btusb "firmware request while host is not available" at resume

All of lore.kernel.org
 help / color / mirror / Atom feed

* btusb "firmware request while host is not available" at resume
@ 2017-09-10 19:26 ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2017-09-10 19:26 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Tejun Heo, Marcel Holtmann, Gustavo Padovan,
	Sukumar Ghorai, Rafael J. Wysocki
  Cc: Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

This seems to be a new problem at resume for the Intel btusb driver,
but I'm not seeing anything in that driver itself that looks like a
likely trigger, so I wonder if it's some driver core change, a generic
resume path issue, or a workqueue change that has made it trigger for
me.

It might also just be a timing difference, maybe it's always been there?

Does anybody have any ideas? It does't happen on every resume, and the
machine works despite this (but no bluetooth - the *next* resume might
bring it back, though).

                 Linus

--

  ACPI: Low-level resume complete
  ACPI: EC: EC started
  PM: Restoring platform NVS memory
  Enabling non-boot CPUs ...
  x86: Booting SMP configuration:
  smpboot: Booting Node 0 Processor 1 APIC 0x2
   cache: parent cpu1 should not be sleeping
  CPU1 is up
  smpboot: Booting Node 0 Processor 2 APIC 0x1
   cache: parent cpu2 should not be sleeping
  CPU2 is up
  smpboot: Booting Node 0 Processor 3 APIC 0x3
   cache: parent cpu3 should not be sleeping
  CPU3 is up
  ACPI: Waking up from system sleep state S3
  ACPI: EC: event unblocked
  usb 1-3: reset full-speed USB device number 2 using xhci_hcd
  usb 1-4: reset full-speed USB device number 3 using xhci_hcd
  usb 1-5: reset high-speed USB device number 4 using xhci_hcd
  usb 1-3:1.0: rebind failed: -517
  usb 1-3:1.1: rebind failed: -517
  Bluetooth: hci0: Bootloader revision 0.0 build 2 week 52 2014
  OOM killer enabled.
  Restarting tasks ...
  Bluetooth: hci0: Device revision is 5
  Bluetooth: hci0: Secure boot is enabled
  Bluetooth: hci0: OTP lock is enabled
  Bluetooth: hci0: API lock is enabled
  Bluetooth: hci0: Debug lock is disabled
  Bluetooth: hci0: Minimum firmware build 1 week 10 2014
  firmware request while host is not available
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 621 at drivers/base/firmware_class.c:1250
_request_firmware+0x460/0x790
  CPU: 2 PID: 621 Comm: kworker/u9:2 Not tainted 4.13.0-10313-ge860d2c904d1 #11
  Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.17 05/10/2017
  Workqueue: hci0 hci_power_on [bluetooth]
  task: ffff8d3767895ac0 task.stack: ffff9d3481efc000
  RIP: 0010:_request_firmware+0x460/0x790
  Call Trace:
   request_firmware+0x37/0x50
   btusb_setup_intel_new+0x227/0x7e0 [btusb]
   hci_dev_do_open+0x3da/0x570 [bluetooth]
   hci_power_on+0x52/0x1f0 [bluetooth]
   process_one_work+0x1db/0x3d0
   worker_thread+0x47/0x3e0
   kthread+0x125/0x140
   ret_from_fork+0x22/0x30
  ---[ end trace 007b222491432927 ]---
  Bluetooth: hci0: Failed to load Intel firmware file (-112)
  [drm] RC6 on
  done.
  thermal thermal_zone11: failed to read out thermal zone (-5)
  PM: suspend exit

^ permalink raw reply	[flat|nested] 44+ messages in thread

* btusb "firmware request while host is not available" at resume
@ 2017-09-10 19:26 ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2017-09-10 19:26 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Tejun Heo, Marcel Holtmann, Gustavo Padovan,
	Sukumar Ghorai, Rafael J. Wysocki
  Cc: Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

This seems to be a new problem at resume for the Intel btusb driver,
but I'm not seeing anything in that driver itself that looks like a
likely trigger, so I wonder if it's some driver core change, a generic
resume path issue, or a workqueue change that has made it trigger for
me.

It might also just be a timing difference, maybe it's always been there?

Does anybody have any ideas? It does't happen on every resume, and the
machine works despite this (but no bluetooth - the *next* resume might
bring it back, though).

                 Linus

--

  ACPI: Low-level resume complete
  ACPI: EC: EC started
  PM: Restoring platform NVS memory
  Enabling non-boot CPUs ...
  x86: Booting SMP configuration:
  smpboot: Booting Node 0 Processor 1 APIC 0x2
   cache: parent cpu1 should not be sleeping
  CPU1 is up
  smpboot: Booting Node 0 Processor 2 APIC 0x1
   cache: parent cpu2 should not be sleeping
  CPU2 is up
  smpboot: Booting Node 0 Processor 3 APIC 0x3
   cache: parent cpu3 should not be sleeping
  CPU3 is up
  ACPI: Waking up from system sleep state S3
  ACPI: EC: event unblocked
  usb 1-3: reset full-speed USB device number 2 using xhci_hcd
  usb 1-4: reset full-speed USB device number 3 using xhci_hcd
  usb 1-5: reset high-speed USB device number 4 using xhci_hcd
  usb 1-3:1.0: rebind failed: -517
  usb 1-3:1.1: rebind failed: -517
  Bluetooth: hci0: Bootloader revision 0.0 build 2 week 52 2014
  OOM killer enabled.
  Restarting tasks ...
  Bluetooth: hci0: Device revision is 5
  Bluetooth: hci0: Secure boot is enabled
  Bluetooth: hci0: OTP lock is enabled
  Bluetooth: hci0: API lock is enabled
  Bluetooth: hci0: Debug lock is disabled
  Bluetooth: hci0: Minimum firmware build 1 week 10 2014
  firmware request while host is not available
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 621 at drivers/base/firmware_class.c:1250
_request_firmware+0x460/0x790
  CPU: 2 PID: 621 Comm: kworker/u9:2 Not tainted 4.13.0-10313-ge860d2c904d1 #11
  Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.17 05/10/2017
  Workqueue: hci0 hci_power_on [bluetooth]
  task: ffff8d3767895ac0 task.stack: ffff9d3481efc000
  RIP: 0010:_request_firmware+0x460/0x790
  Call Trace:
   request_firmware+0x37/0x50
   btusb_setup_intel_new+0x227/0x7e0 [btusb]
   hci_dev_do_open+0x3da/0x570 [bluetooth]
   hci_power_on+0x52/0x1f0 [bluetooth]
   process_one_work+0x1db/0x3d0
   worker_thread+0x47/0x3e0
   kthread+0x125/0x140
   ret_from_fork+0x22/0x30
  ---[ end trace 007b222491432927 ]---
  Bluetooth: hci0: Failed to load Intel firmware file (-112)
  [drm] RC6 on
  done.
  thermal thermal_zone11: failed to read out thermal zone (-5)
  PM: suspend exit

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-10 19:26 ` Linus Torvalds
@ 2017-09-11  1:25   ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2017-09-11  1:25 UTC (permalink / raw)
  To: Linus Torvalds, Luis R. Rodriguez
  Cc: Tejun Heo, Marcel Holtmann, Gustavo Padovan, Sukumar Ghorai,
	Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Sun, Sep 10, 2017 at 12:26:02PM -0700, Linus Torvalds wrote:
> This seems to be a new problem at resume for the Intel btusb driver,
> but I'm not seeing anything in that driver itself that looks like a
> likely trigger, so I wonder if it's some driver core change, a generic
> resume path issue, or a workqueue change that has made it trigger for
> me.
> 
> It might also just be a timing difference, maybe it's always been there?
> 
> Does anybody have any ideas? It does't happen on every resume, and the
> machine works despite this (but no bluetooth - the *next* resume might
> bring it back, though).

Ah, it's not just me having this problem.  I don't see it happening in
4.12, and haven't had the time to bisect it.  I seem to be able to
trigger it every suspend/resume cycle, so I don't know if it's a timing
issue.



> 
>                  Linus
> 
> --
> 
>   ACPI: Low-level resume complete
>   ACPI: EC: EC started
>   PM: Restoring platform NVS memory
>   Enabling non-boot CPUs ...
>   x86: Booting SMP configuration:
>   smpboot: Booting Node 0 Processor 1 APIC 0x2
>    cache: parent cpu1 should not be sleeping
>   CPU1 is up
>   smpboot: Booting Node 0 Processor 2 APIC 0x1
>    cache: parent cpu2 should not be sleeping
>   CPU2 is up
>   smpboot: Booting Node 0 Processor 3 APIC 0x3
>    cache: parent cpu3 should not be sleeping
>   CPU3 is up
>   ACPI: Waking up from system sleep state S3
>   ACPI: EC: event unblocked
>   usb 1-3: reset full-speed USB device number 2 using xhci_hcd
>   usb 1-4: reset full-speed USB device number 3 using xhci_hcd
>   usb 1-5: reset high-speed USB device number 4 using xhci_hcd
>   usb 1-3:1.0: rebind failed: -517
>   usb 1-3:1.1: rebind failed: -517
>   Bluetooth: hci0: Bootloader revision 0.0 build 2 week 52 2014
>   OOM killer enabled.
>   Restarting tasks ...
>   Bluetooth: hci0: Device revision is 5
>   Bluetooth: hci0: Secure boot is enabled
>   Bluetooth: hci0: OTP lock is enabled
>   Bluetooth: hci0: API lock is enabled
>   Bluetooth: hci0: Debug lock is disabled
>   Bluetooth: hci0: Minimum firmware build 1 week 10 2014
>   firmware request while host is not available
>   ------------[ cut here ]------------
>   WARNING: CPU: 2 PID: 621 at drivers/base/firmware_class.c:1250
> _request_firmware+0x460/0x790
>   CPU: 2 PID: 621 Comm: kworker/u9:2 Not tainted 4.13.0-10313-ge860d2c904d1 #11
>   Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.17 05/10/2017
>   Workqueue: hci0 hci_power_on [bluetooth]
>   task: ffff8d3767895ac0 task.stack: ffff9d3481efc000
>   RIP: 0010:_request_firmware+0x460/0x790
>   Call Trace:
>    request_firmware+0x37/0x50
>    btusb_setup_intel_new+0x227/0x7e0 [btusb]
>    hci_dev_do_open+0x3da/0x570 [bluetooth]
>    hci_power_on+0x52/0x1f0 [bluetooth]
>    process_one_work+0x1db/0x3d0
>    worker_thread+0x47/0x3e0
>    kthread+0x125/0x140
>    ret_from_fork+0x22/0x30
>   ---[ end trace 007b222491432927 ]---
>   Bluetooth: hci0: Failed to load Intel firmware file (-112)
>   [drm] RC6 on
>   done.
>   thermal thermal_zone11: failed to read out thermal zone (-5)
>   PM: suspend exit

Ah, I'll blame Luis for this, I think it might be due to 81f95076281f
("firmware: add sanity check on shutdown/suspend")

Luis, any ideas?  I'll try to revert this and try it out tomorrow when
I get a chance.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-11  1:25   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2017-09-11  1:25 UTC (permalink / raw)
  To: Linus Torvalds, Luis R. Rodriguez
  Cc: Tejun Heo, Marcel Holtmann, Gustavo Padovan, Sukumar Ghorai,
	Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Sun, Sep 10, 2017 at 12:26:02PM -0700, Linus Torvalds wrote:
> This seems to be a new problem at resume for the Intel btusb driver,
> but I'm not seeing anything in that driver itself that looks like a
> likely trigger, so I wonder if it's some driver core change, a generic
> resume path issue, or a workqueue change that has made it trigger for
> me.
> 
> It might also just be a timing difference, maybe it's always been there?
> 
> Does anybody have any ideas? It does't happen on every resume, and the
> machine works despite this (but no bluetooth - the *next* resume might
> bring it back, though).

Ah, it's not just me having this problem.  I don't see it happening in
4.12, and haven't had the time to bisect it.  I seem to be able to
trigger it every suspend/resume cycle, so I don't know if it's a timing
issue.



> 
>                  Linus
> 
> --
> 
>   ACPI: Low-level resume complete
>   ACPI: EC: EC started
>   PM: Restoring platform NVS memory
>   Enabling non-boot CPUs ...
>   x86: Booting SMP configuration:
>   smpboot: Booting Node 0 Processor 1 APIC 0x2
>    cache: parent cpu1 should not be sleeping
>   CPU1 is up
>   smpboot: Booting Node 0 Processor 2 APIC 0x1
>    cache: parent cpu2 should not be sleeping
>   CPU2 is up
>   smpboot: Booting Node 0 Processor 3 APIC 0x3
>    cache: parent cpu3 should not be sleeping
>   CPU3 is up
>   ACPI: Waking up from system sleep state S3
>   ACPI: EC: event unblocked
>   usb 1-3: reset full-speed USB device number 2 using xhci_hcd
>   usb 1-4: reset full-speed USB device number 3 using xhci_hcd
>   usb 1-5: reset high-speed USB device number 4 using xhci_hcd
>   usb 1-3:1.0: rebind failed: -517
>   usb 1-3:1.1: rebind failed: -517
>   Bluetooth: hci0: Bootloader revision 0.0 build 2 week 52 2014
>   OOM killer enabled.
>   Restarting tasks ...
>   Bluetooth: hci0: Device revision is 5
>   Bluetooth: hci0: Secure boot is enabled
>   Bluetooth: hci0: OTP lock is enabled
>   Bluetooth: hci0: API lock is enabled
>   Bluetooth: hci0: Debug lock is disabled
>   Bluetooth: hci0: Minimum firmware build 1 week 10 2014
>   firmware request while host is not available
>   ------------[ cut here ]------------
>   WARNING: CPU: 2 PID: 621 at drivers/base/firmware_class.c:1250
> _request_firmware+0x460/0x790
>   CPU: 2 PID: 621 Comm: kworker/u9:2 Not tainted 4.13.0-10313-ge860d2c904d1 #11
>   Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.17 05/10/2017
>   Workqueue: hci0 hci_power_on [bluetooth]
>   task: ffff8d3767895ac0 task.stack: ffff9d3481efc000
>   RIP: 0010:_request_firmware+0x460/0x790
>   Call Trace:
>    request_firmware+0x37/0x50
>    btusb_setup_intel_new+0x227/0x7e0 [btusb]
>    hci_dev_do_open+0x3da/0x570 [bluetooth]
>    hci_power_on+0x52/0x1f0 [bluetooth]
>    process_one_work+0x1db/0x3d0
>    worker_thread+0x47/0x3e0
>    kthread+0x125/0x140
>    ret_from_fork+0x22/0x30
>   ---[ end trace 007b222491432927 ]---
>   Bluetooth: hci0: Failed to load Intel firmware file (-112)
>   [drm] RC6 on
>   done.
>   thermal thermal_zone11: failed to read out thermal zone (-5)
>   PM: suspend exit

Ah, I'll blame Luis for this, I think it might be due to 81f95076281f
("firmware: add sanity check on shutdown/suspend")

Luis, any ideas?  I'll try to revert this and try it out tomorrow when
I get a chance.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11  1:25   ` Greg Kroah-Hartman
@ 2017-09-11  3:15     ` Gabriel C
  -1 siblings, 0 replies; 44+ messages in thread
From: Gabriel C @ 2017-09-11  3:15 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Linus Torvalds, Luis R. Rodriguez
  Cc: Tejun Heo, Marcel Holtmann, Gustavo Padovan, Sukumar Ghorai,
	Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On 11.09.2017 03:25, Greg Kroah-Hartman wrote:
> On Sun, Sep 10, 2017 at 12:26:02PM -0700, Linus Torvalds wrote:
>> This seems to be a new problem at resume for the Intel btusb driver,
>> but I'm not seeing anything in that driver itself that looks like a
>> likely trigger, so I wonder if it's some driver core change, a generic
>> resume path issue, or a workqueue change that has made it trigger for
>> me.
>>
>> It might also just be a timing difference, maybe it's always been there?
>>
>> Does anybody have any ideas? It does't happen on every resume, and the
>> machine works despite this (but no bluetooth - the *next* resume might
>> bring it back, though).
> 
> Ah, it's not just me having this problem.  I don't see it happening in
> 4.12, and haven't had the time to bisect it.  I seem to be able to
> trigger it every suspend/resume cycle, so I don't know if it's a timing
> issue.


I see the same problem with QCA hardware.. but a bit different.

On first resume cycle the firmware call is fine but the adapter dies a bit later with :


  'Bluetooth: hci0: Failed to send body at 4 of 1857 (-110)'

On second resume cycle I hit the trace too.


> 
>>
>>                   Linus
>>
>> --
>>
>>    ACPI: Low-level resume complete
>>    ACPI: EC: EC started
>>    PM: Restoring platform NVS memory
>>    Enabling non-boot CPUs ...
>>    x86: Booting SMP configuration:
>>    smpboot: Booting Node 0 Processor 1 APIC 0x2
>>     cache: parent cpu1 should not be sleeping
>>    CPU1 is up
>>    smpboot: Booting Node 0 Processor 2 APIC 0x1
>>     cache: parent cpu2 should not be sleeping
>>    CPU2 is up
>>    smpboot: Booting Node 0 Processor 3 APIC 0x3
>>     cache: parent cpu3 should not be sleeping
>>    CPU3 is up
>>    ACPI: Waking up from system sleep state S3
>>    ACPI: EC: event unblocked
>>    usb 1-3: reset full-speed USB device number 2 using xhci_hcd
>>    usb 1-4: reset full-speed USB device number 3 using xhci_hcd
>>    usb 1-5: reset high-speed USB device number 4 using xhci_hcd
>>    usb 1-3:1.0: rebind failed: -517
>>    usb 1-3:1.1: rebind failed: -517
>>    Bluetooth: hci0: Bootloader revision 0.0 build 2 week 52 2014
>>    OOM killer enabled.
>>    Restarting tasks ...
>>    Bluetooth: hci0: Device revision is 5
>>    Bluetooth: hci0: Secure boot is enabled
>>    Bluetooth: hci0: OTP lock is enabled
>>    Bluetooth: hci0: API lock is enabled
>>    Bluetooth: hci0: Debug lock is disabled
>>    Bluetooth: hci0: Minimum firmware build 1 week 10 2014
>>    firmware request while host is not available
>>    ------------[ cut here ]------------
>>    WARNING: CPU: 2 PID: 621 at drivers/base/firmware_class.c:1250
>> _request_firmware+0x460/0x790
>>    CPU: 2 PID: 621 Comm: kworker/u9:2 Not tainted 4.13.0-10313-ge860d2c904d1 #11
>>    Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.17 05/10/2017
>>    Workqueue: hci0 hci_power_on [bluetooth]
>>    task: ffff8d3767895ac0 task.stack: ffff9d3481efc000
>>    RIP: 0010:_request_firmware+0x460/0x790
>>    Call Trace:
>>     request_firmware+0x37/0x50
>>     btusb_setup_intel_new+0x227/0x7e0 [btusb]
>>     hci_dev_do_open+0x3da/0x570 [bluetooth]
>>     hci_power_on+0x52/0x1f0 [bluetooth]
>>     process_one_work+0x1db/0x3d0
>>     worker_thread+0x47/0x3e0
>>     kthread+0x125/0x140
>>     ret_from_fork+0x22/0x30
>>    ---[ end trace 007b222491432927 ]---
>>    Bluetooth: hci0: Failed to load Intel firmware file (-112)
>>    [drm] RC6 on
>>    done.
>>    thermal thermal_zone11: failed to read out thermal zone (-5)
>>    PM: suspend exit
> 
> Ah, I'll blame Luis for this, I think it might be due to 81f95076281f
> ("firmware: add sanity check on shutdown/suspend")
>
> Luis, any ideas?  I'll try to revert this and try it out tomorrow when
> I get a chance.
>

I can revert it an fire up some testing..

This time with 4.13.x I hit all kind bugs on this box anyway :)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-11  3:15     ` Gabriel C
  0 siblings, 0 replies; 44+ messages in thread
From: Gabriel C @ 2017-09-11  3:15 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Linus Torvalds, Luis R. Rodriguez
  Cc: Tejun Heo, Marcel Holtmann, Gustavo Padovan, Sukumar Ghorai,
	Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On 11.09.2017 03:25, Greg Kroah-Hartman wrote:
> On Sun, Sep 10, 2017 at 12:26:02PM -0700, Linus Torvalds wrote:
>> This seems to be a new problem at resume for the Intel btusb driver,
>> but I'm not seeing anything in that driver itself that looks like a
>> likely trigger, so I wonder if it's some driver core change, a generic
>> resume path issue, or a workqueue change that has made it trigger for
>> me.
>>
>> It might also just be a timing difference, maybe it's always been there?
>>
>> Does anybody have any ideas? It does't happen on every resume, and the
>> machine works despite this (but no bluetooth - the *next* resume might
>> bring it back, though).
> 
> Ah, it's not just me having this problem.  I don't see it happening in
> 4.12, and haven't had the time to bisect it.  I seem to be able to
> trigger it every suspend/resume cycle, so I don't know if it's a timing
> issue.


I see the same problem with QCA hardware.. but a bit different.

On first resume cycle the firmware call is fine but the adapter dies a bit later with :


  'Bluetooth: hci0: Failed to send body at 4 of 1857 (-110)'

On second resume cycle I hit the trace too.


> 
>>
>>                   Linus
>>
>> --
>>
>>    ACPI: Low-level resume complete
>>    ACPI: EC: EC started
>>    PM: Restoring platform NVS memory
>>    Enabling non-boot CPUs ...
>>    x86: Booting SMP configuration:
>>    smpboot: Booting Node 0 Processor 1 APIC 0x2
>>     cache: parent cpu1 should not be sleeping
>>    CPU1 is up
>>    smpboot: Booting Node 0 Processor 2 APIC 0x1
>>     cache: parent cpu2 should not be sleeping
>>    CPU2 is up
>>    smpboot: Booting Node 0 Processor 3 APIC 0x3
>>     cache: parent cpu3 should not be sleeping
>>    CPU3 is up
>>    ACPI: Waking up from system sleep state S3
>>    ACPI: EC: event unblocked
>>    usb 1-3: reset full-speed USB device number 2 using xhci_hcd
>>    usb 1-4: reset full-speed USB device number 3 using xhci_hcd
>>    usb 1-5: reset high-speed USB device number 4 using xhci_hcd
>>    usb 1-3:1.0: rebind failed: -517
>>    usb 1-3:1.1: rebind failed: -517
>>    Bluetooth: hci0: Bootloader revision 0.0 build 2 week 52 2014
>>    OOM killer enabled.
>>    Restarting tasks ...
>>    Bluetooth: hci0: Device revision is 5
>>    Bluetooth: hci0: Secure boot is enabled
>>    Bluetooth: hci0: OTP lock is enabled
>>    Bluetooth: hci0: API lock is enabled
>>    Bluetooth: hci0: Debug lock is disabled
>>    Bluetooth: hci0: Minimum firmware build 1 week 10 2014
>>    firmware request while host is not available
>>    ------------[ cut here ]------------
>>    WARNING: CPU: 2 PID: 621 at drivers/base/firmware_class.c:1250
>> _request_firmware+0x460/0x790
>>    CPU: 2 PID: 621 Comm: kworker/u9:2 Not tainted 4.13.0-10313-ge860d2c904d1 #11
>>    Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.17 05/10/2017
>>    Workqueue: hci0 hci_power_on [bluetooth]
>>    task: ffff8d3767895ac0 task.stack: ffff9d3481efc000
>>    RIP: 0010:_request_firmware+0x460/0x790
>>    Call Trace:
>>     request_firmware+0x37/0x50
>>     btusb_setup_intel_new+0x227/0x7e0 [btusb]
>>     hci_dev_do_open+0x3da/0x570 [bluetooth]
>>     hci_power_on+0x52/0x1f0 [bluetooth]
>>     process_one_work+0x1db/0x3d0
>>     worker_thread+0x47/0x3e0
>>     kthread+0x125/0x140
>>     ret_from_fork+0x22/0x30
>>    ---[ end trace 007b222491432927 ]---
>>    Bluetooth: hci0: Failed to load Intel firmware file (-112)
>>    [drm] RC6 on
>>    done.
>>    thermal thermal_zone11: failed to read out thermal zone (-5)
>>    PM: suspend exit
> 
> Ah, I'll blame Luis for this, I think it might be due to 81f95076281f
> ("firmware: add sanity check on shutdown/suspend")
>
> Luis, any ideas?  I'll try to revert this and try it out tomorrow when
> I get a chance.
>

I can revert it an fire up some testing..

This time with 4.13.x I hit all kind bugs on this box anyway :)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11  3:15     ` Gabriel C
@ 2017-09-11  3:49       ` Gabriel C
  -1 siblings, 0 replies; 44+ messages in thread
From: Gabriel C @ 2017-09-11  3:49 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Linus Torvalds, Luis R. Rodriguez
  Cc: Tejun Heo, Marcel Holtmann, Gustavo Padovan, Sukumar Ghorai,
	Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On 11.09.2017 05:15, Gabriel C wrote:
> On 11.09.2017 03:25, Greg Kroah-Hartman wrote:
>> On Sun, Sep 10, 2017 at 12:26:02PM -0700, Linus Torvalds wrote:
>>> This seems to be a new problem at resume for the Intel btusb driver,
>>> but I'm not seeing anything in that driver itself that looks like a
>>> likely trigger, so I wonder if it's some driver core change, a generic
>>> resume path issue, or a workqueue change that has made it trigger for
>>> me.
>>>
>>> It might also just be a timing difference, maybe it's always been there?
>>>
>>> Does anybody have any ideas? It does't happen on every resume, and the
>>> machine works despite this (but no bluetooth - the *next* resume might
>>> bring it back, though).
>>
>> Ah, it's not just me having this problem.  I don't see it happening in
>> 4.12, and haven't had the time to bisect it.  I seem to be able to
>> trigger it every suspend/resume cycle, so I don't know if it's a timing
>> issue.
> 
> 
> I see the same problem with QCA hardware.. but a bit different.
> 
> On first resume cycle the firmware call is fine but the adapter dies a bit later with :
> 
> 
>   'Bluetooth: hci0: Failed to send body at 4 of 1857 (-110)'
> 
> On second resume cycle I hit the trace too.
> 
> 
>>
>>>
>>>                   Linus
>>>
>>> -- 
>>>
>>>    ACPI: Low-level resume complete
>>>    ACPI: EC: EC started
>>>    PM: Restoring platform NVS memory
>>>    Enabling non-boot CPUs ...
>>>    x86: Booting SMP configuration:
>>>    smpboot: Booting Node 0 Processor 1 APIC 0x2
>>>     cache: parent cpu1 should not be sleeping
>>>    CPU1 is up
>>>    smpboot: Booting Node 0 Processor 2 APIC 0x1
>>>     cache: parent cpu2 should not be sleeping
>>>    CPU2 is up
>>>    smpboot: Booting Node 0 Processor 3 APIC 0x3
>>>     cache: parent cpu3 should not be sleeping
>>>    CPU3 is up
>>>    ACPI: Waking up from system sleep state S3
>>>    ACPI: EC: event unblocked
>>>    usb 1-3: reset full-speed USB device number 2 using xhci_hcd
>>>    usb 1-4: reset full-speed USB device number 3 using xhci_hcd
>>>    usb 1-5: reset high-speed USB device number 4 using xhci_hcd
>>>    usb 1-3:1.0: rebind failed: -517
>>>    usb 1-3:1.1: rebind failed: -517
>>>    Bluetooth: hci0: Bootloader revision 0.0 build 2 week 52 2014
>>>    OOM killer enabled.
>>>    Restarting tasks ...
>>>    Bluetooth: hci0: Device revision is 5
>>>    Bluetooth: hci0: Secure boot is enabled
>>>    Bluetooth: hci0: OTP lock is enabled
>>>    Bluetooth: hci0: API lock is enabled
>>>    Bluetooth: hci0: Debug lock is disabled
>>>    Bluetooth: hci0: Minimum firmware build 1 week 10 2014
>>>    firmware request while host is not available
>>>    ------------[ cut here ]------------
>>>    WARNING: CPU: 2 PID: 621 at drivers/base/firmware_class.c:1250
>>> _request_firmware+0x460/0x790
>>>    CPU: 2 PID: 621 Comm: kworker/u9:2 Not tainted 4.13.0-10313-ge860d2c904d1 #11
>>>    Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.17 05/10/2017
>>>    Workqueue: hci0 hci_power_on [bluetooth]
>>>    task: ffff8d3767895ac0 task.stack: ffff9d3481efc000
>>>    RIP: 0010:_request_firmware+0x460/0x790
>>>    Call Trace:
>>>     request_firmware+0x37/0x50
>>>     btusb_setup_intel_new+0x227/0x7e0 [btusb]
>>>     hci_dev_do_open+0x3da/0x570 [bluetooth]
>>>     hci_power_on+0x52/0x1f0 [bluetooth]
>>>     process_one_work+0x1db/0x3d0
>>>     worker_thread+0x47/0x3e0
>>>     kthread+0x125/0x140
>>>     ret_from_fork+0x22/0x30
>>>    ---[ end trace 007b222491432927 ]---
>>>    Bluetooth: hci0: Failed to load Intel firmware file (-112)
>>>    [drm] RC6 on
>>>    done.
>>>    thermal thermal_zone11: failed to read out thermal zone (-5)
>>>    PM: suspend exit
>>
>> Ah, I'll blame Luis for this, I think it might be due to 81f95076281f
>> ("firmware: add sanity check on shutdown/suspend")
>>
>> Luis, any ideas?  I'll try to revert this and try it out tomorrow when
>> I get a chance.
>>
> 
> I can revert it an fire up some testing..
> 

Yes 81f95076281f is to blame.. After reverting it all is fine again.

15 resume cycles on the one laptop , 10 on the other without to hit the trace.

Regards

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-11  3:49       ` Gabriel C
  0 siblings, 0 replies; 44+ messages in thread
From: Gabriel C @ 2017-09-11  3:49 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Linus Torvalds, Luis R. Rodriguez
  Cc: Tejun Heo, Marcel Holtmann, Gustavo Padovan, Sukumar Ghorai,
	Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On 11.09.2017 05:15, Gabriel C wrote:
> On 11.09.2017 03:25, Greg Kroah-Hartman wrote:
>> On Sun, Sep 10, 2017 at 12:26:02PM -0700, Linus Torvalds wrote:
>>> This seems to be a new problem at resume for the Intel btusb driver,
>>> but I'm not seeing anything in that driver itself that looks like a
>>> likely trigger, so I wonder if it's some driver core change, a generic
>>> resume path issue, or a workqueue change that has made it trigger for
>>> me.
>>>
>>> It might also just be a timing difference, maybe it's always been there?
>>>
>>> Does anybody have any ideas? It does't happen on every resume, and the
>>> machine works despite this (but no bluetooth - the *next* resume might
>>> bring it back, though).
>>
>> Ah, it's not just me having this problem.  I don't see it happening in
>> 4.12, and haven't had the time to bisect it.  I seem to be able to
>> trigger it every suspend/resume cycle, so I don't know if it's a timing
>> issue.
> 
> 
> I see the same problem with QCA hardware.. but a bit different.
> 
> On first resume cycle the firmware call is fine but the adapter dies a bit later with :
> 
> 
>   'Bluetooth: hci0: Failed to send body at 4 of 1857 (-110)'
> 
> On second resume cycle I hit the trace too.
> 
> 
>>
>>>
>>>                   Linus
>>>
>>> -- 
>>>
>>>    ACPI: Low-level resume complete
>>>    ACPI: EC: EC started
>>>    PM: Restoring platform NVS memory
>>>    Enabling non-boot CPUs ...
>>>    x86: Booting SMP configuration:
>>>    smpboot: Booting Node 0 Processor 1 APIC 0x2
>>>     cache: parent cpu1 should not be sleeping
>>>    CPU1 is up
>>>    smpboot: Booting Node 0 Processor 2 APIC 0x1
>>>     cache: parent cpu2 should not be sleeping
>>>    CPU2 is up
>>>    smpboot: Booting Node 0 Processor 3 APIC 0x3
>>>     cache: parent cpu3 should not be sleeping
>>>    CPU3 is up
>>>    ACPI: Waking up from system sleep state S3
>>>    ACPI: EC: event unblocked
>>>    usb 1-3: reset full-speed USB device number 2 using xhci_hcd
>>>    usb 1-4: reset full-speed USB device number 3 using xhci_hcd
>>>    usb 1-5: reset high-speed USB device number 4 using xhci_hcd
>>>    usb 1-3:1.0: rebind failed: -517
>>>    usb 1-3:1.1: rebind failed: -517
>>>    Bluetooth: hci0: Bootloader revision 0.0 build 2 week 52 2014
>>>    OOM killer enabled.
>>>    Restarting tasks ...
>>>    Bluetooth: hci0: Device revision is 5
>>>    Bluetooth: hci0: Secure boot is enabled
>>>    Bluetooth: hci0: OTP lock is enabled
>>>    Bluetooth: hci0: API lock is enabled
>>>    Bluetooth: hci0: Debug lock is disabled
>>>    Bluetooth: hci0: Minimum firmware build 1 week 10 2014
>>>    firmware request while host is not available
>>>    ------------[ cut here ]------------
>>>    WARNING: CPU: 2 PID: 621 at drivers/base/firmware_class.c:1250
>>> _request_firmware+0x460/0x790
>>>    CPU: 2 PID: 621 Comm: kworker/u9:2 Not tainted 4.13.0-10313-ge860d2c904d1 #11
>>>    Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.17 05/10/2017
>>>    Workqueue: hci0 hci_power_on [bluetooth]
>>>    task: ffff8d3767895ac0 task.stack: ffff9d3481efc000
>>>    RIP: 0010:_request_firmware+0x460/0x790
>>>    Call Trace:
>>>     request_firmware+0x37/0x50
>>>     btusb_setup_intel_new+0x227/0x7e0 [btusb]
>>>     hci_dev_do_open+0x3da/0x570 [bluetooth]
>>>     hci_power_on+0x52/0x1f0 [bluetooth]
>>>     process_one_work+0x1db/0x3d0
>>>     worker_thread+0x47/0x3e0
>>>     kthread+0x125/0x140
>>>     ret_from_fork+0x22/0x30
>>>    ---[ end trace 007b222491432927 ]---
>>>    Bluetooth: hci0: Failed to load Intel firmware file (-112)
>>>    [drm] RC6 on
>>>    done.
>>>    thermal thermal_zone11: failed to read out thermal zone (-5)
>>>    PM: suspend exit
>>
>> Ah, I'll blame Luis for this, I think it might be due to 81f95076281f
>> ("firmware: add sanity check on shutdown/suspend")
>>
>> Luis, any ideas?  I'll try to revert this and try it out tomorrow when
>> I get a chance.
>>
> 
> I can revert it an fire up some testing..
> 

Yes 81f95076281f is to blame.. After reverting it all is fine again.

15 resume cycles on the one laptop , 10 on the other without to hit the trace.

Regards

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11  3:49       ` Gabriel C
@ 2017-09-11  4:25         ` Linus Torvalds
  -1 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2017-09-11  4:25 UTC (permalink / raw)
  To: Gabriel C
  Cc: Greg Kroah-Hartman, Luis R. Rodriguez, Tejun Heo,
	Marcel Holtmann, Gustavo Padovan, Sukumar Ghorai,
	Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Sun, Sep 10, 2017 at 8:49 PM, Gabriel C <nix.or.die@gmail.com> wrote:
>
> Yes 81f95076281f is to blame.. After reverting it all is fine again.
>
> 15 resume cycles on the one laptop , 10 on the other without to hit the
> trace.

Yeah, I think that disable/enable_firmware in the suspend/resume path
is basically just completely random code. There  is nothing that
serializes with anything else, and it has no actual basis for saying
"I am now disabled/enabled" except for some entirely random time of
whenever the suspend/resume callbacks happen.

I'm going to revert it.

I wonder why 4.13 seemed to work for me. Or maybe it didn't, and I
just didn't notice, because I didn't use bluetooth and I wasn't
traveling. I test my laptop every release, but I don't necessarily
always _use_ it.

               Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-11  4:25         ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2017-09-11  4:25 UTC (permalink / raw)
  To: Gabriel C
  Cc: Greg Kroah-Hartman, Luis R. Rodriguez, Tejun Heo,
	Marcel Holtmann, Gustavo Padovan, Sukumar Ghorai,
	Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Sun, Sep 10, 2017 at 8:49 PM, Gabriel C <nix.or.die@gmail.com> wrote:
>
> Yes 81f95076281f is to blame.. After reverting it all is fine again.
>
> 15 resume cycles on the one laptop , 10 on the other without to hit the
> trace.

Yeah, I think that disable/enable_firmware in the suspend/resume path
is basically just completely random code. There  is nothing that
serializes with anything else, and it has no actual basis for saying
"I am now disabled/enabled" except for some entirely random time of
whenever the suspend/resume callbacks happen.

I'm going to revert it.

I wonder why 4.13 seemed to work for me. Or maybe it didn't, and I
just didn't notice, because I didn't use bluetooth and I wasn't
traveling. I test my laptop every release, but I don't necessarily
always _use_ it.

               Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11  4:25         ` Linus Torvalds
@ 2017-09-11  5:12           ` Marcel Holtmann
  -1 siblings, 0 replies; 44+ messages in thread
From: Marcel Holtmann @ 2017-09-11  5:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gabriel C, Greg Kroah-Hartman, Luis R. Rodriguez, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

Hi Linus,

>> Yes 81f95076281f is to blame.. After reverting it all is fine again.
>> 
>> 15 resume cycles on the one laptop , 10 on the other without to hit the
>> trace.
> 
> Yeah, I think that disable/enable_firmware in the suspend/resume path
> is basically just completely random code. There  is nothing that
> serializes with anything else, and it has no actual basis for saying
> "I am now disabled/enabled" except for some entirely random time of
> whenever the suspend/resume callbacks happen.
> 
> I'm going to revert it.
> 
> I wonder why 4.13 seemed to work for me. Or maybe it didn't, and I
> just didn't notice, because I didn't use bluetooth and I wasn't
> traveling. I test my laptop every release, but I don't necessarily
> always _use_ it.

we have a bug report that might go into the similar direction. So the problem is that modern Bluetooth controller require full firmware loading (not just ROM patching) and if the controller has the firmware somehow already loaded, but then looses power and needs a reload, then it is not cached (since it was never requested in the first place).

Of course if the reload triggers during resume when not file system is available, it can not request the firmware. That will just fail and thus you might see this issue. We have a few RFC patches on the mailing list that I have to review. It is however not that easy all the time to find the right firmware file (at least not for Intel hardware) since the boot loader provides different information than the fully operational firmware. So I need to make sure that request the right firmware (if we do not need it initially) so that it gets cached.

Regards

Marcel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-11  5:12           ` Marcel Holtmann
  0 siblings, 0 replies; 44+ messages in thread
From: Marcel Holtmann @ 2017-09-11  5:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gabriel C, Greg Kroah-Hartman, Luis R. Rodriguez, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

Hi Linus,

>> Yes 81f95076281f is to blame.. After reverting it all is fine again.
>> 
>> 15 resume cycles on the one laptop , 10 on the other without to hit the
>> trace.
> 
> Yeah, I think that disable/enable_firmware in the suspend/resume path
> is basically just completely random code. There  is nothing that
> serializes with anything else, and it has no actual basis for saying
> "I am now disabled/enabled" except for some entirely random time of
> whenever the suspend/resume callbacks happen.
> 
> I'm going to revert it.
> 
> I wonder why 4.13 seemed to work for me. Or maybe it didn't, and I
> just didn't notice, because I didn't use bluetooth and I wasn't
> traveling. I test my laptop every release, but I don't necessarily
> always _use_ it.

we have a bug report that might go into the similar direction. So the problem is that modern Bluetooth controller require full firmware loading (not just ROM patching) and if the controller has the firmware somehow already loaded, but then looses power and needs a reload, then it is not cached (since it was never requested in the first place).

Of course if the reload triggers during resume when not file system is available, it can not request the firmware. That will just fail and thus you might see this issue. We have a few RFC patches on the mailing list that I have to review. It is however not that easy all the time to find the right firmware file (at least not for Intel hardware) since the boot loader provides different information than the fully operational firmware. So I need to make sure that request the right firmware (if we do not need it initially) so that it gets cached.

Regards

Marcel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11  5:12           ` Marcel Holtmann
@ 2017-09-11 13:46             ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2017-09-11 13:46 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Linus Torvalds, Gabriel C, Luis R. Rodriguez, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 07:12:44AM +0200, Marcel Holtmann wrote:
> Hi Linus,
> 
> >> Yes 81f95076281f is to blame.. After reverting it all is fine again.
> >> 
> >> 15 resume cycles on the one laptop , 10 on the other without to hit the
> >> trace.
> > 
> > Yeah, I think that disable/enable_firmware in the suspend/resume path
> > is basically just completely random code. There  is nothing that
> > serializes with anything else, and it has no actual basis for saying
> > "I am now disabled/enabled" except for some entirely random time of
> > whenever the suspend/resume callbacks happen.
> > 
> > I'm going to revert it.
> > 
> > I wonder why 4.13 seemed to work for me. Or maybe it didn't, and I
> > just didn't notice, because I didn't use bluetooth and I wasn't
> > traveling. I test my laptop every release, but I don't necessarily
> > always _use_ it.
> 
> we have a bug report that might go into the similar direction. So the
> problem is that modern Bluetooth controller require full firmware
> loading (not just ROM patching) and if the controller has the firmware
> somehow already loaded, but then looses power and needs a reload, then
> it is not cached (since it was never requested in the first place).
> 
> Of course if the reload triggers during resume when not file system is
> available, it can not request the firmware. That will just fail and
> thus you might see this issue. We have a few RFC patches on the
> mailing list that I have to review. It is however not that easy all
> the time to find the right firmware file (at least not for Intel
> hardware) since the boot loader provides different information than
> the fully operational firmware. So I need to make sure that request
> the right firmware (if we do not need it initially) so that it gets
> cached.

To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
queued it up for the next 4.13-stable release as well.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-11 13:46             ` Greg Kroah-Hartman
  0 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2017-09-11 13:46 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Linus Torvalds, Gabriel C, Luis R. Rodriguez, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 07:12:44AM +0200, Marcel Holtmann wrote:
> Hi Linus,
> 
> >> Yes 81f95076281f is to blame.. After reverting it all is fine again.
> >> 
> >> 15 resume cycles on the one laptop , 10 on the other without to hit the
> >> trace.
> > 
> > Yeah, I think that disable/enable_firmware in the suspend/resume path
> > is basically just completely random code. There  is nothing that
> > serializes with anything else, and it has no actual basis for saying
> > "I am now disabled/enabled" except for some entirely random time of
> > whenever the suspend/resume callbacks happen.
> > 
> > I'm going to revert it.
> > 
> > I wonder why 4.13 seemed to work for me. Or maybe it didn't, and I
> > just didn't notice, because I didn't use bluetooth and I wasn't
> > traveling. I test my laptop every release, but I don't necessarily
> > always _use_ it.
> 
> we have a bug report that might go into the similar direction. So the
> problem is that modern Bluetooth controller require full firmware
> loading (not just ROM patching) and if the controller has the firmware
> somehow already loaded, but then looses power and needs a reload, then
> it is not cached (since it was never requested in the first place).
> 
> Of course if the reload triggers during resume when not file system is
> available, it can not request the firmware. That will just fail and
> thus you might see this issue. We have a few RFC patches on the
> mailing list that I have to review. It is however not that easy all
> the time to find the right firmware file (at least not for Intel
> hardware) since the boot loader provides different information than
> the fully operational firmware. So I need to make sure that request
> the right firmware (if we do not need it initially) so that it gets
> cached.

To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
queued it up for the next 4.13-stable release as well.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11 13:46             ` Greg Kroah-Hartman
@ 2017-09-11 17:11               ` Luis R. Rodriguez
  -1 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-11 17:11 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Marcel Holtmann, Linus Torvalds, Gabriel C, Luis R. Rodriguez,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
> On Mon, Sep 11, 2017 at 07:12:44AM +0200, Marcel Holtmann wrote:
> > Hi Linus,
> > 
> > >> Yes 81f95076281f is to blame.. After reverting it all is fine again.
> > >> 
> > >> 15 resume cycles on the one laptop , 10 on the other without to hit the
> > >> trace.
> > > 
> > > Yeah, I think that disable/enable_firmware in the suspend/resume path
> > > is basically just completely random code. There  is nothing that
> > > serializes with anything else, and it has no actual basis for saying
> > > "I am now disabled/enabled" except for some entirely random time of
> > > whenever the suspend/resume callbacks happen.
> > > 
> > > I'm going to revert it.
> > > 
> > > I wonder why 4.13 seemed to work for me. Or maybe it didn't, and I
> > > just didn't notice, because I didn't use bluetooth and I wasn't
> > > traveling. I test my laptop every release, but I don't necessarily
> > > always _use_ it.
> > 
> > we have a bug report that might go into the similar direction. So the
> > problem is that modern Bluetooth controller require full firmware
> > loading (not just ROM patching) and if the controller has the firmware
> > somehow already loaded, but then looses power and needs a reload, then
> > it is not cached (since it was never requested in the first place).

You mean that on resume there may be a requirement to load a different file
than on boot? If so you can just request it proactively at boot.  Note that the
cache is associated for the device with devres, and even if you call
release_firmware() on the firmware it will be cached for you.  The firmware
cache works with the devres entry, and the firmware devres entry is maintained
throughout the lifetime of the device. This means that even if you
release_firmware() the firmware cache will still be used on resume from
suspend.

> > Of course if the reload triggers during resume when not file system is
> > available, it can not request the firmware. That will just fail and
> > thus you might see this issue.

One of the designs of the firmware cache is to circumvent these concerns,
so you just want to request what you need for suspend early on in the
boot cycle, the firmware API will do what it needs for you.

Unfortunately a lot of this was just not well documented, I hoped that
recent love in this are would have helped.

> > We have a few RFC patches on the
> > mailing list that I have to review. It is however not that easy all
> > the time to find the right firmware file (at least not for Intel
> > hardware) since the boot loader provides different information than
> > the fully operational firmware. So I need to make sure that request
> > the right firmware (if we do not need it initially) so that it gets
> > cached.
> 
> To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
> queued it up for the next 4.13-stable release as well.

Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
seem kludgy but the reason for it was to cleanup the horrible forced and
required UMH lock even when the UMH lock was *not* even needed, which was later
removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
code").

Removing the old UMH lock even when the UMH lock was *not* needed was the right
thing to do but commit 81f95076281f (("firmware: add sanity check on
shutdown/suspend") was put in place as a safe guard as the lock was also
placing an implicit sanity check on the API. It ensures the API with the cache
was used as designed, otherwise you do run the risk of *not getting the
firmware you may need* -- Marcel seems to acknowledge this possibility.

It may be possible for us to already have in place safeguards so that upon
resume we are ensuring the path to the firmware *is* available, so IMHO we
should remove this *iff* we can provide this guarantee.  Otherwise the check is
valid. You see, although the UMH lock was bogus, it did implicitly ask the
question: is it safe for *any* helper to run and make assumptions on the
filesystem then?

In lieu of this question being answered the warning is valid given the design
of the firmware API and the having the cache available as a measure to resolve
this race.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-11 17:11               ` Luis R. Rodriguez
  0 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-11 17:11 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Marcel Holtmann, Linus Torvalds, Gabriel C, Luis R. Rodriguez,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
> On Mon, Sep 11, 2017 at 07:12:44AM +0200, Marcel Holtmann wrote:
> > Hi Linus,
> > 
> > >> Yes 81f95076281f is to blame.. After reverting it all is fine again.
> > >> 
> > >> 15 resume cycles on the one laptop , 10 on the other without to hit the
> > >> trace.
> > > 
> > > Yeah, I think that disable/enable_firmware in the suspend/resume path
> > > is basically just completely random code. There  is nothing that
> > > serializes with anything else, and it has no actual basis for saying
> > > "I am now disabled/enabled" except for some entirely random time of
> > > whenever the suspend/resume callbacks happen.
> > > 
> > > I'm going to revert it.
> > > 
> > > I wonder why 4.13 seemed to work for me. Or maybe it didn't, and I
> > > just didn't notice, because I didn't use bluetooth and I wasn't
> > > traveling. I test my laptop every release, but I don't necessarily
> > > always _use_ it.
> > 
> > we have a bug report that might go into the similar direction. So the
> > problem is that modern Bluetooth controller require full firmware
> > loading (not just ROM patching) and if the controller has the firmware
> > somehow already loaded, but then looses power and needs a reload, then
> > it is not cached (since it was never requested in the first place).

You mean that on resume there may be a requirement to load a different file
than on boot? If so you can just request it proactively at boot.  Note that the
cache is associated for the device with devres, and even if you call
release_firmware() on the firmware it will be cached for you.  The firmware
cache works with the devres entry, and the firmware devres entry is maintained
throughout the lifetime of the device. This means that even if you
release_firmware() the firmware cache will still be used on resume from
suspend.

> > Of course if the reload triggers during resume when not file system is
> > available, it can not request the firmware. That will just fail and
> > thus you might see this issue.

One of the designs of the firmware cache is to circumvent these concerns,
so you just want to request what you need for suspend early on in the
boot cycle, the firmware API will do what it needs for you.

Unfortunately a lot of this was just not well documented, I hoped that
recent love in this are would have helped.

> > We have a few RFC patches on the
> > mailing list that I have to review. It is however not that easy all
> > the time to find the right firmware file (at least not for Intel
> > hardware) since the boot loader provides different information than
> > the fully operational firmware. So I need to make sure that request
> > the right firmware (if we do not need it initially) so that it gets
> > cached.
> 
> To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
> queued it up for the next 4.13-stable release as well.

Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
seem kludgy but the reason for it was to cleanup the horrible forced and
required UMH lock even when the UMH lock was *not* even needed, which was later
removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
code").

Removing the old UMH lock even when the UMH lock was *not* needed was the right
thing to do but commit 81f95076281f (("firmware: add sanity check on
shutdown/suspend") was put in place as a safe guard as the lock was also
placing an implicit sanity check on the API. It ensures the API with the cache
was used as designed, otherwise you do run the risk of *not getting the
firmware you may need* -- Marcel seems to acknowledge this possibility.

It may be possible for us to already have in place safeguards so that upon
resume we are ensuring the path to the firmware *is* available, so IMHO we
should remove this *iff* we can provide this guarantee.  Otherwise the check is
valid. You see, although the UMH lock was bogus, it did implicitly ask the
question: is it safe for *any* helper to run and make assumptions on the
filesystem then?

In lieu of this question being answered the warning is valid given the design
of the firmware API and the having the cache available as a measure to resolve
this race.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11 17:11               ` Luis R. Rodriguez
@ 2017-09-11 19:29                 ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2017-09-11 19:29 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Marcel Holtmann, Linus Torvalds, Gabriel C, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
> On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
> > To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
> > queued it up for the next 4.13-stable release as well.
> 
> Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
> seem kludgy but the reason for it was to cleanup the horrible forced and
> required UMH lock even when the UMH lock was *not* even needed, which was later
> removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
> code").

So what does this mean now that it is reverted?

> Removing the old UMH lock even when the UMH lock was *not* needed was the right
> thing to do but commit 81f95076281f (("firmware: add sanity check on
> shutdown/suspend") was put in place as a safe guard as the lock was also
> placing an implicit sanity check on the API. It ensures the API with the cache
> was used as designed, otherwise you do run the risk of *not getting the
> firmware you may need* -- Marcel seems to acknowledge this possibility.
> 
> It may be possible for us to already have in place safeguards so that upon
> resume we are ensuring the path to the firmware *is* available, so IMHO we
> should remove this *iff* we can provide this guarantee.  Otherwise the check is
> valid. You see, although the UMH lock was bogus, it did implicitly ask the
> question: is it safe for *any* helper to run and make assumptions on the
> filesystem then?
> 
> In lieu of this question being answered the warning is valid given the design
> of the firmware API and the having the cache available as a measure to resolve
> this race.

I don't understand what you are trying to say here at all.

To be specific, what, if anything, is a problem with the current state
of Linus's tree (and the next 4.13-stable release)?

If something needs to be fixed, can you make a patch showing that?  Or
do we also need to revert anything else as well to get back to a "better
working" state?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-11 19:29                 ` Greg Kroah-Hartman
  0 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2017-09-11 19:29 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Marcel Holtmann, Linus Torvalds, Gabriel C, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
> On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
> > To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
> > queued it up for the next 4.13-stable release as well.
> 
> Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
> seem kludgy but the reason for it was to cleanup the horrible forced and
> required UMH lock even when the UMH lock was *not* even needed, which was later
> removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
> code").

So what does this mean now that it is reverted?

> Removing the old UMH lock even when the UMH lock was *not* needed was the right
> thing to do but commit 81f95076281f (("firmware: add sanity check on
> shutdown/suspend") was put in place as a safe guard as the lock was also
> placing an implicit sanity check on the API. It ensures the API with the cache
> was used as designed, otherwise you do run the risk of *not getting the
> firmware you may need* -- Marcel seems to acknowledge this possibility.
> 
> It may be possible for us to already have in place safeguards so that upon
> resume we are ensuring the path to the firmware *is* available, so IMHO we
> should remove this *iff* we can provide this guarantee.  Otherwise the check is
> valid. You see, although the UMH lock was bogus, it did implicitly ask the
> question: is it safe for *any* helper to run and make assumptions on the
> filesystem then?
> 
> In lieu of this question being answered the warning is valid given the design
> of the firmware API and the having the cache available as a measure to resolve
> this race.

I don't understand what you are trying to say here at all.

To be specific, what, if anything, is a problem with the current state
of Linus's tree (and the next 4.13-stable release)?

If something needs to be fixed, can you make a patch showing that?  Or
do we also need to revert anything else as well to get back to a "better
working" state?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11 19:29                 ` Greg Kroah-Hartman
@ 2017-09-11 20:06                   ` Luis R. Rodriguez
  -1 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-11 20:06 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Luis R. Rodriguez, Marcel Holtmann, Linus Torvalds, Gabriel C,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 12:29:55PM -0700, Greg Kroah-Hartman wrote:
> On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
> > On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
> > > To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
> > > queued it up for the next 4.13-stable release as well.
> > 
> > Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
> > seem kludgy but the reason for it was to cleanup the horrible forced and
> > required UMH lock even when the UMH lock was *not* even needed, which was later
> > removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
> > code").
> 
> So what does this mean now that it is reverted?

We discuss what we should do about upkeeping a warning in the future, as
I think technically the warning was still valid and it could help avoid
racy lookups with the filesystem which otherwise could have gone unnoticed.

> > Removing the old UMH lock even when the UMH lock was *not* needed was the right
> > thing to do but commit 81f95076281f (("firmware: add sanity check on
> > shutdown/suspend") was put in place as a safe guard as the lock was also
> > placing an implicit sanity check on the API. It ensures the API with the cache
> > was used as designed, otherwise you do run the risk of *not getting the
> > firmware you may need* -- Marcel seems to acknowledge this possibility.
> > 
> > It may be possible for us to already have in place safeguards so that upon
> > resume we are ensuring the path to the firmware *is* available, so IMHO we
> > should remove this *iff* we can provide this guarantee.  Otherwise the check is
> > valid. You see, although the UMH lock was bogus, it did implicitly ask the
> > question: is it safe for *any* helper to run and make assumptions on the
> > filesystem then?
> > 
> > In lieu of this question being answered the warning is valid given the design
> > of the firmware API and the having the cache available as a measure to resolve
> > this race.
> 
> I don't understand what you are trying to say here at all.
> 
> To be specific, what, if anything, is a problem with the current state
> of Linus's tree (and the next 4.13-stable release)?

The warning is issued when drivers issue *new* firmware requests on resume. The
firmware API cache was designed to enable drivers to easily be able to request
firmware on resume without concern about races against the filesystem, but in order
for this to work the drivers must have requested the firmware prior to suspend.

> If something needs to be fixed, can you make a patch showing that?  Or
> do we also need to revert anything else as well to get back to a "better
> working" state?

I took a look at the driver and it seems that btusb_setup_intel_new() is
not called after the driver is initialized, rather its called only when
hci_dev_do_open() is called. Its not clear to me how you can end up calling
this on resume but not prior to this on a running system. Feedback from
someone more familiar with bt would be useful.

I'd have the call for firmware on probe, no processing would be needed, just
a load to kick the cache into effect would suffice. This may require a bit
of code shift so its best someone more familiar do this.

If it confirmed this information is helping avoid these races we can reconsider
re-instating the warn as a firmware dev debugging aid for developers.

If the race this warning complained about is indeed possible the same race is
also possible for other usermode helpers. Its *why* the UMH lock was
implemented, it however was never generalized.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-11 20:06                   ` Luis R. Rodriguez
  0 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-11 20:06 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Luis R. Rodriguez, Marcel Holtmann, Linus Torvalds, Gabriel C,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 12:29:55PM -0700, Greg Kroah-Hartman wrote:
> On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
> > On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
> > > To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
> > > queued it up for the next 4.13-stable release as well.
> > 
> > Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
> > seem kludgy but the reason for it was to cleanup the horrible forced and
> > required UMH lock even when the UMH lock was *not* even needed, which was later
> > removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
> > code").
> 
> So what does this mean now that it is reverted?

We discuss what we should do about upkeeping a warning in the future, as
I think technically the warning was still valid and it could help avoid
racy lookups with the filesystem which otherwise could have gone unnoticed.

> > Removing the old UMH lock even when the UMH lock was *not* needed was the right
> > thing to do but commit 81f95076281f (("firmware: add sanity check on
> > shutdown/suspend") was put in place as a safe guard as the lock was also
> > placing an implicit sanity check on the API. It ensures the API with the cache
> > was used as designed, otherwise you do run the risk of *not getting the
> > firmware you may need* -- Marcel seems to acknowledge this possibility.
> > 
> > It may be possible for us to already have in place safeguards so that upon
> > resume we are ensuring the path to the firmware *is* available, so IMHO we
> > should remove this *iff* we can provide this guarantee.  Otherwise the check is
> > valid. You see, although the UMH lock was bogus, it did implicitly ask the
> > question: is it safe for *any* helper to run and make assumptions on the
> > filesystem then?
> > 
> > In lieu of this question being answered the warning is valid given the design
> > of the firmware API and the having the cache available as a measure to resolve
> > this race.
> 
> I don't understand what you are trying to say here at all.
> 
> To be specific, what, if anything, is a problem with the current state
> of Linus's tree (and the next 4.13-stable release)?

The warning is issued when drivers issue *new* firmware requests on resume. The
firmware API cache was designed to enable drivers to easily be able to request
firmware on resume without concern about races against the filesystem, but in order
for this to work the drivers must have requested the firmware prior to suspend.

> If something needs to be fixed, can you make a patch showing that?  Or
> do we also need to revert anything else as well to get back to a "better
> working" state?

I took a look at the driver and it seems that btusb_setup_intel_new() is
not called after the driver is initialized, rather its called only when
hci_dev_do_open() is called. Its not clear to me how you can end up calling
this on resume but not prior to this on a running system. Feedback from
someone more familiar with bt would be useful.

I'd have the call for firmware on probe, no processing would be needed, just
a load to kick the cache into effect would suffice. This may require a bit
of code shift so its best someone more familiar do this.

If it confirmed this information is helping avoid these races we can reconsider
re-instating the warn as a firmware dev debugging aid for developers.

If the race this warning complained about is indeed possible the same race is
also possible for other usermode helpers. Its *why* the UMH lock was
implemented, it however was never generalized.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11 20:06                   ` Luis R. Rodriguez
@ 2017-09-12  0:13                     ` Gabriel C
  -1 siblings, 0 replies; 44+ messages in thread
From: Gabriel C @ 2017-09-12  0:13 UTC (permalink / raw)
  To: Luis R. Rodriguez, Greg Kroah-Hartman
  Cc: Marcel Holtmann, Linus Torvalds, Tejun Heo, Gustavo F. Padovan,
	Sukumar Ghorai, Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On 11.09.2017 22:06, Luis R. Rodriguez wrote:
> On Mon, Sep 11, 2017 at 12:29:55PM -0700, Greg Kroah-Hartman wrote:
>> On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
>>> On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
>>>> To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
>>>> queued it up for the next 4.13-stable release as well.
>>>
>>> Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
>>> seem kludgy but the reason for it was to cleanup the horrible forced and
>>> required UMH lock even when the UMH lock was *not* even needed, which was later
>>> removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
>>> code").
>>
>> So what does this mean now that it is reverted?
> 
> We discuss what we should do about upkeeping a warning in the future, as
> I think technically the warning was still valid and it could help avoid
> racy lookups with the filesystem which otherwise could have gone unnoticed.
> 
>>> Removing the old UMH lock even when the UMH lock was *not* needed was the right
>>> thing to do but commit 81f95076281f (("firmware: add sanity check on
>>> shutdown/suspend") was put in place as a safe guard as the lock was also
>>> placing an implicit sanity check on the API. It ensures the API with the cache
>>> was used as designed, otherwise you do run the risk of *not getting the
>>> firmware you may need* -- Marcel seems to acknowledge this possibility.
>>>
>>> It may be possible for us to already have in place safeguards so that upon
>>> resume we are ensuring the path to the firmware *is* available, so IMHO we
>>> should remove this *iff* we can provide this guarantee.  Otherwise the check is
>>> valid. You see, although the UMH lock was bogus, it did implicitly ask the
>>> question: is it safe for *any* helper to run and make assumptions on the
>>> filesystem then?
>>>
>>> In lieu of this question being answered the warning is valid given the design
>>> of the firmware API and the having the cache available as a measure to resolve
>>> this race.
>>
>> I don't understand what you are trying to say here at all.
>>
>> To be specific, what, if anything, is a problem with the current state
>> of Linus's tree (and the next 4.13-stable release)?
> 
> The warning is issued when drivers issue *new* firmware requests on resume. The
> firmware API cache was designed to enable drivers to easily be able to request
> firmware on resume without concern about races against the filesystem, but in order
> for this to work the drivers must have requested the firmware prior to suspend.
> 
>> If something needs to be fixed, can you make a patch showing that?  Or
>> do we also need to revert anything else as well to get back to a "better
>> working" state?
> 
> I took a look at the driver and it seems that btusb_setup_intel_new() is
> not called after the driver is initialized, rather its called only when
> hci_dev_do_open() is called. Its not clear to me how you can end up calling
> this on resume but not prior to this on a running system. Feedback from
> someone more familiar with bt would be useful.

While I really don't know that code I can tell anything about .. however this is _NOT_
about Intel Hardware only .. I trigger that on laptops with Atheros..

So seems like some driver need be fixed before trying something like this again?

Regards,

Gabriel C

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-12  0:13                     ` Gabriel C
  0 siblings, 0 replies; 44+ messages in thread
From: Gabriel C @ 2017-09-12  0:13 UTC (permalink / raw)
  To: Luis R. Rodriguez, Greg Kroah-Hartman
  Cc: Marcel Holtmann, Linus Torvalds, Tejun Heo, Gustavo F. Padovan,
	Sukumar Ghorai, Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On 11.09.2017 22:06, Luis R. Rodriguez wrote:
> On Mon, Sep 11, 2017 at 12:29:55PM -0700, Greg Kroah-Hartman wrote:
>> On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
>>> On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
>>>> To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
>>>> queued it up for the next 4.13-stable release as well.
>>>
>>> Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
>>> seem kludgy but the reason for it was to cleanup the horrible forced and
>>> required UMH lock even when the UMH lock was *not* even needed, which was later
>>> removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
>>> code").
>>
>> So what does this mean now that it is reverted?
> 
> We discuss what we should do about upkeeping a warning in the future, as
> I think technically the warning was still valid and it could help avoid
> racy lookups with the filesystem which otherwise could have gone unnoticed.
> 
>>> Removing the old UMH lock even when the UMH lock was *not* needed was the right
>>> thing to do but commit 81f95076281f (("firmware: add sanity check on
>>> shutdown/suspend") was put in place as a safe guard as the lock was also
>>> placing an implicit sanity check on the API. It ensures the API with the cache
>>> was used as designed, otherwise you do run the risk of *not getting the
>>> firmware you may need* -- Marcel seems to acknowledge this possibility.
>>>
>>> It may be possible for us to already have in place safeguards so that upon
>>> resume we are ensuring the path to the firmware *is* available, so IMHO we
>>> should remove this *iff* we can provide this guarantee.  Otherwise the check is
>>> valid. You see, although the UMH lock was bogus, it did implicitly ask the
>>> question: is it safe for *any* helper to run and make assumptions on the
>>> filesystem then?
>>>
>>> In lieu of this question being answered the warning is valid given the design
>>> of the firmware API and the having the cache available as a measure to resolve
>>> this race.
>>
>> I don't understand what you are trying to say here at all.
>>
>> To be specific, what, if anything, is a problem with the current state
>> of Linus's tree (and the next 4.13-stable release)?
> 
> The warning is issued when drivers issue *new* firmware requests on resume. The
> firmware API cache was designed to enable drivers to easily be able to request
> firmware on resume without concern about races against the filesystem, but in order
> for this to work the drivers must have requested the firmware prior to suspend.
> 
>> If something needs to be fixed, can you make a patch showing that?  Or
>> do we also need to revert anything else as well to get back to a "better
>> working" state?
> 
> I took a look at the driver and it seems that btusb_setup_intel_new() is
> not called after the driver is initialized, rather its called only when
> hci_dev_do_open() is called. Its not clear to me how you can end up calling
> this on resume but not prior to this on a running system. Feedback from
> someone more familiar with bt would be useful.

While I really don't know that code I can tell anything about .. however this is _NOT_
about Intel Hardware only .. I trigger that on laptops with Atheros..

So seems like some driver need be fixed before trying something like this again?

Regards,

Gabriel C

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-12  0:13                     ` Gabriel C
@ 2017-09-12  0:33                       ` Luis R. Rodriguez
  -1 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-12  0:33 UTC (permalink / raw)
  To: Gabriel C
  Cc: Greg Kroah-Hartman, Marcel Holtmann, Linus Torvalds, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 5:13 PM, Gabriel C <nix.or.die@gmail.com> wrote:
> On 11.09.2017 22:06, Luis R. Rodriguez wrote:
>>
>> On Mon, Sep 11, 2017 at 12:29:55PM -0700, Greg Kroah-Hartman wrote:
>>>
>>> On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
>>>>
>>>> On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
>>>>>
>>>>> To confirm, reverting this fixes the problem I was seeing in 4.13.
>>>>> I've
>>>>> queued it up for the next 4.13-stable release as well.
>>>>
>>>>
>>>> Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend")
>>>> may
>>>> seem kludgy but the reason for it was to cleanup the horrible forced and
>>>> required UMH lock even when the UMH lock was *not* even needed, which
>>>> was later
>>>> removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into
>>>> the umh
>>>> code").
>>>
>>>
>>> So what does this mean now that it is reverted?
>>
>>
>> We discuss what we should do about upkeeping a warning in the future, as
>> I think technically the warning was still valid and it could help avoid
>> racy lookups with the filesystem which otherwise could have gone
>> unnoticed.
>>
>>>> Removing the old UMH lock even when the UMH lock was *not* needed was
>>>> the right
>>>> thing to do but commit 81f95076281f (("firmware: add sanity check on
>>>> shutdown/suspend") was put in place as a safe guard as the lock was also
>>>> placing an implicit sanity check on the API. It ensures the API with the
>>>> cache
>>>> was used as designed, otherwise you do run the risk of *not getting the
>>>> firmware you may need* -- Marcel seems to acknowledge this possibility.
>>>>
>>>> It may be possible for us to already have in place safeguards so that
>>>> upon
>>>> resume we are ensuring the path to the firmware *is* available, so IMHO
>>>> we
>>>> should remove this *iff* we can provide this guarantee.  Otherwise the
>>>> check is
>>>> valid. You see, although the UMH lock was bogus, it did implicitly ask
>>>> the
>>>> question: is it safe for *any* helper to run and make assumptions on the
>>>> filesystem then?
>>>>
>>>> In lieu of this question being answered the warning is valid given the
>>>> design
>>>> of the firmware API and the having the cache available as a measure to
>>>> resolve
>>>> this race.
>>>
>>>
>>> I don't understand what you are trying to say here at all.
>>>
>>> To be specific, what, if anything, is a problem with the current state
>>> of Linus's tree (and the next 4.13-stable release)?
>>
>>
>> The warning is issued when drivers issue *new* firmware requests on
>> resume. The
>> firmware API cache was designed to enable drivers to easily be able to
>> request
>> firmware on resume without concern about races against the filesystem, but
>> in order
>> for this to work the drivers must have requested the firmware prior to
>> suspend.
>>
>>> If something needs to be fixed, can you make a patch showing that?  Or
>>> do we also need to revert anything else as well to get back to a "better
>>> working" state?
>>
>>
>> I took a look at the driver and it seems that btusb_setup_intel_new() is
>> not called after the driver is initialized, rather its called only when
>> hci_dev_do_open() is called. Its not clear to me how you can end up
>> calling
>> this on resume but not prior to this on a running system. Feedback from
>> someone more familiar with bt would be useful.
>
>
> While I really don't know that code I can tell anything about .. however
> this is _NOT_
> about Intel Hardware only .. I trigger that on laptops with Atheros..

If the Atheros driver works in the same way then the same race exists
there as well.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-12  0:33                       ` Luis R. Rodriguez
  0 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-12  0:33 UTC (permalink / raw)
  To: Gabriel C
  Cc: Greg Kroah-Hartman, Marcel Holtmann, Linus Torvalds, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 5:13 PM, Gabriel C <nix.or.die@gmail.com> wrote:
> On 11.09.2017 22:06, Luis R. Rodriguez wrote:
>>
>> On Mon, Sep 11, 2017 at 12:29:55PM -0700, Greg Kroah-Hartman wrote:
>>>
>>> On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
>>>>
>>>> On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
>>>>>
>>>>> To confirm, reverting this fixes the problem I was seeing in 4.13.
>>>>> I've
>>>>> queued it up for the next 4.13-stable release as well.
>>>>
>>>>
>>>> Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend")
>>>> may
>>>> seem kludgy but the reason for it was to cleanup the horrible forced and
>>>> required UMH lock even when the UMH lock was *not* even needed, which
>>>> was later
>>>> removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into
>>>> the umh
>>>> code").
>>>
>>>
>>> So what does this mean now that it is reverted?
>>
>>
>> We discuss what we should do about upkeeping a warning in the future, as
>> I think technically the warning was still valid and it could help avoid
>> racy lookups with the filesystem which otherwise could have gone
>> unnoticed.
>>
>>>> Removing the old UMH lock even when the UMH lock was *not* needed was
>>>> the right
>>>> thing to do but commit 81f95076281f (("firmware: add sanity check on
>>>> shutdown/suspend") was put in place as a safe guard as the lock was also
>>>> placing an implicit sanity check on the API. It ensures the API with the
>>>> cache
>>>> was used as designed, otherwise you do run the risk of *not getting the
>>>> firmware you may need* -- Marcel seems to acknowledge this possibility.
>>>>
>>>> It may be possible for us to already have in place safeguards so that
>>>> upon
>>>> resume we are ensuring the path to the firmware *is* available, so IMHO
>>>> we
>>>> should remove this *iff* we can provide this guarantee.  Otherwise the
>>>> check is
>>>> valid. You see, although the UMH lock was bogus, it did implicitly ask
>>>> the
>>>> question: is it safe for *any* helper to run and make assumptions on the
>>>> filesystem then?
>>>>
>>>> In lieu of this question being answered the warning is valid given the
>>>> design
>>>> of the firmware API and the having the cache available as a measure to
>>>> resolve
>>>> this race.
>>>
>>>
>>> I don't understand what you are trying to say here at all.
>>>
>>> To be specific, what, if anything, is a problem with the current state
>>> of Linus's tree (and the next 4.13-stable release)?
>>
>>
>> The warning is issued when drivers issue *new* firmware requests on
>> resume. The
>> firmware API cache was designed to enable drivers to easily be able to
>> request
>> firmware on resume without concern about races against the filesystem, but
>> in order
>> for this to work the drivers must have requested the firmware prior to
>> suspend.
>>
>>> If something needs to be fixed, can you make a patch showing that?  Or
>>> do we also need to revert anything else as well to get back to a "better
>>> working" state?
>>
>>
>> I took a look at the driver and it seems that btusb_setup_intel_new() is
>> not called after the driver is initialized, rather its called only when
>> hci_dev_do_open() is called. Its not clear to me how you can end up
>> calling
>> this on resume but not prior to this on a running system. Feedback from
>> someone more familiar with bt would be useful.
>
>
> While I really don't know that code I can tell anything about .. however
> this is _NOT_
> about Intel Hardware only .. I trigger that on laptops with Atheros..

If the Atheros driver works in the same way then the same race exists
there as well.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11 20:06                   ` Luis R. Rodriguez
@ 2017-09-12  0:48                     ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2017-09-12  0:48 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Marcel Holtmann, Linus Torvalds, Gabriel C, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 10:06:51PM +0200, Luis R. Rodriguez wrote:
> On Mon, Sep 11, 2017 at 12:29:55PM -0700, Greg Kroah-Hartman wrote:
> > On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
> > > On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
> > > > To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
> > > > queued it up for the next 4.13-stable release as well.
> > > 
> > > Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
> > > seem kludgy but the reason for it was to cleanup the horrible forced and
> > > required UMH lock even when the UMH lock was *not* even needed, which was later
> > > removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
> > > code").
> > 
> > So what does this mean now that it is reverted?
> 
> We discuss what we should do about upkeeping a warning in the future, as
> I think technically the warning was still valid and it could help avoid
> racy lookups with the filesystem which otherwise could have gone unnoticed.

Given that things that used to work before the patch, and then didn't
work after the patch was added, but then worked again after the patch
was reverted, I don't think that adding the warning back is good, if it
still breaks things...

> > > Removing the old UMH lock even when the UMH lock was *not* needed was the right
> > > thing to do but commit 81f95076281f (("firmware: add sanity check on
> > > shutdown/suspend") was put in place as a safe guard as the lock was also
> > > placing an implicit sanity check on the API. It ensures the API with the cache
> > > was used as designed, otherwise you do run the risk of *not getting the
> > > firmware you may need* -- Marcel seems to acknowledge this possibility.
> > > 
> > > It may be possible for us to already have in place safeguards so that upon
> > > resume we are ensuring the path to the firmware *is* available, so IMHO we
> > > should remove this *iff* we can provide this guarantee.  Otherwise the check is
> > > valid. You see, although the UMH lock was bogus, it did implicitly ask the
> > > question: is it safe for *any* helper to run and make assumptions on the
> > > filesystem then?
> > > 
> > > In lieu of this question being answered the warning is valid given the design
> > > of the firmware API and the having the cache available as a measure to resolve
> > > this race.
> > 
> > I don't understand what you are trying to say here at all.
> > 
> > To be specific, what, if anything, is a problem with the current state
> > of Linus's tree (and the next 4.13-stable release)?
> 
> The warning is issued when drivers issue *new* firmware requests on resume. The
> firmware API cache was designed to enable drivers to easily be able to request
> firmware on resume without concern about races against the filesystem, but in order
> for this to work the drivers must have requested the firmware prior to suspend.

Ok, then how is this all working today?

I think much more is going on here, sorry, this doesn't make sense.

> > If something needs to be fixed, can you make a patch showing that?  Or
> > do we also need to revert anything else as well to get back to a "better
> > working" state?
> 
> I took a look at the driver and it seems that btusb_setup_intel_new() is
> not called after the driver is initialized, rather its called only when
> hci_dev_do_open() is called. Its not clear to me how you can end up calling
> this on resume but not prior to this on a running system. Feedback from
> someone more familiar with bt would be useful.

open is a great place to want to load firmware, that makes sense, you
get to defer loading stuff until you actually need it.  I see nothing
wrong here (again, especially as it _does_ work...)

> If it confirmed this information is helping avoid these races we can reconsider
> re-instating the warn as a firmware dev debugging aid for developers.

Again, obviously this was not just a "warning" message, you changed the
logic here.

Anyway, all is good now, I guess we don't have to worry about it :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-12  0:48                     ` Greg Kroah-Hartman
  0 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2017-09-12  0:48 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Marcel Holtmann, Linus Torvalds, Gabriel C, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 10:06:51PM +0200, Luis R. Rodriguez wrote:
> On Mon, Sep 11, 2017 at 12:29:55PM -0700, Greg Kroah-Hartman wrote:
> > On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
> > > On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
> > > > To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
> > > > queued it up for the next 4.13-stable release as well.
> > > 
> > > Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
> > > seem kludgy but the reason for it was to cleanup the horrible forced and
> > > required UMH lock even when the UMH lock was *not* even needed, which was later
> > > removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
> > > code").
> > 
> > So what does this mean now that it is reverted?
> 
> We discuss what we should do about upkeeping a warning in the future, as
> I think technically the warning was still valid and it could help avoid
> racy lookups with the filesystem which otherwise could have gone unnoticed.

Given that things that used to work before the patch, and then didn't
work after the patch was added, but then worked again after the patch
was reverted, I don't think that adding the warning back is good, if it
still breaks things...

> > > Removing the old UMH lock even when the UMH lock was *not* needed was the right
> > > thing to do but commit 81f95076281f (("firmware: add sanity check on
> > > shutdown/suspend") was put in place as a safe guard as the lock was also
> > > placing an implicit sanity check on the API. It ensures the API with the cache
> > > was used as designed, otherwise you do run the risk of *not getting the
> > > firmware you may need* -- Marcel seems to acknowledge this possibility.
> > > 
> > > It may be possible for us to already have in place safeguards so that upon
> > > resume we are ensuring the path to the firmware *is* available, so IMHO we
> > > should remove this *iff* we can provide this guarantee.  Otherwise the check is
> > > valid. You see, although the UMH lock was bogus, it did implicitly ask the
> > > question: is it safe for *any* helper to run and make assumptions on the
> > > filesystem then?
> > > 
> > > In lieu of this question being answered the warning is valid given the design
> > > of the firmware API and the having the cache available as a measure to resolve
> > > this race.
> > 
> > I don't understand what you are trying to say here at all.
> > 
> > To be specific, what, if anything, is a problem with the current state
> > of Linus's tree (and the next 4.13-stable release)?
> 
> The warning is issued when drivers issue *new* firmware requests on resume. The
> firmware API cache was designed to enable drivers to easily be able to request
> firmware on resume without concern about races against the filesystem, but in order
> for this to work the drivers must have requested the firmware prior to suspend.

Ok, then how is this all working today?

I think much more is going on here, sorry, this doesn't make sense.

> > If something needs to be fixed, can you make a patch showing that?  Or
> > do we also need to revert anything else as well to get back to a "better
> > working" state?
> 
> I took a look at the driver and it seems that btusb_setup_intel_new() is
> not called after the driver is initialized, rather its called only when
> hci_dev_do_open() is called. Its not clear to me how you can end up calling
> this on resume but not prior to this on a running system. Feedback from
> someone more familiar with bt would be useful.

open is a great place to want to load firmware, that makes sense, you
get to defer loading stuff until you actually need it.  I see nothing
wrong here (again, especially as it _does_ work...)

> If it confirmed this information is helping avoid these races we can reconsider
> re-instating the warn as a firmware dev debugging aid for developers.

Again, obviously this was not just a "warning" message, you changed the
logic here.

Anyway, all is good now, I guess we don't have to worry about it :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-11 20:06                   ` Luis R. Rodriguez
@ 2017-09-12  5:13                     ` Marcel Holtmann
  -1 siblings, 0 replies; 44+ messages in thread
From: Marcel Holtmann @ 2017-09-12  5:13 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Greg Kroah-Hartman, Linus Torvalds, Gabriel C, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

Hi Luis,

>>>> To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
>>>> queued it up for the next 4.13-stable release as well.
>>> 
>>> Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
>>> seem kludgy but the reason for it was to cleanup the horrible forced and
>>> required UMH lock even when the UMH lock was *not* even needed, which was later
>>> removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
>>> code").
>> 
>> So what does this mean now that it is reverted?
> 
> We discuss what we should do about upkeeping a warning in the future, as
> I think technically the warning was still valid and it could help avoid
> racy lookups with the filesystem which otherwise could have gone unnoticed.
> 
>>> Removing the old UMH lock even when the UMH lock was *not* needed was the right
>>> thing to do but commit 81f95076281f (("firmware: add sanity check on
>>> shutdown/suspend") was put in place as a safe guard as the lock was also
>>> placing an implicit sanity check on the API. It ensures the API with the cache
>>> was used as designed, otherwise you do run the risk of *not getting the
>>> firmware you may need* -- Marcel seems to acknowledge this possibility.
>>> 
>>> It may be possible for us to already have in place safeguards so that upon
>>> resume we are ensuring the path to the firmware *is* available, so IMHO we
>>> should remove this *iff* we can provide this guarantee.  Otherwise the check is
>>> valid. You see, although the UMH lock was bogus, it did implicitly ask the
>>> question: is it safe for *any* helper to run and make assumptions on the
>>> filesystem then?
>>> 
>>> In lieu of this question being answered the warning is valid given the design
>>> of the firmware API and the having the cache available as a measure to resolve
>>> this race.
>> 
>> I don't understand what you are trying to say here at all.
>> 
>> To be specific, what, if anything, is a problem with the current state
>> of Linus's tree (and the next 4.13-stable release)?
> 
> The warning is issued when drivers issue *new* firmware requests on resume. The
> firmware API cache was designed to enable drivers to easily be able to request
> firmware on resume without concern about races against the filesystem, but in order
> for this to work the drivers must have requested the firmware prior to suspend.
> 
>> If something needs to be fixed, can you make a patch showing that?  Or
>> do we also need to revert anything else as well to get back to a "better
>> working" state?
> 
> I took a look at the driver and it seems that btusb_setup_intel_new() is
> not called after the driver is initialized, rather its called only when
> hci_dev_do_open() is called. Its not clear to me how you can end up calling
> this on resume but not prior to this on a running system. Feedback from
> someone more familiar with bt would be useful.
> 
> I'd have the call for firmware on probe, no processing would be needed, just
> a load to kick the cache into effect would suffice. This may require a bit
> of code shift so its best someone more familiar do this.
> 
> If it confirmed this information is helping avoid these races we can reconsider
> re-instating the warn as a firmware dev debugging aid for developers.
> 
> If the race this warning complained about is indeed possible the same race is
> also possible for other usermode helpers. Its *why* the UMH lock was
> implemented, it however was never generalized.

we can not load firmware on probe() in most cases. The btusb.ko driver establishes the HCI transport. It is setup in probe() and then started in hci_dev_do_open() and if there is a vendor setup routine like btsub_setup_intel_new(), then it is executed. Most Bluetooth controllers (not all, but most) are doing firmware loading over the HCI transport with vendor specific commands.

And yes, if the firmware was already loaded, we would skip requesting it at all. Which means after suspend/resume cycle where the power to the controller is cut, then we are missing the firmware from the cache. Since we never loaded it in the first place.

So yes, we would have to redo parts of the vendor specific handling to always request the firmware, even if we do not need it right now. And frankly that is not obvious to anybody. We seem to have some patches for doing exactly that, but I have not gotten to review them in detail since they deal with vendor specific complex setup handling. Also this affects more than just Intel since all hardware where firmware loading is skipped if there is already firmware loaded are affected.

Regards

Marcel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-12  5:13                     ` Marcel Holtmann
  0 siblings, 0 replies; 44+ messages in thread
From: Marcel Holtmann @ 2017-09-12  5:13 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Greg Kroah-Hartman, Linus Torvalds, Gabriel C, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

Hi Luis,

>>>> To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
>>>> queued it up for the next 4.13-stable release as well.
>>> 
>>> Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
>>> seem kludgy but the reason for it was to cleanup the horrible forced and
>>> required UMH lock even when the UMH lock was *not* even needed, which was later
>>> removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
>>> code").
>> 
>> So what does this mean now that it is reverted?
> 
> We discuss what we should do about upkeeping a warning in the future, as
> I think technically the warning was still valid and it could help avoid
> racy lookups with the filesystem which otherwise could have gone unnoticed.
> 
>>> Removing the old UMH lock even when the UMH lock was *not* needed was the right
>>> thing to do but commit 81f95076281f (("firmware: add sanity check on
>>> shutdown/suspend") was put in place as a safe guard as the lock was also
>>> placing an implicit sanity check on the API. It ensures the API with the cache
>>> was used as designed, otherwise you do run the risk of *not getting the
>>> firmware you may need* -- Marcel seems to acknowledge this possibility.
>>> 
>>> It may be possible for us to already have in place safeguards so that upon
>>> resume we are ensuring the path to the firmware *is* available, so IMHO we
>>> should remove this *iff* we can provide this guarantee.  Otherwise the check is
>>> valid. You see, although the UMH lock was bogus, it did implicitly ask the
>>> question: is it safe for *any* helper to run and make assumptions on the
>>> filesystem then?
>>> 
>>> In lieu of this question being answered the warning is valid given the design
>>> of the firmware API and the having the cache available as a measure to resolve
>>> this race.
>> 
>> I don't understand what you are trying to say here at all.
>> 
>> To be specific, what, if anything, is a problem with the current state
>> of Linus's tree (and the next 4.13-stable release)?
> 
> The warning is issued when drivers issue *new* firmware requests on resume. The
> firmware API cache was designed to enable drivers to easily be able to request
> firmware on resume without concern about races against the filesystem, but in order
> for this to work the drivers must have requested the firmware prior to suspend.
> 
>> If something needs to be fixed, can you make a patch showing that?  Or
>> do we also need to revert anything else as well to get back to a "better
>> working" state?
> 
> I took a look at the driver and it seems that btusb_setup_intel_new() is
> not called after the driver is initialized, rather its called only when
> hci_dev_do_open() is called. Its not clear to me how you can end up calling
> this on resume but not prior to this on a running system. Feedback from
> someone more familiar with bt would be useful.
> 
> I'd have the call for firmware on probe, no processing would be needed, just
> a load to kick the cache into effect would suffice. This may require a bit
> of code shift so its best someone more familiar do this.
> 
> If it confirmed this information is helping avoid these races we can reconsider
> re-instating the warn as a firmware dev debugging aid for developers.
> 
> If the race this warning complained about is indeed possible the same race is
> also possible for other usermode helpers. Its *why* the UMH lock was
> implemented, it however was never generalized.

we can not load firmware on probe() in most cases. The btusb.ko driver establishes the HCI transport. It is setup in probe() and then started in hci_dev_do_open() and if there is a vendor setup routine like btsub_setup_intel_new(), then it is executed. Most Bluetooth controllers (not all, but most) are doing firmware loading over the HCI transport with vendor specific commands.

And yes, if the firmware was already loaded, we would skip requesting it at all. Which means after suspend/resume cycle where the power to the controller is cut, then we are missing the firmware from the cache. Since we never loaded it in the first place.

So yes, we would have to redo parts of the vendor specific handling to always request the firmware, even if we do not need it right now. And frankly that is not obvious to anybody. We seem to have some patches for doing exactly that, but I have not gotten to review them in detail since they deal with vendor specific complex setup handling. Also this affects more than just Intel since all hardware where firmware loading is skipped if there is already firmware loaded are affected.

Regards

Marcel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-12  5:13                     ` Marcel Holtmann
@ 2017-09-12 16:27                       ` Luis R. Rodriguez
  -1 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-12 16:27 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Luis R. Rodriguez, Greg Kroah-Hartman, Linus Torvalds, Gabriel C,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Tue, Sep 12, 2017 at 07:13:42AM +0200, Marcel Holtmann wrote:
> >> If something needs to be fixed, can you make a patch showing that?  Or
> >> do we also need to revert anything else as well to get back to a "better
> >> working" state?
> > 
> > I took a look at the driver and it seems that btusb_setup_intel_new() is
> > not called after the driver is initialized, rather its called only when
> > hci_dev_do_open() is called. Its not clear to me how you can end up calling
> > this on resume but not prior to this on a running system. Feedback from
> > someone more familiar with bt would be useful.
> > 
> > I'd have the call for firmware on probe, no processing would be needed, just
> > a load to kick the cache into effect would suffice. This may require a bit
> > of code shift so its best someone more familiar do this.
> > 
> > If it confirmed this information is helping avoid these races we can reconsider
> > re-instating the warn as a firmware dev debugging aid for developers.
> > 
> > If the race this warning complained about is indeed possible the same race is
> > also possible for other usermode helpers. Its *why* the UMH lock was
> > implemented, it however was never generalized.
> 
> we can not load firmware on probe() in most cases. The btusb.ko driver
> establishes the HCI transport. It is setup in probe() and then started in
> hci_dev_do_open() and if there is a vendor setup routine like
> btsub_setup_intel_new(), then it is executed. Most Bluetooth controllers (not
> all, but most) are doing firmware loading over the HCI transport with vendor
> specific commands.
> 
> And yes, if the firmware was already loaded, we would skip requesting it at
> all. Which means after suspend/resume cycle where the power to the controller
> is cut, then we are missing the firmware from the cache. Since we never
> loaded it in the first place.

I checked and prior to commit 81f95076281f ("firmware: add sanity check on
shutdown/suspend") and commit 06a45a93e7d34aa ("firmware: move umh try locks
into the umh code") I believe we could end up also failing at a firmware
request as well. Can you confirm? I see one bug report which confirms this
since v3.13:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1356076

For the sync case we had:

-               ret = usermodehelper_read_trylock();
-               if (WARN_ON(ret)) {
-                       dev_err(device, "firmware: %s will not be loaded\n",
-                               name);
-                       goto out;
-               }

The warning splat in launchpad bug 1356076 is the above case.

For the async case we had:

-               timeout = usermodehelper_read_lock_wait(timeout);
-               if (!timeout) {
-                       dev_dbg(device, "firmware: %s loading timed out\n",
-                               name);
-                       ret = -EBUSY;
-                       goto out;

This would be the more silent case.

If this is accurate than the error case has just been modified to be
a bit clearer, and without being implicated by the UMH lock. This was
the design change after all. Nothing more *broken* should have happened.

In other words if fixing your issues goes away by reverting commit
81f95076281f, to be fair you should also try reverting commit 06a45a93e7d34aa
which did the actual removal of the UMH lock when we don't use the UMH. *That*
was previously the failure trigger point.

> So yes, we would have to redo parts of the vendor specific handling to always
> request the firmware, even if we do not need it right now. And frankly that
> is not obvious to anybody.

I agree. I thought a bit about the above and tried co come to a clever resolution,
for instance one resolution could be to use the MODULE_FIRMWARE() declarations
of sorts to then let the firmware API do the pre-fetching for you at init. This
would just require a 1 line change to drivers and some already have this.
There are two issues with this though, one is that the firmware cache works with
devres and devres requires a device already present. Second is firmware names
can be very dynamic, so one can only know the firmware name later at boot.

Another issues is that most driver developers use the firmware API and without
knowing *are* creating a firmware cache entry, whereas a few times they don't
need a firmware cache entry. This is more of performance / optimization thing,
but something to consider long term as well.

> We seem to have some patches for doing exactly
> that, but I have not gotten to review them in detail since they deal with
> vendor specific complex setup handling.

Correct me if I'm wrong but some of this work very likely came from the above
old errors, not the new ones?

> Also this affects more than just Intel since all hardware where firmware
> loading is skipped if there is already firmware loaded are affected.

Right, its a core firmware API issue, but this issue has been present for a
while, it was however in different form. I'm glad we're able to think ahead
and address this now though, it would seem there are quite a bit of intrusive
changes required to use the API right, I'd hope we could help on the firmware
API front to make these changes smaller.

First, I'd like to understand a bit more how a device ends up with firmware
loaded, without having gone through the firmware API.

Second, while requesting firmware at probe seems not necessary, would it
at least be reasonable to require a struct device and a list of possible
firmware that could be used be made available? If so a separate hook could
be provided to help pre-load these only onto the firmware cache. Ie, we would
not actually look for the firmware but just create a firmware cache instance
so that when we suspend we *do* go fetch for these so that they are ready and
available upon resume.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-12 16:27                       ` Luis R. Rodriguez
  0 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-12 16:27 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Luis R. Rodriguez, Greg Kroah-Hartman, Linus Torvalds, Gabriel C,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Tue, Sep 12, 2017 at 07:13:42AM +0200, Marcel Holtmann wrote:
> >> If something needs to be fixed, can you make a patch showing that?  Or
> >> do we also need to revert anything else as well to get back to a "better
> >> working" state?
> > 
> > I took a look at the driver and it seems that btusb_setup_intel_new() is
> > not called after the driver is initialized, rather its called only when
> > hci_dev_do_open() is called. Its not clear to me how you can end up calling
> > this on resume but not prior to this on a running system. Feedback from
> > someone more familiar with bt would be useful.
> > 
> > I'd have the call for firmware on probe, no processing would be needed, just
> > a load to kick the cache into effect would suffice. This may require a bit
> > of code shift so its best someone more familiar do this.
> > 
> > If it confirmed this information is helping avoid these races we can reconsider
> > re-instating the warn as a firmware dev debugging aid for developers.
> > 
> > If the race this warning complained about is indeed possible the same race is
> > also possible for other usermode helpers. Its *why* the UMH lock was
> > implemented, it however was never generalized.
> 
> we can not load firmware on probe() in most cases. The btusb.ko driver
> establishes the HCI transport. It is setup in probe() and then started in
> hci_dev_do_open() and if there is a vendor setup routine like
> btsub_setup_intel_new(), then it is executed. Most Bluetooth controllers (not
> all, but most) are doing firmware loading over the HCI transport with vendor
> specific commands.
> 
> And yes, if the firmware was already loaded, we would skip requesting it at
> all. Which means after suspend/resume cycle where the power to the controller
> is cut, then we are missing the firmware from the cache. Since we never
> loaded it in the first place.

I checked and prior to commit 81f95076281f ("firmware: add sanity check on
shutdown/suspend") and commit 06a45a93e7d34aa ("firmware: move umh try locks
into the umh code") I believe we could end up also failing at a firmware
request as well. Can you confirm? I see one bug report which confirms this
since v3.13:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1356076

For the sync case we had:

-               ret = usermodehelper_read_trylock();
-               if (WARN_ON(ret)) {
-                       dev_err(device, "firmware: %s will not be loaded\n",
-                               name);
-                       goto out;
-               }

The warning splat in launchpad bug 1356076 is the above case.

For the async case we had:

-               timeout = usermodehelper_read_lock_wait(timeout);
-               if (!timeout) {
-                       dev_dbg(device, "firmware: %s loading timed out\n",
-                               name);
-                       ret = -EBUSY;
-                       goto out;

This would be the more silent case.

If this is accurate than the error case has just been modified to be
a bit clearer, and without being implicated by the UMH lock. This was
the design change after all. Nothing more *broken* should have happened.

In other words if fixing your issues goes away by reverting commit
81f95076281f, to be fair you should also try reverting commit 06a45a93e7d34aa
which did the actual removal of the UMH lock when we don't use the UMH. *That*
was previously the failure trigger point.

> So yes, we would have to redo parts of the vendor specific handling to always
> request the firmware, even if we do not need it right now. And frankly that
> is not obvious to anybody.

I agree. I thought a bit about the above and tried co come to a clever resolution,
for instance one resolution could be to use the MODULE_FIRMWARE() declarations
of sorts to then let the firmware API do the pre-fetching for you at init. This
would just require a 1 line change to drivers and some already have this.
There are two issues with this though, one is that the firmware cache works with
devres and devres requires a device already present. Second is firmware names
can be very dynamic, so one can only know the firmware name later at boot.

Another issues is that most driver developers use the firmware API and without
knowing *are* creating a firmware cache entry, whereas a few times they don't
need a firmware cache entry. This is more of performance / optimization thing,
but something to consider long term as well.

> We seem to have some patches for doing exactly
> that, but I have not gotten to review them in detail since they deal with
> vendor specific complex setup handling.

Correct me if I'm wrong but some of this work very likely came from the above
old errors, not the new ones?

> Also this affects more than just Intel since all hardware where firmware
> loading is skipped if there is already firmware loaded are affected.

Right, its a core firmware API issue, but this issue has been present for a
while, it was however in different form. I'm glad we're able to think ahead
and address this now though, it would seem there are quite a bit of intrusive
changes required to use the API right, I'd hope we could help on the firmware
API front to make these changes smaller.

First, I'd like to understand a bit more how a device ends up with firmware
loaded, without having gone through the firmware API.

Second, while requesting firmware at probe seems not necessary, would it
at least be reasonable to require a struct device and a list of possible
firmware that could be used be made available? If so a separate hook could
be provided to help pre-load these only onto the firmware cache. Ie, we would
not actually look for the firmware but just create a firmware cache instance
so that when we suspend we *do* go fetch for these so that they are ready and
available upon resume.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-12  0:48                     ` Greg Kroah-Hartman
@ 2017-09-12 16:52                       ` Luis R. Rodriguez
  -1 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-12 16:52 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Luis R. Rodriguez, Arend Van Spriel, Marcel Holtmann,
	Linus Torvalds, Gabriel C, Tejun Heo, Gustavo F. Padovan,
	Sukumar Ghorai, Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 05:48:52PM -0700, Greg Kroah-Hartman wrote:
> On Mon, Sep 11, 2017 at 10:06:51PM +0200, Luis R. Rodriguez wrote:
> > On Mon, Sep 11, 2017 at 12:29:55PM -0700, Greg Kroah-Hartman wrote:
> > > On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
> > > > On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
> > > > > To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
> > > > > queued it up for the next 4.13-stable release as well.
> > > > 
> > > > Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
> > > > seem kludgy but the reason for it was to cleanup the horrible forced and
> > > > required UMH lock even when the UMH lock was *not* even needed, which was later
> > > > removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
> > > > code").
> > > 
> > > So what does this mean now that it is reverted?
> > 
> > We discuss what we should do about upkeeping a warning in the future, as
> > I think technically the warning was still valid and it could help avoid
> > racy lookups with the filesystem which otherwise could have gone unnoticed.
> 
> Given that things that used to work before the patch, and then didn't
> work after the patch was added, but then worked again after the patch
> was reverted, I don't think that adding the warning back is good, if it
> still breaks things...

As I noted to Marcel, it would seem this issue was present since before, the
warning either was changed or was more silent before. If reverting commit
81f95076281f "firmware: add sanity check on shutdown/suspend") then one should
consider reverting 06a45a93e7d34aa ("firmware: move umh try locks into the umh
code") as otherwise we end up in a situation where we loose determinism to be
sure of what will happen. From what I gather prior to these two patches we
either got a warning on sync case, or an error on async, after it we get a
warning on either and it should be clearer what the issue should be. I believe
that prior to these two commits the reason why things were failing could have
been pretty obscure. With just commit 81f95076281f reverted we then really
change the logic, and *that* was something I did want to avoid, given I
suspected that kernel was already implicitly depending on the functionality
which the UMH lock had brought upon the non-UMH code path. My suspicions seem
to have been correct.

With just commit 81f95076281f reverted, we may find firmware sometimes, but
we may also just as easily not find firmware some other times. It will depend
on the system.

> > > > Removing the old UMH lock even when the UMH lock was *not* needed was the right
> > > > thing to do but commit 81f95076281f (("firmware: add sanity check on
> > > > shutdown/suspend") was put in place as a safe guard as the lock was also
> > > > placing an implicit sanity check on the API. It ensures the API with the cache
> > > > was used as designed, otherwise you do run the risk of *not getting the
> > > > firmware you may need* -- Marcel seems to acknowledge this possibility.
> > > > 
> > > > It may be possible for us to already have in place safeguards so that upon
> > > > resume we are ensuring the path to the firmware *is* available, so IMHO we
> > > > should remove this *iff* we can provide this guarantee.  Otherwise the check is
> > > > valid. You see, although the UMH lock was bogus, it did implicitly ask the
> > > > question: is it safe for *any* helper to run and make assumptions on the
> > > > filesystem then?
> > > > 
> > > > In lieu of this question being answered the warning is valid given the design
> > > > of the firmware API and the having the cache available as a measure to resolve
> > > > this race.
> > > 
> > > I don't understand what you are trying to say here at all.
> > > 
> > > To be specific, what, if anything, is a problem with the current state
> > > of Linus's tree (and the next 4.13-stable release)?
> > 
> > The warning is issued when drivers issue *new* firmware requests on resume. The
> > firmware API cache was designed to enable drivers to easily be able to request
> > firmware on resume without concern about races against the filesystem, but in order
> > for this to work the drivers must have requested the firmware prior to suspend.
> 
> Ok, then how is this all working today?

Prior to the above *two* mentioned commits a UMH lock was used, and so similar but
different errors should have come up. With just 81f95076281f "firmware: add
sanity check on shutdown/suspend") reverted there is no suspend heuristic
check and we just let the users fend for themselves, it may or may not work
this will very depending on the system.

> I think much more is going on here, sorry, this doesn't make sense.

The UMH lock on the non UMH path was nasty to begin with. We've long depended
on it though, my two commits were an attempt to do us away from UMH lock on
the non-UMH path. It was not trivial to reach a fair middle ground.

> > > If something needs to be fixed, can you make a patch showing that?  Or
> > > do we also need to revert anything else as well to get back to a "better
> > > working" state?
> > 
> > I took a look at the driver and it seems that btusb_setup_intel_new() is
> > not called after the driver is initialized, rather its called only when
> > hci_dev_do_open() is called. Its not clear to me how you can end up calling
> > this on resume but not prior to this on a running system. Feedback from
> > someone more familiar with bt would be useful.
> 
> open is a great place to want to load firmware, that makes sense, you
> get to defer loading stuff until you actually need it.  I see nothing
> wrong here (again, especially as it _does_ work...)

I agree. I'm very well aware of drivers doing "stupid things" on probe
or init. But the issue here seems to have been present since before, I
see one report since 3.13 on the path I expect to be hit in similar
situations but without this commit on the old UMH lock path:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1356076

> > If it confirmed this information is helping avoid these races we can reconsider
> > re-instating the warn as a firmware dev debugging aid for developers.
> 
> Again, obviously this was not just a "warning" message, you changed the
> logic here.

I'm afraid the logic changes *more* with *just* commit 81f95076281f reverted.
I intentionally tried to avoid that.

> Anyway, all is good now, I guess we don't have to worry about it :)

I'll discuss with Marcel a proper long resolution, as it stands though with
just commit 81f95076281f reverted I'm afraid the situation may be worse off
long term even though it may seem things are peachy for some systems right off
the bat. We've long depended on the UMH lock path, my change was an attempt to
move us away from it and be more explicit about the issue.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-12 16:52                       ` Luis R. Rodriguez
  0 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-12 16:52 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Luis R. Rodriguez, Arend Van Spriel, Marcel Holtmann,
	Linus Torvalds, Gabriel C, Tejun Heo, Gustavo F. Padovan,
	Sukumar Ghorai, Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Sep 11, 2017 at 05:48:52PM -0700, Greg Kroah-Hartman wrote:
> On Mon, Sep 11, 2017 at 10:06:51PM +0200, Luis R. Rodriguez wrote:
> > On Mon, Sep 11, 2017 at 12:29:55PM -0700, Greg Kroah-Hartman wrote:
> > > On Mon, Sep 11, 2017 at 07:11:38PM +0200, Luis R. Rodriguez wrote:
> > > > On Mon, Sep 11, 2017 at 06:46:47AM -0700, Greg Kroah-Hartman wrote:
> > > > > To confirm, reverting this fixes the problem I was seeing in 4.13.  I've
> > > > > queued it up for the next 4.13-stable release as well.
> > > > 
> > > > Commit 81f95076281f ("firmware: add sanity check on shutdown/suspend") may
> > > > seem kludgy but the reason for it was to cleanup the horrible forced and
> > > > required UMH lock even when the UMH lock was *not* even needed, which was later
> > > > removed via commit 06a45a93e7d34aa ("firmware: move umh try locks into the umh
> > > > code").
> > > 
> > > So what does this mean now that it is reverted?
> > 
> > We discuss what we should do about upkeeping a warning in the future, as
> > I think technically the warning was still valid and it could help avoid
> > racy lookups with the filesystem which otherwise could have gone unnoticed.
> 
> Given that things that used to work before the patch, and then didn't
> work after the patch was added, but then worked again after the patch
> was reverted, I don't think that adding the warning back is good, if it
> still breaks things...

As I noted to Marcel, it would seem this issue was present since before, the
warning either was changed or was more silent before. If reverting commit
81f95076281f "firmware: add sanity check on shutdown/suspend") then one should
consider reverting 06a45a93e7d34aa ("firmware: move umh try locks into the umh
code") as otherwise we end up in a situation where we loose determinism to be
sure of what will happen. From what I gather prior to these two patches we
either got a warning on sync case, or an error on async, after it we get a
warning on either and it should be clearer what the issue should be. I believe
that prior to these two commits the reason why things were failing could have
been pretty obscure. With just commit 81f95076281f reverted we then really
change the logic, and *that* was something I did want to avoid, given I
suspected that kernel was already implicitly depending on the functionality
which the UMH lock had brought upon the non-UMH code path. My suspicions seem
to have been correct.

With just commit 81f95076281f reverted, we may find firmware sometimes, but
we may also just as easily not find firmware some other times. It will depend
on the system.

> > > > Removing the old UMH lock even when the UMH lock was *not* needed was the right
> > > > thing to do but commit 81f95076281f (("firmware: add sanity check on
> > > > shutdown/suspend") was put in place as a safe guard as the lock was also
> > > > placing an implicit sanity check on the API. It ensures the API with the cache
> > > > was used as designed, otherwise you do run the risk of *not getting the
> > > > firmware you may need* -- Marcel seems to acknowledge this possibility.
> > > > 
> > > > It may be possible for us to already have in place safeguards so that upon
> > > > resume we are ensuring the path to the firmware *is* available, so IMHO we
> > > > should remove this *iff* we can provide this guarantee.  Otherwise the check is
> > > > valid. You see, although the UMH lock was bogus, it did implicitly ask the
> > > > question: is it safe for *any* helper to run and make assumptions on the
> > > > filesystem then?
> > > > 
> > > > In lieu of this question being answered the warning is valid given the design
> > > > of the firmware API and the having the cache available as a measure to resolve
> > > > this race.
> > > 
> > > I don't understand what you are trying to say here at all.
> > > 
> > > To be specific, what, if anything, is a problem with the current state
> > > of Linus's tree (and the next 4.13-stable release)?
> > 
> > The warning is issued when drivers issue *new* firmware requests on resume. The
> > firmware API cache was designed to enable drivers to easily be able to request
> > firmware on resume without concern about races against the filesystem, but in order
> > for this to work the drivers must have requested the firmware prior to suspend.
> 
> Ok, then how is this all working today?

Prior to the above *two* mentioned commits a UMH lock was used, and so similar but
different errors should have come up. With just 81f95076281f "firmware: add
sanity check on shutdown/suspend") reverted there is no suspend heuristic
check and we just let the users fend for themselves, it may or may not work
this will very depending on the system.

> I think much more is going on here, sorry, this doesn't make sense.

The UMH lock on the non UMH path was nasty to begin with. We've long depended
on it though, my two commits were an attempt to do us away from UMH lock on
the non-UMH path. It was not trivial to reach a fair middle ground.

> > > If something needs to be fixed, can you make a patch showing that?  Or
> > > do we also need to revert anything else as well to get back to a "better
> > > working" state?
> > 
> > I took a look at the driver and it seems that btusb_setup_intel_new() is
> > not called after the driver is initialized, rather its called only when
> > hci_dev_do_open() is called. Its not clear to me how you can end up calling
> > this on resume but not prior to this on a running system. Feedback from
> > someone more familiar with bt would be useful.
> 
> open is a great place to want to load firmware, that makes sense, you
> get to defer loading stuff until you actually need it.  I see nothing
> wrong here (again, especially as it _does_ work...)

I agree. I'm very well aware of drivers doing "stupid things" on probe
or init. But the issue here seems to have been present since before, I
see one report since 3.13 on the path I expect to be hit in similar
situations but without this commit on the old UMH lock path:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1356076

> > If it confirmed this information is helping avoid these races we can reconsider
> > re-instating the warn as a firmware dev debugging aid for developers.
> 
> Again, obviously this was not just a "warning" message, you changed the
> logic here.

I'm afraid the logic changes *more* with *just* commit 81f95076281f reverted.
I intentionally tried to avoid that.

> Anyway, all is good now, I guess we don't have to worry about it :)

I'll discuss with Marcel a proper long resolution, as it stands though with
just commit 81f95076281f reverted I'm afraid the situation may be worse off
long term even though it may seem things are peachy for some systems right off
the bat. We've long depended on the UMH lock path, my change was an attempt to
move us away from it and be more explicit about the issue.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-12 16:27                       ` Luis R. Rodriguez
@ 2017-09-13  6:52                         ` Marcel Holtmann
  -1 siblings, 0 replies; 44+ messages in thread
From: Marcel Holtmann @ 2017-09-13  6:52 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Greg Kroah-Hartman, Linus Torvalds, Gabriel C, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

Hi Luis,

>>>> If something needs to be fixed, can you make a patch showing that?  Or
>>>> do we also need to revert anything else as well to get back to a "better
>>>> working" state?
>>> 
>>> I took a look at the driver and it seems that btusb_setup_intel_new() is
>>> not called after the driver is initialized, rather its called only when
>>> hci_dev_do_open() is called. Its not clear to me how you can end up calling
>>> this on resume but not prior to this on a running system. Feedback from
>>> someone more familiar with bt would be useful.
>>> 
>>> I'd have the call for firmware on probe, no processing would be needed, just
>>> a load to kick the cache into effect would suffice. This may require a bit
>>> of code shift so its best someone more familiar do this.
>>> 
>>> If it confirmed this information is helping avoid these races we can reconsider
>>> re-instating the warn as a firmware dev debugging aid for developers.
>>> 
>>> If the race this warning complained about is indeed possible the same race is
>>> also possible for other usermode helpers. Its *why* the UMH lock was
>>> implemented, it however was never generalized.
>> 
>> we can not load firmware on probe() in most cases. The btusb.ko driver
>> establishes the HCI transport. It is setup in probe() and then started in
>> hci_dev_do_open() and if there is a vendor setup routine like
>> btsub_setup_intel_new(), then it is executed. Most Bluetooth controllers (not
>> all, but most) are doing firmware loading over the HCI transport with vendor
>> specific commands.
>> 
>> And yes, if the firmware was already loaded, we would skip requesting it at
>> all. Which means after suspend/resume cycle where the power to the controller
>> is cut, then we are missing the firmware from the cache. Since we never
>> loaded it in the first place.
> 
> I checked and prior to commit 81f95076281f ("firmware: add sanity check on
> shutdown/suspend") and commit 06a45a93e7d34aa ("firmware: move umh try locks
> into the umh code") I believe we could end up also failing at a firmware
> request as well. Can you confirm? I see one bug report which confirms this
> since v3.13:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1356076
> 
> For the sync case we had:
> 
> -               ret = usermodehelper_read_trylock();
> -               if (WARN_ON(ret)) {
> -                       dev_err(device, "firmware: %s will not be loaded\n",
> -                               name);
> -                       goto out;
> -               }
> 
> The warning splat in launchpad bug 1356076 is the above case.
> 
> For the async case we had:
> 
> -               timeout = usermodehelper_read_lock_wait(timeout);
> -               if (!timeout) {
> -                       dev_dbg(device, "firmware: %s loading timed out\n",
> -                               name);
> -                       ret = -EBUSY;
> -                       goto out;
> 
> This would be the more silent case.
> 
> If this is accurate than the error case has just been modified to be
> a bit clearer, and without being implicated by the UMH lock. This was
> the design change after all. Nothing more *broken* should have happened.
> 
> In other words if fixing your issues goes away by reverting commit
> 81f95076281f, to be fair you should also try reverting commit 06a45a93e7d34aa
> which did the actual removal of the UMH lock when we don't use the UMH. *That*
> was previously the failure trigger point.
> 
>> So yes, we would have to redo parts of the vendor specific handling to always
>> request the firmware, even if we do not need it right now. And frankly that
>> is not obvious to anybody.
> 
> I agree. I thought a bit about the above and tried co come to a clever resolution,
> for instance one resolution could be to use the MODULE_FIRMWARE() declarations
> of sorts to then let the firmware API do the pre-fetching for you at init. This
> would just require a 1 line change to drivers and some already have this.
> There are two issues with this though, one is that the firmware cache works with
> devres and devres requires a device already present. Second is firmware names
> can be very dynamic, so one can only know the firmware name later at boot.

I don’t like that idea since we have drivers like btusb.ko that support devices from many manufactures. So the only difference is the firmware loading and setup. After that they behave all the same. So pre-loading all Bluetooth firmwares is pretty crazy.

> Another issues is that most driver developers use the firmware API and without
> knowing *are* creating a firmware cache entry, whereas a few times they don't
> need a firmware cache entry. This is more of performance / optimization thing,
> but something to consider long term as well.
> 
>> We seem to have some patches for doing exactly
>> that, but I have not gotten to review them in detail since they deal with
>> vendor specific complex setup handling.
> 
> Correct me if I'm wrong but some of this work very likely came from the above
> old errors, not the new ones?
> 
>> Also this affects more than just Intel since all hardware where firmware
>> loading is skipped if there is already firmware loaded are affected.
> 
> Right, its a core firmware API issue, but this issue has been present for a
> while, it was however in different form. I'm glad we're able to think ahead
> and address this now though, it would seem there are quite a bit of intrusive
> changes required to use the API right, I'd hope we could help on the firmware
> API front to make these changes smaller.
> 
> First, I'd like to understand a bit more how a device ends up with firmware
> loaded, without having gone through the firmware API.
> 
> Second, while requesting firmware at probe seems not necessary, would it
> at least be reasonable to require a struct device and a list of possible
> firmware that could be used be made available? If so a separate hook could
> be provided to help pre-load these only onto the firmware cache. Ie, we would
> not actually look for the firmware but just create a firmware cache instance
> so that when we suspend we *do* go fetch for these so that they are ready and
> available upon resume.

The devices come with boatloader mode and operational mode and both are independent. And most of them only fallback when fully power cycling the device. So a simple boot linux and reboot linux would get you into the situation that the firmware is already loaded. There are also some UEFI bioses that can load the firmware. So there are multitudes of possibilities where we can end up with a firmware loaded.

Regards

Marcel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-13  6:52                         ` Marcel Holtmann
  0 siblings, 0 replies; 44+ messages in thread
From: Marcel Holtmann @ 2017-09-13  6:52 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Greg Kroah-Hartman, Linus Torvalds, Gabriel C, Tejun Heo,
	Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

Hi Luis,

>>>> If something needs to be fixed, can you make a patch showing that?  Or
>>>> do we also need to revert anything else as well to get back to a "better
>>>> working" state?
>>> 
>>> I took a look at the driver and it seems that btusb_setup_intel_new() is
>>> not called after the driver is initialized, rather its called only when
>>> hci_dev_do_open() is called. Its not clear to me how you can end up calling
>>> this on resume but not prior to this on a running system. Feedback from
>>> someone more familiar with bt would be useful.
>>> 
>>> I'd have the call for firmware on probe, no processing would be needed, just
>>> a load to kick the cache into effect would suffice. This may require a bit
>>> of code shift so its best someone more familiar do this.
>>> 
>>> If it confirmed this information is helping avoid these races we can reconsider
>>> re-instating the warn as a firmware dev debugging aid for developers.
>>> 
>>> If the race this warning complained about is indeed possible the same race is
>>> also possible for other usermode helpers. Its *why* the UMH lock was
>>> implemented, it however was never generalized.
>> 
>> we can not load firmware on probe() in most cases. The btusb.ko driver
>> establishes the HCI transport. It is setup in probe() and then started in
>> hci_dev_do_open() and if there is a vendor setup routine like
>> btsub_setup_intel_new(), then it is executed. Most Bluetooth controllers (not
>> all, but most) are doing firmware loading over the HCI transport with vendor
>> specific commands.
>> 
>> And yes, if the firmware was already loaded, we would skip requesting it at
>> all. Which means after suspend/resume cycle where the power to the controller
>> is cut, then we are missing the firmware from the cache. Since we never
>> loaded it in the first place.
> 
> I checked and prior to commit 81f95076281f ("firmware: add sanity check on
> shutdown/suspend") and commit 06a45a93e7d34aa ("firmware: move umh try locks
> into the umh code") I believe we could end up also failing at a firmware
> request as well. Can you confirm? I see one bug report which confirms this
> since v3.13:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1356076
> 
> For the sync case we had:
> 
> -               ret = usermodehelper_read_trylock();
> -               if (WARN_ON(ret)) {
> -                       dev_err(device, "firmware: %s will not be loaded\n",
> -                               name);
> -                       goto out;
> -               }
> 
> The warning splat in launchpad bug 1356076 is the above case.
> 
> For the async case we had:
> 
> -               timeout = usermodehelper_read_lock_wait(timeout);
> -               if (!timeout) {
> -                       dev_dbg(device, "firmware: %s loading timed out\n",
> -                               name);
> -                       ret = -EBUSY;
> -                       goto out;
> 
> This would be the more silent case.
> 
> If this is accurate than the error case has just been modified to be
> a bit clearer, and without being implicated by the UMH lock. This was
> the design change after all. Nothing more *broken* should have happened.
> 
> In other words if fixing your issues goes away by reverting commit
> 81f95076281f, to be fair you should also try reverting commit 06a45a93e7d34aa
> which did the actual removal of the UMH lock when we don't use the UMH. *That*
> was previously the failure trigger point.
> 
>> So yes, we would have to redo parts of the vendor specific handling to always
>> request the firmware, even if we do not need it right now. And frankly that
>> is not obvious to anybody.
> 
> I agree. I thought a bit about the above and tried co come to a clever resolution,
> for instance one resolution could be to use the MODULE_FIRMWARE() declarations
> of sorts to then let the firmware API do the pre-fetching for you at init. This
> would just require a 1 line change to drivers and some already have this.
> There are two issues with this though, one is that the firmware cache works with
> devres and devres requires a device already present. Second is firmware names
> can be very dynamic, so one can only know the firmware name later at boot.

I don’t like that idea since we have drivers like btusb.ko that support devices from many manufactures. So the only difference is the firmware loading and setup. After that they behave all the same. So pre-loading all Bluetooth firmwares is pretty crazy.

> Another issues is that most driver developers use the firmware API and without
> knowing *are* creating a firmware cache entry, whereas a few times they don't
> need a firmware cache entry. This is more of performance / optimization thing,
> but something to consider long term as well.
> 
>> We seem to have some patches for doing exactly
>> that, but I have not gotten to review them in detail since they deal with
>> vendor specific complex setup handling.
> 
> Correct me if I'm wrong but some of this work very likely came from the above
> old errors, not the new ones?
> 
>> Also this affects more than just Intel since all hardware where firmware
>> loading is skipped if there is already firmware loaded are affected.
> 
> Right, its a core firmware API issue, but this issue has been present for a
> while, it was however in different form. I'm glad we're able to think ahead
> and address this now though, it would seem there are quite a bit of intrusive
> changes required to use the API right, I'd hope we could help on the firmware
> API front to make these changes smaller.
> 
> First, I'd like to understand a bit more how a device ends up with firmware
> loaded, without having gone through the firmware API.
> 
> Second, while requesting firmware at probe seems not necessary, would it
> at least be reasonable to require a struct device and a list of possible
> firmware that could be used be made available? If so a separate hook could
> be provided to help pre-load these only onto the firmware cache. Ie, we would
> not actually look for the firmware but just create a firmware cache instance
> so that when we suspend we *do* go fetch for these so that they are ready and
> available upon resume.

The devices come with boatloader mode and operational mode and both are independent. And most of them only fallback when fully power cycling the device. So a simple boot linux and reboot linux would get you into the situation that the firmware is already loaded. There are also some UEFI bioses that can load the firmware. So there are multitudes of possibilities where we can end up with a firmware loaded.

Regards

Marcel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-13  6:52                         ` Marcel Holtmann
@ 2017-09-13 17:39                           ` Luis R. Rodriguez
  -1 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-13 17:39 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Luis R. Rodriguez, Greg Kroah-Hartman, Linus Torvalds, Gabriel C,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Wed, Sep 13, 2017 at 08:52:15AM +0200, Marcel Holtmann wrote:
> > I checked and prior to commit 81f95076281f ("firmware: add sanity check on
> > shutdown/suspend") and commit 06a45a93e7d34aa ("firmware: move umh try locks
> > into the umh code") I believe we could end up also failing at a firmware
> > request as well. Can you confirm? I see one bug report which confirms this
> > since v3.13:
> > 
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1356076

Have you seen these sorts of issues before BTW?

> >> So yes, we would have to redo parts of the vendor specific handling to always
> >> request the firmware, even if we do not need it right now. And frankly that
> >> is not obvious to anybody.
> > 
> > I agree. I thought a bit about the above and tried co come to a clever resolution,
> > for instance one resolution could be to use the MODULE_FIRMWARE() declarations
> > of sorts to then let the firmware API do the pre-fetching for you at init. This
> > would just require a 1 line change to drivers and some already have this.
> > There are two issues with this though, one is that the firmware cache works with
> > devres and devres requires a device already present. Second is firmware names
> > can be very dynamic, so one can only know the firmware name later at boot.
> 
> I don’t like that idea since we have drivers like btusb.ko that support
> devices from many manufactures. So the only difference is the firmware
> loading and setup. After that they behave all the same. So pre-loading all
> Bluetooth firmwares is pretty crazy.

Makes sense. My point was also to explain how the dynamic aspect of the strings
for firmware make that approach not practical.

> > Another issues is that most driver developers use the firmware API and without
> > knowing *are* creating a firmware cache entry, whereas a few times they don't
> > need a firmware cache entry. This is more of performance / optimization thing,
> > but something to consider long term as well.
> > 
> >> We seem to have some patches for doing exactly
> >> that, but I have not gotten to review them in detail since they deal with
> >> vendor specific complex setup handling.

Do you happen to know if some of this code implemented because of the obscure
and perhaps random issues as noted in the above bug URL. If you are not sure,
who can I ask to see what the original motivation was?

> > Correct me if I'm wrong but some of this work very likely came from the above
> > old errors, not the new ones?
> > 
> >> Also this affects more than just Intel since all hardware where firmware
> >> loading is skipped if there is already firmware loaded are affected.
> > 
> > Right, its a core firmware API issue, but this issue has been present for a
> > while, it was however in different form. I'm glad we're able to think ahead
> > and address this now though, it would seem there are quite a bit of intrusive
> > changes required to use the API right, I'd hope we could help on the firmware
> > API front to make these changes smaller.
> > 
> > First, I'd like to understand a bit more how a device ends up with firmware
> > loaded, without having gone through the firmware API.
> > 
> > Second, while requesting firmware at probe seems not necessary, would it
> > at least be reasonable to require a struct device and a list of possible
> > firmware that could be used be made available? If so a separate hook could
> > be provided to help pre-load these only onto the firmware cache. Ie, we would
> > not actually look for the firmware but just create a firmware cache instance
> > so that when we suspend we *do* go fetch for these so that they are ready and
> > available upon resume.
> 
> The devices come with boatloader mode and operational mode and both are
> independent. And most of them only fallback when fully power cycling the
> device. So a simple boot linux and reboot linux would get you into the
> situation that the firmware is already loaded. There are also some UEFI
> bioses that can load the firmware. So there are multitudes of possibilities
> where we can end up with a firmware loaded.

I see thanks.

Would a fw_cache_hint(device, name_list) be reasonable then sometime *before*
suspend? All this would do is ask the firmware API to extend the fw cache
list with the entries. It would not load firmware immediately, instead this
would trigger a request for the hinted firmware to become available for
suspend. Then these drivers could request for firmware at resume and it
will pick up the cached firmware.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-09-13 17:39                           ` Luis R. Rodriguez
  0 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-09-13 17:39 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Luis R. Rodriguez, Greg Kroah-Hartman, Linus Torvalds, Gabriel C,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Wed, Sep 13, 2017 at 08:52:15AM +0200, Marcel Holtmann wrote:
> > I checked and prior to commit 81f95076281f ("firmware: add sanity check on
> > shutdown/suspend") and commit 06a45a93e7d34aa ("firmware: move umh try locks
> > into the umh code") I believe we could end up also failing at a firmware
> > request as well. Can you confirm? I see one bug report which confirms this
> > since v3.13:
> > 
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1356076

Have you seen these sorts of issues before BTW?

> >> So yes, we would have to redo parts of the vendor specific handling to always
> >> request the firmware, even if we do not need it right now. And frankly that
> >> is not obvious to anybody.
> > 
> > I agree. I thought a bit about the above and tried co come to a clever resolution,
> > for instance one resolution could be to use the MODULE_FIRMWARE() declarations
> > of sorts to then let the firmware API do the pre-fetching for you at init. This
> > would just require a 1 line change to drivers and some already have this.
> > There are two issues with this though, one is that the firmware cache works with
> > devres and devres requires a device already present. Second is firmware names
> > can be very dynamic, so one can only know the firmware name later at boot.
> 
> I don’t like that idea since we have drivers like btusb.ko that support
> devices from many manufactures. So the only difference is the firmware
> loading and setup. After that they behave all the same. So pre-loading all
> Bluetooth firmwares is pretty crazy.

Makes sense. My point was also to explain how the dynamic aspect of the strings
for firmware make that approach not practical.

> > Another issues is that most driver developers use the firmware API and without
> > knowing *are* creating a firmware cache entry, whereas a few times they don't
> > need a firmware cache entry. This is more of performance / optimization thing,
> > but something to consider long term as well.
> > 
> >> We seem to have some patches for doing exactly
> >> that, but I have not gotten to review them in detail since they deal with
> >> vendor specific complex setup handling.

Do you happen to know if some of this code implemented because of the obscure
and perhaps random issues as noted in the above bug URL. If you are not sure,
who can I ask to see what the original motivation was?

> > Correct me if I'm wrong but some of this work very likely came from the above
> > old errors, not the new ones?
> > 
> >> Also this affects more than just Intel since all hardware where firmware
> >> loading is skipped if there is already firmware loaded are affected.
> > 
> > Right, its a core firmware API issue, but this issue has been present for a
> > while, it was however in different form. I'm glad we're able to think ahead
> > and address this now though, it would seem there are quite a bit of intrusive
> > changes required to use the API right, I'd hope we could help on the firmware
> > API front to make these changes smaller.
> > 
> > First, I'd like to understand a bit more how a device ends up with firmware
> > loaded, without having gone through the firmware API.
> > 
> > Second, while requesting firmware at probe seems not necessary, would it
> > at least be reasonable to require a struct device and a list of possible
> > firmware that could be used be made available? If so a separate hook could
> > be provided to help pre-load these only onto the firmware cache. Ie, we would
> > not actually look for the firmware but just create a firmware cache instance
> > so that when we suspend we *do* go fetch for these so that they are ready and
> > available upon resume.
> 
> The devices come with boatloader mode and operational mode and both are
> independent. And most of them only fallback when fully power cycling the
> device. So a simple boot linux and reboot linux would get you into the
> situation that the firmware is already loaded. There are also some UEFI
> bioses that can load the firmware. So there are multitudes of possibilities
> where we can end up with a firmware loaded.

I see thanks.

Would a fw_cache_hint(device, name_list) be reasonable then sometime *before*
suspend? All this would do is ask the firmware API to extend the fw cache
list with the entries. It would not load firmware immediately, instead this
would trigger a request for the hinted firmware to become available for
suspend. Then these drivers could request for firmware at resume and it
will pick up the cached firmware.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-09-13 17:39                           ` Luis R. Rodriguez
@ 2017-10-02  8:34                             ` Kai-Heng Feng
  -1 siblings, 0 replies; 44+ messages in thread
From: Kai-Heng Feng @ 2017-10-02  8:34 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Marcel Holtmann, Greg Kroah-Hartman, Linus Torvalds, Gabriel C,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

Hi Luis,

On Thu, Sep 14, 2017 at 1:39 AM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
[snipped]
> Would a fw_cache_hint(device, name_list) be reasonable then sometime *before*
> suspend? All this would do is ask the firmware API to extend the fw cache
> list with the entries. It would not load firmware immediately, instead this
> would trigger a request for the hinted firmware to become available for
> suspend. Then these drivers could request for firmware at resume and it
> will pick up the cached firmware.

I think I am the author the patch [1] mentioned by Marcel. I have to
admit, it's quite
clunky in it's current form.
So yes, the new API you proposed is definitely better to solve the
issue. I'll send
new patch for btusb once we have the new API to use.

Also, with patch [1], the "firmware request while host is not
available" may still get hit
on some corner cases. I proposed another patch [2] to tackle the edge
case, can you
take a look?

[1] https://lkml.org/lkml/2017/8/25/58
[2] https://lkml.org/lkml/2017/8/22/123

Kai-Heng

>
>   Luis
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bluetooth" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-10-02  8:34                             ` Kai-Heng Feng
  0 siblings, 0 replies; 44+ messages in thread
From: Kai-Heng Feng @ 2017-10-02  8:34 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Marcel Holtmann, Greg Kroah-Hartman, Linus Torvalds, Gabriel C,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

Hi Luis,

On Thu, Sep 14, 2017 at 1:39 AM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
[snipped]
> Would a fw_cache_hint(device, name_list) be reasonable then sometime *before*
> suspend? All this would do is ask the firmware API to extend the fw cache
> list with the entries. It would not load firmware immediately, instead this
> would trigger a request for the hinted firmware to become available for
> suspend. Then these drivers could request for firmware at resume and it
> will pick up the cached firmware.

I think I am the author the patch [1] mentioned by Marcel. I have to
admit, it's quite
clunky in it's current form.
So yes, the new API you proposed is definitely better to solve the
issue. I'll send
new patch for btusb once we have the new API to use.

Also, with patch [1], the "firmware request while host is not
available" may still get hit
on some corner cases. I proposed another patch [2] to tackle the edge
case, can you
take a look?

[1] https://lkml.org/lkml/2017/8/25/58
[2] https://lkml.org/lkml/2017/8/22/123

Kai-Heng

>
>   Luis
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bluetooth" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-10-02  8:34                             ` Kai-Heng Feng
@ 2017-10-04  0:20                               ` Luis R. Rodriguez
  -1 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-10-04  0:20 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Luis R. Rodriguez, Marcel Holtmann, Greg Kroah-Hartman,
	Linus Torvalds, Gabriel C, Tejun Heo, Gustavo F. Padovan,
	Sukumar Ghorai, Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Oct 02, 2017 at 04:34:11PM +0800, Kai-Heng Feng wrote:
> Hi Luis,
> 
> On Thu, Sep 14, 2017 at 1:39 AM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> [snipped]
> > Would a fw_cache_hint(device, name_list) be reasonable then sometime *before*
> > suspend? All this would do is ask the firmware API to extend the fw cache
> > list with the entries. It would not load firmware immediately, instead this
> > would trigger a request for the hinted firmware to become available for
> > suspend. Then these drivers could request for firmware at resume and it
> > will pick up the cached firmware.
> 
> I think I am the author the patch [1] mentioned by Marcel. I have to admit,
> it's quite clunky in it's current form.  So yes, the new API you proposed is
> definitely better to solve the issue. I'll send new patch for btusb once we
> have the new API to use.

Note that Linus' recent revert to ensure we can keep the old behaviour
of failing to fetch firmware if we're on our way to suspend means that if
you use direct filesystem firmware fetch today on upstream kernels
you will get an attempt to actually look for the file. That is the old
failures you used to see should be gone.

Although at first I was cautious to avoid moving in this direction it was
the right direction to move in anyway long term. The old mecehanism was
put in place to avoid having the UMH call out given we are freezing
userspace and we'd deadlock as the userspace helper would already be
frozen. The UMH helpers *used* to be our default firmware fetchter,
but the API grew to just do a direct filesystem lookup.

If there are issues or races with suspend / resume those need to be
addressed at an ordering level at the core of the suspend framework.
I just actually proposed *how* to deal with proper filesystem suspend
races today [0], the ordering devised currently there is:

  o device driver pm ops called
  o notifier for suspend issued - *going to suspend*
  o userspace frozen
  o filesystem freeze

On the way back up this order is inverted:

  o filesystem freeze
  o userspace frozen
  o notifier for suspend issued - *going to suspend*
  o device driver pm ops called

With this order in place filesystems should in theory be available
on resume ASAP, any issues that could creep up implicate having
to address and review the above ordering.

As-is upstream though, things should be fine and the race you used to
see should no longer be possible. This does mean older kernels don't
have these fixes, as they were not intended to be fixes. The patch
Linus reverted was the patch I kept purposely to *avoid* us having
to move light years forward in one release, but alas, we've moved
forward. So the way to look at it, in the end, is the work to move
the UMH lock outside of direct FS lookups fixed this issue now.

[0] https://lkml.kernel.org/r/20171003185313.1017-1-mcgrof@kernel.org

> Also, with patch [1], the "firmware request while host is not
> available" may still get hit
> on some corner cases. I proposed another patch [2] to tackle the edge
> case, can you
> take a look?
> 
> [1] https://lkml.org/lkml/2017/8/25/58
> [2] https://lkml.org/lkml/2017/8/22/123

All this crap should not be needed anymore as the cause of the issue,
the UMH lock on direct-fs-lookups, is long gone on direct-fs lookups.

The only case and kernel config that will run into this issue now
is those calls that want to skip the direct-fs-lookup first, and
there are only two stupid device drivers upstream that do this:
dell rbu and some LED crap driver.

So please retest now that the revert happened: commit
f007cad159e99fa2acd3b2e9364fbb32ad28b971 ("Revert "firmware: add sanity check
on shutdown/suspend").

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-10-04  0:20                               ` Luis R. Rodriguez
  0 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-10-04  0:20 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Luis R. Rodriguez, Marcel Holtmann, Greg Kroah-Hartman,
	Linus Torvalds, Gabriel C, Tejun Heo, Gustavo F. Padovan,
	Sukumar Ghorai, Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Mon, Oct 02, 2017 at 04:34:11PM +0800, Kai-Heng Feng wrote:
> Hi Luis,
> 
> On Thu, Sep 14, 2017 at 1:39 AM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> [snipped]
> > Would a fw_cache_hint(device, name_list) be reasonable then sometime *before*
> > suspend? All this would do is ask the firmware API to extend the fw cache
> > list with the entries. It would not load firmware immediately, instead this
> > would trigger a request for the hinted firmware to become available for
> > suspend. Then these drivers could request for firmware at resume and it
> > will pick up the cached firmware.
> 
> I think I am the author the patch [1] mentioned by Marcel. I have to admit,
> it's quite clunky in it's current form.  So yes, the new API you proposed is
> definitely better to solve the issue. I'll send new patch for btusb once we
> have the new API to use.

Note that Linus' recent revert to ensure we can keep the old behaviour
of failing to fetch firmware if we're on our way to suspend means that if
you use direct filesystem firmware fetch today on upstream kernels
you will get an attempt to actually look for the file. That is the old
failures you used to see should be gone.

Although at first I was cautious to avoid moving in this direction it was
the right direction to move in anyway long term. The old mecehanism was
put in place to avoid having the UMH call out given we are freezing
userspace and we'd deadlock as the userspace helper would already be
frozen. The UMH helpers *used* to be our default firmware fetchter,
but the API grew to just do a direct filesystem lookup.

If there are issues or races with suspend / resume those need to be
addressed at an ordering level at the core of the suspend framework.
I just actually proposed *how* to deal with proper filesystem suspend
races today [0], the ordering devised currently there is:

  o device driver pm ops called
  o notifier for suspend issued - *going to suspend*
  o userspace frozen
  o filesystem freeze

On the way back up this order is inverted:

  o filesystem freeze
  o userspace frozen
  o notifier for suspend issued - *going to suspend*
  o device driver pm ops called

With this order in place filesystems should in theory be available
on resume ASAP, any issues that could creep up implicate having
to address and review the above ordering.

As-is upstream though, things should be fine and the race you used to
see should no longer be possible. This does mean older kernels don't
have these fixes, as they were not intended to be fixes. The patch
Linus reverted was the patch I kept purposely to *avoid* us having
to move light years forward in one release, but alas, we've moved
forward. So the way to look at it, in the end, is the work to move
the UMH lock outside of direct FS lookups fixed this issue now.

[0] https://lkml.kernel.org/r/20171003185313.1017-1-mcgrof@kernel.org

> Also, with patch [1], the "firmware request while host is not
> available" may still get hit
> on some corner cases. I proposed another patch [2] to tackle the edge
> case, can you
> take a look?
> 
> [1] https://lkml.org/lkml/2017/8/25/58
> [2] https://lkml.org/lkml/2017/8/22/123

All this crap should not be needed anymore as the cause of the issue,
the UMH lock on direct-fs-lookups, is long gone on direct-fs lookups.

The only case and kernel config that will run into this issue now
is those calls that want to skip the direct-fs-lookup first, and
there are only two stupid device drivers upstream that do this:
dell rbu and some LED crap driver.

So please retest now that the revert happened: commit
f007cad159e99fa2acd3b2e9364fbb32ad28b971 ("Revert "firmware: add sanity check
on shutdown/suspend").

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-10-04  0:20                               ` Luis R. Rodriguez
@ 2017-10-04  1:21                                 ` Luis R. Rodriguez
  -1 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-10-04  1:21 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Luis R. Rodriguez, Marcel Holtmann, Greg Kroah-Hartman,
	Linus Torvalds, Gabriel C, Tejun Heo, Gustavo F. Padovan,
	Sukumar Ghorai, Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Tue, Oct 3, 2017 at 5:20 PM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> the ordering devised currently there is:
>
>   o device driver pm ops called
>   o notifier for suspend issued - *going to suspend*
>   o userspace frozen
>   o filesystem freeze
>
> On the way back up this order is inverted:
>
>   o filesystem freeze
>   o userspace frozen
>   o notifier for suspend issued - *going to suspend*
>   o device driver pm ops called

Fortunately I had it a tad bit wrong, but in a good way. Our ordering
on our way down is:

 o notifier for suspend issued - *going to suspend*
 o userspace frozen
 o filesystem freeze (new, being proposed)
 o device driver pm ops called

Then on our way up:

  o device driver pm ops called
  o filesystem thaw
  o userspace thaw
  o notifier for resume issued - *thawing*

So the driver callbacks get called *later*, so anything called in
notifiers do get a chance to quiesce things properly.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-10-04  1:21                                 ` Luis R. Rodriguez
  0 siblings, 0 replies; 44+ messages in thread
From: Luis R. Rodriguez @ 2017-10-04  1:21 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Luis R. Rodriguez, Marcel Holtmann, Greg Kroah-Hartman,
	Linus Torvalds, Gabriel C, Tejun Heo, Gustavo F. Padovan,
	Sukumar Ghorai, Rafael J. Wysocki, Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

On Tue, Oct 3, 2017 at 5:20 PM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> the ordering devised currently there is:
>
>   o device driver pm ops called
>   o notifier for suspend issued - *going to suspend*
>   o userspace frozen
>   o filesystem freeze
>
> On the way back up this order is inverted:
>
>   o filesystem freeze
>   o userspace frozen
>   o notifier for suspend issued - *going to suspend*
>   o device driver pm ops called

Fortunately I had it a tad bit wrong, but in a good way. Our ordering
on our way down is:

 o notifier for suspend issued - *going to suspend*
 o userspace frozen
 o filesystem freeze (new, being proposed)
 o device driver pm ops called

Then on our way up:

  o device driver pm ops called
  o filesystem thaw
  o userspace thaw
  o notifier for resume issued - *thawing*

So the driver callbacks get called *later*, so anything called in
notifiers do get a chance to quiesce things properly.

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
  2017-10-04  0:20                               ` Luis R. Rodriguez
@ 2017-10-06  4:42                                 ` Kai-Heng Feng
  -1 siblings, 0 replies; 44+ messages in thread
From: Kai-Heng Feng @ 2017-10-06  4:42 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Marcel Holtmann, Greg Kroah-Hartman, Linus Torvalds, Gabriel C,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

Hi Luis,

On Wed, Oct 4, 2017 at 8:20 AM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> So please retest now that the revert happened: commit
> f007cad159e99fa2acd3b2e9364fbb32ad28b971 ("Revert "firmware: add sanity check
> on shutdown/suspend").

I can confirm the issue is gone after the commit.
Also, thanks for your detailed explanation.


Kai-Heng

>
>   Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: btusb "firmware request while host is not available" at resume
@ 2017-10-06  4:42                                 ` Kai-Heng Feng
  0 siblings, 0 replies; 44+ messages in thread
From: Kai-Heng Feng @ 2017-10-06  4:42 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Marcel Holtmann, Greg Kroah-Hartman, Linus Torvalds, Gabriel C,
	Tejun Heo, Gustavo F. Padovan, Sukumar Ghorai, Rafael J. Wysocki,
	Linux Kernel Mailing List,
	bluez mailin list (linux-bluetooth@vger.kernel.org)

Hi Luis,

On Wed, Oct 4, 2017 at 8:20 AM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> So please retest now that the revert happened: commit
> f007cad159e99fa2acd3b2e9364fbb32ad28b971 ("Revert "firmware: add sanity check
> on shutdown/suspend").

I can confirm the issue is gone after the commit.
Also, thanks for your detailed explanation.


Kai-Heng

>
>   Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2017-10-06  4:42 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-10 19:26 btusb "firmware request while host is not available" at resume Linus Torvalds
2017-09-10 19:26 ` Linus Torvalds
2017-09-11  1:25 ` Greg Kroah-Hartman
2017-09-11  1:25   ` Greg Kroah-Hartman
2017-09-11  3:15   ` Gabriel C
2017-09-11  3:15     ` Gabriel C
2017-09-11  3:49     ` Gabriel C
2017-09-11  3:49       ` Gabriel C
2017-09-11  4:25       ` Linus Torvalds
2017-09-11  4:25         ` Linus Torvalds
2017-09-11  5:12         ` Marcel Holtmann
2017-09-11  5:12           ` Marcel Holtmann
2017-09-11 13:46           ` Greg Kroah-Hartman
2017-09-11 13:46             ` Greg Kroah-Hartman
2017-09-11 17:11             ` Luis R. Rodriguez
2017-09-11 17:11               ` Luis R. Rodriguez
2017-09-11 19:29               ` Greg Kroah-Hartman
2017-09-11 19:29                 ` Greg Kroah-Hartman
2017-09-11 20:06                 ` Luis R. Rodriguez
2017-09-11 20:06                   ` Luis R. Rodriguez
2017-09-12  0:13                   ` Gabriel C
2017-09-12  0:13                     ` Gabriel C
2017-09-12  0:33                     ` Luis R. Rodriguez
2017-09-12  0:33                       ` Luis R. Rodriguez
2017-09-12  0:48                   ` Greg Kroah-Hartman
2017-09-12  0:48                     ` Greg Kroah-Hartman
2017-09-12 16:52                     ` Luis R. Rodriguez
2017-09-12 16:52                       ` Luis R. Rodriguez
2017-09-12  5:13                   ` Marcel Holtmann
2017-09-12  5:13                     ` Marcel Holtmann
2017-09-12 16:27                     ` Luis R. Rodriguez
2017-09-12 16:27                       ` Luis R. Rodriguez
2017-09-13  6:52                       ` Marcel Holtmann
2017-09-13  6:52                         ` Marcel Holtmann
2017-09-13 17:39                         ` Luis R. Rodriguez
2017-09-13 17:39                           ` Luis R. Rodriguez
2017-10-02  8:34                           ` Kai-Heng Feng
2017-10-02  8:34                             ` Kai-Heng Feng
2017-10-04  0:20                             ` Luis R. Rodriguez
2017-10-04  0:20                               ` Luis R. Rodriguez
2017-10-04  1:21                               ` Luis R. Rodriguez
2017-10-04  1:21                                 ` Luis R. Rodriguez
2017-10-06  4:42                               ` Kai-Heng Feng
2017-10-06  4:42                                 ` Kai-Heng Feng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.