All of lore.kernel.org
 help / color / mirror / Atom feed
* USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
@ 2013-01-11 21:04 Alex Riesen
  2013-01-12  7:48 ` Alex Riesen
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Riesen @ 2013-01-11 21:04 UTC (permalink / raw)
  To: linux-usb; +Cc: Linux Kernel Mailing List

Hi,

the USB stick (an Cruzer Titanium 2GB) was not recognized at any of
the USB ports of this system (an System76 lemu4 laptop, XHCI device)
after it was removed. If I attempt to insert it again in any of the
ports (one of the two USB3, or the USB2) the led on the stick lights
up shortly and if off again. There is no media detection messages in
the dmesg output, only that from the first time:

 usb 1-1.2: new high-speed USB device number 3 using ehci-pci
 usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408
 usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
 usb 1-1.2: Product: U3 Titanium
 usb 1-1.2: Manufacturer: SanDisk Corporation
 usb 1-1.2: SerialNumber: 0000187A3A60F1E9
 scsi6 : usb-storage 1-1.2:1.0
 io scheduler deadline registered (default)
 usb 1-1.2: USB disconnect, device number 3

The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost
reproduce the problem later in a simplified setup (init=/bin/bash) on
USB3 ports by inserting and removing the stick quickly. Almost - because
the USB3 ports recovered after some time, while the USB2 port never
experienced the problem.

Out of desperation, I tried to write "1\n" to
"/sys/bus/usb/devices/1-1.2/remove",
with interesting result:

 INFO: task khubd:512 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 khubd           D ffff880213918000     0   512      2 0x00000000
  ffff880213b7fa78 0000000000000046 ffff88020000006b 0000000000000000
  ffff880213918000 ffff880213b7ffd8 ffff880213b7ffd8 0000000000013440
  ffff880213eb5d90 ffff880213918000 ffff880213b7fa08 0000000000000046
 Call Trace:
  [<ffffffff8104d763>] ? flush_work+0x6d/0x1fe
  [<ffffffff8133deeb>] ? scsi_remove_host+0x24/0x10e
  [<ffffffff8104d6fb>] ? flush_work+0x5/0x1fe
  [<ffffffff815dcf9e>] schedule+0x65/0x67
  [<ffffffff815dd1e6>] schedule_preempt_disabled+0x18/0x24
  [<ffffffff815db9ac>] mutex_lock_nested+0x181/0x2c1
  [<ffffffff8133deeb>] ? scsi_remove_host+0x24/0x10e
  [<ffffffff8133deeb>] scsi_remove_host+0x24/0x10e
  [<ffffffff8138c2f5>] usb_stor_disconnect+0x77/0xbc
  [<ffffffff81376a4c>] usb_unbind_interface+0x6c/0x14d
  [<ffffffff813266ec>] __device_release_driver+0x88/0xdb
  [<ffffffff81326764>] device_release_driver+0x25/0x32
  [<ffffffff8132615f>] bus_remove_device+0xf5/0x10a
  [<ffffffff8132412f>] device_del+0x12e/0x189
  [<ffffffff81374bee>] usb_disable_device+0x77/0x197
  [<ffffffff8136e719>] usb_disconnect+0x93/0xfb
  [<ffffffff8136f8ed>] hub_port_connect_change+0x14f/0x792
  [<ffffffff81370382>] hub_thread+0x452/0x6c3
  [<ffffffff8105ac1a>] ? complete+0x1f/0x50
  [<ffffffff81052587>] ? wake_up_bit+0x2a/0x2a
  [<ffffffff8136ff30>] ? hub_port_connect_change+0x792/0x792
  [<ffffffff81051f2a>] kthread+0xd5/0xdd
  [<ffffffff8105d5f6>] ? finish_task_switch+0x3f/0xf7
  [<ffffffff81051e55>] ? __init_kthread_worker+0x5a/0x5a
  [<ffffffff815e481c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81051e55>] ? __init_kthread_worker+0x5a/0x5a
 4 locks held by khubd/512:
  #0:  (&__lockdep_no_validate__){......}, at: [<ffffffff81370039>]
hub_thread+0x109/0x6c3
  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff8136e6e2>]
usb_disconnect+0x5c/0xfb
  #2:  (&__lockdep_no_validate__){......}, at: [<ffffffff8132675c>]
device_release_driver+0x1d/0x32
  #3:  (&shost->scan_mutex){......}, at: [<ffffffff8133deeb>]
scsi_remove_host+0x24/0x10e
 INFO: task modprobe:12163 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 modprobe        D 0000000000000009     0 12163  12162 0x00000000
  ffff88020536dd68 0000000000000046 0000000000000000 ffff8801e66f86c8
  ffff8801e66f8000 ffff88020536dfd8 ffff88020536dfd8 0000000000013440
  ffff880213d33e60 ffff8801e66f8000 0000000000000000 ffff8801e66f86c8
 Call Trace:
  [<ffffffff810526f7>] ? prepare_to_wait+0x23/0x7d
  [<ffffffff81058775>] ? async_synchronize_cookie_domain+0xe0/0x167
  [<ffffffff815dcf9e>] schedule+0x65/0x67
  [<ffffffff8105879e>] async_synchronize_cookie_domain+0x109/0x167
  [<ffffffff81052587>] ? wake_up_bit+0x2a/0x2a
  [<ffffffff81058883>] async_synchronize_full+0x56/0x77
  [<ffffffff8108c837>] load_module+0x1002/0x11e8
  [<ffffffff810882e0>] ? sys_getegid16+0x4b/0x4b
  [<ffffffff815e13f2>] ? do_page_fault+0xe/0x10
  [<ffffffff8108cab6>] sys_init_module+0x99/0xa6
  [<ffffffff815e48c6>] system_call_fastpath+0x1a/0x1f
 1 lock held by modprobe/12163:
  #0:  (async_register_mutex){......}, at: [<ffffffff8105884a>]
async_synchronize_full+0x1d/0x77

When reproducing in the simplified setup, this operation just disconnected
the device, as expected.

Additional information:
lspci:

00:00.0 Host bridge: Intel Corporation 3rd Gen Core processor DRAM
Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core
processor Graphics Controller (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset
Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C210
Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset
Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset
Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
Family PCI Express Root Port 1 (rev c4)
00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
Family PCI Express Root Port 3 (rev c4)
00:1c.3 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
Family PCI Express Root Port 4 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset
Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation HM76 Express Chipset LPC
Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series Chipset Family
6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family
SMBus Controller (rev 04)
02:00.0 Network controller: Intel Corporation Centrino Advanced-N 6235 (rev 24)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd.
Device 5289 (rev 01)
03:00.2 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 0a)

.config and the kern.log at:

http://familie-riesen.de/~raa/public/v3.8-rc3-khubd-hang-config-dmesg.tar.bz2

The kern.log ends with a long trace of running tasks: I pressed
alt-sysrq-t before reboot.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-11 21:04 USB device cannot be reconnected and khubd "blocked for more than 120 seconds" Alex Riesen
@ 2013-01-12  7:48 ` Alex Riesen
  2013-01-12  9:18   ` Lan Tianyu
  2013-01-12 17:37   ` Alan Stern
  0 siblings, 2 replies; 93+ messages in thread
From: Alex Riesen @ 2013-01-12  7:48 UTC (permalink / raw)
  To: linux-usb; +Cc: Linux Kernel Mailing List

On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen <raa.lkml@gmail.com> wrote:
> Hi,
>
> the USB stick (an Cruzer Titanium 2GB) was not recognized at any of
> the USB ports of this system (an System76 lemu4 laptop, XHCI device)
> after it was removed. If I attempt to insert it again in any of the
> ports (one of the two USB3, or the USB2) the led on the stick lights
> up shortly and if off again. There is no media detection messages in
> the dmesg output, only that from the first time:
>
>  usb 1-1.2: new high-speed USB device number 3 using ehci-pci
>  usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408
>  usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
>  usb 1-1.2: Product: U3 Titanium
>  usb 1-1.2: Manufacturer: SanDisk Corporation
>  usb 1-1.2: SerialNumber: 0000187A3A60F1E9
>  scsi6 : usb-storage 1-1.2:1.0
>  io scheduler deadline registered (default)
>  usb 1-1.2: USB disconnect, device number 3
>
> The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost
> reproduce the problem later in a simplified setup (init=/bin/bash) on
> USB3 ports by inserting and removing the stick quickly. Almost - because
> the USB3 ports recovered after some time, while the USB2 port never
> experienced the problem.

One more detail: I usually use the "noop" elevator. That time it was
the "deadline". And I just reproduced it easily with "deadline".

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-12  7:48 ` Alex Riesen
@ 2013-01-12  9:18   ` Lan Tianyu
  2013-01-12 17:37   ` Alan Stern
  1 sibling, 0 replies; 93+ messages in thread
From: Lan Tianyu @ 2013-01-12  9:18 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Linux Kernel Mailing List, linux-usb

On 2013年1月12日 15:48:59, Alex Riesen wrote:
> On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen <raa.lkml@gmail.com> wrote:
>> Hi,
>>
>> the USB stick (an Cruzer Titanium 2GB) was not recognized at any of
>> the USB ports of this system (an System76 lemu4 laptop, XHCI device)
>> after it was removed. If I attempt to insert it again in any of the
>> ports (one of the two USB3, or the USB2) the led on the stick lights
>> up shortly and if off again. There is no media detection messages in
>> the dmesg output, only that from the first time:
>>
>>  usb 1-1.2: new high-speed USB device number 3 using ehci-pci
>>  usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408
>>  usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
>>  usb 1-1.2: Product: U3 Titanium
>>  usb 1-1.2: Manufacturer: SanDisk Corporation
>>  usb 1-1.2: SerialNumber: 0000187A3A60F1E9
>>  scsi6 : usb-storage 1-1.2:1.0
>>  io scheduler deadline registered (default)
>>  usb 1-1.2: USB disconnect, device number 3
>>
>> The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost
>> reproduce the problem later in a simplified setup (init=/bin/bash) on
>> USB3 ports by inserting and removing the stick quickly. Almost - because
>> the USB3 ports recovered after some time, while the USB2 port never
>> experienced the problem.
>
> One more detail: I usually use the "noop" elevator. That time it was
> the "deadline". And I just reproduced it easily with "deadline".
Can you provide the output of dmesg with CONFIG_USB_DEBUG? This will
be helpful.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-12  7:48 ` Alex Riesen
  2013-01-12  9:18   ` Lan Tianyu
@ 2013-01-12 17:37   ` Alan Stern
  2013-01-12 19:39     ` Alex Riesen
  2013-01-12 19:56     ` Alex Riesen
  1 sibling, 2 replies; 93+ messages in thread
From: Alan Stern @ 2013-01-12 17:37 UTC (permalink / raw)
  To: Alex Riesen; +Cc: linux-usb, Linux Kernel Mailing List

On Sat, 12 Jan 2013, Alex Riesen wrote:

> On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen <raa.lkml@gmail.com> wrote:
> > Hi,
> >
> > the USB stick (an Cruzer Titanium 2GB) was not recognized at any of
> > the USB ports of this system (an System76 lemu4 laptop, XHCI device)
> > after it was removed. If I attempt to insert it again in any of the
> > ports (one of the two USB3, or the USB2) the led on the stick lights
> > up shortly and if off again. There is no media detection messages in
> > the dmesg output, only that from the first time:

To make testing simpler, use only the USB-2 ports.  The xHCI driver is 
not as mature as the EHCI driver.

> >  usb 1-1.2: new high-speed USB device number 3 using ehci-pci
> >  usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408
> >  usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
> >  usb 1-1.2: Product: U3 Titanium
> >  usb 1-1.2: Manufacturer: SanDisk Corporation
> >  usb 1-1.2: SerialNumber: 0000187A3A60F1E9
> >  scsi6 : usb-storage 1-1.2:1.0
> >  io scheduler deadline registered (default)
> >  usb 1-1.2: USB disconnect, device number 3
> >
> > The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost
> > reproduce the problem later in a simplified setup (init=/bin/bash) on
> > USB3 ports by inserting and removing the stick quickly. Almost - because
> > the USB3 ports recovered after some time, while the USB2 port never
> > experienced the problem.

For testing, use a kernel with CONFIG_USB_DEBUG and CONFIG_PRINTK_TIME 
enabled.  Do the following:

After a normal boot, run "dmesg -C" to clear the log buffer.

Then plug in the stick.  After a couple of seconds, type Alt-SysRq-W.

Then unplug the stick.  After a couple of seconds, type Alt-SysRq-W 
again.

Then collect the output from dmesg and post it.

> One more detail: I usually use the "noop" elevator. That time it was
> the "deadline". And I just reproduced it easily with "deadline".

I doubt the elevator has anything to do with this.

Alan Stern


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-12 17:37   ` Alan Stern
@ 2013-01-12 19:39     ` Alex Riesen
  2013-01-12 20:33       ` Alex Riesen
  2013-01-12 19:56     ` Alex Riesen
  1 sibling, 1 reply; 93+ messages in thread
From: Alex Riesen @ 2013-01-12 19:39 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-usb, Linux Kernel Mailing List

On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sat, 12 Jan 2013, Alex Riesen wrote:
>> One more detail: I usually use the "noop" elevator. That time it was
>> the "deadline". And I just reproduced it easily with "deadline".
>
> I doubt the elevator has anything to do with this.

But it looks like it does: just using the deadline elevator is a sure way
to reproduce the bug. The system always recovers (sometimes after a while)
with "noop".

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-12 17:37   ` Alan Stern
  2013-01-12 19:39     ` Alex Riesen
@ 2013-01-12 19:56     ` Alex Riesen
  1 sibling, 0 replies; 93+ messages in thread
From: Alex Riesen @ 2013-01-12 19:56 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-usb, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1647 bytes --]

On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sat, 12 Jan 2013, Alex Riesen wrote:
>
>> On Fri, Jan 11, 2013 at 10:04 PM, Alex Riesen <raa.lkml@gmail.com> wrote:
>> >
>> > the USB stick (an Cruzer Titanium 2GB) was not recognized at any of
>> > the USB ports of this system (an System76 lemu4 laptop, XHCI device)
>> > after it was removed. [...]
>
> To make testing simpler, use only the USB-2 ports.  The xHCI driver is
> not as mature as the EHCI driver.

I used the USB2 port, but enabled the debugging for xHCI too, just because
it is not as mature as you say, but in the same machine. And there are some
traces from it, even though I didn't touch the USB3 ports.
Might be unrelated, but just in case...

>> > The kernel is v3.8-rc3. I never had this problem in 3.7. I could almost

For the record, I just retested: the problem persists with 3.7.1.

>> > reproduce the problem later in a simplified setup (init=/bin/bash) on
>> > USB3 ports by inserting and removing the stick quickly. Almost - because
>> > the USB3 ports recovered after some time, while the USB2 port never
>> > experienced the problem.
>
> For testing, use a kernel with CONFIG_USB_DEBUG and CONFIG_PRINTK_TIME
> enabled.  Do the following:
>
> After a normal boot, run "dmesg -C" to clear the log buffer.
>
> Then plug in the stick.  After a couple of seconds, type Alt-SysRq-W.
>
> Then unplug the stick.  After a couple of seconds, type Alt-SysRq-W
> again.
>
> Then collect the output from dmesg and post it.

Attached. A remount in the middle is me remounting an SATA device to
save dmesg output in case the system crashes hard.

[-- Attachment #2: dmesg2.bz2 --]
[-- Type: application/x-bzip2, Size: 12421 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-12 19:39     ` Alex Riesen
@ 2013-01-12 20:33       ` Alex Riesen
  2013-01-12 22:52         ` Alan Stern
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Riesen @ 2013-01-12 20:33 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-usb, Linux Kernel Mailing List

On Sat, Jan 12, 2013 at 8:39 PM, Alex Riesen <raa.lkml@gmail.com> wrote:
> On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
>> On Sat, 12 Jan 2013, Alex Riesen wrote:
>>> One more detail: I usually use the "noop" elevator. That time it was
>>> the "deadline". And I just reproduced it easily with "deadline".
>>
>> I doubt the elevator has anything to do with this.
>
> But it looks like it does: just using the deadline elevator is a sure way
> to reproduce the bug. The system always recovers (sometimes after a while)
> with "noop".

And no, it does not. Not by itself, but the fact that deadline elevator was
compiled as module certainly helped!

This explains the hanging modprobe in the dmesg output (the part after device
connect). I still wonder, why didn't it froze at boot, mounting SATA devices
(the root, /var, and /home are on an SSD connected by SATA)? And why hanging
khubd at reboot?

Anyway, building the elevator in the kernel avoids the problem. Sorry for
not spotting this earlier.

Now, who would be interested to handle this kind of misconfiguration ...

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-12 20:33       ` Alex Riesen
@ 2013-01-12 22:52         ` Alan Stern
  2013-01-13 12:09           ` Alex Riesen
  0 siblings, 1 reply; 93+ messages in thread
From: Alan Stern @ 2013-01-12 22:52 UTC (permalink / raw)
  To: Alex Riesen; +Cc: linux-usb, Linux Kernel Mailing List

On Sat, 12 Jan 2013, Alex Riesen wrote:

> On Sat, Jan 12, 2013 at 8:39 PM, Alex Riesen <raa.lkml@gmail.com> wrote:
> > On Sat, Jan 12, 2013 at 6:37 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> >> On Sat, 12 Jan 2013, Alex Riesen wrote:
> >>> One more detail: I usually use the "noop" elevator. That time it was
> >>> the "deadline". And I just reproduced it easily with "deadline".
> >>
> >> I doubt the elevator has anything to do with this.
> >
> > But it looks like it does: just using the deadline elevator is a sure way
> > to reproduce the bug. The system always recovers (sometimes after a while)
> > with "noop".
> 
> And no, it does not. Not by itself, but the fact that deadline elevator was
> compiled as module certainly helped!
> 
> This explains the hanging modprobe in the dmesg output (the part after device
> connect). I still wonder, why didn't it froze at boot, mounting SATA devices
> (the root, /var, and /home are on an SSD connected by SATA)? And why hanging
> khubd at reboot?
> 
> Anyway, building the elevator in the kernel avoids the problem. Sorry for
> not spotting this earlier.
> 
> Now, who would be interested to handle this kind of misconfiguration ...

So the whole thing was a false alarm?

Maybe you should report to the block-layer maintainers that it's 
possible to mess up the system by building an elevator as a module.  
That sounds like the sort of thing they'd be interested to hear.

Alan Stern


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-12 22:52         ` Alan Stern
@ 2013-01-13 12:09           ` Alex Riesen
  2013-01-13 16:56             ` Alan Stern
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Riesen @ 2013-01-13 12:09 UTC (permalink / raw)
  To: Alan Stern, Jens Axboe; +Cc: linux-usb, Linux Kernel Mailing List

On Sat, Jan 12, 2013 at 11:52 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sat, 12 Jan 2013, Alex Riesen wrote:
>> Now, who would be interested to handle this kind of misconfiguration ...
>
> So the whole thing was a false alarm?

Yes, almost. What about khubd hanging when machine is shutdown?

> Maybe you should report to the block-layer maintainers that it's
> possible to mess up the system by building an elevator as a module.
> That sounds like the sort of thing they'd be interested to hear.

Hi Jens,

may I point you at this problem report:

http://thread.gmane.org/gmane.linux.kernel/1420814

It is surely a misconfiguration on my part (the used io scheduler
configured as a module), but the behavior is somewhat problematic
anyway: at least in this case USB storage is essentially locked up.

Regards,
Alex

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-13 12:09           ` Alex Riesen
@ 2013-01-13 16:56             ` Alan Stern
  2013-01-13 17:42               ` Alex Riesen
  0 siblings, 1 reply; 93+ messages in thread
From: Alan Stern @ 2013-01-13 16:56 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Jens Axboe, linux-usb, Linux Kernel Mailing List

On Sun, 13 Jan 2013, Alex Riesen wrote:

> On Sat, Jan 12, 2013 at 11:52 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > On Sat, 12 Jan 2013, Alex Riesen wrote:
> >> Now, who would be interested to handle this kind of misconfiguration ...
> >
> > So the whole thing was a false alarm?
> 
> Yes, almost. What about khubd hanging when machine is shutdown?

What about it?  I have trouble understanding all the descriptions you
have provided so far, because you talk about several different things
and change your mind a lot.  Can you provide a single, simple scenario
that illustrates this problem?

Alan Stern


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-13 16:56             ` Alan Stern
@ 2013-01-13 17:42               ` Alex Riesen
  2013-01-13 19:16                 ` Oliver Neukum
  2013-01-14  3:47                 ` Ming Lei
  0 siblings, 2 replies; 93+ messages in thread
From: Alex Riesen @ 2013-01-13 17:42 UTC (permalink / raw)
  To: Alan Stern; +Cc: Jens Axboe, linux-usb, Linux Kernel Mailing List

On Sun, Jan 13, 2013 at 5:56 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sun, 13 Jan 2013, Alex Riesen wrote:
>>
>> Yes, almost. What about khubd hanging when machine is shutdown?
>
> What about it?  I have trouble understanding all the descriptions you
> have provided so far, because you talk about several different things
> and change your mind a lot.  Can you provide a single, simple scenario
> that illustrates this problem?

1. Compile a kernel with deadline elevator as module
2. Boot into it, make sure the elevator is selected
  (I used "elevator=deadline" in the kernel command line)
3. Insert a FAT formatted mass storage device in an USB2 port
   Observe "io scheduler deadline registered"
4. Pull the stick out, wait a moment, and either shutdown or just
   and press alt-sysrq-W:

[  158.170585] usb 1-1.2: USB disconnect, device number 3
[  158.170590] usb 1-1.2: unregistering device
[  158.170595] usb 1-1.2: unregistering interface 1-1.2:1.0
[  166.959398] SysRq : Show Blocked State
[  166.959410]   task                        PC stack   pid father
[  166.959432] khubd           D ffff880213a68000     0   513      2 0x00000000
[  166.959440]  ffff880213affa18 0000000000000046 ffff88020000006b 0000000000000
[  166.959448]  ffff880213a68000 ffff880213afffd8 ffff880213afffd8 0000000000013
[  166.959454]  ffffffff81a14400 ffff880213a68000 ffff880213aff9a8 0000000000000
[  166.959461] Call Trace:
[  166.959475]  [<ffffffff8104d763>] ? flush_work+0x6d/0x1fe
[  166.959485]  [<ffffffff8133defb>] ? scsi_remove_host+0x24/0x10e
[  166.959490]  [<ffffffff8104d6fb>] ? flush_work+0x5/0x1fe
[  166.959499]  [<ffffffff815e1dd6>] schedule+0x65/0x67
[  166.959506]  [<ffffffff815e201e>] schedule_preempt_disabled+0x18/0x24
[  166.959513]  [<ffffffff815e07e4>] mutex_lock_nested+0x181/0x2c1
[  166.959518]  [<ffffffff8133defb>] ? scsi_remove_host+0x24/0x10e
[  166.959524]  [<ffffffff8133defb>] scsi_remove_host+0x24/0x10e
[  166.959531]  [<ffffffff813910d5>] usb_stor_disconnect+0x77/0xbc
[  166.959539]  [<ffffffff81377ca3>] usb_unbind_interface+0x6c/0x14d
[  166.959548]  [<ffffffff813266fc>] __device_release_driver+0x88/0xdb
[  166.959554]  [<ffffffff81326774>] device_release_driver+0x25/0x32
[  166.959561]  [<ffffffff8132616f>] bus_remove_device+0xf5/0x10a
[  166.959567]  [<ffffffff8132413f>] device_del+0x12e/0x189
[  166.959574]  [<ffffffff81375d3a>] usb_disable_device+0xb1/0x20e
[  166.959582]  [<ffffffff8136ed8b>] usb_disconnect+0xab/0x113
[  166.959589]  [<ffffffff81370218>] hub_port_connect_change+0x1b0/0x879
[  166.959597]  [<ffffffff81370e3a>] hub_events+0x559/0x69d
[  166.959604]  [<ffffffff81370fb6>] hub_thread+0x38/0x19b
[  166.959612]  [<ffffffff81052587>] ? wake_up_bit+0x2a/0x2a
[  166.959618]  [<ffffffff81370f7e>] ? hub_events+0x69d/0x69d
[  166.959625]  [<ffffffff81051f2a>] kthread+0xd5/0xdd
[  166.959632]  [<ffffffff8105d5f6>] ? finish_task_switch+0x3f/0xf7
[  166.959641]  [<ffffffff81051e55>] ? __init_kthread_worker+0x5a/0x5a
[  166.959648]  [<ffffffff815e965c>] ret_from_fork+0x7c/0xb0
[  166.959655]  [<ffffffff81051e55>] ? __init_kthread_worker+0x5a/0x5a

This trace if from alt-sysrq-W. I can attach an image from the shutdown case,
the traces from that case are hard to save: the main storage is usually already
stopped. I believe it was the same, though.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-13 17:42               ` Alex Riesen
@ 2013-01-13 19:16                 ` Oliver Neukum
  2013-01-14  2:39                   ` Alan Stern
  2013-01-14  3:47                 ` Ming Lei
  1 sibling, 1 reply; 93+ messages in thread
From: Oliver Neukum @ 2013-01-13 19:16 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Alan Stern, Jens Axboe, linux-usb, Linux Kernel Mailing List

On Sunday 13 January 2013 18:42:49 Alex Riesen wrote:
> On Sun, Jan 13, 2013 at 5:56 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > On Sun, 13 Jan 2013, Alex Riesen wrote:
> >>
> >> Yes, almost. What about khubd hanging when machine is shutdown?
> >
> > What about it?  I have trouble understanding all the descriptions you
> > have provided so far, because you talk about several different things
> > and change your mind a lot.  Can you provide a single, simple scenario
> > that illustrates this problem?
> 
> 1. Compile a kernel with deadline elevator as module
> 2. Boot into it, make sure the elevator is selected
>   (I used "elevator=deadline" in the kernel command line)
> 3. Insert a FAT formatted mass storage device in an USB2 port
>    Observe "io scheduler deadline registered"
> 4. Pull the stick out, wait a moment, and either shutdown or just
>    and press alt-sysrq-W:

That makes it clear. The elevator probably has scheduled work
which cannot finish waiting on a lock and scsi_remove_host()
wants to flush work.

This is not a USB problem. You need to involve the SCSI people.
khubd just stops working because disconnects are processed
in its context and the removal deadlocks.

	Regards
		Oliver



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-13 19:16                 ` Oliver Neukum
@ 2013-01-14  2:39                   ` Alan Stern
  2013-01-14 16:43                     ` Alex Riesen
  0 siblings, 1 reply; 93+ messages in thread
From: Alan Stern @ 2013-01-14  2:39 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Alex Riesen, Jens Axboe, linux-usb, Linux Kernel Mailing List

On Sun, 13 Jan 2013, Oliver Neukum wrote:

> On Sunday 13 January 2013 18:42:49 Alex Riesen wrote:
> > On Sun, Jan 13, 2013 at 5:56 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > On Sun, 13 Jan 2013, Alex Riesen wrote:
> > >>
> > >> Yes, almost. What about khubd hanging when machine is shutdown?
> > >
> > > What about it?  I have trouble understanding all the descriptions you
> > > have provided so far, because you talk about several different things
> > > and change your mind a lot.  Can you provide a single, simple scenario
> > > that illustrates this problem?
> > 
> > 1. Compile a kernel with deadline elevator as module
> > 2. Boot into it, make sure the elevator is selected
> >   (I used "elevator=deadline" in the kernel command line)
> > 3. Insert a FAT formatted mass storage device in an USB2 port
> >    Observe "io scheduler deadline registered"
> > 4. Pull the stick out, wait a moment, and either shutdown or just
> >    and press alt-sysrq-W:

Indeed.  I just tried booting into a kernel that has the deadline
elevator built-in, not a module.  Even then, when I specified
"elevator=deadline" on the boot command line, the system hung up
partway through booting.  Hard to tell exactly where, because it
occurred shortly after the switching from VGA to the framebuffer
driver, so the screen was completely blank.

When I get a chance, I'll try it on another machine where I can use a 
serial console.

> That makes it clear. The elevator probably has scheduled work
> which cannot finish waiting on a lock and scsi_remove_host()
> wants to flush work.

What is the work and why can't it finish?  Or rather, how can we 
figure these things out?  According to what Alex wrote, the blocked 
task doesn't show up in the Alt-SysRq-W listing.

And don't forget that the listing shows scsi_remove_host() blocks
waiting to acquire the host's scan_mutex.  Not waiting for work to be
flushed.  This casts doubt on your explanation.

> This is not a USB problem. You need to involve the SCSI people.
> khubd just stops working because disconnects are processed
> in its context and the removal deadlocks.

The why whould building the deadline elevator as a module make any
difference?  Or does it make a difference?

Alex, if the elevator is made static instead, do you still see the same 
behavior when the USB drive is removed?

Also, are there any mounted filesystems on the drive when you unplug
it?

Alan Stern


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-13 17:42               ` Alex Riesen
  2013-01-13 19:16                 ` Oliver Neukum
@ 2013-01-14  3:47                 ` Ming Lei
  2013-01-14  7:15                   ` Ming Lei
  2013-01-14  8:22                   ` Oliver Neukum
  1 sibling, 2 replies; 93+ messages in thread
From: Ming Lei @ 2013-01-14  3:47 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Alan Stern, Jens Axboe, linux-usb, Linux Kernel Mailing List

On Mon, Jan 14, 2013 at 1:42 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
>
> 1. Compile a kernel with deadline elevator as module
> 2. Boot into it, make sure the elevator is selected
>   (I used "elevator=deadline" in the kernel command line)
> 3. Insert a FAT formatted mass storage device in an USB2 port
>    Observe "io scheduler deadline registered"
> 4. Pull the stick out, wait a moment, and either shutdown or just
>    and press alt-sysrq-W:

I can reproduce the problem too on one ehci-only system(Pandaboard)
with deadline elevator module case, and no such problem in the
built-in case, and still on 3.8-rc3.

Follows the dmesg log:

[   85.665679] usb 1-1.2.2: new high-speed USB device number 5 using ehci-omap
[   85.784423] usb 1-1.2.2: default language 0x0409
[   85.790008] usb 1-1.2.2: udev 5, busnum 1, minor = 4
[   85.790039] usb 1-1.2.2: New USB device found, idVendor=0951, idProduct=1624
[   85.790039] usb 1-1.2.2: New USB device strings: Mfr=1, Product=2,
SerialNumber=3
[   85.790069] usb 1-1.2.2: Product: DataTraveler G2
[   85.790069] usb 1-1.2.2: Manufacturer: Kingston
[   85.790100] usb 1-1.2.2: SerialNumber: 0019E06C5346EA41D0000071
[   85.790100] device: '1-1.2.2': device_add
[   85.790344] bus: 'usb': add device 1-1.2.2
[   85.790405] PM: Adding info for usb:1-1.2.2
[   85.790740] bus: 'usb': driver_probe_device: matched device 1-1.2.2
with driver usb
[   85.790771] bus: 'usb': really_probe: probing driver usb with device 1-1.2.2
[   85.790802] usb 1-1.2.2: usb_probe_device
[   85.790832] usb 1-1.2.2: configuration #1 chosen from 1 choice
[   85.791076] usb 1-1.2.2: adding 1-1.2.2:1.0 (config #1, interface 0)
[   85.791076] device: '1-1.2.2:1.0': device_add
[   85.791137] bus: 'usb': add device 1-1.2.2:1.0
[   85.791168] PM: Adding info for usb:1-1.2.2:1.0
[   85.791442] device: 'ep_81': device_add
[   85.791564] PM: Adding info for No Bus:ep_81
[   85.791564] device: 'ep_02': device_add
[   85.791687] PM: Adding info for No Bus:ep_02
[   85.791687] driver: '1-1.2.2': driver_bound: bound to device 'usb'
[   85.791717] bus: 'usb': really_probe: bound device 1-1.2.2 to driver usb
[   85.791748] PM: Moving platform:musb-hdrc.0.auto to end of list
[   85.791748] device: 'ep_00': device_add
[   85.791778] platform musb-hdrc.0.auto: Retrying from deferred list
[   85.791839] PM: Adding info for No Bus:ep_00
[   85.791839] bus: 'platform': driver_probe_device: matched device
musb-hdrc.0.auto with driver musb-hdrc
[   85.791839] bus: 'platform': really_probe: probing driver musb-hdrc
with device musb-hdrc.0.auto
[   85.791870] hub 1-1.2:1.0: state 7 ports 4 chg 0000 evt 0004
[   85.791900] unable to find transceiver of type USB2 PHY
[   85.797454] HS USB OTG: no transceiver configured
[   85.802703] musb-hdrc musb-hdrc.0.auto: musb_init_controller failed
with status -517
[   85.811157] platform musb-hdrc.0.auto: Driver musb-hdrc requests
probe deferral
[   85.811187] platform musb-hdrc.0.auto: Added to deferred list
[   85.811218] PM: Moving platform:twl6030_usb to end of list
[   85.811218] platform twl6030_usb: Retrying from deferred list
[   85.811279] bus: 'platform': driver_probe_device: matched device
twl6030_usb with driver twl6030_usb
[   85.811279] bus: 'platform': really_probe: probing driver
twl6030_usb with device twl6030_usb
[   85.811309] twl6030_usb twl6030_usb: phy not ready, deferring probe
[   85.811462] platform twl6030_usb: Driver twl6030_usb requests probe deferral
[   85.811462] platform twl6030_usb: Added to deferred list
[   85.883331] Initializing USB Mass Storage driver...
[   85.883361] bus: 'usb': add driver usb-storage
[   85.883453] bus: 'usb': driver_probe_device: matched device
1-1.2.2:1.0 with driver usb-storage
[   85.883483] bus: 'usb': really_probe: probing driver usb-storage
with device 1-1.2.2:1.0
[   85.883514] usb-storage 1-1.2.2:1.0: usb_probe_interface
[   85.883544] usb-storage 1-1.2.2:1.0: usb_probe_interface - got id
[   85.884094] scsi0 : usb-storage 1-1.2.2:1.0
[   85.884155] device: 'host0': device_add
[   85.884185] bus: 'scsi': add device host0
[   85.884246] PM: Adding info for scsi:host0
[   85.884552] device: 'host0': device_add
[   85.884674] PM: Adding info for No Bus:host0
[   85.884948] driver: '1-1.2.2:1.0': driver_bound: bound to device
'usb-storage'
[   85.884979] bus: 'usb': really_probe: bound device 1-1.2.2:1.0 to
driver usb-storage
[   85.884979] PM: Moving platform:musb-hdrc.0.auto to end of list
[   85.885009] platform musb-hdrc.0.auto: Retrying from deferred list
[   85.885070] bus: 'platform': driver_probe_device: matched device
musb-hdrc.0.auto with driver musb-hdrc
[   85.885070] bus: 'platform': really_probe: probing driver musb-hdrc
with device musb-hdrc.0.auto
[   85.885131] unable to find transceiver of type USB2 PHY
[   85.886230] usbcore: registered new interface driver usb-storage
[   85.886230] USB Mass Storage support registered.
[   85.890655] HS USB OTG: no transceiver configured
[   85.895660] musb-hdrc musb-hdrc.0.auto: musb_init_controller failed
with status -517
[   85.903839] platform musb-hdrc.0.auto: Driver musb-hdrc requests
probe deferral
[   85.903869] platform musb-hdrc.0.auto: Added to deferred list
[   85.903869] PM: Moving platform:twl6030_usb to end of list
[   85.903900] platform twl6030_usb: Retrying from deferred list
[   85.903930] bus: 'platform': driver_probe_device: matched device
twl6030_usb with driver twl6030_usb
[   85.903961] bus: 'platform': really_probe: probing driver
twl6030_usb with device twl6030_usb
[   85.903991] twl6030_usb twl6030_usb: phy not ready, deferring probe
[   85.904022] platform twl6030_usb: Driver twl6030_usb requests probe deferral
[   85.904052] platform twl6030_usb: Added to deferred list
[   86.901367] io scheduler deadline registered (default)
[  181.168487] INFO: task modprobe:2462 blocked for more than 90 seconds.
[  181.175323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  181.183624] modprobe        D c04f1920     0  2462   2461 0x00000000
[  181.183685] [<c04f1920>] (__schedule+0x5fc/0x6d4) from [<c005eba4>]
(async_synchronize_cookie_domain+0xdc/0x
168)
[  181.183715] [<c005eba4>]
(async_synchronize_cookie_domain+0xdc/0x168) from [<c005ed04>]
(async_synchronize_f
ull+0x3c/0x60)
[  181.183776] [<c005ed04>] (async_synchronize_full+0x3c/0x60) from
[<c0085610>] (load_module+0x1aac/0x1cdc)
[  181.183807] [<c0085610>] (load_module+0x1aac/0x1cdc) from
[<c0085944>] (sys_init_module+0x104/0x110)
[  181.183837] [<c0085944>] (sys_init_module+0x104/0x110) from
[<c000dfe0>] (ret_fast_syscall+0x0/0x48)
[  271.175506] INFO: task modprobe:2462 blocked for more than 90 seconds.
[  271.182373] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  271.190826] modprobe        D c04f1920     0  2462   2461 0x00000000
[  271.190887] [<c04f1920>] (__schedule+0x5fc/0x6d4) from [<c005eba4>]
(async_synchronize_cookie_domain+0xdc/0x
168)
[  271.190917] [<c005eba4>]
(async_synchronize_cookie_domain+0xdc/0x168) from [<c005ed04>]
(async_synchronize_f
ull+0x3c/0x60)
[  271.190948] [<c005ed04>] (async_synchronize_full+0x3c/0x60) from
[<c0085610>] (load_module+0x1aac/0x1cdc)
[  271.190948] [<c0085610>] (load_module+0x1aac/0x1cdc) from
[<c0085944>] (sys_init_module+0x104/0x110)
[  271.190979] [<c0085944>] (sys_init_module+0x104/0x110) from
[<c000dfe0>] (ret_fast_syscall+0x0/0x48)



Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-14  3:47                 ` Ming Lei
@ 2013-01-14  7:15                   ` Ming Lei
  2013-01-14 17:30                     ` Linus Torvalds
  2013-01-14  8:22                   ` Oliver Neukum
  1 sibling, 1 reply; 93+ messages in thread
From: Ming Lei @ 2013-01-14  7:15 UTC (permalink / raw)
  To: Alex Riesen, Linus Torvalds
  Cc: Alan Stern, Jens Axboe, linux-usb, Linux Kernel Mailing List

On Mon, Jan 14, 2013 at 11:47 AM, Ming Lei <ming.lei@canonical.com> wrote:
> On Mon, Jan 14, 2013 at 1:42 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
> [   86.901367] io scheduler deadline registered (default)
> [  181.168487] INFO: task modprobe:2462 blocked for more than 90 seconds.
> [  181.175323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  181.183624] modprobe        D c04f1920     0  2462   2461 0x00000000
> [  181.183685] [<c04f1920>] (__schedule+0x5fc/0x6d4) from [<c005eba4>]
> (async_synchronize_cookie_domain+0xdc/0x168)
> [  181.183715] [<c005eba4>] (async_synchronize_cookie_domain+0xdc/0x168) from [<c005ed04>] (async_synchronize_full+0x3c/0x60)
> [  181.183776] [<c005ed04>] (async_synchronize_full+0x3c/0x60) from [<c0085610>] (load_module+0x1aac/0x1cdc)
> [  181.183807] [<c0085610>] (load_module+0x1aac/0x1cdc) from [<c0085944>] (sys_init_module+0x104/0x110)
> [  181.183837] [<c0085944>] (sys_init_module+0x104/0x110) from
> [<c000dfe0>] (ret_fast_syscall+0x0/0x48)

The deadlock problem is caused by calling request_module() inside
async function of do_scan_async(), and it was introduced by Linus's
below commit:

commit d6de2c80e9d758d2e36c21699117db6178c0f517
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Apr 10 12:17:41 2009 -0700

    async: Fix module loading async-work regression

IMO, maybe the commit isn't a proper fix, considered the
below fact:

- it isn't good to allow async function to be marked as __init

- any user mode shouldn't expect that the device is ready just
after completing of 'insmod', and drivers should make
the device ready for user mode just after its async probing or
other kind of async initialization(done in work or kthread)
completes.

- from view of driver, introducing async_synchronize_full() after
do_one_initcall() inside do_init_module() is like a sync probe
for drivers built as module, and cause this kind of deadlock easily.

So could we revert the commit and fix the previous problems just
case by case? or other better fix?


Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-14  3:47                 ` Ming Lei
  2013-01-14  7:15                   ` Ming Lei
@ 2013-01-14  8:22                   ` Oliver Neukum
  2013-01-14  8:40                     ` Ming Lei
  1 sibling, 1 reply; 93+ messages in thread
From: Oliver Neukum @ 2013-01-14  8:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Alex Riesen, Alan Stern, Jens Axboe, linux-usb,
	Linux Kernel Mailing List

On Monday 14 January 2013 11:47:57 Ming Lei wrote:
> [  181.175323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  181.183624] modprobe        D c04f1920     0  2462   2461 0x00000000
> [  181.183685] [<c04f1920>] (__schedule+0x5fc/0x6d4) from [<c005eba4>]
> (async_synchronize_cookie_domain+0xdc/0x
> 168)
> [  181.183715] [<c005eba4>]
> (async_synchronize_cookie_domain+0xdc/0x168) from [<c005ed04>]
> (async_synchronize_f
> ull+0x3c/0x60)
> [  181.183776] [<c005ed04>] (async_synchronize_full+0x3c/0x60) from
> [<c0085610>] (load_module+0x1aac/0x1cdc)
> [  181.183807] [<c0085610>] (load_module+0x1aac/0x1cdc) from
> [<c0085944>] (sys_init_module+0x104/0x110)
> [  181.183837] [<c0085944>] (sys_init_module+0x104/0x110) from
> [<c000dfe0>] (ret_fast_syscall+0x0/0x48)
> [  271.175506] INFO: task modprobe:2462 blocked for more than 90 seconds.
> [  271.182373] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  271.190826] modprobe        D c04f1920     0  2462   2461 0x00000000
> [  271.190887] [<c04f1920>] (__schedule+0x5fc/0x6d4) from [<c005eba4>]
> (async_synchronize_cookie_domain+0xdc/0x
> 168)
> [  271.190917] [<c005eba4>]
> (async_synchronize_cookie_domain+0xdc/0x168) from [<c005ed04>]
> (async_synchronize_f
> ull+0x3c/0x60)
> [  271.190948] [<c005ed04>] (async_synchronize_full+0x3c/0x60) from
> [<c0085610>] (load_module+0x1aac/0x1cdc)
> [  271.190948] [<c0085610>] (load_module+0x1aac/0x1cdc) from
> [<c0085944>] (sys_init_module+0x104/0x110)
> [  271.190979] [<c0085944>] (sys_init_module+0x104/0x110) from
> [<c000dfe0>] (ret_fast_syscall+0x0/0x48)

OK, your trace is totally different. If your hangs are related, as is likely,
my explanation goes out of the window.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-14  8:22                   ` Oliver Neukum
@ 2013-01-14  8:40                     ` Ming Lei
  0 siblings, 0 replies; 93+ messages in thread
From: Ming Lei @ 2013-01-14  8:40 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Alex Riesen, Alan Stern, Jens Axboe, linux-usb,
	Linux Kernel Mailing List

On Mon, Jan 14, 2013 at 4:22 PM, Oliver Neukum <oliver@neukum.org> wrote:
>
> OK, your trace is totally different. If your hangs are related, as is likely,
> my explanation goes out of the window.

If I run 'shutdown' after unplugging usb storage device, another hang trace
same with Alex's can be triggered too, so it should be one same problem.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-14  2:39                   ` Alan Stern
@ 2013-01-14 16:43                     ` Alex Riesen
  0 siblings, 0 replies; 93+ messages in thread
From: Alex Riesen @ 2013-01-14 16:43 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Jens Axboe, linux-usb, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1025 bytes --]

On Mon, Jan 14, 2013 at 3:39 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sun, 13 Jan 2013, Oliver Neukum wrote:
>> This is not a USB problem. You need to involve the SCSI people.
>> khubd just stops working because disconnects are processed
>> in its context and the removal deadlocks.
>
> The why whould building the deadline elevator as a module make any
> difference?  Or does it make a difference?

Building elevator as module does make a difference: the system is broken.

> Alex, if the elevator is made static instead, do you still see the same
> behavior when the USB drive is removed?

How can I make the elevator static? Or did you mean "built-in"?
Or did you mean to ask if khubd hangs if the deadline is built in?
In that case - no. The behavior is normal. Nothing hangs.

> Also, are there any mounted filesystems on the drive when you unplug
> it?

No, no auto-mount. The whole of userspace init is attached, and I'm reasonably
sure nothing of it mounts anything automatically. Nothing of udev, too.

[-- Attachment #2: linuxrc-t --]
[-- Type: application/octet-stream, Size: 2520 bytes --]

#!/bin/bash -v
# SETUP
mount -nt proc  proc /proc
mount -nt sysfs sysfs /sys
(while read d m r; do [ "$d" = devtmpfs -a "$m" = /dev ] && exit 0
 done < /proc/mounts) || {
	mount -t tmpfs devfs /tmp &&
	cp -a /dev/* /tmp &&
	mount --move /tmp /dev
}
test -d /dev/pts || mkdir /dev/pts
mount -nt devpts devpts /dev/pts &
(
	ifconfig lo up
	test -f /etc/hostname && hostname --file /etc/hostname
)&

(
	cg=; while read g r; do
		[ "${g:0:1}" = '#' ] || cg="$cg,$g"
	done < /proc/cgroups
	mkdir /dev/cgroups &&
	mount -nt cgroup -o "${cg:1}" cgroup /dev/cgroups
)&

(
	mount -nt tmpfs run /run &&
	mkdir /run/lock /run/network
) &
mount -nt tmpfs tmp /tmp &
(
	mkdir /dev/shm &&
	mount -nt tmpfs shm /dev/shm -o nodev,nosuid,relatime
)&
mount -nt debugfs debugfs /sys/kernel/debug &

wait
read t < /proc/uptime && echo $t

HOME=/tmp # do not export HOME for udev
HOME=$HOME openvt -c 2 -w -- /bin/bash &
/sbin/udevd --daemon
/sbin/udevadm trigger
/sbin/udevadm settle
set +v # END OF MAIN SETUP
read t < /proc/uptime && echo -e "\e[1mUptime till now: $t\e[0m"

export HOME

test -e /proc/swaps && {
	echo Activating swap...
	swapon -a &
}

test -x /sbin/readahead-list && {
	echo Preloading programs...
	(
	for f in /usr/bin/vim /usr/bin/top ; do
		ldd -v "$f" |grep "^[	]/.*:$"|cut -c2-|cut -d: -f1
	done 2>/dev/null | grep -v "^/lib/" > /tmp/readahead
	test -e /tmp/readahead && {
		/sbin/readahead-list /tmp/readahead
		rm -f /tmp/readahead
	}
	) &
}

echo 9 >/proc/sysrq-trigger
echo -e '\e[32;1m'$(uname -a)'\e[0m'

export HOME
(shopt -s nullglob
grep -q notests /proc/cmdline ||
for LINUXRC_TEST in /boot/tests/*; do
	if test -x "$LINUXRC_TEST"
	then
		export LINUXRC_TEST
		echo TESTING "$LINUXRC_TEST"
		(. "$LINUXRC_TEST")
	fi
done) &

# reboot through three presses of power btn
(for e in /sys/class/input/event*/device/name
do
	if [ "$(< $e)" = 'Power Button' ]; then
		dev=${e%/device/name}
		exec /usr/local/bin/input-event -d /dev${dev:10} \
			-b power -t5 -r3 -- /sbin/reboot -f
	fi
done) &

# user shell
while :; do /sbin/getty 38400 tty3; done &
while :; do /sbin/getty 38400 tty4; done &
# X environment
grep -q nostartx /proc/cmdline ||
( mount -r /var
  test -d /var/lib/sudo && mount -t tmpfs -o size=$((1*1024*1024)) sudo /var/lib/sudo
  test -x /var/my-startx &&
  mount -r /home &&
  mount -t tmpfs -o size=$((1*1024*1024)) xkb /var/lib/xkb &&
  mount -t tmpfs -o size=$((15*1024*1024)) varlog /var/log &&
  /var/my-startx )&

NICE=
type -p nice >/dev/null && NICE="nice -n -3"
exec $NICE /bin/bash

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-14  7:15                   ` Ming Lei
@ 2013-01-14 17:30                     ` Linus Torvalds
  2013-01-14 18:04                       ` Alan Stern
  2013-01-15  1:53                       ` Ming Lei
  0 siblings, 2 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-14 17:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Alex Riesen, Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List

On Sun, Jan 13, 2013 at 11:15 PM, Ming Lei <ming.lei@canonical.com> wrote:
>
> The deadlock problem is caused by calling request_module() inside
> async function of do_scan_async(), and it was introduced by Linus's
> below commit:
>
> commit d6de2c80e9d758d2e36c21699117db6178c0f517
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Fri Apr 10 12:17:41 2009 -0700
>
>     async: Fix module loading async-work regression
>
> IMO, maybe the commit isn't a proper fix, considered the
> below fact:
>
> - it isn't good to allow async function to be marked as __init

Immaterial. For modules, __init is a non-issue. For non-modules, the
synchronization elsewhere is sufficient.

> - any user mode shouldn't expect that the device is ready just
> after completing of 'insmod'

Bullshit. That expectation is just a fact. People insmod a device
driver, and mount the device immediately in scripts.

We do not say "user mode shouldn't". Seriously. EVER. User mode
*does*, and we deal with it. Learn it now, and stop ever saying that
again.

This is really starting to annoy me. Kernel developers who say "user
mode should be fixes to not do that" should go somewhere else. The
whole and *only* point of a kernel is to hide these kinds of issues
from user mode, and make things "just work" in user mode. User mode
should not ever worry about "oh, doing X can trigger a module load, so
now the device might not be available immediately, so I should delay
and re-try until it is".

That's just f*cking crazy talk.

We have a very simple rule in the kernel: we don't break user space. EVER.

Learn that rule. I don't ever want to hear "any user mode shouldn't
expect" again. User mode *does* expect. End of discussion.

> - from view of driver, introducing async_synchronize_full() after
> do_one_initcall() inside do_init_module() is like a sync probe
> for drivers built as module, and cause this kind of deadlock easily.
>
> So could we revert the commit and fix the previous problems just
> case by case? or other better fix?

There's no way in hell we take a "fix things one by one" approach.
It's not going to work. And your suggestion seems to not do async
discovery of devices in general, which is a *much* worse fix than
anything else. It's just crazy.

But there are other approaches we might take. We might move the call to

    async_synchronize_full();

to other places. For example, maybe we're better off doing it at
block/char device open instead?

              Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-14 17:30                     ` Linus Torvalds
@ 2013-01-14 18:04                       ` Alan Stern
  2013-01-14 18:34                         ` Linus Torvalds
  2013-01-15  1:53                       ` Ming Lei
  1 sibling, 1 reply; 93+ messages in thread
From: Alan Stern @ 2013-01-14 18:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Jens Axboe, USB list, Linux Kernel Mailing List

On Mon, 14 Jan 2013, Linus Torvalds wrote:

> > - from view of driver, introducing async_synchronize_full() after
> > do_one_initcall() inside do_init_module() is like a sync probe
> > for drivers built as module, and cause this kind of deadlock easily.
> >
> > So could we revert the commit and fix the previous problems just
> > case by case? or other better fix?
> 
> There's no way in hell we take a "fix things one by one" approach.
> It's not going to work. And your suggestion seems to not do async
> discovery of devices in general, which is a *much* worse fix than
> anything else. It's just crazy.
> 
> But there are other approaches we might take. We might move the call to
> 
>     async_synchronize_full();
> 
> to other places. For example, maybe we're better off doing it at
> block/char device open instead?

How about skipping that call if the current thread is one of the async 
helpers?  Is it possible to detect when that happens?

Or maybe such a check should go inside async_synchronize_full() itself.

Alan Stern


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-14 18:04                       ` Alan Stern
@ 2013-01-14 18:34                         ` Linus Torvalds
  0 siblings, 0 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-14 18:34 UTC (permalink / raw)
  To: Alan Stern
  Cc: Ming Lei, Alex Riesen, Jens Axboe, USB list, Linux Kernel Mailing List

On Mon, Jan 14, 2013 at 10:04 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
>
> How about skipping that call if the current thread is one of the async
> helpers?  Is it possible to detect when that happens?
>
> Or maybe such a check should go inside async_synchronize_full() itself.

Do we have some idea of exactly what is waiting for what? Which async
context is causing the module load to happen in the first place?

I think *that* is what we should avoid - it sounds like the block
layer is loading the IO scheduler at the wrong point. I realize that
people like (for testing purposes) to change the IO scheduler at
random, but if that means that any IO can basically result in a
request_module(), then that sounds like a problem.

It seems to be "elevator_get()", and I presume the chain is something
like "load block driver async, the block driver does
blk_init_allocated_queue, that does request_module() to find the
elevator, the request_module() succeeds, but ends up waiting for async
work, which is the block driver load, which is waiting for the
request_module to finish".

And my gut feel is that blk_init_allocated_queue() probably shouldn't
do that request_module() at all. We migth want to do it when we *open*
the device, but not while loading the module for the device.

So my _feeling_ is that this is just a bug in the block layer, and
that it shouldn't set up block device drivers for this kind of crazy
"need to load the elevator module while in the middle of scanning
devices". I think *that* is what we should aim to change.

Hmm?

That said, I think it might indeed be a good idea to make this problem
much easier to see, and that "detect when it happens" would be a good
thing (and then we should WARN_ON_ONCE() on people trying to do
request_module() calls from async context).

               Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-14 17:30                     ` Linus Torvalds
  2013-01-14 18:04                       ` Alan Stern
@ 2013-01-15  1:53                       ` Ming Lei
  2013-01-15  6:23                         ` Ming Lei
  1 sibling, 1 reply; 93+ messages in thread
From: Ming Lei @ 2013-01-15  1:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Riesen, Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List

On Tue, Jan 15, 2013 at 1:30 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sun, Jan 13, 2013 at 11:15 PM, Ming Lei <ming.lei@canonical.com> wrote:
>>
>> The deadlock problem is caused by calling request_module() inside
>> async function of do_scan_async(), and it was introduced by Linus's
>> below commit:
>>
>> commit d6de2c80e9d758d2e36c21699117db6178c0f517
>> Author: Linus Torvalds <torvalds@linux-foundation.org>
>> Date:   Fri Apr 10 12:17:41 2009 -0700
>>
>>     async: Fix module loading async-work regression
>>
>> IMO, maybe the commit isn't a proper fix, considered the
>> below fact:
>>
>> - it isn't good to allow async function to be marked as __init
>
> Immaterial. For modules, __init is a non-issue. For non-modules, the
> synchronization elsewhere is sufficient.

Looks 5d38258ec026921a7b266f4047ebeaa75db358e5(ACPI battery:
fix async boot oops) addresses the issue of __init for modules.

>
>> - any user mode shouldn't expect that the device is ready just
>> after completing of 'insmod'
>
> Bullshit. That expectation is just a fact. People insmod a device
> driver, and mount the device immediately in scripts.

I mean we can let the device node populated in probe() first,
but let open() wait for completion of the async probe(). Maybe my
expression is not accurate, here the 'device isn't ready' just means
that the async probe() isn't completed, and doesn't mean the device
node doesn't come.

>
> We do not say "user mode shouldn't". Seriously. EVER. User mode
> *does*, and we deal with it. Learn it now, and stop ever saying that
> again.
>
> This is really starting to annoy me. Kernel developers who say "user
> mode should be fixes to not do that" should go somewhere else. The
> whole and *only* point of a kernel is to hide these kinds of issues
> from user mode, and make things "just work" in user mode. User mode
> should not ever worry about "oh, doing X can trigger a module load, so
> now the device might not be available immediately, so I should delay
> and re-try until it is".
>
> That's just f*cking crazy talk.
>
> We have a very simple rule in the kernel: we don't break user space. EVER.

No, I don't mean we should break user space, see above.

>
> Learn that rule. I don't ever want to hear "any user mode shouldn't
> expect" again. User mode *does* expect. End of discussion.
>
>> - from view of driver, introducing async_synchronize_full() after
>> do_one_initcall() inside do_init_module() is like a sync probe
>> for drivers built as module, and cause this kind of deadlock easily.
>>
>> So could we revert the commit and fix the previous problems just
>> case by case? or other better fix?
>
> There's no way in hell we take a "fix things one by one" approach.
> It's not going to work. And your suggestion seems to not do async
> discovery of devices in general, which is a *much* worse fix than
> anything else. It's just crazy.

I will try to figure out one patch to address the scsi block async probe
issue first, and see if it can fix the problem by moving add_disk()
into sd_probe()
and calling async_synchronize_full_domain(&scsi_sd_probe_domain)
in the entry of sd_open().

>
> But there are other approaches we might take. We might move the call to
>
>     async_synchronize_full();
>
> to other places. For example, maybe we're better off doing it at
> block/char device open instead?

Looks it is similar with the above idea, but we have to remove the
async_synchronize_full() in do_init_module() together.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15  1:53                       ` Ming Lei
@ 2013-01-15  6:23                         ` Ming Lei
  2013-01-15 17:36                           ` Linus Torvalds
  0 siblings, 1 reply; 93+ messages in thread
From: Ming Lei @ 2013-01-15  6:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Riesen, Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List

On Tue, Jan 15, 2013 at 9:53 AM, Ming Lei <ming.lei@canonical.com> wrote:
>
> I will try to figure out one patch to address the scsi block async probe
> issue first, and see if it can fix the problem by moving add_disk()
> into sd_probe()
> and calling async_synchronize_full_domain(&scsi_sd_probe_domain)
> in the entry of sd_open().

Looks it isn't doable because the block partition device can only be created
inside the async things.

But I have another idea to address the problem, and let module code call
async_synchronize_full() only if the module requires that explicitly, so how
about the below draft patch?

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7992635..c5106a0 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3143,6 +3143,8 @@ static int __init init_sd(void)
 	if (err)
 		goto err_out_driver;

+	mod_init_async_wait(THIS_MODULE);
+
 	return 0;

 err_out_driver:
diff --git a/include/linux/module.h b/include/linux/module.h
index 7760c6d..09bd4c5 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -300,6 +300,12 @@ struct module

 	unsigned int taints;	/* same bits as kernel:tainted */

+	/*
+	 * set if the module wants to call async_synchronize_full
+	 * after its init() is complted.
+	 */
+	unsigned int init_async_wait:1;
+
 #ifdef CONFIG_GENERIC_BUG
 	/* Support for BUG */
 	unsigned num_bugs;
@@ -656,4 +662,16 @@ static inline void module_bug_finalize(const Elf_Ehdr *hdr,
 static inline void module_bug_cleanup(struct module *mod) {}
 #endif	/* CONFIG_GENERIC_BUG */

+/*
+ * If one module wants to complete its all async code after
+ * its init() executed, the module can call this function in
+ * the entry of its init(), but the module's async function
+ * can't call request_module, otherwise deadlock will be caused.
+ */
+static inline void mod_init_async_wait(struct module *mod)
+{
+	if (mod)
+		mod->init_async_wait = 1;
+}
+
 #endif /* _LINUX_MODULE_H */
diff --git a/kernel/module.c b/kernel/module.c
index 250092c..dc5d011 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3058,8 +3058,9 @@ static int do_init_module(struct module *mod)
 	blocking_notifier_call_chain(&module_notify_list,
 				     MODULE_STATE_LIVE, mod);

-	/* We need to finish all async code before the module init sequence is done */
-	async_synchronize_full();
+	/* Only complete all async code if the module requires that */
+	if (mod->init_async_wait)
+		async_synchronize_full();

 	mutex_lock(&module_mutex);
 	/* Drop initial reference. */


Thanks,
--
Ming Lei

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15  6:23                         ` Ming Lei
@ 2013-01-15 17:36                           ` Linus Torvalds
  2013-01-15 18:18                             ` Linus Torvalds
                                               ` (3 more replies)
  0 siblings, 4 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-15 17:36 UTC (permalink / raw)
  To: Ming Lei, Tejun Heo
  Cc: Alex Riesen, Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List

[ Added Tejun to the discussion, since he's the async go-to-guy ]

On Mon, Jan 14, 2013 at 10:23 PM, Ming Lei <ming.lei@canonical.com> wrote:
>
> But I have another idea to address the problem, and let module code call
> async_synchronize_full() only if the module requires that explicitly, so how
> about the below draft patch?

No way.

This kind of "let's just let drivers tell us when they used async
helpers" is basically *asking* for buggy code. In fact, just to prove
how bad it is, YOU SCREWED IT UP YOURSELF.

Because it's not just sd.c that uses async_schedule(), and would need
the async synchronize. It's floppy.c, it's generic scsi scanning (so
scsi tapes etc), and it's libata-core.c.

This kind of "let's randomly encourage people to write subtly buggy
code that has magical timing dependencies, so that the developer won't
likely even see it because he has fast disks etc" code is totally
unacceptable. And this code was *designed* to be that kind of buggy.

No, if we set a flag like this, then it needs to be set
*automatically*, so that a module cannot screw this up by mistake.

It could be as simple as having a per-thread flag that gets set by the
__async_schedule() function, and gets cleared by fork. Then the module
code could do something like

   /* before calling the module ->init function */
   current->used_async = 0;
   ...
   if (current->used_async)
      async_synchronize_full();

or whatever.

Tejun, comments? You can see the whole thread on lkml, but the basic
problem is that the module loading doing the unconditional
async_synchronize_full() has caused problems, because we have

 - load module A
   - module A does per-controller async discovery of its devices (eg
scsi or ata probing)
   - in the async thread, it initializes somethign that needs another
module B (in this case the default IO scheduler module)
      - modprobe for B loads the IO scheduler module successfully
          at the end of the module load, it does
async_synchronize_full() to make sure load_module won't return before
the module is ready
          *DEADLOCK*, because the async_synchronize_full() thing
actually waits for not the module B async code (it didn't have any),
but for the module *A* async code, which is waiting for module B to
finish.

Now, I'll happily argue that we shouldn't have this kind of "load
modules from random context" behavior in the kernel, and I think the
block layer is to blame for doing the IO scheduler load at an insane
time. So "don't do that then" would be the best solution. Sadly, we
don't even have a good way to notice that we're doing it, so "hacky
workaround that at least doesn't require driver authors to care" is
likely the second-best workaround.

But the "hacky workaround" absolutely needs to be *automatic*. Because
the "driver writers need to get this subtle untestable thing right" is
*not* acceptable. That's the patch that Ming Lei did, and I refuse to
have that kind of fragile crap in the kernel.

                          Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15 17:36                           ` Linus Torvalds
@ 2013-01-15 18:18                             ` Linus Torvalds
  2013-01-15 23:17                               ` Tejun Heo
  2013-01-15 18:20                             ` Alan Stern
                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2013-01-15 18:18 UTC (permalink / raw)
  To: Ming Lei, Tejun Heo
  Cc: Alex Riesen, Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List

On Tue, Jan 15, 2013 at 9:36 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This kind of "let's randomly encourage people to write subtly buggy
> code that has magical timing dependencies, so that the developer won't
> likely even see it because he has fast disks etc" code is totally
> unacceptable. And this code was *designed* to be that kind of buggy.

Btw, we could *possibly* do this the other way around. Wait for all
async work by default, but then have a really hacky way to turn that
off for modules that explicitly don't want it, because they know they
can be loaded in async context, and they don't do any async work
themselves. Then we could make the IO schedulers set that flag ("I
know I'm loaded from async space, and I know I'm not myself doing any
async init")

Quite frankly, I'd still much rather prefer the automated approach -
or even better, just avoiding the "load modules in async context"
entirely. But at least the "I can put a huge comment about why I don't
want to be waited on" would be much more acceptable than the "I need
to explicitly tell the world that it needs to wait on me".

So Ming Lei's patch was "easily subtly buggy by mistake" (showing that
by the fact that it was indeed buggy), while the opposite model where
you have to explicitly ask people not to wait for you could still be
very buggy, but at least now it needs to explicitly do extra work in
order to be buggy.

So if an interface is fragile, it should aim to be fragile in the
right way - making the fragility explicit, so that people can grep for
it, and people can add comments to the particular code that marks it
fragile. The default behavior should be the robust one.

And if would be lovely to add a warning to the "people loaded a module
from async context" case, so that we'd *see* this.

Tejun, is there a good way for code to see "I'm running in async
context"? Then we could do something like

    WARN_ON_ONCE(wait && system_state == SYSTEM_RUNNING && in_async_thread());

in kernel/kmod.c (__request_module()). That should at least warn about
this whole issue happening.

                    Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15 17:36                           ` Linus Torvalds
  2013-01-15 18:18                             ` Linus Torvalds
@ 2013-01-15 18:20                             ` Alan Stern
  2013-01-15 18:39                               ` Tejun Heo
  2013-01-15 18:32                             ` Tejun Heo
  2013-01-16  3:05                             ` USB device cannot be reconnected and khubd "blocked for more than 120 seconds" Ming Lei
  3 siblings, 1 reply; 93+ messages in thread
From: Alan Stern @ 2013-01-15 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Tejun Heo, Alex Riesen, Jens Axboe, USB list,
	Linux Kernel Mailing List

On Tue, 15 Jan 2013, Linus Torvalds wrote:

> Tejun, comments? You can see the whole thread on lkml, but the basic
> problem is that the module loading doing the unconditional
> async_synchronize_full() has caused problems, because we have
> 
>  - load module A
>    - module A does per-controller async discovery of its devices (eg
> scsi or ata probing)
>    - in the async thread, it initializes somethign that needs another
> module B (in this case the default IO scheduler module)
>       - modprobe for B loads the IO scheduler module successfully
>           at the end of the module load, it does
> async_synchronize_full() to make sure load_module won't return before
> the module is ready
>           *DEADLOCK*, because the async_synchronize_full() thing
> actually waits for not the module B async code (it didn't have any),
> but for the module *A* async code, which is waiting for module B to
> finish.
> 
> Now, I'll happily argue that we shouldn't have this kind of "load
> modules from random context" behavior in the kernel, and I think the
> block layer is to blame for doing the IO scheduler load at an insane
> time. So "don't do that then" would be the best solution.

It may not be so easy.  When the SCSI async thread probes the new disk, 
it has to do I/O.  So it needs to use a scheduler.

But maybe it could use a built-in trivial scheduler until the proper 
one is loaded.  Then the loading could be asynchronous.

Alan Stern


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15 17:36                           ` Linus Torvalds
  2013-01-15 18:18                             ` Linus Torvalds
  2013-01-15 18:20                             ` Alan Stern
@ 2013-01-15 18:32                             ` Tejun Heo
  2013-01-15 20:18                               ` Linus Torvalds
  2013-01-16 17:19                               ` [PATCH] async: fix __lowest_in_progress() Tejun Heo
  2013-01-16  3:05                             ` USB device cannot be reconnected and khubd "blocked for more than 120 seconds" Ming Lei
  3 siblings, 2 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-15 18:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List

Hello, Linus.

On Tue, Jan 15, 2013 at 09:36:57AM -0800, Linus Torvalds wrote:
> Tejun, comments? You can see the whole thread on lkml, but the basic
> problem is that the module loading doing the unconditional
> async_synchronize_full() has caused problems, because we have
> 
>  - load module A
>    - module A does per-controller async discovery of its devices (eg
> scsi or ata probing)
>    - in the async thread, it initializes somethign that needs another
> module B (in this case the default IO scheduler module)
>       - modprobe for B loads the IO scheduler module successfully
>           at the end of the module load, it does
> async_synchronize_full() to make sure load_module won't return before
> the module is ready
>           *DEADLOCK*, because the async_synchronize_full() thing
> actually waits for not the module B async code (it didn't have any),
> but for the module *A* async code, which is waiting for module B to
> finish.

I think the root problem here, apart from request_module() from block
- which is a bit nasty but making that part completely async would too
be quite nasty albeit in a different way - is that
async_synchronize_full() is way too indescriminate.  It's something
only suitable for things like the end of system init.

I'm wondering whether what we need is a rudimentray nesting like the
following.

finished_loading()
{
	blah blah;

	cookie = async_current_cookie();

	do init calls;

	async_synchronize_upto(cookie);

	blah blah;
}

The nesting here would be an approximation as the dependency recorded
here is chronological.  I *suspect* this should be safe unless the
module is doing something weird.  Need to think more about it.  One
way or the other, I think what we need is some form of scoping for
flushing async ops.

BTW, the current synchronization is broken - cookie isn't transferred
to running->domain in queueing order but __lowest_in_progress()
assumes that.  I think I broke that while converting it to workqueue.

Anyways, working on it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15 18:20                             ` Alan Stern
@ 2013-01-15 18:39                               ` Tejun Heo
  0 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-15 18:39 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linus Torvalds, Ming Lei, Alex Riesen, Jens Axboe, USB list,
	Linux Kernel Mailing List

Hello, Alan.

On Tue, Jan 15, 2013 at 01:20:58PM -0500, Alan Stern wrote:
> It may not be so easy.  When the SCSI async thread probes the new disk, 
> it has to do I/O.  So it needs to use a scheduler.
> 
> But maybe it could use a built-in trivial scheduler until the proper 
> one is loaded.  Then the loading could be asynchronous.

It can be done.  Noop is always built-in and block IO can do IOs with
noop.  The problem here is that request_module() is done synchronously
during evelator_init().  We can punt that to a work item so that the
elevator is switched on load completion.  There are some nastiness
involved tho - if module probing returns before elevator switch
happens, the userland can observe elevator being switched after some
indetermined short period of time, which can, for example, break
scripts adjusting elevator knobs and etc...

I *think* it'll be best to allow scoped synchronization of async ops.
Looking into it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15 18:32                             ` Tejun Heo
@ 2013-01-15 20:18                               ` Linus Torvalds
  2013-01-15 23:50                                 ` Tejun Heo
  2013-01-16 17:19                               ` [PATCH] async: fix __lowest_in_progress() Tejun Heo
  1 sibling, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2013-01-15 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List

On Tue, Jan 15, 2013 at 10:32 AM, Tejun Heo <tj@kernel.org> wrote:
>
> I think the root problem here, apart from request_module() from block
> - which is a bit nasty but making that part completely async would too
> be quite nasty albeit in a different way - is that
> async_synchronize_full() is way too indescriminate.  It's something
> only suitable for things like the end of system init.
>
> I'm wondering whether what we need is a rudimentray nesting like the
> following.

I think that is a good solution if it works, but look out: we need to
synchronize across *all* domains, not just the default one.  The sd.c
code, for example, uses its own "scsi_sd_probe_domain" for example,
and we *do* want to synchronize with it.

Can you do that with your suggested interface (ie it would have to be
a *global* sequence number).

               Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15 18:18                             ` Linus Torvalds
@ 2013-01-15 23:17                               ` Tejun Heo
  0 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-15 23:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List

Hello, Linus

Will continue on another reply but this one is relevant so...

On Tue, Jan 15, 2013 at 10:18:45AM -0800, Linus Torvalds wrote:
> Tejun, is there a good way for code to see "I'm running in async
> context"? Then we could do something like

Almost.  With a bit of modification we can ask whether current is a
kworker, reach struct worker_struct via kthread_data() if so and then
test worker->current_func against the async workfn.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15 20:18                               ` Linus Torvalds
@ 2013-01-15 23:50                                 ` Tejun Heo
  2013-01-16  0:25                                   ` Arjan van de Ven
  2013-01-16  0:36                                   ` Linus Torvalds
  0 siblings, 2 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-15 23:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven

cc'ing Arjan.  Arjan, the original thread can be read from

  http://thread.gmane.org/gmane.linux.kernel/1420814

Hello, again.

On Tue, Jan 15, 2013 at 12:18:01PM -0800, Linus Torvalds wrote:
> I think that is a good solution if it works, but look out: we need to
> synchronize across *all* domains, not just the default one.  The sd.c
> code, for example, uses its own "scsi_sd_probe_domain" for example,
> and we *do* want to synchronize with it.
> 
> Can you do that with your suggested interface (ie it would have to be
> a *global* sequence number).

So, I've been thinking about it for a while now and it looks like
async is cutting too many corners to implement any sane stackable
flushing scheme on top.  There simply isn't much information to
determine who should wait for what.

I've thought of two workarounds.  Both suck.

A. Try to detect deadlock conditions from synchronize().  If deadlock
   condition involving other async jobs are detected, whine about it
   and then skip.  Ignore deadlock condition on self (should solve
   this particular case).

   Detecting deadlock condition isn't difficult if there are only
   global synchronizations; unfortunately, fragmented dependencies via
   domain-local synchronization makes this non-trivial.

   We can still do ignore-self thing mostly trivially tho.  This will
   at least work around the problem at hand.

B. The ranged synchronization I first suggested.  The problem with
   this is that it's a common practice for a given async job to try to
   flush anything which comes before it.  This can introduce spurious
   synchronization dependencies which can then lead to deadlocks.

   These conditions can be detected and ignored, at least only
   considering global synchronizations.  The problem here is that
   those deadlock conditions will occur under normal usage and thus
   should be ignored silently, which basically makes synchronization
   silently ignore and finish successfully even if there are
   legitimate deadlocks which should be investigated.

For now, I'm gonna implement simple "I'm not gonna wait for myself"
self-deadlock avoidance.  If this needs any more sophistication, I
think we better reimplement it so that we can explicitly match up and
track who's gonna wait for what instead of throwing everything into a
single cookie space and then try to work back from there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15 23:50                                 ` Tejun Heo
@ 2013-01-16  0:25                                   ` Arjan van de Ven
  2013-01-16  0:35                                     ` Tejun Heo
  2013-01-16  0:36                                   ` Linus Torvalds
  1 sibling, 1 reply; 93+ messages in thread
From: Arjan van de Ven @ 2013-01-16  0:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List


> For now, I'm gonna implement simple "I'm not gonna wait for myself"
> self-deadlock avoidance.  If this needs any more sophistication, I
> think we better reimplement it so that we can explicitly match up and
> track who's gonna wait for what instead of throwing everything into a
> single cookie space and then try to work back from there.

async fundamentally had the concept of a monotonic increasing number,
and that you could always wait for "everyone before me".
then people (like me) wanted exceptions to what "everyone" means ;-(
I'm ok with going back to a single space and simplify the world.

the case with (usb) module loading is "fun"...
people expect the device to be there (since frankly, it's hard to do otherwise)..
... but it's also really hard due to the nature of USB.. USB is async in nature,
even independent of the kernel async stuff.
Example: Load ehci.ko ... the actual use devices don't show up for some time.


the module wait case is tricky, and I wonder if there's deadlocks lurking even without async.
(btw there is a similar situation at the end of the normal kernel boot versus things like asynchronous
driver initializing... but we "skip" that in the case of an initrd is used to bypass a very similar deadlock.
this is even without "async" in use.. typical hard case is the PS/2 mouse probing)

at some point in the past we had the concept of "request a module but don't wait for it",
and I wonder if that is what should have been used here.

Doing a "range wait", with the start of the range being taken at the start of module loading
is a bit of a hack, but it'll work for the userspace expected semantics of all async stuff of
the *loaded module* be done, independent of all other modules/async stuff.
It's not as deadlocky as one might think, but it's not going to be efficient to implement.

not self-deadlocking likely solves most practical cases though





^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-16  0:25                                   ` Arjan van de Ven
@ 2013-01-16  0:35                                     ` Tejun Heo
  2013-01-16  4:01                                       ` Alan Stern
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-16  0:35 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List

Hello, Arjan.

On Tue, Jan 15, 2013 at 04:25:54PM -0800, Arjan van de Ven wrote:
> async fundamentally had the concept of a monotonic increasing number,
> and that you could always wait for "everyone before me".
> then people (like me) wanted exceptions to what "everyone" means ;-(
> I'm ok with going back to a single space and simplify the world.

If we want (or need) finer grained operation, we'll probably have to
head the other direction, so that we can definitively tell that an
async operation belongs to domains system, module load A and B, so
that each waiter knows what to wait for.

The current domain implementation is somewhere inbetween.  It's not
completely simplistic system and at the same time not developed enough
to do properly stacked flushing.

> the module wait case is tricky, and I wonder if there's deadlocks
> lurking even without async.

I don't think so.  It's really an async job waiting for itself.
Working around just this case is mostly trivial (working on patches
now) but it really is putting kludges on top of shaky foundation.
Maybe this is the extent of complexity that we need to go given the
rather limited use cases of async.  Let's hope so.  I think we'll have
to reimplement synchronization scheme if we have to go further.

> at some point in the past we had the concept of "request a module
> but don't wait for it", and I wonder if that is what should have
> been used here.

We actually want to wait for it as it creates a userland visible
behavior difference otherwise.  It's just that async's way of waiting
is too ham-fisted to be used properly in more complex scenarios.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15 23:50                                 ` Tejun Heo
  2013-01-16  0:25                                   ` Arjan van de Ven
@ 2013-01-16  0:36                                   ` Linus Torvalds
  2013-01-16  0:40                                     ` Linus Torvalds
  2013-01-16  0:44                                     ` USB device cannot be reconnected and khubd "blocked for more than 120 seconds" Tejun Heo
  1 sibling, 2 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-16  0:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven

On Tue, Jan 15, 2013 at 3:50 PM, Tejun Heo <tj@kernel.org> wrote:
>
> For now, I'm gonna implement simple "I'm not gonna wait for myself"
> self-deadlock avoidance.

You can't really do that. Or rather, it won't *help*.

The thing is, the module loading in particular is not necessarily
happening in the same context as what *started* the module loading. A
module loader will request the module from user space, and then later
user space - through possibly a totally unrelated process - will
finish it. So there is no "myself". There's not even necessarily any
relationship that the kernel even knows about, because the module
loading request can have gone from usermode_helper over something like
dbus to systemd.

See?

There's a reason I asked for a warning for this. Or the "let's flag
the current thread if it ever started anything asynchronous". Because
it's complicated.

          Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-16  0:36                                   ` Linus Torvalds
@ 2013-01-16  0:40                                     ` Linus Torvalds
  2013-01-16  2:52                                       ` [PATCH] module, async: async_synchronize_full() on module init iff async is used Tejun Heo
  2013-01-16  0:44                                     ` USB device cannot be reconnected and khubd "blocked for more than 120 seconds" Tejun Heo
  1 sibling, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2013-01-16  0:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven

On Tue, Jan 15, 2013 at 4:36 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> There's a reason I asked for a warning for this. Or the "let's flag
> the current thread if it ever started anything asynchronous". Because
> it's complicated.

Btw, the sequence counter (that is *not* taking anything else into
account) is good enough in practice, exactly because the common case
for module loading is actually that nothing in the module init
sequence is done asynchronously.

Yes, device discovery (particularly for block devices) is often
asynchronous. But the modules it then asks to load usually wouldn't
be. So if we just have the flag "did this thread ever even start async
work" over the module init sequence, we can just avoid the async
serialization entirely for that case, and it breaks the deadlock chain
nicely in practice.

Only of a block device does async work and then wants to load another
module that does more async work in its init routine would it then
break. But at that point, I'll happily just put my foot down and tell
people they are crazy, and "Let's not do that kind of crap".

            Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-16  0:36                                   ` Linus Torvalds
  2013-01-16  0:40                                     ` Linus Torvalds
@ 2013-01-16  0:44                                     ` Tejun Heo
  1 sibling, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-16  0:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven

On Tue, Jan 15, 2013 at 04:36:34PM -0800, Linus Torvalds wrote:
> The thing is, the module loading in particular is not necessarily
> happening in the same context as what *started* the module loading. A
> module loader will request the module from user space, and then later
> user space - through possibly a totally unrelated process - will
> finish it. So there is no "myself". There's not even necessarily any
> relationship that the kernel even knows about, because the module
> loading request can have gone from usermode_helper over something like
> dbus to systemd.
> 
> See?

Right.  Gees, there's even no way to link them.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16  0:40                                     ` Linus Torvalds
@ 2013-01-16  2:52                                       ` Tejun Heo
  2013-01-16  3:00                                         ` Linus Torvalds
                                                           ` (4 more replies)
  0 siblings, 5 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-16  2:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven, Rusty Russell

If the default iosched is built as module, the kernel may deadlock
while trying to load the iosched module on device probe if the probing
was running off async.  This is because async_synchronize_full() at
the end of module init ends up waiting for the async job which
initiated the module loading.

 async A				modprobe

 1. finds a device
 2. registers the block device
 3. request_module(default iosched)
					4. modprobe in userland
					5. load and init module
					6. async_synchronize_full()

Async A waits for modprobe to finish in request_module() and modprobe
waits for async A to finish in async_synchronize_full().

Because there's no easy to track dependency once control goes out to
userland, implementing properly nested flushing is difficult.  For
now, make module init perform async_synchronize_full() iff module init
has queued async jobs as suggested by Linus.

This avoids the described deadlock because iosched module doesn't use
async and thus wouldn't invoke async_synchronize_full().  This is
hacky and incomplete.  It will deadlock if async module loading nests;
however, this works around the known problem case and seems to be the
best of bad options.

For more details, please refer to the following thread.

  http://thread.gmane.org/gmane.linux.kernel/1420814

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alex Riesen <raa.lkml@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
It makes me feel dirty but makes the problem go away and I can't think
of anything better, so here is the implementation of "used async"
workaround.

Thanks.

 include/linux/sched.h |    1 +
 kernel/async.c        |    3 +++
 kernel/module.c       |   27 +++++++++++++++++++++++++--
 3 files changed, 29 insertions(+), 2 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1810,6 +1810,7 @@ extern void thread_group_cputime_adjuste
 #define PF_MEMALLOC	0x00000800	/* Allocating memory */
 #define PF_NPROC_EXCEEDED 0x00001000	/* set_user noticed that RLIMIT_NPROC was exceeded */
 #define PF_USED_MATH	0x00002000	/* if unset the fpu must be initialized before use */
+#define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -196,6 +196,9 @@ static async_cookie_t __async_schedule(a
 	atomic_inc(&entry_count);
 	spin_unlock_irqrestore(&async_lock, flags);
 
+	/* mark that this task has queued an async job, used by module init */
+	current->flags |= PF_USED_ASYNC;
+
 	/* schedule for execution */
 	queue_work(system_unbound_wq, &entry->work);
 
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3013,6 +3013,12 @@ static int do_init_module(struct module
 {
 	int ret = 0;
 
+	/*
+	 * We want to find out whether @mod uses async during init.  Clear
+	 * PF_USED_ASYNC.  async_schedule*() will set it.
+	 */
+	current->flags &= ~PF_USED_ASYNC;
+
 	blocking_notifier_call_chain(&module_notify_list,
 			MODULE_STATE_COMING, mod);
 
@@ -3058,8 +3064,25 @@ static int do_init_module(struct module
 	blocking_notifier_call_chain(&module_notify_list,
 				     MODULE_STATE_LIVE, mod);
 
-	/* We need to finish all async code before the module init sequence is done */
-	async_synchronize_full();
+	/*
+	 * We need to finish all async code before the module init sequence
+	 * is done.  This has potential to deadlock.  For example, a newly
+	 * detected block device can trigger request_module() of the
+	 * default iosched from async probing task.  Once userland helper
+	 * reaches here, async_synchronize_full() will wait on the async
+	 * task waiting on request_module() and deadlock.
+	 *
+	 * This deadlock is avoided by perfomring async_synchronize_full()
+	 * iff module init queued any async jobs.  This isn't a full
+	 * solution as it will deadlock the same if module loading from
+	 * async jobs nests more than once; however, due to the various
+	 * constraints, this hack seems to be the best option for now.
+	 * Please refer to the following thread for details.
+	 *
+	 * http://thread.gmane.org/gmane.linux.kernel/1420814
+	 */
+	if (current->flags & PF_USED_ASYNC)
+		async_synchronize_full();
 
 	mutex_lock(&module_mutex);
 	/* Drop initial reference. */

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16  2:52                                       ` [PATCH] module, async: async_synchronize_full() on module init iff async is used Tejun Heo
@ 2013-01-16  3:00                                         ` Linus Torvalds
  2013-01-16  3:25                                           ` Tejun Heo
  2013-01-16  3:30                                         ` [PATCH] module, async: async_synchronize_full() on module init iff async is used Ming Lei
                                                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2013-01-16  3:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven, Rusty Russell

On Tue, Jan 15, 2013 at 6:52 PM, Tejun Heo <tj@kernel.org> wrote:
>
> It makes me feel dirty but makes the problem go away and I can't think
> of anything better, so here is the implementation of "used async"
> workaround.

Ok, people, can we get a tested-by (or "Nope, doesn't work") from the
people who saw this?

That said, maybe we could just make the rule be that you can't pick a
default IO scheduler that is modular.

And I *would* like to see the warning we discussed. Maybe there are
other situations that can trigger this?

Because something like that

    WARN_ON_ONCE(wait && i_am_async() && system_state == SYSTEM_RUNNING);

in kernel/kmod.c (__request_module()) still sounds like a good idea to
verify that this is the only thing that triggers it (of course, we'd
need to somehow avoid the warning for the known case with the known
workaround).

             Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-15 17:36                           ` Linus Torvalds
                                               ` (2 preceding siblings ...)
  2013-01-15 18:32                             ` Tejun Heo
@ 2013-01-16  3:05                             ` Ming Lei
  2013-01-16  4:14                               ` Linus Torvalds
  3 siblings, 1 reply; 93+ messages in thread
From: Ming Lei @ 2013-01-16  3:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List

On Wed, Jan 16, 2013 at 1:36 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Because it's not just sd.c that uses async_schedule(), and would need
> the async synchronize. It's floppy.c, it's generic scsi scanning (so
> scsi tapes etc), and it's libata-core.c.

As discussed previously, only the module which will populate device
node for user space inside async func may require the synchronization,
so that the below

                modprobe A
                mount /dev/XXX /mnt

script can't be broken, and that should be the original bug report:

           https://bugzilla.kernel.org/attachment.cgi?id=20937

For other modules, looks the synchonization isn't needed, at least there
are lots of other async(work, kthread, ...) things which is scheduled in
driver probe() and no any synchronize is added after the module init()
completes inside loading module. Do we need to add that sync
for all async things inside loading module?

So looks only sd.c and floppy.c are to be synchronized suppose
some sync interfaces are introduced, doesn't it?


Thanks
--
Ming Lei

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16  3:00                                         ` Linus Torvalds
@ 2013-01-16  3:25                                           ` Tejun Heo
  2013-01-16  3:37                                             ` Linus Torvalds
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-16  3:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven, Rusty Russell

Hello, Linus.

On Tue, Jan 15, 2013 at 07:00:31PM -0800, Linus Torvalds wrote:
> That said, maybe we could just make the rule be that you can't pick a
> default IO scheduler that is modular.

This is definitely much more preferable but it would affect use case
where everything is built modular and the elevator is selected via
kernel param.  This is way outside the usual usage and we can warn
about the new behavior but it still is an observable behavior change.
Do you think this would be okay?

> And I *would* like to see the warning we discussed. Maybe there are
> other situations that can trigger this?
> 
> Because something like that
> 
>     WARN_ON_ONCE(wait && i_am_async() && system_state == SYSTEM_RUNNING);
> 
> in kernel/kmod.c (__request_module()) still sounds like a good idea to
> verify that this is the only thing that triggers it (of course, we'd
> need to somehow avoid the warning for the known case with the known
> workaround).

And then this warning can be added without introducing
request_module_but_dont_warn_about_being_called_from_async().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16  2:52                                       ` [PATCH] module, async: async_synchronize_full() on module init iff async is used Tejun Heo
  2013-01-16  3:00                                         ` Linus Torvalds
@ 2013-01-16  3:30                                         ` Ming Lei
  2013-01-16  4:24                                         ` Rusty Russell
                                                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 93+ messages in thread
From: Ming Lei @ 2013-01-16  3:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven, Rusty Russell

On Wed, Jan 16, 2013 at 10:52 AM, Tejun Heo <tj@kernel.org> wrote:
> If the default iosched is built as module, the kernel may deadlock
> while trying to load the iosched module on device probe if the probing
> was running off async.  This is because async_synchronize_full() at
> the end of module init ends up waiting for the async job which
> initiated the module loading.
>
>  async A                                modprobe
>
>  1. finds a device
>  2. registers the block device
>  3. request_module(default iosched)
>                                         4. modprobe in userland
>                                         5. load and init module
>                                         6. async_synchronize_full()
>
> Async A waits for modprobe to finish in request_module() and modprobe
> waits for async A to finish in async_synchronize_full().
>
> Because there's no easy to track dependency once control goes out to
> userland, implementing properly nested flushing is difficult.  For
> now, make module init perform async_synchronize_full() iff module init
> has queued async jobs as suggested by Linus.
>
> This avoids the described deadlock because iosched module doesn't use
> async and thus wouldn't invoke async_synchronize_full().  This is
> hacky and incomplete.  It will deadlock if async module loading nests;
> however, this works around the known problem case and seems to be the
> best of bad options.
>
> For more details, please refer to the following thread.
>
>   http://thread.gmane.org/gmane.linux.kernel/1420814
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Alex Riesen <raa.lkml@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> ---

Looks it does fix the deadlock problem on my Pandaboard,
also the scsi disk device node(/dev/sdX) comes just
after loading module of 'sd_mod'.

Tested-by: Ming Lei <ming.lei@canonical.com>

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16  3:25                                           ` Tejun Heo
@ 2013-01-16  3:37                                             ` Linus Torvalds
  2013-01-16 16:22                                               ` Arjan van de Ven
  2013-01-16 16:48                                               ` Tejun Heo
  0 siblings, 2 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-16  3:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven, Rusty Russell

On Tue, Jan 15, 2013 at 7:25 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Linus.
>
> On Tue, Jan 15, 2013 at 07:00:31PM -0800, Linus Torvalds wrote:
>> That said, maybe we could just make the rule be that you can't pick a
>> default IO scheduler that is modular.
>
> This is definitely much more preferable but it would affect use case
> where everything is built modular and the elevator is selected via
> kernel param.  This is way outside the usual usage and we can warn
> about the new behavior but it still is an observable behavior change.
> Do you think this would be okay?

I do want the same user-visible semantics, so it's not some one-liner.

The compiled-in elevator would be easy enough to handle in the Kconfig
file (maybe we do already, I didn't even bother to check). The real
problem is the "chosen_elevator" one, which is dynamic with the kernel
command line. And we could handle that one by just trying to load the
module early (but exactly _when_?) and then instead of looking things
up by name, just keep a pointer to the default elevator around.

But no, it's not just some trivial one-liner. Especially the question
about "when to try to load the module that is given on the kernel
command line" is not trivial. Do we require that the module is in the
initrd and loadable basically immediate at boot? Do we try again after
switching the root filesystem? Things like that..

> And then this warning can be added without introducing
> request_module_but_dont_warn_about_being_called_from_async().

I do agree that it would be much nicer that way.

               Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-16  0:35                                     ` Tejun Heo
@ 2013-01-16  4:01                                       ` Alan Stern
  2013-01-16 16:12                                         ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Alan Stern @ 2013-01-16  4:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Arjan van de Ven, Linus Torvalds, Ming Lei, Alex Riesen,
	Jens Axboe, USB list, Linux Kernel Mailing List

On Tue, 15 Jan 2013, Tejun Heo wrote:

> Hello, Arjan.
> 
> On Tue, Jan 15, 2013 at 04:25:54PM -0800, Arjan van de Ven wrote:
> > async fundamentally had the concept of a monotonic increasing number,
> > and that you could always wait for "everyone before me".
> > then people (like me) wanted exceptions to what "everyone" means ;-(
> > I'm ok with going back to a single space and simplify the world.
> 
> If we want (or need) finer grained operation, we'll probably have to
> head the other direction, so that we can definitively tell that an
> async operation belongs to domains system, module load A and B, so
> that each waiter knows what to wait for.
> 
> The current domain implementation is somewhere inbetween.  It's not
> completely simplistic system and at the same time not developed enough
> to do properly stacked flushing.

I like your idea of chronological synchronization: Insist that anybody
who wants to flush async jobs must get a cookie, and then only allow
them to wait for async jobs started after the cookie was issued.

I don't know if this is possible with the current implementation.  It 
would require changing every call to async_synchronize_*(), and in a 
nontrivial way.  But it might provide a proper solution to all these 
problems.

Can you think of any reasons why it wouldn't work in principle?  It 
would prevent code from doing "wait until all currently-running async 
jobs have finished" -- but arguably, nobody should be allowed to do 
that anyway.

Alan Stern



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-16  3:05                             ` USB device cannot be reconnected and khubd "blocked for more than 120 seconds" Ming Lei
@ 2013-01-16  4:14                               ` Linus Torvalds
  0 siblings, 0 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-16  4:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tejun Heo, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List

On Tue, Jan 15, 2013 at 7:05 PM, Ming Lei <ming.lei@canonical.com> wrote:
>
> So looks only sd.c and floppy.c are to be synchronized suppose
> some sync interfaces are introduced, doesn't it?

What about ata_host_register() (usually called through ata_host_activate())?

I don't understand why you continue to push for something fragile
where you have to get things right in the driver, when it clearly is
very fragile indeed, as now shown *twice* by how you seem to have
missed some potential case.

This is *exactly* why I NAK'ed the patch, and said it has to be
handled automatically (or at least default to the safe model, not the
unsafe one).

We do have the automatic patch now. Admittedly it's not wonderful, and
I agreed when Tejun called it slightly ugly, but at least it does
these things automatically without humans having to go through these
cases one by one and having to get them right. So please just stop
pushing this "manual marking" thing. It's fundamentally flawed and
broken.

             Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16  2:52                                       ` [PATCH] module, async: async_synchronize_full() on module init iff async is used Tejun Heo
  2013-01-16  3:00                                         ` Linus Torvalds
  2013-01-16  3:30                                         ` [PATCH] module, async: async_synchronize_full() on module init iff async is used Ming Lei
@ 2013-01-16  4:24                                         ` Rusty Russell
  2013-01-16 11:36                                         ` Alex Riesen
  2013-08-12  7:04                                         ` [3.8-rc3 -> 3.8-rc4 regression] " Jonathan Nieder
  4 siblings, 0 replies; 93+ messages in thread
From: Rusty Russell @ 2013-01-16  4:24 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven

Tejun Heo <tj@kernel.org> writes:
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3058,8 +3064,25 @@ static int do_init_module(struct module
>  	blocking_notifier_call_chain(&module_notify_list,
>  				     MODULE_STATE_LIVE, mod);
>  
> -	/* We need to finish all async code before the module init sequence is done */
> -	async_synchronize_full();

Linus put async_synchronize_full() here as a fix but beware: you can
start using the module before this call.  Normally the potential caller
is the one requesting the module load so it works, but if we get more
async stuff we may land in that hole.

Changing every caller of any async-initializing service is not going to
be pretty, but maybe put an async_cookie_t in struct module for
module_init to use, and sync it in try_module_get()?  Which would now
need a can_sleep flag... but the result would be more async.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16  2:52                                       ` [PATCH] module, async: async_synchronize_full() on module init iff async is used Tejun Heo
                                                           ` (2 preceding siblings ...)
  2013-01-16  4:24                                         ` Rusty Russell
@ 2013-01-16 11:36                                         ` Alex Riesen
  2013-08-12  7:04                                         ` [3.8-rc3 -> 3.8-rc4 regression] " Jonathan Nieder
  4 siblings, 0 replies; 93+ messages in thread
From: Alex Riesen @ 2013-01-16 11:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Ming Lei, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven, Rusty Russell

On Wed, Jan 16, 2013 at 3:52 AM, Tejun Heo <tj@kernel.org> wrote:
> This avoids the described deadlock because iosched module doesn't use
> async and thus wouldn't invoke async_synchronize_full().  This is
> hacky and incomplete.  It will deadlock if async module loading nests;
> however, this works around the known problem case and seems to be the
> best of bad options.

I confirm it fixes the original problem.

Tested-by: Alex Riesen <raa.lkml@gmail.com>

[   27.594951] hub 1-1:1.0: state 7 ports 6 chg 0000 evt 0004
[   27.595245] hub 1-1:1.0: port 2, status 0101, change 0001, 12 Mb/s
[   27.698995] hub 1-1:1.0: debounce: port 2: total 100ms stable 100ms
status 0x101
[   27.709977] hub 1-1:1.0: port 2 not reset yet, waiting 10ms
[   27.771888] usb 1-1.2: new high-speed USB device number 3 using ehci-pci
[   27.782871] hub 1-1:1.0: port 2 not reset yet, waiting 10ms
[   27.857503] usb 1-1.2: default language 0x0409
[   27.858248] usb 1-1.2: udev 3, busnum 1, minor = 2
[   27.858258] usb 1-1.2: New USB device found, idVendor=0781, idProduct=5408
[   27.858263] usb 1-1.2: New USB device strings: Mfr=1, Product=2,
SerialNumber=3
[   27.858267] usb 1-1.2: Product: U3 Titanium
[   27.858271] usb 1-1.2: Manufacturer: SanDisk Corporation
[   27.858275] usb 1-1.2: SerialNumber: 0000187A3A60F1E9
[   27.858800] usb 1-1.2: usb_probe_device
[   27.858815] usb 1-1.2: configuration #1 chosen from 1 choice
[   27.858940] usb 1-1.2: adding 1-1.2:1.0 (config #1, interface 0)
[   27.859246] usb-storage 1-1.2:1.0: usb_probe_interface
[   27.859258] usb-storage 1-1.2:1.0: usb_probe_interface - got id
[   27.859516] scsi6 : usb-storage 1-1.2:1.0
[   28.865771] io scheduler deadline registered (default)
[   28.866705] scsi 6:0:0:0: Direct-Access     SanDisk  U3 Titanium
  2.18 PQ: 0 ANSI: 2
[   28.869483] sd 6:0:0:0: Attached scsi generic sg1 type 0
[   28.869700] sd 6:0:0:0: [sdb] 4013713 512-byte logical blocks:
(2.05 GB/1.91 GiB)
[   28.870197] sd 6:0:0:0: [sdb] Write Protect is off
[   28.870204] sd 6:0:0:0: [sdb] Mode Sense: 03 00 00 00
[   28.870692] sd 6:0:0:0: [sdb] No Caching mode page present
[   28.870697] sd 6:0:0:0: [sdb] Assuming drive cache: write through
[   28.873565] sd 6:0:0:0: [sdb] No Caching mode page present
[   28.873575] sd 6:0:0:0: [sdb] Assuming drive cache: write through
[   28.883895]  sdb: sdb1
[   28.887775] sd 6:0:0:0: [sdb] No Caching mode page present
[   28.887783] sd 6:0:0:0: [sdb] Assuming drive cache: write through
[   28.887789] sd 6:0:0:0: [sdb] Attached SCSI removable disk

The filesystem can be mounted and files can be read.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-16  4:01                                       ` Alan Stern
@ 2013-01-16 16:12                                         ` Tejun Heo
  2013-01-16 17:01                                           ` Alan Stern
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-16 16:12 UTC (permalink / raw)
  To: Alan Stern
  Cc: Arjan van de Ven, Linus Torvalds, Ming Lei, Alex Riesen,
	Jens Axboe, USB list, Linux Kernel Mailing List

Hello, Alan.

On Tue, Jan 15, 2013 at 11:01:15PM -0500, Alan Stern wrote:
> > The current domain implementation is somewhere inbetween.  It's not
> > completely simplistic system and at the same time not developed enough
> > to do properly stacked flushing.
> 
> I like your idea of chronological synchronization: Insist that anybody
> who wants to flush async jobs must get a cookie, and then only allow
> them to wait for async jobs started after the cookie was issued.
> 
> I don't know if this is possible with the current implementation.  It 
> would require changing every call to async_synchronize_*(), and in a 
> nontrivial way.  But it might provide a proper solution to all these 
> problems.

The problem here is that "flush everything which comes before me" is
used to order async jobs.  e.g. after async jobs probe the hardware
they order themselves by flushing before registering them, so unless
we build accurate flushing dependencies, those dependencies will reach
beyond the time window we're interested in and bring in deadlocks.

And, as Linus pointed it out, tracking dependency through
request_module() is tricky no matter what we do.  I think it can be
done by matching the ones calling request_module() and the ones
actually loading modules but it's gonna be nasty.

There aren't too many which use async anyway so changing stuff
shouldn't be too difficult but I think the simpicity or dumbness is
one of major attractions of async, so it'd be nice to keep things that
way and the PF_USED_ASYNC hack seems to be able to hold things
together for now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16  3:37                                             ` Linus Torvalds
@ 2013-01-16 16:22                                               ` Arjan van de Ven
  2013-01-16 16:48                                               ` Tejun Heo
  1 sibling, 0 replies; 93+ messages in thread
From: Arjan van de Ven @ 2013-01-16 16:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell


> But no, it's not just some trivial one-liner. Especially the question
> about "when to try to load the module that is given on the kernel
> command line" is not trivial. Do we require that the module is in the
> initrd and loadable basically immediate at boot? Do we try again after
> switching the root filesystem? Things like that..

to load it from the root fs you tend to need... an elevator ;-)


I think it's pretty fair to users to say that if you want something by default at boot time,
you need to build it in...
but for us to try a modprobe from the initrd is not too bad I suppose.
probably need to do this around the time we initialize the block layer


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16  3:37                                             ` Linus Torvalds
  2013-01-16 16:22                                               ` Arjan van de Ven
@ 2013-01-16 16:48                                               ` Tejun Heo
  2013-01-16 17:03                                                 ` Arjan van de Ven
  1 sibling, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-16 16:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven, Rusty Russell

Hey,

On Tue, Jan 15, 2013 at 07:37:42PM -0800, Linus Torvalds wrote:
> I do want the same user-visible semantics, so it's not some one-liner.
> 
> The compiled-in elevator would be easy enough to handle in the Kconfig
> file (maybe we do already, I didn't even bother to check). The real
> problem is the "chosen_elevator" one, which is dynamic with the kernel
> command line. And we could handle that one by just trying to load the
> module early (but exactly _when_?) and then instead of looking things
> up by name, just keep a pointer to the default elevator around.
> 
> But no, it's not just some trivial one-liner. Especially the question
> about "when to try to load the module that is given on the kernel
> command line" is not trivial. Do we require that the module is in the
> initrd and loadable basically immediate at boot? Do we try again after
> switching the root filesystem? Things like that..

If the current user-visible semantics is defined as "the kernel shall
try to load the default iosched if not already available on each block
device discovery", nothing can be changed ever, but I'm not sure it
needs to be pushed that far.

As Arjan suggested, trying to load the default modules right after the
initial rootfs mount could be an acceptable compromise and it would be
really nice (for both code sanity and avoiding future problems) to be
able to declare module loading nested inside async unspported.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-16 16:12                                         ` Tejun Heo
@ 2013-01-16 17:01                                           ` Alan Stern
  2013-01-16 17:37                                             ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Alan Stern @ 2013-01-16 17:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Arjan van de Ven, Linus Torvalds, Ming Lei, Alex Riesen,
	Jens Axboe, USB list, Linux Kernel Mailing List

On Wed, 16 Jan 2013, Tejun Heo wrote:

> Hello, Alan.
> 
> On Tue, Jan 15, 2013 at 11:01:15PM -0500, Alan Stern wrote:
> > > The current domain implementation is somewhere inbetween.  It's not
> > > completely simplistic system and at the same time not developed enough
> > > to do properly stacked flushing.
> > 
> > I like your idea of chronological synchronization: Insist that anybody
> > who wants to flush async jobs must get a cookie, and then only allow
> > them to wait for async jobs started after the cookie was issued.
> > 
> > I don't know if this is possible with the current implementation.  It 
> > would require changing every call to async_synchronize_*(), and in a 
> > nontrivial way.  But it might provide a proper solution to all these 
> > problems.
> 
> The problem here is that "flush everything which comes before me" is
> used to order async jobs.  e.g. after async jobs probe the hardware
> they order themselves by flushing before registering them, so unless

I don't fully understand this example.  What is the point -- to make 
sure that asynchronously probed devices are registered in the order of 
their discovery?

If so, here's how to do it safely: Start up the async jobs in reverse
order of discovery.  Have each job acquire a cookie when it starts.  
Then each job needs to wait only for tasks that started after its
cookie was issued.

> we build accurate flushing dependencies, those dependencies will reach
> beyond the time window we're interested in and bring in deadlocks.

The flushing-dependency principle can be very simple: No async task
should ever have to wait for another async task that started before it.  
The "cookie" approach satisfies this requirement (unless an earlier 
task passes its cookie to a later task or subverts the mechanism in 
another way).

> And, as Linus pointed it out, tracking dependency through
> request_module() is tricky no matter what we do.  I think it can be
> done by matching the ones calling request_module() and the ones
> actually loading modules but it's gonna be nasty.

This shouldn't matter.  Dependencies don't need to be tracked
explicitly, because we know that any async work done by
request_module() must start _after_ request_module() is called.  Thus,
if async task A calls request_module(), which starts up async task B,
then we know that A can safely wait for B and B cannot safely wait for
A.

> There aren't too many which use async anyway so changing stuff
> shouldn't be too difficult but I think the simpicity or dumbness is
> one of major attractions of async, so it'd be nice to keep things that
> way and the PF_USED_ASYNC hack seems to be able to hold things
> together for now.

Nesting won't matter for the chronological approach.  I really think 
you should consider it more fully.  It's not a hack, and it doesn't 
need to be complicated.

Alan Stern


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16 16:48                                               ` Tejun Heo
@ 2013-01-16 17:03                                                 ` Arjan van de Ven
  2013-01-16 17:06                                                   ` Linus Torvalds
  0 siblings, 1 reply; 93+ messages in thread
From: Arjan van de Ven @ 2013-01-16 17:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell


> As Arjan suggested, trying to load the default modules right after the
> initial rootfs mount could be an acceptable compromise and it would be
> really nice (for both code sanity and avoiding future problems) to be
> able to declare module loading nested inside async unspported.

we can even try twice

the first time right after we mount the initramfs
the second time when the initramfs code exits, and before we exec init
(the initramfs supposedly mounted the real root fs at this point)

if you want your elevator to apply to your root filesystem storage, the rule
will then be "put the module in the initramfs"... but to be honest,
that's not a restriction that is unreasonable or unexpected.


for doing a module loading from inside an async handler..we can then just make
use of the normal "load this module async" way of requesting a module.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16 17:03                                                 ` Arjan van de Ven
@ 2013-01-16 17:06                                                   ` Linus Torvalds
  2013-01-16 21:30                                                     ` [PATCH 1/2] init, block: try to load default elevator module early during boot Tejun Heo
  2013-01-16 21:31                                                     ` [PATCH 2/2] block: don't request module during elevator init Tejun Heo
  0 siblings, 2 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-16 17:06 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Tejun Heo, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

On Wed, Jan 16, 2013 at 9:03 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> we can even try twice
>
> the first time right after we mount the initramfs
> the second time when the initramfs code exits, and before we exec init
> (the initramfs supposedly mounted the real root fs at this point)

Yes. This, together with "don't try request_module for the default
elevator", and the "warn if somebody does request_module from async
context" would, I think, be the right thing to do.

In the meantime, I've applied Tejun's patch. It possibly speeds things
up regardless of this particular deadlock thing, and while it's not
pretty it certainly isn't horribly nasty or very invasive either, so I
don't see any reason to delay it just because there might be a better
solution some day.

                  Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH] async: fix __lowest_in_progress()
  2013-01-15 18:32                             ` Tejun Heo
  2013-01-15 20:18                               ` Linus Torvalds
@ 2013-01-16 17:19                               ` Tejun Heo
  2013-01-17 18:16                                 ` Linus Torvalds
  2013-01-23  0:15                                 ` [PATCH v2] " Tejun Heo
  1 sibling, 2 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-16 17:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven

083b804c4d3e1e3d0eace56bdbc0f674946d2847 ("async: use workqueue for
worker pool") made it possible that async jobs are moved from pending
to running out-of-order.  While pending async jobs will be queued and
dispatched for execution in the same order, nothing guarantees they'll
enter "1) move self to the running queue" of async_run_entry_fn() in
the same order.

This broke __lowest_in_progress().  running->domain may not be
properly sorted and is not guaranteed to contain lower cookies than
pending list when not empty.  Fix it by ensuring sort-inserting to the
running list and always looking at both pending and running when
trying to determine the lowest cookie.

Over time, the async synchronization implementation became quite
messy.  We better restructure it such that each async_entry is linked
to two lists - one global and one per domain - and not move it when
execution starts.  There's no reason to distinguish pending and
running.  They behave the same for synchronization purposes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: stable@vger.kernel.org
---
And here's the fix for the breakage I mentioned earlier.  It wouldn't
happen often in the wild and the effect of it happening wouldn't be
critical for modern distros but it's still kinda surprising nobody
noticed this.

We definitely need to rewrite async synchronization.  It was already
messy and this makes it worse and there's no reason to be messy here.

Thanks.

 kernel/async.c |   27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

--- a/kernel/async.c
+++ b/kernel/async.c
@@ -86,18 +86,27 @@ static atomic_t entry_count;
  */
 static async_cookie_t  __lowest_in_progress(struct async_domain *running)
 {
+	async_cookie_t first_running = next_cookie;	/* infinity value */
+	async_cookie_t first_pending = next_cookie;	/* ditto */
 	struct async_entry *entry;
 
+	/*
+	 * Both running and pending lists are sorted but not disjoint.
+	 * Take the first cookies from both and return the min.
+	 */
 	if (!list_empty(&running->domain)) {
 		entry = list_first_entry(&running->domain, typeof(*entry), list);
-		return entry->cookie;
+		first_running = entry->cookie;
 	}
 
-	list_for_each_entry(entry, &async_pending, list)
-		if (entry->running == running)
-			return entry->cookie;
+	list_for_each_entry(entry, &async_pending, list) {
+		if (entry->running == running) {
+			first_pending = entry->cookie;
+			break;
+		}
+	}
 
-	return next_cookie;	/* "infinity" value */
+	return min(first_running, first_pending);
 }
 
 static async_cookie_t  lowest_in_progress(struct async_domain *running)
@@ -118,13 +127,17 @@ static void async_run_entry_fn(struct wo
 {
 	struct async_entry *entry =
 		container_of(work, struct async_entry, work);
+	struct async_entry *pos;
 	unsigned long flags;
 	ktime_t uninitialized_var(calltime), delta, rettime;
 	struct async_domain *running = entry->running;
 
-	/* 1) move self to the running queue */
+	/* 1) move self to the running queue, make sure it stays sorted */
 	spin_lock_irqsave(&async_lock, flags);
-	list_move_tail(&entry->list, &running->domain);
+	list_for_each_entry_reverse(pos, &running->domain, list)
+		if (entry->cookie < pos->cookie)
+			break;
+	list_move_tail(&entry->list, &pos->list);
 	spin_unlock_irqrestore(&async_lock, flags);
 
 	/* 2) run (and print duration) */

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-16 17:01                                           ` Alan Stern
@ 2013-01-16 17:37                                             ` Tejun Heo
  2013-01-16 17:51                                               ` Alan Stern
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-16 17:37 UTC (permalink / raw)
  To: Alan Stern
  Cc: Arjan van de Ven, Linus Torvalds, Ming Lei, Alex Riesen,
	Jens Axboe, USB list, Linux Kernel Mailing List

Hello, Alan.

On Wed, Jan 16, 2013 at 12:01:53PM -0500, Alan Stern wrote:
> > The problem here is that "flush everything which comes before me" is
> > used to order async jobs.  e.g. after async jobs probe the hardware
> > they order themselves by flushing before registering them, so unless
> 
> I don't fully understand this example.  What is the point -- to make 
> sure that asynchronously probed devices are registered in the order of 
> their discovery?

People still want devices to be numbered to their physical ports and
so on, so we keep the registeration order the same as natural
(whatever that means) hardware order.

> If so, here's how to do it safely: Start up the async jobs in reverse
> order of discovery.  Have each job acquire a cookie when it starts.  
> Then each job needs to wait only for tasks that started after its
> cookie was issued.

It's a bit clumsy but yeah I guess it could work.

> > There aren't too many which use async anyway so changing stuff
> > shouldn't be too difficult but I think the simpicity or dumbness is
> > one of major attractions of async, so it'd be nice to keep things that
> > way and the PF_USED_ASYNC hack seems to be able to hold things
> > together for now.
> 
> Nesting won't matter for the chronological approach.  I really think 
> you should consider it more fully.  It's not a hack, and it doesn't 
> need to be complicated.

There is benefit to the current dumb implementation in that drivers
can use it without thinking too much, but yeah it could be that the
flushing range limit isn't too much of restriction on top.  I don't
know.  At this point, I'd prefer to remove request_module() from
elevator init path for the problem at hand.  If we need something more
involved, changing cookie usage rules definitely seems like an option.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: USB device cannot be reconnected and khubd "blocked for more than 120 seconds"
  2013-01-16 17:37                                             ` Tejun Heo
@ 2013-01-16 17:51                                               ` Alan Stern
  0 siblings, 0 replies; 93+ messages in thread
From: Alan Stern @ 2013-01-16 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Arjan van de Ven, Linus Torvalds, Ming Lei, Alex Riesen,
	Jens Axboe, USB list, Linux Kernel Mailing List

On Wed, 16 Jan 2013, Tejun Heo wrote:

> Hello, Alan.
> 
> On Wed, Jan 16, 2013 at 12:01:53PM -0500, Alan Stern wrote:
> > > The problem here is that "flush everything which comes before me" is
> > > used to order async jobs.  e.g. after async jobs probe the hardware
> > > they order themselves by flushing before registering them, so unless
> > 
> > I don't fully understand this example.  What is the point -- to make 
> > sure that asynchronously probed devices are registered in the order of 
> > their discovery?
> 
> People still want devices to be numbered to their physical ports and
> so on, so we keep the registeration order the same as natural
> (whatever that means) hardware order.
> 
> > If so, here's how to do it safely: Start up the async jobs in reverse
> > order of discovery.  Have each job acquire a cookie when it starts.  
> > Then each job needs to wait only for tasks that started after its
> > cookie was issued.
> 
> It's a bit clumsy but yeah I guess it could work.
> 
> > > There aren't too many which use async anyway so changing stuff
> > > shouldn't be too difficult but I think the simpicity or dumbness is
> > > one of major attractions of async, so it'd be nice to keep things that
> > > way and the PF_USED_ASYNC hack seems to be able to hold things
> > > together for now.
> > 
> > Nesting won't matter for the chronological approach.  I really think 
> > you should consider it more fully.  It's not a hack, and it doesn't 
> > need to be complicated.
> 
> There is benefit to the current dumb implementation in that drivers
> can use it without thinking too much, but yeah it could be that the
> flushing range limit isn't too much of restriction on top.  I don't
> know.  At this point, I'd prefer to remove request_module() from
> elevator init path for the problem at hand.  If we need something more
> involved, changing cookie usage rules definitely seems like an option.

A simpler approach might be to leave the existing synchronization 
mechanisms as they are, and use the chronological approach only for the 
case of loading a module (or wherever else someone wants to use it).

Alan Stern


^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH 1/2] init, block: try to load default elevator module early during boot
  2013-01-16 17:06                                                   ` Linus Torvalds
@ 2013-01-16 21:30                                                     ` Tejun Heo
  2013-01-17 18:05                                                       ` Linus Torvalds
  2013-01-23  0:53                                                       ` [PATCH v2 1/2] init, block: try to load default elevator module early during boot Tejun Heo
  2013-01-16 21:31                                                     ` [PATCH 2/2] block: don't request module during elevator init Tejun Heo
  1 sibling, 2 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-16 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

This patch adds default module loading and uses it to load the default
block elevator.  During boot, it's called right after initramfs or
initrd is made available and right before control is passed to
userland.  This ensures that as long as the modules are available in
the usual places in initramfs, initrd or the root filesystem, the
default modules are loaded as soon as possible.

This will replace the on-demand elevator module loading from elevator
init path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
---
 block/elevator.c         |   16 ++++++++++++++++
 include/linux/elevator.h |    1 +
 include/linux/init.h     |    1 +
 init/do_mounts_initrd.c  |    3 +++
 init/initramfs.c         |    8 +++++++-
 init/main.c              |   16 ++++++++++++++++
 6 files changed, 44 insertions(+), 1 deletion(-)

--- a/block/elevator.c
+++ b/block/elevator.c
@@ -136,6 +136,22 @@ static int __init elevator_setup(char *s
 
 __setup("elevator=", elevator_setup);
 
+/* called during boot to load the elevator chosen by the elevator param */
+void __init load_default_elevator_module(void)
+{
+	struct elevator_type *e;
+
+	if (!chosen_elevator[0])
+		return;
+
+	spin_lock(&elv_list_lock);
+	e = elevator_find(chosen_elevator);
+	spin_unlock(&elv_list_lock);
+
+	if (!e)
+		request_module("%s-iosched", chosen_elevator);
+}
+
 static struct kobj_type elv_ktype;
 
 static struct elevator_queue *elevator_alloc(struct request_queue *q,
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -138,6 +138,7 @@ extern void elv_drain_elevator(struct re
 /*
  * io scheduler registration
  */
+extern void __init load_default_elevator_module(void);
 extern int elv_register(struct elevator_type *);
 extern void elv_unregister(struct elevator_type *);
 
--- a/include/linux/init.h
+++ b/include/linux/init.h
@@ -161,6 +161,7 @@ extern unsigned int reset_devices;
 /* used by init/main.c */
 void setup_arch(char **);
 void prepare_namespace(void);
+void __init load_default_modules(void);
 
 extern void (*late_time_init)(void);
 
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -57,6 +57,9 @@ static void __init handle_initrd(void)
 	sys_mkdir("/old", 0700);
 	sys_chdir("/old");
 
+	/* try loading default modules from initrd */
+	load_default_modules();
+
 	/*
 	 * In case that a resume from disk is carried out by linuxrc or one of
 	 * its children, we need to tell the freezer not to wait for us.
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -592,7 +592,7 @@ static int __init populate_rootfs(void)
 			initrd_end - initrd_start);
 		if (!err) {
 			free_initrd();
-			return 0;
+			goto done;
 		} else {
 			clean_rootfs();
 			unpack_to_rootfs(__initramfs_start, __initramfs_size);
@@ -607,6 +607,7 @@ static int __init populate_rootfs(void)
 			sys_close(fd);
 			free_initrd();
 		}
+	done:
 #else
 		printk(KERN_INFO "Unpacking initramfs...\n");
 		err = unpack_to_rootfs((char *)initrd_start,
@@ -615,6 +616,11 @@ static int __init populate_rootfs(void)
 			printk(KERN_EMERG "Initramfs unpacking failed: %s\n", err);
 		free_initrd();
 #endif
+		/*
+		 * Try loading default modules from initramfs.  This gives
+		 * us a chance to load before device_initcalls.
+		 */
+		load_default_modules();
 	}
 	return 0;
 }
--- a/init/main.c
+++ b/init/main.c
@@ -70,6 +70,8 @@
 #include <linux/perf_event.h>
 #include <linux/file.h>
 #include <linux/ptrace.h>
+#include <linux/blkdev.h>
+#include <linux/elevator.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -794,6 +796,17 @@ static void __init do_pre_smp_initcalls(
 		do_one_initcall(*fn);
 }
 
+/*
+ * This function requests modules which should be loaded by default and is
+ * called twice right after initrd is mounted and right before init is
+ * exec'd.  If such modules are on either initrd or rootfs, they will be
+ * loaded before control is passed to userland.
+ */
+void __init load_default_modules(void)
+{
+	load_default_elevator_module();
+}
+
 static int run_init_process(const char *init_filename)
 {
 	argv_init[0] = init_filename;
@@ -900,4 +913,7 @@ static void __init kernel_init_freeable(
 	 * we're essentially up and running. Get rid of the
 	 * initmem segments and start the user-mode stuff..
 	 */
+
+	/* rootfs is available now, try loading default modules */
+	load_default_modules();
 }

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH 2/2] block: don't request module during elevator init
  2013-01-16 17:06                                                   ` Linus Torvalds
  2013-01-16 21:30                                                     ` [PATCH 1/2] init, block: try to load default elevator module early during boot Tejun Heo
@ 2013-01-16 21:31                                                     ` Tejun Heo
  2013-01-23  0:51                                                       ` [PATCH v2 " Tejun Heo
  1 sibling, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-16 21:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

Block layer allows selecting an elevator which is built as a module to
be selected as system default via kernel param "elevator=".  This is
achieved by automatically invoking request_module() whenever a new
block device is initialized and the elevator is not available.

This led to an interesting deadlock problem involving async and module
init.  Block device probing running off an async job invokes
request_module().  While the module is being loaded, it performs
async_synchronize_full() which ends up waiting for the async job which
is already waiting for request_module() to finish, leading to
deadlock.

Invoking request_module() from deep in block device init path is
already nasty in itself.  It seems best to avoid these situations from
the beginning by moving on-demand module loading out of block init
path.

The previous patch made sure that the default elevator module is
loaded early during boot if available.  This patch removes on-demand
loading of the default elevator from elevator init path.  As the
module would have been loaded during boot, userland-visible behavior
difference should be minimal.

For more details, please refer to the following thread.

  http://thread.gmane.org/gmane.linux.kernel/1420814

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
---
 block/elevator.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

--- a/block/elevator.c
+++ b/block/elevator.c
@@ -100,14 +100,14 @@ static void elevator_put(struct elevator
 	module_put(e->elevator_owner);
 }
 
-static struct elevator_type *elevator_get(const char *name)
+static struct elevator_type *elevator_get(const char *name, bool request_module)
 {
 	struct elevator_type *e;
 
 	spin_lock(&elv_list_lock);
 
 	e = elevator_find(name);
-	if (!e) {
+	if (!e && request_module) {
 		spin_unlock(&elv_list_lock);
 		request_module("%s-iosched", name);
 		spin_lock(&elv_list_lock);
@@ -207,25 +207,30 @@ int elevator_init(struct request_queue *
 	q->boundary_rq = NULL;
 
 	if (name) {
-		e = elevator_get(name);
+		e = elevator_get(name, true);
 		if (!e)
 			return -EINVAL;
 	}
 
+	/*
+	 * Use the default elevator specified by config boot param or
+	 * config option.  Don't try to load modules as we could be running
+	 * off async and request_module() isn't allowed from async.
+	 */
 	if (!e && *chosen_elevator) {
-		e = elevator_get(chosen_elevator);
+		e = elevator_get(chosen_elevator, false);
 		if (!e)
 			printk(KERN_ERR "I/O scheduler %s not found\n",
 							chosen_elevator);
 	}
 
 	if (!e) {
-		e = elevator_get(CONFIG_DEFAULT_IOSCHED);
+		e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
 		if (!e) {
 			printk(KERN_ERR
 				"Default I/O scheduler not found. " \
 				"Using noop.\n");
-			e = elevator_get("noop");
+			e = elevator_get("noop", false);
 		}
 	}
 
@@ -967,7 +972,7 @@ int elevator_change(struct request_queue
 		return -ENXIO;
 
 	strlcpy(elevator_name, name, sizeof(elevator_name));
-	e = elevator_get(strstrip(elevator_name));
+	e = elevator_get(strstrip(elevator_name), true);
 	if (!e) {
 		printk(KERN_ERR "elevator: type %s not found\n", elevator_name);
 		return -EINVAL;

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/2] init, block: try to load default elevator module early during boot
  2013-01-16 21:30                                                     ` [PATCH 1/2] init, block: try to load default elevator module early during boot Tejun Heo
@ 2013-01-17 18:05                                                       ` Linus Torvalds
  2013-01-17 18:38                                                         ` Tejun Heo
                                                                           ` (3 more replies)
  2013-01-23  0:53                                                       ` [PATCH v2 1/2] init, block: try to load default elevator module early during boot Tejun Heo
  1 sibling, 4 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-17 18:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

And here I was really hoping that there was a third patch in the
series that added the warning...

We don't currently have a "am I an async worker" helper function for
the warning to use, which is something very much up your alley.

                Linus

On Wed, Jan 16, 2013 at 1:30 PM, Tejun Heo <tj@kernel.org> wrote:
> This patch adds default module loading and uses it to load the default
> block elevator.  During boot, it's called right after initramfs or
> initrd is made available and right before control is passed to
> userland.  This ensures that as long as the modules are available in
> the usual places in initramfs, initrd or the root filesystem, the
> default modules are loaded as soon as possible.
>
> This will replace the on-demand elevator module loading from elevator
> init path.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Alex Riesen <raa.lkml@gmail.com>
> ---
>  block/elevator.c         |   16 ++++++++++++++++
>  include/linux/elevator.h |    1 +
>  include/linux/init.h     |    1 +
>  init/do_mounts_initrd.c  |    3 +++
>  init/initramfs.c         |    8 +++++++-
>  init/main.c              |   16 ++++++++++++++++
>  6 files changed, 44 insertions(+), 1 deletion(-)
>
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -136,6 +136,22 @@ static int __init elevator_setup(char *s
>
>  __setup("elevator=", elevator_setup);
>
> +/* called during boot to load the elevator chosen by the elevator param */
> +void __init load_default_elevator_module(void)
> +{
> +       struct elevator_type *e;
> +
> +       if (!chosen_elevator[0])
> +               return;
> +
> +       spin_lock(&elv_list_lock);
> +       e = elevator_find(chosen_elevator);
> +       spin_unlock(&elv_list_lock);
> +
> +       if (!e)
> +               request_module("%s-iosched", chosen_elevator);
> +}
> +
>  static struct kobj_type elv_ktype;
>
>  static struct elevator_queue *elevator_alloc(struct request_queue *q,
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -138,6 +138,7 @@ extern void elv_drain_elevator(struct re
>  /*
>   * io scheduler registration
>   */
> +extern void __init load_default_elevator_module(void);
>  extern int elv_register(struct elevator_type *);
>  extern void elv_unregister(struct elevator_type *);
>
> --- a/include/linux/init.h
> +++ b/include/linux/init.h
> @@ -161,6 +161,7 @@ extern unsigned int reset_devices;
>  /* used by init/main.c */
>  void setup_arch(char **);
>  void prepare_namespace(void);
> +void __init load_default_modules(void);
>
>  extern void (*late_time_init)(void);
>
> --- a/init/do_mounts_initrd.c
> +++ b/init/do_mounts_initrd.c
> @@ -57,6 +57,9 @@ static void __init handle_initrd(void)
>         sys_mkdir("/old", 0700);
>         sys_chdir("/old");
>
> +       /* try loading default modules from initrd */
> +       load_default_modules();
> +
>         /*
>          * In case that a resume from disk is carried out by linuxrc or one of
>          * its children, we need to tell the freezer not to wait for us.
> --- a/init/initramfs.c
> +++ b/init/initramfs.c
> @@ -592,7 +592,7 @@ static int __init populate_rootfs(void)
>                         initrd_end - initrd_start);
>                 if (!err) {
>                         free_initrd();
> -                       return 0;
> +                       goto done;
>                 } else {
>                         clean_rootfs();
>                         unpack_to_rootfs(__initramfs_start, __initramfs_size);
> @@ -607,6 +607,7 @@ static int __init populate_rootfs(void)
>                         sys_close(fd);
>                         free_initrd();
>                 }
> +       done:
>  #else
>                 printk(KERN_INFO "Unpacking initramfs...\n");
>                 err = unpack_to_rootfs((char *)initrd_start,
> @@ -615,6 +616,11 @@ static int __init populate_rootfs(void)
>                         printk(KERN_EMERG "Initramfs unpacking failed: %s\n", err);
>                 free_initrd();
>  #endif
> +               /*
> +                * Try loading default modules from initramfs.  This gives
> +                * us a chance to load before device_initcalls.
> +                */
> +               load_default_modules();
>         }
>         return 0;
>  }
> --- a/init/main.c
> +++ b/init/main.c
> @@ -70,6 +70,8 @@
>  #include <linux/perf_event.h>
>  #include <linux/file.h>
>  #include <linux/ptrace.h>
> +#include <linux/blkdev.h>
> +#include <linux/elevator.h>
>
>  #include <asm/io.h>
>  #include <asm/bugs.h>
> @@ -794,6 +796,17 @@ static void __init do_pre_smp_initcalls(
>                 do_one_initcall(*fn);
>  }
>
> +/*
> + * This function requests modules which should be loaded by default and is
> + * called twice right after initrd is mounted and right before init is
> + * exec'd.  If such modules are on either initrd or rootfs, they will be
> + * loaded before control is passed to userland.
> + */
> +void __init load_default_modules(void)
> +{
> +       load_default_elevator_module();
> +}
> +
>  static int run_init_process(const char *init_filename)
>  {
>         argv_init[0] = init_filename;
> @@ -900,4 +913,7 @@ static void __init kernel_init_freeable(
>          * we're essentially up and running. Get rid of the
>          * initmem segments and start the user-mode stuff..
>          */
> +
> +       /* rootfs is available now, try loading default modules */
> +       load_default_modules();
>  }

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] async: fix __lowest_in_progress()
  2013-01-16 17:19                               ` [PATCH] async: fix __lowest_in_progress() Tejun Heo
@ 2013-01-17 18:16                                 ` Linus Torvalds
  2013-01-17 18:50                                   ` Tejun Heo
  2013-01-23  0:15                                 ` [PATCH v2] " Tejun Heo
  1 sibling, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2013-01-17 18:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven

Tejun, mind explaining this one a bit more to me?

If ordering matters, and we have a running queue and a pending queue,
how could the pending queue *ever* be lower than the running one?

That implies that something was taken off the pending queue and put on
the running queue out of order, right?

And that in turn implies that there isn't much of a "lowest" ordering
at all, so how could anybody even care about what lowest is? It seems
to be a meaningless measure.

So with that in mind, I don't see what semantics the first part of the
patch fixes. Can you explain more?

               Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/2] init, block: try to load default elevator module early during boot
  2013-01-17 18:05                                                       ` Linus Torvalds
@ 2013-01-17 18:38                                                         ` Tejun Heo
  2013-01-17 18:46                                                           ` Linus Torvalds
  2013-01-18  1:24                                                         ` [PATCH 1/3] workqueue: set PF_WQ_WORKER on rescuers Tejun Heo
                                                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-17 18:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

Hello, Linus.

On Thu, Jan 17, 2013 at 10:05:53AM -0800, Linus Torvalds wrote:
> And here I was really hoping that there was a third patch in the
> series that added the warning...
> 
> We don't currently have a "am I an async worker" helper function for
> the warning to use, which is something very much up your alley.

Oh yeah, it's coming.  I just wanted to finish something else first
and, as turning on PF_WQ_WORKER on a rescuer thread has some chance of
developing into an obscure difficult-to-trigger and diagnose problem,
don't want to hurry it too much.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/2] init, block: try to load default elevator module early during boot
  2013-01-17 18:38                                                         ` Tejun Heo
@ 2013-01-17 18:46                                                           ` Linus Torvalds
  2013-01-17 18:59                                                             ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2013-01-17 18:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

On Thu, Jan 17, 2013 at 10:38 AM, Tejun Heo <tj@kernel.org> wrote:
>
> Oh yeah, it's coming.  I just wanted to finish something else first
> and, as turning on PF_WQ_WORKER on a rescuer thread has some chance of
> developing into an obscure difficult-to-trigger and diagnose problem,
> don't want to hurry it too much.

Ok. I think I'll delay these things for 3.9 anyway, since the actual
_problem_ people are seeing should be fixed with your other patch. So
I guess it's not really all that critical any more.

                   Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH] async: fix __lowest_in_progress()
  2013-01-17 18:16                                 ` Linus Torvalds
@ 2013-01-17 18:50                                   ` Tejun Heo
  0 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-17 18:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven

Hello,

On Thu, Jan 17, 2013 at 10:16:50AM -0800, Linus Torvalds wrote:
> Tejun, mind explaining this one a bit more to me?
> 
> If ordering matters, and we have a running queue and a pending queue,
> how could the pending queue *ever* be lower than the running one?

So, before being converted to workqueue, async spooled up its own
workers and each worker would lock and move the first pending item to
the executing list and everything was in order.

The conversion to workqueue was done by adding work_struct to each
async_entry and async just schedules the work item.  The queueing and
dispatching of such work items are still in order but now each worker
thread is associated with a specific async_entry and move that
specific async_entry to the executing list.  So, depending on which
worker reaches that point earlier, which is completely
non-deterministic, we may end up moving an async_entry with larger
cookie before one with smaller one.

> That implies that something was taken off the pending queue and put on
> the running queue out of order, right?
> 
> And that in turn implies that there isn't much of a "lowest" ordering
> at all, so how could anybody even care about what lowest is? It seems
> to be a meaningless measure.

The execution is still lowest first as workqueue would dispatch
workers in queued order.  It just is that they can reach the
synchronization point at their own differing paces.

> So with that in mind, I don't see what semantics the first part of the
> patch fixes. Can you explain more?

The problem with the code is that it's keeping a global pending list
and domain-specific running lists.  I don't know why it developed like
this but even before workqueue conversion the code was weird.

* It has sorted per-domain running list, so looking at the first item
  is easy.

* It has sorted global pennding list, and looking for first item in a
  domain involves scanning it.

Global syncing ends up scanning all per-domain running lists and
domain syncing ends up scanning global pending list, when all we need
is each async item to be queued on two lists - global and per-domain
in-flight lists - and stay there until done.

The posted patch is minimal fix while keeping the basic operation the
same so that it doesn't disturb -stable too much.  I'll prep a patch
to redo synchronization for 3.9.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/2] init, block: try to load default elevator module early during boot
  2013-01-17 18:46                                                           ` Linus Torvalds
@ 2013-01-17 18:59                                                             ` Tejun Heo
  2013-01-17 19:00                                                               ` Linus Torvalds
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-17 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

On Thu, Jan 17, 2013 at 10:46:47AM -0800, Linus Torvalds wrote:
> On Thu, Jan 17, 2013 at 10:38 AM, Tejun Heo <tj@kernel.org> wrote:
> >
> > Oh yeah, it's coming.  I just wanted to finish something else first
> > and, as turning on PF_WQ_WORKER on a rescuer thread has some chance of
> > developing into an obscure difficult-to-trigger and diagnose problem,
> > don't want to hurry it too much.
> 
> Ok. I think I'll delay these things for 3.9 anyway, since the actual
> _problem_ people are seeing should be fixed with your other patch. So
> I guess it's not really all that critical any more.

If you're okay with it, I'll route these two and the patches to add
warning through a wq branch.  There's already a wq/for-3.9 patch which
am_i_async() can make use of, so it's gonna be easier this way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/2] init, block: try to load default elevator module early during boot
  2013-01-17 18:59                                                             ` Tejun Heo
@ 2013-01-17 19:00                                                               ` Linus Torvalds
  0 siblings, 0 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-17 19:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

On Thu, Jan 17, 2013 at 10:59 AM, Tejun Heo <tj@kernel.org> wrote:
>
> If you're okay with it, I'll route these two and the patches to add
> warning through a wq branch.  There's already a wq/for-3.9 patch which
> am_i_async() can make use of, so it's gonna be easier this way.

Sounds good to me. Thanks,

             Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH 1/3] workqueue: set PF_WQ_WORKER on rescuers
  2013-01-17 18:05                                                       ` Linus Torvalds
  2013-01-17 18:38                                                         ` Tejun Heo
@ 2013-01-18  1:24                                                         ` Tejun Heo
  2013-01-18  1:25                                                         ` [PATCH 2/3] workqueue, async: implement work/async_current_func() Tejun Heo
  2013-01-18  1:27                                                         ` [PATCH 3/3] " Tejun Heo
  3 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-18  1:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

PF_WQ_WORKER is used to tell scheduler that the task is a workqueue
worker and needs wq_worker_sleeping/waking_up() invoked on it for
concurrency management.  As rescuers never participate in concurrency
management, PF_WQ_WORKER wasn't set on them.

There's a need for an interface which can query whether %current is
executing a work item and if so which.  Such interface requires a way
to identify all tasks which may execute work items and PF_WQ_WORKER
will be used for that.  As all normal workers always have PF_WQ_WORKER
set, we only need to add it to rescuers.

As rescuers start with WORKER_PREP but never clear it, it's always
NOT_RUNNING and there's no need to worry about it interfering with
concurrency management even if PF_WQ_WORKER is set; however, unlike
normal workers, rescuers currently don't have its worker struct as
kthread_data().  It uses the associated workqueue_struct instead.
This is problematic as wq_worker_sleeping/waking_up() expect struct
worker at kthread_data().

This patch adds worker->rescue_wq and start rescuer kthreads with
worker struct as kthread_data and sets PF_WQ_WORKER on rescuers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
These three patches implement the warning on synchronous
request_module() from async.  The first two will go through wq/for-3.9
and the last one through wq/for-3.9-async-deadlock-fixes together with
the block layer updates.

Thanks.

 kernel/workqueue.c | 35 ++++++++++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7967f34..6b99ac7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -149,6 +149,9 @@ struct worker {
 
 	/* for rebinding worker to CPU */
 	struct work_struct	rebind_work;	/* L: for busy worker */
+
+	/* used only by rescuers to point to the target workqueue */
+	struct workqueue_struct	*rescue_wq;	/* I: the workqueue to rescue */
 };
 
 struct worker_pool {
@@ -763,12 +766,20 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task,
 				       unsigned int cpu)
 {
 	struct worker *worker = kthread_data(task), *to_wakeup = NULL;
-	struct worker_pool *pool = worker->pool;
-	atomic_t *nr_running = get_pool_nr_running(pool);
+	struct worker_pool *pool;
+	atomic_t *nr_running;
 
+	/*
+	 * Rescuers, which may not have all the fields set up like normal
+	 * workers, also reach here, let's not access anything before
+	 * checking NOT_RUNNING.
+	 */
 	if (worker->flags & WORKER_NOT_RUNNING)
 		return NULL;
 
+	pool = worker->pool;
+	nr_running = get_pool_nr_running(pool);
+
 	/* this can only happen on the local cpu */
 	BUG_ON(cpu != raw_smp_processor_id());
 
@@ -2357,7 +2368,7 @@ sleep:
 
 /**
  * rescuer_thread - the rescuer thread function
- * @__wq: the associated workqueue
+ * @__rescuer: self
  *
  * Workqueue rescuer thread function.  There's one rescuer for each
  * workqueue which has WQ_RESCUER set.
@@ -2374,20 +2385,27 @@ sleep:
  *
  * This should happen rarely.
  */
-static int rescuer_thread(void *__wq)
+static int rescuer_thread(void *__rescuer)
 {
-	struct workqueue_struct *wq = __wq;
-	struct worker *rescuer = wq->rescuer;
+	struct worker *rescuer = __rescuer;
+	struct workqueue_struct *wq = rescuer->rescue_wq;
 	struct list_head *scheduled = &rescuer->scheduled;
 	bool is_unbound = wq->flags & WQ_UNBOUND;
 	unsigned int cpu;
 
 	set_user_nice(current, RESCUER_NICE_LEVEL);
+
+	/*
+	 * Mark rescuer as worker too.  As WORKER_PREP is never cleared, it
+	 * doesn't participate in concurrency management.
+	 */
+	rescuer->task->flags |= PF_WQ_WORKER;
 repeat:
 	set_current_state(TASK_INTERRUPTIBLE);
 
 	if (kthread_should_stop()) {
 		__set_current_state(TASK_RUNNING);
+		rescuer->task->flags &= ~PF_WQ_WORKER;
 		return 0;
 	}
 
@@ -2431,6 +2449,8 @@ repeat:
 		spin_unlock_irq(&gcwq->lock);
 	}
 
+	/* rescuers should never participate in concurrency management */
+	WARN_ON_ONCE(!(rescuer->flags & WORKER_NOT_RUNNING));
 	schedule();
 	goto repeat;
 }
@@ -3266,7 +3286,8 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 		if (!rescuer)
 			goto err;
 
-		rescuer->task = kthread_create(rescuer_thread, wq, "%s",
+		rescuer->rescue_wq = wq;
+		rescuer->task = kthread_create(rescuer_thread, rescuer, "%s",
 					       wq->name);
 		if (IS_ERR(rescuer->task))
 			goto err;
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 2/3] workqueue, async: implement work/async_current_func()
  2013-01-17 18:05                                                       ` Linus Torvalds
  2013-01-17 18:38                                                         ` Tejun Heo
  2013-01-18  1:24                                                         ` [PATCH 1/3] workqueue: set PF_WQ_WORKER on rescuers Tejun Heo
@ 2013-01-18  1:25                                                         ` Tejun Heo
  2013-01-18  2:47                                                           ` Linus Torvalds
  2013-01-18  1:27                                                         ` [PATCH 3/3] " Tejun Heo
  3 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-18  1:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

Implement work/async_current_func() which query whether the current
task is a workqueue or async worker respectively and, if so, return
the current function being executed along with work / async item
related information.

This will be used to implement warning on synchronous request_module()
from async workers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/async.h     |  2 ++
 include/linux/workqueue.h |  1 +
 kernel/async.c            | 25 +++++++++++++++++++++++++
 kernel/workqueue.c        | 22 ++++++++++++++++++++++
 4 files changed, 50 insertions(+)

diff --git a/include/linux/async.h b/include/linux/async.h
index 7a24fe9..6c49157 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -52,4 +52,6 @@ extern void async_synchronize_full_domain(struct async_domain *domain);
 extern void async_synchronize_cookie(async_cookie_t cookie);
 extern void async_synchronize_cookie_domain(async_cookie_t cookie,
 					    struct async_domain *domain);
+extern async_func_ptr *async_current_func(void **datap,
+					  async_cookie_t *cookiep);
 #endif
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 2b58905..984fbef 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -428,6 +428,7 @@ extern void workqueue_set_max_active(struct workqueue_struct *wq,
 extern bool workqueue_congested(unsigned int cpu, struct workqueue_struct *wq);
 extern unsigned int work_cpu(struct work_struct *work);
 extern unsigned int work_busy(struct work_struct *work);
+extern work_func_t work_current_func(struct work_struct **workp);
 
 /*
  * Like above, but uses del_timer() instead of del_timer_sync(). This means,
diff --git a/kernel/async.c b/kernel/async.c
index 9d31183..ed1eda0 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -337,3 +337,28 @@ void async_synchronize_cookie(async_cookie_t cookie)
 	async_synchronize_cookie_domain(cookie, &async_running);
 }
 EXPORT_SYMBOL_GPL(async_synchronize_cookie);
+
+/**
+ * async_current_func - determine the async entry %current is executing
+ * @datap: optional out param for the data
+ * @cookiep: optional out param for the cookie
+ *
+ * Determine whether %current is an async worker executing an async_entry
+ * and if so return the async function and, if @cookiep is not %NULL, the
+ * cookie.  If %current isn't executing an async_entry, %NULL is returned.
+ */
+async_func_ptr *async_current_func(void **datap, async_cookie_t *cookiep)
+{
+	struct work_struct *work;
+	struct async_entry *entry;
+
+	if (work_current_func(&work) != async_run_entry_fn)
+		return NULL;
+
+	entry = container_of(work, struct async_entry, work);
+	if (datap)
+		*datap = entry->data;
+	if (cookiep)
+		*cookiep = entry->cookie;
+	return entry->func;
+}
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6b99ac7..9fc1549 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3492,6 +3492,28 @@ unsigned int work_busy(struct work_struct *work)
 }
 EXPORT_SYMBOL_GPL(work_busy);
 
+/**
+ * work_current_func - determine the work fn and item %current is executing
+ * @workp: optional out param for the current work item
+ *
+ * Determine whether %current is a kworker executing a work item and if so
+ * return the work function and, if @workp is not %NULL, the work item.  If
+ * %current isn't executing a work item, %NULL is returned.
+ */
+work_func_t work_current_func(struct work_struct **workp)
+{
+	struct worker *worker;
+
+	/* am I a kworker? */
+	if (!(current->flags & PF_WQ_WORKER))
+		return NULL;
+
+	worker = kthread_data(current);
+	if (workp)
+		*workp = worker->current_work;
+	return worker->current_func;
+}
+
 /*
  * CPU hotplug.
  *
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 3/3] async, kmod: warn on synchronous request_module() from async workers
  2013-01-17 18:05                                                       ` Linus Torvalds
                                                                           ` (2 preceding siblings ...)
  2013-01-18  1:25                                                         ` [PATCH 2/3] workqueue, async: implement work/async_current_func() Tejun Heo
@ 2013-01-18  1:27                                                         ` Tejun Heo
  3 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-18  1:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

Synchronous requet_module() from an async worker can lead to deadlock
because module init path may invoke async_synchronize_full().  The
async worker waits for request_module() to complete and the module
loading waits for the async task to finish.  This bug happened in the
block layer because of default elevator auto-loading.

Block layer has been updated not to do default elevator auto-loading
and it has been decided to disallow synchronous request_module() from
async workers.

Trigger WARN_ON_ONCE() on synchronous request_module() from async
workers.

For more details, please refer to the following thread.

  http://thread.gmane.org/gmane.linux.kernel/1420814

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alex Riesen <raa.lkml@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
Linus, please note that I dropped system_state == SYSTEM_RUNNING
condition from WARN_ON_ONCE() as the deadlock can happen during system
init too - e.g. libata probing block device using async making block
layer try to load default elevator from initramfs.

Thanks.

 kernel/kmod.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/kmod.c b/kernel/kmod.c
index 1c317e3..028287e 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -38,6 +38,7 @@
 #include <linux/suspend.h>
 #include <linux/rwsem.h>
 #include <linux/ptrace.h>
+#include <linux/async.h>
 #include <asm/uaccess.h>
 
 #include <trace/events/module.h>
@@ -130,6 +131,14 @@ int __request_module(bool wait, const char *fmt, ...)
 #define MAX_KMOD_CONCURRENT 50	/* Completely arbitrary value - KAO */
 	static int kmod_loop_msg;
 
+	/*
+	 * We don't allow synchronous module loading from async.  Module
+	 * init may invoke async_synchronize_full() which will end up
+	 * waiting for this task which already is waiting for the module
+	 * loading to complete, leading to a deadlock.
+	 */
+	WARN_ON_ONCE(wait && async_current_func(NULL, NULL));
+
 	va_start(args, fmt);
 	ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args);
 	va_end(args);
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 2/3] workqueue, async: implement work/async_current_func()
  2013-01-18  1:25                                                         ` [PATCH 2/3] workqueue, async: implement work/async_current_func() Tejun Heo
@ 2013-01-18  2:47                                                           ` Linus Torvalds
  2013-01-18  2:59                                                             ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2013-01-18  2:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

On Thu, Jan 17, 2013 at 5:25 PM, Tejun Heo <tj@kernel.org> wrote:
> Implement work/async_current_func() which query whether the current
> task is a workqueue or async worker respectively and, if so, return
> the current function being executed along with work / async item
> related information.

So why the odd interface? The only user of it calls it with a
NULL/NULL pair of arguments, and in general it's just way too complex
to be an exported function at all. I *suspect* you chose that complex
interface because you feel you may have some use for it inside of the
async code itself, but why isn't that then not totally private to
there?

IOW, why isn't the interface just

   static struct worker *current_worker(void)
   {
      if (current->flags & PF_WQ_WORKER)
         return kthread_data(current);
      return NULL;
   }

   int current_is_async(void)
   {
      struct worker *worker = current_worker(void);
      return worker && worker->current_func == async_run_entry_fn;
   }

and that current_is_async() is enough for the exported interface.

Then, if you actually want to care about the work itself, you can use
that same "current_worker()" helper function, and look at the
different worker fields more. But why export that kind of logic?

Am I missing something?

                   Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 2/3] workqueue, async: implement work/async_current_func()
  2013-01-18  2:47                                                           ` Linus Torvalds
@ 2013-01-18  2:59                                                             ` Tejun Heo
  2013-01-18  3:04                                                               ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-18  2:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

Hello, Linus.

On Thu, Jan 17, 2013 at 06:47:48PM -0800, Linus Torvalds wrote:
> On Thu, Jan 17, 2013 at 5:25 PM, Tejun Heo <tj@kernel.org> wrote:
> > Implement work/async_current_func() which query whether the current
> > task is a workqueue or async worker respectively and, if so, return
> > the current function being executed along with work / async item
> > related information.
> 
> So why the odd interface? The only user of it calls it with a

Yeah, I was doing something else in async and arguing between that and
current_is_async() and ended up keeping it as it was consistent with
the workqueue counterpart.

> NULL/NULL pair of arguments, and in general it's just way too complex
> to be an exported function at all. I *suspect* you chose that complex
> interface because you feel you may have some use for it inside of the
> async code itself, but why isn't that then not totally private to
> there?
> 
> IOW, why isn't the interface just
> 
>    static struct worker *current_worker(void)
>    {
>       if (current->flags & PF_WQ_WORKER)
>          return kthread_data(current);
>       return NULL;
>    }

I'd prefer to keep struct worker inside workqueue.c, so how about
keeping the workqueue part and make async part current_is_async()?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 2/3] workqueue, async: implement work/async_current_func()
  2013-01-18  2:59                                                             ` Tejun Heo
@ 2013-01-18  3:04                                                               ` Tejun Heo
  2013-01-18  3:18                                                                 ` Linus Torvalds
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-18  3:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

On Thu, Jan 17, 2013 at 06:59:36PM -0800, Tejun Heo wrote:
> I'd prefer to keep struct worker inside workqueue.c, so how about
> keeping the workqueue part and make async part current_is_async()?

Another thing is that it seems like having introspection type
interface often lead to abuses - work_pending(), work_busy() both
ended up bringing more unnecessary dependencies and subtle bugginess
on internal details than actual benefits.  Querying %current is much
less likely to be harmful in itself but I'm afraid it might encourage
its users to develop something crazy on top.  It might be a good idea
to make it only available to async.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 2/3] workqueue, async: implement work/async_current_func()
  2013-01-18  3:04                                                               ` Tejun Heo
@ 2013-01-18  3:18                                                                 ` Linus Torvalds
  2013-01-18  3:47                                                                   ` Tejun Heo
                                                                                     ` (5 more replies)
  0 siblings, 6 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-18  3:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

On Thu, Jan 17, 2013 at 7:04 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Another thing is that it seems like having introspection type
> interface often lead to abuses - work_pending(), work_busy() both
> ended up bringing more unnecessary dependencies and subtle bugginess
> on internal details than actual benefits.  Querying %current is much
> less likely to be harmful in itself but I'm afraid it might encourage
> its users to develop something crazy on top.  It might be a good idea
> to make it only available to async.

I'm not sure I understand what you mean? Do you mean trying to limit
work_current_func() to only be accessible to the async code? You'd
have to make some kind of private header file under kernel/ for that,
but I guess that would work fine. We already do something similar
inside filesystems etc, where they have their own local headers.

I still don't really see what other data the async code could possibly
ever want than the "is this an async thread" I guess if you want to
keep all these things private to their own C files with no leaving of
the structure definitions (even within just a private kernel/worker.h
file or something) you could make the interface be one like

 - kernel/workqueue.c:

      int current_uses_workfn(work_func_t match)
      {
         if (current->flags & PF_WQ_WORKER) {
            struct worker *worker = kthread_data(current);
            return worker && match == worker->current_func;
         }
         return 0;
      }

 - kernel/async.c:

      int current_is_async(void)
      {
         return current_uses_workfn(async_run_entry_fn);
      }

but quite frankly, we've generally tried to avoid those kinds of silly
wrappers just because it's very wasteful to do two function calls just
to hide some detail like this. The code generation is atrocious,
jumping around for no good reason just increases cache pressure etc.

Yes, yes, some globally optimizing compiler could sort it all out, but
I'd personally be inclined to just move all the structure definitions
into kernel/worker.h, and make the code be inline functions. The only
actual current *user* would also be in the kernel/ subdirectory, and
we don't know if we'd ever want to really expand it past there.

Hmm?

             Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 2/3] workqueue, async: implement work/async_current_func()
  2013-01-18  3:18                                                                 ` Linus Torvalds
@ 2013-01-18  3:47                                                                   ` Tejun Heo
  2013-01-18 22:08                                                                   ` [PATCH 1/5] workqueue: set PF_WQ_WORKER on rescuers Tejun Heo
                                                                                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-18  3:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

On Thu, Jan 17, 2013 at 07:18:26PM -0800, Linus Torvalds wrote:
> I'm not sure I understand what you mean? Do you mean trying to limit
> work_current_func() to only be accessible to the async code? You'd
> have to make some kind of private header file under kernel/ for that,
> but I guess that would work fine. We already do something similar
> inside filesystems etc, where they have their own local headers.

Yeap, and I'm unsure whether it's worth introducing a new internal
header file.

> Yes, yes, some globally optimizing compiler could sort it all out, but
> I'd personally be inclined to just move all the structure definitions
> into kernel/worker.h, and make the code be inline functions. The only
> actual current *user* would also be in the kernel/ subdirectory, and
> we don't know if we'd ever want to really expand it past there.
> 
> Hmm?

If we're gonna make it kernel/ internal thing with internal header, we
definitely can go all the way.  It's a bit meh because the code path
involved is very cold.  Hmm... I'll make it that way.  I like how it
keeps the thing apparently internal.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH 1/5] workqueue: set PF_WQ_WORKER on rescuers
  2013-01-18  3:18                                                                 ` Linus Torvalds
  2013-01-18  3:47                                                                   ` Tejun Heo
@ 2013-01-18 22:08                                                                   ` Tejun Heo
  2013-01-18 22:10                                                                   ` [PATCH 2/5] workqueue: rename kernel/workqueue_sched.h to kernel/workqueue_internal.h Tejun Heo
                                                                                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-18 22:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

>From 111c225a5f8d872bc9327ada18d13b75edaa34be Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 17 Jan 2013 17:16:24 -0800

PF_WQ_WORKER is used to tell scheduler that the task is a workqueue
worker and needs wq_worker_sleeping/waking_up() invoked on it for
concurrency management.  As rescuers never participate in concurrency
management, PF_WQ_WORKER wasn't set on them.

There's a need for an interface which can query whether %current is
executing a work item and if so which.  Such interface requires a way
to identify all tasks which may execute work items and PF_WQ_WORKER
will be used for that.  As all normal workers always have PF_WQ_WORKER
set, we only need to add it to rescuers.

As rescuers start with WORKER_PREP but never clear it, it's always
NOT_RUNNING and there's no need to worry about it interfering with
concurrency management even if PF_WQ_WORKER is set; however, unlike
normal workers, rescuers currently don't have its worker struct as
kthread_data().  It uses the associated workqueue_struct instead.
This is problematic as wq_worker_sleeping/waking_up() expect struct
worker at kthread_data().

This patch adds worker->rescue_wq and start rescuer kthreads with
worker struct as kthread_data and sets PF_WQ_WORKER on rescuers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
This one is the same as last time.

 kernel/workqueue.c | 35 ++++++++++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7967f34..6b99ac7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -149,6 +149,9 @@ struct worker {
 
 	/* for rebinding worker to CPU */
 	struct work_struct	rebind_work;	/* L: for busy worker */
+
+	/* used only by rescuers to point to the target workqueue */
+	struct workqueue_struct	*rescue_wq;	/* I: the workqueue to rescue */
 };
 
 struct worker_pool {
@@ -763,12 +766,20 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task,
 				       unsigned int cpu)
 {
 	struct worker *worker = kthread_data(task), *to_wakeup = NULL;
-	struct worker_pool *pool = worker->pool;
-	atomic_t *nr_running = get_pool_nr_running(pool);
+	struct worker_pool *pool;
+	atomic_t *nr_running;
 
+	/*
+	 * Rescuers, which may not have all the fields set up like normal
+	 * workers, also reach here, let's not access anything before
+	 * checking NOT_RUNNING.
+	 */
 	if (worker->flags & WORKER_NOT_RUNNING)
 		return NULL;
 
+	pool = worker->pool;
+	nr_running = get_pool_nr_running(pool);
+
 	/* this can only happen on the local cpu */
 	BUG_ON(cpu != raw_smp_processor_id());
 
@@ -2357,7 +2368,7 @@ sleep:
 
 /**
  * rescuer_thread - the rescuer thread function
- * @__wq: the associated workqueue
+ * @__rescuer: self
  *
  * Workqueue rescuer thread function.  There's one rescuer for each
  * workqueue which has WQ_RESCUER set.
@@ -2374,20 +2385,27 @@ sleep:
  *
  * This should happen rarely.
  */
-static int rescuer_thread(void *__wq)
+static int rescuer_thread(void *__rescuer)
 {
-	struct workqueue_struct *wq = __wq;
-	struct worker *rescuer = wq->rescuer;
+	struct worker *rescuer = __rescuer;
+	struct workqueue_struct *wq = rescuer->rescue_wq;
 	struct list_head *scheduled = &rescuer->scheduled;
 	bool is_unbound = wq->flags & WQ_UNBOUND;
 	unsigned int cpu;
 
 	set_user_nice(current, RESCUER_NICE_LEVEL);
+
+	/*
+	 * Mark rescuer as worker too.  As WORKER_PREP is never cleared, it
+	 * doesn't participate in concurrency management.
+	 */
+	rescuer->task->flags |= PF_WQ_WORKER;
 repeat:
 	set_current_state(TASK_INTERRUPTIBLE);
 
 	if (kthread_should_stop()) {
 		__set_current_state(TASK_RUNNING);
+		rescuer->task->flags &= ~PF_WQ_WORKER;
 		return 0;
 	}
 
@@ -2431,6 +2449,8 @@ repeat:
 		spin_unlock_irq(&gcwq->lock);
 	}
 
+	/* rescuers should never participate in concurrency management */
+	WARN_ON_ONCE(!(rescuer->flags & WORKER_NOT_RUNNING));
 	schedule();
 	goto repeat;
 }
@@ -3266,7 +3286,8 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 		if (!rescuer)
 			goto err;
 
-		rescuer->task = kthread_create(rescuer_thread, wq, "%s",
+		rescuer->rescue_wq = wq;
+		rescuer->task = kthread_create(rescuer_thread, rescuer, "%s",
 					       wq->name);
 		if (IS_ERR(rescuer->task))
 			goto err;
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 2/5] workqueue: rename kernel/workqueue_sched.h to kernel/workqueue_internal.h
  2013-01-18  3:18                                                                 ` Linus Torvalds
  2013-01-18  3:47                                                                   ` Tejun Heo
  2013-01-18 22:08                                                                   ` [PATCH 1/5] workqueue: set PF_WQ_WORKER on rescuers Tejun Heo
@ 2013-01-18 22:10                                                                   ` Tejun Heo
  2013-01-18 22:11                                                                   ` [PATCH 3/5] workqueue: move struct worker definition to workqueue_internal.h Tejun Heo
                                                                                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-18 22:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell, Ingo Molnar,
	Peter Zijlstra

>From ea138446e51f7bfe55cdeffa3f1dd9cafc786bd8 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 18 Jan 2013 14:05:55 -0800

Workqueue wants to expose more interface internal to kernel/.  Instead
of adding a new header file, repurpose kernel/workqueue_sched.h.
Rename it to workqueue_internal.h and add include protector.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
Ingo, Peter, changes to scheduler is trivial.  If there's no
objection, I'd like to route this with the rest of workqueue and async
changes.

Thanks.

 kernel/sched/core.c         |  2 +-
 kernel/workqueue.c          |  2 +-
 kernel/workqueue_internal.h | 18 ++++++++++++++++++
 kernel/workqueue_sched.h    |  9 ---------
 4 files changed, 20 insertions(+), 11 deletions(-)
 create mode 100644 kernel/workqueue_internal.h
 delete mode 100644 kernel/workqueue_sched.h

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..c6737f4f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -83,7 +83,7 @@
 #endif
 
 #include "sched.h"
-#include "../workqueue_sched.h"
+#include "../workqueue_internal.h"
 #include "../smpboot.h"
 
 #define CREATE_TRACE_POINTS
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6b99ac7..b4e9206 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -43,7 +43,7 @@
 #include <linux/idr.h>
 #include <linux/hashtable.h>
 
-#include "workqueue_sched.h"
+#include "workqueue_internal.h"
 
 enum {
 	/*
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
new file mode 100644
index 0000000..b3ea6ad
--- /dev/null
+++ b/kernel/workqueue_internal.h
@@ -0,0 +1,18 @@
+/*
+ * kernel/workqueue_internal.h
+ *
+ * Workqueue internal header file.  Only to be included by workqueue and
+ * core kernel subsystems.
+ */
+#ifndef _KERNEL_WORKQUEUE_INTERNAL_H
+#define _KERNEL_WORKQUEUE_INTERNAL_H
+
+/*
+ * Scheduler hooks for concurrency managed workqueue.  Only to be used from
+ * sched.c and workqueue.c.
+ */
+void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
+struct task_struct *wq_worker_sleeping(struct task_struct *task,
+				       unsigned int cpu);
+
+#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
diff --git a/kernel/workqueue_sched.h b/kernel/workqueue_sched.h
deleted file mode 100644
index 2d10fc9..0000000
--- a/kernel/workqueue_sched.h
+++ /dev/null
@@ -1,9 +0,0 @@
-/*
- * kernel/workqueue_sched.h
- *
- * Scheduler hooks for concurrency managed workqueue.  Only to be
- * included from sched.c and workqueue.c.
- */
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
-struct task_struct *wq_worker_sleeping(struct task_struct *task,
-				       unsigned int cpu);
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 3/5] workqueue: move struct worker definition to workqueue_internal.h
  2013-01-18  3:18                                                                 ` Linus Torvalds
                                                                                     ` (2 preceding siblings ...)
  2013-01-18 22:10                                                                   ` [PATCH 2/5] workqueue: rename kernel/workqueue_sched.h to kernel/workqueue_internal.h Tejun Heo
@ 2013-01-18 22:11                                                                   ` Tejun Heo
  2013-01-18 22:11                                                                   ` [PATCH 4/5] workqueue: implement current_is_async() Tejun Heo
  2013-01-18 22:12                                                                   ` [PATCH 5/5] async, kmod: warn on synchronous request_module() from async workers Tejun Heo
  5 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-18 22:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

>From 2eaebdb33e1911c0cf3d44fd3596c42c6f502fab Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 18 Jan 2013 14:05:55 -0800

This will be used to implement an inline function to query whether
%current is a workqueue worker and, if so, allow determining which
work item it's executing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 kernel/workqueue.c          | 32 +-------------------------------
 kernel/workqueue_internal.h | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b4e9206..2ffa240 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -122,37 +122,7 @@ enum {
  * W: workqueue_lock protected.
  */
 
-struct global_cwq;
-struct worker_pool;
-
-/*
- * The poor guys doing the actual heavy lifting.  All on-duty workers
- * are either serving the manager role, on idle list or on busy hash.
- */
-struct worker {
-	/* on idle list while idle, on busy hash table while busy */
-	union {
-		struct list_head	entry;	/* L: while idle */
-		struct hlist_node	hentry;	/* L: while busy */
-	};
-
-	struct work_struct	*current_work;	/* L: work being processed */
-	work_func_t		current_func;	/* L: current_work's fn */
-	struct cpu_workqueue_struct *current_cwq; /* L: current_work's cwq */
-	struct list_head	scheduled;	/* L: scheduled works */
-	struct task_struct	*task;		/* I: worker task */
-	struct worker_pool	*pool;		/* I: the associated pool */
-	/* 64 bytes boundary on 64bit, 32 on 32bit */
-	unsigned long		last_active;	/* L: last active timestamp */
-	unsigned int		flags;		/* X: flags */
-	int			id;		/* I: worker id */
-
-	/* for rebinding worker to CPU */
-	struct work_struct	rebind_work;	/* L: for busy worker */
-
-	/* used only by rescuers to point to the target workqueue */
-	struct workqueue_struct	*rescue_wq;	/* I: the workqueue to rescue */
-};
+/* struct worker is defined in workqueue_internal.h */
 
 struct worker_pool {
 	struct global_cwq	*gcwq;		/* I: the owning gcwq */
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index b3ea6ad..02549fa 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -7,6 +7,43 @@
 #ifndef _KERNEL_WORKQUEUE_INTERNAL_H
 #define _KERNEL_WORKQUEUE_INTERNAL_H
 
+#include <linux/workqueue.h>
+
+struct global_cwq;
+struct worker_pool;
+
+/*
+ * The poor guys doing the actual heavy lifting.  All on-duty workers are
+ * either serving the manager role, on idle list or on busy hash.  For
+ * details on the locking annotation (L, I, X...), refer to workqueue.c.
+ *
+ * Only to be used in workqueue and async.
+ */
+struct worker {
+	/* on idle list while idle, on busy hash table while busy */
+	union {
+		struct list_head	entry;	/* L: while idle */
+		struct hlist_node	hentry;	/* L: while busy */
+	};
+
+	struct work_struct	*current_work;	/* L: work being processed */
+	work_func_t		current_func;	/* L: current_work's fn */
+	struct cpu_workqueue_struct *current_cwq; /* L: current_work's cwq */
+	struct list_head	scheduled;	/* L: scheduled works */
+	struct task_struct	*task;		/* I: worker task */
+	struct worker_pool	*pool;		/* I: the associated pool */
+	/* 64 bytes boundary on 64bit, 32 on 32bit */
+	unsigned long		last_active;	/* L: last active timestamp */
+	unsigned int		flags;		/* X: flags */
+	int			id;		/* I: worker id */
+
+	/* for rebinding worker to CPU */
+	struct work_struct	rebind_work;	/* L: for busy worker */
+
+	/* used only by rescuers to point to the target workqueue */
+	struct workqueue_struct	*rescue_wq;	/* I: the workqueue to rescue */
+};
+
 /*
  * Scheduler hooks for concurrency managed workqueue.  Only to be used from
  * sched.c and workqueue.c.
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 4/5] workqueue: implement current_is_async()
  2013-01-18  3:18                                                                 ` Linus Torvalds
                                                                                     ` (3 preceding siblings ...)
  2013-01-18 22:11                                                                   ` [PATCH 3/5] workqueue: move struct worker definition to workqueue_internal.h Tejun Heo
@ 2013-01-18 22:11                                                                   ` Tejun Heo
  2013-01-18 22:12                                                                   ` [PATCH 5/5] async, kmod: warn on synchronous request_module() from async workers Tejun Heo
  5 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-18 22:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

>From 84b233adcca3cacd5cfa8013a5feda7a3db4a9af Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 18 Jan 2013 14:05:56 -0800

This function queries whether %current is an async worker executing an
async item.  This will be used to implement warning on synchronous
request_module() from async workers.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/async.h       |  1 +
 kernel/async.c              | 14 ++++++++++++++
 kernel/workqueue_internal.h | 11 +++++++++++
 3 files changed, 26 insertions(+)

diff --git a/include/linux/async.h b/include/linux/async.h
index 7a24fe9..345169c 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -52,4 +52,5 @@ extern void async_synchronize_full_domain(struct async_domain *domain);
 extern void async_synchronize_cookie(async_cookie_t cookie);
 extern void async_synchronize_cookie_domain(async_cookie_t cookie,
 					    struct async_domain *domain);
+extern bool current_is_async(void);
 #endif
diff --git a/kernel/async.c b/kernel/async.c
index 9d31183..d9bf2a9 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -57,6 +57,8 @@ asynchronous and synchronous parts of the kernel.
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 
+#include "workqueue_internal.h"
+
 static async_cookie_t next_cookie = 1;
 
 #define MAX_WORK	32768
@@ -337,3 +339,15 @@ void async_synchronize_cookie(async_cookie_t cookie)
 	async_synchronize_cookie_domain(cookie, &async_running);
 }
 EXPORT_SYMBOL_GPL(async_synchronize_cookie);
+
+/**
+ * current_is_async - is %current an async worker task?
+ *
+ * Returns %true if %current is an async worker task.
+ */
+bool current_is_async(void)
+{
+	struct worker *worker = current_wq_worker();
+
+	return worker && worker->current_func == async_run_entry_fn;
+}
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index 02549fa..cc35e7e 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -8,6 +8,7 @@
 #define _KERNEL_WORKQUEUE_INTERNAL_H
 
 #include <linux/workqueue.h>
+#include <linux/kthread.h>
 
 struct global_cwq;
 struct worker_pool;
@@ -44,6 +45,16 @@ struct worker {
 	struct workqueue_struct	*rescue_wq;	/* I: the workqueue to rescue */
 };
 
+/**
+ * current_wq_worker - return struct worker if %current is a workqueue worker
+ */
+static inline struct worker *current_wq_worker(void)
+{
+	if (current->flags & PF_WQ_WORKER)
+		return kthread_data(current);
+	return NULL;
+}
+
 /*
  * Scheduler hooks for concurrency managed workqueue.  Only to be used from
  * sched.c and workqueue.c.
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 5/5] async, kmod: warn on synchronous request_module() from async workers
  2013-01-18  3:18                                                                 ` Linus Torvalds
                                                                                     ` (4 preceding siblings ...)
  2013-01-18 22:11                                                                   ` [PATCH 4/5] workqueue: implement current_is_async() Tejun Heo
@ 2013-01-18 22:12                                                                   ` Tejun Heo
  2022-06-23  5:25                                                                     ` Saravana Kannan
  5 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-18 22:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

>From 4983f3b51e18d008956dd113e0ea2f252774cefc Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 18 Jan 2013 14:05:57 -0800

Synchronous requet_module() from an async worker can lead to deadlock
because module init path may invoke async_synchronize_full().  The
async worker waits for request_module() to complete and the module
loading waits for the async task to finish.  This bug happened in the
block layer because of default elevator auto-loading.

Block layer has been updated not to do default elevator auto-loading
and it has been decided to disallow synchronous request_module() from
async workers.

Trigger WARN_ON_ONCE() on synchronous request_module() from async
workers.

For more details, please refer to the following thread.

  http://thread.gmane.org/gmane.linux.kernel/1420814

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alex Riesen <raa.lkml@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/kmod.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/kmod.c b/kernel/kmod.c
index 1c317e3..ecd42b4 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -38,6 +38,7 @@
 #include <linux/suspend.h>
 #include <linux/rwsem.h>
 #include <linux/ptrace.h>
+#include <linux/async.h>
 #include <asm/uaccess.h>
 
 #include <trace/events/module.h>
@@ -130,6 +131,14 @@ int __request_module(bool wait, const char *fmt, ...)
 #define MAX_KMOD_CONCURRENT 50	/* Completely arbitrary value - KAO */
 	static int kmod_loop_msg;
 
+	/*
+	 * We don't allow synchronous module loading from async.  Module
+	 * init may invoke async_synchronize_full() which will end up
+	 * waiting for this task which already is waiting for the module
+	 * loading to complete, leading to a deadlock.
+	 */
+	WARN_ON_ONCE(wait && current_is_async());
+
 	va_start(args, fmt);
 	ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args);
 	va_end(args);
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2] async: fix __lowest_in_progress()
  2013-01-16 17:19                               ` [PATCH] async: fix __lowest_in_progress() Tejun Heo
  2013-01-17 18:16                                 ` Linus Torvalds
@ 2013-01-23  0:15                                 ` Tejun Heo
  2013-01-23  0:22                                   ` Linus Torvalds
  1 sibling, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-01-23  0:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven

083b804c4d3e1e3d0eace56bdbc0f674946d2847 ("async: use workqueue for
worker pool") made it possible that async jobs are moved from pending
to running out-of-order.  While pending async jobs will be queued and
dispatched for execution in the same order, nothing guarantees they'll
enter "1) move self to the running queue" of async_run_entry_fn() in
the same order.

Before the conversion, async implemented its own worker pool.  An
async worker, upon being woken up, fetches the first item from the
pending list, which kept the executing lists sorted.  The conversion
to workqueue was done by adding work_struct to each async_entry and
async just schedules the work item.  The queueing and dispatching of
such work items are still in order but now each worker thread is
associated with a specific async_entry and moves that specific
async_entry to the executing list.  So, depending on which worker
reaches that point earlier, which is non-deterministic, we may end up
moving an async_entry with larger cookie before one with smaller one.

This broke __lowest_in_progress().  running->domain may not be
properly sorted and is not guaranteed to contain lower cookies than
pending list when not empty.  Fix it by ensuring sort-inserting to the
running list and always looking at both pending and running when
trying to determine the lowest cookie.

Over time, the async synchronization implementation became quite
messy.  We better restructure it such that each async_entry is linked
to two lists - one global and one per domain - and not move it when
execution starts.  There's no reason to distinguish pending and
running.  They behave the same for synchronization purposes.

v2: Description updated to better explain why it's broken.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: stable@vger.kernel.org
---
Linus, I've updated the description to better explain why it's broken.
The code is ugly but cleanup patches are already ready, so it will be
cleaned up during 3.9-rc1.  How should this be routed?

Thanks.

 kernel/async.c |   27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

--- a/kernel/async.c
+++ b/kernel/async.c
@@ -86,18 +86,27 @@ static atomic_t entry_count;
  */
 static async_cookie_t  __lowest_in_progress(struct async_domain *running)
 {
+	async_cookie_t first_running = next_cookie;	/* infinity value */
+	async_cookie_t first_pending = next_cookie;	/* ditto */
 	struct async_entry *entry;
 
+	/*
+	 * Both running and pending lists are sorted but not disjoint.
+	 * Take the first cookies from both and return the min.
+	 */
 	if (!list_empty(&running->domain)) {
 		entry = list_first_entry(&running->domain, typeof(*entry), list);
-		return entry->cookie;
+		first_running = entry->cookie;
 	}
 
-	list_for_each_entry(entry, &async_pending, list)
-		if (entry->running == running)
-			return entry->cookie;
+	list_for_each_entry(entry, &async_pending, list) {
+		if (entry->running == running) {
+			first_pending = entry->cookie;
+			break;
+		}
+	}
 
-	return next_cookie;	/* "infinity" value */
+	return min(first_running, first_pending);
 }
 
 static async_cookie_t  lowest_in_progress(struct async_domain *running)
@@ -118,13 +127,17 @@ static void async_run_entry_fn(struct wo
 {
 	struct async_entry *entry =
 		container_of(work, struct async_entry, work);
+	struct async_entry *pos;
 	unsigned long flags;
 	ktime_t uninitialized_var(calltime), delta, rettime;
 	struct async_domain *running = entry->running;
 
-	/* 1) move self to the running queue */
+	/* 1) move self to the running queue, make sure it stays sorted */
 	spin_lock_irqsave(&async_lock, flags);
-	list_move_tail(&entry->list, &running->domain);
+	list_for_each_entry_reverse(pos, &running->domain, list)
+		if (entry->cookie < pos->cookie)
+			break;
+	list_move_tail(&entry->list, &pos->list);
 	spin_unlock_irqrestore(&async_lock, flags);
 
 	/* 2) run (and print duration) */

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2] async: fix __lowest_in_progress()
  2013-01-23  0:15                                 ` [PATCH v2] " Tejun Heo
@ 2013-01-23  0:22                                   ` Linus Torvalds
  0 siblings, 0 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-01-23  0:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ming Lei, Alex Riesen, Alan Stern, Jens Axboe, USB list,
	Linux Kernel Mailing List, Arjan van de Ven

On Tue, Jan 22, 2013 at 4:15 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Linus, I've updated the description to better explain why it's broken.
> The code is ugly but cleanup patches are already ready, so it will be
> cleaned up during 3.9-rc1.  How should this be routed?

I'm not a huge fan of the patch and I might even have preferred the
whole cleanup, but I took it as-is.

            Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH v2 2/2] block: don't request module during elevator init
  2013-01-16 21:31                                                     ` [PATCH 2/2] block: don't request module during elevator init Tejun Heo
@ 2013-01-23  0:51                                                       ` Tejun Heo
  0 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-23  0:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell

>From 21c3c5d2800733b7a276725b8e1ae49a694adc1a Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Tue, 22 Jan 2013 16:48:03 -0800

Block layer allows selecting an elevator which is built as a module to
be selected as system default via kernel param "elevator=".  This is
achieved by automatically invoking request_module() whenever a new
block device is initialized and the elevator is not available.

This led to an interesting deadlock problem involving async and module
init.  Block device probing running off an async job invokes
request_module().  While the module is being loaded, it performs
async_synchronize_full() which ends up waiting for the async job which
is already waiting for request_module() to finish, leading to
deadlock.

Invoking request_module() from deep in block device init path is
already nasty in itself.  It seems best to avoid these situations from
the beginning by moving on-demand module loading out of block init
path.

The previous patch made sure that the default elevator module is
loaded early during boot if available.  This patch removes on-demand
loading of the default elevator from elevator init path.  As the
module would have been loaded during boot, userland-visible behavior
difference should be minimal.

For more details, please refer to the following thread.

  http://thread.gmane.org/gmane.linux.kernel/1420814

v2: The bool parameter was named @request_module which conflicted with
    request_module().  This built okay w/ CONFIG_MODULES because
    request_module() was defined as a macro.  W/o CONFIG_MODULES, it
    causes build breakage.  Rename the parameter to @try_loading.
    Reported by Fengguang.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
---
Minor revision to fix build breakage on !CONFIG_MODULES reported by
Fengguang.  The patch has been queued to
wq/for-3.9-async-deadlock-fixes.

Thanks.

 block/elevator.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index c2d61d5..603b2c1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -100,14 +100,14 @@ static void elevator_put(struct elevator_type *e)
 	module_put(e->elevator_owner);
 }
 
-static struct elevator_type *elevator_get(const char *name)
+static struct elevator_type *elevator_get(const char *name, bool try_loading)
 {
 	struct elevator_type *e;
 
 	spin_lock(&elv_list_lock);
 
 	e = elevator_find(name);
-	if (!e) {
+	if (!e && try_loading) {
 		spin_unlock(&elv_list_lock);
 		request_module("%s-iosched", name);
 		spin_lock(&elv_list_lock);
@@ -207,25 +207,30 @@ int elevator_init(struct request_queue *q, char *name)
 	q->boundary_rq = NULL;
 
 	if (name) {
-		e = elevator_get(name);
+		e = elevator_get(name, true);
 		if (!e)
 			return -EINVAL;
 	}
 
+	/*
+	 * Use the default elevator specified by config boot param or
+	 * config option.  Don't try to load modules as we could be running
+	 * off async and request_module() isn't allowed from async.
+	 */
 	if (!e && *chosen_elevator) {
-		e = elevator_get(chosen_elevator);
+		e = elevator_get(chosen_elevator, false);
 		if (!e)
 			printk(KERN_ERR "I/O scheduler %s not found\n",
 							chosen_elevator);
 	}
 
 	if (!e) {
-		e = elevator_get(CONFIG_DEFAULT_IOSCHED);
+		e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
 		if (!e) {
 			printk(KERN_ERR
 				"Default I/O scheduler not found. " \
 				"Using noop.\n");
-			e = elevator_get("noop");
+			e = elevator_get("noop", false);
 		}
 	}
 
@@ -967,7 +972,7 @@ int elevator_change(struct request_queue *q, const char *name)
 		return -ENXIO;
 
 	strlcpy(elevator_name, name, sizeof(elevator_name));
-	e = elevator_get(strstrip(elevator_name));
+	e = elevator_get(strstrip(elevator_name), true);
 	if (!e) {
 		printk(KERN_ERR "elevator: type %s not found\n", elevator_name);
 		return -EINVAL;
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 1/2] init, block: try to load default elevator module early during boot
  2013-01-16 21:30                                                     ` [PATCH 1/2] init, block: try to load default elevator module early during boot Tejun Heo
  2013-01-17 18:05                                                       ` Linus Torvalds
@ 2013-01-23  0:53                                                       ` Tejun Heo
  1 sibling, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-01-23  0:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Rusty Russell, Fengguang We

>From bb813f4c933ae9f887a014483690d5f8b8ec05e1 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 18 Jan 2013 14:05:56 -0800

This patch adds default module loading and uses it to load the default
block elevator.  During boot, it's called right after initramfs or
initrd is made available and right before control is passed to
userland.  This ensures that as long as the modules are available in
the usual places in initramfs, initrd or the root filesystem, the
default modules are loaded as soon as possible.

This will replace the on-demand elevator module loading from elevator
init path.

v2: Fixed build breakage when !CONFIG_BLOCK.  Reported by kbuild test
    robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
Cc: Fengguang We <fengguang.wu@intel.com>
---
Minor revision to fix build breakage on !CONFIG_BLOCK reported by
Fengguang.  The patch is queued to wq/for-3.9-async-deadlock-fixes.

Thanks.

 block/elevator.c         | 16 ++++++++++++++++
 include/linux/elevator.h |  5 +++++
 include/linux/init.h     |  1 +
 init/do_mounts_initrd.c  |  3 +++
 init/initramfs.c         |  8 +++++++-
 init/main.c              | 16 ++++++++++++++++
 6 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/block/elevator.c b/block/elevator.c
index 9edba1b..c2d61d5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -136,6 +136,22 @@ static int __init elevator_setup(char *str)
 
 __setup("elevator=", elevator_setup);
 
+/* called during boot to load the elevator chosen by the elevator param */
+void __init load_default_elevator_module(void)
+{
+	struct elevator_type *e;
+
+	if (!chosen_elevator[0])
+		return;
+
+	spin_lock(&elv_list_lock);
+	e = elevator_find(chosen_elevator);
+	spin_unlock(&elv_list_lock);
+
+	if (!e)
+		request_module("%s-iosched", chosen_elevator);
+}
+
 static struct kobj_type elv_ktype;
 
 static struct elevator_queue *elevator_alloc(struct request_queue *q,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c03af76..1866206 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -138,6 +138,7 @@ extern void elv_drain_elevator(struct request_queue *);
 /*
  * io scheduler registration
  */
+extern void __init load_default_elevator_module(void);
 extern int elv_register(struct elevator_type *);
 extern void elv_unregister(struct elevator_type *);
 
@@ -206,5 +207,9 @@ enum {
 	INIT_LIST_HEAD(&(rq)->csd.list);	\
 	} while (0)
 
+#else /* CONFIG_BLOCK */
+
+static inline void load_default_elevator_module(void) { }
+
 #endif /* CONFIG_BLOCK */
 #endif
diff --git a/include/linux/init.h b/include/linux/init.h
index a799273..9230c94 100644
--- a/include/linux/init.h
+++ b/include/linux/init.h
@@ -161,6 +161,7 @@ extern unsigned int reset_devices;
 /* used by init/main.c */
 void setup_arch(char **);
 void prepare_namespace(void);
+void __init load_default_modules(void);
 
 extern void (*late_time_init)(void);
 
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index 5e4ded5..dfe606a 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -57,6 +57,9 @@ static void __init handle_initrd(void)
 	sys_mkdir("/old", 0700);
 	sys_chdir("/old");
 
+	/* try loading default modules from initrd */
+	load_default_modules();
+
 	/*
 	 * In case that a resume from disk is carried out by linuxrc or one of
 	 * its children, we need to tell the freezer not to wait for us.
diff --git a/init/initramfs.c b/init/initramfs.c
index 84c6bf1..a67ef9d 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -592,7 +592,7 @@ static int __init populate_rootfs(void)
 			initrd_end - initrd_start);
 		if (!err) {
 			free_initrd();
-			return 0;
+			goto done;
 		} else {
 			clean_rootfs();
 			unpack_to_rootfs(__initramfs_start, __initramfs_size);
@@ -607,6 +607,7 @@ static int __init populate_rootfs(void)
 			sys_close(fd);
 			free_initrd();
 		}
+	done:
 #else
 		printk(KERN_INFO "Unpacking initramfs...\n");
 		err = unpack_to_rootfs((char *)initrd_start,
@@ -615,6 +616,11 @@ static int __init populate_rootfs(void)
 			printk(KERN_EMERG "Initramfs unpacking failed: %s\n", err);
 		free_initrd();
 #endif
+		/*
+		 * Try loading default modules from initramfs.  This gives
+		 * us a chance to load before device_initcalls.
+		 */
+		load_default_modules();
 	}
 	return 0;
 }
diff --git a/init/main.c b/init/main.c
index baf1f0f..18efadb 100644
--- a/init/main.c
+++ b/init/main.c
@@ -70,6 +70,8 @@
 #include <linux/perf_event.h>
 #include <linux/file.h>
 #include <linux/ptrace.h>
+#include <linux/blkdev.h>
+#include <linux/elevator.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -794,6 +796,17 @@ static void __init do_pre_smp_initcalls(void)
 		do_one_initcall(*fn);
 }
 
+/*
+ * This function requests modules which should be loaded by default and is
+ * called twice right after initrd is mounted and right before init is
+ * exec'd.  If such modules are on either initrd or rootfs, they will be
+ * loaded before control is passed to userland.
+ */
+void __init load_default_modules(void)
+{
+	load_default_elevator_module();
+}
+
 static int run_init_process(const char *init_filename)
 {
 	argv_init[0] = init_filename;
@@ -898,4 +911,7 @@ static void __init kernel_init_freeable(void)
 	 * we're essentially up and running. Get rid of the
 	 * initmem segments and start the user-mode stuff..
 	 */
+
+	/* rootfs is available now, try loading default modules */
+	load_default_modules();
 }
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-01-16  2:52                                       ` [PATCH] module, async: async_synchronize_full() on module init iff async is used Tejun Heo
                                                           ` (3 preceding siblings ...)
  2013-01-16 11:36                                         ` Alex Riesen
@ 2013-08-12  7:04                                         ` Jonathan Nieder
  2013-08-12 15:09                                           ` Tejun Heo
  4 siblings, 1 reply; 93+ messages in thread
From: Jonathan Nieder @ 2013-08-12  7:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Arjan van de Ven,
	Rusty Russell

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

Hi,

Tejun Heo wrote:

> This avoids the described deadlock because iosched module doesn't use
> async and thus wouldn't invoke async_synchronize_full().  This is
> hacky and incomplete.  It will deadlock if async module loading nests;
> however, this works around the known problem case and seems to be the
> best of bad options.
>
> For more details, please refer to the following thread.
>
>   http://thread.gmane.org/gmane.linux.kernel/1420814

My laptop fails to boot[1] with the message 'Volume group "data" not
found'.  Bisects to v3.8-rc4~17 (the above commit).  Reverting that
commit on top of current "master" (d92581fcad18, 2013-08-10) produces
a working kernel.  dmesg output from that working kernel attached.
More details, including .config, at [2].

Any ideas for tracking this down?

Thanks,
Jonathan

[1] Screenshot: http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=bad_3.10.3-1.jpg;att=1;bug=719464
Screenshot in recovery mode: http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=bad_3.10.3-1_recovery.jpg;att=2;bug=719464
[2] http://bugs.debian.org/719464

[-- Attachment #2: dmesg --]
[-- Type: text/plain, Size: 46504 bytes --]

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.11.0-rc4+ (jrn@elie) (gcc version 4.7.3 (Debian 4.7.3-6) ) #2 SMP Sun Aug 11 23:20:20 PDT 2013
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.11.0-rc4+ root=/dev/mapper/data-root ro quiet
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000006e545fff] usable
[    0.000000] BIOS-e820: [mem 0x000000006e546000-0x000000006e745fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000006e746000-0x000000006fd3efff] usable
[    0.000000] BIOS-e820: [mem 0x000000006fd3f000-0x000000006fdbefff] reserved
[    0.000000] BIOS-e820: [mem 0x000000006fdbf000-0x000000006febefff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000006febf000-0x000000006fef6fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000006fef7000-0x000000006fefffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000f8ffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000ffe00000-0x00000000ffffffff] reserved
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] SMBIOS 2.6 present.
[    0.000000] DMI: TOSHIBA Satellite C650D/Portable PC, BIOS 1.60 09/02/2010
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] No AGP bridge found
[    0.000000] e820: last_pfn = 0x6ff00 max_arch_pfn = 0x400000000
[    0.000000] MTRR default type: uncachable
[    0.000000] MTRR fixed ranges enabled:
[    0.000000]   00000-9FFFF write-back
[    0.000000]   A0000-BFFFF uncachable
[    0.000000]   C0000-FFFFF write-through
[    0.000000] MTRR variable ranges enabled:
[    0.000000]   0 base 000000000000 mask FFFFC0000000 write-back
[    0.000000]   1 base 000040000000 mask FFFFE0000000 write-back
[    0.000000]   2 base 000060000000 mask FFFFF0000000 write-back
[    0.000000]   3 base 00006FEBE000 mask FFFFFFFFF000 uncachable
[    0.000000]   4 base 0000FFE00000 mask FFFFFFE00000 write-protect
[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[    0.000000] Base memory trampoline at [ffff880000099000] 99000 size 24576
[    0.000000] Using GB pages for direct mapping
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000]  [mem 0x00000000-0x000fffff] page 4k
[    0.000000] BRK [0x0184c000, 0x0184cfff] PGTABLE
[    0.000000] BRK [0x0184d000, 0x0184dfff] PGTABLE
[    0.000000] BRK [0x0184e000, 0x0184efff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x6fa00000-0x6fbfffff]
[    0.000000]  [mem 0x6fa00000-0x6fbfffff] page 2M
[    0.000000] BRK [0x0184f000, 0x0184ffff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x6c000000-0x6e545fff]
[    0.000000]  [mem 0x6c000000-0x6e3fffff] page 2M
[    0.000000]  [mem 0x6e400000-0x6e545fff] page 4k
[    0.000000] BRK [0x01850000, 0x01850fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x6e746000-0x6f9fffff]
[    0.000000]  [mem 0x6e746000-0x6e7fffff] page 4k
[    0.000000]  [mem 0x6e800000-0x6f9fffff] page 2M
[    0.000000] init_memory_mapping: [mem 0x00100000-0x6bffffff]
[    0.000000]  [mem 0x00100000-0x001fffff] page 4k
[    0.000000]  [mem 0x00200000-0x6bffffff] page 2M
[    0.000000] init_memory_mapping: [mem 0x6fc00000-0x6fd3efff]
[    0.000000]  [mem 0x6fc00000-0x6fd3efff] page 4k
[    0.000000] init_memory_mapping: [mem 0x6fef7000-0x6fefffff]
[    0.000000]  [mem 0x6fef7000-0x6fefffff] page 4k
[    0.000000] RAMDISK: [mem 0x37ab2000-0x37d50fff]
[    0.000000] ACPI: RSDP 00000000000fe020 00024 (v02 TOSINV)
[    0.000000] ACPI: XSDT 000000006fef6120 0005C (v01 TOSINV TOSINV00 00000003      01000013)
[    0.000000] ACPI: FACP 000000006fef5000 000F4 (v04 TOSINV TOSINV00 00000003 MSFT 01000013)
[    0.000000] ACPI: DSDT 000000006fee6000 0BAD7 (v01 TOSINV TOSINV00 F0000000 MSFT 01000013)
[    0.000000] ACPI: FACS 000000006fe9c000 00040
[    0.000000] ACPI: HPET 000000006fef4000 00038 (v01 TOSINV TOSINV00 00000001 MSFT 01000013)
[    0.000000] ACPI: APIC 000000006fef3000 00084 (v02 TOSINV TOSINV00 00000001 MSFT 01000013)
[    0.000000] ACPI: MCFG 000000006fef2000 0003C (v01 TOSINV TOSINV00 00000001 MSFT 01000013)
[    0.000000] ACPI: BOOT 000000006fee5000 00028 (v01 TOSINV TOSINV00 00000001 MSFT 01000013)
[    0.000000] ACPI: SLIC 000000006fee4000 00176 (v01 TOSINV TOSINV00 00000001 MSFT 01000013)
[    0.000000] ACPI: SSDT 000000006fee3000 00392 (v01 AMD    POWERNOW 00000001 AMD  00000001)
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] Scanning NUMA topology in Northbridge 24
[    0.000000] No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x000000006fefffff]
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x6fefffff]
[    0.000000]   NODE_DATA [mem 0x6fefc000-0x6fefffff]
[    0.000000]  [ffffea0000000000-ffffea00019fffff] PMD -> [ffff88006c800000-ffff88006e1fffff] on node 0
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009efff]
[    0.000000]   node   0: [mem 0x00100000-0x6e545fff]
[    0.000000]   node   0: [mem 0x6e746000-0x6fd3efff]
[    0.000000]   node   0: [mem 0x6fef7000-0x6fefffff]
[    0.000000] On node 0 totalpages: 457446
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3998 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 6213 pages used for memmap
[    0.000000]   DMA32 zone: 453448 pages, LIFO batch:31
[    0.000000] ACPI: PM-Timer IO Port: 0x408
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x00] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x00] disabled)
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
[    0.000000] ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 4, version 33, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] ACPI: HPET id: 0x1002a201 base: 0xfed00000
[    0.000000] smpboot: Allowing 4 CPUs, 2 hotplug CPUs
[    0.000000] nr_irqs_gsi: 40
[    0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000dffff]
[    0.000000] PM: Registered nosave memory: [mem 0x000e0000-0x000fffff]
[    0.000000] PM: Registered nosave memory: [mem 0x6e546000-0x6e745fff]
[    0.000000] PM: Registered nosave memory: [mem 0x6fd3f000-0x6fdbefff]
[    0.000000] PM: Registered nosave memory: [mem 0x6fdbf000-0x6febefff]
[    0.000000] PM: Registered nosave memory: [mem 0x6febf000-0x6fef6fff]
[    0.000000] e820: [mem 0x6ff00000-0xf7ffffff] available for PCI devices
[    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:1
[    0.000000] PERCPU: Embedded 27 pages/cpu @ffff88006f800000 s80512 r8192 d21888 u524288
[    0.000000] pcpu-alloc: s80512 r8192 d21888 u524288 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 1 2 3 
[    0.000000] Built 1 zonelists in Node order, mobility grouping on.  Total pages: 451156
[    0.000000] Policy zone: DMA32
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.11.0-rc4+ root=/dev/mapper/data-root ro quiet
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] Checking aperture...
[    0.000000] No AGP bridge found
[    0.000000] Node 0: aperture @ d912000000 size 32 MB
[    0.000000] Aperture beyond 4GB. Ignoring.
[    0.000000] Memory: 1791168K/1829784K available (3365K kernel code, 591K rwdata, 1444K rodata, 856K init, 884K bss, 38616K reserved)
[    0.000000] Hierarchical RCU implementation.
[    0.000000] 	RCU dyntick-idle grace-period acceleration is enabled.
[    0.000000] 	RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=4.
[    0.000000] NR_IRQS:33024 nr_irqs:712 16
[    0.000000] spurious 8259A interrupt: IRQ7.
[    0.000000] Console: colour VGA+ 80x25
[    0.000000] console [tty0] enabled
[    0.000000] allocated 7340032 bytes of page_cgroup
[    0.000000] please try 'cgroup_disable=memory' option if you don't want memory cgroups
[    0.000000] hpet clockevent registered
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.004000] tsc: Detected 2294.162 MHz processor
[    0.000003] Calibrating delay loop (skipped), value calculated using timer frequency.. 4588.32 BogoMIPS (lpj=9176648)
[    0.000007] pid_max: default: 32768 minimum: 301
[    0.000049] Security Framework initialized
[    0.000055] AppArmor: AppArmor disabled by boot time parameter
[    0.000208] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
[    0.001142] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
[    0.001587] Mount-cache hash table entries: 256
[    0.001813] Initializing cgroup subsys memory
[    0.001835] Initializing cgroup subsys devices
[    0.001837] Initializing cgroup subsys freezer
[    0.001840] Initializing cgroup subsys net_cls
[    0.001842] Initializing cgroup subsys blkio
[    0.001844] Initializing cgroup subsys perf_event
[    0.001867] tseg: 006ff00000
[    0.001870] CPU: Physical Processor ID: 0
[    0.001871] CPU: Processor Core ID: 0
[    0.001873] mce: CPU supports 6 MCE banks
[    0.001880] LVT offset 0 assigned for vector 0xf9
[    0.001885] process: using AMD E400 aware idle routine
[    0.001888] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
[    0.001888] Last level dTLB entries: 4KB 512, 2MB 128, 4MB 64
[    0.001888] tlb_flushall_shift: 4
[    0.001960] Freeing SMP alternatives memory: 8K (ffffffff8176b000 - ffffffff8176d000)
[    0.001962] ACPI: Core revision 20130517
[    0.001964] TOSHIBA Satellite detected - force copy of DSDT to local memory
[    0.002085] ACPI: Forced DSDT copy: length 0x0BAD7 copied locally, original unmapped
[    0.006991] ACPI: All ACPI Tables successfully acquired
[    0.011022] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.050677] smpboot: CPU0: AMD Athlon(tm) II P360 Dual-Core Processor (fam: 10, model: 06, stepping: 03)
[    0.155726] Performance Events: AMD PMU driver.
[    0.155731] ... version:                0
[    0.155733] ... bit width:              48
[    0.155734] ... generic registers:      4
[    0.155735] ... value mask:             0000ffffffffffff
[    0.155737] ... max period:             00007fffffffffff
[    0.155738] ... fixed-purpose events:   0
[    0.155739] ... event mask:             000000000000000f
[    0.156004] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[    0.156121] smpboot: Booting Node   0, Processors  #1
[    0.169239] Brought up 2 CPUs
[    0.169242] smpboot: Total of 2 processors activated (9176.64 BogoMIPS)
[    0.169589] process: System has AMD C1E enabled
[    0.169607] process: Switch to broadcast mode on CPU1
[    0.169729] process: Switch to broadcast mode on CPU0
[    0.173823] devtmpfs: initialized
[    0.176836] PM: Registering ACPI NVS region [mem 0x6e546000-0x6e745fff] (2097152 bytes)
[    0.176908] PM: Registering ACPI NVS region [mem 0x6fdbf000-0x6febefff] (1048576 bytes)
[    0.177161] regulator-dummy: no parameters
[    0.177238] NET: Registered protocol family 16
[    0.177339] node 0 link 0: io port [0, ffffff]
[    0.177342] TOM: 0000000080000000 aka 2048M
[    0.177345] Fam 10h mmconf [mem 0xf8000000-0xfbffffff]
[    0.177348] node 0 link 0: mmio [80000000, 8fffffff]
[    0.177350] node 0 link 0: mmio [90000000, 920fffff]
[    0.177352] node 0 link 0: mmio [92100000, 922fffff]
[    0.177354] node 0 link 0: mmio [92300000, f7ffffff]
[    0.177356] node 0 link 0: mmio [f8000000, fbffffff] ==> none
[    0.177358] node 0 link 0: mmio [fc000000, febfffff]
[    0.177360] node 0 link 0: mmio [fec00000, fffeffff]
[    0.177362] node 0 link 0: mmio [ffff0000, ffffffff]
[    0.177364] bus: [bus 00-1f] on node 0 link 0
[    0.177366] bus: 00 [io  0x0000-0xffff]
[    0.177368] bus: 00 [mem 0x80000000-0xf7ffffff]
[    0.177370] bus: 00 [mem 0xfc000000-0xfcffffffff]
[    0.177400] ACPI: bus type PCI registered
[    0.177402] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    0.177460] PCI: MMCONFIG for domain 0000 [bus 00-3f] at [mem 0xf8000000-0xfbffffff] (base 0xf8000000)
[    0.177463] PCI: MMCONFIG at [mem 0xf8000000-0xfbffffff] reserved in E820
[    0.177466] PCI: MMCONFIG for 0000 [bus00-0f] at [mem 0xf8000000-0xf8ffffff] (base 0xf8000000) (size reduced!)
[    0.178326] PCI: Using configuration type 1 for base access
[    0.178455] mtrr: your CPUs had inconsistent fixed MTRR settings
[    0.178456] mtrr: your CPUs had inconsistent variable MTRR settings
[    0.178458] mtrr: probably your BIOS does not setup all CPUs.
[    0.178459] mtrr: corrected configuration.
[    0.179115] bio: create slab <bio-0> at 0
[    0.179249] ACPI: Added _OSI(Module Device)
[    0.179252] ACPI: Added _OSI(Processor Device)
[    0.179253] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.179255] ACPI: Added _OSI(Processor Aggregator Device)
[    0.180366] ACPI: EC: Look up EC in DSDT
[    0.181561] ACPI: Executed 1 blocks of module-level executable AML code
[    0.185333] [Firmware Bug]: ACPI: BIOS _OSI(Linux) query ignored
[    0.611520] ACPI: Interpreter enabled
[    0.611531] ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S1_] (20130517/hwxface-571)
[    0.611536] ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S2_] (20130517/hwxface-571)
[    0.611552] ACPI: (supports S0 S3 S4 S5)
[    0.611555] ACPI: Using IOAPIC for interrupt routing
[    0.611763] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    0.639420] ACPI: Power Resource [PFA1] (off)
[    0.640221] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[    0.641101] acpi PNP0A08:00: ignoring host bridge window [mem 0x000cc000-0x000cffff] (conflicts with Video ROM [mem 0x000c0000-0x000cefff])
[    0.641108] acpi PNP0A08:00: [Firmware Info]: MMCONFIG for domain 0000 [bus 00-0f] only partially covers this bridge
[    0.641193] PCI host bridge to bus 0000:00
[    0.641196] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.641199] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
[    0.641202] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
[    0.641204] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
[    0.641207] pci_bus 0000:00: root bus resource [mem 0x000c0000-0x000c3fff]
[    0.641209] pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000c7fff]
[    0.641211] pci_bus 0000:00: root bus resource [mem 0x000c8000-0x000cbfff]
[    0.641214] pci_bus 0000:00: root bus resource [mem 0x000d0000-0x000d3fff]
[    0.641216] pci_bus 0000:00: root bus resource [mem 0x000d4000-0x000d7fff]
[    0.641218] pci_bus 0000:00: root bus resource [mem 0x000d8000-0x000dbfff]
[    0.641221] pci_bus 0000:00: root bus resource [mem 0x000dc000-0x000dffff]
[    0.641223] pci_bus 0000:00: root bus resource [mem 0x000e0000-0x000e3fff]
[    0.641225] pci_bus 0000:00: root bus resource [mem 0x000e4000-0x000e7fff]
[    0.641228] pci_bus 0000:00: root bus resource [mem 0x000e8000-0x000ebfff]
[    0.641230] pci_bus 0000:00: root bus resource [mem 0x000ec000-0x000effff]
[    0.641232] pci_bus 0000:00: root bus resource [mem 0x80000000-0xf7ffffff]
[    0.641235] pci_bus 0000:00: root bus resource [mem 0xfc000000-0xfed3ffff]
[    0.641237] pci_bus 0000:00: root bus resource [mem 0xfed45000-0xffffffff]
[    0.641251] pci 0000:00:00.0: [1022:9601] type 00 class 0x060000
[    0.641384] pci 0000:00:01.0: [1022:9602] type 01 class 0x060400
[    0.641493] pci 0000:00:04.0: [1022:9604] type 01 class 0x060400
[    0.641529] pci 0000:00:04.0: PME# supported from D0 D3hot D3cold
[    0.641615] pci 0000:00:05.0: [1022:9605] type 01 class 0x060400
[    0.641650] pci 0000:00:05.0: PME# supported from D0 D3hot D3cold
[    0.641701] pci 0000:00:05.0: System wakeup disabled by ACPI
[    0.641796] pci 0000:00:11.0: [1002:4391] type 00 class 0x010601
[    0.641818] pci 0000:00:11.0: reg 0x10: [io  0x8038-0x803f]
[    0.641828] pci 0000:00:11.0: reg 0x14: [io  0x804c-0x804f]
[    0.641837] pci 0000:00:11.0: reg 0x18: [io  0x8030-0x8037]
[    0.641847] pci 0000:00:11.0: reg 0x1c: [io  0x8048-0x804b]
[    0.641856] pci 0000:00:11.0: reg 0x20: [io  0x8010-0x801f]
[    0.641866] pci 0000:00:11.0: reg 0x24: [mem 0x92307000-0x923073ff]
[    0.642000] pci 0000:00:12.0: [1002:4397] type 00 class 0x0c0310
[    0.642013] pci 0000:00:12.0: reg 0x10: [mem 0x92306000-0x92306fff]
[    0.649649] pci 0000:00:12.0: System wakeup disabled by ACPI
[    0.649700] pci 0000:00:12.2: [1002:4396] type 00 class 0x0c0320
[    0.649718] pci 0000:00:12.2: reg 0x10: [mem 0x92307600-0x923076ff]
[    0.649796] pci 0000:00:12.2: supports D1 D2
[    0.649798] pci 0000:00:12.2: PME# supported from D0 D1 D2 D3hot
[    0.658318] pci 0000:00:12.2: System wakeup disabled by ACPI
[    0.658368] pci 0000:00:13.0: [1002:4397] type 00 class 0x0c0310
[    0.658381] pci 0000:00:13.0: reg 0x10: [mem 0x92305000-0x92305fff]
[    0.658521] pci 0000:00:13.2: [1002:4396] type 00 class 0x0c0320
[    0.658539] pci 0000:00:13.2: reg 0x10: [mem 0x92307500-0x923075ff]
[    0.658617] pci 0000:00:13.2: supports D1 D2
[    0.658619] pci 0000:00:13.2: PME# supported from D0 D1 D2 D3hot
[    0.658715] pci 0000:00:14.0: [1002:4385] type 00 class 0x0c0500
[    0.658859] pci 0000:00:14.2: [1002:4383] type 00 class 0x040300
[    0.658880] pci 0000:00:14.2: reg 0x10: [mem 0x92300000-0x92303fff 64bit]
[    0.658942] pci 0000:00:14.2: PME# supported from D0 D3hot D3cold
[    0.658994] pci 0000:00:14.2: System wakeup disabled by ACPI
[    0.659038] pci 0000:00:14.3: [1002:439d] type 00 class 0x060100
[    0.659178] pci 0000:00:14.4: [1002:4384] type 01 class 0x060400
[    0.659253] pci 0000:00:14.4: System wakeup disabled by ACPI
[    0.659302] pci 0000:00:16.0: [1002:4397] type 00 class 0x0c0310
[    0.659316] pci 0000:00:16.0: reg 0x10: [mem 0x92304000-0x92304fff]
[    0.659451] pci 0000:00:16.2: [1002:4396] type 00 class 0x0c0320
[    0.659470] pci 0000:00:16.2: reg 0x10: [mem 0x92307400-0x923074ff]
[    0.659547] pci 0000:00:16.2: supports D1 D2
[    0.659549] pci 0000:00:16.2: PME# supported from D0 D1 D2 D3hot
[    0.659645] pci 0000:00:18.0: [1022:1200] type 00 class 0x060000
[    0.659724] pci 0000:00:18.1: [1022:1201] type 00 class 0x060000
[    0.659801] pci 0000:00:18.2: [1022:1202] type 00 class 0x060000
[    0.659876] pci 0000:00:18.3: [1022:1203] type 00 class 0x060000
[    0.659955] pci 0000:00:18.4: [1022:1204] type 00 class 0x060000
[    0.660123] pci 0000:01:05.0: [1002:9712] type 00 class 0x030000
[    0.660132] pci 0000:01:05.0: reg 0x10: [mem 0x80000000-0x8fffffff pref]
[    0.660137] pci 0000:01:05.0: reg 0x14: [io  0x7000-0x70ff]
[    0.660143] pci 0000:01:05.0: reg 0x18: [mem 0x92200000-0x9220ffff]
[    0.660153] pci 0000:01:05.0: reg 0x24: [mem 0x92100000-0x921fffff]
[    0.660171] pci 0000:01:05.0: supports D1 D2
[    0.660239] pci 0000:00:01.0: PCI bridge to [bus 01]
[    0.660243] pci 0000:00:01.0:   bridge window [io  0x7000-0x7fff]
[    0.660246] pci 0000:00:01.0:   bridge window [mem 0x92100000-0x922fffff]
[    0.660251] pci 0000:00:01.0:   bridge window [mem 0x80000000-0x8fffffff 64bit pref]
[    0.660307] pci 0000:02:00.0: [168c:002b] type 00 class 0x028000
[    0.660325] pci 0000:02:00.0: reg 0x10: [mem 0x91100000-0x9110ffff 64bit]
[    0.660401] pci 0000:02:00.0: supports D1
[    0.660404] pci 0000:02:00.0: PME# supported from D0 D1 D3hot
[    0.665344] pci 0000:00:04.0: PCI bridge to [bus 02-07]
[    0.665352] pci 0000:00:04.0:   bridge window [io  0x3000-0x6fff]
[    0.665356] pci 0000:00:04.0:   bridge window [mem 0x91100000-0x920fffff]
[    0.665361] pci 0000:00:04.0:   bridge window [mem 0x90000000-0x90ffffff 64bit pref]
[    0.665455] pci 0000:08:00.0: [1969:2060] type 00 class 0x020000
[    0.665478] pci 0000:08:00.0: reg 0x10: [mem 0x91000000-0x9103ffff 64bit]
[    0.665489] pci 0000:08:00.0: reg 0x18: [io  0x2000-0x207f]
[    0.665584] pci 0000:08:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[    0.673339] pci 0000:00:05.0: PCI bridge to [bus 08]
[    0.673348] pci 0000:00:05.0:   bridge window [io  0x2000-0x2fff]
[    0.673352] pci 0000:00:05.0:   bridge window [mem 0x91000000-0x910fffff]
[    0.673462] pci 0000:00:14.4: PCI bridge to [bus 09]
[    0.673488] acpi PNP0A08:00: ACPI _OSC support notification failed, disabling PCIe ASPM
[    0.673491] acpi PNP0A08:00: Unable to request _OSC control (_OSC support mask: 0x08)
[    0.674186] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 7 10 11 12 14 15) *0
[    0.674263] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 7 10 11 12 14 15) *0
[    0.674339] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 7 10 11 12 14 15) *0
[    0.674413] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 7 10 11 12 14 15) *0
[    0.674473] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 7 10 11 12 14 15) *0
[    0.674522] ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 7 10 11 12 14 15) *0
[    0.674571] ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 7 10 11 12 14 15) *0
[    0.674619] ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 7 10 11 12 14 15) *0
[    0.674825] ACPI: Enabled 1 GPEs in block 00 to 1F
[    0.674835] ACPI: \_SB_.PCI0: notify handler is installed
[    0.674886] Found 1 acpi root devices
[    0.675035] vgaarb: device added: PCI:0000:01:05.0,decodes=io+mem,owns=io+mem,locks=none
[    0.675037] vgaarb: loaded
[    0.675039] vgaarb: bridge control possible 0000:01:05.0
[    0.675089] PCI: Using ACPI for IRQ routing
[    0.675440] PCI: pci_cache_line_size set to 64 bytes
[    0.675497] e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
[    0.675499] e820: reserve RAM buffer [mem 0x6e546000-0x6fffffff]
[    0.675502] e820: reserve RAM buffer [mem 0x6fd3f000-0x6fffffff]
[    0.675504] e820: reserve RAM buffer [mem 0x6ff00000-0x6fffffff]
[    0.675640] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    0.675644] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[    0.677707] Switched to clocksource hpet
[    0.679522] pnp: PnP ACPI init
[    0.679544] ACPI: bus type PNP registered
[    0.679971] system 00:00: [mem 0xfec00000-0xfec00fff] could not be reserved
[    0.679974] system 00:00: [mem 0xfee00000-0xfee00fff] has been reserved
[    0.679978] system 00:00: [mem 0xf8000000-0xfbffffff] could not be reserved
[    0.679983] system 00:00: Plug and Play ACPI device, IDs PNP0c02 (active)
[    0.680164] pnp 00:01: [dma 4]
[    0.680204] pnp 00:01: Plug and Play ACPI device, IDs PNP0200 (active)
[    0.680254] pnp 00:02: Plug and Play ACPI device, IDs PNP0c04 (active)
[    0.680316] pnp 00:03: Plug and Play ACPI device, IDs PNP0b00 (active)
[    0.680351] pnp 00:04: Plug and Play ACPI device, IDs PNP0800 (active)
[    0.680396] pnp 00:05: Plug and Play ACPI device, IDs PNP0303 (active)
[    0.680452] pnp 00:06: Plug and Play ACPI device, IDs TOS0100 SYN1900 SYN0002 PNP0f13 (active)
[    0.680519] system 00:07: [io  0x0400-0x04cf] could not be reserved
[    0.680522] system 00:07: [io  0x04d0-0x04d1] has been reserved
[    0.680525] system 00:07: [io  0x04d6] has been reserved
[    0.680528] system 00:07: [io  0x0680-0x06ff] has been reserved
[    0.680531] system 00:07: [io  0x077a] has been reserved
[    0.680533] system 00:07: [io  0x0c00-0x0c01] has been reserved
[    0.680536] system 00:07: [io  0x0c14] has been reserved
[    0.680539] system 00:07: [io  0x0c50-0x0c52] has been reserved
[    0.680542] system 00:07: [io  0x0c6c] has been reserved
[    0.680544] system 00:07: [io  0x0c6f] has been reserved
[    0.680547] system 00:07: [io  0x0cd0-0x0cdb] has been reserved
[    0.680550] system 00:07: Plug and Play ACPI device, IDs PNP0c02 (active)
[    0.680633] system 00:08: [mem 0x000e0000-0x000fffff] could not be reserved
[    0.680636] system 00:08: [mem 0xffe00000-0xffffffff] has been reserved
[    0.680639] system 00:08: [mem 0xfed80000-0xfed80fff] has been reserved
[    0.680643] system 00:08: Plug and Play ACPI device, IDs PNP0c01 (active)
[    0.681021] pnp: PnP ACPI: found 9 devices
[    0.681023] ACPI: bus type PNP unregistered
[    0.688732] pci 0000:00:01.0: PCI bridge to [bus 01]
[    0.688737] pci 0000:00:01.0:   bridge window [io  0x7000-0x7fff]
[    0.688741] pci 0000:00:01.0:   bridge window [mem 0x92100000-0x922fffff]
[    0.688745] pci 0000:00:01.0:   bridge window [mem 0x80000000-0x8fffffff 64bit pref]
[    0.688749] pci 0000:00:04.0: PCI bridge to [bus 02-07]
[    0.688751] pci 0000:00:04.0:   bridge window [io  0x3000-0x6fff]
[    0.688755] pci 0000:00:04.0:   bridge window [mem 0x91100000-0x920fffff]
[    0.688758] pci 0000:00:04.0:   bridge window [mem 0x90000000-0x90ffffff 64bit pref]
[    0.688762] pci 0000:00:05.0: PCI bridge to [bus 08]
[    0.688765] pci 0000:00:05.0:   bridge window [io  0x2000-0x2fff]
[    0.688768] pci 0000:00:05.0:   bridge window [mem 0x91000000-0x910fffff]
[    0.688772] pci 0000:00:14.4: PCI bridge to [bus 09]
[    0.688793] pci 0000:00:01.0: setting latency timer to 64
[    0.689044] pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7]
[    0.689047] pci_bus 0000:00: resource 5 [io  0x0d00-0xffff]
[    0.689049] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff]
[    0.689052] pci_bus 0000:00: resource 7 [mem 0x000c0000-0x000c3fff]
[    0.689054] pci_bus 0000:00: resource 8 [mem 0x000c4000-0x000c7fff]
[    0.689057] pci_bus 0000:00: resource 9 [mem 0x000c8000-0x000cbfff]
[    0.689059] pci_bus 0000:00: resource 10 [mem 0x000d0000-0x000d3fff]
[    0.689062] pci_bus 0000:00: resource 11 [mem 0x000d4000-0x000d7fff]
[    0.689064] pci_bus 0000:00: resource 12 [mem 0x000d8000-0x000dbfff]
[    0.689066] pci_bus 0000:00: resource 13 [mem 0x000dc000-0x000dffff]
[    0.689069] pci_bus 0000:00: resource 14 [mem 0x000e0000-0x000e3fff]
[    0.689071] pci_bus 0000:00: resource 15 [mem 0x000e4000-0x000e7fff]
[    0.689073] pci_bus 0000:00: resource 16 [mem 0x000e8000-0x000ebfff]
[    0.689076] pci_bus 0000:00: resource 17 [mem 0x000ec000-0x000effff]
[    0.689078] pci_bus 0000:00: resource 18 [mem 0x80000000-0xf7ffffff]
[    0.689081] pci_bus 0000:00: resource 19 [mem 0xfc000000-0xfed3ffff]
[    0.689083] pci_bus 0000:00: resource 20 [mem 0xfed45000-0xffffffff]
[    0.689086] pci_bus 0000:01: resource 0 [io  0x7000-0x7fff]
[    0.689088] pci_bus 0000:01: resource 1 [mem 0x92100000-0x922fffff]
[    0.689091] pci_bus 0000:01: resource 2 [mem 0x80000000-0x8fffffff 64bit pref]
[    0.689093] pci_bus 0000:02: resource 0 [io  0x3000-0x6fff]
[    0.689096] pci_bus 0000:02: resource 1 [mem 0x91100000-0x920fffff]
[    0.689098] pci_bus 0000:02: resource 2 [mem 0x90000000-0x90ffffff 64bit pref]
[    0.689101] pci_bus 0000:08: resource 0 [io  0x2000-0x2fff]
[    0.689103] pci_bus 0000:08: resource 1 [mem 0x91000000-0x910fffff]
[    0.689177] NET: Registered protocol family 2
[    0.689381] TCP established hash table entries: 16384 (order: 6, 262144 bytes)
[    0.689514] TCP bind hash table entries: 16384 (order: 6, 262144 bytes)
[    0.689606] TCP: Hash tables configured (established 16384 bind 16384)
[    0.689671] TCP: reno registered
[    0.689679] UDP hash table entries: 1024 (order: 3, 32768 bytes)
[    0.689764] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes)
[    0.689897] NET: Registered protocol family 1
[    0.689913] pci 0000:00:01.0: MSI quirk detected; subordinate MSI disabled
[    0.691212] pci 0000:01:05.0: Boot video device
[    0.691220] PCI: CLS 64 bytes, default 64
[    0.691278] Unpacking initramfs...
[    1.063294] Freeing initrd memory: 2684K (ffff880037ab2000 - ffff880037d51000)
[    1.063634] Simple Boot Flag at 0x44 set to 0x1
[    1.063716] LVT offset 1 assigned for vector 0x400
[    1.063722] IBS: LVT offset 1 assigned
[    1.063746] perf: AMD IBS detected (0x0000001f)
[    1.064125] audit: initializing netlink socket (disabled)
[    1.064141] type=2000 audit(1376290884.944:1): initialized
[    1.080423] bounce pool size: 64 pages
[    1.080430] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[    1.080964] VFS: Disk quotas dquot_6.5.2
[    1.081008] Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    1.081119] msgmni has been set to 3503
[    1.081427] alg: No test for stdrng (krng)
[    1.081462] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 252)
[    1.081503] io scheduler noop registered
[    1.081504] io scheduler deadline registered
[    1.081517] io scheduler cfq registered (default)
[    1.081677] pcieport 0000:00:04.0: irq 40 for MSI/MSI-X
[    1.081825] pcieport 0000:00:05.0: irq 41 for MSI/MSI-X
[    1.081975] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[    1.081997] pciehp: PCI Express Hot Plug Controller Driver version: 0.4
[    1.082113] GHES: HEST is not enabled!
[    1.082207] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    1.082819] Linux agpgart interface v0.103
[    1.082925] i8042: PNP: PS/2 Controller [PNP0303:KBC0,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
[    1.112145] serio: i8042 KBD port at 0x60,0x64 irq 1
[    1.112151] serio: i8042 AUX port at 0x60,0x64 irq 12
[    1.112371] mousedev: PS/2 mouse device common for all mice
[    1.112404] rtc_cmos 00:03: RTC can wake from S4
[    1.112582] rtc_cmos 00:03: rtc core: registered rtc_cmos as rtc0
[    1.112617] rtc_cmos 00:03: alarms up to one day, 114 bytes nvram, hpet irqs
[    1.112632] cpuidle: using governor ladder
[    1.112634] cpuidle: using governor menu
[    1.112707] TCP: cubic registered
[    1.113254] NET: Registered protocol family 10
[    1.113486] NET: Registered protocol family 17
[    1.113844] PM: Hibernation image not present or could not be loaded.
[    1.113859] registered taskstats version 1
[    1.114443] rtc_cmos 00:03: setting system clock to 2013-08-12 07:01:25 UTC (1376290885)
[    1.115745] Freeing unused kernel memory: 856K (ffffffff81695000 - ffffffff8176b000)
[    1.115749] Write protecting the kernel read-only data: 6144k
[    1.118871] Freeing unused kernel memory: 720K (ffff88000134c000 - ffff880001400000)
[    1.121300] Freeing unused kernel memory: 604K (ffff880001569000 - ffff880001600000)
[    1.139759] udevd[56]: starting version 175
[    1.146791] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
[    1.170368] ACPI: bus type USB registered
[    1.170405] usbcore: registered new interface driver usbfs
[    1.170417] usbcore: registered new interface driver hub
[    1.174602] SCSI subsystem initialized
[    1.176040] ACPI: bus type ATA registered
[    1.184322] usbcore: registered new device driver usb
[    1.184747] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    1.184929] ehci-pci: EHCI PCI platform driver
[    1.185189] ehci-pci 0000:00:12.2: setting latency timer to 64
[    1.185209] ehci-pci 0000:00:12.2: EHCI Host Controller
[    1.185222] ehci-pci 0000:00:12.2: new USB bus registered, assigned bus number 1
[    1.185231] QUIRK: Enable AMD PLL fix
[    1.185233] ehci-pci 0000:00:12.2: applying AMD SB700/SB800/Hudson-2/3 EHCI dummy qh workaround
[    1.185249] ehci-pci 0000:00:12.2: debug port 1
[    1.185306] ehci-pci 0000:00:12.2: irq 17, io mem 0x92307600
[    1.190382] libata version 3.00 loaded.
[    1.193374] ehci-pci 0000:00:12.2: USB 2.0 started, EHCI 1.00
[    1.193411] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[    1.193414] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.193417] usb usb1: Product: EHCI Host Controller
[    1.193419] usb usb1: Manufacturer: Linux 3.11.0-rc4+ ehci_hcd
[    1.193421] usb usb1: SerialNumber: 0000:00:12.2
[    1.193609] hub 1-0:1.0: USB hub found
[    1.193616] hub 1-0:1.0: 5 ports detected
[    1.194098] ehci-pci 0000:00:13.2: setting latency timer to 64
[    1.194107] ehci-pci 0000:00:13.2: EHCI Host Controller
[    1.194115] ehci-pci 0000:00:13.2: new USB bus registered, assigned bus number 2
[    1.194121] ehci-pci 0000:00:13.2: applying AMD SB700/SB800/Hudson-2/3 EHCI dummy qh workaround
[    1.194133] ehci-pci 0000:00:13.2: debug port 1
[    1.194168] ehci-pci 0000:00:13.2: irq 17, io mem 0x92307500
[    1.205350] ehci-pci 0000:00:13.2: USB 2.0 started, EHCI 1.00
[    1.205385] usb usb2: New USB device found, idVendor=1d6b, idProduct=0002
[    1.205389] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.205392] usb usb2: Product: EHCI Host Controller
[    1.205394] usb usb2: Manufacturer: Linux 3.11.0-rc4+ ehci_hcd
[    1.205396] usb usb2: SerialNumber: 0000:00:13.2
[    1.205550] hub 2-0:1.0: USB hub found
[    1.205556] hub 2-0:1.0: 5 ports detected
[    1.206000] ehci-pci 0000:00:16.2: setting latency timer to 64
[    1.206010] ehci-pci 0000:00:16.2: EHCI Host Controller
[    1.206017] ehci-pci 0000:00:16.2: new USB bus registered, assigned bus number 3
[    1.206023] ehci-pci 0000:00:16.2: applying AMD SB700/SB800/Hudson-2/3 EHCI dummy qh workaround
[    1.206034] ehci-pci 0000:00:16.2: debug port 1
[    1.206068] ehci-pci 0000:00:16.2: irq 17, io mem 0x92307400
[    1.213191] ACPI: Fan [FAN1] (off)
[    1.217381] ehci-pci 0000:00:16.2: USB 2.0 started, EHCI 1.00
[    1.217420] usb usb3: New USB device found, idVendor=1d6b, idProduct=0002
[    1.217423] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.217426] usb usb3: Product: EHCI Host Controller
[    1.217429] usb usb3: Manufacturer: Linux 3.11.0-rc4+ ehci_hcd
[    1.217431] usb usb3: SerialNumber: 0000:00:16.2
[    1.217603] hub 3-0:1.0: USB hub found
[    1.217610] hub 3-0:1.0: 4 ports detected
[    1.218096] ahci 0000:00:11.0: version 3.0
[    1.218345] ahci 0000:00:11.0: irq 42 for MSI/MSI-X
[    1.218433] ahci 0000:00:11.0: AHCI 0001.0200 32 slots 3 ports 3 Gbps 0x7 impl SATA mode
[    1.218438] ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part sxs 
[    1.220913] thermal LNXTHERM:00: registered as thermal_zone0
[    1.220917] ACPI: Thermal Zone [THZN] (53 C)
[    1.221130] scsi0 : ahci
[    1.221946] scsi1 : ahci
[    1.222325] scsi2 : ahci
[    1.222463] ata1: SATA max UDMA/133 abar m1024@0x92307000 port 0x92307100 irq 42
[    1.222467] ata2: SATA max UDMA/133 abar m1024@0x92307000 port 0x92307180 irq 42
[    1.222470] ata3: SATA max UDMA/133 abar m1024@0x92307000 port 0x92307200 irq 42
[    1.541194] ata3: SATA link down (SStatus 0 SControl 300)
[    1.713059] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    1.713095] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    1.715641] ata2.00: ATAPI: HL-DT-STDVDRAM GT30N, TN08, max UDMA/100
[    1.718625] ata2.00: configured for UDMA/100
[    1.761836] ata1.00: ATA-8: TOSHIBA MK3265GSXN, GH101M, max UDMA/100
[    1.761841] ata1.00: 625142448 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[    1.762720] ata1.00: configured for UDMA/100
[    1.763006] scsi 0:0:0:0: Direct-Access     ATA      TOSHIBA MK3265GS GH10 PQ: 0 ANSI: 5
[    1.765590] scsi 1:0:0:0: CD-ROM            HL-DT-ST DVDRAM GT30N     TN08 PQ: 0 ANSI: 5
[    1.771968] sd 0:0:0:0: [sda] 625142448 512-byte logical blocks: (320 GB/298 GiB)
[    1.772022] sd 0:0:0:0: [sda] Write Protect is off
[    1.772025] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    1.772049] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.774714] sr0: scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
[    1.774720] cdrom: Uniform CD-ROM driver Revision: 3.20
[    1.774896] sr 1:0:0:0: Attached scsi CD-ROM sr0
[    1.836272]  sda: sda1 sda2 sda3 < sda5 >
[    1.836951] sd 0:0:0:0: [sda] Attached SCSI disk
[    1.839130] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    1.839304] sr 1:0:0:0: Attached scsi generic sg1 type 5
[    2.060799] tsc: Refined TSC clocksource calibration: 2294.254 MHz
[    2.243678] device-mapper: uevent: version 1.0.3
[    2.243750] device-mapper: ioctl: 4.25.0-ioctl (2013-06-26) initialised: dm-devel@redhat.com
[    2.348947] bio: create slab <bio-1> at 1
[    2.629501] PM: Starting manual resume from disk
[    2.629507] PM: Hibernation image partition 254:1 present
[    2.629509] PM: Looking for hibernation image.
[    2.639228] PM: Image not found (code -22)
[    2.639232] PM: Hibernation image not present or could not be loaded.
[    2.662957] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
[    3.060128] Switched to clocksource tsc
[    4.059917] udevd[375]: starting version 175
[    4.861125] ACPI: AC Adapter [ADP0] (on-line)
[    4.861748] ACPI: processor limited to max C-state 1
[    4.868631] piix4_smbus 0000:00:14.0: SMBus Host Controller at 0xb00, revision 0
[    4.868998] acpi-cpufreq: overriding BIOS provided _PSD data
[    4.877991] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver v0.05
[    4.878049] sp5100_tco: PCI Revision ID: 0x42
[    4.878122] sp5100_tco: Using 0xfed80b00 for watchdog MMIO address
[    4.878135] sp5100_tco: Last reboot was not triggered by watchdog.
[    4.878187] sp5100_tco: initialized (0xffffc9000037ab00). heartbeat=60 sec (nowayout=0)
[    4.886280] microcode: CPU0: patch_level=0x010000b6
[    4.891107] ACPI: Battery Slot [BAT0] (battery present)
[    4.891378] input: Power Button as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input1
[    4.891393] ACPI: Power Button [PWRB]
[    4.891573] input: Lid Switch as /devices/LNXSYSTM:00/device:00/PNP0C0D:00/input/input2
[    4.893488] ACPI: Lid Switch [LID]
[    4.894026] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input3
[    4.894037] ACPI: Power Button [PWRF]
[    4.912777] acpi device:03: registered as cooling_device3
[    4.912801] ACPI: Video Device [VGA] (multi-head: yes  rom: no  post: no)
[    4.912865] input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A08:00/device:01/LNXVIDEO:00/input/input4
[    4.973755] input: PC Speaker as /devices/platform/pcspkr/input/input5
[    4.976931] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[    4.979706] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    4.987328] wmi: Mapper loaded
[    5.031810] toshiba_acpi: Toshiba Laptop ACPI Extras version 0.19
[    5.032469] input: Toshiba input device as /devices/virtual/input/input6
[    5.038401] microcode: failed to load file amd-ucode/microcode_amd.bin
[    5.038964] microcode: CPU1: patch_level=0x010000b6
[    5.039070] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
[    5.065700] [drm] Initialized drm 1.1.0 20060810
[    5.087178] ohci-pci: OHCI PCI platform driver
[    5.087421] ohci-pci 0000:00:12.0: setting latency timer to 64
[    5.087428] ohci-pci 0000:00:12.0: OHCI PCI host controller
[    5.087442] ohci-pci 0000:00:12.0: new USB bus registered, assigned bus number 4
[    5.087487] ohci-pci 0000:00:12.0: irq 18, io mem 0x92306000
[    5.146595] usb usb4: New USB device found, idVendor=1d6b, idProduct=0001
[    5.146607] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    5.146615] usb usb4: Product: OHCI PCI host controller
[    5.146621] usb usb4: Manufacturer: Linux 3.11.0-rc4+ ohci_hcd
[    5.146627] usb usb4: SerialNumber: 0000:00:12.0
[    5.147948] hub 4-0:1.0: USB hub found
[    5.147964] hub 4-0:1.0: 5 ports detected
[    5.148903] ohci-pci 0000:00:13.0: setting latency timer to 64
[    5.148912] ohci-pci 0000:00:13.0: OHCI PCI host controller
[    5.148928] ohci-pci 0000:00:13.0: new USB bus registered, assigned bus number 5
[    5.148970] ohci-pci 0000:00:13.0: irq 18, io mem 0x92305000
[    5.163104] cfg80211: Calling CRDA to update world regulatory domain
[    5.206521] usb usb5: New USB device found, idVendor=1d6b, idProduct=0001
[    5.206534] usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    5.206542] usb usb5: Product: OHCI PCI host controller
[    5.206549] usb usb5: Manufacturer: Linux 3.11.0-rc4+ ohci_hcd
[    5.206554] usb usb5: SerialNumber: 0000:00:13.0
[    5.206947] hub 5-0:1.0: USB hub found
[    5.207001] hub 5-0:1.0: 5 ports detected
[    5.208062] ohci-pci 0000:00:16.0: setting latency timer to 64
[    5.208072] ohci-pci 0000:00:16.0: OHCI PCI host controller
[    5.208090] ohci-pci 0000:00:16.0: new USB bus registered, assigned bus number 6
[    5.208142] ohci-pci 0000:00:16.0: irq 18, io mem 0x92304000
[    5.266524] usb usb6: New USB device found, idVendor=1d6b, idProduct=0001
[    5.266536] usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    5.266544] usb usb6: Product: OHCI PCI host controller
[    5.266550] usb usb6: Manufacturer: Linux 3.11.0-rc4+ ohci_hcd
[    5.266555] usb usb6: SerialNumber: 0000:00:16.0
[    5.267006] hub 6-0:1.0: USB hub found
[    5.267032] hub 6-0:1.0: 4 ports detected
[    5.292584] [drm] radeon kernel modesetting enabled.
[    5.293455] [drm] initializing kernel modesetting (RS880 0x1002:0x9712 0x1179:0xFDE4).
[    5.293498] [drm] register mmio base: 0x92200000
[    5.293503] [drm] register mmio size: 65536
[    5.301098] ATOM BIOS: Tos_Berlin10AD
[    5.301149] radeon 0000:01:05.0: VRAM: 256M 0x00000000C0000000 - 0x00000000CFFFFFFF (256M used)
[    5.301157] radeon 0000:01:05.0: GTT: 512M 0x00000000A0000000 - 0x00000000BFFFFFFF
[    5.301166] [drm] Detected VRAM RAM=256M, BAR=256M
[    5.301170] [drm] RAM width 32bits DDR
[    5.301299] [TTM] Zone  kernel: Available graphics memory: 898020 kiB
[    5.301300] [TTM] Initializing pool allocator
[    5.301306] [TTM] Initializing DMA pool allocator
[    5.301333] [drm] radeon: 256M of VRAM memory ready
[    5.301335] [drm] radeon: 512M of GTT memory ready.
[    5.301347] [drm] GART: num cpu pages 131072, num gpu pages 131072
[    5.320736] [drm] Loading RS780 Microcode
[    5.338145] r600_cp: Failed to load firmware "radeon/RS780_pfp.bin"
[    5.338278] [drm:r600_startup] *ERROR* Failed to load firmware!
[    5.338331] radeon 0000:01:05.0: disabling GPU acceleration
[    5.339702] radeon 0000:01:05.0: ffff880069da9400 unpin not necessary
[    5.340300] [drm] radeon atom DIG backlight initialized
[    5.340305] [drm] Radeon Display Connectors
[    5.340309] [drm] Connector 0:
[    5.340313] [drm]   VGA-1
[    5.340319] [drm]   DDC: 0x7e40 0x7e40 0x7e44 0x7e44 0x7e48 0x7e48 0x7e4c 0x7e4c
[    5.340322] [drm]   Encoders:
[    5.340326] [drm]     CRT1: INTERNAL_KLDSCP_DAC1
[    5.340329] [drm] Connector 1:
[    5.340333] [drm]   LVDS-1
[    5.340339] [drm]   DDC: 0x7e50 0x7e50 0x7e54 0x7e54 0x7e58 0x7e58 0x7e5c 0x7e5c
[    5.340342] [drm]   Encoders:
[    5.340346] [drm]     LCD1: INTERNAL_KLDSCP_LVTMA
[    5.340377] [drm] radeon: power management initialized
[    5.733892] ath: phy0: ASPM enabled: 0x42
[    5.733903] ath: EEPROM regdomain: 0x65
[    5.733908] ath: EEPROM indicates we should expect a direct regpair map
[    5.733916] ath: Country alpha2 being used: 00
[    5.733920] ath: Regpair used: 0x65
[    5.772820] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
[    5.773419] ieee80211 phy0: Atheros AR9285 Rev:2 mem=0xffffc900013e0000, irq=16
[    5.778400] kvm: Nested Virtualization enabled
[    5.778416] kvm: Nested Paging enabled
[    5.803078] snd_hda_intel 0000:00:14.2: setting latency timer to 64
[    5.864026] input: HDA Digital PCBeep as /devices/pci0000:00/0000:00:14.2/input/input7
[    6.251148] psmouse serio1: synaptics: Touchpad model: 1, fw: 7.2, id: 0x1c0b1, caps: 0xd04733/0xa40000/0xa0000, board id: 3655, fw id: 582762
[    6.251169] psmouse serio1: synaptics: Toshiba Satellite C650D detected, limiting rate to 40pps.
[    6.335430] [drm] fb mappable at 0x80040000
[    6.335439] [drm] vram apper at 0x80000000
[    6.335443] [drm] size 4325376
[    6.335447] [drm] fb depth is 24
[    6.335451] [drm]    pitch is 5632
[    6.335683] fbcon: radeondrmfb (fb0) is primary device
[    6.355680] input: SynPS/2 Synaptics TouchPad as /devices/platform/i8042/serio1/input/input8
[    6.410984] Console: switching to colour frame buffer device 170x48
[    6.424157] radeon 0000:01:05.0: fb0: radeondrmfb frame buffer device
[    6.424164] radeon 0000:01:05.0: registered panic notifier
[    6.424482] [drm] Initialized radeon 2.34.0 20080528 for 0000:01:05.0 on minor 0
[    8.014260] EXT4-fs (dm-0): re-mounted. Opts: (null)
[    8.139624] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
[    8.715272] Adding 4882428k swap on /dev/mapper/data-swap.  Priority:-1 extents:1 across:4882428k 
[    9.295763] fuse init (API version 7.22)
[    9.333131] loop: module loaded
[    9.435146] kjournald starting.  Commit interval 5 seconds
[    9.435688] EXT3-fs (sda1): using internal journal
[    9.435701] EXT3-fs (sda1): mounted filesystem with ordered data mode
[    9.483220] EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts: (null)
[    9.517596] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: (null)
[    9.557272] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)
[   14.805373] input: ACPI Virtual Keyboard Device as /devices/virtual/input/input9
[   24.300277] lp: driver loaded but no devices found
[   24.307971] ppdev: user-space parallel port driver

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-08-12  7:04                                         ` [3.8-rc3 -> 3.8-rc4 regression] " Jonathan Nieder
@ 2013-08-12 15:09                                           ` Tejun Heo
  2013-11-26 21:29                                             ` Josh Hunt
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-08-12 15:09 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Linus Torvalds, Ming Lei, Alex Riesen, Alan Stern, Jens Axboe,
	USB list, Linux Kernel Mailing List, Arjan van de Ven,
	Rusty Russell

Hello, Jonathan.

On Mon, Aug 12, 2013 at 12:04:11AM -0700, Jonathan Nieder wrote:
> My laptop fails to boot[1] with the message 'Volume group "data" not
> found'.  Bisects to v3.8-rc4~17 (the above commit).  Reverting that
> commit on top of current "master" (d92581fcad18, 2013-08-10) produces
> a working kernel.  dmesg output from that working kernel attached.
> More details, including .config, at [2].
> 
> Any ideas for tracking this down?

Which initrd / boot script are you using?  It looks like lvm assemble
scripts are running before sdX are detected leading to volume assembly
failure.  Before the patch, any module loading would end up
synchronizing async probes but after the patch modprobe invocations
which don't schedule them won't be.  Does your boot script happen to
run multiple modprobes in parallel and proceed to configure lvm
without waiting for modprobes of libata drivers to finish?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-08-12 15:09                                           ` Tejun Heo
@ 2013-11-26 21:29                                             ` Josh Hunt
  2013-11-26 21:53                                               ` Linus Torvalds
  0 siblings, 1 reply; 93+ messages in thread
From: Josh Hunt @ 2013-11-26 21:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jonathan Nieder, Linus Torvalds, Ming Lei, Alex Riesen,
	Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List,
	Arjan van de Ven, Rusty Russell

[-- Attachment #1: Type: text/plain, Size: 2143 bytes --]

On Mon, Aug 12, 2013 at 10:09 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Jonathan.
>
> On Mon, Aug 12, 2013 at 12:04:11AM -0700, Jonathan Nieder wrote:
>> My laptop fails to boot[1] with the message 'Volume group "data" not
>> found'.  Bisects to v3.8-rc4~17 (the above commit).  Reverting that
>> commit on top of current "master" (d92581fcad18, 2013-08-10) produces
>> a working kernel.  dmesg output from that working kernel attached.
>> More details, including .config, at [2].
>>
>> Any ideas for tracking this down?
>
> Which initrd / boot script are you using?  It looks like lvm assemble
> scripts are running before sdX are detected leading to volume assembly
> failure.  Before the patch, any module loading would end up
> synchronizing async probes but after the patch modprobe invocations
> which don't schedule them won't be.  Does your boot script happen to
> run multiple modprobes in parallel and proceed to configure lvm
> without waiting for modprobes of libata drivers to finish?
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

I'm also hitting a regression b/c of this patch. Booting 3.10.20 on a
number of older machines with onboard sata controllers are unable to
find their root device quickly enough. I bisected the issue down to
774a1221e862b343388347bac9b318767336b20b. Reverting it allows my
systems to boot fine. I'm seeing this with both sata_svw and ahci. I'm
using ubuntu precise userspace which is using initramfs-tools
0.99ubuntu13.1. From what I can tell the modprobes in this initrd are
done in serial, but the port probing is async. This allows init to
continue on to try and mount root, but it's not there yet.

I'm attaching some serial log output with initcall_debug enabled.

Both ahci and sata_svw call ata_host_activate(), which call
ata_host_register() and async_schedule(async_port_probe, ap).

Let me know what other information you may need.
Thanks

-- 
Josh

[-- Attachment #2: seriallog --]
[-- Type: application/octet-stream, Size: 5990 bytes --]

[    8.206940] async_waiting @ 1
[    8.210057] async_continuing @ 1 after 1 usec
[    8.215144] Freeing unused kerel memory: 1124k freed
[    8.281234] calling  ata_init+0x0/0x3de [libata] @ 1264
[    8.286615] ACPI: bus type ATA registered
[    8.298160] libata version 3.00 loaded.
[    8.302162] initcall ata_init+0x0/0x3de [libata] returned 0 after 15170 usecs
[    8.364665] calling  ahci_pci_driver_init+0x0/0x20 [ahci] @ 1264
[    8.370858] ahci 0000:00:11.0: version 3.0
[    8.375342] ahci 0000:00:11.0: AHCI 0001.0100 32 slots 6 ports 3 Gbps 0x3f impl SATA mode
[    8.383757] ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part ccc 
[    8.426199] scsi0 : ahci
[    8.428983] scsi1 : ahci
[    8.431799] scsi2 : ahci
[    8.434544] scsi3 : ahci
[    8.437513] scsi4 : ahci
[    8.440247] scsi5 : ahci
[    8.442957] ata1: SATA max UDMA/133 abar m1024@0xfdff9800 port 0xfdff9900 irq 22
[    8.450598] ata2: SATA max UDMA/133 abar m1024@0xfdff9800 port 0xfdff9980 irq 22
[    8.458234] ata3: SATA max UDMA/133 abar m1024@0xfdff9800 port 0xfdff9a00 irq 22
[    8.465861] ata4: SATA max UDMA/133 abar m1024@0xfdff9800 port 0xfdff9a80 irq 22
[    8.473498] ata5: SATA max UDMA/133 abar m1024@0xfdff9800 port 0xfdff9b00 irq 22
[    8.481126] ata6: SATA max UDMA/133 abar m1024@0xfdff9800 port 0xfdff9b80 irq 22
[    8.530027] initcall ahci_pci_driver_init+0x0/0x20 [ahci] returned 0 after 155459 usecs
[    8.624448] calling  dm_init+0x0/0x48 [dm_mod] @ 1462
[    .629697] device-mapper: ioctl: 4.24.0-ioctl (2013-01-15) initialised: dm-devel@redhat.com
[    8.638380] initcall dm_init+0x0/0x48 [dm_mod] returned 0 after 8514 usecs
[    8.672187] calling  md_init+0x0/0x16f [md_mod] @ 1468
[    8.677547] initcall md_init+0x0/0x16f [md_mod] returned 0 after 69 usecs
[    8.698175] calling  linear_init+0x0/0x12 [linear] @ 1468
[    8.703729] md: linear personality registered for level -1
[    8.709371] initcall linear_init+0x0/0x12 [linear] returned 0 after 5509 usecs
[    8.738262] calling  raid0_init+0x0/0x12 [raid0] @ 1473
[    8.743638] md: raid0 personality registered for level 0
[    8.749100] initcall raid0_init+0x0/0x12 [raid0] returned 0 after 5333 usecs
[    8.773966] calling  raid_init+0x0/0x12 [raid1] @ 1475
[    8.779261] md: raid1 personality registered for level 1
[    8.784721] initcall raid_init+0x0/0x12 [raid1] returned 0 after 5330 usecs
[    8.809389] calling  async_tx_init+0x0/0x1b [async_tx] @ 1477
[    8.815301] async_tx: api initialized (async)
[    8.819812] initcall async_tx_init+0x0/0x1b [async_tx] returned 0 after 4404 usecs
[    8.828137] ata4: SATA link down (SStatus 0 SControl 300)
[    8.841069] ata6: SATA link down (SStatus 0 SControl 300)
[    8.841118] ata5: SATA link down (SStatus 0 SControl 300)
[    8.872159] calling  calibrate_xor_blocks+0x0/0x144 [xor] @ 1477
[    8.878321] xor: meauring software checksum speed
[    8.893012]    prefetch64-sse: 11240.000 MB/sec
[    8.907013]    generic_sse: 10640.000 MB/sec
[    8.911435] xor: using function: prefetch64-sse (11240.000 MB/sec)
[    8.917765] initcall calibrate_xor_blocks+0x0/0x144 [xor] returned 0 after 38518 usecs
[    8.954420] calling  init_module+0x0/0x229 [raid6_pq] @ 1477
[    8.970052] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    8.976737] ata1.00: ATAPI: ATAPI   iHDS118   5, RL08, max UDMA/100
[    8.977014] raid6: sse2x1    3863 MB/s
[    8.988099] ata1.00: configured for UDMA/100
[    8.994025] raid6: sse2x2    5917 MB/s
[    8.994481] scsi 0:0:0:0: CD-ROM            ATAPI    iHDS118   5      RL08 PQ: 0 ANSI: 5
[    8.994566] scsi 0:0:0:0: Attached scsi generic sg0 type 5
[    9.012065] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    9.018420] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    9.026256] ata3.00: ATA-8: WDC WD2502ABYS-02B7A0, 02.03B03, max UDMA/133
[    9.028023] raid6: sse2x4    7160 MB/s
[    9.028024] raid6: using algorithm sse2x4 (7160 MB/s)
[    9.028025] raid6: using intx1 recovery algorithm
[    9.028031] initcall init_module+0x0/0x229 [raid6_pq] returned 0 after 66206 usecs
[    9.028337] calling  async_pq_init+0x0/0x3b [async_pq] @ 1477
[    9.028343] initcall async_pq_init+0x0/0x3b [async_pq] returned 0 after 2 usecs
[    9.029728] calling  raid5_init+0x0/0x2c [raid456] @ 1477
[    9.029731] md: raid6 personality registered for level 6
[    9.029732] md: raid5 personalit registered for level 5
[    9.029732] md: raid4 personality registered for level 4
[    9.029736] initcall raid5_init+0x0/0x2c [raid456] returned 0 after 3 usecs
[    9.146757] ata3.00: 490350672 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[    9154167] ata2.00: ATA-8: WDC WD2502ABYS-02B7A0, 02.03B03, max UDMA/133
[    9.154169] ata2.00: 490350672 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[    9.168459] ata3.00: configured for UDMA/133
[    9.172920] ata2.00: configured for UDMA/133
[    9.173079] scsi 1:0:0:0: Direct-Access     ATA      WDC WD2502ABYS-0 02.0 PQ: 0 ANSI: 5
[    9.173173] sd 1:0:0:0: Attached scsi generic sg1 type 0
[    9.173494] scsi 2:0:0:0: Direct-Access     ATA      WDC WD2502ABYS0 02.0 PQ: 0 ANSI: 5
[   9.173553] sd 2:0:0:0: Attached scsi generic sg2 type 0
[    9.173917] sd 2:0:0:0: [sdb] 490350672 512-byte logical blocks: (251 GB/233 GiB)
[    9.173965] sd 2:0:0:0: [sdb] Write Protect is off
[    9.173966] sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    9.174172] sd 2:0:0:0: sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    9.174368] sd 1:0:0:0: [sda] 490350672 512-byte logical blocks: (251 GB/233 GiB)
[    9.174443] sd 1:0:0:0: [sda] Write Protect is off
[    9.174444] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    9.174482] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    9.179247]  sda: sda1 sda2 sda3 sda4
[    9.179977] sd 1:0:0:0: [sda] Attached SCSI disk
[    9.184193]  sdb: sdb1 sdb2 sdb3 sdb4
[    9.184905] sd 2:0:0:0: [sdb] Attached SCSI disk

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-11-26 21:29                                             ` Josh Hunt
@ 2013-11-26 21:53                                               ` Linus Torvalds
  2013-11-26 22:12                                                 ` Josh Hunt
  0 siblings, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2013-11-26 21:53 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Tejun Heo, Jonathan Nieder, Ming Lei, Alex Riesen, Alan Stern,
	Jens Axboe, USB list, Linux Kernel Mailing List,
	Arjan van de Ven, Rusty Russell

On Tue, Nov 26, 2013 at 1:29 PM, Josh Hunt <joshhunt00@gmail.com> wrote:
>
> Both ahci and sata_svw call ata_host_activate(), which call
> ata_host_register() and async_schedule(async_port_probe, ap).

Well, with the modern logic ("only wait for async probing if the
module itself did async probing") the ahci and svw modules didn't
really change any behavior.

But other modules did. I wonder, for example, if people insmod the dm
module, and expect all devices to exist afterwards. Which the old
logic of "we always wait for all async code regardless of whether we
started it ourselves" would do, but the new logic does not.

Something similar might hit the (non-modular) md auto-detect ioctl.

So maybe we should just special-case those two issues, and say "let's
just wait for async requests here"

Something like the appended (whitespace-damaged) diff. Does that make
a difference to you guys? And if it does, can you check *which* of the
two async_synchronize_full() calls it is that matters for your cases?

                 Linus

--- duh, apply by hand --

    diff --git a/drivers/md/dm.c b/drivers/md/dm.c
    index 0704c523a76b..7e7a2f743b11 100644
    --- a/drivers/md/dm.c
    +++ b/drivers/md/dm.c
    @@ -351,6 +351,7 @@ static int __init dm_init(void)
                            goto bad;
            }

    +       async_synchronize_full();
            return 0;

           bad:
    diff --git a/drivers/md/md.c b/drivers/md/md.c
    index b6b7a2866c9e..1d173dc662fc 100644
    --- a/drivers/md/md.c
    +++ b/drivers/md/md.c
    @@ -8602,6 +8602,7 @@ static void autostart_arrays(int part)
            i_scanned = 0;
            i_passed = 0;

    +       async_synchronize_full();
            printk(KERN_INFO "md: Autodetecting RAID arrays.\n");

            while (!list_empty(&all_detected_devices) && i_scanned < INT_MAX) {

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-11-26 21:53                                               ` Linus Torvalds
@ 2013-11-26 22:12                                                 ` Josh Hunt
  2013-11-26 22:29                                                   ` Tejun Heo
  2013-11-26 22:30                                                   ` Linus Torvalds
  0 siblings, 2 replies; 93+ messages in thread
From: Josh Hunt @ 2013-11-26 22:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Jonathan Nieder, Ming Lei, Alex Riesen, Alan Stern,
	Jens Axboe, USB list, Linux Kernel Mailing List,
	Arjan van de Ven, Rusty Russell

On Tue, Nov 26, 2013 at 3:53 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Nov 26, 2013 at 1:29 PM, Josh Hunt <joshhunt00@gmail.com> wrote:
>>
>> Both ahci and sata_svw call ata_host_activate(), which call
>> ata_host_register() and async_schedule(async_port_probe, ap).
>
> Well, with the modern logic ("only wait for async probing if the
> module itself did async probing") the ahci and svw modules didn't
> really change any behavior.
>
> But other modules did. I wonder, for example, if people insmod the dm
> module, and expect all devices to exist afterwards. Which the old
> logic of "we always wait for all async code regardless of whether we
> started it ourselves" would do, but the new logic does not.
>
> Something similar might hit the (non-modular) md auto-detect ioctl.
>
> So maybe we should just special-case those two issues, and say "let's
> just wait for async requests here"
>
> Something like the appended (whitespace-damaged) diff. Does that make
> a difference to you guys? And if it does, can you check *which* of the
> two async_synchronize_full() calls it is that matters for your cases?
>
>                  Linus
>
> --- duh, apply by hand --
>
>     diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>     index 0704c523a76b..7e7a2f743b11 100644
>     --- a/drivers/md/dm.c
>     +++ b/drivers/md/dm.c
>     @@ -351,6 +351,7 @@ static int __init dm_init(void)
>                             goto bad;
>             }
>
>     +       async_synchronize_full();
>             return 0;
>
>            bad:
>     diff --git a/drivers/md/md.c b/drivers/md/md.c
>     index b6b7a2866c9e..1d173dc662fc 100644
>     --- a/drivers/md/md.c
>     +++ b/drivers/md/md.c
>     @@ -8602,6 +8602,7 @@ static void autostart_arrays(int part)
>             i_scanned = 0;
>             i_passed = 0;
>
>     +       async_synchronize_full();
>             printk(KERN_INFO "md: Autodetecting RAID arrays.\n");
>
>             while (!list_empty(&all_detected_devices) && i_scanned < INT_MAX) {

I should have clarified that I'm not using dm/md in my setup. I know
the modules are getting loaded in the log I attached, but root is not
a md/dm device.

-- 
Josh

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-11-26 22:12                                                 ` Josh Hunt
@ 2013-11-26 22:29                                                   ` Tejun Heo
  2013-12-03 14:28                                                     ` Josh Hunt
  2013-11-26 22:30                                                   ` Linus Torvalds
  1 sibling, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-11-26 22:29 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Linus Torvalds, Jonathan Nieder, Ming Lei, Alex Riesen,
	Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List,
	Arjan van de Ven, Rusty Russell

Hello,

On Tue, Nov 26, 2013 at 04:12:41PM -0600, Josh Hunt wrote:
> I should have clarified that I'm not using dm/md in my setup. I know
> the modules are getting loaded in the log I attached, but root is not
> a md/dm device.

Can you please still try it?  The init script is broken and we're now
just trying to restore just enough of the old behavior so that the
issue is not exposed.  The boot script in use seems to load md/dm
modules after storage drivers and use their termination as the signal
for "storage ready", so it could be a good enough bandaid even if
you're not using dm/md.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-11-26 22:12                                                 ` Josh Hunt
  2013-11-26 22:29                                                   ` Tejun Heo
@ 2013-11-26 22:30                                                   ` Linus Torvalds
  1 sibling, 0 replies; 93+ messages in thread
From: Linus Torvalds @ 2013-11-26 22:30 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Tejun Heo, Jonathan Nieder, Ming Lei, Alex Riesen, Alan Stern,
	Jens Axboe, USB list, Linux Kernel Mailing List,
	Arjan van de Ven, Rusty Russell

On Tue, Nov 26, 2013 at 2:12 PM, Josh Hunt <joshhunt00@gmail.com> wrote:
>
> I should have clarified that I'm not using dm/md in my setup. I know
> the modules are getting loaded in the log I attached, but root is not
> a md/dm device.

Hmm. The initcall debugging doesn't actually show any of the "wait for
async events", because those debug messages come from
"do_one_initcall()", and the waiting happens later. Plus your messages
don't actually show where you are trying to - and failing - to mount
the root filesystem.

Without that kind of information, it's kind of hard to guess. Maybe
you could add a few printk's to your kernel? Add one to
do_init_module() *after* the

        if (current->flags & PF_USED_ASYNC)
                async_synchronize_full();

thing, and another to fs/namespace.c: do_mount() (just put something like

        printk("do_mount: %s at %s\n", dev_name, dir_name);

or whatever, so that we can see when that happens..

              Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-11-26 22:29                                                   ` Tejun Heo
@ 2013-12-03 14:28                                                     ` Josh Hunt
  2013-12-03 15:19                                                       ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Josh Hunt @ 2013-12-03 14:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Jonathan Nieder, Ming Lei, Alex Riesen,
	Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List,
	Arjan van de Ven, Rusty Russell

[-- Attachment #1: Type: text/plain, Size: 1733 bytes --]

On Tue, Nov 26, 2013 at 4:29 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Nov 26, 2013 at 04:12:41PM -0600, Josh Hunt wrote:
>> I should have clarified that I'm not using dm/md in my setup. I know
>> the modules are getting loaded in the log I attached, but root is not
>> a md/dm device.
>
> Can you please still try it?  The init script is broken and we're now
> just trying to restore just enough of the old behavior so that the
> issue is not exposed.  The boot script in use seems to load md/dm
> modules after storage drivers and use their termination as the signal
> for "storage ready", so it could be a good enough bandaid even if
> you're not using dm/md.
>
> Thanks.
>
> --
> tejun

Tejun

You're right. Thanks for pointing this out. I did not realize there
was a bug in the init script. The version of initramfs-tools I was
using had the following bug:
https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/1215911

Updating to 0.99ubuntu13.4 of initramfs-tools resolved my boot hangs.

I did try using the workaround as suggested by Linus. In my setup the
dm_init() code was hit, however it still appeared to be too late at
times. I also tried moving the call to async_synchronize_full() above
the for loop and it still had the same issue (patch attached.) Out of
around 10 reboot tests it failed to find root 1 or 2 times.

The ubuntu scripts don't ever actually call do_mount() if it can't
find the device. It seems to rely on some udev functionality to tell
it when the device is present, and if that fails it just bails out.

This change has introduced a regression. However, I only noticed it
b/c my init script had a bug which caused it not to wait around for
the device to appear.

-- 
Josh

[-- Attachment #2: async-dbg.patch --]
[-- Type: text/x-patch, Size: 1773 bytes --]

Index: linux-3.10/drivers/md/dm.c
===================================================================
--- linux-3.10.orig/drivers/md/dm.c
+++ linux-3.10/drivers/md/dm.c
@@ -16,12 +16,13 @@
 #include <linux/bio.h>
 #include <linux/mempool.h>
 #include <linux/slab.h>
 #include <linux/idr.h>
 #include <linux/hdreg.h>
 #include <linux/delay.h>
+#include <linux/async.h>
 
 #include <trace/events/block.h>
 
 #define DM_MSG_PREFIX "core"
 
 #ifdef CONFIG_PRINTK
@@ -275,12 +276,15 @@ static void (*_exits[])(void) = {
 static int __init dm_init(void)
 {
 	const int count = ARRAY_SIZE(_inits);
 
 	int r, i;
 
+	printk(KERN_CRIT "DBG: %s: Calling async_synchronize_full();\n", __func__);
+	async_synchronize_full();
+
 	for (i = 0; i < count; i++) {
 		r = _inits[i]();
 		if (r)
 			goto bad;
 	}
 
Index: linux-3.10/drivers/md/md.c
===================================================================
--- linux-3.10.orig/drivers/md/md.c
+++ linux-3.10/drivers/md/md.c
@@ -48,12 +48,13 @@
 #include <linux/file.h>
 #include <linux/compat.h>
 #include <linux/delay.h>
 #include <linux/raid/md_p.h>
 #include <linux/raid/md_u.h>
 #include <linux/slab.h>
+#include <linux/async.h>
 #include "md.h"
 #include "bitmap.h"
 
 #ifndef MODULE
 static void autostart_arrays(int part);
 #endif
@@ -8573,12 +8574,14 @@ static void autostart_arrays(int part)
 	dev_t dev;
 	int i_scanned, i_passed;
 
 	i_scanned = 0;
 	i_passed = 0;
 
+	printk(KERN_CRIT "DBG: %s: Calling async_synchronize_full()\n", __func__);
+	async_synchronize_full();
 	printk(KERN_INFO "md: Autodetecting RAID arrays.\n");
 
 	while (!list_empty(&all_detected_devices) && i_scanned < INT_MAX) {
 		i_scanned++;
 		node_detected_dev = list_entry(all_detected_devices.next,
 					struct detected_devices_node, list);

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-12-03 14:28                                                     ` Josh Hunt
@ 2013-12-03 15:19                                                       ` Tejun Heo
  2013-12-04 23:01                                                         ` Josh Hunt
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2013-12-03 15:19 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Linus Torvalds, Jonathan Nieder, Ming Lei, Alex Riesen,
	Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List,
	Arjan van de Ven, Rusty Russell

Hello,

On Tue, Dec 03, 2013 at 08:28:43AM -0600, Josh Hunt wrote:
> You're right. Thanks for pointing this out. I did not realize there
> was a bug in the init script. The version of initramfs-tools I was
> using had the following bug:
> https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/1215911
> 
> Updating to 0.99ubuntu13.4 of initramfs-tools resolved my boot hangs.
> 
> I did try using the workaround as suggested by Linus. In my setup the
> dm_init() code was hit, however it still appeared to be too late at
> times. I also tried moving the call to async_synchronize_full() above
> the for loop and it still had the same issue (patch attached.) Out of
> around 10 reboot tests it failed to find root 1 or 2 times.
> 
> The ubuntu scripts don't ever actually call do_mount() if it can't
> find the device. It seems to rely on some udev functionality to tell
> it when the device is present, and if that fails it just bails out.
> 
> This change has introduced a regression. However, I only noticed it
> b/c my init script had a bug which caused it not to wait around for
> the device to appear.

Hmmm.... so, read the bug report, digged and asked around a bit.
Here's the root problem - ubuntu's initramfs uses a tool to wait for
the root device which uses libudev to listen for the device event;
unfortunately, its rx buffer is not set large enough and the receiver
isn't fast enough, which means that netlink broadcast messages from
the kernel can overrun the buffer.  When that happens, it sets an
error on the socket, so the next recv fails with -ENOBUFS.  If that
happens, the wait for root aborts immediately and initramfs proceeds
to mount non-existent root device.

The only thing which changes by these patches is the timing of events.
The problem likely wasn't as exposed before because things were slow
enough so that either the messages could be consumed fast enough or
there's enough delay between libata module load and the root device
wait hiding the bug in the wait logic.

So, yeah, it's a full blown timing bug.  I'm not sure what we can do
to work around from kernel side except for randomly slowing things
down or forcefully enlarging rx buffer size.  There really is no
interlocking to take advantage of. :(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-12-03 15:19                                                       ` Tejun Heo
@ 2013-12-04 23:01                                                         ` Josh Hunt
  2013-12-04 23:12                                                           ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Josh Hunt @ 2013-12-04 23:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Jonathan Nieder, Ming Lei, Alex Riesen,
	Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List,
	Arjan van de Ven, Rusty Russell

[-- Attachment #1: Type: text/plain, Size: 2733 bytes --]

On Tue, Dec 3, 2013 at 9:19 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Dec 03, 2013 at 08:28:43AM -0600, Josh Hunt wrote:
>> You're right. Thanks for pointing this out. I did not realize there
>> was a bug in the init script. The version of initramfs-tools I was
>> using had the following bug:
>> https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/1215911
>>
>> Updating to 0.99ubuntu13.4 of initramfs-tools resolved my boot hangs.
>>
>> I did try using the workaround as suggested by Linus. In my setup the
>> dm_init() code was hit, however it still appeared to be too late at
>> times. I also tried moving the call to async_synchronize_full() above
>> the for loop and it still had the same issue (patch attached.) Out of
>> around 10 reboot tests it failed to find root 1 or 2 times.
>>
>> The ubuntu scripts don't ever actually call do_mount() if it can't
>> find the device. It seems to rely on some udev functionality to tell
>> it when the device is present, and if that fails it just bails out.
>>
>> This change has introduced a regression. However, I only noticed it
>> b/c my init script had a bug which caused it not to wait around for
>> the device to appear.
>
> Hmmm.... so, read the bug report, digged and asked around a bit.
> Here's the root problem - ubuntu's initramfs uses a tool to wait for
> the root device which uses libudev to listen for the device event;
> unfortunately, its rx buffer is not set large enough and the receiver
> isn't fast enough, which means that netlink broadcast messages from
> the kernel can overrun the buffer.  When that happens, it sets an
> error on the socket, so the next recv fails with -ENOBUFS.  If that
> happens, the wait for root aborts immediately and initramfs proceeds
> to mount non-existent root device.
>
> The only thing which changes by these patches is the timing of events.
> The problem likely wasn't as exposed before because things were slow
> enough so that either the messages could be consumed fast enough or
> there's enough delay between libata module load and the root device
> wait hiding the bug in the wait logic.
>
> So, yeah, it's a full blown timing bug.  I'm not sure what we can do
> to work around from kernel side except for randomly slowing things
> down or forcefully enlarging rx buffer size.  There really is no
> interlocking to take advantage of. :(

So there used to be a call to async_synchronize_full() in
ata_host_register(), but it was removed by
f29d3b23238e1955a8094e038c72546e99308e61 as part of some fastboot
changes. Adding it back (in the attached patch) seems to resolve the
issue when using the broken initrd. I'm guessing adding it back isn't
an option, but I wanted to point it out.

-- 
Josh

[-- Attachment #2: dbg-ata.patch --]
[-- Type: text/x-patch, Size: 520 bytes --]

Index: b/drivers/ata/libata-core.c
===================================================================
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6181,12 +6181,14 @@ int ata_host_register(struct ata_host *h
 	/* perform each probe asynchronously */
 	for (i = 0; i < host->n_ports; i++) {
 		struct ata_port *ap = host->ports[i];
 		async_schedule(async_port_probe, ap);
 	}
 
+	async_synchronize_full();
+
 	return 0;
 
  err_tadd:
 	while (--i >= 0) {
 		ata_tport_delete(host->ports[i]);
 	}

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
  2013-12-04 23:01                                                         ` Josh Hunt
@ 2013-12-04 23:12                                                           ` Tejun Heo
  0 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2013-12-04 23:12 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Linus Torvalds, Jonathan Nieder, Ming Lei, Alex Riesen,
	Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List,
	Arjan van de Ven, Rusty Russell

Hello, Josh.

On Wed, Dec 04, 2013 at 05:01:53PM -0600, Josh Hunt wrote:
> So there used to be a call to async_synchronize_full() in
> ata_host_register(), but it was removed by
> f29d3b23238e1955a8094e038c72546e99308e61 as part of some fastboot
> changes. Adding it back (in the attached patch) seems to resolve the
> issue when using the broken initrd. I'm guessing adding it back isn't
> an option, but I wanted to point it out.

The problem is that really isn't the root cause of it.  There's no
real interlocking there and the problem gets hidden just because
things are slower.  Putting ssleep(2) there would work the same, so
the true issue is "the kernel is faster with probing now", which is
somewhat challenging to rectify.  :(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 5/5] async, kmod: warn on synchronous request_module() from async workers
  2013-01-18 22:12                                                                   ` [PATCH 5/5] async, kmod: warn on synchronous request_module() from async workers Tejun Heo
@ 2022-06-23  5:25                                                                     ` Saravana Kannan
  0 siblings, 0 replies; 93+ messages in thread
From: Saravana Kannan @ 2022-06-23  5:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Arjan van de Ven, Ming Lei, Alex Riesen,
	Alan Stern, Jens Axboe, USB list, Linux Kernel Mailing List,
	Rusty Russell, Marek Szyprowski

On Fri, Jan 18, 2013 at 2:12 PM Tejun Heo <tj@kernel.org> wrote:
>
> >>From 4983f3b51e18d008956dd113e0ea2f252774cefc Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Fri, 18 Jan 2013 14:05:57 -0800
>
> Synchronous requet_module() from an async worker can lead to deadlock
> because module init path may invoke async_synchronize_full().  The
> async worker waits for request_module() to complete and the module
> loading waits for the async task to finish.  This bug happened in the
> block layer because of default elevator auto-loading.
>
> Block layer has been updated not to do default elevator auto-loading
> and it has been decided to disallow synchronous request_module() from
> async workers.
>
> Trigger WARN_ON_ONCE() on synchronous request_module() from async
> workers.
>
> For more details, please refer to the following thread.
>
>   http://thread.gmane.org/gmane.linux.kernel/1420814
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Alex Riesen <raa.lkml@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  kernel/kmod.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/kernel/kmod.c b/kernel/kmod.c
> index 1c317e3..ecd42b4 100644
> --- a/kernel/kmod.c
> +++ b/kernel/kmod.c
> @@ -38,6 +38,7 @@
>  #include <linux/suspend.h>
>  #include <linux/rwsem.h>
>  #include <linux/ptrace.h>
> +#include <linux/async.h>
>  #include <asm/uaccess.h>
>
>  #include <trace/events/module.h>
> @@ -130,6 +131,14 @@ int __request_module(bool wait, const char *fmt, ...)
>  #define MAX_KMOD_CONCURRENT 50 /* Completely arbitrary value - KAO */
>         static int kmod_loop_msg;
>
> +       /*
> +        * We don't allow synchronous module loading from async.  Module
> +        * init may invoke async_synchronize_full() which will end up
> +        * waiting for this task which already is waiting for the module
> +        * loading to complete, leading to a deadlock.
> +        */
> +       WARN_ON_ONCE(wait && current_is_async());
> +

If a builtin driver does async probing even before we get to being
able to load modules, this causes a spurious warning splat.

Here's a report by Marek [1]. I tried taking a stab at not warning at
least for drivers that do async probing before the initcalls are done,
but then I got confused [2] trying to understand when is the earliest
point in the bootup that request_module() can succeed. If someone can
clarify my confusion, I can try avoiding this warning for calls to
request_module() before we can load any modules. Any other ideas for
either making this warning way less trigger happy about false
positives?

[1] - https://lore.kernel.org/lkml/d5796286-ec24-511a-5910-5673f8ea8b10@samsung.com/
[2] - https://lore.kernel.org/lkml/CAGETcx-MHwex8tHLB1d71MAP01-3OPDZSNCUBb3iT+BtrugJmQ@mail.gmail.com/

Another question (pardon my ignorance) is whether we need to
async_synchronize_full() at the end of do_init_module() or if we can
limit it to a smaller domain? Looking at this history, I see that this
call was added by Linus in this commit d6de2c80e9d7 ("async: Fix
module loading async-work regression"). Are we doing the blanket
async_synchronize_full() only because we are not keeping proper track
of the async domains? And if so, then what if we have a sync domain
per module and any uses of async_schedule*() triggered by that module
is tied to the module's async domain? Then we'd only need to sync that
module's domain and we won't hit any deadlock issues.

Grepping for async_schedule*() calls, I see only about 30 instances.
At a glance, it looks like most cases are:
1. Have a device/driver from which we can find the related module and
tie the async_scheduler() to that domain.
2. Just direct async_schedule*() calls from module_init() -- we can
just directly tie it to the module's domain.
3. Other?

Is this idea worth pursuing? Or am I going in a completely wrong direction?

Btw, I did see Linus's suggestion in one of the emails in this thread
(?) about just doing a synchronize full on device open. That'd seem
like it would work too, but I'm afraid to touch any file open code
path because I expect that to be a hot path.

-Saravana

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2022-06-23  5:26 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-11 21:04 USB device cannot be reconnected and khubd "blocked for more than 120 seconds" Alex Riesen
2013-01-12  7:48 ` Alex Riesen
2013-01-12  9:18   ` Lan Tianyu
2013-01-12 17:37   ` Alan Stern
2013-01-12 19:39     ` Alex Riesen
2013-01-12 20:33       ` Alex Riesen
2013-01-12 22:52         ` Alan Stern
2013-01-13 12:09           ` Alex Riesen
2013-01-13 16:56             ` Alan Stern
2013-01-13 17:42               ` Alex Riesen
2013-01-13 19:16                 ` Oliver Neukum
2013-01-14  2:39                   ` Alan Stern
2013-01-14 16:43                     ` Alex Riesen
2013-01-14  3:47                 ` Ming Lei
2013-01-14  7:15                   ` Ming Lei
2013-01-14 17:30                     ` Linus Torvalds
2013-01-14 18:04                       ` Alan Stern
2013-01-14 18:34                         ` Linus Torvalds
2013-01-15  1:53                       ` Ming Lei
2013-01-15  6:23                         ` Ming Lei
2013-01-15 17:36                           ` Linus Torvalds
2013-01-15 18:18                             ` Linus Torvalds
2013-01-15 23:17                               ` Tejun Heo
2013-01-15 18:20                             ` Alan Stern
2013-01-15 18:39                               ` Tejun Heo
2013-01-15 18:32                             ` Tejun Heo
2013-01-15 20:18                               ` Linus Torvalds
2013-01-15 23:50                                 ` Tejun Heo
2013-01-16  0:25                                   ` Arjan van de Ven
2013-01-16  0:35                                     ` Tejun Heo
2013-01-16  4:01                                       ` Alan Stern
2013-01-16 16:12                                         ` Tejun Heo
2013-01-16 17:01                                           ` Alan Stern
2013-01-16 17:37                                             ` Tejun Heo
2013-01-16 17:51                                               ` Alan Stern
2013-01-16  0:36                                   ` Linus Torvalds
2013-01-16  0:40                                     ` Linus Torvalds
2013-01-16  2:52                                       ` [PATCH] module, async: async_synchronize_full() on module init iff async is used Tejun Heo
2013-01-16  3:00                                         ` Linus Torvalds
2013-01-16  3:25                                           ` Tejun Heo
2013-01-16  3:37                                             ` Linus Torvalds
2013-01-16 16:22                                               ` Arjan van de Ven
2013-01-16 16:48                                               ` Tejun Heo
2013-01-16 17:03                                                 ` Arjan van de Ven
2013-01-16 17:06                                                   ` Linus Torvalds
2013-01-16 21:30                                                     ` [PATCH 1/2] init, block: try to load default elevator module early during boot Tejun Heo
2013-01-17 18:05                                                       ` Linus Torvalds
2013-01-17 18:38                                                         ` Tejun Heo
2013-01-17 18:46                                                           ` Linus Torvalds
2013-01-17 18:59                                                             ` Tejun Heo
2013-01-17 19:00                                                               ` Linus Torvalds
2013-01-18  1:24                                                         ` [PATCH 1/3] workqueue: set PF_WQ_WORKER on rescuers Tejun Heo
2013-01-18  1:25                                                         ` [PATCH 2/3] workqueue, async: implement work/async_current_func() Tejun Heo
2013-01-18  2:47                                                           ` Linus Torvalds
2013-01-18  2:59                                                             ` Tejun Heo
2013-01-18  3:04                                                               ` Tejun Heo
2013-01-18  3:18                                                                 ` Linus Torvalds
2013-01-18  3:47                                                                   ` Tejun Heo
2013-01-18 22:08                                                                   ` [PATCH 1/5] workqueue: set PF_WQ_WORKER on rescuers Tejun Heo
2013-01-18 22:10                                                                   ` [PATCH 2/5] workqueue: rename kernel/workqueue_sched.h to kernel/workqueue_internal.h Tejun Heo
2013-01-18 22:11                                                                   ` [PATCH 3/5] workqueue: move struct worker definition to workqueue_internal.h Tejun Heo
2013-01-18 22:11                                                                   ` [PATCH 4/5] workqueue: implement current_is_async() Tejun Heo
2013-01-18 22:12                                                                   ` [PATCH 5/5] async, kmod: warn on synchronous request_module() from async workers Tejun Heo
2022-06-23  5:25                                                                     ` Saravana Kannan
2013-01-18  1:27                                                         ` [PATCH 3/3] " Tejun Heo
2013-01-23  0:53                                                       ` [PATCH v2 1/2] init, block: try to load default elevator module early during boot Tejun Heo
2013-01-16 21:31                                                     ` [PATCH 2/2] block: don't request module during elevator init Tejun Heo
2013-01-23  0:51                                                       ` [PATCH v2 " Tejun Heo
2013-01-16  3:30                                         ` [PATCH] module, async: async_synchronize_full() on module init iff async is used Ming Lei
2013-01-16  4:24                                         ` Rusty Russell
2013-01-16 11:36                                         ` Alex Riesen
2013-08-12  7:04                                         ` [3.8-rc3 -> 3.8-rc4 regression] " Jonathan Nieder
2013-08-12 15:09                                           ` Tejun Heo
2013-11-26 21:29                                             ` Josh Hunt
2013-11-26 21:53                                               ` Linus Torvalds
2013-11-26 22:12                                                 ` Josh Hunt
2013-11-26 22:29                                                   ` Tejun Heo
2013-12-03 14:28                                                     ` Josh Hunt
2013-12-03 15:19                                                       ` Tejun Heo
2013-12-04 23:01                                                         ` Josh Hunt
2013-12-04 23:12                                                           ` Tejun Heo
2013-11-26 22:30                                                   ` Linus Torvalds
2013-01-16  0:44                                     ` USB device cannot be reconnected and khubd "blocked for more than 120 seconds" Tejun Heo
2013-01-16 17:19                               ` [PATCH] async: fix __lowest_in_progress() Tejun Heo
2013-01-17 18:16                                 ` Linus Torvalds
2013-01-17 18:50                                   ` Tejun Heo
2013-01-23  0:15                                 ` [PATCH v2] " Tejun Heo
2013-01-23  0:22                                   ` Linus Torvalds
2013-01-16  3:05                             ` USB device cannot be reconnected and khubd "blocked for more than 120 seconds" Ming Lei
2013-01-16  4:14                               ` Linus Torvalds
2013-01-14  8:22                   ` Oliver Neukum
2013-01-14  8:40                     ` Ming Lei
2013-01-12 19:56     ` Alex Riesen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.