All of lore.kernel.org
 help / color / mirror / Atom feed
* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
@ 2016-10-31 20:34 Stefan Wahren
  2016-11-01  8:26 ` Michael Zoran
  2017-04-17 11:05 ` Stefan Wahren
  0 siblings, 2 replies; 30+ messages in thread
From: Stefan Wahren @ 2016-10-31 20:34 UTC (permalink / raw)
  To: linux-arm-kernel

I inspired by this issue [1] i build up a slightly modified setup with a
Raspberry Pi B (mainline kernel 4.9rc3), a powered 7 port USB hub and 5 Prolific
PL2303 USB to serial convertors. I modified the usb_test for dwc2 [2], which
only tries to open all ttyUSB devices one after the other. 

Unfortunately the complete system stuck after opening the first ttyUSB device (
heartbeat LED stop blinking, no reaction to debug UART). The only way to
reanimate the system is to powerdown the USB hub with the USB to serial
convertors.

[1] - https://github.com/raspberrypi/linux/issues/1692
[2] - https://gist.github.com/lategoodbye/dd0d30af27b6f101b03d5923b279dbaa

pi at raspberrypi:~$ lsusb -t
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc2/1p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/3p, 480M
        |__ Port 1: Dev 3, If 0, Class=Vendor Specific Class, Driver=smsc95xx,
480M
        |__ Port 2: Dev 11, If 0, Class=Hub, Driver=hub/4p, 480M
            |__ Port 3: Dev 15, If 0, Class=Vendor Specific Class,
Driver=pl2303, 12M
            |__ Port 1: Dev 12, If 0, Class=Hub, Driver=hub/4p, 480M
                |__ Port 2: Dev 14, If 0, Class=Vendor Specific Class,
Driver=pl2303, 12M
                |__ Port 3: Dev 16, If 0, Class=Vendor Specific Class,
Driver=pl2303, 12M
            |__ Port 4: Dev 17, If 0, Class=Vendor Specific Class,
Driver=pl2303, 12M
            |__ Port 2: Dev 13, If 0, Class=Vendor Specific Class,
Driver=pl2303, 12M

pi at raspberrypi:~$ ./usb_test
idle
opening [/dev/ttyUSB0] ok
[  232.658768] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s!
[usb_test:522]
[  232.678443] Modules linked in: pwm_bcm2835 vchiq(C)
[  232.678481] CPU: 0 PID: 522 Comm: usb_test Tainted: G         C
     4.9.0-rc3+ #33
[  232.678486] Hardware name: BCM2835
[  232.678498] task: cab54e40 task.stack: c688c000
[  232.678520] PC is at __do_softirq+0x68/0x2a8
[  232.678538] LR is at irq_exit+0xb8/0x120
[  232.678552] pc : [<c01014d0>]    lr : [<c0121270>]    psr: 60000113
sp : c688da60  ip : c688daa8  fp : c688daa4
[  232.678560] r10: c0b7f800  r9 : c688c000  r8 : 00400000
[  232.678567] r7 : 0000000a  r6 : cb802400  r5 : 00000000  r4 : 00000001
[  232.678575] r3 : 00000000  r2 : 00000100  r1 : 00400000  r0 : 00000000
[  232.678585] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  232.678593] Control: 00c5387d  Table: 068a4008  DAC: 00000051
[  232.678604] CPU: 0 PID: 522 Comm: usb_test Tainted: G         C
     4.9.0-rc3+ #33
[  232.678609] Hardware name: BCM2835
[  232.678667] [<c010ecc0>] (unwind_backtrace) from [<c010bc28>]
(show_stack+0x20/0x24)
[  232.678695] [<c010bc28>] (show_stack) from [<c03684e0>]
(dump_stack+0x20/0x28)
[  232.678718] [<c03684e0>] (dump_stack) from [<c0108ee4>] (show_regs+0x1c/0x20)
[  232.678752] [<c0108ee4>] (show_regs) from [<c018ea00>]
(watchdog_timer_fn+0x140/0x19c)
[  232.678779] [<c018ea00>] (watchdog_timer_fn) from [<c015eefc>]
(__hrtimer_run_queues.constprop.3+0xec/0x210)
[  232.678798] [<c015eefc>] (__hrtimer_run_queues.constprop.3) from [<c015f210>]
(hrtimer_interrupt+0xa4/0x1cc)
[  232.678827] [<c015f210>] (hrtimer_interrupt) from [<c04eff58>]
(bcm2835_time_interrupt+0x40/0x48)
[  232.678865] [<c04eff58>] (bcm2835_time_interrupt) from [<c01537ec>]
(__handle_irq_event_percpu+0x68/0x1b8)
[  232.678891] [<c01537ec>] (__handle_irq_event_percpu) from [<c0153968>]
(handle_irq_event_percpu+0x2c/0x68)
[  232.678915] [<c0153968>] (handle_irq_event_percpu) from [<c01539dc>]
(handle_irq_event+0x38/0x4c)
[  232.678939] [<c01539dc>] (handle_irq_event) from [<c0156df8>]
(handle_level_irq+0xc4/0xf8)
[  232.678960] [<c0156df8>] (handle_level_irq) from [<c0152fd8>]
(generic_handle_irq+0x28/0x38)
[  232.678983] [<c0152fd8>] (generic_handle_irq) from [<c0153144>]
(__handle_domain_irq+0x90/0xb8)
[  232.679004] [<c0153144>] (__handle_domain_irq) from [<c01013f4>]
(bcm2835_handle_irq+0x30/0x48)
[  232.679022] [<c01013f4>] (bcm2835_handle_irq) from [<c010c7cc>]
(__irq_svc+0x6c/0x90)
[  232.679031] Exception stack(0xc688da10 to 0xc688da58)
[  232.679044] da00:                                     00000000 00400000
00000100 00000000
[  232.679062] da20: 00000001 00000000 cb802400 0000000a 00400000 c688c000
c0b7f800 c688daa4
[  232.679078] da40: c688daa8 c688da60 c0121270 c01014d0 60000113 fffff000000
c687a300 c9374900 ca918b00 caff1200 c688db84
[  232.679235] db20: c688db28 c688db40 c04a1cf8 c049cb64 60000013 ffffffff
[  232.679272] [<c010c7cc>] (__irq_svc) from [<c049cb64>]
(_dwc2_hcd_urb_enqueue+0x414/0x560)
[  232.679302] [<c049cb64>] (_dwc2_hcd_urb_enqueue) from [<c04849ac>]
(usb_hcd_submit_urb+0x704/0x8c4)
[  232.679327] [<c04849ac>] (usb_hcd_submit_urb) from [<c048628c>]
(usb_submit_urb+0x438/0x4e0)
[  232.679358] [<c048628c>] (usb_submit_urb) from [<c04a9d60>]
(usb_serial_generic_submit_read_urb+0x50/0xc8)
[  232.679382] [<c04a9d60>] (usb_serial_generic_submit_read_urb) from
[<c04aa0a8>] (usb_serial_generic_submit_read_urbs+0x28/0x70)
[  232.679407] [<c04aa0a8>] (usb_serial_generic_submit_read_urbs) from
[<c04aa298>] (usb_serial_generic_open+0x44/0x4c)
[  232.679435] [<c04aa298>] (usb_serial_generic_open) from [<c04aebc4>]
(pl2303_open+0xc0/0xe0)
[  232.679458] [<c04aebc4>] (pl2303_open) from [<c04a744c>]
(serial_port_activate+0x54/0x8c)
[  232.679486] [<c04a744c>] (serial_port_activate) from [<c03c4014>]
(tty_port_open+0x84/0xd4)
[  232.679509] [<c03c4014>] (tty_port_open) from [<c04a7d40>]
(serial_open+0x34/0x5c)
[  232.679546] [<c04a7d40>] (serial_open) from [<c03bd3a8>]
(tty_open+0x380/0x534)
[  232.679578] [<c03bd3a8>] (tty_open) from [<c021e188>]
(chrdev_open+0x120/0x14c)
[  232.679612] [<c021e188>] (chrdev_open) from [<c0217368>]
(do_dentry_open+0x1b8/0x304)
[  232.679638] [<c0217368>] (do_dentry_open) from [<c021876c>]
(vfs_open+0x7c/0x80)
[  232.679663] [<c021876c>] (vfs_open) from [<c022834c>]
(path_openat+0xb60/0xd88)
[  232.679682] [<c022834c>] (path_openat) from [<c02285bc>]
(do_filp_open+0x48/0x94)
[  232.679704] [<c02285bc>] (do_filp_open) from [<c0218af4>]
(do_sys_open+0x12c/0x1cc)
[  232.679728] [<c0218af4>] (do_sys_open) from [<c0218bc4>] (SyS_open+0x30/0x34)
[  232.679754] [<c0218bc4>] (SyS_open) from [<c0108280>]
(ret_fast_syscall+0x0/0x3c)
[  233.852571] usb 1-1.2: clear tt 1 (9063) error -110
[  233.853046] usb 1-1.2.1: clear tt 1 (9073) error -110
[  234.892554] usb 1-1.2.1: clear tt 1 (9071) error -110
[  234.892995] usb 1-1.2: clear tt 1 (9061) error -110
[  235.932574] usb 1-1.2.1: clear tt 1 (9073) error -110
[  235.933542] usb 1-1.2: clear tt 1 (9063) error -110
[  236.972595] usb 1-1.2.1: clear tt 1 (9071) error -110
[  236.973235] usb 1-1.2: clear tt 1 (9061) error -110
[  238.012573] usb 1-1.2: clear tt 1 (9063) error -110
[  238.013052] usb 1-1.2.1: clear tt 1 (9073) error -110
[  238.013456] hub 1-1:1.0: hub_ext_port_status failed (err = -110)
[  239.052567] usb 1-1.2.1: clear tt 1 (9071) error -110
[  239.053076] usb 1-1.2: clear tt 1 (9061) error -110
[  240.092573] usb 1-1.2: clear tt 1 (9063) error -110
[  240.093050] usb 1-1.2.1: clear tt 1 (9073) error -110
[  240.239643] usb 1-1.2.1: USB disconnect, device number 5
[  240.239685] usb 1-1.2.1.2: USB disconnect, device number 7
[  240.262707] usb 1-1.2: clear tt 1 (9061) error -71
[  240.263123] usb 1-1.2.1: clear tt 1 (9071) error -71
[  240.263708] usb 1-1.2: USB disconnect, device number 4
[  694304] pl2303 ttyUSB0: pl2303 converter now disconnected from ttyUSB0
[  251.694645] pl2303 1-1.2.2:1.0: device disconnected
[  251.704668] usb 1-1.2.3: USB disconnect, device number 8
[  251.706468] pl2303 ttyUSB2: pl2303 converter now disconnected from ttyUSB2
[  251.706894] pl2303 1-1.2.3:1.0: device disconnected
[  251.714720] usb 1-1.2.4: USB disconnect, device number 10
[  251.733019] pl2303 ttyUSB4: pl2303 converter now disconnected from ttyUSB4
[  251.733459] pl2303 1-1.2.4:1.0: device disconnected
[  251.852767] hub 1-1.2:1.0: activate --> -19

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2016-10-31 20:34 usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s Stefan Wahren
@ 2016-11-01  8:26 ` Michael Zoran
  2017-04-17 11:05 ` Stefan Wahren
  1 sibling, 0 replies; 30+ messages in thread
From: Michael Zoran @ 2016-11-01  8:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2016-10-31 at 21:34 +0100, Stefan Wahren wrote:
> I inspired by this issue [1] i build up a slightly modified setup
> with a
> Raspberry Pi B (mainline kernel 4.9rc3), a powered 7 port USB hub and
> 5 Prolific
> PL2303 USB to serial convertors. I modified the usb_test for dwc2
> [2], which
> only tries to open all ttyUSB devices one after the other.?
> 
> Unfortunately the complete system stuck after opening the first
> ttyUSB device (
> heartbeat LED stop blinking, no reaction to debug UART). The only way
> to
> reanimate the system is to powerdown the USB hub with the USB to
> serial
> convertors.
> 
> [1] - https://github.com/raspberrypi/linux/issues/1692
> [2] - https://gist.github.com/lategoodbye/dd0d30af27b6f101b03d5923b27
> 9dbaa
> 
> pi at raspberrypi:~$ lsusb -t
> /:??Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc2/1p, 480M
> ????|__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/3p, 480M
> ????????|__ Port 1: Dev 3, If 0, Class=Vendor Specific Class,
> Driver=smsc95xx,
> 480M
> ????????|__ Port 2: Dev 11, If 0, Class=Hub, Driver=hub/4p, 480M
> ????????????|__ Port 3: Dev 15, If 0, Class=Vendor Specific Class,
> Driver=pl2303, 12M
> ????????????|__ Port 1: Dev 12, If 0, Class=Hub, Driver=hub/4p, 480M
> ????????????????|__ Port 2: Dev 14, If 0, Class=Vendor Specific
> Class,
> Driver=pl2303, 12M
> ????????????????|__ Port 3: Dev 16, If 0, Class=Vendor Specific
> Class,
> Driver=pl2303, 12M
> ????????????|__ Port 4: Dev 17, If 0, Class=Vendor Specific Class,
> Driver=pl2303, 12M
> ????????????|__ Port 2: Dev 13, If 0, Class=Vendor Specific Class,
> Driver=pl2303, 12M
> 

Since I didn't see a response, I'll go ahead and add my two cents. 
Hopefully nobody minds me chiming in here.

I see these kinds of issues with USB on the RPI all the time. 
Typically, it's just the USB that breaks down not a hang in the CPU.

The issue is that I think the USB controller in the RPI chipset was
designed for the cellphone/set top box market.  In those markets, the
USB port is only used to connect to one device at a time.  For example,
with a cell phone you might connect it to a PC to transfer data.  On a
set top box, you might plug in a USB flash drive with video to play.

The RPI doesn't have a proper USB controller like you find in a PC. 
Instead it has a fixed number of hardware slots(I think the number is
between 5-7) that are used for pending transfers. Once the slots are
full, very little can be done.  And because USB 2.0 is based on
polling, the connected USB devices are constantly polled even if
nothing is happening.

The DWC_OTG driver works a bit better even though it's not perfect. 
That driver makes an attempt to schedule USB activity and send it to
the hardware a bit at a time just like a PC USB controller does. 
Unfortunatly, to get the precision required on some of the lower end
RPIs, it's necessary to use FIQ which is somewhat outside the Linux
driver model.

Although I don't completely agree that FIQ is necessary on the high end
RPIs with multiple cores.   In fact, I have a few local modifications
to the interrupt controller driver to round robin dispatch IRQs between
core to get more concurrency at the expense of more CPU cache misses. I
think this reduces the need for FIQ, but some other disagree...

  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2016-10-31 20:34 usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s Stefan Wahren
  2016-11-01  8:26 ` Michael Zoran
@ 2017-04-17 11:05 ` Stefan Wahren
  2017-04-17 22:37   ` Doug Anderson
  2017-04-17 23:45   ` Heiko Stuebner
  1 sibling, 2 replies; 30+ messages in thread
From: Stefan Wahren @ 2017-04-17 11:05 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

> Stefan Wahren <stefan.wahren@i2se.com> hat am 31. Oktober 2016 um 21:34 geschrieben:
> 
> 
> I inspired by this issue [1] i build up a slightly modified setup with a
> Raspberry Pi B (mainline kernel 4.9rc3), a powered 7 port USB hub and 5 Prolific
> PL2303 USB to serial convertors. I modified the usb_test for dwc2 [2], which
> only tries to open all ttyUSB devices one after the other. 
> 
> Unfortunately the complete system stuck after opening the first ttyUSB device (
> heartbeat LED stop blinking, no reaction to debug UART). The only way to
> reanimate the system is to powerdown the USB hub with the USB to serial
> convertors.
> 
> [1] - https://github.com/raspberrypi/linux/issues/1692
> [2] - https://gist.github.com/lategoodbye/dd0d30af27b6f101b03d5923b279dbaa

since this issue still exists with 4.11 (even without or with microframe scheduler enabled), i want to ask some additional questions:

Is this issue reproducible with other dwc2 platforms than bcm2835?
Does the soft lockup also occurs after opening the second serial convertor or later?

Regards
Stefan

> 
> pi at raspberrypi:~$ lsusb -t
> /:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc2/1p, 480M
>     |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/3p, 480M
>         |__ Port 1: Dev 3, If 0, Class=Vendor Specific Class, Driver=smsc95xx,
> 480M
>         |__ Port 2: Dev 11, If 0, Class=Hub, Driver=hub/4p, 480M
>             |__ Port 3: Dev 15, If 0, Class=Vendor Specific Class,
> Driver=pl2303, 12M
>             |__ Port 1: Dev 12, If 0, Class=Hub, Driver=hub/4p, 480M
>                 |__ Port 2: Dev 14, If 0, Class=Vendor Specific Class,
> Driver=pl2303, 12M
>                 |__ Port 3: Dev 16, If 0, Class=Vendor Specific Class,
> Driver=pl2303, 12M
>             |__ Port 4: Dev 17, If 0, Class=Vendor Specific Class,
> Driver=pl2303, 12M
>             |__ Port 2: Dev 13, If 0, Class=Vendor Specific Class,
> Driver=pl2303, 12M
> 
> pi at raspberrypi:~$ ./usb_test
> idle
> opening [/dev/ttyUSB0] ok
> [  232.658768] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s!
> [usb_test:522]
> [  232.678443] Modules linked in: pwm_bcm2835 vchiq(C)
> [  232.678481] CPU: 0 PID: 522 Comm: usb_test Tainted: G         C
>      4.9.0-rc3+ #33
> [  232.678486] Hardware name: BCM2835
> [  232.678498] task: cab54e40 task.stack: c688c000
> [  232.678520] PC is at __do_softirq+0x68/0x2a8
> [  232.678538] LR is at irq_exit+0xb8/0x120
> [  232.678552] pc : [<c01014d0>]    lr : [<c0121270>]    psr: 60000113
> sp : c688da60  ip : c688daa8  fp : c688daa4
> [  232.678560] r10: c0b7f800  r9 : c688c000  r8 : 00400000
> [  232.678567] r7 : 0000000a  r6 : cb802400  r5 : 00000000  r4 : 00000001
> [  232.678575] r3 : 00000000  r2 : 00000100  r1 : 00400000  r0 : 00000000
> [  232.678585] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [  232.678593] Control: 00c5387d  Table: 068a4008  DAC: 00000051
> [  232.678604] CPU: 0 PID: 522 Comm: usb_test Tainted: G         C
>      4.9.0-rc3+ #33
> [  232.678609] Hardware name: BCM2835
> [  232.678667] [<c010ecc0>] (unwind_backtrace) from [<c010bc28>]
> (show_stack+0x20/0x24)
> [  232.678695] [<c010bc28>] (show_stack) from [<c03684e0>]
> (dump_stack+0x20/0x28)
> [  232.678718] [<c03684e0>] (dump_stack) from [<c0108ee4>] (show_regs+0x1c/0x20)
> [  232.678752] [<c0108ee4>] (show_regs) from [<c018ea00>]
> (watchdog_timer_fn+0x140/0x19c)
> [  232.678779] [<c018ea00>] (watchdog_timer_fn) from [<c015eefc>]
> (__hrtimer_run_queues.constprop.3+0xec/0x210)
> [  232.678798] [<c015eefc>] (__hrtimer_run_queues.constprop.3) from [<c015f210>]
> (hrtimer_interrupt+0xa4/0x1cc)
> [  232.678827] [<c015f210>] (hrtimer_interrupt) from [<c04eff58>]
> (bcm2835_time_interrupt+0x40/0x48)
> [  232.678865] [<c04eff58>] (bcm2835_time_interrupt) from [<c01537ec>]
> (__handle_irq_event_percpu+0x68/0x1b8)
> [  232.678891] [<c01537ec>] (__handle_irq_event_percpu) from [<c0153968>]
> (handle_irq_event_percpu+0x2c/0x68)
> [  232.678915] [<c0153968>] (handle_irq_event_percpu) from [<c01539dc>]
> (handle_irq_event+0x38/0x4c)
> [  232.678939] [<c01539dc>] (handle_irq_event) from [<c0156df8>]
> (handle_level_irq+0xc4/0xf8)
> [  232.678960] [<c0156df8>] (handle_level_irq) from [<c0152fd8>]
> (generic_handle_irq+0x28/0x38)
> [  232.678983] [<c0152fd8>] (generic_handle_irq) from [<c0153144>]
> (__handle_domain_irq+0x90/0xb8)
> [  232.679004] [<c0153144>] (__handle_domain_irq) from [<c01013f4>]
> (bcm2835_handle_irq+0x30/0x48)
> [  232.679022] [<c01013f4>] (bcm2835_handle_irq) from [<c010c7cc>]
> (__irq_svc+0x6c/0x90)
> [  232.679031] Exception stack(0xc688da10 to 0xc688da58)
> [  232.679044] da00:                                     00000000 00400000
> 00000100 00000000
> [  232.679062] da20: 00000001 00000000 cb802400 0000000a 00400000 c688c000
> c0b7f800 c688daa4
> [  232.679078] da40: c688daa8 c688da60 c0121270 c01014d0 60000113 fffff000000
> c687a300 c9374900 ca918b00 caff1200 c688db84
> [  232.679235] db20: c688db28 c688db40 c04a1cf8 c049cb64 60000013 ffffffff
> [  232.679272] [<c010c7cc>] (__irq_svc) from [<c049cb64>]
> (_dwc2_hcd_urb_enqueue+0x414/0x560)
> [  232.679302] [<c049cb64>] (_dwc2_hcd_urb_enqueue) from [<c04849ac>]
> (usb_hcd_submit_urb+0x704/0x8c4)
> [  232.679327] [<c04849ac>] (usb_hcd_submit_urb) from [<c048628c>]
> (usb_submit_urb+0x438/0x4e0)
> [  232.679358] [<c048628c>] (usb_submit_urb) from [<c04a9d60>]
> (usb_serial_generic_submit_read_urb+0x50/0xc8)
> [  232.679382] [<c04a9d60>] (usb_serial_generic_submit_read_urb) from
> [<c04aa0a8>] (usb_serial_generic_submit_read_urbs+0x28/0x70)
> [  232.679407] [<c04aa0a8>] (usb_serial_generic_submit_read_urbs) from
> [<c04aa298>] (usb_serial_generic_open+0x44/0x4c)
> [  232.679435] [<c04aa298>] (usb_serial_generic_open) from [<c04aebc4>]
> (pl2303_open+0xc0/0xe0)
> [  232.679458] [<c04aebc4>] (pl2303_open) from [<c04a744c>]
> (serial_port_activate+0x54/0x8c)
> [  232.679486] [<c04a744c>] (serial_port_activate) from [<c03c4014>]
> (tty_port_open+0x84/0xd4)
> [  232.679509] [<c03c4014>] (tty_port_open) from [<c04a7d40>]
> (serial_open+0x34/0x5c)
> [  232.679546] [<c04a7d40>] (serial_open) from [<c03bd3a8>]
> (tty_open+0x380/0x534)
> [  232.679578] [<c03bd3a8>] (tty_open) from [<c021e188>]
> (chrdev_open+0x120/0x14c)
> [  232.679612] [<c021e188>] (chrdev_open) from [<c0217368>]
> (do_dentry_open+0x1b8/0x304)
> [  232.679638] [<c0217368>] (do_dentry_open) from [<c021876c>]
> (vfs_open+0x7c/0x80)
> [  232.679663] [<c021876c>] (vfs_open) from [<c022834c>]
> (path_openat+0xb60/0xd88)
> [  232.679682] [<c022834c>] (path_openat) from [<c02285bc>]
> (do_filp_open+0x48/0x94)
> [  232.679704] [<c02285bc>] (do_filp_open) from [<c0218af4>]
> (do_sys_open+0x12c/0x1cc)
> [  232.679728] [<c0218af4>] (do_sys_open) from [<c0218bc4>] (SyS_open+0x30/0x34)
> [  232.679754] [<c0218bc4>] (SyS_open) from [<c0108280>]
> (ret_fast_syscall+0x0/0x3c)
> [  233.852571] usb 1-1.2: clear tt 1 (9063) error -110
> [  233.853046] usb 1-1.2.1: clear tt 1 (9073) error -110
> [  234.892554] usb 1-1.2.1: clear tt 1 (9071) error -110
> [  234.892995] usb 1-1.2: clear tt 1 (9061) error -110
> [  235.932574] usb 1-1.2.1: clear tt 1 (9073) error -110
> [  235.933542] usb 1-1.2: clear tt 1 (9063) error -110
> [  236.972595] usb 1-1.2.1: clear tt 1 (9071) error -110
> [  236.973235] usb 1-1.2: clear tt 1 (9061) error -110
> [  238.012573] usb 1-1.2: clear tt 1 (9063) error -110
> [  238.013052] usb 1-1.2.1: clear tt 1 (9073) error -110
> [  238.013456] hub 1-1:1.0: hub_ext_port_status failed (err = -110)
> [  239.052567] usb 1-1.2.1: clear tt 1 (9071) error -110
> [  239.053076] usb 1-1.2: clear tt 1 (9061) error -110
> [  240.092573] usb 1-1.2: clear tt 1 (9063) error -110
> [  240.093050] usb 1-1.2.1: clear tt 1 (9073) error -110
> [  240.239643] usb 1-1.2.1: USB disconnect, device number 5
> [  240.239685] usb 1-1.2.1.2: USB disconnect, device number 7
> [  240.262707] usb 1-1.2: clear tt 1 (9061) error -71
> [  240.263123] usb 1-1.2.1: clear tt 1 (9071) error -71
> [  240.263708] usb 1-1.2: USB disconnect, device number 4
> [  694304] pl2303 ttyUSB0: pl2303 converter now disconnected from ttyUSB0
> [  251.694645] pl2303 1-1.2.2:1.0: device disconnected
> [  251.704668] usb 1-1.2.3: USB disconnect, device number 8
> [  251.706468] pl2303 ttyUSB2: pl2303 converter now disconnected from ttyUSB2
> [  251.706894] pl2303 1-1.2.3:1.0: device disconnected
> [  251.714720] usb 1-1.2.4: USB disconnect, device number 10
> [  251.733019] pl2303 ttyUSB4: pl2303 converter now disconnected from ttyUSB4
> [  251.733459] pl2303 1-1.2.4:1.0: device disconnected
> [  251.852767] hub 1-1.2:1.0: activate --> -19
> 
> _______________________________________________
> linux-rpi-kernel mailing list
> linux-rpi-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-rpi-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-17 11:05 ` Stefan Wahren
@ 2017-04-17 22:37   ` Doug Anderson
  2017-04-18  8:07     ` Stefan Wahren
  2017-04-17 23:45   ` Heiko Stuebner
  1 sibling, 1 reply; 30+ messages in thread
From: Doug Anderson @ 2017-04-17 22:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Mon, Apr 17, 2017 at 4:05 AM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> Hi,
>
>> Stefan Wahren <stefan.wahren@i2se.com> hat am 31. Oktober 2016 um 21:34 geschrieben:
>>
>>
>> I inspired by this issue [1] i build up a slightly modified setup with a
>> Raspberry Pi B (mainline kernel 4.9rc3), a powered 7 port USB hub and 5 Prolific
>> PL2303 USB to serial convertors. I modified the usb_test for dwc2 [2], which
>> only tries to open all ttyUSB devices one after the other.
>>
>> Unfortunately the complete system stuck after opening the first ttyUSB device (
>> heartbeat LED stop blinking, no reaction to debug UART). The only way to
>> reanimate the system is to powerdown the USB hub with the USB to serial
>> convertors.
>>
>> [1] - https://github.com/raspberrypi/linux/issues/1692
>> [2] - https://gist.github.com/lategoodbye/dd0d30af27b6f101b03d5923b279dbaa
>
> since this issue still exists with 4.11 (even without or with microframe scheduler enabled), i want to ask some additional questions:
>
> Is this issue reproducible with other dwc2 platforms than bcm2835?

+Edmund Szeto, who I seem to remember emailing me about similar
questions in the past.


> Does the soft lockup also occurs after opening the second serial convertor or later?

I don't have serial converters easily available to me, but back in the
day when I was stressing things out on rk3288 I never saw anything
this bad.  ...of course, on rk3288 we've got 4 A17 cores running
really fast, so possibly just being slower is what causes your
problems here?

I will make the following observations:

1. With dwc2 you often end up in the situation where you need to
service an interrupt every 125 uS.  If servicing that interrupt takes
anywhere near 125 uS in the common case then you'll be in trouble.

===

2. When I was testing on rk3288 (on kernel 3.14) I did see occasions
where uvc_video_complete() could sometimes take > 125 uS.  It's been a
long time now, but if I remember correctly this had to do with the
fact that the URB buffers were allocated in a way where you had to
access them non-cached and this was super duper slow.  In my
particular case I could "fix" it by adjusting UVC_MAX_PACKETS
(crosreview.com/321932).  ...and I had some timing code in
crosreview.com/321980.

Again, it was a long time ago, but elsewhere I have written down:

-----

Specifically:
* The USB "complete" functions are called with local interrupts
disabled.  Specifically see __usb_hcd_giveback_urb().
* I see calls to uvc_video_complete() that easily take > 125us.

Unfortunately the interrupts disabled while uvc_video_complete() is
called are always the interrupts for the same CPU that's dealing with
the normal dwc2 USB interrupts.

--

Ugh.  This may be the memcpy() as others have found:

http://www.spinics.net/lists/linux-usb/msg83581.html

...looks like the issue is that the driver is allocating memory that's
supposed to be DMA coherent and copying from this memory is slow.

-----

You could probably pick my timing patch and then see if you're
actually hitting cases like this, I guess?

===

3. Are you running CPUFreq by chance?

...back in the day we had a bug on rk3288 where we were temporarily
running the CPU as slow as 8 MHz for a short while during a CPUFreq
transition.  If you happened to get a dwc2 interrupt while at this
speed then you were in trouble.


-Doug

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-17 11:05 ` Stefan Wahren
  2017-04-17 22:37   ` Doug Anderson
@ 2017-04-17 23:45   ` Heiko Stuebner
  1 sibling, 0 replies; 30+ messages in thread
From: Heiko Stuebner @ 2017-04-17 23:45 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Stefan,

Am Montag, 17. April 2017, 13:05:40 CEST schrieb Stefan Wahren:
> > Stefan Wahren <stefan.wahren@i2se.com> hat am 31. Oktober 2016 um 21:34 geschrieben:
> > 
> > 
> > I inspired by this issue [1] i build up a slightly modified setup with a
> > Raspberry Pi B (mainline kernel 4.9rc3), a powered 7 port USB hub and 5 Prolific
> > PL2303 USB to serial convertors. I modified the usb_test for dwc2 [2], which
> > only tries to open all ttyUSB devices one after the other. 
> > 
> > Unfortunately the complete system stuck after opening the first ttyUSB device (
> > heartbeat LED stop blinking, no reaction to debug UART). The only way to
> > reanimate the system is to powerdown the USB hub with the USB to serial
> > convertors.
> > 
> > [1] - https://github.com/raspberrypi/linux/issues/1692
> > [2] - https://gist.github.com/lategoodbye/dd0d30af27b6f101b03d5923b279dbaa
> 
> since this issue still exists with 4.11 (even without or with microframe scheduler enabled), i want to ask some additional questions:
> 
> Is this issue reproducible with other dwc2 platforms than bcm2835?
> Does the soft lockup also occurs after opening the second serial convertor or later?

A friend of mine uses a rk3288-veyron-jerry as work-device also with a
slightly interesting usb topology (multiple hubs, keyboard, mouse,
smartcard-reader and printer) but not usb-serial adapters.

On the 4.9 that is currently on there, the system sometimes locks up when
trying to print things, sadly there is no serial console on it and the lockup
also kills the whole ui and we haven't managed to dig deeper yet.
[cpufreq is set to 600MHz minimum to mitigate the worst dwc2 issues].


Heiko


> > 
> > pi at raspberrypi:~$ lsusb -t
> > /:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc2/1p, 480M
> >     |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/3p, 480M
> >         |__ Port 1: Dev 3, If 0, Class=Vendor Specific Class, Driver=smsc95xx,
> > 480M
> >         |__ Port 2: Dev 11, If 0, Class=Hub, Driver=hub/4p, 480M
> >             |__ Port 3: Dev 15, If 0, Class=Vendor Specific Class,
> > Driver=pl2303, 12M
> >             |__ Port 1: Dev 12, If 0, Class=Hub, Driver=hub/4p, 480M
> >                 |__ Port 2: Dev 14, If 0, Class=Vendor Specific Class,
> > Driver=pl2303, 12M
> >                 |__ Port 3: Dev 16, If 0, Class=Vendor Specific Class,
> > Driver=pl2303, 12M
> >             |__ Port 4: Dev 17, If 0, Class=Vendor Specific Class,
> > Driver=pl2303, 12M
> >             |__ Port 2: Dev 13, If 0, Class=Vendor Specific Class,
> > Driver=pl2303, 12M
> > 
> > pi at raspberrypi:~$ ./usb_test
> > idle
> > opening [/dev/ttyUSB0] ok
> > [  232.658768] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s!
> > [usb_test:522]
> > [  232.678443] Modules linked in: pwm_bcm2835 vchiq(C)
> > [  232.678481] CPU: 0 PID: 522 Comm: usb_test Tainted: G         C
> >      4.9.0-rc3+ #33
> > [  232.678486] Hardware name: BCM2835
> > [  232.678498] task: cab54e40 task.stack: c688c000
> > [  232.678520] PC is at __do_softirq+0x68/0x2a8
> > [  232.678538] LR is at irq_exit+0xb8/0x120
> > [  232.678552] pc : [<c01014d0>]    lr : [<c0121270>]    psr: 60000113
> > sp : c688da60  ip : c688daa8  fp : c688daa4
> > [  232.678560] r10: c0b7f800  r9 : c688c000  r8 : 00400000
> > [  232.678567] r7 : 0000000a  r6 : cb802400  r5 : 00000000  r4 : 00000001
> > [  232.678575] r3 : 00000000  r2 : 00000100  r1 : 00400000  r0 : 00000000
> > [  232.678585] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> > [  232.678593] Control: 00c5387d  Table: 068a4008  DAC: 00000051
> > [  232.678604] CPU: 0 PID: 522 Comm: usb_test Tainted: G         C
> >      4.9.0-rc3+ #33
> > [  232.678609] Hardware name: BCM2835
> > [  232.678667] [<c010ecc0>] (unwind_backtrace) from [<c010bc28>]
> > (show_stack+0x20/0x24)
> > [  232.678695] [<c010bc28>] (show_stack) from [<c03684e0>]
> > (dump_stack+0x20/0x28)
> > [  232.678718] [<c03684e0>] (dump_stack) from [<c0108ee4>] (show_regs+0x1c/0x20)
> > [  232.678752] [<c0108ee4>] (show_regs) from [<c018ea00>]
> > (watchdog_timer_fn+0x140/0x19c)
> > [  232.678779] [<c018ea00>] (watchdog_timer_fn) from [<c015eefc>]
> > (__hrtimer_run_queues.constprop.3+0xec/0x210)
> > [  232.678798] [<c015eefc>] (__hrtimer_run_queues.constprop.3) from [<c015f210>]
> > (hrtimer_interrupt+0xa4/0x1cc)
> > [  232.678827] [<c015f210>] (hrtimer_interrupt) from [<c04eff58>]
> > (bcm2835_time_interrupt+0x40/0x48)
> > [  232.678865] [<c04eff58>] (bcm2835_time_interrupt) from [<c01537ec>]
> > (__handle_irq_event_percpu+0x68/0x1b8)
> > [  232.678891] [<c01537ec>] (__handle_irq_event_percpu) from [<c0153968>]
> > (handle_irq_event_percpu+0x2c/0x68)
> > [  232.678915] [<c0153968>] (handle_irq_event_percpu) from [<c01539dc>]
> > (handle_irq_event+0x38/0x4c)
> > [  232.678939] [<c01539dc>] (handle_irq_event) from [<c0156df8>]
> > (handle_level_irq+0xc4/0xf8)
> > [  232.678960] [<c0156df8>] (handle_level_irq) from [<c0152fd8>]
> > (generic_handle_irq+0x28/0x38)
> > [  232.678983] [<c0152fd8>] (generic_handle_irq) from [<c0153144>]
> > (__handle_domain_irq+0x90/0xb8)
> > [  232.679004] [<c0153144>] (__handle_domain_irq) from [<c01013f4>]
> > (bcm2835_handle_irq+0x30/0x48)
> > [  232.679022] [<c01013f4>] (bcm2835_handle_irq) from [<c010c7cc>]
> > (__irq_svc+0x6c/0x90)
> > [  232.679031] Exception stack(0xc688da10 to 0xc688da58)
> > [  232.679044] da00:                                     00000000 00400000
> > 00000100 00000000
> > [  232.679062] da20: 00000001 00000000 cb802400 0000000a 00400000 c688c000
> > c0b7f800 c688daa4
> > [  232.679078] da40: c688daa8 c688da60 c0121270 c01014d0 60000113 fffff000000
> > c687a300 c9374900 ca918b00 caff1200 c688db84
> > [  232.679235] db20: c688db28 c688db40 c04a1cf8 c049cb64 60000013 ffffffff
> > [  232.679272] [<c010c7cc>] (__irq_svc) from [<c049cb64>]
> > (_dwc2_hcd_urb_enqueue+0x414/0x560)
> > [  232.679302] [<c049cb64>] (_dwc2_hcd_urb_enqueue) from [<c04849ac>]
> > (usb_hcd_submit_urb+0x704/0x8c4)
> > [  232.679327] [<c04849ac>] (usb_hcd_submit_urb) from [<c048628c>]
> > (usb_submit_urb+0x438/0x4e0)
> > [  232.679358] [<c048628c>] (usb_submit_urb) from [<c04a9d60>]
> > (usb_serial_generic_submit_read_urb+0x50/0xc8)
> > [  232.679382] [<c04a9d60>] (usb_serial_generic_submit_read_urb) from
> > [<c04aa0a8>] (usb_serial_generic_submit_read_urbs+0x28/0x70)
> > [  232.679407] [<c04aa0a8>] (usb_serial_generic_submit_read_urbs) from
> > [<c04aa298>] (usb_serial_generic_open+0x44/0x4c)
> > [  232.679435] [<c04aa298>] (usb_serial_generic_open) from [<c04aebc4>]
> > (pl2303_open+0xc0/0xe0)
> > [  232.679458] [<c04aebc4>] (pl2303_open) from [<c04a744c>]
> > (serial_port_activate+0x54/0x8c)
> > [  232.679486] [<c04a744c>] (serial_port_activate) from [<c03c4014>]
> > (tty_port_open+0x84/0xd4)
> > [  232.679509] [<c03c4014>] (tty_port_open) from [<c04a7d40>]
> > (serial_open+0x34/0x5c)
> > [  232.679546] [<c04a7d40>] (serial_open) from [<c03bd3a8>]
> > (tty_open+0x380/0x534)
> > [  232.679578] [<c03bd3a8>] (tty_open) from [<c021e188>]
> > (chrdev_open+0x120/0x14c)
> > [  232.679612] [<c021e188>] (chrdev_open) from [<c0217368>]
> > (do_dentry_open+0x1b8/0x304)
> > [  232.679638] [<c0217368>] (do_dentry_open) from [<c021876c>]
> > (vfs_open+0x7c/0x80)
> > [  232.679663] [<c021876c>] (vfs_open) from [<c022834c>]
> > (path_openat+0xb60/0xd88)
> > [  232.679682] [<c022834c>] (path_openat) from [<c02285bc>]
> > (do_filp_open+0x48/0x94)
> > [  232.679704] [<c02285bc>] (do_filp_open) from [<c0218af4>]
> > (do_sys_open+0x12c/0x1cc)
> > [  232.679728] [<c0218af4>] (do_sys_open) from [<c0218bc4>] (SyS_open+0x30/0x34)
> > [  232.679754] [<c0218bc4>] (SyS_open) from [<c0108280>]
> > (ret_fast_syscall+0x0/0x3c)
> > [  233.852571] usb 1-1.2: clear tt 1 (9063) error -110
> > [  233.853046] usb 1-1.2.1: clear tt 1 (9073) error -110
> > [  234.892554] usb 1-1.2.1: clear tt 1 (9071) error -110
> > [  234.892995] usb 1-1.2: clear tt 1 (9061) error -110
> > [  235.932574] usb 1-1.2.1: clear tt 1 (9073) error -110
> > [  235.933542] usb 1-1.2: clear tt 1 (9063) error -110
> > [  236.972595] usb 1-1.2.1: clear tt 1 (9071) error -110
> > [  236.973235] usb 1-1.2: clear tt 1 (9061) error -110
> > [  238.012573] usb 1-1.2: clear tt 1 (9063) error -110
> > [  238.013052] usb 1-1.2.1: clear tt 1 (9073) error -110
> > [  238.013456] hub 1-1:1.0: hub_ext_port_status failed (err = -110)
> > [  239.052567] usb 1-1.2.1: clear tt 1 (9071) error -110
> > [  239.053076] usb 1-1.2: clear tt 1 (9061) error -110
> > [  240.092573] usb 1-1.2: clear tt 1 (9063) error -110
> > [  240.093050] usb 1-1.2.1: clear tt 1 (9073) error -110
> > [  240.239643] usb 1-1.2.1: USB disconnect, device number 5
> > [  240.239685] usb 1-1.2.1.2: USB disconnect, device number 7
> > [  240.262707] usb 1-1.2: clear tt 1 (9061) error -71
> > [  240.263123] usb 1-1.2.1: clear tt 1 (9071) error -71
> > [  240.263708] usb 1-1.2: USB disconnect, device number 4
> > [  694304] pl2303 ttyUSB0: pl2303 converter now disconnected from ttyUSB0
> > [  251.694645] pl2303 1-1.2.2:1.0: device disconnected
> > [  251.704668] usb 1-1.2.3: USB disconnect, device number 8
> > [  251.706468] pl2303 ttyUSB2: pl2303 converter now disconnected from ttyUSB2
> > [  251.706894] pl2303 1-1.2.3:1.0: device disconnected
> > [  251.714720] usb 1-1.2.4: USB disconnect, device number 10
> > [  251.733019] pl2303 ttyUSB4: pl2303 converter now disconnected from ttyUSB4
> > [  251.733459] pl2303 1-1.2.4:1.0: device disconnected
> > [  251.852767] hub 1-1.2:1.0: activate --> -19
> > 
> > _______________________________________________
> > linux-rpi-kernel mailing list
> > linux-rpi-kernel at lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-rpi-kernel
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-17 22:37   ` Doug Anderson
@ 2017-04-18  8:07     ` Stefan Wahren
  2017-04-18 20:08       ` Doug Anderson
  2017-04-18 20:25       ` Stefan Wahren
  0 siblings, 2 replies; 30+ messages in thread
From: Stefan Wahren @ 2017-04-18  8:07 UTC (permalink / raw)
  To: linux-arm-kernel

Am 18.04.2017 um 00:37 schrieb Doug Anderson:
> Hi,
>
> On Mon, Apr 17, 2017 at 4:05 AM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
>> Hi,
>>
>>> Stefan Wahren <stefan.wahren@i2se.com> hat am 31. Oktober 2016 um 21:34 geschrieben:
>>>
>>>
>>> I inspired by this issue [1] i build up a slightly modified setup with a
>>> Raspberry Pi B (mainline kernel 4.9rc3), a powered 7 port USB hub and 5 Prolific
>>> PL2303 USB to serial convertors. I modified the usb_test for dwc2 [2], which
>>> only tries to open all ttyUSB devices one after the other.
>>>
>>> Unfortunately the complete system stuck after opening the first ttyUSB device (
>>> heartbeat LED stop blinking, no reaction to debug UART). The only way to
>>> reanimate the system is to powerdown the USB hub with the USB to serial
>>> convertors.
>>>
>>> [1] - https://github.com/raspberrypi/linux/issues/1692
>>> [2] - https://gist.github.com/lategoodbye/dd0d30af27b6f101b03d5923b279dbaa
>> since this issue still exists with 4.11 (even without or with microframe scheduler enabled), i want to ask some additional questions:
>>
>> Is this issue reproducible with other dwc2 platforms than bcm2835?
> +Edmund Szeto, who I seem to remember emailing me about similar
> questions in the past.
>
>
>> Does the soft lockup also occurs after opening the second serial convertor or later?
> I don't have serial converters easily available to me, but back in the
> day when I was stressing things out on rk3288 I never saw anything
> this bad.  ...of course, on rk3288 we've got 4 A17 cores running
> really fast, so possibly just being slower is what causes your
> problems here?

The downstream kernel of the Raspberry Pi foundation with it's
out-of-tree dwc_otg driver is able to handle 8 serial converter on a RPI
B. I would be happy to get at least 2 or 3 working on mainline.

>
> I will make the following observations:
>
> 1. With dwc2 you often end up in the situation where you need to
> service an interrupt every 125 uS.  If servicing that interrupt takes
> anywhere near 125 uS in the common case then you'll be in trouble.

I will try to measure this with a logic analyzer.

>
> ===
>
> 2. When I was testing on rk3288 (on kernel 3.14) I did see occasions
> where uvc_video_complete() could sometimes take > 125 uS.  It's been a
> long time now, but if I remember correctly this had to do with the
> fact that the URB buffers were allocated in a way where you had to
> access them non-cached and this was super duper slow.  In my
> particular case I could "fix" it by adjusting UVC_MAX_PACKETS
> (crosreview.com/321932).  ...and I had some timing code in
> crosreview.com/321980.
>
> Again, it was a long time ago, but elsewhere I have written down:
>
> -----
>
> Specifically:
> * The USB "complete" functions are called with local interrupts
> disabled.  Specifically see __usb_hcd_giveback_urb().
> * I see calls to uvc_video_complete() that easily take > 125us.
>
> Unfortunately the interrupts disabled while uvc_video_complete() is
> called are always the interrupts for the same CPU that's dealing with
> the normal dwc2 USB interrupts.
>
> --
>
> Ugh.  This may be the memcpy() as others have found:
>
> http://www.spinics.net/lists/linux-usb/msg83581.html
>
> ...looks like the issue is that the driver is allocating memory that's
> supposed to be DMA coherent and copying from this memory is slow.
>
> -----
>
> You could probably pick my timing patch and then see if you're
> actually hitting cases like this, I guess?
>
> ===
>
> 3. Are you running CPUFreq by chance?
>
> ...back in the day we had a bug on rk3288 where we were temporarily
> running the CPU as slow as 8 MHz for a short while during a CPUFreq
> transition.  If you happened to get a dwc2 interrupt while at this
> speed then you were in trouble.

There is no CPUFreq driver for bcm2835 in mainline and i hope the GPU
firmware doesn't play with the core frequency during runtime.

I also use the bcm2835_defconfig, which does have anything disabled
under kernel hacking.

Thanks
Stefan

>
>
> -Doug

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-18  8:07     ` Stefan Wahren
@ 2017-04-18 20:08       ` Doug Anderson
  2017-04-18 20:25       ` Stefan Wahren
  1 sibling, 0 replies; 30+ messages in thread
From: Doug Anderson @ 2017-04-18 20:08 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Tue, Apr 18, 2017 at 1:07 AM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
>> 1. With dwc2 you often end up in the situation where you need to
>> service an interrupt every 125 uS.  If servicing that interrupt takes
>> anywhere near 125 uS in the common case then you'll be in trouble.
>
> I will try to measure this with a logic analyzer.

Why a logic analyzer?  I'd think that "ftrace" would be your friend
here.  If you configure it just right you ought to be able to figure
out exactly what your kernel is doing.

...or, if you don't want to learn ftrace (it's cool and worth it, but
there's some time) you can just use "trace_printk" to get a fairly
low-over head printout to a memory buffer.  You can put that printout
in various places in the code and figure out what's taking so long.
In fact, that's exactly how the patch at
<https://chromium-review.googlesource.com/c/321980/4/drivers/usb/core/hcd.c>
works.

-Doug

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-18  8:07     ` Stefan Wahren
  2017-04-18 20:08       ` Doug Anderson
@ 2017-04-18 20:25       ` Stefan Wahren
  2017-04-18 20:41         ` Doug Anderson
  2017-04-19  8:56         ` Johan Hovold
  1 sibling, 2 replies; 30+ messages in thread
From: Stefan Wahren @ 2017-04-18 20:25 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

[add Johan]

> Stefan Wahren <stefan.wahren@i2se.com> hat am 18. April 2017 um 10:07 geschrieben:
> 
> 
> Am 18.04.2017 um 00:37 schrieb Doug Anderson:
> > Hi,
> >
> > On Mon, Apr 17, 2017 at 4:05 AM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> >> Hi,
> >>
> >>> Stefan Wahren <stefan.wahren@i2se.com> hat am 31. Oktober 2016 um 21:34 geschrieben:
> >>>
> >>>
> >>> I inspired by this issue [1] i build up a slightly modified setup with a
> >>> Raspberry Pi B (mainline kernel 4.9rc3), a powered 7 port USB hub and 5 Prolific
> >>> PL2303 USB to serial convertors. I modified the usb_test for dwc2 [2], which
> >>> only tries to open all ttyUSB devices one after the other.
> >>>
> >>> Unfortunately the complete system stuck after opening the first ttyUSB device (
> >>> heartbeat LED stop blinking, no reaction to debug UART). The only way to
> >>> reanimate the system is to powerdown the USB hub with the USB to serial
> >>> convertors.
> >>>
> >>> [1] - https://github.com/raspberrypi/linux/issues/1692
> >>> [2] - https://gist.github.com/lategoodbye/dd0d30af27b6f101b03d5923b279dbaa
> >> since this issue still exists with 4.11 (even without or with microframe scheduler enabled), i want to ask some additional questions:
> >>
> >> Is this issue reproducible with other dwc2 platforms than bcm2835?
> > +Edmund Szeto, who I seem to remember emailing me about similar
> > questions in the past.
> >
> >
> >> Does the soft lockup also occurs after opening the second serial convertor or later?
> > I don't have serial converters easily available to me, but back in the
> > day when I was stressing things out on rk3288 I never saw anything
> > this bad.  ...of course, on rk3288 we've got 4 A17 cores running
> > really fast, so possibly just being slower is what causes your
> > problems here?
> 
> The downstream kernel of the Raspberry Pi foundation with it's
> out-of-tree dwc_otg driver is able to handle 8 serial converter on a RPI
> B. I would be happy to get at least 2 or 3 working on mainline.
> 
> >
> > I will make the following observations:
> >
> > 1. With dwc2 you often end up in the situation where you need to
> > service an interrupt every 125 uS.  If servicing that interrupt takes
> > anywhere near 125 uS in the common case then you'll be in trouble.
> 
> I will try to measure this with a logic analyzer.
> 

i took GPIO17 to measure _dwc2_hcd_irq and GPIO18 to measure _dwc2_hcd_urb_enqueue (patch against 4.11rc1 below).

So i made my observations for 3 test cases:

1) no serial converter connected (idle)
2) 1 FTDI serial converter connected
3) 1 PL2303 serial converter connected

case   | ksoftirq cpu     | mean duration | max duration  | max duration | urb_enqueue  |
       |                  | hcd_irq       | hcd_irq       | urb_enqueue  | within 10 sec|
-------+------------------+---------------+---------------+--------------+--------------+
idle   | 0.0%             | 2 us          | 16.5 us       |     12 us    | 5            |
FTDI   | 25.0%            | 8.5 us        | 18.0 us       |  31000 us    | ~ 400        |
PL2303 | top doesn't work | 8.5 us        | 22.5 us       | 900000 us    | 4            |

So it seems the serial USB driver has also an impact. In the analyzer trace the FTDI triggers many smaller urb_enqueue calls in the opposite to the PL2303 which only has few but huge calls.

Additional notes:
After closing the serial connection to the FTDI the system is usable as before. In case of PL2303 i need to disconnect the converter in order to get a usable system.

Why do they behave so differently?
Are these results of a overload?
Doug, can you point me to your timing patch?

Thanks
Stefan

PS: My hacky patches for measuring

diff --git a/drivers/usb/dwc2/core.h b/drivers/usb/dwc2/core.h
index 1a7e830..aa490c3 100644
--- a/drivers/usb/dwc2/core.h
+++ b/drivers/usb/dwc2/core.h
@@ -930,6 +930,9 @@ struct dwc2_hsotg {
 	struct dentry *debug_root;
 	struct debugfs_regset32 *regset;
 
+	int gpio_17;
+	int gpio_18;
+
 	/* DWC OTG HW Release versions */
 #define DWC2_CORE_REV_2_71a	0x4f54271a
 #define DWC2_CORE_REV_2_90a	0x4f54290a
diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
index a73722e..97271c8 100644
--- a/drivers/usb/dwc2/hcd.c
+++ b/drivers/usb/dwc2/hcd.c
@@ -45,6 +45,7 @@
 #include <linux/platform_device.h>
 #include <linux/dma-mapping.h>
 #include <linux/delay.h>
+#include <linux/gpio.h>
 #include <linux/io.h>
 #include <linux/slab.h>
 #include <linux/usb.h>
@@ -4595,13 +4596,17 @@ static int _dwc2_hcd_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
 	bool qh_allocated = false;
 	struct dwc2_qtd *qtd;
 
+	gpio_set_value(hsotg->gpio_18, 1);
+
 	if (dbg_urb(urb)) {
 		dev_vdbg(hsotg->dev, "DWC OTG HCD URB Enqueue\n");
 		dwc2_dump_urb_info(hcd, urb, "urb_enqueue");
 	}
 
-	if (!ep)
+	if (!ep) {
+		gpio_set_value(hsotg->gpio_18, 0);
 		return -EINVAL;
+	}
 
 	if (usb_pipetype(urb->pipe) == PIPE_ISOCHRONOUS ||
 	    usb_pipetype(urb->pipe) == PIPE_INTERRUPT) {
@@ -4628,8 +4633,10 @@ static int _dwc2_hcd_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
 
 	dwc2_urb = dwc2_hcd_urb_alloc(hsotg, urb->number_of_packets,
 				      mem_flags);
-	if (!dwc2_urb)
+	if (!dwc2_urb) {
+		gpio_set_value(hsotg->gpio_18, 0);
 		return -ENOMEM;
+	}
 
 	dwc2_hcd_urb_set_pipeinfo(hsotg, dwc2_urb, usb_pipedevice(urb->pipe),
 				  usb_pipeendpoint(urb->pipe), ep_type,
@@ -4705,6 +4712,8 @@ static int _dwc2_hcd_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
 
 	spin_unlock_irqrestore(&hsotg->lock, flags);
 
+	gpio_set_value(hsotg->gpio_18, 0);
+
 	return 0;
 
 fail3:
@@ -4732,6 +4741,8 @@ static int _dwc2_hcd_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
 fail0:
 	kfree(dwc2_urb);
 
+	gpio_set_value(hsotg->gpio_18, 0);
+
 	return retval;
 }
 
@@ -4825,8 +4836,15 @@ static void _dwc2_hcd_endpoint_reset(struct usb_hcd *hcd,
 static irqreturn_t _dwc2_hcd_irq(struct usb_hcd *hcd)
 {
 	struct dwc2_hsotg *hsotg = dwc2_hcd_to_hsotg(hcd);
+	irqreturn_t ret;
 
-	return dwc2_handle_hcd_intr(hsotg);
+	gpio_set_value(hsotg->gpio_17, 1);
+
+	ret = dwc2_handle_hcd_intr(hsotg);
+
+	gpio_set_value(hsotg->gpio_17, 0);
+
+	return ret;
 }
 
 /*
diff --git a/drivers/usb/dwc2/platform.c b/drivers/usb/dwc2/platform.c
index 9564bc7..e3f8a8a 100644
--- a/drivers/usb/dwc2/platform.c
+++ b/drivers/usb/dwc2/platform.c
@@ -35,6 +35,7 @@
  */
 
 #include <linux/kernel.h>
+#include <linux/gpio.h>
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/clk.h>
@@ -404,6 +405,16 @@ static int dwc2_driver_probe(struct platform_device *dev)
 		return hsotg->irq;
 	}
 
+	hsotg->gpio_17 = 475;
+	retval = devm_gpio_request_one(hsotg->dev, hsotg->gpio_17,  GPIOF_OUT_INIT_LOW, "gpio_17");
+	if (retval)
+		pr_warn("Unable to request GPIO17: %d\n", retval);
+	
+	hsotg->gpio_18 = 476;
+	retval = devm_gpio_request_one(hsotg->dev, hsotg->gpio_18,  GPIOF_OUT_INIT_LOW, "gpio_18");
+	if (retval)
+		pr_warn("Unable to request GPIO18: %d\n", retval);
+
 	dev_dbg(hsotg->dev, "registering common handler for irq%d\n",
 		hsotg->irq);
 	retval = devm_request_irq(hsotg->dev, hsotg->irq,

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-18 20:25       ` Stefan Wahren
@ 2017-04-18 20:41         ` Doug Anderson
  2017-04-18 20:53           ` Stefan Wahren
  2017-04-19 20:25           ` Stefan Wahren
  2017-04-19  8:56         ` Johan Hovold
  1 sibling, 2 replies; 30+ messages in thread
From: Doug Anderson @ 2017-04-18 20:41 UTC (permalink / raw)
  To: linux-arm-kernel

Stefan,

On Tue, Apr 18, 2017 at 1:25 PM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> Hi,
>
> [add Johan]
>
>> Stefan Wahren <stefan.wahren@i2se.com> hat am 18. April 2017 um 10:07 geschrieben:
>>
>>
>> Am 18.04.2017 um 00:37 schrieb Doug Anderson:
>> > Hi,
>> >
>> > On Mon, Apr 17, 2017 at 4:05 AM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
>> >> Hi,
>> >>
>> >>> Stefan Wahren <stefan.wahren@i2se.com> hat am 31. Oktober 2016 um 21:34 geschrieben:
>> >>>
>> >>>
>> >>> I inspired by this issue [1] i build up a slightly modified setup with a
>> >>> Raspberry Pi B (mainline kernel 4.9rc3), a powered 7 port USB hub and 5 Prolific
>> >>> PL2303 USB to serial convertors. I modified the usb_test for dwc2 [2], which
>> >>> only tries to open all ttyUSB devices one after the other.
>> >>>
>> >>> Unfortunately the complete system stuck after opening the first ttyUSB device (
>> >>> heartbeat LED stop blinking, no reaction to debug UART). The only way to
>> >>> reanimate the system is to powerdown the USB hub with the USB to serial
>> >>> convertors.
>> >>>
>> >>> [1] - https://github.com/raspberrypi/linux/issues/1692
>> >>> [2] - https://gist.github.com/lategoodbye/dd0d30af27b6f101b03d5923b279dbaa
>> >> since this issue still exists with 4.11 (even without or with microframe scheduler enabled), i want to ask some additional questions:
>> >>
>> >> Is this issue reproducible with other dwc2 platforms than bcm2835?
>> > +Edmund Szeto, who I seem to remember emailing me about similar
>> > questions in the past.
>> >
>> >
>> >> Does the soft lockup also occurs after opening the second serial convertor or later?
>> > I don't have serial converters easily available to me, but back in the
>> > day when I was stressing things out on rk3288 I never saw anything
>> > this bad.  ...of course, on rk3288 we've got 4 A17 cores running
>> > really fast, so possibly just being slower is what causes your
>> > problems here?
>>
>> The downstream kernel of the Raspberry Pi foundation with it's
>> out-of-tree dwc_otg driver is able to handle 8 serial converter on a RPI
>> B. I would be happy to get at least 2 or 3 working on mainline.
>>
>> >
>> > I will make the following observations:
>> >
>> > 1. With dwc2 you often end up in the situation where you need to
>> > service an interrupt every 125 uS.  If servicing that interrupt takes
>> > anywhere near 125 uS in the common case then you'll be in trouble.
>>
>> I will try to measure this with a logic analyzer.
>>
>
> i took GPIO17 to measure _dwc2_hcd_irq and GPIO18 to measure _dwc2_hcd_urb_enqueue (patch against 4.11rc1 below).
>
> So i made my observations for 3 test cases:
>
> 1) no serial converter connected (idle)
> 2) 1 FTDI serial converter connected
> 3) 1 PL2303 serial converter connected
>
> case   | ksoftirq cpu     | mean duration | max duration  | max duration | urb_enqueue  |
>        |                  | hcd_irq       | hcd_irq       | urb_enqueue  | within 10 sec|
> -------+------------------+---------------+---------------+--------------+--------------+
> idle   | 0.0%             | 2 us          | 16.5 us       |     12 us    | 5            |
> FTDI   | 25.0%            | 8.5 us        | 18.0 us       |  31000 us    | ~ 400        |
> PL2303 | top doesn't work | 8.5 us        | 22.5 us       | 900000 us    | 4            |

It's hard to know for sure that all of this time is really in
urb_enqueue().  Possible we could have task switched out and been
blocked elsewhere.  Using ftrace to get more fine-grained timings
would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
friends here if you want to use trace_printk.


> So it seems the serial USB driver has also an impact. In the analyzer trace the FTDI triggers many smaller urb_enqueue calls in the opposite to the PL2303 which only has few but huge calls.
>
> Additional notes:
> After closing the serial connection to the FTDI the system is usable as before. In case of PL2303 i need to disconnect the converter in order to get a usable system.
>
> Why do they behave so differently?
> Are these results of a overload?
> Doug, can you point me to your timing patch?

I did in the previous message, but basically take a look at:

https://chromium-review.googlesource.com/c/321980/4/drivers/usb/core/hcd.c#1668

I never posted this upstream since it was just a test patch.

>
> Thanks
> Stefan
>
> PS: My hacky patches for measuring
>
> diff --git a/drivers/usb/dwc2/core.h b/drivers/usb/dwc2/core.h
> index 1a7e830..aa490c3 100644
> --- a/drivers/usb/dwc2/core.h
> +++ b/drivers/usb/dwc2/core.h
> @@ -930,6 +930,9 @@ struct dwc2_hsotg {
>         struct dentry *debug_root;
>         struct debugfs_regset32 *regset;
>
> +       int gpio_17;
> +       int gpio_18;
> +
>         /* DWC OTG HW Release versions */
>  #define DWC2_CORE_REV_2_71a    0x4f54271a
>  #define DWC2_CORE_REV_2_90a    0x4f54290a
> diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
> index a73722e..97271c8 100644
> --- a/drivers/usb/dwc2/hcd.c
> +++ b/drivers/usb/dwc2/hcd.c
> @@ -45,6 +45,7 @@
>  #include <linux/platform_device.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/delay.h>
> +#include <linux/gpio.h>
>  #include <linux/io.h>
>  #include <linux/slab.h>
>  #include <linux/usb.h>
> @@ -4595,13 +4596,17 @@ static int _dwc2_hcd_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
>         bool qh_allocated = false;
>         struct dwc2_qtd *qtd;
>
> +       gpio_set_value(hsotg->gpio_18, 1);
> +
>         if (dbg_urb(urb)) {
>                 dev_vdbg(hsotg->dev, "DWC OTG HCD URB Enqueue\n");
>                 dwc2_dump_urb_info(hcd, urb, "urb_enqueue");
>         }
>
> -       if (!ep)
> +       if (!ep) {
> +               gpio_set_value(hsotg->gpio_18, 0);
>                 return -EINVAL;
> +       }
>
>         if (usb_pipetype(urb->pipe) == PIPE_ISOCHRONOUS ||
>             usb_pipetype(urb->pipe) == PIPE_INTERRUPT) {
> @@ -4628,8 +4633,10 @@ static int _dwc2_hcd_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
>
>         dwc2_urb = dwc2_hcd_urb_alloc(hsotg, urb->number_of_packets,
>                                       mem_flags);
> -       if (!dwc2_urb)
> +       if (!dwc2_urb) {
> +               gpio_set_value(hsotg->gpio_18, 0);
>                 return -ENOMEM;
> +       }
>
>         dwc2_hcd_urb_set_pipeinfo(hsotg, dwc2_urb, usb_pipedevice(urb->pipe),
>                                   usb_pipeendpoint(urb->pipe), ep_type,
> @@ -4705,6 +4712,8 @@ static int _dwc2_hcd_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
>
>         spin_unlock_irqrestore(&hsotg->lock, flags);
>
> +       gpio_set_value(hsotg->gpio_18, 0);
> +
>         return 0;
>
>  fail3:
> @@ -4732,6 +4741,8 @@ static int _dwc2_hcd_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
>  fail0:
>         kfree(dwc2_urb);
>
> +       gpio_set_value(hsotg->gpio_18, 0);
> +
>         return retval;
>  }
>
> @@ -4825,8 +4836,15 @@ static void _dwc2_hcd_endpoint_reset(struct usb_hcd *hcd,
>  static irqreturn_t _dwc2_hcd_irq(struct usb_hcd *hcd)
>  {
>         struct dwc2_hsotg *hsotg = dwc2_hcd_to_hsotg(hcd);
> +       irqreturn_t ret;
>
> -       return dwc2_handle_hcd_intr(hsotg);
> +       gpio_set_value(hsotg->gpio_17, 1);
> +
> +       ret = dwc2_handle_hcd_intr(hsotg);
> +
> +       gpio_set_value(hsotg->gpio_17, 0);
> +
> +       return ret;
>  }
>
>  /*
> diff --git a/drivers/usb/dwc2/platform.c b/drivers/usb/dwc2/platform.c
> index 9564bc7..e3f8a8a 100644
> --- a/drivers/usb/dwc2/platform.c
> +++ b/drivers/usb/dwc2/platform.c
> @@ -35,6 +35,7 @@
>   */
>
>  #include <linux/kernel.h>
> +#include <linux/gpio.h>
>  #include <linux/module.h>
>  #include <linux/slab.h>
>  #include <linux/clk.h>
> @@ -404,6 +405,16 @@ static int dwc2_driver_probe(struct platform_device *dev)
>                 return hsotg->irq;
>         }
>
> +       hsotg->gpio_17 = 475;
> +       retval = devm_gpio_request_one(hsotg->dev, hsotg->gpio_17,  GPIOF_OUT_INIT_LOW, "gpio_17");
> +       if (retval)
> +               pr_warn("Unable to request GPIO17: %d\n", retval);
> +
> +       hsotg->gpio_18 = 476;
> +       retval = devm_gpio_request_one(hsotg->dev, hsotg->gpio_18,  GPIOF_OUT_INIT_LOW, "gpio_18");
> +       if (retval)
> +               pr_warn("Unable to request GPIO18: %d\n", retval);
> +
>         dev_dbg(hsotg->dev, "registering common handler for irq%d\n",
>                 hsotg->irq);
>         retval = devm_request_irq(hsotg->dev, hsotg->irq,

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-18 20:41         ` Doug Anderson
@ 2017-04-18 20:53           ` Stefan Wahren
  2017-04-19 20:25           ` Stefan Wahren
  1 sibling, 0 replies; 30+ messages in thread
From: Stefan Wahren @ 2017-04-18 20:53 UTC (permalink / raw)
  To: linux-arm-kernel


> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
> 
> 
> Stefan,
> 
> On Tue, Apr 18, 2017 at 1:25 PM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> > Hi,
> >
> > [add Johan]
> >
> >> Stefan Wahren <stefan.wahren@i2se.com> hat am 18. April 2017 um 10:07 geschrieben:
> >>
> >>
> >> Am 18.04.2017 um 00:37 schrieb Doug Anderson:
> >> > Hi,
> >> >
> >> > On Mon, Apr 17, 2017 at 4:05 AM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> >> >> Hi,
> >> >>
> >> >>> Stefan Wahren <stefan.wahren@i2se.com> hat am 31. Oktober 2016 um 21:34 geschrieben:
> >> >>>
> >> >>>
> >> >>> I inspired by this issue [1] i build up a slightly modified setup with a
> >> >>> Raspberry Pi B (mainline kernel 4.9rc3), a powered 7 port USB hub and 5 Prolific
> >> >>> PL2303 USB to serial convertors. I modified the usb_test for dwc2 [2], which
> >> >>> only tries to open all ttyUSB devices one after the other.
> >> >>>
> >> >>> Unfortunately the complete system stuck after opening the first ttyUSB device (
> >> >>> heartbeat LED stop blinking, no reaction to debug UART). The only way to
> >> >>> reanimate the system is to powerdown the USB hub with the USB to serial
> >> >>> convertors.
> >> >>>
> >> >>> [1] - https://github.com/raspberrypi/linux/issues/1692
> >> >>> [2] - https://gist.github.com/lategoodbye/dd0d30af27b6f101b03d5923b279dbaa
> >> >> since this issue still exists with 4.11 (even without or with microframe scheduler enabled), i want to ask some additional questions:
> >> >>
> >> >> Is this issue reproducible with other dwc2 platforms than bcm2835?
> >> > +Edmund Szeto, who I seem to remember emailing me about similar
> >> > questions in the past.
> >> >
> >> >
> >> >> Does the soft lockup also occurs after opening the second serial convertor or later?
> >> > I don't have serial converters easily available to me, but back in the
> >> > day when I was stressing things out on rk3288 I never saw anything
> >> > this bad.  ...of course, on rk3288 we've got 4 A17 cores running
> >> > really fast, so possibly just being slower is what causes your
> >> > problems here?
> >>
> >> The downstream kernel of the Raspberry Pi foundation with it's
> >> out-of-tree dwc_otg driver is able to handle 8 serial converter on a RPI
> >> B. I would be happy to get at least 2 or 3 working on mainline.
> >>
> >> >
> >> > I will make the following observations:
> >> >
> >> > 1. With dwc2 you often end up in the situation where you need to
> >> > service an interrupt every 125 uS.  If servicing that interrupt takes
> >> > anywhere near 125 uS in the common case then you'll be in trouble.
> >>
> >> I will try to measure this with a logic analyzer.
> >>
> >
> > i took GPIO17 to measure _dwc2_hcd_irq and GPIO18 to measure _dwc2_hcd_urb_enqueue (patch against 4.11rc1 below).
> >
> > So i made my observations for 3 test cases:
> >
> > 1) no serial converter connected (idle)
> > 2) 1 FTDI serial converter connected
> > 3) 1 PL2303 serial converter connected
> >
> > case   | ksoftirq cpu     | mean duration | max duration  | max duration | urb_enqueue  |
> >        |                  | hcd_irq       | hcd_irq       | urb_enqueue  | within 10 sec|
> > -------+------------------+---------------+---------------+--------------+--------------+
> > idle   | 0.0%             | 2 us          | 16.5 us       |     12 us    | 5            |
> > FTDI   | 25.0%            | 8.5 us        | 18.0 us       |  31000 us    | ~ 400        |
> > PL2303 | top doesn't work | 8.5 us        | 22.5 us       | 900000 us    | 4            |
> 
> It's hard to know for sure that all of this time is really in
> urb_enqueue().  Possible we could have task switched out and been
> blocked elsewhere.  Using ftrace to get more fine-grained timings
> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
> friends here if you want to use trace_printk.
> 

I saw your last reply after sending my last mail. I will go further with ftrace.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-18 20:25       ` Stefan Wahren
  2017-04-18 20:41         ` Doug Anderson
@ 2017-04-19  8:56         ` Johan Hovold
  2017-04-19  9:12           ` Stefan Wahren
  1 sibling, 1 reply; 30+ messages in thread
From: Johan Hovold @ 2017-04-19  8:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 18, 2017 at 10:25:06PM +0200, Stefan Wahren wrote:
> Hi,
> 
> [add Johan]
> 
> > Stefan Wahren <stefan.wahren@i2se.com> hat am 18. April 2017 um 10:07 geschrieben:
> > 
> > 
> > Am 18.04.2017 um 00:37 schrieb Doug Anderson:
> > > Hi,
> > >
> > > On Mon, Apr 17, 2017 at 4:05 AM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> > >> Hi,
> > >>
> > >>> Stefan Wahren <stefan.wahren@i2se.com> hat am 31. Oktober 2016 um 21:34 geschrieben:
> > >>>
> > >>>
> > >>> I inspired by this issue [1] i build up a slightly modified
> > >>> setup with a Raspberry Pi B (mainline kernel 4.9rc3), a powered
> > >>> 7 port USB hub and 5 Prolific PL2303 USB to serial convertors. I
> > >>> modified the usb_test for dwc2 [2], which only tries to open all
> > >>> ttyUSB devices one after the other.
> > >>>
> > >>> Unfortunately the complete system stuck after opening the first
> > >>> ttyUSB device ( heartbeat LED stop blinking, no reaction to
> > >>> debug UART). The only way to reanimate the system is to
> > >>> powerdown the USB hub with the USB to serial convertors.
> > >>>
> > >>> [1] - https://github.com/raspberrypi/linux/issues/1692

> i took GPIO17 to measure _dwc2_hcd_irq and GPIO18 to measure
> _dwc2_hcd_urb_enqueue (patch against 4.11rc1 below).
> 
> So i made my observations for 3 test cases:
> 
> 1) no serial converter connected (idle)
> 2) 1 FTDI serial converter connected
> 3) 1 PL2303 serial converter connected
> 
> case   | ksoftirq cpu     | mean duration | max duration  | max duration | urb_enqueue  |
>        |                  | hcd_irq       | hcd_irq       | urb_enqueue  | within 10 sec|
> -------+------------------+---------------+---------------+--------------+--------------+
> idle   | 0.0%             | 2 us          | 16.5 us       |     12 us    | 5            |
> FTDI   | 25.0%            | 8.5 us        | 18.0 us       |  31000 us    | ~ 400        |
> PL2303 | top doesn't work | 8.5 us        | 22.5 us       | 900000 us    | 4            |
> 
> So it seems the serial USB driver has also an impact. In the analyzer
> trace the FTDI triggers many smaller urb_enqueue calls in the opposite
> to the PL2303 which only has few but huge calls.

What do you mean by "huge calls" above?

Are you just keeping the ports open (i.e. with no data being received or
sent)?

FTDI devices are notorious for their status messages being sent
continuously while the port is open. You'll get a two-byte bulk-in
message every 16ms by default (these were sent every millisecond up
until recently due to a long-standing regression).

PL2303 devices have an interrupt-in endpoint which is used for status
notifications so you would not see any completion callbacks on an
otherwise idle port.

> Additional notes:
> After closing the serial connection to the FTDI the system is usable
> as before. In case of PL2303 i need to disconnect the converter in
> order to get a usable system.

Does your system lock up when you open the first pl2303 device?

> Why do they behave so differently?

So the ftdi bulk-in status messages and the pl2303 interrupt-in endpoint
could be two reasons, but I guess so could any change in timing etc.

Is your ftdi-device a Full Speed device like the pl2303?

Johan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-19  8:56         ` Johan Hovold
@ 2017-04-19  9:12           ` Stefan Wahren
  2017-04-19  9:55             ` Johan Hovold
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Wahren @ 2017-04-19  9:12 UTC (permalink / raw)
  To: linux-arm-kernel

Am 19.04.2017 um 10:56 schrieb Johan Hovold:
> On Tue, Apr 18, 2017 at 10:25:06PM +0200, Stefan Wahren wrote:
>> Hi,
>>
>> [add Johan]
>>
>>> Stefan Wahren <stefan.wahren@i2se.com> hat am 18. April 2017 um 10:07 geschrieben:
>>>
>>>
>>> Am 18.04.2017 um 00:37 schrieb Doug Anderson:
>>>> Hi,
>>>>
>>>> On Mon, Apr 17, 2017 at 4:05 AM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
>>>>> Hi,
>>>>>
>>>>>> Stefan Wahren <stefan.wahren@i2se.com> hat am 31. Oktober 2016 um 21:34 geschrieben:
>>>>>>
>>>>>>
>>>>>> I inspired by this issue [1] i build up a slightly modified
>>>>>> setup with a Raspberry Pi B (mainline kernel 4.9rc3), a powered
>>>>>> 7 port USB hub and 5 Prolific PL2303 USB to serial convertors. I
>>>>>> modified the usb_test for dwc2 [2], which only tries to open all
>>>>>> ttyUSB devices one after the other.
>>>>>>
>>>>>> Unfortunately the complete system stuck after opening the first
>>>>>> ttyUSB device ( heartbeat LED stop blinking, no reaction to
>>>>>> debug UART). The only way to reanimate the system is to
>>>>>> powerdown the USB hub with the USB to serial convertors.
>>>>>>
>>>>>> [1] - https://github.com/raspberrypi/linux/issues/1692
>> i took GPIO17 to measure _dwc2_hcd_irq and GPIO18 to measure
>> _dwc2_hcd_urb_enqueue (patch against 4.11rc1 below).
>>
>> So i made my observations for 3 test cases:
>>
>> 1) no serial converter connected (idle)
>> 2) 1 FTDI serial converter connected
>> 3) 1 PL2303 serial converter connected
>>
>> case   | ksoftirq cpu     | mean duration | max duration  | max duration | urb_enqueue  |
>>        |                  | hcd_irq       | hcd_irq       | urb_enqueue  | within 10 sec|
>> -------+------------------+---------------+---------------+--------------+--------------+
>> idle   | 0.0%             | 2 us          | 16.5 us       |     12 us    | 5            |
>> FTDI   | 25.0%            | 8.5 us        | 18.0 us       |  31000 us    | ~ 400        |
>> PL2303 | top doesn't work | 8.5 us        | 22.5 us       | 900000 us    | 4            |
>>
>> So it seems the serial USB driver has also an impact. In the analyzer
>> trace the FTDI triggers many smaller urb_enqueue calls in the opposite
>> to the PL2303 which only has few but huge calls.
> What do you mean by "huge calls" above?

The time it takes until _dwc2_hcd_urb_enqueue is finished.

>
> Are you just keeping the ports open (i.e. with no data being received or
> sent)?

Yes, only open and no data is received or sent (LEDs doesn't show any
activity).

>
> FTDI devices are notorious for their status messages being sent
> continuously while the port is open. You'll get a two-byte bulk-in
> message every 16ms by default (these were sent every millisecond up
> until recently due to a long-standing regression).
>
> PL2303 devices have an interrupt-in endpoint which is used for status
> notifications so you would not see any completion callbacks on an
> otherwise idle port.
>
>> Additional notes:
>> After closing the serial connection to the FTDI the system is usable
>> as before. In case of PL2303 i need to disconnect the converter in
>> order to get a usable system.
> Does your system lock up when you open the first pl2303 device?

Yes (directly to the onboard hub or an external hub make no difference).

>
>> Why do they behave so differently?
> So the ftdi bulk-in status messages and the pl2303 interrupt-in endpoint
> could be two reasons, but I guess so could any change in timing etc.
>
> Is your ftdi-device a Full Speed device like the pl2303?

Sorry, i will need to verify.

Thanks
Stefan

>
> Johan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-19  9:12           ` Stefan Wahren
@ 2017-04-19  9:55             ` Johan Hovold
  0 siblings, 0 replies; 30+ messages in thread
From: Johan Hovold @ 2017-04-19  9:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Apr 19, 2017 at 11:12:17AM +0200, Stefan Wahren wrote:
> Am 19.04.2017 um 10:56 schrieb Johan Hovold:
> > On Tue, Apr 18, 2017 at 10:25:06PM +0200, Stefan Wahren wrote:
> >> Hi,
> >>
> >> [add Johan]
> >>
> >>> Stefan Wahren <stefan.wahren@i2se.com> hat am 18. April 2017 um 10:07 geschrieben:
> >>>
> >>>
> >>> Am 18.04.2017 um 00:37 schrieb Doug Anderson:
> >>>> Hi,
> >>>>
> >>>> On Mon, Apr 17, 2017 at 4:05 AM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> >>>>> Hi,
> >>>>>
> >>>>>> Stefan Wahren <stefan.wahren@i2se.com> hat am 31. Oktober 2016 um 21:34 geschrieben:
> >>>>>>
> >>>>>>
> >>>>>> I inspired by this issue [1] i build up a slightly modified
> >>>>>> setup with a Raspberry Pi B (mainline kernel 4.9rc3), a powered
> >>>>>> 7 port USB hub and 5 Prolific PL2303 USB to serial convertors. I
> >>>>>> modified the usb_test for dwc2 [2], which only tries to open all
> >>>>>> ttyUSB devices one after the other.
> >>>>>>
> >>>>>> Unfortunately the complete system stuck after opening the first
> >>>>>> ttyUSB device ( heartbeat LED stop blinking, no reaction to
> >>>>>> debug UART). The only way to reanimate the system is to
> >>>>>> powerdown the USB hub with the USB to serial convertors.
> >>>>>>
> >>>>>> [1] - https://github.com/raspberrypi/linux/issues/1692
> >> i took GPIO17 to measure _dwc2_hcd_irq and GPIO18 to measure
> >> _dwc2_hcd_urb_enqueue (patch against 4.11rc1 below).
> >>
> >> So i made my observations for 3 test cases:
> >>
> >> 1) no serial converter connected (idle)
> >> 2) 1 FTDI serial converter connected
> >> 3) 1 PL2303 serial converter connected
> >>
> >> case   | ksoftirq cpu     | mean duration | max duration  | max duration | urb_enqueue  |
> >>        |                  | hcd_irq       | hcd_irq       | urb_enqueue  | within 10 sec|
> >> -------+------------------+---------------+---------------+--------------+--------------+
> >> idle   | 0.0%             | 2 us          | 16.5 us       |     12 us    | 5            |
> >> FTDI   | 25.0%            | 8.5 us        | 18.0 us       |  31000 us    | ~ 400        |
> >> PL2303 | top doesn't work | 8.5 us        | 22.5 us       | 900000 us    | 4            |
> >>
> >> So it seems the serial USB driver has also an impact. In the analyzer
> >> trace the FTDI triggers many smaller urb_enqueue calls in the opposite
> >> to the PL2303 which only has few but huge calls.
> > What do you mean by "huge calls" above?
> 
> The time it takes until _dwc2_hcd_urb_enqueue is finished.

Yes, sorry, I realised that after I sent my reply.

> > Are you just keeping the ports open (i.e. with no data being received or
> > sent)?
> 
> Yes, only open and no data is received or sent (LEDs doesn't show any
> activity).
> 
> > FTDI devices are notorious for their status messages being sent
> > continuously while the port is open. You'll get a two-byte bulk-in
> > message every 16ms by default (these were sent every millisecond up
> > until recently due to a long-standing regression).
> >
> > PL2303 devices have an interrupt-in endpoint which is used for status
> > notifications so you would not see any completion callbacks on an
> > otherwise idle port.
> >
> >> Additional notes:
> >> After closing the serial connection to the FTDI the system is usable
> >> as before. In case of PL2303 i need to disconnect the converter in
> >> order to get a usable system.
> > Does your system lock up when you open the first pl2303 device?
> 
> Yes (directly to the onboard hub or an external hub make no difference).

Ok, good. I suggest not using the external hub until you've tracked this
down.

> >> Why do they behave so differently?
> > So the ftdi bulk-in status messages and the pl2303 interrupt-in endpoint
> > could be two reasons, but I guess so could any change in timing etc.
> >
> > Is your ftdi-device a Full Speed device like the pl2303?
> 
> Sorry, i will need to verify.

Can you reproduce this on 4.11-rc7 without the external hub and post the
corresponding soft lockup bug output?

On open, the pl2303 driver first enqueues the interrupt-in URB before
enqueueing two bulk-in URBs. From your original stack trace it looks
like the lockup happens when queueing either of those. Perhaps you can
enable debugging in the dwc2 driver or add some printks to determine how
far you get.

Another thing to try is to comment out the interrupt-in URB submission
in pl2303 and see whether that avoids triggering the lockup.

Johan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-18 20:41         ` Doug Anderson
  2017-04-18 20:53           ` Stefan Wahren
@ 2017-04-19 20:25           ` Stefan Wahren
  2017-04-19 21:47             ` Doug Anderson
  2017-04-20 18:54             ` Eric Anholt
  1 sibling, 2 replies; 30+ messages in thread
From: Stefan Wahren @ 2017-04-19 20:25 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
> 
> 
> It's hard to know for sure that all of this time is really in
> urb_enqueue().  Possible we could have task switched out and been
> blocked elsewhere.  Using ftrace to get more fine-grained timings
> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
> friends here if you want to use trace_printk.

i'm a newbie to ftrace, so i hope this would be helpful.

# connect PL2303 to the onboard hub
# echo 0 > options/sleep-time
# echo 0 > function_profile_enabled
# echo 1 > function_profile_enabled
# ./usb_test
# Waiting for at least 20 seconds and then disconnect PL2303
# echo 0 > function_profile_enabled
# cat trace_stat/function0

  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us  
  __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us  
  generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us 
  irq_exit                           1082482    98197771 us     90.715 us       29649040 us 
  handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us 
  do_sys_open                           1806    87612983 us     48512.17 us     2198507 us  
  SyS_open                              1601    87372331 us     54573.59 us     1898996 us  
  do_filp_open                          1862    87368058 us     46921.62 us     1634982 us  
  path_openat                           1862    87314553 us     46892.88 us     3357817 us  
  __do_softirq                          3035    86266050 us     28423.73 us     6449768 us  
  vfs_open                              1515    85877012 us     56684.49 us     101673.5 us 
  do_dentry_open                        1515    85861429 us     56674.21 us     812420.7 us 
  usb_submit_urb                         136    85760172 us     630589.5 us     59532024 us 
  usb_hcd_submit_urb                     134    85756518 us     639974.0 us     726298102 us 
  _dwc2_hcd_urb_enqueue                  134    85738333 us     639838.3 us     874104371 us 
  chrdev_open                             87    85716519 us     985247.3 us     1951112835 us 
  tty_open                                 3    85714494 us     28571498 us     3743206849 us 
  tty_port_open                            3    85712603 us     28570867 us     1968003468 us 
  serial_open                              1    85712472 us     85712472 us     0.000 us    
  serial_port_activate                     1    85709890 us     85709890 us     0.000 us    
  pl2303_open                              1    85709875 us     85709875 us     0.000 us    
  usb_serial_generic_open                  1    85701170 us     85701170 us     0.000 us    
  usb_serial_generic_submit_read           1    85701166 us     85701166 us     0.000 us    
  usb_serial_generic_submit_read           2    85701160 us     42850580 us     2252410463 us 
  handle_irq_event                   1082482    80686112 us     74.538 us       36417905 us 
  handle_irq_event_percpu            1082482    78398378 us     72.424 us       34060119 us 
  __handle_irq_event_percpu          1082482    68466046 us     63.249 us       53609076 us 
  usb_hcd_irq                        1078945    59761116 us     55.388 us       25139388 us 
  _dwc2_hcd_irq                      1078945    57481606 us     53.275 us       30847036 us 
  dwc2_handle_hcd_intr               1078945    55380910 us     51.328 us       35986237 us 
  do_idle                                394    43577379 us     110602.4 us     84278209 us 
  dwc2_hc_n_intr                      651065    42408491 us     65.137 us       7281229 us  
  default_idle_call                   229969    38021287 us     165.332 us      11386904 us 
  arch_cpu_idle                       229969    37379305 us     162.540 us      2168909 us  
  dwc2_halt_channel                   650811    34208348 us     52.562 us       7792012 us  
  dwc2_release_channel                651065    33042638 us     50.751 us       8112754 us  
  dwc2_hc_nak_intr                    334689    18892951 us     56.449 us       1823592 us  
  dwc2_hc_ack_intr                    315986    18323433 us     57.988 us       151346.6 us 
  dwc2_hcd_select_transactions       1730010    11694955 us     6.760 us        1188442 us  
  dwc2_hcd_queue_transactions         651065    10931854 us     16.790 us       952074.2 us 
  dwc2_assign_and_init_hc             651065    9540885 us      14.654 us       258067.1 us 
  dwc2_queue_transaction              651065    9192559 us      14.119 us       736431.6 us 
  tick_nohz_irq_exit                  325055    9052192 us      27.848 us       916293.3 us 
  __tick_nohz_idle_enter.constpr      324760    8463861 us      26.061 us       2410440 us  
  dwc2_hc_start_transfer              651065    7687376 us      11.807 us       625773.3 us 
  irq_enter                          1082482    6627959 us      6.122 us        11528932 us 
  arm_heavy_mb                      10903913    6603586 us      0.605 us        60562924 us 
  unmask_irq                         1082482    4763699 us      4.400 us        1032638 us  
  tick_irq_enter                      326467    4377088 us      13.407 us       4606805 us  
  dwc2_handle_common_intr            1078945    3799392 us      3.521 us        1131504 us  
  add_interrupt_randomness           1082482    3542044 us      3.272 us        30545511 us 
  get_next_armctrl_hwirq             1443829    3276974 us      2.269 us        2116848 us  
  tick_nohz_restart                    70325    3250462 us      46.220 us       35710.73 us 
  dwc2_hc_cleanup                     651065    3079003 us      4.729 us        81202.50 us 
  dwc2_hcd_qh_deactivate              651065    3039940 us      4.669 us        90928.10 us 
  ledtrig_cpu                         459938    2450327 us      5.327 us        1466705 us  
  arch_cpu_idle_exit                  229969    2443315 us      10.624 us       8606787 us  
  do_page_fault                         5509    2304998 us      418.405 us      17961439 us 
  ktime_get                           795998    2121741 us      2.665 us        166119.5 us 
  get_next_timer_interrupt            324760    1998999 us      6.155 us        32201.63 us 
  handle_mm_fault                       5537    1945500 us      351.363 us      8324917 us  
  dwc2_is_controller_alive           2157890    1473590 us      0.682 us        291399.4 us 
  hrtimer_cancel                       70450    1394330 us      19.791 us       36980.28 us 
  arch_cpu_idle_enter                 229969    1350759 us      5.873 us        10789.94 us 
  hrtimer_try_to_cancel                71029    1266967 us      17.837 us       30260.75 us 
  filemap_map_pages                     1632    1263006 us      773.900 us      207846.4 us 
  tick_program_event                  141799    1260485 us      8.889 us        9343.194 us 
  hrtimer_start_range_ns               97555    1253698 us      12.851 us       43166.94 us 
  irq_to_desc                        1082482    1146955 us      1.059 us        546937.6 us 
  link_path_walk                        3486    1134351 us      325.401 us      484443.0 us 
  __remove_hrtimer                     99205    999724.0 us     10.077 us       201933.9 us 
  clockevents_program_event           141799    985346.0 us     6.948 us        7114.725 us 
  note_interrupt                     1082482    899494.0 us     0.830 us        705141.6 us 
  update_ts_time_stats.constprop      324760    862996.0 us     2.657 us        21813.81 us 
  alloc_set_pte                        20011    805880.0 us     40.271 us       1086748 us  
  SyS_read                              1940    805061.0 us     414.979 us      164198.7 us 
  _local_bh_enable                    326467    777908.0 us     2.382 us        37202.43 us 
  vfs_read                              2080    761487.0 us     366.099 us      22944777 us 
  dwc2_hcd_get_frame_number          1097673    721108.0 us     0.656 us        525348.1 us 
  do_munmap                             1194    717161.0 us     600.637 us      65635.31 us 
  schedule                              3297    700578.0 us     212.489 us      28594.59 us 
  __vfs_read                            2080    694376.0 us     333.834 us      21955675 us 
  walk_component                        9912    683211.0 us     68.927 us       349771.7 us 
  irq_find_mapping                   1082482    673162.0 us     0.621 us        900861.5 us 
  __next_timer_interrupt              324929    667175.0 us     2.053 us        4576.720 us 
  armctrl_translate_shortcut         1079274    662875.0 us     0.614 us        164464.8 us 
  filename_lookup                       1316    646059.0 us     490.926 us      148884.8 us 
  mmput                                  259    644417.0 us     2488.096 us     33155247 us 
  armctrl_unmask_irq                 1082482    639692.0 us     0.590 us        891074.7 us 
  armctrl_mask_irq                   1082482    636505.0 us     0.588 us        888583.1 us 
  exit_mmap                               45    634389.0 us     14097.53 us     24105769 us 
  irq_state_set_masked               1082482    622182.0 us     0.574 us        894212.6 us 
  user_path_at_empty                    1156    620229.0 us     536.530 us      166865.3 us 
  SyS_munmap                             895    600613.0 us     671.075 us      75763.68 us 
  vm_mmap_pgoff                         1437    598955.0 us     416.809 us      55882.41 us 
  read_current_timer                 1082666    598757.0 us     0.553 us        86817.59 us 
  irq_may_run                        1082482    598471.0 us     0.552 us        898308.6 us 
  vm_munmap                              937    595309.0 us     635.335 us      70914.31 us 
  path_lookupat                         1316    592448.0 us     450.188 us      147624.4 us 
  SyS_mmap_pgoff                        1325    588992.0 us     444.522 us      62142.49 us 
  irq_state_clr_masked               1082482    582077.0 us     0.537 us        893572.4 us 
  idle_cpu                           1082482    580428.0 us     0.536 us        905609.5 us 
  do_execve                               28    573864.0 us     20495.14 us     98356111 us 
  do_execveat_common                      28    573576.0 us     20484.85 us     98471441 us 
  dwc2_hc_set_even_odd_frame          651065    570127.0 us     0.875 us        543701.5 us 
  ledtrig_cpu.part.0                  229969    549060.0 us     2.387 us        9142.920 us 
  do_mmap                               1437    548817.0 us     381.918 us      54270.64 us 
  do_work_pending                       2211    525805.0 us     237.813 us      234836.0 us 
  led_trigger_event                   460228    501093.0 us     1.088 us        4759241 us  
  search_binary_handler                   28    501026.0 us     17893.78 us     93445723 us 
  clocksource_mmio_readl_up           807791    500765.0 us     0.619 us        91050.42 us 
  load_elf_binary                         28    500359.0 us     17869.96 us     93309220 us 
  process_one_work                       558    496994.0 us     890.670 us      7196614 us  
  tick_do_update_jiffies64            324826    496168.0 us     1.527 us        3422867 us  
  hrtimer_forward                      72095    490887.0 us     6.808 us        4548.034 us 
  unmap_vmas                            1239    490538.0 us     395.914 us      2723415 us  
  finish_task_switch                    3926    479983.0 us     122.257 us      25336.71 us 
  unmap_single_vma                      3403    474812.0 us     139.527 us      959184.6 us 
  __sync_icache_dcache                 27231    468069.0 us     17.188 us       364740.1 us 
  SyS_execve                              16    462210.0 us     28888.12 us     4450186 us  
  unmap_region                          1194    460297.0 us     385.508 us      44372.58 us 
  unmap_page_range                      3403    455941.0 us     133.982 us      956896.1 us 
  bcm2835_time_interrupt                3208    455584.0 us     142.014 us      56012.72 us 
  mmap_region                           1437    449599.0 us     312.873 us      55175.49 us 
  hrtimer_interrupt                     3208    443908.0 us     138.375 us      55446.73 us 
  dwc2_hcd_qh_add                     632341    436511.0 us     0.690 us        58787.66 us 
  lookup_fast                          11601    436227.0 us     37.602 us       570183.6 us 
  inode_permission                     14699    426528.0 us     29.017 us       404007.8 us 
  SyS_access                             561    412200.0 us     734.759 us      146663.8 us 
  dwc2_hcd_qh_unlink                  632344    410157.0 us     0.648 us        87505.07 us 
  seq_read                               669    408885.0 us     611.188 us      226703.8 us 
  SyS_faccessat                          564    406969.0 us     721.576 us      141955.4 us 
  dwc2_check_qtd_still_ok             651065    392283.0 us     0.602 us        59460.80 us 
  SyS_getdents64                         392    379673.0 us     968.553 us      13678636 us 
  SyS_write                              404    378782.0 us     937.579 us      254399.0 us 
  iterate_dir                            393    367777.0 us     935.819 us      13659529 us 
  vfs_write                              404    358326.0 us     886.945 us      240956.8 us 
  __hrtimer_run_queues.constprop        3227    357720.0 us     110.852 us      47783.12 us 
  task_work_run                         1460    350902.0 us     240.343 us      71379.18 us 
  _cond_resched                       109971    344855.0 us     3.135 us        2245087 us  
  __vfs_write                            404    333260.0 us     824.900 us      230347.3 us 
  ____fput                              1680    326232.0 us     194.185 us      40913.99 us 
  __inode_permission                   14699    324102.0 us     22.049 us       328791.7 us 
  __fput                                1680    316285.0 us     188.264 us      40874.09 us 
  flush_old_exec                          28    313336.0 us     11190.57 us     96843591 us 
  tick_sched_timer                      1964    312793.0 us     159.263 us      2002.533 us 
  do_wp_page                            1758    300693.0 us     171.042 us      17667.03 us 
  _do_fork                                29    281388.0 us     9703.034 us     54207832 us 
  copy_process.part.4                     29    273985.0 us     9447.758 us     53888393 us 
  SyS_clone                               17    267170.0 us     15715.88 us     2444683 us  
  free_pgtables                         1239    265593.0 us     214.360 us      500428.0 us 
  dput                                 15879    265579.0 us     16.725 us       269685.7 us 
  SyS_openat                             205    258191.0 us     1259.468 us     209643.6 us 
  pfn_valid                            18146    257726.0 us     14.202 us       121416.6 us 
  update_wall_time                      5957    250787.0 us     42.099 us       619.230 us  
  mmc_blk_issue_rq                        63    246608.0 us     3914.412 us     8858815 us  
  mmc_blk_issue_rw_rq                     63    241626.0 us     3835.333 us     8884205 us  
  vfs_statx                              441    232198.0 us     526.526 us      150124.0 us 
  SyS_sendmsg                            316    226698.0 us     717.398 us      164208.5 us 
  mmc_start_request                      145    225615.0 us     1555.965 us     5542154 us  
  __sys_sendmsg                          316    223856.0 us     708.405 us      164907.1 us 
  wp_page_copy                          1061    222841.0 us     210.029 us      13010.76 us 
  __generic_file_write_iter              206    219092.0 us     1063.553 us     242948.7 us 
  terminate_walk                        3225    217649.0 us     67.488 us       47336.14 us 
  ext4_file_write_iter                   170    216280.0 us     1272.235 us     127738.3 us 
  __mmc_start_request                    145    214090.0 us     1476.482 us     5509235 us  
  ___sys_sendmsg                         316    212732.0 us     673.202 us      161410.8 us 
  kmem_cache_alloc                     17570    210590.0 us     11.985 us       120917.2 us 
  mmc_start_areq                          63    210528.0 us     3341.714 us     9610382 us  
  sdhci_request                          145    209828.0 us     1447.089 us     5547701 us  
  sock_sendmsg                           353    208523.0 us     590.716 us      147666.9 us 
  __split_vma                            981    206458.0 us     210.456 us      13253.43 us 
  credit_entropy_bits                  16914    205434.0 us     12.145 us       208.830 us  
  hrtimer_get_next_event              324760    199659.0 us     0.614 us        17454.58 us 
  path_put                              6090    196782.0 us     32.312 us       54264.43 us 
  SyS_mprotect                           427    193190.0 us     452.435 us      42168.10 us 
  SyS_stat64                             370    181587.0 us     490.775 us      119707.4 us 
  nr_iowait_cpu                       324760    176767.0 us     0.544 us        17782.33 us 
  __vma_adjust                          1452    174253.0 us     120.008 us      5746.323 us 
  tick_sched_handle                     1964    174004.0 us     88.596 us       419.232 us  
  sdhci_irq                              211    173984.0 us     824.568 us      4212897 us  
  touch_softlockup_watchdog_sche      323207    173767.0 us     0.537 us        108651.0 us 
  kernfs_iop_permission                 5012    173442.0 us     34.605 us       21830.09 us 
  __local_bh_enable                   329502    172898.0 us     0.524 us        9002.660 us 
  proc_single_show                       233    170938.0 us     733.639 us      235941.1 us 
  mprotect_fixup                         455    163622.0 us     359.608 us      43581.82 us 
  SyS_recvmsg                            515    163337.0 us     317.159 us      23749.85 us 
  update_process_times                  1964    161227.0 us     82.091 us       436.558 us  
  usb_control_msg                        131    160306.0 us     1223.709 us     63779.20 us 
  __sys_recvmsg                          515    159685.0 us     310.067 us      23888.88 us 
  check_carrier                           29    159209.0 us     5489.965 us     157176.1 us 
  __dentry_kill                         1306    159044.0 us     121.779 us      13522.69 us 
  smsc95xx_mdio_read                      29    156605.0 us     5400.172 us     147578.2 us 
  __smsc95xx_mdio_read                    29    156128.0 us     5383.724 us     153872.9 us 
  ext4_readdir                            60    152036.0 us     2533.933 us     67111799 us 
  run_ksoftirqd                          844    151846.0 us     179.912 us      74010.42 us 
  usb_start_wait_urb                     131    151181.0 us     1154.053 us     59721.86 us 
  find_vma                              8806    149418.0 us     16.967 us       419885.5 us 
  __alloc_pages_nodemask                4365    143647.0 us     32.908 us       22236.43 us 
  ___sys_recvmsg                         515    143567.0 us     278.770 us      23183.05 us 
  memblock_is_map_memory               18146    142969.0 us     7.878 us        62191.29 us 
  timekeeping_max_deferment           252165    141177.0 us     0.559 us        3782.924 us 
  unlink_anon_vmas                      3403    137080.0 us     40.282 us       17090.37 us 
  __vma_link_rb                         3403    136104.0 us     39.995 us       12357.79 us 
  rcu_process_callbacks                 1050    135701.0 us     129.239 us      66292.86 us 
  SyS_close                             1977    135041.0 us     68.306 us       4406.697 us 
  pick_next_task_fair                   3996    133622.0 us     33.438 us       1391.499 us 
  proc_readfd                             50    133185.0 us     2663.700 us     18788258 us 
  proc_readfd_common                      50    132911.0 us     2658.220 us     18769391 us 
  generic_perform_write                  206    132888.0 us     645.087 us      134152.0 us 
  __mix_pool_bytes                     16914    128343.0 us     7.587 us        303.814 us  
  run_timer_softirq                     1960    127980.0 us     65.295 us       13337.43 us 
  timekeeping_update                    5956    127128.0 us     21.344 us       147.035 us  
  try_to_wake_up                        2348    125639.0 us     53.508 us       5318.431 us 
  ktime_add_safe                      241886    125624.0 us     0.519 us        6352.310 us 
  __rcu_process_callbacks               2100    125522.0 us     59.772 us       1422883 us  
  __wake_up                            19318    121496.0 us     6.289 us        54037.07 us 
  tick_nohz_idle_enter                   394    119544.0 us     303.411 us      24204.73 us 
  proc_fill_cache                        402    119333.0 us     296.848 us      32614.19 us 
  unix_dgram_sendmsg                     144    119088.0 us     827.000 us      197140.4 us 
  enqueue_hrtimer                      99205    118756.0 us     1.197 us        1634.105 us 
  call_usermodehelper_exec_async          12    117852.0 us     9821.000 us     319067.4 us 
  mutex_lock                           14089    117414.0 us     8.333 us        245056.5 us 
  sock_recvmsg                           548    116787.0 us     213.114 us      20249.99 us 
  __smsc95xx_read_reg                     87    116418.0 us     1338.137 us     39000.81 us 
  usbnet_read_cmd                         87    115523.0 us     1327.850 us     37075.87 us 
  SyS_epoll_wait                         336    114271.0 us     340.092 us      47629.77 us 
  __close_fd                            1977    113569.0 us     57.445 us       3969.846 us 
  split_vma                              501    112236.0 us     224.023 us      20703.73 us 
  vma_link                              1248    111097.0 us     89.020 us       5058.706 us 
  __usbnet_read_cmd                       87    111024.0 us     1276.137 us     34288.14 us 
  schedule_hrtimeout_range               390    107480.0 us     275.589 us      10798.57 us 
  dentry_unlink_inode                    733    107440.0 us     146.575 us      8915.389 us 
  unix_find_other                        172    106799.0 us     620.924 us      79499.41 us 
  iput                                   757    106445.0 us     140.614 us      8828.710 us 
  ext4_htree_fill_tree                    36    106438.0 us     2956.611 us     10423331 us 
  kernfs_fop_read                        198    105215.0 us     531.388 us      48887.87 us 
  htree_dirblock_to_tree                  36    104352.0 us     2898.666 us     10008360 us 
  schedule_hrtimeout_range_clock         390    103931.0 us     266.489 us      10595.17 us 
  SyS_connect                             99    102476.0 us     1035.111 us     160650.8 us 
  __mark_inode_dirty                     641    102069.0 us     159.234 us      35047.69 us 
  expire_timers                          700    100145.0 us     143.064 us      8519.548 us 
  do_wait                                 42    99498.00 us     2369.000 us     11701021 us 
  lookup_slow                            413    99137.00 us     240.041 us      40034.02 us 
  release_task                            29    98995.00 us     3413.620 us     12343735 us 
  SyS_fstat64                           1168    98993.00 us     84.754 us       5326.254 us 
  wait_consider_task                     444    98839.00 us     222.610 us      1532328 us  
  kmem_cache_free                      17783    98715.00 us     5.551 us        72804.30 us

...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-19 20:25           ` Stefan Wahren
@ 2017-04-19 21:47             ` Doug Anderson
  2017-04-20  7:46               ` Stefan Wahren
  2017-04-20 18:54             ` Eric Anholt
  1 sibling, 1 reply; 30+ messages in thread
From: Doug Anderson @ 2017-04-19 21:47 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Wed, Apr 19, 2017 at 1:25 PM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> Hi,
>
>> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
>>
>>
>> It's hard to know for sure that all of this time is really in
>> urb_enqueue().  Possible we could have task switched out and been
>> blocked elsewhere.  Using ftrace to get more fine-grained timings
>> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
>> friends here if you want to use trace_printk.
>
> i'm a newbie to ftrace, so i hope this would be helpful.
>
> # connect PL2303 to the onboard hub
> # echo 0 > options/sleep-time
> # echo 0 > function_profile_enabled
> # echo 1 > function_profile_enabled
> # ./usb_test
> # Waiting for at least 20 seconds and then disconnect PL2303
> # echo 0 > function_profile_enabled
> # cat trace_stat/function0
>
>   Function                               Hit    Time            Avg             s^2
>   --------                               ---    ----            ---             ---
>   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us
>   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us
>   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us
>   irq_exit                           1082482    98197771 us     90.715 us       29649040 us
>   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us
>   do_sys_open                           1806    87612983 us     48512.17 us     2198507 us
>   SyS_open                              1601    87372331 us     54573.59 us     1898996 us
>   do_filp_open                          1862    87368058 us     46921.62 us     1634982 us
>   path_openat                           1862    87314553 us     46892.88 us     3357817 us
>   __do_softirq                          3035    86266050 us     28423.73 us     6449768 us
>   vfs_open                              1515    85877012 us     56684.49 us     101673.5 us
>   do_dentry_open                        1515    85861429 us     56674.21 us     812420.7 us
>   usb_submit_urb                         136    85760172 us     630589.5 us     59532024 us
>   usb_hcd_submit_urb                     134    85756518 us     639974.0 us     726298102 us
>   _dwc2_hcd_urb_enqueue                  134    85738333 us     639838.3 us     874104371 us

The 134 calls to this are taking a ton of time.  It would be nice to
know where all the time actually was in here.  Are you on a
single-core device, or multi-core?

Oh, probably this so slow because we're just getting interrupted
constantly.  You can see that during your trace 80686112 us was in
handle_irq_event().  Presumably all of that time could have been
counted towards whatever code you were in when the interrupt went off.
Of that 80 seconds, 60 seconds or so was usb_hcd_irq, and of that 57
seconds was in _dwc2_hcd_irq().

Presumably the big part of your problem is that dwc2_handle_hcd_intr()
is just too slow.  From glancing at the numbers below, it seems likely
that nothing in particular is slow, it's just running a lot of code
and (in total) all of that is slow.  ...but someone with more time on
their hands would need to really debug.

One thing that would be interesting would be to see if you can
increase the bug clock for talking to the dwc2 controller.  Possibly
some of those memory mapped IOs talking to dwc2 could be making
everything slow?

Other than that, maybe you can find some way to optimize the code in
dwc2 so it runs faster, at least in the cases you care about...



>   chrdev_open                             87    85716519 us     985247.3 us     1951112835 us
>   tty_open                                 3    85714494 us     28571498 us     3743206849 us
>   tty_port_open                            3    85712603 us     28570867 us     1968003468 us
>   serial_open                              1    85712472 us     85712472 us     0.000 us
>   serial_port_activate                     1    85709890 us     85709890 us     0.000 us
>   pl2303_open                              1    85709875 us     85709875 us     0.000 us
>   usb_serial_generic_open                  1    85701170 us     85701170 us     0.000 us
>   usb_serial_generic_submit_read           1    85701166 us     85701166 us     0.000 us
>   usb_serial_generic_submit_read           2    85701160 us     42850580 us     2252410463 us
>   handle_irq_event                   1082482    80686112 us     74.538 us       36417905 us
>   handle_irq_event_percpu            1082482    78398378 us     72.424 us       34060119 us
>   __handle_irq_event_percpu          1082482    68466046 us     63.249 us       53609076 us
>   usb_hcd_irq                        1078945    59761116 us     55.388 us       25139388 us
>   _dwc2_hcd_irq                      1078945    57481606 us     53.275 us       30847036 us
>   dwc2_handle_hcd_intr               1078945    55380910 us     51.328 us       35986237 us
>   do_idle                                394    43577379 us     110602.4 us     84278209 us
>   dwc2_hc_n_intr                      651065    42408491 us     65.137 us       7281229 us
>   default_idle_call                   229969    38021287 us     165.332 us      11386904 us
>   arch_cpu_idle                       229969    37379305 us     162.540 us      2168909 us
>   dwc2_halt_channel                   650811    34208348 us     52.562 us       7792012 us
>   dwc2_release_channel                651065    33042638 us     50.751 us       8112754 us
>   dwc2_hc_nak_intr                    334689    18892951 us     56.449 us       1823592 us
>   dwc2_hc_ack_intr                    315986    18323433 us     57.988 us       151346.6 us
>   dwc2_hcd_select_transactions       1730010    11694955 us     6.760 us        1188442 us
>   dwc2_hcd_queue_transactions         651065    10931854 us     16.790 us       952074.2 us
>   dwc2_assign_and_init_hc             651065    9540885 us      14.654 us       258067.1 us
>   dwc2_queue_transaction              651065    9192559 us      14.119 us       736431.6 us
>   tick_nohz_irq_exit                  325055    9052192 us      27.848 us       916293.3 us
>   __tick_nohz_idle_enter.constpr      324760    8463861 us      26.061 us       2410440 us
>   dwc2_hc_start_transfer              651065    7687376 us      11.807 us       625773.3 us
>   irq_enter                          1082482    6627959 us      6.122 us        11528932 us
>   arm_heavy_mb                      10903913    6603586 us      0.605 us        60562924 us
>   unmask_irq                         1082482    4763699 us      4.400 us        1032638 us
>   tick_irq_enter                      326467    4377088 us      13.407 us       4606805 us
>   dwc2_handle_common_intr            1078945    3799392 us      3.521 us        1131504 us
>   add_interrupt_randomness           1082482    3542044 us      3.272 us        30545511 us
>   get_next_armctrl_hwirq             1443829    3276974 us      2.269 us        2116848 us
>   tick_nohz_restart                    70325    3250462 us      46.220 us       35710.73 us
>   dwc2_hc_cleanup                     651065    3079003 us      4.729 us        81202.50 us
>   dwc2_hcd_qh_deactivate              651065    3039940 us      4.669 us        90928.10 us
>   ledtrig_cpu                         459938    2450327 us      5.327 us        1466705 us
>   arch_cpu_idle_exit                  229969    2443315 us      10.624 us       8606787 us
>   do_page_fault                         5509    2304998 us      418.405 us      17961439 us
>   ktime_get                           795998    2121741 us      2.665 us        166119.5 us
>   get_next_timer_interrupt            324760    1998999 us      6.155 us        32201.63 us
>   handle_mm_fault                       5537    1945500 us      351.363 us      8324917 us
>   dwc2_is_controller_alive           2157890    1473590 us      0.682 us        291399.4 us
>   hrtimer_cancel                       70450    1394330 us      19.791 us       36980.28 us
>   arch_cpu_idle_enter                 229969    1350759 us      5.873 us        10789.94 us
>   hrtimer_try_to_cancel                71029    1266967 us      17.837 us       30260.75 us
>   filemap_map_pages                     1632    1263006 us      773.900 us      207846.4 us
>   tick_program_event                  141799    1260485 us      8.889 us        9343.194 us
>   hrtimer_start_range_ns               97555    1253698 us      12.851 us       43166.94 us
>   irq_to_desc                        1082482    1146955 us      1.059 us        546937.6 us
>   link_path_walk                        3486    1134351 us      325.401 us      484443.0 us
>   __remove_hrtimer                     99205    999724.0 us     10.077 us       201933.9 us
>   clockevents_program_event           141799    985346.0 us     6.948 us        7114.725 us
>   note_interrupt                     1082482    899494.0 us     0.830 us        705141.6 us
>   update_ts_time_stats.constprop      324760    862996.0 us     2.657 us        21813.81 us
>   alloc_set_pte                        20011    805880.0 us     40.271 us       1086748 us
>   SyS_read                              1940    805061.0 us     414.979 us      164198.7 us
>   _local_bh_enable                    326467    777908.0 us     2.382 us        37202.43 us
>   vfs_read                              2080    761487.0 us     366.099 us      22944777 us
>   dwc2_hcd_get_frame_number          1097673    721108.0 us     0.656 us        525348.1 us
>   do_munmap                             1194    717161.0 us     600.637 us      65635.31 us
>   schedule                              3297    700578.0 us     212.489 us      28594.59 us
>   __vfs_read                            2080    694376.0 us     333.834 us      21955675 us
>   walk_component                        9912    683211.0 us     68.927 us       349771.7 us
>   irq_find_mapping                   1082482    673162.0 us     0.621 us        900861.5 us
>   __next_timer_interrupt              324929    667175.0 us     2.053 us        4576.720 us
>   armctrl_translate_shortcut         1079274    662875.0 us     0.614 us        164464.8 us
>   filename_lookup                       1316    646059.0 us     490.926 us      148884.8 us
>   mmput                                  259    644417.0 us     2488.096 us     33155247 us
>   armctrl_unmask_irq                 1082482    639692.0 us     0.590 us        891074.7 us
>   armctrl_mask_irq                   1082482    636505.0 us     0.588 us        888583.1 us
>   exit_mmap                               45    634389.0 us     14097.53 us     24105769 us
>   irq_state_set_masked               1082482    622182.0 us     0.574 us        894212.6 us
>   user_path_at_empty                    1156    620229.0 us     536.530 us      166865.3 us
>   SyS_munmap                             895    600613.0 us     671.075 us      75763.68 us
>   vm_mmap_pgoff                         1437    598955.0 us     416.809 us      55882.41 us
>   read_current_timer                 1082666    598757.0 us     0.553 us        86817.59 us
>   irq_may_run                        1082482    598471.0 us     0.552 us        898308.6 us
>   vm_munmap                              937    595309.0 us     635.335 us      70914.31 us
>   path_lookupat                         1316    592448.0 us     450.188 us      147624.4 us
>   SyS_mmap_pgoff                        1325    588992.0 us     444.522 us      62142.49 us
>   irq_state_clr_masked               1082482    582077.0 us     0.537 us        893572.4 us
>   idle_cpu                           1082482    580428.0 us     0.536 us        905609.5 us
>   do_execve                               28    573864.0 us     20495.14 us     98356111 us
>   do_execveat_common                      28    573576.0 us     20484.85 us     98471441 us
>   dwc2_hc_set_even_odd_frame          651065    570127.0 us     0.875 us        543701.5 us
>   ledtrig_cpu.part.0                  229969    549060.0 us     2.387 us        9142.920 us
>   do_mmap                               1437    548817.0 us     381.918 us      54270.64 us
>   do_work_pending                       2211    525805.0 us     237.813 us      234836.0 us
>   led_trigger_event                   460228    501093.0 us     1.088 us        4759241 us
>   search_binary_handler                   28    501026.0 us     17893.78 us     93445723 us
>   clocksource_mmio_readl_up           807791    500765.0 us     0.619 us        91050.42 us
>   load_elf_binary                         28    500359.0 us     17869.96 us     93309220 us
>   process_one_work                       558    496994.0 us     890.670 us      7196614 us
>   tick_do_update_jiffies64            324826    496168.0 us     1.527 us        3422867 us
>   hrtimer_forward                      72095    490887.0 us     6.808 us        4548.034 us
>   unmap_vmas                            1239    490538.0 us     395.914 us      2723415 us
>   finish_task_switch                    3926    479983.0 us     122.257 us      25336.71 us
>   unmap_single_vma                      3403    474812.0 us     139.527 us      959184.6 us
>   __sync_icache_dcache                 27231    468069.0 us     17.188 us       364740.1 us
>   SyS_execve                              16    462210.0 us     28888.12 us     4450186 us
>   unmap_region                          1194    460297.0 us     385.508 us      44372.58 us
>   unmap_page_range                      3403    455941.0 us     133.982 us      956896.1 us
>   bcm2835_time_interrupt                3208    455584.0 us     142.014 us      56012.72 us
>   mmap_region                           1437    449599.0 us     312.873 us      55175.49 us
>   hrtimer_interrupt                     3208    443908.0 us     138.375 us      55446.73 us
>   dwc2_hcd_qh_add                     632341    436511.0 us     0.690 us        58787.66 us
>   lookup_fast                          11601    436227.0 us     37.602 us       570183.6 us
>   inode_permission                     14699    426528.0 us     29.017 us       404007.8 us
>   SyS_access                             561    412200.0 us     734.759 us      146663.8 us
>   dwc2_hcd_qh_unlink                  632344    410157.0 us     0.648 us        87505.07 us
>   seq_read                               669    408885.0 us     611.188 us      226703.8 us
>   SyS_faccessat                          564    406969.0 us     721.576 us      141955.4 us
>   dwc2_check_qtd_still_ok             651065    392283.0 us     0.602 us        59460.80 us
>   SyS_getdents64                         392    379673.0 us     968.553 us      13678636 us
>   SyS_write                              404    378782.0 us     937.579 us      254399.0 us
>   iterate_dir                            393    367777.0 us     935.819 us      13659529 us
>   vfs_write                              404    358326.0 us     886.945 us      240956.8 us
>   __hrtimer_run_queues.constprop        3227    357720.0 us     110.852 us      47783.12 us
>   task_work_run                         1460    350902.0 us     240.343 us      71379.18 us
>   _cond_resched                       109971    344855.0 us     3.135 us        2245087 us
>   __vfs_write                            404    333260.0 us     824.900 us      230347.3 us
>   ____fput                              1680    326232.0 us     194.185 us      40913.99 us
>   __inode_permission                   14699    324102.0 us     22.049 us       328791.7 us
>   __fput                                1680    316285.0 us     188.264 us      40874.09 us
>   flush_old_exec                          28    313336.0 us     11190.57 us     96843591 us
>   tick_sched_timer                      1964    312793.0 us     159.263 us      2002.533 us
>   do_wp_page                            1758    300693.0 us     171.042 us      17667.03 us
>   _do_fork                                29    281388.0 us     9703.034 us     54207832 us
>   copy_process.part.4                     29    273985.0 us     9447.758 us     53888393 us
>   SyS_clone                               17    267170.0 us     15715.88 us     2444683 us
>   free_pgtables                         1239    265593.0 us     214.360 us      500428.0 us
>   dput                                 15879    265579.0 us     16.725 us       269685.7 us
>   SyS_openat                             205    258191.0 us     1259.468 us     209643.6 us
>   pfn_valid                            18146    257726.0 us     14.202 us       121416.6 us
>   update_wall_time                      5957    250787.0 us     42.099 us       619.230 us
>   mmc_blk_issue_rq                        63    246608.0 us     3914.412 us     8858815 us
>   mmc_blk_issue_rw_rq                     63    241626.0 us     3835.333 us     8884205 us
>   vfs_statx                              441    232198.0 us     526.526 us      150124.0 us
>   SyS_sendmsg                            316    226698.0 us     717.398 us      164208.5 us
>   mmc_start_request                      145    225615.0 us     1555.965 us     5542154 us
>   __sys_sendmsg                          316    223856.0 us     708.405 us      164907.1 us
>   wp_page_copy                          1061    222841.0 us     210.029 us      13010.76 us
>   __generic_file_write_iter              206    219092.0 us     1063.553 us     242948.7 us
>   terminate_walk                        3225    217649.0 us     67.488 us       47336.14 us
>   ext4_file_write_iter                   170    216280.0 us     1272.235 us     127738.3 us
>   __mmc_start_request                    145    214090.0 us     1476.482 us     5509235 us
>   ___sys_sendmsg                         316    212732.0 us     673.202 us      161410.8 us
>   kmem_cache_alloc                     17570    210590.0 us     11.985 us       120917.2 us
>   mmc_start_areq                          63    210528.0 us     3341.714 us     9610382 us
>   sdhci_request                          145    209828.0 us     1447.089 us     5547701 us
>   sock_sendmsg                           353    208523.0 us     590.716 us      147666.9 us
>   __split_vma                            981    206458.0 us     210.456 us      13253.43 us
>   credit_entropy_bits                  16914    205434.0 us     12.145 us       208.830 us
>   hrtimer_get_next_event              324760    199659.0 us     0.614 us        17454.58 us
>   path_put                              6090    196782.0 us     32.312 us       54264.43 us
>   SyS_mprotect                           427    193190.0 us     452.435 us      42168.10 us
>   SyS_stat64                             370    181587.0 us     490.775 us      119707.4 us
>   nr_iowait_cpu                       324760    176767.0 us     0.544 us        17782.33 us
>   __vma_adjust                          1452    174253.0 us     120.008 us      5746.323 us
>   tick_sched_handle                     1964    174004.0 us     88.596 us       419.232 us
>   sdhci_irq                              211    173984.0 us     824.568 us      4212897 us
>   touch_softlockup_watchdog_sche      323207    173767.0 us     0.537 us        108651.0 us
>   kernfs_iop_permission                 5012    173442.0 us     34.605 us       21830.09 us
>   __local_bh_enable                   329502    172898.0 us     0.524 us        9002.660 us
>   proc_single_show                       233    170938.0 us     733.639 us      235941.1 us
>   mprotect_fixup                         455    163622.0 us     359.608 us      43581.82 us
>   SyS_recvmsg                            515    163337.0 us     317.159 us      23749.85 us
>   update_process_times                  1964    161227.0 us     82.091 us       436.558 us
>   usb_control_msg                        131    160306.0 us     1223.709 us     63779.20 us
>   __sys_recvmsg                          515    159685.0 us     310.067 us      23888.88 us
>   check_carrier                           29    159209.0 us     5489.965 us     157176.1 us
>   __dentry_kill                         1306    159044.0 us     121.779 us      13522.69 us
>   smsc95xx_mdio_read                      29    156605.0 us     5400.172 us     147578.2 us
>   __smsc95xx_mdio_read                    29    156128.0 us     5383.724 us     153872.9 us
>   ext4_readdir                            60    152036.0 us     2533.933 us     67111799 us
>   run_ksoftirqd                          844    151846.0 us     179.912 us      74010.42 us
>   usb_start_wait_urb                     131    151181.0 us     1154.053 us     59721.86 us
>   find_vma                              8806    149418.0 us     16.967 us       419885.5 us
>   __alloc_pages_nodemask                4365    143647.0 us     32.908 us       22236.43 us
>   ___sys_recvmsg                         515    143567.0 us     278.770 us      23183.05 us
>   memblock_is_map_memory               18146    142969.0 us     7.878 us        62191.29 us
>   timekeeping_max_deferment           252165    141177.0 us     0.559 us        3782.924 us
>   unlink_anon_vmas                      3403    137080.0 us     40.282 us       17090.37 us
>   __vma_link_rb                         3403    136104.0 us     39.995 us       12357.79 us
>   rcu_process_callbacks                 1050    135701.0 us     129.239 us      66292.86 us
>   SyS_close                             1977    135041.0 us     68.306 us       4406.697 us
>   pick_next_task_fair                   3996    133622.0 us     33.438 us       1391.499 us
>   proc_readfd                             50    133185.0 us     2663.700 us     18788258 us
>   proc_readfd_common                      50    132911.0 us     2658.220 us     18769391 us
>   generic_perform_write                  206    132888.0 us     645.087 us      134152.0 us
>   __mix_pool_bytes                     16914    128343.0 us     7.587 us        303.814 us
>   run_timer_softirq                     1960    127980.0 us     65.295 us       13337.43 us
>   timekeeping_update                    5956    127128.0 us     21.344 us       147.035 us
>   try_to_wake_up                        2348    125639.0 us     53.508 us       5318.431 us
>   ktime_add_safe                      241886    125624.0 us     0.519 us        6352.310 us
>   __rcu_process_callbacks               2100    125522.0 us     59.772 us       1422883 us
>   __wake_up                            19318    121496.0 us     6.289 us        54037.07 us
>   tick_nohz_idle_enter                   394    119544.0 us     303.411 us      24204.73 us
>   proc_fill_cache                        402    119333.0 us     296.848 us      32614.19 us
>   unix_dgram_sendmsg                     144    119088.0 us     827.000 us      197140.4 us
>   enqueue_hrtimer                      99205    118756.0 us     1.197 us        1634.105 us
>   call_usermodehelper_exec_async          12    117852.0 us     9821.000 us     319067.4 us
>   mutex_lock                           14089    117414.0 us     8.333 us        245056.5 us
>   sock_recvmsg                           548    116787.0 us     213.114 us      20249.99 us
>   __smsc95xx_read_reg                     87    116418.0 us     1338.137 us     39000.81 us
>   usbnet_read_cmd                         87    115523.0 us     1327.850 us     37075.87 us
>   SyS_epoll_wait                         336    114271.0 us     340.092 us      47629.77 us
>   __close_fd                            1977    113569.0 us     57.445 us       3969.846 us
>   split_vma                              501    112236.0 us     224.023 us      20703.73 us
>   vma_link                              1248    111097.0 us     89.020 us       5058.706 us
>   __usbnet_read_cmd                       87    111024.0 us     1276.137 us     34288.14 us
>   schedule_hrtimeout_range               390    107480.0 us     275.589 us      10798.57 us
>   dentry_unlink_inode                    733    107440.0 us     146.575 us      8915.389 us
>   unix_find_other                        172    106799.0 us     620.924 us      79499.41 us
>   iput                                   757    106445.0 us     140.614 us      8828.710 us
>   ext4_htree_fill_tree                    36    106438.0 us     2956.611 us     10423331 us
>   kernfs_fop_read                        198    105215.0 us     531.388 us      48887.87 us
>   htree_dirblock_to_tree                  36    104352.0 us     2898.666 us     10008360 us
>   schedule_hrtimeout_range_clock         390    103931.0 us     266.489 us      10595.17 us
>   SyS_connect                             99    102476.0 us     1035.111 us     160650.8 us
>   __mark_inode_dirty                     641    102069.0 us     159.234 us      35047.69 us
>   expire_timers                          700    100145.0 us     143.064 us      8519.548 us
>   do_wait                                 42    99498.00 us     2369.000 us     11701021 us
>   lookup_slow                            413    99137.00 us     240.041 us      40034.02 us
>   release_task                            29    98995.00 us     3413.620 us     12343735 us
>   SyS_fstat64                           1168    98993.00 us     84.754 us       5326.254 us
>   wait_consider_task                     444    98839.00 us     222.610 us      1532328 us
>   kmem_cache_free                      17783    98715.00 us     5.551 us        72804.30 us
>
> ...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-19 21:47             ` Doug Anderson
@ 2017-04-20  7:46               ` Stefan Wahren
  2017-04-20 16:19                 ` Doug Anderson
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Wahren @ 2017-04-20  7:46 UTC (permalink / raw)
  To: linux-arm-kernel

Am 19.04.2017 um 23:47 schrieb Doug Anderson:
> Hi,
>
> On Wed, Apr 19, 2017 at 1:25 PM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
>> Hi,
>>
>>> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
>>>
>>>
>>> It's hard to know for sure that all of this time is really in
>>> urb_enqueue().  Possible we could have task switched out and been
>>> blocked elsewhere.  Using ftrace to get more fine-grained timings
>>> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
>>> friends here if you want to use trace_printk.
>> i'm a newbie to ftrace, so i hope this would be helpful.
>>
>> # connect PL2303 to the onboard hub
>> # echo 0 > options/sleep-time
>> # echo 0 > function_profile_enabled
>> # echo 1 > function_profile_enabled
>> # ./usb_test
>> # Waiting for at least 20 seconds and then disconnect PL2303
>> # echo 0 > function_profile_enabled
>> # cat trace_stat/function0
>>
>>   Function                               Hit    Time            Avg             s^2
>>   --------                               ---    ----            ---             ---
>>   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us
>>   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us
>>   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us
>>   irq_exit                           1082482    98197771 us     90.715 us       29649040 us
>>   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us
>>   do_sys_open                           1806    87612983 us     48512.17 us     2198507 us
>>   SyS_open                              1601    87372331 us     54573.59 us     1898996 us
>>   do_filp_open                          1862    87368058 us     46921.62 us     1634982 us
>>   path_openat                           1862    87314553 us     46892.88 us     3357817 us
>>   __do_softirq                          3035    86266050 us     28423.73 us     6449768 us
>>   vfs_open                              1515    85877012 us     56684.49 us     101673.5 us
>>   do_dentry_open                        1515    85861429 us     56674.21 us     812420.7 us
>>   usb_submit_urb                         136    85760172 us     630589.5 us     59532024 us
>>   usb_hcd_submit_urb                     134    85756518 us     639974.0 us     726298102 us
>>   _dwc2_hcd_urb_enqueue                  134    85738333 us     639838.3 us     874104371 us
> The 134 calls to this are taking a ton of time.  It would be nice to
> know where all the time actually was in here.  Are you on a
> single-core device, or multi-core?

Raspberry Pi B is a single core device which runs with 700 MHz cpu freq.

> Oh, probably this so slow because we're just getting interrupted
> constantly.  You can see that during your trace 80686112 us was in
> handle_irq_event().  Presumably all of that time could have been
> counted towards whatever code you were in when the interrupt went off.
> Of that 80 seconds, 60 seconds or so was usb_hcd_irq, and of that 57
> seconds was in _dwc2_hcd_irq().

Since the FTDI (full speed device) and the PL2303 (also full speed
device) behaves differently, i will try to compare those ftraces.

>
> Presumably the big part of your problem is that dwc2_handle_hcd_intr()
> is just too slow.  From glancing at the numbers below, it seems likely
> that nothing in particular is slow, it's just running a lot of code
> and (in total) all of that is slow.  ...but someone with more time on
> their hands would need to really debug.
>
> One thing that would be interesting would be to see if you can
> increase the bug clock for talking to the dwc2 controller.  

Unfortunately the bcm2835 datasheet [1] doesn't provide many information
about the USB IP core and the Synopsys documents aren't public
available. The Raspberry Pi Foundation decided to use a different driver
which uses FIQ in the downstream tree [2], but this won't be an option here.

Maybe i could start with comparing the register settings between dwc2
and dwc_otg.

[1] -
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf
[2] -
https://github.com/raspberrypi/linux/commit/65165df02f2ba9feccebf45d8c732dffb34b6109

> Possibly
> some of those memory mapped IOs talking to dwc2 could be making
> everything slow?
>
> Other than that, maybe you can find some way to optimize the code in
> dwc2 so it runs faster, at least in the cases you care about...
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-20  7:46               ` Stefan Wahren
@ 2017-04-20 16:19                 ` Doug Anderson
  0 siblings, 0 replies; 30+ messages in thread
From: Doug Anderson @ 2017-04-20 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, Apr 20, 2017 at 12:46 AM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
> Am 19.04.2017 um 23:47 schrieb Doug Anderson:
>> Hi,
>>
>> On Wed, Apr 19, 2017 at 1:25 PM, Stefan Wahren <stefan.wahren@i2se.com> wrote:
>>> Hi,
>>>
>>>> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
>>>>
>>>>
>>>> It's hard to know for sure that all of this time is really in
>>>> urb_enqueue().  Possible we could have task switched out and been
>>>> blocked elsewhere.  Using ftrace to get more fine-grained timings
>>>> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
>>>> friends here if you want to use trace_printk.
>>> i'm a newbie to ftrace, so i hope this would be helpful.
>>>
>>> # connect PL2303 to the onboard hub
>>> # echo 0 > options/sleep-time
>>> # echo 0 > function_profile_enabled
>>> # echo 1 > function_profile_enabled
>>> # ./usb_test
>>> # Waiting for at least 20 seconds and then disconnect PL2303
>>> # echo 0 > function_profile_enabled
>>> # cat trace_stat/function0
>>>
>>>   Function                               Hit    Time            Avg             s^2
>>>   --------                               ---    ----            ---             ---
>>>   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us
>>>   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us
>>>   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us
>>>   irq_exit                           1082482    98197771 us     90.715 us       29649040 us
>>>   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us
>>>   do_sys_open                           1806    87612983 us     48512.17 us     2198507 us
>>>   SyS_open                              1601    87372331 us     54573.59 us     1898996 us
>>>   do_filp_open                          1862    87368058 us     46921.62 us     1634982 us
>>>   path_openat                           1862    87314553 us     46892.88 us     3357817 us
>>>   __do_softirq                          3035    86266050 us     28423.73 us     6449768 us
>>>   vfs_open                              1515    85877012 us     56684.49 us     101673.5 us
>>>   do_dentry_open                        1515    85861429 us     56674.21 us     812420.7 us
>>>   usb_submit_urb                         136    85760172 us     630589.5 us     59532024 us
>>>   usb_hcd_submit_urb                     134    85756518 us     639974.0 us     726298102 us
>>>   _dwc2_hcd_urb_enqueue                  134    85738333 us     639838.3 us     874104371 us
>> The 134 calls to this are taking a ton of time.  It would be nice to
>> know where all the time actually was in here.  Are you on a
>> single-core device, or multi-core?
>
> Raspberry Pi B is a single core device which runs with 700 MHz cpu freq.
>
>> Oh, probably this so slow because we're just getting interrupted
>> constantly.  You can see that during your trace 80686112 us was in
>> handle_irq_event().  Presumably all of that time could have been
>> counted towards whatever code you were in when the interrupt went off.
>> Of that 80 seconds, 60 seconds or so was usb_hcd_irq, and of that 57
>> seconds was in _dwc2_hcd_irq().
>
> Since the FTDI (full speed device) and the PL2303 (also full speed
> device) behaves differently, i will try to compare those ftraces.

The most interesting to know is what the endpoints looked like for
these two devices.  Since they are full speed devices, I'd expect that
any interrupt end points should fire once per "ms" at most.


>> Presumably the big part of your problem is that dwc2_handle_hcd_intr()
>> is just too slow.  From glancing at the numbers below, it seems likely
>> that nothing in particular is slow, it's just running a lot of code
>> and (in total) all of that is slow.  ...but someone with more time on
>> their hands would need to really debug.
>>
>> One thing that would be interesting would be to see if you can
>> increase the bug clock for talking to the dwc2 controller.
>
> Unfortunately the bcm2835 datasheet [1] doesn't provide many information
> about the USB IP core and the Synopsys documents aren't public
> available. The Raspberry Pi Foundation decided to use a different driver
> which uses FIQ in the downstream tree [2], but this won't be an option here.
>
> Maybe i could start with comparing the register settings between dwc2
> and dwc_otg.

I was suggesting looking at clock speeds related to usb that might be
in /sys/kernel/debug/clk/clk_summary

...but at 700 MHz single core (and I think it's also an older ARM core
so 700 MHz there might be a lot slower than 700 MHz on newer cores),
the theory that you're just running too many instructions is a sane
one.

Probably someone just needs to see if there's any way to fast path
some of the common cases in the IRQ handler.

-Doug

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-19 20:25           ` Stefan Wahren
  2017-04-19 21:47             ` Doug Anderson
@ 2017-04-20 18:54             ` Eric Anholt
  2017-04-20 19:45               ` Doug Anderson
  2017-04-22 20:50               ` Stefan Wahren
  1 sibling, 2 replies; 30+ messages in thread
From: Eric Anholt @ 2017-04-20 18:54 UTC (permalink / raw)
  To: linux-arm-kernel

Stefan Wahren <stefan.wahren@i2se.com> writes:

> Hi,
>
>> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
>> 
>> 
>> It's hard to know for sure that all of this time is really in
>> urb_enqueue().  Possible we could have task switched out and been
>> blocked elsewhere.  Using ftrace to get more fine-grained timings
>> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
>> friends here if you want to use trace_printk.
>
> i'm a newbie to ftrace, so i hope this would be helpful.
>
> # connect PL2303 to the onboard hub
> # echo 0 > options/sleep-time
> # echo 0 > function_profile_enabled
> # echo 1 > function_profile_enabled
> # ./usb_test
> # Waiting for at least 20 seconds and then disconnect PL2303
> # echo 0 > function_profile_enabled
> # cat trace_stat/function0
>
>   Function                               Hit    Time            Avg             s^2
>   --------                               ---    ----            ---             ---
>   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us  
>   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us  
>   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us 
>   irq_exit                           1082482    98197771 us     90.715 us       29649040 us 
>   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us 

If I'm reading this output right, we're spending half of our interrupt
processing time in irq_exit(), so even if dwc2's interrupt was free (the
generic_handle_irq() chain), we'd be eating about half the CPU getting
back out of the interrupt handler, right?

I don't really know anything about DWC2 or USB, but is there any way we
could mitigate the interrupt frequency with this hardware?  If nothing
else, could we loop reading gintsts until it reads back 0?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20170420/bbabadb5/attachment.sig>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-20 18:54             ` Eric Anholt
@ 2017-04-20 19:45               ` Doug Anderson
  2017-04-20 19:57                 ` Eric Anholt
  2017-04-22 20:50               ` Stefan Wahren
  1 sibling, 1 reply; 30+ messages in thread
From: Doug Anderson @ 2017-04-20 19:45 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, Apr 20, 2017 at 11:54 AM, Eric Anholt <eric@anholt.net> wrote:
> Stefan Wahren <stefan.wahren@i2se.com> writes:
>
>> Hi,
>>
>>> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
>>>
>>>
>>> It's hard to know for sure that all of this time is really in
>>> urb_enqueue().  Possible we could have task switched out and been
>>> blocked elsewhere.  Using ftrace to get more fine-grained timings
>>> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
>>> friends here if you want to use trace_printk.
>>
>> i'm a newbie to ftrace, so i hope this would be helpful.
>>
>> # connect PL2303 to the onboard hub
>> # echo 0 > options/sleep-time
>> # echo 0 > function_profile_enabled
>> # echo 1 > function_profile_enabled
>> # ./usb_test
>> # Waiting for at least 20 seconds and then disconnect PL2303
>> # echo 0 > function_profile_enabled
>> # cat trace_stat/function0
>>
>>   Function                               Hit    Time            Avg             s^2
>>   --------                               ---    ----            ---             ---
>>   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us
>>   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us
>>   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us
>>   irq_exit                           1082482    98197771 us     90.715 us       29649040 us
>>   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us
>
> If I'm reading this output right, we're spending half of our interrupt
> processing time in irq_exit(), so even if dwc2's interrupt was free (the
> generic_handle_irq() chain), we'd be eating about half the CPU getting
> back out of the interrupt handler, right?
>
> I don't really know anything about DWC2 or USB, but is there any way we
> could mitigate the interrupt frequency with this hardware?  If nothing
> else, could we loop reading gintsts until it reads back 0?

Take ftrace with a little bit of a grain of salt, especially on older
/ slower ARMs (without the arch timer).  Whenever ftrace takes a log
it grabs a timestamp.  This can be an expensive (ish) operation.  Even
on newer CPUs it's still not free if you call it as much as ftrace,
but on older CPUs it's extra expensive.

I spent a chunk of time working on optimizations for that on exynos
since it showed up in profiles as an expensive operation (Chrome asks
for the time a lot during its internal profiling).  Some of that type
of data is in commit 3252a646aa2c ("clocksource: exynos_mct: Only use
32-bits where possible").

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-20 19:45               ` Doug Anderson
@ 2017-04-20 19:57                 ` Eric Anholt
  2017-04-20 20:37                   ` Doug Anderson
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Anholt @ 2017-04-20 19:57 UTC (permalink / raw)
  To: linux-arm-kernel

Doug Anderson <dianders@chromium.org> writes:

> Hi,
>
> On Thu, Apr 20, 2017 at 11:54 AM, Eric Anholt <eric@anholt.net> wrote:
>> Stefan Wahren <stefan.wahren@i2se.com> writes:
>>
>>> Hi,
>>>
>>>> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
>>>>
>>>>
>>>> It's hard to know for sure that all of this time is really in
>>>> urb_enqueue().  Possible we could have task switched out and been
>>>> blocked elsewhere.  Using ftrace to get more fine-grained timings
>>>> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
>>>> friends here if you want to use trace_printk.
>>>
>>> i'm a newbie to ftrace, so i hope this would be helpful.
>>>
>>> # connect PL2303 to the onboard hub
>>> # echo 0 > options/sleep-time
>>> # echo 0 > function_profile_enabled
>>> # echo 1 > function_profile_enabled
>>> # ./usb_test
>>> # Waiting for at least 20 seconds and then disconnect PL2303
>>> # echo 0 > function_profile_enabled
>>> # cat trace_stat/function0
>>>
>>>   Function                               Hit    Time            Avg             s^2
>>>   --------                               ---    ----            ---             ---
>>>   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us
>>>   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us
>>>   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us
>>>   irq_exit                           1082482    98197771 us     90.715 us       29649040 us
>>>   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us
>>
>> If I'm reading this output right, we're spending half of our interrupt
>> processing time in irq_exit(), so even if dwc2's interrupt was free (the
>> generic_handle_irq() chain), we'd be eating about half the CPU getting
>> back out of the interrupt handler, right?
>>
>> I don't really know anything about DWC2 or USB, but is there any way we
>> could mitigate the interrupt frequency with this hardware?  If nothing
>> else, could we loop reading gintsts until it reads back 0?
>
> Take ftrace with a little bit of a grain of salt, especially on older
> / slower ARMs (without the arch timer).  Whenever ftrace takes a log
> it grabs a timestamp.  This can be an expensive (ish) operation.  Even
> on newer CPUs it's still not free if you call it as much as ftrace,
> but on older CPUs it's extra expensive.

If per-function timestamp cost was the problem, shouldn't I expect to
see a bunch of irq_exit()'s children each taking a bit of time?  We have
a long callchain with the functions each taking a bit of time in the
dwc2 interrupt handler, but irq_exit() seems to be a monolithic cost.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20170420/7d2ce091/attachment.sig>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-20 19:57                 ` Eric Anholt
@ 2017-04-20 20:37                   ` Doug Anderson
  0 siblings, 0 replies; 30+ messages in thread
From: Doug Anderson @ 2017-04-20 20:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, Apr 20, 2017 at 12:57 PM, Eric Anholt <eric@anholt.net> wrote:
> Doug Anderson <dianders@chromium.org> writes:
>
>> Hi,
>>
>> On Thu, Apr 20, 2017 at 11:54 AM, Eric Anholt <eric@anholt.net> wrote:
>>> Stefan Wahren <stefan.wahren@i2se.com> writes:
>>>
>>>> Hi,
>>>>
>>>>> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
>>>>>
>>>>>
>>>>> It's hard to know for sure that all of this time is really in
>>>>> urb_enqueue().  Possible we could have task switched out and been
>>>>> blocked elsewhere.  Using ftrace to get more fine-grained timings
>>>>> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
>>>>> friends here if you want to use trace_printk.
>>>>
>>>> i'm a newbie to ftrace, so i hope this would be helpful.
>>>>
>>>> # connect PL2303 to the onboard hub
>>>> # echo 0 > options/sleep-time
>>>> # echo 0 > function_profile_enabled
>>>> # echo 1 > function_profile_enabled
>>>> # ./usb_test
>>>> # Waiting for at least 20 seconds and then disconnect PL2303
>>>> # echo 0 > function_profile_enabled
>>>> # cat trace_stat/function0
>>>>
>>>>   Function                               Hit    Time            Avg             s^2
>>>>   --------                               ---    ----            ---             ---
>>>>   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us
>>>>   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us
>>>>   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us
>>>>   irq_exit                           1082482    98197771 us     90.715 us       29649040 us
>>>>   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us
>>>
>>> If I'm reading this output right, we're spending half of our interrupt
>>> processing time in irq_exit(), so even if dwc2's interrupt was free (the
>>> generic_handle_irq() chain), we'd be eating about half the CPU getting
>>> back out of the interrupt handler, right?
>>>
>>> I don't really know anything about DWC2 or USB, but is there any way we
>>> could mitigate the interrupt frequency with this hardware?  If nothing
>>> else, could we loop reading gintsts until it reads back 0?
>>
>> Take ftrace with a little bit of a grain of salt, especially on older
>> / slower ARMs (without the arch timer).  Whenever ftrace takes a log
>> it grabs a timestamp.  This can be an expensive (ish) operation.  Even
>> on newer CPUs it's still not free if you call it as much as ftrace,
>> but on older CPUs it's extra expensive.
>
> If per-function timestamp cost was the problem, shouldn't I expect to
> see a bunch of irq_exit()'s children each taking a bit of time?  We have
> a long callchain with the functions each taking a bit of time in the
> dwc2 interrupt handler, but irq_exit() seems to be a monolithic cost.

Maybe.  I remember some of the timestamp code being a might bit odd.
Specifically there might have been cases where the timestamp code
predictable slower in some cases, but that slowness would be blamed on
the wrong function.  I think this might have to do with some of the
the fact that a memory mapped read could block until other outstanding
memory mapped operations finished.

-Doug

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-20 18:54             ` Eric Anholt
  2017-04-20 19:45               ` Doug Anderson
@ 2017-04-22 20:50               ` Stefan Wahren
  2017-04-25 18:11                 ` Stefan Wahren
  1 sibling, 1 reply; 30+ messages in thread
From: Stefan Wahren @ 2017-04-22 20:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

> Eric Anholt <eric@anholt.net> hat am 20. April 2017 um 20:54 geschrieben:
> 
> 
> Stefan Wahren <stefan.wahren@i2se.com> writes:
> 
> > Hi,
> >
> >> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
> >> 
> >> 
> >> It's hard to know for sure that all of this time is really in
> >> urb_enqueue().  Possible we could have task switched out and been
> >> blocked elsewhere.  Using ftrace to get more fine-grained timings
> >> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
> >> friends here if you want to use trace_printk.
> >
> > i'm a newbie to ftrace, so i hope this would be helpful.
> >
> > # connect PL2303 to the onboard hub
> > # echo 0 > options/sleep-time
> > # echo 0 > function_profile_enabled
> > # echo 1 > function_profile_enabled
> > # ./usb_test
> > # Waiting for at least 20 seconds and then disconnect PL2303
> > # echo 0 > function_profile_enabled
> > # cat trace_stat/function0
> >
> >   Function                               Hit    Time            Avg             s^2
> >   --------                               ---    ----            ---             ---
> >   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us  
> >   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us  
> >   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us 
> >   irq_exit                           1082482    98197771 us     90.715 us       29649040 us 
> >   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us 
> 
> If I'm reading this output right, we're spending half of our interrupt
> processing time in irq_exit(), so even if dwc2's interrupt was free (the
> generic_handle_irq() chain), we'd be eating about half the CPU getting
> back out of the interrupt handler, right?
> 
> I don't really know anything about DWC2 or USB, but is there any way we
> could mitigate the interrupt frequency with this hardware?  If nothing
> else, could we loop reading gintsts until it reads back 0?

first of all i updated my kernel to 4.11rc7 and the issue still occures. Today i had some time to investigate this issue further and i made some interesting observations:

1. The lockup doesn't occure always after starting usb_test. In rare cases i was able to run the program without lockup.
2. In case the lockup occured we are always able to "rescue" the BCM2835 from this state by sending some serial data to the PL2303.
3. I looked again at the logic analyzer traces of the PL2303 and FTDI. After usb_test has been started the time between 2 _dwc2_hcd_irq calls is mostly smaller than 6 us (this can't be correct for a single device doesn't send any data). With other words it seems to me one or more interrupt(s) are never acked.
4. I placed a trace_printk into dwc2_handle_hcd_intr() in order to dump GINTSTS. In normal state we usually reach the value 0x04000029, but not in lockup state.

Regards
Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-22 20:50               ` Stefan Wahren
@ 2017-04-25 18:11                 ` Stefan Wahren
  2017-05-08 20:22                   ` Stefan Wahren
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Wahren @ 2017-04-25 18:11 UTC (permalink / raw)
  To: linux-arm-kernel

> Stefan Wahren <stefan.wahren@i2se.com> hat am 22. April 2017 um 22:50 geschrieben:
> 
> 
> Hi,
> 
> > Eric Anholt <eric@anholt.net> hat am 20. April 2017 um 20:54 geschrieben:
> > 
> > 
> > Stefan Wahren <stefan.wahren@i2se.com> writes:
> > 
> > > Hi,
> > >
> > >> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
> > >> 
> > >> 
> > >> It's hard to know for sure that all of this time is really in
> > >> urb_enqueue().  Possible we could have task switched out and been
> > >> blocked elsewhere.  Using ftrace to get more fine-grained timings
> > >> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
> > >> friends here if you want to use trace_printk.
> > >
> > > i'm a newbie to ftrace, so i hope this would be helpful.
> > >
> > > # connect PL2303 to the onboard hub
> > > # echo 0 > options/sleep-time
> > > # echo 0 > function_profile_enabled
> > > # echo 1 > function_profile_enabled
> > > # ./usb_test
> > > # Waiting for at least 20 seconds and then disconnect PL2303
> > > # echo 0 > function_profile_enabled
> > > # cat trace_stat/function0
> > >
> > >   Function                               Hit    Time            Avg             s^2
> > >   --------                               ---    ----            ---             ---
> > >   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us  
> > >   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us  
> > >   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us 
> > >   irq_exit                           1082482    98197771 us     90.715 us       29649040 us 
> > >   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us 
> > 
> > If I'm reading this output right, we're spending half of our interrupt
> > processing time in irq_exit(), so even if dwc2's interrupt was free (the
> > generic_handle_irq() chain), we'd be eating about half the CPU getting
> > back out of the interrupt handler, right?
> > 
> > I don't really know anything about DWC2 or USB, but is there any way we
> > could mitigate the interrupt frequency with this hardware?  If nothing
> > else, could we loop reading gintsts until it reads back 0?
> 
> first of all i updated my kernel to 4.11rc7 and the issue still occures. Today i had some time to investigate this issue further and i made some interesting observations:
> 
> 1. The lockup doesn't occure always after starting usb_test. In rare cases i was able to run the program without lockup.
> 2. In case the lockup occured we are always able to "rescue" the BCM2835 from this state by sending some serial data to the PL2303.

Based on this scenario i patched the interrupt routine to detect the interrupt storm and normal condition. So i can dump the global and host registers in both situations (bad and goodcase).

Here is the diff between both register dumps, maybe someone see something interesting:

--- badcase.txt	2017-04-25 18:02:59.000000000 +0000
+++ goodcase.txt	2017-04-25 18:02:59.000000000 +0000
@@ -4,12 +4,12 @@
  dwc2_dump_global_registers: GAHBCFG	 @0xCC850008 : 0x00000031
  dwc2_dump_global_registers: GUSBCFG	 @0xCC85000C : 0x20001700
  dwc2_dump_global_registers: GRSTCTL	 @0xCC850010 : 0x80000000
- dwc2_dump_global_registers: GINTSTS	 @0xCC850014 : 0x06000021
+ dwc2_dump_global_registers: GINTSTS	 @0xCC850014 : 0x06000029
  dwc2_dump_global_registers: GINTMSK	 @0xCC850018 : 0xF300080E
- dwc2_dump_global_registers: GRXSTSR	 @0xCC85001C : 0x00070044
+ dwc2_dump_global_registers: GRXSTSR	 @0xCC85001C : 0x000F0001
  dwc2_dump_global_registers: GRXFSIZ	 @0xCC850024 : 0x00000306
  dwc2_dump_global_registers: GNPTXFSIZ	 @0xCC850028 : 0x01000306
- dwc2_dump_global_registers: GNPTXSTS	 @0xCC85002C : 0x1D080100
+ dwc2_dump_global_registers: GNPTXSTS	 @0xCC85002C : 0x18080100
  dwc2_dump_global_registers: GI2CCTL	 @0xCC850030 : 0x00000000
  dwc2_dump_global_registers: GPVNDCTL	 @0xCC850034 : 0x00000000
  dwc2_dump_global_registers: GGPIO	 @0xCC850038 : 0x00000000
@@ -27,64 +27,64 @@
  dwc2_dump_host_registers: Host Global Registers
  dwc2_dump_host_registers: HCFG	 @0xCC850400 : 0x00000000
  dwc2_dump_host_registers: HFIR	 @0xCC850404 : 0x00001D4B
- dwc2_dump_host_registers: HFNUM	 @0xCC850408 : 0x0BF03CA1
- dwc2_dump_host_registers: HPTXSTS	 @0xCC850410 : 0x29080200
- dwc2_dump_host_registers: HAINT	 @0xCC850414 : 0x00000002
- dwc2_dump_host_registers: HAINTMSK	 @0xCC850418 : 0x00000002
+ dwc2_dump_host_registers: HFNUM	 @0xCC850408 : 0x029F3421
+ dwc2_dump_host_registers: HPTXSTS	 @0xCC850410 : 0x25080200
+ dwc2_dump_host_registers: HAINT	 @0xCC850414 : 0x00000001
+ dwc2_dump_host_registers: HAINTMSK	 @0xCC850418 : 0x00000001
  dwc2_dump_host_registers: HPRT0	 @0xCC850440 : 0x00001005
  dwc2_dump_host_registers: Host Channel 0 Specific Registers
- dwc2_dump_host_registers: HCCHAR	 @0xCC850500 : 0x01189840
- dwc2_dump_host_registers: HCSPLT	 @0xCC850504 : 0x8001C102
- dwc2_dump_host_registers: HCINT	 @0xCC850508 : 0x00000000
- dwc2_dump_host_registers: HCINTMSK	 @0xCC85050C : 0x00000000
- dwc2_dump_host_registers: HCTSIZ	 @0xCC850510 : 0x00080040
- dwc2_dump_host_registers: HCDMA	 @0xCC850514 : 0x4A07D300
+ dwc2_dump_host_registers: HCCHAR	 @0xCC850500 : 0x013C880A
+ dwc2_dump_host_registers: HCSPLT	 @0xCC850504 : 0x8000C102
+ dwc2_dump_host_registers: HCINT	 @0xCC850508 : 0x00000022
+ dwc2_dump_host_registers: HCINTMSK	 @0xCC85050C : 0x00000006
+ dwc2_dump_host_registers: HCTSIZ	 @0xCC850510 : 0x4008000A
+ dwc2_dump_host_registers: HCDMA	 @0xCC850514 : 0x4A093840
  dwc2_dump_host_registers: Host Channel 1 Specific Registers
  dwc2_dump_host_registers: HCCHAR	 @0xCC850520 : 0x01189840
  dwc2_dump_host_registers: HCSPLT	 @0xCC850524 : 0x8000C102
- dwc2_dump_host_registers: HCINT	 @0xCC850528 : 0x00000022
- dwc2_dump_host_registers: HCINTMSK	 @0xCC85052C : 0x00000006
+ dwc2_dump_host_registers: HCINT	 @0xCC850528 : 0x00000000
+ dwc2_dump_host_registers: HCINTMSK	 @0xCC85052C : 0x00000000
  dwc2_dump_host_registers: HCTSIZ	 @0xCC850530 : 0x00080040
- dwc2_dump_host_registers: HCDMA	 @0xCC850534 : 0x4A07D300
+ dwc2_dump_host_registers: HCDMA	 @0xCC850534 : 0x4A07D500
  dwc2_dump_host_registers: Host Channel 2 Specific Registers
  dwc2_dump_host_registers: HCCHAR	 @0xCC850540 : 0x01189840
  dwc2_dump_host_registers: HCSPLT	 @0xCC850544 : 0x8001C102
  dwc2_dump_host_registers: HCINT	 @0xCC850548 : 0x00000000
  dwc2_dump_host_registers: HCINTMSK	 @0xCC85054C : 0x00000000
  dwc2_dump_host_registers: HCTSIZ	 @0xCC850550 : 0x00080040
- dwc2_dump_host_registers: HCDMA	 @0xCC850554 : 0x4A07D300
+ dwc2_dump_host_registers: HCDMA	 @0xCC850554 : 0x4A07D500
  dwc2_dump_host_registers: Host Channel 3 Specific Registers
- dwc2_dump_host_registers: HCCHAR	 @0xCC850560 : 0x01189840
- dwc2_dump_host_registers: HCSPLT	 @0xCC850564 : 0x8001C102
+ dwc2_dump_host_registers: HCCHAR	 @0xCC850560 : 0x00D08040
+ dwc2_dump_host_registers: HCSPLT	 @0xCC850564 : 0x00000000
  dwc2_dump_host_registers: HCINT	 @0xCC850568 : 0x00000000
  dwc2_dump_host_registers: HCINTMSK	 @0xCC85056C : 0x00000000
- dwc2_dump_host_registers: HCTSIZ	 @0xCC850570 : 0x00080040
- dwc2_dump_host_registers: HCDMA	 @0xCC850574 : 0x4A07D300
+ dwc2_dump_host_registers: HCTSIZ	 @0xCC850570 : 0x8000003C
+ dwc2_dump_host_registers: HCDMA	 @0xCC850574 : 0x4A3B7544
  dwc2_dump_host_registers: Host Channel 4 Specific Registers
- dwc2_dump_host_registers: HCCHAR	 @0xCC850580 : 0x01189840
- dwc2_dump_host_registers: HCSPLT	 @0xCC850584 : 0x8001C102
+ dwc2_dump_host_registers: HCCHAR	 @0xCC850580 : 0x00D00040
+ dwc2_dump_host_registers: HCSPLT	 @0xCC850584 : 0x00000000
  dwc2_dump_host_registers: HCINT	 @0xCC850588 : 0x00000000
  dwc2_dump_host_registers: HCINTMSK	 @0xCC85058C : 0x00000000
- dwc2_dump_host_registers: HCTSIZ	 @0xCC850590 : 0x00080040
- dwc2_dump_host_registers: HCDMA	 @0xCC850594 : 0x4A07D300
+ dwc2_dump_host_registers: HCTSIZ	 @0xCC850590 : 0x20000008
+ dwc2_dump_host_registers: HCDMA	 @0xCC850594 : 0x4A3B74C8
  dwc2_dump_host_registers: Host Channel 5 Specific Registers
  dwc2_dump_host_registers: HCCHAR	 @0xCC8505A0 : 0x01189840
  dwc2_dump_host_registers: HCSPLT	 @0xCC8505A4 : 0x8000C102
  dwc2_dump_host_registers: HCINT	 @0xCC8505A8 : 0x00000000
  dwc2_dump_host_registers: HCINTMSK	 @0xCC8505AC : 0x00000000
  dwc2_dump_host_registers: HCTSIZ	 @0xCC8505B0 : 0x00080040
- dwc2_dump_host_registers: HCDMA	 @0xCC8505B4 : 0x4A07D300
+ dwc2_dump_host_registers: HCDMA	 @0xCC8505B4 : 0x4A07D500
  dwc2_dump_host_registers: Host Channel 6 Specific Registers
- dwc2_dump_host_registers: HCCHAR	 @0xCC8505C0 : 0x01189840
- dwc2_dump_host_registers: HCSPLT	 @0xCC8505C4 : 0x8001C102
+ dwc2_dump_host_registers: HCCHAR	 @0xCC8505C0 : 0x00D00040
+ dwc2_dump_host_registers: HCSPLT	 @0xCC8505C4 : 0x00000000
  dwc2_dump_host_registers: HCINT	 @0xCC8505C8 : 0x00000000
  dwc2_dump_host_registers: HCINTMSK	 @0xCC8505CC : 0x00000000
- dwc2_dump_host_registers: HCTSIZ	 @0xCC8505D0 : 0x00080040
- dwc2_dump_host_registers: HCDMA	 @0xCC8505D4 : 0x4A07D300
+ dwc2_dump_host_registers: HCTSIZ	 @0xCC8505D0 : 0x00000000
+ dwc2_dump_host_registers: HCDMA	 @0xCC8505D4 : 0x4B080000
  dwc2_dump_host_registers: Host Channel 7 Specific Registers
- dwc2_dump_host_registers: HCCHAR	 @0xCC8505E0 : 0x01189840
+ dwc2_dump_host_registers: HCCHAR	 @0xCC8505E0 : 0x013C880A
  dwc2_dump_host_registers: HCSPLT	 @0xCC8505E4 : 0x8001C102
  dwc2_dump_host_registers: HCINT	 @0xCC8505E8 : 0x00000000
  dwc2_dump_host_registers: HCINTMSK	 @0xCC8505EC : 0x00000000
- dwc2_dump_host_registers: HCTSIZ	 @0xCC8505F0 : 0x00080040
- dwc2_dump_host_registers: HCDMA	 @0xCC8505F4 : 0x4A07D300
+ dwc2_dump_host_registers: HCTSIZ	 @0xCC8505F0 : 0x4008000A
+ dwc2_dump_host_registers: HCDMA	 @0xCC8505F4 : 0x4A093840

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-04-25 18:11                 ` Stefan Wahren
@ 2017-05-08 20:22                   ` Stefan Wahren
  2017-05-10 16:31                     ` Johan Hovold
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Wahren @ 2017-05-08 20:22 UTC (permalink / raw)
  To: linux-arm-kernel


> Stefan Wahren <stefan.wahren@i2se.com> hat am 25. April 2017 um 20:11 geschrieben:
> 
> 
> > Stefan Wahren <stefan.wahren@i2se.com> hat am 22. April 2017 um 22:50 geschrieben:
> > 
> > 
> > Hi,
> > 
> > > Eric Anholt <eric@anholt.net> hat am 20. April 2017 um 20:54 geschrieben:
> > > 
> > > 
> > > Stefan Wahren <stefan.wahren@i2se.com> writes:
> > > 
> > > > Hi,
> > > >
> > > >> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
> > > >> 
> > > >> 
> > > >> It's hard to know for sure that all of this time is really in
> > > >> urb_enqueue().  Possible we could have task switched out and been
> > > >> blocked elsewhere.  Using ftrace to get more fine-grained timings
> > > >> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
> > > >> friends here if you want to use trace_printk.
> > > >
> > > > i'm a newbie to ftrace, so i hope this would be helpful.
> > > >
> > > > # connect PL2303 to the onboard hub
> > > > # echo 0 > options/sleep-time
> > > > # echo 0 > function_profile_enabled
> > > > # echo 1 > function_profile_enabled
> > > > # ./usb_test
> > > > # Waiting for at least 20 seconds and then disconnect PL2303
> > > > # echo 0 > function_profile_enabled
> > > > # cat trace_stat/function0
> > > >
> > > >   Function                               Hit    Time            Avg             s^2
> > > >   --------                               ---    ----            ---             ---
> > > >   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us  
> > > >   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us  
> > > >   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us 
> > > >   irq_exit                           1082482    98197771 us     90.715 us       29649040 us 
> > > >   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us 
> > > 
> > > If I'm reading this output right, we're spending half of our interrupt
> > > processing time in irq_exit(), so even if dwc2's interrupt was free (the
> > > generic_handle_irq() chain), we'd be eating about half the CPU getting
> > > back out of the interrupt handler, right?
> > > 
> > > I don't really know anything about DWC2 or USB, but is there any way we
> > > could mitigate the interrupt frequency with this hardware?  If nothing
> > > else, could we loop reading gintsts until it reads back 0?
> > 
> > first of all i updated my kernel to 4.11rc7 and the issue still occures. Today i had some time to investigate this issue further and i made some interesting observations:
> > 
> > 1. The lockup doesn't occure always after starting usb_test. In rare cases i was able to run the program without lockup.
> > 2. In case the lockup occured we are always able to "rescue" the BCM2835 from this state by sending some serial data to the PL2303.
> 
> Based on this scenario i patched the interrupt routine to detect the interrupt storm and normal condition. So i can dump the global and host registers in both situations (bad and goodcase).
> 
> Here is the diff between both register dumps, maybe someone see something interesting:
> 
> --- badcase.txt	2017-04-25 18:02:59.000000000 +0000
> +++ goodcase.txt	2017-04-25 18:02:59.000000000 +0000
> @@ -4,12 +4,12 @@
>   dwc2_dump_global_registers: GAHBCFG	 @0xCC850008 : 0x00000031
>   dwc2_dump_global_registers: GUSBCFG	 @0xCC85000C : 0x20001700
>   dwc2_dump_global_registers: GRSTCTL	 @0xCC850010 : 0x80000000
> - dwc2_dump_global_registers: GINTSTS	 @0xCC850014 : 0x06000021
> + dwc2_dump_global_registers: GINTSTS	 @0xCC850014 : 0x06000029
>   dwc2_dump_global_registers: GINTMSK	 @0xCC850018 : 0xF300080E
> - dwc2_dump_global_registers: GRXSTSR	 @0xCC85001C : 0x00070044
> + dwc2_dump_global_registers: GRXSTSR	 @0xCC85001C : 0x000F0001
>   dwc2_dump_global_registers: GRXFSIZ	 @0xCC850024 : 0x00000306
>   dwc2_dump_global_registers: GNPTXFSIZ	 @0xCC850028 : 0x01000306
> - dwc2_dump_global_registers: GNPTXSTS	 @0xCC85002C : 0x1D080100
> + dwc2_dump_global_registers: GNPTXSTS	 @0xCC85002C : 0x18080100
> ...

@Doug, @John Y., @Johan: Any further ideas to narrow down the reason for this interrupt storm?

Regards
Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-05-08 20:22                   ` Stefan Wahren
@ 2017-05-10 16:31                     ` Johan Hovold
  2017-05-10 23:50                       ` Doug Anderson
  0 siblings, 1 reply; 30+ messages in thread
From: Johan Hovold @ 2017-05-10 16:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, May 08, 2017 at 10:22:08PM +0200, Stefan Wahren wrote:
> 
> > Stefan Wahren <stefan.wahren@i2se.com> hat am 25. April 2017 um 20:11 geschrieben:
> > 
> > 
> > > Stefan Wahren <stefan.wahren@i2se.com> hat am 22. April 2017 um 22:50 geschrieben:
> > > 
> > > 
> > > Hi,
> > > 
> > > > Eric Anholt <eric@anholt.net> hat am 20. April 2017 um 20:54 geschrieben:
> > > > 
> > > > 
> > > > Stefan Wahren <stefan.wahren@i2se.com> writes:
> > > > 
> > > > > Hi,
> > > > >
> > > > >> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
> > > > >> 
> > > > >> 
> > > > >> It's hard to know for sure that all of this time is really in
> > > > >> urb_enqueue().  Possible we could have task switched out and been
> > > > >> blocked elsewhere.  Using ftrace to get more fine-grained timings
> > > > >> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
> > > > >> friends here if you want to use trace_printk.
> > > > >
> > > > > i'm a newbie to ftrace, so i hope this would be helpful.
> > > > >
> > > > > # connect PL2303 to the onboard hub
> > > > > # echo 0 > options/sleep-time
> > > > > # echo 0 > function_profile_enabled
> > > > > # echo 1 > function_profile_enabled
> > > > > # ./usb_test
> > > > > # Waiting for at least 20 seconds and then disconnect PL2303
> > > > > # echo 0 > function_profile_enabled
> > > > > # cat trace_stat/function0
> > > > >
> > > > >   Function                               Hit    Time            Avg             s^2
> > > > >   --------                               ---    ----            ---             ---
> > > > >   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us  
> > > > >   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us  
> > > > >   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us 
> > > > >   irq_exit                           1082482    98197771 us     90.715 us       29649040 us 
> > > > >   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us 
> > > > 
> > > > If I'm reading this output right, we're spending half of our interrupt
> > > > processing time in irq_exit(), so even if dwc2's interrupt was free (the
> > > > generic_handle_irq() chain), we'd be eating about half the CPU getting
> > > > back out of the interrupt handler, right?
> > > > 
> > > > I don't really know anything about DWC2 or USB, but is there any way we
> > > > could mitigate the interrupt frequency with this hardware?  If nothing
> > > > else, could we loop reading gintsts until it reads back 0?
> > > 
> > > first of all i updated my kernel to 4.11rc7 and the issue still
> > > occures. Today i had some time to investigate this issue further
> > > and i made some interesting observations:
> > > 
> > > 1. The lockup doesn't occure always after starting usb_test. In
> > > rare cases i was able to run the program without lockup.
> > > 2. In case the lockup occured we are always able to "rescue" the
> > > BCM2835 from this state by sending some serial data to the PL2303.
> > 
> > Based on this scenario i patched the interrupt routine to detect the
> > interrupt storm and normal condition. So i can dump the global and
> > host registers in both situations (bad and goodcase).
> > 
> > Here is the diff between both register dumps, maybe someone see
> > something interesting:
> > 
> > --- badcase.txt	2017-04-25 18:02:59.000000000 +0000
> > +++ goodcase.txt	2017-04-25 18:02:59.000000000 +0000
> > @@ -4,12 +4,12 @@
> >   dwc2_dump_global_registers: GAHBCFG	 @0xCC850008 : 0x00000031
> >   dwc2_dump_global_registers: GUSBCFG	 @0xCC85000C : 0x20001700
> >   dwc2_dump_global_registers: GRSTCTL	 @0xCC850010 : 0x80000000
> > - dwc2_dump_global_registers: GINTSTS	 @0xCC850014 : 0x06000021
> > + dwc2_dump_global_registers: GINTSTS	 @0xCC850014 : 0x06000029
> >   dwc2_dump_global_registers: GINTMSK	 @0xCC850018 : 0xF300080E
> > - dwc2_dump_global_registers: GRXSTSR	 @0xCC85001C : 0x00070044
> > + dwc2_dump_global_registers: GRXSTSR	 @0xCC85001C : 0x000F0001
> >   dwc2_dump_global_registers: GRXFSIZ	 @0xCC850024 : 0x00000306
> >   dwc2_dump_global_registers: GNPTXFSIZ	 @0xCC850028 : 0x01000306
> > - dwc2_dump_global_registers: GNPTXSTS	 @0xCC85002C : 0x1D080100
> > + dwc2_dump_global_registers: GNPTXSTS	 @0xCC85002C : 0x18080100
> > ...
> 
> @Doug, @John Y., @Johan: Any further ideas to narrow down the reason
> for this interrupt storm?

No, sorry.

I suggested earlier that you could try commenting out the interrupt-URB
submission in a naive attempt to narrow it down since that is an obvious
difference from the ftdi driver, but then again so are the frequent ftdi
status bulk messages (and any other timing differences).

Seems like you need to dig into dwc2.

Johan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-05-10 16:31                     ` Johan Hovold
@ 2017-05-10 23:50                       ` Doug Anderson
  2017-05-13 12:28                         ` Stefan Wahren
  0 siblings, 1 reply; 30+ messages in thread
From: Doug Anderson @ 2017-05-10 23:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Wed, May 10, 2017 at 9:31 AM, Johan Hovold <johan@kernel.org> wrote:
> On Mon, May 08, 2017 at 10:22:08PM +0200, Stefan Wahren wrote:
>>
>> > Stefan Wahren <stefan.wahren@i2se.com> hat am 25. April 2017 um 20:11 geschrieben:
>> >
>> >
>> > > Stefan Wahren <stefan.wahren@i2se.com> hat am 22. April 2017 um 22:50 geschrieben:
>> > >
>> > >
>> > > Hi,
>> > >
>> > > > Eric Anholt <eric@anholt.net> hat am 20. April 2017 um 20:54 geschrieben:
>> > > >
>> > > >
>> > > > Stefan Wahren <stefan.wahren@i2se.com> writes:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > >> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben:
>> > > > >>
>> > > > >>
>> > > > >> It's hard to know for sure that all of this time is really in
>> > > > >> urb_enqueue().  Possible we could have task switched out and been
>> > > > >> blocked elsewhere.  Using ftrace to get more fine-grained timings
>> > > > >> would be useful.  ktime_get(), ktime_sub(), and ktime_to_us() are your
>> > > > >> friends here if you want to use trace_printk.
>> > > > >
>> > > > > i'm a newbie to ftrace, so i hope this would be helpful.
>> > > > >
>> > > > > # connect PL2303 to the onboard hub
>> > > > > # echo 0 > options/sleep-time
>> > > > > # echo 0 > function_profile_enabled
>> > > > > # echo 1 > function_profile_enabled
>> > > > > # ./usb_test
>> > > > > # Waiting for at least 20 seconds and then disconnect PL2303
>> > > > > # echo 0 > function_profile_enabled
>> > > > > # cat trace_stat/function0
>> > > > >
>> > > > >   Function                               Hit    Time            Avg             s^2
>> > > > >   --------                               ---    ----            ---             ---
>> > > > >   bcm2835_handle_irq                  361347    219567633 us     607.636 us      1485199 us
>> > > > >   __handle_domain_irq                1082482    212639551 us     196.437 us      3642030 us
>> > > > >   generic_handle_irq                 1082482    100592051 us     92.927 us       50511334 us
>> > > > >   irq_exit                           1082482    98197771 us     90.715 us       29649040 us
>> > > > >   handle_level_irq                   1082482    95812379 us     88.511 us       51910093 us
>> > > >
>> > > > If I'm reading this output right, we're spending half of our interrupt
>> > > > processing time in irq_exit(), so even if dwc2's interrupt was free (the
>> > > > generic_handle_irq() chain), we'd be eating about half the CPU getting
>> > > > back out of the interrupt handler, right?
>> > > >
>> > > > I don't really know anything about DWC2 or USB, but is there any way we
>> > > > could mitigate the interrupt frequency with this hardware?  If nothing
>> > > > else, could we loop reading gintsts until it reads back 0?
>> > >
>> > > first of all i updated my kernel to 4.11rc7 and the issue still
>> > > occures. Today i had some time to investigate this issue further
>> > > and i made some interesting observations:
>> > >
>> > > 1. The lockup doesn't occure always after starting usb_test. In
>> > > rare cases i was able to run the program without lockup.
>> > > 2. In case the lockup occured we are always able to "rescue" the
>> > > BCM2835 from this state by sending some serial data to the PL2303.
>> >
>> > Based on this scenario i patched the interrupt routine to detect the
>> > interrupt storm and normal condition. So i can dump the global and
>> > host registers in both situations (bad and goodcase).
>> >
>> > Here is the diff between both register dumps, maybe someone see
>> > something interesting:
>> >
>> > --- badcase.txt     2017-04-25 18:02:59.000000000 +0000
>> > +++ goodcase.txt    2017-04-25 18:02:59.000000000 +0000
>> > @@ -4,12 +4,12 @@
>> >   dwc2_dump_global_registers: GAHBCFG        @0xCC850008 : 0x00000031
>> >   dwc2_dump_global_registers: GUSBCFG        @0xCC85000C : 0x20001700
>> >   dwc2_dump_global_registers: GRSTCTL        @0xCC850010 : 0x80000000
>> > - dwc2_dump_global_registers: GINTSTS        @0xCC850014 : 0x06000021
>> > + dwc2_dump_global_registers: GINTSTS        @0xCC850014 : 0x06000029
>> >   dwc2_dump_global_registers: GINTMSK        @0xCC850018 : 0xF300080E
>> > - dwc2_dump_global_registers: GRXSTSR        @0xCC85001C : 0x00070044
>> > + dwc2_dump_global_registers: GRXSTSR        @0xCC85001C : 0x000F0001
>> >   dwc2_dump_global_registers: GRXFSIZ        @0xCC850024 : 0x00000306
>> >   dwc2_dump_global_registers: GNPTXFSIZ      @0xCC850028 : 0x01000306
>> > - dwc2_dump_global_registers: GNPTXSTS       @0xCC85002C : 0x1D080100
>> > + dwc2_dump_global_registers: GNPTXSTS       @0xCC85002C : 0x18080100
>> > ...
>>
>> @Doug, @John Y., @Johan: Any further ideas to narrow down the reason
>> for this interrupt storm?
>
> No, sorry.
>
> I suggested earlier that you could try commenting out the interrupt-URB
> submission in a naive attempt to narrow it down since that is an obvious
> difference from the ftdi driver, but then again so are the frequent ftdi
> status bulk messages (and any other timing differences).
>
> Seems like you need to dig into dwc2.

I haven't had a chance to dig and not sure when I'll have that time.
I'll keep this around in my inbox for now on the off chance that I
clean it out enough to get back to this.  Sorry!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-05-10 23:50                       ` Doug Anderson
@ 2017-05-13 12:28                         ` Stefan Wahren
  2017-10-16 20:49                           ` Julius Werner
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Wahren @ 2017-05-13 12:28 UTC (permalink / raw)
  To: linux-arm-kernel

> Doug Anderson <dianders@chromium.org> hat am 11. Mai 2017 um 01:50 geschrieben:
> 
> 
> Hi,
> 
> On Wed, May 10, 2017 at 9:31 AM, Johan Hovold <johan@kernel.org> wrote:
> > On Mon, May 08, 2017 at 10:22:08PM +0200, Stefan Wahren wrote:
> >> @Doug, @John Y., @Johan: Any further ideas to narrow down the reason
> >> for this interrupt storm?
> >
> > No, sorry.
> >
> > I suggested earlier that you could try commenting out the interrupt-URB
> > submission in a naive attempt to narrow it down since that is an obvious
> > difference from the ftdi driver, but then again so are the frequent ftdi
> > status bulk messages (and any other timing differences).
> >
> > Seems like you need to dig into dwc2.
> 
> I haven't had a chance to dig and not sure when I'll have that time.
> I'll keep this around in my inbox for now on the off chance that I
> clean it out enough to get back to this.  Sorry!
> 

But thanks for all your feedback guys.

Btw i think i took a step forward. During the whole time i a took a Raspberry Pi B to reproduce this issue, which has a builtin USB hub and a SMSC USB to ethernet chip.

/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc2/1p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/3p, 480M
        |__ Port 1: Dev 3, If 0, Class=Vendor Specific Class, Driver=smsc95xx, 480M
        |__ Port 2: Dev 4, If 0, Class=Vendor Specific Class, Driver=pl2303, 12M

I decide to switch to the Raspberry Pi Zero which neither have a builtin USB hub or a USB ethernet chip. If i connect the PL2303 directly to the RPi Zero the issue is not reproducible and usb_test behave as expected by opening the port and terminate after a few seconds (good case).

/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc2/1p, 480M
    |__ Port 1: Dev 5, If 0, Class=Vendor Specific Class, Driver=pl2303, 12M

After that i placed a USB 4 port hub between RPi Zero and PL2303. Now the issue (interrupt storm) occurs on the RPi Zero (bad case). After nearly 20 seconds i disconnected the PL2303 in order to make the system usable and stop the trace.

/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc2/1p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
        |__ Port 3: Dev 4, If 0, Class=Vendor Specific Class, Driver=pl2303, 12M

Additional i made usbmon traces of both cases. Maybe that's useful for somebody with more USB knowledge than me.

Stefan

good case:

d9adfd00 1324940464 S Co:1:005:0 s 40 01 0008 0000 0000 0
d9adfd00 1324941739 C Co:1:005:0 0 0
d9adfd00 1324941792 S Co:1:005:0 s 40 01 0009 0000 0000 0
d9adfd00 1324941879 C Co:1:005:0 0 0
d9adfd00 1324941920 S Ci:1:005:0 s a1 21 0000 0000 0007 7 <
d9adfd00 1324942046 C Ci:1:005:0 0 7 = 80250000 000000
d9adfd00 1324942093 S Co:1:005:0 s 21 20 0000 0000 0007 7 = 80250000 000008
d9adfd00 1324942211 C Co:1:005:0 0 7 >
d9adfd00 1324942248 S Co:1:005:0 s 40 01 0000 0000 0000 0
d9adfd00 1324942335 C Co:1:005:0 0 0
da6ef600 1324942367 S Ii:1:005:1 -115:1 10 <
da6efd00 1324942412 S Bi:1:005:3 -115 256 <
da6ef800 1324942431 S Bi:1:005:3 -115 256 <
d9adfd00 1324942447 S Co:1:005:0 s 21 22 0003 0000 0000 0
d9adfd00 1324943310 C Co:1:005:0 0 0
da6ef600 1324944174 C Ii:1:005:1 0:1 10 = a1200000 00000200 8b06
da6ef600 1324944192 S Ii:1:005:1 -115:1 10 <
da6ef600 1324946152 C Ii:1:005:1 0:1 10 = a1200000 00000200 8b00
da6ef600 1324946161 S Ii:1:005:1 -115:1 10 <
da6ef600 1325451149 C Ii:1:005:1 -5:1 0
da6ef600 1325451177 S Ii:1:005:1 -115:1 10 <
d9adf400 1330210089 S Co:1:005:0 s 21 22 0000 0000 0000 0
d9adf400 1330211717 C Co:1:005:0 0 0
da6efd00 1330211855 C Bi:1:005:3 -2 0
da6ef800 1330211921 C Bi:1:005:3 -2 0
da6ef600 1330211969 C Ii:1:005:1 -2:1 0
d9adf400 1330211997 S Co:1:005:0 s 21 23 0000 0000 0000 0
d9adf400 1330212158 C Co:1:005:0 0 0

bad case:

da6efb00 316377734 S Co:1:004:0 s 40 01 0008 0000 0000 0
da6efb00 316378587 C Co:1:004:0 0 0
da6efb00 316378837 S Co:1:004:0 s 40 01 0009 0000 0000 0
da6efb00 316379201 C Co:1:004:0 0 0
da6efb00 316379397 S Ci:1:004:0 s a1 21 0000 0000 0007 7 <
da6efb00 316379697 C Ci:1:004:0 0 7 = 80250000 000000
da6efb00 316379890 S Co:1:004:0 s 21 20 0000 0000 0007 7 = 80250000 000008
da6efb00 316380350 C Co:1:004:0 0 7 >
da6efb00 316380580 S Co:1:004:0 s 40 01 0000 0000 0000 0
da6efb00 316380932 C Co:1:004:0 0 0
d9a14b00 316381132 S Ii:1:004:1 -115:1 10 <
da780600 316381325 S Bi:1:004:3 -115 256 <
da780100 316381488 S Bi:1:004:3 -115 256 <
d9a14b00 316389933 C Ii:1:004:1 0:1 10 = a1200000 00000200 8b06
d9a14b00 316390820 S Ii:1:004:1 -115:1 10 <
d9a14b00 316398308 C Ii:1:004:1 0:1 10 = a1200000 00000200 8b00
d9a14b00 316399936 S Ii:1:004:1 -115:1 10 <
da6efb00 316419808 S Co:1:004:0 s 21 22 0003 0000 0000 0
da6efb00 316438433 C Co:1:004:0 0 0
da780600 339316684 C Bi:1:004:3 -32 0
da780100 339316772 C Bi:1:004:3 -32 0
d9a14b00 339317035 C Ii:1:004:1 -32:1 0
d9a14b00 339317049 S Ii:1:004:1 -115:1 10 <
d9a14b00 339318040 C Ii:1:004:1 -32:1 0
d9a14b00 339318057 S Ii:1:004:1 -115:1 10 <
d9a14b00 339319042 C Ii:1:004:1 -32:1 0
d9a14b00 339319056 S Ii:1:004:1 -115:1 10 <
d9a14b00 339329551 C Ii:1:004:1 -32:1 0
d9a14b00 339329571 S Ii:1:004:1 -115:1 10 <
d9a14b00 339330586 C Ii:1:004:1 -32:1 0
d9a14b00 339330601 S Ii:1:004:1 -115:1 10 <
d9a14b00 339331035 C Ii:1:004:1 -32:1 0
d9a14b00 339331046 S Ii:1:004:1 -115:1 10 <
d9a14b00 339332037 C Ii:1:004:1 -32:1 0
d9a14b00 339332049 S Ii:1:004:1 -115:1 10 <
d9a14b00 339333054 C Ii:1:004:1 -32:1 0
d9a14b00 339333068 S Ii:1:004:1 -115:1 10 <
d9a14b00 339334037 C Ii:1:004:1 -32:1 0
d9a14b00 339334050 S Ii:1:004:1 -115:1 10 <
d9a14b00 339335036 C Ii:1:004:1 -32:1 0
d9a14b00 339335050 S Ii:1:004:1 -115:1 10 <
d9a14b00 339336036 C Ii:1:004:1 -32:1 0
d9a14b00 339336049 S Ii:1:004:1 -115:1 10 <
d9a14b00 339337037 C Ii:1:004:1 -32:1 0
d9a14b00 339337050 S Ii:1:004:1 -115:1 10 <
d9a14b00 339338039 C Ii:1:004:1 -32:1 0
d9a14b00 339338074 S Ii:1:004:1 -115:1 10 <
d9a14b00 339339038 C Ii:1:004:1 -32:1 0
d9a14b00 339339052 S Ii:1:004:1 -115:1 10 <
d9a14b00 339340130 C Ii:1:004:1 -32:1 0
d9a14b00 339340143 S Ii:1:004:1 -115:1 10 <
d9a14b00 339341035 C Ii:1:004:1 -32:1 0
d9a14b00 339341048 S Ii:1:004:1 -115:1 10 <
d9a14b00 339342039 C Ii:1:004:1 -32:1 0
d9a14b00 339342051 S Ii:1:004:1 -115:1 10 <
d9a14b00 339343038 C Ii:1:004:1 -32:1 0
d9a14b00 339343051 S Ii:1:004:1 -115:1 10 <
d9a14b00 339344038 C Ii:1:004:1 -32:1 0
d9a14b00 339344050 S Ii:1:004:1 -115:1 10 <
d9a14b00 339345037 C Ii:1:004:1 -32:1 0
d9a14b00 339345050 S Ii:1:004:1 -115:1 10 <
d9a14b00 339346038 C Ii:1:004:1 -32:1 0
d9a14b00 339346049 S Ii:1:004:1 -115:1 10 <
d9a14b00 339347030 C Ii:1:004:1 -32:1 0
d9a14b00 339347041 S Ii:1:004:1 -115:1 10 <
d9a14b00 339348039 C Ii:1:004:1 -32:1 0
d9a14b00 339348052 S Ii:1:004:1 -115:1 10 <
d9a14b00 339349039 C Ii:1:004:1 -32:1 0
d9a14b00 339349053 S Ii:1:004:1 -115:1 10 <
d9a14b00 339350105 C Ii:1:004:1 -32:1 0
d9a14b00 339350119 S Ii:1:004:1 -115:1 10 <
d9a14b00 339351032 C Ii:1:004:1 -32:1 0
d9a14b00 339351044 S Ii:1:004:1 -115:1 10 <
d9a14b00 339352030 C Ii:1:004:1 -32:1 0
d9a14b00 339352042 S Ii:1:004:1 -115:1 10 <
d9a14b00 339353037 C Ii:1:004:1 -32:1 0
d9a14b00 339353051 S Ii:1:004:1 -115:1 10 <
d9a14b00 339354068 C Ii:1:004:1 -32:1 0
d9a14b00 339354083 S Ii:1:004:1 -115:1 10 <
d9a14b00 339355018 C Ii:1:004:1 -32:1 0
d9a14b00 339355026 S Ii:1:004:1 -115:1 10 <
d9a14b00 339356017 C Ii:1:004:1 -32:1 0
d9a14b00 339356025 S Ii:1:004:1 -115:1 10 <
d9a14b00 339357018 C Ii:1:004:1 -32:1 0
d9a14b00 339357026 S Ii:1:004:1 -115:1 10 <
d9a14b00 339358017 C Ii:1:004:1 -32:1 0
d9a14b00 339358025 S Ii:1:004:1 -115:1 10 <
d9a14b00 339359018 C Ii:1:004:1 -32:1 0
d9a14b00 339359026 S Ii:1:004:1 -115:1 10 <
d9a14b00 339360016 C Ii:1:004:1 -32:1 0
d9a14b00 339360024 S Ii:1:004:1 -115:1 10 <
d9a14b00 339361018 C Ii:1:004:1 -32:1 0
d9a14b00 339361026 S Ii:1:004:1 -115:1 10 <
d9a14b00 339362016 C Ii:1:004:1 -32:1 0
d9a14b00 339362024 S Ii:1:004:1 -115:1 10 <
d9a14b00 339363017 C Ii:1:004:1 -32:1 0
d9a14b00 339363049 S Ii:1:004:1 -115:1 10 <
d9a14b00 339364017 C Ii:1:004:1 -32:1 0
d9a14b00 339364025 S Ii:1:004:1 -115:1 10 <
d9a14b00 339365017 C Ii:1:004:1 -32:1 0
d9a14b00 339365025 S Ii:1:004:1 -115:1 10 <
d9a14b00 339366016 C Ii:1:004:1 -32:1 0
d9a14b00 339366024 S Ii:1:004:1 -115:1 10 <
d9a14b00 339367017 C Ii:1:004:1 -32:1 0
d9a14b00 339367025 S Ii:1:004:1 -115:1 10 <
d9a14b00 339368017 C Ii:1:004:1 -32:1 0
d9a14b00 339368024 S Ii:1:004:1 -115:1 10 <
d9a14b00 339369017 C Ii:1:004:1 -32:1 0
d9a14b00 339369025 S Ii:1:004:1 -115:1 10 <
d9a14b00 339370017 C Ii:1:004:1 -32:1 0
d9a14b00 339370024 S Ii:1:004:1 -115:1 10 <
d9a14b00 339371017 C Ii:1:004:1 -32:1 0
d9a14b00 339371026 S Ii:1:004:1 -115:1 10 <
d9a14b00 339372017 C Ii:1:004:1 -32:1 0
d9a14b00 339372024 S Ii:1:004:1 -115:1 10 <
d9a14b00 339373017 C Ii:1:004:1 -32:1 0
d9a14b00 339373025 S Ii:1:004:1 -115:1 10 <
d9a14b00 339374018 C Ii:1:004:1 -32:1 0
d9a14b00 339374025 S Ii:1:004:1 -115:1 10 <
d9a14b00 339375017 C Ii:1:004:1 -32:1 0
d9a14b00 339375025 S Ii:1:004:1 -115:1 10 <
d9a14b00 339376018 C Ii:1:004:1 -32:1 0
d9a14b00 339376026 S Ii:1:004:1 -115:1 10 <
d9a14b00 339377017 C Ii:1:004:1 -32:1 0
d9a14b00 339377024 S Ii:1:004:1 -115:1 10 <
d9930180 339605934 S Ci:1:002:0 s a3 00 0000 0003 0004 4 <
d9930180 339606101 C Ci:1:002:0 0 4 = 00010000
d9930380 339650171 S Ci:1:002:0 s a3 00 0000 0003 0004 4 <
d9930380 339650346 C Ci:1:002:0 0 4 = 00010000
d9842c80 339700165 S Ci:1:002:0 s a3 00 0000 0003 0004 4 <
d9842c80 339700346 C Ci:1:002:0 0 4 = 00010000
da780000 339751063 S Ci:1:002:0 s a3 00 0000 0003 0004 4 <
da780000 339751221 C Ci:1:002:0 0 4 = 00010000
d9a15380 339800833 S Ci:1:002:0 s a3 00 0000 0003 0004 4 <
d9a15380 339800974 C Ci:1:002:0 0 4 = 00010000
da737080 339801535 C Ii:1:002:1 -2:2048 0
d9a15380 339802157 S Co:1:002:0 s 00 03 0001 0000 0000 0
d9a15380 339802265 C Co:1:002:0 0 0
d9a15380 339802291 S Co:1:001:0 s 23 03 0002 0001 0000 0
d9a15380 339802352 C Co:1:001:0 0 0
da6ba280 339830247 C Ii:1:001:1 -2:2048 0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-05-13 12:28                         ` Stefan Wahren
@ 2017-10-16 20:49                           ` Julius Werner
  2017-10-17  8:52                             ` Johan Hovold
  2017-10-25 21:22                             ` Doug Anderson
  0 siblings, 2 replies; 30+ messages in thread
From: Julius Werner @ 2017-10-16 20:49 UTC (permalink / raw)
  To: linux-arm-kernel

> d9a14b00 339317035 C Ii:1:004:1 -32:1 0
> d9a14b00 339317049 S Ii:1:004:1 -115:1 10 <
> d9a14b00 339318040 C Ii:1:004:1 -32:1 0
> d9a14b00 339318057 S Ii:1:004:1 -115:1 10 <
> d9a14b00 339319042 C Ii:1:004:1 -32:1 0
> d9a14b00 339319056 S Ii:1:004:1 -115:1 10 <
> d9a14b00 339329551 C Ii:1:004:1 -32:1 0
> d9a14b00 339329571 S Ii:1:004:1 -115:1 10 <
> d9a14b00 339330586 C Ii:1:004:1 -32:1 0
> d9a14b00 339330601 S Ii:1:004:1 -115:1 10 <
> d9a14b00 339331035 C Ii:1:004:1 -32:1 0

Sorry for necromancing an old thread, but I just happened to read
through this and thought someone might care:

If I read that right, the usbmon output shows that the interrupt
endpoint is stalled (keeps returning -EPIPE). A STALL is a special
device-side USB condition that tells the host something is wrong and
will persist until cleared manually. It seems that the driver isn't
prepared for this (see
drivers/usb/serial/pl2303.c#pl2303_read_int_callback) and just keeps
resubmitting the URB, so it will stall again as fast as the endpoint
allows it to. This may be the reason why you get so many transfers
that it overwhelms the CPU.

A fix would be to catch -EPIPE in that function and handle it
explicitly (with either a CLEAR_STALL to the endpoint or a full USB
reset... would have to look at the documentation for PL2303 to see
what the stall actually means and how you're supposed to treat it).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-10-16 20:49                           ` Julius Werner
@ 2017-10-17  8:52                             ` Johan Hovold
  2017-10-25 21:22                             ` Doug Anderson
  1 sibling, 0 replies; 30+ messages in thread
From: Johan Hovold @ 2017-10-17  8:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 16, 2017 at 01:49:11PM -0700, Julius Werner wrote:
> > d9a14b00 339317035 C Ii:1:004:1 -32:1 0
> > d9a14b00 339317049 S Ii:1:004:1 -115:1 10 <
> > d9a14b00 339318040 C Ii:1:004:1 -32:1 0
> > d9a14b00 339318057 S Ii:1:004:1 -115:1 10 <
> > d9a14b00 339319042 C Ii:1:004:1 -32:1 0
> > d9a14b00 339319056 S Ii:1:004:1 -115:1 10 <
> > d9a14b00 339329551 C Ii:1:004:1 -32:1 0
> > d9a14b00 339329571 S Ii:1:004:1 -115:1 10 <
> > d9a14b00 339330586 C Ii:1:004:1 -32:1 0
> > d9a14b00 339330601 S Ii:1:004:1 -115:1 10 <
> > d9a14b00 339331035 C Ii:1:004:1 -32:1 0
> 
> Sorry for necromancing an old thread, but I just happened to read
> through this and thought someone might care:
> 
> If I read that right, the usbmon output shows that the interrupt
> endpoint is stalled (keeps returning -EPIPE). A STALL is a special
> device-side USB condition that tells the host something is wrong and
> will persist until cleared manually. It seems that the driver isn't
> prepared for this (see
> drivers/usb/serial/pl2303.c#pl2303_read_int_callback) and just keeps
> resubmitting the URB, so it will stall again as fast as the endpoint
> allows it to. This may be the reason why you get so many transfers
> that it overwhelms the CPU.

That's a bug in the driver, we should not resubmit (without further
action) on -EPIPE.

> A fix would be to catch -EPIPE in that function and handle it
> explicitly (with either a CLEAR_STALL to the endpoint or a full USB
> reset... would have to look at the documentation for PL2303 to see
> what the stall actually means and how you're supposed to treat it).

Yes, but we can't just clear the halt from the completion handler, so
you'd typically have to schedule a work struct and call usb_clear_halt
from there. Only then could we try resubmitting the URBs, but chances
are we'd just hit that stall again (with the hardware setup in
question). Note that no usb-serial drivers currently implement any such
stall recovery, and just stop resubmitting the URB on -EPIPE.

Or at least so I thought. The generic implementation (which most drivers
rely on) and a few others get this right, but we have a number of legacy
drivers with custom implementations that do resubmit on -EPIPE
(including the pl2303 one).

I'll go fix up that up.

Thanks,
Johan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s
  2017-10-16 20:49                           ` Julius Werner
  2017-10-17  8:52                             ` Johan Hovold
@ 2017-10-25 21:22                             ` Doug Anderson
  1 sibling, 0 replies; 30+ messages in thread
From: Doug Anderson @ 2017-10-25 21:22 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Mon, Oct 16, 2017 at 1:49 PM, Julius Werner <jwerner@chromium.org> wrote:
>> d9a14b00 339317035 C Ii:1:004:1 -32:1 0
>> d9a14b00 339317049 S Ii:1:004:1 -115:1 10 <
>> d9a14b00 339318040 C Ii:1:004:1 -32:1 0
>> d9a14b00 339318057 S Ii:1:004:1 -115:1 10 <
>> d9a14b00 339319042 C Ii:1:004:1 -32:1 0
>> d9a14b00 339319056 S Ii:1:004:1 -115:1 10 <
>> d9a14b00 339329551 C Ii:1:004:1 -32:1 0
>> d9a14b00 339329571 S Ii:1:004:1 -115:1 10 <
>> d9a14b00 339330586 C Ii:1:004:1 -32:1 0
>> d9a14b00 339330601 S Ii:1:004:1 -115:1 10 <
>> d9a14b00 339331035 C Ii:1:004:1 -32:1 0
>
> Sorry for necromancing an old thread, but I just happened to read
> through this and thought someone might care:
>
> If I read that right, the usbmon output shows that the interrupt
> endpoint is stalled (keeps returning -EPIPE). A STALL is a special
> device-side USB condition that tells the host something is wrong and
> will persist until cleared manually. It seems that the driver isn't
> prepared for this (see
> drivers/usb/serial/pl2303.c#pl2303_read_int_callback) and just keeps
> resubmitting the URB, so it will stall again as fast as the endpoint
> allows it to. This may be the reason why you get so many transfers
> that it overwhelms the CPU.
>
> A fix would be to catch -EPIPE in that function and handle it
> explicitly (with either a CLEAR_STALL to the endpoint or a full USB
> reset... would have to look at the documentation for PL2303 to see
> what the stall actually means and how you're supposed to treat it).

To further comment on this old thread, I just posted another patch at
<https://patchwork.kernel.org/patch/10027237/> that could also make
pl2303 less able to bring dwc2-based controllers to a screeching halt.
I added many of the people who had taken part in this thread, but if
you were just lurking here then hopefully you can dig it up and try it
out.

-Doug

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-10-25 21:22 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-31 20:34 usb: dwc2: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 146s Stefan Wahren
2016-11-01  8:26 ` Michael Zoran
2017-04-17 11:05 ` Stefan Wahren
2017-04-17 22:37   ` Doug Anderson
2017-04-18  8:07     ` Stefan Wahren
2017-04-18 20:08       ` Doug Anderson
2017-04-18 20:25       ` Stefan Wahren
2017-04-18 20:41         ` Doug Anderson
2017-04-18 20:53           ` Stefan Wahren
2017-04-19 20:25           ` Stefan Wahren
2017-04-19 21:47             ` Doug Anderson
2017-04-20  7:46               ` Stefan Wahren
2017-04-20 16:19                 ` Doug Anderson
2017-04-20 18:54             ` Eric Anholt
2017-04-20 19:45               ` Doug Anderson
2017-04-20 19:57                 ` Eric Anholt
2017-04-20 20:37                   ` Doug Anderson
2017-04-22 20:50               ` Stefan Wahren
2017-04-25 18:11                 ` Stefan Wahren
2017-05-08 20:22                   ` Stefan Wahren
2017-05-10 16:31                     ` Johan Hovold
2017-05-10 23:50                       ` Doug Anderson
2017-05-13 12:28                         ` Stefan Wahren
2017-10-16 20:49                           ` Julius Werner
2017-10-17  8:52                             ` Johan Hovold
2017-10-25 21:22                             ` Doug Anderson
2017-04-19  8:56         ` Johan Hovold
2017-04-19  9:12           ` Stefan Wahren
2017-04-19  9:55             ` Johan Hovold
2017-04-17 23:45   ` Heiko Stuebner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.