All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel locks due to USB I/O
@ 2020-11-10 19:20 Alberto Sentieri
  2020-11-10 20:51 ` Alan Stern
  0 siblings, 1 reply; 17+ messages in thread
From: Alberto Sentieri @ 2020-11-10 19:20 UTC (permalink / raw)
  To: Greg Kroah-Hartman, linux-usb

I’ve seen many kernel locks caused by a particular user-level 
application. After the kernel locks, there is no report left in the 
machine, neither in the logs. These locks have to do with USB input and 
output.

The objective of this email is to get guidance about how to collect more 
data related to the locks.

Follows a description of the problem.

I manage a few remote machines installed at a manufacturing facility, 
which run Ubuntu 18.04. For months I had seen unexpected kernel locks, 
which I could not explain. By locks I mean that the machine completely 
dies. The graphical screen and keyboard freezes. I cannot ping or 
connect through ssh during the locks. The only way of making the machine 
come back is through a “pull the plug”. After rebooting I cannot find 
anything meaningful about the lock in the logs. The machine is a good 
quality one with a 6-core Xeon, 32 GB ECC memory (and the application is 
using about 1GB). Exact the same problem happens in two identical 
machines, one running kernel 5.0.0-37 generic and the other running 
kernel 5.3.0-62-generic.

A few days ago I was able to create a sequence of events that produce 
the locks in a couple of minutes. These events have to do with USB 2.0 
interrupt I/O on USB devices connected at 12 Mbits/s and the frequency 
URBs are submitted and reaped . It is necessary to have at least 36 
devices connected to reproduce the problem easily, which I cannot do 
from where I am. The machines are in a country other than the one I 
live, and my physical access to them is not possible due to COVID-19 
restrictions.

There is no special USB drivers installed. However, there is a NVIDIA 
manufacturer driver installed, which I installed using the Ubuntu 
regular tools for non-free software. All USB I/O is done by a regular 
user opening /dev/bus/usb/xxx/xxx (the device group is set to the user 
group by udev).

Each set of 18 USB devices is connected to a 10-Amp.-power-supply 
powered HUB. Each hub has its own USB 2.0 root, I mean, I installed 
multiple USB 2.0 PCI express expansion cards, and only one port of each 
expansion card is used for each HUB.

The protocol to talk to any of the 36 devices is pretty simple. It uses 
USB interrupt frames. A 64-byte frame is sent to the device (request 
packet). I use ioctl (USBDEVFS_SUBMITURB). The file descriptor is 
monitored by epoll and when an answer comes back, the response packet 
(another 64-byte interrupt packet) is recovered by ioctl 
(USBDEVFS_REAPURBNDELAY). Then a 64-byte packet (confirmation packet) is 
sent through USBDEVFS_SUBMITURB. This sequence happens once every few 
seconds and the delay between the three packets is just a couple of 
milliseconds. All process of dealing with the 36 devices is in a unique 
thread, under the same epoll loop.

So if I synchronize all 36 devices, I mean, I try to talk to all them 
basically at the same time, the kernel will lock in about 2 minutes or 
less. By “at the same time” I mean to submit the URBs for the request 
packet around the same time for all of them, and then sit there, waiting 
for the proper epoll wake-up to deal with the state machine (response 
and confirmation packets).

However, if I lock a semaphore before sending the request packet for one 
device, and only unlock after reaping the URB I used to send the 
confirmation packet, it ran for ate least 72 hours without problems. So, 
one device at a time (using basically the same software plus the 
semaphore) does not cause the kernel lock.

My point is that simple ioctl calls to USB devices should not break the 
kernel. I need help to address the kernel issue. The problem is 
difficult to reproduce at my office because it needs many devices 
connected to it, which are available only in a place I do not have 
physical access to, due to COVID-19 travel restrictions.

My guess is that, for a regular user, this bug rarely manifests itself 
and it may be there for a long time.

I would like to figure out exactly where the problem is and I am looking 
for your guidance to get more information about it.

Thanks,

Alberto Sentieri

# cat /proc/version
Linux version 5.3.0-62-generic (buildd@lcy01-amd64-004) (gcc version 
7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #56~18.04.1-Ubuntu SMP Wed Jun 24 
16:17:03 UTC 2020

# free -h  (while running the application)
               total        used        free      shared  buff/cache 
available
Mem:            31G        860M         29G        7.1M        1.3G 
    29G
Swap:           15G          0B         15G

# lspci
00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 04)
00:04.0 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.1 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.2 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.3 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.4 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.5 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.6 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.7 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:05.0 System peripheral: Intel Corporation Sky Lake-E MM/Vt-d 
Configuration Registers (rev 04)
00:05.2 System peripheral: Intel Corporation Device 2025 (rev 04)
00:05.4 PIC: Intel Corporation Device 2026 (rev 04)
00:08.0 System peripheral: Intel Corporation Sky Lake-E Ubox Registers 
(rev 04)
00:08.1 Performance counters: Intel Corporation Sky Lake-E Ubox 
Registers (rev 04)
00:08.2 System peripheral: Intel Corporation Sky Lake-E Ubox Registers 
(rev 04)
00:14.0 USB controller: Intel Corporation 200 Series/Z370 Chipset Family 
USB 3.0 xHCI Controller
00:14.2 Signal processing controller: Intel Corporation 200 Series PCH 
Thermal Subsystem
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME 
HECI #1
00:17.0 RAID bus controller: Intel Corporation C600/X79 series chipset 
SATA RAID Controller
00:1b.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root 
Port #21 (rev f0)
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root 
Port #3 (rev f0)
00:1c.4 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root 
Port #5 (rev f0)
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root 
Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation C422 Chipset LPC/eSPI Controller
00:1f.2 Memory controller: Intel Corporation 200 Series/Z370 Chipset 
Family Power Management Controller
00:1f.3 Audio device: Intel Corporation 200 Series PCH HD Audio
00:1f.4 SMBus: Intel Corporation 200 Series/Z370 Chipset Family SMBus 
Controller
00:1f.5 Non-VGA unclassified device: Intel Corporation 200 Series/Z370 
Chipset Family SPI Controller
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) 
I219-LM
01:00.0 PCI bridge: Pericom Semiconductor PI7C9X111SL PCIe-to-PCI 
Reversible Bridge (rev 02)
02:04.0 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 
Controller (rev 62)
02:04.1 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 
Controller (rev 62)
02:04.2 USB controller: VIA Technologies, Inc. USB 2.0 (rev 65)
03:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network 
Connection (rev 03)
05:00.0 PCI bridge: Pericom Semiconductor PI7C9X111SL PCIe-to-PCI 
Reversible Bridge (rev 02)
06:04.0 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 
Controller (rev 62)
06:04.1 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 
Controller (rev 62)
06:04.2 USB controller: VIA Technologies, Inc. USB 2.0 (rev 65)
14:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A 
(rev 04)
14:05.0 System peripheral: Intel Corporation Device 2034 (rev 04)
14:05.2 System peripheral: Intel Corporation Sky Lake-E RAS 
Configuration Registers (rev 04)
14:05.4 PIC: Intel Corporation Device 2036 (rev 04)
14:08.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:09.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:09.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0f.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0f.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:1d.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:1d.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:1d.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:1d.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:1e.0 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.1 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.2 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.3 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.4 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.5 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.6 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
15:00.0 PCI bridge: Pericom Semiconductor PI7C9X111SL PCIe-to-PCI 
Reversible Bridge (rev 02)
16:04.0 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 
Controller (rev 62)
16:04.1 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 
Controller (rev 62)
16:04.2 USB controller: VIA Technologies, Inc. USB 2.0 (rev 65)
20:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A 
(rev 04)
20:05.0 System peripheral: Intel Corporation Device 2034 (rev 04)
20:05.2 System peripheral: Intel Corporation Sky Lake-E RAS 
Configuration Registers (rev 04)
20:05.4 PIC: Intel Corporation Device 2036 (rev 04)
20:08.0 System peripheral: Intel Corporation Device 2066 (rev 04)
20:09.0 System peripheral: Intel Corporation Device 2066 (rev 04)
20:0a.0 System peripheral: Intel Corporation Device 2040 (rev 04)
20:0a.1 System peripheral: Intel Corporation Device 2041 (rev 04)
20:0a.2 System peripheral: Intel Corporation Device 2042 (rev 04)
20:0a.3 System peripheral: Intel Corporation Device 2043 (rev 04)
20:0a.4 System peripheral: Intel Corporation Device 2044 (rev 04)
20:0a.5 System peripheral: Intel Corporation Device 2045 (rev 04)
20:0a.6 System peripheral: Intel Corporation Device 2046 (rev 04)
20:0a.7 System peripheral: Intel Corporation Device 2047 (rev 04)
20:0b.0 System peripheral: Intel Corporation Device 2048 (rev 04)
20:0b.1 System peripheral: Intel Corporation Device 2049 (rev 04)
20:0b.2 System peripheral: Intel Corporation Device 204a (rev 04)
20:0b.3 System peripheral: Intel Corporation Device 204b (rev 04)
20:0c.0 System peripheral: Intel Corporation Device 2040 (rev 04)
20:0c.1 System peripheral: Intel Corporation Device 2041 (rev 04)
20:0c.2 System peripheral: Intel Corporation Device 2042 (rev 04)
20:0c.3 System peripheral: Intel Corporation Device 2043 (rev 04)
20:0c.4 System peripheral: Intel Corporation Device 2044 (rev 04)
20:0c.5 System peripheral: Intel Corporation Device 2045 (rev 04)
20:0c.6 System peripheral: Intel Corporation Device 2046 (rev 04)
20:0c.7 System peripheral: Intel Corporation Device 2047 (rev 04)
20:0d.0 System peripheral: Intel Corporation Device 2048 (rev 04)
20:0d.1 System peripheral: Intel Corporation Device 2049 (rev 04)
20:0d.2 System peripheral: Intel Corporation Device 204a (rev 04)
20:0d.3 System peripheral: Intel Corporation Device 204b (rev 04)
21:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro 
P400] (rev a1)
21:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio 
Controller (rev a1)
2c:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A 
(rev 04)
2c:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port C 
(rev 04)
2c:03.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port D 
(rev 04)
2c:05.0 System peripheral: Intel Corporation Device 2034 (rev 04)
2c:05.2 System peripheral: Intel Corporation Sky Lake-E RAS 
Configuration Registers (rev 04)
2c:05.4 PIC: Intel Corporation Device 2036 (rev 04)
2c:12.0 Performance counters: Intel Corporation Sky Lake-E M3KTI 
Registers (rev 04)
2c:12.1 Performance counters: Intel Corporation Sky Lake-E M3KTI 
Registers (rev 04)
2c:12.2 System peripheral: Intel Corporation Sky Lake-E M3KTI Registers 
(rev 04)
2c:15.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers 
(rev 04)
2c:16.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers 
(rev 04)
2c:16.4 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers 
(rev 04)
2c:17.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers 
(rev 04)
2d:00.0 PCI bridge: Pericom Semiconductor PI7C9X111SL PCIe-to-PCI 
Reversible Bridge (rev 02)
2e:04.0 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 
Controller (rev 62)
2e:04.1 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 
Controller (rev 62)
2e:04.2 USB controller: VIA Technologies, Inc. USB 2.0 (rev 65)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-10 19:20 kernel locks due to USB I/O Alberto Sentieri
@ 2020-11-10 20:51 ` Alan Stern
  2020-11-10 23:42   ` Alberto Sentieri
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Stern @ 2020-11-10 20:51 UTC (permalink / raw)
  To: Alberto Sentieri; +Cc: Greg Kroah-Hartman, linux-usb

On Tue, Nov 10, 2020 at 02:20:50PM -0500, Alberto Sentieri wrote:
> I’ve seen many kernel locks caused by a particular user-level application.
> After the kernel locks, there is no report left in the machine, neither in
> the logs. These locks have to do with USB input and output.
> 
> The objective of this email is to get guidance about how to collect more
> data related to the locks.
> 
> Follows a description of the problem.
> 
> I manage a few remote machines installed at a manufacturing facility, which
> run Ubuntu 18.04. For months I had seen unexpected kernel locks, which I
> could not explain. By locks I mean that the machine completely dies. The
> graphical screen and keyboard freezes. I cannot ping or connect through ssh
> during the locks. The only way of making the machine come back is through a
> “pull the plug”. After rebooting I cannot find anything meaningful about the
> lock in the logs. The machine is a good quality one with a 6-core Xeon, 32
> GB ECC memory (and the application is using about 1GB). Exact the same
> problem happens in two identical machines, one running kernel 5.0.0-37
> generic and the other running kernel 5.3.0-62-generic.

Can you update either machine to a 5.9 kernel?

> A few days ago I was able to create a sequence of events that produce the
> locks in a couple of minutes. These events have to do with USB 2.0 interrupt
> I/O on USB devices connected at 12 Mbits/s and the frequency URBs are
> submitted and reaped . It is necessary to have at least 36 devices connected
> to reproduce the problem easily, which I cannot do from where I am. The
> machines are in a country other than the one I live, and my physical access
> to them is not possible due to COVID-19 restrictions.
> 
> There is no special USB drivers installed. However, there is a NVIDIA
> manufacturer driver installed, which I installed using the Ubuntu regular
> tools for non-free software. All USB I/O is done by a regular user opening
> /dev/bus/usb/xxx/xxx (the device group is set to the user group by udev).
> 
> Each set of 18 USB devices is connected to a 10-Amp.-power-supply powered
> HUB. Each hub has its own USB 2.0 root, I mean, I installed multiple USB 2.0
> PCI express expansion cards, and only one port of each expansion card is
> used for each HUB.
> 
> The protocol to talk to any of the 36 devices is pretty simple. It uses USB
> interrupt frames. A 64-byte frame is sent to the device (request packet). I
> use ioctl (USBDEVFS_SUBMITURB). The file descriptor is monitored by epoll
> and when an answer comes back, the response packet (another 64-byte
> interrupt packet) is recovered by ioctl (USBDEVFS_REAPURBNDELAY). Then a
> 64-byte packet (confirmation packet) is sent through USBDEVFS_SUBMITURB.
> This sequence happens once every few seconds and the delay between the three
> packets is just a couple of milliseconds. All process of dealing with the 36
> devices is in a unique thread, under the same epoll loop.

This sentence is ambiguous.  Do you mean there is a single unique thread 
which talks to all 36 devices?  Or do you mean there is a separate 
unique thread for each device (so 36 threads)?

> So if I synchronize all 36 devices, I mean, I try to talk to all them
> basically at the same time, the kernel will lock in about 2 minutes or less.
> By “at the same time” I mean to submit the URBs for the request packet
> around the same time for all of them, and then sit there, waiting for the
> proper epoll wake-up to deal with the state machine (response and
> confirmation packets).
> 
> However, if I lock a semaphore before sending the request packet for one
> device, and only unlock after reaping the URB I used to send the
> confirmation packet, it ran for ate least 72 hours without problems. So, one
> device at a time (using basically the same software plus the semaphore) does
> not cause the kernel lock.
> 
> My point is that simple ioctl calls to USB devices should not break the
> kernel. I need help to address the kernel issue. The problem is difficult to
> reproduce at my office because it needs many devices connected to it, which
> are available only in a place I do not have physical access to, due to
> COVID-19 travel restrictions.
> 
> My guess is that, for a regular user, this bug rarely manifests itself and
> it may be there for a long time.
> 
> I would like to figure out exactly where the problem is and I am looking for
> your guidance to get more information about it.

You could try using a network console.  Or have someone who is on-site 
take a picture of the computer screen when a crash occurs.

Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-10 20:51 ` Alan Stern
@ 2020-11-10 23:42   ` Alberto Sentieri
  2020-11-11  7:51     ` Greg Kroah-Hartman
  2020-11-11 15:51     ` Alan Stern
  0 siblings, 2 replies; 17+ messages in thread
From: Alberto Sentieri @ 2020-11-10 23:42 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg Kroah-Hartman, linux-usb

1) The current Ubuntu Kernel is 5.4.0-53. Do you want me to upgrade it 
to 5.9, from kernel.org? Or is there a Ubuntu 5.9 package that I can 
use? It would be easy to do it If there is a Ubuntu package with 5.9, 
which I would install and, after the tests, uninstall.

2) Why do you believe that 5.9 would solve the problem? I am asking that 
because I cannot change the production machine for a test if I cannot go 
back to the original state. There is always a risk involved.

3) It is one single thread dealing with all 36 devices. Each device has 
its own co-routine (not preemptive), but all co-routines are executed by 
a unique thread.

4) By network console, do you mean ssh? It dies as well when it locks. 
The screen is the regular GNOME3 screen and nothing can be seen there. 
Every time it locks they send a picture, and I cannot see anything 
meaningful there. I am thinking about disabling GNOME3, but I need their 
blessing for that.

Thanks,

Alberto

On 11/10/20 3:51 PM, Alan Stern wrote:
> On Tue, Nov 10, 2020 at 02:20:50PM -0500, Alberto Sentieri wrote:
>> I’ve seen many kernel locks caused by a particular user-level application.
>> After the kernel locks, there is no report left in the machine, neither in
>> the logs. These locks have to do with USB input and output.
>>
>> The objective of this email is to get guidance about how to collect more
>> data related to the locks.
>>
>> Follows a description of the problem.
>>
>> I manage a few remote machines installed at a manufacturing facility, which
>> run Ubuntu 18.04. For months I had seen unexpected kernel locks, which I
>> could not explain. By locks I mean that the machine completely dies. The
>> graphical screen and keyboard freezes. I cannot ping or connect through ssh
>> during the locks. The only way of making the machine come back is through a
>> “pull the plug”. After rebooting I cannot find anything meaningful about the
>> lock in the logs. The machine is a good quality one with a 6-core Xeon, 32
>> GB ECC memory (and the application is using about 1GB). Exact the same
>> problem happens in two identical machines, one running kernel 5.0.0-37
>> generic and the other running kernel 5.3.0-62-generic.
> Can you update either machine to a 5.9 kernel?
>
>> A few days ago I was able to create a sequence of events that produce the
>> locks in a couple of minutes. These events have to do with USB 2.0 interrupt
>> I/O on USB devices connected at 12 Mbits/s and the frequency URBs are
>> submitted and reaped . It is necessary to have at least 36 devices connected
>> to reproduce the problem easily, which I cannot do from where I am. The
>> machines are in a country other than the one I live, and my physical access
>> to them is not possible due to COVID-19 restrictions.
>>
>> There is no special USB drivers installed. However, there is a NVIDIA
>> manufacturer driver installed, which I installed using the Ubuntu regular
>> tools for non-free software. All USB I/O is done by a regular user opening
>> /dev/bus/usb/xxx/xxx (the device group is set to the user group by udev).
>>
>> Each set of 18 USB devices is connected to a 10-Amp.-power-supply powered
>> HUB. Each hub has its own USB 2.0 root, I mean, I installed multiple USB 2.0
>> PCI express expansion cards, and only one port of each expansion card is
>> used for each HUB.
>>
>> The protocol to talk to any of the 36 devices is pretty simple. It uses USB
>> interrupt frames. A 64-byte frame is sent to the device (request packet). I
>> use ioctl (USBDEVFS_SUBMITURB). The file descriptor is monitored by epoll
>> and when an answer comes back, the response packet (another 64-byte
>> interrupt packet) is recovered by ioctl (USBDEVFS_REAPURBNDELAY). Then a
>> 64-byte packet (confirmation packet) is sent through USBDEVFS_SUBMITURB.
>> This sequence happens once every few seconds and the delay between the three
>> packets is just a couple of milliseconds. All process of dealing with the 36
>> devices is in a unique thread, under the same epoll loop.
> This sentence is ambiguous.  Do you mean there is a single unique thread
> which talks to all 36 devices?  Or do you mean there is a separate
> unique thread for each device (so 36 threads)?
>
>> So if I synchronize all 36 devices, I mean, I try to talk to all them
>> basically at the same time, the kernel will lock in about 2 minutes or less.
>> By “at the same time” I mean to submit the URBs for the request packet
>> around the same time for all of them, and then sit there, waiting for the
>> proper epoll wake-up to deal with the state machine (response and
>> confirmation packets).
>>
>> However, if I lock a semaphore before sending the request packet for one
>> device, and only unlock after reaping the URB I used to send the
>> confirmation packet, it ran for ate least 72 hours without problems. So, one
>> device at a time (using basically the same software plus the semaphore) does
>> not cause the kernel lock.
>>
>> My point is that simple ioctl calls to USB devices should not break the
>> kernel. I need help to address the kernel issue. The problem is difficult to
>> reproduce at my office because it needs many devices connected to it, which
>> are available only in a place I do not have physical access to, due to
>> COVID-19 travel restrictions.
>>
>> My guess is that, for a regular user, this bug rarely manifests itself and
>> it may be there for a long time.
>>
>> I would like to figure out exactly where the problem is and I am looking for
>> your guidance to get more information about it.
> You could try using a network console.  Or have someone who is on-site
> take a picture of the computer screen when a crash occurs.
>
> Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-10 23:42   ` Alberto Sentieri
@ 2020-11-11  7:51     ` Greg Kroah-Hartman
  2020-11-11 15:51     ` Alan Stern
  1 sibling, 0 replies; 17+ messages in thread
From: Greg Kroah-Hartman @ 2020-11-11  7:51 UTC (permalink / raw)
  To: Alberto Sentieri; +Cc: Alan Stern, linux-usb

On Tue, Nov 10, 2020 at 06:42:17PM -0500, Alberto Sentieri wrote:
> 1) The current Ubuntu Kernel is 5.4.0-53. Do you want me to upgrade it to
> 5.9, from kernel.org? Or is there a Ubuntu 5.9 package that I can use? It
> would be easy to do it If there is a Ubuntu package with 5.9, which I would
> install and, after the tests, uninstall.

There might be a Ubuntu package, we don't know.

The issue is that the Ubuntu kernel you are using is an old one, and we
(the community) have no idea what is in there.  We can help you if you
use a kernel.org release, otherwise you really should get support from
Ubuntu as you are relying on them for this type of thing (and probably
paying them for support as you are using it in a corporate setting and
relying on it, right?)

> 2) Why do you believe that 5.9 would solve the problem? I am asking that
> because I cannot change the production machine for a test if I cannot go
> back to the original state. There is always a risk involved.

You should always be able to choose a different kernel version at boot
time, no need to ever get rid of your older one in case something goes
wrong.

There has been lots of changes in the past year of kernel development,
as always, from 5.4 to 5.9, so something that we did then might have
resolved this.  Maybe not, but we don't know if you can't try as to try
to remember what we did a year ago is hard, as I am sure you can imagine
:)

> 3) It is one single thread dealing with all 36 devices. Each device has its
> own co-routine (not preemptive), but all co-routines are executed by a
> unique thread.

Are you sure that these threads are not just getting stalled somehow
when a response doesn't come back properly and are not timing out, which
your userspace co-routines should be doing if the bus stalls, right?

One thing to note, USB hubs are almost never tested with that many
devices at once, so you might have to upgrade to one that says it really
will work with that many devices, and provide enough power to them.  We
have seen more low-power issues on USB hubs and connections over the
years to know that this is a real issue, and one that the OS can't do
much about, but can affect it.

> 4) By network console, do you mean ssh? It dies as well when it locks. The
> screen is the regular GNOME3 screen and nothing can be seen there. Every
> time it locks they send a picture, and I cannot see anything meaningful
> there. I am thinking about disabling GNOME3, but I need their blessing for
> that.

No, netconsole is something else, see the kernel documentation for the
full details.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-10 23:42   ` Alberto Sentieri
  2020-11-11  7:51     ` Greg Kroah-Hartman
@ 2020-11-11 15:51     ` Alan Stern
  2020-11-11 19:31       ` Alberto Sentieri
  2020-11-16 16:53       ` Alberto Sentieri
  1 sibling, 2 replies; 17+ messages in thread
From: Alan Stern @ 2020-11-11 15:51 UTC (permalink / raw)
  To: Alberto Sentieri; +Cc: Greg Kroah-Hartman, linux-usb

On Tue, Nov 10, 2020 at 06:42:17PM -0500, Alberto Sentieri wrote:
> 1) The current Ubuntu Kernel is 5.4.0-53. Do you want me to upgrade it to
> 5.9, from kernel.org? Or is there a Ubuntu 5.9 package that I can use? It
> would be easy to do it If there is a Ubuntu package with 5.9, which I would
> install and, after the tests, uninstall.

If you want to know what Ubuntu packages are available, you should ask 
on an Ubuntu mailing list instead of the linux-usb mailing list.

> 2) Why do you believe that 5.9 would solve the problem? I am asking that
> because I cannot change the production machine for a test if I cannot go
> back to the original state. There is always a risk involved.

We do not believe that 5.9 will solve the problem -- we have no reason 
to believe this -- but we could be wrong.  In any case it is always 
best to test with the most up-to-date software available, and 5.9 is the 
version closest to what we are working on now.

> 3) It is one single thread dealing with all 36 devices. Each device has its
> own co-routine (not preemptive), but all co-routines are executed by a
> unique thread.

If everything runs within a single thread, how can adding a semaphore 
or mutex make any difference?

> 4) By network console, do you mean ssh? It dies as well when it locks. The
> screen is the regular GNOME3 screen and nothing can be seen there. Every
> time it locks they send a picture, and I cannot see anything meaningful
> there. I am thinking about disabling GNOME3, but I need their blessing for
> that.

See https://www.kernel.org/doc/Documentation/networking/netconsole.txt
for instructions on netconsole.  And when you use it for testing, be 
sure to set the console log level to a high value.

Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-11 15:51     ` Alan Stern
@ 2020-11-11 19:31       ` Alberto Sentieri
  2020-11-16 16:53       ` Alberto Sentieri
  1 sibling, 0 replies; 17+ messages in thread
From: Alberto Sentieri @ 2020-11-11 19:31 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg Kroah-Hartman, linux-usb

On 11/11/20 10:51 AM, Alan Stern wrote:
> On Tue, Nov 10, 2020 at 06:42:17PM -0500, Alberto Sentieri wrote:
>> 1) The current Ubuntu Kernel is 5.4.0-53. Do you want me to upgrade it to
>> 5.9, from kernel.org? Or is there a Ubuntu 5.9 package that I can use? It
>> would be easy to do it If there is a Ubuntu package with 5.9, which I would
>> install and, after the tests, uninstall.
> If you want to know what Ubuntu packages are available, you should ask
> on an Ubuntu mailing list instead of the linux-usb mailing list.
I am sorry to be asking about Ubuntu. For some reason I imagined I was 
exchanging emails with Ubuntu guys, but now I understand that you are 
from the kernel.org.
>
>> 2) Why do you believe that 5.9 would solve the problem? I am asking that
>> because I cannot change the production machine for a test if I cannot go
>> back to the original state. There is always a risk involved.
> We do not believe that 5.9 will solve the problem -- we have no reason
> to believe this -- but we could be wrong.  In any case it is always
> best to test with the most up-to-date software available, and 5.9 is the
> version closest to what we are working on now.
I will try kernel 5.9 soon...
>
>> 3) It is one single thread dealing with all 36 devices. Each device has its
>> own co-routine (not preemptive), but all co-routines are executed by a
>> unique thread.
> If everything runs within a single thread, how can adding a semaphore
> or mutex make any difference?
The semaphore will block a co-routine, not a thread. It is not the type 
of semaphore C programmers are used to. So, before the introduction of 
the semaphore, a sequence like that would happen:

request packet device 1 URB submit
request packet device 2 URB submit
...
request packet device 36 URB submit
wait on epoll after submitting 36 URBs, one for each device.
reap URBs, receive response packets, send confirmation packets 
(basically run the state machine, each device has its own state)

After the semaphore, a sequence like that would happen:

lock the semaphore
send request packet device 1 (URB submit)
wait on epoll
reap URB with device 1 response packet
submit URB with device 1 confirmation
wait on epoll
reap URB submitted on last step
unlock the semaphore
Now go to the next device, which was waiting on the co-routine semaphore.

The main difference is that I would not submit 36 URBs to 36 different 
devices at the same time. The submission of 36 URBs would make the 
devices start responding as soon as they get ready and receive a pool.


>
>> 4) By network console, do you mean ssh? It dies as well when it locks. The
>> screen is the regular GNOME3 screen and nothing can be seen there. Every
>> time it locks they send a picture, and I cannot see anything meaningful
>> there. I am thinking about disabling GNOME3, but I need their blessing for
>> that.
> See https://www.kernel.org/doc/Documentation/networking/netconsole.txt
> for instructions on netconsole.  And when you use it for testing, be
> sure to set the console log level to a high value.
>
> Alan Stern


I will try kernel 5.9. However, it seems that the 5.3 kernel gets lost 
when too many submits / reaps start happening very close to each other.

Thanks,

Alberto Sentieri


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-11 15:51     ` Alan Stern
  2020-11-11 19:31       ` Alberto Sentieri
@ 2020-11-16 16:53       ` Alberto Sentieri
  2020-11-16 17:06         ` Alan Stern
  1 sibling, 1 reply; 17+ messages in thread
From: Alberto Sentieri @ 2020-11-16 16:53 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg Kroah-Hartman, linux-usb

The objective of this email is to report the current status of my findings.

I loaded netconsole on both machines I was having problems with. I tried 
3 times on the machine with kernel 5.0.0-37 and twice with on the 
machine with kernel 5.3.0-62. Each attempt consisted of running the 
program which lock the kernel until it locked (about 3 minutes after 
stating the program). The referred program had the "semphore code" 
commented out. Nothing was sent to netconsole on all the 5 attempts I 
made when the kernel locked.

Just to be clear about my use of netconsole, before loading the 
netconsole kernel module, I ran "dmesg -n 8". When netconsole module was 
loaded I could clearly see about 9 message lines on the computer 
receiving the netconsole messages telling me that netconsole was loaded 
(and how it was configured), so no doubts about the correct netconsole 
setup. The "netconsole server" was a machine on the same local network.

My next attempt will be to compile kernel 5.9, as you suggest, and try it.

Thanks,

Alberto Sentieri

On 11/11/20 10:51 AM, Alan Stern wrote:
> On Tue, Nov 10, 2020 at 06:42:17PM -0500, Alberto Sentieri wrote:
>> 1) The current Ubuntu Kernel is 5.4.0-53. Do you want me to upgrade it to
>> 5.9, from kernel.org? Or is there a Ubuntu 5.9 package that I can use? It
>> would be easy to do it If there is a Ubuntu package with 5.9, which I would
>> install and, after the tests, uninstall.
> If you want to know what Ubuntu packages are available, you should ask
> on an Ubuntu mailing list instead of the linux-usb mailing list.
>
>> 2) Why do you believe that 5.9 would solve the problem? I am asking that
>> because I cannot change the production machine for a test if I cannot go
>> back to the original state. There is always a risk involved.
> We do not believe that 5.9 will solve the problem -- we have no reason
> to believe this -- but we could be wrong.  In any case it is always
> best to test with the most up-to-date software available, and 5.9 is the
> version closest to what we are working on now.
>
>> 3) It is one single thread dealing with all 36 devices. Each device has its
>> own co-routine (not preemptive), but all co-routines are executed by a
>> unique thread.
> If everything runs within a single thread, how can adding a semaphore
> or mutex make any difference?
>
>> 4) By network console, do you mean ssh? It dies as well when it locks. The
>> screen is the regular GNOME3 screen and nothing can be seen there. Every
>> time it locks they send a picture, and I cannot see anything meaningful
>> there. I am thinking about disabling GNOME3, but I need their blessing for
>> that.
> See https://www.kernel.org/doc/Documentation/networking/netconsole.txt
> for instructions on netconsole.  And when you use it for testing, be
> sure to set the console log level to a high value.
>
> Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-16 16:53       ` Alberto Sentieri
@ 2020-11-16 17:06         ` Alan Stern
  2020-11-16 18:42           ` Alberto Sentieri
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Stern @ 2020-11-16 17:06 UTC (permalink / raw)
  To: Alberto Sentieri; +Cc: Greg Kroah-Hartman, linux-usb

On Mon, Nov 16, 2020 at 11:53:38AM -0500, Alberto Sentieri wrote:
> The objective of this email is to report the current status of my findings.
> 
> I loaded netconsole on both machines I was having problems with. I tried 3
> times on the machine with kernel 5.0.0-37 and twice with on the machine with
> kernel 5.3.0-62. Each attempt consisted of running the program which lock
> the kernel until it locked (about 3 minutes after stating the program). The
> referred program had the "semphore code" commented out. Nothing was sent to
> netconsole on all the 5 attempts I made when the kernel locked.
> 
> Just to be clear about my use of netconsole, before loading the netconsole
> kernel module, I ran "dmesg -n 8". When netconsole module was loaded I could
> clearly see about 9 message lines on the computer receiving the netconsole
> messages telling me that netconsole was loaded (and how it was configured),
> so no doubts about the correct netconsole setup. The "netconsole server" was
> a machine on the same local network.
> 
> My next attempt will be to compile kernel 5.9, as you suggest, and try it.

While this may not generate any useful information, one way to create a 
bunch of log output while running your test is to set the usbfs_snoop 
module parameter for usbcore to true.

You can also enable dynamic debugging for usbcore, although in a stable 
environment like yours it probably won't produce much output.

Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-16 17:06         ` Alan Stern
@ 2020-11-16 18:42           ` Alberto Sentieri
  2020-11-19 17:22             ` Alan Stern
  0 siblings, 1 reply; 17+ messages in thread
From: Alberto Sentieri @ 2020-11-16 18:42 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg Kroah-Hartman, linux-usb

Alan,

I am not sure if that would give you any clue, but I enabled usbfs_snoop 
in the kernel and I got URB error -108 (endpoint shutdown) just before 
the kernel lock.

Just after starting the program, I got something like that:

[  174.325977] usb 1-1.1.1: opened by process 2961: ate
[  174.326006] usb 1-1.1.1: usbdev_do_ioctl: GETDRIVER
[  174.326015] usb 1-1.1.1: usbdev_do_ioctl: IOCTL
[  174.326334] usb 1-1.1.1: usbdev_do_ioctl: CLAIMINTERFACE
[  174.326371] usb 1-1.1.1: usbdev_do_ioctl: SETINTERFACE
[  174.326582] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326592] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326617] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326623] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326629] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326635] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326640] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326645] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326652] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326657] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326663] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326668] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326674] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326679] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326685] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326691] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326696] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326702] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326708] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326713] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326718] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326723] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326729] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326734] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326740] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326745] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326751] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326756] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326762] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326767] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.326773] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
[  174.326783] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.327565] usb 1-1.3.1: opened by process 2961: ate
[  174.327584] usb 1-1.3.1: usbdev_do_ioctl: GETDRIVER
[  174.327593] usb 1-1.3.1: usbdev_do_ioctl: IOCTL
[  174.327855] usb 1-1.3.1: usbdev_do_ioctl: CLAIMINTERFACE
[  174.327881] usb 1-1.3.1: usbdev_do_ioctl: SETINTERFACE
[  174.328180] usb 1-1.3.1: usbdev_do_ioctl: SUBMITURB
[  174.328188] usb 1-1.3.1: userurb 0000000000000000, ep1 int-in, length 64
[  174.328205] usb 1-1.3.1: usbdev_do_ioctl: SUBMITURB


During the program execution, I got many lines like that:

[  178.537958] usb 1-1.4.1: urb complete
[  178.537975] usb 1-1.4.1: userurb         pK-error, ep1 int-out, 
actual_length 64 status 0
[  178.538000] usb 1-1.2.1: urb complete
[  178.538006] usb 1-1.2.1: userurb         pK-error, ep1 int-out, 
actual_length 64 status 0

I am not sure if pK-error has a special meaning there.

However, just before the lock, I got this sequence:

[  387.326795] usb 1-1.5.3.1: usbdev_do_ioctl: SUBMITURB
[  387.326801] usb 1-1.5.3.1: userurb 0000000000000000, ep1 int-in, 
length 64
[  387.327531] usb 1-1.5.3.1: urb complete
[  387.327539] usb 1-1.5.3.1: userurb         pK-error, ep1 int-out, 
actual_length 64 status 0
[  387.327582] usb 1-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
[  387.327590] usb 1-1.5.3.1: reap 0000000000000000
[  387.327598] usb 1-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.109400] usb 3-1.5.7.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.110032] usb 3-1.6.7.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.110169] usb 3-1.5.5.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.110605] usb 3-1.6.5.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.111394] usb 3-1.1.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.403278] usb 3-1.6.4.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.404011] usb 3-1.5.1.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.404344] usb 3-1.6.3.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.404649] usb 3-1.4.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.404790] usb 3-1.3.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.405283] usb 3-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.407828] usb 3-1.6.1.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.408771] usb 3-1.5.4.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.409155] usb 3-1.5.2.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.409452] usb 3-1.5.6.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.409743] usb 3-1.6.6.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.409848] usb 3-1.2.1: usbdev_do_ioctl: REAPURBNDELAY
[  388.410840] usb 3-1.6.2.1: usbdev_do_ioctl: REAPURBNDELAY
[  389.612597] usb 3-1.5.7.1: opened by process 2961: ate
[  389.612622] usb 3-1.5.7.1: usbdev_do_ioctl: GETDRIVER
[  389.612631] usb 3-1.5.7.1: usbdev_do_ioctl: IOCTL
[  389.614965] usb 3-1.5.7.1: urb complete
[  389.614981] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.614989] usb 3-1.5.7.1: urb complete
[  389.614994] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.614999] usb 3-1.5.7.1: urb complete
[  389.615003] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615008] usb 3-1.5.7.1: urb complete
[  389.615013] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615018] usb 3-1.5.7.1: urb complete
[  389.615022] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615027] usb 3-1.5.7.1: urb complete
[  389.615032] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615037] usb 3-1.5.7.1: urb complete
[  389.615042] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615047] usb 3-1.5.7.1: urb complete
[  389.615051] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615056] usb 3-1.5.7.1: urb complete
[  389.615060] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615065] usb 3-1.5.7.1: urb complete
[  389.615070] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615074] usb 3-1.5.7.1: urb complete
[  389.615079] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615084] usb 3-1.5.7.1: urb complete
[  389.615089] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615094] usb 3-1.5.7.1: urb complete
[  389.615098] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615103] usb 3-1.5.7.1: urb complete
[  389.615108] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615112] usb 3-1.5.7.1: urb complete
[  389.615117] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.615121] usb 3-1.5.7.1: urb complete
[  389.615126] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in, 
actual_length 0 status -108
[  389.616823] usb 3-1.5.7.1: urb complete
[  389.616839] usb 3-1.5.7.1: userurb         pK-error, ep1 int-out, 
actual_length 0 status -108
[  389.616926] usb 3-1.5.7.1: usbdev_do_ioctl: CLAIMINTERFACE
[  389.616964] usb 3-1.5.7.1: usbdev_do_ioctl: SETINTERFACE

As you can see, it locked in 215 seconds, in this test. Do you see 
anything special in this sequence?


Thanks,


Alberto Sentieri


On 11/16/20 12:06 PM, Alan Stern wrote:
> On Mon, Nov 16, 2020 at 11:53:38AM -0500, Alberto Sentieri wrote:
>> The objective of this email is to report the current status of my findings.
>>
>> I loaded netconsole on both machines I was having problems with. I tried 3
>> times on the machine with kernel 5.0.0-37 and twice with on the machine with
>> kernel 5.3.0-62. Each attempt consisted of running the program which lock
>> the kernel until it locked (about 3 minutes after stating the program). The
>> referred program had the "semphore code" commented out. Nothing was sent to
>> netconsole on all the 5 attempts I made when the kernel locked.
>>
>> Just to be clear about my use of netconsole, before loading the netconsole
>> kernel module, I ran "dmesg -n 8". When netconsole module was loaded I could
>> clearly see about 9 message lines on the computer receiving the netconsole
>> messages telling me that netconsole was loaded (and how it was configured),
>> so no doubts about the correct netconsole setup. The "netconsole server" was
>> a machine on the same local network.
>>
>> My next attempt will be to compile kernel 5.9, as you suggest, and try it.
> While this may not generate any useful information, one way to create a
> bunch of log output while running your test is to set the usbfs_snoop
> module parameter for usbcore to true.
>
> You can also enable dynamic debugging for usbcore, although in a stable
> environment like yours it probably won't produce much output.
>
> Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-16 18:42           ` Alberto Sentieri
@ 2020-11-19 17:22             ` Alan Stern
  2020-11-19 18:50               ` Alberto Sentieri
  2020-11-19 19:21               ` Alberto Sentieri
  0 siblings, 2 replies; 17+ messages in thread
From: Alan Stern @ 2020-11-19 17:22 UTC (permalink / raw)
  To: Alberto Sentieri; +Cc: Greg Kroah-Hartman, linux-usb

On Mon, Nov 16, 2020 at 01:42:46PM -0500, Alberto Sentieri wrote:
> Alan,
> 
> I am not sure if that would give you any clue, but I enabled usbfs_snoop in
> the kernel and I got URB error -108 (endpoint shutdown) just before the
> kernel lock.
> 
> Just after starting the program, I got something like that:
> 
> [  174.325977] usb 1-1.1.1: opened by process 2961: ate
> [  174.326006] usb 1-1.1.1: usbdev_do_ioctl: GETDRIVER
> [  174.326015] usb 1-1.1.1: usbdev_do_ioctl: IOCTL
> [  174.326334] usb 1-1.1.1: usbdev_do_ioctl: CLAIMINTERFACE
> [  174.326371] usb 1-1.1.1: usbdev_do_ioctl: SETINTERFACE
> [  174.326582] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326592] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326617] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326623] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326629] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326635] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326640] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326645] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326652] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326657] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326663] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326668] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326674] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326679] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326685] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326691] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326696] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326702] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326708] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326713] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326718] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326723] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326729] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326734] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326740] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326745] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326751] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326756] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326762] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326767] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.326773] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
> [  174.326783] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.327565] usb 1-1.3.1: opened by process 2961: ate
> [  174.327584] usb 1-1.3.1: usbdev_do_ioctl: GETDRIVER
> [  174.327593] usb 1-1.3.1: usbdev_do_ioctl: IOCTL
> [  174.327855] usb 1-1.3.1: usbdev_do_ioctl: CLAIMINTERFACE
> [  174.327881] usb 1-1.3.1: usbdev_do_ioctl: SETINTERFACE
> [  174.328180] usb 1-1.3.1: usbdev_do_ioctl: SUBMITURB
> [  174.328188] usb 1-1.3.1: userurb 0000000000000000, ep1 int-in, length 64
> [  174.328205] usb 1-1.3.1: usbdev_do_ioctl: SUBMITURB

Can you post the output from "lsusb -t" for your test system?  The
device numbers aren't what I expected to see.  Not wrong, just
unexpected.

Apart from that, this looks like normal initialization.  Why does the
program submit 16 URBs all for the same endpoint without waiting?
Wouldn't two or three be enough?

> During the program execution, I got many lines like that:
> 
> [  178.537958] usb 1-1.4.1: urb complete
> [  178.537975] usb 1-1.4.1: userurb         pK-error, ep1 int-out,
> actual_length 64 status 0
> [  178.538000] usb 1-1.2.1: urb complete
> [  178.538006] usb 1-1.2.1: userurb         pK-error, ep1 int-out,
> actual_length 64 status 0
> 
> I am not sure if pK-error has a special meaning there.

It's an error in the debugging output.  I have just submitted a patch
to fix the error (see
<https://marc.info/?l=linux-usb&m=160580539818074&w=2>).  In any case,
it should be irrelevant for your problem.

> However, just before the lock, I got this sequence:
> 
> [  387.326795] usb 1-1.5.3.1: usbdev_do_ioctl: SUBMITURB
> [  387.326801] usb 1-1.5.3.1: userurb 0000000000000000, ep1 int-in, length
> 64
> [  387.327531] usb 1-1.5.3.1: urb complete
> [  387.327539] usb 1-1.5.3.1: userurb         pK-error, ep1 int-out,
> actual_length 64 status 0
> [  387.327582] usb 1-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
> [  387.327590] usb 1-1.5.3.1: reap 0000000000000000
> [  387.327598] usb 1-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.109400] usb 3-1.5.7.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.110032] usb 3-1.6.7.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.110169] usb 3-1.5.5.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.110605] usb 3-1.6.5.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.111394] usb 3-1.1.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.403278] usb 3-1.6.4.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.404011] usb 3-1.5.1.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.404344] usb 3-1.6.3.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.404649] usb 3-1.4.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.404790] usb 3-1.3.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.405283] usb 3-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.407828] usb 3-1.6.1.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.408771] usb 3-1.5.4.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.409155] usb 3-1.5.2.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.409452] usb 3-1.5.6.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.409743] usb 3-1.6.6.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.409848] usb 3-1.2.1: usbdev_do_ioctl: REAPURBNDELAY
> [  388.410840] usb 3-1.6.2.1: usbdev_do_ioctl: REAPURBNDELAY

This rapid sequence of reap requests with no completions or actual
reaps is odd.  Is this the sort of thing you expect your program to
do?  Maybe you can add some debugging output to the program.

Also, can you turn on dynamic debugging for usbcore before starting
the test?

	echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control

> [  389.612597] usb 3-1.5.7.1: opened by process 2961: ate
> [  389.612622] usb 3-1.5.7.1: usbdev_do_ioctl: GETDRIVER
> [  389.612631] usb 3-1.5.7.1: usbdev_do_ioctl: IOCTL
> [  389.614965] usb 3-1.5.7.1: urb complete
> [  389.614981] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in,
> actual_length 0 status -108
> [  389.614989] usb 3-1.5.7.1: urb complete
> [  389.614994] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in,
> actual_length 0 status -108

The -108 erroro codes are ESHUTDOWN.  They indicate that the device
was disconnected or something like that, although I don't see any
other indication of this in the log.

> [  389.616926] usb 3-1.5.7.1: usbdev_do_ioctl: CLAIMINTERFACE
> [  389.616964] usb 3-1.5.7.1: usbdev_do_ioctl: SETINTERFACE

And if the device had been disconnected, these calls would have failed.

> As you can see, it locked in 215 seconds, in this test. Do you see anything
> special in this sequence?

No, but several things are puzzling.  But there's nothing that
indicates any reason for the system to freeze.

Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-19 17:22             ` Alan Stern
@ 2020-11-19 18:50               ` Alberto Sentieri
  2020-11-19 20:01                 ` Alan Stern
  2020-11-19 19:21               ` Alberto Sentieri
  1 sibling, 1 reply; 17+ messages in thread
From: Alberto Sentieri @ 2020-11-19 18:50 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg Kroah-Hartman, linux-usb

Alan,

I have to get the proper blessing to do more tests, and I will work on them.

Meanwhile, my comments are below.

Thanks for your attention,


Alberto

On 11/19/20 12:22 PM, Alan Stern wrote:
> On Mon, Nov 16, 2020 at 01:42:46PM -0500, Alberto Sentieri wrote:
>> Alan,
>>
>> I am not sure if that would give you any clue, but I enabled usbfs_snoop in
>> the kernel and I got URB error -108 (endpoint shutdown) just before the
>> kernel lock.
>>
>> Just after starting the program, I got something like that:
>>
>> [  174.325977] usb 1-1.1.1: opened by process 2961: ate
>> [  174.326006] usb 1-1.1.1: usbdev_do_ioctl: GETDRIVER
>> [  174.326015] usb 1-1.1.1: usbdev_do_ioctl: IOCTL
>> [  174.326334] usb 1-1.1.1: usbdev_do_ioctl: CLAIMINTERFACE
>> [  174.326371] usb 1-1.1.1: usbdev_do_ioctl: SETINTERFACE
>> [  174.326582] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326592] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326617] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326623] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326629] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326635] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326640] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326645] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326652] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326657] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326663] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326668] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326674] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326679] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326685] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326691] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326696] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326702] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326708] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326713] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326718] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326723] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326729] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326734] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326740] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326745] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326751] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326756] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326762] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326767] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326773] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326783] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.327565] usb 1-1.3.1: opened by process 2961: ate
>> [  174.327584] usb 1-1.3.1: usbdev_do_ioctl: GETDRIVER
>> [  174.327593] usb 1-1.3.1: usbdev_do_ioctl: IOCTL
>> [  174.327855] usb 1-1.3.1: usbdev_do_ioctl: CLAIMINTERFACE
>> [  174.327881] usb 1-1.3.1: usbdev_do_ioctl: SETINTERFACE
>> [  174.328180] usb 1-1.3.1: usbdev_do_ioctl: SUBMITURB
>> [  174.328188] usb 1-1.3.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.328205] usb 1-1.3.1: usbdev_do_ioctl: SUBMITURB
> Can you post the output from "lsusb -t" for your test system?  The
> device numbers aren't what I expected to see.  Not wrong, just
> unexpected.
I will run lsusb -t when I get access to the system. But the numbers are 
correct. Each device has a 2-port USB hub which is connected to the USB 
bus, just before the micro-controller. That is the reason for all of 
them ending in .1. The micro-controller is connected to port 1 of the 
2-port hub.
>
> Apart from that, this looks like normal initialization.  Why does the
> program submit 16 URBs all for the same endpoint without waiting?
> Wouldn't two or three be enough?
I am using simple commands (to the device) and the responses fit in a 
64-byte HID packet. But there are more complex commands, which may 
return many packets. All submitted input URBs are there because, in some 
cases, the worker main loop can take many milliseconds to start reaping 
the URBs after they arrive.
>> During the program execution, I got many lines like that:
>>
>> [  178.537958] usb 1-1.4.1: urb complete
>> [  178.537975] usb 1-1.4.1: userurb         pK-error, ep1 int-out,
>> actual_length 64 status 0
>> [  178.538000] usb 1-1.2.1: urb complete
>> [  178.538006] usb 1-1.2.1: userurb         pK-error, ep1 int-out,
>> actual_length 64 status 0
>>
>> I am not sure if pK-error has a special meaning there.
> It's an error in the debugging output.  I have just submitted a patch
> to fix the error (see
> <https://marc.info/?l=linux-usb&m=160580539818074&w=2>).  In any case,
> it should be irrelevant for your problem.
Thanks.
>> However, just before the lock, I got this sequence:
>>
>> [  387.326795] usb 1-1.5.3.1: usbdev_do_ioctl: SUBMITURB
>> [  387.326801] usb 1-1.5.3.1: userurb 0000000000000000, ep1 int-in, length
>> 64
>> [  387.327531] usb 1-1.5.3.1: urb complete
>> [  387.327539] usb 1-1.5.3.1: userurb         pK-error, ep1 int-out,
>> actual_length 64 status 0
>> [  387.327582] usb 1-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  387.327590] usb 1-1.5.3.1: reap 0000000000000000
>> [  387.327598] usb 1-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.109400] usb 3-1.5.7.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.110032] usb 3-1.6.7.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.110169] usb 3-1.5.5.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.110605] usb 3-1.6.5.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.111394] usb 3-1.1.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.403278] usb 3-1.6.4.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.404011] usb 3-1.5.1.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.404344] usb 3-1.6.3.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.404649] usb 3-1.4.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.404790] usb 3-1.3.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.405283] usb 3-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.407828] usb 3-1.6.1.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.408771] usb 3-1.5.4.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.409155] usb 3-1.5.2.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.409452] usb 3-1.5.6.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.409743] usb 3-1.6.6.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.409848] usb 3-1.2.1: usbdev_do_ioctl: REAPURBNDELAY
>> [  388.410840] usb 3-1.6.2.1: usbdev_do_ioctl: REAPURBNDELAY
> This rapid sequence of reap requests with no completions or actual
> reaps is odd.  Is this the sort of thing you expect your program to
> do?  Maybe you can add some debugging output to the program.
What apparently happened before that (but unfortunately I did not 
capture it this time) is that the particular device return errno -71 in 
some URBs, without de-enumerating. However, I am only seeing userurbs 
with status 0 in my the netconsole capture, before the ones with status 
-108 (just before the lock). So, if error -71 happened, all URBs are 
reaped and the devices are then closed. Usually I see error -71 in the 
program log happening for a few devices at the same time. After closing 
the interface, the program opens it again. The return from open should 
be a valid file descriptor, otherwise it would fail. After opening the 
device successfully it starts submitting urbs, and error -108 is coming 
back, telling me that the device just opened is not open.

It just came to my attention now: maybe something to be investigated is 
what the kernel does if a interface is closed due to a URB error and 
immediately opened again.

> Also, can you turn on dynamic debugging for usbcore before starting
> the test?
>
> 	echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
>
>> [  389.612597] usb 3-1.5.7.1: opened by process 2961: ate
>> [  389.612622] usb 3-1.5.7.1: usbdev_do_ioctl: GETDRIVER
>> [  389.612631] usb 3-1.5.7.1: usbdev_do_ioctl: IOCTL
>> [  389.614965] usb 3-1.5.7.1: urb complete
>> [  389.614981] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in,
>> actual_length 0 status -108
>> [  389.614989] usb 3-1.5.7.1: urb complete
>> [  389.614994] usb 3-1.5.7.1: userurb         pK-error, ep1 int-in,
>> actual_length 0 status -108
> The -108 erroro codes are ESHUTDOWN.  They indicate that the device
> was disconnected or something like that, although I don't see any
> other indication of this in the log.
>
>> [  389.616926] usb 3-1.5.7.1: usbdev_do_ioctl: CLAIMINTERFACE
>> [  389.616964] usb 3-1.5.7.1: usbdev_do_ioctl: SETINTERFACE
> And if the device had been disconnected, these calls would have failed.
>
>> As you can see, it locked in 215 seconds, in this test. Do you see anything
>> special in this sequence?
> No, but several things are puzzling.  But there's nothing that
> indicates any reason for the system to freeze.
>
> Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-19 17:22             ` Alan Stern
  2020-11-19 18:50               ` Alberto Sentieri
@ 2020-11-19 19:21               ` Alberto Sentieri
  2020-11-19 19:43                 ` Alan Stern
  1 sibling, 1 reply; 17+ messages in thread
From: Alberto Sentieri @ 2020-11-19 19:21 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg Kroah-Hartman, linux-usb

lsusb -t in a similar configuration I use for development (it has just 6 
device, and not 36):

$ lsusb -t
/:  Bus 16.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 5000M
/:  Bus 15.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
/:  Bus 14.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 5000M
/:  Bus 13.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
/:  Bus 12.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 5000M
/:  Bus 11.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
/:  Bus 10.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 5000M
/:  Bus 09.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
/:  Bus 08.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 5000M
     |__ Port 8: Dev 2, If 0, Class=Hub, Driver=hub/4p, 5000M
/:  Bus 07.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
     |__ Port 6: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
/:  Bus 06.Port 1: Dev 1, Class=root_hub, Driver=uhci_hcd/2p, 12M
/:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=uhci_hcd/2p, 12M
/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=ohci-pci/2p, 12M
/:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=ohci-pci/3p, 12M
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/4p, 480M
     |__ Port 3: Dev 5, If 0, Class=Hub, Driver=hub/7p, 480M
         |__ Port 1: Dev 6, If 0, Class=Hub, Driver=hub/2p, 480M
             |__ Port 1: Dev 8, If 0, Class=Human Interface Device, 
Driver=usbhid, 12M
         |__ Port 2: Dev 7, If 0, Class=Hub, Driver=hub/2p, 480M
             |__ Port 1: Dev 10, If 0, Class=Human Interface Device, 
Driver=usbhid, 12M
         |__ Port 4: Dev 9, If 0, Class=Hub, Driver=hub/2p, 480M
             |__ Port 1: Dev 12, If 0, Class=Human Interface Device, 
Driver=usbhid, 12M
         |__ Port 5: Dev 11, If 0, Class=Hub, Driver=hub/7p, 480M
         |__ Port 6: Dev 13, If 0, Class=Hub, Driver=hub/7p, 480M
             |__ Port 6: Dev 15, If 0, Class=Hub, Driver=hub/2p, 480M
                 |__ Port 1: Dev 17, If 0, Class=Human Interface Device, 
Driver=usbhid, 12M
             |__ Port 7: Dev 16, If 0, Class=Hub, Driver=hub/2p, 480M
                 |__ Port 1: Dev 18, If 0, Class=Human Interface Device, 
Driver=usbhid, 12M
         |__ Port 7: Dev 14, If 0, Class=Human Interface Device, 
Driver=usbhid, 12M

/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/5p, 480M

And lspci (I believe we are using the VIA Technologies, Inc controller 
at the site where we are having locks):

$ lspci
00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 04)
00:04.0 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.1 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.2 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.3 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.4 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.5 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.6 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:04.7 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers 
(rev 04)
00:05.0 System peripheral: Intel Corporation Sky Lake-E MM/Vt-d 
Configuration Registers (rev 04)
00:05.2 System peripheral: Intel Corporation Sky Lake-E RAS (rev 04)
00:05.4 PIC: Intel Corporation Sky Lake-E IOAPIC (rev 04)
00:08.0 System peripheral: Intel Corporation Sky Lake-E Ubox Registers 
(rev 04)
00:08.1 Performance counters: Intel Corporation Sky Lake-E Ubox 
Registers (rev 04)
00:08.2 System peripheral: Intel Corporation Sky Lake-E Ubox Registers 
(rev 04)
00:14.0 USB controller: Intel Corporation 200 Series/Z370 Chipset Family 
USB 3.0 xHCI Controller
00:14.2 Signal processing controller: Intel Corporation 200 Series PCH 
Thermal Subsystem
00:17.0 RAID bus controller: Intel Corporation C600/X79 series chipset 
SATA RAID Controller
00:1b.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root 
Port #21 (rev f0)
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root 
Port #3 (rev f0)
00:1c.4 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root 
Port #5 (rev f0)
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root 
Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation C422 Chipset LPC/eSPI Controller
00:1f.2 Memory controller: Intel Corporation 200 Series/Z370 Chipset 
Family Power Management Controller
00:1f.3 Audio device: Intel Corporation 200 Series PCH HD Audio
00:1f.4 SMBus: Intel Corporation 200 Series/Z370 Chipset Family SMBus 
Controller
00:1f.5 Non-VGA unclassified device: Intel Corporation 200 Series/Z370 
Chipset Family SPI Controller
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) 
I219-LM
01:00.0 PCI bridge: Texas Instruments XIO2000(A)/XIO2200A PCI 
Express-to-PCI Bridge (rev 03)
02:01.0 USB controller: NEC Corporation OHCI USB Controller (rev 43)
02:01.1 USB controller: NEC Corporation OHCI USB Controller (rev 43)
02:01.2 USB controller: NEC Corporation uPD72010x USB 2.0 Controller 
(rev 04)
03:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network 
Connection (rev 03)
05:00.0 PCI bridge: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI 
Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
05:00.1 System peripheral: PLX Technology, Inc. PEX 8609 8-lane, 8-Port 
PCI Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
06:01.0 PCI bridge: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI 
Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
06:05.0 PCI bridge: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI 
Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
06:07.0 PCI bridge: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI 
Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
06:09.0 PCI bridge: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI 
Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
07:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host 
Controller
08:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host 
Controller
09:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host 
Controller
0a:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host 
Controller
14:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A 
(rev 04)
14:05.0 System peripheral: Intel Corporation Sky Lake-E VT-d (rev 04)
14:05.2 System peripheral: Intel Corporation Sky Lake-E RAS 
Configuration Registers (rev 04)
14:05.4 PIC: Intel Corporation Sky Lake-E IOxAPIC Configuration 
Registers (rev 04)
14:08.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:08.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:09.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:09.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0e.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0f.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:0f.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:1d.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:1d.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:1d.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:1d.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers 
(rev 04)
14:1e.0 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.1 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.2 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.3 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.4 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.5 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
14:1e.6 System peripheral: Intel Corporation Sky Lake-E PCU Registers 
(rev 04)
15:00.0 PCI bridge: Pericom Semiconductor PI7C9X111SL PCIe-to-PCI 
Reversible Bridge (rev 02)
16:04.0 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 
Controller (rev 62)
16:04.1 USB controller: VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 
Controller (rev 62)
16:04.2 USB controller: VIA Technologies, Inc. USB 2.0 (rev 65)
20:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A 
(rev 04)
20:05.0 System peripheral: Intel Corporation Sky Lake-E VT-d (rev 04)
20:05.2 System peripheral: Intel Corporation Sky Lake-E RAS 
Configuration Registers (rev 04)
20:05.4 PIC: Intel Corporation Sky Lake-E IOxAPIC Configuration 
Registers (rev 04)
20:08.0 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:09.0 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0a.0 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0a.1 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0a.2 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0a.3 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0a.4 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0a.5 System peripheral: Intel Corporation Sky Lake-E LM Channel 1 
(rev 04)
20:0a.6 System peripheral: Intel Corporation Sky Lake-E LMS Channel 1 
(rev 04)
20:0a.7 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 1 
(rev 04)
20:0b.0 System peripheral: Intel Corporation Sky Lake-E DECS Channel 2 
(rev 04)
20:0b.1 System peripheral: Intel Corporation Sky Lake-E LM Channel 2 
(rev 04)
20:0b.2 System peripheral: Intel Corporation Sky Lake-E LMS Channel 2 
(rev 04)
20:0b.3 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 2 
(rev 04)
20:0c.0 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0c.1 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0c.2 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0c.3 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0c.4 System peripheral: Intel Corporation Sky Lake-E Integrated 
Memory Controller (rev 04)
20:0c.5 System peripheral: Intel Corporation Sky Lake-E LM Channel 1 
(rev 04)
20:0c.6 System peripheral: Intel Corporation Sky Lake-E LMS Channel 1 
(rev 04)
20:0c.7 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 1 
(rev 04)
20:0d.0 System peripheral: Intel Corporation Sky Lake-E DECS Channel 2 
(rev 04)
20:0d.1 System peripheral: Intel Corporation Sky Lake-E LM Channel 2 
(rev 04)
20:0d.2 System peripheral: Intel Corporation Sky Lake-E LMS Channel 2 
(rev 04)
20:0d.3 System peripheral: Intel Corporation Sky Lake-E LMDP Channel 2 
(rev 04)
21:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro 
P400] (rev a1)
21:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio 
Controller (rev a1)
2c:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A 
(rev 04)
2c:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port C 
(rev 04)
2c:03.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port D 
(rev 04)
2c:05.0 System peripheral: Intel Corporation Sky Lake-E VT-d (rev 04)
2c:05.2 System peripheral: Intel Corporation Sky Lake-E RAS 
Configuration Registers (rev 04)
2c:05.4 PIC: Intel Corporation Sky Lake-E IOxAPIC Configuration 
Registers (rev 04)
2c:12.0 Performance counters: Intel Corporation Sky Lake-E M3KTI 
Registers (rev 04)
2c:12.1 Performance counters: Intel Corporation Sky Lake-E M3KTI 
Registers (rev 04)
2c:12.2 System peripheral: Intel Corporation Sky Lake-E M3KTI Registers 
(rev 04)
2c:15.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers 
(rev 04)
2c:16.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers 
(rev 04)
2c:16.4 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers 
(rev 04)
2c:17.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers 
(rev 04)


On 11/19/20 12:22 PM, Alan Stern wrote:
> On Mon, Nov 16, 2020 at 01:42:46PM -0500, Alberto Sentieri wrote:
>> Alan,
>>
>> I am not sure if that would give you any clue, but I enabled usbfs_snoop in
>> the kernel and I got URB error -108 (endpoint shutdown) just before the
>> kernel lock.
>>
>> Just after starting the program, I got something like that:
>>
>> [  174.325977] usb 1-1.1.1: opened by process 2961: ate
>> [  174.326006] usb 1-1.1.1: usbdev_do_ioctl: GETDRIVER
>> [  174.326015] usb 1-1.1.1: usbdev_do_ioctl: IOCTL
>> [  174.326334] usb 1-1.1.1: usbdev_do_ioctl: CLAIMINTERFACE
>> [  174.326371] usb 1-1.1.1: usbdev_do_ioctl: SETINTERFACE
>> [  174.326582] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326592] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326617] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326623] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326629] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326635] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326640] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326645] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326652] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326657] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326663] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326668] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326674] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326679] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326685] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326691] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326696] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326702] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326708] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326713] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326718] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326723] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326729] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326734] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326740] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326745] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326751] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326756] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326762] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326767] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.326773] usb 1-1.1.1: usbdev_do_ioctl: SUBMITURB
>> [  174.326783] usb 1-1.1.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.327565] usb 1-1.3.1: opened by process 2961: ate
>> [  174.327584] usb 1-1.3.1: usbdev_do_ioctl: GETDRIVER
>> [  174.327593] usb 1-1.3.1: usbdev_do_ioctl: IOCTL
>> [  174.327855] usb 1-1.3.1: usbdev_do_ioctl: CLAIMINTERFACE
>> [  174.327881] usb 1-1.3.1: usbdev_do_ioctl: SETINTERFACE
>> [  174.328180] usb 1-1.3.1: usbdev_do_ioctl: SUBMITURB
>> [  174.328188] usb 1-1.3.1: userurb 0000000000000000, ep1 int-in, length 64
>> [  174.328205] usb 1-1.3.1: usbdev_do_ioctl: SUBMITURB
> Can you post the output from "lsusb -t" for your test system?  The
> device numbers aren't what I expected to see.  Not wrong, just
> unexpected.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-19 19:21               ` Alberto Sentieri
@ 2020-11-19 19:43                 ` Alan Stern
  2020-11-19 22:14                   ` Alberto Sentieri
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Stern @ 2020-11-19 19:43 UTC (permalink / raw)
  To: Alberto Sentieri; +Cc: Greg Kroah-Hartman, linux-usb

On Thu, Nov 19, 2020 at 02:21:47PM -0500, Alberto Sentieri wrote:
> lsusb -t in a similar configuration I use for development (it has just 6
> device, and not 36):
> 
> $ lsusb -t

> /:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/4p, 480M
>     |__ Port 3: Dev 5, If 0, Class=Hub, Driver=hub/7p, 480M
>         |__ Port 1: Dev 6, If 0, Class=Hub, Driver=hub/2p, 480M
>             |__ Port 1: Dev 8, If 0, Class=Human Interface Device,
> Driver=usbhid, 12M
>         |__ Port 2: Dev 7, If 0, Class=Hub, Driver=hub/2p, 480M
>             |__ Port 1: Dev 10, If 0, Class=Human Interface Device,
> Driver=usbhid, 12M
>         |__ Port 4: Dev 9, If 0, Class=Hub, Driver=hub/2p, 480M
>             |__ Port 1: Dev 12, If 0, Class=Human Interface Device,
> Driver=usbhid, 12M
>         |__ Port 5: Dev 11, If 0, Class=Hub, Driver=hub/7p, 480M
>         |__ Port 6: Dev 13, If 0, Class=Hub, Driver=hub/7p, 480M
>             |__ Port 6: Dev 15, If 0, Class=Hub, Driver=hub/2p, 480M
>                 |__ Port 1: Dev 17, If 0, Class=Human Interface Device,
> Driver=usbhid, 12M
>             |__ Port 7: Dev 16, If 0, Class=Hub, Driver=hub/2p, 480M
>                 |__ Port 1: Dev 18, If 0, Class=Human Interface Device,
> Driver=usbhid, 12M
>         |__ Port 7: Dev 14, If 0, Class=Human Interface Device,
> Driver=usbhid, 12M

Previously you said that each HID microcontroller is connected to port 1
of a two-port hub.  But that clearly isn't true for device 14 in the
listing above.  What happened there?

Alan Stern


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-19 18:50               ` Alberto Sentieri
@ 2020-11-19 20:01                 ` Alan Stern
       [not found]                   ` <4f8f545e-4846-45e0-b8f8-5c73876b150a@tripolho.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Stern @ 2020-11-19 20:01 UTC (permalink / raw)
  To: Alberto Sentieri; +Cc: Greg Kroah-Hartman, linux-usb

On Thu, Nov 19, 2020 at 01:50:13PM -0500, Alberto Sentieri wrote:
> > > However, just before the lock, I got this sequence:
> > > 
> > > [  387.326795] usb 1-1.5.3.1: usbdev_do_ioctl: SUBMITURB
> > > [  387.326801] usb 1-1.5.3.1: userurb 0000000000000000, ep1 int-in, length
> > > 64
> > > [  387.327531] usb 1-1.5.3.1: urb complete
> > > [  387.327539] usb 1-1.5.3.1: userurb         pK-error, ep1 int-out,
> > > actual_length 64 status 0
> > > [  387.327582] usb 1-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  387.327590] usb 1-1.5.3.1: reap 0000000000000000
> > > [  387.327598] usb 1-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.109400] usb 3-1.5.7.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.110032] usb 3-1.6.7.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.110169] usb 3-1.5.5.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.110605] usb 3-1.6.5.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.111394] usb 3-1.1.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.403278] usb 3-1.6.4.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.404011] usb 3-1.5.1.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.404344] usb 3-1.6.3.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.404649] usb 3-1.4.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.404790] usb 3-1.3.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.405283] usb 3-1.5.3.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.407828] usb 3-1.6.1.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.408771] usb 3-1.5.4.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.409155] usb 3-1.5.2.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.409452] usb 3-1.5.6.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.409743] usb 3-1.6.6.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.409848] usb 3-1.2.1: usbdev_do_ioctl: REAPURBNDELAY
> > > [  388.410840] usb 3-1.6.2.1: usbdev_do_ioctl: REAPURBNDELAY
> > This rapid sequence of reap requests with no completions or actual
> > reaps is odd.  Is this the sort of thing you expect your program to
> > do?  Maybe you can add some debugging output to the program.
> What apparently happened before that (but unfortunately I did not capture it
> this time) is that the particular device return errno -71 in some URBs,
> without de-enumerating.

That's actually quite bad.  -71 (-EPROTO) indicates a serious error in
the USB protocol.  Most often it means that the kernel did not receive
a reply packet from the device -- this could be caused by the device's
firmware crashing or a USB cable failing.  In a well-functioning
system it should never occur.

> However, I am only seeing userurbs with status 0 in
> my the netconsole capture, before the ones with status -108 (just before the
> lock). So, if error -71 happened, all URBs are reaped and the devices are
> then closed. Usually I see error -71 in the program log happening for a few
> devices at the same time.

That probably means there's something wrong with a hub they are all
connected to.  But of course, a problem in a hub wouldn't explain why
the whole system freezes.

>  After closing the interface, the program opens it
> again. The return from open should be a valid file descriptor, otherwise it
> would fail. After opening the device successfully it starts submitting urbs,
> and error -108 is coming back, telling me that the device just opened is not
> open.

It means that the device is no longer present.  Perhaps it was
logically (if not physically) disconnected after the open but before
the URBs were submitted.

> It just came to my attention now: maybe something to be investigated is what
> the kernel does if a interface is closed due to a URB error and immediately
> opened again.

In itself, that should work perfectly.  But whatever caused the USB
error could go on to cause other things to go wrong.

It might help to see a usbmon trace.  You merely have to send the data 
from /sys/kernel/debug/usb/usbmon/1u (or 3u, or even both -- the number 
refers to the USB bus being monitored) through a UDP pipe, using netcat 
or something similar, to another computer which could safely store the 
output.

Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-11-19 19:43                 ` Alan Stern
@ 2020-11-19 22:14                   ` Alberto Sentieri
  0 siblings, 0 replies; 17+ messages in thread
From: Alberto Sentieri @ 2020-11-19 22:14 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg Kroah-Hartman, linux-usb

On 11/19/20 2:43 PM, Alan Stern wrote:
> On Thu, Nov 19, 2020 at 02:21:47PM -0500, Alberto Sentieri wrote:
>> lsusb -t in a similar configuration I use for development (it has just 6
>> device, and not 36):
>>
>> $ lsusb -t
>> /:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/4p, 480M
>>      |__ Port 3: Dev 5, If 0, Class=Hub, Driver=hub/7p, 480M
>>          |__ Port 1: Dev 6, If 0, Class=Hub, Driver=hub/2p, 480M
>>              |__ Port 1: Dev 8, If 0, Class=Human Interface Device,
>> Driver=usbhid, 12M
>>          |__ Port 2: Dev 7, If 0, Class=Hub, Driver=hub/2p, 480M
>>              |__ Port 1: Dev 10, If 0, Class=Human Interface Device,
>> Driver=usbhid, 12M
>>          |__ Port 4: Dev 9, If 0, Class=Hub, Driver=hub/2p, 480M
>>              |__ Port 1: Dev 12, If 0, Class=Human Interface Device,
>> Driver=usbhid, 12M
>>          |__ Port 5: Dev 11, If 0, Class=Hub, Driver=hub/7p, 480M
>>          |__ Port 6: Dev 13, If 0, Class=Hub, Driver=hub/7p, 480M
>>              |__ Port 6: Dev 15, If 0, Class=Hub, Driver=hub/2p, 480M
>>                  |__ Port 1: Dev 17, If 0, Class=Human Interface Device,
>> Driver=usbhid, 12M
>>              |__ Port 7: Dev 16, If 0, Class=Hub, Driver=hub/2p, 480M
>>                  |__ Port 1: Dev 18, If 0, Class=Human Interface Device,
>> Driver=usbhid, 12M
>>          |__ Port 7: Dev 14, If 0, Class=Human Interface Device,
>> Driver=usbhid, 12M
> Previously you said that each HID microcontroller is connected to port 1
> of a two-port hub.  But that clearly isn't true for device 14 in the
> listing above.  What happened there?
The program never talks to that device. It does not try to open it. The 
program has a list of valid interfaces, and if one is not in the list it 
will not be part of the game. Really I sent you a list of 5 devices, 
because one was disconnect when I ran lsusb.

This would be the list with the 6 devices:

/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/4p, 480M
     |__ Port 3: Dev 53, If 0, Class=Hub, Driver=hub/7p, 480M
         |__ Port 1: Dev 54, If 0, Class=Hub, Driver=hub/2p, 480M
             |__ Port 1: Dev 56, If 0, Class=Human Interface Device, 
Driver=, 12M
         |__ Port 2: Dev 55, If 0, Class=Hub, Driver=hub/2p, 480M
             |__ Port 1: Dev 58, If 0, Class=Human Interface Device, 
Driver=, 12M
         |__ Port 3: Dev 57, If 0, Class=Hub, Driver=hub/2p, 480M
             |__ Port 1: Dev 60, If 0, Class=Human Interface Device, 
Driver=, 12M
         |__ Port 4: Dev 59, If 0, Class=Hub, Driver=hub/2p, 480M
             |__ Port 1: Dev 62, If 0, Class=Human Interface Device, 
Driver=, 12M
         |__ Port 5: Dev 61, If 0, Class=Hub, Driver=hub/7p, 480M
         |__ Port 6: Dev 63, If 0, Class=Hub, Driver=hub/7p, 480M
             |__ Port 6: Dev 69, If 0, Class=Hub, Driver=hub/2p, 480M
                 |__ Port 1: Dev 70, If 0, Class=Human Interface Device, 
Driver=, 12M
             |__ Port 7: Dev 66, If 0, Class=Hub, Driver=hub/2p, 480M
                 |__ Port 1: Dev 68, If 0, Class=Human Interface Device, 
Driver=, 12M
         |__ Port 7: Dev 64, If 0, Class=Human Interface Device, 
Driver=usbhid, 12M

Again, device 64 (previous device 14) is not part of the game.

I compiled kernel 5.9.8, which is running, and have other kernels as 
well. I will try to reproduce the problem in my lab. I will let you know 
as soon as I have more information. If I were able to reproduce the 
problem, I will try to use a Beagle USB analyzer at the same time the 
problem occurs.

Thanks,

Alberto
> Alan Stern
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
       [not found]                       ` <3df90f9d-0af2-2aaa-9853-966f99e961a4@tripolho.com>
@ 2020-12-14 17:18                         ` Alan Stern
  2020-12-16 22:14                           ` Alberto Sentieri
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Stern @ 2020-12-14 17:18 UTC (permalink / raw)
  To: Alberto Sentieri; +Cc: Greg Kroah-Hartman, linux-usb

On Mon, Dec 14, 2020 at 09:43:25AM -0500, Alberto Sentieri wrote:
> Alan,
> 
> I was finally able to assemble (on my desk) a similar environment which can
> reproduce the problem. It is locking with only 18 USB devices, which are
> being accessed in higher frequency than on the regular application (once
> every 3 seconds instead of once every 15 seconds). I was able to reproduce
> locks with kernel version 5.9.8, which I compiled myself.
> 
> Now a brief description of what I did.
> 
> Each access I referred to on the first paragraph was composed of 3 64-byte
> USB interrupt packets, which are: a request the computer sends, a response
> the device sends back, and a confirmation that the computer send to finish
> the cycle, which would be repeated in about 3 seconds.
> 
> The computer motherboard I am using has a UART (apparently a 16550) and I
> could enable a serial console using: console=tty1 console=ttyS0,115200. I
> also would like to point out that ipv6.disable=1 is the other kernel
> argument I am using. The serial console was working properly and all its
> output was being captured by another computer during the locks. My lab
> machine has the same USB expansion card model being used at the production
> environment and that was the one being used during the experiments. Though
> my lab machine has some extra PCI USB cards installed, they were not being
> using during the locks.
> 
> I did the experiment just a few times during the weekend, after receiving
> the last parts I needed on Saturday.
> 
> So, with kernel 5.4.0-53-generic, which is one of the kernels released with
> Ubuntu, the locked happened in just a couple of minutes. I tried that twice.
> With kernel 5.9.8, the lock took more time to happen. Initially I though it
> would not happen, because I'd observed the machine running for about 15
> minutes and everything seemed normal. However after a couple of hours I
> found the machine locked. I also would like to point out that I’ve been
> using this machine for more than 6 months and that it had never locked on
> me, so that excludes any hardware problem.
> 
> Unfortunately on the 3 times it locked (2 with 5.4.0 and one with 5.9.8)
> nothing was printed on the serial console.
> 
> I will be working on this during the day and I will try to enable some of
> the debugging tools you suggested in your previous emails.
> 
> Any comment on this will be highly appreciated.

We'll have to wait and see what the tests and debugging tools reveal.

As for the comment about the hardware being reliable -- in fact, you 
can't be certain of this.  It may be that your test is stressing the 
hardware in a way that it never experienced during the prior six 
months.

Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel locks due to USB I/O
  2020-12-14 17:18                         ` Alan Stern
@ 2020-12-16 22:14                           ` Alberto Sentieri
  0 siblings, 0 replies; 17+ messages in thread
From: Alberto Sentieri @ 2020-12-16 22:14 UTC (permalink / raw)
  To: Alan Stern; +Cc: Greg Kroah-Hartman, linux-usb

This is just a quick status update. I am just checking if it rings any 
bell. I am still doing more tests.

The machine where all these problems are happening is a HP Z4 G4, as 
well as the other two in production. Until this morning it had BIOS 
v01.76 dated 04/22/2019. Today I update the BIOS to v02.63 dated 10/28/2020.

After updating the BIOS, the behavior definitively changed. Now both 
kernels (5.4.0-53-generic and 5.9.8) reboot unexpectedly instead of 
locking. After re-enabling the console, expecting to capture something, 
I could not see a lock or reboot. However the serial console, after 
displaying the whole boot, locked, and I cannot use it to login.

Alberto Sentieri

On 12/14/20 12:18 PM, Alan Stern wrote:
> We'll have to wait and see what the tests and debugging tools reveal.
> As for the comment about the hardware being reliable -- in fact, you
> can't be certain of this.  It may be that your test is stressing the
> hardware in a way that it never experienced during the prior six
> months.
>
> Alan Stern

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-12-16 22:15 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-10 19:20 kernel locks due to USB I/O Alberto Sentieri
2020-11-10 20:51 ` Alan Stern
2020-11-10 23:42   ` Alberto Sentieri
2020-11-11  7:51     ` Greg Kroah-Hartman
2020-11-11 15:51     ` Alan Stern
2020-11-11 19:31       ` Alberto Sentieri
2020-11-16 16:53       ` Alberto Sentieri
2020-11-16 17:06         ` Alan Stern
2020-11-16 18:42           ` Alberto Sentieri
2020-11-19 17:22             ` Alan Stern
2020-11-19 18:50               ` Alberto Sentieri
2020-11-19 20:01                 ` Alan Stern
     [not found]                   ` <4f8f545e-4846-45e0-b8f8-5c73876b150a@tripolho.com>
     [not found]                     ` <20201119225144.GA590990@rowland.harvard.edu>
     [not found]                       ` <3df90f9d-0af2-2aaa-9853-966f99e961a4@tripolho.com>
2020-12-14 17:18                         ` Alan Stern
2020-12-16 22:14                           ` Alberto Sentieri
2020-11-19 19:21               ` Alberto Sentieri
2020-11-19 19:43                 ` Alan Stern
2020-11-19 22:14                   ` Alberto Sentieri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.