All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alberto Sentieri <22t@tripolho.com>
To: Alan Stern <stern@rowland.harvard.edu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	linux-usb@vger.kernel.org
Subject: Re: kernel locks due to USB I/O
Date: Tue, 10 Nov 2020 18:42:17 -0500	[thread overview]
Message-ID: <8152190e-c962-e376-64fd-cc2ebf3e6104@tripolho.com> (raw)
In-Reply-To: <20201110205114.GB204624@rowland.harvard.edu>

1) The current Ubuntu Kernel is 5.4.0-53. Do you want me to upgrade it 
to 5.9, from kernel.org? Or is there a Ubuntu 5.9 package that I can 
use? It would be easy to do it If there is a Ubuntu package with 5.9, 
which I would install and, after the tests, uninstall.

2) Why do you believe that 5.9 would solve the problem? I am asking that 
because I cannot change the production machine for a test if I cannot go 
back to the original state. There is always a risk involved.

3) It is one single thread dealing with all 36 devices. Each device has 
its own co-routine (not preemptive), but all co-routines are executed by 
a unique thread.

4) By network console, do you mean ssh? It dies as well when it locks. 
The screen is the regular GNOME3 screen and nothing can be seen there. 
Every time it locks they send a picture, and I cannot see anything 
meaningful there. I am thinking about disabling GNOME3, but I need their 
blessing for that.

Thanks,

Alberto

On 11/10/20 3:51 PM, Alan Stern wrote:
> On Tue, Nov 10, 2020 at 02:20:50PM -0500, Alberto Sentieri wrote:
>> I’ve seen many kernel locks caused by a particular user-level application.
>> After the kernel locks, there is no report left in the machine, neither in
>> the logs. These locks have to do with USB input and output.
>>
>> The objective of this email is to get guidance about how to collect more
>> data related to the locks.
>>
>> Follows a description of the problem.
>>
>> I manage a few remote machines installed at a manufacturing facility, which
>> run Ubuntu 18.04. For months I had seen unexpected kernel locks, which I
>> could not explain. By locks I mean that the machine completely dies. The
>> graphical screen and keyboard freezes. I cannot ping or connect through ssh
>> during the locks. The only way of making the machine come back is through a
>> “pull the plug”. After rebooting I cannot find anything meaningful about the
>> lock in the logs. The machine is a good quality one with a 6-core Xeon, 32
>> GB ECC memory (and the application is using about 1GB). Exact the same
>> problem happens in two identical machines, one running kernel 5.0.0-37
>> generic and the other running kernel 5.3.0-62-generic.
> Can you update either machine to a 5.9 kernel?
>
>> A few days ago I was able to create a sequence of events that produce the
>> locks in a couple of minutes. These events have to do with USB 2.0 interrupt
>> I/O on USB devices connected at 12 Mbits/s and the frequency URBs are
>> submitted and reaped . It is necessary to have at least 36 devices connected
>> to reproduce the problem easily, which I cannot do from where I am. The
>> machines are in a country other than the one I live, and my physical access
>> to them is not possible due to COVID-19 restrictions.
>>
>> There is no special USB drivers installed. However, there is a NVIDIA
>> manufacturer driver installed, which I installed using the Ubuntu regular
>> tools for non-free software. All USB I/O is done by a regular user opening
>> /dev/bus/usb/xxx/xxx (the device group is set to the user group by udev).
>>
>> Each set of 18 USB devices is connected to a 10-Amp.-power-supply powered
>> HUB. Each hub has its own USB 2.0 root, I mean, I installed multiple USB 2.0
>> PCI express expansion cards, and only one port of each expansion card is
>> used for each HUB.
>>
>> The protocol to talk to any of the 36 devices is pretty simple. It uses USB
>> interrupt frames. A 64-byte frame is sent to the device (request packet). I
>> use ioctl (USBDEVFS_SUBMITURB). The file descriptor is monitored by epoll
>> and when an answer comes back, the response packet (another 64-byte
>> interrupt packet) is recovered by ioctl (USBDEVFS_REAPURBNDELAY). Then a
>> 64-byte packet (confirmation packet) is sent through USBDEVFS_SUBMITURB.
>> This sequence happens once every few seconds and the delay between the three
>> packets is just a couple of milliseconds. All process of dealing with the 36
>> devices is in a unique thread, under the same epoll loop.
> This sentence is ambiguous.  Do you mean there is a single unique thread
> which talks to all 36 devices?  Or do you mean there is a separate
> unique thread for each device (so 36 threads)?
>
>> So if I synchronize all 36 devices, I mean, I try to talk to all them
>> basically at the same time, the kernel will lock in about 2 minutes or less.
>> By “at the same time” I mean to submit the URBs for the request packet
>> around the same time for all of them, and then sit there, waiting for the
>> proper epoll wake-up to deal with the state machine (response and
>> confirmation packets).
>>
>> However, if I lock a semaphore before sending the request packet for one
>> device, and only unlock after reaping the URB I used to send the
>> confirmation packet, it ran for ate least 72 hours without problems. So, one
>> device at a time (using basically the same software plus the semaphore) does
>> not cause the kernel lock.
>>
>> My point is that simple ioctl calls to USB devices should not break the
>> kernel. I need help to address the kernel issue. The problem is difficult to
>> reproduce at my office because it needs many devices connected to it, which
>> are available only in a place I do not have physical access to, due to
>> COVID-19 travel restrictions.
>>
>> My guess is that, for a regular user, this bug rarely manifests itself and
>> it may be there for a long time.
>>
>> I would like to figure out exactly where the problem is and I am looking for
>> your guidance to get more information about it.
> You could try using a network console.  Or have someone who is on-site
> take a picture of the computer screen when a crash occurs.
>
> Alan Stern

  reply	other threads:[~2020-11-10 23:43 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-10 19:20 kernel locks due to USB I/O Alberto Sentieri
2020-11-10 20:51 ` Alan Stern
2020-11-10 23:42   ` Alberto Sentieri [this message]
2020-11-11  7:51     ` Greg Kroah-Hartman
2020-11-11 15:51     ` Alan Stern
2020-11-11 19:31       ` Alberto Sentieri
2020-11-16 16:53       ` Alberto Sentieri
2020-11-16 17:06         ` Alan Stern
2020-11-16 18:42           ` Alberto Sentieri
2020-11-19 17:22             ` Alan Stern
2020-11-19 18:50               ` Alberto Sentieri
2020-11-19 20:01                 ` Alan Stern
     [not found]                   ` <4f8f545e-4846-45e0-b8f8-5c73876b150a@tripolho.com>
     [not found]                     ` <20201119225144.GA590990@rowland.harvard.edu>
     [not found]                       ` <3df90f9d-0af2-2aaa-9853-966f99e961a4@tripolho.com>
2020-12-14 17:18                         ` Alan Stern
2020-12-16 22:14                           ` Alberto Sentieri
2020-11-19 19:21               ` Alberto Sentieri
2020-11-19 19:43                 ` Alan Stern
2020-11-19 22:14                   ` Alberto Sentieri

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8152190e-c962-e376-64fd-cc2ebf3e6104@tripolho.com \
    --to=22t@tripolho.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-usb@vger.kernel.org \
    --cc=stern@rowland.harvard.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.