All of lore.kernel.org
 help / color / mirror / Atom feed
* System crash/lockup after plugging CDC ACM device
@ 2020-07-15  8:58 David Guillen Fandos
  2020-07-15  9:30 ` Greg KH
  0 siblings, 1 reply; 19+ messages in thread
From: David Guillen Fandos @ 2020-07-15  8:58 UTC (permalink / raw)
  To: linux-usb

Hello linux-usb,

I think I might have found a kernel bug related to the USB subsystem
(cdc_acm perhaps).

Context: I was playing around with a device I'm creating, essentially a
USB quad modem device that exposes four modems to the host system. This
device is still a prototype so there's a few bugs here and there, most
likely in the USB descriptors and control requests.

What happens: After plugging the device the system starts spitting
warnings and BUGs and it locks up. Most of the time some CPUs get into
some spinloop and never comes back (you can see it being detected by
the watchdog after a few seconds). Generally after that the USB devices
stop working completely and at some point the machine freezes
completely. In a couple of ocasions I managed to see a bug in dmesg
saying "unable to handle page fault for address XXX" and "Supervisor
read access in kernel mode" "error code (0x0000) not present page". I
could not get a trace for that one since the kernel died completely and
my log files were truncated/lost.

Since it is happening to my two machines (both Intel but rather
different controllers, Sunrise Point-LP USB 3.0 vs 8 Series/C220) and
with different kernel versions I suspect this might be a bug in the
kernel.

I have 4 logs that I collected, they are sort of long-ish, not sure how
to best send them to the list.

Thanks!
David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15  8:58 System crash/lockup after plugging CDC ACM device David Guillen Fandos
@ 2020-07-15  9:30 ` Greg KH
  2020-07-15 10:31   ` David Guillen Fandos
  0 siblings, 1 reply; 19+ messages in thread
From: Greg KH @ 2020-07-15  9:30 UTC (permalink / raw)
  To: David Guillen Fandos; +Cc: linux-usb

On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen Fandos wrote:
> Hello linux-usb,
> 
> I think I might have found a kernel bug related to the USB subsystem
> (cdc_acm perhaps).
> 
> Context: I was playing around with a device I'm creating, essentially a
> USB quad modem device that exposes four modems to the host system. This
> device is still a prototype so there's a few bugs here and there, most
> likely in the USB descriptors and control requests.
> 
> What happens: After plugging the device the system starts spitting
> warnings and BUGs and it locks up. Most of the time some CPUs get into
> some spinloop and never comes back (you can see it being detected by
> the watchdog after a few seconds). Generally after that the USB devices
> stop working completely and at some point the machine freezes
> completely. In a couple of ocasions I managed to see a bug in dmesg
> saying "unable to handle page fault for address XXX" and "Supervisor
> read access in kernel mode" "error code (0x0000) not present page". I
> could not get a trace for that one since the kernel died completely and
> my log files were truncated/lost.
> 
> Since it is happening to my two machines (both Intel but rather
> different controllers, Sunrise Point-LP USB 3.0 vs 8 Series/C220) and
> with different kernel versions I suspect this might be a bug in the
> kernel.
> 
> I have 4 logs that I collected, they are sort of long-ish, not sure how
> to best send them to the list.

Send the crashes with the callback list, that should be quite small,
right?  We don't need the full log.

The first crash is the most important, the others can be from the first
one and are not reliable.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15  9:30 ` Greg KH
@ 2020-07-15 10:31   ` David Guillen Fandos
  2020-07-15 10:50     ` Greg KH
  0 siblings, 1 reply; 19+ messages in thread
From: David Guillen Fandos @ 2020-07-15 10:31 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-usb

On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen Fandos wrote:
> > Hello linux-usb,
> > 
> > I think I might have found a kernel bug related to the USB
> > subsystem
> > (cdc_acm perhaps).
> > 
> > Context: I was playing around with a device I'm creating,
> > essentially a
> > USB quad modem device that exposes four modems to the host system.
> > This
> > device is still a prototype so there's a few bugs here and there,
> > most
> > likely in the USB descriptors and control requests.
> > 
> > What happens: After plugging the device the system starts spitting
> > warnings and BUGs and it locks up. Most of the time some CPUs get
> > into
> > some spinloop and never comes back (you can see it being detected
> > by
> > the watchdog after a few seconds). Generally after that the USB
> > devices
> > stop working completely and at some point the machine freezes
> > completely. In a couple of ocasions I managed to see a bug in dmesg
> > saying "unable to handle page fault for address XXX" and
> > "Supervisor
> > read access in kernel mode" "error code (0x0000) not present page".
> > I
> > could not get a trace for that one since the kernel died completely
> > and
> > my log files were truncated/lost.
> > 
> > Since it is happening to my two machines (both Intel but rather
> > different controllers, Sunrise Point-LP USB 3.0 vs 8 Series/C220)
> > and
> > with different kernel versions I suspect this might be a bug in the
> > kernel.
> > 
> > I have 4 logs that I collected, they are sort of long-ish, not sure
> > how
> > to best send them to the list.
> 
> Send the crashes with the callback list, that should be quite small,
> right?  We don't need the full log.
> 
> The first crash is the most important, the others can be from the
> first
> one and are not reliable.
> 
> thanks,
> 
> greg k-h

Ok then, here comes one of the logs, I selected some bits only

[  147.302016] WARNING: CPU: 3 PID: 134 at kernel/workqueue.c:1473
__queue_work+0x364/0x410
[...]
[  147.302322] Call Trace:
[  147.302329]  <IRQ>
[  147.302342]  queue_work_on+0x36/0x40
[  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
[  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
[  147.302372]  tasklet_action_common.constprop.0+0x66/0x100
[  147.302382]  __do_softirq+0xe9/0x2dc
[  147.302391]  irq_exit+0xcf/0x110
[  147.302397]  do_IRQ+0x55/0xe0
[  147.302408]  common_interrupt+0xf/0xf
[  147.302413]  </IRQ>
[...]
[  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck for 23s!
[kworker/3:2:134]


The other machines:

Jul 15 01:22:16 localhost kernel: WARNING: CPU: 0 PID: 0 at
kernel/workqueue.c:1473 __queue_work+0x342/0x3e0
[...]
Jul 15 01:22:16 localhost kernel: Call Trace:
Jul 15 01:22:16 localhost kernel: <IRQ>
Jul 15 01:22:16 localhost kernel: queue_work_on+0x36/0x40
Jul 15 01:22:16 localhost kernel: __usb_hcd_giveback_urb+0x6f/0x120
Jul 15 01:22:16 localhost kernel: usb_giveback_urb_bh+0xa0/0xf0
Jul 15 01:22:16 localhost kernel:
tasklet_action_common.isra.0+0x5b/0x100
Jul 15 01:22:16 localhost kernel: __do_softirq+0xee/0x2ff
Jul 15 01:22:16 localhost kernel: irq_exit+0xe9/0xf0
Jul 15 01:22:16 localhost kernel: do_IRQ+0x55/0xe0
Jul 15 01:22:16 localhost kernel: common_interrupt+0xf/0xf
Jul 15 01:22:16 localhost kernel: </IRQ>

Pretty similar as well. This is the first crash, the other ones mention
the cdc-acm but the first ones do not.

In one occasion the first crash seems different, here goes:

[   46.713823] ------------[ cut here ]------------
[   46.713827] refcount_t: underflow; use-after-free.
[   46.713884] WARNING: CPU: 2 PID: 33 at lib/refcount.c:28
refcount_warn_saturate+0xa6/0xf0
[...]
[   46.714148] Call Trace:
[   46.714166]  acm_disconnect+0x198/0x280 [cdc_acm]
[   46.714188]  usb_unbind_interface+0x8a/0x270
[   46.714198]  __device_release_driver+0x15c/0x210
[   46.714204]  device_release_driver+0x24/0x30
[   46.714209]  bus_remove_device+0xdb/0x140
[   46.714218]  device_del+0x16f/0x3f0
[   46.714227]  usb_disable_device+0xd6/0x290
[   46.714234]  usb_disconnect.cold+0x7e/0x205
[   46.714242]  hub_event+0xbfa/0x1810
[   46.714257]  process_one_work+0x1b4/0x380
[   46.714264]  worker_thread+0x53/0x3e0
[   46.714269]  ? process_one_work+0x380/0x380
[   46.714276]  kthread+0x115/0x140
[   46.714284]  ? __kthread_bind_mask+0x60/0x60
[   46.714293]  ret_from_fork+0x35/0x40

Let me know if you need more relevant/complete logs. Will be glad to
send them over.

Thanks
David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15 10:31   ` David Guillen Fandos
@ 2020-07-15 10:50     ` Greg KH
  2020-07-15 10:57       ` David Guillen Fandos
  0 siblings, 1 reply; 19+ messages in thread
From: Greg KH @ 2020-07-15 10:50 UTC (permalink / raw)
  To: David Guillen Fandos; +Cc: linux-usb

On Wed, Jul 15, 2020 at 12:31:42PM +0200, David Guillen Fandos wrote:
> On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen Fandos wrote:
> > > Hello linux-usb,
> > > 
> > > I think I might have found a kernel bug related to the USB
> > > subsystem
> > > (cdc_acm perhaps).
> > > 
> > > Context: I was playing around with a device I'm creating,
> > > essentially a
> > > USB quad modem device that exposes four modems to the host system.
> > > This
> > > device is still a prototype so there's a few bugs here and there,
> > > most
> > > likely in the USB descriptors and control requests.
> > > 
> > > What happens: After plugging the device the system starts spitting
> > > warnings and BUGs and it locks up. Most of the time some CPUs get
> > > into
> > > some spinloop and never comes back (you can see it being detected
> > > by
> > > the watchdog after a few seconds). Generally after that the USB
> > > devices
> > > stop working completely and at some point the machine freezes
> > > completely. In a couple of ocasions I managed to see a bug in dmesg
> > > saying "unable to handle page fault for address XXX" and
> > > "Supervisor
> > > read access in kernel mode" "error code (0x0000) not present page".
> > > I
> > > could not get a trace for that one since the kernel died completely
> > > and
> > > my log files were truncated/lost.
> > > 
> > > Since it is happening to my two machines (both Intel but rather
> > > different controllers, Sunrise Point-LP USB 3.0 vs 8 Series/C220)
> > > and
> > > with different kernel versions I suspect this might be a bug in the
> > > kernel.
> > > 
> > > I have 4 logs that I collected, they are sort of long-ish, not sure
> > > how
> > > to best send them to the list.
> > 
> > Send the crashes with the callback list, that should be quite small,
> > right?  We don't need the full log.
> > 
> > The first crash is the most important, the others can be from the
> > first
> > one and are not reliable.
> > 
> > thanks,
> > 
> > greg k-h
> 
> Ok then, here comes one of the logs, I selected some bits only
> 
> [  147.302016] WARNING: CPU: 3 PID: 134 at kernel/workqueue.c:1473
> __queue_work+0x364/0x410
> [...]
> [  147.302322] Call Trace:
> [  147.302329]  <IRQ>
> [  147.302342]  queue_work_on+0x36/0x40
> [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> [  147.302372]  tasklet_action_common.constprop.0+0x66/0x100
> [  147.302382]  __do_softirq+0xe9/0x2dc
> [  147.302391]  irq_exit+0xcf/0x110
> [  147.302397]  do_IRQ+0x55/0xe0
> [  147.302408]  common_interrupt+0xf/0xf
> [  147.302413]  </IRQ>
> [...]
> [  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck for 23s!
> [kworker/3:2:134]

That was the first message?

Ok, we need some more logs, how about the 30 lines right before the
above?

And what kernel version are you using?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15 10:50     ` Greg KH
@ 2020-07-15 10:57       ` David Guillen Fandos
  2020-07-15 11:12         ` Greg KH
  0 siblings, 1 reply; 19+ messages in thread
From: David Guillen Fandos @ 2020-07-15 10:57 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-usb

On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> On Wed, Jul 15, 2020 at 12:31:42PM +0200, David Guillen Fandos wrote:
> > On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen Fandos
> > > wrote:
> > > > Hello linux-usb,
> > > > 
> > > > I think I might have found a kernel bug related to the USB
> > > > subsystem
> > > > (cdc_acm perhaps).
> > > > 
> > > > Context: I was playing around with a device I'm creating,
> > > > essentially a
> > > > USB quad modem device that exposes four modems to the host
> > > > system.
> > > > This
> > > > device is still a prototype so there's a few bugs here and
> > > > there,
> > > > most
> > > > likely in the USB descriptors and control requests.
> > > > 
> > > > What happens: After plugging the device the system starts
> > > > spitting
> > > > warnings and BUGs and it locks up. Most of the time some CPUs
> > > > get
> > > > into
> > > > some spinloop and never comes back (you can see it being
> > > > detected
> > > > by
> > > > the watchdog after a few seconds). Generally after that the USB
> > > > devices
> > > > stop working completely and at some point the machine freezes
> > > > completely. In a couple of ocasions I managed to see a bug in
> > > > dmesg
> > > > saying "unable to handle page fault for address XXX" and
> > > > "Supervisor
> > > > read access in kernel mode" "error code (0x0000) not present
> > > > page".
> > > > I
> > > > could not get a trace for that one since the kernel died
> > > > completely
> > > > and
> > > > my log files were truncated/lost.
> > > > 
> > > > Since it is happening to my two machines (both Intel but rather
> > > > different controllers, Sunrise Point-LP USB 3.0 vs 8
> > > > Series/C220)
> > > > and
> > > > with different kernel versions I suspect this might be a bug in
> > > > the
> > > > kernel.
> > > > 
> > > > I have 4 logs that I collected, they are sort of long-ish, not
> > > > sure
> > > > how
> > > > to best send them to the list.
> > > 
> > > Send the crashes with the callback list, that should be quite
> > > small,
> > > right?  We don't need the full log.
> > > 
> > > The first crash is the most important, the others can be from the
> > > first
> > > one and are not reliable.
> > > 
> > > thanks,
> > > 
> > > greg k-h
> > 
> > Ok then, here comes one of the logs, I selected some bits only
> > 
> > [  147.302016] WARNING: CPU: 3 PID: 134 at kernel/workqueue.c:1473
> > __queue_work+0x364/0x410
> > [...]
> > [  147.302322] Call Trace:
> > [  147.302329]  <IRQ>
> > [  147.302342]  queue_work_on+0x36/0x40
> > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > [  147.302372]  tasklet_action_common.constprop.0+0x66/0x100
> > [  147.302382]  __do_softirq+0xe9/0x2dc
> > [  147.302391]  irq_exit+0xcf/0x110
> > [  147.302397]  do_IRQ+0x55/0xe0
> > [  147.302408]  common_interrupt+0xf/0xf
> > [  147.302413]  </IRQ>
> > [...]
> > [  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck for 23s!
> > [kworker/3:2:134]
> 
> That was the first message?
> 
> Ok, we need some more logs, how about the 30 lines right before the
> above?
> 
> And what kernel version are you using?
> 
> thanks,
> 
> greg k-h

Heh I assumed you would find the 3rd stack more interesting since it
involves more subsystems but anyway, here we got, the first one with
more context. The trigger as you can see is me connecting the USB
device:

[  141.445367] usb 1-1: new full-speed USB device number 5 using
xhci_hcd
[  141.573592] usb 1-1: New USB device found, idVendor=0483,
idProduct=5740, bcdDevice= 2.00
[  141.573597] usb 1-1: New USB device strings: Mfr=1, Product=2,
SerialNumber=3
[  141.573601] usb 1-1: Product: Quad-UART serial USB device
[  141.573603] usb 1-1: Manufacturer: davidgf.net
[  141.573605] usb 1-1: SerialNumber: serialno
[  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM device
[  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM device
[  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM device
[  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM device
[  142.382473] usbcore: registered new interface driver cdc_acm
[  142.382476] cdc_acm: USB Abstract Control Model driver for USB
modems and ISDN adapters
[  147.301997] ------------[ cut here ]------------
[  147.302016] WARNING: CPU: 3 PID: 134 at kernel/workqueue.c:1473
__queue_work+0x364/0x410
[  147.302019] Modules linked in: cdc_acm rfcomm ccm wireguard
curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64
libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
libcurve25519_generic libchacha libblake2s_generic nft_fib_inet
nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat
ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle
iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter
ip6_tables iptable_filter cmac vboxnetadp(OE) vboxnetflt(OE) bnep
vboxdrv(OE) sunrpc vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops
videobuf2_v4l2 videobuf2_common videodev btusb btrtl btbcm btintel mc
bluetooth ecdh_generic ecc iTCO_wdt iTCO_vendor_support mei_hdcp
intel_rapl_msr dell_laptop x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore
intel_rapl_perf iwlmvm
[  147.302121]  snd_hda_codec_hdmi mac80211 snd_soc_skl snd_soc_sst_ipc
snd_soc_sst_dsp dell_wmi snd_hda_ext_core dell_smbios
snd_hda_codec_realtek dcdbas libarc4 wmi_bmof dell_wmi_descriptor
snd_soc_acpi_intel_match snd_soc_acpi intel_wmi_thunderbolt
snd_hda_codec_generic snd_soc_core ledtrig_audio iwlwifi pcspkr
snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg
snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_seq snd_seq_device
joydev snd_pcm rfkill snd_timer snd i2c_i801 soundcore idma64
int3403_thermal intel_hid int3400_thermal sparse_keymap
acpi_thermal_rel mei_me intel_xhci_usb_role_switch acpi_pad roles mei
intel_pch_thermal processor_thermal_device intel_rapl_common
int340x_thermal_zone intel_soc_dts_iosf binfmt_misc ip_tables dm_crypt
i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul i2c_algo_bit
cec crc32c_intel drm_kms_helper nvme ghash_clmulni_intel drm nvme_core
serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
pinctrl_sunrisepoint pinctrl_intel
[  147.302218]  fuse
[  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2 Tainted:
G          IOE     5.7.7-200.fc32.x86_64 #1
[  147.302233] Hardware name: Dell Inc. XPS 13 9350/0PWNCR, BIOS 1.12.2
12/15/2019
[  147.302260] Workqueue:  0x0 (mm_percpu_wq)
[  147.302275] RIP: 0010:__queue_work+0x364/0x410
[  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f 65 48 8b 3c 25 c0 8b
01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f
c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60 e9 5d fe ff ff e8 53
[  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS: 00010002
[  147.302292] RAX: ffff8f551b333790 RBX: 0000000000000048 RCX:
0000000000000000
[  147.302295] RDX: ffff8f551b333798 RSI: ffff8f5575803718 RDI:
ffff8f5576daa840
[  147.302299] RBP: ffff8f551b333790 R08: ffffffff97856cb0 R09:
0000000000000000
[  147.302302] R10: 0000000000000000 R11: ffffffff97856cb8 R12:
0000000000000003
[  147.302306] R13: 0000000000002000 R14: ffff8f5575c14e00 R15:
ffff8f5576db0700
[  147.302311] FS:  0000000000000000(0000) GS:ffff8f5576d80000(0000)
knlGS:0000000000000000
[  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  147.302319] CR2: 00000000000000b0 CR3: 0000000267774004 CR4:
00000000003606e0
[  147.302322] Call Trace:
[  147.302329]  <IRQ>
[  147.302342]  queue_work_on+0x36/0x40
[  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
[  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
[  147.302372]  tasklet_action_common.constprop.0+0x66/0x100
[  147.302382]  __do_softirq+0xe9/0x2dc
[  147.302391]  irq_exit+0xcf/0x110
[  147.302397]  do_IRQ+0x55/0xe0
[  147.302408]  common_interrupt+0xf/0xf
[  147.302413]  </IRQ>
[  147.302422] RIP: 0010:finish_task_switch+0x80/0x2a0
[  147.302428] Code: 8b 1c 25 c0 8b 01 00 0f 1f 44 00 00 0f 1f 44 00 00
41 c7 45 38 00 00 00 00 4c 89 e7 c6 07 00 0f 1f 40 00 fb 66 0f 1f 44 00
00 <65> 48 8b 04 25 c0 8b 01 00 0f 1f 44 00 00 4d 85 f6 74 21 65 48 8b
[  147.302431] RSP: 0018:ffffbab9802dfe30 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffdc
[  147.302438] RAX: ffff8f55676aa700 RBX: ffff8f556ec70000 RCX:
0000000000000000
[  147.302441] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
ffff8f5576daaec0
[  147.302444] RBP: ffffbab9802dfe58 R08: 0000000000000000 R09:
0000000000000000
[  147.302447] R10: 0000000000000001 R11: 0000000000000000 R12:
ffff8f5576daaec0
[  147.302450] R13: ffff8f55676aa700 R14: 0000000000000000 R15:
0000000000000000
[  147.302466]  __schedule+0x279/0x770
[  147.302474]  schedule+0x4a/0xb0
[  147.302484]  worker_thread+0xd1/0x3e0
[  147.302491]  ? process_one_work+0x380/0x380
[  147.302498]  kthread+0x115/0x140
[  147.302504]  ? __kthread_bind_mask+0x60/0x60
[  147.302514]  ret_from_fork+0x35/0x40
[  147.302524] ---[ end trace c3bd1c322016ab56 ]---
[  147.302618] ------------[ cut here ]------------

Thanks
David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15 10:57       ` David Guillen Fandos
@ 2020-07-15 11:12         ` Greg KH
  2020-07-15 11:20           ` David Guillen Fandos
  0 siblings, 1 reply; 19+ messages in thread
From: Greg KH @ 2020-07-15 11:12 UTC (permalink / raw)
  To: David Guillen Fandos; +Cc: linux-usb

On Wed, Jul 15, 2020 at 12:57:14PM +0200, David Guillen Fandos wrote:
> On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David Guillen Fandos wrote:
> > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen Fandos
> > > > wrote:
> > > > > Hello linux-usb,
> > > > > 
> > > > > I think I might have found a kernel bug related to the USB
> > > > > subsystem
> > > > > (cdc_acm perhaps).
> > > > > 
> > > > > Context: I was playing around with a device I'm creating,
> > > > > essentially a
> > > > > USB quad modem device that exposes four modems to the host
> > > > > system.
> > > > > This
> > > > > device is still a prototype so there's a few bugs here and
> > > > > there,
> > > > > most
> > > > > likely in the USB descriptors and control requests.
> > > > > 
> > > > > What happens: After plugging the device the system starts
> > > > > spitting
> > > > > warnings and BUGs and it locks up. Most of the time some CPUs
> > > > > get
> > > > > into
> > > > > some spinloop and never comes back (you can see it being
> > > > > detected
> > > > > by
> > > > > the watchdog after a few seconds). Generally after that the USB
> > > > > devices
> > > > > stop working completely and at some point the machine freezes
> > > > > completely. In a couple of ocasions I managed to see a bug in
> > > > > dmesg
> > > > > saying "unable to handle page fault for address XXX" and
> > > > > "Supervisor
> > > > > read access in kernel mode" "error code (0x0000) not present
> > > > > page".
> > > > > I
> > > > > could not get a trace for that one since the kernel died
> > > > > completely
> > > > > and
> > > > > my log files were truncated/lost.
> > > > > 
> > > > > Since it is happening to my two machines (both Intel but rather
> > > > > different controllers, Sunrise Point-LP USB 3.0 vs 8
> > > > > Series/C220)
> > > > > and
> > > > > with different kernel versions I suspect this might be a bug in
> > > > > the
> > > > > kernel.
> > > > > 
> > > > > I have 4 logs that I collected, they are sort of long-ish, not
> > > > > sure
> > > > > how
> > > > > to best send them to the list.
> > > > 
> > > > Send the crashes with the callback list, that should be quite
> > > > small,
> > > > right?  We don't need the full log.
> > > > 
> > > > The first crash is the most important, the others can be from the
> > > > first
> > > > one and are not reliable.
> > > > 
> > > > thanks,
> > > > 
> > > > greg k-h
> > > 
> > > Ok then, here comes one of the logs, I selected some bits only
> > > 
> > > [  147.302016] WARNING: CPU: 3 PID: 134 at kernel/workqueue.c:1473
> > > __queue_work+0x364/0x410
> > > [...]
> > > [  147.302322] Call Trace:
> > > [  147.302329]  <IRQ>
> > > [  147.302342]  queue_work_on+0x36/0x40
> > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > [  147.302372]  tasklet_action_common.constprop.0+0x66/0x100
> > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > [  147.302391]  irq_exit+0xcf/0x110
> > > [  147.302397]  do_IRQ+0x55/0xe0
> > > [  147.302408]  common_interrupt+0xf/0xf
> > > [  147.302413]  </IRQ>
> > > [...]
> > > [  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck for 23s!
> > > [kworker/3:2:134]
> > 
> > That was the first message?
> > 
> > Ok, we need some more logs, how about the 30 lines right before the
> > above?
> > 
> > And what kernel version are you using?
> > 
> > thanks,
> > 
> > greg k-h
> 
> Heh I assumed you would find the 3rd stack more interesting since it
> involves more subsystems but anyway, here we got, the first one with
> more context. The trigger as you can see is me connecting the USB
> device:
> 
> [  141.445367] usb 1-1: new full-speed USB device number 5 using
> xhci_hcd
> [  141.573592] usb 1-1: New USB device found, idVendor=0483,
> idProduct=5740, bcdDevice= 2.00
> [  141.573597] usb 1-1: New USB device strings: Mfr=1, Product=2,
> SerialNumber=3
> [  141.573601] usb 1-1: Product: Quad-UART serial USB device
> [  141.573603] usb 1-1: Manufacturer: davidgf.net
> [  141.573605] usb 1-1: SerialNumber: serialno
> [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM device
> [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM device
> [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM device
> [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM device
> [  142.382473] usbcore: registered new interface driver cdc_acm
> [  142.382476] cdc_acm: USB Abstract Control Model driver for USB
> modems and ISDN adapters
> [  147.301997] ------------[ cut here ]------------
> [  147.302016] WARNING: CPU: 3 PID: 134 at kernel/workqueue.c:1473
> __queue_work+0x364/0x410
> [  147.302019] Modules linked in: cdc_acm rfcomm ccm wireguard
> curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64
> libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
> libcurve25519_generic libchacha libblake2s_generic nft_fib_inet
> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
> nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat
> ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle
> iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter
> ip6_tables iptable_filter cmac vboxnetadp(OE) vboxnetflt(OE) bnep
> vboxdrv(OE) sunrpc vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops
> videobuf2_v4l2 videobuf2_common videodev btusb btrtl btbcm btintel mc
> bluetooth ecdh_generic ecc iTCO_wdt iTCO_vendor_support mei_hdcp
> intel_rapl_msr dell_laptop x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore
> intel_rapl_perf iwlmvm
> [  147.302121]  snd_hda_codec_hdmi mac80211 snd_soc_skl snd_soc_sst_ipc
> snd_soc_sst_dsp dell_wmi snd_hda_ext_core dell_smbios
> snd_hda_codec_realtek dcdbas libarc4 wmi_bmof dell_wmi_descriptor
> snd_soc_acpi_intel_match snd_soc_acpi intel_wmi_thunderbolt
> snd_hda_codec_generic snd_soc_core ledtrig_audio iwlwifi pcspkr
> snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg
> snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_seq snd_seq_device
> joydev snd_pcm rfkill snd_timer snd i2c_i801 soundcore idma64
> int3403_thermal intel_hid int3400_thermal sparse_keymap
> acpi_thermal_rel mei_me intel_xhci_usb_role_switch acpi_pad roles mei
> intel_pch_thermal processor_thermal_device intel_rapl_common
> int340x_thermal_zone intel_soc_dts_iosf binfmt_misc ip_tables dm_crypt
> i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul i2c_algo_bit
> cec crc32c_intel drm_kms_helper nvme ghash_clmulni_intel drm nvme_core
> serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
> pinctrl_sunrisepoint pinctrl_intel
> [  147.302218]  fuse
> [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2 Tainted:
> G          IOE     5.7.7-200.fc32.x86_64 #1
> [  147.302233] Hardware name: Dell Inc. XPS 13 9350/0PWNCR, BIOS 1.12.2
> 12/15/2019
> [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f 65 48 8b 3c 25 c0 8b
> 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f
> c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60 e9 5d fe ff ff e8 53
> [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS: 00010002
> [  147.302292] RAX: ffff8f551b333790 RBX: 0000000000000048 RCX:
> 0000000000000000
> [  147.302295] RDX: ffff8f551b333798 RSI: ffff8f5575803718 RDI:
> ffff8f5576daa840
> [  147.302299] RBP: ffff8f551b333790 R08: ffffffff97856cb0 R09:
> 0000000000000000
> [  147.302302] R10: 0000000000000000 R11: ffffffff97856cb8 R12:
> 0000000000000003
> [  147.302306] R13: 0000000000002000 R14: ffff8f5575c14e00 R15:
> ffff8f5576db0700
> [  147.302311] FS:  0000000000000000(0000) GS:ffff8f5576d80000(0000)
> knlGS:0000000000000000
> [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  147.302319] CR2: 00000000000000b0 CR3: 0000000267774004 CR4:
> 00000000003606e0
> [  147.302322] Call Trace:
> [  147.302329]  <IRQ>
> [  147.302342]  queue_work_on+0x36/0x40
> [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0

Are you sure your device is working properly and talking USB correctly
to the host?  It looks like you are just timing out for some reason.

But, that warning is showing that something is odd in the usb workqueue,
which is strange.

What type of host controller is this talking to?  And does your device
actually answer the urbs being sent to it correctly?

Using usbmon on this might be the best way to watch the USB traffic, if
you don't have a hardware protocol sniffer, which could provide some
clues as to what is going wrong.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15 11:12         ` Greg KH
@ 2020-07-15 11:20           ` David Guillen Fandos
  2020-07-15 12:24             ` Greg KH
  2020-07-21  8:01             ` Oliver Neukum
  0 siblings, 2 replies; 19+ messages in thread
From: David Guillen Fandos @ 2020-07-15 11:20 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-usb

On Wed, 2020-07-15 at 13:12 +0200, Greg KH wrote:
> On Wed, Jul 15, 2020 at 12:57:14PM +0200, David Guillen Fandos wrote:
> > On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David Guillen Fandos
> > > wrote:
> > > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen
> > > > > Fandos
> > > > > wrote:
> > > > > > Hello linux-usb,
> > > > > > 
> > > > > > I think I might have found a kernel bug related to the USB
> > > > > > subsystem
> > > > > > (cdc_acm perhaps).
> > > > > > 
> > > > > > Context: I was playing around with a device I'm creating,
> > > > > > essentially a
> > > > > > USB quad modem device that exposes four modems to the host
> > > > > > system.
> > > > > > This
> > > > > > device is still a prototype so there's a few bugs here and
> > > > > > there,
> > > > > > most
> > > > > > likely in the USB descriptors and control requests.
> > > > > > 
> > > > > > What happens: After plugging the device the system starts
> > > > > > spitting
> > > > > > warnings and BUGs and it locks up. Most of the time some
> > > > > > CPUs
> > > > > > get
> > > > > > into
> > > > > > some spinloop and never comes back (you can see it being
> > > > > > detected
> > > > > > by
> > > > > > the watchdog after a few seconds). Generally after that the
> > > > > > USB
> > > > > > devices
> > > > > > stop working completely and at some point the machine
> > > > > > freezes
> > > > > > completely. In a couple of ocasions I managed to see a bug
> > > > > > in
> > > > > > dmesg
> > > > > > saying "unable to handle page fault for address XXX" and
> > > > > > "Supervisor
> > > > > > read access in kernel mode" "error code (0x0000) not
> > > > > > present
> > > > > > page".
> > > > > > I
> > > > > > could not get a trace for that one since the kernel died
> > > > > > completely
> > > > > > and
> > > > > > my log files were truncated/lost.
> > > > > > 
> > > > > > Since it is happening to my two machines (both Intel but
> > > > > > rather
> > > > > > different controllers, Sunrise Point-LP USB 3.0 vs 8
> > > > > > Series/C220)
> > > > > > and
> > > > > > with different kernel versions I suspect this might be a
> > > > > > bug in
> > > > > > the
> > > > > > kernel.
> > > > > > 
> > > > > > I have 4 logs that I collected, they are sort of long-ish,
> > > > > > not
> > > > > > sure
> > > > > > how
> > > > > > to best send them to the list.
> > > > > 
> > > > > Send the crashes with the callback list, that should be quite
> > > > > small,
> > > > > right?  We don't need the full log.
> > > > > 
> > > > > The first crash is the most important, the others can be from
> > > > > the
> > > > > first
> > > > > one and are not reliable.
> > > > > 
> > > > > thanks,
> > > > > 
> > > > > greg k-h
> > > > 
> > > > Ok then, here comes one of the logs, I selected some bits only
> > > > 
> > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > kernel/workqueue.c:1473
> > > > __queue_work+0x364/0x410
> > > > [...]
> > > > [  147.302322] Call Trace:
> > > > [  147.302329]  <IRQ>
> > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > [  147.302372]  tasklet_action_common.constprop.0+0x66/0x100
> > > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > > [  147.302391]  irq_exit+0xcf/0x110
> > > > [  147.302397]  do_IRQ+0x55/0xe0
> > > > [  147.302408]  common_interrupt+0xf/0xf
> > > > [  147.302413]  </IRQ>
> > > > [...]
> > > > [  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck for
> > > > 23s!
> > > > [kworker/3:2:134]
> > > 
> > > That was the first message?
> > > 
> > > Ok, we need some more logs, how about the 30 lines right before
> > > the
> > > above?
> > > 
> > > And what kernel version are you using?
> > > 
> > > thanks,
> > > 
> > > greg k-h
> > 
> > Heh I assumed you would find the 3rd stack more interesting since
> > it
> > involves more subsystems but anyway, here we got, the first one
> > with
> > more context. The trigger as you can see is me connecting the USB
> > device:
> > 
> > [  141.445367] usb 1-1: new full-speed USB device number 5 using
> > xhci_hcd
> > [  141.573592] usb 1-1: New USB device found, idVendor=0483,
> > idProduct=5740, bcdDevice= 2.00
> > [  141.573597] usb 1-1: New USB device strings: Mfr=1, Product=2,
> > SerialNumber=3
> > [  141.573601] usb 1-1: Product: Quad-UART serial USB device
> > [  141.573603] usb 1-1: Manufacturer: davidgf.net
> > [  141.573605] usb 1-1: SerialNumber: serialno
> > [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM device
> > [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM device
> > [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM device
> > [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM device
> > [  142.382473] usbcore: registered new interface driver cdc_acm
> > [  142.382476] cdc_acm: USB Abstract Control Model driver for USB
> > modems and ISDN adapters
> > [  147.301997] ------------[ cut here ]------------
> > [  147.302016] WARNING: CPU: 3 PID: 134 at kernel/workqueue.c:1473
> > __queue_work+0x364/0x410
> > [  147.302019] Modules linked in: cdc_acm rfcomm ccm wireguard
> > curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64
> > libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
> > libcurve25519_generic libchacha libblake2s_generic nft_fib_inet
> > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
> > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat
> > ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat
> > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle
> > iptable_raw iptable_security ip_set nf_tables nfnetlink
> > ip6table_filter
> > ip6_tables iptable_filter cmac vboxnetadp(OE) vboxnetflt(OE) bnep
> > vboxdrv(OE) sunrpc vfat fat uvcvideo videobuf2_vmalloc
> > videobuf2_memops
> > videobuf2_v4l2 videobuf2_common videodev btusb btrtl btbcm btintel
> > mc
> > bluetooth ecdh_generic ecc iTCO_wdt iTCO_vendor_support mei_hdcp
> > intel_rapl_msr dell_laptop x86_pkg_temp_thermal intel_powerclamp
> > coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore
> > intel_rapl_perf iwlmvm
> > [  147.302121]  snd_hda_codec_hdmi mac80211 snd_soc_skl
> > snd_soc_sst_ipc
> > snd_soc_sst_dsp dell_wmi snd_hda_ext_core dell_smbios
> > snd_hda_codec_realtek dcdbas libarc4 wmi_bmof dell_wmi_descriptor
> > snd_soc_acpi_intel_match snd_soc_acpi intel_wmi_thunderbolt
> > snd_hda_codec_generic snd_soc_core ledtrig_audio iwlwifi pcspkr
> > snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel
> > snd_intel_dspcfg
> > snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_seq
> > snd_seq_device
> > joydev snd_pcm rfkill snd_timer snd i2c_i801 soundcore idma64
> > int3403_thermal intel_hid int3400_thermal sparse_keymap
> > acpi_thermal_rel mei_me intel_xhci_usb_role_switch acpi_pad roles
> > mei
> > intel_pch_thermal processor_thermal_device intel_rapl_common
> > int340x_thermal_zone intel_soc_dts_iosf binfmt_misc ip_tables
> > dm_crypt
> > i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul
> > i2c_algo_bit
> > cec crc32c_intel drm_kms_helper nvme ghash_clmulni_intel drm
> > nvme_core
> > serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
> > pinctrl_sunrisepoint pinctrl_intel
> > [  147.302218]  fuse
> > [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2 Tainted:
> > G          IOE     5.7.7-200.fc32.x86_64 #1
> > [  147.302233] Hardware name: Dell Inc. XPS 13 9350/0PWNCR, BIOS
> > 1.12.2
> > 12/15/2019
> > [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> > [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> > [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f 65 48 8b 3c 25
> > c0 8b
> > 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e
> > 41 5f
> > c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60 e9 5d fe ff ff e8
> > 53
> > [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS: 00010002
> > [  147.302292] RAX: ffff8f551b333790 RBX: 0000000000000048 RCX:
> > 0000000000000000
> > [  147.302295] RDX: ffff8f551b333798 RSI: ffff8f5575803718 RDI:
> > ffff8f5576daa840
> > [  147.302299] RBP: ffff8f551b333790 R08: ffffffff97856cb0 R09:
> > 0000000000000000
> > [  147.302302] R10: 0000000000000000 R11: ffffffff97856cb8 R12:
> > 0000000000000003
> > [  147.302306] R13: 0000000000002000 R14: ffff8f5575c14e00 R15:
> > ffff8f5576db0700
> > [  147.302311] FS:  0000000000000000(0000)
> > GS:ffff8f5576d80000(0000)
> > knlGS:0000000000000000
> > [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  147.302319] CR2: 00000000000000b0 CR3: 0000000267774004 CR4:
> > 00000000003606e0
> > [  147.302322] Call Trace:
> > [  147.302329]  <IRQ>
> > [  147.302342]  queue_work_on+0x36/0x40
> > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> 
> Are you sure your device is working properly and talking USB
> correctly
> to the host?  It looks like you are just timing out for some reason.
> 
> But, that warning is showing that something is odd in the usb
> workqueue,
> which is strange.
> 
> What type of host controller is this talking to?  And does your
> device
> actually answer the urbs being sent to it correctly?
> 
> Using usbmon on this might be the best way to watch the USB traffic,
> if
> you don't have a hardware protocol sniffer, which could provide some
> clues as to what is going wrong.
> 
> thanks,
> 
> greg k-h

As I mentioned the device is likely buggy, since I'm developing and
debugging it.
However my ability to debug and fix any issue is limited by the fact
that the kernel decides to stop working as usual, making my USB
keyboard and mouse useless, if not crashing later due to soft lockups.

Shouldn't the kernel be resilient to such devices? I've developed quite
a few USB devices in the past and I've never ran into things like this
on Linux (Windows is another story, rather 'easy' to crash, hang or
bluescreen). In any case since I do not have access to a hardware
debugger and the machine goes bananas (preventing me from using
Wireshark) I do not think I can further debug this issue. I could try
to find a kernel version where this does not crash the machine (only
tested 5.6.X and 5.7.X so far). Or perhaps use VirtualBox, but I'd need
to convice the host OS to ignore the USB device and just forward it to
the guest.

The firmware for this device can be easily tweaked to expose an
arbitrary number (up to 7 I think) of CDC ACM interfaces. When I use
one or two there's no issues, three has had some issues (but did not
investigate further). Going to four is what consistently triggers
kernel issues.

Let me know if you have any ideas on how I could proceed to better
debug this.

Thanks
David



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15 11:20           ` David Guillen Fandos
@ 2020-07-15 12:24             ` Greg KH
  2020-07-15 17:03               ` David Guillen Fandos
  2020-07-21  8:01             ` Oliver Neukum
  1 sibling, 1 reply; 19+ messages in thread
From: Greg KH @ 2020-07-15 12:24 UTC (permalink / raw)
  To: David Guillen Fandos; +Cc: linux-usb

On Wed, Jul 15, 2020 at 01:20:54PM +0200, David Guillen Fandos wrote:
> On Wed, 2020-07-15 at 13:12 +0200, Greg KH wrote:
> > On Wed, Jul 15, 2020 at 12:57:14PM +0200, David Guillen Fandos wrote:
> > > On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > > > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David Guillen Fandos
> > > > wrote:
> > > > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > > > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen
> > > > > > Fandos
> > > > > > wrote:
> > > > > > > Hello linux-usb,
> > > > > > > 
> > > > > > > I think I might have found a kernel bug related to the USB
> > > > > > > subsystem
> > > > > > > (cdc_acm perhaps).
> > > > > > > 
> > > > > > > Context: I was playing around with a device I'm creating,
> > > > > > > essentially a
> > > > > > > USB quad modem device that exposes four modems to the host
> > > > > > > system.
> > > > > > > This
> > > > > > > device is still a prototype so there's a few bugs here and
> > > > > > > there,
> > > > > > > most
> > > > > > > likely in the USB descriptors and control requests.
> > > > > > > 
> > > > > > > What happens: After plugging the device the system starts
> > > > > > > spitting
> > > > > > > warnings and BUGs and it locks up. Most of the time some
> > > > > > > CPUs
> > > > > > > get
> > > > > > > into
> > > > > > > some spinloop and never comes back (you can see it being
> > > > > > > detected
> > > > > > > by
> > > > > > > the watchdog after a few seconds). Generally after that the
> > > > > > > USB
> > > > > > > devices
> > > > > > > stop working completely and at some point the machine
> > > > > > > freezes
> > > > > > > completely. In a couple of ocasions I managed to see a bug
> > > > > > > in
> > > > > > > dmesg
> > > > > > > saying "unable to handle page fault for address XXX" and
> > > > > > > "Supervisor
> > > > > > > read access in kernel mode" "error code (0x0000) not
> > > > > > > present
> > > > > > > page".
> > > > > > > I
> > > > > > > could not get a trace for that one since the kernel died
> > > > > > > completely
> > > > > > > and
> > > > > > > my log files were truncated/lost.
> > > > > > > 
> > > > > > > Since it is happening to my two machines (both Intel but
> > > > > > > rather
> > > > > > > different controllers, Sunrise Point-LP USB 3.0 vs 8
> > > > > > > Series/C220)
> > > > > > > and
> > > > > > > with different kernel versions I suspect this might be a
> > > > > > > bug in
> > > > > > > the
> > > > > > > kernel.
> > > > > > > 
> > > > > > > I have 4 logs that I collected, they are sort of long-ish,
> > > > > > > not
> > > > > > > sure
> > > > > > > how
> > > > > > > to best send them to the list.
> > > > > > 
> > > > > > Send the crashes with the callback list, that should be quite
> > > > > > small,
> > > > > > right?  We don't need the full log.
> > > > > > 
> > > > > > The first crash is the most important, the others can be from
> > > > > > the
> > > > > > first
> > > > > > one and are not reliable.
> > > > > > 
> > > > > > thanks,
> > > > > > 
> > > > > > greg k-h
> > > > > 
> > > > > Ok then, here comes one of the logs, I selected some bits only
> > > > > 
> > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > kernel/workqueue.c:1473
> > > > > __queue_work+0x364/0x410
> > > > > [...]
> > > > > [  147.302322] Call Trace:
> > > > > [  147.302329]  <IRQ>
> > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > [  147.302372]  tasklet_action_common.constprop.0+0x66/0x100
> > > > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > > > [  147.302391]  irq_exit+0xcf/0x110
> > > > > [  147.302397]  do_IRQ+0x55/0xe0
> > > > > [  147.302408]  common_interrupt+0xf/0xf
> > > > > [  147.302413]  </IRQ>
> > > > > [...]
> > > > > [  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck for
> > > > > 23s!
> > > > > [kworker/3:2:134]
> > > > 
> > > > That was the first message?
> > > > 
> > > > Ok, we need some more logs, how about the 30 lines right before
> > > > the
> > > > above?
> > > > 
> > > > And what kernel version are you using?
> > > > 
> > > > thanks,
> > > > 
> > > > greg k-h
> > > 
> > > Heh I assumed you would find the 3rd stack more interesting since
> > > it
> > > involves more subsystems but anyway, here we got, the first one
> > > with
> > > more context. The trigger as you can see is me connecting the USB
> > > device:
> > > 
> > > [  141.445367] usb 1-1: new full-speed USB device number 5 using
> > > xhci_hcd
> > > [  141.573592] usb 1-1: New USB device found, idVendor=0483,
> > > idProduct=5740, bcdDevice= 2.00
> > > [  141.573597] usb 1-1: New USB device strings: Mfr=1, Product=2,
> > > SerialNumber=3
> > > [  141.573601] usb 1-1: Product: Quad-UART serial USB device
> > > [  141.573603] usb 1-1: Manufacturer: davidgf.net
> > > [  141.573605] usb 1-1: SerialNumber: serialno
> > > [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM device
> > > [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM device
> > > [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM device
> > > [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM device
> > > [  142.382473] usbcore: registered new interface driver cdc_acm
> > > [  142.382476] cdc_acm: USB Abstract Control Model driver for USB
> > > modems and ISDN adapters
> > > [  147.301997] ------------[ cut here ]------------
> > > [  147.302016] WARNING: CPU: 3 PID: 134 at kernel/workqueue.c:1473
> > > __queue_work+0x364/0x410
> > > [  147.302019] Modules linked in: cdc_acm rfcomm ccm wireguard
> > > curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64
> > > libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
> > > libcurve25519_generic libchacha libblake2s_generic nft_fib_inet
> > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
> > > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat
> > > ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat
> > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle
> > > iptable_raw iptable_security ip_set nf_tables nfnetlink
> > > ip6table_filter
> > > ip6_tables iptable_filter cmac vboxnetadp(OE) vboxnetflt(OE) bnep
> > > vboxdrv(OE) sunrpc vfat fat uvcvideo videobuf2_vmalloc
> > > videobuf2_memops
> > > videobuf2_v4l2 videobuf2_common videodev btusb btrtl btbcm btintel
> > > mc
> > > bluetooth ecdh_generic ecc iTCO_wdt iTCO_vendor_support mei_hdcp
> > > intel_rapl_msr dell_laptop x86_pkg_temp_thermal intel_powerclamp
> > > coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore
> > > intel_rapl_perf iwlmvm
> > > [  147.302121]  snd_hda_codec_hdmi mac80211 snd_soc_skl
> > > snd_soc_sst_ipc
> > > snd_soc_sst_dsp dell_wmi snd_hda_ext_core dell_smbios
> > > snd_hda_codec_realtek dcdbas libarc4 wmi_bmof dell_wmi_descriptor
> > > snd_soc_acpi_intel_match snd_soc_acpi intel_wmi_thunderbolt
> > > snd_hda_codec_generic snd_soc_core ledtrig_audio iwlwifi pcspkr
> > > snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel
> > > snd_intel_dspcfg
> > > snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_seq
> > > snd_seq_device
> > > joydev snd_pcm rfkill snd_timer snd i2c_i801 soundcore idma64
> > > int3403_thermal intel_hid int3400_thermal sparse_keymap
> > > acpi_thermal_rel mei_me intel_xhci_usb_role_switch acpi_pad roles
> > > mei
> > > intel_pch_thermal processor_thermal_device intel_rapl_common
> > > int340x_thermal_zone intel_soc_dts_iosf binfmt_misc ip_tables
> > > dm_crypt
> > > i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul
> > > i2c_algo_bit
> > > cec crc32c_intel drm_kms_helper nvme ghash_clmulni_intel drm
> > > nvme_core
> > > serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
> > > pinctrl_sunrisepoint pinctrl_intel
> > > [  147.302218]  fuse
> > > [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2 Tainted:
> > > G          IOE     5.7.7-200.fc32.x86_64 #1
> > > [  147.302233] Hardware name: Dell Inc. XPS 13 9350/0PWNCR, BIOS
> > > 1.12.2
> > > 12/15/2019
> > > [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> > > [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> > > [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f 65 48 8b 3c 25
> > > c0 8b
> > > 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e
> > > 41 5f
> > > c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60 e9 5d fe ff ff e8
> > > 53
> > > [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS: 00010002
> > > [  147.302292] RAX: ffff8f551b333790 RBX: 0000000000000048 RCX:
> > > 0000000000000000
> > > [  147.302295] RDX: ffff8f551b333798 RSI: ffff8f5575803718 RDI:
> > > ffff8f5576daa840
> > > [  147.302299] RBP: ffff8f551b333790 R08: ffffffff97856cb0 R09:
> > > 0000000000000000
> > > [  147.302302] R10: 0000000000000000 R11: ffffffff97856cb8 R12:
> > > 0000000000000003
> > > [  147.302306] R13: 0000000000002000 R14: ffff8f5575c14e00 R15:
> > > ffff8f5576db0700
> > > [  147.302311] FS:  0000000000000000(0000)
> > > GS:ffff8f5576d80000(0000)
> > > knlGS:0000000000000000
> > > [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  147.302319] CR2: 00000000000000b0 CR3: 0000000267774004 CR4:
> > > 00000000003606e0
> > > [  147.302322] Call Trace:
> > > [  147.302329]  <IRQ>
> > > [  147.302342]  queue_work_on+0x36/0x40
> > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > 
> > Are you sure your device is working properly and talking USB
> > correctly
> > to the host?  It looks like you are just timing out for some reason.
> > 
> > But, that warning is showing that something is odd in the usb
> > workqueue,
> > which is strange.
> > 
> > What type of host controller is this talking to?  And does your
> > device
> > actually answer the urbs being sent to it correctly?
> > 
> > Using usbmon on this might be the best way to watch the USB traffic,
> > if
> > you don't have a hardware protocol sniffer, which could provide some
> > clues as to what is going wrong.
> > 
> > thanks,
> > 
> > greg k-h
> 
> As I mentioned the device is likely buggy, since I'm developing and
> debugging it.
> However my ability to debug and fix any issue is limited by the fact
> that the kernel decides to stop working as usual, making my USB
> keyboard and mouse useless, if not crashing later due to soft lockups.
> 
> Shouldn't the kernel be resilient to such devices?

Yes it should, we should not crash.


> I've developed quite
> a few USB devices in the past and I've never ran into things like this
> on Linux (Windows is another story, rather 'easy' to crash, hang or
> bluescreen). In any case since I do not have access to a hardware
> debugger and the machine goes bananas (preventing me from using
> Wireshark) I do not think I can further debug this issue. I could try
> to find a kernel version where this does not crash the machine (only
> tested 5.6.X and 5.7.X so far). Or perhaps use VirtualBox, but I'd need
> to convice the host OS to ignore the USB device and just forward it to
> the guest.

Trying to trace down what part of the setup is failing, by using usbmon,
will be good to try to figure out what the problem is here, if you can
do that.

> The firmware for this device can be easily tweaked to expose an
> arbitrary number (up to 7 I think) of CDC ACM interfaces. When I use
> one or two there's no issues, three has had some issues (but did not
> investigate further). Going to four is what consistently triggers
> kernel issues.

Hm, that might be a clue, what does the output of 'lsusb -v' for that
device when you have 3 and then 4 interfaces?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15 12:24             ` Greg KH
@ 2020-07-15 17:03               ` David Guillen Fandos
  2020-07-15 20:39                 ` Daniele Palmas
  2020-07-16 14:30                 ` David Guillen Fandos
  0 siblings, 2 replies; 19+ messages in thread
From: David Guillen Fandos @ 2020-07-15 17:03 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-usb

On Wed, 2020-07-15 at 14:24 +0200, Greg KH wrote:
> On Wed, Jul 15, 2020 at 01:20:54PM +0200, David Guillen Fandos wrote:
> > On Wed, 2020-07-15 at 13:12 +0200, Greg KH wrote:
> > > On Wed, Jul 15, 2020 at 12:57:14PM +0200, David Guillen Fandos
> > > wrote:
> > > > On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > > > > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David Guillen
> > > > > Fandos
> > > > > wrote:
> > > > > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > > > > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen
> > > > > > > Fandos
> > > > > > > wrote:
> > > > > > > > Hello linux-usb,
> > > > > > > > 
> > > > > > > > I think I might have found a kernel bug related to the
> > > > > > > > USB
> > > > > > > > subsystem
> > > > > > > > (cdc_acm perhaps).
> > > > > > > > 
> > > > > > > > Context: I was playing around with a device I'm
> > > > > > > > creating,
> > > > > > > > essentially a
> > > > > > > > USB quad modem device that exposes four modems to the
> > > > > > > > host
> > > > > > > > system.
> > > > > > > > This
> > > > > > > > device is still a prototype so there's a few bugs here
> > > > > > > > and
> > > > > > > > there,
> > > > > > > > most
> > > > > > > > likely in the USB descriptors and control requests.
> > > > > > > > 
> > > > > > > > What happens: After plugging the device the system
> > > > > > > > starts
> > > > > > > > spitting
> > > > > > > > warnings and BUGs and it locks up. Most of the time
> > > > > > > > some
> > > > > > > > CPUs
> > > > > > > > get
> > > > > > > > into
> > > > > > > > some spinloop and never comes back (you can see it
> > > > > > > > being
> > > > > > > > detected
> > > > > > > > by
> > > > > > > > the watchdog after a few seconds). Generally after that
> > > > > > > > the
> > > > > > > > USB
> > > > > > > > devices
> > > > > > > > stop working completely and at some point the machine
> > > > > > > > freezes
> > > > > > > > completely. In a couple of ocasions I managed to see a
> > > > > > > > bug
> > > > > > > > in
> > > > > > > > dmesg
> > > > > > > > saying "unable to handle page fault for address XXX"
> > > > > > > > and
> > > > > > > > "Supervisor
> > > > > > > > read access in kernel mode" "error code (0x0000) not
> > > > > > > > present
> > > > > > > > page".
> > > > > > > > I
> > > > > > > > could not get a trace for that one since the kernel
> > > > > > > > died
> > > > > > > > completely
> > > > > > > > and
> > > > > > > > my log files were truncated/lost.
> > > > > > > > 
> > > > > > > > Since it is happening to my two machines (both Intel
> > > > > > > > but
> > > > > > > > rather
> > > > > > > > different controllers, Sunrise Point-LP USB 3.0 vs 8
> > > > > > > > Series/C220)
> > > > > > > > and
> > > > > > > > with different kernel versions I suspect this might be
> > > > > > > > a
> > > > > > > > bug in
> > > > > > > > the
> > > > > > > > kernel.
> > > > > > > > 
> > > > > > > > I have 4 logs that I collected, they are sort of long-
> > > > > > > > ish,
> > > > > > > > not
> > > > > > > > sure
> > > > > > > > how
> > > > > > > > to best send them to the list.
> > > > > > > 
> > > > > > > Send the crashes with the callback list, that should be
> > > > > > > quite
> > > > > > > small,
> > > > > > > right?  We don't need the full log.
> > > > > > > 
> > > > > > > The first crash is the most important, the others can be
> > > > > > > from
> > > > > > > the
> > > > > > > first
> > > > > > > one and are not reliable.
> > > > > > > 
> > > > > > > thanks,
> > > > > > > 
> > > > > > > greg k-h
> > > > > > 
> > > > > > Ok then, here comes one of the logs, I selected some bits
> > > > > > only
> > > > > > 
> > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > kernel/workqueue.c:1473
> > > > > > __queue_work+0x364/0x410
> > > > > > [...]
> > > > > > [  147.302322] Call Trace:
> > > > > > [  147.302329]  <IRQ>
> > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > [  147.302372]  tasklet_action_common.constprop.0+0x66/0x10
> > > > > > 0
> > > > > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > > > > [  147.302391]  irq_exit+0xcf/0x110
> > > > > > [  147.302397]  do_IRQ+0x55/0xe0
> > > > > > [  147.302408]  common_interrupt+0xf/0xf
> > > > > > [  147.302413]  </IRQ>
> > > > > > [...]
> > > > > > [  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck for
> > > > > > 23s!
> > > > > > [kworker/3:2:134]
> > > > > 
> > > > > That was the first message?
> > > > > 
> > > > > Ok, we need some more logs, how about the 30 lines right
> > > > > before
> > > > > the
> > > > > above?
> > > > > 
> > > > > And what kernel version are you using?
> > > > > 
> > > > > thanks,
> > > > > 
> > > > > greg k-h
> > > > 
> > > > Heh I assumed you would find the 3rd stack more interesting
> > > > since
> > > > it
> > > > involves more subsystems but anyway, here we got, the first one
> > > > with
> > > > more context. The trigger as you can see is me connecting the
> > > > USB
> > > > device:
> > > > 
> > > > [  141.445367] usb 1-1: new full-speed USB device number 5
> > > > using
> > > > xhci_hcd
> > > > [  141.573592] usb 1-1: New USB device found, idVendor=0483,
> > > > idProduct=5740, bcdDevice= 2.00
> > > > [  141.573597] usb 1-1: New USB device strings: Mfr=1,
> > > > Product=2,
> > > > SerialNumber=3
> > > > [  141.573601] usb 1-1: Product: Quad-UART serial USB device
> > > > [  141.573603] usb 1-1: Manufacturer: davidgf.net
> > > > [  141.573605] usb 1-1: SerialNumber: serialno
> > > > [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM device
> > > > [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM device
> > > > [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM device
> > > > [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM device
> > > > [  142.382473] usbcore: registered new interface driver cdc_acm
> > > > [  142.382476] cdc_acm: USB Abstract Control Model driver for
> > > > USB
> > > > modems and ISDN adapters
> > > > [  147.301997] ------------[ cut here ]------------
> > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > kernel/workqueue.c:1473
> > > > __queue_work+0x364/0x410
> > > > [  147.302019] Modules linked in: cdc_acm rfcomm ccm wireguard
> > > > curve25519_x86_64 libchacha20poly1305 chacha_x86_64
> > > > poly1305_x86_64
> > > > libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
> > > > libcurve25519_generic libchacha libblake2s_generic nft_fib_inet
> > > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> > > > nf_reject_ipv4
> > > > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat
> > > > ip6table_mangle ip6table_raw ip6table_security iptable_nat
> > > > nf_nat
> > > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
> > > > iptable_mangle
> > > > iptable_raw iptable_security ip_set nf_tables nfnetlink
> > > > ip6table_filter
> > > > ip6_tables iptable_filter cmac vboxnetadp(OE) vboxnetflt(OE)
> > > > bnep
> > > > vboxdrv(OE) sunrpc vfat fat uvcvideo videobuf2_vmalloc
> > > > videobuf2_memops
> > > > videobuf2_v4l2 videobuf2_common videodev btusb btrtl btbcm
> > > > btintel
> > > > mc
> > > > bluetooth ecdh_generic ecc iTCO_wdt iTCO_vendor_support
> > > > mei_hdcp
> > > > intel_rapl_msr dell_laptop x86_pkg_temp_thermal
> > > > intel_powerclamp
> > > > coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore
> > > > intel_rapl_perf iwlmvm
> > > > [  147.302121]  snd_hda_codec_hdmi mac80211 snd_soc_skl
> > > > snd_soc_sst_ipc
> > > > snd_soc_sst_dsp dell_wmi snd_hda_ext_core dell_smbios
> > > > snd_hda_codec_realtek dcdbas libarc4 wmi_bmof
> > > > dell_wmi_descriptor
> > > > snd_soc_acpi_intel_match snd_soc_acpi intel_wmi_thunderbolt
> > > > snd_hda_codec_generic snd_soc_core ledtrig_audio iwlwifi pcspkr
> > > > snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel
> > > > snd_intel_dspcfg
> > > > snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_seq
> > > > snd_seq_device
> > > > joydev snd_pcm rfkill snd_timer snd i2c_i801 soundcore idma64
> > > > int3403_thermal intel_hid int3400_thermal sparse_keymap
> > > > acpi_thermal_rel mei_me intel_xhci_usb_role_switch acpi_pad
> > > > roles
> > > > mei
> > > > intel_pch_thermal processor_thermal_device intel_rapl_common
> > > > int340x_thermal_zone intel_soc_dts_iosf binfmt_misc ip_tables
> > > > dm_crypt
> > > > i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul
> > > > i2c_algo_bit
> > > > cec crc32c_intel drm_kms_helper nvme ghash_clmulni_intel drm
> > > > nvme_core
> > > > serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
> > > > pinctrl_sunrisepoint pinctrl_intel
> > > > [  147.302218]  fuse
> > > > [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2 Tainted:
> > > > G          IOE     5.7.7-200.fc32.x86_64 #1
> > > > [  147.302233] Hardware name: Dell Inc. XPS 13 9350/0PWNCR,
> > > > BIOS
> > > > 1.12.2
> > > > 12/15/2019
> > > > [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> > > > [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> > > > [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f 65 48 8b 3c
> > > > 25
> > > > c0 8b
> > > > 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41 5c 41 5d 41
> > > > 5e
> > > > 41 5f
> > > > c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60 e9 5d fe ff
> > > > ff e8
> > > > 53
> > > > [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS: 00010002
> > > > [  147.302292] RAX: ffff8f551b333790 RBX: 0000000000000048 RCX:
> > > > 0000000000000000
> > > > [  147.302295] RDX: ffff8f551b333798 RSI: ffff8f5575803718 RDI:
> > > > ffff8f5576daa840
> > > > [  147.302299] RBP: ffff8f551b333790 R08: ffffffff97856cb0 R09:
> > > > 0000000000000000
> > > > [  147.302302] R10: 0000000000000000 R11: ffffffff97856cb8 R12:
> > > > 0000000000000003
> > > > [  147.302306] R13: 0000000000002000 R14: ffff8f5575c14e00 R15:
> > > > ffff8f5576db0700
> > > > [  147.302311] FS:  0000000000000000(0000)
> > > > GS:ffff8f5576d80000(0000)
> > > > knlGS:0000000000000000
> > > > [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > 0000000080050033
> > > > [  147.302319] CR2: 00000000000000b0 CR3: 0000000267774004 CR4:
> > > > 00000000003606e0
> > > > [  147.302322] Call Trace:
> > > > [  147.302329]  <IRQ>
> > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > 
> > > Are you sure your device is working properly and talking USB
> > > correctly
> > > to the host?  It looks like you are just timing out for some
> > > reason.
> > > 
> > > But, that warning is showing that something is odd in the usb
> > > workqueue,
> > > which is strange.
> > > 
> > > What type of host controller is this talking to?  And does your
> > > device
> > > actually answer the urbs being sent to it correctly?
> > > 
> > > Using usbmon on this might be the best way to watch the USB
> > > traffic,
> > > if
> > > you don't have a hardware protocol sniffer, which could provide
> > > some
> > > clues as to what is going wrong.
> > > 
> > > thanks,
> > > 
> > > greg k-h
> > 
> > As I mentioned the device is likely buggy, since I'm developing and
> > debugging it.
> > However my ability to debug and fix any issue is limited by the
> > fact
> > that the kernel decides to stop working as usual, making my USB
> > keyboard and mouse useless, if not crashing later due to soft
> > lockups.
> > 
> > Shouldn't the kernel be resilient to such devices?
> 
> Yes it should, we should not crash.
> 
> 
> > I've developed quite
> > a few USB devices in the past and I've never ran into things like
> > this
> > on Linux (Windows is another story, rather 'easy' to crash, hang or
> > bluescreen). In any case since I do not have access to a hardware
> > debugger and the machine goes bananas (preventing me from using
> > Wireshark) I do not think I can further debug this issue. I could
> > try
> > to find a kernel version where this does not crash the machine
> > (only
> > tested 5.6.X and 5.7.X so far). Or perhaps use VirtualBox, but I'd
> > need
> > to convice the host OS to ignore the USB device and just forward it
> > to
> > the guest.
> 
> Trying to trace down what part of the setup is failing, by using
> usbmon,
> will be good to try to figure out what the problem is here, if you
> can
> do that.
> 
> > The firmware for this device can be easily tweaked to expose an
> > arbitrary number (up to 7 I think) of CDC ACM interfaces. When I
> > use
> > one or two there's no issues, three has had some issues (but did
> > not
> > investigate further). Going to four is what consistently triggers
> > kernel issues.
> 
> Hm, that might be a clue, what does the output of 'lsusb -v' for that
> device when you have 3 and then 4 interfaces?
> 
> thanks,
> 
> greg k-h

Hello,

I will try to see how I can debug further, perhaps I can locate a
machine or kernel that does not crash. Another option can be to disable
the firmware down to the minimum so that it does not response to the
bulk endpoints (just to the enumeration and some basic things), to rule
out a bad behaviour.

The USB descriptor is what you could imagine, just replicate the two
ACM interfaces (control & data) and add more endpoints. Here goes the
one with three ports. Note this one seems to make the kernel crash just
like the one with four. The only ones that work well are 1 and 2 ports.
Since I'm not aware of any other commercial solutions (apart from FTDI)
that use more than 2 ACM ports, could that be the issue? Meaning
there's a bug somewhere and no commercial hardware that can trigger it.

For reference the diff between two and three ports (in lsusb) is that
it's missing the last two interaces (with the 3 EPs described). Of
course the bNumInterfaces is 4 instead of 6, and wTotalLength has a
different value.

Hope this can help somehow.
Thanks
David

Bus 003 Device 012: ID 0483:5740 STMicroelectronics Virtual COM Port
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            2 Communications
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0        64
  idVendor           0x0483 STMicroelectronics
  idProduct          0x5740 Virtual COM Port
  bcdDevice            2.00
  iManufacturer           1 
  iProduct                2 
  iSerial                 3 
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength       0x00b7
    bNumInterfaces          6
    bConfigurationValue     1
    iConfiguration          0 
    bmAttributes         0x80
      (Bus Powered)
    MaxPower              100mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         2 Communications
      bInterfaceSubClass      2 Abstract (modem)
      bInterfaceProtocol      1 AT-commands (v.25ter)
      iInterface              0 
      CDC Header:
        bcdCDC               1.10
      CDC Call Management:
        bmCapabilities       0x00
        bDataInterface          1
      CDC ACM:
        bmCapabilities       0x00
      CDC Union:
        bMasterInterface        0
        bSlaveInterface         1 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x82  EP 2 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0010  1x 16 bytes
        bInterval             255
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        1
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass        10 CDC Data
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x01  EP 1 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               1
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               1
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        2
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         2 Communications
      bInterfaceSubClass      2 Abstract (modem)
      bInterfaceProtocol      1 AT-commands (v.25ter)
      iInterface              0 
      CDC Header:
        bcdCDC               1.10
      CDC Call Management:
        bmCapabilities       0x00
        bDataInterface          3
      CDC ACM:
        bmCapabilities       0x00
      CDC Union:
        bMasterInterface        2
        bSlaveInterface         3 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x84  EP 4 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0010  1x 16 bytes
        bInterval             255
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        3
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass        10 CDC Data
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x03  EP 3 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               1
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x83  EP 3 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               1
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        4
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         2 Communications
      bInterfaceSubClass      2 Abstract (modem)
      bInterfaceProtocol      1 AT-commands (v.25ter)
      iInterface              0 
      CDC Header:
        bcdCDC               1.10
      CDC Call Management:
        bmCapabilities       0x00
        bDataInterface          5
      CDC ACM:
        bmCapabilities       0x00
      CDC Union:
        bMasterInterface        4
        bSlaveInterface         5 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x86  EP 6 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0010  1x 16 bytes
        bInterval             255
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        5
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass        10 CDC Data
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x05  EP 5 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               1
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x85  EP 5 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               1





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15 17:03               ` David Guillen Fandos
@ 2020-07-15 20:39                 ` Daniele Palmas
  2020-07-16 14:30                 ` David Guillen Fandos
  1 sibling, 0 replies; 19+ messages in thread
From: Daniele Palmas @ 2020-07-15 20:39 UTC (permalink / raw)
  To: David Guillen Fandos; +Cc: Greg KH, linux-usb

Hi David,

Il giorno mer 15 lug 2020 alle ore 19:06 David Guillen Fandos
<david@davidgf.net> ha scritto:
>
> Since I'm not aware of any other commercial solutions (apart from FTDI)
> that use more than 2 ACM ports, could that be the issue? Meaning
> there's a bug somewhere and no commercial hardware that can trigger it.
>

Telit modems like LE910V2 or LE910EU1 expose 6 cdc-acm ports and are
working fine.

Regards,
Daniele

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15 17:03               ` David Guillen Fandos
  2020-07-15 20:39                 ` Daniele Palmas
@ 2020-07-16 14:30                 ` David Guillen Fandos
  2020-07-19 23:36                   ` David Guillen Fandos
  1 sibling, 1 reply; 19+ messages in thread
From: David Guillen Fandos @ 2020-07-16 14:30 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-usb, dnlplm

On Wed, 2020-07-15 at 19:03 +0200, David Guillen Fandos wrote:
> On Wed, 2020-07-15 at 14:24 +0200, Greg KH wrote:
> > On Wed, Jul 15, 2020 at 01:20:54PM +0200, David Guillen Fandos
> > wrote:
> > > On Wed, 2020-07-15 at 13:12 +0200, Greg KH wrote:
> > > > On Wed, Jul 15, 2020 at 12:57:14PM +0200, David Guillen Fandos
> > > > wrote:
> > > > > On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > > > > > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David Guillen
> > > > > > Fandos
> > > > > > wrote:
> > > > > > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > > > > > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David Guillen
> > > > > > > > Fandos
> > > > > > > > wrote:
> > > > > > > > > Hello linux-usb,
> > > > > > > > > 
> > > > > > > > > I think I might have found a kernel bug related to
> > > > > > > > > the
> > > > > > > > > USB
> > > > > > > > > subsystem
> > > > > > > > > (cdc_acm perhaps).
> > > > > > > > > 
> > > > > > > > > Context: I was playing around with a device I'm
> > > > > > > > > creating,
> > > > > > > > > essentially a
> > > > > > > > > USB quad modem device that exposes four modems to the
> > > > > > > > > host
> > > > > > > > > system.
> > > > > > > > > This
> > > > > > > > > device is still a prototype so there's a few bugs
> > > > > > > > > here
> > > > > > > > > and
> > > > > > > > > there,
> > > > > > > > > most
> > > > > > > > > likely in the USB descriptors and control requests.
> > > > > > > > > 
> > > > > > > > > What happens: After plugging the device the system
> > > > > > > > > starts
> > > > > > > > > spitting
> > > > > > > > > warnings and BUGs and it locks up. Most of the time
> > > > > > > > > some
> > > > > > > > > CPUs
> > > > > > > > > get
> > > > > > > > > into
> > > > > > > > > some spinloop and never comes back (you can see it
> > > > > > > > > being
> > > > > > > > > detected
> > > > > > > > > by
> > > > > > > > > the watchdog after a few seconds). Generally after
> > > > > > > > > that
> > > > > > > > > the
> > > > > > > > > USB
> > > > > > > > > devices
> > > > > > > > > stop working completely and at some point the machine
> > > > > > > > > freezes
> > > > > > > > > completely. In a couple of ocasions I managed to see
> > > > > > > > > a
> > > > > > > > > bug
> > > > > > > > > in
> > > > > > > > > dmesg
> > > > > > > > > saying "unable to handle page fault for address XXX"
> > > > > > > > > and
> > > > > > > > > "Supervisor
> > > > > > > > > read access in kernel mode" "error code (0x0000) not
> > > > > > > > > present
> > > > > > > > > page".
> > > > > > > > > I
> > > > > > > > > could not get a trace for that one since the kernel
> > > > > > > > > died
> > > > > > > > > completely
> > > > > > > > > and
> > > > > > > > > my log files were truncated/lost.
> > > > > > > > > 
> > > > > > > > > Since it is happening to my two machines (both Intel
> > > > > > > > > but
> > > > > > > > > rather
> > > > > > > > > different controllers, Sunrise Point-LP USB 3.0 vs 8
> > > > > > > > > Series/C220)
> > > > > > > > > and
> > > > > > > > > with different kernel versions I suspect this might
> > > > > > > > > be
> > > > > > > > > a
> > > > > > > > > bug in
> > > > > > > > > the
> > > > > > > > > kernel.
> > > > > > > > > 
> > > > > > > > > I have 4 logs that I collected, they are sort of
> > > > > > > > > long-
> > > > > > > > > ish,
> > > > > > > > > not
> > > > > > > > > sure
> > > > > > > > > how
> > > > > > > > > to best send them to the list.
> > > > > > > > 
> > > > > > > > Send the crashes with the callback list, that should be
> > > > > > > > quite
> > > > > > > > small,
> > > > > > > > right?  We don't need the full log.
> > > > > > > > 
> > > > > > > > The first crash is the most important, the others can
> > > > > > > > be
> > > > > > > > from
> > > > > > > > the
> > > > > > > > first
> > > > > > > > one and are not reliable.
> > > > > > > > 
> > > > > > > > thanks,
> > > > > > > > 
> > > > > > > > greg k-h
> > > > > > > 
> > > > > > > Ok then, here comes one of the logs, I selected some bits
> > > > > > > only
> > > > > > > 
> > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > kernel/workqueue.c:1473
> > > > > > > __queue_work+0x364/0x410
> > > > > > > [...]
> > > > > > > [  147.302322] Call Trace:
> > > > > > > [  147.302329]  <IRQ>
> > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > [  147.302372]  tasklet_action_common.constprop.0+0x66/0x
> > > > > > > 10
> > > > > > > 0
> > > > > > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > > > > > [  147.302391]  irq_exit+0xcf/0x110
> > > > > > > [  147.302397]  do_IRQ+0x55/0xe0
> > > > > > > [  147.302408]  common_interrupt+0xf/0xf
> > > > > > > [  147.302413]  </IRQ>
> > > > > > > [...]
> > > > > > > [  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck
> > > > > > > for
> > > > > > > 23s!
> > > > > > > [kworker/3:2:134]
> > > > > > 
> > > > > > That was the first message?
> > > > > > 
> > > > > > Ok, we need some more logs, how about the 30 lines right
> > > > > > before
> > > > > > the
> > > > > > above?
> > > > > > 
> > > > > > And what kernel version are you using?
> > > > > > 
> > > > > > thanks,
> > > > > > 
> > > > > > greg k-h
> > > > > 
> > > > > Heh I assumed you would find the 3rd stack more interesting
> > > > > since
> > > > > it
> > > > > involves more subsystems but anyway, here we got, the first
> > > > > one
> > > > > with
> > > > > more context. The trigger as you can see is me connecting the
> > > > > USB
> > > > > device:
> > > > > 
> > > > > [  141.445367] usb 1-1: new full-speed USB device number 5
> > > > > using
> > > > > xhci_hcd
> > > > > [  141.573592] usb 1-1: New USB device found, idVendor=0483,
> > > > > idProduct=5740, bcdDevice= 2.00
> > > > > [  141.573597] usb 1-1: New USB device strings: Mfr=1,
> > > > > Product=2,
> > > > > SerialNumber=3
> > > > > [  141.573601] usb 1-1: Product: Quad-UART serial USB device
> > > > > [  141.573603] usb 1-1: Manufacturer: davidgf.net
> > > > > [  141.573605] usb 1-1: SerialNumber: serialno
> > > > > [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM device
> > > > > [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM device
> > > > > [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM device
> > > > > [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM device
> > > > > [  142.382473] usbcore: registered new interface driver
> > > > > cdc_acm
> > > > > [  142.382476] cdc_acm: USB Abstract Control Model driver for
> > > > > USB
> > > > > modems and ISDN adapters
> > > > > [  147.301997] ------------[ cut here ]------------
> > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > kernel/workqueue.c:1473
> > > > > __queue_work+0x364/0x410
> > > > > [  147.302019] Modules linked in: cdc_acm rfcomm ccm
> > > > > wireguard
> > > > > curve25519_x86_64 libchacha20poly1305 chacha_x86_64
> > > > > poly1305_x86_64
> > > > > libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
> > > > > libcurve25519_generic libchacha libblake2s_generic
> > > > > nft_fib_inet
> > > > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> > > > > nf_reject_ipv4
> > > > > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat
> > > > > ip6table_mangle ip6table_raw ip6table_security iptable_nat
> > > > > nf_nat
> > > > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
> > > > > iptable_mangle
> > > > > iptable_raw iptable_security ip_set nf_tables nfnetlink
> > > > > ip6table_filter
> > > > > ip6_tables iptable_filter cmac vboxnetadp(OE) vboxnetflt(OE)
> > > > > bnep
> > > > > vboxdrv(OE) sunrpc vfat fat uvcvideo videobuf2_vmalloc
> > > > > videobuf2_memops
> > > > > videobuf2_v4l2 videobuf2_common videodev btusb btrtl btbcm
> > > > > btintel
> > > > > mc
> > > > > bluetooth ecdh_generic ecc iTCO_wdt iTCO_vendor_support
> > > > > mei_hdcp
> > > > > intel_rapl_msr dell_laptop x86_pkg_temp_thermal
> > > > > intel_powerclamp
> > > > > coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore
> > > > > intel_rapl_perf iwlmvm
> > > > > [  147.302121]  snd_hda_codec_hdmi mac80211 snd_soc_skl
> > > > > snd_soc_sst_ipc
> > > > > snd_soc_sst_dsp dell_wmi snd_hda_ext_core dell_smbios
> > > > > snd_hda_codec_realtek dcdbas libarc4 wmi_bmof
> > > > > dell_wmi_descriptor
> > > > > snd_soc_acpi_intel_match snd_soc_acpi intel_wmi_thunderbolt
> > > > > snd_hda_codec_generic snd_soc_core ledtrig_audio iwlwifi
> > > > > pcspkr
> > > > > snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel
> > > > > snd_intel_dspcfg
> > > > > snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_seq
> > > > > snd_seq_device
> > > > > joydev snd_pcm rfkill snd_timer snd i2c_i801 soundcore idma64
> > > > > int3403_thermal intel_hid int3400_thermal sparse_keymap
> > > > > acpi_thermal_rel mei_me intel_xhci_usb_role_switch acpi_pad
> > > > > roles
> > > > > mei
> > > > > intel_pch_thermal processor_thermal_device intel_rapl_common
> > > > > int340x_thermal_zone intel_soc_dts_iosf binfmt_misc ip_tables
> > > > > dm_crypt
> > > > > i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul
> > > > > i2c_algo_bit
> > > > > cec crc32c_intel drm_kms_helper nvme ghash_clmulni_intel drm
> > > > > nvme_core
> > > > > serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
> > > > > pinctrl_sunrisepoint pinctrl_intel
> > > > > [  147.302218]  fuse
> > > > > [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2 Tainted:
> > > > > G          IOE     5.7.7-200.fc32.x86_64 #1
> > > > > [  147.302233] Hardware name: Dell Inc. XPS 13 9350/0PWNCR,
> > > > > BIOS
> > > > > 1.12.2
> > > > > 12/15/2019
> > > > > [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> > > > > [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> > > > > [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f 65 48 8b
> > > > > 3c
> > > > > 25
> > > > > c0 8b
> > > > > 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41 5c 41 5d
> > > > > 41
> > > > > 5e
> > > > > 41 5f
> > > > > c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60 e9 5d fe ff
> > > > > ff e8
> > > > > 53
> > > > > [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS: 00010002
> > > > > [  147.302292] RAX: ffff8f551b333790 RBX: 0000000000000048
> > > > > RCX:
> > > > > 0000000000000000
> > > > > [  147.302295] RDX: ffff8f551b333798 RSI: ffff8f5575803718
> > > > > RDI:
> > > > > ffff8f5576daa840
> > > > > [  147.302299] RBP: ffff8f551b333790 R08: ffffffff97856cb0
> > > > > R09:
> > > > > 0000000000000000
> > > > > [  147.302302] R10: 0000000000000000 R11: ffffffff97856cb8
> > > > > R12:
> > > > > 0000000000000003
> > > > > [  147.302306] R13: 0000000000002000 R14: ffff8f5575c14e00
> > > > > R15:
> > > > > ffff8f5576db0700
> > > > > [  147.302311] FS:  0000000000000000(0000)
> > > > > GS:ffff8f5576d80000(0000)
> > > > > knlGS:0000000000000000
> > > > > [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > 0000000080050033
> > > > > [  147.302319] CR2: 00000000000000b0 CR3: 0000000267774004
> > > > > CR4:
> > > > > 00000000003606e0
> > > > > [  147.302322] Call Trace:
> > > > > [  147.302329]  <IRQ>
> > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > 
> > > > Are you sure your device is working properly and talking USB
> > > > correctly
> > > > to the host?  It looks like you are just timing out for some
> > > > reason.
> > > > 
> > > > But, that warning is showing that something is odd in the usb
> > > > workqueue,
> > > > which is strange.
> > > > 
> > > > What type of host controller is this talking to?  And does your
> > > > device
> > > > actually answer the urbs being sent to it correctly?
> > > > 
> > > > Using usbmon on this might be the best way to watch the USB
> > > > traffic,
> > > > if
> > > > you don't have a hardware protocol sniffer, which could provide
> > > > some
> > > > clues as to what is going wrong.
> > > > 
> > > > thanks,
> > > > 
> > > > greg k-h
> > > 
> > > As I mentioned the device is likely buggy, since I'm developing
> > > and
> > > debugging it.
> > > However my ability to debug and fix any issue is limited by the
> > > fact
> > > that the kernel decides to stop working as usual, making my USB
> > > keyboard and mouse useless, if not crashing later due to soft
> > > lockups.
> > > 
> > > Shouldn't the kernel be resilient to such devices?
> > 
> > Yes it should, we should not crash.
> > 
> > 
> > > I've developed quite
> > > a few USB devices in the past and I've never ran into things like
> > > this
> > > on Linux (Windows is another story, rather 'easy' to crash, hang
> > > or
> > > bluescreen). In any case since I do not have access to a hardware
> > > debugger and the machine goes bananas (preventing me from using
> > > Wireshark) I do not think I can further debug this issue. I could
> > > try
> > > to find a kernel version where this does not crash the machine
> > > (only
> > > tested 5.6.X and 5.7.X so far). Or perhaps use VirtualBox, but
> > > I'd
> > > need
> > > to convice the host OS to ignore the USB device and just forward
> > > it
> > > to
> > > the guest.
> > 
> > Trying to trace down what part of the setup is failing, by using
> > usbmon,
> > will be good to try to figure out what the problem is here, if you
> > can
> > do that.
> > 
> > > The firmware for this device can be easily tweaked to expose an
> > > arbitrary number (up to 7 I think) of CDC ACM interfaces. When I
> > > use
> > > one or two there's no issues, three has had some issues (but did
> > > not
> > > investigate further). Going to four is what consistently triggers
> > > kernel issues.
> > 
> > Hm, that might be a clue, what does the output of 'lsusb -v' for
> > that
> > device when you have 3 and then 4 interfaces?
> > 
> > thanks,
> > 
> > greg k-h
> 
> Hello,
> 
> I will try to see how I can debug further, perhaps I can locate a
> machine or kernel that does not crash. Another option can be to
> disable
> the firmware down to the minimum so that it does not response to the
> bulk endpoints (just to the enumeration and some basic things), to
> rule
> out a bad behaviour.
> 
> The USB descriptor is what you could imagine, just replicate the two
> ACM interfaces (control & data) and add more endpoints. Here goes the
> one with three ports. Note this one seems to make the kernel crash
> just
> like the one with four. The only ones that work well are 1 and 2
> ports.
> Since I'm not aware of any other commercial solutions (apart from
> FTDI)
> that use more than 2 ACM ports, could that be the issue? Meaning
> there's a bug somewhere and no commercial hardware that can trigger
> it.
> 
> For reference the diff between two and three ports (in lsusb) is that
> it's missing the last two interaces (with the 3 EPs described). Of
> course the bNumInterfaces is 4 instead of 6, and wTotalLength has a
> different value.
> 
> Hope this can help somehow.
> Thanks
> David
> 
> Bus 003 Device 012: ID 0483:5740 STMicroelectronics Virtual COM Port

Hey again,

I was not aware about the modems Daniele, thanks!

So I did some testing on my old BeagleBone black, which has a very old
kernel (3.8.13-bone47). In this device the kernel is happy and I was
able to do some testing, it seems to work well. The UARTs seem to work
well in both directions, no weird shenanigans, no error/warn
messages...

I'm a bit at loss on how I can debug this further, I will try to use a
RPi with a newer kernel and see what happens. I could try to boot a
Live USB with older kernels (in my Intel machines) to try to locate a
version where it works. Since I'm no kernel expert: any way I can
provide more info? The computer becomes unusable shortly after plugging
the device so I can't really do any meaningful stuff on it.

Thanks again,
David




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-16 14:30                 ` David Guillen Fandos
@ 2020-07-19 23:36                   ` David Guillen Fandos
  2020-07-20  1:21                     ` Alan Stern
  2020-07-20 16:55                     ` Dan Williams
  0 siblings, 2 replies; 19+ messages in thread
From: David Guillen Fandos @ 2020-07-19 23:36 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-usb, dnlplm

[-- Attachment #1: Type: text/plain, Size: 17519 bytes --]

On Thu, 2020-07-16 at 16:30 +0200, David Guillen Fandos wrote:
> On Wed, 2020-07-15 at 19:03 +0200, David Guillen Fandos wrote:
> > On Wed, 2020-07-15 at 14:24 +0200, Greg KH wrote:
> > > On Wed, Jul 15, 2020 at 01:20:54PM +0200, David Guillen Fandos
> > > wrote:
> > > > On Wed, 2020-07-15 at 13:12 +0200, Greg KH wrote:
> > > > > On Wed, Jul 15, 2020 at 12:57:14PM +0200, David Guillen
> > > > > Fandos
> > > > > wrote:
> > > > > > On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > > > > > > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David Guillen
> > > > > > > Fandos
> > > > > > > wrote:
> > > > > > > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > > > > > > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David
> > > > > > > > > Guillen
> > > > > > > > > Fandos
> > > > > > > > > wrote:
> > > > > > > > > > Hello linux-usb,
> > > > > > > > > > 
> > > > > > > > > > I think I might have found a kernel bug related to
> > > > > > > > > > the
> > > > > > > > > > USB
> > > > > > > > > > subsystem
> > > > > > > > > > (cdc_acm perhaps).
> > > > > > > > > > 
> > > > > > > > > > Context: I was playing around with a device I'm
> > > > > > > > > > creating,
> > > > > > > > > > essentially a
> > > > > > > > > > USB quad modem device that exposes four modems to
> > > > > > > > > > the
> > > > > > > > > > host
> > > > > > > > > > system.
> > > > > > > > > > This
> > > > > > > > > > device is still a prototype so there's a few bugs
> > > > > > > > > > here
> > > > > > > > > > and
> > > > > > > > > > there,
> > > > > > > > > > most
> > > > > > > > > > likely in the USB descriptors and control requests.
> > > > > > > > > > 
> > > > > > > > > > What happens: After plugging the device the system
> > > > > > > > > > starts
> > > > > > > > > > spitting
> > > > > > > > > > warnings and BUGs and it locks up. Most of the time
> > > > > > > > > > some
> > > > > > > > > > CPUs
> > > > > > > > > > get
> > > > > > > > > > into
> > > > > > > > > > some spinloop and never comes back (you can see it
> > > > > > > > > > being
> > > > > > > > > > detected
> > > > > > > > > > by
> > > > > > > > > > the watchdog after a few seconds). Generally after
> > > > > > > > > > that
> > > > > > > > > > the
> > > > > > > > > > USB
> > > > > > > > > > devices
> > > > > > > > > > stop working completely and at some point the
> > > > > > > > > > machine
> > > > > > > > > > freezes
> > > > > > > > > > completely. In a couple of ocasions I managed to
> > > > > > > > > > see
> > > > > > > > > > a
> > > > > > > > > > bug
> > > > > > > > > > in
> > > > > > > > > > dmesg
> > > > > > > > > > saying "unable to handle page fault for address
> > > > > > > > > > XXX"
> > > > > > > > > > and
> > > > > > > > > > "Supervisor
> > > > > > > > > > read access in kernel mode" "error code (0x0000)
> > > > > > > > > > not
> > > > > > > > > > present
> > > > > > > > > > page".
> > > > > > > > > > I
> > > > > > > > > > could not get a trace for that one since the kernel
> > > > > > > > > > died
> > > > > > > > > > completely
> > > > > > > > > > and
> > > > > > > > > > my log files were truncated/lost.
> > > > > > > > > > 
> > > > > > > > > > Since it is happening to my two machines (both
> > > > > > > > > > Intel
> > > > > > > > > > but
> > > > > > > > > > rather
> > > > > > > > > > different controllers, Sunrise Point-LP USB 3.0 vs
> > > > > > > > > > 8
> > > > > > > > > > Series/C220)
> > > > > > > > > > and
> > > > > > > > > > with different kernel versions I suspect this might
> > > > > > > > > > be
> > > > > > > > > > a
> > > > > > > > > > bug in
> > > > > > > > > > the
> > > > > > > > > > kernel.
> > > > > > > > > > 
> > > > > > > > > > I have 4 logs that I collected, they are sort of
> > > > > > > > > > long-
> > > > > > > > > > ish,
> > > > > > > > > > not
> > > > > > > > > > sure
> > > > > > > > > > how
> > > > > > > > > > to best send them to the list.
> > > > > > > > > 
> > > > > > > > > Send the crashes with the callback list, that should
> > > > > > > > > be
> > > > > > > > > quite
> > > > > > > > > small,
> > > > > > > > > right?  We don't need the full log.
> > > > > > > > > 
> > > > > > > > > The first crash is the most important, the others can
> > > > > > > > > be
> > > > > > > > > from
> > > > > > > > > the
> > > > > > > > > first
> > > > > > > > > one and are not reliable.
> > > > > > > > > 
> > > > > > > > > thanks,
> > > > > > > > > 
> > > > > > > > > greg k-h
> > > > > > > > 
> > > > > > > > Ok then, here comes one of the logs, I selected some
> > > > > > > > bits
> > > > > > > > only
> > > > > > > > 
> > > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > > kernel/workqueue.c:1473
> > > > > > > > __queue_work+0x364/0x410
> > > > > > > > [...]
> > > > > > > > [  147.302322] Call Trace:
> > > > > > > > [  147.302329]  <IRQ>
> > > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > > [  147.302372]  tasklet_action_common.constprop.0+0x66/
> > > > > > > > 0x
> > > > > > > > 10
> > > > > > > > 0
> > > > > > > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > > > > > > [  147.302391]  irq_exit+0xcf/0x110
> > > > > > > > [  147.302397]  do_IRQ+0x55/0xe0
> > > > > > > > [  147.302408]  common_interrupt+0xf/0xf
> > > > > > > > [  147.302413]  </IRQ>
> > > > > > > > [...]
> > > > > > > > [  184.771172] watchdog: BUG: soft lockup - CPU#3 stuck
> > > > > > > > for
> > > > > > > > 23s!
> > > > > > > > [kworker/3:2:134]
> > > > > > > 
> > > > > > > That was the first message?
> > > > > > > 
> > > > > > > Ok, we need some more logs, how about the 30 lines right
> > > > > > > before
> > > > > > > the
> > > > > > > above?
> > > > > > > 
> > > > > > > And what kernel version are you using?
> > > > > > > 
> > > > > > > thanks,
> > > > > > > 
> > > > > > > greg k-h
> > > > > > 
> > > > > > Heh I assumed you would find the 3rd stack more interesting
> > > > > > since
> > > > > > it
> > > > > > involves more subsystems but anyway, here we got, the first
> > > > > > one
> > > > > > with
> > > > > > more context. The trigger as you can see is me connecting
> > > > > > the
> > > > > > USB
> > > > > > device:
> > > > > > 
> > > > > > [  141.445367] usb 1-1: new full-speed USB device number 5
> > > > > > using
> > > > > > xhci_hcd
> > > > > > [  141.573592] usb 1-1: New USB device found,
> > > > > > idVendor=0483,
> > > > > > idProduct=5740, bcdDevice= 2.00
> > > > > > [  141.573597] usb 1-1: New USB device strings: Mfr=1,
> > > > > > Product=2,
> > > > > > SerialNumber=3
> > > > > > [  141.573601] usb 1-1: Product: Quad-UART serial USB
> > > > > > device
> > > > > > [  141.573603] usb 1-1: Manufacturer: davidgf.net
> > > > > > [  141.573605] usb 1-1: SerialNumber: serialno
> > > > > > [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM device
> > > > > > [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM device
> > > > > > [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM device
> > > > > > [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM device
> > > > > > [  142.382473] usbcore: registered new interface driver
> > > > > > cdc_acm
> > > > > > [  142.382476] cdc_acm: USB Abstract Control Model driver
> > > > > > for
> > > > > > USB
> > > > > > modems and ISDN adapters
> > > > > > [  147.301997] ------------[ cut here ]------------
> > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > kernel/workqueue.c:1473
> > > > > > __queue_work+0x364/0x410
> > > > > > [  147.302019] Modules linked in: cdc_acm rfcomm ccm
> > > > > > wireguard
> > > > > > curve25519_x86_64 libchacha20poly1305 chacha_x86_64
> > > > > > poly1305_x86_64
> > > > > > libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
> > > > > > libcurve25519_generic libchacha libblake2s_generic
> > > > > > nft_fib_inet
> > > > > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> > > > > > nf_reject_ipv4
> > > > > > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat
> > > > > > ip6table_mangle ip6table_raw ip6table_security iptable_nat
> > > > > > nf_nat
> > > > > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
> > > > > > iptable_mangle
> > > > > > iptable_raw iptable_security ip_set nf_tables nfnetlink
> > > > > > ip6table_filter
> > > > > > ip6_tables iptable_filter cmac vboxnetadp(OE)
> > > > > > vboxnetflt(OE)
> > > > > > bnep
> > > > > > vboxdrv(OE) sunrpc vfat fat uvcvideo videobuf2_vmalloc
> > > > > > videobuf2_memops
> > > > > > videobuf2_v4l2 videobuf2_common videodev btusb btrtl btbcm
> > > > > > btintel
> > > > > > mc
> > > > > > bluetooth ecdh_generic ecc iTCO_wdt iTCO_vendor_support
> > > > > > mei_hdcp
> > > > > > intel_rapl_msr dell_laptop x86_pkg_temp_thermal
> > > > > > intel_powerclamp
> > > > > > coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore
> > > > > > intel_rapl_perf iwlmvm
> > > > > > [  147.302121]  snd_hda_codec_hdmi mac80211 snd_soc_skl
> > > > > > snd_soc_sst_ipc
> > > > > > snd_soc_sst_dsp dell_wmi snd_hda_ext_core dell_smbios
> > > > > > snd_hda_codec_realtek dcdbas libarc4 wmi_bmof
> > > > > > dell_wmi_descriptor
> > > > > > snd_soc_acpi_intel_match snd_soc_acpi intel_wmi_thunderbolt
> > > > > > snd_hda_codec_generic snd_soc_core ledtrig_audio iwlwifi
> > > > > > pcspkr
> > > > > > snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel
> > > > > > snd_intel_dspcfg
> > > > > > snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_seq
> > > > > > snd_seq_device
> > > > > > joydev snd_pcm rfkill snd_timer snd i2c_i801 soundcore
> > > > > > idma64
> > > > > > int3403_thermal intel_hid int3400_thermal sparse_keymap
> > > > > > acpi_thermal_rel mei_me intel_xhci_usb_role_switch acpi_pad
> > > > > > roles
> > > > > > mei
> > > > > > intel_pch_thermal processor_thermal_device
> > > > > > intel_rapl_common
> > > > > > int340x_thermal_zone intel_soc_dts_iosf binfmt_misc
> > > > > > ip_tables
> > > > > > dm_crypt
> > > > > > i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul
> > > > > > i2c_algo_bit
> > > > > > cec crc32c_intel drm_kms_helper nvme ghash_clmulni_intel
> > > > > > drm
> > > > > > nvme_core
> > > > > > serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
> > > > > > pinctrl_sunrisepoint pinctrl_intel
> > > > > > [  147.302218]  fuse
> > > > > > [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2 Tainted:
> > > > > > G          IOE     5.7.7-200.fc32.x86_64 #1
> > > > > > [  147.302233] Hardware name: Dell Inc. XPS 13 9350/0PWNCR,
> > > > > > BIOS
> > > > > > 1.12.2
> > > > > > 12/15/2019
> > > > > > [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> > > > > > [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> > > > > > [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f 65 48 8b
> > > > > > 3c
> > > > > > 25
> > > > > > c0 8b
> > > > > > 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41 5c 41 5d
> > > > > > 41
> > > > > > 5e
> > > > > > 41 5f
> > > > > > c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60 e9 5d fe
> > > > > > ff
> > > > > > ff e8
> > > > > > 53
> > > > > > [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS: 00010002
> > > > > > [  147.302292] RAX: ffff8f551b333790 RBX: 0000000000000048
> > > > > > RCX:
> > > > > > 0000000000000000
> > > > > > [  147.302295] RDX: ffff8f551b333798 RSI: ffff8f5575803718
> > > > > > RDI:
> > > > > > ffff8f5576daa840
> > > > > > [  147.302299] RBP: ffff8f551b333790 R08: ffffffff97856cb0
> > > > > > R09:
> > > > > > 0000000000000000
> > > > > > [  147.302302] R10: 0000000000000000 R11: ffffffff97856cb8
> > > > > > R12:
> > > > > > 0000000000000003
> > > > > > [  147.302306] R13: 0000000000002000 R14: ffff8f5575c14e00
> > > > > > R15:
> > > > > > ffff8f5576db0700
> > > > > > [  147.302311] FS:  0000000000000000(0000)
> > > > > > GS:ffff8f5576d80000(0000)
> > > > > > knlGS:0000000000000000
> > > > > > [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > > 0000000080050033
> > > > > > [  147.302319] CR2: 00000000000000b0 CR3: 0000000267774004
> > > > > > CR4:
> > > > > > 00000000003606e0
> > > > > > [  147.302322] Call Trace:
> > > > > > [  147.302329]  <IRQ>
> > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > 
> > > > > Are you sure your device is working properly and talking USB
> > > > > correctly
> > > > > to the host?  It looks like you are just timing out for some
> > > > > reason.
> > > > > 
> > > > > But, that warning is showing that something is odd in the usb
> > > > > workqueue,
> > > > > which is strange.
> > > > > 
> > > > > What type of host controller is this talking to?  And does
> > > > > your
> > > > > device
> > > > > actually answer the urbs being sent to it correctly?
> > > > > 
> > > > > Using usbmon on this might be the best way to watch the USB
> > > > > traffic,
> > > > > if
> > > > > you don't have a hardware protocol sniffer, which could
> > > > > provide
> > > > > some
> > > > > clues as to what is going wrong.
> > > > > 
> > > > > thanks,
> > > > > 
> > > > > greg k-h
> > > > 
> > > > As I mentioned the device is likely buggy, since I'm developing
> > > > and
> > > > debugging it.
> > > > However my ability to debug and fix any issue is limited by the
> > > > fact
> > > > that the kernel decides to stop working as usual, making my USB
> > > > keyboard and mouse useless, if not crashing later due to soft
> > > > lockups.
> > > > 
> > > > Shouldn't the kernel be resilient to such devices?
> > > 
> > > Yes it should, we should not crash.
> > > 
> > > 
> > > > I've developed quite
> > > > a few USB devices in the past and I've never ran into things
> > > > like
> > > > this
> > > > on Linux (Windows is another story, rather 'easy' to crash,
> > > > hang
> > > > or
> > > > bluescreen). In any case since I do not have access to a
> > > > hardware
> > > > debugger and the machine goes bananas (preventing me from using
> > > > Wireshark) I do not think I can further debug this issue. I
> > > > could
> > > > try
> > > > to find a kernel version where this does not crash the machine
> > > > (only
> > > > tested 5.6.X and 5.7.X so far). Or perhaps use VirtualBox, but
> > > > I'd
> > > > need
> > > > to convice the host OS to ignore the USB device and just
> > > > forward
> > > > it
> > > > to
> > > > the guest.
> > > 
> > > Trying to trace down what part of the setup is failing, by using
> > > usbmon,
> > > will be good to try to figure out what the problem is here, if
> > > you
> > > can
> > > do that.
> > > 
> > > > The firmware for this device can be easily tweaked to expose an
> > > > arbitrary number (up to 7 I think) of CDC ACM interfaces. When
> > > > I
> > > > use
> > > > one or two there's no issues, three has had some issues (but
> > > > did
> > > > not
> > > > investigate further). Going to four is what consistently
> > > > triggers
> > > > kernel issues.
> > > 
> > > Hm, that might be a clue, what does the output of 'lsusb -v' for
> > > that
> > > device when you have 3 and then 4 interfaces?
> > > 
> > > thanks,
> > > 
> > > greg k-h
> > 
> > Hello,
> > 
> > I will try to see how I can debug further, perhaps I can locate a
> > machine or kernel that does not crash. Another option can be to
> > disable
> > the firmware down to the minimum so that it does not response to
> > the
> > bulk endpoints (just to the enumeration and some basic things), to
> > rule
> > out a bad behaviour.
> > 
> > The USB descriptor is what you could imagine, just replicate the
> > two
> > ACM interfaces (control & data) and add more endpoints. Here goes
> > the
> > one with three ports. Note this one seems to make the kernel crash
> > just
> > like the one with four. The only ones that work well are 1 and 2
> > ports.
> > Since I'm not aware of any other commercial solutions (apart from
> > FTDI)
> > that use more than 2 ACM ports, could that be the issue? Meaning
> > there's a bug somewhere and no commercial hardware that can trigger
> > it.
> > 
> > For reference the diff between two and three ports (in lsusb) is
> > that
> > it's missing the last two interaces (with the 3 EPs described). Of
> > course the bNumInterfaces is 4 instead of 6, and wTotalLength has a
> > different value.
> > 
> > Hope this can help somehow.
> > Thanks
> > David
> > 
> > Bus 003 Device 012: ID 0483:5740 STMicroelectronics Virtual COM
> > Port
> 
> Hey again,
> 
> I was not aware about the modems Daniele, thanks!
> 
> So I did some testing on my old BeagleBone black, which has a very
> old
> kernel (3.8.13-bone47). In this device the kernel is happy and I was
> able to do some testing, it seems to work well. The UARTs seem to
> work
> well in both directions, no weird shenanigans, no error/warn
> messages...
> 
> I'm a bit at loss on how I can debug this further, I will try to use
> a
> RPi with a newer kernel and see what happens. I could try to boot a
> Live USB with older kernels (in my Intel machines) to try to locate a
> version where it works. Since I'm no kernel expert: any way I can
> provide more info? The computer becomes unusable shortly after
> plugging
> the device so I can't really do any meaningful stuff on it.
> 
> Thanks again,
> David
> 

Hey there again!

I managed to get a PCAP capture for this. Note that NetworkManager was
running and actively probing the ttyACM* devices for a modem, hence why
you can see "AT\n" commands being sent to the four devices.

As you can also probably see is that the device currently ignores any
control requests (like Set Line Coding).

Hope it can help your debugging.

David



[-- Attachment #2: usbcap.pcap --]
[-- Type: application/vnd.tcpdump.pcap, Size: 70064 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-19 23:36                   ` David Guillen Fandos
@ 2020-07-20  1:21                     ` Alan Stern
  2020-07-20 16:55                     ` Dan Williams
  1 sibling, 0 replies; 19+ messages in thread
From: Alan Stern @ 2020-07-20  1:21 UTC (permalink / raw)
  To: David Guillen Fandos; +Cc: Greg KH, linux-usb, dnlplm

On Mon, Jul 20, 2020 at 01:36:58AM +0200, David Guillen Fandos wrote:
> Hey there again!
> 
> I managed to get a PCAP capture for this. Note that NetworkManager was
> running and actively probing the ttyACM* devices for a modem, hence why
> you can see "AT\n" commands being sent to the four devices.
> 
> As you can also probably see is that the device currently ignores any
> control requests (like Set Line Coding).
> 
> Hope it can help your debugging.

What's the story with all the -EPROTO errors on the bulk endpoints in
the trace?  It looks like your device hasn't enabled a bunch of
endpoints that are supposed to be active.

Alan Stern

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-19 23:36                   ` David Guillen Fandos
  2020-07-20  1:21                     ` Alan Stern
@ 2020-07-20 16:55                     ` Dan Williams
  2020-07-20 20:39                       ` David Guillen Fandos
  1 sibling, 1 reply; 19+ messages in thread
From: Dan Williams @ 2020-07-20 16:55 UTC (permalink / raw)
  To: David Guillen Fandos, Greg KH; +Cc: linux-usb, dnlplm

On Mon, 2020-07-20 at 01:36 +0200, David Guillen Fandos wrote:
> On Thu, 2020-07-16 at 16:30 +0200, David Guillen Fandos wrote:
> > On Wed, 2020-07-15 at 19:03 +0200, David Guillen Fandos wrote:
> > > On Wed, 2020-07-15 at 14:24 +0200, Greg KH wrote:
> > > > On Wed, Jul 15, 2020 at 01:20:54PM +0200, David Guillen Fandos
> > > > wrote:
> > > > > On Wed, 2020-07-15 at 13:12 +0200, Greg KH wrote:
> > > > > > On Wed, Jul 15, 2020 at 12:57:14PM +0200, David Guillen
> > > > > > Fandos
> > > > > > wrote:
> > > > > > > On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > > > > > > > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David Guillen
> > > > > > > > Fandos
> > > > > > > > wrote:
> > > > > > > > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > > > > > > > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David
> > > > > > > > > > Guillen
> > > > > > > > > > Fandos
> > > > > > > > > > wrote:
> > > > > > > > > > > Hello linux-usb,
> > > > > > > > > > > 
> > > > > > > > > > > I think I might have found a kernel bug related
> > > > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > USB
> > > > > > > > > > > subsystem
> > > > > > > > > > > (cdc_acm perhaps).
> > > > > > > > > > > 
> > > > > > > > > > > Context: I was playing around with a device I'm
> > > > > > > > > > > creating,
> > > > > > > > > > > essentially a
> > > > > > > > > > > USB quad modem device that exposes four modems to
> > > > > > > > > > > the
> > > > > > > > > > > host
> > > > > > > > > > > system.
> > > > > > > > > > > This
> > > > > > > > > > > device is still a prototype so there's a few bugs
> > > > > > > > > > > here
> > > > > > > > > > > and
> > > > > > > > > > > there,
> > > > > > > > > > > most
> > > > > > > > > > > likely in the USB descriptors and control
> > > > > > > > > > > requests.
> > > > > > > > > > > 
> > > > > > > > > > > What happens: After plugging the device the
> > > > > > > > > > > system
> > > > > > > > > > > starts
> > > > > > > > > > > spitting
> > > > > > > > > > > warnings and BUGs and it locks up. Most of the
> > > > > > > > > > > time
> > > > > > > > > > > some
> > > > > > > > > > > CPUs
> > > > > > > > > > > get
> > > > > > > > > > > into
> > > > > > > > > > > some spinloop and never comes back (you can see
> > > > > > > > > > > it
> > > > > > > > > > > being
> > > > > > > > > > > detected
> > > > > > > > > > > by
> > > > > > > > > > > the watchdog after a few seconds). Generally
> > > > > > > > > > > after
> > > > > > > > > > > that
> > > > > > > > > > > the
> > > > > > > > > > > USB
> > > > > > > > > > > devices
> > > > > > > > > > > stop working completely and at some point the
> > > > > > > > > > > machine
> > > > > > > > > > > freezes
> > > > > > > > > > > completely. In a couple of ocasions I managed to
> > > > > > > > > > > see
> > > > > > > > > > > a
> > > > > > > > > > > bug
> > > > > > > > > > > in
> > > > > > > > > > > dmesg
> > > > > > > > > > > saying "unable to handle page fault for address
> > > > > > > > > > > XXX"
> > > > > > > > > > > and
> > > > > > > > > > > "Supervisor
> > > > > > > > > > > read access in kernel mode" "error code (0x0000)
> > > > > > > > > > > not
> > > > > > > > > > > present
> > > > > > > > > > > page".
> > > > > > > > > > > I
> > > > > > > > > > > could not get a trace for that one since the
> > > > > > > > > > > kernel
> > > > > > > > > > > died
> > > > > > > > > > > completely
> > > > > > > > > > > and
> > > > > > > > > > > my log files were truncated/lost.
> > > > > > > > > > > 
> > > > > > > > > > > Since it is happening to my two machines (both
> > > > > > > > > > > Intel
> > > > > > > > > > > but
> > > > > > > > > > > rather
> > > > > > > > > > > different controllers, Sunrise Point-LP USB 3.0
> > > > > > > > > > > vs
> > > > > > > > > > > 8
> > > > > > > > > > > Series/C220)
> > > > > > > > > > > and
> > > > > > > > > > > with different kernel versions I suspect this
> > > > > > > > > > > might
> > > > > > > > > > > be
> > > > > > > > > > > a
> > > > > > > > > > > bug in
> > > > > > > > > > > the
> > > > > > > > > > > kernel.
> > > > > > > > > > > 
> > > > > > > > > > > I have 4 logs that I collected, they are sort of
> > > > > > > > > > > long-
> > > > > > > > > > > ish,
> > > > > > > > > > > not
> > > > > > > > > > > sure
> > > > > > > > > > > how
> > > > > > > > > > > to best send them to the list.
> > > > > > > > > > 
> > > > > > > > > > Send the crashes with the callback list, that
> > > > > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > quite
> > > > > > > > > > small,
> > > > > > > > > > right?  We don't need the full log.
> > > > > > > > > > 
> > > > > > > > > > The first crash is the most important, the others
> > > > > > > > > > can
> > > > > > > > > > be
> > > > > > > > > > from
> > > > > > > > > > the
> > > > > > > > > > first
> > > > > > > > > > one and are not reliable.
> > > > > > > > > > 
> > > > > > > > > > thanks,
> > > > > > > > > > 
> > > > > > > > > > greg k-h
> > > > > > > > > 
> > > > > > > > > Ok then, here comes one of the logs, I selected some
> > > > > > > > > bits
> > > > > > > > > only
> > > > > > > > > 
> > > > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > > > kernel/workqueue.c:1473
> > > > > > > > > __queue_work+0x364/0x410
> > > > > > > > > [...]
> > > > > > > > > [  147.302322] Call Trace:
> > > > > > > > > [  147.302329]  <IRQ>
> > > > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > > > [  147.302372]  tasklet_action_common.constprop.0+0x6
> > > > > > > > > 6/
> > > > > > > > > 0x
> > > > > > > > > 10
> > > > > > > > > 0
> > > > > > > > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > > > > > > > [  147.302391]  irq_exit+0xcf/0x110
> > > > > > > > > [  147.302397]  do_IRQ+0x55/0xe0
> > > > > > > > > [  147.302408]  common_interrupt+0xf/0xf
> > > > > > > > > [  147.302413]  </IRQ>
> > > > > > > > > [...]
> > > > > > > > > [  184.771172] watchdog: BUG: soft lockup - CPU#3
> > > > > > > > > stuck
> > > > > > > > > for
> > > > > > > > > 23s!
> > > > > > > > > [kworker/3:2:134]
> > > > > > > > 
> > > > > > > > That was the first message?
> > > > > > > > 
> > > > > > > > Ok, we need some more logs, how about the 30 lines
> > > > > > > > right
> > > > > > > > before
> > > > > > > > the
> > > > > > > > above?
> > > > > > > > 
> > > > > > > > And what kernel version are you using?
> > > > > > > > 
> > > > > > > > thanks,
> > > > > > > > 
> > > > > > > > greg k-h
> > > > > > > 
> > > > > > > Heh I assumed you would find the 3rd stack more
> > > > > > > interesting
> > > > > > > since
> > > > > > > it
> > > > > > > involves more subsystems but anyway, here we got, the
> > > > > > > first
> > > > > > > one
> > > > > > > with
> > > > > > > more context. The trigger as you can see is me connecting
> > > > > > > the
> > > > > > > USB
> > > > > > > device:
> > > > > > > 
> > > > > > > [  141.445367] usb 1-1: new full-speed USB device number
> > > > > > > 5
> > > > > > > using
> > > > > > > xhci_hcd
> > > > > > > [  141.573592] usb 1-1: New USB device found,
> > > > > > > idVendor=0483,
> > > > > > > idProduct=5740, bcdDevice= 2.00
> > > > > > > [  141.573597] usb 1-1: New USB device strings: Mfr=1,
> > > > > > > Product=2,
> > > > > > > SerialNumber=3
> > > > > > > [  141.573601] usb 1-1: Product: Quad-UART serial USB
> > > > > > > device
> > > > > > > [  141.573603] usb 1-1: Manufacturer: davidgf.net
> > > > > > > [  141.573605] usb 1-1: SerialNumber: serialno
> > > > > > > [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM device
> > > > > > > [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM device
> > > > > > > [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM device
> > > > > > > [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM device
> > > > > > > [  142.382473] usbcore: registered new interface driver
> > > > > > > cdc_acm
> > > > > > > [  142.382476] cdc_acm: USB Abstract Control Model driver
> > > > > > > for
> > > > > > > USB
> > > > > > > modems and ISDN adapters
> > > > > > > [  147.301997] ------------[ cut here ]------------
> > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > kernel/workqueue.c:1473
> > > > > > > __queue_work+0x364/0x410
> > > > > > > [  147.302019] Modules linked in: cdc_acm rfcomm ccm
> > > > > > > wireguard
> > > > > > > curve25519_x86_64 libchacha20poly1305 chacha_x86_64
> > > > > > > poly1305_x86_64
> > > > > > > libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
> > > > > > > libcurve25519_generic libchacha libblake2s_generic
> > > > > > > nft_fib_inet
> > > > > > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> > > > > > > nf_reject_ipv4
> > > > > > > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
> > > > > > > ip6table_nat
> > > > > > > ip6table_mangle ip6table_raw ip6table_security
> > > > > > > iptable_nat
> > > > > > > nf_nat
> > > > > > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
> > > > > > > iptable_mangle
> > > > > > > iptable_raw iptable_security ip_set nf_tables nfnetlink
> > > > > > > ip6table_filter
> > > > > > > ip6_tables iptable_filter cmac vboxnetadp(OE)
> > > > > > > vboxnetflt(OE)
> > > > > > > bnep
> > > > > > > vboxdrv(OE) sunrpc vfat fat uvcvideo videobuf2_vmalloc
> > > > > > > videobuf2_memops
> > > > > > > videobuf2_v4l2 videobuf2_common videodev btusb btrtl
> > > > > > > btbcm
> > > > > > > btintel
> > > > > > > mc
> > > > > > > bluetooth ecdh_generic ecc iTCO_wdt iTCO_vendor_support
> > > > > > > mei_hdcp
> > > > > > > intel_rapl_msr dell_laptop x86_pkg_temp_thermal
> > > > > > > intel_powerclamp
> > > > > > > coretemp kvm_intel kvm irqbypass intel_cstate
> > > > > > > intel_uncore
> > > > > > > intel_rapl_perf iwlmvm
> > > > > > > [  147.302121]  snd_hda_codec_hdmi mac80211 snd_soc_skl
> > > > > > > snd_soc_sst_ipc
> > > > > > > snd_soc_sst_dsp dell_wmi snd_hda_ext_core dell_smbios
> > > > > > > snd_hda_codec_realtek dcdbas libarc4 wmi_bmof
> > > > > > > dell_wmi_descriptor
> > > > > > > snd_soc_acpi_intel_match snd_soc_acpi
> > > > > > > intel_wmi_thunderbolt
> > > > > > > snd_hda_codec_generic snd_soc_core ledtrig_audio iwlwifi
> > > > > > > pcspkr
> > > > > > > snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel
> > > > > > > snd_intel_dspcfg
> > > > > > > snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_seq
> > > > > > > snd_seq_device
> > > > > > > joydev snd_pcm rfkill snd_timer snd i2c_i801 soundcore
> > > > > > > idma64
> > > > > > > int3403_thermal intel_hid int3400_thermal sparse_keymap
> > > > > > > acpi_thermal_rel mei_me intel_xhci_usb_role_switch
> > > > > > > acpi_pad
> > > > > > > roles
> > > > > > > mei
> > > > > > > intel_pch_thermal processor_thermal_device
> > > > > > > intel_rapl_common
> > > > > > > int340x_thermal_zone intel_soc_dts_iosf binfmt_misc
> > > > > > > ip_tables
> > > > > > > dm_crypt
> > > > > > > i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul
> > > > > > > crc32_pclmul
> > > > > > > i2c_algo_bit
> > > > > > > cec crc32c_intel drm_kms_helper nvme ghash_clmulni_intel
> > > > > > > drm
> > > > > > > nvme_core
> > > > > > > serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
> > > > > > > pinctrl_sunrisepoint pinctrl_intel
> > > > > > > [  147.302218]  fuse
> > > > > > > [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2 Tainted:
> > > > > > > G          IOE     5.7.7-200.fc32.x86_64 #1
> > > > > > > [  147.302233] Hardware name: Dell Inc. XPS 13
> > > > > > > 9350/0PWNCR,
> > > > > > > BIOS
> > > > > > > 1.12.2
> > > > > > > 12/15/2019
> > > > > > > [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> > > > > > > [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> > > > > > > [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f 65 48
> > > > > > > 8b
> > > > > > > 3c
> > > > > > > 25
> > > > > > > c0 8b
> > > > > > > 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41 5c 41
> > > > > > > 5d
> > > > > > > 41
> > > > > > > 5e
> > > > > > > 41 5f
> > > > > > > c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60 e9 5d
> > > > > > > fe
> > > > > > > ff
> > > > > > > ff e8
> > > > > > > 53
> > > > > > > [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS:
> > > > > > > 00010002
> > > > > > > [  147.302292] RAX: ffff8f551b333790 RBX:
> > > > > > > 0000000000000048
> > > > > > > RCX:
> > > > > > > 0000000000000000
> > > > > > > [  147.302295] RDX: ffff8f551b333798 RSI:
> > > > > > > ffff8f5575803718
> > > > > > > RDI:
> > > > > > > ffff8f5576daa840
> > > > > > > [  147.302299] RBP: ffff8f551b333790 R08:
> > > > > > > ffffffff97856cb0
> > > > > > > R09:
> > > > > > > 0000000000000000
> > > > > > > [  147.302302] R10: 0000000000000000 R11:
> > > > > > > ffffffff97856cb8
> > > > > > > R12:
> > > > > > > 0000000000000003
> > > > > > > [  147.302306] R13: 0000000000002000 R14:
> > > > > > > ffff8f5575c14e00
> > > > > > > R15:
> > > > > > > ffff8f5576db0700
> > > > > > > [  147.302311] FS:  0000000000000000(0000)
> > > > > > > GS:ffff8f5576d80000(0000)
> > > > > > > knlGS:0000000000000000
> > > > > > > [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > > > 0000000080050033
> > > > > > > [  147.302319] CR2: 00000000000000b0 CR3:
> > > > > > > 0000000267774004
> > > > > > > CR4:
> > > > > > > 00000000003606e0
> > > > > > > [  147.302322] Call Trace:
> > > > > > > [  147.302329]  <IRQ>
> > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > 
> > > > > > Are you sure your device is working properly and talking
> > > > > > USB
> > > > > > correctly
> > > > > > to the host?  It looks like you are just timing out for
> > > > > > some
> > > > > > reason.
> > > > > > 
> > > > > > But, that warning is showing that something is odd in the
> > > > > > usb
> > > > > > workqueue,
> > > > > > which is strange.
> > > > > > 
> > > > > > What type of host controller is this talking to?  And does
> > > > > > your
> > > > > > device
> > > > > > actually answer the urbs being sent to it correctly?
> > > > > > 
> > > > > > Using usbmon on this might be the best way to watch the USB
> > > > > > traffic,
> > > > > > if
> > > > > > you don't have a hardware protocol sniffer, which could
> > > > > > provide
> > > > > > some
> > > > > > clues as to what is going wrong.
> > > > > > 
> > > > > > thanks,
> > > > > > 
> > > > > > greg k-h
> > > > > 
> > > > > As I mentioned the device is likely buggy, since I'm
> > > > > developing
> > > > > and
> > > > > debugging it.
> > > > > However my ability to debug and fix any issue is limited by
> > > > > the
> > > > > fact
> > > > > that the kernel decides to stop working as usual, making my
> > > > > USB
> > > > > keyboard and mouse useless, if not crashing later due to soft
> > > > > lockups.
> > > > > 
> > > > > Shouldn't the kernel be resilient to such devices?
> > > > 
> > > > Yes it should, we should not crash.
> > > > 
> > > > 
> > > > > I've developed quite
> > > > > a few USB devices in the past and I've never ran into things
> > > > > like
> > > > > this
> > > > > on Linux (Windows is another story, rather 'easy' to crash,
> > > > > hang
> > > > > or
> > > > > bluescreen). In any case since I do not have access to a
> > > > > hardware
> > > > > debugger and the machine goes bananas (preventing me from
> > > > > using
> > > > > Wireshark) I do not think I can further debug this issue. I
> > > > > could
> > > > > try
> > > > > to find a kernel version where this does not crash the
> > > > > machine
> > > > > (only
> > > > > tested 5.6.X and 5.7.X so far). Or perhaps use VirtualBox,
> > > > > but
> > > > > I'd
> > > > > need
> > > > > to convice the host OS to ignore the USB device and just
> > > > > forward
> > > > > it
> > > > > to
> > > > > the guest.
> > > > 
> > > > Trying to trace down what part of the setup is failing, by
> > > > using
> > > > usbmon,
> > > > will be good to try to figure out what the problem is here, if
> > > > you
> > > > can
> > > > do that.
> > > > 
> > > > > The firmware for this device can be easily tweaked to expose
> > > > > an
> > > > > arbitrary number (up to 7 I think) of CDC ACM interfaces.
> > > > > When
> > > > > I
> > > > > use
> > > > > one or two there's no issues, three has had some issues (but
> > > > > did
> > > > > not
> > > > > investigate further). Going to four is what consistently
> > > > > triggers
> > > > > kernel issues.
> > > > 
> > > > Hm, that might be a clue, what does the output of 'lsusb -v'
> > > > for
> > > > that
> > > > device when you have 3 and then 4 interfaces?
> > > > 
> > > > thanks,
> > > > 
> > > > greg k-h
> > > 
> > > Hello,
> > > 
> > > I will try to see how I can debug further, perhaps I can locate a
> > > machine or kernel that does not crash. Another option can be to
> > > disable
> > > the firmware down to the minimum so that it does not response to
> > > the
> > > bulk endpoints (just to the enumeration and some basic things),
> > > to
> > > rule
> > > out a bad behaviour.
> > > 
> > > The USB descriptor is what you could imagine, just replicate the
> > > two
> > > ACM interfaces (control & data) and add more endpoints. Here goes
> > > the
> > > one with three ports. Note this one seems to make the kernel
> > > crash
> > > just
> > > like the one with four. The only ones that work well are 1 and 2
> > > ports.
> > > Since I'm not aware of any other commercial solutions (apart from
> > > FTDI)
> > > that use more than 2 ACM ports, could that be the issue? Meaning
> > > there's a bug somewhere and no commercial hardware that can
> > > trigger
> > > it.
> > > 
> > > For reference the diff between two and three ports (in lsusb) is
> > > that
> > > it's missing the last two interaces (with the 3 EPs described).
> > > Of
> > > course the bNumInterfaces is 4 instead of 6, and wTotalLength has
> > > a
> > > different value.
> > > 
> > > Hope this can help somehow.
> > > Thanks
> > > David
> > > 
> > > Bus 003 Device 012: ID 0483:5740 STMicroelectronics Virtual COM
> > > Port
> > 
> > Hey again,
> > 
> > I was not aware about the modems Daniele, thanks!
> > 
> > So I did some testing on my old BeagleBone black, which has a very
> > old
> > kernel (3.8.13-bone47). In this device the kernel is happy and I
> > was
> > able to do some testing, it seems to work well. The UARTs seem to
> > work
> > well in both directions, no weird shenanigans, no error/warn
> > messages...
> > 
> > I'm a bit at loss on how I can debug this further, I will try to
> > use
> > a
> > RPi with a newer kernel and see what happens. I could try to boot a
> > Live USB with older kernels (in my Intel machines) to try to locate
> > a
> > version where it works. Since I'm no kernel expert: any way I can
> > provide more info? The computer becomes unusable shortly after
> > plugging
> > the device so I can't really do any meaningful stuff on it.
> > 
> > Thanks again,
> > David
> > 
> 
> Hey there again!
> 
> I managed to get a PCAP capture for this. Note that NetworkManager
> was
> running and actively probing the ttyACM* devices for a modem, hence
> why
> you can see "AT\n" commands being sent to the four devices.

Do you mean ModemManager? NM moved the probing to ModemManager about 6
or 7 years ago. In any case, ModemManager has both a "don't probe"
list, a greylist, and build-time options to only probe things it knows
are modems.

Of course the build-time policy depends on your distro; upstream
ModemManager now defaults to "strict" mode (only probe known
modems/drivers/USB IDs).

Dan

> As you can also probably see is that the device currently ignores any
> control requests (like Set Line Coding).
> 
> Hope it can help your debugging.
> 
> David
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-20 16:55                     ` Dan Williams
@ 2020-07-20 20:39                       ` David Guillen Fandos
  2020-07-21  8:26                         ` Greg KH
  0 siblings, 1 reply; 19+ messages in thread
From: David Guillen Fandos @ 2020-07-20 20:39 UTC (permalink / raw)
  To: Dan Williams, Greg KH; +Cc: linux-usb, dnlplm

On Mon, 2020-07-20 at 11:55 -0500, Dan Williams wrote:
> On Mon, 2020-07-20 at 01:36 +0200, David Guillen Fandos wrote:
> > On Thu, 2020-07-16 at 16:30 +0200, David Guillen Fandos wrote:
> > > On Wed, 2020-07-15 at 19:03 +0200, David Guillen Fandos wrote:
> > > > On Wed, 2020-07-15 at 14:24 +0200, Greg KH wrote:
> > > > > On Wed, Jul 15, 2020 at 01:20:54PM +0200, David Guillen
> > > > > Fandos
> > > > > wrote:
> > > > > > On Wed, 2020-07-15 at 13:12 +0200, Greg KH wrote:
> > > > > > > On Wed, Jul 15, 2020 at 12:57:14PM +0200, David Guillen
> > > > > > > Fandos
> > > > > > > wrote:
> > > > > > > > On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > > > > > > > > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David
> > > > > > > > > Guillen
> > > > > > > > > Fandos
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > > > > > > > > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David
> > > > > > > > > > > Guillen
> > > > > > > > > > > Fandos
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Hello linux-usb,
> > > > > > > > > > > > 
> > > > > > > > > > > > I think I might have found a kernel bug related
> > > > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > USB
> > > > > > > > > > > > subsystem
> > > > > > > > > > > > (cdc_acm perhaps).
> > > > > > > > > > > > 
> > > > > > > > > > > > Context: I was playing around with a device I'm
> > > > > > > > > > > > creating,
> > > > > > > > > > > > essentially a
> > > > > > > > > > > > USB quad modem device that exposes four modems
> > > > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > host
> > > > > > > > > > > > system.
> > > > > > > > > > > > This
> > > > > > > > > > > > device is still a prototype so there's a few
> > > > > > > > > > > > bugs
> > > > > > > > > > > > here
> > > > > > > > > > > > and
> > > > > > > > > > > > there,
> > > > > > > > > > > > most
> > > > > > > > > > > > likely in the USB descriptors and control
> > > > > > > > > > > > requests.
> > > > > > > > > > > > 
> > > > > > > > > > > > What happens: After plugging the device the
> > > > > > > > > > > > system
> > > > > > > > > > > > starts
> > > > > > > > > > > > spitting
> > > > > > > > > > > > warnings and BUGs and it locks up. Most of the
> > > > > > > > > > > > time
> > > > > > > > > > > > some
> > > > > > > > > > > > CPUs
> > > > > > > > > > > > get
> > > > > > > > > > > > into
> > > > > > > > > > > > some spinloop and never comes back (you can see
> > > > > > > > > > > > it
> > > > > > > > > > > > being
> > > > > > > > > > > > detected
> > > > > > > > > > > > by
> > > > > > > > > > > > the watchdog after a few seconds). Generally
> > > > > > > > > > > > after
> > > > > > > > > > > > that
> > > > > > > > > > > > the
> > > > > > > > > > > > USB
> > > > > > > > > > > > devices
> > > > > > > > > > > > stop working completely and at some point the
> > > > > > > > > > > > machine
> > > > > > > > > > > > freezes
> > > > > > > > > > > > completely. In a couple of ocasions I managed
> > > > > > > > > > > > to
> > > > > > > > > > > > see
> > > > > > > > > > > > a
> > > > > > > > > > > > bug
> > > > > > > > > > > > in
> > > > > > > > > > > > dmesg
> > > > > > > > > > > > saying "unable to handle page fault for address
> > > > > > > > > > > > XXX"
> > > > > > > > > > > > and
> > > > > > > > > > > > "Supervisor
> > > > > > > > > > > > read access in kernel mode" "error code
> > > > > > > > > > > > (0x0000)
> > > > > > > > > > > > not
> > > > > > > > > > > > present
> > > > > > > > > > > > page".
> > > > > > > > > > > > I
> > > > > > > > > > > > could not get a trace for that one since the
> > > > > > > > > > > > kernel
> > > > > > > > > > > > died
> > > > > > > > > > > > completely
> > > > > > > > > > > > and
> > > > > > > > > > > > my log files were truncated/lost.
> > > > > > > > > > > > 
> > > > > > > > > > > > Since it is happening to my two machines (both
> > > > > > > > > > > > Intel
> > > > > > > > > > > > but
> > > > > > > > > > > > rather
> > > > > > > > > > > > different controllers, Sunrise Point-LP USB 3.0
> > > > > > > > > > > > vs
> > > > > > > > > > > > 8
> > > > > > > > > > > > Series/C220)
> > > > > > > > > > > > and
> > > > > > > > > > > > with different kernel versions I suspect this
> > > > > > > > > > > > might
> > > > > > > > > > > > be
> > > > > > > > > > > > a
> > > > > > > > > > > > bug in
> > > > > > > > > > > > the
> > > > > > > > > > > > kernel.
> > > > > > > > > > > > 
> > > > > > > > > > > > I have 4 logs that I collected, they are sort
> > > > > > > > > > > > of
> > > > > > > > > > > > long-
> > > > > > > > > > > > ish,
> > > > > > > > > > > > not
> > > > > > > > > > > > sure
> > > > > > > > > > > > how
> > > > > > > > > > > > to best send them to the list.
> > > > > > > > > > > 
> > > > > > > > > > > Send the crashes with the callback list, that
> > > > > > > > > > > should
> > > > > > > > > > > be
> > > > > > > > > > > quite
> > > > > > > > > > > small,
> > > > > > > > > > > right?  We don't need the full log.
> > > > > > > > > > > 
> > > > > > > > > > > The first crash is the most important, the others
> > > > > > > > > > > can
> > > > > > > > > > > be
> > > > > > > > > > > from
> > > > > > > > > > > the
> > > > > > > > > > > first
> > > > > > > > > > > one and are not reliable.
> > > > > > > > > > > 
> > > > > > > > > > > thanks,
> > > > > > > > > > > 
> > > > > > > > > > > greg k-h
> > > > > > > > > > 
> > > > > > > > > > Ok then, here comes one of the logs, I selected
> > > > > > > > > > some
> > > > > > > > > > bits
> > > > > > > > > > only
> > > > > > > > > > 
> > > > > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > > > > kernel/workqueue.c:1473
> > > > > > > > > > __queue_work+0x364/0x410
> > > > > > > > > > [...]
> > > > > > > > > > [  147.302322] Call Trace:
> > > > > > > > > > [  147.302329]  <IRQ>
> > > > > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > > > > [  147.302372]  tasklet_action_common.constprop.0+0
> > > > > > > > > > x6
> > > > > > > > > > 6/
> > > > > > > > > > 0x
> > > > > > > > > > 10
> > > > > > > > > > 0
> > > > > > > > > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > > > > > > > > [  147.302391]  irq_exit+0xcf/0x110
> > > > > > > > > > [  147.302397]  do_IRQ+0x55/0xe0
> > > > > > > > > > [  147.302408]  common_interrupt+0xf/0xf
> > > > > > > > > > [  147.302413]  </IRQ>
> > > > > > > > > > [...]
> > > > > > > > > > [  184.771172] watchdog: BUG: soft lockup - CPU#3
> > > > > > > > > > stuck
> > > > > > > > > > for
> > > > > > > > > > 23s!
> > > > > > > > > > [kworker/3:2:134]
> > > > > > > > > 
> > > > > > > > > That was the first message?
> > > > > > > > > 
> > > > > > > > > Ok, we need some more logs, how about the 30 lines
> > > > > > > > > right
> > > > > > > > > before
> > > > > > > > > the
> > > > > > > > > above?
> > > > > > > > > 
> > > > > > > > > And what kernel version are you using?
> > > > > > > > > 
> > > > > > > > > thanks,
> > > > > > > > > 
> > > > > > > > > greg k-h
> > > > > > > > 
> > > > > > > > Heh I assumed you would find the 3rd stack more
> > > > > > > > interesting
> > > > > > > > since
> > > > > > > > it
> > > > > > > > involves more subsystems but anyway, here we got, the
> > > > > > > > first
> > > > > > > > one
> > > > > > > > with
> > > > > > > > more context. The trigger as you can see is me
> > > > > > > > connecting
> > > > > > > > the
> > > > > > > > USB
> > > > > > > > device:
> > > > > > > > 
> > > > > > > > [  141.445367] usb 1-1: new full-speed USB device
> > > > > > > > number
> > > > > > > > 5
> > > > > > > > using
> > > > > > > > xhci_hcd
> > > > > > > > [  141.573592] usb 1-1: New USB device found,
> > > > > > > > idVendor=0483,
> > > > > > > > idProduct=5740, bcdDevice= 2.00
> > > > > > > > [  141.573597] usb 1-1: New USB device strings: Mfr=1,
> > > > > > > > Product=2,
> > > > > > > > SerialNumber=3
> > > > > > > > [  141.573601] usb 1-1: Product: Quad-UART serial USB
> > > > > > > > device
> > > > > > > > [  141.573603] usb 1-1: Manufacturer: davidgf.net
> > > > > > > > [  141.573605] usb 1-1: SerialNumber: serialno
> > > > > > > > [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM device
> > > > > > > > [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM device
> > > > > > > > [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM device
> > > > > > > > [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM device
> > > > > > > > [  142.382473] usbcore: registered new interface driver
> > > > > > > > cdc_acm
> > > > > > > > [  142.382476] cdc_acm: USB Abstract Control Model
> > > > > > > > driver
> > > > > > > > for
> > > > > > > > USB
> > > > > > > > modems and ISDN adapters
> > > > > > > > [  147.301997] ------------[ cut here ]------------
> > > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > > kernel/workqueue.c:1473
> > > > > > > > __queue_work+0x364/0x410
> > > > > > > > [  147.302019] Modules linked in: cdc_acm rfcomm ccm
> > > > > > > > wireguard
> > > > > > > > curve25519_x86_64 libchacha20poly1305 chacha_x86_64
> > > > > > > > poly1305_x86_64
> > > > > > > > libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
> > > > > > > > libcurve25519_generic libchacha libblake2s_generic
> > > > > > > > nft_fib_inet
> > > > > > > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> > > > > > > > nf_reject_ipv4
> > > > > > > > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
> > > > > > > > ip6table_nat
> > > > > > > > ip6table_mangle ip6table_raw ip6table_security
> > > > > > > > iptable_nat
> > > > > > > > nf_nat
> > > > > > > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
> > > > > > > > iptable_mangle
> > > > > > > > iptable_raw iptable_security ip_set nf_tables nfnetlink
> > > > > > > > ip6table_filter
> > > > > > > > ip6_tables iptable_filter cmac vboxnetadp(OE)
> > > > > > > > vboxnetflt(OE)
> > > > > > > > bnep
> > > > > > > > vboxdrv(OE) sunrpc vfat fat uvcvideo videobuf2_vmalloc
> > > > > > > > videobuf2_memops
> > > > > > > > videobuf2_v4l2 videobuf2_common videodev btusb btrtl
> > > > > > > > btbcm
> > > > > > > > btintel
> > > > > > > > mc
> > > > > > > > bluetooth ecdh_generic ecc iTCO_wdt iTCO_vendor_support
> > > > > > > > mei_hdcp
> > > > > > > > intel_rapl_msr dell_laptop x86_pkg_temp_thermal
> > > > > > > > intel_powerclamp
> > > > > > > > coretemp kvm_intel kvm irqbypass intel_cstate
> > > > > > > > intel_uncore
> > > > > > > > intel_rapl_perf iwlmvm
> > > > > > > > [  147.302121]  snd_hda_codec_hdmi mac80211 snd_soc_skl
> > > > > > > > snd_soc_sst_ipc
> > > > > > > > snd_soc_sst_dsp dell_wmi snd_hda_ext_core dell_smbios
> > > > > > > > snd_hda_codec_realtek dcdbas libarc4 wmi_bmof
> > > > > > > > dell_wmi_descriptor
> > > > > > > > snd_soc_acpi_intel_match snd_soc_acpi
> > > > > > > > intel_wmi_thunderbolt
> > > > > > > > snd_hda_codec_generic snd_soc_core ledtrig_audio
> > > > > > > > iwlwifi
> > > > > > > > pcspkr
> > > > > > > > snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel
> > > > > > > > snd_intel_dspcfg
> > > > > > > > snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_seq
> > > > > > > > snd_seq_device
> > > > > > > > joydev snd_pcm rfkill snd_timer snd i2c_i801 soundcore
> > > > > > > > idma64
> > > > > > > > int3403_thermal intel_hid int3400_thermal sparse_keymap
> > > > > > > > acpi_thermal_rel mei_me intel_xhci_usb_role_switch
> > > > > > > > acpi_pad
> > > > > > > > roles
> > > > > > > > mei
> > > > > > > > intel_pch_thermal processor_thermal_device
> > > > > > > > intel_rapl_common
> > > > > > > > int340x_thermal_zone intel_soc_dts_iosf binfmt_misc
> > > > > > > > ip_tables
> > > > > > > > dm_crypt
> > > > > > > > i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul
> > > > > > > > crc32_pclmul
> > > > > > > > i2c_algo_bit
> > > > > > > > cec crc32c_intel drm_kms_helper nvme
> > > > > > > > ghash_clmulni_intel
> > > > > > > > drm
> > > > > > > > nvme_core
> > > > > > > > serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
> > > > > > > > pinctrl_sunrisepoint pinctrl_intel
> > > > > > > > [  147.302218]  fuse
> > > > > > > > [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2
> > > > > > > > Tainted:
> > > > > > > > G          IOE     5.7.7-200.fc32.x86_64 #1
> > > > > > > > [  147.302233] Hardware name: Dell Inc. XPS 13
> > > > > > > > 9350/0PWNCR,
> > > > > > > > BIOS
> > > > > > > > 1.12.2
> > > > > > > > 12/15/2019
> > > > > > > > [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> > > > > > > > [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> > > > > > > > [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f 65
> > > > > > > > 48
> > > > > > > > 8b
> > > > > > > > 3c
> > > > > > > > 25
> > > > > > > > c0 8b
> > > > > > > > 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41 5c
> > > > > > > > 41
> > > > > > > > 5d
> > > > > > > > 41
> > > > > > > > 5e
> > > > > > > > 41 5f
> > > > > > > > c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60 e9 5d
> > > > > > > > fe
> > > > > > > > ff
> > > > > > > > ff e8
> > > > > > > > 53
> > > > > > > > [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS:
> > > > > > > > 00010002
> > > > > > > > [  147.302292] RAX: ffff8f551b333790 RBX:
> > > > > > > > 0000000000000048
> > > > > > > > RCX:
> > > > > > > > 0000000000000000
> > > > > > > > [  147.302295] RDX: ffff8f551b333798 RSI:
> > > > > > > > ffff8f5575803718
> > > > > > > > RDI:
> > > > > > > > ffff8f5576daa840
> > > > > > > > [  147.302299] RBP: ffff8f551b333790 R08:
> > > > > > > > ffffffff97856cb0
> > > > > > > > R09:
> > > > > > > > 0000000000000000
> > > > > > > > [  147.302302] R10: 0000000000000000 R11:
> > > > > > > > ffffffff97856cb8
> > > > > > > > R12:
> > > > > > > > 0000000000000003
> > > > > > > > [  147.302306] R13: 0000000000002000 R14:
> > > > > > > > ffff8f5575c14e00
> > > > > > > > R15:
> > > > > > > > ffff8f5576db0700
> > > > > > > > [  147.302311] FS:  0000000000000000(0000)
> > > > > > > > GS:ffff8f5576d80000(0000)
> > > > > > > > knlGS:0000000000000000
> > > > > > > > [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > > > > 0000000080050033
> > > > > > > > [  147.302319] CR2: 00000000000000b0 CR3:
> > > > > > > > 0000000267774004
> > > > > > > > CR4:
> > > > > > > > 00000000003606e0
> > > > > > > > [  147.302322] Call Trace:
> > > > > > > > [  147.302329]  <IRQ>
> > > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > 
> > > > > > > Are you sure your device is working properly and talking
> > > > > > > USB
> > > > > > > correctly
> > > > > > > to the host?  It looks like you are just timing out for
> > > > > > > some
> > > > > > > reason.
> > > > > > > 
> > > > > > > But, that warning is showing that something is odd in the
> > > > > > > usb
> > > > > > > workqueue,
> > > > > > > which is strange.
> > > > > > > 
> > > > > > > What type of host controller is this talking to?  And
> > > > > > > does
> > > > > > > your
> > > > > > > device
> > > > > > > actually answer the urbs being sent to it correctly?
> > > > > > > 
> > > > > > > Using usbmon on this might be the best way to watch the
> > > > > > > USB
> > > > > > > traffic,
> > > > > > > if
> > > > > > > you don't have a hardware protocol sniffer, which could
> > > > > > > provide
> > > > > > > some
> > > > > > > clues as to what is going wrong.
> > > > > > > 
> > > > > > > thanks,
> > > > > > > 
> > > > > > > greg k-h
> > > > > > 
> > > > > > As I mentioned the device is likely buggy, since I'm
> > > > > > developing
> > > > > > and
> > > > > > debugging it.
> > > > > > However my ability to debug and fix any issue is limited by
> > > > > > the
> > > > > > fact
> > > > > > that the kernel decides to stop working as usual, making my
> > > > > > USB
> > > > > > keyboard and mouse useless, if not crashing later due to
> > > > > > soft
> > > > > > lockups.
> > > > > > 
> > > > > > Shouldn't the kernel be resilient to such devices?
> > > > > 
> > > > > Yes it should, we should not crash.
> > > > > 
> > > > > 
> > > > > > I've developed quite
> > > > > > a few USB devices in the past and I've never ran into
> > > > > > things
> > > > > > like
> > > > > > this
> > > > > > on Linux (Windows is another story, rather 'easy' to crash,
> > > > > > hang
> > > > > > or
> > > > > > bluescreen). In any case since I do not have access to a
> > > > > > hardware
> > > > > > debugger and the machine goes bananas (preventing me from
> > > > > > using
> > > > > > Wireshark) I do not think I can further debug this issue. I
> > > > > > could
> > > > > > try
> > > > > > to find a kernel version where this does not crash the
> > > > > > machine
> > > > > > (only
> > > > > > tested 5.6.X and 5.7.X so far). Or perhaps use VirtualBox,
> > > > > > but
> > > > > > I'd
> > > > > > need
> > > > > > to convice the host OS to ignore the USB device and just
> > > > > > forward
> > > > > > it
> > > > > > to
> > > > > > the guest.
> > > > > 
> > > > > Trying to trace down what part of the setup is failing, by
> > > > > using
> > > > > usbmon,
> > > > > will be good to try to figure out what the problem is here,
> > > > > if
> > > > > you
> > > > > can
> > > > > do that.
> > > > > 
> > > > > > The firmware for this device can be easily tweaked to
> > > > > > expose
> > > > > > an
> > > > > > arbitrary number (up to 7 I think) of CDC ACM interfaces.
> > > > > > When
> > > > > > I
> > > > > > use
> > > > > > one or two there's no issues, three has had some issues
> > > > > > (but
> > > > > > did
> > > > > > not
> > > > > > investigate further). Going to four is what consistently
> > > > > > triggers
> > > > > > kernel issues.
> > > > > 
> > > > > Hm, that might be a clue, what does the output of 'lsusb -v'
> > > > > for
> > > > > that
> > > > > device when you have 3 and then 4 interfaces?
> > > > > 
> > > > > thanks,
> > > > > 
> > > > > greg k-h
> > > > 
> > > > Hello,
> > > > 
> > > > I will try to see how I can debug further, perhaps I can locate
> > > > a
> > > > machine or kernel that does not crash. Another option can be to
> > > > disable
> > > > the firmware down to the minimum so that it does not response
> > > > to
> > > > the
> > > > bulk endpoints (just to the enumeration and some basic things),
> > > > to
> > > > rule
> > > > out a bad behaviour.
> > > > 
> > > > The USB descriptor is what you could imagine, just replicate
> > > > the
> > > > two
> > > > ACM interfaces (control & data) and add more endpoints. Here
> > > > goes
> > > > the
> > > > one with three ports. Note this one seems to make the kernel
> > > > crash
> > > > just
> > > > like the one with four. The only ones that work well are 1 and
> > > > 2
> > > > ports.
> > > > Since I'm not aware of any other commercial solutions (apart
> > > > from
> > > > FTDI)
> > > > that use more than 2 ACM ports, could that be the issue?
> > > > Meaning
> > > > there's a bug somewhere and no commercial hardware that can
> > > > trigger
> > > > it.
> > > > 
> > > > For reference the diff between two and three ports (in lsusb)
> > > > is
> > > > that
> > > > it's missing the last two interaces (with the 3 EPs described).
> > > > Of
> > > > course the bNumInterfaces is 4 instead of 6, and wTotalLength
> > > > has
> > > > a
> > > > different value.
> > > > 
> > > > Hope this can help somehow.
> > > > Thanks
> > > > David
> > > > 
> > > > Bus 003 Device 012: ID 0483:5740 STMicroelectronics Virtual COM
> > > > Port
> > > 
> > > Hey again,
> > > 
> > > I was not aware about the modems Daniele, thanks!
> > > 
> > > So I did some testing on my old BeagleBone black, which has a
> > > very
> > > old
> > > kernel (3.8.13-bone47). In this device the kernel is happy and I
> > > was
> > > able to do some testing, it seems to work well. The UARTs seem to
> > > work
> > > well in both directions, no weird shenanigans, no error/warn
> > > messages...
> > > 
> > > I'm a bit at loss on how I can debug this further, I will try to
> > > use
> > > a
> > > RPi with a newer kernel and see what happens. I could try to boot
> > > a
> > > Live USB with older kernels (in my Intel machines) to try to
> > > locate
> > > a
> > > version where it works. Since I'm no kernel expert: any way I can
> > > provide more info? The computer becomes unusable shortly after
> > > plugging
> > > the device so I can't really do any meaningful stuff on it.
> > > 
> > > Thanks again,
> > > David
> > > 
> > 
> > Hey there again!
> > 
> > I managed to get a PCAP capture for this. Note that NetworkManager
> > was
> > running and actively probing the ttyACM* devices for a modem, hence
> > why
> > you can see "AT\n" commands being sent to the four devices.
> 
> Do you mean ModemManager? NM moved the probing to ModemManager about
> 6
> or 7 years ago. In any case, ModemManager has both a "don't probe"
> list, a greylist, and build-time options to only probe things it
> knows
> are modems.
> 
> Of course the build-time policy depends on your distro; upstream
> ModemManager now defaults to "strict" mode (only probe known
> modems/drivers/USB IDs).
> 
> Dan
> 
> > As you can also probably see is that the device currently ignores
> > any
> > control requests (like Set Line Coding).
> > 
> > Hope it can help your debugging.
> > 
> > David
> > 
> > 

Yeah ModemManager. I disabled it to better debug.

So I came to the bottom of the issue (device side). This device does
not support more than 3 IN and 3 OUT endpoints. Whenever you specify
more they start behaving very weirdly and Linux/Wireshark shows the
EPROTO errors. In almost all cases that correlates with a kernel crash.

I've seen some EPROTO errors at the begining of a trace on working
devices, I'm assuming it'd what happens during the period the device is
plugged and probed and before it gets full ready (all EPs are setup and
able to respond).

Now coming around the restriction I've been able to use 3 UARTs without
a problem, also on another device from the same family. That explains
why it did not work, does not explain why the kernel gets a CPU stuck
in some busy loop/interrupt loop though.

From my side there's not much more I can do. I could try to get this
setup in a simpler device and ship it someone willing to debug it. I'm
totally convinced this is a bug in the USB stack given it crashes
several computers with relateively different hardware.

Thanks!
David



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-15 11:20           ` David Guillen Fandos
  2020-07-15 12:24             ` Greg KH
@ 2020-07-21  8:01             ` Oliver Neukum
  1 sibling, 0 replies; 19+ messages in thread
From: Oliver Neukum @ 2020-07-21  8:01 UTC (permalink / raw)
  To: David Guillen Fandos, Greg KH; +Cc: linux-usb

Am Mittwoch, den 15.07.2020, 13:20 +0200 schrieb David Guillen Fandos:
> Shouldn't the kernel be resilient to such devices? I've developed quite
> a few USB devices in the past and I've never ran into things like this
> on Linux


Yes. Unfortunately we do not know which module is crashing here.
It may be as simple as CDC-ACM retrying without a delay or giving
up. Or this is in usbcore. Could you initially try without
CDC-ACM, so we can narrow this down? And tehn we can try with
dynamic debugging or thought what can be done.

And thank you for identifying the root cause for the malfunction.
Yet you are clearly right that even such a device should not crash the
kernel.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-20 20:39                       ` David Guillen Fandos
@ 2020-07-21  8:26                         ` Greg KH
  2020-07-22 14:41                           ` David Guillen Fandos
  0 siblings, 1 reply; 19+ messages in thread
From: Greg KH @ 2020-07-21  8:26 UTC (permalink / raw)
  To: David Guillen Fandos; +Cc: Dan Williams, linux-usb, dnlplm

On Mon, Jul 20, 2020 at 10:39:40PM +0200, David Guillen Fandos wrote:
> On Mon, 2020-07-20 at 11:55 -0500, Dan Williams wrote:
> > On Mon, 2020-07-20 at 01:36 +0200, David Guillen Fandos wrote:
> > > On Thu, 2020-07-16 at 16:30 +0200, David Guillen Fandos wrote:
> > > > On Wed, 2020-07-15 at 19:03 +0200, David Guillen Fandos wrote:
> > > > > On Wed, 2020-07-15 at 14:24 +0200, Greg KH wrote:
> > > > > > On Wed, Jul 15, 2020 at 01:20:54PM +0200, David Guillen
> > > > > > Fandos
> > > > > > wrote:
> > > > > > > On Wed, 2020-07-15 at 13:12 +0200, Greg KH wrote:
> > > > > > > > On Wed, Jul 15, 2020 at 12:57:14PM +0200, David Guillen
> > > > > > > > Fandos
> > > > > > > > wrote:
> > > > > > > > > On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > > > > > > > > > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David
> > > > > > > > > > Guillen
> > > > > > > > > > Fandos
> > > > > > > > > > wrote:
> > > > > > > > > > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH wrote:
> > > > > > > > > > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200, David
> > > > > > > > > > > > Guillen
> > > > > > > > > > > > Fandos
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > Hello linux-usb,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I think I might have found a kernel bug related
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > USB
> > > > > > > > > > > > > subsystem
> > > > > > > > > > > > > (cdc_acm perhaps).
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Context: I was playing around with a device I'm
> > > > > > > > > > > > > creating,
> > > > > > > > > > > > > essentially a
> > > > > > > > > > > > > USB quad modem device that exposes four modems
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > host
> > > > > > > > > > > > > system.
> > > > > > > > > > > > > This
> > > > > > > > > > > > > device is still a prototype so there's a few
> > > > > > > > > > > > > bugs
> > > > > > > > > > > > > here
> > > > > > > > > > > > > and
> > > > > > > > > > > > > there,
> > > > > > > > > > > > > most
> > > > > > > > > > > > > likely in the USB descriptors and control
> > > > > > > > > > > > > requests.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > What happens: After plugging the device the
> > > > > > > > > > > > > system
> > > > > > > > > > > > > starts
> > > > > > > > > > > > > spitting
> > > > > > > > > > > > > warnings and BUGs and it locks up. Most of the
> > > > > > > > > > > > > time
> > > > > > > > > > > > > some
> > > > > > > > > > > > > CPUs
> > > > > > > > > > > > > get
> > > > > > > > > > > > > into
> > > > > > > > > > > > > some spinloop and never comes back (you can see
> > > > > > > > > > > > > it
> > > > > > > > > > > > > being
> > > > > > > > > > > > > detected
> > > > > > > > > > > > > by
> > > > > > > > > > > > > the watchdog after a few seconds). Generally
> > > > > > > > > > > > > after
> > > > > > > > > > > > > that
> > > > > > > > > > > > > the
> > > > > > > > > > > > > USB
> > > > > > > > > > > > > devices
> > > > > > > > > > > > > stop working completely and at some point the
> > > > > > > > > > > > > machine
> > > > > > > > > > > > > freezes
> > > > > > > > > > > > > completely. In a couple of ocasions I managed
> > > > > > > > > > > > > to
> > > > > > > > > > > > > see
> > > > > > > > > > > > > a
> > > > > > > > > > > > > bug
> > > > > > > > > > > > > in
> > > > > > > > > > > > > dmesg
> > > > > > > > > > > > > saying "unable to handle page fault for address
> > > > > > > > > > > > > XXX"
> > > > > > > > > > > > > and
> > > > > > > > > > > > > "Supervisor
> > > > > > > > > > > > > read access in kernel mode" "error code
> > > > > > > > > > > > > (0x0000)
> > > > > > > > > > > > > not
> > > > > > > > > > > > > present
> > > > > > > > > > > > > page".
> > > > > > > > > > > > > I
> > > > > > > > > > > > > could not get a trace for that one since the
> > > > > > > > > > > > > kernel
> > > > > > > > > > > > > died
> > > > > > > > > > > > > completely
> > > > > > > > > > > > > and
> > > > > > > > > > > > > my log files were truncated/lost.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Since it is happening to my two machines (both
> > > > > > > > > > > > > Intel
> > > > > > > > > > > > > but
> > > > > > > > > > > > > rather
> > > > > > > > > > > > > different controllers, Sunrise Point-LP USB 3.0
> > > > > > > > > > > > > vs
> > > > > > > > > > > > > 8
> > > > > > > > > > > > > Series/C220)
> > > > > > > > > > > > > and
> > > > > > > > > > > > > with different kernel versions I suspect this
> > > > > > > > > > > > > might
> > > > > > > > > > > > > be
> > > > > > > > > > > > > a
> > > > > > > > > > > > > bug in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > kernel.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I have 4 logs that I collected, they are sort
> > > > > > > > > > > > > of
> > > > > > > > > > > > > long-
> > > > > > > > > > > > > ish,
> > > > > > > > > > > > > not
> > > > > > > > > > > > > sure
> > > > > > > > > > > > > how
> > > > > > > > > > > > > to best send them to the list.
> > > > > > > > > > > > 
> > > > > > > > > > > > Send the crashes with the callback list, that
> > > > > > > > > > > > should
> > > > > > > > > > > > be
> > > > > > > > > > > > quite
> > > > > > > > > > > > small,
> > > > > > > > > > > > right?  We don't need the full log.
> > > > > > > > > > > > 
> > > > > > > > > > > > The first crash is the most important, the others
> > > > > > > > > > > > can
> > > > > > > > > > > > be
> > > > > > > > > > > > from
> > > > > > > > > > > > the
> > > > > > > > > > > > first
> > > > > > > > > > > > one and are not reliable.
> > > > > > > > > > > > 
> > > > > > > > > > > > thanks,
> > > > > > > > > > > > 
> > > > > > > > > > > > greg k-h
> > > > > > > > > > > 
> > > > > > > > > > > Ok then, here comes one of the logs, I selected
> > > > > > > > > > > some
> > > > > > > > > > > bits
> > > > > > > > > > > only
> > > > > > > > > > > 
> > > > > > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > > > > > kernel/workqueue.c:1473
> > > > > > > > > > > __queue_work+0x364/0x410
> > > > > > > > > > > [...]
> > > > > > > > > > > [  147.302322] Call Trace:
> > > > > > > > > > > [  147.302329]  <IRQ>
> > > > > > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > > > > > [  147.302372]  tasklet_action_common.constprop.0+0
> > > > > > > > > > > x6
> > > > > > > > > > > 6/
> > > > > > > > > > > 0x
> > > > > > > > > > > 10
> > > > > > > > > > > 0
> > > > > > > > > > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > > > > > > > > > [  147.302391]  irq_exit+0xcf/0x110
> > > > > > > > > > > [  147.302397]  do_IRQ+0x55/0xe0
> > > > > > > > > > > [  147.302408]  common_interrupt+0xf/0xf
> > > > > > > > > > > [  147.302413]  </IRQ>
> > > > > > > > > > > [...]
> > > > > > > > > > > [  184.771172] watchdog: BUG: soft lockup - CPU#3
> > > > > > > > > > > stuck
> > > > > > > > > > > for
> > > > > > > > > > > 23s!
> > > > > > > > > > > [kworker/3:2:134]
> > > > > > > > > > 
> > > > > > > > > > That was the first message?
> > > > > > > > > > 
> > > > > > > > > > Ok, we need some more logs, how about the 30 lines
> > > > > > > > > > right
> > > > > > > > > > before
> > > > > > > > > > the
> > > > > > > > > > above?
> > > > > > > > > > 
> > > > > > > > > > And what kernel version are you using?
> > > > > > > > > > 
> > > > > > > > > > thanks,
> > > > > > > > > > 
> > > > > > > > > > greg k-h
> > > > > > > > > 
> > > > > > > > > Heh I assumed you would find the 3rd stack more
> > > > > > > > > interesting
> > > > > > > > > since
> > > > > > > > > it
> > > > > > > > > involves more subsystems but anyway, here we got, the
> > > > > > > > > first
> > > > > > > > > one
> > > > > > > > > with
> > > > > > > > > more context. The trigger as you can see is me
> > > > > > > > > connecting
> > > > > > > > > the
> > > > > > > > > USB
> > > > > > > > > device:
> > > > > > > > > 
> > > > > > > > > [  141.445367] usb 1-1: new full-speed USB device
> > > > > > > > > number
> > > > > > > > > 5
> > > > > > > > > using
> > > > > > > > > xhci_hcd
> > > > > > > > > [  141.573592] usb 1-1: New USB device found,
> > > > > > > > > idVendor=0483,
> > > > > > > > > idProduct=5740, bcdDevice= 2.00
> > > > > > > > > [  141.573597] usb 1-1: New USB device strings: Mfr=1,
> > > > > > > > > Product=2,
> > > > > > > > > SerialNumber=3
> > > > > > > > > [  141.573601] usb 1-1: Product: Quad-UART serial USB
> > > > > > > > > device
> > > > > > > > > [  141.573603] usb 1-1: Manufacturer: davidgf.net
> > > > > > > > > [  141.573605] usb 1-1: SerialNumber: serialno
> > > > > > > > > [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM device
> > > > > > > > > [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM device
> > > > > > > > > [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM device
> > > > > > > > > [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM device
> > > > > > > > > [  142.382473] usbcore: registered new interface driver
> > > > > > > > > cdc_acm
> > > > > > > > > [  142.382476] cdc_acm: USB Abstract Control Model
> > > > > > > > > driver
> > > > > > > > > for
> > > > > > > > > USB
> > > > > > > > > modems and ISDN adapters
> > > > > > > > > [  147.301997] ------------[ cut here ]------------
> > > > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > > > kernel/workqueue.c:1473
> > > > > > > > > __queue_work+0x364/0x410
> > > > > > > > > [  147.302019] Modules linked in: cdc_acm rfcomm ccm
> > > > > > > > > wireguard
> > > > > > > > > curve25519_x86_64 libchacha20poly1305 chacha_x86_64
> > > > > > > > > poly1305_x86_64
> > > > > > > > > libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
> > > > > > > > > libcurve25519_generic libchacha libblake2s_generic
> > > > > > > > > nft_fib_inet
> > > > > > > > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> > > > > > > > > nf_reject_ipv4
> > > > > > > > > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
> > > > > > > > > ip6table_nat
> > > > > > > > > ip6table_mangle ip6table_raw ip6table_security
> > > > > > > > > iptable_nat
> > > > > > > > > nf_nat
> > > > > > > > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
> > > > > > > > > iptable_mangle
> > > > > > > > > iptable_raw iptable_security ip_set nf_tables nfnetlink
> > > > > > > > > ip6table_filter
> > > > > > > > > ip6_tables iptable_filter cmac vboxnetadp(OE)
> > > > > > > > > vboxnetflt(OE)
> > > > > > > > > bnep
> > > > > > > > > vboxdrv(OE) sunrpc vfat fat uvcvideo videobuf2_vmalloc
> > > > > > > > > videobuf2_memops
> > > > > > > > > videobuf2_v4l2 videobuf2_common videodev btusb btrtl
> > > > > > > > > btbcm
> > > > > > > > > btintel
> > > > > > > > > mc
> > > > > > > > > bluetooth ecdh_generic ecc iTCO_wdt iTCO_vendor_support
> > > > > > > > > mei_hdcp
> > > > > > > > > intel_rapl_msr dell_laptop x86_pkg_temp_thermal
> > > > > > > > > intel_powerclamp
> > > > > > > > > coretemp kvm_intel kvm irqbypass intel_cstate
> > > > > > > > > intel_uncore
> > > > > > > > > intel_rapl_perf iwlmvm
> > > > > > > > > [  147.302121]  snd_hda_codec_hdmi mac80211 snd_soc_skl
> > > > > > > > > snd_soc_sst_ipc
> > > > > > > > > snd_soc_sst_dsp dell_wmi snd_hda_ext_core dell_smbios
> > > > > > > > > snd_hda_codec_realtek dcdbas libarc4 wmi_bmof
> > > > > > > > > dell_wmi_descriptor
> > > > > > > > > snd_soc_acpi_intel_match snd_soc_acpi
> > > > > > > > > intel_wmi_thunderbolt
> > > > > > > > > snd_hda_codec_generic snd_soc_core ledtrig_audio
> > > > > > > > > iwlwifi
> > > > > > > > > pcspkr
> > > > > > > > > snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel
> > > > > > > > > snd_intel_dspcfg
> > > > > > > > > snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_seq
> > > > > > > > > snd_seq_device
> > > > > > > > > joydev snd_pcm rfkill snd_timer snd i2c_i801 soundcore
> > > > > > > > > idma64
> > > > > > > > > int3403_thermal intel_hid int3400_thermal sparse_keymap
> > > > > > > > > acpi_thermal_rel mei_me intel_xhci_usb_role_switch
> > > > > > > > > acpi_pad
> > > > > > > > > roles
> > > > > > > > > mei
> > > > > > > > > intel_pch_thermal processor_thermal_device
> > > > > > > > > intel_rapl_common
> > > > > > > > > int340x_thermal_zone intel_soc_dts_iosf binfmt_misc
> > > > > > > > > ip_tables
> > > > > > > > > dm_crypt
> > > > > > > > > i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul
> > > > > > > > > crc32_pclmul
> > > > > > > > > i2c_algo_bit
> > > > > > > > > cec crc32c_intel drm_kms_helper nvme
> > > > > > > > > ghash_clmulni_intel
> > > > > > > > > drm
> > > > > > > > > nvme_core
> > > > > > > > > serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
> > > > > > > > > pinctrl_sunrisepoint pinctrl_intel
> > > > > > > > > [  147.302218]  fuse
> > > > > > > > > [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2
> > > > > > > > > Tainted:
> > > > > > > > > G          IOE     5.7.7-200.fc32.x86_64 #1
> > > > > > > > > [  147.302233] Hardware name: Dell Inc. XPS 13
> > > > > > > > > 9350/0PWNCR,
> > > > > > > > > BIOS
> > > > > > > > > 1.12.2
> > > > > > > > > 12/15/2019
> > > > > > > > > [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> > > > > > > > > [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> > > > > > > > > [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f 65
> > > > > > > > > 48
> > > > > > > > > 8b
> > > > > > > > > 3c
> > > > > > > > > 25
> > > > > > > > > c0 8b
> > > > > > > > > 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41 5c
> > > > > > > > > 41
> > > > > > > > > 5d
> > > > > > > > > 41
> > > > > > > > > 5e
> > > > > > > > > 41 5f
> > > > > > > > > c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60 e9 5d
> > > > > > > > > fe
> > > > > > > > > ff
> > > > > > > > > ff e8
> > > > > > > > > 53
> > > > > > > > > [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS:
> > > > > > > > > 00010002
> > > > > > > > > [  147.302292] RAX: ffff8f551b333790 RBX:
> > > > > > > > > 0000000000000048
> > > > > > > > > RCX:
> > > > > > > > > 0000000000000000
> > > > > > > > > [  147.302295] RDX: ffff8f551b333798 RSI:
> > > > > > > > > ffff8f5575803718
> > > > > > > > > RDI:
> > > > > > > > > ffff8f5576daa840
> > > > > > > > > [  147.302299] RBP: ffff8f551b333790 R08:
> > > > > > > > > ffffffff97856cb0
> > > > > > > > > R09:
> > > > > > > > > 0000000000000000
> > > > > > > > > [  147.302302] R10: 0000000000000000 R11:
> > > > > > > > > ffffffff97856cb8
> > > > > > > > > R12:
> > > > > > > > > 0000000000000003
> > > > > > > > > [  147.302306] R13: 0000000000002000 R14:
> > > > > > > > > ffff8f5575c14e00
> > > > > > > > > R15:
> > > > > > > > > ffff8f5576db0700
> > > > > > > > > [  147.302311] FS:  0000000000000000(0000)
> > > > > > > > > GS:ffff8f5576d80000(0000)
> > > > > > > > > knlGS:0000000000000000
> > > > > > > > > [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > > > > > 0000000080050033
> > > > > > > > > [  147.302319] CR2: 00000000000000b0 CR3:
> > > > > > > > > 0000000267774004
> > > > > > > > > CR4:
> > > > > > > > > 00000000003606e0
> > > > > > > > > [  147.302322] Call Trace:
> > > > > > > > > [  147.302329]  <IRQ>
> > > > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > > 
> > > > > > > > Are you sure your device is working properly and talking
> > > > > > > > USB
> > > > > > > > correctly
> > > > > > > > to the host?  It looks like you are just timing out for
> > > > > > > > some
> > > > > > > > reason.
> > > > > > > > 
> > > > > > > > But, that warning is showing that something is odd in the
> > > > > > > > usb
> > > > > > > > workqueue,
> > > > > > > > which is strange.
> > > > > > > > 
> > > > > > > > What type of host controller is this talking to?  And
> > > > > > > > does
> > > > > > > > your
> > > > > > > > device
> > > > > > > > actually answer the urbs being sent to it correctly?
> > > > > > > > 
> > > > > > > > Using usbmon on this might be the best way to watch the
> > > > > > > > USB
> > > > > > > > traffic,
> > > > > > > > if
> > > > > > > > you don't have a hardware protocol sniffer, which could
> > > > > > > > provide
> > > > > > > > some
> > > > > > > > clues as to what is going wrong.
> > > > > > > > 
> > > > > > > > thanks,
> > > > > > > > 
> > > > > > > > greg k-h
> > > > > > > 
> > > > > > > As I mentioned the device is likely buggy, since I'm
> > > > > > > developing
> > > > > > > and
> > > > > > > debugging it.
> > > > > > > However my ability to debug and fix any issue is limited by
> > > > > > > the
> > > > > > > fact
> > > > > > > that the kernel decides to stop working as usual, making my
> > > > > > > USB
> > > > > > > keyboard and mouse useless, if not crashing later due to
> > > > > > > soft
> > > > > > > lockups.
> > > > > > > 
> > > > > > > Shouldn't the kernel be resilient to such devices?
> > > > > > 
> > > > > > Yes it should, we should not crash.
> > > > > > 
> > > > > > 
> > > > > > > I've developed quite
> > > > > > > a few USB devices in the past and I've never ran into
> > > > > > > things
> > > > > > > like
> > > > > > > this
> > > > > > > on Linux (Windows is another story, rather 'easy' to crash,
> > > > > > > hang
> > > > > > > or
> > > > > > > bluescreen). In any case since I do not have access to a
> > > > > > > hardware
> > > > > > > debugger and the machine goes bananas (preventing me from
> > > > > > > using
> > > > > > > Wireshark) I do not think I can further debug this issue. I
> > > > > > > could
> > > > > > > try
> > > > > > > to find a kernel version where this does not crash the
> > > > > > > machine
> > > > > > > (only
> > > > > > > tested 5.6.X and 5.7.X so far). Or perhaps use VirtualBox,
> > > > > > > but
> > > > > > > I'd
> > > > > > > need
> > > > > > > to convice the host OS to ignore the USB device and just
> > > > > > > forward
> > > > > > > it
> > > > > > > to
> > > > > > > the guest.
> > > > > > 
> > > > > > Trying to trace down what part of the setup is failing, by
> > > > > > using
> > > > > > usbmon,
> > > > > > will be good to try to figure out what the problem is here,
> > > > > > if
> > > > > > you
> > > > > > can
> > > > > > do that.
> > > > > > 
> > > > > > > The firmware for this device can be easily tweaked to
> > > > > > > expose
> > > > > > > an
> > > > > > > arbitrary number (up to 7 I think) of CDC ACM interfaces.
> > > > > > > When
> > > > > > > I
> > > > > > > use
> > > > > > > one or two there's no issues, three has had some issues
> > > > > > > (but
> > > > > > > did
> > > > > > > not
> > > > > > > investigate further). Going to four is what consistently
> > > > > > > triggers
> > > > > > > kernel issues.
> > > > > > 
> > > > > > Hm, that might be a clue, what does the output of 'lsusb -v'
> > > > > > for
> > > > > > that
> > > > > > device when you have 3 and then 4 interfaces?
> > > > > > 
> > > > > > thanks,
> > > > > > 
> > > > > > greg k-h
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > I will try to see how I can debug further, perhaps I can locate
> > > > > a
> > > > > machine or kernel that does not crash. Another option can be to
> > > > > disable
> > > > > the firmware down to the minimum so that it does not response
> > > > > to
> > > > > the
> > > > > bulk endpoints (just to the enumeration and some basic things),
> > > > > to
> > > > > rule
> > > > > out a bad behaviour.
> > > > > 
> > > > > The USB descriptor is what you could imagine, just replicate
> > > > > the
> > > > > two
> > > > > ACM interfaces (control & data) and add more endpoints. Here
> > > > > goes
> > > > > the
> > > > > one with three ports. Note this one seems to make the kernel
> > > > > crash
> > > > > just
> > > > > like the one with four. The only ones that work well are 1 and
> > > > > 2
> > > > > ports.
> > > > > Since I'm not aware of any other commercial solutions (apart
> > > > > from
> > > > > FTDI)
> > > > > that use more than 2 ACM ports, could that be the issue?
> > > > > Meaning
> > > > > there's a bug somewhere and no commercial hardware that can
> > > > > trigger
> > > > > it.
> > > > > 
> > > > > For reference the diff between two and three ports (in lsusb)
> > > > > is
> > > > > that
> > > > > it's missing the last two interaces (with the 3 EPs described).
> > > > > Of
> > > > > course the bNumInterfaces is 4 instead of 6, and wTotalLength
> > > > > has
> > > > > a
> > > > > different value.
> > > > > 
> > > > > Hope this can help somehow.
> > > > > Thanks
> > > > > David
> > > > > 
> > > > > Bus 003 Device 012: ID 0483:5740 STMicroelectronics Virtual COM
> > > > > Port
> > > > 
> > > > Hey again,
> > > > 
> > > > I was not aware about the modems Daniele, thanks!
> > > > 
> > > > So I did some testing on my old BeagleBone black, which has a
> > > > very
> > > > old
> > > > kernel (3.8.13-bone47). In this device the kernel is happy and I
> > > > was
> > > > able to do some testing, it seems to work well. The UARTs seem to
> > > > work
> > > > well in both directions, no weird shenanigans, no error/warn
> > > > messages...
> > > > 
> > > > I'm a bit at loss on how I can debug this further, I will try to
> > > > use
> > > > a
> > > > RPi with a newer kernel and see what happens. I could try to boot
> > > > a
> > > > Live USB with older kernels (in my Intel machines) to try to
> > > > locate
> > > > a
> > > > version where it works. Since I'm no kernel expert: any way I can
> > > > provide more info? The computer becomes unusable shortly after
> > > > plugging
> > > > the device so I can't really do any meaningful stuff on it.
> > > > 
> > > > Thanks again,
> > > > David
> > > > 
> > > 
> > > Hey there again!
> > > 
> > > I managed to get a PCAP capture for this. Note that NetworkManager
> > > was
> > > running and actively probing the ttyACM* devices for a modem, hence
> > > why
> > > you can see "AT\n" commands being sent to the four devices.
> > 
> > Do you mean ModemManager? NM moved the probing to ModemManager about
> > 6
> > or 7 years ago. In any case, ModemManager has both a "don't probe"
> > list, a greylist, and build-time options to only probe things it
> > knows
> > are modems.
> > 
> > Of course the build-time policy depends on your distro; upstream
> > ModemManager now defaults to "strict" mode (only probe known
> > modems/drivers/USB IDs).
> > 
> > Dan
> > 
> > > As you can also probably see is that the device currently ignores
> > > any
> > > control requests (like Set Line Coding).
> > > 
> > > Hope it can help your debugging.
> > > 
> > > David
> > > 
> > > 
> 
> Yeah ModemManager. I disabled it to better debug.
> 
> So I came to the bottom of the issue (device side). This device does
> not support more than 3 IN and 3 OUT endpoints. Whenever you specify
> more they start behaving very weirdly and Linux/Wireshark shows the
> EPROTO errors. In almost all cases that correlates with a kernel crash.

Ah, that makes more sense, so the device itself is just not responding
to USB commands properly.

> I've seen some EPROTO errors at the begining of a trace on working
> devices, I'm assuming it'd what happens during the period the device is
> plugged and probed and before it gets full ready (all EPs are setup and
> able to respond).
> 
> Now coming around the restriction I've been able to use 3 UARTs without
> a problem, also on another device from the same family. That explains
> why it did not work, does not explain why the kernel gets a CPU stuck
> in some busy loop/interrupt loop though.

I agree, it's not good that we can hang with a broken device like this,
but if the hardware is the thing that is locking up, it might be hard
for us to do anything about this.  Timing out and causing errors like
you see might be the only thing that we can do.

> From my side there's not much more I can do. I could try to get this
> setup in a simpler device and ship it someone willing to debug it. I'm
> totally convinced this is a bug in the USB stack given it crashes
> several computers with relateively different hardware.

I would be interested to see what happens when you plug it into other
operating systems.  Do they also lock up, or can they handle devices
like this?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-21  8:26                         ` Greg KH
@ 2020-07-22 14:41                           ` David Guillen Fandos
  2020-07-22 14:57                             ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: David Guillen Fandos @ 2020-07-22 14:41 UTC (permalink / raw)
  To: Greg KH; +Cc: Dan Williams, linux-usb, dnlplm

On Tue, 2020-07-21 at 10:26 +0200, Greg KH wrote:
> On Mon, Jul 20, 2020 at 10:39:40PM +0200, David Guillen Fandos wrote:
> > On Mon, 2020-07-20 at 11:55 -0500, Dan Williams wrote:
> > > On Mon, 2020-07-20 at 01:36 +0200, David Guillen Fandos wrote:
> > > > On Thu, 2020-07-16 at 16:30 +0200, David Guillen Fandos wrote:
> > > > > On Wed, 2020-07-15 at 19:03 +0200, David Guillen Fandos
> > > > > wrote:
> > > > > > On Wed, 2020-07-15 at 14:24 +0200, Greg KH wrote:
> > > > > > > On Wed, Jul 15, 2020 at 01:20:54PM +0200, David Guillen
> > > > > > > Fandos
> > > > > > > wrote:
> > > > > > > > On Wed, 2020-07-15 at 13:12 +0200, Greg KH wrote:
> > > > > > > > > On Wed, Jul 15, 2020 at 12:57:14PM +0200, David
> > > > > > > > > Guillen
> > > > > > > > > Fandos
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > > > > > > > > > > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David
> > > > > > > > > > > Guillen
> > > > > > > > > > > Fandos
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200,
> > > > > > > > > > > > > David
> > > > > > > > > > > > > Guillen
> > > > > > > > > > > > > Fandos
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > Hello linux-usb,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I think I might have found a kernel bug
> > > > > > > > > > > > > > related
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > USB
> > > > > > > > > > > > > > subsystem
> > > > > > > > > > > > > > (cdc_acm perhaps).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Context: I was playing around with a device
> > > > > > > > > > > > > > I'm
> > > > > > > > > > > > > > creating,
> > > > > > > > > > > > > > essentially a
> > > > > > > > > > > > > > USB quad modem device that exposes four
> > > > > > > > > > > > > > modems
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > host
> > > > > > > > > > > > > > system.
> > > > > > > > > > > > > > This
> > > > > > > > > > > > > > device is still a prototype so there's a
> > > > > > > > > > > > > > few
> > > > > > > > > > > > > > bugs
> > > > > > > > > > > > > > here
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > there,
> > > > > > > > > > > > > > most
> > > > > > > > > > > > > > likely in the USB descriptors and control
> > > > > > > > > > > > > > requests.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > What happens: After plugging the device the
> > > > > > > > > > > > > > system
> > > > > > > > > > > > > > starts
> > > > > > > > > > > > > > spitting
> > > > > > > > > > > > > > warnings and BUGs and it locks up. Most of
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > time
> > > > > > > > > > > > > > some
> > > > > > > > > > > > > > CPUs
> > > > > > > > > > > > > > get
> > > > > > > > > > > > > > into
> > > > > > > > > > > > > > some spinloop and never comes back (you can
> > > > > > > > > > > > > > see
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > being
> > > > > > > > > > > > > > detected
> > > > > > > > > > > > > > by
> > > > > > > > > > > > > > the watchdog after a few seconds).
> > > > > > > > > > > > > > Generally
> > > > > > > > > > > > > > after
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > USB
> > > > > > > > > > > > > > devices
> > > > > > > > > > > > > > stop working completely and at some point
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > machine
> > > > > > > > > > > > > > freezes
> > > > > > > > > > > > > > completely. In a couple of ocasions I
> > > > > > > > > > > > > > managed
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > see
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > bug
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > dmesg
> > > > > > > > > > > > > > saying "unable to handle page fault for
> > > > > > > > > > > > > > address
> > > > > > > > > > > > > > XXX"
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > "Supervisor
> > > > > > > > > > > > > > read access in kernel mode" "error code
> > > > > > > > > > > > > > (0x0000)
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > present
> > > > > > > > > > > > > > page".
> > > > > > > > > > > > > > I
> > > > > > > > > > > > > > could not get a trace for that one since
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > kernel
> > > > > > > > > > > > > > died
> > > > > > > > > > > > > > completely
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > my log files were truncated/lost.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Since it is happening to my two machines
> > > > > > > > > > > > > > (both
> > > > > > > > > > > > > > Intel
> > > > > > > > > > > > > > but
> > > > > > > > > > > > > > rather
> > > > > > > > > > > > > > different controllers, Sunrise Point-LP USB
> > > > > > > > > > > > > > 3.0
> > > > > > > > > > > > > > vs
> > > > > > > > > > > > > > 8
> > > > > > > > > > > > > > Series/C220)
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > with different kernel versions I suspect
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > might
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > bug in
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > kernel.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I have 4 logs that I collected, they are
> > > > > > > > > > > > > > sort
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > long-
> > > > > > > > > > > > > > ish,
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > how
> > > > > > > > > > > > > > to best send them to the list.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Send the crashes with the callback list, that
> > > > > > > > > > > > > should
> > > > > > > > > > > > > be
> > > > > > > > > > > > > quite
> > > > > > > > > > > > > small,
> > > > > > > > > > > > > right?  We don't need the full log.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The first crash is the most important, the
> > > > > > > > > > > > > others
> > > > > > > > > > > > > can
> > > > > > > > > > > > > be
> > > > > > > > > > > > > from
> > > > > > > > > > > > > the
> > > > > > > > > > > > > first
> > > > > > > > > > > > > one and are not reliable.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > thanks,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > greg k-h
> > > > > > > > > > > > 
> > > > > > > > > > > > Ok then, here comes one of the logs, I selected
> > > > > > > > > > > > some
> > > > > > > > > > > > bits
> > > > > > > > > > > > only
> > > > > > > > > > > > 
> > > > > > > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > > > > > > kernel/workqueue.c:1473
> > > > > > > > > > > > __queue_work+0x364/0x410
> > > > > > > > > > > > [...]
> > > > > > > > > > > > [  147.302322] Call Trace:
> > > > > > > > > > > > [  147.302329]  <IRQ>
> > > > > > > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x1
> > > > > > > > > > > > 10
> > > > > > > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > > > > > > [  147.302372]  tasklet_action_common.constprop
> > > > > > > > > > > > .0+0
> > > > > > > > > > > > x6
> > > > > > > > > > > > 6/
> > > > > > > > > > > > 0x
> > > > > > > > > > > > 10
> > > > > > > > > > > > 0
> > > > > > > > > > > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > > > > > > > > > > [  147.302391]  irq_exit+0xcf/0x110
> > > > > > > > > > > > [  147.302397]  do_IRQ+0x55/0xe0
> > > > > > > > > > > > [  147.302408]  common_interrupt+0xf/0xf
> > > > > > > > > > > > [  147.302413]  </IRQ>
> > > > > > > > > > > > [...]
> > > > > > > > > > > > [  184.771172] watchdog: BUG: soft lockup -
> > > > > > > > > > > > CPU#3
> > > > > > > > > > > > stuck
> > > > > > > > > > > > for
> > > > > > > > > > > > 23s!
> > > > > > > > > > > > [kworker/3:2:134]
> > > > > > > > > > > 
> > > > > > > > > > > That was the first message?
> > > > > > > > > > > 
> > > > > > > > > > > Ok, we need some more logs, how about the 30
> > > > > > > > > > > lines
> > > > > > > > > > > right
> > > > > > > > > > > before
> > > > > > > > > > > the
> > > > > > > > > > > above?
> > > > > > > > > > > 
> > > > > > > > > > > And what kernel version are you using?
> > > > > > > > > > > 
> > > > > > > > > > > thanks,
> > > > > > > > > > > 
> > > > > > > > > > > greg k-h
> > > > > > > > > > 
> > > > > > > > > > Heh I assumed you would find the 3rd stack more
> > > > > > > > > > interesting
> > > > > > > > > > since
> > > > > > > > > > it
> > > > > > > > > > involves more subsystems but anyway, here we got,
> > > > > > > > > > the
> > > > > > > > > > first
> > > > > > > > > > one
> > > > > > > > > > with
> > > > > > > > > > more context. The trigger as you can see is me
> > > > > > > > > > connecting
> > > > > > > > > > the
> > > > > > > > > > USB
> > > > > > > > > > device:
> > > > > > > > > > 
> > > > > > > > > > [  141.445367] usb 1-1: new full-speed USB device
> > > > > > > > > > number
> > > > > > > > > > 5
> > > > > > > > > > using
> > > > > > > > > > xhci_hcd
> > > > > > > > > > [  141.573592] usb 1-1: New USB device found,
> > > > > > > > > > idVendor=0483,
> > > > > > > > > > idProduct=5740, bcdDevice= 2.00
> > > > > > > > > > [  141.573597] usb 1-1: New USB device strings:
> > > > > > > > > > Mfr=1,
> > > > > > > > > > Product=2,
> > > > > > > > > > SerialNumber=3
> > > > > > > > > > [  141.573601] usb 1-1: Product: Quad-UART serial
> > > > > > > > > > USB
> > > > > > > > > > device
> > > > > > > > > > [  141.573603] usb 1-1: Manufacturer: davidgf.net
> > > > > > > > > > [  141.573605] usb 1-1: SerialNumber: serialno
> > > > > > > > > > [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM
> > > > > > > > > > device
> > > > > > > > > > [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM
> > > > > > > > > > device
> > > > > > > > > > [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM
> > > > > > > > > > device
> > > > > > > > > > [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM
> > > > > > > > > > device
> > > > > > > > > > [  142.382473] usbcore: registered new interface
> > > > > > > > > > driver
> > > > > > > > > > cdc_acm
> > > > > > > > > > [  142.382476] cdc_acm: USB Abstract Control Model
> > > > > > > > > > driver
> > > > > > > > > > for
> > > > > > > > > > USB
> > > > > > > > > > modems and ISDN adapters
> > > > > > > > > > [  147.301997] ------------[ cut here ]------------
> > > > > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > > > > kernel/workqueue.c:1473
> > > > > > > > > > __queue_work+0x364/0x410
> > > > > > > > > > [  147.302019] Modules linked in: cdc_acm rfcomm
> > > > > > > > > > ccm
> > > > > > > > > > wireguard
> > > > > > > > > > curve25519_x86_64 libchacha20poly1305 chacha_x86_64
> > > > > > > > > > poly1305_x86_64
> > > > > > > > > > libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel
> > > > > > > > > > libcurve25519_generic libchacha libblake2s_generic
> > > > > > > > > > nft_fib_inet
> > > > > > > > > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> > > > > > > > > > nf_reject_ipv4
> > > > > > > > > > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
> > > > > > > > > > ip6table_nat
> > > > > > > > > > ip6table_mangle ip6table_raw ip6table_security
> > > > > > > > > > iptable_nat
> > > > > > > > > > nf_nat
> > > > > > > > > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> > > > > > > > > > libcrc32c
> > > > > > > > > > iptable_mangle
> > > > > > > > > > iptable_raw iptable_security ip_set nf_tables
> > > > > > > > > > nfnetlink
> > > > > > > > > > ip6table_filter
> > > > > > > > > > ip6_tables iptable_filter cmac vboxnetadp(OE)
> > > > > > > > > > vboxnetflt(OE)
> > > > > > > > > > bnep
> > > > > > > > > > vboxdrv(OE) sunrpc vfat fat uvcvideo
> > > > > > > > > > videobuf2_vmalloc
> > > > > > > > > > videobuf2_memops
> > > > > > > > > > videobuf2_v4l2 videobuf2_common videodev btusb
> > > > > > > > > > btrtl
> > > > > > > > > > btbcm
> > > > > > > > > > btintel
> > > > > > > > > > mc
> > > > > > > > > > bluetooth ecdh_generic ecc iTCO_wdt
> > > > > > > > > > iTCO_vendor_support
> > > > > > > > > > mei_hdcp
> > > > > > > > > > intel_rapl_msr dell_laptop x86_pkg_temp_thermal
> > > > > > > > > > intel_powerclamp
> > > > > > > > > > coretemp kvm_intel kvm irqbypass intel_cstate
> > > > > > > > > > intel_uncore
> > > > > > > > > > intel_rapl_perf iwlmvm
> > > > > > > > > > [  147.302121]  snd_hda_codec_hdmi mac80211
> > > > > > > > > > snd_soc_skl
> > > > > > > > > > snd_soc_sst_ipc
> > > > > > > > > > snd_soc_sst_dsp dell_wmi snd_hda_ext_core
> > > > > > > > > > dell_smbios
> > > > > > > > > > snd_hda_codec_realtek dcdbas libarc4 wmi_bmof
> > > > > > > > > > dell_wmi_descriptor
> > > > > > > > > > snd_soc_acpi_intel_match snd_soc_acpi
> > > > > > > > > > intel_wmi_thunderbolt
> > > > > > > > > > snd_hda_codec_generic snd_soc_core ledtrig_audio
> > > > > > > > > > iwlwifi
> > > > > > > > > > pcspkr
> > > > > > > > > > snd_compress ac97_bus snd_pcm_dmaengine
> > > > > > > > > > snd_hda_intel
> > > > > > > > > > snd_intel_dspcfg
> > > > > > > > > > snd_hda_codec cfg80211 snd_hda_core snd_hwdep
> > > > > > > > > > snd_seq
> > > > > > > > > > snd_seq_device
> > > > > > > > > > joydev snd_pcm rfkill snd_timer snd i2c_i801
> > > > > > > > > > soundcore
> > > > > > > > > > idma64
> > > > > > > > > > int3403_thermal intel_hid int3400_thermal
> > > > > > > > > > sparse_keymap
> > > > > > > > > > acpi_thermal_rel mei_me intel_xhci_usb_role_switch
> > > > > > > > > > acpi_pad
> > > > > > > > > > roles
> > > > > > > > > > mei
> > > > > > > > > > intel_pch_thermal processor_thermal_device
> > > > > > > > > > intel_rapl_common
> > > > > > > > > > int340x_thermal_zone intel_soc_dts_iosf binfmt_misc
> > > > > > > > > > ip_tables
> > > > > > > > > > dm_crypt
> > > > > > > > > > i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul
> > > > > > > > > > crc32_pclmul
> > > > > > > > > > i2c_algo_bit
> > > > > > > > > > cec crc32c_intel drm_kms_helper nvme
> > > > > > > > > > ghash_clmulni_intel
> > > > > > > > > > drm
> > > > > > > > > > nvme_core
> > > > > > > > > > serio_raw rtsx_pci hid_multitouch wmi i2c_hid video
> > > > > > > > > > pinctrl_sunrisepoint pinctrl_intel
> > > > > > > > > > [  147.302218]  fuse
> > > > > > > > > > [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2
> > > > > > > > > > Tainted:
> > > > > > > > > > G          IOE     5.7.7-200.fc32.x86_64 #1
> > > > > > > > > > [  147.302233] Hardware name: Dell Inc. XPS 13
> > > > > > > > > > 9350/0PWNCR,
> > > > > > > > > > BIOS
> > > > > > > > > > 1.12.2
> > > > > > > > > > 12/15/2019
> > > > > > > > > > [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> > > > > > > > > > [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> > > > > > > > > > [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75 0f
> > > > > > > > > > 65
> > > > > > > > > > 48
> > > > > > > > > > 8b
> > > > > > > > > > 3c
> > > > > > > > > > 25
> > > > > > > > > > c0 8b
> > > > > > > > > > 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d 41
> > > > > > > > > > 5c
> > > > > > > > > > 41
> > > > > > > > > > 5d
> > > > > > > > > > 41
> > > > > > > > > > 5e
> > > > > > > > > > 41 5f
> > > > > > > > > > c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60
> > > > > > > > > > e9 5d
> > > > > > > > > > fe
> > > > > > > > > > ff
> > > > > > > > > > ff e8
> > > > > > > > > > 53
> > > > > > > > > > [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS:
> > > > > > > > > > 00010002
> > > > > > > > > > [  147.302292] RAX: ffff8f551b333790 RBX:
> > > > > > > > > > 0000000000000048
> > > > > > > > > > RCX:
> > > > > > > > > > 0000000000000000
> > > > > > > > > > [  147.302295] RDX: ffff8f551b333798 RSI:
> > > > > > > > > > ffff8f5575803718
> > > > > > > > > > RDI:
> > > > > > > > > > ffff8f5576daa840
> > > > > > > > > > [  147.302299] RBP: ffff8f551b333790 R08:
> > > > > > > > > > ffffffff97856cb0
> > > > > > > > > > R09:
> > > > > > > > > > 0000000000000000
> > > > > > > > > > [  147.302302] R10: 0000000000000000 R11:
> > > > > > > > > > ffffffff97856cb8
> > > > > > > > > > R12:
> > > > > > > > > > 0000000000000003
> > > > > > > > > > [  147.302306] R13: 0000000000002000 R14:
> > > > > > > > > > ffff8f5575c14e00
> > > > > > > > > > R15:
> > > > > > > > > > ffff8f5576db0700
> > > > > > > > > > [  147.302311] FS:  0000000000000000(0000)
> > > > > > > > > > GS:ffff8f5576d80000(0000)
> > > > > > > > > > knlGS:0000000000000000
> > > > > > > > > > [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > > > > > > 0000000080050033
> > > > > > > > > > [  147.302319] CR2: 00000000000000b0 CR3:
> > > > > > > > > > 0000000267774004
> > > > > > > > > > CR4:
> > > > > > > > > > 00000000003606e0
> > > > > > > > > > [  147.302322] Call Trace:
> > > > > > > > > > [  147.302329]  <IRQ>
> > > > > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > > > 
> > > > > > > > > Are you sure your device is working properly and
> > > > > > > > > talking
> > > > > > > > > USB
> > > > > > > > > correctly
> > > > > > > > > to the host?  It looks like you are just timing out
> > > > > > > > > for
> > > > > > > > > some
> > > > > > > > > reason.
> > > > > > > > > 
> > > > > > > > > But, that warning is showing that something is odd in
> > > > > > > > > the
> > > > > > > > > usb
> > > > > > > > > workqueue,
> > > > > > > > > which is strange.
> > > > > > > > > 
> > > > > > > > > What type of host controller is this talking to?  And
> > > > > > > > > does
> > > > > > > > > your
> > > > > > > > > device
> > > > > > > > > actually answer the urbs being sent to it correctly?
> > > > > > > > > 
> > > > > > > > > Using usbmon on this might be the best way to watch
> > > > > > > > > the
> > > > > > > > > USB
> > > > > > > > > traffic,
> > > > > > > > > if
> > > > > > > > > you don't have a hardware protocol sniffer, which
> > > > > > > > > could
> > > > > > > > > provide
> > > > > > > > > some
> > > > > > > > > clues as to what is going wrong.
> > > > > > > > > 
> > > > > > > > > thanks,
> > > > > > > > > 
> > > > > > > > > greg k-h
> > > > > > > > 
> > > > > > > > As I mentioned the device is likely buggy, since I'm
> > > > > > > > developing
> > > > > > > > and
> > > > > > > > debugging it.
> > > > > > > > However my ability to debug and fix any issue is
> > > > > > > > limited by
> > > > > > > > the
> > > > > > > > fact
> > > > > > > > that the kernel decides to stop working as usual,
> > > > > > > > making my
> > > > > > > > USB
> > > > > > > > keyboard and mouse useless, if not crashing later due
> > > > > > > > to
> > > > > > > > soft
> > > > > > > > lockups.
> > > > > > > > 
> > > > > > > > Shouldn't the kernel be resilient to such devices?
> > > > > > > 
> > > > > > > Yes it should, we should not crash.
> > > > > > > 
> > > > > > > 
> > > > > > > > I've developed quite
> > > > > > > > a few USB devices in the past and I've never ran into
> > > > > > > > things
> > > > > > > > like
> > > > > > > > this
> > > > > > > > on Linux (Windows is another story, rather 'easy' to
> > > > > > > > crash,
> > > > > > > > hang
> > > > > > > > or
> > > > > > > > bluescreen). In any case since I do not have access to
> > > > > > > > a
> > > > > > > > hardware
> > > > > > > > debugger and the machine goes bananas (preventing me
> > > > > > > > from
> > > > > > > > using
> > > > > > > > Wireshark) I do not think I can further debug this
> > > > > > > > issue. I
> > > > > > > > could
> > > > > > > > try
> > > > > > > > to find a kernel version where this does not crash the
> > > > > > > > machine
> > > > > > > > (only
> > > > > > > > tested 5.6.X and 5.7.X so far). Or perhaps use
> > > > > > > > VirtualBox,
> > > > > > > > but
> > > > > > > > I'd
> > > > > > > > need
> > > > > > > > to convice the host OS to ignore the USB device and
> > > > > > > > just
> > > > > > > > forward
> > > > > > > > it
> > > > > > > > to
> > > > > > > > the guest.
> > > > > > > 
> > > > > > > Trying to trace down what part of the setup is failing,
> > > > > > > by
> > > > > > > using
> > > > > > > usbmon,
> > > > > > > will be good to try to figure out what the problem is
> > > > > > > here,
> > > > > > > if
> > > > > > > you
> > > > > > > can
> > > > > > > do that.
> > > > > > > 
> > > > > > > > The firmware for this device can be easily tweaked to
> > > > > > > > expose
> > > > > > > > an
> > > > > > > > arbitrary number (up to 7 I think) of CDC ACM
> > > > > > > > interfaces.
> > > > > > > > When
> > > > > > > > I
> > > > > > > > use
> > > > > > > > one or two there's no issues, three has had some issues
> > > > > > > > (but
> > > > > > > > did
> > > > > > > > not
> > > > > > > > investigate further). Going to four is what
> > > > > > > > consistently
> > > > > > > > triggers
> > > > > > > > kernel issues.
> > > > > > > 
> > > > > > > Hm, that might be a clue, what does the output of 'lsusb
> > > > > > > -v'
> > > > > > > for
> > > > > > > that
> > > > > > > device when you have 3 and then 4 interfaces?
> > > > > > > 
> > > > > > > thanks,
> > > > > > > 
> > > > > > > greg k-h
> > > > > > 
> > > > > > Hello,
> > > > > > 
> > > > > > I will try to see how I can debug further, perhaps I can
> > > > > > locate
> > > > > > a
> > > > > > machine or kernel that does not crash. Another option can
> > > > > > be to
> > > > > > disable
> > > > > > the firmware down to the minimum so that it does not
> > > > > > response
> > > > > > to
> > > > > > the
> > > > > > bulk endpoints (just to the enumeration and some basic
> > > > > > things),
> > > > > > to
> > > > > > rule
> > > > > > out a bad behaviour.
> > > > > > 
> > > > > > The USB descriptor is what you could imagine, just
> > > > > > replicate
> > > > > > the
> > > > > > two
> > > > > > ACM interfaces (control & data) and add more endpoints.
> > > > > > Here
> > > > > > goes
> > > > > > the
> > > > > > one with three ports. Note this one seems to make the
> > > > > > kernel
> > > > > > crash
> > > > > > just
> > > > > > like the one with four. The only ones that work well are 1
> > > > > > and
> > > > > > 2
> > > > > > ports.
> > > > > > Since I'm not aware of any other commercial solutions
> > > > > > (apart
> > > > > > from
> > > > > > FTDI)
> > > > > > that use more than 2 ACM ports, could that be the issue?
> > > > > > Meaning
> > > > > > there's a bug somewhere and no commercial hardware that can
> > > > > > trigger
> > > > > > it.
> > > > > > 
> > > > > > For reference the diff between two and three ports (in
> > > > > > lsusb)
> > > > > > is
> > > > > > that
> > > > > > it's missing the last two interaces (with the 3 EPs
> > > > > > described).
> > > > > > Of
> > > > > > course the bNumInterfaces is 4 instead of 6, and
> > > > > > wTotalLength
> > > > > > has
> > > > > > a
> > > > > > different value.
> > > > > > 
> > > > > > Hope this can help somehow.
> > > > > > Thanks
> > > > > > David
> > > > > > 
> > > > > > Bus 003 Device 012: ID 0483:5740 STMicroelectronics Virtual
> > > > > > COM
> > > > > > Port
> > > > > 
> > > > > Hey again,
> > > > > 
> > > > > I was not aware about the modems Daniele, thanks!
> > > > > 
> > > > > So I did some testing on my old BeagleBone black, which has a
> > > > > very
> > > > > old
> > > > > kernel (3.8.13-bone47). In this device the kernel is happy
> > > > > and I
> > > > > was
> > > > > able to do some testing, it seems to work well. The UARTs
> > > > > seem to
> > > > > work
> > > > > well in both directions, no weird shenanigans, no error/warn
> > > > > messages...
> > > > > 
> > > > > I'm a bit at loss on how I can debug this further, I will try
> > > > > to
> > > > > use
> > > > > a
> > > > > RPi with a newer kernel and see what happens. I could try to
> > > > > boot
> > > > > a
> > > > > Live USB with older kernels (in my Intel machines) to try to
> > > > > locate
> > > > > a
> > > > > version where it works. Since I'm no kernel expert: any way I
> > > > > can
> > > > > provide more info? The computer becomes unusable shortly
> > > > > after
> > > > > plugging
> > > > > the device so I can't really do any meaningful stuff on it.
> > > > > 
> > > > > Thanks again,
> > > > > David
> > > > > 
> > > > 
> > > > Hey there again!
> > > > 
> > > > I managed to get a PCAP capture for this. Note that
> > > > NetworkManager
> > > > was
> > > > running and actively probing the ttyACM* devices for a modem,
> > > > hence
> > > > why
> > > > you can see "AT\n" commands being sent to the four devices.
> > > 
> > > Do you mean ModemManager? NM moved the probing to ModemManager
> > > about
> > > 6
> > > or 7 years ago. In any case, ModemManager has both a "don't
> > > probe"
> > > list, a greylist, and build-time options to only probe things it
> > > knows
> > > are modems.
> > > 
> > > Of course the build-time policy depends on your distro; upstream
> > > ModemManager now defaults to "strict" mode (only probe known
> > > modems/drivers/USB IDs).
> > > 
> > > Dan
> > > 
> > > > As you can also probably see is that the device currently
> > > > ignores
> > > > any
> > > > control requests (like Set Line Coding).
> > > > 
> > > > Hope it can help your debugging.
> > > > 
> > > > David
> > > > 
> > > > 
> > 
> > Yeah ModemManager. I disabled it to better debug.
> > 
> > So I came to the bottom of the issue (device side). This device
> > does
> > not support more than 3 IN and 3 OUT endpoints. Whenever you
> > specify
> > more they start behaving very weirdly and Linux/Wireshark shows the
> > EPROTO errors. In almost all cases that correlates with a kernel
> > crash.
> 
> Ah, that makes more sense, so the device itself is just not
> responding
> to USB commands properly.
> 
> > I've seen some EPROTO errors at the begining of a trace on working
> > devices, I'm assuming it'd what happens during the period the
> > device is
> > plugged and probed and before it gets full ready (all EPs are setup
> > and
> > able to respond).
> > 
> > Now coming around the restriction I've been able to use 3 UARTs
> > without
> > a problem, also on another device from the same family. That
> > explains
> > why it did not work, does not explain why the kernel gets a CPU
> > stuck
> > in some busy loop/interrupt loop though.
> 
> I agree, it's not good that we can hang with a broken device like
> this,
> but if the hardware is the thing that is locking up, it might be hard
> for us to do anything about this.  Timing out and causing errors like
> you see might be the only thing that we can do.
> 
> > From my side there's not much more I can do. I could try to get
> > this
> > setup in a simpler device and ship it someone willing to debug it.
> > I'm
> > totally convinced this is a bug in the USB stack given it crashes
> > several computers with relateively different hardware.
> 
> I would be interested to see what happens when you plug it into other
> operating systems.  Do they also lock up, or can they handle devices
> like this?
> 
> thanks,
> 
> greg k-h

Hey there,

So I actually tested some Live media on the same computer and it is
interesting to see how other distros did not fail at all with the same
kernel version as Fedora's :/ (I tested CloneZilla 5.7.0 kernel)

So I dug a bit more and realized that the kernel wont go bananas if
ModemManager is not running. I'm assuming this means the kernel only
has issues whenever some software is poking the device (which sort of
makes sense, but at the same time should mean that the enumeration and
config bits are 'ok').

Also running ModemManager after plugging the device results sometimes
in different behaviour. In some occasions I didnt get it to crash (or
at least not as reliably) but it would print:

[12515.209323] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
[12515.209324] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is
pending.

I'm not sure whether this is helpful or not. I haven't tried Windows
since I do not have any computer (and using VMs doesnt really seem to
work well).

Thanks!
David



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System crash/lockup after plugging CDC ACM device
  2020-07-22 14:41                           ` David Guillen Fandos
@ 2020-07-22 14:57                             ` Dan Williams
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2020-07-22 14:57 UTC (permalink / raw)
  To: David Guillen Fandos, Greg KH; +Cc: linux-usb, dnlplm

On Wed, 2020-07-22 at 16:41 +0200, David Guillen Fandos wrote:
> On Tue, 2020-07-21 at 10:26 +0200, Greg KH wrote:
> > On Mon, Jul 20, 2020 at 10:39:40PM +0200, David Guillen Fandos
> > wrote:
> > > On Mon, 2020-07-20 at 11:55 -0500, Dan Williams wrote:
> > > > On Mon, 2020-07-20 at 01:36 +0200, David Guillen Fandos wrote:
> > > > > On Thu, 2020-07-16 at 16:30 +0200, David Guillen Fandos
> > > > > wrote:
> > > > > > On Wed, 2020-07-15 at 19:03 +0200, David Guillen Fandos
> > > > > > wrote:
> > > > > > > On Wed, 2020-07-15 at 14:24 +0200, Greg KH wrote:
> > > > > > > > On Wed, Jul 15, 2020 at 01:20:54PM +0200, David Guillen
> > > > > > > > Fandos
> > > > > > > > wrote:
> > > > > > > > > On Wed, 2020-07-15 at 13:12 +0200, Greg KH wrote:
> > > > > > > > > > On Wed, Jul 15, 2020 at 12:57:14PM +0200, David
> > > > > > > > > > Guillen
> > > > > > > > > > Fandos
> > > > > > > > > > wrote:
> > > > > > > > > > > On Wed, 2020-07-15 at 12:50 +0200, Greg KH wrote:
> > > > > > > > > > > > On Wed, Jul 15, 2020 at 12:31:42PM +0200, David
> > > > > > > > > > > > Guillen
> > > > > > > > > > > > Fandos
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Wed, 2020-07-15 at 11:30 +0200, Greg KH
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > On Wed, Jul 15, 2020 at 10:58:03AM +0200,
> > > > > > > > > > > > > > David
> > > > > > > > > > > > > > Guillen
> > > > > > > > > > > > > > Fandos
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > Hello linux-usb,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I think I might have found a kernel bug
> > > > > > > > > > > > > > > related
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > USB
> > > > > > > > > > > > > > > subsystem
> > > > > > > > > > > > > > > (cdc_acm perhaps).
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Context: I was playing around with a
> > > > > > > > > > > > > > > device
> > > > > > > > > > > > > > > I'm
> > > > > > > > > > > > > > > creating,
> > > > > > > > > > > > > > > essentially a
> > > > > > > > > > > > > > > USB quad modem device that exposes four
> > > > > > > > > > > > > > > modems
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > host
> > > > > > > > > > > > > > > system.
> > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > device is still a prototype so there's a
> > > > > > > > > > > > > > > few
> > > > > > > > > > > > > > > bugs
> > > > > > > > > > > > > > > here
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > there,
> > > > > > > > > > > > > > > most
> > > > > > > > > > > > > > > likely in the USB descriptors and control
> > > > > > > > > > > > > > > requests.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > What happens: After plugging the device
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > system
> > > > > > > > > > > > > > > starts
> > > > > > > > > > > > > > > spitting
> > > > > > > > > > > > > > > warnings and BUGs and it locks up. Most
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > CPUs
> > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > into
> > > > > > > > > > > > > > > some spinloop and never comes back (you
> > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > see
> > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > being
> > > > > > > > > > > > > > > detected
> > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > the watchdog after a few seconds).
> > > > > > > > > > > > > > > Generally
> > > > > > > > > > > > > > > after
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > USB
> > > > > > > > > > > > > > > devices
> > > > > > > > > > > > > > > stop working completely and at some point
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > machine
> > > > > > > > > > > > > > > freezes
> > > > > > > > > > > > > > > completely. In a couple of ocasions I
> > > > > > > > > > > > > > > managed
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > see
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > bug
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > dmesg
> > > > > > > > > > > > > > > saying "unable to handle page fault for
> > > > > > > > > > > > > > > address
> > > > > > > > > > > > > > > XXX"
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > "Supervisor
> > > > > > > > > > > > > > > read access in kernel mode" "error code
> > > > > > > > > > > > > > > (0x0000)
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > present
> > > > > > > > > > > > > > > page".
> > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > could not get a trace for that one since
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > kernel
> > > > > > > > > > > > > > > died
> > > > > > > > > > > > > > > completely
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > my log files were truncated/lost.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Since it is happening to my two machines
> > > > > > > > > > > > > > > (both
> > > > > > > > > > > > > > > Intel
> > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > rather
> > > > > > > > > > > > > > > different controllers, Sunrise Point-LP
> > > > > > > > > > > > > > > USB
> > > > > > > > > > > > > > > 3.0
> > > > > > > > > > > > > > > vs
> > > > > > > > > > > > > > > 8
> > > > > > > > > > > > > > > Series/C220)
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > with different kernel versions I suspect
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > might
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > bug in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > kernel.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I have 4 logs that I collected, they are
> > > > > > > > > > > > > > > sort
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > long-
> > > > > > > > > > > > > > > ish,
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > to best send them to the list.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Send the crashes with the callback list,
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > should
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > quite
> > > > > > > > > > > > > > small,
> > > > > > > > > > > > > > right?  We don't need the full log.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > The first crash is the most important, the
> > > > > > > > > > > > > > others
> > > > > > > > > > > > > > can
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > first
> > > > > > > > > > > > > > one and are not reliable.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > thanks,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > greg k-h
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Ok then, here comes one of the logs, I
> > > > > > > > > > > > > selected
> > > > > > > > > > > > > some
> > > > > > > > > > > > > bits
> > > > > > > > > > > > > only
> > > > > > > > > > > > > 
> > > > > > > > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > > > > > > > kernel/workqueue.c:1473
> > > > > > > > > > > > > __queue_work+0x364/0x410
> > > > > > > > > > > > > [...]
> > > > > > > > > > > > > [  147.302322] Call Trace:
> > > > > > > > > > > > > [  147.302329]  <IRQ>
> > > > > > > > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0
> > > > > > > > > > > > > x1
> > > > > > > > > > > > > 10
> > > > > > > > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > > > > > > > [  147.302372]  tasklet_action_common.constpr
> > > > > > > > > > > > > op
> > > > > > > > > > > > > .0+0
> > > > > > > > > > > > > x6
> > > > > > > > > > > > > 6/
> > > > > > > > > > > > > 0x
> > > > > > > > > > > > > 10
> > > > > > > > > > > > > 0
> > > > > > > > > > > > > [  147.302382]  __do_softirq+0xe9/0x2dc
> > > > > > > > > > > > > [  147.302391]  irq_exit+0xcf/0x110
> > > > > > > > > > > > > [  147.302397]  do_IRQ+0x55/0xe0
> > > > > > > > > > > > > [  147.302408]  common_interrupt+0xf/0xf
> > > > > > > > > > > > > [  147.302413]  </IRQ>
> > > > > > > > > > > > > [...]
> > > > > > > > > > > > > [  184.771172] watchdog: BUG: soft lockup -
> > > > > > > > > > > > > CPU#3
> > > > > > > > > > > > > stuck
> > > > > > > > > > > > > for
> > > > > > > > > > > > > 23s!
> > > > > > > > > > > > > [kworker/3:2:134]
> > > > > > > > > > > > 
> > > > > > > > > > > > That was the first message?
> > > > > > > > > > > > 
> > > > > > > > > > > > Ok, we need some more logs, how about the 30
> > > > > > > > > > > > lines
> > > > > > > > > > > > right
> > > > > > > > > > > > before
> > > > > > > > > > > > the
> > > > > > > > > > > > above?
> > > > > > > > > > > > 
> > > > > > > > > > > > And what kernel version are you using?
> > > > > > > > > > > > 
> > > > > > > > > > > > thanks,
> > > > > > > > > > > > 
> > > > > > > > > > > > greg k-h
> > > > > > > > > > > 
> > > > > > > > > > > Heh I assumed you would find the 3rd stack more
> > > > > > > > > > > interesting
> > > > > > > > > > > since
> > > > > > > > > > > it
> > > > > > > > > > > involves more subsystems but anyway, here we got,
> > > > > > > > > > > the
> > > > > > > > > > > first
> > > > > > > > > > > one
> > > > > > > > > > > with
> > > > > > > > > > > more context. The trigger as you can see is me
> > > > > > > > > > > connecting
> > > > > > > > > > > the
> > > > > > > > > > > USB
> > > > > > > > > > > device:
> > > > > > > > > > > 
> > > > > > > > > > > [  141.445367] usb 1-1: new full-speed USB device
> > > > > > > > > > > number
> > > > > > > > > > > 5
> > > > > > > > > > > using
> > > > > > > > > > > xhci_hcd
> > > > > > > > > > > [  141.573592] usb 1-1: New USB device found,
> > > > > > > > > > > idVendor=0483,
> > > > > > > > > > > idProduct=5740, bcdDevice= 2.00
> > > > > > > > > > > [  141.573597] usb 1-1: New USB device strings:
> > > > > > > > > > > Mfr=1,
> > > > > > > > > > > Product=2,
> > > > > > > > > > > SerialNumber=3
> > > > > > > > > > > [  141.573601] usb 1-1: Product: Quad-UART serial
> > > > > > > > > > > USB
> > > > > > > > > > > device
> > > > > > > > > > > [  141.573603] usb 1-1: Manufacturer: davidgf.net
> > > > > > > > > > > [  141.573605] usb 1-1: SerialNumber: serialno
> > > > > > > > > > > [  142.375007] cdc_acm 1-1:1.0: ttyACM0: USB ACM
> > > > > > > > > > > device
> > > > > > > > > > > [  142.376623] cdc_acm 1-1:1.2: ttyACM1: USB ACM
> > > > > > > > > > > device
> > > > > > > > > > > [  142.378350] cdc_acm 1-1:1.4: ttyACM2: USB ACM
> > > > > > > > > > > device
> > > > > > > > > > > [  142.379637] cdc_acm 1-1:1.6: ttyACM3: USB ACM
> > > > > > > > > > > device
> > > > > > > > > > > [  142.382473] usbcore: registered new interface
> > > > > > > > > > > driver
> > > > > > > > > > > cdc_acm
> > > > > > > > > > > [  142.382476] cdc_acm: USB Abstract Control
> > > > > > > > > > > Model
> > > > > > > > > > > driver
> > > > > > > > > > > for
> > > > > > > > > > > USB
> > > > > > > > > > > modems and ISDN adapters
> > > > > > > > > > > [  147.301997] ------------[ cut here ]--------
> > > > > > > > > > > ----
> > > > > > > > > > > [  147.302016] WARNING: CPU: 3 PID: 134 at
> > > > > > > > > > > kernel/workqueue.c:1473
> > > > > > > > > > > __queue_work+0x364/0x410
> > > > > > > > > > > [  147.302019] Modules linked in: cdc_acm rfcomm
> > > > > > > > > > > ccm
> > > > > > > > > > > wireguard
> > > > > > > > > > > curve25519_x86_64 libchacha20poly1305
> > > > > > > > > > > chacha_x86_64
> > > > > > > > > > > poly1305_x86_64
> > > > > > > > > > > libblake2s blake2s_x86_64 ip6_udp_tunnel
> > > > > > > > > > > udp_tunnel
> > > > > > > > > > > libcurve25519_generic libchacha
> > > > > > > > > > > libblake2s_generic
> > > > > > > > > > > nft_fib_inet
> > > > > > > > > > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> > > > > > > > > > > nf_reject_ipv4
> > > > > > > > > > > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
> > > > > > > > > > > ip6table_nat
> > > > > > > > > > > ip6table_mangle ip6table_raw ip6table_security
> > > > > > > > > > > iptable_nat
> > > > > > > > > > > nf_nat
> > > > > > > > > > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> > > > > > > > > > > libcrc32c
> > > > > > > > > > > iptable_mangle
> > > > > > > > > > > iptable_raw iptable_security ip_set nf_tables
> > > > > > > > > > > nfnetlink
> > > > > > > > > > > ip6table_filter
> > > > > > > > > > > ip6_tables iptable_filter cmac vboxnetadp(OE)
> > > > > > > > > > > vboxnetflt(OE)
> > > > > > > > > > > bnep
> > > > > > > > > > > vboxdrv(OE) sunrpc vfat fat uvcvideo
> > > > > > > > > > > videobuf2_vmalloc
> > > > > > > > > > > videobuf2_memops
> > > > > > > > > > > videobuf2_v4l2 videobuf2_common videodev btusb
> > > > > > > > > > > btrtl
> > > > > > > > > > > btbcm
> > > > > > > > > > > btintel
> > > > > > > > > > > mc
> > > > > > > > > > > bluetooth ecdh_generic ecc iTCO_wdt
> > > > > > > > > > > iTCO_vendor_support
> > > > > > > > > > > mei_hdcp
> > > > > > > > > > > intel_rapl_msr dell_laptop x86_pkg_temp_thermal
> > > > > > > > > > > intel_powerclamp
> > > > > > > > > > > coretemp kvm_intel kvm irqbypass intel_cstate
> > > > > > > > > > > intel_uncore
> > > > > > > > > > > intel_rapl_perf iwlmvm
> > > > > > > > > > > [  147.302121]  snd_hda_codec_hdmi mac80211
> > > > > > > > > > > snd_soc_skl
> > > > > > > > > > > snd_soc_sst_ipc
> > > > > > > > > > > snd_soc_sst_dsp dell_wmi snd_hda_ext_core
> > > > > > > > > > > dell_smbios
> > > > > > > > > > > snd_hda_codec_realtek dcdbas libarc4 wmi_bmof
> > > > > > > > > > > dell_wmi_descriptor
> > > > > > > > > > > snd_soc_acpi_intel_match snd_soc_acpi
> > > > > > > > > > > intel_wmi_thunderbolt
> > > > > > > > > > > snd_hda_codec_generic snd_soc_core ledtrig_audio
> > > > > > > > > > > iwlwifi
> > > > > > > > > > > pcspkr
> > > > > > > > > > > snd_compress ac97_bus snd_pcm_dmaengine
> > > > > > > > > > > snd_hda_intel
> > > > > > > > > > > snd_intel_dspcfg
> > > > > > > > > > > snd_hda_codec cfg80211 snd_hda_core snd_hwdep
> > > > > > > > > > > snd_seq
> > > > > > > > > > > snd_seq_device
> > > > > > > > > > > joydev snd_pcm rfkill snd_timer snd i2c_i801
> > > > > > > > > > > soundcore
> > > > > > > > > > > idma64
> > > > > > > > > > > int3403_thermal intel_hid int3400_thermal
> > > > > > > > > > > sparse_keymap
> > > > > > > > > > > acpi_thermal_rel mei_me
> > > > > > > > > > > intel_xhci_usb_role_switch
> > > > > > > > > > > acpi_pad
> > > > > > > > > > > roles
> > > > > > > > > > > mei
> > > > > > > > > > > intel_pch_thermal processor_thermal_device
> > > > > > > > > > > intel_rapl_common
> > > > > > > > > > > int340x_thermal_zone intel_soc_dts_iosf
> > > > > > > > > > > binfmt_misc
> > > > > > > > > > > ip_tables
> > > > > > > > > > > dm_crypt
> > > > > > > > > > > i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul
> > > > > > > > > > > crc32_pclmul
> > > > > > > > > > > i2c_algo_bit
> > > > > > > > > > > cec crc32c_intel drm_kms_helper nvme
> > > > > > > > > > > ghash_clmulni_intel
> > > > > > > > > > > drm
> > > > > > > > > > > nvme_core
> > > > > > > > > > > serio_raw rtsx_pci hid_multitouch wmi i2c_hid
> > > > > > > > > > > video
> > > > > > > > > > > pinctrl_sunrisepoint pinctrl_intel
> > > > > > > > > > > [  147.302218]  fuse
> > > > > > > > > > > [  147.302230] CPU: 3 PID: 134 Comm: kworker/3:2
> > > > > > > > > > > Tainted:
> > > > > > > > > > > G          IOE     5.7.7-200.fc32.x86_64 #1
> > > > > > > > > > > [  147.302233] Hardware name: Dell Inc. XPS 13
> > > > > > > > > > > 9350/0PWNCR,
> > > > > > > > > > > BIOS
> > > > > > > > > > > 1.12.2
> > > > > > > > > > > 12/15/2019
> > > > > > > > > > > [  147.302260] Workqueue:  0x0 (mm_percpu_wq)
> > > > > > > > > > > [  147.302275] RIP: 0010:__queue_work+0x364/0x410
> > > > > > > > > > > [  147.302282] Code: e0 f1 69 a9 00 01 1f 00 75
> > > > > > > > > > > 0f
> > > > > > > > > > > 65
> > > > > > > > > > > 48
> > > > > > > > > > > 8b
> > > > > > > > > > > 3c
> > > > > > > > > > > 25
> > > > > > > > > > > c0 8b
> > > > > > > > > > > 01 00 f6 47 24 20 75 25 0f 0b 48 83 c4 10 5b 5d
> > > > > > > > > > > 41
> > > > > > > > > > > 5c
> > > > > > > > > > > 41
> > > > > > > > > > > 5d
> > > > > > > > > > > 41
> > > > > > > > > > > 5e
> > > > > > > > > > > 41 5f
> > > > > > > > > > > c3 <0f> 0b e9 78 fe ff ff 41 83 cc 02 49 8d 57 60
> > > > > > > > > > > e9 5d
> > > > > > > > > > > fe
> > > > > > > > > > > ff
> > > > > > > > > > > ff e8
> > > > > > > > > > > 53
> > > > > > > > > > > [  147.302286] RSP: 0018:ffffbab980154e68 EFLAGS:
> > > > > > > > > > > 00010002
> > > > > > > > > > > [  147.302292] RAX: ffff8f551b333790 RBX:
> > > > > > > > > > > 0000000000000048
> > > > > > > > > > > RCX:
> > > > > > > > > > > 0000000000000000
> > > > > > > > > > > [  147.302295] RDX: ffff8f551b333798 RSI:
> > > > > > > > > > > ffff8f5575803718
> > > > > > > > > > > RDI:
> > > > > > > > > > > ffff8f5576daa840
> > > > > > > > > > > [  147.302299] RBP: ffff8f551b333790 R08:
> > > > > > > > > > > ffffffff97856cb0
> > > > > > > > > > > R09:
> > > > > > > > > > > 0000000000000000
> > > > > > > > > > > [  147.302302] R10: 0000000000000000 R11:
> > > > > > > > > > > ffffffff97856cb8
> > > > > > > > > > > R12:
> > > > > > > > > > > 0000000000000003
> > > > > > > > > > > [  147.302306] R13: 0000000000002000 R14:
> > > > > > > > > > > ffff8f5575c14e00
> > > > > > > > > > > R15:
> > > > > > > > > > > ffff8f5576db0700
> > > > > > > > > > > [  147.302311] FS:  0000000000000000(0000)
> > > > > > > > > > > GS:ffff8f5576d80000(0000)
> > > > > > > > > > > knlGS:0000000000000000
> > > > > > > > > > > [  147.302315] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > > > > > > > 0000000080050033
> > > > > > > > > > > [  147.302319] CR2: 00000000000000b0 CR3:
> > > > > > > > > > > 0000000267774004
> > > > > > > > > > > CR4:
> > > > > > > > > > > 00000000003606e0
> > > > > > > > > > > [  147.302322] Call Trace:
> > > > > > > > > > > [  147.302329]  <IRQ>
> > > > > > > > > > > [  147.302342]  queue_work_on+0x36/0x40
> > > > > > > > > > > [  147.302353]  __usb_hcd_giveback_urb+0x9c/0x110
> > > > > > > > > > > [  147.302362]  usb_giveback_urb_bh+0xa0/0xf0
> > > > > > > > > > 
> > > > > > > > > > Are you sure your device is working properly and
> > > > > > > > > > talking
> > > > > > > > > > USB
> > > > > > > > > > correctly
> > > > > > > > > > to the host?  It looks like you are just timing out
> > > > > > > > > > for
> > > > > > > > > > some
> > > > > > > > > > reason.
> > > > > > > > > > 
> > > > > > > > > > But, that warning is showing that something is odd
> > > > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > usb
> > > > > > > > > > workqueue,
> > > > > > > > > > which is strange.
> > > > > > > > > > 
> > > > > > > > > > What type of host controller is this talking
> > > > > > > > > > to?  And
> > > > > > > > > > does
> > > > > > > > > > your
> > > > > > > > > > device
> > > > > > > > > > actually answer the urbs being sent to it
> > > > > > > > > > correctly?
> > > > > > > > > > 
> > > > > > > > > > Using usbmon on this might be the best way to watch
> > > > > > > > > > the
> > > > > > > > > > USB
> > > > > > > > > > traffic,
> > > > > > > > > > if
> > > > > > > > > > you don't have a hardware protocol sniffer, which
> > > > > > > > > > could
> > > > > > > > > > provide
> > > > > > > > > > some
> > > > > > > > > > clues as to what is going wrong.
> > > > > > > > > > 
> > > > > > > > > > thanks,
> > > > > > > > > > 
> > > > > > > > > > greg k-h
> > > > > > > > > 
> > > > > > > > > As I mentioned the device is likely buggy, since I'm
> > > > > > > > > developing
> > > > > > > > > and
> > > > > > > > > debugging it.
> > > > > > > > > However my ability to debug and fix any issue is
> > > > > > > > > limited by
> > > > > > > > > the
> > > > > > > > > fact
> > > > > > > > > that the kernel decides to stop working as usual,
> > > > > > > > > making my
> > > > > > > > > USB
> > > > > > > > > keyboard and mouse useless, if not crashing later due
> > > > > > > > > to
> > > > > > > > > soft
> > > > > > > > > lockups.
> > > > > > > > > 
> > > > > > > > > Shouldn't the kernel be resilient to such devices?
> > > > > > > > 
> > > > > > > > Yes it should, we should not crash.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > I've developed quite
> > > > > > > > > a few USB devices in the past and I've never ran into
> > > > > > > > > things
> > > > > > > > > like
> > > > > > > > > this
> > > > > > > > > on Linux (Windows is another story, rather 'easy' to
> > > > > > > > > crash,
> > > > > > > > > hang
> > > > > > > > > or
> > > > > > > > > bluescreen). In any case since I do not have access
> > > > > > > > > to
> > > > > > > > > a
> > > > > > > > > hardware
> > > > > > > > > debugger and the machine goes bananas (preventing me
> > > > > > > > > from
> > > > > > > > > using
> > > > > > > > > Wireshark) I do not think I can further debug this
> > > > > > > > > issue. I
> > > > > > > > > could
> > > > > > > > > try
> > > > > > > > > to find a kernel version where this does not crash
> > > > > > > > > the
> > > > > > > > > machine
> > > > > > > > > (only
> > > > > > > > > tested 5.6.X and 5.7.X so far). Or perhaps use
> > > > > > > > > VirtualBox,
> > > > > > > > > but
> > > > > > > > > I'd
> > > > > > > > > need
> > > > > > > > > to convice the host OS to ignore the USB device and
> > > > > > > > > just
> > > > > > > > > forward
> > > > > > > > > it
> > > > > > > > > to
> > > > > > > > > the guest.
> > > > > > > > 
> > > > > > > > Trying to trace down what part of the setup is failing,
> > > > > > > > by
> > > > > > > > using
> > > > > > > > usbmon,
> > > > > > > > will be good to try to figure out what the problem is
> > > > > > > > here,
> > > > > > > > if
> > > > > > > > you
> > > > > > > > can
> > > > > > > > do that.
> > > > > > > > 
> > > > > > > > > The firmware for this device can be easily tweaked to
> > > > > > > > > expose
> > > > > > > > > an
> > > > > > > > > arbitrary number (up to 7 I think) of CDC ACM
> > > > > > > > > interfaces.
> > > > > > > > > When
> > > > > > > > > I
> > > > > > > > > use
> > > > > > > > > one or two there's no issues, three has had some
> > > > > > > > > issues
> > > > > > > > > (but
> > > > > > > > > did
> > > > > > > > > not
> > > > > > > > > investigate further). Going to four is what
> > > > > > > > > consistently
> > > > > > > > > triggers
> > > > > > > > > kernel issues.
> > > > > > > > 
> > > > > > > > Hm, that might be a clue, what does the output of
> > > > > > > > 'lsusb
> > > > > > > > -v'
> > > > > > > > for
> > > > > > > > that
> > > > > > > > device when you have 3 and then 4 interfaces?
> > > > > > > > 
> > > > > > > > thanks,
> > > > > > > > 
> > > > > > > > greg k-h
> > > > > > > 
> > > > > > > Hello,
> > > > > > > 
> > > > > > > I will try to see how I can debug further, perhaps I can
> > > > > > > locate
> > > > > > > a
> > > > > > > machine or kernel that does not crash. Another option can
> > > > > > > be to
> > > > > > > disable
> > > > > > > the firmware down to the minimum so that it does not
> > > > > > > response
> > > > > > > to
> > > > > > > the
> > > > > > > bulk endpoints (just to the enumeration and some basic
> > > > > > > things),
> > > > > > > to
> > > > > > > rule
> > > > > > > out a bad behaviour.
> > > > > > > 
> > > > > > > The USB descriptor is what you could imagine, just
> > > > > > > replicate
> > > > > > > the
> > > > > > > two
> > > > > > > ACM interfaces (control & data) and add more endpoints.
> > > > > > > Here
> > > > > > > goes
> > > > > > > the
> > > > > > > one with three ports. Note this one seems to make the
> > > > > > > kernel
> > > > > > > crash
> > > > > > > just
> > > > > > > like the one with four. The only ones that work well are
> > > > > > > 1
> > > > > > > and
> > > > > > > 2
> > > > > > > ports.
> > > > > > > Since I'm not aware of any other commercial solutions
> > > > > > > (apart
> > > > > > > from
> > > > > > > FTDI)
> > > > > > > that use more than 2 ACM ports, could that be the issue?
> > > > > > > Meaning
> > > > > > > there's a bug somewhere and no commercial hardware that
> > > > > > > can
> > > > > > > trigger
> > > > > > > it.
> > > > > > > 
> > > > > > > For reference the diff between two and three ports (in
> > > > > > > lsusb)
> > > > > > > is
> > > > > > > that
> > > > > > > it's missing the last two interaces (with the 3 EPs
> > > > > > > described).
> > > > > > > Of
> > > > > > > course the bNumInterfaces is 4 instead of 6, and
> > > > > > > wTotalLength
> > > > > > > has
> > > > > > > a
> > > > > > > different value.
> > > > > > > 
> > > > > > > Hope this can help somehow.
> > > > > > > Thanks
> > > > > > > David
> > > > > > > 
> > > > > > > Bus 003 Device 012: ID 0483:5740 STMicroelectronics
> > > > > > > Virtual
> > > > > > > COM
> > > > > > > Port
> > > > > > 
> > > > > > Hey again,
> > > > > > 
> > > > > > I was not aware about the modems Daniele, thanks!
> > > > > > 
> > > > > > So I did some testing on my old BeagleBone black, which has
> > > > > > a
> > > > > > very
> > > > > > old
> > > > > > kernel (3.8.13-bone47). In this device the kernel is happy
> > > > > > and I
> > > > > > was
> > > > > > able to do some testing, it seems to work well. The UARTs
> > > > > > seem to
> > > > > > work
> > > > > > well in both directions, no weird shenanigans, no
> > > > > > error/warn
> > > > > > messages...
> > > > > > 
> > > > > > I'm a bit at loss on how I can debug this further, I will
> > > > > > try
> > > > > > to
> > > > > > use
> > > > > > a
> > > > > > RPi with a newer kernel and see what happens. I could try
> > > > > > to
> > > > > > boot
> > > > > > a
> > > > > > Live USB with older kernels (in my Intel machines) to try
> > > > > > to
> > > > > > locate
> > > > > > a
> > > > > > version where it works. Since I'm no kernel expert: any way
> > > > > > I
> > > > > > can
> > > > > > provide more info? The computer becomes unusable shortly
> > > > > > after
> > > > > > plugging
> > > > > > the device so I can't really do any meaningful stuff on it.
> > > > > > 
> > > > > > Thanks again,
> > > > > > David
> > > > > > 
> > > > > 
> > > > > Hey there again!
> > > > > 
> > > > > I managed to get a PCAP capture for this. Note that
> > > > > NetworkManager
> > > > > was
> > > > > running and actively probing the ttyACM* devices for a modem,
> > > > > hence
> > > > > why
> > > > > you can see "AT\n" commands being sent to the four devices.
> > > > 
> > > > Do you mean ModemManager? NM moved the probing to ModemManager
> > > > about
> > > > 6
> > > > or 7 years ago. In any case, ModemManager has both a "don't
> > > > probe"
> > > > list, a greylist, and build-time options to only probe things
> > > > it
> > > > knows
> > > > are modems.
> > > > 
> > > > Of course the build-time policy depends on your distro;
> > > > upstream
> > > > ModemManager now defaults to "strict" mode (only probe known
> > > > modems/drivers/USB IDs).
> > > > 
> > > > Dan
> > > > 
> > > > > As you can also probably see is that the device currently
> > > > > ignores
> > > > > any
> > > > > control requests (like Set Line Coding).
> > > > > 
> > > > > Hope it can help your debugging.
> > > > > 
> > > > > David
> > > > > 
> > > > > 
> > > 
> > > Yeah ModemManager. I disabled it to better debug.
> > > 
> > > So I came to the bottom of the issue (device side). This device
> > > does
> > > not support more than 3 IN and 3 OUT endpoints. Whenever you
> > > specify
> > > more they start behaving very weirdly and Linux/Wireshark shows
> > > the
> > > EPROTO errors. In almost all cases that correlates with a kernel
> > > crash.
> > 
> > Ah, that makes more sense, so the device itself is just not
> > responding
> > to USB commands properly.
> > 
> > > I've seen some EPROTO errors at the begining of a trace on
> > > working
> > > devices, I'm assuming it'd what happens during the period the
> > > device is
> > > plugged and probed and before it gets full ready (all EPs are
> > > setup
> > > and
> > > able to respond).
> > > 
> > > Now coming around the restriction I've been able to use 3 UARTs
> > > without
> > > a problem, also on another device from the same family. That
> > > explains
> > > why it did not work, does not explain why the kernel gets a CPU
> > > stuck
> > > in some busy loop/interrupt loop though.
> > 
> > I agree, it's not good that we can hang with a broken device like
> > this,
> > but if the hardware is the thing that is locking up, it might be
> > hard
> > for us to do anything about this.  Timing out and causing errors
> > like
> > you see might be the only thing that we can do.
> > 
> > > From my side there's not much more I can do. I could try to get
> > > this
> > > setup in a simpler device and ship it someone willing to debug
> > > it.
> > > I'm
> > > totally convinced this is a bug in the USB stack given it crashes
> > > several computers with relateively different hardware.
> > 
> > I would be interested to see what happens when you plug it into
> > other
> > operating systems.  Do they also lock up, or can they handle
> > devices
> > like this?
> > 
> > thanks,
> > 
> > greg k-h
> 
> Hey there,
> 
> So I actually tested some Live media on the same computer and it is
> interesting to see how other distros did not fail at all with the
> same
> kernel version as Fedora's :/ (I tested CloneZilla 5.7.0 kernel)
> 
> So I dug a bit more and realized that the kernel wont go bananas if
> ModemManager is not running. I'm assuming this means the kernel only
> has issues whenever some software is poking the device (which sort of
> makes sense, but at the same time should mean that the enumeration
> and
> config bits are 'ok').
> 
> Also running ModemManager after plugging the device results sometimes
> in different behaviour. In some occasions I didnt get it to crash (or
> at least not as reliably) but it would print:

I don't think Fedora has yet switched to ModemManager's "strict" device
probing policy (which only probes known modems based on USB IDs and
driver types) so that might be the discrepancy.

But I still wouldn't expect the kernel to crash or BUG, even when the
device is opened and sent some traffic. Worst case a timeout.

Dan

> [12515.209323] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq
> Ptr
> [12515.209324] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is
> pending.
> 
> I'm not sure whether this is helpful or not. I haven't tried Windows
> since I do not have any computer (and using VMs doesnt really seem to
> work well).
> 
> Thanks!
> David
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-07-22 14:57 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15  8:58 System crash/lockup after plugging CDC ACM device David Guillen Fandos
2020-07-15  9:30 ` Greg KH
2020-07-15 10:31   ` David Guillen Fandos
2020-07-15 10:50     ` Greg KH
2020-07-15 10:57       ` David Guillen Fandos
2020-07-15 11:12         ` Greg KH
2020-07-15 11:20           ` David Guillen Fandos
2020-07-15 12:24             ` Greg KH
2020-07-15 17:03               ` David Guillen Fandos
2020-07-15 20:39                 ` Daniele Palmas
2020-07-16 14:30                 ` David Guillen Fandos
2020-07-19 23:36                   ` David Guillen Fandos
2020-07-20  1:21                     ` Alan Stern
2020-07-20 16:55                     ` Dan Williams
2020-07-20 20:39                       ` David Guillen Fandos
2020-07-21  8:26                         ` Greg KH
2020-07-22 14:41                           ` David Guillen Fandos
2020-07-22 14:57                             ` Dan Williams
2020-07-21  8:01             ` Oliver Neukum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.