All of lore.kernel.org
 help / color / mirror / Atom feed
* list_del corruption (NULL pointer dereference) on xhci-pci unbind
@ 2022-08-31  0:31 Marek Marczykowski-Górecki
  2022-10-14  1:21 ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 8+ messages in thread
From: Marek Marczykowski-Górecki @ 2022-08-31  0:31 UTC (permalink / raw)
  To: linux-usb

[-- Attachment #1: Type: text/plain, Size: 8191 bytes --]

Hello,

I hit a kernel crash when unbinding xhci-pci from the PCI device (via
sysfs write). I can trigger the issue at least on 5.19.2 and 6.0-rc3.
Interestingly, the same kernel does not crash on another machine while
doing the same, so it might depends on specific devices being connected.

The specific message I get is this:

  ehci-pci 0000:00:06.0: remove, state 1
  usb usb4: USB disconnect, device number 1
  usb 4-1: USB disconnect, device number 2
  usb 4-1.5: USB disconnect, device number 3
  ehci-pci 0000:00:06.0: USB bus 4 deregistered
  ehci-pci 0000:00:07.0: remove, state 1
  usb usb5: USB disconnect, device number 1
  usb 5-1: USB disconnect, device number 2
  usb 5-1.2: USB disconnect, device number 3
  usb 5-1.4: USB disconnect, device number 4
  usb 5-1.5: USB disconnect, device number 5
  usb 5-1.6: USB disconnect, device number 6
  ehci-pci 0000:00:07.0: USB bus 5 deregistered
  xhci_hcd 0000:00:08.0: remove, state 4
  usb usb3: USB disconnect, device number 1
  xhci_hcd 0000:00:08.0: USB bus 3 deregistered
  xhci_hcd 0000:00:08.0: remove, state 1
  usb usb2: USB disconnect, device number 1
  usb 2-4: USB disconnect, device number 2
  cdc_mbim 2-4:1.6 wws8u4i6: unregister 'cdc_mbim' usb-0000:00:08.0-4, CDC MBIM
  xhci_hcd 0000:00:08.0: Slot 1 endpoint 8 not removed from BW list!
  xhci_hcd 0000:00:08.0: Slot 1 endpoint 12 not removed from BW list!
  xhci_hcd 0000:00:08.0: Slot 1 endpoint 14 not removed from BW list!
  xhci_hcd 0000:00:08.0: Slot 1 endpoint 16 not removed from BW list!
  xhci_hcd 0000:00:08.0: Slot 1 endpoint 18 not removed from BW list!
  xhci_hcd 0000:00:08.0: Slot 1 endpoint 20 not removed from BW list!
  list_del corruption, ffff935804028758->next is NULL
  ------------[ cut here ]------------
  kernel BUG at lib/list_debug.c:49!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 1 PID: 1211 Comm: prepare-suspend Not tainted 6.0.0-rc3-1.51.fc32.qubes.x86_64 #248
  Hardware name: Xen HVM domU, BIOS 4.14.5 08/24/2022
  RIP: 0010:__list_del_entry_valid.cold+0xf/0x6f
  Code: c7 c7 38 de 8c ae e8 22 d2 fd ff 0f 0b 48 c7 c7 10 de 8c ae e8 14 d2 fd ff 0f 0b 48 89 fe 48 c7 c7 20 df 8c ae e8 03 d2 fd ff <0f> 0b 48 89 d1 48 c7 c7 40 e0 8c ae 4c 89 c2 e8 ef d1 fd ff 0f 0b
  RSP: 0000:ffffb7ef817e7cd0 EFLAGS: 00010246
  RAX: 0000000000000033 RBX: ffff935803460900 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffffffffae8b45a7 RDI: 00000000ffffffff
  RBP: 0000000000000006 R08: 0000000000000000 R09: 00000000ffffdfff
  R10: ffffb7ef817e7b78 R11: ffffffffaed46088 R12: ffff935803466260
  R13: ffff935803460810 R14: ffff935804028758 R15: ffff935803460928
  FS:  000076820cccd740(0000) GS:ffff935810700000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000075bb7d654d70 CR3: 000000000066a003 CR4: 00000000001706e0
  Call Trace:
   <TASK>
   xhci_mem_cleanup+0x14c/0x520 [xhci_hcd]
   xhci_stop+0x12d/0x1b0 [xhci_hcd]
   usb_stop_hcd+0x3b/0x57
   usb_remove_hcd.cold+0xd0/0x159
   usb_hcd_pci_remove+0x76/0x110
   pci_device_remove+0x36/0xa0
   device_release_driver_internal+0x1aa/0x230
   unbind_store+0x11f/0x130
   kernfs_fop_write_iter+0x124/0x1b0
   vfs_write+0x2ff/0x400
   ksys_write+0x67/0xe0
   do_syscall_64+0x3b/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd
  RIP: 0033:0x76820cb3e807
  Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
  RSP: 002b:00007ffe4cddb668 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 000000000000000d RCX: 000076820cb3e807
  RDX: 000000000000000d RSI: 00005b61eff10ec0 RDI: 0000000000000001
  RBP: 00005b61eff10ec0 R08: 0000000000000000 R09: 000076820cbb14e0
  R10: 000076820cbb13e0 R11: 0000000000000246 R12: 000000000000000d
  R13: 000076820cbfb780 R14: 000000000000000d R15: 000076820cbf69e0
   </TASK>
  Modules linked in: nft_ct bnep uvcvideo videobuf2_vmalloc videobuf2_memops ath3k btusb btrtl btbcm btintel btmtk bluetooth videobuf2_v4l2 videobuf2_common videodev ecdh_generic rfkill mc cdc_mbim cdc_ncm cdc_ether usbnet mii cdc_wdm cdc_acm ipt_REJECT nf_reject_ipv4 xt_state xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat intel_rapl_msr intel_rapl_common nf_tables joydev crct10dif_pclmul nfnetlink crc32_pclmul ghash_clmulni_intel xhci_pci pcspkr xhci_pci_renesas ehci_pci xhci_hcd drm_vram_helper ehci_hcd serio_raw drm_ttm_helper ttm ata_generic pata_acpi i2c_piix4 floppy xen_scsiback xen_netback xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn ipmi_devintf ipmi_msghandler fuse ip_tables overlay xen_blkfront
  ---[ end trace 0000000000000000 ]---
  RIP: 0010:__list_del_entry_valid.cold+0xf/0x6f
  Code: c7 c7 38 de 8c ae e8 22 d2 fd ff 0f 0b 48 c7 c7 10 de 8c ae e8 14 d2 fd ff 0f 0b 48 89 fe 48 c7 c7 20 df 8c ae e8 03 d2 fd ff <0f> 0b 48 89 d1 48 c7 c7 40 e0 8c ae 4c 89 c2 e8 ef d1 fd ff 0f 0b
  RSP: 0000:ffffb7ef817e7cd0 EFLAGS: 00010246
  RAX: 0000000000000033 RBX: ffff935803460900 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffffffffae8b45a7 RDI: 00000000ffffffff
  RBP: 0000000000000006 R08: 0000000000000000 R09: 00000000ffffdfff
  R10: ffffb7ef817e7b78 R11: ffffffffaed46088 R12: ffff935803466260
  R13: ffff935803460810 R14: ffff935804028758 R15: ffff935803460928
  FS:  000076820cccd740(0000) GS:ffff935810700000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000075bb7d654d70 CR3: 000000000066a003 CR4: 00000000001706e0
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: 0x2c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

USB devices present in the system:

/:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 480M
    |__ Port 4: Dev 2, If 0, Class=Communications, Driver=, 480M
    |__ Port 4: Dev 2, If 1, Class=Communications, Driver=cdc_acm, 480M
    |__ Port 4: Dev 2, If 2, Class=CDC Data, Driver=cdc_acm, 480M
    |__ Port 4: Dev 2, If 3, Class=Communications, Driver=cdc_acm, 480M
    |__ Port 4: Dev 2, If 4, Class=CDC Data, Driver=cdc_acm, 480M
    |__ Port 4: Dev 2, If 5, Class=Communications, Driver=cdc_wdm, 480M
    |__ Port 4: Dev 2, If 6, Class=Communications, Driver=cdc_mbim, 480M
    |__ Port 4: Dev 2, If 7, Class=CDC Data, Driver=cdc_mbim, 480M
    |__ Port 4: Dev 2, If 8, Class=Communications, Driver=cdc_wdm, 480M
    |__ Port 4: Dev 2, If 9, Class=Communications, Driver=cdc_acm, 480M
    |__ Port 4: Dev 2, If 10, Class=CDC Data, Driver=cdc_acm, 480M
/:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/3p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/6p, 480M
        |__ Port 2: Dev 3, If 1, Class=Chip/SmartCard, Driver=, 12M
        |__ Port 2: Dev 3, If 0, Class=Human Interface Device, Driver=usbhid, 12M
        |__ Port 4: Dev 4, If 2, Class=Vendor Specific Class, Driver=btusb, 12M
        |__ Port 4: Dev 4, If 0, Class=Vendor Specific Class, Driver=btusb, 12M
        |__ Port 4: Dev 4, If 3, Class=Application Specific Interface, Driver=, 12M
        |__ Port 4: Dev 4, If 1, Class=Vendor Specific Class, Driver=btusb, 12M
        |__ Port 5: Dev 5, If 1, Class=Wireless, Driver=btusb, 12M
        |__ Port 5: Dev 5, If 0, Class=Wireless, Driver=btusb, 12M
        |__ Port 6: Dev 6, If 0, Class=Video, Driver=uvcvideo, 480M
        |__ Port 6: Dev 6, If 1, Class=Video, Driver=uvcvideo, 480M
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/3p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/8p, 480M
        |__ Port 5: Dev 3, If 0, Class=Human Interface Device, Driver=usbhid, 480M
        |__ Port 5: Dev 3, If 1, Class=Human Interface Device, Driver=usbhid, 480M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/6p, 480M
    |__ Port 1: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 480M

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: list_del corruption (NULL pointer dereference) on xhci-pci unbind
  2022-08-31  0:31 list_del corruption (NULL pointer dereference) on xhci-pci unbind Marek Marczykowski-Górecki
@ 2022-10-14  1:21 ` Marek Marczykowski-Górecki
  2022-10-14 16:02   ` Mathias Nyman
  0 siblings, 1 reply; 8+ messages in thread
From: Marek Marczykowski-Górecki @ 2022-10-14  1:21 UTC (permalink / raw)
  To: linux-usb, Mathias Nyman

[-- Attachment #1: Type: text/plain, Size: 10547 bytes --]

On Wed, Aug 31, 2022 at 02:31:23AM +0200, Marek Marczykowski-Górecki wrote:
> Hello,
> 
> I hit a kernel crash when unbinding xhci-pci from the PCI device (via
> sysfs write). I can trigger the issue at least on 5.19.2 and 6.0-rc3.
> Interestingly, the same kernel does not crash on another machine while
> doing the same, so it might depends on specific devices being connected.

I did some more digging, and the issue is definitely much older, I can
see it in 5.10.112 too. It simply happen to be found with
init_on_free=1, which I changed about the same time (and forgot about
it).

> The specific message I get is this:
> 
>   ehci-pci 0000:00:06.0: remove, state 1
>   usb usb4: USB disconnect, device number 1
>   usb 4-1: USB disconnect, device number 2
>   usb 4-1.5: USB disconnect, device number 3
>   ehci-pci 0000:00:06.0: USB bus 4 deregistered
>   ehci-pci 0000:00:07.0: remove, state 1
>   usb usb5: USB disconnect, device number 1
>   usb 5-1: USB disconnect, device number 2
>   usb 5-1.2: USB disconnect, device number 3
>   usb 5-1.4: USB disconnect, device number 4
>   usb 5-1.5: USB disconnect, device number 5
>   usb 5-1.6: USB disconnect, device number 6
>   ehci-pci 0000:00:07.0: USB bus 5 deregistered
>   xhci_hcd 0000:00:08.0: remove, state 4
>   usb usb3: USB disconnect, device number 1
>   xhci_hcd 0000:00:08.0: USB bus 3 deregistered
>   xhci_hcd 0000:00:08.0: remove, state 1
>   usb usb2: USB disconnect, device number 1
>   usb 2-4: USB disconnect, device number 2
>   cdc_mbim 2-4:1.6 wws8u4i6: unregister 'cdc_mbim' usb-0000:00:08.0-4, CDC MBIM
>   xhci_hcd 0000:00:08.0: Slot 1 endpoint 8 not removed from BW list!
>   xhci_hcd 0000:00:08.0: Slot 1 endpoint 12 not removed from BW list!
>   xhci_hcd 0000:00:08.0: Slot 1 endpoint 14 not removed from BW list!
>   xhci_hcd 0000:00:08.0: Slot 1 endpoint 16 not removed from BW list!
>   xhci_hcd 0000:00:08.0: Slot 1 endpoint 18 not removed from BW list!
>   xhci_hcd 0000:00:08.0: Slot 1 endpoint 20 not removed from BW list!

This seems to be highly related. The related code is
(drivers/usb/host/xhci-mem.c):

 860 void xhci_free_virt_device(struct xhci_hcd *xhci, int slot_id)
 861 {
(...)
 870     dev = xhci->devs[slot_id];
(...)
 892         if (!list_empty(&dev->eps[i].bw_endpoint_list))
 893             xhci_warn(xhci, "Slot %u endpoint %u "
 894                     "not removed from BW list!\n",
 895                     slot_id, i);
(...)
 909     kfree(xhci->devs[slot_id]);
 910     xhci->devs[slot_id] = NULL;
 911 }

So, it does kfree() a list that is connected somewhere.

I can trigger the issue by unbinding xhci_hcd from just this device.
This is an USB controller to which internal WWAN adapter is connected,
and nothing else. I can still trigger the crash, if I prevent relevant
driver(s) from ever loading, so the issue is clearly somewhere in xhci
core. Adding XHCI maintainer to the recipients.

BTW, the call trace to the above warning is (collected on different
kernel version than the other one...):

  usb_disconnect+0x212/0x290
  usb_disconnect+0xc8/0x290
  usb_remove_hcd+0xdf/0x1d3
  usb_hcd_pci_remove+0x74/0x100
  pci_device_remove+0x3b/0xa0
  __device_release_driver+0x181/0x250
  device_driver_detach+0x3c/0xa0
  unbind_store+0xd8/0x100
  kernfs_fop_write_iter+0x11a/0x1f0
  new_sync_write+0x150/0x1e0
  vfs_write+0x1d0/0x260
  ksys_write+0x6b/0xe0
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9


>   list_del corruption, ffff935804028758->next is NULL
>   ------------[ cut here ]------------
>   kernel BUG at lib/list_debug.c:49!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 1 PID: 1211 Comm: prepare-suspend Not tainted 6.0.0-rc3-1.51.fc32.qubes.x86_64 #248
>   Hardware name: Xen HVM domU, BIOS 4.14.5 08/24/2022
>   RIP: 0010:__list_del_entry_valid.cold+0xf/0x6f
>   Code: c7 c7 38 de 8c ae e8 22 d2 fd ff 0f 0b 48 c7 c7 10 de 8c ae e8 14 d2 fd ff 0f 0b 48 89 fe 48 c7 c7 20 df 8c ae e8 03 d2 fd ff <0f> 0b 48 89 d1 48 c7 c7 40 e0 8c ae 4c 89 c2 e8 ef d1 fd ff 0f 0b
>   RSP: 0000:ffffb7ef817e7cd0 EFLAGS: 00010246
>   RAX: 0000000000000033 RBX: ffff935803460900 RCX: 0000000000000000
>   RDX: 0000000000000000 RSI: ffffffffae8b45a7 RDI: 00000000ffffffff
>   RBP: 0000000000000006 R08: 0000000000000000 R09: 00000000ffffdfff
>   R10: ffffb7ef817e7b78 R11: ffffffffaed46088 R12: ffff935803466260
>   R13: ffff935803460810 R14: ffff935804028758 R15: ffff935803460928
>   FS:  000076820cccd740(0000) GS:ffff935810700000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 000075bb7d654d70 CR3: 000000000066a003 CR4: 00000000001706e0
>   Call Trace:
>    <TASK>
>    xhci_mem_cleanup+0x14c/0x520 [xhci_hcd]

This indeed iterates over
xhci->rh_bw[i].bw_table.interval_bw[j].endpoints and tries to
list_del_init() every entry.
I guess it should happen before the above xhci_free_virt_device(), but
for some reason happens after.

>    xhci_stop+0x12d/0x1b0 [xhci_hcd]
>    usb_stop_hcd+0x3b/0x57
>    usb_remove_hcd.cold+0xd0/0x159
>    usb_hcd_pci_remove+0x76/0x110
>    pci_device_remove+0x36/0xa0
>    device_release_driver_internal+0x1aa/0x230
>    unbind_store+0x11f/0x130
>    kernfs_fop_write_iter+0x124/0x1b0
>    vfs_write+0x2ff/0x400
>    ksys_write+0x67/0xe0
>    do_syscall_64+0x3b/0x90
>    entry_SYSCALL_64_after_hwframe+0x63/0xcd
>   RIP: 0033:0x76820cb3e807
>   Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>   RSP: 002b:00007ffe4cddb668 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>   RAX: ffffffffffffffda RBX: 000000000000000d RCX: 000076820cb3e807
>   RDX: 000000000000000d RSI: 00005b61eff10ec0 RDI: 0000000000000001
>   RBP: 00005b61eff10ec0 R08: 0000000000000000 R09: 000076820cbb14e0
>   R10: 000076820cbb13e0 R11: 0000000000000246 R12: 000000000000000d
>   R13: 000076820cbfb780 R14: 000000000000000d R15: 000076820cbf69e0
>    </TASK>
>   Modules linked in: nft_ct bnep uvcvideo videobuf2_vmalloc videobuf2_memops ath3k btusb btrtl btbcm btintel btmtk bluetooth videobuf2_v4l2 videobuf2_common videodev ecdh_generic rfkill mc cdc_mbim cdc_ncm cdc_ether usbnet mii cdc_wdm cdc_acm ipt_REJECT nf_reject_ipv4 xt_state xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat intel_rapl_msr intel_rapl_common nf_tables joydev crct10dif_pclmul nfnetlink crc32_pclmul ghash_clmulni_intel xhci_pci pcspkr xhci_pci_renesas ehci_pci xhci_hcd drm_vram_helper ehci_hcd serio_raw drm_ttm_helper ttm ata_generic pata_acpi i2c_piix4 floppy xen_scsiback xen_netback xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn ipmi_devintf ipmi_msghandler fuse ip_tables overlay xen_blkfront
>   ---[ end trace 0000000000000000 ]---
>   RIP: 0010:__list_del_entry_valid.cold+0xf/0x6f
>   Code: c7 c7 38 de 8c ae e8 22 d2 fd ff 0f 0b 48 c7 c7 10 de 8c ae e8 14 d2 fd ff 0f 0b 48 89 fe 48 c7 c7 20 df 8c ae e8 03 d2 fd ff <0f> 0b 48 89 d1 48 c7 c7 40 e0 8c ae 4c 89 c2 e8 ef d1 fd ff 0f 0b
>   RSP: 0000:ffffb7ef817e7cd0 EFLAGS: 00010246
>   RAX: 0000000000000033 RBX: ffff935803460900 RCX: 0000000000000000
>   RDX: 0000000000000000 RSI: ffffffffae8b45a7 RDI: 00000000ffffffff
>   RBP: 0000000000000006 R08: 0000000000000000 R09: 00000000ffffdfff
>   R10: ffffb7ef817e7b78 R11: ffffffffaed46088 R12: ffff935803466260
>   R13: ffff935803460810 R14: ffff935804028758 R15: ffff935803460928
>   FS:  000076820cccd740(0000) GS:ffff935810700000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 000075bb7d654d70 CR3: 000000000066a003 CR4: 00000000001706e0
>   Kernel panic - not syncing: Fatal exception
>   Kernel Offset: 0x2c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> USB devices present in the system:
> 
> /:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
> /:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 480M
>     |__ Port 4: Dev 2, If 0, Class=Communications, Driver=, 480M
>     |__ Port 4: Dev 2, If 1, Class=Communications, Driver=cdc_acm, 480M
>     |__ Port 4: Dev 2, If 2, Class=CDC Data, Driver=cdc_acm, 480M
>     |__ Port 4: Dev 2, If 3, Class=Communications, Driver=cdc_acm, 480M
>     |__ Port 4: Dev 2, If 4, Class=CDC Data, Driver=cdc_acm, 480M
>     |__ Port 4: Dev 2, If 5, Class=Communications, Driver=cdc_wdm, 480M
>     |__ Port 4: Dev 2, If 6, Class=Communications, Driver=cdc_mbim, 480M
>     |__ Port 4: Dev 2, If 7, Class=CDC Data, Driver=cdc_mbim, 480M
>     |__ Port 4: Dev 2, If 8, Class=Communications, Driver=cdc_wdm, 480M
>     |__ Port 4: Dev 2, If 9, Class=Communications, Driver=cdc_acm, 480M
>     |__ Port 4: Dev 2, If 10, Class=CDC Data, Driver=cdc_acm, 480M
> /:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/3p, 480M
>     |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/6p, 480M
>         |__ Port 2: Dev 3, If 1, Class=Chip/SmartCard, Driver=, 12M
>         |__ Port 2: Dev 3, If 0, Class=Human Interface Device, Driver=usbhid, 12M
>         |__ Port 4: Dev 4, If 2, Class=Vendor Specific Class, Driver=btusb, 12M
>         |__ Port 4: Dev 4, If 0, Class=Vendor Specific Class, Driver=btusb, 12M
>         |__ Port 4: Dev 4, If 3, Class=Application Specific Interface, Driver=, 12M
>         |__ Port 4: Dev 4, If 1, Class=Vendor Specific Class, Driver=btusb, 12M
>         |__ Port 5: Dev 5, If 1, Class=Wireless, Driver=btusb, 12M
>         |__ Port 5: Dev 5, If 0, Class=Wireless, Driver=btusb, 12M
>         |__ Port 6: Dev 6, If 0, Class=Video, Driver=uvcvideo, 480M
>         |__ Port 6: Dev 6, If 1, Class=Video, Driver=uvcvideo, 480M
> /:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/3p, 480M
>     |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/8p, 480M
>         |__ Port 5: Dev 3, If 0, Class=Human Interface Device, Driver=usbhid, 480M
>         |__ Port 5: Dev 3, If 1, Class=Human Interface Device, Driver=usbhid, 480M
> /:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/6p, 480M
>     |__ Port 1: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 480M

lsusb -v of relevant devices can be seen here: https://gist.github.com/marmarek/fe87a1e7339acb60a40d1ef5f598736d

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: list_del corruption (NULL pointer dereference) on xhci-pci unbind
  2022-10-14  1:21 ` Marek Marczykowski-Górecki
@ 2022-10-14 16:02   ` Mathias Nyman
  2022-10-14 20:29     ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 8+ messages in thread
From: Mathias Nyman @ 2022-10-14 16:02 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, linux-usb, Mathias Nyman

Hi

On 14.10.2022 4.21, Marek Marczykowski-Górecki wrote:
> On Wed, Aug 31, 2022 at 02:31:23AM +0200, Marek Marczykowski-Górecki wrote:
>> Hello,
>>
>> I hit a kernel crash when unbinding xhci-pci from the PCI device (via
>> sysfs write). I can trigger the issue at least on 5.19.2 and 6.0-rc3.
>> Interestingly, the same kernel does not crash on another machine while
>> doing the same, so it might depends on specific devices being connected.
> 
> I did some more digging, and the issue is definitely much older, I can
> see it in 5.10.112 too. It simply happen to be found with
> init_on_free=1, which I changed about the same time (and forgot about
> it).
> 
>> The specific message I get is this:
>>
>>    ehci-pci 0000:00:06.0: remove, state 1
>>    usb usb4: USB disconnect, device number 1
>>    usb 4-1: USB disconnect, device number 2
>>    usb 4-1.5: USB disconnect, device number 3
>>    ehci-pci 0000:00:06.0: USB bus 4 deregistered
>>    ehci-pci 0000:00:07.0: remove, state 1
>>    usb usb5: USB disconnect, device number 1
>>    usb 5-1: USB disconnect, device number 2
>>    usb 5-1.2: USB disconnect, device number 3
>>    usb 5-1.4: USB disconnect, device number 4
>>    usb 5-1.5: USB disconnect, device number 5
>>    usb 5-1.6: USB disconnect, device number 6
>>    ehci-pci 0000:00:07.0: USB bus 5 deregistered
>>    xhci_hcd 0000:00:08.0: remove, state 4
>>    usb usb3: USB disconnect, device number 1
>>    xhci_hcd 0000:00:08.0: USB bus 3 deregistered
>>    xhci_hcd 0000:00:08.0: remove, state 1
>>    usb usb2: USB disconnect, device number 1
>>    usb 2-4: USB disconnect, device number 2
>>    cdc_mbim 2-4:1.6 wws8u4i6: unregister 'cdc_mbim' usb-0000:00:08.0-4, CDC MBIM
>>    xhci_hcd 0000:00:08.0: Slot 1 endpoint 8 not removed from BW list!
>>    xhci_hcd 0000:00:08.0: Slot 1 endpoint 12 not removed from BW list!
>>    xhci_hcd 0000:00:08.0: Slot 1 endpoint 14 not removed from BW list!
>>    xhci_hcd 0000:00:08.0: Slot 1 endpoint 16 not removed from BW list!
>>    xhci_hcd 0000:00:08.0: Slot 1 endpoint 18 not removed from BW list!
>>    xhci_hcd 0000:00:08.0: Slot 1 endpoint 20 not removed from BW list!
> 
> This seems to be highly related. The related code is
> (drivers/usb/host/xhci-mem.c):
> 
>   860 void xhci_free_virt_device(struct xhci_hcd *xhci, int slot_id)
>   861 {
> (...)
>   870     dev = xhci->devs[slot_id];
> (...)
>   892         if (!list_empty(&dev->eps[i].bw_endpoint_list))
>   893             xhci_warn(xhci, "Slot %u endpoint %u "
>   894                     "not removed from BW list!\n",
>   895                     slot_id, i);
> (...)
>   909     kfree(xhci->devs[slot_id]);
>   910     xhci->devs[slot_id] = NULL;
>   911 }
> 
> So, it does kfree() a list that is connected somewhere.
> 
> I can trigger the issue by unbinding xhci_hcd from just this device.
> This is an USB controller to which internal WWAN adapter is connected,
> and nothing else. I can still trigger the crash, if I prevent relevant
> driver(s) from ever loading, so the issue is clearly somewhere in xhci
> core. Adding XHCI maintainer to the recipients.
> 
> BTW, the call trace to the above warning is (collected on different
> kernel version than the other one...):
> 
>    usb_disconnect+0x212/0x290
>    usb_disconnect+0xc8/0x290
>    usb_remove_hcd+0xdf/0x1d3
>    usb_hcd_pci_remove+0x74/0x100
>    pci_device_remove+0x3b/0xa0
>    __device_release_driver+0x181/0x250
>    device_driver_detach+0x3c/0xa0
>    unbind_store+0xd8/0x100
>    kernfs_fop_write_iter+0x11a/0x1f0
>    new_sync_write+0x150/0x1e0
>    vfs_write+0x1d0/0x260
>    ksys_write+0x6b/0xe0
>    do_syscall_64+0x33/0x40
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
>>    list_del corruption, ffff935804028758->next is NULL
>>    ------------[ cut here ]------------
>>    kernel BUG at lib/list_debug.c:49!
>>    invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>    CPU: 1 PID: 1211 Comm: prepare-suspend Not tainted 6.0.0-rc3-1.51.fc32.qubes.x86_64 #248
>>    Hardware name: Xen HVM domU, BIOS 4.14.5 08/24/2022
>>    RIP: 0010:__list_del_entry_valid.cold+0xf/0x6f
>>    Code: c7 c7 38 de 8c ae e8 22 d2 fd ff 0f 0b 48 c7 c7 10 de 8c ae e8 14 d2 fd ff 0f 0b 48 89 fe 48 c7 c7 20 df 8c ae e8 03 d2 fd ff <0f> 0b 48 89 d1 48 c7 c7 40 e0 8c ae 4c 89 c2 e8 ef d1 fd ff 0f 0b
>>    RSP: 0000:ffffb7ef817e7cd0 EFLAGS: 00010246
>>    RAX: 0000000000000033 RBX: ffff935803460900 RCX: 0000000000000000
>>    RDX: 0000000000000000 RSI: ffffffffae8b45a7 RDI: 00000000ffffffff
>>    RBP: 0000000000000006 R08: 0000000000000000 R09: 00000000ffffdfff
>>    R10: ffffb7ef817e7b78 R11: ffffffffaed46088 R12: ffff935803466260
>>    R13: ffff935803460810 R14: ffff935804028758 R15: ffff935803460928
>>    FS:  000076820cccd740(0000) GS:ffff935810700000(0000) knlGS:0000000000000000
>>    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>    CR2: 000075bb7d654d70 CR3: 000000000066a003 CR4: 00000000001706e0
>>    Call Trace:
>>     <TASK>
>>     xhci_mem_cleanup+0x14c/0x520 [xhci_hcd]
> 
> This indeed iterates over
> xhci->rh_bw[i].bw_table.interval_bw[j].endpoints and tries to
> list_del_init() every entry.
> I guess it should happen before the above xhci_free_virt_device(), but
> for some reason happens after.
> 

Thanks for looking into this.

This whole software bandwidth issue should only be visible in Intel
Panther Point PCH xHC (Ivy bridge)

Endpoints should be deleted from bw_table list, and xhci_virt_devices
should be freed already before xhci_mem_cleanup() is called if all goes well.

Normally endpoints are deleted from bw_table list during usb_disconnect()

usb_disconnect()
   ...
   usb_hcd_alloc_bandwidth(dev, NULL, NULL, NULL);
     hcd->driver->drop_endpoint()  // flags endpoint to be dropped
     hcd->driver->check_bandwidth()
     ->xhci_check_bandwidth()
       xhci_configure_endpoint()
         xhci_reserve_bandwidth()  // only for Panther Point
           xhci_drop_ep_from_interval_table()

But to avoid queuing new commands to a host in XHCI_STATE_DYING or
XHCI_STATE_REMOVING state we return early, not calling xhci_reserve_bandwidth().

Thanks
Mathias

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: list_del corruption (NULL pointer dereference) on xhci-pci unbind
  2022-10-14 16:02   ` Mathias Nyman
@ 2022-10-14 20:29     ` Marek Marczykowski-Górecki
  2022-10-17 16:12       ` Mathias Nyman
  0 siblings, 1 reply; 8+ messages in thread
From: Marek Marczykowski-Górecki @ 2022-10-14 20:29 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

[-- Attachment #1: Type: text/plain, Size: 1166 bytes --]

On Fri, Oct 14, 2022 at 07:02:13PM +0300, Mathias Nyman wrote:
> This whole software bandwidth issue should only be visible in Intel
> Panther Point PCH xHC (Ivy bridge)

It is indeed Ivy Bridge platform.

> Endpoints should be deleted from bw_table list, and xhci_virt_devices
> should be freed already before xhci_mem_cleanup() is called if all goes well.
> 
> Normally endpoints are deleted from bw_table list during usb_disconnect()
> 
> usb_disconnect()
>   ...
>   usb_hcd_alloc_bandwidth(dev, NULL, NULL, NULL);
>     hcd->driver->drop_endpoint()  // flags endpoint to be dropped
>     hcd->driver->check_bandwidth()
>     ->xhci_check_bandwidth()
>       xhci_configure_endpoint()
>         xhci_reserve_bandwidth()  // only for Panther Point
>           xhci_drop_ep_from_interval_table()
> 
> But to avoid queuing new commands to a host in XHCI_STATE_DYING or
> XHCI_STATE_REMOVING state we return early, not calling xhci_reserve_bandwidth().

Indeed when I remove that early return in xhci_check_bandwidth(), the
crash is gone. What's the proper solution?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: list_del corruption (NULL pointer dereference) on xhci-pci unbind
  2022-10-14 20:29     ` Marek Marczykowski-Górecki
@ 2022-10-17 16:12       ` Mathias Nyman
  2022-10-17 18:43         ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 8+ messages in thread
From: Mathias Nyman @ 2022-10-17 16:12 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki; +Cc: linux-usb

On 14.10.2022 23.29, Marek Marczykowski-Górecki wrote:
> On Fri, Oct 14, 2022 at 07:02:13PM +0300, Mathias Nyman wrote:
>> This whole software bandwidth issue should only be visible in Intel
>> Panther Point PCH xHC (Ivy bridge)
> 
> It is indeed Ivy Bridge platform.
> 
>> Endpoints should be deleted from bw_table list, and xhci_virt_devices
>> should be freed already before xhci_mem_cleanup() is called if all goes well.
>>
>> Normally endpoints are deleted from bw_table list during usb_disconnect()
>>
>> usb_disconnect()
>>    ...
>>    usb_hcd_alloc_bandwidth(dev, NULL, NULL, NULL);
>>      hcd->driver->drop_endpoint()  // flags endpoint to be dropped
>>      hcd->driver->check_bandwidth()
>>      ->xhci_check_bandwidth()
>>        xhci_configure_endpoint()
>>          xhci_reserve_bandwidth()  // only for Panther Point
>>            xhci_drop_ep_from_interval_table()
>>
>> But to avoid queuing new commands to a host in XHCI_STATE_DYING or
>> XHCI_STATE_REMOVING state we return early, not calling xhci_reserve_bandwidth().
> 
> Indeed when I remove that early return in xhci_check_bandwidth(), the
> crash is gone. What's the proper solution?
> 

We could probably just delete the endpoint from the bw list when freeing the device and
endpoints. Currently we just print that "endpoint x not removed from BW list!" message

does the below help?

diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
index 9e56aa28efcd..2adc0c2b470c 100644
--- a/drivers/usb/host/xhci-mem.c
+++ b/drivers/usb/host/xhci-mem.c
@@ -894,10 +894,12 @@ void xhci_free_virt_device(struct xhci_hcd *xhci, int slot_id)
                  * We can't drop them anyway, because the udev might have gone
                  * away by this point, and we can't tell what speed it was.
                  */
-               if (!list_empty(&dev->eps[i].bw_endpoint_list))
+               if (!list_empty(&dev->eps[i].bw_endpoint_list)) {
+                       list_del_init(&dev->eps[i].bw_endpoint_list);
                         xhci_warn(xhci, "Slot %u endpoint %u "
                                         "not removed from BW list!\n",
                                         slot_id, i);
+               }
         }
         /* If this is a hub, free the TT(s) from the TT list */
         xhci_free_tt_info(xhci, dev, slot_id);

Thanks
-Mathias

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: list_del corruption (NULL pointer dereference) on xhci-pci unbind
  2022-10-17 16:12       ` Mathias Nyman
@ 2022-10-17 18:43         ` Marek Marczykowski-Górecki
  2022-10-18 13:36           ` Mathias Nyman
  0 siblings, 1 reply; 8+ messages in thread
From: Marek Marczykowski-Górecki @ 2022-10-17 18:43 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

[-- Attachment #1: Type: text/plain, Size: 2829 bytes --]

On Mon, Oct 17, 2022 at 07:12:36PM +0300, Mathias Nyman wrote:
> On 14.10.2022 23.29, Marek Marczykowski-Górecki wrote:
> > On Fri, Oct 14, 2022 at 07:02:13PM +0300, Mathias Nyman wrote:
> > > This whole software bandwidth issue should only be visible in Intel
> > > Panther Point PCH xHC (Ivy bridge)
> > 
> > It is indeed Ivy Bridge platform.
> > 
> > > Endpoints should be deleted from bw_table list, and xhci_virt_devices
> > > should be freed already before xhci_mem_cleanup() is called if all goes well.
> > > 
> > > Normally endpoints are deleted from bw_table list during usb_disconnect()
> > > 
> > > usb_disconnect()
> > >    ...
> > >    usb_hcd_alloc_bandwidth(dev, NULL, NULL, NULL);
> > >      hcd->driver->drop_endpoint()  // flags endpoint to be dropped
> > >      hcd->driver->check_bandwidth()
> > >      ->xhci_check_bandwidth()
> > >        xhci_configure_endpoint()
> > >          xhci_reserve_bandwidth()  // only for Panther Point
> > >            xhci_drop_ep_from_interval_table()
> > > 
> > > But to avoid queuing new commands to a host in XHCI_STATE_DYING or
> > > XHCI_STATE_REMOVING state we return early, not calling xhci_reserve_bandwidth().
> > 
> > Indeed when I remove that early return in xhci_check_bandwidth(), the
> > crash is gone. What's the proper solution?
> > 
> 
> We could probably just delete the endpoint from the bw list when freeing the device and
> endpoints. Currently we just print that "endpoint x not removed from BW list!" message
> 
> does the below help?

Yes, this helps!

xhci_drop_ep_from_interval_table() does few more things, but I assume
this all doesn't matter at the xhci_free_virt_device() time, right?

> diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
> index 9e56aa28efcd..2adc0c2b470c 100644
> --- a/drivers/usb/host/xhci-mem.c
> +++ b/drivers/usb/host/xhci-mem.c
> @@ -894,10 +894,12 @@ void xhci_free_virt_device(struct xhci_hcd *xhci, int slot_id)
>                  * We can't drop them anyway, because the udev might have gone
>                  * away by this point, and we can't tell what speed it was.
>                  */
> -               if (!list_empty(&dev->eps[i].bw_endpoint_list))
> +               if (!list_empty(&dev->eps[i].bw_endpoint_list)) {
> +                       list_del_init(&dev->eps[i].bw_endpoint_list);
>                         xhci_warn(xhci, "Slot %u endpoint %u "
>                                         "not removed from BW list!\n",
>                                         slot_id, i);
> +               }
>         }
>         /* If this is a hub, free the TT(s) from the TT list */
>         xhci_free_tt_info(xhci, dev, slot_id);
> 
> Thanks
> -Mathias

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: list_del corruption (NULL pointer dereference) on xhci-pci unbind
  2022-10-17 18:43         ` Marek Marczykowski-Górecki
@ 2022-10-18 13:36           ` Mathias Nyman
  2022-10-18 13:58             ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 8+ messages in thread
From: Mathias Nyman @ 2022-10-18 13:36 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki; +Cc: linux-usb

On 17.10.2022 21.43, Marek Marczykowski-Górecki wrote:
> On Mon, Oct 17, 2022 at 07:12:36PM +0300, Mathias Nyman wrote:
>> On 14.10.2022 23.29, Marek Marczykowski-Górecki wrote:
>>> On Fri, Oct 14, 2022 at 07:02:13PM +0300, Mathias Nyman wrote:
>>>> This whole software bandwidth issue should only be visible in Intel
>>>> Panther Point PCH xHC (Ivy bridge)
>>>
>>> It is indeed Ivy Bridge platform.
>>>
>>>> Endpoints should be deleted from bw_table list, and xhci_virt_devices
>>>> should be freed already before xhci_mem_cleanup() is called if all goes well.
>>>>
>>>> Normally endpoints are deleted from bw_table list during usb_disconnect()
>>>>
>>>> usb_disconnect()
>>>>     ...
>>>>     usb_hcd_alloc_bandwidth(dev, NULL, NULL, NULL);
>>>>       hcd->driver->drop_endpoint()  // flags endpoint to be dropped
>>>>       hcd->driver->check_bandwidth()
>>>>       ->xhci_check_bandwidth()
>>>>         xhci_configure_endpoint()
>>>>           xhci_reserve_bandwidth()  // only for Panther Point
>>>>             xhci_drop_ep_from_interval_table()
>>>>
>>>> But to avoid queuing new commands to a host in XHCI_STATE_DYING or
>>>> XHCI_STATE_REMOVING state we return early, not calling xhci_reserve_bandwidth().
>>>
>>> Indeed when I remove that early return in xhci_check_bandwidth(), the
>>> crash is gone. What's the proper solution?
>>>
>>
>> We could probably just delete the endpoint from the bw list when freeing the device and
>> endpoints. Currently we just print that "endpoint x not removed from BW list!" message
>>
>> does the below help?
> 
> Yes, this helps!

Great, thanks, I'll turn it into a proper patch.
Can I add your Reported-by and Tested-by tags to it?

> 
> xhci_drop_ep_from_interval_table() does few more things, but I assume
> this all doesn't matter at the xhci_free_virt_device() time, right?

Right, if bw_endpoint_list isn't empty when freeing the virt device
it means something prevented dropping the endpoint cleanly earlier.

Most likely host died or is being removed. We just want a clean exit

Thanks
-Mathias

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: list_del corruption (NULL pointer dereference) on xhci-pci unbind
  2022-10-18 13:36           ` Mathias Nyman
@ 2022-10-18 13:58             ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 8+ messages in thread
From: Marek Marczykowski-Górecki @ 2022-10-18 13:58 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

[-- Attachment #1: Type: text/plain, Size: 255 bytes --]

On Tue, Oct 18, 2022 at 04:36:54PM +0300, Mathias Nyman wrote:
> Great, thanks, I'll turn it into a proper patch.
> Can I add your Reported-by and Tested-by tags to it?

Yes.


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-10-18 13:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-31  0:31 list_del corruption (NULL pointer dereference) on xhci-pci unbind Marek Marczykowski-Górecki
2022-10-14  1:21 ` Marek Marczykowski-Górecki
2022-10-14 16:02   ` Mathias Nyman
2022-10-14 20:29     ` Marek Marczykowski-Górecki
2022-10-17 16:12       ` Mathias Nyman
2022-10-17 18:43         ` Marek Marczykowski-Górecki
2022-10-18 13:36           ` Mathias Nyman
2022-10-18 13:58             ` Marek Marczykowski-Górecki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.