linux-usb.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible race in 4.14 xhci stack
@ 2021-07-20 15:04 Schmid, Carsten
  2021-07-20 16:11 ` Greg KH
  0 siblings, 1 reply; 6+ messages in thread
From: Schmid, Carsten @ 2021-07-20 15:04 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: USB list

Hi Mathias,

i got a NULL pointer deref in the usbfs and analyzed it.
The connected device seemed to have trouble on USB transmissions.
However, I have the impression that there is a race between finishing URB handling and disconnection of devices.
In detail:
[ 4979.029666] xhci_hcd 0000:00:15.0: USB transfer error. Maybe the USB cable is bad?
[ 4979.029721] usb 1-3.2: usbfs: USBDEVFS_CONTROL failed cmd NmeDefaultPrio rqt 192 rq 51 len 2 ret -71
[ 4979.053224] usb 1-3.2: USB disconnect, device number 28
[ 4979.061672] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 4979.155537] IP: xhci_triad_to_transfer_ring+0x19/0x80 [xhci_hcd]
[ 4979.155539] PGD 0 P4D 0
[ 4979.155543] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 4979.155547] Modules linked in: nls_cp437 nls_utf8 cfq_iosched sd_mod tntfs(PO) texfat(PO) usb_storage bcmdhd(O) squashfs zlib_inflate xz_dec lzo lzo_compress lzo_decompress ebtable_filter ebtables xfrm_user xfrm_algo cls_u32 sch_htb intel_tfm_governor intel_ipu4_psys intel_ipu4_psys_csslib crlmodule cdc_acm snd_soc_apl_mgu_hu ecryptfs dwc3 intel_xhci_usb_role_switch roles udc_core snd_soc_skl ti960 regmap_i2c adv728x sdw_cnl snd_soc_acpi_intel_match snd_soc_acpi intel_ipu4_isys videobuf2_dma_contig snd_soc_core coretemp videobuf2_memops ipu4_acpi intel_ipu4_isys_csslib videobuf2_v4l2 videobuf2_core snd_compress snd_soc_skl_ipc sbi_apl ahci i2c_i801 libahci sdw_bus crc8 snd_soc_sst_ipc libata snd_soc_sst_dsp cfg80211 intel_ipu4_mmu xhci_pci xhci_hcd snd_hda_ext_core snd_hda_core scsi_mod snd_pcm usbcore
[ 4979.155599]  rfkill usb_common mei_me snd_timer dwc3_pci snd intel_ipu4 mei soundcore iova nfsd auth_rpcgss lockd grace sunrpc zram zsmalloc loop fuse nft_exthdr 8021q bridge stp llc inap560t(O) i915 video backlight intel_gtt i2c_algo_bit drm_kms_helper drm igb_avb(O) firmware_class ptp pps_core hwmon spi_pxa2xx_platform [last unloaded: bcmdhd]
[ 4980.581941] CPU: 1 PID: 31853 Comm: NmeDefaultPrio Tainted: P     U     O    4.14.198-apl #1
[ 4980.683080] task: ffff9f6bf08e3200 task.stack: ffffafb5c04a4000
[ 4980.753930] RIP: 0010:xhci_triad_to_transfer_ring+0x19/0x80 [xhci_hcd]
[ 4980.832160] RSP: 0018:ffffafb5c04a7a40 EFLAGS: 00010046
[ 4980.894757] RAX: 0000000000000000 RBX: ffff9f6b52744540 RCX: 0000000000000000
[ 4980.980169] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff9f6c1e9a6228
[ 4981.065584] RBP: ffffafb5c04a7a90 R08: 0000000000000009 R09: 0000000000000000
[ 4981.150995] R10: 0000000000000000 R11: ffff9f6aebd03000 R12: ffff9f6c1e9a6228
[ 4981.236409] R13: 0000000000000000 R14: 0000000001080020 R15: 0000000000000009
[ 4981.321926] FS:  00007ff8010b3700(0000) GS:ffff9f6c37c80000(0000) knlGS:0000000000000000
[ 4981.418896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4981.487644] CR2: 0000000000000030 CR3: 000000024a312000 CR4: 00000000003406a0
[ 4981.573067] Call Trace:
[ 4981.602251]  ? xhci_queue_ctrl_tx+0x48/0x270 [xhci_hcd]
[ 4981.664868]  ? __kmalloc+0x176/0x1f0
[ 4981.707679]  xhci_urb_enqueue+0x232/0x590 [xhci_hcd]
[ 4981.767160]  usb_hcd_submit_urb+0x98/0xbc0 [usbcore]
[ 4981.826651]  ? __pm_runtime_resume+0x47/0x60
[ 4981.877804]  ? usbdev_open+0x1eb/0x230 [usbcore]
[ 4981.933122]  usb_submit_urb+0x2a3/0x5b0 [usbcore]
[ 4981.989493]  usb_start_wait_urb+0x69/0x170 [usbcore]
[ 4982.048981]  usb_control_msg+0xd7/0x140 [usbcore]
[ 4982.105352]  usbdev_ioctl+0x1d61/0x2570 [usbcore]
[ 4982.161723]  ? do_filp_open+0xaa/0x110
[ 4982.206632]  do_vfs_ioctl+0xa3/0x610
[ 4982.249440]  ? syscall_slow_exit_work+0x13a/0x1a0
[ 4982.305793]  ? __fget+0x71/0xa0
[ 4982.343397]  SyS_ioctl+0x74/0x80
[ 4982.382041]  do_syscall_64+0x79/0x350
[ 4982.425896]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
[ 4982.486414] RIP: 0033:0x7ff81f1fb4a7
[ 4982.529222] RSP: 002b:00007ff8010b24e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 4982.619946] RAX: ffffffffffffffda RBX: 00007ff810000e80 RCX: 00007ff81f1fb4a7
[ 4982.705358] RDX: 00007ff8010b24f0 RSI: 00000000c0185500 RDI: 0000000000000061
[ 4982.790769] RBP: 0000000000000061 R08: 0000000000000300 R09: 0000000000000000
[ 4982.876179] R10: 0000000000000080 R11: 0000000000000246 R12: 0000000000000007
[ 4982.961599] R13: 00007ff80c024648 R14: 00007ff80c024648 R15: 00007ff80c018d10
[ 4983.047033] Code: e5 09 d1 48 8d 04 b0 89 08 5d c3 c3 66 0f 1f 44 00 00 89 d0 41 89 f0 4c 8d 0c 80 4a 8d 04 48 48 c1 e0 04 4a 03 84 c7 90 01 00 00 <f6> 40 30 10 74 20 55 85 c9 48 89 e5 74 3e 48 8b 40 20 44 8b 40
[ 4983.273287] RIP: xhci_triad_to_transfer_ring+0x19/0x80 [xhci_hcd] RSP: ffffafb5c04a7a40
[ 4983.369231] CR2: 0000000000000030

The crash happens here:
000000000000d8c0 <xhci_triad_to_transfer_ring>:
xhci_triad_to_transfer_ring():
    d8c0:89 d0                mov    eax,edx
    d8c2:41 89 f0             mov    r8d,esi
    d8c5:4c 8d 0c 80          lea    r9,[rax+rax*4]
    d8c9:4a 8d 04 48          lea    rax,[rax+r9*2]
    d8cd:48 c1 e0 04          shl    rax,0x4
    d8d1:4a 03 84 c7 90 01 00 add    rax,QWORD PTR [rdi+r8*8+0x190]
    d8d8:00

RAX holds *ep, which is NULL:
    d8d9:f6 40 30 10          test   BYTE PTR [rax+0x30],0x10 // if (!(ep->ep_state & EP_HAS_STREAMS)) //********* CRASH

RDI: ffff9f6c1e9a6228
RAX: 0000000000000000 ep
R08: 0000000000000009 ep_index
R09: 0000000000000000 slot_id

In sources this is (xhci_ring.c, 451ff):
struct xhci_ring *xhci_triad_to_transfer_ring(struct xhci_hcd *xhci,
unsigned int slot_id, unsigned int ep_index,
unsigned int stream_id)
{
struct xhci_virt_ep *ep;

ep = &xhci->devs[slot_id]->eps[ep_index];
/* Common case: no streams */

if (!(ep->ep_state & EP_HAS_STREAMS)) //********* CRASH

return ep->ring;

When looking into the dmesg, we see that there is a disconnect happening short before the crash:
[ 4979.053224] usb 1-3.2: USB disconnect, device number 28
[ 4979.061672] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030

The disconnect message is thrown here (hub.c, 2137ff):
void usb_disconnect(struct usb_device **pdev)
{
struct usb_port *port_dev = NULL;
struct usb_device *udev = *pdev;
struct usb_hub *hub = NULL;
int port1 = 1;

/* mark the device as inactive, so any further urb submissions for
 * this device (and any of its children) will fail immediately.
 * this quiesces everything except pending urbs.
 */
usb_set_device_state(udev, USB_STATE_NOTATTACHED);
dev_info(&udev->dev, "USB disconnect, device number %d\n",
udev->devnum);

/*
 * Ensure that the pm runtime code knows that the USB device
 * is in the process of being disconnected.
 */
pm_runtime_barrier(&udev->dev);

usb_lock_device(udev);
... and at a later point in this function:
usb_remove_ep_devs(&udev->ep0);
usb_unlock_device(udev);

My assumption is that the transfer_ring is not locked against a disconnect, and doesn't use the device lock.

A first idea would be to add a NULL pointer check for *ep in xhci_triad_to_transfer_ring like
ep = &xhci->devs[slot_id]->eps[ep_index];
if (!ep)
return NULL;

but that would cure the symptom, not the root cause.
Any idea for a different approach to solve that race?

Best regards
Carsten
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible race in 4.14 xhci stack
  2021-07-20 15:04 Possible race in 4.14 xhci stack Schmid, Carsten
@ 2021-07-20 16:11 ` Greg KH
  2021-07-21  7:51   ` Schmid, Carsten
  0 siblings, 1 reply; 6+ messages in thread
From: Greg KH @ 2021-07-20 16:11 UTC (permalink / raw)
  To: Schmid, Carsten; +Cc: Mathias Nyman, USB list

On Tue, Jul 20, 2021 at 03:04:44PM +0000, Schmid, Carsten wrote:
> Hi Mathias,
> 
> i got a NULL pointer deref in the usbfs and analyzed it.
> The connected device seemed to have trouble on USB transmissions.
> However, I have the impression that there is a race between finishing URB handling and disconnection of devices.
> In detail:
> [ 4979.029666] xhci_hcd 0000:00:15.0: USB transfer error. Maybe the USB cable is bad?

Did your cable die?

And 4.14 is quite old, what about 5.13?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Possible race in 4.14 xhci stack
  2021-07-20 16:11 ` Greg KH
@ 2021-07-21  7:51   ` Schmid, Carsten
  2021-07-21  8:14     ` Greg KH
  0 siblings, 1 reply; 6+ messages in thread
From: Schmid, Carsten @ 2021-07-21  7:51 UTC (permalink / raw)
  To: Greg KH; +Cc: Mathias Nyman, USB list

Hi Greg,

>> Hi Mathias,
>>
>> i got a NULL pointer deref in the usbfs and analyzed it.
>> The connected device seemed to have trouble on USB transmissions.
>> However, I have the impression that there is a race between finishing URB handling and disconnection of devices.
>> In detail:
>> [ 4979.029666] xhci_hcd 0000:00:15.0: USB transfer error. Maybe the USB cable is bad?
>
> Did your cable die?
>
> And 4.14 is quite old, what about 5.13?
>
> thanks,
>
> greg k-h

That was reported from a device in the field.
Yes, a defective cable is bad, but shouldn't the kernel's health avoid a NULL pointer deref?

4.14 is used due to being LTS and the device firmware has been stabilized on that.
We frequently update on the 4.14 and monitor patches, but there's nothing regarding
an xhci race up to now in 4.14 seen, I checked that.

Not sure if this would happen on a newer kernel also, but we can't test that on the device in the field.
It was seen once so far, but the indication of a race is really high here.

If nobody of the USB maintainers has an idea, I'll fix that with my proposal.

Best regards
Carsten
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible race in 4.14 xhci stack
  2021-07-21  7:51   ` Schmid, Carsten
@ 2021-07-21  8:14     ` Greg KH
  2021-07-23 19:43       ` Mathias Nyman
  0 siblings, 1 reply; 6+ messages in thread
From: Greg KH @ 2021-07-21  8:14 UTC (permalink / raw)
  To: Schmid, Carsten; +Cc: Mathias Nyman, USB list

On Wed, Jul 21, 2021 at 07:51:58AM +0000, Schmid, Carsten wrote:
> Hi Greg,
> 
> >> Hi Mathias,
> >>
> >> i got a NULL pointer deref in the usbfs and analyzed it.
> >> The connected device seemed to have trouble on USB transmissions.
> >> However, I have the impression that there is a race between finishing URB handling and disconnection of devices.
> >> In detail:
> >> [ 4979.029666] xhci_hcd 0000:00:15.0: USB transfer error. Maybe the USB cable is bad?
> >
> > Did your cable die?
> >
> > And 4.14 is quite old, what about 5.13?
> >
> > thanks,
> >
> > greg k-h
> 
> That was reported from a device in the field.
> Yes, a defective cable is bad, but shouldn't the kernel's health avoid a NULL pointer deref?

Yes, I am not disagreeing about that, patches always welcome :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible race in 4.14 xhci stack
  2021-07-21  8:14     ` Greg KH
@ 2021-07-23 19:43       ` Mathias Nyman
  2021-07-26  7:10         ` Schmid, Carsten
  0 siblings, 1 reply; 6+ messages in thread
From: Mathias Nyman @ 2021-07-23 19:43 UTC (permalink / raw)
  To: Greg KH, Schmid, Carsten; +Cc: USB list

On 21.7.2021 11.14, Greg KH wrote:
> On Wed, Jul 21, 2021 at 07:51:58AM +0000, Schmid, Carsten wrote:
>> Hi Greg,
>>
>>>> Hi Mathias,
>>>>
>>>> i got a NULL pointer deref in the usbfs and analyzed it.
>>>> The connected device seemed to have trouble on USB transmissions.
>>>> However, I have the impression that there is a race between finishing URB handling and disconnection of devices.
>>>> In detail:
>>>> [ 4979.029666] xhci_hcd 0000:00:15.0: USB transfer error. Maybe the USB cable is bad?
>>>
>>> Did your cable die?
>>>
>>> And 4.14 is quite old, what about 5.13?
>>>
>>> thanks,
>>>
>>> greg k-h
>>
>> That was reported from a device in the field.
>> Yes, a defective cable is bad, but shouldn't the kernel's health avoid a NULL pointer deref?
> 
> Yes, I am not disagreeing about that, patches always welcome :)

Current kernel already has a cure for the symptom, avoiding the NULL pointer deref:

struct xhci_ring *xhci_triad_to_transfer_ring(...)
{
	struct xhci_virt_ep *ep;

	ep = xhci_get_virt_ep(xhci, slot_id, ep_index);
	if (!ep)
		return NULL;
	...
}

I'm still on vacation next week, but after that we could look closer at the root cause.

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Possible race in 4.14 xhci stack
  2021-07-23 19:43       ` Mathias Nyman
@ 2021-07-26  7:10         ` Schmid, Carsten
  0 siblings, 0 replies; 6+ messages in thread
From: Schmid, Carsten @ 2021-07-26  7:10 UTC (permalink / raw)
  To: Mathias Nyman, Greg KH; +Cc: USB list

Hi Mathias,

thanks, that's exactly the fix/workaround/cure I wanted to do.
Will fix it this way 😊

Have a nice vacation.

Best regards
Carsten

-----Original Message-----
From: Mathias Nyman <mathias.nyman@linux.intel.com>
Sent: Freitag, 23. Juli 2021 21:44
To: Greg KH <gregkh@linuxfoundation.org>; Schmid, Carsten <Carsten_Schmid@mentor.com>
Cc: USB list <linux-usb@vger.kernel.org>
Subject: Re: Possible race in 4.14 xhci stack

On 21.7.2021 11.14, Greg KH wrote:
> On Wed, Jul 21, 2021 at 07:51:58AM +0000, Schmid, Carsten wrote:
>> Hi Greg,
>>
>>>> Hi Mathias,
>>>>
>>>> i got a NULL pointer deref in the usbfs and analyzed it.
>>>> The connected device seemed to have trouble on USB transmissions.
>>>> However, I have the impression that there is a race between finishing URB handling and disconnection of devices.
>>>> In detail:
>>>> [ 4979.029666] xhci_hcd 0000:00:15.0: USB transfer error. Maybe the USB cable is bad?
>>>
>>> Did your cable die?
>>>
>>> And 4.14 is quite old, what about 5.13?
>>>
>>> thanks,
>>>
>>> greg k-h
>>
>> That was reported from a device in the field.
>> Yes, a defective cable is bad, but shouldn't the kernel's health avoid a NULL pointer deref?
>
> Yes, I am not disagreeing about that, patches always welcome :)

Current kernel already has a cure for the symptom, avoiding the NULL pointer deref:

struct xhci_ring *xhci_triad_to_transfer_ring(...) {
struct xhci_virt_ep *ep;

ep = xhci_get_virt_ep(xhci, slot_id, ep_index);
if (!ep)
return NULL;
...
}

I'm still on vacation next week, but after that we could look closer at the root cause.

Thanks,
Mathias
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-07-26  7:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20 15:04 Possible race in 4.14 xhci stack Schmid, Carsten
2021-07-20 16:11 ` Greg KH
2021-07-21  7:51   ` Schmid, Carsten
2021-07-21  8:14     ` Greg KH
2021-07-23 19:43       ` Mathias Nyman
2021-07-26  7:10         ` Schmid, Carsten

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).