linux-usb.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* xhci problem -> general protection fault
@ 2020-09-17 15:30 Andrzej Pietrasiewicz
  2020-09-18 10:50 ` Mathias Nyman
  0 siblings, 1 reply; 26+ messages in thread
From: Andrzej Pietrasiewicz @ 2020-09-17 15:30 UTC (permalink / raw)
  To: linux-usb; +Cc: Andrzej Pietrasiewicz, kernel

Dear All,

I have encountered a problem in xhci which leads to general protection fault.

The problem is triggered by running this program:

https://gitlab.collabora.com/andrzej.p/bulk-cancel.git

$ sudo ./bulk-cancel -D /dev/bus/usb/002/006 -i 1 -b 1

where /dev/bus/usb/002/006 is a Gadget Zero:

$ lsusb -t
/:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/9p, 480M
     |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
         |__ Port 1: Dev 4, If 0, Class=Vendor Specific Class, Driver=cp210x, 12M
         |__ Port 2: Dev 6, If 0, Class=Vendor Specific Class, Driver=usbtest, 480M
         |__ Port 4: Dev 7, If 0, Class=Hub, Driver=hub/4p, 480M
     |__ Port 7: Dev 3, If 1, Class=Wireless, Driver=btusb, 12M
     |__ Port 7: Dev 3, If 0, Class=Wireless, Driver=btusb, 12M
     |__ Port 8: Dev 5, If 0, Class=Video, Driver=uvcvideo, 480M
     |__ Port 8: Dev 5, If 1, Class=Video, Driver=uvcvideo, 480M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/3p, 480M
     |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/8p, 480M

[ --- gadget info
the gadget runs on an Odroid U3, with legacy g_zero compiled-in

# ls -l /sys/module/g_zero/
total 0
drwxr-xr-x 2 root root    0 Mar 15 12:23 parameters
--w------- 1 root root 4096 Nov  3  2016 uevent

and running

# uname -r
5.7.0

end of gadget info, back to the USB host --- ]

and iommu is turned on:

$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.8.0 root=UUID=<UUID> ro intel_iommu=on quiet splash 
vt.handoff=1

$ uname -r
5.8.0

It takes less than a minute until the crash happens.
The DMAR (iommu) errors don't happen always, i.e. there are crashes
when they are not reported in the system log. In either case the

"WARN Cannot submit Set TR Deq Ptr"
"A Set TR Deq Ptr command is pending."
"WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state."

messages do appear.

This is what is in the system log:

xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or 
ep state.

[... tens of occurences ...]

xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or 
ep state.
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffe20000 
[fault reason 05] PTE Write access is not set
xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or 
ep state.

[... tens of occurences ...]

xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or 
ep state.
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffcf8000 
[fault reason 05] PTE Write access is not set
xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or 
ep state.

[... tens of occurences ...]

xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or 
ep state.
general protection fault, probably for non-canonical address 0x18d159165869aac4: 
0000 [#1] SMP PTI
CPU: 1 PID: 1644 Comm: bulk-cancel Not tainted 5.8.0 #553
Hardware name: LENOVO 20B7S4L800/20B7S4L800, BIOS GJET82WW (2.32 ) 01/09/2015
RIP: 0010:__kmalloc+0x9e/0x270
Code: 7a 01 00 00 4d 8b 01 65 49 8b 50 08 65 4c 03 05 30 ea 95 69 4d 8b 20 4d 85 
e4 0f 84 88 01 00 00 41 8b 41 20 49 8b 39 4c 01 e0 <48> 8b 18 49 33 99 70 01 00 
00 48 89 c1 48 0f c9 4c 89 e0 48 31 cb
RSP: 0018:ffffb584c0f4fd48 EFLAGS: 00010202
RAX: 18d159165869aac4 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 00000000000076f4 RSI: 0000000000000cc0 RDI: 0000000000031120
RBP: ffffb584c0f4fd78 R08: ffff992f16671120 R09: ffff992f15406f40
R10: ffffb584c0f4fd28 R11: ffff992eeb1cf338 R12: 18d159165869aa44
R13: 0000000000000cc0 R14: 0000000000000100 R15: ffff992f15406f40
FS:  00007f664cbf3140(0000) GS:ffff992f16640000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0c14f9e1d8 CR3: 00000001f26e6004 CR4: 00000000001606e0
Call Trace:
  ? proc_do_submiturb+0xc3a/0xe10
  proc_do_submiturb+0xc3a/0xe10
  usbdev_ioctl+0x8fa/0x1480
  ? fput+0x13/0x20
  ? do_timerfd_settime+0x2d5/0x470
  ksys_ioctl+0x98/0xb0
  __x64_sys_ioctl+0x1a/0x20
  do_syscall_64+0x48/0xc0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f664c4f06d7
Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 
66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 
48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
RSP: 002b:00007fffab782528 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000055f16fd78330 RCX: 00007f664c4f06d7
RDX: 000055f16fd6d2b0 RSI: 000000008038550a RDI: 0000000000000003
RBP: 000055f16fd79bf0 R08: 000055f16fd6d2b0 R09: 000055f16fd6f540
R10: 000055f16fd79bf0 R11: 0000000000000246 R12: 0000000000000001
R13: 000055f16fd6d2b0 R14: 0000000000000000 R15: 0000000000000000
Modules linked in: ccm overlay cmac bnep intel_rapl_msr intel_rapl_common 
x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_hdmi kvm_intel 
snd_hda_codec_realtek kvm snd_hda_codec_generic irqbypass snd_hda_intel 
snd_intel_dspcfg i915 snd_hda_codec crct10dif_pclmul snd_hda_core crc32_pclmul 
snd_hwdep ghash_clmulni_intel snd_pcm uvcvideo aesni_intel cec crypto_simd 
cryptd glue_helper rapl intel_cstate rc_core videobuf2_vmalloc videobuf2_memops 
btusb iwlmvm snd_seq_midi mac80211 snd_seq_midi_event btrtl videobuf2_v4l2 
thinkpad_acpi drm_kms_helper libarc4 btbcm joydev input_leds snd_rawmidi 
videobuf2_common nvram btintel ledtrig_audio wmi_bmof serio_raw cp210x snd_seq 
iwlwifi bluetooth drm videodev cfg80211 usbserial ecdh_generic mc ecc 
snd_seq_device i2c_algo_bit fb_sys_fops usbtest snd_timer syscopyarea 
sysfillrect sysimgblt snd mei_me mei mac_hid soundcore lpc_ich sch_fq_codel 
parport_pc ppdev lp parport ip_tables x_tables autofs4 rtsx_pci_sdmmc psmouse 
ahci libahci
  e1000e rtsx_pci wmi video
---[ end trace 93d969fc4f8e5305 ]---

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-09-17 15:30 xhci problem -> general protection fault Andrzej Pietrasiewicz
@ 2020-09-18 10:50 ` Mathias Nyman
  2020-09-18 14:20   ` Andrzej Pietrasiewicz
  0 siblings, 1 reply; 26+ messages in thread
From: Mathias Nyman @ 2020-09-18 10:50 UTC (permalink / raw)
  To: Andrzej Pietrasiewicz, linux-usb; +Cc: kernel

On 17.9.2020 18.30, Andrzej Pietrasiewicz wrote:
> Dear All,
> 
> I have encountered a problem in xhci which leads to general protection fault.
> 
> The problem is triggered by running this program:
> 
> https://gitlab.collabora.com/andrzej.p/bulk-cancel.git
> 
> $ sudo ./bulk-cancel -D /dev/bus/usb/002/006 -i 1 -b 1
> 
> where /dev/bus/usb/002/006 is a Gadget Zero:
> 
> It takes less than a minute until the crash happens.
> The DMAR (iommu) errors don't happen always, i.e. there are crashes
> when they are not reported in the system log. In either case the
> 
> "WARN Cannot submit Set TR Deq Ptr"
> "A Set TR Deq Ptr command is pending."
> "WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state."
> 
> messages do appear.
> 

Nice testcase and report, thanks.

I started looking at issues in this area some time ago, and wrote a couple patches but
it was left hanging. The two patches (now rebased on 5.9-rc3) can be found in my tree in the
fix_invalid_context_at_stop_endpoint branch

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint 

https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint

If you could give those a try and see if they help I'd be grateful.

also, if possible, can you add xhci dynamic debug and traces to show in more detail what is going on.

mount -t debugfs none /sys/kernel/debug
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
< Run the testcase >
Send output of dmesg
Send content of /sys/kernel/debug/tracing/trace

Thanks
Mathias

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-09-18 10:50 ` Mathias Nyman
@ 2020-09-18 14:20   ` Andrzej Pietrasiewicz
  2020-09-25 13:40     ` Mathias Nyman
  0 siblings, 1 reply; 26+ messages in thread
From: Andrzej Pietrasiewicz @ 2020-09-18 14:20 UTC (permalink / raw)
  To: Mathias Nyman, linux-usb; +Cc: kernel

Hi Mathias,

W dniu 18.09.2020 o 12:50, Mathias Nyman pisze:
> On 17.9.2020 18.30, Andrzej Pietrasiewicz wrote:
>> Dear All,
>>
>> I have encountered a problem in xhci which leads to general protection fault.
>>
>> The problem is triggered by running this program:
>>
>> https://gitlab.collabora.com/andrzej.p/bulk-cancel.git
>>
>> $ sudo ./bulk-cancel -D /dev/bus/usb/002/006 -i 1 -b 1
>>
>> where /dev/bus/usb/002/006 is a Gadget Zero:
>>
>> It takes less than a minute until the crash happens.
>> The DMAR (iommu) errors don't happen always, i.e. there are crashes
>> when they are not reported in the system log. In either case the
>>
>> "WARN Cannot submit Set TR Deq Ptr"
>> "A Set TR Deq Ptr command is pending."
>> "WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state."
>>
>> messages do appear.
>>
> 
> Nice testcase and report, thanks.
> 
> I started looking at issues in this area some time ago, and wrote a couple patches but
> it was left hanging. The two patches (now rebased on 5.9-rc3) can be found in my tree in the
> fix_invalid_context_at_stop_endpoint branch
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint
> 
> If you could give those a try and see if they help I'd be grateful.

No, it doesn't help, albeit the errors are slightly different:

xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
dmar_fault: 44 callbacks suppressed
DRHD: handling fault status reg 3
DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffcda000 
[fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffee8000 
[fault reason 05] PTE Write access is not set
xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr fff8b000 
[fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD 
ep_index 2 comp_code 3
xhci_hcd 0000:00:14.0: Looking for event-dma 00000000fff49dd0 trb-start 
00000000fff49e00 trb-end 00000000fff49e10 seg-start 00000000fff49000 seg-end 
00000000fff49ff0
xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD 
ep_index 2 comp_code 3
xhci_hcd 0000:00:14.0: Looking for event-dma 00000000fff49dd0 trb-start 
00000000fff49e00 trb-end 00000000fff49e20 seg-start 00000000fff49000 seg-end

> 
> also, if possible, can you add xhci dynamic debug and traces to show in more detail what is going on.
> 
> mount -t debugfs none /sys/kernel/debug
> echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
> echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
> < Run the testcase >
> Send output of dmesg
> Send content of /sys/kernel/debug/tracing/trace
> 

For 5.9-rc3 _without_ your patches:

xhci_hcd 0000:00:14.0: Babble error for slot 5 ep 2 on endpoint
xhci_hcd 0000:00:14.0: Cleaning up stalled endpoint ring
xhci_hcd 0000:00:14.0: Finding endpoint context
xhci_hcd 0000:00:14.0: Cycle state = 0x1
xhci_hcd 0000:00:14.0: New dequeue segment = 000000000a108eaa (virtual)
xhci_hcd 0000:00:14.0: New dequeue pointer = 0xfff4b010 (DMA)
xhci_hcd 0000:00:14.0: Queueing new dequeue state
xhci_hcd 0000:00:14.0: Set TR Deq Ptr cmd, new deq seg = 000000000a108eaa 
(0xfff4b000 dma), new deq ptr = 000000006e3d8c35 (0xfff4b010 dma), new cycle = 1
xhci_hcd 0000:00:14.0: // Ding dong!
xhci_hcd 0000:00:14.0: Giveback URB 000000008b374ed9, len = 0, expected = 256, 
status = -75
xhci_hcd 0000:00:14.0: Ignoring reset ep completion code of 1
xhci_hcd 0000:00:14.0: Successful Set TR Deq Ptr cmd, deq = @fff4b010
xhci_hcd 0000:00:14.0: Babble error for slot 5 ep 2 on endpoint
xhci_hcd 0000:00:14.0: Cleaning up stalled endpoint ring
xhci_hcd 0000:00:14.0: Finding endpoint context
xhci_hcd 0000:00:14.0: Cycle state = 0x1
xhci_hcd 0000:00:14.0: New dequeue segment = 000000000a108eaa (virtual)
xhci_hcd 0000:00:14.0: New dequeue pointer = 0xfff4b020 (DMA)
xhci_hcd 0000:00:14.0: Queueing new dequeue state
xhci_hcd 0000:00:14.0: Set TR Deq Ptr cmd, new deq seg = 000000000a108eaa 
(0xfff4b000 dma), new deq ptr = 00000000355aae2b (0xfff4b020 dma), new cycle = 1
xhci_hcd 0000:00:14.0: // Ding dong!
xhci_hcd 0000:00:14.0: Giveback URB 0000000014c841b0, len = 0, expected = 256, 
status = -75
xhci_hcd 0000:00:14.0: Ignoring reset ep completion code of 1
xhci_hcd 0000:00:14.0: Successful Set TR Deq Ptr cmd, deq = @fff4b020
xhci_hcd 0000:00:14.0: Babble error for slot 5 ep 2 on endpoint
xhci_hcd 0000:00:14.0: Cleaning up stalled endpoint ring
xhci_hcd 0000:00:14.0: Finding endpoint context
xhci_hcd 0000:00:14.0: Cycle state = 0x1
xhci_hcd 0000:00:14.0: New dequeue segment = 000000000a108eaa (virtual)
xhci_hcd 0000:00:14.0: New dequeue pointer = 0xfff4b030 (DMA)
xhci_hcd 0000:00:14.0: Queueing new dequeue state
xhci_hcd 0000:00:14.0: Set TR Deq Ptr cmd, new deq seg = 000000000a108eaa 
(0xfff4b000 dma), new deq ptr = 00000000ce04ba9f (0xfff4b030 dma), new cycle = 1
xhci_hcd 0000:00:14.0: // Ding dong!
xhci_hcd 0000:00:14.0: Giveback URB 000000008b374ed9, len = 0, expected = 256, 
status = -75


[above blocks repeat a gazillion times]

xhci_hcd 0000:00:14.0: Ignoring reset ep completion code of 1
xhci_hcd 0000:00:14.0: Successful Set TR Deq Ptr cmd, deq = @fff4bd10
xhci_hcd 0000:00:14.0: Babble error for slot 5 ep 2 on endpoint
xhci_hcd 0000:00:14.0: Cleaning up stalled endpoint ring
xhci_hcd 0000:00:14.0: Finding endpoint context
xhci_hcd 0000:00:14.0: Cycle state = 0x0
xhci_hcd 0000:00:14.0: New dequeue segment = 000000000a108eaa (virtual)
xhci_hcd 0000:00:14.0: New dequeue pointer = 0xfff4bd20 (DMA)
xhci_hcd 0000:00:14.0: Queueing new dequeue state
xhci_hcd 0000:00:14.0: Set TR Deq Ptr cmd, new deq seg = 000000000a108eaa 
(0xfff4b000 dma), new deq ptr = 00000000e2b08aa6 (0xfff4bd20 dma), new cycle = 0
xhci_hcd 0000:00:14.0: // Ding dong!
xhci_hcd 0000:00:14.0: Giveback URB 00000000ea1584be, len = 0, expected = 256, 
status = -75
xhci_hcd 0000:00:14.0: Ignoring reset ep completion code of 1
xhci_hcd 0000:00:14.0: Successful Set TR Deq Ptr cmd, deq = @fff4bd20
xhci_hcd 0000:00:14.0: Babble error for slot 5 ep 2 on endpoint
xhci_hcd 0000:00:14.0: Cleaning up stalled endpoint ring
xhci_hcd 0000:00:14.0: Finding endpoint context
xhci_hcd 0000:00:14.0: Cycle state = 0x0
xhci_hcd 0000:00:14.0: New dequeue segment = 000000000a108eaa (virtual)
xhci_hcd 0000:00:14.0: New dequeue pointer = 0xfff4bd30 (DMA)
xhci_hcd 0000:00:14.0: Queueing new dequeue state
xhci_hcd 0000:00:14.0: Set TR Deq Ptr cmd, new deq seg = 000000000a108eaa 
(0xfff4b000 dma), new deq ptr = 00000000d04439dd (0xfff4bd30 dma), new cycle = 0
xhci_hcd 0000:00:14.0: // Ding dong!
xhci_hcd 0000:00:14.0: Giveback URB 000000007e94171c, len = 0, expected = 256, 
status = -75
xhci_hcd 0000:00:14.0: Ignoring reset ep completion code of 1
xhci_hcd 0000:00:14.0: Successful Set TR Deq Ptr cmd, deq = @fff4bd30
xhci_hcd 0000:00:14.0: Babble error for slot 5 ep 2 on endpoint
xhci_hcd 0000:00:14.0: Cleaning up stalled endpoint ring
xhci_hcd 0000:00:14.0: Finding endpoint context
xhci_hcd 0000:00:14.0: Cycle state = 0x0
xhci_hcd 0000:00:14.0: New dequeue segment = 000000000a108eaa (virtual)
xhci_hcd 0000:00:14.0: New dequeue pointer = 0xfff4bd40 (DMA)
xhci_hcd 0000:00:14.0: Queueing new dequeue state
xhci_hcd 0000:00:14.0: Set TR Deq Ptr cmd, new deq seg = 000000000a108eaa 
(0xfff4b000 dma), new deq ptr = 000000007a585f64 (0xfff4bd40 dma), new cycle = 0
xhci_hcd 0000:00:14.0: // Ding dong!
xhci_hcd 0000:00:14.0: Giveback URB 000000000c30ab36, len = 0, expected = 256, 
status = -75
xhci_hcd 0000:00:14.0: Ignoring reset ep completion code of 1
xhci_hcd 0000:00:14.0: Successful Set TR Deq Ptr cmd, deq = @fff4bd40
xhci_hcd 0000:00:14.0: Babble error for slot 5 ep 2 on endpoint
xhci_hcd 0000:00:14.0: Cleaning up stalled endpoint ring
xhci_hcd 0000:00:14.0: Finding endpoint context
xhci_hcd 0000:00:14.0: Cycle state = 0x0
xhci_hcd 0000:00:14.0: New dequeue segment = 000000000a108eaa (virtual)
xhci_hcd 0000:00:14.0: New dequeue pointer = 0xfff4bd50 (DMA)
xhci_hcd 0000:00:14.0: Queueing new dequeue state
xhci_hcd 0000:00:14.0: Set TR Deq Ptr cmd, new deq seg = 000000000a108eaa 
(0xfff4b000 dma), new deq ptr = 00000000bbfef907 (0xfff4bd50 dma), new cycle = 0
xhci_hcd 0000:00:14.0: // Ding dong!
xhci_hcd 0000:00:14.0: Giveback URB 0000000042fc26fe, len = 0, expected = 256, 
status = -75
xhci_hcd 0000:00:14.0: Ignoring reset ep completion code of 1
xhci_hcd 0000:00:14.0: Successful Set TR Deq Ptr cmd, deq = @fff4bd50
general protection fault: 0000 [#2] SMP PTI
CPU: 3 PID: 1061 Comm: gnome-shell Tainted: G      D           5.4.0-47-generic 
#51~18.04.1-Ubuntu
Hardware name: LENOVO 20B7S4L800/20B7S4L800, BIOS GJET82WW (2.32 ) 01/09/2015
RIP: 0010:kmem_cache_alloc_trace+0x7e/0x220
Code: b9 01 00 00 4d 8b 01 65 49 8b 50 08 65 4c 03 05 50 8e b6 6e 4d 8b 30 4d 85 
f6 0f 84 56 01 00 00 41 8b 41 20 49 8b 39 4c 01 f0 <48> 8b 18 49 33 99 70 01 00 
00 48 89 c1 48 0f c9 4c 89 f0 48 31 cb
RSP: 0018:ffffa3a640bfbeb8 EFLAGS: 00010206
RAX: 56329a4a4f062eb2 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 000000000000053b RSI: 0000000000000dc0 RDI: 000000000002f120
RBP: ffffa3a640bfbef0 R08: ffff9563164ef120 R09: ffff956315406f40
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000dc0
R13: ffff956315406f40 R14: 56329a4a4f062eb2 R15: ffffffff9153683c
FS:  00007fefcba2dac0(0000) GS:ffff9563164c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00001d94d9640000 CR3: 00000001ec1ec006 CR4: 00000000001606e0
Call Trace:
  __x64_sys_timerfd_create+0x6c/0x1a0
  do_syscall_64+0x57/0x190
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fefc8af6757
Code: 73 01 c3 48 8b 0d 31 87 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 
00 00 00 00 00 0f 1f 44 00 00 b8 1b 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 
48 8b 0d 01 87 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffee678ad78 EFLAGS: 00000246 ORIG_RAX: 000000000000011b
RAX: ffffffffffffffda RBX: 0000555750f80560 RCX: 00007fefc8af6757
RDX: 0000000ed6f808b0 RSI: 0000000000080000 RDI: 0000000000000000
RBP: 0000000000000001 R08: 0000555750f80560 R09: 00007fefc8dbfc40
R10: 0000000000000007 R11: 0000000000000246 R12: 000000005f64c06c
R13: 000000005f64c030 R14: 0000000000000000 R15: 0000000000000000
Modules linked in: ccm aufs overlay cmac bnep intel_rapl_msr mei_hdcp 
intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi 
snd_hda_codec_realtek snd_hda_codec_generic aesni_intel crypto_simd cryptd 
glue_helper snd_hda_intel rapl snd_intel_dspcfg snd_hda_codec iwlmvm 
snd_hda_core snd_hwdep mac80211 uvcvideo intel_cstate videobuf2_vmalloc libarc4 
videobuf2_memops videobuf2_v4l2 videobuf2_common snd_pcm cp210x joydev usbtest 
usbserial videodev input_leds rtsx_pci_ms iwlwifi serio_raw mc wmi_bmof memstick 
i915 snd_seq_midi snd_seq_midi_event btusb btrtl btbcm btintel bluetooth 
thinkpad_acpi snd_rawmidi ecdh_generic cfg80211 ecc nvram ledtrig_audio snd_seq 
drm_kms_helper snd_seq_device drm snd_timer snd i2c_algo_bit fb_sys_fops 
syscopyarea mac_hid sysfillrect sysimgblt mei_me lpc_ich mei soundcore 
sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 
rtsx_pci_sdmmc psmouse
  ahci libahci rtsx_pci e1000e wmi video
---[ end trace ac9831e33eee8ef0 ]---
RIP: 0010:__kmalloc+0x9e/0x270
Code: 7f 01 00 00 4d 8b 01 65 49 8b 50 08 65 4c 03 05 70 84 b6 6e 4d 8b 20 4d 85 
e4 0f 84 8d 01 00 00 41 8b 41 20 49 8b 39 4c 01 e0 <48> 8b 18 49 33 99 70 01 00 
00 48 89 c1 48 0f c9 4c 89 e0 48 31 cb
RSP: 0018:ffffa3a640577ca8 EFLAGS: 00010206
RAX: 56329a4a4f062eb2 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 000000000000053b RSI: 0000000000000cc0 RDI: 000000000002f120
RBP: ffffa3a640577cd8 R08: ffff9563164ef120 R09: ffff956315406f40
R10: ffff95630af85598 R11: ffff9562ed344338 R12: 56329a4a4f062eb2
R13: 0000000000000cc0 R14: 0000000000000100 R15: ffff956315406f40
FS:  00007fefcba2dac0(0000) GS:ffff9563164c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00001d94d9640000 CR3: 00000001ec1ec006 CR4: 00000000001606e0


# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |

[empty contents!]

Andrzej

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-09-18 14:20   ` Andrzej Pietrasiewicz
@ 2020-09-25 13:40     ` Mathias Nyman
  2020-09-25 21:05       ` Ross Zwisler
  0 siblings, 1 reply; 26+ messages in thread
From: Mathias Nyman @ 2020-09-25 13:40 UTC (permalink / raw)
  To: Andrzej Pietrasiewicz, linux-usb; +Cc: kernel

On 18.9.2020 17.20, Andrzej Pietrasiewicz wrote:
> Hi Mathias,
> 
> W dniu 18.09.2020 o 12:50, Mathias Nyman pisze:
>> On 17.9.2020 18.30, Andrzej Pietrasiewicz wrote:
>>> Dear All,
>>>
>>> I have encountered a problem in xhci which leads to general protection fault.
>>>
>>> The problem is triggered by running this program:
>>>
>>> https://gitlab.collabora.com/andrzej.p/bulk-cancel.git
>>>
>>> $ sudo ./bulk-cancel -D /dev/bus/usb/002/006 -i 1 -b 1
>>>
>>> where /dev/bus/usb/002/006 is a Gadget Zero:
>>>
>>> It takes less than a minute until the crash happens.
>>> The DMAR (iommu) errors don't happen always, i.e. there are crashes
>>> when they are not reported in the system log. In either case the
>>>
>>> "WARN Cannot submit Set TR Deq Ptr"
>>> "A Set TR Deq Ptr command is pending."
>>> "WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state."
>>>
>>> messages do appear.
>>>
>>
>> Nice testcase and report, thanks.
>>
>> I started looking at issues in this area some time ago, and wrote a couple patches but
>> it was left hanging. The two patches (now rebased on 5.9-rc3) can be found in my tree in the
>> fix_invalid_context_at_stop_endpoint branch
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint
>>
>> If you could give those a try and see if they help I'd be grateful.
> 
> No, it doesn't help, albeit the errors are slightly different:
> 
> xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
> xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
> dmar_fault: 44 callbacks suppressed
> DRHD: handling fault status reg 3> DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffcda000 [fault reason 05] PTE Write access is not set
> DMAR: DRHD: handling fault status reg 3

Ok, thanks, the DMA problems make sense to me now.

If a transfer ring stops on a transfer request (TRB) that should be canceled (manual cancel,
or error) it's not enough to just turn the  TRB to a no-op.
HW has most likely cached the TRB, and we need to move the transfer ring dequeue pointer past this TRB.
Moving deq also clears controller cache.

We do all this, but if we fail to queue the Set TR Deq command the TRB (with DMA  pointers) will stay on the ring,
and controller will access the TRB DMA  address once it continues running. At this point xhci driver has already
given back the canceled/erroneous TRB, and is probably unmapped already.
Hence the DMAR entries.  

Looks like this part of the code needs a more extensive rewrite, on top of this we are not handling
races between endpoints halted due errors, and endpoints stopped by driver to cancel URBs. 

-Mathias

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-09-25 13:40     ` Mathias Nyman
@ 2020-09-25 21:05       ` Ross Zwisler
  2020-09-28 13:32         ` Andrzej Pietrasiewicz
  2020-09-28 22:35         ` Mathias Nyman
  0 siblings, 2 replies; 26+ messages in thread
From: Ross Zwisler @ 2020-09-25 21:05 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On Fri, Sep 25, 2020 at 04:40:29PM +0300, Mathias Nyman wrote:
> On 18.9.2020 17.20, Andrzej Pietrasiewicz wrote:
> > Hi Mathias,
> > 
> > W dniu 18.09.2020 o 12:50, Mathias Nyman pisze:
> >> On 17.9.2020 18.30, Andrzej Pietrasiewicz wrote:
> >>> Dear All,
> >>>
> >>> I have encountered a problem in xhci which leads to general protection fault.
> >>>
> >>> The problem is triggered by running this program:
> >>>
> >>> https://gitlab.collabora.com/andrzej.p/bulk-cancel.git
> >>>
> >>> $ sudo ./bulk-cancel -D /dev/bus/usb/002/006 -i 1 -b 1
> >>>
> >>> where /dev/bus/usb/002/006 is a Gadget Zero:
> >>>
> >>> It takes less than a minute until the crash happens.
> >>> The DMAR (iommu) errors don't happen always, i.e. there are crashes
> >>> when they are not reported in the system log. In either case the
> >>>
> >>> "WARN Cannot submit Set TR Deq Ptr"
> >>> "A Set TR Deq Ptr command is pending."
> >>> "WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state."
> >>>
> >>> messages do appear.
> >>>
> >>
> >> Nice testcase and report, thanks.
> >>
> >> I started looking at issues in this area some time ago, and wrote a couple patches but
> >> it was left hanging. The two patches (now rebased on 5.9-rc3) can be found in my tree in the
> >> fix_invalid_context_at_stop_endpoint branch
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint
> >>
> >> If you could give those a try and see if they help I'd be grateful.
> > 
> > No, it doesn't help, albeit the errors are slightly different:
> > 
> > xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
> > xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
> > dmar_fault: 44 callbacks suppressed
> > DRHD: handling fault status reg 3> DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffcda000 [fault reason 05] PTE Write access is not set
> > DMAR: DRHD: handling fault status reg 3
> 
> Ok, thanks, the DMA problems make sense to me now.
> 
> If a transfer ring stops on a transfer request (TRB) that should be canceled (manual cancel,
> or error) it's not enough to just turn the  TRB to a no-op.
> HW has most likely cached the TRB, and we need to move the transfer ring dequeue pointer past this TRB.
> Moving deq also clears controller cache.
> 
> We do all this, but if we fail to queue the Set TR Deq command the TRB (with DMA  pointers) will stay on the ring,
> and controller will access the TRB DMA  address once it continues running. At this point xhci driver has already
> given back the canceled/erroneous TRB, and is probably unmapped already.
> Hence the DMAR entries.  
> 
> Looks like this part of the code needs a more extensive rewrite, on top of this we are not handling
> races between endpoints halted due errors, and endpoints stopped by driver to cancel URBs. 
> 
> -Mathias

I'm chasing a similar problem which is also most easily reproduced using
Andrzej's bulk-cancel program, though we have seen it in the field many times
as well with normal usage.

The symptom is that we get the following errors in dmesg:

 xhci_hcd 0000:00:14.0: Mismatch between completed Set TR Deq Ptr command & xHCI internal state.
 xhci_hcd 0000:00:14.0: ep deq seg = 000000001141d6a0, deq ptr = 00000000ebd28dcf
 xhci_hcd 0000:00:14.0: WARNING: Host System Error
 DMAR: DRHD: handling fault status reg 2
 DMAR: [DMA Read] Request device [00:14.0] PASID ffffffff fault addr 0 [fault reason 06] PTE Read access is not set
 xhci_hcd 0000:00:14.0: xHCI host not responding to stop endpoint command.
 xhci_hcd 0000:00:14.0: USBSTS: HCHalted HSE EINT
 xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
 xhci_hcd 0000:00:14.0: HC died; cleaning up

The xhci USB stack loses all attached devices, and on my system at least has
only been recoverable by rebooting.

Full dmesg and trace after 'echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable'
can be found here:

https://gist.github.com/rzwisler/531b926e3d160609ead371c23fc15b55
https://gist.github.com/rzwisler/4621f805737993fec30b5ae23bfd8289

One interesting bit from the trace is that we observe the ep_ctx->deq pointer
being 0 in xhci_handle_cmd_set_deq():

 xhci_inc_enq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe090(0x00000000ffffe000) segs 1 stream 0 free_trbs 251 bounce 0 cycle 0
 xhci_ring_host_doorbell: Ring doorbell for Command Ring 0
 xhci_urb_giveback: ep7in-bulk: urb 000000003c80b7a8 pipe 3221455744 slot 2 length 0/256 sgs 0/0 stream 0 flags 00010200
 xhci_inc_deq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe0a0(0x00000000ffffe000) segs 1 stream 0 free_trbs 252 bounce 0 cycle 0
 xhci_inc_deq: EVENT 00000000b5c6e6a2: enq 0x00000000ffffc000(0x00000000ffffc000) deq 0x00000000ffffc1b0(0x00000000ffffc000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
 xhci_handle_event: EVENT: TRB 00000000ffffe0a0 status 'Success' len 0 slot 4 ep 0 type 'Command Completion Event' flags e:C
 xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff296a1 stream 0 slot 4 ep 3 flags c
 xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 13/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 4 State configured
 xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 64 deq 0000000000000000 avg trb len 0
																	  ^^^^^^^^^^^^^^^^^^^^
 xhci_dbg_cancel_urb: Successful Set TR Deq Ptr cmd, deq = @00000000
						     ^^^^^^^^^^^^^^^

In successful completions they are real values:

 xhci_ring_ep_doorbell: Ring doorbell for Slot 4 ep1in
 xhci_inc_deq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe0b0(0x00000000ffffe000) segs 1 stream 0 free_trbs 253 bounce 0 cycle 0
 xhci_inc_deq: EVENT 00000000b5c6e6a2: enq 0x00000000ffffc000(0x00000000ffffc000) deq 0x00000000ffffc1c0(0x00000000ffffc000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
 xhci_handle_event: EVENT: TRB 00000000ffffe0b0 status 'Success' len 0 slot 2 ep 0 type 'Command Completion Event' flags e:C
 xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff86551 stream 0 slot 2 ep 15 flags c
 xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 11/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 2 State configured
 xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 16 deq 00000000fff86551 avg trb len 0
																	  ^^^^^^^^^^^^^^^^^^^^
 xhci_dbg_cancel_urb: Successful Set TR Deq Ptr cmd, deq = @fff86550
						     ^^^^^^^^^^^^^^^

I've noticed that I have to have CONFIG_INTEL_IOMMU_DEFAULT_ON=y for this
clean repro case, else the system still fails but I don't always (ever?) see
the failure to read at address 0.  

Mathias, do you think that your above explanation also covers my failure case?  
Are we just failing to enqueue a Set TR Deq command, and the HC is processing
a stale TRB?  Or does the fact that ep_ctx->deq == 0 not fit with that
explanation?

Other tidbits that I've disovered along the way:

1) We first started noticing this in the field with v4.19 based kernels, but
   I've been able to reproduce it using bulk-cancel with a v4.18 kernel
   without much trouble, so I'm pretty sure it's an old issue.

2) This definitely looks like a race which is very sensitive to timing.  If I
   put a single trace_printk() line in xhci_handle_cmd_set_deq() function, the
   issue is 10x harder to repro (on average 3 seconds to on average 30
   seconds).  Similarly, when we turned on tracing in the field to try and get
   logs the issue reproduces much less frequently.  I also tried to bisect,
   and it turns out that whether or not it reproduced on my system with older
   kernels was dependent on how tracing functions were inlined (!!).

   I think that we started seeing this in the field with v4.19 based kernels
   purely because of a timing change.

3) I did run with your patches from
   https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint
   and it didn't change anything for me.

I'm reading up more on xhci and trying to understand this race, but would
appreciate any help or direction that you're able to provide.

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-09-25 21:05       ` Ross Zwisler
@ 2020-09-28 13:32         ` Andrzej Pietrasiewicz
  2020-09-29  7:13           ` Mathias Nyman
  2020-09-28 22:35         ` Mathias Nyman
  1 sibling, 1 reply; 26+ messages in thread
From: Andrzej Pietrasiewicz @ 2020-09-28 13:32 UTC (permalink / raw)
  To: Ross Zwisler, Mathias Nyman; +Cc: linux-usb, kernel

Hi Ross, hi Mathias,

W dniu 25.09.2020 o 23:05, Ross Zwisler pisze:
> On Fri, Sep 25, 2020 at 04:40:29PM +0300, Mathias Nyman wrote:
>> On 18.9.2020 17.20, Andrzej Pietrasiewicz wrote:
>>> Hi Mathias,
>>>
>>> W dniu 18.09.2020 o 12:50, Mathias Nyman pisze:
>>>> On 17.9.2020 18.30, Andrzej Pietrasiewicz wrote:
>>>>> Dear All,
>>>>>
>>>>> I have encountered a problem in xhci which leads to general protection fault.
>>>>>
>>>>> The problem is triggered by running this program:
>>>>>
>>>>> https://gitlab.collabora.com/andrzej.p/bulk-cancel.git
>>>>>
>>>>> $ sudo ./bulk-cancel -D /dev/bus/usb/002/006 -i 1 -b 1
>>>>>
>>>>> where /dev/bus/usb/002/006 is a Gadget Zero:
>>>>>
>>>>> It takes less than a minute until the crash happens.
>>>>> The DMAR (iommu) errors don't happen always, i.e. there are crashes
>>>>> when they are not reported in the system log. In either case the
>>>>>
>>>>> "WARN Cannot submit Set TR Deq Ptr"
>>>>> "A Set TR Deq Ptr command is pending."
>>>>> "WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state."
>>>>>
>>>>> messages do appear.
>>>>>
>>>>
>>>> Nice testcase and report, thanks.
>>>>
>>>> I started looking at issues in this area some time ago, and wrote a couple patches but
>>>> it was left hanging. The two patches (now rebased on 5.9-rc3) can be found in my tree in the
>>>> fix_invalid_context_at_stop_endpoint branch
>>>>
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint
>>>>
>>>> If you could give those a try and see if they help I'd be grateful.
>>>
>>> No, it doesn't help, albeit the errors are slightly different:
>>>
>>> xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
>>> xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
>>> dmar_fault: 44 callbacks suppressed
>>> DRHD: handling fault status reg 3> DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffcda000 [fault reason 05] PTE Write access is not set
>>> DMAR: DRHD: handling fault status reg 3
>>
>> Ok, thanks, the DMA problems make sense to me now.
>>
>> If a transfer ring stops on a transfer request (TRB) that should be canceled (manual cancel,
>> or error) it's not enough to just turn the  TRB to a no-op.
>> HW has most likely cached the TRB, and we need to move the transfer ring dequeue pointer past this TRB.
>> Moving deq also clears controller cache.
>>
>> We do all this, but if we fail to queue the Set TR Deq command the TRB (with DMA  pointers) will stay on the ring,
>> and controller will access the TRB DMA  address once it continues running. At this point xhci driver has already
>> given back the canceled/erroneous TRB, and is probably unmapped already.
>> Hence the DMAR entries.
>>
>> Looks like this part of the code needs a more extensive rewrite, on top of this we are not handling
>> races between endpoints halted due errors, and endpoints stopped by driver to cancel URBs.
>>
>> -Mathias
> 
> I'm chasing a similar problem which is also most easily reproduced using
> Andrzej's bulk-cancel program, though we have seen it in the field many times
> as well with normal usage.
> 
> The symptom is that we get the following errors in dmesg:
> 
>   xhci_hcd 0000:00:14.0: Mismatch between completed Set TR Deq Ptr command & xHCI internal state.
>   xhci_hcd 0000:00:14.0: ep deq seg = 000000001141d6a0, deq ptr = 00000000ebd28dcf
>   xhci_hcd 0000:00:14.0: WARNING: Host System Error
>   DMAR: DRHD: handling fault status reg 2
>   DMAR: [DMA Read] Request device [00:14.0] PASID ffffffff fault addr 0 [fault reason 06] PTE Read access is not set
>   xhci_hcd 0000:00:14.0: xHCI host not responding to stop endpoint command.
>   xhci_hcd 0000:00:14.0: USBSTS: HCHalted HSE EINT
>   xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
>   xhci_hcd 0000:00:14.0: HC died; cleaning up
> 
> The xhci USB stack loses all attached devices, and on my system at least has
> only been recoverable by rebooting.
> 
> Full dmesg and trace after 'echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable'
> can be found here:
> 
> https://gist.github.com/rzwisler/531b926e3d160609ead371c23fc15b55
> https://gist.github.com/rzwisler/4621f805737993fec30b5ae23bfd8289
> 
> One interesting bit from the trace is that we observe the ep_ctx->deq pointer
> being 0 in xhci_handle_cmd_set_deq():
> 
>   xhci_inc_enq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe090(0x00000000ffffe000) segs 1 stream 0 free_trbs 251 bounce 0 cycle 0
>   xhci_ring_host_doorbell: Ring doorbell for Command Ring 0
>   xhci_urb_giveback: ep7in-bulk: urb 000000003c80b7a8 pipe 3221455744 slot 2 length 0/256 sgs 0/0 stream 0 flags 00010200
>   xhci_inc_deq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe0a0(0x00000000ffffe000) segs 1 stream 0 free_trbs 252 bounce 0 cycle 0
>   xhci_inc_deq: EVENT 00000000b5c6e6a2: enq 0x00000000ffffc000(0x00000000ffffc000) deq 0x00000000ffffc1b0(0x00000000ffffc000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
>   xhci_handle_event: EVENT: TRB 00000000ffffe0a0 status 'Success' len 0 slot 4 ep 0 type 'Command Completion Event' flags e:C
>   xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff296a1 stream 0 slot 4 ep 3 flags c
>   xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 13/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 4 State configured
>   xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 64 deq 0000000000000000 avg trb len 0
> 																	  ^^^^^^^^^^^^^^^^^^^^
>   xhci_dbg_cancel_urb: Successful Set TR Deq Ptr cmd, deq = @00000000
> 						     ^^^^^^^^^^^^^^^
> 
> In successful completions they are real values:
> 
>   xhci_ring_ep_doorbell: Ring doorbell for Slot 4 ep1in
>   xhci_inc_deq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe0b0(0x00000000ffffe000) segs 1 stream 0 free_trbs 253 bounce 0 cycle 0
>   xhci_inc_deq: EVENT 00000000b5c6e6a2: enq 0x00000000ffffc000(0x00000000ffffc000) deq 0x00000000ffffc1c0(0x00000000ffffc000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
>   xhci_handle_event: EVENT: TRB 00000000ffffe0b0 status 'Success' len 0 slot 2 ep 0 type 'Command Completion Event' flags e:C
>   xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff86551 stream 0 slot 2 ep 15 flags c
>   xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 11/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 2 State configured
>   xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 16 deq 00000000fff86551 avg trb len 0
> 																	  ^^^^^^^^^^^^^^^^^^^^
>   xhci_dbg_cancel_urb: Successful Set TR Deq Ptr cmd, deq = @fff86550
> 						     ^^^^^^^^^^^^^^^
> 
> I've noticed that I have to have CONFIG_INTEL_IOMMU_DEFAULT_ON=y for this
> clean repro case, else the system still fails but I don't always (ever?) see
> the failure to read at address 0.
> 
> Mathias, do you think that your above explanation also covers my failure case?
> Are we just failing to enqueue a Set TR Deq command, and the HC is processing
> a stale TRB?  Or does the fact that ep_ctx->deq == 0 not fit with that
> explanation?
> 

 From the logs from Ross it seems that the direct cause of the HC dying
is Stop Endpoint not completing and, consequently, timing out.

In the xHCI spec, section "4.6.9 Stop Endpoint" we can read:

"Note that when an endpoint is stopped, the xHC maintains the state
necessary to restart the last active Transfer Ring where it left off,
however software may not want to do this. The options are discussed below:

1. Temporarily Stop Transfer Ring Activity - [...]

2. Aborting a Transfer - If, because of a timeout or other reason, software
issued the Stop Endpoint Command to abort the current TD. Then the
Set TR Dequeue Pointer Command may be used to force the xHC to
dump any internal state that it has for the ring and restart activity at the
new Transfer Ring location specified by the Set TR Dequeue Pointer Command."

If bulk-cancel reproducer is used then the software's intention is definitely
to abort a transfer, so indeed a Set TR Dequeue Pointer is needed.

On the other hand, in that same spec, in "4.6.10 Set TR Dequeue Pointer"
we read:

"This command may be executed only if the target endpoint is in the Error
or Stopped state."

My question is: why in the crash scenario Ross describes "Set TR Dequeue
Pointer" is being carried out while Stop Endpoint has not been completed
(and timed out instead)?

Regards,

Andrzej

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-09-25 21:05       ` Ross Zwisler
  2020-09-28 13:32         ` Andrzej Pietrasiewicz
@ 2020-09-28 22:35         ` Mathias Nyman
  2020-10-01 16:43           ` zwisler
  1 sibling, 1 reply; 26+ messages in thread
From: Mathias Nyman @ 2020-09-28 22:35 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On 26.9.2020 0.05, Ross Zwisler wrote:
> On Fri, Sep 25, 2020 at 04:40:29PM +0300, Mathias Nyman wrote:
>> On 18.9.2020 17.20, Andrzej Pietrasiewicz wrote:
>>> Hi Mathias,
>>>
>>> W dniu 18.09.2020 o 12:50, Mathias Nyman pisze:
>>>> On 17.9.2020 18.30, Andrzej Pietrasiewicz wrote:
>>>>> Dear All,
>>>>>
>>>>> I have encountered a problem in xhci which leads to general protection fault.
>>>>>
>>>>> The problem is triggered by running this program:
>>>>>
>>>>> https://gitlab.collabora.com/andrzej.p/bulk-cancel.git
>>>>>
>>>>> $ sudo ./bulk-cancel -D /dev/bus/usb/002/006 -i 1 -b 1
>>>>>
>>>>> where /dev/bus/usb/002/006 is a Gadget Zero:
>>>>>
>>>>> It takes less than a minute until the crash happens.
>>>>> The DMAR (iommu) errors don't happen always, i.e. there are crashes
>>>>> when they are not reported in the system log. In either case the
>>>>>
>>>>> "WARN Cannot submit Set TR Deq Ptr"
>>>>> "A Set TR Deq Ptr command is pending."
>>>>> "WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state."
>>>>>
>>>>> messages do appear.
>>>>>
>>>>
>>>> Nice testcase and report, thanks.
>>>>
>>>> I started looking at issues in this area some time ago, and wrote a couple patches but
>>>> it was left hanging. The two patches (now rebased on 5.9-rc3) can be found in my tree in the
>>>> fix_invalid_context_at_stop_endpoint branch
>>>>
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint
>>>>
>>>> If you could give those a try and see if they help I'd be grateful.
>>>
>>> No, it doesn't help, albeit the errors are slightly different:
>>>
>>> xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
>>> xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
>>> dmar_fault: 44 callbacks suppressed
>>> DRHD: handling fault status reg 3> DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffcda000 [fault reason 05] PTE Write access is not set
>>> DMAR: DRHD: handling fault status reg 3
>>
>> Ok, thanks, the DMA problems make sense to me now.
>>
>> If a transfer ring stops on a transfer request (TRB) that should be canceled (manual cancel,
>> or error) it's not enough to just turn the  TRB to a no-op.
>> HW has most likely cached the TRB, and we need to move the transfer ring dequeue pointer past this TRB.
>> Moving deq also clears controller cache.
>>
>> We do all this, but if we fail to queue the Set TR Deq command the TRB (with DMA  pointers) will stay on the ring,
>> and controller will access the TRB DMA  address once it continues running. At this point xhci driver has already
>> given back the canceled/erroneous TRB, and is probably unmapped already.
>> Hence the DMAR entries.  
>>
>> Looks like this part of the code needs a more extensive rewrite, on top of this we are not handling
>> races between endpoints halted due errors, and endpoints stopped by driver to cancel URBs. 
>>
>> -Mathias
> 
> I'm chasing a similar problem which is also most easily reproduced using
> Andrzej's bulk-cancel program, though we have seen it in the field many times
> as well with normal usage.
> 
> The symptom is that we get the following errors in dmesg:
> 
>  xhci_hcd 0000:00:14.0: Mismatch between completed Set TR Deq Ptr command & xHCI internal state.
>  xhci_hcd 0000:00:14.0: ep deq seg = 000000001141d6a0, deq ptr = 00000000ebd28dcf
>  xhci_hcd 0000:00:14.0: WARNING: Host System Error
>  DMAR: DRHD: handling fault status reg 2
>  DMAR: [DMA Read] Request device [00:14.0] PASID ffffffff fault addr 0 [fault reason 06] PTE Read access is not set
>  xhci_hcd 0000:00:14.0: xHCI host not responding to stop endpoint command.
>  xhci_hcd 0000:00:14.0: USBSTS: HCHalted HSE EINT
>  xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
>  xhci_hcd 0000:00:14.0: HC died; cleaning up
> 
> The xhci USB stack loses all attached devices, and on my system at least has
> only been recoverable by rebooting.
> 
> Full dmesg and trace after 'echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable'
> can be found here:
> 
> https://gist.github.com/rzwisler/531b926e3d160609ead371c23fc15b55
> https://gist.github.com/rzwisler/4621f805737993fec30b5ae23bfd8289
> 
> One interesting bit from the trace is that we observe the ep_ctx->deq pointer
> being 0 in xhci_handle_cmd_set_deq():
> 
>  xhci_inc_enq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe090(0x00000000ffffe000) segs 1 stream 0 free_trbs 251 bounce 0 cycle 0
>  xhci_ring_host_doorbell: Ring doorbell for Command Ring 0
>  xhci_urb_giveback: ep7in-bulk: urb 000000003c80b7a8 pipe 3221455744 slot 2 length 0/256 sgs 0/0 stream 0 flags 00010200
>  xhci_inc_deq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe0a0(0x00000000ffffe000) segs 1 stream 0 free_trbs 252 bounce 0 cycle 0
>  xhci_inc_deq: EVENT 00000000b5c6e6a2: enq 0x00000000ffffc000(0x00000000ffffc000) deq 0x00000000ffffc1b0(0x00000000ffffc000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
>  xhci_handle_event: EVENT: TRB 00000000ffffe0a0 status 'Success' len 0 slot 4 ep 0 type 'Command Completion Event' flags e:C
>  xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff296a1 stream 0 slot 4 ep 3 flags c
>  xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 13/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 4 State configured
>  xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 64 deq 0000000000000000 avg trb len 0
> 																	  ^^^^^^^^^^^^^^^^^^^^
>  xhci_dbg_cancel_urb: Successful Set TR Deq Ptr cmd, deq = @00000000
> 						     ^^^^^^^^^^^^^^^
> 
> In successful completions they are real values:
> 
>  xhci_ring_ep_doorbell: Ring doorbell for Slot 4 ep1in
>  xhci_inc_deq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe0b0(0x00000000ffffe000) segs 1 stream 0 free_trbs 253 bounce 0 cycle 0
>  xhci_inc_deq: EVENT 00000000b5c6e6a2: enq 0x00000000ffffc000(0x00000000ffffc000) deq 0x00000000ffffc1c0(0x00000000ffffc000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
>  xhci_handle_event: EVENT: TRB 00000000ffffe0b0 status 'Success' len 0 slot 2 ep 0 type 'Command Completion Event' flags e:C
>  xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff86551 stream 0 slot 2 ep 15 flags c
>  xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 11/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 2 State configured
>  xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 16 deq 00000000fff86551 avg trb len 0
> 																	  ^^^^^^^^^^^^^^^^^^^^
>  xhci_dbg_cancel_urb: Successful Set TR Deq Ptr cmd, deq = @fff86550
> 						     ^^^^^^^^^^^^^^^
> 
> I've noticed that I have to have CONFIG_INTEL_IOMMU_DEFAULT_ON=y for this
> clean repro case, else the system still fails but I don't always (ever?) see
> the failure to read at address 0.  
> 
> Mathias, do you think that your above explanation also covers my failure case?  
> Are we just failing to enqueue a Set TR Deq command, and the HC is processing
> a stale TRB?  Or does the fact that ep_ctx->deq == 0 not fit with that
> explanation?

The ep_ctx->deq == 0 is very odd. 
The Set TR Deq command looks correct when it was queued:
77.313308: xhci_queue_trb: CMD: Set TR Dequeue Pointer Command: deq 00000000fff296a1 stream 0 slot 4 ep 3 flags c

and the command TRB was intact still when the command completed:
77.313318: xhci_handle_event: EVENT: TRB 00000000ffffe0a0 status 'Success' len 0 slot 4 ep 0 type 'Command Completion Event' flags e:C
77.313318: xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff296a1 stream 0 slot 4 ep 3 flags c
77.313318: xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 13/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 4 State configured

But when we read the deq ptr from the endpoint (output) context it's suddenly 0
77.313318: xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 64 deq 0000000000000000 avg trb len 0

On a successful Set TR Deq command the xHC controller should "Copy the value of
the New TR Dequeue Pointer field in the Set TR Dequeue Pointer TRB to the 
TR Dequeue Pointer field of the target Endpoint or Stream Context." xhci 4.6.10

The xHC controller starts reading TRBs from the deq ptr address (0 in this case)
which explains the DMAR entries, and possibse the Host System Error (HSE)

> 
> Other tidbits that I've disovered along the way:
> 
> 1) We first started noticing this in the field with v4.19 based kernels, but
>    I've been able to reproduce it using bulk-cancel with a v4.18 kernel
>    without much trouble, so I'm pretty sure it's an old issue.
> 
> 2) This definitely looks like a race which is very sensitive to timing.  If I
>    put a single trace_printk() line in xhci_handle_cmd_set_deq() function, the
>    issue is 10x harder to repro (on average 3 seconds to on average 30
>    seconds).  Similarly, when we turned on tracing in the field to try and get
>    logs the issue reproduces much less frequently.  I also tried to bisect,
>    and it turns out that whether or not it reproduced on my system with older
>    kernels was dependent on how tracing functions were inlined (!!).
> 
>    I think that we started seeing this in the field with v4.19 based kernels
>    purely because of a timing change.
> 
> 3) I did run with your patches from
>    https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint
>    and it didn't change anything for me.
> 
> I'm reading up more on xhci and trying to understand this race, but would
> appreciate any help or direction that you're able to provide.

The race I was referring to is if a driver issues a "Stop endpoint" command,
and it races with an endpoint error/halt initiated by the xHC controller. 

The additional note in xhci 4.6.9 - Stop Endpoint, explains it:
"Note: A Busy endpoint may asynchronously transition from the Running to the Halted
or Error state due to error conditions detected while processing TRBs. A possible
race condition may occur if software, thinking an endpoint is in the Running state,
issues a Stop Endpoint Command however at the same time the xHC
asynchronously transitions the endpoint to the Halted or Error state. In this case,
a Context State Error may be generated for the command completion. Software
may verify that this case occurred by inspecting the EP State for Halted or Error
when a Stop Endpoint Command results in a Context State Error."

There are several context state errors in your trace.

Thanks
-Mathias


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-09-28 13:32         ` Andrzej Pietrasiewicz
@ 2020-09-29  7:13           ` Mathias Nyman
  2020-10-01 14:13             ` Andrzej Pietrasiewicz
  0 siblings, 1 reply; 26+ messages in thread
From: Mathias Nyman @ 2020-09-29  7:13 UTC (permalink / raw)
  To: Andrzej Pietrasiewicz, Ross Zwisler; +Cc: linux-usb, kernel

Hi

On 28.9.2020 16.32, Andrzej Pietrasiewicz wrote:
> Hi Ross, hi Mathias,
> 
> W dniu 25.09.2020 o 23:05, Ross Zwisler pisze:
>> On Fri, Sep 25, 2020 at 04:40:29PM +0300, Mathias Nyman wrote:
>>> On 18.9.2020 17.20, Andrzej Pietrasiewicz wrote:
>>>> Hi Mathias,
>>>>
>>>> W dniu 18.09.2020 o 12:50, Mathias Nyman pisze:
>>>>> On 17.9.2020 18.30, Andrzej Pietrasiewicz wrote:
>>>>>> Dear All,
>>>>>>
>>>>>> I have encountered a problem in xhci which leads to general protection fault.
>>>>>>
>>>>>> The problem is triggered by running this program:
>>>>>>
>>>>>> https://gitlab.collabora.com/andrzej.p/bulk-cancel.git
>>>>>>
>>>>>> $ sudo ./bulk-cancel -D /dev/bus/usb/002/006 -i 1 -b 1
>>>>>>
>>>>>> where /dev/bus/usb/002/006 is a Gadget Zero:
>>>>>>
>>>>>> It takes less than a minute until the crash happens.
>>>>>> The DMAR (iommu) errors don't happen always, i.e. there are crashes
>>>>>> when they are not reported in the system log. In either case the
>>>>>>
>>>>>> "WARN Cannot submit Set TR Deq Ptr"
>>>>>> "A Set TR Deq Ptr command is pending."
>>>>>> "WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state."
>>>>>>
>>>>>> messages do appear.
>>>>>>
>>>>>
>>>>> Nice testcase and report, thanks.
>>>>>
>>>>> I started looking at issues in this area some time ago, and wrote a couple patches but
>>>>> it was left hanging. The two patches (now rebased on 5.9-rc3) can be found in my tree in the
>>>>> fix_invalid_context_at_stop_endpoint branch
>>>>>
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint
>>>>>
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint
>>>>>
>>>>> If you could give those a try and see if they help I'd be grateful.
>>>>
>>>> No, it doesn't help, albeit the errors are slightly different:
>>>>
>>>> xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
>>>> xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
>>>> dmar_fault: 44 callbacks suppressed
>>>> DRHD: handling fault status reg 3> DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffcda000 [fault reason 05] PTE Write access is not set
>>>> DMAR: DRHD: handling fault status reg 3
>>>
>>> Ok, thanks, the DMA problems make sense to me now.
>>>
>>> If a transfer ring stops on a transfer request (TRB) that should be canceled (manual cancel,
>>> or error) it's not enough to just turn the  TRB to a no-op.
>>> HW has most likely cached the TRB, and we need to move the transfer ring dequeue pointer past this TRB.
>>> Moving deq also clears controller cache.
>>>
>>> We do all this, but if we fail to queue the Set TR Deq command the TRB (with DMA  pointers) will stay on the ring,
>>> and controller will access the TRB DMA  address once it continues running. At this point xhci driver has already
>>> given back the canceled/erroneous TRB, and is probably unmapped already.
>>> Hence the DMAR entries.
>>>
>>> Looks like this part of the code needs a more extensive rewrite, on top of this we are not handling
>>> races between endpoints halted due errors, and endpoints stopped by driver to cancel URBs.
>>>
>>> -Mathias
>>
>> I'm chasing a similar problem which is also most easily reproduced using
>> Andrzej's bulk-cancel program, though we have seen it in the field many times
>> as well with normal usage.
>>
>> The symptom is that we get the following errors in dmesg:
>>
>>   xhci_hcd 0000:00:14.0: Mismatch between completed Set TR Deq Ptr command & xHCI internal state.
>>   xhci_hcd 0000:00:14.0: ep deq seg = 000000001141d6a0, deq ptr = 00000000ebd28dcf
>>   xhci_hcd 0000:00:14.0: WARNING: Host System Error
>>   DMAR: DRHD: handling fault status reg 2
>>   DMAR: [DMA Read] Request device [00:14.0] PASID ffffffff fault addr 0 [fault reason 06] PTE Read access is not set
>>   xhci_hcd 0000:00:14.0: xHCI host not responding to stop endpoint command.
>>   xhci_hcd 0000:00:14.0: USBSTS: HCHalted HSE EINT
>>   xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
>>   xhci_hcd 0000:00:14.0: HC died; cleaning up
>>
>> The xhci USB stack loses all attached devices, and on my system at least has
>> only been recoverable by rebooting.
>>
>> Full dmesg and trace after 'echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable'
>> can be found here:
>>
>> https://gist.github.com/rzwisler/531b926e3d160609ead371c23fc15b55
>> https://gist.github.com/rzwisler/4621f805737993fec30b5ae23bfd8289
>>
>> One interesting bit from the trace is that we observe the ep_ctx->deq pointer
>> being 0 in xhci_handle_cmd_set_deq():
>>
>>   xhci_inc_enq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe090(0x00000000ffffe000) segs 1 stream 0 free_trbs 251 bounce 0 cycle 0
>>   xhci_ring_host_doorbell: Ring doorbell for Command Ring 0
>>   xhci_urb_giveback: ep7in-bulk: urb 000000003c80b7a8 pipe 3221455744 slot 2 length 0/256 sgs 0/0 stream 0 flags 00010200
>>   xhci_inc_deq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe0a0(0x00000000ffffe000) segs 1 stream 0 free_trbs 252 bounce 0 cycle 0
>>   xhci_inc_deq: EVENT 00000000b5c6e6a2: enq 0x00000000ffffc000(0x00000000ffffc000) deq 0x00000000ffffc1b0(0x00000000ffffc000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
>>   xhci_handle_event: EVENT: TRB 00000000ffffe0a0 status 'Success' len 0 slot 4 ep 0 type 'Command Completion Event' flags e:C
>>   xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff296a1 stream 0 slot 4 ep 3 flags c
>>   xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 13/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 4 State configured
>>   xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 64 deq 0000000000000000 avg trb len 0
>>                                                                       ^^^^^^^^^^^^^^^^^^^^
>>   xhci_dbg_cancel_urb: Successful Set TR Deq Ptr cmd, deq = @00000000
>>                              ^^^^^^^^^^^^^^^
>>
>> In successful completions they are real values:
>>
>>   xhci_ring_ep_doorbell: Ring doorbell for Slot 4 ep1in
>>   xhci_inc_deq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe0b0(0x00000000ffffe000) segs 1 stream 0 free_trbs 253 bounce 0 cycle 0
>>   xhci_inc_deq: EVENT 00000000b5c6e6a2: enq 0x00000000ffffc000(0x00000000ffffc000) deq 0x00000000ffffc1c0(0x00000000ffffc000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
>>   xhci_handle_event: EVENT: TRB 00000000ffffe0b0 status 'Success' len 0 slot 2 ep 0 type 'Command Completion Event' flags e:C
>>   xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff86551 stream 0 slot 2 ep 15 flags c
>>   xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 11/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 2 State configured
>>   xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 16 deq 00000000fff86551 avg trb len 0
>>                                                                       ^^^^^^^^^^^^^^^^^^^^
>>   xhci_dbg_cancel_urb: Successful Set TR Deq Ptr cmd, deq = @fff86550
>>                              ^^^^^^^^^^^^^^^
>>
>> I've noticed that I have to have CONFIG_INTEL_IOMMU_DEFAULT_ON=y for this
>> clean repro case, else the system still fails but I don't always (ever?) see
>> the failure to read at address 0.
>>
>> Mathias, do you think that your above explanation also covers my failure case?
>> Are we just failing to enqueue a Set TR Deq command, and the HC is processing
>> a stale TRB?  Or does the fact that ep_ctx->deq == 0 not fit with that
>> explanation?
>>
> 
> From the logs from Ross it seems that the direct cause of the HC dying
> is Stop Endpoint not completing and, consequently, timing out.
Stop endpoint command does not complete due to the 'catastrophic' HSE 
(host system error) which stops the controller completely. 

I'd guess the HSE is related to the zeroed dequeue pointer.

> 
> In the xHCI spec, section "4.6.9 Stop Endpoint" we can read:
> 
> "Note that when an endpoint is stopped, the xHC maintains the state
> necessary to restart the last active Transfer Ring where it left off,
> however software may not want to do this. The options are discussed below:
> 
> 1. Temporarily Stop Transfer Ring Activity - [...]
> 
> 2. Aborting a Transfer - If, because of a timeout or other reason, software
> issued the Stop Endpoint Command to abort the current TD. Then the
> Set TR Dequeue Pointer Command may be used to force the xHC to
> dump any internal state that it has for the ring and restart activity at the
> new Transfer Ring location specified by the Set TR Dequeue Pointer Command."
> 
> If bulk-cancel reproducer is used then the software's intention is definitely
> to abort a transfer, so indeed a Set TR Dequeue Pointer is needed.
> 
> On the other hand, in that same spec, in "4.6.10 Set TR Dequeue Pointer"
> we read:
> 
> "This command may be executed only if the target endpoint is in the Error
> or Stopped state."
> 
> My question is: why in the crash scenario Ross describes "Set TR Dequeue
> Pointer" is being carried out while Stop Endpoint has not been completed
> (and timed out instead)?

Every time a URB is canceled the xhci driver will queue a "stop endpoint command"
if there isn't one already pending. 
The Set TR Dequeue command you see is probably for the previous URB, while the stop is
for the next URB. 
 Andrzej, any change of getting traces from your case? They would tell more than just
the dynamic debug.

Thanks
-Mathias

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-09-29  7:13           ` Mathias Nyman
@ 2020-10-01 14:13             ` Andrzej Pietrasiewicz
  0 siblings, 0 replies; 26+ messages in thread
From: Andrzej Pietrasiewicz @ 2020-10-01 14:13 UTC (permalink / raw)
  To: Mathias Nyman, Ross Zwisler; +Cc: linux-usb, kernel

Hi Mathias,

W dniu 29.09.2020 o 09:13, Mathias Nyman pisze:
> Hi
> 
> On 28.9.2020 16.32, Andrzej Pietrasiewicz wrote:
>> Hi Ross, hi Mathias,
>>
>> W dniu 25.09.2020 o 23:05, Ross Zwisler pisze:
>>> On Fri, Sep 25, 2020 at 04:40:29PM +0300, Mathias Nyman wrote:
>>>> On 18.9.2020 17.20, Andrzej Pietrasiewicz wrote:
>>>>> Hi Mathias,
>>>>>
>>>>> W dniu 18.09.2020 o 12:50, Mathias Nyman pisze:
>>>>>> On 17.9.2020 18.30, Andrzej Pietrasiewicz wrote:
>>>>>>> Dear All,
>>>>>>>
>>>>>>> I have encountered a problem in xhci which leads to general protection fault.
>>>>>>>
>>>>>>> The problem is triggered by running this program:
>>>>>>>
>>>>>>> https://gitlab.collabora.com/andrzej.p/bulk-cancel.git
>>>>>>>
>>>>>>> $ sudo ./bulk-cancel -D /dev/bus/usb/002/006 -i 1 -b 1
>>>>>>>
>>>>>>> where /dev/bus/usb/002/006 is a Gadget Zero:
>>>>>>>
>>>>>>> It takes less than a minute until the crash happens.
>>>>>>> The DMAR (iommu) errors don't happen always, i.e. there are crashes
>>>>>>> when they are not reported in the system log. In either case the
>>>>>>>
>>>>>>> "WARN Cannot submit Set TR Deq Ptr"
>>>>>>> "A Set TR Deq Ptr command is pending."
>>>>>>> "WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state."
>>>>>>>
>>>>>>> messages do appear.
>>>>>>>
>>>>>>
>>>>>> Nice testcase and report, thanks.
>>>>>>
>>>>>> I started looking at issues in this area some time ago, and wrote a couple patches but
>>>>>> it was left hanging. The two patches (now rebased on 5.9-rc3) can be found in my tree in the
>>>>>> fix_invalid_context_at_stop_endpoint branch
>>>>>>
>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint
>>>>>>
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint
>>>>>>
>>>>>> If you could give those a try and see if they help I'd be grateful.
>>>>>
>>>>> No, it doesn't help, albeit the errors are slightly different:
>>>>>
>>>>> xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
>>>>> xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
>>>>> dmar_fault: 44 callbacks suppressed
>>>>> DRHD: handling fault status reg 3> DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffcda000 [fault reason 05] PTE Write access is not set
>>>>> DMAR: DRHD: handling fault status reg 3
>>>>
>>>> Ok, thanks, the DMA problems make sense to me now.
>>>>
>>>> If a transfer ring stops on a transfer request (TRB) that should be canceled (manual cancel,
>>>> or error) it's not enough to just turn the  TRB to a no-op.
>>>> HW has most likely cached the TRB, and we need to move the transfer ring dequeue pointer past this TRB.
>>>> Moving deq also clears controller cache.
>>>>
>>>> We do all this, but if we fail to queue the Set TR Deq command the TRB (with DMA  pointers) will stay on the ring,
>>>> and controller will access the TRB DMA  address once it continues running. At this point xhci driver has already
>>>> given back the canceled/erroneous TRB, and is probably unmapped already.
>>>> Hence the DMAR entries.
>>>>
>>>> Looks like this part of the code needs a more extensive rewrite, on top of this we are not handling
>>>> races between endpoints halted due errors, and endpoints stopped by driver to cancel URBs.
>>>>
>>>> -Mathias
>>>
>>> I'm chasing a similar problem which is also most easily reproduced using
>>> Andrzej's bulk-cancel program, though we have seen it in the field many times
>>> as well with normal usage.
>>>
>>> The symptom is that we get the following errors in dmesg:
>>>
>>>    xhci_hcd 0000:00:14.0: Mismatch between completed Set TR Deq Ptr command & xHCI internal state.
>>>    xhci_hcd 0000:00:14.0: ep deq seg = 000000001141d6a0, deq ptr = 00000000ebd28dcf
>>>    xhci_hcd 0000:00:14.0: WARNING: Host System Error
>>>    DMAR: DRHD: handling fault status reg 2
>>>    DMAR: [DMA Read] Request device [00:14.0] PASID ffffffff fault addr 0 [fault reason 06] PTE Read access is not set
>>>    xhci_hcd 0000:00:14.0: xHCI host not responding to stop endpoint command.
>>>    xhci_hcd 0000:00:14.0: USBSTS: HCHalted HSE EINT
>>>    xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
>>>    xhci_hcd 0000:00:14.0: HC died; cleaning up
>>>
>>> The xhci USB stack loses all attached devices, and on my system at least has
>>> only been recoverable by rebooting.
>>>
>>> Full dmesg and trace after 'echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable'
>>> can be found here:
>>>
>>> https://gist.github.com/rzwisler/531b926e3d160609ead371c23fc15b55
>>> https://gist.github.com/rzwisler/4621f805737993fec30b5ae23bfd8289
>>>
>>> One interesting bit from the trace is that we observe the ep_ctx->deq pointer
>>> being 0 in xhci_handle_cmd_set_deq():
>>>
>>>    xhci_inc_enq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe090(0x00000000ffffe000) segs 1 stream 0 free_trbs 251 bounce 0 cycle 0
>>>    xhci_ring_host_doorbell: Ring doorbell for Command Ring 0
>>>    xhci_urb_giveback: ep7in-bulk: urb 000000003c80b7a8 pipe 3221455744 slot 2 length 0/256 sgs 0/0 stream 0 flags 00010200
>>>    xhci_inc_deq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe0a0(0x00000000ffffe000) segs 1 stream 0 free_trbs 252 bounce 0 cycle 0
>>>    xhci_inc_deq: EVENT 00000000b5c6e6a2: enq 0x00000000ffffc000(0x00000000ffffc000) deq 0x00000000ffffc1b0(0x00000000ffffc000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
>>>    xhci_handle_event: EVENT: TRB 00000000ffffe0a0 status 'Success' len 0 slot 4 ep 0 type 'Command Completion Event' flags e:C
>>>    xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff296a1 stream 0 slot 4 ep 3 flags c
>>>    xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 13/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 4 State configured
>>>    xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 64 deq 0000000000000000 avg trb len 0
>>>                                                                        ^^^^^^^^^^^^^^^^^^^^
>>>    xhci_dbg_cancel_urb: Successful Set TR Deq Ptr cmd, deq = @00000000
>>>                               ^^^^^^^^^^^^^^^
>>>
>>> In successful completions they are real values:
>>>
>>>    xhci_ring_ep_doorbell: Ring doorbell for Slot 4 ep1in
>>>    xhci_inc_deq: CMD 000000000b6352e0: enq 0x00000000ffffe0c0(0x00000000ffffe000) deq 0x00000000ffffe0b0(0x00000000ffffe000) segs 1 stream 0 free_trbs 253 bounce 0 cycle 0
>>>    xhci_inc_deq: EVENT 00000000b5c6e6a2: enq 0x00000000ffffc000(0x00000000ffffc000) deq 0x00000000ffffc1c0(0x00000000ffffc000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 1
>>>    xhci_handle_event: EVENT: TRB 00000000ffffe0b0 status 'Success' len 0 slot 2 ep 0 type 'Command Completion Event' flags e:C
>>>    xhci_handle_command: CMD: Set TR Dequeue Pointer Command: deq 00000000fff86551 stream 0 slot 2 ep 15 flags c
>>>    xhci_handle_cmd_set_deq: RS 00000 full-speed Ctx Entries 15 MEL 0 us Port# 11/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 2 State configured
>>>    xhci_handle_cmd_set_deq_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 0 maxp 16 deq 00000000fff86551 avg trb len 0
>>>                                                                        ^^^^^^^^^^^^^^^^^^^^
>>>    xhci_dbg_cancel_urb: Successful Set TR Deq Ptr cmd, deq = @fff86550
>>>                               ^^^^^^^^^^^^^^^
>>>
>>> I've noticed that I have to have CONFIG_INTEL_IOMMU_DEFAULT_ON=y for this
>>> clean repro case, else the system still fails but I don't always (ever?) see
>>> the failure to read at address 0.
>>>
>>> Mathias, do you think that your above explanation also covers my failure case?
>>> Are we just failing to enqueue a Set TR Deq command, and the HC is processing
>>> a stale TRB?  Or does the fact that ep_ctx->deq == 0 not fit with that
>>> explanation?
>>>
>>
>>  From the logs from Ross it seems that the direct cause of the HC dying
>> is Stop Endpoint not completing and, consequently, timing out.
> Stop endpoint command does not complete due to the 'catastrophic' HSE
> (host system error) which stops the controller completely.
> 
> I'd guess the HSE is related to the zeroed dequeue pointer.
> 
>>
>> In the xHCI spec, section "4.6.9 Stop Endpoint" we can read:
>>
>> "Note that when an endpoint is stopped, the xHC maintains the state
>> necessary to restart the last active Transfer Ring where it left off,
>> however software may not want to do this. The options are discussed below:
>>
>> 1. Temporarily Stop Transfer Ring Activity - [...]
>>
>> 2. Aborting a Transfer - If, because of a timeout or other reason, software
>> issued the Stop Endpoint Command to abort the current TD. Then the
>> Set TR Dequeue Pointer Command may be used to force the xHC to
>> dump any internal state that it has for the ring and restart activity at the
>> new Transfer Ring location specified by the Set TR Dequeue Pointer Command."
>>
>> If bulk-cancel reproducer is used then the software's intention is definitely
>> to abort a transfer, so indeed a Set TR Dequeue Pointer is needed.
>>
>> On the other hand, in that same spec, in "4.6.10 Set TR Dequeue Pointer"
>> we read:
>>
>> "This command may be executed only if the target endpoint is in the Error
>> or Stopped state."
>>
>> My question is: why in the crash scenario Ross describes "Set TR Dequeue
>> Pointer" is being carried out while Stop Endpoint has not been completed
>> (and timed out instead)?
> 
> Every time a URB is canceled the xhci driver will queue a "stop endpoint command"
> if there isn't one already pending.
> The Set TR Dequeue command you see is probably for the previous URB, while the stop is
> for the next URB.
>   Andrzej, any change of getting traces from your case? They would tell more than just
> the dynamic debug.
> 

The problem is I am not getting any trace at all even though it is turned on,
the system crashes before tail -f /sys/kernel/debug/tracing/trace outputs
anything.

Andrzej


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-09-28 22:35         ` Mathias Nyman
@ 2020-10-01 16:43           ` zwisler
  2020-10-12 19:20             ` Mathias Nyman
  0 siblings, 1 reply; 26+ messages in thread
From: zwisler @ 2020-10-01 16:43 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On Tue, Sep 29, 2020 at 01:35:31AM +0300, Mathias Nyman wrote:
<>
> The race I was referring to is if a driver issues a "Stop endpoint" command,
> and it races with an endpoint error/halt initiated by the xHC controller. 
> 
> The additional note in xhci 4.6.9 - Stop Endpoint, explains it:
> "Note: A Busy endpoint may asynchronously transition from the Running to the Halted
> or Error state due to error conditions detected while processing TRBs. A possible
> race condition may occur if software, thinking an endpoint is in the Running state,
> issues a Stop Endpoint Command however at the same time the xHC
> asynchronously transitions the endpoint to the Halted or Error state. In this case,
> a Context State Error may be generated for the command completion. Software
> may verify that this case occurred by inspecting the EP State for Halted or Error
> when a Stop Endpoint Command results in a Context State Error."
> 
> There are several context state errors in your trace.
> 
> Thanks
> -Mathias

Interestingly it looks like it's the actions that we take at the end of
xhci_handle_cmd_set_deq() for the broken command which break the HC.
Specifically, this line:

	dev->eps[ep_index].ep_state &= ~SET_DEQ_PENDING;

If I skip this line when I notice that ep_ctx->deq==0, the system will keep
running happily.

Here is a trace and dmesg for a run with the patch at the bottom of this mail.
I trimmed the trace a bit since it was very large, but I think I've left the
important bits intact:

https://gist.github.com/rzwisler/422e55321d9d2db5fc258d6d5b93d018

I've been able to run with this patch and survive through many "Mismatch"
occurrences, both with ep_ctx->deq set to 0 and set to some other value which
just seems to be wrong.

It seems like there are a few other places where we notice that we're in a bad
state, and we just bail, specifically these in xhci_queue_new_dequeue_state():

	addr = xhci_trb_virt_to_dma(deq_state->new_deq_seg,
				    deq_state->new_deq_ptr);
	if (addr == 0) {
		xhci_warn(xhci, "WARN Cannot submit Set TR Deq Ptr\n");
		xhci_warn(xhci, "WARN deq seg = %px, deq pt = %px\n",
			  deq_state->new_deq_seg, deq_state->new_deq_ptr);
		return;
	}
	ep = &xhci->devs[slot_id]->eps[ep_index];
	if ((ep->ep_state & SET_DEQ_PENDING)) {
		xhci_warn(xhci, "WARN Cannot submit Set TR Deq Ptr\n");
		xhci_warn(xhci, "A Set TR Deq Ptr command is pending.\n");
		return;
	}

Is noticing that the HC has given us bad data via the "Mismatch" check in
xhci_handle_cmd_set_deq() and bailing out enough, or should we figure out
exactly why the HC is getting into a bad state?

I'm happy to gather logs with more debug or run other experiments, if that
would be helpful.  As it is I don't really know how to debug the internal
state of the HC further, but hopefully the knowledge that the patch below
makes a difference will help us move forward.

--- >8 ---
From 1642f6f375d4822919fb03a56ce8c9186f58d6f7 Mon Sep 17 00:00:00 2001
From: Ross Zwisler <zwisler@google.com>
Date: Wed, 30 Sep 2020 17:45:24 -0600
Subject: [PATCH] DEBUG: avoid xhci crash

---
 drivers/usb/host/xhci-ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 2c255d0620b05..379570facbc9e 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1142,6 +1142,8 @@ static void xhci_handle_cmd_set_deq(struct xhci_hcd *xhci, int slot_id,
 			xhci_warn(xhci, "Mismatch between completed Set TR Deq Ptr command & xHCI internal state.\n");
 			xhci_warn(xhci, "ep deq seg = %p, deq ptr = %p\n",
 				  ep->queued_deq_seg, ep->queued_deq_ptr);
+			xhci_warn(xhci, "calculated deq:%#llx  real deq:%#llx\n", xhci_trb_virt_to_dma(ep->queued_deq_seg, ep->queued_deq_ptr), deq);
+			return;
 		}
 	}
 
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-10-01 16:43           ` zwisler
@ 2020-10-12 19:20             ` Mathias Nyman
  2020-10-12 21:53               ` zwisler
  0 siblings, 1 reply; 26+ messages in thread
From: Mathias Nyman @ 2020-10-12 19:20 UTC (permalink / raw)
  To: zwisler; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On 1.10.2020 19.43, zwisler@google.com wrote:
> On Tue, Sep 29, 2020 at 01:35:31AM +0300, Mathias Nyman wrote:
> <>
>> The race I was referring to is if a driver issues a "Stop endpoint" command,
>> and it races with an endpoint error/halt initiated by the xHC controller. 
>>
>> The additional note in xhci 4.6.9 - Stop Endpoint, explains it:
>> "Note: A Busy endpoint may asynchronously transition from the Running to the Halted
>> or Error state due to error conditions detected while processing TRBs. A possible
>> race condition may occur if software, thinking an endpoint is in the Running state,
>> issues a Stop Endpoint Command however at the same time the xHC
>> asynchronously transitions the endpoint to the Halted or Error state. In this case,
>> a Context State Error may be generated for the command completion. Software
>> may verify that this case occurred by inspecting the EP State for Halted or Error
>> when a Stop Endpoint Command results in a Context State Error."
>>
>> There are several context state errors in your trace.
>>
>> Thanks
>> -Mathias
> 
> Interestingly it looks like it's the actions that we take at the end of
> xhci_handle_cmd_set_deq() for the broken command which break the HC.
> Specifically, this line:
> 
> 	dev->eps[ep_index].ep_state &= ~SET_DEQ_PENDING;
> 
> If I skip this line when I notice that ep_ctx->deq==0, the system will keep
> running happily.

Skipping this will prevent this endpoint from running, and thus
preventing the issues seen if we continue.

> 
> Here is a trace and dmesg for a run with the patch at the bottom of this mail.
> I trimmed the trace a bit since it was very large, but I think I've left the
> important bits intact:
> 
> https://gist.github.com/rzwisler/422e55321d9d2db5fc258d6d5b93d018
> 
> I've been able to run with this patch and survive through many "Mismatch"
> occurrences, both with ep_ctx->deq set to 0 and set to some other value which
> just seems to be wrong.
> 
> It seems like there are a few other places where we notice that we're in a bad
> state, and we just bail, specifically these in xhci_queue_new_dequeue_state():
> 
> 	addr = xhci_trb_virt_to_dma(deq_state->new_deq_seg,
> 				    deq_state->new_deq_ptr);
> 	if (addr == 0) {
> 		xhci_warn(xhci, "WARN Cannot submit Set TR Deq Ptr\n");
> 		xhci_warn(xhci, "WARN deq seg = %px, deq pt = %px\n",
> 			  deq_state->new_deq_seg, deq_state->new_deq_ptr);
> 		return;
> 	}
> 	ep = &xhci->devs[slot_id]->eps[ep_index];
> 	if ((ep->ep_state & SET_DEQ_PENDING)) {
> 		xhci_warn(xhci, "WARN Cannot submit Set TR Deq Ptr\n");
> 		xhci_warn(xhci, "A Set TR Deq Ptr command is pending.\n");
> 		return;
> 	}
> 
> Is noticing that the HC has given us bad data via the "Mismatch" check in
> xhci_handle_cmd_set_deq() and bailing out enough, or should we figure out
> exactly why the HC is getting into a bad state?

I'm rewriting how xhci driver handles halted and canceled transfers.
While looking into it I found an older case where hardware gives bad data 
in the output context. This was 10 years ago and on some specic hardware,
see commit:

ac9d8fe7c6a8 USB: xhci: Add quirk for Fresco Logic xHCI hardware.

> 
> I'm happy to gather logs with more debug or run other experiments, if that
> would be helpful.  As it is I don't really know how to debug the internal
> state of the HC further, but hopefully the knowledge that the patch below
> makes a difference will help us move forward.

Great thanks, it will take some time before rewrite is ready.

-Mathias


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-10-12 19:20             ` Mathias Nyman
@ 2020-10-12 21:53               ` zwisler
  2020-10-13  7:49                 ` Mathias Nyman
  0 siblings, 1 reply; 26+ messages in thread
From: zwisler @ 2020-10-12 21:53 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On Mon, Oct 12, 2020 at 10:20:31PM +0300, Mathias Nyman wrote:
> On 1.10.2020 19.43, zwisler@google.com wrote:
> > On Tue, Sep 29, 2020 at 01:35:31AM +0300, Mathias Nyman wrote:

> I'm rewriting how xhci driver handles halted and canceled transfers.
> While looking into it I found an older case where hardware gives bad data 
> in the output context. This was 10 years ago and on some specic hardware,
> see commit:
> 
> ac9d8fe7c6a8 USB: xhci: Add quirk for Fresco Logic xHCI hardware.
> 
> > 
> > I'm happy to gather logs with more debug or run other experiments, if that
> > would be helpful.  As it is I don't really know how to debug the internal
> > state of the HC further, but hopefully the knowledge that the patch below
> > makes a difference will help us move forward.
> 
> Great thanks, it will take some time before rewrite is ready.

Should we/I invest time in trying to create a quirk similar to

> ac9d8fe7c6a8 USB: xhci: Add quirk for Fresco Logic xHCI hardware.

in the mean time, or should we just wait for your rewrite?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-10-12 21:53               ` zwisler
@ 2020-10-13  7:49                 ` Mathias Nyman
  2020-10-13  8:29                   ` Andrzej Pietrasiewicz
  2020-11-19 16:52                   ` Ross Zwisler
  0 siblings, 2 replies; 26+ messages in thread
From: Mathias Nyman @ 2020-10-13  7:49 UTC (permalink / raw)
  To: zwisler; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On 13.10.2020 0.53, zwisler@google.com wrote:
> On Mon, Oct 12, 2020 at 10:20:31PM +0300, Mathias Nyman wrote:
>> On 1.10.2020 19.43, zwisler@google.com wrote:
>>> On Tue, Sep 29, 2020 at 01:35:31AM +0300, Mathias Nyman wrote:
> 
>> I'm rewriting how xhci driver handles halted and canceled transfers.
>> While looking into it I found an older case where hardware gives bad data 
>> in the output context. This was 10 years ago and on some specic hardware,
>> see commit:
>>
>> ac9d8fe7c6a8 USB: xhci: Add quirk for Fresco Logic xHCI hardware.
>>
>>>
>>> I'm happy to gather logs with more debug or run other experiments, if that
>>> would be helpful.  As it is I don't really know how to debug the internal
>>> state of the HC further, but hopefully the knowledge that the patch below
>>> makes a difference will help us move forward.
>>
>> Great thanks, it will take some time before rewrite is ready.
> 
> Should we/I invest time in trying to create a quirk similar to
> 
>> ac9d8fe7c6a8 USB: xhci: Add quirk for Fresco Logic xHCI hardware.
> 
> in the mean time, or should we just wait for your rewrite?
> 

I've been focusing on this part so rewrite shouldn't take long.
If hardware is still giving incorrect values after this we might
also need a quirk on top of it.

-Mathias

 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-10-13  7:49                 ` Mathias Nyman
@ 2020-10-13  8:29                   ` Andrzej Pietrasiewicz
  2020-10-13 16:44                     ` zwisler
  2020-11-19 16:52                   ` Ross Zwisler
  1 sibling, 1 reply; 26+ messages in thread
From: Andrzej Pietrasiewicz @ 2020-10-13  8:29 UTC (permalink / raw)
  To: Mathias Nyman, zwisler; +Cc: linux-usb, kernel

Hi,

W dniu 13.10.2020 o 09:49, Mathias Nyman pisze:
> On 13.10.2020 0.53, zwisler@google.com wrote:
>> On Mon, Oct 12, 2020 at 10:20:31PM +0300, Mathias Nyman wrote:
>>> On 1.10.2020 19.43, zwisler@google.com wrote:
>>>> On Tue, Sep 29, 2020 at 01:35:31AM +0300, Mathias Nyman wrote:
>>
>>> I'm rewriting how xhci driver handles halted and canceled transfers.
>>> While looking into it I found an older case where hardware gives bad data
>>> in the output context. This was 10 years ago and on some specic hardware,
>>> see commit:
>>>
>>> ac9d8fe7c6a8 USB: xhci: Add quirk for Fresco Logic xHCI hardware.
>>>
>>>>
>>>> I'm happy to gather logs with more debug or run other experiments, if that
>>>> would be helpful.  As it is I don't really know how to debug the internal
>>>> state of the HC further, but hopefully the knowledge that the patch below
>>>> makes a difference will help us move forward.
>>>
>>> Great thanks, it will take some time before rewrite is ready.
>>
>> Should we/I invest time in trying to create a quirk similar to
>>
>>> ac9d8fe7c6a8 USB: xhci: Add quirk for Fresco Logic xHCI hardware.
>>
>> in the mean time, or should we just wait for your rewrite?
>>
> 
> I've been focusing on this part so rewrite shouldn't take long.

Thank you for doing that!

Should you be sending your patches not as a reply in this thread
kindly Cc me andrzej.p@collabora.com.

Regards,

Andrzej

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-10-13  8:29                   ` Andrzej Pietrasiewicz
@ 2020-10-13 16:44                     ` zwisler
  0 siblings, 0 replies; 26+ messages in thread
From: zwisler @ 2020-10-13 16:44 UTC (permalink / raw)
  To: Andrzej Pietrasiewicz; +Cc: Mathias Nyman, linux-usb, kernel

On Tue, Oct 13, 2020 at 10:29:51AM +0200, Andrzej Pietrasiewicz wrote:
> W dniu 13.10.2020 o 09:49, Mathias Nyman pisze:
> > On 13.10.2020 0.53, zwisler@google.com wrote:
> > > On Mon, Oct 12, 2020 at 10:20:31PM +0300, Mathias Nyman wrote:
> > > > On 1.10.2020 19.43, zwisler@google.com wrote:
> > > > > On Tue, Sep 29, 2020 at 01:35:31AM +0300, Mathias Nyman wrote:
> > > 
> > > > I'm rewriting how xhci driver handles halted and canceled transfers.
> > > > While looking into it I found an older case where hardware gives bad data
> > > > in the output context. This was 10 years ago and on some specic hardware,
> > > > see commit:
> > > > 
> > > > ac9d8fe7c6a8 USB: xhci: Add quirk for Fresco Logic xHCI hardware.
> > > > 
> > > > > 
> > > > > I'm happy to gather logs with more debug or run other experiments, if that
> > > > > would be helpful.  As it is I don't really know how to debug the internal
> > > > > state of the HC further, but hopefully the knowledge that the patch below
> > > > > makes a difference will help us move forward.
> > > > 
> > > > Great thanks, it will take some time before rewrite is ready.
> > > 
> > > Should we/I invest time in trying to create a quirk similar to
> > > 
> > > > ac9d8fe7c6a8 USB: xhci: Add quirk for Fresco Logic xHCI hardware.
> > > 
> > > in the mean time, or should we just wait for your rewrite?
> > > 
> > 
> > I've been focusing on this part so rewrite shouldn't take long.
> 
> Thank you for doing that!
> 
> Should you be sending your patches not as a reply in this thread
> kindly Cc me andrzej.p@collabora.com.

Please CC me as well, and I'd be happy to test for you.  Thanks again!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-10-13  7:49                 ` Mathias Nyman
  2020-10-13  8:29                   ` Andrzej Pietrasiewicz
@ 2020-11-19 16:52                   ` Ross Zwisler
  2020-11-23 15:06                     ` Mathias Nyman
  1 sibling, 1 reply; 26+ messages in thread
From: Ross Zwisler @ 2020-11-19 16:52 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On Tue, Oct 13, 2020 at 1:45 AM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:

> I've been focusing on this part so rewrite shouldn't take long.
> If hardware is still giving incorrect values after this we might
> also need a quirk on top of it.

Hi Mathias,

I just wanted to check back in on this and see how things were going.
Can Andrzej and I help test anything for you?  Have you been able to
reproduce this issue locally?  Anything else we can do to help?

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-11-19 16:52                   ` Ross Zwisler
@ 2020-11-23 15:06                     ` Mathias Nyman
  2020-12-02 22:59                       ` Ross Zwisler
  0 siblings, 1 reply; 26+ messages in thread
From: Mathias Nyman @ 2020-11-23 15:06 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On 19.11.2020 18.52, Ross Zwisler wrote:
> On Tue, Oct 13, 2020 at 1:45 AM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
> 
>> I've been focusing on this part so rewrite shouldn't take long.
>> If hardware is still giving incorrect values after this we might
>> also need a quirk on top of it.
> 
> Hi Mathias,
> 
> I just wanted to check back in on this and see how things were going.
> Can Andrzej and I help test anything for you?  Have you been able to
> reproduce this issue locally?  Anything else we can do to help?
> 
> Thanks,
> - Ross
> 

I think I got most of the functionality now working.
The series is not in upstream shape, but should work, and can be tested.
just pushed it to a rewrite_halt_stop_handling branch in my tree, ten patches on top of 5.10-rc4

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git rewrite_halt_stop_handling
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=rewrite_halt_stop_handling

It still contains dead code that needs to be removed, and all streams (uas) cases are not
handled properly, it won't pass checkpatch.. and so on,  but it should be testable.

Thanks
-Mathias

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-11-23 15:06                     ` Mathias Nyman
@ 2020-12-02 22:59                       ` Ross Zwisler
  2020-12-04 18:07                         ` Mathias Nyman
  0 siblings, 1 reply; 26+ messages in thread
From: Ross Zwisler @ 2020-12-02 22:59 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On Mon, Nov 23, 2020 at 8:04 AM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:

> I think I got most of the functionality now working.
> The series is not in upstream shape, but should work, and can be tested.
> just pushed it to a rewrite_halt_stop_handling branch in my tree, ten patches on top of 5.10-rc4
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git rewrite_halt_stop_handling
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=rewrite_halt_stop_handling
>
> It still contains dead code that needs to be removed, and all streams (uas) cases are not
> handled properly, it won't pass checkpatch.. and so on,  but it should be testable.
>
> Thanks
> -Mathias

Unfortunately I'm still able to reproduce the failure with your
patches.  Here is a dmesg:

https://gist.github.com/rzwisler/05e52020e87f0ccd6185182be999dae0

I was testing at this commit:

3c1f3ab219e5f xhci: handle halting transfer event properly after
endpoint stop and halt raced.

Turning on ftrace makes it much harder to reproduce.  Should I keep
trying for a repro with tracing turned on, or is the fact that it
still happens informative enough to know we have to look elsewhere for
a solution?

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-12-02 22:59                       ` Ross Zwisler
@ 2020-12-04 18:07                         ` Mathias Nyman
  2020-12-08 17:24                           ` Ross Zwisler
  0 siblings, 1 reply; 26+ messages in thread
From: Mathias Nyman @ 2020-12-04 18:07 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On 3.12.2020 0.59, Ross Zwisler wrote:
> On Mon, Nov 23, 2020 at 8:04 AM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
> 
>> I think I got most of the functionality now working.
>> The series is not in upstream shape, but should work, and can be tested.
>> just pushed it to a rewrite_halt_stop_handling branch in my tree, ten patches on top of 5.10-rc4
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git rewrite_halt_stop_handling
>> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=rewrite_halt_stop_handling
>>
>> It still contains dead code that needs to be removed, and all streams (uas) cases are not
>> handled properly, it won't pass checkpatch.. and so on,  but it should be testable.
>>
>> Thanks
>> -Mathias
> 
> Unfortunately I'm still able to reproduce the failure with your
> patches.  Here is a dmesg:
> 
> https://gist.github.com/rzwisler/05e52020e87f0ccd6185182be999dae0
> 
> I was testing at this commit:
> 
> 3c1f3ab219e5f xhci: handle halting transfer event properly after
> endpoint stop and halt raced.
> 
> Turning on ftrace makes it much harder to reproduce.  Should I keep
> trying for a repro with tracing turned on, or is the fact that it
> still happens informative enough to know we have to look elsewhere for
> a solution?
> 
> Thanks,
> - Ross
> 

Ok, thanks.

Then the rootcause remains unknown.
For some reason the endpoint context dequeue pointer field contains zero
instead of the new dequeue pointer.
The (output) endpoint context is supposed to be written only by the controller.

Time to change strategy and start to detect and treat the symptoms instead.

I wrote a patch that detects the 0-dequeue pointer and issues a
new Set TR Deq pointer command. Hopefully that works.
patch added to same branch, can you try it out?

3f6326766abc xhci: retry setting new dequeue if xHC hardware failed to update it

I didn't set a retry limit yet so if it doesn't work it might retry forever.
 
Thanks
Mathias

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-12-04 18:07                         ` Mathias Nyman
@ 2020-12-08 17:24                           ` Ross Zwisler
  2020-12-09 13:11                             ` Mathias Nyman
  0 siblings, 1 reply; 26+ messages in thread
From: Ross Zwisler @ 2020-12-08 17:24 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On Fri, Dec 04, 2020 at 08:07:30PM +0200, Mathias Nyman wrote:
<>
> Ok, thanks.
> 
> Then the rootcause remains unknown.
> For some reason the endpoint context dequeue pointer field contains zero
> instead of the new dequeue pointer.
> The (output) endpoint context is supposed to be written only by the controller.
> 
> Time to change strategy and start to detect and treat the symptoms instead.
> 
> I wrote a patch that detects the 0-dequeue pointer and issues a
> new Set TR Deq pointer command. Hopefully that works.
> patch added to same branch, can you try it out?
> 
> 3f6326766abc xhci: retry setting new dequeue if xHC hardware failed to update it
> 
> I didn't set a retry limit yet so if it doesn't work it might retry forever.

Here are some logs when running with that commit:

https://gist.github.com/rzwisler/17923c9dedf2b914254eadd1cd294a4c

I think we only consistently get the clean failure case with the dequeue
pointer being 0 if CONFIG_INTEL_IOMMU_DEFAULT_ON=y.

If that option is set to 'n', we get the same failure where the xHCI
controller totally dies (log "CONFIG_INTEL_IOMMU_DEFAULT_ON=n" in the gist).

With CONFIG_INTEL_IOMMU_DEFAULT_ON=y we do seem to live through multiple
errors, but as soon as I try to use the device normally afterwards it seems to
spin forever with these messages:

xhci_hcd 0000:00:14.0: Looking for event-dma 00000000fff0a330 trb-start 00000000f8884000 trb-end 0000000000000000 seg-start 00000000f8884000 seg-end 00000000f8884ff0

Are you able to reproduce this with Andrzej's bulk-cancel script?  I think you
probably just need a device which accepts bulk transfer commands?  In my most
recent reproductions my servo hardware wasn't even attached to a device, so I
don't really think it's doing anything except sitting there and receiving
BULK_IN commands.   I'm doing this to two devices simultaneously.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-12-08 17:24                           ` Ross Zwisler
@ 2020-12-09 13:11                             ` Mathias Nyman
  2020-12-09 18:54                               ` Ross Zwisler
  0 siblings, 1 reply; 26+ messages in thread
From: Mathias Nyman @ 2020-12-09 13:11 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On 8.12.2020 19.24, Ross Zwisler wrote:
> On Fri, Dec 04, 2020 at 08:07:30PM +0200, Mathias Nyman wrote:
> <>
> 
> Here are some logs when running with that commit:
> 
> https://gist.github.com/rzwisler/17923c9dedf2b914254eadd1cd294a4c
> 
> I think we only consistently get the clean failure case with the dequeue
> pointer being 0 if CONFIG_INTEL_IOMMU_DEFAULT_ON=y.
> 
> If that option is set to 'n', we get the same failure where the xHCI
> controller totally dies (log "CONFIG_INTEL_IOMMU_DEFAULT_ON=n" in the gist).
> 
> With CONFIG_INTEL_IOMMU_DEFAULT_ON=y we do seem to live through multiple
> errors, but as soon as I try to use the device normally afterwards it seems to
> spin forever with these messages:
> 
> xhci_hcd 0000:00:14.0: Looking for event-dma 00000000fff0a330 trb-start 00000000f8884000 trb-end 0000000000000000 seg-start 00000000f8884000 seg-end 00000000f8884ff0
> 
> Are you able to reproduce this with Andrzej's bulk-cancel script?  I think you
> probably just need a device which accepts bulk transfer commands?  In my most
> recent reproductions my servo hardware wasn't even attached to a device, so I
> don't really think it's doing anything except sitting there and receiving
> BULK_IN commands.   I'm doing this to two devices simultaneously.
> 

I was testing with Andrzej's script against a g_zero gadget.
I could  trigger many similar issues as those he reported, but not this
dequeue issue you see.

The rewrite resolved all issues I saw. Script was running without issues
over night. (tested against both USB2 and USB3).

I haven't tried with two devices simultaneously, I could try that.

Could you share more details about the system you have, what xhci
controller do you have?

Thanks
-Mathias 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-12-09 13:11                             ` Mathias Nyman
@ 2020-12-09 18:54                               ` Ross Zwisler
  2020-12-30 12:33                                 ` Mathias Nyman
  0 siblings, 1 reply; 26+ messages in thread
From: Ross Zwisler @ 2020-12-09 18:54 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On Wed, Dec 09, 2020 at 03:11:14PM +0200, Mathias Nyman wrote:
<>
> I was testing with Andrzej's script against a g_zero gadget.
> I could  trigger many similar issues as those he reported, but not this
> dequeue issue you see.
> 
> The rewrite resolved all issues I saw. Script was running without issues
> over night. (tested against both USB2 and USB3).
> 
> I haven't tried with two devices simultaneously, I could try that.
> 
> Could you share more details about the system you have, what xhci
> controller do you have?

Sure.  I'm running with the following CPU:

Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz

with 72 virtual cores, if that matters.

The full output of "lsusb -vvv" can be found here:

https://gist.github.com/rzwisler/8c61f10c7a9642cbc6fffe1daadfa8cc

And here is the xhci entry in "lspci -vvv":

00:14.0 USB controller: Intel Corporation C620 Series Chipset Family USB 3.0 xHCI Controller (rev 09) (prog-if 30 [XHCI])
	Subsystem: Lenovo C620 Series Chipset Family USB 3.0 xHCI Controller
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 38
	NUMA node: 0
	IOMMU group: 16
	Region 0: Memory at 90320000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: <access denied>
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci

We've seen this issue on a variety of hardware from Lenovo and HP, so it's not
specific to this specific workstation model (Lenovo P920).

I repro the issue by running the bulk-cancel program against my "servo"
hardware, which is debugging hardware for Chromebooks.  As I said, though, I
can repro it easily without having a device attached, so I *think* all you
need is a device which handles BULK IN commands.  I've got 2, and if I run
against both my system repros the issue in under a minute, usually.  With 1 it
takes a little longer but still repros, so I don't think it's an interaction
between the two devices.

Please let me know if I can provide anything else of value.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-12-09 18:54                               ` Ross Zwisler
@ 2020-12-30 12:33                                 ` Mathias Nyman
  2021-01-06 18:52                                   ` Ross Zwisler
  0 siblings, 1 reply; 26+ messages in thread
From: Mathias Nyman @ 2020-12-30 12:33 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On 9.12.2020 20.54, Ross Zwisler wrote:
> On Wed, Dec 09, 2020 at 03:11:14PM +0200, Mathias Nyman wrote:
> <>
>> I was testing with Andrzej's script against a g_zero gadget.
>> I could  trigger many similar issues as those he reported, but not this
>> dequeue issue you see.
>>
>> The rewrite resolved all issues I saw. Script was running without issues
>> over night. (tested against both USB2 and USB3).
>>
>> I haven't tried with two devices simultaneously, I could try that.
>>
>> Could you share more details about the system you have, what xhci
>> controller do you have?
> 
> Sure.  I'm running with the following CPU:
> 
> Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz

I was able to reproduce the issue with two USB devices on a different system.

I saw that the new incorrect dequeue pointer sometimes had the higher 32 bits
incorrect while the lower bits were correct.
So this looks like a memory access order issue.

The command TRB is 16 bytes. The xhci driver writes it in four 4 byte chunks.
Even if the driver writes the chunk that sets the cycle bit last, handing the TRB
over to the controller, it appears that the actual write order can be different.
The controller ends up reading a command TRB with updated cycle bit but old bogus
values in the "new dequeue pointer" field.

Adding a write memory barrier before writing the cycle bit solved the issue in my case.

Whole patch series is updated, added write memory barrier, and rebased on 5.10.
It can be found force-updated in the same rewrite_halt_stop_handling branch: 

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git rewrite_halt_stop_handling
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=rewrite_halt_stop_handling

Does this work for you?

-Mathias
  

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2020-12-30 12:33                                 ` Mathias Nyman
@ 2021-01-06 18:52                                   ` Ross Zwisler
  2021-01-07  8:57                                     ` Mathias Nyman
  0 siblings, 1 reply; 26+ messages in thread
From: Ross Zwisler @ 2021-01-06 18:52 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On Wed, Dec 30, 2020 at 5:31 AM Mathias Nyman <mathias.nyman@linux.intel.com> wrote:
<>
> I was able to reproduce the issue with two USB devices on a different system.
>
> I saw that the new incorrect dequeue pointer sometimes had the higher 32 bits
> incorrect while the lower bits were correct.
> So this looks like a memory access order issue.
>
> The command TRB is 16 bytes. The xhci driver writes it in four 4 byte chunks.
> Even if the driver writes the chunk that sets the cycle bit last, handing the TRB
> over to the controller, it appears that the actual write order can be different.
> The controller ends up reading a command TRB with updated cycle bit but old bogus
> values in the "new dequeue pointer" field.
>
> Adding a write memory barrier before writing the cycle bit solved the issue in my case.
>
> Whole patch series is updated, added write memory barrier, and rebased on 5.10.
> It can be found force-updated in the same rewrite_halt_stop_handling branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git rewrite_halt_stop_handling
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=rewrite_halt_stop_handling
>
> Does this work for you?

Yes, it does!!  I verified that I'm able to reproduce the issue in less than
10 seconds with this commit which is the HEAD~1 of your series:

a7d053d207121 xhci: handle halting transfer event properly after endpoint stop and halt raced.

With this commit (the final commit in your series, adding a single patch): 

96887d191a88c xhci: make sure TRB is fully written before giving it to the controller

I ran cleanly for over 20 minutes and haven't been able to reproduce the
issue.  It looks like the wmb() added in that patch makes all the difference!

Thank you for the fix, and you can add this tag to the series:

Tested-by: Ross Zwisler <zwisler@google.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2021-01-06 18:52                                   ` Ross Zwisler
@ 2021-01-07  8:57                                     ` Mathias Nyman
  2021-01-07 16:07                                       ` Ross Zwisler
  0 siblings, 1 reply; 26+ messages in thread
From: Mathias Nyman @ 2021-01-07  8:57 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On 6.1.2021 20.52, Ross Zwisler wrote:
> On Wed, Dec 30, 2020 at 5:31 AM Mathias Nyman <mathias.nyman@linux.intel.com> wrote:
> <>
>> I was able to reproduce the issue with two USB devices on a different system.
>>
>> I saw that the new incorrect dequeue pointer sometimes had the higher 32 bits
>> incorrect while the lower bits were correct.
>> So this looks like a memory access order issue.
>>
>> The command TRB is 16 bytes. The xhci driver writes it in four 4 byte chunks.
>> Even if the driver writes the chunk that sets the cycle bit last, handing the TRB
>> over to the controller, it appears that the actual write order can be different.
>> The controller ends up reading a command TRB with updated cycle bit but old bogus
>> values in the "new dequeue pointer" field.
>>
>> Adding a write memory barrier before writing the cycle bit solved the issue in my case.
>>
>> Whole patch series is updated, added write memory barrier, and rebased on 5.10.
>> It can be found force-updated in the same rewrite_halt_stop_handling branch:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git rewrite_halt_stop_handling
>> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=rewrite_halt_stop_handling
>>
>> Does this work for you?
> 
> Yes, it does!!  I verified that I'm able to reproduce the issue in less than
> 10 seconds with this commit which is the HEAD~1 of your series:
> 
> a7d053d207121 xhci: handle halting transfer event properly after endpoint stop and halt raced.
> 
> With this commit (the final commit in your series, adding a single patch): 
> 
> 96887d191a88c xhci: make sure TRB is fully written before giving it to the controller
> 
> I ran cleanly for over 20 minutes and haven't been able to reproduce the
> issue.  It looks like the wmb() added in that patch makes all the difference!
> 
> Thank you for the fix, and you can add this tag to the series:
> 
> Tested-by: Ross Zwisler <zwisler@google.com>
> 

Great, thanks for testing.

I think it makes sense to cherry pick that one-liner wmb() patch first to 5.11 and stable,
then add rest of the rewrite later.

-Mathias

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: xhci problem -> general protection fault
  2021-01-07  8:57                                     ` Mathias Nyman
@ 2021-01-07 16:07                                       ` Ross Zwisler
  0 siblings, 0 replies; 26+ messages in thread
From: Ross Zwisler @ 2021-01-07 16:07 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Andrzej Pietrasiewicz, linux-usb, kernel

On Thu, Jan 07, 2021 at 10:57:19AM +0200, Mathias Nyman wrote:
> On 6.1.2021 20.52, Ross Zwisler wrote:
> > On Wed, Dec 30, 2020 at 5:31 AM Mathias Nyman <mathias.nyman@linux.intel.com> wrote:
> > <>
> >> I was able to reproduce the issue with two USB devices on a different system.
> >>
> >> I saw that the new incorrect dequeue pointer sometimes had the higher 32 bits
> >> incorrect while the lower bits were correct.
> >> So this looks like a memory access order issue.
> >>
> >> The command TRB is 16 bytes. The xhci driver writes it in four 4 byte chunks.
> >> Even if the driver writes the chunk that sets the cycle bit last, handing the TRB
> >> over to the controller, it appears that the actual write order can be different.
> >> The controller ends up reading a command TRB with updated cycle bit but old bogus
> >> values in the "new dequeue pointer" field.
> >>
> >> Adding a write memory barrier before writing the cycle bit solved the issue in my case.
> >>
> >> Whole patch series is updated, added write memory barrier, and rebased on 5.10.
> >> It can be found force-updated in the same rewrite_halt_stop_handling branch:
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git rewrite_halt_stop_handling
> >> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=rewrite_halt_stop_handling
> >>
> >> Does this work for you?
> > 
> > Yes, it does!!  I verified that I'm able to reproduce the issue in less than
> > 10 seconds with this commit which is the HEAD~1 of your series:
> > 
> > a7d053d207121 xhci: handle halting transfer event properly after endpoint stop and halt raced.
> > 
> > With this commit (the final commit in your series, adding a single patch): 
> > 
> > 96887d191a88c xhci: make sure TRB is fully written before giving it to the controller
> > 
> > I ran cleanly for over 20 minutes and haven't been able to reproduce the
> > issue.  It looks like the wmb() added in that patch makes all the difference!
> > 
> > Thank you for the fix, and you can add this tag to the series:
> > 
> > Tested-by: Ross Zwisler <zwisler@google.com>
> > 
> 
> Great, thanks for testing.
> 
> I think it makes sense to cherry pick that one-liner wmb() patch first to 5.11 and stable,
> then add rest of the rewrite later.
> 
> -Mathias

Makes sense to me.  FWIW I've verified that v5.10 reproduces the issue easily,
and v5.10 with just the wmb() patch does not.

Thanks again for the fix!

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-01-07 16:07 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-17 15:30 xhci problem -> general protection fault Andrzej Pietrasiewicz
2020-09-18 10:50 ` Mathias Nyman
2020-09-18 14:20   ` Andrzej Pietrasiewicz
2020-09-25 13:40     ` Mathias Nyman
2020-09-25 21:05       ` Ross Zwisler
2020-09-28 13:32         ` Andrzej Pietrasiewicz
2020-09-29  7:13           ` Mathias Nyman
2020-10-01 14:13             ` Andrzej Pietrasiewicz
2020-09-28 22:35         ` Mathias Nyman
2020-10-01 16:43           ` zwisler
2020-10-12 19:20             ` Mathias Nyman
2020-10-12 21:53               ` zwisler
2020-10-13  7:49                 ` Mathias Nyman
2020-10-13  8:29                   ` Andrzej Pietrasiewicz
2020-10-13 16:44                     ` zwisler
2020-11-19 16:52                   ` Ross Zwisler
2020-11-23 15:06                     ` Mathias Nyman
2020-12-02 22:59                       ` Ross Zwisler
2020-12-04 18:07                         ` Mathias Nyman
2020-12-08 17:24                           ` Ross Zwisler
2020-12-09 13:11                             ` Mathias Nyman
2020-12-09 18:54                               ` Ross Zwisler
2020-12-30 12:33                                 ` Mathias Nyman
2021-01-06 18:52                                   ` Ross Zwisler
2021-01-07  8:57                                     ` Mathias Nyman
2021-01-07 16:07                                       ` Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).