ASMedia ASM2142 USB host controller tries to DMA to address zero when doing bulk reads from multiple devices

* ASMedia ASM2142 USB host controller tries to DMA to address zero when doing bulk reads from multiple devices
@ 2020-07-21  5:49 Forest Crossman
  2020-07-29 13:22 ` Oliver O'Halloran
  0 siblings, 1 reply; 3+ messages in thread
From: Forest Crossman @ 2020-07-21  5:49 UTC (permalink / raw)
  To: linux-usb; +Cc: linuxppc-dev

Hello, again!

After fixing the issue in my previous thread using this patch[1], I
decided to do some stress-testing of the controller to make sure it
could handle my intended workloads and that there were no further DMA
address issues that would need to be fixed. Unfortunately, it looks
like there's still more work to be done: when I try to do long bulk
reads from multiple devices simultaneously, eventually the host
controller sends a DMA write to address zero, which then triggers EEH
in my POWER9 system, causing the controller card to get hotplug-reset,
which of course kills the disk-reading processes. For more details on
the EEH errors, you can see my kernel's EEH message log[2]. The
results of the various tests I performed are listed below.

Test results (all failures are due to DMA writes to address zero, all
hubs are USB 3.0/3.1 Gen1 only, and all disks are accessed via the
usb-storage driver):
- Reading simultaneously from two or more disks behind a hub connected
to one port on the host controller:
  - FAIL after 20-50 GB of data transferred for each device.
- Reading simultaneously from two disks, each connected directly to
one port on the host controller:
  - FAIL after about 800 GB of data transferred for each device.
- Reading from one disk behind a hub connected to one port on the host
controller:
  - OK for at least 2.7 TB of data transferred (I didn't test the
whole 8 TB disk).
- Writing simultaneously to two FL2000 dongles (using osmo-fl2k's
"fl2k_test"), each connected directly to one port on the host
controller:
  - OK, was able to write several dozen terabytes to each device over
the course of a little over 21 hours.

Seeing how simultaneous writes to multiple devices and reads from
single devices both seem to work fine, I assume that means this is
being caused by some race condition in the host controller firmware
when it responds to multiple read requests. I also assume we're not
going to be able to convince ASMedia to both fix the bug in their
firmware and release the details on how to flash it from Linux, so I
guess we'll just have to figure out how to make the driver talk to the
controller in a way that avoids triggering the bad DMA write. As
before, I decided to try a little kernel hacking of my own before
sending this email, and tried separately enabling the
XHCI_BROKEN_STREAMS and XHCI_ASMEDIA_MODIFY_FLOWCONTROL quirks in an
attempt to fix this. As you might expect since you're reading this
message, neither of those quirks fixed the issue, nor did they even
make the transfers last any longer before failing.

So now I've reached the limits of my understanding, and I need some
help devising a fix. If anyone has any comments to that effect, or any
questions about my hardware configuration, testing methodology, etc.,
please don't hesitate to air them. Also, if anyone needs me to perform
additional tests, or collect more log information, I'd be happy to do
that as well.

Thanks in advance for your help,

Forest

[1]: https://patchwork.kernel.org/patch/11669989/
[2]: https://paste.debian.net/hidden/2a442aa6

^ permalink raw reply	[flat|nested] 3+ messages in thread