* ASMedia ASM2142 USB host controller tries to DMA to address zero when doing bulk reads from multiple devices @ 2020-07-21 5:49 Forest Crossman 2020-07-29 13:22 ` Oliver O'Halloran 0 siblings, 1 reply; 3+ messages in thread From: Forest Crossman @ 2020-07-21 5:49 UTC (permalink / raw) To: linux-usb; +Cc: linuxppc-dev Hello, again! After fixing the issue in my previous thread using this patch[1], I decided to do some stress-testing of the controller to make sure it could handle my intended workloads and that there were no further DMA address issues that would need to be fixed. Unfortunately, it looks like there's still more work to be done: when I try to do long bulk reads from multiple devices simultaneously, eventually the host controller sends a DMA write to address zero, which then triggers EEH in my POWER9 system, causing the controller card to get hotplug-reset, which of course kills the disk-reading processes. For more details on the EEH errors, you can see my kernel's EEH message log[2]. The results of the various tests I performed are listed below. Test results (all failures are due to DMA writes to address zero, all hubs are USB 3.0/3.1 Gen1 only, and all disks are accessed via the usb-storage driver): - Reading simultaneously from two or more disks behind a hub connected to one port on the host controller: - FAIL after 20-50 GB of data transferred for each device. - Reading simultaneously from two disks, each connected directly to one port on the host controller: - FAIL after about 800 GB of data transferred for each device. - Reading from one disk behind a hub connected to one port on the host controller: - OK for at least 2.7 TB of data transferred (I didn't test the whole 8 TB disk). - Writing simultaneously to two FL2000 dongles (using osmo-fl2k's "fl2k_test"), each connected directly to one port on the host controller: - OK, was able to write several dozen terabytes to each device over the course of a little over 21 hours. Seeing how simultaneous writes to multiple devices and reads from single devices both seem to work fine, I assume that means this is being caused by some race condition in the host controller firmware when it responds to multiple read requests. I also assume we're not going to be able to convince ASMedia to both fix the bug in their firmware and release the details on how to flash it from Linux, so I guess we'll just have to figure out how to make the driver talk to the controller in a way that avoids triggering the bad DMA write. As before, I decided to try a little kernel hacking of my own before sending this email, and tried separately enabling the XHCI_BROKEN_STREAMS and XHCI_ASMEDIA_MODIFY_FLOWCONTROL quirks in an attempt to fix this. As you might expect since you're reading this message, neither of those quirks fixed the issue, nor did they even make the transfers last any longer before failing. So now I've reached the limits of my understanding, and I need some help devising a fix. If anyone has any comments to that effect, or any questions about my hardware configuration, testing methodology, etc., please don't hesitate to air them. Also, if anyone needs me to perform additional tests, or collect more log information, I'd be happy to do that as well. Thanks in advance for your help, Forest [1]: https://patchwork.kernel.org/patch/11669989/ [2]: https://paste.debian.net/hidden/2a442aa6 ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: ASMedia ASM2142 USB host controller tries to DMA to address zero when doing bulk reads from multiple devices 2020-07-21 5:49 ASMedia ASM2142 USB host controller tries to DMA to address zero when doing bulk reads from multiple devices Forest Crossman @ 2020-07-29 13:22 ` Oliver O'Halloran 2020-08-01 21:57 ` Forest Crossman 0 siblings, 1 reply; 3+ messages in thread From: Oliver O'Halloran @ 2020-07-29 13:22 UTC (permalink / raw) To: Forest Crossman; +Cc: linux-usb, linuxppc-dev, Alexey Kardashevskiy On Tue, Jul 21, 2020 at 3:51 PM Forest Crossman <cyrozap@gmail.com> wrote: > > Hello, again! > > After fixing the issue in my previous thread using this patch[1], I > decided to do some stress-testing of the controller to make sure it > could handle my intended workloads and that there were no further DMA > address issues that would need to be fixed. Unfortunately, it looks > like there's still more work to be done: when I try to do long bulk > reads from multiple devices simultaneously, eventually the host > controller sends a DMA write to address zero, which then triggers EEH > in my POWER9 system, causing the controller card to get hotplug-reset, > which of course kills the disk-reading processes. For more details on > the EEH errors, you can see my kernel's EEH message log[2]. Take the logged address with a grain of salt. If an error occurs while translating the DMA address the PHB logs all zeros as the "DMA Address" because it only keeps around the bits that it needs to fetch the next level of the TCE table. The EEH dump says the error is due to a TCE permission mis-match so odds the ASmedia controller is writing to an address that's already been DMA unmapped, hence the logged address being zeros. Sorry, I probably should have mentioned that quirk in the last mail. > The results of the various tests I performed are listed below. > > Test results (all failures are due to DMA writes to address zero, all > hubs are USB 3.0/3.1 Gen1 only, and all disks are accessed via the > usb-storage driver): > - Reading simultaneously from two or more disks behind a hub connected > to one port on the host controller: > - FAIL after 20-50 GB of data transferred for each device. > - Reading simultaneously from two disks, each connected directly to > one port on the host controller: > - FAIL after about 800 GB of data transferred for each device. > - Reading from one disk behind a hub connected to one port on the host > controller: > - OK for at least 2.7 TB of data transferred (I didn't test the > whole 8 TB disk). > - Writing simultaneously to two FL2000 dongles (using osmo-fl2k's > "fl2k_test"), each connected directly to one port on the host > controller: > - OK, was able to write several dozen terabytes to each device over > the course of a little over 21 hours. > > Seeing how simultaneous writes to multiple devices and reads from > single devices both seem to work fine, I assume that means this is > being caused by some race condition in the host controller firmware > when it responds to multiple read requests. Most likely. It's possible it's a platform specific race with DMA map/unmap too, but I think we would be seeing similar issues with other devices if it was. > I also assume we're not > going to be able to convince ASMedia to both fix the bug in their > firmware and release the details on how to flash it from Linux, so I > guess we'll just have to figure out how to make the driver talk to the > controller in a way that avoids triggering the bad DMA write. As > before, I decided to try a little kernel hacking of my own before > sending this email, and tried separately enabling the > XHCI_BROKEN_STREAMS and XHCI_ASMEDIA_MODIFY_FLOWCONTROL quirks in an > attempt to fix this. As you might expect since you're reading this > message, neither of those quirks fixed the issue, nor did they even > make the transfers last any longer before failing. > > So now I've reached the limits of my understanding, and I need some > help devising a fix. If anyone has any comments to that effect, or any > questions about my hardware configuration, testing methodology, etc., > please don't hesitate to air them. Also, if anyone needs me to perform > additional tests, or collect more log information, I'd be happy to do > that as well. I started writing a tool a while ago to use the internal trace bus to log incoming TLPs. Something like that might allow you to get a better idea what the faulting access pattern is, but you would still need to find a way to mitigate the issue. I'm not all that familiar with USB3 so I'm not much help on that front. ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: ASMedia ASM2142 USB host controller tries to DMA to address zero when doing bulk reads from multiple devices 2020-07-29 13:22 ` Oliver O'Halloran @ 2020-08-01 21:57 ` Forest Crossman 0 siblings, 0 replies; 3+ messages in thread From: Forest Crossman @ 2020-08-01 21:57 UTC (permalink / raw) To: Oliver O'Halloran; +Cc: linux-usb, linuxppc-dev, Alexey Kardashevskiy On Wed, Jul 29, 2020 at 8:22 AM Oliver O'Halloran <oohall@gmail.com> wrote: > > On Tue, Jul 21, 2020 at 3:51 PM Forest Crossman <cyrozap@gmail.com> wrote: > > > > Hello, again! > > > > After fixing the issue in my previous thread using this patch[1], I > > decided to do some stress-testing of the controller to make sure it > > could handle my intended workloads and that there were no further DMA > > address issues that would need to be fixed. Unfortunately, it looks > > like there's still more work to be done: when I try to do long bulk > > reads from multiple devices simultaneously, eventually the host > > controller sends a DMA write to address zero, which then triggers EEH > > in my POWER9 system, causing the controller card to get hotplug-reset, > > which of course kills the disk-reading processes. For more details on > > the EEH errors, you can see my kernel's EEH message log[2]. > > Take the logged address with a grain of salt. If an error occurs while > translating the DMA address the PHB logs all zeros as the "DMA > Address" because it only keeps around the bits that it needs to fetch > the next level of the TCE table. The EEH dump says the error is due to > a TCE permission mis-match so odds the ASmedia controller is writing > to an address that's already been DMA unmapped, hence the logged > address being zeros. Interesting, that's good to know. I saw that the RXE_TCE_FESR had the "TCE Response Page Access Error" bit set, and had originally assumed that just meant the DMA to address zero was triggering that error since it wasn't in a mapped page, but after reading that bit's description again I think I understand it now. > Sorry, I probably should have mentioned that quirk in the last mail. > > > The results of the various tests I performed are listed below. > > > > Test results (all failures are due to DMA writes to address zero, all > > hubs are USB 3.0/3.1 Gen1 only, and all disks are accessed via the > > usb-storage driver): > > - Reading simultaneously from two or more disks behind a hub connected > > to one port on the host controller: > > - FAIL after 20-50 GB of data transferred for each device. > > - Reading simultaneously from two disks, each connected directly to > > one port on the host controller: > > - FAIL after about 800 GB of data transferred for each device. > > - Reading from one disk behind a hub connected to one port on the host > > controller: > > - OK for at least 2.7 TB of data transferred (I didn't test the > > whole 8 TB disk). > > - Writing simultaneously to two FL2000 dongles (using osmo-fl2k's > > "fl2k_test"), each connected directly to one port on the host > > controller: > > - OK, was able to write several dozen terabytes to each device over > > the course of a little over 21 hours. > > > > Seeing how simultaneous writes to multiple devices and reads from > > single devices both seem to work fine, I assume that means this is > > being caused by some race condition in the host controller firmware > > when it responds to multiple read requests. > > Most likely. It's possible it's a platform specific race with DMA > map/unmap too, but I think we would be seeing similar issues with > other devices if it was. Yeah, I have several other xHCI controllers connected to this system, and I've never experienced this issue with any of them. If the problem was a POWER-specific DMA map/unmap race I would expect to be having problems with those controllers as well. > > I also assume we're not > > going to be able to convince ASMedia to both fix the bug in their > > firmware and release the details on how to flash it from Linux, so I > > guess we'll just have to figure out how to make the driver talk to the > > controller in a way that avoids triggering the bad DMA write. As > > before, I decided to try a little kernel hacking of my own before > > sending this email, and tried separately enabling the > > XHCI_BROKEN_STREAMS and XHCI_ASMEDIA_MODIFY_FLOWCONTROL quirks in an > > attempt to fix this. As you might expect since you're reading this > > message, neither of those quirks fixed the issue, nor did they even > > make the transfers last any longer before failing. > > > > So now I've reached the limits of my understanding, and I need some > > help devising a fix. If anyone has any comments to that effect, or any > > questions about my hardware configuration, testing methodology, etc., > > please don't hesitate to air them. Also, if anyone needs me to perform > > additional tests, or collect more log information, I'd be happy to do > > that as well. > > I started writing a tool a while ago to use the internal trace bus to > log incoming TLPs. Something like that might allow you to get a better > idea what the faulting access pattern is, but you would still need to > find a way to mitigate the issue. I'm not all that familiar with USB3 > so I'm not much help on that front. Oh, interesting, I remember seeing the trace registers in the PHB4 spec, but I wasn't sure how to access them without writing a kernel driver. I'd love to be able to log and dissect TLPs in Wireshark the same way I do for USB packets, since it makes reverse engineering protocols and debugging drivers so much easier. This would also be especially helpful because I haven't yet figured out how to get Qemu to intercept certain types of PCIe accesses (I forget if it was DMA or PIO or something, it was a quite while ago), and PCIe protocol analyzer hardware is prohibitively expensive (when it's even available for purchase). So if you've uploaded your code anywhere, I'd be really interested in seeing it, even if it's incomplete, since even with incomplete code I could use that as a reference for a Wireshark plugin or something. But if it's not online or if you'd prefer not to share it in its current state, I'll understand. Thanks again for your help, Forest ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-08-01 21:58 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-21 5:49 ASMedia ASM2142 USB host controller tries to DMA to address zero when doing bulk reads from multiple devices Forest Crossman 2020-07-29 13:22 ` Oliver O'Halloran 2020-08-01 21:57 ` Forest Crossman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).