On Thu, Aug 12, 2021 at 04:26:41PM +0100, Marc Zyngier wrote: > On Thu, 12 Aug 2021 15:29:06 +0100, > Thierry Reding wrote: > > > > On Wed, Aug 11, 2021 at 02:23:10PM +0100, Marc Zyngier wrote: > > [...] > > > > I love this machine... Did this issue occur with the Denver CPUs > > > disabled? > > > > Interestingly I've been doing some work on a newer device called Jetson > > TX2 NX (which is kind of a trimmed-down version of Jetson TX2, in the > > spirit of the Jetson Nano) and I can't seem to reproduce these failures > > there (tested on next-20210812). > > > > I'll go dig out my Jetson TX2 to run the same tests there, because I've > > also been using a development version of the bootloader stack and > > flashing tools and all that, so it's possible that something was fixed > > at that level. I don't think I've ever tried disabling the Denver CPUs, > > but then I've also never seen these issues myself. > > > > Just out of curiosity, what version of the BSP have you been using to > > flash? > > I've only used the BSP for a few weeks when I got the board last > year. The only thing I use from it is u-boot to chainload an upstream > u-boot, and boot Debian from there. That's interesting... have you ever tried to inject a version of upstream U-Boot into the BSP and have it flash that instead? That should allow you to drop the chainloading step. Not that that's likely to have anything to do with this. > > One other thing that I ran into: there's a known issue with the PHY > > configuration. We mark the PHY on most devices as "rgmii-id" on most > > devices and then the Marvell PHY driver needs to be enabled. Jetson TX2 > > has phy-mode = "rgmii", so it /should/ work okay. > > > > Typically what we're seeing with that misconfiguration is that the > > device fails to get an IP address, but it might still be worth trying to > > switch Jetson TX2 to rgmii-id and using the Marvell PHY, to see if that > > improves anything. > > I never failed to get an IP address. Overall, networking has been > solid on this machine until this patch. I'll try and mess with this > when I get time, but that's probably going to be next week now. So I've hooked up my Jetson TX2 and tried various workloads. I wasn't able to reproduce this on next-20210813. I've tried both the L4T 32.6.1 release and a local development build. Perhaps one thing to try would be to upgrade your L4T BSP to something newer. I know that there have occasionally been bugs in the MTS firmware, which is what's running on the Denver cores, and newer BSPs can fix those kinds of issues. If that doesn't help, perhaps try to read out the SoC version numbers so that we can compare. I know that some newer Tegra186 chips behave slightly differently, so that's perhaps a difference that would explain why it's not happening on all devices. You can read the version and revision from sysfs using something like: # cat /sys/devices/soc0/{major,minor,revision} > [...] > > > > That'd be pretty annoying. Do you know if the Ethernet is a coherent > > > device on this machine? or does it need active cache maintenance? > > > > I don't think Ethernet is a coherent device on Tegra186. I think > > Tegra194 had various improvements with regard to coherency, but most > > devices on Tegra186 do need active cache maintenance. > > > > Let me dig through some old patches and mailing list threads. I vaguely > > recall prototyping a patch that did something special for outer cache > > flushing, but that may have been Tegra132, not Tegra186. I also don't > > think we ended up merging that because it turned out to not be needed. > > ARMv8 forbid any sort of *visible* outer cache, so I really hope this > is not required. We wouldn't be able to support it. I couldn't find any trace of this anywhere. So I'm possibly misremembering. It's also more likely that this was on an earlier SoC generation, otherwise I'd probably remember more clearly. Thierry