On Wed, Jan 09, 2019 at 04:09:02PM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote: > > > > > In a very cryptic way that requires manual parsing using non-public > > > docs sadly but yes. From the look of it, it's a completion timeout. > > > > > > Looks to me like we don't get a response to a config space access > > > during the change of D state. I don't know if it's the write of the D3 > > > state itself or the read back though (it's probably detected on the > > > read back or a subsequent read, but that doesn't tell me which specific > > > one failed). > > > > If it is just one card doing it (again, check you have latest > > firmware) I wonder if it is a sketchy PCI-E electrical link that is > > causing a long re-training cycle? Can you tell if the PCI-E link is > > permanently gone or does it eventually return? > > No, it's 100% reproducable on systems with that specific card model, > not card instance, and maybe different systems/cards as well, I'll let > David & Alexey comment further on that. Well, it's 100% reproducable on a particular model of system (garrison) with a particular model of card. I've had some suggestions that it fails with some other systems card card models, but nothing confirmed - the one other system model I've been able to try, which also had a newer card model didn't reproduce the problem. > > Does the card work in Gen 3 when it starts? Is there any indication of > > PCI-E link errors? > > Nope. > > > Everytime or sometimes? > > > > POWER 8 firmware is good? If the link does eventually come back, is > > the POWER8's D3 resumption timeout long enough? > > > > If this doesn't lead to an obvious conclusion you'll probably need to > > connect to IBM's Mellanox support team to get more information from > > the card side. > > We are IBM :-) So far, it seems to be that the card is doing something > not quite right, but we don't know what. We might need to engage > Mellanox themselves. Possibly. On the other hand, I've had it reported that this is a software regression at least with downstream red hat kernels. I haven't yet been able to eliminate factors that might be confusing that, or try to find a working version upstream. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson