Thanks a lot Ma for trying - I think I have to have my own system to debug this so I will keep trying enabling XGMI - i still think the is the right and the generic solution for multiple nodes reset synchronization and in fact the barrier should also be used for synchronizing PSP mode 1 XGMI reset too. Andrey On 12/9/19 6:34 AM, Ma, Le wrote: > > [AMD Official Use Only - Internal Distribution Only] > > > Hi Andrey, > > I tried your patches on my 2P XGMI platform. The baco can work at most > time, and randomly got following error: > > [ 1701.542298] amdgpu: [powerplay] Failed to send message 0x25, > response 0x0 > > This error usually means some sync issue exist for xgmi baco case. > Feel free to debug your patches on my XGMI platform. > > Regards, > > Ma Le > > *From:*Grodzovsky, Andrey > *Sent:* Saturday, December 7, 2019 5:51 AM > *To:* Ma, Le ; amd-gfx@lists.freedesktop.org; Zhou1, > Tao ; Deucher, Alexander > ; Li, Dennis ; Zhang, > Hawking > *Cc:* Chen, Guchun > *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset > support for XGMI > > Hey Ma, attached a solution - it's just compiled as I still can't make > my XGMI setup work (with bridge connected only one device is visible > to the system while the other is not). Please try it on your system if > you have a chance. > > Andrey > > On 12/4/19 10:14 PM, Ma, Le wrote: > > AFAIK it's enough for even single one node in the hive to to fail > the enter the BACO state on time to fail the entire hive reset > procedure, no ? > > [Le]: Yeah, agree that. I’ve been thinking that make all nodes > entering baco simultaneously can reduce the possibility of node > failure to enter/exit BACO risk. For example, in an XGMI hive with > 8 nodes, the total time interval of 8 nodes enter/exit BACO on 8 > CPUs is less than the interval that 8 nodes enter BACO serially > and exit BACO serially depending on one CPU with yield capability. > This interval is usually strict for BACO feature itself. Anyway, > we need more looping test later on any method we will choose. > > Any way - I see our discussion blocks your entire patch set - I > think you can go ahead and commit yours way (I think you got an RB > from Hawking) and I will look then and see if I can implement my > method and if it works will just revert your patch. > > [Le]: OK, fine. > > Andrey >