I switched the workqueue we were using for xgmi_reset_work from system_highpri_wq to system_unbound_wq - the difference is that workers servicing the queue in system_unbound_wq are not bounded to specific CPU and so the reset jobs for each XGMI node are getting scheduled to different CPU while system_highpri_wq is a bounded work queue. I traced it as bellow for 10 consecutive times and didn't see errors any more. Also the time diff between BACO entries or exits was never more then around 2 uS. Please give this updated patchset a try    kworker/u16:2-57    [004] ...1   243.276312: trace_code: func: vega20_baco_set_state, line 91 <----- - Before BEACO enter            <...>-60    [007] ...1   243.276312: trace_code: func: vega20_baco_set_state, line 91 <----- - Before BEACO enter    kworker/u16:2-57    [004] ...1   243.276384: trace_code: func: vega20_baco_set_state, line 105 <----- - After BEACO enter done            <...>-60    [007] ...1   243.276392: trace_code: func: vega20_baco_set_state, line 105 <----- - After BEACO enter done    kworker/u16:3-60    [007] ...1   243.276397: trace_code: func: vega20_baco_set_state, line 108 <----- - Before BEACO exit    kworker/u16:2-57    [004] ...1   243.276399: trace_code: func: vega20_baco_set_state, line 108 <----- - Before BEACO exit    kworker/u16:3-60    [007] ...1   243.288067: trace_code: func: vega20_baco_set_state, line 114 <----- - After BEACO exit done    kworker/u16:2-57    [004] ...1   243.295624: trace_code: func: vega20_baco_set_state, line 114 <----- - After BEACO exit done Andrey On 12/9/19 9:45 PM, Ma, Le wrote: > > [AMD Official Use Only - Internal Distribution Only] > > > I’m fine with your solution if synchronization time interval satisfies > BACO requirements and loop test can pass on XGMI system. > > Regards, > > Ma Le > > *From:*Grodzovsky, Andrey > *Sent:* Monday, December 9, 2019 11:52 PM > *To:* Ma, Le ; amd-gfx@lists.freedesktop.org; Zhou1, > Tao ; Deucher, Alexander > ; Li, Dennis ; Zhang, > Hawking > *Cc:* Chen, Guchun > *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset > support for XGMI > > Thanks a lot Ma for trying - I think I have to have my own system to > debug this so I will keep trying enabling XGMI - i still think the is > the right and the generic solution for multiple nodes reset > synchronization and in fact the barrier should also be used for > synchronizing PSP mode 1 XGMI reset too. > > Andrey > > On 12/9/19 6:34 AM, Ma, Le wrote: > > [AMD Official Use Only - Internal Distribution Only] > > Hi Andrey, > > I tried your patches on my 2P XGMI platform. The baco can work at > most time, and randomly got following error: > > [ 1701.542298] amdgpu: [powerplay] Failed to send message 0x25, > response 0x0 > > This error usually means some sync issue exist for xgmi baco case. > Feel free to debug your patches on my XGMI platform. > > Regards, > > Ma Le > > *From:*Grodzovsky, Andrey > > *Sent:* Saturday, December 7, 2019 5:51 AM > *To:* Ma, Le ; > amd-gfx@lists.freedesktop.org > ; Zhou1, Tao > ; Deucher, Alexander > ; > Li, Dennis ; Zhang, > Hawking > *Cc:* Chen, Guchun > *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset > support for XGMI > > Hey Ma, attached a solution - it's just compiled as I still can't > make my XGMI setup work (with bridge connected only one device is > visible to the system while the other is not). Please try it on > your system if you have a chance. > > Andrey > > On 12/4/19 10:14 PM, Ma, Le wrote: > > AFAIK it's enough for even single one node in the hive to to > fail the enter the BACO state on time to fail the entire hive > reset procedure, no ? > > [Le]: Yeah, agree that. I’ve been thinking that make all nodes > entering baco simultaneously can reduce the possibility of > node failure to enter/exit BACO risk. For example, in an XGMI > hive with 8 nodes, the total time interval of 8 nodes > enter/exit BACO on 8 CPUs is less than the interval that 8 > nodes enter BACO serially and exit BACO serially depending on > one CPU with yield capability. This interval is usually strict > for BACO feature itself. Anyway, we need more looping test > later on any method we will choose. > > Any way - I see our discussion blocks your entire patch set - > I think you can go ahead and commit yours way (I think you got > an RB from Hawking) and I will look then and see if I can > implement my method and if it works will just revert your patch. > > [Le]: OK, fine. > > Andrey >