I switched the workqueue we were using for xgmi_reset_work from 
system_highpri_wq to system_unbound_wq - the difference is that workers 
servicing the queue in system_unbound_wq are not bounded to specific CPU 
and so the reset jobs for each XGMI node are getting scheduled to 
different CPU while system_highpri_wq is a bounded work queue. I traced 
it as bellow for 10 consecutive times and didn't see errors any more. 
Also the time diff between BACO entries or exits was never more then 
around 2 uS.

Please give this updated patchset a try

    kworker/u16:2-57    [004] ...1   243.276312: trace_code: func: 
vega20_baco_set_state, line 91 <----- - Before BEACO enter
            <...>-60    [007] ...1   243.276312: trace_code: func: 
vega20_baco_set_state, line 91 <----- - Before BEACO enter
    kworker/u16:2-57    [004] ...1   243.276384: trace_code: func: 
vega20_baco_set_state, line 105 <----- - After BEACO enter done
            <...>-60    [007] ...1   243.276392: trace_code: func: 
vega20_baco_set_state, line 105 <----- - After BEACO enter done
    kworker/u16:3-60    [007] ...1   243.276397: trace_code: func: 
vega20_baco_set_state, line 108 <----- - Before BEACO exit
    kworker/u16:2-57    [004] ...1   243.276399: trace_code: func: 
vega20_baco_set_state, line 108 <----- - Before BEACO exit
    kworker/u16:3-60    [007] ...1   243.288067: trace_code: func: 
vega20_baco_set_state, line 114 <----- - After BEACO exit done
    kworker/u16:2-57    [004] ...1   243.295624: trace_code: func: 
vega20_baco_set_state, line 114 <----- - After BEACO exit done

Andrey

On 12/9/19 9:45 PM, Ma, Le wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> I’m fine with your solution if synchronization time interval satisfies 
> BACO requirements and loop test can pass on XGMI system.
>
> Regards,
>
> Ma Le
>
> *From:*Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *Sent:* Monday, December 9, 2019 11:52 PM
> *To:* Ma, Le <Le.Ma@amd.com>; amd-gfx@lists.freedesktop.org; Zhou1, 
> Tao <Tao.Zhou1@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Zhang, 
> Hawking <Hawking.Zhang@amd.com>
> *Cc:* Chen, Guchun <Guchun.Chen@amd.com>
> *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset 
> support for XGMI
>
> Thanks a lot Ma for trying - I think I have to have my own system to 
> debug this so I will keep trying enabling XGMI - i still think the is 
> the right and the generic solution for multiple nodes reset 
> synchronization and in fact the barrier should also be used for 
> synchronizing PSP mode 1 XGMI reset too.
>
> Andrey
>
> On 12/9/19 6:34 AM, Ma, Le wrote:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     Hi Andrey,
>
>     I tried your patches on my 2P XGMI platform. The baco can work at
>     most time, and randomly got following error:
>
>     [ 1701.542298] amdgpu: [powerplay] Failed to send message 0x25,
>     response 0x0
>
>     This error usually means some sync issue exist for xgmi baco case.
>     Feel free to debug your patches on my XGMI platform.
>
>     Regards,
>
>     Ma Le
>
>     *From:*Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>     <mailto:Andrey.Grodzovsky@amd.com>
>     *Sent:* Saturday, December 7, 2019 5:51 AM
>     *To:* Ma, Le <Le.Ma@amd.com> <mailto:Le.Ma@amd.com>;
>     amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>; Zhou1, Tao
>     <Tao.Zhou1@amd.com> <mailto:Tao.Zhou1@amd.com>; Deucher, Alexander
>     <Alexander.Deucher@amd.com> <mailto:Alexander.Deucher@amd.com>;
>     Li, Dennis <Dennis.Li@amd.com> <mailto:Dennis.Li@amd.com>; Zhang,
>     Hawking <Hawking.Zhang@amd.com> <mailto:Hawking.Zhang@amd.com>
>     *Cc:* Chen, Guchun <Guchun.Chen@amd.com> <mailto:Guchun.Chen@amd.com>
>     *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset
>     support for XGMI
>
>     Hey Ma, attached a solution - it's just compiled as I still can't
>     make my XGMI setup work (with bridge connected only one device is
>     visible to the system while the other is not). Please try it on
>     your system if you have a chance.
>
>     Andrey
>
>     On 12/4/19 10:14 PM, Ma, Le wrote:
>
>         AFAIK it's enough for even single one node in the hive to to
>         fail the enter the BACO state on time to fail the entire hive
>         reset procedure, no ?
>
>         [Le]: Yeah, agree that. I’ve been thinking that make all nodes
>         entering baco simultaneously can reduce the possibility of
>         node failure to enter/exit BACO risk. For example, in an XGMI
>         hive with 8 nodes, the total time interval of 8 nodes
>         enter/exit BACO on 8 CPUs is less than the interval that 8
>         nodes enter BACO serially and exit BACO serially depending on
>         one CPU with yield capability. This interval is usually strict
>         for BACO feature itself. Anyway, we need more looping test
>         later on any method we will choose.
>
>         Any way - I see our discussion blocks your entire patch set -
>         I think you can go ahead and commit yours way (I think you got
>         an RB from Hawking) and I will look then and see if I can
>         implement my method and if it works will just revert your patch.
>
>         [Le]: OK, fine.
>
>         Andrey
>