All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
To: "Ma, Le" <Le.Ma@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Zhou1, Tao" <Tao.Zhou1@amd.com>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>
Cc: "Chen, Guchun" <Guchun.Chen@amd.com>
Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI
Date: Wed, 11 Dec 2019 09:04:45 -0500	[thread overview]
Message-ID: <d64f892c-44af-ff72-e581-f10853896271@amd.com> (raw)
In-Reply-To: <MN2PR12MB42855499B960506C3BA62198F65A0@MN2PR12MB4285.namprd12.prod.outlook.com>


[-- Attachment #1.1: Type: text/plain, Size: 7437 bytes --]

Great! I will update the patches to also use the barrier in PSP MODE 1 
reset case and resend the patches for formal review.

Andrey

On 12/11/19 7:18 AM, Ma, Le wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> I tried your new patches to run BACO for about 10 loops and the result 
> looks positive, without observing enter/exit baco message failure again.
>
> The time interval between BACO entries or exits in my environment was 
> almost less than 10 us: max 36us, min 2us. I think it’s safe enough 
> according to the sample data we collected in both sides.
>
> And it looks not necessary to continue using system_highpri_wq any 
> more because we require all the nodes enter or exit at the same time, 
> while do not mind how long the time interval is b/t enter and exit. 
> The system_unbound_wq can satisfy our requirement here since it wakes 
> different CPUs up to work at the same time.
>
> Regards,
>
> Ma Le
>
> *From:*Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *Sent:* Wednesday, December 11, 2019 3:56 AM
> *To:* Ma, Le <Le.Ma@amd.com>; amd-gfx@lists.freedesktop.org; Zhou1, 
> Tao <Tao.Zhou1@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Zhang, 
> Hawking <Hawking.Zhang@amd.com>
> *Cc:* Chen, Guchun <Guchun.Chen@amd.com>
> *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset 
> support for XGMI
>
> I switched the workqueue we were using for xgmi_reset_work from 
> system_highpri_wq to system_unbound_wq - the difference is that 
> workers servicing the queue in system_unbound_wq are not bounded to 
> specific CPU and so the reset jobs for each XGMI node are getting 
> scheduled to different CPU while system_highpri_wq is a bounded work 
> queue. I traced it as bellow for 10 consecutive times and didn't see 
> errors any more. Also the time diff between BACO entries or exits was 
> never more then around 2 uS.
>
> Please give this updated patchset a try
>
>    kworker/u16:2-57    [004] ...1   243.276312: trace_code: func: 
> vega20_baco_set_state, line 91 <----- - Before BEACO enter
>            <...>-60    [007] ...1   243.276312: trace_code: func: 
> vega20_baco_set_state, line 91 <----- - Before BEACO enter
>    kworker/u16:2-57    [004] ...1   243.276384: trace_code: func: 
> vega20_baco_set_state, line 105 <----- - After BEACO enter done
>            <...>-60    [007] ...1   243.276392: trace_code: func: 
> vega20_baco_set_state, line 105 <----- - After BEACO enter done
>    kworker/u16:3-60    [007] ...1   243.276397: trace_code: func: 
> vega20_baco_set_state, line 108 <----- - Before BEACO exit
>    kworker/u16:2-57    [004] ...1   243.276399: trace_code: func: 
> vega20_baco_set_state, line 108 <----- - Before BEACO exit
>    kworker/u16:3-60    [007] ...1   243.288067: trace_code: func: 
> vega20_baco_set_state, line 114 <----- - After BEACO exit done
>    kworker/u16:2-57    [004] ...1   243.295624: trace_code: func: 
> vega20_baco_set_state, line 114 <----- - After BEACO exit done
>
> Andrey
>
> On 12/9/19 9:45 PM, Ma, Le wrote:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     I’m fine with your solution if synchronization time interval
>     satisfies BACO requirements and loop test can pass on XGMI system.
>
>     Regards,
>
>     Ma Le
>
>     *From:*Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>     <mailto:Andrey.Grodzovsky@amd.com>
>     *Sent:* Monday, December 9, 2019 11:52 PM
>     *To:* Ma, Le <Le.Ma@amd.com> <mailto:Le.Ma@amd.com>;
>     amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>; Zhou1, Tao
>     <Tao.Zhou1@amd.com> <mailto:Tao.Zhou1@amd.com>; Deucher, Alexander
>     <Alexander.Deucher@amd.com> <mailto:Alexander.Deucher@amd.com>;
>     Li, Dennis <Dennis.Li@amd.com> <mailto:Dennis.Li@amd.com>; Zhang,
>     Hawking <Hawking.Zhang@amd.com> <mailto:Hawking.Zhang@amd.com>
>     *Cc:* Chen, Guchun <Guchun.Chen@amd.com> <mailto:Guchun.Chen@amd.com>
>     *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset
>     support for XGMI
>
>     Thanks a lot Ma for trying - I think I have to have my own system
>     to debug this so I will keep trying enabling XGMI - i still think
>     the is the right and the generic solution for multiple nodes reset
>     synchronization and in fact the barrier should also be used for
>     synchronizing PSP mode 1 XGMI reset too.
>
>     Andrey
>
>     On 12/9/19 6:34 AM, Ma, Le wrote:
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         Hi Andrey,
>
>         I tried your patches on my 2P XGMI platform. The baco can work
>         at most time, and randomly got following error:
>
>         [ 1701.542298] amdgpu: [powerplay] Failed to send message
>         0x25, response 0x0
>
>         This error usually means some sync issue exist for xgmi baco
>         case. Feel free to debug your patches on my XGMI platform.
>
>         Regards,
>
>         Ma Le
>
>         *From:*Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>         <mailto:Andrey.Grodzovsky@amd.com>
>         *Sent:* Saturday, December 7, 2019 5:51 AM
>         *To:* Ma, Le <Le.Ma@amd.com> <mailto:Le.Ma@amd.com>;
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>; Zhou1, Tao
>         <Tao.Zhou1@amd.com> <mailto:Tao.Zhou1@amd.com>; Deucher,
>         Alexander <Alexander.Deucher@amd.com>
>         <mailto:Alexander.Deucher@amd.com>; Li, Dennis
>         <Dennis.Li@amd.com> <mailto:Dennis.Li@amd.com>; Zhang, Hawking
>         <Hawking.Zhang@amd.com> <mailto:Hawking.Zhang@amd.com>
>         *Cc:* Chen, Guchun <Guchun.Chen@amd.com>
>         <mailto:Guchun.Chen@amd.com>
>         *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco
>         reset support for XGMI
>
>         Hey Ma, attached a solution - it's just compiled as I still
>         can't make my XGMI setup work (with bridge connected only one
>         device is visible to the system while the other is not).
>         Please try it on your system if you have a chance.
>
>         Andrey
>
>         On 12/4/19 10:14 PM, Ma, Le wrote:
>
>             AFAIK it's enough for even single one node in the hive to
>             to fail the enter the BACO state on time to fail the
>             entire hive reset procedure, no ?
>
>             [Le]: Yeah, agree that. I’ve been thinking that make all
>             nodes entering baco simultaneously can reduce the
>             possibility of node failure to enter/exit BACO risk. For
>             example, in an XGMI hive with 8 nodes, the total time
>             interval of 8 nodes enter/exit BACO on 8 CPUs is less than
>             the interval that 8 nodes enter BACO serially and exit
>             BACO serially depending on one CPU with yield capability.
>             This interval is usually strict for BACO feature itself.
>             Anyway, we need more looping test later on any method we
>             will choose.
>
>             Any way - I see our discussion blocks your entire patch
>             set - I think you can go ahead and commit yours way (I
>             think you got an RB from Hawking) and I will look then and
>             see if I can implement my method and if it works will just
>             revert your patch.
>
>             [Le]: OK, fine.
>
>             Andrey
>

[-- Attachment #1.2: Type: text/html, Size: 18101 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2019-12-11 14:04 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27  9:15 [PATCH 01/10] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler Le Ma
2019-11-27  9:15 ` Le Ma
     [not found] ` <1574846129-4826-1-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27  9:15   ` [PATCH 02/10] drm/amdgpu: export amdgpu_ras_find_obj to use externally Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 03/10] drm/amdgpu: clear ras controller status registers when interrupt occurs Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 05/10] drm/amdgpu: enable/disable doorbell interrupt in baco entry/exit helper Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-4-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27 12:04       ` Zhang, Hawking
2019-11-27 12:04         ` Zhang, Hawking
     [not found]         ` <DM5PR12MB14184CF08E965BAF369F4249FC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27 12:14           ` Ma, Le
2019-11-27 12:14             ` Ma, Le
2019-11-28  6:50       ` Zhou1, Tao
2019-11-28  6:50         ` Zhou1, Tao
2019-11-27  9:15   ` [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-5-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27 11:28       ` Zhang, Hawking
2019-11-27 11:28         ` Zhang, Hawking
     [not found]         ` <DM5PR12MB141825CB772FEEF1FD013EDBFC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27 12:35           ` Ma, Le
2019-11-27 12:35             ` Ma, Le
2019-11-27 11:38       ` Zhang, Hawking
2019-11-27 11:38         ` Zhang, Hawking
     [not found]         ` <DM5PR12MB1418D76FD9E6E7748C2F9997FC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27 14:00           ` Ma, Le
2019-11-27 14:00             ` Ma, Le
2019-11-27  9:15   ` [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-6-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27 15:46       ` Andrey Grodzovsky
2019-11-27 15:46         ` Andrey Grodzovsky
     [not found]         ` <c09d7928-f864-3a80-40e2-b6116abe044c-5C7GfCeVMHo@public.gmane.org>
2019-11-28  9:00           ` Ma, Le
2019-11-28  9:00             ` Ma, Le
2019-11-29 16:21             ` Andrey Grodzovsky
2019-12-02 11:42               ` Ma, Le
2019-12-02 22:05                 ` Andrey Grodzovsky
     [not found]                   ` <MN2PR12MB42855B198BB4064A0D311845F6420@MN2PR12MB4285.namprd12.prod.outlook.com>
     [not found]                     ` <2c4dd3f3-e2ce-9843-312b-1e5c05a51521@amd.com>
2019-12-04  7:09                       ` Ma, Le
2019-12-04 16:05                         ` Andrey Grodzovsky
2019-12-05  3:14                           ` Ma, Le
2019-12-06 21:50                             ` Andrey Grodzovsky
2019-12-09 11:34                               ` Ma, Le
2019-12-09 15:52                                 ` Andrey Grodzovsky
2019-12-10  2:45                                   ` Ma, Le
2019-12-10 19:55                                     ` Andrey Grodzovsky
2019-12-11 12:18                                       ` Ma, Le
2019-12-11 14:04                                         ` Andrey Grodzovsky [this message]
2019-12-09 22:00                                 ` Andrey Grodzovsky
2019-12-10  3:27                                   ` Ma, Le
2019-11-27  9:15   ` [PATCH 08/10] drm/amdgpu: support full gpu reset workflow when ras err_event_athub occurs Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 09/10] drm/amdgpu: clear err_event_athub flag after reset exit Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 10/10] drm/amdgpu: reduce redundant uvd context lost warning message Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-9-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27  9:49       ` Chen, Guchun
2019-11-27  9:49         ` Chen, Guchun
     [not found]         ` <BYAPR12MB280648A1C59519AA77B3FCA9F1440-ZGDeBxoHBPk0CuAkIMgl3QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27  9:54           ` Ma, Le
2019-11-27  9:54             ` Ma, Le
2019-11-28  5:27   ` [PATCH 01/10] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler Zhang, Hawking
2019-11-28  5:27     ` Zhang, Hawking

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d64f892c-44af-ff72-e581-f10853896271@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Guchun.Chen@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=Le.Ma@amd.com \
    --cc=Tao.Zhou1@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.