All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
To: "Ma, Le" <Le.Ma@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Zhou1, Tao" <Tao.Zhou1@amd.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"Chen, Guchun" <Guchun.Chen@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>
Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI
Date: Mon, 2 Dec 2019 17:05:08 -0500	[thread overview]
Message-ID: <5b505116-17aa-383d-5cdf-246663a1f4f9@amd.com> (raw)
In-Reply-To: <MN2PR12MB428532FA663C99770AA71263F6430@MN2PR12MB4285.namprd12.prod.outlook.com>


[-- Attachment #1.1: Type: text/plain, Size: 16162 bytes --]


On 12/2/19 6:42 AM, Ma, Le wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> *From:*Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *Sent:* Saturday, November 30, 2019 12:22 AM
> *To:* Ma, Le <Le.Ma@amd.com>; amd-gfx@lists.freedesktop.org
> *Cc:* Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao 
> <Tao.Zhou1@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; 
> Li, Dennis <Dennis.Li@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
> *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset 
> support for XGMI
>
> On 11/28/19 4:00 AM, Ma, Le wrote:
>
>     -----Original Message-----
>     From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>     <mailto:Andrey.Grodzovsky@amd.com>
>     Sent: Wednesday, November 27, 2019 11:46 PM
>     To: Ma, Le <Le.Ma@amd.com> <mailto:Le.Ma@amd.com>;
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>     Cc: Chen, Guchun <Guchun.Chen@amd.com>
>     <mailto:Guchun.Chen@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>
>     <mailto:Tao.Zhou1@amd.com>; Deucher, Alexander
>     <Alexander.Deucher@amd.com> <mailto:Alexander.Deucher@amd.com>;
>     Li, Dennis <Dennis.Li@amd.com> <mailto:Dennis.Li@amd.com>; Zhang,
>     Hawking <Hawking.Zhang@amd.com> <mailto:Hawking.Zhang@amd.com>
>     Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset
>     support for XGMI
>
>     On 11/27/19 4:15 AM, Le Ma wrote:
>
>     > Currently each XGMI node reset wq does not run in parrallel because
>
>     > same work item bound to same cpu runs in sequence. So change to
>     bound
>
>     > the xgmi_reset_work item to different cpus.
>
>     It's not the same work item, see more bellow
>
>     >
>
>     > XGMI requires all nodes enter into baco within very close proximity
>
>     > before any node exit baco. So schedule the xgmi_reset_work wq twice
>
>     > for enter/exit baco respectively.
>
>     >
>
>     > The default reset code path and methods do not change for vega20
>     production:
>
>     >    - baco reset without xgmi/ras
>
>     >    - psp reset with xgmi/ras
>
>     >
>
>     > To enable baco for XGMI/RAS case, both 2 conditions below are
>     needed:
>
>     >    - amdgpu_ras_enable=2
>
>     >    - baco-supported smu firmware
>
>     >
>
>     > The case that PSP reset and baco reset coexist within an XGMI
>     hive is
>
>     > not in the consideration.
>
>     >
>
>     > Change-Id: I9c08cf90134f940b42e20d2129ff87fba761c532
>
>     > Signed-off-by: Le Ma <le.ma@amd.com <mailto:le.ma@amd.com>>
>
>     > ---
>
>     > drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 +
>
>     > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 78
>     ++++++++++++++++++++++++++----
>
>     >   2 files changed, 70 insertions(+), 10 deletions(-)
>
>     >
>
>     > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>
>     > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>
>     > index d120fe5..08929e6 100644
>
>     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>
>     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>
>     > @@ -998,6 +998,8 @@ struct amdgpu_device {
>
>     > int                                           pstate;
>
>     >          /* enable runtime pm on the device */
>
>     > bool                            runpm;
>
>     > +
>
>     > + bool                                        in_baco;
>
>     >   };
>
>     >
>
>     >   static inline struct amdgpu_device *amdgpu_ttm_adev(struct
>
>     > ttm_bo_device *bdev) diff --git
>
>     > a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
>     > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
>     > index bd387bb..71abfe9 100644
>
>     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
>     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
>     > @@ -2654,7 +2654,13 @@ static void
>     amdgpu_device_xgmi_reset_func(struct work_struct *__work)
>
>     >          struct amdgpu_device *adev =
>
>     > container_of(__work, struct amdgpu_device, xgmi_reset_work);
>
>     >
>
>     > -       adev->asic_reset_res =  amdgpu_asic_reset(adev);
>
>     > +      if (amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO)
>
>     > + adev->asic_reset_res = (adev->in_baco == false) ?
>
>     > +             amdgpu_device_baco_enter(adev->ddev) :
>
>     > +             amdgpu_device_baco_exit(adev->ddev);
>
>     > +      else
>
>     > + adev->asic_reset_res = amdgpu_asic_reset(adev);
>
>     > +
>
>     >          if (adev->asic_reset_res)
>
>     > DRM_WARN("ASIC reset failed with error, %d for drm dev, %s",
>
>     >  adev->asic_reset_res, adev->ddev->unique); @@ -3796,6 +3802,7 @@
>
>     > static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>
>     >          struct amdgpu_device *tmp_adev = NULL;
>
>     >          bool need_full_reset = *need_full_reset_arg, vram_lost
>     = false;
>
>     >          int r = 0;
>
>     > +      int cpu = smp_processor_id();
>
>     >
>
>     >          /*
>
>     >           * ASIC reset has to be done on all HGMI hive nodes
>     ASAP @@
>
>     > -3803,21 +3810,24 @@ static int amdgpu_do_asic_reset(struct
>     amdgpu_hive_info *hive,
>
>     >           */
>
>     >          if (need_full_reset) {
>
>     > list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
>
>     > - /* For XGMI run all resets in parallel to speed up the process */
>
>     > +                              /*
>
>     > +                              * For XGMI run all resets in
>     parallel to speed up the
>
>     > +                              * process by scheduling the
>     highpri wq on different
>
>     > +                              * cpus. For XGMI with baco reset,
>     all nodes must enter
>
>     > +                              * baco within close proximity
>     before anyone exit.
>
>     > +                              */
>
>     > if (tmp_adev->gmc.xgmi.num_physical_nodes > 1) {
>
>     > -                                           if
>     (!queue_work(system_highpri_wq, &tmp_adev->xgmi_reset_work))
>
>     Note that tmp_adev->xgmi_reset_work (the work item) is per device
>     in XGMI hive and not the same work item. So I don't see why you
>     need to explicitly queue them on different CPUs, they should run
>     in parallel already.
>
>     Andrey
>
>     [Le]: It’s also beyond my understanding that the 2 node reset work
>     items scheduled to same cpu does not run in parallel. But from the
>     experiment result in my side, the 2nd work item always run after
>     1st work item finished. Based on this result, I changed to queue
>     them on different CPUs to make sure more XGMI nodes case to run in
>     parallel, because baco requires all nodes enter baco within very
>     close proximity.
>
>     The experiment code is as following for your reference. When card0
>     worker running, card1 worker is not observed to run.
>
> The code bellow will only test that they don't run concurrently - but 
> this doesn't mean they don't run on different CPUs and threads,I don't 
> have an XGMI setup at hand to test this theory but what if there is 
> some locking dependency between them that serializes their execution ? 
> Can you just add a one line print inside amdgpu_device_xgmi_reset_func 
> that prints CPU id, thread name/id and card number ?
>
> Andrey
>
> [Le]: I checked if directly use queue_work() several times, the same 
> CPU thread will be used. And the worker per CPU will execute the item 
> one by one. Our goal here is to make the xgmi_reset_func run 
> concurrently for XGMI BACO case. That’s why I schedule them on 
> different CPUs to run parallelly. And I can share the XGMI system with 
> you if you’d like to verify more.
>

I tried today to setup XGMI 2P setup to test this but weren't able to 
load with the XGMI bridge in place (maybe faulty bridge) - so yea - 
maybe leave me your setup before your changes (the original code) so i 
can try to open some kernel traces that show CPU id and thread id to 
check this. It's just so weird that system_highpri_wq which is 
documented to be multi-cpu and multi-threaded wouldn't queue those work 
items to different cpus and worker threads.

Andrey


>     +atomic_t card0_in_baco = ATOMIC_INIT(0);
>
>     +atomic_t card1_in_baco = ATOMIC_INIT(0);
>
>     +
>
>     static void amdgpu_device_xgmi_reset_func(struct work_struct *__work)
>
>     {
>
>     struct amdgpu_device *adev =
>
>     container_of(__work, struct amdgpu_device, xgmi_reset_work);
>
>     + printk("lema1: card 0x%x goes into reset wq\n",
>     adev->pdev->bus->number);
>
>     +       if (adev->pdev->bus->number == 0x7) {
>
>     + atomic_set(&card1_in_baco, 1);
>
>     + printk("lema1: card1 in baco from card1 view\n");
>
>     +       }
>
>     +
>
>             if (amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO)
>
>                    adev->asic_reset_res = (adev->in_baco == false) ?
>
>     amdgpu_device_baco_enter(adev->ddev) :
>
>     @@ -2664,6 +2673,23 @@ static void
>     amdgpu_device_xgmi_reset_func(struct work_struct *__work)
>
>             if (adev->asic_reset_res)
>
>     DRM_WARN("ASIC reset failed with error, %d for drm dev, %s",
>
>     adev->asic_reset_res, adev->ddev->unique);
>
>     +
>
>     +       if (adev->pdev->bus->number == 0x4) {
>
>     + atomic_set(&card0_in_baco, 1);
>
>     +        printk("lema1: card0 in baco from card0 view\n");
>
>     +
>
>     + while (true)
>
>     + if (!!atomic_read(&card1_in_baco))
>
>     + break;
>
>     + printk("lema1: card1 in baco from card0 view\n");
>
>     +       }
>
>     +
>
>     +       if (adev->pdev->bus->number == 0x7) {
>
>     + while (true)
>
>     + if (!!atomic_read(&card0_in_baco))
>
>     + break;
>
>     + printk("lema1: card0 in baco from card1 view\n");
>
>     +       }
>
>     > +                                          if
>     (!queue_work_on(cpu, system_highpri_wq,
>
>     > +    &tmp_adev->xgmi_reset_work))
>
>     >                                                        r =
>     -EALREADY;
>
>     > +                                          cpu =
>     cpumask_next(cpu, cpu_online_mask);
>
>     > } else
>
>     >                                            r =
>     amdgpu_asic_reset(tmp_adev);
>
>     > -
>
>     > - if (r) {
>
>     > -                                           DRM_ERROR("ASIC
>     reset failed with error, %d for drm dev, %s",
>
>     > -                                                       r,
>     tmp_adev->ddev->unique);
>
>     > +                              if (r)
>
>     >                                            break;
>
>     > -                               }
>
>     >                      }
>
>     >
>
>     > -                   /* For XGMI wait for all PSP resets to
>     complete before proceed */
>
>     > +                  /* For XGMI wait for all work to complete
>     before proceed */
>
>     >                      if (!r) {
>
>     > list_for_each_entry(tmp_adev, device_list_handle,
>
>     >     gmc.xgmi.head) {
>
>     > @@ -3826,11 +3836,59 @@ static int amdgpu_do_asic_reset(struct
>     amdgpu_hive_info *hive,
>
>     >                                                        r =
>     tmp_adev->asic_reset_res;
>
>     >                                                        if (r)
>
>     > break;
>
>     > + if(AMD_RESET_METHOD_BACO ==
>
>     > + amdgpu_asic_reset_method(tmp_adev))
>
>     > + tmp_adev->in_baco = true;
>
>     >                                            }
>
>     > }
>
>     >                      }
>
>     > -       }
>
>     >
>
>     > +                  /*
>
>     > +                  * For XGMI with baco reset, need exit baco
>     phase by scheduling
>
>     > +                  * xgmi_reset_work one more time. PSP reset
>     skips this phase.
>
>     > +                  * Not assume the situation that PSP reset and
>     baco reset
>
>     > +                  * coexist within an XGMI hive.
>
>     > +                  */
>
>     > +
>
>     > +                  if (!r) {
>
>     > + cpu = smp_processor_id();
>
>     > + list_for_each_entry(tmp_adev, device_list_handle,
>
>     > + gmc.xgmi.head) {
>
>     > +                                          if
>     (tmp_adev->gmc.xgmi.num_physical_nodes > 1
>
>     > +                                              &&
>     AMD_RESET_METHOD_BACO ==
>
>     > + amdgpu_asic_reset_method(tmp_adev)) {
>
>     > +                                                      if
>     (!queue_work_on(cpu,
>
>     > + system_highpri_wq,
>
>     > +             &tmp_adev->xgmi_reset_work))
>
>     > + r = -EALREADY;
>
>     > +                                                      if (r)
>
>     > + break;
>
>     > +                                                      cpu =
>     cpumask_next(cpu, cpu_online_mask);
>
>     > +                                          }
>
>     > +                              }
>
>     > +                  }
>
>     > +
>
>     > +                  if (!r) {
>
>     > + list_for_each_entry(tmp_adev, device_list_handle,
>
>     > + gmc.xgmi.head) {
>
>     > +                                          if
>     (tmp_adev->gmc.xgmi.num_physical_nodes > 1
>
>     > +                                              &&
>     AMD_RESET_METHOD_BACO ==
>
>     > + amdgpu_asic_reset_method(tmp_adev)) {
>
>     > + flush_work(&tmp_adev->xgmi_reset_work);
>
>     > +                                                      r =
>     tmp_adev->asic_reset_res;
>
>     > +                                                      if (r)
>
>     > + break;
>
>     > + tmp_adev->in_baco = false;
>
>     > +                                          }
>
>     > +                              }
>
>     > +                  }
>
>     > +
>
>     > +                  if (r) {
>
>     > + DRM_ERROR("ASIC reset failed with error, %d for drm dev, %s",
>
>     > +                                          r,
>     tmp_adev->ddev->unique);
>
>     > + goto end;
>
>     > +                  }
>
>     > +      }
>
>     >
>
>     > list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
>
>     >                      if (need_full_reset) {
>

[-- Attachment #1.2: Type: text/html, Size: 55616 bytes --]

[-- Attachment #2: Type: text/plain, Size: 153 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2019-12-02 22:05 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27  9:15 [PATCH 01/10] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler Le Ma
2019-11-27  9:15 ` Le Ma
     [not found] ` <1574846129-4826-1-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27  9:15   ` [PATCH 02/10] drm/amdgpu: export amdgpu_ras_find_obj to use externally Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 03/10] drm/amdgpu: clear ras controller status registers when interrupt occurs Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 05/10] drm/amdgpu: enable/disable doorbell interrupt in baco entry/exit helper Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-4-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27 12:04       ` Zhang, Hawking
2019-11-27 12:04         ` Zhang, Hawking
     [not found]         ` <DM5PR12MB14184CF08E965BAF369F4249FC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27 12:14           ` Ma, Le
2019-11-27 12:14             ` Ma, Le
2019-11-28  6:50       ` Zhou1, Tao
2019-11-28  6:50         ` Zhou1, Tao
2019-11-27  9:15   ` [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-5-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27 11:28       ` Zhang, Hawking
2019-11-27 11:28         ` Zhang, Hawking
     [not found]         ` <DM5PR12MB141825CB772FEEF1FD013EDBFC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27 12:35           ` Ma, Le
2019-11-27 12:35             ` Ma, Le
2019-11-27 11:38       ` Zhang, Hawking
2019-11-27 11:38         ` Zhang, Hawking
     [not found]         ` <DM5PR12MB1418D76FD9E6E7748C2F9997FC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27 14:00           ` Ma, Le
2019-11-27 14:00             ` Ma, Le
2019-11-27  9:15   ` [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-6-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27 15:46       ` Andrey Grodzovsky
2019-11-27 15:46         ` Andrey Grodzovsky
     [not found]         ` <c09d7928-f864-3a80-40e2-b6116abe044c-5C7GfCeVMHo@public.gmane.org>
2019-11-28  9:00           ` Ma, Le
2019-11-28  9:00             ` Ma, Le
2019-11-29 16:21             ` Andrey Grodzovsky
2019-12-02 11:42               ` Ma, Le
2019-12-02 22:05                 ` Andrey Grodzovsky [this message]
     [not found]                   ` <MN2PR12MB42855B198BB4064A0D311845F6420@MN2PR12MB4285.namprd12.prod.outlook.com>
     [not found]                     ` <2c4dd3f3-e2ce-9843-312b-1e5c05a51521@amd.com>
2019-12-04  7:09                       ` Ma, Le
2019-12-04 16:05                         ` Andrey Grodzovsky
2019-12-05  3:14                           ` Ma, Le
2019-12-06 21:50                             ` Andrey Grodzovsky
2019-12-09 11:34                               ` Ma, Le
2019-12-09 15:52                                 ` Andrey Grodzovsky
2019-12-10  2:45                                   ` Ma, Le
2019-12-10 19:55                                     ` Andrey Grodzovsky
2019-12-11 12:18                                       ` Ma, Le
2019-12-11 14:04                                         ` Andrey Grodzovsky
2019-12-09 22:00                                 ` Andrey Grodzovsky
2019-12-10  3:27                                   ` Ma, Le
2019-11-27  9:15   ` [PATCH 08/10] drm/amdgpu: support full gpu reset workflow when ras err_event_athub occurs Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 09/10] drm/amdgpu: clear err_event_athub flag after reset exit Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 10/10] drm/amdgpu: reduce redundant uvd context lost warning message Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-9-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27  9:49       ` Chen, Guchun
2019-11-27  9:49         ` Chen, Guchun
     [not found]         ` <BYAPR12MB280648A1C59519AA77B3FCA9F1440-ZGDeBxoHBPk0CuAkIMgl3QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27  9:54           ` Ma, Le
2019-11-27  9:54             ` Ma, Le
2019-11-28  5:27   ` [PATCH 01/10] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler Zhang, Hawking
2019-11-28  5:27     ` Zhang, Hawking

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5b505116-17aa-383d-5cdf-246663a1f4f9@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Guchun.Chen@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=Le.Ma@amd.com \
    --cc=Tao.Zhou1@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.