All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Deucher <alexdeucher@gmail.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: "Deucher, Alexander" <alexander.deucher@amd.com>,
	Nirmoy <nirmodas@amd.com>,
	Christian Koenig <christian.koenig@amd.com>,
	amd-gfx list <amd-gfx@lists.freedesktop.org>,
	Dennis Li <Dennis.Li@amd.com>
Subject: Re: [PATCH v2 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery
Date: Fri, 28 Aug 2020 15:28:04 -0400	[thread overview]
Message-ID: <CADnq5_PUifk==tNW0NRWU6_qgT7fgoeX-Y_j3Y--Y+706zs7BA@mail.gmail.com> (raw)
In-Reply-To: <1598630743-21155-4-git-send-email-andrey.grodzovsky@amd.com>

On Fri, Aug 28, 2020 at 12:06 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> DPC recovery involves ASIC reset just as normal GPU recovery so blosk

Typo: "block"

> SW GPU scedulers and wait on all concurent GPU resets.

Typos: "schedulers" and "concurrent"

>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
>  1 file changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e67cbf2..9a367a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4745,6 +4745,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
>         return 0;
>  }
>
> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
> +{
> +       int i;
> +
> +       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +               struct amdgpu_ring *ring = adev->rings[i];
> +
> +               if (!ring || !ring->sched.thread)
> +                       continue;
> +
> +               cancel_delayed_work_sync(&ring->sched.work_tdr);
> +       }
> +}
> +
>  /**
>   * amdgpu_pci_error_detected - Called when a PCI error is detected.
>   * @pdev: PCI device struct
> @@ -4758,16 +4772,38 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>  {
>         struct drm_device *dev = pci_get_drvdata(pdev);
>         struct amdgpu_device *adev = drm_to_adev(dev);
> +       int i;
>
>         DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
>
>         switch (state) {
>         case pci_channel_io_normal:
>                 return PCI_ERS_RESULT_CAN_RECOVER;
> -       case pci_channel_io_frozen: {
> -               /* Fatal error, prepare for slot reset */
> +       case pci_channel_io_frozen: { /* Fatal error, prepare for slot reset */
> +
> +               /*
> +                * Cancel and wait for all TDRs in progress if failing to
> +                * set  adev->in_gpu_reset in amdgpu_device_lock_adev
> +                *
> +                * Locking adev->reset_sem will perevent any external access

Typo: "prevent"

> +                * to GPU during PCI error recovery
> +                */
> +               while (!amdgpu_device_lock_adev(adev, NULL))
> +                       amdgpu_cancel_all_tdr(adev);
> +
> +               /*
> +                * Block any work scheduling as we do for regualr GPU reset

Typo: "regular"

> +                * for the duration of the recoveryq

Typo: "recovery"

Overall looks good to me, but you might want to run the scheduling
changes by Christian as well.  With the typos fixed:
Acked-by: Alex Deucher <alexander.deucher@amd.com>


> +                */
> +               for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +                       struct amdgpu_ring *ring = adev->rings[i];
> +
> +                       if (!ring || !ring->sched.thread)
> +                               continue;
> +
> +                       drm_sched_stop(&ring->sched, NULL);
> +               }
>
> -               amdgpu_device_lock_adev(adev);
>                 return PCI_ERS_RESULT_NEED_RESET;
>         }
>         case pci_channel_io_perm_failure:
> @@ -4900,8 +4936,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>  {
>         struct drm_device *dev = pci_get_drvdata(pdev);
>         struct amdgpu_device *adev = drm_to_adev(dev);
> +       int i;
>
> -       amdgpu_device_unlock_adev(adev);
>
>         DRM_INFO("PCI error: resume callback!!\n");
> +
> +       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +               struct amdgpu_ring *ring = adev->rings[i];
> +
> +               if (!ring || !ring->sched.thread)
> +                       continue;
> +
> +
> +               drm_sched_resubmit_jobs(&ring->sched);
> +               drm_sched_start(&ring->sched, true);
> +       }
> +
> +       amdgpu_device_unlock_adev(adev);
>  }
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2020-08-28 19:28 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-28 16:05 [PATCH v2 0/7] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
2020-08-28 16:05 ` [PATCH v2 1/7] drm/amdgpu: Implement DPC recovery Andrey Grodzovsky
2020-08-28 19:23   ` Alex Deucher
2020-08-28 19:24     ` Alex Deucher
2020-08-31 14:26     ` Andrey Grodzovsky
2020-08-31 14:30       ` Alex Deucher
2020-08-28 19:25   ` Alex Deucher
2020-08-31 12:44   ` Christian König
2020-08-28 16:05 ` [PATCH v2 2/7] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
2020-08-28 19:26   ` Alex Deucher
2020-08-31 20:19     ` Luben Tuikov
2020-08-28 16:05 ` [PATCH v2 3/7] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
2020-08-28 19:28   ` Alex Deucher [this message]
2020-08-28 16:05 ` [PATCH v2 4/7] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
2020-08-28 19:29   ` Alex Deucher
2020-08-28 20:28     ` Andrey Grodzovsky
2020-08-28 16:05 ` [PATCH v2 5/7] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
2020-08-28 19:19   ` Alex Deucher
2020-08-28 16:05 ` [PATCH v2 6/7] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
2020-08-28 19:30   ` Alex Deucher
2020-08-28 16:05 ` [PATCH v2 7/7] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
2020-08-28 19:30   ` Alex Deucher
2020-08-28 19:31     ` Alex Deucher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CADnq5_PUifk==tNW0NRWU6_qgT7fgoeX-Y_j3Y--Y+706zs7BA@mail.gmail.com' \
    --to=alexdeucher@gmail.com \
    --cc=Dennis.Li@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=nirmodas@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.