dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: <dri-devel@lists.freedesktop.org>, <amd-gfx@lists.freedesktop.org>
Cc: Monk.Liu@amd.com, horace.chen@amd.com, christian.koenig@amd.com
Subject: [RFC v2 0/8] Define and use reset domain for GPU recovery in amdgpu
Date: Wed, 22 Dec 2021 17:04:58 -0500	[thread overview]
Message-ID: <20211222220506.789133-1-andrey.grodzovsky@amd.com> (raw)

This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
ordered wq and in this way simplify somewhat our GPU reset code so we don't need
to protect from concurrency by multiple GPU reset triggeres such as TDR on one
hand and sysfs trigger or RAS trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct such that
all the entities that go through reset together will be serialized one against
another. 

TDR triggered by multiple entities within the same domain due to the same reason will not
be triggered as the first such reset will cancel all the pending resets. This is
relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS,
those will still happen after the in flight resets finishes.

v2:
Add handling on SRIOV configuration, the reset notify coming from host 
and driver already trigger a work queue to handle the reset so drop this
intermidiate wq and send directly to timeout wq. (Shaoyun)

[1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there.

Andrey Grodzovsky (8):
  drm/amdgpu: Introduce reset domain
  drm/amdgpu: Move scheduler init to after XGMI is ready
  drm/amdgpu: Fix crash on modprobe
  drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
  drm/amdgpu: Drop hive->in_reset
  drm/amdgpu: Drop concurrent GPU reset protection for device
  drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++++++++++----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  18 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  18 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |   7 +-
 10 files changed, 147 insertions(+), 164 deletions(-)

-- 
2.25.1


             reply	other threads:[~2021-12-22 22:05 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-22 22:04 Andrey Grodzovsky [this message]
2021-12-22 22:04 ` [RFC v2 1/8] drm/amdgpu: Introduce reset domain Andrey Grodzovsky
2021-12-22 22:05 ` [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready Andrey Grodzovsky
2021-12-23  8:39   ` Christian König
2021-12-22 22:05 ` [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe Andrey Grodzovsky
2021-12-23  8:40   ` Christian König
2021-12-22 22:05 ` [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs Andrey Grodzovsky
2021-12-23  8:41   ` Christian König
2022-01-05  9:54   ` Lazar, Lijo
2022-01-05 12:31     ` Christian König
2022-01-05 13:11       ` Lazar, Lijo
2022-01-05 13:15         ` Christian König
2022-01-05 13:26           ` Lazar, Lijo
2022-01-05 13:41             ` Christian König
2022-01-05 18:11       ` Andrey Grodzovsky
2022-01-17 19:14         ` Andrey Grodzovsky
2022-01-17 19:17           ` Christian König
2022-01-17 19:21             ` Andrey Grodzovsky
2022-01-26 15:52               ` Andrey Grodzovsky
2022-01-28 16:57                 ` Grodzovsky, Andrey
2022-02-07  2:41                   ` JingWen Chen
2022-02-07  3:08                     ` Grodzovsky, Andrey
2021-12-22 22:13 ` [RFC v2 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky
2021-12-22 22:13   ` [RFC v2 6/8] drm/amdgpu: Drop hive->in_reset Andrey Grodzovsky
2021-12-22 22:13   ` [RFC v2 7/8] drm/amdgpu: Drop concurrent GPU reset protection for device Andrey Grodzovsky
2021-12-22 22:14   ` [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV Andrey Grodzovsky
2021-12-23  8:42     ` Christian König
2021-12-23 10:14       ` Liu, Monk
2021-12-24  8:58         ` Deng, Emily
2021-12-24  9:57           ` JingWen Chen
2021-12-30 18:45             ` Andrey Grodzovsky
2022-01-03 10:17               ` Christian König
2022-01-04  9:07                 ` JingWen Chen
2022-01-04 10:18                   ` Christian König
2022-01-04 10:49                     ` Liu, Monk
2022-01-04 11:36                       ` Christian König
2022-01-04 16:56                         ` Andrey Grodzovsky
2022-01-05  7:34                           ` JingWen Chen
2022-01-05  7:59                             ` Christian König
2022-01-05 18:24                               ` Andrey Grodzovsky
2022-01-06  4:59                                 ` JingWen Chen
2022-01-06  5:18                                   ` JingWen Chen
2022-01-06  9:13                                     ` Christian König
2022-01-06 19:13                                     ` Andrey Grodzovsky
2022-01-07  3:57                                       ` JingWen Chen
2022-01-07  5:46                                         ` JingWen Chen
2022-01-07 16:02                                           ` Andrey Grodzovsky
2022-01-12  6:28                                             ` JingWen Chen
2022-01-04 17:13                         ` Liu, Shaoyun
2022-01-04 20:54                           ` Andrey Grodzovsky
2022-01-05  0:01                             ` Liu, Shaoyun
2022-01-05  7:25                         ` JingWen Chen
2021-12-30 18:39           ` Andrey Grodzovsky
2021-12-23 18:07     ` Liu, Shaoyun
2021-12-23 18:29   ` [RFC v3 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211222220506.789133-1-andrey.grodzovsky@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Monk.Liu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=horace.chen@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).