All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org
Cc: horace.chen@amd.com, christian.koenig@amd.com, Monk.Liu@amd.com
Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu
Date: Mon, 20 Dec 2021 08:25:05 +0100	[thread overview]
Message-ID: <0a30778e-28b8-7d02-01e9-9db690227222@gmail.com> (raw)
In-Reply-To: <20211217222745.881637-1-andrey.grodzovsky@amd.com>

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
> This patchset is based on earlier work by Boris[1] that allowed to have an
> ordered workqueue at the driver level that will be used by the different
> schedulers to queue their timeout work. On top of that I also serialized
> any GPU reset we trigger from within amdgpu code to also go through the same
> ordered wq and in this way simplify somewhat our GPU reset code so we don't need
> to protect from concurrency by multiple GPU reset triggeres such as TDR on one
> hand and sysfs trigger or RAS trigger on the other hand.
>
> As advised by Christian and Daniel I defined a reset_domain struct such that
> all the entities that go through reset together will be serialized one against
> another.
>
> TDR triggered by multiple entities within the same domain due to the same reason will not
> be triggered as the first such reset will cancel all the pending resets. This is
> relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS,
> those will still happen after the in flight resets finishes.
>
> [1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/
>
> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there.

Patches #1 and #5, #6 are Reviewed-by: Christian König 
<christian.koenig@amd.com>

Some minor comments on the rest, but in general absolutely looks like 
the way we want to go.

Regards,
Christian.

>
> Andrey Grodzovsky (6):
>    drm/amdgpu: Init GPU reset single threaded wq
>    drm/amdgpu: Move scheduler init to after XGMI is ready
>    drm/amdgpu: Fix crash on modprobe
>    drm/amdgpu: Serialize non TDR gpu recovery with TDRs
>    drm/amdgpu: Drop hive->in_reset
>    drm/amdgpu: Drop concurrent GPU reset protection for device
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++++++++++----------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
>   7 files changed, 132 insertions(+), 136 deletions(-)
>


WARNING: multiple messages have this Message-ID (diff)
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org
Cc: daniel@ffwll.ch, horace.chen@amd.com, christian.koenig@amd.com,
	Monk.Liu@amd.com
Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu
Date: Mon, 20 Dec 2021 08:25:05 +0100	[thread overview]
Message-ID: <0a30778e-28b8-7d02-01e9-9db690227222@gmail.com> (raw)
In-Reply-To: <20211217222745.881637-1-andrey.grodzovsky@amd.com>

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
> This patchset is based on earlier work by Boris[1] that allowed to have an
> ordered workqueue at the driver level that will be used by the different
> schedulers to queue their timeout work. On top of that I also serialized
> any GPU reset we trigger from within amdgpu code to also go through the same
> ordered wq and in this way simplify somewhat our GPU reset code so we don't need
> to protect from concurrency by multiple GPU reset triggeres such as TDR on one
> hand and sysfs trigger or RAS trigger on the other hand.
>
> As advised by Christian and Daniel I defined a reset_domain struct such that
> all the entities that go through reset together will be serialized one against
> another.
>
> TDR triggered by multiple entities within the same domain due to the same reason will not
> be triggered as the first such reset will cancel all the pending resets. This is
> relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS,
> those will still happen after the in flight resets finishes.
>
> [1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/
>
> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there.

Patches #1 and #5, #6 are Reviewed-by: Christian König 
<christian.koenig@amd.com>

Some minor comments on the rest, but in general absolutely looks like 
the way we want to go.

Regards,
Christian.

>
> Andrey Grodzovsky (6):
>    drm/amdgpu: Init GPU reset single threaded wq
>    drm/amdgpu: Move scheduler init to after XGMI is ready
>    drm/amdgpu: Fix crash on modprobe
>    drm/amdgpu: Serialize non TDR gpu recovery with TDRs
>    drm/amdgpu: Drop hive->in_reset
>    drm/amdgpu: Drop concurrent GPU reset protection for device
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++++++++++----------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
>   7 files changed, 132 insertions(+), 136 deletions(-)
>


  parent reply	other threads:[~2021-12-20  7:25 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-17 22:27 [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu Andrey Grodzovsky
2021-12-17 22:27 ` Andrey Grodzovsky
2021-12-17 22:27 ` [RFC 1/6] drm/amdgpu: Init GPU reset single threaded wq Andrey Grodzovsky
2021-12-17 22:27   ` Andrey Grodzovsky
2021-12-17 22:27 ` [RFC 2/6] drm/amdgpu: Move scheduler init to after XGMI is ready Andrey Grodzovsky
2021-12-17 22:27   ` Andrey Grodzovsky
2021-12-20  7:16   ` Christian König
2021-12-20  7:16     ` Christian König
2021-12-20 21:51     ` Andrey Grodzovsky
2021-12-20 21:51       ` Andrey Grodzovsky
2021-12-21  7:05       ` Christian König
2021-12-21  7:05         ` Christian König
2021-12-17 22:27 ` [RFC 3/6] drm/amdgpu: Fix crash on modprobe Andrey Grodzovsky
2021-12-17 22:27   ` Andrey Grodzovsky
2021-12-20  7:17   ` Christian König
2021-12-20  7:17     ` Christian König
2021-12-20 19:22     ` Andrey Grodzovsky
2021-12-20 19:22       ` Andrey Grodzovsky
2021-12-21  7:02       ` Christian König
2021-12-21  7:02         ` Christian König
2021-12-21 16:03         ` Andrey Grodzovsky
2021-12-21 16:03           ` Andrey Grodzovsky
2021-12-22  7:50           ` Christian König
2021-12-22  7:50             ` Christian König
2021-12-17 22:27 ` [RFC 4/6] drm/amdgpu: Serialize non TDR gpu recovery with TDRs Andrey Grodzovsky
2021-12-17 22:27   ` Andrey Grodzovsky
2021-12-20  7:20   ` Christian König
2021-12-20  7:20     ` Christian König
2021-12-20 22:17     ` Andrey Grodzovsky
2021-12-20 22:17       ` Andrey Grodzovsky
2021-12-21  7:59       ` Christian König
2021-12-21  7:59         ` Christian König
2021-12-21 16:10         ` Andrey Grodzovsky
2021-12-21 16:10           ` Andrey Grodzovsky
2021-12-17 22:27 ` [RFC 5/6] drm/amdgpu: Drop hive->in_reset Andrey Grodzovsky
2021-12-17 22:27   ` Andrey Grodzovsky
2021-12-17 22:27 ` [RFC 6/6] drm/amdgpu: Drop concurrent GPU reset protection for device Andrey Grodzovsky
2021-12-17 22:27   ` Andrey Grodzovsky
2021-12-20  7:25 ` Christian König [this message]
2021-12-20  7:25   ` [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu Christian König
2021-12-20  9:43   ` Daniel Vetter
2021-12-20  9:43     ` Daniel Vetter
2021-12-20 17:06   ` Liu, Shaoyun
2021-12-20 17:06     ` Liu, Shaoyun
2021-12-20 19:11     ` Andrey Grodzovsky
2021-12-20 19:11       ` Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0a30778e-28b8-7d02-01e9-9db690227222@gmail.com \
    --to=ckoenig.leichtzumerken@gmail.com \
    --cc=Monk.Liu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=horace.chen@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.