All of lore.kernel.org
 help / color / mirror / Atom feed
From: Felix Kuehling <felix.kuehling@amd.com>
To: Jonathan Kim <jonathan.kim@amd.com>,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [PATCH 08/34] drm/amdkfd: fix kfd_suspend_all_processes for gfx941 debugging
Date: Mon, 27 Mar 2023 17:20:48 -0400	[thread overview]
Message-ID: <01e7c6a8-e6b6-7d28-1d54-4e065e7e150f@amd.com> (raw)
In-Reply-To: <20230327184339.125016-8-jonathan.kim@amd.com>

On 2023-03-27 14:43, Jonathan Kim wrote:
> The debugger for GFX9.4.1 uses kfd_suspend_all_processes to pause the
> compute pipe line so it can safely toggle the SQ's implicit wait on
> barrier setting during debug attach/detach to work around the wave
> exception s_barrier race condition.
>
> For mGPU setups, repeated calls to cancel all outstanding restore work can
> result in an assymetric permanent cancelling of the restored work from the
> debug device after it has toggled the HW work around settings.

This is a bit hard to follow. Not sure what you mean by asymmetric.

I think this is a general bug in how kfd_suspend_all_processes and 
kfd_resume_all_processes interact. The latter schedules restore work. If 
that gets cancelled before it gets a chance to run, it will result in 
the queues staying preempted forever. It just happened that the barrier 
waitcount setting workaround on GFXv9.4.1 was good at triggering the bug.

I would simplify the description like this:

> Flush delayed restore work in kfd_suspend_all_queues instead of 
> cancelling. Cancelling the work before it runs results in the queues 
> becoming permanently disabled. Flushing the work ensures that the 
> queue suspend/resume state stays balanced.
With the updated description, the patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> Instead of cancelling the outstanding restore work, just flush it as it
> will be properly evicted anyways by the current suspend call.
>
> Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 1e3795e7e18d..55a4ddd35e12 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -2008,7 +2008,7 @@ void kfd_suspend_all_processes(void)
>   	WARN(debug_evictions, "Evicting all processes");
>   	hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>   		cancel_delayed_work_sync(&p->eviction_work);
> -		cancel_delayed_work_sync(&p->restore_work);
> +		flush_delayed_work(&p->restore_work);
>   
>   		if (kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SUSPEND))
>   			pr_err("Failed to suspend process 0x%x\n", p->pasid);

  reply	other threads:[~2023-03-27 21:21 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-27 18:43 [PATCH 01/34] drm/amdkfd: add debug and runtime enable interface Jonathan Kim
2023-03-27 18:43 ` [PATCH 02/34] drm/amdkfd: display debug capabilities Jonathan Kim
2023-03-27 18:43 ` [PATCH 03/34] drm/amdkfd: prepare per-process debug enable and disable Jonathan Kim
2023-03-27 18:43 ` [PATCH 04/34] drm/amdgpu: add kgd hw debug mode setting interface Jonathan Kim
2023-03-27 18:43 ` [PATCH 05/34] drm/amdgpu: setup hw debug registers on driver initialization Jonathan Kim
2023-03-27 18:43 ` [PATCH 06/34] drm/amdgpu: add gfx9 hw debug mode enable and disable calls Jonathan Kim
2023-03-27 18:43 ` [PATCH 07/34] drm/amdgpu: add gfx9.4.1 " Jonathan Kim
2023-03-28  5:28   ` kernel test robot
2023-03-28  5:28     ` kernel test robot
2023-03-27 18:43 ` [PATCH 08/34] drm/amdkfd: fix kfd_suspend_all_processes for gfx941 debugging Jonathan Kim
2023-03-27 21:20   ` Felix Kuehling [this message]
2023-03-27 18:43 ` [PATCH 09/34] drm/amdgpu: add gfx10 hw debug mode enable and disable calls Jonathan Kim
2023-03-27 18:43 ` [PATCH 10/34] drm/amdgpu: add gfx9.4.2 " Jonathan Kim
2023-03-27 18:43 ` [PATCH 11/34] drm/amdgpu: add gfx11 " Jonathan Kim
2023-03-27 18:43 ` [PATCH 12/34] drm/amdgpu: add configurable grace period for unmap queues Jonathan Kim
2023-03-28 15:19   ` Russell, Kent
2023-03-28 15:45     ` Kim, Jonathan
2023-03-27 18:43 ` [PATCH 13/34] drm/amdkfd: prepare map process for single process debug devices Jonathan Kim
2023-03-27 18:43 ` [PATCH 14/34] drm/amdgpu: prepare map process for multi-process " Jonathan Kim
2023-03-27 18:43 ` [PATCH 15/34] drm/amdgpu: expose debug api for mes Jonathan Kim
2023-03-27 18:43 ` [PATCH 16/34] drm/amdkfd: add per process hw trap enable and disable functions Jonathan Kim
2023-03-27 18:43 ` [PATCH 17/34] drm/amdkfd: apply trap workaround for gfx11 Jonathan Kim
2023-03-27 18:43 ` [PATCH 18/34] drm/amdkfd: add raise exception event function Jonathan Kim
2023-03-27 18:43 ` [PATCH 19/34] drm/amdkfd: add send exception operation Jonathan Kim
2023-03-27 18:43 ` [PATCH 20/34] drm/amdkfd: add runtime enable operation Jonathan Kim
2023-03-27 18:43 ` [PATCH 21/34] drm/amdkfd: add debug trap enabled flag to tma Jonathan Kim
2023-03-27 21:29   ` Felix Kuehling
2023-03-27 18:43 ` [PATCH 22/34] drm/amdkfd: update process interrupt handling for debug events Jonathan Kim
2023-03-27 18:43 ` [PATCH 23/34] drm/amdkfd: add debug set exceptions enabled operation Jonathan Kim
2023-03-27 18:43 ` [PATCH 24/34] drm/amdkfd: add debug wave launch override operation Jonathan Kim
2023-03-27 18:43 ` [PATCH 25/34] drm/amdkfd: add debug wave launch mode operation Jonathan Kim
2023-03-27 18:43 ` [PATCH 26/34] drm/amdkfd: add debug suspend and resume process queues operation Jonathan Kim
2023-03-27 18:43 ` [PATCH 27/34] drm/amdkfd: add debug set and clear address watch points operation Jonathan Kim
2023-03-28  9:21   ` kernel test robot
2023-03-28  9:21     ` kernel test robot
2023-03-31  0:08   ` kernel test robot
2023-03-31  0:08     ` kernel test robot
2023-03-27 18:43 ` [PATCH 28/34] drm/amdkfd: add debug set flags operation Jonathan Kim
2023-03-27 18:43 ` [PATCH 29/34] drm/amdkfd: add debug query event operation Jonathan Kim
2023-03-27 18:43 ` [PATCH 30/34] drm/amdkfd: add debug query exception info operation Jonathan Kim
2023-03-27 18:43 ` [PATCH 31/34] drm/amdkfd: add debug queue snapshot operation Jonathan Kim
2023-03-27 18:43 ` [PATCH 32/34] drm/amdkfd: add debug device " Jonathan Kim
2023-03-27 18:43 ` [PATCH 33/34] drm/amdkfd: bump kfd ioctl minor version for debug api availability Jonathan Kim
2023-03-27 18:43 ` [PATCH 34/34] drm/amdkfd: optimize gfx off enable toggle for debugging Jonathan Kim
2023-03-27 21:44   ` Felix Kuehling
2023-03-27 21:47 ` [PATCH 01/34] drm/amdkfd: add debug and runtime enable interface Felix Kuehling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=01e7c6a8-e6b6-7d28-1d54-4e065e7e150f@amd.com \
    --to=felix.kuehling@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jonathan.kim@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.