All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
	linux-pci@vger.kernel.org, daniel.vetter@ffwll.ch,
	Harry.Wentland@amd.com
Cc: ppaalanen@gmail.com, Alexander.Deucher@amd.com,
	gregkh@linuxfoundation.org, helgaas@kernel.org,
	Felix.Kuehling@amd.com
Subject: Re: [PATCH v6 12/16] drm/amdgpu: Prevent any job recoveries after device is unplugged.
Date: Tue, 11 May 2021 08:53:13 +0200	[thread overview]
Message-ID: <eda13789-95c7-42c2-320b-b29d5d95e465@gmail.com> (raw)
In-Reply-To: <20210510163625.407105-13-andrey.grodzovsky@amd.com>

Am 10.05.21 um 18:36 schrieb Andrey Grodzovsky:
> Return DRM_TASK_STATUS_ENODEV back to the scheduler when device
> is not present so they timeout timer will not be rearmed.
>
> v5: Update to match updated return values in enum drm_gpu_sched_stat
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 19 ++++++++++++++++---
>   1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 759b34799221..d33e6d97cc89 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -25,6 +25,8 @@
>   #include <linux/wait.h>
>   #include <linux/sched.h>
>   
> +#include <drm/drm_drv.h>
> +
>   #include "amdgpu.h"
>   #include "amdgpu_trace.h"
>   
> @@ -34,6 +36,15 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	struct amdgpu_job *job = to_amdgpu_job(s_job);
>   	struct amdgpu_task_info ti;
>   	struct amdgpu_device *adev = ring->adev;
> +	int idx;
> +
> +	if (!drm_dev_enter(&adev->ddev, &idx)) {
> +		DRM_INFO("%s - device unplugged skipping recovery on scheduler:%s",
> +			 __func__, s_job->sched->name);
> +
> +		/* Effectively the job is aborted as the device is gone */
> +		return DRM_GPU_SCHED_STAT_ENODEV;
> +	}
>   
>   	memset(&ti, 0, sizeof(struct amdgpu_task_info));
>   
> @@ -41,7 +52,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	    amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>   		DRM_ERROR("ring %s timeout, but soft recovered\n",
>   			  s_job->sched->name);
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		goto exit;
>   	}
>   
>   	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
> @@ -53,13 +64,15 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   
>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
>   		amdgpu_device_gpu_recover(ring->adev, job);
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
>   	} else {
>   		drm_sched_suspend_timeout(&ring->sched);
>   		if (amdgpu_sriov_vf(adev))
>   			adev->virt.tdr_debug = true;
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
>   	}
> +
> +exit:
> +	drm_dev_exit(idx);
> +	return DRM_GPU_SCHED_STAT_NOMINAL;
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,


WARNING: multiple messages have this Message-ID (diff)
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
	linux-pci@vger.kernel.org, daniel.vetter@ffwll.ch,
	Harry.Wentland@amd.com
Cc: Alexander.Deucher@amd.com, gregkh@linuxfoundation.org,
	helgaas@kernel.org, Felix.Kuehling@amd.com
Subject: Re: [PATCH v6 12/16] drm/amdgpu: Prevent any job recoveries after device is unplugged.
Date: Tue, 11 May 2021 08:53:13 +0200	[thread overview]
Message-ID: <eda13789-95c7-42c2-320b-b29d5d95e465@gmail.com> (raw)
In-Reply-To: <20210510163625.407105-13-andrey.grodzovsky@amd.com>

Am 10.05.21 um 18:36 schrieb Andrey Grodzovsky:
> Return DRM_TASK_STATUS_ENODEV back to the scheduler when device
> is not present so they timeout timer will not be rearmed.
>
> v5: Update to match updated return values in enum drm_gpu_sched_stat
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 19 ++++++++++++++++---
>   1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 759b34799221..d33e6d97cc89 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -25,6 +25,8 @@
>   #include <linux/wait.h>
>   #include <linux/sched.h>
>   
> +#include <drm/drm_drv.h>
> +
>   #include "amdgpu.h"
>   #include "amdgpu_trace.h"
>   
> @@ -34,6 +36,15 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	struct amdgpu_job *job = to_amdgpu_job(s_job);
>   	struct amdgpu_task_info ti;
>   	struct amdgpu_device *adev = ring->adev;
> +	int idx;
> +
> +	if (!drm_dev_enter(&adev->ddev, &idx)) {
> +		DRM_INFO("%s - device unplugged skipping recovery on scheduler:%s",
> +			 __func__, s_job->sched->name);
> +
> +		/* Effectively the job is aborted as the device is gone */
> +		return DRM_GPU_SCHED_STAT_ENODEV;
> +	}
>   
>   	memset(&ti, 0, sizeof(struct amdgpu_task_info));
>   
> @@ -41,7 +52,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	    amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>   		DRM_ERROR("ring %s timeout, but soft recovered\n",
>   			  s_job->sched->name);
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		goto exit;
>   	}
>   
>   	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
> @@ -53,13 +64,15 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   
>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
>   		amdgpu_device_gpu_recover(ring->adev, job);
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
>   	} else {
>   		drm_sched_suspend_timeout(&ring->sched);
>   		if (amdgpu_sriov_vf(adev))
>   			adev->virt.tdr_debug = true;
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
>   	}
> +
> +exit:
> +	drm_dev_exit(idx);
> +	return DRM_GPU_SCHED_STAT_NOMINAL;
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,


WARNING: multiple messages have this Message-ID (diff)
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
	linux-pci@vger.kernel.org, daniel.vetter@ffwll.ch,
	Harry.Wentland@amd.com
Cc: Alexander.Deucher@amd.com, gregkh@linuxfoundation.org,
	ppaalanen@gmail.com, helgaas@kernel.org, Felix.Kuehling@amd.com
Subject: Re: [PATCH v6 12/16] drm/amdgpu: Prevent any job recoveries after device is unplugged.
Date: Tue, 11 May 2021 08:53:13 +0200	[thread overview]
Message-ID: <eda13789-95c7-42c2-320b-b29d5d95e465@gmail.com> (raw)
In-Reply-To: <20210510163625.407105-13-andrey.grodzovsky@amd.com>

Am 10.05.21 um 18:36 schrieb Andrey Grodzovsky:
> Return DRM_TASK_STATUS_ENODEV back to the scheduler when device
> is not present so they timeout timer will not be rearmed.
>
> v5: Update to match updated return values in enum drm_gpu_sched_stat
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 19 ++++++++++++++++---
>   1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 759b34799221..d33e6d97cc89 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -25,6 +25,8 @@
>   #include <linux/wait.h>
>   #include <linux/sched.h>
>   
> +#include <drm/drm_drv.h>
> +
>   #include "amdgpu.h"
>   #include "amdgpu_trace.h"
>   
> @@ -34,6 +36,15 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	struct amdgpu_job *job = to_amdgpu_job(s_job);
>   	struct amdgpu_task_info ti;
>   	struct amdgpu_device *adev = ring->adev;
> +	int idx;
> +
> +	if (!drm_dev_enter(&adev->ddev, &idx)) {
> +		DRM_INFO("%s - device unplugged skipping recovery on scheduler:%s",
> +			 __func__, s_job->sched->name);
> +
> +		/* Effectively the job is aborted as the device is gone */
> +		return DRM_GPU_SCHED_STAT_ENODEV;
> +	}
>   
>   	memset(&ti, 0, sizeof(struct amdgpu_task_info));
>   
> @@ -41,7 +52,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	    amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>   		DRM_ERROR("ring %s timeout, but soft recovered\n",
>   			  s_job->sched->name);
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		goto exit;
>   	}
>   
>   	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
> @@ -53,13 +64,15 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   
>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
>   		amdgpu_device_gpu_recover(ring->adev, job);
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
>   	} else {
>   		drm_sched_suspend_timeout(&ring->sched);
>   		if (amdgpu_sriov_vf(adev))
>   			adev->virt.tdr_debug = true;
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
>   	}
> +
> +exit:
> +	drm_dev_exit(idx);
> +	return DRM_GPU_SCHED_STAT_NOMINAL;
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-05-11  6:53 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-10 16:36 [PATCH v6 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
2021-05-10 16:36 ` Andrey Grodzovsky
2021-05-10 16:36 ` Andrey Grodzovsky
2021-05-10 16:36 ` [PATCH v6 01/16] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-11  6:38   ` Christian König
2021-05-11  6:38     ` Christian König
2021-05-11  6:38     ` Christian König
2021-05-11 14:44     ` Andrey Grodzovsky
2021-05-11 14:44       ` Andrey Grodzovsky
2021-05-11 14:44       ` Andrey Grodzovsky
2021-05-11 15:12       ` Christian König
2021-05-11 15:12         ` Christian König
2021-05-11 15:12         ` Christian König
2021-05-10 16:36 ` [PATCH v6 02/16] drm/ttm: Expose ttm_tt_unpopulate for driver use Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 18:27   ` Felix Kuehling
2021-05-10 18:27     ` Felix Kuehling
2021-05-10 18:27     ` Felix Kuehling
2021-05-10 18:32     ` Andrey Grodzovsky
2021-05-10 18:32       ` Andrey Grodzovsky
2021-05-10 18:32       ` Andrey Grodzovsky
2021-05-10 16:36 ` [PATCH v6 03/16] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 23:49   ` kernel test robot
2021-05-10 23:49     ` kernel test robot
2021-05-10 23:49     ` kernel test robot
2021-05-10 23:49     ` kernel test robot
2021-05-10 16:36 ` [PATCH v6 04/16] drm/amdkfd: Split kfd suspend from devie exit Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-11  6:40   ` Christian König
2021-05-11  6:40     ` Christian König
2021-05-11  6:40     ` Christian König
2021-05-11 14:52     ` Andrey Grodzovsky
2021-05-11 14:52       ` Andrey Grodzovsky
2021-05-11 14:52       ` Andrey Grodzovsky
2021-05-11 13:24   ` Deucher, Alexander
2021-05-11 13:24     ` Deucher, Alexander
2021-05-10 16:36 ` [PATCH v6 05/16] drm/amdgpu: Add early fini callback Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-11  6:41   ` Christian König
2021-05-11  6:41     ` Christian König
2021-05-11  6:41     ` Christian König
2021-05-10 16:36 ` [PATCH v6 06/16] drm/amdgpu: Handle IOMMU enabled case Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-11  6:44   ` Christian König
2021-05-11  6:44     ` Christian König
2021-05-11  6:44     ` Christian König
2021-05-11 15:46     ` Andrey Grodzovsky
2021-05-11 15:46       ` Andrey Grodzovsky
2021-05-11 15:46       ` Andrey Grodzovsky
2021-05-11 15:56   ` Alex Deucher
2021-05-11 15:56     ` Alex Deucher
2021-05-11 15:56     ` Alex Deucher
2021-05-11 15:59     ` Andrey Grodzovsky
2021-05-11 15:59       ` Andrey Grodzovsky
2021-05-11 15:59       ` Andrey Grodzovsky
2021-05-10 16:36 ` [PATCH v6 07/16] drm/amdgpu: Remap all page faults to per process dummy page Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36 ` [PATCH v6 08/16] PCI: Add support for dev_groups to struct pci_device_driver Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 20:56   ` Bjorn Helgaas
2021-05-10 20:56     ` Bjorn Helgaas
2021-05-10 20:56     ` Bjorn Helgaas
2021-05-10 16:36 ` [PATCH v6 09/16] drm/amdgpu: Convert driver sysfs attributes to static attributes Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36 ` [PATCH v6 10/16] drm/amdgpu: Guard against write accesses after device removal Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-11  6:50   ` Christian König
2021-05-11  6:50     ` Christian König
2021-05-11  6:50     ` Christian König
2021-05-11 17:52     ` Andrey Grodzovsky
2021-05-11 17:52       ` Andrey Grodzovsky
2021-05-11 17:52       ` Andrey Grodzovsky
2021-05-12 14:01       ` Andrey Grodzovsky
2021-05-12 14:01         ` Andrey Grodzovsky
2021-05-12 14:01         ` Andrey Grodzovsky
2021-05-12 14:06         ` Christian König
2021-05-12 14:06           ` Christian König
2021-05-12 14:06           ` Christian König
2021-05-12 14:11           ` Andrey Grodzovsky
2021-05-12 14:11             ` Andrey Grodzovsky
2021-05-12 14:11             ` Andrey Grodzovsky
2021-05-10 16:36 ` [PATCH v6 11/16] drm/sched: Make timeout timer rearm conditional Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-11  6:52   ` Christian König
2021-05-11  6:52     ` Christian König
2021-05-11  6:52     ` Christian König
2021-05-10 16:36 ` [PATCH v6 12/16] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-11  6:53   ` Christian König [this message]
2021-05-11  6:53     ` Christian König
2021-05-11  6:53     ` Christian König
2021-05-10 16:36 ` [PATCH v6 13/16] drm/amdgpu: Fix hang on device removal Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-11  6:54   ` Christian König
2021-05-11  6:54     ` Christian König
2021-05-11  6:54     ` Christian König
2021-05-10 16:36 ` [PATCH v6 14/16] drm/scheduler: Fix hang when sched_entity released Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36 ` [PATCH v6 15/16] drm/amd/display: Remove superflous drm_mode_config_cleanup Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 21:38   ` Rodrigo Siqueira
2021-05-10 21:38     ` Rodrigo Siqueira
2021-05-10 21:38     ` Rodrigo Siqueira
2021-05-10 16:36 ` [PATCH v6 16/16] drm/amdgpu: Verify DMA opearations from device are done Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-10 16:36   ` Andrey Grodzovsky
2021-05-11  6:56   ` Christian König
2021-05-11  6:56     ` Christian König
2021-05-11  6:56     ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eda13789-95c7-42c2-320b-b29d5d95e465@gmail.com \
    --to=ckoenig.leichtzumerken@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Harry.Wentland@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=ppaalanen@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.