All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
To: "Liu, Monk" <Monk.Liu-5C7GfCeVMHo@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in gpu-reset
Date: Mon, 8 May 2017 11:50:38 +0200	[thread overview]
Message-ID: <43340817-5a77-cf51-b9bd-8d8ca0b37415@vodafone.de> (raw)
In-Reply-To: <DM5PR12MB1610851C6F0BDD8FAC54DE1284EE0-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>

> Because we can always rely on TDR and HYPERVISOR to detect GPU hang and resubmit malicious jobs or even kick them out later,
> and the gpu reset will eventually be invoked, so there is no reason to manually and voluntarily call gpu reset under SRIOV case.
Well there is a rather good reason, we detect that something is wrong 
much faster than waiting for the timeout.

But I agree that it was broken before as well and we can fix that later. 
Please add a code comment that this needs more work.

With that fixed feel free to add my rb on it.

Christian.

Am 08.05.2017 um 11:42 schrieb Liu, Monk:
> The VM fault interrupt or illegal instruction  will be delivered to GPU no matter it's SR-IOV or bare-metal case,
> And I removed them from invoking GPU reset is due to the same reason:
> Don't trigger gpu reset for sriov case if possible, always beware that trigger GPU reset under SR-IOV is a heavy cost (need take full access mode on GPU, so all
> Other VFs will be paused for a while)
>
> Because we can always rely on TDR and HYPERVISOR to detect GPU hang and resubmit malicious jobs or even kick them out later,
> and the gpu reset will eventually be invoked, so there is no reason to manually and voluntarily call gpu reset under SRIOV case.
>
> BR Monk
>
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple@vodafone.de]
> Sent: Monday, May 08, 2017 5:34 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in gpu-reset
>
> Sounds good, but what do we do with the amdgpu_irq_reset_work_func?
>
> Please note that I find that calling amdgpu_gpu_reset() here is a bad idea in the first place.
>
> Instead we should consider the scheduler as faulting and let the scheduler handle that as in the same way as a job timeout.
>
> But I'm not sure if those interrupts are actually send under SRIOV or if the hypervisor handles them somehow.
>
> Christian.
>
> Am 08.05.2017 um 11:24 schrieb Liu, Monk:
>> I agree with disabling debugfs for amdgpu_reset when SRIOV detected.
>>
>> -----Original Message-----
>> From: Christian König [mailto:deathsimple@vodafone.de]
>> Sent: Monday, May 08, 2017 5:20 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in
>> gpu-reset
>>
>>> You know that gpu reset under SR-IOV will have very big impact on all other VFs ...
>> Mhm, good argument. But in this case we need to give at least some warning message instead of doing nothing.
>>
>> Or even better disable creating the amdgpu_reste debugfs file altogether. This way nobody will wonder why using it doesn't trigger anything.
>>
>> Christian.
>>
>> Am 08.05.2017 um 11:10 schrieb Liu, Monk:
>>> For SR-IOV use case, we call gpu reset under the case we have no choice ...
>>>
>>> So many places like debug fs shouldn't a good reason to trigger gpu
>>> reset
>>>
>>> You know that gpu reset under SR-IOV will have very big impact on all other VFs ...
>>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:deathsimple@vodafone.de]
>>> Sent: Monday, May 08, 2017 5:08 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in
>>> gpu-reset
>>>
>>> Am 08.05.2017 um 08:51 schrieb Monk Liu:
>>>> because we don't want to do sriov-gpu-reset under certain cases, so
>>>> just split those two funtion and don't invoke sr-iov one from
>>>> bare-metal one.
>>>>
>>>> Change-Id: I641126c241e2ee2dfd54e6d16c389b159f99cfe0
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 3 ++-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    | 2 +-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    | 3 ++-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 6 +++++-
>>>>      5 files changed, 10 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 45a60a6..4985a7e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2652,9 +2652,6 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev)
>>>>      	int resched;
>>>>      	bool need_full_reset;
>>>>      
>>>> -	if (amdgpu_sriov_vf(adev))
>>>> -		return amdgpu_sriov_gpu_reset(adev, true);
>>>> -
>>>>      	if (!amdgpu_check_soft_reset(adev)) {
>>>>      		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
>>>>      		return 0;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 5772ef2..d7523d1 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -651,7 +651,8 @@ static int amdgpu_debugfs_gpu_reset(struct seq_file *m, void *data)
>>>>      	struct amdgpu_device *adev = dev->dev_private;
>>>>      
>>>>      	seq_printf(m, "gpu reset\n");
>>>> -	amdgpu_gpu_reset(adev);
>>>> +	if (!amdgpu_sriov_vf(adev))
>>>> +		amdgpu_gpu_reset(adev);
>>> Well that is clearly not a good idea. Why do you want to disable the reset here?
>>>
>>>>      
>>>>      	return 0;
>>>>      }
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> index 67be795..5bcbea0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> @@ -221,7 +221,7 @@ void amdgpu_gem_object_close(struct
>>>> drm_gem_object *obj,
>>>>      
>>>>      static int amdgpu_gem_handle_lockup(struct amdgpu_device *adev, int r)
>>>>      {
>>>> -	if (r == -EDEADLK) {
>>>> +	if (r == -EDEADLK && !amdgpu_sriov_vf(adev)) {
>>> Not a problem of your patch, but that stuff is outdated and should have been removed completely years ago. Going to take care of that.
>>>
>>>>      		r = amdgpu_gpu_reset(adev);
>>>>      		if (!r)
>>>>      			r = -EAGAIN;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>> index f8a6c95..49c6e6e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>>> @@ -89,7 +89,8 @@ static void amdgpu_irq_reset_work_func(struct work_struct *work)
>>>>      	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
>>>>      						  reset_work);
>>>>      
>>>> -	amdgpu_gpu_reset(adev);
>>>> +	if (!amdgpu_sriov_vf(adev))
>>>> +		amdgpu_gpu_reset(adev);
>>> Mhm, that disables the reset on an invalid register access or invalid command stream. Is that really what we want?
>>>
>>> Christian.
>>>
>>>>      }
>>>>      
>>>>      /* Disable *all* interrupts */
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 690ef3d..c7718af 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -36,7 +36,11 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>>>      		  job->base.sched->name,
>>>>      		  atomic_read(&job->ring->fence_drv.last_seq),
>>>>      		  job->ring->fence_drv.sync_seq);
>>>> -	amdgpu_gpu_reset(job->adev);
>>>> +
>>>> +	if (amdgpu_sriov_vf(job->adev))
>>>> +		amdgpu_sriov_gpu_reset(job->adev, true);
>>>> +	else
>>>> +		amdgpu_gpu_reset(job->adev);
>>>>      }
>>>>      
>>>>      int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
>>>> num_ibs,
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2017-05-08  9:50 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-08  6:51 [PATCH 0/4] TDR guilty job feature Monk Liu
     [not found] ` <1494226269-8837-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-05-08  6:51   ` [PATCH 1/4] drm/amdgpu:don't invoke srio-gpu-reset in gpu-reset Monk Liu
     [not found]     ` <1494226269-8837-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-05-08  9:07       ` Christian König
     [not found]         ` <4d4fb987-9ccb-8fde-e485-7586f6ec49e8-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-08  9:10           ` Liu, Monk
     [not found]             ` <DM5PR12MB16103A41862C9FCFF923295384EE0-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-08  9:19               ` Christian König
     [not found]                 ` <707273ff-6cf7-4d86-bbe4-7cebe928840d-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-08  9:24                   ` Liu, Monk
     [not found]                     ` <DM5PR12MB1610A94BDEBFCAB42975560B84EE0-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-08  9:33                       ` Christian König
     [not found]                         ` <3290f7e0-56a7-98e0-db60-0ce968cd65e5-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-08  9:42                           ` Liu, Monk
     [not found]                             ` <DM5PR12MB1610851C6F0BDD8FAC54DE1284EE0-2J9CzHegvk++jCVTvoAFKAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-05-08  9:50                               ` Christian König [this message]
2017-05-08  6:51   ` [PATCH 2/4] drm/amdgpu:use job* to replace voluntary Monk Liu
     [not found]     ` <1494226269-8837-3-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-05-08  9:08       ` Christian König
2017-05-08  6:51   ` [PATCH 3/4] drm/amdgpu:only call flr_work under infinite timeout Monk Liu
     [not found]     ` <1494226269-8837-4-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-05-08  9:11       ` Christian König
     [not found]         ` <7ae97dfa-96bd-414b-0ace-ddf4e626440d-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-05-08  9:15           ` Liu, Monk
2017-05-08  6:51   ` [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job TDR for Monk Liu
     [not found]     ` <1494226269-8837-5-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-05-08  7:00       ` Liu, Monk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=43340817-5a77-cf51-b9bd-8d8ca0b37415@vodafone.de \
    --to=deathsimple-antagkrnahcb1svskn2v4q@public.gmane.org \
    --cc=Monk.Liu-5C7GfCeVMHo@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.