Re: [PATCH 2/2] amd/amdgpu: force to trigger a no-retry-fault after a retry-fault

From: Felix Kuehling <felix.kuehling-5C7GfCeVMHo@public.gmane.org>
To: Alex Sierra <alex.sierra-5C7GfCeVMHo@public.gmane.org>,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [PATCH 2/2] amd/amdgpu: force to trigger a no-retry-fault after a retry-fault
Date: Mon, 18 Nov 2019 17:46:45 -0500	[thread overview]
Message-ID: <f60eeb60-712f-6aa4-2660-86970b92c637@amd.com> (raw)
In-Reply-To: <20191118222435.93134-2-alex.sierra-5C7GfCeVMHo@public.gmane.org>

On 2019-11-18 17:24, Alex Sierra wrote:
> Only for the debugger use case.
>
> [why]
> Avoid endless translation retries, after an invalid address access has
> been issued to the GPU. Instead, the trap handler is forced to enter by
> generating a no-retry-fault.
> A s_trap instruction is inserted in the debugger case to let the wave to
> enter trap handler to save context.
>
> [how]
> Intentionally using an invalid flag combination (F and P set at the same
> time) to trigger a no-retry-fault, after a retry-fault happens. This is
> only valid under compute context.
>
> Change-Id: I4180c30e2631dc0401cbd6171f8a6776e4733c9a
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index d51ac8771ae0..358a4f50fcfb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -3207,6 +3207,12 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
>   		value = adev->dummy_page_addr;
>   		flags |= AMDGPU_PTE_EXECUTABLE | AMDGPU_PTE_READABLE |
>   			AMDGPU_PTE_WRITEABLE;
> +
> +		if (vm->is_compute_context) {
> +			/* Setting PTE flags to trigger a no-retry-fault  */
> +			flags = AMDGPU_PTE_EXECUTABLE | AMDGPU_PDE_PTE |
> +				AMDGPU_PTE_TF;

Hmm, this looks like you're setting flags twice in the compute-case. I 
was also expecting something more like this:

if (vm->is_compute_context) {
     ...
} else if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_NEVER) {
     ...
} else {
     ...
}

I.e. for compute contexts, we do our compute-specific thing, otherwise 
the behaviour depends on the amdgpu_vm_fault_stop setting.

Regards,
   Felix

> +		}
>   	} else {
>   		/* Let the hw retry silently on the PTE */
>   		value = 0;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx