All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Liu, Monk" <Monk.Liu-5C7GfCeVMHo@public.gmane.org>
To: "Koenig,
	Christian" <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>,
	"Nicolai Hähnle"
	<nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>,
	"Daenzer, Michel" <Michel.Daenzer-5C7GfCeVMHo@public.gmane.org>
Subject: RE: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
Date: Tue, 10 Oct 2017 07:12:55 +0000	[thread overview]
Message-ID: <BLUPR12MB04497B5442F66C969861742584750@BLUPR12MB0449.namprd12.prod.outlook.com> (raw)
In-Reply-To: <85c67ae9-bfe2-390a-79d0-6e5872b9be62-5C7GfCeVMHo@public.gmane.org>

Then the question is how we treat recovery if VRAM lost ?

-----Original Message-----
From: Koenig, Christian 
Sent: 2017年10月10日 14:59
To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org; Daenzer, Michel <Michel.Daenzer@amd.com>
Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

As Nicolai explained that approach simply won't work.

The fd is used by more than just the closed source Vulkan driver and I think even by some components not developed by AMD (common X code? 
Michel please comment as well).

So closing it and reopening it to handle a GPU reset is simply not an option.

Regards,
Christian.

Am 10.10.2017 um 06:26 schrieb Liu, Monk:
> After VRAM lost happens, all clients no matter radv/mesa/ogl is 
> useless,
>
> Any drivers uses this FD should be denied by KMD after VRAM lost, and 
> UMD can destroy/close this FD and re-open it and rebuild all resources
>
> That's the only option for VRAM lost case
>
>
>
> -----Original Message-----
> From: Nicolai Hähnle [mailto:nhaehnle@gmail.com]
> Sent: 2017年10月9日 19:01
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
> reseted
>
> On 09.10.2017 10:35, Liu, Monk wrote:
>> Please be aware that this policy is what the strict mode defined and 
>> what customer want, And also please check VK spec, it defines that 
>> after GPU reset all vk INSTANCE should close/release its 
>> resource/device/ctx and all buffers, and call re-initvkinstance after 
>> gpu reset
> Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.
>
> It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.
>
> Cheers,
> Nicolai
>
>
>
>> So this whole approach is what just aligned with the spec, and to not 
>> influence with current MESA/OGL client that's why I put the whole 
>> approach into the strict mode And by default strict mode is not 
>> selected
>>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2017年10月9日 16:26
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
>> reseted
>>
>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>> for SRIOV strict mode gpu reset:
>>>
>>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we 
>>> return -ENODEV in cs_ioctl or info_ioctl if they found
>>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>>
>>> this way we prevent a potential bad process/FD from submitting cmds 
>>> and notify userspace with -ENODEV.
>>>
>>> userspace should close all BO/ctx and re-open dri FD to re-create 
>>> virtual memory system for this process
>> The whole aproach is a NAK from my side.
>>
>> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>>
>> Regards,
>> Christian.
>>
>>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>>     3 files changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index de9c164..b40d4ba 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>>     	struct idr		bo_list_handles;
>>>     	struct amdgpu_ctx_mgr	ctx_mgr;
>>>     	u32			vram_lost_counter;
>>> +	int gpu_reset_counter;
>>>     };
>>>     
>>>     /*
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 9467cf6..6a1515e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>>     	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>     		return -ENODEV;
>>>     
>>> +	if (amdgpu_sriov_vf(adev) &&
>>> +		amdgpu_sriov_reset_level == 1 &&
>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> +		return -ENODEV;
>>> +
>>>     	parser.adev = adev;
>>>     	parser.filp = filp;
>>>     
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index 282f45b..bd389cf 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>>     	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>     		return -ENODEV;
>>>     
>>> +	if (amdgpu_sriov_vf(adev) &&
>>> +		amdgpu_sriov_reset_level == 1 &&
>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> +		return -ENODEV;
>>> +
>>>     	switch (info->query) {
>>>     	case AMDGPU_INFO_ACCEL_WORKING:
>>>     		ui32 = adev->accel_working;
>>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>     		goto out_suspend;
>>>     	}
>>>     
>>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>>> +
>>>     	r = amdgpu_vm_init(adev, &fpriv->vm,
>>>     			   AMDGPU_VM_CONTEXT_GFX, 0);
>>>     	if (r) {
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2017-10-10  7:12 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-30  6:03 [PATCH 00/12] *** SRIOV GPU RESET PATCHES *** Monk Liu
     [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-09-30  6:03   ` [PATCH 01/12] drm/amdgpu/sriov:now must reinit psp Monk Liu
2017-09-30  6:03   ` [PATCH 02/12] drm/amdgpu/sriov:fix memory leak in psp_load_fw Monk Liu
2017-09-30  6:03   ` [PATCH 03/12] drm/amdgpu/sriov:use atomic type for sriov_reset Monk Liu
2017-09-30  6:03   ` [PATCH 04/12] drm/amdgpu/sriov:cleanup gpu rest mlock Monk Liu
2017-09-30  6:03   ` [PATCH 05/12] drm/amdgpu/sriov:accurate description for sriov_gpu_reset Monk Liu
2017-09-30  6:03   ` [PATCH 06/12] drm/amdgpu/sriov:handle more jobs hang in different ring case Monk Liu
     [not found]     ` <1506751432-21789-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:18       ` Christian König
2017-09-30  6:03   ` [PATCH 07/12] drm/amdgpu/sriov:implement strict gpu reset Monk Liu
     [not found]     ` <1506751432-21789-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:20       ` Christian König
     [not found]         ` <250ce10a-cca0-0193-b2ed-cc2f04e80d0c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:30           ` Liu, Monk
2017-10-09 10:58       ` Nicolai Hähnle
2017-09-30  6:03   ` [PATCH 08/12] drm/amdgpu:explicitly call fence_process Monk Liu
     [not found]     ` <1506751432-21789-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:23       ` Christian König
     [not found]         ` <5cb1ae43-ec3a-2b0b-b78b-91cefd575672-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:32           ` Liu, Monk
     [not found]             ` <BLUPR12MB04491DDBC8ACFE2FB43D0F0084740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:40               ` Christian König
     [not found]                 ` <62bb9496-b29f-0230-8fa4-0bad470c12c8-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:51                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB0449E49C10230F350B9BD3B284740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:52                       ` Liu, Monk
     [not found]                         ` <BLUPR12MB04495DD27084790E5B219D7384740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:58                           ` Christian König
2017-09-30  6:03   ` [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted Monk Liu
     [not found]     ` <1506751432-21789-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:25       ` Christian König
     [not found]         ` <6e81d8b0-267a-1ea8-b228-93286fc6a954-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:35           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449531313F50BE080F7746D84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:54               ` Christian König
2017-10-09 11:01               ` Nicolai Hähnle
     [not found]                 ` <71b411c8-21a6-fe9b-ed33-7928571a88da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-10  4:26                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB04492B28DF57EACE2149562D84750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-10  6:58                       ` Christian König
     [not found]                         ` <85c67ae9-bfe2-390a-79d0-6e5872b9be62-5C7GfCeVMHo@public.gmane.org>
2017-10-10  7:12                           ` Liu, Monk [this message]
     [not found]                             ` <BLUPR12MB04497B5442F66C969861742584750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-10  7:25                               ` Christian König
     [not found]                                 ` <f06b80fa-fc96-a93c-59b7-2460dba95e94-5C7GfCeVMHo@public.gmane.org>
2017-10-10  8:21                                   ` Liu, Monk
     [not found]                                     ` <BLUPR12MB0449B68E81C778A9D07FB38584750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-10  8:57                                       ` Nicolai Hähnle
2017-10-10  7:19                           ` Liu, Monk
2017-10-10  7:47                           ` Michel Dänzer
     [not found]                             ` <0c91bb14-a874-9ee6-8756-2a31eb41d5b2-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-10-10  7:57                               ` Christian König
     [not found]                                 ` <36f5b680-c881-3b4f-0784-3cd624064004-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-10  8:17                                   ` Michel Dänzer
2017-09-30  6:03   ` [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset Monk Liu
     [not found]     ` <1506751432-21789-11-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:27       ` Christian König
     [not found]         ` <e4c96014-b4f4-e013-a966-9e2e03b9a62b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:39           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449C8E878F09AE59BA816E284740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  9:03               ` Christian König
     [not found]                 ` <d249cc75-29e3-713f-fc5a-2f26f555500b-5C7GfCeVMHo@public.gmane.org>
2017-10-09  9:14                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB04498EE183C86C2B93DDA85484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  9:24                       ` Christian König
2017-09-30  6:03   ` [PATCH 11/12] drm/amdgpu/sriov:show error if ib test failed Monk Liu
     [not found]     ` <1506751432-21789-12-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:29       ` Christian König
2017-09-30  6:03   ` [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery Monk Liu
     [not found]     ` <1506751432-21789-13-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-01  9:32       ` Christian König
2017-10-01  9:36       ` Christian König
     [not found]         ` <e767c6f2-4050-c697-2075-c3d744e6b379-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-04  9:41           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449346A746E70A7BE88FEA084730-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-04 10:56               ` Christian König
     [not found]                 ` <9b08e030-1a47-39ef-8010-64c51d4560e8-5C7GfCeVMHo@public.gmane.org>
2017-10-09  4:12                   ` Liu, Monk
2017-10-01  9:31   ` [PATCH 00/12] *** SRIOV GPU RESET PATCHES *** Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BLUPR12MB04497B5442F66C969861742584750@BLUPR12MB0449.namprd12.prod.outlook.com \
    --to=monk.liu-5c7gfcevmho@public.gmane.org \
    --cc=Christian.Koenig-5C7GfCeVMHo@public.gmane.org \
    --cc=Michel.Daenzer-5C7GfCeVMHo@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    --cc=nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.