RE: TDR and VRAM lost handling in KMD:

From: "Liu, Monk" <Monk.Liu-5C7GfCeVMHo@public.gmane.org>
To: "Liu, Monk" <Monk.Liu-5C7GfCeVMHo@public.gmane.org>,
	"Koenig,
	Christian" <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>,
	"Zhou,
	David(ChunMing)" <David1.Zhou-5C7GfCeVMHo@public.gmane.org>,
	"Haehnle, Nicolai" <Nicolai.Haehnle-5C7GfCeVMHo@public.gmane.org>,
	"Olsak, Marek" <Marek.Olsak-5C7GfCeVMHo@public.gmane.org>,
	"Deucher,
	Alexander" <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>
Cc: "Ramirez,
	Alejandro" <Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>,
	"Filipas, Mario" <Mario.Filipas-5C7GfCeVMHo@public.gmane.org>,
	"Ding, Pixel" <Pixel.Ding-5C7GfCeVMHo@public.gmane.org>,
	"Li, Bingley" <Bingley.Li-5C7GfCeVMHo@public.gmane.org>,
	"Jiang, Jerry (SW)" <Jerry.Jiang-5C7GfCeVMHo@public.gmane.org>
Subject: RE: TDR and VRAM lost handling in KMD:
Date: Wed, 11 Oct 2017 13:59:41 +0000	[thread overview]
Message-ID: <BLUPR12MB04497EDD5AE48484E7A18C2F844A0@BLUPR12MB0449.namprd12.prod.outlook.com> (raw)
In-Reply-To: <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>

But if we keep counter in entity, there is one issue I suddenly though of :

For regular user context, after vram lost UMD will aware this context is LOST since we have a counter copy in context, so user space can close it and re-create one
But for kernel entity, since no U/K interface, so it is kernel's responsibility to recover this kernel entity to work, that make things complicated ....

Emm, I agree that keep a copy in context and in job is a good move

BR Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: Wednesday, October 11, 2017 9:51 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: RE: TDR and VRAM lost handling in KMD:

> Some jobs don't have a context (VM updates, clears, buffer moves).

What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy That can solve your concerns 

-----Original Message-----
From: Koenig, Christian
Sent: Wednesday, October 11, 2017 9:39 PM
To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

Some jobs don't have a context (VM updates, clears, buffer moves).

I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable.

So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run.

Regards,
Christian.

Am 11.10.2017 um 15:35 schrieb Liu, Monk:
> I think just compare the copy from context/entity with current counter 
> is enough, don't see how it's better to keep another copy in JOB
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Wednesday, October 11, 2017 6:40 PM
> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk 
> <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, 
> Marek <Marek.Olsak@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
> amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; 
> Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> I've already posted a patch for this on the mailing list.
>
> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled.
>
> Regards,
> Christian.
>
> Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
>> Your summary lacks the below issue:
>>
>> How about the job already pushed in scheduler queue when vram is lost?
>>
>>
>> Regards,
>> David Zhou
>> On 2017年10月11日 17:41, Liu, Monk wrote:
>>> Okay, let me summary our whole idea together and see if it works:
>>>
>>> 1, For cs_submit, always check vram-lost_counter first and reject 
>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter !=
>>> adev->vram_lost_counter. That way the vram lost issue can be handled
>>>
>>> 2, for cs_submit we still need to check if the incoming context is 
>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found
>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject
>>> the submit
>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>>
>>> 3, in gpu_reset() routine, we only mark the hang job's entity as 
>>> guilty (so we need to add new member in entity structure), and not 
>>> kick it out in gpu_reset() stage, but we need to set the context 
>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>>     And if reset introduces VRAM LOST, we just update
>>> adev->vram_lost_counter, but *don't* change all entity to guilty, so
>>> still only the hang job's entity is "guilty"
>>>     After some entity marked as "guilty", we find a way to set the 
>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K 
>>> interface, we need let UMD can know that this context is wrong.
>>>
>>> 4, in gpu scheduler's run_job() routine, since it only reads entity, 
>>> so we skip job scheduling once found the entity is "guilty"
>>>
>>>
>>> Does above sounds good ?
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Haehnle, Nicolai
>>> Sent: Wednesday, October 11, 2017 5:26 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
>>> <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>
>>> On 11.10.2017 11:18, Liu, Monk wrote:
>>>> Let's talk it simple, When vram lost hit,  what's the action for 
>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not 
>>>> the one trigger gpu hang) after vram lost ? do you mean we return 
>>>> -ENODEV to UMD ?
>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>>
>>>
>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as 
>>>> "guilty", how we block its from submitting ? can you show some 
>>>> implement way
>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>        return -ECANCELED;
>>>
>>> (where ctx->vram_lost_counter is initialized at context creation 
>>> time and never changed afterwards)
>>>
>>>
>>>> BTW: the "guilty" here is a new member I want to add to context, it 
>>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I 
>>>> need to unify them and only one place to mark guilty or not
>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
>>> consistent with the rest.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>> BR Monk
>>>>
>>>> -----Original Message-----
>>>> From: Haehnle, Nicolai
>>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM 
>>>>> lost is a perfect example where the driver should report context 
>>>>> lost to applications with the "innocent" flag for contexts that 
>>>>> were idle at the time of reset. The only context(s) that should be 
>>>>> reported as "guilty"
>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>> executing at the time of reset.
>>>>>
>>>>> ML: KMD mark all contexts as guilty is because that way we can 
>>>>> unify our IOCTL behavior: e.g. for IOCTL only block 
>>>>> “guilty”context , no need to worry about vram-lost-counter 
>>>>> anymore, that’s a implementation style. I don’t think it is 
>>>>> related with UMD layer,
>>>>>
>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement 
>>>>> it own “guilty” gl-context if you want.
>>>> Well, to some extent this is just semantics, but it helps to keep 
>>>> the terminology consistent.
>>>>
>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>>>> mind: this returns one of
>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>>> and it must return "innocent" for contexts that are only lost due 
>>>> to VRAM lost without being otherwise involved in the timeout that 
>>>> lead to the reset.
>>>>
>>>> The point is that in the places where you used "guilty" it would be 
>>>> better to use "context lost", and then further differentiate 
>>>> between guilty/innocent context lost based on the details of what happened.
>>>>
>>>>
>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like 
>>>>> cs_sumbit ?? let’s see which way better
>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>         return -ECANCELED;
>>>>
>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>>
>>>> Yes, it's one additional check in cs_submit. If you're worried 
>>>> about that (and Christian's concerns about possible issues with 
>>>> walking over all contexts are addressed), I suppose you could just 
>>>> store a per-context
>>>>
>>>>       unsigned context_reset_status;
>>>>
>>>> instead of a `bool guilty`. Its value would start out as 0
>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during 
>>>> reset.
>>>>
>>>> Cheers,
>>>> Nicolai
>>>>
>>>>
>>>>> *From:*Haehnle, Nicolai
>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario 
>>>>> <Mario.Filipas@amd.com>
>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>>     From a Mesa perspective, this almost all sounds reasonable to me.
>>>>>
>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM 
>>>>> lost is a perfect example where the driver should report context 
>>>>> lost to applications with the "innocent" flag for contexts that 
>>>>> were idle at the time of reset. The only context(s) that should be 
>>>>> reported as "guilty"
>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>> executing at the time of reset.
>>>>>
>>>>> On whether the whole context is marked as guilty from a user space 
>>>>> perspective, it would simply be nice for user space to get 
>>>>> consistent answers. It would be a bit odd if we could e.g. succeed 
>>>>> in submitting an SDMA job after a GFX job was rejected. This would 
>>>>> point in favor of marking the entire context as guilty (although 
>>>>> that could happen lazily instead of at reset time). On the other 
>>>>> hand, if that's too big a burden for the kernel implementation I'm 
>>>>> sure we can live without it.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Nicolai
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> -
>>>>> ---
>>>>> --
>>>>>
>>>>> *From:*Liu, Monk
>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
>>>>> Alexander
>>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> 1.Set its fence error status to “*ETIME*”,
>>>>>
>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>
>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>
>>>>> Please use ECANCELED as well or some other error code when we find 
>>>>> that we need to distinct the timedout job from the canceled ones 
>>>>> (probably a good idea, but I'm not sure).
>>>>>
>>>>> [ML] I’m okay if you insist not to use ETIME
>>>>>
>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>>
>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>> the entity?
>>>>>
>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>
>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>> "bool canceled;" ?
>>>>>
>>>>> [ML] I think context is better than entity, because for example if 
>>>>> you only block entity_0 of context and allow entity_N run, that 
>>>>> means the dependency between entities are broken (e.g. page table 
>>>>> updates in
>>>>>
>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make 
>>>>> sense to me)
>>>>>
>>>>> We’d better either block the whole context or let not…
>>>>>
>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set 
>>>>> all their fence status to “*ECANCELED*”
>>>>>
>>>>> Setting ECANCELED should be ok. But I think we should do this when 
>>>>> we try to run the jobs and not during GPU reset.
>>>>>
>>>>> [ML] without deep thought and expritment, I’m not sure the 
>>>>> difference between them, but kick it out in gpu_reset routine is 
>>>>> more efficient,
>>>>>
>>>>> Otherwise you need to check context/entity guilty flag in run_job 
>>>>> routine …and you need to it for every context/entity, I don’t see 
>>>>> why
>>>>>
>>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>>
>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since 
>>>>> VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>> over all contexts is way to much overhead.
>>>>>
>>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t 
>>>>> need to differentiate VRAM lost or not, they only interested in if 
>>>>> the context is guilty or not, and block
>>>>>
>>>>> Submit for guilty ones.
>>>>>
>>>>> *Can you give more details of your idea? And better the detail 
>>>>> implement in cs_submit, I want to see how you want to block submit 
>>>>> without checking context guilty flag*
>>>>>
>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their 
>>>>> fence status to “*ECANCELDED*”
>>>>>
>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>> not during GPU reset.
>>>>>
>>>>> [ML] again, kicking out them in gpu reset routine is high 
>>>>> efficient, otherwise you need check on every job in run_job()
>>>>>
>>>>> Besides, can you illustrate the detail implementation ?
>>>>>
>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>> sync_file debugging (no idea why that made it into the common 
>>>>> fence code).
>>>>>
>>>>> It was replaced by putting the error code directly into the fence, 
>>>>> so just reading that one after waiting should be ok.
>>>>>
>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>> for this?
>>>>>
>>>>> [ML] yeah, that’s too confusing, the name sound really the one I 
>>>>> want to use, we should change it…
>>>>>
>>>>> *But look into the implement, I don**’t see why we cannot use it ?
>>>>> it also finally return the fence->error *
>>>>>
>>>>> *From:*Koenig, Christian
>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; 
>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com 
>>>>> <mailto:Nicolai.Haehnle@amd.com>>;
>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; 
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com 
>>>>> <mailto:Alexander.Deucher@amd.com>>
>>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel 
>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry
>>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, 
>>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, 
>>>>> Alejandro <Alejandro.Ramirez@amd.com 
>>>>> <mailto:Alejandro.Ramirez@amd.com>>;
>>>>> Filipas, Mario <Mario.Filipas@amd.com 
>>>>> <mailto:Mario.Filipas@amd.com>>
>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> See inline:
>>>>>
>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>>
>>>>>        Hi Christian & Nicolai,
>>>>>
>>>>>        We need to achieve some agreements on what should MESA/UMD 
>>>>> do and
>>>>>        what should KMD do, *please give your comments with “okay” 
>>>>> or “No”
>>>>>        and your idea on below items,*
>>>>>
>>>>>        ?When a job timed out (set from lockup_timeout kernel 
>>>>> parameter),
>>>>>        What KMD should do in TDR routine :
>>>>>
>>>>>        1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>>        (*gpu_reset_counter* is used to force vm flush after GPU 
>>>>> reset, out
>>>>>        of this thread’s scope so no more discussion on it)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        2.Set its fence error status to “*ETIME*”,
>>>>>
>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>
>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>
>>>>> Please use ECANCELED as well or some other error code when we find 
>>>>> that we need to distinct the timedout job from the canceled ones 
>>>>> (probably a good idea, but I'm not sure).
>>>>>
>>>>>        3.Find the entity/ctx behind this job, and set this ctx as 
>>>>> “*guilty*”
>>>>>
>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>> the entity?
>>>>>
>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>
>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>> "bool canceled;" ?
>>>>>
>>>>>        4.Kick out this job from scheduler’s mirror list, so this 
>>>>> job won’t
>>>>>        get re-scheduled to ring anymore.
>>>>>
>>>>> Okay.
>>>>>
>>>>>        5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and 
>>>>> set all
>>>>>        their fence status to “*ECANCELED*”
>>>>>
>>>>> Setting ECANCELED should be ok. But I think we should do this when 
>>>>> we try to run the jobs and not during GPU reset.
>>>>>
>>>>>        6.Force signal all fences that get kicked out by above two
>>>>>        steps,*otherwise UMD will block forever if waiting on those
>>>>> fences*
>>>>>
>>>>> Okay.
>>>>>
>>>>>        7.Do gpu reset, which is can be some callbacks to let 
>>>>> bare-metal and
>>>>>        SR-IOV implement with their favor style
>>>>>
>>>>> Okay.
>>>>>
>>>>>        8.After reset, KMD need to aware if the VRAM lost happens 
>>>>> or not,
>>>>>        bare-metal can implement some function to judge, while for 
>>>>> SR-IOV I
>>>>>        prefer to read it from GIM side (for initial version we consider
>>>>>        it’s always VRAM lost, till GIM side change aligned)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        9.If VRAM lost not hit, continue, otherwise:
>>>>>
>>>>>        a)Update adev->*vram_lost_counter*,
>>>>>
>>>>> Okay.
>>>>>
>>>>>        b)Iterate over all living ctx, and set all ctx as “*guilty*”
>>>>> since
>>>>>        VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>> over all contexts is way to much overhead.
>>>>>
>>>>>        c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>        fence status to “*ECANCELDED*”
>>>>>
>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>> not during GPU reset.
>>>>>
>>>>>        10.Do GTT recovery and VRAM page tables/entries recovery 
>>>>> (optional,
>>>>>        do we need it ???)
>>>>>
>>>>> Yes, that is still needed. As Nicolai explained we can't be sure 
>>>>> that VRAM is still 100% correct even when it isn't cleared.
>>>>>
>>>>>        11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>>        restart scheduler (for VRAM lost case, no JOB will
>>>>> re-scheduled)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        ?For cs_wait() IOCTL:
>>>>>
>>>>>        After it found fence signaled, it should check with
>>>>>        *“dma_fence_get_status” *to see if there is error there,
>>>>>
>>>>>        And return the error status of fence
>>>>>
>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>> sync_file debugging (no idea why that made it into the common 
>>>>> fence code).
>>>>>
>>>>> It was replaced by putting the error code directly into the fence, 
>>>>> so just reading that one after waiting should be ok.
>>>>>
>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>> for this?
>>>>>
>>>>>        ?For cs_wait_fences() IOCTL:
>>>>>
>>>>>        Similar with above approach
>>>>>
>>>>>        ?For cs_submit() IOCTL:
>>>>>
>>>>>        It need to check if current ctx been marked as “*guilty*” 
>>>>> and return
>>>>>        “*ECANCELED*” if so
>>>>>
>>>>>        ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>>
>>>>>        This way, UMD can also block app from submitting, like @Nicolai
>>>>>        mentioned, we can cache one copy of *vram_lost_counter* when
>>>>>        enumerate physical device, and deny all
>>>>>
>>>>>        gl-context from submitting if the counter queried bigger 
>>>>> than that
>>>>>        one cached in physical device. (looks a little overkill to 
>>>>> me, but
>>>>>        easy to implement )
>>>>>
>>>>>        UMD can also return error to APP when creating gl-context 
>>>>> if found
>>>>>        current queried*vram_lost_counter *bigger than that one 
>>>>> cached in
>>>>>        physical device.
>>>>>
>>>>> Okay. Already have a patch for this, please review that one if you 
>>>>> haven't already done so.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>        BTW: I realized that gl-context is a little different with 
>>>>> kernel’s
>>>>>        context. Because for kernel. BO is not related with context 
>>>>> but only
>>>>>        with FD, while in UMD, BO have a backend
>>>>>
>>>>>        gl-context, so block submitting in UMD layer is also needed 
>>>>> although
>>>>>        KMD will do its job as bottom line
>>>>>
>>>>>        ?Basically “vram_lost_counter” is exposure by kernel to let 
>>>>> UMD take
>>>>>        the control of robust extension feature, it will be UMD’s 
>>>>> call to
>>>>>        move, KMD only deny “guilty” context from submitting
>>>>>
>>>>>        Need your feedback, thx
>>>>>
>>>>>        We’d better make TDR feature landed ASAP
>>>>>
>>>>>        BR Monk
>>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx