All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Liu, Monk" <Monk.Liu-5C7GfCeVMHo@public.gmane.org>
To: "Liu, Monk" <Monk.Liu-5C7GfCeVMHo@public.gmane.org>,
	"Koenig,
	Christian" <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>,
	"Zhou,
	David(ChunMing)" <David1.Zhou-5C7GfCeVMHo@public.gmane.org>,
	"Haehnle, Nicolai" <Nicolai.Haehnle-5C7GfCeVMHo@public.gmane.org>,
	"Olsak, Marek" <Marek.Olsak-5C7GfCeVMHo@public.gmane.org>,
	"Deucher,
	Alexander" <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>
Cc: "Ramirez,
	Alejandro" <Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>,
	"Filipas, Mario" <Mario.Filipas-5C7GfCeVMHo@public.gmane.org>,
	"Ding, Pixel" <Pixel.Ding-5C7GfCeVMHo@public.gmane.org>,
	"Li, Bingley" <Bingley.Li-5C7GfCeVMHo@public.gmane.org>,
	"Jiang, Jerry (SW)" <Jerry.Jiang-5C7GfCeVMHo@public.gmane.org>
Subject: RE: TDR and VRAM lost handling in KMD:
Date: Wed, 11 Oct 2017 13:59:41 +0000	[thread overview]
Message-ID: <BLUPR12MB04497EDD5AE48484E7A18C2F844A0@BLUPR12MB0449.namprd12.prod.outlook.com> (raw)
In-Reply-To: <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>

But if we keep counter in entity, there is one issue I suddenly though of :

For regular user context, after vram lost UMD will aware this context is LOST since we have a counter copy in context, so user space can close it and re-create one
But for kernel entity, since no U/K interface, so it is kernel's responsibility to recover this kernel entity to work, that make things complicated ....

Emm, I agree that keep a copy in context and in job is a good move

BR Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: Wednesday, October 11, 2017 9:51 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: RE: TDR and VRAM lost handling in KMD:

> Some jobs don't have a context (VM updates, clears, buffer moves).

What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy That can solve your concerns 



-----Original Message-----
From: Koenig, Christian
Sent: Wednesday, October 11, 2017 9:39 PM
To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

Some jobs don't have a context (VM updates, clears, buffer moves).

I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable.

So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run.

Regards,
Christian.

Am 11.10.2017 um 15:35 schrieb Liu, Monk:
> I think just compare the copy from context/entity with current counter 
> is enough, don't see how it's better to keep another copy in JOB
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Wednesday, October 11, 2017 6:40 PM
> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk 
> <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, 
> Marek <Marek.Olsak@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
> amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; 
> Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> I've already posted a patch for this on the mailing list.
>
> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled.
>
> Regards,
> Christian.
>
> Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
>> Your summary lacks the below issue:
>>
>> How about the job already pushed in scheduler queue when vram is lost?
>>
>>
>> Regards,
>> David Zhou
>> On 2017年10月11日 17:41, Liu, Monk wrote:
>>> Okay, let me summary our whole idea together and see if it works:
>>>
>>> 1, For cs_submit, always check vram-lost_counter first and reject 
>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter !=
>>> adev->vram_lost_counter. That way the vram lost issue can be handled
>>>
>>> 2, for cs_submit we still need to check if the incoming context is 
>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found
>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject
>>> the submit
>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>>
>>> 3, in gpu_reset() routine, we only mark the hang job's entity as 
>>> guilty (so we need to add new member in entity structure), and not 
>>> kick it out in gpu_reset() stage, but we need to set the context 
>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>>     And if reset introduces VRAM LOST, we just update
>>> adev->vram_lost_counter, but *don't* change all entity to guilty, so
>>> still only the hang job's entity is "guilty"
>>>     After some entity marked as "guilty", we find a way to set the 
>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K 
>>> interface, we need let UMD can know that this context is wrong.
>>>
>>> 4, in gpu scheduler's run_job() routine, since it only reads entity, 
>>> so we skip job scheduling once found the entity is "guilty"
>>>
>>>
>>> Does above sounds good ?
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Haehnle, Nicolai
>>> Sent: Wednesday, October 11, 2017 5:26 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
>>> <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>
>>> On 11.10.2017 11:18, Liu, Monk wrote:
>>>> Let's talk it simple, When vram lost hit,  what's the action for 
>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not 
>>>> the one trigger gpu hang) after vram lost ? do you mean we return 
>>>> -ENODEV to UMD ?
>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>>
>>>
>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as 
>>>> "guilty", how we block its from submitting ? can you show some 
>>>> implement way
>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>        return -ECANCELED;
>>>
>>> (where ctx->vram_lost_counter is initialized at context creation 
>>> time and never changed afterwards)
>>>
>>>
>>>> BTW: the "guilty" here is a new member I want to add to context, it 
>>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I 
>>>> need to unify them and only one place to mark guilty or not
>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
>>> consistent with the rest.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>> BR Monk
>>>>
>>>> -----Original Message-----
>>>> From: Haehnle, Nicolai
>>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM 
>>>>> lost is a perfect example where the driver should report context 
>>>>> lost to applications with the "innocent" flag for contexts that 
>>>>> were idle at the time of reset. The only context(s) that should be 
>>>>> reported as "guilty"
>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>> executing at the time of reset.
>>>>>
>>>>> ML: KMD mark all contexts as guilty is because that way we can 
>>>>> unify our IOCTL behavior: e.g. for IOCTL only block 
>>>>> “guilty”context , no need to worry about vram-lost-counter 
>>>>> anymore, that’s a implementation style. I don’t think it is 
>>>>> related with UMD layer,
>>>>>
>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement 
>>>>> it own “guilty” gl-context if you want.
>>>> Well, to some extent this is just semantics, but it helps to keep 
>>>> the terminology consistent.
>>>>
>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>>>> mind: this returns one of
>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>>> and it must return "innocent" for contexts that are only lost due 
>>>> to VRAM lost without being otherwise involved in the timeout that 
>>>> lead to the reset.
>>>>
>>>> The point is that in the places where you used "guilty" it would be 
>>>> better to use "context lost", and then further differentiate 
>>>> between guilty/innocent context lost based on the details of what happened.
>>>>
>>>>
>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like 
>>>>> cs_sumbit ?? let’s see which way better
>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>         return -ECANCELED;
>>>>
>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>>
>>>> Yes, it's one additional check in cs_submit. If you're worried 
>>>> about that (and Christian's concerns about possible issues with 
>>>> walking over all contexts are addressed), I suppose you could just 
>>>> store a per-context
>>>>
>>>>       unsigned context_reset_status;
>>>>
>>>> instead of a `bool guilty`. Its value would start out as 0
>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during 
>>>> reset.
>>>>
>>>> Cheers,
>>>> Nicolai
>>>>
>>>>
>>>>> *From:*Haehnle, Nicolai
>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario 
>>>>> <Mario.Filipas@amd.com>
>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>>     From a Mesa perspective, this almost all sounds reasonable to me.
>>>>>
>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM 
>>>>> lost is a perfect example where the driver should report context 
>>>>> lost to applications with the "innocent" flag for contexts that 
>>>>> were idle at the time of reset. The only context(s) that should be 
>>>>> reported as "guilty"
>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>> executing at the time of reset.
>>>>>
>>>>> On whether the whole context is marked as guilty from a user space 
>>>>> perspective, it would simply be nice for user space to get 
>>>>> consistent answers. It would be a bit odd if we could e.g. succeed 
>>>>> in submitting an SDMA job after a GFX job was rejected. This would 
>>>>> point in favor of marking the entire context as guilty (although 
>>>>> that could happen lazily instead of at reset time). On the other 
>>>>> hand, if that's too big a burden for the kernel implementation I'm 
>>>>> sure we can live without it.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Nicolai
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> -
>>>>> ---
>>>>> --
>>>>>
>>>>> *From:*Liu, Monk
>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
>>>>> Alexander
>>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> 1.Set its fence error status to “*ETIME*”,
>>>>>
>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>
>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>
>>>>> Please use ECANCELED as well or some other error code when we find 
>>>>> that we need to distinct the timedout job from the canceled ones 
>>>>> (probably a good idea, but I'm not sure).
>>>>>
>>>>> [ML] I’m okay if you insist not to use ETIME
>>>>>
>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>>
>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>> the entity?
>>>>>
>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>
>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>> "bool canceled;" ?
>>>>>
>>>>> [ML] I think context is better than entity, because for example if 
>>>>> you only block entity_0 of context and allow entity_N run, that 
>>>>> means the dependency between entities are broken (e.g. page table 
>>>>> updates in
>>>>>
>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make 
>>>>> sense to me)
>>>>>
>>>>> We’d better either block the whole context or let not…
>>>>>
>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set 
>>>>> all their fence status to “*ECANCELED*”
>>>>>
>>>>> Setting ECANCELED should be ok. But I think we should do this when 
>>>>> we try to run the jobs and not during GPU reset.
>>>>>
>>>>> [ML] without deep thought and expritment, I’m not sure the 
>>>>> difference between them, but kick it out in gpu_reset routine is 
>>>>> more efficient,
>>>>>
>>>>> Otherwise you need to check context/entity guilty flag in run_job 
>>>>> routine …and you need to it for every context/entity, I don’t see 
>>>>> why
>>>>>
>>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>>
>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since 
>>>>> VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>> over all contexts is way to much overhead.
>>>>>
>>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t 
>>>>> need to differentiate VRAM lost or not, they only interested in if 
>>>>> the context is guilty or not, and block
>>>>>
>>>>> Submit for guilty ones.
>>>>>
>>>>> *Can you give more details of your idea? And better the detail 
>>>>> implement in cs_submit, I want to see how you want to block submit 
>>>>> without checking context guilty flag*
>>>>>
>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their 
>>>>> fence status to “*ECANCELDED*”
>>>>>
>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>> not during GPU reset.
>>>>>
>>>>> [ML] again, kicking out them in gpu reset routine is high 
>>>>> efficient, otherwise you need check on every job in run_job()
>>>>>
>>>>> Besides, can you illustrate the detail implementation ?
>>>>>
>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>> sync_file debugging (no idea why that made it into the common 
>>>>> fence code).
>>>>>
>>>>> It was replaced by putting the error code directly into the fence, 
>>>>> so just reading that one after waiting should be ok.
>>>>>
>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>> for this?
>>>>>
>>>>> [ML] yeah, that’s too confusing, the name sound really the one I 
>>>>> want to use, we should change it…
>>>>>
>>>>> *But look into the implement, I don**’t see why we cannot use it ?
>>>>> it also finally return the fence->error *
>>>>>
>>>>> *From:*Koenig, Christian
>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; 
>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com 
>>>>> <mailto:Nicolai.Haehnle@amd.com>>;
>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; 
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com 
>>>>> <mailto:Alexander.Deucher@amd.com>>
>>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel 
>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry
>>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, 
>>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, 
>>>>> Alejandro <Alejandro.Ramirez@amd.com 
>>>>> <mailto:Alejandro.Ramirez@amd.com>>;
>>>>> Filipas, Mario <Mario.Filipas@amd.com 
>>>>> <mailto:Mario.Filipas@amd.com>>
>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> See inline:
>>>>>
>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>>
>>>>>        Hi Christian & Nicolai,
>>>>>
>>>>>        We need to achieve some agreements on what should MESA/UMD 
>>>>> do and
>>>>>        what should KMD do, *please give your comments with “okay” 
>>>>> or “No”
>>>>>        and your idea on below items,*
>>>>>
>>>>>        ?When a job timed out (set from lockup_timeout kernel 
>>>>> parameter),
>>>>>        What KMD should do in TDR routine :
>>>>>
>>>>>        1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>>        (*gpu_reset_counter* is used to force vm flush after GPU 
>>>>> reset, out
>>>>>        of this thread’s scope so no more discussion on it)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        2.Set its fence error status to “*ETIME*”,
>>>>>
>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>
>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>
>>>>> Please use ECANCELED as well or some other error code when we find 
>>>>> that we need to distinct the timedout job from the canceled ones 
>>>>> (probably a good idea, but I'm not sure).
>>>>>
>>>>>        3.Find the entity/ctx behind this job, and set this ctx as 
>>>>> “*guilty*”
>>>>>
>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>> the entity?
>>>>>
>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>
>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>> "bool canceled;" ?
>>>>>
>>>>>        4.Kick out this job from scheduler’s mirror list, so this 
>>>>> job won’t
>>>>>        get re-scheduled to ring anymore.
>>>>>
>>>>> Okay.
>>>>>
>>>>>        5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and 
>>>>> set all
>>>>>        their fence status to “*ECANCELED*”
>>>>>
>>>>> Setting ECANCELED should be ok. But I think we should do this when 
>>>>> we try to run the jobs and not during GPU reset.
>>>>>
>>>>>        6.Force signal all fences that get kicked out by above two
>>>>>        steps,*otherwise UMD will block forever if waiting on those
>>>>> fences*
>>>>>
>>>>> Okay.
>>>>>
>>>>>        7.Do gpu reset, which is can be some callbacks to let 
>>>>> bare-metal and
>>>>>        SR-IOV implement with their favor style
>>>>>
>>>>> Okay.
>>>>>
>>>>>        8.After reset, KMD need to aware if the VRAM lost happens 
>>>>> or not,
>>>>>        bare-metal can implement some function to judge, while for 
>>>>> SR-IOV I
>>>>>        prefer to read it from GIM side (for initial version we consider
>>>>>        it’s always VRAM lost, till GIM side change aligned)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        9.If VRAM lost not hit, continue, otherwise:
>>>>>
>>>>>        a)Update adev->*vram_lost_counter*,
>>>>>
>>>>> Okay.
>>>>>
>>>>>        b)Iterate over all living ctx, and set all ctx as “*guilty*”
>>>>> since
>>>>>        VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>> over all contexts is way to much overhead.
>>>>>
>>>>>        c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>        fence status to “*ECANCELDED*”
>>>>>
>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>> not during GPU reset.
>>>>>
>>>>>        10.Do GTT recovery and VRAM page tables/entries recovery 
>>>>> (optional,
>>>>>        do we need it ???)
>>>>>
>>>>> Yes, that is still needed. As Nicolai explained we can't be sure 
>>>>> that VRAM is still 100% correct even when it isn't cleared.
>>>>>
>>>>>        11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>>        restart scheduler (for VRAM lost case, no JOB will
>>>>> re-scheduled)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        ?For cs_wait() IOCTL:
>>>>>
>>>>>        After it found fence signaled, it should check with
>>>>>        *“dma_fence_get_status” *to see if there is error there,
>>>>>
>>>>>        And return the error status of fence
>>>>>
>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>> sync_file debugging (no idea why that made it into the common 
>>>>> fence code).
>>>>>
>>>>> It was replaced by putting the error code directly into the fence, 
>>>>> so just reading that one after waiting should be ok.
>>>>>
>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>> for this?
>>>>>
>>>>>        ?For cs_wait_fences() IOCTL:
>>>>>
>>>>>        Similar with above approach
>>>>>
>>>>>        ?For cs_submit() IOCTL:
>>>>>
>>>>>        It need to check if current ctx been marked as “*guilty*” 
>>>>> and return
>>>>>        “*ECANCELED*” if so
>>>>>
>>>>>        ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>>
>>>>>        This way, UMD can also block app from submitting, like @Nicolai
>>>>>        mentioned, we can cache one copy of *vram_lost_counter* when
>>>>>        enumerate physical device, and deny all
>>>>>
>>>>>        gl-context from submitting if the counter queried bigger 
>>>>> than that
>>>>>        one cached in physical device. (looks a little overkill to 
>>>>> me, but
>>>>>        easy to implement )
>>>>>
>>>>>        UMD can also return error to APP when creating gl-context 
>>>>> if found
>>>>>        current queried*vram_lost_counter *bigger than that one 
>>>>> cached in
>>>>>        physical device.
>>>>>
>>>>> Okay. Already have a patch for this, please review that one if you 
>>>>> haven't already done so.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>        BTW: I realized that gl-context is a little different with 
>>>>> kernel’s
>>>>>        context. Because for kernel. BO is not related with context 
>>>>> but only
>>>>>        with FD, while in UMD, BO have a backend
>>>>>
>>>>>        gl-context, so block submitting in UMD layer is also needed 
>>>>> although
>>>>>        KMD will do its job as bottom line
>>>>>
>>>>>        ?Basically “vram_lost_counter” is exposure by kernel to let 
>>>>> UMD take
>>>>>        the control of robust extension feature, it will be UMD’s 
>>>>> call to
>>>>>        move, KMD only deny “guilty” context from submitting
>>>>>
>>>>>        Need your feedback, thx
>>>>>
>>>>>        We’d better make TDR feature landed ASAP
>>>>>
>>>>>        BR Monk
>>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2017-10-11 13:59 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-11  5:33 TDR and VRAM lost handling in KMD: Liu, Monk
     [not found] ` <BLUPR12MB0449785160E34EA9369C5E23844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11  7:14   ` Liu, Monk
2017-10-11  7:20   ` Christian König
     [not found]     ` <b5c5f6c9-07e2-4688-8ffc-3929bfc59366-5C7GfCeVMHo@public.gmane.org>
2017-10-11  8:15       ` Liu, Monk
     [not found]         ` <BLUPR12MB044911DFCB510022605DD38A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11  8:40           ` Haehnle, Nicolai
     [not found]             ` <DM5PR12MB1292D21FC5438AEA8FCF9F64FF4A0-2J9CzHegvk82qrKJuDAMhQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11  8:48               ` Liu, Monk
     [not found]                 ` <BLUPR12MB0449287A92DF8D3EB30BE6A6844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11  8:59                   ` Nicolai Hähnle
     [not found]                     ` <28d64011-fd90-07fb-d95d-48286ecbdcc5-5C7GfCeVMHo@public.gmane.org>
2017-10-11  9:18                       ` Liu, Monk
     [not found]                         ` <BLUPR12MB044914F3A7B5D3D316481A7A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11  9:25                           ` Nicolai Hähnle
     [not found]                             ` <6876e153-7e98-66ac-7338-5601cf83c633-5C7GfCeVMHo@public.gmane.org>
2017-10-11  9:41                               ` Liu, Monk
     [not found]                                 ` <BLUPR12MB044907C2C72DD8BEB1D5BE3B844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11 10:14                                   ` Chunming Zhou
     [not found]                                     ` <8c4e849f-9227-12bc-9d2e-3daf60fcd762-5C7GfCeVMHo@public.gmane.org>
2017-10-11 10:39                                       ` Christian König
     [not found]                                         ` <0c198ba6-b853-c26a-7fb4-bcc0344fdea0-5C7GfCeVMHo@public.gmane.org>
2017-10-11 13:35                                           ` Liu, Monk
     [not found]                                             ` <BLUPR12MB04490BE33EC2E851228E25F4844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11 13:39                                               ` Christian König
     [not found]                                                 ` <d9274bc6-27e8-f6c4-0851-4240bde72452-5C7GfCeVMHo@public.gmane.org>
2017-10-11 13:51                                                   ` Liu, Monk
     [not found]                                                     ` <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11 13:59                                                       ` Liu, Monk [this message]
     [not found]                                                         ` <BLUPR12MB04497EDD5AE48484E7A18C2F844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11 14:04                                                           ` Christian König
2017-10-11 14:03                                                       ` Christian König
     [not found]                                                         ` <02bb9f77-bcc6-8a24-e9b0-8f3f260d74d8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-11 14:04                                                           ` Liu, Monk
2017-10-11 13:27                                       ` Liu, Monk
2017-10-11  9:02                   ` Christian König
     [not found]                     ` <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe-5C7GfCeVMHo@public.gmane.org>
2017-10-11  9:16                       ` Nicolai Hähnle
2017-10-11  9:27                       ` Liu, Monk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BLUPR12MB04497EDD5AE48484E7A18C2F844A0@BLUPR12MB0449.namprd12.prod.outlook.com \
    --to=monk.liu-5c7gfcevmho@public.gmane.org \
    --cc=Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org \
    --cc=Alexander.Deucher-5C7GfCeVMHo@public.gmane.org \
    --cc=Bingley.Li-5C7GfCeVMHo@public.gmane.org \
    --cc=Christian.Koenig-5C7GfCeVMHo@public.gmane.org \
    --cc=David1.Zhou-5C7GfCeVMHo@public.gmane.org \
    --cc=Jerry.Jiang-5C7GfCeVMHo@public.gmane.org \
    --cc=Marek.Olsak-5C7GfCeVMHo@public.gmane.org \
    --cc=Mario.Filipas-5C7GfCeVMHo@public.gmane.org \
    --cc=Nicolai.Haehnle-5C7GfCeVMHo@public.gmane.org \
    --cc=Pixel.Ding-5C7GfCeVMHo@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.