From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?Q?Christian_K=c3=b6nig?= Subject: Re: TDR and VRAM lost handling in KMD: Date: Wed, 11 Oct 2017 11:02:47 +0200 Message-ID: <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe@amd.com> References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0120759384==" Return-path: In-Reply-To: Content-Language: en-US List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Sender: "amd-gfx" To: "Liu, Monk" , "Haehnle, Nicolai" , "Olsak, Marek" , "Deucher, Alexander" Cc: "Ramirez, Alejandro" , "amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org" , "Filipas, Mario" , "Ding, Pixel" , "Li, Bingley" , "Jiang, Jerry (SW)" --===============0120759384== Content-Type: multipart/alternative; boundary="------------43102E1E54431C52DCB882A3" Content-Language: en-US --------------43102E1E54431C52DCB882A3 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit > [ML] I think context is better than entity, because for example if you > only block entity_0 of context and allow entity_N run, that means the > dependency between entities are broken (e.g. page table updates in > > Sdma entity pass but gfx submit in GFX entity blocked, not make sense > to me) > > We’d better either block the whole context or let not… > Page table updates are not part of any context. So I think the only thing we can do is to mark the entity as not scheduled any more. > 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > [ML] without deep thought and expritment, I’m not sure the difference > between them, but kick it out in gpu_reset routine is more efficient, > I really don't think so. Kicking them out during gpu_reset sounds racy to me once more. And marking them canceled when we try to run them has the clear advantage that all dependencies are meet first. > ML: KMD mark all contexts as guilty is because that way we can unify > our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no > need to worry about vram-lost-counter anymore, that’s a implementation > style. I don’t think it is related with UMD layer, > I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context. Regards, Christian. Am 11.10.2017 um 10:48 schrieb Liu, Monk: > > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at > the time of reset. The only context(s) that should be reported as > "guilty" (or perhaps "unknown" in some cases) are the ones that were > executing at the time of reset. > > ML: KMD mark all contexts as guilty is because that way we can unify > our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no > need to worry about vram-lost-counter anymore, that’s a implementation > style. I don’t think it is related with UMD layer, > > For UMD the gl-context isn’t aware of by KMD, so UMD can implement it > own “guilty” gl-context if you want. > > If KMD doesn’t mark all ctx as guilty after VRAM lost, can you > illustrate what rule KMD should obey to check in KMS IOCTL like > cs_sumbit ?? let’s see which way better > > *From:*Haehnle, Nicolai > *Sent:* Wednesday, October 11, 2017 4:41 PM > *To:* Liu, Monk ; Koenig, Christian > ; Olsak, Marek ; > Deucher, Alexander > *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Ding, Pixel ; > Jiang, Jerry (SW) ; Li, Bingley > ; Ramirez, Alejandro ; > Filipas, Mario > *Subject:* Re: TDR and VRAM lost handling in KMD: > > From a Mesa perspective, this almost all sounds reasonable to me. > > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at > the time of reset. The only context(s) that should be reported as > "guilty" (or perhaps "unknown" in some cases) are the ones that were > executing at the time of reset. > > On whether the whole context is marked as guilty from a user space > perspective, it would simply be nice for user space to get consistent > answers. It would be a bit odd if we could e.g. succeed in submitting > an SDMA job after a GFX job was rejected. This would point in favor of > marking the entire context as guilty (although that could happen > lazily instead of at reset time). On the other hand, if that's too big > a burden for the kernel implementation I'm sure we can live without it. > > Cheers, > > Nicolai > > ------------------------------------------------------------------------ > > *From:*Liu, Monk > *Sent:* Wednesday, October 11, 2017 10:15:40 AM > *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, > Alexander > *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > ; Ding, Pixel; Jiang, Jerry > (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario > *Subject:* RE: TDR and VRAM lost handling in KMD: > > 1.Set its fence error status to “*ETIME*”, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find > that we need to distinct the timedout job from the canceled ones > (probably a good idea, but I'm not sure). > > [ML] I’m okay if you insist not to use ETIME > > 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > > Not sure. Do we want to set the whole context as guilty or just the > entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > [ML] I think context is better than entity, because for example if you > only block entity_0 of context and allow entity_N run, that means the > dependency between entities are broken (e.g. page table updates in > > Sdma entity pass but gfx submit in GFX entity blocked, not make sense > to me) > > We’d better either block the whole context or let not… > > 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > [ML] without deep thought and expritment, I’m not sure the difference > between them, but kick it out in gpu_reset routine is more efficient, > > Otherwise you need to check context/entity guilty flag in run_job > routine …and you need to it for every context/entity, I don’t see why > > We don’t just kickout all of them in gpu_reset stage …. > > a)Iterate over all living ctx, and set all ctx as “*guilty*” since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over > all contexts is way to much overhead. > > [ML] because I want to make KMS IOCTL rules clean, like they don’t > need to differentiate VRAM lost or not, they only interested in if the > context is guilty or not, and block > > Submit for guilty ones. > > *Can you give more details of your idea? And better the detail > implement in cs_submit, I want to see how you want to block submit > without checking context guilty flag* > > a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence > status to “*ECANCELDED*” > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > [ML] again, kicking out them in gpu reset routine is high efficient, > otherwise you need check on every job in run_job() > > Besides, can you illustrate the detail implementation ? > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > [ML] yeah, that’s too confusing, the name sound really the one I want > to use, we should change it… > > *But look into the implement, I don**’t see why we cannot use it ? it > also finally return the fence->error * > > *From:*Koenig, Christian > *Sent:* Wednesday, October 11, 2017 3:21 PM > *To:* Liu, Monk >; Haehnle, > Nicolai >; > Olsak, Marek >; > Deucher, Alexander > > *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > ; Ding, Pixel > >; Jiang, Jerry (SW) > >; Li, Bingley > >; Ramirez, Alejandro > >; > Filipas, Mario > > *Subject:* Re: TDR and VRAM lost handling in KMD: > > See inline: > > Am 11.10.2017 um 07:33 schrieb Liu, Monk: > > Hi Christian & Nicolai, > > We need to achieve some agreements on what should MESA/UMD do and > what should KMD do, *please give your comments with “okay” or “No” > and your idea on below items,* > > ?When a job timed out (set from lockup_timeout kernel parameter), > What KMD should do in TDR routine : > > 1.Update adev->*gpu_reset_counter*, and stop scheduler first, > (*gpu_reset_counter* is used to force vm flush after GPU reset, > out of this thread’s scope so no more discussion on it) > > Okay. > > 2.Set its fence error status to “*ETIME*”, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find > that we need to distinct the timedout job from the canceled ones > (probably a good idea, but I'm not sure). > > 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > > Not sure. Do we want to set the whole context as guilty or just the > entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > 4.Kick out this job from scheduler’s mirror list, so this job > won’t get re-scheduled to ring anymore. > > Okay. > > 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set > all their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > 6.Force signal all fences that get kicked out by above two > steps,*otherwise UMD will block forever if waiting on those fences* > > Okay. > > 7.Do gpu reset, which is can be some callbacks to let bare-metal > and SR-IOV implement with their favor style > > Okay. > > 8.After reset, KMD need to aware if the VRAM lost happens or not, > bare-metal can implement some function to judge, while for SR-IOV > I prefer to read it from GIM side (for initial version we consider > it’s always VRAM lost, till GIM side change aligned) > > Okay. > > 9.If VRAM lost not hit, continue, otherwise: > > a)Update adev->*vram_lost_counter*, > > Okay. > > b)Iterate over all living ctx, and set all ctx as “*guilty*” since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over > all contexts is way to much overhead. > > c)Kick out all jobs in all ctx’s KFIFO queue, and set all their > fence status to “*ECANCELDED*” > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > 10.Do GTT recovery and VRAM page tables/entries recovery > (optional, do we need it ???) > > Yes, that is still needed. As Nicolai explained we can't be sure that > VRAM is still 100% correct even when it isn't cleared. > > 11.Re-schedule all JOBs remains in mirror list to ring again and > restart scheduler (for VRAM lost case, no JOB will re-scheduled) > > Okay. > > ?For cs_wait() IOCTL: > > After it found fence signaled, it should check with > *“dma_fence_get_status” *to see if there is error there, > > And return the error status of fence > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > ?For cs_wait_fences() IOCTL: > > Similar with above approach > > ?For cs_submit() IOCTL: > > It need to check if current ctx been marked as “*guilty*” and > return “*ECANCELED*” if so > > ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: > > This way, UMD can also block app from submitting, like @Nicolai > mentioned, we can cache one copy of *vram_lost_counter* when > enumerate physical device, and deny all > > gl-context from submitting if the counter queried bigger than that > one cached in physical device. (looks a little overkill to me, but > easy to implement ) > > UMD can also return error to APP when creating gl-context if found > current queried*vram_lost_counter *bigger than that one cached in > physical device. > > Okay. Already have a patch for this, please review that one if you > haven't already done so. > > Regards, > Christian. > > BTW: I realized that gl-context is a little different with > kernel’s context. Because for kernel. BO is not related with > context but only with FD, while in UMD, BO have a backend > > gl-context, so block submitting in UMD layer is also needed > although KMD will do its job as bottom line > > ?Basically “vram_lost_counter” is exposure by kernel to let UMD > take the control of robust extension feature, it will be UMD’s > call to move, KMD only deny “guilty” context from submitting > > Need your feedback, thx > > We’d better make TDR feature landed ASAP > > BR Monk > --------------43102E1E54431C52DCB882A3 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit

[ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in

Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me)

Wed better either block the whole context or let not

Page table updates are not part of any context.

So I think the only thing we can do is to mark the entity as not scheduled any more.

1.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED

Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.

 

[ML] without deep thought and expritment, Im not sure the difference between them, but kick it out in gpu_reset routine is more efficient,

I really don't think so. Kicking them out during gpu_reset sounds racy to me once more.

And marking them canceled when we try to run them has the clear advantage that all dependencies are meet first.

ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to worry about vram-lost-counter anymore, that’s a implementation style. I don’t think it is related with UMD layer,

I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context.

Regards,
Christian.

Am 11.10.2017 um 10:48 schrieb Liu, Monk:

 

On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset.

 

ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to worry about vram-lost-counter anymore, that’s a implementation style. I don’t think it is related with UMD layer,

For UMD the gl-context isn’t aware of by KMD, so UMD can implement it own “guilty” gl-context if you want.

 

If KMD doesn’t mark all ctx as guilty after VRAM lost, can you illustrate what rule KMD should obey to check in KMS IOCTL like cs_sumbit ?? let’s see which way better

 

 

 

From a Mesa perspective, this almost all sounds reasonable to me.

 

On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset.

 

On whether the whole context is marked as guilty from a user space perspective, it would simply be nice for user space to get consistent answers. It would be a bit odd if we could e.g. succeed in submitting an SDMA job after a GFX job was rejected. This would point in favor of marking the entire context as guilty (although that could happen lazily instead of at reset time). On the other hand, if that's too big a burden for the kernel implementation I'm sure we can live without it.

 

Cheers,

Nicolai


From: Liu, Monk
Sent: Wednesday, October 11, 2017 10:15:40 AM
To: Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, Alexander
Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Ding, Pixel; Jiang, Jerry (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
Subject: RE: TDR and VRAM lost handling in KMD:

 

1.        Set its fence error status to “ETIME”,

No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure).

 

[ML] Im okay if you insist not to use ETIME

 

1.        Find the entity/ctx behind this job, and set this ctx as “guilty

Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool canceled;" ?

 

[ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in

Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me)

Wed better either block the whole context or let not

 

 

1.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED

Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.

 

[ML] without deep thought and expritment, Im not sure the difference between them, but kick it out in gpu_reset routine is more efficient,

Otherwise you need to check context/entity guilty flag in run_job routine and you need to it for every context/entity, I dont see why

We dont just kickout all of them in gpu_reset stage .

 

 

a)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents

No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead.

 

[ML] because I want to make KMS IOCTL rules clean, like they dont need to differentiate VRAM lost or not, they only interested in if the context is guilty or not, and block

Submit for guilty ones.

 

Can you give more details of your idea? And better the detail implement in cs_submit, I want to see how you want to block submit without checking context guilty flag

 

 

a)       Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED

Yes and no, that should be done when we try to run the jobs and not during GPU reset.

 

[ML] again, kicking out them in gpu reset routine is high efficient, otherwise you need check on every job in run_job()

Besides, can you illustrate the detail implementation ?

 

 

 

Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?

 

[ML] yeah, thats too confusing, the name sound really the one I want to use, we should change it

But look into the implement, I dont see why we cannot use it ? it also finally return the fence->error

 

 

 

 

From: Koenig, Christian
Sent: Wednesday, October 11, 2017 3:21 PM
To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Haehnle, Nicolai <Nicolai.Haehnle-5C7GfCeVMHo@public.gmane.org>; Olsak, Marek <Marek.Olsak-5C7GfCeVMHo@public.gmane.org>; Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>
Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Ding, Pixel <Pixel.Ding-5C7GfCeVMHo@public.gmane.org>; Jiang, Jerry (SW) <Jerry.Jiang-5C7GfCeVMHo@public.gmane.org>; Li, Bingley <Bingley.Li-5C7GfCeVMHo@public.gmane.org>; Ramirez, Alejandro <Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org>; Filipas, Mario <Mario.Filipas-5C7GfCeVMHo@public.gmane.org>
Subject: Re: TDR and VRAM lost handling in KMD:

 

See inline:

Am 11.10.2017 um 07:33 schrieb Liu, Monk:

Hi Christian & Nicolai,

 

We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items,

 

?  When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine :

 

1.        Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it)

Okay.

2.        Set its fence error status to “ETIME”,

No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure).

3.        Find the entity/ctx behind this job, and set this ctx as “guilty

Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool canceled;" ?

4.        Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore.

Okay.

5.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED

Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.

6.        Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences

Okay.

7.        Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style

Okay.

8.        After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned)

Okay.

9.        If VRAM lost not hit, continue, otherwise:

a)       Update adev->vram_lost_counter,

Okay.

b)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents

No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead.

c)        Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED

Yes and no, that should be done when we try to run the jobs and not during GPU reset.

10.     Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???)

Yes, that is still needed. As Nicolai explained we can't be sure that VRAM is still 100% correct even when it isn't cleared.

11.     Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled)

Okay.

 

?  For cs_wait() IOCTL:

After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there,

And return the error status of fence

Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?

 

?  For cs_wait_fences() IOCTL:

Similar with above approach

 

?  For cs_submit() IOCTL:

It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so

 

?  Introduce a new IOCTL to let UMD query vram_lost_counter:

This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all

gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement )

UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device.

Okay. Already have a patch for this, please review that one if you haven't already done so.

Regards,
Christian.

 

BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend

gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line

 

?  Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting

 

 

Need your feedback, thx

 

We’d better make TDR feature landed ASAP

 

BR Monk

 

 

 

 

 


--------------43102E1E54431C52DCB882A3-- --===============0120759384== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KYW1kLWdmeCBt YWlsaW5nIGxpc3QKYW1kLWdmeEBsaXN0cy5mcmVlZGVza3RvcC5vcmcKaHR0cHM6Ly9saXN0cy5m cmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9hbWQtZ2Z4Cg== --===============0120759384==--