> [ML] I think context is better than entity, because for example if you > only block entity_0 of context and allow entity_N run, that means the > dependency between entities are broken (e.g. page table updates in > > Sdma entity pass but gfx submit in GFX entity blocked, not make sense > to me) > > We’d better either block the whole context or let not… > Page table updates are not part of any context. So I think the only thing we can do is to mark the entity as not scheduled any more. > 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > [ML] without deep thought and expritment, I’m not sure the difference > between them, but kick it out in gpu_reset routine is more efficient, > I really don't think so. Kicking them out during gpu_reset sounds racy to me once more. And marking them canceled when we try to run them has the clear advantage that all dependencies are meet first. > ML: KMD mark all contexts as guilty is because that way we can unify > our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no > need to worry about vram-lost-counter anymore, that’s a implementation > style. I don’t think it is related with UMD layer, > I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context. Regards, Christian. Am 11.10.2017 um 10:48 schrieb Liu, Monk: > > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at > the time of reset. The only context(s) that should be reported as > "guilty" (or perhaps "unknown" in some cases) are the ones that were > executing at the time of reset. > > ML: KMD mark all contexts as guilty is because that way we can unify > our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no > need to worry about vram-lost-counter anymore, that’s a implementation > style. I don’t think it is related with UMD layer, > > For UMD the gl-context isn’t aware of by KMD, so UMD can implement it > own “guilty” gl-context if you want. > > If KMD doesn’t mark all ctx as guilty after VRAM lost, can you > illustrate what rule KMD should obey to check in KMS IOCTL like > cs_sumbit ?? let’s see which way better > > *From:*Haehnle, Nicolai > *Sent:* Wednesday, October 11, 2017 4:41 PM > *To:* Liu, Monk ; Koenig, Christian > ; Olsak, Marek ; > Deucher, Alexander > *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Ding, Pixel ; > Jiang, Jerry (SW) ; Li, Bingley > ; Ramirez, Alejandro ; > Filipas, Mario > *Subject:* Re: TDR and VRAM lost handling in KMD: > > From a Mesa perspective, this almost all sounds reasonable to me. > > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at > the time of reset. The only context(s) that should be reported as > "guilty" (or perhaps "unknown" in some cases) are the ones that were > executing at the time of reset. > > On whether the whole context is marked as guilty from a user space > perspective, it would simply be nice for user space to get consistent > answers. It would be a bit odd if we could e.g. succeed in submitting > an SDMA job after a GFX job was rejected. This would point in favor of > marking the entire context as guilty (although that could happen > lazily instead of at reset time). On the other hand, if that's too big > a burden for the kernel implementation I'm sure we can live without it. > > Cheers, > > Nicolai > > ------------------------------------------------------------------------ > > *From:*Liu, Monk > *Sent:* Wednesday, October 11, 2017 10:15:40 AM > *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, > Alexander > *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > ; Ding, Pixel; Jiang, Jerry > (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario > *Subject:* RE: TDR and VRAM lost handling in KMD: > > 1.Set its fence error status to “*ETIME*”, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find > that we need to distinct the timedout job from the canceled ones > (probably a good idea, but I'm not sure). > > [ML] I’m okay if you insist not to use ETIME > > 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > > Not sure. Do we want to set the whole context as guilty or just the > entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > [ML] I think context is better than entity, because for example if you > only block entity_0 of context and allow entity_N run, that means the > dependency between entities are broken (e.g. page table updates in > > Sdma entity pass but gfx submit in GFX entity blocked, not make sense > to me) > > We’d better either block the whole context or let not… > > 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > [ML] without deep thought and expritment, I’m not sure the difference > between them, but kick it out in gpu_reset routine is more efficient, > > Otherwise you need to check context/entity guilty flag in run_job > routine …and you need to it for every context/entity, I don’t see why > > We don’t just kickout all of them in gpu_reset stage …. > > a)Iterate over all living ctx, and set all ctx as “*guilty*” since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over > all contexts is way to much overhead. > > [ML] because I want to make KMS IOCTL rules clean, like they don’t > need to differentiate VRAM lost or not, they only interested in if the > context is guilty or not, and block > > Submit for guilty ones. > > *Can you give more details of your idea? And better the detail > implement in cs_submit, I want to see how you want to block submit > without checking context guilty flag* > > a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence > status to “*ECANCELDED*” > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > [ML] again, kicking out them in gpu reset routine is high efficient, > otherwise you need check on every job in run_job() > > Besides, can you illustrate the detail implementation ? > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > [ML] yeah, that’s too confusing, the name sound really the one I want > to use, we should change it… > > *But look into the implement, I don**’t see why we cannot use it ? it > also finally return the fence->error * > > *From:*Koenig, Christian > *Sent:* Wednesday, October 11, 2017 3:21 PM > *To:* Liu, Monk >; Haehnle, > Nicolai >; > Olsak, Marek >; > Deucher, Alexander > > *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > ; Ding, Pixel > >; Jiang, Jerry (SW) > >; Li, Bingley > >; Ramirez, Alejandro > >; > Filipas, Mario > > *Subject:* Re: TDR and VRAM lost handling in KMD: > > See inline: > > Am 11.10.2017 um 07:33 schrieb Liu, Monk: > > Hi Christian & Nicolai, > > We need to achieve some agreements on what should MESA/UMD do and > what should KMD do, *please give your comments with “okay” or “No” > and your idea on below items,* > > ?When a job timed out (set from lockup_timeout kernel parameter), > What KMD should do in TDR routine : > > 1.Update adev->*gpu_reset_counter*, and stop scheduler first, > (*gpu_reset_counter* is used to force vm flush after GPU reset, > out of this thread’s scope so no more discussion on it) > > Okay. > > 2.Set its fence error status to “*ETIME*”, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find > that we need to distinct the timedout job from the canceled ones > (probably a good idea, but I'm not sure). > > 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > > Not sure. Do we want to set the whole context as guilty or just the > entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > 4.Kick out this job from scheduler’s mirror list, so this job > won’t get re-scheduled to ring anymore. > > Okay. > > 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set > all their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > 6.Force signal all fences that get kicked out by above two > steps,*otherwise UMD will block forever if waiting on those fences* > > Okay. > > 7.Do gpu reset, which is can be some callbacks to let bare-metal > and SR-IOV implement with their favor style > > Okay. > > 8.After reset, KMD need to aware if the VRAM lost happens or not, > bare-metal can implement some function to judge, while for SR-IOV > I prefer to read it from GIM side (for initial version we consider > it’s always VRAM lost, till GIM side change aligned) > > Okay. > > 9.If VRAM lost not hit, continue, otherwise: > > a)Update adev->*vram_lost_counter*, > > Okay. > > b)Iterate over all living ctx, and set all ctx as “*guilty*” since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over > all contexts is way to much overhead. > > c)Kick out all jobs in all ctx’s KFIFO queue, and set all their > fence status to “*ECANCELDED*” > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > 10.Do GTT recovery and VRAM page tables/entries recovery > (optional, do we need it ???) > > Yes, that is still needed. As Nicolai explained we can't be sure that > VRAM is still 100% correct even when it isn't cleared. > > 11.Re-schedule all JOBs remains in mirror list to ring again and > restart scheduler (for VRAM lost case, no JOB will re-scheduled) > > Okay. > > ?For cs_wait() IOCTL: > > After it found fence signaled, it should check with > *“dma_fence_get_status” *to see if there is error there, > > And return the error status of fence > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > ?For cs_wait_fences() IOCTL: > > Similar with above approach > > ?For cs_submit() IOCTL: > > It need to check if current ctx been marked as “*guilty*” and > return “*ECANCELED*” if so > > ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: > > This way, UMD can also block app from submitting, like @Nicolai > mentioned, we can cache one copy of *vram_lost_counter* when > enumerate physical device, and deny all > > gl-context from submitting if the counter queried bigger than that > one cached in physical device. (looks a little overkill to me, but > easy to implement ) > > UMD can also return error to APP when creating gl-context if found > current queried*vram_lost_counter *bigger than that one cached in > physical device. > > Okay. Already have a patch for this, please review that one if you > haven't already done so. > > Regards, > Christian. > > BTW: I realized that gl-context is a little different with > kernel’s context. Because for kernel. BO is not related with > context but only with FD, while in UMD, BO have a backend > > gl-context, so block submitting in UMD layer is also needed > although KMD will do its job as bottom line > > ?Basically “vram_lost_counter” is exposure by kernel to let UMD > take the control of robust extension feature, it will be UMD’s > call to move, KMD only deny “guilty” context from submitting > > Need your feedback, thx > > We’d better make TDR feature landed ASAP > > BR Monk >