* TDR and VRAM lost handling in KMD: @ 2017-10-11 5:33 Liu, Monk [not found] ` <BLUPR12MB0449785160E34EA9369C5E23844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Liu, Monk @ 2017-10-11 5:33 UTC (permalink / raw) To: Koenig, Christian, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) [-- Attachment #1.1: Type: text/plain, Size: 3456 bytes --] Hi Christian & Nicolai, We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items, l When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine : 1. Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it) 2. Set its fence error status to “ETIME”, 3. Find the entity/ctx behind this job, and set this ctx as “guilty” 4. Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore. 5. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” 6. Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences 7. Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style 8. After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned) 9. If VRAM lost not hit, continue, otherwise: a) Update adev->vram_lost_counter, b) Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents c) Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED” 10. Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???) 11. Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled) l For cs_wait() IOCTL: After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there, And return the error status of fence l For cs_wait_fences() IOCTL: Similar with above approach l For cs_submit() IOCTL: It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so l Introduce a new IOCTL to let UMD query vram_lost_counter: This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement ) UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device. BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line l Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting Need your feedback, thx We’d better make TDR feature landed ASAP BR Monk [-- Attachment #1.2: Type: text/html, Size: 22066 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BLUPR12MB0449785160E34EA9369C5E23844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* RE: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB0449785160E34EA9369C5E23844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-11 7:14 ` Liu, Monk 2017-10-11 7:20 ` Christian König 1 sibling, 0 replies; 23+ messages in thread From: Liu, Monk @ 2017-10-11 7:14 UTC (permalink / raw) To: Koenig, Christian, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander, Mao, David Cc: Ramirez, Alejandro, 'amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org', Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) [-- Attachment #1.1: Type: text/plain, Size: 4007 bytes --] + david From: Liu, Monk Sent: Wednesday, October 11, 2017 1:34 PM To: Koenig, Christian <Christian.Koenig@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley (Bingley.Li@amd.com) <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> Subject: TDR and VRAM lost handling in KMD: Hi Christian & Nicolai, We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items, l When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine : 1. Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it) 2. Set its fence error status to “ETIME”, 3. Find the entity/ctx behind this job, and set this ctx as “guilty” 4. Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore. 5. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” 6. Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences 7. Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style 8. After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned) 9. If VRAM lost not hit, continue, otherwise: a) Update adev->vram_lost_counter, b) Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents c) Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED” 10. Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???) 11. Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled) l For cs_wait() IOCTL: After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there, And return the error status of fence l For cs_wait_fences() IOCTL: Similar with above approach l For cs_submit() IOCTL: It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so l Introduce a new IOCTL to let UMD query vram_lost_counter: This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement ) UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device. BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line l Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting Need your feedback, thx We’d better make TDR feature landed ASAP BR Monk [-- Attachment #1.2: Type: text/html, Size: 20768 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB0449785160E34EA9369C5E23844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 7:14 ` Liu, Monk @ 2017-10-11 7:20 ` Christian König [not found] ` <b5c5f6c9-07e2-4688-8ffc-3929bfc59366-5C7GfCeVMHo@public.gmane.org> 1 sibling, 1 reply; 23+ messages in thread From: Christian König @ 2017-10-11 7:20 UTC (permalink / raw) To: Liu, Monk, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) [-- Attachment #1.1: Type: text/plain, Size: 5100 bytes --] See inline: Am 11.10.2017 um 07:33 schrieb Liu, Monk: > > Hi Christian & Nicolai, > > We need to achieve some agreements on what should MESA/UMD do and what > should KMD do, *please give your comments with “okay” or “No” and your > idea on below items,* > > lWhen a job timed out (set from lockup_timeout kernel parameter), What > KMD should do in TDR routine : > > 1.Update adev->*gpu_reset_counter*, and stop scheduler first, > (*gpu_reset_counter* is used to force vm flush after GPU reset, out of > this thread’s scope so no more discussion on it) > Okay. > 2.Set its fence error status to “*ETIME*”, > No, as I already explained ETIME is for synchronous operation. In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for. Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure). > 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > Not sure. Do we want to set the whole context as guilty or just the entity? Setting the whole contexts as guilty sounds racy to me. BTW: We should use a different name than "guilty", maybe just "bool canceled;" ? > 4.Kick out this job from scheduler’s mirror list, so this job won’t > get re-scheduled to ring anymore. > Okay. > 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset. > *6.*Force signal all fences that get kicked out by above two > steps,*otherwise UMD will block forever if waiting on those fences* > Okay. > ** > > 7.Do gpu reset, which is can be some callbacks to let bare-metal and > SR-IOV implement with their favor style > Okay. > 8.After reset, KMD need to aware if the VRAM lost happens or not, > bare-metal can implement some function to judge, while for SR-IOV I > prefer to read it from GIM side (for initial version we consider it’s > always VRAM lost, till GIM side change aligned) > Okay. > 9.If VRAM lost not hit, continue, otherwise: > > a)Update adev->*vram_lost_counter*, > Okay. > b)Iterate over all living ctx, and set all ctx as “*guilty*” since > VRAM lost actually ruins all VRAM contents > No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead. > c)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence > status to “*ECANCELDED*” > Yes and no, that should be done when we try to run the jobs and not during GPU reset. > 10.Do GTT recovery and VRAM page tables/entries recovery (optional, do > we need it ???) > Yes, that is still needed. As Nicolai explained we can't be sure that VRAM is still 100% correct even when it isn't cleared. > 11.Re-schedule all JOBs remains in mirror list to ring again and > restart scheduler (for VRAM lost case, no JOB will re-scheduled) > Okay. > lFor cs_wait() IOCTL: > > After it found fence signaled, it should check with > *“dma_fence_get_status” *to see if there is error there, > > And return the error status of fence > Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code). It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok. Maybe we should fix dma_fence_get_status() to do the right thing for this? > lFor cs_wait_fences() IOCTL: > > Similar with above approach > > lFor cs_submit() IOCTL: > > It need to check if current ctx been marked as “*guilty*” and return > “*ECANCELED*” if so > > lIntroduce a new IOCTL to let UMD query *vram_lost_counter*: > > This way, UMD can also block app from submitting, like @Nicolai > mentioned, we can cache one copy of *vram_lost_counter* when enumerate > physical device, and deny all > > gl-context from submitting if the counter queried bigger than that one > cached in physical device. (looks a little overkill to me, but easy to > implement ) > > UMD can also return error to APP when creating gl-context if found > current queried*vram_lost_counter *bigger than that one cached in > physical device. > Okay. Already have a patch for this, please review that one if you haven't already done so. Regards, Christian. > BTW: I realized that gl-context is a little different with kernel’s > context. Because for kernel. BO is not related with context but only > with FD, while in UMD, BO have a backend > > gl-context, so block submitting in UMD layer is also needed although > KMD will do its job as bottom line > > lBasically “vram_lost_counter” is exposure by kernel to let UMD take > the control of robust extension feature, it will be UMD’s call to > move, KMD only deny “guilty” context from submitting > > Need your feedback, thx > > We’d better make TDR feature landed ASAP > > BR Monk > [-- Attachment #1.2: Type: text/html, Size: 31561 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <b5c5f6c9-07e2-4688-8ffc-3929bfc59366-5C7GfCeVMHo@public.gmane.org>]
* RE: TDR and VRAM lost handling in KMD: [not found] ` <b5c5f6c9-07e2-4688-8ffc-3929bfc59366-5C7GfCeVMHo@public.gmane.org> @ 2017-10-11 8:15 ` Liu, Monk [not found] ` <BLUPR12MB044911DFCB510022605DD38A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Liu, Monk @ 2017-10-11 8:15 UTC (permalink / raw) To: Koenig, Christian, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) [-- Attachment #1.1: Type: text/plain, Size: 8934 bytes --] 1. Set its fence error status to “ETIME”, No, as I already explained ETIME is for synchronous operation. In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for. Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure). [ML] I’m okay if you insist not to use ETIME 1. Find the entity/ctx behind this job, and set this ctx as “guilty” Not sure. Do we want to set the whole context as guilty or just the entity? Setting the whole contexts as guilty sounds racy to me. BTW: We should use a different name than "guilty", maybe just "bool canceled;" ? [ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me) We’d better either block the whole context or let not… 1. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset. [ML] without deep thought and expritment, I’m not sure the difference between them, but kick it out in gpu_reset routine is more efficient, Otherwise you need to check context/entity guilty flag in run_job routine … and you need to it for every context/entity, I don’t see why We don’t just kickout all of them in gpu_reset stage …. a) Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead. [ML] because I want to make KMS IOCTL rules clean, like they don’t need to differentiate VRAM lost or not, they only interested in if the context is guilty or not, and block Submit for guilty ones. Can you give more details of your idea? And better the detail implement in cs_submit, I want to see how you want to block submit without checking context guilty flag a) Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED” Yes and no, that should be done when we try to run the jobs and not during GPU reset. [ML] again, kicking out them in gpu reset routine is high efficient, otherwise you need check on every job in run_job() Besides, can you illustrate the detail implementation ? Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code). It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok. Maybe we should fix dma_fence_get_status() to do the right thing for this? [ML] yeah, that’s too confusing, the name sound really the one I want to use, we should change it… But look into the implement, I don’t see why we cannot use it ? it also finally return the fence->error From: Koenig, Christian Sent: Wednesday, October 11, 2017 3:21 PM To: Liu, Monk <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: See inline: Am 11.10.2017 um 07:33 schrieb Liu, Monk: Hi Christian & Nicolai, We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items, l When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine : 1. Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it) Okay. 2. Set its fence error status to “ETIME”, No, as I already explained ETIME is for synchronous operation. In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for. Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure). 3. Find the entity/ctx behind this job, and set this ctx as “guilty” Not sure. Do we want to set the whole context as guilty or just the entity? Setting the whole contexts as guilty sounds racy to me. BTW: We should use a different name than "guilty", maybe just "bool canceled;" ? 4. Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore. Okay. 5. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset. 6. Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences Okay. 7. Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style Okay. 8. After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned) Okay. 9. If VRAM lost not hit, continue, otherwise: a) Update adev->vram_lost_counter, Okay. b) Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead. c) Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED” Yes and no, that should be done when we try to run the jobs and not during GPU reset. 10. Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???) Yes, that is still needed. As Nicolai explained we can't be sure that VRAM is still 100% correct even when it isn't cleared. 11. Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled) Okay. l For cs_wait() IOCTL: After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there, And return the error status of fence Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code). It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok. Maybe we should fix dma_fence_get_status() to do the right thing for this? l For cs_wait_fences() IOCTL: Similar with above approach l For cs_submit() IOCTL: It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so l Introduce a new IOCTL to let UMD query vram_lost_counter: This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement ) UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device. Okay. Already have a patch for this, please review that one if you haven't already done so. Regards, Christian. BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line l Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting Need your feedback, thx We’d better make TDR feature landed ASAP BR Monk [-- Attachment #1.2: Type: text/html, Size: 44074 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BLUPR12MB044911DFCB510022605DD38A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* Re: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB044911DFCB510022605DD38A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-11 8:40 ` Haehnle, Nicolai [not found] ` <DM5PR12MB1292D21FC5438AEA8FCF9F64FF4A0-2J9CzHegvk82qrKJuDAMhQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Haehnle, Nicolai @ 2017-10-11 8:40 UTC (permalink / raw) To: Liu, Monk, Koenig, Christian, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) [-- Attachment #1.1: Type: text/plain, Size: 10451 bytes --] >From a Mesa perspective, this almost all sounds reasonable to me. On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset. On whether the whole context is marked as guilty from a user space perspective, it would simply be nice for user space to get consistent answers. It would be a bit odd if we could e.g. succeed in submitting an SDMA job after a GFX job was rejected. This would point in favor of marking the entire context as guilty (although that could happen lazily instead of at reset time). On the other hand, if that's too big a burden for the kernel implementation I'm sure we can live without it. Cheers, Nicolai ________________________________ From: Liu, Monk Sent: Wednesday, October 11, 2017 10:15:40 AM To: Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, Alexander Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Ding, Pixel; Jiang, Jerry (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario Subject: RE: TDR and VRAM lost handling in KMD: 1. Set its fence error status to “ETIME”, No, as I already explained ETIME is for synchronous operation. In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for. Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure). [ML] I’m okay if you insist not to use ETIME 1. Find the entity/ctx behind this job, and set this ctx as “guilty” Not sure. Do we want to set the whole context as guilty or just the entity? Setting the whole contexts as guilty sounds racy to me. BTW: We should use a different name than "guilty", maybe just "bool canceled;" ? [ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me) We’d better either block the whole context or let not… 1. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset. [ML] without deep thought and expritment, I’m not sure the difference between them, but kick it out in gpu_reset routine is more efficient, Otherwise you need to check context/entity guilty flag in run_job routine … and you need to it for every context/entity, I don’t see why We don’t just kickout all of them in gpu_reset stage …. a) Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead. [ML] because I want to make KMS IOCTL rules clean, like they don’t need to differentiate VRAM lost or not, they only interested in if the context is guilty or not, and block Submit for guilty ones. Can you give more details of your idea? And better the detail implement in cs_submit, I want to see how you want to block submit without checking context guilty flag a) Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED” Yes and no, that should be done when we try to run the jobs and not during GPU reset. [ML] again, kicking out them in gpu reset routine is high efficient, otherwise you need check on every job in run_job() Besides, can you illustrate the detail implementation ? Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code). It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok. Maybe we should fix dma_fence_get_status() to do the right thing for this? [ML] yeah, that’s too confusing, the name sound really the one I want to use, we should change it… But look into the implement, I don’t see why we cannot use it ? it also finally return the fence->error From: Koenig, Christian Sent: Wednesday, October 11, 2017 3:21 PM To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Haehnle, Nicolai <Nicolai.Haehnle-5C7GfCeVMHo@public.gmane.org>; Olsak, Marek <Marek.Olsak-5C7GfCeVMHo@public.gmane.org>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Ding, Pixel <Pixel.Ding-5C7GfCeVMHo@public.gmane.org>; Jiang, Jerry (SW) <Jerry.Jiang-5C7GfCeVMHo@public.gmane.org>; Li, Bingley <Bingley.Li-5C7GfCeVMHo@public.gmane.org>; Ramirez, Alejandro <Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org>; Filipas, Mario <Mario.Filipas@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: See inline: Am 11.10.2017 um 07:33 schrieb Liu, Monk: Hi Christian & Nicolai, We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items, l When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine : 1. Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it) Okay. 2. Set its fence error status to “ETIME”, No, as I already explained ETIME is for synchronous operation. In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for. Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure). 3. Find the entity/ctx behind this job, and set this ctx as “guilty” Not sure. Do we want to set the whole context as guilty or just the entity? Setting the whole contexts as guilty sounds racy to me. BTW: We should use a different name than "guilty", maybe just "bool canceled;" ? 4. Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore. Okay. 5. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset. 6. Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences Okay. 7. Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style Okay. 8. After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned) Okay. 9. If VRAM lost not hit, continue, otherwise: a) Update adev->vram_lost_counter, Okay. b) Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead. c) Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED” Yes and no, that should be done when we try to run the jobs and not during GPU reset. 10. Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???) Yes, that is still needed. As Nicolai explained we can't be sure that VRAM is still 100% correct even when it isn't cleared. 11. Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled) Okay. l For cs_wait() IOCTL: After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there, And return the error status of fence Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code). It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok. Maybe we should fix dma_fence_get_status() to do the right thing for this? l For cs_wait_fences() IOCTL: Similar with above approach l For cs_submit() IOCTL: It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so l Introduce a new IOCTL to let UMD query vram_lost_counter: This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement ) UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device. Okay. Already have a patch for this, please review that one if you haven't already done so. Regards, Christian. BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line l Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting Need your feedback, thx We’d better make TDR feature landed ASAP BR Monk [-- Attachment #1.2: Type: text/html, Size: 46742 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <DM5PR12MB1292D21FC5438AEA8FCF9F64FF4A0-2J9CzHegvk82qrKJuDAMhQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* RE: TDR and VRAM lost handling in KMD: [not found] ` <DM5PR12MB1292D21FC5438AEA8FCF9F64FF4A0-2J9CzHegvk82qrKJuDAMhQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-11 8:48 ` Liu, Monk [not found] ` <BLUPR12MB0449287A92DF8D3EB30BE6A6844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Liu, Monk @ 2017-10-11 8:48 UTC (permalink / raw) To: Haehnle, Nicolai, Koenig, Christian, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) [-- Attachment #1.1: Type: text/plain, Size: 12270 bytes --] On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset. ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to worry about vram-lost-counter anymore, that’s a implementation style. I don’t think it is related with UMD layer, For UMD the gl-context isn’t aware of by KMD, so UMD can implement it own “guilty” gl-context if you want. If KMD doesn’t mark all ctx as guilty after VRAM lost, can you illustrate what rule KMD should obey to check in KMS IOCTL like cs_sumbit ?? let’s see which way better From: Haehnle, Nicolai Sent: Wednesday, October 11, 2017 4:41 PM To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: From a Mesa perspective, this almost all sounds reasonable to me. On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset. On whether the whole context is marked as guilty from a user space perspective, it would simply be nice for user space to get consistent answers. It would be a bit odd if we could e.g. succeed in submitting an SDMA job after a GFX job was rejected. This would point in favor of marking the entire context as guilty (although that could happen lazily instead of at reset time). On the other hand, if that's too big a burden for the kernel implementation I'm sure we can live without it. Cheers, Nicolai ________________________________ From: Liu, Monk Sent: Wednesday, October 11, 2017 10:15:40 AM To: Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, Alexander Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario Subject: RE: TDR and VRAM lost handling in KMD: 1. Set its fence error status to “ETIME”, No, as I already explained ETIME is for synchronous operation. In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for. Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure). [ML] I’m okay if you insist not to use ETIME 1. Find the entity/ctx behind this job, and set this ctx as “guilty” Not sure. Do we want to set the whole context as guilty or just the entity? Setting the whole contexts as guilty sounds racy to me. BTW: We should use a different name than "guilty", maybe just "bool canceled;" ? [ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me) We’d better either block the whole context or let not… 1. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset. [ML] without deep thought and expritment, I’m not sure the difference between them, but kick it out in gpu_reset routine is more efficient, Otherwise you need to check context/entity guilty flag in run_job routine … and you need to it for every context/entity, I don’t see why We don’t just kickout all of them in gpu_reset stage …. a) Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead. [ML] because I want to make KMS IOCTL rules clean, like they don’t need to differentiate VRAM lost or not, they only interested in if the context is guilty or not, and block Submit for guilty ones. Can you give more details of your idea? And better the detail implement in cs_submit, I want to see how you want to block submit without checking context guilty flag a) Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED” Yes and no, that should be done when we try to run the jobs and not during GPU reset. [ML] again, kicking out them in gpu reset routine is high efficient, otherwise you need check on every job in run_job() Besides, can you illustrate the detail implementation ? Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code). It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok. Maybe we should fix dma_fence_get_status() to do the right thing for this? [ML] yeah, that’s too confusing, the name sound really the one I want to use, we should change it… But look into the implement, I don’t see why we cannot use it ? it also finally return the fence->error From: Koenig, Christian Sent: Wednesday, October 11, 2017 3:21 PM To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com<mailto:Nicolai.Haehnle@amd.com>>; Olsak, Marek <Marek.Olsak@amd.com<mailto:Marek.Olsak@amd.com>>; Deucher, Alexander <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>> Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel <Pixel.Ding@amd.com<mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com<mailto:Jerry.Jiang@amd.com>>; Li, Bingley <Bingley.Li@amd.com<mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com<mailto:Alejandro.Ramirez@amd.com>>; Filipas, Mario <Mario.Filipas@amd.com<mailto:Mario.Filipas@amd.com>> Subject: Re: TDR and VRAM lost handling in KMD: See inline: Am 11.10.2017 um 07:33 schrieb Liu, Monk: Hi Christian & Nicolai, We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items, ? When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine : 1. Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it) Okay. 2. Set its fence error status to “ETIME”, No, as I already explained ETIME is for synchronous operation. In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for. Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure). 3. Find the entity/ctx behind this job, and set this ctx as “guilty” Not sure. Do we want to set the whole context as guilty or just the entity? Setting the whole contexts as guilty sounds racy to me. BTW: We should use a different name than "guilty", maybe just "bool canceled;" ? 4. Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore. Okay. 5. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset. 6. Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences Okay. 7. Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style Okay. 8. After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned) Okay. 9. If VRAM lost not hit, continue, otherwise: a) Update adev->vram_lost_counter, Okay. b) Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead. c) Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED” Yes and no, that should be done when we try to run the jobs and not during GPU reset. 10. Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???) Yes, that is still needed. As Nicolai explained we can't be sure that VRAM is still 100% correct even when it isn't cleared. 11. Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled) Okay. ? For cs_wait() IOCTL: After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there, And return the error status of fence Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code). It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok. Maybe we should fix dma_fence_get_status() to do the right thing for this? ? For cs_wait_fences() IOCTL: Similar with above approach ? For cs_submit() IOCTL: It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so ? Introduce a new IOCTL to let UMD query vram_lost_counter: This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement ) UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device. Okay. Already have a patch for this, please review that one if you haven't already done so. Regards, Christian. BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line ? Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting Need your feedback, thx We’d better make TDR feature landed ASAP BR Monk [-- Attachment #1.2: Type: text/html, Size: 50621 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BLUPR12MB0449287A92DF8D3EB30BE6A6844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* Re: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB0449287A92DF8D3EB30BE6A6844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-11 8:59 ` Nicolai Hähnle [not found] ` <28d64011-fd90-07fb-d95d-48286ecbdcc5-5C7GfCeVMHo@public.gmane.org> 2017-10-11 9:02 ` Christian König 1 sibling, 1 reply; 23+ messages in thread From: Nicolai Hähnle @ 2017-10-11 8:59 UTC (permalink / raw) To: Liu, Monk, Koenig, Christian, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) On 11.10.2017 10:48, Liu, Monk wrote: > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at the > time of reset. The only context(s) that should be reported as "guilty" > (or perhaps "unknown" in some cases) are the ones that were executing at > the time of reset. > > ML: KMD mark all contexts as guilty is because that way we can unify our > IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to > worry about vram-lost-counter anymore, that’s a implementation style. I > don’t think it is related with UMD layer, > > For UMD the gl-context isn’t aware of by KMD, so UMD can implement it > own “guilty” gl-context if you want. Well, to some extent this is just semantics, but it helps to keep the terminology consistent. Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset. The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened. > If KMD doesn’t mark all ctx as guilty after VRAM lost, can you > illustrate what rule KMD should obey to check in KMS IOCTL like > cs_sumbit ?? let’s see which way better if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) return -ECANCELED; Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. Yes, it's one additional check in cs_submit. If you're worried about that (and Christian's concerns about possible issues with walking over all contexts are addressed), I suppose you could just store a per-context unsigned context_reset_status; instead of a `bool guilty`. Its value would start out as 0 (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset. Cheers, Nicolai > > *From:*Haehnle, Nicolai > *Sent:* Wednesday, October 11, 2017 4:41 PM > *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian > <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, > Alexander <Alexander.Deucher@amd.com> > *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley > <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; > Filipas, Mario <Mario.Filipas@amd.com> > *Subject:* Re: TDR and VRAM lost handling in KMD: > > From a Mesa perspective, this almost all sounds reasonable to me. > > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at the > time of reset. The only context(s) that should be reported as "guilty" > (or perhaps "unknown" in some cases) are the ones that were executing at > the time of reset. > > On whether the whole context is marked as guilty from a user space > perspective, it would simply be nice for user space to get consistent > answers. It would be a bit odd if we could e.g. succeed in submitting an > SDMA job after a GFX job was rejected. This would point in favor of > marking the entire context as guilty (although that could happen lazily > instead of at reset time). On the other hand, if that's too big a burden > for the kernel implementation I'm sure we can live without it. > > Cheers, > > Nicolai > > ------------------------------------------------------------------------ > > *From:*Liu, Monk > *Sent:* Wednesday, October 11, 2017 10:15:40 AM > *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, Alexander > *Cc:* amd-gfx@lists.freedesktop.org > <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry (SW); > Li, Bingley; Ramirez, Alejandro; Filipas, Mario > *Subject:* RE: TDR and VRAM lost handling in KMD: > > 1.Set its fence error status to “*ETIME*”, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find that > we need to distinct the timedout job from the canceled ones (probably a > good idea, but I'm not sure). > > [ML] I’m okay if you insist not to use ETIME > > 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > > Not sure. Do we want to set the whole context as guilty or just the entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > [ML] I think context is better than entity, because for example if you > only block entity_0 of context and allow entity_N run, that means the > dependency between entities are broken (e.g. page table updates in > > Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me) > > We’d better either block the whole context or let not… > > 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > [ML] without deep thought and expritment, I’m not sure the difference > between them, but kick it out in gpu_reset routine is more efficient, > > Otherwise you need to check context/entity guilty flag in run_job > routine …and you need to it for every context/entity, I don’t see why > > We don’t just kickout all of them in gpu_reset stage …. > > a)Iterate over all living ctx, and set all ctx as “*guilty*” since VRAM > lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over all > contexts is way to much overhead. > > [ML] because I want to make KMS IOCTL rules clean, like they don’t need > to differentiate VRAM lost or not, they only interested in if the > context is guilty or not, and block > > Submit for guilty ones. > > *Can you give more details of your idea? And better the detail implement > in cs_submit, I want to see how you want to block submit without > checking context guilty flag* > > a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence > status to “*ECANCELDED*” > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > [ML] again, kicking out them in gpu reset routine is high efficient, > otherwise you need check on every job in run_job() > > Besides, can you illustrate the detail implementation ? > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > [ML] yeah, that’s too confusing, the name sound really the one I want to > use, we should change it… > > *But look into the implement, I don**’t see why we cannot use it ? it > also finally return the fence->error * > > *From:*Koenig, Christian > *Sent:* Wednesday, October 11, 2017 3:21 PM > *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle, > Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>; > Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; > Deucher, Alexander <Alexander.Deucher@amd.com > <mailto:Alexander.Deucher@amd.com>> > *Cc:* amd-gfx@lists.freedesktop.org > <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel <Pixel.Ding@amd.com > <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com > <mailto:Jerry.Jiang@amd.com>>; Li, Bingley <Bingley.Li@amd.com > <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro > <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>; Filipas, > Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>> > *Subject:* Re: TDR and VRAM lost handling in KMD: > > See inline: > > Am 11.10.2017 um 07:33 schrieb Liu, Monk: > > Hi Christian & Nicolai, > > We need to achieve some agreements on what should MESA/UMD do and > what should KMD do, *please give your comments with “okay” or “No” > and your idea on below items,* > > ?When a job timed out (set from lockup_timeout kernel parameter), > What KMD should do in TDR routine : > > 1.Update adev->*gpu_reset_counter*, and stop scheduler first, > (*gpu_reset_counter* is used to force vm flush after GPU reset, out > of this thread’s scope so no more discussion on it) > > Okay. > > 2.Set its fence error status to “*ETIME*”, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find that > we need to distinct the timedout job from the canceled ones (probably a > good idea, but I'm not sure). > > 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > > Not sure. Do we want to set the whole context as guilty or just the entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > 4.Kick out this job from scheduler’s mirror list, so this job won’t > get re-scheduled to ring anymore. > > Okay. > > 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > 6.Force signal all fences that get kicked out by above two > steps,*otherwise UMD will block forever if waiting on those fences* > > Okay. > > 7.Do gpu reset, which is can be some callbacks to let bare-metal and > SR-IOV implement with their favor style > > Okay. > > 8.After reset, KMD need to aware if the VRAM lost happens or not, > bare-metal can implement some function to judge, while for SR-IOV I > prefer to read it from GIM side (for initial version we consider > it’s always VRAM lost, till GIM side change aligned) > > Okay. > > 9.If VRAM lost not hit, continue, otherwise: > > a)Update adev->*vram_lost_counter*, > > Okay. > > b)Iterate over all living ctx, and set all ctx as “*guilty*” since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over all > contexts is way to much overhead. > > c)Kick out all jobs in all ctx’s KFIFO queue, and set all their > fence status to “*ECANCELDED*” > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > 10.Do GTT recovery and VRAM page tables/entries recovery (optional, > do we need it ???) > > Yes, that is still needed. As Nicolai explained we can't be sure that > VRAM is still 100% correct even when it isn't cleared. > > 11.Re-schedule all JOBs remains in mirror list to ring again and > restart scheduler (for VRAM lost case, no JOB will re-scheduled) > > Okay. > > ?For cs_wait() IOCTL: > > After it found fence signaled, it should check with > *“dma_fence_get_status” *to see if there is error there, > > And return the error status of fence > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > ?For cs_wait_fences() IOCTL: > > Similar with above approach > > ?For cs_submit() IOCTL: > > It need to check if current ctx been marked as “*guilty*” and return > “*ECANCELED*” if so > > ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: > > This way, UMD can also block app from submitting, like @Nicolai > mentioned, we can cache one copy of *vram_lost_counter* when > enumerate physical device, and deny all > > gl-context from submitting if the counter queried bigger than that > one cached in physical device. (looks a little overkill to me, but > easy to implement ) > > UMD can also return error to APP when creating gl-context if found > current queried*vram_lost_counter *bigger than that one cached in > physical device. > > Okay. Already have a patch for this, please review that one if you > haven't already done so. > > Regards, > Christian. > > BTW: I realized that gl-context is a little different with kernel’s > context. Because for kernel. BO is not related with context but only > with FD, while in UMD, BO have a backend > > gl-context, so block submitting in UMD layer is also needed although > KMD will do its job as bottom line > > ?Basically “vram_lost_counter” is exposure by kernel to let UMD take > the control of robust extension feature, it will be UMD’s call to > move, KMD only deny “guilty” context from submitting > > Need your feedback, thx > > We’d better make TDR feature landed ASAP > > BR Monk > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <28d64011-fd90-07fb-d95d-48286ecbdcc5-5C7GfCeVMHo@public.gmane.org>]
* RE: TDR and VRAM lost handling in KMD: [not found] ` <28d64011-fd90-07fb-d95d-48286ecbdcc5-5C7GfCeVMHo@public.gmane.org> @ 2017-10-11 9:18 ` Liu, Monk [not found] ` <BLUPR12MB044914F3A7B5D3D316481A7A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Liu, Monk @ 2017-10-11 9:18 UTC (permalink / raw) To: Haehnle, Nicolai, Koenig, Christian, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) Let's talk it simple, When vram lost hit, what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ? In cs_submit, with vram lost hit, if we don't mark all contexts as "guilty", how we block its from submitting ? can you show some implement way BTW: the "guilty" here is a new member I want to add to context, it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I need to unify them and only one place to mark guilty or not BR Monk -----Original Message----- From: Haehnle, Nicolai Sent: Wednesday, October 11, 2017 5:00 PM To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: On 11.10.2017 10:48, Liu, Monk wrote: > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at > the time of reset. The only context(s) that should be reported as "guilty" > (or perhaps "unknown" in some cases) are the ones that were executing > at the time of reset. > > ML: KMD mark all contexts as guilty is because that way we can unify > our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no > need to worry about vram-lost-counter anymore, that’s a implementation > style. I don’t think it is related with UMD layer, > > For UMD the gl-context isn’t aware of by KMD, so UMD can implement it > own “guilty” gl-context if you want. Well, to some extent this is just semantics, but it helps to keep the terminology consistent. Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset. The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened. > If KMD doesn’t mark all ctx as guilty after VRAM lost, can you > illustrate what rule KMD should obey to check in KMS IOCTL like > cs_sumbit ?? let’s see which way better if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) return -ECANCELED; Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. Yes, it's one additional check in cs_submit. If you're worried about that (and Christian's concerns about possible issues with walking over all contexts are addressed), I suppose you could just store a per-context unsigned context_reset_status; instead of a `bool guilty`. Its value would start out as 0 (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset. Cheers, Nicolai > > *From:*Haehnle, Nicolai > *Sent:* Wednesday, October 11, 2017 4:41 PM > *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian > <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; > Deucher, Alexander <Alexander.Deucher@amd.com> > *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley > <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; > Filipas, Mario <Mario.Filipas@amd.com> > *Subject:* Re: TDR and VRAM lost handling in KMD: > > From a Mesa perspective, this almost all sounds reasonable to me. > > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at > the time of reset. The only context(s) that should be reported as "guilty" > (or perhaps "unknown" in some cases) are the ones that were executing > at the time of reset. > > On whether the whole context is marked as guilty from a user space > perspective, it would simply be nice for user space to get consistent > answers. It would be a bit odd if we could e.g. succeed in submitting > an SDMA job after a GFX job was rejected. This would point in favor of > marking the entire context as guilty (although that could happen > lazily instead of at reset time). On the other hand, if that's too big > a burden for the kernel implementation I'm sure we can live without it. > > Cheers, > > Nicolai > > ---------------------------------------------------------------------- > -- > > *From:*Liu, Monk > *Sent:* Wednesday, October 11, 2017 10:15:40 AM > *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, > Alexander > *Cc:* amd-gfx@lists.freedesktop.org > <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry > (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario > *Subject:* RE: TDR and VRAM lost handling in KMD: > > 1.Set its fence error status to “*ETIME*”, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find > that we need to distinct the timedout job from the canceled ones > (probably a good idea, but I'm not sure). > > [ML] I’m okay if you insist not to use ETIME > > 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > > Not sure. Do we want to set the whole context as guilty or just the entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > [ML] I think context is better than entity, because for example if you > only block entity_0 of context and allow entity_N run, that means the > dependency between entities are broken (e.g. page table updates in > > Sdma entity pass but gfx submit in GFX entity blocked, not make sense > to me) > > We’d better either block the whole context or let not… > > 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > [ML] without deep thought and expritment, I’m not sure the difference > between them, but kick it out in gpu_reset routine is more efficient, > > Otherwise you need to check context/entity guilty flag in run_job > routine …and you need to it for every context/entity, I don’t see why > > We don’t just kickout all of them in gpu_reset stage …. > > a)Iterate over all living ctx, and set all ctx as “*guilty*” since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over > all contexts is way to much overhead. > > [ML] because I want to make KMS IOCTL rules clean, like they don’t > need to differentiate VRAM lost or not, they only interested in if the > context is guilty or not, and block > > Submit for guilty ones. > > *Can you give more details of your idea? And better the detail > implement in cs_submit, I want to see how you want to block submit > without checking context guilty flag* > > a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence > status to “*ECANCELDED*” > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > [ML] again, kicking out them in gpu reset routine is high efficient, > otherwise you need check on every job in run_job() > > Besides, can you illustrate the detail implementation ? > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > [ML] yeah, that’s too confusing, the name sound really the one I want > to use, we should change it… > > *But look into the implement, I don**’t see why we cannot use it ? it > also finally return the fence->error * > > *From:*Koenig, Christian > *Sent:* Wednesday, October 11, 2017 3:21 PM > *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle, > Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>; > Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; > Deucher, Alexander <Alexander.Deucher@amd.com > <mailto:Alexander.Deucher@amd.com>> > *Cc:* amd-gfx@lists.freedesktop.org > <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel > <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) > <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley > <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro > <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>; > Filipas, Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>> > *Subject:* Re: TDR and VRAM lost handling in KMD: > > See inline: > > Am 11.10.2017 um 07:33 schrieb Liu, Monk: > > Hi Christian & Nicolai, > > We need to achieve some agreements on what should MESA/UMD do and > what should KMD do, *please give your comments with “okay” or “No” > and your idea on below items,* > > ?When a job timed out (set from lockup_timeout kernel parameter), > What KMD should do in TDR routine : > > 1.Update adev->*gpu_reset_counter*, and stop scheduler first, > (*gpu_reset_counter* is used to force vm flush after GPU reset, out > of this thread’s scope so no more discussion on it) > > Okay. > > 2.Set its fence error status to “*ETIME*”, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find > that we need to distinct the timedout job from the canceled ones > (probably a good idea, but I'm not sure). > > 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > > Not sure. Do we want to set the whole context as guilty or just the entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > 4.Kick out this job from scheduler’s mirror list, so this job won’t > get re-scheduled to ring anymore. > > Okay. > > 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > 6.Force signal all fences that get kicked out by above two > steps,*otherwise UMD will block forever if waiting on those > fences* > > Okay. > > 7.Do gpu reset, which is can be some callbacks to let bare-metal and > SR-IOV implement with their favor style > > Okay. > > 8.After reset, KMD need to aware if the VRAM lost happens or not, > bare-metal can implement some function to judge, while for SR-IOV I > prefer to read it from GIM side (for initial version we consider > it’s always VRAM lost, till GIM side change aligned) > > Okay. > > 9.If VRAM lost not hit, continue, otherwise: > > a)Update adev->*vram_lost_counter*, > > Okay. > > b)Iterate over all living ctx, and set all ctx as “*guilty*” since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over > all contexts is way to much overhead. > > c)Kick out all jobs in all ctx’s KFIFO queue, and set all their > fence status to “*ECANCELDED*” > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > 10.Do GTT recovery and VRAM page tables/entries recovery (optional, > do we need it ???) > > Yes, that is still needed. As Nicolai explained we can't be sure that > VRAM is still 100% correct even when it isn't cleared. > > 11.Re-schedule all JOBs remains in mirror list to ring again and > restart scheduler (for VRAM lost case, no JOB will re-scheduled) > > Okay. > > ?For cs_wait() IOCTL: > > After it found fence signaled, it should check with > *“dma_fence_get_status” *to see if there is error there, > > And return the error status of fence > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > ?For cs_wait_fences() IOCTL: > > Similar with above approach > > ?For cs_submit() IOCTL: > > It need to check if current ctx been marked as “*guilty*” and return > “*ECANCELED*” if so > > ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: > > This way, UMD can also block app from submitting, like @Nicolai > mentioned, we can cache one copy of *vram_lost_counter* when > enumerate physical device, and deny all > > gl-context from submitting if the counter queried bigger than that > one cached in physical device. (looks a little overkill to me, but > easy to implement ) > > UMD can also return error to APP when creating gl-context if found > current queried*vram_lost_counter *bigger than that one cached in > physical device. > > Okay. Already have a patch for this, please review that one if you > haven't already done so. > > Regards, > Christian. > > BTW: I realized that gl-context is a little different with kernel’s > context. Because for kernel. BO is not related with context but only > with FD, while in UMD, BO have a backend > > gl-context, so block submitting in UMD layer is also needed although > KMD will do its job as bottom line > > ?Basically “vram_lost_counter” is exposure by kernel to let UMD take > the control of robust extension feature, it will be UMD’s call to > move, KMD only deny “guilty” context from submitting > > Need your feedback, thx > > We’d better make TDR feature landed ASAP > > BR Monk > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BLUPR12MB044914F3A7B5D3D316481A7A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* Re: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB044914F3A7B5D3D316481A7A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-11 9:25 ` Nicolai Hähnle [not found] ` <6876e153-7e98-66ac-7338-5601cf83c633-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Nicolai Hähnle @ 2017-10-11 9:25 UTC (permalink / raw) To: Liu, Monk, Koenig, Christian, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) On 11.10.2017 11:18, Liu, Monk wrote: > Let's talk it simple, When vram lost hit, what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ? It should successfully return AMDGPU_CTX_INNOCENT_RESET. > In cs_submit, with vram lost hit, if we don't mark all contexts as "guilty", how we block its from submitting ? can you show some implement way if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) return -ECANCELED; (where ctx->vram_lost_counter is initialized at context creation time and never changed afterwards) > BTW: the "guilty" here is a new member I want to add to context, it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, > Looks I need to unify them and only one place to mark guilty or not Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made consistent with the rest. Cheers, Nicolai > > > BR Monk > > -----Original Message----- > From: Haehnle, Nicolai > Sent: Wednesday, October 11, 2017 5:00 PM > To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > On 11.10.2017 10:48, Liu, Monk wrote: >> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so >> it's reasonable to use it. However, it /does not/ make sense to mark >> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a >> perfect example where the driver should report context lost to >> applications with the "innocent" flag for contexts that were idle at >> the time of reset. The only context(s) that should be reported as "guilty" >> (or perhaps "unknown" in some cases) are the ones that were executing >> at the time of reset. >> >> ML: KMD mark all contexts as guilty is because that way we can unify >> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no >> need to worry about vram-lost-counter anymore, that’s a implementation >> style. I don’t think it is related with UMD layer, >> >> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it >> own “guilty” gl-context if you want. > > Well, to some extent this is just semantics, but it helps to keep the terminology consistent. > > Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in > mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, > and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset. > > The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened. > > >> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >> illustrate what rule KMD should obey to check in KMS IOCTL like >> cs_sumbit ?? let’s see which way better > > if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) > return -ECANCELED; > > Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. > > Yes, it's one additional check in cs_submit. If you're worried about that (and Christian's concerns about possible issues with walking over all contexts are addressed), I suppose you could just store a per-context > > unsigned context_reset_status; > > instead of a `bool guilty`. Its value would start out as 0 > (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset. > > Cheers, > Nicolai > > >> >> *From:*Haehnle, Nicolai >> *Sent:* Wednesday, October 11, 2017 4:41 PM >> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >> Deucher, Alexander <Alexander.Deucher@amd.com> >> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >> Filipas, Mario <Mario.Filipas@amd.com> >> *Subject:* Re: TDR and VRAM lost handling in KMD: >> >> From a Mesa perspective, this almost all sounds reasonable to me. >> >> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so >> it's reasonable to use it. However, it /does not/ make sense to mark >> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a >> perfect example where the driver should report context lost to >> applications with the "innocent" flag for contexts that were idle at >> the time of reset. The only context(s) that should be reported as "guilty" >> (or perhaps "unknown" in some cases) are the ones that were executing >> at the time of reset. >> >> On whether the whole context is marked as guilty from a user space >> perspective, it would simply be nice for user space to get consistent >> answers. It would be a bit odd if we could e.g. succeed in submitting >> an SDMA job after a GFX job was rejected. This would point in favor of >> marking the entire context as guilty (although that could happen >> lazily instead of at reset time). On the other hand, if that's too big >> a burden for the kernel implementation I'm sure we can live without it. >> >> Cheers, >> >> Nicolai >> >> ---------------------------------------------------------------------- >> -- >> >> *From:*Liu, Monk >> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >> Alexander >> *Cc:* amd-gfx@lists.freedesktop.org >> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >> *Subject:* RE: TDR and VRAM lost handling in KMD: >> >> 1.Set its fence error status to “*ETIME*”, >> >> No, as I already explained ETIME is for synchronous operation. >> >> In other words when we return ETIME from the wait IOCTL it would mean >> that the waiting has somehow timed out, but not the job we waited for. >> >> Please use ECANCELED as well or some other error code when we find >> that we need to distinct the timedout job from the canceled ones >> (probably a good idea, but I'm not sure). >> >> [ML] I’m okay if you insist not to use ETIME >> >> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >> >> Not sure. Do we want to set the whole context as guilty or just the entity? >> >> Setting the whole contexts as guilty sounds racy to me. >> >> BTW: We should use a different name than "guilty", maybe just "bool >> canceled;" ? >> >> [ML] I think context is better than entity, because for example if you >> only block entity_0 of context and allow entity_N run, that means the >> dependency between entities are broken (e.g. page table updates in >> >> Sdma entity pass but gfx submit in GFX entity blocked, not make sense >> to me) >> >> We’d better either block the whole context or let not… >> >> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >> their fence status to “*ECANCELED*” >> >> Setting ECANCELED should be ok. But I think we should do this when we >> try to run the jobs and not during GPU reset. >> >> [ML] without deep thought and expritment, I’m not sure the difference >> between them, but kick it out in gpu_reset routine is more efficient, >> >> Otherwise you need to check context/entity guilty flag in run_job >> routine …and you need to it for every context/entity, I don’t see why >> >> We don’t just kickout all of them in gpu_reset stage …. >> >> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >> VRAM lost actually ruins all VRAM contents >> >> No, that shouldn't be done by comparing the counters. Iterating over >> all contexts is way to much overhead. >> >> [ML] because I want to make KMS IOCTL rules clean, like they don’t >> need to differentiate VRAM lost or not, they only interested in if the >> context is guilty or not, and block >> >> Submit for guilty ones. >> >> *Can you give more details of your idea? And better the detail >> implement in cs_submit, I want to see how you want to block submit >> without checking context guilty flag* >> >> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence >> status to “*ECANCELDED*” >> >> Yes and no, that should be done when we try to run the jobs and not >> during GPU reset. >> >> [ML] again, kicking out them in gpu reset routine is high efficient, >> otherwise you need check on every job in run_job() >> >> Besides, can you illustrate the detail implementation ? >> >> Yes and no, dma_fence_get_status() is some specific handling for >> sync_file debugging (no idea why that made it into the common fence code). >> >> It was replaced by putting the error code directly into the fence, so >> just reading that one after waiting should be ok. >> >> Maybe we should fix dma_fence_get_status() to do the right thing for this? >> >> [ML] yeah, that’s too confusing, the name sound really the one I want >> to use, we should change it… >> >> *But look into the implement, I don**’t see why we cannot use it ? it >> also finally return the fence->error * >> >> *From:*Koenig, Christian >> *Sent:* Wednesday, October 11, 2017 3:21 PM >> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle, >> Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>; >> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >> Deucher, Alexander <Alexander.Deucher@amd.com >> <mailto:Alexander.Deucher@amd.com>> >> *Cc:* amd-gfx@lists.freedesktop.org >> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) >> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley >> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro >> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>; >> Filipas, Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>> >> *Subject:* Re: TDR and VRAM lost handling in KMD: >> >> See inline: >> >> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >> >> Hi Christian & Nicolai, >> >> We need to achieve some agreements on what should MESA/UMD do and >> what should KMD do, *please give your comments with “okay” or “No” >> and your idea on below items,* >> >> ?When a job timed out (set from lockup_timeout kernel parameter), >> What KMD should do in TDR routine : >> >> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >> (*gpu_reset_counter* is used to force vm flush after GPU reset, out >> of this thread’s scope so no more discussion on it) >> >> Okay. >> >> 2.Set its fence error status to “*ETIME*”, >> >> No, as I already explained ETIME is for synchronous operation. >> >> In other words when we return ETIME from the wait IOCTL it would mean >> that the waiting has somehow timed out, but not the job we waited for. >> >> Please use ECANCELED as well or some other error code when we find >> that we need to distinct the timedout job from the canceled ones >> (probably a good idea, but I'm not sure). >> >> 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >> >> Not sure. Do we want to set the whole context as guilty or just the entity? >> >> Setting the whole contexts as guilty sounds racy to me. >> >> BTW: We should use a different name than "guilty", maybe just "bool >> canceled;" ? >> >> 4.Kick out this job from scheduler’s mirror list, so this job won’t >> get re-scheduled to ring anymore. >> >> Okay. >> >> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >> their fence status to “*ECANCELED*” >> >> Setting ECANCELED should be ok. But I think we should do this when we >> try to run the jobs and not during GPU reset. >> >> 6.Force signal all fences that get kicked out by above two >> steps,*otherwise UMD will block forever if waiting on those >> fences* >> >> Okay. >> >> 7.Do gpu reset, which is can be some callbacks to let bare-metal and >> SR-IOV implement with their favor style >> >> Okay. >> >> 8.After reset, KMD need to aware if the VRAM lost happens or not, >> bare-metal can implement some function to judge, while for SR-IOV I >> prefer to read it from GIM side (for initial version we consider >> it’s always VRAM lost, till GIM side change aligned) >> >> Okay. >> >> 9.If VRAM lost not hit, continue, otherwise: >> >> a)Update adev->*vram_lost_counter*, >> >> Okay. >> >> b)Iterate over all living ctx, and set all ctx as “*guilty*” since >> VRAM lost actually ruins all VRAM contents >> >> No, that shouldn't be done by comparing the counters. Iterating over >> all contexts is way to much overhead. >> >> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >> fence status to “*ECANCELDED*” >> >> Yes and no, that should be done when we try to run the jobs and not >> during GPU reset. >> >> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, >> do we need it ???) >> >> Yes, that is still needed. As Nicolai explained we can't be sure that >> VRAM is still 100% correct even when it isn't cleared. >> >> 11.Re-schedule all JOBs remains in mirror list to ring again and >> restart scheduler (for VRAM lost case, no JOB will re-scheduled) >> >> Okay. >> >> ?For cs_wait() IOCTL: >> >> After it found fence signaled, it should check with >> *“dma_fence_get_status” *to see if there is error there, >> >> And return the error status of fence >> >> Yes and no, dma_fence_get_status() is some specific handling for >> sync_file debugging (no idea why that made it into the common fence code). >> >> It was replaced by putting the error code directly into the fence, so >> just reading that one after waiting should be ok. >> >> Maybe we should fix dma_fence_get_status() to do the right thing for this? >> >> ?For cs_wait_fences() IOCTL: >> >> Similar with above approach >> >> ?For cs_submit() IOCTL: >> >> It need to check if current ctx been marked as “*guilty*” and return >> “*ECANCELED*” if so >> >> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >> >> This way, UMD can also block app from submitting, like @Nicolai >> mentioned, we can cache one copy of *vram_lost_counter* when >> enumerate physical device, and deny all >> >> gl-context from submitting if the counter queried bigger than that >> one cached in physical device. (looks a little overkill to me, but >> easy to implement ) >> >> UMD can also return error to APP when creating gl-context if found >> current queried*vram_lost_counter *bigger than that one cached in >> physical device. >> >> Okay. Already have a patch for this, please review that one if you >> haven't already done so. >> >> Regards, >> Christian. >> >> BTW: I realized that gl-context is a little different with kernel’s >> context. Because for kernel. BO is not related with context but only >> with FD, while in UMD, BO have a backend >> >> gl-context, so block submitting in UMD layer is also needed although >> KMD will do its job as bottom line >> >> ?Basically “vram_lost_counter” is exposure by kernel to let UMD take >> the control of robust extension feature, it will be UMD’s call to >> move, KMD only deny “guilty” context from submitting >> >> Need your feedback, thx >> >> We’d better make TDR feature landed ASAP >> >> BR Monk >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <6876e153-7e98-66ac-7338-5601cf83c633-5C7GfCeVMHo@public.gmane.org>]
* RE: TDR and VRAM lost handling in KMD: [not found] ` <6876e153-7e98-66ac-7338-5601cf83c633-5C7GfCeVMHo@public.gmane.org> @ 2017-10-11 9:41 ` Liu, Monk [not found] ` <BLUPR12MB044907C2C72DD8BEB1D5BE3B844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Liu, Monk @ 2017-10-11 9:41 UTC (permalink / raw) To: Haehnle, Nicolai, Koenig, Christian, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) Okay, let me summary our whole idea together and see if it works: 1, For cs_submit, always check vram-lost_counter first and reject the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != adev->vram_lost_counter. That way the vram lost issue can be handled 2, for cs_submit we still need to check if the incoming context is "AMDGPU_CTX_GUILTY_RESET" or not even if we found ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject the submit If it is "AMDGPU_CTX_GUILTY_RESET", correct ? 3, in gpu_reset() routine, we only mark the hang job's entity as guilty (so we need to add new member in entity structure), and not kick it out in gpu_reset() stage, but we need to set the context behind this entity as " AMDGPU_CTX_GUILTY_RESET" And if reset introduces VRAM LOST, we just update adev->vram_lost_counter, but *don't* change all entity to guilty, so still only the hang job's entity is "guilty" After some entity marked as "guilty", we find a way to set the context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K interface, we need let UMD can know that this context is wrong. 4, in gpu scheduler's run_job() routine, since it only reads entity, so we skip job scheduling once found the entity is "guilty" Does above sounds good ? -----Original Message----- From: Haehnle, Nicolai Sent: Wednesday, October 11, 2017 5:26 PM To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: On 11.10.2017 11:18, Liu, Monk wrote: > Let's talk it simple, When vram lost hit, what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ? It should successfully return AMDGPU_CTX_INNOCENT_RESET. > In cs_submit, with vram lost hit, if we don't mark all contexts as > "guilty", how we block its from submitting ? can you show some > implement way if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) return -ECANCELED; (where ctx->vram_lost_counter is initialized at context creation time and never changed afterwards) > BTW: the "guilty" here is a new member I want to add to context, it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, > Looks I need to unify them and only one place to mark guilty or not Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made consistent with the rest. Cheers, Nicolai > > > BR Monk > > -----Original Message----- > From: Haehnle, Nicolai > Sent: Wednesday, October 11, 2017 5:00 PM > To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > On 11.10.2017 10:48, Liu, Monk wrote: >> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so >> it's reasonable to use it. However, it /does not/ make sense to mark >> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a >> perfect example where the driver should report context lost to >> applications with the "innocent" flag for contexts that were idle at >> the time of reset. The only context(s) that should be reported as "guilty" >> (or perhaps "unknown" in some cases) are the ones that were executing >> at the time of reset. >> >> ML: KMD mark all contexts as guilty is because that way we can unify >> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no >> need to worry about vram-lost-counter anymore, that’s a implementation >> style. I don’t think it is related with UMD layer, >> >> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it >> own “guilty” gl-context if you want. > > Well, to some extent this is just semantics, but it helps to keep the terminology consistent. > > Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in > mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, > and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset. > > The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened. > > >> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >> illustrate what rule KMD should obey to check in KMS IOCTL like >> cs_sumbit ?? let’s see which way better > > if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) > return -ECANCELED; > > Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. > > Yes, it's one additional check in cs_submit. If you're worried about that (and Christian's concerns about possible issues with walking over all contexts are addressed), I suppose you could just store a per-context > > unsigned context_reset_status; > > instead of a `bool guilty`. Its value would start out as 0 > (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset. > > Cheers, > Nicolai > > >> >> *From:*Haehnle, Nicolai >> *Sent:* Wednesday, October 11, 2017 4:41 PM >> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >> Deucher, Alexander <Alexander.Deucher@amd.com> >> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >> Filipas, Mario <Mario.Filipas@amd.com> >> *Subject:* Re: TDR and VRAM lost handling in KMD: >> >> From a Mesa perspective, this almost all sounds reasonable to me. >> >> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so >> it's reasonable to use it. However, it /does not/ make sense to mark >> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a >> perfect example where the driver should report context lost to >> applications with the "innocent" flag for contexts that were idle at >> the time of reset. The only context(s) that should be reported as "guilty" >> (or perhaps "unknown" in some cases) are the ones that were executing >> at the time of reset. >> >> On whether the whole context is marked as guilty from a user space >> perspective, it would simply be nice for user space to get consistent >> answers. It would be a bit odd if we could e.g. succeed in submitting >> an SDMA job after a GFX job was rejected. This would point in favor of >> marking the entire context as guilty (although that could happen >> lazily instead of at reset time). On the other hand, if that's too big >> a burden for the kernel implementation I'm sure we can live without it. >> >> Cheers, >> >> Nicolai >> >> ---------------------------------------------------------------------- >> -- >> >> *From:*Liu, Monk >> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >> Alexander >> *Cc:* amd-gfx@lists.freedesktop.org >> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >> *Subject:* RE: TDR and VRAM lost handling in KMD: >> >> 1.Set its fence error status to “*ETIME*”, >> >> No, as I already explained ETIME is for synchronous operation. >> >> In other words when we return ETIME from the wait IOCTL it would mean >> that the waiting has somehow timed out, but not the job we waited for. >> >> Please use ECANCELED as well or some other error code when we find >> that we need to distinct the timedout job from the canceled ones >> (probably a good idea, but I'm not sure). >> >> [ML] I’m okay if you insist not to use ETIME >> >> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >> >> Not sure. Do we want to set the whole context as guilty or just the entity? >> >> Setting the whole contexts as guilty sounds racy to me. >> >> BTW: We should use a different name than "guilty", maybe just "bool >> canceled;" ? >> >> [ML] I think context is better than entity, because for example if you >> only block entity_0 of context and allow entity_N run, that means the >> dependency between entities are broken (e.g. page table updates in >> >> Sdma entity pass but gfx submit in GFX entity blocked, not make sense >> to me) >> >> We’d better either block the whole context or let not… >> >> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >> their fence status to “*ECANCELED*” >> >> Setting ECANCELED should be ok. But I think we should do this when we >> try to run the jobs and not during GPU reset. >> >> [ML] without deep thought and expritment, I’m not sure the difference >> between them, but kick it out in gpu_reset routine is more efficient, >> >> Otherwise you need to check context/entity guilty flag in run_job >> routine …and you need to it for every context/entity, I don’t see why >> >> We don’t just kickout all of them in gpu_reset stage …. >> >> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >> VRAM lost actually ruins all VRAM contents >> >> No, that shouldn't be done by comparing the counters. Iterating over >> all contexts is way to much overhead. >> >> [ML] because I want to make KMS IOCTL rules clean, like they don’t >> need to differentiate VRAM lost or not, they only interested in if the >> context is guilty or not, and block >> >> Submit for guilty ones. >> >> *Can you give more details of your idea? And better the detail >> implement in cs_submit, I want to see how you want to block submit >> without checking context guilty flag* >> >> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence >> status to “*ECANCELDED*” >> >> Yes and no, that should be done when we try to run the jobs and not >> during GPU reset. >> >> [ML] again, kicking out them in gpu reset routine is high efficient, >> otherwise you need check on every job in run_job() >> >> Besides, can you illustrate the detail implementation ? >> >> Yes and no, dma_fence_get_status() is some specific handling for >> sync_file debugging (no idea why that made it into the common fence code). >> >> It was replaced by putting the error code directly into the fence, so >> just reading that one after waiting should be ok. >> >> Maybe we should fix dma_fence_get_status() to do the right thing for this? >> >> [ML] yeah, that’s too confusing, the name sound really the one I want >> to use, we should change it… >> >> *But look into the implement, I don**’t see why we cannot use it ? it >> also finally return the fence->error * >> >> *From:*Koenig, Christian >> *Sent:* Wednesday, October 11, 2017 3:21 PM >> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle, >> Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>; >> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >> Deucher, Alexander <Alexander.Deucher@amd.com >> <mailto:Alexander.Deucher@amd.com>> >> *Cc:* amd-gfx@lists.freedesktop.org >> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) >> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley >> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro >> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>; >> Filipas, Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>> >> *Subject:* Re: TDR and VRAM lost handling in KMD: >> >> See inline: >> >> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >> >> Hi Christian & Nicolai, >> >> We need to achieve some agreements on what should MESA/UMD do and >> what should KMD do, *please give your comments with “okay” or “No” >> and your idea on below items,* >> >> ?When a job timed out (set from lockup_timeout kernel parameter), >> What KMD should do in TDR routine : >> >> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >> (*gpu_reset_counter* is used to force vm flush after GPU reset, out >> of this thread’s scope so no more discussion on it) >> >> Okay. >> >> 2.Set its fence error status to “*ETIME*”, >> >> No, as I already explained ETIME is for synchronous operation. >> >> In other words when we return ETIME from the wait IOCTL it would mean >> that the waiting has somehow timed out, but not the job we waited for. >> >> Please use ECANCELED as well or some other error code when we find >> that we need to distinct the timedout job from the canceled ones >> (probably a good idea, but I'm not sure). >> >> 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >> >> Not sure. Do we want to set the whole context as guilty or just the entity? >> >> Setting the whole contexts as guilty sounds racy to me. >> >> BTW: We should use a different name than "guilty", maybe just "bool >> canceled;" ? >> >> 4.Kick out this job from scheduler’s mirror list, so this job won’t >> get re-scheduled to ring anymore. >> >> Okay. >> >> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >> their fence status to “*ECANCELED*” >> >> Setting ECANCELED should be ok. But I think we should do this when we >> try to run the jobs and not during GPU reset. >> >> 6.Force signal all fences that get kicked out by above two >> steps,*otherwise UMD will block forever if waiting on those >> fences* >> >> Okay. >> >> 7.Do gpu reset, which is can be some callbacks to let bare-metal and >> SR-IOV implement with their favor style >> >> Okay. >> >> 8.After reset, KMD need to aware if the VRAM lost happens or not, >> bare-metal can implement some function to judge, while for SR-IOV I >> prefer to read it from GIM side (for initial version we consider >> it’s always VRAM lost, till GIM side change aligned) >> >> Okay. >> >> 9.If VRAM lost not hit, continue, otherwise: >> >> a)Update adev->*vram_lost_counter*, >> >> Okay. >> >> b)Iterate over all living ctx, and set all ctx as “*guilty*” since >> VRAM lost actually ruins all VRAM contents >> >> No, that shouldn't be done by comparing the counters. Iterating over >> all contexts is way to much overhead. >> >> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >> fence status to “*ECANCELDED*” >> >> Yes and no, that should be done when we try to run the jobs and not >> during GPU reset. >> >> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, >> do we need it ???) >> >> Yes, that is still needed. As Nicolai explained we can't be sure that >> VRAM is still 100% correct even when it isn't cleared. >> >> 11.Re-schedule all JOBs remains in mirror list to ring again and >> restart scheduler (for VRAM lost case, no JOB will re-scheduled) >> >> Okay. >> >> ?For cs_wait() IOCTL: >> >> After it found fence signaled, it should check with >> *“dma_fence_get_status” *to see if there is error there, >> >> And return the error status of fence >> >> Yes and no, dma_fence_get_status() is some specific handling for >> sync_file debugging (no idea why that made it into the common fence code). >> >> It was replaced by putting the error code directly into the fence, so >> just reading that one after waiting should be ok. >> >> Maybe we should fix dma_fence_get_status() to do the right thing for this? >> >> ?For cs_wait_fences() IOCTL: >> >> Similar with above approach >> >> ?For cs_submit() IOCTL: >> >> It need to check if current ctx been marked as “*guilty*” and return >> “*ECANCELED*” if so >> >> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >> >> This way, UMD can also block app from submitting, like @Nicolai >> mentioned, we can cache one copy of *vram_lost_counter* when >> enumerate physical device, and deny all >> >> gl-context from submitting if the counter queried bigger than that >> one cached in physical device. (looks a little overkill to me, but >> easy to implement ) >> >> UMD can also return error to APP when creating gl-context if found >> current queried*vram_lost_counter *bigger than that one cached in >> physical device. >> >> Okay. Already have a patch for this, please review that one if you >> haven't already done so. >> >> Regards, >> Christian. >> >> BTW: I realized that gl-context is a little different with kernel’s >> context. Because for kernel. BO is not related with context but only >> with FD, while in UMD, BO have a backend >> >> gl-context, so block submitting in UMD layer is also needed although >> KMD will do its job as bottom line >> >> ?Basically “vram_lost_counter” is exposure by kernel to let UMD take >> the control of robust extension feature, it will be UMD’s call to >> move, KMD only deny “guilty” context from submitting >> >> Need your feedback, thx >> >> We’d better make TDR feature landed ASAP >> >> BR Monk >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BLUPR12MB044907C2C72DD8BEB1D5BE3B844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* Re: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB044907C2C72DD8BEB1D5BE3B844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-11 10:14 ` Chunming Zhou [not found] ` <8c4e849f-9227-12bc-9d2e-3daf60fcd762-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Chunming Zhou @ 2017-10-11 10:14 UTC (permalink / raw) To: Liu, Monk, Haehnle, Nicolai, Koenig, Christian, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) Your summary lacks the below issue: How about the job already pushed in scheduler queue when vram is lost? Regards, David Zhou On 2017年10月11日 17:41, Liu, Monk wrote: > Okay, let me summary our whole idea together and see if it works: > > 1, For cs_submit, always check vram-lost_counter first and reject the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != adev->vram_lost_counter. That way the vram lost issue can be handled > > 2, for cs_submit we still need to check if the incoming context is "AMDGPU_CTX_GUILTY_RESET" or not even if we found ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject the submit > If it is "AMDGPU_CTX_GUILTY_RESET", correct ? > > 3, in gpu_reset() routine, we only mark the hang job's entity as guilty (so we need to add new member in entity structure), and not kick it out in gpu_reset() stage, but we need to set the context behind this entity as " AMDGPU_CTX_GUILTY_RESET" > And if reset introduces VRAM LOST, we just update adev->vram_lost_counter, but *don't* change all entity to guilty, so still only the hang job's entity is "guilty" > After some entity marked as "guilty", we find a way to set the context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K interface, we need let UMD can know that this context is wrong. > > 4, in gpu scheduler's run_job() routine, since it only reads entity, so we skip job scheduling once found the entity is "guilty" > > > Does above sounds good ? > > > > -----Original Message----- > From: Haehnle, Nicolai > Sent: Wednesday, October 11, 2017 5:26 PM > To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > On 11.10.2017 11:18, Liu, Monk wrote: >> Let's talk it simple, When vram lost hit, what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ? > It should successfully return AMDGPU_CTX_INNOCENT_RESET. > > >> In cs_submit, with vram lost hit, if we don't mark all contexts as >> "guilty", how we block its from submitting ? can you show some >> implement way > if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) > return -ECANCELED; > > (where ctx->vram_lost_counter is initialized at context creation time and never changed afterwards) > > >> BTW: the "guilty" here is a new member I want to add to context, it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, >> Looks I need to unify them and only one place to mark guilty or not > Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made > consistent with the rest. > > Cheers, > Nicolai > > >> >> BR Monk >> >> -----Original Message----- >> From: Haehnle, Nicolai >> Sent: Wednesday, October 11, 2017 5:00 PM >> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> >> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD: >> >> On 11.10.2017 10:48, Liu, Monk wrote: >>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so >>> it's reasonable to use it. However, it /does not/ make sense to mark >>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a >>> perfect example where the driver should report context lost to >>> applications with the "innocent" flag for contexts that were idle at >>> the time of reset. The only context(s) that should be reported as "guilty" >>> (or perhaps "unknown" in some cases) are the ones that were executing >>> at the time of reset. >>> >>> ML: KMD mark all contexts as guilty is because that way we can unify >>> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no >>> need to worry about vram-lost-counter anymore, that’s a implementation >>> style. I don’t think it is related with UMD layer, >>> >>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it >>> own “guilty” gl-context if you want. >> Well, to some extent this is just semantics, but it helps to keep the terminology consistent. >> >> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >> mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >> and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset. >> >> The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened. >> >> >>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >>> illustrate what rule KMD should obey to check in KMS IOCTL like >>> cs_sumbit ?? let’s see which way better >> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >> return -ECANCELED; >> >> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >> >> Yes, it's one additional check in cs_submit. If you're worried about that (and Christian's concerns about possible issues with walking over all contexts are addressed), I suppose you could just store a per-context >> >> unsigned context_reset_status; >> >> instead of a `bool guilty`. Its value would start out as 0 >> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset. >> >> Cheers, >> Nicolai >> >> >>> *From:*Haehnle, Nicolai >>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>> Deucher, Alexander <Alexander.Deucher@amd.com> >>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >>> Filipas, Mario <Mario.Filipas@amd.com> >>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>> >>> From a Mesa perspective, this almost all sounds reasonable to me. >>> >>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so >>> it's reasonable to use it. However, it /does not/ make sense to mark >>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a >>> perfect example where the driver should report context lost to >>> applications with the "innocent" flag for contexts that were idle at >>> the time of reset. The only context(s) that should be reported as "guilty" >>> (or perhaps "unknown" in some cases) are the ones that were executing >>> at the time of reset. >>> >>> On whether the whole context is marked as guilty from a user space >>> perspective, it would simply be nice for user space to get consistent >>> answers. It would be a bit odd if we could e.g. succeed in submitting >>> an SDMA job after a GFX job was rejected. This would point in favor of >>> marking the entire context as guilty (although that could happen >>> lazily instead of at reset time). On the other hand, if that's too big >>> a burden for the kernel implementation I'm sure we can live without it. >>> >>> Cheers, >>> >>> Nicolai >>> >>> ---------------------------------------------------------------------- >>> -- >>> >>> *From:*Liu, Monk >>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>> Alexander >>> *Cc:* amd-gfx@lists.freedesktop.org >>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>> >>> 1.Set its fence error status to “*ETIME*”, >>> >>> No, as I already explained ETIME is for synchronous operation. >>> >>> In other words when we return ETIME from the wait IOCTL it would mean >>> that the waiting has somehow timed out, but not the job we waited for. >>> >>> Please use ECANCELED as well or some other error code when we find >>> that we need to distinct the timedout job from the canceled ones >>> (probably a good idea, but I'm not sure). >>> >>> [ML] I’m okay if you insist not to use ETIME >>> >>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>> >>> Not sure. Do we want to set the whole context as guilty or just the entity? >>> >>> Setting the whole contexts as guilty sounds racy to me. >>> >>> BTW: We should use a different name than "guilty", maybe just "bool >>> canceled;" ? >>> >>> [ML] I think context is better than entity, because for example if you >>> only block entity_0 of context and allow entity_N run, that means the >>> dependency between entities are broken (e.g. page table updates in >>> >>> Sdma entity pass but gfx submit in GFX entity blocked, not make sense >>> to me) >>> >>> We’d better either block the whole context or let not… >>> >>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >>> their fence status to “*ECANCELED*” >>> >>> Setting ECANCELED should be ok. But I think we should do this when we >>> try to run the jobs and not during GPU reset. >>> >>> [ML] without deep thought and expritment, I’m not sure the difference >>> between them, but kick it out in gpu_reset routine is more efficient, >>> >>> Otherwise you need to check context/entity guilty flag in run_job >>> routine …and you need to it for every context/entity, I don’t see why >>> >>> We don’t just kickout all of them in gpu_reset stage …. >>> >>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >>> VRAM lost actually ruins all VRAM contents >>> >>> No, that shouldn't be done by comparing the counters. Iterating over >>> all contexts is way to much overhead. >>> >>> [ML] because I want to make KMS IOCTL rules clean, like they don’t >>> need to differentiate VRAM lost or not, they only interested in if the >>> context is guilty or not, and block >>> >>> Submit for guilty ones. >>> >>> *Can you give more details of your idea? And better the detail >>> implement in cs_submit, I want to see how you want to block submit >>> without checking context guilty flag* >>> >>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence >>> status to “*ECANCELDED*” >>> >>> Yes and no, that should be done when we try to run the jobs and not >>> during GPU reset. >>> >>> [ML] again, kicking out them in gpu reset routine is high efficient, >>> otherwise you need check on every job in run_job() >>> >>> Besides, can you illustrate the detail implementation ? >>> >>> Yes and no, dma_fence_get_status() is some specific handling for >>> sync_file debugging (no idea why that made it into the common fence code). >>> >>> It was replaced by putting the error code directly into the fence, so >>> just reading that one after waiting should be ok. >>> >>> Maybe we should fix dma_fence_get_status() to do the right thing for this? >>> >>> [ML] yeah, that’s too confusing, the name sound really the one I want >>> to use, we should change it… >>> >>> *But look into the implement, I don**’t see why we cannot use it ? it >>> also finally return the fence->error * >>> >>> *From:*Koenig, Christian >>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle, >>> Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>; >>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >>> Deucher, Alexander <Alexander.Deucher@amd.com >>> <mailto:Alexander.Deucher@amd.com>> >>> *Cc:* amd-gfx@lists.freedesktop.org >>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) >>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley >>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro >>> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>; >>> Filipas, Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>> >>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>> >>> See inline: >>> >>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>> >>> Hi Christian & Nicolai, >>> >>> We need to achieve some agreements on what should MESA/UMD do and >>> what should KMD do, *please give your comments with “okay” or “No” >>> and your idea on below items,* >>> >>> ?When a job timed out (set from lockup_timeout kernel parameter), >>> What KMD should do in TDR routine : >>> >>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>> (*gpu_reset_counter* is used to force vm flush after GPU reset, out >>> of this thread’s scope so no more discussion on it) >>> >>> Okay. >>> >>> 2.Set its fence error status to “*ETIME*”, >>> >>> No, as I already explained ETIME is for synchronous operation. >>> >>> In other words when we return ETIME from the wait IOCTL it would mean >>> that the waiting has somehow timed out, but not the job we waited for. >>> >>> Please use ECANCELED as well or some other error code when we find >>> that we need to distinct the timedout job from the canceled ones >>> (probably a good idea, but I'm not sure). >>> >>> 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>> >>> Not sure. Do we want to set the whole context as guilty or just the entity? >>> >>> Setting the whole contexts as guilty sounds racy to me. >>> >>> BTW: We should use a different name than "guilty", maybe just "bool >>> canceled;" ? >>> >>> 4.Kick out this job from scheduler’s mirror list, so this job won’t >>> get re-scheduled to ring anymore. >>> >>> Okay. >>> >>> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >>> their fence status to “*ECANCELED*” >>> >>> Setting ECANCELED should be ok. But I think we should do this when we >>> try to run the jobs and not during GPU reset. >>> >>> 6.Force signal all fences that get kicked out by above two >>> steps,*otherwise UMD will block forever if waiting on those >>> fences* >>> >>> Okay. >>> >>> 7.Do gpu reset, which is can be some callbacks to let bare-metal and >>> SR-IOV implement with their favor style >>> >>> Okay. >>> >>> 8.After reset, KMD need to aware if the VRAM lost happens or not, >>> bare-metal can implement some function to judge, while for SR-IOV I >>> prefer to read it from GIM side (for initial version we consider >>> it’s always VRAM lost, till GIM side change aligned) >>> >>> Okay. >>> >>> 9.If VRAM lost not hit, continue, otherwise: >>> >>> a)Update adev->*vram_lost_counter*, >>> >>> Okay. >>> >>> b)Iterate over all living ctx, and set all ctx as “*guilty*” since >>> VRAM lost actually ruins all VRAM contents >>> >>> No, that shouldn't be done by comparing the counters. Iterating over >>> all contexts is way to much overhead. >>> >>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>> fence status to “*ECANCELDED*” >>> >>> Yes and no, that should be done when we try to run the jobs and not >>> during GPU reset. >>> >>> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, >>> do we need it ???) >>> >>> Yes, that is still needed. As Nicolai explained we can't be sure that >>> VRAM is still 100% correct even when it isn't cleared. >>> >>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>> restart scheduler (for VRAM lost case, no JOB will re-scheduled) >>> >>> Okay. >>> >>> ?For cs_wait() IOCTL: >>> >>> After it found fence signaled, it should check with >>> *“dma_fence_get_status” *to see if there is error there, >>> >>> And return the error status of fence >>> >>> Yes and no, dma_fence_get_status() is some specific handling for >>> sync_file debugging (no idea why that made it into the common fence code). >>> >>> It was replaced by putting the error code directly into the fence, so >>> just reading that one after waiting should be ok. >>> >>> Maybe we should fix dma_fence_get_status() to do the right thing for this? >>> >>> ?For cs_wait_fences() IOCTL: >>> >>> Similar with above approach >>> >>> ?For cs_submit() IOCTL: >>> >>> It need to check if current ctx been marked as “*guilty*” and return >>> “*ECANCELED*” if so >>> >>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>> >>> This way, UMD can also block app from submitting, like @Nicolai >>> mentioned, we can cache one copy of *vram_lost_counter* when >>> enumerate physical device, and deny all >>> >>> gl-context from submitting if the counter queried bigger than that >>> one cached in physical device. (looks a little overkill to me, but >>> easy to implement ) >>> >>> UMD can also return error to APP when creating gl-context if found >>> current queried*vram_lost_counter *bigger than that one cached in >>> physical device. >>> >>> Okay. Already have a patch for this, please review that one if you >>> haven't already done so. >>> >>> Regards, >>> Christian. >>> >>> BTW: I realized that gl-context is a little different with kernel’s >>> context. Because for kernel. BO is not related with context but only >>> with FD, while in UMD, BO have a backend >>> >>> gl-context, so block submitting in UMD layer is also needed although >>> KMD will do its job as bottom line >>> >>> ?Basically “vram_lost_counter” is exposure by kernel to let UMD take >>> the control of robust extension feature, it will be UMD’s call to >>> move, KMD only deny “guilty” context from submitting >>> >>> Need your feedback, thx >>> >>> We’d better make TDR feature landed ASAP >>> >>> BR Monk >>> > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <8c4e849f-9227-12bc-9d2e-3daf60fcd762-5C7GfCeVMHo@public.gmane.org>]
* Re: TDR and VRAM lost handling in KMD: [not found] ` <8c4e849f-9227-12bc-9d2e-3daf60fcd762-5C7GfCeVMHo@public.gmane.org> @ 2017-10-11 10:39 ` Christian König [not found] ` <0c198ba6-b853-c26a-7fb4-bcc0344fdea0-5C7GfCeVMHo@public.gmane.org> 2017-10-11 13:27 ` Liu, Monk 1 sibling, 1 reply; 23+ messages in thread From: Christian König @ 2017-10-11 10:39 UTC (permalink / raw) To: Chunming Zhou, Liu, Monk, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) I've already posted a patch for this on the mailing list. Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled. Regards, Christian. Am 11.10.2017 um 12:14 schrieb Chunming Zhou: > Your summary lacks the below issue: > > How about the job already pushed in scheduler queue when vram is lost? > > > Regards, > David Zhou > On 2017年10月11日 17:41, Liu, Monk wrote: >> Okay, let me summary our whole idea together and see if it works: >> >> 1, For cs_submit, always check vram-lost_counter first and reject the >> submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != >> adev->vram_lost_counter. That way the vram lost issue can be handled >> >> 2, for cs_submit we still need to check if the incoming context is >> "AMDGPU_CTX_GUILTY_RESET" or not even if we found >> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject >> the submit >> If it is "AMDGPU_CTX_GUILTY_RESET", correct ? >> >> 3, in gpu_reset() routine, we only mark the hang job's entity as >> guilty (so we need to add new member in entity structure), and not >> kick it out in gpu_reset() stage, but we need to set the context >> behind this entity as " AMDGPU_CTX_GUILTY_RESET" >> And if reset introduces VRAM LOST, we just update >> adev->vram_lost_counter, but *don't* change all entity to guilty, so >> still only the hang job's entity is "guilty" >> After some entity marked as "guilty", we find a way to set the >> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K >> interface, we need let UMD can know that this context is wrong. >> >> 4, in gpu scheduler's run_job() routine, since it only reads entity, >> so we skip job scheduling once found the entity is "guilty" >> >> >> Does above sounds good ? >> >> >> >> -----Original Message----- >> From: Haehnle, Nicolai >> Sent: Wednesday, October 11, 2017 5:26 PM >> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >> Deucher, Alexander <Alexander.Deucher@amd.com> >> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >> Filipas, Mario <Mario.Filipas@amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD: >> >> On 11.10.2017 11:18, Liu, Monk wrote: >>> Let's talk it simple, When vram lost hit, what's the action for >>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not >>> the one trigger gpu hang) after vram lost ? do you mean we return >>> -ENODEV to UMD ? >> It should successfully return AMDGPU_CTX_INNOCENT_RESET. >> >> >>> In cs_submit, with vram lost hit, if we don't mark all contexts as >>> "guilty", how we block its from submitting ? can you show some >>> implement way >> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >> return -ECANCELED; >> >> (where ctx->vram_lost_counter is initialized at context creation time >> and never changed afterwards) >> >> >>> BTW: the "guilty" here is a new member I want to add to context, it >>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, >>> Looks I need to unify them and only one place to mark guilty or not >> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made >> consistent with the rest. >> >> Cheers, >> Nicolai >> >> >>> >>> BR Monk >>> >>> -----Original Message----- >>> From: Haehnle, Nicolai >>> Sent: Wednesday, October 11, 2017 5:00 PM >>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>> Deucher, Alexander <Alexander.Deucher@amd.com> >>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >>> <Bingley.Li@amd.com>; Ramirez, Alejandro >>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>> Subject: Re: TDR and VRAM lost handling in KMD: >>> >>> On 11.10.2017 10:48, Liu, Monk wrote: >>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so >>>> it's reasonable to use it. However, it /does not/ make sense to mark >>>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a >>>> perfect example where the driver should report context lost to >>>> applications with the "innocent" flag for contexts that were idle at >>>> the time of reset. The only context(s) that should be reported as >>>> "guilty" >>>> (or perhaps "unknown" in some cases) are the ones that were executing >>>> at the time of reset. >>>> >>>> ML: KMD mark all contexts as guilty is because that way we can unify >>>> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no >>>> need to worry about vram-lost-counter anymore, that’s a implementation >>>> style. I don’t think it is related with UMD layer, >>>> >>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it >>>> own “guilty” gl-context if you want. >>> Well, to some extent this is just semantics, but it helps to keep >>> the terminology consistent. >>> >>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >>> mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >>> and it must return "innocent" for contexts that are only lost due to >>> VRAM lost without being otherwise involved in the timeout that lead >>> to the reset. >>> >>> The point is that in the places where you used "guilty" it would be >>> better to use "context lost", and then further differentiate between >>> guilty/innocent context lost based on the details of what happened. >>> >>> >>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >>>> illustrate what rule KMD should obey to check in KMS IOCTL like >>>> cs_sumbit ?? let’s see which way better >>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>> return -ECANCELED; >>> >>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >>> >>> Yes, it's one additional check in cs_submit. If you're worried about >>> that (and Christian's concerns about possible issues with walking >>> over all contexts are addressed), I suppose you could just store a >>> per-context >>> >>> unsigned context_reset_status; >>> >>> instead of a `bool guilty`. Its value would start out as 0 >>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during >>> reset. >>> >>> Cheers, >>> Nicolai >>> >>> >>>> *From:*Haehnle, Nicolai >>>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >>>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >>>> Filipas, Mario <Mario.Filipas@amd.com> >>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>> >>>> From a Mesa perspective, this almost all sounds reasonable to me. >>>> >>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so >>>> it's reasonable to use it. However, it /does not/ make sense to mark >>>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a >>>> perfect example where the driver should report context lost to >>>> applications with the "innocent" flag for contexts that were idle at >>>> the time of reset. The only context(s) that should be reported as >>>> "guilty" >>>> (or perhaps "unknown" in some cases) are the ones that were executing >>>> at the time of reset. >>>> >>>> On whether the whole context is marked as guilty from a user space >>>> perspective, it would simply be nice for user space to get consistent >>>> answers. It would be a bit odd if we could e.g. succeed in submitting >>>> an SDMA job after a GFX job was rejected. This would point in favor of >>>> marking the entire context as guilty (although that could happen >>>> lazily instead of at reset time). On the other hand, if that's too big >>>> a burden for the kernel implementation I'm sure we can live without >>>> it. >>>> >>>> Cheers, >>>> >>>> Nicolai >>>> >>>> ---------------------------------------------------------------------- >>>> -- >>>> >>>> *From:*Liu, Monk >>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>>> Alexander >>>> *Cc:* amd-gfx@lists.freedesktop.org >>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>>> >>>> 1.Set its fence error status to “*ETIME*”, >>>> >>>> No, as I already explained ETIME is for synchronous operation. >>>> >>>> In other words when we return ETIME from the wait IOCTL it would mean >>>> that the waiting has somehow timed out, but not the job we waited for. >>>> >>>> Please use ECANCELED as well or some other error code when we find >>>> that we need to distinct the timedout job from the canceled ones >>>> (probably a good idea, but I'm not sure). >>>> >>>> [ML] I’m okay if you insist not to use ETIME >>>> >>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>>> >>>> Not sure. Do we want to set the whole context as guilty or just the >>>> entity? >>>> >>>> Setting the whole contexts as guilty sounds racy to me. >>>> >>>> BTW: We should use a different name than "guilty", maybe just "bool >>>> canceled;" ? >>>> >>>> [ML] I think context is better than entity, because for example if you >>>> only block entity_0 of context and allow entity_N run, that means the >>>> dependency between entities are broken (e.g. page table updates in >>>> >>>> Sdma entity pass but gfx submit in GFX entity blocked, not make sense >>>> to me) >>>> >>>> We’d better either block the whole context or let not… >>>> >>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >>>> their fence status to “*ECANCELED*” >>>> >>>> Setting ECANCELED should be ok. But I think we should do this when we >>>> try to run the jobs and not during GPU reset. >>>> >>>> [ML] without deep thought and expritment, I’m not sure the difference >>>> between them, but kick it out in gpu_reset routine is more efficient, >>>> >>>> Otherwise you need to check context/entity guilty flag in run_job >>>> routine …and you need to it for every context/entity, I don’t see why >>>> >>>> We don’t just kickout all of them in gpu_reset stage …. >>>> >>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >>>> VRAM lost actually ruins all VRAM contents >>>> >>>> No, that shouldn't be done by comparing the counters. Iterating over >>>> all contexts is way to much overhead. >>>> >>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t >>>> need to differentiate VRAM lost or not, they only interested in if the >>>> context is guilty or not, and block >>>> >>>> Submit for guilty ones. >>>> >>>> *Can you give more details of your idea? And better the detail >>>> implement in cs_submit, I want to see how you want to block submit >>>> without checking context guilty flag* >>>> >>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence >>>> status to “*ECANCELDED*” >>>> >>>> Yes and no, that should be done when we try to run the jobs and not >>>> during GPU reset. >>>> >>>> [ML] again, kicking out them in gpu reset routine is high efficient, >>>> otherwise you need check on every job in run_job() >>>> >>>> Besides, can you illustrate the detail implementation ? >>>> >>>> Yes and no, dma_fence_get_status() is some specific handling for >>>> sync_file debugging (no idea why that made it into the common fence >>>> code). >>>> >>>> It was replaced by putting the error code directly into the fence, so >>>> just reading that one after waiting should be ok. >>>> >>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>> for this? >>>> >>>> [ML] yeah, that’s too confusing, the name sound really the one I want >>>> to use, we should change it… >>>> >>>> *But look into the implement, I don**’t see why we cannot use it ? it >>>> also finally return the fence->error * >>>> >>>> *From:*Koenig, Christian >>>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle, >>>> Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>; >>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >>>> Deucher, Alexander <Alexander.Deucher@amd.com >>>> <mailto:Alexander.Deucher@amd.com>> >>>> *Cc:* amd-gfx@lists.freedesktop.org >>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) >>>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley >>>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro >>>> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>; >>>> Filipas, Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>> >>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>> >>>> See inline: >>>> >>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>>> >>>> Hi Christian & Nicolai, >>>> >>>> We need to achieve some agreements on what should MESA/UMD do >>>> and >>>> what should KMD do, *please give your comments with “okay” or >>>> “No” >>>> and your idea on below items,* >>>> >>>> ?When a job timed out (set from lockup_timeout kernel >>>> parameter), >>>> What KMD should do in TDR routine : >>>> >>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>> (*gpu_reset_counter* is used to force vm flush after GPU >>>> reset, out >>>> of this thread’s scope so no more discussion on it) >>>> >>>> Okay. >>>> >>>> 2.Set its fence error status to “*ETIME*”, >>>> >>>> No, as I already explained ETIME is for synchronous operation. >>>> >>>> In other words when we return ETIME from the wait IOCTL it would mean >>>> that the waiting has somehow timed out, but not the job we waited for. >>>> >>>> Please use ECANCELED as well or some other error code when we find >>>> that we need to distinct the timedout job from the canceled ones >>>> (probably a good idea, but I'm not sure). >>>> >>>> 3.Find the entity/ctx behind this job, and set this ctx as >>>> “*guilty*” >>>> >>>> Not sure. Do we want to set the whole context as guilty or just the >>>> entity? >>>> >>>> Setting the whole contexts as guilty sounds racy to me. >>>> >>>> BTW: We should use a different name than "guilty", maybe just "bool >>>> canceled;" ? >>>> >>>> 4.Kick out this job from scheduler’s mirror list, so this job >>>> won’t >>>> get re-scheduled to ring anymore. >>>> >>>> Okay. >>>> >>>> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and >>>> set all >>>> their fence status to “*ECANCELED*” >>>> >>>> Setting ECANCELED should be ok. But I think we should do this when we >>>> try to run the jobs and not during GPU reset. >>>> >>>> 6.Force signal all fences that get kicked out by above two >>>> steps,*otherwise UMD will block forever if waiting on those >>>> fences* >>>> >>>> Okay. >>>> >>>> 7.Do gpu reset, which is can be some callbacks to let >>>> bare-metal and >>>> SR-IOV implement with their favor style >>>> >>>> Okay. >>>> >>>> 8.After reset, KMD need to aware if the VRAM lost happens or >>>> not, >>>> bare-metal can implement some function to judge, while for >>>> SR-IOV I >>>> prefer to read it from GIM side (for initial version we consider >>>> it’s always VRAM lost, till GIM side change aligned) >>>> >>>> Okay. >>>> >>>> 9.If VRAM lost not hit, continue, otherwise: >>>> >>>> a)Update adev->*vram_lost_counter*, >>>> >>>> Okay. >>>> >>>> b)Iterate over all living ctx, and set all ctx as “*guilty*” >>>> since >>>> VRAM lost actually ruins all VRAM contents >>>> >>>> No, that shouldn't be done by comparing the counters. Iterating over >>>> all contexts is way to much overhead. >>>> >>>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>> fence status to “*ECANCELDED*” >>>> >>>> Yes and no, that should be done when we try to run the jobs and not >>>> during GPU reset. >>>> >>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>> (optional, >>>> do we need it ???) >>>> >>>> Yes, that is still needed. As Nicolai explained we can't be sure that >>>> VRAM is still 100% correct even when it isn't cleared. >>>> >>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>> restart scheduler (for VRAM lost case, no JOB will re-scheduled) >>>> >>>> Okay. >>>> >>>> ?For cs_wait() IOCTL: >>>> >>>> After it found fence signaled, it should check with >>>> *“dma_fence_get_status” *to see if there is error there, >>>> >>>> And return the error status of fence >>>> >>>> Yes and no, dma_fence_get_status() is some specific handling for >>>> sync_file debugging (no idea why that made it into the common fence >>>> code). >>>> >>>> It was replaced by putting the error code directly into the fence, so >>>> just reading that one after waiting should be ok. >>>> >>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>> for this? >>>> >>>> ?For cs_wait_fences() IOCTL: >>>> >>>> Similar with above approach >>>> >>>> ?For cs_submit() IOCTL: >>>> >>>> It need to check if current ctx been marked as “*guilty*” and >>>> return >>>> “*ECANCELED*” if so >>>> >>>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>>> >>>> This way, UMD can also block app from submitting, like @Nicolai >>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>> enumerate physical device, and deny all >>>> >>>> gl-context from submitting if the counter queried bigger than >>>> that >>>> one cached in physical device. (looks a little overkill to >>>> me, but >>>> easy to implement ) >>>> >>>> UMD can also return error to APP when creating gl-context if >>>> found >>>> current queried*vram_lost_counter *bigger than that one >>>> cached in >>>> physical device. >>>> >>>> Okay. Already have a patch for this, please review that one if you >>>> haven't already done so. >>>> >>>> Regards, >>>> Christian. >>>> >>>> BTW: I realized that gl-context is a little different with >>>> kernel’s >>>> context. Because for kernel. BO is not related with context >>>> but only >>>> with FD, while in UMD, BO have a backend >>>> >>>> gl-context, so block submitting in UMD layer is also needed >>>> although >>>> KMD will do its job as bottom line >>>> >>>> ?Basically “vram_lost_counter” is exposure by kernel to let >>>> UMD take >>>> the control of robust extension feature, it will be UMD’s >>>> call to >>>> move, KMD only deny “guilty” context from submitting >>>> >>>> Need your feedback, thx >>>> >>>> We’d better make TDR feature landed ASAP >>>> >>>> BR Monk >>>> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <0c198ba6-b853-c26a-7fb4-bcc0344fdea0-5C7GfCeVMHo@public.gmane.org>]
* RE: TDR and VRAM lost handling in KMD: [not found] ` <0c198ba6-b853-c26a-7fb4-bcc0344fdea0-5C7GfCeVMHo@public.gmane.org> @ 2017-10-11 13:35 ` Liu, Monk [not found] ` <BLUPR12MB04490BE33EC2E851228E25F4844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Liu, Monk @ 2017-10-11 13:35 UTC (permalink / raw) To: Koenig, Christian, Zhou, David(ChunMing), Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) I think just compare the copy from context/entity with current counter is enough, don't see how it's better to keep another copy in JOB -----Original Message----- From: Koenig, Christian Sent: Wednesday, October 11, 2017 6:40 PM To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: I've already posted a patch for this on the mailing list. Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled. Regards, Christian. Am 11.10.2017 um 12:14 schrieb Chunming Zhou: > Your summary lacks the below issue: > > How about the job already pushed in scheduler queue when vram is lost? > > > Regards, > David Zhou > On 2017年10月11日 17:41, Liu, Monk wrote: >> Okay, let me summary our whole idea together and see if it works: >> >> 1, For cs_submit, always check vram-lost_counter first and reject the >> submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != >> adev->vram_lost_counter. That way the vram lost issue can be handled >> >> 2, for cs_submit we still need to check if the incoming context is >> "AMDGPU_CTX_GUILTY_RESET" or not even if we found >> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject >> the submit >> If it is "AMDGPU_CTX_GUILTY_RESET", correct ? >> >> 3, in gpu_reset() routine, we only mark the hang job's entity as >> guilty (so we need to add new member in entity structure), and not >> kick it out in gpu_reset() stage, but we need to set the context >> behind this entity as " AMDGPU_CTX_GUILTY_RESET" >> And if reset introduces VRAM LOST, we just update >> adev->vram_lost_counter, but *don't* change all entity to guilty, so >> still only the hang job's entity is "guilty" >> After some entity marked as "guilty", we find a way to set the >> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K >> interface, we need let UMD can know that this context is wrong. >> >> 4, in gpu scheduler's run_job() routine, since it only reads entity, >> so we skip job scheduling once found the entity is "guilty" >> >> >> Does above sounds good ? >> >> >> >> -----Original Message----- >> From: Haehnle, Nicolai >> Sent: Wednesday, October 11, 2017 5:26 PM >> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >> Deucher, Alexander <Alexander.Deucher@amd.com> >> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >> Filipas, Mario <Mario.Filipas@amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD: >> >> On 11.10.2017 11:18, Liu, Monk wrote: >>> Let's talk it simple, When vram lost hit, what's the action for >>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not >>> the one trigger gpu hang) after vram lost ? do you mean we return >>> -ENODEV to UMD ? >> It should successfully return AMDGPU_CTX_INNOCENT_RESET. >> >> >>> In cs_submit, with vram lost hit, if we don't mark all contexts as >>> "guilty", how we block its from submitting ? can you show some >>> implement way >> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >> return -ECANCELED; >> >> (where ctx->vram_lost_counter is initialized at context creation time >> and never changed afterwards) >> >> >>> BTW: the "guilty" here is a new member I want to add to context, it >>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I >>> need to unify them and only one place to mark guilty or not >> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made >> consistent with the rest. >> >> Cheers, >> Nicolai >> >> >>> >>> BR Monk >>> >>> -----Original Message----- >>> From: Haehnle, Nicolai >>> Sent: Wednesday, October 11, 2017 5:00 PM >>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>> Deucher, Alexander <Alexander.Deucher@amd.com> >>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >>> <Bingley.Li@amd.com>; Ramirez, Alejandro >>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>> Subject: Re: TDR and VRAM lost handling in KMD: >>> >>> On 11.10.2017 10:48, Liu, Monk wrote: >>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>> so it's reasonable to use it. However, it /does not/ make sense to >>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost >>>> is a perfect example where the driver should report context lost to >>>> applications with the "innocent" flag for contexts that were idle >>>> at the time of reset. The only context(s) that should be reported >>>> as "guilty" >>>> (or perhaps "unknown" in some cases) are the ones that were >>>> executing at the time of reset. >>>> >>>> ML: KMD mark all contexts as guilty is because that way we can >>>> unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context >>>> , no need to worry about vram-lost-counter anymore, that’s a >>>> implementation style. I don’t think it is related with UMD layer, >>>> >>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement >>>> it own “guilty” gl-context if you want. >>> Well, to some extent this is just semantics, but it helps to keep >>> the terminology consistent. >>> >>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >>> mind: this returns one of >>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >>> and it must return "innocent" for contexts that are only lost due to >>> VRAM lost without being otherwise involved in the timeout that lead >>> to the reset. >>> >>> The point is that in the places where you used "guilty" it would be >>> better to use "context lost", and then further differentiate between >>> guilty/innocent context lost based on the details of what happened. >>> >>> >>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >>>> illustrate what rule KMD should obey to check in KMS IOCTL like >>>> cs_sumbit ?? let’s see which way better >>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>> return -ECANCELED; >>> >>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >>> >>> Yes, it's one additional check in cs_submit. If you're worried about >>> that (and Christian's concerns about possible issues with walking >>> over all contexts are addressed), I suppose you could just store a >>> per-context >>> >>> unsigned context_reset_status; >>> >>> instead of a `bool guilty`. Its value would start out as 0 >>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during >>> reset. >>> >>> Cheers, >>> Nicolai >>> >>> >>>> *From:*Haehnle, Nicolai >>>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel >>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>> >>>> From a Mesa perspective, this almost all sounds reasonable to me. >>>> >>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>> so it's reasonable to use it. However, it /does not/ make sense to >>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost >>>> is a perfect example where the driver should report context lost to >>>> applications with the "innocent" flag for contexts that were idle >>>> at the time of reset. The only context(s) that should be reported >>>> as "guilty" >>>> (or perhaps "unknown" in some cases) are the ones that were >>>> executing at the time of reset. >>>> >>>> On whether the whole context is marked as guilty from a user space >>>> perspective, it would simply be nice for user space to get >>>> consistent answers. It would be a bit odd if we could e.g. succeed >>>> in submitting an SDMA job after a GFX job was rejected. This would >>>> point in favor of marking the entire context as guilty (although >>>> that could happen lazily instead of at reset time). On the other >>>> hand, if that's too big a burden for the kernel implementation I'm >>>> sure we can live without it. >>>> >>>> Cheers, >>>> >>>> Nicolai >>>> >>>> ------------------------------------------------------------------- >>>> --- >>>> -- >>>> >>>> *From:*Liu, Monk >>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>>> Alexander >>>> *Cc:* amd-gfx@lists.freedesktop.org >>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>>> >>>> 1.Set its fence error status to “*ETIME*”, >>>> >>>> No, as I already explained ETIME is for synchronous operation. >>>> >>>> In other words when we return ETIME from the wait IOCTL it would >>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>> >>>> Please use ECANCELED as well or some other error code when we find >>>> that we need to distinct the timedout job from the canceled ones >>>> (probably a good idea, but I'm not sure). >>>> >>>> [ML] I’m okay if you insist not to use ETIME >>>> >>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>>> >>>> Not sure. Do we want to set the whole context as guilty or just the >>>> entity? >>>> >>>> Setting the whole contexts as guilty sounds racy to me. >>>> >>>> BTW: We should use a different name than "guilty", maybe just "bool >>>> canceled;" ? >>>> >>>> [ML] I think context is better than entity, because for example if >>>> you only block entity_0 of context and allow entity_N run, that >>>> means the dependency between entities are broken (e.g. page table >>>> updates in >>>> >>>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>>> sense to me) >>>> >>>> We’d better either block the whole context or let not… >>>> >>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >>>> their fence status to “*ECANCELED*” >>>> >>>> Setting ECANCELED should be ok. But I think we should do this when >>>> we try to run the jobs and not during GPU reset. >>>> >>>> [ML] without deep thought and expritment, I’m not sure the >>>> difference between them, but kick it out in gpu_reset routine is >>>> more efficient, >>>> >>>> Otherwise you need to check context/entity guilty flag in run_job >>>> routine …and you need to it for every context/entity, I don’t see >>>> why >>>> >>>> We don’t just kickout all of them in gpu_reset stage …. >>>> >>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >>>> VRAM lost actually ruins all VRAM contents >>>> >>>> No, that shouldn't be done by comparing the counters. Iterating >>>> over all contexts is way to much overhead. >>>> >>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t >>>> need to differentiate VRAM lost or not, they only interested in if >>>> the context is guilty or not, and block >>>> >>>> Submit for guilty ones. >>>> >>>> *Can you give more details of your idea? And better the detail >>>> implement in cs_submit, I want to see how you want to block submit >>>> without checking context guilty flag* >>>> >>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>> fence status to “*ECANCELDED*” >>>> >>>> Yes and no, that should be done when we try to run the jobs and not >>>> during GPU reset. >>>> >>>> [ML] again, kicking out them in gpu reset routine is high >>>> efficient, otherwise you need check on every job in run_job() >>>> >>>> Besides, can you illustrate the detail implementation ? >>>> >>>> Yes and no, dma_fence_get_status() is some specific handling for >>>> sync_file debugging (no idea why that made it into the common fence >>>> code). >>>> >>>> It was replaced by putting the error code directly into the fence, >>>> so just reading that one after waiting should be ok. >>>> >>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>> for this? >>>> >>>> [ML] yeah, that’s too confusing, the name sound really the one I >>>> want to use, we should change it… >>>> >>>> *But look into the implement, I don**’t see why we cannot use it ? >>>> it also finally return the fence->error * >>>> >>>> *From:*Koenig, Christian >>>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; >>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com >>>> <mailto:Nicolai.Haehnle@amd.com>>; >>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >>>> Deucher, Alexander <Alexander.Deucher@amd.com >>>> <mailto:Alexander.Deucher@amd.com>> >>>> *Cc:* amd-gfx@lists.freedesktop.org >>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) >>>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley >>>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, >>>> Alejandro <Alejandro.Ramirez@amd.com >>>> <mailto:Alejandro.Ramirez@amd.com>>; >>>> Filipas, Mario <Mario.Filipas@amd.com >>>> <mailto:Mario.Filipas@amd.com>> >>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>> >>>> See inline: >>>> >>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>>> >>>> Hi Christian & Nicolai, >>>> >>>> We need to achieve some agreements on what should MESA/UMD do >>>> and >>>> what should KMD do, *please give your comments with “okay” or >>>> “No” >>>> and your idea on below items,* >>>> >>>> ?When a job timed out (set from lockup_timeout kernel >>>> parameter), >>>> What KMD should do in TDR routine : >>>> >>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>> (*gpu_reset_counter* is used to force vm flush after GPU >>>> reset, out >>>> of this thread’s scope so no more discussion on it) >>>> >>>> Okay. >>>> >>>> 2.Set its fence error status to “*ETIME*”, >>>> >>>> No, as I already explained ETIME is for synchronous operation. >>>> >>>> In other words when we return ETIME from the wait IOCTL it would >>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>> >>>> Please use ECANCELED as well or some other error code when we find >>>> that we need to distinct the timedout job from the canceled ones >>>> (probably a good idea, but I'm not sure). >>>> >>>> 3.Find the entity/ctx behind this job, and set this ctx as >>>> “*guilty*” >>>> >>>> Not sure. Do we want to set the whole context as guilty or just the >>>> entity? >>>> >>>> Setting the whole contexts as guilty sounds racy to me. >>>> >>>> BTW: We should use a different name than "guilty", maybe just "bool >>>> canceled;" ? >>>> >>>> 4.Kick out this job from scheduler’s mirror list, so this job >>>> won’t >>>> get re-scheduled to ring anymore. >>>> >>>> Okay. >>>> >>>> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and >>>> set all >>>> their fence status to “*ECANCELED*” >>>> >>>> Setting ECANCELED should be ok. But I think we should do this when >>>> we try to run the jobs and not during GPU reset. >>>> >>>> 6.Force signal all fences that get kicked out by above two >>>> steps,*otherwise UMD will block forever if waiting on those >>>> fences* >>>> >>>> Okay. >>>> >>>> 7.Do gpu reset, which is can be some callbacks to let >>>> bare-metal and >>>> SR-IOV implement with their favor style >>>> >>>> Okay. >>>> >>>> 8.After reset, KMD need to aware if the VRAM lost happens or >>>> not, >>>> bare-metal can implement some function to judge, while for >>>> SR-IOV I >>>> prefer to read it from GIM side (for initial version we consider >>>> it’s always VRAM lost, till GIM side change aligned) >>>> >>>> Okay. >>>> >>>> 9.If VRAM lost not hit, continue, otherwise: >>>> >>>> a)Update adev->*vram_lost_counter*, >>>> >>>> Okay. >>>> >>>> b)Iterate over all living ctx, and set all ctx as “*guilty*” >>>> since >>>> VRAM lost actually ruins all VRAM contents >>>> >>>> No, that shouldn't be done by comparing the counters. Iterating >>>> over all contexts is way to much overhead. >>>> >>>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>> fence status to “*ECANCELDED*” >>>> >>>> Yes and no, that should be done when we try to run the jobs and not >>>> during GPU reset. >>>> >>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>> (optional, >>>> do we need it ???) >>>> >>>> Yes, that is still needed. As Nicolai explained we can't be sure >>>> that VRAM is still 100% correct even when it isn't cleared. >>>> >>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>> restart scheduler (for VRAM lost case, no JOB will >>>> re-scheduled) >>>> >>>> Okay. >>>> >>>> ?For cs_wait() IOCTL: >>>> >>>> After it found fence signaled, it should check with >>>> *“dma_fence_get_status” *to see if there is error there, >>>> >>>> And return the error status of fence >>>> >>>> Yes and no, dma_fence_get_status() is some specific handling for >>>> sync_file debugging (no idea why that made it into the common fence >>>> code). >>>> >>>> It was replaced by putting the error code directly into the fence, >>>> so just reading that one after waiting should be ok. >>>> >>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>> for this? >>>> >>>> ?For cs_wait_fences() IOCTL: >>>> >>>> Similar with above approach >>>> >>>> ?For cs_submit() IOCTL: >>>> >>>> It need to check if current ctx been marked as “*guilty*” and >>>> return >>>> “*ECANCELED*” if so >>>> >>>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>>> >>>> This way, UMD can also block app from submitting, like @Nicolai >>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>> enumerate physical device, and deny all >>>> >>>> gl-context from submitting if the counter queried bigger than >>>> that >>>> one cached in physical device. (looks a little overkill to >>>> me, but >>>> easy to implement ) >>>> >>>> UMD can also return error to APP when creating gl-context if >>>> found >>>> current queried*vram_lost_counter *bigger than that one >>>> cached in >>>> physical device. >>>> >>>> Okay. Already have a patch for this, please review that one if you >>>> haven't already done so. >>>> >>>> Regards, >>>> Christian. >>>> >>>> BTW: I realized that gl-context is a little different with >>>> kernel’s >>>> context. Because for kernel. BO is not related with context >>>> but only >>>> with FD, while in UMD, BO have a backend >>>> >>>> gl-context, so block submitting in UMD layer is also needed >>>> although >>>> KMD will do its job as bottom line >>>> >>>> ?Basically “vram_lost_counter” is exposure by kernel to let >>>> UMD take >>>> the control of robust extension feature, it will be UMD’s >>>> call to >>>> move, KMD only deny “guilty” context from submitting >>>> >>>> Need your feedback, thx >>>> >>>> We’d better make TDR feature landed ASAP >>>> >>>> BR Monk >>>> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BLUPR12MB04490BE33EC2E851228E25F4844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* Re: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB04490BE33EC2E851228E25F4844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-11 13:39 ` Christian König [not found] ` <d9274bc6-27e8-f6c4-0851-4240bde72452-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Christian König @ 2017-10-11 13:39 UTC (permalink / raw) To: Liu, Monk, Zhou, David(ChunMing), Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) Some jobs don't have a context (VM updates, clears, buffer moves). I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable. So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run. Regards, Christian. Am 11.10.2017 um 15:35 schrieb Liu, Monk: > I think just compare the copy from context/entity with current counter is enough, don't see how it's better to keep another copy in JOB > > > -----Original Message----- > From: Koenig, Christian > Sent: Wednesday, October 11, 2017 6:40 PM > To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > I've already posted a patch for this on the mailing list. > > Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled. > > Regards, > Christian. > > Am 11.10.2017 um 12:14 schrieb Chunming Zhou: >> Your summary lacks the below issue: >> >> How about the job already pushed in scheduler queue when vram is lost? >> >> >> Regards, >> David Zhou >> On 2017年10月11日 17:41, Liu, Monk wrote: >>> Okay, let me summary our whole idea together and see if it works: >>> >>> 1, For cs_submit, always check vram-lost_counter first and reject the >>> submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != >>> adev->vram_lost_counter. That way the vram lost issue can be handled >>> >>> 2, for cs_submit we still need to check if the incoming context is >>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found >>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject >>> the submit >>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ? >>> >>> 3, in gpu_reset() routine, we only mark the hang job's entity as >>> guilty (so we need to add new member in entity structure), and not >>> kick it out in gpu_reset() stage, but we need to set the context >>> behind this entity as " AMDGPU_CTX_GUILTY_RESET" >>> And if reset introduces VRAM LOST, we just update >>> adev->vram_lost_counter, but *don't* change all entity to guilty, so >>> still only the hang job's entity is "guilty" >>> After some entity marked as "guilty", we find a way to set the >>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K >>> interface, we need let UMD can know that this context is wrong. >>> >>> 4, in gpu scheduler's run_job() routine, since it only reads entity, >>> so we skip job scheduling once found the entity is "guilty" >>> >>> >>> Does above sounds good ? >>> >>> >>> >>> -----Original Message----- >>> From: Haehnle, Nicolai >>> Sent: Wednesday, October 11, 2017 5:26 PM >>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>> Deucher, Alexander <Alexander.Deucher@amd.com> >>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >>> Filipas, Mario <Mario.Filipas@amd.com> >>> Subject: Re: TDR and VRAM lost handling in KMD: >>> >>> On 11.10.2017 11:18, Liu, Monk wrote: >>>> Let's talk it simple, When vram lost hit, what's the action for >>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not >>>> the one trigger gpu hang) after vram lost ? do you mean we return >>>> -ENODEV to UMD ? >>> It should successfully return AMDGPU_CTX_INNOCENT_RESET. >>> >>> >>>> In cs_submit, with vram lost hit, if we don't mark all contexts as >>>> "guilty", how we block its from submitting ? can you show some >>>> implement way >>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>> return -ECANCELED; >>> >>> (where ctx->vram_lost_counter is initialized at context creation time >>> and never changed afterwards) >>> >>> >>>> BTW: the "guilty" here is a new member I want to add to context, it >>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I >>>> need to unify them and only one place to mark guilty or not >>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made >>> consistent with the rest. >>> >>> Cheers, >>> Nicolai >>> >>> >>>> BR Monk >>>> >>>> -----Original Message----- >>>> From: Haehnle, Nicolai >>>> Sent: Wednesday, October 11, 2017 5:00 PM >>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >>>> <Bingley.Li@amd.com>; Ramirez, Alejandro >>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>> >>>> On 11.10.2017 10:48, Liu, Monk wrote: >>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost >>>>> is a perfect example where the driver should report context lost to >>>>> applications with the "innocent" flag for contexts that were idle >>>>> at the time of reset. The only context(s) that should be reported >>>>> as "guilty" >>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>> executing at the time of reset. >>>>> >>>>> ML: KMD mark all contexts as guilty is because that way we can >>>>> unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context >>>>> , no need to worry about vram-lost-counter anymore, that’s a >>>>> implementation style. I don’t think it is related with UMD layer, >>>>> >>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement >>>>> it own “guilty” gl-context if you want. >>>> Well, to some extent this is just semantics, but it helps to keep >>>> the terminology consistent. >>>> >>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >>>> mind: this returns one of >>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >>>> and it must return "innocent" for contexts that are only lost due to >>>> VRAM lost without being otherwise involved in the timeout that lead >>>> to the reset. >>>> >>>> The point is that in the places where you used "guilty" it would be >>>> better to use "context lost", and then further differentiate between >>>> guilty/innocent context lost based on the details of what happened. >>>> >>>> >>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >>>>> illustrate what rule KMD should obey to check in KMS IOCTL like >>>>> cs_sumbit ?? let’s see which way better >>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>> return -ECANCELED; >>>> >>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >>>> >>>> Yes, it's one additional check in cs_submit. If you're worried about >>>> that (and Christian's concerns about possible issues with walking >>>> over all contexts are addressed), I suppose you could just store a >>>> per-context >>>> >>>> unsigned context_reset_status; >>>> >>>> instead of a `bool guilty`. Its value would start out as 0 >>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during >>>> reset. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> *From:*Haehnle, Nicolai >>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel >>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> From a Mesa perspective, this almost all sounds reasonable to me. >>>>> >>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost >>>>> is a perfect example where the driver should report context lost to >>>>> applications with the "innocent" flag for contexts that were idle >>>>> at the time of reset. The only context(s) that should be reported >>>>> as "guilty" >>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>> executing at the time of reset. >>>>> >>>>> On whether the whole context is marked as guilty from a user space >>>>> perspective, it would simply be nice for user space to get >>>>> consistent answers. It would be a bit odd if we could e.g. succeed >>>>> in submitting an SDMA job after a GFX job was rejected. This would >>>>> point in favor of marking the entire context as guilty (although >>>>> that could happen lazily instead of at reset time). On the other >>>>> hand, if that's too big a burden for the kernel implementation I'm >>>>> sure we can live without it. >>>>> >>>>> Cheers, >>>>> >>>>> Nicolai >>>>> >>>>> ------------------------------------------------------------------- >>>>> --- >>>>> -- >>>>> >>>>> *From:*Liu, Monk >>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>>>> Alexander >>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>>>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>>>> >>>>> 1.Set its fence error status to “*ETIME*”, >>>>> >>>>> No, as I already explained ETIME is for synchronous operation. >>>>> >>>>> In other words when we return ETIME from the wait IOCTL it would >>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>> >>>>> Please use ECANCELED as well or some other error code when we find >>>>> that we need to distinct the timedout job from the canceled ones >>>>> (probably a good idea, but I'm not sure). >>>>> >>>>> [ML] I’m okay if you insist not to use ETIME >>>>> >>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>>>> >>>>> Not sure. Do we want to set the whole context as guilty or just the >>>>> entity? >>>>> >>>>> Setting the whole contexts as guilty sounds racy to me. >>>>> >>>>> BTW: We should use a different name than "guilty", maybe just "bool >>>>> canceled;" ? >>>>> >>>>> [ML] I think context is better than entity, because for example if >>>>> you only block entity_0 of context and allow entity_N run, that >>>>> means the dependency between entities are broken (e.g. page table >>>>> updates in >>>>> >>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>>>> sense to me) >>>>> >>>>> We’d better either block the whole context or let not… >>>>> >>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >>>>> their fence status to “*ECANCELED*” >>>>> >>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>> we try to run the jobs and not during GPU reset. >>>>> >>>>> [ML] without deep thought and expritment, I’m not sure the >>>>> difference between them, but kick it out in gpu_reset routine is >>>>> more efficient, >>>>> >>>>> Otherwise you need to check context/entity guilty flag in run_job >>>>> routine …and you need to it for every context/entity, I don’t see >>>>> why >>>>> >>>>> We don’t just kickout all of them in gpu_reset stage …. >>>>> >>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >>>>> VRAM lost actually ruins all VRAM contents >>>>> >>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>> over all contexts is way to much overhead. >>>>> >>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t >>>>> need to differentiate VRAM lost or not, they only interested in if >>>>> the context is guilty or not, and block >>>>> >>>>> Submit for guilty ones. >>>>> >>>>> *Can you give more details of your idea? And better the detail >>>>> implement in cs_submit, I want to see how you want to block submit >>>>> without checking context guilty flag* >>>>> >>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>> fence status to “*ECANCELDED*” >>>>> >>>>> Yes and no, that should be done when we try to run the jobs and not >>>>> during GPU reset. >>>>> >>>>> [ML] again, kicking out them in gpu reset routine is high >>>>> efficient, otherwise you need check on every job in run_job() >>>>> >>>>> Besides, can you illustrate the detail implementation ? >>>>> >>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>> sync_file debugging (no idea why that made it into the common fence >>>>> code). >>>>> >>>>> It was replaced by putting the error code directly into the fence, >>>>> so just reading that one after waiting should be ok. >>>>> >>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>> for this? >>>>> >>>>> [ML] yeah, that’s too confusing, the name sound really the one I >>>>> want to use, we should change it… >>>>> >>>>> *But look into the implement, I don**’t see why we cannot use it ? >>>>> it also finally return the fence->error * >>>>> >>>>> *From:*Koenig, Christian >>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; >>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com >>>>> <mailto:Nicolai.Haehnle@amd.com>>; >>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >>>>> Deucher, Alexander <Alexander.Deucher@amd.com >>>>> <mailto:Alexander.Deucher@amd.com>> >>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) >>>>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley >>>>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, >>>>> Alejandro <Alejandro.Ramirez@amd.com >>>>> <mailto:Alejandro.Ramirez@amd.com>>; >>>>> Filipas, Mario <Mario.Filipas@amd.com >>>>> <mailto:Mario.Filipas@amd.com>> >>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> See inline: >>>>> >>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>>>> >>>>> Hi Christian & Nicolai, >>>>> >>>>> We need to achieve some agreements on what should MESA/UMD do >>>>> and >>>>> what should KMD do, *please give your comments with “okay” or >>>>> “No” >>>>> and your idea on below items,* >>>>> >>>>> ?When a job timed out (set from lockup_timeout kernel >>>>> parameter), >>>>> What KMD should do in TDR routine : >>>>> >>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>>> (*gpu_reset_counter* is used to force vm flush after GPU >>>>> reset, out >>>>> of this thread’s scope so no more discussion on it) >>>>> >>>>> Okay. >>>>> >>>>> 2.Set its fence error status to “*ETIME*”, >>>>> >>>>> No, as I already explained ETIME is for synchronous operation. >>>>> >>>>> In other words when we return ETIME from the wait IOCTL it would >>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>> >>>>> Please use ECANCELED as well or some other error code when we find >>>>> that we need to distinct the timedout job from the canceled ones >>>>> (probably a good idea, but I'm not sure). >>>>> >>>>> 3.Find the entity/ctx behind this job, and set this ctx as >>>>> “*guilty*” >>>>> >>>>> Not sure. Do we want to set the whole context as guilty or just the >>>>> entity? >>>>> >>>>> Setting the whole contexts as guilty sounds racy to me. >>>>> >>>>> BTW: We should use a different name than "guilty", maybe just "bool >>>>> canceled;" ? >>>>> >>>>> 4.Kick out this job from scheduler’s mirror list, so this job >>>>> won’t >>>>> get re-scheduled to ring anymore. >>>>> >>>>> Okay. >>>>> >>>>> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and >>>>> set all >>>>> their fence status to “*ECANCELED*” >>>>> >>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>> we try to run the jobs and not during GPU reset. >>>>> >>>>> 6.Force signal all fences that get kicked out by above two >>>>> steps,*otherwise UMD will block forever if waiting on those >>>>> fences* >>>>> >>>>> Okay. >>>>> >>>>> 7.Do gpu reset, which is can be some callbacks to let >>>>> bare-metal and >>>>> SR-IOV implement with their favor style >>>>> >>>>> Okay. >>>>> >>>>> 8.After reset, KMD need to aware if the VRAM lost happens or >>>>> not, >>>>> bare-metal can implement some function to judge, while for >>>>> SR-IOV I >>>>> prefer to read it from GIM side (for initial version we consider >>>>> it’s always VRAM lost, till GIM side change aligned) >>>>> >>>>> Okay. >>>>> >>>>> 9.If VRAM lost not hit, continue, otherwise: >>>>> >>>>> a)Update adev->*vram_lost_counter*, >>>>> >>>>> Okay. >>>>> >>>>> b)Iterate over all living ctx, and set all ctx as “*guilty*” >>>>> since >>>>> VRAM lost actually ruins all VRAM contents >>>>> >>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>> over all contexts is way to much overhead. >>>>> >>>>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>> fence status to “*ECANCELDED*” >>>>> >>>>> Yes and no, that should be done when we try to run the jobs and not >>>>> during GPU reset. >>>>> >>>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>>> (optional, >>>>> do we need it ???) >>>>> >>>>> Yes, that is still needed. As Nicolai explained we can't be sure >>>>> that VRAM is still 100% correct even when it isn't cleared. >>>>> >>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>>> restart scheduler (for VRAM lost case, no JOB will >>>>> re-scheduled) >>>>> >>>>> Okay. >>>>> >>>>> ?For cs_wait() IOCTL: >>>>> >>>>> After it found fence signaled, it should check with >>>>> *“dma_fence_get_status” *to see if there is error there, >>>>> >>>>> And return the error status of fence >>>>> >>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>> sync_file debugging (no idea why that made it into the common fence >>>>> code). >>>>> >>>>> It was replaced by putting the error code directly into the fence, >>>>> so just reading that one after waiting should be ok. >>>>> >>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>> for this? >>>>> >>>>> ?For cs_wait_fences() IOCTL: >>>>> >>>>> Similar with above approach >>>>> >>>>> ?For cs_submit() IOCTL: >>>>> >>>>> It need to check if current ctx been marked as “*guilty*” and >>>>> return >>>>> “*ECANCELED*” if so >>>>> >>>>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>>>> >>>>> This way, UMD can also block app from submitting, like @Nicolai >>>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>>> enumerate physical device, and deny all >>>>> >>>>> gl-context from submitting if the counter queried bigger than >>>>> that >>>>> one cached in physical device. (looks a little overkill to >>>>> me, but >>>>> easy to implement ) >>>>> >>>>> UMD can also return error to APP when creating gl-context if >>>>> found >>>>> current queried*vram_lost_counter *bigger than that one >>>>> cached in >>>>> physical device. >>>>> >>>>> Okay. Already have a patch for this, please review that one if you >>>>> haven't already done so. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> BTW: I realized that gl-context is a little different with >>>>> kernel’s >>>>> context. Because for kernel. BO is not related with context >>>>> but only >>>>> with FD, while in UMD, BO have a backend >>>>> >>>>> gl-context, so block submitting in UMD layer is also needed >>>>> although >>>>> KMD will do its job as bottom line >>>>> >>>>> ?Basically “vram_lost_counter” is exposure by kernel to let >>>>> UMD take >>>>> the control of robust extension feature, it will be UMD’s >>>>> call to >>>>> move, KMD only deny “guilty” context from submitting >>>>> >>>>> Need your feedback, thx >>>>> >>>>> We’d better make TDR feature landed ASAP >>>>> >>>>> BR Monk >>>>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <d9274bc6-27e8-f6c4-0851-4240bde72452-5C7GfCeVMHo@public.gmane.org>]
* RE: TDR and VRAM lost handling in KMD: [not found] ` <d9274bc6-27e8-f6c4-0851-4240bde72452-5C7GfCeVMHo@public.gmane.org> @ 2017-10-11 13:51 ` Liu, Monk [not found] ` <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Liu, Monk @ 2017-10-11 13:51 UTC (permalink / raw) To: Koenig, Christian, Zhou, David(ChunMing), Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) > Some jobs don't have a context (VM updates, clears, buffer moves). What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy That can solve your concerns -----Original Message----- From: Koenig, Christian Sent: Wednesday, October 11, 2017 9:39 PM To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: Some jobs don't have a context (VM updates, clears, buffer moves). I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable. So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run. Regards, Christian. Am 11.10.2017 um 15:35 schrieb Liu, Monk: > I think just compare the copy from context/entity with current counter > is enough, don't see how it's better to keep another copy in JOB > > > -----Original Message----- > From: Koenig, Christian > Sent: Wednesday, October 11, 2017 6:40 PM > To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk > <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, > Marek <Marek.Olsak@amd.com>; Deucher, Alexander > <Alexander.Deucher@amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; > amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; > Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang@amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > I've already posted a patch for this on the mailing list. > > Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled. > > Regards, > Christian. > > Am 11.10.2017 um 12:14 schrieb Chunming Zhou: >> Your summary lacks the below issue: >> >> How about the job already pushed in scheduler queue when vram is lost? >> >> >> Regards, >> David Zhou >> On 2017年10月11日 17:41, Liu, Monk wrote: >>> Okay, let me summary our whole idea together and see if it works: >>> >>> 1, For cs_submit, always check vram-lost_counter first and reject >>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != >>> adev->vram_lost_counter. That way the vram lost issue can be handled >>> >>> 2, for cs_submit we still need to check if the incoming context is >>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found >>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject >>> the submit >>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ? >>> >>> 3, in gpu_reset() routine, we only mark the hang job's entity as >>> guilty (so we need to add new member in entity structure), and not >>> kick it out in gpu_reset() stage, but we need to set the context >>> behind this entity as " AMDGPU_CTX_GUILTY_RESET" >>> And if reset introduces VRAM LOST, we just update >>> adev->vram_lost_counter, but *don't* change all entity to guilty, so >>> still only the hang job's entity is "guilty" >>> After some entity marked as "guilty", we find a way to set the >>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K >>> interface, we need let UMD can know that this context is wrong. >>> >>> 4, in gpu scheduler's run_job() routine, since it only reads entity, >>> so we skip job scheduling once found the entity is "guilty" >>> >>> >>> Does above sounds good ? >>> >>> >>> >>> -----Original Message----- >>> From: Haehnle, Nicolai >>> Sent: Wednesday, October 11, 2017 5:26 PM >>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>> Deucher, Alexander <Alexander.Deucher@amd.com> >>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >>> <Bingley.Li@amd.com>; Ramirez, Alejandro >>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>> Subject: Re: TDR and VRAM lost handling in KMD: >>> >>> On 11.10.2017 11:18, Liu, Monk wrote: >>>> Let's talk it simple, When vram lost hit, what's the action for >>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not >>>> the one trigger gpu hang) after vram lost ? do you mean we return >>>> -ENODEV to UMD ? >>> It should successfully return AMDGPU_CTX_INNOCENT_RESET. >>> >>> >>>> In cs_submit, with vram lost hit, if we don't mark all contexts as >>>> "guilty", how we block its from submitting ? can you show some >>>> implement way >>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>> return -ECANCELED; >>> >>> (where ctx->vram_lost_counter is initialized at context creation >>> time and never changed afterwards) >>> >>> >>>> BTW: the "guilty" here is a new member I want to add to context, it >>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I >>>> need to unify them and only one place to mark guilty or not >>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made >>> consistent with the rest. >>> >>> Cheers, >>> Nicolai >>> >>> >>>> BR Monk >>>> >>>> -----Original Message----- >>>> From: Haehnle, Nicolai >>>> Sent: Wednesday, October 11, 2017 5:00 PM >>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel >>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>> >>>> On 11.10.2017 10:48, Liu, Monk wrote: >>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>> lost is a perfect example where the driver should report context >>>>> lost to applications with the "innocent" flag for contexts that >>>>> were idle at the time of reset. The only context(s) that should be >>>>> reported as "guilty" >>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>> executing at the time of reset. >>>>> >>>>> ML: KMD mark all contexts as guilty is because that way we can >>>>> unify our IOCTL behavior: e.g. for IOCTL only block >>>>> “guilty”context , no need to worry about vram-lost-counter >>>>> anymore, that’s a implementation style. I don’t think it is >>>>> related with UMD layer, >>>>> >>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement >>>>> it own “guilty” gl-context if you want. >>>> Well, to some extent this is just semantics, but it helps to keep >>>> the terminology consistent. >>>> >>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >>>> mind: this returns one of >>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >>>> and it must return "innocent" for contexts that are only lost due >>>> to VRAM lost without being otherwise involved in the timeout that >>>> lead to the reset. >>>> >>>> The point is that in the places where you used "guilty" it would be >>>> better to use "context lost", and then further differentiate >>>> between guilty/innocent context lost based on the details of what happened. >>>> >>>> >>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >>>>> illustrate what rule KMD should obey to check in KMS IOCTL like >>>>> cs_sumbit ?? let’s see which way better >>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>> return -ECANCELED; >>>> >>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >>>> >>>> Yes, it's one additional check in cs_submit. If you're worried >>>> about that (and Christian's concerns about possible issues with >>>> walking over all contexts are addressed), I suppose you could just >>>> store a per-context >>>> >>>> unsigned context_reset_status; >>>> >>>> instead of a `bool guilty`. Its value would start out as 0 >>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during >>>> reset. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> *From:*Haehnle, Nicolai >>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel >>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario >>>>> <Mario.Filipas@amd.com> >>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> From a Mesa perspective, this almost all sounds reasonable to me. >>>>> >>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>> lost is a perfect example where the driver should report context >>>>> lost to applications with the "innocent" flag for contexts that >>>>> were idle at the time of reset. The only context(s) that should be >>>>> reported as "guilty" >>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>> executing at the time of reset. >>>>> >>>>> On whether the whole context is marked as guilty from a user space >>>>> perspective, it would simply be nice for user space to get >>>>> consistent answers. It would be a bit odd if we could e.g. succeed >>>>> in submitting an SDMA job after a GFX job was rejected. This would >>>>> point in favor of marking the entire context as guilty (although >>>>> that could happen lazily instead of at reset time). On the other >>>>> hand, if that's too big a burden for the kernel implementation I'm >>>>> sure we can live without it. >>>>> >>>>> Cheers, >>>>> >>>>> Nicolai >>>>> >>>>> ------------------------------------------------------------------ >>>>> - >>>>> --- >>>>> -- >>>>> >>>>> *From:*Liu, Monk >>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>>>> Alexander >>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>>>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>>>> >>>>> 1.Set its fence error status to “*ETIME*”, >>>>> >>>>> No, as I already explained ETIME is for synchronous operation. >>>>> >>>>> In other words when we return ETIME from the wait IOCTL it would >>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>> >>>>> Please use ECANCELED as well or some other error code when we find >>>>> that we need to distinct the timedout job from the canceled ones >>>>> (probably a good idea, but I'm not sure). >>>>> >>>>> [ML] I’m okay if you insist not to use ETIME >>>>> >>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>>>> >>>>> Not sure. Do we want to set the whole context as guilty or just >>>>> the entity? >>>>> >>>>> Setting the whole contexts as guilty sounds racy to me. >>>>> >>>>> BTW: We should use a different name than "guilty", maybe just >>>>> "bool canceled;" ? >>>>> >>>>> [ML] I think context is better than entity, because for example if >>>>> you only block entity_0 of context and allow entity_N run, that >>>>> means the dependency between entities are broken (e.g. page table >>>>> updates in >>>>> >>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>>>> sense to me) >>>>> >>>>> We’d better either block the whole context or let not… >>>>> >>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set >>>>> all their fence status to “*ECANCELED*” >>>>> >>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>> we try to run the jobs and not during GPU reset. >>>>> >>>>> [ML] without deep thought and expritment, I’m not sure the >>>>> difference between them, but kick it out in gpu_reset routine is >>>>> more efficient, >>>>> >>>>> Otherwise you need to check context/entity guilty flag in run_job >>>>> routine …and you need to it for every context/entity, I don’t see >>>>> why >>>>> >>>>> We don’t just kickout all of them in gpu_reset stage …. >>>>> >>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >>>>> VRAM lost actually ruins all VRAM contents >>>>> >>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>> over all contexts is way to much overhead. >>>>> >>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t >>>>> need to differentiate VRAM lost or not, they only interested in if >>>>> the context is guilty or not, and block >>>>> >>>>> Submit for guilty ones. >>>>> >>>>> *Can you give more details of your idea? And better the detail >>>>> implement in cs_submit, I want to see how you want to block submit >>>>> without checking context guilty flag* >>>>> >>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>> fence status to “*ECANCELDED*” >>>>> >>>>> Yes and no, that should be done when we try to run the jobs and >>>>> not during GPU reset. >>>>> >>>>> [ML] again, kicking out them in gpu reset routine is high >>>>> efficient, otherwise you need check on every job in run_job() >>>>> >>>>> Besides, can you illustrate the detail implementation ? >>>>> >>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>> sync_file debugging (no idea why that made it into the common >>>>> fence code). >>>>> >>>>> It was replaced by putting the error code directly into the fence, >>>>> so just reading that one after waiting should be ok. >>>>> >>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>> for this? >>>>> >>>>> [ML] yeah, that’s too confusing, the name sound really the one I >>>>> want to use, we should change it… >>>>> >>>>> *But look into the implement, I don**’t see why we cannot use it ? >>>>> it also finally return the fence->error * >>>>> >>>>> *From:*Koenig, Christian >>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; >>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com >>>>> <mailto:Nicolai.Haehnle@amd.com>>; >>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >>>>> Deucher, Alexander <Alexander.Deucher@amd.com >>>>> <mailto:Alexander.Deucher@amd.com>> >>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry >>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, >>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, >>>>> Alejandro <Alejandro.Ramirez@amd.com >>>>> <mailto:Alejandro.Ramirez@amd.com>>; >>>>> Filipas, Mario <Mario.Filipas@amd.com >>>>> <mailto:Mario.Filipas@amd.com>> >>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> See inline: >>>>> >>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>>>> >>>>> Hi Christian & Nicolai, >>>>> >>>>> We need to achieve some agreements on what should MESA/UMD >>>>> do and >>>>> what should KMD do, *please give your comments with “okay” >>>>> or “No” >>>>> and your idea on below items,* >>>>> >>>>> ?When a job timed out (set from lockup_timeout kernel >>>>> parameter), >>>>> What KMD should do in TDR routine : >>>>> >>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>>> (*gpu_reset_counter* is used to force vm flush after GPU >>>>> reset, out >>>>> of this thread’s scope so no more discussion on it) >>>>> >>>>> Okay. >>>>> >>>>> 2.Set its fence error status to “*ETIME*”, >>>>> >>>>> No, as I already explained ETIME is for synchronous operation. >>>>> >>>>> In other words when we return ETIME from the wait IOCTL it would >>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>> >>>>> Please use ECANCELED as well or some other error code when we find >>>>> that we need to distinct the timedout job from the canceled ones >>>>> (probably a good idea, but I'm not sure). >>>>> >>>>> 3.Find the entity/ctx behind this job, and set this ctx as >>>>> “*guilty*” >>>>> >>>>> Not sure. Do we want to set the whole context as guilty or just >>>>> the entity? >>>>> >>>>> Setting the whole contexts as guilty sounds racy to me. >>>>> >>>>> BTW: We should use a different name than "guilty", maybe just >>>>> "bool canceled;" ? >>>>> >>>>> 4.Kick out this job from scheduler’s mirror list, so this >>>>> job won’t >>>>> get re-scheduled to ring anymore. >>>>> >>>>> Okay. >>>>> >>>>> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and >>>>> set all >>>>> their fence status to “*ECANCELED*” >>>>> >>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>> we try to run the jobs and not during GPU reset. >>>>> >>>>> 6.Force signal all fences that get kicked out by above two >>>>> steps,*otherwise UMD will block forever if waiting on those >>>>> fences* >>>>> >>>>> Okay. >>>>> >>>>> 7.Do gpu reset, which is can be some callbacks to let >>>>> bare-metal and >>>>> SR-IOV implement with their favor style >>>>> >>>>> Okay. >>>>> >>>>> 8.After reset, KMD need to aware if the VRAM lost happens >>>>> or not, >>>>> bare-metal can implement some function to judge, while for >>>>> SR-IOV I >>>>> prefer to read it from GIM side (for initial version we consider >>>>> it’s always VRAM lost, till GIM side change aligned) >>>>> >>>>> Okay. >>>>> >>>>> 9.If VRAM lost not hit, continue, otherwise: >>>>> >>>>> a)Update adev->*vram_lost_counter*, >>>>> >>>>> Okay. >>>>> >>>>> b)Iterate over all living ctx, and set all ctx as “*guilty*” >>>>> since >>>>> VRAM lost actually ruins all VRAM contents >>>>> >>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>> over all contexts is way to much overhead. >>>>> >>>>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>> fence status to “*ECANCELDED*” >>>>> >>>>> Yes and no, that should be done when we try to run the jobs and >>>>> not during GPU reset. >>>>> >>>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>>> (optional, >>>>> do we need it ???) >>>>> >>>>> Yes, that is still needed. As Nicolai explained we can't be sure >>>>> that VRAM is still 100% correct even when it isn't cleared. >>>>> >>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>>> restart scheduler (for VRAM lost case, no JOB will >>>>> re-scheduled) >>>>> >>>>> Okay. >>>>> >>>>> ?For cs_wait() IOCTL: >>>>> >>>>> After it found fence signaled, it should check with >>>>> *“dma_fence_get_status” *to see if there is error there, >>>>> >>>>> And return the error status of fence >>>>> >>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>> sync_file debugging (no idea why that made it into the common >>>>> fence code). >>>>> >>>>> It was replaced by putting the error code directly into the fence, >>>>> so just reading that one after waiting should be ok. >>>>> >>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>> for this? >>>>> >>>>> ?For cs_wait_fences() IOCTL: >>>>> >>>>> Similar with above approach >>>>> >>>>> ?For cs_submit() IOCTL: >>>>> >>>>> It need to check if current ctx been marked as “*guilty*” >>>>> and return >>>>> “*ECANCELED*” if so >>>>> >>>>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>>>> >>>>> This way, UMD can also block app from submitting, like @Nicolai >>>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>>> enumerate physical device, and deny all >>>>> >>>>> gl-context from submitting if the counter queried bigger >>>>> than that >>>>> one cached in physical device. (looks a little overkill to >>>>> me, but >>>>> easy to implement ) >>>>> >>>>> UMD can also return error to APP when creating gl-context >>>>> if found >>>>> current queried*vram_lost_counter *bigger than that one >>>>> cached in >>>>> physical device. >>>>> >>>>> Okay. Already have a patch for this, please review that one if you >>>>> haven't already done so. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> BTW: I realized that gl-context is a little different with >>>>> kernel’s >>>>> context. Because for kernel. BO is not related with context >>>>> but only >>>>> with FD, while in UMD, BO have a backend >>>>> >>>>> gl-context, so block submitting in UMD layer is also needed >>>>> although >>>>> KMD will do its job as bottom line >>>>> >>>>> ?Basically “vram_lost_counter” is exposure by kernel to let >>>>> UMD take >>>>> the control of robust extension feature, it will be UMD’s >>>>> call to >>>>> move, KMD only deny “guilty” context from submitting >>>>> >>>>> Need your feedback, thx >>>>> >>>>> We’d better make TDR feature landed ASAP >>>>> >>>>> BR Monk >>>>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* RE: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-11 13:59 ` Liu, Monk [not found] ` <BLUPR12MB04497EDD5AE48484E7A18C2F844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 14:03 ` Christian König 1 sibling, 1 reply; 23+ messages in thread From: Liu, Monk @ 2017-10-11 13:59 UTC (permalink / raw) To: Liu, Monk, Koenig, Christian, Zhou, David(ChunMing), Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) But if we keep counter in entity, there is one issue I suddenly though of : For regular user context, after vram lost UMD will aware this context is LOST since we have a counter copy in context, so user space can close it and re-create one But for kernel entity, since no U/K interface, so it is kernel's responsibility to recover this kernel entity to work, that make things complicated .... Emm, I agree that keep a copy in context and in job is a good move BR Monk -----Original Message----- From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk Sent: Wednesday, October 11, 2017 9:51 PM To: Koenig, Christian <Christian.Koenig@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> Subject: RE: TDR and VRAM lost handling in KMD: > Some jobs don't have a context (VM updates, clears, buffer moves). What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy That can solve your concerns -----Original Message----- From: Koenig, Christian Sent: Wednesday, October 11, 2017 9:39 PM To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: Some jobs don't have a context (VM updates, clears, buffer moves). I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable. So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run. Regards, Christian. Am 11.10.2017 um 15:35 schrieb Liu, Monk: > I think just compare the copy from context/entity with current counter > is enough, don't see how it's better to keep another copy in JOB > > > -----Original Message----- > From: Koenig, Christian > Sent: Wednesday, October 11, 2017 6:40 PM > To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk > <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, > Marek <Marek.Olsak@amd.com>; Deucher, Alexander > <Alexander.Deucher@amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; > amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; > Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang@amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > I've already posted a patch for this on the mailing list. > > Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled. > > Regards, > Christian. > > Am 11.10.2017 um 12:14 schrieb Chunming Zhou: >> Your summary lacks the below issue: >> >> How about the job already pushed in scheduler queue when vram is lost? >> >> >> Regards, >> David Zhou >> On 2017年10月11日 17:41, Liu, Monk wrote: >>> Okay, let me summary our whole idea together and see if it works: >>> >>> 1, For cs_submit, always check vram-lost_counter first and reject >>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != >>> adev->vram_lost_counter. That way the vram lost issue can be handled >>> >>> 2, for cs_submit we still need to check if the incoming context is >>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found >>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject >>> the submit >>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ? >>> >>> 3, in gpu_reset() routine, we only mark the hang job's entity as >>> guilty (so we need to add new member in entity structure), and not >>> kick it out in gpu_reset() stage, but we need to set the context >>> behind this entity as " AMDGPU_CTX_GUILTY_RESET" >>> And if reset introduces VRAM LOST, we just update >>> adev->vram_lost_counter, but *don't* change all entity to guilty, so >>> still only the hang job's entity is "guilty" >>> After some entity marked as "guilty", we find a way to set the >>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K >>> interface, we need let UMD can know that this context is wrong. >>> >>> 4, in gpu scheduler's run_job() routine, since it only reads entity, >>> so we skip job scheduling once found the entity is "guilty" >>> >>> >>> Does above sounds good ? >>> >>> >>> >>> -----Original Message----- >>> From: Haehnle, Nicolai >>> Sent: Wednesday, October 11, 2017 5:26 PM >>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>> Deucher, Alexander <Alexander.Deucher@amd.com> >>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >>> <Bingley.Li@amd.com>; Ramirez, Alejandro >>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>> Subject: Re: TDR and VRAM lost handling in KMD: >>> >>> On 11.10.2017 11:18, Liu, Monk wrote: >>>> Let's talk it simple, When vram lost hit, what's the action for >>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not >>>> the one trigger gpu hang) after vram lost ? do you mean we return >>>> -ENODEV to UMD ? >>> It should successfully return AMDGPU_CTX_INNOCENT_RESET. >>> >>> >>>> In cs_submit, with vram lost hit, if we don't mark all contexts as >>>> "guilty", how we block its from submitting ? can you show some >>>> implement way >>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>> return -ECANCELED; >>> >>> (where ctx->vram_lost_counter is initialized at context creation >>> time and never changed afterwards) >>> >>> >>>> BTW: the "guilty" here is a new member I want to add to context, it >>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I >>>> need to unify them and only one place to mark guilty or not >>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made >>> consistent with the rest. >>> >>> Cheers, >>> Nicolai >>> >>> >>>> BR Monk >>>> >>>> -----Original Message----- >>>> From: Haehnle, Nicolai >>>> Sent: Wednesday, October 11, 2017 5:00 PM >>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel >>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>> >>>> On 11.10.2017 10:48, Liu, Monk wrote: >>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>> lost is a perfect example where the driver should report context >>>>> lost to applications with the "innocent" flag for contexts that >>>>> were idle at the time of reset. The only context(s) that should be >>>>> reported as "guilty" >>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>> executing at the time of reset. >>>>> >>>>> ML: KMD mark all contexts as guilty is because that way we can >>>>> unify our IOCTL behavior: e.g. for IOCTL only block >>>>> “guilty”context , no need to worry about vram-lost-counter >>>>> anymore, that’s a implementation style. I don’t think it is >>>>> related with UMD layer, >>>>> >>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement >>>>> it own “guilty” gl-context if you want. >>>> Well, to some extent this is just semantics, but it helps to keep >>>> the terminology consistent. >>>> >>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >>>> mind: this returns one of >>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >>>> and it must return "innocent" for contexts that are only lost due >>>> to VRAM lost without being otherwise involved in the timeout that >>>> lead to the reset. >>>> >>>> The point is that in the places where you used "guilty" it would be >>>> better to use "context lost", and then further differentiate >>>> between guilty/innocent context lost based on the details of what happened. >>>> >>>> >>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >>>>> illustrate what rule KMD should obey to check in KMS IOCTL like >>>>> cs_sumbit ?? let’s see which way better >>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>> return -ECANCELED; >>>> >>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >>>> >>>> Yes, it's one additional check in cs_submit. If you're worried >>>> about that (and Christian's concerns about possible issues with >>>> walking over all contexts are addressed), I suppose you could just >>>> store a per-context >>>> >>>> unsigned context_reset_status; >>>> >>>> instead of a `bool guilty`. Its value would start out as 0 >>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during >>>> reset. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> *From:*Haehnle, Nicolai >>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel >>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario >>>>> <Mario.Filipas@amd.com> >>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> From a Mesa perspective, this almost all sounds reasonable to me. >>>>> >>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>> lost is a perfect example where the driver should report context >>>>> lost to applications with the "innocent" flag for contexts that >>>>> were idle at the time of reset. The only context(s) that should be >>>>> reported as "guilty" >>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>> executing at the time of reset. >>>>> >>>>> On whether the whole context is marked as guilty from a user space >>>>> perspective, it would simply be nice for user space to get >>>>> consistent answers. It would be a bit odd if we could e.g. succeed >>>>> in submitting an SDMA job after a GFX job was rejected. This would >>>>> point in favor of marking the entire context as guilty (although >>>>> that could happen lazily instead of at reset time). On the other >>>>> hand, if that's too big a burden for the kernel implementation I'm >>>>> sure we can live without it. >>>>> >>>>> Cheers, >>>>> >>>>> Nicolai >>>>> >>>>> ------------------------------------------------------------------ >>>>> - >>>>> --- >>>>> -- >>>>> >>>>> *From:*Liu, Monk >>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>>>> Alexander >>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>>>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>>>> >>>>> 1.Set its fence error status to “*ETIME*”, >>>>> >>>>> No, as I already explained ETIME is for synchronous operation. >>>>> >>>>> In other words when we return ETIME from the wait IOCTL it would >>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>> >>>>> Please use ECANCELED as well or some other error code when we find >>>>> that we need to distinct the timedout job from the canceled ones >>>>> (probably a good idea, but I'm not sure). >>>>> >>>>> [ML] I’m okay if you insist not to use ETIME >>>>> >>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>>>> >>>>> Not sure. Do we want to set the whole context as guilty or just >>>>> the entity? >>>>> >>>>> Setting the whole contexts as guilty sounds racy to me. >>>>> >>>>> BTW: We should use a different name than "guilty", maybe just >>>>> "bool canceled;" ? >>>>> >>>>> [ML] I think context is better than entity, because for example if >>>>> you only block entity_0 of context and allow entity_N run, that >>>>> means the dependency between entities are broken (e.g. page table >>>>> updates in >>>>> >>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>>>> sense to me) >>>>> >>>>> We’d better either block the whole context or let not… >>>>> >>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set >>>>> all their fence status to “*ECANCELED*” >>>>> >>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>> we try to run the jobs and not during GPU reset. >>>>> >>>>> [ML] without deep thought and expritment, I’m not sure the >>>>> difference between them, but kick it out in gpu_reset routine is >>>>> more efficient, >>>>> >>>>> Otherwise you need to check context/entity guilty flag in run_job >>>>> routine …and you need to it for every context/entity, I don’t see >>>>> why >>>>> >>>>> We don’t just kickout all of them in gpu_reset stage …. >>>>> >>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >>>>> VRAM lost actually ruins all VRAM contents >>>>> >>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>> over all contexts is way to much overhead. >>>>> >>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t >>>>> need to differentiate VRAM lost or not, they only interested in if >>>>> the context is guilty or not, and block >>>>> >>>>> Submit for guilty ones. >>>>> >>>>> *Can you give more details of your idea? And better the detail >>>>> implement in cs_submit, I want to see how you want to block submit >>>>> without checking context guilty flag* >>>>> >>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>> fence status to “*ECANCELDED*” >>>>> >>>>> Yes and no, that should be done when we try to run the jobs and >>>>> not during GPU reset. >>>>> >>>>> [ML] again, kicking out them in gpu reset routine is high >>>>> efficient, otherwise you need check on every job in run_job() >>>>> >>>>> Besides, can you illustrate the detail implementation ? >>>>> >>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>> sync_file debugging (no idea why that made it into the common >>>>> fence code). >>>>> >>>>> It was replaced by putting the error code directly into the fence, >>>>> so just reading that one after waiting should be ok. >>>>> >>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>> for this? >>>>> >>>>> [ML] yeah, that’s too confusing, the name sound really the one I >>>>> want to use, we should change it… >>>>> >>>>> *But look into the implement, I don**’t see why we cannot use it ? >>>>> it also finally return the fence->error * >>>>> >>>>> *From:*Koenig, Christian >>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; >>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com >>>>> <mailto:Nicolai.Haehnle@amd.com>>; >>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >>>>> Deucher, Alexander <Alexander.Deucher@amd.com >>>>> <mailto:Alexander.Deucher@amd.com>> >>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry >>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, >>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, >>>>> Alejandro <Alejandro.Ramirez@amd.com >>>>> <mailto:Alejandro.Ramirez@amd.com>>; >>>>> Filipas, Mario <Mario.Filipas@amd.com >>>>> <mailto:Mario.Filipas@amd.com>> >>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> See inline: >>>>> >>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>>>> >>>>> Hi Christian & Nicolai, >>>>> >>>>> We need to achieve some agreements on what should MESA/UMD >>>>> do and >>>>> what should KMD do, *please give your comments with “okay” >>>>> or “No” >>>>> and your idea on below items,* >>>>> >>>>> ?When a job timed out (set from lockup_timeout kernel >>>>> parameter), >>>>> What KMD should do in TDR routine : >>>>> >>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>>> (*gpu_reset_counter* is used to force vm flush after GPU >>>>> reset, out >>>>> of this thread’s scope so no more discussion on it) >>>>> >>>>> Okay. >>>>> >>>>> 2.Set its fence error status to “*ETIME*”, >>>>> >>>>> No, as I already explained ETIME is for synchronous operation. >>>>> >>>>> In other words when we return ETIME from the wait IOCTL it would >>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>> >>>>> Please use ECANCELED as well or some other error code when we find >>>>> that we need to distinct the timedout job from the canceled ones >>>>> (probably a good idea, but I'm not sure). >>>>> >>>>> 3.Find the entity/ctx behind this job, and set this ctx as >>>>> “*guilty*” >>>>> >>>>> Not sure. Do we want to set the whole context as guilty or just >>>>> the entity? >>>>> >>>>> Setting the whole contexts as guilty sounds racy to me. >>>>> >>>>> BTW: We should use a different name than "guilty", maybe just >>>>> "bool canceled;" ? >>>>> >>>>> 4.Kick out this job from scheduler’s mirror list, so this >>>>> job won’t >>>>> get re-scheduled to ring anymore. >>>>> >>>>> Okay. >>>>> >>>>> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and >>>>> set all >>>>> their fence status to “*ECANCELED*” >>>>> >>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>> we try to run the jobs and not during GPU reset. >>>>> >>>>> 6.Force signal all fences that get kicked out by above two >>>>> steps,*otherwise UMD will block forever if waiting on those >>>>> fences* >>>>> >>>>> Okay. >>>>> >>>>> 7.Do gpu reset, which is can be some callbacks to let >>>>> bare-metal and >>>>> SR-IOV implement with their favor style >>>>> >>>>> Okay. >>>>> >>>>> 8.After reset, KMD need to aware if the VRAM lost happens >>>>> or not, >>>>> bare-metal can implement some function to judge, while for >>>>> SR-IOV I >>>>> prefer to read it from GIM side (for initial version we consider >>>>> it’s always VRAM lost, till GIM side change aligned) >>>>> >>>>> Okay. >>>>> >>>>> 9.If VRAM lost not hit, continue, otherwise: >>>>> >>>>> a)Update adev->*vram_lost_counter*, >>>>> >>>>> Okay. >>>>> >>>>> b)Iterate over all living ctx, and set all ctx as “*guilty*” >>>>> since >>>>> VRAM lost actually ruins all VRAM contents >>>>> >>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>> over all contexts is way to much overhead. >>>>> >>>>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>> fence status to “*ECANCELDED*” >>>>> >>>>> Yes and no, that should be done when we try to run the jobs and >>>>> not during GPU reset. >>>>> >>>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>>> (optional, >>>>> do we need it ???) >>>>> >>>>> Yes, that is still needed. As Nicolai explained we can't be sure >>>>> that VRAM is still 100% correct even when it isn't cleared. >>>>> >>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>>> restart scheduler (for VRAM lost case, no JOB will >>>>> re-scheduled) >>>>> >>>>> Okay. >>>>> >>>>> ?For cs_wait() IOCTL: >>>>> >>>>> After it found fence signaled, it should check with >>>>> *“dma_fence_get_status” *to see if there is error there, >>>>> >>>>> And return the error status of fence >>>>> >>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>> sync_file debugging (no idea why that made it into the common >>>>> fence code). >>>>> >>>>> It was replaced by putting the error code directly into the fence, >>>>> so just reading that one after waiting should be ok. >>>>> >>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>> for this? >>>>> >>>>> ?For cs_wait_fences() IOCTL: >>>>> >>>>> Similar with above approach >>>>> >>>>> ?For cs_submit() IOCTL: >>>>> >>>>> It need to check if current ctx been marked as “*guilty*” >>>>> and return >>>>> “*ECANCELED*” if so >>>>> >>>>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>>>> >>>>> This way, UMD can also block app from submitting, like @Nicolai >>>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>>> enumerate physical device, and deny all >>>>> >>>>> gl-context from submitting if the counter queried bigger >>>>> than that >>>>> one cached in physical device. (looks a little overkill to >>>>> me, but >>>>> easy to implement ) >>>>> >>>>> UMD can also return error to APP when creating gl-context >>>>> if found >>>>> current queried*vram_lost_counter *bigger than that one >>>>> cached in >>>>> physical device. >>>>> >>>>> Okay. Already have a patch for this, please review that one if you >>>>> haven't already done so. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> BTW: I realized that gl-context is a little different with >>>>> kernel’s >>>>> context. Because for kernel. BO is not related with context >>>>> but only >>>>> with FD, while in UMD, BO have a backend >>>>> >>>>> gl-context, so block submitting in UMD layer is also needed >>>>> although >>>>> KMD will do its job as bottom line >>>>> >>>>> ?Basically “vram_lost_counter” is exposure by kernel to let >>>>> UMD take >>>>> the control of robust extension feature, it will be UMD’s >>>>> call to >>>>> move, KMD only deny “guilty” context from submitting >>>>> >>>>> Need your feedback, thx >>>>> >>>>> We’d better make TDR feature landed ASAP >>>>> >>>>> BR Monk >>>>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <BLUPR12MB04497EDD5AE48484E7A18C2F844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* Re: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB04497EDD5AE48484E7A18C2F844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-11 14:04 ` Christian König 0 siblings, 0 replies; 23+ messages in thread From: Christian König @ 2017-10-11 14:04 UTC (permalink / raw) To: Liu, Monk, Koenig, Christian, Zhou, David(ChunMing), Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) Yeah, that was exactly my thinking as well. Christian. Am 11.10.2017 um 15:59 schrieb Liu, Monk: > But if we keep counter in entity, there is one issue I suddenly though of : > > For regular user context, after vram lost UMD will aware this context is LOST since we have a counter copy in context, so user space can close it and re-create one > But for kernel entity, since no U/K interface, so it is kernel's responsibility to recover this kernel entity to work, that make things complicated .... > > Emm, I agree that keep a copy in context and in job is a good move > > BR Monk > > -----Original Message----- > From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk > Sent: Wednesday, October 11, 2017 9:51 PM > To: Koenig, Christian <Christian.Koenig@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> > Subject: RE: TDR and VRAM lost handling in KMD: > >> Some jobs don't have a context (VM updates, clears, buffer moves). > What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy That can solve your concerns > > > > -----Original Message----- > From: Koenig, Christian > Sent: Wednesday, October 11, 2017 9:39 PM > To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > Some jobs don't have a context (VM updates, clears, buffer moves). > > I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable. > > So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run. > > Regards, > Christian. > > Am 11.10.2017 um 15:35 schrieb Liu, Monk: >> I think just compare the copy from context/entity with current counter >> is enough, don't see how it's better to keep another copy in JOB >> >> >> -----Original Message----- >> From: Koenig, Christian >> Sent: Wednesday, October 11, 2017 6:40 PM >> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk >> <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, >> Marek <Marek.Olsak@amd.com>; Deucher, Alexander >> <Alexander.Deucher@amd.com> >> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >> amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; >> Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang@amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD: >> >> I've already posted a patch for this on the mailing list. >> >> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled. >> >> Regards, >> Christian. >> >> Am 11.10.2017 um 12:14 schrieb Chunming Zhou: >>> Your summary lacks the below issue: >>> >>> How about the job already pushed in scheduler queue when vram is lost? >>> >>> >>> Regards, >>> David Zhou >>> On 2017年10月11日 17:41, Liu, Monk wrote: >>>> Okay, let me summary our whole idea together and see if it works: >>>> >>>> 1, For cs_submit, always check vram-lost_counter first and reject >>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != >>>> adev->vram_lost_counter. That way the vram lost issue can be handled >>>> >>>> 2, for cs_submit we still need to check if the incoming context is >>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found >>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject >>>> the submit >>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ? >>>> >>>> 3, in gpu_reset() routine, we only mark the hang job's entity as >>>> guilty (so we need to add new member in entity structure), and not >>>> kick it out in gpu_reset() stage, but we need to set the context >>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET" >>>> And if reset introduces VRAM LOST, we just update >>>> adev->vram_lost_counter, but *don't* change all entity to guilty, so >>>> still only the hang job's entity is "guilty" >>>> After some entity marked as "guilty", we find a way to set the >>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K >>>> interface, we need let UMD can know that this context is wrong. >>>> >>>> 4, in gpu scheduler's run_job() routine, since it only reads entity, >>>> so we skip job scheduling once found the entity is "guilty" >>>> >>>> >>>> Does above sounds good ? >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Haehnle, Nicolai >>>> Sent: Wednesday, October 11, 2017 5:26 PM >>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >>>> <Bingley.Li@amd.com>; Ramirez, Alejandro >>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>> >>>> On 11.10.2017 11:18, Liu, Monk wrote: >>>>> Let's talk it simple, When vram lost hit, what's the action for >>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not >>>>> the one trigger gpu hang) after vram lost ? do you mean we return >>>>> -ENODEV to UMD ? >>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET. >>>> >>>> >>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as >>>>> "guilty", how we block its from submitting ? can you show some >>>>> implement way >>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>> return -ECANCELED; >>>> >>>> (where ctx->vram_lost_counter is initialized at context creation >>>> time and never changed afterwards) >>>> >>>> >>>>> BTW: the "guilty" here is a new member I want to add to context, it >>>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I >>>>> need to unify them and only one place to mark guilty or not >>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made >>>> consistent with the rest. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> BR Monk >>>>> >>>>> -----Original Message----- >>>>> From: Haehnle, Nicolai >>>>> Sent: Wednesday, October 11, 2017 5:00 PM >>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel >>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> On 11.10.2017 10:48, Liu, Monk wrote: >>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>>> lost is a perfect example where the driver should report context >>>>>> lost to applications with the "innocent" flag for contexts that >>>>>> were idle at the time of reset. The only context(s) that should be >>>>>> reported as "guilty" >>>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>>> executing at the time of reset. >>>>>> >>>>>> ML: KMD mark all contexts as guilty is because that way we can >>>>>> unify our IOCTL behavior: e.g. for IOCTL only block >>>>>> “guilty”context , no need to worry about vram-lost-counter >>>>>> anymore, that’s a implementation style. I don’t think it is >>>>>> related with UMD layer, >>>>>> >>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement >>>>>> it own “guilty” gl-context if you want. >>>>> Well, to some extent this is just semantics, but it helps to keep >>>>> the terminology consistent. >>>>> >>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >>>>> mind: this returns one of >>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >>>>> and it must return "innocent" for contexts that are only lost due >>>>> to VRAM lost without being otherwise involved in the timeout that >>>>> lead to the reset. >>>>> >>>>> The point is that in the places where you used "guilty" it would be >>>>> better to use "context lost", and then further differentiate >>>>> between guilty/innocent context lost based on the details of what happened. >>>>> >>>>> >>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like >>>>>> cs_sumbit ?? let’s see which way better >>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>>> return -ECANCELED; >>>>> >>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >>>>> >>>>> Yes, it's one additional check in cs_submit. If you're worried >>>>> about that (and Christian's concerns about possible issues with >>>>> walking over all contexts are addressed), I suppose you could just >>>>> store a per-context >>>>> >>>>> unsigned context_reset_status; >>>>> >>>>> instead of a `bool guilty`. Its value would start out as 0 >>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during >>>>> reset. >>>>> >>>>> Cheers, >>>>> Nicolai >>>>> >>>>> >>>>>> *From:*Haehnle, Nicolai >>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel >>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario >>>>>> <Mario.Filipas@amd.com> >>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> From a Mesa perspective, this almost all sounds reasonable to me. >>>>>> >>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>>> lost is a perfect example where the driver should report context >>>>>> lost to applications with the "innocent" flag for contexts that >>>>>> were idle at the time of reset. The only context(s) that should be >>>>>> reported as "guilty" >>>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>>> executing at the time of reset. >>>>>> >>>>>> On whether the whole context is marked as guilty from a user space >>>>>> perspective, it would simply be nice for user space to get >>>>>> consistent answers. It would be a bit odd if we could e.g. succeed >>>>>> in submitting an SDMA job after a GFX job was rejected. This would >>>>>> point in favor of marking the entire context as guilty (although >>>>>> that could happen lazily instead of at reset time). On the other >>>>>> hand, if that's too big a burden for the kernel implementation I'm >>>>>> sure we can live without it. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Nicolai >>>>>> >>>>>> ------------------------------------------------------------------ >>>>>> - >>>>>> --- >>>>>> -- >>>>>> >>>>>> *From:*Liu, Monk >>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>>>>> Alexander >>>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> 1.Set its fence error status to “*ETIME*”, >>>>>> >>>>>> No, as I already explained ETIME is for synchronous operation. >>>>>> >>>>>> In other words when we return ETIME from the wait IOCTL it would >>>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>>> >>>>>> Please use ECANCELED as well or some other error code when we find >>>>>> that we need to distinct the timedout job from the canceled ones >>>>>> (probably a good idea, but I'm not sure). >>>>>> >>>>>> [ML] I’m okay if you insist not to use ETIME >>>>>> >>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>>>>> >>>>>> Not sure. Do we want to set the whole context as guilty or just >>>>>> the entity? >>>>>> >>>>>> Setting the whole contexts as guilty sounds racy to me. >>>>>> >>>>>> BTW: We should use a different name than "guilty", maybe just >>>>>> "bool canceled;" ? >>>>>> >>>>>> [ML] I think context is better than entity, because for example if >>>>>> you only block entity_0 of context and allow entity_N run, that >>>>>> means the dependency between entities are broken (e.g. page table >>>>>> updates in >>>>>> >>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>>>>> sense to me) >>>>>> >>>>>> We’d better either block the whole context or let not… >>>>>> >>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set >>>>>> all their fence status to “*ECANCELED*” >>>>>> >>>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>>> we try to run the jobs and not during GPU reset. >>>>>> >>>>>> [ML] without deep thought and expritment, I’m not sure the >>>>>> difference between them, but kick it out in gpu_reset routine is >>>>>> more efficient, >>>>>> >>>>>> Otherwise you need to check context/entity guilty flag in run_job >>>>>> routine …and you need to it for every context/entity, I don’t see >>>>>> why >>>>>> >>>>>> We don’t just kickout all of them in gpu_reset stage …. >>>>>> >>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >>>>>> VRAM lost actually ruins all VRAM contents >>>>>> >>>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>>> over all contexts is way to much overhead. >>>>>> >>>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t >>>>>> need to differentiate VRAM lost or not, they only interested in if >>>>>> the context is guilty or not, and block >>>>>> >>>>>> Submit for guilty ones. >>>>>> >>>>>> *Can you give more details of your idea? And better the detail >>>>>> implement in cs_submit, I want to see how you want to block submit >>>>>> without checking context guilty flag* >>>>>> >>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>>> fence status to “*ECANCELDED*” >>>>>> >>>>>> Yes and no, that should be done when we try to run the jobs and >>>>>> not during GPU reset. >>>>>> >>>>>> [ML] again, kicking out them in gpu reset routine is high >>>>>> efficient, otherwise you need check on every job in run_job() >>>>>> >>>>>> Besides, can you illustrate the detail implementation ? >>>>>> >>>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>>> sync_file debugging (no idea why that made it into the common >>>>>> fence code). >>>>>> >>>>>> It was replaced by putting the error code directly into the fence, >>>>>> so just reading that one after waiting should be ok. >>>>>> >>>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>>> for this? >>>>>> >>>>>> [ML] yeah, that’s too confusing, the name sound really the one I >>>>>> want to use, we should change it… >>>>>> >>>>>> *But look into the implement, I don**’t see why we cannot use it ? >>>>>> it also finally return the fence->error * >>>>>> >>>>>> *From:*Koenig, Christian >>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; >>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com >>>>>> <mailto:Nicolai.Haehnle@amd.com>>; >>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >>>>>> Deucher, Alexander <Alexander.Deucher@amd.com >>>>>> <mailto:Alexander.Deucher@amd.com>> >>>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry >>>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, >>>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, >>>>>> Alejandro <Alejandro.Ramirez@amd.com >>>>>> <mailto:Alejandro.Ramirez@amd.com>>; >>>>>> Filipas, Mario <Mario.Filipas@amd.com >>>>>> <mailto:Mario.Filipas@amd.com>> >>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> See inline: >>>>>> >>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>>>>> >>>>>> Hi Christian & Nicolai, >>>>>> >>>>>> We need to achieve some agreements on what should MESA/UMD >>>>>> do and >>>>>> what should KMD do, *please give your comments with “okay” >>>>>> or “No” >>>>>> and your idea on below items,* >>>>>> >>>>>> ?When a job timed out (set from lockup_timeout kernel >>>>>> parameter), >>>>>> What KMD should do in TDR routine : >>>>>> >>>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>>>> (*gpu_reset_counter* is used to force vm flush after GPU >>>>>> reset, out >>>>>> of this thread’s scope so no more discussion on it) >>>>>> >>>>>> Okay. >>>>>> >>>>>> 2.Set its fence error status to “*ETIME*”, >>>>>> >>>>>> No, as I already explained ETIME is for synchronous operation. >>>>>> >>>>>> In other words when we return ETIME from the wait IOCTL it would >>>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>>> >>>>>> Please use ECANCELED as well or some other error code when we find >>>>>> that we need to distinct the timedout job from the canceled ones >>>>>> (probably a good idea, but I'm not sure). >>>>>> >>>>>> 3.Find the entity/ctx behind this job, and set this ctx as >>>>>> “*guilty*” >>>>>> >>>>>> Not sure. Do we want to set the whole context as guilty or just >>>>>> the entity? >>>>>> >>>>>> Setting the whole contexts as guilty sounds racy to me. >>>>>> >>>>>> BTW: We should use a different name than "guilty", maybe just >>>>>> "bool canceled;" ? >>>>>> >>>>>> 4.Kick out this job from scheduler’s mirror list, so this >>>>>> job won’t >>>>>> get re-scheduled to ring anymore. >>>>>> >>>>>> Okay. >>>>>> >>>>>> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and >>>>>> set all >>>>>> their fence status to “*ECANCELED*” >>>>>> >>>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>>> we try to run the jobs and not during GPU reset. >>>>>> >>>>>> 6.Force signal all fences that get kicked out by above two >>>>>> steps,*otherwise UMD will block forever if waiting on those >>>>>> fences* >>>>>> >>>>>> Okay. >>>>>> >>>>>> 7.Do gpu reset, which is can be some callbacks to let >>>>>> bare-metal and >>>>>> SR-IOV implement with their favor style >>>>>> >>>>>> Okay. >>>>>> >>>>>> 8.After reset, KMD need to aware if the VRAM lost happens >>>>>> or not, >>>>>> bare-metal can implement some function to judge, while for >>>>>> SR-IOV I >>>>>> prefer to read it from GIM side (for initial version we consider >>>>>> it’s always VRAM lost, till GIM side change aligned) >>>>>> >>>>>> Okay. >>>>>> >>>>>> 9.If VRAM lost not hit, continue, otherwise: >>>>>> >>>>>> a)Update adev->*vram_lost_counter*, >>>>>> >>>>>> Okay. >>>>>> >>>>>> b)Iterate over all living ctx, and set all ctx as “*guilty*” >>>>>> since >>>>>> VRAM lost actually ruins all VRAM contents >>>>>> >>>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>>> over all contexts is way to much overhead. >>>>>> >>>>>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>>> fence status to “*ECANCELDED*” >>>>>> >>>>>> Yes and no, that should be done when we try to run the jobs and >>>>>> not during GPU reset. >>>>>> >>>>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>>>> (optional, >>>>>> do we need it ???) >>>>>> >>>>>> Yes, that is still needed. As Nicolai explained we can't be sure >>>>>> that VRAM is still 100% correct even when it isn't cleared. >>>>>> >>>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>>>> restart scheduler (for VRAM lost case, no JOB will >>>>>> re-scheduled) >>>>>> >>>>>> Okay. >>>>>> >>>>>> ?For cs_wait() IOCTL: >>>>>> >>>>>> After it found fence signaled, it should check with >>>>>> *“dma_fence_get_status” *to see if there is error there, >>>>>> >>>>>> And return the error status of fence >>>>>> >>>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>>> sync_file debugging (no idea why that made it into the common >>>>>> fence code). >>>>>> >>>>>> It was replaced by putting the error code directly into the fence, >>>>>> so just reading that one after waiting should be ok. >>>>>> >>>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>>> for this? >>>>>> >>>>>> ?For cs_wait_fences() IOCTL: >>>>>> >>>>>> Similar with above approach >>>>>> >>>>>> ?For cs_submit() IOCTL: >>>>>> >>>>>> It need to check if current ctx been marked as “*guilty*” >>>>>> and return >>>>>> “*ECANCELED*” if so >>>>>> >>>>>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>>>>> >>>>>> This way, UMD can also block app from submitting, like @Nicolai >>>>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>>>> enumerate physical device, and deny all >>>>>> >>>>>> gl-context from submitting if the counter queried bigger >>>>>> than that >>>>>> one cached in physical device. (looks a little overkill to >>>>>> me, but >>>>>> easy to implement ) >>>>>> >>>>>> UMD can also return error to APP when creating gl-context >>>>>> if found >>>>>> current queried*vram_lost_counter *bigger than that one >>>>>> cached in >>>>>> physical device. >>>>>> >>>>>> Okay. Already have a patch for this, please review that one if you >>>>>> haven't already done so. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> BTW: I realized that gl-context is a little different with >>>>>> kernel’s >>>>>> context. Because for kernel. BO is not related with context >>>>>> but only >>>>>> with FD, while in UMD, BO have a backend >>>>>> >>>>>> gl-context, so block submitting in UMD layer is also needed >>>>>> although >>>>>> KMD will do its job as bottom line >>>>>> >>>>>> ?Basically “vram_lost_counter” is exposure by kernel to let >>>>>> UMD take >>>>>> the control of robust extension feature, it will be UMD’s >>>>>> call to >>>>>> move, KMD only deny “guilty” context from submitting >>>>>> >>>>>> Need your feedback, thx >>>>>> >>>>>> We’d better make TDR feature landed ASAP >>>>>> >>>>>> BR Monk >>>>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 13:59 ` Liu, Monk @ 2017-10-11 14:03 ` Christian König [not found] ` <02bb9f77-bcc6-8a24-e9b0-8f3f260d74d8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 23+ messages in thread From: Christian König @ 2017-10-11 14:03 UTC (permalink / raw) To: Liu, Monk, Koenig, Christian, Zhou, David(ChunMing), Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) > I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy That won't work. We want to keep the entities associated with VM updates and buffer moves alive, but their jobs canceled. Regards, Christian. Am 11.10.2017 um 15:51 schrieb Liu, Monk: >> Some jobs don't have a context (VM updates, clears, buffer moves). > What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy > That can solve your concerns > > > > -----Original Message----- > From: Koenig, Christian > Sent: Wednesday, October 11, 2017 9:39 PM > To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > Some jobs don't have a context (VM updates, clears, buffer moves). > > I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable. > > So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run. > > Regards, > Christian. > > Am 11.10.2017 um 15:35 schrieb Liu, Monk: >> I think just compare the copy from context/entity with current counter >> is enough, don't see how it's better to keep another copy in JOB >> >> >> -----Original Message----- >> From: Koenig, Christian >> Sent: Wednesday, October 11, 2017 6:40 PM >> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk >> <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, >> Marek <Marek.Olsak@amd.com>; Deucher, Alexander >> <Alexander.Deucher@amd.com> >> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >> amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; >> Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang@amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD: >> >> I've already posted a patch for this on the mailing list. >> >> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled. >> >> Regards, >> Christian. >> >> Am 11.10.2017 um 12:14 schrieb Chunming Zhou: >>> Your summary lacks the below issue: >>> >>> How about the job already pushed in scheduler queue when vram is lost? >>> >>> >>> Regards, >>> David Zhou >>> On 2017年10月11日 17:41, Liu, Monk wrote: >>>> Okay, let me summary our whole idea together and see if it works: >>>> >>>> 1, For cs_submit, always check vram-lost_counter first and reject >>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != >>>> adev->vram_lost_counter. That way the vram lost issue can be handled >>>> >>>> 2, for cs_submit we still need to check if the incoming context is >>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found >>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject >>>> the submit >>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ? >>>> >>>> 3, in gpu_reset() routine, we only mark the hang job's entity as >>>> guilty (so we need to add new member in entity structure), and not >>>> kick it out in gpu_reset() stage, but we need to set the context >>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET" >>>> And if reset introduces VRAM LOST, we just update >>>> adev->vram_lost_counter, but *don't* change all entity to guilty, so >>>> still only the hang job's entity is "guilty" >>>> After some entity marked as "guilty", we find a way to set the >>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K >>>> interface, we need let UMD can know that this context is wrong. >>>> >>>> 4, in gpu scheduler's run_job() routine, since it only reads entity, >>>> so we skip job scheduling once found the entity is "guilty" >>>> >>>> >>>> Does above sounds good ? >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Haehnle, Nicolai >>>> Sent: Wednesday, October 11, 2017 5:26 PM >>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >>>> <Bingley.Li@amd.com>; Ramirez, Alejandro >>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>> >>>> On 11.10.2017 11:18, Liu, Monk wrote: >>>>> Let's talk it simple, When vram lost hit, what's the action for >>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not >>>>> the one trigger gpu hang) after vram lost ? do you mean we return >>>>> -ENODEV to UMD ? >>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET. >>>> >>>> >>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as >>>>> "guilty", how we block its from submitting ? can you show some >>>>> implement way >>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>> return -ECANCELED; >>>> >>>> (where ctx->vram_lost_counter is initialized at context creation >>>> time and never changed afterwards) >>>> >>>> >>>>> BTW: the "guilty" here is a new member I want to add to context, it >>>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I >>>>> need to unify them and only one place to mark guilty or not >>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made >>>> consistent with the rest. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> BR Monk >>>>> >>>>> -----Original Message----- >>>>> From: Haehnle, Nicolai >>>>> Sent: Wednesday, October 11, 2017 5:00 PM >>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel >>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> On 11.10.2017 10:48, Liu, Monk wrote: >>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>>> lost is a perfect example where the driver should report context >>>>>> lost to applications with the "innocent" flag for contexts that >>>>>> were idle at the time of reset. The only context(s) that should be >>>>>> reported as "guilty" >>>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>>> executing at the time of reset. >>>>>> >>>>>> ML: KMD mark all contexts as guilty is because that way we can >>>>>> unify our IOCTL behavior: e.g. for IOCTL only block >>>>>> “guilty”context , no need to worry about vram-lost-counter >>>>>> anymore, that’s a implementation style. I don’t think it is >>>>>> related with UMD layer, >>>>>> >>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement >>>>>> it own “guilty” gl-context if you want. >>>>> Well, to some extent this is just semantics, but it helps to keep >>>>> the terminology consistent. >>>>> >>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >>>>> mind: this returns one of >>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >>>>> and it must return "innocent" for contexts that are only lost due >>>>> to VRAM lost without being otherwise involved in the timeout that >>>>> lead to the reset. >>>>> >>>>> The point is that in the places where you used "guilty" it would be >>>>> better to use "context lost", and then further differentiate >>>>> between guilty/innocent context lost based on the details of what happened. >>>>> >>>>> >>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like >>>>>> cs_sumbit ?? let’s see which way better >>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>>> return -ECANCELED; >>>>> >>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >>>>> >>>>> Yes, it's one additional check in cs_submit. If you're worried >>>>> about that (and Christian's concerns about possible issues with >>>>> walking over all contexts are addressed), I suppose you could just >>>>> store a per-context >>>>> >>>>> unsigned context_reset_status; >>>>> >>>>> instead of a `bool guilty`. Its value would start out as 0 >>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during >>>>> reset. >>>>> >>>>> Cheers, >>>>> Nicolai >>>>> >>>>> >>>>>> *From:*Haehnle, Nicolai >>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel >>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario >>>>>> <Mario.Filipas@amd.com> >>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> From a Mesa perspective, this almost all sounds reasonable to me. >>>>>> >>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>>> lost is a perfect example where the driver should report context >>>>>> lost to applications with the "innocent" flag for contexts that >>>>>> were idle at the time of reset. The only context(s) that should be >>>>>> reported as "guilty" >>>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>>> executing at the time of reset. >>>>>> >>>>>> On whether the whole context is marked as guilty from a user space >>>>>> perspective, it would simply be nice for user space to get >>>>>> consistent answers. It would be a bit odd if we could e.g. succeed >>>>>> in submitting an SDMA job after a GFX job was rejected. This would >>>>>> point in favor of marking the entire context as guilty (although >>>>>> that could happen lazily instead of at reset time). On the other >>>>>> hand, if that's too big a burden for the kernel implementation I'm >>>>>> sure we can live without it. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Nicolai >>>>>> >>>>>> ------------------------------------------------------------------ >>>>>> - >>>>>> --- >>>>>> -- >>>>>> >>>>>> *From:*Liu, Monk >>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>>>>> Alexander >>>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> 1.Set its fence error status to “*ETIME*”, >>>>>> >>>>>> No, as I already explained ETIME is for synchronous operation. >>>>>> >>>>>> In other words when we return ETIME from the wait IOCTL it would >>>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>>> >>>>>> Please use ECANCELED as well or some other error code when we find >>>>>> that we need to distinct the timedout job from the canceled ones >>>>>> (probably a good idea, but I'm not sure). >>>>>> >>>>>> [ML] I’m okay if you insist not to use ETIME >>>>>> >>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>>>>> >>>>>> Not sure. Do we want to set the whole context as guilty or just >>>>>> the entity? >>>>>> >>>>>> Setting the whole contexts as guilty sounds racy to me. >>>>>> >>>>>> BTW: We should use a different name than "guilty", maybe just >>>>>> "bool canceled;" ? >>>>>> >>>>>> [ML] I think context is better than entity, because for example if >>>>>> you only block entity_0 of context and allow entity_N run, that >>>>>> means the dependency between entities are broken (e.g. page table >>>>>> updates in >>>>>> >>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>>>>> sense to me) >>>>>> >>>>>> We’d better either block the whole context or let not… >>>>>> >>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set >>>>>> all their fence status to “*ECANCELED*” >>>>>> >>>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>>> we try to run the jobs and not during GPU reset. >>>>>> >>>>>> [ML] without deep thought and expritment, I’m not sure the >>>>>> difference between them, but kick it out in gpu_reset routine is >>>>>> more efficient, >>>>>> >>>>>> Otherwise you need to check context/entity guilty flag in run_job >>>>>> routine …and you need to it for every context/entity, I don’t see >>>>>> why >>>>>> >>>>>> We don’t just kickout all of them in gpu_reset stage …. >>>>>> >>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >>>>>> VRAM lost actually ruins all VRAM contents >>>>>> >>>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>>> over all contexts is way to much overhead. >>>>>> >>>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t >>>>>> need to differentiate VRAM lost or not, they only interested in if >>>>>> the context is guilty or not, and block >>>>>> >>>>>> Submit for guilty ones. >>>>>> >>>>>> *Can you give more details of your idea? And better the detail >>>>>> implement in cs_submit, I want to see how you want to block submit >>>>>> without checking context guilty flag* >>>>>> >>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>>> fence status to “*ECANCELDED*” >>>>>> >>>>>> Yes and no, that should be done when we try to run the jobs and >>>>>> not during GPU reset. >>>>>> >>>>>> [ML] again, kicking out them in gpu reset routine is high >>>>>> efficient, otherwise you need check on every job in run_job() >>>>>> >>>>>> Besides, can you illustrate the detail implementation ? >>>>>> >>>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>>> sync_file debugging (no idea why that made it into the common >>>>>> fence code). >>>>>> >>>>>> It was replaced by putting the error code directly into the fence, >>>>>> so just reading that one after waiting should be ok. >>>>>> >>>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>>> for this? >>>>>> >>>>>> [ML] yeah, that’s too confusing, the name sound really the one I >>>>>> want to use, we should change it… >>>>>> >>>>>> *But look into the implement, I don**’t see why we cannot use it ? >>>>>> it also finally return the fence->error * >>>>>> >>>>>> *From:*Koenig, Christian >>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; >>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com >>>>>> <mailto:Nicolai.Haehnle@amd.com>>; >>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >>>>>> Deucher, Alexander <Alexander.Deucher@amd.com >>>>>> <mailto:Alexander.Deucher@amd.com>> >>>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry >>>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, >>>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, >>>>>> Alejandro <Alejandro.Ramirez@amd.com >>>>>> <mailto:Alejandro.Ramirez@amd.com>>; >>>>>> Filipas, Mario <Mario.Filipas@amd.com >>>>>> <mailto:Mario.Filipas@amd.com>> >>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> See inline: >>>>>> >>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>>>>> >>>>>> Hi Christian & Nicolai, >>>>>> >>>>>> We need to achieve some agreements on what should MESA/UMD >>>>>> do and >>>>>> what should KMD do, *please give your comments with “okay” >>>>>> or “No” >>>>>> and your idea on below items,* >>>>>> >>>>>> ?When a job timed out (set from lockup_timeout kernel >>>>>> parameter), >>>>>> What KMD should do in TDR routine : >>>>>> >>>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>>>> (*gpu_reset_counter* is used to force vm flush after GPU >>>>>> reset, out >>>>>> of this thread’s scope so no more discussion on it) >>>>>> >>>>>> Okay. >>>>>> >>>>>> 2.Set its fence error status to “*ETIME*”, >>>>>> >>>>>> No, as I already explained ETIME is for synchronous operation. >>>>>> >>>>>> In other words when we return ETIME from the wait IOCTL it would >>>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>>> >>>>>> Please use ECANCELED as well or some other error code when we find >>>>>> that we need to distinct the timedout job from the canceled ones >>>>>> (probably a good idea, but I'm not sure). >>>>>> >>>>>> 3.Find the entity/ctx behind this job, and set this ctx as >>>>>> “*guilty*” >>>>>> >>>>>> Not sure. Do we want to set the whole context as guilty or just >>>>>> the entity? >>>>>> >>>>>> Setting the whole contexts as guilty sounds racy to me. >>>>>> >>>>>> BTW: We should use a different name than "guilty", maybe just >>>>>> "bool canceled;" ? >>>>>> >>>>>> 4.Kick out this job from scheduler’s mirror list, so this >>>>>> job won’t >>>>>> get re-scheduled to ring anymore. >>>>>> >>>>>> Okay. >>>>>> >>>>>> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and >>>>>> set all >>>>>> their fence status to “*ECANCELED*” >>>>>> >>>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>>> we try to run the jobs and not during GPU reset. >>>>>> >>>>>> 6.Force signal all fences that get kicked out by above two >>>>>> steps,*otherwise UMD will block forever if waiting on those >>>>>> fences* >>>>>> >>>>>> Okay. >>>>>> >>>>>> 7.Do gpu reset, which is can be some callbacks to let >>>>>> bare-metal and >>>>>> SR-IOV implement with their favor style >>>>>> >>>>>> Okay. >>>>>> >>>>>> 8.After reset, KMD need to aware if the VRAM lost happens >>>>>> or not, >>>>>> bare-metal can implement some function to judge, while for >>>>>> SR-IOV I >>>>>> prefer to read it from GIM side (for initial version we consider >>>>>> it’s always VRAM lost, till GIM side change aligned) >>>>>> >>>>>> Okay. >>>>>> >>>>>> 9.If VRAM lost not hit, continue, otherwise: >>>>>> >>>>>> a)Update adev->*vram_lost_counter*, >>>>>> >>>>>> Okay. >>>>>> >>>>>> b)Iterate over all living ctx, and set all ctx as “*guilty*” >>>>>> since >>>>>> VRAM lost actually ruins all VRAM contents >>>>>> >>>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>>> over all contexts is way to much overhead. >>>>>> >>>>>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>>> fence status to “*ECANCELDED*” >>>>>> >>>>>> Yes and no, that should be done when we try to run the jobs and >>>>>> not during GPU reset. >>>>>> >>>>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>>>> (optional, >>>>>> do we need it ???) >>>>>> >>>>>> Yes, that is still needed. As Nicolai explained we can't be sure >>>>>> that VRAM is still 100% correct even when it isn't cleared. >>>>>> >>>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>>>> restart scheduler (for VRAM lost case, no JOB will >>>>>> re-scheduled) >>>>>> >>>>>> Okay. >>>>>> >>>>>> ?For cs_wait() IOCTL: >>>>>> >>>>>> After it found fence signaled, it should check with >>>>>> *“dma_fence_get_status” *to see if there is error there, >>>>>> >>>>>> And return the error status of fence >>>>>> >>>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>>> sync_file debugging (no idea why that made it into the common >>>>>> fence code). >>>>>> >>>>>> It was replaced by putting the error code directly into the fence, >>>>>> so just reading that one after waiting should be ok. >>>>>> >>>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>>> for this? >>>>>> >>>>>> ?For cs_wait_fences() IOCTL: >>>>>> >>>>>> Similar with above approach >>>>>> >>>>>> ?For cs_submit() IOCTL: >>>>>> >>>>>> It need to check if current ctx been marked as “*guilty*” >>>>>> and return >>>>>> “*ECANCELED*” if so >>>>>> >>>>>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>>>>> >>>>>> This way, UMD can also block app from submitting, like @Nicolai >>>>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>>>> enumerate physical device, and deny all >>>>>> >>>>>> gl-context from submitting if the counter queried bigger >>>>>> than that >>>>>> one cached in physical device. (looks a little overkill to >>>>>> me, but >>>>>> easy to implement ) >>>>>> >>>>>> UMD can also return error to APP when creating gl-context >>>>>> if found >>>>>> current queried*vram_lost_counter *bigger than that one >>>>>> cached in >>>>>> physical device. >>>>>> >>>>>> Okay. Already have a patch for this, please review that one if you >>>>>> haven't already done so. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> BTW: I realized that gl-context is a little different with >>>>>> kernel’s >>>>>> context. Because for kernel. BO is not related with context >>>>>> but only >>>>>> with FD, while in UMD, BO have a backend >>>>>> >>>>>> gl-context, so block submitting in UMD layer is also needed >>>>>> although >>>>>> KMD will do its job as bottom line >>>>>> >>>>>> ?Basically “vram_lost_counter” is exposure by kernel to let >>>>>> UMD take >>>>>> the control of robust extension feature, it will be UMD’s >>>>>> call to >>>>>> move, KMD only deny “guilty” context from submitting >>>>>> >>>>>> Need your feedback, thx >>>>>> >>>>>> We’d better make TDR feature landed ASAP >>>>>> >>>>>> BR Monk >>>>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <02bb9f77-bcc6-8a24-e9b0-8f3f260d74d8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* RE: TDR and VRAM lost handling in KMD: [not found] ` <02bb9f77-bcc6-8a24-e9b0-8f3f260d74d8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2017-10-11 14:04 ` Liu, Monk 0 siblings, 0 replies; 23+ messages in thread From: Liu, Monk @ 2017-10-11 14:04 UTC (permalink / raw) To: Koenig, Christian, Zhou, David(ChunMing), Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) Yeah, I just thought of it, agree that shouldn't keep copy in entity, otherwise too complicated to handle BR Monk -----Original Message----- From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] Sent: Wednesday, October 11, 2017 10:04 PM To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: > I remember even the VM update job is with a kernel entity, (no context > is true), and if entity can keep a counter copy That won't work. We want to keep the entities associated with VM updates and buffer moves alive, but their jobs canceled. Regards, Christian. Am 11.10.2017 um 15:51 schrieb Liu, Monk: >> Some jobs don't have a context (VM updates, clears, buffer moves). > What? I remember even the VM update job is with a kernel entity, (no > context is true), and if entity can keep a counter copy That can solve > your concerns > > > > -----Original Message----- > From: Koenig, Christian > Sent: Wednesday, October 11, 2017 9:39 PM > To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) > <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; > Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander > <Alexander.Deucher@amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; > amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; > Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang@amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > Some jobs don't have a context (VM updates, clears, buffer moves). > > I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable. > > So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run. > > Regards, > Christian. > > Am 11.10.2017 um 15:35 schrieb Liu, Monk: >> I think just compare the copy from context/entity with current >> counter is enough, don't see how it's better to keep another copy in >> JOB >> >> >> -----Original Message----- >> From: Koenig, Christian >> Sent: Wednesday, October 11, 2017 6:40 PM >> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk >> <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; >> Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander >> <Alexander.Deucher@amd.com> >> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >> amd-gfx@lists.freedesktop.org; Filipas, Mario >> <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, >> Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD: >> >> I've already posted a patch for this on the mailing list. >> >> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled. >> >> Regards, >> Christian. >> >> Am 11.10.2017 um 12:14 schrieb Chunming Zhou: >>> Your summary lacks the below issue: >>> >>> How about the job already pushed in scheduler queue when vram is lost? >>> >>> >>> Regards, >>> David Zhou >>> On 2017年10月11日 17:41, Liu, Monk wrote: >>>> Okay, let me summary our whole idea together and see if it works: >>>> >>>> 1, For cs_submit, always check vram-lost_counter first and reject >>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != >>>> adev->vram_lost_counter. That way the vram lost issue can be >>>> adev->handled >>>> >>>> 2, for cs_submit we still need to check if the incoming context is >>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found >>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can >>>> ctx->reject >>>> the submit >>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ? >>>> >>>> 3, in gpu_reset() routine, we only mark the hang job's entity as >>>> guilty (so we need to add new member in entity structure), and not >>>> kick it out in gpu_reset() stage, but we need to set the context >>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET" >>>> And if reset introduces VRAM LOST, we just update >>>> adev->vram_lost_counter, but *don't* change all entity to guilty, >>>> adev->so >>>> still only the hang job's entity is "guilty" >>>> After some entity marked as "guilty", we find a way to set the >>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K >>>> interface, we need let UMD can know that this context is wrong. >>>> >>>> 4, in gpu scheduler's run_job() routine, since it only reads >>>> entity, so we skip job scheduling once found the entity is "guilty" >>>> >>>> >>>> Does above sounds good ? >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Haehnle, Nicolai >>>> Sent: Wednesday, October 11, 2017 5:26 PM >>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel >>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>> >>>> On 11.10.2017 11:18, Liu, Monk wrote: >>>>> Let's talk it simple, When vram lost hit, what's the action for >>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts >>>>> (not the one trigger gpu hang) after vram lost ? do you mean we >>>>> return -ENODEV to UMD ? >>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET. >>>> >>>> >>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as >>>>> "guilty", how we block its from submitting ? can you show some >>>>> implement way >>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>> return -ECANCELED; >>>> >>>> (where ctx->vram_lost_counter is initialized at context creation >>>> time and never changed afterwards) >>>> >>>> >>>>> BTW: the "guilty" here is a new member I want to add to context, >>>>> it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, >>>>> Looks I need to unify them and only one place to mark guilty or >>>>> not >>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made >>>> consistent with the rest. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> BR Monk >>>>> >>>>> -----Original Message----- >>>>> From: Haehnle, Nicolai >>>>> Sent: Wednesday, October 11, 2017 5:00 PM >>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel >>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario >>>>> <Mario.Filipas@amd.com> >>>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> On 11.10.2017 10:48, Liu, Monk wrote: >>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. >>>>>> OpenGL), so it's reasonable to use it. However, it /does not/ >>>>>> make sense to mark idle contexts as "guilty" just because VRAM is >>>>>> lost. VRAM lost is a perfect example where the driver should >>>>>> report context lost to applications with the "innocent" flag for >>>>>> contexts that were idle at the time of reset. The only context(s) >>>>>> that should be reported as "guilty" >>>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>>> executing at the time of reset. >>>>>> >>>>>> ML: KMD mark all contexts as guilty is because that way we can >>>>>> unify our IOCTL behavior: e.g. for IOCTL only block >>>>>> “guilty”context , no need to worry about vram-lost-counter >>>>>> anymore, that’s a implementation style. I don’t think it is >>>>>> related with UMD layer, >>>>>> >>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can >>>>>> implement it own “guilty” gl-context if you want. >>>>> Well, to some extent this is just semantics, but it helps to keep >>>>> the terminology consistent. >>>>> >>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi >>>>> in >>>>> mind: this returns one of >>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >>>>> and it must return "innocent" for contexts that are only lost due >>>>> to VRAM lost without being otherwise involved in the timeout that >>>>> lead to the reset. >>>>> >>>>> The point is that in the places where you used "guilty" it would >>>>> be better to use "context lost", and then further differentiate >>>>> between guilty/innocent context lost based on the details of what happened. >>>>> >>>>> >>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like >>>>>> cs_sumbit ?? let’s see which way better >>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>>> return -ECANCELED; >>>>> >>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >>>>> >>>>> Yes, it's one additional check in cs_submit. If you're worried >>>>> about that (and Christian's concerns about possible issues with >>>>> walking over all contexts are addressed), I suppose you could just >>>>> store a per-context >>>>> >>>>> unsigned context_reset_status; >>>>> >>>>> instead of a `bool guilty`. Its value would start out as 0 >>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during >>>>> reset. >>>>> >>>>> Cheers, >>>>> Nicolai >>>>> >>>>> >>>>>> *From:*Haehnle, Nicolai >>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>>>>> Deucher, Alexander <Alexander.Deucher@amd.com> >>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel >>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; >>>>>> Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario >>>>>> <Mario.Filipas@amd.com> >>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> From a Mesa perspective, this almost all sounds reasonable to me. >>>>>> >>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. >>>>>> OpenGL), so it's reasonable to use it. However, it /does not/ >>>>>> make sense to mark idle contexts as "guilty" just because VRAM is >>>>>> lost. VRAM lost is a perfect example where the driver should >>>>>> report context lost to applications with the "innocent" flag for >>>>>> contexts that were idle at the time of reset. The only context(s) >>>>>> that should be reported as "guilty" >>>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>>> executing at the time of reset. >>>>>> >>>>>> On whether the whole context is marked as guilty from a user >>>>>> space perspective, it would simply be nice for user space to get >>>>>> consistent answers. It would be a bit odd if we could e.g. >>>>>> succeed in submitting an SDMA job after a GFX job was rejected. >>>>>> This would point in favor of marking the entire context as guilty >>>>>> (although that could happen lazily instead of at reset time). On >>>>>> the other hand, if that's too big a burden for the kernel >>>>>> implementation I'm sure we can live without it. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Nicolai >>>>>> >>>>>> ----------------------------------------------------------------- >>>>>> - >>>>>> - >>>>>> --- >>>>>> -- >>>>>> >>>>>> *From:*Liu, Monk >>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>>>>> Alexander >>>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> 1.Set its fence error status to “*ETIME*”, >>>>>> >>>>>> No, as I already explained ETIME is for synchronous operation. >>>>>> >>>>>> In other words when we return ETIME from the wait IOCTL it would >>>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>>> >>>>>> Please use ECANCELED as well or some other error code when we >>>>>> find that we need to distinct the timedout job from the canceled >>>>>> ones (probably a good idea, but I'm not sure). >>>>>> >>>>>> [ML] I’m okay if you insist not to use ETIME >>>>>> >>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>>>>> >>>>>> Not sure. Do we want to set the whole context as guilty or just >>>>>> the entity? >>>>>> >>>>>> Setting the whole contexts as guilty sounds racy to me. >>>>>> >>>>>> BTW: We should use a different name than "guilty", maybe just >>>>>> "bool canceled;" ? >>>>>> >>>>>> [ML] I think context is better than entity, because for example >>>>>> if you only block entity_0 of context and allow entity_N run, >>>>>> that means the dependency between entities are broken (e.g. page >>>>>> table updates in >>>>>> >>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>>>>> sense to me) >>>>>> >>>>>> We’d better either block the whole context or let not… >>>>>> >>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set >>>>>> all their fence status to “*ECANCELED*” >>>>>> >>>>>> Setting ECANCELED should be ok. But I think we should do this >>>>>> when we try to run the jobs and not during GPU reset. >>>>>> >>>>>> [ML] without deep thought and expritment, I’m not sure the >>>>>> difference between them, but kick it out in gpu_reset routine is >>>>>> more efficient, >>>>>> >>>>>> Otherwise you need to check context/entity guilty flag in run_job >>>>>> routine …and you need to it for every context/entity, I don’t see >>>>>> why >>>>>> >>>>>> We don’t just kickout all of them in gpu_reset stage …. >>>>>> >>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” >>>>>> since VRAM lost actually ruins all VRAM contents >>>>>> >>>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>>> over all contexts is way to much overhead. >>>>>> >>>>>> [ML] because I want to make KMS IOCTL rules clean, like they >>>>>> don’t need to differentiate VRAM lost or not, they only >>>>>> interested in if the context is guilty or not, and block >>>>>> >>>>>> Submit for guilty ones. >>>>>> >>>>>> *Can you give more details of your idea? And better the detail >>>>>> implement in cs_submit, I want to see how you want to block >>>>>> submit without checking context guilty flag* >>>>>> >>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>>> fence status to “*ECANCELDED*” >>>>>> >>>>>> Yes and no, that should be done when we try to run the jobs and >>>>>> not during GPU reset. >>>>>> >>>>>> [ML] again, kicking out them in gpu reset routine is high >>>>>> efficient, otherwise you need check on every job in run_job() >>>>>> >>>>>> Besides, can you illustrate the detail implementation ? >>>>>> >>>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>>> sync_file debugging (no idea why that made it into the common >>>>>> fence code). >>>>>> >>>>>> It was replaced by putting the error code directly into the >>>>>> fence, so just reading that one after waiting should be ok. >>>>>> >>>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>>> for this? >>>>>> >>>>>> [ML] yeah, that’s too confusing, the name sound really the one I >>>>>> want to use, we should change it… >>>>>> >>>>>> *But look into the implement, I don**’t see why we cannot use it ? >>>>>> it also finally return the fence->error * >>>>>> >>>>>> *From:*Koenig, Christian >>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; >>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com >>>>>> <mailto:Nicolai.Haehnle@amd.com>>; >>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >>>>>> Deucher, Alexander <Alexander.Deucher@amd.com >>>>>> <mailto:Alexander.Deucher@amd.com>> >>>>>> *Cc:* amd-gfx@lists.freedesktop.org >>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry >>>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, >>>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; >>>>>> Ramirez, Alejandro <Alejandro.Ramirez@amd.com >>>>>> <mailto:Alejandro.Ramirez@amd.com>>; >>>>>> Filipas, Mario <Mario.Filipas@amd.com >>>>>> <mailto:Mario.Filipas@amd.com>> >>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> See inline: >>>>>> >>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>>>>> >>>>>> Hi Christian & Nicolai, >>>>>> >>>>>> We need to achieve some agreements on what should >>>>>> MESA/UMD do and >>>>>> what should KMD do, *please give your comments with “okay” >>>>>> or “No” >>>>>> and your idea on below items,* >>>>>> >>>>>> ?When a job timed out (set from lockup_timeout kernel >>>>>> parameter), >>>>>> What KMD should do in TDR routine : >>>>>> >>>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>>>> (*gpu_reset_counter* is used to force vm flush after GPU >>>>>> reset, out >>>>>> of this thread’s scope so no more discussion on it) >>>>>> >>>>>> Okay. >>>>>> >>>>>> 2.Set its fence error status to “*ETIME*”, >>>>>> >>>>>> No, as I already explained ETIME is for synchronous operation. >>>>>> >>>>>> In other words when we return ETIME from the wait IOCTL it would >>>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>>> >>>>>> Please use ECANCELED as well or some other error code when we >>>>>> find that we need to distinct the timedout job from the canceled >>>>>> ones (probably a good idea, but I'm not sure). >>>>>> >>>>>> 3.Find the entity/ctx behind this job, and set this ctx >>>>>> as “*guilty*” >>>>>> >>>>>> Not sure. Do we want to set the whole context as guilty or just >>>>>> the entity? >>>>>> >>>>>> Setting the whole contexts as guilty sounds racy to me. >>>>>> >>>>>> BTW: We should use a different name than "guilty", maybe just >>>>>> "bool canceled;" ? >>>>>> >>>>>> 4.Kick out this job from scheduler’s mirror list, so this >>>>>> job won’t >>>>>> get re-scheduled to ring anymore. >>>>>> >>>>>> Okay. >>>>>> >>>>>> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, >>>>>> and set all >>>>>> their fence status to “*ECANCELED*” >>>>>> >>>>>> Setting ECANCELED should be ok. But I think we should do this >>>>>> when we try to run the jobs and not during GPU reset. >>>>>> >>>>>> 6.Force signal all fences that get kicked out by above two >>>>>> steps,*otherwise UMD will block forever if waiting on >>>>>> those >>>>>> fences* >>>>>> >>>>>> Okay. >>>>>> >>>>>> 7.Do gpu reset, which is can be some callbacks to let >>>>>> bare-metal and >>>>>> SR-IOV implement with their favor style >>>>>> >>>>>> Okay. >>>>>> >>>>>> 8.After reset, KMD need to aware if the VRAM lost happens >>>>>> or not, >>>>>> bare-metal can implement some function to judge, while >>>>>> for SR-IOV I >>>>>> prefer to read it from GIM side (for initial version we consider >>>>>> it’s always VRAM lost, till GIM side change aligned) >>>>>> >>>>>> Okay. >>>>>> >>>>>> 9.If VRAM lost not hit, continue, otherwise: >>>>>> >>>>>> a)Update adev->*vram_lost_counter*, >>>>>> >>>>>> Okay. >>>>>> >>>>>> b)Iterate over all living ctx, and set all ctx as “*guilty*” >>>>>> since >>>>>> VRAM lost actually ruins all VRAM contents >>>>>> >>>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>>> over all contexts is way to much overhead. >>>>>> >>>>>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>>>>> fence status to “*ECANCELDED*” >>>>>> >>>>>> Yes and no, that should be done when we try to run the jobs and >>>>>> not during GPU reset. >>>>>> >>>>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>>>> (optional, >>>>>> do we need it ???) >>>>>> >>>>>> Yes, that is still needed. As Nicolai explained we can't be sure >>>>>> that VRAM is still 100% correct even when it isn't cleared. >>>>>> >>>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>>>> restart scheduler (for VRAM lost case, no JOB will >>>>>> re-scheduled) >>>>>> >>>>>> Okay. >>>>>> >>>>>> ?For cs_wait() IOCTL: >>>>>> >>>>>> After it found fence signaled, it should check with >>>>>> *“dma_fence_get_status” *to see if there is error there, >>>>>> >>>>>> And return the error status of fence >>>>>> >>>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>>> sync_file debugging (no idea why that made it into the common >>>>>> fence code). >>>>>> >>>>>> It was replaced by putting the error code directly into the >>>>>> fence, so just reading that one after waiting should be ok. >>>>>> >>>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>>> for this? >>>>>> >>>>>> ?For cs_wait_fences() IOCTL: >>>>>> >>>>>> Similar with above approach >>>>>> >>>>>> ?For cs_submit() IOCTL: >>>>>> >>>>>> It need to check if current ctx been marked as “*guilty*” >>>>>> and return >>>>>> “*ECANCELED*” if so >>>>>> >>>>>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>>>>> >>>>>> This way, UMD can also block app from submitting, like @Nicolai >>>>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>>>> enumerate physical device, and deny all >>>>>> >>>>>> gl-context from submitting if the counter queried bigger >>>>>> than that >>>>>> one cached in physical device. (looks a little overkill >>>>>> to me, but >>>>>> easy to implement ) >>>>>> >>>>>> UMD can also return error to APP when creating gl-context >>>>>> if found >>>>>> current queried*vram_lost_counter *bigger than that one >>>>>> cached in >>>>>> physical device. >>>>>> >>>>>> Okay. Already have a patch for this, please review that one if >>>>>> you haven't already done so. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> BTW: I realized that gl-context is a little different >>>>>> with kernel’s >>>>>> context. Because for kernel. BO is not related with >>>>>> context but only >>>>>> with FD, while in UMD, BO have a backend >>>>>> >>>>>> gl-context, so block submitting in UMD layer is also >>>>>> needed although >>>>>> KMD will do its job as bottom line >>>>>> >>>>>> ?Basically “vram_lost_counter” is exposure by kernel to >>>>>> let UMD take >>>>>> the control of robust extension feature, it will be UMD’s >>>>>> call to >>>>>> move, KMD only deny “guilty” context from submitting >>>>>> >>>>>> Need your feedback, thx >>>>>> >>>>>> We’d better make TDR feature landed ASAP >>>>>> >>>>>> BR Monk >>>>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: TDR and VRAM lost handling in KMD: [not found] ` <8c4e849f-9227-12bc-9d2e-3daf60fcd762-5C7GfCeVMHo@public.gmane.org> 2017-10-11 10:39 ` Christian König @ 2017-10-11 13:27 ` Liu, Monk 1 sibling, 0 replies; 23+ messages in thread From: Liu, Monk @ 2017-10-11 13:27 UTC (permalink / raw) To: Zhou, David(ChunMing), Haehnle, Nicolai, Koenig, Christian, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) According to the initial summary, this situation is already considered: When vram lost hit, all context marked as guilty, and all jobs in guilty context's KFIFO queue will be kicked out Now if we move the kick out from gpu_reset to run_job, then I think your question can be answered by: in run_job(), before each job scheduling, check current vram_lost_counter, compare it with the copy cached during entity init (or context init), skip the job if not match BR Monk -----Original Message----- From: Zhou, David(ChunMing) Sent: Wednesday, October 11, 2017 6:15 PM To: Liu, Monk <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: Your summary lacks the below issue: How about the job already pushed in scheduler queue when vram is lost? Regards, David Zhou On 2017年10月11日 17:41, Liu, Monk wrote: > Okay, let me summary our whole idea together and see if it works: > > 1, For cs_submit, always check vram-lost_counter first and reject the > submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != > adev->vram_lost_counter. That way the vram lost issue can be handled > > 2, for cs_submit we still need to check if the incoming context is > "AMDGPU_CTX_GUILTY_RESET" or not even if we found ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject the submit If it is "AMDGPU_CTX_GUILTY_RESET", correct ? > > 3, in gpu_reset() routine, we only mark the hang job's entity as guilty (so we need to add new member in entity structure), and not kick it out in gpu_reset() stage, but we need to set the context behind this entity as " AMDGPU_CTX_GUILTY_RESET" > And if reset introduces VRAM LOST, we just update adev->vram_lost_counter, but *don't* change all entity to guilty, so still only the hang job's entity is "guilty" > After some entity marked as "guilty", we find a way to set the context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K interface, we need let UMD can know that this context is wrong. > > 4, in gpu scheduler's run_job() routine, since it only reads entity, so we skip job scheduling once found the entity is "guilty" > > > Does above sounds good ? > > > > -----Original Message----- > From: Haehnle, Nicolai > Sent: Wednesday, October 11, 2017 5:26 PM > To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian > <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; > Deucher, Alexander <Alexander.Deucher@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley > <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; > Filipas, Mario <Mario.Filipas@amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > On 11.10.2017 11:18, Liu, Monk wrote: >> Let's talk it simple, When vram lost hit, what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ? > It should successfully return AMDGPU_CTX_INNOCENT_RESET. > > >> In cs_submit, with vram lost hit, if we don't mark all contexts as >> "guilty", how we block its from submitting ? can you show some >> implement way > if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) > return -ECANCELED; > > (where ctx->vram_lost_counter is initialized at context creation time > and never changed afterwards) > > >> BTW: the "guilty" here is a new member I want to add to context, it >> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I >> need to unify them and only one place to mark guilty or not > Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made > consistent with the rest. > > Cheers, > Nicolai > > >> >> BR Monk >> >> -----Original Message----- >> From: Haehnle, Nicolai >> Sent: Wednesday, October 11, 2017 5:00 PM >> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >> Deucher, Alexander <Alexander.Deucher@amd.com> >> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley >> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; >> Filipas, Mario <Mario.Filipas@amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD: >> >> On 11.10.2017 10:48, Liu, Monk wrote: >>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>> so it's reasonable to use it. However, it /does not/ make sense to >>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost >>> is a perfect example where the driver should report context lost to >>> applications with the "innocent" flag for contexts that were idle at >>> the time of reset. The only context(s) that should be reported as "guilty" >>> (or perhaps "unknown" in some cases) are the ones that were >>> executing at the time of reset. >>> >>> ML: KMD mark all contexts as guilty is because that way we can unify >>> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no >>> need to worry about vram-lost-counter anymore, that’s a >>> implementation style. I don’t think it is related with UMD layer, >>> >>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement >>> it own “guilty” gl-context if you want. >> Well, to some extent this is just semantics, but it helps to keep the terminology consistent. >> >> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >> mind: this returns one of >> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >> and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset. >> >> The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened. >> >> >>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you >>> illustrate what rule KMD should obey to check in KMS IOCTL like >>> cs_sumbit ?? let’s see which way better >> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >> return -ECANCELED; >> >> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >> >> Yes, it's one additional check in cs_submit. If you're worried about >> that (and Christian's concerns about possible issues with walking >> over all contexts are addressed), I suppose you could just store a >> per-context >> >> unsigned context_reset_status; >> >> instead of a `bool guilty`. Its value would start out as 0 >> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset. >> >> Cheers, >> Nicolai >> >> >>> *From:*Haehnle, Nicolai >>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian >>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; >>> Deucher, Alexander <Alexander.Deucher@amd.com> >>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel >>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, >>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro >>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> >>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>> >>> From a Mesa perspective, this almost all sounds reasonable to me. >>> >>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>> so it's reasonable to use it. However, it /does not/ make sense to >>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost >>> is a perfect example where the driver should report context lost to >>> applications with the "innocent" flag for contexts that were idle at >>> the time of reset. The only context(s) that should be reported as "guilty" >>> (or perhaps "unknown" in some cases) are the ones that were >>> executing at the time of reset. >>> >>> On whether the whole context is marked as guilty from a user space >>> perspective, it would simply be nice for user space to get >>> consistent answers. It would be a bit odd if we could e.g. succeed >>> in submitting an SDMA job after a GFX job was rejected. This would >>> point in favor of marking the entire context as guilty (although >>> that could happen lazily instead of at reset time). On the other >>> hand, if that's too big a burden for the kernel implementation I'm sure we can live without it. >>> >>> Cheers, >>> >>> Nicolai >>> >>> -------------------------------------------------------------------- >>> -- >>> -- >>> >>> *From:*Liu, Monk >>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>> Alexander >>> *Cc:* amd-gfx@lists.freedesktop.org >>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>> >>> 1.Set its fence error status to “*ETIME*”, >>> >>> No, as I already explained ETIME is for synchronous operation. >>> >>> In other words when we return ETIME from the wait IOCTL it would >>> mean that the waiting has somehow timed out, but not the job we waited for. >>> >>> Please use ECANCELED as well or some other error code when we find >>> that we need to distinct the timedout job from the canceled ones >>> (probably a good idea, but I'm not sure). >>> >>> [ML] I’m okay if you insist not to use ETIME >>> >>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>> >>> Not sure. Do we want to set the whole context as guilty or just the entity? >>> >>> Setting the whole contexts as guilty sounds racy to me. >>> >>> BTW: We should use a different name than "guilty", maybe just "bool >>> canceled;" ? >>> >>> [ML] I think context is better than entity, because for example if >>> you only block entity_0 of context and allow entity_N run, that >>> means the dependency between entities are broken (e.g. page table >>> updates in >>> >>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>> sense to me) >>> >>> We’d better either block the whole context or let not… >>> >>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >>> their fence status to “*ECANCELED*” >>> >>> Setting ECANCELED should be ok. But I think we should do this when >>> we try to run the jobs and not during GPU reset. >>> >>> [ML] without deep thought and expritment, I’m not sure the >>> difference between them, but kick it out in gpu_reset routine is >>> more efficient, >>> >>> Otherwise you need to check context/entity guilty flag in run_job >>> routine …and you need to it for every context/entity, I don’t see >>> why >>> >>> We don’t just kickout all of them in gpu_reset stage …. >>> >>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since >>> VRAM lost actually ruins all VRAM contents >>> >>> No, that shouldn't be done by comparing the counters. Iterating over >>> all contexts is way to much overhead. >>> >>> [ML] because I want to make KMS IOCTL rules clean, like they don’t >>> need to differentiate VRAM lost or not, they only interested in if >>> the context is guilty or not, and block >>> >>> Submit for guilty ones. >>> >>> *Can you give more details of your idea? And better the detail >>> implement in cs_submit, I want to see how you want to block submit >>> without checking context guilty flag* >>> >>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>> fence status to “*ECANCELDED*” >>> >>> Yes and no, that should be done when we try to run the jobs and not >>> during GPU reset. >>> >>> [ML] again, kicking out them in gpu reset routine is high efficient, >>> otherwise you need check on every job in run_job() >>> >>> Besides, can you illustrate the detail implementation ? >>> >>> Yes and no, dma_fence_get_status() is some specific handling for >>> sync_file debugging (no idea why that made it into the common fence code). >>> >>> It was replaced by putting the error code directly into the fence, >>> so just reading that one after waiting should be ok. >>> >>> Maybe we should fix dma_fence_get_status() to do the right thing for this? >>> >>> [ML] yeah, that’s too confusing, the name sound really the one I >>> want to use, we should change it… >>> >>> *But look into the implement, I don**’t see why we cannot use it ? >>> it also finally return the fence->error * >>> >>> *From:*Koenig, Christian >>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; >>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com >>> <mailto:Nicolai.Haehnle@amd.com>>; >>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; >>> Deucher, Alexander <Alexander.Deucher@amd.com >>> <mailto:Alexander.Deucher@amd.com>> >>> *Cc:* amd-gfx@lists.freedesktop.org >>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel >>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) >>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley >>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro >>> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>; >>> Filipas, Mario <Mario.Filipas@amd.com >>> <mailto:Mario.Filipas@amd.com>> >>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>> >>> See inline: >>> >>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>> >>> Hi Christian & Nicolai, >>> >>> We need to achieve some agreements on what should MESA/UMD do and >>> what should KMD do, *please give your comments with “okay” or “No” >>> and your idea on below items,* >>> >>> ?When a job timed out (set from lockup_timeout kernel parameter), >>> What KMD should do in TDR routine : >>> >>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>> (*gpu_reset_counter* is used to force vm flush after GPU reset, out >>> of this thread’s scope so no more discussion on it) >>> >>> Okay. >>> >>> 2.Set its fence error status to “*ETIME*”, >>> >>> No, as I already explained ETIME is for synchronous operation. >>> >>> In other words when we return ETIME from the wait IOCTL it would >>> mean that the waiting has somehow timed out, but not the job we waited for. >>> >>> Please use ECANCELED as well or some other error code when we find >>> that we need to distinct the timedout job from the canceled ones >>> (probably a good idea, but I'm not sure). >>> >>> 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*” >>> >>> Not sure. Do we want to set the whole context as guilty or just the entity? >>> >>> Setting the whole contexts as guilty sounds racy to me. >>> >>> BTW: We should use a different name than "guilty", maybe just "bool >>> canceled;" ? >>> >>> 4.Kick out this job from scheduler’s mirror list, so this job won’t >>> get re-scheduled to ring anymore. >>> >>> Okay. >>> >>> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >>> their fence status to “*ECANCELED*” >>> >>> Setting ECANCELED should be ok. But I think we should do this when >>> we try to run the jobs and not during GPU reset. >>> >>> 6.Force signal all fences that get kicked out by above two >>> steps,*otherwise UMD will block forever if waiting on those >>> fences* >>> >>> Okay. >>> >>> 7.Do gpu reset, which is can be some callbacks to let bare-metal and >>> SR-IOV implement with their favor style >>> >>> Okay. >>> >>> 8.After reset, KMD need to aware if the VRAM lost happens or not, >>> bare-metal can implement some function to judge, while for SR-IOV I >>> prefer to read it from GIM side (for initial version we consider >>> it’s always VRAM lost, till GIM side change aligned) >>> >>> Okay. >>> >>> 9.If VRAM lost not hit, continue, otherwise: >>> >>> a)Update adev->*vram_lost_counter*, >>> >>> Okay. >>> >>> b)Iterate over all living ctx, and set all ctx as “*guilty*” since >>> VRAM lost actually ruins all VRAM contents >>> >>> No, that shouldn't be done by comparing the counters. Iterating over >>> all contexts is way to much overhead. >>> >>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their >>> fence status to “*ECANCELDED*” >>> >>> Yes and no, that should be done when we try to run the jobs and not >>> during GPU reset. >>> >>> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, >>> do we need it ???) >>> >>> Yes, that is still needed. As Nicolai explained we can't be sure >>> that VRAM is still 100% correct even when it isn't cleared. >>> >>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>> restart scheduler (for VRAM lost case, no JOB will >>> re-scheduled) >>> >>> Okay. >>> >>> ?For cs_wait() IOCTL: >>> >>> After it found fence signaled, it should check with >>> *“dma_fence_get_status” *to see if there is error there, >>> >>> And return the error status of fence >>> >>> Yes and no, dma_fence_get_status() is some specific handling for >>> sync_file debugging (no idea why that made it into the common fence code). >>> >>> It was replaced by putting the error code directly into the fence, >>> so just reading that one after waiting should be ok. >>> >>> Maybe we should fix dma_fence_get_status() to do the right thing for this? >>> >>> ?For cs_wait_fences() IOCTL: >>> >>> Similar with above approach >>> >>> ?For cs_submit() IOCTL: >>> >>> It need to check if current ctx been marked as “*guilty*” and return >>> “*ECANCELED*” if so >>> >>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>> >>> This way, UMD can also block app from submitting, like @Nicolai >>> mentioned, we can cache one copy of *vram_lost_counter* when >>> enumerate physical device, and deny all >>> >>> gl-context from submitting if the counter queried bigger than that >>> one cached in physical device. (looks a little overkill to me, but >>> easy to implement ) >>> >>> UMD can also return error to APP when creating gl-context if found >>> current queried*vram_lost_counter *bigger than that one cached in >>> physical device. >>> >>> Okay. Already have a patch for this, please review that one if you >>> haven't already done so. >>> >>> Regards, >>> Christian. >>> >>> BTW: I realized that gl-context is a little different with kernel’s >>> context. Because for kernel. BO is not related with context but only >>> with FD, while in UMD, BO have a backend >>> >>> gl-context, so block submitting in UMD layer is also needed although >>> KMD will do its job as bottom line >>> >>> ?Basically “vram_lost_counter” is exposure by kernel to let UMD take >>> the control of robust extension feature, it will be UMD’s call to >>> move, KMD only deny “guilty” context from submitting >>> >>> Need your feedback, thx >>> >>> We’d better make TDR feature landed ASAP >>> >>> BR Monk >>> > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TDR and VRAM lost handling in KMD: [not found] ` <BLUPR12MB0449287A92DF8D3EB30BE6A6844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 8:59 ` Nicolai Hähnle @ 2017-10-11 9:02 ` Christian König [not found] ` <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe-5C7GfCeVMHo@public.gmane.org> 1 sibling, 1 reply; 23+ messages in thread From: Christian König @ 2017-10-11 9:02 UTC (permalink / raw) To: Liu, Monk, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) [-- Attachment #1.1: Type: text/plain, Size: 15195 bytes --] > [ML] I think context is better than entity, because for example if you > only block entity_0 of context and allow entity_N run, that means the > dependency between entities are broken (e.g. page table updates in > > Sdma entity pass but gfx submit in GFX entity blocked, not make sense > to me) > > We’d better either block the whole context or let not… > Page table updates are not part of any context. So I think the only thing we can do is to mark the entity as not scheduled any more. > 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > [ML] without deep thought and expritment, I’m not sure the difference > between them, but kick it out in gpu_reset routine is more efficient, > I really don't think so. Kicking them out during gpu_reset sounds racy to me once more. And marking them canceled when we try to run them has the clear advantage that all dependencies are meet first. > ML: KMD mark all contexts as guilty is because that way we can unify > our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no > need to worry about vram-lost-counter anymore, that’s a implementation > style. I don’t think it is related with UMD layer, > I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context. Regards, Christian. Am 11.10.2017 um 10:48 schrieb Liu, Monk: > > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at > the time of reset. The only context(s) that should be reported as > "guilty" (or perhaps "unknown" in some cases) are the ones that were > executing at the time of reset. > > ML: KMD mark all contexts as guilty is because that way we can unify > our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no > need to worry about vram-lost-counter anymore, that’s a implementation > style. I don’t think it is related with UMD layer, > > For UMD the gl-context isn’t aware of by KMD, so UMD can implement it > own “guilty” gl-context if you want. > > If KMD doesn’t mark all ctx as guilty after VRAM lost, can you > illustrate what rule KMD should obey to check in KMS IOCTL like > cs_sumbit ?? let’s see which way better > > *From:*Haehnle, Nicolai > *Sent:* Wednesday, October 11, 2017 4:41 PM > *To:* Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Koenig, Christian > <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>; Olsak, Marek <Marek.Olsak-5C7GfCeVMHo@public.gmane.org>; > Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org> > *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Ding, Pixel <Pixel.Ding-5C7GfCeVMHo@public.gmane.org>; > Jiang, Jerry (SW) <Jerry.Jiang-5C7GfCeVMHo@public.gmane.org>; Li, Bingley > <Bingley.Li-5C7GfCeVMHo@public.gmane.org>; Ramirez, Alejandro <Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org>; > Filipas, Mario <Mario.Filipas-5C7GfCeVMHo@public.gmane.org> > *Subject:* Re: TDR and VRAM lost handling in KMD: > > From a Mesa perspective, this almost all sounds reasonable to me. > > On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so > it's reasonable to use it. However, it /does not/ make sense to mark > idle contexts as "guilty" just because VRAM is lost. VRAM lost is a > perfect example where the driver should report context lost to > applications with the "innocent" flag for contexts that were idle at > the time of reset. The only context(s) that should be reported as > "guilty" (or perhaps "unknown" in some cases) are the ones that were > executing at the time of reset. > > On whether the whole context is marked as guilty from a user space > perspective, it would simply be nice for user space to get consistent > answers. It would be a bit odd if we could e.g. succeed in submitting > an SDMA job after a GFX job was rejected. This would point in favor of > marking the entire context as guilty (although that could happen > lazily instead of at reset time). On the other hand, if that's too big > a burden for the kernel implementation I'm sure we can live without it. > > Cheers, > > Nicolai > > ------------------------------------------------------------------------ > > *From:*Liu, Monk > *Sent:* Wednesday, October 11, 2017 10:15:40 AM > *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, > Alexander > *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>; Ding, Pixel; Jiang, Jerry > (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario > *Subject:* RE: TDR and VRAM lost handling in KMD: > > 1.Set its fence error status to “*ETIME*”, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find > that we need to distinct the timedout job from the canceled ones > (probably a good idea, but I'm not sure). > > [ML] I’m okay if you insist not to use ETIME > > 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > > Not sure. Do we want to set the whole context as guilty or just the > entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > [ML] I think context is better than entity, because for example if you > only block entity_0 of context and allow entity_N run, that means the > dependency between entities are broken (e.g. page table updates in > > Sdma entity pass but gfx submit in GFX entity blocked, not make sense > to me) > > We’d better either block the whole context or let not… > > 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all > their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > [ML] without deep thought and expritment, I’m not sure the difference > between them, but kick it out in gpu_reset routine is more efficient, > > Otherwise you need to check context/entity guilty flag in run_job > routine …and you need to it for every context/entity, I don’t see why > > We don’t just kickout all of them in gpu_reset stage …. > > a)Iterate over all living ctx, and set all ctx as “*guilty*” since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over > all contexts is way to much overhead. > > [ML] because I want to make KMS IOCTL rules clean, like they don’t > need to differentiate VRAM lost or not, they only interested in if the > context is guilty or not, and block > > Submit for guilty ones. > > *Can you give more details of your idea? And better the detail > implement in cs_submit, I want to see how you want to block submit > without checking context guilty flag* > > a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence > status to “*ECANCELDED*” > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > [ML] again, kicking out them in gpu reset routine is high efficient, > otherwise you need check on every job in run_job() > > Besides, can you illustrate the detail implementation ? > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > [ML] yeah, that’s too confusing, the name sound really the one I want > to use, we should change it… > > *But look into the implement, I don**’t see why we cannot use it ? it > also finally return the fence->error * > > *From:*Koenig, Christian > *Sent:* Wednesday, October 11, 2017 3:21 PM > *To:* Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>>; Haehnle, > Nicolai <Nicolai.Haehnle-5C7GfCeVMHo@public.gmane.org <mailto:Nicolai.Haehnle-5C7GfCeVMHo@public.gmane.org>>; > Olsak, Marek <Marek.Olsak-5C7GfCeVMHo@public.gmane.org <mailto:Marek.Olsak-5C7GfCeVMHo@public.gmane.org>>; > Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org > <mailto:Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>> > *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>; Ding, Pixel > <Pixel.Ding-5C7GfCeVMHo@public.gmane.org <mailto:Pixel.Ding-5C7GfCeVMHo@public.gmane.org>>; Jiang, Jerry (SW) > <Jerry.Jiang-5C7GfCeVMHo@public.gmane.org <mailto:Jerry.Jiang-5C7GfCeVMHo@public.gmane.org>>; Li, Bingley > <Bingley.Li-5C7GfCeVMHo@public.gmane.org <mailto:Bingley.Li-5C7GfCeVMHo@public.gmane.org>>; Ramirez, Alejandro > <Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org <mailto:Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org>>; > Filipas, Mario <Mario.Filipas-5C7GfCeVMHo@public.gmane.org <mailto:Mario.Filipas-5C7GfCeVMHo@public.gmane.org>> > *Subject:* Re: TDR and VRAM lost handling in KMD: > > See inline: > > Am 11.10.2017 um 07:33 schrieb Liu, Monk: > > Hi Christian & Nicolai, > > We need to achieve some agreements on what should MESA/UMD do and > what should KMD do, *please give your comments with “okay” or “No” > and your idea on below items,* > > ?When a job timed out (set from lockup_timeout kernel parameter), > What KMD should do in TDR routine : > > 1.Update adev->*gpu_reset_counter*, and stop scheduler first, > (*gpu_reset_counter* is used to force vm flush after GPU reset, > out of this thread’s scope so no more discussion on it) > > Okay. > > 2.Set its fence error status to “*ETIME*”, > > No, as I already explained ETIME is for synchronous operation. > > In other words when we return ETIME from the wait IOCTL it would mean > that the waiting has somehow timed out, but not the job we waited for. > > Please use ECANCELED as well or some other error code when we find > that we need to distinct the timedout job from the canceled ones > (probably a good idea, but I'm not sure). > > 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*” > > Not sure. Do we want to set the whole context as guilty or just the > entity? > > Setting the whole contexts as guilty sounds racy to me. > > BTW: We should use a different name than "guilty", maybe just "bool > canceled;" ? > > 4.Kick out this job from scheduler’s mirror list, so this job > won’t get re-scheduled to ring anymore. > > Okay. > > 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set > all their fence status to “*ECANCELED*” > > Setting ECANCELED should be ok. But I think we should do this when we > try to run the jobs and not during GPU reset. > > 6.Force signal all fences that get kicked out by above two > steps,*otherwise UMD will block forever if waiting on those fences* > > Okay. > > 7.Do gpu reset, which is can be some callbacks to let bare-metal > and SR-IOV implement with their favor style > > Okay. > > 8.After reset, KMD need to aware if the VRAM lost happens or not, > bare-metal can implement some function to judge, while for SR-IOV > I prefer to read it from GIM side (for initial version we consider > it’s always VRAM lost, till GIM side change aligned) > > Okay. > > 9.If VRAM lost not hit, continue, otherwise: > > a)Update adev->*vram_lost_counter*, > > Okay. > > b)Iterate over all living ctx, and set all ctx as “*guilty*” since > VRAM lost actually ruins all VRAM contents > > No, that shouldn't be done by comparing the counters. Iterating over > all contexts is way to much overhead. > > c)Kick out all jobs in all ctx’s KFIFO queue, and set all their > fence status to “*ECANCELDED*” > > Yes and no, that should be done when we try to run the jobs and not > during GPU reset. > > 10.Do GTT recovery and VRAM page tables/entries recovery > (optional, do we need it ???) > > Yes, that is still needed. As Nicolai explained we can't be sure that > VRAM is still 100% correct even when it isn't cleared. > > 11.Re-schedule all JOBs remains in mirror list to ring again and > restart scheduler (for VRAM lost case, no JOB will re-scheduled) > > Okay. > > ?For cs_wait() IOCTL: > > After it found fence signaled, it should check with > *“dma_fence_get_status” *to see if there is error there, > > And return the error status of fence > > Yes and no, dma_fence_get_status() is some specific handling for > sync_file debugging (no idea why that made it into the common fence code). > > It was replaced by putting the error code directly into the fence, so > just reading that one after waiting should be ok. > > Maybe we should fix dma_fence_get_status() to do the right thing for this? > > ?For cs_wait_fences() IOCTL: > > Similar with above approach > > ?For cs_submit() IOCTL: > > It need to check if current ctx been marked as “*guilty*” and > return “*ECANCELED*” if so > > ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: > > This way, UMD can also block app from submitting, like @Nicolai > mentioned, we can cache one copy of *vram_lost_counter* when > enumerate physical device, and deny all > > gl-context from submitting if the counter queried bigger than that > one cached in physical device. (looks a little overkill to me, but > easy to implement ) > > UMD can also return error to APP when creating gl-context if found > current queried*vram_lost_counter *bigger than that one cached in > physical device. > > Okay. Already have a patch for this, please review that one if you > haven't already done so. > > Regards, > Christian. > > BTW: I realized that gl-context is a little different with > kernel’s context. Because for kernel. BO is not related with > context but only with FD, while in UMD, BO have a backend > > gl-context, so block submitting in UMD layer is also needed > although KMD will do its job as bottom line > > ?Basically “vram_lost_counter” is exposure by kernel to let UMD > take the control of robust extension feature, it will be UMD’s > call to move, KMD only deny “guilty” context from submitting > > Need your feedback, thx > > We’d better make TDR feature landed ASAP > > BR Monk > [-- Attachment #1.2: Type: text/html, Size: 65730 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe-5C7GfCeVMHo@public.gmane.org>]
* Re: TDR and VRAM lost handling in KMD: [not found] ` <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe-5C7GfCeVMHo@public.gmane.org> @ 2017-10-11 9:16 ` Nicolai Hähnle 2017-10-11 9:27 ` Liu, Monk 1 sibling, 0 replies; 23+ messages in thread From: Nicolai Hähnle @ 2017-10-11 9:16 UTC (permalink / raw) To: Christian König, Liu, Monk, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) On 11.10.2017 11:02, Christian König wrote: >> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all >> their fence status to “*ECANCELED*” >> >> Setting ECANCELED should be ok. But I think we should do this when we >> try to run the jobs and not during GPU reset. >> >> [ML] without deep thought and expritment, I’m not sure the difference >> between them, but kick it out in gpu_reset routine is more efficient, >> > I really don't think so. Kicking them out during gpu_reset sounds racy > to me once more. > > And marking them canceled when we try to run them has the clear > advantage that all dependencies are meet first. This makes sense to me as well. It raises a vaguely related question: What happens to jobs whose dependencies were canceled? I believe we currently don't check those errors, so we might execute them anyway if their contexts were unaffected by the reset. There's a risk that the job will hang due to stale data. I don't think it's a huge risk in practice today because we don't have a lot of buffer sharing between applications, but it's something to think through at some point. In a way, canceling out of an abundance of caution may be a bad idea because it could kill a compositor's task by being overly conservative. Cheers, Nicolai _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: TDR and VRAM lost handling in KMD: [not found] ` <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe-5C7GfCeVMHo@public.gmane.org> 2017-10-11 9:16 ` Nicolai Hähnle @ 2017-10-11 9:27 ` Liu, Monk 1 sibling, 0 replies; 23+ messages in thread From: Liu, Monk @ 2017-10-11 9:27 UTC (permalink / raw) To: Koenig, Christian, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW) [-- Attachment #1.1: Type: text/plain, Size: 15673 bytes --] ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to worry about vram-lost-counter anymore, that’s a implementation style. I don’t think it is related with UMD layer, I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context. [ML] say that we only block at entity level, then we have two rules: 1) we block submit for “guilty” entity in run_job routine. (and mark as guilty entity in gpu_reset) 2) for innocent entity, we still need to check vram_lost_counter in cs_submit, correct ? besides: Nicolai reminded me that we have amdgpu_ctx_query() to worry about .. when we mark some entity as “guilty”, do we need to mark the context behind it as “AMDGPU_CTX_GUILTY_RESET” ? this thing I didn’t think of … I just ignored it …. BR Monk From: Koenig, Christian Sent: Wednesday, October 11, 2017 5:03 PM To: Liu, Monk <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: [ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me) We’d better either block the whole context or let not… Page table updates are not part of any context. So I think the only thing we can do is to mark the entity as not scheduled any more. 1. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset. [ML] without deep thought and expritment, I’m not sure the difference between them, but kick it out in gpu_reset routine is more efficient, I really don't think so. Kicking them out during gpu_reset sounds racy to me once more. And marking them canceled when we try to run them has the clear advantage that all dependencies are meet first. ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to worry about vram-lost-counter anymore, that’s a implementation style. I don’t think it is related with UMD layer, I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context. Regards, Christian. Am 11.10.2017 um 10:48 schrieb Liu, Monk: On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset. ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to worry about vram-lost-counter anymore, that’s a implementation style. I don’t think it is related with UMD layer, For UMD the gl-context isn’t aware of by KMD, so UMD can implement it own “guilty” gl-context if you want. If KMD doesn’t mark all ctx as guilty after VRAM lost, can you illustrate what rule KMD should obey to check in KMS IOCTL like cs_sumbit ?? let’s see which way better From: Haehnle, Nicolai Sent: Wednesday, October 11, 2017 4:41 PM To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com><mailto:Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com> Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel <Pixel.Ding@amd.com><mailto:Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com><mailto:Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com><mailto:Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com><mailto:Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com><mailto:Mario.Filipas@amd.com> Subject: Re: TDR and VRAM lost handling in KMD: From a Mesa perspective, this almost all sounds reasonable to me. On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset. On whether the whole context is marked as guilty from a user space perspective, it would simply be nice for user space to get consistent answers. It would be a bit odd if we could e.g. succeed in submitting an SDMA job after a GFX job was rejected. This would point in favor of marking the entire context as guilty (although that could happen lazily instead of at reset time). On the other hand, if that's too big a burden for the kernel implementation I'm sure we can live without it. Cheers, Nicolai ________________________________ From: Liu, Monk Sent: Wednesday, October 11, 2017 10:15:40 AM To: Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, Alexander Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario Subject: RE: TDR and VRAM lost handling in KMD: 1. Set its fence error status to “ETIME”, No, as I already explained ETIME is for synchronous operation. In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for. Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure). [ML] I’m okay if you insist not to use ETIME 1. Find the entity/ctx behind this job, and set this ctx as “guilty” Not sure. Do we want to set the whole context as guilty or just the entity? Setting the whole contexts as guilty sounds racy to me. BTW: We should use a different name than "guilty", maybe just "bool canceled;" ? [ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me) We’d better either block the whole context or let not… 1. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset. [ML] without deep thought and expritment, I’m not sure the difference between them, but kick it out in gpu_reset routine is more efficient, Otherwise you need to check context/entity guilty flag in run_job routine … and you need to it for every context/entity, I don’t see why We don’t just kickout all of them in gpu_reset stage …. a) Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead. [ML] because I want to make KMS IOCTL rules clean, like they don’t need to differentiate VRAM lost or not, they only interested in if the context is guilty or not, and block Submit for guilty ones. Can you give more details of your idea? And better the detail implement in cs_submit, I want to see how you want to block submit without checking context guilty flag a) Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED” Yes and no, that should be done when we try to run the jobs and not during GPU reset. [ML] again, kicking out them in gpu reset routine is high efficient, otherwise you need check on every job in run_job() Besides, can you illustrate the detail implementation ? Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code). It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok. Maybe we should fix dma_fence_get_status() to do the right thing for this? [ML] yeah, that’s too confusing, the name sound really the one I want to use, we should change it… But look into the implement, I don’t see why we cannot use it ? it also finally return the fence->error From: Koenig, Christian Sent: Wednesday, October 11, 2017 3:21 PM To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com<mailto:Nicolai.Haehnle@amd.com>>; Olsak, Marek <Marek.Olsak@amd.com<mailto:Marek.Olsak@amd.com>>; Deucher, Alexander <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>> Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel <Pixel.Ding@amd.com<mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com<mailto:Jerry.Jiang@amd.com>>; Li, Bingley <Bingley.Li@amd.com<mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com<mailto:Alejandro.Ramirez@amd.com>>; Filipas, Mario <Mario.Filipas@amd.com<mailto:Mario.Filipas@amd.com>> Subject: Re: TDR and VRAM lost handling in KMD: See inline: Am 11.10.2017 um 07:33 schrieb Liu, Monk: Hi Christian & Nicolai, We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items, ? When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine : 1. Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it) Okay. 2. Set its fence error status to “ETIME”, No, as I already explained ETIME is for synchronous operation. In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for. Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure). 3. Find the entity/ctx behind this job, and set this ctx as “guilty” Not sure. Do we want to set the whole context as guilty or just the entity? Setting the whole contexts as guilty sounds racy to me. BTW: We should use a different name than "guilty", maybe just "bool canceled;" ? 4. Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore. Okay. 5. Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED” Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset. 6. Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences Okay. 7. Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style Okay. 8. After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned) Okay. 9. If VRAM lost not hit, continue, otherwise: a) Update adev->vram_lost_counter, Okay. b) Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead. c) Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED” Yes and no, that should be done when we try to run the jobs and not during GPU reset. 10. Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???) Yes, that is still needed. As Nicolai explained we can't be sure that VRAM is still 100% correct even when it isn't cleared. 11. Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled) Okay. ? For cs_wait() IOCTL: After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there, And return the error status of fence Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code). It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok. Maybe we should fix dma_fence_get_status() to do the right thing for this? ? For cs_wait_fences() IOCTL: Similar with above approach ? For cs_submit() IOCTL: It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so ? Introduce a new IOCTL to let UMD query vram_lost_counter: This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement ) UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device. Okay. Already have a patch for this, please review that one if you haven't already done so. Regards, Christian. BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line ? Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting Need your feedback, thx We’d better make TDR feature landed ASAP BR Monk [-- Attachment #1.2: Type: text/html, Size: 62940 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2017-10-11 14:04 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-11 5:33 TDR and VRAM lost handling in KMD: Liu, Monk [not found] ` <BLUPR12MB0449785160E34EA9369C5E23844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 7:14 ` Liu, Monk 2017-10-11 7:20 ` Christian König [not found] ` <b5c5f6c9-07e2-4688-8ffc-3929bfc59366-5C7GfCeVMHo@public.gmane.org> 2017-10-11 8:15 ` Liu, Monk [not found] ` <BLUPR12MB044911DFCB510022605DD38A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 8:40 ` Haehnle, Nicolai [not found] ` <DM5PR12MB1292D21FC5438AEA8FCF9F64FF4A0-2J9CzHegvk82qrKJuDAMhQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 8:48 ` Liu, Monk [not found] ` <BLUPR12MB0449287A92DF8D3EB30BE6A6844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 8:59 ` Nicolai Hähnle [not found] ` <28d64011-fd90-07fb-d95d-48286ecbdcc5-5C7GfCeVMHo@public.gmane.org> 2017-10-11 9:18 ` Liu, Monk [not found] ` <BLUPR12MB044914F3A7B5D3D316481A7A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 9:25 ` Nicolai Hähnle [not found] ` <6876e153-7e98-66ac-7338-5601cf83c633-5C7GfCeVMHo@public.gmane.org> 2017-10-11 9:41 ` Liu, Monk [not found] ` <BLUPR12MB044907C2C72DD8BEB1D5BE3B844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 10:14 ` Chunming Zhou [not found] ` <8c4e849f-9227-12bc-9d2e-3daf60fcd762-5C7GfCeVMHo@public.gmane.org> 2017-10-11 10:39 ` Christian König [not found] ` <0c198ba6-b853-c26a-7fb4-bcc0344fdea0-5C7GfCeVMHo@public.gmane.org> 2017-10-11 13:35 ` Liu, Monk [not found] ` <BLUPR12MB04490BE33EC2E851228E25F4844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 13:39 ` Christian König [not found] ` <d9274bc6-27e8-f6c4-0851-4240bde72452-5C7GfCeVMHo@public.gmane.org> 2017-10-11 13:51 ` Liu, Monk [not found] ` <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 13:59 ` Liu, Monk [not found] ` <BLUPR12MB04497EDD5AE48484E7A18C2F844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-11 14:04 ` Christian König 2017-10-11 14:03 ` Christian König [not found] ` <02bb9f77-bcc6-8a24-e9b0-8f3f260d74d8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-10-11 14:04 ` Liu, Monk 2017-10-11 13:27 ` Liu, Monk 2017-10-11 9:02 ` Christian König [not found] ` <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe-5C7GfCeVMHo@public.gmane.org> 2017-10-11 9:16 ` Nicolai Hähnle 2017-10-11 9:27 ` Liu, Monk
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.