All of lore.kernel.org
 help / color / mirror / Atom feed
* TDR and VRAM lost handling in KMD:
@ 2017-10-11  5:33 Liu, Monk
       [not found] ` <BLUPR12MB0449785160E34EA9369C5E23844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-10-11  5:33 UTC (permalink / raw)
  To: Koenig, Christian, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)


[-- Attachment #1.1: Type: text/plain, Size: 3456 bytes --]

Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items,


l  When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine :


1.        Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it)

2.        Set its fence error status to “ETIME”,

3.        Find the entity/ctx behind this job, and set this ctx as “guilty”

4.        Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore.

5.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”

6.       Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences

7.        Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style

8.        After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned)

9.        If VRAM lost not hit, continue, otherwise:

a)       Update adev->vram_lost_counter,

b)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents

c)        Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”

10.     Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???)

11.     Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled)


l  For cs_wait() IOCTL:
After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there,
And return the error status of fence


l  For cs_wait_fences() IOCTL:
Similar with above approach


l  For cs_submit() IOCTL:
It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so


l  Introduce a new IOCTL to let UMD query vram_lost_counter:
This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all
gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement )
UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device.

BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend
gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line


l  Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting


Need your feedback, thx

We’d better make TDR feature landed ASAP

BR Monk





[-- Attachment #1.2: Type: text/html, Size: 22066 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found] ` <BLUPR12MB0449785160E34EA9369C5E23844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-11  7:14   ` Liu, Monk
  2017-10-11  7:20   ` Christian König
  1 sibling, 0 replies; 23+ messages in thread
From: Liu, Monk @ 2017-10-11  7:14 UTC (permalink / raw)
  To: Koenig, Christian, Haehnle, Nicolai, Olsak, Marek, Deucher,
	Alexander, Mao, David
  Cc: Ramirez, Alejandro,
	'amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org',
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)


[-- Attachment #1.1: Type: text/plain, Size: 4007 bytes --]

+ david


From: Liu, Monk
Sent: Wednesday, October 11, 2017 1:34 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley (Bingley.Li@amd.com) <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
Subject: TDR and VRAM lost handling in KMD:

Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items,


l  When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine :


1.        Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it)

2.        Set its fence error status to “ETIME”,

3.        Find the entity/ctx behind this job, and set this ctx as “guilty”

4.        Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore.

5.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”

6.       Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences

7.        Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style

8.        After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned)

9.        If VRAM lost not hit, continue, otherwise:

a)       Update adev->vram_lost_counter,

b)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents

c)        Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”

10.     Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???)

11.     Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled)


l  For cs_wait() IOCTL:
After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there,
And return the error status of fence


l  For cs_wait_fences() IOCTL:
Similar with above approach


l  For cs_submit() IOCTL:
It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so


l  Introduce a new IOCTL to let UMD query vram_lost_counter:
This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all
gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement )
UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device.

BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend
gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line


l  Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting


Need your feedback, thx

We’d better make TDR feature landed ASAP

BR Monk





[-- Attachment #1.2: Type: text/html, Size: 20768 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found] ` <BLUPR12MB0449785160E34EA9369C5E23844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-10-11  7:14   ` Liu, Monk
@ 2017-10-11  7:20   ` Christian König
       [not found]     ` <b5c5f6c9-07e2-4688-8ffc-3929bfc59366-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Christian König @ 2017-10-11  7:20 UTC (permalink / raw)
  To: Liu, Monk, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)


[-- Attachment #1.1: Type: text/plain, Size: 5100 bytes --]

See inline:

Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>
> Hi Christian & Nicolai,
>
> We need to achieve some agreements on what should MESA/UMD do and what 
> should KMD do, *please give your comments with “okay” or “No” and your 
> idea on below items,*
>
> lWhen a job timed out (set from lockup_timeout kernel parameter), What 
> KMD should do in TDR routine :
>
> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
> (*gpu_reset_counter* is used to force vm flush after GPU reset, out of 
> this thread’s scope so no more discussion on it)
>
Okay.

> 2.Set its fence error status to “*ETIME*”,
>
No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean 
that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that 
we need to distinct the timedout job from the canceled ones (probably a 
good idea, but I'm not sure).

> 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>
Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool 
canceled;" ?

> 4.Kick out this job from scheduler’s mirror list, so this job won’t 
> get re-scheduled to ring anymore.
>
Okay.

> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
> their fence status to “*ECANCELED*”
>
Setting ECANCELED should be ok. But I think we should do this when we 
try to run the jobs and not during GPU reset.

> *6.*Force signal all fences that get kicked out by above two 
> steps,*otherwise UMD will block forever if waiting on those fences*
>
Okay.

> **
>
> 7.Do gpu reset, which is can be some callbacks to let bare-metal and 
> SR-IOV implement with their favor style
>
Okay.

> 8.After reset, KMD need to aware if the VRAM lost happens or not, 
> bare-metal can implement some function to judge, while for SR-IOV I 
> prefer to read it from GIM side (for initial version we consider it’s 
> always VRAM lost, till GIM side change aligned)
>
Okay.

> 9.If VRAM lost not hit, continue, otherwise:
>
> a)Update adev->*vram_lost_counter*,
>
Okay.

> b)Iterate over all living ctx, and set all ctx as “*guilty*” since 
> VRAM lost actually ruins all VRAM contents
>
No, that shouldn't be done by comparing the counters. Iterating over all 
contexts is way to much overhead.

> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence 
> status to “*ECANCELDED*”
>
Yes and no, that should be done when we try to run the jobs and not 
during GPU reset.

> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, do 
> we need it ???)
>
Yes, that is still needed. As Nicolai explained we can't be sure that 
VRAM is still 100% correct even when it isn't cleared.

> 11.Re-schedule all JOBs remains in mirror list to ring again and 
> restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>
Okay.

> lFor cs_wait() IOCTL:
>
> After it found fence signaled, it should check with 
> *“dma_fence_get_status” *to see if there is error there,
>
> And return the error status of fence
>
Yes and no, dma_fence_get_status() is some specific handling for 
sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so 
just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?

> lFor cs_wait_fences() IOCTL:
>
> Similar with above approach
>
> lFor cs_submit() IOCTL:
>
> It need to check if current ctx been marked as “*guilty*” and return 
> “*ECANCELED*” if so
>
> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*:
>
> This way, UMD can also block app from submitting, like @Nicolai 
> mentioned, we can cache one copy of *vram_lost_counter* when enumerate 
> physical device, and deny all
>
> gl-context from submitting if the counter queried bigger than that one 
> cached in physical device. (looks a little overkill to me, but easy to 
> implement )
>
> UMD can also return error to APP when creating gl-context if found 
> current queried*vram_lost_counter *bigger than that one cached in 
> physical device.
>
Okay. Already have a patch for this, please review that one if you 
haven't already done so.

Regards,
Christian.

> BTW: I realized that gl-context is a little different with kernel’s 
> context. Because for kernel. BO is not related with context but only 
> with FD, while in UMD, BO have a backend
>
> gl-context, so block submitting in UMD layer is also needed although 
> KMD will do its job as bottom line
>
> lBasically “vram_lost_counter” is exposure by kernel to let UMD take 
> the control of robust extension feature, it will be UMD’s call to 
> move, KMD only deny “guilty” context from submitting
>
> Need your feedback, thx
>
> We’d better make TDR feature landed ASAP
>
> BR Monk
>


[-- Attachment #1.2: Type: text/html, Size: 31561 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found]     ` <b5c5f6c9-07e2-4688-8ffc-3929bfc59366-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-11  8:15       ` Liu, Monk
       [not found]         ` <BLUPR12MB044911DFCB510022605DD38A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-10-11  8:15 UTC (permalink / raw)
  To: Koenig, Christian, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)


[-- Attachment #1.1: Type: text/plain, Size: 8934 bytes --]

1.        Set its fence error status to “ETIME”,
No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure).

[ML] I’m okay if you insist not to use ETIME


1.        Find the entity/ctx behind this job, and set this ctx as “guilty”
Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool canceled;" ?

[ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in
Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me)
We’d better either block the whole context or let not…



1.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”
Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.

[ML] without deep thought and expritment, I’m not sure the difference between them, but kick it out in gpu_reset routine is more efficient,
Otherwise you need to check context/entity guilty flag in run_job routine … and you need to it for every context/entity, I don’t see why
We don’t just kickout all of them in gpu_reset stage ….



a)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents
No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead.

[ML] because I want to make KMS IOCTL rules clean, like they don’t need to differentiate VRAM lost or not, they only interested in if the context is guilty or not, and block
Submit for guilty ones.

Can you give more details of your idea? And better the detail implement in cs_submit, I want to see how you want to block submit without checking context guilty flag



a)       Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”
Yes and no, that should be done when we try to run the jobs and not during GPU reset.

[ML] again, kicking out them in gpu reset routine is high efficient, otherwise you need check on every job in run_job()
Besides, can you illustrate the detail implementation ?



Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?

[ML] yeah, that’s too confusing, the name sound really the one I want to use, we should change it…
But look into the implement, I don’t see why we cannot use it ? it also finally return the fence->error




From: Koenig, Christian
Sent: Wednesday, October 11, 2017 3:21 PM
To: Liu, Monk <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

See inline:

Am 11.10.2017 um 07:33 schrieb Liu, Monk:
Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items,


l  When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine :


1.        Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it)
Okay.



2.        Set its fence error status to “ETIME”,
No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure).



3.        Find the entity/ctx behind this job, and set this ctx as “guilty”
Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool canceled;" ?



4.        Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore.
Okay.



5.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”
Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.



6.        Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences
Okay.



7.        Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style
Okay.



8.        After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned)
Okay.



9.        If VRAM lost not hit, continue, otherwise:

a)       Update adev->vram_lost_counter,
Okay.



b)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents
No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead.



c)        Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”
Yes and no, that should be done when we try to run the jobs and not during GPU reset.



10.     Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???)
Yes, that is still needed. As Nicolai explained we can't be sure that VRAM is still 100% correct even when it isn't cleared.



11.     Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled)
Okay.




l  For cs_wait() IOCTL:
After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there,
And return the error status of fence
Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?




l  For cs_wait_fences() IOCTL:
Similar with above approach


l  For cs_submit() IOCTL:
It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so


l  Introduce a new IOCTL to let UMD query vram_lost_counter:
This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all
gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement )
UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device.
Okay. Already have a patch for this, please review that one if you haven't already done so.

Regards,
Christian.



BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend
gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line


l  Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting


Need your feedback, thx

We’d better make TDR feature landed ASAP

BR Monk







[-- Attachment #1.2: Type: text/html, Size: 44074 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found]         ` <BLUPR12MB044911DFCB510022605DD38A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-11  8:40           ` Haehnle, Nicolai
       [not found]             ` <DM5PR12MB1292D21FC5438AEA8FCF9F64FF4A0-2J9CzHegvk82qrKJuDAMhQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Haehnle, Nicolai @ 2017-10-11  8:40 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)


[-- Attachment #1.1: Type: text/plain, Size: 10451 bytes --]

>From a Mesa perspective, this almost all sounds reasonable to me.


On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset.


On whether the whole context is marked as guilty from a user space perspective, it would simply be nice for user space to get consistent answers. It would be a bit odd if we could e.g. succeed in submitting an SDMA job after a GFX job was rejected. This would point in favor of marking the entire context as guilty (although that could happen lazily instead of at reset time). On the other hand, if that's too big a burden for the kernel implementation I'm sure we can live without it.


Cheers,

Nicolai

________________________________
From: Liu, Monk
Sent: Wednesday, October 11, 2017 10:15:40 AM
To: Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, Alexander
Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Ding, Pixel; Jiang, Jerry (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
Subject: RE: TDR and VRAM lost handling in KMD:


1.        Set its fence error status to “ETIME”,
No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure).

[ML] I’m okay if you insist not to use ETIME


1.        Find the entity/ctx behind this job, and set this ctx as “guilty”
Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool canceled;" ?

[ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in
Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me)
We’d better either block the whole context or let not…



1.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”
Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.

[ML] without deep thought and expritment, I’m not sure the difference between them, but kick it out in gpu_reset routine is more efficient,
Otherwise you need to check context/entity guilty flag in run_job routine … and you need to it for every context/entity, I don’t see why
We don’t just kickout all of them in gpu_reset stage ….



a)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents
No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead.

[ML] because I want to make KMS IOCTL rules clean, like they don’t need to differentiate VRAM lost or not, they only interested in if the context is guilty or not, and block
Submit for guilty ones.

Can you give more details of your idea? And better the detail implement in cs_submit, I want to see how you want to block submit without checking context guilty flag



a)       Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”
Yes and no, that should be done when we try to run the jobs and not during GPU reset.

[ML] again, kicking out them in gpu reset routine is high efficient, otherwise you need check on every job in run_job()
Besides, can you illustrate the detail implementation ?



Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?

[ML] yeah, that’s too confusing, the name sound really the one I want to use, we should change it…
But look into the implement, I don’t see why we cannot use it ? it also finally return the fence->error




From: Koenig, Christian
Sent: Wednesday, October 11, 2017 3:21 PM
To: Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Haehnle, Nicolai <Nicolai.Haehnle-5C7GfCeVMHo@public.gmane.org>; Olsak, Marek <Marek.Olsak-5C7GfCeVMHo@public.gmane.org>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Ding, Pixel <Pixel.Ding-5C7GfCeVMHo@public.gmane.org>; Jiang, Jerry (SW) <Jerry.Jiang-5C7GfCeVMHo@public.gmane.org>; Li, Bingley <Bingley.Li-5C7GfCeVMHo@public.gmane.org>; Ramirez, Alejandro <Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org>; Filipas, Mario <Mario.Filipas@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

See inline:

Am 11.10.2017 um 07:33 schrieb Liu, Monk:
Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items,


l  When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine :


1.        Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it)
Okay.



2.        Set its fence error status to “ETIME”,
No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure).



3.        Find the entity/ctx behind this job, and set this ctx as “guilty”
Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool canceled;" ?



4.        Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore.
Okay.



5.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”
Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.



6.        Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences
Okay.



7.        Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style
Okay.



8.        After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned)
Okay.



9.        If VRAM lost not hit, continue, otherwise:

a)       Update adev->vram_lost_counter,
Okay.



b)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents
No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead.



c)        Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”
Yes and no, that should be done when we try to run the jobs and not during GPU reset.



10.     Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???)
Yes, that is still needed. As Nicolai explained we can't be sure that VRAM is still 100% correct even when it isn't cleared.



11.     Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled)
Okay.




l  For cs_wait() IOCTL:
After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there,
And return the error status of fence
Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?




l  For cs_wait_fences() IOCTL:
Similar with above approach


l  For cs_submit() IOCTL:
It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so


l  Introduce a new IOCTL to let UMD query vram_lost_counter:
This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all
gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement )
UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device.
Okay. Already have a patch for this, please review that one if you haven't already done so.

Regards,
Christian.



BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend
gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line


l  Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting


Need your feedback, thx

We’d better make TDR feature landed ASAP

BR Monk







[-- Attachment #1.2: Type: text/html, Size: 46742 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found]             ` <DM5PR12MB1292D21FC5438AEA8FCF9F64FF4A0-2J9CzHegvk82qrKJuDAMhQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-11  8:48               ` Liu, Monk
       [not found]                 ` <BLUPR12MB0449287A92DF8D3EB30BE6A6844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-10-11  8:48 UTC (permalink / raw)
  To: Haehnle, Nicolai, Koenig, Christian, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)


[-- Attachment #1.1: Type: text/plain, Size: 12270 bytes --]



On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset.

ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to worry about vram-lost-counter anymore, that’s a implementation style. I don’t think it is related with UMD layer,
For UMD the gl-context isn’t aware of by KMD, so UMD can implement it own “guilty” gl-context if you want.

If KMD doesn’t mark all ctx as guilty after VRAM lost, can you illustrate what rule KMD should obey to check in KMS IOCTL like cs_sumbit ?? let’s see which way better


From: Haehnle, Nicolai
Sent: Wednesday, October 11, 2017 4:41 PM
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:


From a Mesa perspective, this almost all sounds reasonable to me.



On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset.


On whether the whole context is marked as guilty from a user space perspective, it would simply be nice for user space to get consistent answers. It would be a bit odd if we could e.g. succeed in submitting an SDMA job after a GFX job was rejected. This would point in favor of marking the entire context as guilty (although that could happen lazily instead of at reset time). On the other hand, if that's too big a burden for the kernel implementation I'm sure we can live without it.



Cheers,

Nicolai

________________________________
From: Liu, Monk
Sent: Wednesday, October 11, 2017 10:15:40 AM
To: Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, Alexander
Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
Subject: RE: TDR and VRAM lost handling in KMD:


1.        Set its fence error status to “ETIME”,
No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure).

[ML] I’m okay if you insist not to use ETIME


1.        Find the entity/ctx behind this job, and set this ctx as “guilty”
Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool canceled;" ?

[ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in
Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me)
We’d better either block the whole context or let not…



1.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”
Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.

[ML] without deep thought and expritment, I’m not sure the difference between them, but kick it out in gpu_reset routine is more efficient,
Otherwise you need to check context/entity guilty flag in run_job routine … and you need to it for every context/entity, I don’t see why
We don’t just kickout all of them in gpu_reset stage ….



a)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents
No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead.

[ML] because I want to make KMS IOCTL rules clean, like they don’t need to differentiate VRAM lost or not, they only interested in if the context is guilty or not, and block
Submit for guilty ones.

Can you give more details of your idea? And better the detail implement in cs_submit, I want to see how you want to block submit without checking context guilty flag



a)       Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”
Yes and no, that should be done when we try to run the jobs and not during GPU reset.

[ML] again, kicking out them in gpu reset routine is high efficient, otherwise you need check on every job in run_job()
Besides, can you illustrate the detail implementation ?



Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?

[ML] yeah, that’s too confusing, the name sound really the one I want to use, we should change it…
But look into the implement, I don’t see why we cannot use it ? it also finally return the fence->error




From: Koenig, Christian
Sent: Wednesday, October 11, 2017 3:21 PM
To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com<mailto:Nicolai.Haehnle@amd.com>>; Olsak, Marek <Marek.Olsak@amd.com<mailto:Marek.Olsak@amd.com>>; Deucher, Alexander <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>
Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel <Pixel.Ding@amd.com<mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com<mailto:Jerry.Jiang@amd.com>>; Li, Bingley <Bingley.Li@amd.com<mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com<mailto:Alejandro.Ramirez@amd.com>>; Filipas, Mario <Mario.Filipas@amd.com<mailto:Mario.Filipas@amd.com>>
Subject: Re: TDR and VRAM lost handling in KMD:

See inline:

Am 11.10.2017 um 07:33 schrieb Liu, Monk:
Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items,


?  When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine :


1.        Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it)
Okay.


2.        Set its fence error status to “ETIME”,
No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure).


3.        Find the entity/ctx behind this job, and set this ctx as “guilty”
Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool canceled;" ?


4.        Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore.
Okay.


5.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”
Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.


6.        Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences
Okay.


7.        Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style
Okay.


8.        After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned)
Okay.


9.        If VRAM lost not hit, continue, otherwise:

a)       Update adev->vram_lost_counter,
Okay.


b)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents
No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead.


c)        Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”
Yes and no, that should be done when we try to run the jobs and not during GPU reset.


10.     Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???)
Yes, that is still needed. As Nicolai explained we can't be sure that VRAM is still 100% correct even when it isn't cleared.


11.     Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled)
Okay.



?  For cs_wait() IOCTL:
After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there,
And return the error status of fence
Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?



?  For cs_wait_fences() IOCTL:
Similar with above approach


?  For cs_submit() IOCTL:
It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so


?  Introduce a new IOCTL to let UMD query vram_lost_counter:
This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all
gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement )
UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device.
Okay. Already have a patch for this, please review that one if you haven't already done so.

Regards,
Christian.


BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend
gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line


?  Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting


Need your feedback, thx

We’d better make TDR feature landed ASAP

BR Monk







[-- Attachment #1.2: Type: text/html, Size: 50621 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found]                 ` <BLUPR12MB0449287A92DF8D3EB30BE6A6844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-11  8:59                   ` Nicolai Hähnle
       [not found]                     ` <28d64011-fd90-07fb-d95d-48286ecbdcc5-5C7GfCeVMHo@public.gmane.org>
  2017-10-11  9:02                   ` Christian König
  1 sibling, 1 reply; 23+ messages in thread
From: Nicolai Hähnle @ 2017-10-11  8:59 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

On 11.10.2017 10:48, Liu, Monk wrote:
> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so 
> it's reasonable to use it. However, it /does not/ make sense to mark 
> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a 
> perfect example where the driver should report context lost to 
> applications with the "innocent" flag for contexts that were idle at the 
> time of reset. The only context(s) that should be reported as "guilty" 
> (or perhaps "unknown" in some cases) are the ones that were executing at 
> the time of reset.
> 
> ML: KMD mark all contexts as guilty is because that way we can unify our 
> IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to 
> worry about vram-lost-counter anymore, that’s a implementation style. I 
> don’t think it is related with UMD layer,
> 
> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it 
> own “guilty” gl-context if you want.

Well, to some extent this is just semantics, but it helps to keep the 
terminology consistent.

Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in 
mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, 
and it must return "innocent" for contexts that are only lost due to 
VRAM lost without being otherwise involved in the timeout that lead to 
the reset.

The point is that in the places where you used "guilty" it would be 
better to use "context lost", and then further differentiate between 
guilty/innocent context lost based on the details of what happened.


> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
> illustrate what rule KMD should obey to check in KMS IOCTL like 
> cs_sumbit ?? let’s see which way better

if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
     return -ECANCELED;

Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.

Yes, it's one additional check in cs_submit. If you're worried about 
that (and Christian's concerns about possible issues with walking over 
all contexts are addressed), I suppose you could just store a per-context

   unsigned context_reset_status;

instead of a `bool guilty`. Its value would start out as 0 
(AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset.

Cheers,
Nicolai


> 
> *From:*Haehnle, Nicolai
> *Sent:* Wednesday, October 11, 2017 4:41 PM
> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, 
> Alexander <Alexander.Deucher@amd.com>
> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
> Filipas, Mario <Mario.Filipas@amd.com>
> *Subject:* Re: TDR and VRAM lost handling in KMD:
> 
>  From a Mesa perspective, this almost all sounds reasonable to me.
> 
> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so 
> it's reasonable to use it. However, it /does not/ make sense to mark 
> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a 
> perfect example where the driver should report context lost to 
> applications with the "innocent" flag for contexts that were idle at the 
> time of reset. The only context(s) that should be reported as "guilty" 
> (or perhaps "unknown" in some cases) are the ones that were executing at 
> the time of reset.
> 
> On whether the whole context is marked as guilty from a user space 
> perspective, it would simply be nice for user space to get consistent 
> answers. It would be a bit odd if we could e.g. succeed in submitting an 
> SDMA job after a GFX job was rejected. This would point in favor of 
> marking the entire context as guilty (although that could happen lazily 
> instead of at reset time). On the other hand, if that's too big a burden 
> for the kernel implementation I'm sure we can live without it.
> 
> Cheers,
> 
> Nicolai
> 
> ------------------------------------------------------------------------
> 
> *From:*Liu, Monk
> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, Alexander
> *Cc:* amd-gfx@lists.freedesktop.org 
> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry (SW); 
> Li, Bingley; Ramirez, Alejandro; Filipas, Mario
> *Subject:* RE: TDR and VRAM lost handling in KMD:
> 
> 1.Set its fence error status to “*ETIME*”,
> 
> No, as I already explained ETIME is for synchronous operation.
> 
> In other words when we return ETIME from the wait IOCTL it would mean 
> that the waiting has somehow timed out, but not the job we waited for.
> 
> Please use ECANCELED as well or some other error code when we find that 
> we need to distinct the timedout job from the canceled ones (probably a 
> good idea, but I'm not sure).
> 
> [ML] I’m okay if you insist not to use ETIME
> 
> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
> 
> Not sure. Do we want to set the whole context as guilty or just the entity?
> 
> Setting the whole contexts as guilty sounds racy to me.
> 
> BTW: We should use a different name than "guilty", maybe just "bool 
> canceled;" ?
> 
> [ML] I think context is better than entity, because for example if you 
> only block entity_0 of context and allow entity_N run, that means the 
> dependency between entities are broken (e.g. page table updates in
> 
> Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me)
> 
> We’d better either block the whole context or let not…
> 
> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
> their fence status to “*ECANCELED*”
> 
> Setting ECANCELED should be ok. But I think we should do this when we 
> try to run the jobs and not during GPU reset.
> 
> [ML] without deep thought and expritment, I’m not sure the difference 
> between them, but kick it out in gpu_reset routine is more efficient,
> 
> Otherwise you need to check context/entity guilty flag in run_job 
> routine …and you need to it for every context/entity, I don’t see why
> 
> We don’t just kickout all of them in gpu_reset stage ….
> 
> a)Iterate over all living ctx, and set all ctx as “*guilty*” since VRAM 
> lost actually ruins all VRAM contents
> 
> No, that shouldn't be done by comparing the counters. Iterating over all 
> contexts is way to much overhead.
> 
> [ML] because I want to make KMS IOCTL rules clean, like they don’t need 
> to differentiate VRAM lost or not, they only interested in if the 
> context is guilty or not, and block
> 
> Submit for guilty ones.
> 
> *Can you give more details of your idea? And better the detail implement 
> in cs_submit, I want to see how you want to block submit without 
> checking context guilty flag*
> 
> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence 
> status to “*ECANCELDED*”
> 
> Yes and no, that should be done when we try to run the jobs and not 
> during GPU reset.
> 
> [ML] again, kicking out them in gpu reset routine is high efficient, 
> otherwise you need check on every job in run_job()
> 
> Besides, can you illustrate the detail implementation ?
> 
> Yes and no, dma_fence_get_status() is some specific handling for 
> sync_file debugging (no idea why that made it into the common fence code).
> 
> It was replaced by putting the error code directly into the fence, so 
> just reading that one after waiting should be ok.
> 
> Maybe we should fix dma_fence_get_status() to do the right thing for this?
> 
> [ML] yeah, that’s too confusing, the name sound really the one I want to 
> use, we should change it…
> 
> *But look into the implement, I don**’t see why we cannot use it ? it 
> also finally return the fence->error *
> 
> *From:*Koenig, Christian
> *Sent:* Wednesday, October 11, 2017 3:21 PM
> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle, 
> Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>; 
> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; 
> Deucher, Alexander <Alexander.Deucher@amd.com 
> <mailto:Alexander.Deucher@amd.com>>
> *Cc:* amd-gfx@lists.freedesktop.org 
> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel <Pixel.Ding@amd.com 
> <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com 
> <mailto:Jerry.Jiang@amd.com>>; Li, Bingley <Bingley.Li@amd.com 
> <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro 
> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>; Filipas, 
> Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>>
> *Subject:* Re: TDR and VRAM lost handling in KMD:
> 
> See inline:
> 
> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
> 
>     Hi Christian & Nicolai,
> 
>     We need to achieve some agreements on what should MESA/UMD do and
>     what should KMD do, *please give your comments with “okay” or “No”
>     and your idea on below items,*
> 
>     ?When a job timed out (set from lockup_timeout kernel parameter),
>     What KMD should do in TDR routine :
> 
>     1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>     (*gpu_reset_counter* is used to force vm flush after GPU reset, out
>     of this thread’s scope so no more discussion on it)
> 
> Okay.
> 
>     2.Set its fence error status to “*ETIME*”,
> 
> No, as I already explained ETIME is for synchronous operation.
> 
> In other words when we return ETIME from the wait IOCTL it would mean 
> that the waiting has somehow timed out, but not the job we waited for.
> 
> Please use ECANCELED as well or some other error code when we find that 
> we need to distinct the timedout job from the canceled ones (probably a 
> good idea, but I'm not sure).
> 
>     3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
> 
> Not sure. Do we want to set the whole context as guilty or just the entity?
> 
> Setting the whole contexts as guilty sounds racy to me.
> 
> BTW: We should use a different name than "guilty", maybe just "bool 
> canceled;" ?
> 
>     4.Kick out this job from scheduler’s mirror list, so this job won’t
>     get re-scheduled to ring anymore.
> 
> Okay.
> 
>     5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>     their fence status to “*ECANCELED*”
> 
> Setting ECANCELED should be ok. But I think we should do this when we 
> try to run the jobs and not during GPU reset.
> 
>     6.Force signal all fences that get kicked out by above two
>     steps,*otherwise UMD will block forever if waiting on those fences*
> 
> Okay.
> 
>     7.Do gpu reset, which is can be some callbacks to let bare-metal and
>     SR-IOV implement with their favor style
> 
> Okay.
> 
>     8.After reset, KMD need to aware if the VRAM lost happens or not,
>     bare-metal can implement some function to judge, while for SR-IOV I
>     prefer to read it from GIM side (for initial version we consider
>     it’s always VRAM lost, till GIM side change aligned)
> 
> Okay.
> 
>     9.If VRAM lost not hit, continue, otherwise:
> 
>     a)Update adev->*vram_lost_counter*,
> 
> Okay.
> 
>     b)Iterate over all living ctx, and set all ctx as “*guilty*” since
>     VRAM lost actually ruins all VRAM contents
> 
> No, that shouldn't be done by comparing the counters. Iterating over all 
> contexts is way to much overhead.
> 
>     c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>     fence status to “*ECANCELDED*”
> 
> Yes and no, that should be done when we try to run the jobs and not 
> during GPU reset.
> 
>     10.Do GTT recovery and VRAM page tables/entries recovery (optional,
>     do we need it ???)
> 
> Yes, that is still needed. As Nicolai explained we can't be sure that 
> VRAM is still 100% correct even when it isn't cleared.
> 
>     11.Re-schedule all JOBs remains in mirror list to ring again and
>     restart scheduler (for VRAM lost case, no JOB will re-scheduled)
> 
> Okay.
> 
>     ?For cs_wait() IOCTL:
> 
>     After it found fence signaled, it should check with
>     *“dma_fence_get_status” *to see if there is error there,
> 
>     And return the error status of fence
> 
> Yes and no, dma_fence_get_status() is some specific handling for 
> sync_file debugging (no idea why that made it into the common fence code).
> 
> It was replaced by putting the error code directly into the fence, so 
> just reading that one after waiting should be ok.
> 
> Maybe we should fix dma_fence_get_status() to do the right thing for this?
> 
>     ?For cs_wait_fences() IOCTL:
> 
>     Similar with above approach
> 
>     ?For cs_submit() IOCTL:
> 
>     It need to check if current ctx been marked as “*guilty*” and return
>     “*ECANCELED*” if so
> 
>     ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
> 
>     This way, UMD can also block app from submitting, like @Nicolai
>     mentioned, we can cache one copy of *vram_lost_counter* when
>     enumerate physical device, and deny all
> 
>     gl-context from submitting if the counter queried bigger than that
>     one cached in physical device. (looks a little overkill to me, but
>     easy to implement )
> 
>     UMD can also return error to APP when creating gl-context if found
>     current queried*vram_lost_counter *bigger than that one cached in
>     physical device.
> 
> Okay. Already have a patch for this, please review that one if you 
> haven't already done so.
> 
> Regards,
> Christian.
> 
>     BTW: I realized that gl-context is a little different with kernel’s
>     context. Because for kernel. BO is not related with context but only
>     with FD, while in UMD, BO have a backend
> 
>     gl-context, so block submitting in UMD layer is also needed although
>     KMD will do its job as bottom line
> 
>     ?Basically “vram_lost_counter” is exposure by kernel to let UMD take
>     the control of robust extension feature, it will be UMD’s call to
>     move, KMD only deny “guilty” context from submitting
> 
>     Need your feedback, thx
> 
>     We’d better make TDR feature landed ASAP
> 
>     BR Monk
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found]                 ` <BLUPR12MB0449287A92DF8D3EB30BE6A6844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-10-11  8:59                   ` Nicolai Hähnle
@ 2017-10-11  9:02                   ` Christian König
       [not found]                     ` <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Christian König @ 2017-10-11  9:02 UTC (permalink / raw)
  To: Liu, Monk, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)


[-- Attachment #1.1: Type: text/plain, Size: 15195 bytes --]

> [ML] I think context is better than entity, because for example if you 
> only block entity_0 of context and allow entity_N run, that means the 
> dependency between entities are broken (e.g. page table updates in
>
> Sdma entity pass but gfx submit in GFX entity blocked, not make sense 
> to me)
>
> We’d better either block the whole context or let not…
>
Page table updates are not part of any context.

So I think the only thing we can do is to mark the entity as not 
scheduled any more.

> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
> their fence status to “*ECANCELED*”
>
> Setting ECANCELED should be ok. But I think we should do this when we 
> try to run the jobs and not during GPU reset.
>
> [ML] without deep thought and expritment, I’m not sure the difference 
> between them, but kick it out in gpu_reset routine is more efficient,
>
I really don't think so. Kicking them out during gpu_reset sounds racy 
to me once more.

And marking them canceled when we try to run them has the clear 
advantage that all dependencies are meet first.

> ML: KMD mark all contexts as guilty is because that way we can unify 
> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no 
> need to worry about vram-lost-counter anymore, that’s a implementation 
> style. I don’t think it is related with UMD layer,
>
I don't think that this is a good idea. Instead when you want to unify 
the behavior we should use the vram_lost_counter as marker for the 
guilty context.

Regards,
Christian.

Am 11.10.2017 um 10:48 schrieb Liu, Monk:
>
> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so 
> it's reasonable to use it. However, it /does not/ make sense to mark 
> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a 
> perfect example where the driver should report context lost to 
> applications with the "innocent" flag for contexts that were idle at 
> the time of reset. The only context(s) that should be reported as 
> "guilty" (or perhaps "unknown" in some cases) are the ones that were 
> executing at the time of reset.
>
> ML: KMD mark all contexts as guilty is because that way we can unify 
> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no 
> need to worry about vram-lost-counter anymore, that’s a implementation 
> style. I don’t think it is related with UMD layer,
>
> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it 
> own “guilty” gl-context if you want.
>
> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
> illustrate what rule KMD should obey to check in KMS IOCTL like 
> cs_sumbit ?? let’s see which way better
>
> *From:*Haehnle, Nicolai
> *Sent:* Wednesday, October 11, 2017 4:41 PM
> *To:* Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Koenig, Christian 
> <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>; Olsak, Marek <Marek.Olsak-5C7GfCeVMHo@public.gmane.org>; 
> Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>
> *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Ding, Pixel <Pixel.Ding-5C7GfCeVMHo@public.gmane.org>; 
> Jiang, Jerry (SW) <Jerry.Jiang-5C7GfCeVMHo@public.gmane.org>; Li, Bingley 
> <Bingley.Li-5C7GfCeVMHo@public.gmane.org>; Ramirez, Alejandro <Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org>; 
> Filipas, Mario <Mario.Filipas-5C7GfCeVMHo@public.gmane.org>
> *Subject:* Re: TDR and VRAM lost handling in KMD:
>
> From a Mesa perspective, this almost all sounds reasonable to me.
>
> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so 
> it's reasonable to use it. However, it /does not/ make sense to mark 
> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a 
> perfect example where the driver should report context lost to 
> applications with the "innocent" flag for contexts that were idle at 
> the time of reset. The only context(s) that should be reported as 
> "guilty" (or perhaps "unknown" in some cases) are the ones that were 
> executing at the time of reset.
>
> On whether the whole context is marked as guilty from a user space 
> perspective, it would simply be nice for user space to get consistent 
> answers. It would be a bit odd if we could e.g. succeed in submitting 
> an SDMA job after a GFX job was rejected. This would point in favor of 
> marking the entire context as guilty (although that could happen 
> lazily instead of at reset time). On the other hand, if that's too big 
> a burden for the kernel implementation I'm sure we can live without it.
>
> Cheers,
>
> Nicolai
>
> ------------------------------------------------------------------------
>
> *From:*Liu, Monk
> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
> Alexander
> *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org 
> <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>; Ding, Pixel; Jiang, Jerry 
> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
> *Subject:* RE: TDR and VRAM lost handling in KMD:
>
> 1.Set its fence error status to “*ETIME*”,
>
> No, as I already explained ETIME is for synchronous operation.
>
> In other words when we return ETIME from the wait IOCTL it would mean 
> that the waiting has somehow timed out, but not the job we waited for.
>
> Please use ECANCELED as well or some other error code when we find 
> that we need to distinct the timedout job from the canceled ones 
> (probably a good idea, but I'm not sure).
>
> [ML] I’m okay if you insist not to use ETIME
>
> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>
> Not sure. Do we want to set the whole context as guilty or just the 
> entity?
>
> Setting the whole contexts as guilty sounds racy to me.
>
> BTW: We should use a different name than "guilty", maybe just "bool 
> canceled;" ?
>
> [ML] I think context is better than entity, because for example if you 
> only block entity_0 of context and allow entity_N run, that means the 
> dependency between entities are broken (e.g. page table updates in
>
> Sdma entity pass but gfx submit in GFX entity blocked, not make sense 
> to me)
>
> We’d better either block the whole context or let not…
>
> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
> their fence status to “*ECANCELED*”
>
> Setting ECANCELED should be ok. But I think we should do this when we 
> try to run the jobs and not during GPU reset.
>
> [ML] without deep thought and expritment, I’m not sure the difference 
> between them, but kick it out in gpu_reset routine is more efficient,
>
> Otherwise you need to check context/entity guilty flag in run_job 
> routine …and you need to it for every context/entity, I don’t see why
>
> We don’t just kickout all of them in gpu_reset stage ….
>
> a)Iterate over all living ctx, and set all ctx as “*guilty*” since 
> VRAM lost actually ruins all VRAM contents
>
> No, that shouldn't be done by comparing the counters. Iterating over 
> all contexts is way to much overhead.
>
> [ML] because I want to make KMS IOCTL rules clean, like they don’t 
> need to differentiate VRAM lost or not, they only interested in if the 
> context is guilty or not, and block
>
> Submit for guilty ones.
>
> *Can you give more details of your idea? And better the detail 
> implement in cs_submit, I want to see how you want to block submit 
> without checking context guilty flag*
>
> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence 
> status to “*ECANCELDED*”
>
> Yes and no, that should be done when we try to run the jobs and not 
> during GPU reset.
>
> [ML] again, kicking out them in gpu reset routine is high efficient, 
> otherwise you need check on every job in run_job()
>
> Besides, can you illustrate the detail implementation ?
>
> Yes and no, dma_fence_get_status() is some specific handling for 
> sync_file debugging (no idea why that made it into the common fence code).
>
> It was replaced by putting the error code directly into the fence, so 
> just reading that one after waiting should be ok.
>
> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>
> [ML] yeah, that’s too confusing, the name sound really the one I want 
> to use, we should change it…
>
> *But look into the implement, I don**’t see why we cannot use it ? it 
> also finally return the fence->error *
>
> *From:*Koenig, Christian
> *Sent:* Wednesday, October 11, 2017 3:21 PM
> *To:* Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>>; Haehnle, 
> Nicolai <Nicolai.Haehnle-5C7GfCeVMHo@public.gmane.org <mailto:Nicolai.Haehnle-5C7GfCeVMHo@public.gmane.org>>; 
> Olsak, Marek <Marek.Olsak-5C7GfCeVMHo@public.gmane.org <mailto:Marek.Olsak-5C7GfCeVMHo@public.gmane.org>>; 
> Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org 
> <mailto:Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>>
> *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org 
> <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>; Ding, Pixel 
> <Pixel.Ding-5C7GfCeVMHo@public.gmane.org <mailto:Pixel.Ding-5C7GfCeVMHo@public.gmane.org>>; Jiang, Jerry (SW) 
> <Jerry.Jiang-5C7GfCeVMHo@public.gmane.org <mailto:Jerry.Jiang-5C7GfCeVMHo@public.gmane.org>>; Li, Bingley 
> <Bingley.Li-5C7GfCeVMHo@public.gmane.org <mailto:Bingley.Li-5C7GfCeVMHo@public.gmane.org>>; Ramirez, Alejandro 
> <Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org <mailto:Alejandro.Ramirez-5C7GfCeVMHo@public.gmane.org>>; 
> Filipas, Mario <Mario.Filipas-5C7GfCeVMHo@public.gmane.org <mailto:Mario.Filipas-5C7GfCeVMHo@public.gmane.org>>
> *Subject:* Re: TDR and VRAM lost handling in KMD:
>
> See inline:
>
> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>
>     Hi Christian & Nicolai,
>
>     We need to achieve some agreements on what should MESA/UMD do and
>     what should KMD do, *please give your comments with “okay” or “No”
>     and your idea on below items,*
>
>     ?When a job timed out (set from lockup_timeout kernel parameter),
>     What KMD should do in TDR routine :
>
>     1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>     (*gpu_reset_counter* is used to force vm flush after GPU reset,
>     out of this thread’s scope so no more discussion on it)
>
> Okay.
>
>     2.Set its fence error status to “*ETIME*”,
>
> No, as I already explained ETIME is for synchronous operation.
>
> In other words when we return ETIME from the wait IOCTL it would mean 
> that the waiting has somehow timed out, but not the job we waited for.
>
> Please use ECANCELED as well or some other error code when we find 
> that we need to distinct the timedout job from the canceled ones 
> (probably a good idea, but I'm not sure).
>
>     3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>
> Not sure. Do we want to set the whole context as guilty or just the 
> entity?
>
> Setting the whole contexts as guilty sounds racy to me.
>
> BTW: We should use a different name than "guilty", maybe just "bool 
> canceled;" ?
>
>     4.Kick out this job from scheduler’s mirror list, so this job
>     won’t get re-scheduled to ring anymore.
>
> Okay.
>
>     5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set
>     all their fence status to “*ECANCELED*”
>
> Setting ECANCELED should be ok. But I think we should do this when we 
> try to run the jobs and not during GPU reset.
>
>     6.Force signal all fences that get kicked out by above two
>     steps,*otherwise UMD will block forever if waiting on those fences*
>
> Okay.
>
>     7.Do gpu reset, which is can be some callbacks to let bare-metal
>     and SR-IOV implement with their favor style
>
> Okay.
>
>     8.After reset, KMD need to aware if the VRAM lost happens or not,
>     bare-metal can implement some function to judge, while for SR-IOV
>     I prefer to read it from GIM side (for initial version we consider
>     it’s always VRAM lost, till GIM side change aligned)
>
> Okay.
>
>     9.If VRAM lost not hit, continue, otherwise:
>
>     a)Update adev->*vram_lost_counter*,
>
> Okay.
>
>     b)Iterate over all living ctx, and set all ctx as “*guilty*” since
>     VRAM lost actually ruins all VRAM contents
>
> No, that shouldn't be done by comparing the counters. Iterating over 
> all contexts is way to much overhead.
>
>     c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>     fence status to “*ECANCELDED*”
>
> Yes and no, that should be done when we try to run the jobs and not 
> during GPU reset.
>
>     10.Do GTT recovery and VRAM page tables/entries recovery
>     (optional, do we need it ???)
>
> Yes, that is still needed. As Nicolai explained we can't be sure that 
> VRAM is still 100% correct even when it isn't cleared.
>
>     11.Re-schedule all JOBs remains in mirror list to ring again and
>     restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>
> Okay.
>
>     ?For cs_wait() IOCTL:
>
>     After it found fence signaled, it should check with
>     *“dma_fence_get_status” *to see if there is error there,
>
>     And return the error status of fence
>
> Yes and no, dma_fence_get_status() is some specific handling for 
> sync_file debugging (no idea why that made it into the common fence code).
>
> It was replaced by putting the error code directly into the fence, so 
> just reading that one after waiting should be ok.
>
> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>
>     ?For cs_wait_fences() IOCTL:
>
>     Similar with above approach
>
>     ?For cs_submit() IOCTL:
>
>     It need to check if current ctx been marked as “*guilty*” and
>     return “*ECANCELED*” if so
>
>     ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>
>     This way, UMD can also block app from submitting, like @Nicolai
>     mentioned, we can cache one copy of *vram_lost_counter* when
>     enumerate physical device, and deny all
>
>     gl-context from submitting if the counter queried bigger than that
>     one cached in physical device. (looks a little overkill to me, but
>     easy to implement )
>
>     UMD can also return error to APP when creating gl-context if found
>     current queried*vram_lost_counter *bigger than that one cached in
>     physical device.
>
> Okay. Already have a patch for this, please review that one if you 
> haven't already done so.
>
> Regards,
> Christian.
>
>     BTW: I realized that gl-context is a little different with
>     kernel’s context. Because for kernel. BO is not related with
>     context but only with FD, while in UMD, BO have a backend
>
>     gl-context, so block submitting in UMD layer is also needed
>     although KMD will do its job as bottom line
>
>     ?Basically “vram_lost_counter” is exposure by kernel to let UMD
>     take the control of robust extension feature, it will be UMD’s
>     call to move, KMD only deny “guilty” context from submitting
>
>     Need your feedback, thx
>
>     We’d better make TDR feature landed ASAP
>
>     BR Monk
>


[-- Attachment #1.2: Type: text/html, Size: 65730 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found]                     ` <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-11  9:16                       ` Nicolai Hähnle
  2017-10-11  9:27                       ` Liu, Monk
  1 sibling, 0 replies; 23+ messages in thread
From: Nicolai Hähnle @ 2017-10-11  9:16 UTC (permalink / raw)
  To: Christian König, Liu, Monk, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

On 11.10.2017 11:02, Christian König wrote:
>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
>> their fence status to “*ECANCELED*”
>>
>> Setting ECANCELED should be ok. But I think we should do this when we 
>> try to run the jobs and not during GPU reset.
>>
>> [ML] without deep thought and expritment, I’m not sure the difference 
>> between them, but kick it out in gpu_reset routine is more efficient,
>>
> I really don't think so. Kicking them out during gpu_reset sounds racy 
> to me once more.
> 
> And marking them canceled when we try to run them has the clear 
> advantage that all dependencies are meet first.

This makes sense to me as well.

It raises a vaguely related question: What happens to jobs whose 
dependencies were canceled? I believe we currently don't check those 
errors, so we might execute them anyway if their contexts were 
unaffected by the reset. There's a risk that the job will hang due to 
stale data.

I don't think it's a huge risk in practice today because we don't have a 
lot of buffer sharing between applications, but it's something to think 
through at some point. In a way, canceling out of an abundance of 
caution may be a bad idea because it could kill a compositor's task by 
being overly conservative.

Cheers,
Nicolai
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found]                     ` <28d64011-fd90-07fb-d95d-48286ecbdcc5-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-11  9:18                       ` Liu, Monk
       [not found]                         ` <BLUPR12MB044914F3A7B5D3D316481A7A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-10-11  9:18 UTC (permalink / raw)
  To: Haehnle, Nicolai, Koenig, Christian, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

Let's talk it simple, When vram lost hit,  what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ?

In cs_submit, with vram lost hit, if we don't mark all contexts as "guilty", how we block its from submitting ? can you show some implement way 

BTW: the "guilty" here is a new member I want to add to context, it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface,  
Looks I need to unify them and only one place to mark guilty or not


BR Monk

-----Original Message-----
From: Haehnle, Nicolai 
Sent: Wednesday, October 11, 2017 5:00 PM
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

On 11.10.2017 10:48, Liu, Monk wrote:
> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so 
> it's reasonable to use it. However, it /does not/ make sense to mark 
> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a 
> perfect example where the driver should report context lost to 
> applications with the "innocent" flag for contexts that were idle at 
> the time of reset. The only context(s) that should be reported as "guilty"
> (or perhaps "unknown" in some cases) are the ones that were executing 
> at the time of reset.
> 
> ML: KMD mark all contexts as guilty is because that way we can unify 
> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no 
> need to worry about vram-lost-counter anymore, that’s a implementation 
> style. I don’t think it is related with UMD layer,
> 
> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it 
> own “guilty” gl-context if you want.

Well, to some extent this is just semantics, but it helps to keep the terminology consistent.

Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset.

The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened.


> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
> illustrate what rule KMD should obey to check in KMS IOCTL like 
> cs_sumbit ?? let’s see which way better

if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
     return -ECANCELED;

Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.

Yes, it's one additional check in cs_submit. If you're worried about that (and Christian's concerns about possible issues with walking over all contexts are addressed), I suppose you could just store a per-context

   unsigned context_reset_status;

instead of a `bool guilty`. Its value would start out as 0
(AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset.

Cheers,
Nicolai


> 
> *From:*Haehnle, Nicolai
> *Sent:* Wednesday, October 11, 2017 4:41 PM
> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
> Deucher, Alexander <Alexander.Deucher@amd.com>
> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
> Filipas, Mario <Mario.Filipas@amd.com>
> *Subject:* Re: TDR and VRAM lost handling in KMD:
> 
>  From a Mesa perspective, this almost all sounds reasonable to me.
> 
> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so 
> it's reasonable to use it. However, it /does not/ make sense to mark 
> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a 
> perfect example where the driver should report context lost to 
> applications with the "innocent" flag for contexts that were idle at 
> the time of reset. The only context(s) that should be reported as "guilty"
> (or perhaps "unknown" in some cases) are the ones that were executing 
> at the time of reset.
> 
> On whether the whole context is marked as guilty from a user space 
> perspective, it would simply be nice for user space to get consistent 
> answers. It would be a bit odd if we could e.g. succeed in submitting 
> an SDMA job after a GFX job was rejected. This would point in favor of 
> marking the entire context as guilty (although that could happen 
> lazily instead of at reset time). On the other hand, if that's too big 
> a burden for the kernel implementation I'm sure we can live without it.
> 
> Cheers,
> 
> Nicolai
> 
> ----------------------------------------------------------------------
> --
> 
> *From:*Liu, Monk
> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
> Alexander
> *Cc:* amd-gfx@lists.freedesktop.org
> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
> *Subject:* RE: TDR and VRAM lost handling in KMD:
> 
> 1.Set its fence error status to “*ETIME*”,
> 
> No, as I already explained ETIME is for synchronous operation.
> 
> In other words when we return ETIME from the wait IOCTL it would mean 
> that the waiting has somehow timed out, but not the job we waited for.
> 
> Please use ECANCELED as well or some other error code when we find 
> that we need to distinct the timedout job from the canceled ones 
> (probably a good idea, but I'm not sure).
> 
> [ML] I’m okay if you insist not to use ETIME
> 
> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
> 
> Not sure. Do we want to set the whole context as guilty or just the entity?
> 
> Setting the whole contexts as guilty sounds racy to me.
> 
> BTW: We should use a different name than "guilty", maybe just "bool 
> canceled;" ?
> 
> [ML] I think context is better than entity, because for example if you 
> only block entity_0 of context and allow entity_N run, that means the 
> dependency between entities are broken (e.g. page table updates in
> 
> Sdma entity pass but gfx submit in GFX entity blocked, not make sense 
> to me)
> 
> We’d better either block the whole context or let not…
> 
> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
> their fence status to “*ECANCELED*”
> 
> Setting ECANCELED should be ok. But I think we should do this when we 
> try to run the jobs and not during GPU reset.
> 
> [ML] without deep thought and expritment, I’m not sure the difference 
> between them, but kick it out in gpu_reset routine is more efficient,
> 
> Otherwise you need to check context/entity guilty flag in run_job 
> routine …and you need to it for every context/entity, I don’t see why
> 
> We don’t just kickout all of them in gpu_reset stage ….
> 
> a)Iterate over all living ctx, and set all ctx as “*guilty*” since 
> VRAM lost actually ruins all VRAM contents
> 
> No, that shouldn't be done by comparing the counters. Iterating over 
> all contexts is way to much overhead.
> 
> [ML] because I want to make KMS IOCTL rules clean, like they don’t 
> need to differentiate VRAM lost or not, they only interested in if the 
> context is guilty or not, and block
> 
> Submit for guilty ones.
> 
> *Can you give more details of your idea? And better the detail 
> implement in cs_submit, I want to see how you want to block submit 
> without checking context guilty flag*
> 
> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence 
> status to “*ECANCELDED*”
> 
> Yes and no, that should be done when we try to run the jobs and not 
> during GPU reset.
> 
> [ML] again, kicking out them in gpu reset routine is high efficient, 
> otherwise you need check on every job in run_job()
> 
> Besides, can you illustrate the detail implementation ?
> 
> Yes and no, dma_fence_get_status() is some specific handling for 
> sync_file debugging (no idea why that made it into the common fence code).
> 
> It was replaced by putting the error code directly into the fence, so 
> just reading that one after waiting should be ok.
> 
> Maybe we should fix dma_fence_get_status() to do the right thing for this?
> 
> [ML] yeah, that’s too confusing, the name sound really the one I want 
> to use, we should change it…
> 
> *But look into the implement, I don**’t see why we cannot use it ? it 
> also finally return the fence->error *
> 
> *From:*Koenig, Christian
> *Sent:* Wednesday, October 11, 2017 3:21 PM
> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle, 
> Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>;
> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; 
> Deucher, Alexander <Alexander.Deucher@amd.com 
> <mailto:Alexander.Deucher@amd.com>>
> *Cc:* amd-gfx@lists.freedesktop.org
> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel 
> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) 
> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley 
> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro 
> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>; 
> Filipas, Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>>
> *Subject:* Re: TDR and VRAM lost handling in KMD:
> 
> See inline:
> 
> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
> 
>     Hi Christian & Nicolai,
> 
>     We need to achieve some agreements on what should MESA/UMD do and
>     what should KMD do, *please give your comments with “okay” or “No”
>     and your idea on below items,*
> 
>     ?When a job timed out (set from lockup_timeout kernel parameter),
>     What KMD should do in TDR routine :
> 
>     1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>     (*gpu_reset_counter* is used to force vm flush after GPU reset, out
>     of this thread’s scope so no more discussion on it)
> 
> Okay.
> 
>     2.Set its fence error status to “*ETIME*”,
> 
> No, as I already explained ETIME is for synchronous operation.
> 
> In other words when we return ETIME from the wait IOCTL it would mean 
> that the waiting has somehow timed out, but not the job we waited for.
> 
> Please use ECANCELED as well or some other error code when we find 
> that we need to distinct the timedout job from the canceled ones 
> (probably a good idea, but I'm not sure).
> 
>     3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
> 
> Not sure. Do we want to set the whole context as guilty or just the entity?
> 
> Setting the whole contexts as guilty sounds racy to me.
> 
> BTW: We should use a different name than "guilty", maybe just "bool 
> canceled;" ?
> 
>     4.Kick out this job from scheduler’s mirror list, so this job won’t
>     get re-scheduled to ring anymore.
> 
> Okay.
> 
>     5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>     their fence status to “*ECANCELED*”
> 
> Setting ECANCELED should be ok. But I think we should do this when we 
> try to run the jobs and not during GPU reset.
> 
>     6.Force signal all fences that get kicked out by above two
>     steps,*otherwise UMD will block forever if waiting on those 
> fences*
> 
> Okay.
> 
>     7.Do gpu reset, which is can be some callbacks to let bare-metal and
>     SR-IOV implement with their favor style
> 
> Okay.
> 
>     8.After reset, KMD need to aware if the VRAM lost happens or not,
>     bare-metal can implement some function to judge, while for SR-IOV I
>     prefer to read it from GIM side (for initial version we consider
>     it’s always VRAM lost, till GIM side change aligned)
> 
> Okay.
> 
>     9.If VRAM lost not hit, continue, otherwise:
> 
>     a)Update adev->*vram_lost_counter*,
> 
> Okay.
> 
>     b)Iterate over all living ctx, and set all ctx as “*guilty*” since
>     VRAM lost actually ruins all VRAM contents
> 
> No, that shouldn't be done by comparing the counters. Iterating over 
> all contexts is way to much overhead.
> 
>     c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>     fence status to “*ECANCELDED*”
> 
> Yes and no, that should be done when we try to run the jobs and not 
> during GPU reset.
> 
>     10.Do GTT recovery and VRAM page tables/entries recovery (optional,
>     do we need it ???)
> 
> Yes, that is still needed. As Nicolai explained we can't be sure that 
> VRAM is still 100% correct even when it isn't cleared.
> 
>     11.Re-schedule all JOBs remains in mirror list to ring again and
>     restart scheduler (for VRAM lost case, no JOB will re-scheduled)
> 
> Okay.
> 
>     ?For cs_wait() IOCTL:
> 
>     After it found fence signaled, it should check with
>     *“dma_fence_get_status” *to see if there is error there,
> 
>     And return the error status of fence
> 
> Yes and no, dma_fence_get_status() is some specific handling for 
> sync_file debugging (no idea why that made it into the common fence code).
> 
> It was replaced by putting the error code directly into the fence, so 
> just reading that one after waiting should be ok.
> 
> Maybe we should fix dma_fence_get_status() to do the right thing for this?
> 
>     ?For cs_wait_fences() IOCTL:
> 
>     Similar with above approach
> 
>     ?For cs_submit() IOCTL:
> 
>     It need to check if current ctx been marked as “*guilty*” and return
>     “*ECANCELED*” if so
> 
>     ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
> 
>     This way, UMD can also block app from submitting, like @Nicolai
>     mentioned, we can cache one copy of *vram_lost_counter* when
>     enumerate physical device, and deny all
> 
>     gl-context from submitting if the counter queried bigger than that
>     one cached in physical device. (looks a little overkill to me, but
>     easy to implement )
> 
>     UMD can also return error to APP when creating gl-context if found
>     current queried*vram_lost_counter *bigger than that one cached in
>     physical device.
> 
> Okay. Already have a patch for this, please review that one if you 
> haven't already done so.
> 
> Regards,
> Christian.
> 
>     BTW: I realized that gl-context is a little different with kernel’s
>     context. Because for kernel. BO is not related with context but only
>     with FD, while in UMD, BO have a backend
> 
>     gl-context, so block submitting in UMD layer is also needed although
>     KMD will do its job as bottom line
> 
>     ?Basically “vram_lost_counter” is exposure by kernel to let UMD take
>     the control of robust extension feature, it will be UMD’s call to
>     move, KMD only deny “guilty” context from submitting
> 
>     Need your feedback, thx
> 
>     We’d better make TDR feature landed ASAP
> 
>     BR Monk
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found]                         ` <BLUPR12MB044914F3A7B5D3D316481A7A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-11  9:25                           ` Nicolai Hähnle
       [not found]                             ` <6876e153-7e98-66ac-7338-5601cf83c633-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Nicolai Hähnle @ 2017-10-11  9:25 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

On 11.10.2017 11:18, Liu, Monk wrote:
> Let's talk it simple, When vram lost hit,  what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ?

It should successfully return AMDGPU_CTX_INNOCENT_RESET.


> In cs_submit, with vram lost hit, if we don't mark all contexts as "guilty", how we block its from submitting ? can you show some implement way

if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
     return -ECANCELED;

(where ctx->vram_lost_counter is initialized at context creation time 
and never changed afterwards)


> BTW: the "guilty" here is a new member I want to add to context, it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface,
> Looks I need to unify them and only one place to mark guilty or not

Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
consistent with the rest.

Cheers,
Nicolai


> 
> 
> BR Monk
> 
> -----Original Message-----
> From: Haehnle, Nicolai
> Sent: Wednesday, October 11, 2017 5:00 PM
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
> 
> On 11.10.2017 10:48, Liu, Monk wrote:
>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so
>> it's reasonable to use it. However, it /does not/ make sense to mark
>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a
>> perfect example where the driver should report context lost to
>> applications with the "innocent" flag for contexts that were idle at
>> the time of reset. The only context(s) that should be reported as "guilty"
>> (or perhaps "unknown" in some cases) are the ones that were executing
>> at the time of reset.
>>
>> ML: KMD mark all contexts as guilty is because that way we can unify
>> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no
>> need to worry about vram-lost-counter anymore, that’s a implementation
>> style. I don’t think it is related with UMD layer,
>>
>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it
>> own “guilty” gl-context if you want.
> 
> Well, to some extent this is just semantics, but it helps to keep the terminology consistent.
> 
> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
> mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
> and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset.
> 
> The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened.
> 
> 
>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you
>> illustrate what rule KMD should obey to check in KMS IOCTL like
>> cs_sumbit ?? let’s see which way better
> 
> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>       return -ECANCELED;
> 
> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
> 
> Yes, it's one additional check in cs_submit. If you're worried about that (and Christian's concerns about possible issues with walking over all contexts are addressed), I suppose you could just store a per-context
> 
>     unsigned context_reset_status;
> 
> instead of a `bool guilty`. Its value would start out as 0
> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset.
> 
> Cheers,
> Nicolai
> 
> 
>>
>> *From:*Haehnle, Nicolai
>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>> Deucher, Alexander <Alexander.Deucher@amd.com>
>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>;
>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley
>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>;
>> Filipas, Mario <Mario.Filipas@amd.com>
>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>
>>   From a Mesa perspective, this almost all sounds reasonable to me.
>>
>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so
>> it's reasonable to use it. However, it /does not/ make sense to mark
>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a
>> perfect example where the driver should report context lost to
>> applications with the "innocent" flag for contexts that were idle at
>> the time of reset. The only context(s) that should be reported as "guilty"
>> (or perhaps "unknown" in some cases) are the ones that were executing
>> at the time of reset.
>>
>> On whether the whole context is marked as guilty from a user space
>> perspective, it would simply be nice for user space to get consistent
>> answers. It would be a bit odd if we could e.g. succeed in submitting
>> an SDMA job after a GFX job was rejected. This would point in favor of
>> marking the entire context as guilty (although that could happen
>> lazily instead of at reset time). On the other hand, if that's too big
>> a burden for the kernel implementation I'm sure we can live without it.
>>
>> Cheers,
>>
>> Nicolai
>>
>> ----------------------------------------------------------------------
>> --
>>
>> *From:*Liu, Monk
>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher,
>> Alexander
>> *Cc:* amd-gfx@lists.freedesktop.org
>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry
>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>
>> 1.Set its fence error status to “*ETIME*”,
>>
>> No, as I already explained ETIME is for synchronous operation.
>>
>> In other words when we return ETIME from the wait IOCTL it would mean
>> that the waiting has somehow timed out, but not the job we waited for.
>>
>> Please use ECANCELED as well or some other error code when we find
>> that we need to distinct the timedout job from the canceled ones
>> (probably a good idea, but I'm not sure).
>>
>> [ML] I’m okay if you insist not to use ETIME
>>
>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>
>> Not sure. Do we want to set the whole context as guilty or just the entity?
>>
>> Setting the whole contexts as guilty sounds racy to me.
>>
>> BTW: We should use a different name than "guilty", maybe just "bool
>> canceled;" ?
>>
>> [ML] I think context is better than entity, because for example if you
>> only block entity_0 of context and allow entity_N run, that means the
>> dependency between entities are broken (e.g. page table updates in
>>
>> Sdma entity pass but gfx submit in GFX entity blocked, not make sense
>> to me)
>>
>> We’d better either block the whole context or let not…
>>
>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>> their fence status to “*ECANCELED*”
>>
>> Setting ECANCELED should be ok. But I think we should do this when we
>> try to run the jobs and not during GPU reset.
>>
>> [ML] without deep thought and expritment, I’m not sure the difference
>> between them, but kick it out in gpu_reset routine is more efficient,
>>
>> Otherwise you need to check context/entity guilty flag in run_job
>> routine …and you need to it for every context/entity, I don’t see why
>>
>> We don’t just kickout all of them in gpu_reset stage ….
>>
>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since
>> VRAM lost actually ruins all VRAM contents
>>
>> No, that shouldn't be done by comparing the counters. Iterating over
>> all contexts is way to much overhead.
>>
>> [ML] because I want to make KMS IOCTL rules clean, like they don’t
>> need to differentiate VRAM lost or not, they only interested in if the
>> context is guilty or not, and block
>>
>> Submit for guilty ones.
>>
>> *Can you give more details of your idea? And better the detail
>> implement in cs_submit, I want to see how you want to block submit
>> without checking context guilty flag*
>>
>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence
>> status to “*ECANCELDED*”
>>
>> Yes and no, that should be done when we try to run the jobs and not
>> during GPU reset.
>>
>> [ML] again, kicking out them in gpu reset routine is high efficient,
>> otherwise you need check on every job in run_job()
>>
>> Besides, can you illustrate the detail implementation ?
>>
>> Yes and no, dma_fence_get_status() is some specific handling for
>> sync_file debugging (no idea why that made it into the common fence code).
>>
>> It was replaced by putting the error code directly into the fence, so
>> just reading that one after waiting should be ok.
>>
>> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>>
>> [ML] yeah, that’s too confusing, the name sound really the one I want
>> to use, we should change it…
>>
>> *But look into the implement, I don**’t see why we cannot use it ? it
>> also finally return the fence->error *
>>
>> *From:*Koenig, Christian
>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle,
>> Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>;
>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>;
>> Deucher, Alexander <Alexander.Deucher@amd.com
>> <mailto:Alexander.Deucher@amd.com>>
>> *Cc:* amd-gfx@lists.freedesktop.org
>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel
>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW)
>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley
>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro
>> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>;
>> Filipas, Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>>
>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>
>> See inline:
>>
>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>
>>      Hi Christian & Nicolai,
>>
>>      We need to achieve some agreements on what should MESA/UMD do and
>>      what should KMD do, *please give your comments with “okay” or “No”
>>      and your idea on below items,*
>>
>>      ?When a job timed out (set from lockup_timeout kernel parameter),
>>      What KMD should do in TDR routine :
>>
>>      1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>      (*gpu_reset_counter* is used to force vm flush after GPU reset, out
>>      of this thread’s scope so no more discussion on it)
>>
>> Okay.
>>
>>      2.Set its fence error status to “*ETIME*”,
>>
>> No, as I already explained ETIME is for synchronous operation.
>>
>> In other words when we return ETIME from the wait IOCTL it would mean
>> that the waiting has somehow timed out, but not the job we waited for.
>>
>> Please use ECANCELED as well or some other error code when we find
>> that we need to distinct the timedout job from the canceled ones
>> (probably a good idea, but I'm not sure).
>>
>>      3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>
>> Not sure. Do we want to set the whole context as guilty or just the entity?
>>
>> Setting the whole contexts as guilty sounds racy to me.
>>
>> BTW: We should use a different name than "guilty", maybe just "bool
>> canceled;" ?
>>
>>      4.Kick out this job from scheduler’s mirror list, so this job won’t
>>      get re-scheduled to ring anymore.
>>
>> Okay.
>>
>>      5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>>      their fence status to “*ECANCELED*”
>>
>> Setting ECANCELED should be ok. But I think we should do this when we
>> try to run the jobs and not during GPU reset.
>>
>>      6.Force signal all fences that get kicked out by above two
>>      steps,*otherwise UMD will block forever if waiting on those
>> fences*
>>
>> Okay.
>>
>>      7.Do gpu reset, which is can be some callbacks to let bare-metal and
>>      SR-IOV implement with their favor style
>>
>> Okay.
>>
>>      8.After reset, KMD need to aware if the VRAM lost happens or not,
>>      bare-metal can implement some function to judge, while for SR-IOV I
>>      prefer to read it from GIM side (for initial version we consider
>>      it’s always VRAM lost, till GIM side change aligned)
>>
>> Okay.
>>
>>      9.If VRAM lost not hit, continue, otherwise:
>>
>>      a)Update adev->*vram_lost_counter*,
>>
>> Okay.
>>
>>      b)Iterate over all living ctx, and set all ctx as “*guilty*” since
>>      VRAM lost actually ruins all VRAM contents
>>
>> No, that shouldn't be done by comparing the counters. Iterating over
>> all contexts is way to much overhead.
>>
>>      c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>      fence status to “*ECANCELDED*”
>>
>> Yes and no, that should be done when we try to run the jobs and not
>> during GPU reset.
>>
>>      10.Do GTT recovery and VRAM page tables/entries recovery (optional,
>>      do we need it ???)
>>
>> Yes, that is still needed. As Nicolai explained we can't be sure that
>> VRAM is still 100% correct even when it isn't cleared.
>>
>>      11.Re-schedule all JOBs remains in mirror list to ring again and
>>      restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>>
>> Okay.
>>
>>      ?For cs_wait() IOCTL:
>>
>>      After it found fence signaled, it should check with
>>      *“dma_fence_get_status” *to see if there is error there,
>>
>>      And return the error status of fence
>>
>> Yes and no, dma_fence_get_status() is some specific handling for
>> sync_file debugging (no idea why that made it into the common fence code).
>>
>> It was replaced by putting the error code directly into the fence, so
>> just reading that one after waiting should be ok.
>>
>> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>>
>>      ?For cs_wait_fences() IOCTL:
>>
>>      Similar with above approach
>>
>>      ?For cs_submit() IOCTL:
>>
>>      It need to check if current ctx been marked as “*guilty*” and return
>>      “*ECANCELED*” if so
>>
>>      ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>
>>      This way, UMD can also block app from submitting, like @Nicolai
>>      mentioned, we can cache one copy of *vram_lost_counter* when
>>      enumerate physical device, and deny all
>>
>>      gl-context from submitting if the counter queried bigger than that
>>      one cached in physical device. (looks a little overkill to me, but
>>      easy to implement )
>>
>>      UMD can also return error to APP when creating gl-context if found
>>      current queried*vram_lost_counter *bigger than that one cached in
>>      physical device.
>>
>> Okay. Already have a patch for this, please review that one if you
>> haven't already done so.
>>
>> Regards,
>> Christian.
>>
>>      BTW: I realized that gl-context is a little different with kernel’s
>>      context. Because for kernel. BO is not related with context but only
>>      with FD, while in UMD, BO have a backend
>>
>>      gl-context, so block submitting in UMD layer is also needed although
>>      KMD will do its job as bottom line
>>
>>      ?Basically “vram_lost_counter” is exposure by kernel to let UMD take
>>      the control of robust extension feature, it will be UMD’s call to
>>      move, KMD only deny “guilty” context from submitting
>>
>>      Need your feedback, thx
>>
>>      We’d better make TDR feature landed ASAP
>>
>>      BR Monk
>>
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found]                     ` <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe-5C7GfCeVMHo@public.gmane.org>
  2017-10-11  9:16                       ` Nicolai Hähnle
@ 2017-10-11  9:27                       ` Liu, Monk
  1 sibling, 0 replies; 23+ messages in thread
From: Liu, Monk @ 2017-10-11  9:27 UTC (permalink / raw)
  To: Koenig, Christian, Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)


[-- Attachment #1.1: Type: text/plain, Size: 15673 bytes --]

ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to worry about vram-lost-counter anymore, that’s a implementation style. I don’t think it is related with UMD layer,
I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context.


[ML] say that we only block at entity level, then we have two rules:

1)      we block submit for “guilty” entity in run_job routine. (and mark as guilty entity in gpu_reset)

2)      for innocent entity, we still need to check vram_lost_counter in cs_submit, correct ?

besides: Nicolai reminded me that we have amdgpu_ctx_query() to worry about ..
when we mark some entity as “guilty”, do we need to mark the context behind it as “AMDGPU_CTX_GUILTY_RESET” ?

this thing I didn’t think of … I just ignored it ….

BR Monk
From: Koenig, Christian
Sent: Wednesday, October 11, 2017 5:03 PM
To: Liu, Monk <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

[ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in
Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me)
We’d better either block the whole context or let not…
Page table updates are not part of any context.

So I think the only thing we can do is to mark the entity as not scheduled any more.



1.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”
Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.

[ML] without deep thought and expritment, I’m not sure the difference between them, but kick it out in gpu_reset routine is more efficient,
I really don't think so. Kicking them out during gpu_reset sounds racy to me once more.

And marking them canceled when we try to run them has the clear advantage that all dependencies are meet first.


ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to worry about vram-lost-counter anymore, that’s a implementation style. I don’t think it is related with UMD layer,
I don't think that this is a good idea. Instead when you want to unify the behavior we should use the vram_lost_counter as marker for the guilty context.

Regards,
Christian.

Am 11.10.2017 um 10:48 schrieb Liu, Monk:



On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset.

ML: KMD mark all contexts as guilty is because that way we can unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no need to worry about vram-lost-counter anymore, that’s a implementation style. I don’t think it is related with UMD layer,
For UMD the gl-context isn’t aware of by KMD, so UMD can implement it own “guilty” gl-context if you want.

If KMD doesn’t mark all ctx as guilty after VRAM lost, can you illustrate what rule KMD should obey to check in KMS IOCTL like cs_sumbit ?? let’s see which way better


From: Haehnle, Nicolai
Sent: Wednesday, October 11, 2017 4:41 PM
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com><mailto:Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel <Pixel.Ding@amd.com><mailto:Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com><mailto:Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com><mailto:Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com><mailto:Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com><mailto:Mario.Filipas@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:


From a Mesa perspective, this almost all sounds reasonable to me.



On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so it's reasonable to use it. However, it does not make sense to mark idle contexts as "guilty" just because VRAM is lost. VRAM lost is a perfect example where the driver should report context lost to applications with the "innocent" flag for contexts that were idle at the time of reset. The only context(s) that should be reported as "guilty" (or perhaps "unknown" in some cases) are the ones that were executing at the time of reset.


On whether the whole context is marked as guilty from a user space perspective, it would simply be nice for user space to get consistent answers. It would be a bit odd if we could e.g. succeed in submitting an SDMA job after a GFX job was rejected. This would point in favor of marking the entire context as guilty (although that could happen lazily instead of at reset time). On the other hand, if that's too big a burden for the kernel implementation I'm sure we can live without it.



Cheers,

Nicolai

________________________________
From: Liu, Monk
Sent: Wednesday, October 11, 2017 10:15:40 AM
To: Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, Alexander
Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
Subject: RE: TDR and VRAM lost handling in KMD:


1.        Set its fence error status to “ETIME”,
No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure).

[ML] I’m okay if you insist not to use ETIME


1.        Find the entity/ctx behind this job, and set this ctx as “guilty”
Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool canceled;" ?

[ML] I think context is better than entity, because for example if you only block entity_0 of context and allow entity_N run, that means the dependency between entities are broken (e.g. page table updates in
Sdma entity pass but gfx submit in GFX entity blocked, not make sense to me)
We’d better either block the whole context or let not…



1.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”
Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.

[ML] without deep thought and expritment, I’m not sure the difference between them, but kick it out in gpu_reset routine is more efficient,
Otherwise you need to check context/entity guilty flag in run_job routine … and you need to it for every context/entity, I don’t see why
We don’t just kickout all of them in gpu_reset stage ….



a)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents
No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead.

[ML] because I want to make KMS IOCTL rules clean, like they don’t need to differentiate VRAM lost or not, they only interested in if the context is guilty or not, and block
Submit for guilty ones.

Can you give more details of your idea? And better the detail implement in cs_submit, I want to see how you want to block submit without checking context guilty flag



a)       Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”
Yes and no, that should be done when we try to run the jobs and not during GPU reset.

[ML] again, kicking out them in gpu reset routine is high efficient, otherwise you need check on every job in run_job()
Besides, can you illustrate the detail implementation ?



Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?

[ML] yeah, that’s too confusing, the name sound really the one I want to use, we should change it…
But look into the implement, I don’t see why we cannot use it ? it also finally return the fence->error




From: Koenig, Christian
Sent: Wednesday, October 11, 2017 3:21 PM
To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com<mailto:Nicolai.Haehnle@amd.com>>; Olsak, Marek <Marek.Olsak@amd.com<mailto:Marek.Olsak@amd.com>>; Deucher, Alexander <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>
Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel <Pixel.Ding@amd.com<mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com<mailto:Jerry.Jiang@amd.com>>; Li, Bingley <Bingley.Li@amd.com<mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com<mailto:Alejandro.Ramirez@amd.com>>; Filipas, Mario <Mario.Filipas@amd.com<mailto:Mario.Filipas@amd.com>>
Subject: Re: TDR and VRAM lost handling in KMD:

See inline:

Am 11.10.2017 um 07:33 schrieb Liu, Monk:
Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items,


?  When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine :


1.        Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it)
Okay.



2.        Set its fence error status to “ETIME”,
No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that we need to distinct the timedout job from the canceled ones (probably a good idea, but I'm not sure).



3.        Find the entity/ctx behind this job, and set this ctx as “guilty”
Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool canceled;" ?



4.        Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore.
Okay.



5.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”
Setting ECANCELED should be ok. But I think we should do this when we try to run the jobs and not during GPU reset.



6.        Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences
Okay.



7.        Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style
Okay.



8.        After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned)
Okay.



9.        If VRAM lost not hit, continue, otherwise:

a)       Update adev->vram_lost_counter,
Okay.



b)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents
No, that shouldn't be done by comparing the counters. Iterating over all contexts is way to much overhead.



c)        Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”
Yes and no, that should be done when we try to run the jobs and not during GPU reset.



10.     Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???)
Yes, that is still needed. As Nicolai explained we can't be sure that VRAM is still 100% correct even when it isn't cleared.



11.     Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled)
Okay.




?  For cs_wait() IOCTL:
After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there,
And return the error status of fence
Yes and no, dma_fence_get_status() is some specific handling for sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?




?  For cs_wait_fences() IOCTL:
Similar with above approach


?  For cs_submit() IOCTL:
It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so


?  Introduce a new IOCTL to let UMD query vram_lost_counter:
This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all
gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement )
UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device.
Okay. Already have a patch for this, please review that one if you haven't already done so.

Regards,
Christian.



BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend
gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line


?  Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting


Need your feedback, thx

We’d better make TDR feature landed ASAP

BR Monk









[-- Attachment #1.2: Type: text/html, Size: 62940 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found]                             ` <6876e153-7e98-66ac-7338-5601cf83c633-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-11  9:41                               ` Liu, Monk
       [not found]                                 ` <BLUPR12MB044907C2C72DD8BEB1D5BE3B844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-10-11  9:41 UTC (permalink / raw)
  To: Haehnle, Nicolai, Koenig, Christian, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

Okay, let me summary our whole idea together and see if it works:

1, For cs_submit, always check vram-lost_counter first and reject the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != adev->vram_lost_counter. That way the vram lost issue can be handled 

2, for cs_submit we still need to check if the incoming context is "AMDGPU_CTX_GUILTY_RESET" or not even if we found ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject the submit
If it is "AMDGPU_CTX_GUILTY_RESET", correct ?

3, in gpu_reset() routine, we only mark the hang job's entity as guilty (so we need to add new member in entity structure), and not kick it out in gpu_reset() stage, but we need to set the context behind this entity as " AMDGPU_CTX_GUILTY_RESET"
  And if reset introduces VRAM LOST, we just update adev->vram_lost_counter, but *don't* change all entity to guilty, so still only the hang job's entity is "guilty"
  After some entity marked as "guilty", we find a way to set the context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K interface, we need let UMD can know that this context is wrong.

4, in gpu scheduler's run_job() routine, since it only reads entity, so we skip job scheduling once found the entity is "guilty"


Does above sounds good ?



-----Original Message-----
From: Haehnle, Nicolai 
Sent: Wednesday, October 11, 2017 5:26 PM
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

On 11.10.2017 11:18, Liu, Monk wrote:
> Let's talk it simple, When vram lost hit,  what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ?

It should successfully return AMDGPU_CTX_INNOCENT_RESET.


> In cs_submit, with vram lost hit, if we don't mark all contexts as 
> "guilty", how we block its from submitting ? can you show some 
> implement way

if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
     return -ECANCELED;

(where ctx->vram_lost_counter is initialized at context creation time and never changed afterwards)


> BTW: the "guilty" here is a new member I want to add to context, it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface,
> Looks I need to unify them and only one place to mark guilty or not

Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
consistent with the rest.

Cheers,
Nicolai


> 
> 
> BR Monk
> 
> -----Original Message-----
> From: Haehnle, Nicolai
> Sent: Wednesday, October 11, 2017 5:00 PM
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
> 
> On 11.10.2017 10:48, Liu, Monk wrote:
>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so
>> it's reasonable to use it. However, it /does not/ make sense to mark
>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a
>> perfect example where the driver should report context lost to
>> applications with the "innocent" flag for contexts that were idle at
>> the time of reset. The only context(s) that should be reported as "guilty"
>> (or perhaps "unknown" in some cases) are the ones that were executing
>> at the time of reset.
>>
>> ML: KMD mark all contexts as guilty is because that way we can unify
>> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no
>> need to worry about vram-lost-counter anymore, that’s a implementation
>> style. I don’t think it is related with UMD layer,
>>
>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it
>> own “guilty” gl-context if you want.
> 
> Well, to some extent this is just semantics, but it helps to keep the terminology consistent.
> 
> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
> mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
> and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset.
> 
> The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened.
> 
> 
>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you
>> illustrate what rule KMD should obey to check in KMS IOCTL like
>> cs_sumbit ?? let’s see which way better
> 
> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>       return -ECANCELED;
> 
> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
> 
> Yes, it's one additional check in cs_submit. If you're worried about that (and Christian's concerns about possible issues with walking over all contexts are addressed), I suppose you could just store a per-context
> 
>     unsigned context_reset_status;
> 
> instead of a `bool guilty`. Its value would start out as 0
> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset.
> 
> Cheers,
> Nicolai
> 
> 
>>
>> *From:*Haehnle, Nicolai
>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>> Deucher, Alexander <Alexander.Deucher@amd.com>
>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>;
>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley
>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>;
>> Filipas, Mario <Mario.Filipas@amd.com>
>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>
>>   From a Mesa perspective, this almost all sounds reasonable to me.
>>
>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so
>> it's reasonable to use it. However, it /does not/ make sense to mark
>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a
>> perfect example where the driver should report context lost to
>> applications with the "innocent" flag for contexts that were idle at
>> the time of reset. The only context(s) that should be reported as "guilty"
>> (or perhaps "unknown" in some cases) are the ones that were executing
>> at the time of reset.
>>
>> On whether the whole context is marked as guilty from a user space
>> perspective, it would simply be nice for user space to get consistent
>> answers. It would be a bit odd if we could e.g. succeed in submitting
>> an SDMA job after a GFX job was rejected. This would point in favor of
>> marking the entire context as guilty (although that could happen
>> lazily instead of at reset time). On the other hand, if that's too big
>> a burden for the kernel implementation I'm sure we can live without it.
>>
>> Cheers,
>>
>> Nicolai
>>
>> ----------------------------------------------------------------------
>> --
>>
>> *From:*Liu, Monk
>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher,
>> Alexander
>> *Cc:* amd-gfx@lists.freedesktop.org
>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry
>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>
>> 1.Set its fence error status to “*ETIME*”,
>>
>> No, as I already explained ETIME is for synchronous operation.
>>
>> In other words when we return ETIME from the wait IOCTL it would mean
>> that the waiting has somehow timed out, but not the job we waited for.
>>
>> Please use ECANCELED as well or some other error code when we find
>> that we need to distinct the timedout job from the canceled ones
>> (probably a good idea, but I'm not sure).
>>
>> [ML] I’m okay if you insist not to use ETIME
>>
>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>
>> Not sure. Do we want to set the whole context as guilty or just the entity?
>>
>> Setting the whole contexts as guilty sounds racy to me.
>>
>> BTW: We should use a different name than "guilty", maybe just "bool
>> canceled;" ?
>>
>> [ML] I think context is better than entity, because for example if you
>> only block entity_0 of context and allow entity_N run, that means the
>> dependency between entities are broken (e.g. page table updates in
>>
>> Sdma entity pass but gfx submit in GFX entity blocked, not make sense
>> to me)
>>
>> We’d better either block the whole context or let not…
>>
>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>> their fence status to “*ECANCELED*”
>>
>> Setting ECANCELED should be ok. But I think we should do this when we
>> try to run the jobs and not during GPU reset.
>>
>> [ML] without deep thought and expritment, I’m not sure the difference
>> between them, but kick it out in gpu_reset routine is more efficient,
>>
>> Otherwise you need to check context/entity guilty flag in run_job
>> routine …and you need to it for every context/entity, I don’t see why
>>
>> We don’t just kickout all of them in gpu_reset stage ….
>>
>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since
>> VRAM lost actually ruins all VRAM contents
>>
>> No, that shouldn't be done by comparing the counters. Iterating over
>> all contexts is way to much overhead.
>>
>> [ML] because I want to make KMS IOCTL rules clean, like they don’t
>> need to differentiate VRAM lost or not, they only interested in if the
>> context is guilty or not, and block
>>
>> Submit for guilty ones.
>>
>> *Can you give more details of your idea? And better the detail
>> implement in cs_submit, I want to see how you want to block submit
>> without checking context guilty flag*
>>
>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence
>> status to “*ECANCELDED*”
>>
>> Yes and no, that should be done when we try to run the jobs and not
>> during GPU reset.
>>
>> [ML] again, kicking out them in gpu reset routine is high efficient,
>> otherwise you need check on every job in run_job()
>>
>> Besides, can you illustrate the detail implementation ?
>>
>> Yes and no, dma_fence_get_status() is some specific handling for
>> sync_file debugging (no idea why that made it into the common fence code).
>>
>> It was replaced by putting the error code directly into the fence, so
>> just reading that one after waiting should be ok.
>>
>> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>>
>> [ML] yeah, that’s too confusing, the name sound really the one I want
>> to use, we should change it…
>>
>> *But look into the implement, I don**’t see why we cannot use it ? it
>> also finally return the fence->error *
>>
>> *From:*Koenig, Christian
>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle,
>> Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>;
>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>;
>> Deucher, Alexander <Alexander.Deucher@amd.com
>> <mailto:Alexander.Deucher@amd.com>>
>> *Cc:* amd-gfx@lists.freedesktop.org
>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel
>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW)
>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley
>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro
>> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>;
>> Filipas, Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>>
>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>
>> See inline:
>>
>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>
>>      Hi Christian & Nicolai,
>>
>>      We need to achieve some agreements on what should MESA/UMD do and
>>      what should KMD do, *please give your comments with “okay” or “No”
>>      and your idea on below items,*
>>
>>      ?When a job timed out (set from lockup_timeout kernel parameter),
>>      What KMD should do in TDR routine :
>>
>>      1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>      (*gpu_reset_counter* is used to force vm flush after GPU reset, out
>>      of this thread’s scope so no more discussion on it)
>>
>> Okay.
>>
>>      2.Set its fence error status to “*ETIME*”,
>>
>> No, as I already explained ETIME is for synchronous operation.
>>
>> In other words when we return ETIME from the wait IOCTL it would mean
>> that the waiting has somehow timed out, but not the job we waited for.
>>
>> Please use ECANCELED as well or some other error code when we find
>> that we need to distinct the timedout job from the canceled ones
>> (probably a good idea, but I'm not sure).
>>
>>      3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>
>> Not sure. Do we want to set the whole context as guilty or just the entity?
>>
>> Setting the whole contexts as guilty sounds racy to me.
>>
>> BTW: We should use a different name than "guilty", maybe just "bool
>> canceled;" ?
>>
>>      4.Kick out this job from scheduler’s mirror list, so this job won’t
>>      get re-scheduled to ring anymore.
>>
>> Okay.
>>
>>      5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>>      their fence status to “*ECANCELED*”
>>
>> Setting ECANCELED should be ok. But I think we should do this when we
>> try to run the jobs and not during GPU reset.
>>
>>      6.Force signal all fences that get kicked out by above two
>>      steps,*otherwise UMD will block forever if waiting on those
>> fences*
>>
>> Okay.
>>
>>      7.Do gpu reset, which is can be some callbacks to let bare-metal and
>>      SR-IOV implement with their favor style
>>
>> Okay.
>>
>>      8.After reset, KMD need to aware if the VRAM lost happens or not,
>>      bare-metal can implement some function to judge, while for SR-IOV I
>>      prefer to read it from GIM side (for initial version we consider
>>      it’s always VRAM lost, till GIM side change aligned)
>>
>> Okay.
>>
>>      9.If VRAM lost not hit, continue, otherwise:
>>
>>      a)Update adev->*vram_lost_counter*,
>>
>> Okay.
>>
>>      b)Iterate over all living ctx, and set all ctx as “*guilty*” since
>>      VRAM lost actually ruins all VRAM contents
>>
>> No, that shouldn't be done by comparing the counters. Iterating over
>> all contexts is way to much overhead.
>>
>>      c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>      fence status to “*ECANCELDED*”
>>
>> Yes and no, that should be done when we try to run the jobs and not
>> during GPU reset.
>>
>>      10.Do GTT recovery and VRAM page tables/entries recovery (optional,
>>      do we need it ???)
>>
>> Yes, that is still needed. As Nicolai explained we can't be sure that
>> VRAM is still 100% correct even when it isn't cleared.
>>
>>      11.Re-schedule all JOBs remains in mirror list to ring again and
>>      restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>>
>> Okay.
>>
>>      ?For cs_wait() IOCTL:
>>
>>      After it found fence signaled, it should check with
>>      *“dma_fence_get_status” *to see if there is error there,
>>
>>      And return the error status of fence
>>
>> Yes and no, dma_fence_get_status() is some specific handling for
>> sync_file debugging (no idea why that made it into the common fence code).
>>
>> It was replaced by putting the error code directly into the fence, so
>> just reading that one after waiting should be ok.
>>
>> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>>
>>      ?For cs_wait_fences() IOCTL:
>>
>>      Similar with above approach
>>
>>      ?For cs_submit() IOCTL:
>>
>>      It need to check if current ctx been marked as “*guilty*” and return
>>      “*ECANCELED*” if so
>>
>>      ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>
>>      This way, UMD can also block app from submitting, like @Nicolai
>>      mentioned, we can cache one copy of *vram_lost_counter* when
>>      enumerate physical device, and deny all
>>
>>      gl-context from submitting if the counter queried bigger than that
>>      one cached in physical device. (looks a little overkill to me, but
>>      easy to implement )
>>
>>      UMD can also return error to APP when creating gl-context if found
>>      current queried*vram_lost_counter *bigger than that one cached in
>>      physical device.
>>
>> Okay. Already have a patch for this, please review that one if you
>> haven't already done so.
>>
>> Regards,
>> Christian.
>>
>>      BTW: I realized that gl-context is a little different with kernel’s
>>      context. Because for kernel. BO is not related with context but only
>>      with FD, while in UMD, BO have a backend
>>
>>      gl-context, so block submitting in UMD layer is also needed although
>>      KMD will do its job as bottom line
>>
>>      ?Basically “vram_lost_counter” is exposure by kernel to let UMD take
>>      the control of robust extension feature, it will be UMD’s call to
>>      move, KMD only deny “guilty” context from submitting
>>
>>      Need your feedback, thx
>>
>>      We’d better make TDR feature landed ASAP
>>
>>      BR Monk
>>
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found]                                 ` <BLUPR12MB044907C2C72DD8BEB1D5BE3B844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-11 10:14                                   ` Chunming Zhou
       [not found]                                     ` <8c4e849f-9227-12bc-9d2e-3daf60fcd762-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Chunming Zhou @ 2017-10-11 10:14 UTC (permalink / raw)
  To: Liu, Monk, Haehnle, Nicolai, Koenig, Christian, Olsak, Marek,
	Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

Your summary lacks the below issue:

How about the job already pushed in scheduler queue when vram is lost?


Regards,
David Zhou
On 2017年10月11日 17:41, Liu, Monk wrote:
> Okay, let me summary our whole idea together and see if it works:
>
> 1, For cs_submit, always check vram-lost_counter first and reject the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != adev->vram_lost_counter. That way the vram lost issue can be handled
>
> 2, for cs_submit we still need to check if the incoming context is "AMDGPU_CTX_GUILTY_RESET" or not even if we found ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject the submit
> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>
> 3, in gpu_reset() routine, we only mark the hang job's entity as guilty (so we need to add new member in entity structure), and not kick it out in gpu_reset() stage, but we need to set the context behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>    And if reset introduces VRAM LOST, we just update adev->vram_lost_counter, but *don't* change all entity to guilty, so still only the hang job's entity is "guilty"
>    After some entity marked as "guilty", we find a way to set the context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K interface, we need let UMD can know that this context is wrong.
>
> 4, in gpu scheduler's run_job() routine, since it only reads entity, so we skip job scheduling once found the entity is "guilty"
>
>
> Does above sounds good ?
>
>
>
> -----Original Message-----
> From: Haehnle, Nicolai
> Sent: Wednesday, October 11, 2017 5:26 PM
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> On 11.10.2017 11:18, Liu, Monk wrote:
>> Let's talk it simple, When vram lost hit,  what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ?
> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>
>
>> In cs_submit, with vram lost hit, if we don't mark all contexts as
>> "guilty", how we block its from submitting ? can you show some
>> implement way
> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>       return -ECANCELED;
>
> (where ctx->vram_lost_counter is initialized at context creation time and never changed afterwards)
>
>
>> BTW: the "guilty" here is a new member I want to add to context, it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface,
>> Looks I need to unify them and only one place to mark guilty or not
> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made
> consistent with the rest.
>
> Cheers,
> Nicolai
>
>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Haehnle, Nicolai
>> Sent: Wednesday, October 11, 2017 5:00 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD:
>>
>> On 11.10.2017 10:48, Liu, Monk wrote:
>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so
>>> it's reasonable to use it. However, it /does not/ make sense to mark
>>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a
>>> perfect example where the driver should report context lost to
>>> applications with the "innocent" flag for contexts that were idle at
>>> the time of reset. The only context(s) that should be reported as "guilty"
>>> (or perhaps "unknown" in some cases) are the ones that were executing
>>> at the time of reset.
>>>
>>> ML: KMD mark all contexts as guilty is because that way we can unify
>>> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no
>>> need to worry about vram-lost-counter anymore, that’s a implementation
>>> style. I don’t think it is related with UMD layer,
>>>
>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it
>>> own “guilty” gl-context if you want.
>> Well, to some extent this is just semantics, but it helps to keep the terminology consistent.
>>
>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>> mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>> and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset.
>>
>> The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened.
>>
>>
>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you
>>> illustrate what rule KMD should obey to check in KMS IOCTL like
>>> cs_sumbit ?? let’s see which way better
>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>        return -ECANCELED;
>>
>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>
>> Yes, it's one additional check in cs_submit. If you're worried about that (and Christian's concerns about possible issues with walking over all contexts are addressed), I suppose you could just store a per-context
>>
>>      unsigned context_reset_status;
>>
>> instead of a `bool guilty`. Its value would start out as 0
>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset.
>>
>> Cheers,
>> Nicolai
>>
>>
>>> *From:*Haehnle, Nicolai
>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>;
>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley
>>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>;
>>> Filipas, Mario <Mario.Filipas@amd.com>
>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>
>>>    From a Mesa perspective, this almost all sounds reasonable to me.
>>>
>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so
>>> it's reasonable to use it. However, it /does not/ make sense to mark
>>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a
>>> perfect example where the driver should report context lost to
>>> applications with the "innocent" flag for contexts that were idle at
>>> the time of reset. The only context(s) that should be reported as "guilty"
>>> (or perhaps "unknown" in some cases) are the ones that were executing
>>> at the time of reset.
>>>
>>> On whether the whole context is marked as guilty from a user space
>>> perspective, it would simply be nice for user space to get consistent
>>> answers. It would be a bit odd if we could e.g. succeed in submitting
>>> an SDMA job after a GFX job was rejected. This would point in favor of
>>> marking the entire context as guilty (although that could happen
>>> lazily instead of at reset time). On the other hand, if that's too big
>>> a burden for the kernel implementation I'm sure we can live without it.
>>>
>>> Cheers,
>>>
>>> Nicolai
>>>
>>> ----------------------------------------------------------------------
>>> --
>>>
>>> *From:*Liu, Monk
>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher,
>>> Alexander
>>> *Cc:* amd-gfx@lists.freedesktop.org
>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry
>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>
>>> 1.Set its fence error status to “*ETIME*”,
>>>
>>> No, as I already explained ETIME is for synchronous operation.
>>>
>>> In other words when we return ETIME from the wait IOCTL it would mean
>>> that the waiting has somehow timed out, but not the job we waited for.
>>>
>>> Please use ECANCELED as well or some other error code when we find
>>> that we need to distinct the timedout job from the canceled ones
>>> (probably a good idea, but I'm not sure).
>>>
>>> [ML] I’m okay if you insist not to use ETIME
>>>
>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>
>>> Not sure. Do we want to set the whole context as guilty or just the entity?
>>>
>>> Setting the whole contexts as guilty sounds racy to me.
>>>
>>> BTW: We should use a different name than "guilty", maybe just "bool
>>> canceled;" ?
>>>
>>> [ML] I think context is better than entity, because for example if you
>>> only block entity_0 of context and allow entity_N run, that means the
>>> dependency between entities are broken (e.g. page table updates in
>>>
>>> Sdma entity pass but gfx submit in GFX entity blocked, not make sense
>>> to me)
>>>
>>> We’d better either block the whole context or let not…
>>>
>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>>> their fence status to “*ECANCELED*”
>>>
>>> Setting ECANCELED should be ok. But I think we should do this when we
>>> try to run the jobs and not during GPU reset.
>>>
>>> [ML] without deep thought and expritment, I’m not sure the difference
>>> between them, but kick it out in gpu_reset routine is more efficient,
>>>
>>> Otherwise you need to check context/entity guilty flag in run_job
>>> routine …and you need to it for every context/entity, I don’t see why
>>>
>>> We don’t just kickout all of them in gpu_reset stage ….
>>>
>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since
>>> VRAM lost actually ruins all VRAM contents
>>>
>>> No, that shouldn't be done by comparing the counters. Iterating over
>>> all contexts is way to much overhead.
>>>
>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t
>>> need to differentiate VRAM lost or not, they only interested in if the
>>> context is guilty or not, and block
>>>
>>> Submit for guilty ones.
>>>
>>> *Can you give more details of your idea? And better the detail
>>> implement in cs_submit, I want to see how you want to block submit
>>> without checking context guilty flag*
>>>
>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence
>>> status to “*ECANCELDED*”
>>>
>>> Yes and no, that should be done when we try to run the jobs and not
>>> during GPU reset.
>>>
>>> [ML] again, kicking out them in gpu reset routine is high efficient,
>>> otherwise you need check on every job in run_job()
>>>
>>> Besides, can you illustrate the detail implementation ?
>>>
>>> Yes and no, dma_fence_get_status() is some specific handling for
>>> sync_file debugging (no idea why that made it into the common fence code).
>>>
>>> It was replaced by putting the error code directly into the fence, so
>>> just reading that one after waiting should be ok.
>>>
>>> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>>>
>>> [ML] yeah, that’s too confusing, the name sound really the one I want
>>> to use, we should change it…
>>>
>>> *But look into the implement, I don**’t see why we cannot use it ? it
>>> also finally return the fence->error *
>>>
>>> *From:*Koenig, Christian
>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle,
>>> Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>;
>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>;
>>> Deucher, Alexander <Alexander.Deucher@amd.com
>>> <mailto:Alexander.Deucher@amd.com>>
>>> *Cc:* amd-gfx@lists.freedesktop.org
>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel
>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW)
>>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley
>>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro
>>> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>;
>>> Filipas, Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>>
>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>
>>> See inline:
>>>
>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>
>>>       Hi Christian & Nicolai,
>>>
>>>       We need to achieve some agreements on what should MESA/UMD do and
>>>       what should KMD do, *please give your comments with “okay” or “No”
>>>       and your idea on below items,*
>>>
>>>       ?When a job timed out (set from lockup_timeout kernel parameter),
>>>       What KMD should do in TDR routine :
>>>
>>>       1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>       (*gpu_reset_counter* is used to force vm flush after GPU reset, out
>>>       of this thread’s scope so no more discussion on it)
>>>
>>> Okay.
>>>
>>>       2.Set its fence error status to “*ETIME*”,
>>>
>>> No, as I already explained ETIME is for synchronous operation.
>>>
>>> In other words when we return ETIME from the wait IOCTL it would mean
>>> that the waiting has somehow timed out, but not the job we waited for.
>>>
>>> Please use ECANCELED as well or some other error code when we find
>>> that we need to distinct the timedout job from the canceled ones
>>> (probably a good idea, but I'm not sure).
>>>
>>>       3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>
>>> Not sure. Do we want to set the whole context as guilty or just the entity?
>>>
>>> Setting the whole contexts as guilty sounds racy to me.
>>>
>>> BTW: We should use a different name than "guilty", maybe just "bool
>>> canceled;" ?
>>>
>>>       4.Kick out this job from scheduler’s mirror list, so this job won’t
>>>       get re-scheduled to ring anymore.
>>>
>>> Okay.
>>>
>>>       5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>>>       their fence status to “*ECANCELED*”
>>>
>>> Setting ECANCELED should be ok. But I think we should do this when we
>>> try to run the jobs and not during GPU reset.
>>>
>>>       6.Force signal all fences that get kicked out by above two
>>>       steps,*otherwise UMD will block forever if waiting on those
>>> fences*
>>>
>>> Okay.
>>>
>>>       7.Do gpu reset, which is can be some callbacks to let bare-metal and
>>>       SR-IOV implement with their favor style
>>>
>>> Okay.
>>>
>>>       8.After reset, KMD need to aware if the VRAM lost happens or not,
>>>       bare-metal can implement some function to judge, while for SR-IOV I
>>>       prefer to read it from GIM side (for initial version we consider
>>>       it’s always VRAM lost, till GIM side change aligned)
>>>
>>> Okay.
>>>
>>>       9.If VRAM lost not hit, continue, otherwise:
>>>
>>>       a)Update adev->*vram_lost_counter*,
>>>
>>> Okay.
>>>
>>>       b)Iterate over all living ctx, and set all ctx as “*guilty*” since
>>>       VRAM lost actually ruins all VRAM contents
>>>
>>> No, that shouldn't be done by comparing the counters. Iterating over
>>> all contexts is way to much overhead.
>>>
>>>       c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>       fence status to “*ECANCELDED*”
>>>
>>> Yes and no, that should be done when we try to run the jobs and not
>>> during GPU reset.
>>>
>>>       10.Do GTT recovery and VRAM page tables/entries recovery (optional,
>>>       do we need it ???)
>>>
>>> Yes, that is still needed. As Nicolai explained we can't be sure that
>>> VRAM is still 100% correct even when it isn't cleared.
>>>
>>>       11.Re-schedule all JOBs remains in mirror list to ring again and
>>>       restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>>>
>>> Okay.
>>>
>>>       ?For cs_wait() IOCTL:
>>>
>>>       After it found fence signaled, it should check with
>>>       *“dma_fence_get_status” *to see if there is error there,
>>>
>>>       And return the error status of fence
>>>
>>> Yes and no, dma_fence_get_status() is some specific handling for
>>> sync_file debugging (no idea why that made it into the common fence code).
>>>
>>> It was replaced by putting the error code directly into the fence, so
>>> just reading that one after waiting should be ok.
>>>
>>> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>>>
>>>       ?For cs_wait_fences() IOCTL:
>>>
>>>       Similar with above approach
>>>
>>>       ?For cs_submit() IOCTL:
>>>
>>>       It need to check if current ctx been marked as “*guilty*” and return
>>>       “*ECANCELED*” if so
>>>
>>>       ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>
>>>       This way, UMD can also block app from submitting, like @Nicolai
>>>       mentioned, we can cache one copy of *vram_lost_counter* when
>>>       enumerate physical device, and deny all
>>>
>>>       gl-context from submitting if the counter queried bigger than that
>>>       one cached in physical device. (looks a little overkill to me, but
>>>       easy to implement )
>>>
>>>       UMD can also return error to APP when creating gl-context if found
>>>       current queried*vram_lost_counter *bigger than that one cached in
>>>       physical device.
>>>
>>> Okay. Already have a patch for this, please review that one if you
>>> haven't already done so.
>>>
>>> Regards,
>>> Christian.
>>>
>>>       BTW: I realized that gl-context is a little different with kernel’s
>>>       context. Because for kernel. BO is not related with context but only
>>>       with FD, while in UMD, BO have a backend
>>>
>>>       gl-context, so block submitting in UMD layer is also needed although
>>>       KMD will do its job as bottom line
>>>
>>>       ?Basically “vram_lost_counter” is exposure by kernel to let UMD take
>>>       the control of robust extension feature, it will be UMD’s call to
>>>       move, KMD only deny “guilty” context from submitting
>>>
>>>       Need your feedback, thx
>>>
>>>       We’d better make TDR feature landed ASAP
>>>
>>>       BR Monk
>>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found]                                     ` <8c4e849f-9227-12bc-9d2e-3daf60fcd762-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-11 10:39                                       ` Christian König
       [not found]                                         ` <0c198ba6-b853-c26a-7fb4-bcc0344fdea0-5C7GfCeVMHo@public.gmane.org>
  2017-10-11 13:27                                       ` Liu, Monk
  1 sibling, 1 reply; 23+ messages in thread
From: Christian König @ 2017-10-11 10:39 UTC (permalink / raw)
  To: Chunming Zhou, Liu, Monk, Haehnle, Nicolai, Olsak, Marek,
	Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

I've already posted a patch for this on the mailing list.

Basically we just copy the vram lost counter into the job and when we 
try to run the job we can mark it as canceled.

Regards,
Christian.

Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
> Your summary lacks the below issue:
>
> How about the job already pushed in scheduler queue when vram is lost?
>
>
> Regards,
> David Zhou
> On 2017年10月11日 17:41, Liu, Monk wrote:
>> Okay, let me summary our whole idea together and see if it works:
>>
>> 1, For cs_submit, always check vram-lost_counter first and reject the 
>> submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != 
>> adev->vram_lost_counter. That way the vram lost issue can be handled
>>
>> 2, for cs_submit we still need to check if the incoming context is 
>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found 
>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject 
>> the submit
>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>
>> 3, in gpu_reset() routine, we only mark the hang job's entity as 
>> guilty (so we need to add new member in entity structure), and not 
>> kick it out in gpu_reset() stage, but we need to set the context 
>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>    And if reset introduces VRAM LOST, we just update 
>> adev->vram_lost_counter, but *don't* change all entity to guilty, so 
>> still only the hang job's entity is "guilty"
>>    After some entity marked as "guilty", we find a way to set the 
>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K 
>> interface, we need let UMD can know that this context is wrong.
>>
>> 4, in gpu scheduler's run_job() routine, since it only reads entity, 
>> so we skip job scheduling once found the entity is "guilty"
>>
>>
>> Does above sounds good ?
>>
>>
>>
>> -----Original Message-----
>> From: Haehnle, Nicolai
>> Sent: Wednesday, October 11, 2017 5:26 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>> Deucher, Alexander <Alexander.Deucher@amd.com>
>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
>> Filipas, Mario <Mario.Filipas@amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD:
>>
>> On 11.10.2017 11:18, Liu, Monk wrote:
>>> Let's talk it simple, When vram lost hit,  what's the action for 
>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not 
>>> the one trigger gpu hang) after vram lost ? do you mean we return 
>>> -ENODEV to UMD ?
>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>
>>
>>> In cs_submit, with vram lost hit, if we don't mark all contexts as
>>> "guilty", how we block its from submitting ? can you show some
>>> implement way
>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>       return -ECANCELED;
>>
>> (where ctx->vram_lost_counter is initialized at context creation time 
>> and never changed afterwards)
>>
>>
>>> BTW: the "guilty" here is a new member I want to add to context, it 
>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface,
>>> Looks I need to unify them and only one place to mark guilty or not
>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made
>> consistent with the rest.
>>
>> Cheers,
>> Nicolai
>>
>>
>>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Haehnle, Nicolai
>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
>>> <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>
>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so
>>>> it's reasonable to use it. However, it /does not/ make sense to mark
>>>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a
>>>> perfect example where the driver should report context lost to
>>>> applications with the "innocent" flag for contexts that were idle at
>>>> the time of reset. The only context(s) that should be reported as 
>>>> "guilty"
>>>> (or perhaps "unknown" in some cases) are the ones that were executing
>>>> at the time of reset.
>>>>
>>>> ML: KMD mark all contexts as guilty is because that way we can unify
>>>> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no
>>>> need to worry about vram-lost-counter anymore, that’s a implementation
>>>> style. I don’t think it is related with UMD layer,
>>>>
>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it
>>>> own “guilty” gl-context if you want.
>>> Well, to some extent this is just semantics, but it helps to keep 
>>> the terminology consistent.
>>>
>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>>> mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>> and it must return "innocent" for contexts that are only lost due to 
>>> VRAM lost without being otherwise involved in the timeout that lead 
>>> to the reset.
>>>
>>> The point is that in the places where you used "guilty" it would be 
>>> better to use "context lost", and then further differentiate between 
>>> guilty/innocent context lost based on the details of what happened.
>>>
>>>
>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you
>>>> illustrate what rule KMD should obey to check in KMS IOCTL like
>>>> cs_sumbit ?? let’s see which way better
>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>        return -ECANCELED;
>>>
>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>
>>> Yes, it's one additional check in cs_submit. If you're worried about 
>>> that (and Christian's concerns about possible issues with walking 
>>> over all contexts are addressed), I suppose you could just store a 
>>> per-context
>>>
>>>      unsigned context_reset_status;
>>>
>>> instead of a `bool guilty`. Its value would start out as 0
>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during 
>>> reset.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>> *From:*Haehnle, Nicolai
>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>;
>>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley
>>>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>;
>>>> Filipas, Mario <Mario.Filipas@amd.com>
>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>
>>>>    From a Mesa perspective, this almost all sounds reasonable to me.
>>>>
>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so
>>>> it's reasonable to use it. However, it /does not/ make sense to mark
>>>> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a
>>>> perfect example where the driver should report context lost to
>>>> applications with the "innocent" flag for contexts that were idle at
>>>> the time of reset. The only context(s) that should be reported as 
>>>> "guilty"
>>>> (or perhaps "unknown" in some cases) are the ones that were executing
>>>> at the time of reset.
>>>>
>>>> On whether the whole context is marked as guilty from a user space
>>>> perspective, it would simply be nice for user space to get consistent
>>>> answers. It would be a bit odd if we could e.g. succeed in submitting
>>>> an SDMA job after a GFX job was rejected. This would point in favor of
>>>> marking the entire context as guilty (although that could happen
>>>> lazily instead of at reset time). On the other hand, if that's too big
>>>> a burden for the kernel implementation I'm sure we can live without 
>>>> it.
>>>>
>>>> Cheers,
>>>>
>>>> Nicolai
>>>>
>>>> ----------------------------------------------------------------------
>>>> -- 
>>>>
>>>> *From:*Liu, Monk
>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher,
>>>> Alexander
>>>> *Cc:* amd-gfx@lists.freedesktop.org
>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry
>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>
>>>> 1.Set its fence error status to “*ETIME*”,
>>>>
>>>> No, as I already explained ETIME is for synchronous operation.
>>>>
>>>> In other words when we return ETIME from the wait IOCTL it would mean
>>>> that the waiting has somehow timed out, but not the job we waited for.
>>>>
>>>> Please use ECANCELED as well or some other error code when we find
>>>> that we need to distinct the timedout job from the canceled ones
>>>> (probably a good idea, but I'm not sure).
>>>>
>>>> [ML] I’m okay if you insist not to use ETIME
>>>>
>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>
>>>> Not sure. Do we want to set the whole context as guilty or just the 
>>>> entity?
>>>>
>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>
>>>> BTW: We should use a different name than "guilty", maybe just "bool
>>>> canceled;" ?
>>>>
>>>> [ML] I think context is better than entity, because for example if you
>>>> only block entity_0 of context and allow entity_N run, that means the
>>>> dependency between entities are broken (e.g. page table updates in
>>>>
>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make sense
>>>> to me)
>>>>
>>>> We’d better either block the whole context or let not…
>>>>
>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>>>> their fence status to “*ECANCELED*”
>>>>
>>>> Setting ECANCELED should be ok. But I think we should do this when we
>>>> try to run the jobs and not during GPU reset.
>>>>
>>>> [ML] without deep thought and expritment, I’m not sure the difference
>>>> between them, but kick it out in gpu_reset routine is more efficient,
>>>>
>>>> Otherwise you need to check context/entity guilty flag in run_job
>>>> routine …and you need to it for every context/entity, I don’t see why
>>>>
>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>
>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since
>>>> VRAM lost actually ruins all VRAM contents
>>>>
>>>> No, that shouldn't be done by comparing the counters. Iterating over
>>>> all contexts is way to much overhead.
>>>>
>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t
>>>> need to differentiate VRAM lost or not, they only interested in if the
>>>> context is guilty or not, and block
>>>>
>>>> Submit for guilty ones.
>>>>
>>>> *Can you give more details of your idea? And better the detail
>>>> implement in cs_submit, I want to see how you want to block submit
>>>> without checking context guilty flag*
>>>>
>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence
>>>> status to “*ECANCELDED*”
>>>>
>>>> Yes and no, that should be done when we try to run the jobs and not
>>>> during GPU reset.
>>>>
>>>> [ML] again, kicking out them in gpu reset routine is high efficient,
>>>> otherwise you need check on every job in run_job()
>>>>
>>>> Besides, can you illustrate the detail implementation ?
>>>>
>>>> Yes and no, dma_fence_get_status() is some specific handling for
>>>> sync_file debugging (no idea why that made it into the common fence 
>>>> code).
>>>>
>>>> It was replaced by putting the error code directly into the fence, so
>>>> just reading that one after waiting should be ok.
>>>>
>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>> for this?
>>>>
>>>> [ML] yeah, that’s too confusing, the name sound really the one I want
>>>> to use, we should change it…
>>>>
>>>> *But look into the implement, I don**’t see why we cannot use it ? it
>>>> also finally return the fence->error *
>>>>
>>>> *From:*Koenig, Christian
>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Haehnle,
>>>> Nicolai <Nicolai.Haehnle@amd.com <mailto:Nicolai.Haehnle@amd.com>>;
>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>;
>>>> Deucher, Alexander <Alexander.Deucher@amd.com
>>>> <mailto:Alexander.Deucher@amd.com>>
>>>> *Cc:* amd-gfx@lists.freedesktop.org
>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel
>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW)
>>>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley
>>>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro
>>>> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>;
>>>> Filipas, Mario <Mario.Filipas@amd.com <mailto:Mario.Filipas@amd.com>>
>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> See inline:
>>>>
>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>
>>>>       Hi Christian & Nicolai,
>>>>
>>>>       We need to achieve some agreements on what should MESA/UMD do 
>>>> and
>>>>       what should KMD do, *please give your comments with “okay” or 
>>>> “No”
>>>>       and your idea on below items,*
>>>>
>>>>       ?When a job timed out (set from lockup_timeout kernel 
>>>> parameter),
>>>>       What KMD should do in TDR routine :
>>>>
>>>>       1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>       (*gpu_reset_counter* is used to force vm flush after GPU 
>>>> reset, out
>>>>       of this thread’s scope so no more discussion on it)
>>>>
>>>> Okay.
>>>>
>>>>       2.Set its fence error status to “*ETIME*”,
>>>>
>>>> No, as I already explained ETIME is for synchronous operation.
>>>>
>>>> In other words when we return ETIME from the wait IOCTL it would mean
>>>> that the waiting has somehow timed out, but not the job we waited for.
>>>>
>>>> Please use ECANCELED as well or some other error code when we find
>>>> that we need to distinct the timedout job from the canceled ones
>>>> (probably a good idea, but I'm not sure).
>>>>
>>>>       3.Find the entity/ctx behind this job, and set this ctx as 
>>>> “*guilty*”
>>>>
>>>> Not sure. Do we want to set the whole context as guilty or just the 
>>>> entity?
>>>>
>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>
>>>> BTW: We should use a different name than "guilty", maybe just "bool
>>>> canceled;" ?
>>>>
>>>>       4.Kick out this job from scheduler’s mirror list, so this job 
>>>> won’t
>>>>       get re-scheduled to ring anymore.
>>>>
>>>> Okay.
>>>>
>>>>       5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and 
>>>> set all
>>>>       their fence status to “*ECANCELED*”
>>>>
>>>> Setting ECANCELED should be ok. But I think we should do this when we
>>>> try to run the jobs and not during GPU reset.
>>>>
>>>>       6.Force signal all fences that get kicked out by above two
>>>>       steps,*otherwise UMD will block forever if waiting on those
>>>> fences*
>>>>
>>>> Okay.
>>>>
>>>>       7.Do gpu reset, which is can be some callbacks to let 
>>>> bare-metal and
>>>>       SR-IOV implement with their favor style
>>>>
>>>> Okay.
>>>>
>>>>       8.After reset, KMD need to aware if the VRAM lost happens or 
>>>> not,
>>>>       bare-metal can implement some function to judge, while for 
>>>> SR-IOV I
>>>>       prefer to read it from GIM side (for initial version we consider
>>>>       it’s always VRAM lost, till GIM side change aligned)
>>>>
>>>> Okay.
>>>>
>>>>       9.If VRAM lost not hit, continue, otherwise:
>>>>
>>>>       a)Update adev->*vram_lost_counter*,
>>>>
>>>> Okay.
>>>>
>>>>       b)Iterate over all living ctx, and set all ctx as “*guilty*” 
>>>> since
>>>>       VRAM lost actually ruins all VRAM contents
>>>>
>>>> No, that shouldn't be done by comparing the counters. Iterating over
>>>> all contexts is way to much overhead.
>>>>
>>>>       c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>       fence status to “*ECANCELDED*”
>>>>
>>>> Yes and no, that should be done when we try to run the jobs and not
>>>> during GPU reset.
>>>>
>>>>       10.Do GTT recovery and VRAM page tables/entries recovery 
>>>> (optional,
>>>>       do we need it ???)
>>>>
>>>> Yes, that is still needed. As Nicolai explained we can't be sure that
>>>> VRAM is still 100% correct even when it isn't cleared.
>>>>
>>>>       11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>       restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>>>>
>>>> Okay.
>>>>
>>>>       ?For cs_wait() IOCTL:
>>>>
>>>>       After it found fence signaled, it should check with
>>>>       *“dma_fence_get_status” *to see if there is error there,
>>>>
>>>>       And return the error status of fence
>>>>
>>>> Yes and no, dma_fence_get_status() is some specific handling for
>>>> sync_file debugging (no idea why that made it into the common fence 
>>>> code).
>>>>
>>>> It was replaced by putting the error code directly into the fence, so
>>>> just reading that one after waiting should be ok.
>>>>
>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>> for this?
>>>>
>>>>       ?For cs_wait_fences() IOCTL:
>>>>
>>>>       Similar with above approach
>>>>
>>>>       ?For cs_submit() IOCTL:
>>>>
>>>>       It need to check if current ctx been marked as “*guilty*” and 
>>>> return
>>>>       “*ECANCELED*” if so
>>>>
>>>>       ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>
>>>>       This way, UMD can also block app from submitting, like @Nicolai
>>>>       mentioned, we can cache one copy of *vram_lost_counter* when
>>>>       enumerate physical device, and deny all
>>>>
>>>>       gl-context from submitting if the counter queried bigger than 
>>>> that
>>>>       one cached in physical device. (looks a little overkill to 
>>>> me, but
>>>>       easy to implement )
>>>>
>>>>       UMD can also return error to APP when creating gl-context if 
>>>> found
>>>>       current queried*vram_lost_counter *bigger than that one 
>>>> cached in
>>>>       physical device.
>>>>
>>>> Okay. Already have a patch for this, please review that one if you
>>>> haven't already done so.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>       BTW: I realized that gl-context is a little different with 
>>>> kernel’s
>>>>       context. Because for kernel. BO is not related with context 
>>>> but only
>>>>       with FD, while in UMD, BO have a backend
>>>>
>>>>       gl-context, so block submitting in UMD layer is also needed 
>>>> although
>>>>       KMD will do its job as bottom line
>>>>
>>>>       ?Basically “vram_lost_counter” is exposure by kernel to let 
>>>> UMD take
>>>>       the control of robust extension feature, it will be UMD’s 
>>>> call to
>>>>       move, KMD only deny “guilty” context from submitting
>>>>
>>>>       Need your feedback, thx
>>>>
>>>>       We’d better make TDR feature landed ASAP
>>>>
>>>>       BR Monk
>>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found]                                     ` <8c4e849f-9227-12bc-9d2e-3daf60fcd762-5C7GfCeVMHo@public.gmane.org>
  2017-10-11 10:39                                       ` Christian König
@ 2017-10-11 13:27                                       ` Liu, Monk
  1 sibling, 0 replies; 23+ messages in thread
From: Liu, Monk @ 2017-10-11 13:27 UTC (permalink / raw)
  To: Zhou, David(ChunMing),
	Haehnle, Nicolai, Koenig, Christian, Olsak, Marek, Deucher,
	Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

According to the initial summary, this situation is already considered:
When vram lost hit, all context marked as guilty, and all jobs in guilty context's KFIFO queue will be kicked out 

Now if we move the kick out from gpu_reset to run_job, then I think your question can be answered by:
in run_job(), before each job scheduling, check current vram_lost_counter, compare it with the copy cached during entity init (or context init), skip the job if not match 


BR Monk


-----Original Message-----
From: Zhou, David(ChunMing) 
Sent: Wednesday, October 11, 2017 6:15 PM
To: Liu, Monk <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

Your summary lacks the below issue:

How about the job already pushed in scheduler queue when vram is lost?


Regards,
David Zhou
On 2017年10月11日 17:41, Liu, Monk wrote:
> Okay, let me summary our whole idea together and see if it works:
>
> 1, For cs_submit, always check vram-lost_counter first and reject the 
> submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != 
> adev->vram_lost_counter. That way the vram lost issue can be handled
>
> 2, for cs_submit we still need to check if the incoming context is 
> "AMDGPU_CTX_GUILTY_RESET" or not even if we found ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject the submit If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>
> 3, in gpu_reset() routine, we only mark the hang job's entity as guilty (so we need to add new member in entity structure), and not kick it out in gpu_reset() stage, but we need to set the context behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>    And if reset introduces VRAM LOST, we just update adev->vram_lost_counter, but *don't* change all entity to guilty, so still only the hang job's entity is "guilty"
>    After some entity marked as "guilty", we find a way to set the context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K interface, we need let UMD can know that this context is wrong.
>
> 4, in gpu scheduler's run_job() routine, since it only reads entity, so we skip job scheduling once found the entity is "guilty"
>
>
> Does above sounds good ?
>
>
>
> -----Original Message-----
> From: Haehnle, Nicolai
> Sent: Wednesday, October 11, 2017 5:26 PM
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
> Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
> Filipas, Mario <Mario.Filipas@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> On 11.10.2017 11:18, Liu, Monk wrote:
>> Let's talk it simple, When vram lost hit,  what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ?
> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>
>
>> In cs_submit, with vram lost hit, if we don't mark all contexts as 
>> "guilty", how we block its from submitting ? can you show some 
>> implement way
> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>       return -ECANCELED;
>
> (where ctx->vram_lost_counter is initialized at context creation time 
> and never changed afterwards)
>
>
>> BTW: the "guilty" here is a new member I want to add to context, it 
>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I 
>> need to unify them and only one place to mark guilty or not
> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
> consistent with the rest.
>
> Cheers,
> Nicolai
>
>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Haehnle, Nicolai
>> Sent: Wednesday, October 11, 2017 5:00 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>> Deucher, Alexander <Alexander.Deucher@amd.com>
>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
>> Filipas, Mario <Mario.Filipas@amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD:
>>
>> On 11.10.2017 10:48, Liu, Monk wrote:
>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost 
>>> is a perfect example where the driver should report context lost to 
>>> applications with the "innocent" flag for contexts that were idle at 
>>> the time of reset. The only context(s) that should be reported as "guilty"
>>> (or perhaps "unknown" in some cases) are the ones that were 
>>> executing at the time of reset.
>>>
>>> ML: KMD mark all contexts as guilty is because that way we can unify 
>>> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no 
>>> need to worry about vram-lost-counter anymore, that’s a 
>>> implementation style. I don’t think it is related with UMD layer,
>>>
>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement 
>>> it own “guilty” gl-context if you want.
>> Well, to some extent this is just semantics, but it helps to keep the terminology consistent.
>>
>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>> mind: this returns one of 
>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>> and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset.
>>
>> The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened.
>>
>>
>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
>>> illustrate what rule KMD should obey to check in KMS IOCTL like 
>>> cs_sumbit ?? let’s see which way better
>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>        return -ECANCELED;
>>
>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>
>> Yes, it's one additional check in cs_submit. If you're worried about 
>> that (and Christian's concerns about possible issues with walking 
>> over all contexts are addressed), I suppose you could just store a 
>> per-context
>>
>>      unsigned context_reset_status;
>>
>> instead of a `bool guilty`. Its value would start out as 0
>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset.
>>
>> Cheers,
>> Nicolai
>>
>>
>>> *From:*Haehnle, Nicolai
>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>
>>>    From a Mesa perspective, this almost all sounds reasonable to me.
>>>
>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost 
>>> is a perfect example where the driver should report context lost to 
>>> applications with the "innocent" flag for contexts that were idle at 
>>> the time of reset. The only context(s) that should be reported as "guilty"
>>> (or perhaps "unknown" in some cases) are the ones that were 
>>> executing at the time of reset.
>>>
>>> On whether the whole context is marked as guilty from a user space 
>>> perspective, it would simply be nice for user space to get 
>>> consistent answers. It would be a bit odd if we could e.g. succeed 
>>> in submitting an SDMA job after a GFX job was rejected. This would 
>>> point in favor of marking the entire context as guilty (although 
>>> that could happen lazily instead of at reset time). On the other 
>>> hand, if that's too big a burden for the kernel implementation I'm sure we can live without it.
>>>
>>> Cheers,
>>>
>>> Nicolai
>>>
>>> --------------------------------------------------------------------
>>> --
>>> --
>>>
>>> *From:*Liu, Monk
>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
>>> Alexander
>>> *Cc:* amd-gfx@lists.freedesktop.org
>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>
>>> 1.Set its fence error status to “*ETIME*”,
>>>
>>> No, as I already explained ETIME is for synchronous operation.
>>>
>>> In other words when we return ETIME from the wait IOCTL it would 
>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>
>>> Please use ECANCELED as well or some other error code when we find 
>>> that we need to distinct the timedout job from the canceled ones 
>>> (probably a good idea, but I'm not sure).
>>>
>>> [ML] I’m okay if you insist not to use ETIME
>>>
>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>
>>> Not sure. Do we want to set the whole context as guilty or just the entity?
>>>
>>> Setting the whole contexts as guilty sounds racy to me.
>>>
>>> BTW: We should use a different name than "guilty", maybe just "bool 
>>> canceled;" ?
>>>
>>> [ML] I think context is better than entity, because for example if 
>>> you only block entity_0 of context and allow entity_N run, that 
>>> means the dependency between entities are broken (e.g. page table 
>>> updates in
>>>
>>> Sdma entity pass but gfx submit in GFX entity blocked, not make 
>>> sense to me)
>>>
>>> We’d better either block the whole context or let not…
>>>
>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
>>> their fence status to “*ECANCELED*”
>>>
>>> Setting ECANCELED should be ok. But I think we should do this when 
>>> we try to run the jobs and not during GPU reset.
>>>
>>> [ML] without deep thought and expritment, I’m not sure the 
>>> difference between them, but kick it out in gpu_reset routine is 
>>> more efficient,
>>>
>>> Otherwise you need to check context/entity guilty flag in run_job 
>>> routine …and you need to it for every context/entity, I don’t see 
>>> why
>>>
>>> We don’t just kickout all of them in gpu_reset stage ….
>>>
>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since 
>>> VRAM lost actually ruins all VRAM contents
>>>
>>> No, that shouldn't be done by comparing the counters. Iterating over 
>>> all contexts is way to much overhead.
>>>
>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t 
>>> need to differentiate VRAM lost or not, they only interested in if 
>>> the context is guilty or not, and block
>>>
>>> Submit for guilty ones.
>>>
>>> *Can you give more details of your idea? And better the detail 
>>> implement in cs_submit, I want to see how you want to block submit 
>>> without checking context guilty flag*
>>>
>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their 
>>> fence status to “*ECANCELDED*”
>>>
>>> Yes and no, that should be done when we try to run the jobs and not 
>>> during GPU reset.
>>>
>>> [ML] again, kicking out them in gpu reset routine is high efficient, 
>>> otherwise you need check on every job in run_job()
>>>
>>> Besides, can you illustrate the detail implementation ?
>>>
>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>> sync_file debugging (no idea why that made it into the common fence code).
>>>
>>> It was replaced by putting the error code directly into the fence, 
>>> so just reading that one after waiting should be ok.
>>>
>>> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>>>
>>> [ML] yeah, that’s too confusing, the name sound really the one I 
>>> want to use, we should change it…
>>>
>>> *But look into the implement, I don**’t see why we cannot use it ? 
>>> it also finally return the fence->error *
>>>
>>> *From:*Koenig, Christian
>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; 
>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com 
>>> <mailto:Nicolai.Haehnle@amd.com>>;
>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; 
>>> Deucher, Alexander <Alexander.Deucher@amd.com 
>>> <mailto:Alexander.Deucher@amd.com>>
>>> *Cc:* amd-gfx@lists.freedesktop.org
>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel 
>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) 
>>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley 
>>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, Alejandro 
>>> <Alejandro.Ramirez@amd.com <mailto:Alejandro.Ramirez@amd.com>>;
>>> Filipas, Mario <Mario.Filipas@amd.com 
>>> <mailto:Mario.Filipas@amd.com>>
>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>
>>> See inline:
>>>
>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>
>>>       Hi Christian & Nicolai,
>>>
>>>       We need to achieve some agreements on what should MESA/UMD do and
>>>       what should KMD do, *please give your comments with “okay” or “No”
>>>       and your idea on below items,*
>>>
>>>       ?When a job timed out (set from lockup_timeout kernel parameter),
>>>       What KMD should do in TDR routine :
>>>
>>>       1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>       (*gpu_reset_counter* is used to force vm flush after GPU reset, out
>>>       of this thread’s scope so no more discussion on it)
>>>
>>> Okay.
>>>
>>>       2.Set its fence error status to “*ETIME*”,
>>>
>>> No, as I already explained ETIME is for synchronous operation.
>>>
>>> In other words when we return ETIME from the wait IOCTL it would 
>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>
>>> Please use ECANCELED as well or some other error code when we find 
>>> that we need to distinct the timedout job from the canceled ones 
>>> (probably a good idea, but I'm not sure).
>>>
>>>       3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>
>>> Not sure. Do we want to set the whole context as guilty or just the entity?
>>>
>>> Setting the whole contexts as guilty sounds racy to me.
>>>
>>> BTW: We should use a different name than "guilty", maybe just "bool 
>>> canceled;" ?
>>>
>>>       4.Kick out this job from scheduler’s mirror list, so this job won’t
>>>       get re-scheduled to ring anymore.
>>>
>>> Okay.
>>>
>>>       5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>>>       their fence status to “*ECANCELED*”
>>>
>>> Setting ECANCELED should be ok. But I think we should do this when 
>>> we try to run the jobs and not during GPU reset.
>>>
>>>       6.Force signal all fences that get kicked out by above two
>>>       steps,*otherwise UMD will block forever if waiting on those
>>> fences*
>>>
>>> Okay.
>>>
>>>       7.Do gpu reset, which is can be some callbacks to let bare-metal and
>>>       SR-IOV implement with their favor style
>>>
>>> Okay.
>>>
>>>       8.After reset, KMD need to aware if the VRAM lost happens or not,
>>>       bare-metal can implement some function to judge, while for SR-IOV I
>>>       prefer to read it from GIM side (for initial version we consider
>>>       it’s always VRAM lost, till GIM side change aligned)
>>>
>>> Okay.
>>>
>>>       9.If VRAM lost not hit, continue, otherwise:
>>>
>>>       a)Update adev->*vram_lost_counter*,
>>>
>>> Okay.
>>>
>>>       b)Iterate over all living ctx, and set all ctx as “*guilty*” since
>>>       VRAM lost actually ruins all VRAM contents
>>>
>>> No, that shouldn't be done by comparing the counters. Iterating over 
>>> all contexts is way to much overhead.
>>>
>>>       c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>       fence status to “*ECANCELDED*”
>>>
>>> Yes and no, that should be done when we try to run the jobs and not 
>>> during GPU reset.
>>>
>>>       10.Do GTT recovery and VRAM page tables/entries recovery (optional,
>>>       do we need it ???)
>>>
>>> Yes, that is still needed. As Nicolai explained we can't be sure 
>>> that VRAM is still 100% correct even when it isn't cleared.
>>>
>>>       11.Re-schedule all JOBs remains in mirror list to ring again and
>>>       restart scheduler (for VRAM lost case, no JOB will 
>>> re-scheduled)
>>>
>>> Okay.
>>>
>>>       ?For cs_wait() IOCTL:
>>>
>>>       After it found fence signaled, it should check with
>>>       *“dma_fence_get_status” *to see if there is error there,
>>>
>>>       And return the error status of fence
>>>
>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>> sync_file debugging (no idea why that made it into the common fence code).
>>>
>>> It was replaced by putting the error code directly into the fence, 
>>> so just reading that one after waiting should be ok.
>>>
>>> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>>>
>>>       ?For cs_wait_fences() IOCTL:
>>>
>>>       Similar with above approach
>>>
>>>       ?For cs_submit() IOCTL:
>>>
>>>       It need to check if current ctx been marked as “*guilty*” and return
>>>       “*ECANCELED*” if so
>>>
>>>       ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>
>>>       This way, UMD can also block app from submitting, like @Nicolai
>>>       mentioned, we can cache one copy of *vram_lost_counter* when
>>>       enumerate physical device, and deny all
>>>
>>>       gl-context from submitting if the counter queried bigger than that
>>>       one cached in physical device. (looks a little overkill to me, but
>>>       easy to implement )
>>>
>>>       UMD can also return error to APP when creating gl-context if found
>>>       current queried*vram_lost_counter *bigger than that one cached in
>>>       physical device.
>>>
>>> Okay. Already have a patch for this, please review that one if you 
>>> haven't already done so.
>>>
>>> Regards,
>>> Christian.
>>>
>>>       BTW: I realized that gl-context is a little different with kernel’s
>>>       context. Because for kernel. BO is not related with context but only
>>>       with FD, while in UMD, BO have a backend
>>>
>>>       gl-context, so block submitting in UMD layer is also needed although
>>>       KMD will do its job as bottom line
>>>
>>>       ?Basically “vram_lost_counter” is exposure by kernel to let UMD take
>>>       the control of robust extension feature, it will be UMD’s call to
>>>       move, KMD only deny “guilty” context from submitting
>>>
>>>       Need your feedback, thx
>>>
>>>       We’d better make TDR feature landed ASAP
>>>
>>>       BR Monk
>>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found]                                         ` <0c198ba6-b853-c26a-7fb4-bcc0344fdea0-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-11 13:35                                           ` Liu, Monk
       [not found]                                             ` <BLUPR12MB04490BE33EC2E851228E25F4844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-10-11 13:35 UTC (permalink / raw)
  To: Koenig, Christian, Zhou, David(ChunMing),
	Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

I think just compare the copy from context/entity with current counter is enough, don't see how it's better to keep another copy in JOB


-----Original Message-----
From: Koenig, Christian 
Sent: Wednesday, October 11, 2017 6:40 PM
To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

I've already posted a patch for this on the mailing list.

Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled.

Regards,
Christian.

Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
> Your summary lacks the below issue:
>
> How about the job already pushed in scheduler queue when vram is lost?
>
>
> Regards,
> David Zhou
> On 2017年10月11日 17:41, Liu, Monk wrote:
>> Okay, let me summary our whole idea together and see if it works:
>>
>> 1, For cs_submit, always check vram-lost_counter first and reject the 
>> submit (return -ECANCLED to UMD) if ctx->vram_lost_counter !=
>> adev->vram_lost_counter. That way the vram lost issue can be handled
>>
>> 2, for cs_submit we still need to check if the incoming context is 
>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found
>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject
>> the submit
>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>
>> 3, in gpu_reset() routine, we only mark the hang job's entity as 
>> guilty (so we need to add new member in entity structure), and not 
>> kick it out in gpu_reset() stage, but we need to set the context 
>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>    And if reset introduces VRAM LOST, we just update
>> adev->vram_lost_counter, but *don't* change all entity to guilty, so
>> still only the hang job's entity is "guilty"
>>    After some entity marked as "guilty", we find a way to set the 
>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K 
>> interface, we need let UMD can know that this context is wrong.
>>
>> 4, in gpu scheduler's run_job() routine, since it only reads entity, 
>> so we skip job scheduling once found the entity is "guilty"
>>
>>
>> Does above sounds good ?
>>
>>
>>
>> -----Original Message-----
>> From: Haehnle, Nicolai
>> Sent: Wednesday, October 11, 2017 5:26 PM
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>> Deucher, Alexander <Alexander.Deucher@amd.com>
>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
>> Filipas, Mario <Mario.Filipas@amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD:
>>
>> On 11.10.2017 11:18, Liu, Monk wrote:
>>> Let's talk it simple, When vram lost hit,  what's the action for 
>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not 
>>> the one trigger gpu hang) after vram lost ? do you mean we return 
>>> -ENODEV to UMD ?
>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>
>>
>>> In cs_submit, with vram lost hit, if we don't mark all contexts as 
>>> "guilty", how we block its from submitting ? can you show some 
>>> implement way
>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>       return -ECANCELED;
>>
>> (where ctx->vram_lost_counter is initialized at context creation time 
>> and never changed afterwards)
>>
>>
>>> BTW: the "guilty" here is a new member I want to add to context, it 
>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I 
>>> need to unify them and only one place to mark guilty or not
>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
>> consistent with the rest.
>>
>> Cheers,
>> Nicolai
>>
>>
>>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Haehnle, Nicolai
>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
>>> <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>
>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost 
>>>> is a perfect example where the driver should report context lost to 
>>>> applications with the "innocent" flag for contexts that were idle 
>>>> at the time of reset. The only context(s) that should be reported 
>>>> as "guilty"
>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>> executing at the time of reset.
>>>>
>>>> ML: KMD mark all contexts as guilty is because that way we can 
>>>> unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context 
>>>> , no need to worry about vram-lost-counter anymore, that’s a 
>>>> implementation style. I don’t think it is related with UMD layer,
>>>>
>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement 
>>>> it own “guilty” gl-context if you want.
>>> Well, to some extent this is just semantics, but it helps to keep 
>>> the terminology consistent.
>>>
>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>>> mind: this returns one of 
>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>> and it must return "innocent" for contexts that are only lost due to 
>>> VRAM lost without being otherwise involved in the timeout that lead 
>>> to the reset.
>>>
>>> The point is that in the places where you used "guilty" it would be 
>>> better to use "context lost", and then further differentiate between 
>>> guilty/innocent context lost based on the details of what happened.
>>>
>>>
>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
>>>> illustrate what rule KMD should obey to check in KMS IOCTL like 
>>>> cs_sumbit ?? let’s see which way better
>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>        return -ECANCELED;
>>>
>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>
>>> Yes, it's one additional check in cs_submit. If you're worried about 
>>> that (and Christian's concerns about possible issues with walking 
>>> over all contexts are addressed), I suppose you could just store a 
>>> per-context
>>>
>>>      unsigned context_reset_status;
>>>
>>> instead of a `bool guilty`. Its value would start out as 0
>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during 
>>> reset.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>> *From:*Haehnle, Nicolai
>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>
>>>>    From a Mesa perspective, this almost all sounds reasonable to me.
>>>>
>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost 
>>>> is a perfect example where the driver should report context lost to 
>>>> applications with the "innocent" flag for contexts that were idle 
>>>> at the time of reset. The only context(s) that should be reported 
>>>> as "guilty"
>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>> executing at the time of reset.
>>>>
>>>> On whether the whole context is marked as guilty from a user space 
>>>> perspective, it would simply be nice for user space to get 
>>>> consistent answers. It would be a bit odd if we could e.g. succeed 
>>>> in submitting an SDMA job after a GFX job was rejected. This would 
>>>> point in favor of marking the entire context as guilty (although 
>>>> that could happen lazily instead of at reset time). On the other 
>>>> hand, if that's too big a burden for the kernel implementation I'm 
>>>> sure we can live without it.
>>>>
>>>> Cheers,
>>>>
>>>> Nicolai
>>>>
>>>> -------------------------------------------------------------------
>>>> ---
>>>> --
>>>>
>>>> *From:*Liu, Monk
>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
>>>> Alexander
>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>
>>>> 1.Set its fence error status to “*ETIME*”,
>>>>
>>>> No, as I already explained ETIME is for synchronous operation.
>>>>
>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>
>>>> Please use ECANCELED as well or some other error code when we find 
>>>> that we need to distinct the timedout job from the canceled ones 
>>>> (probably a good idea, but I'm not sure).
>>>>
>>>> [ML] I’m okay if you insist not to use ETIME
>>>>
>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>
>>>> Not sure. Do we want to set the whole context as guilty or just the 
>>>> entity?
>>>>
>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>
>>>> BTW: We should use a different name than "guilty", maybe just "bool 
>>>> canceled;" ?
>>>>
>>>> [ML] I think context is better than entity, because for example if 
>>>> you only block entity_0 of context and allow entity_N run, that 
>>>> means the dependency between entities are broken (e.g. page table 
>>>> updates in
>>>>
>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make 
>>>> sense to me)
>>>>
>>>> We’d better either block the whole context or let not…
>>>>
>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
>>>> their fence status to “*ECANCELED*”
>>>>
>>>> Setting ECANCELED should be ok. But I think we should do this when 
>>>> we try to run the jobs and not during GPU reset.
>>>>
>>>> [ML] without deep thought and expritment, I’m not sure the 
>>>> difference between them, but kick it out in gpu_reset routine is 
>>>> more efficient,
>>>>
>>>> Otherwise you need to check context/entity guilty flag in run_job 
>>>> routine …and you need to it for every context/entity, I don’t see 
>>>> why
>>>>
>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>
>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since 
>>>> VRAM lost actually ruins all VRAM contents
>>>>
>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>> over all contexts is way to much overhead.
>>>>
>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t 
>>>> need to differentiate VRAM lost or not, they only interested in if 
>>>> the context is guilty or not, and block
>>>>
>>>> Submit for guilty ones.
>>>>
>>>> *Can you give more details of your idea? And better the detail 
>>>> implement in cs_submit, I want to see how you want to block submit 
>>>> without checking context guilty flag*
>>>>
>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their 
>>>> fence status to “*ECANCELDED*”
>>>>
>>>> Yes and no, that should be done when we try to run the jobs and not 
>>>> during GPU reset.
>>>>
>>>> [ML] again, kicking out them in gpu reset routine is high 
>>>> efficient, otherwise you need check on every job in run_job()
>>>>
>>>> Besides, can you illustrate the detail implementation ?
>>>>
>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>> sync_file debugging (no idea why that made it into the common fence 
>>>> code).
>>>>
>>>> It was replaced by putting the error code directly into the fence, 
>>>> so just reading that one after waiting should be ok.
>>>>
>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>> for this?
>>>>
>>>> [ML] yeah, that’s too confusing, the name sound really the one I 
>>>> want to use, we should change it…
>>>>
>>>> *But look into the implement, I don**’t see why we cannot use it ? 
>>>> it also finally return the fence->error *
>>>>
>>>> *From:*Koenig, Christian
>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; 
>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com 
>>>> <mailto:Nicolai.Haehnle@amd.com>>;
>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; 
>>>> Deucher, Alexander <Alexander.Deucher@amd.com 
>>>> <mailto:Alexander.Deucher@amd.com>>
>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel 
>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW) 
>>>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley 
>>>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, 
>>>> Alejandro <Alejandro.Ramirez@amd.com 
>>>> <mailto:Alejandro.Ramirez@amd.com>>;
>>>> Filipas, Mario <Mario.Filipas@amd.com 
>>>> <mailto:Mario.Filipas@amd.com>>
>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> See inline:
>>>>
>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>
>>>>       Hi Christian & Nicolai,
>>>>
>>>>       We need to achieve some agreements on what should MESA/UMD do 
>>>> and
>>>>       what should KMD do, *please give your comments with “okay” or 
>>>> “No”
>>>>       and your idea on below items,*
>>>>
>>>>       ?When a job timed out (set from lockup_timeout kernel 
>>>> parameter),
>>>>       What KMD should do in TDR routine :
>>>>
>>>>       1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>       (*gpu_reset_counter* is used to force vm flush after GPU 
>>>> reset, out
>>>>       of this thread’s scope so no more discussion on it)
>>>>
>>>> Okay.
>>>>
>>>>       2.Set its fence error status to “*ETIME*”,
>>>>
>>>> No, as I already explained ETIME is for synchronous operation.
>>>>
>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>
>>>> Please use ECANCELED as well or some other error code when we find 
>>>> that we need to distinct the timedout job from the canceled ones 
>>>> (probably a good idea, but I'm not sure).
>>>>
>>>>       3.Find the entity/ctx behind this job, and set this ctx as 
>>>> “*guilty*”
>>>>
>>>> Not sure. Do we want to set the whole context as guilty or just the 
>>>> entity?
>>>>
>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>
>>>> BTW: We should use a different name than "guilty", maybe just "bool 
>>>> canceled;" ?
>>>>
>>>>       4.Kick out this job from scheduler’s mirror list, so this job 
>>>> won’t
>>>>       get re-scheduled to ring anymore.
>>>>
>>>> Okay.
>>>>
>>>>       5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and 
>>>> set all
>>>>       their fence status to “*ECANCELED*”
>>>>
>>>> Setting ECANCELED should be ok. But I think we should do this when 
>>>> we try to run the jobs and not during GPU reset.
>>>>
>>>>       6.Force signal all fences that get kicked out by above two
>>>>       steps,*otherwise UMD will block forever if waiting on those
>>>> fences*
>>>>
>>>> Okay.
>>>>
>>>>       7.Do gpu reset, which is can be some callbacks to let 
>>>> bare-metal and
>>>>       SR-IOV implement with their favor style
>>>>
>>>> Okay.
>>>>
>>>>       8.After reset, KMD need to aware if the VRAM lost happens or 
>>>> not,
>>>>       bare-metal can implement some function to judge, while for 
>>>> SR-IOV I
>>>>       prefer to read it from GIM side (for initial version we consider
>>>>       it’s always VRAM lost, till GIM side change aligned)
>>>>
>>>> Okay.
>>>>
>>>>       9.If VRAM lost not hit, continue, otherwise:
>>>>
>>>>       a)Update adev->*vram_lost_counter*,
>>>>
>>>> Okay.
>>>>
>>>>       b)Iterate over all living ctx, and set all ctx as “*guilty*” 
>>>> since
>>>>       VRAM lost actually ruins all VRAM contents
>>>>
>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>> over all contexts is way to much overhead.
>>>>
>>>>       c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>       fence status to “*ECANCELDED*”
>>>>
>>>> Yes and no, that should be done when we try to run the jobs and not 
>>>> during GPU reset.
>>>>
>>>>       10.Do GTT recovery and VRAM page tables/entries recovery 
>>>> (optional,
>>>>       do we need it ???)
>>>>
>>>> Yes, that is still needed. As Nicolai explained we can't be sure 
>>>> that VRAM is still 100% correct even when it isn't cleared.
>>>>
>>>>       11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>       restart scheduler (for VRAM lost case, no JOB will 
>>>> re-scheduled)
>>>>
>>>> Okay.
>>>>
>>>>       ?For cs_wait() IOCTL:
>>>>
>>>>       After it found fence signaled, it should check with
>>>>       *“dma_fence_get_status” *to see if there is error there,
>>>>
>>>>       And return the error status of fence
>>>>
>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>> sync_file debugging (no idea why that made it into the common fence 
>>>> code).
>>>>
>>>> It was replaced by putting the error code directly into the fence, 
>>>> so just reading that one after waiting should be ok.
>>>>
>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>> for this?
>>>>
>>>>       ?For cs_wait_fences() IOCTL:
>>>>
>>>>       Similar with above approach
>>>>
>>>>       ?For cs_submit() IOCTL:
>>>>
>>>>       It need to check if current ctx been marked as “*guilty*” and 
>>>> return
>>>>       “*ECANCELED*” if so
>>>>
>>>>       ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>
>>>>       This way, UMD can also block app from submitting, like @Nicolai
>>>>       mentioned, we can cache one copy of *vram_lost_counter* when
>>>>       enumerate physical device, and deny all
>>>>
>>>>       gl-context from submitting if the counter queried bigger than 
>>>> that
>>>>       one cached in physical device. (looks a little overkill to 
>>>> me, but
>>>>       easy to implement )
>>>>
>>>>       UMD can also return error to APP when creating gl-context if 
>>>> found
>>>>       current queried*vram_lost_counter *bigger than that one 
>>>> cached in
>>>>       physical device.
>>>>
>>>> Okay. Already have a patch for this, please review that one if you 
>>>> haven't already done so.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>       BTW: I realized that gl-context is a little different with 
>>>> kernel’s
>>>>       context. Because for kernel. BO is not related with context 
>>>> but only
>>>>       with FD, while in UMD, BO have a backend
>>>>
>>>>       gl-context, so block submitting in UMD layer is also needed 
>>>> although
>>>>       KMD will do its job as bottom line
>>>>
>>>>       ?Basically “vram_lost_counter” is exposure by kernel to let 
>>>> UMD take
>>>>       the control of robust extension feature, it will be UMD’s 
>>>> call to
>>>>       move, KMD only deny “guilty” context from submitting
>>>>
>>>>       Need your feedback, thx
>>>>
>>>>       We’d better make TDR feature landed ASAP
>>>>
>>>>       BR Monk
>>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found]                                             ` <BLUPR12MB04490BE33EC2E851228E25F4844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-11 13:39                                               ` Christian König
       [not found]                                                 ` <d9274bc6-27e8-f6c4-0851-4240bde72452-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Christian König @ 2017-10-11 13:39 UTC (permalink / raw)
  To: Liu, Monk, Zhou, David(ChunMing),
	Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

Some jobs don't have a context (VM updates, clears, buffer moves).

I would still like to abort those when they where issued before a losing 
VRAM content, but keep the entity usable.

So I think we should just keep a copy of the VRAM lost counter in the 
job. That also removes us from the burden of figuring out the context 
during job run.

Regards,
Christian.

Am 11.10.2017 um 15:35 schrieb Liu, Monk:
> I think just compare the copy from context/entity with current counter is enough, don't see how it's better to keep another copy in JOB
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Wednesday, October 11, 2017 6:40 PM
> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> I've already posted a patch for this on the mailing list.
>
> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled.
>
> Regards,
> Christian.
>
> Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
>> Your summary lacks the below issue:
>>
>> How about the job already pushed in scheduler queue when vram is lost?
>>
>>
>> Regards,
>> David Zhou
>> On 2017年10月11日 17:41, Liu, Monk wrote:
>>> Okay, let me summary our whole idea together and see if it works:
>>>
>>> 1, For cs_submit, always check vram-lost_counter first and reject the
>>> submit (return -ECANCLED to UMD) if ctx->vram_lost_counter !=
>>> adev->vram_lost_counter. That way the vram lost issue can be handled
>>>
>>> 2, for cs_submit we still need to check if the incoming context is
>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found
>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject
>>> the submit
>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>>
>>> 3, in gpu_reset() routine, we only mark the hang job's entity as
>>> guilty (so we need to add new member in entity structure), and not
>>> kick it out in gpu_reset() stage, but we need to set the context
>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>>     And if reset introduces VRAM LOST, we just update
>>> adev->vram_lost_counter, but *don't* change all entity to guilty, so
>>> still only the hang job's entity is "guilty"
>>>     After some entity marked as "guilty", we find a way to set the
>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K
>>> interface, we need let UMD can know that this context is wrong.
>>>
>>> 4, in gpu scheduler's run_job() routine, since it only reads entity,
>>> so we skip job scheduling once found the entity is "guilty"
>>>
>>>
>>> Does above sounds good ?
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Haehnle, Nicolai
>>> Sent: Wednesday, October 11, 2017 5:26 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>;
>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley
>>> <Bingley.Li@amd.com>; Ramirez, Alejandro <Alejandro.Ramirez@amd.com>;
>>> Filipas, Mario <Mario.Filipas@amd.com>
>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>
>>> On 11.10.2017 11:18, Liu, Monk wrote:
>>>> Let's talk it simple, When vram lost hit,  what's the action for
>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not
>>>> the one trigger gpu hang) after vram lost ? do you mean we return
>>>> -ENODEV to UMD ?
>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>>
>>>
>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as
>>>> "guilty", how we block its from submitting ? can you show some
>>>> implement way
>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>        return -ECANCELED;
>>>
>>> (where ctx->vram_lost_counter is initialized at context creation time
>>> and never changed afterwards)
>>>
>>>
>>>> BTW: the "guilty" here is a new member I want to add to context, it
>>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I
>>>> need to unify them and only one place to mark guilty or not
>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made
>>> consistent with the rest.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>> BR Monk
>>>>
>>>> -----Original Message-----
>>>> From: Haehnle, Nicolai
>>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>;
>>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley
>>>> <Bingley.Li@amd.com>; Ramirez, Alejandro
>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL),
>>>>> so it's reasonable to use it. However, it /does not/ make sense to
>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost
>>>>> is a perfect example where the driver should report context lost to
>>>>> applications with the "innocent" flag for contexts that were idle
>>>>> at the time of reset. The only context(s) that should be reported
>>>>> as "guilty"
>>>>> (or perhaps "unknown" in some cases) are the ones that were
>>>>> executing at the time of reset.
>>>>>
>>>>> ML: KMD mark all contexts as guilty is because that way we can
>>>>> unify our IOCTL behavior: e.g. for IOCTL only block “guilty”context
>>>>> , no need to worry about vram-lost-counter anymore, that’s a
>>>>> implementation style. I don’t think it is related with UMD layer,
>>>>>
>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement
>>>>> it own “guilty” gl-context if you want.
>>>> Well, to some extent this is just semantics, but it helps to keep
>>>> the terminology consistent.
>>>>
>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>>>> mind: this returns one of
>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>>> and it must return "innocent" for contexts that are only lost due to
>>>> VRAM lost without being otherwise involved in the timeout that lead
>>>> to the reset.
>>>>
>>>> The point is that in the places where you used "guilty" it would be
>>>> better to use "context lost", and then further differentiate between
>>>> guilty/innocent context lost based on the details of what happened.
>>>>
>>>>
>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you
>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like
>>>>> cs_sumbit ?? let’s see which way better
>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>         return -ECANCELED;
>>>>
>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>>
>>>> Yes, it's one additional check in cs_submit. If you're worried about
>>>> that (and Christian's concerns about possible issues with walking
>>>> over all contexts are addressed), I suppose you could just store a
>>>> per-context
>>>>
>>>>       unsigned context_reset_status;
>>>>
>>>> instead of a `bool guilty`. Its value would start out as 0
>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during
>>>> reset.
>>>>
>>>> Cheers,
>>>> Nicolai
>>>>
>>>>
>>>>> *From:*Haehnle, Nicolai
>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel
>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li,
>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro
>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>>     From a Mesa perspective, this almost all sounds reasonable to me.
>>>>>
>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL),
>>>>> so it's reasonable to use it. However, it /does not/ make sense to
>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost
>>>>> is a perfect example where the driver should report context lost to
>>>>> applications with the "innocent" flag for contexts that were idle
>>>>> at the time of reset. The only context(s) that should be reported
>>>>> as "guilty"
>>>>> (or perhaps "unknown" in some cases) are the ones that were
>>>>> executing at the time of reset.
>>>>>
>>>>> On whether the whole context is marked as guilty from a user space
>>>>> perspective, it would simply be nice for user space to get
>>>>> consistent answers. It would be a bit odd if we could e.g. succeed
>>>>> in submitting an SDMA job after a GFX job was rejected. This would
>>>>> point in favor of marking the entire context as guilty (although
>>>>> that could happen lazily instead of at reset time). On the other
>>>>> hand, if that's too big a burden for the kernel implementation I'm
>>>>> sure we can live without it.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Nicolai
>>>>>
>>>>> -------------------------------------------------------------------
>>>>> ---
>>>>> --
>>>>>
>>>>> *From:*Liu, Monk
>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher,
>>>>> Alexander
>>>>> *Cc:* amd-gfx@lists.freedesktop.org
>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry
>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> 1.Set its fence error status to “*ETIME*”,
>>>>>
>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>
>>>>> In other words when we return ETIME from the wait IOCTL it would
>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>
>>>>> Please use ECANCELED as well or some other error code when we find
>>>>> that we need to distinct the timedout job from the canceled ones
>>>>> (probably a good idea, but I'm not sure).
>>>>>
>>>>> [ML] I’m okay if you insist not to use ETIME
>>>>>
>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>>
>>>>> Not sure. Do we want to set the whole context as guilty or just the
>>>>> entity?
>>>>>
>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>
>>>>> BTW: We should use a different name than "guilty", maybe just "bool
>>>>> canceled;" ?
>>>>>
>>>>> [ML] I think context is better than entity, because for example if
>>>>> you only block entity_0 of context and allow entity_N run, that
>>>>> means the dependency between entities are broken (e.g. page table
>>>>> updates in
>>>>>
>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make
>>>>> sense to me)
>>>>>
>>>>> We’d better either block the whole context or let not…
>>>>>
>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>>>>> their fence status to “*ECANCELED*”
>>>>>
>>>>> Setting ECANCELED should be ok. But I think we should do this when
>>>>> we try to run the jobs and not during GPU reset.
>>>>>
>>>>> [ML] without deep thought and expritment, I’m not sure the
>>>>> difference between them, but kick it out in gpu_reset routine is
>>>>> more efficient,
>>>>>
>>>>> Otherwise you need to check context/entity guilty flag in run_job
>>>>> routine …and you need to it for every context/entity, I don’t see
>>>>> why
>>>>>
>>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>>
>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since
>>>>> VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> No, that shouldn't be done by comparing the counters. Iterating
>>>>> over all contexts is way to much overhead.
>>>>>
>>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t
>>>>> need to differentiate VRAM lost or not, they only interested in if
>>>>> the context is guilty or not, and block
>>>>>
>>>>> Submit for guilty ones.
>>>>>
>>>>> *Can you give more details of your idea? And better the detail
>>>>> implement in cs_submit, I want to see how you want to block submit
>>>>> without checking context guilty flag*
>>>>>
>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>> fence status to “*ECANCELDED*”
>>>>>
>>>>> Yes and no, that should be done when we try to run the jobs and not
>>>>> during GPU reset.
>>>>>
>>>>> [ML] again, kicking out them in gpu reset routine is high
>>>>> efficient, otherwise you need check on every job in run_job()
>>>>>
>>>>> Besides, can you illustrate the detail implementation ?
>>>>>
>>>>> Yes and no, dma_fence_get_status() is some specific handling for
>>>>> sync_file debugging (no idea why that made it into the common fence
>>>>> code).
>>>>>
>>>>> It was replaced by putting the error code directly into the fence,
>>>>> so just reading that one after waiting should be ok.
>>>>>
>>>>> Maybe we should fix dma_fence_get_status() to do the right thing
>>>>> for this?
>>>>>
>>>>> [ML] yeah, that’s too confusing, the name sound really the one I
>>>>> want to use, we should change it…
>>>>>
>>>>> *But look into the implement, I don**’t see why we cannot use it ?
>>>>> it also finally return the fence->error *
>>>>>
>>>>> *From:*Koenig, Christian
>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>;
>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com
>>>>> <mailto:Nicolai.Haehnle@amd.com>>;
>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>;
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com
>>>>> <mailto:Alexander.Deucher@amd.com>>
>>>>> *Cc:* amd-gfx@lists.freedesktop.org
>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel
>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry (SW)
>>>>> <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, Bingley
>>>>> <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez,
>>>>> Alejandro <Alejandro.Ramirez@amd.com
>>>>> <mailto:Alejandro.Ramirez@amd.com>>;
>>>>> Filipas, Mario <Mario.Filipas@amd.com
>>>>> <mailto:Mario.Filipas@amd.com>>
>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> See inline:
>>>>>
>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>>
>>>>>        Hi Christian & Nicolai,
>>>>>
>>>>>        We need to achieve some agreements on what should MESA/UMD do
>>>>> and
>>>>>        what should KMD do, *please give your comments with “okay” or
>>>>> “No”
>>>>>        and your idea on below items,*
>>>>>
>>>>>        ?When a job timed out (set from lockup_timeout kernel
>>>>> parameter),
>>>>>        What KMD should do in TDR routine :
>>>>>
>>>>>        1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>>        (*gpu_reset_counter* is used to force vm flush after GPU
>>>>> reset, out
>>>>>        of this thread’s scope so no more discussion on it)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        2.Set its fence error status to “*ETIME*”,
>>>>>
>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>
>>>>> In other words when we return ETIME from the wait IOCTL it would
>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>
>>>>> Please use ECANCELED as well or some other error code when we find
>>>>> that we need to distinct the timedout job from the canceled ones
>>>>> (probably a good idea, but I'm not sure).
>>>>>
>>>>>        3.Find the entity/ctx behind this job, and set this ctx as
>>>>> “*guilty*”
>>>>>
>>>>> Not sure. Do we want to set the whole context as guilty or just the
>>>>> entity?
>>>>>
>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>
>>>>> BTW: We should use a different name than "guilty", maybe just "bool
>>>>> canceled;" ?
>>>>>
>>>>>        4.Kick out this job from scheduler’s mirror list, so this job
>>>>> won’t
>>>>>        get re-scheduled to ring anymore.
>>>>>
>>>>> Okay.
>>>>>
>>>>>        5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and
>>>>> set all
>>>>>        their fence status to “*ECANCELED*”
>>>>>
>>>>> Setting ECANCELED should be ok. But I think we should do this when
>>>>> we try to run the jobs and not during GPU reset.
>>>>>
>>>>>        6.Force signal all fences that get kicked out by above two
>>>>>        steps,*otherwise UMD will block forever if waiting on those
>>>>> fences*
>>>>>
>>>>> Okay.
>>>>>
>>>>>        7.Do gpu reset, which is can be some callbacks to let
>>>>> bare-metal and
>>>>>        SR-IOV implement with their favor style
>>>>>
>>>>> Okay.
>>>>>
>>>>>        8.After reset, KMD need to aware if the VRAM lost happens or
>>>>> not,
>>>>>        bare-metal can implement some function to judge, while for
>>>>> SR-IOV I
>>>>>        prefer to read it from GIM side (for initial version we consider
>>>>>        it’s always VRAM lost, till GIM side change aligned)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        9.If VRAM lost not hit, continue, otherwise:
>>>>>
>>>>>        a)Update adev->*vram_lost_counter*,
>>>>>
>>>>> Okay.
>>>>>
>>>>>        b)Iterate over all living ctx, and set all ctx as “*guilty*”
>>>>> since
>>>>>        VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> No, that shouldn't be done by comparing the counters. Iterating
>>>>> over all contexts is way to much overhead.
>>>>>
>>>>>        c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>        fence status to “*ECANCELDED*”
>>>>>
>>>>> Yes and no, that should be done when we try to run the jobs and not
>>>>> during GPU reset.
>>>>>
>>>>>        10.Do GTT recovery and VRAM page tables/entries recovery
>>>>> (optional,
>>>>>        do we need it ???)
>>>>>
>>>>> Yes, that is still needed. As Nicolai explained we can't be sure
>>>>> that VRAM is still 100% correct even when it isn't cleared.
>>>>>
>>>>>        11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>>        restart scheduler (for VRAM lost case, no JOB will
>>>>> re-scheduled)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        ?For cs_wait() IOCTL:
>>>>>
>>>>>        After it found fence signaled, it should check with
>>>>>        *“dma_fence_get_status” *to see if there is error there,
>>>>>
>>>>>        And return the error status of fence
>>>>>
>>>>> Yes and no, dma_fence_get_status() is some specific handling for
>>>>> sync_file debugging (no idea why that made it into the common fence
>>>>> code).
>>>>>
>>>>> It was replaced by putting the error code directly into the fence,
>>>>> so just reading that one after waiting should be ok.
>>>>>
>>>>> Maybe we should fix dma_fence_get_status() to do the right thing
>>>>> for this?
>>>>>
>>>>>        ?For cs_wait_fences() IOCTL:
>>>>>
>>>>>        Similar with above approach
>>>>>
>>>>>        ?For cs_submit() IOCTL:
>>>>>
>>>>>        It need to check if current ctx been marked as “*guilty*” and
>>>>> return
>>>>>        “*ECANCELED*” if so
>>>>>
>>>>>        ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>>
>>>>>        This way, UMD can also block app from submitting, like @Nicolai
>>>>>        mentioned, we can cache one copy of *vram_lost_counter* when
>>>>>        enumerate physical device, and deny all
>>>>>
>>>>>        gl-context from submitting if the counter queried bigger than
>>>>> that
>>>>>        one cached in physical device. (looks a little overkill to
>>>>> me, but
>>>>>        easy to implement )
>>>>>
>>>>>        UMD can also return error to APP when creating gl-context if
>>>>> found
>>>>>        current queried*vram_lost_counter *bigger than that one
>>>>> cached in
>>>>>        physical device.
>>>>>
>>>>> Okay. Already have a patch for this, please review that one if you
>>>>> haven't already done so.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>        BTW: I realized that gl-context is a little different with
>>>>> kernel’s
>>>>>        context. Because for kernel. BO is not related with context
>>>>> but only
>>>>>        with FD, while in UMD, BO have a backend
>>>>>
>>>>>        gl-context, so block submitting in UMD layer is also needed
>>>>> although
>>>>>        KMD will do its job as bottom line
>>>>>
>>>>>        ?Basically “vram_lost_counter” is exposure by kernel to let
>>>>> UMD take
>>>>>        the control of robust extension feature, it will be UMD’s
>>>>> call to
>>>>>        move, KMD only deny “guilty” context from submitting
>>>>>
>>>>>        Need your feedback, thx
>>>>>
>>>>>        We’d better make TDR feature landed ASAP
>>>>>
>>>>>        BR Monk
>>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found]                                                 ` <d9274bc6-27e8-f6c4-0851-4240bde72452-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-11 13:51                                                   ` Liu, Monk
       [not found]                                                     ` <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-10-11 13:51 UTC (permalink / raw)
  To: Koenig, Christian, Zhou, David(ChunMing),
	Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

> Some jobs don't have a context (VM updates, clears, buffer moves).

What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy
That can solve your concerns 



-----Original Message-----
From: Koenig, Christian 
Sent: Wednesday, October 11, 2017 9:39 PM
To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

Some jobs don't have a context (VM updates, clears, buffer moves).

I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable.

So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run.

Regards,
Christian.

Am 11.10.2017 um 15:35 schrieb Liu, Monk:
> I think just compare the copy from context/entity with current counter 
> is enough, don't see how it's better to keep another copy in JOB
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Wednesday, October 11, 2017 6:40 PM
> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk 
> <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, 
> Marek <Marek.Olsak@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
> amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; 
> Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> I've already posted a patch for this on the mailing list.
>
> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled.
>
> Regards,
> Christian.
>
> Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
>> Your summary lacks the below issue:
>>
>> How about the job already pushed in scheduler queue when vram is lost?
>>
>>
>> Regards,
>> David Zhou
>> On 2017年10月11日 17:41, Liu, Monk wrote:
>>> Okay, let me summary our whole idea together and see if it works:
>>>
>>> 1, For cs_submit, always check vram-lost_counter first and reject 
>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter !=
>>> adev->vram_lost_counter. That way the vram lost issue can be handled
>>>
>>> 2, for cs_submit we still need to check if the incoming context is 
>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found
>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject
>>> the submit
>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>>
>>> 3, in gpu_reset() routine, we only mark the hang job's entity as 
>>> guilty (so we need to add new member in entity structure), and not 
>>> kick it out in gpu_reset() stage, but we need to set the context 
>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>>     And if reset introduces VRAM LOST, we just update
>>> adev->vram_lost_counter, but *don't* change all entity to guilty, so
>>> still only the hang job's entity is "guilty"
>>>     After some entity marked as "guilty", we find a way to set the 
>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K 
>>> interface, we need let UMD can know that this context is wrong.
>>>
>>> 4, in gpu scheduler's run_job() routine, since it only reads entity, 
>>> so we skip job scheduling once found the entity is "guilty"
>>>
>>>
>>> Does above sounds good ?
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Haehnle, Nicolai
>>> Sent: Wednesday, October 11, 2017 5:26 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
>>> <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>
>>> On 11.10.2017 11:18, Liu, Monk wrote:
>>>> Let's talk it simple, When vram lost hit,  what's the action for 
>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not 
>>>> the one trigger gpu hang) after vram lost ? do you mean we return 
>>>> -ENODEV to UMD ?
>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>>
>>>
>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as 
>>>> "guilty", how we block its from submitting ? can you show some 
>>>> implement way
>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>        return -ECANCELED;
>>>
>>> (where ctx->vram_lost_counter is initialized at context creation 
>>> time and never changed afterwards)
>>>
>>>
>>>> BTW: the "guilty" here is a new member I want to add to context, it 
>>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I 
>>>> need to unify them and only one place to mark guilty or not
>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
>>> consistent with the rest.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>> BR Monk
>>>>
>>>> -----Original Message-----
>>>> From: Haehnle, Nicolai
>>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM 
>>>>> lost is a perfect example where the driver should report context 
>>>>> lost to applications with the "innocent" flag for contexts that 
>>>>> were idle at the time of reset. The only context(s) that should be 
>>>>> reported as "guilty"
>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>> executing at the time of reset.
>>>>>
>>>>> ML: KMD mark all contexts as guilty is because that way we can 
>>>>> unify our IOCTL behavior: e.g. for IOCTL only block 
>>>>> “guilty”context , no need to worry about vram-lost-counter 
>>>>> anymore, that’s a implementation style. I don’t think it is 
>>>>> related with UMD layer,
>>>>>
>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement 
>>>>> it own “guilty” gl-context if you want.
>>>> Well, to some extent this is just semantics, but it helps to keep 
>>>> the terminology consistent.
>>>>
>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>>>> mind: this returns one of
>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>>> and it must return "innocent" for contexts that are only lost due 
>>>> to VRAM lost without being otherwise involved in the timeout that 
>>>> lead to the reset.
>>>>
>>>> The point is that in the places where you used "guilty" it would be 
>>>> better to use "context lost", and then further differentiate 
>>>> between guilty/innocent context lost based on the details of what happened.
>>>>
>>>>
>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like 
>>>>> cs_sumbit ?? let’s see which way better
>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>         return -ECANCELED;
>>>>
>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>>
>>>> Yes, it's one additional check in cs_submit. If you're worried 
>>>> about that (and Christian's concerns about possible issues with 
>>>> walking over all contexts are addressed), I suppose you could just 
>>>> store a per-context
>>>>
>>>>       unsigned context_reset_status;
>>>>
>>>> instead of a `bool guilty`. Its value would start out as 0
>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during 
>>>> reset.
>>>>
>>>> Cheers,
>>>> Nicolai
>>>>
>>>>
>>>>> *From:*Haehnle, Nicolai
>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario 
>>>>> <Mario.Filipas@amd.com>
>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>>     From a Mesa perspective, this almost all sounds reasonable to me.
>>>>>
>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM 
>>>>> lost is a perfect example where the driver should report context 
>>>>> lost to applications with the "innocent" flag for contexts that 
>>>>> were idle at the time of reset. The only context(s) that should be 
>>>>> reported as "guilty"
>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>> executing at the time of reset.
>>>>>
>>>>> On whether the whole context is marked as guilty from a user space 
>>>>> perspective, it would simply be nice for user space to get 
>>>>> consistent answers. It would be a bit odd if we could e.g. succeed 
>>>>> in submitting an SDMA job after a GFX job was rejected. This would 
>>>>> point in favor of marking the entire context as guilty (although 
>>>>> that could happen lazily instead of at reset time). On the other 
>>>>> hand, if that's too big a burden for the kernel implementation I'm 
>>>>> sure we can live without it.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Nicolai
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> -
>>>>> ---
>>>>> --
>>>>>
>>>>> *From:*Liu, Monk
>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
>>>>> Alexander
>>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> 1.Set its fence error status to “*ETIME*”,
>>>>>
>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>
>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>
>>>>> Please use ECANCELED as well or some other error code when we find 
>>>>> that we need to distinct the timedout job from the canceled ones 
>>>>> (probably a good idea, but I'm not sure).
>>>>>
>>>>> [ML] I’m okay if you insist not to use ETIME
>>>>>
>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>>
>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>> the entity?
>>>>>
>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>
>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>> "bool canceled;" ?
>>>>>
>>>>> [ML] I think context is better than entity, because for example if 
>>>>> you only block entity_0 of context and allow entity_N run, that 
>>>>> means the dependency between entities are broken (e.g. page table 
>>>>> updates in
>>>>>
>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make 
>>>>> sense to me)
>>>>>
>>>>> We’d better either block the whole context or let not…
>>>>>
>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set 
>>>>> all their fence status to “*ECANCELED*”
>>>>>
>>>>> Setting ECANCELED should be ok. But I think we should do this when 
>>>>> we try to run the jobs and not during GPU reset.
>>>>>
>>>>> [ML] without deep thought and expritment, I’m not sure the 
>>>>> difference between them, but kick it out in gpu_reset routine is 
>>>>> more efficient,
>>>>>
>>>>> Otherwise you need to check context/entity guilty flag in run_job 
>>>>> routine …and you need to it for every context/entity, I don’t see 
>>>>> why
>>>>>
>>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>>
>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since 
>>>>> VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>> over all contexts is way to much overhead.
>>>>>
>>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t 
>>>>> need to differentiate VRAM lost or not, they only interested in if 
>>>>> the context is guilty or not, and block
>>>>>
>>>>> Submit for guilty ones.
>>>>>
>>>>> *Can you give more details of your idea? And better the detail 
>>>>> implement in cs_submit, I want to see how you want to block submit 
>>>>> without checking context guilty flag*
>>>>>
>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their 
>>>>> fence status to “*ECANCELDED*”
>>>>>
>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>> not during GPU reset.
>>>>>
>>>>> [ML] again, kicking out them in gpu reset routine is high 
>>>>> efficient, otherwise you need check on every job in run_job()
>>>>>
>>>>> Besides, can you illustrate the detail implementation ?
>>>>>
>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>> sync_file debugging (no idea why that made it into the common 
>>>>> fence code).
>>>>>
>>>>> It was replaced by putting the error code directly into the fence, 
>>>>> so just reading that one after waiting should be ok.
>>>>>
>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>> for this?
>>>>>
>>>>> [ML] yeah, that’s too confusing, the name sound really the one I 
>>>>> want to use, we should change it…
>>>>>
>>>>> *But look into the implement, I don**’t see why we cannot use it ?
>>>>> it also finally return the fence->error *
>>>>>
>>>>> *From:*Koenig, Christian
>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; 
>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com 
>>>>> <mailto:Nicolai.Haehnle@amd.com>>;
>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; 
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com 
>>>>> <mailto:Alexander.Deucher@amd.com>>
>>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel 
>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry 
>>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, 
>>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, 
>>>>> Alejandro <Alejandro.Ramirez@amd.com 
>>>>> <mailto:Alejandro.Ramirez@amd.com>>;
>>>>> Filipas, Mario <Mario.Filipas@amd.com 
>>>>> <mailto:Mario.Filipas@amd.com>>
>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> See inline:
>>>>>
>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>>
>>>>>        Hi Christian & Nicolai,
>>>>>
>>>>>        We need to achieve some agreements on what should MESA/UMD 
>>>>> do and
>>>>>        what should KMD do, *please give your comments with “okay” 
>>>>> or “No”
>>>>>        and your idea on below items,*
>>>>>
>>>>>        ?When a job timed out (set from lockup_timeout kernel 
>>>>> parameter),
>>>>>        What KMD should do in TDR routine :
>>>>>
>>>>>        1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>>        (*gpu_reset_counter* is used to force vm flush after GPU 
>>>>> reset, out
>>>>>        of this thread’s scope so no more discussion on it)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        2.Set its fence error status to “*ETIME*”,
>>>>>
>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>
>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>
>>>>> Please use ECANCELED as well or some other error code when we find 
>>>>> that we need to distinct the timedout job from the canceled ones 
>>>>> (probably a good idea, but I'm not sure).
>>>>>
>>>>>        3.Find the entity/ctx behind this job, and set this ctx as 
>>>>> “*guilty*”
>>>>>
>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>> the entity?
>>>>>
>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>
>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>> "bool canceled;" ?
>>>>>
>>>>>        4.Kick out this job from scheduler’s mirror list, so this 
>>>>> job won’t
>>>>>        get re-scheduled to ring anymore.
>>>>>
>>>>> Okay.
>>>>>
>>>>>        5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and 
>>>>> set all
>>>>>        their fence status to “*ECANCELED*”
>>>>>
>>>>> Setting ECANCELED should be ok. But I think we should do this when 
>>>>> we try to run the jobs and not during GPU reset.
>>>>>
>>>>>        6.Force signal all fences that get kicked out by above two
>>>>>        steps,*otherwise UMD will block forever if waiting on those
>>>>> fences*
>>>>>
>>>>> Okay.
>>>>>
>>>>>        7.Do gpu reset, which is can be some callbacks to let 
>>>>> bare-metal and
>>>>>        SR-IOV implement with their favor style
>>>>>
>>>>> Okay.
>>>>>
>>>>>        8.After reset, KMD need to aware if the VRAM lost happens 
>>>>> or not,
>>>>>        bare-metal can implement some function to judge, while for 
>>>>> SR-IOV I
>>>>>        prefer to read it from GIM side (for initial version we consider
>>>>>        it’s always VRAM lost, till GIM side change aligned)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        9.If VRAM lost not hit, continue, otherwise:
>>>>>
>>>>>        a)Update adev->*vram_lost_counter*,
>>>>>
>>>>> Okay.
>>>>>
>>>>>        b)Iterate over all living ctx, and set all ctx as “*guilty*”
>>>>> since
>>>>>        VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>> over all contexts is way to much overhead.
>>>>>
>>>>>        c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>        fence status to “*ECANCELDED*”
>>>>>
>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>> not during GPU reset.
>>>>>
>>>>>        10.Do GTT recovery and VRAM page tables/entries recovery 
>>>>> (optional,
>>>>>        do we need it ???)
>>>>>
>>>>> Yes, that is still needed. As Nicolai explained we can't be sure 
>>>>> that VRAM is still 100% correct even when it isn't cleared.
>>>>>
>>>>>        11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>>        restart scheduler (for VRAM lost case, no JOB will
>>>>> re-scheduled)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        ?For cs_wait() IOCTL:
>>>>>
>>>>>        After it found fence signaled, it should check with
>>>>>        *“dma_fence_get_status” *to see if there is error there,
>>>>>
>>>>>        And return the error status of fence
>>>>>
>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>> sync_file debugging (no idea why that made it into the common 
>>>>> fence code).
>>>>>
>>>>> It was replaced by putting the error code directly into the fence, 
>>>>> so just reading that one after waiting should be ok.
>>>>>
>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>> for this?
>>>>>
>>>>>        ?For cs_wait_fences() IOCTL:
>>>>>
>>>>>        Similar with above approach
>>>>>
>>>>>        ?For cs_submit() IOCTL:
>>>>>
>>>>>        It need to check if current ctx been marked as “*guilty*” 
>>>>> and return
>>>>>        “*ECANCELED*” if so
>>>>>
>>>>>        ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>>
>>>>>        This way, UMD can also block app from submitting, like @Nicolai
>>>>>        mentioned, we can cache one copy of *vram_lost_counter* when
>>>>>        enumerate physical device, and deny all
>>>>>
>>>>>        gl-context from submitting if the counter queried bigger 
>>>>> than that
>>>>>        one cached in physical device. (looks a little overkill to 
>>>>> me, but
>>>>>        easy to implement )
>>>>>
>>>>>        UMD can also return error to APP when creating gl-context 
>>>>> if found
>>>>>        current queried*vram_lost_counter *bigger than that one 
>>>>> cached in
>>>>>        physical device.
>>>>>
>>>>> Okay. Already have a patch for this, please review that one if you 
>>>>> haven't already done so.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>        BTW: I realized that gl-context is a little different with 
>>>>> kernel’s
>>>>>        context. Because for kernel. BO is not related with context 
>>>>> but only
>>>>>        with FD, while in UMD, BO have a backend
>>>>>
>>>>>        gl-context, so block submitting in UMD layer is also needed 
>>>>> although
>>>>>        KMD will do its job as bottom line
>>>>>
>>>>>        ?Basically “vram_lost_counter” is exposure by kernel to let 
>>>>> UMD take
>>>>>        the control of robust extension feature, it will be UMD’s 
>>>>> call to
>>>>>        move, KMD only deny “guilty” context from submitting
>>>>>
>>>>>        Need your feedback, thx
>>>>>
>>>>>        We’d better make TDR feature landed ASAP
>>>>>
>>>>>        BR Monk
>>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found]                                                     ` <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-11 13:59                                                       ` Liu, Monk
       [not found]                                                         ` <BLUPR12MB04497EDD5AE48484E7A18C2F844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-10-11 14:03                                                       ` Christian König
  1 sibling, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-10-11 13:59 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Zhou, David(ChunMing),
	Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

But if we keep counter in entity, there is one issue I suddenly though of :

For regular user context, after vram lost UMD will aware this context is LOST since we have a counter copy in context, so user space can close it and re-create one
But for kernel entity, since no U/K interface, so it is kernel's responsibility to recover this kernel entity to work, that make things complicated ....

Emm, I agree that keep a copy in context and in job is a good move

BR Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: Wednesday, October 11, 2017 9:51 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: RE: TDR and VRAM lost handling in KMD:

> Some jobs don't have a context (VM updates, clears, buffer moves).

What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy That can solve your concerns 



-----Original Message-----
From: Koenig, Christian
Sent: Wednesday, October 11, 2017 9:39 PM
To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

Some jobs don't have a context (VM updates, clears, buffer moves).

I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable.

So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run.

Regards,
Christian.

Am 11.10.2017 um 15:35 schrieb Liu, Monk:
> I think just compare the copy from context/entity with current counter 
> is enough, don't see how it's better to keep another copy in JOB
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Wednesday, October 11, 2017 6:40 PM
> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk 
> <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, 
> Marek <Marek.Olsak@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
> amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; 
> Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> I've already posted a patch for this on the mailing list.
>
> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled.
>
> Regards,
> Christian.
>
> Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
>> Your summary lacks the below issue:
>>
>> How about the job already pushed in scheduler queue when vram is lost?
>>
>>
>> Regards,
>> David Zhou
>> On 2017年10月11日 17:41, Liu, Monk wrote:
>>> Okay, let me summary our whole idea together and see if it works:
>>>
>>> 1, For cs_submit, always check vram-lost_counter first and reject 
>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter !=
>>> adev->vram_lost_counter. That way the vram lost issue can be handled
>>>
>>> 2, for cs_submit we still need to check if the incoming context is 
>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found
>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject
>>> the submit
>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>>
>>> 3, in gpu_reset() routine, we only mark the hang job's entity as 
>>> guilty (so we need to add new member in entity structure), and not 
>>> kick it out in gpu_reset() stage, but we need to set the context 
>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>>     And if reset introduces VRAM LOST, we just update
>>> adev->vram_lost_counter, but *don't* change all entity to guilty, so
>>> still only the hang job's entity is "guilty"
>>>     After some entity marked as "guilty", we find a way to set the 
>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K 
>>> interface, we need let UMD can know that this context is wrong.
>>>
>>> 4, in gpu scheduler's run_job() routine, since it only reads entity, 
>>> so we skip job scheduling once found the entity is "guilty"
>>>
>>>
>>> Does above sounds good ?
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Haehnle, Nicolai
>>> Sent: Wednesday, October 11, 2017 5:26 PM
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>; 
>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley 
>>> <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>
>>> On 11.10.2017 11:18, Liu, Monk wrote:
>>>> Let's talk it simple, When vram lost hit,  what's the action for 
>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not 
>>>> the one trigger gpu hang) after vram lost ? do you mean we return 
>>>> -ENODEV to UMD ?
>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>>
>>>
>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as 
>>>> "guilty", how we block its from submitting ? can you show some 
>>>> implement way
>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>        return -ECANCELED;
>>>
>>> (where ctx->vram_lost_counter is initialized at context creation 
>>> time and never changed afterwards)
>>>
>>>
>>>> BTW: the "guilty" here is a new member I want to add to context, it 
>>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I 
>>>> need to unify them and only one place to mark guilty or not
>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
>>> consistent with the rest.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>> BR Monk
>>>>
>>>> -----Original Message-----
>>>> From: Haehnle, Nicolai
>>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM 
>>>>> lost is a perfect example where the driver should report context 
>>>>> lost to applications with the "innocent" flag for contexts that 
>>>>> were idle at the time of reset. The only context(s) that should be 
>>>>> reported as "guilty"
>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>> executing at the time of reset.
>>>>>
>>>>> ML: KMD mark all contexts as guilty is because that way we can 
>>>>> unify our IOCTL behavior: e.g. for IOCTL only block 
>>>>> “guilty”context , no need to worry about vram-lost-counter 
>>>>> anymore, that’s a implementation style. I don’t think it is 
>>>>> related with UMD layer,
>>>>>
>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement 
>>>>> it own “guilty” gl-context if you want.
>>>> Well, to some extent this is just semantics, but it helps to keep 
>>>> the terminology consistent.
>>>>
>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>>>> mind: this returns one of
>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>>> and it must return "innocent" for contexts that are only lost due 
>>>> to VRAM lost without being otherwise involved in the timeout that 
>>>> lead to the reset.
>>>>
>>>> The point is that in the places where you used "guilty" it would be 
>>>> better to use "context lost", and then further differentiate 
>>>> between guilty/innocent context lost based on the details of what happened.
>>>>
>>>>
>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like 
>>>>> cs_sumbit ?? let’s see which way better
>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>         return -ECANCELED;
>>>>
>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>>
>>>> Yes, it's one additional check in cs_submit. If you're worried 
>>>> about that (and Christian's concerns about possible issues with 
>>>> walking over all contexts are addressed), I suppose you could just 
>>>> store a per-context
>>>>
>>>>       unsigned context_reset_status;
>>>>
>>>> instead of a `bool guilty`. Its value would start out as 0
>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during 
>>>> reset.
>>>>
>>>> Cheers,
>>>> Nicolai
>>>>
>>>>
>>>>> *From:*Haehnle, Nicolai
>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario 
>>>>> <Mario.Filipas@amd.com>
>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>>     From a Mesa perspective, this almost all sounds reasonable to me.
>>>>>
>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM 
>>>>> lost is a perfect example where the driver should report context 
>>>>> lost to applications with the "innocent" flag for contexts that 
>>>>> were idle at the time of reset. The only context(s) that should be 
>>>>> reported as "guilty"
>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>> executing at the time of reset.
>>>>>
>>>>> On whether the whole context is marked as guilty from a user space 
>>>>> perspective, it would simply be nice for user space to get 
>>>>> consistent answers. It would be a bit odd if we could e.g. succeed 
>>>>> in submitting an SDMA job after a GFX job was rejected. This would 
>>>>> point in favor of marking the entire context as guilty (although 
>>>>> that could happen lazily instead of at reset time). On the other 
>>>>> hand, if that's too big a burden for the kernel implementation I'm 
>>>>> sure we can live without it.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Nicolai
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> -
>>>>> ---
>>>>> --
>>>>>
>>>>> *From:*Liu, Monk
>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
>>>>> Alexander
>>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> 1.Set its fence error status to “*ETIME*”,
>>>>>
>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>
>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>
>>>>> Please use ECANCELED as well or some other error code when we find 
>>>>> that we need to distinct the timedout job from the canceled ones 
>>>>> (probably a good idea, but I'm not sure).
>>>>>
>>>>> [ML] I’m okay if you insist not to use ETIME
>>>>>
>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>>
>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>> the entity?
>>>>>
>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>
>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>> "bool canceled;" ?
>>>>>
>>>>> [ML] I think context is better than entity, because for example if 
>>>>> you only block entity_0 of context and allow entity_N run, that 
>>>>> means the dependency between entities are broken (e.g. page table 
>>>>> updates in
>>>>>
>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make 
>>>>> sense to me)
>>>>>
>>>>> We’d better either block the whole context or let not…
>>>>>
>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set 
>>>>> all their fence status to “*ECANCELED*”
>>>>>
>>>>> Setting ECANCELED should be ok. But I think we should do this when 
>>>>> we try to run the jobs and not during GPU reset.
>>>>>
>>>>> [ML] without deep thought and expritment, I’m not sure the 
>>>>> difference between them, but kick it out in gpu_reset routine is 
>>>>> more efficient,
>>>>>
>>>>> Otherwise you need to check context/entity guilty flag in run_job 
>>>>> routine …and you need to it for every context/entity, I don’t see 
>>>>> why
>>>>>
>>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>>
>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since 
>>>>> VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>> over all contexts is way to much overhead.
>>>>>
>>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t 
>>>>> need to differentiate VRAM lost or not, they only interested in if 
>>>>> the context is guilty or not, and block
>>>>>
>>>>> Submit for guilty ones.
>>>>>
>>>>> *Can you give more details of your idea? And better the detail 
>>>>> implement in cs_submit, I want to see how you want to block submit 
>>>>> without checking context guilty flag*
>>>>>
>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their 
>>>>> fence status to “*ECANCELDED*”
>>>>>
>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>> not during GPU reset.
>>>>>
>>>>> [ML] again, kicking out them in gpu reset routine is high 
>>>>> efficient, otherwise you need check on every job in run_job()
>>>>>
>>>>> Besides, can you illustrate the detail implementation ?
>>>>>
>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>> sync_file debugging (no idea why that made it into the common 
>>>>> fence code).
>>>>>
>>>>> It was replaced by putting the error code directly into the fence, 
>>>>> so just reading that one after waiting should be ok.
>>>>>
>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>> for this?
>>>>>
>>>>> [ML] yeah, that’s too confusing, the name sound really the one I 
>>>>> want to use, we should change it…
>>>>>
>>>>> *But look into the implement, I don**’t see why we cannot use it ?
>>>>> it also finally return the fence->error *
>>>>>
>>>>> *From:*Koenig, Christian
>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; 
>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com 
>>>>> <mailto:Nicolai.Haehnle@amd.com>>;
>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; 
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com 
>>>>> <mailto:Alexander.Deucher@amd.com>>
>>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel 
>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry
>>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, 
>>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez, 
>>>>> Alejandro <Alejandro.Ramirez@amd.com 
>>>>> <mailto:Alejandro.Ramirez@amd.com>>;
>>>>> Filipas, Mario <Mario.Filipas@amd.com 
>>>>> <mailto:Mario.Filipas@amd.com>>
>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> See inline:
>>>>>
>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>>
>>>>>        Hi Christian & Nicolai,
>>>>>
>>>>>        We need to achieve some agreements on what should MESA/UMD 
>>>>> do and
>>>>>        what should KMD do, *please give your comments with “okay” 
>>>>> or “No”
>>>>>        and your idea on below items,*
>>>>>
>>>>>        ?When a job timed out (set from lockup_timeout kernel 
>>>>> parameter),
>>>>>        What KMD should do in TDR routine :
>>>>>
>>>>>        1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>>        (*gpu_reset_counter* is used to force vm flush after GPU 
>>>>> reset, out
>>>>>        of this thread’s scope so no more discussion on it)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        2.Set its fence error status to “*ETIME*”,
>>>>>
>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>
>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>
>>>>> Please use ECANCELED as well or some other error code when we find 
>>>>> that we need to distinct the timedout job from the canceled ones 
>>>>> (probably a good idea, but I'm not sure).
>>>>>
>>>>>        3.Find the entity/ctx behind this job, and set this ctx as 
>>>>> “*guilty*”
>>>>>
>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>> the entity?
>>>>>
>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>
>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>> "bool canceled;" ?
>>>>>
>>>>>        4.Kick out this job from scheduler’s mirror list, so this 
>>>>> job won’t
>>>>>        get re-scheduled to ring anymore.
>>>>>
>>>>> Okay.
>>>>>
>>>>>        5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and 
>>>>> set all
>>>>>        their fence status to “*ECANCELED*”
>>>>>
>>>>> Setting ECANCELED should be ok. But I think we should do this when 
>>>>> we try to run the jobs and not during GPU reset.
>>>>>
>>>>>        6.Force signal all fences that get kicked out by above two
>>>>>        steps,*otherwise UMD will block forever if waiting on those
>>>>> fences*
>>>>>
>>>>> Okay.
>>>>>
>>>>>        7.Do gpu reset, which is can be some callbacks to let 
>>>>> bare-metal and
>>>>>        SR-IOV implement with their favor style
>>>>>
>>>>> Okay.
>>>>>
>>>>>        8.After reset, KMD need to aware if the VRAM lost happens 
>>>>> or not,
>>>>>        bare-metal can implement some function to judge, while for 
>>>>> SR-IOV I
>>>>>        prefer to read it from GIM side (for initial version we consider
>>>>>        it’s always VRAM lost, till GIM side change aligned)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        9.If VRAM lost not hit, continue, otherwise:
>>>>>
>>>>>        a)Update adev->*vram_lost_counter*,
>>>>>
>>>>> Okay.
>>>>>
>>>>>        b)Iterate over all living ctx, and set all ctx as “*guilty*”
>>>>> since
>>>>>        VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>> over all contexts is way to much overhead.
>>>>>
>>>>>        c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>        fence status to “*ECANCELDED*”
>>>>>
>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>> not during GPU reset.
>>>>>
>>>>>        10.Do GTT recovery and VRAM page tables/entries recovery 
>>>>> (optional,
>>>>>        do we need it ???)
>>>>>
>>>>> Yes, that is still needed. As Nicolai explained we can't be sure 
>>>>> that VRAM is still 100% correct even when it isn't cleared.
>>>>>
>>>>>        11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>>        restart scheduler (for VRAM lost case, no JOB will
>>>>> re-scheduled)
>>>>>
>>>>> Okay.
>>>>>
>>>>>        ?For cs_wait() IOCTL:
>>>>>
>>>>>        After it found fence signaled, it should check with
>>>>>        *“dma_fence_get_status” *to see if there is error there,
>>>>>
>>>>>        And return the error status of fence
>>>>>
>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>> sync_file debugging (no idea why that made it into the common 
>>>>> fence code).
>>>>>
>>>>> It was replaced by putting the error code directly into the fence, 
>>>>> so just reading that one after waiting should be ok.
>>>>>
>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>> for this?
>>>>>
>>>>>        ?For cs_wait_fences() IOCTL:
>>>>>
>>>>>        Similar with above approach
>>>>>
>>>>>        ?For cs_submit() IOCTL:
>>>>>
>>>>>        It need to check if current ctx been marked as “*guilty*” 
>>>>> and return
>>>>>        “*ECANCELED*” if so
>>>>>
>>>>>        ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>>
>>>>>        This way, UMD can also block app from submitting, like @Nicolai
>>>>>        mentioned, we can cache one copy of *vram_lost_counter* when
>>>>>        enumerate physical device, and deny all
>>>>>
>>>>>        gl-context from submitting if the counter queried bigger 
>>>>> than that
>>>>>        one cached in physical device. (looks a little overkill to 
>>>>> me, but
>>>>>        easy to implement )
>>>>>
>>>>>        UMD can also return error to APP when creating gl-context 
>>>>> if found
>>>>>        current queried*vram_lost_counter *bigger than that one 
>>>>> cached in
>>>>>        physical device.
>>>>>
>>>>> Okay. Already have a patch for this, please review that one if you 
>>>>> haven't already done so.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>        BTW: I realized that gl-context is a little different with 
>>>>> kernel’s
>>>>>        context. Because for kernel. BO is not related with context 
>>>>> but only
>>>>>        with FD, while in UMD, BO have a backend
>>>>>
>>>>>        gl-context, so block submitting in UMD layer is also needed 
>>>>> although
>>>>>        KMD will do its job as bottom line
>>>>>
>>>>>        ?Basically “vram_lost_counter” is exposure by kernel to let 
>>>>> UMD take
>>>>>        the control of robust extension feature, it will be UMD’s 
>>>>> call to
>>>>>        move, KMD only deny “guilty” context from submitting
>>>>>
>>>>>        Need your feedback, thx
>>>>>
>>>>>        We’d better make TDR feature landed ASAP
>>>>>
>>>>>        BR Monk
>>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found]                                                     ` <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-10-11 13:59                                                       ` Liu, Monk
@ 2017-10-11 14:03                                                       ` Christian König
       [not found]                                                         ` <02bb9f77-bcc6-8a24-e9b0-8f3f260d74d8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Christian König @ 2017-10-11 14:03 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Zhou, David(ChunMing),
	Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

> I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy
That won't work. We want to keep the entities associated with VM updates 
and buffer moves alive, but their jobs canceled.

Regards,
Christian.

Am 11.10.2017 um 15:51 schrieb Liu, Monk:
>> Some jobs don't have a context (VM updates, clears, buffer moves).
> What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy
> That can solve your concerns
>
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Wednesday, October 11, 2017 9:39 PM
> To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> Some jobs don't have a context (VM updates, clears, buffer moves).
>
> I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable.
>
> So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run.
>
> Regards,
> Christian.
>
> Am 11.10.2017 um 15:35 schrieb Liu, Monk:
>> I think just compare the copy from context/entity with current counter
>> is enough, don't see how it's better to keep another copy in JOB
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: Wednesday, October 11, 2017 6:40 PM
>> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk
>> <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak,
>> Marek <Marek.Olsak@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>
>> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>;
>> amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>;
>> Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>;
>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD:
>>
>> I've already posted a patch for this on the mailing list.
>>
>> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled.
>>
>> Regards,
>> Christian.
>>
>> Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
>>> Your summary lacks the below issue:
>>>
>>> How about the job already pushed in scheduler queue when vram is lost?
>>>
>>>
>>> Regards,
>>> David Zhou
>>> On 2017年10月11日 17:41, Liu, Monk wrote:
>>>> Okay, let me summary our whole idea together and see if it works:
>>>>
>>>> 1, For cs_submit, always check vram-lost_counter first and reject
>>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter !=
>>>> adev->vram_lost_counter. That way the vram lost issue can be handled
>>>>
>>>> 2, for cs_submit we still need to check if the incoming context is
>>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found
>>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject
>>>> the submit
>>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>>>
>>>> 3, in gpu_reset() routine, we only mark the hang job's entity as
>>>> guilty (so we need to add new member in entity structure), and not
>>>> kick it out in gpu_reset() stage, but we need to set the context
>>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>>>      And if reset introduces VRAM LOST, we just update
>>>> adev->vram_lost_counter, but *don't* change all entity to guilty, so
>>>> still only the hang job's entity is "guilty"
>>>>      After some entity marked as "guilty", we find a way to set the
>>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K
>>>> interface, we need let UMD can know that this context is wrong.
>>>>
>>>> 4, in gpu scheduler's run_job() routine, since it only reads entity,
>>>> so we skip job scheduling once found the entity is "guilty"
>>>>
>>>>
>>>> Does above sounds good ?
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Haehnle, Nicolai
>>>> Sent: Wednesday, October 11, 2017 5:26 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>;
>>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley
>>>> <Bingley.Li@amd.com>; Ramirez, Alejandro
>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> On 11.10.2017 11:18, Liu, Monk wrote:
>>>>> Let's talk it simple, When vram lost hit,  what's the action for
>>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not
>>>>> the one trigger gpu hang) after vram lost ? do you mean we return
>>>>> -ENODEV to UMD ?
>>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>>>
>>>>
>>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as
>>>>> "guilty", how we block its from submitting ? can you show some
>>>>> implement way
>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>         return -ECANCELED;
>>>>
>>>> (where ctx->vram_lost_counter is initialized at context creation
>>>> time and never changed afterwards)
>>>>
>>>>
>>>>> BTW: the "guilty" here is a new member I want to add to context, it
>>>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I
>>>>> need to unify them and only one place to mark guilty or not
>>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made
>>>> consistent with the rest.
>>>>
>>>> Cheers,
>>>> Nicolai
>>>>
>>>>
>>>>> BR Monk
>>>>>
>>>>> -----Original Message-----
>>>>> From: Haehnle, Nicolai
>>>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel
>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li,
>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro
>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL),
>>>>>> so it's reasonable to use it. However, it /does not/ make sense to
>>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM
>>>>>> lost is a perfect example where the driver should report context
>>>>>> lost to applications with the "innocent" flag for contexts that
>>>>>> were idle at the time of reset. The only context(s) that should be
>>>>>> reported as "guilty"
>>>>>> (or perhaps "unknown" in some cases) are the ones that were
>>>>>> executing at the time of reset.
>>>>>>
>>>>>> ML: KMD mark all contexts as guilty is because that way we can
>>>>>> unify our IOCTL behavior: e.g. for IOCTL only block
>>>>>> “guilty”context , no need to worry about vram-lost-counter
>>>>>> anymore, that’s a implementation style. I don’t think it is
>>>>>> related with UMD layer,
>>>>>>
>>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement
>>>>>> it own “guilty” gl-context if you want.
>>>>> Well, to some extent this is just semantics, but it helps to keep
>>>>> the terminology consistent.
>>>>>
>>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>>>>> mind: this returns one of
>>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>>>> and it must return "innocent" for contexts that are only lost due
>>>>> to VRAM lost without being otherwise involved in the timeout that
>>>>> lead to the reset.
>>>>>
>>>>> The point is that in the places where you used "guilty" it would be
>>>>> better to use "context lost", and then further differentiate
>>>>> between guilty/innocent context lost based on the details of what happened.
>>>>>
>>>>>
>>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you
>>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like
>>>>>> cs_sumbit ?? let’s see which way better
>>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>>          return -ECANCELED;
>>>>>
>>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>>>
>>>>> Yes, it's one additional check in cs_submit. If you're worried
>>>>> about that (and Christian's concerns about possible issues with
>>>>> walking over all contexts are addressed), I suppose you could just
>>>>> store a per-context
>>>>>
>>>>>        unsigned context_reset_status;
>>>>>
>>>>> instead of a `bool guilty`. Its value would start out as 0
>>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during
>>>>> reset.
>>>>>
>>>>> Cheers,
>>>>> Nicolai
>>>>>
>>>>>
>>>>>> *From:*Haehnle, Nicolai
>>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel
>>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li,
>>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro
>>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario
>>>>>> <Mario.Filipas@amd.com>
>>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>>      From a Mesa perspective, this almost all sounds reasonable to me.
>>>>>>
>>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL),
>>>>>> so it's reasonable to use it. However, it /does not/ make sense to
>>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM
>>>>>> lost is a perfect example where the driver should report context
>>>>>> lost to applications with the "innocent" flag for contexts that
>>>>>> were idle at the time of reset. The only context(s) that should be
>>>>>> reported as "guilty"
>>>>>> (or perhaps "unknown" in some cases) are the ones that were
>>>>>> executing at the time of reset.
>>>>>>
>>>>>> On whether the whole context is marked as guilty from a user space
>>>>>> perspective, it would simply be nice for user space to get
>>>>>> consistent answers. It would be a bit odd if we could e.g. succeed
>>>>>> in submitting an SDMA job after a GFX job was rejected. This would
>>>>>> point in favor of marking the entire context as guilty (although
>>>>>> that could happen lazily instead of at reset time). On the other
>>>>>> hand, if that's too big a burden for the kernel implementation I'm
>>>>>> sure we can live without it.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Nicolai
>>>>>>
>>>>>> ------------------------------------------------------------------
>>>>>> -
>>>>>> ---
>>>>>> --
>>>>>>
>>>>>> *From:*Liu, Monk
>>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher,
>>>>>> Alexander
>>>>>> *Cc:* amd-gfx@lists.freedesktop.org
>>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry
>>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>> 1.Set its fence error status to “*ETIME*”,
>>>>>>
>>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>>
>>>>>> In other words when we return ETIME from the wait IOCTL it would
>>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>>
>>>>>> Please use ECANCELED as well or some other error code when we find
>>>>>> that we need to distinct the timedout job from the canceled ones
>>>>>> (probably a good idea, but I'm not sure).
>>>>>>
>>>>>> [ML] I’m okay if you insist not to use ETIME
>>>>>>
>>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>>>
>>>>>> Not sure. Do we want to set the whole context as guilty or just
>>>>>> the entity?
>>>>>>
>>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>>
>>>>>> BTW: We should use a different name than "guilty", maybe just
>>>>>> "bool canceled;" ?
>>>>>>
>>>>>> [ML] I think context is better than entity, because for example if
>>>>>> you only block entity_0 of context and allow entity_N run, that
>>>>>> means the dependency between entities are broken (e.g. page table
>>>>>> updates in
>>>>>>
>>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make
>>>>>> sense to me)
>>>>>>
>>>>>> We’d better either block the whole context or let not…
>>>>>>
>>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set
>>>>>> all their fence status to “*ECANCELED*”
>>>>>>
>>>>>> Setting ECANCELED should be ok. But I think we should do this when
>>>>>> we try to run the jobs and not during GPU reset.
>>>>>>
>>>>>> [ML] without deep thought and expritment, I’m not sure the
>>>>>> difference between them, but kick it out in gpu_reset routine is
>>>>>> more efficient,
>>>>>>
>>>>>> Otherwise you need to check context/entity guilty flag in run_job
>>>>>> routine …and you need to it for every context/entity, I don’t see
>>>>>> why
>>>>>>
>>>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>>>
>>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since
>>>>>> VRAM lost actually ruins all VRAM contents
>>>>>>
>>>>>> No, that shouldn't be done by comparing the counters. Iterating
>>>>>> over all contexts is way to much overhead.
>>>>>>
>>>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t
>>>>>> need to differentiate VRAM lost or not, they only interested in if
>>>>>> the context is guilty or not, and block
>>>>>>
>>>>>> Submit for guilty ones.
>>>>>>
>>>>>> *Can you give more details of your idea? And better the detail
>>>>>> implement in cs_submit, I want to see how you want to block submit
>>>>>> without checking context guilty flag*
>>>>>>
>>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>> fence status to “*ECANCELDED*”
>>>>>>
>>>>>> Yes and no, that should be done when we try to run the jobs and
>>>>>> not during GPU reset.
>>>>>>
>>>>>> [ML] again, kicking out them in gpu reset routine is high
>>>>>> efficient, otherwise you need check on every job in run_job()
>>>>>>
>>>>>> Besides, can you illustrate the detail implementation ?
>>>>>>
>>>>>> Yes and no, dma_fence_get_status() is some specific handling for
>>>>>> sync_file debugging (no idea why that made it into the common
>>>>>> fence code).
>>>>>>
>>>>>> It was replaced by putting the error code directly into the fence,
>>>>>> so just reading that one after waiting should be ok.
>>>>>>
>>>>>> Maybe we should fix dma_fence_get_status() to do the right thing
>>>>>> for this?
>>>>>>
>>>>>> [ML] yeah, that’s too confusing, the name sound really the one I
>>>>>> want to use, we should change it…
>>>>>>
>>>>>> *But look into the implement, I don**’t see why we cannot use it ?
>>>>>> it also finally return the fence->error *
>>>>>>
>>>>>> *From:*Koenig, Christian
>>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>;
>>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com
>>>>>> <mailto:Nicolai.Haehnle@amd.com>>;
>>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>;
>>>>>> Deucher, Alexander <Alexander.Deucher@amd.com
>>>>>> <mailto:Alexander.Deucher@amd.com>>
>>>>>> *Cc:* amd-gfx@lists.freedesktop.org
>>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel
>>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry
>>>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li,
>>>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez,
>>>>>> Alejandro <Alejandro.Ramirez@amd.com
>>>>>> <mailto:Alejandro.Ramirez@amd.com>>;
>>>>>> Filipas, Mario <Mario.Filipas@amd.com
>>>>>> <mailto:Mario.Filipas@amd.com>>
>>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>> See inline:
>>>>>>
>>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>>>
>>>>>>         Hi Christian & Nicolai,
>>>>>>
>>>>>>         We need to achieve some agreements on what should MESA/UMD
>>>>>> do and
>>>>>>         what should KMD do, *please give your comments with “okay”
>>>>>> or “No”
>>>>>>         and your idea on below items,*
>>>>>>
>>>>>>         ?When a job timed out (set from lockup_timeout kernel
>>>>>> parameter),
>>>>>>         What KMD should do in TDR routine :
>>>>>>
>>>>>>         1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>>>         (*gpu_reset_counter* is used to force vm flush after GPU
>>>>>> reset, out
>>>>>>         of this thread’s scope so no more discussion on it)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         2.Set its fence error status to “*ETIME*”,
>>>>>>
>>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>>
>>>>>> In other words when we return ETIME from the wait IOCTL it would
>>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>>
>>>>>> Please use ECANCELED as well or some other error code when we find
>>>>>> that we need to distinct the timedout job from the canceled ones
>>>>>> (probably a good idea, but I'm not sure).
>>>>>>
>>>>>>         3.Find the entity/ctx behind this job, and set this ctx as
>>>>>> “*guilty*”
>>>>>>
>>>>>> Not sure. Do we want to set the whole context as guilty or just
>>>>>> the entity?
>>>>>>
>>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>>
>>>>>> BTW: We should use a different name than "guilty", maybe just
>>>>>> "bool canceled;" ?
>>>>>>
>>>>>>         4.Kick out this job from scheduler’s mirror list, so this
>>>>>> job won’t
>>>>>>         get re-scheduled to ring anymore.
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and
>>>>>> set all
>>>>>>         their fence status to “*ECANCELED*”
>>>>>>
>>>>>> Setting ECANCELED should be ok. But I think we should do this when
>>>>>> we try to run the jobs and not during GPU reset.
>>>>>>
>>>>>>         6.Force signal all fences that get kicked out by above two
>>>>>>         steps,*otherwise UMD will block forever if waiting on those
>>>>>> fences*
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         7.Do gpu reset, which is can be some callbacks to let
>>>>>> bare-metal and
>>>>>>         SR-IOV implement with their favor style
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         8.After reset, KMD need to aware if the VRAM lost happens
>>>>>> or not,
>>>>>>         bare-metal can implement some function to judge, while for
>>>>>> SR-IOV I
>>>>>>         prefer to read it from GIM side (for initial version we consider
>>>>>>         it’s always VRAM lost, till GIM side change aligned)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         9.If VRAM lost not hit, continue, otherwise:
>>>>>>
>>>>>>         a)Update adev->*vram_lost_counter*,
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         b)Iterate over all living ctx, and set all ctx as “*guilty*”
>>>>>> since
>>>>>>         VRAM lost actually ruins all VRAM contents
>>>>>>
>>>>>> No, that shouldn't be done by comparing the counters. Iterating
>>>>>> over all contexts is way to much overhead.
>>>>>>
>>>>>>         c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>>         fence status to “*ECANCELDED*”
>>>>>>
>>>>>> Yes and no, that should be done when we try to run the jobs and
>>>>>> not during GPU reset.
>>>>>>
>>>>>>         10.Do GTT recovery and VRAM page tables/entries recovery
>>>>>> (optional,
>>>>>>         do we need it ???)
>>>>>>
>>>>>> Yes, that is still needed. As Nicolai explained we can't be sure
>>>>>> that VRAM is still 100% correct even when it isn't cleared.
>>>>>>
>>>>>>         11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>>>         restart scheduler (for VRAM lost case, no JOB will
>>>>>> re-scheduled)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         ?For cs_wait() IOCTL:
>>>>>>
>>>>>>         After it found fence signaled, it should check with
>>>>>>         *“dma_fence_get_status” *to see if there is error there,
>>>>>>
>>>>>>         And return the error status of fence
>>>>>>
>>>>>> Yes and no, dma_fence_get_status() is some specific handling for
>>>>>> sync_file debugging (no idea why that made it into the common
>>>>>> fence code).
>>>>>>
>>>>>> It was replaced by putting the error code directly into the fence,
>>>>>> so just reading that one after waiting should be ok.
>>>>>>
>>>>>> Maybe we should fix dma_fence_get_status() to do the right thing
>>>>>> for this?
>>>>>>
>>>>>>         ?For cs_wait_fences() IOCTL:
>>>>>>
>>>>>>         Similar with above approach
>>>>>>
>>>>>>         ?For cs_submit() IOCTL:
>>>>>>
>>>>>>         It need to check if current ctx been marked as “*guilty*”
>>>>>> and return
>>>>>>         “*ECANCELED*” if so
>>>>>>
>>>>>>         ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>>>
>>>>>>         This way, UMD can also block app from submitting, like @Nicolai
>>>>>>         mentioned, we can cache one copy of *vram_lost_counter* when
>>>>>>         enumerate physical device, and deny all
>>>>>>
>>>>>>         gl-context from submitting if the counter queried bigger
>>>>>> than that
>>>>>>         one cached in physical device. (looks a little overkill to
>>>>>> me, but
>>>>>>         easy to implement )
>>>>>>
>>>>>>         UMD can also return error to APP when creating gl-context
>>>>>> if found
>>>>>>         current queried*vram_lost_counter *bigger than that one
>>>>>> cached in
>>>>>>         physical device.
>>>>>>
>>>>>> Okay. Already have a patch for this, please review that one if you
>>>>>> haven't already done so.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>         BTW: I realized that gl-context is a little different with
>>>>>> kernel’s
>>>>>>         context. Because for kernel. BO is not related with context
>>>>>> but only
>>>>>>         with FD, while in UMD, BO have a backend
>>>>>>
>>>>>>         gl-context, so block submitting in UMD layer is also needed
>>>>>> although
>>>>>>         KMD will do its job as bottom line
>>>>>>
>>>>>>         ?Basically “vram_lost_counter” is exposure by kernel to let
>>>>>> UMD take
>>>>>>         the control of robust extension feature, it will be UMD’s
>>>>>> call to
>>>>>>         move, KMD only deny “guilty” context from submitting
>>>>>>
>>>>>>         Need your feedback, thx
>>>>>>
>>>>>>         We’d better make TDR feature landed ASAP
>>>>>>
>>>>>>         BR Monk
>>>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TDR and VRAM lost handling in KMD:
       [not found]                                                         ` <BLUPR12MB04497EDD5AE48484E7A18C2F844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-11 14:04                                                           ` Christian König
  0 siblings, 0 replies; 23+ messages in thread
From: Christian König @ 2017-10-11 14:04 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Zhou, David(ChunMing),
	Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

Yeah, that was exactly my thinking as well.

Christian.

Am 11.10.2017 um 15:59 schrieb Liu, Monk:
> But if we keep counter in entity, there is one issue I suddenly though of :
>
> For regular user context, after vram lost UMD will aware this context is LOST since we have a counter copy in context, so user space can close it and re-create one
> But for kernel entity, since no U/K interface, so it is kernel's responsibility to recover this kernel entity to work, that make things complicated ....
>
> Emm, I agree that keep a copy in context and in job is a good move
>
> BR Monk
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
> Sent: Wednesday, October 11, 2017 9:51 PM
> To: Koenig, Christian <Christian.Koenig@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
> Subject: RE: TDR and VRAM lost handling in KMD:
>
>> Some jobs don't have a context (VM updates, clears, buffer moves).
> What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy That can solve your concerns
>
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Wednesday, October 11, 2017 9:39 PM
> To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> Some jobs don't have a context (VM updates, clears, buffer moves).
>
> I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable.
>
> So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run.
>
> Regards,
> Christian.
>
> Am 11.10.2017 um 15:35 schrieb Liu, Monk:
>> I think just compare the copy from context/entity with current counter
>> is enough, don't see how it's better to keep another copy in JOB
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: Wednesday, October 11, 2017 6:40 PM
>> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk
>> <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak,
>> Marek <Marek.Olsak@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>
>> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>;
>> amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>;
>> Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>;
>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD:
>>
>> I've already posted a patch for this on the mailing list.
>>
>> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled.
>>
>> Regards,
>> Christian.
>>
>> Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
>>> Your summary lacks the below issue:
>>>
>>> How about the job already pushed in scheduler queue when vram is lost?
>>>
>>>
>>> Regards,
>>> David Zhou
>>> On 2017年10月11日 17:41, Liu, Monk wrote:
>>>> Okay, let me summary our whole idea together and see if it works:
>>>>
>>>> 1, For cs_submit, always check vram-lost_counter first and reject
>>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter !=
>>>> adev->vram_lost_counter. That way the vram lost issue can be handled
>>>>
>>>> 2, for cs_submit we still need to check if the incoming context is
>>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found
>>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject
>>>> the submit
>>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>>>
>>>> 3, in gpu_reset() routine, we only mark the hang job's entity as
>>>> guilty (so we need to add new member in entity structure), and not
>>>> kick it out in gpu_reset() stage, but we need to set the context
>>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>>>      And if reset introduces VRAM LOST, we just update
>>>> adev->vram_lost_counter, but *don't* change all entity to guilty, so
>>>> still only the hang job's entity is "guilty"
>>>>      After some entity marked as "guilty", we find a way to set the
>>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K
>>>> interface, we need let UMD can know that this context is wrong.
>>>>
>>>> 4, in gpu scheduler's run_job() routine, since it only reads entity,
>>>> so we skip job scheduling once found the entity is "guilty"
>>>>
>>>>
>>>> Does above sounds good ?
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Haehnle, Nicolai
>>>> Sent: Wednesday, October 11, 2017 5:26 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel <Pixel.Ding@amd.com>;
>>>> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, Bingley
>>>> <Bingley.Li@amd.com>; Ramirez, Alejandro
>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> On 11.10.2017 11:18, Liu, Monk wrote:
>>>>> Let's talk it simple, When vram lost hit,  what's the action for
>>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not
>>>>> the one trigger gpu hang) after vram lost ? do you mean we return
>>>>> -ENODEV to UMD ?
>>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>>>
>>>>
>>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as
>>>>> "guilty", how we block its from submitting ? can you show some
>>>>> implement way
>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>         return -ECANCELED;
>>>>
>>>> (where ctx->vram_lost_counter is initialized at context creation
>>>> time and never changed afterwards)
>>>>
>>>>
>>>>> BTW: the "guilty" here is a new member I want to add to context, it
>>>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I
>>>>> need to unify them and only one place to mark guilty or not
>>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made
>>>> consistent with the rest.
>>>>
>>>> Cheers,
>>>> Nicolai
>>>>
>>>>
>>>>> BR Monk
>>>>>
>>>>> -----Original Message-----
>>>>> From: Haehnle, Nicolai
>>>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel
>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li,
>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro
>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL),
>>>>>> so it's reasonable to use it. However, it /does not/ make sense to
>>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM
>>>>>> lost is a perfect example where the driver should report context
>>>>>> lost to applications with the "innocent" flag for contexts that
>>>>>> were idle at the time of reset. The only context(s) that should be
>>>>>> reported as "guilty"
>>>>>> (or perhaps "unknown" in some cases) are the ones that were
>>>>>> executing at the time of reset.
>>>>>>
>>>>>> ML: KMD mark all contexts as guilty is because that way we can
>>>>>> unify our IOCTL behavior: e.g. for IOCTL only block
>>>>>> “guilty”context , no need to worry about vram-lost-counter
>>>>>> anymore, that’s a implementation style. I don’t think it is
>>>>>> related with UMD layer,
>>>>>>
>>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement
>>>>>> it own “guilty” gl-context if you want.
>>>>> Well, to some extent this is just semantics, but it helps to keep
>>>>> the terminology consistent.
>>>>>
>>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>>>>> mind: this returns one of
>>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>>>> and it must return "innocent" for contexts that are only lost due
>>>>> to VRAM lost without being otherwise involved in the timeout that
>>>>> lead to the reset.
>>>>>
>>>>> The point is that in the places where you used "guilty" it would be
>>>>> better to use "context lost", and then further differentiate
>>>>> between guilty/innocent context lost based on the details of what happened.
>>>>>
>>>>>
>>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you
>>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like
>>>>>> cs_sumbit ?? let’s see which way better
>>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>>          return -ECANCELED;
>>>>>
>>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>>>
>>>>> Yes, it's one additional check in cs_submit. If you're worried
>>>>> about that (and Christian's concerns about possible issues with
>>>>> walking over all contexts are addressed), I suppose you could just
>>>>> store a per-context
>>>>>
>>>>>        unsigned context_reset_status;
>>>>>
>>>>> instead of a `bool guilty`. Its value would start out as 0
>>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during
>>>>> reset.
>>>>>
>>>>> Cheers,
>>>>> Nicolai
>>>>>
>>>>>
>>>>>> *From:*Haehnle, Nicolai
>>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>;
>>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel
>>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li,
>>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro
>>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario
>>>>>> <Mario.Filipas@amd.com>
>>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>>      From a Mesa perspective, this almost all sounds reasonable to me.
>>>>>>
>>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL),
>>>>>> so it's reasonable to use it. However, it /does not/ make sense to
>>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM
>>>>>> lost is a perfect example where the driver should report context
>>>>>> lost to applications with the "innocent" flag for contexts that
>>>>>> were idle at the time of reset. The only context(s) that should be
>>>>>> reported as "guilty"
>>>>>> (or perhaps "unknown" in some cases) are the ones that were
>>>>>> executing at the time of reset.
>>>>>>
>>>>>> On whether the whole context is marked as guilty from a user space
>>>>>> perspective, it would simply be nice for user space to get
>>>>>> consistent answers. It would be a bit odd if we could e.g. succeed
>>>>>> in submitting an SDMA job after a GFX job was rejected. This would
>>>>>> point in favor of marking the entire context as guilty (although
>>>>>> that could happen lazily instead of at reset time). On the other
>>>>>> hand, if that's too big a burden for the kernel implementation I'm
>>>>>> sure we can live without it.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Nicolai
>>>>>>
>>>>>> ------------------------------------------------------------------
>>>>>> -
>>>>>> ---
>>>>>> --
>>>>>>
>>>>>> *From:*Liu, Monk
>>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher,
>>>>>> Alexander
>>>>>> *Cc:* amd-gfx@lists.freedesktop.org
>>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry
>>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>> 1.Set its fence error status to “*ETIME*”,
>>>>>>
>>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>>
>>>>>> In other words when we return ETIME from the wait IOCTL it would
>>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>>
>>>>>> Please use ECANCELED as well or some other error code when we find
>>>>>> that we need to distinct the timedout job from the canceled ones
>>>>>> (probably a good idea, but I'm not sure).
>>>>>>
>>>>>> [ML] I’m okay if you insist not to use ETIME
>>>>>>
>>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>>>
>>>>>> Not sure. Do we want to set the whole context as guilty or just
>>>>>> the entity?
>>>>>>
>>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>>
>>>>>> BTW: We should use a different name than "guilty", maybe just
>>>>>> "bool canceled;" ?
>>>>>>
>>>>>> [ML] I think context is better than entity, because for example if
>>>>>> you only block entity_0 of context and allow entity_N run, that
>>>>>> means the dependency between entities are broken (e.g. page table
>>>>>> updates in
>>>>>>
>>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make
>>>>>> sense to me)
>>>>>>
>>>>>> We’d better either block the whole context or let not…
>>>>>>
>>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set
>>>>>> all their fence status to “*ECANCELED*”
>>>>>>
>>>>>> Setting ECANCELED should be ok. But I think we should do this when
>>>>>> we try to run the jobs and not during GPU reset.
>>>>>>
>>>>>> [ML] without deep thought and expritment, I’m not sure the
>>>>>> difference between them, but kick it out in gpu_reset routine is
>>>>>> more efficient,
>>>>>>
>>>>>> Otherwise you need to check context/entity guilty flag in run_job
>>>>>> routine …and you need to it for every context/entity, I don’t see
>>>>>> why
>>>>>>
>>>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>>>
>>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since
>>>>>> VRAM lost actually ruins all VRAM contents
>>>>>>
>>>>>> No, that shouldn't be done by comparing the counters. Iterating
>>>>>> over all contexts is way to much overhead.
>>>>>>
>>>>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t
>>>>>> need to differentiate VRAM lost or not, they only interested in if
>>>>>> the context is guilty or not, and block
>>>>>>
>>>>>> Submit for guilty ones.
>>>>>>
>>>>>> *Can you give more details of your idea? And better the detail
>>>>>> implement in cs_submit, I want to see how you want to block submit
>>>>>> without checking context guilty flag*
>>>>>>
>>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>> fence status to “*ECANCELDED*”
>>>>>>
>>>>>> Yes and no, that should be done when we try to run the jobs and
>>>>>> not during GPU reset.
>>>>>>
>>>>>> [ML] again, kicking out them in gpu reset routine is high
>>>>>> efficient, otherwise you need check on every job in run_job()
>>>>>>
>>>>>> Besides, can you illustrate the detail implementation ?
>>>>>>
>>>>>> Yes and no, dma_fence_get_status() is some specific handling for
>>>>>> sync_file debugging (no idea why that made it into the common
>>>>>> fence code).
>>>>>>
>>>>>> It was replaced by putting the error code directly into the fence,
>>>>>> so just reading that one after waiting should be ok.
>>>>>>
>>>>>> Maybe we should fix dma_fence_get_status() to do the right thing
>>>>>> for this?
>>>>>>
>>>>>> [ML] yeah, that’s too confusing, the name sound really the one I
>>>>>> want to use, we should change it…
>>>>>>
>>>>>> *But look into the implement, I don**’t see why we cannot use it ?
>>>>>> it also finally return the fence->error *
>>>>>>
>>>>>> *From:*Koenig, Christian
>>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>;
>>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com
>>>>>> <mailto:Nicolai.Haehnle@amd.com>>;
>>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>;
>>>>>> Deucher, Alexander <Alexander.Deucher@amd.com
>>>>>> <mailto:Alexander.Deucher@amd.com>>
>>>>>> *Cc:* amd-gfx@lists.freedesktop.org
>>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel
>>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry
>>>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li,
>>>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; Ramirez,
>>>>>> Alejandro <Alejandro.Ramirez@amd.com
>>>>>> <mailto:Alejandro.Ramirez@amd.com>>;
>>>>>> Filipas, Mario <Mario.Filipas@amd.com
>>>>>> <mailto:Mario.Filipas@amd.com>>
>>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>> See inline:
>>>>>>
>>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>>>
>>>>>>         Hi Christian & Nicolai,
>>>>>>
>>>>>>         We need to achieve some agreements on what should MESA/UMD
>>>>>> do and
>>>>>>         what should KMD do, *please give your comments with “okay”
>>>>>> or “No”
>>>>>>         and your idea on below items,*
>>>>>>
>>>>>>         ?When a job timed out (set from lockup_timeout kernel
>>>>>> parameter),
>>>>>>         What KMD should do in TDR routine :
>>>>>>
>>>>>>         1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>>>         (*gpu_reset_counter* is used to force vm flush after GPU
>>>>>> reset, out
>>>>>>         of this thread’s scope so no more discussion on it)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         2.Set its fence error status to “*ETIME*”,
>>>>>>
>>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>>
>>>>>> In other words when we return ETIME from the wait IOCTL it would
>>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>>
>>>>>> Please use ECANCELED as well or some other error code when we find
>>>>>> that we need to distinct the timedout job from the canceled ones
>>>>>> (probably a good idea, but I'm not sure).
>>>>>>
>>>>>>         3.Find the entity/ctx behind this job, and set this ctx as
>>>>>> “*guilty*”
>>>>>>
>>>>>> Not sure. Do we want to set the whole context as guilty or just
>>>>>> the entity?
>>>>>>
>>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>>
>>>>>> BTW: We should use a different name than "guilty", maybe just
>>>>>> "bool canceled;" ?
>>>>>>
>>>>>>         4.Kick out this job from scheduler’s mirror list, so this
>>>>>> job won’t
>>>>>>         get re-scheduled to ring anymore.
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and
>>>>>> set all
>>>>>>         their fence status to “*ECANCELED*”
>>>>>>
>>>>>> Setting ECANCELED should be ok. But I think we should do this when
>>>>>> we try to run the jobs and not during GPU reset.
>>>>>>
>>>>>>         6.Force signal all fences that get kicked out by above two
>>>>>>         steps,*otherwise UMD will block forever if waiting on those
>>>>>> fences*
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         7.Do gpu reset, which is can be some callbacks to let
>>>>>> bare-metal and
>>>>>>         SR-IOV implement with their favor style
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         8.After reset, KMD need to aware if the VRAM lost happens
>>>>>> or not,
>>>>>>         bare-metal can implement some function to judge, while for
>>>>>> SR-IOV I
>>>>>>         prefer to read it from GIM side (for initial version we consider
>>>>>>         it’s always VRAM lost, till GIM side change aligned)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         9.If VRAM lost not hit, continue, otherwise:
>>>>>>
>>>>>>         a)Update adev->*vram_lost_counter*,
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         b)Iterate over all living ctx, and set all ctx as “*guilty*”
>>>>>> since
>>>>>>         VRAM lost actually ruins all VRAM contents
>>>>>>
>>>>>> No, that shouldn't be done by comparing the counters. Iterating
>>>>>> over all contexts is way to much overhead.
>>>>>>
>>>>>>         c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>>         fence status to “*ECANCELDED*”
>>>>>>
>>>>>> Yes and no, that should be done when we try to run the jobs and
>>>>>> not during GPU reset.
>>>>>>
>>>>>>         10.Do GTT recovery and VRAM page tables/entries recovery
>>>>>> (optional,
>>>>>>         do we need it ???)
>>>>>>
>>>>>> Yes, that is still needed. As Nicolai explained we can't be sure
>>>>>> that VRAM is still 100% correct even when it isn't cleared.
>>>>>>
>>>>>>         11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>>>         restart scheduler (for VRAM lost case, no JOB will
>>>>>> re-scheduled)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         ?For cs_wait() IOCTL:
>>>>>>
>>>>>>         After it found fence signaled, it should check with
>>>>>>         *“dma_fence_get_status” *to see if there is error there,
>>>>>>
>>>>>>         And return the error status of fence
>>>>>>
>>>>>> Yes and no, dma_fence_get_status() is some specific handling for
>>>>>> sync_file debugging (no idea why that made it into the common
>>>>>> fence code).
>>>>>>
>>>>>> It was replaced by putting the error code directly into the fence,
>>>>>> so just reading that one after waiting should be ok.
>>>>>>
>>>>>> Maybe we should fix dma_fence_get_status() to do the right thing
>>>>>> for this?
>>>>>>
>>>>>>         ?For cs_wait_fences() IOCTL:
>>>>>>
>>>>>>         Similar with above approach
>>>>>>
>>>>>>         ?For cs_submit() IOCTL:
>>>>>>
>>>>>>         It need to check if current ctx been marked as “*guilty*”
>>>>>> and return
>>>>>>         “*ECANCELED*” if so
>>>>>>
>>>>>>         ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>>>
>>>>>>         This way, UMD can also block app from submitting, like @Nicolai
>>>>>>         mentioned, we can cache one copy of *vram_lost_counter* when
>>>>>>         enumerate physical device, and deny all
>>>>>>
>>>>>>         gl-context from submitting if the counter queried bigger
>>>>>> than that
>>>>>>         one cached in physical device. (looks a little overkill to
>>>>>> me, but
>>>>>>         easy to implement )
>>>>>>
>>>>>>         UMD can also return error to APP when creating gl-context
>>>>>> if found
>>>>>>         current queried*vram_lost_counter *bigger than that one
>>>>>> cached in
>>>>>>         physical device.
>>>>>>
>>>>>> Okay. Already have a patch for this, please review that one if you
>>>>>> haven't already done so.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>         BTW: I realized that gl-context is a little different with
>>>>>> kernel’s
>>>>>>         context. Because for kernel. BO is not related with context
>>>>>> but only
>>>>>>         with FD, while in UMD, BO have a backend
>>>>>>
>>>>>>         gl-context, so block submitting in UMD layer is also needed
>>>>>> although
>>>>>>         KMD will do its job as bottom line
>>>>>>
>>>>>>         ?Basically “vram_lost_counter” is exposure by kernel to let
>>>>>> UMD take
>>>>>>         the control of robust extension feature, it will be UMD’s
>>>>>> call to
>>>>>>         move, KMD only deny “guilty” context from submitting
>>>>>>
>>>>>>         Need your feedback, thx
>>>>>>
>>>>>>         We’d better make TDR feature landed ASAP
>>>>>>
>>>>>>         BR Monk
>>>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: TDR and VRAM lost handling in KMD:
       [not found]                                                         ` <02bb9f77-bcc6-8a24-e9b0-8f3f260d74d8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-11 14:04                                                           ` Liu, Monk
  0 siblings, 0 replies; 23+ messages in thread
From: Liu, Monk @ 2017-10-11 14:04 UTC (permalink / raw)
  To: Koenig, Christian, Zhou, David(ChunMing),
	Haehnle, Nicolai, Olsak, Marek, Deucher, Alexander
  Cc: Ramirez, Alejandro, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Filipas, Mario, Ding, Pixel, Li, Bingley, Jiang, Jerry (SW)

Yeah, I just thought of it, agree that shouldn't keep copy in entity, otherwise too complicated to handle 

BR Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: Wednesday, October 11, 2017 10:04 PM
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhou, David(ChunMing) <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

> I remember even the VM update job is with a kernel entity, (no context 
> is true), and if entity can keep a counter copy
That won't work. We want to keep the entities associated with VM updates and buffer moves alive, but their jobs canceled.

Regards,
Christian.

Am 11.10.2017 um 15:51 schrieb Liu, Monk:
>> Some jobs don't have a context (VM updates, clears, buffer moves).
> What? I remember even the VM update job is with a kernel entity, (no 
> context is true), and if entity can keep a counter copy That can solve 
> your concerns
>
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Wednesday, October 11, 2017 9:39 PM
> To: Liu, Monk <Monk.Liu@amd.com>; Zhou, David(ChunMing) 
> <David1.Zhou@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; 
> Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
> amd-gfx@lists.freedesktop.org; Filipas, Mario <Mario.Filipas@amd.com>; 
> Ding, Pixel <Pixel.Ding@amd.com>; Li, Bingley <Bingley.Li@amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> Some jobs don't have a context (VM updates, clears, buffer moves).
>
> I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable.
>
> So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run.
>
> Regards,
> Christian.
>
> Am 11.10.2017 um 15:35 schrieb Liu, Monk:
>> I think just compare the copy from context/entity with current 
>> counter is enough, don't see how it's better to keep another copy in 
>> JOB
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: Wednesday, October 11, 2017 6:40 PM
>> To: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liu, Monk 
>> <Monk.Liu@amd.com>; Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; 
>> Olsak, Marek <Marek.Olsak@amd.com>; Deucher, Alexander 
>> <Alexander.Deucher@amd.com>
>> Cc: Ramirez, Alejandro <Alejandro.Ramirez@amd.com>; 
>> amd-gfx@lists.freedesktop.org; Filipas, Mario 
>> <Mario.Filipas@amd.com>; Ding, Pixel <Pixel.Ding@amd.com>; Li, 
>> Bingley <Bingley.Li@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD:
>>
>> I've already posted a patch for this on the mailing list.
>>
>> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled.
>>
>> Regards,
>> Christian.
>>
>> Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
>>> Your summary lacks the below issue:
>>>
>>> How about the job already pushed in scheduler queue when vram is lost?
>>>
>>>
>>> Regards,
>>> David Zhou
>>> On 2017年10月11日 17:41, Liu, Monk wrote:
>>>> Okay, let me summary our whole idea together and see if it works:
>>>>
>>>> 1, For cs_submit, always check vram-lost_counter first and reject 
>>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter !=
>>>> adev->vram_lost_counter. That way the vram lost issue can be 
>>>> adev->handled
>>>>
>>>> 2, for cs_submit we still need to check if the incoming context is 
>>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found
>>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can 
>>>> ctx->reject
>>>> the submit
>>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>>>
>>>> 3, in gpu_reset() routine, we only mark the hang job's entity as 
>>>> guilty (so we need to add new member in entity structure), and not 
>>>> kick it out in gpu_reset() stage, but we need to set the context 
>>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>>>      And if reset introduces VRAM LOST, we just update
>>>> adev->vram_lost_counter, but *don't* change all entity to guilty, 
>>>> adev->so
>>>> still only the hang job's entity is "guilty"
>>>>      After some entity marked as "guilty", we find a way to set the 
>>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K 
>>>> interface, we need let UMD can know that this context is wrong.
>>>>
>>>> 4, in gpu scheduler's run_job() routine, since it only reads 
>>>> entity, so we skip job scheduling once found the entity is "guilty"
>>>>
>>>>
>>>> Does above sounds good ?
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Haehnle, Nicolai
>>>> Sent: Wednesday, October 11, 2017 5:26 PM
>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario <Mario.Filipas@amd.com>
>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> On 11.10.2017 11:18, Liu, Monk wrote:
>>>>> Let's talk it simple, When vram lost hit,  what's the action for 
>>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts 
>>>>> (not the one trigger gpu hang) after vram lost ? do you mean we 
>>>>> return -ENODEV to UMD ?
>>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>>>
>>>>
>>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as 
>>>>> "guilty", how we block its from submitting ? can you show some 
>>>>> implement way
>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>         return -ECANCELED;
>>>>
>>>> (where ctx->vram_lost_counter is initialized at context creation 
>>>> time and never changed afterwards)
>>>>
>>>>
>>>>> BTW: the "guilty" here is a new member I want to add to context, 
>>>>> it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, 
>>>>> Looks I need to unify them and only one place to mark guilty or 
>>>>> not
>>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
>>>> consistent with the rest.
>>>>
>>>> Cheers,
>>>> Nicolai
>>>>
>>>>
>>>>> BR Monk
>>>>>
>>>>> -----Original Message-----
>>>>> From: Haehnle, Nicolai
>>>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>> Cc: amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; Li, 
>>>>> Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario 
>>>>> <Mario.Filipas@amd.com>
>>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. 
>>>>>> OpenGL), so it's reasonable to use it. However, it /does not/ 
>>>>>> make sense to mark idle contexts as "guilty" just because VRAM is 
>>>>>> lost. VRAM lost is a perfect example where the driver should 
>>>>>> report context lost to applications with the "innocent" flag for 
>>>>>> contexts that were idle at the time of reset. The only context(s) 
>>>>>> that should be reported as "guilty"
>>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>>> executing at the time of reset.
>>>>>>
>>>>>> ML: KMD mark all contexts as guilty is because that way we can 
>>>>>> unify our IOCTL behavior: e.g. for IOCTL only block 
>>>>>> “guilty”context , no need to worry about vram-lost-counter 
>>>>>> anymore, that’s a implementation style. I don’t think it is 
>>>>>> related with UMD layer,
>>>>>>
>>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can 
>>>>>> implement it own “guilty” gl-context if you want.
>>>>> Well, to some extent this is just semantics, but it helps to keep 
>>>>> the terminology consistent.
>>>>>
>>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi 
>>>>> in
>>>>> mind: this returns one of
>>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>>>> and it must return "innocent" for contexts that are only lost due 
>>>>> to VRAM lost without being otherwise involved in the timeout that 
>>>>> lead to the reset.
>>>>>
>>>>> The point is that in the places where you used "guilty" it would 
>>>>> be better to use "context lost", and then further differentiate 
>>>>> between guilty/innocent context lost based on the details of what happened.
>>>>>
>>>>>
>>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
>>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like 
>>>>>> cs_sumbit ?? let’s see which way better
>>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>>          return -ECANCELED;
>>>>>
>>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>>>
>>>>> Yes, it's one additional check in cs_submit. If you're worried 
>>>>> about that (and Christian's concerns about possible issues with 
>>>>> walking over all contexts are addressed), I suppose you could just 
>>>>> store a per-context
>>>>>
>>>>>        unsigned context_reset_status;
>>>>>
>>>>> instead of a `bool guilty`. Its value would start out as 0
>>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during 
>>>>> reset.
>>>>>
>>>>> Cheers,
>>>>> Nicolai
>>>>>
>>>>>
>>>>>> *From:*Haehnle, Nicolai
>>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>>>> *To:* Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>>>>> <Christian.Koenig@amd.com>; Olsak, Marek <Marek.Olsak@amd.com>; 
>>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>>> *Cc:* amd-gfx@lists.freedesktop.org; Ding, Pixel 
>>>>>> <Pixel.Ding@amd.com>; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>; 
>>>>>> Li, Bingley <Bingley.Li@amd.com>; Ramirez, Alejandro 
>>>>>> <Alejandro.Ramirez@amd.com>; Filipas, Mario 
>>>>>> <Mario.Filipas@amd.com>
>>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>>      From a Mesa perspective, this almost all sounds reasonable to me.
>>>>>>
>>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. 
>>>>>> OpenGL), so it's reasonable to use it. However, it /does not/ 
>>>>>> make sense to mark idle contexts as "guilty" just because VRAM is 
>>>>>> lost. VRAM lost is a perfect example where the driver should 
>>>>>> report context lost to applications with the "innocent" flag for 
>>>>>> contexts that were idle at the time of reset. The only context(s) 
>>>>>> that should be reported as "guilty"
>>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>>> executing at the time of reset.
>>>>>>
>>>>>> On whether the whole context is marked as guilty from a user 
>>>>>> space perspective, it would simply be nice for user space to get 
>>>>>> consistent answers. It would be a bit odd if we could e.g. 
>>>>>> succeed in submitting an SDMA job after a GFX job was rejected. 
>>>>>> This would point in favor of marking the entire context as guilty 
>>>>>> (although that could happen lazily instead of at reset time). On 
>>>>>> the other hand, if that's too big a burden for the kernel 
>>>>>> implementation I'm sure we can live without it.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Nicolai
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> -
>>>>>> -
>>>>>> ---
>>>>>> --
>>>>>>
>>>>>> *From:*Liu, Monk
>>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
>>>>>> Alexander
>>>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
>>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>> 1.Set its fence error status to “*ETIME*”,
>>>>>>
>>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>>
>>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>>
>>>>>> Please use ECANCELED as well or some other error code when we 
>>>>>> find that we need to distinct the timedout job from the canceled 
>>>>>> ones (probably a good idea, but I'm not sure).
>>>>>>
>>>>>> [ML] I’m okay if you insist not to use ETIME
>>>>>>
>>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>>>
>>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>>> the entity?
>>>>>>
>>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>>
>>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>>> "bool canceled;" ?
>>>>>>
>>>>>> [ML] I think context is better than entity, because for example 
>>>>>> if you only block entity_0 of context and allow entity_N run, 
>>>>>> that means the dependency between entities are broken (e.g. page 
>>>>>> table updates in
>>>>>>
>>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make 
>>>>>> sense to me)
>>>>>>
>>>>>> We’d better either block the whole context or let not…
>>>>>>
>>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set 
>>>>>> all their fence status to “*ECANCELED*”
>>>>>>
>>>>>> Setting ECANCELED should be ok. But I think we should do this 
>>>>>> when we try to run the jobs and not during GPU reset.
>>>>>>
>>>>>> [ML] without deep thought and expritment, I’m not sure the 
>>>>>> difference between them, but kick it out in gpu_reset routine is 
>>>>>> more efficient,
>>>>>>
>>>>>> Otherwise you need to check context/entity guilty flag in run_job 
>>>>>> routine …and you need to it for every context/entity, I don’t see 
>>>>>> why
>>>>>>
>>>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>>>
>>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” 
>>>>>> since VRAM lost actually ruins all VRAM contents
>>>>>>
>>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>>> over all contexts is way to much overhead.
>>>>>>
>>>>>> [ML] because I want to make KMS IOCTL rules clean, like they 
>>>>>> don’t need to differentiate VRAM lost or not, they only 
>>>>>> interested in if the context is guilty or not, and block
>>>>>>
>>>>>> Submit for guilty ones.
>>>>>>
>>>>>> *Can you give more details of your idea? And better the detail 
>>>>>> implement in cs_submit, I want to see how you want to block 
>>>>>> submit without checking context guilty flag*
>>>>>>
>>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their 
>>>>>> fence status to “*ECANCELDED*”
>>>>>>
>>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>>> not during GPU reset.
>>>>>>
>>>>>> [ML] again, kicking out them in gpu reset routine is high 
>>>>>> efficient, otherwise you need check on every job in run_job()
>>>>>>
>>>>>> Besides, can you illustrate the detail implementation ?
>>>>>>
>>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>>> sync_file debugging (no idea why that made it into the common 
>>>>>> fence code).
>>>>>>
>>>>>> It was replaced by putting the error code directly into the 
>>>>>> fence, so just reading that one after waiting should be ok.
>>>>>>
>>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>>> for this?
>>>>>>
>>>>>> [ML] yeah, that’s too confusing, the name sound really the one I 
>>>>>> want to use, we should change it…
>>>>>>
>>>>>> *But look into the implement, I don**’t see why we cannot use it ?
>>>>>> it also finally return the fence->error *
>>>>>>
>>>>>> *From:*Koenig, Christian
>>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>>>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; 
>>>>>> Haehnle, Nicolai <Nicolai.Haehnle@amd.com 
>>>>>> <mailto:Nicolai.Haehnle@amd.com>>;
>>>>>> Olsak, Marek <Marek.Olsak@amd.com <mailto:Marek.Olsak@amd.com>>; 
>>>>>> Deucher, Alexander <Alexander.Deucher@amd.com 
>>>>>> <mailto:Alexander.Deucher@amd.com>>
>>>>>> *Cc:* amd-gfx@lists.freedesktop.org 
>>>>>> <mailto:amd-gfx@lists.freedesktop.org>; Ding, Pixel 
>>>>>> <Pixel.Ding@amd.com <mailto:Pixel.Ding@amd.com>>; Jiang, Jerry
>>>>>> (SW) <Jerry.Jiang@amd.com <mailto:Jerry.Jiang@amd.com>>; Li, 
>>>>>> Bingley <Bingley.Li@amd.com <mailto:Bingley.Li@amd.com>>; 
>>>>>> Ramirez, Alejandro <Alejandro.Ramirez@amd.com 
>>>>>> <mailto:Alejandro.Ramirez@amd.com>>;
>>>>>> Filipas, Mario <Mario.Filipas@amd.com 
>>>>>> <mailto:Mario.Filipas@amd.com>>
>>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>> See inline:
>>>>>>
>>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>>>
>>>>>>         Hi Christian & Nicolai,
>>>>>>
>>>>>>         We need to achieve some agreements on what should 
>>>>>> MESA/UMD do and
>>>>>>         what should KMD do, *please give your comments with “okay”
>>>>>> or “No”
>>>>>>         and your idea on below items,*
>>>>>>
>>>>>>         ?When a job timed out (set from lockup_timeout kernel 
>>>>>> parameter),
>>>>>>         What KMD should do in TDR routine :
>>>>>>
>>>>>>         1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>>>         (*gpu_reset_counter* is used to force vm flush after GPU 
>>>>>> reset, out
>>>>>>         of this thread’s scope so no more discussion on it)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         2.Set its fence error status to “*ETIME*”,
>>>>>>
>>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>>
>>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>>
>>>>>> Please use ECANCELED as well or some other error code when we 
>>>>>> find that we need to distinct the timedout job from the canceled 
>>>>>> ones (probably a good idea, but I'm not sure).
>>>>>>
>>>>>>         3.Find the entity/ctx behind this job, and set this ctx 
>>>>>> as “*guilty*”
>>>>>>
>>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>>> the entity?
>>>>>>
>>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>>
>>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>>> "bool canceled;" ?
>>>>>>
>>>>>>         4.Kick out this job from scheduler’s mirror list, so this 
>>>>>> job won’t
>>>>>>         get re-scheduled to ring anymore.
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, 
>>>>>> and set all
>>>>>>         their fence status to “*ECANCELED*”
>>>>>>
>>>>>> Setting ECANCELED should be ok. But I think we should do this 
>>>>>> when we try to run the jobs and not during GPU reset.
>>>>>>
>>>>>>         6.Force signal all fences that get kicked out by above two
>>>>>>         steps,*otherwise UMD will block forever if waiting on 
>>>>>> those
>>>>>> fences*
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         7.Do gpu reset, which is can be some callbacks to let 
>>>>>> bare-metal and
>>>>>>         SR-IOV implement with their favor style
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         8.After reset, KMD need to aware if the VRAM lost happens 
>>>>>> or not,
>>>>>>         bare-metal can implement some function to judge, while 
>>>>>> for SR-IOV I
>>>>>>         prefer to read it from GIM side (for initial version we consider
>>>>>>         it’s always VRAM lost, till GIM side change aligned)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         9.If VRAM lost not hit, continue, otherwise:
>>>>>>
>>>>>>         a)Update adev->*vram_lost_counter*,
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         b)Iterate over all living ctx, and set all ctx as “*guilty*”
>>>>>> since
>>>>>>         VRAM lost actually ruins all VRAM contents
>>>>>>
>>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>>> over all contexts is way to much overhead.
>>>>>>
>>>>>>         c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>>         fence status to “*ECANCELDED*”
>>>>>>
>>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>>> not during GPU reset.
>>>>>>
>>>>>>         10.Do GTT recovery and VRAM page tables/entries recovery 
>>>>>> (optional,
>>>>>>         do we need it ???)
>>>>>>
>>>>>> Yes, that is still needed. As Nicolai explained we can't be sure 
>>>>>> that VRAM is still 100% correct even when it isn't cleared.
>>>>>>
>>>>>>         11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>>>         restart scheduler (for VRAM lost case, no JOB will
>>>>>> re-scheduled)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         ?For cs_wait() IOCTL:
>>>>>>
>>>>>>         After it found fence signaled, it should check with
>>>>>>         *“dma_fence_get_status” *to see if there is error there,
>>>>>>
>>>>>>         And return the error status of fence
>>>>>>
>>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>>> sync_file debugging (no idea why that made it into the common 
>>>>>> fence code).
>>>>>>
>>>>>> It was replaced by putting the error code directly into the 
>>>>>> fence, so just reading that one after waiting should be ok.
>>>>>>
>>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>>> for this?
>>>>>>
>>>>>>         ?For cs_wait_fences() IOCTL:
>>>>>>
>>>>>>         Similar with above approach
>>>>>>
>>>>>>         ?For cs_submit() IOCTL:
>>>>>>
>>>>>>         It need to check if current ctx been marked as “*guilty*”
>>>>>> and return
>>>>>>         “*ECANCELED*” if so
>>>>>>
>>>>>>         ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>>>
>>>>>>         This way, UMD can also block app from submitting, like @Nicolai
>>>>>>         mentioned, we can cache one copy of *vram_lost_counter* when
>>>>>>         enumerate physical device, and deny all
>>>>>>
>>>>>>         gl-context from submitting if the counter queried bigger 
>>>>>> than that
>>>>>>         one cached in physical device. (looks a little overkill 
>>>>>> to me, but
>>>>>>         easy to implement )
>>>>>>
>>>>>>         UMD can also return error to APP when creating gl-context 
>>>>>> if found
>>>>>>         current queried*vram_lost_counter *bigger than that one 
>>>>>> cached in
>>>>>>         physical device.
>>>>>>
>>>>>> Okay. Already have a patch for this, please review that one if 
>>>>>> you haven't already done so.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>         BTW: I realized that gl-context is a little different 
>>>>>> with kernel’s
>>>>>>         context. Because for kernel. BO is not related with 
>>>>>> context but only
>>>>>>         with FD, while in UMD, BO have a backend
>>>>>>
>>>>>>         gl-context, so block submitting in UMD layer is also 
>>>>>> needed although
>>>>>>         KMD will do its job as bottom line
>>>>>>
>>>>>>         ?Basically “vram_lost_counter” is exposure by kernel to 
>>>>>> let UMD take
>>>>>>         the control of robust extension feature, it will be UMD’s 
>>>>>> call to
>>>>>>         move, KMD only deny “guilty” context from submitting
>>>>>>
>>>>>>         Need your feedback, thx
>>>>>>
>>>>>>         We’d better make TDR feature landed ASAP
>>>>>>
>>>>>>         BR Monk
>>>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-10-11 14:04 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-11  5:33 TDR and VRAM lost handling in KMD: Liu, Monk
     [not found] ` <BLUPR12MB0449785160E34EA9369C5E23844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11  7:14   ` Liu, Monk
2017-10-11  7:20   ` Christian König
     [not found]     ` <b5c5f6c9-07e2-4688-8ffc-3929bfc59366-5C7GfCeVMHo@public.gmane.org>
2017-10-11  8:15       ` Liu, Monk
     [not found]         ` <BLUPR12MB044911DFCB510022605DD38A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11  8:40           ` Haehnle, Nicolai
     [not found]             ` <DM5PR12MB1292D21FC5438AEA8FCF9F64FF4A0-2J9CzHegvk82qrKJuDAMhQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11  8:48               ` Liu, Monk
     [not found]                 ` <BLUPR12MB0449287A92DF8D3EB30BE6A6844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11  8:59                   ` Nicolai Hähnle
     [not found]                     ` <28d64011-fd90-07fb-d95d-48286ecbdcc5-5C7GfCeVMHo@public.gmane.org>
2017-10-11  9:18                       ` Liu, Monk
     [not found]                         ` <BLUPR12MB044914F3A7B5D3D316481A7A844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11  9:25                           ` Nicolai Hähnle
     [not found]                             ` <6876e153-7e98-66ac-7338-5601cf83c633-5C7GfCeVMHo@public.gmane.org>
2017-10-11  9:41                               ` Liu, Monk
     [not found]                                 ` <BLUPR12MB044907C2C72DD8BEB1D5BE3B844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11 10:14                                   ` Chunming Zhou
     [not found]                                     ` <8c4e849f-9227-12bc-9d2e-3daf60fcd762-5C7GfCeVMHo@public.gmane.org>
2017-10-11 10:39                                       ` Christian König
     [not found]                                         ` <0c198ba6-b853-c26a-7fb4-bcc0344fdea0-5C7GfCeVMHo@public.gmane.org>
2017-10-11 13:35                                           ` Liu, Monk
     [not found]                                             ` <BLUPR12MB04490BE33EC2E851228E25F4844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11 13:39                                               ` Christian König
     [not found]                                                 ` <d9274bc6-27e8-f6c4-0851-4240bde72452-5C7GfCeVMHo@public.gmane.org>
2017-10-11 13:51                                                   ` Liu, Monk
     [not found]                                                     ` <BLUPR12MB044961DEE4326E94A156ED05844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11 13:59                                                       ` Liu, Monk
     [not found]                                                         ` <BLUPR12MB04497EDD5AE48484E7A18C2F844A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-11 14:04                                                           ` Christian König
2017-10-11 14:03                                                       ` Christian König
     [not found]                                                         ` <02bb9f77-bcc6-8a24-e9b0-8f3f260d74d8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-11 14:04                                                           ` Liu, Monk
2017-10-11 13:27                                       ` Liu, Monk
2017-10-11  9:02                   ` Christian König
     [not found]                     ` <7a7a1830-5457-ea68-44dc-f88eb1e0a8fe-5C7GfCeVMHo@public.gmane.org>
2017-10-11  9:16                       ` Nicolai Hähnle
2017-10-11  9:27                       ` Liu, Monk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.