All of lore.kernel.org
 help / color / mirror / Atom feed
* Failed to find memory space for buffer eviction
@ 2020-07-14  0:44 Felix Kuehling
  2020-07-14  8:28 ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Felix Kuehling @ 2020-07-14  0:44 UTC (permalink / raw)
  To: amd-gfx list, Koenig, Christian

I'm running into this problem with the KFD EvictionTest. The log snippet
below looks like it ran out of GTT space for the eviction of a 64MB
buffer. But then it dumps the used and free space and shows plenty of
free space.

As I understand it, the per-page breakdown of used and free space shown
by TTM is the GART space. So it's not very meaningful.

What matters more is the GTT space managed by amdgpu_gtt_mgr.c. And
that's where the problem is. It keeps track of available GTT space with
an atomic counter in amdgpu_gtt_mgr.available. It gets decremented in
amdgpu_gtt_mgr_new and incremented in amdgpu_gtt_mgr_del. The trouble
is, that TTM doesn't call the latter for ttm_mem_regs that don't have an
mm_node:

> void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_mem_reg *mem)
> {
>         struct ttm_mem_type_manager *man = &bo->bdev->man[mem->mem_type];
>
>         if (mem->mm_node)
>                 (*man->func->put_node)(man, mem);
> }
GTT BOs that don't have GART space allocated, don't hate an mm_node. So
the amdgpu_gtt_mgr.available counter doesn't get incremented when an
unmapped GTT BO is freed, and eventually runs out of space.

Now I know what the problem is, but I don't know how to fix it. Maybe a
dummy-mm_node for unmapped GTT BOs, to trick TTM into calling our
put_node callback? Or a change in TTM to call put_node unconditionally?

Regards,
  Felix


[  360.082552] [TTM] Failed to find memory space for buffer
0x00000000264c823c eviction
[  360.090331] [TTM]  No space for 00000000264c823c (16384 pages,
65536K, 64M)
[  360.090334] [TTM]    placement[0]=0x00010002 (1)
[  360.090336] [TTM]      has_type: 1
[  360.090337] [TTM]      use_type: 1
[  360.090339] [TTM]      flags: 0x0000000A
[  360.090341] [TTM]      gpu_offset: 0xFF00000000
[  360.090342] [TTM]      size: 1048576
[  360.090344] [TTM]      available_caching: 0x00070000
[  360.090346] [TTM]      default_caching: 0x00010000
[  360.090349] [TTM]  0x0000000000000400-0x0000000000000402: 2: used
[  360.090352] [TTM]  0x0000000000000402-0x0000000000000404: 2: used
[  360.090354] [TTM]  0x0000000000000404-0x0000000000000406: 2: used
[  360.090355] [TTM]  0x0000000000000406-0x0000000000000408: 2: used
[  360.090357] [TTM]  0x0000000000000408-0x000000000000040a: 2: used
[  360.090359] [TTM]  0x000000000000040a-0x000000000000040c: 2: used
[  360.090361] [TTM]  0x000000000000040c-0x000000000000040e: 2: used
[  360.090363] [TTM]  0x000000000000040e-0x0000000000000410: 2: used
[  360.090365] [TTM]  0x0000000000000410-0x0000000000000412: 2: used
[  360.090367] [TTM]  0x0000000000000412-0x0000000000000414: 2: used
[  360.090368] [TTM]  0x0000000000000414-0x0000000000000415: 1: used
[  360.090370] [TTM]  0x0000000000000415-0x0000000000000515: 256: used
[  360.090372] [TTM]  0x0000000000000515-0x0000000000000516: 1: used
[  360.090374] [TTM]  0x0000000000000516-0x0000000000000517: 1: used
[  360.090376] [TTM]  0x0000000000000517-0x0000000000000518: 1: used
[  360.090378] [TTM]  0x0000000000000518-0x0000000000000519: 1: used
[  360.090379] [TTM]  0x0000000000000519-0x000000000000051a: 1: used
[  360.090381] [TTM]  0x000000000000051a-0x000000000000051b: 1: used
[  360.090383] [TTM]  0x000000000000051b-0x000000000000051c: 1: used
[  360.090385] [TTM]  0x000000000000051c-0x000000000000051d: 1: used
[  360.090387] [TTM]  0x000000000000051d-0x000000000000051f: 2: used
[  360.090389] [TTM]  0x000000000000051f-0x0000000000000521: 2: used
[  360.090391] [TTM]  0x0000000000000521-0x0000000000000522: 1: used
[  360.090392] [TTM]  0x0000000000000522-0x0000000000000523: 1: used
[  360.090394] [TTM]  0x0000000000000523-0x0000000000000524: 1: used
[  360.090396] [TTM]  0x0000000000000524-0x0000000000000525: 1: used
[  360.090398] [TTM]  0x0000000000000525-0x0000000000000625: 256: used
[  360.090400] [TTM]  0x0000000000000625-0x0000000000000725: 256: used
[  360.090402] [TTM]  0x0000000000000725-0x0000000000000727: 2: used
[  360.090404] [TTM]  0x0000000000000727-0x00000000000007c0: 153: used
[  360.090406] [TTM]  0x00000000000007c0-0x0000000000000b8a: 970: used
[  360.090407] [TTM]  0x0000000000000b8a-0x0000000000000b8b: 1: used
[  360.090409] [TTM]  0x0000000000000b8b-0x0000000000000bcb: 64: used
[  360.090411] [TTM]  0x0000000000000bcb-0x0000000000000bcd: 2: used
[  360.090413] [TTM]  0x0000000000000bcd-0x0000000000040000: 259123: free
[  360.090415] [TTM]  total: 261120, used 1997 free 259123
[  360.090417] [TTM]  man size:1048576 pages, gtt available:14371 pages,
usage:4039MB


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed to find memory space for buffer eviction
  2020-07-14  0:44 Failed to find memory space for buffer eviction Felix Kuehling
@ 2020-07-14  8:28 ` Christian König
  2020-07-15  2:49   ` Felix Kuehling
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2020-07-14  8:28 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx list

Hi Felix,

yes I already stumbled over this as well quite recently.

See the following patch which I pushed to drm-misc-next just yesterday:

commit e04be2310b5eac683ec03b096c0e22c4c2e23593
Author: Christian König <christian.koenig@amd.com>
Date:   Mon Jul 6 17:32:55 2020 +0200

     drm/ttm: further cleanup ttm_mem_reg handling

     Stop touching the backend private pointer alltogether and
     make sure we never put the same mem twice by.

     Signed-off-by: Christian König <christian.koenig@amd.com>
     Reviewed-by: Madhav Chauhan <madhav.chauhan@amd.com>
     Link: https://patchwork.freedesktop.org/patch/375613/


But this shouldn't have been problematic since we used a dummy value for 
mem->mm_node in this case.

What could be problematic and result is an overrun is that TTM was buggy 
and called put_node twice for the same memory.

So I've seen that the code needs fixing as well, but I'm not 100% sure 
how you ran into your problem.

Regards,
Christian.

Am 14.07.20 um 02:44 schrieb Felix Kuehling:
> I'm running into this problem with the KFD EvictionTest. The log snippet
> below looks like it ran out of GTT space for the eviction of a 64MB
> buffer. But then it dumps the used and free space and shows plenty of
> free space.
>
> As I understand it, the per-page breakdown of used and free space shown
> by TTM is the GART space. So it's not very meaningful.
>
> What matters more is the GTT space managed by amdgpu_gtt_mgr.c. And
> that's where the problem is. It keeps track of available GTT space with
> an atomic counter in amdgpu_gtt_mgr.available. It gets decremented in
> amdgpu_gtt_mgr_new and incremented in amdgpu_gtt_mgr_del. The trouble
> is, that TTM doesn't call the latter for ttm_mem_regs that don't have an
> mm_node:
>
>> void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_mem_reg *mem)
>> {
>>          struct ttm_mem_type_manager *man = &bo->bdev->man[mem->mem_type];
>>
>>          if (mem->mm_node)
>>                  (*man->func->put_node)(man, mem);
>> }
> GTT BOs that don't have GART space allocated, don't hate an mm_node. So
> the amdgpu_gtt_mgr.available counter doesn't get incremented when an
> unmapped GTT BO is freed, and eventually runs out of space.
>
> Now I know what the problem is, but I don't know how to fix it. Maybe a
> dummy-mm_node for unmapped GTT BOs, to trick TTM into calling our
> put_node callback? Or a change in TTM to call put_node unconditionally?
>
> Regards,
>    Felix
>
>
> [  360.082552] [TTM] Failed to find memory space for buffer
> 0x00000000264c823c eviction
> [  360.090331] [TTM]  No space for 00000000264c823c (16384 pages,
> 65536K, 64M)
> [  360.090334] [TTM]    placement[0]=0x00010002 (1)
> [  360.090336] [TTM]      has_type: 1
> [  360.090337] [TTM]      use_type: 1
> [  360.090339] [TTM]      flags: 0x0000000A
> [  360.090341] [TTM]      gpu_offset: 0xFF00000000
> [  360.090342] [TTM]      size: 1048576
> [  360.090344] [TTM]      available_caching: 0x00070000
> [  360.090346] [TTM]      default_caching: 0x00010000
> [  360.090349] [TTM]  0x0000000000000400-0x0000000000000402: 2: used
> [  360.090352] [TTM]  0x0000000000000402-0x0000000000000404: 2: used
> [  360.090354] [TTM]  0x0000000000000404-0x0000000000000406: 2: used
> [  360.090355] [TTM]  0x0000000000000406-0x0000000000000408: 2: used
> [  360.090357] [TTM]  0x0000000000000408-0x000000000000040a: 2: used
> [  360.090359] [TTM]  0x000000000000040a-0x000000000000040c: 2: used
> [  360.090361] [TTM]  0x000000000000040c-0x000000000000040e: 2: used
> [  360.090363] [TTM]  0x000000000000040e-0x0000000000000410: 2: used
> [  360.090365] [TTM]  0x0000000000000410-0x0000000000000412: 2: used
> [  360.090367] [TTM]  0x0000000000000412-0x0000000000000414: 2: used
> [  360.090368] [TTM]  0x0000000000000414-0x0000000000000415: 1: used
> [  360.090370] [TTM]  0x0000000000000415-0x0000000000000515: 256: used
> [  360.090372] [TTM]  0x0000000000000515-0x0000000000000516: 1: used
> [  360.090374] [TTM]  0x0000000000000516-0x0000000000000517: 1: used
> [  360.090376] [TTM]  0x0000000000000517-0x0000000000000518: 1: used
> [  360.090378] [TTM]  0x0000000000000518-0x0000000000000519: 1: used
> [  360.090379] [TTM]  0x0000000000000519-0x000000000000051a: 1: used
> [  360.090381] [TTM]  0x000000000000051a-0x000000000000051b: 1: used
> [  360.090383] [TTM]  0x000000000000051b-0x000000000000051c: 1: used
> [  360.090385] [TTM]  0x000000000000051c-0x000000000000051d: 1: used
> [  360.090387] [TTM]  0x000000000000051d-0x000000000000051f: 2: used
> [  360.090389] [TTM]  0x000000000000051f-0x0000000000000521: 2: used
> [  360.090391] [TTM]  0x0000000000000521-0x0000000000000522: 1: used
> [  360.090392] [TTM]  0x0000000000000522-0x0000000000000523: 1: used
> [  360.090394] [TTM]  0x0000000000000523-0x0000000000000524: 1: used
> [  360.090396] [TTM]  0x0000000000000524-0x0000000000000525: 1: used
> [  360.090398] [TTM]  0x0000000000000525-0x0000000000000625: 256: used
> [  360.090400] [TTM]  0x0000000000000625-0x0000000000000725: 256: used
> [  360.090402] [TTM]  0x0000000000000725-0x0000000000000727: 2: used
> [  360.090404] [TTM]  0x0000000000000727-0x00000000000007c0: 153: used
> [  360.090406] [TTM]  0x00000000000007c0-0x0000000000000b8a: 970: used
> [  360.090407] [TTM]  0x0000000000000b8a-0x0000000000000b8b: 1: used
> [  360.090409] [TTM]  0x0000000000000b8b-0x0000000000000bcb: 64: used
> [  360.090411] [TTM]  0x0000000000000bcb-0x0000000000000bcd: 2: used
> [  360.090413] [TTM]  0x0000000000000bcd-0x0000000000040000: 259123: free
> [  360.090415] [TTM]  total: 261120, used 1997 free 259123
> [  360.090417] [TTM]  man size:1048576 pages, gtt available:14371 pages,
> usage:4039MB
>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed to find memory space for buffer eviction
  2020-07-14  8:28 ` Christian König
@ 2020-07-15  2:49   ` Felix Kuehling
  2020-07-15  9:28     ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Felix Kuehling @ 2020-07-15  2:49 UTC (permalink / raw)
  To: Christian König, amd-gfx list

Am 2020-07-14 um 4:28 a.m. schrieb Christian König:
> Hi Felix,
>
> yes I already stumbled over this as well quite recently.
>
> See the following patch which I pushed to drm-misc-next just yesterday:
>
> commit e04be2310b5eac683ec03b096c0e22c4c2e23593
> Author: Christian König <christian.koenig@amd.com>
> Date:   Mon Jul 6 17:32:55 2020 +0200
>
>     drm/ttm: further cleanup ttm_mem_reg handling
>
>     Stop touching the backend private pointer alltogether and
>     make sure we never put the same mem twice by.
>
>     Signed-off-by: Christian König <christian.koenig@amd.com>
>     Reviewed-by: Madhav Chauhan <madhav.chauhan@amd.com>
>     Link: https://patchwork.freedesktop.org/patch/375613/
>
>
> But this shouldn't have been problematic since we used a dummy value
> for mem->mm_node in this case.

Hmm, yeah, I was reading the code wrong. It's possible that I was really
just out of GTT space. But see below.


>
>
> What could be problematic and result is an overrun is that TTM was
> buggy and called put_node twice for the same memory.
>
> So I've seen that the code needs fixing as well, but I'm not 100% sure
> how you ran into your problem.

This is in the KFD eviction test, which deliberately overcommits VRAM in
order to trigger lots of evictions. It will use some GTT space while BOs
are evicted. But shouldn't it move them further out of GTT and into
SYSTEM to free up GTT space?

Your change "further cleanup ttm_mem_reg handling" removes a
mem->mm_node = NULL in ttm_bo_handle_move_mem in exactly the case where
a BO is moved from GTT to SYSTEM. I think that leads to a later put_node
call not happening or amdgpu_gtt_mgr_del returning before incrementing
mgr->available.

I can try if cherry-picking your two fixes will help with the eviction test.

Regards,
  Felix


>
> Regards,
> Christian.
>
> Am 14.07.20 um 02:44 schrieb Felix Kuehling:
>> I'm running into this problem with the KFD EvictionTest. The log snippet
>> below looks like it ran out of GTT space for the eviction of a 64MB
>> buffer. But then it dumps the used and free space and shows plenty of
>> free space.
>>
>> As I understand it, the per-page breakdown of used and free space shown
>> by TTM is the GART space. So it's not very meaningful.
>>
>> What matters more is the GTT space managed by amdgpu_gtt_mgr.c. And
>> that's where the problem is. It keeps track of available GTT space with
>> an atomic counter in amdgpu_gtt_mgr.available. It gets decremented in
>> amdgpu_gtt_mgr_new and incremented in amdgpu_gtt_mgr_del. The trouble
>> is, that TTM doesn't call the latter for ttm_mem_regs that don't have an
>> mm_node:
>>
>>> void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_mem_reg
>>> *mem)
>>> {
>>>          struct ttm_mem_type_manager *man =
>>> &bo->bdev->man[mem->mem_type];
>>>
>>>          if (mem->mm_node)
>>>                  (*man->func->put_node)(man, mem);
>>> }
>> GTT BOs that don't have GART space allocated, don't hate an mm_node. So
>> the amdgpu_gtt_mgr.available counter doesn't get incremented when an
>> unmapped GTT BO is freed, and eventually runs out of space.
>>
>> Now I know what the problem is, but I don't know how to fix it. Maybe a
>> dummy-mm_node for unmapped GTT BOs, to trick TTM into calling our
>> put_node callback? Or a change in TTM to call put_node unconditionally?
>>
>> Regards,
>>    Felix
>>
>>
>> [  360.082552] [TTM] Failed to find memory space for buffer
>> 0x00000000264c823c eviction
>> [  360.090331] [TTM]  No space for 00000000264c823c (16384 pages,
>> 65536K, 64M)
>> [  360.090334] [TTM]    placement[0]=0x00010002 (1)
>> [  360.090336] [TTM]      has_type: 1
>> [  360.090337] [TTM]      use_type: 1
>> [  360.090339] [TTM]      flags: 0x0000000A
>> [  360.090341] [TTM]      gpu_offset: 0xFF00000000
>> [  360.090342] [TTM]      size: 1048576
>> [  360.090344] [TTM]      available_caching: 0x00070000
>> [  360.090346] [TTM]      default_caching: 0x00010000
>> [  360.090349] [TTM]  0x0000000000000400-0x0000000000000402: 2: used
>> [  360.090352] [TTM]  0x0000000000000402-0x0000000000000404: 2: used
>> [  360.090354] [TTM]  0x0000000000000404-0x0000000000000406: 2: used
>> [  360.090355] [TTM]  0x0000000000000406-0x0000000000000408: 2: used
>> [  360.090357] [TTM]  0x0000000000000408-0x000000000000040a: 2: used
>> [  360.090359] [TTM]  0x000000000000040a-0x000000000000040c: 2: used
>> [  360.090361] [TTM]  0x000000000000040c-0x000000000000040e: 2: used
>> [  360.090363] [TTM]  0x000000000000040e-0x0000000000000410: 2: used
>> [  360.090365] [TTM]  0x0000000000000410-0x0000000000000412: 2: used
>> [  360.090367] [TTM]  0x0000000000000412-0x0000000000000414: 2: used
>> [  360.090368] [TTM]  0x0000000000000414-0x0000000000000415: 1: used
>> [  360.090370] [TTM]  0x0000000000000415-0x0000000000000515: 256: used
>> [  360.090372] [TTM]  0x0000000000000515-0x0000000000000516: 1: used
>> [  360.090374] [TTM]  0x0000000000000516-0x0000000000000517: 1: used
>> [  360.090376] [TTM]  0x0000000000000517-0x0000000000000518: 1: used
>> [  360.090378] [TTM]  0x0000000000000518-0x0000000000000519: 1: used
>> [  360.090379] [TTM]  0x0000000000000519-0x000000000000051a: 1: used
>> [  360.090381] [TTM]  0x000000000000051a-0x000000000000051b: 1: used
>> [  360.090383] [TTM]  0x000000000000051b-0x000000000000051c: 1: used
>> [  360.090385] [TTM]  0x000000000000051c-0x000000000000051d: 1: used
>> [  360.090387] [TTM]  0x000000000000051d-0x000000000000051f: 2: used
>> [  360.090389] [TTM]  0x000000000000051f-0x0000000000000521: 2: used
>> [  360.090391] [TTM]  0x0000000000000521-0x0000000000000522: 1: used
>> [  360.090392] [TTM]  0x0000000000000522-0x0000000000000523: 1: used
>> [  360.090394] [TTM]  0x0000000000000523-0x0000000000000524: 1: used
>> [  360.090396] [TTM]  0x0000000000000524-0x0000000000000525: 1: used
>> [  360.090398] [TTM]  0x0000000000000525-0x0000000000000625: 256: used
>> [  360.090400] [TTM]  0x0000000000000625-0x0000000000000725: 256: used
>> [  360.090402] [TTM]  0x0000000000000725-0x0000000000000727: 2: used
>> [  360.090404] [TTM]  0x0000000000000727-0x00000000000007c0: 153: used
>> [  360.090406] [TTM]  0x00000000000007c0-0x0000000000000b8a: 970: used
>> [  360.090407] [TTM]  0x0000000000000b8a-0x0000000000000b8b: 1: used
>> [  360.090409] [TTM]  0x0000000000000b8b-0x0000000000000bcb: 64: used
>> [  360.090411] [TTM]  0x0000000000000bcb-0x0000000000000bcd: 2: used
>> [  360.090413] [TTM]  0x0000000000000bcd-0x0000000000040000: 259123:
>> free
>> [  360.090415] [TTM]  total: 261120, used 1997 free 259123
>> [  360.090417] [TTM]  man size:1048576 pages, gtt available:14371 pages,
>> usage:4039MB
>>
>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed to find memory space for buffer eviction
  2020-07-15  2:49   ` Felix Kuehling
@ 2020-07-15  9:28     ` Christian König
  2020-07-15 14:24       ` Deucher, Alexander
  2020-07-15 15:14       ` Felix Kuehling
  0 siblings, 2 replies; 9+ messages in thread
From: Christian König @ 2020-07-15  9:28 UTC (permalink / raw)
  To: Felix Kuehling, Christian König, amd-gfx list

Am 15.07.20 um 04:49 schrieb Felix Kuehling:
> Am 2020-07-14 um 4:28 a.m. schrieb Christian König:
>> Hi Felix,
>>
>> yes I already stumbled over this as well quite recently.
>>
>> See the following patch which I pushed to drm-misc-next just yesterday:
>>
>> commit e04be2310b5eac683ec03b096c0e22c4c2e23593
>> Author: Christian König <christian.koenig@amd.com>
>> Date:   Mon Jul 6 17:32:55 2020 +0200
>>
>>      drm/ttm: further cleanup ttm_mem_reg handling
>>
>>      Stop touching the backend private pointer alltogether and
>>      make sure we never put the same mem twice by.
>>
>>      Signed-off-by: Christian König <christian.koenig@amd.com>
>>      Reviewed-by: Madhav Chauhan <madhav.chauhan@amd.com>
>>      Link: https://patchwork.freedesktop.org/patch/375613/
>>
>>
>> But this shouldn't have been problematic since we used a dummy value
>> for mem->mm_node in this case.
> Hmm, yeah, I was reading the code wrong. It's possible that I was really
> just out of GTT space. But see below.

It looks like it yes.

>> What could be problematic and result is an overrun is that TTM was
>> buggy and called put_node twice for the same memory.
>>
>> So I've seen that the code needs fixing as well, but I'm not 100% sure
>> how you ran into your problem.
> This is in the KFD eviction test, which deliberately overcommits VRAM in
> order to trigger lots of evictions. It will use some GTT space while BOs
> are evicted. But shouldn't it move them further out of GTT and into
> SYSTEM to free up GTT space?

Yes, exactly that should happen.

But for some reason it couldn't find a candidate to evict and the 14371 
pages left are just a bit to small for the buffer.

Regards,
Christian.

> Your change "further cleanup ttm_mem_reg handling" removes a
> mem->mm_node = NULL in ttm_bo_handle_move_mem in exactly the case where
> a BO is moved from GTT to SYSTEM. I think that leads to a later put_node
> call not happening or amdgpu_gtt_mgr_del returning before incrementing
> mgr->available.
>
> I can try if cherry-picking your two fixes will help with the eviction test.
>
> Regards,
>    Felix
>
>
>> Regards,
>> Christian.
>>
>> Am 14.07.20 um 02:44 schrieb Felix Kuehling:
>>> I'm running into this problem with the KFD EvictionTest. The log snippet
>>> below looks like it ran out of GTT space for the eviction of a 64MB
>>> buffer. But then it dumps the used and free space and shows plenty of
>>> free space.
>>>
>>> As I understand it, the per-page breakdown of used and free space shown
>>> by TTM is the GART space. So it's not very meaningful.
>>>
>>> What matters more is the GTT space managed by amdgpu_gtt_mgr.c. And
>>> that's where the problem is. It keeps track of available GTT space with
>>> an atomic counter in amdgpu_gtt_mgr.available. It gets decremented in
>>> amdgpu_gtt_mgr_new and incremented in amdgpu_gtt_mgr_del. The trouble
>>> is, that TTM doesn't call the latter for ttm_mem_regs that don't have an
>>> mm_node:
>>>
>>>> void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_mem_reg
>>>> *mem)
>>>> {
>>>>           struct ttm_mem_type_manager *man =
>>>> &bo->bdev->man[mem->mem_type];
>>>>
>>>>           if (mem->mm_node)
>>>>                   (*man->func->put_node)(man, mem);
>>>> }
>>> GTT BOs that don't have GART space allocated, don't hate an mm_node. So
>>> the amdgpu_gtt_mgr.available counter doesn't get incremented when an
>>> unmapped GTT BO is freed, and eventually runs out of space.
>>>
>>> Now I know what the problem is, but I don't know how to fix it. Maybe a
>>> dummy-mm_node for unmapped GTT BOs, to trick TTM into calling our
>>> put_node callback? Or a change in TTM to call put_node unconditionally?
>>>
>>> Regards,
>>>     Felix
>>>
>>>
>>> [  360.082552] [TTM] Failed to find memory space for buffer
>>> 0x00000000264c823c eviction
>>> [  360.090331] [TTM]  No space for 00000000264c823c (16384 pages,
>>> 65536K, 64M)
>>> [  360.090334] [TTM]    placement[0]=0x00010002 (1)
>>> [  360.090336] [TTM]      has_type: 1
>>> [  360.090337] [TTM]      use_type: 1
>>> [  360.090339] [TTM]      flags: 0x0000000A
>>> [  360.090341] [TTM]      gpu_offset: 0xFF00000000
>>> [  360.090342] [TTM]      size: 1048576
>>> [  360.090344] [TTM]      available_caching: 0x00070000
>>> [  360.090346] [TTM]      default_caching: 0x00010000
>>> [  360.090349] [TTM]  0x0000000000000400-0x0000000000000402: 2: used
>>> [  360.090352] [TTM]  0x0000000000000402-0x0000000000000404: 2: used
>>> [  360.090354] [TTM]  0x0000000000000404-0x0000000000000406: 2: used
>>> [  360.090355] [TTM]  0x0000000000000406-0x0000000000000408: 2: used
>>> [  360.090357] [TTM]  0x0000000000000408-0x000000000000040a: 2: used
>>> [  360.090359] [TTM]  0x000000000000040a-0x000000000000040c: 2: used
>>> [  360.090361] [TTM]  0x000000000000040c-0x000000000000040e: 2: used
>>> [  360.090363] [TTM]  0x000000000000040e-0x0000000000000410: 2: used
>>> [  360.090365] [TTM]  0x0000000000000410-0x0000000000000412: 2: used
>>> [  360.090367] [TTM]  0x0000000000000412-0x0000000000000414: 2: used
>>> [  360.090368] [TTM]  0x0000000000000414-0x0000000000000415: 1: used
>>> [  360.090370] [TTM]  0x0000000000000415-0x0000000000000515: 256: used
>>> [  360.090372] [TTM]  0x0000000000000515-0x0000000000000516: 1: used
>>> [  360.090374] [TTM]  0x0000000000000516-0x0000000000000517: 1: used
>>> [  360.090376] [TTM]  0x0000000000000517-0x0000000000000518: 1: used
>>> [  360.090378] [TTM]  0x0000000000000518-0x0000000000000519: 1: used
>>> [  360.090379] [TTM]  0x0000000000000519-0x000000000000051a: 1: used
>>> [  360.090381] [TTM]  0x000000000000051a-0x000000000000051b: 1: used
>>> [  360.090383] [TTM]  0x000000000000051b-0x000000000000051c: 1: used
>>> [  360.090385] [TTM]  0x000000000000051c-0x000000000000051d: 1: used
>>> [  360.090387] [TTM]  0x000000000000051d-0x000000000000051f: 2: used
>>> [  360.090389] [TTM]  0x000000000000051f-0x0000000000000521: 2: used
>>> [  360.090391] [TTM]  0x0000000000000521-0x0000000000000522: 1: used
>>> [  360.090392] [TTM]  0x0000000000000522-0x0000000000000523: 1: used
>>> [  360.090394] [TTM]  0x0000000000000523-0x0000000000000524: 1: used
>>> [  360.090396] [TTM]  0x0000000000000524-0x0000000000000525: 1: used
>>> [  360.090398] [TTM]  0x0000000000000525-0x0000000000000625: 256: used
>>> [  360.090400] [TTM]  0x0000000000000625-0x0000000000000725: 256: used
>>> [  360.090402] [TTM]  0x0000000000000725-0x0000000000000727: 2: used
>>> [  360.090404] [TTM]  0x0000000000000727-0x00000000000007c0: 153: used
>>> [  360.090406] [TTM]  0x00000000000007c0-0x0000000000000b8a: 970: used
>>> [  360.090407] [TTM]  0x0000000000000b8a-0x0000000000000b8b: 1: used
>>> [  360.090409] [TTM]  0x0000000000000b8b-0x0000000000000bcb: 64: used
>>> [  360.090411] [TTM]  0x0000000000000bcb-0x0000000000000bcd: 2: used
>>> [  360.090413] [TTM]  0x0000000000000bcd-0x0000000000040000: 259123:
>>> free
>>> [  360.090415] [TTM]  total: 261120, used 1997 free 259123
>>> [  360.090417] [TTM]  man size:1048576 pages, gtt available:14371 pages,
>>> usage:4039MB
>>>
>>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed to find memory space for buffer eviction
  2020-07-15  9:28     ` Christian König
@ 2020-07-15 14:24       ` Deucher, Alexander
  2020-07-15 15:14       ` Felix Kuehling
  1 sibling, 0 replies; 9+ messages in thread
From: Deucher, Alexander @ 2020-07-15 14:24 UTC (permalink / raw)
  To: Kuehling, Felix, Koenig, Christian, amd-gfx list


[-- Attachment #1.1: Type: text/plain, Size: 9604 bytes --]

[AMD Public Use]

Maybe we should re-test the problematic piglit test and if it's no longer an issue, revert:

commit 24562523688bebc7ec17a88271b4e8c3fc337b74
Author: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Date:   Fri Dec 15 12:09:16 2017 -0500

    Revert "drm/amd/amdgpu: set gtt size according to system memory size only"

    This reverts commit ba851eed895c76be0eb4260bdbeb7e26f9ccfaa2.
    With that change piglit max size tests (running with -t max.*size) are causing
    OOM and hard hang on my CZ with 1GB RAM.

    Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
    Acked-by: Alex Deucher <alexander.deucher@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Roger He <Hongbo.He@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Christian König <ckoenig.leichtzumerken@gmail.com>
Sent: Wednesday, July 15, 2020 5:28 AM
To: Kuehling, Felix <Felix.Kuehling@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx list <amd-gfx@lists.freedesktop.org>
Subject: Re: Failed to find memory space for buffer eviction

Am 15.07.20 um 04:49 schrieb Felix Kuehling:
> Am 2020-07-14 um 4:28 a.m. schrieb Christian König:
>> Hi Felix,
>>
>> yes I already stumbled over this as well quite recently.
>>
>> See the following patch which I pushed to drm-misc-next just yesterday:
>>
>> commit e04be2310b5eac683ec03b096c0e22c4c2e23593
>> Author: Christian König <christian.koenig@amd.com>
>> Date:   Mon Jul 6 17:32:55 2020 +0200
>>
>>      drm/ttm: further cleanup ttm_mem_reg handling
>>
>>      Stop touching the backend private pointer alltogether and
>>      make sure we never put the same mem twice by.
>>
>>      Signed-off-by: Christian König <christian.koenig@amd.com>
>>      Reviewed-by: Madhav Chauhan <madhav.chauhan@amd.com>
>>      Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F375613%2F&amp;data=02%7C01%7Calexander.deucher%40amd.com%7Ce19192b295fc41a7fb4c08d828a168d4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304021017509156&amp;sdata=zilZiBrs%2FVrzhZuolVzhLSO2kIBDugp16HT58G7tX8w%3D&amp;reserved=0
>>
>>
>> But this shouldn't have been problematic since we used a dummy value
>> for mem->mm_node in this case.
> Hmm, yeah, I was reading the code wrong. It's possible that I was really
> just out of GTT space. But see below.

It looks like it yes.

>> What could be problematic and result is an overrun is that TTM was
>> buggy and called put_node twice for the same memory.
>>
>> So I've seen that the code needs fixing as well, but I'm not 100% sure
>> how you ran into your problem.
> This is in the KFD eviction test, which deliberately overcommits VRAM in
> order to trigger lots of evictions. It will use some GTT space while BOs
> are evicted. But shouldn't it move them further out of GTT and into
> SYSTEM to free up GTT space?

Yes, exactly that should happen.

But for some reason it couldn't find a candidate to evict and the 14371
pages left are just a bit to small for the buffer.

Regards,
Christian.

> Your change "further cleanup ttm_mem_reg handling" removes a
> mem->mm_node = NULL in ttm_bo_handle_move_mem in exactly the case where
> a BO is moved from GTT to SYSTEM. I think that leads to a later put_node
> call not happening or amdgpu_gtt_mgr_del returning before incrementing
> mgr->available.
>
> I can try if cherry-picking your two fixes will help with the eviction test.
>
> Regards,
>    Felix
>
>
>> Regards,
>> Christian.
>>
>> Am 14.07.20 um 02:44 schrieb Felix Kuehling:
>>> I'm running into this problem with the KFD EvictionTest. The log snippet
>>> below looks like it ran out of GTT space for the eviction of a 64MB
>>> buffer. But then it dumps the used and free space and shows plenty of
>>> free space.
>>>
>>> As I understand it, the per-page breakdown of used and free space shown
>>> by TTM is the GART space. So it's not very meaningful.
>>>
>>> What matters more is the GTT space managed by amdgpu_gtt_mgr.c. And
>>> that's where the problem is. It keeps track of available GTT space with
>>> an atomic counter in amdgpu_gtt_mgr.available. It gets decremented in
>>> amdgpu_gtt_mgr_new and incremented in amdgpu_gtt_mgr_del. The trouble
>>> is, that TTM doesn't call the latter for ttm_mem_regs that don't have an
>>> mm_node:
>>>
>>>> void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_mem_reg
>>>> *mem)
>>>> {
>>>>           struct ttm_mem_type_manager *man =
>>>> &bo->bdev->man[mem->mem_type];
>>>>
>>>>           if (mem->mm_node)
>>>>                   (*man->func->put_node)(man, mem);
>>>> }
>>> GTT BOs that don't have GART space allocated, don't hate an mm_node. So
>>> the amdgpu_gtt_mgr.available counter doesn't get incremented when an
>>> unmapped GTT BO is freed, and eventually runs out of space.
>>>
>>> Now I know what the problem is, but I don't know how to fix it. Maybe a
>>> dummy-mm_node for unmapped GTT BOs, to trick TTM into calling our
>>> put_node callback? Or a change in TTM to call put_node unconditionally?
>>>
>>> Regards,
>>>     Felix
>>>
>>>
>>> [  360.082552] [TTM] Failed to find memory space for buffer
>>> 0x00000000264c823c eviction
>>> [  360.090331] [TTM]  No space for 00000000264c823c (16384 pages,
>>> 65536K, 64M)
>>> [  360.090334] [TTM]    placement[0]=0x00010002 (1)
>>> [  360.090336] [TTM]      has_type: 1
>>> [  360.090337] [TTM]      use_type: 1
>>> [  360.090339] [TTM]      flags: 0x0000000A
>>> [  360.090341] [TTM]      gpu_offset: 0xFF00000000
>>> [  360.090342] [TTM]      size: 1048576
>>> [  360.090344] [TTM]      available_caching: 0x00070000
>>> [  360.090346] [TTM]      default_caching: 0x00010000
>>> [  360.090349] [TTM]  0x0000000000000400-0x0000000000000402: 2: used
>>> [  360.090352] [TTM]  0x0000000000000402-0x0000000000000404: 2: used
>>> [  360.090354] [TTM]  0x0000000000000404-0x0000000000000406: 2: used
>>> [  360.090355] [TTM]  0x0000000000000406-0x0000000000000408: 2: used
>>> [  360.090357] [TTM]  0x0000000000000408-0x000000000000040a: 2: used
>>> [  360.090359] [TTM]  0x000000000000040a-0x000000000000040c: 2: used
>>> [  360.090361] [TTM]  0x000000000000040c-0x000000000000040e: 2: used
>>> [  360.090363] [TTM]  0x000000000000040e-0x0000000000000410: 2: used
>>> [  360.090365] [TTM]  0x0000000000000410-0x0000000000000412: 2: used
>>> [  360.090367] [TTM]  0x0000000000000412-0x0000000000000414: 2: used
>>> [  360.090368] [TTM]  0x0000000000000414-0x0000000000000415: 1: used
>>> [  360.090370] [TTM]  0x0000000000000415-0x0000000000000515: 256: used
>>> [  360.090372] [TTM]  0x0000000000000515-0x0000000000000516: 1: used
>>> [  360.090374] [TTM]  0x0000000000000516-0x0000000000000517: 1: used
>>> [  360.090376] [TTM]  0x0000000000000517-0x0000000000000518: 1: used
>>> [  360.090378] [TTM]  0x0000000000000518-0x0000000000000519: 1: used
>>> [  360.090379] [TTM]  0x0000000000000519-0x000000000000051a: 1: used
>>> [  360.090381] [TTM]  0x000000000000051a-0x000000000000051b: 1: used
>>> [  360.090383] [TTM]  0x000000000000051b-0x000000000000051c: 1: used
>>> [  360.090385] [TTM]  0x000000000000051c-0x000000000000051d: 1: used
>>> [  360.090387] [TTM]  0x000000000000051d-0x000000000000051f: 2: used
>>> [  360.090389] [TTM]  0x000000000000051f-0x0000000000000521: 2: used
>>> [  360.090391] [TTM]  0x0000000000000521-0x0000000000000522: 1: used
>>> [  360.090392] [TTM]  0x0000000000000522-0x0000000000000523: 1: used
>>> [  360.090394] [TTM]  0x0000000000000523-0x0000000000000524: 1: used
>>> [  360.090396] [TTM]  0x0000000000000524-0x0000000000000525: 1: used
>>> [  360.090398] [TTM]  0x0000000000000525-0x0000000000000625: 256: used
>>> [  360.090400] [TTM]  0x0000000000000625-0x0000000000000725: 256: used
>>> [  360.090402] [TTM]  0x0000000000000725-0x0000000000000727: 2: used
>>> [  360.090404] [TTM]  0x0000000000000727-0x00000000000007c0: 153: used
>>> [  360.090406] [TTM]  0x00000000000007c0-0x0000000000000b8a: 970: used
>>> [  360.090407] [TTM]  0x0000000000000b8a-0x0000000000000b8b: 1: used
>>> [  360.090409] [TTM]  0x0000000000000b8b-0x0000000000000bcb: 64: used
>>> [  360.090411] [TTM]  0x0000000000000bcb-0x0000000000000bcd: 2: used
>>> [  360.090413] [TTM]  0x0000000000000bcd-0x0000000000040000: 259123:
>>> free
>>> [  360.090415] [TTM]  total: 261120, used 1997 free 259123
>>> [  360.090417] [TTM]  man size:1048576 pages, gtt available:14371 pages,
>>> usage:4039MB
>>>
>>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7Ce19192b295fc41a7fb4c08d828a168d4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304021017509156&amp;sdata=LVRisNun0DYM%2F5dLthnxNiN0KgAq%2BAh5mXvnoYEjkR0%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Calexander.deucher%40amd.com%7Ce19192b295fc41a7fb4c08d828a168d4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304021017509156&amp;sdata=LVRisNun0DYM%2F5dLthnxNiN0KgAq%2BAh5mXvnoYEjkR0%3D&amp;reserved=0

[-- Attachment #1.2: Type: text/html, Size: 14999 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed to find memory space for buffer eviction
  2020-07-15  9:28     ` Christian König
  2020-07-15 14:24       ` Deucher, Alexander
@ 2020-07-15 15:14       ` Felix Kuehling
  2020-07-16  6:58         ` Christian König
  1 sibling, 1 reply; 9+ messages in thread
From: Felix Kuehling @ 2020-07-15 15:14 UTC (permalink / raw)
  To: christian.koenig, amd-gfx list


Am 2020-07-15 um 5:28 a.m. schrieb Christian König:
> Am 15.07.20 um 04:49 schrieb Felix Kuehling:
>> Am 2020-07-14 um 4:28 a.m. schrieb Christian König:
>>> Hi Felix,
>>>
>>> yes I already stumbled over this as well quite recently.
>>>
>>> See the following patch which I pushed to drm-misc-next just yesterday:
>>>
>>> commit e04be2310b5eac683ec03b096c0e22c4c2e23593
>>> Author: Christian König <christian.koenig@amd.com>
>>> Date:   Mon Jul 6 17:32:55 2020 +0200
>>>
>>>      drm/ttm: further cleanup ttm_mem_reg handling
>>>
>>>      Stop touching the backend private pointer alltogether and
>>>      make sure we never put the same mem twice by.
>>>
>>>      Signed-off-by: Christian König <christian.koenig@amd.com>
>>>      Reviewed-by: Madhav Chauhan <madhav.chauhan@amd.com>
>>>      Link:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F375613%2F&amp;data=02%7C01%7Cfelix.kuehling%40amd.com%7Cd859556fb0f04658081208d828a16797%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304020992423068&amp;sdata=Dpno3Wmqgyb%2FkRWzoye9T3tBg8BEgCXM0THGw8pKESY%3D&amp;reserved=0
>>>
>>>
>>> But this shouldn't have been problematic since we used a dummy value
>>> for mem->mm_node in this case.
>> Hmm, yeah, I was reading the code wrong. It's possible that I was really
>> just out of GTT space. But see below.
>
> It looks like it yes.

I checked. I don't see a general GTT space leak. During the eviction
test the GTT usage spikes, but after finishing the test, GTT usage goes
back down to 7MB.


>
>>> What could be problematic and result is an overrun is that TTM was
>>> buggy and called put_node twice for the same memory.
>>>
>>> So I've seen that the code needs fixing as well, but I'm not 100% sure
>>> how you ran into your problem.
>> This is in the KFD eviction test, which deliberately overcommits VRAM in
>> order to trigger lots of evictions. It will use some GTT space while BOs
>> are evicted. But shouldn't it move them further out of GTT and into
>> SYSTEM to free up GTT space?
>
> Yes, exactly that should happen.
>
> But for some reason it couldn't find a candidate to evict and the
> 14371 pages left are just a bit to small for the buffer.

That would be a nested eviction. A VRAM to GTT eviction requires a GTT
to SYSTEM eviction to make space in GTT. Is that even possible?

Regards,
  Felix


>
> Regards,
> Christian.
>
>> Your change "further cleanup ttm_mem_reg handling" removes a
>> mem->mm_node = NULL in ttm_bo_handle_move_mem in exactly the case where
>> a BO is moved from GTT to SYSTEM. I think that leads to a later put_node
>> call not happening or amdgpu_gtt_mgr_del returning before incrementing
>> mgr->available.
>>
>> I can try if cherry-picking your two fixes will help with the
>> eviction test.
>>
>> Regards,
>>    Felix
>>
>>
>>> Regards,
>>> Christian.
>>>
>>> Am 14.07.20 um 02:44 schrieb Felix Kuehling:
>>>> I'm running into this problem with the KFD EvictionTest. The log
>>>> snippet
>>>> below looks like it ran out of GTT space for the eviction of a 64MB
>>>> buffer. But then it dumps the used and free space and shows plenty of
>>>> free space.
>>>>
>>>> As I understand it, the per-page breakdown of used and free space
>>>> shown
>>>> by TTM is the GART space. So it's not very meaningful.
>>>>
>>>> What matters more is the GTT space managed by amdgpu_gtt_mgr.c. And
>>>> that's where the problem is. It keeps track of available GTT space
>>>> with
>>>> an atomic counter in amdgpu_gtt_mgr.available. It gets decremented in
>>>> amdgpu_gtt_mgr_new and incremented in amdgpu_gtt_mgr_del. The trouble
>>>> is, that TTM doesn't call the latter for ttm_mem_regs that don't
>>>> have an
>>>> mm_node:
>>>>
>>>>> void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_mem_reg
>>>>> *mem)
>>>>> {
>>>>>           struct ttm_mem_type_manager *man =
>>>>> &bo->bdev->man[mem->mem_type];
>>>>>
>>>>>           if (mem->mm_node)
>>>>>                   (*man->func->put_node)(man, mem);
>>>>> }
>>>> GTT BOs that don't have GART space allocated, don't hate an
>>>> mm_node. So
>>>> the amdgpu_gtt_mgr.available counter doesn't get incremented when an
>>>> unmapped GTT BO is freed, and eventually runs out of space.
>>>>
>>>> Now I know what the problem is, but I don't know how to fix it.
>>>> Maybe a
>>>> dummy-mm_node for unmapped GTT BOs, to trick TTM into calling our
>>>> put_node callback? Or a change in TTM to call put_node
>>>> unconditionally?
>>>>
>>>> Regards,
>>>>     Felix
>>>>
>>>>
>>>> [  360.082552] [TTM] Failed to find memory space for buffer
>>>> 0x00000000264c823c eviction
>>>> [  360.090331] [TTM]  No space for 00000000264c823c (16384 pages,
>>>> 65536K, 64M)
>>>> [  360.090334] [TTM]    placement[0]=0x00010002 (1)
>>>> [  360.090336] [TTM]      has_type: 1
>>>> [  360.090337] [TTM]      use_type: 1
>>>> [  360.090339] [TTM]      flags: 0x0000000A
>>>> [  360.090341] [TTM]      gpu_offset: 0xFF00000000
>>>> [  360.090342] [TTM]      size: 1048576
>>>> [  360.090344] [TTM]      available_caching: 0x00070000
>>>> [  360.090346] [TTM]      default_caching: 0x00010000
>>>> [  360.090349] [TTM]  0x0000000000000400-0x0000000000000402: 2: used
>>>> [  360.090352] [TTM]  0x0000000000000402-0x0000000000000404: 2: used
>>>> [  360.090354] [TTM]  0x0000000000000404-0x0000000000000406: 2: used
>>>> [  360.090355] [TTM]  0x0000000000000406-0x0000000000000408: 2: used
>>>> [  360.090357] [TTM]  0x0000000000000408-0x000000000000040a: 2: used
>>>> [  360.090359] [TTM]  0x000000000000040a-0x000000000000040c: 2: used
>>>> [  360.090361] [TTM]  0x000000000000040c-0x000000000000040e: 2: used
>>>> [  360.090363] [TTM]  0x000000000000040e-0x0000000000000410: 2: used
>>>> [  360.090365] [TTM]  0x0000000000000410-0x0000000000000412: 2: used
>>>> [  360.090367] [TTM]  0x0000000000000412-0x0000000000000414: 2: used
>>>> [  360.090368] [TTM]  0x0000000000000414-0x0000000000000415: 1: used
>>>> [  360.090370] [TTM]  0x0000000000000415-0x0000000000000515: 256: used
>>>> [  360.090372] [TTM]  0x0000000000000515-0x0000000000000516: 1: used
>>>> [  360.090374] [TTM]  0x0000000000000516-0x0000000000000517: 1: used
>>>> [  360.090376] [TTM]  0x0000000000000517-0x0000000000000518: 1: used
>>>> [  360.090378] [TTM]  0x0000000000000518-0x0000000000000519: 1: used
>>>> [  360.090379] [TTM]  0x0000000000000519-0x000000000000051a: 1: used
>>>> [  360.090381] [TTM]  0x000000000000051a-0x000000000000051b: 1: used
>>>> [  360.090383] [TTM]  0x000000000000051b-0x000000000000051c: 1: used
>>>> [  360.090385] [TTM]  0x000000000000051c-0x000000000000051d: 1: used
>>>> [  360.090387] [TTM]  0x000000000000051d-0x000000000000051f: 2: used
>>>> [  360.090389] [TTM]  0x000000000000051f-0x0000000000000521: 2: used
>>>> [  360.090391] [TTM]  0x0000000000000521-0x0000000000000522: 1: used
>>>> [  360.090392] [TTM]  0x0000000000000522-0x0000000000000523: 1: used
>>>> [  360.090394] [TTM]  0x0000000000000523-0x0000000000000524: 1: used
>>>> [  360.090396] [TTM]  0x0000000000000524-0x0000000000000525: 1: used
>>>> [  360.090398] [TTM]  0x0000000000000525-0x0000000000000625: 256: used
>>>> [  360.090400] [TTM]  0x0000000000000625-0x0000000000000725: 256: used
>>>> [  360.090402] [TTM]  0x0000000000000725-0x0000000000000727: 2: used
>>>> [  360.090404] [TTM]  0x0000000000000727-0x00000000000007c0: 153: used
>>>> [  360.090406] [TTM]  0x00000000000007c0-0x0000000000000b8a: 970: used
>>>> [  360.090407] [TTM]  0x0000000000000b8a-0x0000000000000b8b: 1: used
>>>> [  360.090409] [TTM]  0x0000000000000b8b-0x0000000000000bcb: 64: used
>>>> [  360.090411] [TTM]  0x0000000000000bcb-0x0000000000000bcd: 2: used
>>>> [  360.090413] [TTM]  0x0000000000000bcd-0x0000000000040000: 259123:
>>>> free
>>>> [  360.090415] [TTM]  total: 261120, used 1997 free 259123
>>>> [  360.090417] [TTM]  man size:1048576 pages, gtt available:14371
>>>> pages,
>>>> usage:4039MB
>>>>
>>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cfelix.kuehling%40amd.com%7Cd859556fb0f04658081208d828a16797%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304020992423068&amp;sdata=DTQpd9F8ST2i1VR9N4oCUfd88FimI4wShTvC%2BeR2ZSE%3D&amp;reserved=0
>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed to find memory space for buffer eviction
  2020-07-15 15:14       ` Felix Kuehling
@ 2020-07-16  6:58         ` Christian König
  2020-07-16 17:05           ` Felix Kuehling
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2020-07-16  6:58 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx list

Am 15.07.20 um 17:14 schrieb Felix Kuehling:
> Am 2020-07-15 um 5:28 a.m. schrieb Christian König:
>> Am 15.07.20 um 04:49 schrieb Felix Kuehling:
>>> Am 2020-07-14 um 4:28 a.m. schrieb Christian König:
>>>> Hi Felix,
>>>>
>>>> yes I already stumbled over this as well quite recently.
>>>>
>>>> See the following patch which I pushed to drm-misc-next just yesterday:
>>>>
>>>> commit e04be2310b5eac683ec03b096c0e22c4c2e23593
>>>> Author: Christian König <christian.koenig@amd.com>
>>>> Date:   Mon Jul 6 17:32:55 2020 +0200
>>>>
>>>>       drm/ttm: further cleanup ttm_mem_reg handling
>>>>
>>>>       Stop touching the backend private pointer alltogether and
>>>>       make sure we never put the same mem twice by.
>>>>
>>>>       Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>       Reviewed-by: Madhav Chauhan <madhav.chauhan@amd.com>
>>>>       Link:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F375613%2F&amp;data=02%7C01%7Cfelix.kuehling%40amd.com%7Cd859556fb0f04658081208d828a16797%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304020992423068&amp;sdata=Dpno3Wmqgyb%2FkRWzoye9T3tBg8BEgCXM0THGw8pKESY%3D&amp;reserved=0
>>>>
>>>>
>>>> But this shouldn't have been problematic since we used a dummy value
>>>> for mem->mm_node in this case.
>>> Hmm, yeah, I was reading the code wrong. It's possible that I was really
>>> just out of GTT space. But see below.
>> It looks like it yes.
> I checked. I don't see a general GTT space leak. During the eviction
> test the GTT usage spikes, but after finishing the test, GTT usage goes
> back down to 7MB.
>
>
>>>> What could be problematic and result is an overrun is that TTM was
>>>> buggy and called put_node twice for the same memory.
>>>>
>>>> So I've seen that the code needs fixing as well, but I'm not 100% sure
>>>> how you ran into your problem.
>>> This is in the KFD eviction test, which deliberately overcommits VRAM in
>>> order to trigger lots of evictions. It will use some GTT space while BOs
>>> are evicted. But shouldn't it move them further out of GTT and into
>>> SYSTEM to free up GTT space?
>> Yes, exactly that should happen.
>>
>> But for some reason it couldn't find a candidate to evict and the
>> 14371 pages left are just a bit to small for the buffer.
> That would be a nested eviction. A VRAM to GTT eviction requires a GTT
> to SYSTEM eviction to make space in GTT. Is that even possible?

Yes, this is the core of the TTM design problem which I talked about in 
my FOSDEM presentation in February.

Question do we still have this crude workaround that KFD is not taking 
all reservations of the current process when allocating new BOs?

That could maybe cause this as well.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> Regards,
>> Christian.
>>
>>> Your change "further cleanup ttm_mem_reg handling" removes a
>>> mem->mm_node = NULL in ttm_bo_handle_move_mem in exactly the case where
>>> a BO is moved from GTT to SYSTEM. I think that leads to a later put_node
>>> call not happening or amdgpu_gtt_mgr_del returning before incrementing
>>> mgr->available.
>>>
>>> I can try if cherry-picking your two fixes will help with the
>>> eviction test.
>>>
>>> Regards,
>>>     Felix
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 14.07.20 um 02:44 schrieb Felix Kuehling:
>>>>> I'm running into this problem with the KFD EvictionTest. The log
>>>>> snippet
>>>>> below looks like it ran out of GTT space for the eviction of a 64MB
>>>>> buffer. But then it dumps the used and free space and shows plenty of
>>>>> free space.
>>>>>
>>>>> As I understand it, the per-page breakdown of used and free space
>>>>> shown
>>>>> by TTM is the GART space. So it's not very meaningful.
>>>>>
>>>>> What matters more is the GTT space managed by amdgpu_gtt_mgr.c. And
>>>>> that's where the problem is. It keeps track of available GTT space
>>>>> with
>>>>> an atomic counter in amdgpu_gtt_mgr.available. It gets decremented in
>>>>> amdgpu_gtt_mgr_new and incremented in amdgpu_gtt_mgr_del. The trouble
>>>>> is, that TTM doesn't call the latter for ttm_mem_regs that don't
>>>>> have an
>>>>> mm_node:
>>>>>
>>>>>> void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_mem_reg
>>>>>> *mem)
>>>>>> {
>>>>>>            struct ttm_mem_type_manager *man =
>>>>>> &bo->bdev->man[mem->mem_type];
>>>>>>
>>>>>>            if (mem->mm_node)
>>>>>>                    (*man->func->put_node)(man, mem);
>>>>>> }
>>>>> GTT BOs that don't have GART space allocated, don't hate an
>>>>> mm_node. So
>>>>> the amdgpu_gtt_mgr.available counter doesn't get incremented when an
>>>>> unmapped GTT BO is freed, and eventually runs out of space.
>>>>>
>>>>> Now I know what the problem is, but I don't know how to fix it.
>>>>> Maybe a
>>>>> dummy-mm_node for unmapped GTT BOs, to trick TTM into calling our
>>>>> put_node callback? Or a change in TTM to call put_node
>>>>> unconditionally?
>>>>>
>>>>> Regards,
>>>>>      Felix
>>>>>
>>>>>
>>>>> [  360.082552] [TTM] Failed to find memory space for buffer
>>>>> 0x00000000264c823c eviction
>>>>> [  360.090331] [TTM]  No space for 00000000264c823c (16384 pages,
>>>>> 65536K, 64M)
>>>>> [  360.090334] [TTM]    placement[0]=0x00010002 (1)
>>>>> [  360.090336] [TTM]      has_type: 1
>>>>> [  360.090337] [TTM]      use_type: 1
>>>>> [  360.090339] [TTM]      flags: 0x0000000A
>>>>> [  360.090341] [TTM]      gpu_offset: 0xFF00000000
>>>>> [  360.090342] [TTM]      size: 1048576
>>>>> [  360.090344] [TTM]      available_caching: 0x00070000
>>>>> [  360.090346] [TTM]      default_caching: 0x00010000
>>>>> [  360.090349] [TTM]  0x0000000000000400-0x0000000000000402: 2: used
>>>>> [  360.090352] [TTM]  0x0000000000000402-0x0000000000000404: 2: used
>>>>> [  360.090354] [TTM]  0x0000000000000404-0x0000000000000406: 2: used
>>>>> [  360.090355] [TTM]  0x0000000000000406-0x0000000000000408: 2: used
>>>>> [  360.090357] [TTM]  0x0000000000000408-0x000000000000040a: 2: used
>>>>> [  360.090359] [TTM]  0x000000000000040a-0x000000000000040c: 2: used
>>>>> [  360.090361] [TTM]  0x000000000000040c-0x000000000000040e: 2: used
>>>>> [  360.090363] [TTM]  0x000000000000040e-0x0000000000000410: 2: used
>>>>> [  360.090365] [TTM]  0x0000000000000410-0x0000000000000412: 2: used
>>>>> [  360.090367] [TTM]  0x0000000000000412-0x0000000000000414: 2: used
>>>>> [  360.090368] [TTM]  0x0000000000000414-0x0000000000000415: 1: used
>>>>> [  360.090370] [TTM]  0x0000000000000415-0x0000000000000515: 256: used
>>>>> [  360.090372] [TTM]  0x0000000000000515-0x0000000000000516: 1: used
>>>>> [  360.090374] [TTM]  0x0000000000000516-0x0000000000000517: 1: used
>>>>> [  360.090376] [TTM]  0x0000000000000517-0x0000000000000518: 1: used
>>>>> [  360.090378] [TTM]  0x0000000000000518-0x0000000000000519: 1: used
>>>>> [  360.090379] [TTM]  0x0000000000000519-0x000000000000051a: 1: used
>>>>> [  360.090381] [TTM]  0x000000000000051a-0x000000000000051b: 1: used
>>>>> [  360.090383] [TTM]  0x000000000000051b-0x000000000000051c: 1: used
>>>>> [  360.090385] [TTM]  0x000000000000051c-0x000000000000051d: 1: used
>>>>> [  360.090387] [TTM]  0x000000000000051d-0x000000000000051f: 2: used
>>>>> [  360.090389] [TTM]  0x000000000000051f-0x0000000000000521: 2: used
>>>>> [  360.090391] [TTM]  0x0000000000000521-0x0000000000000522: 1: used
>>>>> [  360.090392] [TTM]  0x0000000000000522-0x0000000000000523: 1: used
>>>>> [  360.090394] [TTM]  0x0000000000000523-0x0000000000000524: 1: used
>>>>> [  360.090396] [TTM]  0x0000000000000524-0x0000000000000525: 1: used
>>>>> [  360.090398] [TTM]  0x0000000000000525-0x0000000000000625: 256: used
>>>>> [  360.090400] [TTM]  0x0000000000000625-0x0000000000000725: 256: used
>>>>> [  360.090402] [TTM]  0x0000000000000725-0x0000000000000727: 2: used
>>>>> [  360.090404] [TTM]  0x0000000000000727-0x00000000000007c0: 153: used
>>>>> [  360.090406] [TTM]  0x00000000000007c0-0x0000000000000b8a: 970: used
>>>>> [  360.090407] [TTM]  0x0000000000000b8a-0x0000000000000b8b: 1: used
>>>>> [  360.090409] [TTM]  0x0000000000000b8b-0x0000000000000bcb: 64: used
>>>>> [  360.090411] [TTM]  0x0000000000000bcb-0x0000000000000bcd: 2: used
>>>>> [  360.090413] [TTM]  0x0000000000000bcd-0x0000000000040000: 259123:
>>>>> free
>>>>> [  360.090415] [TTM]  total: 261120, used 1997 free 259123
>>>>> [  360.090417] [TTM]  man size:1048576 pages, gtt available:14371
>>>>> pages,
>>>>> usage:4039MB
>>>>>
>>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cfelix.kuehling%40amd.com%7Cd859556fb0f04658081208d828a16797%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304020992423068&amp;sdata=DTQpd9F8ST2i1VR9N4oCUfd88FimI4wShTvC%2BeR2ZSE%3D&amp;reserved=0
>>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed to find memory space for buffer eviction
  2020-07-16  6:58         ` Christian König
@ 2020-07-16 17:05           ` Felix Kuehling
  2020-07-20  9:25             ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Felix Kuehling @ 2020-07-16 17:05 UTC (permalink / raw)
  To: Christian König, amd-gfx list


Am 2020-07-16 um 2:58 a.m. schrieb Christian König:
> Am 15.07.20 um 17:14 schrieb Felix Kuehling:
>> Am 2020-07-15 um 5:28 a.m. schrieb Christian König:
>>> Am 15.07.20 um 04:49 schrieb Felix Kuehling:
>>>> Am 2020-07-14 um 4:28 a.m. schrieb Christian König:
>>>>> Hi Felix,
>>>>>
>>>>> yes I already stumbled over this as well quite recently.
>>>>>
>>>>> See the following patch which I pushed to drm-misc-next just
>>>>> yesterday:
>>>>>
>>>>> commit e04be2310b5eac683ec03b096c0e22c4c2e23593
>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>> Date:   Mon Jul 6 17:32:55 2020 +0200
>>>>>
>>>>>       drm/ttm: further cleanup ttm_mem_reg handling
>>>>>
>>>>>       Stop touching the backend private pointer alltogether and
>>>>>       make sure we never put the same mem twice by.
>>>>>
>>>>>       Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>       Reviewed-by: Madhav Chauhan <madhav.chauhan@amd.com>
>>>>>       Link:
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F375613%2F&amp;data=02%7C01%7Cfelix.kuehling%40amd.com%7Cd859556fb0f04658081208d828a16797%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304020992423068&amp;sdata=Dpno3Wmqgyb%2FkRWzoye9T3tBg8BEgCXM0THGw8pKESY%3D&amp;reserved=0
>>>>>
>>>>>
>>>>>
>>>>> But this shouldn't have been problematic since we used a dummy value
>>>>> for mem->mm_node in this case.
>>>> Hmm, yeah, I was reading the code wrong. It's possible that I was
>>>> really
>>>> just out of GTT space. But see below.
>>> It looks like it yes.
>> I checked. I don't see a general GTT space leak. During the eviction
>> test the GTT usage spikes, but after finishing the test, GTT usage goes
>> back down to 7MB.
>>
>>
>>>>> What could be problematic and result is an overrun is that TTM was
>>>>> buggy and called put_node twice for the same memory.
>>>>>
>>>>> So I've seen that the code needs fixing as well, but I'm not 100%
>>>>> sure
>>>>> how you ran into your problem.
>>>> This is in the KFD eviction test, which deliberately overcommits
>>>> VRAM in
>>>> order to trigger lots of evictions. It will use some GTT space
>>>> while BOs
>>>> are evicted. But shouldn't it move them further out of GTT and into
>>>> SYSTEM to free up GTT space?
>>> Yes, exactly that should happen.
>>>
>>> But for some reason it couldn't find a candidate to evict and the
>>> 14371 pages left are just a bit to small for the buffer.
>> That would be a nested eviction. A VRAM to GTT eviction requires a GTT
>> to SYSTEM eviction to make space in GTT. Is that even possible?
>
> Yes, this is the core of the TTM design problem which I talked about
> in my FOSDEM presentation in February.
>
> Question do we still have this crude workaround that KFD is not taking
> all reservations of the current process when allocating new BOs?

Not sure if you're referring to the workarounds we had to remove
eviction fences from reservations temporarily. Those are all gone. We're
making full use of the sync-object fence owner logic to avoid triggering
eviction fences unintentionally.

I don't know why we would need to take all reservations when we allocate
a new BO. I'm probably misunderstanding you.

Regards,
  Felix


>
> That could maybe cause this as well.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> Regards,
>>> Christian.
>>>
>>>> Your change "further cleanup ttm_mem_reg handling" removes a
>>>> mem->mm_node = NULL in ttm_bo_handle_move_mem in exactly the case
>>>> where
>>>> a BO is moved from GTT to SYSTEM. I think that leads to a later
>>>> put_node
>>>> call not happening or amdgpu_gtt_mgr_del returning before incrementing
>>>> mgr->available.
>>>>
>>>> I can try if cherry-picking your two fixes will help with the
>>>> eviction test.
>>>>
>>>> Regards,
>>>>     Felix
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 14.07.20 um 02:44 schrieb Felix Kuehling:
>>>>>> I'm running into this problem with the KFD EvictionTest. The log
>>>>>> snippet
>>>>>> below looks like it ran out of GTT space for the eviction of a 64MB
>>>>>> buffer. But then it dumps the used and free space and shows
>>>>>> plenty of
>>>>>> free space.
>>>>>>
>>>>>> As I understand it, the per-page breakdown of used and free space
>>>>>> shown
>>>>>> by TTM is the GART space. So it's not very meaningful.
>>>>>>
>>>>>> What matters more is the GTT space managed by amdgpu_gtt_mgr.c. And
>>>>>> that's where the problem is. It keeps track of available GTT space
>>>>>> with
>>>>>> an atomic counter in amdgpu_gtt_mgr.available. It gets
>>>>>> decremented in
>>>>>> amdgpu_gtt_mgr_new and incremented in amdgpu_gtt_mgr_del. The
>>>>>> trouble
>>>>>> is, that TTM doesn't call the latter for ttm_mem_regs that don't
>>>>>> have an
>>>>>> mm_node:
>>>>>>
>>>>>>> void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct
>>>>>>> ttm_mem_reg
>>>>>>> *mem)
>>>>>>> {
>>>>>>>            struct ttm_mem_type_manager *man =
>>>>>>> &bo->bdev->man[mem->mem_type];
>>>>>>>
>>>>>>>            if (mem->mm_node)
>>>>>>>                    (*man->func->put_node)(man, mem);
>>>>>>> }
>>>>>> GTT BOs that don't have GART space allocated, don't hate an
>>>>>> mm_node. So
>>>>>> the amdgpu_gtt_mgr.available counter doesn't get incremented when an
>>>>>> unmapped GTT BO is freed, and eventually runs out of space.
>>>>>>
>>>>>> Now I know what the problem is, but I don't know how to fix it.
>>>>>> Maybe a
>>>>>> dummy-mm_node for unmapped GTT BOs, to trick TTM into calling our
>>>>>> put_node callback? Or a change in TTM to call put_node
>>>>>> unconditionally?
>>>>>>
>>>>>> Regards,
>>>>>>      Felix
>>>>>>
>>>>>>
>>>>>> [  360.082552] [TTM] Failed to find memory space for buffer
>>>>>> 0x00000000264c823c eviction
>>>>>> [  360.090331] [TTM]  No space for 00000000264c823c (16384 pages,
>>>>>> 65536K, 64M)
>>>>>> [  360.090334] [TTM]    placement[0]=0x00010002 (1)
>>>>>> [  360.090336] [TTM]      has_type: 1
>>>>>> [  360.090337] [TTM]      use_type: 1
>>>>>> [  360.090339] [TTM]      flags: 0x0000000A
>>>>>> [  360.090341] [TTM]      gpu_offset: 0xFF00000000
>>>>>> [  360.090342] [TTM]      size: 1048576
>>>>>> [  360.090344] [TTM]      available_caching: 0x00070000
>>>>>> [  360.090346] [TTM]      default_caching: 0x00010000
>>>>>> [  360.090349] [TTM]  0x0000000000000400-0x0000000000000402: 2: used
>>>>>> [  360.090352] [TTM]  0x0000000000000402-0x0000000000000404: 2: used
>>>>>> [  360.090354] [TTM]  0x0000000000000404-0x0000000000000406: 2: used
>>>>>> [  360.090355] [TTM]  0x0000000000000406-0x0000000000000408: 2: used
>>>>>> [  360.090357] [TTM]  0x0000000000000408-0x000000000000040a: 2: used
>>>>>> [  360.090359] [TTM]  0x000000000000040a-0x000000000000040c: 2: used
>>>>>> [  360.090361] [TTM]  0x000000000000040c-0x000000000000040e: 2: used
>>>>>> [  360.090363] [TTM]  0x000000000000040e-0x0000000000000410: 2: used
>>>>>> [  360.090365] [TTM]  0x0000000000000410-0x0000000000000412: 2: used
>>>>>> [  360.090367] [TTM]  0x0000000000000412-0x0000000000000414: 2: used
>>>>>> [  360.090368] [TTM]  0x0000000000000414-0x0000000000000415: 1: used
>>>>>> [  360.090370] [TTM]  0x0000000000000415-0x0000000000000515: 256:
>>>>>> used
>>>>>> [  360.090372] [TTM]  0x0000000000000515-0x0000000000000516: 1: used
>>>>>> [  360.090374] [TTM]  0x0000000000000516-0x0000000000000517: 1: used
>>>>>> [  360.090376] [TTM]  0x0000000000000517-0x0000000000000518: 1: used
>>>>>> [  360.090378] [TTM]  0x0000000000000518-0x0000000000000519: 1: used
>>>>>> [  360.090379] [TTM]  0x0000000000000519-0x000000000000051a: 1: used
>>>>>> [  360.090381] [TTM]  0x000000000000051a-0x000000000000051b: 1: used
>>>>>> [  360.090383] [TTM]  0x000000000000051b-0x000000000000051c: 1: used
>>>>>> [  360.090385] [TTM]  0x000000000000051c-0x000000000000051d: 1: used
>>>>>> [  360.090387] [TTM]  0x000000000000051d-0x000000000000051f: 2: used
>>>>>> [  360.090389] [TTM]  0x000000000000051f-0x0000000000000521: 2: used
>>>>>> [  360.090391] [TTM]  0x0000000000000521-0x0000000000000522: 1: used
>>>>>> [  360.090392] [TTM]  0x0000000000000522-0x0000000000000523: 1: used
>>>>>> [  360.090394] [TTM]  0x0000000000000523-0x0000000000000524: 1: used
>>>>>> [  360.090396] [TTM]  0x0000000000000524-0x0000000000000525: 1: used
>>>>>> [  360.090398] [TTM]  0x0000000000000525-0x0000000000000625: 256:
>>>>>> used
>>>>>> [  360.090400] [TTM]  0x0000000000000625-0x0000000000000725: 256:
>>>>>> used
>>>>>> [  360.090402] [TTM]  0x0000000000000725-0x0000000000000727: 2: used
>>>>>> [  360.090404] [TTM]  0x0000000000000727-0x00000000000007c0: 153:
>>>>>> used
>>>>>> [  360.090406] [TTM]  0x00000000000007c0-0x0000000000000b8a: 970:
>>>>>> used
>>>>>> [  360.090407] [TTM]  0x0000000000000b8a-0x0000000000000b8b: 1: used
>>>>>> [  360.090409] [TTM]  0x0000000000000b8b-0x0000000000000bcb: 64:
>>>>>> used
>>>>>> [  360.090411] [TTM]  0x0000000000000bcb-0x0000000000000bcd: 2: used
>>>>>> [  360.090413] [TTM]  0x0000000000000bcd-0x0000000000040000: 259123:
>>>>>> free
>>>>>> [  360.090415] [TTM]  total: 261120, used 1997 free 259123
>>>>>> [  360.090417] [TTM]  man size:1048576 pages, gtt available:14371
>>>>>> pages,
>>>>>> usage:4039MB
>>>>>>
>>>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cfelix.kuehling%40amd.com%7Cd859556fb0f04658081208d828a16797%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304020992423068&amp;sdata=DTQpd9F8ST2i1VR9N4oCUfd88FimI4wShTvC%2BeR2ZSE%3D&amp;reserved=0
>>>>
>>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed to find memory space for buffer eviction
  2020-07-16 17:05           ` Felix Kuehling
@ 2020-07-20  9:25             ` Christian König
  0 siblings, 0 replies; 9+ messages in thread
From: Christian König @ 2020-07-20  9:25 UTC (permalink / raw)
  To: Felix Kuehling, Christian König, amd-gfx list

Am 16.07.20 um 19:05 schrieb Felix Kuehling:
> Am 2020-07-16 um 2:58 a.m. schrieb Christian König:
>> Am 15.07.20 um 17:14 schrieb Felix Kuehling:
>>> Am 2020-07-15 um 5:28 a.m. schrieb Christian König:
>>> [SNIP]
>>>>>> What could be problematic and result is an overrun is that TTM was
>>>>>> buggy and called put_node twice for the same memory.
>>>>>>
>>>>>> So I've seen that the code needs fixing as well, but I'm not 100%
>>>>>> sure
>>>>>> how you ran into your problem.
>>>>> This is in the KFD eviction test, which deliberately overcommits
>>>>> VRAM in
>>>>> order to trigger lots of evictions. It will use some GTT space
>>>>> while BOs
>>>>> are evicted. But shouldn't it move them further out of GTT and into
>>>>> SYSTEM to free up GTT space?
>>>> Yes, exactly that should happen.
>>>>
>>>> But for some reason it couldn't find a candidate to evict and the
>>>> 14371 pages left are just a bit to small for the buffer.
>>> That would be a nested eviction. A VRAM to GTT eviction requires a GTT
>>> to SYSTEM eviction to make space in GTT. Is that even possible?
>> Yes, this is the core of the TTM design problem which I talked about
>> in my FOSDEM presentation in February.
>>
>> Question do we still have this crude workaround that KFD is not taking
>> all reservations of the current process when allocating new BOs?
> Not sure if you're referring to the workarounds we had to remove
> eviction fences from reservations temporarily. Those are all gone. We're
> making full use of the sync-object fence owner logic to avoid triggering
> eviction fences unintentionally.

I was talking about this check here in amdgpu_ttm_bo_eviction_valuable():
>         /* If bo is a KFD BO, check if the bo belongs to the current 
> process.
>          * If true, then return false as any KFD process needs all its 
> BOs to
>          * be resident to run successfully
>          */
>         flist = dma_resv_get_list(bo->base.resv);
>         if (flist) {
>                 for (i = 0; i < flist->shared_count; ++i) {
>                         f = rcu_dereference_protected(flist->shared[i],
>                                 dma_resv_held(bo->base.resv));
>                         if (amdkfd_fence_check_mm(f, current->mm))
>                                 return false;
>                 }
>         }

What can happen is that the allocating process owns to much of GTT as 
well and as an end result we can't evict anything from GTT to allow for 
VRAM eviction to happen.

> I don't know why we would need to take all reservations when we allocate
> a new BO. I'm probably misunderstanding you.

Taking all reservations when you change the set of BOs allocated in a 
working context is mandatory for correct operation.

I've already noted multiple times that working around like we currently 
do is just a hack and what you see here is one of the symptoms of this.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> That could maybe cause this as well.
>>
>> Regards,
>> Christian.
>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-07-20  9:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-14  0:44 Failed to find memory space for buffer eviction Felix Kuehling
2020-07-14  8:28 ` Christian König
2020-07-15  2:49   ` Felix Kuehling
2020-07-15  9:28     ` Christian König
2020-07-15 14:24       ` Deucher, Alexander
2020-07-15 15:14       ` Felix Kuehling
2020-07-16  6:58         ` Christian König
2020-07-16 17:05           ` Felix Kuehling
2020-07-20  9:25             ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.