All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Koenig, Christian" <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>
To: "Kuehling, Felix" <Felix.Kuehling-5C7GfCeVMHo@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Cc: "Russell, Kent" <Kent.Russell-5C7GfCeVMHo@public.gmane.org>
Subject: Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation
Date: Tue, 30 Apr 2019 17:03:26 +0000	[thread overview]
Message-ID: <f5c698ad-2aff-b3c5-2041-05a10983438a@amd.com> (raw)
In-Reply-To: <9f882acd-c48f-3bbd-2d90-659c2edead39-5C7GfCeVMHo@public.gmane.org>

Am 30.04.19 um 17:36 schrieb Kuehling, Felix:
> On 2019-04-30 5:32 a.m., Christian König wrote:
>> [CAUTION: External Email]
>>
>> Am 30.04.19 um 01:16 schrieb Kuehling, Felix:
>>> On 2019-04-29 8:34 a.m., Christian König wrote:
>>>> Am 28.04.19 um 09:44 schrieb Kuehling, Felix:
>>>>> From: Kent Russell <kent.russell@amd.com>
>>>>>
>>>>> GTT size is currently limited to the minimum of VRAM size or 3/4 of
>>>>> system memory. This severely limits the quanitity of system memory
>>>>> that can be used by ROCm application.
>>>>>
>>>>> Increase GTT size to the maximum of VRAM size or system memory size.
>>>> Well, NAK.
>>>>
>>>> This limit was done on purpose because we otherwise the
>>>> max-texture-size would be crashing the system because the OOM killer
>>>> would be causing a system panic.
>>>>
>>>> Using more than 75% of system memory by the GPU at the same time makes
>>>> the system unstable and so we can't allow that by default.
>>> Like we discussed, the current implementation is too limiting. On a Fiji
>>> system with 4GB VRAM and 32GB system memory, it limits system memory
>>> allocations to 4GB. I think this workaround was fixed once before and
>>> reverted because it broke a CZ system with 1GB system memory. So I
>>> suspect that this is an issue affecting small memory systems where maybe
>>> the 1/2 system memory limit in TTM isn't sufficient to protect from OOM
>>> panics.
>> Well it not only broke on a 1GB CZ system, this was just where Andrey
>> reproduced it. We got reports from all kind of systems.
> I'd like to see those reports. This patch has been included in Linux Pro
> releases since 18.20. I'm not aware that anyone complained about it.

Well to be honest our Pro driver is actually not used that widely and 
only used on rather homogeneous systems.

Which is not really surprising since we only advise to use it on 
professional use cases.

>>> The OOM killer problem is a more general problem that potentially
>>> affects other drivers too. Keeping this GTT limit broken in AMDGPU is an
>>> inadequate workaround at best. I'd like to look for a better solution,
>>> probably some adjustment of the TTM system memory limits on systems with
>>> small memory, to avoid OOM panics on such systems.
>> The core problem here is that the OOM killer explicitly doesn't want to
>> block for shaders to finish whatever it is doing.
>>
>> So currently as soon as the hardware is using some memory it can't be
>> reclaimed immediately.
>>
>> The original limit in TTM was 2/3 of system memory and that worked
>> really reliable and we ran into problems only after raising it to 3/4.
> The TTM system memory limit is still 3/8 soft and 1/2 hard, 3/4 for
> emergencies. See ttm_mem_init_kernel_zone. AFAICT, the emergency limit
> is only available to root.

Ah! I think I know why those limits doesn't kick in here!

When GTT space is used by evictions from VRAM then we will use the 
emergency limit as well.

> This GTT limit kicks in before I get anywhere close to the TTM limit.
> That's why I think it is both broken and redundant.

That was also the argument when we removed it the last time, but it got 
immediately reverted.

>> To sum it up the requirement of using almost all system memory by a GPU
>> is simply not possible upstream and even in any production system rather
>> questionable.
> It should be doable with userptr, which now uses unpinned pages through
> HMM. Currently the GTT limit affects the largest possible userptr
> allocation, though not the total sum of all userptr allocations. Maybe
> making userptr completely independent of GTT size would be an easier
> problem to tackle. Then I can leave the GTT limit alone.

Well this way we would only avoid the symptoms, but not the real problem.

>> The only real solution I can see is to be able to reliable kill shaders
>> in an OOM situation.
> Well, we can in fact preempt our compute shaders with low latency.
> Killing a KFD process will do exactly that.

I've taken a look at that thing as well and to be honest it is not even 
remotely sufficient.

We need something which stops the hardware *immediately* from accessing 
system memory, and not wait for the SQ to kill all waves, flush caches 
etc...

One possibility I'm playing around with for a while is to replace the 
root PD for the VMIDs in question on the fly. E.g. we just let it point 
to some dummy which redirects everything into nirvana.

But implementing this is easier said than done...

Regards,
Christian.

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2019-04-30 17:03 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-28  7:44 [PATCH 00/27] KFD upstreaming Kuehling, Felix
     [not found] ` <20190428074331.30107-1-Felix.Kuehling-5C7GfCeVMHo@public.gmane.org>
2019-04-28  7:44   ` [PATCH 01/27] drm/amdkfd: Use 64 bit sdma_bitmap Kuehling, Felix
2019-04-28  7:44   ` [PATCH 02/27] drm/amdkfd: Add sdma allocation debug message Kuehling, Felix
2019-04-28  7:44   ` [PATCH 03/27] drm/amdkfd: Differentiate b/t sdma_id and sdma_queue_id Kuehling, Felix
2019-04-28  7:44   ` [PATCH 05/27] drm/amdkfd: Fix a potential memory leak Kuehling, Felix
2019-04-28  7:44   ` [PATCH 04/27] drm/amdkfd: Shift sdma_engine_id and sdma_queue_id in mqd Kuehling, Felix
2019-04-28  7:44   ` [PATCH 06/27] drm/amdkfd: Introduce asic-specific mqd_manager_init function Kuehling, Felix
2019-04-28  7:44   ` [PATCH 07/27] drm/amdkfd: Introduce DIQ type mqd manager Kuehling, Felix
2019-04-28  7:44   ` [PATCH 08/27] drm/amdkfd: Init mqd managers in device queue manager init Kuehling, Felix
2019-04-28  7:44   ` [PATCH 09/27] drm/amdkfd: Add mqd size in mqd manager struct Kuehling, Felix
2019-04-28  7:44   ` [PATCH 10/27] drm/amdkfd: Allocate MQD trunk for HIQ and SDMA Kuehling, Felix
2019-04-28  7:44   ` [PATCH 11/27] drm/amdkfd: Move non-sdma mqd allocation out of init_mqd Kuehling, Felix
2019-04-28  7:44   ` [PATCH 12/27] drm/amdkfd: Allocate hiq and sdma mqd from mqd trunk Kuehling, Felix
2019-04-28  7:44   ` [PATCH 13/27] drm/amdkfd: Move sdma_queue_id calculation into allocate_sdma_queue() Kuehling, Felix
2019-04-28  7:44   ` [PATCH 14/27] drm/amdkfd: Fix compute profile switching Kuehling, Felix
2019-04-28  7:44   ` [PATCH 15/27] drm/amdkfd: Fix sdma queue map issue Kuehling, Felix
2019-04-28  7:44   ` [PATCH 16/27] drm/amdkfd: Introduce XGMI SDMA queue type Kuehling, Felix
2019-04-28  7:44   ` [PATCH 17/27] drm/amdkfd: Expose sdma engine numbers to topology Kuehling, Felix
2019-04-28  7:44   ` [PATCH 18/27] drm/amdkfd: Delete alloc_format field from map_queue struct Kuehling, Felix
2019-04-28  7:44   ` [PATCH 19/27] drm/amdkfd: Fix a circular lock dependency Kuehling, Felix
2019-04-28  7:44   ` [PATCH 20/27] drm/amdkfd: Fix gfx8 MEM_VIOL exception handler Kuehling, Felix
2019-04-28  7:44   ` [PATCH 21/27] drm/amdkfd: Preserve wave state after instruction fetch MEM_VIOL Kuehling, Felix
2019-04-28  7:44   ` [PATCH 22/27] drm/amdkfd: Fix gfx9 XNACK state save/restore Kuehling, Felix
2019-04-28  7:44   ` [PATCH 23/27] drm/amdkfd: Preserve ttmp[4:5] instead of ttmp[14:15] Kuehling, Felix
2019-04-28  7:44   ` [PATCH 24/27] drm/amdkfd: Add VegaM support Kuehling, Felix
2019-04-28  7:44   ` [PATCH 25/27] drm/amdkfd: Add domain number into gpu_id Kuehling, Felix
2019-04-28  7:44   ` [PATCH 26/27] drm/amdgpu: Use heavy weight for tlb invalidation on xgmi configuration Kuehling, Felix
2019-04-28  7:44   ` [PATCH 27/27] drm/amdgpu: Fix GTT size calculation Kuehling, Felix
     [not found]     ` <20190428074331.30107-28-Felix.Kuehling-5C7GfCeVMHo@public.gmane.org>
2019-04-29 12:34       ` Christian König
     [not found]         ` <86fa9fc3-7a8f-9855-ae1d-5c7ccf2b5260-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-04-29 23:16           ` Kuehling, Felix
     [not found]             ` <1b1ec993-1c4b-8661-9b3f-ac0ad8ae64c7-5C7GfCeVMHo@public.gmane.org>
2019-04-30  9:32               ` Christian König
     [not found]                 ` <134a4999-776f-44c6-99a2-42e8b9366a73-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-04-30 15:36                   ` Kuehling, Felix
     [not found]                     ` <9f882acd-c48f-3bbd-2d90-659c2edead39-5C7GfCeVMHo@public.gmane.org>
2019-04-30 17:03                       ` Koenig, Christian [this message]
     [not found]                         ` <f5c698ad-2aff-b3c5-2041-05a10983438a-5C7GfCeVMHo@public.gmane.org>
2019-04-30 17:25                           ` Kuehling, Felix
     [not found]                             ` <8ba952ab-4836-4ca3-cd80-99f7367a7979-5C7GfCeVMHo@public.gmane.org>
2019-05-02 13:06                               ` Koenig, Christian
2019-07-13 20:24                           ` Felix Kuehling
2019-04-29 23:23   ` [PATCH 00/27] KFD upstreaming Kuehling, Felix

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f5c698ad-2aff-b3c5-2041-05a10983438a@amd.com \
    --to=christian.koenig-5c7gfcevmho@public.gmane.org \
    --cc=Felix.Kuehling-5C7GfCeVMHo@public.gmane.org \
    --cc=Kent.Russell-5C7GfCeVMHo@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.