Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

From: "Christian König" <christian.koenig@amd.com>
To: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: ML dri-devel <dri-devel@lists.freedesktop.org>
Subject: Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
Date: Mon, 6 Jun 2022 12:15:35 +0200	[thread overview]
Message-ID: <3b6a7d56-6c65-046d-0a51-bbe167e04322@amd.com> (raw)
In-Reply-To: <CAP+8YyEWE9KCmo2pGOHMyT3fSEXskTG-tfymwQmmy7d1c9CuRQ@mail.gmail.com>

Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
>>> [SNIP]
>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
>>>>> For this series, not really.  To clarify there are two sides for
>>>>> getting GPU bubbles and no overlap:
>>>>>
>>>>> (1) VM operations implicitly wait for earlier CS submissions
>>>>> (2) CS submissions implicitly wait for earlier VM operations
>>>>>
>>>>> Together, these combine to ensure that you get a (potentially small)
>>>>> bubble any time VM work happens.
>>>>>
>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
>>>>> to do. However, while writing the userspace for this I noticed this
>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
>>>>> for the next CS anyway for API semantics. Due to VM operations
>>>>> happening on a single timeline that means this high priority map can
>>>>> end up being blocked by earlier sparse maps and hence the bubble in
>>>>> that case still exists.
>>>>>
>>>>> So in this series I try to tackle (1) instead. Since GPU work
>>>>> typically lags behind CPU submissions and VM operations aren't that
>>>>> slow, we can typically execute VM operations early enough that any
>>>>> implicit syncs from (2) are less/no issue.
>>>> Ok, once more since you don't seem to understand what I want to say: It
>>>> isn't possible to fix #1 before you have fixed #2.
>>>>
>>>> The VM unmap operation here is a barrier which divides the CS operations
>>>> in a before and after. This is intentional design.
>>> Why is that barrier needed? The two barriers I got and understood and
>>> I think we can deal with:
>>>
>>> 1) the VM unmap is a barrier between prior CS and later memory free.
>>> 2) The TLB flush need to happen between a VM unmap and later CS.
>>>
>>> But why do we need the VM unmap to be a strict barrier between prior
>>> CS and later CS?
>> Exactly because of the two reasons you mentioned.
> This is the part I'm not seeing. I get that removing #2 is a
> nightmare, which is why I did something that doesn't violate that
> constraint.
>
> Like if an explicit CS that was running before the VM operation  runs
> till after the VM operation (and hence possibly till after the TLB
> flush, or otherwise have the TLB flush not apply due to lack of async
> TLB flush support), that is not an issue. It might see the state from
> before the unmap, or after the unmap, or some intermediate state and
> all of those would be okay.
>
> We still get the constraint that the TLB flush happens between the VM
> unmap and later CS and hence the unmap is certainly visible to them.

So you basically just want to set the sync mode in 
amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

That should be doable, but then you don't need any of the other changes.

Regards,
Christian.

>
>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
>> page table dma_resv object over to the BOs dma_resv object in the gem
>> close handler. E.g. exactly what you suggested with the dma_resv_copy
>> function.
>>
>> #2 is a nightmare.
>>
>> We can't move the TLB flush at the end of the unmap operation because on
>> async TLB flushes are either a bit complicated (double flushes etc..) or
>> don't even work at all because of hw bugs. So to have a reliable TLB
>> flush we must make sure that nothing else is ongoing and that means
>> CS->VM->CS barrier.
>>
>> We try very hard to circumvent that already on maps by (for example)
>> using a completely new VMID for CS after the VM map operation.
>>
>> But for the unmap operation we would need some kind special dma_fence
>> implementation which would not only wait for all existing dma_fence but
>> also for the one added until the unmap operation is completed. Cause
>> otherwise our operation we do at #1 would simply not catch all
>> dma_fences which have access to the memory.
>>
>> That's certainly doable, but I think just using the drm_exec stuff I
>> already came up with is easier.
>>
>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
>> goes away and we can keep track of the unmap operations in the bo_va
>> structure.
>>
>> With that done you can make the explicit sync you noted in the bo_va
>> structure and implicit sync when the bo_va structure goes away.
>>
>> Then the only reason I can see why we would need a CS->VM dependency is
>> implicit synchronization, and that's what we are trying to avoid here in
>> the first place.
>>
>> Regards,
>> Christian.
>>
>>>> To get rid of this barrier you must first fix the part where CS
>>>> submissions wait for the VM operation to complete, e.g. the necessity of
>>>> the barrier.
>>>>
>>>> I'm working on this for a couple of years now and I'm really running out
>>>> of idea how to explain this restriction.
>>>>
>>>> Regards,
>>>> Christian.
>>>>