On Wed, Jun 1, 2022, 10:48 Bas Nieuwenhuizen wrote: > On Wed, Jun 1, 2022 at 10:40 AM Christian König > wrote: > > > > Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen: > > > On Wed, Jun 1, 2022 at 10:03 AM Christian König > > > wrote: > > >> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen: > > >>> This should be okay because moves themselves use KERNEL usage and > > >>> hence still sync with BOOKKEEP usage. Then any later submits still > > >>> wait on any pending VM operations. > > >>> > > >>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other > > >>> way around) > > >> Well NAK again. This allows access to freed up memory and is a > complete > > >> no-go. > > > How does this allow access to freed memory? Worst I can see is that > > > the unmap happens earlier if the app/drivers gets the waits wrong, > > > which wouldn't give access after the underlying BO is freed? > > > > To free up memory we need to update the PTEs and then flush those out by > > invalidating the TLB. > > > > On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can > > only be done while the VMID is idle. > > > > Only gfx9 can reliable invalidate the TLB while it is in use and even > > there it comes with quite some performance penalty (at TLB invalidation > > can take multiple seconds). > > > > Because of this what we do in the kernel driver is to sync to everything > > when we unmap entries: > > > > if (!(flags & AMDGPU_PTE_VALID)) > > sync_mode = AMDGPU_SYNC_EQ_OWNER; > > else > > sync_mode = AMDGPU_SYNC_EXPLICIT; > > > > This acts as a barrier for freeing the memory. In other words we > > intentionally add a bubble which syncs everything. > > > > I'm working for month on a concept how to do all this without causing > > the stalls you observer, but so far didn't came to much of a conclusion. > > That might cause an unmap operation too early, but for freeing up the > actual backing memory we still wait for all fences on the BO to finish > first, no? In that case, since BOOKKEEP fences are still added for > explicit sync, that should not be a problem, no? > > (If not, that sounds like the obvious fix for making this work?) > As an aside this is the same hole/issue as when an app forgets a bo in the bo list on submission. > > > > Regards, > > Christian. > > > > > > > >> Regards, > > >> Christian. > > >> > > >>> Signed-off-by: Bas Nieuwenhuizen > > >>> --- > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- > > >>> 2 files changed, 2 insertions(+), 2 deletions(-) > > >>> > > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > > >>> index f10332e1c6c0..31bc73fd1fae 100644 > > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > > >>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct > amdgpu_vm_update_params *p, > > >>> if (!resv) > > >>> return 0; > > >>> > > >>> - return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, > sync_mode, p->vm, true); > > >>> + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, > AMDGPU_SYNC_EXPLICIT, p->vm, true); > > >>> } > > >>> > > >>> /** > > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > >>> index 63b484dc76c5..c8d5898bea11 100644 > > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > >>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct > amdgpu_vm_update_params *p, > > >>> if (!resv) > > >>> return 0; > > >>> > > >>> - return amdgpu_sync_resv(p->adev, &p->job->sync, resv, > sync_mode, sync_mode, p->vm); > > >>> + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, > sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); > > >>> } > > >>> > > >>> /** > > >