On Wed, Jun 1, 2022, 10:48 Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
wrote:

> On Wed, Jun 1, 2022 at 10:40 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
> > > On Wed, Jun 1, 2022 at 10:03 AM Christian König
> > > <christian.koenig@amd.com> wrote:
> > >> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> > >>> This should be okay because moves themselves use KERNEL usage and
> > >>> hence still sync with BOOKKEEP usage. Then any later submits still
> > >>> wait on any pending VM operations.
> > >>>
> > >>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
> > >>>    way around)
> > >> Well NAK again. This allows access to freed up memory and is a
> complete
> > >> no-go.
> > > How does this allow access to freed memory? Worst I can see is that
> > > the unmap happens earlier if the app/drivers gets the waits wrong,
> > > which wouldn't give access after the underlying BO is freed?
> >
> > To free up memory we need to update the PTEs and then flush those out by
> > invalidating the TLB.
> >
> > On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can
> > only be done while the VMID is idle.
> >
> > Only gfx9 can reliable invalidate the TLB while it is in use and even
> > there it comes with quite some performance penalty (at TLB invalidation
> > can take multiple seconds).
> >
> > Because of this what we do in the kernel driver is to sync to everything
> > when we unmap entries:
> >
> >          if (!(flags & AMDGPU_PTE_VALID))
> >                  sync_mode = AMDGPU_SYNC_EQ_OWNER;
> >          else
> >                  sync_mode = AMDGPU_SYNC_EXPLICIT;
> >
> > This acts as a barrier for freeing the memory. In other words we
> > intentionally add a bubble which syncs everything.
> >
> > I'm working for month on a concept how to do all this without causing
> > the stalls you observer, but so far didn't came to much of a conclusion.
>
> That might cause an unmap operation too early, but for freeing up the
> actual backing memory we still wait for all fences on the BO to finish
> first, no? In that case, since BOOKKEEP fences are still added for
> explicit sync, that should not be a problem, no?
>
> (If not, that sounds like the obvious fix for making this work?)
>

As an aside this is the same hole/issue as when an app forgets a bo in the
bo list on submission.

> >
> > Regards,
> > Christian.
> >
> > >
> > >> Regards,
> > >> Christian.
> > >>
> > >>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> > >>> ---
> > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
> > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
> > >>>    2 files changed, 2 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > >>> index f10332e1c6c0..31bc73fd1fae 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > >>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct
> amdgpu_vm_update_params *p,
> > >>>        if (!resv)
> > >>>                return 0;
> > >>>
> > >>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode,
> sync_mode, p->vm, true);
> > >>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode,
> AMDGPU_SYNC_EXPLICIT, p->vm, true);
> > >>>    }
> > >>>
> > >>>    /**
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > >>> index 63b484dc76c5..c8d5898bea11 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > >>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct
> amdgpu_vm_update_params *p,
> > >>>        if (!resv)
> > >>>                return 0;
> > >>>
> > >>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv,
> sync_mode, sync_mode, p->vm);
> > >>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv,
> sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
> > >>>    }
> > >>>
> > >>>    /**
> >
>