[RFC PATCH 0/5] Add option to disable implicit sync for userspace submits.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] Add option to disable implicit sync for userspace submits.
@ 2022-06-01  0:40 Bas Nieuwenhuizen
  2022-06-01  0:40 ` [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage Bas Nieuwenhuizen
                   ` (4 more replies)
  0 siblings, 5 replies; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  0:40 UTC (permalink / raw)
  To: dri-devel; +Cc: christian.koenig

This adds a context option to use DMA_RESV_USAGE_BOOKKEEP for userspace submissions,
based on Christians TTM work.

Disabling implicit sync is something we've wanted in radv for a while for resolving
some corner cases. A more immediate thing that would be solved here is avoiding a
bunch of implicit sync on GPU map/unmap operations as well, which helps with stutter
around sparse maps/unmaps.

I have experimental userspace in radv, but it isn't 100% ready yet. There are still
issues with some games that I'm looking at, but in the meantime I'm looking for early
feedback on the idea.

Besides the debugging an open question is whether it is worth adding the option to
wait on additional explicit syncobj in the VM map/unmap operations. My current radv
code waits on the wait syncobj in userspace on a thread before doing the operation
which results in some corner cases because we can't provide binary syncobj at
submission time (impacting the usual sync file exports). However adding these fences
adds the risk of head of line blocking because all VM operations get executed on the
same ring, so all later operations get blocked by waiting on the fences as well, which
can cause head of line blocking.

I'm looking to get more implementation experience with different games to see if we
need this, but if we need it it would be a somewhat separate addition to the UAPI.

Bas Nieuwenhuizen (5):
  drm/ttm: Refactor num_shared into usage.
  drm/amdgpu: Add separate mode for syncing DMA_RESV_USAGE_BOOKKEEP.
  drm/amdgpu: Allow explicit sync for VM ops.
  drm/amdgpu: Refactor amdgpu_vm_get_pd_bo.
  drm/amdgpu: Add option to disable implicit sync for a context.

 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 21 ++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 19 ++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c       |  4 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 32 +++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h       |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c       | 10 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c    | 11 ++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c      | 11 +++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h      |  4 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c   |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c          |  2 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
 drivers/gpu/drm/qxl/qxl_release.c             |  2 +-
 drivers/gpu/drm/radeon/radeon_cs.c            |  5 +--
 drivers/gpu/drm/radeon/radeon_gem.c           |  2 +-
 drivers/gpu/drm/radeon/radeon_vm.c            |  4 +--
 drivers/gpu/drm/ttm/ttm_execbuf_util.c        |  5 ++-
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c      | 10 +++---
 drivers/gpu/drm/vmwgfx/vmwgfx_validation.c    |  2 +-
 include/drm/ttm/ttm_execbuf_util.h            |  3 +-
 include/uapi/drm/amdgpu_drm.h                 |  3 ++
 28 files changed, 112 insertions(+), 63 deletions(-)

-- 
2.36.1

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.
  2022-06-01  0:40 [RFC PATCH 0/5] Add option to disable implicit sync for userspace submits Bas Nieuwenhuizen
@ 2022-06-01  0:40 ` Bas Nieuwenhuizen
  2022-06-01  8:02   ` Christian König
  2022-06-01  0:40 ` [RFC PATCH 2/5] drm/amdgpu: Add separate mode for syncing DMA_RESV_USAGE_BOOKKEEP Bas Nieuwenhuizen
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  0:40 UTC (permalink / raw)
  To: dri-devel; +Cc: christian.koenig

So that the driver can set some BOOKKEEP for explicit sync. Maybe
some of the existing places would already make sense for that, but
I targeted this for no functional changes.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 10 +++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  8 +++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c           |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c           |  6 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c          |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  3 +--
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c              |  2 +-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
 drivers/gpu/drm/qxl/qxl_release.c                 |  2 +-
 drivers/gpu/drm/radeon/radeon_cs.c                |  5 +++--
 drivers/gpu/drm/radeon/radeon_gem.c               |  2 +-
 drivers/gpu/drm/radeon/radeon_vm.c                |  4 ++--
 drivers/gpu/drm/ttm/ttm_execbuf_util.c            |  5 ++---
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c          | 10 +++++-----
 drivers/gpu/drm/vmwgfx/vmwgfx_validation.c        |  2 +-
 include/drm/ttm/ttm_execbuf_util.h                |  3 ++-
 16 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index a4955ef76cfc..a790a089e829 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem,
 	struct amdgpu_bo *bo = mem->bo;
 
 	INIT_LIST_HEAD(&entry->head);
-	entry->num_shared = 1;
+	entry->usage = DMA_RESV_USAGE_READ;
 	entry->bo = &bo->tbo;
 	mutex_lock(&process_info->lock);
 	if (userptr)
@@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
 
 	ctx->kfd_bo.priority = 0;
 	ctx->kfd_bo.tv.bo = &bo->tbo;
-	ctx->kfd_bo.tv.num_shared = 1;
+	ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
 	list_add(&ctx->kfd_bo.tv.head, &ctx->list);
 
 	amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
@@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
 
 	ctx->kfd_bo.priority = 0;
 	ctx->kfd_bo.tv.bo = &bo->tbo;
-	ctx->kfd_bo.tv.num_shared = 1;
+	ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
 	list_add(&ctx->kfd_bo.tv.head, &ctx->list);
 
 	i = 0;
@@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
 			    validate_list.head) {
 		list_add_tail(&mem->resv_list.head, &resv_list);
 		mem->resv_list.bo = mem->validate_list.bo;
-		mem->resv_list.num_shared = mem->validate_list.num_shared;
+		mem->resv_list.usage = mem->validate_list.usage;
 	}
 
 	/* Reserve all BOs and page tables for validation */
@@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
 
 		list_add_tail(&mem->resv_list.head, &ctx.list);
 		mem->resv_list.bo = mem->validate_list.bo;
-		mem->resv_list.num_shared = mem->validate_list.num_shared;
+		mem->resv_list.usage = mem->validate_list.usage;
 	}
 
 	ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 60ca14afb879..2ae1c0d9d33a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
 	bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj));
 	p->uf_entry.priority = 0;
 	p->uf_entry.tv.bo = &bo->tbo;
-	/* One for TTM and two for the CS job */
-	p->uf_entry.tv.num_shared = 3;
+	p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;
 
 	drm_gem_object_put(gobj);
 
@@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 			return r;
 	}
 
-	/* One for TTM and one for the CS job */
 	amdgpu_bo_list_for_each_entry(e, p->bo_list)
-		e->tv.num_shared = 2;
+		e->tv.usage = DMA_RESV_USAGE_READ;
 
 	amdgpu_bo_list_get_list(p->bo_list, &p->validated);
 
@@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 
 	/* Make sure all BOs are remembered as writers */
 	amdgpu_bo_list_for_each_entry(e, p->bo_list)
-		e->tv.num_shared = 0;
+		e->tv.usage = DMA_RESV_USAGE_WRITE;
 
 	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
 	mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
index c6d4d41c4393..71277257d94d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	INIT_LIST_HEAD(&list);
 	INIT_LIST_HEAD(&csa_tv.head);
 	csa_tv.bo = &bo->tbo;
-	csa_tv.num_shared = 1;
+	csa_tv.usage = DMA_RESV_USAGE_READ;
 
 	list_add(&csa_tv.head, &list);
 	amdgpu_vm_get_pd_bo(vm, &list, &pd);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 84a53758e18e..7483411229f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj,
 	INIT_LIST_HEAD(&duplicates);
 
 	tv.bo = &bo->tbo;
-	tv.num_shared = 2;
+	tv.usage = DMA_RESV_USAGE_READ;
 	list_add(&tv.head, &list);
 
 	amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
@@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
 		abo = gem_to_amdgpu_bo(gobj);
 		tv.bo = &abo->tbo;
 		if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
-			tv.num_shared = 1;
+			tv.usage = DMA_RESV_USAGE_READ;
 		else
-			tv.num_shared = 0;
+			tv.usage = DMA_RESV_USAGE_WRITE;
 		list_add(&tv.head, &list);
 	} else {
 		gobj = NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
index 5224d9a39737..f670d8473993 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
@@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane,
 	INIT_LIST_HEAD(&list);
 
 	tv.bo = &rbo->tbo;
-	tv.num_shared = 1;
+	tv.usage = DMA_RESV_USAGE_READ;
 	list_add(&tv.head, &list);
 
 	r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 15184153e2b9..515be19ab279 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
 {
 	entry->priority = 0;
 	entry->tv.bo = &vm->root.bo->tbo;
-	/* Two for VM updates, one for TTM and one for the CS job */
-	entry->tv.num_shared = 4;
+	entry->tv.usage = DMA_RESV_USAGE_READ;
 	entry->user_pages = NULL;
 	list_add(&entry->tv.head, validated);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index b3fc3e958227..af844b636778 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx)
 		vm = drm_priv_to_vm(pdd->drm_priv);
 
 		ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
-		ctx->tv[gpuidx].num_shared = 4;
+		ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
 		list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
 	}
 
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 73423b805b54..851b7844b084 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
 	INIT_LIST_HEAD(&list);
 
 	tv.bo = &rbo->tbo;
-	tv.num_shared = 1;
+	tv.usage = DMA_RESV_USAGE_READ;
 	list_add(&tv.head, &list);
 
 	r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
index 368d26da0d6a..689e35192070 100644
--- a/drivers/gpu/drm/qxl/qxl_release.c
+++ b/drivers/gpu/drm/qxl/qxl_release.c
@@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
 
 	qxl_bo_ref(bo);
 	entry->tv.bo = &bo->tbo;
-	entry->tv.num_shared = 0;
+	entry->tv.usage = DMA_RESV_USAGE_WRITE;
 	list_add_tail(&entry->tv.head, &release->bos);
 	return 0;
 }
diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
index 446f7bae54c4..30afe0c62dd9 100644
--- a/drivers/gpu/drm/radeon/radeon_cs.c
+++ b/drivers/gpu/drm/radeon/radeon_cs.c
@@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
 		}
 
 		p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
-		p->relocs[i].tv.num_shared = !r->write_domain;
+		p->relocs[i].tv.usage =
+			r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;
 
 		radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
 				      priority);
@@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
 
 		resv = reloc->robj->tbo.base.resv;
 		r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
-				     reloc->tv.num_shared);
+				     reloc->tv.usage != DMA_RESV_USAGE_WRITE);
 		if (r)
 			return r;
 	}
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
index 8c01a7f0e027..eae47c709f5d 100644
--- a/drivers/gpu/drm/radeon/radeon_gem.c
+++ b/drivers/gpu/drm/radeon/radeon_gem.c
@@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev,
 	INIT_LIST_HEAD(&list);
 
 	tv.bo = &bo_va->bo->tbo;
-	tv.num_shared = 1;
+	tv.usage = DMA_RESV_USAGE_READ;
 	list_add(&tv.head, &list);
 
 	vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c
index 987cabbf1318..702627b48dae 100644
--- a/drivers/gpu/drm/radeon/radeon_vm.c
+++ b/drivers/gpu/drm/radeon/radeon_vm.c
@@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
 	list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
 	list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
 	list[0].tv.bo = &vm->page_directory->tbo;
-	list[0].tv.num_shared = 1;
+	list[0].tv.usage = DMA_RESV_USAGE_READ;
 	list[0].tiling_flags = 0;
 	list_add(&list[0].tv.head, head);
 
@@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
 		list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
 		list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
 		list[idx].tv.bo = &list[idx].robj->tbo;
-		list[idx].tv.num_shared = 1;
+		list[idx].tv.usage = DMA_RESV_USAGE_READ;
 		list[idx].tiling_flags = 0;
 		list_add(&list[idx++].tv.head, head);
 	}
diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
index 0eb995d25df1..c39d8e5ac271 100644
--- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c
+++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
@@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket,
 			continue;
 		}
 
-		num_fences = min(entry->num_shared, 1u);
+		num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
 		if (!ret) {
 			ret = dma_resv_reserve_fences(bo->base.resv,
 						      num_fences);
@@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket,
 	list_for_each_entry(entry, list, head) {
 		struct ttm_buffer_object *bo = entry->bo;
 
-		dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
-				   DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
+		dma_resv_add_fence(bo->base.resv, fence, entry->usage);
 		ttm_bo_move_to_lru_tail_unlocked(bo);
 		dma_resv_unlock(bo->base.resv);
 	}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
index c6d02c98a19a..58dfff7d6c76 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
@@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref)
 			struct ttm_validate_buffer val_buf;
 
 			val_buf.bo = bo;
-			val_buf.num_shared = 0;
+			val_buf.usage = DMA_RESV_USAGE_WRITE;
 			res->func->unbind(res, false, &val_buf);
 		}
 		res->backup_dirty = false;
@@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket,
 	INIT_LIST_HEAD(&val_list);
 	ttm_bo_get(&res->backup->base);
 	val_buf->bo = &res->backup->base;
-	val_buf->num_shared = 0;
+	val_buf->usage = DMA_RESV_USAGE_WRITE;
 	list_add_tail(&val_buf->head, &val_list);
 	ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
 	if (unlikely(ret != 0))
@@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
 	BUG_ON(!func->may_evict);
 
 	val_buf.bo = NULL;
-	val_buf.num_shared = 0;
+	val_buf.usage = DMA_RESV_USAGE_WRITE;
 	ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
 	if (unlikely(ret != 0))
 		return ret;
@@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr,
 		return 0;
 
 	val_buf.bo = NULL;
-	val_buf.num_shared = 0;
+	val_buf.usage = DMA_RESV_USAGE_WRITE;
 	if (res->backup)
 		val_buf.bo = &res->backup->base;
 	do {
@@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo)
 {
 	struct ttm_validate_buffer val_buf = {
 		.bo = &vbo->base,
-		.num_shared = 0
+		.usage = DMA_RESV_USAGE_WRITE
 	};
 
 	dma_resv_assert_held(vbo->base.base.resv);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
index f46891012be3..0476ba498321 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
@@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx,
 		val_buf->bo = ttm_bo_get_unless_zero(&vbo->base);
 		if (!val_buf->bo)
 			return -ESRCH;
-		val_buf->num_shared = 0;
+		val_buf->usage = DMA_RESV_USAGE_WRITE;
 		list_add_tail(&val_buf->head, &ctx->bo_list);
 		bo_node->as_mob = as_mob;
 		bo_node->cpu_blit = cpu_blit;
diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h
index a99d7fdf2964..851961a06c27 100644
--- a/include/drm/ttm/ttm_execbuf_util.h
+++ b/include/drm/ttm/ttm_execbuf_util.h
@@ -31,6 +31,7 @@
 #ifndef _TTM_EXECBUF_UTIL_H_
 #define _TTM_EXECBUF_UTIL_H_
 
+#include <linux/dma-resv.h>
 #include <linux/list.h>
 
 #include "ttm_bo_api.h"
@@ -46,7 +47,7 @@
 struct ttm_validate_buffer {
 	struct list_head head;
 	struct ttm_buffer_object *bo;
-	unsigned int num_shared;
+	enum dma_resv_usage usage;
 };
 
 /**
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 2/5] drm/amdgpu: Add separate mode for syncing DMA_RESV_USAGE_BOOKKEEP.
  2022-06-01  0:40 [RFC PATCH 0/5] Add option to disable implicit sync for userspace submits Bas Nieuwenhuizen
  2022-06-01  0:40 ` [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage Bas Nieuwenhuizen
@ 2022-06-01  0:40 ` Bas Nieuwenhuizen
  2022-06-01  0:40 ` [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops Bas Nieuwenhuizen
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  0:40 UTC (permalink / raw)
  To: dri-devel; +Cc: christian.koenig

To prep for allowing different sync modes in a follow-up patch.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c           |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c       | 11 +++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h       |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c         | 11 ++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h         |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c          |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c      |  2 +-
 10 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index a790a089e829..92a1b08b3bbc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1157,7 +1157,7 @@ static int process_sync_pds_resv(struct amdkfd_process_info *process_info,
 		struct amdgpu_bo *pd = peer_vm->root.bo;
 
 		ret = amdgpu_sync_resv(NULL, sync, pd->tbo.base.resv,
-				       AMDGPU_SYNC_NE_OWNER,
+				       AMDGPU_SYNC_NE_OWNER, AMDGPU_SYNC_NE_OWNER,
 				       AMDGPU_FENCE_OWNER_KFD);
 		if (ret)
 			return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 2ae1c0d9d33a..0318a6d46a41 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -654,7 +654,7 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
 		sync_mode = amdgpu_bo_explicit_sync(bo) ?
 			AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
 		r = amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode,
-				     &fpriv->vm);
+				     AMDGPU_SYNC_EXPLICIT, &fpriv->vm);
 		if (r)
 			return r;
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 91b99eb7dc35..63e6f7b8b522 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1407,7 +1407,8 @@ void amdgpu_bo_fence(struct amdgpu_bo *bo, struct dma_fence *fence,
  *
  * @adev: amdgpu device pointer
  * @resv: reservation object to sync to
- * @sync_mode: synchronization mode
+ * @implicit_sync_mode: synchronization mode for usage <= DMA_RESV_USAGE_READ
+ * @explicit_sync_mode: synchronization mode for usage DMA_RESV_USAGE_BOOKKEEP
  * @owner: fence owner
  * @intr: Whether the wait is interruptible
  *
@@ -1417,14 +1418,15 @@ void amdgpu_bo_fence(struct amdgpu_bo *bo, struct dma_fence *fence,
  * 0 on success, errno otherwise.
  */
 int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
-			     enum amdgpu_sync_mode sync_mode, void *owner,
+			     enum amdgpu_sync_mode implicit_sync_mode,
+			     enum amdgpu_sync_mode explicit_sync_mode, void *owner,
 			     bool intr)
 {
 	struct amdgpu_sync sync;
 	int r;
 
 	amdgpu_sync_create(&sync);
-	amdgpu_sync_resv(adev, &sync, resv, sync_mode, owner);
+	amdgpu_sync_resv(adev, &sync, resv, implicit_sync_mode, explicit_sync_mode, owner);
 	r = amdgpu_sync_wait(&sync, intr);
 	amdgpu_sync_free(&sync);
 	return r;
@@ -1445,7 +1447,8 @@ int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr)
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 
 	return amdgpu_bo_sync_wait_resv(adev, bo->tbo.base.resv,
-					AMDGPU_SYNC_NE_OWNER, owner, intr);
+					AMDGPU_SYNC_NE_OWNER, AMDGPU_SYNC_EXPLICIT,
+					owner, intr);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 4c9cbdc66995..9540ee1102ad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -321,7 +321,8 @@ vm_fault_t amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo);
 void amdgpu_bo_fence(struct amdgpu_bo *bo, struct dma_fence *fence,
 		     bool shared);
 int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
-			     enum amdgpu_sync_mode sync_mode, void *owner,
+			     enum amdgpu_sync_mode implicit_sync_mode,
+			     enum amdgpu_sync_mode explicit_sync_mode, void *owner,
 			     bool intr);
 int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr);
 u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index 11c46b3e4c60..b40cd4eff6a3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -243,14 +243,15 @@ static bool amdgpu_sync_test_fence(struct amdgpu_device *adev,
  * @adev: amdgpu device
  * @sync: sync object to add fences from reservation object to
  * @resv: reservation object with embedded fence
- * @mode: how owner affects which fences we sync to
+ * @implicit_mode: how owner affects which fences with usage <= DMA_RESV_USAGE_READ we sync to
+ * @explicit_mode: how owner affects which fences with usage DMA_RESV_USAGE_BOOKKEEP we sync to
  * @owner: owner of the planned job submission
  *
  * Sync to the fence
  */
 int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync,
-		     struct dma_resv *resv, enum amdgpu_sync_mode mode,
-		     void *owner)
+		     struct dma_resv *resv, enum amdgpu_sync_mode implicit_mode,
+		     enum amdgpu_sync_mode explicit_mode, void *owner)
 {
 	struct dma_resv_iter cursor;
 	struct dma_fence *f;
@@ -263,6 +264,10 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync,
 	dma_resv_for_each_fence(&cursor, resv, DMA_RESV_USAGE_BOOKKEEP, f) {
 		dma_fence_chain_for_each(f, f) {
 			struct dma_fence *tmp = dma_fence_chain_contained(f);
+			enum amdgpu_sync_mode mode = implicit_mode;
+
+			if (dma_resv_iter_usage(&cursor) >= DMA_RESV_USAGE_BOOKKEEP)
+				mode = explicit_mode;
 
 			if (amdgpu_sync_test_fence(adev, mode, owner, tmp)) {
 				r = amdgpu_sync_fence(sync, f);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h
index 7c0fe20c470d..f786e30eb0a3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h
@@ -50,8 +50,8 @@ void amdgpu_sync_create(struct amdgpu_sync *sync);
 int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f);
 int amdgpu_sync_vm_fence(struct amdgpu_sync *sync, struct dma_fence *fence);
 int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync,
-		     struct dma_resv *resv, enum amdgpu_sync_mode mode,
-		     void *owner);
+		     struct dma_resv *resv, enum amdgpu_sync_mode implicit_mode,
+		     enum amdgpu_sync_mode explicit_mode, void *owner);
 struct dma_fence *amdgpu_sync_peek_fence(struct amdgpu_sync *sync,
 				     struct amdgpu_ring *ring);
 struct dma_fence *amdgpu_sync_get_fence(struct amdgpu_sync *sync);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 48a635864a92..00a749016b6d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1971,6 +1971,7 @@ static int amdgpu_ttm_prepare_job(struct amdgpu_device *adev,
 	if (resv) {
 		r = amdgpu_sync_resv(adev, &(*job)->sync, resv,
 				     AMDGPU_SYNC_ALWAYS,
+				     AMDGPU_SYNC_EXPLICIT,
 				     AMDGPU_FENCE_OWNER_UNDEFINED);
 		if (r) {
 			DRM_ERROR("sync failed (%d).\n", r);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 6eac649499d3..de08bab400d5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1176,7 +1176,7 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo,
 			goto err_free;
 	} else {
 		r = amdgpu_sync_resv(adev, &job->sync, bo->tbo.base.resv,
-				     AMDGPU_SYNC_ALWAYS,
+				     AMDGPU_SYNC_ALWAYS, AMDGPU_SYNC_ALWAYS,
 				     AMDGPU_FENCE_OWNER_UNDEFINED);
 		if (r)
 			goto err_free;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
index 31913ae86de6..f10332e1c6c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
@@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
 	if (!resv)
 		return 0;
 
-	return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, p->vm, true);
+	return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index bdb44cee19d3..63b484dc76c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
 	if (!resv)
 		return 0;
 
-	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, p->vm);
+	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
 }
 
 /**
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-01  0:40 [RFC PATCH 0/5] Add option to disable implicit sync for userspace submits Bas Nieuwenhuizen
  2022-06-01  0:40 ` [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage Bas Nieuwenhuizen
  2022-06-01  0:40 ` [RFC PATCH 2/5] drm/amdgpu: Add separate mode for syncing DMA_RESV_USAGE_BOOKKEEP Bas Nieuwenhuizen
@ 2022-06-01  0:40 ` Bas Nieuwenhuizen
  2022-06-01  8:03   ` Christian König
  2022-06-01  0:40 ` [RFC PATCH 4/5] drm/amdgpu: Refactor amdgpu_vm_get_pd_bo Bas Nieuwenhuizen
  2022-06-01  0:40 ` [RFC PATCH 5/5] drm/amdgpu: Add option to disable implicit sync for a context Bas Nieuwenhuizen
  4 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  0:40 UTC (permalink / raw)
  To: dri-devel; +Cc: christian.koenig

This should be okay because moves themselves use KERNEL usage and
hence still sync with BOOKKEEP usage. Then any later submits still
wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
 way around)

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
index f10332e1c6c0..31bc73fd1fae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
@@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
 	if (!resv)
 		return 0;
 
-	return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
+	return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 63b484dc76c5..c8d5898bea11 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
 	if (!resv)
 		return 0;
 
-	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
+	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
 }
 
 /**
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 4/5] drm/amdgpu: Refactor amdgpu_vm_get_pd_bo.
  2022-06-01  0:40 [RFC PATCH 0/5] Add option to disable implicit sync for userspace submits Bas Nieuwenhuizen
                   ` (2 preceding siblings ...)
  2022-06-01  0:40 ` [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops Bas Nieuwenhuizen
@ 2022-06-01  0:40 ` Bas Nieuwenhuizen
  2022-06-01  0:40 ` [RFC PATCH 5/5] drm/amdgpu: Add option to disable implicit sync for a context Bas Nieuwenhuizen
  4 siblings, 0 replies; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  0:40 UTC (permalink / raw)
  To: dri-devel; +Cc: christian.koenig

We want to take only a BOOKKEEP usage for contexts that are not
implicitly synced.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c           | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c          | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c          | 4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c           | 6 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h           | 3 ++-
 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 92a1b08b3bbc..c47695b37a1c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -921,7 +921,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
 	ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
 	list_add(&ctx->kfd_bo.tv.head, &ctx->list);
 
-	amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
+	amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0], DMA_RESV_USAGE_READ);
 
 	ret = ttm_eu_reserve_buffers(&ctx->ticket, &ctx->list,
 				     false, &ctx->duplicates);
@@ -992,7 +992,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
 			continue;
 
 		amdgpu_vm_get_pd_bo(entry->bo_va->base.vm, &ctx->list,
-				&ctx->vm_pd[i]);
+				&ctx->vm_pd[i], DMA_RESV_USAGE_READ);
 		i++;
 	}
 
@@ -2212,7 +2212,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
 	list_for_each_entry(peer_vm, &process_info->vm_list_head,
 			    vm_list_node)
 		amdgpu_vm_get_pd_bo(peer_vm, &resv_list,
-				    &pd_bo_list_entries[i++]);
+				    &pd_bo_list_entries[i++], DMA_RESV_USAGE_READ);
 	/* Add the userptr_inval_list entries to resv_list */
 	list_for_each_entry(mem, &process_info->userptr_inval_list,
 			    validate_list.head) {
@@ -2407,7 +2407,8 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
 	mutex_lock(&process_info->lock);
 	list_for_each_entry(peer_vm, &process_info->vm_list_head,
 			vm_list_node)
-		amdgpu_vm_get_pd_bo(peer_vm, &ctx.list, &pd_bo_list[i++]);
+		amdgpu_vm_get_pd_bo(peer_vm, &ctx.list, &pd_bo_list[i++],
+				    DMA_RESV_USAGE_READ);
 
 	/* Reserve all BOs and page tables/directory. Add all BOs from
 	 * kfd_bo_list to ctx.list
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 0318a6d46a41..64419f55606f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -524,7 +524,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 	amdgpu_bo_list_get_list(p->bo_list, &p->validated);
 
 	INIT_LIST_HEAD(&duplicates);
-	amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd);
+	amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd, DMA_RESV_USAGE_READ);
 
 	if (p->uf_entry.tv.bo && !ttm_to_amdgpu_bo(p->uf_entry.tv.bo)->parent)
 		list_add(&p->uf_entry.tv.head, &p->validated);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
index 71277257d94d..f091fe6bb985 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -77,7 +77,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	csa_tv.usage = DMA_RESV_USAGE_READ;
 
 	list_add(&csa_tv.head, &list);
-	amdgpu_vm_get_pd_bo(vm, &list, &pd);
+	amdgpu_vm_get_pd_bo(vm, &list, &pd, DMA_RESV_USAGE_READ);
 
 	r = ttm_eu_reserve_buffers(&ticket, &list, true, NULL);
 	if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 7483411229f4..a1194a0986bf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -210,7 +210,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj,
 	tv.usage = DMA_RESV_USAGE_READ;
 	list_add(&tv.head, &list);
 
-	amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
+	amdgpu_vm_get_pd_bo(vm, &list, &vm_pd, DMA_RESV_USAGE_READ);
 
 	r = ttm_eu_reserve_buffers(&ticket, &list, false, &duplicates);
 	if (r) {
@@ -740,7 +740,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
 		abo = NULL;
 	}
 
-	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
+	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd, DMA_RESV_USAGE_READ);
 
 	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates);
 	if (r)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 515be19ab279..da04072a3ea6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -623,17 +623,19 @@ static void amdgpu_vm_pt_next_dfs(struct amdgpu_device *adev,
  * @vm: vm providing the BOs
  * @validated: head of validation list
  * @entry: entry to add
+ * @resv_usage: resv usage for the synchronization
  *
  * Add the page directory to the list of BOs to
  * validate for command submission.
  */
 void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
 			 struct list_head *validated,
-			 struct amdgpu_bo_list_entry *entry)
+			 struct amdgpu_bo_list_entry *entry,
+			 enum dma_resv_usage resv_usage)
 {
 	entry->priority = 0;
 	entry->tv.bo = &vm->root.bo->tbo;
-	entry->tv.usage = DMA_RESV_USAGE_READ;
+	entry->tv.usage = resv_usage;
 	entry->user_pages = NULL;
 	list_add(&entry->tv.head, validated);
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index a40a6a993bb0..a14cd9716f44 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -384,7 +384,8 @@ void amdgpu_vm_release_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm)
 void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm);
 void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
 			 struct list_head *validated,
-			 struct amdgpu_bo_list_entry *entry);
+			 struct amdgpu_bo_list_entry *entry,
+			 enum dma_resv_usage resv_usage);
 bool amdgpu_vm_ready(struct amdgpu_vm *vm);
 int amdgpu_vm_validate_pt_bos(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 			      int (*callback)(void *p, struct amdgpu_bo *bo),
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 5/5] drm/amdgpu: Add option to disable implicit sync for a context.
  2022-06-01  0:40 [RFC PATCH 0/5] Add option to disable implicit sync for userspace submits Bas Nieuwenhuizen
                   ` (3 preceding siblings ...)
  2022-06-01  0:40 ` [RFC PATCH 4/5] drm/amdgpu: Refactor amdgpu_vm_get_pd_bo Bas Nieuwenhuizen
@ 2022-06-01  0:40 ` Bas Nieuwenhuizen
  4 siblings, 0 replies; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  0:40 UTC (permalink / raw)
  To: dri-devel; +Cc: christian.koenig

This changes all BO usages in a submit to BOOKKEEP instead of READ,
which effectively disables implicit sync for these submits.

This is configured at a context level using the existing IOCTL.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 13 ++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 32 +++++++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  1 +
 include/uapi/drm/amdgpu_drm.h           |  3 +++
 4 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 64419f55606f..944028d0ed6d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -498,6 +498,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 	struct amdgpu_bo *gws;
 	struct amdgpu_bo *oa;
 	int r;
+	enum dma_resv_usage resv_usage;
 
 	INIT_LIST_HEAD(&p->validated);
 
@@ -518,13 +519,16 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 			return r;
 	}
 
+	resv_usage = p->ctx->disable_implicit_sync ? DMA_RESV_USAGE_BOOKKEEP :
+						     DMA_RESV_USAGE_READ;
+
 	amdgpu_bo_list_for_each_entry(e, p->bo_list)
-		e->tv.usage = DMA_RESV_USAGE_READ;
+		e->tv.usage = resv_usage;
 
 	amdgpu_bo_list_get_list(p->bo_list, &p->validated);
 
 	INIT_LIST_HEAD(&duplicates);
-	amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd, DMA_RESV_USAGE_READ);
+	amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd, resv_usage);
 
 	if (p->uf_entry.tv.bo && !ttm_to_amdgpu_bo(p->uf_entry.tv.bo)->parent)
 		list_add(&p->uf_entry.tv.head, &p->validated);
@@ -651,7 +655,7 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
 		struct dma_resv *resv = bo->tbo.base.resv;
 		enum amdgpu_sync_mode sync_mode;
 
-		sync_mode = amdgpu_bo_explicit_sync(bo) ?
+		sync_mode = (amdgpu_bo_explicit_sync(bo) || p->ctx->disable_implicit_sync) ?
 			AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
 		r = amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode,
 				     AMDGPU_SYNC_EXPLICIT, &fpriv->vm);
@@ -1259,7 +1263,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 
 	/* Make sure all BOs are remembered as writers */
 	amdgpu_bo_list_for_each_entry(e, p->bo_list)
-		e->tv.usage = DMA_RESV_USAGE_WRITE;
+		e->tv.usage = p->ctx->disable_implicit_sync ? DMA_RESV_USAGE_BOOKKEEP
+							    : DMA_RESV_USAGE_WRITE;
 
 	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
 	mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index c317078d1afd..5fd3ad630194 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -559,8 +559,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 	return 0;
 }
 
-
-
 static int amdgpu_ctx_stable_pstate(struct amdgpu_device *adev,
 				    struct amdgpu_fpriv *fpriv, uint32_t id,
 				    bool set, u32 *stable_pstate)
@@ -589,6 +587,30 @@ static int amdgpu_ctx_stable_pstate(struct amdgpu_device *adev,
 	return r;
 }
 
+static int amdgpu_ctx_set_implicit_sync(struct amdgpu_device *adev,
+					struct amdgpu_fpriv *fpriv, uint32_t id,
+					bool enable)
+{
+	struct amdgpu_ctx *ctx;
+	struct amdgpu_ctx_mgr *mgr;
+
+	if (!fpriv)
+		return -EINVAL;
+
+	mgr = &fpriv->ctx_mgr;
+	mutex_lock(&mgr->lock);
+	ctx = idr_find(&mgr->ctx_handles, id);
+	if (!ctx) {
+		mutex_unlock(&mgr->lock);
+		return -EINVAL;
+	}
+
+	ctx->disable_implicit_sync = !enable;
+
+	mutex_unlock(&mgr->lock);
+	return 0;
+}
+
 int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
 		     struct drm_file *filp)
 {
@@ -637,6 +659,12 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
 			return -EINVAL;
 		r = amdgpu_ctx_stable_pstate(adev, fpriv, id, true, &stable_pstate);
 		break;
+	case AMDGPU_CTX_OP_SET_IMPLICIT_SYNC:
+		if ((args->in.flags & ~AMDGPU_CTX_IMPICIT_SYNC_ENABLED) || args->in.priority)
+			return -EINVAL;
+		r = amdgpu_ctx_set_implicit_sync(adev, fpriv, id,
+						 args->in.flags & ~AMDGPU_CTX_IMPICIT_SYNC_ENABLED);
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
index 142f2f87d44c..7675838d1640 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
@@ -54,6 +54,7 @@ struct amdgpu_ctx {
 	unsigned long			ras_counter_ce;
 	unsigned long			ras_counter_ue;
 	uint32_t			stable_pstate;
+	bool				disable_implicit_sync;
 };
 
 struct amdgpu_ctx_mgr {
diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h
index 1d65c1fbc4ec..09d9388e35a7 100644
--- a/include/uapi/drm/amdgpu_drm.h
+++ b/include/uapi/drm/amdgpu_drm.h
@@ -208,6 +208,7 @@ union drm_amdgpu_bo_list {
 #define AMDGPU_CTX_OP_QUERY_STATE2	4
 #define AMDGPU_CTX_OP_GET_STABLE_PSTATE	5
 #define AMDGPU_CTX_OP_SET_STABLE_PSTATE	6
+#define AMDGPU_CTX_OP_SET_IMPLICIT_SYNC	7
 
 /* GPU reset status */
 #define AMDGPU_CTX_NO_RESET		0
@@ -248,6 +249,8 @@ union drm_amdgpu_bo_list {
 #define AMDGPU_CTX_STABLE_PSTATE_MIN_MCLK  3
 #define AMDGPU_CTX_STABLE_PSTATE_PEAK  4
 
+#define AMDGPU_CTX_IMPICIT_SYNC_ENABLED 1
+
 struct drm_amdgpu_ctx_in {
 	/** AMDGPU_CTX_OP_* */
 	__u32	op;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.
  2022-06-01  0:40 ` [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage Bas Nieuwenhuizen
@ 2022-06-01  8:02   ` Christian König
  2022-06-01  8:11     ` Bas Nieuwenhuizen
  2022-06-01  8:41     ` Daniel Vetter
  0 siblings, 2 replies; 46+ messages in thread
From: Christian König @ 2022-06-01  8:02 UTC (permalink / raw)
  To: Bas Nieuwenhuizen, dri-devel

Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> So that the driver can set some BOOKKEEP for explicit sync. Maybe
> some of the existing places would already make sense for that, but
> I targeted this for no functional changes.

Well first of all NAK to that one since it will totally break cases 
which need to reserve more than one fence slot.

Also as discussed with Daniel we don't want to use BOOKKEEP for implicit 
sync. We should instead use READ for that.

BOOKKEEP is for stuff userspace should never be aware of, e.g. like page 
table updates and KFD eviction fences.

Regards,
Christian.

>
> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 10 +++++-----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  8 +++-----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c           |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c           |  6 +++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c          |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  3 +--
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c              |  2 +-
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
>   drivers/gpu/drm/qxl/qxl_release.c                 |  2 +-
>   drivers/gpu/drm/radeon/radeon_cs.c                |  5 +++--
>   drivers/gpu/drm/radeon/radeon_gem.c               |  2 +-
>   drivers/gpu/drm/radeon/radeon_vm.c                |  4 ++--
>   drivers/gpu/drm/ttm/ttm_execbuf_util.c            |  5 ++---
>   drivers/gpu/drm/vmwgfx/vmwgfx_resource.c          | 10 +++++-----
>   drivers/gpu/drm/vmwgfx/vmwgfx_validation.c        |  2 +-
>   include/drm/ttm/ttm_execbuf_util.h                |  3 ++-
>   16 files changed, 33 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index a4955ef76cfc..a790a089e829 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem,
>   	struct amdgpu_bo *bo = mem->bo;
>   
>   	INIT_LIST_HEAD(&entry->head);
> -	entry->num_shared = 1;
> +	entry->usage = DMA_RESV_USAGE_READ;
>   	entry->bo = &bo->tbo;
>   	mutex_lock(&process_info->lock);
>   	if (userptr)
> @@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
>   
>   	ctx->kfd_bo.priority = 0;
>   	ctx->kfd_bo.tv.bo = &bo->tbo;
> -	ctx->kfd_bo.tv.num_shared = 1;
> +	ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
>   	list_add(&ctx->kfd_bo.tv.head, &ctx->list);
>   
>   	amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
> @@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
>   
>   	ctx->kfd_bo.priority = 0;
>   	ctx->kfd_bo.tv.bo = &bo->tbo;
> -	ctx->kfd_bo.tv.num_shared = 1;
> +	ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
>   	list_add(&ctx->kfd_bo.tv.head, &ctx->list);
>   
>   	i = 0;
> @@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
>   			    validate_list.head) {
>   		list_add_tail(&mem->resv_list.head, &resv_list);
>   		mem->resv_list.bo = mem->validate_list.bo;
> -		mem->resv_list.num_shared = mem->validate_list.num_shared;
> +		mem->resv_list.usage = mem->validate_list.usage;
>   	}
>   
>   	/* Reserve all BOs and page tables for validation */
> @@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
>   
>   		list_add_tail(&mem->resv_list.head, &ctx.list);
>   		mem->resv_list.bo = mem->validate_list.bo;
> -		mem->resv_list.num_shared = mem->validate_list.num_shared;
> +		mem->resv_list.usage = mem->validate_list.usage;
>   	}
>   
>   	ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 60ca14afb879..2ae1c0d9d33a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
>   	bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj));
>   	p->uf_entry.priority = 0;
>   	p->uf_entry.tv.bo = &bo->tbo;
> -	/* One for TTM and two for the CS job */
> -	p->uf_entry.tv.num_shared = 3;
> +	p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;
>   
>   	drm_gem_object_put(gobj);
>   
> @@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
>   			return r;
>   	}
>   
> -	/* One for TTM and one for the CS job */
>   	amdgpu_bo_list_for_each_entry(e, p->bo_list)
> -		e->tv.num_shared = 2;
> +		e->tv.usage = DMA_RESV_USAGE_READ;
>   
>   	amdgpu_bo_list_get_list(p->bo_list, &p->validated);
>   
> @@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>   
>   	/* Make sure all BOs are remembered as writers */
>   	amdgpu_bo_list_for_each_entry(e, p->bo_list)
> -		e->tv.num_shared = 0;
> +		e->tv.usage = DMA_RESV_USAGE_WRITE;
>   
>   	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
>   	mutex_unlock(&p->adev->notifier_lock);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> index c6d4d41c4393..71277257d94d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   	INIT_LIST_HEAD(&list);
>   	INIT_LIST_HEAD(&csa_tv.head);
>   	csa_tv.bo = &bo->tbo;
> -	csa_tv.num_shared = 1;
> +	csa_tv.usage = DMA_RESV_USAGE_READ;
>   
>   	list_add(&csa_tv.head, &list);
>   	amdgpu_vm_get_pd_bo(vm, &list, &pd);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> index 84a53758e18e..7483411229f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj,
>   	INIT_LIST_HEAD(&duplicates);
>   
>   	tv.bo = &bo->tbo;
> -	tv.num_shared = 2;
> +	tv.usage = DMA_RESV_USAGE_READ;
>   	list_add(&tv.head, &list);
>   
>   	amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
> @@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>   		abo = gem_to_amdgpu_bo(gobj);
>   		tv.bo = &abo->tbo;
>   		if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
> -			tv.num_shared = 1;
> +			tv.usage = DMA_RESV_USAGE_READ;
>   		else
> -			tv.num_shared = 0;
> +			tv.usage = DMA_RESV_USAGE_WRITE;
>   		list_add(&tv.head, &list);
>   	} else {
>   		gobj = NULL;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> index 5224d9a39737..f670d8473993 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane,
>   	INIT_LIST_HEAD(&list);
>   
>   	tv.bo = &rbo->tbo;
> -	tv.num_shared = 1;
> +	tv.usage = DMA_RESV_USAGE_READ;
>   	list_add(&tv.head, &list);
>   
>   	r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 15184153e2b9..515be19ab279 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
>   {
>   	entry->priority = 0;
>   	entry->tv.bo = &vm->root.bo->tbo;
> -	/* Two for VM updates, one for TTM and one for the CS job */
> -	entry->tv.num_shared = 4;
> +	entry->tv.usage = DMA_RESV_USAGE_READ;
>   	entry->user_pages = NULL;
>   	list_add(&entry->tv.head, validated);
>   }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index b3fc3e958227..af844b636778 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx)
>   		vm = drm_priv_to_vm(pdd->drm_priv);
>   
>   		ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
> -		ctx->tv[gpuidx].num_shared = 4;
> +		ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
>   		list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
>   	}
>   
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 73423b805b54..851b7844b084 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
>   	INIT_LIST_HEAD(&list);
>   
>   	tv.bo = &rbo->tbo;
> -	tv.num_shared = 1;
> +	tv.usage = DMA_RESV_USAGE_READ;
>   	list_add(&tv.head, &list);
>   
>   	r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
> diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
> index 368d26da0d6a..689e35192070 100644
> --- a/drivers/gpu/drm/qxl/qxl_release.c
> +++ b/drivers/gpu/drm/qxl/qxl_release.c
> @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
>   
>   	qxl_bo_ref(bo);
>   	entry->tv.bo = &bo->tbo;
> -	entry->tv.num_shared = 0;
> +	entry->tv.usage = DMA_RESV_USAGE_WRITE;
>   	list_add_tail(&entry->tv.head, &release->bos);
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
> index 446f7bae54c4..30afe0c62dd9 100644
> --- a/drivers/gpu/drm/radeon/radeon_cs.c
> +++ b/drivers/gpu/drm/radeon/radeon_cs.c
> @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
>   		}
>   
>   		p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
> -		p->relocs[i].tv.num_shared = !r->write_domain;
> +		p->relocs[i].tv.usage =
> +			r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;
>   
>   		radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
>   				      priority);
> @@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
>   
>   		resv = reloc->robj->tbo.base.resv;
>   		r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
> -				     reloc->tv.num_shared);
> +				     reloc->tv.usage != DMA_RESV_USAGE_WRITE);
>   		if (r)
>   			return r;
>   	}
> diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
> index 8c01a7f0e027..eae47c709f5d 100644
> --- a/drivers/gpu/drm/radeon/radeon_gem.c
> +++ b/drivers/gpu/drm/radeon/radeon_gem.c
> @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev,
>   	INIT_LIST_HEAD(&list);
>   
>   	tv.bo = &bo_va->bo->tbo;
> -	tv.num_shared = 1;
> +	tv.usage = DMA_RESV_USAGE_READ;
>   	list_add(&tv.head, &list);
>   
>   	vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
> diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c
> index 987cabbf1318..702627b48dae 100644
> --- a/drivers/gpu/drm/radeon/radeon_vm.c
> +++ b/drivers/gpu/drm/radeon/radeon_vm.c
> @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
>   	list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
>   	list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
>   	list[0].tv.bo = &vm->page_directory->tbo;
> -	list[0].tv.num_shared = 1;
> +	list[0].tv.usage = DMA_RESV_USAGE_READ;
>   	list[0].tiling_flags = 0;
>   	list_add(&list[0].tv.head, head);
>   
> @@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
>   		list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
>   		list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
>   		list[idx].tv.bo = &list[idx].robj->tbo;
> -		list[idx].tv.num_shared = 1;
> +		list[idx].tv.usage = DMA_RESV_USAGE_READ;
>   		list[idx].tiling_flags = 0;
>   		list_add(&list[idx++].tv.head, head);
>   	}
> diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> index 0eb995d25df1..c39d8e5ac271 100644
> --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket,
>   			continue;
>   		}
>   
> -		num_fences = min(entry->num_shared, 1u);
> +		num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
>   		if (!ret) {
>   			ret = dma_resv_reserve_fences(bo->base.resv,
>   						      num_fences);
> @@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket,
>   	list_for_each_entry(entry, list, head) {
>   		struct ttm_buffer_object *bo = entry->bo;
>   
> -		dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
> -				   DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
> +		dma_resv_add_fence(bo->base.resv, fence, entry->usage);
>   		ttm_bo_move_to_lru_tail_unlocked(bo);
>   		dma_resv_unlock(bo->base.resv);
>   	}
> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> index c6d02c98a19a..58dfff7d6c76 100644
> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref)
>   			struct ttm_validate_buffer val_buf;
>   
>   			val_buf.bo = bo;
> -			val_buf.num_shared = 0;
> +			val_buf.usage = DMA_RESV_USAGE_WRITE;
>   			res->func->unbind(res, false, &val_buf);
>   		}
>   		res->backup_dirty = false;
> @@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket,
>   	INIT_LIST_HEAD(&val_list);
>   	ttm_bo_get(&res->backup->base);
>   	val_buf->bo = &res->backup->base;
> -	val_buf->num_shared = 0;
> +	val_buf->usage = DMA_RESV_USAGE_WRITE;
>   	list_add_tail(&val_buf->head, &val_list);
>   	ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
>   	if (unlikely(ret != 0))
> @@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
>   	BUG_ON(!func->may_evict);
>   
>   	val_buf.bo = NULL;
> -	val_buf.num_shared = 0;
> +	val_buf.usage = DMA_RESV_USAGE_WRITE;
>   	ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
>   	if (unlikely(ret != 0))
>   		return ret;
> @@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr,
>   		return 0;
>   
>   	val_buf.bo = NULL;
> -	val_buf.num_shared = 0;
> +	val_buf.usage = DMA_RESV_USAGE_WRITE;
>   	if (res->backup)
>   		val_buf.bo = &res->backup->base;
>   	do {
> @@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo)
>   {
>   	struct ttm_validate_buffer val_buf = {
>   		.bo = &vbo->base,
> -		.num_shared = 0
> +		.usage = DMA_RESV_USAGE_WRITE
>   	};
>   
>   	dma_resv_assert_held(vbo->base.base.resv);
> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> index f46891012be3..0476ba498321 100644
> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx,
>   		val_buf->bo = ttm_bo_get_unless_zero(&vbo->base);
>   		if (!val_buf->bo)
>   			return -ESRCH;
> -		val_buf->num_shared = 0;
> +		val_buf->usage = DMA_RESV_USAGE_WRITE;
>   		list_add_tail(&val_buf->head, &ctx->bo_list);
>   		bo_node->as_mob = as_mob;
>   		bo_node->cpu_blit = cpu_blit;
> diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h
> index a99d7fdf2964..851961a06c27 100644
> --- a/include/drm/ttm/ttm_execbuf_util.h
> +++ b/include/drm/ttm/ttm_execbuf_util.h
> @@ -31,6 +31,7 @@
>   #ifndef _TTM_EXECBUF_UTIL_H_
>   #define _TTM_EXECBUF_UTIL_H_
>   
> +#include <linux/dma-resv.h>
>   #include <linux/list.h>
>   
>   #include "ttm_bo_api.h"
> @@ -46,7 +47,7 @@
>   struct ttm_validate_buffer {
>   	struct list_head head;
>   	struct ttm_buffer_object *bo;
> -	unsigned int num_shared;
> +	enum dma_resv_usage usage;
>   };
>   
>   /**


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-01  0:40 ` [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops Bas Nieuwenhuizen
@ 2022-06-01  8:03   ` Christian König
  2022-06-01  8:16     ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-01  8:03 UTC (permalink / raw)
  To: Bas Nieuwenhuizen, dri-devel

Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> This should be okay because moves themselves use KERNEL usage and
> hence still sync with BOOKKEEP usage. Then any later submits still
> wait on any pending VM operations.
>
> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
>   way around)

Well NAK again. This allows access to freed up memory and is a complete 
no-go.

Regards,
Christian.

>
> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> index f10332e1c6c0..31bc73fd1fae 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
>   	if (!resv)
>   		return 0;
>   
> -	return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
> +	return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 63b484dc76c5..c8d5898bea11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>   	if (!resv)
>   		return 0;
>   
> -	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
> +	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
>   }
>   
>   /**


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.
  2022-06-01  8:02   ` Christian König
@ 2022-06-01  8:11     ` Bas Nieuwenhuizen
  2022-06-01  8:29       ` Christian König
  2022-06-01  8:41     ` Daniel Vetter
  1 sibling, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  8:11 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Wed, Jun 1, 2022 at 10:02 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> > So that the driver can set some BOOKKEEP for explicit sync. Maybe
> > some of the existing places would already make sense for that, but
> > I targeted this for no functional changes.
>
> Well first of all NAK to that one since it will totally break cases
> which need to reserve more than one fence slot.

TTM already didn't do that? From ttm_execbuf_util.c :

> > -             num_fences = min(entry->num_shared, 1u);
> > +             num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;

>
> Also as discussed with Daniel we don't want to use BOOKKEEP for implicit
> sync. We should instead use READ for that.

That is the plan and what we do later in the series, use BOOKKEEP for
submissions that don't want to participate in implicit sync?

This refactor sets everything to READ or WRITE based on the previous
num_shared value, to make sure this patch by itself is not a
functional change.

>
> BOOKKEEP is for stuff userspace should never be aware of, e.g. like page
> table updates and KFD eviction fences.
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 10 +++++-----
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  8 +++-----
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c           |  2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c           |  6 +++---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c          |  2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  3 +--
> >   drivers/gpu/drm/amd/amdkfd/kfd_svm.c              |  2 +-
> >   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
> >   drivers/gpu/drm/qxl/qxl_release.c                 |  2 +-
> >   drivers/gpu/drm/radeon/radeon_cs.c                |  5 +++--
> >   drivers/gpu/drm/radeon/radeon_gem.c               |  2 +-
> >   drivers/gpu/drm/radeon/radeon_vm.c                |  4 ++--
> >   drivers/gpu/drm/ttm/ttm_execbuf_util.c            |  5 ++---
> >   drivers/gpu/drm/vmwgfx/vmwgfx_resource.c          | 10 +++++-----
> >   drivers/gpu/drm/vmwgfx/vmwgfx_validation.c        |  2 +-
> >   include/drm/ttm/ttm_execbuf_util.h                |  3 ++-
> >   16 files changed, 33 insertions(+), 35 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > index a4955ef76cfc..a790a089e829 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem,
> >       struct amdgpu_bo *bo = mem->bo;
> >
> >       INIT_LIST_HEAD(&entry->head);
> > -     entry->num_shared = 1;
> > +     entry->usage = DMA_RESV_USAGE_READ;
> >       entry->bo = &bo->tbo;
> >       mutex_lock(&process_info->lock);
> >       if (userptr)
> > @@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
> >
> >       ctx->kfd_bo.priority = 0;
> >       ctx->kfd_bo.tv.bo = &bo->tbo;
> > -     ctx->kfd_bo.tv.num_shared = 1;
> > +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&ctx->kfd_bo.tv.head, &ctx->list);
> >
> >       amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
> > @@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
> >
> >       ctx->kfd_bo.priority = 0;
> >       ctx->kfd_bo.tv.bo = &bo->tbo;
> > -     ctx->kfd_bo.tv.num_shared = 1;
> > +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&ctx->kfd_bo.tv.head, &ctx->list);
> >
> >       i = 0;
> > @@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
> >                           validate_list.head) {
> >               list_add_tail(&mem->resv_list.head, &resv_list);
> >               mem->resv_list.bo = mem->validate_list.bo;
> > -             mem->resv_list.num_shared = mem->validate_list.num_shared;
> > +             mem->resv_list.usage = mem->validate_list.usage;
> >       }
> >
> >       /* Reserve all BOs and page tables for validation */
> > @@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
> >
> >               list_add_tail(&mem->resv_list.head, &ctx.list);
> >               mem->resv_list.bo = mem->validate_list.bo;
> > -             mem->resv_list.num_shared = mem->validate_list.num_shared;
> > +             mem->resv_list.usage = mem->validate_list.usage;
> >       }
> >
> >       ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index 60ca14afb879..2ae1c0d9d33a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
> >       bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj));
> >       p->uf_entry.priority = 0;
> >       p->uf_entry.tv.bo = &bo->tbo;
> > -     /* One for TTM and two for the CS job */
> > -     p->uf_entry.tv.num_shared = 3;
> > +     p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;
> >
> >       drm_gem_object_put(gobj);
> >
> > @@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
> >                       return r;
> >       }
> >
> > -     /* One for TTM and one for the CS job */
> >       amdgpu_bo_list_for_each_entry(e, p->bo_list)
> > -             e->tv.num_shared = 2;
> > +             e->tv.usage = DMA_RESV_USAGE_READ;
> >
> >       amdgpu_bo_list_get_list(p->bo_list, &p->validated);
> >
> > @@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
> >
> >       /* Make sure all BOs are remembered as writers */
> >       amdgpu_bo_list_for_each_entry(e, p->bo_list)
> > -             e->tv.num_shared = 0;
> > +             e->tv.usage = DMA_RESV_USAGE_WRITE;
> >
> >       ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
> >       mutex_unlock(&p->adev->notifier_lock);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> > index c6d4d41c4393..71277257d94d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> > @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> >       INIT_LIST_HEAD(&list);
> >       INIT_LIST_HEAD(&csa_tv.head);
> >       csa_tv.bo = &bo->tbo;
> > -     csa_tv.num_shared = 1;
> > +     csa_tv.usage = DMA_RESV_USAGE_READ;
> >
> >       list_add(&csa_tv.head, &list);
> >       amdgpu_vm_get_pd_bo(vm, &list, &pd);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > index 84a53758e18e..7483411229f4 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj,
> >       INIT_LIST_HEAD(&duplicates);
> >
> >       tv.bo = &bo->tbo;
> > -     tv.num_shared = 2;
> > +     tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&tv.head, &list);
> >
> >       amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
> > @@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
> >               abo = gem_to_amdgpu_bo(gobj);
> >               tv.bo = &abo->tbo;
> >               if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
> > -                     tv.num_shared = 1;
> > +                     tv.usage = DMA_RESV_USAGE_READ;
> >               else
> > -                     tv.num_shared = 0;
> > +                     tv.usage = DMA_RESV_USAGE_WRITE;
> >               list_add(&tv.head, &list);
> >       } else {
> >               gobj = NULL;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > index 5224d9a39737..f670d8473993 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane,
> >       INIT_LIST_HEAD(&list);
> >
> >       tv.bo = &rbo->tbo;
> > -     tv.num_shared = 1;
> > +     tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&tv.head, &list);
> >
> >       r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index 15184153e2b9..515be19ab279 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
> >   {
> >       entry->priority = 0;
> >       entry->tv.bo = &vm->root.bo->tbo;
> > -     /* Two for VM updates, one for TTM and one for the CS job */
> > -     entry->tv.num_shared = 4;
> > +     entry->tv.usage = DMA_RESV_USAGE_READ;
> >       entry->user_pages = NULL;
> >       list_add(&entry->tv.head, validated);
> >   }
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > index b3fc3e958227..af844b636778 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx)
> >               vm = drm_priv_to_vm(pdd->drm_priv);
> >
> >               ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
> > -             ctx->tv[gpuidx].num_shared = 4;
> > +             ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
> >               list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
> >       }
> >
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index 73423b805b54..851b7844b084 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
> >       INIT_LIST_HEAD(&list);
> >
> >       tv.bo = &rbo->tbo;
> > -     tv.num_shared = 1;
> > +     tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&tv.head, &list);
> >
> >       r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
> > diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
> > index 368d26da0d6a..689e35192070 100644
> > --- a/drivers/gpu/drm/qxl/qxl_release.c
> > +++ b/drivers/gpu/drm/qxl/qxl_release.c
> > @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
> >
> >       qxl_bo_ref(bo);
> >       entry->tv.bo = &bo->tbo;
> > -     entry->tv.num_shared = 0;
> > +     entry->tv.usage = DMA_RESV_USAGE_WRITE;
> >       list_add_tail(&entry->tv.head, &release->bos);
> >       return 0;
> >   }
> > diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
> > index 446f7bae54c4..30afe0c62dd9 100644
> > --- a/drivers/gpu/drm/radeon/radeon_cs.c
> > +++ b/drivers/gpu/drm/radeon/radeon_cs.c
> > @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
> >               }
> >
> >               p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
> > -             p->relocs[i].tv.num_shared = !r->write_domain;
> > +             p->relocs[i].tv.usage =
> > +                     r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;
> >
> >               radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
> >                                     priority);
> > @@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
> >
> >               resv = reloc->robj->tbo.base.resv;
> >               r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
> > -                                  reloc->tv.num_shared);
> > +                                  reloc->tv.usage != DMA_RESV_USAGE_WRITE);
> >               if (r)
> >                       return r;
> >       }
> > diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
> > index 8c01a7f0e027..eae47c709f5d 100644
> > --- a/drivers/gpu/drm/radeon/radeon_gem.c
> > +++ b/drivers/gpu/drm/radeon/radeon_gem.c
> > @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev,
> >       INIT_LIST_HEAD(&list);
> >
> >       tv.bo = &bo_va->bo->tbo;
> > -     tv.num_shared = 1;
> > +     tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&tv.head, &list);
> >
> >       vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
> > diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c
> > index 987cabbf1318..702627b48dae 100644
> > --- a/drivers/gpu/drm/radeon/radeon_vm.c
> > +++ b/drivers/gpu/drm/radeon/radeon_vm.c
> > @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
> >       list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
> >       list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
> >       list[0].tv.bo = &vm->page_directory->tbo;
> > -     list[0].tv.num_shared = 1;
> > +     list[0].tv.usage = DMA_RESV_USAGE_READ;
> >       list[0].tiling_flags = 0;
> >       list_add(&list[0].tv.head, head);
> >
> > @@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
> >               list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
> >               list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
> >               list[idx].tv.bo = &list[idx].robj->tbo;
> > -             list[idx].tv.num_shared = 1;
> > +             list[idx].tv.usage = DMA_RESV_USAGE_READ;
> >               list[idx].tiling_flags = 0;
> >               list_add(&list[idx++].tv.head, head);
> >       }
> > diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> > index 0eb995d25df1..c39d8e5ac271 100644
> > --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> > +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> > @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket,
> >                       continue;
> >               }
> >
> > -             num_fences = min(entry->num_shared, 1u);
> > +             num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
> >               if (!ret) {
> >                       ret = dma_resv_reserve_fences(bo->base.resv,
> >                                                     num_fences);
> > @@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket,
> >       list_for_each_entry(entry, list, head) {
> >               struct ttm_buffer_object *bo = entry->bo;
> >
> > -             dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
> > -                                DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
> > +             dma_resv_add_fence(bo->base.resv, fence, entry->usage);
> >               ttm_bo_move_to_lru_tail_unlocked(bo);
> >               dma_resv_unlock(bo->base.resv);
> >       }
> > diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> > index c6d02c98a19a..58dfff7d6c76 100644
> > --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> > +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> > @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref)
> >                       struct ttm_validate_buffer val_buf;
> >
> >                       val_buf.bo = bo;
> > -                     val_buf.num_shared = 0;
> > +                     val_buf.usage = DMA_RESV_USAGE_WRITE;
> >                       res->func->unbind(res, false, &val_buf);
> >               }
> >               res->backup_dirty = false;
> > @@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket,
> >       INIT_LIST_HEAD(&val_list);
> >       ttm_bo_get(&res->backup->base);
> >       val_buf->bo = &res->backup->base;
> > -     val_buf->num_shared = 0;
> > +     val_buf->usage = DMA_RESV_USAGE_WRITE;
> >       list_add_tail(&val_buf->head, &val_list);
> >       ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
> >       if (unlikely(ret != 0))
> > @@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
> >       BUG_ON(!func->may_evict);
> >
> >       val_buf.bo = NULL;
> > -     val_buf.num_shared = 0;
> > +     val_buf.usage = DMA_RESV_USAGE_WRITE;
> >       ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
> >       if (unlikely(ret != 0))
> >               return ret;
> > @@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr,
> >               return 0;
> >
> >       val_buf.bo = NULL;
> > -     val_buf.num_shared = 0;
> > +     val_buf.usage = DMA_RESV_USAGE_WRITE;
> >       if (res->backup)
> >               val_buf.bo = &res->backup->base;
> >       do {
> > @@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo)
> >   {
> >       struct ttm_validate_buffer val_buf = {
> >               .bo = &vbo->base,
> > -             .num_shared = 0
> > +             .usage = DMA_RESV_USAGE_WRITE
> >       };
> >
> >       dma_resv_assert_held(vbo->base.base.resv);
> > diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> > index f46891012be3..0476ba498321 100644
> > --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> > +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> > @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx,
> >               val_buf->bo = ttm_bo_get_unless_zero(&vbo->base);
> >               if (!val_buf->bo)
> >                       return -ESRCH;
> > -             val_buf->num_shared = 0;
> > +             val_buf->usage = DMA_RESV_USAGE_WRITE;
> >               list_add_tail(&val_buf->head, &ctx->bo_list);
> >               bo_node->as_mob = as_mob;
> >               bo_node->cpu_blit = cpu_blit;
> > diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h
> > index a99d7fdf2964..851961a06c27 100644
> > --- a/include/drm/ttm/ttm_execbuf_util.h
> > +++ b/include/drm/ttm/ttm_execbuf_util.h
> > @@ -31,6 +31,7 @@
> >   #ifndef _TTM_EXECBUF_UTIL_H_
> >   #define _TTM_EXECBUF_UTIL_H_
> >
> > +#include <linux/dma-resv.h>
> >   #include <linux/list.h>
> >
> >   #include "ttm_bo_api.h"
> > @@ -46,7 +47,7 @@
> >   struct ttm_validate_buffer {
> >       struct list_head head;
> >       struct ttm_buffer_object *bo;
> > -     unsigned int num_shared;
> > +     enum dma_resv_usage usage;
> >   };
> >
> >   /**
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-01  8:03   ` Christian König
@ 2022-06-01  8:16     ` Bas Nieuwenhuizen
  2022-06-01  8:40       ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  8:16 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Wed, Jun 1, 2022 at 10:03 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> > This should be okay because moves themselves use KERNEL usage and
> > hence still sync with BOOKKEEP usage. Then any later submits still
> > wait on any pending VM operations.
> >
> > (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
> >   way around)
>
> Well NAK again. This allows access to freed up memory and is a complete
> no-go.

How does this allow access to freed memory? Worst I can see is that
the unmap happens earlier if the app/drivers gets the waits wrong,
which wouldn't give access after the underlying BO is freed?

>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
> >   2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > index f10332e1c6c0..31bc73fd1fae 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
> >       if (!resv)
> >               return 0;
> >
> > -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
> > +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
> >   }
> >
> >   /**
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > index 63b484dc76c5..c8d5898bea11 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
> >       if (!resv)
> >               return 0;
> >
> > -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
> > +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
> >   }
> >
> >   /**
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.
  2022-06-01  8:11     ` Bas Nieuwenhuizen
@ 2022-06-01  8:29       ` Christian König
  2022-06-01  8:39         ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-01  8:29 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 01.06.22 um 10:11 schrieb Bas Nieuwenhuizen:
> On Wed, Jun 1, 2022 at 10:02 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
>>> So that the driver can set some BOOKKEEP for explicit sync. Maybe
>>> some of the existing places would already make sense for that, but
>>> I targeted this for no functional changes.
>> Well first of all NAK to that one since it will totally break cases
>> which need to reserve more than one fence slot.
> TTM already didn't do that? From ttm_execbuf_util.c :
>
>>> -             num_fences = min(entry->num_shared, 1u);
>>> +             num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;

That's doing a min(entry->num_shared, 1u). In other words even when the 
driver requested to reserve no fence we at least reserve at least one.

But if the driver requested to reserve more than one then we do reserve 
more than one. That's rather important because both radeon and amdgpu 
need that for their VM updates.

This patch here completely breaks that.

There is already an drm_exec patch set from me on the dri-devel mailing 
list which untangles all of this and deprecates the whole 
ttm_exec_buf_util handling.

Regards,
Christian.

>> Also as discussed with Daniel we don't want to use BOOKKEEP for implicit
>> sync. We should instead use READ for that.
> That is the plan and what we do later in the series, use BOOKKEEP for
> submissions that don't want to participate in implicit sync?
>
> This refactor sets everything to READ or WRITE based on the previous
> num_shared value, to make sure this patch by itself is not a
> functional change.
>
>> BOOKKEEP is for stuff userspace should never be aware of, e.g. like page
>> table updates and KFD eviction fences.
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 10 +++++-----
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  8 +++-----
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c           |  2 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c           |  6 +++---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c          |  2 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  3 +--
>>>    drivers/gpu/drm/amd/amdkfd/kfd_svm.c              |  2 +-
>>>    drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
>>>    drivers/gpu/drm/qxl/qxl_release.c                 |  2 +-
>>>    drivers/gpu/drm/radeon/radeon_cs.c                |  5 +++--
>>>    drivers/gpu/drm/radeon/radeon_gem.c               |  2 +-
>>>    drivers/gpu/drm/radeon/radeon_vm.c                |  4 ++--
>>>    drivers/gpu/drm/ttm/ttm_execbuf_util.c            |  5 ++---
>>>    drivers/gpu/drm/vmwgfx/vmwgfx_resource.c          | 10 +++++-----
>>>    drivers/gpu/drm/vmwgfx/vmwgfx_validation.c        |  2 +-
>>>    include/drm/ttm/ttm_execbuf_util.h                |  3 ++-
>>>    16 files changed, 33 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> index a4955ef76cfc..a790a089e829 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem,
>>>        struct amdgpu_bo *bo = mem->bo;
>>>
>>>        INIT_LIST_HEAD(&entry->head);
>>> -     entry->num_shared = 1;
>>> +     entry->usage = DMA_RESV_USAGE_READ;
>>>        entry->bo = &bo->tbo;
>>>        mutex_lock(&process_info->lock);
>>>        if (userptr)
>>> @@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
>>>
>>>        ctx->kfd_bo.priority = 0;
>>>        ctx->kfd_bo.tv.bo = &bo->tbo;
>>> -     ctx->kfd_bo.tv.num_shared = 1;
>>> +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&ctx->kfd_bo.tv.head, &ctx->list);
>>>
>>>        amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
>>> @@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
>>>
>>>        ctx->kfd_bo.priority = 0;
>>>        ctx->kfd_bo.tv.bo = &bo->tbo;
>>> -     ctx->kfd_bo.tv.num_shared = 1;
>>> +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&ctx->kfd_bo.tv.head, &ctx->list);
>>>
>>>        i = 0;
>>> @@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
>>>                            validate_list.head) {
>>>                list_add_tail(&mem->resv_list.head, &resv_list);
>>>                mem->resv_list.bo = mem->validate_list.bo;
>>> -             mem->resv_list.num_shared = mem->validate_list.num_shared;
>>> +             mem->resv_list.usage = mem->validate_list.usage;
>>>        }
>>>
>>>        /* Reserve all BOs and page tables for validation */
>>> @@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
>>>
>>>                list_add_tail(&mem->resv_list.head, &ctx.list);
>>>                mem->resv_list.bo = mem->validate_list.bo;
>>> -             mem->resv_list.num_shared = mem->validate_list.num_shared;
>>> +             mem->resv_list.usage = mem->validate_list.usage;
>>>        }
>>>
>>>        ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 60ca14afb879..2ae1c0d9d33a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
>>>        bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj));
>>>        p->uf_entry.priority = 0;
>>>        p->uf_entry.tv.bo = &bo->tbo;
>>> -     /* One for TTM and two for the CS job */
>>> -     p->uf_entry.tv.num_shared = 3;
>>> +     p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;
>>>
>>>        drm_gem_object_put(gobj);
>>>
>>> @@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
>>>                        return r;
>>>        }
>>>
>>> -     /* One for TTM and one for the CS job */
>>>        amdgpu_bo_list_for_each_entry(e, p->bo_list)
>>> -             e->tv.num_shared = 2;
>>> +             e->tv.usage = DMA_RESV_USAGE_READ;
>>>
>>>        amdgpu_bo_list_get_list(p->bo_list, &p->validated);
>>>
>>> @@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>>>
>>>        /* Make sure all BOs are remembered as writers */
>>>        amdgpu_bo_list_for_each_entry(e, p->bo_list)
>>> -             e->tv.num_shared = 0;
>>> +             e->tv.usage = DMA_RESV_USAGE_WRITE;
>>>
>>>        ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
>>>        mutex_unlock(&p->adev->notifier_lock);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>> index c6d4d41c4393..71277257d94d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>> @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>>        INIT_LIST_HEAD(&list);
>>>        INIT_LIST_HEAD(&csa_tv.head);
>>>        csa_tv.bo = &bo->tbo;
>>> -     csa_tv.num_shared = 1;
>>> +     csa_tv.usage = DMA_RESV_USAGE_READ;
>>>
>>>        list_add(&csa_tv.head, &list);
>>>        amdgpu_vm_get_pd_bo(vm, &list, &pd);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> index 84a53758e18e..7483411229f4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj,
>>>        INIT_LIST_HEAD(&duplicates);
>>>
>>>        tv.bo = &bo->tbo;
>>> -     tv.num_shared = 2;
>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&tv.head, &list);
>>>
>>>        amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
>>> @@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>                abo = gem_to_amdgpu_bo(gobj);
>>>                tv.bo = &abo->tbo;
>>>                if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
>>> -                     tv.num_shared = 1;
>>> +                     tv.usage = DMA_RESV_USAGE_READ;
>>>                else
>>> -                     tv.num_shared = 0;
>>> +                     tv.usage = DMA_RESV_USAGE_WRITE;
>>>                list_add(&tv.head, &list);
>>>        } else {
>>>                gobj = NULL;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> index 5224d9a39737..f670d8473993 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane,
>>>        INIT_LIST_HEAD(&list);
>>>
>>>        tv.bo = &rbo->tbo;
>>> -     tv.num_shared = 1;
>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&tv.head, &list);
>>>
>>>        r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 15184153e2b9..515be19ab279 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
>>>    {
>>>        entry->priority = 0;
>>>        entry->tv.bo = &vm->root.bo->tbo;
>>> -     /* Two for VM updates, one for TTM and one for the CS job */
>>> -     entry->tv.num_shared = 4;
>>> +     entry->tv.usage = DMA_RESV_USAGE_READ;
>>>        entry->user_pages = NULL;
>>>        list_add(&entry->tv.head, validated);
>>>    }
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> index b3fc3e958227..af844b636778 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx)
>>>                vm = drm_priv_to_vm(pdd->drm_priv);
>>>
>>>                ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
>>> -             ctx->tv[gpuidx].num_shared = 4;
>>> +             ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
>>>                list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
>>>        }
>>>
>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> index 73423b805b54..851b7844b084 100644
>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
>>>        INIT_LIST_HEAD(&list);
>>>
>>>        tv.bo = &rbo->tbo;
>>> -     tv.num_shared = 1;
>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&tv.head, &list);
>>>
>>>        r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
>>> diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
>>> index 368d26da0d6a..689e35192070 100644
>>> --- a/drivers/gpu/drm/qxl/qxl_release.c
>>> +++ b/drivers/gpu/drm/qxl/qxl_release.c
>>> @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
>>>
>>>        qxl_bo_ref(bo);
>>>        entry->tv.bo = &bo->tbo;
>>> -     entry->tv.num_shared = 0;
>>> +     entry->tv.usage = DMA_RESV_USAGE_WRITE;
>>>        list_add_tail(&entry->tv.head, &release->bos);
>>>        return 0;
>>>    }
>>> diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
>>> index 446f7bae54c4..30afe0c62dd9 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_cs.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_cs.c
>>> @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
>>>                }
>>>
>>>                p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
>>> -             p->relocs[i].tv.num_shared = !r->write_domain;
>>> +             p->relocs[i].tv.usage =
>>> +                     r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;
>>>
>>>                radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
>>>                                      priority);
>>> @@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
>>>
>>>                resv = reloc->robj->tbo.base.resv;
>>>                r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
>>> -                                  reloc->tv.num_shared);
>>> +                                  reloc->tv.usage != DMA_RESV_USAGE_WRITE);
>>>                if (r)
>>>                        return r;
>>>        }
>>> diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
>>> index 8c01a7f0e027..eae47c709f5d 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_gem.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_gem.c
>>> @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev,
>>>        INIT_LIST_HEAD(&list);
>>>
>>>        tv.bo = &bo_va->bo->tbo;
>>> -     tv.num_shared = 1;
>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&tv.head, &list);
>>>
>>>        vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
>>> diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c
>>> index 987cabbf1318..702627b48dae 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_vm.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_vm.c
>>> @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
>>>        list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
>>>        list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
>>>        list[0].tv.bo = &vm->page_directory->tbo;
>>> -     list[0].tv.num_shared = 1;
>>> +     list[0].tv.usage = DMA_RESV_USAGE_READ;
>>>        list[0].tiling_flags = 0;
>>>        list_add(&list[0].tv.head, head);
>>>
>>> @@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
>>>                list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
>>>                list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
>>>                list[idx].tv.bo = &list[idx].robj->tbo;
>>> -             list[idx].tv.num_shared = 1;
>>> +             list[idx].tv.usage = DMA_RESV_USAGE_READ;
>>>                list[idx].tiling_flags = 0;
>>>                list_add(&list[idx++].tv.head, head);
>>>        }
>>> diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
>>> index 0eb995d25df1..c39d8e5ac271 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
>>> @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket,
>>>                        continue;
>>>                }
>>>
>>> -             num_fences = min(entry->num_shared, 1u);
>>> +             num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
>>>                if (!ret) {
>>>                        ret = dma_resv_reserve_fences(bo->base.resv,
>>>                                                      num_fences);
>>> @@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket,
>>>        list_for_each_entry(entry, list, head) {
>>>                struct ttm_buffer_object *bo = entry->bo;
>>>
>>> -             dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
>>> -                                DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
>>> +             dma_resv_add_fence(bo->base.resv, fence, entry->usage);
>>>                ttm_bo_move_to_lru_tail_unlocked(bo);
>>>                dma_resv_unlock(bo->base.resv);
>>>        }
>>> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
>>> index c6d02c98a19a..58dfff7d6c76 100644
>>> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
>>> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
>>> @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref)
>>>                        struct ttm_validate_buffer val_buf;
>>>
>>>                        val_buf.bo = bo;
>>> -                     val_buf.num_shared = 0;
>>> +                     val_buf.usage = DMA_RESV_USAGE_WRITE;
>>>                        res->func->unbind(res, false, &val_buf);
>>>                }
>>>                res->backup_dirty = false;
>>> @@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket,
>>>        INIT_LIST_HEAD(&val_list);
>>>        ttm_bo_get(&res->backup->base);
>>>        val_buf->bo = &res->backup->base;
>>> -     val_buf->num_shared = 0;
>>> +     val_buf->usage = DMA_RESV_USAGE_WRITE;
>>>        list_add_tail(&val_buf->head, &val_list);
>>>        ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
>>>        if (unlikely(ret != 0))
>>> @@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
>>>        BUG_ON(!func->may_evict);
>>>
>>>        val_buf.bo = NULL;
>>> -     val_buf.num_shared = 0;
>>> +     val_buf.usage = DMA_RESV_USAGE_WRITE;
>>>        ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
>>>        if (unlikely(ret != 0))
>>>                return ret;
>>> @@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr,
>>>                return 0;
>>>
>>>        val_buf.bo = NULL;
>>> -     val_buf.num_shared = 0;
>>> +     val_buf.usage = DMA_RESV_USAGE_WRITE;
>>>        if (res->backup)
>>>                val_buf.bo = &res->backup->base;
>>>        do {
>>> @@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo)
>>>    {
>>>        struct ttm_validate_buffer val_buf = {
>>>                .bo = &vbo->base,
>>> -             .num_shared = 0
>>> +             .usage = DMA_RESV_USAGE_WRITE
>>>        };
>>>
>>>        dma_resv_assert_held(vbo->base.base.resv);
>>> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
>>> index f46891012be3..0476ba498321 100644
>>> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
>>> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
>>> @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx,
>>>                val_buf->bo = ttm_bo_get_unless_zero(&vbo->base);
>>>                if (!val_buf->bo)
>>>                        return -ESRCH;
>>> -             val_buf->num_shared = 0;
>>> +             val_buf->usage = DMA_RESV_USAGE_WRITE;
>>>                list_add_tail(&val_buf->head, &ctx->bo_list);
>>>                bo_node->as_mob = as_mob;
>>>                bo_node->cpu_blit = cpu_blit;
>>> diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h
>>> index a99d7fdf2964..851961a06c27 100644
>>> --- a/include/drm/ttm/ttm_execbuf_util.h
>>> +++ b/include/drm/ttm/ttm_execbuf_util.h
>>> @@ -31,6 +31,7 @@
>>>    #ifndef _TTM_EXECBUF_UTIL_H_
>>>    #define _TTM_EXECBUF_UTIL_H_
>>>
>>> +#include <linux/dma-resv.h>
>>>    #include <linux/list.h>
>>>
>>>    #include "ttm_bo_api.h"
>>> @@ -46,7 +47,7 @@
>>>    struct ttm_validate_buffer {
>>>        struct list_head head;
>>>        struct ttm_buffer_object *bo;
>>> -     unsigned int num_shared;
>>> +     enum dma_resv_usage usage;
>>>    };
>>>
>>>    /**


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.
  2022-06-01  8:29       ` Christian König
@ 2022-06-01  8:39         ` Bas Nieuwenhuizen
  2022-06-01  8:42           ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  8:39 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Wed, Jun 1, 2022 at 10:29 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 01.06.22 um 10:11 schrieb Bas Nieuwenhuizen:
> > On Wed, Jun 1, 2022 at 10:02 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> >>> So that the driver can set some BOOKKEEP for explicit sync. Maybe
> >>> some of the existing places would already make sense for that, but
> >>> I targeted this for no functional changes.
> >> Well first of all NAK to that one since it will totally break cases
> >> which need to reserve more than one fence slot.
> > TTM already didn't do that? From ttm_execbuf_util.c :
> >
> >>> -             num_fences = min(entry->num_shared, 1u);
> >>> +             num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
>
> That's doing a min(entry->num_shared, 1u). In other words even when the
> driver requested to reserve no fence we at least reserve at least one.

That would be the case if it was a max, not a min. However, since it
is a min, it only ever resulted in 0 or 1, behavior that we mimic
based on DMA_RESV_USAGE_*.

Nowhere else do we actually use the specific number  assigned to num_shared.

>
> But if the driver requested to reserve more than one then we do reserve
> more than one. That's rather important because both radeon and amdgpu
> need that for their VM updates.
>
> This patch here completely breaks that.
>
> There is already an drm_exec patch set from me on the dri-devel mailing
> list which untangles all of this and deprecates the whole
> ttm_exec_buf_util handling.

Can take a look at your patch, but I believe in pre-patch state this
is a correct non functional change.

>
> Regards,
> Christian.
>
> >> Also as discussed with Daniel we don't want to use BOOKKEEP for implicit
> >> sync. We should instead use READ for that.
> > That is the plan and what we do later in the series, use BOOKKEEP for
> > submissions that don't want to participate in implicit sync?
> >
> > This refactor sets everything to READ or WRITE based on the previous
> > num_shared value, to make sure this patch by itself is not a
> > functional change.
> >
> >> BOOKKEEP is for stuff userspace should never be aware of, e.g. like page
> >> table updates and KFD eviction fences.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 10 +++++-----
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  8 +++-----
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c           |  2 +-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c           |  6 +++---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c          |  2 +-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  3 +--
> >>>    drivers/gpu/drm/amd/amdkfd/kfd_svm.c              |  2 +-
> >>>    drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
> >>>    drivers/gpu/drm/qxl/qxl_release.c                 |  2 +-
> >>>    drivers/gpu/drm/radeon/radeon_cs.c                |  5 +++--
> >>>    drivers/gpu/drm/radeon/radeon_gem.c               |  2 +-
> >>>    drivers/gpu/drm/radeon/radeon_vm.c                |  4 ++--
> >>>    drivers/gpu/drm/ttm/ttm_execbuf_util.c            |  5 ++---
> >>>    drivers/gpu/drm/vmwgfx/vmwgfx_resource.c          | 10 +++++-----
> >>>    drivers/gpu/drm/vmwgfx/vmwgfx_validation.c        |  2 +-
> >>>    include/drm/ttm/ttm_execbuf_util.h                |  3 ++-
> >>>    16 files changed, 33 insertions(+), 35 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> index a4955ef76cfc..a790a089e829 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem,
> >>>        struct amdgpu_bo *bo = mem->bo;
> >>>
> >>>        INIT_LIST_HEAD(&entry->head);
> >>> -     entry->num_shared = 1;
> >>> +     entry->usage = DMA_RESV_USAGE_READ;
> >>>        entry->bo = &bo->tbo;
> >>>        mutex_lock(&process_info->lock);
> >>>        if (userptr)
> >>> @@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
> >>>
> >>>        ctx->kfd_bo.priority = 0;
> >>>        ctx->kfd_bo.tv.bo = &bo->tbo;
> >>> -     ctx->kfd_bo.tv.num_shared = 1;
> >>> +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
> >>>        list_add(&ctx->kfd_bo.tv.head, &ctx->list);
> >>>
> >>>        amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
> >>> @@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
> >>>
> >>>        ctx->kfd_bo.priority = 0;
> >>>        ctx->kfd_bo.tv.bo = &bo->tbo;
> >>> -     ctx->kfd_bo.tv.num_shared = 1;
> >>> +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
> >>>        list_add(&ctx->kfd_bo.tv.head, &ctx->list);
> >>>
> >>>        i = 0;
> >>> @@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
> >>>                            validate_list.head) {
> >>>                list_add_tail(&mem->resv_list.head, &resv_list);
> >>>                mem->resv_list.bo = mem->validate_list.bo;
> >>> -             mem->resv_list.num_shared = mem->validate_list.num_shared;
> >>> +             mem->resv_list.usage = mem->validate_list.usage;
> >>>        }
> >>>
> >>>        /* Reserve all BOs and page tables for validation */
> >>> @@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
> >>>
> >>>                list_add_tail(&mem->resv_list.head, &ctx.list);
> >>>                mem->resv_list.bo = mem->validate_list.bo;
> >>> -             mem->resv_list.num_shared = mem->validate_list.num_shared;
> >>> +             mem->resv_list.usage = mem->validate_list.usage;
> >>>        }
> >>>
> >>>        ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >>> index 60ca14afb879..2ae1c0d9d33a 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >>> @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
> >>>        bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj));
> >>>        p->uf_entry.priority = 0;
> >>>        p->uf_entry.tv.bo = &bo->tbo;
> >>> -     /* One for TTM and two for the CS job */
> >>> -     p->uf_entry.tv.num_shared = 3;
> >>> +     p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;
> >>>
> >>>        drm_gem_object_put(gobj);
> >>>
> >>> @@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
> >>>                        return r;
> >>>        }
> >>>
> >>> -     /* One for TTM and one for the CS job */
> >>>        amdgpu_bo_list_for_each_entry(e, p->bo_list)
> >>> -             e->tv.num_shared = 2;
> >>> +             e->tv.usage = DMA_RESV_USAGE_READ;
> >>>
> >>>        amdgpu_bo_list_get_list(p->bo_list, &p->validated);
> >>>
> >>> @@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
> >>>
> >>>        /* Make sure all BOs are remembered as writers */
> >>>        amdgpu_bo_list_for_each_entry(e, p->bo_list)
> >>> -             e->tv.num_shared = 0;
> >>> +             e->tv.usage = DMA_RESV_USAGE_WRITE;
> >>>
> >>>        ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
> >>>        mutex_unlock(&p->adev->notifier_lock);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> >>> index c6d4d41c4393..71277257d94d 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> >>> @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> >>>        INIT_LIST_HEAD(&list);
> >>>        INIT_LIST_HEAD(&csa_tv.head);
> >>>        csa_tv.bo = &bo->tbo;
> >>> -     csa_tv.num_shared = 1;
> >>> +     csa_tv.usage = DMA_RESV_USAGE_READ;
> >>>
> >>>        list_add(&csa_tv.head, &list);
> >>>        amdgpu_vm_get_pd_bo(vm, &list, &pd);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >>> index 84a53758e18e..7483411229f4 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >>> @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj,
> >>>        INIT_LIST_HEAD(&duplicates);
> >>>
> >>>        tv.bo = &bo->tbo;
> >>> -     tv.num_shared = 2;
> >>> +     tv.usage = DMA_RESV_USAGE_READ;
> >>>        list_add(&tv.head, &list);
> >>>
> >>>        amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
> >>> @@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
> >>>                abo = gem_to_amdgpu_bo(gobj);
> >>>                tv.bo = &abo->tbo;
> >>>                if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
> >>> -                     tv.num_shared = 1;
> >>> +                     tv.usage = DMA_RESV_USAGE_READ;
> >>>                else
> >>> -                     tv.num_shared = 0;
> >>> +                     tv.usage = DMA_RESV_USAGE_WRITE;
> >>>                list_add(&tv.head, &list);
> >>>        } else {
> >>>                gobj = NULL;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> >>> index 5224d9a39737..f670d8473993 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> >>> @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane,
> >>>        INIT_LIST_HEAD(&list);
> >>>
> >>>        tv.bo = &rbo->tbo;
> >>> -     tv.num_shared = 1;
> >>> +     tv.usage = DMA_RESV_USAGE_READ;
> >>>        list_add(&tv.head, &list);
> >>>
> >>>        r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>> index 15184153e2b9..515be19ab279 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>> @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
> >>>    {
> >>>        entry->priority = 0;
> >>>        entry->tv.bo = &vm->root.bo->tbo;
> >>> -     /* Two for VM updates, one for TTM and one for the CS job */
> >>> -     entry->tv.num_shared = 4;
> >>> +     entry->tv.usage = DMA_RESV_USAGE_READ;
> >>>        entry->user_pages = NULL;
> >>>        list_add(&entry->tv.head, validated);
> >>>    }
> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> >>> index b3fc3e958227..af844b636778 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> >>> @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx)
> >>>                vm = drm_priv_to_vm(pdd->drm_priv);
> >>>
> >>>                ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
> >>> -             ctx->tv[gpuidx].num_shared = 4;
> >>> +             ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
> >>>                list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
> >>>        }
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >>> index 73423b805b54..851b7844b084 100644
> >>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >>> @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
> >>>        INIT_LIST_HEAD(&list);
> >>>
> >>>        tv.bo = &rbo->tbo;
> >>> -     tv.num_shared = 1;
> >>> +     tv.usage = DMA_RESV_USAGE_READ;
> >>>        list_add(&tv.head, &list);
> >>>
> >>>        r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
> >>> diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
> >>> index 368d26da0d6a..689e35192070 100644
> >>> --- a/drivers/gpu/drm/qxl/qxl_release.c
> >>> +++ b/drivers/gpu/drm/qxl/qxl_release.c
> >>> @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
> >>>
> >>>        qxl_bo_ref(bo);
> >>>        entry->tv.bo = &bo->tbo;
> >>> -     entry->tv.num_shared = 0;
> >>> +     entry->tv.usage = DMA_RESV_USAGE_WRITE;
> >>>        list_add_tail(&entry->tv.head, &release->bos);
> >>>        return 0;
> >>>    }
> >>> diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
> >>> index 446f7bae54c4..30afe0c62dd9 100644
> >>> --- a/drivers/gpu/drm/radeon/radeon_cs.c
> >>> +++ b/drivers/gpu/drm/radeon/radeon_cs.c
> >>> @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
> >>>                }
> >>>
> >>>                p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
> >>> -             p->relocs[i].tv.num_shared = !r->write_domain;
> >>> +             p->relocs[i].tv.usage =
> >>> +                     r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;
> >>>
> >>>                radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
> >>>                                      priority);
> >>> @@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
> >>>
> >>>                resv = reloc->robj->tbo.base.resv;
> >>>                r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
> >>> -                                  reloc->tv.num_shared);
> >>> +                                  reloc->tv.usage != DMA_RESV_USAGE_WRITE);
> >>>                if (r)
> >>>                        return r;
> >>>        }
> >>> diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
> >>> index 8c01a7f0e027..eae47c709f5d 100644
> >>> --- a/drivers/gpu/drm/radeon/radeon_gem.c
> >>> +++ b/drivers/gpu/drm/radeon/radeon_gem.c
> >>> @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev,
> >>>        INIT_LIST_HEAD(&list);
> >>>
> >>>        tv.bo = &bo_va->bo->tbo;
> >>> -     tv.num_shared = 1;
> >>> +     tv.usage = DMA_RESV_USAGE_READ;
> >>>        list_add(&tv.head, &list);
> >>>
> >>>        vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
> >>> diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c
> >>> index 987cabbf1318..702627b48dae 100644
> >>> --- a/drivers/gpu/drm/radeon/radeon_vm.c
> >>> +++ b/drivers/gpu/drm/radeon/radeon_vm.c
> >>> @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
> >>>        list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
> >>>        list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
> >>>        list[0].tv.bo = &vm->page_directory->tbo;
> >>> -     list[0].tv.num_shared = 1;
> >>> +     list[0].tv.usage = DMA_RESV_USAGE_READ;
> >>>        list[0].tiling_flags = 0;
> >>>        list_add(&list[0].tv.head, head);
> >>>
> >>> @@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
> >>>                list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
> >>>                list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
> >>>                list[idx].tv.bo = &list[idx].robj->tbo;
> >>> -             list[idx].tv.num_shared = 1;
> >>> +             list[idx].tv.usage = DMA_RESV_USAGE_READ;
> >>>                list[idx].tiling_flags = 0;
> >>>                list_add(&list[idx++].tv.head, head);
> >>>        }
> >>> diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> >>> index 0eb995d25df1..c39d8e5ac271 100644
> >>> --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> >>> +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> >>> @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket,
> >>>                        continue;
> >>>                }
> >>>
> >>> -             num_fences = min(entry->num_shared, 1u);
> >>> +             num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
> >>>                if (!ret) {
> >>>                        ret = dma_resv_reserve_fences(bo->base.resv,
> >>>                                                      num_fences);
> >>> @@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket,
> >>>        list_for_each_entry(entry, list, head) {
> >>>                struct ttm_buffer_object *bo = entry->bo;
> >>>
> >>> -             dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
> >>> -                                DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
> >>> +             dma_resv_add_fence(bo->base.resv, fence, entry->usage);
> >>>                ttm_bo_move_to_lru_tail_unlocked(bo);
> >>>                dma_resv_unlock(bo->base.resv);
> >>>        }
> >>> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> >>> index c6d02c98a19a..58dfff7d6c76 100644
> >>> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> >>> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> >>> @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref)
> >>>                        struct ttm_validate_buffer val_buf;
> >>>
> >>>                        val_buf.bo = bo;
> >>> -                     val_buf.num_shared = 0;
> >>> +                     val_buf.usage = DMA_RESV_USAGE_WRITE;
> >>>                        res->func->unbind(res, false, &val_buf);
> >>>                }
> >>>                res->backup_dirty = false;
> >>> @@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket,
> >>>        INIT_LIST_HEAD(&val_list);
> >>>        ttm_bo_get(&res->backup->base);
> >>>        val_buf->bo = &res->backup->base;
> >>> -     val_buf->num_shared = 0;
> >>> +     val_buf->usage = DMA_RESV_USAGE_WRITE;
> >>>        list_add_tail(&val_buf->head, &val_list);
> >>>        ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
> >>>        if (unlikely(ret != 0))
> >>> @@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
> >>>        BUG_ON(!func->may_evict);
> >>>
> >>>        val_buf.bo = NULL;
> >>> -     val_buf.num_shared = 0;
> >>> +     val_buf.usage = DMA_RESV_USAGE_WRITE;
> >>>        ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
> >>>        if (unlikely(ret != 0))
> >>>                return ret;
> >>> @@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr,
> >>>                return 0;
> >>>
> >>>        val_buf.bo = NULL;
> >>> -     val_buf.num_shared = 0;
> >>> +     val_buf.usage = DMA_RESV_USAGE_WRITE;
> >>>        if (res->backup)
> >>>                val_buf.bo = &res->backup->base;
> >>>        do {
> >>> @@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo)
> >>>    {
> >>>        struct ttm_validate_buffer val_buf = {
> >>>                .bo = &vbo->base,
> >>> -             .num_shared = 0
> >>> +             .usage = DMA_RESV_USAGE_WRITE
> >>>        };
> >>>
> >>>        dma_resv_assert_held(vbo->base.base.resv);
> >>> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> >>> index f46891012be3..0476ba498321 100644
> >>> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> >>> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> >>> @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx,
> >>>                val_buf->bo = ttm_bo_get_unless_zero(&vbo->base);
> >>>                if (!val_buf->bo)
> >>>                        return -ESRCH;
> >>> -             val_buf->num_shared = 0;
> >>> +             val_buf->usage = DMA_RESV_USAGE_WRITE;
> >>>                list_add_tail(&val_buf->head, &ctx->bo_list);
> >>>                bo_node->as_mob = as_mob;
> >>>                bo_node->cpu_blit = cpu_blit;
> >>> diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h
> >>> index a99d7fdf2964..851961a06c27 100644
> >>> --- a/include/drm/ttm/ttm_execbuf_util.h
> >>> +++ b/include/drm/ttm/ttm_execbuf_util.h
> >>> @@ -31,6 +31,7 @@
> >>>    #ifndef _TTM_EXECBUF_UTIL_H_
> >>>    #define _TTM_EXECBUF_UTIL_H_
> >>>
> >>> +#include <linux/dma-resv.h>
> >>>    #include <linux/list.h>
> >>>
> >>>    #include "ttm_bo_api.h"
> >>> @@ -46,7 +47,7 @@
> >>>    struct ttm_validate_buffer {
> >>>        struct list_head head;
> >>>        struct ttm_buffer_object *bo;
> >>> -     unsigned int num_shared;
> >>> +     enum dma_resv_usage usage;
> >>>    };
> >>>
> >>>    /**
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-01  8:16     ` Bas Nieuwenhuizen
@ 2022-06-01  8:40       ` Christian König
  2022-06-01  8:48         ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-01  8:40 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
> On Wed, Jun 1, 2022 at 10:03 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
>>> This should be okay because moves themselves use KERNEL usage and
>>> hence still sync with BOOKKEEP usage. Then any later submits still
>>> wait on any pending VM operations.
>>>
>>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
>>>    way around)
>> Well NAK again. This allows access to freed up memory and is a complete
>> no-go.
> How does this allow access to freed memory? Worst I can see is that
> the unmap happens earlier if the app/drivers gets the waits wrong,
> which wouldn't give access after the underlying BO is freed?

To free up memory we need to update the PTEs and then flush those out by 
invalidating the TLB.

On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can 
only be done while the VMID is idle.

Only gfx9 can reliable invalidate the TLB while it is in use and even 
there it comes with quite some performance penalty (at TLB invalidation 
can take multiple seconds).

Because of this what we do in the kernel driver is to sync to everything 
when we unmap entries:

         if (!(flags & AMDGPU_PTE_VALID))
                 sync_mode = AMDGPU_SYNC_EQ_OWNER;
         else
                 sync_mode = AMDGPU_SYNC_EXPLICIT;

This acts as a barrier for freeing the memory. In other words we 
intentionally add a bubble which syncs everything.

I'm working for month on a concept how to do all this without causing 
the stalls you observer, but so far didn't came to much of a conclusion.

Regards,
Christian.

>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
>>>    2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>> index f10332e1c6c0..31bc73fd1fae 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
>>>        if (!resv)
>>>                return 0;
>>>
>>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
>>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
>>>    }
>>>
>>>    /**
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> index 63b484dc76c5..c8d5898bea11 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>>>        if (!resv)
>>>                return 0;
>>>
>>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
>>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
>>>    }
>>>
>>>    /**


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.
  2022-06-01  8:02   ` Christian König
  2022-06-01  8:11     ` Bas Nieuwenhuizen
@ 2022-06-01  8:41     ` Daniel Vetter
  2022-06-01  8:47       ` Christian König
  1 sibling, 1 reply; 46+ messages in thread
From: Daniel Vetter @ 2022-06-01  8:41 UTC (permalink / raw)
  To: Christian König; +Cc: dri-devel

On Wed, 1 Jun 2022 at 10:02, Christian König <christian.koenig@amd.com> wrote:
> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> > So that the driver can set some BOOKKEEP for explicit sync. Maybe
> > some of the existing places would already make sense for that, but
> > I targeted this for no functional changes.
>
> Well first of all NAK to that one since it will totally break cases
> which need to reserve more than one fence slot.

Quick reminder, we talked about this in the past. For many folks (not
you) NAK means "fuck off" and not "this wont work for the reasons I
just explained". Looks like the conversation is all on a good track in
the further replies, just figured I'll drop this again as a reminder
:-)

Maybe do an autocomplete in your mail editor which replaces NAK with
NAK (note: this means "fuck off" for many folks) so you can decide
whether that's really the message you want to send out to start the
morning. And in some rare case I do agree that just dropping a polite
"fuck off" is the right thing to make it clear what's up ...

Cheers, Daniel

>
> Also as discussed with Daniel we don't want to use BOOKKEEP for implicit
> sync. We should instead use READ for that.
>
> BOOKKEEP is for stuff userspace should never be aware of, e.g. like page
> table updates and KFD eviction fences.
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 10 +++++-----
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  8 +++-----
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c           |  2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c           |  6 +++---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c          |  2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  3 +--
> >   drivers/gpu/drm/amd/amdkfd/kfd_svm.c              |  2 +-
> >   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
> >   drivers/gpu/drm/qxl/qxl_release.c                 |  2 +-
> >   drivers/gpu/drm/radeon/radeon_cs.c                |  5 +++--
> >   drivers/gpu/drm/radeon/radeon_gem.c               |  2 +-
> >   drivers/gpu/drm/radeon/radeon_vm.c                |  4 ++--
> >   drivers/gpu/drm/ttm/ttm_execbuf_util.c            |  5 ++---
> >   drivers/gpu/drm/vmwgfx/vmwgfx_resource.c          | 10 +++++-----
> >   drivers/gpu/drm/vmwgfx/vmwgfx_validation.c        |  2 +-
> >   include/drm/ttm/ttm_execbuf_util.h                |  3 ++-
> >   16 files changed, 33 insertions(+), 35 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > index a4955ef76cfc..a790a089e829 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem,
> >       struct amdgpu_bo *bo = mem->bo;
> >
> >       INIT_LIST_HEAD(&entry->head);
> > -     entry->num_shared = 1;
> > +     entry->usage = DMA_RESV_USAGE_READ;
> >       entry->bo = &bo->tbo;
> >       mutex_lock(&process_info->lock);
> >       if (userptr)
> > @@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
> >
> >       ctx->kfd_bo.priority = 0;
> >       ctx->kfd_bo.tv.bo = &bo->tbo;
> > -     ctx->kfd_bo.tv.num_shared = 1;
> > +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&ctx->kfd_bo.tv.head, &ctx->list);
> >
> >       amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
> > @@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
> >
> >       ctx->kfd_bo.priority = 0;
> >       ctx->kfd_bo.tv.bo = &bo->tbo;
> > -     ctx->kfd_bo.tv.num_shared = 1;
> > +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&ctx->kfd_bo.tv.head, &ctx->list);
> >
> >       i = 0;
> > @@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
> >                           validate_list.head) {
> >               list_add_tail(&mem->resv_list.head, &resv_list);
> >               mem->resv_list.bo = mem->validate_list.bo;
> > -             mem->resv_list.num_shared = mem->validate_list.num_shared;
> > +             mem->resv_list.usage = mem->validate_list.usage;
> >       }
> >
> >       /* Reserve all BOs and page tables for validation */
> > @@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
> >
> >               list_add_tail(&mem->resv_list.head, &ctx.list);
> >               mem->resv_list.bo = mem->validate_list.bo;
> > -             mem->resv_list.num_shared = mem->validate_list.num_shared;
> > +             mem->resv_list.usage = mem->validate_list.usage;
> >       }
> >
> >       ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index 60ca14afb879..2ae1c0d9d33a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
> >       bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj));
> >       p->uf_entry.priority = 0;
> >       p->uf_entry.tv.bo = &bo->tbo;
> > -     /* One for TTM and two for the CS job */
> > -     p->uf_entry.tv.num_shared = 3;
> > +     p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;
> >
> >       drm_gem_object_put(gobj);
> >
> > @@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
> >                       return r;
> >       }
> >
> > -     /* One for TTM and one for the CS job */
> >       amdgpu_bo_list_for_each_entry(e, p->bo_list)
> > -             e->tv.num_shared = 2;
> > +             e->tv.usage = DMA_RESV_USAGE_READ;
> >
> >       amdgpu_bo_list_get_list(p->bo_list, &p->validated);
> >
> > @@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
> >
> >       /* Make sure all BOs are remembered as writers */
> >       amdgpu_bo_list_for_each_entry(e, p->bo_list)
> > -             e->tv.num_shared = 0;
> > +             e->tv.usage = DMA_RESV_USAGE_WRITE;
> >
> >       ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
> >       mutex_unlock(&p->adev->notifier_lock);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> > index c6d4d41c4393..71277257d94d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> > @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> >       INIT_LIST_HEAD(&list);
> >       INIT_LIST_HEAD(&csa_tv.head);
> >       csa_tv.bo = &bo->tbo;
> > -     csa_tv.num_shared = 1;
> > +     csa_tv.usage = DMA_RESV_USAGE_READ;
> >
> >       list_add(&csa_tv.head, &list);
> >       amdgpu_vm_get_pd_bo(vm, &list, &pd);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > index 84a53758e18e..7483411229f4 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj,
> >       INIT_LIST_HEAD(&duplicates);
> >
> >       tv.bo = &bo->tbo;
> > -     tv.num_shared = 2;
> > +     tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&tv.head, &list);
> >
> >       amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
> > @@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
> >               abo = gem_to_amdgpu_bo(gobj);
> >               tv.bo = &abo->tbo;
> >               if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
> > -                     tv.num_shared = 1;
> > +                     tv.usage = DMA_RESV_USAGE_READ;
> >               else
> > -                     tv.num_shared = 0;
> > +                     tv.usage = DMA_RESV_USAGE_WRITE;
> >               list_add(&tv.head, &list);
> >       } else {
> >               gobj = NULL;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > index 5224d9a39737..f670d8473993 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane,
> >       INIT_LIST_HEAD(&list);
> >
> >       tv.bo = &rbo->tbo;
> > -     tv.num_shared = 1;
> > +     tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&tv.head, &list);
> >
> >       r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index 15184153e2b9..515be19ab279 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
> >   {
> >       entry->priority = 0;
> >       entry->tv.bo = &vm->root.bo->tbo;
> > -     /* Two for VM updates, one for TTM and one for the CS job */
> > -     entry->tv.num_shared = 4;
> > +     entry->tv.usage = DMA_RESV_USAGE_READ;
> >       entry->user_pages = NULL;
> >       list_add(&entry->tv.head, validated);
> >   }
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > index b3fc3e958227..af844b636778 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx)
> >               vm = drm_priv_to_vm(pdd->drm_priv);
> >
> >               ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
> > -             ctx->tv[gpuidx].num_shared = 4;
> > +             ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
> >               list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
> >       }
> >
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index 73423b805b54..851b7844b084 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
> >       INIT_LIST_HEAD(&list);
> >
> >       tv.bo = &rbo->tbo;
> > -     tv.num_shared = 1;
> > +     tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&tv.head, &list);
> >
> >       r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
> > diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
> > index 368d26da0d6a..689e35192070 100644
> > --- a/drivers/gpu/drm/qxl/qxl_release.c
> > +++ b/drivers/gpu/drm/qxl/qxl_release.c
> > @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
> >
> >       qxl_bo_ref(bo);
> >       entry->tv.bo = &bo->tbo;
> > -     entry->tv.num_shared = 0;
> > +     entry->tv.usage = DMA_RESV_USAGE_WRITE;
> >       list_add_tail(&entry->tv.head, &release->bos);
> >       return 0;
> >   }
> > diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
> > index 446f7bae54c4..30afe0c62dd9 100644
> > --- a/drivers/gpu/drm/radeon/radeon_cs.c
> > +++ b/drivers/gpu/drm/radeon/radeon_cs.c
> > @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
> >               }
> >
> >               p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
> > -             p->relocs[i].tv.num_shared = !r->write_domain;
> > +             p->relocs[i].tv.usage =
> > +                     r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;
> >
> >               radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
> >                                     priority);
> > @@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
> >
> >               resv = reloc->robj->tbo.base.resv;
> >               r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
> > -                                  reloc->tv.num_shared);
> > +                                  reloc->tv.usage != DMA_RESV_USAGE_WRITE);
> >               if (r)
> >                       return r;
> >       }
> > diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
> > index 8c01a7f0e027..eae47c709f5d 100644
> > --- a/drivers/gpu/drm/radeon/radeon_gem.c
> > +++ b/drivers/gpu/drm/radeon/radeon_gem.c
> > @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev,
> >       INIT_LIST_HEAD(&list);
> >
> >       tv.bo = &bo_va->bo->tbo;
> > -     tv.num_shared = 1;
> > +     tv.usage = DMA_RESV_USAGE_READ;
> >       list_add(&tv.head, &list);
> >
> >       vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
> > diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c
> > index 987cabbf1318..702627b48dae 100644
> > --- a/drivers/gpu/drm/radeon/radeon_vm.c
> > +++ b/drivers/gpu/drm/radeon/radeon_vm.c
> > @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
> >       list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
> >       list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
> >       list[0].tv.bo = &vm->page_directory->tbo;
> > -     list[0].tv.num_shared = 1;
> > +     list[0].tv.usage = DMA_RESV_USAGE_READ;
> >       list[0].tiling_flags = 0;
> >       list_add(&list[0].tv.head, head);
> >
> > @@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
> >               list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
> >               list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
> >               list[idx].tv.bo = &list[idx].robj->tbo;
> > -             list[idx].tv.num_shared = 1;
> > +             list[idx].tv.usage = DMA_RESV_USAGE_READ;
> >               list[idx].tiling_flags = 0;
> >               list_add(&list[idx++].tv.head, head);
> >       }
> > diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> > index 0eb995d25df1..c39d8e5ac271 100644
> > --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> > +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
> > @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket,
> >                       continue;
> >               }
> >
> > -             num_fences = min(entry->num_shared, 1u);
> > +             num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
> >               if (!ret) {
> >                       ret = dma_resv_reserve_fences(bo->base.resv,
> >                                                     num_fences);
> > @@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket,
> >       list_for_each_entry(entry, list, head) {
> >               struct ttm_buffer_object *bo = entry->bo;
> >
> > -             dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
> > -                                DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
> > +             dma_resv_add_fence(bo->base.resv, fence, entry->usage);
> >               ttm_bo_move_to_lru_tail_unlocked(bo);
> >               dma_resv_unlock(bo->base.resv);
> >       }
> > diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> > index c6d02c98a19a..58dfff7d6c76 100644
> > --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> > +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
> > @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref)
> >                       struct ttm_validate_buffer val_buf;
> >
> >                       val_buf.bo = bo;
> > -                     val_buf.num_shared = 0;
> > +                     val_buf.usage = DMA_RESV_USAGE_WRITE;
> >                       res->func->unbind(res, false, &val_buf);
> >               }
> >               res->backup_dirty = false;
> > @@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket,
> >       INIT_LIST_HEAD(&val_list);
> >       ttm_bo_get(&res->backup->base);
> >       val_buf->bo = &res->backup->base;
> > -     val_buf->num_shared = 0;
> > +     val_buf->usage = DMA_RESV_USAGE_WRITE;
> >       list_add_tail(&val_buf->head, &val_list);
> >       ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
> >       if (unlikely(ret != 0))
> > @@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
> >       BUG_ON(!func->may_evict);
> >
> >       val_buf.bo = NULL;
> > -     val_buf.num_shared = 0;
> > +     val_buf.usage = DMA_RESV_USAGE_WRITE;
> >       ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
> >       if (unlikely(ret != 0))
> >               return ret;
> > @@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr,
> >               return 0;
> >
> >       val_buf.bo = NULL;
> > -     val_buf.num_shared = 0;
> > +     val_buf.usage = DMA_RESV_USAGE_WRITE;
> >       if (res->backup)
> >               val_buf.bo = &res->backup->base;
> >       do {
> > @@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo)
> >   {
> >       struct ttm_validate_buffer val_buf = {
> >               .bo = &vbo->base,
> > -             .num_shared = 0
> > +             .usage = DMA_RESV_USAGE_WRITE
> >       };
> >
> >       dma_resv_assert_held(vbo->base.base.resv);
> > diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> > index f46891012be3..0476ba498321 100644
> > --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> > +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
> > @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx,
> >               val_buf->bo = ttm_bo_get_unless_zero(&vbo->base);
> >               if (!val_buf->bo)
> >                       return -ESRCH;
> > -             val_buf->num_shared = 0;
> > +             val_buf->usage = DMA_RESV_USAGE_WRITE;
> >               list_add_tail(&val_buf->head, &ctx->bo_list);
> >               bo_node->as_mob = as_mob;
> >               bo_node->cpu_blit = cpu_blit;
> > diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h
> > index a99d7fdf2964..851961a06c27 100644
> > --- a/include/drm/ttm/ttm_execbuf_util.h
> > +++ b/include/drm/ttm/ttm_execbuf_util.h
> > @@ -31,6 +31,7 @@
> >   #ifndef _TTM_EXECBUF_UTIL_H_
> >   #define _TTM_EXECBUF_UTIL_H_
> >
> > +#include <linux/dma-resv.h>
> >   #include <linux/list.h>
> >
> >   #include "ttm_bo_api.h"
> > @@ -46,7 +47,7 @@
> >   struct ttm_validate_buffer {
> >       struct list_head head;
> >       struct ttm_buffer_object *bo;
> > -     unsigned int num_shared;
> > +     enum dma_resv_usage usage;
> >   };
> >
> >   /**
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.
  2022-06-01  8:39         ` Bas Nieuwenhuizen
@ 2022-06-01  8:42           ` Christian König
  0 siblings, 0 replies; 46+ messages in thread
From: Christian König @ 2022-06-01  8:42 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 01.06.22 um 10:39 schrieb Bas Nieuwenhuizen:
> On Wed, Jun 1, 2022 at 10:29 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 01.06.22 um 10:11 schrieb Bas Nieuwenhuizen:
>>> On Wed, Jun 1, 2022 at 10:02 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
>>>>> So that the driver can set some BOOKKEEP for explicit sync. Maybe
>>>>> some of the existing places would already make sense for that, but
>>>>> I targeted this for no functional changes.
>>>> Well first of all NAK to that one since it will totally break cases
>>>> which need to reserve more than one fence slot.
>>> TTM already didn't do that? From ttm_execbuf_util.c :
>>>
>>>>> -             num_fences = min(entry->num_shared, 1u);
>>>>> +             num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
>> That's doing a min(entry->num_shared, 1u). In other words even when the
>> driver requested to reserve no fence we at least reserve at least one.
> That would be the case if it was a max, not a min. However, since it
> is a min, it only ever resulted in 0 or 1, behavior that we mimic
> based on DMA_RESV_USAGE_*.

Ah! You are working on a broken branch, that was fixed with:

commit d72dcbe9fce505228dae43bef9da8f2b707d1b3d
Author: Christian König <christian.koenig@amd.com>
Date:   Mon Apr 11 15:21:59 2022 +0200

     drm/ttm: fix logic inversion in ttm_eu_reserve_buffers

     That should have been max, not min.

Without that fix your branch can cause rare to debug memory corruptions.

Regards,
Christian.


>
> Nowhere else do we actually use the specific number  assigned to num_shared.
>
>> But if the driver requested to reserve more than one then we do reserve
>> more than one. That's rather important because both radeon and amdgpu
>> need that for their VM updates.
>>
>> This patch here completely breaks that.
>>
>> There is already an drm_exec patch set from me on the dri-devel mailing
>> list which untangles all of this and deprecates the whole
>> ttm_exec_buf_util handling.
> Can take a look at your patch, but I believe in pre-patch state this
> is a correct non functional change.
>
>> Regards,
>> Christian.
>>
>>>> Also as discussed with Daniel we don't want to use BOOKKEEP for implicit
>>>> sync. We should instead use READ for that.
>>> That is the plan and what we do later in the series, use BOOKKEEP for
>>> submissions that don't want to participate in implicit sync?
>>>
>>> This refactor sets everything to READ or WRITE based on the previous
>>> num_shared value, to make sure this patch by itself is not a
>>> functional change.
>>>
>>>> BOOKKEEP is for stuff userspace should never be aware of, e.g. like page
>>>> table updates and KFD eviction fences.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 10 +++++-----
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  8 +++-----
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c           |  2 +-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c           |  6 +++---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c          |  2 +-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  3 +--
>>>>>     drivers/gpu/drm/amd/amdkfd/kfd_svm.c              |  2 +-
>>>>>     drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
>>>>>     drivers/gpu/drm/qxl/qxl_release.c                 |  2 +-
>>>>>     drivers/gpu/drm/radeon/radeon_cs.c                |  5 +++--
>>>>>     drivers/gpu/drm/radeon/radeon_gem.c               |  2 +-
>>>>>     drivers/gpu/drm/radeon/radeon_vm.c                |  4 ++--
>>>>>     drivers/gpu/drm/ttm/ttm_execbuf_util.c            |  5 ++---
>>>>>     drivers/gpu/drm/vmwgfx/vmwgfx_resource.c          | 10 +++++-----
>>>>>     drivers/gpu/drm/vmwgfx/vmwgfx_validation.c        |  2 +-
>>>>>     include/drm/ttm/ttm_execbuf_util.h                |  3 ++-
>>>>>     16 files changed, 33 insertions(+), 35 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> index a4955ef76cfc..a790a089e829 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem,
>>>>>         struct amdgpu_bo *bo = mem->bo;
>>>>>
>>>>>         INIT_LIST_HEAD(&entry->head);
>>>>> -     entry->num_shared = 1;
>>>>> +     entry->usage = DMA_RESV_USAGE_READ;
>>>>>         entry->bo = &bo->tbo;
>>>>>         mutex_lock(&process_info->lock);
>>>>>         if (userptr)
>>>>> @@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
>>>>>
>>>>>         ctx->kfd_bo.priority = 0;
>>>>>         ctx->kfd_bo.tv.bo = &bo->tbo;
>>>>> -     ctx->kfd_bo.tv.num_shared = 1;
>>>>> +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
>>>>>         list_add(&ctx->kfd_bo.tv.head, &ctx->list);
>>>>>
>>>>>         amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
>>>>> @@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
>>>>>
>>>>>         ctx->kfd_bo.priority = 0;
>>>>>         ctx->kfd_bo.tv.bo = &bo->tbo;
>>>>> -     ctx->kfd_bo.tv.num_shared = 1;
>>>>> +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
>>>>>         list_add(&ctx->kfd_bo.tv.head, &ctx->list);
>>>>>
>>>>>         i = 0;
>>>>> @@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
>>>>>                             validate_list.head) {
>>>>>                 list_add_tail(&mem->resv_list.head, &resv_list);
>>>>>                 mem->resv_list.bo = mem->validate_list.bo;
>>>>> -             mem->resv_list.num_shared = mem->validate_list.num_shared;
>>>>> +             mem->resv_list.usage = mem->validate_list.usage;
>>>>>         }
>>>>>
>>>>>         /* Reserve all BOs and page tables for validation */
>>>>> @@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
>>>>>
>>>>>                 list_add_tail(&mem->resv_list.head, &ctx.list);
>>>>>                 mem->resv_list.bo = mem->validate_list.bo;
>>>>> -             mem->resv_list.num_shared = mem->validate_list.num_shared;
>>>>> +             mem->resv_list.usage = mem->validate_list.usage;
>>>>>         }
>>>>>
>>>>>         ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> index 60ca14afb879..2ae1c0d9d33a 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
>>>>>         bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj));
>>>>>         p->uf_entry.priority = 0;
>>>>>         p->uf_entry.tv.bo = &bo->tbo;
>>>>> -     /* One for TTM and two for the CS job */
>>>>> -     p->uf_entry.tv.num_shared = 3;
>>>>> +     p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;
>>>>>
>>>>>         drm_gem_object_put(gobj);
>>>>>
>>>>> @@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
>>>>>                         return r;
>>>>>         }
>>>>>
>>>>> -     /* One for TTM and one for the CS job */
>>>>>         amdgpu_bo_list_for_each_entry(e, p->bo_list)
>>>>> -             e->tv.num_shared = 2;
>>>>> +             e->tv.usage = DMA_RESV_USAGE_READ;
>>>>>
>>>>>         amdgpu_bo_list_get_list(p->bo_list, &p->validated);
>>>>>
>>>>> @@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>>>>>
>>>>>         /* Make sure all BOs are remembered as writers */
>>>>>         amdgpu_bo_list_for_each_entry(e, p->bo_list)
>>>>> -             e->tv.num_shared = 0;
>>>>> +             e->tv.usage = DMA_RESV_USAGE_WRITE;
>>>>>
>>>>>         ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
>>>>>         mutex_unlock(&p->adev->notifier_lock);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>>>> index c6d4d41c4393..71277257d94d 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>>>> @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>>>>         INIT_LIST_HEAD(&list);
>>>>>         INIT_LIST_HEAD(&csa_tv.head);
>>>>>         csa_tv.bo = &bo->tbo;
>>>>> -     csa_tv.num_shared = 1;
>>>>> +     csa_tv.usage = DMA_RESV_USAGE_READ;
>>>>>
>>>>>         list_add(&csa_tv.head, &list);
>>>>>         amdgpu_vm_get_pd_bo(vm, &list, &pd);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> index 84a53758e18e..7483411229f4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj,
>>>>>         INIT_LIST_HEAD(&duplicates);
>>>>>
>>>>>         tv.bo = &bo->tbo;
>>>>> -     tv.num_shared = 2;
>>>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>>>         list_add(&tv.head, &list);
>>>>>
>>>>>         amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
>>>>> @@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>>                 abo = gem_to_amdgpu_bo(gobj);
>>>>>                 tv.bo = &abo->tbo;
>>>>>                 if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
>>>>> -                     tv.num_shared = 1;
>>>>> +                     tv.usage = DMA_RESV_USAGE_READ;
>>>>>                 else
>>>>> -                     tv.num_shared = 0;
>>>>> +                     tv.usage = DMA_RESV_USAGE_WRITE;
>>>>>                 list_add(&tv.head, &list);
>>>>>         } else {
>>>>>                 gobj = NULL;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>>>> index 5224d9a39737..f670d8473993 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>>>> @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane,
>>>>>         INIT_LIST_HEAD(&list);
>>>>>
>>>>>         tv.bo = &rbo->tbo;
>>>>> -     tv.num_shared = 1;
>>>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>>>         list_add(&tv.head, &list);
>>>>>
>>>>>         r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> index 15184153e2b9..515be19ab279 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
>>>>>     {
>>>>>         entry->priority = 0;
>>>>>         entry->tv.bo = &vm->root.bo->tbo;
>>>>> -     /* Two for VM updates, one for TTM and one for the CS job */
>>>>> -     entry->tv.num_shared = 4;
>>>>> +     entry->tv.usage = DMA_RESV_USAGE_READ;
>>>>>         entry->user_pages = NULL;
>>>>>         list_add(&entry->tv.head, validated);
>>>>>     }
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>> index b3fc3e958227..af844b636778 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>> @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx)
>>>>>                 vm = drm_priv_to_vm(pdd->drm_priv);
>>>>>
>>>>>                 ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
>>>>> -             ctx->tv[gpuidx].num_shared = 4;
>>>>> +             ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
>>>>>                 list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
>>>>>         }
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>>> index 73423b805b54..851b7844b084 100644
>>>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>>> @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
>>>>>         INIT_LIST_HEAD(&list);
>>>>>
>>>>>         tv.bo = &rbo->tbo;
>>>>> -     tv.num_shared = 1;
>>>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>>>         list_add(&tv.head, &list);
>>>>>
>>>>>         r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
>>>>> diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
>>>>> index 368d26da0d6a..689e35192070 100644
>>>>> --- a/drivers/gpu/drm/qxl/qxl_release.c
>>>>> +++ b/drivers/gpu/drm/qxl/qxl_release.c
>>>>> @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
>>>>>
>>>>>         qxl_bo_ref(bo);
>>>>>         entry->tv.bo = &bo->tbo;
>>>>> -     entry->tv.num_shared = 0;
>>>>> +     entry->tv.usage = DMA_RESV_USAGE_WRITE;
>>>>>         list_add_tail(&entry->tv.head, &release->bos);
>>>>>         return 0;
>>>>>     }
>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
>>>>> index 446f7bae54c4..30afe0c62dd9 100644
>>>>> --- a/drivers/gpu/drm/radeon/radeon_cs.c
>>>>> +++ b/drivers/gpu/drm/radeon/radeon_cs.c
>>>>> @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
>>>>>                 }
>>>>>
>>>>>                 p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
>>>>> -             p->relocs[i].tv.num_shared = !r->write_domain;
>>>>> +             p->relocs[i].tv.usage =
>>>>> +                     r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;
>>>>>
>>>>>                 radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
>>>>>                                       priority);
>>>>> @@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
>>>>>
>>>>>                 resv = reloc->robj->tbo.base.resv;
>>>>>                 r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
>>>>> -                                  reloc->tv.num_shared);
>>>>> +                                  reloc->tv.usage != DMA_RESV_USAGE_WRITE);
>>>>>                 if (r)
>>>>>                         return r;
>>>>>         }
>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
>>>>> index 8c01a7f0e027..eae47c709f5d 100644
>>>>> --- a/drivers/gpu/drm/radeon/radeon_gem.c
>>>>> +++ b/drivers/gpu/drm/radeon/radeon_gem.c
>>>>> @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev,
>>>>>         INIT_LIST_HEAD(&list);
>>>>>
>>>>>         tv.bo = &bo_va->bo->tbo;
>>>>> -     tv.num_shared = 1;
>>>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>>>         list_add(&tv.head, &list);
>>>>>
>>>>>         vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c
>>>>> index 987cabbf1318..702627b48dae 100644
>>>>> --- a/drivers/gpu/drm/radeon/radeon_vm.c
>>>>> +++ b/drivers/gpu/drm/radeon/radeon_vm.c
>>>>> @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
>>>>>         list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
>>>>>         list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
>>>>>         list[0].tv.bo = &vm->page_directory->tbo;
>>>>> -     list[0].tv.num_shared = 1;
>>>>> +     list[0].tv.usage = DMA_RESV_USAGE_READ;
>>>>>         list[0].tiling_flags = 0;
>>>>>         list_add(&list[0].tv.head, head);
>>>>>
>>>>> @@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
>>>>>                 list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
>>>>>                 list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
>>>>>                 list[idx].tv.bo = &list[idx].robj->tbo;
>>>>> -             list[idx].tv.num_shared = 1;
>>>>> +             list[idx].tv.usage = DMA_RESV_USAGE_READ;
>>>>>                 list[idx].tiling_flags = 0;
>>>>>                 list_add(&list[idx++].tv.head, head);
>>>>>         }
>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
>>>>> index 0eb995d25df1..c39d8e5ac271 100644
>>>>> --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c
>>>>> +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
>>>>> @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket,
>>>>>                         continue;
>>>>>                 }
>>>>>
>>>>> -             num_fences = min(entry->num_shared, 1u);
>>>>> +             num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
>>>>>                 if (!ret) {
>>>>>                         ret = dma_resv_reserve_fences(bo->base.resv,
>>>>>                                                       num_fences);
>>>>> @@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket,
>>>>>         list_for_each_entry(entry, list, head) {
>>>>>                 struct ttm_buffer_object *bo = entry->bo;
>>>>>
>>>>> -             dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
>>>>> -                                DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
>>>>> +             dma_resv_add_fence(bo->base.resv, fence, entry->usage);
>>>>>                 ttm_bo_move_to_lru_tail_unlocked(bo);
>>>>>                 dma_resv_unlock(bo->base.resv);
>>>>>         }
>>>>> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
>>>>> index c6d02c98a19a..58dfff7d6c76 100644
>>>>> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
>>>>> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
>>>>> @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref)
>>>>>                         struct ttm_validate_buffer val_buf;
>>>>>
>>>>>                         val_buf.bo = bo;
>>>>> -                     val_buf.num_shared = 0;
>>>>> +                     val_buf.usage = DMA_RESV_USAGE_WRITE;
>>>>>                         res->func->unbind(res, false, &val_buf);
>>>>>                 }
>>>>>                 res->backup_dirty = false;
>>>>> @@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket,
>>>>>         INIT_LIST_HEAD(&val_list);
>>>>>         ttm_bo_get(&res->backup->base);
>>>>>         val_buf->bo = &res->backup->base;
>>>>> -     val_buf->num_shared = 0;
>>>>> +     val_buf->usage = DMA_RESV_USAGE_WRITE;
>>>>>         list_add_tail(&val_buf->head, &val_list);
>>>>>         ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
>>>>>         if (unlikely(ret != 0))
>>>>> @@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
>>>>>         BUG_ON(!func->may_evict);
>>>>>
>>>>>         val_buf.bo = NULL;
>>>>> -     val_buf.num_shared = 0;
>>>>> +     val_buf.usage = DMA_RESV_USAGE_WRITE;
>>>>>         ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
>>>>>         if (unlikely(ret != 0))
>>>>>                 return ret;
>>>>> @@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr,
>>>>>                 return 0;
>>>>>
>>>>>         val_buf.bo = NULL;
>>>>> -     val_buf.num_shared = 0;
>>>>> +     val_buf.usage = DMA_RESV_USAGE_WRITE;
>>>>>         if (res->backup)
>>>>>                 val_buf.bo = &res->backup->base;
>>>>>         do {
>>>>> @@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo)
>>>>>     {
>>>>>         struct ttm_validate_buffer val_buf = {
>>>>>                 .bo = &vbo->base,
>>>>> -             .num_shared = 0
>>>>> +             .usage = DMA_RESV_USAGE_WRITE
>>>>>         };
>>>>>
>>>>>         dma_resv_assert_held(vbo->base.base.resv);
>>>>> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
>>>>> index f46891012be3..0476ba498321 100644
>>>>> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
>>>>> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
>>>>> @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx,
>>>>>                 val_buf->bo = ttm_bo_get_unless_zero(&vbo->base);
>>>>>                 if (!val_buf->bo)
>>>>>                         return -ESRCH;
>>>>> -             val_buf->num_shared = 0;
>>>>> +             val_buf->usage = DMA_RESV_USAGE_WRITE;
>>>>>                 list_add_tail(&val_buf->head, &ctx->bo_list);
>>>>>                 bo_node->as_mob = as_mob;
>>>>>                 bo_node->cpu_blit = cpu_blit;
>>>>> diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h
>>>>> index a99d7fdf2964..851961a06c27 100644
>>>>> --- a/include/drm/ttm/ttm_execbuf_util.h
>>>>> +++ b/include/drm/ttm/ttm_execbuf_util.h
>>>>> @@ -31,6 +31,7 @@
>>>>>     #ifndef _TTM_EXECBUF_UTIL_H_
>>>>>     #define _TTM_EXECBUF_UTIL_H_
>>>>>
>>>>> +#include <linux/dma-resv.h>
>>>>>     #include <linux/list.h>
>>>>>
>>>>>     #include "ttm_bo_api.h"
>>>>> @@ -46,7 +47,7 @@
>>>>>     struct ttm_validate_buffer {
>>>>>         struct list_head head;
>>>>>         struct ttm_buffer_object *bo;
>>>>> -     unsigned int num_shared;
>>>>> +     enum dma_resv_usage usage;
>>>>>     };
>>>>>
>>>>>     /**


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.
  2022-06-01  8:41     ` Daniel Vetter
@ 2022-06-01  8:47       ` Christian König
  0 siblings, 0 replies; 46+ messages in thread
From: Christian König @ 2022-06-01  8:47 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel

Am 01.06.22 um 10:41 schrieb Daniel Vetter:
> On Wed, 1 Jun 2022 at 10:02, Christian König <christian.koenig@amd.com> wrote:
>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
>>> So that the driver can set some BOOKKEEP for explicit sync. Maybe
>>> some of the existing places would already make sense for that, but
>>> I targeted this for no functional changes.
>> Well first of all NAK to that one since it will totally break cases
>> which need to reserve more than one fence slot.
> Quick reminder, we talked about this in the past. For many folks (not
> you) NAK means "fuck off" and not "this wont work for the reasons I
> just explained". Looks like the conversation is all on a good track in
> the further replies, just figured I'll drop this again as a reminder
> :-)

Yeah, that came to my mind as well.

But I still prefer NAK as what it means in computer since, e.g. "Not 
AcKnowledged", please restart from scratch.

We do need a clear indicator that the whole approach taken in a patch 
needs to be dropped and restarted from scratch and a NAK seems to fit that.

When I would want to tell somebody to fuck of I would clearly write that.

Christian.

>
> Maybe do an autocomplete in your mail editor which replaces NAK with
> NAK (note: this means "fuck off" for many folks) so you can decide
> whether that's really the message you want to send out to start the
> morning. And in some rare case I do agree that just dropping a polite
> "fuck off" is the right thing to make it clear what's up ...
>
> Cheers, Daniel
>
>> Also as discussed with Daniel we don't want to use BOOKKEEP for implicit
>> sync. We should instead use READ for that.
>>
>> BOOKKEEP is for stuff userspace should never be aware of, e.g. like page
>> table updates and KFD eviction fences.
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 10 +++++-----
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  8 +++-----
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c           |  2 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c           |  6 +++---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c          |  2 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  3 +--
>>>    drivers/gpu/drm/amd/amdkfd/kfd_svm.c              |  2 +-
>>>    drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
>>>    drivers/gpu/drm/qxl/qxl_release.c                 |  2 +-
>>>    drivers/gpu/drm/radeon/radeon_cs.c                |  5 +++--
>>>    drivers/gpu/drm/radeon/radeon_gem.c               |  2 +-
>>>    drivers/gpu/drm/radeon/radeon_vm.c                |  4 ++--
>>>    drivers/gpu/drm/ttm/ttm_execbuf_util.c            |  5 ++---
>>>    drivers/gpu/drm/vmwgfx/vmwgfx_resource.c          | 10 +++++-----
>>>    drivers/gpu/drm/vmwgfx/vmwgfx_validation.c        |  2 +-
>>>    include/drm/ttm/ttm_execbuf_util.h                |  3 ++-
>>>    16 files changed, 33 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> index a4955ef76cfc..a790a089e829 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem,
>>>        struct amdgpu_bo *bo = mem->bo;
>>>
>>>        INIT_LIST_HEAD(&entry->head);
>>> -     entry->num_shared = 1;
>>> +     entry->usage = DMA_RESV_USAGE_READ;
>>>        entry->bo = &bo->tbo;
>>>        mutex_lock(&process_info->lock);
>>>        if (userptr)
>>> @@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
>>>
>>>        ctx->kfd_bo.priority = 0;
>>>        ctx->kfd_bo.tv.bo = &bo->tbo;
>>> -     ctx->kfd_bo.tv.num_shared = 1;
>>> +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&ctx->kfd_bo.tv.head, &ctx->list);
>>>
>>>        amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
>>> @@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
>>>
>>>        ctx->kfd_bo.priority = 0;
>>>        ctx->kfd_bo.tv.bo = &bo->tbo;
>>> -     ctx->kfd_bo.tv.num_shared = 1;
>>> +     ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&ctx->kfd_bo.tv.head, &ctx->list);
>>>
>>>        i = 0;
>>> @@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
>>>                            validate_list.head) {
>>>                list_add_tail(&mem->resv_list.head, &resv_list);
>>>                mem->resv_list.bo = mem->validate_list.bo;
>>> -             mem->resv_list.num_shared = mem->validate_list.num_shared;
>>> +             mem->resv_list.usage = mem->validate_list.usage;
>>>        }
>>>
>>>        /* Reserve all BOs and page tables for validation */
>>> @@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
>>>
>>>                list_add_tail(&mem->resv_list.head, &ctx.list);
>>>                mem->resv_list.bo = mem->validate_list.bo;
>>> -             mem->resv_list.num_shared = mem->validate_list.num_shared;
>>> +             mem->resv_list.usage = mem->validate_list.usage;
>>>        }
>>>
>>>        ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 60ca14afb879..2ae1c0d9d33a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
>>>        bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj));
>>>        p->uf_entry.priority = 0;
>>>        p->uf_entry.tv.bo = &bo->tbo;
>>> -     /* One for TTM and two for the CS job */
>>> -     p->uf_entry.tv.num_shared = 3;
>>> +     p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;
>>>
>>>        drm_gem_object_put(gobj);
>>>
>>> @@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
>>>                        return r;
>>>        }
>>>
>>> -     /* One for TTM and one for the CS job */
>>>        amdgpu_bo_list_for_each_entry(e, p->bo_list)
>>> -             e->tv.num_shared = 2;
>>> +             e->tv.usage = DMA_RESV_USAGE_READ;
>>>
>>>        amdgpu_bo_list_get_list(p->bo_list, &p->validated);
>>>
>>> @@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>>>
>>>        /* Make sure all BOs are remembered as writers */
>>>        amdgpu_bo_list_for_each_entry(e, p->bo_list)
>>> -             e->tv.num_shared = 0;
>>> +             e->tv.usage = DMA_RESV_USAGE_WRITE;
>>>
>>>        ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
>>>        mutex_unlock(&p->adev->notifier_lock);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>> index c6d4d41c4393..71277257d94d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>> @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>>        INIT_LIST_HEAD(&list);
>>>        INIT_LIST_HEAD(&csa_tv.head);
>>>        csa_tv.bo = &bo->tbo;
>>> -     csa_tv.num_shared = 1;
>>> +     csa_tv.usage = DMA_RESV_USAGE_READ;
>>>
>>>        list_add(&csa_tv.head, &list);
>>>        amdgpu_vm_get_pd_bo(vm, &list, &pd);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> index 84a53758e18e..7483411229f4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj,
>>>        INIT_LIST_HEAD(&duplicates);
>>>
>>>        tv.bo = &bo->tbo;
>>> -     tv.num_shared = 2;
>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&tv.head, &list);
>>>
>>>        amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
>>> @@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>                abo = gem_to_amdgpu_bo(gobj);
>>>                tv.bo = &abo->tbo;
>>>                if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
>>> -                     tv.num_shared = 1;
>>> +                     tv.usage = DMA_RESV_USAGE_READ;
>>>                else
>>> -                     tv.num_shared = 0;
>>> +                     tv.usage = DMA_RESV_USAGE_WRITE;
>>>                list_add(&tv.head, &list);
>>>        } else {
>>>                gobj = NULL;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> index 5224d9a39737..f670d8473993 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane,
>>>        INIT_LIST_HEAD(&list);
>>>
>>>        tv.bo = &rbo->tbo;
>>> -     tv.num_shared = 1;
>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&tv.head, &list);
>>>
>>>        r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 15184153e2b9..515be19ab279 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
>>>    {
>>>        entry->priority = 0;
>>>        entry->tv.bo = &vm->root.bo->tbo;
>>> -     /* Two for VM updates, one for TTM and one for the CS job */
>>> -     entry->tv.num_shared = 4;
>>> +     entry->tv.usage = DMA_RESV_USAGE_READ;
>>>        entry->user_pages = NULL;
>>>        list_add(&entry->tv.head, validated);
>>>    }
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> index b3fc3e958227..af844b636778 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx)
>>>                vm = drm_priv_to_vm(pdd->drm_priv);
>>>
>>>                ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
>>> -             ctx->tv[gpuidx].num_shared = 4;
>>> +             ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
>>>                list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
>>>        }
>>>
>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> index 73423b805b54..851b7844b084 100644
>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>> @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
>>>        INIT_LIST_HEAD(&list);
>>>
>>>        tv.bo = &rbo->tbo;
>>> -     tv.num_shared = 1;
>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&tv.head, &list);
>>>
>>>        r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
>>> diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
>>> index 368d26da0d6a..689e35192070 100644
>>> --- a/drivers/gpu/drm/qxl/qxl_release.c
>>> +++ b/drivers/gpu/drm/qxl/qxl_release.c
>>> @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
>>>
>>>        qxl_bo_ref(bo);
>>>        entry->tv.bo = &bo->tbo;
>>> -     entry->tv.num_shared = 0;
>>> +     entry->tv.usage = DMA_RESV_USAGE_WRITE;
>>>        list_add_tail(&entry->tv.head, &release->bos);
>>>        return 0;
>>>    }
>>> diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
>>> index 446f7bae54c4..30afe0c62dd9 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_cs.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_cs.c
>>> @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
>>>                }
>>>
>>>                p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
>>> -             p->relocs[i].tv.num_shared = !r->write_domain;
>>> +             p->relocs[i].tv.usage =
>>> +                     r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;
>>>
>>>                radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
>>>                                      priority);
>>> @@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
>>>
>>>                resv = reloc->robj->tbo.base.resv;
>>>                r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
>>> -                                  reloc->tv.num_shared);
>>> +                                  reloc->tv.usage != DMA_RESV_USAGE_WRITE);
>>>                if (r)
>>>                        return r;
>>>        }
>>> diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
>>> index 8c01a7f0e027..eae47c709f5d 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_gem.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_gem.c
>>> @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev,
>>>        INIT_LIST_HEAD(&list);
>>>
>>>        tv.bo = &bo_va->bo->tbo;
>>> -     tv.num_shared = 1;
>>> +     tv.usage = DMA_RESV_USAGE_READ;
>>>        list_add(&tv.head, &list);
>>>
>>>        vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
>>> diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c
>>> index 987cabbf1318..702627b48dae 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_vm.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_vm.c
>>> @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
>>>        list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
>>>        list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
>>>        list[0].tv.bo = &vm->page_directory->tbo;
>>> -     list[0].tv.num_shared = 1;
>>> +     list[0].tv.usage = DMA_RESV_USAGE_READ;
>>>        list[0].tiling_flags = 0;
>>>        list_add(&list[0].tv.head, head);
>>>
>>> @@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev,
>>>                list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM;
>>>                list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM;
>>>                list[idx].tv.bo = &list[idx].robj->tbo;
>>> -             list[idx].tv.num_shared = 1;
>>> +             list[idx].tv.usage = DMA_RESV_USAGE_READ;
>>>                list[idx].tiling_flags = 0;
>>>                list_add(&list[idx++].tv.head, head);
>>>        }
>>> diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
>>> index 0eb995d25df1..c39d8e5ac271 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c
>>> @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket,
>>>                        continue;
>>>                }
>>>
>>> -             num_fences = min(entry->num_shared, 1u);
>>> +             num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
>>>                if (!ret) {
>>>                        ret = dma_resv_reserve_fences(bo->base.resv,
>>>                                                      num_fences);
>>> @@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket,
>>>        list_for_each_entry(entry, list, head) {
>>>                struct ttm_buffer_object *bo = entry->bo;
>>>
>>> -             dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
>>> -                                DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
>>> +             dma_resv_add_fence(bo->base.resv, fence, entry->usage);
>>>                ttm_bo_move_to_lru_tail_unlocked(bo);
>>>                dma_resv_unlock(bo->base.resv);
>>>        }
>>> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
>>> index c6d02c98a19a..58dfff7d6c76 100644
>>> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
>>> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
>>> @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref)
>>>                        struct ttm_validate_buffer val_buf;
>>>
>>>                        val_buf.bo = bo;
>>> -                     val_buf.num_shared = 0;
>>> +                     val_buf.usage = DMA_RESV_USAGE_WRITE;
>>>                        res->func->unbind(res, false, &val_buf);
>>>                }
>>>                res->backup_dirty = false;
>>> @@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket,
>>>        INIT_LIST_HEAD(&val_list);
>>>        ttm_bo_get(&res->backup->base);
>>>        val_buf->bo = &res->backup->base;
>>> -     val_buf->num_shared = 0;
>>> +     val_buf->usage = DMA_RESV_USAGE_WRITE;
>>>        list_add_tail(&val_buf->head, &val_list);
>>>        ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
>>>        if (unlikely(ret != 0))
>>> @@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket,
>>>        BUG_ON(!func->may_evict);
>>>
>>>        val_buf.bo = NULL;
>>> -     val_buf.num_shared = 0;
>>> +     val_buf.usage = DMA_RESV_USAGE_WRITE;
>>>        ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
>>>        if (unlikely(ret != 0))
>>>                return ret;
>>> @@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr,
>>>                return 0;
>>>
>>>        val_buf.bo = NULL;
>>> -     val_buf.num_shared = 0;
>>> +     val_buf.usage = DMA_RESV_USAGE_WRITE;
>>>        if (res->backup)
>>>                val_buf.bo = &res->backup->base;
>>>        do {
>>> @@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo)
>>>    {
>>>        struct ttm_validate_buffer val_buf = {
>>>                .bo = &vbo->base,
>>> -             .num_shared = 0
>>> +             .usage = DMA_RESV_USAGE_WRITE
>>>        };
>>>
>>>        dma_resv_assert_held(vbo->base.base.resv);
>>> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
>>> index f46891012be3..0476ba498321 100644
>>> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
>>> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c
>>> @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx,
>>>                val_buf->bo = ttm_bo_get_unless_zero(&vbo->base);
>>>                if (!val_buf->bo)
>>>                        return -ESRCH;
>>> -             val_buf->num_shared = 0;
>>> +             val_buf->usage = DMA_RESV_USAGE_WRITE;
>>>                list_add_tail(&val_buf->head, &ctx->bo_list);
>>>                bo_node->as_mob = as_mob;
>>>                bo_node->cpu_blit = cpu_blit;
>>> diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h
>>> index a99d7fdf2964..851961a06c27 100644
>>> --- a/include/drm/ttm/ttm_execbuf_util.h
>>> +++ b/include/drm/ttm/ttm_execbuf_util.h
>>> @@ -31,6 +31,7 @@
>>>    #ifndef _TTM_EXECBUF_UTIL_H_
>>>    #define _TTM_EXECBUF_UTIL_H_
>>>
>>> +#include <linux/dma-resv.h>
>>>    #include <linux/list.h>
>>>
>>>    #include "ttm_bo_api.h"
>>> @@ -46,7 +47,7 @@
>>>    struct ttm_validate_buffer {
>>>        struct list_head head;
>>>        struct ttm_buffer_object *bo;
>>> -     unsigned int num_shared;
>>> +     enum dma_resv_usage usage;
>>>    };
>>>
>>>    /**
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-01  8:40       ` Christian König
@ 2022-06-01  8:48         ` Bas Nieuwenhuizen
  2022-06-01  8:59           ` Bas Nieuwenhuizen
  2022-06-01  9:01           ` Christian König
  0 siblings, 2 replies; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  8:48 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Wed, Jun 1, 2022 at 10:40 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
> > On Wed, Jun 1, 2022 at 10:03 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> >>> This should be okay because moves themselves use KERNEL usage and
> >>> hence still sync with BOOKKEEP usage. Then any later submits still
> >>> wait on any pending VM operations.
> >>>
> >>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
> >>>    way around)
> >> Well NAK again. This allows access to freed up memory and is a complete
> >> no-go.
> > How does this allow access to freed memory? Worst I can see is that
> > the unmap happens earlier if the app/drivers gets the waits wrong,
> > which wouldn't give access after the underlying BO is freed?
>
> To free up memory we need to update the PTEs and then flush those out by
> invalidating the TLB.
>
> On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can
> only be done while the VMID is idle.
>
> Only gfx9 can reliable invalidate the TLB while it is in use and even
> there it comes with quite some performance penalty (at TLB invalidation
> can take multiple seconds).
>
> Because of this what we do in the kernel driver is to sync to everything
> when we unmap entries:
>
>          if (!(flags & AMDGPU_PTE_VALID))
>                  sync_mode = AMDGPU_SYNC_EQ_OWNER;
>          else
>                  sync_mode = AMDGPU_SYNC_EXPLICIT;
>
> This acts as a barrier for freeing the memory. In other words we
> intentionally add a bubble which syncs everything.
>
> I'm working for month on a concept how to do all this without causing
> the stalls you observer, but so far didn't came to much of a conclusion.

That might cause an unmap operation too early, but for freeing up the
actual backing memory we still wait for all fences on the BO to finish
first, no? In that case, since BOOKKEEP fences are still added for
explicit sync, that should not be a problem, no?

(If not, that sounds like the obvious fix for making this work?)
>
> Regards,
> Christian.
>
> >
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
> >>>    2 files changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>> index f10332e1c6c0..31bc73fd1fae 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
> >>>        if (!resv)
> >>>                return 0;
> >>>
> >>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
> >>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
> >>>    }
> >>>
> >>>    /**
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>> index 63b484dc76c5..c8d5898bea11 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
> >>>        if (!resv)
> >>>                return 0;
> >>>
> >>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
> >>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
> >>>    }
> >>>
> >>>    /**
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-01  8:48         ` Bas Nieuwenhuizen
@ 2022-06-01  8:59           ` Bas Nieuwenhuizen
  2022-06-01  9:01           ` Christian König
  1 sibling, 0 replies; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-01  8:59 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

[-- Attachment #1: Type: text/plain, Size: 4241 bytes --]

On Wed, Jun 1, 2022, 10:48 Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
wrote:

> On Wed, Jun 1, 2022 at 10:40 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
> > > On Wed, Jun 1, 2022 at 10:03 AM Christian König
> > > <christian.koenig@amd.com> wrote:
> > >> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> > >>> This should be okay because moves themselves use KERNEL usage and
> > >>> hence still sync with BOOKKEEP usage. Then any later submits still
> > >>> wait on any pending VM operations.
> > >>>
> > >>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
> > >>>    way around)
> > >> Well NAK again. This allows access to freed up memory and is a
> complete
> > >> no-go.
> > > How does this allow access to freed memory? Worst I can see is that
> > > the unmap happens earlier if the app/drivers gets the waits wrong,
> > > which wouldn't give access after the underlying BO is freed?
> >
> > To free up memory we need to update the PTEs and then flush those out by
> > invalidating the TLB.
> >
> > On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can
> > only be done while the VMID is idle.
> >
> > Only gfx9 can reliable invalidate the TLB while it is in use and even
> > there it comes with quite some performance penalty (at TLB invalidation
> > can take multiple seconds).
> >
> > Because of this what we do in the kernel driver is to sync to everything
> > when we unmap entries:
> >
> >          if (!(flags & AMDGPU_PTE_VALID))
> >                  sync_mode = AMDGPU_SYNC_EQ_OWNER;
> >          else
> >                  sync_mode = AMDGPU_SYNC_EXPLICIT;
> >
> > This acts as a barrier for freeing the memory. In other words we
> > intentionally add a bubble which syncs everything.
> >
> > I'm working for month on a concept how to do all this without causing
> > the stalls you observer, but so far didn't came to much of a conclusion.
>
> That might cause an unmap operation too early, but for freeing up the
> actual backing memory we still wait for all fences on the BO to finish
> first, no? In that case, since BOOKKEEP fences are still added for
> explicit sync, that should not be a problem, no?
>
> (If not, that sounds like the obvious fix for making this work?)
>

As an aside this is the same hole/issue as when an app forgets a bo in the
bo list on submission.

> >
> > Regards,
> > Christian.
> >
> > >
> > >> Regards,
> > >> Christian.
> > >>
> > >>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> > >>> ---
> > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
> > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
> > >>>    2 files changed, 2 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > >>> index f10332e1c6c0..31bc73fd1fae 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > >>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct
> amdgpu_vm_update_params *p,
> > >>>        if (!resv)
> > >>>                return 0;
> > >>>
> > >>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode,
> sync_mode, p->vm, true);
> > >>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode,
> AMDGPU_SYNC_EXPLICIT, p->vm, true);
> > >>>    }
> > >>>
> > >>>    /**
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > >>> index 63b484dc76c5..c8d5898bea11 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > >>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct
> amdgpu_vm_update_params *p,
> > >>>        if (!resv)
> > >>>                return 0;
> > >>>
> > >>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv,
> sync_mode, sync_mode, p->vm);
> > >>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv,
> sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
> > >>>    }
> > >>>
> > >>>    /**
> >
>

[-- Attachment #2: Type: text/html, Size: 5912 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-01  8:48         ` Bas Nieuwenhuizen
  2022-06-01  8:59           ` Bas Nieuwenhuizen
@ 2022-06-01  9:01           ` Christian König
  2022-06-03  1:21             ` Bas Nieuwenhuizen
  1 sibling, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-01  9:01 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 01.06.22 um 10:48 schrieb Bas Nieuwenhuizen:
> On Wed, Jun 1, 2022 at 10:40 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
>>> On Wed, Jun 1, 2022 at 10:03 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
>>>>> This should be okay because moves themselves use KERNEL usage and
>>>>> hence still sync with BOOKKEEP usage. Then any later submits still
>>>>> wait on any pending VM operations.
>>>>>
>>>>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
>>>>>     way around)
>>>> Well NAK again. This allows access to freed up memory and is a complete
>>>> no-go.
>>> How does this allow access to freed memory? Worst I can see is that
>>> the unmap happens earlier if the app/drivers gets the waits wrong,
>>> which wouldn't give access after the underlying BO is freed?
>> To free up memory we need to update the PTEs and then flush those out by
>> invalidating the TLB.
>>
>> On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can
>> only be done while the VMID is idle.
>>
>> Only gfx9 can reliable invalidate the TLB while it is in use and even
>> there it comes with quite some performance penalty (at TLB invalidation
>> can take multiple seconds).
>>
>> Because of this what we do in the kernel driver is to sync to everything
>> when we unmap entries:
>>
>>           if (!(flags & AMDGPU_PTE_VALID))
>>                   sync_mode = AMDGPU_SYNC_EQ_OWNER;
>>           else
>>                   sync_mode = AMDGPU_SYNC_EXPLICIT;
>>
>> This acts as a barrier for freeing the memory. In other words we
>> intentionally add a bubble which syncs everything.
>>
>> I'm working for month on a concept how to do all this without causing
>> the stalls you observer, but so far didn't came to much of a conclusion.
> That might cause an unmap operation too early, but for freeing up the
> actual backing memory we still wait for all fences on the BO to finish
> first, no? In that case, since BOOKKEEP fences are still added for
> explicit sync, that should not be a problem, no?
>
> (If not, that sounds like the obvious fix for making this work?)

The problem is we need to wait on fences *not* added to the buffer object.

E.g. what we currently do here while freeing memory is:
1. Update the PTEs and make that update wait for everything!
2. Add the fence of that update to the freed up BO so that this BO isn't 
freed before the next CS.

We might be able to fix this by adding the fences to the BO before 
freeing it manually, but I'm not 100% sure we can actually allocate 
memory for the fences in that moment.

Regards,
Christian.


>> Regards,
>> Christian.
>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
>>>>>     2 files changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>>>> index f10332e1c6c0..31bc73fd1fae 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>>>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
>>>>>         if (!resv)
>>>>>                 return 0;
>>>>>
>>>>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
>>>>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
>>>>>     }
>>>>>
>>>>>     /**
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> index 63b484dc76c5..c8d5898bea11 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>>>>>         if (!resv)
>>>>>                 return 0;
>>>>>
>>>>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
>>>>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
>>>>>     }
>>>>>
>>>>>     /**


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-01  9:01           ` Christian König
@ 2022-06-03  1:21             ` Bas Nieuwenhuizen
  2022-06-03  8:11               ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-03  1:21 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Wed, Jun 1, 2022 at 11:01 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 01.06.22 um 10:48 schrieb Bas Nieuwenhuizen:
> > On Wed, Jun 1, 2022 at 10:40 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
> >>> On Wed, Jun 1, 2022 at 10:03 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> >>>>> This should be okay because moves themselves use KERNEL usage and
> >>>>> hence still sync with BOOKKEEP usage. Then any later submits still
> >>>>> wait on any pending VM operations.
> >>>>>
> >>>>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
> >>>>>     way around)
> >>>> Well NAK again. This allows access to freed up memory and is a complete
> >>>> no-go.
> >>> How does this allow access to freed memory? Worst I can see is that
> >>> the unmap happens earlier if the app/drivers gets the waits wrong,
> >>> which wouldn't give access after the underlying BO is freed?
> >> To free up memory we need to update the PTEs and then flush those out by
> >> invalidating the TLB.
> >>
> >> On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can
> >> only be done while the VMID is idle.
> >>
> >> Only gfx9 can reliable invalidate the TLB while it is in use and even
> >> there it comes with quite some performance penalty (at TLB invalidation
> >> can take multiple seconds).
> >>
> >> Because of this what we do in the kernel driver is to sync to everything
> >> when we unmap entries:
> >>
> >>           if (!(flags & AMDGPU_PTE_VALID))
> >>                   sync_mode = AMDGPU_SYNC_EQ_OWNER;
> >>           else
> >>                   sync_mode = AMDGPU_SYNC_EXPLICIT;
> >>
> >> This acts as a barrier for freeing the memory. In other words we
> >> intentionally add a bubble which syncs everything.
> >>
> >> I'm working for month on a concept how to do all this without causing
> >> the stalls you observer, but so far didn't came to much of a conclusion.
> > That might cause an unmap operation too early, but for freeing up the
> > actual backing memory we still wait for all fences on the BO to finish
> > first, no? In that case, since BOOKKEEP fences are still added for
> > explicit sync, that should not be a problem, no?
> >
> > (If not, that sounds like the obvious fix for making this work?)
>
> The problem is we need to wait on fences *not* added to the buffer object.

What fences wouldn't be added to the buffer object that we need here?
>
> E.g. what we currently do here while freeing memory is:
> 1. Update the PTEs and make that update wait for everything!
> 2. Add the fence of that update to the freed up BO so that this BO isn't
> freed before the next CS.
>
> We might be able to fix this by adding the fences to the BO before
> freeing it manually, but I'm not 100% sure we can actually allocate
> memory for the fences in that moment.

I think we don't need to be able to. We're already adding the unmap
fence to the BO in the gem close ioctl, and that has the fallback that
if we can't allocate space for the fence in the BO, we wait on the
fence manually on the CPU. I think that is a reasonable fallback for
this as well?

For the TTM move path amdgpu_copy_buffer will wait on the BO resv and
then following submissions will trigger VM updates that will wait on
the amdgpu_copy_buffer jobs (and hence transitively) will wait on the
work.  AFAICT the amdgpu_bo_move does not trigger any VM updates by
itself, and the amdgpu_bo_move_notify is way after the move (and after
the ttm_bo_move_accel_cleanup which would free the old resource), so
any VM changes triggered by that would see the TTM copy and sync to
it.

I do have to fix some stuff indeed, especially for the GEM close but
with that we should be able to keep the same basic approach?
>
> Regards,
> Christian.
>
>
> >> Regards,
> >> Christian.
> >>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> >>>>> ---
> >>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
> >>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
> >>>>>     2 files changed, 2 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>>>> index f10332e1c6c0..31bc73fd1fae 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>>>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
> >>>>>         if (!resv)
> >>>>>                 return 0;
> >>>>>
> >>>>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
> >>>>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
> >>>>>     }
> >>>>>
> >>>>>     /**
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>>>> index 63b484dc76c5..c8d5898bea11 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>>>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
> >>>>>         if (!resv)
> >>>>>                 return 0;
> >>>>>
> >>>>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
> >>>>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
> >>>>>     }
> >>>>>
> >>>>>     /**
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03  1:21             ` Bas Nieuwenhuizen
@ 2022-06-03  8:11               ` Christian König
  2022-06-03 10:08                 ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-03  8:11 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 03.06.22 um 03:21 schrieb Bas Nieuwenhuizen:
> [SNIP]
>> The problem is we need to wait on fences *not* added to the buffer object.
> What fences wouldn't be added to the buffer object that we need here?

Basically all still running submissions from the VM which could 
potentially access the BO.

That's why we have the AMDGPU_SYNC_EQ_OWNER in amdgpu_vm_update_range().

>> E.g. what we currently do here while freeing memory is:
>> 1. Update the PTEs and make that update wait for everything!
>> 2. Add the fence of that update to the freed up BO so that this BO isn't
>> freed before the next CS.
>>
>> We might be able to fix this by adding the fences to the BO before
>> freeing it manually, but I'm not 100% sure we can actually allocate
>> memory for the fences in that moment.
> I think we don't need to be able to. We're already adding the unmap
> fence to the BO in the gem close ioctl, and that has the fallback that
> if we can't allocate space for the fence in the BO, we wait on the
> fence manually on the CPU. I think that is a reasonable fallback for
> this as well?

Yes, just blocking might work in an OOM situation as well.

> For the TTM move path amdgpu_copy_buffer will wait on the BO resv and
> then following submissions will trigger VM updates that will wait on
> the amdgpu_copy_buffer jobs (and hence transitively) will wait on the
> work.  AFAICT the amdgpu_bo_move does not trigger any VM updates by
> itself, and the amdgpu_bo_move_notify is way after the move (and after
> the ttm_bo_move_accel_cleanup which would free the old resource), so
> any VM changes triggered by that would see the TTM copy and sync to
> it.
>
> I do have to fix some stuff indeed, especially for the GEM close but
> with that we should be able to keep the same basic approach?

Nope, not even remotely.

What we need is the following:
1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
2. When we get a VM operation we not only lock the VM page tables, but 
also all buffers we potentially need to unmap.
3. Nuking the freed list in the amdgpu_vm structure by updating freed 
areas directly when they are unmapped.
4. Tracking those updates inside the bo_va structure for the BO+VM 
combination.
5. When the bo_va structure is destroy because of closing the handle 
move the last clear operation over to the VM as implicit sync.

Only when all this is done we then can resolve the dependency that the 
CS currently must wait for any clear operation on the VM.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03  8:11               ` Christian König
@ 2022-06-03 10:08                 ` Bas Nieuwenhuizen
  2022-06-03 10:16                   ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-03 10:08 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Fri, Jun 3, 2022 at 10:11 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 03:21 schrieb Bas Nieuwenhuizen:
> > [SNIP]
> >> The problem is we need to wait on fences *not* added to the buffer object.
> > What fences wouldn't be added to the buffer object that we need here?
>
> Basically all still running submissions from the VM which could
> potentially access the BO.
>
> That's why we have the AMDGPU_SYNC_EQ_OWNER in amdgpu_vm_update_range().
>
> >> E.g. what we currently do here while freeing memory is:
> >> 1. Update the PTEs and make that update wait for everything!
> >> 2. Add the fence of that update to the freed up BO so that this BO isn't
> >> freed before the next CS.
> >>
> >> We might be able to fix this by adding the fences to the BO before
> >> freeing it manually, but I'm not 100% sure we can actually allocate
> >> memory for the fences in that moment.
> > I think we don't need to be able to. We're already adding the unmap
> > fence to the BO in the gem close ioctl, and that has the fallback that
> > if we can't allocate space for the fence in the BO, we wait on the
> > fence manually on the CPU. I think that is a reasonable fallback for
> > this as well?
>
> Yes, just blocking might work in an OOM situation as well.
>
> > For the TTM move path amdgpu_copy_buffer will wait on the BO resv and
> > then following submissions will trigger VM updates that will wait on
> > the amdgpu_copy_buffer jobs (and hence transitively) will wait on the
> > work.  AFAICT the amdgpu_bo_move does not trigger any VM updates by
> > itself, and the amdgpu_bo_move_notify is way after the move (and after
> > the ttm_bo_move_accel_cleanup which would free the old resource), so
> > any VM changes triggered by that would see the TTM copy and sync to
> > it.
> >
> > I do have to fix some stuff indeed, especially for the GEM close but
> > with that we should be able to keep the same basic approach?
>
> Nope, not even remotely.
>
> What we need is the following:
> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> 2. When we get a VM operation we not only lock the VM page tables, but
> also all buffers we potentially need to unmap.
> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> areas directly when they are unmapped.
> 4. Tracking those updates inside the bo_va structure for the BO+VM
> combination.
> 5. When the bo_va structure is destroy because of closing the handle
> move the last clear operation over to the VM as implicit sync.
>

Hi Christian, isn't that a different problem though (that we're also
trying to solve, but in your series)?

What this patch tries to achieve:

(t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
(t+1) a VM operation on a BO/VM accessed by the CS.

to run concurrently. What it *doesn't* try is

(t+0) a VM operation on a BO/VM accessed by the CS.
(t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)

to run concurrently. When you write

> Only when all this is done we then can resolve the dependency that the
> CS currently must wait for any clear operation on the VM.

isn't that all about the second problem?


>
> Regards,
> Christian.
>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 10:08                 ` Bas Nieuwenhuizen
@ 2022-06-03 10:16                   ` Christian König
  2022-06-03 11:07                     ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-03 10:16 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> [SNIP]
>>> I do have to fix some stuff indeed, especially for the GEM close but
>>> with that we should be able to keep the same basic approach?
>> Nope, not even remotely.
>>
>> What we need is the following:
>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
>> 2. When we get a VM operation we not only lock the VM page tables, but
>> also all buffers we potentially need to unmap.
>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
>> areas directly when they are unmapped.
>> 4. Tracking those updates inside the bo_va structure for the BO+VM
>> combination.
>> 5. When the bo_va structure is destroy because of closing the handle
>> move the last clear operation over to the VM as implicit sync.
>>
> Hi Christian, isn't that a different problem though (that we're also
> trying to solve, but in your series)?
>
> What this patch tries to achieve:
>
> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> (t+1) a VM operation on a BO/VM accessed by the CS.
>
> to run concurrently. What it *doesn't* try is
>
> (t+0) a VM operation on a BO/VM accessed by the CS.
> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>
> to run concurrently. When you write
>
>> Only when all this is done we then can resolve the dependency that the
>> CS currently must wait for any clear operation on the VM.
> isn't that all about the second problem?

No, it's the same.

See what we do in the VM code is to artificially insert a bubble so that 
all VM clear operations wait for all CS operations and then use the 
clear fence to indicate when the backing store of the BO can be freed.

When you want to remove this bubble (which is certainly a good idea) you 
need to first come up with a different approach to handle the clear 
operations.

Regards,
Christian.

>
>
>> Regards,
>> Christian.
>>
>>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 10:16                   ` Christian König
@ 2022-06-03 11:07                     ` Bas Nieuwenhuizen
  2022-06-03 12:08                       ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-03 11:07 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Fri, Jun 3, 2022 at 12:16 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> > [SNIP]
> >>> I do have to fix some stuff indeed, especially for the GEM close but
> >>> with that we should be able to keep the same basic approach?
> >> Nope, not even remotely.
> >>
> >> What we need is the following:
> >> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> >> 2. When we get a VM operation we not only lock the VM page tables, but
> >> also all buffers we potentially need to unmap.
> >> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> >> areas directly when they are unmapped.
> >> 4. Tracking those updates inside the bo_va structure for the BO+VM
> >> combination.
> >> 5. When the bo_va structure is destroy because of closing the handle
> >> move the last clear operation over to the VM as implicit sync.
> >>
> > Hi Christian, isn't that a different problem though (that we're also
> > trying to solve, but in your series)?
> >
> > What this patch tries to achieve:
> >
> > (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> > (t+1) a VM operation on a BO/VM accessed by the CS.
> >
> > to run concurrently. What it *doesn't* try is
> >
> > (t+0) a VM operation on a BO/VM accessed by the CS.
> > (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >
> > to run concurrently. When you write
> >
> >> Only when all this is done we then can resolve the dependency that the
> >> CS currently must wait for any clear operation on the VM.
> > isn't that all about the second problem?
>
> No, it's the same.
>
> See what we do in the VM code is to artificially insert a bubble so that
> all VM clear operations wait for all CS operations and then use the
> clear fence to indicate when the backing store of the BO can be freed.

Isn't that remediated with something like the code below? At least the
gem_close case should be handled with this, and the move case was
already handled by the copy operation.


--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct
drm_gem_object *obj,
       return 0;
}

+static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst)
+{
+       struct dma_resv_iter cursor;
+       struct dma_fence *f;
+       int r;
+       unsigned num_fences = 0;
+
+       if (src == dst)
+               return;
+
+       /* We assume the later loops get the same fences as the caller should
+        * lock the resv. */
+       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
+               ++num_fences;
+               dma_fence_put(f);
+       }
+
+       r = dma_resv_reserve_fences(dst, num_fences);
+       if (r) {
+               /* As last resort on OOM we block for the fence */
+               dma_resv_for_each_fence(&cursor, src,
DMA_RESV_USAGE_BOOKKEEP, f) {
+                       dma_fence_wait(f, false);
+                       dma_fence_put(f);
+               }
+       }
+
+       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
+               dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
+               dma_fence_put(f);
+       }
+}
+
+
static void amdgpu_gem_object_close(struct drm_gem_object *obj,
                                   struct drm_file *file_priv)
{
@@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct
drm_gem_object *obj,
       amdgpu_bo_fence(bo, fence, true);
       dma_fence_put(fence);

+       dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
+
out_unlock:
       if (unlikely(r < 0))
               dev_err(adev->dev, "failed to clear page "

>
> When you want to remove this bubble (which is certainly a good idea) you
> need to first come up with a different approach to handle the clear
> operations.
>
> Regards,
> Christian.
>
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 11:07                     ` Bas Nieuwenhuizen
@ 2022-06-03 12:08                       ` Christian König
  2022-06-03 12:39                         ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-03 12:08 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
> On Fri, Jun 3, 2022 at 12:16 PM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
>>> [SNIP]
>>>>> I do have to fix some stuff indeed, especially for the GEM close but
>>>>> with that we should be able to keep the same basic approach?
>>>> Nope, not even remotely.
>>>>
>>>> What we need is the following:
>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
>>>> 2. When we get a VM operation we not only lock the VM page tables, but
>>>> also all buffers we potentially need to unmap.
>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
>>>> areas directly when they are unmapped.
>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
>>>> combination.
>>>> 5. When the bo_va structure is destroy because of closing the handle
>>>> move the last clear operation over to the VM as implicit sync.
>>>>
>>> Hi Christian, isn't that a different problem though (that we're also
>>> trying to solve, but in your series)?
>>>
>>> What this patch tries to achieve:
>>>
>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>> (t+1) a VM operation on a BO/VM accessed by the CS.
>>>
>>> to run concurrently. What it *doesn't* try is
>>>
>>> (t+0) a VM operation on a BO/VM accessed by the CS.
>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>>
>>> to run concurrently. When you write
>>>
>>>> Only when all this is done we then can resolve the dependency that the
>>>> CS currently must wait for any clear operation on the VM.
>>> isn't that all about the second problem?
>> No, it's the same.
>>
>> See what we do in the VM code is to artificially insert a bubble so that
>> all VM clear operations wait for all CS operations and then use the
>> clear fence to indicate when the backing store of the BO can be freed.
> Isn't that remediated with something like the code below? At least the
> gem_close case should be handled with this, and the move case was
> already handled by the copy operation.

That is one necessary puzzle piece, yes. But you need more than that.

Especially the explicit unmap operation needs to be converted into an 
implicit unmap to get the TLB flush right.

I think I know all the necessary steps now, it's just tons of work to do.

Regards,
Christian.

>
>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct
> drm_gem_object *obj,
>         return 0;
> }
>
> +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst)
> +{
> +       struct dma_resv_iter cursor;
> +       struct dma_fence *f;
> +       int r;
> +       unsigned num_fences = 0;
> +
> +       if (src == dst)
> +               return;
> +
> +       /* We assume the later loops get the same fences as the caller should
> +        * lock the resv. */
> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> +               ++num_fences;
> +               dma_fence_put(f);
> +       }
> +
> +       r = dma_resv_reserve_fences(dst, num_fences);
> +       if (r) {
> +               /* As last resort on OOM we block for the fence */
> +               dma_resv_for_each_fence(&cursor, src,
> DMA_RESV_USAGE_BOOKKEEP, f) {
> +                       dma_fence_wait(f, false);
> +                       dma_fence_put(f);
> +               }
> +       }
> +
> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> +               dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
> +               dma_fence_put(f);
> +       }
> +}
> +
> +
> static void amdgpu_gem_object_close(struct drm_gem_object *obj,
>                                     struct drm_file *file_priv)
> {
> @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct
> drm_gem_object *obj,
>         amdgpu_bo_fence(bo, fence, true);
>         dma_fence_put(fence);
>
> +       dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
> +
> out_unlock:
>         if (unlikely(r < 0))
>                 dev_err(adev->dev, "failed to clear page "
>
>> When you want to remove this bubble (which is certainly a good idea) you
>> need to first come up with a different approach to handle the clear
>> operations.
>>
>> Regards,
>> Christian.
>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 12:08                       ` Christian König
@ 2022-06-03 12:39                         ` Bas Nieuwenhuizen
  2022-06-03 12:49                           ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-03 12:39 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
> > On Fri, Jun 3, 2022 at 12:16 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> >>> [SNIP]
> >>>>> I do have to fix some stuff indeed, especially for the GEM close but
> >>>>> with that we should be able to keep the same basic approach?
> >>>> Nope, not even remotely.
> >>>>
> >>>> What we need is the following:
> >>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> >>>> 2. When we get a VM operation we not only lock the VM page tables, but
> >>>> also all buffers we potentially need to unmap.
> >>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> >>>> areas directly when they are unmapped.
> >>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
> >>>> combination.
> >>>> 5. When the bo_va structure is destroy because of closing the handle
> >>>> move the last clear operation over to the VM as implicit sync.
> >>>>
> >>> Hi Christian, isn't that a different problem though (that we're also
> >>> trying to solve, but in your series)?
> >>>
> >>> What this patch tries to achieve:
> >>>
> >>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>> (t+1) a VM operation on a BO/VM accessed by the CS.
> >>>
> >>> to run concurrently. What it *doesn't* try is
> >>>
> >>> (t+0) a VM operation on a BO/VM accessed by the CS.
> >>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>
> >>> to run concurrently. When you write
> >>>
> >>>> Only when all this is done we then can resolve the dependency that the
> >>>> CS currently must wait for any clear operation on the VM.
> >>> isn't that all about the second problem?
> >> No, it's the same.
> >>
> >> See what we do in the VM code is to artificially insert a bubble so that
> >> all VM clear operations wait for all CS operations and then use the
> >> clear fence to indicate when the backing store of the BO can be freed.
> > Isn't that remediated with something like the code below? At least the
> > gem_close case should be handled with this, and the move case was
> > already handled by the copy operation.
>
> That is one necessary puzzle piece, yes. But you need more than that.
>
> Especially the explicit unmap operation needs to be converted into an
> implicit unmap to get the TLB flush right.

This doesn't change anything about the TLB flush though? Since all
unmap -> later jobs dependencies are still implicit.

So the worst what could happen (i.f. e.g. userspace gets the
waits/dependencies wrong) is

1) non-implicit CS gets submitted that touches a BO
2)  VM unmap on that BO happens
2.5) the CS from 1 is still active due to missing dependencies
2.6) but any CS submission after 2 will trigger a TLB flush
3) A TLB flush happens for a new CS
4) All CS submissions here see the TLB flush and hence the unmap

So the main problem would be the CS from step 1, but (a) if that
VMFaults that is the apps own fault and (b) because we don't free the
memory until (1) finishes it is not a security issue kernel-wise.

>
> I think I know all the necessary steps now, it's just tons of work to do.
>
> Regards,
> Christian.
>
> >
> >
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct
> > drm_gem_object *obj,
> >         return 0;
> > }
> >
> > +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst)
> > +{
> > +       struct dma_resv_iter cursor;
> > +       struct dma_fence *f;
> > +       int r;
> > +       unsigned num_fences = 0;
> > +
> > +       if (src == dst)
> > +               return;
> > +
> > +       /* We assume the later loops get the same fences as the caller should
> > +        * lock the resv. */
> > +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> > +               ++num_fences;
> > +               dma_fence_put(f);
> > +       }
> > +
> > +       r = dma_resv_reserve_fences(dst, num_fences);
> > +       if (r) {
> > +               /* As last resort on OOM we block for the fence */
> > +               dma_resv_for_each_fence(&cursor, src,
> > DMA_RESV_USAGE_BOOKKEEP, f) {
> > +                       dma_fence_wait(f, false);
> > +                       dma_fence_put(f);
> > +               }
> > +       }
> > +
> > +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> > +               dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
> > +               dma_fence_put(f);
> > +       }
> > +}
> > +
> > +
> > static void amdgpu_gem_object_close(struct drm_gem_object *obj,
> >                                     struct drm_file *file_priv)
> > {
> > @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct
> > drm_gem_object *obj,
> >         amdgpu_bo_fence(bo, fence, true);
> >         dma_fence_put(fence);
> >
> > +       dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
> > +
> > out_unlock:
> >         if (unlikely(r < 0))
> >                 dev_err(adev->dev, "failed to clear page "
> >
> >> When you want to remove this bubble (which is certainly a good idea) you
> >> need to first come up with a different approach to handle the clear
> >> operations.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 12:39                         ` Bas Nieuwenhuizen
@ 2022-06-03 12:49                           ` Christian König
  2022-06-03 13:23                             ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-03 12:49 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:
> On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote:
>> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
>>> On Fri, Jun 3, 2022 at 12:16 PM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
>>>>> [SNIP]
>>>>>>> I do have to fix some stuff indeed, especially for the GEM close but
>>>>>>> with that we should be able to keep the same basic approach?
>>>>>> Nope, not even remotely.
>>>>>>
>>>>>> What we need is the following:
>>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
>>>>>> 2. When we get a VM operation we not only lock the VM page tables, but
>>>>>> also all buffers we potentially need to unmap.
>>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
>>>>>> areas directly when they are unmapped.
>>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
>>>>>> combination.
>>>>>> 5. When the bo_va structure is destroy because of closing the handle
>>>>>> move the last clear operation over to the VM as implicit sync.
>>>>>>
>>>>> Hi Christian, isn't that a different problem though (that we're also
>>>>> trying to solve, but in your series)?
>>>>>
>>>>> What this patch tries to achieve:
>>>>>
>>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>>>> (t+1) a VM operation on a BO/VM accessed by the CS.
>>>>>
>>>>> to run concurrently. What it *doesn't* try is
>>>>>
>>>>> (t+0) a VM operation on a BO/VM accessed by the CS.
>>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>>>>
>>>>> to run concurrently. When you write
>>>>>
>>>>>> Only when all this is done we then can resolve the dependency that the
>>>>>> CS currently must wait for any clear operation on the VM.
>>>>> isn't that all about the second problem?
>>>> No, it's the same.
>>>>
>>>> See what we do in the VM code is to artificially insert a bubble so that
>>>> all VM clear operations wait for all CS operations and then use the
>>>> clear fence to indicate when the backing store of the BO can be freed.
>>> Isn't that remediated with something like the code below? At least the
>>> gem_close case should be handled with this, and the move case was
>>> already handled by the copy operation.
>> That is one necessary puzzle piece, yes. But you need more than that.
>>
>> Especially the explicit unmap operation needs to be converted into an
>> implicit unmap to get the TLB flush right.
> This doesn't change anything about the TLB flush though? Since all
> unmap -> later jobs dependencies are still implicit.
>
> So the worst what could happen (i.f. e.g. userspace gets the
> waits/dependencies wrong) is
>
> 1) non-implicit CS gets submitted that touches a BO
> 2)  VM unmap on that BO happens
> 2.5) the CS from 1 is still active due to missing dependencies
> 2.6) but any CS submission after 2 will trigger a TLB flush

Yeah, but that's exactly the bubble we try to avoid. Isn't it?

When we want to do a TLB flush the unmap operation must already be 
completed. Otherwise the flush is rather pointless since any access 
could reloads the not yet updated PTEs.

And this means that we need to artificially add a dependency on every 
command submission after 2 to wait until the unmap operation is completed.

Christian.

> 3) A TLB flush happens for a new CS
> 4) All CS submissions here see the TLB flush and hence the unmap
>
> So the main problem would be the CS from step 1, but (a) if that
> VMFaults that is the apps own fault and (b) because we don't free the
> memory until (1) finishes it is not a security issue kernel-wise.
>
>> I think I know all the necessary steps now, it's just tons of work to do.
>>
>> Regards,
>> Christian.
>>
>>>
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct
>>> drm_gem_object *obj,
>>>          return 0;
>>> }
>>>
>>> +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst)
>>> +{
>>> +       struct dma_resv_iter cursor;
>>> +       struct dma_fence *f;
>>> +       int r;
>>> +       unsigned num_fences = 0;
>>> +
>>> +       if (src == dst)
>>> +               return;
>>> +
>>> +       /* We assume the later loops get the same fences as the caller should
>>> +        * lock the resv. */
>>> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
>>> +               ++num_fences;
>>> +               dma_fence_put(f);
>>> +       }
>>> +
>>> +       r = dma_resv_reserve_fences(dst, num_fences);
>>> +       if (r) {
>>> +               /* As last resort on OOM we block for the fence */
>>> +               dma_resv_for_each_fence(&cursor, src,
>>> DMA_RESV_USAGE_BOOKKEEP, f) {
>>> +                       dma_fence_wait(f, false);
>>> +                       dma_fence_put(f);
>>> +               }
>>> +       }
>>> +
>>> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
>>> +               dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
>>> +               dma_fence_put(f);
>>> +       }
>>> +}
>>> +
>>> +
>>> static void amdgpu_gem_object_close(struct drm_gem_object *obj,
>>>                                      struct drm_file *file_priv)
>>> {
>>> @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct
>>> drm_gem_object *obj,
>>>          amdgpu_bo_fence(bo, fence, true);
>>>          dma_fence_put(fence);
>>>
>>> +       dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
>>> +
>>> out_unlock:
>>>          if (unlikely(r < 0))
>>>                  dev_err(adev->dev, "failed to clear page "
>>>
>>>> When you want to remove this bubble (which is certainly a good idea) you
>>>> need to first come up with a different approach to handle the clear
>>>> operations.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 12:49                           ` Christian König
@ 2022-06-03 13:23                             ` Bas Nieuwenhuizen
  2022-06-03 17:41                               ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-03 13:23 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Fri, Jun 3, 2022 at 2:49 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:
> > On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote:
> >> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
> >>> On Fri, Jun 3, 2022 at 12:16 PM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> >>>>> [SNIP]
> >>>>>>> I do have to fix some stuff indeed, especially for the GEM close but
> >>>>>>> with that we should be able to keep the same basic approach?
> >>>>>> Nope, not even remotely.
> >>>>>>
> >>>>>> What we need is the following:
> >>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> >>>>>> 2. When we get a VM operation we not only lock the VM page tables, but
> >>>>>> also all buffers we potentially need to unmap.
> >>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> >>>>>> areas directly when they are unmapped.
> >>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
> >>>>>> combination.
> >>>>>> 5. When the bo_va structure is destroy because of closing the handle
> >>>>>> move the last clear operation over to the VM as implicit sync.
> >>>>>>
> >>>>> Hi Christian, isn't that a different problem though (that we're also
> >>>>> trying to solve, but in your series)?
> >>>>>
> >>>>> What this patch tries to achieve:
> >>>>>
> >>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>>> (t+1) a VM operation on a BO/VM accessed by the CS.
> >>>>>
> >>>>> to run concurrently. What it *doesn't* try is
> >>>>>
> >>>>> (t+0) a VM operation on a BO/VM accessed by the CS.
> >>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>>>
> >>>>> to run concurrently. When you write
> >>>>>
> >>>>>> Only when all this is done we then can resolve the dependency that the
> >>>>>> CS currently must wait for any clear operation on the VM.
> >>>>> isn't that all about the second problem?
> >>>> No, it's the same.
> >>>>
> >>>> See what we do in the VM code is to artificially insert a bubble so that
> >>>> all VM clear operations wait for all CS operations and then use the
> >>>> clear fence to indicate when the backing store of the BO can be freed.
> >>> Isn't that remediated with something like the code below? At least the
> >>> gem_close case should be handled with this, and the move case was
> >>> already handled by the copy operation.
> >> That is one necessary puzzle piece, yes. But you need more than that.
> >>
> >> Especially the explicit unmap operation needs to be converted into an
> >> implicit unmap to get the TLB flush right.
> > This doesn't change anything about the TLB flush though? Since all
> > unmap -> later jobs dependencies are still implicit.
> >
> > So the worst what could happen (i.f. e.g. userspace gets the
> > waits/dependencies wrong) is
> >
> > 1) non-implicit CS gets submitted that touches a BO
> > 2)  VM unmap on that BO happens
> > 2.5) the CS from 1 is still active due to missing dependencies
> > 2.6) but any CS submission after 2 will trigger a TLB flush
>
> Yeah, but that's exactly the bubble we try to avoid. Isn't it?

For this series, not really.  To clarify there are two sides for
getting GPU bubbles and no overlap:

(1) VM operations implicitly wait for earlier CS submissions
(2) CS submissions implicitly wait for earlier VM operations

Together, these combine to ensure that you get a (potentially small)
bubble any time VM work happens.

Your series (and further ideas) tackles (2), and is a worthwhile thing
to do. However, while writing the userspace for this I noticed this
isn't enough to get rid of all our GPU bubbles. In particular when
doing a non-sparse map of a new BO, that tends to need to be waited on
for the next CS anyway for API semantics. Due to VM operations
happening on a single timeline that means this high priority map can
end up being blocked by earlier sparse maps and hence the bubble in
that case still exists.

So in this series I try to tackle (1) instead. Since GPU work
typically lags behind CPU submissions and VM operations aren't that
slow, we can typically execute VM operations early enough that any
implicit syncs from (2) are less/no issue. In particular, by doing all
dependency waits in userspace, we can make almost all VM operations
start pretty much immediately (with a bunch of exceptions, like other
VM work that takes time, radeonsi still submitting implicitly synced
stuff etc.).

So I think (2) is valuable, just not what this series tries to focus
on or touch at all.

(And then the cherry on top would be having two timelines for VM
operations, a high priority one for non-sparse bindings and a low
priority one for sparse bindings, but that is very complex and not
super high value on top of eliminating (1) + (2), so I'd punt that for
"maybe later". See e.g. the discussion wrt Intel at
https://patchwork.freedesktop.org/patch/486604/#comment_879193)

>
> When we want to do a TLB flush the unmap operation must already be
> completed. Otherwise the flush is rather pointless since any access
> could reloads the not yet updated PTEs.
>
> And this means that we need to artificially add a dependency on every
> command submission after 2 to wait until the unmap operation is completed.
>
> Christian.
>
> > 3) A TLB flush happens for a new CS
> > 4) All CS submissions here see the TLB flush and hence the unmap
> >
> > So the main problem would be the CS from step 1, but (a) if that
> > VMFaults that is the apps own fault and (b) because we don't free the
> > memory until (1) finishes it is not a security issue kernel-wise.
> >
> >> I think I know all the necessary steps now, it's just tons of work to do.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >>> @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct
> >>> drm_gem_object *obj,
> >>>          return 0;
> >>> }
> >>>
> >>> +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst)
> >>> +{
> >>> +       struct dma_resv_iter cursor;
> >>> +       struct dma_fence *f;
> >>> +       int r;
> >>> +       unsigned num_fences = 0;
> >>> +
> >>> +       if (src == dst)
> >>> +               return;
> >>> +
> >>> +       /* We assume the later loops get the same fences as the caller should
> >>> +        * lock the resv. */
> >>> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> >>> +               ++num_fences;
> >>> +               dma_fence_put(f);
> >>> +       }
> >>> +
> >>> +       r = dma_resv_reserve_fences(dst, num_fences);
> >>> +       if (r) {
> >>> +               /* As last resort on OOM we block for the fence */
> >>> +               dma_resv_for_each_fence(&cursor, src,
> >>> DMA_RESV_USAGE_BOOKKEEP, f) {
> >>> +                       dma_fence_wait(f, false);
> >>> +                       dma_fence_put(f);
> >>> +               }
> >>> +       }
> >>> +
> >>> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> >>> +               dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
> >>> +               dma_fence_put(f);
> >>> +       }
> >>> +}
> >>> +
> >>> +
> >>> static void amdgpu_gem_object_close(struct drm_gem_object *obj,
> >>>                                      struct drm_file *file_priv)
> >>> {
> >>> @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct
> >>> drm_gem_object *obj,
> >>>          amdgpu_bo_fence(bo, fence, true);
> >>>          dma_fence_put(fence);
> >>>
> >>> +       dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
> >>> +
> >>> out_unlock:
> >>>          if (unlikely(r < 0))
> >>>                  dev_err(adev->dev, "failed to clear page "
> >>>
> >>>> When you want to remove this bubble (which is certainly a good idea) you
> >>>> need to first come up with a different approach to handle the clear
> >>>> operations.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>> Regards,
> >>>>>> Christian.
> >>>>>>
> >>>>>>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 13:23                             ` Bas Nieuwenhuizen
@ 2022-06-03 17:41                               ` Christian König
  2022-06-03 17:50                                 ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-03 17:41 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 03.06.22 um 15:23 schrieb Bas Nieuwenhuizen:
> On Fri, Jun 3, 2022 at 2:49 PM Christian König <christian.koenig@amd.com> wrote:
>> Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:
>>> On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote:
>>>> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
>>>>> On Fri, Jun 3, 2022 at 12:16 PM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
>>>>>>> [SNIP]
>>>>>>>>> I do have to fix some stuff indeed, especially for the GEM close but
>>>>>>>>> with that we should be able to keep the same basic approach?
>>>>>>>> Nope, not even remotely.
>>>>>>>>
>>>>>>>> What we need is the following:
>>>>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
>>>>>>>> 2. When we get a VM operation we not only lock the VM page tables, but
>>>>>>>> also all buffers we potentially need to unmap.
>>>>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
>>>>>>>> areas directly when they are unmapped.
>>>>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
>>>>>>>> combination.
>>>>>>>> 5. When the bo_va structure is destroy because of closing the handle
>>>>>>>> move the last clear operation over to the VM as implicit sync.
>>>>>>>>
>>>>>>> Hi Christian, isn't that a different problem though (that we're also
>>>>>>> trying to solve, but in your series)?
>>>>>>>
>>>>>>> What this patch tries to achieve:
>>>>>>>
>>>>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>>>>>> (t+1) a VM operation on a BO/VM accessed by the CS.
>>>>>>>
>>>>>>> to run concurrently. What it *doesn't* try is
>>>>>>>
>>>>>>> (t+0) a VM operation on a BO/VM accessed by the CS.
>>>>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>>>>>>
>>>>>>> to run concurrently. When you write
>>>>>>>
>>>>>>>> Only when all this is done we then can resolve the dependency that the
>>>>>>>> CS currently must wait for any clear operation on the VM.
>>>>>>> isn't that all about the second problem?
>>>>>> No, it's the same.
>>>>>>
>>>>>> See what we do in the VM code is to artificially insert a bubble so that
>>>>>> all VM clear operations wait for all CS operations and then use the
>>>>>> clear fence to indicate when the backing store of the BO can be freed.
>>>>> Isn't that remediated with something like the code below? At least the
>>>>> gem_close case should be handled with this, and the move case was
>>>>> already handled by the copy operation.
>>>> That is one necessary puzzle piece, yes. But you need more than that.
>>>>
>>>> Especially the explicit unmap operation needs to be converted into an
>>>> implicit unmap to get the TLB flush right.
>>> This doesn't change anything about the TLB flush though? Since all
>>> unmap -> later jobs dependencies are still implicit.
>>>
>>> So the worst what could happen (i.f. e.g. userspace gets the
>>> waits/dependencies wrong) is
>>>
>>> 1) non-implicit CS gets submitted that touches a BO
>>> 2)  VM unmap on that BO happens
>>> 2.5) the CS from 1 is still active due to missing dependencies
>>> 2.6) but any CS submission after 2 will trigger a TLB flush
>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> For this series, not really.  To clarify there are two sides for
> getting GPU bubbles and no overlap:
>
> (1) VM operations implicitly wait for earlier CS submissions
> (2) CS submissions implicitly wait for earlier VM operations
>
> Together, these combine to ensure that you get a (potentially small)
> bubble any time VM work happens.
>
> Your series (and further ideas) tackles (2), and is a worthwhile thing
> to do. However, while writing the userspace for this I noticed this
> isn't enough to get rid of all our GPU bubbles. In particular when
> doing a non-sparse map of a new BO, that tends to need to be waited on
> for the next CS anyway for API semantics. Due to VM operations
> happening on a single timeline that means this high priority map can
> end up being blocked by earlier sparse maps and hence the bubble in
> that case still exists.
>
> So in this series I try to tackle (1) instead. Since GPU work
> typically lags behind CPU submissions and VM operations aren't that
> slow, we can typically execute VM operations early enough that any
> implicit syncs from (2) are less/no issue.

Ok, once more since you don't seem to understand what I want to say: It 
isn't possible to fix #1 before you have fixed #2.

The VM unmap operation here is a barrier which divides the CS operations 
in a before and after. This is intentional design.

To get rid of this barrier you must first fix the part where CS 
submissions wait for the VM operation to complete, e.g. the necessity of 
the barrier.

I'm working on this for a couple of years now and I'm really running out 
of idea how to explain this restriction.

Regards,
Christian.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 17:41                               ` Christian König
@ 2022-06-03 17:50                                 ` Bas Nieuwenhuizen
  2022-06-03 18:41                                   ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-03 17:50 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Fri, Jun 3, 2022 at 7:42 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 15:23 schrieb Bas Nieuwenhuizen:
> > On Fri, Jun 3, 2022 at 2:49 PM Christian König <christian.koenig@amd.com> wrote:
> >> Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:
> >>> On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote:
> >>>> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
> >>>>> On Fri, Jun 3, 2022 at 12:16 PM Christian König
> >>>>> <christian.koenig@amd.com> wrote:
> >>>>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> >>>>>>> [SNIP]
> >>>>>>>>> I do have to fix some stuff indeed, especially for the GEM close but
> >>>>>>>>> with that we should be able to keep the same basic approach?
> >>>>>>>> Nope, not even remotely.
> >>>>>>>>
> >>>>>>>> What we need is the following:
> >>>>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> >>>>>>>> 2. When we get a VM operation we not only lock the VM page tables, but
> >>>>>>>> also all buffers we potentially need to unmap.
> >>>>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> >>>>>>>> areas directly when they are unmapped.
> >>>>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
> >>>>>>>> combination.
> >>>>>>>> 5. When the bo_va structure is destroy because of closing the handle
> >>>>>>>> move the last clear operation over to the VM as implicit sync.
> >>>>>>>>
> >>>>>>> Hi Christian, isn't that a different problem though (that we're also
> >>>>>>> trying to solve, but in your series)?
> >>>>>>>
> >>>>>>> What this patch tries to achieve:
> >>>>>>>
> >>>>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>>>>> (t+1) a VM operation on a BO/VM accessed by the CS.
> >>>>>>>
> >>>>>>> to run concurrently. What it *doesn't* try is
> >>>>>>>
> >>>>>>> (t+0) a VM operation on a BO/VM accessed by the CS.
> >>>>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>>>>>
> >>>>>>> to run concurrently. When you write
> >>>>>>>
> >>>>>>>> Only when all this is done we then can resolve the dependency that the
> >>>>>>>> CS currently must wait for any clear operation on the VM.
> >>>>>>> isn't that all about the second problem?
> >>>>>> No, it's the same.
> >>>>>>
> >>>>>> See what we do in the VM code is to artificially insert a bubble so that
> >>>>>> all VM clear operations wait for all CS operations and then use the
> >>>>>> clear fence to indicate when the backing store of the BO can be freed.
> >>>>> Isn't that remediated with something like the code below? At least the
> >>>>> gem_close case should be handled with this, and the move case was
> >>>>> already handled by the copy operation.
> >>>> That is one necessary puzzle piece, yes. But you need more than that.
> >>>>
> >>>> Especially the explicit unmap operation needs to be converted into an
> >>>> implicit unmap to get the TLB flush right.
> >>> This doesn't change anything about the TLB flush though? Since all
> >>> unmap -> later jobs dependencies are still implicit.
> >>>
> >>> So the worst what could happen (i.f. e.g. userspace gets the
> >>> waits/dependencies wrong) is
> >>>
> >>> 1) non-implicit CS gets submitted that touches a BO
> >>> 2)  VM unmap on that BO happens
> >>> 2.5) the CS from 1 is still active due to missing dependencies
> >>> 2.6) but any CS submission after 2 will trigger a TLB flush
> >> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> > For this series, not really.  To clarify there are two sides for
> > getting GPU bubbles and no overlap:
> >
> > (1) VM operations implicitly wait for earlier CS submissions
> > (2) CS submissions implicitly wait for earlier VM operations
> >
> > Together, these combine to ensure that you get a (potentially small)
> > bubble any time VM work happens.
> >
> > Your series (and further ideas) tackles (2), and is a worthwhile thing
> > to do. However, while writing the userspace for this I noticed this
> > isn't enough to get rid of all our GPU bubbles. In particular when
> > doing a non-sparse map of a new BO, that tends to need to be waited on
> > for the next CS anyway for API semantics. Due to VM operations
> > happening on a single timeline that means this high priority map can
> > end up being blocked by earlier sparse maps and hence the bubble in
> > that case still exists.
> >
> > So in this series I try to tackle (1) instead. Since GPU work
> > typically lags behind CPU submissions and VM operations aren't that
> > slow, we can typically execute VM operations early enough that any
> > implicit syncs from (2) are less/no issue.
>
> Ok, once more since you don't seem to understand what I want to say: It
> isn't possible to fix #1 before you have fixed #2.
>
> The VM unmap operation here is a barrier which divides the CS operations
> in a before and after. This is intentional design.

Why is that barrier needed? The two barriers I got and understood and
I think we can deal with:

1) the VM unmap is a barrier between prior CS and later memory free.
2) The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior
CS and later CS?

>
> To get rid of this barrier you must first fix the part where CS
> submissions wait for the VM operation to complete, e.g. the necessity of
> the barrier.
>
> I'm working on this for a couple of years now and I'm really running out
> of idea how to explain this restriction.
>
> Regards,
> Christian.
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 17:50                                 ` Bas Nieuwenhuizen
@ 2022-06-03 18:41                                   ` Christian König
  2022-06-03 19:11                                     ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-03 18:41 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> [SNIP]
>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
>>> For this series, not really.  To clarify there are two sides for
>>> getting GPU bubbles and no overlap:
>>>
>>> (1) VM operations implicitly wait for earlier CS submissions
>>> (2) CS submissions implicitly wait for earlier VM operations
>>>
>>> Together, these combine to ensure that you get a (potentially small)
>>> bubble any time VM work happens.
>>>
>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
>>> to do. However, while writing the userspace for this I noticed this
>>> isn't enough to get rid of all our GPU bubbles. In particular when
>>> doing a non-sparse map of a new BO, that tends to need to be waited on
>>> for the next CS anyway for API semantics. Due to VM operations
>>> happening on a single timeline that means this high priority map can
>>> end up being blocked by earlier sparse maps and hence the bubble in
>>> that case still exists.
>>>
>>> So in this series I try to tackle (1) instead. Since GPU work
>>> typically lags behind CPU submissions and VM operations aren't that
>>> slow, we can typically execute VM operations early enough that any
>>> implicit syncs from (2) are less/no issue.
>> Ok, once more since you don't seem to understand what I want to say: It
>> isn't possible to fix #1 before you have fixed #2.
>>
>> The VM unmap operation here is a barrier which divides the CS operations
>> in a before and after. This is intentional design.
> Why is that barrier needed? The two barriers I got and understood and
> I think we can deal with:
>
> 1) the VM unmap is a barrier between prior CS and later memory free.
> 2) The TLB flush need to happen between a VM unmap and later CS.
>
> But why do we need the VM unmap to be a strict barrier between prior
> CS and later CS?

Exactly because of the two reasons you mentioned.

#1 Is rather easy to fix, you just need to copy all dma_fences from the 
page table dma_resv object over to the BOs dma_resv object in the gem 
close handler. E.g. exactly what you suggested with the dma_resv_copy 
function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on 
async TLB flushes are either a bit complicated (double flushes etc..) or 
don't even work at all because of hw bugs. So to have a reliable TLB 
flush we must make sure that nothing else is ongoing and that means 
CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) 
using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence 
implementation which would not only wait for all existing dma_fence but 
also for the one added until the unmap operation is completed. Cause 
otherwise our operation we do at #1 would simply not catch all 
dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I 
already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() 
goes away and we can keep track of the unmap operations in the bo_va 
structure.

With that done you can make the explicit sync you noted in the bo_va 
structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is 
implicit synchronization, and that's what we are trying to avoid here in 
the first place.

Regards,
Christian.

>
>> To get rid of this barrier you must first fix the part where CS
>> submissions wait for the VM operation to complete, e.g. the necessity of
>> the barrier.
>>
>> I'm working on this for a couple of years now and I'm really running out
>> of idea how to explain this restriction.
>>
>> Regards,
>> Christian.
>>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 18:41                                   ` Christian König
@ 2022-06-03 19:11                                     ` Bas Nieuwenhuizen
  2022-06-06 10:15                                       ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-03 19:11 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> > [SNIP]
> >>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> >>> For this series, not really.  To clarify there are two sides for
> >>> getting GPU bubbles and no overlap:
> >>>
> >>> (1) VM operations implicitly wait for earlier CS submissions
> >>> (2) CS submissions implicitly wait for earlier VM operations
> >>>
> >>> Together, these combine to ensure that you get a (potentially small)
> >>> bubble any time VM work happens.
> >>>
> >>> Your series (and further ideas) tackles (2), and is a worthwhile thing
> >>> to do. However, while writing the userspace for this I noticed this
> >>> isn't enough to get rid of all our GPU bubbles. In particular when
> >>> doing a non-sparse map of a new BO, that tends to need to be waited on
> >>> for the next CS anyway for API semantics. Due to VM operations
> >>> happening on a single timeline that means this high priority map can
> >>> end up being blocked by earlier sparse maps and hence the bubble in
> >>> that case still exists.
> >>>
> >>> So in this series I try to tackle (1) instead. Since GPU work
> >>> typically lags behind CPU submissions and VM operations aren't that
> >>> slow, we can typically execute VM operations early enough that any
> >>> implicit syncs from (2) are less/no issue.
> >> Ok, once more since you don't seem to understand what I want to say: It
> >> isn't possible to fix #1 before you have fixed #2.
> >>
> >> The VM unmap operation here is a barrier which divides the CS operations
> >> in a before and after. This is intentional design.
> > Why is that barrier needed? The two barriers I got and understood and
> > I think we can deal with:
> >
> > 1) the VM unmap is a barrier between prior CS and later memory free.
> > 2) The TLB flush need to happen between a VM unmap and later CS.
> >
> > But why do we need the VM unmap to be a strict barrier between prior
> > CS and later CS?
>
> Exactly because of the two reasons you mentioned.

This is the part I'm not seeing. I get that removing #2 is a
nightmare, which is why I did something that doesn't violate that
constraint.

Like if an explicit CS that was running before the VM operation  runs
till after the VM operation (and hence possibly till after the TLB
flush, or otherwise have the TLB flush not apply due to lack of async
TLB flush support), that is not an issue. It might see the state from
before the unmap, or after the unmap, or some intermediate state and
all of those would be okay.

We still get the constraint that the TLB flush happens between the VM
unmap and later CS and hence the unmap is certainly visible to them.

>
> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> page table dma_resv object over to the BOs dma_resv object in the gem
> close handler. E.g. exactly what you suggested with the dma_resv_copy
> function.
>
> #2 is a nightmare.
>
> We can't move the TLB flush at the end of the unmap operation because on
> async TLB flushes are either a bit complicated (double flushes etc..) or
> don't even work at all because of hw bugs. So to have a reliable TLB
> flush we must make sure that nothing else is ongoing and that means
> CS->VM->CS barrier.
>
> We try very hard to circumvent that already on maps by (for example)
> using a completely new VMID for CS after the VM map operation.
>
> But for the unmap operation we would need some kind special dma_fence
> implementation which would not only wait for all existing dma_fence but
> also for the one added until the unmap operation is completed. Cause
> otherwise our operation we do at #1 would simply not catch all
> dma_fences which have access to the memory.
>
> That's certainly doable, but I think just using the drm_exec stuff I
> already came up with is easier.
>
> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> goes away and we can keep track of the unmap operations in the bo_va
> structure.
>
> With that done you can make the explicit sync you noted in the bo_va
> structure and implicit sync when the bo_va structure goes away.
>
> Then the only reason I can see why we would need a CS->VM dependency is
> implicit synchronization, and that's what we are trying to avoid here in
> the first place.
>
> Regards,
> Christian.
>
> >
> >> To get rid of this barrier you must first fix the part where CS
> >> submissions wait for the VM operation to complete, e.g. the necessity of
> >> the barrier.
> >>
> >> I'm working on this for a couple of years now and I'm really running out
> >> of idea how to explain this restriction.
> >>
> >> Regards,
> >> Christian.
> >>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-03 19:11                                     ` Bas Nieuwenhuizen
@ 2022-06-06 10:15                                       ` Christian König
  2022-06-06 10:30                                         ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-06 10:15 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel



Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
>>> [SNIP]
>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
>>>>> For this series, not really.  To clarify there are two sides for
>>>>> getting GPU bubbles and no overlap:
>>>>>
>>>>> (1) VM operations implicitly wait for earlier CS submissions
>>>>> (2) CS submissions implicitly wait for earlier VM operations
>>>>>
>>>>> Together, these combine to ensure that you get a (potentially small)
>>>>> bubble any time VM work happens.
>>>>>
>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
>>>>> to do. However, while writing the userspace for this I noticed this
>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
>>>>> for the next CS anyway for API semantics. Due to VM operations
>>>>> happening on a single timeline that means this high priority map can
>>>>> end up being blocked by earlier sparse maps and hence the bubble in
>>>>> that case still exists.
>>>>>
>>>>> So in this series I try to tackle (1) instead. Since GPU work
>>>>> typically lags behind CPU submissions and VM operations aren't that
>>>>> slow, we can typically execute VM operations early enough that any
>>>>> implicit syncs from (2) are less/no issue.
>>>> Ok, once more since you don't seem to understand what I want to say: It
>>>> isn't possible to fix #1 before you have fixed #2.
>>>>
>>>> The VM unmap operation here is a barrier which divides the CS operations
>>>> in a before and after. This is intentional design.
>>> Why is that barrier needed? The two barriers I got and understood and
>>> I think we can deal with:
>>>
>>> 1) the VM unmap is a barrier between prior CS and later memory free.
>>> 2) The TLB flush need to happen between a VM unmap and later CS.
>>>
>>> But why do we need the VM unmap to be a strict barrier between prior
>>> CS and later CS?
>> Exactly because of the two reasons you mentioned.
> This is the part I'm not seeing. I get that removing #2 is a
> nightmare, which is why I did something that doesn't violate that
> constraint.
>
> Like if an explicit CS that was running before the VM operation  runs
> till after the VM operation (and hence possibly till after the TLB
> flush, or otherwise have the TLB flush not apply due to lack of async
> TLB flush support), that is not an issue. It might see the state from
> before the unmap, or after the unmap, or some intermediate state and
> all of those would be okay.
>
> We still get the constraint that the TLB flush happens between the VM
> unmap and later CS and hence the unmap is certainly visible to them.

So you basically just want to set the sync mode in 
amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

That should be doable, but then you don't need any of the other changes.

Regards,
Christian.

>
>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
>> page table dma_resv object over to the BOs dma_resv object in the gem
>> close handler. E.g. exactly what you suggested with the dma_resv_copy
>> function.
>>
>> #2 is a nightmare.
>>
>> We can't move the TLB flush at the end of the unmap operation because on
>> async TLB flushes are either a bit complicated (double flushes etc..) or
>> don't even work at all because of hw bugs. So to have a reliable TLB
>> flush we must make sure that nothing else is ongoing and that means
>> CS->VM->CS barrier.
>>
>> We try very hard to circumvent that already on maps by (for example)
>> using a completely new VMID for CS after the VM map operation.
>>
>> But for the unmap operation we would need some kind special dma_fence
>> implementation which would not only wait for all existing dma_fence but
>> also for the one added until the unmap operation is completed. Cause
>> otherwise our operation we do at #1 would simply not catch all
>> dma_fences which have access to the memory.
>>
>> That's certainly doable, but I think just using the drm_exec stuff I
>> already came up with is easier.
>>
>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
>> goes away and we can keep track of the unmap operations in the bo_va
>> structure.
>>
>> With that done you can make the explicit sync you noted in the bo_va
>> structure and implicit sync when the bo_va structure goes away.
>>
>> Then the only reason I can see why we would need a CS->VM dependency is
>> implicit synchronization, and that's what we are trying to avoid here in
>> the first place.
>>
>> Regards,
>> Christian.
>>
>>>> To get rid of this barrier you must first fix the part where CS
>>>> submissions wait for the VM operation to complete, e.g. the necessity of
>>>> the barrier.
>>>>
>>>> I'm working on this for a couple of years now and I'm really running out
>>>> of idea how to explain this restriction.
>>>>
>>>> Regards,
>>>> Christian.
>>>>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-06 10:15                                       ` Christian König
@ 2022-06-06 10:30                                         ` Bas Nieuwenhuizen
  2022-06-06 10:35                                           ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-06 10:30 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Mon, Jun 6, 2022 at 12:15 PM Christian König
<christian.koenig@amd.com> wrote:
>
>
>
> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
> > On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
> >> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> >>> [SNIP]
> >>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> >>>>> For this series, not really.  To clarify there are two sides for
> >>>>> getting GPU bubbles and no overlap:
> >>>>>
> >>>>> (1) VM operations implicitly wait for earlier CS submissions
> >>>>> (2) CS submissions implicitly wait for earlier VM operations
> >>>>>
> >>>>> Together, these combine to ensure that you get a (potentially small)
> >>>>> bubble any time VM work happens.
> >>>>>
> >>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
> >>>>> to do. However, while writing the userspace for this I noticed this
> >>>>> isn't enough to get rid of all our GPU bubbles. In particular when
> >>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
> >>>>> for the next CS anyway for API semantics. Due to VM operations
> >>>>> happening on a single timeline that means this high priority map can
> >>>>> end up being blocked by earlier sparse maps and hence the bubble in
> >>>>> that case still exists.
> >>>>>
> >>>>> So in this series I try to tackle (1) instead. Since GPU work
> >>>>> typically lags behind CPU submissions and VM operations aren't that
> >>>>> slow, we can typically execute VM operations early enough that any
> >>>>> implicit syncs from (2) are less/no issue.
> >>>> Ok, once more since you don't seem to understand what I want to say: It
> >>>> isn't possible to fix #1 before you have fixed #2.
> >>>>
> >>>> The VM unmap operation here is a barrier which divides the CS operations
> >>>> in a before and after. This is intentional design.
> >>> Why is that barrier needed? The two barriers I got and understood and
> >>> I think we can deal with:
> >>>
> >>> 1) the VM unmap is a barrier between prior CS and later memory free.
> >>> 2) The TLB flush need to happen between a VM unmap and later CS.
> >>>
> >>> But why do we need the VM unmap to be a strict barrier between prior
> >>> CS and later CS?
> >> Exactly because of the two reasons you mentioned.
> > This is the part I'm not seeing. I get that removing #2 is a
> > nightmare, which is why I did something that doesn't violate that
> > constraint.
> >
> > Like if an explicit CS that was running before the VM operation  runs
> > till after the VM operation (and hence possibly till after the TLB
> > flush, or otherwise have the TLB flush not apply due to lack of async
> > TLB flush support), that is not an issue. It might see the state from
> > before the unmap, or after the unmap, or some intermediate state and
> > all of those would be okay.
> >
> > We still get the constraint that the TLB flush happens between the VM
> > unmap and later CS and hence the unmap is certainly visible to them.
>
> So you basically just want to set the sync mode in
> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

Yes, with the caveat that I want to do that only for
DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
implicit sync we get the old implicit behavior, if we submit a CS with
explicit sync we get the new explicit behavior). The rest of the
series is basically just for enabling explicit sync submissions.

> That should be doable, but then you don't need any of the other changes.
>
> Regards,
> Christian.
>
> >
> >> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> >> page table dma_resv object over to the BOs dma_resv object in the gem
> >> close handler. E.g. exactly what you suggested with the dma_resv_copy
> >> function.
> >>
> >> #2 is a nightmare.
> >>
> >> We can't move the TLB flush at the end of the unmap operation because on
> >> async TLB flushes are either a bit complicated (double flushes etc..) or
> >> don't even work at all because of hw bugs. So to have a reliable TLB
> >> flush we must make sure that nothing else is ongoing and that means
> >> CS->VM->CS barrier.
> >>
> >> We try very hard to circumvent that already on maps by (for example)
> >> using a completely new VMID for CS after the VM map operation.
> >>
> >> But for the unmap operation we would need some kind special dma_fence
> >> implementation which would not only wait for all existing dma_fence but
> >> also for the one added until the unmap operation is completed. Cause
> >> otherwise our operation we do at #1 would simply not catch all
> >> dma_fences which have access to the memory.
> >>
> >> That's certainly doable, but I think just using the drm_exec stuff I
> >> already came up with is easier.
> >>
> >> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> >> goes away and we can keep track of the unmap operations in the bo_va
> >> structure.
> >>
> >> With that done you can make the explicit sync you noted in the bo_va
> >> structure and implicit sync when the bo_va structure goes away.
> >>
> >> Then the only reason I can see why we would need a CS->VM dependency is
> >> implicit synchronization, and that's what we are trying to avoid here in
> >> the first place.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>> To get rid of this barrier you must first fix the part where CS
> >>>> submissions wait for the VM operation to complete, e.g. the necessity of
> >>>> the barrier.
> >>>>
> >>>> I'm working on this for a couple of years now and I'm really running out
> >>>> of idea how to explain this restriction.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-06 10:30                                         ` Bas Nieuwenhuizen
@ 2022-06-06 10:35                                           ` Christian König
  2022-06-06 11:00                                             ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-06 10:35 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:
> On Mon, Jun 6, 2022 at 12:15 PM Christian König
> <christian.koenig@amd.com> wrote:
>>
>>
>> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
>>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
>>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
>>>>> [SNIP]
>>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
>>>>>>> For this series, not really.  To clarify there are two sides for
>>>>>>> getting GPU bubbles and no overlap:
>>>>>>>
>>>>>>> (1) VM operations implicitly wait for earlier CS submissions
>>>>>>> (2) CS submissions implicitly wait for earlier VM operations
>>>>>>>
>>>>>>> Together, these combine to ensure that you get a (potentially small)
>>>>>>> bubble any time VM work happens.
>>>>>>>
>>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
>>>>>>> to do. However, while writing the userspace for this I noticed this
>>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
>>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
>>>>>>> for the next CS anyway for API semantics. Due to VM operations
>>>>>>> happening on a single timeline that means this high priority map can
>>>>>>> end up being blocked by earlier sparse maps and hence the bubble in
>>>>>>> that case still exists.
>>>>>>>
>>>>>>> So in this series I try to tackle (1) instead. Since GPU work
>>>>>>> typically lags behind CPU submissions and VM operations aren't that
>>>>>>> slow, we can typically execute VM operations early enough that any
>>>>>>> implicit syncs from (2) are less/no issue.
>>>>>> Ok, once more since you don't seem to understand what I want to say: It
>>>>>> isn't possible to fix #1 before you have fixed #2.
>>>>>>
>>>>>> The VM unmap operation here is a barrier which divides the CS operations
>>>>>> in a before and after. This is intentional design.
>>>>> Why is that barrier needed? The two barriers I got and understood and
>>>>> I think we can deal with:
>>>>>
>>>>> 1) the VM unmap is a barrier between prior CS and later memory free.
>>>>> 2) The TLB flush need to happen between a VM unmap and later CS.
>>>>>
>>>>> But why do we need the VM unmap to be a strict barrier between prior
>>>>> CS and later CS?
>>>> Exactly because of the two reasons you mentioned.
>>> This is the part I'm not seeing. I get that removing #2 is a
>>> nightmare, which is why I did something that doesn't violate that
>>> constraint.
>>>
>>> Like if an explicit CS that was running before the VM operation  runs
>>> till after the VM operation (and hence possibly till after the TLB
>>> flush, or otherwise have the TLB flush not apply due to lack of async
>>> TLB flush support), that is not an issue. It might see the state from
>>> before the unmap, or after the unmap, or some intermediate state and
>>> all of those would be okay.
>>>
>>> We still get the constraint that the TLB flush happens between the VM
>>> unmap and later CS and hence the unmap is certainly visible to them.
>> So you basically just want to set the sync mode in
>> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?
> Yes, with the caveat that I want to do that only for
> DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
> implicit sync we get the old implicit behavior, if we submit a CS with
> explicit sync we get the new explicit behavior). The rest of the
> series is basically just for enabling explicit sync submissions.

That part won't work at all and would cause additional synchronization 
problems.

First of all for implicit synced CS we should use READ, not BOOKKEEP. 
Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've 
fixed that this causes memory corruption, but it is still nice to avoid.

BOOKKEEP can only be used by VM updates themselves. So that they don't 
interfere with CS.

Then the second problem is that the VM IOCTL has absolutely no idea what 
the CS IOCTL would be doing. That's why we have added the EXPLICIT sync 
flag on the BO.

Regards,
Christian.

>
>> That should be doable, but then you don't need any of the other changes.
>>
>> Regards,
>> Christian.
>>
>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
>>>> page table dma_resv object over to the BOs dma_resv object in the gem
>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
>>>> function.
>>>>
>>>> #2 is a nightmare.
>>>>
>>>> We can't move the TLB flush at the end of the unmap operation because on
>>>> async TLB flushes are either a bit complicated (double flushes etc..) or
>>>> don't even work at all because of hw bugs. So to have a reliable TLB
>>>> flush we must make sure that nothing else is ongoing and that means
>>>> CS->VM->CS barrier.
>>>>
>>>> We try very hard to circumvent that already on maps by (for example)
>>>> using a completely new VMID for CS after the VM map operation.
>>>>
>>>> But for the unmap operation we would need some kind special dma_fence
>>>> implementation which would not only wait for all existing dma_fence but
>>>> also for the one added until the unmap operation is completed. Cause
>>>> otherwise our operation we do at #1 would simply not catch all
>>>> dma_fences which have access to the memory.
>>>>
>>>> That's certainly doable, but I think just using the drm_exec stuff I
>>>> already came up with is easier.
>>>>
>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
>>>> goes away and we can keep track of the unmap operations in the bo_va
>>>> structure.
>>>>
>>>> With that done you can make the explicit sync you noted in the bo_va
>>>> structure and implicit sync when the bo_va structure goes away.
>>>>
>>>> Then the only reason I can see why we would need a CS->VM dependency is
>>>> implicit synchronization, and that's what we are trying to avoid here in
>>>> the first place.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> To get rid of this barrier you must first fix the part where CS
>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
>>>>>> the barrier.
>>>>>>
>>>>>> I'm working on this for a couple of years now and I'm really running out
>>>>>> of idea how to explain this restriction.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-06 10:35                                           ` Christian König
@ 2022-06-06 11:00                                             ` Bas Nieuwenhuizen
  2022-06-15  0:40                                               ` Bas Nieuwenhuizen
  2022-06-15  7:00                                               ` Christian König
  0 siblings, 2 replies; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-06 11:00 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Mon, Jun 6, 2022 at 12:35 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:
> > On Mon, Jun 6, 2022 at 12:15 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >>
> >> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
> >>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
> >>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> >>>>> [SNIP]
> >>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> >>>>>>> For this series, not really.  To clarify there are two sides for
> >>>>>>> getting GPU bubbles and no overlap:
> >>>>>>>
> >>>>>>> (1) VM operations implicitly wait for earlier CS submissions
> >>>>>>> (2) CS submissions implicitly wait for earlier VM operations
> >>>>>>>
> >>>>>>> Together, these combine to ensure that you get a (potentially small)
> >>>>>>> bubble any time VM work happens.
> >>>>>>>
> >>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
> >>>>>>> to do. However, while writing the userspace for this I noticed this
> >>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
> >>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
> >>>>>>> for the next CS anyway for API semantics. Due to VM operations
> >>>>>>> happening on a single timeline that means this high priority map can
> >>>>>>> end up being blocked by earlier sparse maps and hence the bubble in
> >>>>>>> that case still exists.
> >>>>>>>
> >>>>>>> So in this series I try to tackle (1) instead. Since GPU work
> >>>>>>> typically lags behind CPU submissions and VM operations aren't that
> >>>>>>> slow, we can typically execute VM operations early enough that any
> >>>>>>> implicit syncs from (2) are less/no issue.
> >>>>>> Ok, once more since you don't seem to understand what I want to say: It
> >>>>>> isn't possible to fix #1 before you have fixed #2.
> >>>>>>
> >>>>>> The VM unmap operation here is a barrier which divides the CS operations
> >>>>>> in a before and after. This is intentional design.
> >>>>> Why is that barrier needed? The two barriers I got and understood and
> >>>>> I think we can deal with:
> >>>>>
> >>>>> 1) the VM unmap is a barrier between prior CS and later memory free.
> >>>>> 2) The TLB flush need to happen between a VM unmap and later CS.
> >>>>>
> >>>>> But why do we need the VM unmap to be a strict barrier between prior
> >>>>> CS and later CS?
> >>>> Exactly because of the two reasons you mentioned.
> >>> This is the part I'm not seeing. I get that removing #2 is a
> >>> nightmare, which is why I did something that doesn't violate that
> >>> constraint.
> >>>
> >>> Like if an explicit CS that was running before the VM operation  runs
> >>> till after the VM operation (and hence possibly till after the TLB
> >>> flush, or otherwise have the TLB flush not apply due to lack of async
> >>> TLB flush support), that is not an issue. It might see the state from
> >>> before the unmap, or after the unmap, or some intermediate state and
> >>> all of those would be okay.
> >>>
> >>> We still get the constraint that the TLB flush happens between the VM
> >>> unmap and later CS and hence the unmap is certainly visible to them.
> >> So you basically just want to set the sync mode in
> >> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?
> > Yes, with the caveat that I want to do that only for
> > DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
> > implicit sync we get the old implicit behavior, if we submit a CS with
> > explicit sync we get the new explicit behavior). The rest of the
> > series is basically just for enabling explicit sync submissions.
>
> That part won't work at all and would cause additional synchronization
> problems.
>
> First of all for implicit synced CS we should use READ, not BOOKKEEP.
> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've
> fixed that this causes memory corruption, but it is still nice to avoid.

Yes, what I'm saying is that on implicit sync CS submission should add
READ fences to the dma resv and on explicit sync CS submission should
add BOOKKEEP fences.

>
> BOOKKEEP can only be used by VM updates themselves. So that they don't
> interfere with CS.

That is the point why we would go BOOKKEEP for explicit sync CS
submissions, no? Explicit submission shouldn't interfere with any
other CS submissions. That includes being totally ignored by GL
importers (if we want to have synchronization there between an
explicit submission and GL, userspace is expected to use Jason's
dmabuf fence import/export IOCTLs)

>
> Then the second problem is that the VM IOCTL has absolutely no idea what
> the CS IOCTL would be doing. That's why we have added the EXPLICIT sync
> flag on the BO.

It doesn't need to? We just use a different sync_mode for BOOKKEEP
fences vs others:
https://patchwork.freedesktop.org/patch/487887/?series=104578&rev=2

(the nice thing about doing it this way is that it is independent of
the IOCTL, i.e. also works for the delayed mapping changes we trigger
on CS submit)

>
> Regards,
> Christian.
>
> >
> >> That should be doable, but then you don't need any of the other changes.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> >>>> page table dma_resv object over to the BOs dma_resv object in the gem
> >>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
> >>>> function.
> >>>>
> >>>> #2 is a nightmare.
> >>>>
> >>>> We can't move the TLB flush at the end of the unmap operation because on
> >>>> async TLB flushes are either a bit complicated (double flushes etc..) or
> >>>> don't even work at all because of hw bugs. So to have a reliable TLB
> >>>> flush we must make sure that nothing else is ongoing and that means
> >>>> CS->VM->CS barrier.
> >>>>
> >>>> We try very hard to circumvent that already on maps by (for example)
> >>>> using a completely new VMID for CS after the VM map operation.
> >>>>
> >>>> But for the unmap operation we would need some kind special dma_fence
> >>>> implementation which would not only wait for all existing dma_fence but
> >>>> also for the one added until the unmap operation is completed. Cause
> >>>> otherwise our operation we do at #1 would simply not catch all
> >>>> dma_fences which have access to the memory.
> >>>>
> >>>> That's certainly doable, but I think just using the drm_exec stuff I
> >>>> already came up with is easier.
> >>>>
> >>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> >>>> goes away and we can keep track of the unmap operations in the bo_va
> >>>> structure.
> >>>>
> >>>> With that done you can make the explicit sync you noted in the bo_va
> >>>> structure and implicit sync when the bo_va structure goes away.
> >>>>
> >>>> Then the only reason I can see why we would need a CS->VM dependency is
> >>>> implicit synchronization, and that's what we are trying to avoid here in
> >>>> the first place.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>> To get rid of this barrier you must first fix the part where CS
> >>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
> >>>>>> the barrier.
> >>>>>>
> >>>>>> I'm working on this for a couple of years now and I'm really running out
> >>>>>> of idea how to explain this restriction.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Christian.
> >>>>>>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-06 11:00                                             ` Bas Nieuwenhuizen
@ 2022-06-15  0:40                                               ` Bas Nieuwenhuizen
  2022-06-15  7:00                                                 ` Christian König
  2022-06-15  7:00                                               ` Christian König
  1 sibling, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-15  0:40 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

Hi Christian,

Friendly ping on the comments here. Are you okay with me re-spinning
the series with a fixed patch 1 or do we need further discussion on
the approach here?

Thanks,
Bas

On Mon, Jun 6, 2022 at 1:00 PM Bas Nieuwenhuizen
<bas@basnieuwenhuizen.nl> wrote:
>
> On Mon, Jun 6, 2022 at 12:35 PM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:
> > > On Mon, Jun 6, 2022 at 12:15 PM Christian König
> > > <christian.koenig@amd.com> wrote:
> > >>
> > >>
> > >> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
> > >>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
> > >>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> > >>>>> [SNIP]
> > >>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> > >>>>>>> For this series, not really.  To clarify there are two sides for
> > >>>>>>> getting GPU bubbles and no overlap:
> > >>>>>>>
> > >>>>>>> (1) VM operations implicitly wait for earlier CS submissions
> > >>>>>>> (2) CS submissions implicitly wait for earlier VM operations
> > >>>>>>>
> > >>>>>>> Together, these combine to ensure that you get a (potentially small)
> > >>>>>>> bubble any time VM work happens.
> > >>>>>>>
> > >>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
> > >>>>>>> to do. However, while writing the userspace for this I noticed this
> > >>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
> > >>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
> > >>>>>>> for the next CS anyway for API semantics. Due to VM operations
> > >>>>>>> happening on a single timeline that means this high priority map can
> > >>>>>>> end up being blocked by earlier sparse maps and hence the bubble in
> > >>>>>>> that case still exists.
> > >>>>>>>
> > >>>>>>> So in this series I try to tackle (1) instead. Since GPU work
> > >>>>>>> typically lags behind CPU submissions and VM operations aren't that
> > >>>>>>> slow, we can typically execute VM operations early enough that any
> > >>>>>>> implicit syncs from (2) are less/no issue.
> > >>>>>> Ok, once more since you don't seem to understand what I want to say: It
> > >>>>>> isn't possible to fix #1 before you have fixed #2.
> > >>>>>>
> > >>>>>> The VM unmap operation here is a barrier which divides the CS operations
> > >>>>>> in a before and after. This is intentional design.
> > >>>>> Why is that barrier needed? The two barriers I got and understood and
> > >>>>> I think we can deal with:
> > >>>>>
> > >>>>> 1) the VM unmap is a barrier between prior CS and later memory free.
> > >>>>> 2) The TLB flush need to happen between a VM unmap and later CS.
> > >>>>>
> > >>>>> But why do we need the VM unmap to be a strict barrier between prior
> > >>>>> CS and later CS?
> > >>>> Exactly because of the two reasons you mentioned.
> > >>> This is the part I'm not seeing. I get that removing #2 is a
> > >>> nightmare, which is why I did something that doesn't violate that
> > >>> constraint.
> > >>>
> > >>> Like if an explicit CS that was running before the VM operation  runs
> > >>> till after the VM operation (and hence possibly till after the TLB
> > >>> flush, or otherwise have the TLB flush not apply due to lack of async
> > >>> TLB flush support), that is not an issue. It might see the state from
> > >>> before the unmap, or after the unmap, or some intermediate state and
> > >>> all of those would be okay.
> > >>>
> > >>> We still get the constraint that the TLB flush happens between the VM
> > >>> unmap and later CS and hence the unmap is certainly visible to them.
> > >> So you basically just want to set the sync mode in
> > >> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?
> > > Yes, with the caveat that I want to do that only for
> > > DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
> > > implicit sync we get the old implicit behavior, if we submit a CS with
> > > explicit sync we get the new explicit behavior). The rest of the
> > > series is basically just for enabling explicit sync submissions.
> >
> > That part won't work at all and would cause additional synchronization
> > problems.
> >
> > First of all for implicit synced CS we should use READ, not BOOKKEEP.
> > Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've
> > fixed that this causes memory corruption, but it is still nice to avoid.
>
> Yes, what I'm saying is that on implicit sync CS submission should add
> READ fences to the dma resv and on explicit sync CS submission should
> add BOOKKEEP fences.
>
> >
> > BOOKKEEP can only be used by VM updates themselves. So that they don't
> > interfere with CS.
>
> That is the point why we would go BOOKKEEP for explicit sync CS
> submissions, no? Explicit submission shouldn't interfere with any
> other CS submissions. That includes being totally ignored by GL
> importers (if we want to have synchronization there between an
> explicit submission and GL, userspace is expected to use Jason's
> dmabuf fence import/export IOCTLs)
>
> >
> > Then the second problem is that the VM IOCTL has absolutely no idea what
> > the CS IOCTL would be doing. That's why we have added the EXPLICIT sync
> > flag on the BO.
>
> It doesn't need to? We just use a different sync_mode for BOOKKEEP
> fences vs others:
> https://patchwork.freedesktop.org/patch/487887/?series=104578&rev=2
>
> (the nice thing about doing it this way is that it is independent of
> the IOCTL, i.e. also works for the delayed mapping changes we trigger
> on CS submit)
>
> >
> > Regards,
> > Christian.
> >
> > >
> > >> That should be doable, but then you don't need any of the other changes.
> > >>
> > >> Regards,
> > >> Christian.
> > >>
> > >>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> > >>>> page table dma_resv object over to the BOs dma_resv object in the gem
> > >>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
> > >>>> function.
> > >>>>
> > >>>> #2 is a nightmare.
> > >>>>
> > >>>> We can't move the TLB flush at the end of the unmap operation because on
> > >>>> async TLB flushes are either a bit complicated (double flushes etc..) or
> > >>>> don't even work at all because of hw bugs. So to have a reliable TLB
> > >>>> flush we must make sure that nothing else is ongoing and that means
> > >>>> CS->VM->CS barrier.
> > >>>>
> > >>>> We try very hard to circumvent that already on maps by (for example)
> > >>>> using a completely new VMID for CS after the VM map operation.
> > >>>>
> > >>>> But for the unmap operation we would need some kind special dma_fence
> > >>>> implementation which would not only wait for all existing dma_fence but
> > >>>> also for the one added until the unmap operation is completed. Cause
> > >>>> otherwise our operation we do at #1 would simply not catch all
> > >>>> dma_fences which have access to the memory.
> > >>>>
> > >>>> That's certainly doable, but I think just using the drm_exec stuff I
> > >>>> already came up with is easier.
> > >>>>
> > >>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> > >>>> goes away and we can keep track of the unmap operations in the bo_va
> > >>>> structure.
> > >>>>
> > >>>> With that done you can make the explicit sync you noted in the bo_va
> > >>>> structure and implicit sync when the bo_va structure goes away.
> > >>>>
> > >>>> Then the only reason I can see why we would need a CS->VM dependency is
> > >>>> implicit synchronization, and that's what we are trying to avoid here in
> > >>>> the first place.
> > >>>>
> > >>>> Regards,
> > >>>> Christian.
> > >>>>
> > >>>>>> To get rid of this barrier you must first fix the part where CS
> > >>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
> > >>>>>> the barrier.
> > >>>>>>
> > >>>>>> I'm working on this for a couple of years now and I'm really running out
> > >>>>>> of idea how to explain this restriction.
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Christian.
> > >>>>>>
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-06 11:00                                             ` Bas Nieuwenhuizen
  2022-06-15  0:40                                               ` Bas Nieuwenhuizen
@ 2022-06-15  7:00                                               ` Christian König
  2022-06-17 13:03                                                 ` Bas Nieuwenhuizen
  1 sibling, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-15  7:00 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 06.06.22 um 13:00 schrieb Bas Nieuwenhuizen:
> On Mon, Jun 6, 2022 at 12:35 PM Christian König
> <christian.koenig@amd.com> wrote:
>> [SNIP]
>> That part won't work at all and would cause additional synchronization
>> problems.
>>
>> First of all for implicit synced CS we should use READ, not BOOKKEEP.
>> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've
>> fixed that this causes memory corruption, but it is still nice to avoid.
> Yes, what I'm saying is that on implicit sync CS submission should add
> READ fences to the dma resv and on explicit sync CS submission should
> add BOOKKEEP fences.

No, exactly that is wrong.

Implicit CS submissions should add WRITE fences.

Explicit CS submissions should add READ fences.

Only VM updates should add BOOKKEEP fences.

>> BOOKKEEP can only be used by VM updates themselves. So that they don't
>> interfere with CS.
> That is the point why we would go BOOKKEEP for explicit sync CS
> submissions, no? Explicit submission shouldn't interfere with any
> other CS submissions. That includes being totally ignored by GL
> importers (if we want to have synchronization there between an
> explicit submission and GL, userspace is expected to use Jason's
> dmabuf fence import/export IOCTLs)

No, that would break existing DMA-buf rules.

Explicit CS submissions are still a dependency for implicit submissions.

>
> Then the second problem is that the VM IOCTL has absolutely no idea what
> the CS IOCTL would be doing. That's why we have added the EXPLICIT sync
> flag on the BO.
> It doesn't need to? We just use a different sync_mode for BOOKKEEP
> fences vs others:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F487887%2F%3Fseries%3D104578%26rev%3D2&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C81db0fea1c854076fc4408da47abafaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637901099957139870%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=F72Boaesx83MD2pjGuucA1buawi205XLSsQHg5EH39A%3D&amp;reserved=0

No, exactly that's completely broken.

Regards,
Christian.

>
> (the nice thing about doing it this way is that it is independent of
> the IOCTL, i.e. also works for the delayed mapping changes we trigger
> on CS submit)
>
>> Regards,
>> Christian.
>>
>>>> That should be doable, but then you don't need any of the other changes.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
>>>>>> page table dma_resv object over to the BOs dma_resv object in the gem
>>>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
>>>>>> function.
>>>>>>
>>>>>> #2 is a nightmare.
>>>>>>
>>>>>> We can't move the TLB flush at the end of the unmap operation because on
>>>>>> async TLB flushes are either a bit complicated (double flushes etc..) or
>>>>>> don't even work at all because of hw bugs. So to have a reliable TLB
>>>>>> flush we must make sure that nothing else is ongoing and that means
>>>>>> CS->VM->CS barrier.
>>>>>>
>>>>>> We try very hard to circumvent that already on maps by (for example)
>>>>>> using a completely new VMID for CS after the VM map operation.
>>>>>>
>>>>>> But for the unmap operation we would need some kind special dma_fence
>>>>>> implementation which would not only wait for all existing dma_fence but
>>>>>> also for the one added until the unmap operation is completed. Cause
>>>>>> otherwise our operation we do at #1 would simply not catch all
>>>>>> dma_fences which have access to the memory.
>>>>>>
>>>>>> That's certainly doable, but I think just using the drm_exec stuff I
>>>>>> already came up with is easier.
>>>>>>
>>>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
>>>>>> goes away and we can keep track of the unmap operations in the bo_va
>>>>>> structure.
>>>>>>
>>>>>> With that done you can make the explicit sync you noted in the bo_va
>>>>>> structure and implicit sync when the bo_va structure goes away.
>>>>>>
>>>>>> Then the only reason I can see why we would need a CS->VM dependency is
>>>>>> implicit synchronization, and that's what we are trying to avoid here in
>>>>>> the first place.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>> To get rid of this barrier you must first fix the part where CS
>>>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
>>>>>>>> the barrier.
>>>>>>>>
>>>>>>>> I'm working on this for a couple of years now and I'm really running out
>>>>>>>> of idea how to explain this restriction.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-15  0:40                                               ` Bas Nieuwenhuizen
@ 2022-06-15  7:00                                                 ` Christian König
  0 siblings, 0 replies; 46+ messages in thread
From: Christian König @ 2022-06-15  7:00 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Hi Bas,

sorry I totally missed your reply. Just tried to answer your original 
questions.

Regards,
Christian.

Am 15.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> Hi Christian,
>
> Friendly ping on the comments here. Are you okay with me re-spinning
> the series with a fixed patch 1 or do we need further discussion on
> the approach here?
>
> Thanks,
> Bas
>
> On Mon, Jun 6, 2022 at 1:00 PM Bas Nieuwenhuizen
> <bas@basnieuwenhuizen.nl> wrote:
>> On Mon, Jun 6, 2022 at 12:35 PM Christian König
>> <christian.koenig@amd.com> wrote:
>>> Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:
>>>> On Mon, Jun 6, 2022 at 12:15 PM Christian König
>>>> <christian.koenig@amd.com> wrote:
>>>>>
>>>>> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
>>>>>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
>>>>>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
>>>>>>>> [SNIP]
>>>>>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
>>>>>>>>>> For this series, not really.  To clarify there are two sides for
>>>>>>>>>> getting GPU bubbles and no overlap:
>>>>>>>>>>
>>>>>>>>>> (1) VM operations implicitly wait for earlier CS submissions
>>>>>>>>>> (2) CS submissions implicitly wait for earlier VM operations
>>>>>>>>>>
>>>>>>>>>> Together, these combine to ensure that you get a (potentially small)
>>>>>>>>>> bubble any time VM work happens.
>>>>>>>>>>
>>>>>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
>>>>>>>>>> to do. However, while writing the userspace for this I noticed this
>>>>>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
>>>>>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
>>>>>>>>>> for the next CS anyway for API semantics. Due to VM operations
>>>>>>>>>> happening on a single timeline that means this high priority map can
>>>>>>>>>> end up being blocked by earlier sparse maps and hence the bubble in
>>>>>>>>>> that case still exists.
>>>>>>>>>>
>>>>>>>>>> So in this series I try to tackle (1) instead. Since GPU work
>>>>>>>>>> typically lags behind CPU submissions and VM operations aren't that
>>>>>>>>>> slow, we can typically execute VM operations early enough that any
>>>>>>>>>> implicit syncs from (2) are less/no issue.
>>>>>>>>> Ok, once more since you don't seem to understand what I want to say: It
>>>>>>>>> isn't possible to fix #1 before you have fixed #2.
>>>>>>>>>
>>>>>>>>> The VM unmap operation here is a barrier which divides the CS operations
>>>>>>>>> in a before and after. This is intentional design.
>>>>>>>> Why is that barrier needed? The two barriers I got and understood and
>>>>>>>> I think we can deal with:
>>>>>>>>
>>>>>>>> 1) the VM unmap is a barrier between prior CS and later memory free.
>>>>>>>> 2) The TLB flush need to happen between a VM unmap and later CS.
>>>>>>>>
>>>>>>>> But why do we need the VM unmap to be a strict barrier between prior
>>>>>>>> CS and later CS?
>>>>>>> Exactly because of the two reasons you mentioned.
>>>>>> This is the part I'm not seeing. I get that removing #2 is a
>>>>>> nightmare, which is why I did something that doesn't violate that
>>>>>> constraint.
>>>>>>
>>>>>> Like if an explicit CS that was running before the VM operation  runs
>>>>>> till after the VM operation (and hence possibly till after the TLB
>>>>>> flush, or otherwise have the TLB flush not apply due to lack of async
>>>>>> TLB flush support), that is not an issue. It might see the state from
>>>>>> before the unmap, or after the unmap, or some intermediate state and
>>>>>> all of those would be okay.
>>>>>>
>>>>>> We still get the constraint that the TLB flush happens between the VM
>>>>>> unmap and later CS and hence the unmap is certainly visible to them.
>>>>> So you basically just want to set the sync mode in
>>>>> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?
>>>> Yes, with the caveat that I want to do that only for
>>>> DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
>>>> implicit sync we get the old implicit behavior, if we submit a CS with
>>>> explicit sync we get the new explicit behavior). The rest of the
>>>> series is basically just for enabling explicit sync submissions.
>>> That part won't work at all and would cause additional synchronization
>>> problems.
>>>
>>> First of all for implicit synced CS we should use READ, not BOOKKEEP.
>>> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've
>>> fixed that this causes memory corruption, but it is still nice to avoid.
>> Yes, what I'm saying is that on implicit sync CS submission should add
>> READ fences to the dma resv and on explicit sync CS submission should
>> add BOOKKEEP fences.
>>
>>> BOOKKEEP can only be used by VM updates themselves. So that they don't
>>> interfere with CS.
>> That is the point why we would go BOOKKEEP for explicit sync CS
>> submissions, no? Explicit submission shouldn't interfere with any
>> other CS submissions. That includes being totally ignored by GL
>> importers (if we want to have synchronization there between an
>> explicit submission and GL, userspace is expected to use Jason's
>> dmabuf fence import/export IOCTLs)
>>
>>> Then the second problem is that the VM IOCTL has absolutely no idea what
>>> the CS IOCTL would be doing. That's why we have added the EXPLICIT sync
>>> flag on the BO.
>> It doesn't need to? We just use a different sync_mode for BOOKKEEP
>> fences vs others:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F487887%2F%3Fseries%3D104578%26rev%3D2&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C0c76d5c34db846f2fff208da4e67ad7b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908504442767830%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=cyfYyKR6hVpDV%2FathhSO6EnCHjNkEM6sJs%2BPLPERCEE%3D&amp;reserved=0
>>
>> (the nice thing about doing it this way is that it is independent of
>> the IOCTL, i.e. also works for the delayed mapping changes we trigger
>> on CS submit)
>>
>>> Regards,
>>> Christian.
>>>
>>>>> That should be doable, but then you don't need any of the other changes.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
>>>>>>> page table dma_resv object over to the BOs dma_resv object in the gem
>>>>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
>>>>>>> function.
>>>>>>>
>>>>>>> #2 is a nightmare.
>>>>>>>
>>>>>>> We can't move the TLB flush at the end of the unmap operation because on
>>>>>>> async TLB flushes are either a bit complicated (double flushes etc..) or
>>>>>>> don't even work at all because of hw bugs. So to have a reliable TLB
>>>>>>> flush we must make sure that nothing else is ongoing and that means
>>>>>>> CS->VM->CS barrier.
>>>>>>>
>>>>>>> We try very hard to circumvent that already on maps by (for example)
>>>>>>> using a completely new VMID for CS after the VM map operation.
>>>>>>>
>>>>>>> But for the unmap operation we would need some kind special dma_fence
>>>>>>> implementation which would not only wait for all existing dma_fence but
>>>>>>> also for the one added until the unmap operation is completed. Cause
>>>>>>> otherwise our operation we do at #1 would simply not catch all
>>>>>>> dma_fences which have access to the memory.
>>>>>>>
>>>>>>> That's certainly doable, but I think just using the drm_exec stuff I
>>>>>>> already came up with is easier.
>>>>>>>
>>>>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
>>>>>>> goes away and we can keep track of the unmap operations in the bo_va
>>>>>>> structure.
>>>>>>>
>>>>>>> With that done you can make the explicit sync you noted in the bo_va
>>>>>>> structure and implicit sync when the bo_va structure goes away.
>>>>>>>
>>>>>>> Then the only reason I can see why we would need a CS->VM dependency is
>>>>>>> implicit synchronization, and that's what we are trying to avoid here in
>>>>>>> the first place.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>> To get rid of this barrier you must first fix the part where CS
>>>>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
>>>>>>>>> the barrier.
>>>>>>>>>
>>>>>>>>> I'm working on this for a couple of years now and I'm really running out
>>>>>>>>> of idea how to explain this restriction.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-15  7:00                                               ` Christian König
@ 2022-06-17 13:03                                                 ` Bas Nieuwenhuizen
  2022-06-17 13:08                                                   ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-17 13:03 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Wed, Jun 15, 2022 at 9:00 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 06.06.22 um 13:00 schrieb Bas Nieuwenhuizen:
> > On Mon, Jun 6, 2022 at 12:35 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >> [SNIP]
> >> That part won't work at all and would cause additional synchronization
> >> problems.
> >>
> >> First of all for implicit synced CS we should use READ, not BOOKKEEP.
> >> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've
> >> fixed that this causes memory corruption, but it is still nice to avoid.
> > Yes, what I'm saying is that on implicit sync CS submission should add
> > READ fences to the dma resv and on explicit sync CS submission should
> > add BOOKKEEP fences.
>
> No, exactly that is wrong.
>
> Implicit CS submissions should add WRITE fences.
>
> Explicit CS submissions should add READ fences.
>
> Only VM updates should add BOOKKEEP fences.
>
> >> BOOKKEEP can only be used by VM updates themselves. So that they don't
> >> interfere with CS.
> > That is the point why we would go BOOKKEEP for explicit sync CS
> > submissions, no? Explicit submission shouldn't interfere with any
> > other CS submissions. That includes being totally ignored by GL
> > importers (if we want to have synchronization there between an
> > explicit submission and GL, userspace is expected to use Jason's
> > dmabuf fence import/export IOCTLs)
>
> No, that would break existing DMA-buf rules.
>
> Explicit CS submissions are still a dependency for implicit submissions.

This is explicitly what we don't want for explicit submissions and why
I waited with this series until the DMA_RESV_USAGE series landed. We
wish to opt out from implicit sync completely, and just use the IOCTLs
Jason wrote for back-compat with windowing systems that need it.

If BOOKKEEP isn't for that, should we add a new USAGE?

>
> >
> > Then the second problem is that the VM IOCTL has absolutely no idea what
> > the CS IOCTL would be doing. That's why we have added the EXPLICIT sync
> > flag on the BO.
> > It doesn't need to? We just use a different sync_mode for BOOKKEEP
> > fences vs others:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F487887%2F%3Fseries%3D104578%26rev%3D2&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C81db0fea1c854076fc4408da47abafaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637901099957139870%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=F72Boaesx83MD2pjGuucA1buawi205XLSsQHg5EH39A%3D&amp;reserved=0
>
> No, exactly that's completely broken.
>
> Regards,
> Christian.
>
> >
> > (the nice thing about doing it this way is that it is independent of
> > the IOCTL, i.e. also works for the delayed mapping changes we trigger
> > on CS submit)
> >
> >> Regards,
> >> Christian.
> >>
> >>>> That should be doable, but then you don't need any of the other changes.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> >>>>>> page table dma_resv object over to the BOs dma_resv object in the gem
> >>>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
> >>>>>> function.
> >>>>>>
> >>>>>> #2 is a nightmare.
> >>>>>>
> >>>>>> We can't move the TLB flush at the end of the unmap operation because on
> >>>>>> async TLB flushes are either a bit complicated (double flushes etc..) or
> >>>>>> don't even work at all because of hw bugs. So to have a reliable TLB
> >>>>>> flush we must make sure that nothing else is ongoing and that means
> >>>>>> CS->VM->CS barrier.
> >>>>>>
> >>>>>> We try very hard to circumvent that already on maps by (for example)
> >>>>>> using a completely new VMID for CS after the VM map operation.
> >>>>>>
> >>>>>> But for the unmap operation we would need some kind special dma_fence
> >>>>>> implementation which would not only wait for all existing dma_fence but
> >>>>>> also for the one added until the unmap operation is completed. Cause
> >>>>>> otherwise our operation we do at #1 would simply not catch all
> >>>>>> dma_fences which have access to the memory.
> >>>>>>
> >>>>>> That's certainly doable, but I think just using the drm_exec stuff I
> >>>>>> already came up with is easier.
> >>>>>>
> >>>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> >>>>>> goes away and we can keep track of the unmap operations in the bo_va
> >>>>>> structure.
> >>>>>>
> >>>>>> With that done you can make the explicit sync you noted in the bo_va
> >>>>>> structure and implicit sync when the bo_va structure goes away.
> >>>>>>
> >>>>>> Then the only reason I can see why we would need a CS->VM dependency is
> >>>>>> implicit synchronization, and that's what we are trying to avoid here in
> >>>>>> the first place.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Christian.
> >>>>>>
> >>>>>>>> To get rid of this barrier you must first fix the part where CS
> >>>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
> >>>>>>>> the barrier.
> >>>>>>>>
> >>>>>>>> I'm working on this for a couple of years now and I'm really running out
> >>>>>>>> of idea how to explain this restriction.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Christian.
> >>>>>>>>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-17 13:03                                                 ` Bas Nieuwenhuizen
@ 2022-06-17 13:08                                                   ` Christian König
  2022-06-24 20:34                                                     ` Daniel Vetter
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-17 13:08 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: ML dri-devel

Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen:
> [SNIP]
>>>> BOOKKEEP can only be used by VM updates themselves. So that they don't
>>>> interfere with CS.
>>> That is the point why we would go BOOKKEEP for explicit sync CS
>>> submissions, no? Explicit submission shouldn't interfere with any
>>> other CS submissions. That includes being totally ignored by GL
>>> importers (if we want to have synchronization there between an
>>> explicit submission and GL, userspace is expected to use Jason's
>>> dmabuf fence import/export IOCTLs)
>> No, that would break existing DMA-buf rules.
>>
>> Explicit CS submissions are still a dependency for implicit submissions.
> This is explicitly what we don't want for explicit submissions and why
> I waited with this series until the DMA_RESV_USAGE series landed. We
> wish to opt out from implicit sync completely, and just use the IOCTLs
> Jason wrote for back-compat with windowing systems that need it.
>
> If BOOKKEEP isn't for that, should we add a new USAGE?

BOOKKEEP is exactly for that, but as discussed with Daniel that's not 
what we want in the kernel.

When you mix implicit with explicit synchronization (OpenGL with RADV 
for example) it should be mandatory for the OpenGL to wait for any RADV 
submission before issuing an operation.

What you want to do is intentionally not supported.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-17 13:08                                                   ` Christian König
@ 2022-06-24 20:34                                                     ` Daniel Vetter
  2022-06-25 13:58                                                       ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Daniel Vetter @ 2022-06-24 20:34 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

Digging out of a hole, apologies to everyone.

On Fri, Jun 17, 2022 at 03:08:00PM +0200, Christian König wrote:
> Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen:
> > [SNIP]
> > > > > BOOKKEEP can only be used by VM updates themselves. So that they don't
> > > > > interfere with CS.
> > > > That is the point why we would go BOOKKEEP for explicit sync CS
> > > > submissions, no? Explicit submission shouldn't interfere with any
> > > > other CS submissions. That includes being totally ignored by GL
> > > > importers (if we want to have synchronization there between an
> > > > explicit submission and GL, userspace is expected to use Jason's
> > > > dmabuf fence import/export IOCTLs)
> > > No, that would break existing DMA-buf rules.
> > > 
> > > Explicit CS submissions are still a dependency for implicit submissions.
> > This is explicitly what we don't want for explicit submissions and why
> > I waited with this series until the DMA_RESV_USAGE series landed. We
> > wish to opt out from implicit sync completely, and just use the IOCTLs
> > Jason wrote for back-compat with windowing systems that need it.
> > 
> > If BOOKKEEP isn't for that, should we add a new USAGE?
> 
> BOOKKEEP is exactly for that, but as discussed with Daniel that's not what
> we want in the kernel.

Not sure which Daniel you talked to, but this wasn't me.

> When you mix implicit with explicit synchronization (OpenGL with RADV for
> example) it should be mandatory for the OpenGL to wait for any RADV
> submission before issuing an operation.
> 
> What you want to do is intentionally not supported.

vk is very intentional in it's rejecting of any implicit sync. Which means
when you share a buffer with gl, even in _that_ case there must be no sync
automatically, or your implementation is kinda shit. Instead anyone
sharing a buffer with vk and using it in gl must take care of sync by
importing the timeline syncobj to gl, that's why all these extensions got
added.

This leaves libva in the cold, but hey libva didn't even get around to
adding the full set of modifier extensions so I can't really get myself to
care.

So summary this means:

- a CS/execbuf for vk should _only_ set BOOKKEEPING fences (except ofc if
  there's memory management moves in the preparation, which use KERNEL
  fences and then become additional dependencies for the job)

- because vk memory model is that always everything currently bound can be
  used this means you set BOOKKEEPING on absolutely everything. The
  current clever trick amdgpu has with shared buffers is also not really
  the right thing.

- implicit sync is only controlled through the new import/export ioctl on
  the dma-buf

- if you set any READ/WRITE fences anywhere else, you have potential
  oversync compared to what the vk spec would want

- userspace gets to keep absolutely all the pieces here. Which is not an
  issue, because userspace is totally allowed to fill a buffer with
  garbage and hand that to the compositor already, so there's nothing new
  going wrong here.

- ideally (definitely required for vk sparse) when you unbind or rebind
  then the BOOKKEEPING fences for the vm/ctx get for the old buffers get
  simply replaced by the pte clearing and tlb flushing fences (like amdkfd
  does for compute, vk really just wants to look like compute in
  everything). In practice, especially with partial and multiple mappings
  of the same underlying bo involved, this might be too expensive to
  accurately track since you can only do the replacement trick when the
  last mapping is gone. It might be worth it for private bo though, dunno.

For amdgpu the current special owner checks mostly allow you to get the
semantics vulkan wants. But it breaks down when you have cross-device or
cross-process sharing.

We should probably also document this in the kerneldoc for the BOOKKEEPING
usage that this is the fence type that vulkan cs should use in all
drivers, otherwise this will become an endless mess of driver specific
hacks (i.e. the world we currently live in).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-24 20:34                                                     ` Daniel Vetter
@ 2022-06-25 13:58                                                       ` Christian König
  2022-06-25 22:45                                                         ` Daniel Vetter
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-06-25 13:58 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: ML dri-devel

[-- Attachment #1: Type: text/plain, Size: 2279 bytes --]

Am 24.06.22 um 22:34 schrieb Daniel Vetter:
> Digging out of a hole, apologies to everyone.

No problem, I'm totally overworked as well.

> On Fri, Jun 17, 2022 at 03:08:00PM +0200, Christian König wrote:
>> Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen:
>>> [SNIP]
>> BOOKKEEP is exactly for that, but as discussed with Daniel that's not what
>> we want in the kernel.
> Not sure which Daniel you talked to, but this wasn't me.

Hui what? Of course I'm talking about you.

>> When you mix implicit with explicit synchronization (OpenGL with RADV for
>> example) it should be mandatory for the OpenGL to wait for any RADV
>> submission before issuing an operation.
>>
>> What you want to do is intentionally not supported.
> vk is very intentional in it's rejecting of any implicit sync.

[SNIP]

> We should probably also document this in the kerneldoc for the BOOKKEEPING
> usage that this is the fence type that vulkan cs should use in all
> drivers, otherwise this will become an endless mess of driver specific
> hacks (i.e. the world we currently live in).

Well, Daniel somehow we are somehow not talking about the same thing here :)

I've documented exactly what you describe above in the initial patch 
which added BOOKKEEPING (I've still called it OTHER in that iteration):

> >/+ /**/
> >/+ * @DMA_RESV_USAGE_OTHER: No implicit sync./
> >/+ */
> >/+ * This should be used for operations which don't want to add an/
> >/+ * implicit dependency at all, but still have a dependency on memory/
> >/+ * management./
> >/+ */
> >/+ * This might include things like preemption fences as well as device/
> >/+ * page table updates or even userspace command submissions./
> >/+ */
> >/+ * The kernel memory management *always* need to wait for those fences/
> >/+ * before moving or freeing the resource protected by the dma_resv/
> >/+ * object./
> >/+ *//
> >/+ DMA_RESV_USAGE_OTHER/

Later on I've even explicitly mentioned that this is for Vulkan submissions.

But it was *you* who made me remove that with the explanation that we 
have to use READ for that or we break existing userspace.

I mean that still makes a lot of sense to me because if I'm not 
completely mistaken we do have use cases which break, especially 
Vulkan+encoding.

Regards,
Christian.

> -Daniel

[-- Attachment #2: Type: text/html, Size: 3883 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-25 13:58                                                       ` Christian König
@ 2022-06-25 22:45                                                         ` Daniel Vetter
  2022-07-04 13:37                                                           ` Christian König
  0 siblings, 1 reply; 46+ messages in thread
From: Daniel Vetter @ 2022-06-25 22:45 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Sat, Jun 25, 2022 at 03:58:17PM +0200, Christian König wrote:
> Am 24.06.22 um 22:34 schrieb Daniel Vetter:
> > Digging out of a hole, apologies to everyone.
> 
> No problem, I'm totally overworked as well.
> 
> > On Fri, Jun 17, 2022 at 03:08:00PM +0200, Christian König wrote:
> > > Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen:
> > > > [SNIP]
> > > BOOKKEEP is exactly for that, but as discussed with Daniel that's not what
> > > we want in the kernel.
> > Not sure which Daniel you talked to, but this wasn't me.
> 
> Hui what? Of course I'm talking about you.
> 
> > > When you mix implicit with explicit synchronization (OpenGL with RADV for
> > > example) it should be mandatory for the OpenGL to wait for any RADV
> > > submission before issuing an operation.
> > > 
> > > What you want to do is intentionally not supported.
> > vk is very intentional in it's rejecting of any implicit sync.
> 
> [SNIP]
> 
> > We should probably also document this in the kerneldoc for the BOOKKEEPING
> > usage that this is the fence type that vulkan cs should use in all
> > drivers, otherwise this will become an endless mess of driver specific
> > hacks (i.e. the world we currently live in).
> 
> Well, Daniel somehow we are somehow not talking about the same thing here :)
> 
> I've documented exactly what you describe above in the initial patch which
> added BOOKKEEPING (I've still called it OTHER in that iteration):
> 
> > >/+ /**/
> > >/+ * @DMA_RESV_USAGE_OTHER: No implicit sync./
> > >/+ */
> > >/+ * This should be used for operations which don't want to add an/
> > >/+ * implicit dependency at all, but still have a dependency on memory/
> > >/+ * management./
> > >/+ */
> > >/+ * This might include things like preemption fences as well as device/
> > >/+ * page table updates or even userspace command submissions./
> > >/+ */
> > >/+ * The kernel memory management *always* need to wait for those fences/
> > >/+ * before moving or freeing the resource protected by the dma_resv/
> > >/+ * object./
> > >/+ *//
> > >/+ DMA_RESV_USAGE_OTHER/
> 
> Later on I've even explicitly mentioned that this is for Vulkan submissions.
> 
> But it was *you* who made me remove that with the explanation that we have
> to use READ for that or we break existing userspace.

Hm the only discussion I've found I actually mentioend we should highlight
that vk should use OTHER even more than what you had. Quoting myself:

> +      * This might include things like preemption fences as well as device
> +      * page table updates or even userspace command submissions.

I think we should highlight a bit more that for explicitly synchronized
userspace like vk OTHER is the normal case. So really not an exception.
Ofc aside from amdkgf there's currently no driver doing this, but really
we should have lots of them ...

See https://lore.kernel.org/dri-devel/YZ+y+Uwo809qtvs5@phenom.ffwll.local/

I didn't find anything else. So not sure how we managed to create
confusion here :-(

> I mean that still makes a lot of sense to me because if I'm not completely
> mistaken we do have use cases which break, especially Vulkan+encoding.

Yeah I think we only have some communication fumble here, nothing else :-)

And yes libva doesn't have any support for vk's explicit sync model, so
that will just fall flat on its face. Might motivate a few folks to fix
libva :-)

Note that on i915 side it's exactly the same, we've also been setting the
READ fence thus far. Since the breakage will be introduced by upgrading
mesa we'll at least avoid the kernel regression complaints, or at least I
hope we can get away with that.

Since really I don't have any idea how it could be fixed otherwise, except
through some really, really terrible hacks. Maybe kernel module option or
so.

Anyway I think all we need is just a patch to the dma_resv docs to explain
that USAGE_BOOKKEEPING is what vulkan userspace wants, and why. Bas,
you're up to typing that?

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-06-25 22:45                                                         ` Daniel Vetter
@ 2022-07-04 13:37                                                           ` Christian König
  2022-08-09 14:37                                                             ` Daniel Vetter
  0 siblings, 1 reply; 46+ messages in thread
From: Christian König @ 2022-07-04 13:37 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: ML dri-devel

Hey Daniel,

Am 26.06.22 um 00:45 schrieb Daniel Vetter:
> [SNIP]
> I think we should highlight a bit more that for explicitly synchronized
> userspace like vk OTHER is the normal case. So really not an exception.
> Ofc aside from amdkgf there's currently no driver doing this, but really
> we should have lots of them ...
>
> See https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2FYZ%2By%2BUwo809qtvs5%40phenom.ffwll.local%2F&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C88037a566a8d4c8d4aca08da56fc6e3c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637917939428739923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=6sYto7GCLw8i3pT9OCFN1l6dxeYYHPghzKDMYxqUw90%3D&amp;reserved=0
>
> I didn't find anything else. So not sure how we managed to create
> confusion here :-(

Well you said something like "Yeah, READ is supposed to be used for 
that...." and that created the impression that AMDGPU should start using 
that for Vulkan submissions and you are rejecting my idea of using 
BOOKKEEP for that.

>> I mean that still makes a lot of sense to me because if I'm not completely
>> mistaken we do have use cases which break, especially Vulkan+encoding.
> Yeah I think we only have some communication fumble here, nothing else :-)

Ok, well then @Bas: Sorry for all the noise, we are actually all on the 
same page :)

> And yes libva doesn't have any support for vk's explicit sync model, so
> that will just fall flat on its face. Might motivate a few folks to fix
> libva :-)

Well that's not the problem. The problem is that we have a couple of use 
cases where libva is supposed to encode a Vulkan surface without letting 
Vulkan know about that.

In other words: Application shares DMA-buf between Vulkan and VA-API, 
renders with Vulkan and encodes with VA-API without any explicit 
synchronization between the two.

I know that this is absolutely against the Vulkan specification, but it 
just happened to work fine. And when you break something which used to 
work people start to complain...

> Note that on i915 side it's exactly the same, we've also been setting the
> READ fence thus far. Since the breakage will be introduced by upgrading
> mesa we'll at least avoid the kernel regression complaints, or at least I
> hope we can get away with that.

Yeah, the path to salvation start's with the words: It's not my f... 
problem :)

> Since really I don't have any idea how it could be fixed otherwise, except
> through some really, really terrible hacks. Maybe kernel module option or
> so.
>
> Anyway I think all we need is just a patch to the dma_resv docs to explain
> that USAGE_BOOKKEEPING is what vulkan userspace wants, and why. Bas,
> you're up to typing that?

I can do that. I'm just back from a week of vacation and still digging 
through my mails.

Cheers,
Christian.

>
> Cheers, Daniel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.
  2022-07-04 13:37                                                           ` Christian König
@ 2022-08-09 14:37                                                             ` Daniel Vetter
  0 siblings, 0 replies; 46+ messages in thread
From: Daniel Vetter @ 2022-08-09 14:37 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

[Back from vacations and work change and out sick and absolutely
everything else going wrong]

On Mon, Jul 04, 2022 at 03:37:43PM +0200, Christian König wrote:
> Hey Daniel,
> 
> Am 26.06.22 um 00:45 schrieb Daniel Vetter:
> > [SNIP]
> > I think we should highlight a bit more that for explicitly synchronized
> > userspace like vk OTHER is the normal case. So really not an exception.
> > Ofc aside from amdkgf there's currently no driver doing this, but really
> > we should have lots of them ...
> > 
> > See https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2FYZ%2By%2BUwo809qtvs5%40phenom.ffwll.local%2F&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C88037a566a8d4c8d4aca08da56fc6e3c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637917939428739923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=6sYto7GCLw8i3pT9OCFN1l6dxeYYHPghzKDMYxqUw90%3D&amp;reserved=0
> > 
> > I didn't find anything else. So not sure how we managed to create
> > confusion here :-(
> 
> Well you said something like "Yeah, READ is supposed to be used for
> that...." and that created the impression that AMDGPU should start using
> that for Vulkan submissions and you are rejecting my idea of using BOOKKEEP
> for that.
> 
> > > I mean that still makes a lot of sense to me because if I'm not completely
> > > mistaken we do have use cases which break, especially Vulkan+encoding.
> > Yeah I think we only have some communication fumble here, nothing else :-)
> 
> Ok, well then @Bas: Sorry for all the noise, we are actually all on the same
> page :)
> 
> > And yes libva doesn't have any support for vk's explicit sync model, so
> > that will just fall flat on its face. Might motivate a few folks to fix
> > libva :-)
> 
> Well that's not the problem. The problem is that we have a couple of use
> cases where libva is supposed to encode a Vulkan surface without letting
> Vulkan know about that.
> 
> In other words: Application shares DMA-buf between Vulkan and VA-API,
> renders with Vulkan and encodes with VA-API without any explicit
> synchronization between the two.
> 
> I know that this is absolutely against the Vulkan specification, but it just
> happened to work fine. And when you break something which used to work
> people start to complain...

Yeah I feared that, and worse libva doesn't have the nice gl interop
extensions to make it actually work.

> > Note that on i915 side it's exactly the same, we've also been setting the
> > READ fence thus far. Since the breakage will be introduced by upgrading
> > mesa we'll at least avoid the kernel regression complaints, or at least I
> > hope we can get away with that.
> 
> Yeah, the path to salvation start's with the words: It's not my f... problem
> :)
> 
> > Since really I don't have any idea how it could be fixed otherwise, except
> > through some really, really terrible hacks. Maybe kernel module option or
> > so.
> > 
> > Anyway I think all we need is just a patch to the dma_resv docs to explain
> > that USAGE_BOOKKEEPING is what vulkan userspace wants, and why. Bas,
> > you're up to typing that?
> 
> I can do that. I'm just back from a week of vacation and still digging
> through my mails.

Yeah I think the best path is we're pushing hard for adding the libva
syncobj extensions like gl has, so that this can be done properly. And
then just pave it over with kernel modoptions until userspace is fixed.

If we end up making implicit sync part of defacto vk api on linux, then a
_lot_ of people will be very sad :-(
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2022-08-09 14:38 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-01  0:40 [RFC PATCH 0/5] Add option to disable implicit sync for userspace submits Bas Nieuwenhuizen
2022-06-01  0:40 ` [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage Bas Nieuwenhuizen
2022-06-01  8:02   ` Christian König
2022-06-01  8:11     ` Bas Nieuwenhuizen
2022-06-01  8:29       ` Christian König
2022-06-01  8:39         ` Bas Nieuwenhuizen
2022-06-01  8:42           ` Christian König
2022-06-01  8:41     ` Daniel Vetter
2022-06-01  8:47       ` Christian König
2022-06-01  0:40 ` [RFC PATCH 2/5] drm/amdgpu: Add separate mode for syncing DMA_RESV_USAGE_BOOKKEEP Bas Nieuwenhuizen
2022-06-01  0:40 ` [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops Bas Nieuwenhuizen
2022-06-01  8:03   ` Christian König
2022-06-01  8:16     ` Bas Nieuwenhuizen
2022-06-01  8:40       ` Christian König
2022-06-01  8:48         ` Bas Nieuwenhuizen
2022-06-01  8:59           ` Bas Nieuwenhuizen
2022-06-01  9:01           ` Christian König
2022-06-03  1:21             ` Bas Nieuwenhuizen
2022-06-03  8:11               ` Christian König
2022-06-03 10:08                 ` Bas Nieuwenhuizen
2022-06-03 10:16                   ` Christian König
2022-06-03 11:07                     ` Bas Nieuwenhuizen
2022-06-03 12:08                       ` Christian König
2022-06-03 12:39                         ` Bas Nieuwenhuizen
2022-06-03 12:49                           ` Christian König
2022-06-03 13:23                             ` Bas Nieuwenhuizen
2022-06-03 17:41                               ` Christian König
2022-06-03 17:50                                 ` Bas Nieuwenhuizen
2022-06-03 18:41                                   ` Christian König
2022-06-03 19:11                                     ` Bas Nieuwenhuizen
2022-06-06 10:15                                       ` Christian König
2022-06-06 10:30                                         ` Bas Nieuwenhuizen
2022-06-06 10:35                                           ` Christian König
2022-06-06 11:00                                             ` Bas Nieuwenhuizen
2022-06-15  0:40                                               ` Bas Nieuwenhuizen
2022-06-15  7:00                                                 ` Christian König
2022-06-15  7:00                                               ` Christian König
2022-06-17 13:03                                                 ` Bas Nieuwenhuizen
2022-06-17 13:08                                                   ` Christian König
2022-06-24 20:34                                                     ` Daniel Vetter
2022-06-25 13:58                                                       ` Christian König
2022-06-25 22:45                                                         ` Daniel Vetter
2022-07-04 13:37                                                           ` Christian König
2022-08-09 14:37                                                             ` Daniel Vetter
2022-06-01  0:40 ` [RFC PATCH 4/5] drm/amdgpu: Refactor amdgpu_vm_get_pd_bo Bas Nieuwenhuizen
2022-06-01  0:40 ` [RFC PATCH 5/5] drm/amdgpu: Add option to disable implicit sync for a context Bas Nieuwenhuizen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.