All of lore.kernel.org
 help / color / mirror / Atom feed
* Tackling the indefinite/user DMA fence problem
@ 2022-05-02 16:37 Christian König
  2022-05-02 16:37 ` [PATCH 01/15] dma-buf: rename DMA_FENCE_FLAG_USER_BITS to _DEVICE Christian König
                   ` (15 more replies)
  0 siblings, 16 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig

Hello everyone,

it's a well known problem that the DMA-buf subsystem mixed synchronization and memory management requirements into the same dma_fence and dma_resv objects. Because of this dma_fence objects need to guarantee that they complete within a finite amount of time or otherwise the system can easily deadlock.

One of the few good things about this problem is that it is really good understood by now.

Daniel and others came up with some documentation: https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences

And Jason did an excellent presentation about that problem on last years LPC: https://lpc.events/event/11/contributions/1115/

Based on that we had been able to reject new implementations of infinite/user DMA fences and mitigate the effect of the few existing ones.

The still remaining down side is that we don't have a way of using user fences as dependency in both the explicit (sync_file, drm_syncobj) as well as the implicit (dma_resv) synchronization objects, resulting in numerous problems and limitations for things like HMM, user queues etc....

This patch set here now tries to tackle this problem by untangling the synchronization from the memory management. What it does *not* try to do is to fix the existing kernel fences, because I think we now can all agree on that this isn't really possible.

To archive this goal what I do in this patch set is to add some parallel infrastructure to cleanly separate normal kernel dma_fence objects from indefinite/user fences:

1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some existing driver defines). To note that a certain dma_fence is an user fence and *must* be ignore by memory management and never used as dependency for normal none user dma_fence objects.

2. The dma_fence_array and dma_fence_chain containers are modified so that they are marked as user fences whenever any of their contained fences are an user fence.

3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be used with indefinite/user fences and separates those into it's own synchronization domain.

4. The existing dma_buf_poll_add_cb() function is modified so that indefinite/user fences are included in the polling.

5. The sync_file synchronization object is modified so that we essentially have two fence streams instead of just one.

6. The drm_syncobj is modified in a similar way. User fences are just ignored unless the driver explicitly states support to wait for them.

7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers can use to indicate the need for user fences. If user fences are used the atomic mode setting starts to support user fences as IN/OUT fences.

8. Lockdep is used at various critical locations to ensure that nobody ever tries to mix user fences with non user fences.

The general approach is to just ignore user fences unless a driver stated explicitely support for them.

On top of all of this I've hacked amdgpu so that we add the resulting CS fence only as kernel dependency to the dma_resv object and an additional wrapped up with a dma_fence_array and a stub user fence.

The result is that the newly added atomic modeset functions now correctly wait for the user fence to complete before doing the flip. And dependent CS don't pipeline any more, but rather block on the CPU before submitting work.

After tons of debugging and testing everything now seems to not go up in flames immediately and even lockdep is happy with the annotations.

I'm perfectly aware that this is probably by far the most controversial patch set I've ever created and I really wish we wouldn't need it. But we certainly have the requirement for this and I don't see much other chance to get that working in an UAPI compatible way.

Thoughts/comments?

Regards,
Christian.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/15] dma-buf: rename DMA_FENCE_FLAG_USER_BITS to _DEVICE
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 02/15] dma-buf: introduce user fence support Christian König
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

This is supposed to be used by device drivers.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/i915/i915_request.h       | 2 +-
 drivers/gpu/drm/i915/i915_sw_fence_work.h | 2 +-
 drivers/gpu/drm/nouveau/nouveau_fence.c   | 4 ++--
 include/linux/dma-fence.h                 | 4 ++--
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 28b1f9db5487..716c8c413cc4 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -80,7 +80,7 @@ enum {
 	 *
 	 * See i915_request_is_active()
 	 */
-	I915_FENCE_FLAG_ACTIVE = DMA_FENCE_FLAG_USER_BITS,
+	I915_FENCE_FLAG_ACTIVE = DMA_FENCE_FLAG_DRIVER,
 
 	/*
 	 * I915_FENCE_FLAG_PQUEUE - this request is ready for execution
diff --git a/drivers/gpu/drm/i915/i915_sw_fence_work.h b/drivers/gpu/drm/i915/i915_sw_fence_work.h
index d56806918d13..ece0a06e598c 100644
--- a/drivers/gpu/drm/i915/i915_sw_fence_work.h
+++ b/drivers/gpu/drm/i915/i915_sw_fence_work.h
@@ -33,7 +33,7 @@ struct dma_fence_work {
 };
 
 enum {
-	DMA_FENCE_WORK_IMM = DMA_FENCE_FLAG_USER_BITS,
+	DMA_FENCE_WORK_IMM = DMA_FENCE_FLAG_DRIVER,
 };
 
 void dma_fence_work_init(struct dma_fence_work *f,
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index 7f01dcf81fab..e2f61b34cc1e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -61,7 +61,7 @@ nouveau_fence_signal(struct nouveau_fence *fence)
 	list_del(&fence->head);
 	rcu_assign_pointer(fence->channel, NULL);
 
-	if (test_bit(DMA_FENCE_FLAG_USER_BITS, &fence->base.flags)) {
+	if (test_bit(DMA_FENCE_FLAG_DRIVER, &fence->base.flags)) {
 		struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
 
 		if (!--fctx->notify_ref)
@@ -510,7 +510,7 @@ static bool nouveau_fence_enable_signaling(struct dma_fence *f)
 
 	ret = nouveau_fence_no_signaling(f);
 	if (ret)
-		set_bit(DMA_FENCE_FLAG_USER_BITS, &fence->base.flags);
+		set_bit(DMA_FENCE_FLAG_DRIVER, &fence->base.flags);
 	else if (!--fctx->notify_ref)
 		nvif_notify_put(&fctx->notify);
 
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 775cdc0b4f24..afea82ec5946 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -49,7 +49,7 @@ struct dma_fence_cb;
  * DMA_FENCE_FLAG_SIGNALED_BIT - fence is already signaled
  * DMA_FENCE_FLAG_TIMESTAMP_BIT - timestamp recorded for fence signaling
  * DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT - enable_signaling might have been called
- * DMA_FENCE_FLAG_USER_BITS - start of the unused bits, can be used by the
+ * DMA_FENCE_FLAG_DRIVER - start of the unused bits, can be used by the
  * implementer of the fence for its own purposes. Can be used in different
  * ways by different fence implementers, so do not rely on this.
  *
@@ -99,7 +99,7 @@ enum dma_fence_flag_bits {
 	DMA_FENCE_FLAG_SIGNALED_BIT,
 	DMA_FENCE_FLAG_TIMESTAMP_BIT,
 	DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
-	DMA_FENCE_FLAG_USER_BITS, /* must always be last member */
+	DMA_FENCE_FLAG_DRIVER, /* must always be last member */
 };
 
 typedef void (*dma_fence_func_t)(struct dma_fence *fence,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/15] dma-buf: introduce user fence support
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
  2022-05-02 16:37 ` [PATCH 01/15] dma-buf: rename DMA_FENCE_FLAG_USER_BITS to _DEVICE Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-04  7:53   ` Tvrtko Ursulin
  2022-05-02 16:37 ` [PATCH 03/15] dma-buf: add user fence support to dma_fence_array Christian König
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

Start introducing a new part of the framework for handling user fences.

In strict opposition to normal fences user fences don't reliable finish in
a fixed amount of time and therefore can't be used as dependency in memory
management.

Because of this user fences are marked with DMA_FENCE_FLAG_USER. Lockdep
is checked that we can at least fault user pages when we check them for
signaling.

This patch also adds a flag to dma_fence_get_stub() so that we can
retrieve a signaled user fence. This can be used together with lockdep to
test the handling in code path supporting user fences.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-fence-unwrap.c            |  4 +--
 drivers/dma-buf/dma-fence.c                   | 31 ++++++++++++-------
 drivers/dma-buf/st-dma-fence.c                |  2 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  4 +--
 drivers/gpu/drm/drm_syncobj.c                 | 10 +++---
 include/linux/dma-fence.h                     | 17 +++++++++-
 9 files changed, 49 insertions(+), 25 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
index c9becc74896d..87ee2efced10 100644
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -76,7 +76,7 @@ struct dma_fence *__dma_fence_merge(unsigned int num_fences,
 	}
 
 	if (count == 0)
-		return dma_fence_get_stub();
+		return dma_fence_get_stub(false);
 
 	if (count > INT_MAX)
 		return NULL;
@@ -129,7 +129,7 @@ struct dma_fence *__dma_fence_merge(unsigned int num_fences,
 	} while (tmp);
 
 	if (count == 0) {
-		tmp = dma_fence_get_stub();
+		tmp = dma_fence_get_stub(false);
 		goto return_tmp;
 	}
 
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 066400ed8841..52873f5eaeba 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -26,6 +26,7 @@ EXPORT_TRACEPOINT_SYMBOL(dma_fence_signaled);
 
 static DEFINE_SPINLOCK(dma_fence_stub_lock);
 static struct dma_fence dma_fence_stub;
+static struct dma_fence dma_fence_user_stub;
 
 /*
  * fence context counter: each execution context should have its own
@@ -123,24 +124,28 @@ static const struct dma_fence_ops dma_fence_stub_ops = {
 
 /**
  * dma_fence_get_stub - return a signaled fence
+ * @user: if true the returned fence is an user fence
  *
- * Return a stub fence which is already signaled. The fence's
- * timestamp corresponds to the first time after boot this
- * function is called.
+ * Return a stub fence which is already signaled. The fence's timestamp
+ * corresponds to the first time after boot this function is called. If @user is
+ * true an user fence is returned which can be used with lockdep to test user
+ * fence saveness in a code path.
  */
-struct dma_fence *dma_fence_get_stub(void)
+struct dma_fence *dma_fence_get_stub(bool user)
 {
+	struct dma_fence *fence = user ? &dma_fence_stub : &dma_fence_user_stub;
+
 	spin_lock(&dma_fence_stub_lock);
-	if (!dma_fence_stub.ops) {
-		dma_fence_init(&dma_fence_stub,
-			       &dma_fence_stub_ops,
-			       &dma_fence_stub_lock,
+	if (!fence->ops) {
+		dma_fence_init(fence, &dma_fence_stub_ops, &dma_fence_stub_lock,
 			       0, 0);
-		dma_fence_signal_locked(&dma_fence_stub);
+		if (user)
+			set_bit(DMA_FENCE_FLAG_USER, &fence->flags);
+		dma_fence_signal_locked(fence);
 	}
 	spin_unlock(&dma_fence_stub_lock);
 
-	return dma_fence_get(&dma_fence_stub);
+	return dma_fence_get(fence);
 }
 EXPORT_SYMBOL(dma_fence_get_stub);
 
@@ -497,8 +502,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 		return -EINVAL;
 
 	might_sleep();
-
 	__dma_fence_might_wait();
+	if (test_bit(DMA_FENCE_FLAG_USER, &fence->flags))
+		might_fault();
 
 	trace_dma_fence_wait_start(fence);
 	if (fence->ops->wait)
@@ -870,6 +876,9 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count,
 	for (i = 0; i < count; ++i) {
 		struct dma_fence *fence = fences[i];
 
+		if (test_bit(DMA_FENCE_FLAG_USER, &fence->flags))
+			might_fault();
+
 		cb[i].task = current;
 		if (dma_fence_add_callback(fence, &cb[i].base,
 					   dma_fence_default_wait_cb)) {
diff --git a/drivers/dma-buf/st-dma-fence.c b/drivers/dma-buf/st-dma-fence.c
index c8a12d7ad71a..50f757f75645 100644
--- a/drivers/dma-buf/st-dma-fence.c
+++ b/drivers/dma-buf/st-dma-fence.c
@@ -412,7 +412,7 @@ static int test_stub(void *arg)
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(f); i++) {
-		f[i] = dma_fence_get_stub();
+		f[i] = dma_fence_get_stub((i % 2) == 1);
 		if (!dma_fence_is_signaled(f[i])) {
 			pr_err("Obtained unsignaled stub fence!\n");
 			goto err;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 64ac4f8f49be..541c59635c34 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -263,7 +263,7 @@ static int amdgpu_amdkfd_remove_eviction_fence(struct amdgpu_bo *bo,
 	/* TODO: Instead of block before we should use the fence of the page
 	 * table update and TLB flush here directly.
 	 */
-	replacement = dma_fence_get_stub();
+	replacement = dma_fence_get_stub(false);
 	dma_resv_replace_fences(bo->tbo.base.resv, ef->base.context,
 				replacement, DMA_RESV_USAGE_READ);
 	dma_fence_put(replacement);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index a28b7947a034..95eeab527ca9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1399,7 +1399,7 @@ int amdgpu_cs_fence_to_handle_ioctl(struct drm_device *dev, void *data,
 		return PTR_ERR(fence);
 
 	if (!fence)
-		fence = dma_fence_get_stub();
+		fence = dma_fence_get_stub(false);
 
 	switch (info->in.what) {
 	case AMDGPU_FENCE_TO_HANDLE_GET_SYNCOBJ:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 7f33ae87cb41..73165f387f3b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -193,7 +193,7 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
 		adev->rings[ring->idx] = ring;
 		ring->num_hw_submission = sched_hw_submission;
 		ring->sched_score = sched_score;
-		ring->vmid_wait = dma_fence_get_stub();
+		ring->vmid_wait = dma_fence_get_stub(false);
 		r = amdgpu_fence_driver_init_ring(ring);
 		if (r)
 			return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 7e5cc8323329..e5c8e72a9485 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1689,7 +1689,7 @@ int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
 	}
 
 	if (!unlocked && !dma_fence_is_signaled(vm->last_unlocked)) {
-		struct dma_fence *tmp = dma_fence_get_stub();
+		struct dma_fence *tmp = dma_fence_get_stub(false);
 
 		amdgpu_bo_fence(vm->root.bo, vm->last_unlocked, true);
 		swap(vm->last_unlocked, tmp);
@@ -2905,7 +2905,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
 	else
 		vm->update_funcs = &amdgpu_vm_sdma_funcs;
 	vm->last_update = NULL;
-	vm->last_unlocked = dma_fence_get_stub();
+	vm->last_unlocked = dma_fence_get_stub(false);
 
 	mutex_init(&vm->eviction_lock);
 	vm->evicting = false;
diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
index 7e48dcd1bee4..5a961ea90a35 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -255,7 +255,7 @@ static void drm_syncobj_fence_add_wait(struct drm_syncobj *syncobj,
 		dma_fence_put(fence);
 		list_add_tail(&wait->node, &syncobj->cb_list);
 	} else if (!fence) {
-		wait->fence = dma_fence_get_stub();
+		wait->fence = dma_fence_get_stub(false);
 	} else {
 		wait->fence = fence;
 	}
@@ -411,7 +411,7 @@ int drm_syncobj_find_fence(struct drm_file *file_private,
 			 * signalled, use a new fence instead.
 			 */
 			if (!*fence)
-				*fence = dma_fence_get_stub();
+				*fence = dma_fence_get_stub(false);
 
 			goto out;
 		}
@@ -1000,7 +1000,7 @@ static void syncobj_wait_syncobj_func(struct drm_syncobj *syncobj,
 		dma_fence_put(fence);
 		return;
 	} else if (!fence) {
-		wait->fence = dma_fence_get_stub();
+		wait->fence = dma_fence_get_stub(false);
 	} else {
 		wait->fence = fence;
 	}
@@ -1067,7 +1067,7 @@ static signed long drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs,
 		if (fence)
 			entries[i].fence = fence;
 		else
-			entries[i].fence = dma_fence_get_stub();
+			entries[i].fence = dma_fence_get_stub(false);
 
 		if ((flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE) ||
 		    dma_fence_is_signaled(entries[i].fence)) {
@@ -1472,7 +1472,7 @@ drm_syncobj_timeline_signal_ioctl(struct drm_device *dev, void *data,
 	}
 
 	for (i = 0; i < args->count_handles; i++) {
-		struct dma_fence *fence = dma_fence_get_stub();
+		struct dma_fence *fence = dma_fence_get_stub(false);
 
 		drm_syncobj_add_point(syncobjs[i], chains[i],
 				      fence, points[i]);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index afea82ec5946..be96687d31d8 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -99,6 +99,16 @@ enum dma_fence_flag_bits {
 	DMA_FENCE_FLAG_SIGNALED_BIT,
 	DMA_FENCE_FLAG_TIMESTAMP_BIT,
 	DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
+
+	/**
+	 * @DMA_FENCE_FLAG_USER:
+	 *
+	 * Indicates an user fence. User fences are not guaranteed to signal in
+	 * a finite amount of time. Because of this it is not allowed to wait for user
+	 * fences with any lock held nor depend the signaling of none user
+	 * fences on them.
+	 */
+	DMA_FENCE_FLAG_USER,
 	DMA_FENCE_FLAG_DRIVER, /* must always be last member */
 };
 
@@ -398,6 +408,8 @@ void dma_fence_enable_sw_signaling(struct dma_fence *fence);
 static inline bool
 dma_fence_is_signaled_locked(struct dma_fence *fence)
 {
+	WARN_ON(test_bit(DMA_FENCE_FLAG_USER, &fence->flags));
+
 	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
 		return true;
 
@@ -428,6 +440,9 @@ dma_fence_is_signaled_locked(struct dma_fence *fence)
 static inline bool
 dma_fence_is_signaled(struct dma_fence *fence)
 {
+	if (test_bit(DMA_FENCE_FLAG_USER, &fence->flags))
+		might_fault();
+
 	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
 		return true;
 
@@ -583,7 +598,7 @@ static inline signed long dma_fence_wait(struct dma_fence *fence, bool intr)
 	return ret < 0 ? ret : 0;
 }
 
-struct dma_fence *dma_fence_get_stub(void);
+struct dma_fence *dma_fence_get_stub(bool user);
 struct dma_fence *dma_fence_allocate_private_stub(void);
 u64 dma_fence_context_alloc(unsigned num);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/15] dma-buf: add user fence support to dma_fence_array
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
  2022-05-02 16:37 ` [PATCH 01/15] dma-buf: rename DMA_FENCE_FLAG_USER_BITS to _DEVICE Christian König
  2022-05-02 16:37 ` [PATCH 02/15] dma-buf: introduce user fence support Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 04/15] dma-buf: add user fence support to dma_fence_chain Christian König
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

When any of the fences inside the array is an user fence the whole array
is considered to be an user fence as well.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-fence-array.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-array.c b/drivers/dma-buf/dma-fence-array.c
index 5c8a7084577b..de196b07d36d 100644
--- a/drivers/dma-buf/dma-fence-array.c
+++ b/drivers/dma-buf/dma-fence-array.c
@@ -189,8 +189,13 @@ struct dma_fence_array *dma_fence_array_create(int num_fences,
 	 * Enforce this here by checking that we don't create a dma_fence_array
 	 * with any container inside.
 	 */
-	while (num_fences--)
-		WARN_ON(dma_fence_is_container(fences[num_fences]));
+	while (num_fences--) {
+		struct dma_fence *f = fences[num_fences];
+
+		WARN_ON(dma_fence_is_container(f));
+		if (test_bit(DMA_FENCE_FLAG_USER, &f->flags))
+			set_bit(DMA_FENCE_FLAG_USER, &array->base.flags);
+	}
 
 	return array;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/15] dma-buf: add user fence support to dma_fence_chain
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (2 preceding siblings ...)
  2022-05-02 16:37 ` [PATCH 03/15] dma-buf: add user fence support to dma_fence_array Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 05/15] dma-buf: add user fence support to dma_resv Christian König
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

If either the previous fence or the newly chained on is an user fence the
dma_fence chain node is considered an user fence as well.

This way the user fence status propagates through the chain.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-fence-chain.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c
index 06f8ef97c6e8..421241607ac2 100644
--- a/drivers/dma-buf/dma-fence-chain.c
+++ b/drivers/dma-buf/dma-fence-chain.c
@@ -253,6 +253,10 @@ void dma_fence_chain_init(struct dma_fence_chain *chain,
 	dma_fence_init(&chain->base, &dma_fence_chain_ops,
 		       &chain->lock, context, seqno);
 
+	if ((prev && test_bit(DMA_FENCE_FLAG_USER, &prev->flags)) ||
+	    test_bit(DMA_FENCE_FLAG_USER, &fence->flags))
+		set_bit(DMA_FENCE_FLAG_USER, &chain->base.flags);
+
 	/*
 	 * Chaining dma_fence_chain container together is only allowed through
 	 * the prev fence and not through the contained fence.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 05/15] dma-buf: add user fence support to dma_resv
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (3 preceding siblings ...)
  2022-05-02 16:37 ` [PATCH 04/15] dma-buf: add user fence support to dma_fence_chain Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 06/15] dma-buf: add user fence support to dma_fence_merge() Christian König
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

This patch adds the new DMA_RESV_USAGE_USER flag to the dma_resv object
which must be used with user fence objects.

In opposite to the other usage flags this one doesn't automatically return
other lower classes. So when user fences are requested from the dma_resv
object only user fences are returned.

Lockdep is used to enforce that user fences can only be queried while the
dma_resv object is not locked. Additional to that waiting for the user
fences inside a dma_resv object requires not other lock to be held.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-resv.c | 62 +++++++++++++++++++++++---------------
 include/linux/dma-resv.h   | 23 ++++++++++++--
 2 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 0cce6e4ec946..da667c21ad55 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -58,7 +58,7 @@ DEFINE_WD_CLASS(reservation_ww_class);
 EXPORT_SYMBOL(reservation_ww_class);
 
 /* Mask for the lower fence pointer bits */
-#define DMA_RESV_LIST_MASK	0x3
+#define DMA_RESV_LIST_MASK	0x7
 
 struct dma_resv_list {
 	struct rcu_head rcu;
@@ -288,6 +288,10 @@ void dma_resv_add_fence(struct dma_resv *obj, struct dma_fence *fence,
 	 */
 	WARN_ON(dma_fence_is_container(fence));
 
+	/* User fences must be added using DMA_RESV_USAGE_USER */
+	WARN_ON(test_bit(DMA_FENCE_FLAG_USER, &fence->flags) !=
+		(usage == DMA_RESV_USAGE_USER));
+
 	fobj = dma_resv_fences_list(obj);
 	count = fobj->num_fences;
 
@@ -349,6 +353,15 @@ void dma_resv_replace_fences(struct dma_resv *obj, uint64_t context,
 }
 EXPORT_SYMBOL(dma_resv_replace_fences);
 
+/* Matches requested usage with the fence usage for iterators */
+static bool dma_resv_iter_match_usage(struct dma_resv_iter *cursor)
+{
+	if (cursor->usage == DMA_RESV_USAGE_USER)
+		return cursor->fence_usage == DMA_RESV_USAGE_USER;
+
+	return cursor->usage >= cursor->fence_usage;
+}
+
 /* Restart the unlocked iteration by initializing the cursor object. */
 static void dma_resv_iter_restart_unlocked(struct dma_resv_iter *cursor)
 {
@@ -385,8 +398,7 @@ static void dma_resv_iter_walk_unlocked(struct dma_resv_iter *cursor)
 			continue;
 		}
 
-		if (!dma_fence_is_signaled(cursor->fence) &&
-		    cursor->usage >= cursor->fence_usage)
+		if (dma_resv_iter_match_usage(cursor))
 			break;
 	} while (true);
 }
@@ -405,14 +417,9 @@ static void dma_resv_iter_walk_unlocked(struct dma_resv_iter *cursor)
  */
 struct dma_fence *dma_resv_iter_first_unlocked(struct dma_resv_iter *cursor)
 {
-	rcu_read_lock();
-	do {
-		dma_resv_iter_restart_unlocked(cursor);
-		dma_resv_iter_walk_unlocked(cursor);
-	} while (dma_resv_fences_list(cursor->obj) != cursor->fences);
-	rcu_read_unlock();
-
-	return cursor->fence;
+	/* Force a restart */
+	cursor->fences = NULL;
+	return dma_resv_iter_next_unlocked(cursor);
 }
 EXPORT_SYMBOL(dma_resv_iter_first_unlocked);
 
@@ -428,18 +435,21 @@ EXPORT_SYMBOL(dma_resv_iter_first_unlocked);
  */
 struct dma_fence *dma_resv_iter_next_unlocked(struct dma_resv_iter *cursor)
 {
-	bool restart;
-
-	rcu_read_lock();
 	cursor->is_restarted = false;
-	restart = dma_resv_fences_list(cursor->obj) != cursor->fences;
 	do {
-		if (restart)
-			dma_resv_iter_restart_unlocked(cursor);
-		dma_resv_iter_walk_unlocked(cursor);
-		restart = true;
-	} while (dma_resv_fences_list(cursor->obj) != cursor->fences);
-	rcu_read_unlock();
+		bool restart;
+
+		rcu_read_lock();
+		restart = dma_resv_fences_list(cursor->obj) != cursor->fences;
+		do {
+			if (restart)
+				dma_resv_iter_restart_unlocked(cursor);
+			dma_resv_iter_walk_unlocked(cursor);
+			restart = true;
+		} while (dma_resv_fences_list(cursor->obj) != cursor->fences);
+		rcu_read_unlock();
+
+	} while (cursor->fence && dma_fence_is_signaled(cursor->fence));
 
 	return cursor->fence;
 }
@@ -491,7 +501,7 @@ struct dma_fence *dma_resv_iter_next(struct dma_resv_iter *cursor)
 
 		dma_resv_list_entry(cursor->fences, cursor->index++,
 				    cursor->obj, &fence, &cursor->fence_usage);
-	} while (cursor->fence_usage > cursor->usage);
+	} while (!dma_resv_iter_match_usage(cursor));
 
 	return fence;
 }
@@ -663,6 +673,9 @@ long dma_resv_wait_timeout(struct dma_resv *obj, enum dma_resv_usage usage,
 	struct dma_resv_iter cursor;
 	struct dma_fence *fence;
 
+	if (usage == DMA_RESV_USAGE_USER)
+		lockdep_assert_none_held_once();
+
 	dma_resv_iter_begin(&cursor, obj, usage);
 	dma_resv_for_each_fence_unlocked(&cursor, fence) {
 
@@ -678,7 +691,6 @@ long dma_resv_wait_timeout(struct dma_resv *obj, enum dma_resv_usage usage,
 }
 EXPORT_SYMBOL_GPL(dma_resv_wait_timeout);
 
-
 /**
  * dma_resv_test_signaled - Test if a reservation object's fences have been
  * signaled.
@@ -717,7 +729,9 @@ EXPORT_SYMBOL_GPL(dma_resv_test_signaled);
  */
 void dma_resv_describe(struct dma_resv *obj, struct seq_file *seq)
 {
-	static const char *usage[] = { "kernel", "write", "read", "bookkeep" };
+	static const char *usage[] = {
+		"kernel", "write", "read", "bookkeep", "user"
+	};
 	struct dma_resv_iter cursor;
 	struct dma_fence *fence;
 
diff --git a/include/linux/dma-resv.h b/include/linux/dma-resv.h
index c8ccbc94d5d2..81a9ca32cc69 100644
--- a/include/linux/dma-resv.h
+++ b/include/linux/dma-resv.h
@@ -42,7 +42,6 @@
 #include <linux/ww_mutex.h>
 #include <linux/dma-fence.h>
 #include <linux/slab.h>
-#include <linux/seqlock.h>
 #include <linux/rcupdate.h>
 
 extern struct ww_class reservation_ww_class;
@@ -57,11 +56,15 @@ struct dma_resv_list;
  *
  * An important fact is that there is the order KERNEL<WRITE<READ<BOOKKEEP and
  * when the dma_resv object is asked for fences for one use case the fences
- * for the lower use case are returned as well.
+ * for the lower use case are returned as well. The exception are USER fences
+ * which only return USER fences and nothing else.
  *
  * For example when asking for WRITE fences then the KERNEL fences are returned
  * as well. Similar when asked for READ fences then both WRITE and KERNEL
  * fences are returned as well.
+ *
+ * But when asked for USER fences only USER fences are returned and not WRITE
+ * nor any other fences.
  */
 enum dma_resv_usage {
 	/**
@@ -103,7 +106,18 @@ enum dma_resv_usage {
 	 * The most common case are preemption fences as well as page table
 	 * updates and their TLB flushes.
 	 */
-	DMA_RESV_USAGE_BOOKKEEP
+	DMA_RESV_USAGE_BOOKKEEP,
+
+	/**
+	 * @DMA_RESV_USAGE_USER: Special usage for user fences.
+	 *
+	 * This must only be used with fences which have DMA_FENCE_FLAG_USER
+	 * set so that memory mangement completely ignores those fences.
+	 *
+	 * A warning is raised if a fence with DMA_FENCE_FLAG USER is added with
+	 * any other usage than DMA_RESV_USAGE_USER.
+	 */
+	DMA_RESV_USAGE_USER
 };
 
 /**
@@ -221,6 +235,9 @@ static inline void dma_resv_iter_begin(struct dma_resv_iter *cursor,
 				       struct dma_resv *obj,
 				       enum dma_resv_usage usage)
 {
+	if (usage == DMA_RESV_USAGE_USER)
+		lockdep_assert_not_held(&(obj)->lock.base);
+
 	cursor->obj = obj;
 	cursor->usage = usage;
 	cursor->fence = NULL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/15] dma-buf: add user fence support to dma_fence_merge()
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (4 preceding siblings ...)
  2022-05-02 16:37 ` [PATCH 05/15] dma-buf: add user fence support to dma_resv Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 07/15] dma-buf: add user fence utility functions Christian König
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

Just make sure that we forward the user fence status to the output
whenever any of the inputs are an user fence.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-fence-unwrap.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
index 87ee2efced10..dd3c43aba8f1 100644
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -66,17 +66,19 @@ struct dma_fence *__dma_fence_merge(unsigned int num_fences,
 {
 	struct dma_fence_array *result;
 	struct dma_fence *tmp, **array;
+	bool user_fence = false;
 	unsigned int i, count;
 
 	count = 0;
 	for (i = 0; i < num_fences; ++i) {
+		user_fence |= test_bit(DMA_FENCE_FLAG_USER, &fences[i]->flags);
 		dma_fence_unwrap_for_each(tmp, &iter[i], fences[i])
 			if (!dma_fence_is_signaled(tmp))
 				++count;
 	}
 
 	if (count == 0)
-		return dma_fence_get_stub(false);
+		return dma_fence_get_stub(user_fence);
 
 	if (count > INT_MAX)
 		return NULL;
@@ -129,11 +131,12 @@ struct dma_fence *__dma_fence_merge(unsigned int num_fences,
 	} while (tmp);
 
 	if (count == 0) {
-		tmp = dma_fence_get_stub(false);
+		tmp = dma_fence_get_stub(user_fence);
 		goto return_tmp;
 	}
 
-	if (count == 1) {
+	if (count == 1 &&
+	    test_bit(DMA_FENCE_FLAG_USER, &array[0]->flags) == user_fence) {
 		tmp = array[0];
 		goto return_tmp;
 	}
@@ -145,6 +148,8 @@ struct dma_fence *__dma_fence_merge(unsigned int num_fences,
 		tmp = NULL;
 		goto return_tmp;
 	}
+	if (user_fence)
+		set_bit(DMA_FENCE_FLAG_USER, &result->base.flags);
 	return &result->base;
 
 return_tmp:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/15] dma-buf: add user fence utility functions
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (5 preceding siblings ...)
  2022-05-02 16:37 ` [PATCH 06/15] dma-buf: add user fence support to dma_fence_merge() Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 08/15] dma-buf: add support for polling on user fences Christian König
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

Allows to filter user fences from a dma_fence_chain and wait for user
fences in containers.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-fence-unwrap.c | 47 ++++++++++++++++++++++++++++++
 include/linux/dma-fence-unwrap.h   |  3 ++
 2 files changed, 50 insertions(+)

diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
index dd3c43aba8f1..4e9eda0b65c5 100644
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -157,3 +157,50 @@ struct dma_fence *__dma_fence_merge(unsigned int num_fences,
 	return tmp;
 }
 EXPORT_SYMBOL_GPL(__dma_fence_merge);
+
+/**
+ * dma_fence_filter_user - filter user fences
+ * @fence: entry point into the chain
+ *
+ * Check @fence and if it's a dma_fence_chain advance it until it doesn't depend
+ * on any user fence any more.
+ *
+ * Returns the advanced fence or NULL if no none user fence could be found.
+ */
+struct dma_fence *dma_fence_filter_user(struct dma_fence *fence)
+{
+
+	while (fence && test_bit(DMA_FENCE_FLAG_USER, &fence->flags))
+		fence = dma_fence_chain_walk(fence);
+
+	return fence;
+}
+EXPORT_SYMBOL(dma_fence_filter_user);
+
+/**
+ * dma_fence_wait_user - wait for all user fences to signal
+ * @fence: entry point
+ * @timeout: timeout for the wait
+ *
+ * Unwrap the potential container in @fence and wait for all the user fences to
+ * signal.
+ * Returns: Either negative error code or the remaining timeout.
+ */
+long dma_fence_wait_user(struct dma_fence *fence, unsigned long timeout)
+{
+	struct dma_fence_unwrap iter;
+	long res;
+
+	dma_fence_unwrap_for_each(fence, &iter, fence) {
+		if (!test_bit(DMA_FENCE_FLAG_USER, &fence->flags))
+			continue;
+
+		res = dma_fence_wait(fence, timeout);
+		if (res <= 0)
+			return res;
+		if (timeout)
+			timeout = res;
+	}
+	return timeout;
+}
+EXPORT_SYMBOL(dma_fence_wait_user);
diff --git a/include/linux/dma-fence-unwrap.h b/include/linux/dma-fence-unwrap.h
index 7c0fab318301..2a786f5a7819 100644
--- a/include/linux/dma-fence-unwrap.h
+++ b/include/linux/dma-fence-unwrap.h
@@ -72,4 +72,7 @@ struct dma_fence *__dma_fence_merge(unsigned int num_fences,
 		__dma_fence_merge(ARRAY_SIZE(__f), __f, __c);	\
 	})
 
+struct dma_fence *dma_fence_filter_user(struct dma_fence *fence);
+long dma_fence_wait_user(struct dma_fence *fence, unsigned long timeout);
+
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 08/15] dma-buf: add support for polling on user fences
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (6 preceding siblings ...)
  2022-05-02 16:37 ` [PATCH 07/15] dma-buf: add user fence utility functions Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 09/15] dma-buf/sync_file: add user fence support Christian König
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

Just also wait for user fences to signal when we wait for write fences.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-buf.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 79795857be3e..5558f4e555f8 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -225,6 +225,16 @@ static bool dma_buf_poll_add_cb(struct dma_resv *resv, bool write,
 		dma_fence_put(fence);
 	}
 
+	if (!write)
+		return false;
+
+	dma_resv_for_each_fence(&cursor, resv, DMA_RESV_USAGE_USER, fence) {
+		dma_fence_get(fence);
+		r = dma_fence_add_callback(fence, &dcb->cb, dma_buf_poll_cb);
+		if (!r)
+			return true;
+		dma_fence_put(fence);
+	}
 	return false;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/15] dma-buf/sync_file: add user fence support
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (7 preceding siblings ...)
  2022-05-02 16:37 ` [PATCH 08/15] dma-buf: add support for polling on user fences Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 10/15] drm: add user fence support for atomic out fences Christian König
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

Keep user fences separate from normal fences.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/sync_file.c | 82 +++++++++++++++++++++++++++++++++----
 include/linux/sync_file.h   |  4 +-
 2 files changed, 76 insertions(+), 10 deletions(-)

diff --git a/drivers/dma-buf/sync_file.c b/drivers/dma-buf/sync_file.c
index fe149d7e3ce2..630472d79dc1 100644
--- a/drivers/dma-buf/sync_file.c
+++ b/drivers/dma-buf/sync_file.c
@@ -70,7 +70,13 @@ struct sync_file *sync_file_create(struct dma_fence *fence)
 	if (!sync_file)
 		return NULL;
 
-	sync_file->fence = dma_fence_get(fence);
+	if (test_bit(DMA_FENCE_FLAG_USER, &fence->flags)) {
+		sync_file->fence = dma_fence_get_stub(false);
+		sync_file->user_fence = dma_fence_get(fence);
+	} else {
+		sync_file->fence = dma_fence_get(fence);
+		sync_file->user_fence = dma_fence_get_stub(true);
+	}
 
 	return sync_file;
 }
@@ -116,6 +122,28 @@ struct dma_fence *sync_file_get_fence(int fd)
 }
 EXPORT_SYMBOL(sync_file_get_fence);
 
+/**
+ * sync_file_get_user_fence - get user fence related to the sync_file fd
+ * @fd:		sync_file fd to get the fence from
+ *
+ * Ensures @fd references a valid sync_file and returns an user fence reference
+ * which represents all fence in the sync_file. On error NULL is returned.
+ */
+struct dma_fence *sync_file_get_user_fence(int fd)
+{
+	struct sync_file *sync_file;
+	struct dma_fence *fence;
+
+	sync_file = sync_file_fdget(fd);
+	if (!sync_file)
+		return NULL;
+
+	fence = dma_fence_merge(sync_file->fence, sync_file->user_fence);
+	fput(sync_file->file);
+	return fence;
+}
+EXPORT_SYMBOL(sync_file_get_user_fence);
+
 /**
  * sync_file_get_name - get the name of the sync_file
  * @sync_file:		sync_file to get the fence from
@@ -136,6 +164,9 @@ char *sync_file_get_name(struct sync_file *sync_file, char *buf, int len)
 	} else {
 		struct dma_fence *fence = sync_file->fence;
 
+		if (dma_fence_is_signaled(fence))
+			fence = sync_file->user_fence;
+
 		snprintf(buf, len, "%s-%s%llu-%lld",
 			 fence->ops->get_driver_name(fence),
 			 fence->ops->get_timeline_name(fence),
@@ -159,21 +190,32 @@ char *sync_file_get_name(struct sync_file *sync_file, char *buf, int len)
 static struct sync_file *sync_file_merge(const char *name, struct sync_file *a,
 					 struct sync_file *b)
 {
+	struct dma_fence *fence, *user_fence;
 	struct sync_file *sync_file;
-	struct dma_fence *fence;
 
 	sync_file = sync_file_alloc();
 	if (!sync_file)
 		return NULL;
 
 	fence = dma_fence_merge(a->fence, b->fence);
-	if (!fence) {
-		fput(sync_file->file);
-		return NULL;
-	}
+	if (!fence)
+		goto error_fput;
+
+	user_fence = dma_fence_merge(a->user_fence, b->user_fence);
+	if (!user_fence)
+		goto error_put_fence;
+
 	sync_file->fence = fence;
+	sync_file->user_fence = user_fence;
 	strlcpy(sync_file->user_name, name, sizeof(sync_file->user_name));
 	return sync_file;
+
+error_put_fence:
+	dma_fence_put(fence);
+
+error_fput:
+	fput(sync_file->file);
+	return NULL;
 }
 
 static int sync_file_release(struct inode *inode, struct file *file)
@@ -183,6 +225,7 @@ static int sync_file_release(struct inode *inode, struct file *file)
 	if (test_bit(POLL_ENABLED, &sync_file->flags))
 		dma_fence_remove_callback(sync_file->fence, &sync_file->cb);
 	dma_fence_put(sync_file->fence);
+	dma_fence_put(sync_file->user_fence);
 	kfree(sync_file);
 
 	return 0;
@@ -191,17 +234,25 @@ static int sync_file_release(struct inode *inode, struct file *file)
 static __poll_t sync_file_poll(struct file *file, poll_table *wait)
 {
 	struct sync_file *sync_file = file->private_data;
+	int ret;
 
 	poll_wait(file, &sync_file->wq, wait);
 
 	if (list_empty(&sync_file->cb.node) &&
 	    !test_and_set_bit(POLL_ENABLED, &sync_file->flags)) {
-		if (dma_fence_add_callback(sync_file->fence, &sync_file->cb,
-					   fence_check_cb_func) < 0)
+		ret = dma_fence_add_callback(sync_file->fence, &sync_file->cb,
+					     fence_check_cb_func);
+		if (ret) {
+			ret = dma_fence_add_callback(sync_file->user_fence,
+						     &sync_file->cb,
+						     fence_check_cb_func);
+		}
+		if (ret)
 			wake_up_all(&sync_file->wq);
 	}
 
-	return dma_fence_is_signaled(sync_file->fence) ? EPOLLIN : 0;
+	return (dma_fence_is_signaled(sync_file->fence) &&
+		dma_fence_is_signaled(sync_file->user_fence)) ? EPOLLIN : 0;
 }
 
 static long sync_file_ioctl_merge(struct sync_file *sync_file,
@@ -299,6 +350,8 @@ static long sync_file_ioctl_fence_info(struct sync_file *sync_file,
 	num_fences = 0;
 	dma_fence_unwrap_for_each(fence, &iter, sync_file->fence)
 		++num_fences;
+	dma_fence_unwrap_for_each(fence, &iter, sync_file->user_fence)
+		++num_fences;
 
 	/*
 	 * Passing num_fences = 0 means that userspace doesn't want to
@@ -307,7 +360,12 @@ static long sync_file_ioctl_fence_info(struct sync_file *sync_file,
 	 * info->num_fences.
 	 */
 	if (!info.num_fences) {
+		int status;
+
 		info.status = dma_fence_get_status(sync_file->fence);
+		status = dma_fence_get_status(sync_file->user_fence);
+		if (!info.status)
+			info.status = status;
 		goto no_fences;
 	} else {
 		info.status = 1;
@@ -328,6 +386,12 @@ static long sync_file_ioctl_fence_info(struct sync_file *sync_file,
 		status = sync_fill_fence_info(fence, &fence_info[num_fences++]);
 		info.status = info.status <= 0 ? info.status : status;
 	}
+	dma_fence_unwrap_for_each(fence, &iter, sync_file->user_fence) {
+		int status;
+
+		status = sync_fill_fence_info(fence, &fence_info[num_fences++]);
+		info.status = info.status <= 0 ? info.status : status;
+	}
 
 	if (copy_to_user(u64_to_user_ptr(info.sync_fence_info), fence_info,
 			 size)) {
diff --git a/include/linux/sync_file.h b/include/linux/sync_file.h
index 790ca021203a..14aff1a4ee75 100644
--- a/include/linux/sync_file.h
+++ b/include/linux/sync_file.h
@@ -50,13 +50,15 @@ struct sync_file {
 	unsigned long		flags;
 
 	struct dma_fence	*fence;
-	struct dma_fence_cb cb;
+	struct dma_fence	*user_fence;
+	struct dma_fence_cb	cb;
 };
 
 #define POLL_ENABLED 0
 
 struct sync_file *sync_file_create(struct dma_fence *fence);
 struct dma_fence *sync_file_get_fence(int fd);
+struct dma_fence *sync_file_get_user_fence(int fd);
 char *sync_file_get_name(struct sync_file *sync_file, char *buf, int len);
 
 #endif /* _LINUX_SYNC_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/15] drm: add user fence support for atomic out fences
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (8 preceding siblings ...)
  2022-05-02 16:37 ` [PATCH 09/15] dma-buf/sync_file: add user fence support Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 11/15] drm: add user fence support for atomic in fences Christian König
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

Add a new driver flag indicating support for user fences.

This flag is then used when creating out fences for atomic mode setting,
indicating that the mode set might depend on an user fence.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/drm_atomic_uapi.c | 8 ++++++++
 include/drm/drm_drv.h             | 7 +++++++
 2 files changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/drm_atomic_uapi.c b/drivers/gpu/drm/drm_atomic_uapi.c
index 434f3d4cb8a2..e2112c10569b 100644
--- a/drivers/gpu/drm/drm_atomic_uapi.c
+++ b/drivers/gpu/drm/drm_atomic_uapi.c
@@ -1111,6 +1111,7 @@ static int prepare_signaling(struct drm_device *dev,
 				  struct drm_out_fence_state **fence_state,
 				  unsigned int *num_fences)
 {
+	bool use_user_fence = drm_core_check_feature(dev, DRIVER_USER_FENCE);
 	struct drm_crtc *crtc;
 	struct drm_crtc_state *crtc_state;
 	struct drm_connector *conn;
@@ -1120,6 +1121,7 @@ static int prepare_signaling(struct drm_device *dev,
 	if (arg->flags & DRM_MODE_ATOMIC_TEST_ONLY)
 		return 0;
 
+
 	for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
 		s32 __user *fence_ptr;
 
@@ -1168,6 +1170,9 @@ static int prepare_signaling(struct drm_device *dev,
 			if (!fence)
 				return -ENOMEM;
 
+			if (use_user_fence)
+				set_bit(DMA_FENCE_FLAG_USER, &fence->flags);
+
 			ret = setup_out_fence(&f[(*num_fences)++], fence);
 			if (ret) {
 				dma_fence_put(fence);
@@ -1208,6 +1213,9 @@ static int prepare_signaling(struct drm_device *dev,
 		if (!fence)
 			return -ENOMEM;
 
+		if (use_user_fence)
+			set_bit(DMA_FENCE_FLAG_USER, &fence->flags);
+
 		ret = setup_out_fence(&f[(*num_fences)++], fence);
 		if (ret) {
 			dma_fence_put(fence);
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index f6159acb8856..b2b8ea8d4a9e 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -94,6 +94,13 @@ enum drm_driver_feature {
 	 * synchronization of command submission.
 	 */
 	DRIVER_SYNCOBJ_TIMELINE         = BIT(6),
+	/**
+	 * @DRIVER_USER_FENCE:
+	 *
+	 * Drivers supports user fences and waiting for the before command
+	 * submission.
+	 */
+	DRIVER_USER_FENCE		= BIT(7),
 
 	/* IMPORTANT: Below are all the legacy flags, add new ones above. */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 11/15] drm: add user fence support for atomic in fences
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (9 preceding siblings ...)
  2022-05-02 16:37 ` [PATCH 10/15] drm: add user fence support for atomic out fences Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 12/15] drm: add user fence support to drm_gem_plane_helper_prepare_fb Christian König
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

When the DRIVER_USER_FENCE flag is set we grab the user fence from the
sync file instead of the normal one.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/drm_atomic_uapi.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_atomic_uapi.c b/drivers/gpu/drm/drm_atomic_uapi.c
index e2112c10569b..d1b13657e2ae 100644
--- a/drivers/gpu/drm/drm_atomic_uapi.c
+++ b/drivers/gpu/drm/drm_atomic_uapi.c
@@ -517,7 +517,10 @@ static int drm_atomic_plane_set_property(struct drm_plane *plane,
 		if (U642I64(val) == -1)
 			return 0;
 
-		state->fence = sync_file_get_fence(val);
+		if (drm_core_check_feature(dev, DRIVER_USER_FENCE))
+			state->fence = sync_file_get_user_fence(val);
+		else
+			state->fence = sync_file_get_fence(val);
 		if (!state->fence)
 			return -EINVAL;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 12/15] drm: add user fence support to drm_gem_plane_helper_prepare_fb
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (10 preceding siblings ...)
  2022-05-02 16:37 ` [PATCH 11/15] drm: add user fence support for atomic in fences Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37 ` [PATCH 13/15] drm: add user fence support to drm_syncobj Christian König
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

Just handle them the same way as fences with the DMA_RESV_USAGE_WRITE
flag when the DRIVER_USER_FENCE flag is set.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/drm_gem_atomic_helper.c | 68 ++++++++++++++++---------
 1 file changed, 45 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_atomic_helper.c b/drivers/gpu/drm/drm_gem_atomic_helper.c
index a5026f617739..75d04333ff2e 100644
--- a/drivers/gpu/drm/drm_gem_atomic_helper.c
+++ b/drivers/gpu/drm/drm_gem_atomic_helper.c
@@ -5,6 +5,7 @@
 
 #include <drm/drm_atomic_state_helper.h>
 #include <drm/drm_atomic_uapi.h>
+#include <drm/drm_drv.h>
 #include <drm/drm_gem.h>
 #include <drm/drm_gem_atomic_helper.h>
 #include <drm/drm_gem_framebuffer_helper.h>
@@ -121,6 +122,40 @@
  * Plane Helpers
  */
 
+static int chain_fb_fences(struct drm_framebuffer *fb,
+			   enum dma_resv_usage usage,
+			   struct dma_fence **fence)
+{
+	size_t i;
+	int ret;
+
+	for (i = 0; i < fb->format->num_planes; ++i) {
+		struct drm_gem_object *obj = drm_gem_fb_get_obj(fb, i);
+		struct dma_fence *new;
+
+		if (WARN_ON_ONCE(!obj))
+			continue;
+
+		ret = dma_resv_get_singleton(obj->resv, usage, &new);
+		if (ret)
+			return ret;
+
+		if (new && *fence) {
+			struct dma_fence_chain *chain = dma_fence_chain_alloc();
+
+			if (!chain)
+				return -ENOMEM;
+
+			dma_fence_chain_init(chain, *fence, new, 1);
+			*fence = &chain->base;
+
+		} else if (new) {
+			*fence = new;
+		}
+	}
+	return 0;
+}
+
 /**
  * drm_gem_plane_helper_prepare_fb() - Prepare a GEM backed framebuffer
  * @plane: Plane
@@ -143,8 +178,6 @@ int drm_gem_plane_helper_prepare_fb(struct drm_plane *plane,
 				    struct drm_plane_state *state)
 {
 	struct dma_fence *fence = dma_fence_get(state->fence);
-	enum dma_resv_usage usage;
-	size_t i;
 	int ret;
 
 	if (!state->fb)
@@ -163,32 +196,21 @@ int drm_gem_plane_helper_prepare_fb(struct drm_plane *plane,
 	 * obeys both implicit and explicit fences for plane updates, then it
 	 * will break all the benefits of explicit fencing.
 	 */
-	usage = fence ? DMA_RESV_USAGE_KERNEL : DMA_RESV_USAGE_WRITE;
-
-	for (i = 0; i < state->fb->format->num_planes; ++i) {
-		struct drm_gem_object *obj = drm_gem_fb_get_obj(state->fb, i);
-		struct dma_fence *new;
-
-		if (WARN_ON_ONCE(!obj))
-			continue;
-
-		ret = dma_resv_get_singleton(obj->resv, usage, &new);
+	if (fence) {
+		ret = chain_fb_fences(state->fb, DMA_RESV_USAGE_KERNEL, &fence);
 		if (ret)
 			goto error;
 
-		if (new && fence) {
-			struct dma_fence_chain *chain = dma_fence_chain_alloc();
+	} else {
+		ret = chain_fb_fences(state->fb, DMA_RESV_USAGE_WRITE, &fence);
+		if (ret)
+			goto error;
 
-			if (!chain) {
-				ret = -ENOMEM;
+		if (drm_core_check_feature(plane->dev, DRIVER_USER_FENCE)) {
+			ret = chain_fb_fences(state->fb, DMA_RESV_USAGE_USER,
+					      &fence);
+			if (ret)
 				goto error;
-			}
-
-			dma_fence_chain_init(chain, fence, new, 1);
-			fence = &chain->base;
-
-		} else if (new) {
-			fence = new;
 		}
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 13/15] drm: add user fence support to drm_syncobj
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (11 preceding siblings ...)
  2022-05-02 16:37 ` [PATCH 12/15] drm: add user fence support to drm_gem_plane_helper_prepare_fb Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-02 16:37   ` Christian König
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

For now just filter or wait for user fences.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/drm_syncobj.c | 15 ++++++++++++---
 include/drm/drm_syncobj.h     | 25 +++++++++++++++++++++----
 2 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
index 5a961ea90a35..8d25a2dd107b 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -386,7 +386,7 @@ int drm_syncobj_find_fence(struct drm_file *file_private,
 	struct drm_syncobj *syncobj = drm_syncobj_find(file_private, handle);
 	struct syncobj_wait_entry wait;
 	u64 timeout = nsecs_to_jiffies64(DRM_SYNCOBJ_WAIT_FOR_SUBMIT_TIMEOUT);
-	int ret;
+	long ret;
 
 	if (!syncobj)
 		return -ENOENT;
@@ -398,10 +398,19 @@ int drm_syncobj_find_fence(struct drm_file *file_private,
 	if (flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT) {
 		might_sleep();
 		lockdep_assert_none_held_once();
+		*fence = drm_syncobj_user_fence_get(syncobj);
+		if (*fence) {
+			ret = dma_fence_wait_user(*fence, timeout);
+			if (ret < 0)
+				return ret;
+			if (ret == 0)
+				return -ETIME;
+			timeout = ret;
+		}
+	} else {
+		*fence = drm_syncobj_fence_get(syncobj);
 	}
 
-	*fence = drm_syncobj_fence_get(syncobj);
-
 	if (*fence) {
 		ret = dma_fence_chain_find_seqno(fence, point);
 		if (!ret) {
diff --git a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h
index 6cf7243a1dc5..acc3979815eb 100644
--- a/include/drm/drm_syncobj.h
+++ b/include/drm/drm_syncobj.h
@@ -28,6 +28,7 @@
 
 #include <linux/dma-fence.h>
 #include <linux/dma-fence-chain.h>
+#include <linux/dma-fence-unwrap.h>
 
 struct drm_file;
 
@@ -89,18 +90,18 @@ drm_syncobj_put(struct drm_syncobj *obj)
 }
 
 /**
- * drm_syncobj_fence_get - get a reference to a fence in a sync object
+ * drm_syncobj_user_fence_get - get an user fence from a sync object
  * @syncobj: sync object.
  *
  * This acquires additional reference to &drm_syncobj.fence contained in @obj,
- * if not NULL. It is illegal to call this without already holding a reference.
- * No locks required.
+ * if not NULL. It is illegal to call this without holding a reference to the
+ * syncobj. No locks required.
  *
  * Returns:
  * Either the fence of @obj or NULL if there's none.
  */
 static inline struct dma_fence *
-drm_syncobj_fence_get(struct drm_syncobj *syncobj)
+drm_syncobj_user_fence_get(struct drm_syncobj *syncobj)
 {
 	struct dma_fence *fence;
 
@@ -111,6 +112,22 @@ drm_syncobj_fence_get(struct drm_syncobj *syncobj)
 	return fence;
 }
 
+/**
+ * drm_syncobj_fence_get - get a reference to a fence in a sync object
+ * @syncobj: sync object.
+ *
+ * Same functionality as drm_syncobj_user_fence_get(), but user fences are
+ * filtered out.
+ *
+ * Returns:
+ * Either the fence of @obj or NULL if there's none.
+ */
+static inline struct dma_fence *
+drm_syncobj_fence_get(struct drm_syncobj *syncobj)
+{
+	return dma_fence_filter_user(drm_syncobj_user_fence_get(syncobj));
+}
+
 struct drm_syncobj *drm_syncobj_find(struct drm_file *file_private,
 				     u32 handle);
 void drm_syncobj_add_point(struct drm_syncobj *syncobj,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 14/15] drm/amdgpu: switch DM to atomic fence helpers
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
@ 2022-05-02 16:37   ` Christian König
  2022-05-02 16:37 ` [PATCH 02/15] dma-buf: introduce user fence support Christian König
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König, Harry Wentland, Nicholas Kazlauskas,
	Roman Li, Qingqing Zhuo, Jude Shih, Wayne Lin, Rodrigo Siqueira

This gives us the standard atomic implicit and explicit fencing rules.

Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: Harry Wentland <harry.wentland@amd.com>
Cc: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Cc: Roman Li <Roman.Li@amd.com>
Cc: Qingqing Zhuo <qingqing.zhuo@amd.com>
Cc: Jude Shih <shenshih@amd.com>
Cc: Wayne Lin <Wayne.Lin@amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 23 ++++++++-----------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 2ade82cfb1ac..c5b2417adcc6 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -83,6 +83,7 @@
 #include <drm/drm_edid.h>
 #include <drm/drm_vblank.h>
 #include <drm/drm_audio_component.h>
+#include <drm/drm_gem_atomic_helper.h>
 
 #if defined(CONFIG_DRM_AMD_DC_DCN)
 #include "ivsrcid/dcn/irqsrcs_dcn_1_0.h"
@@ -7627,6 +7628,10 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
 		goto error_unpin;
 	}
 
+	r = drm_gem_plane_helper_prepare_fb(plane, new_state);
+	if (unlikely(r != 0))
+		goto error_unpin;
+
 	amdgpu_bo_unreserve(rbo);
 
 	afb->address = amdgpu_bo_gpu_offset(rbo);
@@ -9160,7 +9165,6 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 	struct dm_crtc_state *dm_old_crtc_state =
 			to_dm_crtc_state(drm_atomic_get_old_crtc_state(state, pcrtc));
 	int planes_count = 0, vpos, hpos;
-	long r;
 	unsigned long flags;
 	struct amdgpu_bo *abo;
 	uint32_t target_vblank, last_flip_vblank;
@@ -9173,6 +9177,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		struct dc_flip_addrs flip_addrs[MAX_SURFACES];
 		struct dc_stream_update stream_update;
 	} *bundle;
+	int r;
 
 	bundle = kzalloc(sizeof(*bundle), GFP_KERNEL);
 
@@ -9181,6 +9186,10 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		goto cleanup;
 	}
 
+	r = drm_atomic_helper_wait_for_fences(dev, state, false);
+	if (unlikely(r))
+		DRM_ERROR("Waiting for fences timed out!");
+
 	/*
 	 * Disable the cursor first if we're disabling all the planes.
 	 * It'll remain on the screen after the planes are re-enabled
@@ -9235,18 +9244,6 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		}
 
 		abo = gem_to_amdgpu_bo(fb->obj[0]);
-
-		/*
-		 * Wait for all fences on this FB. Do limited wait to avoid
-		 * deadlock during GPU reset when this fence will not signal
-		 * but we hold reservation lock for the BO.
-		 */
-		r = dma_resv_wait_timeout(abo->tbo.base.resv,
-					  DMA_RESV_USAGE_WRITE, false,
-					  msecs_to_jiffies(5000));
-		if (unlikely(r <= 0))
-			DRM_ERROR("Waiting for fences timed out!");
-
 		fill_dc_plane_info_and_addr(
 			dm->adev, new_plane_state,
 			afb->tiling_flags,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 14/15] drm/amdgpu: switch DM to atomic fence helpers
@ 2022-05-02 16:37   ` Christian König
  0 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Jude Shih, Qingqing Zhuo, Rodrigo Siqueira, Roman Li,
	Nicholas Kazlauskas, Wayne Lin, Christian König

This gives us the standard atomic implicit and explicit fencing rules.

Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: Harry Wentland <harry.wentland@amd.com>
Cc: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Cc: Roman Li <Roman.Li@amd.com>
Cc: Qingqing Zhuo <qingqing.zhuo@amd.com>
Cc: Jude Shih <shenshih@amd.com>
Cc: Wayne Lin <Wayne.Lin@amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 23 ++++++++-----------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 2ade82cfb1ac..c5b2417adcc6 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -83,6 +83,7 @@
 #include <drm/drm_edid.h>
 #include <drm/drm_vblank.h>
 #include <drm/drm_audio_component.h>
+#include <drm/drm_gem_atomic_helper.h>
 
 #if defined(CONFIG_DRM_AMD_DC_DCN)
 #include "ivsrcid/dcn/irqsrcs_dcn_1_0.h"
@@ -7627,6 +7628,10 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
 		goto error_unpin;
 	}
 
+	r = drm_gem_plane_helper_prepare_fb(plane, new_state);
+	if (unlikely(r != 0))
+		goto error_unpin;
+
 	amdgpu_bo_unreserve(rbo);
 
 	afb->address = amdgpu_bo_gpu_offset(rbo);
@@ -9160,7 +9165,6 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 	struct dm_crtc_state *dm_old_crtc_state =
 			to_dm_crtc_state(drm_atomic_get_old_crtc_state(state, pcrtc));
 	int planes_count = 0, vpos, hpos;
-	long r;
 	unsigned long flags;
 	struct amdgpu_bo *abo;
 	uint32_t target_vblank, last_flip_vblank;
@@ -9173,6 +9177,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		struct dc_flip_addrs flip_addrs[MAX_SURFACES];
 		struct dc_stream_update stream_update;
 	} *bundle;
+	int r;
 
 	bundle = kzalloc(sizeof(*bundle), GFP_KERNEL);
 
@@ -9181,6 +9186,10 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		goto cleanup;
 	}
 
+	r = drm_atomic_helper_wait_for_fences(dev, state, false);
+	if (unlikely(r))
+		DRM_ERROR("Waiting for fences timed out!");
+
 	/*
 	 * Disable the cursor first if we're disabling all the planes.
 	 * It'll remain on the screen after the planes are re-enabled
@@ -9235,18 +9244,6 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		}
 
 		abo = gem_to_amdgpu_bo(fb->obj[0]);
-
-		/*
-		 * Wait for all fences on this FB. Do limited wait to avoid
-		 * deadlock during GPU reset when this fence will not signal
-		 * but we hold reservation lock for the BO.
-		 */
-		r = dma_resv_wait_timeout(abo->tbo.base.resv,
-					  DMA_RESV_USAGE_WRITE, false,
-					  msecs_to_jiffies(5000));
-		if (unlikely(r <= 0))
-			DRM_ERROR("Waiting for fences timed out!");
-
 		fill_dc_plane_info_and_addr(
 			dm->adev, new_plane_state,
 			afb->tiling_flags,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 15/15] drm/amdgpu: user fence proof of concept
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
                   ` (13 preceding siblings ...)
  2022-05-02 16:37   ` Christian König
@ 2022-05-02 16:37 ` Christian König
  2022-05-04 10:08   ` Daniel Vetter
  15 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-02 16:37 UTC (permalink / raw)
  To: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig
  Cc: Christian König

Just some hack to test the functionality, not a real implementation of
the interface.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-resv.c                    |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 28 ++++++++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  2 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  9 ++++--
 5 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index da667c21ad55..e18efb21c452 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -286,7 +286,8 @@ void dma_resv_add_fence(struct dma_resv *obj, struct dma_fence *fence,
 	/* Drivers should not add containers here, instead add each fence
 	 * individually.
 	 */
-	WARN_ON(dma_fence_is_container(fence));
+	//WARN_ON(dma_fence_is_container(fence));
+
 
 	/* User fences must be added using DMA_RESV_USAGE_USER */
 	WARN_ON(test_bit(DMA_FENCE_FLAG_USER, &fence->flags) !=
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 95eeab527ca9..299ab8e50c42 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -453,6 +453,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 	struct amdgpu_vm *vm = &fpriv->vm;
 	struct amdgpu_bo_list_entry *e;
 	struct drm_gem_object *obj;
+	long timeout = HZ / 10;
 	struct amdgpu_bo *gds;
 	struct amdgpu_bo *gws;
 	struct amdgpu_bo *oa;
@@ -476,6 +477,17 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 			return r;
 	}
 
+	amdgpu_bo_list_for_each_entry(e, p->bo_list) {
+		struct dma_resv *resv = e->bo->tbo.base.resv;
+
+		timeout = dma_resv_wait_timeout(resv, DMA_RESV_USAGE_USER,
+						true, timeout);
+		if (unlikely(timeout < 0))
+			return timeout;
+		if (unlikely(timeout == 0))
+			return -ETIME;
+	}
+
 	/* Get userptr backing pages. If pages are updated after registered
 	 * in amdgpu_gem_userptr_ioctl(), amdgpu_cs_list_validate() will do
 	 * amdgpu_ttm_backend_bind() to flush and invalidate new pages
@@ -516,7 +528,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 			return r;
 
 		amdgpu_bo_list_for_each_entry(e, p->bo_list) {
-			r = drm_exec_prepare_obj(&p->exec, &e->bo->tbo.base, 2);
+			r = drm_exec_prepare_obj(&p->exec, &e->bo->tbo.base, 3);
 			drm_exec_break_on_contention(&p->exec);
 			if (unlikely(r))
 				return r;
@@ -527,7 +539,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 
 		if (p->uf_bo) {
 			r = drm_exec_prepare_obj(&p->exec, &p->uf_bo->tbo.base,
-						 2);
+						 3);
 			drm_exec_continue_on_contention(&p->exec);
 			if (unlikely(r))
 				return r;
@@ -1160,6 +1172,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	struct drm_sched_entity *entity = p->entity;
 	struct amdgpu_bo_list_entry *e;
 	struct drm_gem_object *gobj;
+	struct dma_fence *dummy;
 	struct amdgpu_job *job;
 	unsigned long index;
 	uint64_t seq;
@@ -1191,6 +1204,11 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	}
 
 	p->fence = dma_fence_get(&job->base.s_fence->finished);
+	dummy = dma_fence_merge(p->fence, dma_fence_get_stub(true));
+	if (!dummy) {
+		r = -ENOMEM;
+		goto error_abort;
+	}
 
 	amdgpu_ctx_add_fence(p->ctx, entity, p->fence, &seq);
 	amdgpu_cs_post_dependencies(p);
@@ -1214,11 +1232,13 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 
 	drm_exec_for_each_duplicate_object(&p->exec, index, gobj) {
 		ttm_bo_move_to_lru_tail_unlocked(&gem_to_amdgpu_bo(gobj)->tbo);
-		dma_resv_add_fence(gobj->resv, p->fence, DMA_RESV_USAGE_WRITE);
+		dma_resv_add_fence(gobj->resv, p->fence, DMA_RESV_USAGE_KERNEL);
+		dma_resv_add_fence(gobj->resv, dummy, DMA_RESV_USAGE_USER);
 	}
 	drm_exec_for_each_locked_object(&p->exec, index, gobj) {
 		ttm_bo_move_to_lru_tail_unlocked(&gem_to_amdgpu_bo(gobj)->tbo);
-		dma_resv_add_fence(gobj->resv, p->fence, DMA_RESV_USAGE_WRITE);
+		dma_resv_add_fence(gobj->resv, p->fence, DMA_RESV_USAGE_KERNEL);
+		dma_resv_add_fence(gobj->resv, dummy, DMA_RESV_USAGE_USER);
 	}
 
 	mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index b03663f42cc9..bd334f5fd64f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2655,7 +2655,7 @@ static const struct drm_driver amdgpu_kms_driver = {
 	    DRIVER_ATOMIC |
 	    DRIVER_GEM |
 	    DRIVER_RENDER | DRIVER_MODESET | DRIVER_SYNCOBJ |
-	    DRIVER_SYNCOBJ_TIMELINE,
+	    DRIVER_SYNCOBJ_TIMELINE | DRIVER_USER_FENCE,
 	.open = amdgpu_driver_open_kms,
 	.postclose = amdgpu_driver_postclose_kms,
 	.lastclose = amdgpu_driver_lastclose_kms,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index e5c8e72a9485..6705287887e9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -628,7 +628,7 @@ static void amdgpu_vm_pt_next_dfs(struct amdgpu_device *adev,
  */
 int amdgpu_vm_lock_pd(struct amdgpu_vm *vm, struct drm_exec *exec)
 {
-	return drm_exec_prepare_obj(exec, &vm->root.bo->tbo.base, 4);
+	return drm_exec_prepare_obj(exec, &vm->root.bo->tbo.base, 5);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index c5b2417adcc6..2e0f059b9d12 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -7627,12 +7627,11 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
 		DRM_ERROR("%p bind failed\n", rbo);
 		goto error_unpin;
 	}
+	amdgpu_bo_unreserve(rbo);
 
 	r = drm_gem_plane_helper_prepare_fb(plane, new_state);
 	if (unlikely(r != 0))
-		goto error_unpin;
-
-	amdgpu_bo_unreserve(rbo);
+		goto error_reserve;
 
 	afb->address = amdgpu_bo_gpu_offset(rbo);
 
@@ -7665,6 +7664,10 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane,
 
 	return 0;
 
+error_reserve:
+	if (WARN_ON(amdgpu_bo_reserve(rbo, true)))
+		return r;
+
 error_unpin:
 	amdgpu_bo_unpin(rbo);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/15] dma-buf: introduce user fence support
  2022-05-02 16:37 ` [PATCH 02/15] dma-buf: introduce user fence support Christian König
@ 2022-05-04  7:53   ` Tvrtko Ursulin
  2022-05-04  9:15     ` Christian König
  0 siblings, 1 reply; 41+ messages in thread
From: Tvrtko Ursulin @ 2022-05-04  7:53 UTC (permalink / raw)
  To: Christian König, daniel, jason, daniels, skhawaja,
	maad.aldabagh, sergemetral, sumit.semwal, gustavo,
	Felix.Kuehling, alexander.deucher, tzimmermann, linux-media,
	dri-devel, linaro-mm-sig
  Cc: Christian König


On 02/05/2022 17:37, Christian König wrote:
> Start introducing a new part of the framework for handling user fences.
> 
> In strict opposition to normal fences user fences don't reliable finish in

reliably

> a fixed amount of time and therefore can't be used as dependency in memory
> management.
> 
> Because of this user fences are marked with DMA_FENCE_FLAG_USER. Lockdep
> is checked that we can at least fault user pages when we check them for
> signaling.
> 
> This patch also adds a flag to dma_fence_get_stub() so that we can
> retrieve a signaled user fence. This can be used together with lockdep to
> test the handling in code path supporting user fences.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/dma-buf/dma-fence-unwrap.c            |  4 +--
>   drivers/dma-buf/dma-fence.c                   | 31 ++++++++++++-------
>   drivers/dma-buf/st-dma-fence.c                |  2 +-
>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  4 +--
>   drivers/gpu/drm/drm_syncobj.c                 | 10 +++---
>   include/linux/dma-fence.h                     | 17 +++++++++-
>   9 files changed, 49 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
> index c9becc74896d..87ee2efced10 100644
> --- a/drivers/dma-buf/dma-fence-unwrap.c
> +++ b/drivers/dma-buf/dma-fence-unwrap.c
> @@ -76,7 +76,7 @@ struct dma_fence *__dma_fence_merge(unsigned int num_fences,
>   	}
>   
>   	if (count == 0)
> -		return dma_fence_get_stub();
> +		return dma_fence_get_stub(false);
>   
>   	if (count > INT_MAX)
>   		return NULL;
> @@ -129,7 +129,7 @@ struct dma_fence *__dma_fence_merge(unsigned int num_fences,
>   	} while (tmp);
>   
>   	if (count == 0) {
> -		tmp = dma_fence_get_stub();
> +		tmp = dma_fence_get_stub(false);
>   		goto return_tmp;
>   	}
>   
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 066400ed8841..52873f5eaeba 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -26,6 +26,7 @@ EXPORT_TRACEPOINT_SYMBOL(dma_fence_signaled);
>   
>   static DEFINE_SPINLOCK(dma_fence_stub_lock);
>   static struct dma_fence dma_fence_stub;
> +static struct dma_fence dma_fence_user_stub;
>   
>   /*
>    * fence context counter: each execution context should have its own
> @@ -123,24 +124,28 @@ static const struct dma_fence_ops dma_fence_stub_ops = {
>   
>   /**
>    * dma_fence_get_stub - return a signaled fence
> + * @user: if true the returned fence is an user fence
>    *
> - * Return a stub fence which is already signaled. The fence's
> - * timestamp corresponds to the first time after boot this
> - * function is called.
> + * Return a stub fence which is already signaled. The fence's timestamp
> + * corresponds to the first time after boot this function is called. If @user is
> + * true an user fence is returned which can be used with lockdep to test user
> + * fence saveness in a code path.
>    */
> -struct dma_fence *dma_fence_get_stub(void)
> +struct dma_fence *dma_fence_get_stub(bool user)
>   {
> +	struct dma_fence *fence = user ? &dma_fence_stub : &dma_fence_user_stub;
> +
>   	spin_lock(&dma_fence_stub_lock);
> -	if (!dma_fence_stub.ops) {
> -		dma_fence_init(&dma_fence_stub,
> -			       &dma_fence_stub_ops,
> -			       &dma_fence_stub_lock,
> +	if (!fence->ops) {
> +		dma_fence_init(fence, &dma_fence_stub_ops, &dma_fence_stub_lock,
>   			       0, 0);
> -		dma_fence_signal_locked(&dma_fence_stub);
> +		if (user)
> +			set_bit(DMA_FENCE_FLAG_USER, &fence->flags);
> +		dma_fence_signal_locked(fence);
>   	}
>   	spin_unlock(&dma_fence_stub_lock);
>   
> -	return dma_fence_get(&dma_fence_stub);
> +	return dma_fence_get(fence);
>   }
>   EXPORT_SYMBOL(dma_fence_get_stub);
>   
> @@ -497,8 +502,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>   		return -EINVAL;
>   
>   	might_sleep();
> -
>   	__dma_fence_might_wait();
> +	if (test_bit(DMA_FENCE_FLAG_USER, &fence->flags))
> +		might_fault();
>   
>   	trace_dma_fence_wait_start(fence);
>   	if (fence->ops->wait)
> @@ -870,6 +876,9 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count,
>   	for (i = 0; i < count; ++i) {
>   		struct dma_fence *fence = fences[i];
>   
> +		if (test_bit(DMA_FENCE_FLAG_USER, &fence->flags))
> +			might_fault();
> +
>   		cb[i].task = current;
>   		if (dma_fence_add_callback(fence, &cb[i].base,
>   					   dma_fence_default_wait_cb)) {
> diff --git a/drivers/dma-buf/st-dma-fence.c b/drivers/dma-buf/st-dma-fence.c
> index c8a12d7ad71a..50f757f75645 100644
> --- a/drivers/dma-buf/st-dma-fence.c
> +++ b/drivers/dma-buf/st-dma-fence.c
> @@ -412,7 +412,7 @@ static int test_stub(void *arg)
>   	int i;
>   
>   	for (i = 0; i < ARRAY_SIZE(f); i++) {
> -		f[i] = dma_fence_get_stub();
> +		f[i] = dma_fence_get_stub((i % 2) == 1);
>   		if (!dma_fence_is_signaled(f[i])) {
>   			pr_err("Obtained unsignaled stub fence!\n");
>   			goto err;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 64ac4f8f49be..541c59635c34 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -263,7 +263,7 @@ static int amdgpu_amdkfd_remove_eviction_fence(struct amdgpu_bo *bo,
>   	/* TODO: Instead of block before we should use the fence of the page
>   	 * table update and TLB flush here directly.
>   	 */
> -	replacement = dma_fence_get_stub();
> +	replacement = dma_fence_get_stub(false);
>   	dma_resv_replace_fences(bo->tbo.base.resv, ef->base.context,
>   				replacement, DMA_RESV_USAGE_READ);
>   	dma_fence_put(replacement);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index a28b7947a034..95eeab527ca9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1399,7 +1399,7 @@ int amdgpu_cs_fence_to_handle_ioctl(struct drm_device *dev, void *data,
>   		return PTR_ERR(fence);
>   
>   	if (!fence)
> -		fence = dma_fence_get_stub();
> +		fence = dma_fence_get_stub(false);
>   
>   	switch (info->in.what) {
>   	case AMDGPU_FENCE_TO_HANDLE_GET_SYNCOBJ:
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 7f33ae87cb41..73165f387f3b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -193,7 +193,7 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
>   		adev->rings[ring->idx] = ring;
>   		ring->num_hw_submission = sched_hw_submission;
>   		ring->sched_score = sched_score;
> -		ring->vmid_wait = dma_fence_get_stub();
> +		ring->vmid_wait = dma_fence_get_stub(false);
>   		r = amdgpu_fence_driver_init_ring(ring);
>   		if (r)
>   			return r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 7e5cc8323329..e5c8e72a9485 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1689,7 +1689,7 @@ int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
>   	}
>   
>   	if (!unlocked && !dma_fence_is_signaled(vm->last_unlocked)) {
> -		struct dma_fence *tmp = dma_fence_get_stub();
> +		struct dma_fence *tmp = dma_fence_get_stub(false);
>   
>   		amdgpu_bo_fence(vm->root.bo, vm->last_unlocked, true);
>   		swap(vm->last_unlocked, tmp);
> @@ -2905,7 +2905,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>   	else
>   		vm->update_funcs = &amdgpu_vm_sdma_funcs;
>   	vm->last_update = NULL;
> -	vm->last_unlocked = dma_fence_get_stub();
> +	vm->last_unlocked = dma_fence_get_stub(false);
>   
>   	mutex_init(&vm->eviction_lock);
>   	vm->evicting = false;
> diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
> index 7e48dcd1bee4..5a961ea90a35 100644
> --- a/drivers/gpu/drm/drm_syncobj.c
> +++ b/drivers/gpu/drm/drm_syncobj.c
> @@ -255,7 +255,7 @@ static void drm_syncobj_fence_add_wait(struct drm_syncobj *syncobj,
>   		dma_fence_put(fence);
>   		list_add_tail(&wait->node, &syncobj->cb_list);
>   	} else if (!fence) {
> -		wait->fence = dma_fence_get_stub();
> +		wait->fence = dma_fence_get_stub(false);
>   	} else {
>   		wait->fence = fence;
>   	}
> @@ -411,7 +411,7 @@ int drm_syncobj_find_fence(struct drm_file *file_private,
>   			 * signalled, use a new fence instead.
>   			 */
>   			if (!*fence)
> -				*fence = dma_fence_get_stub();
> +				*fence = dma_fence_get_stub(false);
>   
>   			goto out;
>   		}
> @@ -1000,7 +1000,7 @@ static void syncobj_wait_syncobj_func(struct drm_syncobj *syncobj,
>   		dma_fence_put(fence);
>   		return;
>   	} else if (!fence) {
> -		wait->fence = dma_fence_get_stub();
> +		wait->fence = dma_fence_get_stub(false);
>   	} else {
>   		wait->fence = fence;
>   	}
> @@ -1067,7 +1067,7 @@ static signed long drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs,
>   		if (fence)
>   			entries[i].fence = fence;
>   		else
> -			entries[i].fence = dma_fence_get_stub();
> +			entries[i].fence = dma_fence_get_stub(false);
>   
>   		if ((flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE) ||
>   		    dma_fence_is_signaled(entries[i].fence)) {
> @@ -1472,7 +1472,7 @@ drm_syncobj_timeline_signal_ioctl(struct drm_device *dev, void *data,
>   	}
>   
>   	for (i = 0; i < args->count_handles; i++) {
> -		struct dma_fence *fence = dma_fence_get_stub();
> +		struct dma_fence *fence = dma_fence_get_stub(false);
>   
>   		drm_syncobj_add_point(syncobjs[i], chains[i],
>   				      fence, points[i]);
> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> index afea82ec5946..be96687d31d8 100644
> --- a/include/linux/dma-fence.h
> +++ b/include/linux/dma-fence.h
> @@ -99,6 +99,16 @@ enum dma_fence_flag_bits {
>   	DMA_FENCE_FLAG_SIGNALED_BIT,
>   	DMA_FENCE_FLAG_TIMESTAMP_BIT,
>   	DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
> +
> +	/**
> +	 * @DMA_FENCE_FLAG_USER:
> +	 *
> +	 * Indicates an user fence. User fences are not guaranteed to signal in
> +	 * a finite amount of time. Because of this it is not allowed to wait for user
> +	 * fences with any lock held nor depend the signaling of none user
> +	 * fences on them.
> +	 */
> +	DMA_FENCE_FLAG_USER,
>   	DMA_FENCE_FLAG_DRIVER, /* must always be last member */
>   };
>   
> @@ -398,6 +408,8 @@ void dma_fence_enable_sw_signaling(struct dma_fence *fence);
>   static inline bool
>   dma_fence_is_signaled_locked(struct dma_fence *fence)
>   {
> +	WARN_ON(test_bit(DMA_FENCE_FLAG_USER, &fence->flags));
> +
>   	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>   		return true;
>   
> @@ -428,6 +440,9 @@ dma_fence_is_signaled_locked(struct dma_fence *fence)
>   static inline bool
>   dma_fence_is_signaled(struct dma_fence *fence)
>   {
> +	if (test_bit(DMA_FENCE_FLAG_USER, &fence->flags))
> +		might_fault();

Why this one can fault and the dma_fence_signal_locked cannot? I mean 
obviosuly it must not in the locked section, but how do signalling 
callbacks know the context of the caller?

This maybe ties back to the issue I don't think I ever found an 
explanation to - why "is signal" helpers actually need (or want?) to do 
the signalling itself, and are therefore hit also with the signalling 
annotations?

Regards,

Tvrtko

> +
>   	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>   		return true;
>   
> @@ -583,7 +598,7 @@ static inline signed long dma_fence_wait(struct dma_fence *fence, bool intr)
>   	return ret < 0 ? ret : 0;
>   }
>   
> -struct dma_fence *dma_fence_get_stub(void);
> +struct dma_fence *dma_fence_get_stub(bool user);
>   struct dma_fence *dma_fence_allocate_private_stub(void);
>   u64 dma_fence_context_alloc(unsigned num);
>   

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/15] dma-buf: introduce user fence support
  2022-05-04  7:53   ` Tvrtko Ursulin
@ 2022-05-04  9:15     ` Christian König
  0 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-04  9:15 UTC (permalink / raw)
  To: Tvrtko Ursulin, Christian König, daniel, jason, daniels,
	skhawaja, maad.aldabagh, sergemetral, sumit.semwal, gustavo,
	Felix.Kuehling, alexander.deucher, tzimmermann, linux-media,
	dri-devel, linaro-mm-sig



Am 04.05.22 um 09:53 schrieb Tvrtko Ursulin:
>
> On 02/05/2022 17:37, Christian König wrote:
>> [SNIP]
>>   @@ -398,6 +408,8 @@ void dma_fence_enable_sw_signaling(struct 
>> dma_fence *fence);
>>   static inline bool
>>   dma_fence_is_signaled_locked(struct dma_fence *fence)
>>   {
>> +    WARN_ON(test_bit(DMA_FENCE_FLAG_USER, &fence->flags));
>> +
>>       if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>           return true;
>>   @@ -428,6 +440,9 @@ dma_fence_is_signaled_locked(struct dma_fence 
>> *fence)
>>   static inline bool
>>   dma_fence_is_signaled(struct dma_fence *fence)
>>   {
>> +    if (test_bit(DMA_FENCE_FLAG_USER, &fence->flags))
>> +        might_fault();
>
> Why this one can fault and the dma_fence_signal_locked cannot? I mean 
> obviosuly it must not in the locked section, but how do signalling 
> callbacks know the context of the caller?

Well, dma_fence_is_signaled_locked() shouldn't exists in the first place.

The locking can only be done by either the framework itself or the 
driver exporting the fence. And if you take at the actual users than you 
indeed see that the only two cases where this is used the driver knows 
what fence it has and should probably call it's internal function itself.

>
> This maybe ties back to the issue I don't think I ever found an 
> explanation to - why "is signal" helpers actually need (or want?) to 
> do the signalling itself, and are therefore hit also with the 
> signalling annotations?

Yeah, exactly that's what I could never figure out as well. And I think 
that it isn't good design to do this, because it results in the 
dma_fence being signaled from different sources.

Without it it would just be consistently signaled from the interrupt (or 
whatever context) the exporter uses to signal it.

Thanks,
Christian.

>
> Regards,
>
> Tvrtko
>
>> +
>>       if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>           return true;
>>   @@ -583,7 +598,7 @@ static inline signed long dma_fence_wait(struct 
>> dma_fence *fence, bool intr)
>>       return ret < 0 ? ret : 0;
>>   }
>>   -struct dma_fence *dma_fence_get_stub(void);
>> +struct dma_fence *dma_fence_get_stub(bool user);
>>   struct dma_fence *dma_fence_allocate_private_stub(void);
>>   u64 dma_fence_context_alloc(unsigned num);


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
@ 2022-05-04 10:08   ` Daniel Vetter
  2022-05-02 16:37 ` [PATCH 02/15] dma-buf: introduce user fence support Christian König
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2022-05-04 10:08 UTC (permalink / raw)
  To: Christian König
  Cc: daniel, jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig

On Mon, May 02, 2022 at 06:37:07PM +0200, Christian König wrote:
> Hello everyone,
> 
> it's a well known problem that the DMA-buf subsystem mixed
> synchronization and memory management requirements into the same
> dma_fence and dma_resv objects. Because of this dma_fence objects need
> to guarantee that they complete within a finite amount of time or
> otherwise the system can easily deadlock.
> 
> One of the few good things about this problem is that it is really good
> understood by now.
> 
> Daniel and others came up with some documentation:
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
> 
> And Jason did an excellent presentation about that problem on last years
> LPC: https://lpc.events/event/11/contributions/1115/
> 
> Based on that we had been able to reject new implementations of
> infinite/user DMA fences and mitigate the effect of the few existing
> ones.
> 
> The still remaining down side is that we don't have a way of using user
> fences as dependency in both the explicit (sync_file, drm_syncobj) as
> well as the implicit (dma_resv) synchronization objects, resulting in
> numerous problems and limitations for things like HMM, user queues
> etc....
> 
> This patch set here now tries to tackle this problem by untangling the
> synchronization from the memory management. What it does *not* try to do
> is to fix the existing kernel fences, because I think we now can all
> agree on that this isn't really possible.
> 
> To archive this goal what I do in this patch set is to add some parallel
> infrastructure to cleanly separate normal kernel dma_fence objects from
> indefinite/user fences:
> 
> 1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some
> existing driver defines). To note that a certain dma_fence is an user
> fence and *must* be ignore by memory management and never used as
> dependency for normal none user dma_fence objects.
> 
> 2. The dma_fence_array and dma_fence_chain containers are modified so
> that they are marked as user fences whenever any of their contained
> fences are an user fence.
> 
> 3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be
> used with indefinite/user fences and separates those into it's own
> synchronization domain.
> 
> 4. The existing dma_buf_poll_add_cb() function is modified so that
> indefinite/user fences are included in the polling.
> 
> 5. The sync_file synchronization object is modified so that we
> essentially have two fence streams instead of just one.
> 
> 6. The drm_syncobj is modified in a similar way. User fences are just
> ignored unless the driver explicitly states support to wait for them.
> 
> 7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers
> can use to indicate the need for user fences. If user fences are used
> the atomic mode setting starts to support user fences as IN/OUT fences.
> 
> 8. Lockdep is used at various critical locations to ensure that nobody
> ever tries to mix user fences with non user fences.
> 
> The general approach is to just ignore user fences unless a driver
> stated explicitely support for them.
> 
> On top of all of this I've hacked amdgpu so that we add the resulting CS
> fence only as kernel dependency to the dma_resv object and an additional
> wrapped up with a dma_fence_array and a stub user fence.
> 
> The result is that the newly added atomic modeset functions now
> correctly wait for the user fence to complete before doing the flip. And
> dependent CS don't pipeline any more, but rather block on the CPU before
> submitting work.
> 
> After tons of debugging and testing everything now seems to not go up in
> flames immediately and even lockdep is happy with the annotations.
> 
> I'm perfectly aware that this is probably by far the most controversial
> patch set I've ever created and I really wish we wouldn't need it. But
> we certainly have the requirement for this and I don't see much other
> chance to get that working in an UAPI compatible way.
> 
> Thoughts/comments?

I think you need to type up the goal or exact problem statement you're
trying to solve first. What you typed up is a solution along the lines of
"try to stuff userspace memory fences into dma_fence and see how horrible
it all is", and that's certainly an interesting experiment, but what are
you trying to solve with it?

Like if the issue is to enable opencl or whatever, then that's no problem
(rocm on amdkfd is a thing, same maybe without the kfd part can be done
anywhere else). If the goal is to enable userspace memory fences for vk,
then we really don't need these everywhere, but really only in drm_syncobj
(and maybe sync_file).

If the goal is specifically atomic kms, then there's an entire can of
worms there that I really don't want to think about, but it exists: We
have dma_fence as out-fences from atomic commit, and that's already
massively broken since most drivers allocate some memory or at least take
locks which can allocate memory in their commit path. Like i2c. Putting a
userspace memory fence as in-fence in there makes that problem
substantially worse, since at least in theory you're just not allowed to
might_faul in atomic_commit_tail.

If the goal is to keep the uapi perfectly compatible then your patch set
doesn't look like a solution, since as soon as another driver is involved
which doesn't understand userspace memory fences it all falls apart. So
works great for a quick demo with amd+amd sharing, but not much further.
And I don't think it's feasible to just rev the entire ecosystem, since
that kinda defeats the point of keeping uapi stable - if we rev everything
we might as well also rev the uapi and make this a bit more incremental
again :-)

There's probably more to ponder here ...

I'm not sure what exactly the problem statement is that matches your
solution here though, so that seems to be missing.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-04 10:08   ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2022-05-04 10:08 UTC (permalink / raw)
  To: Christian König
  Cc: tvrtko.ursulin, sergemetral, tzimmermann, gustavo,
	Felix.Kuehling, linux-media, dri-devel, linaro-mm-sig, jason,
	alexander.deucher, daniels, skhawaja, sumit.semwal,
	maad.aldabagh

On Mon, May 02, 2022 at 06:37:07PM +0200, Christian König wrote:
> Hello everyone,
> 
> it's a well known problem that the DMA-buf subsystem mixed
> synchronization and memory management requirements into the same
> dma_fence and dma_resv objects. Because of this dma_fence objects need
> to guarantee that they complete within a finite amount of time or
> otherwise the system can easily deadlock.
> 
> One of the few good things about this problem is that it is really good
> understood by now.
> 
> Daniel and others came up with some documentation:
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
> 
> And Jason did an excellent presentation about that problem on last years
> LPC: https://lpc.events/event/11/contributions/1115/
> 
> Based on that we had been able to reject new implementations of
> infinite/user DMA fences and mitigate the effect of the few existing
> ones.
> 
> The still remaining down side is that we don't have a way of using user
> fences as dependency in both the explicit (sync_file, drm_syncobj) as
> well as the implicit (dma_resv) synchronization objects, resulting in
> numerous problems and limitations for things like HMM, user queues
> etc....
> 
> This patch set here now tries to tackle this problem by untangling the
> synchronization from the memory management. What it does *not* try to do
> is to fix the existing kernel fences, because I think we now can all
> agree on that this isn't really possible.
> 
> To archive this goal what I do in this patch set is to add some parallel
> infrastructure to cleanly separate normal kernel dma_fence objects from
> indefinite/user fences:
> 
> 1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some
> existing driver defines). To note that a certain dma_fence is an user
> fence and *must* be ignore by memory management and never used as
> dependency for normal none user dma_fence objects.
> 
> 2. The dma_fence_array and dma_fence_chain containers are modified so
> that they are marked as user fences whenever any of their contained
> fences are an user fence.
> 
> 3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be
> used with indefinite/user fences and separates those into it's own
> synchronization domain.
> 
> 4. The existing dma_buf_poll_add_cb() function is modified so that
> indefinite/user fences are included in the polling.
> 
> 5. The sync_file synchronization object is modified so that we
> essentially have two fence streams instead of just one.
> 
> 6. The drm_syncobj is modified in a similar way. User fences are just
> ignored unless the driver explicitly states support to wait for them.
> 
> 7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers
> can use to indicate the need for user fences. If user fences are used
> the atomic mode setting starts to support user fences as IN/OUT fences.
> 
> 8. Lockdep is used at various critical locations to ensure that nobody
> ever tries to mix user fences with non user fences.
> 
> The general approach is to just ignore user fences unless a driver
> stated explicitely support for them.
> 
> On top of all of this I've hacked amdgpu so that we add the resulting CS
> fence only as kernel dependency to the dma_resv object and an additional
> wrapped up with a dma_fence_array and a stub user fence.
> 
> The result is that the newly added atomic modeset functions now
> correctly wait for the user fence to complete before doing the flip. And
> dependent CS don't pipeline any more, but rather block on the CPU before
> submitting work.
> 
> After tons of debugging and testing everything now seems to not go up in
> flames immediately and even lockdep is happy with the annotations.
> 
> I'm perfectly aware that this is probably by far the most controversial
> patch set I've ever created and I really wish we wouldn't need it. But
> we certainly have the requirement for this and I don't see much other
> chance to get that working in an UAPI compatible way.
> 
> Thoughts/comments?

I think you need to type up the goal or exact problem statement you're
trying to solve first. What you typed up is a solution along the lines of
"try to stuff userspace memory fences into dma_fence and see how horrible
it all is", and that's certainly an interesting experiment, but what are
you trying to solve with it?

Like if the issue is to enable opencl or whatever, then that's no problem
(rocm on amdkfd is a thing, same maybe without the kfd part can be done
anywhere else). If the goal is to enable userspace memory fences for vk,
then we really don't need these everywhere, but really only in drm_syncobj
(and maybe sync_file).

If the goal is specifically atomic kms, then there's an entire can of
worms there that I really don't want to think about, but it exists: We
have dma_fence as out-fences from atomic commit, and that's already
massively broken since most drivers allocate some memory or at least take
locks which can allocate memory in their commit path. Like i2c. Putting a
userspace memory fence as in-fence in there makes that problem
substantially worse, since at least in theory you're just not allowed to
might_faul in atomic_commit_tail.

If the goal is to keep the uapi perfectly compatible then your patch set
doesn't look like a solution, since as soon as another driver is involved
which doesn't understand userspace memory fences it all falls apart. So
works great for a quick demo with amd+amd sharing, but not much further.
And I don't think it's feasible to just rev the entire ecosystem, since
that kinda defeats the point of keeping uapi stable - if we rev everything
we might as well also rev the uapi and make this a bit more incremental
again :-)

There's probably more to ponder here ...

I'm not sure what exactly the problem statement is that matches your
solution here though, so that seems to be missing.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-04 10:08   ` Daniel Vetter
@ 2022-05-09  6:56     ` Christian König
  -1 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-09  6:56 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: tvrtko.ursulin, sergemetral, tzimmermann, gustavo,
	Felix.Kuehling, linux-media, dri-devel, linaro-mm-sig,
	maad.aldabagh, jason, alexander.deucher, skhawaja, sumit.semwal,
	daniels

Am 04.05.22 um 12:08 schrieb Daniel Vetter:
> On Mon, May 02, 2022 at 06:37:07PM +0200, Christian König wrote:
>> Hello everyone,
>>
>> it's a well known problem that the DMA-buf subsystem mixed
>> synchronization and memory management requirements into the same
>> dma_fence and dma_resv objects. Because of this dma_fence objects need
>> to guarantee that they complete within a finite amount of time or
>> otherwise the system can easily deadlock.
>>
>> One of the few good things about this problem is that it is really good
>> understood by now.
>>
>> Daniel and others came up with some documentation:
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
>>
>> And Jason did an excellent presentation about that problem on last years
>> LPC: https://lpc.events/event/11/contributions/1115/
>>
>> Based on that we had been able to reject new implementations of
>> infinite/user DMA fences and mitigate the effect of the few existing
>> ones.
>>
>> The still remaining down side is that we don't have a way of using user
>> fences as dependency in both the explicit (sync_file, drm_syncobj) as
>> well as the implicit (dma_resv) synchronization objects, resulting in
>> numerous problems and limitations for things like HMM, user queues
>> etc....
>>
>> This patch set here now tries to tackle this problem by untangling the
>> synchronization from the memory management. What it does *not* try to do
>> is to fix the existing kernel fences, because I think we now can all
>> agree on that this isn't really possible.
>>
>> To archive this goal what I do in this patch set is to add some parallel
>> infrastructure to cleanly separate normal kernel dma_fence objects from
>> indefinite/user fences:
>>
>> 1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some
>> existing driver defines). To note that a certain dma_fence is an user
>> fence and *must* be ignore by memory management and never used as
>> dependency for normal none user dma_fence objects.
>>
>> 2. The dma_fence_array and dma_fence_chain containers are modified so
>> that they are marked as user fences whenever any of their contained
>> fences are an user fence.
>>
>> 3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be
>> used with indefinite/user fences and separates those into it's own
>> synchronization domain.
>>
>> 4. The existing dma_buf_poll_add_cb() function is modified so that
>> indefinite/user fences are included in the polling.
>>
>> 5. The sync_file synchronization object is modified so that we
>> essentially have two fence streams instead of just one.
>>
>> 6. The drm_syncobj is modified in a similar way. User fences are just
>> ignored unless the driver explicitly states support to wait for them.
>>
>> 7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers
>> can use to indicate the need for user fences. If user fences are used
>> the atomic mode setting starts to support user fences as IN/OUT fences.
>>
>> 8. Lockdep is used at various critical locations to ensure that nobody
>> ever tries to mix user fences with non user fences.
>>
>> The general approach is to just ignore user fences unless a driver
>> stated explicitely support for them.
>>
>> On top of all of this I've hacked amdgpu so that we add the resulting CS
>> fence only as kernel dependency to the dma_resv object and an additional
>> wrapped up with a dma_fence_array and a stub user fence.
>>
>> The result is that the newly added atomic modeset functions now
>> correctly wait for the user fence to complete before doing the flip. And
>> dependent CS don't pipeline any more, but rather block on the CPU before
>> submitting work.
>>
>> After tons of debugging and testing everything now seems to not go up in
>> flames immediately and even lockdep is happy with the annotations.
>>
>> I'm perfectly aware that this is probably by far the most controversial
>> patch set I've ever created and I really wish we wouldn't need it. But
>> we certainly have the requirement for this and I don't see much other
>> chance to get that working in an UAPI compatible way.
>>
>> Thoughts/comments?
> I think you need to type up the goal or exact problem statement you're
> trying to solve first. What you typed up is a solution along the lines of
> "try to stuff userspace memory fences into dma_fence and see how horrible
> it all is", and that's certainly an interesting experiment, but what are
> you trying to solve with it?

Well, good point. I explained to much how it works, but now why.

In general I would describe the goal as: Providing a standard kernel 
infrastructure for user fences.

> Like if the issue is to enable opencl or whatever, then that's no problem
> (rocm on amdkfd is a thing, same maybe without the kfd part can be done
> anywhere else). If the goal is to enable userspace memory fences for vk,
> then we really don't need these everywhere, but really only in drm_syncobj
> (and maybe sync_file).

Yes, having an in kernel representation for vk user space fences is one 
of the goals.

And I was going back and forth if I should rather come up with a new 
structure for this or use the existing dma_fence with a flag as well.

I've decided to go down the later router because we have quite a lot of 
existing functionality which can be re-used. But if you have a good 
argument that it would be more defensive to come up with something 
completely new, I'm perfectly fine with that as well.

> If the goal is specifically atomic kms, then there's an entire can of
> worms there that I really don't want to think about, but it exists: We
> have dma_fence as out-fences from atomic commit, and that's already
> massively broken since most drivers allocate some memory or at least take
> locks which can allocate memory in their commit path. Like i2c. Putting a
> userspace memory fence as in-fence in there makes that problem
> substantially worse, since at least in theory you're just not allowed to
> might_faul in atomic_commit_tail.

Yes, that's unfortunately one of the goals as well and yes I completely 
agree on the can of worms. But I think I've solved that.

What I do in the patch set is to enforce that the out fence is an user 
fence when the driver supports user in fences as well.

Since user fences doesn't have the memory management dependency drivers 
can actually allocate memory or call I2C functions which takes locks 
which have memory allocation dependencies.

Or do I miss some other reason why you can't fault or allocate memory in 
atomic_commit_tail? At least lockdep seems to be happy about that now.

> If the goal is to keep the uapi perfectly compatible then your patch set
> doesn't look like a solution, since as soon as another driver is involved
> which doesn't understand userspace memory fences it all falls apart. So
> works great for a quick demo with amd+amd sharing, but not much further.
> And I don't think it's feasible to just rev the entire ecosystem, since
> that kinda defeats the point of keeping uapi stable - if we rev everything
> we might as well also rev the uapi and make this a bit more incremental
> again :-)

Yes, unfortunately the uapi needs to stay compatible as well and yes 
that means we need to deploy this to all drivers involved.

We at least need to be able to provide a stack on new hardware with (for 
example) Ubuntu 18.04 without replacing all the userspace components.

What we can replace is the OpenGL stack and if necessary libdrm, but not 
(for example) the X server and most likely not the DDX in some cases.

The same applies with surfaceflinger and to some extend Wayland as well.

Regards,
Christian.

>
> There's probably more to ponder here ...
>
> I'm not sure what exactly the problem statement is that matches your
> solution here though, so that seems to be missing.
> -Daniel


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-09  6:56     ` Christian König
  0 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-09  6:56 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig

Am 04.05.22 um 12:08 schrieb Daniel Vetter:
> On Mon, May 02, 2022 at 06:37:07PM +0200, Christian König wrote:
>> Hello everyone,
>>
>> it's a well known problem that the DMA-buf subsystem mixed
>> synchronization and memory management requirements into the same
>> dma_fence and dma_resv objects. Because of this dma_fence objects need
>> to guarantee that they complete within a finite amount of time or
>> otherwise the system can easily deadlock.
>>
>> One of the few good things about this problem is that it is really good
>> understood by now.
>>
>> Daniel and others came up with some documentation:
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
>>
>> And Jason did an excellent presentation about that problem on last years
>> LPC: https://lpc.events/event/11/contributions/1115/
>>
>> Based on that we had been able to reject new implementations of
>> infinite/user DMA fences and mitigate the effect of the few existing
>> ones.
>>
>> The still remaining down side is that we don't have a way of using user
>> fences as dependency in both the explicit (sync_file, drm_syncobj) as
>> well as the implicit (dma_resv) synchronization objects, resulting in
>> numerous problems and limitations for things like HMM, user queues
>> etc....
>>
>> This patch set here now tries to tackle this problem by untangling the
>> synchronization from the memory management. What it does *not* try to do
>> is to fix the existing kernel fences, because I think we now can all
>> agree on that this isn't really possible.
>>
>> To archive this goal what I do in this patch set is to add some parallel
>> infrastructure to cleanly separate normal kernel dma_fence objects from
>> indefinite/user fences:
>>
>> 1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some
>> existing driver defines). To note that a certain dma_fence is an user
>> fence and *must* be ignore by memory management and never used as
>> dependency for normal none user dma_fence objects.
>>
>> 2. The dma_fence_array and dma_fence_chain containers are modified so
>> that they are marked as user fences whenever any of their contained
>> fences are an user fence.
>>
>> 3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be
>> used with indefinite/user fences and separates those into it's own
>> synchronization domain.
>>
>> 4. The existing dma_buf_poll_add_cb() function is modified so that
>> indefinite/user fences are included in the polling.
>>
>> 5. The sync_file synchronization object is modified so that we
>> essentially have two fence streams instead of just one.
>>
>> 6. The drm_syncobj is modified in a similar way. User fences are just
>> ignored unless the driver explicitly states support to wait for them.
>>
>> 7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers
>> can use to indicate the need for user fences. If user fences are used
>> the atomic mode setting starts to support user fences as IN/OUT fences.
>>
>> 8. Lockdep is used at various critical locations to ensure that nobody
>> ever tries to mix user fences with non user fences.
>>
>> The general approach is to just ignore user fences unless a driver
>> stated explicitely support for them.
>>
>> On top of all of this I've hacked amdgpu so that we add the resulting CS
>> fence only as kernel dependency to the dma_resv object and an additional
>> wrapped up with a dma_fence_array and a stub user fence.
>>
>> The result is that the newly added atomic modeset functions now
>> correctly wait for the user fence to complete before doing the flip. And
>> dependent CS don't pipeline any more, but rather block on the CPU before
>> submitting work.
>>
>> After tons of debugging and testing everything now seems to not go up in
>> flames immediately and even lockdep is happy with the annotations.
>>
>> I'm perfectly aware that this is probably by far the most controversial
>> patch set I've ever created and I really wish we wouldn't need it. But
>> we certainly have the requirement for this and I don't see much other
>> chance to get that working in an UAPI compatible way.
>>
>> Thoughts/comments?
> I think you need to type up the goal or exact problem statement you're
> trying to solve first. What you typed up is a solution along the lines of
> "try to stuff userspace memory fences into dma_fence and see how horrible
> it all is", and that's certainly an interesting experiment, but what are
> you trying to solve with it?

Well, good point. I explained to much how it works, but now why.

In general I would describe the goal as: Providing a standard kernel 
infrastructure for user fences.

> Like if the issue is to enable opencl or whatever, then that's no problem
> (rocm on amdkfd is a thing, same maybe without the kfd part can be done
> anywhere else). If the goal is to enable userspace memory fences for vk,
> then we really don't need these everywhere, but really only in drm_syncobj
> (and maybe sync_file).

Yes, having an in kernel representation for vk user space fences is one 
of the goals.

And I was going back and forth if I should rather come up with a new 
structure for this or use the existing dma_fence with a flag as well.

I've decided to go down the later router because we have quite a lot of 
existing functionality which can be re-used. But if you have a good 
argument that it would be more defensive to come up with something 
completely new, I'm perfectly fine with that as well.

> If the goal is specifically atomic kms, then there's an entire can of
> worms there that I really don't want to think about, but it exists: We
> have dma_fence as out-fences from atomic commit, and that's already
> massively broken since most drivers allocate some memory or at least take
> locks which can allocate memory in their commit path. Like i2c. Putting a
> userspace memory fence as in-fence in there makes that problem
> substantially worse, since at least in theory you're just not allowed to
> might_faul in atomic_commit_tail.

Yes, that's unfortunately one of the goals as well and yes I completely 
agree on the can of worms. But I think I've solved that.

What I do in the patch set is to enforce that the out fence is an user 
fence when the driver supports user in fences as well.

Since user fences doesn't have the memory management dependency drivers 
can actually allocate memory or call I2C functions which takes locks 
which have memory allocation dependencies.

Or do I miss some other reason why you can't fault or allocate memory in 
atomic_commit_tail? At least lockdep seems to be happy about that now.

> If the goal is to keep the uapi perfectly compatible then your patch set
> doesn't look like a solution, since as soon as another driver is involved
> which doesn't understand userspace memory fences it all falls apart. So
> works great for a quick demo with amd+amd sharing, but not much further.
> And I don't think it's feasible to just rev the entire ecosystem, since
> that kinda defeats the point of keeping uapi stable - if we rev everything
> we might as well also rev the uapi and make this a bit more incremental
> again :-)

Yes, unfortunately the uapi needs to stay compatible as well and yes 
that means we need to deploy this to all drivers involved.

We at least need to be able to provide a stack on new hardware with (for 
example) Ubuntu 18.04 without replacing all the userspace components.

What we can replace is the OpenGL stack and if necessary libdrm, but not 
(for example) the X server and most likely not the DDX in some cases.

The same applies with surfaceflinger and to some extend Wayland as well.

Regards,
Christian.

>
> There's probably more to ponder here ...
>
> I'm not sure what exactly the problem statement is that matches your
> solution here though, so that seems to be missing.
> -Daniel


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-09  6:56     ` Christian König
@ 2022-05-09 14:10       ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2022-05-09 14:10 UTC (permalink / raw)
  To: Christian König
  Cc: tvrtko.ursulin, sergemetral, tzimmermann, gustavo,
	Felix.Kuehling, linux-media, dri-devel, linaro-mm-sig, jason,
	alexander.deucher, daniels, skhawaja, sumit.semwal,
	maad.aldabagh

On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
> Am 04.05.22 um 12:08 schrieb Daniel Vetter:
> > On Mon, May 02, 2022 at 06:37:07PM +0200, Christian König wrote:
> > > Hello everyone,
> > > 
> > > it's a well known problem that the DMA-buf subsystem mixed
> > > synchronization and memory management requirements into the same
> > > dma_fence and dma_resv objects. Because of this dma_fence objects need
> > > to guarantee that they complete within a finite amount of time or
> > > otherwise the system can easily deadlock.
> > > 
> > > One of the few good things about this problem is that it is really good
> > > understood by now.
> > > 
> > > Daniel and others came up with some documentation:
> > > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
> > > 
> > > And Jason did an excellent presentation about that problem on last years
> > > LPC: https://lpc.events/event/11/contributions/1115/
> > > 
> > > Based on that we had been able to reject new implementations of
> > > infinite/user DMA fences and mitigate the effect of the few existing
> > > ones.
> > > 
> > > The still remaining down side is that we don't have a way of using user
> > > fences as dependency in both the explicit (sync_file, drm_syncobj) as
> > > well as the implicit (dma_resv) synchronization objects, resulting in
> > > numerous problems and limitations for things like HMM, user queues
> > > etc....
> > > 
> > > This patch set here now tries to tackle this problem by untangling the
> > > synchronization from the memory management. What it does *not* try to do
> > > is to fix the existing kernel fences, because I think we now can all
> > > agree on that this isn't really possible.
> > > 
> > > To archive this goal what I do in this patch set is to add some parallel
> > > infrastructure to cleanly separate normal kernel dma_fence objects from
> > > indefinite/user fences:
> > > 
> > > 1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some
> > > existing driver defines). To note that a certain dma_fence is an user
> > > fence and *must* be ignore by memory management and never used as
> > > dependency for normal none user dma_fence objects.
> > > 
> > > 2. The dma_fence_array and dma_fence_chain containers are modified so
> > > that they are marked as user fences whenever any of their contained
> > > fences are an user fence.
> > > 
> > > 3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be
> > > used with indefinite/user fences and separates those into it's own
> > > synchronization domain.
> > > 
> > > 4. The existing dma_buf_poll_add_cb() function is modified so that
> > > indefinite/user fences are included in the polling.
> > > 
> > > 5. The sync_file synchronization object is modified so that we
> > > essentially have two fence streams instead of just one.
> > > 
> > > 6. The drm_syncobj is modified in a similar way. User fences are just
> > > ignored unless the driver explicitly states support to wait for them.
> > > 
> > > 7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers
> > > can use to indicate the need for user fences. If user fences are used
> > > the atomic mode setting starts to support user fences as IN/OUT fences.
> > > 
> > > 8. Lockdep is used at various critical locations to ensure that nobody
> > > ever tries to mix user fences with non user fences.
> > > 
> > > The general approach is to just ignore user fences unless a driver
> > > stated explicitely support for them.
> > > 
> > > On top of all of this I've hacked amdgpu so that we add the resulting CS
> > > fence only as kernel dependency to the dma_resv object and an additional
> > > wrapped up with a dma_fence_array and a stub user fence.
> > > 
> > > The result is that the newly added atomic modeset functions now
> > > correctly wait for the user fence to complete before doing the flip. And
> > > dependent CS don't pipeline any more, but rather block on the CPU before
> > > submitting work.
> > > 
> > > After tons of debugging and testing everything now seems to not go up in
> > > flames immediately and even lockdep is happy with the annotations.
> > > 
> > > I'm perfectly aware that this is probably by far the most controversial
> > > patch set I've ever created and I really wish we wouldn't need it. But
> > > we certainly have the requirement for this and I don't see much other
> > > chance to get that working in an UAPI compatible way.
> > > 
> > > Thoughts/comments?
> > I think you need to type up the goal or exact problem statement you're
> > trying to solve first. What you typed up is a solution along the lines of
> > "try to stuff userspace memory fences into dma_fence and see how horrible
> > it all is", and that's certainly an interesting experiment, but what are
> > you trying to solve with it?
> 
> Well, good point. I explained to much how it works, but now why.
> 
> In general I would describe the goal as: Providing a standard kernel
> infrastructure for user fences.

So on that goal the part I fully agree on is that drm_syncobj can (and
should imo) be able to contain userspace memory fences. The uapi semantics
and everything is already fully set up to support that, but maybe with
reduced performance: Non-aware userspace (or when you don't trust the
supplier of the umf) needs to block when looking up the fence, and the
dma_fence returned will always be signalled already. But that's just a
mild performance issue (and vk drivers paper over that already with
threading) and not a correctness issue.

> > Like if the issue is to enable opencl or whatever, then that's no problem
> > (rocm on amdkfd is a thing, same maybe without the kfd part can be done
> > anywhere else). If the goal is to enable userspace memory fences for vk,
> > then we really don't need these everywhere, but really only in drm_syncobj
> > (and maybe sync_file).
> 
> Yes, having an in kernel representation for vk user space fences is one of
> the goals.
> 
> And I was going back and forth if I should rather come up with a new
> structure for this or use the existing dma_fence with a flag as well.
> 
> I've decided to go down the later router because we have quite a lot of
> existing functionality which can be re-used. But if you have a good argument
> that it would be more defensive to come up with something completely new,
> I'm perfectly fine with that as well.

Yeah so stuffing that into dma_fence already freaks me out a bit. It is
quite fundamentally a different thing, and it would be really nice to make
that very apparent at the type level too.

E.g. to make sure you never ever end up with an umf fence in mmu notifier
invalidate callback. You can enforce that with runtime checks too, but imo
compile time fail is better than runtime fail.

> > If the goal is specifically atomic kms, then there's an entire can of
> > worms there that I really don't want to think about, but it exists: We
> > have dma_fence as out-fences from atomic commit, and that's already
> > massively broken since most drivers allocate some memory or at least take
> > locks which can allocate memory in their commit path. Like i2c. Putting a
> > userspace memory fence as in-fence in there makes that problem
> > substantially worse, since at least in theory you're just not allowed to
> > might_faul in atomic_commit_tail.
> 
> Yes, that's unfortunately one of the goals as well and yes I completely
> agree on the can of worms. But I think I've solved that.
> 
> What I do in the patch set is to enforce that the out fence is an user fence
> when the driver supports user in fences as well.
> 
> Since user fences doesn't have the memory management dependency drivers can
> actually allocate memory or call I2C functions which takes locks which have
> memory allocation dependencies.
> 
> Or do I miss some other reason why you can't fault or allocate memory in
> atomic_commit_tail? At least lockdep seems to be happy about that now.

The problem is a bit that this breaks the uapi already. At least if the
goal is to have this all be perfectly transparent for userspace - as you
as you have multi-gpu setups going on at least.

> > If the goal is to keep the uapi perfectly compatible then your patch set
> > doesn't look like a solution, since as soon as another driver is involved
> > which doesn't understand userspace memory fences it all falls apart. So
> > works great for a quick demo with amd+amd sharing, but not much further.
> > And I don't think it's feasible to just rev the entire ecosystem, since
> > that kinda defeats the point of keeping uapi stable - if we rev everything
> > we might as well also rev the uapi and make this a bit more incremental
> > again :-)
> 
> Yes, unfortunately the uapi needs to stay compatible as well and yes that
> means we need to deploy this to all drivers involved.
> 
> We at least need to be able to provide a stack on new hardware with (for
> example) Ubuntu 18.04 without replacing all the userspace components.
> 
> What we can replace is the OpenGL stack and if necessary libdrm, but not
> (for example) the X server and most likely not the DDX in some cases.
> 
> The same applies with surfaceflinger and to some extend Wayland as well.

So for perfect uapi compat for existing compositor I really don't think
stuffing umf into the kernel is the right approach. Too many little
corners that break:

- the in/out fence mismatch every
- cross gpu with different userspace that doesn't understand umf and then
  just ignores them
- compositors which currently assume implicit sync finishes eventually,
  and with umf that gets complicated at best
- same with sync_file, the uapi atm does not have a concept of future
  fence

So you can kinda make this work, but it falls apart all over the place.
And I also don't think smashing umf into all these old concepts helps us
in any way to get towards a desktop which is umf-native.

My take is still that for backwards compat the simplest way is if a
umf-native driver simply provides dma-fence backwards compat as an opt-in,
which userspace chooses when it's necessary. There's really only two
things you need for that to work:

- a timeout of some sort on the dma_fence, which might or might not kill
  the entire context. This is entirey up to how your userspace does or
  does not implement stuff like arb robustness or vk_error_device_lost

- pre-pinned memory management to block out the all the inversions. This
  is a bit more nasty, but since we do have all the code for this already
  it really shouldn't be too tricky to make that happen for the fancy new
  umf world.

You do not need a kernel scheduler or anything like that at all, you can
do full userspace direct submit to hw and all that fun. Maybe do a
drm/sched frontend (and then your submit code does exactly what userspace
would do too).

Importantly the things you really don't need:

- special hw support, even if the only mode your hw supports is with page
  faults and all that: You can make sure all the pages are present
  upfront, and then simply kill the entire context is a page fault
  happens.

- special fw scheduler support: Once the memory management inversions are
  taken care of with pre-pinning under dma_fences, then the only other
  thing you need is a timeout for the dma_fence to signal. And maybe some
  kind of guaranteed ordering if you want to use a dma_fence timeline
  since that one can't go backwards.

Trying to shoehorn umf into all the old concepts like implicit sync or
sync_file which really don't support umf works for a demo, but imo just
isn't solid enough for shipping everywhere.

And long term I really don't think we ever want umf anywhere else than
drm_syncobj, at least for a 100% umf-native stack.

So maybe this all goes back to the old discussion with had, where you
argued for the need for special fw and hw and all that to make the old
dma_fence stuff work. Why is that needed? I still don't get that part ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-09 14:10       ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2022-05-09 14:10 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, jason, daniels, skhawaja, maad.aldabagh,
	sergemetral, sumit.semwal, gustavo, Felix.Kuehling,
	alexander.deucher, tzimmermann, tvrtko.ursulin, linux-media,
	dri-devel, linaro-mm-sig

On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
> Am 04.05.22 um 12:08 schrieb Daniel Vetter:
> > On Mon, May 02, 2022 at 06:37:07PM +0200, Christian König wrote:
> > > Hello everyone,
> > > 
> > > it's a well known problem that the DMA-buf subsystem mixed
> > > synchronization and memory management requirements into the same
> > > dma_fence and dma_resv objects. Because of this dma_fence objects need
> > > to guarantee that they complete within a finite amount of time or
> > > otherwise the system can easily deadlock.
> > > 
> > > One of the few good things about this problem is that it is really good
> > > understood by now.
> > > 
> > > Daniel and others came up with some documentation:
> > > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
> > > 
> > > And Jason did an excellent presentation about that problem on last years
> > > LPC: https://lpc.events/event/11/contributions/1115/
> > > 
> > > Based on that we had been able to reject new implementations of
> > > infinite/user DMA fences and mitigate the effect of the few existing
> > > ones.
> > > 
> > > The still remaining down side is that we don't have a way of using user
> > > fences as dependency in both the explicit (sync_file, drm_syncobj) as
> > > well as the implicit (dma_resv) synchronization objects, resulting in
> > > numerous problems and limitations for things like HMM, user queues
> > > etc....
> > > 
> > > This patch set here now tries to tackle this problem by untangling the
> > > synchronization from the memory management. What it does *not* try to do
> > > is to fix the existing kernel fences, because I think we now can all
> > > agree on that this isn't really possible.
> > > 
> > > To archive this goal what I do in this patch set is to add some parallel
> > > infrastructure to cleanly separate normal kernel dma_fence objects from
> > > indefinite/user fences:
> > > 
> > > 1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some
> > > existing driver defines). To note that a certain dma_fence is an user
> > > fence and *must* be ignore by memory management and never used as
> > > dependency for normal none user dma_fence objects.
> > > 
> > > 2. The dma_fence_array and dma_fence_chain containers are modified so
> > > that they are marked as user fences whenever any of their contained
> > > fences are an user fence.
> > > 
> > > 3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be
> > > used with indefinite/user fences and separates those into it's own
> > > synchronization domain.
> > > 
> > > 4. The existing dma_buf_poll_add_cb() function is modified so that
> > > indefinite/user fences are included in the polling.
> > > 
> > > 5. The sync_file synchronization object is modified so that we
> > > essentially have two fence streams instead of just one.
> > > 
> > > 6. The drm_syncobj is modified in a similar way. User fences are just
> > > ignored unless the driver explicitly states support to wait for them.
> > > 
> > > 7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers
> > > can use to indicate the need for user fences. If user fences are used
> > > the atomic mode setting starts to support user fences as IN/OUT fences.
> > > 
> > > 8. Lockdep is used at various critical locations to ensure that nobody
> > > ever tries to mix user fences with non user fences.
> > > 
> > > The general approach is to just ignore user fences unless a driver
> > > stated explicitely support for them.
> > > 
> > > On top of all of this I've hacked amdgpu so that we add the resulting CS
> > > fence only as kernel dependency to the dma_resv object and an additional
> > > wrapped up with a dma_fence_array and a stub user fence.
> > > 
> > > The result is that the newly added atomic modeset functions now
> > > correctly wait for the user fence to complete before doing the flip. And
> > > dependent CS don't pipeline any more, but rather block on the CPU before
> > > submitting work.
> > > 
> > > After tons of debugging and testing everything now seems to not go up in
> > > flames immediately and even lockdep is happy with the annotations.
> > > 
> > > I'm perfectly aware that this is probably by far the most controversial
> > > patch set I've ever created and I really wish we wouldn't need it. But
> > > we certainly have the requirement for this and I don't see much other
> > > chance to get that working in an UAPI compatible way.
> > > 
> > > Thoughts/comments?
> > I think you need to type up the goal or exact problem statement you're
> > trying to solve first. What you typed up is a solution along the lines of
> > "try to stuff userspace memory fences into dma_fence and see how horrible
> > it all is", and that's certainly an interesting experiment, but what are
> > you trying to solve with it?
> 
> Well, good point. I explained to much how it works, but now why.
> 
> In general I would describe the goal as: Providing a standard kernel
> infrastructure for user fences.

So on that goal the part I fully agree on is that drm_syncobj can (and
should imo) be able to contain userspace memory fences. The uapi semantics
and everything is already fully set up to support that, but maybe with
reduced performance: Non-aware userspace (or when you don't trust the
supplier of the umf) needs to block when looking up the fence, and the
dma_fence returned will always be signalled already. But that's just a
mild performance issue (and vk drivers paper over that already with
threading) and not a correctness issue.

> > Like if the issue is to enable opencl or whatever, then that's no problem
> > (rocm on amdkfd is a thing, same maybe without the kfd part can be done
> > anywhere else). If the goal is to enable userspace memory fences for vk,
> > then we really don't need these everywhere, but really only in drm_syncobj
> > (and maybe sync_file).
> 
> Yes, having an in kernel representation for vk user space fences is one of
> the goals.
> 
> And I was going back and forth if I should rather come up with a new
> structure for this or use the existing dma_fence with a flag as well.
> 
> I've decided to go down the later router because we have quite a lot of
> existing functionality which can be re-used. But if you have a good argument
> that it would be more defensive to come up with something completely new,
> I'm perfectly fine with that as well.

Yeah so stuffing that into dma_fence already freaks me out a bit. It is
quite fundamentally a different thing, and it would be really nice to make
that very apparent at the type level too.

E.g. to make sure you never ever end up with an umf fence in mmu notifier
invalidate callback. You can enforce that with runtime checks too, but imo
compile time fail is better than runtime fail.

> > If the goal is specifically atomic kms, then there's an entire can of
> > worms there that I really don't want to think about, but it exists: We
> > have dma_fence as out-fences from atomic commit, and that's already
> > massively broken since most drivers allocate some memory or at least take
> > locks which can allocate memory in their commit path. Like i2c. Putting a
> > userspace memory fence as in-fence in there makes that problem
> > substantially worse, since at least in theory you're just not allowed to
> > might_faul in atomic_commit_tail.
> 
> Yes, that's unfortunately one of the goals as well and yes I completely
> agree on the can of worms. But I think I've solved that.
> 
> What I do in the patch set is to enforce that the out fence is an user fence
> when the driver supports user in fences as well.
> 
> Since user fences doesn't have the memory management dependency drivers can
> actually allocate memory or call I2C functions which takes locks which have
> memory allocation dependencies.
> 
> Or do I miss some other reason why you can't fault or allocate memory in
> atomic_commit_tail? At least lockdep seems to be happy about that now.

The problem is a bit that this breaks the uapi already. At least if the
goal is to have this all be perfectly transparent for userspace - as you
as you have multi-gpu setups going on at least.

> > If the goal is to keep the uapi perfectly compatible then your patch set
> > doesn't look like a solution, since as soon as another driver is involved
> > which doesn't understand userspace memory fences it all falls apart. So
> > works great for a quick demo with amd+amd sharing, but not much further.
> > And I don't think it's feasible to just rev the entire ecosystem, since
> > that kinda defeats the point of keeping uapi stable - if we rev everything
> > we might as well also rev the uapi and make this a bit more incremental
> > again :-)
> 
> Yes, unfortunately the uapi needs to stay compatible as well and yes that
> means we need to deploy this to all drivers involved.
> 
> We at least need to be able to provide a stack on new hardware with (for
> example) Ubuntu 18.04 without replacing all the userspace components.
> 
> What we can replace is the OpenGL stack and if necessary libdrm, but not
> (for example) the X server and most likely not the DDX in some cases.
> 
> The same applies with surfaceflinger and to some extend Wayland as well.

So for perfect uapi compat for existing compositor I really don't think
stuffing umf into the kernel is the right approach. Too many little
corners that break:

- the in/out fence mismatch every
- cross gpu with different userspace that doesn't understand umf and then
  just ignores them
- compositors which currently assume implicit sync finishes eventually,
  and with umf that gets complicated at best
- same with sync_file, the uapi atm does not have a concept of future
  fence

So you can kinda make this work, but it falls apart all over the place.
And I also don't think smashing umf into all these old concepts helps us
in any way to get towards a desktop which is umf-native.

My take is still that for backwards compat the simplest way is if a
umf-native driver simply provides dma-fence backwards compat as an opt-in,
which userspace chooses when it's necessary. There's really only two
things you need for that to work:

- a timeout of some sort on the dma_fence, which might or might not kill
  the entire context. This is entirey up to how your userspace does or
  does not implement stuff like arb robustness or vk_error_device_lost

- pre-pinned memory management to block out the all the inversions. This
  is a bit more nasty, but since we do have all the code for this already
  it really shouldn't be too tricky to make that happen for the fancy new
  umf world.

You do not need a kernel scheduler or anything like that at all, you can
do full userspace direct submit to hw and all that fun. Maybe do a
drm/sched frontend (and then your submit code does exactly what userspace
would do too).

Importantly the things you really don't need:

- special hw support, even if the only mode your hw supports is with page
  faults and all that: You can make sure all the pages are present
  upfront, and then simply kill the entire context is a page fault
  happens.

- special fw scheduler support: Once the memory management inversions are
  taken care of with pre-pinning under dma_fences, then the only other
  thing you need is a timeout for the dma_fence to signal. And maybe some
  kind of guaranteed ordering if you want to use a dma_fence timeline
  since that one can't go backwards.

Trying to shoehorn umf into all the old concepts like implicit sync or
sync_file which really don't support umf works for a demo, but imo just
isn't solid enough for shipping everywhere.

And long term I really don't think we ever want umf anywhere else than
drm_syncobj, at least for a 100% umf-native stack.

So maybe this all goes back to the old discussion with had, where you
argued for the need for special fw and hw and all that to make the old
dma_fence stuff work. Why is that needed? I still don't get that part ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-09 14:10       ` Daniel Vetter
@ 2022-05-17 10:28         ` Christian König
  -1 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-17 10:28 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: tvrtko.ursulin, sergemetral, tzimmermann, gustavo,
	Felix.Kuehling, linux-media, dri-devel, linaro-mm-sig,
	maad.aldabagh, jason, alexander.deucher, skhawaja, sumit.semwal,
	daniels

Am 09.05.22 um 16:10 schrieb Daniel Vetter:
> On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
>> Am 04.05.22 um 12:08 schrieb Daniel Vetter:
>>> On Mon, May 02, 2022 at 06:37:07PM +0200, Christian König wrote:
>>>> Hello everyone,
>>>>
>>>> it's a well known problem that the DMA-buf subsystem mixed
>>>> synchronization and memory management requirements into the same
>>>> dma_fence and dma_resv objects. Because of this dma_fence objects need
>>>> to guarantee that they complete within a finite amount of time or
>>>> otherwise the system can easily deadlock.
>>>>
>>>> One of the few good things about this problem is that it is really good
>>>> understood by now.
>>>>
>>>> Daniel and others came up with some documentation:
>>>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
>>>>
>>>> And Jason did an excellent presentation about that problem on last years
>>>> LPC: https://lpc.events/event/11/contributions/1115/
>>>>
>>>> Based on that we had been able to reject new implementations of
>>>> infinite/user DMA fences and mitigate the effect of the few existing
>>>> ones.
>>>>
>>>> The still remaining down side is that we don't have a way of using user
>>>> fences as dependency in both the explicit (sync_file, drm_syncobj) as
>>>> well as the implicit (dma_resv) synchronization objects, resulting in
>>>> numerous problems and limitations for things like HMM, user queues
>>>> etc....
>>>>
>>>> This patch set here now tries to tackle this problem by untangling the
>>>> synchronization from the memory management. What it does *not* try to do
>>>> is to fix the existing kernel fences, because I think we now can all
>>>> agree on that this isn't really possible.
>>>>
>>>> To archive this goal what I do in this patch set is to add some parallel
>>>> infrastructure to cleanly separate normal kernel dma_fence objects from
>>>> indefinite/user fences:
>>>>
>>>> 1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some
>>>> existing driver defines). To note that a certain dma_fence is an user
>>>> fence and *must* be ignore by memory management and never used as
>>>> dependency for normal none user dma_fence objects.
>>>>
>>>> 2. The dma_fence_array and dma_fence_chain containers are modified so
>>>> that they are marked as user fences whenever any of their contained
>>>> fences are an user fence.
>>>>
>>>> 3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be
>>>> used with indefinite/user fences and separates those into it's own
>>>> synchronization domain.
>>>>
>>>> 4. The existing dma_buf_poll_add_cb() function is modified so that
>>>> indefinite/user fences are included in the polling.
>>>>
>>>> 5. The sync_file synchronization object is modified so that we
>>>> essentially have two fence streams instead of just one.
>>>>
>>>> 6. The drm_syncobj is modified in a similar way. User fences are just
>>>> ignored unless the driver explicitly states support to wait for them.
>>>>
>>>> 7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers
>>>> can use to indicate the need for user fences. If user fences are used
>>>> the atomic mode setting starts to support user fences as IN/OUT fences.
>>>>
>>>> 8. Lockdep is used at various critical locations to ensure that nobody
>>>> ever tries to mix user fences with non user fences.
>>>>
>>>> The general approach is to just ignore user fences unless a driver
>>>> stated explicitely support for them.
>>>>
>>>> On top of all of this I've hacked amdgpu so that we add the resulting CS
>>>> fence only as kernel dependency to the dma_resv object and an additional
>>>> wrapped up with a dma_fence_array and a stub user fence.
>>>>
>>>> The result is that the newly added atomic modeset functions now
>>>> correctly wait for the user fence to complete before doing the flip. And
>>>> dependent CS don't pipeline any more, but rather block on the CPU before
>>>> submitting work.
>>>>
>>>> After tons of debugging and testing everything now seems to not go up in
>>>> flames immediately and even lockdep is happy with the annotations.
>>>>
>>>> I'm perfectly aware that this is probably by far the most controversial
>>>> patch set I've ever created and I really wish we wouldn't need it. But
>>>> we certainly have the requirement for this and I don't see much other
>>>> chance to get that working in an UAPI compatible way.
>>>>
>>>> Thoughts/comments?
>>> I think you need to type up the goal or exact problem statement you're
>>> trying to solve first. What you typed up is a solution along the lines of
>>> "try to stuff userspace memory fences into dma_fence and see how horrible
>>> it all is", and that's certainly an interesting experiment, but what are
>>> you trying to solve with it?
>> Well, good point. I explained to much how it works, but now why.
>>
>> In general I would describe the goal as: Providing a standard kernel
>> infrastructure for user fences.
> So on that goal the part I fully agree on is that drm_syncobj can (and
> should imo) be able to contain userspace memory fences. The uapi semantics
> and everything is already fully set up to support that, but maybe with
> reduced performance: Non-aware userspace (or when you don't trust the
> supplier of the umf) needs to block when looking up the fence, and the
> dma_fence returned will always be signalled already. But that's just a
> mild performance issue (and vk drivers paper over that already with
> threading) and not a correctness issue.

Exactly that, yes.

>>> Like if the issue is to enable opencl or whatever, then that's no problem
>>> (rocm on amdkfd is a thing, same maybe without the kfd part can be done
>>> anywhere else). If the goal is to enable userspace memory fences for vk,
>>> then we really don't need these everywhere, but really only in drm_syncobj
>>> (and maybe sync_file).
>> Yes, having an in kernel representation for vk user space fences is one of
>> the goals.
>>
>> And I was going back and forth if I should rather come up with a new
>> structure for this or use the existing dma_fence with a flag as well.
>>
>> I've decided to go down the later router because we have quite a lot of
>> existing functionality which can be re-used. But if you have a good argument
>> that it would be more defensive to come up with something completely new,
>> I'm perfectly fine with that as well.
> Yeah so stuffing that into dma_fence already freaks me out a bit. It is
> quite fundamentally a different thing, and it would be really nice to make
> that very apparent at the type level too.
>
> E.g. to make sure you never ever end up with an umf fence in mmu notifier
> invalidate callback. You can enforce that with runtime checks too, but imo
> compile time fail is better than runtime fail.

Well, I see arguments for both sides.

There is certainly the danger that we have an umf wait in the mmu 
notifier, but then lockdep will scream "bloody hell" immediately.

On the other hand when I make this a separate structure we need to 
maintain containers for both variants, especially a chain implementation 
for drm_syncobj. And here I don't have lockdep to keep an eye that 
nobody does anything strange.

It's only a gut feeling with no clear evidence for one side. If you 
insists on a separate structure I will go down that route.

>>> If the goal is specifically atomic kms, then there's an entire can of
>>> worms there that I really don't want to think about, but it exists: We
>>> have dma_fence as out-fences from atomic commit, and that's already
>>> massively broken since most drivers allocate some memory or at least take
>>> locks which can allocate memory in their commit path. Like i2c. Putting a
>>> userspace memory fence as in-fence in there makes that problem
>>> substantially worse, since at least in theory you're just not allowed to
>>> might_faul in atomic_commit_tail.
>> Yes, that's unfortunately one of the goals as well and yes I completely
>> agree on the can of worms. But I think I've solved that.
>>
>> What I do in the patch set is to enforce that the out fence is an user fence
>> when the driver supports user in fences as well.
>>
>> Since user fences doesn't have the memory management dependency drivers can
>> actually allocate memory or call I2C functions which takes locks which have
>> memory allocation dependencies.
>>
>> Or do I miss some other reason why you can't fault or allocate memory in
>> atomic_commit_tail? At least lockdep seems to be happy about that now.
> The problem is a bit that this breaks the uapi already. At least if the
> goal is to have this all be perfectly transparent for userspace - as you
> as you have multi-gpu setups going on at least.

Question here is why do you think there is an UAPI break? We currently 
wait in a work item already, so where exactly is the problem?

>>> If the goal is to keep the uapi perfectly compatible then your patch set
>>> doesn't look like a solution, since as soon as another driver is involved
>>> which doesn't understand userspace memory fences it all falls apart. So
>>> works great for a quick demo with amd+amd sharing, but not much further.
>>> And I don't think it's feasible to just rev the entire ecosystem, since
>>> that kinda defeats the point of keeping uapi stable - if we rev everything
>>> we might as well also rev the uapi and make this a bit more incremental
>>> again :-)
>> Yes, unfortunately the uapi needs to stay compatible as well and yes that
>> means we need to deploy this to all drivers involved.
>>
>> We at least need to be able to provide a stack on new hardware with (for
>> example) Ubuntu 18.04 without replacing all the userspace components.
>>
>> What we can replace is the OpenGL stack and if necessary libdrm, but not
>> (for example) the X server and most likely not the DDX in some cases.
>>
>> The same applies with surfaceflinger and to some extend Wayland as well.
> So for perfect uapi compat for existing compositor I really don't think
> stuffing umf into the kernel is the right approach. Too many little
> corners that break:
>
> - the in/out fence mismatch every

On that part I need further explanation, cause I hoped that this is solved.

> - cross gpu with different userspace that doesn't understand umf and then
>    just ignores them

Well by stuffing umf into the kernel the whole thing becomes transparent 
for userspace.

So it won't matter that you have a new amdgpu stack which wants to use 
umf and an older i915 stack which knows nothing about umfs.

The kernel will block on command submission as soon as a buffer is used 
by i915. And as you said above as well that might cause performance 
trouble, but is not a correctness problem.

> - compositors which currently assume implicit sync finishes eventually,
>    and with umf that gets complicated at best

But for userspace compositors there is no difference between an umf 
which times out and a dma_fence which timesout? Or am I missing 
something here?

> - same with sync_file, the uapi atm does not have a concept of future
>    fence
>
> So you can kinda make this work, but it falls apart all over the place.
> And I also don't think smashing umf into all these old concepts helps us
> in any way to get towards a desktop which is umf-native.

Yeah, but having an umf compatibility memory management doesn't help 
either to get away from pre-pinned pages.

> My take is still that for backwards compat the simplest way is if a
> umf-native driver simply provides dma-fence backwards compat as an opt-in,
> which userspace chooses when it's necessary. There's really only two
> things you need for that to work:
>
> - a timeout of some sort on the dma_fence, which might or might not kill
>    the entire context. This is entirey up to how your userspace does or
>    does not implement stuff like arb robustness or vk_error_device_lost
>
> - pre-pinned memory management to block out the all the inversions. This
>    is a bit more nasty, but since we do have all the code for this already
>    it really shouldn't be too tricky to make that happen for the fancy new
>    umf world.

Well, exactly that's what we want to get away from.

> You do not need a kernel scheduler or anything like that at all, you can
> do full userspace direct submit to hw and all that fun. Maybe do a
> drm/sched frontend (and then your submit code does exactly what userspace
> would do too).
>
> Importantly the things you really don't need:
>
> - special hw support, even if the only mode your hw supports is with page
>    faults and all that: You can make sure all the pages are present
>    upfront, and then simply kill the entire context is a page fault
>    happens.

Well, that's only like 90% correct.

You can make that work without special hardware support, but from the 
experience with ROCm and very extensive talks with out hardware folks we 
have seriously problems making sure that the hw can't access freed up 
memory any more.

Except for the solution of never freeing up memory the only other 
possibility is to wait between 1 and 6 seconds until a shoot down made 
sure that there is really nobody accessing old page tables entries any more.

In the case of an user space queue with hardware scheduler support and 
HMM the memory would just still be referenced until userspace 
cooperatively inserted a barrier, but that again breaks some dma_fence 
assumptions as far as I can see.

> - special fw scheduler support: Once the memory management inversions are
>    taken care of with pre-pinning under dma_fences, then the only other
>    thing you need is a timeout for the dma_fence to signal. And maybe some
>    kind of guaranteed ordering if you want to use a dma_fence timeline
>    since that one can't go backwards.

Yeah, that not going backward thing turned out to be massively more 
tricky than I thought initially as well.

Alex, Marek and I worked quite hard on relaying those requirements to 
our internal teams, but I'm still not quite sure if that will turn out 
working or not.

> Trying to shoehorn umf into all the old concepts like implicit sync or
> sync_file which really don't support umf works for a demo, but imo just
> isn't solid enough for shipping everywhere.
>
> And long term I really don't think we ever want umf anywhere else than
> drm_syncobj, at least for a 100% umf-native stack.

Ok then I will concentrate on drm_syncobj for now.

What about in driver backward compatibility? E.g. blocking wait in the 
multimedia driver CS IOCTL until umf signals?

Thanks,
Christian.

>
> So maybe this all goes back to the old discussion with had, where you
> argued for the need for special fw and hw and all that to make the old
> dma_fence stuff work. Why is that needed? I still don't get that part ...
> -Daniel


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-17 10:28         ` Christian König
  0 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-17 10:28 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason, daniels, skhawaja, maad.aldabagh, sergemetral,
	sumit.semwal, gustavo, Felix.Kuehling, alexander.deucher,
	tzimmermann, tvrtko.ursulin, linux-media, dri-devel,
	linaro-mm-sig

Am 09.05.22 um 16:10 schrieb Daniel Vetter:
> On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
>> Am 04.05.22 um 12:08 schrieb Daniel Vetter:
>>> On Mon, May 02, 2022 at 06:37:07PM +0200, Christian König wrote:
>>>> Hello everyone,
>>>>
>>>> it's a well known problem that the DMA-buf subsystem mixed
>>>> synchronization and memory management requirements into the same
>>>> dma_fence and dma_resv objects. Because of this dma_fence objects need
>>>> to guarantee that they complete within a finite amount of time or
>>>> otherwise the system can easily deadlock.
>>>>
>>>> One of the few good things about this problem is that it is really good
>>>> understood by now.
>>>>
>>>> Daniel and others came up with some documentation:
>>>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
>>>>
>>>> And Jason did an excellent presentation about that problem on last years
>>>> LPC: https://lpc.events/event/11/contributions/1115/
>>>>
>>>> Based on that we had been able to reject new implementations of
>>>> infinite/user DMA fences and mitigate the effect of the few existing
>>>> ones.
>>>>
>>>> The still remaining down side is that we don't have a way of using user
>>>> fences as dependency in both the explicit (sync_file, drm_syncobj) as
>>>> well as the implicit (dma_resv) synchronization objects, resulting in
>>>> numerous problems and limitations for things like HMM, user queues
>>>> etc....
>>>>
>>>> This patch set here now tries to tackle this problem by untangling the
>>>> synchronization from the memory management. What it does *not* try to do
>>>> is to fix the existing kernel fences, because I think we now can all
>>>> agree on that this isn't really possible.
>>>>
>>>> To archive this goal what I do in this patch set is to add some parallel
>>>> infrastructure to cleanly separate normal kernel dma_fence objects from
>>>> indefinite/user fences:
>>>>
>>>> 1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some
>>>> existing driver defines). To note that a certain dma_fence is an user
>>>> fence and *must* be ignore by memory management and never used as
>>>> dependency for normal none user dma_fence objects.
>>>>
>>>> 2. The dma_fence_array and dma_fence_chain containers are modified so
>>>> that they are marked as user fences whenever any of their contained
>>>> fences are an user fence.
>>>>
>>>> 3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be
>>>> used with indefinite/user fences and separates those into it's own
>>>> synchronization domain.
>>>>
>>>> 4. The existing dma_buf_poll_add_cb() function is modified so that
>>>> indefinite/user fences are included in the polling.
>>>>
>>>> 5. The sync_file synchronization object is modified so that we
>>>> essentially have two fence streams instead of just one.
>>>>
>>>> 6. The drm_syncobj is modified in a similar way. User fences are just
>>>> ignored unless the driver explicitly states support to wait for them.
>>>>
>>>> 7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers
>>>> can use to indicate the need for user fences. If user fences are used
>>>> the atomic mode setting starts to support user fences as IN/OUT fences.
>>>>
>>>> 8. Lockdep is used at various critical locations to ensure that nobody
>>>> ever tries to mix user fences with non user fences.
>>>>
>>>> The general approach is to just ignore user fences unless a driver
>>>> stated explicitely support for them.
>>>>
>>>> On top of all of this I've hacked amdgpu so that we add the resulting CS
>>>> fence only as kernel dependency to the dma_resv object and an additional
>>>> wrapped up with a dma_fence_array and a stub user fence.
>>>>
>>>> The result is that the newly added atomic modeset functions now
>>>> correctly wait for the user fence to complete before doing the flip. And
>>>> dependent CS don't pipeline any more, but rather block on the CPU before
>>>> submitting work.
>>>>
>>>> After tons of debugging and testing everything now seems to not go up in
>>>> flames immediately and even lockdep is happy with the annotations.
>>>>
>>>> I'm perfectly aware that this is probably by far the most controversial
>>>> patch set I've ever created and I really wish we wouldn't need it. But
>>>> we certainly have the requirement for this and I don't see much other
>>>> chance to get that working in an UAPI compatible way.
>>>>
>>>> Thoughts/comments?
>>> I think you need to type up the goal or exact problem statement you're
>>> trying to solve first. What you typed up is a solution along the lines of
>>> "try to stuff userspace memory fences into dma_fence and see how horrible
>>> it all is", and that's certainly an interesting experiment, but what are
>>> you trying to solve with it?
>> Well, good point. I explained to much how it works, but now why.
>>
>> In general I would describe the goal as: Providing a standard kernel
>> infrastructure for user fences.
> So on that goal the part I fully agree on is that drm_syncobj can (and
> should imo) be able to contain userspace memory fences. The uapi semantics
> and everything is already fully set up to support that, but maybe with
> reduced performance: Non-aware userspace (or when you don't trust the
> supplier of the umf) needs to block when looking up the fence, and the
> dma_fence returned will always be signalled already. But that's just a
> mild performance issue (and vk drivers paper over that already with
> threading) and not a correctness issue.

Exactly that, yes.

>>> Like if the issue is to enable opencl or whatever, then that's no problem
>>> (rocm on amdkfd is a thing, same maybe without the kfd part can be done
>>> anywhere else). If the goal is to enable userspace memory fences for vk,
>>> then we really don't need these everywhere, but really only in drm_syncobj
>>> (and maybe sync_file).
>> Yes, having an in kernel representation for vk user space fences is one of
>> the goals.
>>
>> And I was going back and forth if I should rather come up with a new
>> structure for this or use the existing dma_fence with a flag as well.
>>
>> I've decided to go down the later router because we have quite a lot of
>> existing functionality which can be re-used. But if you have a good argument
>> that it would be more defensive to come up with something completely new,
>> I'm perfectly fine with that as well.
> Yeah so stuffing that into dma_fence already freaks me out a bit. It is
> quite fundamentally a different thing, and it would be really nice to make
> that very apparent at the type level too.
>
> E.g. to make sure you never ever end up with an umf fence in mmu notifier
> invalidate callback. You can enforce that with runtime checks too, but imo
> compile time fail is better than runtime fail.

Well, I see arguments for both sides.

There is certainly the danger that we have an umf wait in the mmu 
notifier, but then lockdep will scream "bloody hell" immediately.

On the other hand when I make this a separate structure we need to 
maintain containers for both variants, especially a chain implementation 
for drm_syncobj. And here I don't have lockdep to keep an eye that 
nobody does anything strange.

It's only a gut feeling with no clear evidence for one side. If you 
insists on a separate structure I will go down that route.

>>> If the goal is specifically atomic kms, then there's an entire can of
>>> worms there that I really don't want to think about, but it exists: We
>>> have dma_fence as out-fences from atomic commit, and that's already
>>> massively broken since most drivers allocate some memory or at least take
>>> locks which can allocate memory in their commit path. Like i2c. Putting a
>>> userspace memory fence as in-fence in there makes that problem
>>> substantially worse, since at least in theory you're just not allowed to
>>> might_faul in atomic_commit_tail.
>> Yes, that's unfortunately one of the goals as well and yes I completely
>> agree on the can of worms. But I think I've solved that.
>>
>> What I do in the patch set is to enforce that the out fence is an user fence
>> when the driver supports user in fences as well.
>>
>> Since user fences doesn't have the memory management dependency drivers can
>> actually allocate memory or call I2C functions which takes locks which have
>> memory allocation dependencies.
>>
>> Or do I miss some other reason why you can't fault or allocate memory in
>> atomic_commit_tail? At least lockdep seems to be happy about that now.
> The problem is a bit that this breaks the uapi already. At least if the
> goal is to have this all be perfectly transparent for userspace - as you
> as you have multi-gpu setups going on at least.

Question here is why do you think there is an UAPI break? We currently 
wait in a work item already, so where exactly is the problem?

>>> If the goal is to keep the uapi perfectly compatible then your patch set
>>> doesn't look like a solution, since as soon as another driver is involved
>>> which doesn't understand userspace memory fences it all falls apart. So
>>> works great for a quick demo with amd+amd sharing, but not much further.
>>> And I don't think it's feasible to just rev the entire ecosystem, since
>>> that kinda defeats the point of keeping uapi stable - if we rev everything
>>> we might as well also rev the uapi and make this a bit more incremental
>>> again :-)
>> Yes, unfortunately the uapi needs to stay compatible as well and yes that
>> means we need to deploy this to all drivers involved.
>>
>> We at least need to be able to provide a stack on new hardware with (for
>> example) Ubuntu 18.04 without replacing all the userspace components.
>>
>> What we can replace is the OpenGL stack and if necessary libdrm, but not
>> (for example) the X server and most likely not the DDX in some cases.
>>
>> The same applies with surfaceflinger and to some extend Wayland as well.
> So for perfect uapi compat for existing compositor I really don't think
> stuffing umf into the kernel is the right approach. Too many little
> corners that break:
>
> - the in/out fence mismatch every

On that part I need further explanation, cause I hoped that this is solved.

> - cross gpu with different userspace that doesn't understand umf and then
>    just ignores them

Well by stuffing umf into the kernel the whole thing becomes transparent 
for userspace.

So it won't matter that you have a new amdgpu stack which wants to use 
umf and an older i915 stack which knows nothing about umfs.

The kernel will block on command submission as soon as a buffer is used 
by i915. And as you said above as well that might cause performance 
trouble, but is not a correctness problem.

> - compositors which currently assume implicit sync finishes eventually,
>    and with umf that gets complicated at best

But for userspace compositors there is no difference between an umf 
which times out and a dma_fence which timesout? Or am I missing 
something here?

> - same with sync_file, the uapi atm does not have a concept of future
>    fence
>
> So you can kinda make this work, but it falls apart all over the place.
> And I also don't think smashing umf into all these old concepts helps us
> in any way to get towards a desktop which is umf-native.

Yeah, but having an umf compatibility memory management doesn't help 
either to get away from pre-pinned pages.

> My take is still that for backwards compat the simplest way is if a
> umf-native driver simply provides dma-fence backwards compat as an opt-in,
> which userspace chooses when it's necessary. There's really only two
> things you need for that to work:
>
> - a timeout of some sort on the dma_fence, which might or might not kill
>    the entire context. This is entirey up to how your userspace does or
>    does not implement stuff like arb robustness or vk_error_device_lost
>
> - pre-pinned memory management to block out the all the inversions. This
>    is a bit more nasty, but since we do have all the code for this already
>    it really shouldn't be too tricky to make that happen for the fancy new
>    umf world.

Well, exactly that's what we want to get away from.

> You do not need a kernel scheduler or anything like that at all, you can
> do full userspace direct submit to hw and all that fun. Maybe do a
> drm/sched frontend (and then your submit code does exactly what userspace
> would do too).
>
> Importantly the things you really don't need:
>
> - special hw support, even if the only mode your hw supports is with page
>    faults and all that: You can make sure all the pages are present
>    upfront, and then simply kill the entire context is a page fault
>    happens.

Well, that's only like 90% correct.

You can make that work without special hardware support, but from the 
experience with ROCm and very extensive talks with out hardware folks we 
have seriously problems making sure that the hw can't access freed up 
memory any more.

Except for the solution of never freeing up memory the only other 
possibility is to wait between 1 and 6 seconds until a shoot down made 
sure that there is really nobody accessing old page tables entries any more.

In the case of an user space queue with hardware scheduler support and 
HMM the memory would just still be referenced until userspace 
cooperatively inserted a barrier, but that again breaks some dma_fence 
assumptions as far as I can see.

> - special fw scheduler support: Once the memory management inversions are
>    taken care of with pre-pinning under dma_fences, then the only other
>    thing you need is a timeout for the dma_fence to signal. And maybe some
>    kind of guaranteed ordering if you want to use a dma_fence timeline
>    since that one can't go backwards.

Yeah, that not going backward thing turned out to be massively more 
tricky than I thought initially as well.

Alex, Marek and I worked quite hard on relaying those requirements to 
our internal teams, but I'm still not quite sure if that will turn out 
working or not.

> Trying to shoehorn umf into all the old concepts like implicit sync or
> sync_file which really don't support umf works for a demo, but imo just
> isn't solid enough for shipping everywhere.
>
> And long term I really don't think we ever want umf anywhere else than
> drm_syncobj, at least for a 100% umf-native stack.

Ok then I will concentrate on drm_syncobj for now.

What about in driver backward compatibility? E.g. blocking wait in the 
multimedia driver CS IOCTL until umf signals?

Thanks,
Christian.

>
> So maybe this all goes back to the old discussion with had, where you
> argued for the need for special fw and hw and all that to make the old
> dma_fence stuff work. Why is that needed? I still don't get that part ...
> -Daniel


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-17 10:28         ` Christian König
@ 2022-05-25 13:05           ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2022-05-25 13:05 UTC (permalink / raw)
  To: Christian König
  Cc: tvrtko.ursulin, sergemetral, tzimmermann, gustavo,
	Felix.Kuehling, linux-media, dri-devel, linaro-mm-sig, jason,
	alexander.deucher, daniels, skhawaja, sumit.semwal,
	maad.aldabagh

Apologies I'm constantly behind on m-l discussions :-/

On Tue, May 17, 2022 at 12:28:17PM +0200, Christian König wrote:
> Am 09.05.22 um 16:10 schrieb Daniel Vetter:
> > On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
> > > Am 04.05.22 um 12:08 schrieb Daniel Vetter:
> > > > On Mon, May 02, 2022 at 06:37:07PM +0200, Christian König wrote:
> > > > > Hello everyone,
> > > > > 
> > > > > it's a well known problem that the DMA-buf subsystem mixed
> > > > > synchronization and memory management requirements into the same
> > > > > dma_fence and dma_resv objects. Because of this dma_fence objects need
> > > > > to guarantee that they complete within a finite amount of time or
> > > > > otherwise the system can easily deadlock.
> > > > > 
> > > > > One of the few good things about this problem is that it is really good
> > > > > understood by now.
> > > > > 
> > > > > Daniel and others came up with some documentation:
> > > > > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
> > > > > 
> > > > > And Jason did an excellent presentation about that problem on last years
> > > > > LPC: https://lpc.events/event/11/contributions/1115/
> > > > > 
> > > > > Based on that we had been able to reject new implementations of
> > > > > infinite/user DMA fences and mitigate the effect of the few existing
> > > > > ones.
> > > > > 
> > > > > The still remaining down side is that we don't have a way of using user
> > > > > fences as dependency in both the explicit (sync_file, drm_syncobj) as
> > > > > well as the implicit (dma_resv) synchronization objects, resulting in
> > > > > numerous problems and limitations for things like HMM, user queues
> > > > > etc....
> > > > > 
> > > > > This patch set here now tries to tackle this problem by untangling the
> > > > > synchronization from the memory management. What it does *not* try to do
> > > > > is to fix the existing kernel fences, because I think we now can all
> > > > > agree on that this isn't really possible.
> > > > > 
> > > > > To archive this goal what I do in this patch set is to add some parallel
> > > > > infrastructure to cleanly separate normal kernel dma_fence objects from
> > > > > indefinite/user fences:
> > > > > 
> > > > > 1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some
> > > > > existing driver defines). To note that a certain dma_fence is an user
> > > > > fence and *must* be ignore by memory management and never used as
> > > > > dependency for normal none user dma_fence objects.
> > > > > 
> > > > > 2. The dma_fence_array and dma_fence_chain containers are modified so
> > > > > that they are marked as user fences whenever any of their contained
> > > > > fences are an user fence.
> > > > > 
> > > > > 3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be
> > > > > used with indefinite/user fences and separates those into it's own
> > > > > synchronization domain.
> > > > > 
> > > > > 4. The existing dma_buf_poll_add_cb() function is modified so that
> > > > > indefinite/user fences are included in the polling.
> > > > > 
> > > > > 5. The sync_file synchronization object is modified so that we
> > > > > essentially have two fence streams instead of just one.
> > > > > 
> > > > > 6. The drm_syncobj is modified in a similar way. User fences are just
> > > > > ignored unless the driver explicitly states support to wait for them.
> > > > > 
> > > > > 7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers
> > > > > can use to indicate the need for user fences. If user fences are used
> > > > > the atomic mode setting starts to support user fences as IN/OUT fences.
> > > > > 
> > > > > 8. Lockdep is used at various critical locations to ensure that nobody
> > > > > ever tries to mix user fences with non user fences.
> > > > > 
> > > > > The general approach is to just ignore user fences unless a driver
> > > > > stated explicitely support for them.
> > > > > 
> > > > > On top of all of this I've hacked amdgpu so that we add the resulting CS
> > > > > fence only as kernel dependency to the dma_resv object and an additional
> > > > > wrapped up with a dma_fence_array and a stub user fence.
> > > > > 
> > > > > The result is that the newly added atomic modeset functions now
> > > > > correctly wait for the user fence to complete before doing the flip. And
> > > > > dependent CS don't pipeline any more, but rather block on the CPU before
> > > > > submitting work.
> > > > > 
> > > > > After tons of debugging and testing everything now seems to not go up in
> > > > > flames immediately and even lockdep is happy with the annotations.
> > > > > 
> > > > > I'm perfectly aware that this is probably by far the most controversial
> > > > > patch set I've ever created and I really wish we wouldn't need it. But
> > > > > we certainly have the requirement for this and I don't see much other
> > > > > chance to get that working in an UAPI compatible way.
> > > > > 
> > > > > Thoughts/comments?
> > > > I think you need to type up the goal or exact problem statement you're
> > > > trying to solve first. What you typed up is a solution along the lines of
> > > > "try to stuff userspace memory fences into dma_fence and see how horrible
> > > > it all is", and that's certainly an interesting experiment, but what are
> > > > you trying to solve with it?
> > > Well, good point. I explained to much how it works, but now why.
> > > 
> > > In general I would describe the goal as: Providing a standard kernel
> > > infrastructure for user fences.
> > So on that goal the part I fully agree on is that drm_syncobj can (and
> > should imo) be able to contain userspace memory fences. The uapi semantics
> > and everything is already fully set up to support that, but maybe with
> > reduced performance: Non-aware userspace (or when you don't trust the
> > supplier of the umf) needs to block when looking up the fence, and the
> > dma_fence returned will always be signalled already. But that's just a
> > mild performance issue (and vk drivers paper over that already with
> > threading) and not a correctness issue.
> 
> Exactly that, yes.
> 
> > > > Like if the issue is to enable opencl or whatever, then that's no problem
> > > > (rocm on amdkfd is a thing, same maybe without the kfd part can be done
> > > > anywhere else). If the goal is to enable userspace memory fences for vk,
> > > > then we really don't need these everywhere, but really only in drm_syncobj
> > > > (and maybe sync_file).
> > > Yes, having an in kernel representation for vk user space fences is one of
> > > the goals.
> > > 
> > > And I was going back and forth if I should rather come up with a new
> > > structure for this or use the existing dma_fence with a flag as well.
> > > 
> > > I've decided to go down the later router because we have quite a lot of
> > > existing functionality which can be re-used. But if you have a good argument
> > > that it would be more defensive to come up with something completely new,
> > > I'm perfectly fine with that as well.
> > Yeah so stuffing that into dma_fence already freaks me out a bit. It is
> > quite fundamentally a different thing, and it would be really nice to make
> > that very apparent at the type level too.
> > 
> > E.g. to make sure you never ever end up with an umf fence in mmu notifier
> > invalidate callback. You can enforce that with runtime checks too, but imo
> > compile time fail is better than runtime fail.
> 
> Well, I see arguments for both sides.
> 
> There is certainly the danger that we have an umf wait in the mmu notifier,
> but then lockdep will scream "bloody hell" immediately.
> 
> On the other hand when I make this a separate structure we need to maintain
> containers for both variants, especially a chain implementation for
> drm_syncobj. And here I don't have lockdep to keep an eye that nobody does
> anything strange.
> 
> It's only a gut feeling with no clear evidence for one side. If you insists
> on a separate structure I will go down that route.
> 
> > > > If the goal is specifically atomic kms, then there's an entire can of
> > > > worms there that I really don't want to think about, but it exists: We
> > > > have dma_fence as out-fences from atomic commit, and that's already
> > > > massively broken since most drivers allocate some memory or at least take
> > > > locks which can allocate memory in their commit path. Like i2c. Putting a
> > > > userspace memory fence as in-fence in there makes that problem
> > > > substantially worse, since at least in theory you're just not allowed to
> > > > might_faul in atomic_commit_tail.
> > > Yes, that's unfortunately one of the goals as well and yes I completely
> > > agree on the can of worms. But I think I've solved that.
> > > 
> > > What I do in the patch set is to enforce that the out fence is an user fence
> > > when the driver supports user in fences as well.
> > > 
> > > Since user fences doesn't have the memory management dependency drivers can
> > > actually allocate memory or call I2C functions which takes locks which have
> > > memory allocation dependencies.
> > > 
> > > Or do I miss some other reason why you can't fault or allocate memory in
> > > atomic_commit_tail? At least lockdep seems to be happy about that now.
> > The problem is a bit that this breaks the uapi already. At least if the
> > goal is to have this all be perfectly transparent for userspace - as you
> > as you have multi-gpu setups going on at least.
> 
> Question here is why do you think there is an UAPI break? We currently wait
> in a work item already, so where exactly is the problem?

It's a bit washy, but dma_fence and hence implicit sync is supposed to
finish in finite time. umf just doesn't.

Ofc in reality you can still flood your compositor and they're not very
robust, but with umf it's trivial to just hang your compositor forever and
nothing happens. There was a least quite a bit of screaming from
compositor folks 2 years ago when we've gone through the i915-gem fun of
"hey we redefined dma_fence to have indefinite fence semantcis, cool isn't
it". And this is kinda similar - you don't hang the kernel with all the
lockdep stuff, meaning ^C can get you out of any hang. But userspace is
still dead, and the user is still potentially rather unhappy.

So either we force a timeout in the kernel or in userspace or somewhere,
and at that point I'm not entirely sure this is really worth it.

Same holds for sync_file. drm_syncobj is different and does have the
entire "this might never signal" concept encoded in it's uapi already, so
that's fine.

> > > > If the goal is to keep the uapi perfectly compatible then your patch set
> > > > doesn't look like a solution, since as soon as another driver is involved
> > > > which doesn't understand userspace memory fences it all falls apart. So
> > > > works great for a quick demo with amd+amd sharing, but not much further.
> > > > And I don't think it's feasible to just rev the entire ecosystem, since
> > > > that kinda defeats the point of keeping uapi stable - if we rev everything
> > > > we might as well also rev the uapi and make this a bit more incremental
> > > > again :-)
> > > Yes, unfortunately the uapi needs to stay compatible as well and yes that
> > > means we need to deploy this to all drivers involved.
> > > 
> > > We at least need to be able to provide a stack on new hardware with (for
> > > example) Ubuntu 18.04 without replacing all the userspace components.
> > > 
> > > What we can replace is the OpenGL stack and if necessary libdrm, but not
> > > (for example) the X server and most likely not the DDX in some cases.
> > > 
> > > The same applies with surfaceflinger and to some extend Wayland as well.
> > So for perfect uapi compat for existing compositor I really don't think
> > stuffing umf into the kernel is the right approach. Too many little
> > corners that break:
> > 
> > - the in/out fence mismatch every
> 
> On that part I need further explanation, cause I hoped that this is solved.

The thing is even if you force the out fence to be an umf too if the
in-fence is, they're chained. So as soon as you have an unsignalled in
fence all subsequent out fences need to be forced to be umf semantics.
Possible, but definitely messy.

> > - cross gpu with different userspace that doesn't understand umf and then
> >    just ignores them
> 
> Well by stuffing umf into the kernel the whole thing becomes transparent for
> userspace.
> 
> So it won't matter that you have a new amdgpu stack which wants to use umf
> and an older i915 stack which knows nothing about umfs.
> 
> The kernel will block on command submission as soon as a buffer is used by
> i915. And as you said above as well that might cause performance trouble,
> but is not a correctness problem.

The problem is that you still have to adjust all drivers. Which yes
eventually we have to do, but if the compat code is just in the driver
which has switched to umf already, then the cost is a lot more localized.
And with the dma_fence trick below it's all good.

> > - compositors which currently assume implicit sync finishes eventually,
> >    and with umf that gets complicated at best
> 
> But for userspace compositors there is no difference between an umf which
> times out and a dma_fence which timesout? Or am I missing something here?
> 
> > - same with sync_file, the uapi atm does not have a concept of future
> >    fence
> > 
> > So you can kinda make this work, but it falls apart all over the place.
> > And I also don't think smashing umf into all these old concepts helps us
> > in any way to get towards a desktop which is umf-native.
> 
> Yeah, but having an umf compatibility memory management doesn't help either
> to get away from pre-pinned pages.

I think it's really two worlds:

a) dma_fence syncing (sync_file, implicit sync in dma_resv, drm_syncobj
with dma_fence) and pre-pinned everything memory

b) umf mode, which opens the door for page faults and clever tricks and
avoids the need to pre-pin everything

It sucks, but I really haven't found a way to mix these. And I feel like
that it's pretty dangerous to pretend that we can do it - it freaks me out
at least :-)

> > My take is still that for backwards compat the simplest way is if a
> > umf-native driver simply provides dma-fence backwards compat as an opt-in,
> > which userspace chooses when it's necessary. There's really only two
> > things you need for that to work:
> > 
> > - a timeout of some sort on the dma_fence, which might or might not kill
> >    the entire context. This is entirey up to how your userspace does or
> >    does not implement stuff like arb robustness or vk_error_device_lost
> > 
> > - pre-pinned memory management to block out the all the inversions. This
> >    is a bit more nasty, but since we do have all the code for this already
> >    it really shouldn't be too tricky to make that happen for the fancy new
> >    umf world.
> 
> Well, exactly that's what we want to get away from.
> 
> > You do not need a kernel scheduler or anything like that at all, you can
> > do full userspace direct submit to hw and all that fun. Maybe do a
> > drm/sched frontend (and then your submit code does exactly what userspace
> > would do too).
> > 
> > Importantly the things you really don't need:
> > 
> > - special hw support, even if the only mode your hw supports is with page
> >    faults and all that: You can make sure all the pages are present
> >    upfront, and then simply kill the entire context is a page fault
> >    happens.
> 
> Well, that's only like 90% correct.
> 
> You can make that work without special hardware support, but from the
> experience with ROCm and very extensive talks with out hardware folks we
> have seriously problems making sure that the hw can't access freed up memory
> any more.
> 
> Except for the solution of never freeing up memory the only other
> possibility is to wait between 1 and 6 seconds until a shoot down made sure
> that there is really nobody accessing old page tables entries any more.
> 
> In the case of an user space queue with hardware scheduler support and HMM
> the memory would just still be referenced until userspace cooperatively
> inserted a barrier, but that again breaks some dma_fence assumptions as far
> as I can see.

So two things:

- Don't use legacy semantics for compute. They're completely different
  worlds, and i915 plan is to have completely different mode between
  legacy and compute mode creation, because they really don't mesh, at
  context creation time. So kinda like amdgpu renderD* vs amdkfd, except
  all int he renderD* node.

  You _never_ want to run ROCm or level0 or cuda with legacy mode, because
  it sucks too much. The trouble is a bit vulkan since if you run it
  headless, you want to run in compute mode, but if want wayland/x11
  winsys support, then you need legacy support. But we've come up with
  some supremely nasty tricks for that too in the discussion at plumbers
  last year.

  The other trouble is interop. I think the best path here is the vk
  approach of "you only get drm_syncobj, no implicit sync ever". But that
  might not be workable for rocm/opencl/whatever. I'm not sure what's the
  best option there, if absolutely everything breaks I guess we could look
  into adding a umf implicit sync slot to dma_resv, but I really think we
  should exhaust all other options first before we step into that dragon's
  lair :-/

  If we do go with the umf implicit sync I think we should really do it as
  a timeline thing, i.e. you attach the umf container once, the other side
  gets it once, an after that point the only thing that keeps happening is
  that writers increment the syncpt seqno, and readers just sample that
  every time they use the buffer. So essentially:

  struct dma_buf_umf_implict_sync {
  	struct drm_syncobj *sync;
	atomic_t seqno;
  };

  Plus a few ioctls to increment the seqno (and receive the value) and
  read the current seqno. That still means interop isn't perfectly
  transparent, but maybe we could at least reuse some of the codepaths we
  have for drm_syncobj in both kernel and userspace.

  Maybe we could make it backwards compat but just blocking rendering
  until the umf signals if a buffer is shared with any importer that
  doesn't understand umf implicit sync.

  But that's all rather horrible imo.

- For memory management I think the right long term approach is to punt
  this all to userspace, for both legacy and compute mode, with exactly
  the same semantics as mmap()/munmap() on the cpu side. i915 calls this
  vm_bind (the rfc is floating around somewhere), but essentially i915.ko
  will execute vm_bind and vm_unbind directly, completely decoupled from
  any in-flight command submission (we just punch out the ptes and force a
  tlb flush and done). If userspace unmaps memory it still uses, it gets
  to keep all the pieces. This is how vk and all the new stuff works. It
  does mean there's more work for gl and that kind of stuff, but this
  really shouldn't be the kernel's problem.

  Aside: We're aiming for the same for gpu context. If userspace closes a
  gpu context we just preempt and destroy it right away, userspace gets to
  keep he pieces if they wanted to let it finish first.

  This is exactly like on the cpu side where userspace can create/destroy
  threads and mmap/munmap as it sees fit, and it just blows up if it gets
  it all wrong.

  Ofc if you're hw is busted and you can't do reasonable efficient tlb
  flushing (even with pipeling and batching) then you're screwed, so I'm
  hoping your 1-6 seconds of waiting isn't that. If it is then you kinda
  have an unfixable security bug :-/ But iirc amdkfd already has batched
  up tlb flushes as dma_fence in USAGE_OTHER now, so this should all work.

  Anything else, i.e. userpsace fails to insert the required cache flushes
  or barriers or anything else before it calls gpu_unamp (or vm_unbind or
  whatever you want to call it) is a userspace bug.

  Ofc if you preempt a context for memory eviction due to memory pressure
  then the kernel has to do all the cache flushing and everything. Same
  for evicting pages due to memory pressure in the hmm use case. If you're
  preemption takes a few seconds I think there's bigger problems no matter
  what :-)

  Btw on vm_bind it'd be great if you or someone from amd's rocm/kfd side
  can take a look and drop your input. I expect that we'll have the same
  discussion with nouveau/nvidia and maybe also some of the armsoc chips.
  So it would be great if we can collect upstream wisdom on this topic a
  bit.

Neither of these are a reason to stuff umf into implicit sync or sync file
imo.

> > - special fw scheduler support: Once the memory management inversions are
> >    taken care of with pre-pinning under dma_fences, then the only other
> >    thing you need is a timeout for the dma_fence to signal. And maybe some
> >    kind of guaranteed ordering if you want to use a dma_fence timeline
> >    since that one can't go backwards.
> 
> Yeah, that not going backward thing turned out to be massively more tricky
> than I thought initially as well.
> 
> Alex, Marek and I worked quite hard on relaying those requirements to our
> internal teams, but I'm still not quite sure if that will turn out working
> or not.

I really don't think you need to teach this to your fw scheduler, that
sounds backwards. At least for i915+guc we don't have any plans to inflict
this on our fw scheduler (otoh our fw scheduler sucks at resolving
dependencies, but that's a different topic - currently it just busy-spins
if a context is stuck, there's some plans to preempt contexts instead and
round-robin to the next one):

All our fw scheduler is doing is scheduling gpu context and round-robing
them with preemption if they use up their time slice before they go idle
again. Everything else is meaning and semantics we put on top either in
the kernel (for dma_fence based sync) or in userspace (for direct submit
userspace galore, which currently is done in the most hilarious way in a
downstream hack since the hw support isn't there yet). Hilarious here
means we submit a batchbuffer to the kernel which jumps back to it's
start, and then use conditional rendering CS commands to latch in the next
real userspace batch as userspace queues them up. There's a ring of these
to keep it all nicely pipelined and the gpu busy. It's terrible, but for
certain niche use-cases it gets the job done :-)

> > Trying to shoehorn umf into all the old concepts like implicit sync or
> > sync_file which really don't support umf works for a demo, but imo just
> > isn't solid enough for shipping everywhere.
> > 
> > And long term I really don't think we ever want umf anywhere else than
> > drm_syncobj, at least for a 100% umf-native stack.
> 
> Ok then I will concentrate on drm_syncobj for now.
> 
> What about in driver backward compatibility? E.g. blocking wait in the
> multimedia driver CS IOCTL until umf signals?

Yeah I think for that we can keep on using drm_syncobj exactly as-is, i.e.
if it's a umf you just block until signalled when the driver tries to look
up the dma_fence. Ofc userspace can be a bit more clever (and they all
should do so) by doing that waiting in a userspace thread, and eventually
we want them to get at the umf behind it all if they trust the other side,
so that the wait can happen on the gpu.

But that's kinda details, in general I think drm_syncobj should be ready
for cross-driver/process/whatever umf sharing.

Cheers, Daniel

> Thanks,
> Christian.
> 
> > 
> > So maybe this all goes back to the old discussion with had, where you
> > argued for the need for special fw and hw and all that to make the old
> > dma_fence stuff work. Why is that needed? I still don't get that part ...
> > -Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-25 13:05           ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2022-05-25 13:05 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, jason, daniels, skhawaja, maad.aldabagh,
	sergemetral, sumit.semwal, gustavo, Felix.Kuehling,
	alexander.deucher, tzimmermann, tvrtko.ursulin, linux-media,
	dri-devel, linaro-mm-sig

Apologies I'm constantly behind on m-l discussions :-/

On Tue, May 17, 2022 at 12:28:17PM +0200, Christian König wrote:
> Am 09.05.22 um 16:10 schrieb Daniel Vetter:
> > On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
> > > Am 04.05.22 um 12:08 schrieb Daniel Vetter:
> > > > On Mon, May 02, 2022 at 06:37:07PM +0200, Christian König wrote:
> > > > > Hello everyone,
> > > > > 
> > > > > it's a well known problem that the DMA-buf subsystem mixed
> > > > > synchronization and memory management requirements into the same
> > > > > dma_fence and dma_resv objects. Because of this dma_fence objects need
> > > > > to guarantee that they complete within a finite amount of time or
> > > > > otherwise the system can easily deadlock.
> > > > > 
> > > > > One of the few good things about this problem is that it is really good
> > > > > understood by now.
> > > > > 
> > > > > Daniel and others came up with some documentation:
> > > > > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
> > > > > 
> > > > > And Jason did an excellent presentation about that problem on last years
> > > > > LPC: https://lpc.events/event/11/contributions/1115/
> > > > > 
> > > > > Based on that we had been able to reject new implementations of
> > > > > infinite/user DMA fences and mitigate the effect of the few existing
> > > > > ones.
> > > > > 
> > > > > The still remaining down side is that we don't have a way of using user
> > > > > fences as dependency in both the explicit (sync_file, drm_syncobj) as
> > > > > well as the implicit (dma_resv) synchronization objects, resulting in
> > > > > numerous problems and limitations for things like HMM, user queues
> > > > > etc....
> > > > > 
> > > > > This patch set here now tries to tackle this problem by untangling the
> > > > > synchronization from the memory management. What it does *not* try to do
> > > > > is to fix the existing kernel fences, because I think we now can all
> > > > > agree on that this isn't really possible.
> > > > > 
> > > > > To archive this goal what I do in this patch set is to add some parallel
> > > > > infrastructure to cleanly separate normal kernel dma_fence objects from
> > > > > indefinite/user fences:
> > > > > 
> > > > > 1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some
> > > > > existing driver defines). To note that a certain dma_fence is an user
> > > > > fence and *must* be ignore by memory management and never used as
> > > > > dependency for normal none user dma_fence objects.
> > > > > 
> > > > > 2. The dma_fence_array and dma_fence_chain containers are modified so
> > > > > that they are marked as user fences whenever any of their contained
> > > > > fences are an user fence.
> > > > > 
> > > > > 3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be
> > > > > used with indefinite/user fences and separates those into it's own
> > > > > synchronization domain.
> > > > > 
> > > > > 4. The existing dma_buf_poll_add_cb() function is modified so that
> > > > > indefinite/user fences are included in the polling.
> > > > > 
> > > > > 5. The sync_file synchronization object is modified so that we
> > > > > essentially have two fence streams instead of just one.
> > > > > 
> > > > > 6. The drm_syncobj is modified in a similar way. User fences are just
> > > > > ignored unless the driver explicitly states support to wait for them.
> > > > > 
> > > > > 7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers
> > > > > can use to indicate the need for user fences. If user fences are used
> > > > > the atomic mode setting starts to support user fences as IN/OUT fences.
> > > > > 
> > > > > 8. Lockdep is used at various critical locations to ensure that nobody
> > > > > ever tries to mix user fences with non user fences.
> > > > > 
> > > > > The general approach is to just ignore user fences unless a driver
> > > > > stated explicitely support for them.
> > > > > 
> > > > > On top of all of this I've hacked amdgpu so that we add the resulting CS
> > > > > fence only as kernel dependency to the dma_resv object and an additional
> > > > > wrapped up with a dma_fence_array and a stub user fence.
> > > > > 
> > > > > The result is that the newly added atomic modeset functions now
> > > > > correctly wait for the user fence to complete before doing the flip. And
> > > > > dependent CS don't pipeline any more, but rather block on the CPU before
> > > > > submitting work.
> > > > > 
> > > > > After tons of debugging and testing everything now seems to not go up in
> > > > > flames immediately and even lockdep is happy with the annotations.
> > > > > 
> > > > > I'm perfectly aware that this is probably by far the most controversial
> > > > > patch set I've ever created and I really wish we wouldn't need it. But
> > > > > we certainly have the requirement for this and I don't see much other
> > > > > chance to get that working in an UAPI compatible way.
> > > > > 
> > > > > Thoughts/comments?
> > > > I think you need to type up the goal or exact problem statement you're
> > > > trying to solve first. What you typed up is a solution along the lines of
> > > > "try to stuff userspace memory fences into dma_fence and see how horrible
> > > > it all is", and that's certainly an interesting experiment, but what are
> > > > you trying to solve with it?
> > > Well, good point. I explained to much how it works, but now why.
> > > 
> > > In general I would describe the goal as: Providing a standard kernel
> > > infrastructure for user fences.
> > So on that goal the part I fully agree on is that drm_syncobj can (and
> > should imo) be able to contain userspace memory fences. The uapi semantics
> > and everything is already fully set up to support that, but maybe with
> > reduced performance: Non-aware userspace (or when you don't trust the
> > supplier of the umf) needs to block when looking up the fence, and the
> > dma_fence returned will always be signalled already. But that's just a
> > mild performance issue (and vk drivers paper over that already with
> > threading) and not a correctness issue.
> 
> Exactly that, yes.
> 
> > > > Like if the issue is to enable opencl or whatever, then that's no problem
> > > > (rocm on amdkfd is a thing, same maybe without the kfd part can be done
> > > > anywhere else). If the goal is to enable userspace memory fences for vk,
> > > > then we really don't need these everywhere, but really only in drm_syncobj
> > > > (and maybe sync_file).
> > > Yes, having an in kernel representation for vk user space fences is one of
> > > the goals.
> > > 
> > > And I was going back and forth if I should rather come up with a new
> > > structure for this or use the existing dma_fence with a flag as well.
> > > 
> > > I've decided to go down the later router because we have quite a lot of
> > > existing functionality which can be re-used. But if you have a good argument
> > > that it would be more defensive to come up with something completely new,
> > > I'm perfectly fine with that as well.
> > Yeah so stuffing that into dma_fence already freaks me out a bit. It is
> > quite fundamentally a different thing, and it would be really nice to make
> > that very apparent at the type level too.
> > 
> > E.g. to make sure you never ever end up with an umf fence in mmu notifier
> > invalidate callback. You can enforce that with runtime checks too, but imo
> > compile time fail is better than runtime fail.
> 
> Well, I see arguments for both sides.
> 
> There is certainly the danger that we have an umf wait in the mmu notifier,
> but then lockdep will scream "bloody hell" immediately.
> 
> On the other hand when I make this a separate structure we need to maintain
> containers for both variants, especially a chain implementation for
> drm_syncobj. And here I don't have lockdep to keep an eye that nobody does
> anything strange.
> 
> It's only a gut feeling with no clear evidence for one side. If you insists
> on a separate structure I will go down that route.
> 
> > > > If the goal is specifically atomic kms, then there's an entire can of
> > > > worms there that I really don't want to think about, but it exists: We
> > > > have dma_fence as out-fences from atomic commit, and that's already
> > > > massively broken since most drivers allocate some memory or at least take
> > > > locks which can allocate memory in their commit path. Like i2c. Putting a
> > > > userspace memory fence as in-fence in there makes that problem
> > > > substantially worse, since at least in theory you're just not allowed to
> > > > might_faul in atomic_commit_tail.
> > > Yes, that's unfortunately one of the goals as well and yes I completely
> > > agree on the can of worms. But I think I've solved that.
> > > 
> > > What I do in the patch set is to enforce that the out fence is an user fence
> > > when the driver supports user in fences as well.
> > > 
> > > Since user fences doesn't have the memory management dependency drivers can
> > > actually allocate memory or call I2C functions which takes locks which have
> > > memory allocation dependencies.
> > > 
> > > Or do I miss some other reason why you can't fault or allocate memory in
> > > atomic_commit_tail? At least lockdep seems to be happy about that now.
> > The problem is a bit that this breaks the uapi already. At least if the
> > goal is to have this all be perfectly transparent for userspace - as you
> > as you have multi-gpu setups going on at least.
> 
> Question here is why do you think there is an UAPI break? We currently wait
> in a work item already, so where exactly is the problem?

It's a bit washy, but dma_fence and hence implicit sync is supposed to
finish in finite time. umf just doesn't.

Ofc in reality you can still flood your compositor and they're not very
robust, but with umf it's trivial to just hang your compositor forever and
nothing happens. There was a least quite a bit of screaming from
compositor folks 2 years ago when we've gone through the i915-gem fun of
"hey we redefined dma_fence to have indefinite fence semantcis, cool isn't
it". And this is kinda similar - you don't hang the kernel with all the
lockdep stuff, meaning ^C can get you out of any hang. But userspace is
still dead, and the user is still potentially rather unhappy.

So either we force a timeout in the kernel or in userspace or somewhere,
and at that point I'm not entirely sure this is really worth it.

Same holds for sync_file. drm_syncobj is different and does have the
entire "this might never signal" concept encoded in it's uapi already, so
that's fine.

> > > > If the goal is to keep the uapi perfectly compatible then your patch set
> > > > doesn't look like a solution, since as soon as another driver is involved
> > > > which doesn't understand userspace memory fences it all falls apart. So
> > > > works great for a quick demo with amd+amd sharing, but not much further.
> > > > And I don't think it's feasible to just rev the entire ecosystem, since
> > > > that kinda defeats the point of keeping uapi stable - if we rev everything
> > > > we might as well also rev the uapi and make this a bit more incremental
> > > > again :-)
> > > Yes, unfortunately the uapi needs to stay compatible as well and yes that
> > > means we need to deploy this to all drivers involved.
> > > 
> > > We at least need to be able to provide a stack on new hardware with (for
> > > example) Ubuntu 18.04 without replacing all the userspace components.
> > > 
> > > What we can replace is the OpenGL stack and if necessary libdrm, but not
> > > (for example) the X server and most likely not the DDX in some cases.
> > > 
> > > The same applies with surfaceflinger and to some extend Wayland as well.
> > So for perfect uapi compat for existing compositor I really don't think
> > stuffing umf into the kernel is the right approach. Too many little
> > corners that break:
> > 
> > - the in/out fence mismatch every
> 
> On that part I need further explanation, cause I hoped that this is solved.

The thing is even if you force the out fence to be an umf too if the
in-fence is, they're chained. So as soon as you have an unsignalled in
fence all subsequent out fences need to be forced to be umf semantics.
Possible, but definitely messy.

> > - cross gpu with different userspace that doesn't understand umf and then
> >    just ignores them
> 
> Well by stuffing umf into the kernel the whole thing becomes transparent for
> userspace.
> 
> So it won't matter that you have a new amdgpu stack which wants to use umf
> and an older i915 stack which knows nothing about umfs.
> 
> The kernel will block on command submission as soon as a buffer is used by
> i915. And as you said above as well that might cause performance trouble,
> but is not a correctness problem.

The problem is that you still have to adjust all drivers. Which yes
eventually we have to do, but if the compat code is just in the driver
which has switched to umf already, then the cost is a lot more localized.
And with the dma_fence trick below it's all good.

> > - compositors which currently assume implicit sync finishes eventually,
> >    and with umf that gets complicated at best
> 
> But for userspace compositors there is no difference between an umf which
> times out and a dma_fence which timesout? Or am I missing something here?
> 
> > - same with sync_file, the uapi atm does not have a concept of future
> >    fence
> > 
> > So you can kinda make this work, but it falls apart all over the place.
> > And I also don't think smashing umf into all these old concepts helps us
> > in any way to get towards a desktop which is umf-native.
> 
> Yeah, but having an umf compatibility memory management doesn't help either
> to get away from pre-pinned pages.

I think it's really two worlds:

a) dma_fence syncing (sync_file, implicit sync in dma_resv, drm_syncobj
with dma_fence) and pre-pinned everything memory

b) umf mode, which opens the door for page faults and clever tricks and
avoids the need to pre-pin everything

It sucks, but I really haven't found a way to mix these. And I feel like
that it's pretty dangerous to pretend that we can do it - it freaks me out
at least :-)

> > My take is still that for backwards compat the simplest way is if a
> > umf-native driver simply provides dma-fence backwards compat as an opt-in,
> > which userspace chooses when it's necessary. There's really only two
> > things you need for that to work:
> > 
> > - a timeout of some sort on the dma_fence, which might or might not kill
> >    the entire context. This is entirey up to how your userspace does or
> >    does not implement stuff like arb robustness or vk_error_device_lost
> > 
> > - pre-pinned memory management to block out the all the inversions. This
> >    is a bit more nasty, but since we do have all the code for this already
> >    it really shouldn't be too tricky to make that happen for the fancy new
> >    umf world.
> 
> Well, exactly that's what we want to get away from.
> 
> > You do not need a kernel scheduler or anything like that at all, you can
> > do full userspace direct submit to hw and all that fun. Maybe do a
> > drm/sched frontend (and then your submit code does exactly what userspace
> > would do too).
> > 
> > Importantly the things you really don't need:
> > 
> > - special hw support, even if the only mode your hw supports is with page
> >    faults and all that: You can make sure all the pages are present
> >    upfront, and then simply kill the entire context is a page fault
> >    happens.
> 
> Well, that's only like 90% correct.
> 
> You can make that work without special hardware support, but from the
> experience with ROCm and very extensive talks with out hardware folks we
> have seriously problems making sure that the hw can't access freed up memory
> any more.
> 
> Except for the solution of never freeing up memory the only other
> possibility is to wait between 1 and 6 seconds until a shoot down made sure
> that there is really nobody accessing old page tables entries any more.
> 
> In the case of an user space queue with hardware scheduler support and HMM
> the memory would just still be referenced until userspace cooperatively
> inserted a barrier, but that again breaks some dma_fence assumptions as far
> as I can see.

So two things:

- Don't use legacy semantics for compute. They're completely different
  worlds, and i915 plan is to have completely different mode between
  legacy and compute mode creation, because they really don't mesh, at
  context creation time. So kinda like amdgpu renderD* vs amdkfd, except
  all int he renderD* node.

  You _never_ want to run ROCm or level0 or cuda with legacy mode, because
  it sucks too much. The trouble is a bit vulkan since if you run it
  headless, you want to run in compute mode, but if want wayland/x11
  winsys support, then you need legacy support. But we've come up with
  some supremely nasty tricks for that too in the discussion at plumbers
  last year.

  The other trouble is interop. I think the best path here is the vk
  approach of "you only get drm_syncobj, no implicit sync ever". But that
  might not be workable for rocm/opencl/whatever. I'm not sure what's the
  best option there, if absolutely everything breaks I guess we could look
  into adding a umf implicit sync slot to dma_resv, but I really think we
  should exhaust all other options first before we step into that dragon's
  lair :-/

  If we do go with the umf implicit sync I think we should really do it as
  a timeline thing, i.e. you attach the umf container once, the other side
  gets it once, an after that point the only thing that keeps happening is
  that writers increment the syncpt seqno, and readers just sample that
  every time they use the buffer. So essentially:

  struct dma_buf_umf_implict_sync {
  	struct drm_syncobj *sync;
	atomic_t seqno;
  };

  Plus a few ioctls to increment the seqno (and receive the value) and
  read the current seqno. That still means interop isn't perfectly
  transparent, but maybe we could at least reuse some of the codepaths we
  have for drm_syncobj in both kernel and userspace.

  Maybe we could make it backwards compat but just blocking rendering
  until the umf signals if a buffer is shared with any importer that
  doesn't understand umf implicit sync.

  But that's all rather horrible imo.

- For memory management I think the right long term approach is to punt
  this all to userspace, for both legacy and compute mode, with exactly
  the same semantics as mmap()/munmap() on the cpu side. i915 calls this
  vm_bind (the rfc is floating around somewhere), but essentially i915.ko
  will execute vm_bind and vm_unbind directly, completely decoupled from
  any in-flight command submission (we just punch out the ptes and force a
  tlb flush and done). If userspace unmaps memory it still uses, it gets
  to keep all the pieces. This is how vk and all the new stuff works. It
  does mean there's more work for gl and that kind of stuff, but this
  really shouldn't be the kernel's problem.

  Aside: We're aiming for the same for gpu context. If userspace closes a
  gpu context we just preempt and destroy it right away, userspace gets to
  keep he pieces if they wanted to let it finish first.

  This is exactly like on the cpu side where userspace can create/destroy
  threads and mmap/munmap as it sees fit, and it just blows up if it gets
  it all wrong.

  Ofc if you're hw is busted and you can't do reasonable efficient tlb
  flushing (even with pipeling and batching) then you're screwed, so I'm
  hoping your 1-6 seconds of waiting isn't that. If it is then you kinda
  have an unfixable security bug :-/ But iirc amdkfd already has batched
  up tlb flushes as dma_fence in USAGE_OTHER now, so this should all work.

  Anything else, i.e. userpsace fails to insert the required cache flushes
  or barriers or anything else before it calls gpu_unamp (or vm_unbind or
  whatever you want to call it) is a userspace bug.

  Ofc if you preempt a context for memory eviction due to memory pressure
  then the kernel has to do all the cache flushing and everything. Same
  for evicting pages due to memory pressure in the hmm use case. If you're
  preemption takes a few seconds I think there's bigger problems no matter
  what :-)

  Btw on vm_bind it'd be great if you or someone from amd's rocm/kfd side
  can take a look and drop your input. I expect that we'll have the same
  discussion with nouveau/nvidia and maybe also some of the armsoc chips.
  So it would be great if we can collect upstream wisdom on this topic a
  bit.

Neither of these are a reason to stuff umf into implicit sync or sync file
imo.

> > - special fw scheduler support: Once the memory management inversions are
> >    taken care of with pre-pinning under dma_fences, then the only other
> >    thing you need is a timeout for the dma_fence to signal. And maybe some
> >    kind of guaranteed ordering if you want to use a dma_fence timeline
> >    since that one can't go backwards.
> 
> Yeah, that not going backward thing turned out to be massively more tricky
> than I thought initially as well.
> 
> Alex, Marek and I worked quite hard on relaying those requirements to our
> internal teams, but I'm still not quite sure if that will turn out working
> or not.

I really don't think you need to teach this to your fw scheduler, that
sounds backwards. At least for i915+guc we don't have any plans to inflict
this on our fw scheduler (otoh our fw scheduler sucks at resolving
dependencies, but that's a different topic - currently it just busy-spins
if a context is stuck, there's some plans to preempt contexts instead and
round-robin to the next one):

All our fw scheduler is doing is scheduling gpu context and round-robing
them with preemption if they use up their time slice before they go idle
again. Everything else is meaning and semantics we put on top either in
the kernel (for dma_fence based sync) or in userspace (for direct submit
userspace galore, which currently is done in the most hilarious way in a
downstream hack since the hw support isn't there yet). Hilarious here
means we submit a batchbuffer to the kernel which jumps back to it's
start, and then use conditional rendering CS commands to latch in the next
real userspace batch as userspace queues them up. There's a ring of these
to keep it all nicely pipelined and the gpu busy. It's terrible, but for
certain niche use-cases it gets the job done :-)

> > Trying to shoehorn umf into all the old concepts like implicit sync or
> > sync_file which really don't support umf works for a demo, but imo just
> > isn't solid enough for shipping everywhere.
> > 
> > And long term I really don't think we ever want umf anywhere else than
> > drm_syncobj, at least for a 100% umf-native stack.
> 
> Ok then I will concentrate on drm_syncobj for now.
> 
> What about in driver backward compatibility? E.g. blocking wait in the
> multimedia driver CS IOCTL until umf signals?

Yeah I think for that we can keep on using drm_syncobj exactly as-is, i.e.
if it's a umf you just block until signalled when the driver tries to look
up the dma_fence. Ofc userspace can be a bit more clever (and they all
should do so) by doing that waiting in a userspace thread, and eventually
we want them to get at the umf behind it all if they trust the other side,
so that the wait can happen on the gpu.

But that's kinda details, in general I think drm_syncobj should be ready
for cross-driver/process/whatever umf sharing.

Cheers, Daniel

> Thanks,
> Christian.
> 
> > 
> > So maybe this all goes back to the old discussion with had, where you
> > argued for the need for special fw and hw and all that to make the old
> > dma_fence stuff work. Why is that needed? I still don't get that part ...
> > -Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-25 13:05           ` Daniel Vetter
@ 2022-05-25 13:28             ` Michel Dänzer
  -1 siblings, 0 replies; 41+ messages in thread
From: Michel Dänzer @ 2022-05-25 13:28 UTC (permalink / raw)
  To: Daniel Vetter, Christian König
  Cc: tvrtko.ursulin, sergemetral, gustavo, Felix.Kuehling, dri-devel,
	linaro-mm-sig, maad.aldabagh, tzimmermann, alexander.deucher,
	daniels, skhawaja, sumit.semwal, jason, linux-media

On 2022-05-25 15:05, Daniel Vetter wrote:
> On Tue, May 17, 2022 at 12:28:17PM +0200, Christian König wrote:
>> Am 09.05.22 um 16:10 schrieb Daniel Vetter:
>>> On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
>>>> Am 04.05.22 um 12:08 schrieb Daniel Vetter:
>>>>>
>>>>> If the goal is specifically atomic kms, then there's an entire can of
>>>>> worms there that I really don't want to think about, but it exists: We
>>>>> have dma_fence as out-fences from atomic commit, and that's already
>>>>> massively broken since most drivers allocate some memory or at least take
>>>>> locks which can allocate memory in their commit path. Like i2c. Putting a
>>>>> userspace memory fence as in-fence in there makes that problem
>>>>> substantially worse, since at least in theory you're just not allowed to
>>>>> might_faul in atomic_commit_tail.
>>>> Yes, that's unfortunately one of the goals as well and yes I completely
>>>> agree on the can of worms. But I think I've solved that.
>>>>
>>>> What I do in the patch set is to enforce that the out fence is an user fence
>>>> when the driver supports user in fences as well.
>>>>
>>>> Since user fences doesn't have the memory management dependency drivers can
>>>> actually allocate memory or call I2C functions which takes locks which have
>>>> memory allocation dependencies.
>>>>
>>>> Or do I miss some other reason why you can't fault or allocate memory in
>>>> atomic_commit_tail? At least lockdep seems to be happy about that now.
>>> The problem is a bit that this breaks the uapi already. At least if the
>>> goal is to have this all be perfectly transparent for userspace - as you
>>> as you have multi-gpu setups going on at least.
>>
>> Question here is why do you think there is an UAPI break? We currently wait
>> in a work item already, so where exactly is the problem?
> 
> It's a bit washy, but dma_fence and hence implicit sync is supposed to
> finish in finite time. umf just doesn't.
> 
> Ofc in reality you can still flood your compositor and they're not very
> robust, but with umf it's trivial to just hang your compositor forever and
> nothing happens.

You can add that to the list of reasons why compositors need to stop using buffers with unsignaled fences. There's plenty of other reasons there already (the big one being that otherwise slow clients can slow down the compositor, even if the compositor uses a high priority context and the HW supports preemption).


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-25 13:28             ` Michel Dänzer
  0 siblings, 0 replies; 41+ messages in thread
From: Michel Dänzer @ 2022-05-25 13:28 UTC (permalink / raw)
  To: Daniel Vetter, Christian König
  Cc: tvrtko.ursulin, sergemetral, tzimmermann, gustavo,
	Felix.Kuehling, linux-media, dri-devel, linaro-mm-sig, jason,
	alexander.deucher, daniels, skhawaja, sumit.semwal,
	maad.aldabagh

On 2022-05-25 15:05, Daniel Vetter wrote:
> On Tue, May 17, 2022 at 12:28:17PM +0200, Christian König wrote:
>> Am 09.05.22 um 16:10 schrieb Daniel Vetter:
>>> On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
>>>> Am 04.05.22 um 12:08 schrieb Daniel Vetter:
>>>>>
>>>>> If the goal is specifically atomic kms, then there's an entire can of
>>>>> worms there that I really don't want to think about, but it exists: We
>>>>> have dma_fence as out-fences from atomic commit, and that's already
>>>>> massively broken since most drivers allocate some memory or at least take
>>>>> locks which can allocate memory in their commit path. Like i2c. Putting a
>>>>> userspace memory fence as in-fence in there makes that problem
>>>>> substantially worse, since at least in theory you're just not allowed to
>>>>> might_faul in atomic_commit_tail.
>>>> Yes, that's unfortunately one of the goals as well and yes I completely
>>>> agree on the can of worms. But I think I've solved that.
>>>>
>>>> What I do in the patch set is to enforce that the out fence is an user fence
>>>> when the driver supports user in fences as well.
>>>>
>>>> Since user fences doesn't have the memory management dependency drivers can
>>>> actually allocate memory or call I2C functions which takes locks which have
>>>> memory allocation dependencies.
>>>>
>>>> Or do I miss some other reason why you can't fault or allocate memory in
>>>> atomic_commit_tail? At least lockdep seems to be happy about that now.
>>> The problem is a bit that this breaks the uapi already. At least if the
>>> goal is to have this all be perfectly transparent for userspace - as you
>>> as you have multi-gpu setups going on at least.
>>
>> Question here is why do you think there is an UAPI break? We currently wait
>> in a work item already, so where exactly is the problem?
> 
> It's a bit washy, but dma_fence and hence implicit sync is supposed to
> finish in finite time. umf just doesn't.
> 
> Ofc in reality you can still flood your compositor and they're not very
> robust, but with umf it's trivial to just hang your compositor forever and
> nothing happens.

You can add that to the list of reasons why compositors need to stop using buffers with unsignaled fences. There's plenty of other reasons there already (the big one being that otherwise slow clients can slow down the compositor, even if the compositor uses a high priority context and the HW supports preemption).


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-25 13:28             ` Michel Dänzer
@ 2022-05-25 13:51               ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2022-05-25 13:51 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Daniel Vetter, Christian König, tvrtko.ursulin, sergemetral,
	tzimmermann, gustavo, Felix.Kuehling, linux-media, dri-devel,
	linaro-mm-sig, jason, alexander.deucher, daniels, skhawaja,
	sumit.semwal, maad.aldabagh

On Wed, May 25, 2022 at 03:28:41PM +0200, Michel Dänzer wrote:
> On 2022-05-25 15:05, Daniel Vetter wrote:
> > On Tue, May 17, 2022 at 12:28:17PM +0200, Christian König wrote:
> >> Am 09.05.22 um 16:10 schrieb Daniel Vetter:
> >>> On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
> >>>> Am 04.05.22 um 12:08 schrieb Daniel Vetter:
> >>>>>
> >>>>> If the goal is specifically atomic kms, then there's an entire can of
> >>>>> worms there that I really don't want to think about, but it exists: We
> >>>>> have dma_fence as out-fences from atomic commit, and that's already
> >>>>> massively broken since most drivers allocate some memory or at least take
> >>>>> locks which can allocate memory in their commit path. Like i2c. Putting a
> >>>>> userspace memory fence as in-fence in there makes that problem
> >>>>> substantially worse, since at least in theory you're just not allowed to
> >>>>> might_faul in atomic_commit_tail.
> >>>> Yes, that's unfortunately one of the goals as well and yes I completely
> >>>> agree on the can of worms. But I think I've solved that.
> >>>>
> >>>> What I do in the patch set is to enforce that the out fence is an user fence
> >>>> when the driver supports user in fences as well.
> >>>>
> >>>> Since user fences doesn't have the memory management dependency drivers can
> >>>> actually allocate memory or call I2C functions which takes locks which have
> >>>> memory allocation dependencies.
> >>>>
> >>>> Or do I miss some other reason why you can't fault or allocate memory in
> >>>> atomic_commit_tail? At least lockdep seems to be happy about that now.
> >>> The problem is a bit that this breaks the uapi already. At least if the
> >>> goal is to have this all be perfectly transparent for userspace - as you
> >>> as you have multi-gpu setups going on at least.
> >>
> >> Question here is why do you think there is an UAPI break? We currently wait
> >> in a work item already, so where exactly is the problem?
> > 
> > It's a bit washy, but dma_fence and hence implicit sync is supposed to
> > finish in finite time. umf just doesn't.
> > 
> > Ofc in reality you can still flood your compositor and they're not very
> > robust, but with umf it's trivial to just hang your compositor forever and
> > nothing happens.
> 
> You can add that to the list of reasons why compositors need to stop
> using buffers with unsignaled fences. There's plenty of other reasons
> there already (the big one being that otherwise slow clients can slow
> down the compositor, even if the compositor uses a high priority context
> and the HW supports preemption).

Yeah that's tbh another reason why I think we shouldn't do umf as a
transparent thing - compositors need to get better anyway, so we might as
well take this as a chance to do this right.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-25 13:51               ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2022-05-25 13:51 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: tvrtko.ursulin, skhawaja, gustavo, Felix.Kuehling, dri-devel,
	sumit.semwal, linaro-mm-sig, maad.aldabagh, tzimmermann,
	alexander.deucher, daniels, linux-media, Christian König,
	jason, sergemetral

On Wed, May 25, 2022 at 03:28:41PM +0200, Michel Dänzer wrote:
> On 2022-05-25 15:05, Daniel Vetter wrote:
> > On Tue, May 17, 2022 at 12:28:17PM +0200, Christian König wrote:
> >> Am 09.05.22 um 16:10 schrieb Daniel Vetter:
> >>> On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
> >>>> Am 04.05.22 um 12:08 schrieb Daniel Vetter:
> >>>>>
> >>>>> If the goal is specifically atomic kms, then there's an entire can of
> >>>>> worms there that I really don't want to think about, but it exists: We
> >>>>> have dma_fence as out-fences from atomic commit, and that's already
> >>>>> massively broken since most drivers allocate some memory or at least take
> >>>>> locks which can allocate memory in their commit path. Like i2c. Putting a
> >>>>> userspace memory fence as in-fence in there makes that problem
> >>>>> substantially worse, since at least in theory you're just not allowed to
> >>>>> might_faul in atomic_commit_tail.
> >>>> Yes, that's unfortunately one of the goals as well and yes I completely
> >>>> agree on the can of worms. But I think I've solved that.
> >>>>
> >>>> What I do in the patch set is to enforce that the out fence is an user fence
> >>>> when the driver supports user in fences as well.
> >>>>
> >>>> Since user fences doesn't have the memory management dependency drivers can
> >>>> actually allocate memory or call I2C functions which takes locks which have
> >>>> memory allocation dependencies.
> >>>>
> >>>> Or do I miss some other reason why you can't fault or allocate memory in
> >>>> atomic_commit_tail? At least lockdep seems to be happy about that now.
> >>> The problem is a bit that this breaks the uapi already. At least if the
> >>> goal is to have this all be perfectly transparent for userspace - as you
> >>> as you have multi-gpu setups going on at least.
> >>
> >> Question here is why do you think there is an UAPI break? We currently wait
> >> in a work item already, so where exactly is the problem?
> > 
> > It's a bit washy, but dma_fence and hence implicit sync is supposed to
> > finish in finite time. umf just doesn't.
> > 
> > Ofc in reality you can still flood your compositor and they're not very
> > robust, but with umf it's trivial to just hang your compositor forever and
> > nothing happens.
> 
> You can add that to the list of reasons why compositors need to stop
> using buffers with unsignaled fences. There's plenty of other reasons
> there already (the big one being that otherwise slow clients can slow
> down the compositor, even if the compositor uses a high priority context
> and the HW supports preemption).

Yeah that's tbh another reason why I think we shouldn't do umf as a
transparent thing - compositors need to get better anyway, so we might as
well take this as a chance to do this right.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-25 13:51               ` Daniel Vetter
@ 2022-05-25 14:07                 ` Simon Ser
  -1 siblings, 0 replies; 41+ messages in thread
From: Simon Ser @ 2022-05-25 14:07 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: tvrtko.ursulin, daniels, Michel Dänzer, gustavo,
	Felix.Kuehling, linux-media, dri-devel, Christian König,
	linaro-mm-sig, sergemetral, tzimmermann, alexander.deucher,
	skhawaja, sumit.semwal, jason, maad.aldabagh

On Wednesday, May 25th, 2022 at 15:51, Daniel Vetter <daniel@ffwll.ch> wrote:

> > > Ofc in reality you can still flood your compositor and they're not very
> > > robust, but with umf it's trivial to just hang your compositor forever and
> > > nothing happens.
> >
> > You can add that to the list of reasons why compositors need to stop
> > using buffers with unsignaled fences. There's plenty of other reasons
> > there already (the big one being that otherwise slow clients can slow
> > down the compositor, even if the compositor uses a high priority context
> > and the HW supports preemption).
>
>
> Yeah that's tbh another reason why I think we shouldn't do umf as a
> transparent thing - compositors need to get better anyway, so we might as
> well take this as a chance to do this right.

As a compositor dev, I agree -- we should definitely be smarter about
this. Note, it would help a lot to have a good way to integrate the
waits into a poll(2) event loop.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-25 14:07                 ` Simon Ser
  0 siblings, 0 replies; 41+ messages in thread
From: Simon Ser @ 2022-05-25 14:07 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Michel Dänzer, tvrtko.ursulin, skhawaja, gustavo,
	Felix.Kuehling, dri-devel, sumit.semwal, linaro-mm-sig,
	maad.aldabagh, tzimmermann, alexander.deucher, daniels,
	linux-media, Christian König, jason, sergemetral

On Wednesday, May 25th, 2022 at 15:51, Daniel Vetter <daniel@ffwll.ch> wrote:

> > > Ofc in reality you can still flood your compositor and they're not very
> > > robust, but with umf it's trivial to just hang your compositor forever and
> > > nothing happens.
> >
> > You can add that to the list of reasons why compositors need to stop
> > using buffers with unsignaled fences. There's plenty of other reasons
> > there already (the big one being that otherwise slow clients can slow
> > down the compositor, even if the compositor uses a high priority context
> > and the HW supports preemption).
>
>
> Yeah that's tbh another reason why I think we shouldn't do umf as a
> transparent thing - compositors need to get better anyway, so we might as
> well take this as a chance to do this right.

As a compositor dev, I agree -- we should definitely be smarter about
this. Note, it would help a lot to have a good way to integrate the
waits into a poll(2) event loop.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-25 14:07                 ` Simon Ser
@ 2022-05-25 14:15                   ` Daniel Stone
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Stone @ 2022-05-25 14:15 UTC (permalink / raw)
  To: Simon Ser
  Cc: Daniel Vetter, tvrtko.ursulin, daniels, Michel Dänzer,
	gustavo, Felix.Kuehling, linux-media, dri-devel,
	Christian König, linaro-mm-sig, sergemetral, tzimmermann,
	alexander.deucher, skhawaja, sumit.semwal, jason, maad.aldabagh

Hi,

On Wed, 25 May 2022 at 15:07, Simon Ser <contact@emersion.fr> wrote:
> On Wednesday, May 25th, 2022 at 15:51, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > You can add that to the list of reasons why compositors need to stop
> > > using buffers with unsignaled fences. There's plenty of other reasons
> > > there already (the big one being that otherwise slow clients can slow
> > > down the compositor, even if the compositor uses a high priority context
> > > and the HW supports preemption).
> >
> >
> > Yeah that's tbh another reason why I think we shouldn't do umf as a
> > transparent thing - compositors need to get better anyway, so we might as
> > well take this as a chance to do this right.
>
> As a compositor dev, I agree -- we should definitely be smarter about
> this. Note, it would help a lot to have a good way to integrate the
> waits into a poll(2) event loop.

The same holds for Weston. We're currently working through a bunch of
internal infrastructure to be able to handle this. Mutter (aka GNOME)
is also really well-placed to be able to do this.

Having pollable waits would be really useful, but I don't think it's
essential. In my strawman I'm just waking up at the usual
just-before-repaint point and checking; if it doesn't make it for this
frame then we'll wait for the next frame. If someone submits buffers
which take 4 repaint periods to clear then we'll have 3 'unnecessary'
wakeups, but given the GPU is already slammed then it's not an
efficiency problem I don't think. (I don't know if all the other
compositor people share this view.)

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-25 14:15                   ` Daniel Stone
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Stone @ 2022-05-25 14:15 UTC (permalink / raw)
  To: Simon Ser
  Cc: tvrtko.ursulin, daniels, tzimmermann, Michel Dänzer,
	gustavo, Felix.Kuehling, dri-devel, sumit.semwal, linaro-mm-sig,
	sergemetral, maad.aldabagh, alexander.deucher, skhawaja,
	Christian König, jason, linux-media

Hi,

On Wed, 25 May 2022 at 15:07, Simon Ser <contact@emersion.fr> wrote:
> On Wednesday, May 25th, 2022 at 15:51, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > You can add that to the list of reasons why compositors need to stop
> > > using buffers with unsignaled fences. There's plenty of other reasons
> > > there already (the big one being that otherwise slow clients can slow
> > > down the compositor, even if the compositor uses a high priority context
> > > and the HW supports preemption).
> >
> >
> > Yeah that's tbh another reason why I think we shouldn't do umf as a
> > transparent thing - compositors need to get better anyway, so we might as
> > well take this as a chance to do this right.
>
> As a compositor dev, I agree -- we should definitely be smarter about
> this. Note, it would help a lot to have a good way to integrate the
> waits into a poll(2) event loop.

The same holds for Weston. We're currently working through a bunch of
internal infrastructure to be able to handle this. Mutter (aka GNOME)
is also really well-placed to be able to do this.

Having pollable waits would be really useful, but I don't think it's
essential. In my strawman I'm just waking up at the usual
just-before-repaint point and checking; if it doesn't make it for this
frame then we'll wait for the next frame. If someone submits buffers
which take 4 repaint periods to clear then we'll have 3 'unnecessary'
wakeups, but given the GPU is already slammed then it's not an
efficiency problem I don't think. (I don't know if all the other
compositor people share this view.)

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-25 14:15                   ` Daniel Stone
@ 2022-05-25 14:22                     ` Christian König
  -1 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-25 14:22 UTC (permalink / raw)
  To: Daniel Stone, Simon Ser
  Cc: Daniel Vetter, tvrtko.ursulin, daniels, Michel Dänzer,
	gustavo, Felix.Kuehling, linux-media, dri-devel, linaro-mm-sig,
	sergemetral, tzimmermann, alexander.deucher, skhawaja,
	sumit.semwal, jason, maad.aldabagh

Am 25.05.22 um 16:15 schrieb Daniel Stone:
> Hi,
>
> On Wed, 25 May 2022 at 15:07, Simon Ser <contact@emersion.fr> wrote:
>> On Wednesday, May 25th, 2022 at 15:51, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>> You can add that to the list of reasons why compositors need to stop
>>>> using buffers with unsignaled fences. There's plenty of other reasons
>>>> there already (the big one being that otherwise slow clients can slow
>>>> down the compositor, even if the compositor uses a high priority context
>>>> and the HW supports preemption).
>>>
>>> Yeah that's tbh another reason why I think we shouldn't do umf as a
>>> transparent thing - compositors need to get better anyway, so we might as
>>> well take this as a chance to do this right.
>> As a compositor dev, I agree -- we should definitely be smarter about
>> this. Note, it would help a lot to have a good way to integrate the
>> waits into a poll(2) event loop.
> The same holds for Weston. We're currently working through a bunch of
> internal infrastructure to be able to handle this. Mutter (aka GNOME)
> is also really well-placed to be able to do this.
>
> Having pollable waits would be really useful, but I don't think it's
> essential. In my strawman I'm just waking up at the usual
> just-before-repaint point and checking; if it doesn't make it for this
> frame then we'll wait for the next frame. If someone submits buffers
> which take 4 repaint periods to clear then we'll have 3 'unnecessary'
> wakeups, but given the GPU is already slammed then it's not an
> efficiency problem I don't think. (I don't know if all the other
> compositor people share this view.)

Oh, well you should already have pollable waits, at least on DMA-buf.

If you are saying you want to have that for drm_syncobj timelines as 
well then that's certainly something we could do.

Regards,
Christian.

>
> Cheers,
> Daniel


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-25 14:22                     ` Christian König
  0 siblings, 0 replies; 41+ messages in thread
From: Christian König @ 2022-05-25 14:22 UTC (permalink / raw)
  To: Daniel Stone, Simon Ser
  Cc: tvrtko.ursulin, daniels, tzimmermann, Michel Dänzer,
	gustavo, Felix.Kuehling, dri-devel, linaro-mm-sig, sergemetral,
	maad.aldabagh, alexander.deucher, skhawaja, sumit.semwal, jason,
	linux-media

Am 25.05.22 um 16:15 schrieb Daniel Stone:
> Hi,
>
> On Wed, 25 May 2022 at 15:07, Simon Ser <contact@emersion.fr> wrote:
>> On Wednesday, May 25th, 2022 at 15:51, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>> You can add that to the list of reasons why compositors need to stop
>>>> using buffers with unsignaled fences. There's plenty of other reasons
>>>> there already (the big one being that otherwise slow clients can slow
>>>> down the compositor, even if the compositor uses a high priority context
>>>> and the HW supports preemption).
>>>
>>> Yeah that's tbh another reason why I think we shouldn't do umf as a
>>> transparent thing - compositors need to get better anyway, so we might as
>>> well take this as a chance to do this right.
>> As a compositor dev, I agree -- we should definitely be smarter about
>> this. Note, it would help a lot to have a good way to integrate the
>> waits into a poll(2) event loop.
> The same holds for Weston. We're currently working through a bunch of
> internal infrastructure to be able to handle this. Mutter (aka GNOME)
> is also really well-placed to be able to do this.
>
> Having pollable waits would be really useful, but I don't think it's
> essential. In my strawman I'm just waking up at the usual
> just-before-repaint point and checking; if it doesn't make it for this
> frame then we'll wait for the next frame. If someone submits buffers
> which take 4 repaint periods to clear then we'll have 3 'unnecessary'
> wakeups, but given the GPU is already slammed then it's not an
> efficiency problem I don't think. (I don't know if all the other
> compositor people share this view.)

Oh, well you should already have pollable waits, at least on DMA-buf.

If you are saying you want to have that for drm_syncobj timelines as 
well then that's certainly something we could do.

Regards,
Christian.

>
> Cheers,
> Daniel


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
  2022-05-25 14:22                     ` Christian König
@ 2022-05-25 14:25                       ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2022-05-25 14:25 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Stone, Simon Ser, Daniel Vetter, tvrtko.ursulin, daniels,
	Michel Dänzer, gustavo, Felix.Kuehling, linux-media,
	dri-devel, linaro-mm-sig, sergemetral, tzimmermann,
	alexander.deucher, skhawaja, sumit.semwal, jason, maad.aldabagh

On Wed, May 25, 2022 at 04:22:48PM +0200, Christian König wrote:
> Am 25.05.22 um 16:15 schrieb Daniel Stone:
> > Hi,
> > 
> > On Wed, 25 May 2022 at 15:07, Simon Ser <contact@emersion.fr> wrote:
> > > On Wednesday, May 25th, 2022 at 15:51, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > You can add that to the list of reasons why compositors need to stop
> > > > > using buffers with unsignaled fences. There's plenty of other reasons
> > > > > there already (the big one being that otherwise slow clients can slow
> > > > > down the compositor, even if the compositor uses a high priority context
> > > > > and the HW supports preemption).
> > > > 
> > > > Yeah that's tbh another reason why I think we shouldn't do umf as a
> > > > transparent thing - compositors need to get better anyway, so we might as
> > > > well take this as a chance to do this right.
> > > As a compositor dev, I agree -- we should definitely be smarter about
> > > this. Note, it would help a lot to have a good way to integrate the
> > > waits into a poll(2) event loop.
> > The same holds for Weston. We're currently working through a bunch of
> > internal infrastructure to be able to handle this. Mutter (aka GNOME)
> > is also really well-placed to be able to do this.
> > 
> > Having pollable waits would be really useful, but I don't think it's
> > essential. In my strawman I'm just waking up at the usual
> > just-before-repaint point and checking; if it doesn't make it for this
> > frame then we'll wait for the next frame. If someone submits buffers
> > which take 4 repaint periods to clear then we'll have 3 'unnecessary'
> > wakeups, but given the GPU is already slammed then it's not an
> > efficiency problem I don't think. (I don't know if all the other
> > compositor people share this view.)
> 
> Oh, well you should already have pollable waits, at least on DMA-buf.
> 
> If you are saying you want to have that for drm_syncobj timelines as well
> then that's certainly something we could do.

The pollable wait isn't enough because the client could keep issuing more
rendering. So what you actually want is
- dma-buf export to get a sync_file (should land any minute now)
- poll on that (it works, as it doesn on dma-buf)
- tell egl to no do implicit sync or (maybe better) just have a vk
  renderer in your compositor

But interim you can poll on the dma-buf and it'll dtrt for at least
cooperative clients already.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Tackling the indefinite/user DMA fence problem
@ 2022-05-25 14:25                       ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2022-05-25 14:25 UTC (permalink / raw)
  To: Christian König
  Cc: tvrtko.ursulin, daniels, tzimmermann, Michel Dänzer,
	Felix.Kuehling, linaro-mm-sig, sergemetral, maad.aldabagh,
	dri-devel, gustavo, alexander.deucher, skhawaja, sumit.semwal,
	jason, linux-media

On Wed, May 25, 2022 at 04:22:48PM +0200, Christian König wrote:
> Am 25.05.22 um 16:15 schrieb Daniel Stone:
> > Hi,
> > 
> > On Wed, 25 May 2022 at 15:07, Simon Ser <contact@emersion.fr> wrote:
> > > On Wednesday, May 25th, 2022 at 15:51, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > You can add that to the list of reasons why compositors need to stop
> > > > > using buffers with unsignaled fences. There's plenty of other reasons
> > > > > there already (the big one being that otherwise slow clients can slow
> > > > > down the compositor, even if the compositor uses a high priority context
> > > > > and the HW supports preemption).
> > > > 
> > > > Yeah that's tbh another reason why I think we shouldn't do umf as a
> > > > transparent thing - compositors need to get better anyway, so we might as
> > > > well take this as a chance to do this right.
> > > As a compositor dev, I agree -- we should definitely be smarter about
> > > this. Note, it would help a lot to have a good way to integrate the
> > > waits into a poll(2) event loop.
> > The same holds for Weston. We're currently working through a bunch of
> > internal infrastructure to be able to handle this. Mutter (aka GNOME)
> > is also really well-placed to be able to do this.
> > 
> > Having pollable waits would be really useful, but I don't think it's
> > essential. In my strawman I'm just waking up at the usual
> > just-before-repaint point and checking; if it doesn't make it for this
> > frame then we'll wait for the next frame. If someone submits buffers
> > which take 4 repaint periods to clear then we'll have 3 'unnecessary'
> > wakeups, but given the GPU is already slammed then it's not an
> > efficiency problem I don't think. (I don't know if all the other
> > compositor people share this view.)
> 
> Oh, well you should already have pollable waits, at least on DMA-buf.
> 
> If you are saying you want to have that for drm_syncobj timelines as well
> then that's certainly something we could do.

The pollable wait isn't enough because the client could keep issuing more
rendering. So what you actually want is
- dma-buf export to get a sync_file (should land any minute now)
- poll on that (it works, as it doesn on dma-buf)
- tell egl to no do implicit sync or (maybe better) just have a vk
  renderer in your compositor

But interim you can poll on the dma-buf and it'll dtrt for at least
cooperative clients already.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2022-05-25 14:25 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-02 16:37 Tackling the indefinite/user DMA fence problem Christian König
2022-05-02 16:37 ` [PATCH 01/15] dma-buf: rename DMA_FENCE_FLAG_USER_BITS to _DEVICE Christian König
2022-05-02 16:37 ` [PATCH 02/15] dma-buf: introduce user fence support Christian König
2022-05-04  7:53   ` Tvrtko Ursulin
2022-05-04  9:15     ` Christian König
2022-05-02 16:37 ` [PATCH 03/15] dma-buf: add user fence support to dma_fence_array Christian König
2022-05-02 16:37 ` [PATCH 04/15] dma-buf: add user fence support to dma_fence_chain Christian König
2022-05-02 16:37 ` [PATCH 05/15] dma-buf: add user fence support to dma_resv Christian König
2022-05-02 16:37 ` [PATCH 06/15] dma-buf: add user fence support to dma_fence_merge() Christian König
2022-05-02 16:37 ` [PATCH 07/15] dma-buf: add user fence utility functions Christian König
2022-05-02 16:37 ` [PATCH 08/15] dma-buf: add support for polling on user fences Christian König
2022-05-02 16:37 ` [PATCH 09/15] dma-buf/sync_file: add user fence support Christian König
2022-05-02 16:37 ` [PATCH 10/15] drm: add user fence support for atomic out fences Christian König
2022-05-02 16:37 ` [PATCH 11/15] drm: add user fence support for atomic in fences Christian König
2022-05-02 16:37 ` [PATCH 12/15] drm: add user fence support to drm_gem_plane_helper_prepare_fb Christian König
2022-05-02 16:37 ` [PATCH 13/15] drm: add user fence support to drm_syncobj Christian König
2022-05-02 16:37 ` [PATCH 14/15] drm/amdgpu: switch DM to atomic fence helpers Christian König
2022-05-02 16:37   ` Christian König
2022-05-02 16:37 ` [PATCH 15/15] drm/amdgpu: user fence proof of concept Christian König
2022-05-04 10:08 ` Tackling the indefinite/user DMA fence problem Daniel Vetter
2022-05-04 10:08   ` Daniel Vetter
2022-05-09  6:56   ` Christian König
2022-05-09  6:56     ` Christian König
2022-05-09 14:10     ` Daniel Vetter
2022-05-09 14:10       ` Daniel Vetter
2022-05-17 10:28       ` Christian König
2022-05-17 10:28         ` Christian König
2022-05-25 13:05         ` Daniel Vetter
2022-05-25 13:05           ` Daniel Vetter
2022-05-25 13:28           ` Michel Dänzer
2022-05-25 13:28             ` Michel Dänzer
2022-05-25 13:51             ` Daniel Vetter
2022-05-25 13:51               ` Daniel Vetter
2022-05-25 14:07               ` Simon Ser
2022-05-25 14:07                 ` Simon Ser
2022-05-25 14:15                 ` Daniel Stone
2022-05-25 14:15                   ` Daniel Stone
2022-05-25 14:22                   ` Christian König
2022-05-25 14:22                     ` Christian König
2022-05-25 14:25                     ` Daniel Vetter
2022-05-25 14:25                       ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.