All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8)
@ 2021-05-20 19:00 Jason Ekstrand
  2021-05-20 19:00 ` [PATCH 1/4] dma-buf: add dma_fence_array_for_each (v2) Jason Ekstrand
                   ` (4 more replies)
  0 siblings, 5 replies; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-20 19:00 UTC (permalink / raw)
  To: dri-devel
  Cc: Daniel Stone, Michel Dänzer, Jason Ekstrand, Dave Airlie,
	Christian König

This is mostly a re-send of v8 only with a fourth patch which contains the
sync file import ioctl that I had in the original series.  I've not updated
the IGT tests yet for sync file import.  This resend is mostly intended to
aid in discussions around implicit sync in general.  I'll write up some IGT
tests if there is serious interest in patch 4.  I can also update the Mesa
MR to use it for Vulkan.

-------------------------------------------

Modern userspace APIs like Vulkan are built on an explicit
synchronization model.  This doesn't always play nicely with the
implicit synchronization used in the kernel and assumed by X11 and
Wayland.  The client -> compositor half of the synchronization isn't too
bad, at least on intel, because we can control whether or not i915
synchronizes on the buffer and whether or not it's considered written.

The harder part is the compositor -> client synchronization when we get
the buffer back from the compositor.  We're required to be able to
provide the client with a VkSemaphore and VkFence representing the point
in time where the window system (compositor and/or display) finished
using the buffer.  With current APIs, it's very hard to do this in such
a way that we don't get confused by the Vulkan driver's access of the
buffer.  In particular, once we tell the kernel that we're rendering to
the buffer again, any CPU waits on the buffer or GPU dependencies will
wait on some of the client rendering and not just the compositor.

This new IOCTL solves this problem by allowing us to get a snapshot of
the implicit synchronization state of a given dma-buf in the form of a
sync file.  It's effectively the same as a poll() or I915_GEM_WAIT only,
instead of CPU waiting directly, it encapsulates the wait operation, at
the current moment in time, in a sync_file so we can check/wait on it
later.  As long as the Vulkan driver does the sync_file export from the
dma-buf before we re-introduce it for rendering, it will only contain
fences from the compositor or display.  This allows to accurately turn
it into a VkFence or VkSemaphore without any over- synchronization.

Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
IGT tests: https://lists.freedesktop.org/archives/igt-dev/2021-March/029825.html

Cc: Christian König <christian.koenig@amd.com>
Cc: Michel Dänzer <michel@daenzer.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Daniel Stone <daniels@collabora.com>

Christian König (1):
  dma-buf: add dma_fence_array_for_each (v2)

Jason Ekstrand (3):
  dma-buf: add dma_resv_get_singleton_rcu (v4)
  dma-buf: Add an API for exporting sync files (v9)
  RFC: dma-buf: Add an API for importing sync files (v6)

 drivers/dma-buf/dma-buf.c         |  94 +++++++++++++++++++++++
 drivers/dma-buf/dma-fence-array.c |  27 +++++++
 drivers/dma-buf/dma-resv.c        | 122 ++++++++++++++++++++++++++++++
 include/linux/dma-fence-array.h   |  17 +++++
 include/linux/dma-resv.h          |   3 +
 include/uapi/linux/dma-buf.h      |  25 ++++++
 6 files changed, 288 insertions(+)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/4] dma-buf: add dma_fence_array_for_each (v2)
  2021-05-20 19:00 [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8) Jason Ekstrand
@ 2021-05-20 19:00 ` Jason Ekstrand
  2021-05-21  7:51   ` Christian König
  2021-05-20 19:00 ` [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4) Jason Ekstrand
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-20 19:00 UTC (permalink / raw)
  To: dri-devel; +Cc: Christian König, Jason Ekstrand

From: Christian König <ckoenig.leichtzumerken@gmail.com>

Add a helper to iterate over all fences in a dma_fence_array object.

v2 (Jason Ekstrand)
 - Return NULL from dma_fence_array_first if head == NULL.  This matches
   the iterator behavior of dma_fence_chain_for_each in that it iterates
   zero times if head == NULL.
 - Return NULL from dma_fence_array_next if index > array->num_fences.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
---
 drivers/dma-buf/dma-fence-array.c | 27 +++++++++++++++++++++++++++
 include/linux/dma-fence-array.h   | 17 +++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/drivers/dma-buf/dma-fence-array.c b/drivers/dma-buf/dma-fence-array.c
index d3fbd950be944..2ac1afc697d0f 100644
--- a/drivers/dma-buf/dma-fence-array.c
+++ b/drivers/dma-buf/dma-fence-array.c
@@ -201,3 +201,30 @@ bool dma_fence_match_context(struct dma_fence *fence, u64 context)
 	return true;
 }
 EXPORT_SYMBOL(dma_fence_match_context);
+
+struct dma_fence *dma_fence_array_first(struct dma_fence *head)
+{
+	struct dma_fence_array *array;
+
+	if (!head)
+		return NULL;
+
+	array = to_dma_fence_array(head);
+	if (!array)
+		return head;
+
+	return array->fences[0];
+}
+EXPORT_SYMBOL(dma_fence_array_first);
+
+struct dma_fence *dma_fence_array_next(struct dma_fence *head,
+				       unsigned int index)
+{
+	struct dma_fence_array *array = to_dma_fence_array(head);
+
+	if (!array || index >= array->num_fences)
+		return NULL;
+
+	return array->fences[index];
+}
+EXPORT_SYMBOL(dma_fence_array_next);
diff --git a/include/linux/dma-fence-array.h b/include/linux/dma-fence-array.h
index 303dd712220fd..588ac8089dd61 100644
--- a/include/linux/dma-fence-array.h
+++ b/include/linux/dma-fence-array.h
@@ -74,6 +74,19 @@ to_dma_fence_array(struct dma_fence *fence)
 	return container_of(fence, struct dma_fence_array, base);
 }
 
+/**
+ * dma_fence_array_for_each - iterate over all fences in array
+ * @fence: current fence
+ * @index: index into the array
+ * @head: potential dma_fence_array object
+ *
+ * Test if @array is a dma_fence_array object and if yes iterate over all fences
+ * in the array. If not just iterate over the fence in @array itself.
+ */
+#define dma_fence_array_for_each(fence, index, head)			\
+	for (index = 0, fence = dma_fence_array_first(head); fence;	\
+	     ++(index), fence = dma_fence_array_next(head, index))
+
 struct dma_fence_array *dma_fence_array_create(int num_fences,
 					       struct dma_fence **fences,
 					       u64 context, unsigned seqno,
@@ -81,4 +94,8 @@ struct dma_fence_array *dma_fence_array_create(int num_fences,
 
 bool dma_fence_match_context(struct dma_fence *fence, u64 context);
 
+struct dma_fence *dma_fence_array_first(struct dma_fence *head);
+struct dma_fence *dma_fence_array_next(struct dma_fence *head,
+				       unsigned int index);
+
 #endif /* __LINUX_DMA_FENCE_ARRAY_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4)
  2021-05-20 19:00 [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8) Jason Ekstrand
  2021-05-20 19:00 ` [PATCH 1/4] dma-buf: add dma_fence_array_for_each (v2) Jason Ekstrand
@ 2021-05-20 19:00 ` Jason Ekstrand
  2021-05-21 17:48   ` Daniel Vetter
                     ` (2 more replies)
  2021-05-20 19:00 ` [PATCH 3/4] dma-buf: Add an API for exporting sync files (v9) Jason Ekstrand
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-20 19:00 UTC (permalink / raw)
  To: dri-devel; +Cc: Jason Ekstrand

Add a helper function to get a single fence representing
all fences in a dma_resv object.

This fence is either the only one in the object or all not
signaled fences of the object in a flatted out dma_fence_array.

v2 (Jason Ekstrand):
 - Take reference of fences both for creating the dma_fence_array and in
   the case where we return one fence.
 - Handle the case where dma_resv_get_list() returns NULL

v3 (Jason Ekstrand):
 - Add an _rcu suffix because it is read-only
 - Rewrite to use dma_resv_get_fences_rcu so it's RCU-safe
 - Add an EXPORT_SYMBOL_GPL declaration
 - Re-author the patch to Jason since very little is left of Christian
   König's original patch
 - Remove the extra fence argument

v4 (Jason Ekstrand):
 - Restore the extra fence argument

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>

get_singleton
---
 drivers/dma-buf/dma-resv.c | 122 +++++++++++++++++++++++++++++++++++++
 include/linux/dma-resv.h   |   3 +
 2 files changed, 125 insertions(+)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 6ddbeb5dfbf65..25995fc15c370 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -33,6 +33,8 @@
  */
 
 #include <linux/dma-resv.h>
+#include <linux/dma-fence-chain.h>
+#include <linux/dma-fence-array.h>
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/sched/mm.h>
@@ -49,6 +51,19 @@
  * write-side updates.
  */
 
+/**
+ * dma_fence_deep_dive_for_each - deep dive into the fence containers
+ * @fence: resulting fence
+ * @chain: variable for a dma_fence_chain
+ * @index: index into a dma_fence_array
+ * @head: starting point
+ *
+ * Helper to deep dive into the fence containers for flattening them.
+ */
+#define dma_fence_deep_dive_for_each(fence, chain, index, head)	\
+	dma_fence_chain_for_each(chain, head)			\
+		dma_fence_array_for_each(fence, index, chain)
+
 DEFINE_WD_CLASS(reservation_ww_class);
 EXPORT_SYMBOL(reservation_ww_class);
 
@@ -517,6 +532,113 @@ int dma_resv_get_fences_rcu(struct dma_resv *obj,
 }
 EXPORT_SYMBOL_GPL(dma_resv_get_fences_rcu);
 
+/**
+ * dma_resv_get_singleton - get a single fence for the dma_resv object
+ * @obj: the reservation object
+ * @extra: extra fence to add to the resulting array
+ * @result: resulting dma_fence
+ *
+ * Get a single fence representing all unsignaled fences in the dma_resv object
+ * plus the given extra fence. If we got only one fence return a new
+ * reference to that, otherwise return a dma_fence_array object.
+ *
+ * RETURNS
+ * Returns -NOMEM if allocations fail, zero otherwise.
+ */
+int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
+			       struct dma_fence **result)
+{
+	struct dma_fence **resv_fences, *fence, *chain, **fences;
+	struct dma_fence_array *array;
+	unsigned int num_resv_fences, num_fences;
+	unsigned int ret, i, j;
+
+	ret = dma_resv_get_fences_rcu(obj, NULL, &num_resv_fences, &resv_fences);
+	if (ret)
+		return ret;
+
+	num_fences = 0;
+	*result = NULL;
+
+	if (num_resv_fences == 0 && !extra)
+		return 0;
+
+	for (i = 0; i < num_resv_fences; ++i) {
+		dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
+			if (dma_fence_is_signaled(fence))
+				continue;
+
+			*result = fence;
+			++num_fences;
+		}
+	}
+
+	if (extra) {
+		dma_fence_deep_dive_for_each(fence, chain, j, extra) {
+			if (dma_fence_is_signaled(fence))
+				continue;
+
+			*result = fence;
+			++num_fences;
+		}
+	}
+
+	if (num_fences <= 1) {
+		*result = dma_fence_get(*result);
+		goto put_resv_fences;
+	}
+
+	fences = kmalloc_array(num_fences, sizeof(struct dma_fence*),
+			       GFP_KERNEL);
+	if (!fences) {
+		*result = NULL;
+		ret = -ENOMEM;
+		goto put_resv_fences;
+	}
+
+	num_fences = 0;
+	for (i = 0; i < num_resv_fences; ++i) {
+		dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
+			if (!dma_fence_is_signaled(fence))
+				fences[num_fences++] = dma_fence_get(fence);
+		}
+	}
+
+	if (extra) {
+		dma_fence_deep_dive_for_each(fence, chain, j, extra) {
+			if (dma_fence_is_signaled(fence))
+				fences[num_fences++] = dma_fence_get(fence);
+		}
+	}
+
+	if (num_fences <= 1) {
+		*result = num_fences ? fences[0] : NULL;
+		kfree(fences);
+		goto put_resv_fences;
+	}
+
+	array = dma_fence_array_create(num_fences, fences,
+				       dma_fence_context_alloc(1),
+				       1, false);
+	if (array) {
+		*result = &array->base;
+	} else {
+		*result = NULL;
+		while (num_fences--)
+			dma_fence_put(fences[num_fences]);
+		kfree(fences);
+		ret = -ENOMEM;
+	}
+
+put_resv_fences:
+	while (num_resv_fences--)
+		dma_fence_put(resv_fences[num_resv_fences]);
+	kfree(resv_fences);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dma_resv_get_singleton_rcu);
+
 /**
  * dma_resv_wait_timeout_rcu - Wait on reservation's objects
  * shared and/or exclusive fences.
diff --git a/include/linux/dma-resv.h b/include/linux/dma-resv.h
index d44a77e8a7e34..d49ca263e78b4 100644
--- a/include/linux/dma-resv.h
+++ b/include/linux/dma-resv.h
@@ -285,6 +285,9 @@ int dma_resv_get_fences_rcu(struct dma_resv *obj,
 
 int dma_resv_copy_fences(struct dma_resv *dst, struct dma_resv *src);
 
+int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
+			       struct dma_fence **result);
+
 long dma_resv_wait_timeout_rcu(struct dma_resv *obj, bool wait_all, bool intr,
 			       unsigned long timeout);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/4] dma-buf: Add an API for exporting sync files (v9)
  2021-05-20 19:00 [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8) Jason Ekstrand
  2021-05-20 19:00 ` [PATCH 1/4] dma-buf: add dma_fence_array_for_each (v2) Jason Ekstrand
  2021-05-20 19:00 ` [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4) Jason Ekstrand
@ 2021-05-20 19:00 ` Jason Ekstrand
  2021-05-20 19:00 ` [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6) Jason Ekstrand
  2021-05-21 21:32 ` [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8) Jason Ekstrand
  4 siblings, 0 replies; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-20 19:00 UTC (permalink / raw)
  To: dri-devel; +Cc: Jason Ekstrand

Modern userspace APIs like Vulkan are built on an explicit
synchronization model.  This doesn't always play nicely with the
implicit synchronization used in the kernel and assumed by X11 and
Wayland.  The client -> compositor half of the synchronization isn't too
bad, at least on intel, because we can control whether or not i915
synchronizes on the buffer and whether or not it's considered written.

The harder part is the compositor -> client synchronization when we get
the buffer back from the compositor.  We're required to be able to
provide the client with a VkSemaphore and VkFence representing the point
in time where the window system (compositor and/or display) finished
using the buffer.  With current APIs, it's very hard to do this in such
a way that we don't get confused by the Vulkan driver's access of the
buffer.  In particular, once we tell the kernel that we're rendering to
the buffer again, any CPU waits on the buffer or GPU dependencies will
wait on some of the client rendering and not just the compositor.

This new IOCTL solves this problem by allowing us to get a snapshot of
the implicit synchronization state of a given dma-buf in the form of a
sync file.  It's effectively the same as a poll() or I915_GEM_WAIT only,
instead of CPU waiting directly, it encapsulates the wait operation, at
the current moment in time, in a sync_file so we can check/wait on it
later.  As long as the Vulkan driver does the sync_file export from the
dma-buf before we re-introduce it for rendering, it will only contain
fences from the compositor or display.  This allows to accurately turn
it into a VkFence or VkSemaphore without any over- synchronization.

v2 (Jason Ekstrand):
 - Use a wrapper dma_fence_array of all fences including the new one
   when importing an exclusive fence.

v3 (Jason Ekstrand):
 - Lock around setting shared fences as well as exclusive
 - Mark SIGNAL_SYNC_FILE as a read-write ioctl.
 - Initialize ret to 0 in dma_buf_wait_sync_file

v4 (Jason Ekstrand):
 - Use the new dma_resv_get_singleton helper

v5 (Jason Ekstrand):
 - Rename the IOCTLs to import/export rather than wait/signal
 - Drop the WRITE flag and always get/set the exclusive fence

v6 (Jason Ekstrand):
 - Drop the sync_file import as it was all-around sketchy and not nearly
   as useful as import.
 - Re-introduce READ/WRITE flag support for export
 - Rework the commit message

v7 (Jason Ekstrand):
 - Require at least one sync flag
 - Fix a refcounting bug: dma_resv_get_excl() doesn't take a reference
 - Use _rcu helpers since we're accessing the dma_resv read-only

v8 (Jason Ekstrand):
 - Return -ENOMEM if the sync_file_create fails
 - Predicate support on IS_ENABLED(CONFIG_SYNC_FILE)

v9 (Jason Ekstrand):
 - Add documentation for the new ioctl

v10 (Jason Ekstrand):
 - Go back to dma_buf_sync_file as the ioctl struct name

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Simon Ser <contact@emersion.fr>
---
 drivers/dma-buf/dma-buf.c    | 62 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/dma-buf.h | 24 ++++++++++++++
 2 files changed, 86 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f264b70c383eb..7679562b57bfc 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -20,6 +20,7 @@
 #include <linux/debugfs.h>
 #include <linux/module.h>
 #include <linux/seq_file.h>
+#include <linux/sync_file.h>
 #include <linux/poll.h>
 #include <linux/dma-resv.h>
 #include <linux/mm.h>
@@ -362,6 +363,62 @@ static long dma_buf_set_name(struct dma_buf *dmabuf, const char __user *buf)
 	return ret;
 }
 
+#if IS_ENABLED(CONFIG_SYNC_FILE)
+static long dma_buf_export_sync_file(struct dma_buf *dmabuf,
+				     void __user *user_data)
+{
+	struct dma_buf_sync_file arg;
+	struct dma_fence *fence = NULL;
+	struct sync_file *sync_file;
+	int fd, ret;
+
+	if (copy_from_user(&arg, user_data, sizeof(arg)))
+		return -EFAULT;
+
+	if (arg.flags & ~DMA_BUF_SYNC_RW)
+		return -EINVAL;
+
+	if ((arg.flags & DMA_BUF_SYNC_RW) == 0)
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	if (arg.flags & DMA_BUF_SYNC_WRITE) {
+		ret = dma_resv_get_singleton_rcu(dmabuf->resv, NULL, &fence);
+		if (ret)
+			goto err_put_fd;
+	} else if (arg.flags & DMA_BUF_SYNC_READ) {
+		fence = dma_resv_get_excl_rcu(dmabuf->resv);
+	}
+
+	if (!fence)
+		fence = dma_fence_get_stub();
+
+	sync_file = sync_file_create(fence);
+
+	dma_fence_put(fence);
+
+	if (!sync_file) {
+		ret = -ENOMEM;
+		goto err_put_fd;
+	}
+
+	fd_install(fd, sync_file->file);
+
+	arg.fd = fd;
+	if (copy_to_user(user_data, &arg, sizeof(arg)))
+		return -EFAULT;
+
+	return 0;
+
+err_put_fd:
+	put_unused_fd(fd);
+	return ret;
+}
+#endif
+
 static long dma_buf_ioctl(struct file *file,
 			  unsigned int cmd, unsigned long arg)
 {
@@ -405,6 +462,11 @@ static long dma_buf_ioctl(struct file *file,
 	case DMA_BUF_SET_NAME_B:
 		return dma_buf_set_name(dmabuf, (const char __user *)arg);
 
+#if IS_ENABLED(CONFIG_SYNC_FILE)
+	case DMA_BUF_IOCTL_EXPORT_SYNC_FILE:
+		return dma_buf_export_sync_file(dmabuf, (void __user *)arg);
+#endif
+
 	default:
 		return -ENOTTY;
 	}
diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
index 7f30393b92c3b..f902cadcbdb56 100644
--- a/include/uapi/linux/dma-buf.h
+++ b/include/uapi/linux/dma-buf.h
@@ -37,6 +37,29 @@ struct dma_buf_sync {
 
 #define DMA_BUF_NAME_LEN	32
 
+/**
+ * struct dma_buf_export_sync_file - Get a sync_file from a dma-buf
+ *
+ * Userspace can perform a DMA_BUF_IOCTL_EXPORT_SYNC_FILE to retrieve the
+ * current set of fences on a dma-buf file descriptor as a sync_file.  CPU
+ * waits via poll() or other driver-specific mechanisms typically wait on
+ * whatever fences are on the dma-buf at the time the wait begins.  This
+ * is similar except that it takes a snapshot of the current fences on the
+ * dma-buf for waiting later instead of waiting immediately.  This is
+ * useful for modern graphics APIs such as Vulkan which assume an explicit
+ * synchronization model but still need to inter-operate with dma-buf.
+ */
+struct dma_buf_sync_file {
+	/**
+	 * @flags: Read/write flags
+	 *
+	 * Must DMA_BUF_SYNC_READ, DMA_BUF_SYNC_WRITE, or both.
+	 */
+	__u32 flags;
+	/** @fd: Sync file file descriptor */
+	__s32 fd;
+};
+
 #define DMA_BUF_BASE		'b'
 #define DMA_BUF_IOCTL_SYNC	_IOW(DMA_BUF_BASE, 0, struct dma_buf_sync)
 
@@ -46,5 +69,6 @@ struct dma_buf_sync {
 #define DMA_BUF_SET_NAME	_IOW(DMA_BUF_BASE, 1, const char *)
 #define DMA_BUF_SET_NAME_A	_IOW(DMA_BUF_BASE, 1, u32)
 #define DMA_BUF_SET_NAME_B	_IOW(DMA_BUF_BASE, 1, u64)
+#define DMA_BUF_IOCTL_EXPORT_SYNC_FILE	_IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file)
 
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-20 19:00 [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8) Jason Ekstrand
                   ` (2 preceding siblings ...)
  2021-05-20 19:00 ` [PATCH 3/4] dma-buf: Add an API for exporting sync files (v9) Jason Ekstrand
@ 2021-05-20 19:00 ` Jason Ekstrand
  2021-05-22 20:05   ` Daniel Stone
  2021-05-21 21:32 ` [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8) Jason Ekstrand
  4 siblings, 1 reply; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-20 19:00 UTC (permalink / raw)
  To: dri-devel; +Cc: Jason Ekstrand

This patch is analogous to the previous sync file export patch in that
it allows you to import a sync_file into a dma-buf.  Unlike the previous
patch, however, this does add genuinely new functionality to dma-buf.
Without this, the only way to attach a sync_file to a dma-buf is to
submit a batch to your driver of choice which waits on the sync_file and
claims to write to the dma-buf.  Even if said batch is a no-op, a submit
is typically way more overhead than just attaching a fence.  A submit
may also imply extra synchronization with other work because it happens
on a hardware queue.

In the Vulkan world, this is useful for dealing with the out-fence from
vkQueuePresent.  Current Linux window-systems (X11, Wayland, etc.) all
rely on dma-buf implicit sync.  Since Vulkan is an explicit sync API, we
get a set of fences (VkSemaphores) in vkQueuePresent and have to stash
those as an exclusive (write) fence on the dma-buf.  We handle it in
Mesa today with the above mentioned dummy submit trick.  This ioctl
would allow us to set it directly without the dummy submit.

This may also open up possibilities for GPU drivers to move away from
implicit sync for their kernel driver uAPI and instead provide sync
files and rely on dma-buf import/export for communicating with other
implicit sync clients.

We make the explicit choice here to only allow setting RW fences which
translates to an exclusive fence on the dma_resv.  There's no use for
read-only fences for communicating with other implicit sync userspace
and any such attempts are likely to be racy at best.  When we got to
insert the RW fence, the actual fence we set as the new exclusive fence
is a combination of the sync_file provided by the user and all the other
fences on the dma_resv.  This ensures that the newly added exclusive
fence will never signal before the old one would have and ensures that
we don't break any dma_resv contracts.  We require userspace to specify
RW in the flags for symmetry with the export ioctl and in case we ever
want to support read fences in the future.

There is one downside here that's worth documenting:  If two clients
writing to the same dma-buf using this API race with each other, their
actions on the dma-buf may happen in parallel or in an undefined order.
Both with and without this API, the pattern is the same:  Collect all
the fences on dma-buf, submit work which depends on said fences, and
then set a new exclusive (write) fence on the dma-buf which depends on
said work.  The difference is that, when it's all handled by the GPU
driver's submit ioctl, the three operations happen atomically under the
dma_resv lock.  If two userspace submits race, one will happen before
the other.  You aren't guaranteed which but you are guaranteed that
they're strictly ordered.  If userspace manages the fences itself, then
these three operations happen separately and the two render operations
may happen genuinely in parallel or get interleaved.  However, this is a
case of userspace racing with itself.  As long as we ensure userspace
can't back the kernel into a corner, it should be fine.

v2 (Jason Ekstrand):
 - Use a wrapper dma_fence_array of all fences including the new one
   when importing an exclusive fence.

v3 (Jason Ekstrand):
 - Lock around setting shared fences as well as exclusive
 - Mark SIGNAL_SYNC_FILE as a read-write ioctl.
 - Initialize ret to 0 in dma_buf_wait_sync_file

v4 (Jason Ekstrand):
 - Use the new dma_resv_get_singleton helper

v5 (Jason Ekstrand):
 - Rename the IOCTLs to import/export rather than wait/signal
 - Drop the WRITE flag and always get/set the exclusive fence

v5 (Jason Ekstrand):
 - Split import and export into separate patches
 - New commit message

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
---
 drivers/dma-buf/dma-buf.c    | 32 ++++++++++++++++++++++++++++++++
 include/uapi/linux/dma-buf.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 7679562b57bfc..c9d6b9195c00c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf,
 	put_unused_fd(fd);
 	return ret;
 }
+
+static long dma_buf_import_sync_file(struct dma_buf *dmabuf,
+				     const void __user *user_data)
+{
+	struct dma_buf_sync_file arg;
+	struct dma_fence *fence, *singleton = NULL;
+	int ret = 0;
+
+	if (copy_from_user(&arg, user_data, sizeof(arg)))
+		return -EFAULT;
+
+	if (arg.flags != DMA_BUF_SYNC_RW)
+		return -EINVAL;
+
+	fence = sync_file_get_fence(arg.fd);
+	if (!fence)
+		return -EINVAL;
+
+	dma_resv_lock(dmabuf->resv, NULL);
+
+	ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton);
+	if (!ret && singleton)
+		dma_resv_add_excl_fence(dmabuf->resv, singleton);
+
+	dma_resv_unlock(dmabuf->resv);
+
+	dma_fence_put(fence);
+
+	return ret;
+}
 #endif
 
 static long dma_buf_ioctl(struct file *file,
@@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file,
 #if IS_ENABLED(CONFIG_SYNC_FILE)
 	case DMA_BUF_IOCTL_EXPORT_SYNC_FILE:
 		return dma_buf_export_sync_file(dmabuf, (void __user *)arg);
+	case DMA_BUF_IOCTL_IMPORT_SYNC_FILE:
+		return dma_buf_import_sync_file(dmabuf, (const void __user *)arg);
 #endif
 
 	default:
diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
index f902cadcbdb56..75fdde4800267 100644
--- a/include/uapi/linux/dma-buf.h
+++ b/include/uapi/linux/dma-buf.h
@@ -70,5 +70,6 @@ struct dma_buf_sync_file {
 #define DMA_BUF_SET_NAME_A	_IOW(DMA_BUF_BASE, 1, u32)
 #define DMA_BUF_SET_NAME_B	_IOW(DMA_BUF_BASE, 1, u64)
 #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE	_IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file)
+#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE	_IOW(DMA_BUF_BASE, 3, struct dma_buf_sync)
 
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] dma-buf: add dma_fence_array_for_each (v2)
  2021-05-20 19:00 ` [PATCH 1/4] dma-buf: add dma_fence_array_for_each (v2) Jason Ekstrand
@ 2021-05-21  7:51   ` Christian König
  2021-05-21 16:27     ` Jason Ekstrand
  0 siblings, 1 reply; 39+ messages in thread
From: Christian König @ 2021-05-21  7:51 UTC (permalink / raw)
  To: Jason Ekstrand, dri-devel

Am 20.05.21 um 21:00 schrieb Jason Ekstrand:
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>
> Add a helper to iterate over all fences in a dma_fence_array object.
>
> v2 (Jason Ekstrand)
>   - Return NULL from dma_fence_array_first if head == NULL.  This matches
>     the iterator behavior of dma_fence_chain_for_each in that it iterates
>     zero times if head == NULL.
>   - Return NULL from dma_fence_array_next if index > array->num_fences.
>
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>

Reviewed-by: Christian König <christian.koenig@amd.com>

BTW: I'm only seeing this patch from the series. Looks like somehow the 
rest didn't made it into my inbox.

> ---
>   drivers/dma-buf/dma-fence-array.c | 27 +++++++++++++++++++++++++++
>   include/linux/dma-fence-array.h   | 17 +++++++++++++++++
>   2 files changed, 44 insertions(+)
>
> diff --git a/drivers/dma-buf/dma-fence-array.c b/drivers/dma-buf/dma-fence-array.c
> index d3fbd950be944..2ac1afc697d0f 100644
> --- a/drivers/dma-buf/dma-fence-array.c
> +++ b/drivers/dma-buf/dma-fence-array.c
> @@ -201,3 +201,30 @@ bool dma_fence_match_context(struct dma_fence *fence, u64 context)
>   	return true;
>   }
>   EXPORT_SYMBOL(dma_fence_match_context);
> +
> +struct dma_fence *dma_fence_array_first(struct dma_fence *head)
> +{
> +	struct dma_fence_array *array;
> +
> +	if (!head)
> +		return NULL;
> +
> +	array = to_dma_fence_array(head);
> +	if (!array)
> +		return head;
> +
> +	return array->fences[0];
> +}
> +EXPORT_SYMBOL(dma_fence_array_first);
> +
> +struct dma_fence *dma_fence_array_next(struct dma_fence *head,
> +				       unsigned int index)
> +{
> +	struct dma_fence_array *array = to_dma_fence_array(head);
> +
> +	if (!array || index >= array->num_fences)
> +		return NULL;
> +
> +	return array->fences[index];
> +}
> +EXPORT_SYMBOL(dma_fence_array_next);
> diff --git a/include/linux/dma-fence-array.h b/include/linux/dma-fence-array.h
> index 303dd712220fd..588ac8089dd61 100644
> --- a/include/linux/dma-fence-array.h
> +++ b/include/linux/dma-fence-array.h
> @@ -74,6 +74,19 @@ to_dma_fence_array(struct dma_fence *fence)
>   	return container_of(fence, struct dma_fence_array, base);
>   }
>   
> +/**
> + * dma_fence_array_for_each - iterate over all fences in array
> + * @fence: current fence
> + * @index: index into the array
> + * @head: potential dma_fence_array object
> + *
> + * Test if @array is a dma_fence_array object and if yes iterate over all fences
> + * in the array. If not just iterate over the fence in @array itself.
> + */
> +#define dma_fence_array_for_each(fence, index, head)			\
> +	for (index = 0, fence = dma_fence_array_first(head); fence;	\
> +	     ++(index), fence = dma_fence_array_next(head, index))
> +
>   struct dma_fence_array *dma_fence_array_create(int num_fences,
>   					       struct dma_fence **fences,
>   					       u64 context, unsigned seqno,
> @@ -81,4 +94,8 @@ struct dma_fence_array *dma_fence_array_create(int num_fences,
>   
>   bool dma_fence_match_context(struct dma_fence *fence, u64 context);
>   
> +struct dma_fence *dma_fence_array_first(struct dma_fence *head);
> +struct dma_fence *dma_fence_array_next(struct dma_fence *head,
> +				       unsigned int index);
> +
>   #endif /* __LINUX_DMA_FENCE_ARRAY_H */


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] dma-buf: add dma_fence_array_for_each (v2)
  2021-05-21  7:51   ` Christian König
@ 2021-05-21 16:27     ` Jason Ekstrand
  2021-05-21 18:11       ` Christian König
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-21 16:27 UTC (permalink / raw)
  To: Christian König; +Cc: Maling list - DRI developers

On Fri, May 21, 2021 at 2:51 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 20.05.21 um 21:00 schrieb Jason Ekstrand:
> > From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >
> > Add a helper to iterate over all fences in a dma_fence_array object.
> >
> > v2 (Jason Ekstrand)
> >   - Return NULL from dma_fence_array_first if head == NULL.  This matches
> >     the iterator behavior of dma_fence_chain_for_each in that it iterates
> >     zero times if head == NULL.
> >   - Return NULL from dma_fence_array_next if index > array->num_fences.
> >
> > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>
>
> BTW: I'm only seeing this patch from the series. Looks like somehow the
> rest didn't made it into my inbox.

https://lists.freedesktop.org/archives/dri-devel/2021-May/307561.html

Not sure why it didn't make your mail.  This one was CC'd to you
because you're the author from a year ago or something.

--Jason


> > ---
> >   drivers/dma-buf/dma-fence-array.c | 27 +++++++++++++++++++++++++++
> >   include/linux/dma-fence-array.h   | 17 +++++++++++++++++
> >   2 files changed, 44 insertions(+)
> >
> > diff --git a/drivers/dma-buf/dma-fence-array.c b/drivers/dma-buf/dma-fence-array.c
> > index d3fbd950be944..2ac1afc697d0f 100644
> > --- a/drivers/dma-buf/dma-fence-array.c
> > +++ b/drivers/dma-buf/dma-fence-array.c
> > @@ -201,3 +201,30 @@ bool dma_fence_match_context(struct dma_fence *fence, u64 context)
> >       return true;
> >   }
> >   EXPORT_SYMBOL(dma_fence_match_context);
> > +
> > +struct dma_fence *dma_fence_array_first(struct dma_fence *head)
> > +{
> > +     struct dma_fence_array *array;
> > +
> > +     if (!head)
> > +             return NULL;
> > +
> > +     array = to_dma_fence_array(head);
> > +     if (!array)
> > +             return head;
> > +
> > +     return array->fences[0];
> > +}
> > +EXPORT_SYMBOL(dma_fence_array_first);
> > +
> > +struct dma_fence *dma_fence_array_next(struct dma_fence *head,
> > +                                    unsigned int index)
> > +{
> > +     struct dma_fence_array *array = to_dma_fence_array(head);
> > +
> > +     if (!array || index >= array->num_fences)
> > +             return NULL;
> > +
> > +     return array->fences[index];
> > +}
> > +EXPORT_SYMBOL(dma_fence_array_next);
> > diff --git a/include/linux/dma-fence-array.h b/include/linux/dma-fence-array.h
> > index 303dd712220fd..588ac8089dd61 100644
> > --- a/include/linux/dma-fence-array.h
> > +++ b/include/linux/dma-fence-array.h
> > @@ -74,6 +74,19 @@ to_dma_fence_array(struct dma_fence *fence)
> >       return container_of(fence, struct dma_fence_array, base);
> >   }
> >
> > +/**
> > + * dma_fence_array_for_each - iterate over all fences in array
> > + * @fence: current fence
> > + * @index: index into the array
> > + * @head: potential dma_fence_array object
> > + *
> > + * Test if @array is a dma_fence_array object and if yes iterate over all fences
> > + * in the array. If not just iterate over the fence in @array itself.
> > + */
> > +#define dma_fence_array_for_each(fence, index, head)                 \
> > +     for (index = 0, fence = dma_fence_array_first(head); fence;     \
> > +          ++(index), fence = dma_fence_array_next(head, index))
> > +
> >   struct dma_fence_array *dma_fence_array_create(int num_fences,
> >                                              struct dma_fence **fences,
> >                                              u64 context, unsigned seqno,
> > @@ -81,4 +94,8 @@ struct dma_fence_array *dma_fence_array_create(int num_fences,
> >
> >   bool dma_fence_match_context(struct dma_fence *fence, u64 context);
> >
> > +struct dma_fence *dma_fence_array_first(struct dma_fence *head);
> > +struct dma_fence *dma_fence_array_next(struct dma_fence *head,
> > +                                    unsigned int index);
> > +
> >   #endif /* __LINUX_DMA_FENCE_ARRAY_H */
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4)
  2021-05-20 19:00 ` [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4) Jason Ekstrand
@ 2021-05-21 17:48   ` Daniel Vetter
  2021-05-24 20:04     ` Jason Ekstrand
  2021-05-22 14:38     ` kernel test robot
  2021-05-22 18:59     ` kernel test robot
  2 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2021-05-21 17:48 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: dri-devel

On Thu, May 20, 2021 at 02:00:05PM -0500, Jason Ekstrand wrote:
> Add a helper function to get a single fence representing
> all fences in a dma_resv object.
> 
> This fence is either the only one in the object or all not
> signaled fences of the object in a flatted out dma_fence_array.
> 
> v2 (Jason Ekstrand):
>  - Take reference of fences both for creating the dma_fence_array and in
>    the case where we return one fence.
>  - Handle the case where dma_resv_get_list() returns NULL
> 
> v3 (Jason Ekstrand):
>  - Add an _rcu suffix because it is read-only
>  - Rewrite to use dma_resv_get_fences_rcu so it's RCU-safe
>  - Add an EXPORT_SYMBOL_GPL declaration
>  - Re-author the patch to Jason since very little is left of Christian
>    König's original patch
>  - Remove the extra fence argument
> 
> v4 (Jason Ekstrand):
>  - Restore the extra fence argument
> 
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> 
> get_singleton

Spurious thing here.

> ---
>  drivers/dma-buf/dma-resv.c | 122 +++++++++++++++++++++++++++++++++++++
>  include/linux/dma-resv.h   |   3 +
>  2 files changed, 125 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
> index 6ddbeb5dfbf65..25995fc15c370 100644
> --- a/drivers/dma-buf/dma-resv.c
> +++ b/drivers/dma-buf/dma-resv.c
> @@ -33,6 +33,8 @@
>   */
>  
>  #include <linux/dma-resv.h>
> +#include <linux/dma-fence-chain.h>
> +#include <linux/dma-fence-array.h>
>  #include <linux/export.h>
>  #include <linux/mm.h>
>  #include <linux/sched/mm.h>
> @@ -49,6 +51,19 @@
>   * write-side updates.
>   */
>  
> +/**
> + * dma_fence_deep_dive_for_each - deep dive into the fence containers
> + * @fence: resulting fence
> + * @chain: variable for a dma_fence_chain
> + * @index: index into a dma_fence_array
> + * @head: starting point
> + *
> + * Helper to deep dive into the fence containers for flattening them.
> + */
> +#define dma_fence_deep_dive_for_each(fence, chain, index, head)	\
> +	dma_fence_chain_for_each(chain, head)			\
> +		dma_fence_array_for_each(fence, index, chain)

Since this is is just internal helper in the .c file we generally don't
document it. Maybe small comment if you feel it's worth it.

> +
>  DEFINE_WD_CLASS(reservation_ww_class);
>  EXPORT_SYMBOL(reservation_ww_class);
>  
> @@ -517,6 +532,113 @@ int dma_resv_get_fences_rcu(struct dma_resv *obj,
>  }
>  EXPORT_SYMBOL_GPL(dma_resv_get_fences_rcu);
>  
> +/**
> + * dma_resv_get_singleton - get a single fence for the dma_resv object

Name doesn't match here.

> + * @obj: the reservation object
> + * @extra: extra fence to add to the resulting array
> + * @result: resulting dma_fence
> + *
> + * Get a single fence representing all unsignaled fences in the dma_resv object
> + * plus the given extra fence. If we got only one fence return a new
> + * reference to that, otherwise return a dma_fence_array object.
> + *
> + * RETURNS
> + * Returns -NOMEM if allocations fail, zero otherwise.

Kernel often encodes this in ERR_PTR so that you don't have to pass a
pointer to store the result. Would feel more kerenl-y I think that way. So
no result parameter, and on alloc failure you'd return

	return ERR_PTR(-ENOMEM);

> + */
> +int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,

tbh the _rcu here is confusing. I think _unlocked is the better suffix,
maybe we should rename dma_resv_get_fences_rcu too for consistency. The
rcu-ness of the lookup isn't leaked to callers at all, so no point giving
them a panic.

> +			       struct dma_fence **result)
> +{
> +	struct dma_fence **resv_fences, *fence, *chain, **fences;
> +	struct dma_fence_array *array;
> +	unsigned int num_resv_fences, num_fences;
> +	unsigned int ret, i, j;
> +
> +	ret = dma_resv_get_fences_rcu(obj, NULL, &num_resv_fences, &resv_fences);
> +	if (ret)
> +		return ret;
> +
> +	num_fences = 0;
> +	*result = NULL;
> +
> +	if (num_resv_fences == 0 && !extra)
> +		return 0;
> +
> +	for (i = 0; i < num_resv_fences; ++i) {
> +		dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
> +			if (dma_fence_is_signaled(fence))
> +				continue;
> +
> +			*result = fence;
> +			++num_fences;
> +		}
> +	}
> +
> +	if (extra) {
> +		dma_fence_deep_dive_for_each(fence, chain, j, extra) {
> +			if (dma_fence_is_signaled(fence))
> +				continue;
> +
> +			*result = fence;
> +			++num_fences;
> +		}
> +	}
> +
> +	if (num_fences <= 1) {
> +		*result = dma_fence_get(*result);
> +		goto put_resv_fences;
> +	}
> +
> +	fences = kmalloc_array(num_fences, sizeof(struct dma_fence*),
> +			       GFP_KERNEL);
> +	if (!fences) {
> +		*result = NULL;
> +		ret = -ENOMEM;
> +		goto put_resv_fences;
> +	}
> +
> +	num_fences = 0;
> +	for (i = 0; i < num_resv_fences; ++i) {
> +		dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
> +			if (!dma_fence_is_signaled(fence))
> +				fences[num_fences++] = dma_fence_get(fence);
> +		}
> +	}
> +
> +	if (extra) {
> +		dma_fence_deep_dive_for_each(fence, chain, j, extra) {
> +			if (dma_fence_is_signaled(fence))
> +				fences[num_fences++] = dma_fence_get(fence);
> +		}
> +	}
> +
> +	if (num_fences <= 1) {
> +		*result = num_fences ? fences[0] : NULL;
> +		kfree(fences);
> +		goto put_resv_fences;
> +	}
> +
> +	array = dma_fence_array_create(num_fences, fences,
> +				       dma_fence_context_alloc(1),
> +				       1, false);
> +	if (array) {
> +		*result = &array->base;
> +	} else {
> +		*result = NULL;
> +		while (num_fences--)
> +			dma_fence_put(fences[num_fences]);
> +		kfree(fences);
> +		ret = -ENOMEM;
> +	}
> +
> +put_resv_fences:
> +	while (num_resv_fences--)
> +		dma_fence_put(resv_fences[num_resv_fences]);
> +	kfree(resv_fences);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(dma_resv_get_singleton_rcu);

With the nits addressed:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

> +
>  /**
>   * dma_resv_wait_timeout_rcu - Wait on reservation's objects
>   * shared and/or exclusive fences.
> diff --git a/include/linux/dma-resv.h b/include/linux/dma-resv.h
> index d44a77e8a7e34..d49ca263e78b4 100644
> --- a/include/linux/dma-resv.h
> +++ b/include/linux/dma-resv.h
> @@ -285,6 +285,9 @@ int dma_resv_get_fences_rcu(struct dma_resv *obj,
>  
>  int dma_resv_copy_fences(struct dma_resv *dst, struct dma_resv *src);
>  
> +int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
> +			       struct dma_fence **result);
> +
>  long dma_resv_wait_timeout_rcu(struct dma_resv *obj, bool wait_all, bool intr,
>  			       unsigned long timeout);
>  
> -- 
> 2.31.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] dma-buf: add dma_fence_array_for_each (v2)
  2021-05-21 16:27     ` Jason Ekstrand
@ 2021-05-21 18:11       ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2021-05-21 18:11 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Maling list - DRI developers

Am 21.05.21 um 18:27 schrieb Jason Ekstrand:
> On Fri, May 21, 2021 at 2:51 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 20.05.21 um 21:00 schrieb Jason Ekstrand:
>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>
>>> Add a helper to iterate over all fences in a dma_fence_array object.
>>>
>>> v2 (Jason Ekstrand)
>>>    - Return NULL from dma_fence_array_first if head == NULL.  This matches
>>>      the iterator behavior of dma_fence_chain_for_each in that it iterates
>>>      zero times if head == NULL.
>>>    - Return NULL from dma_fence_array_next if index > array->num_fences.
>>>
>>> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>
>> BTW: I'm only seeing this patch from the series. Looks like somehow the
>> rest didn't made it into my inbox.
> https://lists.freedesktop.org/archives/dri-devel/2021-May/307561.html
>
> Not sure why it didn't make your mail.  This one was CC'd to you
> because you're the author from a year ago or something.

Yeah, feel free to add an Acked-by on exporting the fences to the 
sync_file part.

But I think we really need to untangle the resource management from the 
implicit sync handling before the importing fence part can land.

Regards,
Christian.

>
> --Jason
>
>
>>> ---
>>>    drivers/dma-buf/dma-fence-array.c | 27 +++++++++++++++++++++++++++
>>>    include/linux/dma-fence-array.h   | 17 +++++++++++++++++
>>>    2 files changed, 44 insertions(+)
>>>
>>> diff --git a/drivers/dma-buf/dma-fence-array.c b/drivers/dma-buf/dma-fence-array.c
>>> index d3fbd950be944..2ac1afc697d0f 100644
>>> --- a/drivers/dma-buf/dma-fence-array.c
>>> +++ b/drivers/dma-buf/dma-fence-array.c
>>> @@ -201,3 +201,30 @@ bool dma_fence_match_context(struct dma_fence *fence, u64 context)
>>>        return true;
>>>    }
>>>    EXPORT_SYMBOL(dma_fence_match_context);
>>> +
>>> +struct dma_fence *dma_fence_array_first(struct dma_fence *head)
>>> +{
>>> +     struct dma_fence_array *array;
>>> +
>>> +     if (!head)
>>> +             return NULL;
>>> +
>>> +     array = to_dma_fence_array(head);
>>> +     if (!array)
>>> +             return head;
>>> +
>>> +     return array->fences[0];
>>> +}
>>> +EXPORT_SYMBOL(dma_fence_array_first);
>>> +
>>> +struct dma_fence *dma_fence_array_next(struct dma_fence *head,
>>> +                                    unsigned int index)
>>> +{
>>> +     struct dma_fence_array *array = to_dma_fence_array(head);
>>> +
>>> +     if (!array || index >= array->num_fences)
>>> +             return NULL;
>>> +
>>> +     return array->fences[index];
>>> +}
>>> +EXPORT_SYMBOL(dma_fence_array_next);
>>> diff --git a/include/linux/dma-fence-array.h b/include/linux/dma-fence-array.h
>>> index 303dd712220fd..588ac8089dd61 100644
>>> --- a/include/linux/dma-fence-array.h
>>> +++ b/include/linux/dma-fence-array.h
>>> @@ -74,6 +74,19 @@ to_dma_fence_array(struct dma_fence *fence)
>>>        return container_of(fence, struct dma_fence_array, base);
>>>    }
>>>
>>> +/**
>>> + * dma_fence_array_for_each - iterate over all fences in array
>>> + * @fence: current fence
>>> + * @index: index into the array
>>> + * @head: potential dma_fence_array object
>>> + *
>>> + * Test if @array is a dma_fence_array object and if yes iterate over all fences
>>> + * in the array. If not just iterate over the fence in @array itself.
>>> + */
>>> +#define dma_fence_array_for_each(fence, index, head)                 \
>>> +     for (index = 0, fence = dma_fence_array_first(head); fence;     \
>>> +          ++(index), fence = dma_fence_array_next(head, index))
>>> +
>>>    struct dma_fence_array *dma_fence_array_create(int num_fences,
>>>                                               struct dma_fence **fences,
>>>                                               u64 context, unsigned seqno,
>>> @@ -81,4 +94,8 @@ struct dma_fence_array *dma_fence_array_create(int num_fences,
>>>
>>>    bool dma_fence_match_context(struct dma_fence *fence, u64 context);
>>>
>>> +struct dma_fence *dma_fence_array_first(struct dma_fence *head);
>>> +struct dma_fence *dma_fence_array_next(struct dma_fence *head,
>>> +                                    unsigned int index);
>>> +
>>>    #endif /* __LINUX_DMA_FENCE_ARRAY_H */


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8)
  2021-05-20 19:00 [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8) Jason Ekstrand
                   ` (3 preceding siblings ...)
  2021-05-20 19:00 ` [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6) Jason Ekstrand
@ 2021-05-21 21:32 ` Jason Ekstrand
  4 siblings, 0 replies; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-21 21:32 UTC (permalink / raw)
  To: Maling list - DRI developers
  Cc: Dave Airlie, Michel Dänzer, Christian König, Daniel Stone

IGT: https://patchwork.freedesktop.org/series/90433/
Mesa (Vulkan WSI):
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037

Ideally, it'd be nice to see a RADV MR and maybe even radeonsi before
we really land on it.  However, I think it's proved out well enough in
ANV to at least land the first three.

--Jason

On Thu, May 20, 2021 at 2:00 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> This is mostly a re-send of v8 only with a fourth patch which contains the
> sync file import ioctl that I had in the original series.  I've not updated
> the IGT tests yet for sync file import.  This resend is mostly intended to
> aid in discussions around implicit sync in general.  I'll write up some IGT
> tests if there is serious interest in patch 4.  I can also update the Mesa
> MR to use it for Vulkan.
>
> -------------------------------------------
>
> Modern userspace APIs like Vulkan are built on an explicit
> synchronization model.  This doesn't always play nicely with the
> implicit synchronization used in the kernel and assumed by X11 and
> Wayland.  The client -> compositor half of the synchronization isn't too
> bad, at least on intel, because we can control whether or not i915
> synchronizes on the buffer and whether or not it's considered written.
>
> The harder part is the compositor -> client synchronization when we get
> the buffer back from the compositor.  We're required to be able to
> provide the client with a VkSemaphore and VkFence representing the point
> in time where the window system (compositor and/or display) finished
> using the buffer.  With current APIs, it's very hard to do this in such
> a way that we don't get confused by the Vulkan driver's access of the
> buffer.  In particular, once we tell the kernel that we're rendering to
> the buffer again, any CPU waits on the buffer or GPU dependencies will
> wait on some of the client rendering and not just the compositor.
>
> This new IOCTL solves this problem by allowing us to get a snapshot of
> the implicit synchronization state of a given dma-buf in the form of a
> sync file.  It's effectively the same as a poll() or I915_GEM_WAIT only,
> instead of CPU waiting directly, it encapsulates the wait operation, at
> the current moment in time, in a sync_file so we can check/wait on it
> later.  As long as the Vulkan driver does the sync_file export from the
> dma-buf before we re-introduce it for rendering, it will only contain
> fences from the compositor or display.  This allows to accurately turn
> it into a VkFence or VkSemaphore without any over- synchronization.
>
> Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> IGT tests: https://lists.freedesktop.org/archives/igt-dev/2021-March/029825.html
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Michel Dänzer <michel@daenzer.net>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Daniel Stone <daniels@collabora.com>
>
> Christian König (1):
>   dma-buf: add dma_fence_array_for_each (v2)
>
> Jason Ekstrand (3):
>   dma-buf: add dma_resv_get_singleton_rcu (v4)
>   dma-buf: Add an API for exporting sync files (v9)
>   RFC: dma-buf: Add an API for importing sync files (v6)
>
>  drivers/dma-buf/dma-buf.c         |  94 +++++++++++++++++++++++
>  drivers/dma-buf/dma-fence-array.c |  27 +++++++
>  drivers/dma-buf/dma-resv.c        | 122 ++++++++++++++++++++++++++++++
>  include/linux/dma-fence-array.h   |  17 +++++
>  include/linux/dma-resv.h          |   3 +
>  include/uapi/linux/dma-buf.h      |  25 ++++++
>  6 files changed, 288 insertions(+)
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4)
  2021-05-20 19:00 ` [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4) Jason Ekstrand
@ 2021-05-22 14:38     ` kernel test robot
  2021-05-22 14:38     ` kernel test robot
  2021-05-22 18:59     ` kernel test robot
  2 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-05-22 14:38 UTC (permalink / raw)
  To: Jason Ekstrand, dri-devel; +Cc: kbuild-all, Jason Ekstrand

[-- Attachment #1: Type: text/plain, Size: 4988 bytes --]

Hi Jason,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tegra-drm/drm/tegra/for-next]
[also build test WARNING on linus/master v5.13-rc2 next-20210521]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jason-Ekstrand/dma-buf-Add-an-API-for-exporting-sync-files-v8/20210522-201251
base:   git://anongit.freedesktop.org/tegra/linux.git drm/tegra/for-next
config: x86_64-randconfig-s031-20210522 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-341-g8af24329-dirty
        # https://github.com/0day-ci/linux/commit/925221f402201e7b1f665619dda2c5ee6d6324f1
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Jason-Ekstrand/dma-buf-Add-an-API-for-exporting-sync-files-v8/20210522-201251
        git checkout 925221f402201e7b1f665619dda2c5ee6d6324f1
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/dma-buf/dma-resv.c:550: warning: expecting prototype for dma_resv_get_singleton(). Prototype was for dma_resv_get_singleton_rcu() instead


vim +550 drivers/dma-buf/dma-resv.c

   534	
   535	/**
   536	 * dma_resv_get_singleton - get a single fence for the dma_resv object
   537	 * @obj: the reservation object
   538	 * @extra: extra fence to add to the resulting array
   539	 * @result: resulting dma_fence
   540	 *
   541	 * Get a single fence representing all unsignaled fences in the dma_resv object
   542	 * plus the given extra fence. If we got only one fence return a new
   543	 * reference to that, otherwise return a dma_fence_array object.
   544	 *
   545	 * RETURNS
   546	 * Returns -NOMEM if allocations fail, zero otherwise.
   547	 */
   548	int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
   549				       struct dma_fence **result)
 > 550	{
   551		struct dma_fence **resv_fences, *fence, *chain, **fences;
   552		struct dma_fence_array *array;
   553		unsigned int num_resv_fences, num_fences;
   554		unsigned int ret, i, j;
   555	
   556		ret = dma_resv_get_fences_rcu(obj, NULL, &num_resv_fences, &resv_fences);
   557		if (ret)
   558			return ret;
   559	
   560		num_fences = 0;
   561		*result = NULL;
   562	
   563		if (num_resv_fences == 0 && !extra)
   564			return 0;
   565	
   566		for (i = 0; i < num_resv_fences; ++i) {
   567			dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
   568				if (dma_fence_is_signaled(fence))
   569					continue;
   570	
   571				*result = fence;
   572				++num_fences;
   573			}
   574		}
   575	
   576		if (extra) {
   577			dma_fence_deep_dive_for_each(fence, chain, j, extra) {
   578				if (dma_fence_is_signaled(fence))
   579					continue;
   580	
   581				*result = fence;
   582				++num_fences;
   583			}
   584		}
   585	
   586		if (num_fences <= 1) {
   587			*result = dma_fence_get(*result);
   588			goto put_resv_fences;
   589		}
   590	
   591		fences = kmalloc_array(num_fences, sizeof(struct dma_fence*),
   592				       GFP_KERNEL);
   593		if (!fences) {
   594			*result = NULL;
   595			ret = -ENOMEM;
   596			goto put_resv_fences;
   597		}
   598	
   599		num_fences = 0;
   600		for (i = 0; i < num_resv_fences; ++i) {
   601			dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
   602				if (!dma_fence_is_signaled(fence))
   603					fences[num_fences++] = dma_fence_get(fence);
   604			}
   605		}
   606	
   607		if (extra) {
   608			dma_fence_deep_dive_for_each(fence, chain, j, extra) {
   609				if (dma_fence_is_signaled(fence))
   610					fences[num_fences++] = dma_fence_get(fence);
   611			}
   612		}
   613	
   614		if (num_fences <= 1) {
   615			*result = num_fences ? fences[0] : NULL;
   616			kfree(fences);
   617			goto put_resv_fences;
   618		}
   619	
   620		array = dma_fence_array_create(num_fences, fences,
   621					       dma_fence_context_alloc(1),
   622					       1, false);
   623		if (array) {
   624			*result = &array->base;
   625		} else {
   626			*result = NULL;
   627			while (num_fences--)
   628				dma_fence_put(fences[num_fences]);
   629			kfree(fences);
   630			ret = -ENOMEM;
   631		}
   632	
   633	put_resv_fences:
   634		while (num_resv_fences--)
   635			dma_fence_put(resv_fences[num_resv_fences]);
   636		kfree(resv_fences);
   637	
   638		return ret;
   639	}
   640	EXPORT_SYMBOL_GPL(dma_resv_get_singleton_rcu);
   641	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31798 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4)
@ 2021-05-22 14:38     ` kernel test robot
  0 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-05-22 14:38 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5136 bytes --]

Hi Jason,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tegra-drm/drm/tegra/for-next]
[also build test WARNING on linus/master v5.13-rc2 next-20210521]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jason-Ekstrand/dma-buf-Add-an-API-for-exporting-sync-files-v8/20210522-201251
base:   git://anongit.freedesktop.org/tegra/linux.git drm/tegra/for-next
config: x86_64-randconfig-s031-20210522 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-341-g8af24329-dirty
        # https://github.com/0day-ci/linux/commit/925221f402201e7b1f665619dda2c5ee6d6324f1
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Jason-Ekstrand/dma-buf-Add-an-API-for-exporting-sync-files-v8/20210522-201251
        git checkout 925221f402201e7b1f665619dda2c5ee6d6324f1
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/dma-buf/dma-resv.c:550: warning: expecting prototype for dma_resv_get_singleton(). Prototype was for dma_resv_get_singleton_rcu() instead


vim +550 drivers/dma-buf/dma-resv.c

   534	
   535	/**
   536	 * dma_resv_get_singleton - get a single fence for the dma_resv object
   537	 * @obj: the reservation object
   538	 * @extra: extra fence to add to the resulting array
   539	 * @result: resulting dma_fence
   540	 *
   541	 * Get a single fence representing all unsignaled fences in the dma_resv object
   542	 * plus the given extra fence. If we got only one fence return a new
   543	 * reference to that, otherwise return a dma_fence_array object.
   544	 *
   545	 * RETURNS
   546	 * Returns -NOMEM if allocations fail, zero otherwise.
   547	 */
   548	int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
   549				       struct dma_fence **result)
 > 550	{
   551		struct dma_fence **resv_fences, *fence, *chain, **fences;
   552		struct dma_fence_array *array;
   553		unsigned int num_resv_fences, num_fences;
   554		unsigned int ret, i, j;
   555	
   556		ret = dma_resv_get_fences_rcu(obj, NULL, &num_resv_fences, &resv_fences);
   557		if (ret)
   558			return ret;
   559	
   560		num_fences = 0;
   561		*result = NULL;
   562	
   563		if (num_resv_fences == 0 && !extra)
   564			return 0;
   565	
   566		for (i = 0; i < num_resv_fences; ++i) {
   567			dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
   568				if (dma_fence_is_signaled(fence))
   569					continue;
   570	
   571				*result = fence;
   572				++num_fences;
   573			}
   574		}
   575	
   576		if (extra) {
   577			dma_fence_deep_dive_for_each(fence, chain, j, extra) {
   578				if (dma_fence_is_signaled(fence))
   579					continue;
   580	
   581				*result = fence;
   582				++num_fences;
   583			}
   584		}
   585	
   586		if (num_fences <= 1) {
   587			*result = dma_fence_get(*result);
   588			goto put_resv_fences;
   589		}
   590	
   591		fences = kmalloc_array(num_fences, sizeof(struct dma_fence*),
   592				       GFP_KERNEL);
   593		if (!fences) {
   594			*result = NULL;
   595			ret = -ENOMEM;
   596			goto put_resv_fences;
   597		}
   598	
   599		num_fences = 0;
   600		for (i = 0; i < num_resv_fences; ++i) {
   601			dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
   602				if (!dma_fence_is_signaled(fence))
   603					fences[num_fences++] = dma_fence_get(fence);
   604			}
   605		}
   606	
   607		if (extra) {
   608			dma_fence_deep_dive_for_each(fence, chain, j, extra) {
   609				if (dma_fence_is_signaled(fence))
   610					fences[num_fences++] = dma_fence_get(fence);
   611			}
   612		}
   613	
   614		if (num_fences <= 1) {
   615			*result = num_fences ? fences[0] : NULL;
   616			kfree(fences);
   617			goto put_resv_fences;
   618		}
   619	
   620		array = dma_fence_array_create(num_fences, fences,
   621					       dma_fence_context_alloc(1),
   622					       1, false);
   623		if (array) {
   624			*result = &array->base;
   625		} else {
   626			*result = NULL;
   627			while (num_fences--)
   628				dma_fence_put(fences[num_fences]);
   629			kfree(fences);
   630			ret = -ENOMEM;
   631		}
   632	
   633	put_resv_fences:
   634		while (num_resv_fences--)
   635			dma_fence_put(resv_fences[num_resv_fences]);
   636		kfree(resv_fences);
   637	
   638		return ret;
   639	}
   640	EXPORT_SYMBOL_GPL(dma_resv_get_singleton_rcu);
   641	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 31798 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4)
  2021-05-20 19:00 ` [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4) Jason Ekstrand
@ 2021-05-22 18:59     ` kernel test robot
  2021-05-22 14:38     ` kernel test robot
  2021-05-22 18:59     ` kernel test robot
  2 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-05-22 18:59 UTC (permalink / raw)
  To: Jason Ekstrand, dri-devel; +Cc: clang-built-linux, kbuild-all, Jason Ekstrand

[-- Attachment #1: Type: text/plain, Size: 5247 bytes --]

Hi Jason,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tegra-drm/drm/tegra/for-next]
[also build test WARNING on linus/master v5.13-rc2 next-20210521]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jason-Ekstrand/dma-buf-Add-an-API-for-exporting-sync-files-v8/20210522-201251
base:   git://anongit.freedesktop.org/tegra/linux.git drm/tegra/for-next
config: x86_64-randconfig-a013-20210522 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project e84a9b9bb3051c35dea993cdad7b3d2575638f85)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/925221f402201e7b1f665619dda2c5ee6d6324f1
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Jason-Ekstrand/dma-buf-Add-an-API-for-exporting-sync-files-v8/20210522-201251
        git checkout 925221f402201e7b1f665619dda2c5ee6d6324f1
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/dma-buf/dma-resv.c:550: warning: expecting prototype for dma_resv_get_singleton(). Prototype was for dma_resv_get_singleton_rcu() instead


vim +550 drivers/dma-buf/dma-resv.c

   534	
   535	/**
   536	 * dma_resv_get_singleton - get a single fence for the dma_resv object
   537	 * @obj: the reservation object
   538	 * @extra: extra fence to add to the resulting array
   539	 * @result: resulting dma_fence
   540	 *
   541	 * Get a single fence representing all unsignaled fences in the dma_resv object
   542	 * plus the given extra fence. If we got only one fence return a new
   543	 * reference to that, otherwise return a dma_fence_array object.
   544	 *
   545	 * RETURNS
   546	 * Returns -NOMEM if allocations fail, zero otherwise.
   547	 */
   548	int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
   549				       struct dma_fence **result)
 > 550	{
   551		struct dma_fence **resv_fences, *fence, *chain, **fences;
   552		struct dma_fence_array *array;
   553		unsigned int num_resv_fences, num_fences;
   554		unsigned int ret, i, j;
   555	
   556		ret = dma_resv_get_fences_rcu(obj, NULL, &num_resv_fences, &resv_fences);
   557		if (ret)
   558			return ret;
   559	
   560		num_fences = 0;
   561		*result = NULL;
   562	
   563		if (num_resv_fences == 0 && !extra)
   564			return 0;
   565	
   566		for (i = 0; i < num_resv_fences; ++i) {
   567			dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
   568				if (dma_fence_is_signaled(fence))
   569					continue;
   570	
   571				*result = fence;
   572				++num_fences;
   573			}
   574		}
   575	
   576		if (extra) {
   577			dma_fence_deep_dive_for_each(fence, chain, j, extra) {
   578				if (dma_fence_is_signaled(fence))
   579					continue;
   580	
   581				*result = fence;
   582				++num_fences;
   583			}
   584		}
   585	
   586		if (num_fences <= 1) {
   587			*result = dma_fence_get(*result);
   588			goto put_resv_fences;
   589		}
   590	
   591		fences = kmalloc_array(num_fences, sizeof(struct dma_fence*),
   592				       GFP_KERNEL);
   593		if (!fences) {
   594			*result = NULL;
   595			ret = -ENOMEM;
   596			goto put_resv_fences;
   597		}
   598	
   599		num_fences = 0;
   600		for (i = 0; i < num_resv_fences; ++i) {
   601			dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
   602				if (!dma_fence_is_signaled(fence))
   603					fences[num_fences++] = dma_fence_get(fence);
   604			}
   605		}
   606	
   607		if (extra) {
   608			dma_fence_deep_dive_for_each(fence, chain, j, extra) {
   609				if (dma_fence_is_signaled(fence))
   610					fences[num_fences++] = dma_fence_get(fence);
   611			}
   612		}
   613	
   614		if (num_fences <= 1) {
   615			*result = num_fences ? fences[0] : NULL;
   616			kfree(fences);
   617			goto put_resv_fences;
   618		}
   619	
   620		array = dma_fence_array_create(num_fences, fences,
   621					       dma_fence_context_alloc(1),
   622					       1, false);
   623		if (array) {
   624			*result = &array->base;
   625		} else {
   626			*result = NULL;
   627			while (num_fences--)
   628				dma_fence_put(fences[num_fences]);
   629			kfree(fences);
   630			ret = -ENOMEM;
   631		}
   632	
   633	put_resv_fences:
   634		while (num_resv_fences--)
   635			dma_fence_put(resv_fences[num_resv_fences]);
   636		kfree(resv_fences);
   637	
   638		return ret;
   639	}
   640	EXPORT_SYMBOL_GPL(dma_resv_get_singleton_rcu);
   641	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 40297 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4)
@ 2021-05-22 18:59     ` kernel test robot
  0 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-05-22 18:59 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5397 bytes --]

Hi Jason,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tegra-drm/drm/tegra/for-next]
[also build test WARNING on linus/master v5.13-rc2 next-20210521]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jason-Ekstrand/dma-buf-Add-an-API-for-exporting-sync-files-v8/20210522-201251
base:   git://anongit.freedesktop.org/tegra/linux.git drm/tegra/for-next
config: x86_64-randconfig-a013-20210522 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project e84a9b9bb3051c35dea993cdad7b3d2575638f85)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/925221f402201e7b1f665619dda2c5ee6d6324f1
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Jason-Ekstrand/dma-buf-Add-an-API-for-exporting-sync-files-v8/20210522-201251
        git checkout 925221f402201e7b1f665619dda2c5ee6d6324f1
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/dma-buf/dma-resv.c:550: warning: expecting prototype for dma_resv_get_singleton(). Prototype was for dma_resv_get_singleton_rcu() instead


vim +550 drivers/dma-buf/dma-resv.c

   534	
   535	/**
   536	 * dma_resv_get_singleton - get a single fence for the dma_resv object
   537	 * @obj: the reservation object
   538	 * @extra: extra fence to add to the resulting array
   539	 * @result: resulting dma_fence
   540	 *
   541	 * Get a single fence representing all unsignaled fences in the dma_resv object
   542	 * plus the given extra fence. If we got only one fence return a new
   543	 * reference to that, otherwise return a dma_fence_array object.
   544	 *
   545	 * RETURNS
   546	 * Returns -NOMEM if allocations fail, zero otherwise.
   547	 */
   548	int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
   549				       struct dma_fence **result)
 > 550	{
   551		struct dma_fence **resv_fences, *fence, *chain, **fences;
   552		struct dma_fence_array *array;
   553		unsigned int num_resv_fences, num_fences;
   554		unsigned int ret, i, j;
   555	
   556		ret = dma_resv_get_fences_rcu(obj, NULL, &num_resv_fences, &resv_fences);
   557		if (ret)
   558			return ret;
   559	
   560		num_fences = 0;
   561		*result = NULL;
   562	
   563		if (num_resv_fences == 0 && !extra)
   564			return 0;
   565	
   566		for (i = 0; i < num_resv_fences; ++i) {
   567			dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
   568				if (dma_fence_is_signaled(fence))
   569					continue;
   570	
   571				*result = fence;
   572				++num_fences;
   573			}
   574		}
   575	
   576		if (extra) {
   577			dma_fence_deep_dive_for_each(fence, chain, j, extra) {
   578				if (dma_fence_is_signaled(fence))
   579					continue;
   580	
   581				*result = fence;
   582				++num_fences;
   583			}
   584		}
   585	
   586		if (num_fences <= 1) {
   587			*result = dma_fence_get(*result);
   588			goto put_resv_fences;
   589		}
   590	
   591		fences = kmalloc_array(num_fences, sizeof(struct dma_fence*),
   592				       GFP_KERNEL);
   593		if (!fences) {
   594			*result = NULL;
   595			ret = -ENOMEM;
   596			goto put_resv_fences;
   597		}
   598	
   599		num_fences = 0;
   600		for (i = 0; i < num_resv_fences; ++i) {
   601			dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
   602				if (!dma_fence_is_signaled(fence))
   603					fences[num_fences++] = dma_fence_get(fence);
   604			}
   605		}
   606	
   607		if (extra) {
   608			dma_fence_deep_dive_for_each(fence, chain, j, extra) {
   609				if (dma_fence_is_signaled(fence))
   610					fences[num_fences++] = dma_fence_get(fence);
   611			}
   612		}
   613	
   614		if (num_fences <= 1) {
   615			*result = num_fences ? fences[0] : NULL;
   616			kfree(fences);
   617			goto put_resv_fences;
   618		}
   619	
   620		array = dma_fence_array_create(num_fences, fences,
   621					       dma_fence_context_alloc(1),
   622					       1, false);
   623		if (array) {
   624			*result = &array->base;
   625		} else {
   626			*result = NULL;
   627			while (num_fences--)
   628				dma_fence_put(fences[num_fences]);
   629			kfree(fences);
   630			ret = -ENOMEM;
   631		}
   632	
   633	put_resv_fences:
   634		while (num_resv_fences--)
   635			dma_fence_put(resv_fences[num_resv_fences]);
   636		kfree(resv_fences);
   637	
   638		return ret;
   639	}
   640	EXPORT_SYMBOL_GPL(dma_resv_get_singleton_rcu);
   641	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 40297 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-20 19:00 ` [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6) Jason Ekstrand
@ 2021-05-22 20:05   ` Daniel Stone
  2021-05-23 17:15     ` EPOLL for drm_syncfile (was Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)) Christian König
  2021-05-24 17:11     ` [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6) Jason Ekstrand
  0 siblings, 2 replies; 39+ messages in thread
From: Daniel Stone @ 2021-05-22 20:05 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Daniel Vetter, Christian König, dri-devel

Hi,

On Thu, 20 May 2021 at 20:00, Jason Ekstrand <jason@jlekstrand.net> wrote:
> In the Vulkan world, this is useful for dealing with the out-fence from
> vkQueuePresent.  Current Linux window-systems (X11, Wayland, etc.) all
> rely on dma-buf implicit sync.  Since Vulkan is an explicit sync API, we
> get a set of fences (VkSemaphores) in vkQueuePresent and have to stash
> those as an exclusive (write) fence on the dma-buf.  We handle it in
> Mesa today with the above mentioned dummy submit trick.  This ioctl
> would allow us to set it directly without the dummy submit.
>
> This may also open up possibilities for GPU drivers to move away from
> implicit sync for their kernel driver uAPI and instead provide sync
> files and rely on dma-buf import/export for communicating with other
> implicit sync clients.
>
> We make the explicit choice here to only allow setting RW fences which
> translates to an exclusive fence on the dma_resv.  There's no use for
> read-only fences for communicating with other implicit sync userspace
> and any such attempts are likely to be racy at best.

I think I almost follow, but I'm not quite seeing the race you allude
to. Let me talk through my understanding so it's hopefully more clear
for others as a summary. Please correct me on anything I've
misunderstood or just missed completely. (I thought when I wrote this
intro that the email might be relatively snappy, but it's really long
and takes in a lot of breadth. Sorry.)

So as far as I'm reading this patchset and the Mesa MR, this _only_
concerns the out-fence (i.e. compositor -> client AcquireNextImage
semaphore/fence) so far. The in-fence (client->compositor QueuePresent
wait-semaphores/fences) is handled by the driver ensuring that an
exclusive resv is placed on the union of the signal semaphores passed
to QueuePresent, either through flags on its CS ioctl, or amdgpu's BO
flags, or ... either way, no problem as it should always be exclusive,
and it seems pretty uncontroversial that we should pull the wait
semaphore into an exclusive fence so that no downstream consumer will
begin using it until the client ops have fully retired.

For AcquireNextImage, your patchset exports all the fences (shared and
exclusive both) on the dma_resv out into the semaphore/fence such that
the client can't progress (CPU-side for VkFence, GPU-side for
VkSemaphore) until all currently queued operations have completely
retired. So, as long as the server ensures that all its kernel-side
work is flushed before its IPC to unblock ANI (wl_buffer.release or
DRI3 PresentIdle, both indicating that the client is free to reuse the
buffer, subject to synchronising against implicit fences on the resv),
all good: we snapshot the current resv state, wrap the relevant
driver-side Vulkan primitive around it, and go back to explicit
synchronisation. The client can't race generating new work against
this, because it can't queue any work until ANI has returned and
queued a signaling op on the semaphore/fence.

So far, so good. I really like both your series up until this
narrative point; as I was saying in the userspace-fence thread, the
WSI<->client thread is certainly pulling a very big lever with a
heavyweight transition between the two different worlds, and I like
that the new export ioctl lets us be very clear about what exactly is
happening under the hood. Good stuff.

So, what gives with the import ioctl? Taking a guess at where you're
going, the import ioctl is going to be used in QueuePresent just as
the export ioctl is in ANI: instead of having CS flags to write into
the resv exclusive slot or per-BO flags to always dump in exclusive,
there's a forthcoming patch somewhere which lets drivers skip this and
instead have common QueuePresent code dump the wait semaphore into the
resv, so servers on the other side of an implicit-only protocol will
synchronise their access against the fence imported as part of
client-side QueuePresent?

That makes sense to me and is nicely symmetrical, plus it gets GPU
drivers further out of the business of doing magic winsys/display
synchronisation, which is good. But why enforce that imported fences
have to go into the exclusive slot? It makes sense from the point of
view of clients doing QueueSubmit, but I think it might preclude other
uses for no particularly good reason.

Certainly on X11 there's no real use for the shared slot - you get
into difficulties with mixed client/server rendering - but at least
Wayland and GBM are always non-destructive, so conceptually have a
good use for being able to import fences into the shared resv. This
goes for media and KMS access as well, if you think beyond the usecase
of an explicit Vulkan client sending through to a dumb implicit server
which just wants to texture.

Media clients are definitely a relevant usecase, both directly with
V4L2/VA-API, neither of which have support for explicit
synchronisation yet and (at least for V4L2) are unlikely to do so in
the near future, but even more with pipeline frameworks like GStreamer
and PipeWire, which don't have fundamental awareness (they're prepared
to deal with deep pipelines which return at arbitrary times, but once
anything is returned, it is available for immediate use). Maybe
open-coding CPU-side waits in these users is the best way longer term,
but for easier interop if nothing else, being able to pull shared
fences in seems like it could be a win ('the compositor is still
texturing from this for now, so feel free to read back from ref
frames, but don't scribble over it until it's finished').

I'm slightly bemused that there's so much energy spent on designing
around winsys being dumb and implicit. We did take a long time to roll
out sync_file support, but that was only because we didn't quite
understand all the nuances of why implicit sync is so difficult for
GPU drivers on modern architectures and how it was actually a win
compared to doing nothing; maybe we should have some kind of
conference where we all get together and talk to each other ... ?
Anyway, by the time we got to the cusp of rolling out bi-directional
sync_file support, suddenly we had drm_syncobj. By the time that had
semi-settled down and started to be rolled out, then we suddenly had
userspace fences on the horizon. And what do we do with those?

We've - certainly Weston, and I'm pretty confident I speak for Simon
on the wlroots side and Jonas for Mutter - landed on accepting that
we're going to have to deal with timeline semaphores and
wait-before-signal; we have a credible plan (we think) for protocol to
deal with at least syncobj, which should conceptually extend to
userspace fences. The biggest blocker there is the syncobj uAPI.
Compositors aren't Vulkan clients - we don't have one thread per group
of work, because the inter-client synchronisation becomes nightmarish
overhead for little gain. I don't think this will ever change, because
the balance of work is totally different between client and
compositor.

Anyway, the problem with syncobj is that the ioctl to wait for a
sync_file to materialise for a given timeline point only allows us to
block with a timeout; this is a non-starter, because we need something
which fits into epoll. The most optimal case is returning a new
eventfd or similar which signals when a given timeline point becomes
available or signaled, but in extremis a syncobj FD becoming readable
when any activity which would change the result of any zero-timeout
wait on that syncobj is more or less workable.

What we do want to do though, regardless of the primitive or its
semantics, is to only have to go through this once, not rework it all
in six months, and have to support a bunch of diverging codepaths. So
what is the optimal synchronisation primitive we should be aiming for
on the winsys side? Is sync_file a good enough lowest common
denominator, or should we do timeline-syncobj for fancy clients, in
tandem with sync_file bolted on the side for media and ancient GPUs?
Or are userspace fences going to give us some new primitive? And if
so, can we please talk about those semantics now, so we don't end up
with three synchronisation mechanisms which are all sort of but not
really alike?

As far as I can tell, the three relevant vendors with (more or less)
upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit
more up in the air as the hardware and specs aren't currently
available to the upstream development teams, but hopefully we can
bring them into this conversation. I think it's looking like we're
optimising for all of the following, without leaving anyone out in the
cold:

1. Modern userspace-fencing hardware running on a
userspace-fencing-aware winsys, i.e. new AMD/Arm/Intel on one of the
big three Wayland compositors which handle enough synchronisation
logic internally that the choice of underlying
composition/presentation API is immaterial (for which any new API is
important)
2. Modern userspace-fencing hardware running on Xwayland, for which
we'll inevitably have to type out new DRI3 synchronsiation, but
derived purely from the Wayland protocols so it can be passed through
quite directly, and with any mixed X11<->user buffer client
interaction (e.g. subwindows) being penalised by a long blocking wait
in the X server
3. sync_file-oriented hardware, for which we need to do ~nothing
4. Modern userspace-fencing hardware and APIs which need to interop
with media units, for all four combinations of client/server
source/sink; for some APIs like Vulkan Video synchronisation is not a
problem, for others like VA-API/V4L2/GStreamer we're probably need
going to live with the implicit semantics for the foreseeable future,
which means we should do the right thing for them up front, because
why would you even be playing games if you're not streaming them to
Twitch at the same time? (Personally I'm much happier that no-one
other than my friends who already know I'm terrible can see me playing
CoD, but y'know, kids these days ...)

The fifth case I've left out is clients who want to smash Vulkan
content directly into the display. For games and Kodi-like UI I'm just
going to ignore this, because (maybe I'm biased, but) using KMS
directly is difficult enough that you shouldn't do it and just use a
winsys because we've spent a very long time dealing with these
problems for you. The remaining usecase there is XR, but their uses
are very different, and OpenXR already deals with a _lot_ of this for
you, with the runtimes having sufficiently sophisticated
synchronisation internally that whatever we come up with won't be
onerous for them to implement. Either way, integration with KMS/GBM
has the same problem as Wayland, in that unless you fully encapsulate
it in something like VK_KHR_display, you don't get to throw a thread
under the bus to do delayed submit, because you need to actually
return something to the caller.

Ultimately, I think I'm leaning towards agreeing with Christian that I
would like to see a good holistic model for how this works in a
variety of usecases before we progress new uAPI, but also to agreeing
with you and Dan in that how amdgpu currently implements things is
currently objectively very wrong (despite the motivations for having
implemented it that way).

If there are any usecases I've missed then I'm all ears, but else I
think we should take this forward by working with
Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic
skeletons for end-to-end synchronisation through those usecases. It's
clear that this never happened for syncobj, because it's just not
usable as a primitive in any compositor which exists today: let's make
sure we don't get into the same trap again. If we can do this over
email/GitLab then that's great, but if not, maybe we need to do a kind
of mini-XDC with some kind of virtual whiteboard we can all scribble
over.

(As a coda to this, I'm pretty sure I understand the subtleties of the
memory fences for relocation/shootdown, but it would be _really_ good
to have a coherent description everyone agrees on written down
somewhere, so people can understand the issues without having to read
250 emails with people at loggerheads.)

Cheers,
Daniel








> When we got to
> insert the RW fence, the actual fence we set as the new exclusive fence
> is a combination of the sync_file provided by the user and all the other
> fences on the dma_resv.  This ensures that the newly added exclusive
> fence will never signal before the old one would have and ensures that
> we don't break any dma_resv contracts.  We require userspace to specify
> RW in the flags for symmetry with the export ioctl and in case we ever
> want to support read fences in the future.
>
> There is one downside here that's worth documenting:  If two clients
> writing to the same dma-buf using this API race with each other, their
> actions on the dma-buf may happen in parallel or in an undefined order.
> Both with and without this API, the pattern is the same:  Collect all
> the fences on dma-buf, submit work which depends on said fences, and
> then set a new exclusive (write) fence on the dma-buf which depends on
> said work.  The difference is that, when it's all handled by the GPU
> driver's submit ioctl, the three operations happen atomically under the
> dma_resv lock.  If two userspace submits race, one will happen before
> the other.  You aren't guaranteed which but you are guaranteed that
> they're strictly ordered.  If userspace manages the fences itself, then
> these three operations happen separately and the two render operations
> may happen genuinely in parallel or get interleaved.  However, this is a
> case of userspace racing with itself.  As long as we ensure userspace
> can't back the kernel into a corner, it should be fine.
>
> v2 (Jason Ekstrand):
>  - Use a wrapper dma_fence_array of all fences including the new one
>    when importing an exclusive fence.
>
> v3 (Jason Ekstrand):
>  - Lock around setting shared fences as well as exclusive
>  - Mark SIGNAL_SYNC_FILE as a read-write ioctl.
>  - Initialize ret to 0 in dma_buf_wait_sync_file
>
> v4 (Jason Ekstrand):
>  - Use the new dma_resv_get_singleton helper
>
> v5 (Jason Ekstrand):
>  - Rename the IOCTLs to import/export rather than wait/signal
>  - Drop the WRITE flag and always get/set the exclusive fence
>
> v5 (Jason Ekstrand):
>  - Split import and export into separate patches
>  - New commit message
>
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> ---
>  drivers/dma-buf/dma-buf.c    | 32 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/dma-buf.h |  1 +
>  2 files changed, 33 insertions(+)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 7679562b57bfc..c9d6b9195c00c 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf,
>         put_unused_fd(fd);
>         return ret;
>  }
> +
> +static long dma_buf_import_sync_file(struct dma_buf *dmabuf,
> +                                    const void __user *user_data)
> +{
> +       struct dma_buf_sync_file arg;
> +       struct dma_fence *fence, *singleton = NULL;
> +       int ret = 0;
> +
> +       if (copy_from_user(&arg, user_data, sizeof(arg)))
> +               return -EFAULT;
> +
> +       if (arg.flags != DMA_BUF_SYNC_RW)
> +               return -EINVAL;
> +
> +       fence = sync_file_get_fence(arg.fd);
> +       if (!fence)
> +               return -EINVAL;
> +
> +       dma_resv_lock(dmabuf->resv, NULL);
> +
> +       ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton);
> +       if (!ret && singleton)
> +               dma_resv_add_excl_fence(dmabuf->resv, singleton);
> +
> +       dma_resv_unlock(dmabuf->resv);
> +
> +       dma_fence_put(fence);
> +
> +       return ret;
> +}
>  #endif
>
>  static long dma_buf_ioctl(struct file *file,
> @@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file,
>  #if IS_ENABLED(CONFIG_SYNC_FILE)
>         case DMA_BUF_IOCTL_EXPORT_SYNC_FILE:
>                 return dma_buf_export_sync_file(dmabuf, (void __user *)arg);
> +       case DMA_BUF_IOCTL_IMPORT_SYNC_FILE:
> +               return dma_buf_import_sync_file(dmabuf, (const void __user *)arg);
>  #endif
>
>         default:
> diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
> index f902cadcbdb56..75fdde4800267 100644
> --- a/include/uapi/linux/dma-buf.h
> +++ b/include/uapi/linux/dma-buf.h
> @@ -70,5 +70,6 @@ struct dma_buf_sync_file {
>  #define DMA_BUF_SET_NAME_A     _IOW(DMA_BUF_BASE, 1, u32)
>  #define DMA_BUF_SET_NAME_B     _IOW(DMA_BUF_BASE, 1, u64)
>  #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file)
> +#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_sync)
>
>  #endif
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* EPOLL for drm_syncfile (was Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6))
  2021-05-22 20:05   ` Daniel Stone
@ 2021-05-23 17:15     ` Christian König
  2021-05-23 21:34       ` Daniel Stone
  2021-05-24 17:11     ` [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6) Jason Ekstrand
  1 sibling, 1 reply; 39+ messages in thread
From: Christian König @ 2021-05-23 17:15 UTC (permalink / raw)
  To: Daniel Stone, Jason Ekstrand; +Cc: Daniel Vetter, dri-devel

Hi guys,

separating that discussion out since Daniel had a rather interesting 
idea here.

Am 22.05.21 um 22:05 schrieb Daniel Stone:
> [SNIP]
> Anyway, the problem with syncobj is that the ioctl to wait for a
> sync_file to materialise for a given timeline point only allows us to
> block with a timeout; this is a non-starter, because we need something
> which fits into epoll. The most optimal case is returning a new
> eventfd or similar which signals when a given timeline point becomes
> available or signaled, but in extremis a syncobj FD becoming readable
> when any activity which would change the result of any zero-timeout
> wait on that syncobj is more or less workable.
>

I think the tricky part is to epoll for a certain value.

Not sure how eventfd is supposed to work, but IIRC we don't have the 
functionality to poll for a certain value/offset etc to become available.

We could of course create a separate fd for each requested value to poll 
for thought, but that sounds like a bit much overhead to me.

Apart from that this is a really interesting idea.

Regards,
Christian.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: EPOLL for drm_syncfile (was Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6))
  2021-05-23 17:15     ` EPOLL for drm_syncfile (was Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)) Christian König
@ 2021-05-23 21:34       ` Daniel Stone
  2021-05-25  5:55         ` Daniel Vetter
  2021-05-26  9:45         ` Simon Ser
  0 siblings, 2 replies; 39+ messages in thread
From: Daniel Stone @ 2021-05-23 21:34 UTC (permalink / raw)
  To: Christian König; +Cc: Daniel Vetter, dri-devel, Jason Ekstrand

Hi Christian,

On Sun, 23 May 2021 at 18:16, Christian König <christian.koenig@amd.com> wrote:
> Am 22.05.21 um 22:05 schrieb Daniel Stone:
> > Anyway, the problem with syncobj is that the ioctl to wait for a
> > sync_file to materialise for a given timeline point only allows us to
> > block with a timeout; this is a non-starter, because we need something
> > which fits into epoll. The most optimal case is returning a new
> > eventfd or similar which signals when a given timeline point becomes
> > available or signaled, but in extremis a syncobj FD becoming readable
> > when any activity which would change the result of any zero-timeout
> > wait on that syncobj is more or less workable.
>
> I think the tricky part is to epoll for a certain value.
>
> Not sure how eventfd is supposed to work, but IIRC we don't have the
> functionality to poll for a certain value/offset etc to become available.
>
> We could of course create a separate fd for each requested value to poll
> for thought, but that sounds like a bit much overhead to me.

Yeah, I understand the point; something like an eventfd is exactly
what you think, that we would have to materialise a new FD for it. On
the other hand, as soon as the sync point becomes available, the
winsys will want to immediately extract a sync_file from it, so I
don't think FD amplification is a big issue. If it looks like being a
problem in practice, we could look at providing a FD which starts off
inert, and only later becomes a sync_file after it polls readable, but
that sounds like something we really want to avoid if we can.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-22 20:05   ` Daniel Stone
  2021-05-23 17:15     ` EPOLL for drm_syncfile (was Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)) Christian König
@ 2021-05-24 17:11     ` Jason Ekstrand
  2021-05-25  6:27       ` Daniel Vetter
                         ` (2 more replies)
  1 sibling, 3 replies; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-24 17:11 UTC (permalink / raw)
  To: Daniel Stone; +Cc: Daniel Vetter, Christian König, dri-devel

On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hi,
>
> On Thu, 20 May 2021 at 20:00, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > In the Vulkan world, this is useful for dealing with the out-fence from
> > vkQueuePresent.  Current Linux window-systems (X11, Wayland, etc.) all
> > rely on dma-buf implicit sync.  Since Vulkan is an explicit sync API, we
> > get a set of fences (VkSemaphores) in vkQueuePresent and have to stash
> > those as an exclusive (write) fence on the dma-buf.  We handle it in
> > Mesa today with the above mentioned dummy submit trick.  This ioctl
> > would allow us to set it directly without the dummy submit.
> >
> > This may also open up possibilities for GPU drivers to move away from
> > implicit sync for their kernel driver uAPI and instead provide sync
> > files and rely on dma-buf import/export for communicating with other
> > implicit sync clients.
> >
> > We make the explicit choice here to only allow setting RW fences which
> > translates to an exclusive fence on the dma_resv.  There's no use for
> > read-only fences for communicating with other implicit sync userspace
> > and any such attempts are likely to be racy at best.
>
> I think I almost follow, but I'm not quite seeing the race you allude
> to. Let me talk through my understanding so it's hopefully more clear
> for others as a summary. Please correct me on anything I've
> misunderstood or just missed completely. (I thought when I wrote this
> intro that the email might be relatively snappy, but it's really long
> and takes in a lot of breadth. Sorry.)
>
> So as far as I'm reading this patchset and the Mesa MR, this _only_
> concerns the out-fence (i.e. compositor -> client AcquireNextImage
> semaphore/fence) so far. The in-fence (client->compositor QueuePresent
> wait-semaphores/fences) is handled by the driver ensuring that an
> exclusive resv is placed on the union of the signal semaphores passed
> to QueuePresent, either through flags on its CS ioctl, or amdgpu's BO
> flags, or ... either way, no problem as it should always be exclusive,
> and it seems pretty uncontroversial that we should pull the wait
> semaphore into an exclusive fence so that no downstream consumer will
> begin using it until the client ops have fully retired.
>
> For AcquireNextImage, your patchset exports all the fences (shared and
> exclusive both) on the dma_resv out into the semaphore/fence such that
> the client can't progress (CPU-side for VkFence, GPU-side for
> VkSemaphore) until all currently queued operations have completely
> retired. So, as long as the server ensures that all its kernel-side
> work is flushed before its IPC to unblock ANI (wl_buffer.release or
> DRI3 PresentIdle, both indicating that the client is free to reuse the
> buffer, subject to synchronising against implicit fences on the resv),
> all good: we snapshot the current resv state, wrap the relevant
> driver-side Vulkan primitive around it, and go back to explicit
> synchronisation. The client can't race generating new work against
> this, because it can't queue any work until ANI has returned and
> queued a signaling op on the semaphore/fence.
>
> So far, so good. I really like both your series up until this
> narrative point; as I was saying in the userspace-fence thread, the
> WSI<->client thread is certainly pulling a very big lever with a
> heavyweight transition between the two different worlds, and I like
> that the new export ioctl lets us be very clear about what exactly is
> happening under the hood. Good stuff.

Glad to hear.  If you turned that into RBs on the first three patches
in this series and all but the last patch on the Mesa MR, it'd make me
even happier. :-D

At this point, I think everyone is pretty happy with the first three
patches and the export ioctl.  In the Vulkan WSI code, it solves a
significant over-synchronization issue for ANV.  Also, the uAPI
shouldn't be controversial at all because it's identical to poll()
except that it gives you a FD you can poll on later to get the result
of the poll as it would be now.  I think if we get some Mesa reviews,
we should be able to land those.  It's import that's trickier.

> So, what gives with the import ioctl? Taking a guess at where you're
> going, the import ioctl is going to be used in QueuePresent just as
> the export ioctl is in ANI: instead of having CS flags to write into
> the resv exclusive slot or per-BO flags to always dump in exclusive,
> there's a forthcoming patch somewhere which lets drivers skip this and
> instead have common QueuePresent code dump the wait semaphore into the
> resv, so servers on the other side of an implicit-only protocol will
> synchronise their access against the fence imported as part of
> client-side QueuePresent?

Correct.  And the patch isn't forthcoming.  It already exists as the
top patch in the Mesa MR (!4037).

> That makes sense to me and is nicely symmetrical, plus it gets GPU
> drivers further out of the business of doing magic winsys/display
> synchronisation, which is good. But why enforce that imported fences
> have to go into the exclusive slot? It makes sense from the point of
> view of clients doing QueueSubmit, but I think it might preclude other
> uses for no particularly good reason.

Mostly, I was trying to limit the scope a bit.  Import is tricky and,
to be honest, I'm still not 100% sure that it's safe.  I probably
should have left RFC on this patch.

As long as we keep everything inside the kernel, there's a requirement
that any work which triggers any fence on a dma_resv waits on the
exclusive fence, if any.  Work which sets the exclusive fence has to
wait on all fences.  With the import ioctl, we can guarantee that the
fences fire in the right order by making an array fence containing the
new fence and all other fences and using that as the exclusive fence.
What we can't do, however, is ensure that the work which triggers the
fence actually blocks on ANY of the fences on the dma_resv.

Hrm... I think I've now convinced myself that importing shared fences
is no more dangerous than importing an exclusive fence because they
share the same set of problems.  Unfortunately, I've unconvinced
myself of the safety of either.  I've got to think some more....

The most convincing argument I can make to myself is that this ioctl
is like having a one-shot little queue that contains tiny little work
jobs which wait on whatever sync_file you provide, do nothing, and
then signal.  That should be safe, right?

> Certainly on X11 there's no real use for the shared slot - you get
> into difficulties with mixed client/server rendering - but at least
> Wayland and GBM are always non-destructive, so conceptually have a
> good use for being able to import fences into the shared resv. This
> goes for media and KMS access as well, if you think beyond the usecase
> of an explicit Vulkan client sending through to a dumb implicit server
> which just wants to texture.
>
> Media clients are definitely a relevant usecase, both directly with
> V4L2/VA-API, neither of which have support for explicit
> synchronisation yet and (at least for V4L2) are unlikely to do so in
> the near future, but even more with pipeline frameworks like GStreamer
> and PipeWire, which don't have fundamental awareness (they're prepared
> to deal with deep pipelines which return at arbitrary times, but once
> anything is returned, it is available for immediate use). Maybe
> open-coding CPU-side waits in these users is the best way longer term,
> but for easier interop if nothing else, being able to pull shared
> fences in seems like it could be a win ('the compositor is still
> texturing from this for now, so feel free to read back from ref
> frames, but don't scribble over it until it's finished').
>
> I'm slightly bemused that there's so much energy spent on designing
> around winsys being dumb and implicit.

I'd like to address this one as it's a comment you've made several
times.  Once you've fixed raw X11 (not just XWayland) and a new
release has been made (hah!) and is shipping in distros with said
support, then we can talk.  Sorry if that comes off as overly snarky
but that's reality that we (driver devs) are living with.  To make
things even worse, when we're in Vulkan land (as opposed to GL), we
can't tell up-front whether or not our window-system supports foo
fences and adjust accordingly.  We have to start up, begin rendering,
and only later figure out "oops, this one goes to X11".  We really
can't say things like "when running on modern Wayland, do things the
new way" because Vulkan doesn't have a concept of "running on" a
window system.

FWIW, I do have a Mesa MR somewhere which adds explicit sync for
Vulkan+Wayland when the wayland protocols are there.  I don't remember
why it didn't land.  Maybe lack of acquire fence support?  I think the
protocol issues have been fixed, so we should up-rev the requirements
as needed and land it.

> We did take a long time to roll
> out sync_file support, but that was only because we didn't quite
> understand all the nuances of why implicit sync is so difficult for
> GPU drivers on modern architectures and how it was actually a win
> compared to doing nothing; maybe we should have some kind of
> conference where we all get together and talk to each other ... ?
> Anyway, by the time we got to the cusp of rolling out bi-directional
> sync_file support, suddenly we had drm_syncobj. By the time that had
> semi-settled down and started to be rolled out, then we suddenly had
> userspace fences on the horizon. And what do we do with those?

You're not wrong....  And that's the second reason for the gymnastics
above.  Ever since Vulkan launched, we've been fumbling around trying
to figure out how to best do synchronization.  'm reasonably certain
that userspace memory fences are the future but I'm much less certain
about the path to get there.  It's been a process of patient
experimentation so far and I think we're getting closer.  Syncobj,
timeline syncobj, etc. have all been steps along that path.  I've been
hesitant to ask the window-system people to draft piles of extensions
until we're sure we've found "the one".  It's bad enough iterating in
kernel-space and userspace without bringing Wayland protocol into each
iteration step.  For that reason, one of my goals along this process
has been to try and decouple things as much as we can.

So, in summary, none of this is because I think that window systems
are dumb and implicit or for any lack of respect for the people that
work on them.  I assume that X11 will always be dumb and implicit.
(I'd love to be proven wrong!)  For Wayland (and XWayland, probably),
I assume the people are very smart and active and will implement
whatever APIs we (the driver people) need as long as they're
reasonable.  I just don't know what to ask for yet.

> We've - certainly Weston, and I'm pretty confident I speak for Simon
> on the wlroots side and Jonas for Mutter - landed on accepting that
> we're going to have to deal with timeline semaphores and
> wait-before-signal; we have a credible plan (we think) for protocol to
> deal with at least syncobj, which should conceptually extend to
> userspace fences. The biggest blocker there is the syncobj uAPI.
> Compositors aren't Vulkan clients - we don't have one thread per group
> of work, because the inter-client synchronisation becomes nightmarish
> overhead for little gain. I don't think this will ever change, because
> the balance of work is totally different between client and
> compositor.
>
> Anyway, the problem with syncobj is that the ioctl to wait for a
> sync_file to materialise for a given timeline point only allows us to
> block with a timeout; this is a non-starter, because we need something
> which fits into epoll. The most optimal case is returning a new
> eventfd or similar which signals when a given timeline point becomes
> available or signaled, but in extremis a syncobj FD becoming readable
> when any activity which would change the result of any zero-timeout
> wait on that syncobj is more or less workable.

Right.  You could call this an oversight.  Really, though, it was
because the use-cases we were interested in were ones where a wait
with a timeout was perfectly acceptable and where the extra overhead
of setting an epoll with ioctls wasn't.  It shouldn't be too hard to
wire up if that helps (cross your fingers).  But....

If we go the direction of userspace memory fences (which I think is
likely), there is no such thing as "wait for the fence to
materialize".  The work is either done or it isn't.  There is no
enforceable concept of "about to be done".  The word "enforceable" is
important there.  We could add such a concept but it'd be dependent on
the userspace client (not driver) to let us know when it's just about
ready and then we'd have to VK_ERROR_DEVICE_LOST on them or similar if
they break the contract.  Maybe possible but there's some design work
required there.  The other option is to make the compositors handle
this new way of thinking more thoroughly and eat the latency hit.

> What we do want to do though, regardless of the primitive or its
> semantics, is to only have to go through this once, not rework it all
> in six months, and have to support a bunch of diverging codepaths. So
> what is the optimal synchronisation primitive we should be aiming for
> on the winsys side? Is sync_file a good enough lowest common
> denominator, or should we do timeline-syncobj for fancy clients, in
> tandem with sync_file bolted on the side for media and ancient GPUs?
> Or are userspace fences going to give us some new primitive? And if
> so, can we please talk about those semantics now, so we don't end up
> with three synchronisation mechanisms which are all sort of but not
> really alike?

As I said above, I think we're getting close but I don't think we're there yet.

> As far as I can tell, the three relevant vendors with (more or less)
> upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit
> more up in the air as the hardware and specs aren't currently
> available to the upstream development teams, but hopefully we can
> bring them into this conversation. I think it's looking like we're
> optimising for all of the following, without leaving anyone out in the
> cold:
>
> 1. Modern userspace-fencing hardware running on a
> userspace-fencing-aware winsys, i.e. new AMD/Arm/Intel on one of the
> big three Wayland compositors which handle enough synchronisation
> logic internally that the choice of underlying
> composition/presentation API is immaterial (for which any new API is
> important)
> 2. Modern userspace-fencing hardware running on Xwayland, for which
> we'll inevitably have to type out new DRI3 synchronsiation, but
> derived purely from the Wayland protocols so it can be passed through
> quite directly, and with any mixed X11<->user buffer client
> interaction (e.g. subwindows) being penalised by a long blocking wait
> in the X server
> 3. sync_file-oriented hardware, for which we need to do ~nothing
> 4. Modern userspace-fencing hardware and APIs which need to interop
> with media units, for all four combinations of client/server
> source/sink; for some APIs like Vulkan Video synchronisation is not a
> problem, for others like VA-API/V4L2/GStreamer we're probably need
> going to live with the implicit semantics for the foreseeable future,
> which means we should do the right thing for them up front, because
> why would you even be playing games if you're not streaming them to
> Twitch at the same time? (Personally I'm much happier that no-one
> other than my friends who already know I'm terrible can see me playing
> CoD, but y'know, kids these days ...)
>
> The fifth case I've left out is clients who want to smash Vulkan
> content directly into the display. For games and Kodi-like UI I'm just
> going to ignore this, because (maybe I'm biased, but) using KMS
> directly is difficult enough that you shouldn't do it and just use a
> winsys because we've spent a very long time dealing with these
> problems for you. The remaining usecase there is XR, but their uses
> are very different, and OpenXR already deals with a _lot_ of this for
> you, with the runtimes having sufficiently sophisticated
> synchronisation internally that whatever we come up with won't be
> onerous for them to implement. Either way, integration with KMS/GBM
> has the same problem as Wayland, in that unless you fully encapsulate
> it in something like VK_KHR_display, you don't get to throw a thread
> under the bus to do delayed submit, because you need to actually
> return something to the caller.

You're missing a use-case:  Modern userspace-fencing hardware running
on bare X11.  I don't know that we should optimize for this case, so
to speak, but it has to work non-terribly.  Also, as I alluded to
above, I really don't want to be maintaining two drivers forever: One
for X11 and ancient Wayland and one for modern Wayland.  We need to be
able to modernize the driver and still support the old window-systems.
Unless you can promise me that X11 and KDE/Wayland will either go away
or else get modernized, I need a fall-back plan.  And even if you make
me said promise, I'm not going to believe you. :-P  8-9 years ago, I
was one of the people making those promises.  Now I'm writing X11
winsys code for drivers.  And... now I'm re-thinking some of my life
choices....

> Ultimately, I think I'm leaning towards agreeing with Christian that I
> would like to see a good holistic model for how this works in a
> variety of usecases before we progress new uAPI, but also to agreeing
> with you and Dan in that how amdgpu currently implements things is
> currently objectively very wrong (despite the motivations for having
> implemented it that way).
>
> If there are any usecases I've missed then I'm all ears, but else I
> think we should take this forward by working with
> Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic
> skeletons for end-to-end synchronisation through those usecases. It's
> clear that this never happened for syncobj, because it's just not
> usable as a primitive in any compositor which exists today: let's make
> sure we don't get into the same trap again. If we can do this over
> email/GitLab then that's great, but if not, maybe we need to do a kind
> of mini-XDC with some kind of virtual whiteboard we can all scribble
> over.

I think what we're looking at is roughly three steps, the first two of
which are mostly there on the Wayland side:

 1. Implicit sync.  We know what this one is.  glFlush/vkQueueSubmit
on the one side, start texturing on the other, and it all works.

 2. Immutable SW fences, i.e. sync_file.  This is where we have a
fence object that gets returned from the flush/submit and can be
passed into the texture op to provide a dependency.  Syncobj may be
useful here but only as a container.  If a sync_file is a dma_fence*,
a syncobj should be thought of as a dma_fence**.  This may be useful
if we want to retrofit sync_file into X11 where the current DRI3
explicit sync stuff is based on creating an object and then re-using
it for every present.

 3. Userspace memory fences.

Note that timeline syncobj is NOT in that list.  IMO, all the "wait
for submit" stuff is an implementation detail we needed in order to
get the timeline semantics on top of immutable SW fences.  Under the
hood it's all dma_fence; this just gives us a shareable container so
we can implement VK_KHR_timeline_semaphore with sharing.  I really
don't want to make Wayland protocol around it if memory fences are the
final solution.


> (As a coda to this, I'm pretty sure I understand the subtleties of the
> memory fences for relocation/shootdown, but it would be _really_ good
> to have a coherent description everyone agrees on written down
> somewhere, so people can understand the issues without having to read
> 250 emails with people at loggerheads.)

Let me give it a try.  I define a userspace memory fence as follows:
 - The fence object is fundamentally a bit of mappable gpu-accessible
memory which stores a 64-bit counter value which is used as a
timeline.  (On Windows, they require it to live in system memory.)
 - For sharable memory fences, each one probably has to go in its own page. :-(
 - The value is initialized to 0.
 - To signal the fence, someone (GPU or CPU) writes a new 64-bit value.
 - Waits on the fence are waits for it to be >= some value.

Depending on circumstances, the wait may be a 32-bit comparison and
may be >= with wrap-around.  For the purposes of window-system
protocol, I think we can assume 64-bit >= with no wrap-around.

There are a few very important points worth noting about them, however:
 - No one except the userspace client (not driver!) has any idea
when/if a value will be signaled
 - The value may never be signaled
 - Someone may submit a wait before someone submits a signal; we have
to deal with that
 - There is no concept of "about to be signaled"
 - For the kernel/firmware GPU scheduler handling waits, this means it
just keeps trying to run work and hopes the wait eventually unblocks.
It also means we need to totally decouple kernel synchronization for
memory management purposes from synchronization for work scheduling.
 - For a compositor, I'm still not 100% sure what this means.  TBD, I think.

There are some ways to work around some of these issues.  Windows has
a few tricks which we might be able to steal if we want.

Why would anyone want such a horrid thing?  Application developers
absolutely love them.  They can write massively multi-threaded apps
with piles of work queues that require very little cross-thread
synchronization and the GPU scheduler sorts it all out for them in the
end.  If you're a 3D game engine developer, this timeline model is
very powerful.  If you're a driver or window-system developer, you
really have to just embrace the lack of knowledge and trust apps.

Oof... That got long.  I hope it was informative.

--Jason


> Cheers,
> Daniel
>
>
>
>
>
>
>
>
> > When we got to
> > insert the RW fence, the actual fence we set as the new exclusive fence
> > is a combination of the sync_file provided by the user and all the other
> > fences on the dma_resv.  This ensures that the newly added exclusive
> > fence will never signal before the old one would have and ensures that
> > we don't break any dma_resv contracts.  We require userspace to specify
> > RW in the flags for symmetry with the export ioctl and in case we ever
> > want to support read fences in the future.
> >
> > There is one downside here that's worth documenting:  If two clients
> > writing to the same dma-buf using this API race with each other, their
> > actions on the dma-buf may happen in parallel or in an undefined order.
> > Both with and without this API, the pattern is the same:  Collect all
> > the fences on dma-buf, submit work which depends on said fences, and
> > then set a new exclusive (write) fence on the dma-buf which depends on
> > said work.  The difference is that, when it's all handled by the GPU
> > driver's submit ioctl, the three operations happen atomically under the
> > dma_resv lock.  If two userspace submits race, one will happen before
> > the other.  You aren't guaranteed which but you are guaranteed that
> > they're strictly ordered.  If userspace manages the fences itself, then
> > these three operations happen separately and the two render operations
> > may happen genuinely in parallel or get interleaved.  However, this is a
> > case of userspace racing with itself.  As long as we ensure userspace
> > can't back the kernel into a corner, it should be fine.
> >
> > v2 (Jason Ekstrand):
> >  - Use a wrapper dma_fence_array of all fences including the new one
> >    when importing an exclusive fence.
> >
> > v3 (Jason Ekstrand):
> >  - Lock around setting shared fences as well as exclusive
> >  - Mark SIGNAL_SYNC_FILE as a read-write ioctl.
> >  - Initialize ret to 0 in dma_buf_wait_sync_file
> >
> > v4 (Jason Ekstrand):
> >  - Use the new dma_resv_get_singleton helper
> >
> > v5 (Jason Ekstrand):
> >  - Rename the IOCTLs to import/export rather than wait/signal
> >  - Drop the WRITE flag and always get/set the exclusive fence
> >
> > v5 (Jason Ekstrand):
> >  - Split import and export into separate patches
> >  - New commit message
> >
> > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > ---
> >  drivers/dma-buf/dma-buf.c    | 32 ++++++++++++++++++++++++++++++++
> >  include/uapi/linux/dma-buf.h |  1 +
> >  2 files changed, 33 insertions(+)
> >
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 7679562b57bfc..c9d6b9195c00c 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf,
> >         put_unused_fd(fd);
> >         return ret;
> >  }
> > +
> > +static long dma_buf_import_sync_file(struct dma_buf *dmabuf,
> > +                                    const void __user *user_data)
> > +{
> > +       struct dma_buf_sync_file arg;
> > +       struct dma_fence *fence, *singleton = NULL;
> > +       int ret = 0;
> > +
> > +       if (copy_from_user(&arg, user_data, sizeof(arg)))
> > +               return -EFAULT;
> > +
> > +       if (arg.flags != DMA_BUF_SYNC_RW)
> > +               return -EINVAL;
> > +
> > +       fence = sync_file_get_fence(arg.fd);
> > +       if (!fence)
> > +               return -EINVAL;
> > +
> > +       dma_resv_lock(dmabuf->resv, NULL);
> > +
> > +       ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton);
> > +       if (!ret && singleton)
> > +               dma_resv_add_excl_fence(dmabuf->resv, singleton);
> > +
> > +       dma_resv_unlock(dmabuf->resv);
> > +
> > +       dma_fence_put(fence);
> > +
> > +       return ret;
> > +}
> >  #endif
> >
> >  static long dma_buf_ioctl(struct file *file,
> > @@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file,
> >  #if IS_ENABLED(CONFIG_SYNC_FILE)
> >         case DMA_BUF_IOCTL_EXPORT_SYNC_FILE:
> >                 return dma_buf_export_sync_file(dmabuf, (void __user *)arg);
> > +       case DMA_BUF_IOCTL_IMPORT_SYNC_FILE:
> > +               return dma_buf_import_sync_file(dmabuf, (const void __user *)arg);
> >  #endif
> >
> >         default:
> > diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
> > index f902cadcbdb56..75fdde4800267 100644
> > --- a/include/uapi/linux/dma-buf.h
> > +++ b/include/uapi/linux/dma-buf.h
> > @@ -70,5 +70,6 @@ struct dma_buf_sync_file {
> >  #define DMA_BUF_SET_NAME_A     _IOW(DMA_BUF_BASE, 1, u32)
> >  #define DMA_BUF_SET_NAME_B     _IOW(DMA_BUF_BASE, 1, u64)
> >  #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file)
> > +#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_sync)
> >
> >  #endif
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4)
  2021-05-21 17:48   ` Daniel Vetter
@ 2021-05-24 20:04     ` Jason Ekstrand
  2021-05-25 12:38       ` Daniel Vetter
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-24 20:04 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Maling list - DRI developers

On Fri, May 21, 2021 at 12:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, May 20, 2021 at 02:00:05PM -0500, Jason Ekstrand wrote:
> > Add a helper function to get a single fence representing
> > all fences in a dma_resv object.
> >
> > This fence is either the only one in the object or all not
> > signaled fences of the object in a flatted out dma_fence_array.
> >
> > v2 (Jason Ekstrand):
> >  - Take reference of fences both for creating the dma_fence_array and in
> >    the case where we return one fence.
> >  - Handle the case where dma_resv_get_list() returns NULL
> >
> > v3 (Jason Ekstrand):
> >  - Add an _rcu suffix because it is read-only
> >  - Rewrite to use dma_resv_get_fences_rcu so it's RCU-safe
> >  - Add an EXPORT_SYMBOL_GPL declaration
> >  - Re-author the patch to Jason since very little is left of Christian
> >    König's original patch
> >  - Remove the extra fence argument
> >
> > v4 (Jason Ekstrand):
> >  - Restore the extra fence argument
> >
> > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> >
> > get_singleton
>
> Spurious thing here.

Fixed.

> > ---
> >  drivers/dma-buf/dma-resv.c | 122 +++++++++++++++++++++++++++++++++++++
> >  include/linux/dma-resv.h   |   3 +
> >  2 files changed, 125 insertions(+)
> >
> > diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
> > index 6ddbeb5dfbf65..25995fc15c370 100644
> > --- a/drivers/dma-buf/dma-resv.c
> > +++ b/drivers/dma-buf/dma-resv.c
> > @@ -33,6 +33,8 @@
> >   */
> >
> >  #include <linux/dma-resv.h>
> > +#include <linux/dma-fence-chain.h>
> > +#include <linux/dma-fence-array.h>
> >  #include <linux/export.h>
> >  #include <linux/mm.h>
> >  #include <linux/sched/mm.h>
> > @@ -49,6 +51,19 @@
> >   * write-side updates.
> >   */
> >
> > +/**
> > + * dma_fence_deep_dive_for_each - deep dive into the fence containers
> > + * @fence: resulting fence
> > + * @chain: variable for a dma_fence_chain
> > + * @index: index into a dma_fence_array
> > + * @head: starting point
> > + *
> > + * Helper to deep dive into the fence containers for flattening them.
> > + */
> > +#define dma_fence_deep_dive_for_each(fence, chain, index, head)      \
> > +     dma_fence_chain_for_each(chain, head)                   \
> > +             dma_fence_array_for_each(fence, index, chain)
>
> Since this is is just internal helper in the .c file we generally don't
> document it. Maybe small comment if you feel it's worth it.

Sure, I can write LESS documentation. :-P

> > +
> >  DEFINE_WD_CLASS(reservation_ww_class);
> >  EXPORT_SYMBOL(reservation_ww_class);
> >
> > @@ -517,6 +532,113 @@ int dma_resv_get_fences_rcu(struct dma_resv *obj,
> >  }
> >  EXPORT_SYMBOL_GPL(dma_resv_get_fences_rcu);
> >
> > +/**
> > + * dma_resv_get_singleton - get a single fence for the dma_resv object
>
> Name doesn't match here.

Fixed.

> > + * @obj: the reservation object
> > + * @extra: extra fence to add to the resulting array
> > + * @result: resulting dma_fence
> > + *
> > + * Get a single fence representing all unsignaled fences in the dma_resv object
> > + * plus the given extra fence. If we got only one fence return a new
> > + * reference to that, otherwise return a dma_fence_array object.
> > + *
> > + * RETURNS
> > + * Returns -NOMEM if allocations fail, zero otherwise.
>
> Kernel often encodes this in ERR_PTR so that you don't have to pass a
> pointer to store the result. Would feel more kerenl-y I think that way. So
> no result parameter, and on alloc failure you'd return

Done.

>         return ERR_PTR(-ENOMEM);
>
> > + */
> > +int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
>
> tbh the _rcu here is confusing. I think _unlocked is the better suffix,
> maybe we should rename dma_resv_get_fences_rcu too for consistency. The
> rcu-ness of the lookup isn't leaked to callers at all, so no point giving
> them a panic.

I can make that change.  I'll also include a patch in the next re-send
that renames all the _rcu helpers to _unlocked for consistency.

--Jason

> > +                            struct dma_fence **result)
> > +{
> > +     struct dma_fence **resv_fences, *fence, *chain, **fences;
> > +     struct dma_fence_array *array;
> > +     unsigned int num_resv_fences, num_fences;
> > +     unsigned int ret, i, j;
> > +
> > +     ret = dma_resv_get_fences_rcu(obj, NULL, &num_resv_fences, &resv_fences);
> > +     if (ret)
> > +             return ret;
> > +
> > +     num_fences = 0;
> > +     *result = NULL;
> > +
> > +     if (num_resv_fences == 0 && !extra)
> > +             return 0;
> > +
> > +     for (i = 0; i < num_resv_fences; ++i) {
> > +             dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
> > +                     if (dma_fence_is_signaled(fence))
> > +                             continue;
> > +
> > +                     *result = fence;
> > +                     ++num_fences;
> > +             }
> > +     }
> > +
> > +     if (extra) {
> > +             dma_fence_deep_dive_for_each(fence, chain, j, extra) {
> > +                     if (dma_fence_is_signaled(fence))
> > +                             continue;
> > +
> > +                     *result = fence;
> > +                     ++num_fences;
> > +             }
> > +     }
> > +
> > +     if (num_fences <= 1) {
> > +             *result = dma_fence_get(*result);
> > +             goto put_resv_fences;
> > +     }
> > +
> > +     fences = kmalloc_array(num_fences, sizeof(struct dma_fence*),
> > +                            GFP_KERNEL);
> > +     if (!fences) {
> > +             *result = NULL;
> > +             ret = -ENOMEM;
> > +             goto put_resv_fences;
> > +     }
> > +
> > +     num_fences = 0;
> > +     for (i = 0; i < num_resv_fences; ++i) {
> > +             dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
> > +                     if (!dma_fence_is_signaled(fence))
> > +                             fences[num_fences++] = dma_fence_get(fence);
> > +             }
> > +     }
> > +
> > +     if (extra) {
> > +             dma_fence_deep_dive_for_each(fence, chain, j, extra) {
> > +                     if (dma_fence_is_signaled(fence))
> > +                             fences[num_fences++] = dma_fence_get(fence);
> > +             }
> > +     }
> > +
> > +     if (num_fences <= 1) {
> > +             *result = num_fences ? fences[0] : NULL;
> > +             kfree(fences);
> > +             goto put_resv_fences;
> > +     }
> > +
> > +     array = dma_fence_array_create(num_fences, fences,
> > +                                    dma_fence_context_alloc(1),
> > +                                    1, false);
> > +     if (array) {
> > +             *result = &array->base;
> > +     } else {
> > +             *result = NULL;
> > +             while (num_fences--)
> > +                     dma_fence_put(fences[num_fences]);
> > +             kfree(fences);
> > +             ret = -ENOMEM;
> > +     }
> > +
> > +put_resv_fences:
> > +     while (num_resv_fences--)
> > +             dma_fence_put(resv_fences[num_resv_fences]);
> > +     kfree(resv_fences);
> > +
> > +     return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(dma_resv_get_singleton_rcu);
>
> With the nits addressed:
>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> > +
> >  /**
> >   * dma_resv_wait_timeout_rcu - Wait on reservation's objects
> >   * shared and/or exclusive fences.
> > diff --git a/include/linux/dma-resv.h b/include/linux/dma-resv.h
> > index d44a77e8a7e34..d49ca263e78b4 100644
> > --- a/include/linux/dma-resv.h
> > +++ b/include/linux/dma-resv.h
> > @@ -285,6 +285,9 @@ int dma_resv_get_fences_rcu(struct dma_resv *obj,
> >
> >  int dma_resv_copy_fences(struct dma_resv *dst, struct dma_resv *src);
> >
> > +int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
> > +                            struct dma_fence **result);
> > +
> >  long dma_resv_wait_timeout_rcu(struct dma_resv *obj, bool wait_all, bool intr,
> >                              unsigned long timeout);
> >
> > --
> > 2.31.1
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: EPOLL for drm_syncfile (was Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6))
  2021-05-23 21:34       ` Daniel Stone
@ 2021-05-25  5:55         ` Daniel Vetter
  2021-05-26  9:45         ` Simon Ser
  1 sibling, 0 replies; 39+ messages in thread
From: Daniel Vetter @ 2021-05-25  5:55 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Jason Ekstrand, Daniel Vetter, Christian König, dri-devel

On Sun, May 23, 2021 at 11:34 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hi Christian,
>
> On Sun, 23 May 2021 at 18:16, Christian König <christian.koenig@amd.com> wrote:
> > Am 22.05.21 um 22:05 schrieb Daniel Stone:
> > > Anyway, the problem with syncobj is that the ioctl to wait for a
> > > sync_file to materialise for a given timeline point only allows us to
> > > block with a timeout; this is a non-starter, because we need something
> > > which fits into epoll. The most optimal case is returning a new
> > > eventfd or similar which signals when a given timeline point becomes
> > > available or signaled, but in extremis a syncobj FD becoming readable
> > > when any activity which would change the result of any zero-timeout
> > > wait on that syncobj is more or less workable.
> >
> > I think the tricky part is to epoll for a certain value.
> >
> > Not sure how eventfd is supposed to work, but IIRC we don't have the
> > functionality to poll for a certain value/offset etc to become available.
> >
> > We could of course create a separate fd for each requested value to poll
> > for thought, but that sounds like a bit much overhead to me.
>
> Yeah, I understand the point; something like an eventfd is exactly
> what you think, that we would have to materialise a new FD for it. On
> the other hand, as soon as the sync point becomes available, the
> winsys will want to immediately extract a sync_file from it, so I
> don't think FD amplification is a big issue. If it looks like being a
> problem in practice, we could look at providing a FD which starts off
> inert, and only later becomes a sync_file after it polls readable, but
> that sounds like something we really want to avoid if we can.

We can't change personalities of filp in the kernel because there's no
locking for that. But we can add a future fence state to sync_file for
this, and then make sure sync_file_get_fence() fails on them as long
as the fence hasn't materialized. That's required because current uapi
is that sync_file is never a future fence. Also opens the door for
that, which might be useful for all our userspace fences ideas.

Then we could support poll on that sync_file to wait for the fence to
materialize using POLLPRI. man poll suggests this isn't the dumbest
thing to go with, and grep in the kernel shows plenty of users (it's
EPOLLPRI internally).

With this we'd just need a new flag for allowing this mode in
sync_handle_to_fd_ioctl, and a fairly small amount of typing to glue
it together.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-24 17:11     ` [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6) Jason Ekstrand
@ 2021-05-25  6:27       ` Daniel Vetter
  2021-05-26 11:08       ` Daniel Stone
  2021-05-26 12:21       ` Vlad Zahorodnii
  2 siblings, 0 replies; 39+ messages in thread
From: Daniel Vetter @ 2021-05-25  6:27 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Daniel Vetter, dri-devel, Christian König

On Mon, May 24, 2021 at 7:11 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@fooishbar.org> wrote:
> >
> > Hi,
> >
> > On Thu, 20 May 2021 at 20:00, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > In the Vulkan world, this is useful for dealing with the out-fence from
> > > vkQueuePresent.  Current Linux window-systems (X11, Wayland, etc.) all
> > > rely on dma-buf implicit sync.  Since Vulkan is an explicit sync API, we
> > > get a set of fences (VkSemaphores) in vkQueuePresent and have to stash
> > > those as an exclusive (write) fence on the dma-buf.  We handle it in
> > > Mesa today with the above mentioned dummy submit trick.  This ioctl
> > > would allow us to set it directly without the dummy submit.
> > >
> > > This may also open up possibilities for GPU drivers to move away from
> > > implicit sync for their kernel driver uAPI and instead provide sync
> > > files and rely on dma-buf import/export for communicating with other
> > > implicit sync clients.
> > >
> > > We make the explicit choice here to only allow setting RW fences which
> > > translates to an exclusive fence on the dma_resv.  There's no use for
> > > read-only fences for communicating with other implicit sync userspace
> > > and any such attempts are likely to be racy at best.
> >
> > I think I almost follow, but I'm not quite seeing the race you allude
> > to. Let me talk through my understanding so it's hopefully more clear
> > for others as a summary. Please correct me on anything I've
> > misunderstood or just missed completely. (I thought when I wrote this
> > intro that the email might be relatively snappy, but it's really long
> > and takes in a lot of breadth. Sorry.)
> >
> > So as far as I'm reading this patchset and the Mesa MR, this _only_
> > concerns the out-fence (i.e. compositor -> client AcquireNextImage
> > semaphore/fence) so far. The in-fence (client->compositor QueuePresent
> > wait-semaphores/fences) is handled by the driver ensuring that an
> > exclusive resv is placed on the union of the signal semaphores passed
> > to QueuePresent, either through flags on its CS ioctl, or amdgpu's BO
> > flags, or ... either way, no problem as it should always be exclusive,
> > and it seems pretty uncontroversial that we should pull the wait
> > semaphore into an exclusive fence so that no downstream consumer will
> > begin using it until the client ops have fully retired.
> >
> > For AcquireNextImage, your patchset exports all the fences (shared and
> > exclusive both) on the dma_resv out into the semaphore/fence such that
> > the client can't progress (CPU-side for VkFence, GPU-side for
> > VkSemaphore) until all currently queued operations have completely
> > retired. So, as long as the server ensures that all its kernel-side
> > work is flushed before its IPC to unblock ANI (wl_buffer.release or
> > DRI3 PresentIdle, both indicating that the client is free to reuse the
> > buffer, subject to synchronising against implicit fences on the resv),
> > all good: we snapshot the current resv state, wrap the relevant
> > driver-side Vulkan primitive around it, and go back to explicit
> > synchronisation. The client can't race generating new work against
> > this, because it can't queue any work until ANI has returned and
> > queued a signaling op on the semaphore/fence.
> >
> > So far, so good. I really like both your series up until this
> > narrative point; as I was saying in the userspace-fence thread, the
> > WSI<->client thread is certainly pulling a very big lever with a
> > heavyweight transition between the two different worlds, and I like
> > that the new export ioctl lets us be very clear about what exactly is
> > happening under the hood. Good stuff.
>
> Glad to hear.  If you turned that into RBs on the first three patches
> in this series and all but the last patch on the Mesa MR, it'd make me
> even happier. :-D
>
> At this point, I think everyone is pretty happy with the first three
> patches and the export ioctl.  In the Vulkan WSI code, it solves a
> significant over-synchronization issue for ANV.  Also, the uAPI
> shouldn't be controversial at all because it's identical to poll()
> except that it gives you a FD you can poll on later to get the result
> of the poll as it would be now.  I think if we get some Mesa reviews,
> we should be able to land those.  It's import that's trickier.
>
> > So, what gives with the import ioctl? Taking a guess at where you're
> > going, the import ioctl is going to be used in QueuePresent just as
> > the export ioctl is in ANI: instead of having CS flags to write into
> > the resv exclusive slot or per-BO flags to always dump in exclusive,
> > there's a forthcoming patch somewhere which lets drivers skip this and
> > instead have common QueuePresent code dump the wait semaphore into the
> > resv, so servers on the other side of an implicit-only protocol will
> > synchronise their access against the fence imported as part of
> > client-side QueuePresent?
>
> Correct.  And the patch isn't forthcoming.  It already exists as the
> top patch in the Mesa MR (!4037).
>
> > That makes sense to me and is nicely symmetrical, plus it gets GPU
> > drivers further out of the business of doing magic winsys/display
> > synchronisation, which is good. But why enforce that imported fences
> > have to go into the exclusive slot? It makes sense from the point of
> > view of clients doing QueueSubmit, but I think it might preclude other
> > uses for no particularly good reason.
>
> Mostly, I was trying to limit the scope a bit.  Import is tricky and,
> to be honest, I'm still not 100% sure that it's safe.  I probably
> should have left RFC on this patch.
>
> As long as we keep everything inside the kernel, there's a requirement
> that any work which triggers any fence on a dma_resv waits on the
> exclusive fence, if any.  Work which sets the exclusive fence has to
> wait on all fences.  With the import ioctl, we can guarantee that the
> fences fire in the right order by making an array fence containing the
> new fence and all other fences and using that as the exclusive fence.
> What we can't do, however, is ensure that the work which triggers the
> fence actually blocks on ANY of the fences on the dma_resv.
>
> Hrm... I think I've now convinced myself that importing shared fences
> is no more dangerous than importing an exclusive fence because they
> share the same set of problems.  Unfortunately, I've unconvinced
> myself of the safety of either.  I've got to think some more....
>
> The most convincing argument I can make to myself is that this ioctl
> is like having a one-shot little queue that contains tiny little work
> jobs which wait on whatever sync_file you provide, do nothing, and
> then signal.  That should be safe, right?

Yup. Same for shared. If that breaks a driver, then things are pretty
badly broken already.

What I'm not so sure about is completely unsynced behaviour between
exclusive and shared slot. Internally I think drivers just wait for
both (but I haven't done a full audit), see
drm_gem_fence_array_add_implicit(). And the only thing that's fully
requiring that you never ignore the exclusive fence is dynamice
dma-buf sharing (atm only amdgpu supports that), at least until we've
split the exclusive slot in dma_resv into the userspace implicit sync
and kernel-internal memory management part.

So within a driver, as long as the buffer isn't shared as dynamic
dma-buf (non-dynamic is fine, those are pinned, so never any concern
abou the buffer moving around), then I think drivers can attach fences
however they deem like, and the strict ordering just oversynchronizes.

But for something cross driver like this it makes probably sense to
(for now) uphold the rule that if there's both shared and exclusive
slots present, then the shared fences are each guaranteed to get
signalled after the exlusive one. This is also overkill I think, and
something we might want to lift later on - I'm mildy afraid it will
cause pointless oversync since we end up mixing exclusive/shared
slots. Especially in cases where both sides of a pipeline do a lot of
readbacks or rendered buffers (e.g. codec reference frames on one side
and some post-processing on the other). Anyway I think it's fixable
later on.

> > Certainly on X11 there's no real use for the shared slot - you get
> > into difficulties with mixed client/server rendering - but at least
> > Wayland and GBM are always non-destructive, so conceptually have a
> > good use for being able to import fences into the shared resv. This
> > goes for media and KMS access as well, if you think beyond the usecase
> > of an explicit Vulkan client sending through to a dumb implicit server
> > which just wants to texture.
> >
> > Media clients are definitely a relevant usecase, both directly with
> > V4L2/VA-API, neither of which have support for explicit
> > synchronisation yet and (at least for V4L2) are unlikely to do so in
> > the near future, but even more with pipeline frameworks like GStreamer
> > and PipeWire, which don't have fundamental awareness (they're prepared
> > to deal with deep pipelines which return at arbitrary times, but once
> > anything is returned, it is available for immediate use). Maybe
> > open-coding CPU-side waits in these users is the best way longer term,
> > but for easier interop if nothing else, being able to pull shared
> > fences in seems like it could be a win ('the compositor is still
> > texturing from this for now, so feel free to read back from ref
> > frames, but don't scribble over it until it's finished').
> >
> > I'm slightly bemused that there's so much energy spent on designing
> > around winsys being dumb and implicit.
>
> I'd like to address this one as it's a comment you've made several
> times.  Once you've fixed raw X11 (not just XWayland) and a new
> release has been made (hah!) and is shipping in distros with said
> support, then we can talk.  Sorry if that comes off as overly snarky
> but that's reality that we (driver devs) are living with.  To make
> things even worse, when we're in Vulkan land (as opposed to GL), we
> can't tell up-front whether or not our window-system supports foo
> fences and adjust accordingly.  We have to start up, begin rendering,
> and only later figure out "oops, this one goes to X11".  We really
> can't say things like "when running on modern Wayland, do things the
> new way" because Vulkan doesn't have a concept of "running on" a
> window system.
>
> FWIW, I do have a Mesa MR somewhere which adds explicit sync for
> Vulkan+Wayland when the wayland protocols are there.  I don't remember
> why it didn't land.  Maybe lack of acquire fence support?  I think the
> protocol issues have been fixed, so we should up-rev the requirements
> as needed and land it.
>
> > We did take a long time to roll
> > out sync_file support, but that was only because we didn't quite
> > understand all the nuances of why implicit sync is so difficult for
> > GPU drivers on modern architectures and how it was actually a win
> > compared to doing nothing; maybe we should have some kind of
> > conference where we all get together and talk to each other ... ?
> > Anyway, by the time we got to the cusp of rolling out bi-directional
> > sync_file support, suddenly we had drm_syncobj. By the time that had
> > semi-settled down and started to be rolled out, then we suddenly had
> > userspace fences on the horizon. And what do we do with those?
>
> You're not wrong....  And that's the second reason for the gymnastics
> above.  Ever since Vulkan launched, we've been fumbling around trying
> to figure out how to best do synchronization.  'm reasonably certain
> that userspace memory fences are the future but I'm much less certain
> about the path to get there.  It's been a process of patient
> experimentation so far and I think we're getting closer.  Syncobj,
> timeline syncobj, etc. have all been steps along that path.  I've been
> hesitant to ask the window-system people to draft piles of extensions
> until we're sure we've found "the one".  It's bad enough iterating in
> kernel-space and userspace without bringing Wayland protocol into each
> iteration step.  For that reason, one of my goals along this process
> has been to try and decouple things as much as we can.
>
> So, in summary, none of this is because I think that window systems
> are dumb and implicit or for any lack of respect for the people that
> work on them.  I assume that X11 will always be dumb and implicit.
> (I'd love to be proven wrong!)  For Wayland (and XWayland, probably),
> I assume the people are very smart and active and will implement
> whatever APIs we (the driver people) need as long as they're
> reasonable.  I just don't know what to ask for yet.
>
> > We've - certainly Weston, and I'm pretty confident I speak for Simon
> > on the wlroots side and Jonas for Mutter - landed on accepting that
> > we're going to have to deal with timeline semaphores and
> > wait-before-signal; we have a credible plan (we think) for protocol to
> > deal with at least syncobj, which should conceptually extend to
> > userspace fences. The biggest blocker there is the syncobj uAPI.
> > Compositors aren't Vulkan clients - we don't have one thread per group
> > of work, because the inter-client synchronisation becomes nightmarish
> > overhead for little gain. I don't think this will ever change, because
> > the balance of work is totally different between client and
> > compositor.
> >
> > Anyway, the problem with syncobj is that the ioctl to wait for a
> > sync_file to materialise for a given timeline point only allows us to
> > block with a timeout; this is a non-starter, because we need something
> > which fits into epoll. The most optimal case is returning a new
> > eventfd or similar which signals when a given timeline point becomes
> > available or signaled, but in extremis a syncobj FD becoming readable
> > when any activity which would change the result of any zero-timeout
> > wait on that syncobj is more or less workable.
>
> Right.  You could call this an oversight.  Really, though, it was
> because the use-cases we were interested in were ones where a wait
> with a timeout was perfectly acceptable and where the extra overhead
> of setting an epoll with ioctls wasn't.  It shouldn't be too hard to
> wire up if that helps (cross your fingers).  But....
>
> If we go the direction of userspace memory fences (which I think is
> likely), there is no such thing as "wait for the fence to
> materialize".  The work is either done or it isn't.  There is no
> enforceable concept of "about to be done".  The word "enforceable" is
> important there.  We could add such a concept but it'd be dependent on
> the userspace client (not driver) to let us know when it's just about
> ready and then we'd have to VK_ERROR_DEVICE_LOST on them or similar if
> they break the contract.  Maybe possible but there's some design work
> required there.  The other option is to make the compositors handle
> this new way of thinking more thoroughly and eat the latency hit.
>
> > What we do want to do though, regardless of the primitive or its
> > semantics, is to only have to go through this once, not rework it all
> > in six months, and have to support a bunch of diverging codepaths. So
> > what is the optimal synchronisation primitive we should be aiming for
> > on the winsys side? Is sync_file a good enough lowest common
> > denominator, or should we do timeline-syncobj for fancy clients, in
> > tandem with sync_file bolted on the side for media and ancient GPUs?
> > Or are userspace fences going to give us some new primitive? And if
> > so, can we please talk about those semantics now, so we don't end up
> > with three synchronisation mechanisms which are all sort of but not
> > really alike?
>
> As I said above, I think we're getting close but I don't think we're there yet.
>
> > As far as I can tell, the three relevant vendors with (more or less)
> > upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit
> > more up in the air as the hardware and specs aren't currently
> > available to the upstream development teams, but hopefully we can
> > bring them into this conversation. I think it's looking like we're
> > optimising for all of the following, without leaving anyone out in the
> > cold:
> >
> > 1. Modern userspace-fencing hardware running on a
> > userspace-fencing-aware winsys, i.e. new AMD/Arm/Intel on one of the
> > big three Wayland compositors which handle enough synchronisation
> > logic internally that the choice of underlying
> > composition/presentation API is immaterial (for which any new API is
> > important)
> > 2. Modern userspace-fencing hardware running on Xwayland, for which
> > we'll inevitably have to type out new DRI3 synchronsiation, but
> > derived purely from the Wayland protocols so it can be passed through
> > quite directly, and with any mixed X11<->user buffer client
> > interaction (e.g. subwindows) being penalised by a long blocking wait
> > in the X server
> > 3. sync_file-oriented hardware, for which we need to do ~nothing
> > 4. Modern userspace-fencing hardware and APIs which need to interop
> > with media units, for all four combinations of client/server
> > source/sink; for some APIs like Vulkan Video synchronisation is not a
> > problem, for others like VA-API/V4L2/GStreamer we're probably need
> > going to live with the implicit semantics for the foreseeable future,
> > which means we should do the right thing for them up front, because
> > why would you even be playing games if you're not streaming them to
> > Twitch at the same time? (Personally I'm much happier that no-one
> > other than my friends who already know I'm terrible can see me playing
> > CoD, but y'know, kids these days ...)
> >
> > The fifth case I've left out is clients who want to smash Vulkan
> > content directly into the display. For games and Kodi-like UI I'm just
> > going to ignore this, because (maybe I'm biased, but) using KMS
> > directly is difficult enough that you shouldn't do it and just use a
> > winsys because we've spent a very long time dealing with these
> > problems for you. The remaining usecase there is XR, but their uses
> > are very different, and OpenXR already deals with a _lot_ of this for
> > you, with the runtimes having sufficiently sophisticated
> > synchronisation internally that whatever we come up with won't be
> > onerous for them to implement. Either way, integration with KMS/GBM
> > has the same problem as Wayland, in that unless you fully encapsulate
> > it in something like VK_KHR_display, you don't get to throw a thread
> > under the bus to do delayed submit, because you need to actually
> > return something to the caller.
>
> You're missing a use-case:  Modern userspace-fencing hardware running
> on bare X11.  I don't know that we should optimize for this case, so
> to speak, but it has to work non-terribly.  Also, as I alluded to
> above, I really don't want to be maintaining two drivers forever: One
> for X11 and ancient Wayland and one for modern Wayland.  We need to be
> able to modernize the driver and still support the old window-systems.
> Unless you can promise me that X11 and KDE/Wayland will either go away
> or else get modernized, I need a fall-back plan.  And even if you make
> me said promise, I'm not going to believe you. :-P  8-9 years ago, I
> was one of the people making those promises.  Now I'm writing X11
> winsys code for drivers.  And... now I'm re-thinking some of my life
> choices....
>
> > Ultimately, I think I'm leaning towards agreeing with Christian that I
> > would like to see a good holistic model for how this works in a
> > variety of usecases before we progress new uAPI, but also to agreeing
> > with you and Dan in that how amdgpu currently implements things is
> > currently objectively very wrong (despite the motivations for having
> > implemented it that way).
> >
> > If there are any usecases I've missed then I'm all ears, but else I
> > think we should take this forward by working with
> > Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic
> > skeletons for end-to-end synchronisation through those usecases. It's
> > clear that this never happened for syncobj, because it's just not
> > usable as a primitive in any compositor which exists today: let's make
> > sure we don't get into the same trap again. If we can do this over
> > email/GitLab then that's great, but if not, maybe we need to do a kind
> > of mini-XDC with some kind of virtual whiteboard we can all scribble
> > over.
>
> I think what we're looking at is roughly three steps, the first two of
> which are mostly there on the Wayland side:
>
>  1. Implicit sync.  We know what this one is.  glFlush/vkQueueSubmit
> on the one side, start texturing on the other, and it all works.
>
>  2. Immutable SW fences, i.e. sync_file.  This is where we have a
> fence object that gets returned from the flush/submit and can be
> passed into the texture op to provide a dependency.  Syncobj may be
> useful here but only as a container.  If a sync_file is a dma_fence*,
> a syncobj should be thought of as a dma_fence**.  This may be useful
> if we want to retrofit sync_file into X11 where the current DRI3
> explicit sync stuff is based on creating an object and then re-using
> it for every present.
>
>  3. Userspace memory fences.
>
> Note that timeline syncobj is NOT in that list.  IMO, all the "wait
> for submit" stuff is an implementation detail we needed in order to
> get the timeline semantics on top of immutable SW fences.  Under the
> hood it's all dma_fence; this just gives us a shareable container so
> we can implement VK_KHR_timeline_semaphore with sharing.  I really
> don't want to make Wayland protocol around it if memory fences are the
> final solution.

Note that I expect there will be gpus for a very long time which wont
do userspace fences. They might do memory fences, but not the full
userspace submit story. Establishing the container object aka
drm_syncobj for that case I think still makes sense, since it's
something they can fully support. Also I do think even full userspace
submit hw can fully support dma_fence model (at least I haven't seen
hw yet where that shouldn't work), you "just" lose stuff like hw page
fault&repair on the gpu.

> > (As a coda to this, I'm pretty sure I understand the subtleties of the
> > memory fences for relocation/shootdown, but it would be _really_ good
> > to have a coherent description everyone agrees on written down
> > somewhere, so people can understand the issues without having to read
> > 250 emails with people at loggerheads.)
>
> Let me give it a try.  I define a userspace memory fence as follows:
>  - The fence object is fundamentally a bit of mappable gpu-accessible
> memory which stores a 64-bit counter value which is used as a
> timeline.  (On Windows, they require it to live in system memory.)
>  - For sharable memory fences, each one probably has to go in its own page. :-(
>  - The value is initialized to 0.
>  - To signal the fence, someone (GPU or CPU) writes a new 64-bit value.
>  - Waits on the fence are waits for it to be >= some value.
>
> Depending on circumstances, the wait may be a 32-bit comparison and
> may be >= with wrap-around.  For the purposes of window-system
> protocol, I think we can assume 64-bit >= with no wrap-around.
>
> There are a few very important points worth noting about them, however:
>  - No one except the userspace client (not driver!) has any idea
> when/if a value will be signaled
>  - The value may never be signaled
>  - Someone may submit a wait before someone submits a signal; we have
> to deal with that
>  - There is no concept of "about to be signaled"
>  - For the kernel/firmware GPU scheduler handling waits, this means it
> just keeps trying to run work and hopes the wait eventually unblocks.
> It also means we need to totally decouple kernel synchronization for
> memory management purposes from synchronization for work scheduling.
>  - For a compositor, I'm still not 100% sure what this means.  TBD, I think.
>
> There are some ways to work around some of these issues.  Windows has
> a few tricks which we might be able to steal if we want.
>
> Why would anyone want such a horrid thing?  Application developers
> absolutely love them.  They can write massively multi-threaded apps
> with piles of work queues that require very little cross-thread
> synchronization and the GPU scheduler sorts it all out for them in the
> end.  If you're a 3D game engine developer, this timeline model is
> very powerful.  If you're a driver or window-system developer, you
> really have to just embrace the lack of knowledge and trust apps.
>
> Oof... That got long.  I hope it was informative.

Yeah maybe smashing dma_fence based syncobj into the same interface as
userspace memory fences isn't the best idea ...
-Daniel

>
> --Jason
>
>
> > Cheers,
> > Daniel
> >
> >
> >
> >
> >
> >
> >
> >
> > > When we got to
> > > insert the RW fence, the actual fence we set as the new exclusive fence
> > > is a combination of the sync_file provided by the user and all the other
> > > fences on the dma_resv.  This ensures that the newly added exclusive
> > > fence will never signal before the old one would have and ensures that
> > > we don't break any dma_resv contracts.  We require userspace to specify
> > > RW in the flags for symmetry with the export ioctl and in case we ever
> > > want to support read fences in the future.
> > >
> > > There is one downside here that's worth documenting:  If two clients
> > > writing to the same dma-buf using this API race with each other, their
> > > actions on the dma-buf may happen in parallel or in an undefined order.
> > > Both with and without this API, the pattern is the same:  Collect all
> > > the fences on dma-buf, submit work which depends on said fences, and
> > > then set a new exclusive (write) fence on the dma-buf which depends on
> > > said work.  The difference is that, when it's all handled by the GPU
> > > driver's submit ioctl, the three operations happen atomically under the
> > > dma_resv lock.  If two userspace submits race, one will happen before
> > > the other.  You aren't guaranteed which but you are guaranteed that
> > > they're strictly ordered.  If userspace manages the fences itself, then
> > > these three operations happen separately and the two render operations
> > > may happen genuinely in parallel or get interleaved.  However, this is a
> > > case of userspace racing with itself.  As long as we ensure userspace
> > > can't back the kernel into a corner, it should be fine.
> > >
> > > v2 (Jason Ekstrand):
> > >  - Use a wrapper dma_fence_array of all fences including the new one
> > >    when importing an exclusive fence.
> > >
> > > v3 (Jason Ekstrand):
> > >  - Lock around setting shared fences as well as exclusive
> > >  - Mark SIGNAL_SYNC_FILE as a read-write ioctl.
> > >  - Initialize ret to 0 in dma_buf_wait_sync_file
> > >
> > > v4 (Jason Ekstrand):
> > >  - Use the new dma_resv_get_singleton helper
> > >
> > > v5 (Jason Ekstrand):
> > >  - Rename the IOCTLs to import/export rather than wait/signal
> > >  - Drop the WRITE flag and always get/set the exclusive fence
> > >
> > > v5 (Jason Ekstrand):
> > >  - Split import and export into separate patches
> > >  - New commit message
> > >
> > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > > ---
> > >  drivers/dma-buf/dma-buf.c    | 32 ++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/dma-buf.h |  1 +
> > >  2 files changed, 33 insertions(+)
> > >
> > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > > index 7679562b57bfc..c9d6b9195c00c 100644
> > > --- a/drivers/dma-buf/dma-buf.c
> > > +++ b/drivers/dma-buf/dma-buf.c
> > > @@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf,
> > >         put_unused_fd(fd);
> > >         return ret;
> > >  }
> > > +
> > > +static long dma_buf_import_sync_file(struct dma_buf *dmabuf,
> > > +                                    const void __user *user_data)
> > > +{
> > > +       struct dma_buf_sync_file arg;
> > > +       struct dma_fence *fence, *singleton = NULL;
> > > +       int ret = 0;
> > > +
> > > +       if (copy_from_user(&arg, user_data, sizeof(arg)))
> > > +               return -EFAULT;
> > > +
> > > +       if (arg.flags != DMA_BUF_SYNC_RW)
> > > +               return -EINVAL;
> > > +
> > > +       fence = sync_file_get_fence(arg.fd);
> > > +       if (!fence)
> > > +               return -EINVAL;
> > > +
> > > +       dma_resv_lock(dmabuf->resv, NULL);
> > > +
> > > +       ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton);
> > > +       if (!ret && singleton)
> > > +               dma_resv_add_excl_fence(dmabuf->resv, singleton);
> > > +
> > > +       dma_resv_unlock(dmabuf->resv);
> > > +
> > > +       dma_fence_put(fence);
> > > +
> > > +       return ret;
> > > +}
> > >  #endif
> > >
> > >  static long dma_buf_ioctl(struct file *file,
> > > @@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file,
> > >  #if IS_ENABLED(CONFIG_SYNC_FILE)
> > >         case DMA_BUF_IOCTL_EXPORT_SYNC_FILE:
> > >                 return dma_buf_export_sync_file(dmabuf, (void __user *)arg);
> > > +       case DMA_BUF_IOCTL_IMPORT_SYNC_FILE:
> > > +               return dma_buf_import_sync_file(dmabuf, (const void __user *)arg);
> > >  #endif
> > >
> > >         default:
> > > diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
> > > index f902cadcbdb56..75fdde4800267 100644
> > > --- a/include/uapi/linux/dma-buf.h
> > > +++ b/include/uapi/linux/dma-buf.h
> > > @@ -70,5 +70,6 @@ struct dma_buf_sync_file {
> > >  #define DMA_BUF_SET_NAME_A     _IOW(DMA_BUF_BASE, 1, u32)
> > >  #define DMA_BUF_SET_NAME_B     _IOW(DMA_BUF_BASE, 1, u64)
> > >  #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file)
> > > +#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_sync)
> > >
> > >  #endif
> > > --
> > > 2.31.1
> > >



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4)
  2021-05-24 20:04     ` Jason Ekstrand
@ 2021-05-25 12:38       ` Daniel Vetter
  2021-05-25 15:50         ` Christian König
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2021-05-25 12:38 UTC (permalink / raw)
  To: Jason Ekstrand, Christian König; +Cc: Maling list - DRI developers

On Mon, May 24, 2021 at 03:04:35PM -0500, Jason Ekstrand wrote:
> On Fri, May 21, 2021 at 12:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Thu, May 20, 2021 at 02:00:05PM -0500, Jason Ekstrand wrote:
> > > Add a helper function to get a single fence representing
> > > all fences in a dma_resv object.
> > >
> > > This fence is either the only one in the object or all not
> > > signaled fences of the object in a flatted out dma_fence_array.
> > >
> > > v2 (Jason Ekstrand):
> > >  - Take reference of fences both for creating the dma_fence_array and in
> > >    the case where we return one fence.
> > >  - Handle the case where dma_resv_get_list() returns NULL
> > >
> > > v3 (Jason Ekstrand):
> > >  - Add an _rcu suffix because it is read-only
> > >  - Rewrite to use dma_resv_get_fences_rcu so it's RCU-safe
> > >  - Add an EXPORT_SYMBOL_GPL declaration
> > >  - Re-author the patch to Jason since very little is left of Christian
> > >    König's original patch
> > >  - Remove the extra fence argument
> > >
> > > v4 (Jason Ekstrand):
> > >  - Restore the extra fence argument
> > >
> > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > >
> > > get_singleton
> >
> > Spurious thing here.
> 
> Fixed.
> 
> > > ---
> > >  drivers/dma-buf/dma-resv.c | 122 +++++++++++++++++++++++++++++++++++++
> > >  include/linux/dma-resv.h   |   3 +
> > >  2 files changed, 125 insertions(+)
> > >
> > > diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
> > > index 6ddbeb5dfbf65..25995fc15c370 100644
> > > --- a/drivers/dma-buf/dma-resv.c
> > > +++ b/drivers/dma-buf/dma-resv.c
> > > @@ -33,6 +33,8 @@
> > >   */
> > >
> > >  #include <linux/dma-resv.h>
> > > +#include <linux/dma-fence-chain.h>
> > > +#include <linux/dma-fence-array.h>
> > >  #include <linux/export.h>
> > >  #include <linux/mm.h>
> > >  #include <linux/sched/mm.h>
> > > @@ -49,6 +51,19 @@
> > >   * write-side updates.
> > >   */
> > >
> > > +/**
> > > + * dma_fence_deep_dive_for_each - deep dive into the fence containers
> > > + * @fence: resulting fence
> > > + * @chain: variable for a dma_fence_chain
> > > + * @index: index into a dma_fence_array
> > > + * @head: starting point
> > > + *
> > > + * Helper to deep dive into the fence containers for flattening them.
> > > + */
> > > +#define dma_fence_deep_dive_for_each(fence, chain, index, head)      \
> > > +     dma_fence_chain_for_each(chain, head)                   \
> > > +             dma_fence_array_for_each(fence, index, chain)
> >
> > Since this is is just internal helper in the .c file we generally don't
> > document it. Maybe small comment if you feel it's worth it.
> 
> Sure, I can write LESS documentation. :-P
> 
> > > +
> > >  DEFINE_WD_CLASS(reservation_ww_class);
> > >  EXPORT_SYMBOL(reservation_ww_class);
> > >
> > > @@ -517,6 +532,113 @@ int dma_resv_get_fences_rcu(struct dma_resv *obj,
> > >  }
> > >  EXPORT_SYMBOL_GPL(dma_resv_get_fences_rcu);
> > >
> > > +/**
> > > + * dma_resv_get_singleton - get a single fence for the dma_resv object
> >
> > Name doesn't match here.
> 
> Fixed.
> 
> > > + * @obj: the reservation object
> > > + * @extra: extra fence to add to the resulting array
> > > + * @result: resulting dma_fence
> > > + *
> > > + * Get a single fence representing all unsignaled fences in the dma_resv object
> > > + * plus the given extra fence. If we got only one fence return a new
> > > + * reference to that, otherwise return a dma_fence_array object.
> > > + *
> > > + * RETURNS
> > > + * Returns -NOMEM if allocations fail, zero otherwise.
> >
> > Kernel often encodes this in ERR_PTR so that you don't have to pass a
> > pointer to store the result. Would feel more kerenl-y I think that way. So
> > no result parameter, and on alloc failure you'd return
> 
> Done.
> 
> >         return ERR_PTR(-ENOMEM);
> >
> > > + */
> > > +int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
> >
> > tbh the _rcu here is confusing. I think _unlocked is the better suffix,
> > maybe we should rename dma_resv_get_fences_rcu too for consistency. The
> > rcu-ness of the lookup isn't leaked to callers at all, so no point giving
> > them a panic.
> 
> I can make that change.  I'll also include a patch in the next re-send
> that renames all the _rcu helpers to _unlocked for consistency.

Maybe double-check with Christian König whether he's on board with this
bikeshed, but the current _rcu postfix we have in some dma_resv functions
really confuses me.
-Daniel

> 
> --Jason
> 
> > > +                            struct dma_fence **result)
> > > +{
> > > +     struct dma_fence **resv_fences, *fence, *chain, **fences;
> > > +     struct dma_fence_array *array;
> > > +     unsigned int num_resv_fences, num_fences;
> > > +     unsigned int ret, i, j;
> > > +
> > > +     ret = dma_resv_get_fences_rcu(obj, NULL, &num_resv_fences, &resv_fences);
> > > +     if (ret)
> > > +             return ret;
> > > +
> > > +     num_fences = 0;
> > > +     *result = NULL;
> > > +
> > > +     if (num_resv_fences == 0 && !extra)
> > > +             return 0;
> > > +
> > > +     for (i = 0; i < num_resv_fences; ++i) {
> > > +             dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
> > > +                     if (dma_fence_is_signaled(fence))
> > > +                             continue;
> > > +
> > > +                     *result = fence;
> > > +                     ++num_fences;
> > > +             }
> > > +     }
> > > +
> > > +     if (extra) {
> > > +             dma_fence_deep_dive_for_each(fence, chain, j, extra) {
> > > +                     if (dma_fence_is_signaled(fence))
> > > +                             continue;
> > > +
> > > +                     *result = fence;
> > > +                     ++num_fences;
> > > +             }
> > > +     }
> > > +
> > > +     if (num_fences <= 1) {
> > > +             *result = dma_fence_get(*result);
> > > +             goto put_resv_fences;
> > > +     }
> > > +
> > > +     fences = kmalloc_array(num_fences, sizeof(struct dma_fence*),
> > > +                            GFP_KERNEL);
> > > +     if (!fences) {
> > > +             *result = NULL;
> > > +             ret = -ENOMEM;
> > > +             goto put_resv_fences;
> > > +     }
> > > +
> > > +     num_fences = 0;
> > > +     for (i = 0; i < num_resv_fences; ++i) {
> > > +             dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
> > > +                     if (!dma_fence_is_signaled(fence))
> > > +                             fences[num_fences++] = dma_fence_get(fence);
> > > +             }
> > > +     }
> > > +
> > > +     if (extra) {
> > > +             dma_fence_deep_dive_for_each(fence, chain, j, extra) {
> > > +                     if (dma_fence_is_signaled(fence))
> > > +                             fences[num_fences++] = dma_fence_get(fence);
> > > +             }
> > > +     }
> > > +
> > > +     if (num_fences <= 1) {
> > > +             *result = num_fences ? fences[0] : NULL;
> > > +             kfree(fences);
> > > +             goto put_resv_fences;
> > > +     }
> > > +
> > > +     array = dma_fence_array_create(num_fences, fences,
> > > +                                    dma_fence_context_alloc(1),
> > > +                                    1, false);
> > > +     if (array) {
> > > +             *result = &array->base;
> > > +     } else {
> > > +             *result = NULL;
> > > +             while (num_fences--)
> > > +                     dma_fence_put(fences[num_fences]);
> > > +             kfree(fences);
> > > +             ret = -ENOMEM;
> > > +     }
> > > +
> > > +put_resv_fences:
> > > +     while (num_resv_fences--)
> > > +             dma_fence_put(resv_fences[num_resv_fences]);
> > > +     kfree(resv_fences);
> > > +
> > > +     return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(dma_resv_get_singleton_rcu);
> >
> > With the nits addressed:
> >
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> >
> > > +
> > >  /**
> > >   * dma_resv_wait_timeout_rcu - Wait on reservation's objects
> > >   * shared and/or exclusive fences.
> > > diff --git a/include/linux/dma-resv.h b/include/linux/dma-resv.h
> > > index d44a77e8a7e34..d49ca263e78b4 100644
> > > --- a/include/linux/dma-resv.h
> > > +++ b/include/linux/dma-resv.h
> > > @@ -285,6 +285,9 @@ int dma_resv_get_fences_rcu(struct dma_resv *obj,
> > >
> > >  int dma_resv_copy_fences(struct dma_resv *dst, struct dma_resv *src);
> > >
> > > +int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
> > > +                            struct dma_fence **result);
> > > +
> > >  long dma_resv_wait_timeout_rcu(struct dma_resv *obj, bool wait_all, bool intr,
> > >                              unsigned long timeout);
> > >
> > > --
> > > 2.31.1
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4)
  2021-05-25 12:38       ` Daniel Vetter
@ 2021-05-25 15:50         ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2021-05-25 15:50 UTC (permalink / raw)
  To: Daniel Vetter, Jason Ekstrand; +Cc: Maling list - DRI developers

Am 25.05.21 um 14:38 schrieb Daniel Vetter:
> On Mon, May 24, 2021 at 03:04:35PM -0500, Jason Ekstrand wrote:
>> On Fri, May 21, 2021 at 12:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>>> On Thu, May 20, 2021 at 02:00:05PM -0500, Jason Ekstrand wrote:
>>>> Add a helper function to get a single fence representing
>>>> all fences in a dma_resv object.
>>>>
>>>> This fence is either the only one in the object or all not
>>>> signaled fences of the object in a flatted out dma_fence_array.
>>>>
>>>> v2 (Jason Ekstrand):
>>>>   - Take reference of fences both for creating the dma_fence_array and in
>>>>     the case where we return one fence.
>>>>   - Handle the case where dma_resv_get_list() returns NULL
>>>>
>>>> v3 (Jason Ekstrand):
>>>>   - Add an _rcu suffix because it is read-only
>>>>   - Rewrite to use dma_resv_get_fences_rcu so it's RCU-safe
>>>>   - Add an EXPORT_SYMBOL_GPL declaration
>>>>   - Re-author the patch to Jason since very little is left of Christian
>>>>     König's original patch
>>>>   - Remove the extra fence argument
>>>>
>>>> v4 (Jason Ekstrand):
>>>>   - Restore the extra fence argument
>>>>
>>>> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
>>>>
>>>> get_singleton
>>> Spurious thing here.
>> Fixed.
>>
>>>> ---
>>>>   drivers/dma-buf/dma-resv.c | 122 +++++++++++++++++++++++++++++++++++++
>>>>   include/linux/dma-resv.h   |   3 +
>>>>   2 files changed, 125 insertions(+)
>>>>
>>>> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
>>>> index 6ddbeb5dfbf65..25995fc15c370 100644
>>>> --- a/drivers/dma-buf/dma-resv.c
>>>> +++ b/drivers/dma-buf/dma-resv.c
>>>> @@ -33,6 +33,8 @@
>>>>    */
>>>>
>>>>   #include <linux/dma-resv.h>
>>>> +#include <linux/dma-fence-chain.h>
>>>> +#include <linux/dma-fence-array.h>
>>>>   #include <linux/export.h>
>>>>   #include <linux/mm.h>
>>>>   #include <linux/sched/mm.h>
>>>> @@ -49,6 +51,19 @@
>>>>    * write-side updates.
>>>>    */
>>>>
>>>> +/**
>>>> + * dma_fence_deep_dive_for_each - deep dive into the fence containers
>>>> + * @fence: resulting fence
>>>> + * @chain: variable for a dma_fence_chain
>>>> + * @index: index into a dma_fence_array
>>>> + * @head: starting point
>>>> + *
>>>> + * Helper to deep dive into the fence containers for flattening them.
>>>> + */
>>>> +#define dma_fence_deep_dive_for_each(fence, chain, index, head)      \
>>>> +     dma_fence_chain_for_each(chain, head)                   \
>>>> +             dma_fence_array_for_each(fence, index, chain)
>>> Since this is is just internal helper in the .c file we generally don't
>>> document it. Maybe small comment if you feel it's worth it.
>> Sure, I can write LESS documentation. :-P
>>
>>>> +
>>>>   DEFINE_WD_CLASS(reservation_ww_class);
>>>>   EXPORT_SYMBOL(reservation_ww_class);
>>>>
>>>> @@ -517,6 +532,113 @@ int dma_resv_get_fences_rcu(struct dma_resv *obj,
>>>>   }
>>>>   EXPORT_SYMBOL_GPL(dma_resv_get_fences_rcu);
>>>>
>>>> +/**
>>>> + * dma_resv_get_singleton - get a single fence for the dma_resv object
>>> Name doesn't match here.
>> Fixed.
>>
>>>> + * @obj: the reservation object
>>>> + * @extra: extra fence to add to the resulting array
>>>> + * @result: resulting dma_fence
>>>> + *
>>>> + * Get a single fence representing all unsignaled fences in the dma_resv object
>>>> + * plus the given extra fence. If we got only one fence return a new
>>>> + * reference to that, otherwise return a dma_fence_array object.
>>>> + *
>>>> + * RETURNS
>>>> + * Returns -NOMEM if allocations fail, zero otherwise.
>>> Kernel often encodes this in ERR_PTR so that you don't have to pass a
>>> pointer to store the result. Would feel more kerenl-y I think that way. So
>>> no result parameter, and on alloc failure you'd return
>> Done.
>>
>>>          return ERR_PTR(-ENOMEM);
>>>
>>>> + */
>>>> +int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
>>> tbh the _rcu here is confusing. I think _unlocked is the better suffix,
>>> maybe we should rename dma_resv_get_fences_rcu too for consistency. The
>>> rcu-ness of the lookup isn't leaked to callers at all, so no point giving
>>> them a panic.
>> I can make that change.  I'll also include a patch in the next re-send
>> that renames all the _rcu helpers to _unlocked for consistency.
> Maybe double-check with Christian König whether he's on board with this
> bikeshed, but the current _rcu postfix we have in some dma_resv functions
> really confuses me.

Well I would say yes please :)

The _rcu postfix was something I was always wondering about as well.

Christian.

> -Daniel
>
>> --Jason
>>
>>>> +                            struct dma_fence **result)
>>>> +{
>>>> +     struct dma_fence **resv_fences, *fence, *chain, **fences;
>>>> +     struct dma_fence_array *array;
>>>> +     unsigned int num_resv_fences, num_fences;
>>>> +     unsigned int ret, i, j;
>>>> +
>>>> +     ret = dma_resv_get_fences_rcu(obj, NULL, &num_resv_fences, &resv_fences);
>>>> +     if (ret)
>>>> +             return ret;
>>>> +
>>>> +     num_fences = 0;
>>>> +     *result = NULL;
>>>> +
>>>> +     if (num_resv_fences == 0 && !extra)
>>>> +             return 0;
>>>> +
>>>> +     for (i = 0; i < num_resv_fences; ++i) {
>>>> +             dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
>>>> +                     if (dma_fence_is_signaled(fence))
>>>> +                             continue;
>>>> +
>>>> +                     *result = fence;
>>>> +                     ++num_fences;
>>>> +             }
>>>> +     }
>>>> +
>>>> +     if (extra) {
>>>> +             dma_fence_deep_dive_for_each(fence, chain, j, extra) {
>>>> +                     if (dma_fence_is_signaled(fence))
>>>> +                             continue;
>>>> +
>>>> +                     *result = fence;
>>>> +                     ++num_fences;
>>>> +             }
>>>> +     }
>>>> +
>>>> +     if (num_fences <= 1) {
>>>> +             *result = dma_fence_get(*result);
>>>> +             goto put_resv_fences;
>>>> +     }
>>>> +
>>>> +     fences = kmalloc_array(num_fences, sizeof(struct dma_fence*),
>>>> +                            GFP_KERNEL);
>>>> +     if (!fences) {
>>>> +             *result = NULL;
>>>> +             ret = -ENOMEM;
>>>> +             goto put_resv_fences;
>>>> +     }
>>>> +
>>>> +     num_fences = 0;
>>>> +     for (i = 0; i < num_resv_fences; ++i) {
>>>> +             dma_fence_deep_dive_for_each(fence, chain, j, resv_fences[i]) {
>>>> +                     if (!dma_fence_is_signaled(fence))
>>>> +                             fences[num_fences++] = dma_fence_get(fence);
>>>> +             }
>>>> +     }
>>>> +
>>>> +     if (extra) {
>>>> +             dma_fence_deep_dive_for_each(fence, chain, j, extra) {
>>>> +                     if (dma_fence_is_signaled(fence))
>>>> +                             fences[num_fences++] = dma_fence_get(fence);
>>>> +             }
>>>> +     }
>>>> +
>>>> +     if (num_fences <= 1) {
>>>> +             *result = num_fences ? fences[0] : NULL;
>>>> +             kfree(fences);
>>>> +             goto put_resv_fences;
>>>> +     }
>>>> +
>>>> +     array = dma_fence_array_create(num_fences, fences,
>>>> +                                    dma_fence_context_alloc(1),
>>>> +                                    1, false);
>>>> +     if (array) {
>>>> +             *result = &array->base;
>>>> +     } else {
>>>> +             *result = NULL;
>>>> +             while (num_fences--)
>>>> +                     dma_fence_put(fences[num_fences]);
>>>> +             kfree(fences);
>>>> +             ret = -ENOMEM;
>>>> +     }
>>>> +
>>>> +put_resv_fences:
>>>> +     while (num_resv_fences--)
>>>> +             dma_fence_put(resv_fences[num_resv_fences]);
>>>> +     kfree(resv_fences);
>>>> +
>>>> +     return ret;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(dma_resv_get_singleton_rcu);
>>> With the nits addressed:
>>>
>>> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>>>
>>>> +
>>>>   /**
>>>>    * dma_resv_wait_timeout_rcu - Wait on reservation's objects
>>>>    * shared and/or exclusive fences.
>>>> diff --git a/include/linux/dma-resv.h b/include/linux/dma-resv.h
>>>> index d44a77e8a7e34..d49ca263e78b4 100644
>>>> --- a/include/linux/dma-resv.h
>>>> +++ b/include/linux/dma-resv.h
>>>> @@ -285,6 +285,9 @@ int dma_resv_get_fences_rcu(struct dma_resv *obj,
>>>>
>>>>   int dma_resv_copy_fences(struct dma_resv *dst, struct dma_resv *src);
>>>>
>>>> +int dma_resv_get_singleton_rcu(struct dma_resv *obj, struct dma_fence *extra,
>>>> +                            struct dma_fence **result);
>>>> +
>>>>   long dma_resv_wait_timeout_rcu(struct dma_resv *obj, bool wait_all, bool intr,
>>>>                               unsigned long timeout);
>>>>
>>>> --
>>>> 2.31.1
>>>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4bfa8c05efd842e47f3b08d91f7a0121%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637575431134167291%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8y%2BHz5GldUZK6Ey%2FuZUQUMg%2BObKticrSVX2gx3NBDOw%3D&amp;reserved=0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: EPOLL for drm_syncfile (was Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6))
  2021-05-23 21:34       ` Daniel Stone
  2021-05-25  5:55         ` Daniel Vetter
@ 2021-05-26  9:45         ` Simon Ser
  1 sibling, 0 replies; 39+ messages in thread
From: Simon Ser @ 2021-05-26  9:45 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Jason Ekstrand, Daniel Vetter, Christian König, dri-devel

On Sunday, May 23rd, 2021 at 11:34 PM, Daniel Stone <daniel@fooishbar.org> wrote:

> On the other hand, as soon as the sync point becomes available, the
> winsys will want to immediately extract a sync_file from it

With today's APIs, yes, but if we ever get drm_syncobj timeline support
in KMS/EGL(/Vulkan?) maybe this won't be necessary at some point.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-24 17:11     ` [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6) Jason Ekstrand
  2021-05-25  6:27       ` Daniel Vetter
@ 2021-05-26 11:08       ` Daniel Stone
  2021-05-26 12:35         ` Daniel Vetter
                           ` (2 more replies)
  2021-05-26 12:21       ` Vlad Zahorodnii
  2 siblings, 3 replies; 39+ messages in thread
From: Daniel Stone @ 2021-05-26 11:08 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Daniel Vetter, Christian König, dri-devel

Hey,

On Mon, 24 May 2021 at 18:11, Jason Ekstrand <jason@jlekstrand.net> wrote:
> On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@fooishbar.org> wrote:
> > So far, so good. I really like both your series up until this
> > narrative point; as I was saying in the userspace-fence thread, the
> > WSI<->client thread is certainly pulling a very big lever with a
> > heavyweight transition between the two different worlds, and I like
> > that the new export ioctl lets us be very clear about what exactly is
> > happening under the hood. Good stuff.
>
> Glad to hear.  If you turned that into RBs on the first three patches
> in this series and all but the last patch on the Mesa MR, it'd make me
> even happier. :-D
>
> At this point, I think everyone is pretty happy with the first three
> patches and the export ioctl.  In the Vulkan WSI code, it solves a
> significant over-synchronization issue for ANV.  Also, the uAPI
> shouldn't be controversial at all because it's identical to poll()
> except that it gives you a FD you can poll on later to get the result
> of the poll as it would be now.  I think if we get some Mesa reviews,
> we should be able to land those.  It's import that's trickier.

Agree. Have a few more burning things on my list (not least fd.o infra
this week) but will get to it so we can land.

(Clarified on IRC that my description above was accurate to the best
of our shared understanding, so we're all on the same page here.)

> > That makes sense to me and is nicely symmetrical, plus it gets GPU
> > drivers further out of the business of doing magic winsys/display
> > synchronisation, which is good. But why enforce that imported fences
> > have to go into the exclusive slot? It makes sense from the point of
> > view of clients doing QueueSubmit, but I think it might preclude other
> > uses for no particularly good reason.
>
> Mostly, I was trying to limit the scope a bit.  Import is tricky and,
> to be honest, I'm still not 100% sure that it's safe.  I probably
> should have left RFC on this patch.
>
> As long as we keep everything inside the kernel, there's a requirement
> that any work which triggers any fence on a dma_resv waits on the
> exclusive fence, if any.  Work which sets the exclusive fence has to
> wait on all fences.  With the import ioctl, we can guarantee that the
> fences fire in the right order by making an array fence containing the
> new fence and all other fences and using that as the exclusive fence.
> What we can't do, however, is ensure that the work which triggers the
> fence actually blocks on ANY of the fences on the dma_resv.
>
> Hrm... I think I've now convinced myself that importing shared fences
> is no more dangerous than importing an exclusive fence because they
> share the same set of problems.  Unfortunately, I've unconvinced
> myself of the safety of either.  I've got to think some more....
>
> The most convincing argument I can make to myself is that this ioctl
> is like having a one-shot little queue that contains tiny little work
> jobs which wait on whatever sync_file you provide, do nothing, and
> then signal.  That should be safe, right?

Yeah, I don't think there's any difference between shared and
exclusive wrt safety. The difference lies in, well, exclusive putting
a hard serialisation barrier between everything which comes before and
everything that comes after, and shared being more relaxed to allow
for reads to retire in parallel.

As said below, I think there's a good argument for the latter once you
get out of the very straightforward uses. One of the arguments for
these ioctls is to eliminate oversync, but then the import ioctl
mandates oversync in the case where the consumer only does
non-destructive reads - which is the case for the vast majority of
users!

> > Certainly on X11 there's no real use for the shared slot - you get
> > into difficulties with mixed client/server rendering - but at least
> > Wayland and GBM are always non-destructive, so conceptually have a
> > good use for being able to import fences into the shared resv. This
> > goes for media and KMS access as well, if you think beyond the usecase
> > of an explicit Vulkan client sending through to a dumb implicit server
> > which just wants to texture.
> >
> > Media clients are definitely a relevant usecase, both directly with
> > V4L2/VA-API, neither of which have support for explicit
> > synchronisation yet and (at least for V4L2) are unlikely to do so in
> > the near future, but even more with pipeline frameworks like GStreamer
> > and PipeWire, which don't have fundamental awareness (they're prepared
> > to deal with deep pipelines which return at arbitrary times, but once
> > anything is returned, it is available for immediate use). Maybe
> > open-coding CPU-side waits in these users is the best way longer term,
> > but for easier interop if nothing else, being able to pull shared
> > fences in seems like it could be a win ('the compositor is still
> > texturing from this for now, so feel free to read back from ref
> > frames, but don't scribble over it until it's finished').
> >
> > I'm slightly bemused that there's so much energy spent on designing
> > around winsys being dumb and implicit.
>
> I'd like to address this one as it's a comment you've made several
> times.  Once you've fixed raw X11 (not just XWayland) and a new
> release has been made (hah!) and is shipping in distros with said
> support, then we can talk.  Sorry if that comes off as overly snarky
> but that's reality that we (driver devs) are living with.  To make
> things even worse, when we're in Vulkan land (as opposed to GL), we
> can't tell up-front whether or not our window-system supports foo
> fences and adjust accordingly.  We have to start up, begin rendering,
> and only later figure out "oops, this one goes to X11".  We really
> can't say things like "when running on modern Wayland, do things the
> new way" because Vulkan doesn't have a concept of "running on" a
> window system.

Hey, no offence taken, and even if there was, there's no point denying
the reality that we don't still have Wayland absolutely everywhere,
and even when we do, there's still a Streams-sized elephant in the
room that no-one wants to talk about.

Ultimately though, there will always be two codepaths. Xorg (as
opposed to Xwayland) is currently unmaintained, and I expect it to
remain so forever. The last times we tried to plumb synchronisation
through to native-Xorg, it collapsed in conflicting requests for
rewriting the presentation mechanism (which very few people understand
and even fewer people can make work reliably), and I don't see that
changing any time soon. Plus the single biggest remaining use for
native-Xorg is that it implements tearflips by blitting to a single
internal frontbuffer which is currently being scanned out, which is
going to be ... difficult ... to square with the goal of actually
synchronising things.

What I mean though is that I don't think it makes sense to have three
design points.

Previously, the only effort put into explicit synchronisation on the
winsys side was using it was a means to surface the timings for both
tracing and being able to calibrate compositor repaints. Given that we
need to support interop with the implicit world forever, and given
that there was no explicit interaction with the implicit resv slots
(solved by this new uAPI), and given that the sync_file vs.
drm_syncobj vs. ?? userspace fence story wasn't clear, it didn't seem
like a practical benefit to sink time into supporting it, just that it
ticked a 'yep we're doing things explicitly' box, which was not
practically useful since EGL doesn't let you opt out of implicit
semantics anyway.

Now with the details having come out of AMD/Arm/Intel about future
gens, and the driver-side pain being properly understood, as well as
the new uAPI actually giving us clear benefit, there's a good reason
to work on it. So we'll do that on the winsys side, and support the
new explicitly-managed-implicit-slot world, and everyone's lives will
be better. Which gives us two design points: unaware implicit-only
users (Xorg, old Wayland, old GPU, media), and fully-aware
extremely-explicit users (Mutter, Weston, wlroots, new GPU).

Having the new uAPI makes the transitions clear and explicit, as
opposed to the previous halfway house where we'd dutifully import and
export sync_files, but the kernel would just do the exact same resv
bookkeeping anyway as if we were implicit.

> FWIW, I do have a Mesa MR somewhere which adds explicit sync for
> Vulkan+Wayland when the wayland protocols are there.  I don't remember
> why it didn't land.  Maybe lack of acquire fence support?  I think the
> protocol issues have been fixed, so we should up-rev the requirements
> as needed and land it.

Depends what you mean by acquire-fence? Assuming you mean ANI
(compositor release -> client reuse), we never hooked that up in
Weston because we were waiting for things to settle down with e.g.
syncobj, but I guess we've just written ourselves into a catch-22
there. Something to fix. In the meantime we do have the
timeline-syncobj protocol typed out, but it's not useful to us until
it works with epoll.

> > We did take a long time to roll
> > out sync_file support, but that was only because we didn't quite
> > understand all the nuances of why implicit sync is so difficult for
> > GPU drivers on modern architectures and how it was actually a win
> > compared to doing nothing; maybe we should have some kind of
> > conference where we all get together and talk to each other ... ?
> > Anyway, by the time we got to the cusp of rolling out bi-directional
> > sync_file support, suddenly we had drm_syncobj. By the time that had
> > semi-settled down and started to be rolled out, then we suddenly had
> > userspace fences on the horizon. And what do we do with those?
>
> You're not wrong....  And that's the second reason for the gymnastics
> above.  Ever since Vulkan launched, we've been fumbling around trying
> to figure out how to best do synchronization.  'm reasonably certain
> that userspace memory fences are the future but I'm much less certain
> about the path to get there.  It's been a process of patient
> experimentation so far and I think we're getting closer.  Syncobj,
> timeline syncobj, etc. have all been steps along that path.  I've been
> hesitant to ask the window-system people to draft piles of extensions
> until we're sure we've found "the one".  It's bad enough iterating in
> kernel-space and userspace without bringing Wayland protocol into each
> iteration step.  For that reason, one of my goals along this process
> has been to try and decouple things as much as we can.
>
> So, in summary, none of this is because I think that window systems
> are dumb and implicit or for any lack of respect for the people that
> work on them.  I assume that X11 will always be dumb and implicit.
> (I'd love to be proven wrong!)  For Wayland (and XWayland, probably),
> I assume the people are very smart and active and will implement
> whatever APIs we (the driver people) need as long as they're
> reasonable.  I just don't know what to ask for yet.

Typing out timeline-syncobj support gets us like 90% of the way to
userspace fences. At the moment, the protocol only has the concept of
fully-materialised fences which will definitely retire at some point;
typing out timeline-syncobj gives us the two concepts of syncobj/val
and not-yet-materialised fences (i.e. wait-before-signal).

The former is just typing, whereas the latter is a fairly violent
compositor-internal change to decouple the scene graph across contexts
- stalling your desktop for a second because a fence takes too long to
retire is not nice, but stalling your desktop infinitely because the
client gave you an infinite WBS is less acceptable.

Once we've done that, userspace fence support is slightly annoying to
type out (new polling mechanism, compositor balancing burning CPU
polling excessively vs. getting your VRR frames in as early as
possible), but the protocol sed job and polling switch-up is nothing
compared to being WBS-aware.

So I think it does make perfect sense for us to get typing on
timeline-syncobj, and then all our compositors are 95% of the way to
being ready for userspace fences.

> > We've - certainly Weston, and I'm pretty confident I speak for Simon
> > on the wlroots side and Jonas for Mutter - landed on accepting that
> > we're going to have to deal with timeline semaphores and
> > wait-before-signal; we have a credible plan (we think) for protocol to
> > deal with at least syncobj, which should conceptually extend to
> > userspace fences. The biggest blocker there is the syncobj uAPI.
> > Compositors aren't Vulkan clients - we don't have one thread per group
> > of work, because the inter-client synchronisation becomes nightmarish
> > overhead for little gain. I don't think this will ever change, because
> > the balance of work is totally different between client and
> > compositor.
> >
> > Anyway, the problem with syncobj is that the ioctl to wait for a
> > sync_file to materialise for a given timeline point only allows us to
> > block with a timeout; this is a non-starter, because we need something
> > which fits into epoll. The most optimal case is returning a new
> > eventfd or similar which signals when a given timeline point becomes
> > available or signaled, but in extremis a syncobj FD becoming readable
> > when any activity which would change the result of any zero-timeout
> > wait on that syncobj is more or less workable.
>
> Right.  You could call this an oversight.  Really, though, it was
> because the use-cases we were interested in were ones where a wait
> with a timeout was perfectly acceptable and where the extra overhead
> of setting an epoll with ioctls wasn't.  It shouldn't be too hard to
> wire up if that helps (cross your fingers).  But....
>
> If we go the direction of userspace memory fences (which I think is
> likely), there is no such thing as "wait for the fence to
> materialize".  The work is either done or it isn't.  There is no
> enforceable concept of "about to be done".  The word "enforceable" is
> important there.  We could add such a concept but it'd be dependent on
> the userspace client (not driver) to let us know when it's just about
> ready and then we'd have to VK_ERROR_DEVICE_LOST on them or similar if
> they break the contract.  Maybe possible but there's some design work
> required there.  The other option is to make the compositors handle
> this new way of thinking more thoroughly and eat the latency hit.

There's no good answer whatsoever to combining pure userspace fences
and VRR, because either you burn your CPU polling at 200Hz, or you're
late getting frames to screen.

But again, once we're timeline-syncobj-aware, we're pretty much
already userspace-fence aware. Compositors already have very separate
timed points for client buffer submit (arbitrary time), to scheduling
output repaint (as needed), to launching output repaint (negative
delta from deadline), to frame deadline (fixed by hardware, VRR
notwithstanding).

Timeline syncobjs hurt because we have to introduce an additional
stage where we check if a client's fence has materialised; the surface
tree has to live in limbo until they have, with the scene-graph update
happening when they have. Userspace fencing just switches the trigger
from 'epoll tells us the fence is materialised and/or signalled' to
'epoll tells us it's time to wake up and check the fence'. Again, just
tedious typing, not actually conceptually difficult.

> > As far as I can tell, the three relevant vendors with (more or less)
> > upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit
> > more up in the air as the hardware and specs aren't currently
> > available to the upstream development teams, but hopefully we can
> > bring them into this conversation. I think it's looking like we're
> > optimising for all of the following, without leaving anyone out in the
> > cold:
> >
> > [...]
>
> You're missing a use-case:  Modern userspace-fencing hardware running
> on bare X11.  I don't know that we should optimize for this case, so
> to speak, but it has to work non-terribly.  Also, as I alluded to
> above, I really don't want to be maintaining two drivers forever: One
> for X11 and ancient Wayland and one for modern Wayland.  We need to be
> able to modernize the driver and still support the old window-systems.
> Unless you can promise me that X11 and KDE/Wayland will either go away
> or else get modernized, I need a fall-back plan.  And even if you make
> me said promise, I'm not going to believe you. :-P  8-9 years ago, I
> was one of the people making those promises.  Now I'm writing X11
> winsys code for drivers.  And... now I'm re-thinking some of my life
> choices....

KWinFT is based on wlroots, so it gets everything for free.

You're right about X11, but weirdly this is the one case that's really
easy. X11 WSI already throws a client submit thread under the bus for
asynchronous presentation. This invalidates every possible guarantee:
you can't do vkQueuePresentKHR() followed by native X11 rendering,
because the queue may not have flushed out to the server. So just do
the same thing: use the present thread that already exists, but make
it spin on completion instead, and then the server doesn't need to do
any synchronisation on QueuePresent, just on ANI. And you don't get
eaten by Xorg grues.

> > Ultimately, I think I'm leaning towards agreeing with Christian that I
> > would like to see a good holistic model for how this works in a
> > variety of usecases before we progress new uAPI, but also to agreeing
> > with you and Dan in that how amdgpu currently implements things is
> > currently objectively very wrong (despite the motivations for having
> > implemented it that way).
> >
> > If there are any usecases I've missed then I'm all ears, but else I
> > think we should take this forward by working with
> > Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic
> > skeletons for end-to-end synchronisation through those usecases. It's
> > clear that this never happened for syncobj, because it's just not
> > usable as a primitive in any compositor which exists today: let's make
> > sure we don't get into the same trap again. If we can do this over
> > email/GitLab then that's great, but if not, maybe we need to do a kind
> > of mini-XDC with some kind of virtual whiteboard we can all scribble
> > over.
>
> I think what we're looking at is roughly three steps, the first two of
> which are mostly there on the Wayland side:
>
>  1. Implicit sync.  We know what this one is.  glFlush/vkQueueSubmit
> on the one side, start texturing on the other, and it all works.

Done.

>  2. Immutable SW fences, i.e. sync_file.  This is where we have a
> fence object that gets returned from the flush/submit and can be
> passed into the texture op to provide a dependency.  Syncobj may be
> useful here but only as a container.  If a sync_file is a dma_fence*,
> a syncobj should be thought of as a dma_fence**.  This may be useful
> if we want to retrofit sync_file into X11 where the current DRI3
> explicit sync stuff is based on creating an object and then re-using
> it for every present.

Mostly done for Wayland, but as above we never bothered typing out the
sync_file return path in any implementation.

>  3. Userspace memory fences.
>
> Note that timeline syncobj is NOT in that list.  IMO, all the "wait
> for submit" stuff is an implementation detail we needed in order to
> get the timeline semantics on top of immutable SW fences.  Under the
> hood it's all dma_fence; this just gives us a shareable container so
> we can implement VK_KHR_timeline_semaphore with sharing.  I really
> don't want to make Wayland protocol around it if memory fences are the
> final solution.

Typing out the Wayland protocol isn't the hard bit. If we just need to
copy and sed syncobj to weirdsyncobj, no problem really, and it gives
us a six-month head start on painful compositor-internal surgery
whilst we work on common infrastructure to ship userspace fences
around (mappable dmabuf with the sync bracketing? FD where every
read() gives you the current value? memfd? other?).

> > (As a coda to this, I'm pretty sure I understand the subtleties of the
> > memory fences for relocation/shootdown, but it would be _really_ good
> > to have a coherent description everyone agrees on written down
> > somewhere, so people can understand the issues without having to read
> > 250 emails with people at loggerheads.)
>
> Let me give it a try.  I define a userspace memory fence as follows:
>  [...]

Thanks, and I completely agree with your description. What I meant
though was the 'relocation fences' Christian has for amdgpu, and Intel
semi-has with ttm_bo->moving. This is semi-orthogonal, but
understanding how it's supposed to interact is pretty critical I
think.

> There are some ways to work around some of these issues.  Windows has
> a few tricks which we might be able to steal if we want.
>
> Why would anyone want such a horrid thing?  Application developers
> absolutely love them.  They can write massively multi-threaded apps
> with piles of work queues that require very little cross-thread
> synchronization and the GPU scheduler sorts it all out for them in the
> end.  If you're a 3D game engine developer, this timeline model is
> very powerful.  If you're a driver or window-system developer, you
> really have to just embrace the lack of knowledge and trust apps.
>
> Oof... That got long.  I hope it was informative.

It was! Thanks for the write-up.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-24 17:11     ` [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6) Jason Ekstrand
  2021-05-25  6:27       ` Daniel Vetter
  2021-05-26 11:08       ` Daniel Stone
@ 2021-05-26 12:21       ` Vlad Zahorodnii
  2 siblings, 0 replies; 39+ messages in thread
From: Vlad Zahorodnii @ 2021-05-26 12:21 UTC (permalink / raw)
  To: dri-devel

On 5/24/21 8:11 PM, Jason Ekstrand wrote:
> On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@fooishbar.org> wrote:
>>
>> Hi,
>>
>> On Thu, 20 May 2021 at 20:00, Jason Ekstrand <jason@jlekstrand.net> wrote:
>>> In the Vulkan world, this is useful for dealing with the out-fence from
>>> vkQueuePresent.  Current Linux window-systems (X11, Wayland, etc.) all
>>> rely on dma-buf implicit sync.  Since Vulkan is an explicit sync API, we
>>> get a set of fences (VkSemaphores) in vkQueuePresent and have to stash
>>> those as an exclusive (write) fence on the dma-buf.  We handle it in
>>> Mesa today with the above mentioned dummy submit trick.  This ioctl
>>> would allow us to set it directly without the dummy submit.
>>>
>>> This may also open up possibilities for GPU drivers to move away from
>>> implicit sync for their kernel driver uAPI and instead provide sync
>>> files and rely on dma-buf import/export for communicating with other
>>> implicit sync clients.
>>>
>>> We make the explicit choice here to only allow setting RW fences which
>>> translates to an exclusive fence on the dma_resv.  There's no use for
>>> read-only fences for communicating with other implicit sync userspace
>>> and any such attempts are likely to be racy at best.
>>
>> I think I almost follow, but I'm not quite seeing the race you allude
>> to. Let me talk through my understanding so it's hopefully more clear
>> for others as a summary. Please correct me on anything I've
>> misunderstood or just missed completely. (I thought when I wrote this
>> intro that the email might be relatively snappy, but it's really long
>> and takes in a lot of breadth. Sorry.)
>>
>> So as far as I'm reading this patchset and the Mesa MR, this _only_
>> concerns the out-fence (i.e. compositor -> client AcquireNextImage
>> semaphore/fence) so far. The in-fence (client->compositor QueuePresent
>> wait-semaphores/fences) is handled by the driver ensuring that an
>> exclusive resv is placed on the union of the signal semaphores passed
>> to QueuePresent, either through flags on its CS ioctl, or amdgpu's BO
>> flags, or ... either way, no problem as it should always be exclusive,
>> and it seems pretty uncontroversial that we should pull the wait
>> semaphore into an exclusive fence so that no downstream consumer will
>> begin using it until the client ops have fully retired.
>>
>> For AcquireNextImage, your patchset exports all the fences (shared and
>> exclusive both) on the dma_resv out into the semaphore/fence such that
>> the client can't progress (CPU-side for VkFence, GPU-side for
>> VkSemaphore) until all currently queued operations have completely
>> retired. So, as long as the server ensures that all its kernel-side
>> work is flushed before its IPC to unblock ANI (wl_buffer.release or
>> DRI3 PresentIdle, both indicating that the client is free to reuse the
>> buffer, subject to synchronising against implicit fences on the resv),
>> all good: we snapshot the current resv state, wrap the relevant
>> driver-side Vulkan primitive around it, and go back to explicit
>> synchronisation. The client can't race generating new work against
>> this, because it can't queue any work until ANI has returned and
>> queued a signaling op on the semaphore/fence.
>>
>> So far, so good. I really like both your series up until this
>> narrative point; as I was saying in the userspace-fence thread, the
>> WSI<->client thread is certainly pulling a very big lever with a
>> heavyweight transition between the two different worlds, and I like
>> that the new export ioctl lets us be very clear about what exactly is
>> happening under the hood. Good stuff.
> 
> Glad to hear.  If you turned that into RBs on the first three patches
> in this series and all but the last patch on the Mesa MR, it'd make me
> even happier. :-D
> 
> At this point, I think everyone is pretty happy with the first three
> patches and the export ioctl.  In the Vulkan WSI code, it solves a
> significant over-synchronization issue for ANV.  Also, the uAPI
> shouldn't be controversial at all because it's identical to poll()
> except that it gives you a FD you can poll on later to get the result
> of the poll as it would be now.  I think if we get some Mesa reviews,
> we should be able to land those.  It's import that's trickier.
> 
>> So, what gives with the import ioctl? Taking a guess at where you're
>> going, the import ioctl is going to be used in QueuePresent just as
>> the export ioctl is in ANI: instead of having CS flags to write into
>> the resv exclusive slot or per-BO flags to always dump in exclusive,
>> there's a forthcoming patch somewhere which lets drivers skip this and
>> instead have common QueuePresent code dump the wait semaphore into the
>> resv, so servers on the other side of an implicit-only protocol will
>> synchronise their access against the fence imported as part of
>> client-side QueuePresent?
> 
> Correct.  And the patch isn't forthcoming.  It already exists as the
> top patch in the Mesa MR (!4037).
> 
>> That makes sense to me and is nicely symmetrical, plus it gets GPU
>> drivers further out of the business of doing magic winsys/display
>> synchronisation, which is good. But why enforce that imported fences
>> have to go into the exclusive slot? It makes sense from the point of
>> view of clients doing QueueSubmit, but I think it might preclude other
>> uses for no particularly good reason.
> 
> Mostly, I was trying to limit the scope a bit.  Import is tricky and,
> to be honest, I'm still not 100% sure that it's safe.  I probably
> should have left RFC on this patch.
> 
> As long as we keep everything inside the kernel, there's a requirement
> that any work which triggers any fence on a dma_resv waits on the
> exclusive fence, if any.  Work which sets the exclusive fence has to
> wait on all fences.  With the import ioctl, we can guarantee that the
> fences fire in the right order by making an array fence containing the
> new fence and all other fences and using that as the exclusive fence.
> What we can't do, however, is ensure that the work which triggers the
> fence actually blocks on ANY of the fences on the dma_resv.
> 
> Hrm... I think I've now convinced myself that importing shared fences
> is no more dangerous than importing an exclusive fence because they
> share the same set of problems.  Unfortunately, I've unconvinced
> myself of the safety of either.  I've got to think some more....
> 
> The most convincing argument I can make to myself is that this ioctl
> is like having a one-shot little queue that contains tiny little work
> jobs which wait on whatever sync_file you provide, do nothing, and
> then signal.  That should be safe, right?
> 
>> Certainly on X11 there's no real use for the shared slot - you get
>> into difficulties with mixed client/server rendering - but at least
>> Wayland and GBM are always non-destructive, so conceptually have a
>> good use for being able to import fences into the shared resv. This
>> goes for media and KMS access as well, if you think beyond the usecase
>> of an explicit Vulkan client sending through to a dumb implicit server
>> which just wants to texture.
>>
>> Media clients are definitely a relevant usecase, both directly with
>> V4L2/VA-API, neither of which have support for explicit
>> synchronisation yet and (at least for V4L2) are unlikely to do so in
>> the near future, but even more with pipeline frameworks like GStreamer
>> and PipeWire, which don't have fundamental awareness (they're prepared
>> to deal with deep pipelines which return at arbitrary times, but once
>> anything is returned, it is available for immediate use). Maybe
>> open-coding CPU-side waits in these users is the best way longer term,
>> but for easier interop if nothing else, being able to pull shared
>> fences in seems like it could be a win ('the compositor is still
>> texturing from this for now, so feel free to read back from ref
>> frames, but don't scribble over it until it's finished').
>>
>> I'm slightly bemused that there's so much energy spent on designing
>> around winsys being dumb and implicit.
> 
> I'd like to address this one as it's a comment you've made several
> times.  Once you've fixed raw X11 (not just XWayland) and a new
> release has been made (hah!) and is shipping in distros with said
> support, then we can talk.  Sorry if that comes off as overly snarky
> but that's reality that we (driver devs) are living with.  To make
> things even worse, when we're in Vulkan land (as opposed to GL), we
> can't tell up-front whether or not our window-system supports foo
> fences and adjust accordingly.  We have to start up, begin rendering,
> and only later figure out "oops, this one goes to X11".  We really
> can't say things like "when running on modern Wayland, do things the
> new way" because Vulkan doesn't have a concept of "running on" a
> window system.
> 
> FWIW, I do have a Mesa MR somewhere which adds explicit sync for
> Vulkan+Wayland when the wayland protocols are there.  I don't remember
> why it didn't land.  Maybe lack of acquire fence support?  I think the
> protocol issues have been fixed, so we should up-rev the requirements
> as needed and land it.
> 
>> We did take a long time to roll
>> out sync_file support, but that was only because we didn't quite
>> understand all the nuances of why implicit sync is so difficult for
>> GPU drivers on modern architectures and how it was actually a win
>> compared to doing nothing; maybe we should have some kind of
>> conference where we all get together and talk to each other ... ?
>> Anyway, by the time we got to the cusp of rolling out bi-directional
>> sync_file support, suddenly we had drm_syncobj. By the time that had
>> semi-settled down and started to be rolled out, then we suddenly had
>> userspace fences on the horizon. And what do we do with those?
> 
> You're not wrong....  And that's the second reason for the gymnastics
> above.  Ever since Vulkan launched, we've been fumbling around trying
> to figure out how to best do synchronization.  'm reasonably certain
> that userspace memory fences are the future but I'm much less certain
> about the path to get there.  It's been a process of patient
> experimentation so far and I think we're getting closer.  Syncobj,
> timeline syncobj, etc. have all been steps along that path.  I've been
> hesitant to ask the window-system people to draft piles of extensions
> until we're sure we've found "the one".  It's bad enough iterating in
> kernel-space and userspace without bringing Wayland protocol into each
> iteration step.  For that reason, one of my goals along this process
> has been to try and decouple things as much as we can.
> 
> So, in summary, none of this is because I think that window systems
> are dumb and implicit or for any lack of respect for the people that
> work on them.  I assume that X11 will always be dumb and implicit.
> (I'd love to be proven wrong!)  For Wayland (and XWayland, probably),
> I assume the people are very smart and active and will implement
> whatever APIs we (the driver people) need as long as they're
> reasonable.  I just don't know what to ask for yet.
> 
>> We've - certainly Weston, and I'm pretty confident I speak for Simon
>> on the wlroots side and Jonas for Mutter - landed on accepting that
>> we're going to have to deal with timeline semaphores and
>> wait-before-signal; we have a credible plan (we think) for protocol to
>> deal with at least syncobj, which should conceptually extend to
>> userspace fences. The biggest blocker there is the syncobj uAPI.
>> Compositors aren't Vulkan clients - we don't have one thread per group
>> of work, because the inter-client synchronisation becomes nightmarish
>> overhead for little gain. I don't think this will ever change, because
>> the balance of work is totally different between client and
>> compositor.
>>
>> Anyway, the problem with syncobj is that the ioctl to wait for a
>> sync_file to materialise for a given timeline point only allows us to
>> block with a timeout; this is a non-starter, because we need something
>> which fits into epoll. The most optimal case is returning a new
>> eventfd or similar which signals when a given timeline point becomes
>> available or signaled, but in extremis a syncobj FD becoming readable
>> when any activity which would change the result of any zero-timeout
>> wait on that syncobj is more or less workable.
> 
> Right.  You could call this an oversight.  Really, though, it was
> because the use-cases we were interested in were ones where a wait
> with a timeout was perfectly acceptable and where the extra overhead
> of setting an epoll with ioctls wasn't.  It shouldn't be too hard to
> wire up if that helps (cross your fingers).  But....
> 
> If we go the direction of userspace memory fences (which I think is
> likely), there is no such thing as "wait for the fence to
> materialize".  The work is either done or it isn't.  There is no
> enforceable concept of "about to be done".  The word "enforceable" is
> important there.  We could add such a concept but it'd be dependent on
> the userspace client (not driver) to let us know when it's just about
> ready and then we'd have to VK_ERROR_DEVICE_LOST on them or similar if
> they break the contract.  Maybe possible but there's some design work
> required there.  The other option is to make the compositors handle
> this new way of thinking more thoroughly and eat the latency hit.
> 
>> What we do want to do though, regardless of the primitive or its
>> semantics, is to only have to go through this once, not rework it all
>> in six months, and have to support a bunch of diverging codepaths. So
>> what is the optimal synchronisation primitive we should be aiming for
>> on the winsys side? Is sync_file a good enough lowest common
>> denominator, or should we do timeline-syncobj for fancy clients, in
>> tandem with sync_file bolted on the side for media and ancient GPUs?
>> Or are userspace fences going to give us some new primitive? And if
>> so, can we please talk about those semantics now, so we don't end up
>> with three synchronisation mechanisms which are all sort of but not
>> really alike?
> 
> As I said above, I think we're getting close but I don't think we're there yet.
> 
>> As far as I can tell, the three relevant vendors with (more or less)
>> upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit
>> more up in the air as the hardware and specs aren't currently
>> available to the upstream development teams, but hopefully we can
>> bring them into this conversation. I think it's looking like we're
>> optimising for all of the following, without leaving anyone out in the
>> cold:
>>
>> 1. Modern userspace-fencing hardware running on a
>> userspace-fencing-aware winsys, i.e. new AMD/Arm/Intel on one of the
>> big three Wayland compositors which handle enough synchronisation
>> logic internally that the choice of underlying
>> composition/presentation API is immaterial (for which any new API is
>> important)
>> 2. Modern userspace-fencing hardware running on Xwayland, for which
>> we'll inevitably have to type out new DRI3 synchronsiation, but
>> derived purely from the Wayland protocols so it can be passed through
>> quite directly, and with any mixed X11<->user buffer client
>> interaction (e.g. subwindows) being penalised by a long blocking wait
>> in the X server
>> 3. sync_file-oriented hardware, for which we need to do ~nothing
>> 4. Modern userspace-fencing hardware and APIs which need to interop
>> with media units, for all four combinations of client/server
>> source/sink; for some APIs like Vulkan Video synchronisation is not a
>> problem, for others like VA-API/V4L2/GStreamer we're probably need
>> going to live with the implicit semantics for the foreseeable future,
>> which means we should do the right thing for them up front, because
>> why would you even be playing games if you're not streaming them to
>> Twitch at the same time? (Personally I'm much happier that no-one
>> other than my friends who already know I'm terrible can see me playing
>> CoD, but y'know, kids these days ...)
>>
>> The fifth case I've left out is clients who want to smash Vulkan
>> content directly into the display. For games and Kodi-like UI I'm just
>> going to ignore this, because (maybe I'm biased, but) using KMS
>> directly is difficult enough that you shouldn't do it and just use a
>> winsys because we've spent a very long time dealing with these
>> problems for you. The remaining usecase there is XR, but their uses
>> are very different, and OpenXR already deals with a _lot_ of this for
>> you, with the runtimes having sufficiently sophisticated
>> synchronisation internally that whatever we come up with won't be
>> onerous for them to implement. Either way, integration with KMS/GBM
>> has the same problem as Wayland, in that unless you fully encapsulate
>> it in something like VK_KHR_display, you don't get to throw a thread
>> under the bus to do delayed submit, because you need to actually
>> return something to the caller.
> 
> You're missing a use-case:  Modern userspace-fencing hardware running
> on bare X11.  I don't know that we should optimize for this case, so
> to speak, but it has to work non-terribly.  Also, as I alluded to
> above, I really don't want to be maintaining two drivers forever: One
> for X11 and ancient Wayland and one for modern Wayland.  We need to be
> able to modernize the driver and still support the old window-systems.
> Unless you can promise me that X11 and KDE/Wayland will either go away
> or else get modernized, I need a fall-back plan.  And even if you make

Hi,

I don't know why you specifically mentioned KDE/Wayland, but we are on 
board with explicit sync. :)

Cheers,
Vlad

> me said promise, I'm not going to believe you. :-P  8-9 years ago, I
> was one of the people making those promises.  Now I'm writing X11
> winsys code for drivers.  And... now I'm re-thinking some of my life
> choices....
> 
>> Ultimately, I think I'm leaning towards agreeing with Christian that I
>> would like to see a good holistic model for how this works in a
>> variety of usecases before we progress new uAPI, but also to agreeing
>> with you and Dan in that how amdgpu currently implements things is
>> currently objectively very wrong (despite the motivations for having
>> implemented it that way).
>>
>> If there are any usecases I've missed then I'm all ears, but else I
>> think we should take this forward by working with
>> Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic
>> skeletons for end-to-end synchronisation through those usecases. It's
>> clear that this never happened for syncobj, because it's just not
>> usable as a primitive in any compositor which exists today: let's make
>> sure we don't get into the same trap again. If we can do this over
>> email/GitLab then that's great, but if not, maybe we need to do a kind
>> of mini-XDC with some kind of virtual whiteboard we can all scribble
>> over.
> 
> I think what we're looking at is roughly three steps, the first two of
> which are mostly there on the Wayland side:
> 
>   1. Implicit sync.  We know what this one is.  glFlush/vkQueueSubmit
> on the one side, start texturing on the other, and it all works.
> 
>   2. Immutable SW fences, i.e. sync_file.  This is where we have a
> fence object that gets returned from the flush/submit and can be
> passed into the texture op to provide a dependency.  Syncobj may be
> useful here but only as a container.  If a sync_file is a dma_fence*,
> a syncobj should be thought of as a dma_fence**.  This may be useful
> if we want to retrofit sync_file into X11 where the current DRI3
> explicit sync stuff is based on creating an object and then re-using
> it for every present.
> 
>   3. Userspace memory fences.
> 
> Note that timeline syncobj is NOT in that list.  IMO, all the "wait
> for submit" stuff is an implementation detail we needed in order to
> get the timeline semantics on top of immutable SW fences.  Under the
> hood it's all dma_fence; this just gives us a shareable container so
> we can implement VK_KHR_timeline_semaphore with sharing.  I really
> don't want to make Wayland protocol around it if memory fences are the
> final solution.
> 
> 
>> (As a coda to this, I'm pretty sure I understand the subtleties of the
>> memory fences for relocation/shootdown, but it would be _really_ good
>> to have a coherent description everyone agrees on written down
>> somewhere, so people can understand the issues without having to read
>> 250 emails with people at loggerheads.)
> 
> Let me give it a try.  I define a userspace memory fence as follows:
>   - The fence object is fundamentally a bit of mappable gpu-accessible
> memory which stores a 64-bit counter value which is used as a
> timeline.  (On Windows, they require it to live in system memory.)
>   - For sharable memory fences, each one probably has to go in its own page. :-(
>   - The value is initialized to 0.
>   - To signal the fence, someone (GPU or CPU) writes a new 64-bit value.
>   - Waits on the fence are waits for it to be >= some value.
> 
> Depending on circumstances, the wait may be a 32-bit comparison and
> may be >= with wrap-around.  For the purposes of window-system
> protocol, I think we can assume 64-bit >= with no wrap-around.
> 
> There are a few very important points worth noting about them, however:
>   - No one except the userspace client (not driver!) has any idea
> when/if a value will be signaled
>   - The value may never be signaled
>   - Someone may submit a wait before someone submits a signal; we have
> to deal with that
>   - There is no concept of "about to be signaled"
>   - For the kernel/firmware GPU scheduler handling waits, this means it
> just keeps trying to run work and hopes the wait eventually unblocks.
> It also means we need to totally decouple kernel synchronization for
> memory management purposes from synchronization for work scheduling.
>   - For a compositor, I'm still not 100% sure what this means.  TBD, I think.
> 
> There are some ways to work around some of these issues.  Windows has
> a few tricks which we might be able to steal if we want.
> 
> Why would anyone want such a horrid thing?  Application developers
> absolutely love them.  They can write massively multi-threaded apps
> with piles of work queues that require very little cross-thread
> synchronization and the GPU scheduler sorts it all out for them in the
> end.  If you're a 3D game engine developer, this timeline model is
> very powerful.  If you're a driver or window-system developer, you
> really have to just embrace the lack of knowledge and trust apps.
> 
> Oof... That got long.  I hope it was informative.
> 
> --Jason
> 
> 
>> Cheers,
>> Daniel
>>
>>
>>
>>
>>
>>
>>
>>
>>> When we got to
>>> insert the RW fence, the actual fence we set as the new exclusive fence
>>> is a combination of the sync_file provided by the user and all the other
>>> fences on the dma_resv.  This ensures that the newly added exclusive
>>> fence will never signal before the old one would have and ensures that
>>> we don't break any dma_resv contracts.  We require userspace to specify
>>> RW in the flags for symmetry with the export ioctl and in case we ever
>>> want to support read fences in the future.
>>>
>>> There is one downside here that's worth documenting:  If two clients
>>> writing to the same dma-buf using this API race with each other, their
>>> actions on the dma-buf may happen in parallel or in an undefined order.
>>> Both with and without this API, the pattern is the same:  Collect all
>>> the fences on dma-buf, submit work which depends on said fences, and
>>> then set a new exclusive (write) fence on the dma-buf which depends on
>>> said work.  The difference is that, when it's all handled by the GPU
>>> driver's submit ioctl, the three operations happen atomically under the
>>> dma_resv lock.  If two userspace submits race, one will happen before
>>> the other.  You aren't guaranteed which but you are guaranteed that
>>> they're strictly ordered.  If userspace manages the fences itself, then
>>> these three operations happen separately and the two render operations
>>> may happen genuinely in parallel or get interleaved.  However, this is a
>>> case of userspace racing with itself.  As long as we ensure userspace
>>> can't back the kernel into a corner, it should be fine.
>>>
>>> v2 (Jason Ekstrand):
>>>   - Use a wrapper dma_fence_array of all fences including the new one
>>>     when importing an exclusive fence.
>>>
>>> v3 (Jason Ekstrand):
>>>   - Lock around setting shared fences as well as exclusive
>>>   - Mark SIGNAL_SYNC_FILE as a read-write ioctl.
>>>   - Initialize ret to 0 in dma_buf_wait_sync_file
>>>
>>> v4 (Jason Ekstrand):
>>>   - Use the new dma_resv_get_singleton helper
>>>
>>> v5 (Jason Ekstrand):
>>>   - Rename the IOCTLs to import/export rather than wait/signal
>>>   - Drop the WRITE flag and always get/set the exclusive fence
>>>
>>> v5 (Jason Ekstrand):
>>>   - Split import and export into separate patches
>>>   - New commit message
>>>
>>> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
>>> ---
>>>   drivers/dma-buf/dma-buf.c    | 32 ++++++++++++++++++++++++++++++++
>>>   include/uapi/linux/dma-buf.h |  1 +
>>>   2 files changed, 33 insertions(+)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index 7679562b57bfc..c9d6b9195c00c 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf,
>>>          put_unused_fd(fd);
>>>          return ret;
>>>   }
>>> +
>>> +static long dma_buf_import_sync_file(struct dma_buf *dmabuf,
>>> +                                    const void __user *user_data)
>>> +{
>>> +       struct dma_buf_sync_file arg;
>>> +       struct dma_fence *fence, *singleton = NULL;
>>> +       int ret = 0;
>>> +
>>> +       if (copy_from_user(&arg, user_data, sizeof(arg)))
>>> +               return -EFAULT;
>>> +
>>> +       if (arg.flags != DMA_BUF_SYNC_RW)
>>> +               return -EINVAL;
>>> +
>>> +       fence = sync_file_get_fence(arg.fd);
>>> +       if (!fence)
>>> +               return -EINVAL;
>>> +
>>> +       dma_resv_lock(dmabuf->resv, NULL);
>>> +
>>> +       ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton);
>>> +       if (!ret && singleton)
>>> +               dma_resv_add_excl_fence(dmabuf->resv, singleton);
>>> +
>>> +       dma_resv_unlock(dmabuf->resv);
>>> +
>>> +       dma_fence_put(fence);
>>> +
>>> +       return ret;
>>> +}
>>>   #endif
>>>
>>>   static long dma_buf_ioctl(struct file *file,
>>> @@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file,
>>>   #if IS_ENABLED(CONFIG_SYNC_FILE)
>>>          case DMA_BUF_IOCTL_EXPORT_SYNC_FILE:
>>>                  return dma_buf_export_sync_file(dmabuf, (void __user *)arg);
>>> +       case DMA_BUF_IOCTL_IMPORT_SYNC_FILE:
>>> +               return dma_buf_import_sync_file(dmabuf, (const void __user *)arg);
>>>   #endif
>>>
>>>          default:
>>> diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
>>> index f902cadcbdb56..75fdde4800267 100644
>>> --- a/include/uapi/linux/dma-buf.h
>>> +++ b/include/uapi/linux/dma-buf.h
>>> @@ -70,5 +70,6 @@ struct dma_buf_sync_file {
>>>   #define DMA_BUF_SET_NAME_A     _IOW(DMA_BUF_BASE, 1, u32)
>>>   #define DMA_BUF_SET_NAME_B     _IOW(DMA_BUF_BASE, 1, u64)
>>>   #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file)
>>> +#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_sync)
>>>
>>>   #endif
>>> --
>>> 2.31.1
>>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 11:08       ` Daniel Stone
@ 2021-05-26 12:35         ` Daniel Vetter
  2021-05-26 13:08           ` Daniel Stone
  2021-05-26 13:52         ` Daniel Stone
  2021-05-26 15:24         ` Jason Ekstrand
  2 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2021-05-26 12:35 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Daniel Vetter, dri-devel, Christian König, Jason Ekstrand

On Wed, May 26, 2021 at 1:09 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hey,
>
> On Mon, 24 May 2021 at 18:11, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@fooishbar.org> wrote:
> > > So far, so good. I really like both your series up until this
> > > narrative point; as I was saying in the userspace-fence thread, the
> > > WSI<->client thread is certainly pulling a very big lever with a
> > > heavyweight transition between the two different worlds, and I like
> > > that the new export ioctl lets us be very clear about what exactly is
> > > happening under the hood. Good stuff.
> >
> > Glad to hear.  If you turned that into RBs on the first three patches
> > in this series and all but the last patch on the Mesa MR, it'd make me
> > even happier. :-D
> >
> > At this point, I think everyone is pretty happy with the first three
> > patches and the export ioctl.  In the Vulkan WSI code, it solves a
> > significant over-synchronization issue for ANV.  Also, the uAPI
> > shouldn't be controversial at all because it's identical to poll()
> > except that it gives you a FD you can poll on later to get the result
> > of the poll as it would be now.  I think if we get some Mesa reviews,
> > we should be able to land those.  It's import that's trickier.
>
> Agree. Have a few more burning things on my list (not least fd.o infra
> this week) but will get to it so we can land.
>
> (Clarified on IRC that my description above was accurate to the best
> of our shared understanding, so we're all on the same page here.)

fd.o work is much appreciated!

> > > That makes sense to me and is nicely symmetrical, plus it gets GPU
> > > drivers further out of the business of doing magic winsys/display
> > > synchronisation, which is good. But why enforce that imported fences
> > > have to go into the exclusive slot? It makes sense from the point of
> > > view of clients doing QueueSubmit, but I think it might preclude other
> > > uses for no particularly good reason.
> >
> > Mostly, I was trying to limit the scope a bit.  Import is tricky and,
> > to be honest, I'm still not 100% sure that it's safe.  I probably
> > should have left RFC on this patch.
> >
> > As long as we keep everything inside the kernel, there's a requirement
> > that any work which triggers any fence on a dma_resv waits on the
> > exclusive fence, if any.  Work which sets the exclusive fence has to
> > wait on all fences.  With the import ioctl, we can guarantee that the
> > fences fire in the right order by making an array fence containing the
> > new fence and all other fences and using that as the exclusive fence.
> > What we can't do, however, is ensure that the work which triggers the
> > fence actually blocks on ANY of the fences on the dma_resv.
> >
> > Hrm... I think I've now convinced myself that importing shared fences
> > is no more dangerous than importing an exclusive fence because they
> > share the same set of problems.  Unfortunately, I've unconvinced
> > myself of the safety of either.  I've got to think some more....
> >
> > The most convincing argument I can make to myself is that this ioctl
> > is like having a one-shot little queue that contains tiny little work
> > jobs which wait on whatever sync_file you provide, do nothing, and
> > then signal.  That should be safe, right?
>
> Yeah, I don't think there's any difference between shared and
> exclusive wrt safety. The difference lies in, well, exclusive putting
> a hard serialisation barrier between everything which comes before and
> everything that comes after, and shared being more relaxed to allow
> for reads to retire in parallel.
>
> As said below, I think there's a good argument for the latter once you
> get out of the very straightforward uses. One of the arguments for
> these ioctls is to eliminate oversync, but then the import ioctl
> mandates oversync in the case where the consumer only does
> non-destructive reads - which is the case for the vast majority of
> users!

Just wanted to comment on this: Right now we attach always attach a
shared end-of-batch fence to every dma_resv. So reads are
automatically and always synced. So in that sense having an explicit
ioct to set the read fence is not really useful, since at most you
just make everything worse.

Until we have userspace memory this is also pretty much guaranteed to
stay like that, except if we maybe split up the shared fences into
"relevant for implicit sync" and "not so relevant for implicit sync".
amdgpu has an interesting model there with only engaging in implicit
sync if it's a fence from a different drm_file (but they lack the
distinction between read/write and oversync due to that in other
cases, oops). I haven't wrapped my head around how to glue those two
approaches together.

So anyway right now all we really have on the table for explicit
control of the implicit fences is really just:
- should I stall for the exclusive fence/all fences/not at all
- should I set the exclusive fence

And from my additional auditing this week we already have plenty of
small bugs in this area :-( So lifting restrictions more on what you
can do with handling implicit fences more explicitly is going to be
major surgery ... Implicit fencing is bound to stay a very funny IPC
for fences for quite some time I think.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 12:35         ` Daniel Vetter
@ 2021-05-26 13:08           ` Daniel Stone
  2021-05-26 13:44             ` Daniel Vetter
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Stone @ 2021-05-26 13:08 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Vetter, dri-devel, Christian König, Jason Ekstrand

Hey,

On Wed, 26 May 2021 at 13:35, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, May 26, 2021 at 1:09 PM Daniel Stone <daniel@fooishbar.org> wrote:
> > Yeah, I don't think there's any difference between shared and
> > exclusive wrt safety. The difference lies in, well, exclusive putting
> > a hard serialisation barrier between everything which comes before and
> > everything that comes after, and shared being more relaxed to allow
> > for reads to retire in parallel.
> >
> > As said below, I think there's a good argument for the latter once you
> > get out of the very straightforward uses. One of the arguments for
> > these ioctls is to eliminate oversync, but then the import ioctl
> > mandates oversync in the case where the consumer only does
> > non-destructive reads - which is the case for the vast majority of
> > users!
>
> Just wanted to comment on this: Right now we attach always attach a
> shared end-of-batch fence to every dma_resv. So reads are
> automatically and always synced. So in that sense having an explicit
> ioct to set the read fence is not really useful, since at most you
> just make everything worse.

Are you saying that if a compositor imports a client-provided dmabuf
as an EGLImage to use as a source texture for its rendering, and then
provides it to VA-API or V4L2 to use as a media encode source (both
purely read-only ops), that these will both serialise against each
other? Like, my media decode job won't begin execution until the
composition read has fully retired?

If so, a) good lord that hurts, and b) what are shared fences actually ... for?

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 13:08           ` Daniel Stone
@ 2021-05-26 13:44             ` Daniel Vetter
  2021-05-26 15:13               ` Daniel Stone
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2021-05-26 13:44 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Daniel Vetter, dri-devel, Jason Ekstrand, Christian König

On Wed, May 26, 2021 at 02:08:19PM +0100, Daniel Stone wrote:
> Hey,
> 
> On Wed, 26 May 2021 at 13:35, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Wed, May 26, 2021 at 1:09 PM Daniel Stone <daniel@fooishbar.org> wrote:
> > > Yeah, I don't think there's any difference between shared and
> > > exclusive wrt safety. The difference lies in, well, exclusive putting
> > > a hard serialisation barrier between everything which comes before and
> > > everything that comes after, and shared being more relaxed to allow
> > > for reads to retire in parallel.
> > >
> > > As said below, I think there's a good argument for the latter once you
> > > get out of the very straightforward uses. One of the arguments for
> > > these ioctls is to eliminate oversync, but then the import ioctl
> > > mandates oversync in the case where the consumer only does
> > > non-destructive reads - which is the case for the vast majority of
> > > users!
> >
> > Just wanted to comment on this: Right now we attach always attach a
> > shared end-of-batch fence to every dma_resv. So reads are
> > automatically and always synced. So in that sense having an explicit
> > ioct to set the read fence is not really useful, since at most you
> > just make everything worse.
> 
> Are you saying that if a compositor imports a client-provided dmabuf
> as an EGLImage to use as a source texture for its rendering, and then
> provides it to VA-API or V4L2 to use as a media encode source (both
> purely read-only ops), that these will both serialise against each
> other? Like, my media decode job won't begin execution until the
> composition read has fully retired?
> 
> If so, a) good lord that hurts, and b) what are shared fences actually ... for?

Shared is shared, I just meant to say that we always add the shared fence.
So an explicit ioctl to add more shared fences is kinda pointless.

So yeah on a good driver this will run in parallel. On a not-so-good
driver (which currently includes amdgpu and panfrost) this will serialize,
because those drivers don't have the concept of a non-exclusive fence for
such shared buffers (amdgpu does not sync internally, but will sync as
soon as it's cross-drm_file).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 11:08       ` Daniel Stone
  2021-05-26 12:35         ` Daniel Vetter
@ 2021-05-26 13:52         ` Daniel Stone
  2021-05-26 15:24         ` Jason Ekstrand
  2 siblings, 0 replies; 39+ messages in thread
From: Daniel Stone @ 2021-05-26 13:52 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Daniel Vetter, Christian König, dri-devel

Hi,
Just making sure this thread doesn't get too short ...

On Wed, 26 May 2021 at 12:08, Daniel Stone <daniel@fooishbar.org> wrote:
> On Mon, 24 May 2021 at 18:11, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > I'd like to address this one as it's a comment you've made several
> > times.  Once you've fixed raw X11 (not just XWayland) and a new
> > release has been made (hah!) and is shipping in distros with said
> > support, then we can talk.  Sorry if that comes off as overly snarky
> > but that's reality that we (driver devs) are living with.  To make
> > things even worse, when we're in Vulkan land (as opposed to GL), we
> > can't tell up-front whether or not our window-system supports foo
> > fences and adjust accordingly.  We have to start up, begin rendering,
> > and only later figure out "oops, this one goes to X11".  We really
> > can't say things like "when running on modern Wayland, do things the
> > new way" because Vulkan doesn't have a concept of "running on" a
> > window system.
>
> Hey, no offence taken, and even if there was, there's no point denying
> the reality that we don't still have Wayland absolutely everywhere,
> and even when we do, there's still a Streams-sized elephant in the
> room that no-one wants to talk about.
>
> Ultimately though, there will always be two codepaths. Xorg (as
> opposed to Xwayland) is currently unmaintained, and I expect it to
> remain so forever. The last times we tried to plumb synchronisation
> through to native-Xorg, it collapsed in conflicting requests for
> rewriting the presentation mechanism (which very few people understand
> and even fewer people can make work reliably), and I don't see that
> changing any time soon. Plus the single biggest remaining use for
> native-Xorg is that it implements tearflips by blitting to a single
> internal frontbuffer which is currently being scanned out, which is
> going to be ... difficult ... to square with the goal of actually
> synchronising things.
>
> What I mean though is that I don't think it makes sense to have three
> design points.
>
> Previously, the only effort put into explicit synchronisation on the
> winsys side was using it was a means to surface the timings for both
> tracing and being able to calibrate compositor repaints. Given that we
> need to support interop with the implicit world forever, and given
> that there was no explicit interaction with the implicit resv slots
> (solved by this new uAPI), and given that the sync_file vs.
> drm_syncobj vs. ?? userspace fence story wasn't clear, it didn't seem
> like a practical benefit to sink time into supporting it, just that it
> ticked a 'yep we're doing things explicitly' box, which was not
> practically useful since EGL doesn't let you opt out of implicit
> semantics anyway.
>
> Now with the details having come out of AMD/Arm/Intel about future
> gens, and the driver-side pain being properly understood, as well as
> the new uAPI actually giving us clear benefit, there's a good reason
> to work on it. So we'll do that on the winsys side, and support the
> new explicitly-managed-implicit-slot world, and everyone's lives will
> be better. Which gives us two design points: unaware implicit-only
> users (Xorg, old Wayland, old GPU, media), and fully-aware
> extremely-explicit users (Mutter, Weston, wlroots, new GPU).

So I said why I think X11 is totally fine (sync on client, let the
server stay dumb and unsynchronised) for the new world, but not why
I'm so confident that it's totally fine for Wayland and we can do it
quickly, despite previously banging on about how much work it was and
why it was so difficult. You already know this Jason, but for the rest
of the class ...

All Wayland compositors I've ever seen have a strong separation
between the protocol-visible objects, and per-output repaint loops,
not least because this is strongly encoded into the protocol. Creating
a surface, attaching a buffer to it, and committing the surface state,
will only update a compositor list of the current state of the
protocol-visible objects.

At 'a good time' for each output (often next vblank minus a few
milliseconds), the compositor will effectively snapshot that state,
generate a scene graph from it, and go through repaint for that output
(GPU composition and/or KMS planes and/or media encode and/or RDP etc)
to turn that to light. This is unlike X11 where you can post rendering
commands and then call XGetImage to pull the pixel result back. We
also explicitly chose not to follow the X11 window of a global
co-ordinate space with windows explicitly treed down from a shared
root window: inter-surface operations are very much disconnected from
each other with no ordering guarantee, and since clients are totally
isolated from each other, there are no inter-client operations.

So that already gives us a clean break between clients posting state
and visible effect, with a lot of latitude on timing: you post some
updates, and a compositor which users actually want will make those
updates real at some point, but nothing is guaranteed.

There is one cut-out from inter-surface operations, which is our
subsurface protocol. An example of this is having your browser UI as
the primary window, which embeds a separate rendered window for each
tab, which further embeds a window with your pop-over ad^W^Wmedia
content. To make resizing and scrolling work without being all
Flash-on-Navigator jank, subsurfaces have a 'synchronised' mode, where
clients can make protocol-visible updates to their surfaces, but the
updates will not be recorded into the visible scene graph until
explicitly released by the topmost parent - think of it like a display
controller's 'go bit' or updating the tail pointer on a ring buffer,
where we record updates but they're stashed away until everything has
become coherent.

Every useful compositor already implements this subsurface protocol as
well, which gives us a _further_ gap between protocol-visible
client-recorded state and the scene graph which is used for painting
outputs. Handling wait-before-sync is still a lot of typing, and
difficult for corner cases, and will need experimentation in the real
world before we can all land on useful common semantics, but we've
already done the most conceptually difficult work, which was to
separate client/protocol-visible object state from the scene graph
that gets generated/snapshotted on every output update.

So that's why I'm bullish on how quickly we can move on the Wayland
side. It's a lot of typing, but we already have two clean breaks from
recorded client surface state -> scene graph -> pixels lit up, with
each break already handling fragmentation between different surface
groups rather than being a global freeze/thaw.

(Having typed all of that, I'm reminded that we never finished up the
EGL extension to opt out of implicit sync on a per-EGLImage basis,
which was mostly blocked on explicit sync being pretty immature in
drivers and none of the uAPI fitting together e.g. per-CS vs. per-BO
... now seems like a pretty apt time to revive that too.)

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 13:44             ` Daniel Vetter
@ 2021-05-26 15:13               ` Daniel Stone
  2021-05-26 16:52                 ` Daniel Vetter
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Stone @ 2021-05-26 15:13 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Vetter, dri-devel, Christian König, Jason Ekstrand

On Wed, 26 May 2021 at 14:44, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, May 26, 2021 at 02:08:19PM +0100, Daniel Stone wrote:
> > Are you saying that if a compositor imports a client-provided dmabuf
> > as an EGLImage to use as a source texture for its rendering, and then
> > provides it to VA-API or V4L2 to use as a media encode source (both
> > purely read-only ops), that these will both serialise against each
> > other? Like, my media decode job won't begin execution until the
> > composition read has fully retired?
> >
> > If so, a) good lord that hurts, and b) what are shared fences actually ... for?
>
> Shared is shared, I just meant to say that we always add the shared fence.
> So an explicit ioctl to add more shared fences is kinda pointless.
>
> So yeah on a good driver this will run in parallel. On a not-so-good
> driver (which currently includes amdgpu and panfrost) this will serialize,
> because those drivers don't have the concept of a non-exclusive fence for
> such shared buffers (amdgpu does not sync internally, but will sync as
> soon as it's cross-drm_file).

When you say 'we always add the shared fence', add it to ... where?
And which shared fence? (I'm going to use 'fence' below to refer to
anything from literal sync_file to timeline-syncobj to userspace
fence.)

I'll admit that I've typed out an argument twice for always export
from excl+shared, and always import to excl, results in oversync. And
I keep tying myself in knots trying to do it. It's arguably slightly
contrived, but here's my third attempt ...

Vulkan Wayland client, full-flying-car-sync Wayland protocol,
Vulkan-based compositor. Part of the contract when the server exposes
that protocol is that it guarantees to do explicit sync in both
directions, so the client provides a fence at QueueSubmit time and the
server provides one back when releasing the image for return to ANI.
Neither side ever record fences into the dma_resv because they've
opted out by being fully explicit-aware.

Now add media encode out on the side because you're streaming. The
compositor knows this is a transition between explicit and implicit
worlds, so it imports the client's fence into the exclusive dma_resv
slot, which makes sense: the media encode has to sync against the
client work, but is indifferent to the parallel compositor work. The
shared fence is exported back out so the compositor can union the
encode-finished fence with its composition-finished fence to send back
to the client with release/ANI.

Now add a second media encode because you want a higher-quality local
capture to upload to YouTube later on. The compositor can do the exact
same import/export dance, and the two encodes can safely run in
parallel. Which is good.

Where it starts to become complex is: what if your compositor is fully
explicit-aware but your clients aren't, so your compositor has more
import/export points to record into the resv? What if you aren't
actually a compositor but a full-blown media pipeline, where you have
a bunch of threads all launching reads in parallel, to the extent
where it's not practical to manage implicit/explicit transitions
globally, but each thread has to more pessimistically import and
export around each access?

I can make the relatively simple usecases work, but it really feels
like in practice we'll end up with massive oversync in some fairly
complex usecases, and we'll regret not having had it from the start,
plus people will just rely on implicit sync for longer because it has
better (more parallel) semantics in some usecases.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 11:08       ` Daniel Stone
  2021-05-26 12:35         ` Daniel Vetter
  2021-05-26 13:52         ` Daniel Stone
@ 2021-05-26 15:24         ` Jason Ekstrand
  2021-05-26 18:14           ` Daniel Stone
  2 siblings, 1 reply; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-26 15:24 UTC (permalink / raw)
  To: Daniel Stone; +Cc: Daniel Vetter, Christian König, dri-devel

On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@fooishbar.org> wrote:
> On Mon, 24 May 2021 at 18:11, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >  3. Userspace memory fences.
> >
> > Note that timeline syncobj is NOT in that list.  IMO, all the "wait
> > for submit" stuff is an implementation detail we needed in order to
> > get the timeline semantics on top of immutable SW fences.  Under the
> > hood it's all dma_fence; this just gives us a shareable container so
> > we can implement VK_KHR_timeline_semaphore with sharing.  I really
> > don't want to make Wayland protocol around it if memory fences are the
> > final solution.
>
> Typing out the Wayland protocol isn't the hard bit. If we just need to
> copy and sed syncobj to weirdsyncobj, no problem really, and it gives
> us a six-month head start on painful compositor-internal surgery
> whilst we work on common infrastructure to ship userspace fences
> around (mappable dmabuf with the sync bracketing? FD where every
> read() gives you the current value? memfd? other?).

I feel like I should elaborate more about timelines.  In my earlier
reply, my commentary about timeline syncobj was mostly focused around
helping people avoid typing.  That's not really the full story,
though, and I hope more context will help.

First, let me say that timeline syncobj was designed as a mechanism to
implement VK_KHR_timeline_semaphore without inserting future fences
into the kernel.  It's entirely designed around the needs of Vulkan
drivers, not really as a window-system primitive.  The semantics are
designed around one driver communicating to another that new fences
have been added and it's safe to kick off more rendering.  I'm not
convinced that it's the right object for window-systems and I'm also
not convinced that it's a good idea to try and make a version of it
that's a wrapper around a userspace memory fence.  (I'm going to start
typing UMF for userspace memory fence because it's long to type out.)

Why?  Well, the fundamental problem with timelines in general is
trying to figure out when it's about to be done.  But timeline syncobj
solves this for us!  It gives us this fancy super-useful ioctl!
Right?  Uh.... not as well as I'd like.  Let's say we make a timeline
syncobj that's a wrapper around a userspace memory fence.  What do we
do with that ioctl?  As I mentioned above, the kernel doesn't have any
clue when it will be triggered so that ioctl turns into an actual
wait.  That's no good because it creates unnecessary stalls.

There's another potential solution here:  Have each UMF be two
timelines: submitted and completed.  At the start of every batch
that's supposed to trigger a UMF, we set the "submitted" side and
then, when it completes, we set the "completed" side.  Ok, great, now
we can get at the "about to be done" with the submitted side,
implement the ioctl, and we're all good, right?  Sadly, no.  There's
no guarantee about how long a "batch" takes.  So there's no universal
timeout the kernel can apply.  Also, if it does time out, the kernel
doesn't know who to blame for the timeout and how to prevent itself
from getting in trouble again.  The compositor does so, in theory,
given the right ioctls, it could detect the -ETIME and kill that
client.  Not a great solution.

The best option I've been able to come up with for this is some sort
of client-provided signal.  Something where it says, as part of submit
or somewhere else, "I promise I'll be done soon" where that promise
comes with dire consequences if it's not.  At that point, we can turn
the UMF and a particular wait value into a one-shot fence like a
dma_fence or sync_file, or signal a syncobj on it.  If it ever times
out, we kick their context.  In Vulkan terminology, they get
VK_ERROR_DEVICE_LOST.  There are two important bits here:  First, is
that it's based on a client-provided thing.  With a fully timeline
model and wait-before-signal, we can't infer when something is about
to be done.  Only the client knows when it submitted its last node in
the dependency graph and the whole mess is unblocked.  Second, is that
the dma_fence is created within the client's driver context.  If it's
created compositor-side, the kernel doesn't know who to blame if
things go badly.  If we create it in the client, it's pretty easy to
make context death on -ETIME part of the contract.

(Before danvet jumps in here and rants about how UMF -> dma_fence
isn't possible, I haven't forgotten.  I'm pretending, for now, that
we've solved some of those problems.)

Another option is to just stall on the UMF until it's done.  Yeah,
kind-of terrible and high-latency, but it always works and doesn't
involve any complex logic to kill clients.  If a client never gets
around to signaling a fence, it just never repaints.  The compositor
keeps going like nothing's wrong.  Maybe, if the client submits lots
of frames without ever triggering, it'll hit some max queue depth
somewhere and kill it but that's it.  More likely, the client's
vkAcquireNextImage will start timing out and it'll crash.

I suspect where we might actually land is some combination of the two
depending on client choice.  If the client wants to be dumb, it gets
the high-latency always-works path.  If the client really wants
lowest-latency VRR, it has to take the smarter path and risk
VK_ERROR_DEVICE_LOST if it misses too far.

But the point of all of this is, neither of the above two paths have
anything to do with the compositor calling a "wait for submit" ioctl.
Building a design around that and baking it into protocol is, IMO, a
mistake.  I don't see any valid way to handle this mess without "wait
for sumbit" either not existing or existing only client-side for the
purposes of WSI.

--Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 15:13               ` Daniel Stone
@ 2021-05-26 16:52                 ` Daniel Vetter
  2021-05-26 18:01                   ` Daniel Stone
  2021-05-27  7:21                   ` Christian König
  0 siblings, 2 replies; 39+ messages in thread
From: Daniel Vetter @ 2021-05-26 16:52 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Daniel Vetter, dri-devel, Christian König, Jason Ekstrand

On Wed, May 26, 2021 at 5:13 PM Daniel Stone <daniel@fooishbar.org> wrote:
> On Wed, 26 May 2021 at 14:44, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Wed, May 26, 2021 at 02:08:19PM +0100, Daniel Stone wrote:
> > > Are you saying that if a compositor imports a client-provided dmabuf
> > > as an EGLImage to use as a source texture for its rendering, and then
> > > provides it to VA-API or V4L2 to use as a media encode source (both
> > > purely read-only ops), that these will both serialise against each
> > > other? Like, my media decode job won't begin execution until the
> > > composition read has fully retired?
> > >
> > > If so, a) good lord that hurts, and b) what are shared fences actually ... for?
> >
> > Shared is shared, I just meant to say that we always add the shared fence.
> > So an explicit ioctl to add more shared fences is kinda pointless.
> >
> > So yeah on a good driver this will run in parallel. On a not-so-good
> > driver (which currently includes amdgpu and panfrost) this will serialize,
> > because those drivers don't have the concept of a non-exclusive fence for
> > such shared buffers (amdgpu does not sync internally, but will sync as
> > soon as it's cross-drm_file).
>
> When you say 'we always add the shared fence', add it to ... where?
> And which shared fence? (I'm going to use 'fence' below to refer to
> anything from literal sync_file to timeline-syncobj to userspace
> fence.)

In the current model, every time you submit anything to the gpu, we
create a dma_fence to track this work. This dma_fence is attached as a
shared fence to the dma_resv obj of every object in your working set.
Clarifications
you = both userspace or kernel, anything really, including fun stuff
like writing PTEs, or clearing PTEs and then flushing TLBs
working set = depends, but can be anything from "really just the
buffers the current gpu submission uses" to "everything bound into a
given gpu VM"

This is the fence I'm talking about here.

Since you can't escape this (not unless we do direct userspace submit
with userspace memory fences) and since there's no distinction of the
shared fences into "relevant for implicit sync" and "not relevant for
implicit sync" there's really not much point in adding implicit read
fences. For now at least, we might want to change this eventually.

> I'll admit that I've typed out an argument twice for always export
> from excl+shared, and always import to excl, results in oversync. And
> I keep tying myself in knots trying to do it. It's arguably slightly
> contrived, but here's my third attempt ...
>
> Vulkan Wayland client, full-flying-car-sync Wayland protocol,
> Vulkan-based compositor. Part of the contract when the server exposes
> that protocol is that it guarantees to do explicit sync in both
> directions, so the client provides a fence at QueueSubmit time and the
> server provides one back when releasing the image for return to ANI.
> Neither side ever record fences into the dma_resv because they've
> opted out by being fully explicit-aware.
>
> Now add media encode out on the side because you're streaming. The
> compositor knows this is a transition between explicit and implicit
> worlds, so it imports the client's fence into the exclusive dma_resv
> slot, which makes sense: the media encode has to sync against the
> client work, but is indifferent to the parallel compositor work. The
> shared fence is exported back out so the compositor can union the
> encode-finished fence with its composition-finished fence to send back
> to the client with release/ANI.
>
> Now add a second media encode because you want a higher-quality local
> capture to upload to YouTube later on. The compositor can do the exact
> same import/export dance, and the two encodes can safely run in
> parallel. Which is good.

So the example which works is really clear ...

> Where it starts to become complex is: what if your compositor is fully
> explicit-aware but your clients aren't, so your compositor has more
> import/export points to record into the resv? What if you aren't
> actually a compositor but a full-blown media pipeline, where you have
> a bunch of threads all launching reads in parallel, to the extent
> where it's not practical to manage implicit/explicit transitions
> globally, but each thread has to more pessimistically import and
> export around each access?

... but the example where we oversync is hand-waving?

:-P

> I can make the relatively simple usecases work, but it really feels
> like in practice we'll end up with massive oversync in some fairly
> complex usecases, and we'll regret not having had it from the start,
> plus people will just rely on implicit sync for longer because it has
> better (more parallel) semantics in some usecases.

Things fall apart in implicit sync if you have more than one logical
writer into the same buffer. Trivial example is two images in one
buffer, but you could also do funky stuff like interleaved/tiled
rendering with _indepedent_ consumers. If the consumers are not
independent, then you can again just stuff the two writer fences into
the exclusive slot with the new ioctl (they'll get merged without
additional overhead into one fence array fence).

And the fundamental thing is: This is just not possible with implicit
sync. There's only one fence slot (even if that resolves to an array
of fences for all the producers), so anytime you do multiple
independent things in the same buffer you either:
- must split the buffers so there's again a clear&unique handoff at
each stage of the pipeline
- or use explicit sync

So in your example, options are
- per-client buffers, which you then blend into a composite buffer to
handle the N implicit fences from N buffers into a single implicit
fence for libva conversion. This single buffer then also allows you to
again fan out to M libva encoders, or whatever it is that you fancy
- explicit fencing and clients render into a single buffer with no
copying, and libva encodes from that single buffer (but again needs
explicit fences or it all comes crashing down)

There's really no option C where you somehow do multiple implicitly
fenced things into a single buffer and expect it to work out in
parallel.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 16:52                 ` Daniel Vetter
@ 2021-05-26 18:01                   ` Daniel Stone
  2021-05-27  7:21                   ` Christian König
  1 sibling, 0 replies; 39+ messages in thread
From: Daniel Stone @ 2021-05-26 18:01 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Vetter, dri-devel, Christian König, Jason Ekstrand

Hey,

On Wed, 26 May 2021 at 17:53, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, May 26, 2021 at 5:13 PM Daniel Stone <daniel@fooishbar.org> wrote:
> > > Shared is shared, I just meant to say that we always add the shared fence.
> > > So an explicit ioctl to add more shared fences is kinda pointless.
> > >
> > > So yeah on a good driver this will run in parallel. On a not-so-good
> > > driver (which currently includes amdgpu and panfrost) this will serialize,
> > > because those drivers don't have the concept of a non-exclusive fence for
> > > such shared buffers (amdgpu does not sync internally, but will sync as
> > > soon as it's cross-drm_file).
> >
> > When you say 'we always add the shared fence', add it to ... where?
> > And which shared fence? (I'm going to use 'fence' below to refer to
> > anything from literal sync_file to timeline-syncobj to userspace
> > fence.)
>
> In the current model, every time you submit anything to the gpu, we
> create a dma_fence to track this work. This dma_fence is attached as a
> shared fence to the dma_resv obj of every object in your working set.
> Clarifications
> you = both userspace or kernel, anything really, including fun stuff
> like writing PTEs, or clearing PTEs and then flushing TLBs
> working set = depends, but can be anything from "really just the
> buffers the current gpu submission uses" to "everything bound into a
> given gpu VM"
>
> This is the fence I'm talking about here.
>
> Since you can't escape this (not unless we do direct userspace submit
> with userspace memory fences) and since there's no distinction of the
> shared fences into "relevant for implicit sync" and "not relevant for
> implicit sync" there's really not much point in adding implicit read
> fences. For now at least, we might want to change this eventually.

Yeah, I agree. My own clarification is that I'm talking about an
explicit-first world, where synchronisation is done primarily through
unknowable UMF, and falling back to implicit sync is a painful and
expensive operation that we only do when we need to. So, definitely
not on every CS (command submission aka execbuf aka vkQueueSubmit aka
glFlush).

> > I'll admit that I've typed out an argument twice for always export
> > from excl+shared, and always import to excl, results in oversync. And
> > I keep tying myself in knots trying to do it. It's arguably slightly
> > contrived, but here's my third attempt ...
> >
> > Vulkan Wayland client, full-flying-car-sync Wayland protocol,
> > Vulkan-based compositor. Part of the contract when the server exposes
> > that protocol is that it guarantees to do explicit sync in both
> > directions, so the client provides a fence at QueueSubmit time and the
> > server provides one back when releasing the image for return to ANI.
> > Neither side ever record fences into the dma_resv because they've
> > opted out by being fully explicit-aware.
> >
> > Now add media encode out on the side because you're streaming. The
> > compositor knows this is a transition between explicit and implicit
> > worlds, so it imports the client's fence into the exclusive dma_resv
> > slot, which makes sense: the media encode has to sync against the
> > client work, but is indifferent to the parallel compositor work. The
> > shared fence is exported back out so the compositor can union the
> > encode-finished fence with its composition-finished fence to send back
> > to the client with release/ANI.
> >
> > Now add a second media encode because you want a higher-quality local
> > capture to upload to YouTube later on. The compositor can do the exact
> > same import/export dance, and the two encodes can safely run in
> > parallel. Which is good.
>
> So the example which works is really clear ...
>
> > Where it starts to become complex is: what if your compositor is fully
> > explicit-aware but your clients aren't, so your compositor has more
> > import/export points to record into the resv? What if you aren't
> > actually a compositor but a full-blown media pipeline, where you have
> > a bunch of threads all launching reads in parallel, to the extent
> > where it's not practical to manage implicit/explicit transitions
> > globally, but each thread has to more pessimistically import and
> > export around each access?
>
> ... but the example where we oversync is hand-waving?
>
> :-P

Hey, I said I tied myself into knots! Maybe it's because my brain is
too deeply baked into implicit sync, maybe it's because the problem
cases aren't actually problems. Who knows.

I think what it comes down to is that we make it workable for (at
least current-generation, before someone bakes it into Unity) Wayland
compositors to work well with these modal switches, but really
difficult for more complex and variable pipeline frameworks like
GStreamer or PipeWire to work with it.

> > I can make the relatively simple usecases work, but it really feels
> > like in practice we'll end up with massive oversync in some fairly
> > complex usecases, and we'll regret not having had it from the start,
> > plus people will just rely on implicit sync for longer because it has
> > better (more parallel) semantics in some usecases.
>
> Things fall apart in implicit sync if you have more than one logical
> writer into the same buffer. Trivial example is two images in one
> buffer, but you could also do funky stuff like interleaved/tiled
> rendering with _indepedent_ consumers. If the consumers are not
> independent, then you can again just stuff the two writer fences into
> the exclusive slot with the new ioctl (they'll get merged without
> additional overhead into one fence array fence).
>
> And the fundamental thing is: This is just not possible with implicit
> sync. There's only one fence slot (even if that resolves to an array
> of fences for all the producers), so anytime you do multiple
> independent things in the same buffer you either:
> - must split the buffers so there's again a clear&unique handoff at
> each stage of the pipeline
> - or use explicit sync

Yeah no argument, this doesn't work & can't work in implicit sync.

But what I'm talking about is having a single writer (serialised) and
multiple readers (in parallel). Readers add to the shared slot,
serialising their access against all earlier exclusive fences, and
writers add to the exclusive slot, serialising their access against
all earlier fences (both exclusive and shared).

So if import can only add to the exclusive slot, then we can end up
potentially serialising readers against each other. We want readers to
land in the shared slot to be able to parallelise against each other
and let writers serialise after them, no?

> So in your example, options are
> - per-client buffers, which you then blend into a composite buffer to
> handle the N implicit fences from N buffers into a single implicit
> fence for libva conversion. This single buffer then also allows you to
> again fan out to M libva encoders, or whatever it is that you fancy
> - explicit fencing and clients render into a single buffer with no
> copying, and libva encodes from that single buffer (but again needs
> explicit fences or it all comes crashing down)
>
> There's really no option C where you somehow do multiple implicitly
> fenced things into a single buffer and expect it to work out in
> parallel.

All of my examples above are a single client buffer (GPU source which
places a fence into the exclusive slot for when the colour buffer
contents are fully realised), just working its way through multiple
stages and APIs. Like, your single Dota2 window ends up in a
Vulkan-based Wayland compositor, a pure VA-API encode stream to write
high-quality AV1 to disk, and also an EGL pipeline which overlays your
awesome logo and webcam stream before VA-API encoding to a
lower-quality H.264 stream for Twitch. This isn't a convoluted
example, it's literally what the non-geriatric millennials do all day.
It's a lot of potential boundaries between implicit & explicit world,
and if we've learned one thing from modifiers it's that we probably
shouldn't underthink the boundaries.

So:

1. Does every CS generate the appropriate resv entries (exclusive for
write, shared for read) for every access to every buffer? I think the
answer has to be no, because it's not necessarily viable in future.
2. If every CS doesn't generate the appropriate resv entries, do we go
for the middle ground where we keep interactions with implicit sync
implicit (e.g. every client API accessing any externally-visible BO
populates the appropriate resv slot, but internal-only buffers get to
skip it), or do we surface them and make it explicit (e.g. the Wayland
explicit-sync protocol is a contract between client/compositor that
the client doesn't have to populate the resv slots, because the
compositor will ensure every access it makes is appropriate
synchronised)? I think the latter, because the halfway house sounds
really painful for questionable if any benefit, and makes it maybe
impossible for us to one day deprecate implicit.
3. If we do surface everything and make userspace handle the
implicit/explicit boundaries, do we make every explicit -> implicit
boundary (via the import ioctl) populate the exclusive slot or allow
it to choose? I think allow it to choose, because I don't understand
what the restriction buys us.
4. Can the combination of dynamic modifier negotiation and explicit
synchronisation let us deliver the EGLStreams promise before
EGLStreams can? :)

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 15:24         ` Jason Ekstrand
@ 2021-05-26 18:14           ` Daniel Stone
  2021-05-27  4:32             ` Jason Ekstrand
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Stone @ 2021-05-26 18:14 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Daniel Vetter, Christian König, dri-devel

Hey,

On Wed, 26 May 2021 at 16:24, Jason Ekstrand <jason@jlekstrand.net> wrote:
> On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@fooishbar.org> wrote:
> > Typing out the Wayland protocol isn't the hard bit. If we just need to
> > copy and sed syncobj to weirdsyncobj, no problem really, and it gives
> > us a six-month head start on painful compositor-internal surgery
> > whilst we work on common infrastructure to ship userspace fences
> > around (mappable dmabuf with the sync bracketing? FD where every
> > read() gives you the current value? memfd? other?).
>
> I feel like I should elaborate more about timelines.  In my earlier
> reply, my commentary about timeline syncobj was mostly focused around
> helping people avoid typing.  That's not really the full story,
> though, and I hope more context will help.
>
> First, let me say that timeline syncobj was designed as a mechanism to
> implement VK_KHR_timeline_semaphore without inserting future fences
> into the kernel.  It's entirely designed around the needs of Vulkan
> drivers, not really as a window-system primitive.  The semantics are
> designed around one driver communicating to another that new fences
> have been added and it's safe to kick off more rendering.  I'm not
> convinced that it's the right object for window-systems and I'm also
> not convinced that it's a good idea to try and make a version of it
> that's a wrapper around a userspace memory fence.  (I'm going to start
> typing UMF for userspace memory fence because it's long to type out.)
>
> Why?  Well, the fundamental problem with timelines in general is
> trying to figure out when it's about to be done.  But timeline syncobj
> solves this for us!  It gives us this fancy super-useful ioctl!
> Right?  Uh.... not as well as I'd like.  Let's say we make a timeline
> syncobj that's a wrapper around a userspace memory fence.  What do we
> do with that ioctl?  As I mentioned above, the kernel doesn't have any
> clue when it will be triggered so that ioctl turns into an actual
> wait.  That's no good because it creates unnecessary stalls.

Yeah, I'm assuming that UMF will be a separate primitive. No problem.
I also think that your submitted/completed thing is a non-problem: at
this stage we're just throwing up our hands and admitting that we're
letting userspace tie itself in knots, and giving it the tools to tie
a sufficiently un-streetwise compositor in knots too. We're already
crossing that Rubicon, so let's just embrace it and not try to design
it out. Us compositors can handle the scheduling, really.

> There's another potential solution here:  Have each UMF be two
> timelines: submitted and completed.  At the start of every batch
> that's supposed to trigger a UMF, we set the "submitted" side and
> then, when it completes, we set the "completed" side.  Ok, great, now
> we can get at the "about to be done" with the submitted side,
> implement the ioctl, and we're all good, right?  Sadly, no.  There's
> no guarantee about how long a "batch" takes.  So there's no universal
> timeout the kernel can apply.  Also, if it does time out, the kernel
> doesn't know who to blame for the timeout and how to prevent itself
> from getting in trouble again.  The compositor does so, in theory,
> given the right ioctls, it could detect the -ETIME and kill that
> client.  Not a great solution.
>
> The best option I've been able to come up with for this is some sort
> of client-provided signal.  Something where it says, as part of submit
> or somewhere else, "I promise I'll be done soon" where that promise
> comes with dire consequences if it's not.  At that point, we can turn
> the UMF and a particular wait value into a one-shot fence like a
> dma_fence or sync_file, or signal a syncobj on it.  If it ever times
> out, we kick their context.  In Vulkan terminology, they get
> VK_ERROR_DEVICE_LOST.  There are two important bits here:  First, is
> that it's based on a client-provided thing.  With a fully timeline
> model and wait-before-signal, we can't infer when something is about
> to be done.  Only the client knows when it submitted its last node in
> the dependency graph and the whole mess is unblocked.  Second, is that
> the dma_fence is created within the client's driver context.  If it's
> created compositor-side, the kernel doesn't know who to blame if
> things go badly.  If we create it in the client, it's pretty easy to
> make context death on -ETIME part of the contract.
>
> (Before danvet jumps in here and rants about how UMF -> dma_fence
> isn't possible, I haven't forgotten.  I'm pretending, for now, that
> we've solved some of those problems.)

Funny how we've come full circle to the original proposal here ...

If we really want a kernel primitive for this - and I think it's a
good idea, since can help surface 'badness' in a way which is
observable by e.g. session managers in a way analogous to cgroup stats
and controls - how about this for a counter-proposal? Client exports a
FD for its context/queue and sends it to winsys as part of setup,
compositor can ioctl() on that to kill it, which lands in the same
zap/zap/zap/zap/ban codepath as GPU hangs do today. It's a bigger
hammer than per-sync-point primitives, but you as a client have to
accept the social contract that if you want to participate in a
session, your context has to be making forward progress and you aren't
writing cheques the compositor can't cash.

I'm also going to pre-emptively agree with other-Dan; I'm extremely
wary of anything which tries to make UMF look even a little bit like
sync_file. The requirements to support them are so wildly different
that I'd almost rather a totally orthogonal interface so that there's
no danger of confusing the two. Us sophisticates on this thread can
eat the mild annoyance of typing out separate codepaths, but it's much
worse for anyone else who may look at a UMF wolf in dma_fence sheep's
clothing then only later be substantially more annoyed when they
realise that it's not anything like they thought it was.

So let's keep sync_file for what it is, and for UMF since the usage is
so radically different, build out whatever we do around making the
uAPI as useful as possible for what we want to do with it. The real
complexity in handling the difference between UMF and 'real' fences is
in how they behave, not in how they look.

> Another option is to just stall on the UMF until it's done.  Yeah,
> kind-of terrible and high-latency, but it always works and doesn't
> involve any complex logic to kill clients.  If a client never gets
> around to signaling a fence, it just never repaints.  The compositor
> keeps going like nothing's wrong.  Maybe, if the client submits lots
> of frames without ever triggering, it'll hit some max queue depth
> somewhere and kill it but that's it.  More likely, the client's
> vkAcquireNextImage will start timing out and it'll crash.
>
> I suspect where we might actually land is some combination of the two
> depending on client choice.  If the client wants to be dumb, it gets
> the high-latency always-works path.  If the client really wants
> lowest-latency VRR, it has to take the smarter path and risk
> VK_ERROR_DEVICE_LOST if it misses too far.

We already have to handle unresponsive clients. If your browser
livelocks today (say because it's Chrome and you hotunplug your
monitor at the wrong time with active media playback in an inactive
tab in an inactive window ... hypothetically), your Wayland server
notices that it isn't responding to pings, throws up the 'do you want
to force-quit?' dialog and kills the client; it's actually really
simple logic. So we just hook unsignaled fences up to the same. (And,
if we have the context-kill primitive, trigger that on our way out.)

So yeah, we already have all the complexity points to put particular
surface trees in limbo (thanks to subsurface sync mode), we already
have all the complexity points to separate realised surface trees from
pixels on screen, and we already have the complexity points for
different parts of the surface trees being rendered at different
times. Checking on fence progression is just a little more typing
around those interface points which already exist, and zapping clients
is utterly trivial.

> But the point of all of this is, neither of the above two paths have
> anything to do with the compositor calling a "wait for submit" ioctl.
> Building a design around that and baking it into protocol is, IMO, a
> mistake.  I don't see any valid way to handle this mess without "wait
> for sumbit" either not existing or existing only client-side for the
> purposes of WSI.

I'm still on the fence (sorry) about a wait-before-submit ioctl. For
the sync_file-based timeline syncobjs that we have today, yes it is
helpful, and we do already have it, it's just the wrong shape in being
sleep rather than epoll.

For UMF, taking it as a given that the kernel really has no visibility
at all into syncpoint progress, then the kernel is conceptually a
worse place to spin-sleep than userspace is, because why account the
CPU burn to a kthread rather than a real PID, and lose
latency/efficiency on context switches when you do wake up?

But also, the kernel is conceptually the best place to spin-sleep,
because it can fuse waits and do better wakeup quantisation than
userspace can. And I'm still hopeful that the IHVs and Windows can
both step away from the postmodern 'synchronisation doesn't mean
anything anymore, just poll in a lap-burning loop' approach that we've
been presented (sorry) with, where we at least get doorbells which
allow the kernel to do polling much smarter than quantising timers
('this fence might not have triggered yet, but _something_ happened
which might have triggered it so why not check?').

On the other other hand, the only winsys case for burning poll in a
tight loop is flipping as quickly as possible straight to a VRR
display. In that case, you're definitely running on mains power so
you're just melting the polar ice caps rather than your lap, and
you've got everything fully lit up anyway so the power cost of polling
is immaterial. For FRR, the compositor already has a fixed deadline at
which it will wake up and make a hard binding decision about which
image to present - this includes XR as well. So we don't have to worry
about optimising a polling loop, because there isn't one: we wake up
once, we check once, and if the client's missed then too bad, try
again next frame.

As you can see, much like userspace memory fences, my position on
which way to go here is not knowable upfront, and depends on when
exactly you observe it. Hopefully someone can come back with an
argument compelling enough either way that I have something better to
do than to try to pun my way out of having more hands than Ganesh. I
don't think it's material to the design or implementation of winsys
support though.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 18:14           ` Daniel Stone
@ 2021-05-27  4:32             ` Jason Ekstrand
  2021-05-27 10:19               ` Daniel Vetter
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Ekstrand @ 2021-05-27  4:32 UTC (permalink / raw)
  To: Daniel Stone; +Cc: Daniel Vetter, Christian König, dri-devel

[-- Attachment #1: Type: text/plain, Size: 12314 bytes --]

On May 26, 2021 13:15:08 Daniel Stone <daniel@fooishbar.org> wrote:

> Hey,
>
> On Wed, 26 May 2021 at 16:24, Jason Ekstrand <jason@jlekstrand.net> wrote:
>> On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@fooishbar.org> wrote:
>>> Typing out the Wayland protocol isn't the hard bit. If we just need to
>>> copy and sed syncobj to weirdsyncobj, no problem really, and it gives
>>> us a six-month head start on painful compositor-internal surgery
>>> whilst we work on common infrastructure to ship userspace fences
>>> around (mappable dmabuf with the sync bracketing? FD where every
>>> read() gives you the current value? memfd? other?).
>>
>> I feel like I should elaborate more about timelines.  In my earlier
>> reply, my commentary about timeline syncobj was mostly focused around
>> helping people avoid typing.  That's not really the full story,
>> though, and I hope more context will help.
>>
>> First, let me say that timeline syncobj was designed as a mechanism to
>> implement VK_KHR_timeline_semaphore without inserting future fences
>> into the kernel.  It's entirely designed around the needs of Vulkan
>> drivers, not really as a window-system primitive.  The semantics are
>> designed around one driver communicating to another that new fences
>> have been added and it's safe to kick off more rendering.  I'm not
>> convinced that it's the right object for window-systems and I'm also
>> not convinced that it's a good idea to try and make a version of it
>> that's a wrapper around a userspace memory fence.  (I'm going to start
>> typing UMF for userspace memory fence because it's long to type out.)
>>
>> Why?  Well, the fundamental problem with timelines in general is
>> trying to figure out when it's about to be done.  But timeline syncobj
>> solves this for us!  It gives us this fancy super-useful ioctl!
>> Right?  Uh.... not as well as I'd like.  Let's say we make a timeline
>> syncobj that's a wrapper around a userspace memory fence.  What do we
>> do with that ioctl?  As I mentioned above, the kernel doesn't have any
>> clue when it will be triggered so that ioctl turns into an actual
>> wait.  That's no good because it creates unnecessary stalls.
>
> Yeah, I'm assuming that UMF will be a separate primitive. No problem.
> I also think that your submitted/completed thing is a non-problem: at
> this stage we're just throwing up our hands and admitting that we're
> letting userspace tie itself in knots, and giving it the tools to tie
> a sufficiently un-streetwise compositor in knots too. We're already
> crossing that Rubicon, so let's just embrace it and not try to design
> it out. Us compositors can handle the scheduling, really.

Ok, good. I think we're on the same page.

>
>> There's another potential solution here:  Have each UMF be two
>> timelines: submitted and completed.  At the start of every batch
>> that's supposed to trigger a UMF, we set the "submitted" side and
>> then, when it completes, we set the "completed" side.  Ok, great, now
>> we can get at the "about to be done" with the submitted side,
>> implement the ioctl, and we're all good, right?  Sadly, no.  There's
>> no guarantee about how long a "batch" takes.  So there's no universal
>> timeout the kernel can apply.  Also, if it does time out, the kernel
>> doesn't know who to blame for the timeout and how to prevent itself
>> from getting in trouble again.  The compositor does so, in theory,
>> given the right ioctls, it could detect the -ETIME and kill that
>> client.  Not a great solution.
>>
>> The best option I've been able to come up with for this is some sort
>> of client-provided signal.  Something where it says, as part of submit
>> or somewhere else, "I promise I'll be done soon" where that promise
>> comes with dire consequences if it's not.  At that point, we can turn
>> the UMF and a particular wait value into a one-shot fence like a
>> dma_fence or sync_file, or signal a syncobj on it.  If it ever times
>> out, we kick their context.  In Vulkan terminology, they get
>> VK_ERROR_DEVICE_LOST.  There are two important bits here:  First, is
>> that it's based on a client-provided thing.  With a fully timeline
>> model and wait-before-signal, we can't infer when something is about
>> to be done.  Only the client knows when it submitted its last node in
>> the dependency graph and the whole mess is unblocked.  Second, is that
>> the dma_fence is created within the client's driver context.  If it's
>> created compositor-side, the kernel doesn't know who to blame if
>> things go badly.  If we create it in the client, it's pretty easy to
>> make context death on -ETIME part of the contract.
>>
>> (Before danvet jumps in here and rants about how UMF -> dma_fence
>> isn't possible, I haven't forgotten.  I'm pretending, for now, that
>> we've solved some of those problems.)
>
> Funny how we've come full circle to the original proposal here ...
>
> If we really want a kernel primitive for this - and I think it's a
> good idea, since can help surface 'badness' in a way which is
> observable by e.g. session managers in a way analogous to cgroup stats
> and controls - how about this for a counter-proposal? Client exports a
> FD for its context/queue and sends it to winsys as part of setup,
> compositor can ioctl() on that to kill it, which lands in the same
> zap/zap/zap/zap/ban codepath as GPU hangs do today. It's a bigger
> hammer than per-sync-point primitives, but you as a client have to
> accept the social contract that if you want to participate in a
> session, your context has to be making forward progress and you aren't
> writing cheques the compositor can't cash.

The compositor already has that. It can kick the client's Wayland protocol 
connection. Banning the context from the kernel might be nice too but 
kicking it is probably sufficient.

Side-note to danvet: Do we need a plan for UMF with persistent contexts? My 
gut says that's a very bad idea but this made me think I should say least 
pose the question.

> I'm also going to pre-emptively agree with other-Dan; I'm extremely
> wary of anything which tries to make UMF look even a little bit like
> sync_file. The requirements to support them are so wildly different
> that I'd almost rather a totally orthogonal interface so that there's
> no danger of confusing the two. Us sophisticates on this thread can
> eat the mild annoyance of typing out separate codepaths, but it's much
> worse for anyone else who may look at a UMF wolf in dma_fence sheep's
> clothing then only later be substantially more annoyed when they
> realise that it's not anything like they thought it was.
>
> So let's keep sync_file for what it is, and for UMF since the usage is
> so radically different, build out whatever we do around making the
> uAPI as useful as possible for what we want to do with it. The real
> complexity in handling the difference between UMF and 'real' fences is
> in how they behave, not in how they look.

Sounds good.

>
>> Another option is to just stall on the UMF until it's done.  Yeah,
>> kind-of terrible and high-latency, but it always works and doesn't
>> involve any complex logic to kill clients.  If a client never gets
>> around to signaling a fence, it just never repaints.  The compositor
>> keeps going like nothing's wrong.  Maybe, if the client submits lots
>> of frames without ever triggering, it'll hit some max queue depth
>> somewhere and kill it but that's it.  More likely, the client's
>> vkAcquireNextImage will start timing out and it'll crash.
>>
>> I suspect where we might actually land is some combination of the two
>> depending on client choice.  If the client wants to be dumb, it gets
>> the high-latency always-works path.  If the client really wants
>> lowest-latency VRR, it has to take the smarter path and risk
>> VK_ERROR_DEVICE_LOST if it misses too far.
>
> We already have to handle unresponsive clients. If your browser
> livelocks today (say because it's Chrome and you hotunplug your
> monitor at the wrong time with active media playback in an inactive
> tab in an inactive window ... hypothetically),

That's an oddly specific hypothetical...

> yourr Wayland server
> notices that it isn't responding to pings, throws up the 'do you want
> to force-quit?' dialog and kills the client; it's actually really
> simple logic. So we just hook unsignaled fences up to the same. (And,
> if we have the context-kill primitive, trigger that on our way out.)
>
> So yeah, we already have all the complexity points to put particular
> surface trees in limbo (thanks to subsurface sync mode), we already
> have all the complexity points to separate realised surface trees from
> pixels on screen, and we already have the complexity points for
> different parts of the surface trees being rendered at different
> times. Checking on fence progression is just a little more typing
> around those interface points which already exist, and zapping clients
> is utterly trivial.

👍

>
>> But the point of all of this is, neither of the above two paths have
>> anything to do with the compositor calling a "wait for submit" ioctl.
>> Building a design around that and baking it into protocol is, IMO, a
>> mistake.  I don't see any valid way to handle this mess without "wait
>> for sumbit" either not existing or existing only client-side for the
>> purposes of WSI.
>
> I'm still on the fence (sorry) about a wait-before-submit ioctl. For
> the sync_file-based timeline syncobjs that we have today, yes it is
> helpful, and we do already have it, it's just the wrong shape in being
> sleep rather than epoll.

I still don't see why we're still talking about timeline syncobj...

> For UMF, taking it as a given that the kernel really has no visibility
> at all into syncpoint progress, then the kernel is conceptually a
> worse place to spin-sleep than userspace is, because why account the
> CPU burn to a kthread rather than a real PID, and lose
> latency/efficiency on context switches when you do wake up?
>
> But also, the kernel is conceptually the best place to spin-sleep,
> because it can fuse waits and do better wakeup quantisation than
> userspace can. And I'm still hopeful that the IHVs and Windows can
> both step away from the postmodern 'synchronisation doesn't mean
> anything anymore, just poll in a lap-burning loop' approach that we've
> been presented (sorry) with, where we at least get doorbells which
> allow the kernel to do polling much smarter than quantising timers
> ('this fence might not have triggered yet, but _something_ happened
> which might have triggered it so why not check?').

I think we can and do do something better than just poll on the memory. I'm 
not sure on the details but I've been told that we can set some sort of 
interrupt-like thing on the address so it's not actually a spin. Even 
without that, done hardware has some way that a command buffer can trigger 
an interrupt. If the protocol is to write memory and then trigger an 
interrupt rather than just write memory, that gives us something if a 
doorbell. Not as convenient, maybe, but it'd help with power consumption, etc.

--Jason

> On the other other hand, the only winsys case for burning poll in a
> tight loop is flipping as quickly as possible straight to a VRR
> display. In that case, you're definitely running on mains power so
> you're just melting the polar ice caps rather than your lap, and
> you've got everything fully lit up anyway so the power cost of polling
> is immaterial. For FRR, the compositor already has a fixed deadline at
> which it will wake up and make a hard binding decision about which
> image to present - this includes XR as well. So we don't have to worry
> about optimising a polling loop, because there isn't one: we wake up
> once, we check once, and if the client's missed then too bad, try
> again next frame.
>
> As you can see, much like userspace memory fences, my position on
> which way to go here is not knowable upfront, and depends on when
> exactly you observe it. Hopefully someone can come back with an
> argument compelling enough either way that I have something better to
> do than to try to pun my way out of having more hands than Ganesh. I
> don't think it's material to the design or implementation of winsys
> support though.
>
> Cheers,
> Daniel


[-- Attachment #2: Type: text/html, Size: 19762 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-26 16:52                 ` Daniel Vetter
  2021-05-26 18:01                   ` Daniel Stone
@ 2021-05-27  7:21                   ` Christian König
  1 sibling, 0 replies; 39+ messages in thread
From: Christian König @ 2021-05-27  7:21 UTC (permalink / raw)
  To: Daniel Vetter, Daniel Stone; +Cc: Daniel Vetter, dri-devel, Jason Ekstrand

Am 26.05.21 um 18:52 schrieb Daniel Vetter:
> [SNIP]
>> I can make the relatively simple usecases work, but it really feels
>> like in practice we'll end up with massive oversync in some fairly
>> complex usecases, and we'll regret not having had it from the start,
>> plus people will just rely on implicit sync for longer because it has
>> better (more parallel) semantics in some usecases.
> Things fall apart in implicit sync if you have more than one logical
> writer into the same buffer. Trivial example is two images in one
> buffer, but you could also do funky stuff like interleaved/tiled
> rendering with _indepedent_ consumers. If the consumers are not
> independent, then you can again just stuff the two writer fences into
> the exclusive slot with the new ioctl (they'll get merged without
> additional overhead into one fence array fence).
>
> And the fundamental thing is: This is just not possible with implicit
> sync. There's only one fence slot (even if that resolves to an array
> of fences for all the producers), so anytime you do multiple
> independent things in the same buffer you either:
> - must split the buffers so there's again a clear&unique handoff at
> each stage of the pipeline
> - or use explicit sync

Well exactly that is the problem we had with amdgpu and why we came up 
with the special handling there.

And you don't even need two images in one buffer, just special hardware 
which handles multiple writers gracefully is sufficient. The simplest 
example is a depth buffer, but we also have things like ordered append 
for ring buffers.

> So in your example, options are
> - per-client buffers, which you then blend into a composite buffer to
> handle the N implicit fences from N buffers into a single implicit
> fence for libva conversion. This single buffer then also allows you to
> again fan out to M libva encoders, or whatever it is that you fancy
> - explicit fencing and clients render into a single buffer with no
> copying, and libva encodes from that single buffer (but again needs
> explicit fences or it all comes crashing down)
>
> There's really no option C where you somehow do multiple implicitly
> fenced things into a single buffer and expect it to work out in
> parallel.

You could also fallback to a dummy submission, e.g. compose the image 
with multiple engines in parallel and then make a single dummy 
submission to collect all the shared fences into the single exclusive fence.

But this needs an extra IOCTL and unfortunately the stack above also 
needs to know when to make that dummy submission.

Christian.

> -Daniel


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-27  4:32             ` Jason Ekstrand
@ 2021-05-27 10:19               ` Daniel Vetter
  2021-05-27 10:39                 ` Christian König
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2021-05-27 10:19 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Daniel Vetter, dri-devel, Christian König

On Wed, May 26, 2021 at 11:32:49PM -0500, Jason Ekstrand wrote:
> On May 26, 2021 13:15:08 Daniel Stone <daniel@fooishbar.org> wrote:
> 
> > Hey,
> > 
> > On Wed, 26 May 2021 at 16:24, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@fooishbar.org> wrote:
> > > > Typing out the Wayland protocol isn't the hard bit. If we just need to
> > > > copy and sed syncobj to weirdsyncobj, no problem really, and it gives
> > > > us a six-month head start on painful compositor-internal surgery
> > > > whilst we work on common infrastructure to ship userspace fences
> > > > around (mappable dmabuf with the sync bracketing? FD where every
> > > > read() gives you the current value? memfd? other?).
> > > 
> > > I feel like I should elaborate more about timelines.  In my earlier
> > > reply, my commentary about timeline syncobj was mostly focused around
> > > helping people avoid typing.  That's not really the full story,
> > > though, and I hope more context will help.
> > > 
> > > First, let me say that timeline syncobj was designed as a mechanism to
> > > implement VK_KHR_timeline_semaphore without inserting future fences
> > > into the kernel.  It's entirely designed around the needs of Vulkan
> > > drivers, not really as a window-system primitive.  The semantics are
> > > designed around one driver communicating to another that new fences
> > > have been added and it's safe to kick off more rendering.  I'm not
> > > convinced that it's the right object for window-systems and I'm also
> > > not convinced that it's a good idea to try and make a version of it
> > > that's a wrapper around a userspace memory fence.  (I'm going to start
> > > typing UMF for userspace memory fence because it's long to type out.)
> > > 
> > > Why?  Well, the fundamental problem with timelines in general is
> > > trying to figure out when it's about to be done.  But timeline syncobj
> > > solves this for us!  It gives us this fancy super-useful ioctl!
> > > Right?  Uh.... not as well as I'd like.  Let's say we make a timeline
> > > syncobj that's a wrapper around a userspace memory fence.  What do we
> > > do with that ioctl?  As I mentioned above, the kernel doesn't have any
> > > clue when it will be triggered so that ioctl turns into an actual
> > > wait.  That's no good because it creates unnecessary stalls.
> > 
> > Yeah, I'm assuming that UMF will be a separate primitive. No problem.
> > I also think that your submitted/completed thing is a non-problem: at
> > this stage we're just throwing up our hands and admitting that we're
> > letting userspace tie itself in knots, and giving it the tools to tie
> > a sufficiently un-streetwise compositor in knots too. We're already
> > crossing that Rubicon, so let's just embrace it and not try to design
> > it out. Us compositors can handle the scheduling, really.
> 
> Ok, good. I think we're on the same page.
> 
> > 
> > > There's another potential solution here:  Have each UMF be two
> > > timelines: submitted and completed.  At the start of every batch
> > > that's supposed to trigger a UMF, we set the "submitted" side and
> > > then, when it completes, we set the "completed" side.  Ok, great, now
> > > we can get at the "about to be done" with the submitted side,
> > > implement the ioctl, and we're all good, right?  Sadly, no.  There's
> > > no guarantee about how long a "batch" takes.  So there's no universal
> > > timeout the kernel can apply.  Also, if it does time out, the kernel
> > > doesn't know who to blame for the timeout and how to prevent itself
> > > from getting in trouble again.  The compositor does so, in theory,
> > > given the right ioctls, it could detect the -ETIME and kill that
> > > client.  Not a great solution.
> > > 
> > > The best option I've been able to come up with for this is some sort
> > > of client-provided signal.  Something where it says, as part of submit
> > > or somewhere else, "I promise I'll be done soon" where that promise
> > > comes with dire consequences if it's not.  At that point, we can turn
> > > the UMF and a particular wait value into a one-shot fence like a
> > > dma_fence or sync_file, or signal a syncobj on it.  If it ever times
> > > out, we kick their context.  In Vulkan terminology, they get
> > > VK_ERROR_DEVICE_LOST.  There are two important bits here:  First, is
> > > that it's based on a client-provided thing.  With a fully timeline
> > > model and wait-before-signal, we can't infer when something is about
> > > to be done.  Only the client knows when it submitted its last node in
> > > the dependency graph and the whole mess is unblocked.  Second, is that
> > > the dma_fence is created within the client's driver context.  If it's
> > > created compositor-side, the kernel doesn't know who to blame if
> > > things go badly.  If we create it in the client, it's pretty easy to
> > > make context death on -ETIME part of the contract.
> > > 
> > > (Before danvet jumps in here and rants about how UMF -> dma_fence
> > > isn't possible, I haven't forgotten.  I'm pretending, for now, that
> > > we've solved some of those problems.)
> > 
> > Funny how we've come full circle to the original proposal here ...
> > 
> > If we really want a kernel primitive for this - and I think it's a
> > good idea, since can help surface 'badness' in a way which is
> > observable by e.g. session managers in a way analogous to cgroup stats
> > and controls - how about this for a counter-proposal? Client exports a
> > FD for its context/queue and sends it to winsys as part of setup,
> > compositor can ioctl() on that to kill it, which lands in the same
> > zap/zap/zap/zap/ban codepath as GPU hangs do today. It's a bigger
> > hammer than per-sync-point primitives, but you as a client have to
> > accept the social contract that if you want to participate in a
> > session, your context has to be making forward progress and you aren't
> > writing cheques the compositor can't cash.
> 
> The compositor already has that. It can kick the client's Wayland protocol
> connection. Banning the context from the kernel might be nice too but
> kicking it is probably sufficient.
> 
> Side-note to danvet: Do we need a plan for UMF with persistent contexts? My
> gut says that's a very bad idea but this made me think I should say least
> pose the question.

Yeah UMF ctx mode needs to require non-persisten ctx mode too or it just
goes horribly wrong everywhere.

> > I'm also going to pre-emptively agree with other-Dan; I'm extremely
> > wary of anything which tries to make UMF look even a little bit like
> > sync_file. The requirements to support them are so wildly different
> > that I'd almost rather a totally orthogonal interface so that there's
> > no danger of confusing the two. Us sophisticates on this thread can
> > eat the mild annoyance of typing out separate codepaths, but it's much
> > worse for anyone else who may look at a UMF wolf in dma_fence sheep's
> > clothing then only later be substantially more annoyed when they
> > realise that it's not anything like they thought it was.
> > 
> > So let's keep sync_file for what it is, and for UMF since the usage is
> > so radically different, build out whatever we do around making the
> > uAPI as useful as possible for what we want to do with it. The real
> > complexity in handling the difference between UMF and 'real' fences is
> > in how they behave, not in how they look.
> 
> Sounds good.
> 
> > 
> > > Another option is to just stall on the UMF until it's done.  Yeah,
> > > kind-of terrible and high-latency, but it always works and doesn't
> > > involve any complex logic to kill clients.  If a client never gets
> > > around to signaling a fence, it just never repaints.  The compositor
> > > keeps going like nothing's wrong.  Maybe, if the client submits lots
> > > of frames without ever triggering, it'll hit some max queue depth
> > > somewhere and kill it but that's it.  More likely, the client's
> > > vkAcquireNextImage will start timing out and it'll crash.
> > > 
> > > I suspect where we might actually land is some combination of the two
> > > depending on client choice.  If the client wants to be dumb, it gets
> > > the high-latency always-works path.  If the client really wants
> > > lowest-latency VRR, it has to take the smarter path and risk
> > > VK_ERROR_DEVICE_LOST if it misses too far.
> > 
> > We already have to handle unresponsive clients. If your browser
> > livelocks today (say because it's Chrome and you hotunplug your
> > monitor at the wrong time with active media playback in an inactive
> > tab in an inactive window ... hypothetically),
> 
> That's an oddly specific hypothetical...
> 
> > yourr Wayland server
> > notices that it isn't responding to pings, throws up the 'do you want
> > to force-quit?' dialog and kills the client; it's actually really
> > simple logic. So we just hook unsignaled fences up to the same. (And,
> > if we have the context-kill primitive, trigger that on our way out.)
> > 
> > So yeah, we already have all the complexity points to put particular
> > surface trees in limbo (thanks to subsurface sync mode), we already
> > have all the complexity points to separate realised surface trees from
> > pixels on screen, and we already have the complexity points for
> > different parts of the surface trees being rendered at different
> > times. Checking on fence progression is just a little more typing
> > around those interface points which already exist, and zapping clients
> > is utterly trivial.
> 
> 👍
> 
> > 
> > > But the point of all of this is, neither of the above two paths have
> > > anything to do with the compositor calling a "wait for submit" ioctl.
> > > Building a design around that and baking it into protocol is, IMO, a
> > > mistake.  I don't see any valid way to handle this mess without "wait
> > > for sumbit" either not existing or existing only client-side for the
> > > purposes of WSI.
> > 
> > I'm still on the fence (sorry) about a wait-before-submit ioctl. For
> > the sync_file-based timeline syncobjs that we have today, yes it is
> > helpful, and we do already have it, it's just the wrong shape in being
> > sleep rather than epoll.
> 
> I still don't see why we're still talking about timeline syncobj...
> 
> > For UMF, taking it as a given that the kernel really has no visibility
> > at all into syncpoint progress, then the kernel is conceptually a
> > worse place to spin-sleep than userspace is, because why account the
> > CPU burn to a kthread rather than a real PID, and lose
> > latency/efficiency on context switches when you do wake up?
> > 
> > But also, the kernel is conceptually the best place to spin-sleep,
> > because it can fuse waits and do better wakeup quantisation than
> > userspace can. And I'm still hopeful that the IHVs and Windows can
> > both step away from the postmodern 'synchronisation doesn't mean
> > anything anymore, just poll in a lap-burning loop' approach that we've
> > been presented (sorry) with, where we at least get doorbells which
> > allow the kernel to do polling much smarter than quantising timers
> > ('this fence might not have triggered yet, but _something_ happened
> > which might have triggered it so why not check?').
> 
> I think we can and do do something better than just poll on the memory. I'm
> not sure on the details but I've been told that we can set some sort of
> interrupt-like thing on the address so it's not actually a spin. Even
> without that, done hardware has some way that a command buffer can trigger
> an interrupt. If the protocol is to write memory and then trigger an
> interrupt rather than just write memory, that gives us something if a
> doorbell. Not as convenient, maybe, but it'd help with power consumption,
> etc.

Spoiler but the i915 plan is to indeed suck less than cpu spinning on
memory. The icky part is converting this into a protocol that can be used
across processes and stuff like that.
-Daniel

> 
> --Jason
> 
> > On the other other hand, the only winsys case for burning poll in a
> > tight loop is flipping as quickly as possible straight to a VRR
> > display. In that case, you're definitely running on mains power so
> > you're just melting the polar ice caps rather than your lap, and
> > you've got everything fully lit up anyway so the power cost of polling
> > is immaterial. For FRR, the compositor already has a fixed deadline at
> > which it will wake up and make a hard binding decision about which
> > image to present - this includes XR as well. So we don't have to worry
> > about optimising a polling loop, because there isn't one: we wake up
> > once, we check once, and if the client's missed then too bad, try
> > again next frame.
> > 
> > As you can see, much like userspace memory fences, my position on
> > which way to go here is not knowable upfront, and depends on when
> > exactly you observe it. Hopefully someone can come back with an
> > argument compelling enough either way that I have something better to
> > do than to try to pun my way out of having more hands than Ganesh. I
> > don't think it's material to the design or implementation of winsys
> > support though.
> > 
> > Cheers,
> > Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
  2021-05-27 10:19               ` Daniel Vetter
@ 2021-05-27 10:39                 ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2021-05-27 10:39 UTC (permalink / raw)
  To: Daniel Vetter, Jason Ekstrand; +Cc: Daniel Vetter, dri-devel

Am 27.05.21 um 12:19 schrieb Daniel Vetter:
> On Wed, May 26, 2021 at 11:32:49PM -0500, Jason Ekstrand wrote:
>> On May 26, 2021 13:15:08 Daniel Stone <daniel@fooishbar.org> wrote:
>>
>>> Hey,
>>>
>>> On Wed, 26 May 2021 at 16:24, Jason Ekstrand <jason@jlekstrand.net> wrote:
>>>> On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@fooishbar.org> wrote:
>>>>> Typing out the Wayland protocol isn't the hard bit. If we just need to
>>>>> copy and sed syncobj to weirdsyncobj, no problem really, and it gives
>>>>> us a six-month head start on painful compositor-internal surgery
>>>>> whilst we work on common infrastructure to ship userspace fences
>>>>> around (mappable dmabuf with the sync bracketing? FD where every
>>>>> read() gives you the current value? memfd? other?).
>>>> I feel like I should elaborate more about timelines.  In my earlier
>>>> reply, my commentary about timeline syncobj was mostly focused around
>>>> helping people avoid typing.  That's not really the full story,
>>>> though, and I hope more context will help.
>>>>
>>>> First, let me say that timeline syncobj was designed as a mechanism to
>>>> implement VK_KHR_timeline_semaphore without inserting future fences
>>>> into the kernel.  It's entirely designed around the needs of Vulkan
>>>> drivers, not really as a window-system primitive.  The semantics are
>>>> designed around one driver communicating to another that new fences
>>>> have been added and it's safe to kick off more rendering.  I'm not
>>>> convinced that it's the right object for window-systems and I'm also
>>>> not convinced that it's a good idea to try and make a version of it
>>>> that's a wrapper around a userspace memory fence.  (I'm going to start
>>>> typing UMF for userspace memory fence because it's long to type out.)
>>>>
>>>> Why?  Well, the fundamental problem with timelines in general is
>>>> trying to figure out when it's about to be done.  But timeline syncobj
>>>> solves this for us!  It gives us this fancy super-useful ioctl!
>>>> Right?  Uh.... not as well as I'd like.  Let's say we make a timeline
>>>> syncobj that's a wrapper around a userspace memory fence.  What do we
>>>> do with that ioctl?  As I mentioned above, the kernel doesn't have any
>>>> clue when it will be triggered so that ioctl turns into an actual
>>>> wait.  That's no good because it creates unnecessary stalls.
>>> Yeah, I'm assuming that UMF will be a separate primitive. No problem.
>>> I also think that your submitted/completed thing is a non-problem: at
>>> this stage we're just throwing up our hands and admitting that we're
>>> letting userspace tie itself in knots, and giving it the tools to tie
>>> a sufficiently un-streetwise compositor in knots too. We're already
>>> crossing that Rubicon, so let's just embrace it and not try to design
>>> it out. Us compositors can handle the scheduling, really.
>> Ok, good. I think we're on the same page.
>>
>>>> There's another potential solution here:  Have each UMF be two
>>>> timelines: submitted and completed.  At the start of every batch
>>>> that's supposed to trigger a UMF, we set the "submitted" side and
>>>> then, when it completes, we set the "completed" side.  Ok, great, now
>>>> we can get at the "about to be done" with the submitted side,
>>>> implement the ioctl, and we're all good, right?  Sadly, no.  There's
>>>> no guarantee about how long a "batch" takes.  So there's no universal
>>>> timeout the kernel can apply.  Also, if it does time out, the kernel
>>>> doesn't know who to blame for the timeout and how to prevent itself
>>>> from getting in trouble again.  The compositor does so, in theory,
>>>> given the right ioctls, it could detect the -ETIME and kill that
>>>> client.  Not a great solution.
>>>>
>>>> The best option I've been able to come up with for this is some sort
>>>> of client-provided signal.  Something where it says, as part of submit
>>>> or somewhere else, "I promise I'll be done soon" where that promise
>>>> comes with dire consequences if it's not.  At that point, we can turn
>>>> the UMF and a particular wait value into a one-shot fence like a
>>>> dma_fence or sync_file, or signal a syncobj on it.  If it ever times
>>>> out, we kick their context.  In Vulkan terminology, they get
>>>> VK_ERROR_DEVICE_LOST.  There are two important bits here:  First, is
>>>> that it's based on a client-provided thing.  With a fully timeline
>>>> model and wait-before-signal, we can't infer when something is about
>>>> to be done.  Only the client knows when it submitted its last node in
>>>> the dependency graph and the whole mess is unblocked.  Second, is that
>>>> the dma_fence is created within the client's driver context.  If it's
>>>> created compositor-side, the kernel doesn't know who to blame if
>>>> things go badly.  If we create it in the client, it's pretty easy to
>>>> make context death on -ETIME part of the contract.
>>>>
>>>> (Before danvet jumps in here and rants about how UMF -> dma_fence
>>>> isn't possible, I haven't forgotten.  I'm pretending, for now, that
>>>> we've solved some of those problems.)
>>> Funny how we've come full circle to the original proposal here ...
>>>
>>> If we really want a kernel primitive for this - and I think it's a
>>> good idea, since can help surface 'badness' in a way which is
>>> observable by e.g. session managers in a way analogous to cgroup stats
>>> and controls - how about this for a counter-proposal? Client exports a
>>> FD for its context/queue and sends it to winsys as part of setup,
>>> compositor can ioctl() on that to kill it, which lands in the same
>>> zap/zap/zap/zap/ban codepath as GPU hangs do today. It's a bigger
>>> hammer than per-sync-point primitives, but you as a client have to
>>> accept the social contract that if you want to participate in a
>>> session, your context has to be making forward progress and you aren't
>>> writing cheques the compositor can't cash.
>> The compositor already has that. It can kick the client's Wayland protocol
>> connection. Banning the context from the kernel might be nice too but
>> kicking it is probably sufficient.
>>
>> Side-note to danvet: Do we need a plan for UMF with persistent contexts? My
>> gut says that's a very bad idea but this made me think I should say least
>> pose the question.
> Yeah UMF ctx mode needs to require non-persisten ctx mode too or it just
> goes horribly wrong everywhere.
>
>>> I'm also going to pre-emptively agree with other-Dan; I'm extremely
>>> wary of anything which tries to make UMF look even a little bit like
>>> sync_file. The requirements to support them are so wildly different
>>> that I'd almost rather a totally orthogonal interface so that there's
>>> no danger of confusing the two. Us sophisticates on this thread can
>>> eat the mild annoyance of typing out separate codepaths, but it's much
>>> worse for anyone else who may look at a UMF wolf in dma_fence sheep's
>>> clothing then only later be substantially more annoyed when they
>>> realise that it's not anything like they thought it was.
>>>
>>> So let's keep sync_file for what it is, and for UMF since the usage is
>>> so radically different, build out whatever we do around making the
>>> uAPI as useful as possible for what we want to do with it. The real
>>> complexity in handling the difference between UMF and 'real' fences is
>>> in how they behave, not in how they look.
>> Sounds good.
>>
>>>> Another option is to just stall on the UMF until it's done.  Yeah,
>>>> kind-of terrible and high-latency, but it always works and doesn't
>>>> involve any complex logic to kill clients.  If a client never gets
>>>> around to signaling a fence, it just never repaints.  The compositor
>>>> keeps going like nothing's wrong.  Maybe, if the client submits lots
>>>> of frames without ever triggering, it'll hit some max queue depth
>>>> somewhere and kill it but that's it.  More likely, the client's
>>>> vkAcquireNextImage will start timing out and it'll crash.
>>>>
>>>> I suspect where we might actually land is some combination of the two
>>>> depending on client choice.  If the client wants to be dumb, it gets
>>>> the high-latency always-works path.  If the client really wants
>>>> lowest-latency VRR, it has to take the smarter path and risk
>>>> VK_ERROR_DEVICE_LOST if it misses too far.
>>> We already have to handle unresponsive clients. If your browser
>>> livelocks today (say because it's Chrome and you hotunplug your
>>> monitor at the wrong time with active media playback in an inactive
>>> tab in an inactive window ... hypothetically),
>> That's an oddly specific hypothetical...
>>
>>> yourr Wayland server
>>> notices that it isn't responding to pings, throws up the 'do you want
>>> to force-quit?' dialog and kills the client; it's actually really
>>> simple logic. So we just hook unsignaled fences up to the same. (And,
>>> if we have the context-kill primitive, trigger that on our way out.)
>>>
>>> So yeah, we already have all the complexity points to put particular
>>> surface trees in limbo (thanks to subsurface sync mode), we already
>>> have all the complexity points to separate realised surface trees from
>>> pixels on screen, and we already have the complexity points for
>>> different parts of the surface trees being rendered at different
>>> times. Checking on fence progression is just a little more typing
>>> around those interface points which already exist, and zapping clients
>>> is utterly trivial.
>> 👍
>>
>>>> But the point of all of this is, neither of the above two paths have
>>>> anything to do with the compositor calling a "wait for submit" ioctl.
>>>> Building a design around that and baking it into protocol is, IMO, a
>>>> mistake.  I don't see any valid way to handle this mess without "wait
>>>> for sumbit" either not existing or existing only client-side for the
>>>> purposes of WSI.
>>> I'm still on the fence (sorry) about a wait-before-submit ioctl. For
>>> the sync_file-based timeline syncobjs that we have today, yes it is
>>> helpful, and we do already have it, it's just the wrong shape in being
>>> sleep rather than epoll.
>> I still don't see why we're still talking about timeline syncobj...
>>
>>> For UMF, taking it as a given that the kernel really has no visibility
>>> at all into syncpoint progress, then the kernel is conceptually a
>>> worse place to spin-sleep than userspace is, because why account the
>>> CPU burn to a kthread rather than a real PID, and lose
>>> latency/efficiency on context switches when you do wake up?
>>>
>>> But also, the kernel is conceptually the best place to spin-sleep,
>>> because it can fuse waits and do better wakeup quantisation than
>>> userspace can. And I'm still hopeful that the IHVs and Windows can
>>> both step away from the postmodern 'synchronisation doesn't mean
>>> anything anymore, just poll in a lap-burning loop' approach that we've
>>> been presented (sorry) with, where we at least get doorbells which
>>> allow the kernel to do polling much smarter than quantising timers
>>> ('this fence might not have triggered yet, but _something_ happened
>>> which might have triggered it so why not check?').
>> I think we can and do do something better than just poll on the memory. I'm
>> not sure on the details but I've been told that we can set some sort of
>> interrupt-like thing on the address so it's not actually a spin. Even
>> without that, done hardware has some way that a command buffer can trigger
>> an interrupt. If the protocol is to write memory and then trigger an
>> interrupt rather than just write memory, that gives us something if a
>> doorbell. Not as convenient, maybe, but it'd help with power consumption,
>> etc.
> Spoiler but the i915 plan is to indeed suck less than cpu spinning on
> memory. The icky part is converting this into a protocol that can be used
> across processes and stuff like that.

Yeah, same here. Basically the hardware now gives you the information 
where something was written in the fence interrupt vector.

With that information you can then figure out in the driver if anybody 
is waiting for that memory location and stop the sleeping etc...

Christian.

> -Daniel
>
>> --Jason
>>
>>> On the other other hand, the only winsys case for burning poll in a
>>> tight loop is flipping as quickly as possible straight to a VRR
>>> display. In that case, you're definitely running on mains power so
>>> you're just melting the polar ice caps rather than your lap, and
>>> you've got everything fully lit up anyway so the power cost of polling
>>> is immaterial. For FRR, the compositor already has a fixed deadline at
>>> which it will wake up and make a hard binding decision about which
>>> image to present - this includes XR as well. So we don't have to worry
>>> about optimising a polling loop, because there isn't one: we wake up
>>> once, we check once, and if the client's missed then too bad, try
>>> again next frame.
>>>
>>> As you can see, much like userspace memory fences, my position on
>>> which way to go here is not knowable upfront, and depends on when
>>> exactly you observe it. Hopefully someone can come back with an
>>> argument compelling enough either way that I have something better to
>>> do than to try to pun my way out of having more hands than Ganesh. I
>>> don't think it's material to the design or implementation of winsys
>>> support though.
>>>
>>> Cheers,
>>> Daniel


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2021-05-27 10:39 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-20 19:00 [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8) Jason Ekstrand
2021-05-20 19:00 ` [PATCH 1/4] dma-buf: add dma_fence_array_for_each (v2) Jason Ekstrand
2021-05-21  7:51   ` Christian König
2021-05-21 16:27     ` Jason Ekstrand
2021-05-21 18:11       ` Christian König
2021-05-20 19:00 ` [PATCH 2/4] dma-buf: add dma_resv_get_singleton_rcu (v4) Jason Ekstrand
2021-05-21 17:48   ` Daniel Vetter
2021-05-24 20:04     ` Jason Ekstrand
2021-05-25 12:38       ` Daniel Vetter
2021-05-25 15:50         ` Christian König
2021-05-22 14:38   ` kernel test robot
2021-05-22 14:38     ` kernel test robot
2021-05-22 18:59   ` kernel test robot
2021-05-22 18:59     ` kernel test robot
2021-05-20 19:00 ` [PATCH 3/4] dma-buf: Add an API for exporting sync files (v9) Jason Ekstrand
2021-05-20 19:00 ` [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6) Jason Ekstrand
2021-05-22 20:05   ` Daniel Stone
2021-05-23 17:15     ` EPOLL for drm_syncfile (was Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)) Christian König
2021-05-23 21:34       ` Daniel Stone
2021-05-25  5:55         ` Daniel Vetter
2021-05-26  9:45         ` Simon Ser
2021-05-24 17:11     ` [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6) Jason Ekstrand
2021-05-25  6:27       ` Daniel Vetter
2021-05-26 11:08       ` Daniel Stone
2021-05-26 12:35         ` Daniel Vetter
2021-05-26 13:08           ` Daniel Stone
2021-05-26 13:44             ` Daniel Vetter
2021-05-26 15:13               ` Daniel Stone
2021-05-26 16:52                 ` Daniel Vetter
2021-05-26 18:01                   ` Daniel Stone
2021-05-27  7:21                   ` Christian König
2021-05-26 13:52         ` Daniel Stone
2021-05-26 15:24         ` Jason Ekstrand
2021-05-26 18:14           ` Daniel Stone
2021-05-27  4:32             ` Jason Ekstrand
2021-05-27 10:19               ` Daniel Vetter
2021-05-27 10:39                 ` Christian König
2021-05-26 12:21       ` Vlad Zahorodnii
2021-05-21 21:32 ` [PATCH 0/4] dma-buf: Add an API for exporting sync files (v8) Jason Ekstrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.