All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
@ 2021-03-11  9:25 Boris Brezillon
  2021-03-11  9:25 ` [RFC PATCH 1/7] drm/panfrost: Pass a job to panfrost_{acquire, attach_object_fences}() Boris Brezillon
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-11  9:25 UTC (permalink / raw)
  To: Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, Steven Price, Robin Murphy
  Cc: Boris Brezillon, dri-devel

Hello,

I've been playing with Vulkan lately and struggled quite a bit to
implement VkQueueSubmit with the submit ioctl we have. There are
several limiting factors that can be worked around if we really have to,
but I think it'd be much easier and future-proof if we introduce a new
ioctl that addresses the current limitations:

1/ There can only be one out_sync, but Vulkan might ask us to signal
   several VkSemaphores and possibly one VkFence too, both of those
   being based on sync objects in my PoC. Making out_sync an array of
   syncobjs to attach the render_done fence to would make that possible.
   The other option would be to collect syncobj updates in userspace
   in a separate thread and propagate those updates to all
   semaphores+fences waiting on those events (I think the v3dv driver
   does something like that, but I didn't spend enough time studying
   the code to be sure, so I might be wrong).

2/ Queued jobs might be executed out-of-order (unless they have
   explicit/implicit deps between them), and Vulkan asks that the out
   fence be signaled when all jobs are done. Timeline syncobjs are a
   good match for that use case. All we need to do is pass the same
   fence syncobj to all jobs being attached to a single QueueSubmit
   request, but a different point on the timeline. The syncobj
   timeline wait does the rest and guarantees that we've reached a
   given timeline point (IOW, all jobs before that point are done)
   before declaring the fence as signaled.
   One alternative would be to have dummy 'synchronization' jobs that
   don't actually execute anything on the GPU but declare a dependency
   on all other jobs that are part of the QueueSubmit request, and
   signal the out fence (the scheduler would do most of the work for
   us, all we have to do is support NULL job heads and signal the
   fence directly when that happens instead of queueing the job).

3/ The current implementation lacks information about BO access,
   so we serialize all jobs accessing the same set of BOs, even
   if those jobs might just be reading from them (which can
   happen concurrently). Other drivers pass an access type to the
   list of referenced BOs to address that. Another option would be
   to disable implicit deps (deps based on BOs) and force the driver
   to pass all deps explicitly (interestingly, some drivers have
   both the no-implicit-dep and r/w flags, probably to support
   sub-resource access, so we might want to add that one too).
   I don't see any userspace workaround to that problem, so that one
   alone would justify extending the existing ioctl or adding a new
   one.

4/ There's also the fact that submitting one job at a time adds an
   overhead when QueueSubmit is being passed more than one
   CommandBuffer. That one is less problematic, but if we're adding
   a new ioctl we'd better design it to limit the userspace -> kernel
   transition overhead.

Right now I'm just trying to collect feedback. I don't intend to get
those patches merged until we have a userspace user, but I thought
starting the discussion early would be a good thing.

Feel free to suggest other approaches.

Regards,

Boris

Boris Brezillon (7):
  drm/panfrost: Pass a job to panfrost_{acquire,attach_object_fences}()
  drm/panfrost: Collect implicit and explicit deps in an XArray
  drm/panfrost: Move the mappings collection out of
    panfrost_lookup_bos()
  drm/panfrost: Add BO access flags to relax dependencies between jobs
  drm/panfrost: Add a new ioctl to submit batches
  drm/panfrost: Advertise the SYNCOBJ_TIMELINE feature
  drm/panfrost: Bump minor version to reflect the feature additions

 drivers/gpu/drm/panfrost/panfrost_drv.c | 408 +++++++++++++++++++++---
 drivers/gpu/drm/panfrost/panfrost_job.c |  80 +++--
 drivers/gpu/drm/panfrost/panfrost_job.h |   8 +-
 include/uapi/drm/panfrost_drm.h         |  83 +++++
 4 files changed, 483 insertions(+), 96 deletions(-)

-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 1/7] drm/panfrost: Pass a job to panfrost_{acquire, attach_object_fences}()
  2021-03-11  9:25 [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Boris Brezillon
@ 2021-03-11  9:25 ` Boris Brezillon
  2021-03-11  9:25 ` [RFC PATCH 2/7] drm/panfrost: Collect implicit and explicit deps in an XArray Boris Brezillon
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-11  9:25 UTC (permalink / raw)
  To: Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, Steven Price, Robin Murphy
  Cc: Boris Brezillon, dri-devel

So we don't have to change the prototype if we extend the function.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 6003cfeb1322..564054a96ca9 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -196,24 +196,20 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
 	job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
 }
 
-static void panfrost_acquire_object_fences(struct drm_gem_object **bos,
-					   int bo_count,
-					   struct dma_fence **implicit_fences)
+static void panfrost_acquire_object_fences(struct panfrost_job *job)
 {
 	int i;
 
-	for (i = 0; i < bo_count; i++)
-		implicit_fences[i] = dma_resv_get_excl_rcu(bos[i]->resv);
+	for (i = 0; i < job->bo_count; i++)
+		job->implicit_fences[i] = dma_resv_get_excl_rcu(job->bos[i]->resv);
 }
 
-static void panfrost_attach_object_fences(struct drm_gem_object **bos,
-					  int bo_count,
-					  struct dma_fence *fence)
+static void panfrost_attach_object_fences(struct panfrost_job *job)
 {
 	int i;
 
-	for (i = 0; i < bo_count; i++)
-		dma_resv_add_excl_fence(bos[i]->resv, fence);
+	for (i = 0; i < job->bo_count; i++)
+		dma_resv_add_excl_fence(job->bos[i]->resv, job->render_done_fence);
 }
 
 int panfrost_job_push(struct panfrost_job *job)
@@ -243,15 +239,13 @@ int panfrost_job_push(struct panfrost_job *job)
 
 	kref_get(&job->refcount); /* put by scheduler job completion */
 
-	panfrost_acquire_object_fences(job->bos, job->bo_count,
-				       job->implicit_fences);
+	panfrost_acquire_object_fences(job);
 
 	drm_sched_entity_push_job(&job->base, entity);
 
 	mutex_unlock(&pfdev->sched_lock);
 
-	panfrost_attach_object_fences(job->bos, job->bo_count,
-				      job->render_done_fence);
+	panfrost_attach_object_fences(job);
 
 unlock:
 	drm_gem_unlock_reservations(job->bos, job->bo_count, &acquire_ctx);
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 2/7] drm/panfrost: Collect implicit and explicit deps in an XArray
  2021-03-11  9:25 [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Boris Brezillon
  2021-03-11  9:25 ` [RFC PATCH 1/7] drm/panfrost: Pass a job to panfrost_{acquire, attach_object_fences}() Boris Brezillon
@ 2021-03-11  9:25 ` Boris Brezillon
  2021-03-11  9:25 ` [RFC PATCH 3/7] drm/panfrost: Move the mappings collection out of panfrost_lookup_bos() Boris Brezillon
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-11  9:25 UTC (permalink / raw)
  To: Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, Steven Price, Robin Murphy
  Cc: Boris Brezillon, dri-devel

This way we can re-use the standard drm_gem_fence_array_add_implicit()
helper and simplify the panfrost_job_dependency() logic.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 42 +++++++-----------
 drivers/gpu/drm/panfrost/panfrost_job.c | 57 +++++++++++--------------
 drivers/gpu/drm/panfrost/panfrost_job.h |  7 +--
 3 files changed, 42 insertions(+), 64 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 83a461bdeea8..86c1347c6f0e 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -137,12 +137,6 @@ panfrost_lookup_bos(struct drm_device *dev,
 	if (!job->bo_count)
 		return 0;
 
-	job->implicit_fences = kvmalloc_array(job->bo_count,
-				  sizeof(struct dma_fence *),
-				  GFP_KERNEL | __GFP_ZERO);
-	if (!job->implicit_fences)
-		return -ENOMEM;
-
 	ret = drm_gem_objects_lookup(file_priv,
 				     (void __user *)(uintptr_t)args->bo_handles,
 				     job->bo_count, &job->bos);
@@ -173,8 +167,7 @@ panfrost_lookup_bos(struct drm_device *dev,
 }
 
 /**
- * panfrost_copy_in_sync() - Sets up job->in_fences[] with the sync objects
- * referenced by the job.
+ * panfrost_copy_in_sync() - Add in_syncs to the job->deps array.
  * @dev: DRM device
  * @file_priv: DRM file for this fd
  * @args: IOCTL args
@@ -187,28 +180,18 @@ panfrost_lookup_bos(struct drm_device *dev,
  */
 static int
 panfrost_copy_in_sync(struct drm_device *dev,
-		  struct drm_file *file_priv,
-		  struct drm_panfrost_submit *args,
-		  struct panfrost_job *job)
+		      struct drm_file *file_priv,
+		      struct drm_panfrost_submit *args,
+		      struct panfrost_job *job)
 {
 	u32 *handles;
 	int ret = 0;
 	int i;
 
-	job->in_fence_count = args->in_sync_count;
-
-	if (!job->in_fence_count)
+	if (!args->in_sync_count)
 		return 0;
 
-	job->in_fences = kvmalloc_array(job->in_fence_count,
-					sizeof(struct dma_fence *),
-					GFP_KERNEL | __GFP_ZERO);
-	if (!job->in_fences) {
-		DRM_DEBUG("Failed to allocate job in fences\n");
-		return -ENOMEM;
-	}
-
-	handles = kvmalloc_array(job->in_fence_count, sizeof(u32), GFP_KERNEL);
+	handles = kvmalloc_array(args->in_sync_count, sizeof(u32), GFP_KERNEL);
 	if (!handles) {
 		ret = -ENOMEM;
 		DRM_DEBUG("Failed to allocate incoming syncobj handles\n");
@@ -217,17 +200,23 @@ panfrost_copy_in_sync(struct drm_device *dev,
 
 	if (copy_from_user(handles,
 			   (void __user *)(uintptr_t)args->in_syncs,
-			   job->in_fence_count * sizeof(u32))) {
+			   args->in_sync_count * sizeof(u32))) {
 		ret = -EFAULT;
 		DRM_DEBUG("Failed to copy in syncobj handles\n");
 		goto fail;
 	}
 
-	for (i = 0; i < job->in_fence_count; i++) {
+	for (i = 0; i < args->in_sync_count; i++) {
+		struct dma_fence *fence;
+
 		ret = drm_syncobj_find_fence(file_priv, handles[i], 0, 0,
-					     &job->in_fences[i]);
+					     &fence);
 		if (ret == -EINVAL)
 			goto fail;
+
+		ret = drm_gem_fence_array_add(&job->deps, fence);
+		if (ret)
+			goto fail;
 	}
 
 fail:
@@ -269,6 +258,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
 	job->requirements = args->requirements;
 	job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev);
 	job->file_priv = file->driver_priv;
+	xa_init_flags(&job->deps, XA_FLAGS_ALLOC);
 
 	ret = panfrost_copy_in_sync(dev, file, args, job);
 	if (ret)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 564054a96ca9..87bfbd77d753 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -196,20 +196,31 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
 	job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
 }
 
-static void panfrost_acquire_object_fences(struct panfrost_job *job)
+static int panfrost_acquire_object_fences(struct panfrost_job *job)
 {
 	int i;
 
-	for (i = 0; i < job->bo_count; i++)
-		job->implicit_fences[i] = dma_resv_get_excl_rcu(job->bos[i]->resv);
+	for (i = 0; i < job->bo_count; i++) {
+		int ret;
+
+		ret = drm_gem_fence_array_add_implicit(&job->deps,
+						       job->bos[i], true);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
 }
 
 static void panfrost_attach_object_fences(struct panfrost_job *job)
 {
 	int i;
 
-	for (i = 0; i < job->bo_count; i++)
-		dma_resv_add_excl_fence(job->bos[i]->resv, job->render_done_fence);
+	for (i = 0; i < job->bo_count; i++) {
+		struct dma_resv *robj = job->bos[i]->resv;
+
+		dma_resv_add_excl_fence(robj, job->render_done_fence);
+	}
 }
 
 int panfrost_job_push(struct panfrost_job *job)
@@ -257,18 +268,15 @@ static void panfrost_job_cleanup(struct kref *ref)
 {
 	struct panfrost_job *job = container_of(ref, struct panfrost_job,
 						refcount);
+	struct dma_fence *fence;
+	unsigned long dep_idx;
 	unsigned int i;
 
-	if (job->in_fences) {
-		for (i = 0; i < job->in_fence_count; i++)
-			dma_fence_put(job->in_fences[i]);
-		kvfree(job->in_fences);
-	}
-	if (job->implicit_fences) {
-		for (i = 0; i < job->bo_count; i++)
-			dma_fence_put(job->implicit_fences[i]);
-		kvfree(job->implicit_fences);
+	xa_for_each(&job->deps, dep_idx, fence) {
+		dma_fence_put(fence);
 	}
+	xa_destroy(&job->deps);
+
 	dma_fence_put(job->done_fence);
 	dma_fence_put(job->render_done_fence);
 
@@ -311,26 +319,9 @@ static struct dma_fence *panfrost_job_dependency(struct drm_sched_job *sched_job
 						 struct drm_sched_entity *s_entity)
 {
 	struct panfrost_job *job = to_panfrost_job(sched_job);
-	struct dma_fence *fence;
-	unsigned int i;
 
-	/* Explicit fences */
-	for (i = 0; i < job->in_fence_count; i++) {
-		if (job->in_fences[i]) {
-			fence = job->in_fences[i];
-			job->in_fences[i] = NULL;
-			return fence;
-		}
-	}
-
-	/* Implicit fences, max. one per BO */
-	for (i = 0; i < job->bo_count; i++) {
-		if (job->implicit_fences[i]) {
-			fence = job->implicit_fences[i];
-			job->implicit_fences[i] = NULL;
-			return fence;
-		}
-	}
+	if (!xa_empty(&job->deps))
+		return xa_erase(&job->deps, job->last_dep++);
 
 	return NULL;
 }
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.h b/drivers/gpu/drm/panfrost/panfrost_job.h
index bbd3ba97ff67..b10b5be57dd2 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.h
+++ b/drivers/gpu/drm/panfrost/panfrost_job.h
@@ -19,9 +19,8 @@ struct panfrost_job {
 	struct panfrost_device *pfdev;
 	struct panfrost_file_priv *file_priv;
 
-	/* Optional fences userspace can pass in for the job to depend on. */
-	struct dma_fence **in_fences;
-	u32 in_fence_count;
+	struct xarray deps;
+	unsigned long last_dep;
 
 	/* Fence to be signaled by IRQ handler when the job is complete. */
 	struct dma_fence *done_fence;
@@ -30,8 +29,6 @@ struct panfrost_job {
 	__u32 requirements;
 	__u32 flush_id;
 
-	/* Exclusive fences we have taken from the BOs to wait for */
-	struct dma_fence **implicit_fences;
 	struct panfrost_gem_mapping **mappings;
 	struct drm_gem_object **bos;
 	u32 bo_count;
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 3/7] drm/panfrost: Move the mappings collection out of panfrost_lookup_bos()
  2021-03-11  9:25 [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Boris Brezillon
  2021-03-11  9:25 ` [RFC PATCH 1/7] drm/panfrost: Pass a job to panfrost_{acquire, attach_object_fences}() Boris Brezillon
  2021-03-11  9:25 ` [RFC PATCH 2/7] drm/panfrost: Collect implicit and explicit deps in an XArray Boris Brezillon
@ 2021-03-11  9:25 ` Boris Brezillon
  2021-03-11  9:25 ` [RFC PATCH 4/7] drm/panfrost: Add BO access flags to relax dependencies between jobs Boris Brezillon
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-11  9:25 UTC (permalink / raw)
  To: Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, Steven Price, Robin Murphy
  Cc: Boris Brezillon, dri-devel

So we can re-use it from elsewhere.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 52 ++++++++++++++-----------
 1 file changed, 29 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 86c1347c6f0e..32e822e00a08 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -108,6 +108,34 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
 	return 0;
 }
 
+static int
+panfrost_get_job_mappings(struct drm_file *file_priv, struct panfrost_job *job)
+{
+	struct panfrost_file_priv *priv = file_priv->driver_priv;
+	unsigned int i;
+
+	job->mappings = kvmalloc_array(job->bo_count,
+				       sizeof(*job->mappings),
+				       GFP_KERNEL | __GFP_ZERO);
+	if (!job->mappings)
+		return -ENOMEM;
+
+	for (i = 0; i < job->bo_count; i++) {
+		struct panfrost_gem_mapping *mapping;
+		struct panfrost_gem_object *bo;
+
+		bo = to_panfrost_bo(job->bos[i]);
+		mapping = panfrost_gem_mapping_get(bo, priv);
+		if (!mapping)
+			return -EINVAL;
+
+		atomic_inc(&bo->gpu_usecount);
+		job->mappings[i] = mapping;
+	}
+
+	return 0;
+}
+
 /**
  * panfrost_lookup_bos() - Sets up job->bo[] with the GEM objects
  * referenced by the job.
@@ -127,8 +155,6 @@ panfrost_lookup_bos(struct drm_device *dev,
 		  struct drm_panfrost_submit *args,
 		  struct panfrost_job *job)
 {
-	struct panfrost_file_priv *priv = file_priv->driver_priv;
-	struct panfrost_gem_object *bo;
 	unsigned int i;
 	int ret;
 
@@ -143,27 +169,7 @@ panfrost_lookup_bos(struct drm_device *dev,
 	if (ret)
 		return ret;
 
-	job->mappings = kvmalloc_array(job->bo_count,
-				       sizeof(struct panfrost_gem_mapping *),
-				       GFP_KERNEL | __GFP_ZERO);
-	if (!job->mappings)
-		return -ENOMEM;
-
-	for (i = 0; i < job->bo_count; i++) {
-		struct panfrost_gem_mapping *mapping;
-
-		bo = to_panfrost_bo(job->bos[i]);
-		mapping = panfrost_gem_mapping_get(bo, priv);
-		if (!mapping) {
-			ret = -EINVAL;
-			break;
-		}
-
-		atomic_inc(&bo->gpu_usecount);
-		job->mappings[i] = mapping;
-	}
-
-	return ret;
+	return panfrost_get_job_mappings(file_priv, job);
 }
 
 /**
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 4/7] drm/panfrost: Add BO access flags to relax dependencies between jobs
  2021-03-11  9:25 [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Boris Brezillon
                   ` (2 preceding siblings ...)
  2021-03-11  9:25 ` [RFC PATCH 3/7] drm/panfrost: Move the mappings collection out of panfrost_lookup_bos() Boris Brezillon
@ 2021-03-11  9:25 ` Boris Brezillon
  2021-03-11  9:25 ` [RFC PATCH 5/7] drm/panfrost: Add a new ioctl to submit batches Boris Brezillon
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-11  9:25 UTC (permalink / raw)
  To: Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, Steven Price, Robin Murphy
  Cc: Boris Brezillon, dri-devel

Jobs reading from the same BO should not be serialized. Add access flags
so we can relax the implicit dependencies in that case. We force RW
access for now to keep the behavior unchanged, but a new SUBMIT ioctl
taking explicit access flags will be introduced.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_drv.c |  9 +++++++++
 drivers/gpu/drm/panfrost/panfrost_job.c | 15 +++++++++++++--
 drivers/gpu/drm/panfrost/panfrost_job.h |  1 +
 include/uapi/drm/panfrost_drm.h         |  4 ++++
 4 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 32e822e00a08..be45efee47a2 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -163,6 +163,15 @@ panfrost_lookup_bos(struct drm_device *dev,
 	if (!job->bo_count)
 		return 0;
 
+	job->bo_flags = kvmalloc_array(job->bo_count,
+				       sizeof(*job->bo_flags),
+				       GFP_KERNEL | __GFP_ZERO);
+	if (!job->bo_flags)
+		return -ENOMEM;
+
+	for (i = 0; i < job->bo_count; i++)
+		job->bo_flags[i] = PANFROST_BO_REF_RW;
+
 	ret = drm_gem_objects_lookup(file_priv,
 				     (void __user *)(uintptr_t)args->bo_handles,
 				     job->bo_count, &job->bos);
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 87bfbd77d753..170c3b0c8641 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -201,10 +201,17 @@ static int panfrost_acquire_object_fences(struct panfrost_job *job)
 	int i;
 
 	for (i = 0; i < job->bo_count; i++) {
+		bool write = job->bo_flags[i] & PANFROST_BO_REF_WRITE;
 		int ret;
 
+		if (!(job->bo_flags[i] & PANFROST_BO_REF_WRITE)) {
+			ret = dma_resv_reserve_shared(job->bos[i]->resv, 1);
+			if (ret)
+				return ret;
+		}
+
 		ret = drm_gem_fence_array_add_implicit(&job->deps,
-						       job->bos[i], true);
+						       job->bos[i], write);
 		if (ret)
 			return ret;
 	}
@@ -219,7 +226,10 @@ static void panfrost_attach_object_fences(struct panfrost_job *job)
 	for (i = 0; i < job->bo_count; i++) {
 		struct dma_resv *robj = job->bos[i]->resv;
 
-		dma_resv_add_excl_fence(robj, job->render_done_fence);
+		if (job->bo_flags[i] & PANFROST_BO_REF_WRITE)
+			dma_resv_add_excl_fence(robj, job->render_done_fence);
+		else
+			dma_resv_add_shared_fence(robj, job->render_done_fence);
 	}
 }
 
@@ -298,6 +308,7 @@ static void panfrost_job_cleanup(struct kref *ref)
 		kvfree(job->bos);
 	}
 
+	kvfree(job->bo_flags);
 	kfree(job);
 }
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.h b/drivers/gpu/drm/panfrost/panfrost_job.h
index b10b5be57dd2..c3f662771ed9 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.h
+++ b/drivers/gpu/drm/panfrost/panfrost_job.h
@@ -31,6 +31,7 @@ struct panfrost_job {
 
 	struct panfrost_gem_mapping **mappings;
 	struct drm_gem_object **bos;
+	u32 *bo_flags;
 	u32 bo_count;
 
 	/* Fence to be signaled by drm-sched once its done with the job */
diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index ec19db1eead8..eab96d7e0e0e 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -223,6 +223,10 @@ struct drm_panfrost_madvise {
 	__u32 retained;       /* out, whether backing store still exists */
 };
 
+#define PANFROST_BO_REF_READ		0x1
+#define PANFROST_BO_REF_WRITE		0x2
+#define PANFROST_BO_REF_RW		(PANFROST_BO_REF_READ | PANFROST_BO_REF_WRITE)
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 5/7] drm/panfrost: Add a new ioctl to submit batches
  2021-03-11  9:25 [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Boris Brezillon
                   ` (3 preceding siblings ...)
  2021-03-11  9:25 ` [RFC PATCH 4/7] drm/panfrost: Add BO access flags to relax dependencies between jobs Boris Brezillon
@ 2021-03-11  9:25 ` Boris Brezillon
  2021-03-11  9:25 ` [RFC PATCH 6/7] drm/panfrost: Advertise the SYNCOBJ_TIMELINE feature Boris Brezillon
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-11  9:25 UTC (permalink / raw)
  To: Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, Steven Price, Robin Murphy
  Cc: Boris Brezillon, dri-devel

This should help limit the number of ioctls when submitting multiple
jobs. The new ioctl also supports syncobj timelines and BO access flags.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 303 ++++++++++++++++++++++++
 include/uapi/drm/panfrost_drm.h         |  79 ++++++
 2 files changed, 382 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index be45efee47a2..3f8addab8bab 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -446,6 +446,308 @@ static int panfrost_ioctl_madvise(struct drm_device *dev, void *data,
 	return ret;
 }
 
+static int
+panfrost_get_job_in_syncs(struct drm_file *file_priv,
+			  u64 refs, u32 ref_stride,
+			  u32 count, struct panfrost_job *job)
+{
+	const void __user *in = u64_to_user_ptr(refs);
+	unsigned int i;
+	int ret;
+
+	if (!count)
+		return 0;
+
+	for (i = 0; i < count; i++) {
+		struct drm_panfrost_syncobj_ref ref = { };
+		struct dma_fence *fence;
+
+		ret = copy_struct_from_user(&ref, sizeof(ref),
+					    in + (i * ref_stride),
+					    ref_stride);
+		if (ret)
+			return ret;
+
+		if (ref.pad)
+			return -EINVAL;
+
+		ret = drm_syncobj_find_fence(file_priv, ref.handle, ref.point,
+					     0, &fence);
+		if (ret)
+			return ret;
+
+		ret = drm_gem_fence_array_add(&job->deps, fence);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+struct panfrost_job_out_sync {
+	struct drm_syncobj *syncobj;
+	struct dma_fence_chain *chain;
+	u64 point;
+};
+
+static void
+panfrost_put_job_out_syncs(struct panfrost_job_out_sync *out_syncs, u32 count)
+{
+	unsigned int i;
+
+	for (i = 0; i < count; i++) {
+		if (!out_syncs[i].syncobj)
+			break;
+
+		drm_syncobj_put(out_syncs[i].syncobj);
+		kvfree(out_syncs[i].chain);
+	}
+
+	kvfree(out_syncs);
+}
+
+static struct panfrost_job_out_sync *
+panfrost_get_job_out_syncs(struct drm_file *file_priv,
+			   u64 refs, u32 ref_stride,
+			   u32 count)
+{
+	void __user *in = u64_to_user_ptr(refs);
+	struct panfrost_job_out_sync *out_syncs;
+	unsigned int i;
+	int ret;
+
+	if (!count)
+		return NULL;
+
+	out_syncs = kvmalloc_array(count, sizeof(*out_syncs),
+				   GFP_KERNEL | __GFP_ZERO);
+	if (!out_syncs)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < count; i++) {
+		struct drm_panfrost_syncobj_ref ref = { };
+
+		ret = copy_struct_from_user(&ref, sizeof(ref),
+					    in + (i * ref_stride),
+					    ref_stride);
+		if (ret)
+			goto err_free_out_syncs;
+
+		if (ref.pad) {
+			ret = -EINVAL;
+			goto err_free_out_syncs;
+		}
+
+		out_syncs[i].syncobj = drm_syncobj_find(file_priv, ref.handle);
+		if (!out_syncs[i].syncobj) {
+			ret = -EINVAL;
+			goto err_free_out_syncs;
+		}
+
+		out_syncs[i].point = ref.point;
+		if (!out_syncs[i].point)
+			continue;
+
+		out_syncs[i].chain = kmalloc(sizeof(*out_syncs[i].chain),
+					     GFP_KERNEL);
+		if (!out_syncs[i].chain) {
+			ret = -ENOMEM;
+			goto err_free_out_syncs;
+		}
+	}
+
+	return out_syncs;
+
+err_free_out_syncs:
+	panfrost_put_job_out_syncs(out_syncs, count);
+	return ERR_PTR(ret);
+}
+
+static void
+panfrost_set_job_out_fence(struct panfrost_job_out_sync *out_syncs,
+			   unsigned int count, struct dma_fence *fence)
+{
+	unsigned int i;
+
+	for (i = 0; i < count; i++) {
+		if (out_syncs[i].chain) {
+			drm_syncobj_add_point(out_syncs[i].syncobj,
+					      out_syncs[i].chain,
+					      fence, out_syncs[i].point);
+			out_syncs[i].chain = NULL;
+		} else {
+			drm_syncobj_replace_fence(out_syncs[i].syncobj,
+						  fence);
+		}
+	}
+}
+
+#define PANFROST_BO_REF_ALLOWED_FLAGS PANFROST_BO_REF_RW
+
+static int
+panfrost_get_job_bos(struct drm_file *file_priv,
+		     u64 refs, u32 ref_stride, u32 count,
+		     struct panfrost_job *job)
+{
+	void __user *in = u64_to_user_ptr(refs);
+	unsigned int i;
+
+	job->bo_count = count;
+
+	if (!count)
+		return 0;
+
+	job->bos = kvmalloc_array(job->bo_count, sizeof(*job->bos),
+				  GFP_KERNEL | __GFP_ZERO);
+	job->bo_flags = kvmalloc_array(job->bo_count,
+				       sizeof(*job->bo_flags),
+				       GFP_KERNEL | __GFP_ZERO);
+	if (!job->bos || !job->bo_flags)
+		return -ENOMEM;
+
+	for (i = 0; i < count; i++) {
+		struct drm_panfrost_bo_ref ref = { };
+		int ret;
+
+		ret = copy_struct_from_user(&ref, sizeof(ref),
+					    in + (i * ref_stride),
+					    ref_stride);
+		if (ret)
+			return ret;
+
+		if ((ref.flags & ~PANFROST_BO_REF_ALLOWED_FLAGS))
+			return -EINVAL;
+
+		/* We need to know the access type. */
+		if (!(ref.flags & PANFROST_BO_REF_RW))
+			return -EINVAL;
+
+		job->bos[i] = drm_gem_object_lookup(file_priv, ref.handle);
+		if (!job->bos[i])
+			return -EINVAL;
+
+		job->bo_flags[i] = ref.flags;
+	}
+
+	return 0;
+}
+
+#define PANFROST_JD_ALLOWED_REQS PANFROST_JD_REQ_FS
+
+static int
+panfrost_submit_job(struct drm_device *dev, struct drm_file *file_priv,
+		    const struct drm_panfrost_job *args,
+		    u32 bo_stride, u32 syncobj_stride)
+{
+	struct panfrost_device *pfdev = dev->dev_private;
+	struct panfrost_job_out_sync *out_syncs;
+	struct panfrost_job *job;
+	int ret;
+
+	if (!args->head)
+		return -EINVAL;
+
+	if (args->requirements & ~PANFROST_JD_ALLOWED_REQS)
+		return -EINVAL;
+
+	job = kzalloc(sizeof(*job), GFP_KERNEL);
+	if (!job)
+		return -ENOMEM;
+
+	kref_init(&job->refcount);
+
+	job->pfdev = pfdev;
+	job->jc = args->head;
+	job->requirements = args->requirements;
+	job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev);
+	job->file_priv = file_priv->driver_priv;
+	xa_init_flags(&job->deps, XA_FLAGS_ALLOC);
+
+	ret = panfrost_get_job_in_syncs(file_priv,
+					args->in_syncs,
+					syncobj_stride,
+					args->in_sync_count,
+					job);
+	if (ret)
+		goto err_put_job;
+
+	out_syncs = panfrost_get_job_out_syncs(file_priv,
+					       args->out_syncs,
+					       syncobj_stride,
+					       args->out_sync_count);
+	if (IS_ERR(out_syncs)) {
+		ret = PTR_ERR(out_syncs);
+		goto err_put_job;
+	}
+
+	ret = panfrost_get_job_bos(file_priv, args->bos, bo_stride,
+				   args->bo_count, job);
+	if (ret)
+		goto err_put_job;
+
+	ret = panfrost_get_job_mappings(file_priv, job);
+	if (ret)
+		goto err_put_job;
+
+	ret = panfrost_job_push(job);
+	if (ret) {
+		panfrost_put_job_out_syncs(out_syncs, args->out_sync_count);
+		goto err_put_job;
+	}
+
+	panfrost_set_job_out_fence(out_syncs, args->out_sync_count,
+				   job->render_done_fence);
+	panfrost_put_job_out_syncs(out_syncs, args->out_sync_count);
+	return 0;
+
+err_put_job:
+	panfrost_job_put(job);
+	return ret;
+}
+
+static int
+panfrost_ioctl_batch_submit(struct drm_device *dev, void *data,
+			    struct drm_file *file_priv)
+{
+	struct drm_panfrost_batch_submit *args = data;
+	void __user *jobs_args = u64_to_user_ptr(args->jobs);
+	unsigned int i;
+	int ret;
+
+	if (args->pad)
+		return -EINVAL;
+
+	/* Relax this test if new fields are added to
+	 * drm_panfrost_{bo_ref,syncobj_ref,job}.
+	 */
+	if (args->job_stride < sizeof(struct drm_panfrost_job) ||
+	    args->bo_ref_stride < sizeof(struct drm_panfrost_bo_ref) ||
+	    args->syncobj_ref_stride < sizeof(struct drm_panfrost_syncobj_ref))
+		return -EINVAL;
+
+	for (i = 0; i < args->job_count; i++) {
+		struct drm_panfrost_job job_args = { };
+
+		ret = copy_struct_from_user(&job_args, sizeof(job_args),
+					    jobs_args + (i * args->job_stride),
+					    args->job_stride);
+		if (ret) {
+			args->fail_idx = i;
+			return ret;
+		}
+
+		ret = panfrost_submit_job(dev, file_priv, &job_args,
+					  args->bo_ref_stride,
+					  args->syncobj_ref_stride);
+		if (ret) {
+			args->fail_idx = i;
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
 int panfrost_unstable_ioctl_check(void)
 {
 	if (!unstable_ioctls)
@@ -544,6 +846,7 @@ static const struct drm_ioctl_desc panfrost_drm_driver_ioctls[] = {
 	PANFROST_IOCTL(PERFCNT_ENABLE,	perfcnt_enable,	DRM_RENDER_ALLOW),
 	PANFROST_IOCTL(PERFCNT_DUMP,	perfcnt_dump,	DRM_RENDER_ALLOW),
 	PANFROST_IOCTL(MADVISE,		madvise,	DRM_RENDER_ALLOW),
+	PANFROST_IOCTL(BATCH_SUBMIT,	batch_submit,	DRM_RENDER_ALLOW),
 };
 
 DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index eab96d7e0e0e..34a10dac7ade 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -21,6 +21,7 @@ extern "C" {
 #define DRM_PANFROST_PERFCNT_ENABLE		0x06
 #define DRM_PANFROST_PERFCNT_DUMP		0x07
 #define DRM_PANFROST_MADVISE			0x08
+#define DRM_PANFROST_BATCH_SUBMIT		0x09
 
 #define DRM_IOCTL_PANFROST_SUBMIT		DRM_IOW(DRM_COMMAND_BASE + DRM_PANFROST_SUBMIT, struct drm_panfrost_submit)
 #define DRM_IOCTL_PANFROST_WAIT_BO		DRM_IOW(DRM_COMMAND_BASE + DRM_PANFROST_WAIT_BO, struct drm_panfrost_wait_bo)
@@ -29,6 +30,7 @@ extern "C" {
 #define DRM_IOCTL_PANFROST_GET_PARAM		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANFROST_GET_PARAM, struct drm_panfrost_get_param)
 #define DRM_IOCTL_PANFROST_GET_BO_OFFSET	DRM_IOWR(DRM_COMMAND_BASE + DRM_PANFROST_GET_BO_OFFSET, struct drm_panfrost_get_bo_offset)
 #define DRM_IOCTL_PANFROST_MADVISE		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANFROST_MADVISE, struct drm_panfrost_madvise)
+#define DRM_IOCTL_PANFROST_BATCH_SUBMIT		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANFROST_BATCH_SUBMIT, struct drm_panfrost_batch_submit)
 
 /*
  * Unstable ioctl(s): only exposed when the unsafe unstable_ioctls module
@@ -223,10 +225,87 @@ struct drm_panfrost_madvise {
 	__u32 retained;       /* out, whether backing store still exists */
 };
 
+/* Syncobj reference passed at job submission time to encode explicit
+ * input/output fences.
+ */
+struct drm_panfrost_syncobj_ref {
+	__u32 handle;
+	__u32 pad;
+	__u64 point;
+};
+
 #define PANFROST_BO_REF_READ		0x1
 #define PANFROST_BO_REF_WRITE		0x2
 #define PANFROST_BO_REF_RW		(PANFROST_BO_REF_READ | PANFROST_BO_REF_WRITE)
 
+/* Describes a BO referenced by a job and the type of access. */
+struct drm_panfrost_bo_ref {
+	__u32 handle;
+	__u32 flags;
+};
+
+/* Describes a GPU job and the resources attached to it. */
+struct drm_panfrost_job {
+	/** GPU pointer to the head of the job chain. */
+	__u64 head;
+
+	/**
+	 * Array of drm_panfrost_bo_ref objects describing the BOs referenced
+	 * by this job.
+	 */
+	__u64 bos;
+
+	/**
+	 * Arrays of drm_panfrost_syncobj_ref objects describing the input
+	 * and output fences.
+	 */
+	__u64 in_syncs;
+	__u64 out_syncs;
+
+	/** Syncobj reference array sizes. */
+	__u32 in_sync_count;
+	__u32 out_sync_count;
+
+	/** BO reference array size. */
+	__u32 bo_count;
+
+	/** Combination of PANFROST_JD_REQ_* flags. */
+	__u32 requirements;
+};
+
+/* Used to submit multiple jobs in one call */
+struct drm_panfrost_batch_submit {
+	/**
+	 * Stride of the jobs array (needed to ease extension of the
+	 * BATCH_SUBMIT ioctl). Should be set to
+	 * sizeof(struct drm_panfrost_job).
+	 */
+	__u32 job_stride;
+
+	/** Number of jobs to submit. */
+	__u32 job_count;
+
+	/* Pointer to a job array. */
+	__u64 jobs;
+
+	/**
+	 * Stride of the BO and syncobj reference arrays (needed to ease
+	 * extension of the BATCH_SUBMIT ioctl). Should be set to
+	 * sizeof(struct drm_panfrost_bo_ref).
+	 */
+	__u32 bo_ref_stride;
+	__u32 syncobj_ref_stride;
+
+	/**
+	 * If the submission fails, this encodes the index of the job
+	 * failed.
+	 */
+	__u32 fail_idx;
+
+	/** Always 0. */
+	__u32 pad;
+};
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 6/7] drm/panfrost: Advertise the SYNCOBJ_TIMELINE feature
  2021-03-11  9:25 [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Boris Brezillon
                   ` (4 preceding siblings ...)
  2021-03-11  9:25 ` [RFC PATCH 5/7] drm/panfrost: Add a new ioctl to submit batches Boris Brezillon
@ 2021-03-11  9:25 ` Boris Brezillon
  2021-03-11  9:25 ` [RFC PATCH 7/7] drm/panfrost: Bump minor version to reflect the feature additions Boris Brezillon
  2021-03-11 12:16 ` [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Steven Price
  7 siblings, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-11  9:25 UTC (permalink / raw)
  To: Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, Steven Price, Robin Murphy
  Cc: Boris Brezillon, dri-devel

Now that we have a new SUBMIT ioctl dealing with timelined syncojbs we
can advertise the feature.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 3f8addab8bab..b3cc6fecb18d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -857,7 +857,8 @@ DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
  * - 1.1 - adds HEAP and NOEXEC flags for CREATE_BO
  */
 static const struct drm_driver panfrost_drm_driver = {
-	.driver_features	= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ,
+	.driver_features	= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
+				  DRIVER_SYNCOBJ_TIMELINE,
 	.open			= panfrost_open,
 	.postclose		= panfrost_postclose,
 	.ioctls			= panfrost_drm_driver_ioctls,
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 7/7] drm/panfrost: Bump minor version to reflect the feature additions
  2021-03-11  9:25 [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Boris Brezillon
                   ` (5 preceding siblings ...)
  2021-03-11  9:25 ` [RFC PATCH 6/7] drm/panfrost: Advertise the SYNCOBJ_TIMELINE feature Boris Brezillon
@ 2021-03-11  9:25 ` Boris Brezillon
  2021-03-11 12:16 ` [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Steven Price
  7 siblings, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-11  9:25 UTC (permalink / raw)
  To: Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, Steven Price, Robin Murphy
  Cc: Boris Brezillon, dri-devel

We now have a new ioctl that allows submitting multiple jobs at once
(among other things) and we support timelined syncobjs. Bump the
minor version number to reflect those changes.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index b3cc6fecb18d..7fb9b20ba27f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -855,6 +855,7 @@ DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
  * Panfrost driver version:
  * - 1.0 - initial interface
  * - 1.1 - adds HEAP and NOEXEC flags for CREATE_BO
+ * - 1.2 - adds BATCH_SUBMIT ioctl and SYNCOBJ_TIMELINE feature
  */
 static const struct drm_driver panfrost_drm_driver = {
 	.driver_features	= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
@@ -868,7 +869,7 @@ static const struct drm_driver panfrost_drm_driver = {
 	.desc			= "panfrost DRM",
 	.date			= "20180908",
 	.major			= 1,
-	.minor			= 1,
+	.minor			= 2,
 
 	.gem_create_object	= panfrost_gem_create_object,
 	.prime_handle_to_fd	= drm_gem_prime_handle_to_fd,
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-11  9:25 [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Boris Brezillon
                   ` (6 preceding siblings ...)
  2021-03-11  9:25 ` [RFC PATCH 7/7] drm/panfrost: Bump minor version to reflect the feature additions Boris Brezillon
@ 2021-03-11 12:16 ` Steven Price
  2021-03-11 13:00   ` Boris Brezillon
  2021-03-12 14:15   ` Boris Brezillon
  7 siblings, 2 replies; 20+ messages in thread
From: Steven Price @ 2021-03-11 12:16 UTC (permalink / raw)
  To: Boris Brezillon, Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig,
	Robin Murphy
  Cc: dri-devel

On 11/03/2021 09:25, Boris Brezillon wrote:
> Hello,
> 
> I've been playing with Vulkan lately and struggled quite a bit to
> implement VkQueueSubmit with the submit ioctl we have. There are
> several limiting factors that can be worked around if we really have to,
> but I think it'd be much easier and future-proof if we introduce a new
> ioctl that addresses the current limitations:

Hi Boris,

I think what you've proposed is quite reasonable, some detailed comments 
to your points below.

> 
> 1/ There can only be one out_sync, but Vulkan might ask us to signal
>     several VkSemaphores and possibly one VkFence too, both of those
>     being based on sync objects in my PoC. Making out_sync an array of
>     syncobjs to attach the render_done fence to would make that possible.
>     The other option would be to collect syncobj updates in userspace
>     in a separate thread and propagate those updates to all
>     semaphores+fences waiting on those events (I think the v3dv driver
>     does something like that, but I didn't spend enough time studying
>     the code to be sure, so I might be wrong).

You should be able to avoid the separate thread to propagate by having a 
proxy object in user space that maps between the one outsync of the job 
and the possibly many Vulkan objects. But I've had this argument before 
with the DDK... and the upshot of it was that he Vulkan API is 
unnecessarily complex here and makes this really hard to do in practise. 
So I agree adding this capability to the kernel is likely the best approach.

> 2/ Queued jobs might be executed out-of-order (unless they have
>     explicit/implicit deps between them), and Vulkan asks that the out
>     fence be signaled when all jobs are done. Timeline syncobjs are a
>     good match for that use case. All we need to do is pass the same
>     fence syncobj to all jobs being attached to a single QueueSubmit
>     request, but a different point on the timeline. The syncobj
>     timeline wait does the rest and guarantees that we've reached a
>     given timeline point (IOW, all jobs before that point are done)
>     before declaring the fence as signaled.
>     One alternative would be to have dummy 'synchronization' jobs that
>     don't actually execute anything on the GPU but declare a dependency
>     on all other jobs that are part of the QueueSubmit request, and
>     signal the out fence (the scheduler would do most of the work for
>     us, all we have to do is support NULL job heads and signal the
>     fence directly when that happens instead of queueing the job).

I have to admit to being rather hazy on the details of timeline 
syncobjs, but I thought there was a requirement that the timeline moves 
monotonically. I.e. if you have multiple jobs signalling the same 
syncobj just with different points, then AFAIU the API requires that the 
points are triggered in order.

So I'm not sure that you've actually fixed this point - you either need 
to force an order (in which case the last job can signal the Vulkan 
fence) or you still need a dummy job to do the many-to-one dependency.

Or I may have completely misunderstood timeline syncobjs - definitely a 
possibility :)

> 3/ The current implementation lacks information about BO access,
>     so we serialize all jobs accessing the same set of BOs, even
>     if those jobs might just be reading from them (which can
>     happen concurrently). Other drivers pass an access type to the
>     list of referenced BOs to address that. Another option would be
>     to disable implicit deps (deps based on BOs) and force the driver
>     to pass all deps explicitly (interestingly, some drivers have
>     both the no-implicit-dep and r/w flags, probably to support
>     sub-resource access, so we might want to add that one too).
>     I don't see any userspace workaround to that problem, so that one
>     alone would justify extending the existing ioctl or adding a new
>     one.

Yeah - I think we need this. My only comment is that I think the 
read/write terminology may come back to bite. Better to use 'shared' and 
'exclusive' - which better matches the dma_resv_xxx APIs anyway.

Also the current code completely ignores PANFROST_BO_REF_READ. So either 
that should be defined as 0, or even better we support 3 modes:

  * Exclusive ('write' access)
  * Shared ('read' access)
  * No fence - ensures the BO is mapped, but doesn't add any implicit 
fences.

The last may make sense when doing explicit fences and e.g. doing 
front-buffer rendering with a display driver which does implicit fencing.

> 4/ There's also the fact that submitting one job at a time adds an
>     overhead when QueueSubmit is being passed more than one
>     CommandBuffer. That one is less problematic, but if we're adding
>     a new ioctl we'd better design it to limit the userspace -> kernel
>     transition overhead.

I've no objection - but I doubt the performance effect is significant. I 
was pleased to see the handling of stride which makes the interface 
extendable. In particular I suspect at some point we're going to want a 
priority field in some form.

> Right now I'm just trying to collect feedback. I don't intend to get
> those patches merged until we have a userspace user, but I thought
> starting the discussion early would be a good thing.
> 
> Feel free to suggest other approaches.

Other than the above I didn't see any obvious issues, and I know the 
Vulkan API is problematic in terms of synchronisation primitives - so if 
this makes it easier to implement then it seems like a good idea to me.

Thanks,

Steve
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-11 12:16 ` [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Steven Price
@ 2021-03-11 13:00   ` Boris Brezillon
  2021-03-11 16:58     ` Jason Ekstrand
  2021-03-12 14:15   ` Boris Brezillon
  1 sibling, 1 reply; 20+ messages in thread
From: Boris Brezillon @ 2021-03-11 13:00 UTC (permalink / raw)
  To: Steven Price; +Cc: dri-devel, Rob Herring, Robin Murphy, Alyssa Rosenzweig

Hi Steven,

On Thu, 11 Mar 2021 12:16:33 +0000
Steven Price <steven.price@arm.com> wrote:

> On 11/03/2021 09:25, Boris Brezillon wrote:
> > Hello,
> > 
> > I've been playing with Vulkan lately and struggled quite a bit to
> > implement VkQueueSubmit with the submit ioctl we have. There are
> > several limiting factors that can be worked around if we really have to,
> > but I think it'd be much easier and future-proof if we introduce a new
> > ioctl that addresses the current limitations:  
> 
> Hi Boris,
> 
> I think what you've proposed is quite reasonable, some detailed comments 
> to your points below.
> 
> > 
> > 1/ There can only be one out_sync, but Vulkan might ask us to signal
> >     several VkSemaphores and possibly one VkFence too, both of those
> >     being based on sync objects in my PoC. Making out_sync an array of
> >     syncobjs to attach the render_done fence to would make that possible.
> >     The other option would be to collect syncobj updates in userspace
> >     in a separate thread and propagate those updates to all
> >     semaphores+fences waiting on those events (I think the v3dv driver
> >     does something like that, but I didn't spend enough time studying
> >     the code to be sure, so I might be wrong).  
> 
> You should be able to avoid the separate thread to propagate by having a 
> proxy object in user space that maps between the one outsync of the job 
> and the possibly many Vulkan objects. But I've had this argument before 
> with the DDK... and the upshot of it was that he Vulkan API is 
> unnecessarily complex here and makes this really hard to do in practise. 
> So I agree adding this capability to the kernel is likely the best approach.
> 
> > 2/ Queued jobs might be executed out-of-order (unless they have
> >     explicit/implicit deps between them), and Vulkan asks that the out
> >     fence be signaled when all jobs are done. Timeline syncobjs are a
> >     good match for that use case. All we need to do is pass the same
> >     fence syncobj to all jobs being attached to a single QueueSubmit
> >     request, but a different point on the timeline. The syncobj
> >     timeline wait does the rest and guarantees that we've reached a
> >     given timeline point (IOW, all jobs before that point are done)
> >     before declaring the fence as signaled.
> >     One alternative would be to have dummy 'synchronization' jobs that
> >     don't actually execute anything on the GPU but declare a dependency
> >     on all other jobs that are part of the QueueSubmit request, and
> >     signal the out fence (the scheduler would do most of the work for
> >     us, all we have to do is support NULL job heads and signal the
> >     fence directly when that happens instead of queueing the job).  
> 
> I have to admit to being rather hazy on the details of timeline 
> syncobjs, but I thought there was a requirement that the timeline moves 
> monotonically. I.e. if you have multiple jobs signalling the same 
> syncobj just with different points, then AFAIU the API requires that the 
> points are triggered in order.

I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
might be wrong, but my understanding is that queuing fences (addition
of new points in the timeline) should happen in order, but signaling
can happen in any order. When checking for a signaled fence the
fence-chain logic starts from the last point (or from an explicit point
if you use the timeline wait flavor) and goes backward, stopping at the
first un-signaled node. If I'm correct, that means that fences that
are part of a chain can be signaled in any order.

Note that I also considered using a sync file, which has the ability to
merge fences, but that required 3 extra ioctls for each syncobj to merge
(for the export/merge/import round trip), and AFAICT, fences stay around
until the sync file is destroyed, which forces some garbage collection
if we want to keep the number of active fences low. One nice thing
about the fence-chain/syncobj-timeline logic is that signaled fences
are collected automatically when walking the chain.

> 
> So I'm not sure that you've actually fixed this point - you either need 
> to force an order (in which case the last job can signal the Vulkan 
> fence)

That options requires adding deps that do not really exist on the last
jobs, so I'm not sure I like it.

> or you still need a dummy job to do the many-to-one dependency.

Yeah, that's what I've considered doing before realizing timelined
syncojbs could solve this problem (assuming I got it right :-/).

> 
> Or I may have completely misunderstood timeline syncobjs - definitely a 
> possibility :)

I wouldn't pretend I got it right either :-).

> 
> > 3/ The current implementation lacks information about BO access,
> >     so we serialize all jobs accessing the same set of BOs, even
> >     if those jobs might just be reading from them (which can
> >     happen concurrently). Other drivers pass an access type to the
> >     list of referenced BOs to address that. Another option would be
> >     to disable implicit deps (deps based on BOs) and force the driver
> >     to pass all deps explicitly (interestingly, some drivers have
> >     both the no-implicit-dep and r/w flags, probably to support
> >     sub-resource access, so we might want to add that one too).
> >     I don't see any userspace workaround to that problem, so that one
> >     alone would justify extending the existing ioctl or adding a new
> >     one.  
> 
> Yeah - I think we need this. My only comment is that I think the 
> read/write terminology may come back to bite. Better to use 'shared' and 
> 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
> 
> Also the current code completely ignores PANFROST_BO_REF_READ. So either 
> that should be defined as 0, or even better we support 3 modes:
> 
>   * Exclusive ('write' access)
>   * Shared ('read' access)
>   * No fence - ensures the BO is mapped, but doesn't add any implicit 
> fences.
> 
> The last may make sense when doing explicit fences and e.g. doing 
> front-buffer rendering with a display driver which does implicit fencing.

Sounds good to me.

> 
> > 4/ There's also the fact that submitting one job at a time adds an
> >     overhead when QueueSubmit is being passed more than one
> >     CommandBuffer. That one is less problematic, but if we're adding
> >     a new ioctl we'd better design it to limit the userspace -> kernel
> >     transition overhead.  
> 
> I've no objection - but I doubt the performance effect is significant. I 
> was pleased to see the handling of stride which makes the interface 
> extendable. In particular I suspect at some point we're going to want a 
> priority field in some form.
> 
> > Right now I'm just trying to collect feedback. I don't intend to get
> > those patches merged until we have a userspace user, but I thought
> > starting the discussion early would be a good thing.
> > 
> > Feel free to suggest other approaches.  
> 
> Other than the above I didn't see any obvious issues, and I know the 
> Vulkan API is problematic in terms of synchronisation primitives - so if 
> this makes it easier to implement then it seems like a good idea to me.

Thanks a lot for you review.

Regards,

Boris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-11 13:00   ` Boris Brezillon
@ 2021-03-11 16:58     ` Jason Ekstrand
  2021-03-11 17:24       ` Boris Brezillon
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Ekstrand @ 2021-03-11 16:58 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Rob Herring, Robin Murphy, Alyssa Rosenzweig,
	Maling list - DRI developers, Steven Price

Hi all,

Dropping in where I may or may not be wanted to feel free to ignore. : -)

On Thu, Mar 11, 2021 at 7:00 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> Hi Steven,
>
> On Thu, 11 Mar 2021 12:16:33 +0000
> Steven Price <steven.price@arm.com> wrote:
>
> > On 11/03/2021 09:25, Boris Brezillon wrote:
> > > Hello,
> > >
> > > I've been playing with Vulkan lately and struggled quite a bit to
> > > implement VkQueueSubmit with the submit ioctl we have. There are
> > > several limiting factors that can be worked around if we really have to,
> > > but I think it'd be much easier and future-proof if we introduce a new
> > > ioctl that addresses the current limitations:
> >
> > Hi Boris,
> >
> > I think what you've proposed is quite reasonable, some detailed comments
> > to your points below.
> >
> > >
> > > 1/ There can only be one out_sync, but Vulkan might ask us to signal
> > >     several VkSemaphores and possibly one VkFence too, both of those
> > >     being based on sync objects in my PoC. Making out_sync an array of
> > >     syncobjs to attach the render_done fence to would make that possible.
> > >     The other option would be to collect syncobj updates in userspace
> > >     in a separate thread and propagate those updates to all
> > >     semaphores+fences waiting on those events (I think the v3dv driver
> > >     does something like that, but I didn't spend enough time studying
> > >     the code to be sure, so I might be wrong).
> >
> > You should be able to avoid the separate thread to propagate by having a
> > proxy object in user space that maps between the one outsync of the job
> > and the possibly many Vulkan objects. But I've had this argument before
> > with the DDK... and the upshot of it was that he Vulkan API is
> > unnecessarily complex here and makes this really hard to do in practise.
> > So I agree adding this capability to the kernel is likely the best approach.

This is pretty easy to do from userspace.  Just take the out sync_file
and stuff it into each of the syncobjs using
DRM_IOCTL_SYNCOBJ_FD_TO_HANDLE.  That's all the kernel would be doing
under the hood anyway.  Having it built into your submit ioctl does
have the advantage of fewer round-trips to kernel space, though.

> > > 2/ Queued jobs might be executed out-of-order (unless they have
> > >     explicit/implicit deps between them), and Vulkan asks that the out
> > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > >     good match for that use case. All we need to do is pass the same
> > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > >     request, but a different point on the timeline. The syncobj
> > >     timeline wait does the rest and guarantees that we've reached a
> > >     given timeline point (IOW, all jobs before that point are done)
> > >     before declaring the fence as signaled.
> > >     One alternative would be to have dummy 'synchronization' jobs that
> > >     don't actually execute anything on the GPU but declare a dependency
> > >     on all other jobs that are part of the QueueSubmit request, and
> > >     signal the out fence (the scheduler would do most of the work for
> > >     us, all we have to do is support NULL job heads and signal the
> > >     fence directly when that happens instead of queueing the job).
> >
> > I have to admit to being rather hazy on the details of timeline
> > syncobjs, but I thought there was a requirement that the timeline moves
> > monotonically. I.e. if you have multiple jobs signalling the same
> > syncobj just with different points, then AFAIU the API requires that the
> > points are triggered in order.
>
> I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> might be wrong, but my understanding is that queuing fences (addition
> of new points in the timeline) should happen in order, but signaling
> can happen in any order. When checking for a signaled fence the
> fence-chain logic starts from the last point (or from an explicit point
> if you use the timeline wait flavor) and goes backward, stopping at the
> first un-signaled node. If I'm correct, that means that fences that
> are part of a chain can be signaled in any order.

You don't even need a timeline for this.  Just have a single syncobj
per-queue and make each submit wait on it and then signal it.
Alternatively, you can just always hang on to the out-fence from the
previous submit and make the next one wait on that.  Timelines are
overkill here, IMO.

> Note that I also considered using a sync file, which has the ability to
> merge fences, but that required 3 extra ioctls for each syncobj to merge
> (for the export/merge/import round trip), and AFAICT, fences stay around
> until the sync file is destroyed, which forces some garbage collection
> if we want to keep the number of active fences low. One nice thing
> about the fence-chain/syncobj-timeline logic is that signaled fences
> are collected automatically when walking the chain.

Yeah, that's the pain when working with sync files.  This is one of
the reasons why our driver takes an arbitrary number of in/out
syncobjs.

> > So I'm not sure that you've actually fixed this point - you either need
> > to force an order (in which case the last job can signal the Vulkan
> > fence)
>
> That options requires adding deps that do not really exist on the last
> jobs, so I'm not sure I like it.
>
> > or you still need a dummy job to do the many-to-one dependency.
>
> Yeah, that's what I've considered doing before realizing timelined
> syncojbs could solve this problem (assuming I got it right :-/).
>
> >
> > Or I may have completely misunderstood timeline syncobjs - definitely a
> > possibility :)
>
> I wouldn't pretend I got it right either :-).
>
> >
> > > 3/ The current implementation lacks information about BO access,
> > >     so we serialize all jobs accessing the same set of BOs, even
> > >     if those jobs might just be reading from them (which can
> > >     happen concurrently). Other drivers pass an access type to the
> > >     list of referenced BOs to address that. Another option would be
> > >     to disable implicit deps (deps based on BOs) and force the driver
> > >     to pass all deps explicitly (interestingly, some drivers have
> > >     both the no-implicit-dep and r/w flags, probably to support
> > >     sub-resource access, so we might want to add that one too).
> > >     I don't see any userspace workaround to that problem, so that one
> > >     alone would justify extending the existing ioctl or adding a new
> > >     one.
> >
> > Yeah - I think we need this. My only comment is that I think the
> > read/write terminology may come back to bite. Better to use 'shared' and
> > 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
> >
> > Also the current code completely ignores PANFROST_BO_REF_READ. So either
> > that should be defined as 0, or even better we support 3 modes:
> >
> >   * Exclusive ('write' access)
> >   * Shared ('read' access)
> >   * No fence - ensures the BO is mapped, but doesn't add any implicit
> > fences.
> >
> > The last may make sense when doing explicit fences and e.g. doing
> > front-buffer rendering with a display driver which does implicit fencing.

This one's really annoying.  TBH, we've still not gotten it right on
Intel, AFAICT.  That is roughly the set of modes you need but you'll
have to watch out for window-system buffers.  RADV and ANV take
slightly different approaches here and they each have their own
problems.  On the balance, I'd look at what RADV is doing rather than
ANV because ANV's results in some over-synchronization every time you
vkWaitForFences on the WSI fence.  I've got a patch floating round
somewhere that adds some new kernel API to make that case a bit better
but it's a snarly mess.  Sorry for being cryptic but it's a 5-page
e-mail if I type out all the annoying details. (-:

--Jason

> Sounds good to me.
>
> >
> > > 4/ There's also the fact that submitting one job at a time adds an
> > >     overhead when QueueSubmit is being passed more than one
> > >     CommandBuffer. That one is less problematic, but if we're adding
> > >     a new ioctl we'd better design it to limit the userspace -> kernel
> > >     transition overhead.
> >
> > I've no objection - but I doubt the performance effect is significant. I
> > was pleased to see the handling of stride which makes the interface
> > extendable. In particular I suspect at some point we're going to want a
> > priority field in some form.
> >
> > > Right now I'm just trying to collect feedback. I don't intend to get
> > > those patches merged until we have a userspace user, but I thought
> > > starting the discussion early would be a good thing.
> > >
> > > Feel free to suggest other approaches.
> >
> > Other than the above I didn't see any obvious issues, and I know the
> > Vulkan API is problematic in terms of synchronisation primitives - so if
> > this makes it easier to implement then it seems like a good idea to me.
>
> Thanks a lot for you review.
>
> Regards,
>
> Boris
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-11 16:58     ` Jason Ekstrand
@ 2021-03-11 17:24       ` Boris Brezillon
  2021-03-11 18:11         ` Jason Ekstrand
  0 siblings, 1 reply; 20+ messages in thread
From: Boris Brezillon @ 2021-03-11 17:24 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Rob Herring, Robin Murphy, Alyssa Rosenzweig,
	Maling list - DRI developers, Steven Price

Hi Jason,

On Thu, 11 Mar 2021 10:58:46 -0600
Jason Ekstrand <jason@jlekstrand.net> wrote:

> Hi all,
> 
> Dropping in where I may or may not be wanted to feel free to ignore. : -)

I'm glad you decided to chime in. :-)

 
> > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > >     good match for that use case. All we need to do is pass the same
> > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > >     request, but a different point on the timeline. The syncobj
> > > >     timeline wait does the rest and guarantees that we've reached a
> > > >     given timeline point (IOW, all jobs before that point are done)
> > > >     before declaring the fence as signaled.
> > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > >     don't actually execute anything on the GPU but declare a dependency
> > > >     on all other jobs that are part of the QueueSubmit request, and
> > > >     signal the out fence (the scheduler would do most of the work for
> > > >     us, all we have to do is support NULL job heads and signal the
> > > >     fence directly when that happens instead of queueing the job).  
> > >
> > > I have to admit to being rather hazy on the details of timeline
> > > syncobjs, but I thought there was a requirement that the timeline moves
> > > monotonically. I.e. if you have multiple jobs signalling the same
> > > syncobj just with different points, then AFAIU the API requires that the
> > > points are triggered in order.  
> >
> > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > might be wrong, but my understanding is that queuing fences (addition
> > of new points in the timeline) should happen in order, but signaling
> > can happen in any order. When checking for a signaled fence the
> > fence-chain logic starts from the last point (or from an explicit point
> > if you use the timeline wait flavor) and goes backward, stopping at the
> > first un-signaled node. If I'm correct, that means that fences that
> > are part of a chain can be signaled in any order.  
> 
> You don't even need a timeline for this.  Just have a single syncobj
> per-queue and make each submit wait on it and then signal it.
> Alternatively, you can just always hang on to the out-fence from the
> previous submit and make the next one wait on that.

That's what I have right now, but it forces the serialization of all
jobs that are pushed during a submit (and there can be more than one
per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
be better to not force this serialization if we can avoid it.

> Timelines are overkill here, IMO.

Mind developing why you think this is overkill? After looking at the
kernel implementation I thought using timeline syncobjs would be
pretty cheap compared to the other options I considered.

> 
> > Note that I also considered using a sync file, which has the ability to
> > merge fences, but that required 3 extra ioctls for each syncobj to merge
> > (for the export/merge/import round trip), and AFAICT, fences stay around
> > until the sync file is destroyed, which forces some garbage collection
> > if we want to keep the number of active fences low. One nice thing
> > about the fence-chain/syncobj-timeline logic is that signaled fences
> > are collected automatically when walking the chain.  
> 
> Yeah, that's the pain when working with sync files.  This is one of
> the reasons why our driver takes an arbitrary number of in/out
> syncobjs.
> 
> > > So I'm not sure that you've actually fixed this point - you either need
> > > to force an order (in which case the last job can signal the Vulkan
> > > fence)  
> >
> > That options requires adding deps that do not really exist on the last
> > jobs, so I'm not sure I like it.
> >  
> > > or you still need a dummy job to do the many-to-one dependency.  
> >
> > Yeah, that's what I've considered doing before realizing timelined
> > syncojbs could solve this problem (assuming I got it right :-/).
> >  
> > >
> > > Or I may have completely misunderstood timeline syncobjs - definitely a
> > > possibility :)  
> >
> > I wouldn't pretend I got it right either :-).
> >  
> > >  
> > > > 3/ The current implementation lacks information about BO access,
> > > >     so we serialize all jobs accessing the same set of BOs, even
> > > >     if those jobs might just be reading from them (which can
> > > >     happen concurrently). Other drivers pass an access type to the
> > > >     list of referenced BOs to address that. Another option would be
> > > >     to disable implicit deps (deps based on BOs) and force the driver
> > > >     to pass all deps explicitly (interestingly, some drivers have
> > > >     both the no-implicit-dep and r/w flags, probably to support
> > > >     sub-resource access, so we might want to add that one too).
> > > >     I don't see any userspace workaround to that problem, so that one
> > > >     alone would justify extending the existing ioctl or adding a new
> > > >     one.  
> > >
> > > Yeah - I think we need this. My only comment is that I think the
> > > read/write terminology may come back to bite. Better to use 'shared' and
> > > 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
> > >
> > > Also the current code completely ignores PANFROST_BO_REF_READ. So either
> > > that should be defined as 0, or even better we support 3 modes:
> > >
> > >   * Exclusive ('write' access)
> > >   * Shared ('read' access)
> > >   * No fence - ensures the BO is mapped, but doesn't add any implicit
> > > fences.
> > >
> > > The last may make sense when doing explicit fences and e.g. doing
> > > front-buffer rendering with a display driver which does implicit fencing.  
> 
> This one's really annoying.  TBH, we've still not gotten it right on
> Intel, AFAICT.  That is roughly the set of modes you need but you'll
> have to watch out for window-system buffers.  RADV and ANV take
> slightly different approaches here and they each have their own
> problems.  On the balance, I'd look at what RADV is doing rather than
> ANV because ANV's results in some over-synchronization every time you
> vkWaitForFences on the WSI fence.  I've got a patch floating round
> somewhere that adds some new kernel API to make that case a bit better
> but it's a snarly mess.  Sorry for being cryptic but it's a 5-page
> e-mail if I type out all the annoying details. (-:

Ok, I'll have a look at the RADV driver.

Thanks for your feedback.

Boris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-11 17:24       ` Boris Brezillon
@ 2021-03-11 18:11         ` Jason Ekstrand
  2021-03-11 22:38           ` Alyssa Rosenzweig
  2021-03-12  7:31           ` Boris Brezillon
  0 siblings, 2 replies; 20+ messages in thread
From: Jason Ekstrand @ 2021-03-11 18:11 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Rob Herring, Robin Murphy, Alyssa Rosenzweig,
	Maling list - DRI developers, Steven Price

On Thu, Mar 11, 2021 at 11:25 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> Hi Jason,
>
> On Thu, 11 Mar 2021 10:58:46 -0600
> Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> > Hi all,
> >
> > Dropping in where I may or may not be wanted to feel free to ignore. : -)
>
> I'm glad you decided to chime in. :-)
>
>
> > > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > > >     good match for that use case. All we need to do is pass the same
> > > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > > >     request, but a different point on the timeline. The syncobj
> > > > >     timeline wait does the rest and guarantees that we've reached a
> > > > >     given timeline point (IOW, all jobs before that point are done)
> > > > >     before declaring the fence as signaled.
> > > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > > >     don't actually execute anything on the GPU but declare a dependency
> > > > >     on all other jobs that are part of the QueueSubmit request, and
> > > > >     signal the out fence (the scheduler would do most of the work for
> > > > >     us, all we have to do is support NULL job heads and signal the
> > > > >     fence directly when that happens instead of queueing the job).
> > > >
> > > > I have to admit to being rather hazy on the details of timeline
> > > > syncobjs, but I thought there was a requirement that the timeline moves
> > > > monotonically. I.e. if you have multiple jobs signalling the same
> > > > syncobj just with different points, then AFAIU the API requires that the
> > > > points are triggered in order.
> > >
> > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > > might be wrong, but my understanding is that queuing fences (addition
> > > of new points in the timeline) should happen in order, but signaling
> > > can happen in any order. When checking for a signaled fence the
> > > fence-chain logic starts from the last point (or from an explicit point
> > > if you use the timeline wait flavor) and goes backward, stopping at the
> > > first un-signaled node. If I'm correct, that means that fences that
> > > are part of a chain can be signaled in any order.
> >
> > You don't even need a timeline for this.  Just have a single syncobj
> > per-queue and make each submit wait on it and then signal it.
> > Alternatively, you can just always hang on to the out-fence from the
> > previous submit and make the next one wait on that.
>
> That's what I have right now, but it forces the serialization of all
> jobs that are pushed during a submit (and there can be more than one
> per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
> be better to not force this serialization if we can avoid it.

I'm not familiar with panfrost's needs and I don't work on a tiler and
I know there are different issues there.  But...

The Vulkan spec requires that everything that all the submits that
happen on a given vkQueue happen in-order.  Search the spec for
"Submission order" for more details.

So, generally speaking, there are some in-order requirements there.
Again, not having a lot of tiler experience, I'm not the one to weigh
in.

> > Timelines are overkill here, IMO.
>
> Mind developing why you think this is overkill? After looking at the
> kernel implementation I thought using timeline syncobjs would be
> pretty cheap compared to the other options I considered.

If you use a regular syncobj, every time you wait on it it inserts a
dependency between the current submit and the last thing to signal it
on the CPU timeline.  The internal dma_fences will hang around
as-needed to ensure those dependencies.  If you use a timeline, you
have to also track a uint64_t to reference the current time point.
This may work if you need to sync a bunch of in-flight stuff at one
go, that may work but if you're trying to serialize, it's just extra
tracking for no point.  Again, maybe there's something I'm missing and
you don't actually want to serialize.

--Jason


> >
> > > Note that I also considered using a sync file, which has the ability to
> > > merge fences, but that required 3 extra ioctls for each syncobj to merge
> > > (for the export/merge/import round trip), and AFAICT, fences stay around
> > > until the sync file is destroyed, which forces some garbage collection
> > > if we want to keep the number of active fences low. One nice thing
> > > about the fence-chain/syncobj-timeline logic is that signaled fences
> > > are collected automatically when walking the chain.
> >
> > Yeah, that's the pain when working with sync files.  This is one of
> > the reasons why our driver takes an arbitrary number of in/out
> > syncobjs.
> >
> > > > So I'm not sure that you've actually fixed this point - you either need
> > > > to force an order (in which case the last job can signal the Vulkan
> > > > fence)
> > >
> > > That options requires adding deps that do not really exist on the last
> > > jobs, so I'm not sure I like it.
> > >
> > > > or you still need a dummy job to do the many-to-one dependency.
> > >
> > > Yeah, that's what I've considered doing before realizing timelined
> > > syncojbs could solve this problem (assuming I got it right :-/).
> > >
> > > >
> > > > Or I may have completely misunderstood timeline syncobjs - definitely a
> > > > possibility :)
> > >
> > > I wouldn't pretend I got it right either :-).
> > >
> > > >
> > > > > 3/ The current implementation lacks information about BO access,
> > > > >     so we serialize all jobs accessing the same set of BOs, even
> > > > >     if those jobs might just be reading from them (which can
> > > > >     happen concurrently). Other drivers pass an access type to the
> > > > >     list of referenced BOs to address that. Another option would be
> > > > >     to disable implicit deps (deps based on BOs) and force the driver
> > > > >     to pass all deps explicitly (interestingly, some drivers have
> > > > >     both the no-implicit-dep and r/w flags, probably to support
> > > > >     sub-resource access, so we might want to add that one too).
> > > > >     I don't see any userspace workaround to that problem, so that one
> > > > >     alone would justify extending the existing ioctl or adding a new
> > > > >     one.
> > > >
> > > > Yeah - I think we need this. My only comment is that I think the
> > > > read/write terminology may come back to bite. Better to use 'shared' and
> > > > 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
> > > >
> > > > Also the current code completely ignores PANFROST_BO_REF_READ. So either
> > > > that should be defined as 0, or even better we support 3 modes:
> > > >
> > > >   * Exclusive ('write' access)
> > > >   * Shared ('read' access)
> > > >   * No fence - ensures the BO is mapped, but doesn't add any implicit
> > > > fences.
> > > >
> > > > The last may make sense when doing explicit fences and e.g. doing
> > > > front-buffer rendering with a display driver which does implicit fencing.
> >
> > This one's really annoying.  TBH, we've still not gotten it right on
> > Intel, AFAICT.  That is roughly the set of modes you need but you'll
> > have to watch out for window-system buffers.  RADV and ANV take
> > slightly different approaches here and they each have their own
> > problems.  On the balance, I'd look at what RADV is doing rather than
> > ANV because ANV's results in some over-synchronization every time you
> > vkWaitForFences on the WSI fence.  I've got a patch floating round
> > somewhere that adds some new kernel API to make that case a bit better
> > but it's a snarly mess.  Sorry for being cryptic but it's a 5-page
> > e-mail if I type out all the annoying details. (-:
>
> Ok, I'll have a look at the RADV driver.
>
> Thanks for your feedback.
>
> Boris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-11 18:11         ` Jason Ekstrand
@ 2021-03-11 22:38           ` Alyssa Rosenzweig
  2021-03-12  7:31           ` Boris Brezillon
  1 sibling, 0 replies; 20+ messages in thread
From: Alyssa Rosenzweig @ 2021-03-11 22:38 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Maling list - DRI developers, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Boris Brezillon, Robin Murphy

> I'm not familiar with panfrost's needs and I don't work on a tiler and
> I know there are different issues there.  But...

The primary issue is we submit vertex+compute and fragment for each
batch as two disjoint jobs (with a dependency of course), reflecting the
internal hardware structure as parallel job slots. That we actually
require two ioctls() and a roundtrip for this is a design wart inherited
from early days of the kernel driver. The downstream Mali driver handles
this by allowing multiple jobs to be submitted with a single ioctl, as
Boris's patch enables. In every other respect I believe our needs are
similar to other renderonly drivers. (What does turnip do?)
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-11 18:11         ` Jason Ekstrand
  2021-03-11 22:38           ` Alyssa Rosenzweig
@ 2021-03-12  7:31           ` Boris Brezillon
  2021-03-12 15:37             ` Jason Ekstrand
  1 sibling, 1 reply; 20+ messages in thread
From: Boris Brezillon @ 2021-03-12  7:31 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Rob Herring, Robin Murphy, Alyssa Rosenzweig,
	Maling list - DRI developers, Steven Price

On Thu, 11 Mar 2021 12:11:48 -0600
Jason Ekstrand <jason@jlekstrand.net> wrote:

> > > > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > > > >     good match for that use case. All we need to do is pass the same
> > > > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > > > >     request, but a different point on the timeline. The syncobj
> > > > > >     timeline wait does the rest and guarantees that we've reached a
> > > > > >     given timeline point (IOW, all jobs before that point are done)
> > > > > >     before declaring the fence as signaled.
> > > > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > > > >     don't actually execute anything on the GPU but declare a dependency
> > > > > >     on all other jobs that are part of the QueueSubmit request, and
> > > > > >     signal the out fence (the scheduler would do most of the work for
> > > > > >     us, all we have to do is support NULL job heads and signal the
> > > > > >     fence directly when that happens instead of queueing the job).  
> > > > >
> > > > > I have to admit to being rather hazy on the details of timeline
> > > > > syncobjs, but I thought there was a requirement that the timeline moves
> > > > > monotonically. I.e. if you have multiple jobs signalling the same
> > > > > syncobj just with different points, then AFAIU the API requires that the
> > > > > points are triggered in order.  
> > > >
> > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > > > might be wrong, but my understanding is that queuing fences (addition
> > > > of new points in the timeline) should happen in order, but signaling
> > > > can happen in any order. When checking for a signaled fence the
> > > > fence-chain logic starts from the last point (or from an explicit point
> > > > if you use the timeline wait flavor) and goes backward, stopping at the
> > > > first un-signaled node. If I'm correct, that means that fences that
> > > > are part of a chain can be signaled in any order.  
> > >
> > > You don't even need a timeline for this.  Just have a single syncobj
> > > per-queue and make each submit wait on it and then signal it.
> > > Alternatively, you can just always hang on to the out-fence from the
> > > previous submit and make the next one wait on that.  
> >
> > That's what I have right now, but it forces the serialization of all
> > jobs that are pushed during a submit (and there can be more than one
> > per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
> > be better to not force this serialization if we can avoid it.  
> 
> I'm not familiar with panfrost's needs and I don't work on a tiler and
> I know there are different issues there.  But...
> 
> The Vulkan spec requires that everything that all the submits that
> happen on a given vkQueue happen in-order.  Search the spec for
> "Submission order" for more details.

Duh, looks like I completely occulted the "Submission order"
guarantees. This being said, even after reading this chapter multiple
times I'm not sure what kind of guarantee this gives us, given the
execution itself can be out-of-order. My understanding is that
submission order matters for implicit deps, say you have 2 distinct
VkSubmitInfo, the first one (in submission order) writing to a buffer
and the second one reading from it, you really want the first one to
be submitted first and the second one to wait on the implicit BO fence
created by the first one. If we were to submit out-of-order, this
guarantee wouldn't be met. OTOH, if we have 2 completely independent
submits, I don't really see what submission order gives us if execution
is out-of-order.

In our case, the kernel driver takes care of the submission
serialization (gathering implicit and explicit deps, queuing the job and
assigning the "done" fence to the output sync objects). Once things
are queued, it's the scheduler (drm_sched) deciding of the execution
order.

> 
> So, generally speaking, there are some in-order requirements there.
> Again, not having a lot of tiler experience, I'm not the one to weigh
> in.
> 
> > > Timelines are overkill here, IMO.  
> >
> > Mind developing why you think this is overkill? After looking at the
> > kernel implementation I thought using timeline syncobjs would be
> > pretty cheap compared to the other options I considered.  
> 
> If you use a regular syncobj, every time you wait on it it inserts a
> dependency between the current submit and the last thing to signal it
> on the CPU timeline.  The internal dma_fences will hang around
> as-needed to ensure those dependencies.  If you use a timeline, you
> have to also track a uint64_t to reference the current time point.
> This may work if you need to sync a bunch of in-flight stuff at one
> go, that may work but if you're trying to serialize, it's just extra
> tracking for no point.  Again, maybe there's something I'm missing and
> you don't actually want to serialize.

My understanding (and I might very much be wrong here) is that using a
regular syncobj to do this actually enforces not only the submission
order but also the execution order (each job waiting on the previous
one to complete before being scheduled). The idea of the timeline
syncobj approach is that jobs that have no inter dependencies can be
started in any order, the scheduler picking the first whose deps are
ready (which might not necessarily match submission order). The
timeline syncobj allows us to have one point per kernel-submission and
eventually wait on the last point for the fence passed to
vkSubmitQueue(), and some specific point on the timeline for
pSignalSemaphores entries.

What's more challenging is signal operation ordering:

"
Signal operation order is a fundamental ordering in Vulkan, giving
meaning to the order in which semaphore and fence signal operations
occur when submitted to a single queue.



1.  The initial order is determined by the order in which
    vkQueueSubmit commands are executed on the host, for a single
    queue, from first to last.

2.  The order in which VkSubmitInfo structures are specified in the
    pSubmits parameter of vkQueueSubmit, from lowest index to highest.

3.  The fence signal operation defined by the fence parameter of a
    vkQueueSubmit or vkQueueBindSparse command is ordered after all
    semaphore signal operations defined by that command.
"

This means we have to add implicit dependencies on the signaling
itself. We have two options to guarantee that:

1/ Transfer one of the queue syncobj timeline point to each semaphore
   and fence after job submission (the point itself being dependent
   on the position of the submit entry in the array for semaphores, and
   the last point for the fence). Problem with this approach is that we
   now have an extra TRANSFER_SYNCOBJ call per semaphore/fence

2/ Add SYNC jobs (jobs that do not actually execute on the GPU, but
   serve as a synchronization point) whose responsibility would be to
   do this muxing/transfer as part of the batch submission process.

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-11 12:16 ` [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Steven Price
  2021-03-11 13:00   ` Boris Brezillon
@ 2021-03-12 14:15   ` Boris Brezillon
  1 sibling, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-12 14:15 UTC (permalink / raw)
  To: Steven Price; +Cc: dri-devel, Rob Herring, Robin Murphy, Alyssa Rosenzweig

On Thu, 11 Mar 2021 12:16:33 +0000
Steven Price <steven.price@arm.com> wrote:

> Also the current code completely ignores PANFROST_BO_REF_READ. So either 
> that should be defined as 0, or even better we support 3 modes:
> 
>   * Exclusive ('write' access)
>   * Shared ('read' access)
>   * No fence - ensures the BO is mapped, but doesn't add any implicit 
> fences.

I looked at other drivers and they seem to have this
NO_FENCE/NO_IMPLICIT flag at the job/submit level and conditioned on
the presence of a in_sync_fd file. I can see per-BO NO_FENCE being
useful for sub-buffer accesses, assuming we don't want to deal with
other deps explicitly, but we could also force the NO_IMPLICIT behavior
for the whole job and let the user deal with all deps explicitly. TBH,
it's a bit unclear to me how that feature would be used by the vulkan
driver, so I'd rather leave that NO_FENCE use case for later.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-12  7:31           ` Boris Brezillon
@ 2021-03-12 15:37             ` Jason Ekstrand
  2021-03-12 18:25               ` Boris Brezillon
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Ekstrand @ 2021-03-12 15:37 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Rob Herring, Robin Murphy, Alyssa Rosenzweig,
	Maling list - DRI developers, Steven Price

On Fri, Mar 12, 2021 at 1:31 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> On Thu, 11 Mar 2021 12:11:48 -0600
> Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> > > > > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > > > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > > > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > > > > >     good match for that use case. All we need to do is pass the same
> > > > > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > > > > >     request, but a different point on the timeline. The syncobj
> > > > > > >     timeline wait does the rest and guarantees that we've reached a
> > > > > > >     given timeline point (IOW, all jobs before that point are done)
> > > > > > >     before declaring the fence as signaled.
> > > > > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > > > > >     don't actually execute anything on the GPU but declare a dependency
> > > > > > >     on all other jobs that are part of the QueueSubmit request, and
> > > > > > >     signal the out fence (the scheduler would do most of the work for
> > > > > > >     us, all we have to do is support NULL job heads and signal the
> > > > > > >     fence directly when that happens instead of queueing the job).
> > > > > >
> > > > > > I have to admit to being rather hazy on the details of timeline
> > > > > > syncobjs, but I thought there was a requirement that the timeline moves
> > > > > > monotonically. I.e. if you have multiple jobs signalling the same
> > > > > > syncobj just with different points, then AFAIU the API requires that the
> > > > > > points are triggered in order.
> > > > >
> > > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > > > > might be wrong, but my understanding is that queuing fences (addition
> > > > > of new points in the timeline) should happen in order, but signaling
> > > > > can happen in any order. When checking for a signaled fence the
> > > > > fence-chain logic starts from the last point (or from an explicit point
> > > > > if you use the timeline wait flavor) and goes backward, stopping at the
> > > > > first un-signaled node. If I'm correct, that means that fences that
> > > > > are part of a chain can be signaled in any order.
> > > >
> > > > You don't even need a timeline for this.  Just have a single syncobj
> > > > per-queue and make each submit wait on it and then signal it.
> > > > Alternatively, you can just always hang on to the out-fence from the
> > > > previous submit and make the next one wait on that.
> > >
> > > That's what I have right now, but it forces the serialization of all
> > > jobs that are pushed during a submit (and there can be more than one
> > > per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
> > > be better to not force this serialization if we can avoid it.
> >
> > I'm not familiar with panfrost's needs and I don't work on a tiler and
> > I know there are different issues there.  But...
> >
> > The Vulkan spec requires that everything that all the submits that
> > happen on a given vkQueue happen in-order.  Search the spec for
> > "Submission order" for more details.
>
> Duh, looks like I completely occulted the "Submission order"
> guarantees. This being said, even after reading this chapter multiple
> times I'm not sure what kind of guarantee this gives us, given the
> execution itself can be out-of-order. My understanding is that
> submission order matters for implicit deps, say you have 2 distinct
> VkSubmitInfo, the first one (in submission order) writing to a buffer
> and the second one reading from it, you really want the first one to
> be submitted first and the second one to wait on the implicit BO fence
> created by the first one. If we were to submit out-of-order, this
> guarantee wouldn't be met. OTOH, if we have 2 completely independent
> submits, I don't really see what submission order gives us if execution
> is out-of-order.

Right, this is where things get sticky.  What's guaranteed there is
submission order not execution order.  But, sadly, there's more hidden
in there than you might think.  Before we can go there, though, we
need to establish a few details of Mali hardware works.

My understanding (feel free to correct me if I'm wrong) is that,
roughly, Mali has two engines which have to work together to render:
One engine for compute/vertex/geometry and one for binning and
fragment.  When you go to submit a draw, you fire off any geometry
work on the compute engine and the fragment work on the binning
engine.  At a tile flush boundary (or renderpass in Vulkan), you
insert a dependency between the geometry work you put on the compute
engine and the binning/fragment work.  In a GL driver, you'd probably
kick it off to the kernel at this point.

In Vulkan, things are going to get more tricky.  You can't just kick
off to the kernel at tile flush boundaries.  You also can't assume
that all vertex work within a given command buffer is independent of
all the fragment work.  Let's say you have a sequence like this:

vkBeginRenderPass() /* Writes to ImageA */
vkCmdDraw()
vkCmdDraw()
...
vkEndRenderPass()
vkPipelineBarrier(imageA /* fragment -> compute */)
vkCmdDispatch() /* reads imageA, writes BufferB */
vkBeginRenderPass() /* Writes to ImageC */
vkCmdBindVertexBuffers(bufferB)
vkCmdDraw();
...
vkEndRenderPass()

Now you have, in a single command buffer, a draw which writes to an
image which is read by a compute shader which writes to a buffer which
is read by vertex fetch for another draw.  The only way to implement
this on a Mali like device is to ping-pong back and forth between the
compute and binning engines.  It'd look something like this (assuming
my mental model):

A: Vertex for the first draw on the compute engine
B: Vertex for the first draw on the compute engine
C: Fragment for the first draw on the binning engine; depends on A
D: Fragment for the second draw on the binning engine; depends on B
E: Compute on the compute engine; depends on C and D
F: Vertex for the third draw on the compute engine; depends on E
G: Fragment for the third draw on the binning engine; depends on F

There are a couple of options for how to do this:

 1. Each VkCommandBuffer is actually a bunch of command buffers for
each engine with dependencies.  vkQueueSubmit() calls the kernel
submit ioctl a brezillion times to submit them all.

 2. Same as 1, only the submit ioctl takes an array of things with
dependencies so that you don't have to do as many round-trips to
kernels space.

 3. Each VkCommandBuffer is two command buffers: one for compute and
one for binning, and you use some sort of HW synchronization mechanism
to handle the dependencies as you ping-pong between them.

Option 1 is the easiest to get going with if you're working on a
kernel driver that was designed for OpenGL but it has the obvious
drawback of lots of smaller command buffers and lots of submits.  It's
not going to scale well.  Option 3 is nice if the hardware can do it
(I have no idea if Mali can).

Sorry if that's all a re-hash of stuff you already know.  I'm just
trying to establish a baseline here so we're all talking about the
same things.

Ok, so what does this have to do with submission order in the spec?
The mental model of the Vulkan spec is that you have queues and you
submit commands to those queues.  Command buffers are just big
sequences of commands.  Those commands kick off work.  The kick-off
happens in-order but completion is out-of-order.  There are then
pipeline barriers which allow you to specify dependencies between
those bits of work such as "future compute work depends on previous
fragment work".  Importantly, those dependencies can even apply across
command buffers.

So where does this leave us?  Well, it depends on your submit model
and exactly how you handle pipeline barriers that sync between
engines.  If you're taking option 3 above and doing two command
buffers for each VkCommandBuffer, then you probably want two
serialized timelines, one for each engine, and some mechanism to tell
the kernel driver "these two command buffers have to run in parallel"
so that your ping-pong works.  If you're doing 1 or 2 above, I think
you probably still want two simple syncobjs, one for each engine.  You
don't really have any need to go all that far back in history.  All
you really need to describe is "command buffer X depends on previous
compute work" or "command buffer X depends on previous binning work".
As long as your multi-submit ioctl processes the command buffers
in-order, you can do it all with two syncobjs.

Sorry for the tome.  I hope it wasn't too wrong and was at least a
little helpful.

--Jason

> In our case, the kernel driver takes care of the submission
> serialization (gathering implicit and explicit deps, queuing the job and
> assigning the "done" fence to the output sync objects). Once things
> are queued, it's the scheduler (drm_sched) deciding of the execution
> order.
>
> >
> > So, generally speaking, there are some in-order requirements there.
> > Again, not having a lot of tiler experience, I'm not the one to weigh
> > in.
> >
> > > > Timelines are overkill here, IMO.
> > >
> > > Mind developing why you think this is overkill? After looking at the
> > > kernel implementation I thought using timeline syncobjs would be
> > > pretty cheap compared to the other options I considered.
> >
> > If you use a regular syncobj, every time you wait on it it inserts a
> > dependency between the current submit and the last thing to signal it
> > on the CPU timeline.  The internal dma_fences will hang around
> > as-needed to ensure those dependencies.  If you use a timeline, you
> > have to also track a uint64_t to reference the current time point.
> > This may work if you need to sync a bunch of in-flight stuff at one
> > go, that may work but if you're trying to serialize, it's just extra
> > tracking for no point.  Again, maybe there's something I'm missing and
> > you don't actually want to serialize.
>
> My understanding (and I might very much be wrong here) is that using a
> regular syncobj to do this actually enforces not only the submission
> order but also the execution order (each job waiting on the previous
> one to complete before being scheduled). The idea of the timeline
> syncobj approach is that jobs that have no inter dependencies can be
> started in any order, the scheduler picking the first whose deps are
> ready (which might not necessarily match submission order). The
> timeline syncobj allows us to have one point per kernel-submission and
> eventually wait on the last point for the fence passed to
> vkSubmitQueue(), and some specific point on the timeline for
> pSignalSemaphores entries.
>
> What's more challenging is signal operation ordering:
>
> "
> Signal operation order is a fundamental ordering in Vulkan, giving
> meaning to the order in which semaphore and fence signal operations
> occur when submitted to a single queue.
>
>
>
> 1.  The initial order is determined by the order in which
>     vkQueueSubmit commands are executed on the host, for a single
>     queue, from first to last.
>
> 2.  The order in which VkSubmitInfo structures are specified in the
>     pSubmits parameter of vkQueueSubmit, from lowest index to highest.
>
> 3.  The fence signal operation defined by the fence parameter of a
>     vkQueueSubmit or vkQueueBindSparse command is ordered after all
>     semaphore signal operations defined by that command.
> "
>
> This means we have to add implicit dependencies on the signaling
> itself. We have two options to guarantee that:
>
> 1/ Transfer one of the queue syncobj timeline point to each semaphore
>    and fence after job submission (the point itself being dependent
>    on the position of the submit entry in the array for semaphores, and
>    the last point for the fence). Problem with this approach is that we
>    now have an extra TRANSFER_SYNCOBJ call per semaphore/fence
>
> 2/ Add SYNC jobs (jobs that do not actually execute on the GPU, but
>    serve as a synchronization point) whose responsibility would be to
>    do this muxing/transfer as part of the batch submission process.
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-12 15:37             ` Jason Ekstrand
@ 2021-03-12 18:25               ` Boris Brezillon
  2021-03-12 20:06                 ` Boris Brezillon
  2021-03-12 21:48                 ` Alyssa Rosenzweig
  0 siblings, 2 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-12 18:25 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Rob Herring, Robin Murphy, Alyssa Rosenzweig,
	Maling list - DRI developers, Steven Price

On Fri, 12 Mar 2021 09:37:49 -0600
Jason Ekstrand <jason@jlekstrand.net> wrote:

> On Fri, Mar 12, 2021 at 1:31 AM Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > On Thu, 11 Mar 2021 12:11:48 -0600
> > Jason Ekstrand <jason@jlekstrand.net> wrote:
> >  
> > > > > > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > > > > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > > > > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > > > > > >     good match for that use case. All we need to do is pass the same
> > > > > > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > > > > > >     request, but a different point on the timeline. The syncobj
> > > > > > > >     timeline wait does the rest and guarantees that we've reached a
> > > > > > > >     given timeline point (IOW, all jobs before that point are done)
> > > > > > > >     before declaring the fence as signaled.
> > > > > > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > > > > > >     don't actually execute anything on the GPU but declare a dependency
> > > > > > > >     on all other jobs that are part of the QueueSubmit request, and
> > > > > > > >     signal the out fence (the scheduler would do most of the work for
> > > > > > > >     us, all we have to do is support NULL job heads and signal the
> > > > > > > >     fence directly when that happens instead of queueing the job).  
> > > > > > >
> > > > > > > I have to admit to being rather hazy on the details of timeline
> > > > > > > syncobjs, but I thought there was a requirement that the timeline moves
> > > > > > > monotonically. I.e. if you have multiple jobs signalling the same
> > > > > > > syncobj just with different points, then AFAIU the API requires that the
> > > > > > > points are triggered in order.  
> > > > > >
> > > > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > > > > > might be wrong, but my understanding is that queuing fences (addition
> > > > > > of new points in the timeline) should happen in order, but signaling
> > > > > > can happen in any order. When checking for a signaled fence the
> > > > > > fence-chain logic starts from the last point (or from an explicit point
> > > > > > if you use the timeline wait flavor) and goes backward, stopping at the
> > > > > > first un-signaled node. If I'm correct, that means that fences that
> > > > > > are part of a chain can be signaled in any order.  
> > > > >
> > > > > You don't even need a timeline for this.  Just have a single syncobj
> > > > > per-queue and make each submit wait on it and then signal it.
> > > > > Alternatively, you can just always hang on to the out-fence from the
> > > > > previous submit and make the next one wait on that.  
> > > >
> > > > That's what I have right now, but it forces the serialization of all
> > > > jobs that are pushed during a submit (and there can be more than one
> > > > per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
> > > > be better to not force this serialization if we can avoid it.  
> > >
> > > I'm not familiar with panfrost's needs and I don't work on a tiler and
> > > I know there are different issues there.  But...
> > >
> > > The Vulkan spec requires that everything that all the submits that
> > > happen on a given vkQueue happen in-order.  Search the spec for
> > > "Submission order" for more details.  
> >
> > Duh, looks like I completely occulted the "Submission order"
> > guarantees. This being said, even after reading this chapter multiple
> > times I'm not sure what kind of guarantee this gives us, given the
> > execution itself can be out-of-order. My understanding is that
> > submission order matters for implicit deps, say you have 2 distinct
> > VkSubmitInfo, the first one (in submission order) writing to a buffer
> > and the second one reading from it, you really want the first one to
> > be submitted first and the second one to wait on the implicit BO fence
> > created by the first one. If we were to submit out-of-order, this
> > guarantee wouldn't be met. OTOH, if we have 2 completely independent
> > submits, I don't really see what submission order gives us if execution
> > is out-of-order.  
> 
> Right, this is where things get sticky.  What's guaranteed there is
> submission order not execution order.  But, sadly, there's more hidden
> in there than you might think.  Before we can go there, though, we
> need to establish a few details of Mali hardware works.
> 
> My understanding (feel free to correct me if I'm wrong) is that,
> roughly, Mali has two engines which have to work together to render:
> One engine for compute/vertex/geometry and one for binning and
> fragment.  When you go to submit a draw, you fire off any geometry
> work on the compute engine and the fragment work on the binning
> engine.  At a tile flush boundary (or renderpass in Vulkan), you
> insert a dependency between the geometry work you put on the compute
> engine and the binning/fragment work.  In a GL driver, you'd probably
> kick it off to the kernel at this point.
> 
> In Vulkan, things are going to get more tricky.  You can't just kick
> off to the kernel at tile flush boundaries.  You also can't assume
> that all vertex work within a given command buffer is independent of
> all the fragment work.  Let's say you have a sequence like this:
> 
> vkBeginRenderPass() /* Writes to ImageA */
> vkCmdDraw()
> vkCmdDraw()
> ...
> vkEndRenderPass()
> vkPipelineBarrier(imageA /* fragment -> compute */)
> vkCmdDispatch() /* reads imageA, writes BufferB */
> vkBeginRenderPass() /* Writes to ImageC */
> vkCmdBindVertexBuffers(bufferB)
> vkCmdDraw();
> ...
> vkEndRenderPass()
> 
> Now you have, in a single command buffer, a draw which writes to an
> image which is read by a compute shader which writes to a buffer which
> is read by vertex fetch for another draw.  The only way to implement
> this on a Mali like device is to ping-pong back and forth between the
> compute and binning engines.  It'd look something like this (assuming
> my mental model):
> 
> A: Vertex for the first draw on the compute engine
> B: Vertex for the first draw on the compute engine
> C: Fragment for the first draw on the binning engine; depends on A
> D: Fragment for the second draw on the binning engine; depends on B
> E: Compute on the compute engine; depends on C and D
> F: Vertex for the third draw on the compute engine; depends on E
> G: Fragment for the third draw on the binning engine; depends on F
> 
> There are a couple of options for how to do this:
> 
>  1. Each VkCommandBuffer is actually a bunch of command buffers for
> each engine with dependencies.  vkQueueSubmit() calls the kernel
> submit ioctl a brezillion times to submit them all.

This is what we currently have.

> 
>  2. Same as 1, only the submit ioctl takes an array of things with
> dependencies so that you don't have to do as many round-trips to
> kernels space.


This is the PoC I worked on.

> 
>  3. Each VkCommandBuffer is two command buffers: one for compute and
> one for binning, and you use some sort of HW synchronization mechanism
> to handle the dependencies as you ping-pong between them.

I didn't consider that option. We have a DOORBELL instruction on Bifrost
to wake up the CPU when the GPU wants to report something (not even
sure we have something equivalent on Midgard), but there's no
inter-queue sync mechanism AFAICT.

> 
> Option 1 is the easiest to get going with if you're working on a
> kernel driver that was designed for OpenGL but it has the obvious
> drawback of lots of smaller command buffers and lots of submits.  It's
> not going to scale well.  Option 3 is nice if the hardware can do it
> (I have no idea if Mali can).
> 
> Sorry if that's all a re-hash of stuff you already know.  I'm just
> trying to establish a baseline here so we're all talking about the
> same things.
> 
> Ok, so what does this have to do with submission order in the spec?
> The mental model of the Vulkan spec is that you have queues and you
> submit commands to those queues.  Command buffers are just big
> sequences of commands.  Those commands kick off work.  The kick-off
> happens in-order but completion is out-of-order.  There are then
> pipeline barriers which allow you to specify dependencies between
> those bits of work such as "future compute work depends on previous
> fragment work".  Importantly, those dependencies can even apply across
> command buffers.
> 
> So where does this leave us?  Well, it depends on your submit model
> and exactly how you handle pipeline barriers that sync between
> engines.  If you're taking option 3 above and doing two command
> buffers for each VkCommandBuffer, then you probably want two
> serialized timelines, one for each engine, and some mechanism to tell
> the kernel driver "these two command buffers have to run in parallel"
> so that your ping-pong works.  If you're doing 1 or 2 above, I think
> you probably still want two simple syncobjs, one for each engine.  You
> don't really have any need to go all that far back in history.  All
> you really need to describe is "command buffer X depends on previous
> compute work" or "command buffer X depends on previous binning work".

Okay, so this will effectively force in-order execution. Let's take your
previous example and add 2 more jobs at the end that have no deps on
previous commands:

vkBeginRenderPass() /* Writes to ImageA */
vkCmdDraw()
vkCmdDraw()
...
vkEndRenderPass()
vkPipelineBarrier(imageA /* fragment -> compute */)
vkCmdDispatch() /* reads imageA, writes BufferB */
vkBeginRenderPass() /* Writes to ImageC */
vkCmdBindVertexBuffers(bufferB)
vkCmdDraw();
...
vkEndRenderPass()
vkBeginRenderPass() /* Writes to ImageD */
vkCmdDraw()
...
vkEndRenderPass()

A: Vertex for the first draw on the compute engine
B: Vertex for the first draw on the compute engine
C: Fragment for the first draw on the binning engine; depends on A
D: Fragment for the second draw on the binning engine; depends on B
E: Compute on the compute engine; depends on C and D
F: Vertex for the third draw on the compute engine; depends on E
G: Fragment for the third draw on the binning engine; depends on F
H: Vertex for the fourth draw on the compute engine
I: Fragment for the fourth draw on the binning engine

When we reach E, we might be waiting for D to finish before scheduling
the job, and because of the implicit serialization we have on the
compute queue (F implicitly depends on E, and H on F) we can't schedule
H either, which could, in theory be started. I guess that's where the
term submission order is a bit unclear to me. The action of starting a
job sounds like execution order to me (the order you starts jobs
determines the execution order since we only have one HW queue per job
type). All implicit deps have been calculated when we queued the job to
the SW queue, and I thought that would be enough to meet the submission
order requirements, but I might be wrong.

The PoC I have was trying to get rid of this explicit serialization on
the compute and fragment queues by having one syncobj timeline
(queue(<syncpoint>)) and synchronization points (Sx).

S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point
A: in-fences=<explicit_deps>, out-fences=<queue(1)>
B: in-fences=<explicit_deps>, out-fences=<queue(2)>
C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context
D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context
E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA
F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B
G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context
H: in-fences=<explicit_deps>, out-fence=<queue(8)>
I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer
S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point
# QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point

With this solution H can be started before E if the compute slot
is empty and E's implicit deps are not done. It's probably overkill,
but I thought maximizing GPU utilization was important.

> As long as your multi-submit ioctl processes the command buffers
> in-order, you can do it all with two syncobjs.
> 
> Sorry for the tome.  I hope it wasn't too wrong and was at least a
> little helpful.

It definitely helps, even if the concept of submission order is still
unclear to me, especially when applied to GPUs that have a single HW
queue, and where submission and execution order could be considered the
same thing.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-12 18:25               ` Boris Brezillon
@ 2021-03-12 20:06                 ` Boris Brezillon
  2021-03-12 21:48                 ` Alyssa Rosenzweig
  1 sibling, 0 replies; 20+ messages in thread
From: Boris Brezillon @ 2021-03-12 20:06 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Rob Herring, Robin Murphy, Alyssa Rosenzweig,
	Maling list - DRI developers, Steven Price

On Fri, 12 Mar 2021 19:25:13 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> > So where does this leave us?  Well, it depends on your submit model
> > and exactly how you handle pipeline barriers that sync between
> > engines.  If you're taking option 3 above and doing two command
> > buffers for each VkCommandBuffer, then you probably want two
> > serialized timelines, one for each engine, and some mechanism to tell
> > the kernel driver "these two command buffers have to run in parallel"
> > so that your ping-pong works.  If you're doing 1 or 2 above, I think
> > you probably still want two simple syncobjs, one for each engine.  You
> > don't really have any need to go all that far back in history.  All
> > you really need to describe is "command buffer X depends on previous
> > compute work" or "command buffer X depends on previous binning work".  
> 
> Okay, so this will effectively force in-order execution. Let's take your
> previous example and add 2 more jobs at the end that have no deps on
> previous commands:
> 
> vkBeginRenderPass() /* Writes to ImageA */
> vkCmdDraw()
> vkCmdDraw()
> ...
> vkEndRenderPass()
> vkPipelineBarrier(imageA /* fragment -> compute */)
> vkCmdDispatch() /* reads imageA, writes BufferB */
> vkBeginRenderPass() /* Writes to ImageC */
> vkCmdBindVertexBuffers(bufferB)
> vkCmdDraw();
> ...
> vkEndRenderPass()
> vkBeginRenderPass() /* Writes to ImageD */
> vkCmdDraw()
> ...
> vkEndRenderPass()
> 
> A: Vertex for the first draw on the compute engine
> B: Vertex for the first draw on the compute engine
> C: Fragment for the first draw on the binning engine; depends on A
> D: Fragment for the second draw on the binning engine; depends on B
> E: Compute on the compute engine; depends on C and D
> F: Vertex for the third draw on the compute engine; depends on E
> G: Fragment for the third draw on the binning engine; depends on F
> H: Vertex for the fourth draw on the compute engine
> I: Fragment for the fourth draw on the binning engine
> 
> When we reach E, we might be waiting for D to finish before scheduling
> the job, and because of the implicit serialization we have on the
> compute queue (F implicitly depends on E, and H on F) we can't schedule
> H either, which could, in theory be started. I guess that's where the
> term submission order is a bit unclear to me. The action of starting a
> job sounds like execution order to me (the order you starts jobs
> determines the execution order since we only have one HW queue per job
> type). All implicit deps have been calculated when we queued the job to
> the SW queue, and I thought that would be enough to meet the submission
> order requirements, but I might be wrong.
> 
> The PoC I have was trying to get rid of this explicit serialization on
> the compute and fragment queues by having one syncobj timeline
> (queue(<syncpoint>)) and synchronization points (Sx).
> 
> S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point
> A: in-fences=<explicit_deps>, out-fences=<queue(1)>
> B: in-fences=<explicit_deps>, out-fences=<queue(2)>
> C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context
> D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context
> E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA
> F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B
> G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context
> H: in-fences=<explicit_deps>, out-fence=<queue(8)>
> I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer
> S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point
> # QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point
> 
> With this solution H can be started before E if the compute slot
> is empty and E's implicit deps are not done. It's probably overkill,
> but I thought maximizing GPU utilization was important.

Nevermind, I forgot the drm scheduler was dequeuing jobs in order, so 2
syncobjs (one per queue type) is indeed the right approach.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl
  2021-03-12 18:25               ` Boris Brezillon
  2021-03-12 20:06                 ` Boris Brezillon
@ 2021-03-12 21:48                 ` Alyssa Rosenzweig
  1 sibling, 0 replies; 20+ messages in thread
From: Alyssa Rosenzweig @ 2021-03-12 21:48 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Maling list - DRI developers, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Jason Ekstrand, Robin Murphy

> >  3. Each VkCommandBuffer is two command buffers: one for compute and
> > one for binning, and you use some sort of HW synchronization mechanism
> > to handle the dependencies as you ping-pong between them.
> 
> I didn't consider that option. We have a DOORBELL instruction on Bifrost
> to wake up the CPU when the GPU wants to report something (not even
> sure we have something equivalent on Midgard), but there's no
> inter-queue sync mechanism AFAICT.

I was about to say that we don't have hardware support. Even considering
DOORBELL (which AFAIK isn't piped through to anything on kbase at
least), I'm inclined to say we don't have hardware support. Option 2
(what kbase does? dunno how DDK vulkan works though) is probably our
best bet, tbh.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-03-12 21:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11  9:25 [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Boris Brezillon
2021-03-11  9:25 ` [RFC PATCH 1/7] drm/panfrost: Pass a job to panfrost_{acquire, attach_object_fences}() Boris Brezillon
2021-03-11  9:25 ` [RFC PATCH 2/7] drm/panfrost: Collect implicit and explicit deps in an XArray Boris Brezillon
2021-03-11  9:25 ` [RFC PATCH 3/7] drm/panfrost: Move the mappings collection out of panfrost_lookup_bos() Boris Brezillon
2021-03-11  9:25 ` [RFC PATCH 4/7] drm/panfrost: Add BO access flags to relax dependencies between jobs Boris Brezillon
2021-03-11  9:25 ` [RFC PATCH 5/7] drm/panfrost: Add a new ioctl to submit batches Boris Brezillon
2021-03-11  9:25 ` [RFC PATCH 6/7] drm/panfrost: Advertise the SYNCOBJ_TIMELINE feature Boris Brezillon
2021-03-11  9:25 ` [RFC PATCH 7/7] drm/panfrost: Bump minor version to reflect the feature additions Boris Brezillon
2021-03-11 12:16 ` [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl Steven Price
2021-03-11 13:00   ` Boris Brezillon
2021-03-11 16:58     ` Jason Ekstrand
2021-03-11 17:24       ` Boris Brezillon
2021-03-11 18:11         ` Jason Ekstrand
2021-03-11 22:38           ` Alyssa Rosenzweig
2021-03-12  7:31           ` Boris Brezillon
2021-03-12 15:37             ` Jason Ekstrand
2021-03-12 18:25               ` Boris Brezillon
2021-03-12 20:06                 ` Boris Brezillon
2021-03-12 21:48                 ` Alyssa Rosenzweig
2021-03-12 14:15   ` Boris Brezillon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.