intel-xe.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
@ 2023-04-04  0:22 Matthew Brost
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
                   ` (16 more replies)
  0 siblings, 17 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.

- Generic messaging interface for DRM scheduler

Idea is to be able to communicate to the submission backend with in band
(relative to main execution function) messages. Messages are backend
defined and flexable enough for any use case. In Xe we use these
messages to clean up entites, set properties for entites, and suspend /
resume execution of an entity [5]. I suspect other driver can leverage
this messaging concept too as it a convenient way to avoid races in the
backend.

- Support for using TDR for all error paths of a scheduler / entity

Fix a few races / bugs, add function to dynamically set the TDR timeout.

- Annotate dma-fences for long running workloads.

The idea here is to use dma-fences only as sync points within the
scheduler and never export them for long running workloads. By
annotating these fences as long running we ensure that these dma-fences
are never used in a way that breaks the dma-fence rules. A benefit of
thus approach is the scheduler can still safely flow control the
execution ring buffer via the job limit without breaking the dma-fence
rules.

Again this a first draft and looking forward to feedback.

Enjoy - Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel
[2] https://patchwork.freedesktop.org/series/112188/ 
[3] https://patchwork.freedesktop.org/series/114772/
[4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
[5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031

Matthew Brost (8):
  drm/sched: Convert drm scheduler to use a work queue rather than
    kthread
  drm/sched: Move schedule policy to scheduler / entity
  drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
  drm/sched: Add generic scheduler message interface
  drm/sched: Start run wq before TDR in drm_sched_start
  drm/sched: Submit job before starting TDR
  drm/sched: Add helper to set TDR timeout
  drm/syncobj: Warn on long running dma-fences

Thomas Hellström (2):
  dma-buf/dma-fence: Introduce long-running completion fences
  drm/sched: Support long-running sched entities

 drivers/dma-buf/dma-fence.c                 | 142 +++++++---
 drivers/dma-buf/dma-resv.c                  |   5 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
 drivers/gpu/drm/drm_syncobj.c               |   5 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
 drivers/gpu/drm/lima/lima_sched.c           |   5 +-
 drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
 drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
 drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
 drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
 drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
 drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
 drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
 include/drm/gpu_scheduler.h                 | 130 +++++++--
 include/linux/dma-fence.h                   |  60 ++++-
 16 files changed, 649 insertions(+), 184 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [Intel-xe] [RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
@ 2023-04-04  0:22 ` Matthew Brost
  2023-06-09  6:58   ` Boris Brezillon
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 02/10] drm/sched: Move schedule policy to scheduler / entity Matthew Brost
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.

1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

2. In XE submissions are done via programming a ring buffer (circular
buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
control on the ring for free.

A problem with this design is currently a drm_gpu_scheduler uses a
kthread for submission / job cleanup. This doesn't scale if a large
number of drm_gpu_scheduler are used. To work around the scaling issue,
use a worker rather than kthread for submission / job cleanup.

v2:
  - (Rob Clark) Fix msm build
  - Pass in run work queue

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  14 +--
 drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   2 +-
 drivers/gpu/drm/lima/lima_sched.c           |   2 +-
 drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
 drivers/gpu/drm/msm/msm_ringbuffer.c        |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c     |   2 +-
 drivers/gpu/drm/scheduler/sched_main.c      | 126 ++++++++++++--------
 drivers/gpu/drm/v3d/v3d_sched.c             |  10 +-
 include/drm/gpu_scheduler.h                 |  14 ++-
 10 files changed, 110 insertions(+), 82 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index f60753f97ac5..9c2a10aeb0b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1489,9 +1489,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
 	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
-		kthread_park(ring->sched.thread);
+		drm_sched_run_wq_stop(&ring->sched);
 	}
 
 	seq_printf(m, "run ib test:\n");
@@ -1505,9 +1505,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
 	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
-		kthread_unpark(ring->sched.thread);
+		drm_sched_run_wq_start(&ring->sched);
 	}
 
 	up_write(&adev->reset_domain->sem);
@@ -1727,7 +1727,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
 
 	ring = adev->rings[val];
 
-	if (!ring || !ring->funcs->preempt_ib || !ring->sched.thread)
+	if (!ring || !ring->funcs->preempt_ib || !ring->sched.ready)
 		return -EINVAL;
 
 	/* the last preemption failed */
@@ -1745,7 +1745,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
 		goto pro_end;
 
 	/* stop the scheduler */
-	kthread_park(ring->sched.thread);
+	drm_sched_run_wq_stop(&ring->sched);
 
 	/* preempt the IB */
 	r = amdgpu_ring_preempt_ib(ring);
@@ -1779,7 +1779,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
 
 failure:
 	/* restart the scheduler */
-	kthread_unpark(ring->sched.thread);
+	drm_sched_run_wq_start(&ring->sched);
 
 	up_read(&adev->reset_domain->sem);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fac9312b1695..00c9c03c8f94 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2364,7 +2364,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
 			break;
 		}
 
-		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
+		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops, NULL,
 				   ring->num_hw_submission, amdgpu_job_hang_limit,
 				   timeout, adev->reset_domain->wq,
 				   ring->sched_score, ring->name,
@@ -4627,7 +4627,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
 
 		spin_lock(&ring->sched.job_list_lock);
@@ -4753,7 +4753,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
 
 		/*clear job fence from fence drv to avoid force_completion
@@ -5294,7 +5294,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = tmp_adev->rings[i];
 
-			if (!ring || !ring->sched.thread)
+			if (!ring || !ring->sched.ready)
 				continue;
 
 			drm_sched_stop(&ring->sched, job ? &job->base : NULL);
@@ -5369,7 +5369,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = tmp_adev->rings[i];
 
-			if (!ring || !ring->sched.thread)
+			if (!ring || !ring->sched.ready)
 				continue;
 
 			drm_sched_start(&ring->sched, true);
@@ -5696,7 +5696,7 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = adev->rings[i];
 
-			if (!ring || !ring->sched.thread)
+			if (!ring || !ring->sched.ready)
 				continue;
 
 			drm_sched_stop(&ring->sched, NULL);
@@ -5824,7 +5824,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
 
 		drm_sched_start(&ring->sched, true);
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 1ae87dfd19c4..8486a2923f1b 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -133,7 +133,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 {
 	int ret;
 
-	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
+	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL,
 			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
 			     msecs_to_jiffies(500), NULL, NULL,
 			     dev_name(gpu->dev), gpu->dev);
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index ff003403fbbc..54f53bece27c 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -488,7 +488,7 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
 
 	INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
 
-	return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
+	return drm_sched_init(&pipe->base, &lima_sched_ops, NULL, 1,
 			      lima_job_hang_limit,
 			      msecs_to_jiffies(timeout), NULL,
 			      NULL, name, pipe->ldev->dev);
diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index c5c4c93b3689..f76ce11a5384 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -662,7 +662,8 @@ static void suspend_scheduler(struct msm_gpu *gpu)
 	 */
 	for (i = 0; i < gpu->nr_rings; i++) {
 		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
-		kthread_park(sched->thread);
+
+		drm_sched_run_wq_stop(sched);
 	}
 }
 
@@ -672,7 +673,8 @@ static void resume_scheduler(struct msm_gpu *gpu)
 
 	for (i = 0; i < gpu->nr_rings; i++) {
 		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
-		kthread_unpark(sched->thread);
+
+		drm_sched_run_wq_start(sched);
 	}
 }
 
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 57a8e9564540..5879fc262047 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -95,7 +95,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
 	 /* currently managing hangcheck ourselves: */
 	sched_timeout = MAX_SCHEDULE_TIMEOUT;
 
-	ret = drm_sched_init(&ring->sched, &msm_sched_ops,
+	ret = drm_sched_init(&ring->sched, &msm_sched_ops, NULL,
 			num_hw_submissions, 0, sched_timeout,
 			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
 	if (ret) {
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index dbc597ab46fb..f48b07056a16 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -815,7 +815,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 		js->queue[j].fence_context = dma_fence_context_alloc(1);
 
 		ret = drm_sched_init(&js->queue[j].sched,
-				     &panfrost_sched_ops,
+				     &panfrost_sched_ops, NULL,
 				     nentries, 0,
 				     msecs_to_jiffies(JOB_TIMEOUT_MS),
 				     pfdev->reset.wq,
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a18c8f5e8cc0..808008990721 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -44,7 +44,6 @@
  * The jobs in a entity are always scheduled in the order that they were pushed.
  */
 
-#include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
 #include <linux/completion.h>
@@ -252,6 +251,53 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
 	return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
 }
 
+/**
+ * drm_sched_run_wq_stop - stop scheduler run worker
+ *
+ * @sched: scheduler instance to stop run worker
+ */
+void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched)
+{
+	sched->pause_run_wq = true;
+	smp_wmb();
+
+	cancel_work_sync(&sched->work_run);
+}
+EXPORT_SYMBOL(drm_sched_run_wq_stop);
+
+/**
+ * drm_sched_run_wq_start - start scheduler run worker
+ *
+ * @sched: scheduler instance to start run worker
+ */
+void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched)
+{
+	sched->pause_run_wq = false;
+	smp_wmb();
+
+	queue_work(sched->run_wq, &sched->work_run);
+}
+EXPORT_SYMBOL(drm_sched_run_wq_start);
+
+/**
+ * drm_sched_run_wq_queue - queue scheduler run worker
+ *
+ * @sched: scheduler instance to queue run worker
+ */
+static void drm_sched_run_wq_queue(struct drm_gpu_scheduler *sched)
+{
+	smp_rmb();
+
+	/*
+	 * Try not to schedule work if pause_run_wq set but not the end of world
+	 * if we do as either it will be cancelled by the above
+	 * cancel_work_sync, or drm_sched_main turns into a NOP while
+	 * pause_run_wq is set.
+	 */
+	if (!sched->pause_run_wq)
+		queue_work(sched->run_wq, &sched->work_run);
+}
+
 /**
  * drm_sched_job_done - complete a job
  * @s_job: pointer to the job which is done
@@ -271,7 +317,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job)
 	dma_fence_get(&s_fence->finished);
 	drm_sched_fence_finished(s_fence);
 	dma_fence_put(&s_fence->finished);
-	wake_up_interruptible(&sched->wake_up_worker);
+	drm_sched_run_wq_queue(sched);
 }
 
 /**
@@ -434,7 +480,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 {
 	struct drm_sched_job *s_job, *tmp;
 
-	kthread_park(sched->thread);
+	drm_sched_run_wq_stop(sched);
 
 	/*
 	 * Reinsert back the bad job here - now it's safe as
@@ -547,7 +593,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 		spin_unlock(&sched->job_list_lock);
 	}
 
-	kthread_unpark(sched->thread);
+	drm_sched_run_wq_start(sched);
 }
 EXPORT_SYMBOL(drm_sched_start);
 
@@ -864,7 +910,7 @@ static bool drm_sched_ready(struct drm_gpu_scheduler *sched)
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
 {
 	if (drm_sched_ready(sched))
-		wake_up_interruptible(&sched->wake_up_worker);
+		drm_sched_run_wq_queue(sched);
 }
 
 /**
@@ -974,60 +1020,42 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
-/**
- * drm_sched_blocked - check if the scheduler is blocked
- *
- * @sched: scheduler instance
- *
- * Returns true if blocked, otherwise false.
- */
-static bool drm_sched_blocked(struct drm_gpu_scheduler *sched)
-{
-	if (kthread_should_park()) {
-		kthread_parkme();
-		return true;
-	}
-
-	return false;
-}
-
 /**
  * drm_sched_main - main scheduler thread
  *
  * @param: scheduler instance
- *
- * Returns 0.
  */
-static int drm_sched_main(void *param)
+static void drm_sched_main(struct work_struct *w)
 {
-	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
+	struct drm_gpu_scheduler *sched =
+		container_of(w, struct drm_gpu_scheduler, work_run);
 	int r;
 
-	sched_set_fifo_low(current);
-
-	while (!kthread_should_stop()) {
-		struct drm_sched_entity *entity = NULL;
+	while (!READ_ONCE(sched->pause_run_wq)) {
+		struct drm_sched_entity *entity;
 		struct drm_sched_fence *s_fence;
 		struct drm_sched_job *sched_job;
 		struct dma_fence *fence;
-		struct drm_sched_job *cleanup_job = NULL;
+		struct drm_sched_job *cleanup_job;
 
-		wait_event_interruptible(sched->wake_up_worker,
-					 (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
-					 (!drm_sched_blocked(sched) &&
-					  (entity = drm_sched_select_entity(sched))) ||
-					 kthread_should_stop());
+		cleanup_job = drm_sched_get_cleanup_job(sched);
+		entity = drm_sched_select_entity(sched);
 
 		if (cleanup_job)
 			sched->ops->free_job(cleanup_job);
 
-		if (!entity)
+		if (!entity) {
+			if (!cleanup_job)
+				break;
 			continue;
+		}
 
 		sched_job = drm_sched_entity_pop_job(entity);
 
 		if (!sched_job) {
 			complete_all(&entity->entity_idle);
+			if (!cleanup_job)
+				break;
 			continue;
 		}
 
@@ -1055,14 +1083,14 @@ static int drm_sched_main(void *param)
 					  r);
 		} else {
 			if (IS_ERR(fence))
-				dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
+				dma_fence_set_error(&s_fence->finished,
+						    PTR_ERR(fence));
 
 			drm_sched_job_done(sched_job);
 		}
 
 		wake_up(&sched->job_scheduled);
 	}
-	return 0;
 }
 
 /**
@@ -1070,6 +1098,7 @@ static int drm_sched_main(void *param)
  *
  * @sched: scheduler instance
  * @ops: backend operations for this scheduler
+ * @run_wq: workqueue to use for run work. If NULL, the system_wq is used
  * @hw_submission: number of hw submissions that can be in flight
  * @hang_limit: number of times to allow a job to hang before dropping it
  * @timeout: timeout value in jiffies for the scheduler
@@ -1083,14 +1112,16 @@ static int drm_sched_main(void *param)
  */
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
+		   struct workqueue_struct *run_wq,
 		   unsigned hw_submission, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name, struct device *dev)
 {
-	int i, ret;
+	int i;
 	sched->ops = ops;
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
+	sched->run_wq = run_wq ? : system_wq;
 	sched->timeout = timeout;
 	sched->timeout_wq = timeout_wq ? : system_wq;
 	sched->hang_limit = hang_limit;
@@ -1099,23 +1130,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
 		drm_sched_rq_init(sched, &sched->sched_rq[i]);
 
-	init_waitqueue_head(&sched->wake_up_worker);
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
 	spin_lock_init(&sched->job_list_lock);
 	atomic_set(&sched->hw_rq_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
+	INIT_WORK(&sched->work_run, drm_sched_main);
 	atomic_set(&sched->_score, 0);
 	atomic64_set(&sched->job_id_count, 0);
-
-	/* Each scheduler will run on a seperate kernel thread */
-	sched->thread = kthread_run(drm_sched_main, sched, sched->name);
-	if (IS_ERR(sched->thread)) {
-		ret = PTR_ERR(sched->thread);
-		sched->thread = NULL;
-		DRM_DEV_ERROR(sched->dev, "Failed to create scheduler for %s.\n", name);
-		return ret;
-	}
+	sched->pause_run_wq = false;
 
 	sched->ready = true;
 	return 0;
@@ -1134,8 +1157,7 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	struct drm_sched_entity *s_entity;
 	int i;
 
-	if (sched->thread)
-		kthread_stop(sched->thread);
+	drm_sched_run_wq_stop(sched);
 
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		struct drm_sched_rq *rq = &sched->sched_rq[i];
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 06238e6d7f5c..38e092ea41e6 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -388,7 +388,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 	int ret;
 
 	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
-			     &v3d_bin_sched_ops,
+			     &v3d_bin_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_bin", v3d->drm.dev);
@@ -396,7 +396,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 		return ret;
 
 	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
-			     &v3d_render_sched_ops,
+			     &v3d_render_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_render", v3d->drm.dev);
@@ -404,7 +404,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 		goto fail;
 
 	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
-			     &v3d_tfu_sched_ops,
+			     &v3d_tfu_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_tfu", v3d->drm.dev);
@@ -413,7 +413,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 
 	if (v3d_has_csd(v3d)) {
 		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
-				     &v3d_csd_sched_ops,
+				     &v3d_csd_sched_ops, NULL,
 				     hw_jobs_limit, job_hang_limit,
 				     msecs_to_jiffies(hang_limit_ms), NULL,
 				     NULL, "v3d_csd", v3d->drm.dev);
@@ -421,7 +421,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 			goto fail;
 
 		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
-				     &v3d_cache_clean_sched_ops,
+				     &v3d_cache_clean_sched_ops, NULL,
 				     hw_jobs_limit, job_hang_limit,
 				     msecs_to_jiffies(hang_limit_ms), NULL,
 				     NULL, "v3d_cache_clean", v3d->drm.dev);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index c0586d832260..98fb5f85eba6 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -473,17 +473,16 @@ struct drm_sched_backend_ops {
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
  * @sched_rq: priority wise array of run queues.
- * @wake_up_worker: the wait queue on which the scheduler sleeps until a job
- *                  is ready to be scheduled.
  * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
  *                 waits on this wait queue until all the scheduled jobs are
  *                 finished.
  * @hw_rq_count: the number of jobs currently in the hardware queue.
  * @job_id_count: used to assign unique id to the each job.
+ * @run_wq: workqueue used to queue @work_run
  * @timeout_wq: workqueue used to queue @work_tdr
+ * @work_run: schedules jobs and cleans up entities
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
- * @thread: the kthread on which the scheduler which run.
  * @pending_list: the list of jobs which are currently in the job queue.
  * @job_list_lock: lock to protect the pending_list.
  * @hang_limit: once the hangs by a job crosses this limit then it is marked
@@ -492,6 +491,7 @@ struct drm_sched_backend_ops {
  * @_score: score used when the driver doesn't provide one
  * @ready: marks if the underlying HW is ready to work
  * @free_guilty: A hit to time out handler to free the guilty job.
+ * @pause_run_wq: pause queuing of @work_run on @run_wq
  * @dev: system &struct device
  *
  * One scheduler is implemented for each hardware ring.
@@ -502,13 +502,13 @@ struct drm_gpu_scheduler {
 	long				timeout;
 	const char			*name;
 	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
-	wait_queue_head_t		wake_up_worker;
 	wait_queue_head_t		job_scheduled;
 	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
+	struct workqueue_struct		*run_wq;
 	struct workqueue_struct		*timeout_wq;
+	struct work_struct		work_run;
 	struct delayed_work		work_tdr;
-	struct task_struct		*thread;
 	struct list_head		pending_list;
 	spinlock_t			job_list_lock;
 	int				hang_limit;
@@ -516,11 +516,13 @@ struct drm_gpu_scheduler {
 	atomic_t                        _score;
 	bool				ready;
 	bool				free_guilty;
+	bool				pause_run_wq;
 	struct device			*dev;
 };
 
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
+		   struct workqueue_struct *run_wq,
 		   uint32_t hw_submission, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name, struct device *dev);
@@ -550,6 +552,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
+void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
+void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
 void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
 void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Intel-xe] [RFC PATCH 02/10] drm/sched: Move schedule policy to scheduler / entity
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
@ 2023-04-04  0:22 ` Matthew Brost
  2023-04-05 17:37   ` Luben Tuikov
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 03/10] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy Matthew Brost
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

Rather than a global modparam for scheduling policy, move the scheduling
policy to scheduler / entity so user can control each scheduler / entity
policy.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
 drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  3 ++-
 drivers/gpu/drm/lima/lima_sched.c          |  3 ++-
 drivers/gpu/drm/msm/msm_ringbuffer.c       |  3 ++-
 drivers/gpu/drm/panfrost/panfrost_job.c    |  3 ++-
 drivers/gpu/drm/scheduler/sched_entity.c   | 25 ++++++++++++++++++----
 drivers/gpu/drm/scheduler/sched_main.c     | 21 +++++++++++++-----
 drivers/gpu/drm/v3d/v3d_sched.c            | 15 ++++++++-----
 include/drm/gpu_scheduler.h                | 23 ++++++++++++++------
 9 files changed, 73 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 00c9c03c8f94..4df0fca5a74c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2368,6 +2368,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
 				   ring->num_hw_submission, amdgpu_job_hang_limit,
 				   timeout, adev->reset_domain->wq,
 				   ring->sched_score, ring->name,
+				   DRM_SCHED_POLICY_DEFAULT,
 				   adev->dev);
 		if (r) {
 			DRM_ERROR("Failed to create scheduler on ring %s.\n",
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 8486a2923f1b..61204a3f8b0b 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -136,7 +136,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL,
 			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
 			     msecs_to_jiffies(500), NULL, NULL,
-			     dev_name(gpu->dev), gpu->dev);
+			     dev_name(gpu->dev), DRM_SCHED_POLICY_DEFAULT,
+			     gpu->dev);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 54f53bece27c..33042ba6ae93 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -491,7 +491,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
 	return drm_sched_init(&pipe->base, &lima_sched_ops, NULL, 1,
 			      lima_job_hang_limit,
 			      msecs_to_jiffies(timeout), NULL,
-			      NULL, name, pipe->ldev->dev);
+			      NULL, name, DRM_SCHED_POLICY_DEFAULT,
+			      pipe->ldev->dev);
 }
 
 void lima_sched_pipe_fini(struct lima_sched_pipe *pipe)
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 5879fc262047..f408a9097315 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -97,7 +97,8 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
 
 	ret = drm_sched_init(&ring->sched, &msm_sched_ops, NULL,
 			num_hw_submissions, 0, sched_timeout,
-			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
+			NULL, NULL, to_msm_bo(ring->bo)->name,
+			DRM_SCHED_POLICY_DEFAULT, gpu->dev->dev);
 	if (ret) {
 		goto fail;
 	}
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index f48b07056a16..effa48b33dce 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -819,7 +819,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 				     nentries, 0,
 				     msecs_to_jiffies(JOB_TIMEOUT_MS),
 				     pfdev->reset.wq,
-				     NULL, "pan_js", pfdev->dev);
+				     NULL, "pan_js", DRM_SCHED_POLICY_DEFAULT,
+				     pfdev->dev);
 		if (ret) {
 			dev_err(pfdev->dev, "Failed to create scheduler: %d.", ret);
 			goto err_sched;
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 15d04a0ec623..f1299e51860b 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -33,6 +33,20 @@
 #define to_drm_sched_job(sched_job)		\
 		container_of((sched_job), struct drm_sched_job, queue_node)
 
+static bool bad_policies(struct drm_gpu_scheduler **sched_list,
+			 unsigned int num_sched_list)
+{
+	enum drm_sched_policy sched_policy = sched_list[0]->sched_policy;
+	unsigned int i;
+
+	/* All scdedule policies must match */
+	for (i = 1; i < num_sched_list; ++i)
+		if (sched_policy != sched_list[i]->sched_policy)
+			return true;
+
+	return false;
+}
+
 /**
  * drm_sched_entity_init - Init a context entity used by scheduler when
  * submit to HW ring.
@@ -62,7 +76,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 			  unsigned int num_sched_list,
 			  atomic_t *guilty)
 {
-	if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])))
+	if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])) ||
+	    bad_policies(sched_list, num_sched_list))
 		return -EINVAL;
 
 	memset(entity, 0, sizeof(struct drm_sched_entity));
@@ -75,8 +90,10 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 	entity->last_scheduled = NULL;
 	RB_CLEAR_NODE(&entity->rb_tree_node);
 
-	if(num_sched_list)
+	if(num_sched_list) {
 		entity->rq = &sched_list[0]->sched_rq[entity->priority];
+		entity->sched_policy = sched_list[0]->sched_policy;
+	}
 
 	init_completion(&entity->entity_idle);
 
@@ -440,7 +457,7 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 	 * Update the entity's location in the min heap according to
 	 * the timestamp of the next job, if any.
 	 */
-	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) {
+	if (entity->sched_policy == DRM_SCHED_POLICY_FIFO) {
 		struct drm_sched_job *next;
 
 		next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
@@ -528,7 +545,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 		drm_sched_rq_add_entity(entity->rq, entity);
 		spin_unlock(&entity->rq_lock);
 
-		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
+		if (entity->sched_policy == DRM_SCHED_POLICY_FIFO)
 			drm_sched_rq_update_fifo(entity, sched_job->submit_ts);
 
 		drm_sched_wakeup(entity->rq->sched);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 808008990721..77894976fa55 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -62,14 +62,14 @@
 #define to_drm_sched_job(sched_job)		\
 		container_of((sched_job), struct drm_sched_job, queue_node)
 
-int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
+int default_drm_sched_policy = DRM_SCHED_POLICY_FIFO;
 
 /**
  * DOC: sched_policy (int)
  * Used to override default entities scheduling policy in a run queue.
  */
 MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
-module_param_named(sched_policy, drm_sched_policy, int, 0444);
+module_param_named(sched_policy, default_drm_sched_policy, int, 0444);
 
 static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
 							    const struct rb_node *b)
@@ -173,7 +173,7 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 	if (rq->current_entity == entity)
 		rq->current_entity = NULL;
 
-	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
+	if (entity->sched_policy == DRM_SCHED_POLICY_FIFO)
 		drm_sched_rq_remove_fifo_locked(entity);
 
 	spin_unlock(&rq->lock);
@@ -931,7 +931,7 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
 
 	/* Kernel run queue has higher priority than normal run queue*/
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
-		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
+		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
 			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
 			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
 		if (entity)
@@ -1106,6 +1106,7 @@ static void drm_sched_main(struct work_struct *w)
  *		used
  * @score: optional score atomic shared with other schedulers
  * @name: name used for debugging
+ * @sched_policy: schedule policy
  * @dev: target &struct device
  *
  * Return 0 on success, otherwise error code.
@@ -1115,9 +1116,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   struct workqueue_struct *run_wq,
 		   unsigned hw_submission, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
-		   atomic_t *score, const char *name, struct device *dev)
+		   atomic_t *score, const char *name,
+		   enum drm_sched_policy sched_policy,
+		   struct device *dev)
 {
 	int i;
+
+	if (sched_policy >= DRM_SCHED_POLICY_MAX)
+		return -EINVAL;
+
 	sched->ops = ops;
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
@@ -1127,6 +1134,10 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 	sched->hang_limit = hang_limit;
 	sched->score = score ? score : &sched->_score;
 	sched->dev = dev;
+	if (sched_policy == DRM_SCHED_POLICY_DEFAULT)
+		sched->sched_policy = default_drm_sched_policy;
+	else
+		sched->sched_policy = sched_policy;
 	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
 		drm_sched_rq_init(sched, &sched->sched_rq[i]);
 
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 38e092ea41e6..5e3fe77fa991 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -391,7 +391,8 @@ v3d_sched_init(struct v3d_dev *v3d)
 			     &v3d_bin_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
-			     NULL, "v3d_bin", v3d->drm.dev);
+			     NULL, "v3d_bin", DRM_SCHED_POLICY_DEFAULT,
+			     v3d->drm.dev);
 	if (ret)
 		return ret;
 
@@ -399,7 +400,8 @@ v3d_sched_init(struct v3d_dev *v3d)
 			     &v3d_render_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
-			     NULL, "v3d_render", v3d->drm.dev);
+			     ULL, "v3d_render", DRM_SCHED_POLICY_DEFAULT,
+			     v3d->drm.dev);
 	if (ret)
 		goto fail;
 
@@ -407,7 +409,8 @@ v3d_sched_init(struct v3d_dev *v3d)
 			     &v3d_tfu_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
-			     NULL, "v3d_tfu", v3d->drm.dev);
+			     NULL, "v3d_tfu", DRM_SCHED_POLICY_DEFAULT,
+			     v3d->drm.dev);
 	if (ret)
 		goto fail;
 
@@ -416,7 +419,8 @@ v3d_sched_init(struct v3d_dev *v3d)
 				     &v3d_csd_sched_ops, NULL,
 				     hw_jobs_limit, job_hang_limit,
 				     msecs_to_jiffies(hang_limit_ms), NULL,
-				     NULL, "v3d_csd", v3d->drm.dev);
+				     NULL, "v3d_csd", DRM_SCHED_POLICY_DEFAULT,
+				     v3d->drm.dev);
 		if (ret)
 			goto fail;
 
@@ -424,7 +428,8 @@ v3d_sched_init(struct v3d_dev *v3d)
 				     &v3d_cache_clean_sched_ops, NULL,
 				     hw_jobs_limit, job_hang_limit,
 				     msecs_to_jiffies(hang_limit_ms), NULL,
-				     NULL, "v3d_cache_clean", v3d->drm.dev);
+				     NULL, "v3d_cache_clean",
+				     DRM_SCHED_POLICY_DEFAULT, v3d->drm.dev);
 		if (ret)
 			goto fail;
 	}
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 98fb5f85eba6..39cb72b7fe5d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -72,11 +72,15 @@ enum drm_sched_priority {
 	DRM_SCHED_PRIORITY_UNSET = -2
 };
 
-/* Used to chose between FIFO and RR jobs scheduling */
-extern int drm_sched_policy;
-
-#define DRM_SCHED_POLICY_RR    0
-#define DRM_SCHED_POLICY_FIFO  1
+/* Used to chose default scheduling policy*/
+extern int default_drm_sched_policy;
+
+enum drm_sched_policy {
+	DRM_SCHED_POLICY_DEFAULT,
+	DRM_SCHED_POLICY_RR,
+	DRM_SCHED_POLICY_FIFO,
+	DRM_SCHED_POLICY_MAX,
+};
 
 /**
  * struct drm_sched_entity - A wrapper around a job queue (typically
@@ -217,6 +221,9 @@ struct drm_sched_entity {
 	 */
 	bool 				stopped;
 
+	/** @sched_policy: Schedule policy for entity */
+	enum drm_sched_policy		sched_policy;
+
 	/**
 	 * @entity_idle:
 	 *
@@ -489,6 +496,7 @@ struct drm_sched_backend_ops {
  *              guilty and it will no longer be considered for scheduling.
  * @score: score to help loadbalancer pick a idle sched
  * @_score: score used when the driver doesn't provide one
+ * @sched_policy: Schedule policy for scheduler
  * @ready: marks if the underlying HW is ready to work
  * @free_guilty: A hit to time out handler to free the guilty job.
  * @pause_run_wq: pause queuing of @work_run on @run_wq
@@ -514,6 +522,7 @@ struct drm_gpu_scheduler {
 	int				hang_limit;
 	atomic_t                        *score;
 	atomic_t                        _score;
+	enum drm_sched_policy		sched_policy;
 	bool				ready;
 	bool				free_guilty;
 	bool				pause_run_wq;
@@ -525,7 +534,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   struct workqueue_struct *run_wq,
 		   uint32_t hw_submission, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
-		   atomic_t *score, const char *name, struct device *dev);
+		   atomic_t *score, const char *name,
+		   enum drm_sched_policy sched_policy,
+		   struct device *dev);
 
 void drm_sched_fini(struct drm_gpu_scheduler *sched);
 int drm_sched_job_init(struct drm_sched_job *job,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Intel-xe] [RFC PATCH 03/10] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 02/10] drm/sched: Move schedule policy to scheduler / entity Matthew Brost
@ 2023-04-04  0:22 ` Matthew Brost
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 04/10] drm/sched: Add generic scheduler message interface Matthew Brost
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

DRM_SCHED_POLICY_SINGLE_ENTITY creates a 1 to 1 relationship between
scheduler and entity. No priorities or run queue used in this mode.
Intended for devices with firmware schedulers.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 58 +++++++++++++++++----
 drivers/gpu/drm/scheduler/sched_fence.c  |  2 +-
 drivers/gpu/drm/scheduler/sched_main.c   | 64 +++++++++++++++++++++---
 include/drm/gpu_scheduler.h              | 29 +++++++----
 4 files changed, 123 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index f1299e51860b..ccea4d079d0f 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -91,8 +91,15 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 	RB_CLEAR_NODE(&entity->rb_tree_node);
 
 	if(num_sched_list) {
-		entity->rq = &sched_list[0]->sched_rq[entity->priority];
 		entity->sched_policy = sched_list[0]->sched_policy;
+		if (entity->sched_policy != DRM_SCHED_POLICY_SINGLE_ENTITY) {
+			entity->rq = &sched_list[0]->sched_rq[entity->priority];
+		} else {
+			if (num_sched_list != 1 || sched_list[0]->single_entity)
+				return -EINVAL;
+			sched_list[0]->single_entity = entity;
+			entity->single_sched = sched_list[0];
+		}
 	}
 
 	init_completion(&entity->entity_idle);
@@ -126,7 +133,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 				    struct drm_gpu_scheduler **sched_list,
 				    unsigned int num_sched_list)
 {
-	WARN_ON(!num_sched_list || !sched_list);
+	WARN_ON(!num_sched_list || !sched_list ||
+		entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
 
 	entity->sched_list = sched_list;
 	entity->num_sched_list = num_sched_list;
@@ -196,13 +204,16 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
 {
 	struct drm_sched_job *job;
 	struct dma_fence *prev;
+	bool single_entity =
+		entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY;
 
-	if (!entity->rq)
+	if (!entity->rq && !single_entity)
 		return;
 
 	spin_lock(&entity->rq_lock);
 	entity->stopped = true;
-	drm_sched_rq_remove_entity(entity->rq, entity);
+	if (!single_entity)
+		drm_sched_rq_remove_entity(entity->rq, entity);
 	spin_unlock(&entity->rq_lock);
 
 	/* Make sure this entity is not used by the scheduler at the moment */
@@ -224,6 +235,21 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
 	dma_fence_put(prev);
 }
 
+/**
+ * drm_sched_entity_to_scheduler - Schedule entity to GPU scheduler
+ * @entity: scheduler entity
+ *
+ * Returns GPU scheduler for the entity
+ */
+struct drm_gpu_scheduler *
+drm_sched_entity_to_scheduler(struct drm_sched_entity *entity)
+{
+	bool single_entity =
+		entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY;
+
+	return single_entity ? entity->single_sched : entity->rq->sched;
+}
+
 /**
  * drm_sched_entity_flush - Flush a context entity
  *
@@ -241,11 +267,13 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
 	struct drm_gpu_scheduler *sched;
 	struct task_struct *last_user;
 	long ret = timeout;
+	bool single_entity =
+		entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY;
 
-	if (!entity->rq)
+	if (!entity->rq && !single_entity)
 		return 0;
 
-	sched = entity->rq->sched;
+	sched = drm_sched_entity_to_scheduler(entity);
 	/**
 	 * The client will not queue more IBs during this fini, consume existing
 	 * queued IBs or discard them on SIGKILL
@@ -338,7 +366,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
 		container_of(cb, struct drm_sched_entity, cb);
 
 	drm_sched_entity_clear_dep(f, cb);
-	drm_sched_wakeup(entity->rq->sched);
+	drm_sched_wakeup(drm_sched_entity_to_scheduler(entity));
 }
 
 /**
@@ -352,6 +380,8 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
 void drm_sched_entity_set_priority(struct drm_sched_entity *entity,
 				   enum drm_sched_priority priority)
 {
+	WARN_ON(entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
+
 	spin_lock(&entity->rq_lock);
 	entity->priority = priority;
 	spin_unlock(&entity->rq_lock);
@@ -364,7 +394,7 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
  */
 static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
 {
-	struct drm_gpu_scheduler *sched = entity->rq->sched;
+	struct drm_gpu_scheduler *sched = drm_sched_entity_to_scheduler(entity);
 	struct dma_fence *fence = entity->dependency;
 	struct drm_sched_fence *s_fence;
 
@@ -474,6 +504,8 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_rq *rq;
 
+	WARN_ON(entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
+
 	/* single possible engine and already selected */
 	if (!entity->sched_list)
 		return;
@@ -523,10 +555,13 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 {
 	struct drm_sched_entity *entity = sched_job->entity;
+	bool single_entity =
+		entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY;
 	bool first;
 
 	trace_drm_sched_job(sched_job, entity);
-	atomic_inc(entity->rq->sched->score);
+	if (!single_entity)
+		atomic_inc(entity->rq->sched->score);
 	WRITE_ONCE(entity->last_user, current->group_leader);
 	first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
 	sched_job->submit_ts = ktime_get();
@@ -542,13 +577,14 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 			return;
 		}
 
-		drm_sched_rq_add_entity(entity->rq, entity);
+		if (!single_entity)
+			drm_sched_rq_add_entity(entity->rq, entity);
 		spin_unlock(&entity->rq_lock);
 
 		if (entity->sched_policy == DRM_SCHED_POLICY_FIFO)
 			drm_sched_rq_update_fifo(entity, sched_job->submit_ts);
 
-		drm_sched_wakeup(entity->rq->sched);
+		drm_sched_wakeup(drm_sched_entity_to_scheduler(entity));
 	}
 }
 EXPORT_SYMBOL(drm_sched_entity_push_job);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
index fe9c6468e440..d7cfc0441885 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -213,7 +213,7 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
 {
 	unsigned seq;
 
-	fence->sched = entity->rq->sched;
+	fence->sched = drm_sched_entity_to_scheduler(entity);
 	seq = atomic_inc_return(&entity->fence_seq);
 	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
 		       &fence->lock, entity->fence_context, seq);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 77894976fa55..2795021efe7b 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -32,7 +32,8 @@
  * backend operations to the scheduler like submitting a job to hardware run queue,
  * returning the dependencies of a job etc.
  *
- * The organisation of the scheduler is the following:
+ * The organisation of the scheduler is the following for scheduling policies
+ * DRM_SCHED_POLICY_RR and DRM_SCHED_POLICY_FIFO:
  *
  * 1. Each hw run queue has one scheduler
  * 2. Each scheduler has multiple run queues with different priorities
@@ -41,7 +42,22 @@
  * 4. Entities themselves maintain a queue of jobs that will be scheduled on
  *    the hardware.
  *
- * The jobs in a entity are always scheduled in the order that they were pushed.
+ * The organisation of the scheduler is the following for scheduling policy
+ * DRM_SCHED_POLICY_SINGLE_ENTITY:
+ *
+ * 1. One to one relationship between scheduler and entity
+ * 2. No priorities implemented per scheduler (single job queue)
+ * 3. No run queues in scheduler rather jobs are directly dequeued from entity
+ * 4. The entity maintains a queue of jobs that will be scheduled on the
+ * hardware
+ *
+ * The jobs in a entity are always scheduled in the order that they were pushed
+ * regardless of scheduling policy.
+ *
+ * A policy of DRM_SCHED_POLICY_RR or DRM_SCHED_POLICY_FIFO is expected to used
+ * when the KMD is scheduling directly on the hardware while a scheduling policy
+ * of DRM_SCHED_POLICY_SINGLE_ENTITY is expected to be used when there is a
+ * firmare scheduler.
  */
 
 #include <linux/wait.h>
@@ -92,6 +108,8 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
 
 void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
 {
+	WARN_ON(entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
+
 	/*
 	 * Both locks need to be grabbed, one to protect from entity->rq change
 	 * for entity from within concurrent drm_sched_entity_select_rq and the
@@ -122,6 +140,8 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
 static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
 			      struct drm_sched_rq *rq)
 {
+	WARN_ON(sched->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
+
 	spin_lock_init(&rq->lock);
 	INIT_LIST_HEAD(&rq->entities);
 	rq->rb_tree_root = RB_ROOT_CACHED;
@@ -140,6 +160,8 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
 void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
 			     struct drm_sched_entity *entity)
 {
+	WARN_ON(entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
+
 	if (!list_empty(&entity->list))
 		return;
 
@@ -162,6 +184,8 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
 void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 				struct drm_sched_entity *entity)
 {
+	WARN_ON(entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
+
 	if (list_empty(&entity->list))
 		return;
 
@@ -673,7 +697,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
 		       struct drm_sched_entity *entity,
 		       void *owner)
 {
-	if (!entity->rq)
+	if (!entity->rq && !entity->single_sched)
 		return -ENOENT;
 
 	job->entity = entity;
@@ -706,13 +730,17 @@ void drm_sched_job_arm(struct drm_sched_job *job)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_entity *entity = job->entity;
+	bool single_entity =
+		entity->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY;
 
 	BUG_ON(!entity);
-	drm_sched_entity_select_rq(entity);
-	sched = entity->rq->sched;
+	if (!single_entity)
+		drm_sched_entity_select_rq(entity);
+	sched = drm_sched_entity_to_scheduler(entity);
 
 	job->sched = sched;
-	job->s_priority = entity->rq - sched->sched_rq;
+	if (!single_entity)
+		job->s_priority = entity->rq - sched->sched_rq;
 	job->id = atomic64_inc_return(&sched->job_id_count);
 
 	drm_sched_fence_init(job->s_fence, job->entity);
@@ -929,6 +957,13 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
 	if (!drm_sched_ready(sched))
 		return NULL;
 
+	if (sched->single_entity) {
+		if (drm_sched_entity_is_ready(sched->single_entity))
+			return sched->single_entity;
+
+		return NULL;
+	}
+
 	/* Kernel run queue has higher priority than normal run queue*/
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
@@ -1126,6 +1161,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 		return -EINVAL;
 
 	sched->ops = ops;
+	sched->single_entity = NULL;
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
 	sched->run_wq = run_wq ? : system_wq;
@@ -1138,7 +1174,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 		sched->sched_policy = default_drm_sched_policy;
 	else
 		sched->sched_policy = sched_policy;
-	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
+	for (i = DRM_SCHED_PRIORITY_MIN; sched_policy !=
+	     DRM_SCHED_POLICY_SINGLE_ENTITY && i < DRM_SCHED_PRIORITY_COUNT;
+	     i++)
 		drm_sched_rq_init(sched, &sched->sched_rq[i]);
 
 	init_waitqueue_head(&sched->job_scheduled);
@@ -1170,7 +1208,15 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 
 	drm_sched_run_wq_stop(sched);
 
-	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
+	if (sched->single_entity) {
+		spin_lock(&sched->single_entity->rq_lock);
+		sched->single_entity->stopped = true;
+		spin_unlock(&sched->single_entity->rq_lock);
+	}
+
+	for (i = DRM_SCHED_PRIORITY_COUNT - 1; sched->sched_policy !=
+	     DRM_SCHED_POLICY_SINGLE_ENTITY && i >= DRM_SCHED_PRIORITY_MIN;
+	     i--) {
 		struct drm_sched_rq *rq = &sched->sched_rq[i];
 
 		if (!rq)
@@ -1214,6 +1260,8 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
 	struct drm_sched_entity *entity;
 	struct drm_gpu_scheduler *sched = bad->sched;
 
+	WARN_ON(sched->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
+
 	/* don't change @bad's karma if it's from KERNEL RQ,
 	 * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
 	 * corrupt but keep in mind that kernel jobs always considered good.
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 39cb72b7fe5d..3e421f5a710c 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -79,6 +79,7 @@ enum drm_sched_policy {
 	DRM_SCHED_POLICY_DEFAULT,
 	DRM_SCHED_POLICY_RR,
 	DRM_SCHED_POLICY_FIFO,
+	DRM_SCHED_POLICY_SINGLE_ENTITY,
 	DRM_SCHED_POLICY_MAX,
 };
 
@@ -101,16 +102,20 @@ struct drm_sched_entity {
 	 */
 	struct list_head		list;
 
-	/**
-	 * @rq:
-	 *
-	 * Runqueue on which this entity is currently scheduled.
-	 *
-	 * FIXME: Locking is very unclear for this. Writers are protected by
-	 * @rq_lock, but readers are generally lockless and seem to just race
-	 * with not even a READ_ONCE.
-	 */
-	struct drm_sched_rq		*rq;
+	union {
+		/**
+		 * @rq:
+		 *
+		 * Runqueue on which this entity is currently scheduled.
+		 *
+		 * FIXME: Locking is very unclear for this. Writers are
+		 * protected by @rq_lock, but readers are generally lockless and
+		 * seem to just race with not even a READ_ONCE.
+		 */
+		struct drm_sched_rq		*rq;
+		/** @single_sched: Single scheduler */
+		struct drm_gpu_scheduler	*single_sched;
+	};
 
 	/**
 	 * @sched_list:
@@ -476,6 +481,7 @@ struct drm_sched_backend_ops {
  * struct drm_gpu_scheduler - scheduler instance-specific data
  *
  * @ops: backend operations provided by the driver.
+ * @single_entity: Single entity for the scheduler
  * @hw_submission_limit: the max size of the hardware queue.
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
@@ -506,6 +512,7 @@ struct drm_sched_backend_ops {
  */
 struct drm_gpu_scheduler {
 	const struct drm_sched_backend_ops	*ops;
+	struct drm_sched_entity		*single_entity;
 	uint32_t			hw_submission_limit;
 	long				timeout;
 	const char			*name;
@@ -587,6 +594,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 			  struct drm_gpu_scheduler **sched_list,
 			  unsigned int num_sched_list,
 			  atomic_t *guilty);
+struct drm_gpu_scheduler *
+drm_sched_entity_to_scheduler(struct drm_sched_entity *entity);
 long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout);
 void drm_sched_entity_fini(struct drm_sched_entity *entity);
 void drm_sched_entity_destroy(struct drm_sched_entity *entity);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Intel-xe] [RFC PATCH 04/10] drm/sched: Add generic scheduler message interface
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (2 preceding siblings ...)
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 03/10] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy Matthew Brost
@ 2023-04-04  0:22 ` Matthew Brost
  2023-05-04  5:28   ` Luben Tuikov
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 05/10] drm/sched: Start run wq before TDR in drm_sched_start Matthew Brost
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

Add generic schedule message interface which sends messages to backend
from the drm_gpu_scheduler main submission thread. The idea is some of
these messages modify some state in drm_sched_entity which is also
modified during submission. By scheduling these messages and submission
in the same thread their is not race changing states in
drm_sched_entity.

This interface will be used in XE, new Intel GPU driver, to cleanup,
suspend, resume, and change scheduling properties of a drm_sched_entity.

The interface is designed to be generic and extendable with only the
backend understanding the messages.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 58 +++++++++++++++++++++++++-
 include/drm/gpu_scheduler.h            | 29 ++++++++++++-
 2 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 2795021efe7b..9dc3378e9c5e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1055,6 +1055,54 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
+/**
+ * drm_sched_add_msg - add scheduler message
+ *
+ * @sched: scheduler instance
+ * @msg: message to be added
+ *
+ * Can and will pass an jobs waiting on dependencies or in a runnable queue.
+ * Messages processing will stop if schedule run wq is stopped and resume when
+ * run wq is started.
+ */
+void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
+		       struct drm_sched_msg *msg)
+{
+	spin_lock(&sched->job_list_lock);
+	list_add_tail(&msg->link, &sched->msgs);
+	spin_unlock(&sched->job_list_lock);
+
+	/*
+	 * Same as above in drm_sched_run_wq_queue, try to kick worker if
+	 * paused, harmless if this races
+	 */
+	if (!sched->pause_run_wq)
+		queue_work(sched->run_wq, &sched->work_run);
+}
+EXPORT_SYMBOL(drm_sched_add_msg);
+
+/**
+ * drm_sched_get_msg - get scheduler message
+ *
+ * @sched: scheduler instance
+ *
+ * Returns NULL or message
+ */
+static struct drm_sched_msg *
+drm_sched_get_msg(struct drm_gpu_scheduler *sched)
+{
+	struct drm_sched_msg *msg;
+
+	spin_lock(&sched->job_list_lock);
+	msg = list_first_entry_or_null(&sched->msgs,
+				       struct drm_sched_msg, link);
+	if (msg)
+		list_del(&msg->link);
+	spin_unlock(&sched->job_list_lock);
+
+	return msg;
+}
+
 /**
  * drm_sched_main - main scheduler thread
  *
@@ -1068,6 +1116,7 @@ static void drm_sched_main(struct work_struct *w)
 
 	while (!READ_ONCE(sched->pause_run_wq)) {
 		struct drm_sched_entity *entity;
+		struct drm_sched_msg *msg;
 		struct drm_sched_fence *s_fence;
 		struct drm_sched_job *sched_job;
 		struct dma_fence *fence;
@@ -1075,12 +1124,16 @@ static void drm_sched_main(struct work_struct *w)
 
 		cleanup_job = drm_sched_get_cleanup_job(sched);
 		entity = drm_sched_select_entity(sched);
+		msg = drm_sched_get_msg(sched);
 
 		if (cleanup_job)
 			sched->ops->free_job(cleanup_job);
 
+		if (msg)
+			sched->ops->process_msg(msg);
+
 		if (!entity) {
-			if (!cleanup_job)
+			if (!cleanup_job && !msg)
 				break;
 			continue;
 		}
@@ -1089,7 +1142,7 @@ static void drm_sched_main(struct work_struct *w)
 
 		if (!sched_job) {
 			complete_all(&entity->entity_idle);
-			if (!cleanup_job)
+			if (!cleanup_job && !msg)
 				break;
 			continue;
 		}
@@ -1181,6 +1234,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
+	INIT_LIST_HEAD(&sched->msgs);
 	spin_lock_init(&sched->job_list_lock);
 	atomic_set(&sched->hw_rq_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 3e421f5a710c..18172ae63ab7 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -398,6 +398,23 @@ enum drm_gpu_sched_stat {
 	DRM_GPU_SCHED_STAT_ENODEV,
 };
 
+/**
+ * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
+ * message
+ *
+ * Generic enough for backend defined messages, backend can expand if needed.
+ */
+struct drm_sched_msg {
+	/** @link: list link into the gpu scheduler list of messages */
+	struct list_head		link;
+	/**
+	 * @private_data: opaque pointer to message private data (backend defined)
+	 */
+	void				*private_data;
+	/** @opcode: opcode of message (backend defined) */
+	unsigned int			opcode;
+};
+
 /**
  * struct drm_sched_backend_ops - Define the backend operations
  *	called by the scheduler
@@ -475,6 +492,12 @@ struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @process_msg: Process a message. Allowed to block, it is this
+	 * function's responsibility to free message if dynamically allocated.
+	 */
+	void (*process_msg)(struct drm_sched_msg *msg);
 };
 
 /**
@@ -486,6 +509,7 @@ struct drm_sched_backend_ops {
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
  * @sched_rq: priority wise array of run queues.
+ * @msgs: list of messages to be processed in @work_run
  * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
  *                 waits on this wait queue until all the scheduled jobs are
  *                 finished.
@@ -493,7 +517,7 @@ struct drm_sched_backend_ops {
  * @job_id_count: used to assign unique id to the each job.
  * @run_wq: workqueue used to queue @work_run
  * @timeout_wq: workqueue used to queue @work_tdr
- * @work_run: schedules jobs and cleans up entities
+ * @work_run: schedules jobs, cleans up jobs, and processes messages
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
  * @pending_list: the list of jobs which are currently in the job queue.
@@ -517,6 +541,7 @@ struct drm_gpu_scheduler {
 	long				timeout;
 	const char			*name;
 	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
+	struct list_head		msgs;
 	wait_queue_head_t		job_scheduled;
 	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
@@ -570,6 +595,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
+void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
+		       struct drm_sched_msg *msg);
 void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
 void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Intel-xe] [RFC PATCH 05/10] drm/sched: Start run wq before TDR in drm_sched_start
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (3 preceding siblings ...)
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 04/10] drm/sched: Add generic scheduler message interface Matthew Brost
@ 2023-04-04  0:22 ` Matthew Brost
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 06/10] drm/sched: Submit job before starting TDR Matthew Brost
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

If the TDR is set to a very small value it can fire before the run wq is
started in the function drm_sched_start. The run wq is expected to
running when the TDR fires, fix this ordering so this expectation is
always met.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 9dc3378e9c5e..6ae710017024 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -611,13 +611,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 			drm_sched_job_done(s_job);
 	}
 
+	drm_sched_run_wq_start(sched);
+
 	if (full_recovery) {
 		spin_lock(&sched->job_list_lock);
 		drm_sched_start_timeout(sched);
 		spin_unlock(&sched->job_list_lock);
 	}
-
-	drm_sched_run_wq_start(sched);
 }
 EXPORT_SYMBOL(drm_sched_start);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Intel-xe] [RFC PATCH 06/10] drm/sched: Submit job before starting TDR
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (4 preceding siblings ...)
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 05/10] drm/sched: Start run wq before TDR in drm_sched_start Matthew Brost
@ 2023-04-04  0:22 ` Matthew Brost
  2023-05-04  5:23   ` Luben Tuikov
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout Matthew Brost
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

If the TDR is set to a value, it can fire before a job is submitted in
drm_sched_main. The job should be always be submitted before the TDR
fires, fix this ordering.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 6ae710017024..4eac02d212c1 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1150,10 +1150,10 @@ static void drm_sched_main(struct work_struct *w)
 		s_fence = sched_job->s_fence;
 
 		atomic_inc(&sched->hw_rq_count);
-		drm_sched_job_begin(sched_job);
 
 		trace_drm_run_job(sched_job, entity);
 		fence = sched->ops->run_job(sched_job);
+		drm_sched_job_begin(sched_job);
 		complete_all(&entity->entity_idle);
 		drm_sched_fence_scheduled(s_fence);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Intel-xe] [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (5 preceding siblings ...)
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 06/10] drm/sched: Submit job before starting TDR Matthew Brost
@ 2023-04-04  0:22 ` Matthew Brost
  2023-05-04  5:28   ` Luben Tuikov
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences Matthew Brost
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

Add helper to set TDR timeout and restart the TDR with new timeout
value. This will be used in XE, new Intel GPU driver, to trigger the TDR
to cleanup drm_sched_entity that encounter errors.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
 include/drm/gpu_scheduler.h            |  1 +
 2 files changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 4eac02d212c1..d61880315d8d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
 		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
 }
 
+/**
+ * drm_sched_set_timeout - set timeout for reset worker
+ *
+ * @sched: scheduler instance to set and (re)-start the worker for
+ * @timeout: timeout period
+ *
+ * Set and (re)-start the timeout for the given scheduler.
+ */
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
+{
+	spin_lock(&sched->job_list_lock);
+	sched->timeout = timeout;
+	cancel_delayed_work(&sched->work_tdr);
+	drm_sched_start_timeout(sched);
+	spin_unlock(&sched->job_list_lock);
+}
+EXPORT_SYMBOL(drm_sched_set_timeout);
+
 /**
  * drm_sched_fault - immediately start timeout handler
  *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 18172ae63ab7..6258e324bd7c 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 				    struct drm_gpu_scheduler **sched_list,
                                    unsigned int num_sched_list);
 
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
 void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (6 preceding siblings ...)
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout Matthew Brost
@ 2023-04-04  0:22 ` Matthew Brost
  2023-04-04  9:09   ` Christian König
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 09/10] drm/sched: Support long-running sched entities Matthew Brost
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

For long-running workloads, drivers either need to open-code completion
waits, invent their own synchronization primitives or internally use
dma-fences that do not obey the cross-driver dma-fence protocol, but
without any lockdep annotation all these approaches are error prone.

So since for example the drm scheduler uses dma-fences it is desirable for
a driver to be able to use it for throttling and error handling also with
internal dma-fences tha do not obey the cros-driver dma-fence protocol.

Introduce long-running completion fences in form of dma-fences, and add
lockdep annotation for them. In particular:

* Do not allow waiting under any memory management locks.
* Do not allow to attach them to a dma-resv object.
* Introduce a new interface for adding callbacks making the helper adding
  a callback sign off on that it is aware that the dma-fence may not
  complete anytime soon. Typically this will be the scheduler chaining
  a new long-running fence on another one.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/dma-buf/dma-fence.c | 142 ++++++++++++++++++++++++++----------
 drivers/dma-buf/dma-resv.c  |   5 ++
 include/linux/dma-fence.h   |  55 +++++++++++++-
 3 files changed, 160 insertions(+), 42 deletions(-)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index f177c56269bb..9726b2a3c67d 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -111,6 +111,20 @@ static atomic64_t dma_fence_context_counter = ATOMIC64_INIT(1);
  * drivers/gpu should ever call dma_fence_wait() in such contexts.
  */
 
+/**
+ * DOC: Long-Running (lr) dma-fences.
+ *
+ * * Long-running dma-fences are NOT required to complete in reasonable time.
+ *   Typically they signal completion of user-space controlled workloads and
+ *   as such, need to never be part of a cross-driver contract, never waited
+ *   for inside a kernel lock, nor attached to a dma-resv. There are helpers
+ *   and warnings in place to help facilitate that that never happens.
+ *
+ * * The motivation for their existense is that helpers that are intended to
+ *   be used by drivers may use dma-fences that, given the workloads mentioned
+ *   above, become long-running.
+ */
+
 static const char *dma_fence_stub_get_name(struct dma_fence *fence)
 {
         return "stub";
@@ -284,6 +298,34 @@ static struct lockdep_map dma_fence_lockdep_map = {
 	.name = "dma_fence_map"
 };
 
+static struct lockdep_map dma_fence_lr_lockdep_map = {
+	.name = "dma_fence_lr_map"
+};
+
+static bool __dma_fence_begin_signalling(struct lockdep_map *map)
+{
+	/* explicitly nesting ... */
+	if (lock_is_held_type(map, 1))
+		return true;
+
+	/* rely on might_sleep check for soft/hardirq locks */
+	if (in_atomic())
+		return true;
+
+	/* ... and non-recursive readlock */
+	lock_acquire(map, 0, 0, 1, 1, NULL, _RET_IP_);
+
+	return false;
+}
+
+static void __dma_fence_end_signalling(bool cookie, struct lockdep_map *map)
+{
+	if (cookie)
+		return;
+
+	lock_release(map, _RET_IP_);
+}
+
 /**
  * dma_fence_begin_signalling - begin a critical DMA fence signalling section
  *
@@ -300,18 +342,7 @@ static struct lockdep_map dma_fence_lockdep_map = {
  */
 bool dma_fence_begin_signalling(void)
 {
-	/* explicitly nesting ... */
-	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
-		return true;
-
-	/* rely on might_sleep check for soft/hardirq locks */
-	if (in_atomic())
-		return true;
-
-	/* ... and non-recursive readlock */
-	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
-
-	return false;
+	return __dma_fence_begin_signalling(&dma_fence_lockdep_map);
 }
 EXPORT_SYMBOL(dma_fence_begin_signalling);
 
@@ -323,25 +354,61 @@ EXPORT_SYMBOL(dma_fence_begin_signalling);
  */
 void dma_fence_end_signalling(bool cookie)
 {
-	if (cookie)
-		return;
-
-	lock_release(&dma_fence_lockdep_map, _RET_IP_);
+	__dma_fence_end_signalling(cookie, &dma_fence_lockdep_map);
 }
 EXPORT_SYMBOL(dma_fence_end_signalling);
 
-void __dma_fence_might_wait(void)
+/**
+ * dma_fence_lr begin_signalling - begin a critical long-running DMA fence
+ * signalling section
+ *
+ * Drivers should use this to annotate the beginning of any code section
+ * required to eventually complete &dma_fence by calling dma_fence_signal().
+ *
+ * The end of these critical sections are annotated with
+ * dma_fence_lr_end_signalling(). Ideally the section should encompass all
+ * locks that are ever required to signal a long-running dma-fence.
+ *
+ * Return: An opaque cookie needed by the implementation, which needs to be
+ * passed to dma_fence_lr end_signalling().
+ */
+bool dma_fence_lr_begin_signalling(void)
+{
+	return __dma_fence_begin_signalling(&dma_fence_lr_lockdep_map);
+}
+EXPORT_SYMBOL(dma_fence_lr_begin_signalling);
+
+/**
+ * dma_fence_lr_end_signalling - end a critical DMA fence signalling section
+ * @cookie: opaque cookie from dma_fence_lr_begin_signalling()
+ *
+ * Closes a critical section annotation opened by
+ * dma_fence_lr_begin_signalling().
+ */
+void dma_fence_lr_end_signalling(bool cookie)
+{
+	__dma_fence_end_signalling(cookie, &dma_fence_lr_lockdep_map);
+}
+EXPORT_SYMBOL(dma_fence_lr_end_signalling);
+
+static void ___dma_fence_might_wait(struct lockdep_map *map)
 {
 	bool tmp;
 
-	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
+	tmp = lock_is_held_type(map, 1);
 	if (tmp)
-		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
-	lock_map_acquire(&dma_fence_lockdep_map);
-	lock_map_release(&dma_fence_lockdep_map);
+		lock_release(map, _THIS_IP_);
+	lock_map_acquire(map);
+	lock_map_release(map);
 	if (tmp)
-		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+		lock_acquire(map, 0, 0, 1, 1, NULL, _THIS_IP_);
+}
+
+void __dma_fence_might_wait(void)
+{
+	___dma_fence_might_wait(&dma_fence_lockdep_map);
 }
+
 #endif
 
 
@@ -506,7 +573,11 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 
 	might_sleep();
 
-	__dma_fence_might_wait();
+#ifdef CONFIG_LOCKDEP
+	___dma_fence_might_wait(dma_fence_is_lr(fence) ?
+				&dma_fence_lr_lockdep_map :
+				&dma_fence_lockdep_map);
+#endif
 
 	dma_fence_enable_sw_signaling(fence);
 
@@ -618,29 +689,22 @@ void dma_fence_enable_sw_signaling(struct dma_fence *fence)
 EXPORT_SYMBOL(dma_fence_enable_sw_signaling);
 
 /**
- * dma_fence_add_callback - add a callback to be called when the fence
+ * dma_fence_lr_add_callback - add a callback to be called when the fence
  * is signaled
  * @fence: the fence to wait on
  * @cb: the callback to register
  * @func: the function to call
  *
- * Add a software callback to the fence. The caller should keep a reference to
- * the fence.
- *
- * @cb will be initialized by dma_fence_add_callback(), no initialization
- * by the caller is required. Any number of callbacks can be registered
- * to a fence, but a callback can only be registered to one fence at a time.
- *
- * If fence is already signaled, this function will return -ENOENT (and
- * *not* call the callback).
- *
- * Note that the callback can be called from an atomic context or irq context.
+ * This function is identical to dma_fence_add_callback() but allows adding
+ * callbacks also to lr dma-fences. The naming helps annotating the fact that
+ * we're adding a callback to a a lr fence and that the callback might therefore
+ * not be called within a reasonable amount of time.
  *
- * Returns 0 in case of success, -ENOENT if the fence is already signaled
+ * Return: 0 in case of success, -ENOENT if the fence is already signaled
  * and -EINVAL in case of error.
  */
-int dma_fence_add_callback(struct dma_fence *fence, struct dma_fence_cb *cb,
-			   dma_fence_func_t func)
+int dma_fence_lr_add_callback(struct dma_fence *fence, struct dma_fence_cb *cb,
+			      dma_fence_func_t func)
 {
 	unsigned long flags;
 	int ret = 0;
@@ -667,7 +731,7 @@ int dma_fence_add_callback(struct dma_fence *fence, struct dma_fence_cb *cb,
 
 	return ret;
 }
-EXPORT_SYMBOL(dma_fence_add_callback);
+EXPORT_SYMBOL(dma_fence_lr_add_callback);
 
 /**
  * dma_fence_get_status - returns the status upon completion
diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 2a594b754af1..fa0210c1442e 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -292,6 +292,7 @@ void dma_resv_add_fence(struct dma_resv *obj, struct dma_fence *fence,
 	 * individually.
 	 */
 	WARN_ON(dma_fence_is_container(fence));
+	WARN_ON_ONCE(dma_fence_is_lr(fence));
 
 	fobj = dma_resv_fences_list(obj);
 	count = fobj->num_fences;
@@ -340,6 +341,7 @@ void dma_resv_replace_fences(struct dma_resv *obj, uint64_t context,
 	unsigned int i;
 
 	dma_resv_assert_held(obj);
+	WARN_ON_ONCE(dma_fence_is_lr(replacement));
 
 	list = dma_resv_fences_list(obj);
 	for (i = 0; list && i < list->num_fences; ++i) {
@@ -764,6 +766,7 @@ static int __init dma_resv_lockdep(void)
 	struct ww_acquire_ctx ctx;
 	struct dma_resv obj;
 	struct address_space mapping;
+	bool lr_cookie;
 	int ret;
 
 	if (!mm)
@@ -772,6 +775,7 @@ static int __init dma_resv_lockdep(void)
 	dma_resv_init(&obj);
 	address_space_init_once(&mapping);
 
+	lr_cookie = dma_fence_lr_begin_signalling();
 	mmap_read_lock(mm);
 	ww_acquire_init(&ctx, &reservation_ww_class);
 	ret = dma_resv_lock(&obj, &ctx);
@@ -792,6 +796,7 @@ static int __init dma_resv_lockdep(void)
 	ww_mutex_unlock(&obj.lock);
 	ww_acquire_fini(&ctx);
 	mmap_read_unlock(mm);
+	dma_fence_lr_end_signalling(lr_cookie);
 
 	mmput(mm);
 
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index d54b595a0fe0..08d21e26782b 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -99,6 +99,7 @@ enum dma_fence_flag_bits {
 	DMA_FENCE_FLAG_SIGNALED_BIT,
 	DMA_FENCE_FLAG_TIMESTAMP_BIT,
 	DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
+	DMA_FENCE_FLAG_LR_BIT,
 	DMA_FENCE_FLAG_USER_BITS, /* must always be last member */
 };
 
@@ -279,6 +280,11 @@ struct dma_fence_ops {
 	void (*set_deadline)(struct dma_fence *fence, ktime_t deadline);
 };
 
+static inline bool dma_fence_is_lr(const struct dma_fence *fence)
+{
+	return test_bit(DMA_FENCE_FLAG_LR_BIT, &fence->flags);
+}
+
 void dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
 		    spinlock_t *lock, u64 context, u64 seqno);
 
@@ -377,13 +383,23 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 #ifdef CONFIG_LOCKDEP
 bool dma_fence_begin_signalling(void);
 void dma_fence_end_signalling(bool cookie);
+bool dma_fence_lr_begin_signalling(void);
+void dma_fence_lr_end_signalling(bool cookie);
 void __dma_fence_might_wait(void);
 #else
+
 static inline bool dma_fence_begin_signalling(void)
 {
 	return true;
 }
+
 static inline void dma_fence_end_signalling(bool cookie) {}
+static inline bool dma_fence_lr_begin_signalling(void)
+{
+	return true;
+}
+
+static inline void dma_fence_lr_end_signalling(bool cookie) {}
 static inline void __dma_fence_might_wait(void) {}
 #endif
 
@@ -394,9 +410,42 @@ int dma_fence_signal_timestamp_locked(struct dma_fence *fence,
 				      ktime_t timestamp);
 signed long dma_fence_default_wait(struct dma_fence *fence,
 				   bool intr, signed long timeout);
-int dma_fence_add_callback(struct dma_fence *fence,
-			   struct dma_fence_cb *cb,
-			   dma_fence_func_t func);
+
+int dma_fence_lr_add_callback(struct dma_fence *fence,
+			      struct dma_fence_cb *cb,
+			      dma_fence_func_t func);
+
+/**
+ * dma_fence_add_callback - add a callback to be called when the fence
+ * is signaled
+ * @fence: the fence to wait on
+ * @cb: the callback to register
+ * @func: the function to call
+ *
+ * Add a software callback to the fence. The caller should keep a reference to
+ * the fence.
+ *
+ * @cb will be initialized by dma_fence_add_callback(), no initialization
+ * by the caller is required. Any number of callbacks can be registered
+ * to a fence, but a callback can only be registered to one fence at a time.
+ *
+ * If fence is already signaled, this function will return -ENOENT (and
+ * *not* call the callback).
+ *
+ * Note that the callback can be called from an atomic context or irq context.
+ *
+ * Returns 0 in case of success, -ENOENT if the fence is already signaled
+ * and -EINVAL in case of error.
+ */
+static inline int dma_fence_add_callback(struct dma_fence *fence,
+					 struct dma_fence_cb *cb,
+					 dma_fence_func_t func)
+{
+	WARN_ON(IS_ENABLED(CONFIG_LOCKDEP) && dma_fence_is_lr(fence));
+
+	return dma_fence_lr_add_callback(fence, cb, func);
+}
+
 bool dma_fence_remove_callback(struct dma_fence *fence,
 			       struct dma_fence_cb *cb);
 void dma_fence_enable_sw_signaling(struct dma_fence *fence);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Intel-xe] [RFC PATCH 09/10] drm/sched: Support long-running sched entities
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (7 preceding siblings ...)
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences Matthew Brost
@ 2023-04-04  0:22 ` Matthew Brost
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 10/10] drm/syncobj: Warn on long running dma-fences Matthew Brost
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Make the drm scheduler aware of long-running dma fences by

* Enable marking a sched entity as producing long-running fences.
* Disallowing long-running fences as dependencies for non-long-running
  sched entities, while long-running sched entities allow those.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 44 +++++++++++++++++++-----
 drivers/gpu/drm/scheduler/sched_fence.c  |  4 +++
 drivers/gpu/drm/scheduler/sched_main.c   |  9 ++---
 include/drm/gpu_scheduler.h              | 36 +++++++++++++++++++
 include/linux/dma-fence.h                |  5 +++
 5 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index ccea4d079d0f..0640fc9d4491 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -174,6 +174,32 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
 	job->sched->ops->free_job(job);
 }
 
+/**
+ * drm_sched_entity_add_fence_cb() - Helper to add a fence callback
+ * @entity: The sched entity
+ * @f: The possbily long-running dma-fence on which to add a callback
+ * @cb: The struct dma_fence_cb to use for the callback
+ * @func: The callback function.
+ *
+ * This function calls the proper dma_fence add callback function
+ * depending on whether @entity is marked as long-running or not. If it
+ * is not, this will make sure we get a warning if trying to add a
+ * callback on a long-running dma-fence.
+ *
+ * Return: Zero on success, -ENOENT if already signaled and -EINVAL in case
+ * of error.
+ */
+int drm_sched_entity_add_fence_cb(struct drm_sched_entity *entity,
+				  struct dma_fence *f,
+				  struct dma_fence_cb *cb,
+				  dma_fence_func_t func)
+{
+	if (drm_sched_entity_is_lr(entity))
+		return dma_fence_lr_add_callback(f, cb, func);
+
+	return dma_fence_add_callback(f, cb, func);
+}
+
 /* Signal the scheduler finished fence when the entity in question is killed. */
 static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
 					  struct dma_fence_cb *cb)
@@ -187,8 +213,8 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
 	/* Wait for all dependencies to avoid data corruptions */
 	while (!xa_empty(&job->dependencies)) {
 		f = xa_erase(&job->dependencies, job->last_dependency++);
-		r = dma_fence_add_callback(f, &job->finish_cb,
-					   drm_sched_entity_kill_jobs_cb);
+		r = drm_sched_entity_add_fence_cb(job->entity, f, &job->finish_cb,
+						  drm_sched_entity_kill_jobs_cb);
 		if (!r)
 			return;
 
@@ -226,8 +252,9 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
 		dma_fence_set_error(&s_fence->finished, -ESRCH);
 
 		dma_fence_get(&s_fence->finished);
-		if (!prev || dma_fence_add_callback(prev, &job->finish_cb,
-					   drm_sched_entity_kill_jobs_cb))
+		if (!prev || drm_sched_entity_add_fence_cb(job->entity, prev,
+							   &job->finish_cb,
+							   drm_sched_entity_kill_jobs_cb))
 			drm_sched_entity_kill_jobs_cb(NULL, &job->finish_cb);
 
 		prev = &s_fence->finished;
@@ -420,8 +447,8 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
 		fence = dma_fence_get(&s_fence->scheduled);
 		dma_fence_put(entity->dependency);
 		entity->dependency = fence;
-		if (!dma_fence_add_callback(fence, &entity->cb,
-					    drm_sched_entity_clear_dep))
+		if (!drm_sched_entity_add_fence_cb(entity, fence, &entity->cb,
+						   drm_sched_entity_clear_dep))
 			return true;
 
 		/* Ignore it when it is already scheduled */
@@ -429,8 +456,9 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
 		return false;
 	}
 
-	if (!dma_fence_add_callback(entity->dependency, &entity->cb,
-				    drm_sched_entity_wakeup))
+	if (!drm_sched_entity_add_fence_cb(entity, entity->dependency,
+					   &entity->cb,
+					   drm_sched_entity_wakeup))
 		return true;
 
 	dma_fence_put(entity->dependency);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
index d7cfc0441885..a566723ecc2c 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -217,8 +217,12 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
 	seq = atomic_inc_return(&entity->fence_seq);
 	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
 		       &fence->lock, entity->fence_context, seq);
+	if (drm_sched_entity_is_lr(entity))
+		dma_fence_set_lr(&fence->scheduled);
 	dma_fence_init(&fence->finished, &drm_sched_fence_ops_finished,
 		       &fence->lock, entity->fence_context + 1, seq);
+	if (drm_sched_entity_is_lr(entity))
+		dma_fence_set_lr(&fence->finished);
 }
 
 module_init(drm_sched_fence_slab_init);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index d61880315d8d..76336a31aa82 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -618,8 +618,8 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 			continue;
 
 		if (fence) {
-			r = dma_fence_add_callback(fence, &s_job->cb,
-						   drm_sched_job_done_cb);
+			r = drm_sched_entity_add_fence_cb(s_job->entity, fence,
+							  &s_job->cb, drm_sched_job_done_cb);
 			if (r == -ENOENT)
 				drm_sched_job_done(s_job);
 			else if (r)
@@ -1180,8 +1180,9 @@ static void drm_sched_main(struct work_struct *w)
 			/* Drop for original kref_init of the fence */
 			dma_fence_put(fence);
 
-			r = dma_fence_add_callback(fence, &sched_job->cb,
-						   drm_sched_job_done_cb);
+			r = drm_sched_entity_add_fence_cb(sched_job->entity, fence,
+							  &sched_job->cb,
+							  drm_sched_job_done_cb);
 			if (r == -ENOENT)
 				drm_sched_job_done(sched_job);
 			else if (r)
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 6258e324bd7c..546507852771 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -142,6 +142,16 @@ struct drm_sched_entity {
 	 */
 	unsigned int                    num_sched_list;
 
+	/**
+	 * @flags: Flags to govern the behaviour:
+	 *
+	 * DRM_SCHED_ENTITY_LR: The entity handles long-running jobs and
+	 * produces long-running completion fences, as well as accepts
+	 * long-running dependency fences.
+	 */
+	u32                             flags;
+#define DRM_SCHED_ENTITY_LR             BIT(0)
+
 	/**
 	 * @priority:
 	 *
@@ -253,6 +263,32 @@ struct drm_sched_entity {
 
 };
 
+/**
+ * drm_sched_entity_is_lr() - Whether the entity manages long-running jobs.
+ * @entity: The entity.
+ *
+ * Return: true if managing long-running jobs. Otherwise false.
+ */
+static inline bool drm_sched_entity_is_lr(const struct drm_sched_entity *entity)
+{
+	return entity->flags & DRM_SCHED_ENTITY_LR;
+}
+
+/**
+ * drm_sched_entity_set_lr() - Mark the entity as managing long-running jobs.
+ * @entity: The entity.
+ *
+ */
+static inline void drm_sched_entity_set_lr(struct drm_sched_entity *entity)
+{
+	entity->flags |= DRM_SCHED_ENTITY_LR;
+}
+
+int drm_sched_entity_add_fence_cb(struct drm_sched_entity *entity,
+				  struct dma_fence *f,
+				  struct dma_fence_cb *cb,
+				  dma_fence_func_t func);
+
 /**
  * struct drm_sched_rq - queue of entities to be scheduled.
  *
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 08d21e26782b..b513811ce536 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -285,6 +285,11 @@ static inline bool dma_fence_is_lr(const struct dma_fence *fence)
 	return test_bit(DMA_FENCE_FLAG_LR_BIT, &fence->flags);
 }
 
+static inline void dma_fence_set_lr(struct dma_fence *fence)
+{
+	__set_bit(DMA_FENCE_FLAG_LR_BIT, &fence->flags);
+}
+
 void dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
 		    spinlock_t *lock, u64 context, u64 seqno);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Intel-xe] [RFC PATCH 10/10] drm/syncobj: Warn on long running dma-fences
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (8 preceding siblings ...)
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 09/10] drm/sched: Support long-running sched entities Matthew Brost
@ 2023-04-04  0:22 ` Matthew Brost
  2023-04-04  0:24 ` [Intel-xe] ✗ CI.Patch_applied: failure for Xe DRM scheduler and long running workload plans Patchwork
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  0:22 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, airlied, lina, Matthew Brost, boris.brezillon, daniel,
	christian.koenig, faith.ekstrand

Long running dma-fences are not allowed to be exported, a drm_syncobj is
designed to be exported to the user, so add a warn if drm_syncobj
install long running dna-fences as this is not allowed.

Signed-off-by: Matthew Brost <matthew.brsot@intel.com>
---
 drivers/gpu/drm/drm_syncobj.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
index 0c2be8360525..7c304cd7d037 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -291,6 +291,7 @@ void drm_syncobj_add_point(struct drm_syncobj *syncobj,
 	struct syncobj_wait_entry *cur, *tmp;
 	struct dma_fence *prev;
 
+	WARN_ON_ONCE(dma_fence_is_lr(fence));
 	dma_fence_get(fence);
 
 	spin_lock(&syncobj->lock);
@@ -325,8 +326,10 @@ void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
 	struct dma_fence *old_fence;
 	struct syncobj_wait_entry *cur, *tmp;
 
-	if (fence)
+	if (fence) {
+		WARN_ON_ONCE(dma_fence_is_lr(fence));
 		dma_fence_get(fence);
+	}
 
 	spin_lock(&syncobj->lock);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Intel-xe] ✗ CI.Patch_applied: failure for Xe DRM scheduler and long running workload plans
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (9 preceding siblings ...)
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 10/10] drm/syncobj: Warn on long running dma-fences Matthew Brost
@ 2023-04-04  0:24 ` Patchwork
  2023-04-04  1:07 ` [Intel-xe] [RFC PATCH 00/10] " Asahi Lina
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: Patchwork @ 2023-04-04  0:24 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: Xe DRM scheduler and long running workload plans
URL   : https://patchwork.freedesktop.org/series/116054/
State : failure

== Summary ==

=== Applying kernel patches on branch 'drm-xe-next' with base: ===
commit 63b79d536e96e045be4f6c63947c7d42e8dbf600
Author:     Lucas De Marchi <lucas.demarchi@intel.com>
AuthorDate: Fri Mar 31 16:09:02 2023 -0700
Commit:     Lucas De Marchi <lucas.demarchi@intel.com>
CommitDate: Mon Apr 3 13:41:08 2023 -0700

    drm/xe: Fix platform order
    
    Platform order in enum xe_platform started to be used by some parts of
    the code, like the GuC/HuC firmware loading logic. The order itself is
    not very important, but it's better to follow a convention: as was
    documented in the comment above the enum, reorder the platforms by
    graphics version. While at it, remove the gen terminology.
    
    v2:
      - Use "graphics version" instead of chronological order (Matt Roper)
      - Also change pciidlist to follow the same order
      - Remove "gen" from comments around enum xe_platform
    
    Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
    Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
    Link: https://lore.kernel.org/r/20230331230902.1603294-1-lucas.demarchi@intel.com
=== git am output follows ===
error: patch failed: drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c:1489
error: drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c: patch does not apply
error: patch failed: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4627
error: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: patch does not apply
error: patch failed: drivers/gpu/drm/msm/adreno/adreno_device.c:662
error: drivers/gpu/drm/msm/adreno/adreno_device.c: patch does not apply
error: patch failed: drivers/gpu/drm/scheduler/sched_main.c:44
error: drivers/gpu/drm/scheduler/sched_main.c: patch does not apply
error: patch failed: include/drm/gpu_scheduler.h:473
error: include/drm/gpu_scheduler.h: patch does not apply
hint: Use 'git am --show-current-patch' to see the failed patch
Applying: drm/sched: Convert drm scheduler to use a work queue rather than kthread
Patch failed at 0001 drm/sched: Convert drm scheduler to use a work queue rather than kthread
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (10 preceding siblings ...)
  2023-04-04  0:24 ` [Intel-xe] ✗ CI.Patch_applied: failure for Xe DRM scheduler and long running workload plans Patchwork
@ 2023-04-04  1:07 ` Asahi Lina
  2023-04-04  1:58   ` Matthew Brost
  2023-04-04  9:04 ` Christian König
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: Asahi Lina @ 2023-04-04  1:07 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, boris.brezillon, daniel, christian.koenig,
	faith.ekstrand

Hi, thanks for the Cc!

On 04/04/2023 09.22, Matthew Brost wrote:
> Hello,
> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.

Yup, we're in the exact same situation with drm/asahi, so this is very 
welcome! We've been using the existing scheduler as-is, but this should 
help remove some unneeded complexity in this use case.

Do you want me to pull in this series into our tree and make sure this 
all works out for us?

I also have a couple bugfixes for drm/sched I need to send out, but I 
think the rebase/merge with this series should be trivial. I'll send 
that out this week.

> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.

We haven't needed this so far (mostly by using fine-grained locking and 
refcounting all over the place) but I can see it being useful to 
simplify some of those constructs and maybe avoid potential deadlocks in 
some places. I'm not sure yet whether we can fully get rid of the main 
queue refcounting/locking (our completion/error signaling path doesn't 
map well to DMA fences directly so we still need something there to get 
from the global GPU completion signaling thread to individual queues) 
but it might be a step in the right direction at least!

~~ Lina


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  1:07 ` [Intel-xe] [RFC PATCH 00/10] " Asahi Lina
@ 2023-04-04  1:58   ` Matthew Brost
  2023-04-08  7:05     ` Asahi Lina
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04  1:58 UTC (permalink / raw)
  To: Asahi Lina
  Cc: robdclark, airlied, dri-devel, christian.koenig, boris.brezillon,
	daniel, intel-xe, faith.ekstrand

On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> Hi, thanks for the Cc!
> 

No problem.

> On 04/04/2023 09.22, Matthew Brost wrote:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> 
> Yup, we're in the exact same situation with drm/asahi, so this is very
> welcome! We've been using the existing scheduler as-is, but this should help
> remove some unneeded complexity in this use case.
>

That's the idea.

> Do you want me to pull in this series into our tree and make sure this all
> works out for us?
>

We tested this in Xe and it definitely works for us but the more testing
the better.

> I also have a couple bugfixes for drm/sched I need to send out, but I think
> the rebase/merge with this series should be trivial. I'll send that out this
> week.
> 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> 
> We haven't needed this so far (mostly by using fine-grained locking and
> refcounting all over the place) but I can see it being useful to simplify
> some of those constructs and maybe avoid potential deadlocks in some places.
> I'm not sure yet whether we can fully get rid of the main queue
> refcounting/locking (our completion/error signaling path doesn't map well to
> DMA fences directly so we still need something there to get from the global
> GPU completion signaling thread to individual queues) but it might be a step
> in the right direction at least!
>

With this messaging interface we essentially have a lockless submission
backend which is really nice compared to what we did in the i915.

Matt

> ~~ Lina
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (11 preceding siblings ...)
  2023-04-04  1:07 ` [Intel-xe] [RFC PATCH 00/10] " Asahi Lina
@ 2023-04-04  9:04 ` Christian König
  2023-04-04 13:23   ` Matthew Brost
  2023-04-04  9:13 ` Christian König
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-04  9:04 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe, Tuikov, Luben
  Cc: robdclark, airlied, lina, boris.brezillon, daniel, faith.ekstrand

Please make sure to CC Luben on scheduler patches.

Regards,
Christian.

Am 04.04.23 um 02:22 schrieb Matthew Brost:
> Hello,
>
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
>
> This can we thought of as 4 parts detailed below.
>
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
>
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
>
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
>
> - Generic messaging interface for DRM scheduler
>
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
>
> - Support for using TDR for all error paths of a scheduler / entity
>
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>
> - Annotate dma-fences for long running workloads.
>
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
>
> Again this a first draft and looking forward to feedback.
>
> Enjoy - Matt
>
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>
> Matthew Brost (8):
>    drm/sched: Convert drm scheduler to use a work queue rather than
>      kthread
>    drm/sched: Move schedule policy to scheduler / entity
>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>    drm/sched: Add generic scheduler message interface
>    drm/sched: Start run wq before TDR in drm_sched_start
>    drm/sched: Submit job before starting TDR
>    drm/sched: Add helper to set TDR timeout
>    drm/syncobj: Warn on long running dma-fences
>
> Thomas Hellström (2):
>    dma-buf/dma-fence: Introduce long-running completion fences
>    drm/sched: Support long-running sched entities
>
>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>   drivers/dma-buf/dma-resv.c                  |   5 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>   include/drm/gpu_scheduler.h                 | 130 +++++++--
>   include/linux/dma-fence.h                   |  60 ++++-
>   16 files changed, 649 insertions(+), 184 deletions(-)
>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences Matthew Brost
@ 2023-04-04  9:09   ` Christian König
  2023-04-04 12:54     ` Thomas Hellström
  0 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-04  9:09 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel, faith.ekstrand

Am 04.04.23 um 02:22 schrieb Matthew Brost:
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>
> For long-running workloads, drivers either need to open-code completion
> waits, invent their own synchronization primitives or internally use
> dma-fences that do not obey the cross-driver dma-fence protocol, but
> without any lockdep annotation all these approaches are error prone.
>
> So since for example the drm scheduler uses dma-fences it is desirable for
> a driver to be able to use it for throttling and error handling also with
> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
>
> Introduce long-running completion fences in form of dma-fences, and add
> lockdep annotation for them. In particular:
>
> * Do not allow waiting under any memory management locks.
> * Do not allow to attach them to a dma-resv object.
> * Introduce a new interface for adding callbacks making the helper adding
>    a callback sign off on that it is aware that the dma-fence may not
>    complete anytime soon. Typically this will be the scheduler chaining
>    a new long-running fence on another one.

Well that's pretty much what I tried before: 
https://lwn.net/Articles/893704/

And the reasons why it was rejected haven't changed.

Regards,
Christian.

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>   drivers/dma-buf/dma-fence.c | 142 ++++++++++++++++++++++++++----------
>   drivers/dma-buf/dma-resv.c  |   5 ++
>   include/linux/dma-fence.h   |  55 +++++++++++++-
>   3 files changed, 160 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index f177c56269bb..9726b2a3c67d 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -111,6 +111,20 @@ static atomic64_t dma_fence_context_counter = ATOMIC64_INIT(1);
>    * drivers/gpu should ever call dma_fence_wait() in such contexts.
>    */
>   
> +/**
> + * DOC: Long-Running (lr) dma-fences.
> + *
> + * * Long-running dma-fences are NOT required to complete in reasonable time.
> + *   Typically they signal completion of user-space controlled workloads and
> + *   as such, need to never be part of a cross-driver contract, never waited
> + *   for inside a kernel lock, nor attached to a dma-resv. There are helpers
> + *   and warnings in place to help facilitate that that never happens.
> + *
> + * * The motivation for their existense is that helpers that are intended to
> + *   be used by drivers may use dma-fences that, given the workloads mentioned
> + *   above, become long-running.
> + */
> +
>   static const char *dma_fence_stub_get_name(struct dma_fence *fence)
>   {
>           return "stub";
> @@ -284,6 +298,34 @@ static struct lockdep_map dma_fence_lockdep_map = {
>   	.name = "dma_fence_map"
>   };
>   
> +static struct lockdep_map dma_fence_lr_lockdep_map = {
> +	.name = "dma_fence_lr_map"
> +};
> +
> +static bool __dma_fence_begin_signalling(struct lockdep_map *map)
> +{
> +	/* explicitly nesting ... */
> +	if (lock_is_held_type(map, 1))
> +		return true;
> +
> +	/* rely on might_sleep check for soft/hardirq locks */
> +	if (in_atomic())
> +		return true;
> +
> +	/* ... and non-recursive readlock */
> +	lock_acquire(map, 0, 0, 1, 1, NULL, _RET_IP_);
> +
> +	return false;
> +}
> +
> +static void __dma_fence_end_signalling(bool cookie, struct lockdep_map *map)
> +{
> +	if (cookie)
> +		return;
> +
> +	lock_release(map, _RET_IP_);
> +}
> +
>   /**
>    * dma_fence_begin_signalling - begin a critical DMA fence signalling section
>    *
> @@ -300,18 +342,7 @@ static struct lockdep_map dma_fence_lockdep_map = {
>    */
>   bool dma_fence_begin_signalling(void)
>   {
> -	/* explicitly nesting ... */
> -	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
> -		return true;
> -
> -	/* rely on might_sleep check for soft/hardirq locks */
> -	if (in_atomic())
> -		return true;
> -
> -	/* ... and non-recursive readlock */
> -	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
> -
> -	return false;
> +	return __dma_fence_begin_signalling(&dma_fence_lockdep_map);
>   }
>   EXPORT_SYMBOL(dma_fence_begin_signalling);
>   
> @@ -323,25 +354,61 @@ EXPORT_SYMBOL(dma_fence_begin_signalling);
>    */
>   void dma_fence_end_signalling(bool cookie)
>   {
> -	if (cookie)
> -		return;
> -
> -	lock_release(&dma_fence_lockdep_map, _RET_IP_);
> +	__dma_fence_end_signalling(cookie, &dma_fence_lockdep_map);
>   }
>   EXPORT_SYMBOL(dma_fence_end_signalling);
>   
> -void __dma_fence_might_wait(void)
> +/**
> + * dma_fence_lr begin_signalling - begin a critical long-running DMA fence
> + * signalling section
> + *
> + * Drivers should use this to annotate the beginning of any code section
> + * required to eventually complete &dma_fence by calling dma_fence_signal().
> + *
> + * The end of these critical sections are annotated with
> + * dma_fence_lr_end_signalling(). Ideally the section should encompass all
> + * locks that are ever required to signal a long-running dma-fence.
> + *
> + * Return: An opaque cookie needed by the implementation, which needs to be
> + * passed to dma_fence_lr end_signalling().
> + */
> +bool dma_fence_lr_begin_signalling(void)
> +{
> +	return __dma_fence_begin_signalling(&dma_fence_lr_lockdep_map);
> +}
> +EXPORT_SYMBOL(dma_fence_lr_begin_signalling);
> +
> +/**
> + * dma_fence_lr_end_signalling - end a critical DMA fence signalling section
> + * @cookie: opaque cookie from dma_fence_lr_begin_signalling()
> + *
> + * Closes a critical section annotation opened by
> + * dma_fence_lr_begin_signalling().
> + */
> +void dma_fence_lr_end_signalling(bool cookie)
> +{
> +	__dma_fence_end_signalling(cookie, &dma_fence_lr_lockdep_map);
> +}
> +EXPORT_SYMBOL(dma_fence_lr_end_signalling);
> +
> +static void ___dma_fence_might_wait(struct lockdep_map *map)
>   {
>   	bool tmp;
>   
> -	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
> +	tmp = lock_is_held_type(map, 1);
>   	if (tmp)
> -		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
> -	lock_map_acquire(&dma_fence_lockdep_map);
> -	lock_map_release(&dma_fence_lockdep_map);
> +		lock_release(map, _THIS_IP_);
> +	lock_map_acquire(map);
> +	lock_map_release(map);
>   	if (tmp)
> -		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
> +		lock_acquire(map, 0, 0, 1, 1, NULL, _THIS_IP_);
> +}
> +
> +void __dma_fence_might_wait(void)
> +{
> +	___dma_fence_might_wait(&dma_fence_lockdep_map);
>   }
> +
>   #endif
>   
>   
> @@ -506,7 +573,11 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>   
>   	might_sleep();
>   
> -	__dma_fence_might_wait();
> +#ifdef CONFIG_LOCKDEP
> +	___dma_fence_might_wait(dma_fence_is_lr(fence) ?
> +				&dma_fence_lr_lockdep_map :
> +				&dma_fence_lockdep_map);
> +#endif
>   
>   	dma_fence_enable_sw_signaling(fence);
>   
> @@ -618,29 +689,22 @@ void dma_fence_enable_sw_signaling(struct dma_fence *fence)
>   EXPORT_SYMBOL(dma_fence_enable_sw_signaling);
>   
>   /**
> - * dma_fence_add_callback - add a callback to be called when the fence
> + * dma_fence_lr_add_callback - add a callback to be called when the fence
>    * is signaled
>    * @fence: the fence to wait on
>    * @cb: the callback to register
>    * @func: the function to call
>    *
> - * Add a software callback to the fence. The caller should keep a reference to
> - * the fence.
> - *
> - * @cb will be initialized by dma_fence_add_callback(), no initialization
> - * by the caller is required. Any number of callbacks can be registered
> - * to a fence, but a callback can only be registered to one fence at a time.
> - *
> - * If fence is already signaled, this function will return -ENOENT (and
> - * *not* call the callback).
> - *
> - * Note that the callback can be called from an atomic context or irq context.
> + * This function is identical to dma_fence_add_callback() but allows adding
> + * callbacks also to lr dma-fences. The naming helps annotating the fact that
> + * we're adding a callback to a a lr fence and that the callback might therefore
> + * not be called within a reasonable amount of time.
>    *
> - * Returns 0 in case of success, -ENOENT if the fence is already signaled
> + * Return: 0 in case of success, -ENOENT if the fence is already signaled
>    * and -EINVAL in case of error.
>    */
> -int dma_fence_add_callback(struct dma_fence *fence, struct dma_fence_cb *cb,
> -			   dma_fence_func_t func)
> +int dma_fence_lr_add_callback(struct dma_fence *fence, struct dma_fence_cb *cb,
> +			      dma_fence_func_t func)
>   {
>   	unsigned long flags;
>   	int ret = 0;
> @@ -667,7 +731,7 @@ int dma_fence_add_callback(struct dma_fence *fence, struct dma_fence_cb *cb,
>   
>   	return ret;
>   }
> -EXPORT_SYMBOL(dma_fence_add_callback);
> +EXPORT_SYMBOL(dma_fence_lr_add_callback);
>   
>   /**
>    * dma_fence_get_status - returns the status upon completion
> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
> index 2a594b754af1..fa0210c1442e 100644
> --- a/drivers/dma-buf/dma-resv.c
> +++ b/drivers/dma-buf/dma-resv.c
> @@ -292,6 +292,7 @@ void dma_resv_add_fence(struct dma_resv *obj, struct dma_fence *fence,
>   	 * individually.
>   	 */
>   	WARN_ON(dma_fence_is_container(fence));
> +	WARN_ON_ONCE(dma_fence_is_lr(fence));
>   
>   	fobj = dma_resv_fences_list(obj);
>   	count = fobj->num_fences;
> @@ -340,6 +341,7 @@ void dma_resv_replace_fences(struct dma_resv *obj, uint64_t context,
>   	unsigned int i;
>   
>   	dma_resv_assert_held(obj);
> +	WARN_ON_ONCE(dma_fence_is_lr(replacement));
>   
>   	list = dma_resv_fences_list(obj);
>   	for (i = 0; list && i < list->num_fences; ++i) {
> @@ -764,6 +766,7 @@ static int __init dma_resv_lockdep(void)
>   	struct ww_acquire_ctx ctx;
>   	struct dma_resv obj;
>   	struct address_space mapping;
> +	bool lr_cookie;
>   	int ret;
>   
>   	if (!mm)
> @@ -772,6 +775,7 @@ static int __init dma_resv_lockdep(void)
>   	dma_resv_init(&obj);
>   	address_space_init_once(&mapping);
>   
> +	lr_cookie = dma_fence_lr_begin_signalling();
>   	mmap_read_lock(mm);
>   	ww_acquire_init(&ctx, &reservation_ww_class);
>   	ret = dma_resv_lock(&obj, &ctx);
> @@ -792,6 +796,7 @@ static int __init dma_resv_lockdep(void)
>   	ww_mutex_unlock(&obj.lock);
>   	ww_acquire_fini(&ctx);
>   	mmap_read_unlock(mm);
> +	dma_fence_lr_end_signalling(lr_cookie);
>   
>   	mmput(mm);
>   
> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> index d54b595a0fe0..08d21e26782b 100644
> --- a/include/linux/dma-fence.h
> +++ b/include/linux/dma-fence.h
> @@ -99,6 +99,7 @@ enum dma_fence_flag_bits {
>   	DMA_FENCE_FLAG_SIGNALED_BIT,
>   	DMA_FENCE_FLAG_TIMESTAMP_BIT,
>   	DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
> +	DMA_FENCE_FLAG_LR_BIT,
>   	DMA_FENCE_FLAG_USER_BITS, /* must always be last member */
>   };
>   
> @@ -279,6 +280,11 @@ struct dma_fence_ops {
>   	void (*set_deadline)(struct dma_fence *fence, ktime_t deadline);
>   };
>   
> +static inline bool dma_fence_is_lr(const struct dma_fence *fence)
> +{
> +	return test_bit(DMA_FENCE_FLAG_LR_BIT, &fence->flags);
> +}
> +
>   void dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
>   		    spinlock_t *lock, u64 context, u64 seqno);
>   
> @@ -377,13 +383,23 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
>   #ifdef CONFIG_LOCKDEP
>   bool dma_fence_begin_signalling(void);
>   void dma_fence_end_signalling(bool cookie);
> +bool dma_fence_lr_begin_signalling(void);
> +void dma_fence_lr_end_signalling(bool cookie);
>   void __dma_fence_might_wait(void);
>   #else
> +
>   static inline bool dma_fence_begin_signalling(void)
>   {
>   	return true;
>   }
> +
>   static inline void dma_fence_end_signalling(bool cookie) {}
> +static inline bool dma_fence_lr_begin_signalling(void)
> +{
> +	return true;
> +}
> +
> +static inline void dma_fence_lr_end_signalling(bool cookie) {}
>   static inline void __dma_fence_might_wait(void) {}
>   #endif
>   
> @@ -394,9 +410,42 @@ int dma_fence_signal_timestamp_locked(struct dma_fence *fence,
>   				      ktime_t timestamp);
>   signed long dma_fence_default_wait(struct dma_fence *fence,
>   				   bool intr, signed long timeout);
> -int dma_fence_add_callback(struct dma_fence *fence,
> -			   struct dma_fence_cb *cb,
> -			   dma_fence_func_t func);
> +
> +int dma_fence_lr_add_callback(struct dma_fence *fence,
> +			      struct dma_fence_cb *cb,
> +			      dma_fence_func_t func);
> +
> +/**
> + * dma_fence_add_callback - add a callback to be called when the fence
> + * is signaled
> + * @fence: the fence to wait on
> + * @cb: the callback to register
> + * @func: the function to call
> + *
> + * Add a software callback to the fence. The caller should keep a reference to
> + * the fence.
> + *
> + * @cb will be initialized by dma_fence_add_callback(), no initialization
> + * by the caller is required. Any number of callbacks can be registered
> + * to a fence, but a callback can only be registered to one fence at a time.
> + *
> + * If fence is already signaled, this function will return -ENOENT (and
> + * *not* call the callback).
> + *
> + * Note that the callback can be called from an atomic context or irq context.
> + *
> + * Returns 0 in case of success, -ENOENT if the fence is already signaled
> + * and -EINVAL in case of error.
> + */
> +static inline int dma_fence_add_callback(struct dma_fence *fence,
> +					 struct dma_fence_cb *cb,
> +					 dma_fence_func_t func)
> +{
> +	WARN_ON(IS_ENABLED(CONFIG_LOCKDEP) && dma_fence_is_lr(fence));
> +
> +	return dma_fence_lr_add_callback(fence, cb, func);
> +}
> +
>   bool dma_fence_remove_callback(struct dma_fence *fence,
>   			       struct dma_fence_cb *cb);
>   void dma_fence_enable_sw_signaling(struct dma_fence *fence);


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (12 preceding siblings ...)
  2023-04-04  9:04 ` Christian König
@ 2023-04-04  9:13 ` Christian König
  2023-04-04 13:37   ` Matthew Brost
  2023-04-04  9:43 ` Tvrtko Ursulin
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-04  9:13 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel, faith.ekstrand

Hi,

Am 04.04.23 um 02:22 schrieb Matthew Brost:
> Hello,
>
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
>
> This can we thought of as 4 parts detailed below.
>
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
>
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
>
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
>
> - Generic messaging interface for DRM scheduler
>
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.

Oh, please absolutely *don't* do this.

This is basically the design which makes a bunch of stuff so horrible 
broken on Windows.

I can explain it in more detail if necessary, but I strongly recommend 
to not go down this path.

Regards,
Christian.

>
> - Support for using TDR for all error paths of a scheduler / entity
>
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>
> - Annotate dma-fences for long running workloads.
>
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
>
> Again this a first draft and looking forward to feedback.
>
> Enjoy - Matt
>
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>
> Matthew Brost (8):
>    drm/sched: Convert drm scheduler to use a work queue rather than
>      kthread
>    drm/sched: Move schedule policy to scheduler / entity
>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>    drm/sched: Add generic scheduler message interface
>    drm/sched: Start run wq before TDR in drm_sched_start
>    drm/sched: Submit job before starting TDR
>    drm/sched: Add helper to set TDR timeout
>    drm/syncobj: Warn on long running dma-fences
>
> Thomas Hellström (2):
>    dma-buf/dma-fence: Introduce long-running completion fences
>    drm/sched: Support long-running sched entities
>
>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>   drivers/dma-buf/dma-resv.c                  |   5 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>   include/drm/gpu_scheduler.h                 | 130 +++++++--
>   include/linux/dma-fence.h                   |  60 ++++-
>   16 files changed, 649 insertions(+), 184 deletions(-)
>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (13 preceding siblings ...)
  2023-04-04  9:13 ` Christian König
@ 2023-04-04  9:43 ` Tvrtko Ursulin
  2023-04-04  9:48   ` Christian König
  2023-04-04 13:52   ` Matthew Brost
  2023-04-04 18:02 ` Zeng, Oak
  2023-04-18 15:10 ` Liviu Dudau
  16 siblings, 2 replies; 87+ messages in thread
From: Tvrtko Ursulin @ 2023-04-04  9:43 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, christian.koenig,
	faith.ekstrand


On 04/04/2023 01:22, Matthew Brost wrote:
> Hello,
> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.

Once you add capability for a more proper 1:1 via 
DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to 
replace kthreads with a wq?

Or in other words, what purpose does the offloading of a job picking 
code to a separate execution context serve? Could it be done directly in 
the 1:1 mode and leave kthread setup for N:M?

Apart from those design level questions, low level open IMO still is 
that default fallback of using the system_wq has the potential to 
affect latency for other drivers. But that's for those driver owners to 
approve.

Regards,

Tvrtko

> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
> 
> - Support for using TDR for all error paths of a scheduler / entity
> 
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> 
> - Annotate dma-fences for long running workloads.
> 
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
> 
> Again this a first draft and looking forward to feedback.
> 
> Enjoy - Matt
> 
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> 
> Matthew Brost (8):
>    drm/sched: Convert drm scheduler to use a work queue rather than
>      kthread
>    drm/sched: Move schedule policy to scheduler / entity
>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>    drm/sched: Add generic scheduler message interface
>    drm/sched: Start run wq before TDR in drm_sched_start
>    drm/sched: Submit job before starting TDR
>    drm/sched: Add helper to set TDR timeout
>    drm/syncobj: Warn on long running dma-fences
> 
> Thomas Hellström (2):
>    dma-buf/dma-fence: Introduce long-running completion fences
>    drm/sched: Support long-running sched entities
> 
>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>   drivers/dma-buf/dma-resv.c                  |   5 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>   include/drm/gpu_scheduler.h                 | 130 +++++++--
>   include/linux/dma-fence.h                   |  60 ++++-
>   16 files changed, 649 insertions(+), 184 deletions(-)
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  9:43 ` Tvrtko Ursulin
@ 2023-04-04  9:48   ` Christian König
  2023-04-04 13:43     ` Matthew Brost
  2023-04-04 13:52   ` Matthew Brost
  1 sibling, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-04  9:48 UTC (permalink / raw)
  To: Tvrtko Ursulin, Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, faith.ekstrand

Am 04.04.23 um 11:43 schrieb Tvrtko Ursulin:
>
> On 04/04/2023 01:22, Matthew Brost wrote:
>> Hello,
>>
>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>> have been asked to merge our common DRM scheduler patches first as well
>> as develop a common solution for long running workloads with the DRM
>> scheduler. This RFC series is our first attempt at doing this. We
>> welcome any and all feedback.
>>
>> This can we thought of as 4 parts detailed below.
>>
>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>> entity (patches 1-3)
>>
>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>> severals problems as the DRM was originally designed to schedule jobs on
>> hardware queues. The main problem being that DRM scheduler expects the
>> submission order of jobs to be the completion order of jobs even across
>> multiple entities. This assumption falls apart with a firmware scheduler
>> as a firmware scheduler has no concept of jobs and jobs can complete out
>> of order. A novel solution for was originally thought of by Faith during
>> the initial prototype of Xe, create a 1 to 1 relationship between 
>> scheduler
>> and entity. I believe the AGX driver [3] is using this approach and
>> Boris may use approach as well for the Mali driver [4].
>>
>> To support a 1 to 1 relationship we move the main execution function
>> from a kthread to a work queue and add a new scheduling mode which
>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>> The new scheduling mode should unify all drivers usage with a 1 to 1
>> relationship and can be thought of as using scheduler as a dependency /
>> infligt job tracker rather than a true scheduler.
>
> Once you add capability for a more proper 1:1 via 
> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to 
> replace kthreads with a wq?

Yeah, I fail to see the need for that as well. On the other hand it 
would be really nice to get rid of the rq/priority design in general.

>
> Or in other words, what purpose does the offloading of a job picking 
> code to a separate execution context serve? Could it be done directly 
> in the 1:1 mode and leave kthread setup for N:M?

Well moving from kthread to work item is beneficial on it's own since 
the later usually follows the the source of it's queue. E.g. when this 
is triggered by an interrupt we run on the CPU of the interrupt and not 
have inter CPU signaling.

>
> Apart from those design level questions, low level open IMO still is 
> that default fallback of using the system_wq has the potential to 
> affect latency for other drivers. But that's for those driver owners 
> to approve.

Oh, yeah that's a good point as well. This needs some high priority queue.

Christian.

>
> Regards,
>
> Tvrtko
>
>> - Generic messaging interface for DRM scheduler
>>
>> Idea is to be able to communicate to the submission backend with in band
>> (relative to main execution function) messages. Messages are backend
>> defined and flexable enough for any use case. In Xe we use these
>> messages to clean up entites, set properties for entites, and suspend /
>> resume execution of an entity [5]. I suspect other driver can leverage
>> this messaging concept too as it a convenient way to avoid races in the
>> backend.
>>
>> - Support for using TDR for all error paths of a scheduler / entity
>>
>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>
>> - Annotate dma-fences for long running workloads.
>>
>> The idea here is to use dma-fences only as sync points within the
>> scheduler and never export them for long running workloads. By
>> annotating these fences as long running we ensure that these dma-fences
>> are never used in a way that breaks the dma-fence rules. A benefit of
>> thus approach is the scheduler can still safely flow control the
>> execution ring buffer via the job limit without breaking the dma-fence
>> rules.
>>
>> Again this a first draft and looking forward to feedback.
>>
>> Enjoy - Matt
>>
>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>> [2] https://patchwork.freedesktop.org/series/112188/
>> [3] https://patchwork.freedesktop.org/series/114772/
>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>> [5] 
>> https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>
>> Matthew Brost (8):
>>    drm/sched: Convert drm scheduler to use a work queue rather than
>>      kthread
>>    drm/sched: Move schedule policy to scheduler / entity
>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>    drm/sched: Add generic scheduler message interface
>>    drm/sched: Start run wq before TDR in drm_sched_start
>>    drm/sched: Submit job before starting TDR
>>    drm/sched: Add helper to set TDR timeout
>>    drm/syncobj: Warn on long running dma-fences
>>
>> Thomas Hellström (2):
>>    dma-buf/dma-fence: Introduce long-running completion fences
>>    drm/sched: Support long-running sched entities
>>
>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>   drivers/dma-buf/dma-resv.c                  |   5 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
>>   include/linux/dma-fence.h                   |  60 ++++-
>>   16 files changed, 649 insertions(+), 184 deletions(-)
>>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04  9:09   ` Christian König
@ 2023-04-04 12:54     ` Thomas Hellström
  2023-04-04 13:10       ` Christian König
  0 siblings, 1 reply; 87+ messages in thread
From: Thomas Hellström @ 2023-04-04 12:54 UTC (permalink / raw)
  To: Christian König, Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel, faith.ekstrand

Hi, Christian,

On 4/4/23 11:09, Christian König wrote:
> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>
>> For long-running workloads, drivers either need to open-code completion
>> waits, invent their own synchronization primitives or internally use
>> dma-fences that do not obey the cross-driver dma-fence protocol, but
>> without any lockdep annotation all these approaches are error prone.
>>
>> So since for example the drm scheduler uses dma-fences it is 
>> desirable for
>> a driver to be able to use it for throttling and error handling also 
>> with
>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
>>
>> Introduce long-running completion fences in form of dma-fences, and add
>> lockdep annotation for them. In particular:
>>
>> * Do not allow waiting under any memory management locks.
>> * Do not allow to attach them to a dma-resv object.
>> * Introduce a new interface for adding callbacks making the helper 
>> adding
>>    a callback sign off on that it is aware that the dma-fence may not
>>    complete anytime soon. Typically this will be the scheduler chaining
>>    a new long-running fence on another one.
>
> Well that's pretty much what I tried before: 
> https://lwn.net/Articles/893704/
>
> And the reasons why it was rejected haven't changed.
>
> Regards,
> Christian.
>
Yes, TBH this was mostly to get discussion going how we'd best tackle 
this problem while being able to reuse the scheduler for long-running 
workloads.

I couldn't see any clear decision on your series, though, but one main 
difference I see is that this is intended for driver-internal use only. 
(I'm counting using the drm_scheduler as a helper for driver-private 
use). This is by no means a way to try tackle the indefinite fence problem.

We could ofc invent a completely different data-type that abstracts the 
synchronization the scheduler needs in the long-running case, or each 
driver could hack something up, like sleeping in the prepare_job() or 
run_job() callback for throttling, but those waits should still be 
annotated in one way or annotated one way or another (and probably in a 
similar way across drivers) to make sure we don't do anything bad.

  So any suggestions as to what would be the better solution here would 
be appreciated.

Thanks,

Thomas






^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 12:54     ` Thomas Hellström
@ 2023-04-04 13:10       ` Christian König
  2023-04-04 18:14         ` Thomas Hellström (Intel)
  2023-04-04 19:00         ` Daniel Vetter
  0 siblings, 2 replies; 87+ messages in thread
From: Christian König @ 2023-04-04 13:10 UTC (permalink / raw)
  To: Thomas Hellström, Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, daniel, faith.ekstrand

Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> Hi, Christian,
>
> On 4/4/23 11:09, Christian König wrote:
>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>
>>> For long-running workloads, drivers either need to open-code completion
>>> waits, invent their own synchronization primitives or internally use
>>> dma-fences that do not obey the cross-driver dma-fence protocol, but
>>> without any lockdep annotation all these approaches are error prone.
>>>
>>> So since for example the drm scheduler uses dma-fences it is 
>>> desirable for
>>> a driver to be able to use it for throttling and error handling also 
>>> with
>>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
>>>
>>> Introduce long-running completion fences in form of dma-fences, and add
>>> lockdep annotation for them. In particular:
>>>
>>> * Do not allow waiting under any memory management locks.
>>> * Do not allow to attach them to a dma-resv object.
>>> * Introduce a new interface for adding callbacks making the helper 
>>> adding
>>>    a callback sign off on that it is aware that the dma-fence may not
>>>    complete anytime soon. Typically this will be the scheduler chaining
>>>    a new long-running fence on another one.
>>
>> Well that's pretty much what I tried before: 
>> https://lwn.net/Articles/893704/
>>
>> And the reasons why it was rejected haven't changed.
>>
>> Regards,
>> Christian.
>>
> Yes, TBH this was mostly to get discussion going how we'd best tackle 
> this problem while being able to reuse the scheduler for long-running 
> workloads.
>
> I couldn't see any clear decision on your series, though, but one main 
> difference I see is that this is intended for driver-internal use 
> only. (I'm counting using the drm_scheduler as a helper for 
> driver-private use). This is by no means a way to try tackle the 
> indefinite fence problem.

Well this was just my latest try to tackle this, but essentially the 
problems are the same as with your approach: When we express such 
operations as dma_fence there is always the change that we leak that 
somewhere.

My approach of adding a flag noting that this operation is dangerous and 
can't be synced with something memory management depends on tried to 
contain this as much as possible, but Daniel still pretty clearly 
rejected it (for good reasons I think).

>
> We could ofc invent a completely different data-type that abstracts 
> the synchronization the scheduler needs in the long-running case, or 
> each driver could hack something up, like sleeping in the 
> prepare_job() or run_job() callback for throttling, but those waits 
> should still be annotated in one way or annotated one way or another 
> (and probably in a similar way across drivers) to make sure we don't 
> do anything bad.
>
>  So any suggestions as to what would be the better solution here would 
> be appreciated.

Mhm, do we really the the GPU scheduler for that?

I mean in the 1 to 1 case  you basically just need a component which 
collects the dependencies as dma_fence and if all of them are fulfilled 
schedules a work item.

As long as the work item itself doesn't produce a dma_fence it can then 
still just wait for other none dma_fence dependencies.

Then the work function could submit the work and wait for the result.

The work item would then pretty much represent what you want, you can 
wait for it to finish and pass it along as long running dependency.

Maybe give it a funky name and wrap it up in a structure, but that's 
basically it.

Regards,
Christian.

>
> Thanks,
>
> Thomas
>
>
>
>
>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  9:04 ` Christian König
@ 2023-04-04 13:23   ` Matthew Brost
  0 siblings, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-04 13:23 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, airlied, lina, dri-devel, Tuikov, Luben, daniel,
	boris.brezillon, intel-xe, faith.ekstrand

On Tue, Apr 04, 2023 at 11:04:54AM +0200, Christian König wrote:
> Please make sure to CC Luben on scheduler patches.
> 

Sure, figured I was missing a few people.

Matt

> Regards,
> Christian.
> 
> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> > 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >    drm/sched: Convert drm scheduler to use a work queue rather than
> >      kthread
> >    drm/sched: Move schedule policy to scheduler / entity
> >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >    drm/sched: Add generic scheduler message interface
> >    drm/sched: Start run wq before TDR in drm_sched_start
> >    drm/sched: Submit job before starting TDR
> >    drm/sched: Add helper to set TDR timeout
> >    drm/syncobj: Warn on long running dma-fences
> > 
> > Thomas Hellström (2):
> >    dma-buf/dma-fence: Introduce long-running completion fences
> >    drm/sched: Support long-running sched entities
> > 
> >   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >   drivers/dma-buf/dma-resv.c                  |   5 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >   include/drm/gpu_scheduler.h                 | 130 +++++++--
> >   include/linux/dma-fence.h                   |  60 ++++-
> >   16 files changed, 649 insertions(+), 184 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  9:13 ` Christian König
@ 2023-04-04 13:37   ` Matthew Brost
  2023-04-05  7:41     ` Christian König
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04 13:37 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, intel-xe,
	faith.ekstrand

On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
> Hi,
> 
> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> > 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> 
> Oh, please absolutely *don't* do this.
> 
> This is basically the design which makes a bunch of stuff so horrible broken
> on Windows.
> 
> I can explain it in more detail if necessary, but I strongly recommend to
> not go down this path.
> 

I'm afraid we are going to have to discuss this further. Let me explain
my reasoning, basically the idea is to have a single main entry point to
backend - the work queue. This avoids the need for lock between run_job
and any message that changes an entites state, also it really helps
during the reset flows (either TDR or GT reset) as we can call
drm_sched_run_wq_stop can ensure that nothing else is in the backend
changing an entity state. It all works out really nicely actually, our
GuC backend is incredibly stable (hasn't really had a bug pop in about a
year) and way simpler than what we did in the i915. I think the simplity
to largely due to this design of limiting the entry points.

I personally don't see how this a poor design, limiting entry points
absolutely makes sense to me, if it didn't why not just call cleanup_job
bypassing the main execution thread (now worker), this is the exact same
concept.

FWIW Asahi liked the idea as well and think it could be useful for AGX.
Matt

> Regards,
> Christian.
> 
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >    drm/sched: Convert drm scheduler to use a work queue rather than
> >      kthread
> >    drm/sched: Move schedule policy to scheduler / entity
> >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >    drm/sched: Add generic scheduler message interface
> >    drm/sched: Start run wq before TDR in drm_sched_start
> >    drm/sched: Submit job before starting TDR
> >    drm/sched: Add helper to set TDR timeout
> >    drm/syncobj: Warn on long running dma-fences
> > 
> > Thomas Hellström (2):
> >    dma-buf/dma-fence: Introduce long-running completion fences
> >    drm/sched: Support long-running sched entities
> > 
> >   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >   drivers/dma-buf/dma-resv.c                  |   5 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >   include/drm/gpu_scheduler.h                 | 130 +++++++--
> >   include/linux/dma-fence.h                   |  60 ++++-
> >   16 files changed, 649 insertions(+), 184 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  9:48   ` Christian König
@ 2023-04-04 13:43     ` Matthew Brost
  0 siblings, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-04 13:43 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, Tvrtko Ursulin, airlied, lina, dri-devel,
	boris.brezillon, intel-xe, faith.ekstrand

On Tue, Apr 04, 2023 at 11:48:36AM +0200, Christian König wrote:
> Am 04.04.23 um 11:43 schrieb Tvrtko Ursulin:
> > 
> > On 04/04/2023 01:22, Matthew Brost wrote:
> > > Hello,
> > > 
> > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > have been asked to merge our common DRM scheduler patches first as well
> > > as develop a common solution for long running workloads with the DRM
> > > scheduler. This RFC series is our first attempt at doing this. We
> > > welcome any and all feedback.
> > > 
> > > This can we thought of as 4 parts detailed below.
> > > 
> > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > entity (patches 1-3)
> > > 
> > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > severals problems as the DRM was originally designed to schedule jobs on
> > > hardware queues. The main problem being that DRM scheduler expects the
> > > submission order of jobs to be the completion order of jobs even across
> > > multiple entities. This assumption falls apart with a firmware scheduler
> > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > of order. A novel solution for was originally thought of by Faith during
> > > the initial prototype of Xe, create a 1 to 1 relationship between
> > > scheduler
> > > and entity. I believe the AGX driver [3] is using this approach and
> > > Boris may use approach as well for the Mali driver [4].
> > > 
> > > To support a 1 to 1 relationship we move the main execution function
> > > from a kthread to a work queue and add a new scheduling mode which
> > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > relationship and can be thought of as using scheduler as a dependency /
> > > infligt job tracker rather than a true scheduler.
> > 
> > Once you add capability for a more proper 1:1 via
> > DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to
> > replace kthreads with a wq?
> 
> Yeah, I fail to see the need for that as well. On the other hand it would be
> really nice to get rid of the rq/priority design in general.
> 

wrt to replacing kthread with a worker I think the idea is you don't
want to tie a kthread creation directly to a uAPI as a user then could
create 1000s of kthreads.

fwiw in a private email about a year yoy actually suggest using a work
queue Christian.

> > 
> > Or in other words, what purpose does the offloading of a job picking
> > code to a separate execution context serve? Could it be done directly in
> > the 1:1 mode and leave kthread setup for N:M?
> 
> Well moving from kthread to work item is beneficial on it's own since the
> later usually follows the the source of it's queue. E.g. when this is
> triggered by an interrupt we run on the CPU of the interrupt and not have
> inter CPU signaling.
> 
> > 
> > Apart from those design level questions, low level open IMO still is
> > that default fallback of using the system_wq has the potential to affect
> > latency for other drivers. But that's for those driver owners to
> > approve.
> 
> Oh, yeah that's a good point as well. This needs some high priority queue.
>

system_highpri_wq?

Matt

> Christian.
> 
> > 
> > Regards,
> > 
> > Tvrtko
> > 
> > > - Generic messaging interface for DRM scheduler
> > > 
> > > Idea is to be able to communicate to the submission backend with in band
> > > (relative to main execution function) messages. Messages are backend
> > > defined and flexable enough for any use case. In Xe we use these
> > > messages to clean up entites, set properties for entites, and suspend /
> > > resume execution of an entity [5]. I suspect other driver can leverage
> > > this messaging concept too as it a convenient way to avoid races in the
> > > backend.
> > > 
> > > - Support for using TDR for all error paths of a scheduler / entity
> > > 
> > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > 
> > > - Annotate dma-fences for long running workloads.
> > > 
> > > The idea here is to use dma-fences only as sync points within the
> > > scheduler and never export them for long running workloads. By
> > > annotating these fences as long running we ensure that these dma-fences
> > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > thus approach is the scheduler can still safely flow control the
> > > execution ring buffer via the job limit without breaking the dma-fence
> > > rules.
> > > 
> > > Again this a first draft and looking forward to feedback.
> > > 
> > > Enjoy - Matt
> > > 
> > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > [2] https://patchwork.freedesktop.org/series/112188/
> > > [3] https://patchwork.freedesktop.org/series/114772/
> > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > 
> > > Matthew Brost (8):
> > >    drm/sched: Convert drm scheduler to use a work queue rather than
> > >      kthread
> > >    drm/sched: Move schedule policy to scheduler / entity
> > >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > >    drm/sched: Add generic scheduler message interface
> > >    drm/sched: Start run wq before TDR in drm_sched_start
> > >    drm/sched: Submit job before starting TDR
> > >    drm/sched: Add helper to set TDR timeout
> > >    drm/syncobj: Warn on long running dma-fences
> > > 
> > > Thomas Hellström (2):
> > >    dma-buf/dma-fence: Introduce long-running completion fences
> > >    drm/sched: Support long-running sched entities
> > > 
> > >   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > >   drivers/dma-buf/dma-resv.c                  |   5 +
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > >   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > >   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > >   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > >   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > >   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > >   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > >   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > >   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > >   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > >   include/drm/gpu_scheduler.h                 | 130 +++++++--
> > >   include/linux/dma-fence.h                   |  60 ++++-
> > >   16 files changed, 649 insertions(+), 184 deletions(-)
> > > 
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  9:43 ` Tvrtko Ursulin
  2023-04-04  9:48   ` Christian König
@ 2023-04-04 13:52   ` Matthew Brost
  2023-04-04 17:29     ` Tvrtko Ursulin
  1 sibling, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04 13:52 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: robdclark, airlied, lina, dri-devel, christian.koenig,
	boris.brezillon, intel-xe, faith.ekstrand

On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote:
> 
> On 04/04/2023 01:22, Matthew Brost wrote:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> 
> Once you add capability for a more proper 1:1 via
> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace
> kthreads with a wq?
> 
> Or in other words, what purpose does the offloading of a job picking code to
> a separate execution context serve? Could it be done directly in the 1:1
> mode and leave kthread setup for N:M?
> 

Addressed the other two on my reply to Christian...

For this one basically the concept of a single entity point IMO is a
very good concept which I'd like to keep. But most important reason
being the main execution thread (now worker) is kicked when a dependency
for a job is resolved, dependencies are dma-fences signaled via a
callback, and these call backs can be signaled in IRQ contexts. We
absolutely do not want to enter the backend in an IRQ context for a
variety of reasons.

Matt

> Apart from those design level questions, low level open IMO still is that
> default fallback of using the system_wq has the potential to affect latency
> for other drivers. But that's for those driver owners to approve.
> 
> Regards,
> 
> Tvrtko
> 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >    drm/sched: Convert drm scheduler to use a work queue rather than
> >      kthread
> >    drm/sched: Move schedule policy to scheduler / entity
> >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >    drm/sched: Add generic scheduler message interface
> >    drm/sched: Start run wq before TDR in drm_sched_start
> >    drm/sched: Submit job before starting TDR
> >    drm/sched: Add helper to set TDR timeout
> >    drm/syncobj: Warn on long running dma-fences
> > 
> > Thomas Hellström (2):
> >    dma-buf/dma-fence: Introduce long-running completion fences
> >    drm/sched: Support long-running sched entities
> > 
> >   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >   drivers/dma-buf/dma-resv.c                  |   5 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >   include/drm/gpu_scheduler.h                 | 130 +++++++--
> >   include/linux/dma-fence.h                   |  60 ++++-
> >   16 files changed, 649 insertions(+), 184 deletions(-)
> > 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04 13:52   ` Matthew Brost
@ 2023-04-04 17:29     ` Tvrtko Ursulin
  2023-04-04 19:07       ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Tvrtko Ursulin @ 2023-04-04 17:29 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, dri-devel, christian.koenig,
	boris.brezillon, intel-xe, faith.ekstrand


On 04/04/2023 14:52, Matthew Brost wrote:
> On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote:
>>
>> On 04/04/2023 01:22, Matthew Brost wrote:
>>> Hello,
>>>
>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>> have been asked to merge our common DRM scheduler patches first as well
>>> as develop a common solution for long running workloads with the DRM
>>> scheduler. This RFC series is our first attempt at doing this. We
>>> welcome any and all feedback.
>>>
>>> This can we thought of as 4 parts detailed below.
>>>
>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>> entity (patches 1-3)
>>>
>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>> severals problems as the DRM was originally designed to schedule jobs on
>>> hardware queues. The main problem being that DRM scheduler expects the
>>> submission order of jobs to be the completion order of jobs even across
>>> multiple entities. This assumption falls apart with a firmware scheduler
>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>> of order. A novel solution for was originally thought of by Faith during
>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>> and entity. I believe the AGX driver [3] is using this approach and
>>> Boris may use approach as well for the Mali driver [4].
>>>
>>> To support a 1 to 1 relationship we move the main execution function
>>> from a kthread to a work queue and add a new scheduling mode which
>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>> relationship and can be thought of as using scheduler as a dependency /
>>> infligt job tracker rather than a true scheduler.
>>
>> Once you add capability for a more proper 1:1 via
>> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace
>> kthreads with a wq?
>>
>> Or in other words, what purpose does the offloading of a job picking code to
>> a separate execution context serve? Could it be done directly in the 1:1
>> mode and leave kthread setup for N:M?
>>
> 
> Addressed the other two on my reply to Christian...
> 
> For this one basically the concept of a single entity point IMO is a
> very good concept which I'd like to keep. But most important reason
> being the main execution thread (now worker) is kicked when a dependency
> for a job is resolved, dependencies are dma-fences signaled via a
> callback, and these call backs can be signaled in IRQ contexts. We
> absolutely do not want to enter the backend in an IRQ context for a
> variety of reasons.

Sounds like a fair enough requirement but if drivers will not be 
comfortable with the wq conversion, it is probably possible to introduce 
some vfuncs for the 1:1 case which would allow scheduler users override 
the scheduler wakeup and select a special "pick one job" path. That 
could allow 1:1 users do their thing, leaving rest as is. I mean you 
already have the special single entity scheduler, you'd just need to add 
some more specialization on the init, wake up, etc paths.

And I will mention once more that I find a wq item with a loop such as:

	while (!READ_ONCE(sched->pause_run_wq)) {
	...

A bit dodgy. If you piggy back on any system_wq it smells of system wide 
starvation so for me any proposal with an option to use a system shared 
wq is a no go.

Regards,

Tvrtko


>> Apart from those design level questions, low level open IMO still is that
>> default fallback of using the system_wq has the potential to affect latency
>> for other drivers. But that's for those driver owners to approve.
>>
>> Regards,
>>
>> Tvrtko
>>
>>> - Generic messaging interface for DRM scheduler
>>>
>>> Idea is to be able to communicate to the submission backend with in band
>>> (relative to main execution function) messages. Messages are backend
>>> defined and flexable enough for any use case. In Xe we use these
>>> messages to clean up entites, set properties for entites, and suspend /
>>> resume execution of an entity [5]. I suspect other driver can leverage
>>> this messaging concept too as it a convenient way to avoid races in the
>>> backend.
>>>
>>> - Support for using TDR for all error paths of a scheduler / entity
>>>
>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>
>>> - Annotate dma-fences for long running workloads.
>>>
>>> The idea here is to use dma-fences only as sync points within the
>>> scheduler and never export them for long running workloads. By
>>> annotating these fences as long running we ensure that these dma-fences
>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>> thus approach is the scheduler can still safely flow control the
>>> execution ring buffer via the job limit without breaking the dma-fence
>>> rules.
>>>
>>> Again this a first draft and looking forward to feedback.
>>>
>>> Enjoy - Matt
>>>
>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>> [2] https://patchwork.freedesktop.org/series/112188/
>>> [3] https://patchwork.freedesktop.org/series/114772/
>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>
>>> Matthew Brost (8):
>>>     drm/sched: Convert drm scheduler to use a work queue rather than
>>>       kthread
>>>     drm/sched: Move schedule policy to scheduler / entity
>>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>     drm/sched: Add generic scheduler message interface
>>>     drm/sched: Start run wq before TDR in drm_sched_start
>>>     drm/sched: Submit job before starting TDR
>>>     drm/sched: Add helper to set TDR timeout
>>>     drm/syncobj: Warn on long running dma-fences
>>>
>>> Thomas Hellström (2):
>>>     dma-buf/dma-fence: Introduce long-running completion fences
>>>     drm/sched: Support long-running sched entities
>>>
>>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>    drivers/dma-buf/dma-resv.c                  |   5 +
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>    include/linux/dma-fence.h                   |  60 ++++-
>>>    16 files changed, 649 insertions(+), 184 deletions(-)
>>>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (14 preceding siblings ...)
  2023-04-04  9:43 ` Tvrtko Ursulin
@ 2023-04-04 18:02 ` Zeng, Oak
  2023-04-04 18:08   ` Matthew Brost
  2023-04-18 15:10 ` Liviu Dudau
  16 siblings, 1 reply; 87+ messages in thread
From: Zeng, Oak @ 2023-04-04 18:02 UTC (permalink / raw)
  To: Brost, Matthew, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, christian.koenig,
	faith.ekstrand

Hi Matt, Thomas,

Some very bold out of box thinking in this area:

1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler? 

2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time... 

So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here.

Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence? 

I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here.

Thanks,
Oak

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Matthew Brost
> Sent: April 3, 2023 8:22 PM
> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> <matthew.brost@intel.com>; christian.koenig@amd.com;
> faith.ekstrand@collabora.com
> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
> 
> Hello,
> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
> 
> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
> 
> - Support for using TDR for all error paths of a scheduler / entity
> 
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> 
> - Annotate dma-fences for long running workloads.
> 
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
> 
> Again this a first draft and looking forward to feedback.
> 
> Enjoy - Matt
> 
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> 
> Matthew Brost (8):
>   drm/sched: Convert drm scheduler to use a work queue rather than
>     kthread
>   drm/sched: Move schedule policy to scheduler / entity
>   drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>   drm/sched: Add generic scheduler message interface
>   drm/sched: Start run wq before TDR in drm_sched_start
>   drm/sched: Submit job before starting TDR
>   drm/sched: Add helper to set TDR timeout
>   drm/syncobj: Warn on long running dma-fences
> 
> Thomas Hellström (2):
>   dma-buf/dma-fence: Introduce long-running completion fences
>   drm/sched: Support long-running sched entities
> 
>  drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>  drivers/dma-buf/dma-resv.c                  |   5 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>  drivers/gpu/drm/drm_syncobj.c               |   5 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>  drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>  drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>  drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>  drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>  drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>  drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>  drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>  include/drm/gpu_scheduler.h                 | 130 +++++++--
>  include/linux/dma-fence.h                   |  60 ++++-
>  16 files changed, 649 insertions(+), 184 deletions(-)
> 
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04 18:02 ` Zeng, Oak
@ 2023-04-04 18:08   ` Matthew Brost
  2023-04-05  7:30     ` Christian König
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04 18:08 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: robdclark, airlied, lina, dri-devel, christian.koenig,
	boris.brezillon, intel-xe, faith.ekstrand

On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> Hi Matt, Thomas,
> 
> Some very bold out of box thinking in this area:
> 
> 1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler? 
>

If we did that now we have 2 paths for dependency track, flow controling
the ring, resets / error handling / backend submission implementations.
We don't want this.
 
> 2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time... 
> 
> So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here.
> 
> Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence? 
> 
> I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here.
>

Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
the signal / CB infrastructure) and enforce we don't use use these
dma-fences from the scheduler in memory reclaim paths or export these to
user space or other drivers. Think of this mode as SW only fence.

Matt
 
> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > Matthew Brost
> > Sent: April 3, 2023 8:22 PM
> > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
> > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > <matthew.brost@intel.com>; christian.koenig@amd.com;
> > faith.ekstrand@collabora.com
> > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
> > 
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> > 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >   drm/sched: Convert drm scheduler to use a work queue rather than
> >     kthread
> >   drm/sched: Move schedule policy to scheduler / entity
> >   drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >   drm/sched: Add generic scheduler message interface
> >   drm/sched: Start run wq before TDR in drm_sched_start
> >   drm/sched: Submit job before starting TDR
> >   drm/sched: Add helper to set TDR timeout
> >   drm/syncobj: Warn on long running dma-fences
> > 
> > Thomas Hellström (2):
> >   dma-buf/dma-fence: Introduce long-running completion fences
> >   drm/sched: Support long-running sched entities
> > 
> >  drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >  drivers/dma-buf/dma-resv.c                  |   5 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >  drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >  drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >  drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >  drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >  drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >  drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >  drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >  drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >  drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >  drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >  include/drm/gpu_scheduler.h                 | 130 +++++++--
> >  include/linux/dma-fence.h                   |  60 ++++-
> >  16 files changed, 649 insertions(+), 184 deletions(-)
> > 
> > --
> > 2.34.1
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 13:10       ` Christian König
@ 2023-04-04 18:14         ` Thomas Hellström (Intel)
  2023-04-04 19:02           ` Matthew Brost
  2023-04-04 19:00         ` Daniel Vetter
  1 sibling, 1 reply; 87+ messages in thread
From: Thomas Hellström (Intel) @ 2023-04-04 18:14 UTC (permalink / raw)
  To: Christian König, Thomas Hellström, Matthew Brost,
	dri-devel, intel-xe
  Cc: robdclark, airlied, boris.brezillon, faith.ekstrand, lina


On 4/4/23 15:10, Christian König wrote:
> Am 04.04.23 um 14:54 schrieb Thomas Hellström:
>> Hi, Christian,
>>
>> On 4/4/23 11:09, Christian König wrote:
>>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>
>>>> For long-running workloads, drivers either need to open-code 
>>>> completion
>>>> waits, invent their own synchronization primitives or internally use
>>>> dma-fences that do not obey the cross-driver dma-fence protocol, but
>>>> without any lockdep annotation all these approaches are error prone.
>>>>
>>>> So since for example the drm scheduler uses dma-fences it is 
>>>> desirable for
>>>> a driver to be able to use it for throttling and error handling 
>>>> also with
>>>> internal dma-fences tha do not obey the cros-driver dma-fence 
>>>> protocol.
>>>>
>>>> Introduce long-running completion fences in form of dma-fences, and 
>>>> add
>>>> lockdep annotation for them. In particular:
>>>>
>>>> * Do not allow waiting under any memory management locks.
>>>> * Do not allow to attach them to a dma-resv object.
>>>> * Introduce a new interface for adding callbacks making the helper 
>>>> adding
>>>>    a callback sign off on that it is aware that the dma-fence may not
>>>>    complete anytime soon. Typically this will be the scheduler 
>>>> chaining
>>>>    a new long-running fence on another one.
>>>
>>> Well that's pretty much what I tried before: 
>>> https://lwn.net/Articles/893704/
>>>
>>> And the reasons why it was rejected haven't changed.
>>>
>>> Regards,
>>> Christian.
>>>
>> Yes, TBH this was mostly to get discussion going how we'd best tackle 
>> this problem while being able to reuse the scheduler for long-running 
>> workloads.
>>
>> I couldn't see any clear decision on your series, though, but one 
>> main difference I see is that this is intended for driver-internal 
>> use only. (I'm counting using the drm_scheduler as a helper for 
>> driver-private use). This is by no means a way to try tackle the 
>> indefinite fence problem.
>
> Well this was just my latest try to tackle this, but essentially the 
> problems are the same as with your approach: When we express such 
> operations as dma_fence there is always the change that we leak that 
> somewhere.
>
> My approach of adding a flag noting that this operation is dangerous 
> and can't be synced with something memory management depends on tried 
> to contain this as much as possible, but Daniel still pretty clearly 
> rejected it (for good reasons I think).
>
>>
>> We could ofc invent a completely different data-type that abstracts 
>> the synchronization the scheduler needs in the long-running case, or 
>> each driver could hack something up, like sleeping in the 
>> prepare_job() or run_job() callback for throttling, but those waits 
>> should still be annotated in one way or annotated one way or another 
>> (and probably in a similar way across drivers) to make sure we don't 
>> do anything bad.
>>
>>  So any suggestions as to what would be the better solution here 
>> would be appreciated.
>
> Mhm, do we really the the GPU scheduler for that?
>
> I mean in the 1 to 1 case  you basically just need a component which 
> collects the dependencies as dma_fence and if all of them are 
> fulfilled schedules a work item.
>
> As long as the work item itself doesn't produce a dma_fence it can 
> then still just wait for other none dma_fence dependencies.
>
> Then the work function could submit the work and wait for the result.
>
> The work item would then pretty much represent what you want, you can 
> wait for it to finish and pass it along as long running dependency.
>
> Maybe give it a funky name and wrap it up in a structure, but that's 
> basically it.
>
This very much sounds like a i915_sw_fence for the dependency tracking 
and dma_fence_work for the actual work although it's completion fence is 
a dma_fence.

Although that goes against the whole idea of a condition for merging the 
xe driver would be that we implement some sort of minimal scaffolding 
for long-running workloads in the drm scheduler, and the thinking behind 
that is to avoid implementing intel-specific solutions like those...

Thanks,

Thomas



> Regards,
> Christian.
>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>>
>>
>>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 13:10       ` Christian König
  2023-04-04 18:14         ` Thomas Hellström (Intel)
@ 2023-04-04 19:00         ` Daniel Vetter
  2023-04-04 20:03           ` Matthew Brost
  1 sibling, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-04 19:00 UTC (permalink / raw)
  To: Christian König
  Cc: airlied, lina, dri-devel, boris.brezillon, robdclark, intel-xe,
	faith.ekstrand

On Tue, 4 Apr 2023 at 15:10, Christian König <christian.koenig@amd.com> wrote:
>
> Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > Hi, Christian,
> >
> > On 4/4/23 11:09, Christian König wrote:
> >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> >>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> >>>
> >>> For long-running workloads, drivers either need to open-code completion
> >>> waits, invent their own synchronization primitives or internally use
> >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> >>> without any lockdep annotation all these approaches are error prone.
> >>>
> >>> So since for example the drm scheduler uses dma-fences it is
> >>> desirable for
> >>> a driver to be able to use it for throttling and error handling also
> >>> with
> >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
> >>>
> >>> Introduce long-running completion fences in form of dma-fences, and add
> >>> lockdep annotation for them. In particular:
> >>>
> >>> * Do not allow waiting under any memory management locks.
> >>> * Do not allow to attach them to a dma-resv object.
> >>> * Introduce a new interface for adding callbacks making the helper
> >>> adding
> >>>    a callback sign off on that it is aware that the dma-fence may not
> >>>    complete anytime soon. Typically this will be the scheduler chaining
> >>>    a new long-running fence on another one.
> >>
> >> Well that's pretty much what I tried before:
> >> https://lwn.net/Articles/893704/
> >>
> >> And the reasons why it was rejected haven't changed.
> >>
> >> Regards,
> >> Christian.
> >>
> > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > this problem while being able to reuse the scheduler for long-running
> > workloads.
> >
> > I couldn't see any clear decision on your series, though, but one main
> > difference I see is that this is intended for driver-internal use
> > only. (I'm counting using the drm_scheduler as a helper for
> > driver-private use). This is by no means a way to try tackle the
> > indefinite fence problem.
>
> Well this was just my latest try to tackle this, but essentially the
> problems are the same as with your approach: When we express such
> operations as dma_fence there is always the change that we leak that
> somewhere.
>
> My approach of adding a flag noting that this operation is dangerous and
> can't be synced with something memory management depends on tried to
> contain this as much as possible, but Daniel still pretty clearly
> rejected it (for good reasons I think).

Yeah I still don't like dma_fence that somehow have totally different
semantics in that critical piece of "will it complete or will it
deadlock?" :-)
>
> >
> > We could ofc invent a completely different data-type that abstracts
> > the synchronization the scheduler needs in the long-running case, or
> > each driver could hack something up, like sleeping in the
> > prepare_job() or run_job() callback for throttling, but those waits
> > should still be annotated in one way or annotated one way or another
> > (and probably in a similar way across drivers) to make sure we don't
> > do anything bad.
> >
> >  So any suggestions as to what would be the better solution here would
> > be appreciated.
>
> Mhm, do we really the the GPU scheduler for that?
>
> I mean in the 1 to 1 case  you basically just need a component which
> collects the dependencies as dma_fence and if all of them are fulfilled
> schedules a work item.
>
> As long as the work item itself doesn't produce a dma_fence it can then
> still just wait for other none dma_fence dependencies.

Yeah that's the important thing, for long-running jobs dependencies as
dma_fence should be totally fine. You're just not allowed to have any
outgoing dma_fences at all (except the magic preemption fence).

> Then the work function could submit the work and wait for the result.
>
> The work item would then pretty much represent what you want, you can
> wait for it to finish and pass it along as long running dependency.
>
> Maybe give it a funky name and wrap it up in a structure, but that's
> basically it.

Like do we need this? If the kernel ever waits for a long-running
compute job to finnish I'd call that a bug. Any functional
dependencies between engines or whatever are userspace's problem only,
which it needs to sort out using userspace memory fences.

The only things the kernel needs are some way to track dependencies as
dma_fence (because memory management move the memory away and we need
to move it back in, ideally pipelined). And it needs the special
preempt fence (if we don't have pagefaults) so that you have a fence
to attach to all the dma_resv for memory management purposes. Now the
scheduler already has almost all the pieces (at least if we assume
there's some magic fw which time-slices these contexts on its own),
and we just need a few minimal changes:
- allowing the scheduler to ignore the completion fence and just
immediately push the next "job" in if its dependencies are ready
- maybe minimal amounts of scaffolding to handle the preemption
dma_fence because that's not entirely trivial. I think ideally we'd
put that into drm_sched_entity since you can only ever have one active
preempt dma_fence per gpu ctx/entity.

None of this needs a dma_fence_is_lr anywhere at all.

Of course there's the somewhat related issue of "how do we transport
these userspace memory fences around from app to compositor", and
that's a lot more gnarly. I still don't think dma_fence_is_lr is
anywhere near what the solution should look like for that.
-Daniel


> Regards,
> Christian.
>
> >
> > Thanks,
> >
> > Thomas
> >
> >
> >
> >
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 18:14         ` Thomas Hellström (Intel)
@ 2023-04-04 19:02           ` Matthew Brost
  2023-04-04 19:25             ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04 19:02 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: robdclark, airlied, lina, dri-devel, intel-xe, boris.brezillon,
	Christian König, faith.ekstrand

On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> 
> On 4/4/23 15:10, Christian König wrote:
> > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > Hi, Christian,
> > > 
> > > On 4/4/23 11:09, Christian König wrote:
> > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > 
> > > > > For long-running workloads, drivers either need to open-code
> > > > > completion
> > > > > waits, invent their own synchronization primitives or internally use
> > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > without any lockdep annotation all these approaches are error prone.
> > > > > 
> > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > desirable for
> > > > > a driver to be able to use it for throttling and error
> > > > > handling also with
> > > > > internal dma-fences tha do not obey the cros-driver
> > > > > dma-fence protocol.
> > > > > 
> > > > > Introduce long-running completion fences in form of
> > > > > dma-fences, and add
> > > > > lockdep annotation for them. In particular:
> > > > > 
> > > > > * Do not allow waiting under any memory management locks.
> > > > > * Do not allow to attach them to a dma-resv object.
> > > > > * Introduce a new interface for adding callbacks making the
> > > > > helper adding
> > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > >    complete anytime soon. Typically this will be the
> > > > > scheduler chaining
> > > > >    a new long-running fence on another one.
> > > > 
> > > > Well that's pretty much what I tried before:
> > > > https://lwn.net/Articles/893704/
> > > > 

I don't think this quite the same, this explictly enforces that we don't
break the dma-fence rules (in path of memory allocations, exported in
any way), essentially this just SW sync point reusing dma-fence the
infrastructure for signaling / callbacks. I believe your series tried to
export these fences to user space (admittedly I haven't fully read your
series).

In this use case we essentially just want to flow control the ring via
the dma-scheduler + maintain a list of pending jobs so the TDR can be
used for cleanup if LR entity encounters an error. To me this seems
perfectly reasonable but I know dma-femce rules are akin to a holy war.

If we return NULL in run_job, now we have to be able to sink all jobs
in the backend regardless on ring space, maintain a list of jobs pending
for cleanup after errors, and write a different cleanup path as now the
TDR doesn't work. Seems very, very silly to duplicate all of this code
when the DRM scheduler provides all of this for us. Also if we go this
route, now all drivers are going to invent ways to handle LR jobs /w the
DRM scheduler.

This solution is pretty clear, mark the scheduler as LR, and don't
export any fences from the scheduler. If you try to export these fences
a blow up happens.

> > > > And the reasons why it was rejected haven't changed.
> > > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > Yes, TBH this was mostly to get discussion going how we'd best
> > > tackle this problem while being able to reuse the scheduler for
> > > long-running workloads.
> > > 
> > > I couldn't see any clear decision on your series, though, but one
> > > main difference I see is that this is intended for driver-internal
> > > use only. (I'm counting using the drm_scheduler as a helper for
> > > driver-private use). This is by no means a way to try tackle the
> > > indefinite fence problem.
> > 
> > Well this was just my latest try to tackle this, but essentially the
> > problems are the same as with your approach: When we express such
> > operations as dma_fence there is always the change that we leak that
> > somewhere.
> > 
> > My approach of adding a flag noting that this operation is dangerous and
> > can't be synced with something memory management depends on tried to
> > contain this as much as possible, but Daniel still pretty clearly
> > rejected it (for good reasons I think).
> > 
> > > 
> > > We could ofc invent a completely different data-type that abstracts
> > > the synchronization the scheduler needs in the long-running case, or
> > > each driver could hack something up, like sleeping in the
> > > prepare_job() or run_job() callback for throttling, but those waits
> > > should still be annotated in one way or annotated one way or another
> > > (and probably in a similar way across drivers) to make sure we don't
> > > do anything bad.
> > > 
> > >  So any suggestions as to what would be the better solution here
> > > would be appreciated.
> > 
> > Mhm, do we really the the GPU scheduler for that?
> > 

I think we need to solve this within the DRM scheduler one way or
another.

> > I mean in the 1 to 1 case  you basically just need a component which
> > collects the dependencies as dma_fence and if all of them are fulfilled
> > schedules a work item.
> > 
> > As long as the work item itself doesn't produce a dma_fence it can then
> > still just wait for other none dma_fence dependencies.
> > 
> > Then the work function could submit the work and wait for the result.
> > 
> > The work item would then pretty much represent what you want, you can
> > wait for it to finish and pass it along as long running dependency.
> > 
> > Maybe give it a funky name and wrap it up in a structure, but that's
> > basically it.
> > 
> This very much sounds like a i915_sw_fence for the dependency tracking and
> dma_fence_work for the actual work although it's completion fence is a
> dma_fence.
>

Agree this does sound to i915ish as stated below one of mandates in Xe
was to use the DRM scheduler. Beyond that as someone who a submission
backend in the i915 and Xe, I love how the DRM scheduler works (single
entry point), it makes everything so much easier.

Matt

> Although that goes against the whole idea of a condition for merging the xe
> driver would be that we implement some sort of minimal scaffolding for
> long-running workloads in the drm scheduler, and the thinking behind that is
> to avoid implementing intel-specific solutions like those...
> 
> Thanks,
> 
> Thomas
> 
> 
> 
> > Regards,
> > Christian.
> > 
> > > 
> > > Thanks,
> > > 
> > > Thomas
> > > 
> > > 
> > > 
> > > 
> > > 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04 17:29     ` Tvrtko Ursulin
@ 2023-04-04 19:07       ` Daniel Vetter
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Vetter @ 2023-04-04 19:07 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: robdclark, airlied, lina, dri-devel, intel-xe, boris.brezillon,
	christian.koenig, faith.ekstrand

On Tue, 4 Apr 2023 at 19:29, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 04/04/2023 14:52, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote:
> >>
> >> On 04/04/2023 01:22, Matthew Brost wrote:
> >>> Hello,
> >>>
> >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> >>> have been asked to merge our common DRM scheduler patches first as well
> >>> as develop a common solution for long running workloads with the DRM
> >>> scheduler. This RFC series is our first attempt at doing this. We
> >>> welcome any and all feedback.
> >>>
> >>> This can we thought of as 4 parts detailed below.
> >>>
> >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> >>> entity (patches 1-3)
> >>>
> >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> >>> severals problems as the DRM was originally designed to schedule jobs on
> >>> hardware queues. The main problem being that DRM scheduler expects the
> >>> submission order of jobs to be the completion order of jobs even across
> >>> multiple entities. This assumption falls apart with a firmware scheduler
> >>> as a firmware scheduler has no concept of jobs and jobs can complete out
> >>> of order. A novel solution for was originally thought of by Faith during
> >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> >>> and entity. I believe the AGX driver [3] is using this approach and
> >>> Boris may use approach as well for the Mali driver [4].
> >>>
> >>> To support a 1 to 1 relationship we move the main execution function
> >>> from a kthread to a work queue and add a new scheduling mode which
> >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> >>> relationship and can be thought of as using scheduler as a dependency /
> >>> infligt job tracker rather than a true scheduler.
> >>
> >> Once you add capability for a more proper 1:1 via
> >> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace
> >> kthreads with a wq?
> >>
> >> Or in other words, what purpose does the offloading of a job picking code to
> >> a separate execution context serve? Could it be done directly in the 1:1
> >> mode and leave kthread setup for N:M?
> >>
> >
> > Addressed the other two on my reply to Christian...
> >
> > For this one basically the concept of a single entity point IMO is a
> > very good concept which I'd like to keep. But most important reason
> > being the main execution thread (now worker) is kicked when a dependency
> > for a job is resolved, dependencies are dma-fences signaled via a
> > callback, and these call backs can be signaled in IRQ contexts. We
> > absolutely do not want to enter the backend in an IRQ context for a
> > variety of reasons.
>
> Sounds like a fair enough requirement but if drivers will not be
> comfortable with the wq conversion, it is probably possible to introduce
> some vfuncs for the 1:1 case which would allow scheduler users override
> the scheduler wakeup and select a special "pick one job" path. That
> could allow 1:1 users do their thing, leaving rest as is. I mean you
> already have the special single entity scheduler, you'd just need to add
> some more specialization on the init, wake up, etc paths.
>
> And I will mention once more that I find a wq item with a loop such as:
>
>         while (!READ_ONCE(sched->pause_run_wq)) {
>         ...
>
> A bit dodgy. If you piggy back on any system_wq it smells of system wide
> starvation so for me any proposal with an option to use a system shared
> wq is a no go.

Yeah I think the argument for wq based scheduler would need a
per-drm_scheduler wq, like we currently have a per-scheduler kthread.
It might still need some serious work to replace the
kthread_stop/start() with a wq-native equivalent (because really this
is the tricky stuff we shouldn't hand-roll unless someone is willing
to write a few papers on the lockless design that's done), but would
look a bunch more reasonable. Having a per-sched workqueue might also
help with the big sched_stop/start/fini state transitions, which I
really think should still go over all the per-entity schedulers even
in the 1:1 case (because otherwise you get some funky code in drivers
that do the iterations needed, which probably tosses the fairly nice
design the current scheduler has by relying on the kthread_stop/start
primitives for this.
-Daniel

>
> Regards,
>
> Tvrtko
>
>
> >> Apart from those design level questions, low level open IMO still is that
> >> default fallback of using the system_wq has the potential to affect latency
> >> for other drivers. But that's for those driver owners to approve.
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>> - Generic messaging interface for DRM scheduler
> >>>
> >>> Idea is to be able to communicate to the submission backend with in band
> >>> (relative to main execution function) messages. Messages are backend
> >>> defined and flexable enough for any use case. In Xe we use these
> >>> messages to clean up entites, set properties for entites, and suspend /
> >>> resume execution of an entity [5]. I suspect other driver can leverage
> >>> this messaging concept too as it a convenient way to avoid races in the
> >>> backend.
> >>>
> >>> - Support for using TDR for all error paths of a scheduler / entity
> >>>
> >>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> >>>
> >>> - Annotate dma-fences for long running workloads.
> >>>
> >>> The idea here is to use dma-fences only as sync points within the
> >>> scheduler and never export them for long running workloads. By
> >>> annotating these fences as long running we ensure that these dma-fences
> >>> are never used in a way that breaks the dma-fence rules. A benefit of
> >>> thus approach is the scheduler can still safely flow control the
> >>> execution ring buffer via the job limit without breaking the dma-fence
> >>> rules.
> >>>
> >>> Again this a first draft and looking forward to feedback.
> >>>
> >>> Enjoy - Matt
> >>>
> >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> >>> [2] https://patchwork.freedesktop.org/series/112188/
> >>> [3] https://patchwork.freedesktop.org/series/114772/
> >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> >>>
> >>> Matthew Brost (8):
> >>>     drm/sched: Convert drm scheduler to use a work queue rather than
> >>>       kthread
> >>>     drm/sched: Move schedule policy to scheduler / entity
> >>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >>>     drm/sched: Add generic scheduler message interface
> >>>     drm/sched: Start run wq before TDR in drm_sched_start
> >>>     drm/sched: Submit job before starting TDR
> >>>     drm/sched: Add helper to set TDR timeout
> >>>     drm/syncobj: Warn on long running dma-fences
> >>>
> >>> Thomas Hellström (2):
> >>>     dma-buf/dma-fence: Introduce long-running completion fences
> >>>     drm/sched: Support long-running sched entities
> >>>
> >>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >>>    drivers/dma-buf/dma-resv.c                  |   5 +
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
> >>>    include/linux/dma-fence.h                   |  60 ++++-
> >>>    16 files changed, 649 insertions(+), 184 deletions(-)
> >>>



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 19:02           ` Matthew Brost
@ 2023-04-04 19:25             ` Daniel Vetter
  2023-04-04 19:48               ` Matthew Brost
  2023-04-05 12:35               ` Thomas Hellström
  0 siblings, 2 replies; 87+ messages in thread
From: Daniel Vetter @ 2023-04-04 19:25 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, Thomas Hellström (Intel),
	dri-devel, Christian König, boris.brezillon, intel-xe,
	faith.ekstrand

On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
> On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > 
> > On 4/4/23 15:10, Christian König wrote:
> > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > Hi, Christian,
> > > > 
> > > > On 4/4/23 11:09, Christian König wrote:
> > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > 
> > > > > > For long-running workloads, drivers either need to open-code
> > > > > > completion
> > > > > > waits, invent their own synchronization primitives or internally use
> > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > without any lockdep annotation all these approaches are error prone.
> > > > > > 
> > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > desirable for
> > > > > > a driver to be able to use it for throttling and error
> > > > > > handling also with
> > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > dma-fence protocol.
> > > > > > 
> > > > > > Introduce long-running completion fences in form of
> > > > > > dma-fences, and add
> > > > > > lockdep annotation for them. In particular:
> > > > > > 
> > > > > > * Do not allow waiting under any memory management locks.
> > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > helper adding
> > > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > > >    complete anytime soon. Typically this will be the
> > > > > > scheduler chaining
> > > > > >    a new long-running fence on another one.
> > > > > 
> > > > > Well that's pretty much what I tried before:
> > > > > https://lwn.net/Articles/893704/
> > > > > 
> 
> I don't think this quite the same, this explictly enforces that we don't
> break the dma-fence rules (in path of memory allocations, exported in
> any way), essentially this just SW sync point reusing dma-fence the
> infrastructure for signaling / callbacks. I believe your series tried to
> export these fences to user space (admittedly I haven't fully read your
> series).
> 
> In this use case we essentially just want to flow control the ring via
> the dma-scheduler + maintain a list of pending jobs so the TDR can be
> used for cleanup if LR entity encounters an error. To me this seems
> perfectly reasonable but I know dma-femce rules are akin to a holy war.
> 
> If we return NULL in run_job, now we have to be able to sink all jobs
> in the backend regardless on ring space, maintain a list of jobs pending
> for cleanup after errors, and write a different cleanup path as now the
> TDR doesn't work. Seems very, very silly to duplicate all of this code
> when the DRM scheduler provides all of this for us. Also if we go this
> route, now all drivers are going to invent ways to handle LR jobs /w the
> DRM scheduler.
> 
> This solution is pretty clear, mark the scheduler as LR, and don't
> export any fences from the scheduler. If you try to export these fences
> a blow up happens.

The problem is if you mix things up. Like for resets you need all the
schedulers on an engine/set-of-engines to quiescent or things get
potentially hilarious. If you now have a scheduler in forever limbo, the
dma_fence guarantees are right out the window.

But the issue you're having is fairly specific if it's just about
ringspace. I think the dumbest fix is to just block in submit if you run
out of per-ctx ringspace, and call it a day. This notion that somehow the
kernel is supposed to provide a bottomless queue of anything userspace
submits simply doesn't hold up in reality (as much as userspace standards
committees would like it to), and as long as it doesn't have a real-world
perf impact it doesn't really matter why we end up blocking in the submit
ioctl. It might also be a simple memory allocation that hits a snag in
page reclaim.

> > > > > And the reasons why it was rejected haven't changed.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > tackle this problem while being able to reuse the scheduler for
> > > > long-running workloads.
> > > > 
> > > > I couldn't see any clear decision on your series, though, but one
> > > > main difference I see is that this is intended for driver-internal
> > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > driver-private use). This is by no means a way to try tackle the
> > > > indefinite fence problem.
> > > 
> > > Well this was just my latest try to tackle this, but essentially the
> > > problems are the same as with your approach: When we express such
> > > operations as dma_fence there is always the change that we leak that
> > > somewhere.
> > > 
> > > My approach of adding a flag noting that this operation is dangerous and
> > > can't be synced with something memory management depends on tried to
> > > contain this as much as possible, but Daniel still pretty clearly
> > > rejected it (for good reasons I think).
> > > 
> > > > 
> > > > We could ofc invent a completely different data-type that abstracts
> > > > the synchronization the scheduler needs in the long-running case, or
> > > > each driver could hack something up, like sleeping in the
> > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > should still be annotated in one way or annotated one way or another
> > > > (and probably in a similar way across drivers) to make sure we don't
> > > > do anything bad.
> > > > 
> > > >  So any suggestions as to what would be the better solution here
> > > > would be appreciated.
> > > 
> > > Mhm, do we really the the GPU scheduler for that?
> > > 
> 
> I think we need to solve this within the DRM scheduler one way or
> another.

Yeah so if we conclude that the queue really must be bottomless then I
agree drm-sched should help out sort out the mess. Because I'm guessing
that every driver will have this issue. But that's a big if.

I guess if we teach the drm scheduler that some jobs are fairly endless
then maybe it wouldn't be too far-fetched to also teach it to wait for a
previous one to finish (but not with the dma_fence that preempts, which we
put into the dma_resv for memory management, but some other struct
completion). The scheduler already has a concept of not stuffing too much
stuff into the same queue after all, so this should fit?
-Daniel


> > > I mean in the 1 to 1 case  you basically just need a component which
> > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > schedules a work item.
> > > 
> > > As long as the work item itself doesn't produce a dma_fence it can then
> > > still just wait for other none dma_fence dependencies.
> > > 
> > > Then the work function could submit the work and wait for the result.
> > > 
> > > The work item would then pretty much represent what you want, you can
> > > wait for it to finish and pass it along as long running dependency.
> > > 
> > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > basically it.
> > > 
> > This very much sounds like a i915_sw_fence for the dependency tracking and
> > dma_fence_work for the actual work although it's completion fence is a
> > dma_fence.
> >
> 
> Agree this does sound to i915ish as stated below one of mandates in Xe
> was to use the DRM scheduler. Beyond that as someone who a submission
> backend in the i915 and Xe, I love how the DRM scheduler works (single
> entry point), it makes everything so much easier.
> 
> Matt
> 
> > Although that goes against the whole idea of a condition for merging the xe
> > driver would be that we implement some sort of minimal scaffolding for
> > long-running workloads in the drm scheduler, and the thinking behind that is
> > to avoid implementing intel-specific solutions like those...
> > 
> > Thanks,
> > 
> > Thomas
> > 
> > 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > 
> > > > Thanks,
> > > > 
> > > > Thomas
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 19:25             ` Daniel Vetter
@ 2023-04-04 19:48               ` Matthew Brost
  2023-04-05 13:09                 ` Daniel Vetter
  2023-04-05 12:35               ` Thomas Hellström
  1 sibling, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04 19:48 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: robdclark, airlied, lina, Thomas Hellström (Intel),
	dri-devel, Christian König, boris.brezillon, intel-xe,
	faith.ekstrand

On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > > 
> > > On 4/4/23 15:10, Christian König wrote:
> > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > Hi, Christian,
> > > > > 
> > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > > 
> > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > completion
> > > > > > > waits, invent their own synchronization primitives or internally use
> > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > > without any lockdep annotation all these approaches are error prone.
> > > > > > > 
> > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > desirable for
> > > > > > > a driver to be able to use it for throttling and error
> > > > > > > handling also with
> > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > dma-fence protocol.
> > > > > > > 
> > > > > > > Introduce long-running completion fences in form of
> > > > > > > dma-fences, and add
> > > > > > > lockdep annotation for them. In particular:
> > > > > > > 
> > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > helper adding
> > > > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > scheduler chaining
> > > > > > >    a new long-running fence on another one.
> > > > > > 
> > > > > > Well that's pretty much what I tried before:
> > > > > > https://lwn.net/Articles/893704/
> > > > > > 
> > 
> > I don't think this quite the same, this explictly enforces that we don't
> > break the dma-fence rules (in path of memory allocations, exported in
> > any way), essentially this just SW sync point reusing dma-fence the
> > infrastructure for signaling / callbacks. I believe your series tried to
> > export these fences to user space (admittedly I haven't fully read your
> > series).
> > 
> > In this use case we essentially just want to flow control the ring via
> > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > used for cleanup if LR entity encounters an error. To me this seems
> > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > 
> > If we return NULL in run_job, now we have to be able to sink all jobs
> > in the backend regardless on ring space, maintain a list of jobs pending
> > for cleanup after errors, and write a different cleanup path as now the
> > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > when the DRM scheduler provides all of this for us. Also if we go this
> > route, now all drivers are going to invent ways to handle LR jobs /w the
> > DRM scheduler.
> > 
> > This solution is pretty clear, mark the scheduler as LR, and don't
> > export any fences from the scheduler. If you try to export these fences
> > a blow up happens.
> 
> The problem is if you mix things up. Like for resets you need all the
> schedulers on an engine/set-of-engines to quiescent or things get
> potentially hilarious. If you now have a scheduler in forever limbo, the
> dma_fence guarantees are right out the window.
> 

Right, a GT reset on Xe is:

Stop all schedulers
Do a reset
Ban any schedulers which we think caused the GT reset
Resubmit all schedulers which we think were good
Restart all schedulers

None of this flow depends on LR dma-fences, all of this uses the DRM
sched infrastructure and work very well compared to the i915. Rewriting
all this with a driver specific implementation is what we are trying to
avoid.

Similarly if LR entity hangs on its own (not a GT reset, rather the
firmware does the reset for us) we use all the DRM scheduler
infrastructure to handle this. Again this works rather well...

> But the issue you're having is fairly specific if it's just about
> ringspace. I think the dumbest fix is to just block in submit if you run
> out of per-ctx ringspace, and call it a day. This notion that somehow the

How does that not break the dma-fence rules? A job can publish its
finished fence after ARM, if the finished fence fence waits on ring
space that may not free up in a reasonable amount of time we now have
broken the dma-dence rules. My understanding is any dma-fence must only
on other dma-fence, Christian seems to agree and NAK'd just blocking if
no space available [1]. IMO this series ensures we don't break dma-fence
rules by restricting how the finished fence can be used.

> kernel is supposed to provide a bottomless queue of anything userspace
> submits simply doesn't hold up in reality (as much as userspace standards
> committees would like it to), and as long as it doesn't have a real-world
> perf impact it doesn't really matter why we end up blocking in the submit
> ioctl. It might also be a simple memory allocation that hits a snag in
> page reclaim.
> 
> > > > > > And the reasons why it was rejected haven't changed.
> > > > > > 
> > > > > > Regards,
> > > > > > Christian.
> > > > > > 
> > > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > > tackle this problem while being able to reuse the scheduler for
> > > > > long-running workloads.
> > > > > 
> > > > > I couldn't see any clear decision on your series, though, but one
> > > > > main difference I see is that this is intended for driver-internal
> > > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > > driver-private use). This is by no means a way to try tackle the
> > > > > indefinite fence problem.
> > > > 
> > > > Well this was just my latest try to tackle this, but essentially the
> > > > problems are the same as with your approach: When we express such
> > > > operations as dma_fence there is always the change that we leak that
> > > > somewhere.
> > > > 
> > > > My approach of adding a flag noting that this operation is dangerous and
> > > > can't be synced with something memory management depends on tried to
> > > > contain this as much as possible, but Daniel still pretty clearly
> > > > rejected it (for good reasons I think).
> > > > 
> > > > > 
> > > > > We could ofc invent a completely different data-type that abstracts
> > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > each driver could hack something up, like sleeping in the
> > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > should still be annotated in one way or annotated one way or another
> > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > do anything bad.
> > > > > 
> > > > >  So any suggestions as to what would be the better solution here
> > > > > would be appreciated.
> > > > 
> > > > Mhm, do we really the the GPU scheduler for that?
> > > > 
> > 
> > I think we need to solve this within the DRM scheduler one way or
> > another.
> 
> Yeah so if we conclude that the queue really must be bottomless then I
> agree drm-sched should help out sort out the mess. Because I'm guessing
> that every driver will have this issue. But that's a big if.
> 
> I guess if we teach the drm scheduler that some jobs are fairly endless
> then maybe it wouldn't be too far-fetched to also teach it to wait for a
> previous one to finish (but not with the dma_fence that preempts, which we
> put into the dma_resv for memory management, but some other struct
> completion). The scheduler already has a concept of not stuffing too much
> stuff into the same queue after all, so this should fit?

See above, exact same situation as spinning on flow controling the ring,
this IMO absolutely breaks the dma-fence rules. IMO the correct solution
is to have a DRM that doesn't export dma-fences, this is exactly what
this series does as if we try to, boom lockdep / warn on blow up.

Matt

[1] https://patchwork.freedesktop.org/patch/525461/?series=114772&rev=2

> -Daniel
> 
> 
> > > > I mean in the 1 to 1 case  you basically just need a component which
> > > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > > schedules a work item.
> > > > 
> > > > As long as the work item itself doesn't produce a dma_fence it can then
> > > > still just wait for other none dma_fence dependencies.
> > > > 
> > > > Then the work function could submit the work and wait for the result.
> > > > 
> > > > The work item would then pretty much represent what you want, you can
> > > > wait for it to finish and pass it along as long running dependency.
> > > > 
> > > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > > basically it.
> > > > 
> > > This very much sounds like a i915_sw_fence for the dependency tracking and
> > > dma_fence_work for the actual work although it's completion fence is a
> > > dma_fence.
> > >
> > 
> > Agree this does sound to i915ish as stated below one of mandates in Xe
> > was to use the DRM scheduler. Beyond that as someone who a submission
> > backend in the i915 and Xe, I love how the DRM scheduler works (single
> > entry point), it makes everything so much easier.
> > 
> > Matt
> > 
> > > Although that goes against the whole idea of a condition for merging the xe
> > > driver would be that we implement some sort of minimal scaffolding for
> > > long-running workloads in the drm scheduler, and the thinking behind that is
> > > to avoid implementing intel-specific solutions like those...
> > > 
> > > Thanks,
> > > 
> > > Thomas
> > > 
> > > 
> > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Thomas
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 19:00         ` Daniel Vetter
@ 2023-04-04 20:03           ` Matthew Brost
  2023-04-04 20:11             ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04 20:03 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: robdclark, airlied, lina, dri-devel, intel-xe, boris.brezillon,
	Christian König, faith.ekstrand

On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> On Tue, 4 Apr 2023 at 15:10, Christian König <christian.koenig@amd.com> wrote:
> >
> > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > Hi, Christian,
> > >
> > > On 4/4/23 11:09, Christian König wrote:
> > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > >>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > >>>
> > >>> For long-running workloads, drivers either need to open-code completion
> > >>> waits, invent their own synchronization primitives or internally use
> > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> > >>> without any lockdep annotation all these approaches are error prone.
> > >>>
> > >>> So since for example the drm scheduler uses dma-fences it is
> > >>> desirable for
> > >>> a driver to be able to use it for throttling and error handling also
> > >>> with
> > >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
> > >>>
> > >>> Introduce long-running completion fences in form of dma-fences, and add
> > >>> lockdep annotation for them. In particular:
> > >>>
> > >>> * Do not allow waiting under any memory management locks.
> > >>> * Do not allow to attach them to a dma-resv object.
> > >>> * Introduce a new interface for adding callbacks making the helper
> > >>> adding
> > >>>    a callback sign off on that it is aware that the dma-fence may not
> > >>>    complete anytime soon. Typically this will be the scheduler chaining
> > >>>    a new long-running fence on another one.
> > >>
> > >> Well that's pretty much what I tried before:
> > >> https://lwn.net/Articles/893704/
> > >>
> > >> And the reasons why it was rejected haven't changed.
> > >>
> > >> Regards,
> > >> Christian.
> > >>
> > > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > > this problem while being able to reuse the scheduler for long-running
> > > workloads.
> > >
> > > I couldn't see any clear decision on your series, though, but one main
> > > difference I see is that this is intended for driver-internal use
> > > only. (I'm counting using the drm_scheduler as a helper for
> > > driver-private use). This is by no means a way to try tackle the
> > > indefinite fence problem.
> >
> > Well this was just my latest try to tackle this, but essentially the
> > problems are the same as with your approach: When we express such
> > operations as dma_fence there is always the change that we leak that
> > somewhere.
> >
> > My approach of adding a flag noting that this operation is dangerous and
> > can't be synced with something memory management depends on tried to
> > contain this as much as possible, but Daniel still pretty clearly
> > rejected it (for good reasons I think).
> 
> Yeah I still don't like dma_fence that somehow have totally different
> semantics in that critical piece of "will it complete or will it
> deadlock?" :-)

Not going to touch LR dma-fences in this reply, I think we can continue
the LR fence discussion of the fork of this thread I just responded to.
Have a response the preempt fence discussion below.

> >
> > >
> > > We could ofc invent a completely different data-type that abstracts
> > > the synchronization the scheduler needs in the long-running case, or
> > > each driver could hack something up, like sleeping in the
> > > prepare_job() or run_job() callback for throttling, but those waits
> > > should still be annotated in one way or annotated one way or another
> > > (and probably in a similar way across drivers) to make sure we don't
> > > do anything bad.
> > >
> > >  So any suggestions as to what would be the better solution here would
> > > be appreciated.
> >
> > Mhm, do we really the the GPU scheduler for that?
> >
> > I mean in the 1 to 1 case  you basically just need a component which
> > collects the dependencies as dma_fence and if all of them are fulfilled
> > schedules a work item.
> >
> > As long as the work item itself doesn't produce a dma_fence it can then
> > still just wait for other none dma_fence dependencies.
> 
> Yeah that's the important thing, for long-running jobs dependencies as
> dma_fence should be totally fine. You're just not allowed to have any
> outgoing dma_fences at all (except the magic preemption fence).
> 
> > Then the work function could submit the work and wait for the result.
> >
> > The work item would then pretty much represent what you want, you can
> > wait for it to finish and pass it along as long running dependency.
> >
> > Maybe give it a funky name and wrap it up in a structure, but that's
> > basically it.
> 
> Like do we need this? If the kernel ever waits for a long-running
> compute job to finnish I'd call that a bug. Any functional
> dependencies between engines or whatever are userspace's problem only,
> which it needs to sort out using userspace memory fences.
> 
> The only things the kernel needs are some way to track dependencies as
> dma_fence (because memory management move the memory away and we need
> to move it back in, ideally pipelined). And it needs the special
> preempt fence (if we don't have pagefaults) so that you have a fence
> to attach to all the dma_resv for memory management purposes. Now the
> scheduler already has almost all the pieces (at least if we assume
> there's some magic fw which time-slices these contexts on its own),
> and we just need a few minimal changes:
> - allowing the scheduler to ignore the completion fence and just
> immediately push the next "job" in if its dependencies are ready
> - maybe minimal amounts of scaffolding to handle the preemption
> dma_fence because that's not entirely trivial. I think ideally we'd
> put that into drm_sched_entity since you can only ever have one active
> preempt dma_fence per gpu ctx/entity.
> 

Yep, preempt fence is per entity in Xe (xe_engine). We install these
into the VM and all external BOs mapped in the VM dma-resv slots.
Wondering if we can make all of this very generic between the DRM
scheduler + GPUVA...

Matt

> None of this needs a dma_fence_is_lr anywhere at all.
> 
> Of course there's the somewhat related issue of "how do we transport
> these userspace memory fences around from app to compositor", and
> that's a lot more gnarly. I still don't think dma_fence_is_lr is
> anywhere near what the solution should look like for that.
> -Daniel
> 
> 
> > Regards,
> > Christian.
> >
> > >
> > > Thanks,
> > >
> > > Thomas
> > >
> > >
> > >
> > >
> > >
> >
> 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 20:03           ` Matthew Brost
@ 2023-04-04 20:11             ` Daniel Vetter
  2023-04-04 20:19               ` Matthew Brost
  0 siblings, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-04 20:11 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, dri-devel, intel-xe, boris.brezillon,
	Christian König, faith.ekstrand

On Tue, 4 Apr 2023 at 22:04, Matthew Brost <matthew.brost@intel.com> wrote:
>
> On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> > On Tue, 4 Apr 2023 at 15:10, Christian König <christian.koenig@amd.com> wrote:
> > >
> > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > Hi, Christian,
> > > >
> > > > On 4/4/23 11:09, Christian König wrote:
> > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > >>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > >>>
> > > >>> For long-running workloads, drivers either need to open-code completion
> > > >>> waits, invent their own synchronization primitives or internally use
> > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > >>> without any lockdep annotation all these approaches are error prone.
> > > >>>
> > > >>> So since for example the drm scheduler uses dma-fences it is
> > > >>> desirable for
> > > >>> a driver to be able to use it for throttling and error handling also
> > > >>> with
> > > >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
> > > >>>
> > > >>> Introduce long-running completion fences in form of dma-fences, and add
> > > >>> lockdep annotation for them. In particular:
> > > >>>
> > > >>> * Do not allow waiting under any memory management locks.
> > > >>> * Do not allow to attach them to a dma-resv object.
> > > >>> * Introduce a new interface for adding callbacks making the helper
> > > >>> adding
> > > >>>    a callback sign off on that it is aware that the dma-fence may not
> > > >>>    complete anytime soon. Typically this will be the scheduler chaining
> > > >>>    a new long-running fence on another one.
> > > >>
> > > >> Well that's pretty much what I tried before:
> > > >> https://lwn.net/Articles/893704/
> > > >>
> > > >> And the reasons why it was rejected haven't changed.
> > > >>
> > > >> Regards,
> > > >> Christian.
> > > >>
> > > > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > > > this problem while being able to reuse the scheduler for long-running
> > > > workloads.
> > > >
> > > > I couldn't see any clear decision on your series, though, but one main
> > > > difference I see is that this is intended for driver-internal use
> > > > only. (I'm counting using the drm_scheduler as a helper for
> > > > driver-private use). This is by no means a way to try tackle the
> > > > indefinite fence problem.
> > >
> > > Well this was just my latest try to tackle this, but essentially the
> > > problems are the same as with your approach: When we express such
> > > operations as dma_fence there is always the change that we leak that
> > > somewhere.
> > >
> > > My approach of adding a flag noting that this operation is dangerous and
> > > can't be synced with something memory management depends on tried to
> > > contain this as much as possible, but Daniel still pretty clearly
> > > rejected it (for good reasons I think).
> >
> > Yeah I still don't like dma_fence that somehow have totally different
> > semantics in that critical piece of "will it complete or will it
> > deadlock?" :-)
>
> Not going to touch LR dma-fences in this reply, I think we can continue
> the LR fence discussion of the fork of this thread I just responded to.
> Have a response the preempt fence discussion below.
>
> > >
> > > >
> > > > We could ofc invent a completely different data-type that abstracts
> > > > the synchronization the scheduler needs in the long-running case, or
> > > > each driver could hack something up, like sleeping in the
> > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > should still be annotated in one way or annotated one way or another
> > > > (and probably in a similar way across drivers) to make sure we don't
> > > > do anything bad.
> > > >
> > > >  So any suggestions as to what would be the better solution here would
> > > > be appreciated.
> > >
> > > Mhm, do we really the the GPU scheduler for that?
> > >
> > > I mean in the 1 to 1 case  you basically just need a component which
> > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > schedules a work item.
> > >
> > > As long as the work item itself doesn't produce a dma_fence it can then
> > > still just wait for other none dma_fence dependencies.
> >
> > Yeah that's the important thing, for long-running jobs dependencies as
> > dma_fence should be totally fine. You're just not allowed to have any
> > outgoing dma_fences at all (except the magic preemption fence).
> >
> > > Then the work function could submit the work and wait for the result.
> > >
> > > The work item would then pretty much represent what you want, you can
> > > wait for it to finish and pass it along as long running dependency.
> > >
> > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > basically it.
> >
> > Like do we need this? If the kernel ever waits for a long-running
> > compute job to finnish I'd call that a bug. Any functional
> > dependencies between engines or whatever are userspace's problem only,
> > which it needs to sort out using userspace memory fences.
> >
> > The only things the kernel needs are some way to track dependencies as
> > dma_fence (because memory management move the memory away and we need
> > to move it back in, ideally pipelined). And it needs the special
> > preempt fence (if we don't have pagefaults) so that you have a fence
> > to attach to all the dma_resv for memory management purposes. Now the
> > scheduler already has almost all the pieces (at least if we assume
> > there's some magic fw which time-slices these contexts on its own),
> > and we just need a few minimal changes:
> > - allowing the scheduler to ignore the completion fence and just
> > immediately push the next "job" in if its dependencies are ready
> > - maybe minimal amounts of scaffolding to handle the preemption
> > dma_fence because that's not entirely trivial. I think ideally we'd
> > put that into drm_sched_entity since you can only ever have one active
> > preempt dma_fence per gpu ctx/entity.
> >
>
> Yep, preempt fence is per entity in Xe (xe_engine). We install these
> into the VM and all external BOs mapped in the VM dma-resv slots.
> Wondering if we can make all of this very generic between the DRM
> scheduler + GPUVA...

I think if the drm/sched just takes care of the preempt ctx dma_fence
(and still stores it in the same slot in the drm_sched_job struct like
a end-of-batch dma_fence job would), and then the gpuva shared code
just has functions to smash these into the right dma_resv structures
then you have all the pieces. Maybe for a bit more flexibility the
gpuva code takes dma_fence and not directly the drm_sched_job, but
maybe even that level of integration makes sense (but imo optional, a
bit of driver glue code is fine).

Yeah that's roughly what I think we should at least aim for since
there's quiet a few drivers in-flight that all need these pieces (more
or less at least).
-Daniel
>
> Matt
>
> > None of this needs a dma_fence_is_lr anywhere at all.
> >
> > Of course there's the somewhat related issue of "how do we transport
> > these userspace memory fences around from app to compositor", and
> > that's a lot more gnarly. I still don't think dma_fence_is_lr is
> > anywhere near what the solution should look like for that.
> > -Daniel
> >
> >
> > > Regards,
> > > Christian.
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Thomas
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 20:11             ` Daniel Vetter
@ 2023-04-04 20:19               ` Matthew Brost
  2023-04-04 20:31                 ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-04 20:19 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: robdclark, airlied, lina, dri-devel, intel-xe, boris.brezillon,
	Christian König, faith.ekstrand

On Tue, Apr 04, 2023 at 10:11:59PM +0200, Daniel Vetter wrote:
> On Tue, 4 Apr 2023 at 22:04, Matthew Brost <matthew.brost@intel.com> wrote:
> >
> > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> > > On Tue, 4 Apr 2023 at 15:10, Christian König <christian.koenig@amd.com> wrote:
> > > >
> > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > Hi, Christian,
> > > > >
> > > > > On 4/4/23 11:09, Christian König wrote:
> > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > >>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > >>>
> > > > >>> For long-running workloads, drivers either need to open-code completion
> > > > >>> waits, invent their own synchronization primitives or internally use
> > > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > >>> without any lockdep annotation all these approaches are error prone.
> > > > >>>
> > > > >>> So since for example the drm scheduler uses dma-fences it is
> > > > >>> desirable for
> > > > >>> a driver to be able to use it for throttling and error handling also
> > > > >>> with
> > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
> > > > >>>
> > > > >>> Introduce long-running completion fences in form of dma-fences, and add
> > > > >>> lockdep annotation for them. In particular:
> > > > >>>
> > > > >>> * Do not allow waiting under any memory management locks.
> > > > >>> * Do not allow to attach them to a dma-resv object.
> > > > >>> * Introduce a new interface for adding callbacks making the helper
> > > > >>> adding
> > > > >>>    a callback sign off on that it is aware that the dma-fence may not
> > > > >>>    complete anytime soon. Typically this will be the scheduler chaining
> > > > >>>    a new long-running fence on another one.
> > > > >>
> > > > >> Well that's pretty much what I tried before:
> > > > >> https://lwn.net/Articles/893704/
> > > > >>
> > > > >> And the reasons why it was rejected haven't changed.
> > > > >>
> > > > >> Regards,
> > > > >> Christian.
> > > > >>
> > > > > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > > > > this problem while being able to reuse the scheduler for long-running
> > > > > workloads.
> > > > >
> > > > > I couldn't see any clear decision on your series, though, but one main
> > > > > difference I see is that this is intended for driver-internal use
> > > > > only. (I'm counting using the drm_scheduler as a helper for
> > > > > driver-private use). This is by no means a way to try tackle the
> > > > > indefinite fence problem.
> > > >
> > > > Well this was just my latest try to tackle this, but essentially the
> > > > problems are the same as with your approach: When we express such
> > > > operations as dma_fence there is always the change that we leak that
> > > > somewhere.
> > > >
> > > > My approach of adding a flag noting that this operation is dangerous and
> > > > can't be synced with something memory management depends on tried to
> > > > contain this as much as possible, but Daniel still pretty clearly
> > > > rejected it (for good reasons I think).
> > >
> > > Yeah I still don't like dma_fence that somehow have totally different
> > > semantics in that critical piece of "will it complete or will it
> > > deadlock?" :-)
> >
> > Not going to touch LR dma-fences in this reply, I think we can continue
> > the LR fence discussion of the fork of this thread I just responded to.
> > Have a response the preempt fence discussion below.
> >
> > > >
> > > > >
> > > > > We could ofc invent a completely different data-type that abstracts
> > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > each driver could hack something up, like sleeping in the
> > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > should still be annotated in one way or annotated one way or another
> > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > do anything bad.
> > > > >
> > > > >  So any suggestions as to what would be the better solution here would
> > > > > be appreciated.
> > > >
> > > > Mhm, do we really the the GPU scheduler for that?
> > > >
> > > > I mean in the 1 to 1 case  you basically just need a component which
> > > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > > schedules a work item.
> > > >
> > > > As long as the work item itself doesn't produce a dma_fence it can then
> > > > still just wait for other none dma_fence dependencies.
> > >
> > > Yeah that's the important thing, for long-running jobs dependencies as
> > > dma_fence should be totally fine. You're just not allowed to have any
> > > outgoing dma_fences at all (except the magic preemption fence).
> > >
> > > > Then the work function could submit the work and wait for the result.
> > > >
> > > > The work item would then pretty much represent what you want, you can
> > > > wait for it to finish and pass it along as long running dependency.
> > > >
> > > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > > basically it.
> > >
> > > Like do we need this? If the kernel ever waits for a long-running
> > > compute job to finnish I'd call that a bug. Any functional
> > > dependencies between engines or whatever are userspace's problem only,
> > > which it needs to sort out using userspace memory fences.
> > >
> > > The only things the kernel needs are some way to track dependencies as
> > > dma_fence (because memory management move the memory away and we need
> > > to move it back in, ideally pipelined). And it needs the special
> > > preempt fence (if we don't have pagefaults) so that you have a fence
> > > to attach to all the dma_resv for memory management purposes. Now the
> > > scheduler already has almost all the pieces (at least if we assume
> > > there's some magic fw which time-slices these contexts on its own),
> > > and we just need a few minimal changes:
> > > - allowing the scheduler to ignore the completion fence and just
> > > immediately push the next "job" in if its dependencies are ready
> > > - maybe minimal amounts of scaffolding to handle the preemption
> > > dma_fence because that's not entirely trivial. I think ideally we'd
> > > put that into drm_sched_entity since you can only ever have one active
> > > preempt dma_fence per gpu ctx/entity.
> > >
> >
> > Yep, preempt fence is per entity in Xe (xe_engine). We install these
> > into the VM and all external BOs mapped in the VM dma-resv slots.
> > Wondering if we can make all of this very generic between the DRM
> > scheduler + GPUVA...
> 
> I think if the drm/sched just takes care of the preempt ctx dma_fence
> (and still stores it in the same slot in the drm_sched_job struct like
> a end-of-batch dma_fence job would), and then the gpuva shared code
> just has functions to smash these into the right dma_resv structures
> then you have all the pieces. Maybe for a bit more flexibility the
> gpuva code takes dma_fence and not directly the drm_sched_job, but
> maybe even that level of integration makes sense (but imo optional, a
> bit of driver glue code is fine).
> 
> Yeah that's roughly what I think we should at least aim for since
> there's quiet a few drivers in-flight that all need these pieces (more
> or less at least).

That is very close to what I'm thinking too, we want to tackle userptr +
GPUVA too, think that will be next but can add this to the list of
things to do.

Matt

> -Daniel
> >
> > Matt
> >
> > > None of this needs a dma_fence_is_lr anywhere at all.
> > >
> > > Of course there's the somewhat related issue of "how do we transport
> > > these userspace memory fences around from app to compositor", and
> > > that's a lot more gnarly. I still don't think dma_fence_is_lr is
> > > anywhere near what the solution should look like for that.
> > > -Daniel
> > >
> > >
> > > > Regards,
> > > > Christian.
> > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Thomas
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 20:19               ` Matthew Brost
@ 2023-04-04 20:31                 ` Daniel Vetter
  2023-04-04 20:46                   ` Matthew Brost
  0 siblings, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-04 20:31 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, dri-devel, intel-xe, boris.brezillon,
	Daniel Vetter, Christian König, faith.ekstrand

On Tue, Apr 04, 2023 at 08:19:37PM +0000, Matthew Brost wrote:
> On Tue, Apr 04, 2023 at 10:11:59PM +0200, Daniel Vetter wrote:
> > On Tue, 4 Apr 2023 at 22:04, Matthew Brost <matthew.brost@intel.com> wrote:
> > >
> > > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> > > > On Tue, 4 Apr 2023 at 15:10, Christian König <christian.koenig@amd.com> wrote:
> > > > >
> > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > Hi, Christian,
> > > > > >
> > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > >>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > >>>
> > > > > >>> For long-running workloads, drivers either need to open-code completion
> > > > > >>> waits, invent their own synchronization primitives or internally use
> > > > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > >>> without any lockdep annotation all these approaches are error prone.
> > > > > >>>
> > > > > >>> So since for example the drm scheduler uses dma-fences it is
> > > > > >>> desirable for
> > > > > >>> a driver to be able to use it for throttling and error handling also
> > > > > >>> with
> > > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
> > > > > >>>
> > > > > >>> Introduce long-running completion fences in form of dma-fences, and add
> > > > > >>> lockdep annotation for them. In particular:
> > > > > >>>
> > > > > >>> * Do not allow waiting under any memory management locks.
> > > > > >>> * Do not allow to attach them to a dma-resv object.
> > > > > >>> * Introduce a new interface for adding callbacks making the helper
> > > > > >>> adding
> > > > > >>>    a callback sign off on that it is aware that the dma-fence may not
> > > > > >>>    complete anytime soon. Typically this will be the scheduler chaining
> > > > > >>>    a new long-running fence on another one.
> > > > > >>
> > > > > >> Well that's pretty much what I tried before:
> > > > > >> https://lwn.net/Articles/893704/
> > > > > >>
> > > > > >> And the reasons why it was rejected haven't changed.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Christian.
> > > > > >>
> > > > > > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > > > > > this problem while being able to reuse the scheduler for long-running
> > > > > > workloads.
> > > > > >
> > > > > > I couldn't see any clear decision on your series, though, but one main
> > > > > > difference I see is that this is intended for driver-internal use
> > > > > > only. (I'm counting using the drm_scheduler as a helper for
> > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > indefinite fence problem.
> > > > >
> > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > problems are the same as with your approach: When we express such
> > > > > operations as dma_fence there is always the change that we leak that
> > > > > somewhere.
> > > > >
> > > > > My approach of adding a flag noting that this operation is dangerous and
> > > > > can't be synced with something memory management depends on tried to
> > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > rejected it (for good reasons I think).
> > > >
> > > > Yeah I still don't like dma_fence that somehow have totally different
> > > > semantics in that critical piece of "will it complete or will it
> > > > deadlock?" :-)
> > >
> > > Not going to touch LR dma-fences in this reply, I think we can continue
> > > the LR fence discussion of the fork of this thread I just responded to.
> > > Have a response the preempt fence discussion below.
> > >
> > > > >
> > > > > >
> > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > each driver could hack something up, like sleeping in the
> > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > should still be annotated in one way or annotated one way or another
> > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > do anything bad.
> > > > > >
> > > > > >  So any suggestions as to what would be the better solution here would
> > > > > > be appreciated.
> > > > >
> > > > > Mhm, do we really the the GPU scheduler for that?
> > > > >
> > > > > I mean in the 1 to 1 case  you basically just need a component which
> > > > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > > > schedules a work item.
> > > > >
> > > > > As long as the work item itself doesn't produce a dma_fence it can then
> > > > > still just wait for other none dma_fence dependencies.
> > > >
> > > > Yeah that's the important thing, for long-running jobs dependencies as
> > > > dma_fence should be totally fine. You're just not allowed to have any
> > > > outgoing dma_fences at all (except the magic preemption fence).
> > > >
> > > > > Then the work function could submit the work and wait for the result.
> > > > >
> > > > > The work item would then pretty much represent what you want, you can
> > > > > wait for it to finish and pass it along as long running dependency.
> > > > >
> > > > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > > > basically it.
> > > >
> > > > Like do we need this? If the kernel ever waits for a long-running
> > > > compute job to finnish I'd call that a bug. Any functional
> > > > dependencies between engines or whatever are userspace's problem only,
> > > > which it needs to sort out using userspace memory fences.
> > > >
> > > > The only things the kernel needs are some way to track dependencies as
> > > > dma_fence (because memory management move the memory away and we need
> > > > to move it back in, ideally pipelined). And it needs the special
> > > > preempt fence (if we don't have pagefaults) so that you have a fence
> > > > to attach to all the dma_resv for memory management purposes. Now the
> > > > scheduler already has almost all the pieces (at least if we assume
> > > > there's some magic fw which time-slices these contexts on its own),
> > > > and we just need a few minimal changes:
> > > > - allowing the scheduler to ignore the completion fence and just
> > > > immediately push the next "job" in if its dependencies are ready
> > > > - maybe minimal amounts of scaffolding to handle the preemption
> > > > dma_fence because that's not entirely trivial. I think ideally we'd
> > > > put that into drm_sched_entity since you can only ever have one active
> > > > preempt dma_fence per gpu ctx/entity.
> > > >
> > >
> > > Yep, preempt fence is per entity in Xe (xe_engine). We install these
> > > into the VM and all external BOs mapped in the VM dma-resv slots.
> > > Wondering if we can make all of this very generic between the DRM
> > > scheduler + GPUVA...
> > 
> > I think if the drm/sched just takes care of the preempt ctx dma_fence
> > (and still stores it in the same slot in the drm_sched_job struct like
> > a end-of-batch dma_fence job would), and then the gpuva shared code
> > just has functions to smash these into the right dma_resv structures
> > then you have all the pieces. Maybe for a bit more flexibility the
> > gpuva code takes dma_fence and not directly the drm_sched_job, but
> > maybe even that level of integration makes sense (but imo optional, a
> > bit of driver glue code is fine).
> > 
> > Yeah that's roughly what I think we should at least aim for since
> > there's quiet a few drivers in-flight that all need these pieces (more
> > or less at least).
> 
> That is very close to what I'm thinking too, we want to tackle userptr +
> GPUVA too, think that will be next but can add this to the list of
> things to do.

I discussed userptr+gpuva a bit with Rodrigo (and maybe Thomas H not sure
anymore) and it sounded a bit like that's maybe a bridge too far. At least
until we have some other drivers that also need that combo. But can't hurt
to at least think how it would ideally integrate from xe's pov.
-Daniel

> 
> Matt
> 
> > -Daniel
> > >
> > > Matt
> > >
> > > > None of this needs a dma_fence_is_lr anywhere at all.
> > > >
> > > > Of course there's the somewhat related issue of "how do we transport
> > > > these userspace memory fences around from app to compositor", and
> > > > that's a lot more gnarly. I still don't think dma_fence_is_lr is
> > > > anywhere near what the solution should look like for that.
> > > > -Daniel
> > > >
> > > >
> > > > > Regards,
> > > > > Christian.
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > 
> > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 20:31                 ` Daniel Vetter
@ 2023-04-04 20:46                   ` Matthew Brost
  0 siblings, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-04 20:46 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: robdclark, airlied, lina, dri-devel, intel-xe, boris.brezillon,
	Christian König, faith.ekstrand

On Tue, Apr 04, 2023 at 10:31:58PM +0200, Daniel Vetter wrote:
> On Tue, Apr 04, 2023 at 08:19:37PM +0000, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:11:59PM +0200, Daniel Vetter wrote:
> > > On Tue, 4 Apr 2023 at 22:04, Matthew Brost <matthew.brost@intel.com> wrote:
> > > >
> > > > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> > > > > On Tue, 4 Apr 2023 at 15:10, Christian König <christian.koenig@amd.com> wrote:
> > > > > >
> > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > Hi, Christian,
> > > > > > >
> > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > >>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > >>>
> > > > > > >>> For long-running workloads, drivers either need to open-code completion
> > > > > > >>> waits, invent their own synchronization primitives or internally use
> > > > > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > >>> without any lockdep annotation all these approaches are error prone.
> > > > > > >>>
> > > > > > >>> So since for example the drm scheduler uses dma-fences it is
> > > > > > >>> desirable for
> > > > > > >>> a driver to be able to use it for throttling and error handling also
> > > > > > >>> with
> > > > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
> > > > > > >>>
> > > > > > >>> Introduce long-running completion fences in form of dma-fences, and add
> > > > > > >>> lockdep annotation for them. In particular:
> > > > > > >>>
> > > > > > >>> * Do not allow waiting under any memory management locks.
> > > > > > >>> * Do not allow to attach them to a dma-resv object.
> > > > > > >>> * Introduce a new interface for adding callbacks making the helper
> > > > > > >>> adding
> > > > > > >>>    a callback sign off on that it is aware that the dma-fence may not
> > > > > > >>>    complete anytime soon. Typically this will be the scheduler chaining
> > > > > > >>>    a new long-running fence on another one.
> > > > > > >>
> > > > > > >> Well that's pretty much what I tried before:
> > > > > > >> https://lwn.net/Articles/893704/
> > > > > > >>
> > > > > > >> And the reasons why it was rejected haven't changed.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Christian.
> > > > > > >>
> > > > > > > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > > > > > > this problem while being able to reuse the scheduler for long-running
> > > > > > > workloads.
> > > > > > >
> > > > > > > I couldn't see any clear decision on your series, though, but one main
> > > > > > > difference I see is that this is intended for driver-internal use
> > > > > > > only. (I'm counting using the drm_scheduler as a helper for
> > > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > > indefinite fence problem.
> > > > > >
> > > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > > problems are the same as with your approach: When we express such
> > > > > > operations as dma_fence there is always the change that we leak that
> > > > > > somewhere.
> > > > > >
> > > > > > My approach of adding a flag noting that this operation is dangerous and
> > > > > > can't be synced with something memory management depends on tried to
> > > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > > rejected it (for good reasons I think).
> > > > >
> > > > > Yeah I still don't like dma_fence that somehow have totally different
> > > > > semantics in that critical piece of "will it complete or will it
> > > > > deadlock?" :-)
> > > >
> > > > Not going to touch LR dma-fences in this reply, I think we can continue
> > > > the LR fence discussion of the fork of this thread I just responded to.
> > > > Have a response the preempt fence discussion below.
> > > >
> > > > > >
> > > > > > >
> > > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > > each driver could hack something up, like sleeping in the
> > > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > > should still be annotated in one way or annotated one way or another
> > > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > > do anything bad.
> > > > > > >
> > > > > > >  So any suggestions as to what would be the better solution here would
> > > > > > > be appreciated.
> > > > > >
> > > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > >
> > > > > > I mean in the 1 to 1 case  you basically just need a component which
> > > > > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > > > > schedules a work item.
> > > > > >
> > > > > > As long as the work item itself doesn't produce a dma_fence it can then
> > > > > > still just wait for other none dma_fence dependencies.
> > > > >
> > > > > Yeah that's the important thing, for long-running jobs dependencies as
> > > > > dma_fence should be totally fine. You're just not allowed to have any
> > > > > outgoing dma_fences at all (except the magic preemption fence).
> > > > >
> > > > > > Then the work function could submit the work and wait for the result.
> > > > > >
> > > > > > The work item would then pretty much represent what you want, you can
> > > > > > wait for it to finish and pass it along as long running dependency.
> > > > > >
> > > > > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > > > > basically it.
> > > > >
> > > > > Like do we need this? If the kernel ever waits for a long-running
> > > > > compute job to finnish I'd call that a bug. Any functional
> > > > > dependencies between engines or whatever are userspace's problem only,
> > > > > which it needs to sort out using userspace memory fences.
> > > > >
> > > > > The only things the kernel needs are some way to track dependencies as
> > > > > dma_fence (because memory management move the memory away and we need
> > > > > to move it back in, ideally pipelined). And it needs the special
> > > > > preempt fence (if we don't have pagefaults) so that you have a fence
> > > > > to attach to all the dma_resv for memory management purposes. Now the
> > > > > scheduler already has almost all the pieces (at least if we assume
> > > > > there's some magic fw which time-slices these contexts on its own),
> > > > > and we just need a few minimal changes:
> > > > > - allowing the scheduler to ignore the completion fence and just
> > > > > immediately push the next "job" in if its dependencies are ready
> > > > > - maybe minimal amounts of scaffolding to handle the preemption
> > > > > dma_fence because that's not entirely trivial. I think ideally we'd
> > > > > put that into drm_sched_entity since you can only ever have one active
> > > > > preempt dma_fence per gpu ctx/entity.
> > > > >
> > > >
> > > > Yep, preempt fence is per entity in Xe (xe_engine). We install these
> > > > into the VM and all external BOs mapped in the VM dma-resv slots.
> > > > Wondering if we can make all of this very generic between the DRM
> > > > scheduler + GPUVA...
> > > 
> > > I think if the drm/sched just takes care of the preempt ctx dma_fence
> > > (and still stores it in the same slot in the drm_sched_job struct like
> > > a end-of-batch dma_fence job would), and then the gpuva shared code
> > > just has functions to smash these into the right dma_resv structures
> > > then you have all the pieces. Maybe for a bit more flexibility the
> > > gpuva code takes dma_fence and not directly the drm_sched_job, but
> > > maybe even that level of integration makes sense (but imo optional, a
> > > bit of driver glue code is fine).
> > > 
> > > Yeah that's roughly what I think we should at least aim for since
> > > there's quiet a few drivers in-flight that all need these pieces (more
> > > or less at least).
> > 
> > That is very close to what I'm thinking too, we want to tackle userptr +
> > GPUVA too, think that will be next but can add this to the list of
> > things to do.
> 
> I discussed userptr+gpuva a bit with Rodrigo (and maybe Thomas H not sure
> anymore) and it sounded a bit like that's maybe a bridge too far. At least
> until we have some other drivers that also need that combo. But can't hurt
> to at least think how it would ideally integrate from xe's pov.
> -Daniel

I spoke with dakr today about on IRC, Nouveua is going to implement
userptr soon. I think the idea would be for Xe and Nouveua to
collaborate on what we stick in GPUVA for userptr + if we have common
DRM helper functions. We may land on something really small (e.g. we
store userpr address with a NULL gem in the gpuva structure) or we might
land on common locking, page population, and MMU notifier? Interested to
see where we land.

Matt

> 
> > 
> > Matt
> > 
> > > -Daniel
> > > >
> > > > Matt
> > > >
> > > > > None of this needs a dma_fence_is_lr anywhere at all.
> > > > >
> > > > > Of course there's the somewhat related issue of "how do we transport
> > > > > these userspace memory fences around from app to compositor", and
> > > > > that's a lot more gnarly. I still don't think dma_fence_is_lr is
> > > > > anywhere near what the solution should look like for that.
> > > > > -Daniel
> > > > >
> > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Thomas
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > > 
> > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04 18:08   ` Matthew Brost
@ 2023-04-05  7:30     ` Christian König
  2023-04-05  8:42       ` Daniel Vetter
  2023-04-05 18:06       ` Zeng, Oak
  0 siblings, 2 replies; 87+ messages in thread
From: Christian König @ 2023-04-05  7:30 UTC (permalink / raw)
  To: Matthew Brost, Zeng, Oak
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, intel-xe,
	faith.ekstrand

Am 04.04.23 um 20:08 schrieb Matthew Brost:
> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
>> Hi Matt, Thomas,
>>
>> Some very bold out of box thinking in this area:
>>
>> 1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler?
>>
> If we did that now we have 2 paths for dependency track, flow controling
> the ring, resets / error handling / backend submission implementations.
> We don't want this.

Well exactly that's the point: Why?

As far as I can see that are two completely distinct use cases, so you 
absolutely do want two completely distinct implementations for this.

>> 2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time...
>>
>> So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here.
>>
>> Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence?
>>
>> I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here.
>>
> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> the signal / CB infrastructure) and enforce we don't use use these
> dma-fences from the scheduler in memory reclaim paths or export these to
> user space or other drivers. Think of this mode as SW only fence.

Yeah and I truly think this is an really bad idea.

The signal/CB infrastructure in the dma_fence turned out to be the 
absolutely nightmare I initially predicted. Sorry to say that, but in 
this case the "I've told you so" is appropriate in my opinion.

If we need infrastructure for long running dependency tracking we should 
encapsulate that in a new framework and not try to mangle the existing 
code for something it was never intended for.

Christian.

>
> Matt
>   
>> Thanks,
>> Oak
>>
>>> -----Original Message-----
>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>>> Matthew Brost
>>> Sent: April 3, 2023 8:22 PM
>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
>>> <matthew.brost@intel.com>; christian.koenig@amd.com;
>>> faith.ekstrand@collabora.com
>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
>>>
>>> Hello,
>>>
>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>> have been asked to merge our common DRM scheduler patches first as well
>>> as develop a common solution for long running workloads with the DRM
>>> scheduler. This RFC series is our first attempt at doing this. We
>>> welcome any and all feedback.
>>>
>>> This can we thought of as 4 parts detailed below.
>>>
>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>> entity (patches 1-3)
>>>
>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>> severals problems as the DRM was originally designed to schedule jobs on
>>> hardware queues. The main problem being that DRM scheduler expects the
>>> submission order of jobs to be the completion order of jobs even across
>>> multiple entities. This assumption falls apart with a firmware scheduler
>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>> of order. A novel solution for was originally thought of by Faith during
>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>> and entity. I believe the AGX driver [3] is using this approach and
>>> Boris may use approach as well for the Mali driver [4].
>>>
>>> To support a 1 to 1 relationship we move the main execution function
>>> from a kthread to a work queue and add a new scheduling mode which
>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>> relationship and can be thought of as using scheduler as a dependency /
>>> infligt job tracker rather than a true scheduler.
>>>
>>> - Generic messaging interface for DRM scheduler
>>>
>>> Idea is to be able to communicate to the submission backend with in band
>>> (relative to main execution function) messages. Messages are backend
>>> defined and flexable enough for any use case. In Xe we use these
>>> messages to clean up entites, set properties for entites, and suspend /
>>> resume execution of an entity [5]. I suspect other driver can leverage
>>> this messaging concept too as it a convenient way to avoid races in the
>>> backend.
>>>
>>> - Support for using TDR for all error paths of a scheduler / entity
>>>
>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>
>>> - Annotate dma-fences for long running workloads.
>>>
>>> The idea here is to use dma-fences only as sync points within the
>>> scheduler and never export them for long running workloads. By
>>> annotating these fences as long running we ensure that these dma-fences
>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>> thus approach is the scheduler can still safely flow control the
>>> execution ring buffer via the job limit without breaking the dma-fence
>>> rules.
>>>
>>> Again this a first draft and looking forward to feedback.
>>>
>>> Enjoy - Matt
>>>
>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>> [2] https://patchwork.freedesktop.org/series/112188/
>>> [3] https://patchwork.freedesktop.org/series/114772/
>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>
>>> Matthew Brost (8):
>>>    drm/sched: Convert drm scheduler to use a work queue rather than
>>>      kthread
>>>    drm/sched: Move schedule policy to scheduler / entity
>>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>    drm/sched: Add generic scheduler message interface
>>>    drm/sched: Start run wq before TDR in drm_sched_start
>>>    drm/sched: Submit job before starting TDR
>>>    drm/sched: Add helper to set TDR timeout
>>>    drm/syncobj: Warn on long running dma-fences
>>>
>>> Thomas Hellström (2):
>>>    dma-buf/dma-fence: Introduce long-running completion fences
>>>    drm/sched: Support long-running sched entities
>>>
>>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>   drivers/dma-buf/dma-resv.c                  |   5 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>   include/linux/dma-fence.h                   |  60 ++++-
>>>   16 files changed, 649 insertions(+), 184 deletions(-)
>>>
>>> --
>>> 2.34.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04 13:37   ` Matthew Brost
@ 2023-04-05  7:41     ` Christian König
  2023-04-05  8:34       ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-05  7:41 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, intel-xe,
	faith.ekstrand

Am 04.04.23 um 15:37 schrieb Matthew Brost:
> On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
>> Hi,
>>
>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>> Hello,
>>>
>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>> have been asked to merge our common DRM scheduler patches first as well
>>> as develop a common solution for long running workloads with the DRM
>>> scheduler. This RFC series is our first attempt at doing this. We
>>> welcome any and all feedback.
>>>
>>> This can we thought of as 4 parts detailed below.
>>>
>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>> entity (patches 1-3)
>>>
>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>> severals problems as the DRM was originally designed to schedule jobs on
>>> hardware queues. The main problem being that DRM scheduler expects the
>>> submission order of jobs to be the completion order of jobs even across
>>> multiple entities. This assumption falls apart with a firmware scheduler
>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>> of order. A novel solution for was originally thought of by Faith during
>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>> and entity. I believe the AGX driver [3] is using this approach and
>>> Boris may use approach as well for the Mali driver [4].
>>>
>>> To support a 1 to 1 relationship we move the main execution function
>>> from a kthread to a work queue and add a new scheduling mode which
>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>> relationship and can be thought of as using scheduler as a dependency /
>>> infligt job tracker rather than a true scheduler.
>>>
>>> - Generic messaging interface for DRM scheduler
>>>
>>> Idea is to be able to communicate to the submission backend with in band
>>> (relative to main execution function) messages. Messages are backend
>>> defined and flexable enough for any use case. In Xe we use these
>>> messages to clean up entites, set properties for entites, and suspend /
>>> resume execution of an entity [5]. I suspect other driver can leverage
>>> this messaging concept too as it a convenient way to avoid races in the
>>> backend.
>> Oh, please absolutely *don't* do this.
>>
>> This is basically the design which makes a bunch of stuff so horrible broken
>> on Windows.
>>
>> I can explain it in more detail if necessary, but I strongly recommend to
>> not go down this path.
>>
> I'm afraid we are going to have to discuss this further. Let me explain
> my reasoning, basically the idea is to have a single main entry point to
> backend - the work queue. This avoids the need for lock between run_job
> and any message that changes an entites state, also it really helps
> during the reset flows (either TDR or GT reset) as we can call
> drm_sched_run_wq_stop can ensure that nothing else is in the backend
> changing an entity state. It all works out really nicely actually, our
> GuC backend is incredibly stable (hasn't really had a bug pop in about a
> year) and way simpler than what we did in the i915. I think the simplity
> to largely due to this design of limiting the entry points.
>
> I personally don't see how this a poor design, limiting entry points
> absolutely makes sense to me, if it didn't why not just call cleanup_job
> bypassing the main execution thread (now worker), this is the exact same
> concept.

Well then I strongly suggest to read a few analyses on the failure of 
the message processing loop on Windows.

Have you ever wondered why classic Win32 applications sometimes seems to 
be stuck and don't do anything? This design pattern combine with 
timeouts to solve deadlocks is the reason for that.

The major problem with this approach is that analyzing tools like 
lockdep have a hard time grasping the dependencies.

What you can do is to offload all your operations which are supposed to 
be run in the same thread as work items into a work queue. This is 
something lockdep understands and is able to scream out lout if someone 
messes up the deadlock dependencies.

Regards,
Christian.

>
> FWIW Asahi liked the idea as well and think it could be useful for AGX.
> Matt
>
>> Regards,
>> Christian.
>>
>>> - Support for using TDR for all error paths of a scheduler / entity
>>>
>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>
>>> - Annotate dma-fences for long running workloads.
>>>
>>> The idea here is to use dma-fences only as sync points within the
>>> scheduler and never export them for long running workloads. By
>>> annotating these fences as long running we ensure that these dma-fences
>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>> thus approach is the scheduler can still safely flow control the
>>> execution ring buffer via the job limit without breaking the dma-fence
>>> rules.
>>>
>>> Again this a first draft and looking forward to feedback.
>>>
>>> Enjoy - Matt
>>>
>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>> [2] https://patchwork.freedesktop.org/series/112188/
>>> [3] https://patchwork.freedesktop.org/series/114772/
>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>
>>> Matthew Brost (8):
>>>     drm/sched: Convert drm scheduler to use a work queue rather than
>>>       kthread
>>>     drm/sched: Move schedule policy to scheduler / entity
>>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>     drm/sched: Add generic scheduler message interface
>>>     drm/sched: Start run wq before TDR in drm_sched_start
>>>     drm/sched: Submit job before starting TDR
>>>     drm/sched: Add helper to set TDR timeout
>>>     drm/syncobj: Warn on long running dma-fences
>>>
>>> Thomas Hellström (2):
>>>     dma-buf/dma-fence: Introduce long-running completion fences
>>>     drm/sched: Support long-running sched entities
>>>
>>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>    drivers/dma-buf/dma-resv.c                  |   5 +
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>    include/linux/dma-fence.h                   |  60 ++++-
>>>    16 files changed, 649 insertions(+), 184 deletions(-)
>>>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05  7:41     ` Christian König
@ 2023-04-05  8:34       ` Daniel Vetter
  2023-04-05  8:53         ` Christian König
  0 siblings, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-05  8:34 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, intel-xe,
	faith.ekstrand

On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote:
> Am 04.04.23 um 15:37 schrieb Matthew Brost:
> > On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
> > > Hi,
> > > 
> > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > > 
> > > > - Generic messaging interface for DRM scheduler
> > > > 
> > > > Idea is to be able to communicate to the submission backend with in band
> > > > (relative to main execution function) messages. Messages are backend
> > > > defined and flexable enough for any use case. In Xe we use these
> > > > messages to clean up entites, set properties for entites, and suspend /
> > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > this messaging concept too as it a convenient way to avoid races in the
> > > > backend.
> > > Oh, please absolutely *don't* do this.
> > > 
> > > This is basically the design which makes a bunch of stuff so horrible broken
> > > on Windows.
> > > 
> > > I can explain it in more detail if necessary, but I strongly recommend to
> > > not go down this path.
> > > 
> > I'm afraid we are going to have to discuss this further. Let me explain
> > my reasoning, basically the idea is to have a single main entry point to
> > backend - the work queue. This avoids the need for lock between run_job
> > and any message that changes an entites state, also it really helps
> > during the reset flows (either TDR or GT reset) as we can call
> > drm_sched_run_wq_stop can ensure that nothing else is in the backend
> > changing an entity state. It all works out really nicely actually, our
> > GuC backend is incredibly stable (hasn't really had a bug pop in about a
> > year) and way simpler than what we did in the i915. I think the simplity
> > to largely due to this design of limiting the entry points.
> > 
> > I personally don't see how this a poor design, limiting entry points
> > absolutely makes sense to me, if it didn't why not just call cleanup_job
> > bypassing the main execution thread (now worker), this is the exact same
> > concept.
> 
> Well then I strongly suggest to read a few analyses on the failure of the
> message processing loop on Windows.
> 
> Have you ever wondered why classic Win32 applications sometimes seems to be
> stuck and don't do anything? This design pattern combine with timeouts to
> solve deadlocks is the reason for that.
> 
> The major problem with this approach is that analyzing tools like lockdep
> have a hard time grasping the dependencies.

wq is fully annotated and actually splats. Plain kthread doesn't, without
adding something like the dma_fence_signalling stuff.

But yeah if you block badly in the work items and stall the entire queue,
then things go sideways real bad. There's not really any tools we have in
the kernel to enforce this, since we still want to allow mutex and
sleeping and stuff like that.

> What you can do is to offload all your operations which are supposed to be
> run in the same thread as work items into a work queue. This is something
> lockdep understands and is able to scream out lout if someone messes up the
> deadlock dependencies.

I thought that's the plan here? Or at least what I thought the plan was,
and why I really think we need a per engine worqqueue to make it work well
(and also why I suggested the refactoring to split up drm_scheduler into
the driver api struct, which stays per-engine, and the internal backend
which would be per drm_sched_entity for fw schedulers that round-robin gpu
ctx on their own).

Also maybe we need to allow drivers to pass in the workqueue like we allow
for the tdr handling already, since that simplifies the locking.

At least for intel gpu I think this message passing design makes some
sense because fundamentally the fw only has a single blocking message
queue. And so intel/xe fundamentally needs to deal with the "stupid code
might block forward progress for everyone" problem you're describing, not
matter whether it's done with the help of drm/sched infra or not.

I do agree though that we shouldn't encourage drivers to use this which
don't have that kind of fw command queue design. So maybe a huge comment
to explain when (and _only_ when) it's ok to use that message passing
would be needed, and explaining why in other cases it would make things a
lot worse?
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > FWIW Asahi liked the idea as well and think it could be useful for AGX.
> > Matt
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > 
> > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > 
> > > > - Annotate dma-fences for long running workloads.
> > > > 
> > > > The idea here is to use dma-fences only as sync points within the
> > > > scheduler and never export them for long running workloads. By
> > > > annotating these fences as long running we ensure that these dma-fences
> > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > thus approach is the scheduler can still safely flow control the
> > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > rules.
> > > > 
> > > > Again this a first draft and looking forward to feedback.
> > > > 
> > > > Enjoy - Matt
> > > > 
> > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > 
> > > > Matthew Brost (8):
> > > >     drm/sched: Convert drm scheduler to use a work queue rather than
> > > >       kthread
> > > >     drm/sched: Move schedule policy to scheduler / entity
> > > >     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > > >     drm/sched: Add generic scheduler message interface
> > > >     drm/sched: Start run wq before TDR in drm_sched_start
> > > >     drm/sched: Submit job before starting TDR
> > > >     drm/sched: Add helper to set TDR timeout
> > > >     drm/syncobj: Warn on long running dma-fences
> > > > 
> > > > Thomas Hellström (2):
> > > >     dma-buf/dma-fence: Introduce long-running completion fences
> > > >     drm/sched: Support long-running sched entities
> > > > 
> > > >    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > >    drivers/dma-buf/dma-resv.c                  |   5 +
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > >    drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > >    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > >    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > >    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > >    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > >    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > >    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > >    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > >    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > > >    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > >    include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > >    include/linux/dma-fence.h                   |  60 ++++-
> > > >    16 files changed, 649 insertions(+), 184 deletions(-)
> > > > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05  7:30     ` Christian König
@ 2023-04-05  8:42       ` Daniel Vetter
  2023-04-05 18:06       ` Zeng, Oak
  1 sibling, 0 replies; 87+ messages in thread
From: Daniel Vetter @ 2023-04-05  8:42 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, airlied, lina, Zeng, Oak, boris.brezillon, dri-devel,
	intel-xe, faith.ekstrand

On Wed, Apr 05, 2023 at 09:30:11AM +0200, Christian König wrote:
> Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > Hi Matt, Thomas,
> > > 
> > > Some very bold out of box thinking in this area:
> > > 
> > > 1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler?
> > > 
> > If we did that now we have 2 paths for dependency track, flow controling
> > the ring, resets / error handling / backend submission implementations.
> > We don't want this.
> 
> Well exactly that's the point: Why?
> 
> As far as I can see that are two completely distinct use cases, so you
> absolutely do want two completely distinct implementations for this.
> 
> > > 2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time...
> > > 
> > > So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here.
> > > 
> > > Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence?
> > > 
> > > I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here.
> > > 
> > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > the signal / CB infrastructure) and enforce we don't use use these
> > dma-fences from the scheduler in memory reclaim paths or export these to
> > user space or other drivers. Think of this mode as SW only fence.
> 
> Yeah and I truly think this is an really bad idea.
> 
> The signal/CB infrastructure in the dma_fence turned out to be the
> absolutely nightmare I initially predicted. Sorry to say that, but in this
> case the "I've told you so" is appropriate in my opinion.
> 
> If we need infrastructure for long running dependency tracking we should
> encapsulate that in a new framework and not try to mangle the existing code
> for something it was never intended for.

Concurring hard (already typed that up somewhere else). I'd go one step
further and ask whether the kernel really has to handle dependencies for
these long-running compute jobs. The entire design with userspace memory
fences assumes that this is userspace's job.

Also for drm_syncobj we've also pushed a lot of the dependency handling to
userspace, with submit threads in mesa. So if there is any blocking to be
done (running out of ring space), why can't we sort that out the same way?
Meaning:
1. superfast direct-to-hw submit path (using doorbells or whatever)
2. submit ioctl which only succeds if it doesn't have to do any userspace
memory fence waits, otherwise you get EWOULDBLOCK
3. userspace sorts out the mess in a submit thread if it gets an
EWOULDBLOCK, because fundamentally the kernel cannot guarantee a
bottomless queue. If userspace wants bottomless, they need to handle the
allocating and delaying imo

You can even make 3 entirely as-needed, which means for the usual
fast-path you'll never see the userspace thread created unless you do hit
an EWOULDBLOCK.

If we insist that the kernel handles the long-running dependencies fully
then all we end up doing is implementing step 3, but entirely in the
kernel instead of userspace. And in the kernel every bug gets you halfway
to a CVE, and I just don't think that makes much sense for something which
is the fallback of the fallback - once you run out of ring space you're
not going to have a great day not matter what.

I'd go as far and say if we want step 3 in the kernel someone needs to
supply the real-world (i.e. real application running real workloads, not
some microbenchmark) benchmark to proof it's actually worth the pain.
Otherwise on-demand userspace submit thread.
-Daniel

> 
> Christian.
> 
> > 
> > Matt
> > > Thanks,
> > > Oak
> > > 
> > > > -----Original Message-----
> > > > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > > > Matthew Brost
> > > > Sent: April 3, 2023 8:22 PM
> > > > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > > > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
> > > > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > > > <matthew.brost@intel.com>; christian.koenig@amd.com;
> > > > faith.ekstrand@collabora.com
> > > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
> > > > 
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > > 
> > > > - Generic messaging interface for DRM scheduler
> > > > 
> > > > Idea is to be able to communicate to the submission backend with in band
> > > > (relative to main execution function) messages. Messages are backend
> > > > defined and flexable enough for any use case. In Xe we use these
> > > > messages to clean up entites, set properties for entites, and suspend /
> > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > this messaging concept too as it a convenient way to avoid races in the
> > > > backend.
> > > > 
> > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > 
> > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > 
> > > > - Annotate dma-fences for long running workloads.
> > > > 
> > > > The idea here is to use dma-fences only as sync points within the
> > > > scheduler and never export them for long running workloads. By
> > > > annotating these fences as long running we ensure that these dma-fences
> > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > thus approach is the scheduler can still safely flow control the
> > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > rules.
> > > > 
> > > > Again this a first draft and looking forward to feedback.
> > > > 
> > > > Enjoy - Matt
> > > > 
> > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > 
> > > > Matthew Brost (8):
> > > >    drm/sched: Convert drm scheduler to use a work queue rather than
> > > >      kthread
> > > >    drm/sched: Move schedule policy to scheduler / entity
> > > >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > > >    drm/sched: Add generic scheduler message interface
> > > >    drm/sched: Start run wq before TDR in drm_sched_start
> > > >    drm/sched: Submit job before starting TDR
> > > >    drm/sched: Add helper to set TDR timeout
> > > >    drm/syncobj: Warn on long running dma-fences
> > > > 
> > > > Thomas Hellström (2):
> > > >    dma-buf/dma-fence: Introduce long-running completion fences
> > > >    drm/sched: Support long-running sched entities
> > > > 
> > > >   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > >   drivers/dma-buf/dma-resv.c                  |   5 +
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > >   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > >   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > >   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > >   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > >   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > >   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > >   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > >   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > > >   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > >   include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > >   include/linux/dma-fence.h                   |  60 ++++-
> > > >   16 files changed, 649 insertions(+), 184 deletions(-)
> > > > 
> > > > --
> > > > 2.34.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05  8:34       ` Daniel Vetter
@ 2023-04-05  8:53         ` Christian König
  2023-04-05  9:07           ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-05  8:53 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, intel-xe,
	faith.ekstrand

Am 05.04.23 um 10:34 schrieb Daniel Vetter:
> On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote:
>> Am 04.04.23 um 15:37 schrieb Matthew Brost:
>>> On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
>>>> Hi,
>>>>
>>>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>>>> Hello,
>>>>>
>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>> have been asked to merge our common DRM scheduler patches first as well
>>>>> as develop a common solution for long running workloads with the DRM
>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>> welcome any and all feedback.
>>>>>
>>>>> This can we thought of as 4 parts detailed below.
>>>>>
>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>> entity (patches 1-3)
>>>>>
>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>> severals problems as the DRM was originally designed to schedule jobs on
>>>>> hardware queues. The main problem being that DRM scheduler expects the
>>>>> submission order of jobs to be the completion order of jobs even across
>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>
>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>> infligt job tracker rather than a true scheduler.
>>>>>
>>>>> - Generic messaging interface for DRM scheduler
>>>>>
>>>>> Idea is to be able to communicate to the submission backend with in band
>>>>> (relative to main execution function) messages. Messages are backend
>>>>> defined and flexable enough for any use case. In Xe we use these
>>>>> messages to clean up entites, set properties for entites, and suspend /
>>>>> resume execution of an entity [5]. I suspect other driver can leverage
>>>>> this messaging concept too as it a convenient way to avoid races in the
>>>>> backend.
>>>> Oh, please absolutely *don't* do this.
>>>>
>>>> This is basically the design which makes a bunch of stuff so horrible broken
>>>> on Windows.
>>>>
>>>> I can explain it in more detail if necessary, but I strongly recommend to
>>>> not go down this path.
>>>>
>>> I'm afraid we are going to have to discuss this further. Let me explain
>>> my reasoning, basically the idea is to have a single main entry point to
>>> backend - the work queue. This avoids the need for lock between run_job
>>> and any message that changes an entites state, also it really helps
>>> during the reset flows (either TDR or GT reset) as we can call
>>> drm_sched_run_wq_stop can ensure that nothing else is in the backend
>>> changing an entity state. It all works out really nicely actually, our
>>> GuC backend is incredibly stable (hasn't really had a bug pop in about a
>>> year) and way simpler than what we did in the i915. I think the simplity
>>> to largely due to this design of limiting the entry points.
>>>
>>> I personally don't see how this a poor design, limiting entry points
>>> absolutely makes sense to me, if it didn't why not just call cleanup_job
>>> bypassing the main execution thread (now worker), this is the exact same
>>> concept.
>> Well then I strongly suggest to read a few analyses on the failure of the
>> message processing loop on Windows.
>>
>> Have you ever wondered why classic Win32 applications sometimes seems to be
>> stuck and don't do anything? This design pattern combine with timeouts to
>> solve deadlocks is the reason for that.
>>
>> The major problem with this approach is that analyzing tools like lockdep
>> have a hard time grasping the dependencies.
> wq is fully annotated and actually splats. Plain kthread doesn't, without
> adding something like the dma_fence_signalling stuff.
>
> But yeah if you block badly in the work items and stall the entire queue,
> then things go sideways real bad. There's not really any tools we have in
> the kernel to enforce this, since we still want to allow mutex and
> sleeping and stuff like that.
>
>> What you can do is to offload all your operations which are supposed to be
>> run in the same thread as work items into a work queue. This is something
>> lockdep understands and is able to scream out lout if someone messes up the
>> deadlock dependencies.
> I thought that's the plan here?

At least from my impression that didn't looked like what was implemented 
here.

>   Or at least what I thought the plan was,
> and why I really think we need a per engine worqqueue to make it work well
> (and also why I suggested the refactoring to split up drm_scheduler into
> the driver api struct, which stays per-engine, and the internal backend
> which would be per drm_sched_entity for fw schedulers that round-robin gpu
> ctx on their own).
>
> Also maybe we need to allow drivers to pass in the workqueue like we allow
> for the tdr handling already, since that simplifies the locking.
>
> At least for intel gpu I think this message passing design makes some
> sense because fundamentally the fw only has a single blocking message
> queue. And so intel/xe fundamentally needs to deal with the "stupid code
> might block forward progress for everyone" problem you're describing, not
> matter whether it's done with the help of drm/sched infra or not.
>
> I do agree though that we shouldn't encourage drivers to use this which
> don't have that kind of fw command queue design. So maybe a huge comment
> to explain when (and _only_ when) it's ok to use that message passing
> would be needed, and explaining why in other cases it would make things a
> lot worse?

I would approach it from the complete other side. This component here is 
a tool to decide what job should run next.

How that is then signaled and run should not be part of the scheduler, 
but another more higher level component.

This way you also don't have a problem with not using DMA-fences as 
dependencies as well as constrains for running more jobs.

Christian.

> -Daniel
>
>> Regards,
>> Christian.
>>
>>> FWIW Asahi liked the idea as well and think it could be useful for AGX.
>>> Matt
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> - Support for using TDR for all error paths of a scheduler / entity
>>>>>
>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>>>
>>>>> - Annotate dma-fences for long running workloads.
>>>>>
>>>>> The idea here is to use dma-fences only as sync points within the
>>>>> scheduler and never export them for long running workloads. By
>>>>> annotating these fences as long running we ensure that these dma-fences
>>>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>>>> thus approach is the scheduler can still safely flow control the
>>>>> execution ring buffer via the job limit without breaking the dma-fence
>>>>> rules.
>>>>>
>>>>> Again this a first draft and looking forward to feedback.
>>>>>
>>>>> Enjoy - Matt
>>>>>
>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>>>> [2] https://patchwork.freedesktop.org/series/112188/
>>>>> [3] https://patchwork.freedesktop.org/series/114772/
>>>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>>>
>>>>> Matthew Brost (8):
>>>>>      drm/sched: Convert drm scheduler to use a work queue rather than
>>>>>        kthread
>>>>>      drm/sched: Move schedule policy to scheduler / entity
>>>>>      drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>>>      drm/sched: Add generic scheduler message interface
>>>>>      drm/sched: Start run wq before TDR in drm_sched_start
>>>>>      drm/sched: Submit job before starting TDR
>>>>>      drm/sched: Add helper to set TDR timeout
>>>>>      drm/syncobj: Warn on long running dma-fences
>>>>>
>>>>> Thomas Hellström (2):
>>>>>      dma-buf/dma-fence: Introduce long-running completion fences
>>>>>      drm/sched: Support long-running sched entities
>>>>>
>>>>>     drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>>>     drivers/dma-buf/dma-resv.c                  |   5 +
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>>>     drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>>>     drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>>>     drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>>>     drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>>>     drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>>>     drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>>>     drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>>>     drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>>>     drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>>>>     drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>>>     include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>>>     include/linux/dma-fence.h                   |  60 ++++-
>>>>>     16 files changed, 649 insertions(+), 184 deletions(-)
>>>>>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05  8:53         ` Christian König
@ 2023-04-05  9:07           ` Daniel Vetter
  2023-04-05  9:57             ` Christian König
  0 siblings, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-05  9:07 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon,
	Daniel Vetter, intel-xe, faith.ekstrand

On Wed, Apr 05, 2023 at 10:53:26AM +0200, Christian König wrote:
> Am 05.04.23 um 10:34 schrieb Daniel Vetter:
> > On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote:
> > > Am 04.04.23 um 15:37 schrieb Matthew Brost:
> > > > On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
> > > > > Hi,
> > > > > 
> > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > Hello,
> > > > > > 
> > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > have been asked to merge our common DRM scheduler patches first as well
> > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > welcome any and all feedback.
> > > > > > 
> > > > > > This can we thought of as 4 parts detailed below.
> > > > > > 
> > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > > > entity (patches 1-3)
> > > > > > 
> > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > > > submission order of jobs to be the completion order of jobs even across
> > > > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > > > of order. A novel solution for was originally thought of by Faith during
> > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > 
> > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > relationship and can be thought of as using scheduler as a dependency /
> > > > > > infligt job tracker rather than a true scheduler.
> > > > > > 
> > > > > > - Generic messaging interface for DRM scheduler
> > > > > > 
> > > > > > Idea is to be able to communicate to the submission backend with in band
> > > > > > (relative to main execution function) messages. Messages are backend
> > > > > > defined and flexable enough for any use case. In Xe we use these
> > > > > > messages to clean up entites, set properties for entites, and suspend /
> > > > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > > > this messaging concept too as it a convenient way to avoid races in the
> > > > > > backend.
> > > > > Oh, please absolutely *don't* do this.
> > > > > 
> > > > > This is basically the design which makes a bunch of stuff so horrible broken
> > > > > on Windows.
> > > > > 
> > > > > I can explain it in more detail if necessary, but I strongly recommend to
> > > > > not go down this path.
> > > > > 
> > > > I'm afraid we are going to have to discuss this further. Let me explain
> > > > my reasoning, basically the idea is to have a single main entry point to
> > > > backend - the work queue. This avoids the need for lock between run_job
> > > > and any message that changes an entites state, also it really helps
> > > > during the reset flows (either TDR or GT reset) as we can call
> > > > drm_sched_run_wq_stop can ensure that nothing else is in the backend
> > > > changing an entity state. It all works out really nicely actually, our
> > > > GuC backend is incredibly stable (hasn't really had a bug pop in about a
> > > > year) and way simpler than what we did in the i915. I think the simplity
> > > > to largely due to this design of limiting the entry points.
> > > > 
> > > > I personally don't see how this a poor design, limiting entry points
> > > > absolutely makes sense to me, if it didn't why not just call cleanup_job
> > > > bypassing the main execution thread (now worker), this is the exact same
> > > > concept.
> > > Well then I strongly suggest to read a few analyses on the failure of the
> > > message processing loop on Windows.
> > > 
> > > Have you ever wondered why classic Win32 applications sometimes seems to be
> > > stuck and don't do anything? This design pattern combine with timeouts to
> > > solve deadlocks is the reason for that.
> > > 
> > > The major problem with this approach is that analyzing tools like lockdep
> > > have a hard time grasping the dependencies.
> > wq is fully annotated and actually splats. Plain kthread doesn't, without
> > adding something like the dma_fence_signalling stuff.
> > 
> > But yeah if you block badly in the work items and stall the entire queue,
> > then things go sideways real bad. There's not really any tools we have in
> > the kernel to enforce this, since we still want to allow mutex and
> > sleeping and stuff like that.
> > 
> > > What you can do is to offload all your operations which are supposed to be
> > > run in the same thread as work items into a work queue. This is something
> > > lockdep understands and is able to scream out lout if someone messes up the
> > > deadlock dependencies.
> > I thought that's the plan here?
> 
> At least from my impression that didn't looked like what was implemented
> here.

Yup the current patches aren't what we want I think, at least not in these
details.

> 
> >   Or at least what I thought the plan was,
> > and why I really think we need a per engine worqqueue to make it work well
> > (and also why I suggested the refactoring to split up drm_scheduler into
> > the driver api struct, which stays per-engine, and the internal backend
> > which would be per drm_sched_entity for fw schedulers that round-robin gpu
> > ctx on their own).
> > 
> > Also maybe we need to allow drivers to pass in the workqueue like we allow
> > for the tdr handling already, since that simplifies the locking.
> > 
> > At least for intel gpu I think this message passing design makes some
> > sense because fundamentally the fw only has a single blocking message
> > queue. And so intel/xe fundamentally needs to deal with the "stupid code
> > might block forward progress for everyone" problem you're describing, not
> > matter whether it's done with the help of drm/sched infra or not.
> > 
> > I do agree though that we shouldn't encourage drivers to use this which
> > don't have that kind of fw command queue design. So maybe a huge comment
> > to explain when (and _only_ when) it's ok to use that message passing
> > would be needed, and explaining why in other cases it would make things a
> > lot worse?
> 
> I would approach it from the complete other side. This component here is a
> tool to decide what job should run next.
> 
> How that is then signaled and run should not be part of the scheduler, but
> another more higher level component.
> 
> This way you also don't have a problem with not using DMA-fences as
> dependencies as well as constrains for running more jobs.

I think we're talking about two things here and mixing them up.

For the dependencies I agree with you, and imo that higher level tool
should probably just be an on-demand submit thread in userspace for the
rare case where the kernel would need to sort out a dependency otherwise
(due to running out of ringspace in the per-ctx ringbuffer).

The other thing is the message passing stuff, and this is what I was
talking about above. This has nothing to do with handling dependencies,
but with talking to the gpu fw. Here the intel design issue is that the fw
only provides a single queue, and it's in-order. Which means it
fundamentally has the stalling issue you describe as a point against a
message passing design. And fundamentally we need to be able to talk to
the fw in the scheduler ->run_job callback.

The proposal here for the message passing part is that since it has the
stalling issue already anyway, and the scheduler needs to be involved
anyway, it makes sense to integrated this (as an optional thing, only for
drivers which have this kind of fw interface) into the scheduler.
Otherwise you just end up with two layers for no reason and more ping-pong
delay because the ->run_job needs to kick off the subordinate driver layer
first. Note that for this case the optional message passing support in the
drm/scheduler actually makes things better, because it allows you to cut
out one layer.

Of course if a driver with better fw interface uses this message passing
support, then that's bad. Hence the big warning in the kerneldoc.
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > FWIW Asahi liked the idea as well and think it could be useful for AGX.
> > > > Matt
> > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > > > 
> > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > > > 
> > > > > > - Annotate dma-fences for long running workloads.
> > > > > > 
> > > > > > The idea here is to use dma-fences only as sync points within the
> > > > > > scheduler and never export them for long running workloads. By
> > > > > > annotating these fences as long running we ensure that these dma-fences
> > > > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > > > thus approach is the scheduler can still safely flow control the
> > > > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > > > rules.
> > > > > > 
> > > > > > Again this a first draft and looking forward to feedback.
> > > > > > 
> > > > > > Enjoy - Matt
> > > > > > 
> > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > > > 
> > > > > > Matthew Brost (8):
> > > > > >      drm/sched: Convert drm scheduler to use a work queue rather than
> > > > > >        kthread
> > > > > >      drm/sched: Move schedule policy to scheduler / entity
> > > > > >      drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > > > > >      drm/sched: Add generic scheduler message interface
> > > > > >      drm/sched: Start run wq before TDR in drm_sched_start
> > > > > >      drm/sched: Submit job before starting TDR
> > > > > >      drm/sched: Add helper to set TDR timeout
> > > > > >      drm/syncobj: Warn on long running dma-fences
> > > > > > 
> > > > > > Thomas Hellström (2):
> > > > > >      dma-buf/dma-fence: Introduce long-running completion fences
> > > > > >      drm/sched: Support long-running sched entities
> > > > > > 
> > > > > >     drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > > > >     drivers/dma-buf/dma-resv.c                  |   5 +
> > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > > > >     drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > > > >     drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > > > >     drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > > > >     drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > > > >     drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > > > >     drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > > > >     drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > > > >     drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > > > >     drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > > > > >     drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > > > >     include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > > > >     include/linux/dma-fence.h                   |  60 ++++-
> > > > > >     16 files changed, 649 insertions(+), 184 deletions(-)
> > > > > > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05  9:07           ` Daniel Vetter
@ 2023-04-05  9:57             ` Christian König
  2023-04-05 10:12               ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-05  9:57 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, intel-xe,
	faith.ekstrand

Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> [SNIP]
>> I would approach it from the complete other side. This component here is a
>> tool to decide what job should run next.
>>
>> How that is then signaled and run should not be part of the scheduler, but
>> another more higher level component.
>>
>> This way you also don't have a problem with not using DMA-fences as
>> dependencies as well as constrains for running more jobs.
> I think we're talking about two things here and mixing them up.
>
> For the dependencies I agree with you, and imo that higher level tool
> should probably just be an on-demand submit thread in userspace for the
> rare case where the kernel would need to sort out a dependency otherwise
> (due to running out of ringspace in the per-ctx ringbuffer).
>
> The other thing is the message passing stuff, and this is what I was
> talking about above. This has nothing to do with handling dependencies,
> but with talking to the gpu fw. Here the intel design issue is that the fw
> only provides a single queue, and it's in-order. Which means it
> fundamentally has the stalling issue you describe as a point against a
> message passing design. And fundamentally we need to be able to talk to
> the fw in the scheduler ->run_job callback.
>
> The proposal here for the message passing part is that since it has the
> stalling issue already anyway, and the scheduler needs to be involved
> anyway, it makes sense to integrated this (as an optional thing, only for
> drivers which have this kind of fw interface) into the scheduler.
> Otherwise you just end up with two layers for no reason and more ping-pong
> delay because the ->run_job needs to kick off the subordinate driver layer
> first. Note that for this case the optional message passing support in the
> drm/scheduler actually makes things better, because it allows you to cut
> out one layer.
>
> Of course if a driver with better fw interface uses this message passing
> support, then that's bad. Hence the big warning in the kerneldoc.

Well what I wanted to say is that if you design the dependency handling 
/ scheduler properly you don't need the message passing through it.

For example if the GPU scheduler component uses a work item to do it's 
handling instead of a kthread you could also let the driver specify the 
work queue where this work item is executed on.

When you design it like this the driver specifies the thread context of 
execution for it's job. In other words it can specify a single threaded 
firmware work queue as well.

When you then have other messages which needs to be passed to the 
firmware you can also use the same single threaded workqueue for this.

Drivers which have a different firmware interface would just use one of 
the system work queues instead.

This approach basically decouples the GPU scheduler component from the 
message passing functionality.

Regards,
Christian.


> -Daniel
>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05  9:57             ` Christian König
@ 2023-04-05 10:12               ` Daniel Vetter
  2023-04-06  2:08                 ` Matthew Brost
  0 siblings, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-05 10:12 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, intel-xe,
	faith.ekstrand

On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote:
>
> Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > [SNIP]
> >> I would approach it from the complete other side. This component here is a
> >> tool to decide what job should run next.
> >>
> >> How that is then signaled and run should not be part of the scheduler, but
> >> another more higher level component.
> >>
> >> This way you also don't have a problem with not using DMA-fences as
> >> dependencies as well as constrains for running more jobs.
> > I think we're talking about two things here and mixing them up.
> >
> > For the dependencies I agree with you, and imo that higher level tool
> > should probably just be an on-demand submit thread in userspace for the
> > rare case where the kernel would need to sort out a dependency otherwise
> > (due to running out of ringspace in the per-ctx ringbuffer).
> >
> > The other thing is the message passing stuff, and this is what I was
> > talking about above. This has nothing to do with handling dependencies,
> > but with talking to the gpu fw. Here the intel design issue is that the fw
> > only provides a single queue, and it's in-order. Which means it
> > fundamentally has the stalling issue you describe as a point against a
> > message passing design. And fundamentally we need to be able to talk to
> > the fw in the scheduler ->run_job callback.
> >
> > The proposal here for the message passing part is that since it has the
> > stalling issue already anyway, and the scheduler needs to be involved
> > anyway, it makes sense to integrated this (as an optional thing, only for
> > drivers which have this kind of fw interface) into the scheduler.
> > Otherwise you just end up with two layers for no reason and more ping-pong
> > delay because the ->run_job needs to kick off the subordinate driver layer
> > first. Note that for this case the optional message passing support in the
> > drm/scheduler actually makes things better, because it allows you to cut
> > out one layer.
> >
> > Of course if a driver with better fw interface uses this message passing
> > support, then that's bad. Hence the big warning in the kerneldoc.
>
> Well what I wanted to say is that if you design the dependency handling
> / scheduler properly you don't need the message passing through it.
>
> For example if the GPU scheduler component uses a work item to do it's
> handling instead of a kthread you could also let the driver specify the
> work queue where this work item is executed on.
>
> When you design it like this the driver specifies the thread context of
> execution for it's job. In other words it can specify a single threaded
> firmware work queue as well.
>
> When you then have other messages which needs to be passed to the
> firmware you can also use the same single threaded workqueue for this.
>
> Drivers which have a different firmware interface would just use one of
> the system work queues instead.
>
> This approach basically decouples the GPU scheduler component from the
> message passing functionality.

Hm I guess we've been talking past each another big time, because
that's really what I thought was under discussions? Essentially the
current rfc, but implementing with some polish.

iow I agree with you (I think at least).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 19:25             ` Daniel Vetter
  2023-04-04 19:48               ` Matthew Brost
@ 2023-04-05 12:35               ` Thomas Hellström
  2023-04-05 12:39                 ` Christian König
  1 sibling, 1 reply; 87+ messages in thread
From: Thomas Hellström @ 2023-04-05 12:35 UTC (permalink / raw)
  To: Daniel Vetter, Matthew Brost, Christian König
  Cc: robdclark, airlied, lina, Thomas Hellström (Intel),
	dri-devel, Christian König, boris.brezillon, intel-xe,
	faith.ekstrand

Hi,

On 4/4/23 21:25, Daniel Vetter wrote:
> On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
>> On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
>>> On 4/4/23 15:10, Christian König wrote:
>>>> Am 04.04.23 um 14:54 schrieb Thomas Hellström:
>>>>> Hi, Christian,
>>>>>
>>>>> On 4/4/23 11:09, Christian König wrote:
>>>>>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>>>>>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>>>
>>>>>>> For long-running workloads, drivers either need to open-code
>>>>>>> completion
>>>>>>> waits, invent their own synchronization primitives or internally use
>>>>>>> dma-fences that do not obey the cross-driver dma-fence protocol, but
>>>>>>> without any lockdep annotation all these approaches are error prone.
>>>>>>>
>>>>>>> So since for example the drm scheduler uses dma-fences it is
>>>>>>> desirable for
>>>>>>> a driver to be able to use it for throttling and error
>>>>>>> handling also with
>>>>>>> internal dma-fences tha do not obey the cros-driver
>>>>>>> dma-fence protocol.
>>>>>>>
>>>>>>> Introduce long-running completion fences in form of
>>>>>>> dma-fences, and add
>>>>>>> lockdep annotation for them. In particular:
>>>>>>>
>>>>>>> * Do not allow waiting under any memory management locks.
>>>>>>> * Do not allow to attach them to a dma-resv object.
>>>>>>> * Introduce a new interface for adding callbacks making the
>>>>>>> helper adding
>>>>>>>     a callback sign off on that it is aware that the dma-fence may not
>>>>>>>     complete anytime soon. Typically this will be the
>>>>>>> scheduler chaining
>>>>>>>     a new long-running fence on another one.
>>>>>> Well that's pretty much what I tried before:
>>>>>> https://lwn.net/Articles/893704/
>>>>>>
>> I don't think this quite the same, this explictly enforces that we don't
>> break the dma-fence rules (in path of memory allocations, exported in
>> any way), essentially this just SW sync point reusing dma-fence the
>> infrastructure for signaling / callbacks. I believe your series tried to
>> export these fences to user space (admittedly I haven't fully read your
>> series).
>>
>> In this use case we essentially just want to flow control the ring via
>> the dma-scheduler + maintain a list of pending jobs so the TDR can be
>> used for cleanup if LR entity encounters an error. To me this seems
>> perfectly reasonable but I know dma-femce rules are akin to a holy war.
>>
>> If we return NULL in run_job, now we have to be able to sink all jobs
>> in the backend regardless on ring space, maintain a list of jobs pending
>> for cleanup after errors, and write a different cleanup path as now the
>> TDR doesn't work. Seems very, very silly to duplicate all of this code
>> when the DRM scheduler provides all of this for us. Also if we go this
>> route, now all drivers are going to invent ways to handle LR jobs /w the
>> DRM scheduler.
>>
>> This solution is pretty clear, mark the scheduler as LR, and don't
>> export any fences from the scheduler. If you try to export these fences
>> a blow up happens.
> The problem is if you mix things up. Like for resets you need all the
> schedulers on an engine/set-of-engines to quiescent or things get
> potentially hilarious. If you now have a scheduler in forever limbo, the
> dma_fence guarantees are right out the window.
>
> But the issue you're having is fairly specific if it's just about
> ringspace. I think the dumbest fix is to just block in submit if you run
> out of per-ctx ringspace, and call it a day. This notion that somehow the
> kernel is supposed to provide a bottomless queue of anything userspace
> submits simply doesn't hold up in reality (as much as userspace standards
> committees would like it to), and as long as it doesn't have a real-world
> perf impact it doesn't really matter why we end up blocking in the submit
> ioctl. It might also be a simple memory allocation that hits a snag in
> page reclaim.

So it seems the discussion around the long-running synchronization 
diverged a bit between threads and this thread was hijacked for 
preempt-fences and userptr.

Do I understand it correctly that the recommendation from both Daniel 
and Christian is to *not* use the drm scheduler for long-running compute 
jobs, but track any internal dma-fence dependencies (pipelined clearing 
or whatever) in a separate mechanism and handle unresolved dependencies 
on other long-running jobs using -EWOULDBLOCK?

Thanks,
Thomas





>>>>>> And the reasons why it was rejected haven't changed.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>> Yes, TBH this was mostly to get discussion going how we'd best
>>>>> tackle this problem while being able to reuse the scheduler for
>>>>> long-running workloads.
>>>>>
>>>>> I couldn't see any clear decision on your series, though, but one
>>>>> main difference I see is that this is intended for driver-internal
>>>>> use only. (I'm counting using the drm_scheduler as a helper for
>>>>> driver-private use). This is by no means a way to try tackle the
>>>>> indefinite fence problem.
>>>> Well this was just my latest try to tackle this, but essentially the
>>>> problems are the same as with your approach: When we express such
>>>> operations as dma_fence there is always the change that we leak that
>>>> somewhere.
>>>>
>>>> My approach of adding a flag noting that this operation is dangerous and
>>>> can't be synced with something memory management depends on tried to
>>>> contain this as much as possible, but Daniel still pretty clearly
>>>> rejected it (for good reasons I think).
>>>>
>>>>> We could ofc invent a completely different data-type that abstracts
>>>>> the synchronization the scheduler needs in the long-running case, or
>>>>> each driver could hack something up, like sleeping in the
>>>>> prepare_job() or run_job() callback for throttling, but those waits
>>>>> should still be annotated in one way or annotated one way or another
>>>>> (and probably in a similar way across drivers) to make sure we don't
>>>>> do anything bad.
>>>>>
>>>>>   So any suggestions as to what would be the better solution here
>>>>> would be appreciated.
>>>> Mhm, do we really the the GPU scheduler for that?
>>>>
>> I think we need to solve this within the DRM scheduler one way or
>> another.
> Yeah so if we conclude that the queue really must be bottomless then I
> agree drm-sched should help out sort out the mess. Because I'm guessing
> that every driver will have this issue. But that's a big if.
>
> I guess if we teach the drm scheduler that some jobs are fairly endless
> then maybe it wouldn't be too far-fetched to also teach it to wait for a
> previous one to finish (but not with the dma_fence that preempts, which we
> put into the dma_resv for memory management, but some other struct
> completion). The scheduler already has a concept of not stuffing too much
> stuff into the same queue after all, so this should fit?
> -Daniel
>
>
>>>> I mean in the 1 to 1 case  you basically just need a component which
>>>> collects the dependencies as dma_fence and if all of them are fulfilled
>>>> schedules a work item.
>>>>
>>>> As long as the work item itself doesn't produce a dma_fence it can then
>>>> still just wait for other none dma_fence dependencies.
>>>>
>>>> Then the work function could submit the work and wait for the result.
>>>>
>>>> The work item would then pretty much represent what you want, you can
>>>> wait for it to finish and pass it along as long running dependency.
>>>>
>>>> Maybe give it a funky name and wrap it up in a structure, but that's
>>>> basically it.
>>>>
>>> This very much sounds like a i915_sw_fence for the dependency tracking and
>>> dma_fence_work for the actual work although it's completion fence is a
>>> dma_fence.
>>>
>> Agree this does sound to i915ish as stated below one of mandates in Xe
>> was to use the DRM scheduler. Beyond that as someone who a submission
>> backend in the i915 and Xe, I love how the DRM scheduler works (single
>> entry point), it makes everything so much easier.
>>
>> Matt
>>
>>> Although that goes against the whole idea of a condition for merging the xe
>>> driver would be that we implement some sort of minimal scaffolding for
>>> long-running workloads in the drm scheduler, and the thinking behind that is
>>> to avoid implementing intel-specific solutions like those...
>>>
>>> Thanks,
>>>
>>> Thomas
>>>
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Thanks,
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-05 12:35               ` Thomas Hellström
@ 2023-04-05 12:39                 ` Christian König
  2023-04-05 12:45                   ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-05 12:39 UTC (permalink / raw)
  To: Thomas Hellström, Daniel Vetter, Matthew Brost
  Cc: robdclark, airlied, lina, Thomas Hellström (Intel),
	dri-devel, boris.brezillon, intel-xe, faith.ekstrand

Am 05.04.23 um 14:35 schrieb Thomas Hellström:
> Hi,
>
> On 4/4/23 21:25, Daniel Vetter wrote:
>> On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
>>> On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) 
>>> wrote:
>>>> On 4/4/23 15:10, Christian König wrote:
>>>>> Am 04.04.23 um 14:54 schrieb Thomas Hellström:
>>>>>> Hi, Christian,
>>>>>>
>>>>>> On 4/4/23 11:09, Christian König wrote:
>>>>>>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>>>>>>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>>>>
>>>>>>>> For long-running workloads, drivers either need to open-code
>>>>>>>> completion
>>>>>>>> waits, invent their own synchronization primitives or 
>>>>>>>> internally use
>>>>>>>> dma-fences that do not obey the cross-driver dma-fence 
>>>>>>>> protocol, but
>>>>>>>> without any lockdep annotation all these approaches are error 
>>>>>>>> prone.
>>>>>>>>
>>>>>>>> So since for example the drm scheduler uses dma-fences it is
>>>>>>>> desirable for
>>>>>>>> a driver to be able to use it for throttling and error
>>>>>>>> handling also with
>>>>>>>> internal dma-fences tha do not obey the cros-driver
>>>>>>>> dma-fence protocol.
>>>>>>>>
>>>>>>>> Introduce long-running completion fences in form of
>>>>>>>> dma-fences, and add
>>>>>>>> lockdep annotation for them. In particular:
>>>>>>>>
>>>>>>>> * Do not allow waiting under any memory management locks.
>>>>>>>> * Do not allow to attach them to a dma-resv object.
>>>>>>>> * Introduce a new interface for adding callbacks making the
>>>>>>>> helper adding
>>>>>>>>     a callback sign off on that it is aware that the dma-fence 
>>>>>>>> may not
>>>>>>>>     complete anytime soon. Typically this will be the
>>>>>>>> scheduler chaining
>>>>>>>>     a new long-running fence on another one.
>>>>>>> Well that's pretty much what I tried before:
>>>>>>> https://lwn.net/Articles/893704/
>>>>>>>
>>> I don't think this quite the same, this explictly enforces that we 
>>> don't
>>> break the dma-fence rules (in path of memory allocations, exported in
>>> any way), essentially this just SW sync point reusing dma-fence the
>>> infrastructure for signaling / callbacks. I believe your series 
>>> tried to
>>> export these fences to user space (admittedly I haven't fully read your
>>> series).
>>>
>>> In this use case we essentially just want to flow control the ring via
>>> the dma-scheduler + maintain a list of pending jobs so the TDR can be
>>> used for cleanup if LR entity encounters an error. To me this seems
>>> perfectly reasonable but I know dma-femce rules are akin to a holy war.
>>>
>>> If we return NULL in run_job, now we have to be able to sink all jobs
>>> in the backend regardless on ring space, maintain a list of jobs 
>>> pending
>>> for cleanup after errors, and write a different cleanup path as now the
>>> TDR doesn't work. Seems very, very silly to duplicate all of this code
>>> when the DRM scheduler provides all of this for us. Also if we go this
>>> route, now all drivers are going to invent ways to handle LR jobs /w 
>>> the
>>> DRM scheduler.
>>>
>>> This solution is pretty clear, mark the scheduler as LR, and don't
>>> export any fences from the scheduler. If you try to export these fences
>>> a blow up happens.
>> The problem is if you mix things up. Like for resets you need all the
>> schedulers on an engine/set-of-engines to quiescent or things get
>> potentially hilarious. If you now have a scheduler in forever limbo, the
>> dma_fence guarantees are right out the window.
>>
>> But the issue you're having is fairly specific if it's just about
>> ringspace. I think the dumbest fix is to just block in submit if you run
>> out of per-ctx ringspace, and call it a day. This notion that somehow 
>> the
>> kernel is supposed to provide a bottomless queue of anything userspace
>> submits simply doesn't hold up in reality (as much as userspace 
>> standards
>> committees would like it to), and as long as it doesn't have a 
>> real-world
>> perf impact it doesn't really matter why we end up blocking in the 
>> submit
>> ioctl. It might also be a simple memory allocation that hits a snag in
>> page reclaim.
>
> So it seems the discussion around the long-running synchronization 
> diverged a bit between threads and this thread was hijacked for 
> preempt-fences and userptr.
>
> Do I understand it correctly that the recommendation from both Daniel 
> and Christian is to *not* use the drm scheduler for long-running 
> compute jobs, but track any internal dma-fence dependencies (pipelined 
> clearing or whatever) in a separate mechanism and handle unresolved 
> dependencies on other long-running jobs using -EWOULDBLOCK?

Yeah, I think that's a good summary.

If needed we could extract some scheduler functionality into separate 
components, but the fundamental problem is that to the GPU scheduler 
provides a dma_fence interface to the outside to signal job completion 
and Daniel and I seem to agree that you really don't want that.

Regards,
Christian.

>
> Thanks,
> Thomas
>
>
>
>
>
>>>>>>> And the reasons why it was rejected haven't changed.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>> Yes, TBH this was mostly to get discussion going how we'd best
>>>>>> tackle this problem while being able to reuse the scheduler for
>>>>>> long-running workloads.
>>>>>>
>>>>>> I couldn't see any clear decision on your series, though, but one
>>>>>> main difference I see is that this is intended for driver-internal
>>>>>> use only. (I'm counting using the drm_scheduler as a helper for
>>>>>> driver-private use). This is by no means a way to try tackle the
>>>>>> indefinite fence problem.
>>>>> Well this was just my latest try to tackle this, but essentially the
>>>>> problems are the same as with your approach: When we express such
>>>>> operations as dma_fence there is always the change that we leak that
>>>>> somewhere.
>>>>>
>>>>> My approach of adding a flag noting that this operation is 
>>>>> dangerous and
>>>>> can't be synced with something memory management depends on tried to
>>>>> contain this as much as possible, but Daniel still pretty clearly
>>>>> rejected it (for good reasons I think).
>>>>>
>>>>>> We could ofc invent a completely different data-type that abstracts
>>>>>> the synchronization the scheduler needs in the long-running case, or
>>>>>> each driver could hack something up, like sleeping in the
>>>>>> prepare_job() or run_job() callback for throttling, but those waits
>>>>>> should still be annotated in one way or annotated one way or another
>>>>>> (and probably in a similar way across drivers) to make sure we don't
>>>>>> do anything bad.
>>>>>>
>>>>>>   So any suggestions as to what would be the better solution here
>>>>>> would be appreciated.
>>>>> Mhm, do we really the the GPU scheduler for that?
>>>>>
>>> I think we need to solve this within the DRM scheduler one way or
>>> another.
>> Yeah so if we conclude that the queue really must be bottomless then I
>> agree drm-sched should help out sort out the mess. Because I'm guessing
>> that every driver will have this issue. But that's a big if.
>>
>> I guess if we teach the drm scheduler that some jobs are fairly endless
>> then maybe it wouldn't be too far-fetched to also teach it to wait for a
>> previous one to finish (but not with the dma_fence that preempts, 
>> which we
>> put into the dma_resv for memory management, but some other struct
>> completion). The scheduler already has a concept of not stuffing too 
>> much
>> stuff into the same queue after all, so this should fit?
>> -Daniel
>>
>>
>>>>> I mean in the 1 to 1 case  you basically just need a component which
>>>>> collects the dependencies as dma_fence and if all of them are 
>>>>> fulfilled
>>>>> schedules a work item.
>>>>>
>>>>> As long as the work item itself doesn't produce a dma_fence it can 
>>>>> then
>>>>> still just wait for other none dma_fence dependencies.
>>>>>
>>>>> Then the work function could submit the work and wait for the result.
>>>>>
>>>>> The work item would then pretty much represent what you want, you can
>>>>> wait for it to finish and pass it along as long running dependency.
>>>>>
>>>>> Maybe give it a funky name and wrap it up in a structure, but that's
>>>>> basically it.
>>>>>
>>>> This very much sounds like a i915_sw_fence for the dependency 
>>>> tracking and
>>>> dma_fence_work for the actual work although it's completion fence is a
>>>> dma_fence.
>>>>
>>> Agree this does sound to i915ish as stated below one of mandates in Xe
>>> was to use the DRM scheduler. Beyond that as someone who a submission
>>> backend in the i915 and Xe, I love how the DRM scheduler works (single
>>> entry point), it makes everything so much easier.
>>>
>>> Matt
>>>
>>>> Although that goes against the whole idea of a condition for 
>>>> merging the xe
>>>> driver would be that we implement some sort of minimal scaffolding for
>>>> long-running workloads in the drm scheduler, and the thinking 
>>>> behind that is
>>>> to avoid implementing intel-specific solutions like those...
>>>>
>>>> Thanks,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-05 12:39                 ` Christian König
@ 2023-04-05 12:45                   ` Daniel Vetter
  2023-04-05 14:08                     ` Christian König
  0 siblings, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-05 12:45 UTC (permalink / raw)
  To: Christian König
  Cc: airlied, lina, Thomas Hellström (Intel),
	dri-devel, boris.brezillon, Daniel Vetter, robdclark, intel-xe,
	faith.ekstrand

On Wed, Apr 05, 2023 at 02:39:35PM +0200, Christian König wrote:
> Am 05.04.23 um 14:35 schrieb Thomas Hellström:
> > Hi,
> > 
> > On 4/4/23 21:25, Daniel Vetter wrote:
> > > On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström
> > > > (Intel) wrote:
> > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > Hi, Christian,
> > > > > > > 
> > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > > > > 
> > > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > > completion
> > > > > > > > > waits, invent their own synchronization
> > > > > > > > > primitives or internally use
> > > > > > > > > dma-fences that do not obey the cross-driver
> > > > > > > > > dma-fence protocol, but
> > > > > > > > > without any lockdep annotation all these
> > > > > > > > > approaches are error prone.
> > > > > > > > > 
> > > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > > desirable for
> > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > handling also with
> > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > dma-fence protocol.
> > > > > > > > > 
> > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > dma-fences, and add
> > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > 
> > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > > helper adding
> > > > > > > > >     a callback sign off on that it is aware
> > > > > > > > > that the dma-fence may not
> > > > > > > > >     complete anytime soon. Typically this will be the
> > > > > > > > > scheduler chaining
> > > > > > > > >     a new long-running fence on another one.
> > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > 
> > > > I don't think this quite the same, this explictly enforces that
> > > > we don't
> > > > break the dma-fence rules (in path of memory allocations, exported in
> > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > infrastructure for signaling / callbacks. I believe your series
> > > > tried to
> > > > export these fences to user space (admittedly I haven't fully read your
> > > > series).
> > > > 
> > > > In this use case we essentially just want to flow control the ring via
> > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > > 
> > > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > > in the backend regardless on ring space, maintain a list of jobs
> > > > pending
> > > > for cleanup after errors, and write a different cleanup path as now the
> > > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > > when the DRM scheduler provides all of this for us. Also if we go this
> > > > route, now all drivers are going to invent ways to handle LR
> > > > jobs /w the
> > > > DRM scheduler.
> > > > 
> > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > export any fences from the scheduler. If you try to export these fences
> > > > a blow up happens.
> > > The problem is if you mix things up. Like for resets you need all the
> > > schedulers on an engine/set-of-engines to quiescent or things get
> > > potentially hilarious. If you now have a scheduler in forever limbo, the
> > > dma_fence guarantees are right out the window.
> > > 
> > > But the issue you're having is fairly specific if it's just about
> > > ringspace. I think the dumbest fix is to just block in submit if you run
> > > out of per-ctx ringspace, and call it a day. This notion that
> > > somehow the
> > > kernel is supposed to provide a bottomless queue of anything userspace
> > > submits simply doesn't hold up in reality (as much as userspace
> > > standards
> > > committees would like it to), and as long as it doesn't have a
> > > real-world
> > > perf impact it doesn't really matter why we end up blocking in the
> > > submit
> > > ioctl. It might also be a simple memory allocation that hits a snag in
> > > page reclaim.
> > 
> > So it seems the discussion around the long-running synchronization
> > diverged a bit between threads and this thread was hijacked for
> > preempt-fences and userptr.
> > 
> > Do I understand it correctly that the recommendation from both Daniel
> > and Christian is to *not* use the drm scheduler for long-running compute
> > jobs, but track any internal dma-fence dependencies (pipelined clearing
> > or whatever) in a separate mechanism and handle unresolved dependencies
> > on other long-running jobs using -EWOULDBLOCK?
> 
> Yeah, I think that's a good summary.
> 
> If needed we could extract some scheduler functionality into separate
> components, but the fundamental problem is that to the GPU scheduler
> provides a dma_fence interface to the outside to signal job completion and
> Daniel and I seem to agree that you really don't want that.

I think I'm on something slightly different:

- For anything which semantically is not a dma_fence I agree it probably
  should be handled with EWOULDBLOCK and passed to userspace. Either with
  a submit thread or userspace memory fences. Note that in practice you
  will have a bunch of blocking left in the ioctl, stuff like mutexes or
  memory allocations when things get really tight and you end up in
  synchronous reclaim. Not any different from userspace ending up in
  synchronous reclaim due to a page fault really. Trying to shoehorn
  userspace memory fences or anything else long-running into drm/sched
  dependency handling is just way too much a can of worms.

- For the memory management dependencies, which are all dma_fence when
  pipeline, I do think pushing them through the drm/sched makes sense. It
  has all the stuff to handle that already, plus it's imo also the ideal
  place to handle the preempt-ctx dma_fence scaffolding/semantics. Which
  would give you a really neatly unified command submission interface
  since in both cases (end-of-batch and long-running) you fish the
  dma_fence you need to stuff in all the right dma_resv object (for memory
  management purpose) out of the same place: The drm_sched_job struct.

So I'm _not_ on the "do not use drm/sched for long-running jobs at all".
That doesn't make much sense to me because you'll just reinventing the
exact same dma_fence dependency handling and memory management shuffling
we already have. That seems silly.
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > Thanks,
> > Thomas
> > 
> > 
> > 
> > 
> > 
> > > > > > > > And the reasons why it was rejected haven't changed.
> > > > > > > > 
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > > 
> > > > > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > > > > tackle this problem while being able to reuse the scheduler for
> > > > > > > long-running workloads.
> > > > > > > 
> > > > > > > I couldn't see any clear decision on your series, though, but one
> > > > > > > main difference I see is that this is intended for driver-internal
> > > > > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > > indefinite fence problem.
> > > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > > problems are the same as with your approach: When we express such
> > > > > > operations as dma_fence there is always the change that we leak that
> > > > > > somewhere.
> > > > > > 
> > > > > > My approach of adding a flag noting that this operation
> > > > > > is dangerous and
> > > > > > can't be synced with something memory management depends on tried to
> > > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > > rejected it (for good reasons I think).
> > > > > > 
> > > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > > each driver could hack something up, like sleeping in the
> > > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > > should still be annotated in one way or annotated one way or another
> > > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > > do anything bad.
> > > > > > > 
> > > > > > >   So any suggestions as to what would be the better solution here
> > > > > > > would be appreciated.
> > > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > > 
> > > > I think we need to solve this within the DRM scheduler one way or
> > > > another.
> > > Yeah so if we conclude that the queue really must be bottomless then I
> > > agree drm-sched should help out sort out the mess. Because I'm guessing
> > > that every driver will have this issue. But that's a big if.
> > > 
> > > I guess if we teach the drm scheduler that some jobs are fairly endless
> > > then maybe it wouldn't be too far-fetched to also teach it to wait for a
> > > previous one to finish (but not with the dma_fence that preempts,
> > > which we
> > > put into the dma_resv for memory management, but some other struct
> > > completion). The scheduler already has a concept of not stuffing too
> > > much
> > > stuff into the same queue after all, so this should fit?
> > > -Daniel
> > > 
> > > 
> > > > > > I mean in the 1 to 1 case  you basically just need a component which
> > > > > > collects the dependencies as dma_fence and if all of
> > > > > > them are fulfilled
> > > > > > schedules a work item.
> > > > > > 
> > > > > > As long as the work item itself doesn't produce a
> > > > > > dma_fence it can then
> > > > > > still just wait for other none dma_fence dependencies.
> > > > > > 
> > > > > > Then the work function could submit the work and wait for the result.
> > > > > > 
> > > > > > The work item would then pretty much represent what you want, you can
> > > > > > wait for it to finish and pass it along as long running dependency.
> > > > > > 
> > > > > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > > > > basically it.
> > > > > > 
> > > > > This very much sounds like a i915_sw_fence for the
> > > > > dependency tracking and
> > > > > dma_fence_work for the actual work although it's completion fence is a
> > > > > dma_fence.
> > > > > 
> > > > Agree this does sound to i915ish as stated below one of mandates in Xe
> > > > was to use the DRM scheduler. Beyond that as someone who a submission
> > > > backend in the i915 and Xe, I love how the DRM scheduler works (single
> > > > entry point), it makes everything so much easier.
> > > > 
> > > > Matt
> > > > 
> > > > > Although that goes against the whole idea of a condition for
> > > > > merging the xe
> > > > > driver would be that we implement some sort of minimal scaffolding for
> > > > > long-running workloads in the drm scheduler, and the
> > > > > thinking behind that is
> > > > > to avoid implementing intel-specific solutions like those...
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Thomas
> > > > > 
> > > > > 
> > > > > 
> > > > > > Regards,
> > > > > > Christian.
> > > > > > 
> > > > > > > Thanks,
> > > > > > > 
> > > > > > > Thomas
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-04 19:48               ` Matthew Brost
@ 2023-04-05 13:09                 ` Daniel Vetter
  2023-04-05 23:58                   ` Matthew Brost
  0 siblings, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-05 13:09 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, Thomas Hellström (Intel),
	dri-devel, Christian König, boris.brezillon, Daniel Vetter,
	intel-xe, faith.ekstrand

On Tue, Apr 04, 2023 at 07:48:27PM +0000, Matthew Brost wrote:
> On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
> > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > > > 
> > > > On 4/4/23 15:10, Christian König wrote:
> > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > Hi, Christian,
> > > > > > 
> > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > > > 
> > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > completion
> > > > > > > > waits, invent their own synchronization primitives or internally use
> > > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > > > without any lockdep annotation all these approaches are error prone.
> > > > > > > > 
> > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > desirable for
> > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > handling also with
> > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > dma-fence protocol.
> > > > > > > > 
> > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > dma-fences, and add
> > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > 
> > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > helper adding
> > > > > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > scheduler chaining
> > > > > > > >    a new long-running fence on another one.
> > > > > > > 
> > > > > > > Well that's pretty much what I tried before:
> > > > > > > https://lwn.net/Articles/893704/
> > > > > > > 
> > > 
> > > I don't think this quite the same, this explictly enforces that we don't
> > > break the dma-fence rules (in path of memory allocations, exported in
> > > any way), essentially this just SW sync point reusing dma-fence the
> > > infrastructure for signaling / callbacks. I believe your series tried to
> > > export these fences to user space (admittedly I haven't fully read your
> > > series).
> > > 
> > > In this use case we essentially just want to flow control the ring via
> > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > used for cleanup if LR entity encounters an error. To me this seems
> > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > 
> > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > in the backend regardless on ring space, maintain a list of jobs pending
> > > for cleanup after errors, and write a different cleanup path as now the
> > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > when the DRM scheduler provides all of this for us. Also if we go this
> > > route, now all drivers are going to invent ways to handle LR jobs /w the
> > > DRM scheduler.
> > > 
> > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > export any fences from the scheduler. If you try to export these fences
> > > a blow up happens.
> > 
> > The problem is if you mix things up. Like for resets you need all the
> > schedulers on an engine/set-of-engines to quiescent or things get
> > potentially hilarious. If you now have a scheduler in forever limbo, the
> > dma_fence guarantees are right out the window.
> > 
> 
> Right, a GT reset on Xe is:
> 
> Stop all schedulers
> Do a reset
> Ban any schedulers which we think caused the GT reset
> Resubmit all schedulers which we think were good
> Restart all schedulers
> 
> None of this flow depends on LR dma-fences, all of this uses the DRM
> sched infrastructure and work very well compared to the i915. Rewriting
> all this with a driver specific implementation is what we are trying to
> avoid.
> 
> Similarly if LR entity hangs on its own (not a GT reset, rather the
> firmware does the reset for us) we use all the DRM scheduler
> infrastructure to handle this. Again this works rather well...

Yeah this is why I don't think duplicating everything that long-running
jobs need makes any sense. iow I agree with you.

> > But the issue you're having is fairly specific if it's just about
> > ringspace. I think the dumbest fix is to just block in submit if you run
> > out of per-ctx ringspace, and call it a day. This notion that somehow the
> 
> How does that not break the dma-fence rules? A job can publish its
> finished fence after ARM, if the finished fence fence waits on ring
> space that may not free up in a reasonable amount of time we now have
> broken the dma-dence rules. My understanding is any dma-fence must only
> on other dma-fence, Christian seems to agree and NAK'd just blocking if
> no space available [1]. IMO this series ensures we don't break dma-fence
> rules by restricting how the finished fence can be used.

Oh I meant in the submit ioctl, _before_ you even call
drm_sched_job_arm(). It's ok to block in there indefinitely.

> > kernel is supposed to provide a bottomless queue of anything userspace
> > submits simply doesn't hold up in reality (as much as userspace standards
> > committees would like it to), and as long as it doesn't have a real-world
> > perf impact it doesn't really matter why we end up blocking in the submit
> > ioctl. It might also be a simple memory allocation that hits a snag in
> > page reclaim.
> > 
> > > > > > > And the reasons why it was rejected haven't changed.
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > > > tackle this problem while being able to reuse the scheduler for
> > > > > > long-running workloads.
> > > > > > 
> > > > > > I couldn't see any clear decision on your series, though, but one
> > > > > > main difference I see is that this is intended for driver-internal
> > > > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > indefinite fence problem.
> > > > > 
> > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > problems are the same as with your approach: When we express such
> > > > > operations as dma_fence there is always the change that we leak that
> > > > > somewhere.
> > > > > 
> > > > > My approach of adding a flag noting that this operation is dangerous and
> > > > > can't be synced with something memory management depends on tried to
> > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > rejected it (for good reasons I think).
> > > > > 
> > > > > > 
> > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > each driver could hack something up, like sleeping in the
> > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > should still be annotated in one way or annotated one way or another
> > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > do anything bad.
> > > > > > 
> > > > > >  So any suggestions as to what would be the better solution here
> > > > > > would be appreciated.
> > > > > 
> > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > 
> > > 
> > > I think we need to solve this within the DRM scheduler one way or
> > > another.
> > 
> > Yeah so if we conclude that the queue really must be bottomless then I
> > agree drm-sched should help out sort out the mess. Because I'm guessing
> > that every driver will have this issue. But that's a big if.
> > 
> > I guess if we teach the drm scheduler that some jobs are fairly endless
> > then maybe it wouldn't be too far-fetched to also teach it to wait for a
> > previous one to finish (but not with the dma_fence that preempts, which we
> > put into the dma_resv for memory management, but some other struct
> > completion). The scheduler already has a concept of not stuffing too much
> > stuff into the same queue after all, so this should fit?
> 
> See above, exact same situation as spinning on flow controling the ring,
> this IMO absolutely breaks the dma-fence rules. IMO the correct solution
> is to have a DRM that doesn't export dma-fences, this is exactly what
> this series does as if we try to, boom lockdep / warn on blow up.

I dont think it's impossible to do this correctly, but definitely very,
very hard. Which is why neither Christian nor me like the idea :-)

Essentially you'd have to make sure that any indefinite way will still
react to drm_sched_job, so that you're not holding up a gt reset or
anything like that, but only ever hold up forward progress for this
specific scheduler/drm_sched_entity. Which you can do as long (and again,
another hugely tricky detail) you still obey the preempt-ctx dma_fence and
manage to preempt the underlying long-running ctx even when the drm/sched
is stuck waiting for an indefinite fence (like waiting for ringspace or
something like that).

So I don't think it's impossible, but very far away from "a good idea" :-)

Hence to proposal to bail out of this entire mess by throwing EWOULDBLCK
back to userspace directly from the ioctl function, where you still can do
that without breaking any dma_fence rules. Or if it's not a case that
matters in practice, simply block in the ioctl handler instead of
returning EWOULDBLCK.
-Daniel

> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/patch/525461/?series=114772&rev=2
> 
> > -Daniel
> > 
> > 
> > > > > I mean in the 1 to 1 case  you basically just need a component which
> > > > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > > > schedules a work item.
> > > > > 
> > > > > As long as the work item itself doesn't produce a dma_fence it can then
> > > > > still just wait for other none dma_fence dependencies.
> > > > > 
> > > > > Then the work function could submit the work and wait for the result.
> > > > > 
> > > > > The work item would then pretty much represent what you want, you can
> > > > > wait for it to finish and pass it along as long running dependency.
> > > > > 
> > > > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > > > basically it.
> > > > > 
> > > > This very much sounds like a i915_sw_fence for the dependency tracking and
> > > > dma_fence_work for the actual work although it's completion fence is a
> > > > dma_fence.
> > > >
> > > 
> > > Agree this does sound to i915ish as stated below one of mandates in Xe
> > > was to use the DRM scheduler. Beyond that as someone who a submission
> > > backend in the i915 and Xe, I love how the DRM scheduler works (single
> > > entry point), it makes everything so much easier.
> > > 
> > > Matt
> > > 
> > > > Although that goes against the whole idea of a condition for merging the xe
> > > > driver would be that we implement some sort of minimal scaffolding for
> > > > long-running workloads in the drm scheduler, and the thinking behind that is
> > > > to avoid implementing intel-specific solutions like those...
> > > > 
> > > > Thanks,
> > > > 
> > > > Thomas
> > > > 
> > > > 
> > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > 
> > > > > > Thanks,
> > > > > > 
> > > > > > Thomas
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-05 12:45                   ` Daniel Vetter
@ 2023-04-05 14:08                     ` Christian König
  0 siblings, 0 replies; 87+ messages in thread
From: Christian König @ 2023-04-05 14:08 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: airlied, lina, Thomas Hellström (Intel),
	dri-devel, boris.brezillon, robdclark, intel-xe, faith.ekstrand

Am 05.04.23 um 14:45 schrieb Daniel Vetter:
> On Wed, Apr 05, 2023 at 02:39:35PM +0200, Christian König wrote:
>> Am 05.04.23 um 14:35 schrieb Thomas Hellström:
>>> Hi,
>>>
>>> On 4/4/23 21:25, Daniel Vetter wrote:
>>>> On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
>>>>> On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström
>>>>> (Intel) wrote:
>>>>>> On 4/4/23 15:10, Christian König wrote:
>>>>>>> Am 04.04.23 um 14:54 schrieb Thomas Hellström:
>>>>>>>> Hi, Christian,
>>>>>>>>
>>>>>>>> On 4/4/23 11:09, Christian König wrote:
>>>>>>>>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>>>>>>>>> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>>>>>>
>>>>>>>>>> For long-running workloads, drivers either need to open-code
>>>>>>>>>> completion
>>>>>>>>>> waits, invent their own synchronization
>>>>>>>>>> primitives or internally use
>>>>>>>>>> dma-fences that do not obey the cross-driver
>>>>>>>>>> dma-fence protocol, but
>>>>>>>>>> without any lockdep annotation all these
>>>>>>>>>> approaches are error prone.
>>>>>>>>>>
>>>>>>>>>> So since for example the drm scheduler uses dma-fences it is
>>>>>>>>>> desirable for
>>>>>>>>>> a driver to be able to use it for throttling and error
>>>>>>>>>> handling also with
>>>>>>>>>> internal dma-fences tha do not obey the cros-driver
>>>>>>>>>> dma-fence protocol.
>>>>>>>>>>
>>>>>>>>>> Introduce long-running completion fences in form of
>>>>>>>>>> dma-fences, and add
>>>>>>>>>> lockdep annotation for them. In particular:
>>>>>>>>>>
>>>>>>>>>> * Do not allow waiting under any memory management locks.
>>>>>>>>>> * Do not allow to attach them to a dma-resv object.
>>>>>>>>>> * Introduce a new interface for adding callbacks making the
>>>>>>>>>> helper adding
>>>>>>>>>>      a callback sign off on that it is aware
>>>>>>>>>> that the dma-fence may not
>>>>>>>>>>      complete anytime soon. Typically this will be the
>>>>>>>>>> scheduler chaining
>>>>>>>>>>      a new long-running fence on another one.
>>>>>>>>> Well that's pretty much what I tried before:
>>>>>>>>> https://lwn.net/Articles/893704/
>>>>>>>>>
>>>>> I don't think this quite the same, this explictly enforces that
>>>>> we don't
>>>>> break the dma-fence rules (in path of memory allocations, exported in
>>>>> any way), essentially this just SW sync point reusing dma-fence the
>>>>> infrastructure for signaling / callbacks. I believe your series
>>>>> tried to
>>>>> export these fences to user space (admittedly I haven't fully read your
>>>>> series).
>>>>>
>>>>> In this use case we essentially just want to flow control the ring via
>>>>> the dma-scheduler + maintain a list of pending jobs so the TDR can be
>>>>> used for cleanup if LR entity encounters an error. To me this seems
>>>>> perfectly reasonable but I know dma-femce rules are akin to a holy war.
>>>>>
>>>>> If we return NULL in run_job, now we have to be able to sink all jobs
>>>>> in the backend regardless on ring space, maintain a list of jobs
>>>>> pending
>>>>> for cleanup after errors, and write a different cleanup path as now the
>>>>> TDR doesn't work. Seems very, very silly to duplicate all of this code
>>>>> when the DRM scheduler provides all of this for us. Also if we go this
>>>>> route, now all drivers are going to invent ways to handle LR
>>>>> jobs /w the
>>>>> DRM scheduler.
>>>>>
>>>>> This solution is pretty clear, mark the scheduler as LR, and don't
>>>>> export any fences from the scheduler. If you try to export these fences
>>>>> a blow up happens.
>>>> The problem is if you mix things up. Like for resets you need all the
>>>> schedulers on an engine/set-of-engines to quiescent or things get
>>>> potentially hilarious. If you now have a scheduler in forever limbo, the
>>>> dma_fence guarantees are right out the window.
>>>>
>>>> But the issue you're having is fairly specific if it's just about
>>>> ringspace. I think the dumbest fix is to just block in submit if you run
>>>> out of per-ctx ringspace, and call it a day. This notion that
>>>> somehow the
>>>> kernel is supposed to provide a bottomless queue of anything userspace
>>>> submits simply doesn't hold up in reality (as much as userspace
>>>> standards
>>>> committees would like it to), and as long as it doesn't have a
>>>> real-world
>>>> perf impact it doesn't really matter why we end up blocking in the
>>>> submit
>>>> ioctl. It might also be a simple memory allocation that hits a snag in
>>>> page reclaim.
>>> So it seems the discussion around the long-running synchronization
>>> diverged a bit between threads and this thread was hijacked for
>>> preempt-fences and userptr.
>>>
>>> Do I understand it correctly that the recommendation from both Daniel
>>> and Christian is to *not* use the drm scheduler for long-running compute
>>> jobs, but track any internal dma-fence dependencies (pipelined clearing
>>> or whatever) in a separate mechanism and handle unresolved dependencies
>>> on other long-running jobs using -EWOULDBLOCK?
>> Yeah, I think that's a good summary.
>>
>> If needed we could extract some scheduler functionality into separate
>> components, but the fundamental problem is that to the GPU scheduler
>> provides a dma_fence interface to the outside to signal job completion and
>> Daniel and I seem to agree that you really don't want that.
> I think I'm on something slightly different:
>
> - For anything which semantically is not a dma_fence I agree it probably
>    should be handled with EWOULDBLOCK and passed to userspace. Either with
>    a submit thread or userspace memory fences. Note that in practice you
>    will have a bunch of blocking left in the ioctl, stuff like mutexes or
>    memory allocations when things get really tight and you end up in
>    synchronous reclaim. Not any different from userspace ending up in
>    synchronous reclaim due to a page fault really. Trying to shoehorn
>    userspace memory fences or anything else long-running into drm/sched
>    dependency handling is just way too much a can of worms.
>
> - For the memory management dependencies, which are all dma_fence when
>    pipeline, I do think pushing them through the drm/sched makes sense. It
>    has all the stuff to handle that already, plus it's imo also the ideal
>    place to handle the preempt-ctx dma_fence scaffolding/semantics. Which
>    would give you a really neatly unified command submission interface
>    since in both cases (end-of-batch and long-running) you fish the
>    dma_fence you need to stuff in all the right dma_resv object (for memory
>    management purpose) out of the same place: The drm_sched_job struct.
>
> So I'm _not_ on the "do not use drm/sched for long-running jobs at all".
> That doesn't make much sense to me because you'll just reinventing the
> exact same dma_fence dependency handling and memory management shuffling
> we already have. That seems silly.

How about we stuff the functionality we still want to have into a 
drm_job object?

I mean that really isn't that much, basically just looking at 
drm_syncobj, dma_resv etc.. and extracting all the dependencies.

Christian.

> -Daniel
>
>> Regards,
>> Christian.
>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>>
>>>
>>>
>>>>>>>>> And the reasons why it was rejected haven't changed.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>> Yes, TBH this was mostly to get discussion going how we'd best
>>>>>>>> tackle this problem while being able to reuse the scheduler for
>>>>>>>> long-running workloads.
>>>>>>>>
>>>>>>>> I couldn't see any clear decision on your series, though, but one
>>>>>>>> main difference I see is that this is intended for driver-internal
>>>>>>>> use only. (I'm counting using the drm_scheduler as a helper for
>>>>>>>> driver-private use). This is by no means a way to try tackle the
>>>>>>>> indefinite fence problem.
>>>>>>> Well this was just my latest try to tackle this, but essentially the
>>>>>>> problems are the same as with your approach: When we express such
>>>>>>> operations as dma_fence there is always the change that we leak that
>>>>>>> somewhere.
>>>>>>>
>>>>>>> My approach of adding a flag noting that this operation
>>>>>>> is dangerous and
>>>>>>> can't be synced with something memory management depends on tried to
>>>>>>> contain this as much as possible, but Daniel still pretty clearly
>>>>>>> rejected it (for good reasons I think).
>>>>>>>
>>>>>>>> We could ofc invent a completely different data-type that abstracts
>>>>>>>> the synchronization the scheduler needs in the long-running case, or
>>>>>>>> each driver could hack something up, like sleeping in the
>>>>>>>> prepare_job() or run_job() callback for throttling, but those waits
>>>>>>>> should still be annotated in one way or annotated one way or another
>>>>>>>> (and probably in a similar way across drivers) to make sure we don't
>>>>>>>> do anything bad.
>>>>>>>>
>>>>>>>>    So any suggestions as to what would be the better solution here
>>>>>>>> would be appreciated.
>>>>>>> Mhm, do we really the the GPU scheduler for that?
>>>>>>>
>>>>> I think we need to solve this within the DRM scheduler one way or
>>>>> another.
>>>> Yeah so if we conclude that the queue really must be bottomless then I
>>>> agree drm-sched should help out sort out the mess. Because I'm guessing
>>>> that every driver will have this issue. But that's a big if.
>>>>
>>>> I guess if we teach the drm scheduler that some jobs are fairly endless
>>>> then maybe it wouldn't be too far-fetched to also teach it to wait for a
>>>> previous one to finish (but not with the dma_fence that preempts,
>>>> which we
>>>> put into the dma_resv for memory management, but some other struct
>>>> completion). The scheduler already has a concept of not stuffing too
>>>> much
>>>> stuff into the same queue after all, so this should fit?
>>>> -Daniel
>>>>
>>>>
>>>>>>> I mean in the 1 to 1 case  you basically just need a component which
>>>>>>> collects the dependencies as dma_fence and if all of
>>>>>>> them are fulfilled
>>>>>>> schedules a work item.
>>>>>>>
>>>>>>> As long as the work item itself doesn't produce a
>>>>>>> dma_fence it can then
>>>>>>> still just wait for other none dma_fence dependencies.
>>>>>>>
>>>>>>> Then the work function could submit the work and wait for the result.
>>>>>>>
>>>>>>> The work item would then pretty much represent what you want, you can
>>>>>>> wait for it to finish and pass it along as long running dependency.
>>>>>>>
>>>>>>> Maybe give it a funky name and wrap it up in a structure, but that's
>>>>>>> basically it.
>>>>>>>
>>>>>> This very much sounds like a i915_sw_fence for the
>>>>>> dependency tracking and
>>>>>> dma_fence_work for the actual work although it's completion fence is a
>>>>>> dma_fence.
>>>>>>
>>>>> Agree this does sound to i915ish as stated below one of mandates in Xe
>>>>> was to use the DRM scheduler. Beyond that as someone who a submission
>>>>> backend in the i915 and Xe, I love how the DRM scheduler works (single
>>>>> entry point), it makes everything so much easier.
>>>>>
>>>>> Matt
>>>>>
>>>>>> Although that goes against the whole idea of a condition for
>>>>>> merging the xe
>>>>>> driver would be that we implement some sort of minimal scaffolding for
>>>>>> long-running workloads in the drm scheduler, and the
>>>>>> thinking behind that is
>>>>>> to avoid implementing intel-specific solutions like those...
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 02/10] drm/sched: Move schedule policy to scheduler / entity
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 02/10] drm/sched: Move schedule policy to scheduler / entity Matthew Brost
@ 2023-04-05 17:37   ` Luben Tuikov
  2023-04-05 18:29     ` Matthew Brost
  0 siblings, 1 reply; 87+ messages in thread
From: Luben Tuikov @ 2023-04-05 17:37 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, christian.koenig,
	faith.ekstrand

Hi,

Inlined:

On 2023-04-03 20:22, Matthew Brost wrote:
> Rather than a global modparam for scheduling policy, move the scheduling
> policy to scheduler / entity so user can control each scheduler / entity
> policy.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  3 ++-
>  drivers/gpu/drm/lima/lima_sched.c          |  3 ++-
>  drivers/gpu/drm/msm/msm_ringbuffer.c       |  3 ++-
>  drivers/gpu/drm/panfrost/panfrost_job.c    |  3 ++-
>  drivers/gpu/drm/scheduler/sched_entity.c   | 25 ++++++++++++++++++----
>  drivers/gpu/drm/scheduler/sched_main.c     | 21 +++++++++++++-----
>  drivers/gpu/drm/v3d/v3d_sched.c            | 15 ++++++++-----
>  include/drm/gpu_scheduler.h                | 23 ++++++++++++++------
>  9 files changed, 73 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 00c9c03c8f94..4df0fca5a74c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2368,6 +2368,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>  				   ring->num_hw_submission, amdgpu_job_hang_limit,
>  				   timeout, adev->reset_domain->wq,
>  				   ring->sched_score, ring->name,
> +				   DRM_SCHED_POLICY_DEFAULT,
>  				   adev->dev);
>  		if (r) {
>  			DRM_ERROR("Failed to create scheduler on ring %s.\n",
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index 8486a2923f1b..61204a3f8b0b 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -136,7 +136,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>  	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL,
>  			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
>  			     msecs_to_jiffies(500), NULL, NULL,
> -			     dev_name(gpu->dev), gpu->dev);
> +			     dev_name(gpu->dev), DRM_SCHED_POLICY_DEFAULT,
> +			     gpu->dev);
>  	if (ret)
>  		return ret;
>  
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 54f53bece27c..33042ba6ae93 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -491,7 +491,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
>  	return drm_sched_init(&pipe->base, &lima_sched_ops, NULL, 1,
>  			      lima_job_hang_limit,
>  			      msecs_to_jiffies(timeout), NULL,
> -			      NULL, name, pipe->ldev->dev);
> +			      NULL, name, DRM_SCHED_POLICY_DEFAULT,
> +			      pipe->ldev->dev);
>  }
>  
>  void lima_sched_pipe_fini(struct lima_sched_pipe *pipe)
> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
> index 5879fc262047..f408a9097315 100644
> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
> @@ -97,7 +97,8 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
>  
>  	ret = drm_sched_init(&ring->sched, &msm_sched_ops, NULL,
>  			num_hw_submissions, 0, sched_timeout,
> -			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
> +			NULL, NULL, to_msm_bo(ring->bo)->name,
> +			DRM_SCHED_POLICY_DEFAULT, gpu->dev->dev);
>  	if (ret) {
>  		goto fail;
>  	}
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index f48b07056a16..effa48b33dce 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -819,7 +819,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>  				     nentries, 0,
>  				     msecs_to_jiffies(JOB_TIMEOUT_MS),
>  				     pfdev->reset.wq,
> -				     NULL, "pan_js", pfdev->dev);
> +				     NULL, "pan_js", DRM_SCHED_POLICY_DEFAULT,
> +				     pfdev->dev);
>  		if (ret) {
>  			dev_err(pfdev->dev, "Failed to create scheduler: %d.", ret);
>  			goto err_sched;
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 15d04a0ec623..f1299e51860b 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -33,6 +33,20 @@
>  #define to_drm_sched_job(sched_job)		\
>  		container_of((sched_job), struct drm_sched_job, queue_node)
>  
> +static bool bad_policies(struct drm_gpu_scheduler **sched_list,
> +			 unsigned int num_sched_list)
> +{
> +	enum drm_sched_policy sched_policy = sched_list[0]->sched_policy;
> +	unsigned int i;
> +
> +	/* All scdedule policies must match */
> +	for (i = 1; i < num_sched_list; ++i)
> +		if (sched_policy != sched_list[i]->sched_policy)
> +			return true;
> +
> +	return false;
> +}
> +
>  /**
>   * drm_sched_entity_init - Init a context entity used by scheduler when
>   * submit to HW ring.
> @@ -62,7 +76,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>  			  unsigned int num_sched_list,
>  			  atomic_t *guilty)
>  {
> -	if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])))
> +	if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])) ||
> +	    bad_policies(sched_list, num_sched_list))
>  		return -EINVAL;
>  
>  	memset(entity, 0, sizeof(struct drm_sched_entity));
> @@ -75,8 +90,10 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>  	entity->last_scheduled = NULL;
>  	RB_CLEAR_NODE(&entity->rb_tree_node);
>  
> -	if(num_sched_list)
> +	if(num_sched_list) {
>  		entity->rq = &sched_list[0]->sched_rq[entity->priority];
> +		entity->sched_policy = sched_list[0]->sched_policy;
> +	}
>  
>  	init_completion(&entity->entity_idle);
>  
> @@ -440,7 +457,7 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>  	 * Update the entity's location in the min heap according to
>  	 * the timestamp of the next job, if any.
>  	 */
> -	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) {
> +	if (entity->sched_policy == DRM_SCHED_POLICY_FIFO) {

The entity (context) shouldn't have the "sched_policy" property.
That property belong only to the scheduler.

>  		struct drm_sched_job *next;
>  
>  		next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> @@ -528,7 +545,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>  		drm_sched_rq_add_entity(entity->rq, entity);
>  		spin_unlock(&entity->rq_lock);
>  
> -		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> +		if (entity->sched_policy == DRM_SCHED_POLICY_FIFO)
>  			drm_sched_rq_update_fifo(entity, sched_job->submit_ts);
>  
>  		drm_sched_wakeup(entity->rq->sched);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 808008990721..77894976fa55 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -62,14 +62,14 @@
>  #define to_drm_sched_job(sched_job)		\
>  		container_of((sched_job), struct drm_sched_job, queue_node)
>  
> -int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
> +int default_drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>  
>  /**
>   * DOC: sched_policy (int)
>   * Used to override default entities scheduling policy in a run queue.
>   */
>  MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
> -module_param_named(sched_policy, drm_sched_policy, int, 0444);
> +module_param_named(sched_policy, default_drm_sched_policy, int, 0444);
>  
>  static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>  							    const struct rb_node *b)
> @@ -173,7 +173,7 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>  	if (rq->current_entity == entity)
>  		rq->current_entity = NULL;
>  
> -	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> +	if (entity->sched_policy == DRM_SCHED_POLICY_FIFO)
>  		drm_sched_rq_remove_fifo_locked(entity);
>  
>  	spin_unlock(&rq->lock);
> @@ -931,7 +931,7 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>  
>  	/* Kernel run queue has higher priority than normal run queue*/
>  	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> -		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
> +		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
>  			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
>  			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
>  		if (entity)
> @@ -1106,6 +1106,7 @@ static void drm_sched_main(struct work_struct *w)
>   *		used
>   * @score: optional score atomic shared with other schedulers
>   * @name: name used for debugging
> + * @sched_policy: schedule policy
>   * @dev: target &struct device
>   *
>   * Return 0 on success, otherwise error code.
> @@ -1115,9 +1116,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  		   struct workqueue_struct *run_wq,
>  		   unsigned hw_submission, unsigned hang_limit,
>  		   long timeout, struct workqueue_struct *timeout_wq,
> -		   atomic_t *score, const char *name, struct device *dev)
> +		   atomic_t *score, const char *name,
> +		   enum drm_sched_policy sched_policy,
> +		   struct device *dev)
>  {
>  	int i;
> +
> +	if (sched_policy >= DRM_SCHED_POLICY_MAX)
> +		return -EINVAL;
> +
>  	sched->ops = ops;
>  	sched->hw_submission_limit = hw_submission;
>  	sched->name = name;
> @@ -1127,6 +1134,10 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  	sched->hang_limit = hang_limit;
>  	sched->score = score ? score : &sched->_score;
>  	sched->dev = dev;
> +	if (sched_policy == DRM_SCHED_POLICY_DEFAULT)
> +		sched->sched_policy = default_drm_sched_policy;
> +	else
> +		sched->sched_policy = sched_policy;
>  	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
>  		drm_sched_rq_init(sched, &sched->sched_rq[i]);
>  
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index 38e092ea41e6..5e3fe77fa991 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -391,7 +391,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>  			     &v3d_bin_sched_ops, NULL,
>  			     hw_jobs_limit, job_hang_limit,
>  			     msecs_to_jiffies(hang_limit_ms), NULL,
> -			     NULL, "v3d_bin", v3d->drm.dev);
> +			     NULL, "v3d_bin", DRM_SCHED_POLICY_DEFAULT,
> +			     v3d->drm.dev);
>  	if (ret)
>  		return ret;
>  
> @@ -399,7 +400,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>  			     &v3d_render_sched_ops, NULL,
>  			     hw_jobs_limit, job_hang_limit,
>  			     msecs_to_jiffies(hang_limit_ms), NULL,
> -			     NULL, "v3d_render", v3d->drm.dev);
> +			     ULL, "v3d_render", DRM_SCHED_POLICY_DEFAULT,
> +			     v3d->drm.dev);
>  	if (ret)
>  		goto fail;
>  
> @@ -407,7 +409,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>  			     &v3d_tfu_sched_ops, NULL,
>  			     hw_jobs_limit, job_hang_limit,
>  			     msecs_to_jiffies(hang_limit_ms), NULL,
> -			     NULL, "v3d_tfu", v3d->drm.dev);
> +			     NULL, "v3d_tfu", DRM_SCHED_POLICY_DEFAULT,
> +			     v3d->drm.dev);
>  	if (ret)
>  		goto fail;
>  
> @@ -416,7 +419,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>  				     &v3d_csd_sched_ops, NULL,
>  				     hw_jobs_limit, job_hang_limit,
>  				     msecs_to_jiffies(hang_limit_ms), NULL,
> -				     NULL, "v3d_csd", v3d->drm.dev);
> +				     NULL, "v3d_csd", DRM_SCHED_POLICY_DEFAULT,
> +				     v3d->drm.dev);
>  		if (ret)
>  			goto fail;
>  
> @@ -424,7 +428,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>  				     &v3d_cache_clean_sched_ops, NULL,
>  				     hw_jobs_limit, job_hang_limit,
>  				     msecs_to_jiffies(hang_limit_ms), NULL,
> -				     NULL, "v3d_cache_clean", v3d->drm.dev);
> +				     NULL, "v3d_cache_clean",
> +				     DRM_SCHED_POLICY_DEFAULT, v3d->drm.dev);
>  		if (ret)
>  			goto fail;
>  	}
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 98fb5f85eba6..39cb72b7fe5d 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -72,11 +72,15 @@ enum drm_sched_priority {
>  	DRM_SCHED_PRIORITY_UNSET = -2
>  };
>  
> -/* Used to chose between FIFO and RR jobs scheduling */
> -extern int drm_sched_policy;
> -
> -#define DRM_SCHED_POLICY_RR    0
> -#define DRM_SCHED_POLICY_FIFO  1
> +/* Used to chose default scheduling policy*/
> +extern int default_drm_sched_policy;
> +
> +enum drm_sched_policy {
> +	DRM_SCHED_POLICY_DEFAULT,
> +	DRM_SCHED_POLICY_RR,
> +	DRM_SCHED_POLICY_FIFO,
> +	DRM_SCHED_POLICY_MAX,
> +};

Please don't use MAX. It is very confusing, as maximum and minimum values
are values which can be attained, in literature and common use.
For instance, "the maximum temperature today is 287K, also expect rains"
means that that temperature will actually be attained.

Use, DRM_SCHED_POLICY_COUNT for instance, since for 0-based indexing,
as that of C enums, the last element in the set is in fact the number of
elements, i.e. the count of the set. (_NUM is also bad as it means
"number" which could really be anything.)

So using DRM_SCHED_POLICY_COUNT is most clear.

>  
>  /**
>   * struct drm_sched_entity - A wrapper around a job queue (typically
> @@ -217,6 +221,9 @@ struct drm_sched_entity {
>  	 */
>  	bool 				stopped;
>  
> +	/** @sched_policy: Schedule policy for entity */
> +	enum drm_sched_policy		sched_policy;
> +

This creates data redundancy. "sched_policy" should only be found
in the drm_gpu_scheduler structure. The context's tasks then get to run
on a scheduler with such and such priority. We shouldn't have this here,
only in drm_gpu_scheduler structure.

Regards,
Luben

>  	/**
>  	 * @entity_idle:
>  	 *
> @@ -489,6 +496,7 @@ struct drm_sched_backend_ops {
>   *              guilty and it will no longer be considered for scheduling.
>   * @score: score to help loadbalancer pick a idle sched
>   * @_score: score used when the driver doesn't provide one
> + * @sched_policy: Schedule policy for scheduler
>   * @ready: marks if the underlying HW is ready to work
>   * @free_guilty: A hit to time out handler to free the guilty job.
>   * @pause_run_wq: pause queuing of @work_run on @run_wq
> @@ -514,6 +522,7 @@ struct drm_gpu_scheduler {
>  	int				hang_limit;
>  	atomic_t                        *score;
>  	atomic_t                        _score;
> +	enum drm_sched_policy		sched_policy;
>  	bool				ready;
>  	bool				free_guilty;
>  	bool				pause_run_wq;
> @@ -525,7 +534,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  		   struct workqueue_struct *run_wq,
>  		   uint32_t hw_submission, unsigned hang_limit,
>  		   long timeout, struct workqueue_struct *timeout_wq,
> -		   atomic_t *score, const char *name, struct device *dev);
> +		   atomic_t *score, const char *name,
> +		   enum drm_sched_policy sched_policy,
> +		   struct device *dev);
>  
>  void drm_sched_fini(struct drm_gpu_scheduler *sched);
>  int drm_sched_job_init(struct drm_sched_job *job,


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05  7:30     ` Christian König
  2023-04-05  8:42       ` Daniel Vetter
@ 2023-04-05 18:06       ` Zeng, Oak
  2023-04-05 18:53         ` Matthew Brost
  1 sibling, 1 reply; 87+ messages in thread
From: Zeng, Oak @ 2023-04-05 18:06 UTC (permalink / raw)
  To: Christian König, Brost, Matthew, Vetter, Daniel,
	Thomas Hellström
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, intel-xe,
	faith.ekstrand

Hi,

Using dma-fence for completion/dependency tracking for long-run workload(more precisely on-demand paging/page fault enabled workload) can cause deadlock. This seems the significant issue here. Other issues such as the drm scheduler completion order implication etc are minors which can be solve inside the framework of drm scheduler. We need to evaluate below paths:

	1) still use drm scheduler for job submission, and use dma-fence for job completion waiting/dependency tracking. This is solution proposed in this series. Annotate dma-fence for long-run workload: user can still wait dma-fence for job completion but can't wait dma-fence while holding any memory management locks.  We still use dma-fence for dependency tracking. But it is just very easily run into deadlock when on-demand paging is in the picture. The annotation helps us to detect deadlock but not solve deadlock problems. Seems *not* a complete solution: It is almost impossible to completely avoid dependency deadlock in complex runtime environment
	
	2) Still use drm scheduler but not use dma-fence for completion signaling and dependency tracking. This way we still get some free functions (reset, err handling ring flow control as Matt said)from drm scheduler, but push the dependency/completion tracking completely to user space using techniques such as user space fence. User space doesn't have chance to wait fence while holding a kernel memory management lock, thus the dma-fence deadlock issue is solved.
	
	3) Completely discard drm scheduler and dma-fence for long-run workload. Use user queue/doorbell for super fast submission, directly interact with fw scheduler. Use user fence for completion/dependency tracking.

Thanks,
Oak

> -----Original Message-----
> From: Christian König <christian.koenig@amd.com>
> Sent: April 5, 2023 3:30 AM
> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> plans
> 
> Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> >> Hi Matt, Thomas,
> >>
> >> Some very bold out of box thinking in this area:
> >>
> >> 1. so you want to use drm scheduler and dma-fence for long running workload.
> Why you want to do this in the first place? What is the benefit? Drm scheduler is
> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> level, as you said below for intel this is Guc. Can xe driver just directly submit job
> to Guc, bypassing drm scheduler?
> >>
> > If we did that now we have 2 paths for dependency track, flow controling
> > the ring, resets / error handling / backend submission implementations.
> > We don't want this.
> 
> Well exactly that's the point: Why?
> 
> As far as I can see that are two completely distinct use cases, so you
> absolutely do want two completely distinct implementations for this.
> 
> >> 2. using dma-fence for long run workload: I am well aware that page fault (and
> the consequent memory allocation/lock acquiring to fix the fault) can cause
> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be
> used purely because the nature of the workload that it runs very long (indefinite).
> I did a math: the dma_fence_wait_timeout function's third param is the timeout
> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long
> enough, can we just change the timeout parameter to signed 64 bits so it is much
> longer than our life time...
> >>
> >> So I mainly argue we can't use dma-fence for long-run workload is not
> because the workload runs very long, rather because of the fact that we use
> page fault for long-run workload. If we enable page fault for short-run workload,
> we can't use dma-fence either. Page fault is the key thing here.
> >>
> >> Now since we use page fault which is *fundamentally* controversial with
> dma-fence design, why now just introduce a independent concept such as user-
> fence instead of extending existing dma-fence?
> >>
> >> I like unified design. If drm scheduler, dma-fence can be extended to work for
> everything, it is beautiful. But seems we have some fundamental problem here.
> >>
> > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > the signal / CB infrastructure) and enforce we don't use use these
> > dma-fences from the scheduler in memory reclaim paths or export these to
> > user space or other drivers. Think of this mode as SW only fence.
> 
> Yeah and I truly think this is an really bad idea.
> 
> The signal/CB infrastructure in the dma_fence turned out to be the
> absolutely nightmare I initially predicted. Sorry to say that, but in
> this case the "I've told you so" is appropriate in my opinion.
> 
> If we need infrastructure for long running dependency tracking we should
> encapsulate that in a new framework and not try to mangle the existing
> code for something it was never intended for.
> 
> Christian.
> 
> >
> > Matt
> >
> >> Thanks,
> >> Oak
> >>
> >>> -----Original Message-----
> >>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> >>> Matthew Brost
> >>> Sent: April 3, 2023 8:22 PM
> >>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> >>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> airlied@linux.ie;
> >>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> >>> <matthew.brost@intel.com>; christian.koenig@amd.com;
> >>> faith.ekstrand@collabora.com
> >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> plans
> >>>
> >>> Hello,
> >>>
> >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> >>> have been asked to merge our common DRM scheduler patches first as well
> >>> as develop a common solution for long running workloads with the DRM
> >>> scheduler. This RFC series is our first attempt at doing this. We
> >>> welcome any and all feedback.
> >>>
> >>> This can we thought of as 4 parts detailed below.
> >>>
> >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> >>> entity (patches 1-3)
> >>>
> >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> >>> severals problems as the DRM was originally designed to schedule jobs on
> >>> hardware queues. The main problem being that DRM scheduler expects the
> >>> submission order of jobs to be the completion order of jobs even across
> >>> multiple entities. This assumption falls apart with a firmware scheduler
> >>> as a firmware scheduler has no concept of jobs and jobs can complete out
> >>> of order. A novel solution for was originally thought of by Faith during
> >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> >>> and entity. I believe the AGX driver [3] is using this approach and
> >>> Boris may use approach as well for the Mali driver [4].
> >>>
> >>> To support a 1 to 1 relationship we move the main execution function
> >>> from a kthread to a work queue and add a new scheduling mode which
> >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> >>> relationship and can be thought of as using scheduler as a dependency /
> >>> infligt job tracker rather than a true scheduler.
> >>>
> >>> - Generic messaging interface for DRM scheduler
> >>>
> >>> Idea is to be able to communicate to the submission backend with in band
> >>> (relative to main execution function) messages. Messages are backend
> >>> defined and flexable enough for any use case. In Xe we use these
> >>> messages to clean up entites, set properties for entites, and suspend /
> >>> resume execution of an entity [5]. I suspect other driver can leverage
> >>> this messaging concept too as it a convenient way to avoid races in the
> >>> backend.
> >>>
> >>> - Support for using TDR for all error paths of a scheduler / entity
> >>>
> >>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> >>>
> >>> - Annotate dma-fences for long running workloads.
> >>>
> >>> The idea here is to use dma-fences only as sync points within the
> >>> scheduler and never export them for long running workloads. By
> >>> annotating these fences as long running we ensure that these dma-fences
> >>> are never used in a way that breaks the dma-fence rules. A benefit of
> >>> thus approach is the scheduler can still safely flow control the
> >>> execution ring buffer via the job limit without breaking the dma-fence
> >>> rules.
> >>>
> >>> Again this a first draft and looking forward to feedback.
> >>>
> >>> Enjoy - Matt
> >>>
> >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> >>> [2] https://patchwork.freedesktop.org/series/112188/
> >>> [3] https://patchwork.freedesktop.org/series/114772/
> >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> >>>
> >>> Matthew Brost (8):
> >>>    drm/sched: Convert drm scheduler to use a work queue rather than
> >>>      kthread
> >>>    drm/sched: Move schedule policy to scheduler / entity
> >>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >>>    drm/sched: Add generic scheduler message interface
> >>>    drm/sched: Start run wq before TDR in drm_sched_start
> >>>    drm/sched: Submit job before starting TDR
> >>>    drm/sched: Add helper to set TDR timeout
> >>>    drm/syncobj: Warn on long running dma-fences
> >>>
> >>> Thomas Hellström (2):
> >>>    dma-buf/dma-fence: Introduce long-running completion fences
> >>>    drm/sched: Support long-running sched entities
> >>>
> >>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >>>   drivers/dma-buf/dma-resv.c                  |   5 +
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
> >>>   include/linux/dma-fence.h                   |  60 ++++-
> >>>   16 files changed, 649 insertions(+), 184 deletions(-)
> >>>
> >>> --
> >>> 2.34.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 02/10] drm/sched: Move schedule policy to scheduler / entity
  2023-04-05 17:37   ` Luben Tuikov
@ 2023-04-05 18:29     ` Matthew Brost
  0 siblings, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-05 18:29 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: robdclark, airlied, lina, dri-devel, christian.koenig,
	boris.brezillon, intel-xe, faith.ekstrand

On Wed, Apr 05, 2023 at 01:37:22PM -0400, Luben Tuikov wrote:
> Hi,
> 
> Inlined:
> 

Thanks for the feedback.

> On 2023-04-03 20:22, Matthew Brost wrote:
> > Rather than a global modparam for scheduling policy, move the scheduling
> > policy to scheduler / entity so user can control each scheduler / entity
> > policy.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
> >  drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  3 ++-
> >  drivers/gpu/drm/lima/lima_sched.c          |  3 ++-
> >  drivers/gpu/drm/msm/msm_ringbuffer.c       |  3 ++-
> >  drivers/gpu/drm/panfrost/panfrost_job.c    |  3 ++-
> >  drivers/gpu/drm/scheduler/sched_entity.c   | 25 ++++++++++++++++++----
> >  drivers/gpu/drm/scheduler/sched_main.c     | 21 +++++++++++++-----
> >  drivers/gpu/drm/v3d/v3d_sched.c            | 15 ++++++++-----
> >  include/drm/gpu_scheduler.h                | 23 ++++++++++++++------
> >  9 files changed, 73 insertions(+), 24 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 00c9c03c8f94..4df0fca5a74c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2368,6 +2368,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
> >  				   ring->num_hw_submission, amdgpu_job_hang_limit,
> >  				   timeout, adev->reset_domain->wq,
> >  				   ring->sched_score, ring->name,
> > +				   DRM_SCHED_POLICY_DEFAULT,
> >  				   adev->dev);
> >  		if (r) {
> >  			DRM_ERROR("Failed to create scheduler on ring %s.\n",
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > index 8486a2923f1b..61204a3f8b0b 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > @@ -136,7 +136,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
> >  	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL,
> >  			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
> >  			     msecs_to_jiffies(500), NULL, NULL,
> > -			     dev_name(gpu->dev), gpu->dev);
> > +			     dev_name(gpu->dev), DRM_SCHED_POLICY_DEFAULT,
> > +			     gpu->dev);
> >  	if (ret)
> >  		return ret;
> >  
> > diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> > index 54f53bece27c..33042ba6ae93 100644
> > --- a/drivers/gpu/drm/lima/lima_sched.c
> > +++ b/drivers/gpu/drm/lima/lima_sched.c
> > @@ -491,7 +491,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
> >  	return drm_sched_init(&pipe->base, &lima_sched_ops, NULL, 1,
> >  			      lima_job_hang_limit,
> >  			      msecs_to_jiffies(timeout), NULL,
> > -			      NULL, name, pipe->ldev->dev);
> > +			      NULL, name, DRM_SCHED_POLICY_DEFAULT,
> > +			      pipe->ldev->dev);
> >  }
> >  
> >  void lima_sched_pipe_fini(struct lima_sched_pipe *pipe)
> > diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
> > index 5879fc262047..f408a9097315 100644
> > --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
> > +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
> > @@ -97,7 +97,8 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
> >  
> >  	ret = drm_sched_init(&ring->sched, &msm_sched_ops, NULL,
> >  			num_hw_submissions, 0, sched_timeout,
> > -			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
> > +			NULL, NULL, to_msm_bo(ring->bo)->name,
> > +			DRM_SCHED_POLICY_DEFAULT, gpu->dev->dev);
> >  	if (ret) {
> >  		goto fail;
> >  	}
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> > index f48b07056a16..effa48b33dce 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > @@ -819,7 +819,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
> >  				     nentries, 0,
> >  				     msecs_to_jiffies(JOB_TIMEOUT_MS),
> >  				     pfdev->reset.wq,
> > -				     NULL, "pan_js", pfdev->dev);
> > +				     NULL, "pan_js", DRM_SCHED_POLICY_DEFAULT,
> > +				     pfdev->dev);
> >  		if (ret) {
> >  			dev_err(pfdev->dev, "Failed to create scheduler: %d.", ret);
> >  			goto err_sched;
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > index 15d04a0ec623..f1299e51860b 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -33,6 +33,20 @@
> >  #define to_drm_sched_job(sched_job)		\
> >  		container_of((sched_job), struct drm_sched_job, queue_node)
> >  
> > +static bool bad_policies(struct drm_gpu_scheduler **sched_list,
> > +			 unsigned int num_sched_list)
> > +{
> > +	enum drm_sched_policy sched_policy = sched_list[0]->sched_policy;
> > +	unsigned int i;
> > +
> > +	/* All scdedule policies must match */
> > +	for (i = 1; i < num_sched_list; ++i)
> > +		if (sched_policy != sched_list[i]->sched_policy)
> > +			return true;
> > +
> > +	return false;
> > +}
> > +
> >  /**
> >   * drm_sched_entity_init - Init a context entity used by scheduler when
> >   * submit to HW ring.
> > @@ -62,7 +76,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
> >  			  unsigned int num_sched_list,
> >  			  atomic_t *guilty)
> >  {
> > -	if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])))
> > +	if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])) ||
> > +	    bad_policies(sched_list, num_sched_list))
> >  		return -EINVAL;
> >  
> >  	memset(entity, 0, sizeof(struct drm_sched_entity));
> > @@ -75,8 +90,10 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
> >  	entity->last_scheduled = NULL;
> >  	RB_CLEAR_NODE(&entity->rb_tree_node);
> >  
> > -	if(num_sched_list)
> > +	if(num_sched_list) {
> >  		entity->rq = &sched_list[0]->sched_rq[entity->priority];
> > +		entity->sched_policy = sched_list[0]->sched_policy;
> > +	}
> >  
> >  	init_completion(&entity->entity_idle);
> >  
> > @@ -440,7 +457,7 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
> >  	 * Update the entity's location in the min heap according to
> >  	 * the timestamp of the next job, if any.
> >  	 */
> > -	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) {
> > +	if (entity->sched_policy == DRM_SCHED_POLICY_FIFO) {
> 
> The entity (context) shouldn't have the "sched_policy" property.
> That property belong only to the scheduler.
> 

Sure. Will have to drop the union of sched_main & rq then.

> >  		struct drm_sched_job *next;
> >  
> >  		next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> > @@ -528,7 +545,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> >  		drm_sched_rq_add_entity(entity->rq, entity);
> >  		spin_unlock(&entity->rq_lock);
> >  
> > -		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> > +		if (entity->sched_policy == DRM_SCHED_POLICY_FIFO)
> >  			drm_sched_rq_update_fifo(entity, sched_job->submit_ts);
> >  
> >  		drm_sched_wakeup(entity->rq->sched);
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 808008990721..77894976fa55 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -62,14 +62,14 @@
> >  #define to_drm_sched_job(sched_job)		\
> >  		container_of((sched_job), struct drm_sched_job, queue_node)
> >  
> > -int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
> > +int default_drm_sched_policy = DRM_SCHED_POLICY_FIFO;
> >  
> >  /**
> >   * DOC: sched_policy (int)
> >   * Used to override default entities scheduling policy in a run queue.
> >   */
> >  MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
> > -module_param_named(sched_policy, drm_sched_policy, int, 0444);
> > +module_param_named(sched_policy, default_drm_sched_policy, int, 0444);
> >  
> >  static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
> >  							    const struct rb_node *b)
> > @@ -173,7 +173,7 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> >  	if (rq->current_entity == entity)
> >  		rq->current_entity = NULL;
> >  
> > -	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> > +	if (entity->sched_policy == DRM_SCHED_POLICY_FIFO)
> >  		drm_sched_rq_remove_fifo_locked(entity);
> >  
> >  	spin_unlock(&rq->lock);
> > @@ -931,7 +931,7 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> >  
> >  	/* Kernel run queue has higher priority than normal run queue*/
> >  	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > -		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
> > +		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> >  			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> >  			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> >  		if (entity)
> > @@ -1106,6 +1106,7 @@ static void drm_sched_main(struct work_struct *w)
> >   *		used
> >   * @score: optional score atomic shared with other schedulers
> >   * @name: name used for debugging
> > + * @sched_policy: schedule policy
> >   * @dev: target &struct device
> >   *
> >   * Return 0 on success, otherwise error code.
> > @@ -1115,9 +1116,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >  		   struct workqueue_struct *run_wq,
> >  		   unsigned hw_submission, unsigned hang_limit,
> >  		   long timeout, struct workqueue_struct *timeout_wq,
> > -		   atomic_t *score, const char *name, struct device *dev)
> > +		   atomic_t *score, const char *name,
> > +		   enum drm_sched_policy sched_policy,
> > +		   struct device *dev)
> >  {
> >  	int i;
> > +
> > +	if (sched_policy >= DRM_SCHED_POLICY_MAX)
> > +		return -EINVAL;
> > +
> >  	sched->ops = ops;
> >  	sched->hw_submission_limit = hw_submission;
> >  	sched->name = name;
> > @@ -1127,6 +1134,10 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >  	sched->hang_limit = hang_limit;
> >  	sched->score = score ? score : &sched->_score;
> >  	sched->dev = dev;
> > +	if (sched_policy == DRM_SCHED_POLICY_DEFAULT)
> > +		sched->sched_policy = default_drm_sched_policy;
> > +	else
> > +		sched->sched_policy = sched_policy;
> >  	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
> >  		drm_sched_rq_init(sched, &sched->sched_rq[i]);
> >  
> > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> > index 38e092ea41e6..5e3fe77fa991 100644
> > --- a/drivers/gpu/drm/v3d/v3d_sched.c
> > +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> > @@ -391,7 +391,8 @@ v3d_sched_init(struct v3d_dev *v3d)
> >  			     &v3d_bin_sched_ops, NULL,
> >  			     hw_jobs_limit, job_hang_limit,
> >  			     msecs_to_jiffies(hang_limit_ms), NULL,
> > -			     NULL, "v3d_bin", v3d->drm.dev);
> > +			     NULL, "v3d_bin", DRM_SCHED_POLICY_DEFAULT,
> > +			     v3d->drm.dev);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -399,7 +400,8 @@ v3d_sched_init(struct v3d_dev *v3d)
> >  			     &v3d_render_sched_ops, NULL,
> >  			     hw_jobs_limit, job_hang_limit,
> >  			     msecs_to_jiffies(hang_limit_ms), NULL,
> > -			     NULL, "v3d_render", v3d->drm.dev);
> > +			     ULL, "v3d_render", DRM_SCHED_POLICY_DEFAULT,
> > +			     v3d->drm.dev);
> >  	if (ret)
> >  		goto fail;
> >  
> > @@ -407,7 +409,8 @@ v3d_sched_init(struct v3d_dev *v3d)
> >  			     &v3d_tfu_sched_ops, NULL,
> >  			     hw_jobs_limit, job_hang_limit,
> >  			     msecs_to_jiffies(hang_limit_ms), NULL,
> > -			     NULL, "v3d_tfu", v3d->drm.dev);
> > +			     NULL, "v3d_tfu", DRM_SCHED_POLICY_DEFAULT,
> > +			     v3d->drm.dev);
> >  	if (ret)
> >  		goto fail;
> >  
> > @@ -416,7 +419,8 @@ v3d_sched_init(struct v3d_dev *v3d)
> >  				     &v3d_csd_sched_ops, NULL,
> >  				     hw_jobs_limit, job_hang_limit,
> >  				     msecs_to_jiffies(hang_limit_ms), NULL,
> > -				     NULL, "v3d_csd", v3d->drm.dev);
> > +				     NULL, "v3d_csd", DRM_SCHED_POLICY_DEFAULT,
> > +				     v3d->drm.dev);
> >  		if (ret)
> >  			goto fail;
> >  
> > @@ -424,7 +428,8 @@ v3d_sched_init(struct v3d_dev *v3d)
> >  				     &v3d_cache_clean_sched_ops, NULL,
> >  				     hw_jobs_limit, job_hang_limit,
> >  				     msecs_to_jiffies(hang_limit_ms), NULL,
> > -				     NULL, "v3d_cache_clean", v3d->drm.dev);
> > +				     NULL, "v3d_cache_clean",
> > +				     DRM_SCHED_POLICY_DEFAULT, v3d->drm.dev);
> >  		if (ret)
> >  			goto fail;
> >  	}
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 98fb5f85eba6..39cb72b7fe5d 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -72,11 +72,15 @@ enum drm_sched_priority {
> >  	DRM_SCHED_PRIORITY_UNSET = -2
> >  };
> >  
> > -/* Used to chose between FIFO and RR jobs scheduling */
> > -extern int drm_sched_policy;
> > -
> > -#define DRM_SCHED_POLICY_RR    0
> > -#define DRM_SCHED_POLICY_FIFO  1
> > +/* Used to chose default scheduling policy*/
> > +extern int default_drm_sched_policy;
> > +
> > +enum drm_sched_policy {
> > +	DRM_SCHED_POLICY_DEFAULT,
> > +	DRM_SCHED_POLICY_RR,
> > +	DRM_SCHED_POLICY_FIFO,
> > +	DRM_SCHED_POLICY_MAX,
> > +};
> 
> Please don't use MAX. It is very confusing, as maximum and minimum values
> are values which can be attained, in literature and common use.
> For instance, "the maximum temperature today is 287K, also expect rains"
> means that that temperature will actually be attained.
> 
> Use, DRM_SCHED_POLICY_COUNT for instance, since for 0-based indexing,
> as that of C enums, the last element in the set is in fact the number of
> elements, i.e. the count of the set. (_NUM is also bad as it means
> "number" which could really be anything.)
> 
> So using DRM_SCHED_POLICY_COUNT is most clear.
>

Got it, will change.
 
> >  
> >  /**
> >   * struct drm_sched_entity - A wrapper around a job queue (typically
> > @@ -217,6 +221,9 @@ struct drm_sched_entity {
> >  	 */
> >  	bool 				stopped;
> >  
> > +	/** @sched_policy: Schedule policy for entity */
> > +	enum drm_sched_policy		sched_policy;
> > +
> 
> This creates data redundancy. "sched_policy" should only be found
> in the drm_gpu_scheduler structure. The context's tasks then get to run
> on a scheduler with such and such priority. We shouldn't have this here,
> only in drm_gpu_scheduler structure.
> 

Addressed above, will do.

Matt

> Regards,
> Luben
> 
> >  	/**
> >  	 * @entity_idle:
> >  	 *
> > @@ -489,6 +496,7 @@ struct drm_sched_backend_ops {
> >   *              guilty and it will no longer be considered for scheduling.
> >   * @score: score to help loadbalancer pick a idle sched
> >   * @_score: score used when the driver doesn't provide one
> > + * @sched_policy: Schedule policy for scheduler
> >   * @ready: marks if the underlying HW is ready to work
> >   * @free_guilty: A hit to time out handler to free the guilty job.
> >   * @pause_run_wq: pause queuing of @work_run on @run_wq
> > @@ -514,6 +522,7 @@ struct drm_gpu_scheduler {
> >  	int				hang_limit;
> >  	atomic_t                        *score;
> >  	atomic_t                        _score;
> > +	enum drm_sched_policy		sched_policy;
> >  	bool				ready;
> >  	bool				free_guilty;
> >  	bool				pause_run_wq;
> > @@ -525,7 +534,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >  		   struct workqueue_struct *run_wq,
> >  		   uint32_t hw_submission, unsigned hang_limit,
> >  		   long timeout, struct workqueue_struct *timeout_wq,
> > -		   atomic_t *score, const char *name, struct device *dev);
> > +		   atomic_t *score, const char *name,
> > +		   enum drm_sched_policy sched_policy,
> > +		   struct device *dev);
> >  
> >  void drm_sched_fini(struct drm_gpu_scheduler *sched);
> >  int drm_sched_job_init(struct drm_sched_job *job,
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05 18:06       ` Zeng, Oak
@ 2023-04-05 18:53         ` Matthew Brost
  2023-04-06 10:04           ` Christian König
  2023-04-07  0:20           ` Zeng, Oak
  0 siblings, 2 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-05 18:53 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: robdclark, airlied, lina, dri-devel, Christian König,
	boris.brezillon, Vetter,  Daniel, intel-xe, faith.ekstrand

On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> Hi,
> 
> Using dma-fence for completion/dependency tracking for long-run workload(more precisely on-demand paging/page fault enabled workload) can cause deadlock. This seems the significant issue here. Other issues such as the drm scheduler completion order implication etc are minors which can be solve inside the framework of drm scheduler. We need to evaluate below paths:
> 
> 	1) still use drm scheduler for job submission, and use dma-fence for job completion waiting/dependency tracking. This is solution proposed in this series. Annotate dma-fence for long-run workload: user can still wait dma-fence for job completion but can't wait dma-fence while holding any memory management locks.  We still use dma-fence for dependency tracking. But it is just very easily run into deadlock when on-demand paging is in the picture. The annotation helps us to detect deadlock but not solve deadlock problems. Seems *not* a complete solution: It is almost impossible to completely avoid dependency deadlock in complex runtime environment
>

No one can wait on LR fence, so it is impossible to deadlock. The
annotations enforce this. Literally this is only for flow controling the
ring / hold pending jobs in in the DRM schedule list.

> 	2) Still use drm scheduler but not use dma-fence for completion signaling and dependency tracking. This way we still get some free functions (reset, err handling ring flow control as Matt said)from drm scheduler, but push the dependency/completion tracking completely to user space using techniques such as user space fence. User space doesn't have chance to wait fence while holding a kernel memory management lock, thus the dma-fence deadlock issue is solved.
>

We use user space fence for syncs.

> 	3) Completely discard drm scheduler and dma-fence for long-run workload. Use user queue/doorbell for super fast submission, directly interact with fw scheduler. Use user fence for completion/dependency tracking.
> 

This is a hard no from me, I want 1 submission path in Xe. Either we use
the DRM scheduler or we don't.

Matt

> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: Christian König <christian.koenig@amd.com>
> > Sent: April 5, 2023 3:30 AM
> > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > <oak.zeng@intel.com>
> > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> > robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
> > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
> > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > plans
> > 
> > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > >> Hi Matt, Thomas,
> > >>
> > >> Some very bold out of box thinking in this area:
> > >>
> > >> 1. so you want to use drm scheduler and dma-fence for long running workload.
> > Why you want to do this in the first place? What is the benefit? Drm scheduler is
> > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > level, as you said below for intel this is Guc. Can xe driver just directly submit job
> > to Guc, bypassing drm scheduler?
> > >>
> > > If we did that now we have 2 paths for dependency track, flow controling
> > > the ring, resets / error handling / backend submission implementations.
> > > We don't want this.
> > 
> > Well exactly that's the point: Why?
> > 
> > As far as I can see that are two completely distinct use cases, so you
> > absolutely do want two completely distinct implementations for this.
> > 
> > >> 2. using dma-fence for long run workload: I am well aware that page fault (and
> > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be
> > used purely because the nature of the workload that it runs very long (indefinite).
> > I did a math: the dma_fence_wait_timeout function's third param is the timeout
> > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long
> > enough, can we just change the timeout parameter to signed 64 bits so it is much
> > longer than our life time...
> > >>
> > >> So I mainly argue we can't use dma-fence for long-run workload is not
> > because the workload runs very long, rather because of the fact that we use
> > page fault for long-run workload. If we enable page fault for short-run workload,
> > we can't use dma-fence either. Page fault is the key thing here.
> > >>
> > >> Now since we use page fault which is *fundamentally* controversial with
> > dma-fence design, why now just introduce a independent concept such as user-
> > fence instead of extending existing dma-fence?
> > >>
> > >> I like unified design. If drm scheduler, dma-fence can be extended to work for
> > everything, it is beautiful. But seems we have some fundamental problem here.
> > >>
> > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > > the signal / CB infrastructure) and enforce we don't use use these
> > > dma-fences from the scheduler in memory reclaim paths or export these to
> > > user space or other drivers. Think of this mode as SW only fence.
> > 
> > Yeah and I truly think this is an really bad idea.
> > 
> > The signal/CB infrastructure in the dma_fence turned out to be the
> > absolutely nightmare I initially predicted. Sorry to say that, but in
> > this case the "I've told you so" is appropriate in my opinion.
> > 
> > If we need infrastructure for long running dependency tracking we should
> > encapsulate that in a new framework and not try to mangle the existing
> > code for something it was never intended for.
> > 
> > Christian.
> > 
> > >
> > > Matt
> > >
> > >> Thanks,
> > >> Oak
> > >>
> > >>> -----Original Message-----
> > >>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > >>> Matthew Brost
> > >>> Sent: April 3, 2023 8:22 PM
> > >>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > >>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > airlied@linux.ie;
> > >>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > >>> <matthew.brost@intel.com>; christian.koenig@amd.com;
> > >>> faith.ekstrand@collabora.com
> > >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > plans
> > >>>
> > >>> Hello,
> > >>>
> > >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > >>> have been asked to merge our common DRM scheduler patches first as well
> > >>> as develop a common solution for long running workloads with the DRM
> > >>> scheduler. This RFC series is our first attempt at doing this. We
> > >>> welcome any and all feedback.
> > >>>
> > >>> This can we thought of as 4 parts detailed below.
> > >>>
> > >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > >>> entity (patches 1-3)
> > >>>
> > >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > >>> severals problems as the DRM was originally designed to schedule jobs on
> > >>> hardware queues. The main problem being that DRM scheduler expects the
> > >>> submission order of jobs to be the completion order of jobs even across
> > >>> multiple entities. This assumption falls apart with a firmware scheduler
> > >>> as a firmware scheduler has no concept of jobs and jobs can complete out
> > >>> of order. A novel solution for was originally thought of by Faith during
> > >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > >>> and entity. I believe the AGX driver [3] is using this approach and
> > >>> Boris may use approach as well for the Mali driver [4].
> > >>>
> > >>> To support a 1 to 1 relationship we move the main execution function
> > >>> from a kthread to a work queue and add a new scheduling mode which
> > >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> > >>> relationship and can be thought of as using scheduler as a dependency /
> > >>> infligt job tracker rather than a true scheduler.
> > >>>
> > >>> - Generic messaging interface for DRM scheduler
> > >>>
> > >>> Idea is to be able to communicate to the submission backend with in band
> > >>> (relative to main execution function) messages. Messages are backend
> > >>> defined and flexable enough for any use case. In Xe we use these
> > >>> messages to clean up entites, set properties for entites, and suspend /
> > >>> resume execution of an entity [5]. I suspect other driver can leverage
> > >>> this messaging concept too as it a convenient way to avoid races in the
> > >>> backend.
> > >>>
> > >>> - Support for using TDR for all error paths of a scheduler / entity
> > >>>
> > >>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > >>>
> > >>> - Annotate dma-fences for long running workloads.
> > >>>
> > >>> The idea here is to use dma-fences only as sync points within the
> > >>> scheduler and never export them for long running workloads. By
> > >>> annotating these fences as long running we ensure that these dma-fences
> > >>> are never used in a way that breaks the dma-fence rules. A benefit of
> > >>> thus approach is the scheduler can still safely flow control the
> > >>> execution ring buffer via the job limit without breaking the dma-fence
> > >>> rules.
> > >>>
> > >>> Again this a first draft and looking forward to feedback.
> > >>>
> > >>> Enjoy - Matt
> > >>>
> > >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > >>> [2] https://patchwork.freedesktop.org/series/112188/
> > >>> [3] https://patchwork.freedesktop.org/series/114772/
> > >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > >>>
> > >>> Matthew Brost (8):
> > >>>    drm/sched: Convert drm scheduler to use a work queue rather than
> > >>>      kthread
> > >>>    drm/sched: Move schedule policy to scheduler / entity
> > >>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > >>>    drm/sched: Add generic scheduler message interface
> > >>>    drm/sched: Start run wq before TDR in drm_sched_start
> > >>>    drm/sched: Submit job before starting TDR
> > >>>    drm/sched: Add helper to set TDR timeout
> > >>>    drm/syncobj: Warn on long running dma-fences
> > >>>
> > >>> Thomas Hellström (2):
> > >>>    dma-buf/dma-fence: Introduce long-running completion fences
> > >>>    drm/sched: Support long-running sched entities
> > >>>
> > >>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > >>>   drivers/dma-buf/dma-resv.c                  |   5 +
> > >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > >>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > >>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > >>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > >>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > >>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > >>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > >>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > >>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > >>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > >>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > >>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
> > >>>   include/linux/dma-fence.h                   |  60 ++++-
> > >>>   16 files changed, 649 insertions(+), 184 deletions(-)
> > >>>
> > >>> --
> > >>> 2.34.1
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-05 13:09                 ` Daniel Vetter
@ 2023-04-05 23:58                   ` Matthew Brost
  2023-04-06  6:32                     ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-05 23:58 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: robdclark, airlied, lina, Thomas Hellström (Intel),
	dri-devel, Christian König, boris.brezillon, intel-xe,
	faith.ekstrand

On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote:
> On Tue, Apr 04, 2023 at 07:48:27PM +0000, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > > On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > > > > 
> > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > Hi, Christian,
> > > > > > > 
> > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > > > > 
> > > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > > completion
> > > > > > > > > waits, invent their own synchronization primitives or internally use
> > > > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > > > > without any lockdep annotation all these approaches are error prone.
> > > > > > > > > 
> > > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > > desirable for
> > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > handling also with
> > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > dma-fence protocol.
> > > > > > > > > 
> > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > dma-fences, and add
> > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > 
> > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > > helper adding
> > > > > > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > > scheduler chaining
> > > > > > > > >    a new long-running fence on another one.
> > > > > > > > 
> > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > 
> > > > 
> > > > I don't think this quite the same, this explictly enforces that we don't
> > > > break the dma-fence rules (in path of memory allocations, exported in
> > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > infrastructure for signaling / callbacks. I believe your series tried to
> > > > export these fences to user space (admittedly I haven't fully read your
> > > > series).
> > > > 
> > > > In this use case we essentially just want to flow control the ring via
> > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > > 
> > > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > > in the backend regardless on ring space, maintain a list of jobs pending
> > > > for cleanup after errors, and write a different cleanup path as now the
> > > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > > when the DRM scheduler provides all of this for us. Also if we go this
> > > > route, now all drivers are going to invent ways to handle LR jobs /w the
> > > > DRM scheduler.
> > > > 
> > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > export any fences from the scheduler. If you try to export these fences
> > > > a blow up happens.
> > > 
> > > The problem is if you mix things up. Like for resets you need all the
> > > schedulers on an engine/set-of-engines to quiescent or things get
> > > potentially hilarious. If you now have a scheduler in forever limbo, the
> > > dma_fence guarantees are right out the window.
> > > 
> > 
> > Right, a GT reset on Xe is:
> > 
> > Stop all schedulers
> > Do a reset
> > Ban any schedulers which we think caused the GT reset
> > Resubmit all schedulers which we think were good
> > Restart all schedulers
> > 
> > None of this flow depends on LR dma-fences, all of this uses the DRM
> > sched infrastructure and work very well compared to the i915. Rewriting
> > all this with a driver specific implementation is what we are trying to
> > avoid.
> > 
> > Similarly if LR entity hangs on its own (not a GT reset, rather the
> > firmware does the reset for us) we use all the DRM scheduler
> > infrastructure to handle this. Again this works rather well...
> 
> Yeah this is why I don't think duplicating everything that long-running
> jobs need makes any sense. iow I agree with you.
> 

Glad we agree.

> > > But the issue you're having is fairly specific if it's just about
> > > ringspace. I think the dumbest fix is to just block in submit if you run
> > > out of per-ctx ringspace, and call it a day. This notion that somehow the
> > 
> > How does that not break the dma-fence rules? A job can publish its
> > finished fence after ARM, if the finished fence fence waits on ring
> > space that may not free up in a reasonable amount of time we now have
> > broken the dma-dence rules. My understanding is any dma-fence must only
> > on other dma-fence, Christian seems to agree and NAK'd just blocking if
> > no space available [1]. IMO this series ensures we don't break dma-fence
> > rules by restricting how the finished fence can be used.
> 
> Oh I meant in the submit ioctl, _before_ you even call
> drm_sched_job_arm(). It's ok to block in there indefinitely.
>

Ok, but how do we determine if their is ring space, wait on xe_hw_fence
which is a dma-fence. We just move a wait from the scheduler to the exec
IOCTL and I realy fail to see the point of that.

> > > kernel is supposed to provide a bottomless queue of anything userspace
> > > submits simply doesn't hold up in reality (as much as userspace standards
> > > committees would like it to), and as long as it doesn't have a real-world
> > > perf impact it doesn't really matter why we end up blocking in the submit
> > > ioctl. It might also be a simple memory allocation that hits a snag in
> > > page reclaim.
> > > 
> > > > > > > > And the reasons why it was rejected haven't changed.
> > > > > > > > 
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > > 
> > > > > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > > > > tackle this problem while being able to reuse the scheduler for
> > > > > > > long-running workloads.
> > > > > > > 
> > > > > > > I couldn't see any clear decision on your series, though, but one
> > > > > > > main difference I see is that this is intended for driver-internal
> > > > > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > > indefinite fence problem.
> > > > > > 
> > > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > > problems are the same as with your approach: When we express such
> > > > > > operations as dma_fence there is always the change that we leak that
> > > > > > somewhere.
> > > > > > 
> > > > > > My approach of adding a flag noting that this operation is dangerous and
> > > > > > can't be synced with something memory management depends on tried to
> > > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > > rejected it (for good reasons I think).
> > > > > > 
> > > > > > > 
> > > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > > each driver could hack something up, like sleeping in the
> > > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > > should still be annotated in one way or annotated one way or another
> > > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > > do anything bad.
> > > > > > > 
> > > > > > >  So any suggestions as to what would be the better solution here
> > > > > > > would be appreciated.
> > > > > > 
> > > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > > 
> > > > 
> > > > I think we need to solve this within the DRM scheduler one way or
> > > > another.
> > > 
> > > Yeah so if we conclude that the queue really must be bottomless then I
> > > agree drm-sched should help out sort out the mess. Because I'm guessing
> > > that every driver will have this issue. But that's a big if.
> > > 
> > > I guess if we teach the drm scheduler that some jobs are fairly endless
> > > then maybe it wouldn't be too far-fetched to also teach it to wait for a
> > > previous one to finish (but not with the dma_fence that preempts, which we
> > > put into the dma_resv for memory management, but some other struct
> > > completion). The scheduler already has a concept of not stuffing too much
> > > stuff into the same queue after all, so this should fit?
> > 
> > See above, exact same situation as spinning on flow controling the ring,
> > this IMO absolutely breaks the dma-fence rules. IMO the correct solution
> > is to have a DRM that doesn't export dma-fences, this is exactly what
> > this series does as if we try to, boom lockdep / warn on blow up.
> 
> I dont think it's impossible to do this correctly, but definitely very,
> very hard. Which is why neither Christian nor me like the idea :-)
> 
> Essentially you'd have to make sure that any indefinite way will still
> react to drm_sched_job, so that you're not holding up a gt reset or
> anything like that, but only ever hold up forward progress for this
> specific scheduler/drm_sched_entity. Which you can do as long (and again,
> another hugely tricky detail) you still obey the preempt-ctx dma_fence and
> manage to preempt the underlying long-running ctx even when the drm/sched
> is stuck waiting for an indefinite fence (like waiting for ringspace or
> something like that).
> 
> So I don't think it's impossible, but very far away from "a good idea" :-)
> 
> Hence to proposal to bail out of this entire mess by throwing EWOULDBLCK
> back to userspace directly from the ioctl function, where you still can do
> that without breaking any dma_fence rules. Or if it's not a case that
> matters in practice, simply block in the ioctl handler instead of
> returning EWOULDBLCK.

Returning EWOULDBLCK on a full ring is reasonsible I guess but again
without returning a fence in run job the TDR can't be used for clean up
on LR entities which will result in duplicate code open coded by each
driver. Same goes for checking ring full in exec.

How about this:
- We mark xe_hw_fence as LR to ensure it can't be exported, return this
  in run_job which gives flow control on the ring + the handy TDR
  functionality
- When a scheduler is marked as LR, we do not generate finished fences
  for jobs
- We heavily, heavily scrutinize any usage of the LR fence flag going
  foward
- We document all of this very loudly

Is this reasonable?

Matt

> -Daniel
> 
> > 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/patch/525461/?series=114772&rev=2
> > 
> > > -Daniel
> > > 
> > > 
> > > > > > I mean in the 1 to 1 case  you basically just need a component which
> > > > > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > > > > schedules a work item.
> > > > > > 
> > > > > > As long as the work item itself doesn't produce a dma_fence it can then
> > > > > > still just wait for other none dma_fence dependencies.
> > > > > > 
> > > > > > Then the work function could submit the work and wait for the result.
> > > > > > 
> > > > > > The work item would then pretty much represent what you want, you can
> > > > > > wait for it to finish and pass it along as long running dependency.
> > > > > > 
> > > > > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > > > > basically it.
> > > > > > 
> > > > > This very much sounds like a i915_sw_fence for the dependency tracking and
> > > > > dma_fence_work for the actual work although it's completion fence is a
> > > > > dma_fence.
> > > > >
> > > > 
> > > > Agree this does sound to i915ish as stated below one of mandates in Xe
> > > > was to use the DRM scheduler. Beyond that as someone who a submission
> > > > backend in the i915 and Xe, I love how the DRM scheduler works (single
> > > > entry point), it makes everything so much easier.
> > > > 
> > > > Matt
> > > > 
> > > > > Although that goes against the whole idea of a condition for merging the xe
> > > > > driver would be that we implement some sort of minimal scaffolding for
> > > > > long-running workloads in the drm scheduler, and the thinking behind that is
> > > > > to avoid implementing intel-specific solutions like those...
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Thomas
> > > > > 
> > > > > 
> > > > > 
> > > > > > Regards,
> > > > > > Christian.
> > > > > > 
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > 
> > > > > > > Thomas
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05 10:12               ` Daniel Vetter
@ 2023-04-06  2:08                 ` Matthew Brost
  2023-04-06  6:37                   ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-06  2:08 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: robdclark, airlied, lina, dri-devel, intel-xe, boris.brezillon,
	Christian König, faith.ekstrand

On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
> On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote:
> >
> > Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > > [SNIP]
> > >> I would approach it from the complete other side. This component here is a
> > >> tool to decide what job should run next.
> > >>
> > >> How that is then signaled and run should not be part of the scheduler, but
> > >> another more higher level component.
> > >>
> > >> This way you also don't have a problem with not using DMA-fences as
> > >> dependencies as well as constrains for running more jobs.
> > > I think we're talking about two things here and mixing them up.
> > >
> > > For the dependencies I agree with you, and imo that higher level tool
> > > should probably just be an on-demand submit thread in userspace for the
> > > rare case where the kernel would need to sort out a dependency otherwise
> > > (due to running out of ringspace in the per-ctx ringbuffer).
> > >
> > > The other thing is the message passing stuff, and this is what I was
> > > talking about above. This has nothing to do with handling dependencies,
> > > but with talking to the gpu fw. Here the intel design issue is that the fw
> > > only provides a single queue, and it's in-order. Which means it
> > > fundamentally has the stalling issue you describe as a point against a
> > > message passing design. And fundamentally we need to be able to talk to
> > > the fw in the scheduler ->run_job callback.
> > >
> > > The proposal here for the message passing part is that since it has the
> > > stalling issue already anyway, and the scheduler needs to be involved
> > > anyway, it makes sense to integrated this (as an optional thing, only for
> > > drivers which have this kind of fw interface) into the scheduler.
> > > Otherwise you just end up with two layers for no reason and more ping-pong
> > > delay because the ->run_job needs to kick off the subordinate driver layer
> > > first. Note that for this case the optional message passing support in the
> > > drm/scheduler actually makes things better, because it allows you to cut
> > > out one layer.
> > >
> > > Of course if a driver with better fw interface uses this message passing
> > > support, then that's bad. Hence the big warning in the kerneldoc.
> >
> > Well what I wanted to say is that if you design the dependency handling
> > / scheduler properly you don't need the message passing through it.
> >
> > For example if the GPU scheduler component uses a work item to do it's
> > handling instead of a kthread you could also let the driver specify the
> > work queue where this work item is executed on.
> >
> > When you design it like this the driver specifies the thread context of
> > execution for it's job. In other words it can specify a single threaded
> > firmware work queue as well.
> >
> > When you then have other messages which needs to be passed to the
> > firmware you can also use the same single threaded workqueue for this.
> >
> > Drivers which have a different firmware interface would just use one of
> > the system work queues instead.
> >
> > This approach basically decouples the GPU scheduler component from the
> > message passing functionality.
> 
> Hm I guess we've been talking past each another big time, because
> that's really what I thought was under discussions? Essentially the
> current rfc, but implementing with some polish.
>

I think Daniel pretty much nailed it here (thanks), to recap:

1. I want the messages in the same worker so run_job / free_job /
process_msg execution is mutual exclusive and also so during reset paths
if the worker is stopped all the entry points can't be entered.

If this is a NAK, then another worker is fine I guess. A lock between
run_job / free_job + process_msg should solve the exclusion issue and the
reset paths can also stop this new worker too. That being said I'd
rather leave this as is but will not fight this point.

2. process_msg is just used to communicate with the firmware using the
same queue as submission. Waiting for space in this queue is the only
place this function can block (same as submission), well actually we
have the concept a preempt time slice but that sleeps for 10 ms by
default. Also preempt is only used in LR entities so I don't think it is
relavent in this case either.

3. Agree this is in the dma-fence signaling path (if process_msg is in
the submission worker) so we can't block indefinitely or an unreasonable
period of time (i.e. we must obey dma-fence rules).

4. Agree the documentation for thw usage of the messaging interface
needs to be clear.

5. Agree that my code could alway use polishing.

Lets close on #1 then can I get on general Ack on this part of the RFC
and apply the polish in the full review process?

Matt

> iow I agree with you (I think at least).
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-05 23:58                   ` Matthew Brost
@ 2023-04-06  6:32                     ` Daniel Vetter
  2023-04-06 16:58                       ` Matthew Brost
  0 siblings, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-06  6:32 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, Thomas Hellström (Intel),
	dri-devel, Christian König, boris.brezillon, Daniel Vetter,
	intel-xe, faith.ekstrand

On Wed, Apr 05, 2023 at 11:58:44PM +0000, Matthew Brost wrote:
> On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote:
> > On Tue, Apr 04, 2023 at 07:48:27PM +0000, Matthew Brost wrote:
> > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > > > On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
> > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > > > > > 
> > > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > > Hi, Christian,
> > > > > > > > 
> > > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > > > > > 
> > > > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > > > completion
> > > > > > > > > > waits, invent their own synchronization primitives or internally use
> > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > > > > > without any lockdep annotation all these approaches are error prone.
> > > > > > > > > > 
> > > > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > > > desirable for
> > > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > > handling also with
> > > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > > dma-fence protocol.
> > > > > > > > > > 
> > > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > > dma-fences, and add
> > > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > > 
> > > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > > > helper adding
> > > > > > > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > > > scheduler chaining
> > > > > > > > > >    a new long-running fence on another one.
> > > > > > > > > 
> > > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > > 
> > > > > 
> > > > > I don't think this quite the same, this explictly enforces that we don't
> > > > > break the dma-fence rules (in path of memory allocations, exported in
> > > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > > infrastructure for signaling / callbacks. I believe your series tried to
> > > > > export these fences to user space (admittedly I haven't fully read your
> > > > > series).
> > > > > 
> > > > > In this use case we essentially just want to flow control the ring via
> > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > > > 
> > > > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > > > in the backend regardless on ring space, maintain a list of jobs pending
> > > > > for cleanup after errors, and write a different cleanup path as now the
> > > > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > > > when the DRM scheduler provides all of this for us. Also if we go this
> > > > > route, now all drivers are going to invent ways to handle LR jobs /w the
> > > > > DRM scheduler.
> > > > > 
> > > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > > export any fences from the scheduler. If you try to export these fences
> > > > > a blow up happens.
> > > > 
> > > > The problem is if you mix things up. Like for resets you need all the
> > > > schedulers on an engine/set-of-engines to quiescent or things get
> > > > potentially hilarious. If you now have a scheduler in forever limbo, the
> > > > dma_fence guarantees are right out the window.
> > > > 
> > > 
> > > Right, a GT reset on Xe is:
> > > 
> > > Stop all schedulers
> > > Do a reset
> > > Ban any schedulers which we think caused the GT reset
> > > Resubmit all schedulers which we think were good
> > > Restart all schedulers
> > > 
> > > None of this flow depends on LR dma-fences, all of this uses the DRM
> > > sched infrastructure and work very well compared to the i915. Rewriting
> > > all this with a driver specific implementation is what we are trying to
> > > avoid.
> > > 
> > > Similarly if LR entity hangs on its own (not a GT reset, rather the
> > > firmware does the reset for us) we use all the DRM scheduler
> > > infrastructure to handle this. Again this works rather well...
> > 
> > Yeah this is why I don't think duplicating everything that long-running
> > jobs need makes any sense. iow I agree with you.
> > 
> 
> Glad we agree.
> 
> > > > But the issue you're having is fairly specific if it's just about
> > > > ringspace. I think the dumbest fix is to just block in submit if you run
> > > > out of per-ctx ringspace, and call it a day. This notion that somehow the
> > > 
> > > How does that not break the dma-fence rules? A job can publish its
> > > finished fence after ARM, if the finished fence fence waits on ring
> > > space that may not free up in a reasonable amount of time we now have
> > > broken the dma-dence rules. My understanding is any dma-fence must only
> > > on other dma-fence, Christian seems to agree and NAK'd just blocking if
> > > no space available [1]. IMO this series ensures we don't break dma-fence
> > > rules by restricting how the finished fence can be used.
> > 
> > Oh I meant in the submit ioctl, _before_ you even call
> > drm_sched_job_arm(). It's ok to block in there indefinitely.
> >
> 
> Ok, but how do we determine if their is ring space, wait on xe_hw_fence
> which is a dma-fence. We just move a wait from the scheduler to the exec
> IOCTL and I realy fail to see the point of that.

Fill in anything you need into the ring at ioctl time, but don't update
the tail pointers? If there's no space, then EWOULDBLCK.

> > > > kernel is supposed to provide a bottomless queue of anything userspace
> > > > submits simply doesn't hold up in reality (as much as userspace standards
> > > > committees would like it to), and as long as it doesn't have a real-world
> > > > perf impact it doesn't really matter why we end up blocking in the submit
> > > > ioctl. It might also be a simple memory allocation that hits a snag in
> > > > page reclaim.
> > > > 
> > > > > > > > > And the reasons why it was rejected haven't changed.
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > Christian.
> > > > > > > > > 
> > > > > > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > > > > > tackle this problem while being able to reuse the scheduler for
> > > > > > > > long-running workloads.
> > > > > > > > 
> > > > > > > > I couldn't see any clear decision on your series, though, but one
> > > > > > > > main difference I see is that this is intended for driver-internal
> > > > > > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > > > indefinite fence problem.
> > > > > > > 
> > > > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > > > problems are the same as with your approach: When we express such
> > > > > > > operations as dma_fence there is always the change that we leak that
> > > > > > > somewhere.
> > > > > > > 
> > > > > > > My approach of adding a flag noting that this operation is dangerous and
> > > > > > > can't be synced with something memory management depends on tried to
> > > > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > > > rejected it (for good reasons I think).
> > > > > > > 
> > > > > > > > 
> > > > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > > > each driver could hack something up, like sleeping in the
> > > > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > > > should still be annotated in one way or annotated one way or another
> > > > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > > > do anything bad.
> > > > > > > > 
> > > > > > > >  So any suggestions as to what would be the better solution here
> > > > > > > > would be appreciated.
> > > > > > > 
> > > > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > > > 
> > > > > 
> > > > > I think we need to solve this within the DRM scheduler one way or
> > > > > another.
> > > > 
> > > > Yeah so if we conclude that the queue really must be bottomless then I
> > > > agree drm-sched should help out sort out the mess. Because I'm guessing
> > > > that every driver will have this issue. But that's a big if.
> > > > 
> > > > I guess if we teach the drm scheduler that some jobs are fairly endless
> > > > then maybe it wouldn't be too far-fetched to also teach it to wait for a
> > > > previous one to finish (but not with the dma_fence that preempts, which we
> > > > put into the dma_resv for memory management, but some other struct
> > > > completion). The scheduler already has a concept of not stuffing too much
> > > > stuff into the same queue after all, so this should fit?
> > > 
> > > See above, exact same situation as spinning on flow controling the ring,
> > > this IMO absolutely breaks the dma-fence rules. IMO the correct solution
> > > is to have a DRM that doesn't export dma-fences, this is exactly what
> > > this series does as if we try to, boom lockdep / warn on blow up.
> > 
> > I dont think it's impossible to do this correctly, but definitely very,
> > very hard. Which is why neither Christian nor me like the idea :-)
> > 
> > Essentially you'd have to make sure that any indefinite way will still
> > react to drm_sched_job, so that you're not holding up a gt reset or
> > anything like that, but only ever hold up forward progress for this
> > specific scheduler/drm_sched_entity. Which you can do as long (and again,
> > another hugely tricky detail) you still obey the preempt-ctx dma_fence and
> > manage to preempt the underlying long-running ctx even when the drm/sched
> > is stuck waiting for an indefinite fence (like waiting for ringspace or
> > something like that).
> > 
> > So I don't think it's impossible, but very far away from "a good idea" :-)
> > 
> > Hence to proposal to bail out of this entire mess by throwing EWOULDBLCK
> > back to userspace directly from the ioctl function, where you still can do
> > that without breaking any dma_fence rules. Or if it's not a case that
> > matters in practice, simply block in the ioctl handler instead of
> > returning EWOULDBLCK.
> 
> Returning EWOULDBLCK on a full ring is reasonsible I guess but again
> without returning a fence in run job the TDR can't be used for clean up
> on LR entities which will result in duplicate code open coded by each
> driver. Same goes for checking ring full in exec.
> 
> How about this:
> - We mark xe_hw_fence as LR to ensure it can't be exported, return this
>   in run_job which gives flow control on the ring + the handy TDR
>   functionality
> - When a scheduler is marked as LR, we do not generate finished fences
>   for jobs
> - We heavily, heavily scrutinize any usage of the LR fence flag going
>   foward
> - We document all of this very loudly
> 
> Is this reasonable?

I'm not seeing why it's needed? If you're worried about TDR duplication
then I think we need something else. Because for long-running ctx we never
have a timeout of the ctx itself (by definition). The only thing we time
out on is the preempt, so I guess what could be done:
- have the minimal scaffolding to support the preempt-ctx fence in
  drm_sched_entity
- when the preempt ctx fence enables signalling a) callback to the driver
  to start the preempt (which should signal the fence) b) start a timer,
  which should catch if the preempt takes too long
- if the timeout first (importantly we only enable that when the
  preemption is trigger, not by default), kick of the normal drm/sched tdr
  flow. maybe needs some adjustements in case there's different handling
  needed for when a preemption times out compared to just a job timing out

I think this might make sense for sharing timeout handling needs for
long-running context. What you proposed I don't really follow why it
should exist, because that kind of timeout handling should not ever happen
for long-running jobs.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-06  2:08                 ` Matthew Brost
@ 2023-04-06  6:37                   ` Daniel Vetter
  2023-04-06 10:14                     ` Christian König
  0 siblings, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-06  6:37 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, dri-devel, intel-xe, boris.brezillon,
	Daniel Vetter, Christian König, faith.ekstrand

On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote:
> On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
> > On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote:
> > >
> > > Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > > > [SNIP]
> > > >> I would approach it from the complete other side. This component here is a
> > > >> tool to decide what job should run next.
> > > >>
> > > >> How that is then signaled and run should not be part of the scheduler, but
> > > >> another more higher level component.
> > > >>
> > > >> This way you also don't have a problem with not using DMA-fences as
> > > >> dependencies as well as constrains for running more jobs.
> > > > I think we're talking about two things here and mixing them up.
> > > >
> > > > For the dependencies I agree with you, and imo that higher level tool
> > > > should probably just be an on-demand submit thread in userspace for the
> > > > rare case where the kernel would need to sort out a dependency otherwise
> > > > (due to running out of ringspace in the per-ctx ringbuffer).
> > > >
> > > > The other thing is the message passing stuff, and this is what I was
> > > > talking about above. This has nothing to do with handling dependencies,
> > > > but with talking to the gpu fw. Here the intel design issue is that the fw
> > > > only provides a single queue, and it's in-order. Which means it
> > > > fundamentally has the stalling issue you describe as a point against a
> > > > message passing design. And fundamentally we need to be able to talk to
> > > > the fw in the scheduler ->run_job callback.
> > > >
> > > > The proposal here for the message passing part is that since it has the
> > > > stalling issue already anyway, and the scheduler needs to be involved
> > > > anyway, it makes sense to integrated this (as an optional thing, only for
> > > > drivers which have this kind of fw interface) into the scheduler.
> > > > Otherwise you just end up with two layers for no reason and more ping-pong
> > > > delay because the ->run_job needs to kick off the subordinate driver layer
> > > > first. Note that for this case the optional message passing support in the
> > > > drm/scheduler actually makes things better, because it allows you to cut
> > > > out one layer.
> > > >
> > > > Of course if a driver with better fw interface uses this message passing
> > > > support, then that's bad. Hence the big warning in the kerneldoc.
> > >
> > > Well what I wanted to say is that if you design the dependency handling
> > > / scheduler properly you don't need the message passing through it.
> > >
> > > For example if the GPU scheduler component uses a work item to do it's
> > > handling instead of a kthread you could also let the driver specify the
> > > work queue where this work item is executed on.
> > >
> > > When you design it like this the driver specifies the thread context of
> > > execution for it's job. In other words it can specify a single threaded
> > > firmware work queue as well.
> > >
> > > When you then have other messages which needs to be passed to the
> > > firmware you can also use the same single threaded workqueue for this.
> > >
> > > Drivers which have a different firmware interface would just use one of
> > > the system work queues instead.
> > >
> > > This approach basically decouples the GPU scheduler component from the
> > > message passing functionality.
> > 
> > Hm I guess we've been talking past each another big time, because
> > that's really what I thought was under discussions? Essentially the
> > current rfc, but implementing with some polish.
> >
> 
> I think Daniel pretty much nailed it here (thanks), to recap:
> 
> 1. I want the messages in the same worker so run_job / free_job /
> process_msg execution is mutual exclusive and also so during reset paths
> if the worker is stopped all the entry points can't be entered.
> 
> If this is a NAK, then another worker is fine I guess. A lock between
> run_job / free_job + process_msg should solve the exclusion issue and the
> reset paths can also stop this new worker too. That being said I'd
> rather leave this as is but will not fight this point.
> 
> 2. process_msg is just used to communicate with the firmware using the
> same queue as submission. Waiting for space in this queue is the only
> place this function can block (same as submission), well actually we
> have the concept a preempt time slice but that sleeps for 10 ms by
> default. Also preempt is only used in LR entities so I don't think it is
> relavent in this case either.
> 
> 3. Agree this is in the dma-fence signaling path (if process_msg is in
> the submission worker) so we can't block indefinitely or an unreasonable
> period of time (i.e. we must obey dma-fence rules).

Just to hammer this in: Not just process_msg is in the dma_fence signaling
path, but the entire fw queue where everything is being funneled through,
including whatever the fw is doing to process these.

Yes this is terrible and blew up a few times already :-/

But also, probably something that the docs really need to hammer in, to
make sure people don't look at this and thinkg "hey this seems to be the
recommended way to do this on linux". We don't want hw people to build
more of these designs, they're an absolute pain to deal with with Linux'
dma_fence signalling and gpu job scheduling rules.

It's just that if you're stuck with such fw, then integrating the flow
into drm/sched instead of having an extra layer of workers seems the
better of two pretty bad solutions.
-Daniel

 
> 4. Agree the documentation for thw usage of the messaging interface
> needs to be clear.
> 
> 5. Agree that my code could alway use polishing.
> 
> Lets close on #1 then can I get on general Ack on this part of the RFC
> and apply the polish in the full review process?
> 
> Matt
> 
> > iow I agree with you (I think at least).
> > -Daniel
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05 18:53         ` Matthew Brost
@ 2023-04-06 10:04           ` Christian König
  2023-04-07  0:20           ` Zeng, Oak
  1 sibling, 0 replies; 87+ messages in thread
From: Christian König @ 2023-04-06 10:04 UTC (permalink / raw)
  To: Matthew Brost, Zeng, Oak
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, Vetter,
	Daniel, intel-xe, faith.ekstrand

Am 05.04.23 um 20:53 schrieb Matthew Brost:
> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
>> Hi,
>>
>> Using dma-fence for completion/dependency tracking for long-run workload(more precisely on-demand paging/page fault enabled workload) can cause deadlock. This seems the significant issue here. Other issues such as the drm scheduler completion order implication etc are minors which can be solve inside the framework of drm scheduler. We need to evaluate below paths:
>>
>> 	1) still use drm scheduler for job submission, and use dma-fence for job completion waiting/dependency tracking. This is solution proposed in this series. Annotate dma-fence for long-run workload: user can still wait dma-fence for job completion but can't wait dma-fence while holding any memory management locks.  We still use dma-fence for dependency tracking. But it is just very easily run into deadlock when on-demand paging is in the picture. The annotation helps us to detect deadlock but not solve deadlock problems. Seems *not* a complete solution: It is almost impossible to completely avoid dependency deadlock in complex runtime environment
>>
> No one can wait on LR fence, so it is impossible to deadlock. The
> annotations enforce this. Literally this is only for flow controling the
> ring / hold pending jobs in in the DRM schedule list.

You can still have someone depend on the LR fence and cause a deadlock 
without waiting on it.

See my attempted solution to this problem. It tries to inherit the LR 
flag of a fence when something depended on it.

For example if you create a fence container and one of the fences inside 
the container is a LR fence then the container itself would be an LR 
fence as well.

>> 	2) Still use drm scheduler but not use dma-fence for completion signaling and dependency tracking. This way we still get some free functions (reset, err handling ring flow control as Matt said)from drm scheduler, but push the dependency/completion tracking completely to user space using techniques such as user space fence. User space doesn't have chance to wait fence while holding a kernel memory management lock, thus the dma-fence deadlock issue is solved.
>>
> We use user space fence for syncs.
>
>> 	3) Completely discard drm scheduler and dma-fence for long-run workload. Use user queue/doorbell for super fast submission, directly interact with fw scheduler. Use user fence for completion/dependency tracking.
>>
> This is a hard no from me, I want 1 submission path in Xe. Either we use
> the DRM scheduler or we don't.

Well I don't think that this will be acceptable. Especially if you not 
only have long running submission, but things like page faults/HMM in 
those jobs.

Regards,
Christian.

>
> Matt
>
>> Thanks,
>> Oak
>>
>>> -----Original Message-----
>>> From: Christian König <christian.koenig@amd.com>
>>> Sent: April 5, 2023 3:30 AM
>>> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
>>> <oak.zeng@intel.com>
>>> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
>>> robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
>>> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>> plans
>>>
>>> Am 04.04.23 um 20:08 schrieb Matthew Brost:
>>>> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
>>>>> Hi Matt, Thomas,
>>>>>
>>>>> Some very bold out of box thinking in this area:
>>>>>
>>>>> 1. so you want to use drm scheduler and dma-fence for long running workload.
>>> Why you want to do this in the first place? What is the benefit? Drm scheduler is
>>> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
>>> level, as you said below for intel this is Guc. Can xe driver just directly submit job
>>> to Guc, bypassing drm scheduler?
>>>> If we did that now we have 2 paths for dependency track, flow controling
>>>> the ring, resets / error handling / backend submission implementations.
>>>> We don't want this.
>>> Well exactly that's the point: Why?
>>>
>>> As far as I can see that are two completely distinct use cases, so you
>>> absolutely do want two completely distinct implementations for this.
>>>
>>>>> 2. using dma-fence for long run workload: I am well aware that page fault (and
>>> the consequent memory allocation/lock acquiring to fix the fault) can cause
>>> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be
>>> used purely because the nature of the workload that it runs very long (indefinite).
>>> I did a math: the dma_fence_wait_timeout function's third param is the timeout
>>> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long
>>> enough, can we just change the timeout parameter to signed 64 bits so it is much
>>> longer than our life time...
>>>>> So I mainly argue we can't use dma-fence for long-run workload is not
>>> because the workload runs very long, rather because of the fact that we use
>>> page fault for long-run workload. If we enable page fault for short-run workload,
>>> we can't use dma-fence either. Page fault is the key thing here.
>>>>> Now since we use page fault which is *fundamentally* controversial with
>>> dma-fence design, why now just introduce a independent concept such as user-
>>> fence instead of extending existing dma-fence?
>>>>> I like unified design. If drm scheduler, dma-fence can be extended to work for
>>> everything, it is beautiful. But seems we have some fundamental problem here.
>>>> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
>>>> the signal / CB infrastructure) and enforce we don't use use these
>>>> dma-fences from the scheduler in memory reclaim paths or export these to
>>>> user space or other drivers. Think of this mode as SW only fence.
>>> Yeah and I truly think this is an really bad idea.
>>>
>>> The signal/CB infrastructure in the dma_fence turned out to be the
>>> absolutely nightmare I initially predicted. Sorry to say that, but in
>>> this case the "I've told you so" is appropriate in my opinion.
>>>
>>> If we need infrastructure for long running dependency tracking we should
>>> encapsulate that in a new framework and not try to mangle the existing
>>> code for something it was never intended for.
>>>
>>> Christian.
>>>
>>>> Matt
>>>>
>>>>> Thanks,
>>>>> Oak
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>>>>>> Matthew Brost
>>>>>> Sent: April 3, 2023 8:22 PM
>>>>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
>>>>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
>>> airlied@linux.ie;
>>>>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
>>>>>> <matthew.brost@intel.com>; christian.koenig@amd.com;
>>>>>> faith.ekstrand@collabora.com
>>>>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>> plans
>>>>>> Hello,
>>>>>>
>>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>>> have been asked to merge our common DRM scheduler patches first as well
>>>>>> as develop a common solution for long running workloads with the DRM
>>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>>> welcome any and all feedback.
>>>>>>
>>>>>> This can we thought of as 4 parts detailed below.
>>>>>>
>>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>>> entity (patches 1-3)
>>>>>>
>>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>>> severals problems as the DRM was originally designed to schedule jobs on
>>>>>> hardware queues. The main problem being that DRM scheduler expects the
>>>>>> submission order of jobs to be the completion order of jobs even across
>>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>>
>>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>>> infligt job tracker rather than a true scheduler.
>>>>>>
>>>>>> - Generic messaging interface for DRM scheduler
>>>>>>
>>>>>> Idea is to be able to communicate to the submission backend with in band
>>>>>> (relative to main execution function) messages. Messages are backend
>>>>>> defined and flexable enough for any use case. In Xe we use these
>>>>>> messages to clean up entites, set properties for entites, and suspend /
>>>>>> resume execution of an entity [5]. I suspect other driver can leverage
>>>>>> this messaging concept too as it a convenient way to avoid races in the
>>>>>> backend.
>>>>>>
>>>>>> - Support for using TDR for all error paths of a scheduler / entity
>>>>>>
>>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>>>>
>>>>>> - Annotate dma-fences for long running workloads.
>>>>>>
>>>>>> The idea here is to use dma-fences only as sync points within the
>>>>>> scheduler and never export them for long running workloads. By
>>>>>> annotating these fences as long running we ensure that these dma-fences
>>>>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>>>>> thus approach is the scheduler can still safely flow control the
>>>>>> execution ring buffer via the job limit without breaking the dma-fence
>>>>>> rules.
>>>>>>
>>>>>> Again this a first draft and looking forward to feedback.
>>>>>>
>>>>>> Enjoy - Matt
>>>>>>
>>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>>>>> [2] https://patchwork.freedesktop.org/series/112188/
>>>>>> [3] https://patchwork.freedesktop.org/series/114772/
>>>>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
>>>>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>>>>
>>>>>> Matthew Brost (8):
>>>>>>     drm/sched: Convert drm scheduler to use a work queue rather than
>>>>>>       kthread
>>>>>>     drm/sched: Move schedule policy to scheduler / entity
>>>>>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>>>>     drm/sched: Add generic scheduler message interface
>>>>>>     drm/sched: Start run wq before TDR in drm_sched_start
>>>>>>     drm/sched: Submit job before starting TDR
>>>>>>     drm/sched: Add helper to set TDR timeout
>>>>>>     drm/syncobj: Warn on long running dma-fences
>>>>>>
>>>>>> Thomas Hellström (2):
>>>>>>     dma-buf/dma-fence: Introduce long-running completion fences
>>>>>>     drm/sched: Support long-running sched entities
>>>>>>
>>>>>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>>>>    drivers/dma-buf/dma-resv.c                  |   5 +
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>>>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>>>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>>>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>>>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>>>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>>>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>>>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>>>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>>>>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>>>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>>>>    include/linux/dma-fence.h                   |  60 ++++-
>>>>>>    16 files changed, 649 insertions(+), 184 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.34.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-06  6:37                   ` Daniel Vetter
@ 2023-04-06 10:14                     ` Christian König
  2023-04-06 10:32                       ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-06 10:14 UTC (permalink / raw)
  To: Daniel Vetter, Matthew Brost
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, intel-xe,
	faith.ekstrand

Am 06.04.23 um 08:37 schrieb Daniel Vetter:
> On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote:
>> On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
>>> On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote:
>>>> Am 05.04.23 um 11:07 schrieb Daniel Vetter:
>>>>> [SNIP]
>>>>>> I would approach it from the complete other side. This component here is a
>>>>>> tool to decide what job should run next.
>>>>>>
>>>>>> How that is then signaled and run should not be part of the scheduler, but
>>>>>> another more higher level component.
>>>>>>
>>>>>> This way you also don't have a problem with not using DMA-fences as
>>>>>> dependencies as well as constrains for running more jobs.
>>>>> I think we're talking about two things here and mixing them up.
>>>>>
>>>>> For the dependencies I agree with you, and imo that higher level tool
>>>>> should probably just be an on-demand submit thread in userspace for the
>>>>> rare case where the kernel would need to sort out a dependency otherwise
>>>>> (due to running out of ringspace in the per-ctx ringbuffer).
>>>>>
>>>>> The other thing is the message passing stuff, and this is what I was
>>>>> talking about above. This has nothing to do with handling dependencies,
>>>>> but with talking to the gpu fw. Here the intel design issue is that the fw
>>>>> only provides a single queue, and it's in-order. Which means it
>>>>> fundamentally has the stalling issue you describe as a point against a
>>>>> message passing design. And fundamentally we need to be able to talk to
>>>>> the fw in the scheduler ->run_job callback.
>>>>>
>>>>> The proposal here for the message passing part is that since it has the
>>>>> stalling issue already anyway, and the scheduler needs to be involved
>>>>> anyway, it makes sense to integrated this (as an optional thing, only for
>>>>> drivers which have this kind of fw interface) into the scheduler.
>>>>> Otherwise you just end up with two layers for no reason and more ping-pong
>>>>> delay because the ->run_job needs to kick off the subordinate driver layer
>>>>> first. Note that for this case the optional message passing support in the
>>>>> drm/scheduler actually makes things better, because it allows you to cut
>>>>> out one layer.
>>>>>
>>>>> Of course if a driver with better fw interface uses this message passing
>>>>> support, then that's bad. Hence the big warning in the kerneldoc.
>>>> Well what I wanted to say is that if you design the dependency handling
>>>> / scheduler properly you don't need the message passing through it.
>>>>
>>>> For example if the GPU scheduler component uses a work item to do it's
>>>> handling instead of a kthread you could also let the driver specify the
>>>> work queue where this work item is executed on.
>>>>
>>>> When you design it like this the driver specifies the thread context of
>>>> execution for it's job. In other words it can specify a single threaded
>>>> firmware work queue as well.
>>>>
>>>> When you then have other messages which needs to be passed to the
>>>> firmware you can also use the same single threaded workqueue for this.
>>>>
>>>> Drivers which have a different firmware interface would just use one of
>>>> the system work queues instead.
>>>>
>>>> This approach basically decouples the GPU scheduler component from the
>>>> message passing functionality.
>>> Hm I guess we've been talking past each another big time, because
>>> that's really what I thought was under discussions? Essentially the
>>> current rfc, but implementing with some polish.
>>>
>> I think Daniel pretty much nailed it here (thanks), to recap:
>>
>> 1. I want the messages in the same worker so run_job / free_job /
>> process_msg execution is mutual exclusive and also so during reset paths
>> if the worker is stopped all the entry points can't be entered.
>>
>> If this is a NAK, then another worker is fine I guess. A lock between
>> run_job / free_job + process_msg should solve the exclusion issue and the
>> reset paths can also stop this new worker too. That being said I'd
>> rather leave this as is but will not fight this point.
>>
>> 2. process_msg is just used to communicate with the firmware using the
>> same queue as submission. Waiting for space in this queue is the only
>> place this function can block (same as submission), well actually we
>> have the concept a preempt time slice but that sleeps for 10 ms by
>> default. Also preempt is only used in LR entities so I don't think it is
>> relavent in this case either.
>>
>> 3. Agree this is in the dma-fence signaling path (if process_msg is in
>> the submission worker) so we can't block indefinitely or an unreasonable
>> period of time (i.e. we must obey dma-fence rules).
> Just to hammer this in: Not just process_msg is in the dma_fence signaling
> path, but the entire fw queue where everything is being funneled through,
> including whatever the fw is doing to process these.
>
> Yes this is terrible and blew up a few times already :-/
>
> But also, probably something that the docs really need to hammer in, to
> make sure people don't look at this and thinkg "hey this seems to be the
> recommended way to do this on linux". We don't want hw people to build
> more of these designs, they're an absolute pain to deal with with Linux'
> dma_fence signalling and gpu job scheduling rules.
>
> It's just that if you're stuck with such fw, then integrating the flow
> into drm/sched instead of having an extra layer of workers seems the
> better of two pretty bad solutions.

Yeah and if you have such fw limitations, make sure that you use 
something which is understood by lockdep to feed into it.

In other words, either locks or work item/queue and not some message 
passing functionality through the scheduler.

As far as I can see the approach with the work item/queue should fit 
your needs here.

Christian.

> -Daniel
>
>   
>> 4. Agree the documentation for thw usage of the messaging interface
>> needs to be clear.
>>
>> 5. Agree that my code could alway use polishing.
>>
>> Lets close on #1 then can I get on general Ack on this part of the RFC
>> and apply the polish in the full review process?
>>
>> Matt
>>
>>> iow I agree with you (I think at least).
>>> -Daniel
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-06 10:14                     ` Christian König
@ 2023-04-06 10:32                       ` Daniel Vetter
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Vetter @ 2023-04-06 10:32 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon,
	Daniel Vetter, intel-xe, faith.ekstrand

On Thu, Apr 06, 2023 at 12:14:36PM +0200, Christian König wrote:
> Am 06.04.23 um 08:37 schrieb Daniel Vetter:
> > On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote:
> > > On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
> > > > On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote:
> > > > > Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > > > > > [SNIP]
> > > > > > > I would approach it from the complete other side. This component here is a
> > > > > > > tool to decide what job should run next.
> > > > > > > 
> > > > > > > How that is then signaled and run should not be part of the scheduler, but
> > > > > > > another more higher level component.
> > > > > > > 
> > > > > > > This way you also don't have a problem with not using DMA-fences as
> > > > > > > dependencies as well as constrains for running more jobs.
> > > > > > I think we're talking about two things here and mixing them up.
> > > > > > 
> > > > > > For the dependencies I agree with you, and imo that higher level tool
> > > > > > should probably just be an on-demand submit thread in userspace for the
> > > > > > rare case where the kernel would need to sort out a dependency otherwise
> > > > > > (due to running out of ringspace in the per-ctx ringbuffer).
> > > > > > 
> > > > > > The other thing is the message passing stuff, and this is what I was
> > > > > > talking about above. This has nothing to do with handling dependencies,
> > > > > > but with talking to the gpu fw. Here the intel design issue is that the fw
> > > > > > only provides a single queue, and it's in-order. Which means it
> > > > > > fundamentally has the stalling issue you describe as a point against a
> > > > > > message passing design. And fundamentally we need to be able to talk to
> > > > > > the fw in the scheduler ->run_job callback.
> > > > > > 
> > > > > > The proposal here for the message passing part is that since it has the
> > > > > > stalling issue already anyway, and the scheduler needs to be involved
> > > > > > anyway, it makes sense to integrated this (as an optional thing, only for
> > > > > > drivers which have this kind of fw interface) into the scheduler.
> > > > > > Otherwise you just end up with two layers for no reason and more ping-pong
> > > > > > delay because the ->run_job needs to kick off the subordinate driver layer
> > > > > > first. Note that for this case the optional message passing support in the
> > > > > > drm/scheduler actually makes things better, because it allows you to cut
> > > > > > out one layer.
> > > > > > 
> > > > > > Of course if a driver with better fw interface uses this message passing
> > > > > > support, then that's bad. Hence the big warning in the kerneldoc.
> > > > > Well what I wanted to say is that if you design the dependency handling
> > > > > / scheduler properly you don't need the message passing through it.
> > > > > 
> > > > > For example if the GPU scheduler component uses a work item to do it's
> > > > > handling instead of a kthread you could also let the driver specify the
> > > > > work queue where this work item is executed on.
> > > > > 
> > > > > When you design it like this the driver specifies the thread context of
> > > > > execution for it's job. In other words it can specify a single threaded
> > > > > firmware work queue as well.
> > > > > 
> > > > > When you then have other messages which needs to be passed to the
> > > > > firmware you can also use the same single threaded workqueue for this.
> > > > > 
> > > > > Drivers which have a different firmware interface would just use one of
> > > > > the system work queues instead.
> > > > > 
> > > > > This approach basically decouples the GPU scheduler component from the
> > > > > message passing functionality.
> > > > Hm I guess we've been talking past each another big time, because
> > > > that's really what I thought was under discussions? Essentially the
> > > > current rfc, but implementing with some polish.
> > > > 
> > > I think Daniel pretty much nailed it here (thanks), to recap:
> > > 
> > > 1. I want the messages in the same worker so run_job / free_job /
> > > process_msg execution is mutual exclusive and also so during reset paths
> > > if the worker is stopped all the entry points can't be entered.
> > > 
> > > If this is a NAK, then another worker is fine I guess. A lock between
> > > run_job / free_job + process_msg should solve the exclusion issue and the
> > > reset paths can also stop this new worker too. That being said I'd
> > > rather leave this as is but will not fight this point.
> > > 
> > > 2. process_msg is just used to communicate with the firmware using the
> > > same queue as submission. Waiting for space in this queue is the only
> > > place this function can block (same as submission), well actually we
> > > have the concept a preempt time slice but that sleeps for 10 ms by
> > > default. Also preempt is only used in LR entities so I don't think it is
> > > relavent in this case either.
> > > 
> > > 3. Agree this is in the dma-fence signaling path (if process_msg is in
> > > the submission worker) so we can't block indefinitely or an unreasonable
> > > period of time (i.e. we must obey dma-fence rules).
> > Just to hammer this in: Not just process_msg is in the dma_fence signaling
> > path, but the entire fw queue where everything is being funneled through,
> > including whatever the fw is doing to process these.
> > 
> > Yes this is terrible and blew up a few times already :-/
> > 
> > But also, probably something that the docs really need to hammer in, to
> > make sure people don't look at this and thinkg "hey this seems to be the
> > recommended way to do this on linux". We don't want hw people to build
> > more of these designs, they're an absolute pain to deal with with Linux'
> > dma_fence signalling and gpu job scheduling rules.
> > 
> > It's just that if you're stuck with such fw, then integrating the flow
> > into drm/sched instead of having an extra layer of workers seems the
> > better of two pretty bad solutions.
> 
> Yeah and if you have such fw limitations, make sure that you use something
> which is understood by lockdep to feed into it.
> 
> In other words, either locks or work item/queue and not some message passing
> functionality through the scheduler.
> 
> As far as I can see the approach with the work item/queue should fit your
> needs here.

dma_fence signalling annotations would also make the scheduler thread
suitable to catch issues with lockdep, just to make double-sure it can
catch issues.
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > > 4. Agree the documentation for thw usage of the messaging interface
> > > needs to be clear.
> > > 
> > > 5. Agree that my code could alway use polishing.
> > > 
> > > Lets close on #1 then can I get on general Ack on this part of the RFC
> > > and apply the polish in the full review process?
> > > 
> > > Matt
> > > 
> > > > iow I agree with you (I think at least).
> > > > -Daniel
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-06  6:32                     ` Daniel Vetter
@ 2023-04-06 16:58                       ` Matthew Brost
  2023-04-06 17:09                         ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-04-06 16:58 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: robdclark, airlied, lina, Thomas Hellström (Intel),
	dri-devel, intel-xe, boris.brezillon, Christian König,
	faith.ekstrand

On Thu, Apr 06, 2023 at 08:32:59AM +0200, Daniel Vetter wrote:
> On Wed, Apr 05, 2023 at 11:58:44PM +0000, Matthew Brost wrote:
> > On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote:
> > > On Tue, Apr 04, 2023 at 07:48:27PM +0000, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > > > > On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
> > > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > > > > > > 
> > > > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > > > Hi, Christian,
> > > > > > > > > 
> > > > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > > > > > > 
> > > > > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > > > > completion
> > > > > > > > > > > waits, invent their own synchronization primitives or internally use
> > > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > > > > > > without any lockdep annotation all these approaches are error prone.
> > > > > > > > > > > 
> > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > > > > desirable for
> > > > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > > > handling also with
> > > > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > > > dma-fence protocol.
> > > > > > > > > > > 
> > > > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > > > dma-fences, and add
> > > > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > > > 
> > > > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > > > > helper adding
> > > > > > > > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > > > > scheduler chaining
> > > > > > > > > > >    a new long-running fence on another one.
> > > > > > > > > > 
> > > > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > > > 
> > > > > > 
> > > > > > I don't think this quite the same, this explictly enforces that we don't
> > > > > > break the dma-fence rules (in path of memory allocations, exported in
> > > > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > > > infrastructure for signaling / callbacks. I believe your series tried to
> > > > > > export these fences to user space (admittedly I haven't fully read your
> > > > > > series).
> > > > > > 
> > > > > > In this use case we essentially just want to flow control the ring via
> > > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > > > > 
> > > > > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > > > > in the backend regardless on ring space, maintain a list of jobs pending
> > > > > > for cleanup after errors, and write a different cleanup path as now the
> > > > > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > > > > when the DRM scheduler provides all of this for us. Also if we go this
> > > > > > route, now all drivers are going to invent ways to handle LR jobs /w the
> > > > > > DRM scheduler.
> > > > > > 
> > > > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > > > export any fences from the scheduler. If you try to export these fences
> > > > > > a blow up happens.
> > > > > 
> > > > > The problem is if you mix things up. Like for resets you need all the
> > > > > schedulers on an engine/set-of-engines to quiescent or things get
> > > > > potentially hilarious. If you now have a scheduler in forever limbo, the
> > > > > dma_fence guarantees are right out the window.
> > > > > 
> > > > 
> > > > Right, a GT reset on Xe is:
> > > > 
> > > > Stop all schedulers
> > > > Do a reset
> > > > Ban any schedulers which we think caused the GT reset
> > > > Resubmit all schedulers which we think were good
> > > > Restart all schedulers
> > > > 
> > > > None of this flow depends on LR dma-fences, all of this uses the DRM
> > > > sched infrastructure and work very well compared to the i915. Rewriting
> > > > all this with a driver specific implementation is what we are trying to
> > > > avoid.
> > > > 
> > > > Similarly if LR entity hangs on its own (not a GT reset, rather the
> > > > firmware does the reset for us) we use all the DRM scheduler
> > > > infrastructure to handle this. Again this works rather well...
> > > 
> > > Yeah this is why I don't think duplicating everything that long-running
> > > jobs need makes any sense. iow I agree with you.
> > > 
> > 
> > Glad we agree.
> > 
> > > > > But the issue you're having is fairly specific if it's just about
> > > > > ringspace. I think the dumbest fix is to just block in submit if you run
> > > > > out of per-ctx ringspace, and call it a day. This notion that somehow the
> > > > 
> > > > How does that not break the dma-fence rules? A job can publish its
> > > > finished fence after ARM, if the finished fence fence waits on ring
> > > > space that may not free up in a reasonable amount of time we now have
> > > > broken the dma-dence rules. My understanding is any dma-fence must only
> > > > on other dma-fence, Christian seems to agree and NAK'd just blocking if
> > > > no space available [1]. IMO this series ensures we don't break dma-fence
> > > > rules by restricting how the finished fence can be used.
> > > 
> > > Oh I meant in the submit ioctl, _before_ you even call
> > > drm_sched_job_arm(). It's ok to block in there indefinitely.
> > >
> > 
> > Ok, but how do we determine if their is ring space, wait on xe_hw_fence
> > which is a dma-fence. We just move a wait from the scheduler to the exec
> > IOCTL and I realy fail to see the point of that.
> 
> Fill in anything you need into the ring at ioctl time, but don't update
> the tail pointers? If there's no space, then EWOULDBLCK.
> 

Ok, I can maybe buy this approach and this is fairly easy to do. I'm
going to do this for LR jobs only though (non-LR job will still flow
control on the ring via the scheduler + write ring in run_job). A bit of
duplicate code but I can live with this.

> > > > > kernel is supposed to provide a bottomless queue of anything userspace
> > > > > submits simply doesn't hold up in reality (as much as userspace standards
> > > > > committees would like it to), and as long as it doesn't have a real-world
> > > > > perf impact it doesn't really matter why we end up blocking in the submit
> > > > > ioctl. It might also be a simple memory allocation that hits a snag in
> > > > > page reclaim.
> > > > > 
> > > > > > > > > > And the reasons why it was rejected haven't changed.
> > > > > > > > > > 
> > > > > > > > > > Regards,
> > > > > > > > > > Christian.
> > > > > > > > > > 
> > > > > > > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > > > > > > tackle this problem while being able to reuse the scheduler for
> > > > > > > > > long-running workloads.
> > > > > > > > > 
> > > > > > > > > I couldn't see any clear decision on your series, though, but one
> > > > > > > > > main difference I see is that this is intended for driver-internal
> > > > > > > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > > > > indefinite fence problem.
> > > > > > > > 
> > > > > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > > > > problems are the same as with your approach: When we express such
> > > > > > > > operations as dma_fence there is always the change that we leak that
> > > > > > > > somewhere.
> > > > > > > > 
> > > > > > > > My approach of adding a flag noting that this operation is dangerous and
> > > > > > > > can't be synced with something memory management depends on tried to
> > > > > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > > > > rejected it (for good reasons I think).
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > > > > each driver could hack something up, like sleeping in the
> > > > > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > > > > should still be annotated in one way or annotated one way or another
> > > > > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > > > > do anything bad.
> > > > > > > > > 
> > > > > > > > >  So any suggestions as to what would be the better solution here
> > > > > > > > > would be appreciated.
> > > > > > > > 
> > > > > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > > > > 
> > > > > > 
> > > > > > I think we need to solve this within the DRM scheduler one way or
> > > > > > another.
> > > > > 
> > > > > Yeah so if we conclude that the queue really must be bottomless then I
> > > > > agree drm-sched should help out sort out the mess. Because I'm guessing
> > > > > that every driver will have this issue. But that's a big if.
> > > > > 
> > > > > I guess if we teach the drm scheduler that some jobs are fairly endless
> > > > > then maybe it wouldn't be too far-fetched to also teach it to wait for a
> > > > > previous one to finish (but not with the dma_fence that preempts, which we
> > > > > put into the dma_resv for memory management, but some other struct
> > > > > completion). The scheduler already has a concept of not stuffing too much
> > > > > stuff into the same queue after all, so this should fit?
> > > > 
> > > > See above, exact same situation as spinning on flow controling the ring,
> > > > this IMO absolutely breaks the dma-fence rules. IMO the correct solution
> > > > is to have a DRM that doesn't export dma-fences, this is exactly what
> > > > this series does as if we try to, boom lockdep / warn on blow up.
> > > 
> > > I dont think it's impossible to do this correctly, but definitely very,
> > > very hard. Which is why neither Christian nor me like the idea :-)
> > > 
> > > Essentially you'd have to make sure that any indefinite way will still
> > > react to drm_sched_job, so that you're not holding up a gt reset or
> > > anything like that, but only ever hold up forward progress for this
> > > specific scheduler/drm_sched_entity. Which you can do as long (and again,
> > > another hugely tricky detail) you still obey the preempt-ctx dma_fence and
> > > manage to preempt the underlying long-running ctx even when the drm/sched
> > > is stuck waiting for an indefinite fence (like waiting for ringspace or
> > > something like that).
> > > 
> > > So I don't think it's impossible, but very far away from "a good idea" :-)
> > > 
> > > Hence to proposal to bail out of this entire mess by throwing EWOULDBLCK
> > > back to userspace directly from the ioctl function, where you still can do
> > > that without breaking any dma_fence rules. Or if it's not a case that
> > > matters in practice, simply block in the ioctl handler instead of
> > > returning EWOULDBLCK.
> > 
> > Returning EWOULDBLCK on a full ring is reasonsible I guess but again
> > without returning a fence in run job the TDR can't be used for clean up
> > on LR entities which will result in duplicate code open coded by each
> > driver. Same goes for checking ring full in exec.
> > 
> > How about this:
> > - We mark xe_hw_fence as LR to ensure it can't be exported, return this
> >   in run_job which gives flow control on the ring + the handy TDR
> >   functionality
> > - When a scheduler is marked as LR, we do not generate finished fences
> >   for jobs
> > - We heavily, heavily scrutinize any usage of the LR fence flag going
> >   foward
> > - We document all of this very loudly
> > 
> > Is this reasonable?
> 
> I'm not seeing why it's needed? If you're worried about TDR duplication
> then I think we need something else. Because for long-running ctx we never
> have a timeout of the ctx itself (by definition). The only thing we time
> out on is the preempt, so I guess what could be done:
> - have the minimal scaffolding to support the preempt-ctx fence in
>   drm_sched_entity
> - when the preempt ctx fence enables signalling a) callback to the driver
>   to start the preempt (which should signal the fence) b) start a timer,
>   which should catch if the preempt takes too long

The GuC does this for us, no need.

> - if the timeout first (importantly we only enable that when the
>   preemption is trigger, not by default), kick of the normal drm/sched tdr
>   flow. maybe needs some adjustements in case there's different handling
>   needed for when a preemption times out compared to just a job timing out
>

The GuC imforms us this and yea we kick the TDR.

> I think this might make sense for sharing timeout handling needs for
> long-running context. What you proposed I don't really follow why it
> should exist, because that kind of timeout handling should not ever happen
> for long-running jobs.

We use just the TDR a as single cleanup point for all entities. In the
case of a LR entity this occurs if the GuC issues a reset on the
entity (liekly preempt timeout), the entity takes a non-recoverable page
fail, or the entity to the root cause of a GT reset. The pending job
list here is handy, that why I wanted to return xe_hw_fence in run_job
to hold the job in the scheduler pending list. The doesn't TDR won't
fire if the pending list is empty.

Based on what you are saying my new proposal:

1. Write ring in exec for LR jobs, return -EWOULDBLCK if no space in
ring
2. Return NULL in run_job (or alternatively a signaled fence)
3. Have specical cased cleanup flow for LR entites (not the TDR, rather
likely a different worker we kick owned by the xe_engine).
4. Document this some that this how drivers are expected to do LR
workloads plus DRM scheduler

1 & 3 are pretty clear duplicates of code but I can live with that if
I can get Ack on the plan + move on. The coding will not be all that
difficult either, I am just being difficult. In the is probably a 100ish
lines of code.

What do you think Daniel, seem like a reasonable plan?

Matt

> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
  2023-04-06 16:58                       ` Matthew Brost
@ 2023-04-06 17:09                         ` Daniel Vetter
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Vetter @ 2023-04-06 17:09 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, Thomas Hellström (Intel),
	dri-devel, intel-xe, boris.brezillon, Christian König,
	faith.ekstrand

On Thu, 6 Apr 2023 at 18:58, Matthew Brost <matthew.brost@intel.com> wrote:
>
> On Thu, Apr 06, 2023 at 08:32:59AM +0200, Daniel Vetter wrote:
> > On Wed, Apr 05, 2023 at 11:58:44PM +0000, Matthew Brost wrote:
> > > On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote:
> > > > On Tue, Apr 04, 2023 at 07:48:27PM +0000, Matthew Brost wrote:
> > > > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > > > > > On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
> > > > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > > > > > > >
> > > > > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > > > > Hi, Christian,
> > > > > > > > > >
> > > > > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > > > > > > >
> > > > > > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > > > > > completion
> > > > > > > > > > > > waits, invent their own synchronization primitives or internally use
> > > > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > > > > > > > without any lockdep annotation all these approaches are error prone.
> > > > > > > > > > > >
> > > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > > > > > desirable for
> > > > > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > > > > handling also with
> > > > > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > > > > dma-fence protocol.
> > > > > > > > > > > >
> > > > > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > > > > dma-fences, and add
> > > > > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > > > >
> > > > > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > > > > > helper adding
> > > > > > > > > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > > > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > > > > > scheduler chaining
> > > > > > > > > > > >    a new long-running fence on another one.
> > > > > > > > > > >
> > > > > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > > > >
> > > > > > >
> > > > > > > I don't think this quite the same, this explictly enforces that we don't
> > > > > > > break the dma-fence rules (in path of memory allocations, exported in
> > > > > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > > > > infrastructure for signaling / callbacks. I believe your series tried to
> > > > > > > export these fences to user space (admittedly I haven't fully read your
> > > > > > > series).
> > > > > > >
> > > > > > > In this use case we essentially just want to flow control the ring via
> > > > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > > > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > > > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > > > > >
> > > > > > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > > > > > in the backend regardless on ring space, maintain a list of jobs pending
> > > > > > > for cleanup after errors, and write a different cleanup path as now the
> > > > > > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > > > > > when the DRM scheduler provides all of this for us. Also if we go this
> > > > > > > route, now all drivers are going to invent ways to handle LR jobs /w the
> > > > > > > DRM scheduler.
> > > > > > >
> > > > > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > > > > export any fences from the scheduler. If you try to export these fences
> > > > > > > a blow up happens.
> > > > > >
> > > > > > The problem is if you mix things up. Like for resets you need all the
> > > > > > schedulers on an engine/set-of-engines to quiescent or things get
> > > > > > potentially hilarious. If you now have a scheduler in forever limbo, the
> > > > > > dma_fence guarantees are right out the window.
> > > > > >
> > > > >
> > > > > Right, a GT reset on Xe is:
> > > > >
> > > > > Stop all schedulers
> > > > > Do a reset
> > > > > Ban any schedulers which we think caused the GT reset
> > > > > Resubmit all schedulers which we think were good
> > > > > Restart all schedulers
> > > > >
> > > > > None of this flow depends on LR dma-fences, all of this uses the DRM
> > > > > sched infrastructure and work very well compared to the i915. Rewriting
> > > > > all this with a driver specific implementation is what we are trying to
> > > > > avoid.
> > > > >
> > > > > Similarly if LR entity hangs on its own (not a GT reset, rather the
> > > > > firmware does the reset for us) we use all the DRM scheduler
> > > > > infrastructure to handle this. Again this works rather well...
> > > >
> > > > Yeah this is why I don't think duplicating everything that long-running
> > > > jobs need makes any sense. iow I agree with you.
> > > >
> > >
> > > Glad we agree.
> > >
> > > > > > But the issue you're having is fairly specific if it's just about
> > > > > > ringspace. I think the dumbest fix is to just block in submit if you run
> > > > > > out of per-ctx ringspace, and call it a day. This notion that somehow the
> > > > >
> > > > > How does that not break the dma-fence rules? A job can publish its
> > > > > finished fence after ARM, if the finished fence fence waits on ring
> > > > > space that may not free up in a reasonable amount of time we now have
> > > > > broken the dma-dence rules. My understanding is any dma-fence must only
> > > > > on other dma-fence, Christian seems to agree and NAK'd just blocking if
> > > > > no space available [1]. IMO this series ensures we don't break dma-fence
> > > > > rules by restricting how the finished fence can be used.
> > > >
> > > > Oh I meant in the submit ioctl, _before_ you even call
> > > > drm_sched_job_arm(). It's ok to block in there indefinitely.
> > > >
> > >
> > > Ok, but how do we determine if their is ring space, wait on xe_hw_fence
> > > which is a dma-fence. We just move a wait from the scheduler to the exec
> > > IOCTL and I realy fail to see the point of that.
> >
> > Fill in anything you need into the ring at ioctl time, but don't update
> > the tail pointers? If there's no space, then EWOULDBLCK.
> >
>
> Ok, I can maybe buy this approach and this is fairly easy to do. I'm
> going to do this for LR jobs only though (non-LR job will still flow
> control on the ring via the scheduler + write ring in run_job). A bit of
> duplicate code but I can live with this.
>
> > > > > > kernel is supposed to provide a bottomless queue of anything userspace
> > > > > > submits simply doesn't hold up in reality (as much as userspace standards
> > > > > > committees would like it to), and as long as it doesn't have a real-world
> > > > > > perf impact it doesn't really matter why we end up blocking in the submit
> > > > > > ioctl. It might also be a simple memory allocation that hits a snag in
> > > > > > page reclaim.
> > > > > >
> > > > > > > > > > > And the reasons why it was rejected haven't changed.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Christian.
> > > > > > > > > > >
> > > > > > > > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > > > > > > > tackle this problem while being able to reuse the scheduler for
> > > > > > > > > > long-running workloads.
> > > > > > > > > >
> > > > > > > > > > I couldn't see any clear decision on your series, though, but one
> > > > > > > > > > main difference I see is that this is intended for driver-internal
> > > > > > > > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > > > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > > > > > indefinite fence problem.
> > > > > > > > >
> > > > > > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > > > > > problems are the same as with your approach: When we express such
> > > > > > > > > operations as dma_fence there is always the change that we leak that
> > > > > > > > > somewhere.
> > > > > > > > >
> > > > > > > > > My approach of adding a flag noting that this operation is dangerous and
> > > > > > > > > can't be synced with something memory management depends on tried to
> > > > > > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > > > > > rejected it (for good reasons I think).
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > > > > > each driver could hack something up, like sleeping in the
> > > > > > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > > > > > should still be annotated in one way or annotated one way or another
> > > > > > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > > > > > do anything bad.
> > > > > > > > > >
> > > > > > > > > >  So any suggestions as to what would be the better solution here
> > > > > > > > > > would be appreciated.
> > > > > > > > >
> > > > > > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > > > > >
> > > > > > >
> > > > > > > I think we need to solve this within the DRM scheduler one way or
> > > > > > > another.
> > > > > >
> > > > > > Yeah so if we conclude that the queue really must be bottomless then I
> > > > > > agree drm-sched should help out sort out the mess. Because I'm guessing
> > > > > > that every driver will have this issue. But that's a big if.
> > > > > >
> > > > > > I guess if we teach the drm scheduler that some jobs are fairly endless
> > > > > > then maybe it wouldn't be too far-fetched to also teach it to wait for a
> > > > > > previous one to finish (but not with the dma_fence that preempts, which we
> > > > > > put into the dma_resv for memory management, but some other struct
> > > > > > completion). The scheduler already has a concept of not stuffing too much
> > > > > > stuff into the same queue after all, so this should fit?
> > > > >
> > > > > See above, exact same situation as spinning on flow controling the ring,
> > > > > this IMO absolutely breaks the dma-fence rules. IMO the correct solution
> > > > > is to have a DRM that doesn't export dma-fences, this is exactly what
> > > > > this series does as if we try to, boom lockdep / warn on blow up.
> > > >
> > > > I dont think it's impossible to do this correctly, but definitely very,
> > > > very hard. Which is why neither Christian nor me like the idea :-)
> > > >
> > > > Essentially you'd have to make sure that any indefinite way will still
> > > > react to drm_sched_job, so that you're not holding up a gt reset or
> > > > anything like that, but only ever hold up forward progress for this
> > > > specific scheduler/drm_sched_entity. Which you can do as long (and again,
> > > > another hugely tricky detail) you still obey the preempt-ctx dma_fence and
> > > > manage to preempt the underlying long-running ctx even when the drm/sched
> > > > is stuck waiting for an indefinite fence (like waiting for ringspace or
> > > > something like that).
> > > >
> > > > So I don't think it's impossible, but very far away from "a good idea" :-)
> > > >
> > > > Hence to proposal to bail out of this entire mess by throwing EWOULDBLCK
> > > > back to userspace directly from the ioctl function, where you still can do
> > > > that without breaking any dma_fence rules. Or if it's not a case that
> > > > matters in practice, simply block in the ioctl handler instead of
> > > > returning EWOULDBLCK.
> > >
> > > Returning EWOULDBLCK on a full ring is reasonsible I guess but again
> > > without returning a fence in run job the TDR can't be used for clean up
> > > on LR entities which will result in duplicate code open coded by each
> > > driver. Same goes for checking ring full in exec.
> > >
> > > How about this:
> > > - We mark xe_hw_fence as LR to ensure it can't be exported, return this
> > >   in run_job which gives flow control on the ring + the handy TDR
> > >   functionality
> > > - When a scheduler is marked as LR, we do not generate finished fences
> > >   for jobs
> > > - We heavily, heavily scrutinize any usage of the LR fence flag going
> > >   foward
> > > - We document all of this very loudly
> > >
> > > Is this reasonable?
> >
> > I'm not seeing why it's needed? If you're worried about TDR duplication
> > then I think we need something else. Because for long-running ctx we never
> > have a timeout of the ctx itself (by definition). The only thing we time
> > out on is the preempt, so I guess what could be done:
> > - have the minimal scaffolding to support the preempt-ctx fence in
> >   drm_sched_entity
> > - when the preempt ctx fence enables signalling a) callback to the driver
> >   to start the preempt (which should signal the fence) b) start a timer,
> >   which should catch if the preempt takes too long
>
> The GuC does this for us, no need.
>
> > - if the timeout first (importantly we only enable that when the
> >   preemption is trigger, not by default), kick of the normal drm/sched tdr
> >   flow. maybe needs some adjustements in case there's different handling
> >   needed for when a preemption times out compared to just a job timing out
> >
>
> The GuC imforms us this and yea we kick the TDR.

You might still need the kernel fallback when guc dies? But yeah not
sure how much similarity is then left with the end-of-batch timed-out.

> > I think this might make sense for sharing timeout handling needs for
> > long-running context. What you proposed I don't really follow why it
> > should exist, because that kind of timeout handling should not ever happen
> > for long-running jobs.
>
> We use just the TDR a as single cleanup point for all entities. In the
> case of a LR entity this occurs if the GuC issues a reset on the
> entity (liekly preempt timeout), the entity takes a non-recoverable page
> fail, or the entity to the root cause of a GT reset. The pending job
> list here is handy, that why I wanted to return xe_hw_fence in run_job
> to hold the job in the scheduler pending list. The doesn't TDR won't
> fire if the pending list is empty.
>
> Based on what you are saying my new proposal:
>
> 1. Write ring in exec for LR jobs, return -EWOULDBLCK if no space in
> ring
> 2. Return NULL in run_job (or alternatively a signaled fence)
> 3. Have specical cased cleanup flow for LR entites (not the TDR, rather
> likely a different worker we kick owned by the xe_engine).
> 4. Document this some that this how drivers are expected to do LR
> workloads plus DRM scheduler
>
> 1 & 3 are pretty clear duplicates of code but I can live with that if
> I can get Ack on the plan + move on. The coding will not be all that
> difficult either, I am just being difficult. In the is probably a 100ish
> lines of code.

For 1 I guess you could also write the ring stuff for end-of-batch
code from the ioctl (and then just block for the ring to advance
enough if needed). That is how we had i915-gem operate before
i915-scheduler happened.

For 3 I'm not sure there's really that much to share, the end-of-batch
and preempt-ctx dma_fence are just rather different. I'm not too clear
on why LR needs a special cleanup flow, isn't it roughly the same:
- stop drm/sched from pushing in new jobs
- preempt the ctx to kill it

With xe I don't think we ever want to let existing jobs complete,
neither for legacy nor for lr ctx.


> What do you think Daniel, seem like a reasonable plan?

Yeah my comments above are just minor side questions, I think this is
roughly what we want. Plus/minus polish details of how much code
sharing between legacy/lr ctx in xe and for lr with other drivers
makes sense or not. I think the in-fences handling (because that ties
into memory management when eviction and restoring a vm) is the big
part which ideally has as much shared as possible. Which I think this
achieves.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-05 18:53         ` Matthew Brost
  2023-04-06 10:04           ` Christian König
@ 2023-04-07  0:20           ` Zeng, Oak
  2023-04-11  9:02             ` Christian König
  1 sibling, 1 reply; 87+ messages in thread
From: Zeng, Oak @ 2023-04-07  0:20 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: robdclark, airlied, lina, dri-devel, Christian König,
	boris.brezillon, Vetter,  Daniel, intel-xe, faith.ekstrand

So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.

The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?  

Thanks,
Oak

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: April 5, 2023 2:53 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel
> <daniel.vetter@intel.com>; Thomas Hellström
> <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie;
> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> plans
> 
> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> > Hi,
> >
> > Using dma-fence for completion/dependency tracking for long-run
> workload(more precisely on-demand paging/page fault enabled workload) can
> cause deadlock. This seems the significant issue here. Other issues such as the
> drm scheduler completion order implication etc are minors which can be solve
> inside the framework of drm scheduler. We need to evaluate below paths:
> >
> > 	1) still use drm scheduler for job submission, and use dma-fence for job
> completion waiting/dependency tracking. This is solution proposed in this series.
> Annotate dma-fence for long-run workload: user can still wait dma-fence for job
> completion but can't wait dma-fence while holding any memory management
> locks.  We still use dma-fence for dependency tracking. But it is just very easily
> run into deadlock when on-demand paging is in the picture. The annotation helps
> us to detect deadlock but not solve deadlock problems. Seems *not* a complete
> solution: It is almost impossible to completely avoid dependency deadlock in
> complex runtime environment
> >
> 
> No one can wait on LR fence, so it is impossible to deadlock. The
> annotations enforce this. Literally this is only for flow controling the
> ring / hold pending jobs in in the DRM schedule list.
> 
> > 	2) Still use drm scheduler but not use dma-fence for completion signaling
> and dependency tracking. This way we still get some free functions (reset, err
> handling ring flow control as Matt said)from drm scheduler, but push the
> dependency/completion tracking completely to user space using techniques such
> as user space fence. User space doesn't have chance to wait fence while holding
> a kernel memory management lock, thus the dma-fence deadlock issue is solved.
> >
> 
> We use user space fence for syncs.
> 
> > 	3) Completely discard drm scheduler and dma-fence for long-run
> workload. Use user queue/doorbell for super fast submission, directly interact
> with fw scheduler. Use user fence for completion/dependency tracking.
> >
> 
> This is a hard no from me, I want 1 submission path in Xe. Either we use
> the DRM scheduler or we don't.
> 
> Matt
> 
> > Thanks,
> > Oak
> >
> > > -----Original Message-----
> > > From: Christian König <christian.koenig@amd.com>
> > > Sent: April 5, 2023 3:30 AM
> > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > > <oak.zeng@intel.com>
> > > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> > > robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> airlied@linux.ie;
> > > lina@asahilina.net; boris.brezillon@collabora.com;
> faith.ekstrand@collabora.com
> > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > plans
> > >
> > > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > >> Hi Matt, Thomas,
> > > >>
> > > >> Some very bold out of box thinking in this area:
> > > >>
> > > >> 1. so you want to use drm scheduler and dma-fence for long running
> workload.
> > > Why you want to do this in the first place? What is the benefit? Drm scheduler
> is
> > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > > level, as you said below for intel this is Guc. Can xe driver just directly submit
> job
> > > to Guc, bypassing drm scheduler?
> > > >>
> > > > If we did that now we have 2 paths for dependency track, flow controling
> > > > the ring, resets / error handling / backend submission implementations.
> > > > We don't want this.
> > >
> > > Well exactly that's the point: Why?
> > >
> > > As far as I can see that are two completely distinct use cases, so you
> > > absolutely do want two completely distinct implementations for this.
> > >
> > > >> 2. using dma-fence for long run workload: I am well aware that page fault
> (and
> > > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
> be
> > > used purely because the nature of the workload that it runs very long
> (indefinite).
> > > I did a math: the dma_fence_wait_timeout function's third param is the
> timeout
> > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
> long
> > > enough, can we just change the timeout parameter to signed 64 bits so it is
> much
> > > longer than our life time...
> > > >>
> > > >> So I mainly argue we can't use dma-fence for long-run workload is not
> > > because the workload runs very long, rather because of the fact that we use
> > > page fault for long-run workload. If we enable page fault for short-run
> workload,
> > > we can't use dma-fence either. Page fault is the key thing here.
> > > >>
> > > >> Now since we use page fault which is *fundamentally* controversial with
> > > dma-fence design, why now just introduce a independent concept such as
> user-
> > > fence instead of extending existing dma-fence?
> > > >>
> > > >> I like unified design. If drm scheduler, dma-fence can be extended to work
> for
> > > everything, it is beautiful. But seems we have some fundamental problem
> here.
> > > >>
> > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > > > the signal / CB infrastructure) and enforce we don't use use these
> > > > dma-fences from the scheduler in memory reclaim paths or export these to
> > > > user space or other drivers. Think of this mode as SW only fence.
> > >
> > > Yeah and I truly think this is an really bad idea.
> > >
> > > The signal/CB infrastructure in the dma_fence turned out to be the
> > > absolutely nightmare I initially predicted. Sorry to say that, but in
> > > this case the "I've told you so" is appropriate in my opinion.
> > >
> > > If we need infrastructure for long running dependency tracking we should
> > > encapsulate that in a new framework and not try to mangle the existing
> > > code for something it was never intended for.
> > >
> > > Christian.
> > >
> > > >
> > > > Matt
> > > >
> > > >> Thanks,
> > > >> Oak
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > > >>> Matthew Brost
> > > >>> Sent: April 3, 2023 8:22 PM
> > > >>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > > >>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > > airlied@linux.ie;
> > > >>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > > >>> <matthew.brost@intel.com>; christian.koenig@amd.com;
> > > >>> faith.ekstrand@collabora.com
> > > >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > plans
> > > >>>
> > > >>> Hello,
> > > >>>
> > > >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > >>> have been asked to merge our common DRM scheduler patches first as
> well
> > > >>> as develop a common solution for long running workloads with the DRM
> > > >>> scheduler. This RFC series is our first attempt at doing this. We
> > > >>> welcome any and all feedback.
> > > >>>
> > > >>> This can we thought of as 4 parts detailed below.
> > > >>>
> > > >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > >>> entity (patches 1-3)
> > > >>>
> > > >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > >>> severals problems as the DRM was originally designed to schedule jobs
> on
> > > >>> hardware queues. The main problem being that DRM scheduler expects
> the
> > > >>> submission order of jobs to be the completion order of jobs even across
> > > >>> multiple entities. This assumption falls apart with a firmware scheduler
> > > >>> as a firmware scheduler has no concept of jobs and jobs can complete
> out
> > > >>> of order. A novel solution for was originally thought of by Faith during
> > > >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > >>> and entity. I believe the AGX driver [3] is using this approach and
> > > >>> Boris may use approach as well for the Mali driver [4].
> > > >>>
> > > >>> To support a 1 to 1 relationship we move the main execution function
> > > >>> from a kthread to a work queue and add a new scheduling mode which
> > > >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> > > >>> relationship and can be thought of as using scheduler as a dependency /
> > > >>> infligt job tracker rather than a true scheduler.
> > > >>>
> > > >>> - Generic messaging interface for DRM scheduler
> > > >>>
> > > >>> Idea is to be able to communicate to the submission backend with in
> band
> > > >>> (relative to main execution function) messages. Messages are backend
> > > >>> defined and flexable enough for any use case. In Xe we use these
> > > >>> messages to clean up entites, set properties for entites, and suspend /
> > > >>> resume execution of an entity [5]. I suspect other driver can leverage
> > > >>> this messaging concept too as it a convenient way to avoid races in the
> > > >>> backend.
> > > >>>
> > > >>> - Support for using TDR for all error paths of a scheduler / entity
> > > >>>
> > > >>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > >>>
> > > >>> - Annotate dma-fences for long running workloads.
> > > >>>
> > > >>> The idea here is to use dma-fences only as sync points within the
> > > >>> scheduler and never export them for long running workloads. By
> > > >>> annotating these fences as long running we ensure that these dma-
> fences
> > > >>> are never used in a way that breaks the dma-fence rules. A benefit of
> > > >>> thus approach is the scheduler can still safely flow control the
> > > >>> execution ring buffer via the job limit without breaking the dma-fence
> > > >>> rules.
> > > >>>
> > > >>> Again this a first draft and looking forward to feedback.
> > > >>>
> > > >>> Enjoy - Matt
> > > >>>
> > > >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > >>> [2] https://patchwork.freedesktop.org/series/112188/
> > > >>> [3] https://patchwork.freedesktop.org/series/114772/
> > > >>> [4]
> https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > > >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > >>>
> > > >>> Matthew Brost (8):
> > > >>>    drm/sched: Convert drm scheduler to use a work queue rather than
> > > >>>      kthread
> > > >>>    drm/sched: Move schedule policy to scheduler / entity
> > > >>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
> policy
> > > >>>    drm/sched: Add generic scheduler message interface
> > > >>>    drm/sched: Start run wq before TDR in drm_sched_start
> > > >>>    drm/sched: Submit job before starting TDR
> > > >>>    drm/sched: Add helper to set TDR timeout
> > > >>>    drm/syncobj: Warn on long running dma-fences
> > > >>>
> > > >>> Thomas Hellström (2):
> > > >>>    dma-buf/dma-fence: Introduce long-running completion fences
> > > >>>    drm/sched: Support long-running sched entities
> > > >>>
> > > >>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > >>>   drivers/dma-buf/dma-resv.c                  |   5 +
> > > >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > >>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > >>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > >>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > >>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > >>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > >>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > >>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > >>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > >>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
> ---
> > > >>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > >>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > >>>   include/linux/dma-fence.h                   |  60 ++++-
> > > >>>   16 files changed, 649 insertions(+), 184 deletions(-)
> > > >>>
> > > >>> --
> > > >>> 2.34.1
> >

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  1:58   ` Matthew Brost
@ 2023-04-08  7:05     ` Asahi Lina
  2023-04-11 14:07       ` Daniel Vetter
  2023-04-17  0:03       ` Matthew Brost
  0 siblings, 2 replies; 87+ messages in thread
From: Asahi Lina @ 2023-04-08  7:05 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, dri-devel, christian.koenig, boris.brezillon,
	daniel, intel-xe, faith.ekstrand

On 04/04/2023 10.58, Matthew Brost wrote:
> On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
>> Hi, thanks for the Cc!
>>
> 
> No problem.
> 
>> On 04/04/2023 09.22, Matthew Brost wrote:
>>> Hello,
>>>
>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>> have been asked to merge our common DRM scheduler patches first as well
>>> as develop a common solution for long running workloads with the DRM
>>> scheduler. This RFC series is our first attempt at doing this. We
>>> welcome any and all feedback.
>>>
>>> This can we thought of as 4 parts detailed below.
>>>
>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>> entity (patches 1-3)
>>>
>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>> severals problems as the DRM was originally designed to schedule jobs on
>>> hardware queues. The main problem being that DRM scheduler expects the
>>> submission order of jobs to be the completion order of jobs even across
>>> multiple entities. This assumption falls apart with a firmware scheduler
>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>> of order. A novel solution for was originally thought of by Faith during
>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>> and entity. I believe the AGX driver [3] is using this approach and
>>> Boris may use approach as well for the Mali driver [4].
>>>
>>> To support a 1 to 1 relationship we move the main execution function
>>> from a kthread to a work queue and add a new scheduling mode which
>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>> relationship and can be thought of as using scheduler as a dependency /
>>> infligt job tracker rather than a true scheduler.
>>
>> Yup, we're in the exact same situation with drm/asahi, so this is very
>> welcome! We've been using the existing scheduler as-is, but this should help
>> remove some unneeded complexity in this use case.
>>
> 
> That's the idea.
> 
>> Do you want me to pull in this series into our tree and make sure this all
>> works out for us?
>>
> 
> We tested this in Xe and it definitely works for us but the more testing
> the better.
> 

I haven't gotten around to testing this series yet, but after more 
debugging of drm_sched issues I want to hear more about how Xe uses the 
scheduler.

 From what I can tell, and from what Christian says, drm_sched has the 
hidden requirement that all job objects outlive the scheduler. I've run 
into several UAF bugs due to this. Not only that, it also currently has 
the requirement that all drm_sched fences outlive the scheduler object.

These requirements are subtle and only manifest as kernel oopses in rare 
corner cases, so it wasn't at all obvious to me that this was somehow a 
fundamental design assumption when I started using it.

As far as I can tell, this design is going to work in 99% of cases for 
global-schedulers-per-GPU models, where those corner cases would have to 
be hit on top of a GPU removal scenario (and GPU remove is... well, not 
the most tested/exercised use case). When the scheduler basically lives 
forever, none of this really matters.

But with a one-scheduler-per-queue model, how do you deal with this when 
the queue goes away? So far, without any of the partial bugfixes I have 
sent so far (which Christian objected to):

- If you try to tear down a scheduler with any jobs currently scheduled 
at the hardware, drm_sched will oops when those jobs complete and the hw 
fences signal.
- If you try to tear down an entity (which should cancel all its pending 
jobs) and then the scheduler it was attached to without actually waiting 
for all the free_job() callbacks to be called on every job that ever 
existed for that entity, you can oops (entity cleanup is asynchronous in 
some cases like killed processes, so it will return before all jobs are 
freed and then that asynchronous process will crash and burn if the 
scheduler goes away out from under its feet). Waiting for job completion 
fences is not enough for this, you have to wait until free_job() has 
actually been called for all jobs.
- Even if you actually wait for all jobs to be truly gone and then tear 
down the scheduler, if any scheduler job fences remain alive, that will 
then oops if you try to call the debug functions on them (like cat 
/sys/kernel/debug/dma_buf/bufinfo).

I tried to fix these things, but Christian objected implying it was the 
driver's job to keep a reference from jobs and hw fences to the 
scheduler. But I find that completely broken, because besides the extra 
memory/resource usage keeping the scheduler alive when you're trying to 
free resources as fast as possible when a process goes away, you can't 
even use normal reference counting for that: if you try to drop the last 
drm_sched reference from within a free_job() callback, the whole thing 
deadlocks since that will be running in the scheduler's thread/workqueue 
context, which can't free itself. So now you both reference count the 
scheduler from jobs and fences, and on top of that you need to outsource 
drm_sched freeing to a workqueue in the driver to make sure you don't 
deadlock.

For job fences this is particularly broken, because those fences can 
live forever signaled and attached to shared buffers and there is no 
guarantee that they will be freed in any kind of reasonable time frame. 
If they have to keep the scheduler that created them alive, that creates 
a lot of dead object junk we have to drag around just because a signaled 
fence exists somewhere.

For a Rust abstraction we have to do all that tracking and refcounting 
in the abstraction itself to make it safe, which is starting to sound 
like reimplementing half of the job tracking drm_sched itself does just 
to fix the lifetime issues, which really tells me the existing design is 
not sound nor easy to use correctly in general.

How does Xe deal with this (does it deal with it at all)? What happens 
when you kill -9 a process using the GPU? Does freeing all of this wait 
for all jobs to complete *and be freed* with free_job()? What about 
exported dma_bufs with fences attached from that scheduler? Do you keep 
the scheduler alive for those?

Personally, after running into all this, and after seeing Christian's 
reaction to me trying to improve the state of things, I'm starting to 
doubt that drm_sched is the right solution at all for 
firmware-scheduling drivers.

If you want a workload to try to see if you run into any of these 
things, running and killing lots of things in parallel is a good thing 
to try (mess with the numbers and let it run for a while to see if you 
can hit any corner cases):

while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears & 
done; sleep 0.1; done

~~ Lina


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-07  0:20           ` Zeng, Oak
@ 2023-04-11  9:02             ` Christian König
  2023-04-11 14:13               ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-11  9:02 UTC (permalink / raw)
  To: Zeng, Oak, Brost, Matthew
  Cc: robdclark, airlied, lina, dri-devel, boris.brezillon, Vetter,
	Daniel, intel-xe, faith.ekstrand

The point is that this not only requires some work in the drm_scheduler, 
but rather it then makes only little sense to use the drm_scheduler in 
the first place.

The whole point of the drm_scheduler is to provide dma_fence 
implementation for the submitted jobs.

We also have dependency handling, but as Daniel and I said this can be 
easily extracted into a separate object/component.

Regards,
Christian.

Am 07.04.23 um 02:20 schrieb Zeng, Oak:
> So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.
>
> The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?
>
> Thanks,
> Oak
>
>> -----Original Message-----
>> From: Brost, Matthew <matthew.brost@intel.com>
>> Sent: April 5, 2023 2:53 PM
>> To: Zeng, Oak <oak.zeng@intel.com>
>> Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel
>> <daniel.vetter@intel.com>; Thomas Hellström
>> <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel-
>> xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie;
>> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>> plans
>>
>> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
>>> Hi,
>>>
>>> Using dma-fence for completion/dependency tracking for long-run
>> workload(more precisely on-demand paging/page fault enabled workload) can
>> cause deadlock. This seems the significant issue here. Other issues such as the
>> drm scheduler completion order implication etc are minors which can be solve
>> inside the framework of drm scheduler. We need to evaluate below paths:
>>> 	1) still use drm scheduler for job submission, and use dma-fence for job
>> completion waiting/dependency tracking. This is solution proposed in this series.
>> Annotate dma-fence for long-run workload: user can still wait dma-fence for job
>> completion but can't wait dma-fence while holding any memory management
>> locks.  We still use dma-fence for dependency tracking. But it is just very easily
>> run into deadlock when on-demand paging is in the picture. The annotation helps
>> us to detect deadlock but not solve deadlock problems. Seems *not* a complete
>> solution: It is almost impossible to completely avoid dependency deadlock in
>> complex runtime environment
>> No one can wait on LR fence, so it is impossible to deadlock. The
>> annotations enforce this. Literally this is only for flow controling the
>> ring / hold pending jobs in in the DRM schedule list.
>>
>>> 	2) Still use drm scheduler but not use dma-fence for completion signaling
>> and dependency tracking. This way we still get some free functions (reset, err
>> handling ring flow control as Matt said)from drm scheduler, but push the
>> dependency/completion tracking completely to user space using techniques such
>> as user space fence. User space doesn't have chance to wait fence while holding
>> a kernel memory management lock, thus the dma-fence deadlock issue is solved.
>> We use user space fence for syncs.
>>
>>> 	3) Completely discard drm scheduler and dma-fence for long-run
>> workload. Use user queue/doorbell for super fast submission, directly interact
>> with fw scheduler. Use user fence for completion/dependency tracking.
>> This is a hard no from me, I want 1 submission path in Xe. Either we use
>> the DRM scheduler or we don't.
>>
>> Matt
>>
>>> Thanks,
>>> Oak
>>>
>>>> -----Original Message-----
>>>> From: Christian König <christian.koenig@amd.com>
>>>> Sent: April 5, 2023 3:30 AM
>>>> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
>>>> <oak.zeng@intel.com>
>>>> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
>>>> robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
>> airlied@linux.ie;
>>>> lina@asahilina.net; boris.brezillon@collabora.com;
>> faith.ekstrand@collabora.com
>>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>> plans
>>>>
>>>> Am 04.04.23 um 20:08 schrieb Matthew Brost:
>>>>> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
>>>>>> Hi Matt, Thomas,
>>>>>>
>>>>>> Some very bold out of box thinking in this area:
>>>>>>
>>>>>> 1. so you want to use drm scheduler and dma-fence for long running
>> workload.
>>>> Why you want to do this in the first place? What is the benefit? Drm scheduler
>> is
>>>> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
>>>> level, as you said below for intel this is Guc. Can xe driver just directly submit
>> job
>>>> to Guc, bypassing drm scheduler?
>>>>> If we did that now we have 2 paths for dependency track, flow controling
>>>>> the ring, resets / error handling / backend submission implementations.
>>>>> We don't want this.
>>>> Well exactly that's the point: Why?
>>>>
>>>> As far as I can see that are two completely distinct use cases, so you
>>>> absolutely do want two completely distinct implementations for this.
>>>>
>>>>>> 2. using dma-fence for long run workload: I am well aware that page fault
>> (and
>>>> the consequent memory allocation/lock acquiring to fix the fault) can cause
>>>> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
>> be
>>>> used purely because the nature of the workload that it runs very long
>> (indefinite).
>>>> I did a math: the dma_fence_wait_timeout function's third param is the
>> timeout
>>>> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
>> long
>>>> enough, can we just change the timeout parameter to signed 64 bits so it is
>> much
>>>> longer than our life time...
>>>>>> So I mainly argue we can't use dma-fence for long-run workload is not
>>>> because the workload runs very long, rather because of the fact that we use
>>>> page fault for long-run workload. If we enable page fault for short-run
>> workload,
>>>> we can't use dma-fence either. Page fault is the key thing here.
>>>>>> Now since we use page fault which is *fundamentally* controversial with
>>>> dma-fence design, why now just introduce a independent concept such as
>> user-
>>>> fence instead of extending existing dma-fence?
>>>>>> I like unified design. If drm scheduler, dma-fence can be extended to work
>> for
>>>> everything, it is beautiful. But seems we have some fundamental problem
>> here.
>>>>> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
>>>>> the signal / CB infrastructure) and enforce we don't use use these
>>>>> dma-fences from the scheduler in memory reclaim paths or export these to
>>>>> user space or other drivers. Think of this mode as SW only fence.
>>>> Yeah and I truly think this is an really bad idea.
>>>>
>>>> The signal/CB infrastructure in the dma_fence turned out to be the
>>>> absolutely nightmare I initially predicted. Sorry to say that, but in
>>>> this case the "I've told you so" is appropriate in my opinion.
>>>>
>>>> If we need infrastructure for long running dependency tracking we should
>>>> encapsulate that in a new framework and not try to mangle the existing
>>>> code for something it was never intended for.
>>>>
>>>> Christian.
>>>>
>>>>> Matt
>>>>>
>>>>>> Thanks,
>>>>>> Oak
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>>>>>>> Matthew Brost
>>>>>>> Sent: April 3, 2023 8:22 PM
>>>>>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
>>>>>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
>>>> airlied@linux.ie;
>>>>>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
>>>>>>> <matthew.brost@intel.com>; christian.koenig@amd.com;
>>>>>>> faith.ekstrand@collabora.com
>>>>>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>> plans
>>>>>>> Hello,
>>>>>>>
>>>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>>>> have been asked to merge our common DRM scheduler patches first as
>> well
>>>>>>> as develop a common solution for long running workloads with the DRM
>>>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>>>> welcome any and all feedback.
>>>>>>>
>>>>>>> This can we thought of as 4 parts detailed below.
>>>>>>>
>>>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>>>> entity (patches 1-3)
>>>>>>>
>>>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>>>> severals problems as the DRM was originally designed to schedule jobs
>> on
>>>>>>> hardware queues. The main problem being that DRM scheduler expects
>> the
>>>>>>> submission order of jobs to be the completion order of jobs even across
>>>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>>>> as a firmware scheduler has no concept of jobs and jobs can complete
>> out
>>>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>>>
>>>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>>>> infligt job tracker rather than a true scheduler.
>>>>>>>
>>>>>>> - Generic messaging interface for DRM scheduler
>>>>>>>
>>>>>>> Idea is to be able to communicate to the submission backend with in
>> band
>>>>>>> (relative to main execution function) messages. Messages are backend
>>>>>>> defined and flexable enough for any use case. In Xe we use these
>>>>>>> messages to clean up entites, set properties for entites, and suspend /
>>>>>>> resume execution of an entity [5]. I suspect other driver can leverage
>>>>>>> this messaging concept too as it a convenient way to avoid races in the
>>>>>>> backend.
>>>>>>>
>>>>>>> - Support for using TDR for all error paths of a scheduler / entity
>>>>>>>
>>>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>>>>>
>>>>>>> - Annotate dma-fences for long running workloads.
>>>>>>>
>>>>>>> The idea here is to use dma-fences only as sync points within the
>>>>>>> scheduler and never export them for long running workloads. By
>>>>>>> annotating these fences as long running we ensure that these dma-
>> fences
>>>>>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>>>>>> thus approach is the scheduler can still safely flow control the
>>>>>>> execution ring buffer via the job limit without breaking the dma-fence
>>>>>>> rules.
>>>>>>>
>>>>>>> Again this a first draft and looking forward to feedback.
>>>>>>>
>>>>>>> Enjoy - Matt
>>>>>>>
>>>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>>>>>> [2] https://patchwork.freedesktop.org/series/112188/
>>>>>>> [3] https://patchwork.freedesktop.org/series/114772/
>>>>>>> [4]
>> https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
>>>>>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>>>>>
>>>>>>> Matthew Brost (8):
>>>>>>>     drm/sched: Convert drm scheduler to use a work queue rather than
>>>>>>>       kthread
>>>>>>>     drm/sched: Move schedule policy to scheduler / entity
>>>>>>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
>> policy
>>>>>>>     drm/sched: Add generic scheduler message interface
>>>>>>>     drm/sched: Start run wq before TDR in drm_sched_start
>>>>>>>     drm/sched: Submit job before starting TDR
>>>>>>>     drm/sched: Add helper to set TDR timeout
>>>>>>>     drm/syncobj: Warn on long running dma-fences
>>>>>>>
>>>>>>> Thomas Hellström (2):
>>>>>>>     dma-buf/dma-fence: Introduce long-running completion fences
>>>>>>>     drm/sched: Support long-running sched entities
>>>>>>>
>>>>>>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>>>>>    drivers/dma-buf/dma-resv.c                  |   5 +
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>>>>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>>>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>>>>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>>>>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>>>>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>>>>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>>>>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>>>>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>>>>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
>> ---
>>>>>>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>>>>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>>>>>    include/linux/dma-fence.h                   |  60 ++++-
>>>>>>>    16 files changed, 649 insertions(+), 184 deletions(-)
>>>>>>>
>>>>>>> --
>>>>>>> 2.34.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-08  7:05     ` Asahi Lina
@ 2023-04-11 14:07       ` Daniel Vetter
  2023-04-12  5:47         ` Asahi Lina
  2023-04-17  0:03       ` Matthew Brost
  1 sibling, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-11 14:07 UTC (permalink / raw)
  To: Asahi Lina
  Cc: airlied, dri-devel, christian.koenig, boris.brezillon, daniel,
	robdclark, intel-xe, faith.ekstrand

On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
> On 04/04/2023 10.58, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> > > Hi, thanks for the Cc!
> > > 
> > 
> > No problem.
> > 
> > > On 04/04/2023 09.22, Matthew Brost wrote:
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > 
> > > Yup, we're in the exact same situation with drm/asahi, so this is very
> > > welcome! We've been using the existing scheduler as-is, but this should help
> > > remove some unneeded complexity in this use case.
> > > 
> > 
> > That's the idea.
> > 
> > > Do you want me to pull in this series into our tree and make sure this all
> > > works out for us?
> > > 
> > 
> > We tested this in Xe and it definitely works for us but the more testing
> > the better.
> > 
> 
> I haven't gotten around to testing this series yet, but after more debugging
> of drm_sched issues I want to hear more about how Xe uses the scheduler.
> 
> From what I can tell, and from what Christian says, drm_sched has the hidden
> requirement that all job objects outlive the scheduler. I've run into
> several UAF bugs due to this. Not only that, it also currently has the
> requirement that all drm_sched fences outlive the scheduler object.
> 
> These requirements are subtle and only manifest as kernel oopses in rare
> corner cases, so it wasn't at all obvious to me that this was somehow a
> fundamental design assumption when I started using it.
> 
> As far as I can tell, this design is going to work in 99% of cases for
> global-schedulers-per-GPU models, where those corner cases would have to be
> hit on top of a GPU removal scenario (and GPU remove is... well, not the
> most tested/exercised use case). When the scheduler basically lives forever,
> none of this really matters.
> 
> But with a one-scheduler-per-queue model, how do you deal with this when the
> queue goes away? So far, without any of the partial bugfixes I have sent so
> far (which Christian objected to):
> 
> - If you try to tear down a scheduler with any jobs currently scheduled at
> the hardware, drm_sched will oops when those jobs complete and the hw fences
> signal.
> - If you try to tear down an entity (which should cancel all its pending
> jobs) and then the scheduler it was attached to without actually waiting for
> all the free_job() callbacks to be called on every job that ever existed for
> that entity, you can oops (entity cleanup is asynchronous in some cases like
> killed processes, so it will return before all jobs are freed and then that
> asynchronous process will crash and burn if the scheduler goes away out from
> under its feet). Waiting for job completion fences is not enough for this,
> you have to wait until free_job() has actually been called for all jobs.
> - Even if you actually wait for all jobs to be truly gone and then tear down
> the scheduler, if any scheduler job fences remain alive, that will then oops
> if you try to call the debug functions on them (like cat
> /sys/kernel/debug/dma_buf/bufinfo).
> 
> I tried to fix these things, but Christian objected implying it was the
> driver's job to keep a reference from jobs and hw fences to the scheduler.
> But I find that completely broken, because besides the extra memory/resource
> usage keeping the scheduler alive when you're trying to free resources as
> fast as possible when a process goes away, you can't even use normal
> reference counting for that: if you try to drop the last drm_sched reference
> from within a free_job() callback, the whole thing deadlocks since that will
> be running in the scheduler's thread/workqueue context, which can't free
> itself. So now you both reference count the scheduler from jobs and fences,
> and on top of that you need to outsource drm_sched freeing to a workqueue in
> the driver to make sure you don't deadlock.
> 
> For job fences this is particularly broken, because those fences can live
> forever signaled and attached to shared buffers and there is no guarantee
> that they will be freed in any kind of reasonable time frame. If they have
> to keep the scheduler that created them alive, that creates a lot of dead
> object junk we have to drag around just because a signaled fence exists
> somewhere.
> 
> For a Rust abstraction we have to do all that tracking and refcounting in
> the abstraction itself to make it safe, which is starting to sound like
> reimplementing half of the job tracking drm_sched itself does just to fix
> the lifetime issues, which really tells me the existing design is not sound
> nor easy to use correctly in general.
> 
> How does Xe deal with this (does it deal with it at all)? What happens when
> you kill -9 a process using the GPU? Does freeing all of this wait for all
> jobs to complete *and be freed* with free_job()? What about exported
> dma_bufs with fences attached from that scheduler? Do you keep the scheduler
> alive for those?
> 
> Personally, after running into all this, and after seeing Christian's
> reaction to me trying to improve the state of things, I'm starting to doubt
> that drm_sched is the right solution at all for firmware-scheduling drivers.

Bit a wash-up reply on the more fundamental thing here:
 
For the current scheduler the issues you've found are indeed all driver
bugs (or most I think at least).

Which is why I think we shouldn't just try to shoehorn fundamentally new
semantics without updating the driver interfaces (the drm_sched split into
the driver interface part and the internal scheduler part). Once we have
that, including kerneldoc update and what the rules are, then all the
various uaf you've discovered become real bugs and I don't see any issue
merging all the fixes.

Without that we do have a chicken/egg problem between:

"here's a bunch of hacks to make the problems disappear I've hit in my
reuse of drm/sched for fw schedulers"

vs.

"this makes no sense for the current drm/sched interfaces and how current
upstream drivers use it"

I don't think there's a lot needed in terms of drm/sched driver api
rework, but I think it's also pretty clearly not ever going to get
anywhere with just nothing at all. Writing an entire new scheduler lib
instead of at least trying what minimal semantic changes (instead of just
a pile of hacks without even doc changes for the new rules) does not sound
like a good idea to me :-)

> If you want a workload to try to see if you run into any of these things,
> running and killing lots of things in parallel is a good thing to try (mess
> with the numbers and let it run for a while to see if you can hit any corner
> cases):
> 
> while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears &
> done; sleep 0.1; done

Maybe xe gets away with this due to synchronously killing everything
related to a ctx, but yeah I'd expect this to go boom in fun ways.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-11  9:02             ` Christian König
@ 2023-04-11 14:13               ` Daniel Vetter
  2023-04-17  6:47                 ` Christian König
  0 siblings, 1 reply; 87+ messages in thread
From: Daniel Vetter @ 2023-04-11 14:13 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, airlied, lina, Zeng, Oak, boris.brezillon, dri-devel,
	Vetter, Daniel, intel-xe, faith.ekstrand

On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:
> The point is that this not only requires some work in the drm_scheduler, but
> rather it then makes only little sense to use the drm_scheduler in the first
> place.
> 
> The whole point of the drm_scheduler is to provide dma_fence implementation
> for the submitted jobs.
> 
> We also have dependency handling, but as Daniel and I said this can be
> easily extracted into a separate object/component.

Uh that's not what I meant. My take is that minimally patching drm/sched
to make the out-fence either optional, or complete it right away, is the
simplest way to get at the dependency handling. For me at least the major
part of drm/sched is the dep handling and timeout stuff. And the later can
be reused with some glue to handle preempt timeouts too and other things,
since tdr is a work struct you can just issue any other gpu timeouts on
the same workqueue and using the roughly same pattern as the ->timed_out
hook and it'll just work.

The entire "oh we also make sure your hw fence doesn't leak into public
fences and causes lifetime mayhem" seems pretty minor. And maybe also
something we want to replicate for the preempt-ctx dma_fence that some
long-running context need (but more as part of drm_sched_entity I guess).

We can of course bikeshed how much flexibility really should be in the
different parts of drm/sched, but imo that's a bikeshed.
-Daniel


> 
> Regards,
> Christian.
> 
> Am 07.04.23 um 02:20 schrieb Zeng, Oak:
> > So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.
> > 
> > The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?
> > 
> > Thanks,
> > Oak
> > 
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: April 5, 2023 2:53 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel
> > > <daniel.vetter@intel.com>; Thomas Hellström
> > > <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel-
> > > xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie;
> > > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
> > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > plans
> > > 
> > > On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> > > > Hi,
> > > > 
> > > > Using dma-fence for completion/dependency tracking for long-run
> > > workload(more precisely on-demand paging/page fault enabled workload) can
> > > cause deadlock. This seems the significant issue here. Other issues such as the
> > > drm scheduler completion order implication etc are minors which can be solve
> > > inside the framework of drm scheduler. We need to evaluate below paths:
> > > > 	1) still use drm scheduler for job submission, and use dma-fence for job
> > > completion waiting/dependency tracking. This is solution proposed in this series.
> > > Annotate dma-fence for long-run workload: user can still wait dma-fence for job
> > > completion but can't wait dma-fence while holding any memory management
> > > locks.  We still use dma-fence for dependency tracking. But it is just very easily
> > > run into deadlock when on-demand paging is in the picture. The annotation helps
> > > us to detect deadlock but not solve deadlock problems. Seems *not* a complete
> > > solution: It is almost impossible to completely avoid dependency deadlock in
> > > complex runtime environment
> > > No one can wait on LR fence, so it is impossible to deadlock. The
> > > annotations enforce this. Literally this is only for flow controling the
> > > ring / hold pending jobs in in the DRM schedule list.
> > > 
> > > > 	2) Still use drm scheduler but not use dma-fence for completion signaling
> > > and dependency tracking. This way we still get some free functions (reset, err
> > > handling ring flow control as Matt said)from drm scheduler, but push the
> > > dependency/completion tracking completely to user space using techniques such
> > > as user space fence. User space doesn't have chance to wait fence while holding
> > > a kernel memory management lock, thus the dma-fence deadlock issue is solved.
> > > We use user space fence for syncs.
> > > 
> > > > 	3) Completely discard drm scheduler and dma-fence for long-run
> > > workload. Use user queue/doorbell for super fast submission, directly interact
> > > with fw scheduler. Use user fence for completion/dependency tracking.
> > > This is a hard no from me, I want 1 submission path in Xe. Either we use
> > > the DRM scheduler or we don't.
> > > 
> > > Matt
> > > 
> > > > Thanks,
> > > > Oak
> > > > 
> > > > > -----Original Message-----
> > > > > From: Christian König <christian.koenig@amd.com>
> > > > > Sent: April 5, 2023 3:30 AM
> > > > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > > > > <oak.zeng@intel.com>
> > > > > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> > > > > robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > > airlied@linux.ie;
> > > > > lina@asahilina.net; boris.brezillon@collabora.com;
> > > faith.ekstrand@collabora.com
> > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > plans
> > > > > 
> > > > > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > > > > > Hi Matt, Thomas,
> > > > > > > 
> > > > > > > Some very bold out of box thinking in this area:
> > > > > > > 
> > > > > > > 1. so you want to use drm scheduler and dma-fence for long running
> > > workload.
> > > > > Why you want to do this in the first place? What is the benefit? Drm scheduler
> > > is
> > > > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > > > > level, as you said below for intel this is Guc. Can xe driver just directly submit
> > > job
> > > > > to Guc, bypassing drm scheduler?
> > > > > > If we did that now we have 2 paths for dependency track, flow controling
> > > > > > the ring, resets / error handling / backend submission implementations.
> > > > > > We don't want this.
> > > > > Well exactly that's the point: Why?
> > > > > 
> > > > > As far as I can see that are two completely distinct use cases, so you
> > > > > absolutely do want two completely distinct implementations for this.
> > > > > 
> > > > > > > 2. using dma-fence for long run workload: I am well aware that page fault
> > > (and
> > > > > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > > > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
> > > be
> > > > > used purely because the nature of the workload that it runs very long
> > > (indefinite).
> > > > > I did a math: the dma_fence_wait_timeout function's third param is the
> > > timeout
> > > > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
> > > long
> > > > > enough, can we just change the timeout parameter to signed 64 bits so it is
> > > much
> > > > > longer than our life time...
> > > > > > > So I mainly argue we can't use dma-fence for long-run workload is not
> > > > > because the workload runs very long, rather because of the fact that we use
> > > > > page fault for long-run workload. If we enable page fault for short-run
> > > workload,
> > > > > we can't use dma-fence either. Page fault is the key thing here.
> > > > > > > Now since we use page fault which is *fundamentally* controversial with
> > > > > dma-fence design, why now just introduce a independent concept such as
> > > user-
> > > > > fence instead of extending existing dma-fence?
> > > > > > > I like unified design. If drm scheduler, dma-fence can be extended to work
> > > for
> > > > > everything, it is beautiful. But seems we have some fundamental problem
> > > here.
> > > > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > > > > > the signal / CB infrastructure) and enforce we don't use use these
> > > > > > dma-fences from the scheduler in memory reclaim paths or export these to
> > > > > > user space or other drivers. Think of this mode as SW only fence.
> > > > > Yeah and I truly think this is an really bad idea.
> > > > > 
> > > > > The signal/CB infrastructure in the dma_fence turned out to be the
> > > > > absolutely nightmare I initially predicted. Sorry to say that, but in
> > > > > this case the "I've told you so" is appropriate in my opinion.
> > > > > 
> > > > > If we need infrastructure for long running dependency tracking we should
> > > > > encapsulate that in a new framework and not try to mangle the existing
> > > > > code for something it was never intended for.
> > > > > 
> > > > > Christian.
> > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > Thanks,
> > > > > > > Oak
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > > > > > > > Matthew Brost
> > > > > > > > Sent: April 3, 2023 8:22 PM
> > > > > > > > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > > > > > > > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > > > > airlied@linux.ie;
> > > > > > > > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > > > > > > > <matthew.brost@intel.com>; christian.koenig@amd.com;
> > > > > > > > faith.ekstrand@collabora.com
> > > > > > > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > plans
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > > > have been asked to merge our common DRM scheduler patches first as
> > > well
> > > > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > > > welcome any and all feedback.
> > > > > > > > 
> > > > > > > > This can we thought of as 4 parts detailed below.
> > > > > > > > 
> > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > > > > > entity (patches 1-3)
> > > > > > > > 
> > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > > > severals problems as the DRM was originally designed to schedule jobs
> > > on
> > > > > > > > hardware queues. The main problem being that DRM scheduler expects
> > > the
> > > > > > > > submission order of jobs to be the completion order of jobs even across
> > > > > > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > > > > > as a firmware scheduler has no concept of jobs and jobs can complete
> > > out
> > > > > > > > of order. A novel solution for was originally thought of by Faith during
> > > > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > > > 
> > > > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > > > relationship and can be thought of as using scheduler as a dependency /
> > > > > > > > infligt job tracker rather than a true scheduler.
> > > > > > > > 
> > > > > > > > - Generic messaging interface for DRM scheduler
> > > > > > > > 
> > > > > > > > Idea is to be able to communicate to the submission backend with in
> > > band
> > > > > > > > (relative to main execution function) messages. Messages are backend
> > > > > > > > defined and flexable enough for any use case. In Xe we use these
> > > > > > > > messages to clean up entites, set properties for entites, and suspend /
> > > > > > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > > > > > this messaging concept too as it a convenient way to avoid races in the
> > > > > > > > backend.
> > > > > > > > 
> > > > > > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > > > > > 
> > > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > > > > > 
> > > > > > > > - Annotate dma-fences for long running workloads.
> > > > > > > > 
> > > > > > > > The idea here is to use dma-fences only as sync points within the
> > > > > > > > scheduler and never export them for long running workloads. By
> > > > > > > > annotating these fences as long running we ensure that these dma-
> > > fences
> > > > > > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > > > > > thus approach is the scheduler can still safely flow control the
> > > > > > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > > > > > rules.
> > > > > > > > 
> > > > > > > > Again this a first draft and looking forward to feedback.
> > > > > > > > 
> > > > > > > > Enjoy - Matt
> > > > > > > > 
> > > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > > > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > > > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > > > > > [4]
> > > https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > > > > > > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > > > > > 
> > > > > > > > Matthew Brost (8):
> > > > > > > >     drm/sched: Convert drm scheduler to use a work queue rather than
> > > > > > > >       kthread
> > > > > > > >     drm/sched: Move schedule policy to scheduler / entity
> > > > > > > >     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
> > > policy
> > > > > > > >     drm/sched: Add generic scheduler message interface
> > > > > > > >     drm/sched: Start run wq before TDR in drm_sched_start
> > > > > > > >     drm/sched: Submit job before starting TDR
> > > > > > > >     drm/sched: Add helper to set TDR timeout
> > > > > > > >     drm/syncobj: Warn on long running dma-fences
> > > > > > > > 
> > > > > > > > Thomas Hellström (2):
> > > > > > > >     dma-buf/dma-fence: Introduce long-running completion fences
> > > > > > > >     drm/sched: Support long-running sched entities
> > > > > > > > 
> > > > > > > >    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > > > > > >    drivers/dma-buf/dma-resv.c                  |   5 +
> > > > > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > > > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > > > > > >    drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > > > > > >    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > > > > > >    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > > > > > >    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > > > > > >    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > > > > > >    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > > > > > >    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > > > > > >    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > > > > > >    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
> > > ---
> > > > > > > >    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > > > > > >    include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > > > > > >    include/linux/dma-fence.h                   |  60 ++++-
> > > > > > > >    16 files changed, 649 insertions(+), 184 deletions(-)
> > > > > > > > 
> > > > > > > > --
> > > > > > > > 2.34.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-11 14:07       ` Daniel Vetter
@ 2023-04-12  5:47         ` Asahi Lina
  2023-04-12  8:18           ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Asahi Lina @ 2023-04-12  5:47 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: airlied, dri-devel, christian.koenig, boris.brezillon, robdclark,
	intel-xe, faith.ekstrand

On 11/04/2023 23.07, Daniel Vetter wrote:
> On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
>> On 04/04/2023 10.58, Matthew Brost wrote:
>>> On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
>>>> Hi, thanks for the Cc!
>>>>
>>>
>>> No problem.
>>>
>>>> On 04/04/2023 09.22, Matthew Brost wrote:
>>>>> Hello,
>>>>>
>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>> have been asked to merge our common DRM scheduler patches first as well
>>>>> as develop a common solution for long running workloads with the DRM
>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>> welcome any and all feedback.
>>>>>
>>>>> This can we thought of as 4 parts detailed below.
>>>>>
>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>> entity (patches 1-3)
>>>>>
>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>> severals problems as the DRM was originally designed to schedule jobs on
>>>>> hardware queues. The main problem being that DRM scheduler expects the
>>>>> submission order of jobs to be the completion order of jobs even across
>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>
>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>> infligt job tracker rather than a true scheduler.
>>>>
>>>> Yup, we're in the exact same situation with drm/asahi, so this is very
>>>> welcome! We've been using the existing scheduler as-is, but this should help
>>>> remove some unneeded complexity in this use case.
>>>>
>>>
>>> That's the idea.
>>>
>>>> Do you want me to pull in this series into our tree and make sure this all
>>>> works out for us?
>>>>
>>>
>>> We tested this in Xe and it definitely works for us but the more testing
>>> the better.
>>>
>>
>> I haven't gotten around to testing this series yet, but after more debugging
>> of drm_sched issues I want to hear more about how Xe uses the scheduler.
>>
>>  From what I can tell, and from what Christian says, drm_sched has the hidden
>> requirement that all job objects outlive the scheduler. I've run into
>> several UAF bugs due to this. Not only that, it also currently has the
>> requirement that all drm_sched fences outlive the scheduler object.
>>
>> These requirements are subtle and only manifest as kernel oopses in rare
>> corner cases, so it wasn't at all obvious to me that this was somehow a
>> fundamental design assumption when I started using it.
>>
>> As far as I can tell, this design is going to work in 99% of cases for
>> global-schedulers-per-GPU models, where those corner cases would have to be
>> hit on top of a GPU removal scenario (and GPU remove is... well, not the
>> most tested/exercised use case). When the scheduler basically lives forever,
>> none of this really matters.
>>
>> But with a one-scheduler-per-queue model, how do you deal with this when the
>> queue goes away? So far, without any of the partial bugfixes I have sent so
>> far (which Christian objected to):
>>
>> - If you try to tear down a scheduler with any jobs currently scheduled at
>> the hardware, drm_sched will oops when those jobs complete and the hw fences
>> signal.
>> - If you try to tear down an entity (which should cancel all its pending
>> jobs) and then the scheduler it was attached to without actually waiting for
>> all the free_job() callbacks to be called on every job that ever existed for
>> that entity, you can oops (entity cleanup is asynchronous in some cases like
>> killed processes, so it will return before all jobs are freed and then that
>> asynchronous process will crash and burn if the scheduler goes away out from
>> under its feet). Waiting for job completion fences is not enough for this,
>> you have to wait until free_job() has actually been called for all jobs.
>> - Even if you actually wait for all jobs to be truly gone and then tear down
>> the scheduler, if any scheduler job fences remain alive, that will then oops
>> if you try to call the debug functions on them (like cat
>> /sys/kernel/debug/dma_buf/bufinfo).
>>
>> I tried to fix these things, but Christian objected implying it was the
>> driver's job to keep a reference from jobs and hw fences to the scheduler.
>> But I find that completely broken, because besides the extra memory/resource
>> usage keeping the scheduler alive when you're trying to free resources as
>> fast as possible when a process goes away, you can't even use normal
>> reference counting for that: if you try to drop the last drm_sched reference
>> from within a free_job() callback, the whole thing deadlocks since that will
>> be running in the scheduler's thread/workqueue context, which can't free
>> itself. So now you both reference count the scheduler from jobs and fences,
>> and on top of that you need to outsource drm_sched freeing to a workqueue in
>> the driver to make sure you don't deadlock.
>>
>> For job fences this is particularly broken, because those fences can live
>> forever signaled and attached to shared buffers and there is no guarantee
>> that they will be freed in any kind of reasonable time frame. If they have
>> to keep the scheduler that created them alive, that creates a lot of dead
>> object junk we have to drag around just because a signaled fence exists
>> somewhere.
>>
>> For a Rust abstraction we have to do all that tracking and refcounting in
>> the abstraction itself to make it safe, which is starting to sound like
>> reimplementing half of the job tracking drm_sched itself does just to fix
>> the lifetime issues, which really tells me the existing design is not sound
>> nor easy to use correctly in general.
>>
>> How does Xe deal with this (does it deal with it at all)? What happens when
>> you kill -9 a process using the GPU? Does freeing all of this wait for all
>> jobs to complete *and be freed* with free_job()? What about exported
>> dma_bufs with fences attached from that scheduler? Do you keep the scheduler
>> alive for those?
>>
>> Personally, after running into all this, and after seeing Christian's
>> reaction to me trying to improve the state of things, I'm starting to doubt
>> that drm_sched is the right solution at all for firmware-scheduling drivers.
> 
> Bit a wash-up reply on the more fundamental thing here:
>   
> For the current scheduler the issues you've found are indeed all driver
> bugs (or most I think at least).

Even the last one with the fences? I can't see how that could be 
implemented correctly today by any driver, short of having the driver 
live until any buffers it has touched and installed a fence into go 
away, which doesn't sound right, since that would block cleanup (and 
module unloading) possibly forever, and that itself sounds like a bug...

This is why I'm a bit disappointed here, because even that one got me a 
"you're doing it wrong" response from Christian... but if scheduler 
fences are supposed to be outlived by the driver and its fences, what is 
even the point of having separate fences?

> 
> Which is why I think we shouldn't just try to shoehorn fundamentally new
> semantics without updating the driver interfaces (the drm_sched split into
> the driver interface part and the internal scheduler part). Once we have
> that, including kerneldoc update and what the rules are, then all the
> various uaf you've discovered become real bugs and I don't see any issue
> merging all the fixes.
> 
> Without that we do have a chicken/egg problem between:
> 
> "here's a bunch of hacks to make the problems disappear I've hit in my
> reuse of drm/sched for fw schedulers"
> 
> vs.
> 
> "this makes no sense for the current drm/sched interfaces and how current
> upstream drivers use it"
> 
> I don't think there's a lot needed in terms of drm/sched driver api
> rework, but I think it's also pretty clearly not ever going to get
> anywhere with just nothing at all. Writing an entire new scheduler lib
> instead of at least trying what minimal semantic changes (instead of just
> a pile of hacks without even doc changes for the new rules) does not sound
> like a good idea to me :-)

I wish I knew what the old rules were, since they're still not documented...

It's frustrating following what few rules are written down, running into 
a bug, writing a patch to fix it, and being told "no, you're just not 
following the unwritten rules"... several times now.

> 
>> If you want a workload to try to see if you run into any of these things,
>> running and killing lots of things in parallel is a good thing to try (mess
>> with the numbers and let it run for a while to see if you can hit any corner
>> cases):
>>
>> while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears &
>> done; sleep 0.1; done
> 
> Maybe xe gets away with this due to synchronously killing everything
> related to a ctx, but yeah I'd expect this to go boom in fun ways.

It'd have to explicitly refcount all the jobs and block killing the ctx 
until all jobs are freed (not just signaled) for that not to go boom 
right now, but even then you'd still have the issue with dangling fences 
in buffers making `cat /sys/kernel/debug/dma_buf/bufinfo` oops... and 
you can't synchronously kill those as far as I know.

~~ Lina


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-12  5:47         ` Asahi Lina
@ 2023-04-12  8:18           ` Daniel Vetter
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Vetter @ 2023-04-12  8:18 UTC (permalink / raw)
  To: Asahi Lina
  Cc: airlied, dri-devel, christian.koenig, boris.brezillon,
	Daniel Vetter, robdclark, intel-xe, faith.ekstrand

On Wed, Apr 12, 2023 at 02:47:52PM +0900, Asahi Lina wrote:
> On 11/04/2023 23.07, Daniel Vetter wrote:
> > On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
> > > On 04/04/2023 10.58, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> > > > > Hi, thanks for the Cc!
> > > > > 
> > > > 
> > > > No problem.
> > > > 
> > > > > On 04/04/2023 09.22, Matthew Brost wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > have been asked to merge our common DRM scheduler patches first as well
> > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > welcome any and all feedback.
> > > > > > 
> > > > > > This can we thought of as 4 parts detailed below.
> > > > > > 
> > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > > > entity (patches 1-3)
> > > > > > 
> > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > > > submission order of jobs to be the completion order of jobs even across
> > > > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > > > of order. A novel solution for was originally thought of by Faith during
> > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > 
> > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > relationship and can be thought of as using scheduler as a dependency /
> > > > > > infligt job tracker rather than a true scheduler.
> > > > > 
> > > > > Yup, we're in the exact same situation with drm/asahi, so this is very
> > > > > welcome! We've been using the existing scheduler as-is, but this should help
> > > > > remove some unneeded complexity in this use case.
> > > > > 
> > > > 
> > > > That's the idea.
> > > > 
> > > > > Do you want me to pull in this series into our tree and make sure this all
> > > > > works out for us?
> > > > > 
> > > > 
> > > > We tested this in Xe and it definitely works for us but the more testing
> > > > the better.
> > > > 
> > > 
> > > I haven't gotten around to testing this series yet, but after more debugging
> > > of drm_sched issues I want to hear more about how Xe uses the scheduler.
> > > 
> > >  From what I can tell, and from what Christian says, drm_sched has the hidden
> > > requirement that all job objects outlive the scheduler. I've run into
> > > several UAF bugs due to this. Not only that, it also currently has the
> > > requirement that all drm_sched fences outlive the scheduler object.
> > > 
> > > These requirements are subtle and only manifest as kernel oopses in rare
> > > corner cases, so it wasn't at all obvious to me that this was somehow a
> > > fundamental design assumption when I started using it.
> > > 
> > > As far as I can tell, this design is going to work in 99% of cases for
> > > global-schedulers-per-GPU models, where those corner cases would have to be
> > > hit on top of a GPU removal scenario (and GPU remove is... well, not the
> > > most tested/exercised use case). When the scheduler basically lives forever,
> > > none of this really matters.
> > > 
> > > But with a one-scheduler-per-queue model, how do you deal with this when the
> > > queue goes away? So far, without any of the partial bugfixes I have sent so
> > > far (which Christian objected to):
> > > 
> > > - If you try to tear down a scheduler with any jobs currently scheduled at
> > > the hardware, drm_sched will oops when those jobs complete and the hw fences
> > > signal.
> > > - If you try to tear down an entity (which should cancel all its pending
> > > jobs) and then the scheduler it was attached to without actually waiting for
> > > all the free_job() callbacks to be called on every job that ever existed for
> > > that entity, you can oops (entity cleanup is asynchronous in some cases like
> > > killed processes, so it will return before all jobs are freed and then that
> > > asynchronous process will crash and burn if the scheduler goes away out from
> > > under its feet). Waiting for job completion fences is not enough for this,
> > > you have to wait until free_job() has actually been called for all jobs.
> > > - Even if you actually wait for all jobs to be truly gone and then tear down
> > > the scheduler, if any scheduler job fences remain alive, that will then oops
> > > if you try to call the debug functions on them (like cat
> > > /sys/kernel/debug/dma_buf/bufinfo).
> > > 
> > > I tried to fix these things, but Christian objected implying it was the
> > > driver's job to keep a reference from jobs and hw fences to the scheduler.
> > > But I find that completely broken, because besides the extra memory/resource
> > > usage keeping the scheduler alive when you're trying to free resources as
> > > fast as possible when a process goes away, you can't even use normal
> > > reference counting for that: if you try to drop the last drm_sched reference
> > > from within a free_job() callback, the whole thing deadlocks since that will
> > > be running in the scheduler's thread/workqueue context, which can't free
> > > itself. So now you both reference count the scheduler from jobs and fences,
> > > and on top of that you need to outsource drm_sched freeing to a workqueue in
> > > the driver to make sure you don't deadlock.
> > > 
> > > For job fences this is particularly broken, because those fences can live
> > > forever signaled and attached to shared buffers and there is no guarantee
> > > that they will be freed in any kind of reasonable time frame. If they have
> > > to keep the scheduler that created them alive, that creates a lot of dead
> > > object junk we have to drag around just because a signaled fence exists
> > > somewhere.
> > > 
> > > For a Rust abstraction we have to do all that tracking and refcounting in
> > > the abstraction itself to make it safe, which is starting to sound like
> > > reimplementing half of the job tracking drm_sched itself does just to fix
> > > the lifetime issues, which really tells me the existing design is not sound
> > > nor easy to use correctly in general.
> > > 
> > > How does Xe deal with this (does it deal with it at all)? What happens when
> > > you kill -9 a process using the GPU? Does freeing all of this wait for all
> > > jobs to complete *and be freed* with free_job()? What about exported
> > > dma_bufs with fences attached from that scheduler? Do you keep the scheduler
> > > alive for those?
> > > 
> > > Personally, after running into all this, and after seeing Christian's
> > > reaction to me trying to improve the state of things, I'm starting to doubt
> > > that drm_sched is the right solution at all for firmware-scheduling drivers.
> > 
> > Bit a wash-up reply on the more fundamental thing here:
> > For the current scheduler the issues you've found are indeed all driver
> > bugs (or most I think at least).
> 
> Even the last one with the fences? I can't see how that could be implemented
> correctly today by any driver, short of having the driver live until any
> buffers it has touched and installed a fence into go away, which doesn't
> sound right, since that would block cleanup (and module unloading) possibly
> forever, and that itself sounds like a bug...
> 
> This is why I'm a bit disappointed here, because even that one got me a
> "you're doing it wrong" response from Christian... but if scheduler fences
> are supposed to be outlived by the driver and its fences, what is even the
> point of having separate fences?

Yeah that one sounds like a bug. Not sure what Christian was thinking, but
the point of the split between hw fence and public drm_job fence is that
the latter is supposed to be free-standing.

Otherwise we'd need to refcount the world and have an enourmous space
leak. i915-gem tried that, and it's not even close to pretty because you
get to untie all kinds of weak pointers in all kinds of really scary ways.

Maybe try resend that, but put a patch in front that fixes the kerneldoc
to explain this in really clear terms & why it's needed? Then 2nd patches
is more a "we fix the driver api contract here".

> > Which is why I think we shouldn't just try to shoehorn fundamentally new
> > semantics without updating the driver interfaces (the drm_sched split into
> > the driver interface part and the internal scheduler part). Once we have
> > that, including kerneldoc update and what the rules are, then all the
> > various uaf you've discovered become real bugs and I don't see any issue
> > merging all the fixes.
> > 
> > Without that we do have a chicken/egg problem between:
> > 
> > "here's a bunch of hacks to make the problems disappear I've hit in my
> > reuse of drm/sched for fw schedulers"
> > 
> > vs.
> > 
> > "this makes no sense for the current drm/sched interfaces and how current
> > upstream drivers use it"
> > 
> > I don't think there's a lot needed in terms of drm/sched driver api
> > rework, but I think it's also pretty clearly not ever going to get
> > anywhere with just nothing at all. Writing an entire new scheduler lib
> > instead of at least trying what minimal semantic changes (instead of just
> > a pile of hacks without even doc changes for the new rules) does not sound
> > like a good idea to me :-)
> 
> I wish I knew what the old rules were, since they're still not documented...
> 
> It's frustrating following what few rules are written down, running into a
> bug, writing a patch to fix it, and being told "no, you're just not
> following the unwritten rules"... several times now.

Yeah I get that. And I think I've put down a few places in the threads
where things derailed that the more constructive approach (instead of
random nak and shit like that) is to ask to improve the docs. Because rn
they're really not great for the scheduler :-/

> > > If you want a workload to try to see if you run into any of these things,
> > > running and killing lots of things in parallel is a good thing to try (mess
> > > with the numbers and let it run for a while to see if you can hit any corner
> > > cases):
> > > 
> > > while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears &
> > > done; sleep 0.1; done
> > 
> > Maybe xe gets away with this due to synchronously killing everything
> > related to a ctx, but yeah I'd expect this to go boom in fun ways.
> 
> It'd have to explicitly refcount all the jobs and block killing the ctx
> until all jobs are freed (not just signaled) for that not to go boom right
> now, but even then you'd still have the issue with dangling fences in
> buffers making `cat /sys/kernel/debug/dma_buf/bufinfo` oops... and you can't
> synchronously kill those as far as I know.

Yeah that sounds like a full on bug. Doc patch to sharpen the semantics we
want + fix should get this going - or I'll jump in and make it going :-)

For the bigger issue I guess we might just land on "drm/sched needs to
refcount a lot more for fw scheduler", which is again why I think some doc
patches + minimal driver api rework might be best. Sometimes just
submitting code isn't the best way to communicate.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-08  7:05     ` Asahi Lina
  2023-04-11 14:07       ` Daniel Vetter
@ 2023-04-17  0:03       ` Matthew Brost
  1 sibling, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-04-17  0:03 UTC (permalink / raw)
  To: Asahi Lina
  Cc: robdclark, airlied, dri-devel, intel-xe, boris.brezillon,
	christian.koenig, faith.ekstrand

On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
> On 04/04/2023 10.58, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> > > Hi, thanks for the Cc!
> > > 
> > 
> > No problem.
> > 
> > > On 04/04/2023 09.22, Matthew Brost wrote:
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > 
> > > Yup, we're in the exact same situation with drm/asahi, so this is very
> > > welcome! We've been using the existing scheduler as-is, but this should help
> > > remove some unneeded complexity in this use case.
> > > 
> > 
> > That's the idea.
> > 
> > > Do you want me to pull in this series into our tree and make sure this all
> > > works out for us?
> > > 
> > 
> > We tested this in Xe and it definitely works for us but the more testing
> > the better.
> > 
> 
> I haven't gotten around to testing this series yet, but after more debugging
> of drm_sched issues I want to hear more about how Xe uses the scheduler.
> 
> From what I can tell, and from what Christian says, drm_sched has the hidden
> requirement that all job objects outlive the scheduler. I've run into
> several UAF bugs due to this. Not only that, it also currently has the
> requirement that all drm_sched fences outlive the scheduler object.
> 
> These requirements are subtle and only manifest as kernel oopses in rare
> corner cases, so it wasn't at all obvious to me that this was somehow a
> fundamental design assumption when I started using it.
> 
> As far as I can tell, this design is going to work in 99% of cases for
> global-schedulers-per-GPU models, where those corner cases would have to be
> hit on top of a GPU removal scenario (and GPU remove is... well, not the
> most tested/exercised use case). When the scheduler basically lives forever,
> none of this really matters.
> 
> But with a one-scheduler-per-queue model, how do you deal with this when the
> queue goes away? So far, without any of the partial bugfixes I have sent so
> far (which Christian objected to):
> 
> - If you try to tear down a scheduler with any jobs currently scheduled at
> the hardware, drm_sched will oops when those jobs complete and the hw fences
> signal.
> - If you try to tear down an entity (which should cancel all its pending
> jobs) and then the scheduler it was attached to without actually waiting for
> all the free_job() callbacks to be called on every job that ever existed for
> that entity, you can oops (entity cleanup is asynchronous in some cases like
> killed processes, so it will return before all jobs are freed and then that
> asynchronous process will crash and burn if the scheduler goes away out from
> under its feet). Waiting for job completion fences is not enough for this,
> you have to wait until free_job() has actually been called for all jobs.
> - Even if you actually wait for all jobs to be truly gone and then tear down
> the scheduler, if any scheduler job fences remain alive, that will then oops
> if you try to call the debug functions on them (like cat
> /sys/kernel/debug/dma_buf/bufinfo).
> 
> I tried to fix these things, but Christian objected implying it was the
> driver's job to keep a reference from jobs and hw fences to the scheduler.
> But I find that completely broken, because besides the extra memory/resource
> usage keeping the scheduler alive when you're trying to free resources as
> fast as possible when a process goes away, you can't even use normal
> reference counting for that: if you try to drop the last drm_sched reference
> from within a free_job() callback, the whole thing deadlocks since that will
> be running in the scheduler's thread/workqueue context, which can't free
> itself. So now you both reference count the scheduler from jobs and fences,
> and on top of that you need to outsource drm_sched freeing to a workqueue in
> the driver to make sure you don't deadlock.
> 

This what Xe does, jobs reference the scheduler / entity (xe_engine),
when the reference count of an xe_engine goes to zero we trigger the
teardown process (ping / pong with firmware) via a CLEANUP message, when
teardown is done the last step of killing the scheduler is indeed done
by an async worker as you suggest.

To kill a queue, we just kick the TDR which in turn kills any
outstanding job resulting the xe_engine ref count (at least from the
jobs) going to zero.

If a user holds ref to dma-fence of a job, then yes the scheduler isn't
going to be freed (it can be killed before as described above).

This all seems to work just fine for Xe.

> For job fences this is particularly broken, because those fences can live
> forever signaled and attached to shared buffers and there is no guarantee
> that they will be freed in any kind of reasonable time frame. If they have
> to keep the scheduler that created them alive, that creates a lot of dead
> object junk we have to drag around just because a signaled fence exists
> somewhere.
> 
> For a Rust abstraction we have to do all that tracking and refcounting in
> the abstraction itself to make it safe, which is starting to sound like
> reimplementing half of the job tracking drm_sched itself does just to fix
> the lifetime issues, which really tells me the existing design is not sound
> nor easy to use correctly in general.
> 
> How does Xe deal with this (does it deal with it at all)? What happens when
> you kill -9 a process using the GPU? Does freeing all of this wait for all
> jobs to complete *and be freed* with free_job()? What about exported
> dma_bufs with fences attached from that scheduler? Do you keep the scheduler
> alive for those?

kill -9 would trigger killing of the queue described above. Yes if
fences are exported the scheduler might hold onto some firmware
resources for a bit.

> 
> Personally, after running into all this, and after seeing Christian's
> reaction to me trying to improve the state of things, I'm starting to doubt
> that drm_sched is the right solution at all for firmware-scheduling drivers.
> 
> If you want a workload to try to see if you run into any of these things,
> running and killing lots of things in parallel is a good thing to try (mess
> with the numbers and let it run for a while to see if you can hit any corner
> cases):
> 
> while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears &
> done; sleep 0.1; done
>

Tested this and this works in Xe.

Feel free to ping me on IRC if you want to chat more about this.

Matt

> ~~ Lina
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-11 14:13               ` Daniel Vetter
@ 2023-04-17  6:47                 ` Christian König
  2023-04-17  8:39                   ` Daniel Vetter
  0 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2023-04-17  6:47 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: robdclark, airlied, lina, Zeng, Oak, boris.brezillon, dri-devel,
	Vetter, Daniel, intel-xe, faith.ekstrand

Am 11.04.23 um 16:13 schrieb Daniel Vetter:
> On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:
>> The point is that this not only requires some work in the drm_scheduler, but
>> rather it then makes only little sense to use the drm_scheduler in the first
>> place.
>>
>> The whole point of the drm_scheduler is to provide dma_fence implementation
>> for the submitted jobs.
>>
>> We also have dependency handling, but as Daniel and I said this can be
>> easily extracted into a separate object/component.
> Uh that's not what I meant. My take is that minimally patching drm/sched
> to make the out-fence either optional, or complete it right away, is the
> simplest way to get at the dependency handling. For me at least the major
> part of drm/sched is the dep handling and timeout stuff. And the later can
> be reused with some glue to handle preempt timeouts too and other things,
> since tdr is a work struct you can just issue any other gpu timeouts on
> the same workqueue and using the roughly same pattern as the ->timed_out
> hook and it'll just work.

Well that strongly sounds like what I had in mind as well.

If we move the dependency/timeout functionality into a new component or 
if we move the scheduler fence into a new component doesn't seem to 
matter, the high level goal is that we have separated the two 
functionalities and both approach will work for that.

> The entire "oh we also make sure your hw fence doesn't leak into public
> fences and causes lifetime mayhem" seems pretty minor. And maybe also
> something we want to replicate for the preempt-ctx dma_fence that some
> long-running context need (but more as part of drm_sched_entity I guess).
>
> We can of course bikeshed how much flexibility really should be in the
> different parts of drm/sched, but imo that's a bikeshed.

Well the dependency handling in a separate component would still be 
interesting to have since we need something similar for user queues as well.

Christian.

> -Daniel
>
>
>> Regards,
>> Christian.
>>
>> Am 07.04.23 um 02:20 schrieb Zeng, Oak:
>>> So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.
>>>
>>> The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?
>>>
>>> Thanks,
>>> Oak
>>>
>>>> -----Original Message-----
>>>> From: Brost, Matthew <matthew.brost@intel.com>
>>>> Sent: April 5, 2023 2:53 PM
>>>> To: Zeng, Oak <oak.zeng@intel.com>
>>>> Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel
>>>> <daniel.vetter@intel.com>; Thomas Hellström
>>>> <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel-
>>>> xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie;
>>>> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
>>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>> plans
>>>>
>>>> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
>>>>> Hi,
>>>>>
>>>>> Using dma-fence for completion/dependency tracking for long-run
>>>> workload(more precisely on-demand paging/page fault enabled workload) can
>>>> cause deadlock. This seems the significant issue here. Other issues such as the
>>>> drm scheduler completion order implication etc are minors which can be solve
>>>> inside the framework of drm scheduler. We need to evaluate below paths:
>>>>> 	1) still use drm scheduler for job submission, and use dma-fence for job
>>>> completion waiting/dependency tracking. This is solution proposed in this series.
>>>> Annotate dma-fence for long-run workload: user can still wait dma-fence for job
>>>> completion but can't wait dma-fence while holding any memory management
>>>> locks.  We still use dma-fence for dependency tracking. But it is just very easily
>>>> run into deadlock when on-demand paging is in the picture. The annotation helps
>>>> us to detect deadlock but not solve deadlock problems. Seems *not* a complete
>>>> solution: It is almost impossible to completely avoid dependency deadlock in
>>>> complex runtime environment
>>>> No one can wait on LR fence, so it is impossible to deadlock. The
>>>> annotations enforce this. Literally this is only for flow controling the
>>>> ring / hold pending jobs in in the DRM schedule list.
>>>>
>>>>> 	2) Still use drm scheduler but not use dma-fence for completion signaling
>>>> and dependency tracking. This way we still get some free functions (reset, err
>>>> handling ring flow control as Matt said)from drm scheduler, but push the
>>>> dependency/completion tracking completely to user space using techniques such
>>>> as user space fence. User space doesn't have chance to wait fence while holding
>>>> a kernel memory management lock, thus the dma-fence deadlock issue is solved.
>>>> We use user space fence for syncs.
>>>>
>>>>> 	3) Completely discard drm scheduler and dma-fence for long-run
>>>> workload. Use user queue/doorbell for super fast submission, directly interact
>>>> with fw scheduler. Use user fence for completion/dependency tracking.
>>>> This is a hard no from me, I want 1 submission path in Xe. Either we use
>>>> the DRM scheduler or we don't.
>>>>
>>>> Matt
>>>>
>>>>> Thanks,
>>>>> Oak
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>> Sent: April 5, 2023 3:30 AM
>>>>>> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
>>>>>> <oak.zeng@intel.com>
>>>>>> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
>>>>>> robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
>>>> airlied@linux.ie;
>>>>>> lina@asahilina.net; boris.brezillon@collabora.com;
>>>> faith.ekstrand@collabora.com
>>>>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>>>> plans
>>>>>>
>>>>>> Am 04.04.23 um 20:08 schrieb Matthew Brost:
>>>>>>> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
>>>>>>>> Hi Matt, Thomas,
>>>>>>>>
>>>>>>>> Some very bold out of box thinking in this area:
>>>>>>>>
>>>>>>>> 1. so you want to use drm scheduler and dma-fence for long running
>>>> workload.
>>>>>> Why you want to do this in the first place? What is the benefit? Drm scheduler
>>>> is
>>>>>> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
>>>>>> level, as you said below for intel this is Guc. Can xe driver just directly submit
>>>> job
>>>>>> to Guc, bypassing drm scheduler?
>>>>>>> If we did that now we have 2 paths for dependency track, flow controling
>>>>>>> the ring, resets / error handling / backend submission implementations.
>>>>>>> We don't want this.
>>>>>> Well exactly that's the point: Why?
>>>>>>
>>>>>> As far as I can see that are two completely distinct use cases, so you
>>>>>> absolutely do want two completely distinct implementations for this.
>>>>>>
>>>>>>>> 2. using dma-fence for long run workload: I am well aware that page fault
>>>> (and
>>>>>> the consequent memory allocation/lock acquiring to fix the fault) can cause
>>>>>> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
>>>> be
>>>>>> used purely because the nature of the workload that it runs very long
>>>> (indefinite).
>>>>>> I did a math: the dma_fence_wait_timeout function's third param is the
>>>> timeout
>>>>>> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
>>>> long
>>>>>> enough, can we just change the timeout parameter to signed 64 bits so it is
>>>> much
>>>>>> longer than our life time...
>>>>>>>> So I mainly argue we can't use dma-fence for long-run workload is not
>>>>>> because the workload runs very long, rather because of the fact that we use
>>>>>> page fault for long-run workload. If we enable page fault for short-run
>>>> workload,
>>>>>> we can't use dma-fence either. Page fault is the key thing here.
>>>>>>>> Now since we use page fault which is *fundamentally* controversial with
>>>>>> dma-fence design, why now just introduce a independent concept such as
>>>> user-
>>>>>> fence instead of extending existing dma-fence?
>>>>>>>> I like unified design. If drm scheduler, dma-fence can be extended to work
>>>> for
>>>>>> everything, it is beautiful. But seems we have some fundamental problem
>>>> here.
>>>>>>> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
>>>>>>> the signal / CB infrastructure) and enforce we don't use use these
>>>>>>> dma-fences from the scheduler in memory reclaim paths or export these to
>>>>>>> user space or other drivers. Think of this mode as SW only fence.
>>>>>> Yeah and I truly think this is an really bad idea.
>>>>>>
>>>>>> The signal/CB infrastructure in the dma_fence turned out to be the
>>>>>> absolutely nightmare I initially predicted. Sorry to say that, but in
>>>>>> this case the "I've told you so" is appropriate in my opinion.
>>>>>>
>>>>>> If we need infrastructure for long running dependency tracking we should
>>>>>> encapsulate that in a new framework and not try to mangle the existing
>>>>>> code for something it was never intended for.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Oak
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>>>>>>>>> Matthew Brost
>>>>>>>>> Sent: April 3, 2023 8:22 PM
>>>>>>>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
>>>>>>>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
>>>>>> airlied@linux.ie;
>>>>>>>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
>>>>>>>>> <matthew.brost@intel.com>; christian.koenig@amd.com;
>>>>>>>>> faith.ekstrand@collabora.com
>>>>>>>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>>>> plans
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>>>>>> have been asked to merge our common DRM scheduler patches first as
>>>> well
>>>>>>>>> as develop a common solution for long running workloads with the DRM
>>>>>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>>>>>> welcome any and all feedback.
>>>>>>>>>
>>>>>>>>> This can we thought of as 4 parts detailed below.
>>>>>>>>>
>>>>>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>>>>>> entity (patches 1-3)
>>>>>>>>>
>>>>>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>>>>>> severals problems as the DRM was originally designed to schedule jobs
>>>> on
>>>>>>>>> hardware queues. The main problem being that DRM scheduler expects
>>>> the
>>>>>>>>> submission order of jobs to be the completion order of jobs even across
>>>>>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>>>>>> as a firmware scheduler has no concept of jobs and jobs can complete
>>>> out
>>>>>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>>>>>
>>>>>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>>>>>> infligt job tracker rather than a true scheduler.
>>>>>>>>>
>>>>>>>>> - Generic messaging interface for DRM scheduler
>>>>>>>>>
>>>>>>>>> Idea is to be able to communicate to the submission backend with in
>>>> band
>>>>>>>>> (relative to main execution function) messages. Messages are backend
>>>>>>>>> defined and flexable enough for any use case. In Xe we use these
>>>>>>>>> messages to clean up entites, set properties for entites, and suspend /
>>>>>>>>> resume execution of an entity [5]. I suspect other driver can leverage
>>>>>>>>> this messaging concept too as it a convenient way to avoid races in the
>>>>>>>>> backend.
>>>>>>>>>
>>>>>>>>> - Support for using TDR for all error paths of a scheduler / entity
>>>>>>>>>
>>>>>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>>>>>>>
>>>>>>>>> - Annotate dma-fences for long running workloads.
>>>>>>>>>
>>>>>>>>> The idea here is to use dma-fences only as sync points within the
>>>>>>>>> scheduler and never export them for long running workloads. By
>>>>>>>>> annotating these fences as long running we ensure that these dma-
>>>> fences
>>>>>>>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>>>>>>>> thus approach is the scheduler can still safely flow control the
>>>>>>>>> execution ring buffer via the job limit without breaking the dma-fence
>>>>>>>>> rules.
>>>>>>>>>
>>>>>>>>> Again this a first draft and looking forward to feedback.
>>>>>>>>>
>>>>>>>>> Enjoy - Matt
>>>>>>>>>
>>>>>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>>>>>>>> [2] https://patchwork.freedesktop.org/series/112188/
>>>>>>>>> [3] https://patchwork.freedesktop.org/series/114772/
>>>>>>>>> [4]
>>>> https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>>>>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
>>>>>>>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>>>>>>>
>>>>>>>>> Matthew Brost (8):
>>>>>>>>>      drm/sched: Convert drm scheduler to use a work queue rather than
>>>>>>>>>        kthread
>>>>>>>>>      drm/sched: Move schedule policy to scheduler / entity
>>>>>>>>>      drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
>>>> policy
>>>>>>>>>      drm/sched: Add generic scheduler message interface
>>>>>>>>>      drm/sched: Start run wq before TDR in drm_sched_start
>>>>>>>>>      drm/sched: Submit job before starting TDR
>>>>>>>>>      drm/sched: Add helper to set TDR timeout
>>>>>>>>>      drm/syncobj: Warn on long running dma-fences
>>>>>>>>>
>>>>>>>>> Thomas Hellström (2):
>>>>>>>>>      dma-buf/dma-fence: Introduce long-running completion fences
>>>>>>>>>      drm/sched: Support long-running sched entities
>>>>>>>>>
>>>>>>>>>     drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>>>>>>>     drivers/dma-buf/dma-resv.c                  |   5 +
>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>>>>>>>     drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>>>>>>>     drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>>>>>>>     drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>>>>>>>     drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>>>>>>>     drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>>>>>>>     drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>>>>>>>     drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>>>>>>>     drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>>>>>>>     drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
>>>> ---
>>>>>>>>>     drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>>>>>>>     include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>>>>>>>     include/linux/dma-fence.h                   |  60 ++++-
>>>>>>>>>     16 files changed, 649 insertions(+), 184 deletions(-)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 2.34.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-17  6:47                 ` Christian König
@ 2023-04-17  8:39                   ` Daniel Vetter
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Vetter @ 2023-04-17  8:39 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, airlied, lina, Zeng, Oak, boris.brezillon, dri-devel,
	Daniel Vetter, Vetter, Daniel, intel-xe, faith.ekstrand

On Mon, Apr 17, 2023 at 08:47:19AM +0200, Christian König wrote:
> Am 11.04.23 um 16:13 schrieb Daniel Vetter:
> > On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:
> > > The point is that this not only requires some work in the drm_scheduler, but
> > > rather it then makes only little sense to use the drm_scheduler in the first
> > > place.
> > > 
> > > The whole point of the drm_scheduler is to provide dma_fence implementation
> > > for the submitted jobs.
> > > 
> > > We also have dependency handling, but as Daniel and I said this can be
> > > easily extracted into a separate object/component.
> > Uh that's not what I meant. My take is that minimally patching drm/sched
> > to make the out-fence either optional, or complete it right away, is the
> > simplest way to get at the dependency handling. For me at least the major
> > part of drm/sched is the dep handling and timeout stuff. And the later can
> > be reused with some glue to handle preempt timeouts too and other things,
> > since tdr is a work struct you can just issue any other gpu timeouts on
> > the same workqueue and using the roughly same pattern as the ->timed_out
> > hook and it'll just work.
> 
> Well that strongly sounds like what I had in mind as well.
> 
> If we move the dependency/timeout functionality into a new component or if
> we move the scheduler fence into a new component doesn't seem to matter, the
> high level goal is that we have separated the two functionalities and both
> approach will work for that.

Ah ok, I guess I just got confused about your wording then.

> > The entire "oh we also make sure your hw fence doesn't leak into public
> > fences and causes lifetime mayhem" seems pretty minor. And maybe also
> > something we want to replicate for the preempt-ctx dma_fence that some
> > long-running context need (but more as part of drm_sched_entity I guess).
> > 
> > We can of course bikeshed how much flexibility really should be in the
> > different parts of drm/sched, but imo that's a bikeshed.
> 
> Well the dependency handling in a separate component would still be
> interesting to have since we need something similar for user queues as well.

Yeah it might be neater to refactor, but I think that part is optional at
least near-term. There's always room to polish shared code, and often it's
better to do that once you have at least some in-tree users for the new
need :-)
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > Am 07.04.23 um 02:20 schrieb Zeng, Oak:
> > > > So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.
> > > > 
> > > > The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?
> > > > 
> > > > Thanks,
> > > > Oak
> > > > 
> > > > > -----Original Message-----
> > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > Sent: April 5, 2023 2:53 PM
> > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel
> > > > > <daniel.vetter@intel.com>; Thomas Hellström
> > > > > <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel-
> > > > > xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie;
> > > > > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
> > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > plans
> > > > > 
> > > > > On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > Using dma-fence for completion/dependency tracking for long-run
> > > > > workload(more precisely on-demand paging/page fault enabled workload) can
> > > > > cause deadlock. This seems the significant issue here. Other issues such as the
> > > > > drm scheduler completion order implication etc are minors which can be solve
> > > > > inside the framework of drm scheduler. We need to evaluate below paths:
> > > > > > 	1) still use drm scheduler for job submission, and use dma-fence for job
> > > > > completion waiting/dependency tracking. This is solution proposed in this series.
> > > > > Annotate dma-fence for long-run workload: user can still wait dma-fence for job
> > > > > completion but can't wait dma-fence while holding any memory management
> > > > > locks.  We still use dma-fence for dependency tracking. But it is just very easily
> > > > > run into deadlock when on-demand paging is in the picture. The annotation helps
> > > > > us to detect deadlock but not solve deadlock problems. Seems *not* a complete
> > > > > solution: It is almost impossible to completely avoid dependency deadlock in
> > > > > complex runtime environment
> > > > > No one can wait on LR fence, so it is impossible to deadlock. The
> > > > > annotations enforce this. Literally this is only for flow controling the
> > > > > ring / hold pending jobs in in the DRM schedule list.
> > > > > 
> > > > > > 	2) Still use drm scheduler but not use dma-fence for completion signaling
> > > > > and dependency tracking. This way we still get some free functions (reset, err
> > > > > handling ring flow control as Matt said)from drm scheduler, but push the
> > > > > dependency/completion tracking completely to user space using techniques such
> > > > > as user space fence. User space doesn't have chance to wait fence while holding
> > > > > a kernel memory management lock, thus the dma-fence deadlock issue is solved.
> > > > > We use user space fence for syncs.
> > > > > 
> > > > > > 	3) Completely discard drm scheduler and dma-fence for long-run
> > > > > workload. Use user queue/doorbell for super fast submission, directly interact
> > > > > with fw scheduler. Use user fence for completion/dependency tracking.
> > > > > This is a hard no from me, I want 1 submission path in Xe. Either we use
> > > > > the DRM scheduler or we don't.
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > Thanks,
> > > > > > Oak
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Christian König <christian.koenig@amd.com>
> > > > > > > Sent: April 5, 2023 3:30 AM
> > > > > > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > > > > > > <oak.zeng@intel.com>
> > > > > > > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> > > > > > > robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > > > > airlied@linux.ie;
> > > > > > > lina@asahilina.net; boris.brezillon@collabora.com;
> > > > > faith.ekstrand@collabora.com
> > > > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > > > plans
> > > > > > > 
> > > > > > > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > > > > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > > > > > > > Hi Matt, Thomas,
> > > > > > > > > 
> > > > > > > > > Some very bold out of box thinking in this area:
> > > > > > > > > 
> > > > > > > > > 1. so you want to use drm scheduler and dma-fence for long running
> > > > > workload.
> > > > > > > Why you want to do this in the first place? What is the benefit? Drm scheduler
> > > > > is
> > > > > > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > > > > > > level, as you said below for intel this is Guc. Can xe driver just directly submit
> > > > > job
> > > > > > > to Guc, bypassing drm scheduler?
> > > > > > > > If we did that now we have 2 paths for dependency track, flow controling
> > > > > > > > the ring, resets / error handling / backend submission implementations.
> > > > > > > > We don't want this.
> > > > > > > Well exactly that's the point: Why?
> > > > > > > 
> > > > > > > As far as I can see that are two completely distinct use cases, so you
> > > > > > > absolutely do want two completely distinct implementations for this.
> > > > > > > 
> > > > > > > > > 2. using dma-fence for long run workload: I am well aware that page fault
> > > > > (and
> > > > > > > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > > > > > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
> > > > > be
> > > > > > > used purely because the nature of the workload that it runs very long
> > > > > (indefinite).
> > > > > > > I did a math: the dma_fence_wait_timeout function's third param is the
> > > > > timeout
> > > > > > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
> > > > > long
> > > > > > > enough, can we just change the timeout parameter to signed 64 bits so it is
> > > > > much
> > > > > > > longer than our life time...
> > > > > > > > > So I mainly argue we can't use dma-fence for long-run workload is not
> > > > > > > because the workload runs very long, rather because of the fact that we use
> > > > > > > page fault for long-run workload. If we enable page fault for short-run
> > > > > workload,
> > > > > > > we can't use dma-fence either. Page fault is the key thing here.
> > > > > > > > > Now since we use page fault which is *fundamentally* controversial with
> > > > > > > dma-fence design, why now just introduce a independent concept such as
> > > > > user-
> > > > > > > fence instead of extending existing dma-fence?
> > > > > > > > > I like unified design. If drm scheduler, dma-fence can be extended to work
> > > > > for
> > > > > > > everything, it is beautiful. But seems we have some fundamental problem
> > > > > here.
> > > > > > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > > > > > > > the signal / CB infrastructure) and enforce we don't use use these
> > > > > > > > dma-fences from the scheduler in memory reclaim paths or export these to
> > > > > > > > user space or other drivers. Think of this mode as SW only fence.
> > > > > > > Yeah and I truly think this is an really bad idea.
> > > > > > > 
> > > > > > > The signal/CB infrastructure in the dma_fence turned out to be the
> > > > > > > absolutely nightmare I initially predicted. Sorry to say that, but in
> > > > > > > this case the "I've told you so" is appropriate in my opinion.
> > > > > > > 
> > > > > > > If we need infrastructure for long running dependency tracking we should
> > > > > > > encapsulate that in a new framework and not try to mangle the existing
> > > > > > > code for something it was never intended for.
> > > > > > > 
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > Matt
> > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Oak
> > > > > > > > > 
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > > > > > > > > > Matthew Brost
> > > > > > > > > > Sent: April 3, 2023 8:22 PM
> > > > > > > > > > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > > > > > > > > > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > > > > > > airlied@linux.ie;
> > > > > > > > > > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > > > > > > > > > <matthew.brost@intel.com>; christian.koenig@amd.com;
> > > > > > > > > > faith.ekstrand@collabora.com
> > > > > > > > > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > > > plans
> > > > > > > > > > Hello,
> > > > > > > > > > 
> > > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > > > > > have been asked to merge our common DRM scheduler patches first as
> > > > > well
> > > > > > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > > > > > welcome any and all feedback.
> > > > > > > > > > 
> > > > > > > > > > This can we thought of as 4 parts detailed below.
> > > > > > > > > > 
> > > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > > > > > > > entity (patches 1-3)
> > > > > > > > > > 
> > > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > > > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > > > > > severals problems as the DRM was originally designed to schedule jobs
> > > > > on
> > > > > > > > > > hardware queues. The main problem being that DRM scheduler expects
> > > > > the
> > > > > > > > > > submission order of jobs to be the completion order of jobs even across
> > > > > > > > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > > > > > > > as a firmware scheduler has no concept of jobs and jobs can complete
> > > > > out
> > > > > > > > > > of order. A novel solution for was originally thought of by Faith during
> > > > > > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > > > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > > > > > 
> > > > > > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > > > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > > > > > relationship and can be thought of as using scheduler as a dependency /
> > > > > > > > > > infligt job tracker rather than a true scheduler.
> > > > > > > > > > 
> > > > > > > > > > - Generic messaging interface for DRM scheduler
> > > > > > > > > > 
> > > > > > > > > > Idea is to be able to communicate to the submission backend with in
> > > > > band
> > > > > > > > > > (relative to main execution function) messages. Messages are backend
> > > > > > > > > > defined and flexable enough for any use case. In Xe we use these
> > > > > > > > > > messages to clean up entites, set properties for entites, and suspend /
> > > > > > > > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > > > > > > > this messaging concept too as it a convenient way to avoid races in the
> > > > > > > > > > backend.
> > > > > > > > > > 
> > > > > > > > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > > > > > > > 
> > > > > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > > > > > > > 
> > > > > > > > > > - Annotate dma-fences for long running workloads.
> > > > > > > > > > 
> > > > > > > > > > The idea here is to use dma-fences only as sync points within the
> > > > > > > > > > scheduler and never export them for long running workloads. By
> > > > > > > > > > annotating these fences as long running we ensure that these dma-
> > > > > fences
> > > > > > > > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > > > > > > > thus approach is the scheduler can still safely flow control the
> > > > > > > > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > > > > > > > rules.
> > > > > > > > > > 
> > > > > > > > > > Again this a first draft and looking forward to feedback.
> > > > > > > > > > 
> > > > > > > > > > Enjoy - Matt
> > > > > > > > > > 
> > > > > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > > > > > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > > > > > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > > > > > > > [4]
> > > > > https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > > > > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > > > > > > > > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > > > > > > > 
> > > > > > > > > > Matthew Brost (8):
> > > > > > > > > >      drm/sched: Convert drm scheduler to use a work queue rather than
> > > > > > > > > >        kthread
> > > > > > > > > >      drm/sched: Move schedule policy to scheduler / entity
> > > > > > > > > >      drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
> > > > > policy
> > > > > > > > > >      drm/sched: Add generic scheduler message interface
> > > > > > > > > >      drm/sched: Start run wq before TDR in drm_sched_start
> > > > > > > > > >      drm/sched: Submit job before starting TDR
> > > > > > > > > >      drm/sched: Add helper to set TDR timeout
> > > > > > > > > >      drm/syncobj: Warn on long running dma-fences
> > > > > > > > > > 
> > > > > > > > > > Thomas Hellström (2):
> > > > > > > > > >      dma-buf/dma-fence: Introduce long-running completion fences
> > > > > > > > > >      drm/sched: Support long-running sched entities
> > > > > > > > > > 
> > > > > > > > > >     drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > > > > > > > >     drivers/dma-buf/dma-resv.c                  |   5 +
> > > > > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > > > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > > > > > > > >     drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > > > > > > > >     drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > > > > > > > >     drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > > > > > > > >     drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > > > > > > > >     drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > > > > > > > >     drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > > > > > > > >     drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > > > > > > > >     drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > > > > > > > >     drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
> > > > > ---
> > > > > > > > > >     drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > > > > > > > >     include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > > > > > > > >     include/linux/dma-fence.h                   |  60 ++++-
> > > > > > > > > >     16 files changed, 649 insertions(+), 184 deletions(-)
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > 2.34.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
  2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
                   ` (15 preceding siblings ...)
  2023-04-04 18:02 ` Zeng, Oak
@ 2023-04-18 15:10 ` Liviu Dudau
  16 siblings, 0 replies; 87+ messages in thread
From: Liviu Dudau @ 2023-04-18 15:10 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, dri-devel, christian.koenig,
	ketil.johnsen, john.reitan, boris.brezillon, intel-xe,
	faith.ekstrand

On Mon, Apr 03, 2023 at 05:22:01PM -0700, Matthew Brost wrote:
> Hello,

Hello,

Jumping a bit late on this thread as I was waiting on some approvals and then
holidays kicked in, but I would like to (re)introduce myself and the people
I work with and to let you know that we are interested in the changes proposed
here and we would like to help if we can.

I currently maintain a number of Arm Mali Display drivers but in recent times
I have moved to the Mali GPU team and now we've got approval to start making
contributions to the upstream driver(s). We're planning to collaborate on
Boris' new Mali driver and make it work well on Mali GPUs. One of the first
things to look at (besides bringing the driver up on internal dev platforms)
are the scheduler changes proposed here.

As such, I would like to ask that people start including John Reitan,
Ketil Johnsen and me on patches. As soon as we have something working and we
can make comments on, we will do so.

Best regards,
Liviu


> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
> 
> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
> 
> - Support for using TDR for all error paths of a scheduler / entity
> 
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> 
> - Annotate dma-fences for long running workloads.
> 
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
> 
> Again this a first draft and looking forward to feedback.
> 
> Enjoy - Matt
> 
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/ 
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> 
> Matthew Brost (8):
>   drm/sched: Convert drm scheduler to use a work queue rather than
>     kthread
>   drm/sched: Move schedule policy to scheduler / entity
>   drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>   drm/sched: Add generic scheduler message interface
>   drm/sched: Start run wq before TDR in drm_sched_start
>   drm/sched: Submit job before starting TDR
>   drm/sched: Add helper to set TDR timeout
>   drm/syncobj: Warn on long running dma-fences
> 
> Thomas Hellström (2):
>   dma-buf/dma-fence: Introduce long-running completion fences
>   drm/sched: Support long-running sched entities
> 
>  drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>  drivers/dma-buf/dma-resv.c                  |   5 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>  drivers/gpu/drm/drm_syncobj.c               |   5 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>  drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>  drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>  drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>  drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>  drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>  drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>  drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>  include/drm/gpu_scheduler.h                 | 130 +++++++--
>  include/linux/dma-fence.h                   |  60 ++++-
>  16 files changed, 649 insertions(+), 184 deletions(-)
> 
> -- 
> 2.34.1
> 

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 06/10] drm/sched: Submit job before starting TDR
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 06/10] drm/sched: Submit job before starting TDR Matthew Brost
@ 2023-05-04  5:23   ` Luben Tuikov
  2023-07-31  1:00     ` Matthew Brost
  0 siblings, 1 reply; 87+ messages in thread
From: Luben Tuikov @ 2023-05-04  5:23 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, christian.koenig,
	faith.ekstrand

On 2023-04-03 20:22, Matthew Brost wrote:
> If the TDR is set to a value, it can fire before a job is submitted in
> drm_sched_main. The job should be always be submitted before the TDR
> fires, fix this ordering.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 6ae710017024..4eac02d212c1 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1150,10 +1150,10 @@ static void drm_sched_main(struct work_struct *w)
>  		s_fence = sched_job->s_fence;
>  
>  		atomic_inc(&sched->hw_rq_count);
> -		drm_sched_job_begin(sched_job);
>  
>  		trace_drm_run_job(sched_job, entity);
>  		fence = sched->ops->run_job(sched_job);
> +		drm_sched_job_begin(sched_job);
>  		complete_all(&entity->entity_idle);
>  		drm_sched_fence_scheduled(s_fence);
>  

Not sure if this is correct. In drm_sched_job_begin() we add the job to the "pending_list"
(meaning it is pending execution in the hardware) and we also start a timeout timer. Both
of those should be started before the job is given to the hardware.

If the timeout is set to too small a value, then that should probably be fixed instead.

Regards,
Luben

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout Matthew Brost
@ 2023-05-04  5:28   ` Luben Tuikov
  2023-07-31  1:09     ` Matthew Brost
  0 siblings, 1 reply; 87+ messages in thread
From: Luben Tuikov @ 2023-05-04  5:28 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, christian.koenig,
	faith.ekstrand

On 2023-04-03 20:22, Matthew Brost wrote:
> Add helper to set TDR timeout and restart the TDR with new timeout
> value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> to cleanup drm_sched_entity that encounter errors.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
>  include/drm/gpu_scheduler.h            |  1 +
>  2 files changed, 19 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 4eac02d212c1..d61880315d8d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>  		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
>  }
>  
> +/**
> + * drm_sched_set_timeout - set timeout for reset worker
> + *
> + * @sched: scheduler instance to set and (re)-start the worker for
> + * @timeout: timeout period
> + *
> + * Set and (re)-start the timeout for the given scheduler.
> + */
> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	sched->timeout = timeout;
> +	cancel_delayed_work(&sched->work_tdr);

I see that the comment says "(re-)start"(sic). Is the rest of the logic
stable in that we don't need to use _sync() version, and/or at least
inspect the return value of the one currently used?

Regards,
Luben

> +	drm_sched_start_timeout(sched);
> +	spin_unlock(&sched->job_list_lock);
> +}
> +EXPORT_SYMBOL(drm_sched_set_timeout);
> +
>  /**
>   * drm_sched_fault - immediately start timeout handler
>   *
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 18172ae63ab7..6258e324bd7c 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>  				    struct drm_gpu_scheduler **sched_list,
>                                     unsigned int num_sched_list);
>  
> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 04/10] drm/sched: Add generic scheduler message interface
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 04/10] drm/sched: Add generic scheduler message interface Matthew Brost
@ 2023-05-04  5:28   ` Luben Tuikov
  2023-07-31  2:42     ` Matthew Brost
  0 siblings, 1 reply; 87+ messages in thread
From: Luben Tuikov @ 2023-05-04  5:28 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, airlied, lina, boris.brezillon, christian.koenig,
	faith.ekstrand

On 2023-04-03 20:22, Matthew Brost wrote:
> Add generic schedule message interface which sends messages to backend
> from the drm_gpu_scheduler main submission thread. The idea is some of
> these messages modify some state in drm_sched_entity which is also
> modified during submission. By scheduling these messages and submission
> in the same thread their is not race changing states in
> drm_sched_entity.

"... there is no race when changing ..." or better yet,
"... we eliminate races due to drm_sched_entity state changes."

> 
> This interface will be used in XE, new Intel GPU driver, to cleanup,

"Xe"?

Regards,
Luben

> suspend, resume, and change scheduling properties of a drm_sched_entity.
> 
> The interface is designed to be generic and extendable with only the
> backend understanding the messages.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 58 +++++++++++++++++++++++++-
>  include/drm/gpu_scheduler.h            | 29 ++++++++++++-
>  2 files changed, 84 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 2795021efe7b..9dc3378e9c5e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1055,6 +1055,54 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>  }
>  EXPORT_SYMBOL(drm_sched_pick_best);
>  
> +/**
> + * drm_sched_add_msg - add scheduler message
> + *
> + * @sched: scheduler instance
> + * @msg: message to be added
> + *
> + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
> + * Messages processing will stop if schedule run wq is stopped and resume when
> + * run wq is started.
> + */
> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> +		       struct drm_sched_msg *msg)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	list_add_tail(&msg->link, &sched->msgs);
> +	spin_unlock(&sched->job_list_lock);
> +
> +	/*
> +	 * Same as above in drm_sched_run_wq_queue, try to kick worker if
> +	 * paused, harmless if this races
> +	 */
> +	if (!sched->pause_run_wq)
> +		queue_work(sched->run_wq, &sched->work_run);
> +}
> +EXPORT_SYMBOL(drm_sched_add_msg);
> +
> +/**
> + * drm_sched_get_msg - get scheduler message
> + *
> + * @sched: scheduler instance
> + *
> + * Returns NULL or message
> + */
> +static struct drm_sched_msg *
> +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
> +{
> +	struct drm_sched_msg *msg;
> +
> +	spin_lock(&sched->job_list_lock);
> +	msg = list_first_entry_or_null(&sched->msgs,
> +				       struct drm_sched_msg, link);
> +	if (msg)
> +		list_del(&msg->link);
> +	spin_unlock(&sched->job_list_lock);
> +
> +	return msg;
> +}
> +
>  /**
>   * drm_sched_main - main scheduler thread
>   *
> @@ -1068,6 +1116,7 @@ static void drm_sched_main(struct work_struct *w)
>  
>  	while (!READ_ONCE(sched->pause_run_wq)) {
>  		struct drm_sched_entity *entity;
> +		struct drm_sched_msg *msg;
>  		struct drm_sched_fence *s_fence;
>  		struct drm_sched_job *sched_job;
>  		struct dma_fence *fence;
> @@ -1075,12 +1124,16 @@ static void drm_sched_main(struct work_struct *w)
>  
>  		cleanup_job = drm_sched_get_cleanup_job(sched);
>  		entity = drm_sched_select_entity(sched);
> +		msg = drm_sched_get_msg(sched);
>  
>  		if (cleanup_job)
>  			sched->ops->free_job(cleanup_job);
>  
> +		if (msg)
> +			sched->ops->process_msg(msg);
> +
>  		if (!entity) {
> -			if (!cleanup_job)
> +			if (!cleanup_job && !msg)
>  				break;
>  			continue;
>  		}
> @@ -1089,7 +1142,7 @@ static void drm_sched_main(struct work_struct *w)
>  
>  		if (!sched_job) {
>  			complete_all(&entity->entity_idle);
> -			if (!cleanup_job)
> +			if (!cleanup_job && !msg)
>  				break;
>  			continue;
>  		}
> @@ -1181,6 +1234,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  
>  	init_waitqueue_head(&sched->job_scheduled);
>  	INIT_LIST_HEAD(&sched->pending_list);
> +	INIT_LIST_HEAD(&sched->msgs);
>  	spin_lock_init(&sched->job_list_lock);
>  	atomic_set(&sched->hw_rq_count, 0);
>  	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 3e421f5a710c..18172ae63ab7 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -398,6 +398,23 @@ enum drm_gpu_sched_stat {
>  	DRM_GPU_SCHED_STAT_ENODEV,
>  };
>  
> +/**
> + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
> + * message
> + *
> + * Generic enough for backend defined messages, backend can expand if needed.
> + */
> +struct drm_sched_msg {
> +	/** @link: list link into the gpu scheduler list of messages */
> +	struct list_head		link;
> +	/**
> +	 * @private_data: opaque pointer to message private data (backend defined)
> +	 */
> +	void				*private_data;
> +	/** @opcode: opcode of message (backend defined) */
> +	unsigned int			opcode;
> +};
> +
>  /**
>   * struct drm_sched_backend_ops - Define the backend operations
>   *	called by the scheduler
> @@ -475,6 +492,12 @@ struct drm_sched_backend_ops {
>           * and it's time to clean it up.
>  	 */
>  	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @process_msg: Process a message. Allowed to block, it is this
> +	 * function's responsibility to free message if dynamically allocated.
> +	 */
> +	void (*process_msg)(struct drm_sched_msg *msg);
>  };
>  
>  /**
> @@ -486,6 +509,7 @@ struct drm_sched_backend_ops {
>   * @timeout: the time after which a job is removed from the scheduler.
>   * @name: name of the ring for which this scheduler is being used.
>   * @sched_rq: priority wise array of run queues.
> + * @msgs: list of messages to be processed in @work_run
>   * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>   *                 waits on this wait queue until all the scheduled jobs are
>   *                 finished.
> @@ -493,7 +517,7 @@ struct drm_sched_backend_ops {
>   * @job_id_count: used to assign unique id to the each job.
>   * @run_wq: workqueue used to queue @work_run
>   * @timeout_wq: workqueue used to queue @work_tdr
> - * @work_run: schedules jobs and cleans up entities
> + * @work_run: schedules jobs, cleans up jobs, and processes messages
>   * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>   *            timeout interval is over.
>   * @pending_list: the list of jobs which are currently in the job queue.
> @@ -517,6 +541,7 @@ struct drm_gpu_scheduler {
>  	long				timeout;
>  	const char			*name;
>  	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
> +	struct list_head		msgs;
>  	wait_queue_head_t		job_scheduled;
>  	atomic_t			hw_rq_count;
>  	atomic64_t			job_id_count;
> @@ -570,6 +595,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>  
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> +		       struct drm_sched_msg *msg);
>  void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
>  void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
@ 2023-06-09  6:58   ` Boris Brezillon
  2023-07-31  0:56     ` Matthew Brost
  0 siblings, 1 reply; 87+ messages in thread
From: Boris Brezillon @ 2023-06-09  6:58 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, Sarah Walker, airlied, lina, Frank Binns, dri-devel,
	christian.koenig, Donald Robson, daniel, intel-xe,
	faith.ekstrand

Hi Matthew,

On Mon,  3 Apr 2023 17:22:02 -0700
Matthew Brost <matthew.brost@intel.com> wrote:

> -static int drm_sched_main(void *param)
> +static void drm_sched_main(struct work_struct *w)
>  {
> -	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
> +	struct drm_gpu_scheduler *sched =
> +		container_of(w, struct drm_gpu_scheduler, work_run);
>  	int r;
>  
> -	sched_set_fifo_low(current);
> -
> -	while (!kthread_should_stop()) {
> -		struct drm_sched_entity *entity = NULL;
> +	while (!READ_ONCE(sched->pause_run_wq)) {

During an informal discussion on IRC I mentioned that this loop might
become problematic if all the 1:1 entities share the same wq
(especially if it's an ordered wq), and one of them is getting passed a
lot of requests. Just wanted to tell you that we've hit that case in
PowerVR:

Geometry and fragment queues get passed X requests respectively, each
pair of request corresponding to a rendering operation. Because we're
using an ordered wq (which I know we shouldn't do, and I intend to
fix that, but I think it shows the problem exists by making it more
visible), all geometry requests get submitted first, then come the
fragment requests. It turns out the submission time is non-negligible
compared to the geometry job execution time, and geometry jobs end up
generating data for the fragment jobs that are not consumed fast enough
by the fragment job to allow the following geom jobs to re-use the same
portion of memory, leading to on-demand allocation of extra memory
chunks which wouldn't happen if submissions were interleaved.

I know you were not fundamentally opposed to killing this loop and doing
one iteration at a time (you even provided a patch doing that), just
wanted to share my findings to prove this is not just a theoretical
issue, and the lack of fairness in the submission path can cause trouble
in practice.

Best Regards,

Boris

> +		struct drm_sched_entity *entity;
>  		struct drm_sched_fence *s_fence;
>  		struct drm_sched_job *sched_job;
>  		struct dma_fence *fence;
> -		struct drm_sched_job *cleanup_job = NULL;
> +		struct drm_sched_job *cleanup_job;
>  
> -		wait_event_interruptible(sched->wake_up_worker,
> -					 (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
> -					 (!drm_sched_blocked(sched) &&
> -					  (entity = drm_sched_select_entity(sched))) ||
> -					 kthread_should_stop());
> +		cleanup_job = drm_sched_get_cleanup_job(sched);
> +		entity = drm_sched_select_entity(sched);
>  
>  		if (cleanup_job)
>  			sched->ops->free_job(cleanup_job);
>  
> -		if (!entity)
> +		if (!entity) {
> +			if (!cleanup_job)
> +				break;
>  			continue;
> +		}
>  
>  		sched_job = drm_sched_entity_pop_job(entity);
>  
>  		if (!sched_job) {
>  			complete_all(&entity->entity_idle);
> +			if (!cleanup_job)
> +				break;
>  			continue;
>  		}
>  
> @@ -1055,14 +1083,14 @@ static int drm_sched_main(void *param)
>  					  r);
>  		} else {
>  			if (IS_ERR(fence))
> -				dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
> +				dma_fence_set_error(&s_fence->finished,
> +						    PTR_ERR(fence));
>  
>  			drm_sched_job_done(sched_job);
>  		}
>  
>  		wake_up(&sched->job_scheduled);
>  	}
> -	return 0;
>  }

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-06-09  6:58   ` Boris Brezillon
@ 2023-07-31  0:56     ` Matthew Brost
  0 siblings, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-07-31  0:56 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: robdclark, Sarah Walker, airlied, lina, Frank Binns, dri-devel,
	christian.koenig, Donald Robson, daniel, intel-xe,
	faith.ekstrand

On Fri, Jun 09, 2023 at 08:58:39AM +0200, Boris Brezillon wrote:
> Hi Matthew,
> 
> On Mon,  3 Apr 2023 17:22:02 -0700
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > -static int drm_sched_main(void *param)
> > +static void drm_sched_main(struct work_struct *w)
> >  {
> > -	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
> > +	struct drm_gpu_scheduler *sched =
> > +		container_of(w, struct drm_gpu_scheduler, work_run);
> >  	int r;
> >  
> > -	sched_set_fifo_low(current);
> > -
> > -	while (!kthread_should_stop()) {
> > -		struct drm_sched_entity *entity = NULL;
> > +	while (!READ_ONCE(sched->pause_run_wq)) {
> 
> During an informal discussion on IRC I mentioned that this loop might
> become problematic if all the 1:1 entities share the same wq
> (especially if it's an ordered wq), and one of them is getting passed a
> lot of requests. Just wanted to tell you that we've hit that case in
> PowerVR:
> 
> Geometry and fragment queues get passed X requests respectively, each
> pair of request corresponding to a rendering operation. Because we're
> using an ordered wq (which I know we shouldn't do, and I intend to
> fix that, but I think it shows the problem exists by making it more
> visible), all geometry requests get submitted first, then come the
> fragment requests. It turns out the submission time is non-negligible
> compared to the geometry job execution time, and geometry jobs end up
> generating data for the fragment jobs that are not consumed fast enough
> by the fragment job to allow the following geom jobs to re-use the same
> portion of memory, leading to on-demand allocation of extra memory
> chunks which wouldn't happen if submissions were interleaved.
> 
> I know you were not fundamentally opposed to killing this loop and doing
> one iteration at a time (you even provided a patch doing that), just
> wanted to share my findings to prove this is not just a theoretical
> issue, and the lack of fairness in the submission path can cause trouble
> in practice.
> 
> Best Regards,
> 
> Boris
> 

Thanks for the info Boris, about to revive this series in a non-RFC form.

This loop seems controversial, let me drop it. Going to cook up a patch
for the Xe branch and get this merged for CI / UMD benchmarks to absorb
and if there any noticeable differences.

Also be on the lookout for a new rev of this series hopefully this week.

Matt

> > +		struct drm_sched_entity *entity;
> >  		struct drm_sched_fence *s_fence;
> >  		struct drm_sched_job *sched_job;
> >  		struct dma_fence *fence;
> > -		struct drm_sched_job *cleanup_job = NULL;
> > +		struct drm_sched_job *cleanup_job;
> >  
> > -		wait_event_interruptible(sched->wake_up_worker,
> > -					 (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
> > -					 (!drm_sched_blocked(sched) &&
> > -					  (entity = drm_sched_select_entity(sched))) ||
> > -					 kthread_should_stop());
> > +		cleanup_job = drm_sched_get_cleanup_job(sched);
> > +		entity = drm_sched_select_entity(sched);
> >  
> >  		if (cleanup_job)
> >  			sched->ops->free_job(cleanup_job);
> >  
> > -		if (!entity)
> > +		if (!entity) {
> > +			if (!cleanup_job)
> > +				break;
> >  			continue;
> > +		}
> >  
> >  		sched_job = drm_sched_entity_pop_job(entity);
> >  
> >  		if (!sched_job) {
> >  			complete_all(&entity->entity_idle);
> > +			if (!cleanup_job)
> > +				break;
> >  			continue;
> >  		}
> >  
> > @@ -1055,14 +1083,14 @@ static int drm_sched_main(void *param)
> >  					  r);
> >  		} else {
> >  			if (IS_ERR(fence))
> > -				dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
> > +				dma_fence_set_error(&s_fence->finished,
> > +						    PTR_ERR(fence));
> >  
> >  			drm_sched_job_done(sched_job);
> >  		}
> >  
> >  		wake_up(&sched->job_scheduled);
> >  	}
> > -	return 0;
> >  }

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 06/10] drm/sched: Submit job before starting TDR
  2023-05-04  5:23   ` Luben Tuikov
@ 2023-07-31  1:00     ` Matthew Brost
  2023-07-31  7:26       ` Boris Brezillon
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-07-31  1:00 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: robdclark, airlied, lina, dri-devel, christian.koenig,
	boris.brezillon, intel-xe, faith.ekstrand

On Thu, May 04, 2023 at 01:23:05AM -0400, Luben Tuikov wrote:
> On 2023-04-03 20:22, Matthew Brost wrote:
> > If the TDR is set to a value, it can fire before a job is submitted in
> > drm_sched_main. The job should be always be submitted before the TDR
> > fires, fix this ordering.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 6ae710017024..4eac02d212c1 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1150,10 +1150,10 @@ static void drm_sched_main(struct work_struct *w)
> >  		s_fence = sched_job->s_fence;
> >  
> >  		atomic_inc(&sched->hw_rq_count);
> > -		drm_sched_job_begin(sched_job);
> >  
> >  		trace_drm_run_job(sched_job, entity);
> >  		fence = sched->ops->run_job(sched_job);
> > +		drm_sched_job_begin(sched_job);
> >  		complete_all(&entity->entity_idle);
> >  		drm_sched_fence_scheduled(s_fence);
> >  
> 
> Not sure if this is correct. In drm_sched_job_begin() we add the job to the "pending_list"
> (meaning it is pending execution in the hardware) and we also start a timeout timer. Both
> of those should be started before the job is given to the hardware.
> 

The correct solution is probably add to pending list before run_job()
and kick TDR after run_job().

> If the timeout is set to too small a value, then that should probably be fixed instead.
>

Disagree, a user should be able to set TDR value to anything it wants
and not break the DRM scheduler.

Matt

> Regards,
> Luben

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout
  2023-05-04  5:28   ` Luben Tuikov
@ 2023-07-31  1:09     ` Matthew Brost
  2023-08-31 19:52       ` Luben Tuikov
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Brost @ 2023-07-31  1:09 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: robdclark, airlied, lina, dri-devel, christian.koenig,
	boris.brezillon, intel-xe, faith.ekstrand

On Thu, May 04, 2023 at 01:28:12AM -0400, Luben Tuikov wrote:
> On 2023-04-03 20:22, Matthew Brost wrote:
> > Add helper to set TDR timeout and restart the TDR with new timeout
> > value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> > to cleanup drm_sched_entity that encounter errors.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
> >  include/drm/gpu_scheduler.h            |  1 +
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 4eac02d212c1..d61880315d8d 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
> >  		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
> >  }
> >  
> > +/**
> > + * drm_sched_set_timeout - set timeout for reset worker
> > + *
> > + * @sched: scheduler instance to set and (re)-start the worker for
> > + * @timeout: timeout period
> > + *
> > + * Set and (re)-start the timeout for the given scheduler.
> > + */
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> > +{
> > +	spin_lock(&sched->job_list_lock);
> > +	sched->timeout = timeout;
> > +	cancel_delayed_work(&sched->work_tdr);
> 
> I see that the comment says "(re-)start"(sic). Is the rest of the logic
> stable in that we don't need to use _sync() version, and/or at least
> inspect the return value of the one currently used?
> 

Sorry for the delayed response, just reviving this series now and seeing
this comment.

We don't care if the TDR is currently executing (at least in Xe which
makes use of this function), that is totally fine we only care to change
the future timeout values. I believe we actually call this from the TDR
in Xe to set the timeout value to zero so using a sync version would
deadlock. We do this as a mechanism to kill the drm_gpu_scheduler and
immediately timeout all remaining jobs. We also call this in a few other
places too with a value of zero for the same reason (kill the
drm_gpu_scheduler).

Matt

> Regards,
> Luben
> 
> > +	drm_sched_start_timeout(sched);
> > +	spin_unlock(&sched->job_list_lock);
> > +}
> > +EXPORT_SYMBOL(drm_sched_set_timeout);
> > +
> >  /**
> >   * drm_sched_fault - immediately start timeout handler
> >   *
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 18172ae63ab7..6258e324bd7c 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> >  				    struct drm_gpu_scheduler **sched_list,
> >                                     unsigned int num_sched_list);
> >  
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> >  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 04/10] drm/sched: Add generic scheduler message interface
  2023-05-04  5:28   ` Luben Tuikov
@ 2023-07-31  2:42     ` Matthew Brost
  0 siblings, 0 replies; 87+ messages in thread
From: Matthew Brost @ 2023-07-31  2:42 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: robdclark, airlied, lina, dri-devel, christian.koenig,
	boris.brezillon, intel-xe, faith.ekstrand

On Thu, May 04, 2023 at 01:28:52AM -0400, Luben Tuikov wrote:
> On 2023-04-03 20:22, Matthew Brost wrote:
> > Add generic schedule message interface which sends messages to backend
> > from the drm_gpu_scheduler main submission thread. The idea is some of
> > these messages modify some state in drm_sched_entity which is also
> > modified during submission. By scheduling these messages and submission
> > in the same thread their is not race changing states in
> > drm_sched_entity.
> 
> "... there is no race when changing ..." or better yet,
> "... we eliminate races due to drm_sched_entity state changes."
> 
> > 
> > This interface will be used in XE, new Intel GPU driver, to cleanup,
> 
> "Xe"?
> 

Will fix both.

Matt

> Regards,
> Luben
> 
> > suspend, resume, and change scheduling properties of a drm_sched_entity.
> > 
> > The interface is designed to be generic and extendable with only the
> > backend understanding the messages.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 58 +++++++++++++++++++++++++-
> >  include/drm/gpu_scheduler.h            | 29 ++++++++++++-
> >  2 files changed, 84 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 2795021efe7b..9dc3378e9c5e 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1055,6 +1055,54 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> >  }
> >  EXPORT_SYMBOL(drm_sched_pick_best);
> >  
> > +/**
> > + * drm_sched_add_msg - add scheduler message
> > + *
> > + * @sched: scheduler instance
> > + * @msg: message to be added
> > + *
> > + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
> > + * Messages processing will stop if schedule run wq is stopped and resume when
> > + * run wq is started.
> > + */
> > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > +		       struct drm_sched_msg *msg)
> > +{
> > +	spin_lock(&sched->job_list_lock);
> > +	list_add_tail(&msg->link, &sched->msgs);
> > +	spin_unlock(&sched->job_list_lock);
> > +
> > +	/*
> > +	 * Same as above in drm_sched_run_wq_queue, try to kick worker if
> > +	 * paused, harmless if this races
> > +	 */
> > +	if (!sched->pause_run_wq)
> > +		queue_work(sched->run_wq, &sched->work_run);
> > +}
> > +EXPORT_SYMBOL(drm_sched_add_msg);
> > +
> > +/**
> > + * drm_sched_get_msg - get scheduler message
> > + *
> > + * @sched: scheduler instance
> > + *
> > + * Returns NULL or message
> > + */
> > +static struct drm_sched_msg *
> > +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
> > +{
> > +	struct drm_sched_msg *msg;
> > +
> > +	spin_lock(&sched->job_list_lock);
> > +	msg = list_first_entry_or_null(&sched->msgs,
> > +				       struct drm_sched_msg, link);
> > +	if (msg)
> > +		list_del(&msg->link);
> > +	spin_unlock(&sched->job_list_lock);
> > +
> > +	return msg;
> > +}
> > +
> >  /**
> >   * drm_sched_main - main scheduler thread
> >   *
> > @@ -1068,6 +1116,7 @@ static void drm_sched_main(struct work_struct *w)
> >  
> >  	while (!READ_ONCE(sched->pause_run_wq)) {
> >  		struct drm_sched_entity *entity;
> > +		struct drm_sched_msg *msg;
> >  		struct drm_sched_fence *s_fence;
> >  		struct drm_sched_job *sched_job;
> >  		struct dma_fence *fence;
> > @@ -1075,12 +1124,16 @@ static void drm_sched_main(struct work_struct *w)
> >  
> >  		cleanup_job = drm_sched_get_cleanup_job(sched);
> >  		entity = drm_sched_select_entity(sched);
> > +		msg = drm_sched_get_msg(sched);
> >  
> >  		if (cleanup_job)
> >  			sched->ops->free_job(cleanup_job);
> >  
> > +		if (msg)
> > +			sched->ops->process_msg(msg);
> > +
> >  		if (!entity) {
> > -			if (!cleanup_job)
> > +			if (!cleanup_job && !msg)
> >  				break;
> >  			continue;
> >  		}
> > @@ -1089,7 +1142,7 @@ static void drm_sched_main(struct work_struct *w)
> >  
> >  		if (!sched_job) {
> >  			complete_all(&entity->entity_idle);
> > -			if (!cleanup_job)
> > +			if (!cleanup_job && !msg)
> >  				break;
> >  			continue;
> >  		}
> > @@ -1181,6 +1234,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >  
> >  	init_waitqueue_head(&sched->job_scheduled);
> >  	INIT_LIST_HEAD(&sched->pending_list);
> > +	INIT_LIST_HEAD(&sched->msgs);
> >  	spin_lock_init(&sched->job_list_lock);
> >  	atomic_set(&sched->hw_rq_count, 0);
> >  	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 3e421f5a710c..18172ae63ab7 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -398,6 +398,23 @@ enum drm_gpu_sched_stat {
> >  	DRM_GPU_SCHED_STAT_ENODEV,
> >  };
> >  
> > +/**
> > + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
> > + * message
> > + *
> > + * Generic enough for backend defined messages, backend can expand if needed.
> > + */
> > +struct drm_sched_msg {
> > +	/** @link: list link into the gpu scheduler list of messages */
> > +	struct list_head		link;
> > +	/**
> > +	 * @private_data: opaque pointer to message private data (backend defined)
> > +	 */
> > +	void				*private_data;
> > +	/** @opcode: opcode of message (backend defined) */
> > +	unsigned int			opcode;
> > +};
> > +
> >  /**
> >   * struct drm_sched_backend_ops - Define the backend operations
> >   *	called by the scheduler
> > @@ -475,6 +492,12 @@ struct drm_sched_backend_ops {
> >           * and it's time to clean it up.
> >  	 */
> >  	void (*free_job)(struct drm_sched_job *sched_job);
> > +
> > +	/**
> > +	 * @process_msg: Process a message. Allowed to block, it is this
> > +	 * function's responsibility to free message if dynamically allocated.
> > +	 */
> > +	void (*process_msg)(struct drm_sched_msg *msg);
> >  };
> >  
> >  /**
> > @@ -486,6 +509,7 @@ struct drm_sched_backend_ops {
> >   * @timeout: the time after which a job is removed from the scheduler.
> >   * @name: name of the ring for which this scheduler is being used.
> >   * @sched_rq: priority wise array of run queues.
> > + * @msgs: list of messages to be processed in @work_run
> >   * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
> >   *                 waits on this wait queue until all the scheduled jobs are
> >   *                 finished.
> > @@ -493,7 +517,7 @@ struct drm_sched_backend_ops {
> >   * @job_id_count: used to assign unique id to the each job.
> >   * @run_wq: workqueue used to queue @work_run
> >   * @timeout_wq: workqueue used to queue @work_tdr
> > - * @work_run: schedules jobs and cleans up entities
> > + * @work_run: schedules jobs, cleans up jobs, and processes messages
> >   * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> >   *            timeout interval is over.
> >   * @pending_list: the list of jobs which are currently in the job queue.
> > @@ -517,6 +541,7 @@ struct drm_gpu_scheduler {
> >  	long				timeout;
> >  	const char			*name;
> >  	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
> > +	struct list_head		msgs;
> >  	wait_queue_head_t		job_scheduled;
> >  	atomic_t			hw_rq_count;
> >  	atomic64_t			job_id_count;
> > @@ -570,6 +595,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> >  
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > +		       struct drm_sched_msg *msg);
> >  void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
> >  void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
> >  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 06/10] drm/sched: Submit job before starting TDR
  2023-07-31  1:00     ` Matthew Brost
@ 2023-07-31  7:26       ` Boris Brezillon
  2023-08-31 19:48         ` Luben Tuikov
  0 siblings, 1 reply; 87+ messages in thread
From: Boris Brezillon @ 2023-07-31  7:26 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, Sarah Walker, airlied, lina, Frank Binns, dri-devel,
	christian.koenig, Luben Tuikov, Donald Robson, intel-xe,
	faith.ekstrand

+the PVR devs

On Mon, 31 Jul 2023 01:00:59 +0000
Matthew Brost <matthew.brost@intel.com> wrote:

> On Thu, May 04, 2023 at 01:23:05AM -0400, Luben Tuikov wrote:
> > On 2023-04-03 20:22, Matthew Brost wrote:  
> > > If the TDR is set to a value, it can fire before a job is submitted in
> > > drm_sched_main. The job should be always be submitted before the TDR
> > > fires, fix this ordering.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 6ae710017024..4eac02d212c1 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1150,10 +1150,10 @@ static void drm_sched_main(struct work_struct *w)
> > >  		s_fence = sched_job->s_fence;
> > >  
> > >  		atomic_inc(&sched->hw_rq_count);
> > > -		drm_sched_job_begin(sched_job);
> > >  
> > >  		trace_drm_run_job(sched_job, entity);
> > >  		fence = sched->ops->run_job(sched_job);
> > > +		drm_sched_job_begin(sched_job);
> > >  		complete_all(&entity->entity_idle);
> > >  		drm_sched_fence_scheduled(s_fence);
> > >    
> > 
> > Not sure if this is correct. In drm_sched_job_begin() we add the job to the "pending_list"
> > (meaning it is pending execution in the hardware) and we also start a timeout timer. Both
> > of those should be started before the job is given to the hardware.
> >   
> 
> The correct solution is probably add to pending list before run_job()
> and kick TDR after run_job().

This would make the PVR driver simpler too. Right now, the driver
iterates over the pending job list to signal jobs done_fences, but
there's a race between the interrupt handler (that's iterating over
this list to signal fences) and the drm_sched logic (that's inserting
the job in the pending_list after run_job() returns). The race is taken
care of with an addition field that's pointing to the last submitted
job [1], but if we can get rid of that logic, that's for the best.

[1]https://gitlab.freedesktop.org/frankbinns/powervr/-/blob/powervr-next/drivers/gpu/drm/imagination/pvr_queue.h#L119

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 06/10] drm/sched: Submit job before starting TDR
  2023-07-31  7:26       ` Boris Brezillon
@ 2023-08-31 19:48         ` Luben Tuikov
  0 siblings, 0 replies; 87+ messages in thread
From: Luben Tuikov @ 2023-08-31 19:48 UTC (permalink / raw)
  To: Boris Brezillon, Matthew Brost
  Cc: robdclark, Sarah Walker, airlied, lina, Frank Binns, dri-devel,
	christian.koenig, Donald Robson, intel-xe, faith.ekstrand

On 2023-07-31 03:26, Boris Brezillon wrote:
> +the PVR devs
> 
> On Mon, 31 Jul 2023 01:00:59 +0000
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
>> On Thu, May 04, 2023 at 01:23:05AM -0400, Luben Tuikov wrote:
>>> On 2023-04-03 20:22, Matthew Brost wrote:  
>>>> If the TDR is set to a value, it can fire before a job is submitted in
>>>> drm_sched_main. The job should be always be submitted before the TDR
>>>> fires, fix this ordering.
>>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>> ---
>>>>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 6ae710017024..4eac02d212c1 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1150,10 +1150,10 @@ static void drm_sched_main(struct work_struct *w)
>>>>  		s_fence = sched_job->s_fence;
>>>>  
>>>>  		atomic_inc(&sched->hw_rq_count);
>>>> -		drm_sched_job_begin(sched_job);
>>>>  
>>>>  		trace_drm_run_job(sched_job, entity);
>>>>  		fence = sched->ops->run_job(sched_job);
>>>> +		drm_sched_job_begin(sched_job);
>>>>  		complete_all(&entity->entity_idle);
>>>>  		drm_sched_fence_scheduled(s_fence);
>>>>    
>>>
>>> Not sure if this is correct. In drm_sched_job_begin() we add the job to the "pending_list"
>>> (meaning it is pending execution in the hardware) and we also start a timeout timer. Both
>>> of those should be started before the job is given to the hardware.
>>>   
>>
>> The correct solution is probably add to pending list before run_job()
>> and kick TDR after run_job().
> 
> This would make the PVR driver simpler too. Right now, the driver
> iterates over the pending job list to signal jobs done_fences, but
> there's a race between the interrupt handler (that's iterating over
> this list to signal fences) and the drm_sched logic (that's inserting
> the job in the pending_list after run_job() returns). The race is taken
> care of with an addition field that's pointing to the last submitted
> job [1], but if we can get rid of that logic, that's for the best.
> 
> [1]https://gitlab.freedesktop.org/frankbinns/powervr/-/blob/powervr-next/drivers/gpu/drm/imagination/pvr_queue.h#L119

(Caching up, chronologically, after vacation...)

I agree on both emails above. I'm aware of this race in the DRM scheduler
but am careful not to open a can of worms if fixed.

But, yes, indeed, the classic way (which would avoid races) is indeed
to add to "pending list" before run_job, as we cannot guarantee the state
of the job after "run_job". Also, ideally we want to stop all submissions
and then call TDR, recover/reset/etc., and then resume incoming submissions.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Intel-xe] [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout
  2023-07-31  1:09     ` Matthew Brost
@ 2023-08-31 19:52       ` Luben Tuikov
  0 siblings, 0 replies; 87+ messages in thread
From: Luben Tuikov @ 2023-08-31 19:52 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, airlied, lina, dri-devel, christian.koenig,
	boris.brezillon, intel-xe, faith.ekstrand

On 2023-07-30 21:09, Matthew Brost wrote:
> On Thu, May 04, 2023 at 01:28:12AM -0400, Luben Tuikov wrote:
>> On 2023-04-03 20:22, Matthew Brost wrote:
>>> Add helper to set TDR timeout and restart the TDR with new timeout
>>> value. This will be used in XE, new Intel GPU driver, to trigger the TDR
>>> to cleanup drm_sched_entity that encounter errors.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
>>>  include/drm/gpu_scheduler.h            |  1 +
>>>  2 files changed, 19 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 4eac02d212c1..d61880315d8d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>>>  		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
>>>  }
>>>  
>>> +/**
>>> + * drm_sched_set_timeout - set timeout for reset worker
>>> + *
>>> + * @sched: scheduler instance to set and (re)-start the worker for
>>> + * @timeout: timeout period
>>> + *
>>> + * Set and (re)-start the timeout for the given scheduler.
>>> + */
>>> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
>>> +{
>>> +	spin_lock(&sched->job_list_lock);
>>> +	sched->timeout = timeout;
>>> +	cancel_delayed_work(&sched->work_tdr);
>>
>> I see that the comment says "(re-)start"(sic). Is the rest of the logic
>> stable in that we don't need to use _sync() version, and/or at least
>> inspect the return value of the one currently used?
>>
> 
> Sorry for the delayed response, just reviving this series now and seeing
> this comment.
> 
> We don't care if the TDR is currently executing (at least in Xe which
> makes use of this function), that is totally fine we only care to change
> the future timeout values. I believe we actually call this from the TDR
> in Xe to set the timeout value to zero so using a sync version would
> deadlock. We do this as a mechanism to kill the drm_gpu_scheduler and
> immediately timeout all remaining jobs. We also call this in a few other
> places too with a value of zero for the same reason (kill the
> drm_gpu_scheduler).

(Catching up chronologically after vacation...)

Okay, that's fine, but this shows a need for an interface/logic to simply
kill the DRM gpu scheduler. So perhaps we need to provide that kind
of functionality, as opposed to gaming the scheduler--setting the timeout to 0
to kill the scheduler. Perhaps that would be simpler...?
-- 
Regards,
Luben

> 
> Matt
> 
>> Regards,
>> Luben
>>
>>> +	drm_sched_start_timeout(sched);
>>> +	spin_unlock(&sched->job_list_lock);
>>> +}
>>> +EXPORT_SYMBOL(drm_sched_set_timeout);
>>> +
>>>  /**
>>>   * drm_sched_fault - immediately start timeout handler
>>>   *
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 18172ae63ab7..6258e324bd7c 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>>  				    struct drm_gpu_scheduler **sched_list,
>>>                                     unsigned int num_sched_list);
>>>  
>>> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
>>>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>>>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>>  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>>


^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2023-08-31 19:52 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
2023-06-09  6:58   ` Boris Brezillon
2023-07-31  0:56     ` Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 02/10] drm/sched: Move schedule policy to scheduler / entity Matthew Brost
2023-04-05 17:37   ` Luben Tuikov
2023-04-05 18:29     ` Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 03/10] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 04/10] drm/sched: Add generic scheduler message interface Matthew Brost
2023-05-04  5:28   ` Luben Tuikov
2023-07-31  2:42     ` Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 05/10] drm/sched: Start run wq before TDR in drm_sched_start Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 06/10] drm/sched: Submit job before starting TDR Matthew Brost
2023-05-04  5:23   ` Luben Tuikov
2023-07-31  1:00     ` Matthew Brost
2023-07-31  7:26       ` Boris Brezillon
2023-08-31 19:48         ` Luben Tuikov
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout Matthew Brost
2023-05-04  5:28   ` Luben Tuikov
2023-07-31  1:09     ` Matthew Brost
2023-08-31 19:52       ` Luben Tuikov
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences Matthew Brost
2023-04-04  9:09   ` Christian König
2023-04-04 12:54     ` Thomas Hellström
2023-04-04 13:10       ` Christian König
2023-04-04 18:14         ` Thomas Hellström (Intel)
2023-04-04 19:02           ` Matthew Brost
2023-04-04 19:25             ` Daniel Vetter
2023-04-04 19:48               ` Matthew Brost
2023-04-05 13:09                 ` Daniel Vetter
2023-04-05 23:58                   ` Matthew Brost
2023-04-06  6:32                     ` Daniel Vetter
2023-04-06 16:58                       ` Matthew Brost
2023-04-06 17:09                         ` Daniel Vetter
2023-04-05 12:35               ` Thomas Hellström
2023-04-05 12:39                 ` Christian König
2023-04-05 12:45                   ` Daniel Vetter
2023-04-05 14:08                     ` Christian König
2023-04-04 19:00         ` Daniel Vetter
2023-04-04 20:03           ` Matthew Brost
2023-04-04 20:11             ` Daniel Vetter
2023-04-04 20:19               ` Matthew Brost
2023-04-04 20:31                 ` Daniel Vetter
2023-04-04 20:46                   ` Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 09/10] drm/sched: Support long-running sched entities Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 10/10] drm/syncobj: Warn on long running dma-fences Matthew Brost
2023-04-04  0:24 ` [Intel-xe] ✗ CI.Patch_applied: failure for Xe DRM scheduler and long running workload plans Patchwork
2023-04-04  1:07 ` [Intel-xe] [RFC PATCH 00/10] " Asahi Lina
2023-04-04  1:58   ` Matthew Brost
2023-04-08  7:05     ` Asahi Lina
2023-04-11 14:07       ` Daniel Vetter
2023-04-12  5:47         ` Asahi Lina
2023-04-12  8:18           ` Daniel Vetter
2023-04-17  0:03       ` Matthew Brost
2023-04-04  9:04 ` Christian König
2023-04-04 13:23   ` Matthew Brost
2023-04-04  9:13 ` Christian König
2023-04-04 13:37   ` Matthew Brost
2023-04-05  7:41     ` Christian König
2023-04-05  8:34       ` Daniel Vetter
2023-04-05  8:53         ` Christian König
2023-04-05  9:07           ` Daniel Vetter
2023-04-05  9:57             ` Christian König
2023-04-05 10:12               ` Daniel Vetter
2023-04-06  2:08                 ` Matthew Brost
2023-04-06  6:37                   ` Daniel Vetter
2023-04-06 10:14                     ` Christian König
2023-04-06 10:32                       ` Daniel Vetter
2023-04-04  9:43 ` Tvrtko Ursulin
2023-04-04  9:48   ` Christian König
2023-04-04 13:43     ` Matthew Brost
2023-04-04 13:52   ` Matthew Brost
2023-04-04 17:29     ` Tvrtko Ursulin
2023-04-04 19:07       ` Daniel Vetter
2023-04-04 18:02 ` Zeng, Oak
2023-04-04 18:08   ` Matthew Brost
2023-04-05  7:30     ` Christian König
2023-04-05  8:42       ` Daniel Vetter
2023-04-05 18:06       ` Zeng, Oak
2023-04-05 18:53         ` Matthew Brost
2023-04-06 10:04           ` Christian König
2023-04-07  0:20           ` Zeng, Oak
2023-04-11  9:02             ` Christian König
2023-04-11 14:13               ` Daniel Vetter
2023-04-17  6:47                 ` Christian König
2023-04-17  8:39                   ` Daniel Vetter
2023-04-18 15:10 ` Liviu Dudau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).