All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] DRM scheduler documentation & bug fixes
@ 2023-07-14  8:21 ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  8:21 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi, Asahi Lina

Based on the previous discussion while I was writing the Rust
abstractions for the DRM scheduler, it looks like we're overdue for some
documentation.

This series first attempts to document what I've learned about the
scheduler and what I believe should be the *intended* lifetime
semantics, and then fixes a few bugs that result from that:

1. The DRM scheduler fences cannot be required to be outlived by the
   scheduler. This is non-negotiable. The whole point of these fences is
   to decouple the underlying hardware/driver from consumers, such as
   dma-bufs with an attached fence. If this requirement were not met,
   then we'd have to somehow keep the scheduler and all the driver
   components associated with it alive as long as a dma-buf with an
   attached drm_sched fence is alive, which could be indefinitely even
   after the hardware that produced that dma-buf is long gone. Consider,
   for example, using a hot-pluggable GPU to write to a dma-buf in main
   memory, which gets presented on an integrated display controller, and
   then the GPU is unplugged. That buffer could potentially live
   forever, we can't block GPU driver cleanup on that.

2. Make the DRM scheduler properly clean up jobs on shutdown, such that
   we can support the use case of tearing down the scheduler with
   in-flight jobs. This is important to cleanly support the firmware
   scheduling use case, where the DRM scheduler is attached to a file
   (which we want to be able to tear down quickly when userspace closes
   it) while firmware could continue to (attempt to) run in-flight jobs
   after that point. The major missing codepath to make this work is
   detaching jobs from their HW fences on scheduler shutdown, so
   implement that. This also makes writing a safe Rust abstraction
   plausible, since otherwise we'd have to add a huge pile of complexity
   to that side in order to enforce the invariant that the scheduler
   outlives its jobs (including setting up a workqueue to handle
   scheduler teardown and other craziness, which is an unacceptable
   level of complexity for what should be a lightweight abstraction).

I believe there *may* still be at least one UAF-type bug related to case
2 above, but it's very hard to trigger and I wasn't able to figure out
what causes it the one time I saw it recently. Other than that, things
look quite robust on the Asahi driver with these patches, even when
trying to break things by killing GPU consumers in a tight loop and
things like that. If we agree this is a good way forward, I think this
is a good start even if there's still a bug lurking somewhere.

Aside (but related to the previous discussion): the can_run_job thing
is gone, I'm using fences returned from prepare() now and that works
well (and actually fixes one corner case related to wait contexts I'd
missed), so hopefully that's OK with everyone ^^

Changes from the previous version of patch #2: explicitly signal
detached job fences with an error. I'd missed that and I think it's what
was causing us some rare lockups due to fences never getting signaled.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
Asahi Lina (3):
      drm/scheduler: Add more documentation
      drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
      drm/scheduler: Clean up jobs when the scheduler is torn down.
 drivers/gpu/drm/scheduler/sched_entity.c |  7 ++-
 drivers/gpu/drm/scheduler/sched_fence.c  |  4 +-
 drivers/gpu/drm/scheduler/sched_main.c   | 90 ++++++++++++++++++++++++++++++--
 include/drm/gpu_scheduler.h              |  5 ++
 4 files changed, 99 insertions(+), 7 deletions(-)
---
base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
change-id: 20230714-drm-sched-fixes-94bea043bbe7

Thank you,
~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 0/3] DRM scheduler documentation & bug fixes
@ 2023-07-14  8:21 ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  8:21 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Asahi Lina, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

Based on the previous discussion while I was writing the Rust
abstractions for the DRM scheduler, it looks like we're overdue for some
documentation.

This series first attempts to document what I've learned about the
scheduler and what I believe should be the *intended* lifetime
semantics, and then fixes a few bugs that result from that:

1. The DRM scheduler fences cannot be required to be outlived by the
   scheduler. This is non-negotiable. The whole point of these fences is
   to decouple the underlying hardware/driver from consumers, such as
   dma-bufs with an attached fence. If this requirement were not met,
   then we'd have to somehow keep the scheduler and all the driver
   components associated with it alive as long as a dma-buf with an
   attached drm_sched fence is alive, which could be indefinitely even
   after the hardware that produced that dma-buf is long gone. Consider,
   for example, using a hot-pluggable GPU to write to a dma-buf in main
   memory, which gets presented on an integrated display controller, and
   then the GPU is unplugged. That buffer could potentially live
   forever, we can't block GPU driver cleanup on that.

2. Make the DRM scheduler properly clean up jobs on shutdown, such that
   we can support the use case of tearing down the scheduler with
   in-flight jobs. This is important to cleanly support the firmware
   scheduling use case, where the DRM scheduler is attached to a file
   (which we want to be able to tear down quickly when userspace closes
   it) while firmware could continue to (attempt to) run in-flight jobs
   after that point. The major missing codepath to make this work is
   detaching jobs from their HW fences on scheduler shutdown, so
   implement that. This also makes writing a safe Rust abstraction
   plausible, since otherwise we'd have to add a huge pile of complexity
   to that side in order to enforce the invariant that the scheduler
   outlives its jobs (including setting up a workqueue to handle
   scheduler teardown and other craziness, which is an unacceptable
   level of complexity for what should be a lightweight abstraction).

I believe there *may* still be at least one UAF-type bug related to case
2 above, but it's very hard to trigger and I wasn't able to figure out
what causes it the one time I saw it recently. Other than that, things
look quite robust on the Asahi driver with these patches, even when
trying to break things by killing GPU consumers in a tight loop and
things like that. If we agree this is a good way forward, I think this
is a good start even if there's still a bug lurking somewhere.

Aside (but related to the previous discussion): the can_run_job thing
is gone, I'm using fences returned from prepare() now and that works
well (and actually fixes one corner case related to wait contexts I'd
missed), so hopefully that's OK with everyone ^^

Changes from the previous version of patch #2: explicitly signal
detached job fences with an error. I'd missed that and I think it's what
was causing us some rare lockups due to fences never getting signaled.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
Asahi Lina (3):
      drm/scheduler: Add more documentation
      drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
      drm/scheduler: Clean up jobs when the scheduler is torn down.
 drivers/gpu/drm/scheduler/sched_entity.c |  7 ++-
 drivers/gpu/drm/scheduler/sched_fence.c  |  4 +-
 drivers/gpu/drm/scheduler/sched_main.c   | 90 ++++++++++++++++++++++++++++++--
 include/drm/gpu_scheduler.h              |  5 ++
 4 files changed, 99 insertions(+), 7 deletions(-)
---
base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
change-id: 20230714-drm-sched-fixes-94bea043bbe7

Thank you,
~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 1/3] drm/scheduler: Add more documentation
  2023-07-14  8:21 ` Asahi Lina
@ 2023-07-14  8:21   ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  8:21 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi, Asahi Lina

Document the implied lifetime rules of the scheduler (or at least the
intended ones), as well as the expectations of how resource acquisition
should be handled.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
 drivers/gpu/drm/scheduler/sched_main.c | 58 ++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 7b2bfc10c1a5..1f3bc3606239 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -43,9 +43,61 @@
  *
  * The jobs in a entity are always scheduled in the order that they were pushed.
  *
- * Note that once a job was taken from the entities queue and pushed to the
- * hardware, i.e. the pending queue, the entity must not be referenced anymore
- * through the jobs entity pointer.
+ * Lifetime rules
+ * --------------
+ *
+ * Getting object lifetimes right across the stack is critical to avoid UAF
+ * issues. The DRM scheduler has the following lifetime rules:
+ *
+ * - The scheduler must outlive all of its entities.
+ * - Jobs pushed to the scheduler are owned by it, and must only be freed
+ *   after the free_job() callback is called.
+ * - Scheduler fences are reference-counted and may outlive the scheduler.
+ * - The scheduler *may* be destroyed while jobs are still in flight.
+ * - There is no guarantee that all jobs have been freed when all entities
+ *   and the scheduled have been destroyed. Jobs may be freed asynchronously
+ *   after this point.
+ * - Once a job is taken from the entity's queue and pushed to the hardware,
+ *   i.e. the pending queue, the entity must not be referenced any more
+ *   through the job's entity pointer. In other words, entities are not
+ *   required to outlive job execution.
+ *
+ * If the scheduler is destroyed with jobs in flight, the following
+ * happens:
+ *
+ * - Jobs that were pushed but have not yet run will be destroyed as part
+ *   of the entity cleanup (which must happen before the scheduler itself
+ *   is destroyed, per the first rule above). This signals the job
+ *   finished fence with an error flag. This process runs asynchronously
+ *   after drm_sched_entity_destroy() returns.
+ * - Jobs that are in-flight on the hardware are "detached" from their
+ *   driver fence (the fence returned from the run_job() callback). In
+ *   this case, it is up to the driver to ensure that any bookkeeping or
+ *   internal data structures have separately managed lifetimes and that
+ *   the hardware either cancels the jobs or runs them to completion.
+ *   The DRM scheduler itself will immediately signal the job complete
+ *   fence (with an error flag) and then call free_job() as part of the
+ *   cleanup process.
+ *
+ * After the scheduler is destroyed, drivers *may* (but are not required to)
+ * skip signaling their remaining driver fences, as long as they have only ever
+ * been returned to the scheduler being destroyed as the return value from
+ * run_job() and not passed anywhere else. If these fences are used in any other
+ * context, then the driver *must* signal them, per the usual fence signaling
+ * rules.
+ *
+ * Resource management
+ * -------------------
+ *
+ * Drivers may need to acquire certain hardware resources (e.g. VM IDs) in order
+ * to run a job. This process must happen during the job's prepare() callback,
+ * not in the run() callback. If any resource is unavailable at job prepare time,
+ * the driver must return a suitable fence that can be waited on to wait for the
+ * resource to (potentially) become available.
+ *
+ * In order to avoid deadlocks, drivers must always acquire resources in the
+ * same order, and release them in opposite order when a job completes or if
+ * resource acquisition fails.
  */
 
 #include <linux/kthread.h>

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 1/3] drm/scheduler: Add more documentation
@ 2023-07-14  8:21   ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  8:21 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Asahi Lina, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

Document the implied lifetime rules of the scheduler (or at least the
intended ones), as well as the expectations of how resource acquisition
should be handled.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
 drivers/gpu/drm/scheduler/sched_main.c | 58 ++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 7b2bfc10c1a5..1f3bc3606239 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -43,9 +43,61 @@
  *
  * The jobs in a entity are always scheduled in the order that they were pushed.
  *
- * Note that once a job was taken from the entities queue and pushed to the
- * hardware, i.e. the pending queue, the entity must not be referenced anymore
- * through the jobs entity pointer.
+ * Lifetime rules
+ * --------------
+ *
+ * Getting object lifetimes right across the stack is critical to avoid UAF
+ * issues. The DRM scheduler has the following lifetime rules:
+ *
+ * - The scheduler must outlive all of its entities.
+ * - Jobs pushed to the scheduler are owned by it, and must only be freed
+ *   after the free_job() callback is called.
+ * - Scheduler fences are reference-counted and may outlive the scheduler.
+ * - The scheduler *may* be destroyed while jobs are still in flight.
+ * - There is no guarantee that all jobs have been freed when all entities
+ *   and the scheduled have been destroyed. Jobs may be freed asynchronously
+ *   after this point.
+ * - Once a job is taken from the entity's queue and pushed to the hardware,
+ *   i.e. the pending queue, the entity must not be referenced any more
+ *   through the job's entity pointer. In other words, entities are not
+ *   required to outlive job execution.
+ *
+ * If the scheduler is destroyed with jobs in flight, the following
+ * happens:
+ *
+ * - Jobs that were pushed but have not yet run will be destroyed as part
+ *   of the entity cleanup (which must happen before the scheduler itself
+ *   is destroyed, per the first rule above). This signals the job
+ *   finished fence with an error flag. This process runs asynchronously
+ *   after drm_sched_entity_destroy() returns.
+ * - Jobs that are in-flight on the hardware are "detached" from their
+ *   driver fence (the fence returned from the run_job() callback). In
+ *   this case, it is up to the driver to ensure that any bookkeeping or
+ *   internal data structures have separately managed lifetimes and that
+ *   the hardware either cancels the jobs or runs them to completion.
+ *   The DRM scheduler itself will immediately signal the job complete
+ *   fence (with an error flag) and then call free_job() as part of the
+ *   cleanup process.
+ *
+ * After the scheduler is destroyed, drivers *may* (but are not required to)
+ * skip signaling their remaining driver fences, as long as they have only ever
+ * been returned to the scheduler being destroyed as the return value from
+ * run_job() and not passed anywhere else. If these fences are used in any other
+ * context, then the driver *must* signal them, per the usual fence signaling
+ * rules.
+ *
+ * Resource management
+ * -------------------
+ *
+ * Drivers may need to acquire certain hardware resources (e.g. VM IDs) in order
+ * to run a job. This process must happen during the job's prepare() callback,
+ * not in the run() callback. If any resource is unavailable at job prepare time,
+ * the driver must return a suitable fence that can be waited on to wait for the
+ * resource to (potentially) become available.
+ *
+ * In order to avoid deadlocks, drivers must always acquire resources in the
+ * same order, and release them in opposite order when a job completes or if
+ * resource acquisition fails.
  */
 
 #include <linux/kthread.h>

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14  8:21 ` Asahi Lina
@ 2023-07-14  8:21   ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  8:21 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi, Asahi Lina

A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
 drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
 include/drm/gpu_scheduler.h              | 5 +++++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
 
 		/*
 		 * Fence is from the same scheduler, only need to wait for
-		 * it to be scheduled
+		 * it to be scheduled.
+		 *
+		 * Note: s_fence->sched could have been freed and reallocated
+		 * as another scheduler. This false positive case is okay, as if
+		 * the old scheduler was freed all of its jobs must have
+		 * signaled their completion fences.
 		 */
 		fence = dma_fence_get(&s_fence->scheduled);
 		dma_fence_put(entity->dependency);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..06a0eebcca10 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -68,7 +68,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence)
 static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f)
 {
 	struct drm_sched_fence *fence = to_drm_sched_fence(f);
-	return (const char *)fence->sched->name;
+	return (const char *)fence->sched_name;
 }
 
 static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
@@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
 	unsigned seq;
 
 	fence->sched = entity->rq->sched;
+	strlcpy(fence->sched_name, entity->rq->sched->name,
+		sizeof(fence->sched_name));
 	seq = atomic_inc_return(&entity->fence_seq);
 	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
 		       &fence->lock, entity->fence_context, seq);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e95b4837e5a3..4fa9523bd47d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -305,6 +305,11 @@ struct drm_sched_fence {
          * @lock: the lock used by the scheduled and the finished fences.
          */
 	spinlock_t			lock;
+        /**
+         * @sched_name: the name of the scheduler that owns this fence. We
+	 * keep a copy here since fences can outlive their scheduler.
+         */
+	char sched_name[16];
         /**
          * @owner: job owner for debugging
          */

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14  8:21   ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  8:21 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Asahi Lina, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
 drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
 include/drm/gpu_scheduler.h              | 5 +++++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
 
 		/*
 		 * Fence is from the same scheduler, only need to wait for
-		 * it to be scheduled
+		 * it to be scheduled.
+		 *
+		 * Note: s_fence->sched could have been freed and reallocated
+		 * as another scheduler. This false positive case is okay, as if
+		 * the old scheduler was freed all of its jobs must have
+		 * signaled their completion fences.
 		 */
 		fence = dma_fence_get(&s_fence->scheduled);
 		dma_fence_put(entity->dependency);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..06a0eebcca10 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -68,7 +68,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence)
 static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f)
 {
 	struct drm_sched_fence *fence = to_drm_sched_fence(f);
-	return (const char *)fence->sched->name;
+	return (const char *)fence->sched_name;
 }
 
 static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
@@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
 	unsigned seq;
 
 	fence->sched = entity->rq->sched;
+	strlcpy(fence->sched_name, entity->rq->sched->name,
+		sizeof(fence->sched_name));
 	seq = atomic_inc_return(&entity->fence_seq);
 	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
 		       &fence->lock, entity->fence_context, seq);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e95b4837e5a3..4fa9523bd47d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -305,6 +305,11 @@ struct drm_sched_fence {
          * @lock: the lock used by the scheduled and the finished fences.
          */
 	spinlock_t			lock;
+        /**
+         * @sched_name: the name of the scheduler that owns this fence. We
+	 * keep a copy here since fences can outlive their scheduler.
+         */
+	char sched_name[16];
         /**
          * @owner: job owner for debugging
          */

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-14  8:21 ` Asahi Lina
@ 2023-07-14  8:21   ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  8:21 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi, Asahi Lina

drm_sched_fini() currently leaves any pending jobs dangling, which
causes segfaults and other badness when job completion fences are
signaled after the scheduler is torn down.

Explicitly detach all jobs from their completion callbacks and free
them. This makes it possible to write a sensible safe abstraction for
drm_sched, without having to externally duplicate the tracking of
in-flight jobs.

This shouldn't regress any existing drivers, since calling
drm_sched_fini() with any pending jobs is broken and this change should
be a no-op if there are no pending jobs.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
 drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 1f3bc3606239..a4da4aac0efd 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_entity *s_entity;
+	struct drm_sched_job *s_job, *tmp;
 	int i;
 
-	if (sched->thread)
-		kthread_stop(sched->thread);
+	if (!sched->thread)
+		return;
+
+	/*
+	 * Stop the scheduler, detaching all jobs from their hardware callbacks
+	 * and cleaning up complete jobs.
+	 */
+	drm_sched_stop(sched, NULL);
+
+	/*
+	 * Iterate through the pending job list and free all jobs.
+	 * This assumes the driver has either guaranteed jobs are already stopped, or that
+	 * otherwise it is responsible for keeping any necessary data structures for
+	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
+	 * putting them in its own queue or doing its own refcounting).
+	 */
+	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
+		spin_lock(&sched->job_list_lock);
+		list_del_init(&s_job->list);
+		spin_unlock(&sched->job_list_lock);
+
+		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
+		drm_sched_fence_finished(s_job->s_fence);
+
+		WARN_ON(s_job->s_fence->parent);
+		sched->ops->free_job(s_job);
+	}
+
+	kthread_stop(sched->thread);
 
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		struct drm_sched_rq *rq = &sched->sched_rq[i];

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-07-14  8:21   ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  8:21 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Asahi Lina, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

drm_sched_fini() currently leaves any pending jobs dangling, which
causes segfaults and other badness when job completion fences are
signaled after the scheduler is torn down.

Explicitly detach all jobs from their completion callbacks and free
them. This makes it possible to write a sensible safe abstraction for
drm_sched, without having to externally duplicate the tracking of
in-flight jobs.

This shouldn't regress any existing drivers, since calling
drm_sched_fini() with any pending jobs is broken and this change should
be a no-op if there are no pending jobs.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
 drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 1f3bc3606239..a4da4aac0efd 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_entity *s_entity;
+	struct drm_sched_job *s_job, *tmp;
 	int i;
 
-	if (sched->thread)
-		kthread_stop(sched->thread);
+	if (!sched->thread)
+		return;
+
+	/*
+	 * Stop the scheduler, detaching all jobs from their hardware callbacks
+	 * and cleaning up complete jobs.
+	 */
+	drm_sched_stop(sched, NULL);
+
+	/*
+	 * Iterate through the pending job list and free all jobs.
+	 * This assumes the driver has either guaranteed jobs are already stopped, or that
+	 * otherwise it is responsible for keeping any necessary data structures for
+	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
+	 * putting them in its own queue or doing its own refcounting).
+	 */
+	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
+		spin_lock(&sched->job_list_lock);
+		list_del_init(&s_job->list);
+		spin_unlock(&sched->job_list_lock);
+
+		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
+		drm_sched_fence_finished(s_job->s_fence);
+
+		WARN_ON(s_job->s_fence->parent);
+		sched->ops->free_job(s_job);
+	}
+
+	kthread_stop(sched->thread);
 
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		struct drm_sched_rq *rq = &sched->sched_rq[i];

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] drm/scheduler: Add more documentation
  2023-07-14  8:21   ` Asahi Lina
@ 2023-07-14  8:40     ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14  8:40 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

Am 14.07.23 um 10:21 schrieb Asahi Lina:
> Document the implied lifetime rules of the scheduler (or at least the
> intended ones), as well as the expectations of how resource acquisition
> should be handled.
>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 58 ++++++++++++++++++++++++++++++++--
>   1 file changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 7b2bfc10c1a5..1f3bc3606239 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -43,9 +43,61 @@
>    *
>    * The jobs in a entity are always scheduled in the order that they were pushed.
>    *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * Lifetime rules
> + * --------------
> + *
> + * Getting object lifetimes right across the stack is critical to avoid UAF
> + * issues. The DRM scheduler has the following lifetime rules:
> + *
> + * - The scheduler must outlive all of its entities.
> + * - Jobs pushed to the scheduler are owned by it, and must only be freed
> + *   after the free_job() callback is called.
> + * - Scheduler fences are reference-counted and may outlive the scheduler.

> + * - The scheduler *may* be destroyed while jobs are still in flight.

That's not correct. The scheduler can only be destroyed after all the 
entities serving it have been destroyed as well as all the jobs already 
pushed to the hw finished.

What might be possible to add is that the hw is still working on the 
already pushed jobs, but so far that was rejected as undesirable.

> + * - There is no guarantee that all jobs have been freed when all entities
> + *   and the scheduled have been destroyed. Jobs may be freed asynchronously
> + *   after this point.
> + * - Once a job is taken from the entity's queue and pushed to the hardware,
> + *   i.e. the pending queue, the entity must not be referenced any more
> + *   through the job's entity pointer. In other words, entities are not
> + *   required to outlive job execution.
> + *
> + * If the scheduler is destroyed with jobs in flight, the following
> + * happens:
> + *
> + * - Jobs that were pushed but have not yet run will be destroyed as part
> + *   of the entity cleanup (which must happen before the scheduler itself
> + *   is destroyed, per the first rule above). This signals the job
> + *   finished fence with an error flag. This process runs asynchronously
> + *   after drm_sched_entity_destroy() returns.
> + * - Jobs that are in-flight on the hardware are "detached" from their
> + *   driver fence (the fence returned from the run_job() callback). In
> + *   this case, it is up to the driver to ensure that any bookkeeping or
> + *   internal data structures have separately managed lifetimes and that
> + *   the hardware either cancels the jobs or runs them to completion.
> + *   The DRM scheduler itself will immediately signal the job complete
> + *   fence (with an error flag) and then call free_job() as part of the
> + *   cleanup process.
> + *
> + * After the scheduler is destroyed, drivers *may* (but are not required to)
> + * skip signaling their remaining driver fences, as long as they have only ever
> + * been returned to the scheduler being destroyed as the return value from
> + * run_job() and not passed anywhere else.

This is an outright NAK to this. Fences must always be cleanly signaled.

IIRC Daniel documented this as mandatory on the dma_fence behavior.

Regards,
Christian.

>   If these fences are used in any other
> + * context, then the driver *must* signal them, per the usual fence signaling
> + * rules.
> + *
> + * Resource management
> + * -------------------
> + *
> + * Drivers may need to acquire certain hardware resources (e.g. VM IDs) in order
> + * to run a job. This process must happen during the job's prepare() callback,
> + * not in the run() callback. If any resource is unavailable at job prepare time,
> + * the driver must return a suitable fence that can be waited on to wait for the
> + * resource to (potentially) become available.
> + *
> + * In order to avoid deadlocks, drivers must always acquire resources in the
> + * same order, and release them in opposite order when a job completes or if
> + * resource acquisition fails.
>    */
>   
>   #include <linux/kthread.h>
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] drm/scheduler: Add more documentation
@ 2023-07-14  8:40     ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14  8:40 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

Am 14.07.23 um 10:21 schrieb Asahi Lina:
> Document the implied lifetime rules of the scheduler (or at least the
> intended ones), as well as the expectations of how resource acquisition
> should be handled.
>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 58 ++++++++++++++++++++++++++++++++--
>   1 file changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 7b2bfc10c1a5..1f3bc3606239 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -43,9 +43,61 @@
>    *
>    * The jobs in a entity are always scheduled in the order that they were pushed.
>    *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * Lifetime rules
> + * --------------
> + *
> + * Getting object lifetimes right across the stack is critical to avoid UAF
> + * issues. The DRM scheduler has the following lifetime rules:
> + *
> + * - The scheduler must outlive all of its entities.
> + * - Jobs pushed to the scheduler are owned by it, and must only be freed
> + *   after the free_job() callback is called.
> + * - Scheduler fences are reference-counted and may outlive the scheduler.

> + * - The scheduler *may* be destroyed while jobs are still in flight.

That's not correct. The scheduler can only be destroyed after all the 
entities serving it have been destroyed as well as all the jobs already 
pushed to the hw finished.

What might be possible to add is that the hw is still working on the 
already pushed jobs, but so far that was rejected as undesirable.

> + * - There is no guarantee that all jobs have been freed when all entities
> + *   and the scheduled have been destroyed. Jobs may be freed asynchronously
> + *   after this point.
> + * - Once a job is taken from the entity's queue and pushed to the hardware,
> + *   i.e. the pending queue, the entity must not be referenced any more
> + *   through the job's entity pointer. In other words, entities are not
> + *   required to outlive job execution.
> + *
> + * If the scheduler is destroyed with jobs in flight, the following
> + * happens:
> + *
> + * - Jobs that were pushed but have not yet run will be destroyed as part
> + *   of the entity cleanup (which must happen before the scheduler itself
> + *   is destroyed, per the first rule above). This signals the job
> + *   finished fence with an error flag. This process runs asynchronously
> + *   after drm_sched_entity_destroy() returns.
> + * - Jobs that are in-flight on the hardware are "detached" from their
> + *   driver fence (the fence returned from the run_job() callback). In
> + *   this case, it is up to the driver to ensure that any bookkeeping or
> + *   internal data structures have separately managed lifetimes and that
> + *   the hardware either cancels the jobs or runs them to completion.
> + *   The DRM scheduler itself will immediately signal the job complete
> + *   fence (with an error flag) and then call free_job() as part of the
> + *   cleanup process.
> + *
> + * After the scheduler is destroyed, drivers *may* (but are not required to)
> + * skip signaling their remaining driver fences, as long as they have only ever
> + * been returned to the scheduler being destroyed as the return value from
> + * run_job() and not passed anywhere else.

This is an outright NAK to this. Fences must always be cleanly signaled.

IIRC Daniel documented this as mandatory on the dma_fence behavior.

Regards,
Christian.

>   If these fences are used in any other
> + * context, then the driver *must* signal them, per the usual fence signaling
> + * rules.
> + *
> + * Resource management
> + * -------------------
> + *
> + * Drivers may need to acquire certain hardware resources (e.g. VM IDs) in order
> + * to run a job. This process must happen during the job's prepare() callback,
> + * not in the run() callback. If any resource is unavailable at job prepare time,
> + * the driver must return a suitable fence that can be waited on to wait for the
> + * resource to (potentially) become available.
> + *
> + * In order to avoid deadlocks, drivers must always acquire resources in the
> + * same order, and release them in opposite order when a job completes or if
> + * resource acquisition fails.
>    */
>   
>   #include <linux/kthread.h>
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14  8:21   ` Asahi Lina
@ 2023-07-14  8:43     ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14  8:43 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

Am 14.07.23 um 10:21 schrieb Asahi Lina:
> A signaled scheduler fence can outlive its scheduler, since fences are
> independencly reference counted. Therefore, we can't reference the
> scheduler in the get_timeline_name() implementation.
>
> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
> dma-bufs reference fences from GPU schedulers that no longer exist.
>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>   drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>   include/drm/gpu_scheduler.h              | 5 +++++
>   3 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index b2bbc8a68b30..17f35b0b005a 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>   
>   		/*
>   		 * Fence is from the same scheduler, only need to wait for
> -		 * it to be scheduled
> +		 * it to be scheduled.
> +		 *
> +		 * Note: s_fence->sched could have been freed and reallocated
> +		 * as another scheduler. This false positive case is okay, as if
> +		 * the old scheduler was freed all of its jobs must have
> +		 * signaled their completion fences.

This is outright nonsense. As long as an entity for a scheduler exists 
it is not allowed to free up this scheduler.

So this function can't be called like this.

>   		 */
>   		fence = dma_fence_get(&s_fence->scheduled);
>   		dma_fence_put(entity->dependency);
> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
> index ef120475e7c6..06a0eebcca10 100644
> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> @@ -68,7 +68,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>   static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f)
>   {
>   	struct drm_sched_fence *fence = to_drm_sched_fence(f);
> -	return (const char *)fence->sched->name;
> +	return (const char *)fence->sched_name;
>   }
>   
>   static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>   	unsigned seq;
>   
>   	fence->sched = entity->rq->sched;
> +	strlcpy(fence->sched_name, entity->rq->sched->name,
> +		sizeof(fence->sched_name));
>   	seq = atomic_inc_return(&entity->fence_seq);
>   	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>   		       &fence->lock, entity->fence_context, seq);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index e95b4837e5a3..4fa9523bd47d 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>            * @lock: the lock used by the scheduled and the finished fences.
>            */
>   	spinlock_t			lock;
> +        /**
> +         * @sched_name: the name of the scheduler that owns this fence. We
> +	 * keep a copy here since fences can outlive their scheduler.
> +         */
> +	char sched_name[16];

This just mitigates the problem, but doesn't fix it.

The real issue is that the hw fence is kept around much longer than that.

Additional to that I'm not willing to increase the scheduler fence size 
like that just to decouple them from the scheduler.

Regards,
Christian.

>           /**
>            * @owner: job owner for debugging
>            */
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14  8:43     ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14  8:43 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

Am 14.07.23 um 10:21 schrieb Asahi Lina:
> A signaled scheduler fence can outlive its scheduler, since fences are
> independencly reference counted. Therefore, we can't reference the
> scheduler in the get_timeline_name() implementation.
>
> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
> dma-bufs reference fences from GPU schedulers that no longer exist.
>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>   drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>   include/drm/gpu_scheduler.h              | 5 +++++
>   3 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index b2bbc8a68b30..17f35b0b005a 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>   
>   		/*
>   		 * Fence is from the same scheduler, only need to wait for
> -		 * it to be scheduled
> +		 * it to be scheduled.
> +		 *
> +		 * Note: s_fence->sched could have been freed and reallocated
> +		 * as another scheduler. This false positive case is okay, as if
> +		 * the old scheduler was freed all of its jobs must have
> +		 * signaled their completion fences.

This is outright nonsense. As long as an entity for a scheduler exists 
it is not allowed to free up this scheduler.

So this function can't be called like this.

>   		 */
>   		fence = dma_fence_get(&s_fence->scheduled);
>   		dma_fence_put(entity->dependency);
> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
> index ef120475e7c6..06a0eebcca10 100644
> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> @@ -68,7 +68,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>   static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f)
>   {
>   	struct drm_sched_fence *fence = to_drm_sched_fence(f);
> -	return (const char *)fence->sched->name;
> +	return (const char *)fence->sched_name;
>   }
>   
>   static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>   	unsigned seq;
>   
>   	fence->sched = entity->rq->sched;
> +	strlcpy(fence->sched_name, entity->rq->sched->name,
> +		sizeof(fence->sched_name));
>   	seq = atomic_inc_return(&entity->fence_seq);
>   	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>   		       &fence->lock, entity->fence_context, seq);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index e95b4837e5a3..4fa9523bd47d 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>            * @lock: the lock used by the scheduled and the finished fences.
>            */
>   	spinlock_t			lock;
> +        /**
> +         * @sched_name: the name of the scheduler that owns this fence. We
> +	 * keep a copy here since fences can outlive their scheduler.
> +         */
> +	char sched_name[16];

This just mitigates the problem, but doesn't fix it.

The real issue is that the hw fence is kept around much longer than that.

Additional to that I'm not willing to increase the scheduler fence size 
like that just to decouple them from the scheduler.

Regards,
Christian.

>           /**
>            * @owner: job owner for debugging
>            */
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] drm/scheduler: Add more documentation
  2023-07-14  8:40     ` Christian König
@ 2023-07-14  9:39       ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  9:39 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 14/07/2023 17.40, Christian König wrote:
> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>> Document the implied lifetime rules of the scheduler (or at least the
>> intended ones), as well as the expectations of how resource acquisition
>> should be handled.
>>
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>>    drivers/gpu/drm/scheduler/sched_main.c | 58 ++++++++++++++++++++++++++++++++--
>>    1 file changed, 55 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 7b2bfc10c1a5..1f3bc3606239 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -43,9 +43,61 @@
>>     *
>>     * The jobs in a entity are always scheduled in the order that they were pushed.
>>     *
>> - * Note that once a job was taken from the entities queue and pushed to the
>> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
>> - * through the jobs entity pointer.
>> + * Lifetime rules
>> + * --------------
>> + *
>> + * Getting object lifetimes right across the stack is critical to avoid UAF
>> + * issues. The DRM scheduler has the following lifetime rules:
>> + *
>> + * - The scheduler must outlive all of its entities.
>> + * - Jobs pushed to the scheduler are owned by it, and must only be freed
>> + *   after the free_job() callback is called.
>> + * - Scheduler fences are reference-counted and may outlive the scheduler.
> 
>> + * - The scheduler *may* be destroyed while jobs are still in flight.
> 
> That's not correct. The scheduler can only be destroyed after all the
> entities serving it have been destroyed as well as all the jobs already
> pushed to the hw finished.

The point of this series is to change this behavior so I can actually 
use the scheduler in my use case, and that begins with formally 
documenting it as Daniel suggested. That is, I need it to be safe for 
jobs to not be yet complete before the scheduler is destroyed (the 
entities do get destroyed first, that's the first bullet point).

We already had this discussion. Without this guarantee, I cannot build a 
reasonable safe Rust abstraction. Unless you have another suggestion, as 
far as I can tell it's either this or I give up on using the DRM 
scheduler entirely and reimplement something else on my own.

> What might be possible to add is that the hw is still working on the
> already pushed jobs, but so far that was rejected as undesirable.

Where was this rejected?

>> + * - There is no guarantee that all jobs have been freed when all entities
>> + *   and the scheduled have been destroyed. Jobs may be freed asynchronously
>> + *   after this point.
>> + * - Once a job is taken from the entity's queue and pushed to the hardware,
>> + *   i.e. the pending queue, the entity must not be referenced any more
>> + *   through the job's entity pointer. In other words, entities are not
>> + *   required to outlive job execution.
>> + *
>> + * If the scheduler is destroyed with jobs in flight, the following
>> + * happens:
>> + *
>> + * - Jobs that were pushed but have not yet run will be destroyed as part
>> + *   of the entity cleanup (which must happen before the scheduler itself
>> + *   is destroyed, per the first rule above). This signals the job
>> + *   finished fence with an error flag. This process runs asynchronously
>> + *   after drm_sched_entity_destroy() returns.
>> + * - Jobs that are in-flight on the hardware are "detached" from their
>> + *   driver fence (the fence returned from the run_job() callback). In
>> + *   this case, it is up to the driver to ensure that any bookkeeping or
>> + *   internal data structures have separately managed lifetimes and that
>> + *   the hardware either cancels the jobs or runs them to completion.
>> + *   The DRM scheduler itself will immediately signal the job complete
>> + *   fence (with an error flag) and then call free_job() as part of the
>> + *   cleanup process.
>> + *
>> + * After the scheduler is destroyed, drivers *may* (but are not required to)
>> + * skip signaling their remaining driver fences, as long as they have only ever
>> + * been returned to the scheduler being destroyed as the return value from
>> + * run_job() and not passed anywhere else.
> 
> This is an outright NAK to this. Fences must always be cleanly signaled.

This is just documenting the fact that the DRM scheduler no longer cares 
about the fences after it is destroyed. I can remove it from the docs if 
you want, I don't rely on this behavior.

> IIRC Daniel documented this as mandatory on the dma_fence behavior.

Right, in the general case all dma_fences must be signaled, that's why I 
explicitly said this only applies if the scheduler is the *only* user of 
those fences.

If you don't think this should be a guarantee the scheduler officially 
makes, I'll remove it from the text.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] drm/scheduler: Add more documentation
@ 2023-07-14  9:39       ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  9:39 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 14/07/2023 17.40, Christian König wrote:
> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>> Document the implied lifetime rules of the scheduler (or at least the
>> intended ones), as well as the expectations of how resource acquisition
>> should be handled.
>>
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>>    drivers/gpu/drm/scheduler/sched_main.c | 58 ++++++++++++++++++++++++++++++++--
>>    1 file changed, 55 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 7b2bfc10c1a5..1f3bc3606239 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -43,9 +43,61 @@
>>     *
>>     * The jobs in a entity are always scheduled in the order that they were pushed.
>>     *
>> - * Note that once a job was taken from the entities queue and pushed to the
>> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
>> - * through the jobs entity pointer.
>> + * Lifetime rules
>> + * --------------
>> + *
>> + * Getting object lifetimes right across the stack is critical to avoid UAF
>> + * issues. The DRM scheduler has the following lifetime rules:
>> + *
>> + * - The scheduler must outlive all of its entities.
>> + * - Jobs pushed to the scheduler are owned by it, and must only be freed
>> + *   after the free_job() callback is called.
>> + * - Scheduler fences are reference-counted and may outlive the scheduler.
> 
>> + * - The scheduler *may* be destroyed while jobs are still in flight.
> 
> That's not correct. The scheduler can only be destroyed after all the
> entities serving it have been destroyed as well as all the jobs already
> pushed to the hw finished.

The point of this series is to change this behavior so I can actually 
use the scheduler in my use case, and that begins with formally 
documenting it as Daniel suggested. That is, I need it to be safe for 
jobs to not be yet complete before the scheduler is destroyed (the 
entities do get destroyed first, that's the first bullet point).

We already had this discussion. Without this guarantee, I cannot build a 
reasonable safe Rust abstraction. Unless you have another suggestion, as 
far as I can tell it's either this or I give up on using the DRM 
scheduler entirely and reimplement something else on my own.

> What might be possible to add is that the hw is still working on the
> already pushed jobs, but so far that was rejected as undesirable.

Where was this rejected?

>> + * - There is no guarantee that all jobs have been freed when all entities
>> + *   and the scheduled have been destroyed. Jobs may be freed asynchronously
>> + *   after this point.
>> + * - Once a job is taken from the entity's queue and pushed to the hardware,
>> + *   i.e. the pending queue, the entity must not be referenced any more
>> + *   through the job's entity pointer. In other words, entities are not
>> + *   required to outlive job execution.
>> + *
>> + * If the scheduler is destroyed with jobs in flight, the following
>> + * happens:
>> + *
>> + * - Jobs that were pushed but have not yet run will be destroyed as part
>> + *   of the entity cleanup (which must happen before the scheduler itself
>> + *   is destroyed, per the first rule above). This signals the job
>> + *   finished fence with an error flag. This process runs asynchronously
>> + *   after drm_sched_entity_destroy() returns.
>> + * - Jobs that are in-flight on the hardware are "detached" from their
>> + *   driver fence (the fence returned from the run_job() callback). In
>> + *   this case, it is up to the driver to ensure that any bookkeeping or
>> + *   internal data structures have separately managed lifetimes and that
>> + *   the hardware either cancels the jobs or runs them to completion.
>> + *   The DRM scheduler itself will immediately signal the job complete
>> + *   fence (with an error flag) and then call free_job() as part of the
>> + *   cleanup process.
>> + *
>> + * After the scheduler is destroyed, drivers *may* (but are not required to)
>> + * skip signaling their remaining driver fences, as long as they have only ever
>> + * been returned to the scheduler being destroyed as the return value from
>> + * run_job() and not passed anywhere else.
> 
> This is an outright NAK to this. Fences must always be cleanly signaled.

This is just documenting the fact that the DRM scheduler no longer cares 
about the fences after it is destroyed. I can remove it from the docs if 
you want, I don't rely on this behavior.

> IIRC Daniel documented this as mandatory on the dma_fence behavior.

Right, in the general case all dma_fences must be signaled, that's why I 
explicitly said this only applies if the scheduler is the *only* user of 
those fences.

If you don't think this should be a guarantee the scheduler officially 
makes, I'll remove it from the text.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14  8:43     ` Christian König
@ 2023-07-14  9:44       ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  9:44 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 14/07/2023 17.43, Christian König wrote:
> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>> A signaled scheduler fence can outlive its scheduler, since fences are
>> independencly reference counted. Therefore, we can't reference the
>> scheduler in the get_timeline_name() implementation.
>>
>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>>    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>    include/drm/gpu_scheduler.h              | 5 +++++
>>    3 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index b2bbc8a68b30..17f35b0b005a 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>    
>>    		/*
>>    		 * Fence is from the same scheduler, only need to wait for
>> -		 * it to be scheduled
>> +		 * it to be scheduled.
>> +		 *
>> +		 * Note: s_fence->sched could have been freed and reallocated
>> +		 * as another scheduler. This false positive case is okay, as if
>> +		 * the old scheduler was freed all of its jobs must have
>> +		 * signaled their completion fences.
> 
> This is outright nonsense. As long as an entity for a scheduler exists
> it is not allowed to free up this scheduler.
> 
> So this function can't be called like this.
> 
>>    		 */
>>    		fence = dma_fence_get(&s_fence->scheduled);
>>    		dma_fence_put(entity->dependency);
>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
>> index ef120475e7c6..06a0eebcca10 100644
>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>> @@ -68,7 +68,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>>    static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f)
>>    {
>>    	struct drm_sched_fence *fence = to_drm_sched_fence(f);
>> -	return (const char *)fence->sched->name;
>> +	return (const char *)fence->sched_name;
>>    }
>>    
>>    static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
>> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>>    	unsigned seq;
>>    
>>    	fence->sched = entity->rq->sched;
>> +	strlcpy(fence->sched_name, entity->rq->sched->name,
>> +		sizeof(fence->sched_name));
>>    	seq = atomic_inc_return(&entity->fence_seq);
>>    	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>>    		       &fence->lock, entity->fence_context, seq);
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index e95b4837e5a3..4fa9523bd47d 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>>             * @lock: the lock used by the scheduled and the finished fences.
>>             */
>>    	spinlock_t			lock;
>> +        /**
>> +         * @sched_name: the name of the scheduler that owns this fence. We
>> +	 * keep a copy here since fences can outlive their scheduler.
>> +         */
>> +	char sched_name[16];
> 
> This just mitigates the problem, but doesn't fix it.

Could you point out any remaining issues so we can fix them? Right now 
this absolutely *is* broken and this fixes the breakage I observed. If 
there are other bugs remaining, I'd like to know what they are so I can 
fix them.

> The real issue is that the hw fence is kept around much longer than that.

As far as I know, the whole point of scheduler fences is to decouple the 
hw fences from the consumers. I already talked with Daniel about this. 
The current behavior is broken. These fences can live forever. It is 
broken to require that they outlive the driver that produced them.

> Additional to that I'm not willing to increase the scheduler fence size
> like that just to decouple them from the scheduler.

Did you read my explanation on the cover letter as to how this is just 
broken right now? We need to fix this. If you have a better suggestion 
I'll take it. Doing nothing is not an option.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14  9:44       ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  9:44 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 14/07/2023 17.43, Christian König wrote:
> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>> A signaled scheduler fence can outlive its scheduler, since fences are
>> independencly reference counted. Therefore, we can't reference the
>> scheduler in the get_timeline_name() implementation.
>>
>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>>    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>    include/drm/gpu_scheduler.h              | 5 +++++
>>    3 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index b2bbc8a68b30..17f35b0b005a 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>    
>>    		/*
>>    		 * Fence is from the same scheduler, only need to wait for
>> -		 * it to be scheduled
>> +		 * it to be scheduled.
>> +		 *
>> +		 * Note: s_fence->sched could have been freed and reallocated
>> +		 * as another scheduler. This false positive case is okay, as if
>> +		 * the old scheduler was freed all of its jobs must have
>> +		 * signaled their completion fences.
> 
> This is outright nonsense. As long as an entity for a scheduler exists
> it is not allowed to free up this scheduler.
> 
> So this function can't be called like this.
> 
>>    		 */
>>    		fence = dma_fence_get(&s_fence->scheduled);
>>    		dma_fence_put(entity->dependency);
>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
>> index ef120475e7c6..06a0eebcca10 100644
>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>> @@ -68,7 +68,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>>    static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f)
>>    {
>>    	struct drm_sched_fence *fence = to_drm_sched_fence(f);
>> -	return (const char *)fence->sched->name;
>> +	return (const char *)fence->sched_name;
>>    }
>>    
>>    static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
>> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>>    	unsigned seq;
>>    
>>    	fence->sched = entity->rq->sched;
>> +	strlcpy(fence->sched_name, entity->rq->sched->name,
>> +		sizeof(fence->sched_name));
>>    	seq = atomic_inc_return(&entity->fence_seq);
>>    	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>>    		       &fence->lock, entity->fence_context, seq);
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index e95b4837e5a3..4fa9523bd47d 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>>             * @lock: the lock used by the scheduled and the finished fences.
>>             */
>>    	spinlock_t			lock;
>> +        /**
>> +         * @sched_name: the name of the scheduler that owns this fence. We
>> +	 * keep a copy here since fences can outlive their scheduler.
>> +         */
>> +	char sched_name[16];
> 
> This just mitigates the problem, but doesn't fix it.

Could you point out any remaining issues so we can fix them? Right now 
this absolutely *is* broken and this fixes the breakage I observed. If 
there are other bugs remaining, I'd like to know what they are so I can 
fix them.

> The real issue is that the hw fence is kept around much longer than that.

As far as I know, the whole point of scheduler fences is to decouple the 
hw fences from the consumers. I already talked with Daniel about this. 
The current behavior is broken. These fences can live forever. It is 
broken to require that they outlive the driver that produced them.

> Additional to that I'm not willing to increase the scheduler fence size
> like that just to decouple them from the scheduler.

Did you read my explanation on the cover letter as to how this is just 
broken right now? We need to fix this. If you have a better suggestion 
I'll take it. Doing nothing is not an option.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] drm/scheduler: Add more documentation
  2023-07-14  9:39       ` Asahi Lina
@ 2023-07-14  9:47         ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14  9:47 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

Am 14.07.23 um 11:39 schrieb Asahi Lina:
> On 14/07/2023 17.40, Christian König wrote:
>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>> Document the implied lifetime rules of the scheduler (or at least the
>>> intended ones), as well as the expectations of how resource acquisition
>>> should be handled.
>>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 58 
>>> ++++++++++++++++++++++++++++++++--
>>>    1 file changed, 55 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 7b2bfc10c1a5..1f3bc3606239 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -43,9 +43,61 @@
>>>     *
>>>     * The jobs in a entity are always scheduled in the order that 
>>> they were pushed.
>>>     *
>>> - * Note that once a job was taken from the entities queue and 
>>> pushed to the
>>> - * hardware, i.e. the pending queue, the entity must not be 
>>> referenced anymore
>>> - * through the jobs entity pointer.
>>> + * Lifetime rules
>>> + * --------------
>>> + *
>>> + * Getting object lifetimes right across the stack is critical to 
>>> avoid UAF
>>> + * issues. The DRM scheduler has the following lifetime rules:
>>> + *
>>> + * - The scheduler must outlive all of its entities.
>>> + * - Jobs pushed to the scheduler are owned by it, and must only be 
>>> freed
>>> + *   after the free_job() callback is called.
>>> + * - Scheduler fences are reference-counted and may outlive the 
>>> scheduler.
>>
>>> + * - The scheduler *may* be destroyed while jobs are still in flight.
>>
>> That's not correct. The scheduler can only be destroyed after all the
>> entities serving it have been destroyed as well as all the jobs already
>> pushed to the hw finished.
>
> The point of this series is to change this behavior so I can actually 
> use the scheduler in my use case, and that begins with formally 
> documenting it as Daniel suggested. That is, I need it to be safe for 
> jobs to not be yet complete before the scheduler is destroyed (the 
> entities do get destroyed first, that's the first bullet point).

Yeah, but you need to document the current situation not how you like it 
to be.

Extending that comes when the functionality for this is implemented.

>
> We already had this discussion. Without this guarantee, I cannot build 
> a reasonable safe Rust abstraction. Unless you have another 
> suggestion, as far as I can tell it's either this or I give up on 
> using the DRM scheduler entirely and reimplement something else on my 
> own.
>
>> What might be possible to add is that the hw is still working on the
>> already pushed jobs, but so far that was rejected as undesirable.
>
> Where was this rejected?

Years ago. Our initial driver suspend/resume design relied on that. 
Turned out not to be a good idea

>
>>> + * - There is no guarantee that all jobs have been freed when all 
>>> entities
>>> + *   and the scheduled have been destroyed. Jobs may be freed 
>>> asynchronously
>>> + *   after this point.
>>> + * - Once a job is taken from the entity's queue and pushed to the 
>>> hardware,
>>> + *   i.e. the pending queue, the entity must not be referenced any 
>>> more
>>> + *   through the job's entity pointer. In other words, entities are 
>>> not
>>> + *   required to outlive job execution.
>>> + *
>>> + * If the scheduler is destroyed with jobs in flight, the following
>>> + * happens:
>>> + *
>>> + * - Jobs that were pushed but have not yet run will be destroyed 
>>> as part
>>> + *   of the entity cleanup (which must happen before the scheduler 
>>> itself
>>> + *   is destroyed, per the first rule above). This signals the job
>>> + *   finished fence with an error flag. This process runs 
>>> asynchronously
>>> + *   after drm_sched_entity_destroy() returns.
>>> + * - Jobs that are in-flight on the hardware are "detached" from their
>>> + *   driver fence (the fence returned from the run_job() callback). In
>>> + *   this case, it is up to the driver to ensure that any 
>>> bookkeeping or
>>> + *   internal data structures have separately managed lifetimes and 
>>> that
>>> + *   the hardware either cancels the jobs or runs them to completion.
>>> + *   The DRM scheduler itself will immediately signal the job complete
>>> + *   fence (with an error flag) and then call free_job() as part of 
>>> the
>>> + *   cleanup process.
>>> + *
>>> + * After the scheduler is destroyed, drivers *may* (but are not 
>>> required to)
>>> + * skip signaling their remaining driver fences, as long as they 
>>> have only ever
>>> + * been returned to the scheduler being destroyed as the return 
>>> value from
>>> + * run_job() and not passed anywhere else.
>>
>> This is an outright NAK to this. Fences must always be cleanly signaled.
>
> This is just documenting the fact that the DRM scheduler no longer 
> cares about the fences after it is destroyed. I can remove it from the 
> docs if you want, I don't rely on this behavior.
>
>> IIRC Daniel documented this as mandatory on the dma_fence behavior.
>
> Right, in the general case all dma_fences must be signaled, that's why 
> I explicitly said this only applies if the scheduler is the *only* 
> user of those fences.
>
> If you don't think this should be a guarantee the scheduler officially 
> makes, I'll remove it from the text.

Please drop that.

When you want to cancel fences already pushed to the hw then do so in 
the driver and signal that through the dma_fence error code.

What we can certainly add is a big warning in drm_sched_fini() when the 
hw hasn't finished it's processing.

Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] drm/scheduler: Add more documentation
@ 2023-07-14  9:47         ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14  9:47 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

Am 14.07.23 um 11:39 schrieb Asahi Lina:
> On 14/07/2023 17.40, Christian König wrote:
>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>> Document the implied lifetime rules of the scheduler (or at least the
>>> intended ones), as well as the expectations of how resource acquisition
>>> should be handled.
>>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 58 
>>> ++++++++++++++++++++++++++++++++--
>>>    1 file changed, 55 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 7b2bfc10c1a5..1f3bc3606239 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -43,9 +43,61 @@
>>>     *
>>>     * The jobs in a entity are always scheduled in the order that 
>>> they were pushed.
>>>     *
>>> - * Note that once a job was taken from the entities queue and 
>>> pushed to the
>>> - * hardware, i.e. the pending queue, the entity must not be 
>>> referenced anymore
>>> - * through the jobs entity pointer.
>>> + * Lifetime rules
>>> + * --------------
>>> + *
>>> + * Getting object lifetimes right across the stack is critical to 
>>> avoid UAF
>>> + * issues. The DRM scheduler has the following lifetime rules:
>>> + *
>>> + * - The scheduler must outlive all of its entities.
>>> + * - Jobs pushed to the scheduler are owned by it, and must only be 
>>> freed
>>> + *   after the free_job() callback is called.
>>> + * - Scheduler fences are reference-counted and may outlive the 
>>> scheduler.
>>
>>> + * - The scheduler *may* be destroyed while jobs are still in flight.
>>
>> That's not correct. The scheduler can only be destroyed after all the
>> entities serving it have been destroyed as well as all the jobs already
>> pushed to the hw finished.
>
> The point of this series is to change this behavior so I can actually 
> use the scheduler in my use case, and that begins with formally 
> documenting it as Daniel suggested. That is, I need it to be safe for 
> jobs to not be yet complete before the scheduler is destroyed (the 
> entities do get destroyed first, that's the first bullet point).

Yeah, but you need to document the current situation not how you like it 
to be.

Extending that comes when the functionality for this is implemented.

>
> We already had this discussion. Without this guarantee, I cannot build 
> a reasonable safe Rust abstraction. Unless you have another 
> suggestion, as far as I can tell it's either this or I give up on 
> using the DRM scheduler entirely and reimplement something else on my 
> own.
>
>> What might be possible to add is that the hw is still working on the
>> already pushed jobs, but so far that was rejected as undesirable.
>
> Where was this rejected?

Years ago. Our initial driver suspend/resume design relied on that. 
Turned out not to be a good idea

>
>>> + * - There is no guarantee that all jobs have been freed when all 
>>> entities
>>> + *   and the scheduled have been destroyed. Jobs may be freed 
>>> asynchronously
>>> + *   after this point.
>>> + * - Once a job is taken from the entity's queue and pushed to the 
>>> hardware,
>>> + *   i.e. the pending queue, the entity must not be referenced any 
>>> more
>>> + *   through the job's entity pointer. In other words, entities are 
>>> not
>>> + *   required to outlive job execution.
>>> + *
>>> + * If the scheduler is destroyed with jobs in flight, the following
>>> + * happens:
>>> + *
>>> + * - Jobs that were pushed but have not yet run will be destroyed 
>>> as part
>>> + *   of the entity cleanup (which must happen before the scheduler 
>>> itself
>>> + *   is destroyed, per the first rule above). This signals the job
>>> + *   finished fence with an error flag. This process runs 
>>> asynchronously
>>> + *   after drm_sched_entity_destroy() returns.
>>> + * - Jobs that are in-flight on the hardware are "detached" from their
>>> + *   driver fence (the fence returned from the run_job() callback). In
>>> + *   this case, it is up to the driver to ensure that any 
>>> bookkeeping or
>>> + *   internal data structures have separately managed lifetimes and 
>>> that
>>> + *   the hardware either cancels the jobs or runs them to completion.
>>> + *   The DRM scheduler itself will immediately signal the job complete
>>> + *   fence (with an error flag) and then call free_job() as part of 
>>> the
>>> + *   cleanup process.
>>> + *
>>> + * After the scheduler is destroyed, drivers *may* (but are not 
>>> required to)
>>> + * skip signaling their remaining driver fences, as long as they 
>>> have only ever
>>> + * been returned to the scheduler being destroyed as the return 
>>> value from
>>> + * run_job() and not passed anywhere else.
>>
>> This is an outright NAK to this. Fences must always be cleanly signaled.
>
> This is just documenting the fact that the DRM scheduler no longer 
> cares about the fences after it is destroyed. I can remove it from the 
> docs if you want, I don't rely on this behavior.
>
>> IIRC Daniel documented this as mandatory on the dma_fence behavior.
>
> Right, in the general case all dma_fences must be signaled, that's why 
> I explicitly said this only applies if the scheduler is the *only* 
> user of those fences.
>
> If you don't think this should be a guarantee the scheduler officially 
> makes, I'll remove it from the text.

Please drop that.

When you want to cancel fences already pushed to the hw then do so in 
the driver and signal that through the dma_fence error code.

What we can certainly add is a big warning in drm_sched_fini() when the 
hw hasn't finished it's processing.

Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14  8:43     ` Christian König
@ 2023-07-14  9:49       ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  9:49 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 14/07/2023 17.43, Christian König wrote:
> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>> A signaled scheduler fence can outlive its scheduler, since fences are
>> independencly reference counted. Therefore, we can't reference the
>> scheduler in the get_timeline_name() implementation.
>>
>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>>    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>    include/drm/gpu_scheduler.h              | 5 +++++
>>    3 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index b2bbc8a68b30..17f35b0b005a 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>    
>>    		/*
>>    		 * Fence is from the same scheduler, only need to wait for
>> -		 * it to be scheduled
>> +		 * it to be scheduled.
>> +		 *
>> +		 * Note: s_fence->sched could have been freed and reallocated
>> +		 * as another scheduler. This false positive case is okay, as if
>> +		 * the old scheduler was freed all of its jobs must have
>> +		 * signaled their completion fences.
> 
> This is outright nonsense. As long as an entity for a scheduler exists
> it is not allowed to free up this scheduler.
> 
> So this function can't be called like this.

As I already explained, the fences can outlive their scheduler. That 
means *this* entity certainly exists for *this* scheduler, but the 
*dependency* fence might have come from a past scheduler that was 
already destroyed along with all of its entities, and its address reused.

Christian, I'm really getting tired of your tone. I don't appreciate 
being told my comments are "outright nonsense" when you don't even take 
the time to understand what the issue is and what I'm trying to 
do/document. If you aren't interested in working with me, I'm just going 
to give up on drm_sched, wait until Rust gets workqueue support, and 
reimplement it in Rust. You can keep your broken fence lifetime 
semantics and I'll do my own thing.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14  9:49       ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  9:49 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 14/07/2023 17.43, Christian König wrote:
> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>> A signaled scheduler fence can outlive its scheduler, since fences are
>> independencly reference counted. Therefore, we can't reference the
>> scheduler in the get_timeline_name() implementation.
>>
>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>>    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>    include/drm/gpu_scheduler.h              | 5 +++++
>>    3 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index b2bbc8a68b30..17f35b0b005a 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>    
>>    		/*
>>    		 * Fence is from the same scheduler, only need to wait for
>> -		 * it to be scheduled
>> +		 * it to be scheduled.
>> +		 *
>> +		 * Note: s_fence->sched could have been freed and reallocated
>> +		 * as another scheduler. This false positive case is okay, as if
>> +		 * the old scheduler was freed all of its jobs must have
>> +		 * signaled their completion fences.
> 
> This is outright nonsense. As long as an entity for a scheduler exists
> it is not allowed to free up this scheduler.
> 
> So this function can't be called like this.

As I already explained, the fences can outlive their scheduler. That 
means *this* entity certainly exists for *this* scheduler, but the 
*dependency* fence might have come from a past scheduler that was 
already destroyed along with all of its entities, and its address reused.

Christian, I'm really getting tired of your tone. I don't appreciate 
being told my comments are "outright nonsense" when you don't even take 
the time to understand what the issue is and what I'm trying to 
do/document. If you aren't interested in working with me, I'm just going 
to give up on drm_sched, wait until Rust gets workqueue support, and 
reimplement it in Rust. You can keep your broken fence lifetime 
semantics and I'll do my own thing.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] drm/scheduler: Add more documentation
  2023-07-14  9:47         ` Christian König
@ 2023-07-14  9:51           ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  9:51 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 14/07/2023 18.47, Christian König wrote:
> Am 14.07.23 um 11:39 schrieb Asahi Lina:
>> On 14/07/2023 17.40, Christian König wrote:
>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>> Document the implied lifetime rules of the scheduler (or at least the
>>>> intended ones), as well as the expectations of how resource acquisition
>>>> should be handled.
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>     drivers/gpu/drm/scheduler/sched_main.c | 58
>>>> ++++++++++++++++++++++++++++++++--
>>>>     1 file changed, 55 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 7b2bfc10c1a5..1f3bc3606239 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -43,9 +43,61 @@
>>>>      *
>>>>      * The jobs in a entity are always scheduled in the order that
>>>> they were pushed.
>>>>      *
>>>> - * Note that once a job was taken from the entities queue and
>>>> pushed to the
>>>> - * hardware, i.e. the pending queue, the entity must not be
>>>> referenced anymore
>>>> - * through the jobs entity pointer.
>>>> + * Lifetime rules
>>>> + * --------------
>>>> + *
>>>> + * Getting object lifetimes right across the stack is critical to
>>>> avoid UAF
>>>> + * issues. The DRM scheduler has the following lifetime rules:
>>>> + *
>>>> + * - The scheduler must outlive all of its entities.
>>>> + * - Jobs pushed to the scheduler are owned by it, and must only be
>>>> freed
>>>> + *   after the free_job() callback is called.
>>>> + * - Scheduler fences are reference-counted and may outlive the
>>>> scheduler.
>>>
>>>> + * - The scheduler *may* be destroyed while jobs are still in flight.
>>>
>>> That's not correct. The scheduler can only be destroyed after all the
>>> entities serving it have been destroyed as well as all the jobs already
>>> pushed to the hw finished.
>>
>> The point of this series is to change this behavior so I can actually
>> use the scheduler in my use case, and that begins with formally
>> documenting it as Daniel suggested. That is, I need it to be safe for
>> jobs to not be yet complete before the scheduler is destroyed (the
>> entities do get destroyed first, that's the first bullet point).
> 
> Yeah, but you need to document the current situation not how you like it
> to be.

Daniel told me to document how I think it should be, then fix the bugs 
that make it not so. That's what this series does.

>> We already had this discussion. Without this guarantee, I cannot build
>> a reasonable safe Rust abstraction. Unless you have another
>> suggestion, as far as I can tell it's either this or I give up on
>> using the DRM scheduler entirely and reimplement something else on my
>> own.
>>
>>> What might be possible to add is that the hw is still working on the
>>> already pushed jobs, but so far that was rejected as undesirable.
>>
>> Where was this rejected?
> 
> Years ago. Our initial driver suspend/resume design relied on that.
> Turned out not to be a good idea

Times change, maybe it's time to revisit that decision?

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] drm/scheduler: Add more documentation
@ 2023-07-14  9:51           ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14  9:51 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 14/07/2023 18.47, Christian König wrote:
> Am 14.07.23 um 11:39 schrieb Asahi Lina:
>> On 14/07/2023 17.40, Christian König wrote:
>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>> Document the implied lifetime rules of the scheduler (or at least the
>>>> intended ones), as well as the expectations of how resource acquisition
>>>> should be handled.
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>     drivers/gpu/drm/scheduler/sched_main.c | 58
>>>> ++++++++++++++++++++++++++++++++--
>>>>     1 file changed, 55 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 7b2bfc10c1a5..1f3bc3606239 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -43,9 +43,61 @@
>>>>      *
>>>>      * The jobs in a entity are always scheduled in the order that
>>>> they were pushed.
>>>>      *
>>>> - * Note that once a job was taken from the entities queue and
>>>> pushed to the
>>>> - * hardware, i.e. the pending queue, the entity must not be
>>>> referenced anymore
>>>> - * through the jobs entity pointer.
>>>> + * Lifetime rules
>>>> + * --------------
>>>> + *
>>>> + * Getting object lifetimes right across the stack is critical to
>>>> avoid UAF
>>>> + * issues. The DRM scheduler has the following lifetime rules:
>>>> + *
>>>> + * - The scheduler must outlive all of its entities.
>>>> + * - Jobs pushed to the scheduler are owned by it, and must only be
>>>> freed
>>>> + *   after the free_job() callback is called.
>>>> + * - Scheduler fences are reference-counted and may outlive the
>>>> scheduler.
>>>
>>>> + * - The scheduler *may* be destroyed while jobs are still in flight.
>>>
>>> That's not correct. The scheduler can only be destroyed after all the
>>> entities serving it have been destroyed as well as all the jobs already
>>> pushed to the hw finished.
>>
>> The point of this series is to change this behavior so I can actually
>> use the scheduler in my use case, and that begins with formally
>> documenting it as Daniel suggested. That is, I need it to be safe for
>> jobs to not be yet complete before the scheduler is destroyed (the
>> entities do get destroyed first, that's the first bullet point).
> 
> Yeah, but you need to document the current situation not how you like it
> to be.

Daniel told me to document how I think it should be, then fix the bugs 
that make it not so. That's what this series does.

>> We already had this discussion. Without this guarantee, I cannot build
>> a reasonable safe Rust abstraction. Unless you have another
>> suggestion, as far as I can tell it's either this or I give up on
>> using the DRM scheduler entirely and reimplement something else on my
>> own.
>>
>>> What might be possible to add is that the hw is still working on the
>>> already pushed jobs, but so far that was rejected as undesirable.
>>
>> Where was this rejected?
> 
> Years ago. Our initial driver suspend/resume design relied on that.
> Turned out not to be a good idea

Times change, maybe it's time to revisit that decision?

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14  9:44       ` Asahi Lina
@ 2023-07-14  9:51         ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14  9:51 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

Am 14.07.23 um 11:44 schrieb Asahi Lina:
> On 14/07/2023 17.43, Christian König wrote:
>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>> A signaled scheduler fence can outlive its scheduler, since fences are
>>> independencly reference counted. Therefore, we can't reference the
>>> scheduler in the get_timeline_name() implementation.
>>>
>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>    include/drm/gpu_scheduler.h              | 5 +++++
>>>    3 files changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index b2bbc8a68b30..17f35b0b005a 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -389,7 +389,12 @@ static bool 
>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>               /*
>>>             * Fence is from the same scheduler, only need to wait for
>>> -         * it to be scheduled
>>> +         * it to be scheduled.
>>> +         *
>>> +         * Note: s_fence->sched could have been freed and reallocated
>>> +         * as another scheduler. This false positive case is okay, 
>>> as if
>>> +         * the old scheduler was freed all of its jobs must have
>>> +         * signaled their completion fences.
>>
>> This is outright nonsense. As long as an entity for a scheduler exists
>> it is not allowed to free up this scheduler.
>>
>> So this function can't be called like this.
>>
>>>             */
>>>            fence = dma_fence_get(&s_fence->scheduled);
>>>            dma_fence_put(entity->dependency);
>>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
>>> b/drivers/gpu/drm/scheduler/sched_fence.c
>>> index ef120475e7c6..06a0eebcca10 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>>> @@ -68,7 +68,7 @@ static const char 
>>> *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>>>    static const char *drm_sched_fence_get_timeline_name(struct 
>>> dma_fence *f)
>>>    {
>>>        struct drm_sched_fence *fence = to_drm_sched_fence(f);
>>> -    return (const char *)fence->sched->name;
>>> +    return (const char *)fence->sched_name;
>>>    }
>>>       static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
>>> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence 
>>> *fence,
>>>        unsigned seq;
>>>           fence->sched = entity->rq->sched;
>>> +    strlcpy(fence->sched_name, entity->rq->sched->name,
>>> +        sizeof(fence->sched_name));
>>>        seq = atomic_inc_return(&entity->fence_seq);
>>>        dma_fence_init(&fence->scheduled, 
>>> &drm_sched_fence_ops_scheduled,
>>>                   &fence->lock, entity->fence_context, seq);
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index e95b4837e5a3..4fa9523bd47d 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>>>             * @lock: the lock used by the scheduled and the finished 
>>> fences.
>>>             */
>>>        spinlock_t            lock;
>>> +        /**
>>> +         * @sched_name: the name of the scheduler that owns this 
>>> fence. We
>>> +     * keep a copy here since fences can outlive their scheduler.
>>> +         */
>>> +    char sched_name[16];
>>
>> This just mitigates the problem, but doesn't fix it.
>
> Could you point out any remaining issues so we can fix them? Right now 
> this absolutely *is* broken and this fixes the breakage I observed. If 
> there are other bugs remaining, I'd like to know what they are so I 
> can fix them.
>
>> The real issue is that the hw fence is kept around much longer than 
>> that.
>
> As far as I know, the whole point of scheduler fences is to decouple 
> the hw fences from the consumers.

Well yes and no. The decoupling is for the signaling, it's not 
decoupling the lifetime.

> I already talked with Daniel about this. The current behavior is 
> broken. These fences can live forever. It is broken to require that 
> they outlive the driver that produced them.
>
>> Additional to that I'm not willing to increase the scheduler fence size
>> like that just to decouple them from the scheduler.
>
> Did you read my explanation on the cover letter as to how this is just 
> broken right now? We need to fix this. If you have a better suggestion 
> I'll take it. Doing nothing is not an option.

Well this isn't broken at all. This works exactly like intended, you 
just want to use it for something it wasn't made for.

That scheduler fences could be changed to outlive the scheduler which 
issued them is possible, but this is certainly a new requirement.

Especially since we need to grab additional references to make sure that 
the module isn't unloaded in such a case.

Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14  9:51         ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14  9:51 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

Am 14.07.23 um 11:44 schrieb Asahi Lina:
> On 14/07/2023 17.43, Christian König wrote:
>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>> A signaled scheduler fence can outlive its scheduler, since fences are
>>> independencly reference counted. Therefore, we can't reference the
>>> scheduler in the get_timeline_name() implementation.
>>>
>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>    include/drm/gpu_scheduler.h              | 5 +++++
>>>    3 files changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index b2bbc8a68b30..17f35b0b005a 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -389,7 +389,12 @@ static bool 
>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>               /*
>>>             * Fence is from the same scheduler, only need to wait for
>>> -         * it to be scheduled
>>> +         * it to be scheduled.
>>> +         *
>>> +         * Note: s_fence->sched could have been freed and reallocated
>>> +         * as another scheduler. This false positive case is okay, 
>>> as if
>>> +         * the old scheduler was freed all of its jobs must have
>>> +         * signaled their completion fences.
>>
>> This is outright nonsense. As long as an entity for a scheduler exists
>> it is not allowed to free up this scheduler.
>>
>> So this function can't be called like this.
>>
>>>             */
>>>            fence = dma_fence_get(&s_fence->scheduled);
>>>            dma_fence_put(entity->dependency);
>>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
>>> b/drivers/gpu/drm/scheduler/sched_fence.c
>>> index ef120475e7c6..06a0eebcca10 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>>> @@ -68,7 +68,7 @@ static const char 
>>> *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>>>    static const char *drm_sched_fence_get_timeline_name(struct 
>>> dma_fence *f)
>>>    {
>>>        struct drm_sched_fence *fence = to_drm_sched_fence(f);
>>> -    return (const char *)fence->sched->name;
>>> +    return (const char *)fence->sched_name;
>>>    }
>>>       static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
>>> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence 
>>> *fence,
>>>        unsigned seq;
>>>           fence->sched = entity->rq->sched;
>>> +    strlcpy(fence->sched_name, entity->rq->sched->name,
>>> +        sizeof(fence->sched_name));
>>>        seq = atomic_inc_return(&entity->fence_seq);
>>>        dma_fence_init(&fence->scheduled, 
>>> &drm_sched_fence_ops_scheduled,
>>>                   &fence->lock, entity->fence_context, seq);
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index e95b4837e5a3..4fa9523bd47d 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>>>             * @lock: the lock used by the scheduled and the finished 
>>> fences.
>>>             */
>>>        spinlock_t            lock;
>>> +        /**
>>> +         * @sched_name: the name of the scheduler that owns this 
>>> fence. We
>>> +     * keep a copy here since fences can outlive their scheduler.
>>> +         */
>>> +    char sched_name[16];
>>
>> This just mitigates the problem, but doesn't fix it.
>
> Could you point out any remaining issues so we can fix them? Right now 
> this absolutely *is* broken and this fixes the breakage I observed. If 
> there are other bugs remaining, I'd like to know what they are so I 
> can fix them.
>
>> The real issue is that the hw fence is kept around much longer than 
>> that.
>
> As far as I know, the whole point of scheduler fences is to decouple 
> the hw fences from the consumers.

Well yes and no. The decoupling is for the signaling, it's not 
decoupling the lifetime.

> I already talked with Daniel about this. The current behavior is 
> broken. These fences can live forever. It is broken to require that 
> they outlive the driver that produced them.
>
>> Additional to that I'm not willing to increase the scheduler fence size
>> like that just to decouple them from the scheduler.
>
> Did you read my explanation on the cover letter as to how this is just 
> broken right now? We need to fix this. If you have a better suggestion 
> I'll take it. Doing nothing is not an option.

Well this isn't broken at all. This works exactly like intended, you 
just want to use it for something it wasn't made for.

That scheduler fences could be changed to outlive the scheduler which 
issued them is possible, but this is certainly a new requirement.

Especially since we need to grab additional references to make sure that 
the module isn't unloaded in such a case.

Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14  9:49       ` Asahi Lina
@ 2023-07-14  9:57         ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14  9:57 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

Am 14.07.23 um 11:49 schrieb Asahi Lina:
> On 14/07/2023 17.43, Christian König wrote:
>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>> A signaled scheduler fence can outlive its scheduler, since fences are
>>> independencly reference counted. Therefore, we can't reference the
>>> scheduler in the get_timeline_name() implementation.
>>>
>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>    include/drm/gpu_scheduler.h              | 5 +++++
>>>    3 files changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index b2bbc8a68b30..17f35b0b005a 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -389,7 +389,12 @@ static bool 
>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>               /*
>>>             * Fence is from the same scheduler, only need to wait for
>>> -         * it to be scheduled
>>> +         * it to be scheduled.
>>> +         *
>>> +         * Note: s_fence->sched could have been freed and reallocated
>>> +         * as another scheduler. This false positive case is okay, 
>>> as if
>>> +         * the old scheduler was freed all of its jobs must have
>>> +         * signaled their completion fences.
>>
>> This is outright nonsense. As long as an entity for a scheduler exists
>> it is not allowed to free up this scheduler.
>>
>> So this function can't be called like this.
>
> As I already explained, the fences can outlive their scheduler. That 
> means *this* entity certainly exists for *this* scheduler, but the 
> *dependency* fence might have come from a past scheduler that was 
> already destroyed along with all of its entities, and its address reused.

Well this is function is not about fences, this function is a callback 
for the entity.

>
> Christian, I'm really getting tired of your tone. I don't appreciate 
> being told my comments are "outright nonsense" when you don't even 
> take the time to understand what the issue is and what I'm trying to 
> do/document. If you aren't interested in working with me, I'm just 
> going to give up on drm_sched, wait until Rust gets workqueue support, 
> and reimplement it in Rust. You can keep your broken fence lifetime 
> semantics and I'll do my own thing.

I'm certainly trying to help here, but you seem to have unrealistic 
expectations.

I perfectly understand what you are trying to do, but you don't seem to 
understand that this functionality here isn't made for your use case.

We can adjust the functionality to better match your requirements, but 
you can't say it is broken because it doesn't work when you use it not 
in the way it is intended to be used.

You can go ahead and try to re-implement the functionality in Rust, but 
then I would reject that pointing out that this should probably be an 
extension to the existing code.

Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14  9:57         ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14  9:57 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

Am 14.07.23 um 11:49 schrieb Asahi Lina:
> On 14/07/2023 17.43, Christian König wrote:
>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>> A signaled scheduler fence can outlive its scheduler, since fences are
>>> independencly reference counted. Therefore, we can't reference the
>>> scheduler in the get_timeline_name() implementation.
>>>
>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>    include/drm/gpu_scheduler.h              | 5 +++++
>>>    3 files changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index b2bbc8a68b30..17f35b0b005a 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -389,7 +389,12 @@ static bool 
>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>               /*
>>>             * Fence is from the same scheduler, only need to wait for
>>> -         * it to be scheduled
>>> +         * it to be scheduled.
>>> +         *
>>> +         * Note: s_fence->sched could have been freed and reallocated
>>> +         * as another scheduler. This false positive case is okay, 
>>> as if
>>> +         * the old scheduler was freed all of its jobs must have
>>> +         * signaled their completion fences.
>>
>> This is outright nonsense. As long as an entity for a scheduler exists
>> it is not allowed to free up this scheduler.
>>
>> So this function can't be called like this.
>
> As I already explained, the fences can outlive their scheduler. That 
> means *this* entity certainly exists for *this* scheduler, but the 
> *dependency* fence might have come from a past scheduler that was 
> already destroyed along with all of its entities, and its address reused.

Well this is function is not about fences, this function is a callback 
for the entity.

>
> Christian, I'm really getting tired of your tone. I don't appreciate 
> being told my comments are "outright nonsense" when you don't even 
> take the time to understand what the issue is and what I'm trying to 
> do/document. If you aren't interested in working with me, I'm just 
> going to give up on drm_sched, wait until Rust gets workqueue support, 
> and reimplement it in Rust. You can keep your broken fence lifetime 
> semantics and I'll do my own thing.

I'm certainly trying to help here, but you seem to have unrealistic 
expectations.

I perfectly understand what you are trying to do, but you don't seem to 
understand that this functionality here isn't made for your use case.

We can adjust the functionality to better match your requirements, but 
you can't say it is broken because it doesn't work when you use it not 
in the way it is intended to be used.

You can go ahead and try to re-implement the functionality in Rust, but 
then I would reject that pointing out that this should probably be an 
extension to the existing code.

Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14  9:57         ` Christian König
@ 2023-07-14 10:06           ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14 10:06 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 14/07/2023 18.57, Christian König wrote:
> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>> On 14/07/2023 17.43, Christian König wrote:
>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>> A signaled scheduler fence can outlive its scheduler, since fences are
>>>> independencly reference counted. Therefore, we can't reference the
>>>> scheduler in the get_timeline_name() implementation.
>>>>
>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>     include/drm/gpu_scheduler.h              | 5 +++++
>>>>     3 files changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -389,7 +389,12 @@ static bool
>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>                /*
>>>>              * Fence is from the same scheduler, only need to wait for
>>>> -         * it to be scheduled
>>>> +         * it to be scheduled.
>>>> +         *
>>>> +         * Note: s_fence->sched could have been freed and reallocated
>>>> +         * as another scheduler. This false positive case is okay,
>>>> as if
>>>> +         * the old scheduler was freed all of its jobs must have
>>>> +         * signaled their completion fences.
>>>
>>> This is outright nonsense. As long as an entity for a scheduler exists
>>> it is not allowed to free up this scheduler.
>>>
>>> So this function can't be called like this.
>>
>> As I already explained, the fences can outlive their scheduler. That
>> means *this* entity certainly exists for *this* scheduler, but the
>> *dependency* fence might have come from a past scheduler that was
>> already destroyed along with all of its entities, and its address reused.
> 
> Well this is function is not about fences, this function is a callback
> for the entity.

That deals with dependency fences, which could have come from any 
arbitrary source, including another entity and another scheduler.

>>
>> Christian, I'm really getting tired of your tone. I don't appreciate
>> being told my comments are "outright nonsense" when you don't even
>> take the time to understand what the issue is and what I'm trying to
>> do/document. If you aren't interested in working with me, I'm just
>> going to give up on drm_sched, wait until Rust gets workqueue support,
>> and reimplement it in Rust. You can keep your broken fence lifetime
>> semantics and I'll do my own thing.
> 
> I'm certainly trying to help here, but you seem to have unrealistic
> expectations.

I don't think expecting not to be told my changes are "outright 
nonsense" is an unrealistic expectation. If you think it is, maybe 
that's yet another indicator of the culture problems the kernel 
community has...

> I perfectly understand what you are trying to do, but you don't seem to
> understand that this functionality here isn't made for your use case.

I do, that's why I'm trying to change things. Right now, this 
functionality isn't even properly documented, which is why I thought it 
could be used for my use case, and slowly discovered otherwise. Daniel 
suggested documenting it, then fixing the mismatches between 
documentation and reality, which is what I'm doing here.

> We can adjust the functionality to better match your requirements, but
> you can't say it is broken because it doesn't work when you use it not
> in the way it is intended to be used.

I'm saying the idea that a random dma-buf holds onto a chain of 
references that prevents unloading a driver module that wrote into it 
(and keeps a bunch of random unrelated objects alive) is a broken state 
of affairs. It may or may not trickle down to actual problems for users 
(I would bet it does in some cases but I don't know for sure), but it's 
a situation that doesn't make any sense.

I know I'm triggering actual breakage with my new use case due to this, 
which is why I'm trying to improve things. But the current state of 
affairs just doesn't make any sense even if it isn't causing kernel 
oopses today with other drivers.

> You can go ahead and try to re-implement the functionality in Rust, but
> then I would reject that pointing out that this should probably be an
> extension to the existing code.

You keep rejecting my attempts at extending the existing code...

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14 10:06           ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14 10:06 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 14/07/2023 18.57, Christian König wrote:
> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>> On 14/07/2023 17.43, Christian König wrote:
>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>> A signaled scheduler fence can outlive its scheduler, since fences are
>>>> independencly reference counted. Therefore, we can't reference the
>>>> scheduler in the get_timeline_name() implementation.
>>>>
>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>     include/drm/gpu_scheduler.h              | 5 +++++
>>>>     3 files changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -389,7 +389,12 @@ static bool
>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>                /*
>>>>              * Fence is from the same scheduler, only need to wait for
>>>> -         * it to be scheduled
>>>> +         * it to be scheduled.
>>>> +         *
>>>> +         * Note: s_fence->sched could have been freed and reallocated
>>>> +         * as another scheduler. This false positive case is okay,
>>>> as if
>>>> +         * the old scheduler was freed all of its jobs must have
>>>> +         * signaled their completion fences.
>>>
>>> This is outright nonsense. As long as an entity for a scheduler exists
>>> it is not allowed to free up this scheduler.
>>>
>>> So this function can't be called like this.
>>
>> As I already explained, the fences can outlive their scheduler. That
>> means *this* entity certainly exists for *this* scheduler, but the
>> *dependency* fence might have come from a past scheduler that was
>> already destroyed along with all of its entities, and its address reused.
> 
> Well this is function is not about fences, this function is a callback
> for the entity.

That deals with dependency fences, which could have come from any 
arbitrary source, including another entity and another scheduler.

>>
>> Christian, I'm really getting tired of your tone. I don't appreciate
>> being told my comments are "outright nonsense" when you don't even
>> take the time to understand what the issue is and what I'm trying to
>> do/document. If you aren't interested in working with me, I'm just
>> going to give up on drm_sched, wait until Rust gets workqueue support,
>> and reimplement it in Rust. You can keep your broken fence lifetime
>> semantics and I'll do my own thing.
> 
> I'm certainly trying to help here, but you seem to have unrealistic
> expectations.

I don't think expecting not to be told my changes are "outright 
nonsense" is an unrealistic expectation. If you think it is, maybe 
that's yet another indicator of the culture problems the kernel 
community has...

> I perfectly understand what you are trying to do, but you don't seem to
> understand that this functionality here isn't made for your use case.

I do, that's why I'm trying to change things. Right now, this 
functionality isn't even properly documented, which is why I thought it 
could be used for my use case, and slowly discovered otherwise. Daniel 
suggested documenting it, then fixing the mismatches between 
documentation and reality, which is what I'm doing here.

> We can adjust the functionality to better match your requirements, but
> you can't say it is broken because it doesn't work when you use it not
> in the way it is intended to be used.

I'm saying the idea that a random dma-buf holds onto a chain of 
references that prevents unloading a driver module that wrote into it 
(and keeps a bunch of random unrelated objects alive) is a broken state 
of affairs. It may or may not trickle down to actual problems for users 
(I would bet it does in some cases but I don't know for sure), but it's 
a situation that doesn't make any sense.

I know I'm triggering actual breakage with my new use case due to this, 
which is why I'm trying to improve things. But the current state of 
affairs just doesn't make any sense even if it isn't causing kernel 
oopses today with other drivers.

> You can go ahead and try to re-implement the functionality in Rust, but
> then I would reject that pointing out that this should probably be an
> extension to the existing code.

You keep rejecting my attempts at extending the existing code...

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14  9:51         ` Christian König
@ 2023-07-14 10:07           ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14 10:07 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 14/07/2023 18.51, Christian König wrote:
> Am 14.07.23 um 11:44 schrieb Asahi Lina:
>> On 14/07/2023 17.43, Christian König wrote:
>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>> A signaled scheduler fence can outlive its scheduler, since fences are
>>>> independencly reference counted. Therefore, we can't reference the
>>>> scheduler in the get_timeline_name() implementation.
>>>>
>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>     include/drm/gpu_scheduler.h              | 5 +++++
>>>>     3 files changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -389,7 +389,12 @@ static bool
>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>                /*
>>>>              * Fence is from the same scheduler, only need to wait for
>>>> -         * it to be scheduled
>>>> +         * it to be scheduled.
>>>> +         *
>>>> +         * Note: s_fence->sched could have been freed and reallocated
>>>> +         * as another scheduler. This false positive case is okay,
>>>> as if
>>>> +         * the old scheduler was freed all of its jobs must have
>>>> +         * signaled their completion fences.
>>>
>>> This is outright nonsense. As long as an entity for a scheduler exists
>>> it is not allowed to free up this scheduler.
>>>
>>> So this function can't be called like this.
>>>
>>>>              */
>>>>             fence = dma_fence_get(&s_fence->scheduled);
>>>>             dma_fence_put(entity->dependency);
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
>>>> b/drivers/gpu/drm/scheduler/sched_fence.c
>>>> index ef120475e7c6..06a0eebcca10 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>>>> @@ -68,7 +68,7 @@ static const char
>>>> *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>>>>     static const char *drm_sched_fence_get_timeline_name(struct
>>>> dma_fence *f)
>>>>     {
>>>>         struct drm_sched_fence *fence = to_drm_sched_fence(f);
>>>> -    return (const char *)fence->sched->name;
>>>> +    return (const char *)fence->sched_name;
>>>>     }
>>>>        static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
>>>> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence
>>>> *fence,
>>>>         unsigned seq;
>>>>            fence->sched = entity->rq->sched;
>>>> +    strlcpy(fence->sched_name, entity->rq->sched->name,
>>>> +        sizeof(fence->sched_name));
>>>>         seq = atomic_inc_return(&entity->fence_seq);
>>>>         dma_fence_init(&fence->scheduled,
>>>> &drm_sched_fence_ops_scheduled,
>>>>                    &fence->lock, entity->fence_context, seq);
>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>> index e95b4837e5a3..4fa9523bd47d 100644
>>>> --- a/include/drm/gpu_scheduler.h
>>>> +++ b/include/drm/gpu_scheduler.h
>>>> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>>>>              * @lock: the lock used by the scheduled and the finished
>>>> fences.
>>>>              */
>>>>         spinlock_t            lock;
>>>> +        /**
>>>> +         * @sched_name: the name of the scheduler that owns this
>>>> fence. We
>>>> +     * keep a copy here since fences can outlive their scheduler.
>>>> +         */
>>>> +    char sched_name[16];
>>>
>>> This just mitigates the problem, but doesn't fix it.
>>
>> Could you point out any remaining issues so we can fix them? Right now
>> this absolutely *is* broken and this fixes the breakage I observed. If
>> there are other bugs remaining, I'd like to know what they are so I
>> can fix them.
>>
>>> The real issue is that the hw fence is kept around much longer than
>>> that.
>>
>> As far as I know, the whole point of scheduler fences is to decouple
>> the hw fences from the consumers.
> 
> Well yes and no. The decoupling is for the signaling, it's not
> decoupling the lifetime.

When I spoke with Daniel I understood the intent was also to decouple 
the lifetime.

>> I already talked with Daniel about this. The current behavior is
>> broken. These fences can live forever. It is broken to require that
>> they outlive the driver that produced them.
>>
>>> Additional to that I'm not willing to increase the scheduler fence size
>>> like that just to decouple them from the scheduler.
>>
>> Did you read my explanation on the cover letter as to how this is just
>> broken right now? We need to fix this. If you have a better suggestion
>> I'll take it. Doing nothing is not an option.
> 
> Well this isn't broken at all. This works exactly like intended, you
> just want to use it for something it wasn't made for.
> 
> That scheduler fences could be changed to outlive the scheduler which
> issued them is possible, but this is certainly a new requirement.
> 
> Especially since we need to grab additional references to make sure that
> the module isn't unloaded in such a case.

Yes, that's a remaining issue. The fences need to grab a module 
reference to make sure drm_sched doesn't get unloaded until they're all 
really gone. I can add that in v2.

It would also be desirable to drop the hw fence as soon as it signals, 
instead of keeping a reference to it forever.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14 10:07           ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14 10:07 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 14/07/2023 18.51, Christian König wrote:
> Am 14.07.23 um 11:44 schrieb Asahi Lina:
>> On 14/07/2023 17.43, Christian König wrote:
>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>> A signaled scheduler fence can outlive its scheduler, since fences are
>>>> independencly reference counted. Therefore, we can't reference the
>>>> scheduler in the get_timeline_name() implementation.
>>>>
>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>     include/drm/gpu_scheduler.h              | 5 +++++
>>>>     3 files changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -389,7 +389,12 @@ static bool
>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>                /*
>>>>              * Fence is from the same scheduler, only need to wait for
>>>> -         * it to be scheduled
>>>> +         * it to be scheduled.
>>>> +         *
>>>> +         * Note: s_fence->sched could have been freed and reallocated
>>>> +         * as another scheduler. This false positive case is okay,
>>>> as if
>>>> +         * the old scheduler was freed all of its jobs must have
>>>> +         * signaled their completion fences.
>>>
>>> This is outright nonsense. As long as an entity for a scheduler exists
>>> it is not allowed to free up this scheduler.
>>>
>>> So this function can't be called like this.
>>>
>>>>              */
>>>>             fence = dma_fence_get(&s_fence->scheduled);
>>>>             dma_fence_put(entity->dependency);
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
>>>> b/drivers/gpu/drm/scheduler/sched_fence.c
>>>> index ef120475e7c6..06a0eebcca10 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>>>> @@ -68,7 +68,7 @@ static const char
>>>> *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>>>>     static const char *drm_sched_fence_get_timeline_name(struct
>>>> dma_fence *f)
>>>>     {
>>>>         struct drm_sched_fence *fence = to_drm_sched_fence(f);
>>>> -    return (const char *)fence->sched->name;
>>>> +    return (const char *)fence->sched_name;
>>>>     }
>>>>        static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
>>>> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence
>>>> *fence,
>>>>         unsigned seq;
>>>>            fence->sched = entity->rq->sched;
>>>> +    strlcpy(fence->sched_name, entity->rq->sched->name,
>>>> +        sizeof(fence->sched_name));
>>>>         seq = atomic_inc_return(&entity->fence_seq);
>>>>         dma_fence_init(&fence->scheduled,
>>>> &drm_sched_fence_ops_scheduled,
>>>>                    &fence->lock, entity->fence_context, seq);
>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>> index e95b4837e5a3..4fa9523bd47d 100644
>>>> --- a/include/drm/gpu_scheduler.h
>>>> +++ b/include/drm/gpu_scheduler.h
>>>> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>>>>              * @lock: the lock used by the scheduled and the finished
>>>> fences.
>>>>              */
>>>>         spinlock_t            lock;
>>>> +        /**
>>>> +         * @sched_name: the name of the scheduler that owns this
>>>> fence. We
>>>> +     * keep a copy here since fences can outlive their scheduler.
>>>> +         */
>>>> +    char sched_name[16];
>>>
>>> This just mitigates the problem, but doesn't fix it.
>>
>> Could you point out any remaining issues so we can fix them? Right now
>> this absolutely *is* broken and this fixes the breakage I observed. If
>> there are other bugs remaining, I'd like to know what they are so I
>> can fix them.
>>
>>> The real issue is that the hw fence is kept around much longer than
>>> that.
>>
>> As far as I know, the whole point of scheduler fences is to decouple
>> the hw fences from the consumers.
> 
> Well yes and no. The decoupling is for the signaling, it's not
> decoupling the lifetime.

When I spoke with Daniel I understood the intent was also to decouple 
the lifetime.

>> I already talked with Daniel about this. The current behavior is
>> broken. These fences can live forever. It is broken to require that
>> they outlive the driver that produced them.
>>
>>> Additional to that I'm not willing to increase the scheduler fence size
>>> like that just to decouple them from the scheduler.
>>
>> Did you read my explanation on the cover letter as to how this is just
>> broken right now? We need to fix this. If you have a better suggestion
>> I'll take it. Doing nothing is not an option.
> 
> Well this isn't broken at all. This works exactly like intended, you
> just want to use it for something it wasn't made for.
> 
> That scheduler fences could be changed to outlive the scheduler which
> issued them is possible, but this is certainly a new requirement.
> 
> Especially since we need to grab additional references to make sure that
> the module isn't unloaded in such a case.

Yes, that's a remaining issue. The fences need to grab a module 
reference to make sure drm_sched doesn't get unloaded until they're all 
really gone. I can add that in v2.

It would also be desirable to drop the hw fence as soon as it signals, 
instead of keeping a reference to it forever.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14 10:06           ` Asahi Lina
@ 2023-07-14 10:18             ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14 10:18 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

Am 14.07.23 um 12:06 schrieb Asahi Lina:
> On 14/07/2023 18.57, Christian König wrote:
>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>> On 14/07/2023 17.43, Christian König wrote:
>>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>> A signaled scheduler fence can outlive its scheduler, since fences 
>>>>> are
>>>>> independencly reference counted. Therefore, we can't reference the
>>>>> scheduler in the get_timeline_name() implementation.
>>>>>
>>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>
>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>     include/drm/gpu_scheduler.h              | 5 +++++
>>>>>     3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -389,7 +389,12 @@ static bool
>>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>                /*
>>>>>              * Fence is from the same scheduler, only need to wait 
>>>>> for
>>>>> -         * it to be scheduled
>>>>> +         * it to be scheduled.
>>>>> +         *
>>>>> +         * Note: s_fence->sched could have been freed and 
>>>>> reallocated
>>>>> +         * as another scheduler. This false positive case is okay,
>>>>> as if
>>>>> +         * the old scheduler was freed all of its jobs must have
>>>>> +         * signaled their completion fences.
>>>>
>>>> This is outright nonsense. As long as an entity for a scheduler exists
>>>> it is not allowed to free up this scheduler.
>>>>
>>>> So this function can't be called like this.
>>>
>>> As I already explained, the fences can outlive their scheduler. That
>>> means *this* entity certainly exists for *this* scheduler, but the
>>> *dependency* fence might have come from a past scheduler that was
>>> already destroyed along with all of its entities, and its address 
>>> reused.
>>
>> Well this is function is not about fences, this function is a callback
>> for the entity.
>
> That deals with dependency fences, which could have come from any 
> arbitrary source, including another entity and another scheduler.

No, they can't. Signaling is certainly mandatory to happen before things 
are released even if we allow to decouple the dma_fence from it's issuer.

>
>>>
>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>> being told my comments are "outright nonsense" when you don't even
>>> take the time to understand what the issue is and what I'm trying to
>>> do/document. If you aren't interested in working with me, I'm just
>>> going to give up on drm_sched, wait until Rust gets workqueue support,
>>> and reimplement it in Rust. You can keep your broken fence lifetime
>>> semantics and I'll do my own thing.
>>
>> I'm certainly trying to help here, but you seem to have unrealistic
>> expectations.
>
> I don't think expecting not to be told my changes are "outright 
> nonsense" is an unrealistic expectation. If you think it is, maybe 
> that's yet another indicator of the culture problems the kernel 
> community has...

Well I'm just pointing out that you don't seem to understand the 
background of the things and just think this is a bug instead of 
intentional behavior.

>
>> I perfectly understand what you are trying to do, but you don't seem to
>> understand that this functionality here isn't made for your use case.
>
> I do, that's why I'm trying to change things. Right now, this 
> functionality isn't even properly documented, which is why I thought 
> it could be used for my use case, and slowly discovered otherwise. 
> Daniel suggested documenting it, then fixing the mismatches between 
> documentation and reality, which is what I'm doing here.

Well I know Daniel for something like 10-15 years or so, I'm pretty sure 
that he meant that you document the existing state because otherwise 
this goes against usual patch submission approaches.

>
>> We can adjust the functionality to better match your requirements, but
>> you can't say it is broken because it doesn't work when you use it not
>> in the way it is intended to be used.
>
> I'm saying the idea that a random dma-buf holds onto a chain of 
> references that prevents unloading a driver module that wrote into it 
> (and keeps a bunch of random unrelated objects alive) is a broken 
> state of affairs.

Well no, this is intentional design. Otherwise the module and with it 
the operations pointer the fences rely on go away. We already discussed 
that over 10 years ago when Marten came up with the initial dma_fence 
design.

The resulting problems are very well known and I completely agree that 
they are undesirable, but this is how the framework works and not just 
the scheduler but the rest of the DMA-buf framework as well.

> It may or may not trickle down to actual problems for users (I would 
> bet it does in some cases but I don't know for sure), but it's a 
> situation that doesn't make any sense.
>
> I know I'm triggering actual breakage with my new use case due to 
> this, which is why I'm trying to improve things. But the current state 
> of affairs just doesn't make any sense even if it isn't causing kernel 
> oopses today with other drivers.
>
>> You can go ahead and try to re-implement the functionality in Rust, but
>> then I would reject that pointing out that this should probably be an
>> extension to the existing code.
>
> You keep rejecting my attempts at extending the existing code...

Well I will try to improve here and push you into the right direction 
instead.

Regards,
Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14 10:18             ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14 10:18 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

Am 14.07.23 um 12:06 schrieb Asahi Lina:
> On 14/07/2023 18.57, Christian König wrote:
>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>> On 14/07/2023 17.43, Christian König wrote:
>>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>> A signaled scheduler fence can outlive its scheduler, since fences 
>>>>> are
>>>>> independencly reference counted. Therefore, we can't reference the
>>>>> scheduler in the get_timeline_name() implementation.
>>>>>
>>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>
>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>     include/drm/gpu_scheduler.h              | 5 +++++
>>>>>     3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -389,7 +389,12 @@ static bool
>>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>                /*
>>>>>              * Fence is from the same scheduler, only need to wait 
>>>>> for
>>>>> -         * it to be scheduled
>>>>> +         * it to be scheduled.
>>>>> +         *
>>>>> +         * Note: s_fence->sched could have been freed and 
>>>>> reallocated
>>>>> +         * as another scheduler. This false positive case is okay,
>>>>> as if
>>>>> +         * the old scheduler was freed all of its jobs must have
>>>>> +         * signaled their completion fences.
>>>>
>>>> This is outright nonsense. As long as an entity for a scheduler exists
>>>> it is not allowed to free up this scheduler.
>>>>
>>>> So this function can't be called like this.
>>>
>>> As I already explained, the fences can outlive their scheduler. That
>>> means *this* entity certainly exists for *this* scheduler, but the
>>> *dependency* fence might have come from a past scheduler that was
>>> already destroyed along with all of its entities, and its address 
>>> reused.
>>
>> Well this is function is not about fences, this function is a callback
>> for the entity.
>
> That deals with dependency fences, which could have come from any 
> arbitrary source, including another entity and another scheduler.

No, they can't. Signaling is certainly mandatory to happen before things 
are released even if we allow to decouple the dma_fence from it's issuer.

>
>>>
>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>> being told my comments are "outright nonsense" when you don't even
>>> take the time to understand what the issue is and what I'm trying to
>>> do/document. If you aren't interested in working with me, I'm just
>>> going to give up on drm_sched, wait until Rust gets workqueue support,
>>> and reimplement it in Rust. You can keep your broken fence lifetime
>>> semantics and I'll do my own thing.
>>
>> I'm certainly trying to help here, but you seem to have unrealistic
>> expectations.
>
> I don't think expecting not to be told my changes are "outright 
> nonsense" is an unrealistic expectation. If you think it is, maybe 
> that's yet another indicator of the culture problems the kernel 
> community has...

Well I'm just pointing out that you don't seem to understand the 
background of the things and just think this is a bug instead of 
intentional behavior.

>
>> I perfectly understand what you are trying to do, but you don't seem to
>> understand that this functionality here isn't made for your use case.
>
> I do, that's why I'm trying to change things. Right now, this 
> functionality isn't even properly documented, which is why I thought 
> it could be used for my use case, and slowly discovered otherwise. 
> Daniel suggested documenting it, then fixing the mismatches between 
> documentation and reality, which is what I'm doing here.

Well I know Daniel for something like 10-15 years or so, I'm pretty sure 
that he meant that you document the existing state because otherwise 
this goes against usual patch submission approaches.

>
>> We can adjust the functionality to better match your requirements, but
>> you can't say it is broken because it doesn't work when you use it not
>> in the way it is intended to be used.
>
> I'm saying the idea that a random dma-buf holds onto a chain of 
> references that prevents unloading a driver module that wrote into it 
> (and keeps a bunch of random unrelated objects alive) is a broken 
> state of affairs.

Well no, this is intentional design. Otherwise the module and with it 
the operations pointer the fences rely on go away. We already discussed 
that over 10 years ago when Marten came up with the initial dma_fence 
design.

The resulting problems are very well known and I completely agree that 
they are undesirable, but this is how the framework works and not just 
the scheduler but the rest of the DMA-buf framework as well.

> It may or may not trickle down to actual problems for users (I would 
> bet it does in some cases but I don't know for sure), but it's a 
> situation that doesn't make any sense.
>
> I know I'm triggering actual breakage with my new use case due to 
> this, which is why I'm trying to improve things. But the current state 
> of affairs just doesn't make any sense even if it isn't causing kernel 
> oopses today with other drivers.
>
>> You can go ahead and try to re-implement the functionality in Rust, but
>> then I would reject that pointing out that this should probably be an
>> extension to the existing code.
>
> You keep rejecting my attempts at extending the existing code...

Well I will try to improve here and push you into the right direction 
instead.

Regards,
Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14 10:07           ` Asahi Lina
@ 2023-07-14 10:29             ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14 10:29 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

Am 14.07.23 um 12:07 schrieb Asahi Lina:
> On 14/07/2023 18.51, Christian König wrote:
>> Am 14.07.23 um 11:44 schrieb Asahi Lina:
>>> On 14/07/2023 17.43, Christian König wrote:
>>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>> A signaled scheduler fence can outlive its scheduler, since fences 
>>>>> are
>>>>> independencly reference counted. Therefore, we can't reference the
>>>>> scheduler in the get_timeline_name() implementation.
>>>>>
>>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>
>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>     include/drm/gpu_scheduler.h              | 5 +++++
>>>>>     3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -389,7 +389,12 @@ static bool
>>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>                /*
>>>>>              * Fence is from the same scheduler, only need to wait 
>>>>> for
>>>>> -         * it to be scheduled
>>>>> +         * it to be scheduled.
>>>>> +         *
>>>>> +         * Note: s_fence->sched could have been freed and 
>>>>> reallocated
>>>>> +         * as another scheduler. This false positive case is okay,
>>>>> as if
>>>>> +         * the old scheduler was freed all of its jobs must have
>>>>> +         * signaled their completion fences.
>>>>
>>>> This is outright nonsense. As long as an entity for a scheduler exists
>>>> it is not allowed to free up this scheduler.
>>>>
>>>> So this function can't be called like this.
>>>>
>>>>>              */
>>>>>             fence = dma_fence_get(&s_fence->scheduled);
>>>>>             dma_fence_put(entity->dependency);
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> b/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> index ef120475e7c6..06a0eebcca10 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> @@ -68,7 +68,7 @@ static const char
>>>>> *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>>>>>     static const char *drm_sched_fence_get_timeline_name(struct
>>>>> dma_fence *f)
>>>>>     {
>>>>>         struct drm_sched_fence *fence = to_drm_sched_fence(f);
>>>>> -    return (const char *)fence->sched->name;
>>>>> +    return (const char *)fence->sched_name;
>>>>>     }
>>>>>        static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
>>>>> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence
>>>>> *fence,
>>>>>         unsigned seq;
>>>>>            fence->sched = entity->rq->sched;
>>>>> +    strlcpy(fence->sched_name, entity->rq->sched->name,
>>>>> +        sizeof(fence->sched_name));
>>>>>         seq = atomic_inc_return(&entity->fence_seq);
>>>>>         dma_fence_init(&fence->scheduled,
>>>>> &drm_sched_fence_ops_scheduled,
>>>>>                    &fence->lock, entity->fence_context, seq);
>>>>> diff --git a/include/drm/gpu_scheduler.h 
>>>>> b/include/drm/gpu_scheduler.h
>>>>> index e95b4837e5a3..4fa9523bd47d 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>>>>>              * @lock: the lock used by the scheduled and the finished
>>>>> fences.
>>>>>              */
>>>>>         spinlock_t            lock;
>>>>> +        /**
>>>>> +         * @sched_name: the name of the scheduler that owns this
>>>>> fence. We
>>>>> +     * keep a copy here since fences can outlive their scheduler.
>>>>> +         */
>>>>> +    char sched_name[16];
>>>>
>>>> This just mitigates the problem, but doesn't fix it.
>>>
>>> Could you point out any remaining issues so we can fix them? Right now
>>> this absolutely *is* broken and this fixes the breakage I observed. If
>>> there are other bugs remaining, I'd like to know what they are so I
>>> can fix them.
>>>
>>>> The real issue is that the hw fence is kept around much longer than
>>>> that.
>>>
>>> As far as I know, the whole point of scheduler fences is to decouple
>>> the hw fences from the consumers.
>>
>> Well yes and no. The decoupling is for the signaling, it's not
>> decoupling the lifetime.
>
> When I spoke with Daniel I understood the intent was also to decouple 
> the lifetime.

Yes, we discussed that a long long time ago as well.

We *wanted* to decouple the dma_fence lifetime, it's just not done that 
way because of problems with that approach.

>
>>> I already talked with Daniel about this. The current behavior is
>>> broken. These fences can live forever. It is broken to require that
>>> they outlive the driver that produced them.
>>>
>>>> Additional to that I'm not willing to increase the scheduler fence 
>>>> size
>>>> like that just to decouple them from the scheduler.
>>>
>>> Did you read my explanation on the cover letter as to how this is just
>>> broken right now? We need to fix this. If you have a better suggestion
>>> I'll take it. Doing nothing is not an option.
>>
>> Well this isn't broken at all. This works exactly like intended, you
>> just want to use it for something it wasn't made for.
>>
>> That scheduler fences could be changed to outlive the scheduler which
>> issued them is possible, but this is certainly a new requirement.
>>
>> Especially since we need to grab additional references to make sure that
>> the module isn't unloaded in such a case.
>
> Yes, that's a remaining issue. The fences need to grab a module 
> reference to make sure drm_sched doesn't get unloaded until they're 
> all really gone. I can add that in v2.

You also need to come up with an idea to prevent races with the deadline 
handling. See drm_sched_fence_set_deadline_finished().

>
> It would also be desirable to drop the hw fence as soon as it signals, 
> instead of keeping a reference to it forever.

Yeah, agree. Problem here again is that this is easier said than done in 
a non-racy way.

Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14 10:29             ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-14 10:29 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

Am 14.07.23 um 12:07 schrieb Asahi Lina:
> On 14/07/2023 18.51, Christian König wrote:
>> Am 14.07.23 um 11:44 schrieb Asahi Lina:
>>> On 14/07/2023 17.43, Christian König wrote:
>>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>> A signaled scheduler fence can outlive its scheduler, since fences 
>>>>> are
>>>>> independencly reference counted. Therefore, we can't reference the
>>>>> scheduler in the get_timeline_name() implementation.
>>>>>
>>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>
>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>     include/drm/gpu_scheduler.h              | 5 +++++
>>>>>     3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -389,7 +389,12 @@ static bool
>>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>                /*
>>>>>              * Fence is from the same scheduler, only need to wait 
>>>>> for
>>>>> -         * it to be scheduled
>>>>> +         * it to be scheduled.
>>>>> +         *
>>>>> +         * Note: s_fence->sched could have been freed and 
>>>>> reallocated
>>>>> +         * as another scheduler. This false positive case is okay,
>>>>> as if
>>>>> +         * the old scheduler was freed all of its jobs must have
>>>>> +         * signaled their completion fences.
>>>>
>>>> This is outright nonsense. As long as an entity for a scheduler exists
>>>> it is not allowed to free up this scheduler.
>>>>
>>>> So this function can't be called like this.
>>>>
>>>>>              */
>>>>>             fence = dma_fence_get(&s_fence->scheduled);
>>>>>             dma_fence_put(entity->dependency);
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> b/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> index ef120475e7c6..06a0eebcca10 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> @@ -68,7 +68,7 @@ static const char
>>>>> *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>>>>>     static const char *drm_sched_fence_get_timeline_name(struct
>>>>> dma_fence *f)
>>>>>     {
>>>>>         struct drm_sched_fence *fence = to_drm_sched_fence(f);
>>>>> -    return (const char *)fence->sched->name;
>>>>> +    return (const char *)fence->sched_name;
>>>>>     }
>>>>>        static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
>>>>> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence
>>>>> *fence,
>>>>>         unsigned seq;
>>>>>            fence->sched = entity->rq->sched;
>>>>> +    strlcpy(fence->sched_name, entity->rq->sched->name,
>>>>> +        sizeof(fence->sched_name));
>>>>>         seq = atomic_inc_return(&entity->fence_seq);
>>>>>         dma_fence_init(&fence->scheduled,
>>>>> &drm_sched_fence_ops_scheduled,
>>>>>                    &fence->lock, entity->fence_context, seq);
>>>>> diff --git a/include/drm/gpu_scheduler.h 
>>>>> b/include/drm/gpu_scheduler.h
>>>>> index e95b4837e5a3..4fa9523bd47d 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>>>>>              * @lock: the lock used by the scheduled and the finished
>>>>> fences.
>>>>>              */
>>>>>         spinlock_t            lock;
>>>>> +        /**
>>>>> +         * @sched_name: the name of the scheduler that owns this
>>>>> fence. We
>>>>> +     * keep a copy here since fences can outlive their scheduler.
>>>>> +         */
>>>>> +    char sched_name[16];
>>>>
>>>> This just mitigates the problem, but doesn't fix it.
>>>
>>> Could you point out any remaining issues so we can fix them? Right now
>>> this absolutely *is* broken and this fixes the breakage I observed. If
>>> there are other bugs remaining, I'd like to know what they are so I
>>> can fix them.
>>>
>>>> The real issue is that the hw fence is kept around much longer than
>>>> that.
>>>
>>> As far as I know, the whole point of scheduler fences is to decouple
>>> the hw fences from the consumers.
>>
>> Well yes and no. The decoupling is for the signaling, it's not
>> decoupling the lifetime.
>
> When I spoke with Daniel I understood the intent was also to decouple 
> the lifetime.

Yes, we discussed that a long long time ago as well.

We *wanted* to decouple the dma_fence lifetime, it's just not done that 
way because of problems with that approach.

>
>>> I already talked with Daniel about this. The current behavior is
>>> broken. These fences can live forever. It is broken to require that
>>> they outlive the driver that produced them.
>>>
>>>> Additional to that I'm not willing to increase the scheduler fence 
>>>> size
>>>> like that just to decouple them from the scheduler.
>>>
>>> Did you read my explanation on the cover letter as to how this is just
>>> broken right now? We need to fix this. If you have a better suggestion
>>> I'll take it. Doing nothing is not an option.
>>
>> Well this isn't broken at all. This works exactly like intended, you
>> just want to use it for something it wasn't made for.
>>
>> That scheduler fences could be changed to outlive the scheduler which
>> issued them is possible, but this is certainly a new requirement.
>>
>> Especially since we need to grab additional references to make sure that
>> the module isn't unloaded in such a case.
>
> Yes, that's a remaining issue. The fences need to grab a module 
> reference to make sure drm_sched doesn't get unloaded until they're 
> all really gone. I can add that in v2.

You also need to come up with an idea to prevent races with the deadline 
handling. See drm_sched_fence_set_deadline_finished().

>
> It would also be desirable to drop the hw fence as soon as it signals, 
> instead of keeping a reference to it forever.

Yeah, agree. Problem here again is that this is easier said than done in 
a non-racy way.

Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14 10:18             ` Christian König
@ 2023-07-14 12:13               ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14 12:13 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 14/07/2023 19.18, Christian König wrote:
> Am 14.07.23 um 12:06 schrieb Asahi Lina:
>> On 14/07/2023 18.57, Christian König wrote:
>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>>> A signaled scheduler fence can outlive its scheduler, since fences
>>>>>> are
>>>>>> independencly reference counted. Therefore, we can't reference the
>>>>>> scheduler in the get_timeline_name() implementation.
>>>>>>
>>>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>>
>>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>>> ---
>>>>>>      drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>>      drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>>      include/drm/gpu_scheduler.h              | 5 +++++
>>>>>>      3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> @@ -389,7 +389,12 @@ static bool
>>>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>>                 /*
>>>>>>               * Fence is from the same scheduler, only need to wait
>>>>>> for
>>>>>> -         * it to be scheduled
>>>>>> +         * it to be scheduled.
>>>>>> +         *
>>>>>> +         * Note: s_fence->sched could have been freed and
>>>>>> reallocated
>>>>>> +         * as another scheduler. This false positive case is okay,
>>>>>> as if
>>>>>> +         * the old scheduler was freed all of its jobs must have
>>>>>> +         * signaled their completion fences.
>>>>>
>>>>> This is outright nonsense. As long as an entity for a scheduler exists
>>>>> it is not allowed to free up this scheduler.
>>>>>
>>>>> So this function can't be called like this.
>>>>
>>>> As I already explained, the fences can outlive their scheduler. That
>>>> means *this* entity certainly exists for *this* scheduler, but the
>>>> *dependency* fence might have come from a past scheduler that was
>>>> already destroyed along with all of its entities, and its address
>>>> reused.
>>>
>>> Well this is function is not about fences, this function is a callback
>>> for the entity.
>>
>> That deals with dependency fences, which could have come from any
>> arbitrary source, including another entity and another scheduler.
> 
> No, they can't. Signaling is certainly mandatory to happen before things
> are released even if we allow to decouple the dma_fence from it's issuer.

That's exactly what I'm saying in my comment. That the fence must be 
signaled if its creator no longer exists, therefore it's okay to 
inadvertently wait on its scheduled fence instead of its finished fence 
(if that one was intended) since everything needs to be signaled at that 
point anyway.

>>
>>>>
>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>> being told my comments are "outright nonsense" when you don't even
>>>> take the time to understand what the issue is and what I'm trying to
>>>> do/document. If you aren't interested in working with me, I'm just
>>>> going to give up on drm_sched, wait until Rust gets workqueue support,
>>>> and reimplement it in Rust. You can keep your broken fence lifetime
>>>> semantics and I'll do my own thing.
>>>
>>> I'm certainly trying to help here, but you seem to have unrealistic
>>> expectations.
>>
>> I don't think expecting not to be told my changes are "outright
>> nonsense" is an unrealistic expectation. If you think it is, maybe
>> that's yet another indicator of the culture problems the kernel
>> community has...
> 
> Well I'm just pointing out that you don't seem to understand the
> background of the things and just think this is a bug instead of
> intentional behavior.

I made a change, I explained why that change works with a portion of the 
existing code by updating a comment, and you called that nonsense. It's 
not even a bug, I'm trying to explain why this part isn't a bug even 
with the expectation that fences don't outlive the scheduler. This is 
because I went through the code trying to find problems this approach 
would cause, ran into this tricky case, thought about it for a while, 
realized it wasn't a problem, and figured it needed a comment.

>>> I perfectly understand what you are trying to do, but you don't seem to
>>> understand that this functionality here isn't made for your use case.
>>
>> I do, that's why I'm trying to change things. Right now, this
>> functionality isn't even properly documented, which is why I thought
>> it could be used for my use case, and slowly discovered otherwise.
>> Daniel suggested documenting it, then fixing the mismatches between
>> documentation and reality, which is what I'm doing here.
> 
> Well I know Daniel for something like 10-15 years or so, I'm pretty sure
> that he meant that you document the existing state because otherwise
> this goes against usual patch submission approaches.
> 
>>
>>> We can adjust the functionality to better match your requirements, but
>>> you can't say it is broken because it doesn't work when you use it not
>>> in the way it is intended to be used.
>>
>> I'm saying the idea that a random dma-buf holds onto a chain of
>> references that prevents unloading a driver module that wrote into it
>> (and keeps a bunch of random unrelated objects alive) is a broken
>> state of affairs.
> 
> Well no, this is intentional design. Otherwise the module and with it
> the operations pointer the fences rely on go away.

But this is a drm_sched fence, not a driver fence. That's the point, 
that they should be decoupled. The driver is free to unload and only 
drm_sched would need to stay loaded so its fences continue to be valid. 
Except that's not what happens right now. Right now the drm_sched fence 
hangs onto the hw fence and the whole thing is supposed to keep the 
whole scheduler alive for things not to go boom.

> We already discussed
> that over 10 years ago when Marten came up with the initial dma_fence
> design.
> 
> The resulting problems are very well known and I completely agree that
> they are undesirable, but this is how the framework works and not just
> the scheduler but the rest of the DMA-buf framework as well.

So it's undesirable but you don't want me to change things...

> 
>> It may or may not trickle down to actual problems for users (I would
>> bet it does in some cases but I don't know for sure), but it's a
>> situation that doesn't make any sense.
>>
>> I know I'm triggering actual breakage with my new use case due to
>> this, which is why I'm trying to improve things. But the current state
>> of affairs just doesn't make any sense even if it isn't causing kernel
>> oopses today with other drivers.
>>
>>> You can go ahead and try to re-implement the functionality in Rust, but
>>> then I would reject that pointing out that this should probably be an
>>> extension to the existing code.
>>
>> You keep rejecting my attempts at extending the existing code...
> 
> Well I will try to improve here and push you into the right direction
> instead.

What is the right direction?

So far it's looking more and more like wait until we get workqueues in 
Rust, write a trivial scheduler in the driver, and give up on this whole 
drm_sched thing. Please tell me if there is a better way, because so far 
all you've done is tell me my attempts are not the right way, and 
demotivated me from working on drm_sched at all.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-14 12:13               ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-14 12:13 UTC (permalink / raw)
  To: Christian König, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 14/07/2023 19.18, Christian König wrote:
> Am 14.07.23 um 12:06 schrieb Asahi Lina:
>> On 14/07/2023 18.57, Christian König wrote:
>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>>> A signaled scheduler fence can outlive its scheduler, since fences
>>>>>> are
>>>>>> independencly reference counted. Therefore, we can't reference the
>>>>>> scheduler in the get_timeline_name() implementation.
>>>>>>
>>>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>>
>>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>>> ---
>>>>>>      drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>>      drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>>      include/drm/gpu_scheduler.h              | 5 +++++
>>>>>>      3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> @@ -389,7 +389,12 @@ static bool
>>>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>>                 /*
>>>>>>               * Fence is from the same scheduler, only need to wait
>>>>>> for
>>>>>> -         * it to be scheduled
>>>>>> +         * it to be scheduled.
>>>>>> +         *
>>>>>> +         * Note: s_fence->sched could have been freed and
>>>>>> reallocated
>>>>>> +         * as another scheduler. This false positive case is okay,
>>>>>> as if
>>>>>> +         * the old scheduler was freed all of its jobs must have
>>>>>> +         * signaled their completion fences.
>>>>>
>>>>> This is outright nonsense. As long as an entity for a scheduler exists
>>>>> it is not allowed to free up this scheduler.
>>>>>
>>>>> So this function can't be called like this.
>>>>
>>>> As I already explained, the fences can outlive their scheduler. That
>>>> means *this* entity certainly exists for *this* scheduler, but the
>>>> *dependency* fence might have come from a past scheduler that was
>>>> already destroyed along with all of its entities, and its address
>>>> reused.
>>>
>>> Well this is function is not about fences, this function is a callback
>>> for the entity.
>>
>> That deals with dependency fences, which could have come from any
>> arbitrary source, including another entity and another scheduler.
> 
> No, they can't. Signaling is certainly mandatory to happen before things
> are released even if we allow to decouple the dma_fence from it's issuer.

That's exactly what I'm saying in my comment. That the fence must be 
signaled if its creator no longer exists, therefore it's okay to 
inadvertently wait on its scheduled fence instead of its finished fence 
(if that one was intended) since everything needs to be signaled at that 
point anyway.

>>
>>>>
>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>> being told my comments are "outright nonsense" when you don't even
>>>> take the time to understand what the issue is and what I'm trying to
>>>> do/document. If you aren't interested in working with me, I'm just
>>>> going to give up on drm_sched, wait until Rust gets workqueue support,
>>>> and reimplement it in Rust. You can keep your broken fence lifetime
>>>> semantics and I'll do my own thing.
>>>
>>> I'm certainly trying to help here, but you seem to have unrealistic
>>> expectations.
>>
>> I don't think expecting not to be told my changes are "outright
>> nonsense" is an unrealistic expectation. If you think it is, maybe
>> that's yet another indicator of the culture problems the kernel
>> community has...
> 
> Well I'm just pointing out that you don't seem to understand the
> background of the things and just think this is a bug instead of
> intentional behavior.

I made a change, I explained why that change works with a portion of the 
existing code by updating a comment, and you called that nonsense. It's 
not even a bug, I'm trying to explain why this part isn't a bug even 
with the expectation that fences don't outlive the scheduler. This is 
because I went through the code trying to find problems this approach 
would cause, ran into this tricky case, thought about it for a while, 
realized it wasn't a problem, and figured it needed a comment.

>>> I perfectly understand what you are trying to do, but you don't seem to
>>> understand that this functionality here isn't made for your use case.
>>
>> I do, that's why I'm trying to change things. Right now, this
>> functionality isn't even properly documented, which is why I thought
>> it could be used for my use case, and slowly discovered otherwise.
>> Daniel suggested documenting it, then fixing the mismatches between
>> documentation and reality, which is what I'm doing here.
> 
> Well I know Daniel for something like 10-15 years or so, I'm pretty sure
> that he meant that you document the existing state because otherwise
> this goes against usual patch submission approaches.
> 
>>
>>> We can adjust the functionality to better match your requirements, but
>>> you can't say it is broken because it doesn't work when you use it not
>>> in the way it is intended to be used.
>>
>> I'm saying the idea that a random dma-buf holds onto a chain of
>> references that prevents unloading a driver module that wrote into it
>> (and keeps a bunch of random unrelated objects alive) is a broken
>> state of affairs.
> 
> Well no, this is intentional design. Otherwise the module and with it
> the operations pointer the fences rely on go away.

But this is a drm_sched fence, not a driver fence. That's the point, 
that they should be decoupled. The driver is free to unload and only 
drm_sched would need to stay loaded so its fences continue to be valid. 
Except that's not what happens right now. Right now the drm_sched fence 
hangs onto the hw fence and the whole thing is supposed to keep the 
whole scheduler alive for things not to go boom.

> We already discussed
> that over 10 years ago when Marten came up with the initial dma_fence
> design.
> 
> The resulting problems are very well known and I completely agree that
> they are undesirable, but this is how the framework works and not just
> the scheduler but the rest of the DMA-buf framework as well.

So it's undesirable but you don't want me to change things...

> 
>> It may or may not trickle down to actual problems for users (I would
>> bet it does in some cases but I don't know for sure), but it's a
>> situation that doesn't make any sense.
>>
>> I know I'm triggering actual breakage with my new use case due to
>> this, which is why I'm trying to improve things. But the current state
>> of affairs just doesn't make any sense even if it isn't causing kernel
>> oopses today with other drivers.
>>
>>> You can go ahead and try to re-implement the functionality in Rust, but
>>> then I would reject that pointing out that this should probably be an
>>> extension to the existing code.
>>
>> You keep rejecting my attempts at extending the existing code...
> 
> Well I will try to improve here and push you into the right direction
> instead.

What is the right direction?

So far it's looking more and more like wait until we get workqueues in 
Rust, write a trivial scheduler in the driver, and give up on this whole 
drm_sched thing. Please tell me if there is a better way, because so far 
all you've done is tell me my attempts are not the right way, and 
demotivated me from working on drm_sched at all.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14  9:57         ` Christian König
@ 2023-07-15  4:03           ` Luben Tuikov
  -1 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-15  4:03 UTC (permalink / raw)
  To: Christian König, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 2023-07-14 05:57, Christian König wrote:
> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>> On 14/07/2023 17.43, Christian König wrote:
>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>> A signaled scheduler fence can outlive its scheduler, since fences are
>>>> independencly reference counted. Therefore, we can't reference the
>>>> scheduler in the get_timeline_name() implementation.
>>>>
>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>    include/drm/gpu_scheduler.h              | 5 +++++
>>>>    3 files changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -389,7 +389,12 @@ static bool 
>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>               /*
>>>>             * Fence is from the same scheduler, only need to wait for
>>>> -         * it to be scheduled
>>>> +         * it to be scheduled.
>>>> +         *
>>>> +         * Note: s_fence->sched could have been freed and reallocated
>>>> +         * as another scheduler. This false positive case is okay, 
>>>> as if
>>>> +         * the old scheduler was freed all of its jobs must have
>>>> +         * signaled their completion fences.
>>>
>>> This is outright nonsense. As long as an entity for a scheduler exists
>>> it is not allowed to free up this scheduler.
>>>
>>> So this function can't be called like this.
>>
>> As I already explained, the fences can outlive their scheduler. That 
>> means *this* entity certainly exists for *this* scheduler, but the 
>> *dependency* fence might have come from a past scheduler that was 
>> already destroyed along with all of its entities, and its address reused.
> 
> Well this is function is not about fences, this function is a callback 
> for the entity.
> 
>>
>> Christian, I'm really getting tired of your tone. I don't appreciate 
>> being told my comments are "outright nonsense" when you don't even 
>> take the time to understand what the issue is and what I'm trying to 
>> do/document. If you aren't interested in working with me, I'm just 
>> going to give up on drm_sched, wait until Rust gets workqueue support, 
>> and reimplement it in Rust. You can keep your broken fence lifetime 
>> semantics and I'll do my own thing.
> 
> I'm certainly trying to help here, but you seem to have unrealistic 
> expectations.
> 
> I perfectly understand what you are trying to do, but you don't seem to 
> understand that this functionality here isn't made for your use case.
> 
> We can adjust the functionality to better match your requirements, but 
> you can't say it is broken because it doesn't work when you use it not 
> in the way it is intended to be used.

I believe "adjusting" functionality to fit some external requirements,
may have unintended consequences, requiring yet more and more "adjustments".
(Or may allow (new) drivers to do wild things which may lead to wild results. :-) )

We need to be extra careful and wary of this.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-15  4:03           ` Luben Tuikov
  0 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-15  4:03 UTC (permalink / raw)
  To: Christian König, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 2023-07-14 05:57, Christian König wrote:
> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>> On 14/07/2023 17.43, Christian König wrote:
>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>> A signaled scheduler fence can outlive its scheduler, since fences are
>>>> independencly reference counted. Therefore, we can't reference the
>>>> scheduler in the get_timeline_name() implementation.
>>>>
>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>    include/drm/gpu_scheduler.h              | 5 +++++
>>>>    3 files changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -389,7 +389,12 @@ static bool 
>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>               /*
>>>>             * Fence is from the same scheduler, only need to wait for
>>>> -         * it to be scheduled
>>>> +         * it to be scheduled.
>>>> +         *
>>>> +         * Note: s_fence->sched could have been freed and reallocated
>>>> +         * as another scheduler. This false positive case is okay, 
>>>> as if
>>>> +         * the old scheduler was freed all of its jobs must have
>>>> +         * signaled their completion fences.
>>>
>>> This is outright nonsense. As long as an entity for a scheduler exists
>>> it is not allowed to free up this scheduler.
>>>
>>> So this function can't be called like this.
>>
>> As I already explained, the fences can outlive their scheduler. That 
>> means *this* entity certainly exists for *this* scheduler, but the 
>> *dependency* fence might have come from a past scheduler that was 
>> already destroyed along with all of its entities, and its address reused.
> 
> Well this is function is not about fences, this function is a callback 
> for the entity.
> 
>>
>> Christian, I'm really getting tired of your tone. I don't appreciate 
>> being told my comments are "outright nonsense" when you don't even 
>> take the time to understand what the issue is and what I'm trying to 
>> do/document. If you aren't interested in working with me, I'm just 
>> going to give up on drm_sched, wait until Rust gets workqueue support, 
>> and reimplement it in Rust. You can keep your broken fence lifetime 
>> semantics and I'll do my own thing.
> 
> I'm certainly trying to help here, but you seem to have unrealistic 
> expectations.
> 
> I perfectly understand what you are trying to do, but you don't seem to 
> understand that this functionality here isn't made for your use case.
> 
> We can adjust the functionality to better match your requirements, but 
> you can't say it is broken because it doesn't work when you use it not 
> in the way it is intended to be used.

I believe "adjusting" functionality to fit some external requirements,
may have unintended consequences, requiring yet more and more "adjustments".
(Or may allow (new) drivers to do wild things which may lead to wild results. :-) )

We need to be extra careful and wary of this.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-14  8:21   ` Asahi Lina
@ 2023-07-15  7:14     ` Luben Tuikov
  -1 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-15  7:14 UTC (permalink / raw)
  To: Asahi Lina, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 2023-07-14 04:21, Asahi Lina wrote:
> drm_sched_fini() currently leaves any pending jobs dangling, which
> causes segfaults and other badness when job completion fences are
> signaled after the scheduler is torn down.

If there are pending jobs, ideally we want to call into the driver,
so that it can release resources it may be holding for those.
The idea behind "pending" is that they are pending in the hardware
and we don't know their state until signalled/the callback called.
(Or unless the device is reset and we get a notification of that fact.)

> Explicitly detach all jobs from their completion callbacks and free
> them. This makes it possible to write a sensible safe abstraction for
> drm_sched, without having to externally duplicate the tracking of
> in-flight jobs.
> 
> This shouldn't regress any existing drivers, since calling
> drm_sched_fini() with any pending jobs is broken and this change should
> be a no-op if there are no pending jobs.

While this statement is true on its own, it kind of contradicts
the premise of the first paragraph.

> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 1f3bc3606239..a4da4aac0efd 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>  void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  {
>  	struct drm_sched_entity *s_entity;
> +	struct drm_sched_job *s_job, *tmp;
>  	int i;
>  
> -	if (sched->thread)
> -		kthread_stop(sched->thread);
> +	if (!sched->thread)
> +		return;
> +
> +	/*
> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
> +	 * and cleaning up complete jobs.
> +	 */
> +	drm_sched_stop(sched, NULL);
> +
> +	/*
> +	 * Iterate through the pending job list and free all jobs.
> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
> +	 * otherwise it is responsible for keeping any necessary data structures for
> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
> +	 * putting them in its own queue or doing its own refcounting).
> +	 */
> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
> +		spin_lock(&sched->job_list_lock);
> +		list_del_init(&s_job->list);
> +		spin_unlock(&sched->job_list_lock);
> +
> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
> +		drm_sched_fence_finished(s_job->s_fence);

I'd imagine it's better to rebase this on top of drm-misc-next where
drm_sched_fence_finished() takes one more parameter--the error.

> +
> +		WARN_ON(s_job->s_fence->parent);
> +		sched->ops->free_job(s_job);
> +	}
> +
> +	kthread_stop(sched->thread);
>  
>  	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>  		struct drm_sched_rq *rq = &sched->sched_rq[i];
> 

Conceptually I don't mind this patch--I see what it is trying to achieve,
but technically, we want the driver to detect GPU removal and return shared
resources back, such as "jobs", which DRM is also aware of.

In the case where we're initiating the tear, we should notify the driver that
we're about to forget jobs (resources), so that it knows to return them back
or that it shouldn't notify us for them (since we've notified we're forgetting them.)

(Note also that in this latter case, traditionally, the device would be reset,
so that we can guarantee that it has forgotten all shared resources which
we are to tear up. This is somewhat more complicated with GPUs, thus the method
pointed out above.)
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-07-15  7:14     ` Luben Tuikov
  0 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-15  7:14 UTC (permalink / raw)
  To: Asahi Lina, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 2023-07-14 04:21, Asahi Lina wrote:
> drm_sched_fini() currently leaves any pending jobs dangling, which
> causes segfaults and other badness when job completion fences are
> signaled after the scheduler is torn down.

If there are pending jobs, ideally we want to call into the driver,
so that it can release resources it may be holding for those.
The idea behind "pending" is that they are pending in the hardware
and we don't know their state until signalled/the callback called.
(Or unless the device is reset and we get a notification of that fact.)

> Explicitly detach all jobs from their completion callbacks and free
> them. This makes it possible to write a sensible safe abstraction for
> drm_sched, without having to externally duplicate the tracking of
> in-flight jobs.
> 
> This shouldn't regress any existing drivers, since calling
> drm_sched_fini() with any pending jobs is broken and this change should
> be a no-op if there are no pending jobs.

While this statement is true on its own, it kind of contradicts
the premise of the first paragraph.

> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 1f3bc3606239..a4da4aac0efd 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>  void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  {
>  	struct drm_sched_entity *s_entity;
> +	struct drm_sched_job *s_job, *tmp;
>  	int i;
>  
> -	if (sched->thread)
> -		kthread_stop(sched->thread);
> +	if (!sched->thread)
> +		return;
> +
> +	/*
> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
> +	 * and cleaning up complete jobs.
> +	 */
> +	drm_sched_stop(sched, NULL);
> +
> +	/*
> +	 * Iterate through the pending job list and free all jobs.
> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
> +	 * otherwise it is responsible for keeping any necessary data structures for
> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
> +	 * putting them in its own queue or doing its own refcounting).
> +	 */
> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
> +		spin_lock(&sched->job_list_lock);
> +		list_del_init(&s_job->list);
> +		spin_unlock(&sched->job_list_lock);
> +
> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
> +		drm_sched_fence_finished(s_job->s_fence);

I'd imagine it's better to rebase this on top of drm-misc-next where
drm_sched_fence_finished() takes one more parameter--the error.

> +
> +		WARN_ON(s_job->s_fence->parent);
> +		sched->ops->free_job(s_job);
> +	}
> +
> +	kthread_stop(sched->thread);
>  
>  	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>  		struct drm_sched_rq *rq = &sched->sched_rq[i];
> 

Conceptually I don't mind this patch--I see what it is trying to achieve,
but technically, we want the driver to detect GPU removal and return shared
resources back, such as "jobs", which DRM is also aware of.

In the case where we're initiating the tear, we should notify the driver that
we're about to forget jobs (resources), so that it knows to return them back
or that it shouldn't notify us for them (since we've notified we're forgetting them.)

(Note also that in this latter case, traditionally, the device would be reset,
so that we can guarantee that it has forgotten all shared resources which
we are to tear up. This is somewhat more complicated with GPUs, thus the method
pointed out above.)
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-14  9:57         ` Christian König
@ 2023-07-15 14:14           ` alyssa
  -1 siblings, 0 replies; 86+ messages in thread
From: alyssa @ 2023-07-15 14:14 UTC (permalink / raw)
  To: Luben Tuikov, Christian König, Asahi Lina, David Airlie,
	Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, dri-devel, linux-kernel, linux-media, asahi

15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@amd.com> wrote:


> 
> On 2023-07-14 05:57, Christian König wrote:
> 
> > 
> > Am 14.07.23 um 11:49 schrieb Asahi Lina:
> > 
> > > 
> > > On 14/07/2023 17.43, Christian König wrote:
> > > 
> > 
> >  Am 14.07.23 um 10:21 schrieb Asahi Lina:
> >  A signaled scheduler fence can outlive its scheduler, since fences are
> >  independencly reference counted. Therefore, we can't reference the
> >  scheduler in the get_timeline_name() implementation.
> > 
> >  Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
> >  dma-bufs reference fences from GPU schedulers that no longer exist.
> > 
> >  Signed-off-by: Asahi Lina <lina@asahilina.net>
> >  ---
> >     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
> >     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
> >     include/drm/gpu_scheduler.h              | 5 +++++
> >     3 files changed, 14 insertions(+), 2 deletions(-)
> > 
> >  diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> >  b/drivers/gpu/drm/scheduler/sched_entity.c
> >  index b2bbc8a68b30..17f35b0b005a 100644
> >  --- a/drivers/gpu/drm/scheduler/sched_entity.c
> >  +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> >  @@ -389,7 +389,12 @@ static bool 
> >  drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
> >                /*
> >              * Fence is from the same scheduler, only need to wait for
> >  -         * it to be scheduled
> >  +         * it to be scheduled.
> >  +         *
> >  +         * Note: s_fence->sched could have been freed and reallocated
> >  +         * as another scheduler. This false positive case is okay, 
> >  as if
> >  +         * the old scheduler was freed all of its jobs must have
> >  +         * signaled their completion fences.
> > 
> >  This is outright nonsense. As long as an entity for a scheduler exists
> >  it is not allowed to free up this scheduler.
> > 
> >  So this function can't be called like this.
> > 
> > > 
> > > As I already explained, the fences can outlive their scheduler. That 
> > >  means *this* entity certainly exists for *this* scheduler, but the 
> > >  *dependency* fence might have come from a past scheduler that was 
> > >  already destroyed along with all of its entities, and its address reused.
> > > 
> > 
> >  
> >  Well this is function is not about fences, this function is a callback 
> >  for the entity.
> >  
> > 
> > > 
> > > Christian, I'm really getting tired of your tone. I don't appreciate 
> > >  being told my comments are "outright nonsense" when you don't even 
> > >  take the time to understand what the issue is and what I'm trying to 
> > >  do/document. If you aren't interested in working with me, I'm just 
> > >  going to give up on drm_sched, wait until Rust gets workqueue support, 
> > >  and reimplement it in Rust. You can keep your broken fence lifetime 
> > >  semantics and I'll do my own thing.
> > > 
> > 
> >  
> >  I'm certainly trying to help here, but you seem to have unrealistic 
> >  expectations.
> >  
> >  I perfectly understand what you are trying to do, but you don't seem to 
> >  understand that this functionality here isn't made for your use case.
> >  
> >  We can adjust the functionality to better match your requirements, but 
> >  you can't say it is broken because it doesn't work when you use it not 
> >  in the way it is intended to be used.
> > 
> 
> I believe "adjusting" functionality to fit some external requirements,
> may have unintended consequences, requiring yet more and more "adjustments".
> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
> 
> We need to be extra careful and wary of this.

Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX. Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.

AMD has NAK'd both options, effectively NAK'ing the driver.

I will ask a simple yes/no question: Should we use drm/sched?

If yes, it will need patches like these, and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.

If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.

Which is it?

Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.

Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.

That's not ok.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-15 14:14           ` alyssa
  0 siblings, 0 replies; 86+ messages in thread
From: alyssa @ 2023-07-15 14:14 UTC (permalink / raw)
  To: Luben Tuikov, Christian König, Asahi Lina, David Airlie,
	Daniel Vetter, Sumit Semwal
  Cc: asahi, linux-media, dri-devel, Faith Ekstrand, linux-kernel

15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@amd.com> wrote:


> 
> On 2023-07-14 05:57, Christian König wrote:
> 
> > 
> > Am 14.07.23 um 11:49 schrieb Asahi Lina:
> > 
> > > 
> > > On 14/07/2023 17.43, Christian König wrote:
> > > 
> > 
> >  Am 14.07.23 um 10:21 schrieb Asahi Lina:
> >  A signaled scheduler fence can outlive its scheduler, since fences are
> >  independencly reference counted. Therefore, we can't reference the
> >  scheduler in the get_timeline_name() implementation.
> > 
> >  Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
> >  dma-bufs reference fences from GPU schedulers that no longer exist.
> > 
> >  Signed-off-by: Asahi Lina <lina@asahilina.net>
> >  ---
> >     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
> >     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
> >     include/drm/gpu_scheduler.h              | 5 +++++
> >     3 files changed, 14 insertions(+), 2 deletions(-)
> > 
> >  diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> >  b/drivers/gpu/drm/scheduler/sched_entity.c
> >  index b2bbc8a68b30..17f35b0b005a 100644
> >  --- a/drivers/gpu/drm/scheduler/sched_entity.c
> >  +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> >  @@ -389,7 +389,12 @@ static bool 
> >  drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
> >                /*
> >              * Fence is from the same scheduler, only need to wait for
> >  -         * it to be scheduled
> >  +         * it to be scheduled.
> >  +         *
> >  +         * Note: s_fence->sched could have been freed and reallocated
> >  +         * as another scheduler. This false positive case is okay, 
> >  as if
> >  +         * the old scheduler was freed all of its jobs must have
> >  +         * signaled their completion fences.
> > 
> >  This is outright nonsense. As long as an entity for a scheduler exists
> >  it is not allowed to free up this scheduler.
> > 
> >  So this function can't be called like this.
> > 
> > > 
> > > As I already explained, the fences can outlive their scheduler. That 
> > >  means *this* entity certainly exists for *this* scheduler, but the 
> > >  *dependency* fence might have come from a past scheduler that was 
> > >  already destroyed along with all of its entities, and its address reused.
> > > 
> > 
> >  
> >  Well this is function is not about fences, this function is a callback 
> >  for the entity.
> >  
> > 
> > > 
> > > Christian, I'm really getting tired of your tone. I don't appreciate 
> > >  being told my comments are "outright nonsense" when you don't even 
> > >  take the time to understand what the issue is and what I'm trying to 
> > >  do/document. If you aren't interested in working with me, I'm just 
> > >  going to give up on drm_sched, wait until Rust gets workqueue support, 
> > >  and reimplement it in Rust. You can keep your broken fence lifetime 
> > >  semantics and I'll do my own thing.
> > > 
> > 
> >  
> >  I'm certainly trying to help here, but you seem to have unrealistic 
> >  expectations.
> >  
> >  I perfectly understand what you are trying to do, but you don't seem to 
> >  understand that this functionality here isn't made for your use case.
> >  
> >  We can adjust the functionality to better match your requirements, but 
> >  you can't say it is broken because it doesn't work when you use it not 
> >  in the way it is intended to be used.
> > 
> 
> I believe "adjusting" functionality to fit some external requirements,
> may have unintended consequences, requiring yet more and more "adjustments".
> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
> 
> We need to be extra careful and wary of this.

Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX. Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.

AMD has NAK'd both options, effectively NAK'ing the driver.

I will ask a simple yes/no question: Should we use drm/sched?

If yes, it will need patches like these, and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.

If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.

Which is it?

Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.

Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.

That's not ok.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-15  7:14     ` Luben Tuikov
@ 2023-07-16  7:51       ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-16  7:51 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 15/07/2023 16.14, Luben Tuikov wrote:
> On 2023-07-14 04:21, Asahi Lina wrote:
>> drm_sched_fini() currently leaves any pending jobs dangling, which
>> causes segfaults and other badness when job completion fences are
>> signaled after the scheduler is torn down.
> 
> If there are pending jobs, ideally we want to call into the driver,
> so that it can release resources it may be holding for those.
> The idea behind "pending" is that they are pending in the hardware
> and we don't know their state until signalled/the callback called.
> (Or unless the device is reset and we get a notification of that fact.)

That's what the job->free_job() callback does, then the driver is free 
to do whatever it wants with those jobs. A driver could opt to 
synchronously kill those jobs (if it can) or account for them 
separately/asynchronously.

What this patch basically says is that if you destroy a scheduler with 
pending jobs, it immediately considers them terminated with an error, 
and returns ownership back to the driver for freeing. Then the driver 
can decide how to handle the rest and whatever the underlying hardware 
state is.

>> Explicitly detach all jobs from their completion callbacks and free
>> them. This makes it possible to write a sensible safe abstraction for
>> drm_sched, without having to externally duplicate the tracking of
>> in-flight jobs.
>>
>> This shouldn't regress any existing drivers, since calling
>> drm_sched_fini() with any pending jobs is broken and this change should
>> be a no-op if there are no pending jobs.
> 
> While this statement is true on its own, it kind of contradicts
> the premise of the first paragraph.

I mean right *now* it's broken, before this patch. I'm trying to make it 
safe, but it shouldn't regress any exiting drivers since if they trigger 
this code path they are broken today.

> 
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 1f3bc3606239..a4da4aac0efd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   {
>>   	struct drm_sched_entity *s_entity;
>> +	struct drm_sched_job *s_job, *tmp;
>>   	int i;
>>   
>> -	if (sched->thread)
>> -		kthread_stop(sched->thread);
>> +	if (!sched->thread)
>> +		return;
>> +
>> +	/*
>> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
>> +	 * and cleaning up complete jobs.
>> +	 */
>> +	drm_sched_stop(sched, NULL);
>> +
>> +	/*
>> +	 * Iterate through the pending job list and free all jobs.
>> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
>> +	 * otherwise it is responsible for keeping any necessary data structures for
>> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
>> +	 * putting them in its own queue or doing its own refcounting).
>> +	 */
>> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>> +		spin_lock(&sched->job_list_lock);
>> +		list_del_init(&s_job->list);
>> +		spin_unlock(&sched->job_list_lock);
>> +
>> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>> +		drm_sched_fence_finished(s_job->s_fence);
> 
> I'd imagine it's better to rebase this on top of drm-misc-next where
> drm_sched_fence_finished() takes one more parameter--the error.

Ah, sure! I can do that.

> 
>> +
>> +		WARN_ON(s_job->s_fence->parent);
>> +		sched->ops->free_job(s_job);
>> +	}
>> +
>> +	kthread_stop(sched->thread);
>>   
>>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>   		struct drm_sched_rq *rq = &sched->sched_rq[i];
>>
> 
> Conceptually I don't mind this patch--I see what it is trying to achieve,
> but technically, we want the driver to detect GPU removal and return shared
> resources back, such as "jobs", which DRM is also aware of.

I think you missed the context of why I'm doing this, so in short: my 
use case (like Xe's) involves using a separate drm_sched instance *per 
file* since these queues are scheduled directly by the firmware. So this 
isn't about GPU removal, but rather about a GPU context going away while 
jobs are in flight (e.g. the process got killed). We want that to 
quickly kill the "DRM view" of the world, including signaling all the 
fences with an error and freeing resources like the scheduler itself.

In the case of this particular GPU, there is no known way to actively 
and synchronously abort GPU jobs, so we need to let them run to 
completion (or failure), but we don't want that to block process cleanup 
and freeing a bunch of high-level resources. The driver is architected 
roughly along the lines of a firmware abstraction layer that maps to the 
firmware shared memory structures, and then a layer on top that 
implements the DRM view. When a process gets killed, the DRM side (which 
includes the scheduler, etc.) gets torn down immediately, and it makes 
sense to handle this cleanup inside drm_sched since it already has a 
view into what jobs are in flight. Otherwise, I would have to duplicate 
job tracking in the driver (actually worse: in the Rust abstraction for 
safety), which doesn't make much sense.

But what I *do* have in the driver is tracking of the firmware 
structures. So when the drm_sched gets torn down and all the jobs 
killed, the underlying firmware jobs do run to completion, and the 
resources they use are all cleaned up after that (it's all reference 
counted). The primitive involved here is that in-flight firmware jobs 
are assigned an event completion slot, and that keeps a reference to 
them from a global array until the events fire and all the jobs are 
known to have completed. This keeps things memory-safe, since we 
absolutely cannot free/destroy firmware structures while they are in use 
(otherwise the firmware crashes, which is fatal on these GPUs - requires 
a full system reboot to recover).

In practice, with the VM map model that we use, what ends up happening 
when a process gets killed is that all the user objects for in-flight 
jobs get unmapped, which usually causes the GPU hardware (not firmware) 
to fault. This then triggers early termination of jobs anyway via the 
firmware fault recovery flow. But even that takes some short amount of 
time, and by then all the drm_sched stuff is long gone and we're just 
dealing with the in-flight firmware stuff.

> In the case where we're initiating the tear, we should notify the driver that
> we're about to forget jobs (resources), so that it knows to return them back
> or that it shouldn't notify us for them (since we've notified we're forgetting them.)

That contradicts Christian's comment. I tried to document that (after 
this patch) the scheduler no longer cares about hw fences and whether 
they are signaled or not after it's destroyed, and I got a strongly 
worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
signal the hw fence after a scheduler teardown, or not?

But really, I don't see a use case for an explicit "about to forget job" 
callback. The job free callback already serves the purpose of telling 
the driver to clean up resources associated with a job. If it wants to 
synchronously abort things there, it could easily take over its own 
fence signaling and do something with the underlying stuff if the fence 
is not signaled yet.

In my case, since the driver is written in Rust and free_job() just maps 
to the destructor (Drop impl), that just ends up freeing a bunch of 
memory and other objects, and I don't particularly care about the state 
of the firmware side any more after that. The flow is the same whether 
it was a successful job completion, a failure, or an early destruction 
due to the drm_sched getting torn down.

> (Note also that in this latter case, traditionally, the device would be reset,
> so that we can guarantee that it has forgotten all shared resources which
> we are to tear up. This is somewhat more complicated with GPUs, thus the method
> pointed out above.)

Yeah, in the firmware scheduling case we can't do this at all unless the 
firmware has an explicit teardown/forget op (which I'm not aware of) and 
a full GPU reset isn't something we can do either. Hence we just let the 
underlying jobs complete. In practice they tend to die pretty quickly 
anyway once all the buffers are unmapped.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-07-16  7:51       ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-16  7:51 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 15/07/2023 16.14, Luben Tuikov wrote:
> On 2023-07-14 04:21, Asahi Lina wrote:
>> drm_sched_fini() currently leaves any pending jobs dangling, which
>> causes segfaults and other badness when job completion fences are
>> signaled after the scheduler is torn down.
> 
> If there are pending jobs, ideally we want to call into the driver,
> so that it can release resources it may be holding for those.
> The idea behind "pending" is that they are pending in the hardware
> and we don't know their state until signalled/the callback called.
> (Or unless the device is reset and we get a notification of that fact.)

That's what the job->free_job() callback does, then the driver is free 
to do whatever it wants with those jobs. A driver could opt to 
synchronously kill those jobs (if it can) or account for them 
separately/asynchronously.

What this patch basically says is that if you destroy a scheduler with 
pending jobs, it immediately considers them terminated with an error, 
and returns ownership back to the driver for freeing. Then the driver 
can decide how to handle the rest and whatever the underlying hardware 
state is.

>> Explicitly detach all jobs from their completion callbacks and free
>> them. This makes it possible to write a sensible safe abstraction for
>> drm_sched, without having to externally duplicate the tracking of
>> in-flight jobs.
>>
>> This shouldn't regress any existing drivers, since calling
>> drm_sched_fini() with any pending jobs is broken and this change should
>> be a no-op if there are no pending jobs.
> 
> While this statement is true on its own, it kind of contradicts
> the premise of the first paragraph.

I mean right *now* it's broken, before this patch. I'm trying to make it 
safe, but it shouldn't regress any exiting drivers since if they trigger 
this code path they are broken today.

> 
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 1f3bc3606239..a4da4aac0efd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   {
>>   	struct drm_sched_entity *s_entity;
>> +	struct drm_sched_job *s_job, *tmp;
>>   	int i;
>>   
>> -	if (sched->thread)
>> -		kthread_stop(sched->thread);
>> +	if (!sched->thread)
>> +		return;
>> +
>> +	/*
>> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
>> +	 * and cleaning up complete jobs.
>> +	 */
>> +	drm_sched_stop(sched, NULL);
>> +
>> +	/*
>> +	 * Iterate through the pending job list and free all jobs.
>> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
>> +	 * otherwise it is responsible for keeping any necessary data structures for
>> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
>> +	 * putting them in its own queue or doing its own refcounting).
>> +	 */
>> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>> +		spin_lock(&sched->job_list_lock);
>> +		list_del_init(&s_job->list);
>> +		spin_unlock(&sched->job_list_lock);
>> +
>> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>> +		drm_sched_fence_finished(s_job->s_fence);
> 
> I'd imagine it's better to rebase this on top of drm-misc-next where
> drm_sched_fence_finished() takes one more parameter--the error.

Ah, sure! I can do that.

> 
>> +
>> +		WARN_ON(s_job->s_fence->parent);
>> +		sched->ops->free_job(s_job);
>> +	}
>> +
>> +	kthread_stop(sched->thread);
>>   
>>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>   		struct drm_sched_rq *rq = &sched->sched_rq[i];
>>
> 
> Conceptually I don't mind this patch--I see what it is trying to achieve,
> but technically, we want the driver to detect GPU removal and return shared
> resources back, such as "jobs", which DRM is also aware of.

I think you missed the context of why I'm doing this, so in short: my 
use case (like Xe's) involves using a separate drm_sched instance *per 
file* since these queues are scheduled directly by the firmware. So this 
isn't about GPU removal, but rather about a GPU context going away while 
jobs are in flight (e.g. the process got killed). We want that to 
quickly kill the "DRM view" of the world, including signaling all the 
fences with an error and freeing resources like the scheduler itself.

In the case of this particular GPU, there is no known way to actively 
and synchronously abort GPU jobs, so we need to let them run to 
completion (or failure), but we don't want that to block process cleanup 
and freeing a bunch of high-level resources. The driver is architected 
roughly along the lines of a firmware abstraction layer that maps to the 
firmware shared memory structures, and then a layer on top that 
implements the DRM view. When a process gets killed, the DRM side (which 
includes the scheduler, etc.) gets torn down immediately, and it makes 
sense to handle this cleanup inside drm_sched since it already has a 
view into what jobs are in flight. Otherwise, I would have to duplicate 
job tracking in the driver (actually worse: in the Rust abstraction for 
safety), which doesn't make much sense.

But what I *do* have in the driver is tracking of the firmware 
structures. So when the drm_sched gets torn down and all the jobs 
killed, the underlying firmware jobs do run to completion, and the 
resources they use are all cleaned up after that (it's all reference 
counted). The primitive involved here is that in-flight firmware jobs 
are assigned an event completion slot, and that keeps a reference to 
them from a global array until the events fire and all the jobs are 
known to have completed. This keeps things memory-safe, since we 
absolutely cannot free/destroy firmware structures while they are in use 
(otherwise the firmware crashes, which is fatal on these GPUs - requires 
a full system reboot to recover).

In practice, with the VM map model that we use, what ends up happening 
when a process gets killed is that all the user objects for in-flight 
jobs get unmapped, which usually causes the GPU hardware (not firmware) 
to fault. This then triggers early termination of jobs anyway via the 
firmware fault recovery flow. But even that takes some short amount of 
time, and by then all the drm_sched stuff is long gone and we're just 
dealing with the in-flight firmware stuff.

> In the case where we're initiating the tear, we should notify the driver that
> we're about to forget jobs (resources), so that it knows to return them back
> or that it shouldn't notify us for them (since we've notified we're forgetting them.)

That contradicts Christian's comment. I tried to document that (after 
this patch) the scheduler no longer cares about hw fences and whether 
they are signaled or not after it's destroyed, and I got a strongly 
worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
signal the hw fence after a scheduler teardown, or not?

But really, I don't see a use case for an explicit "about to forget job" 
callback. The job free callback already serves the purpose of telling 
the driver to clean up resources associated with a job. If it wants to 
synchronously abort things there, it could easily take over its own 
fence signaling and do something with the underlying stuff if the fence 
is not signaled yet.

In my case, since the driver is written in Rust and free_job() just maps 
to the destructor (Drop impl), that just ends up freeing a bunch of 
memory and other objects, and I don't particularly care about the state 
of the firmware side any more after that. The flow is the same whether 
it was a successful job completion, a failure, or an early destruction 
due to the drm_sched getting torn down.

> (Note also that in this latter case, traditionally, the device would be reset,
> so that we can guarantee that it has forgotten all shared resources which
> we are to tear up. This is somewhat more complicated with GPUs, thus the method
> pointed out above.)

Yeah, in the firmware scheduling case we can't do this at all unless the 
firmware has an explicit teardown/forget op (which I'm not aware of) and 
a full GPU reset isn't something we can do either. Hence we just let the 
underlying jobs complete. In practice they tend to die pretty quickly 
anyway once all the buffers are unmapped.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-15 14:14           ` alyssa
@ 2023-07-17 15:55             ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-17 15:55 UTC (permalink / raw)
  To: alyssa, Luben Tuikov, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: asahi, linux-media, dri-devel, Faith Ekstrand, linux-kernel

Am 15.07.23 um 16:14 schrieb alyssa@rosenzweig.io:
> 15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
>> On 2023-07-14 05:57, Christian König wrote:
>>
>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>
>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>
>>>   Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>   A signaled scheduler fence can outlive its scheduler, since fences are
>>>   independencly reference counted. Therefore, we can't reference the
>>>   scheduler in the get_timeline_name() implementation.
>>>
>>>   Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>   dma-bufs reference fences from GPU schedulers that no longer exist.
>>>
>>>   Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>   ---
>>>      drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>      drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>      include/drm/gpu_scheduler.h              | 5 +++++
>>>      3 files changed, 14 insertions(+), 2 deletions(-)
>>>
>>>   diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>   b/drivers/gpu/drm/scheduler/sched_entity.c
>>>   index b2bbc8a68b30..17f35b0b005a 100644
>>>   --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>   +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>   @@ -389,7 +389,12 @@ static bool
>>>   drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>                 /*
>>>               * Fence is from the same scheduler, only need to wait for
>>>   -         * it to be scheduled
>>>   +         * it to be scheduled.
>>>   +         *
>>>   +         * Note: s_fence->sched could have been freed and reallocated
>>>   +         * as another scheduler. This false positive case is okay,
>>>   as if
>>>   +         * the old scheduler was freed all of its jobs must have
>>>   +         * signaled their completion fences.
>>>
>>>   This is outright nonsense. As long as an entity for a scheduler exists
>>>   it is not allowed to free up this scheduler.
>>>
>>>   So this function can't be called like this.
>>>
>>>> As I already explained, the fences can outlive their scheduler. That
>>>>   means *this* entity certainly exists for *this* scheduler, but the
>>>>   *dependency* fence might have come from a past scheduler that was
>>>>   already destroyed along with all of its entities, and its address reused.
>>>>
>>>   
>>>   Well this is function is not about fences, this function is a callback
>>>   for the entity.
>>>   
>>>
>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>>   being told my comments are "outright nonsense" when you don't even
>>>>   take the time to understand what the issue is and what I'm trying to
>>>>   do/document. If you aren't interested in working with me, I'm just
>>>>   going to give up on drm_sched, wait until Rust gets workqueue support,
>>>>   and reimplement it in Rust. You can keep your broken fence lifetime
>>>>   semantics and I'll do my own thing.
>>>>
>>>   
>>>   I'm certainly trying to help here, but you seem to have unrealistic
>>>   expectations.
>>>   
>>>   I perfectly understand what you are trying to do, but you don't seem to
>>>   understand that this functionality here isn't made for your use case.
>>>   
>>>   We can adjust the functionality to better match your requirements, but
>>>   you can't say it is broken because it doesn't work when you use it not
>>>   in the way it is intended to be used.
>>>
>> I believe "adjusting" functionality to fit some external requirements,
>> may have unintended consequences, requiring yet more and more "adjustments".
>> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
>>
>> We need to be extra careful and wary of this.
> Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX.

Well this is the fundamental disagreement we have. As far as I can see 
you don't need to adjust anything in the common drm/scheduler code.

That code works with quite a bunch of different drivers, including the 
Intel XE which has similar requirements to your work here.

We can talk about gradually improving the common code, but as Luben 
already wrote as well this needs to be done very carefully.

>   Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.
>
> AMD has NAK'd both options, effectively NAK'ing the driver.
>
> I will ask a simple yes/no question: Should we use drm/sched?

Well, yes.

>
> If yes, it will need patches like these,

No, you don't.

First of all you need to try to adjust your driver to match the 
requirements of drm/scheduler and *not* the other way around.

>   and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.
>
> If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.
>
> Which is it?
>
> Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.
>
> Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.

Well as far as I can see I'm totally polite as well.

Pointing out that approaches doesn't seem to make sense and NAKing 
patches is a perfectly normal part of the review process.

What you need to to is to take a step back and ask yourself why this 
here is facing so much rejection from our side. I have to admit that I 
don't seem to be good at explaining that, cause we are obviously talking 
past each other, but you don't seem to try hard to understand what I'm 
pointing out either.

> That's not ok.

As far as I can see it is.

As maintainer of a commonly used component my first duty is to preserve 
the status quo and prevent modifications which are not well thought 
through. And to be honest those changes here strongly looks like Lina is 
just adjusting the code to match her requirements without looking left 
and right first.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-17 15:55             ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-17 15:55 UTC (permalink / raw)
  To: alyssa, Luben Tuikov, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, dri-devel, linux-kernel, linux-media, asahi

Am 15.07.23 um 16:14 schrieb alyssa@rosenzweig.io:
> 15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
>> On 2023-07-14 05:57, Christian König wrote:
>>
>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>
>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>
>>>   Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>   A signaled scheduler fence can outlive its scheduler, since fences are
>>>   independencly reference counted. Therefore, we can't reference the
>>>   scheduler in the get_timeline_name() implementation.
>>>
>>>   Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>   dma-bufs reference fences from GPU schedulers that no longer exist.
>>>
>>>   Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>   ---
>>>      drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>      drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>      include/drm/gpu_scheduler.h              | 5 +++++
>>>      3 files changed, 14 insertions(+), 2 deletions(-)
>>>
>>>   diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>   b/drivers/gpu/drm/scheduler/sched_entity.c
>>>   index b2bbc8a68b30..17f35b0b005a 100644
>>>   --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>   +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>   @@ -389,7 +389,12 @@ static bool
>>>   drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>                 /*
>>>               * Fence is from the same scheduler, only need to wait for
>>>   -         * it to be scheduled
>>>   +         * it to be scheduled.
>>>   +         *
>>>   +         * Note: s_fence->sched could have been freed and reallocated
>>>   +         * as another scheduler. This false positive case is okay,
>>>   as if
>>>   +         * the old scheduler was freed all of its jobs must have
>>>   +         * signaled their completion fences.
>>>
>>>   This is outright nonsense. As long as an entity for a scheduler exists
>>>   it is not allowed to free up this scheduler.
>>>
>>>   So this function can't be called like this.
>>>
>>>> As I already explained, the fences can outlive their scheduler. That
>>>>   means *this* entity certainly exists for *this* scheduler, but the
>>>>   *dependency* fence might have come from a past scheduler that was
>>>>   already destroyed along with all of its entities, and its address reused.
>>>>
>>>   
>>>   Well this is function is not about fences, this function is a callback
>>>   for the entity.
>>>   
>>>
>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>>   being told my comments are "outright nonsense" when you don't even
>>>>   take the time to understand what the issue is and what I'm trying to
>>>>   do/document. If you aren't interested in working with me, I'm just
>>>>   going to give up on drm_sched, wait until Rust gets workqueue support,
>>>>   and reimplement it in Rust. You can keep your broken fence lifetime
>>>>   semantics and I'll do my own thing.
>>>>
>>>   
>>>   I'm certainly trying to help here, but you seem to have unrealistic
>>>   expectations.
>>>   
>>>   I perfectly understand what you are trying to do, but you don't seem to
>>>   understand that this functionality here isn't made for your use case.
>>>   
>>>   We can adjust the functionality to better match your requirements, but
>>>   you can't say it is broken because it doesn't work when you use it not
>>>   in the way it is intended to be used.
>>>
>> I believe "adjusting" functionality to fit some external requirements,
>> may have unintended consequences, requiring yet more and more "adjustments".
>> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
>>
>> We need to be extra careful and wary of this.
> Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX.

Well this is the fundamental disagreement we have. As far as I can see 
you don't need to adjust anything in the common drm/scheduler code.

That code works with quite a bunch of different drivers, including the 
Intel XE which has similar requirements to your work here.

We can talk about gradually improving the common code, but as Luben 
already wrote as well this needs to be done very carefully.

>   Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.
>
> AMD has NAK'd both options, effectively NAK'ing the driver.
>
> I will ask a simple yes/no question: Should we use drm/sched?

Well, yes.

>
> If yes, it will need patches like these,

No, you don't.

First of all you need to try to adjust your driver to match the 
requirements of drm/scheduler and *not* the other way around.

>   and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.
>
> If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.
>
> Which is it?
>
> Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.
>
> Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.

Well as far as I can see I'm totally polite as well.

Pointing out that approaches doesn't seem to make sense and NAKing 
patches is a perfectly normal part of the review process.

What you need to to is to take a step back and ask yourself why this 
here is facing so much rejection from our side. I have to admit that I 
don't seem to be good at explaining that, cause we are obviously talking 
past each other, but you don't seem to try hard to understand what I'm 
pointing out either.

> That's not ok.

As far as I can see it is.

As maintainer of a commonly used component my first duty is to preserve 
the status quo and prevent modifications which are not well thought 
through. And to be honest those changes here strongly looks like Lina is 
just adjusting the code to match her requirements without looking left 
and right first.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-16  7:51       ` Asahi Lina
@ 2023-07-17 17:40         ` Luben Tuikov
  -1 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-17 17:40 UTC (permalink / raw)
  To: Asahi Lina, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 2023-07-16 03:51, Asahi Lina wrote:
> On 15/07/2023 16.14, Luben Tuikov wrote:
>> On 2023-07-14 04:21, Asahi Lina wrote:
>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>> causes segfaults and other badness when job completion fences are
>>> signaled after the scheduler is torn down.
>>
>> If there are pending jobs, ideally we want to call into the driver,
>> so that it can release resources it may be holding for those.
>> The idea behind "pending" is that they are pending in the hardware
>> and we don't know their state until signalled/the callback called.
>> (Or unless the device is reset and we get a notification of that fact.)
> 
> That's what the job->free_job() callback does, then the driver is free 
> to do whatever it wants with those jobs. A driver could opt to 
> synchronously kill those jobs (if it can) or account for them 
> separately/asynchronously.
> 
> What this patch basically says is that if you destroy a scheduler with 
> pending jobs, it immediately considers them terminated with an error, 
> and returns ownership back to the driver for freeing. Then the driver 
> can decide how to handle the rest and whatever the underlying hardware 
> state is.
> 
>>> Explicitly detach all jobs from their completion callbacks and free
>>> them. This makes it possible to write a sensible safe abstraction for
>>> drm_sched, without having to externally duplicate the tracking of
>>> in-flight jobs.
>>>
>>> This shouldn't regress any existing drivers, since calling
>>> drm_sched_fini() with any pending jobs is broken and this change should
>>> be a no-op if there are no pending jobs.
>>
>> While this statement is true on its own, it kind of contradicts
>> the premise of the first paragraph.
> 
> I mean right *now* it's broken, before this patch. I'm trying to make it 
> safe, but it shouldn't regress any exiting drivers since if they trigger 
> this code path they are broken today.

Not sure about other drivers--they can speak for themselves and the CC list
should include them--please use "dim add-missing-cc" and make sure
that the Git commit description contains the Cc tags--then git send-email
will populate the SMTP CC. Feel free to add more Cc tags on top of that.

> 
>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
>>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 1f3bc3606239..a4da4aac0efd 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>   {
>>>   	struct drm_sched_entity *s_entity;
>>> +	struct drm_sched_job *s_job, *tmp;
>>>   	int i;
>>>   
>>> -	if (sched->thread)
>>> -		kthread_stop(sched->thread);
>>> +	if (!sched->thread)
>>> +		return;
>>> +
>>> +	/*
>>> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
>>> +	 * and cleaning up complete jobs.
>>> +	 */
>>> +	drm_sched_stop(sched, NULL);
>>> +
>>> +	/*
>>> +	 * Iterate through the pending job list and free all jobs.
>>> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
>>> +	 * otherwise it is responsible for keeping any necessary data structures for
>>> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
>>> +	 * putting them in its own queue or doing its own refcounting).
>>> +	 */
>>> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>> +		spin_lock(&sched->job_list_lock);
>>> +		list_del_init(&s_job->list);
>>> +		spin_unlock(&sched->job_list_lock);
>>> +
>>> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>> +		drm_sched_fence_finished(s_job->s_fence);
>>
>> I'd imagine it's better to rebase this on top of drm-misc-next where
>> drm_sched_fence_finished() takes one more parameter--the error.
> 
> Ah, sure! I can do that.

It's worth posting it as a stand-alone patch. Please make sure to add Cc tags
into the commit description--use "dim add-missing-cc", perhaps also
git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl"
for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl
for affected files.)

Feel free to post it stand-alone and we'll let the natural review process take over. :-)

> 
>>
>>> +
>>> +		WARN_ON(s_job->s_fence->parent);
>>> +		sched->ops->free_job(s_job);
>>> +	}
>>> +
>>> +	kthread_stop(sched->thread);
>>>   
>>>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>   		struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>
>>
>> Conceptually I don't mind this patch--I see what it is trying to achieve,
>> but technically, we want the driver to detect GPU removal and return shared
>> resources back, such as "jobs", which DRM is also aware of.
> 
> I think you missed the context of why I'm doing this, so in short: my

As a general rule of thumb, in my writing emails I try to avoid using
"you" and "I" as much as possible--it sets this divisive stage, and it
can get misrepresented, especially in email.

As is the case in research literature, if I absolutely have to use a pronoun--which
rarely happens, I always use "we", and this is the most number of "I"-s I've used
in a long while.

> use case (like Xe's) involves using a separate drm_sched instance *per 
> file* since these queues are scheduled directly by the firmware. So this 
> isn't about GPU removal, but rather about a GPU context going away while 
> jobs are in flight (e.g. the process got killed). We want that to 
> quickly kill the "DRM view" of the world, including signaling all the 
> fences with an error and freeing resources like the scheduler itself.
> 
> In the case of this particular GPU, there is no known way to actively 
> and synchronously abort GPU jobs, so we need to let them run to 
> completion (or failure), but we don't want that to block process cleanup 
> and freeing a bunch of high-level resources. The driver is architected 
> roughly along the lines of a firmware abstraction layer that maps to the 
> firmware shared memory structures, and then a layer on top that 
> implements the DRM view. When a process gets killed, the DRM side (which 
> includes the scheduler, etc.) gets torn down immediately, and it makes 
> sense to handle this cleanup inside drm_sched since it already has a 
> view into what jobs are in flight. Otherwise, I would have to duplicate 
> job tracking in the driver (actually worse: in the Rust abstraction for 
> safety), which doesn't make much sense.
> 
> But what I *do* have in the driver is tracking of the firmware 
> structures. So when the drm_sched gets torn down and all the jobs 
> killed, the underlying firmware jobs do run to completion, and the 
> resources they use are all cleaned up after that (it's all reference 
> counted).

The ref-count definitely helps here.

> The primitive involved here is that in-flight firmware jobs 
> are assigned an event completion slot, and that keeps a reference to 
> them from a global array until the events fire and all the jobs are 
> known to have completed. This keeps things memory-safe, since we 
> absolutely cannot free/destroy firmware structures while they are in use 
> (otherwise the firmware crashes, which is fatal on these GPUs - requires 
> a full system reboot to recover).
> 
> In practice, with the VM map model that we use, what ends up happening 
> when a process gets killed is that all the user objects for in-flight 
> jobs get unmapped, which usually causes the GPU hardware (not firmware) 
> to fault. This then triggers early termination of jobs anyway via the 
> firmware fault recovery flow. But even that takes some short amount of 
> time, and by then all the drm_sched stuff is long gone and we're just 
> dealing with the in-flight firmware stuff.
> 
>> In the case where we're initiating the tear, we should notify the driver that
>> we're about to forget jobs (resources), so that it knows to return them back
>> or that it shouldn't notify us for them (since we've notified we're forgetting them.)
> 
> That contradicts Christian's comment. I tried to document that (after 
> this patch) the scheduler no longer cares about hw fences and whether 
> they are signaled or not after it's destroyed, and I got a strongly 
> worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
> signal the hw fence after a scheduler teardown, or not?

Christian is correct in that we don't want to hang upstream control
to the whims of a low-level device driver.

> But really, I don't see a use case for an explicit "about to forget job" 
> callback. The job free callback already serves the purpose of telling 

Long time ago, in a galaxy far far away, this was needed in order
to prevent device write-DMA into non-existing (random) memory. As
this is not the case anymore, go with Christian's comment.

> the driver to clean up resources associated with a job. If it wants to 
> synchronously abort things there, it could easily take over its own 
> fence signaling and do something with the underlying stuff if the fence 
> is not signaled yet.
> 
> In my case, since the driver is written in Rust and free_job() just maps 
> to the destructor (Drop impl), that just ends up freeing a bunch of 
> memory and other objects, and I don't particularly care about the state 
> of the firmware side any more after that. The flow is the same whether 
> it was a successful job completion, a failure, or an early destruction 
> due to the drm_sched getting torn down.
> 
>> (Note also that in this latter case, traditionally, the device would be reset,
>> so that we can guarantee that it has forgotten all shared resources which
>> we are to tear up. This is somewhat more complicated with GPUs, thus the method
>> pointed out above.)
> 
> Yeah, in the firmware scheduling case we can't do this at all unless the 
> firmware has an explicit teardown/forget op (which I'm not aware of) and 
> a full GPU reset isn't something we can do either. Hence we just let the 
> underlying jobs complete. In practice they tend to die pretty quickly 
> anyway once all the buffers are unmapped.

Perhaps in the future, as more complex workloads are deferred to this
hardware and driver, a real-time requirement might be needed for this
"tend to die pretty quickly", that that there's some guarantee of
work resuming in some finite time.
-- 
Regards,
Luben

> 
> ~~ Lina
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-07-17 17:40         ` Luben Tuikov
  0 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-17 17:40 UTC (permalink / raw)
  To: Asahi Lina, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 2023-07-16 03:51, Asahi Lina wrote:
> On 15/07/2023 16.14, Luben Tuikov wrote:
>> On 2023-07-14 04:21, Asahi Lina wrote:
>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>> causes segfaults and other badness when job completion fences are
>>> signaled after the scheduler is torn down.
>>
>> If there are pending jobs, ideally we want to call into the driver,
>> so that it can release resources it may be holding for those.
>> The idea behind "pending" is that they are pending in the hardware
>> and we don't know their state until signalled/the callback called.
>> (Or unless the device is reset and we get a notification of that fact.)
> 
> That's what the job->free_job() callback does, then the driver is free 
> to do whatever it wants with those jobs. A driver could opt to 
> synchronously kill those jobs (if it can) or account for them 
> separately/asynchronously.
> 
> What this patch basically says is that if you destroy a scheduler with 
> pending jobs, it immediately considers them terminated with an error, 
> and returns ownership back to the driver for freeing. Then the driver 
> can decide how to handle the rest and whatever the underlying hardware 
> state is.
> 
>>> Explicitly detach all jobs from their completion callbacks and free
>>> them. This makes it possible to write a sensible safe abstraction for
>>> drm_sched, without having to externally duplicate the tracking of
>>> in-flight jobs.
>>>
>>> This shouldn't regress any existing drivers, since calling
>>> drm_sched_fini() with any pending jobs is broken and this change should
>>> be a no-op if there are no pending jobs.
>>
>> While this statement is true on its own, it kind of contradicts
>> the premise of the first paragraph.
> 
> I mean right *now* it's broken, before this patch. I'm trying to make it 
> safe, but it shouldn't regress any exiting drivers since if they trigger 
> this code path they are broken today.

Not sure about other drivers--they can speak for themselves and the CC list
should include them--please use "dim add-missing-cc" and make sure
that the Git commit description contains the Cc tags--then git send-email
will populate the SMTP CC. Feel free to add more Cc tags on top of that.

> 
>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
>>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 1f3bc3606239..a4da4aac0efd 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>   {
>>>   	struct drm_sched_entity *s_entity;
>>> +	struct drm_sched_job *s_job, *tmp;
>>>   	int i;
>>>   
>>> -	if (sched->thread)
>>> -		kthread_stop(sched->thread);
>>> +	if (!sched->thread)
>>> +		return;
>>> +
>>> +	/*
>>> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
>>> +	 * and cleaning up complete jobs.
>>> +	 */
>>> +	drm_sched_stop(sched, NULL);
>>> +
>>> +	/*
>>> +	 * Iterate through the pending job list and free all jobs.
>>> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
>>> +	 * otherwise it is responsible for keeping any necessary data structures for
>>> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
>>> +	 * putting them in its own queue or doing its own refcounting).
>>> +	 */
>>> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>> +		spin_lock(&sched->job_list_lock);
>>> +		list_del_init(&s_job->list);
>>> +		spin_unlock(&sched->job_list_lock);
>>> +
>>> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>> +		drm_sched_fence_finished(s_job->s_fence);
>>
>> I'd imagine it's better to rebase this on top of drm-misc-next where
>> drm_sched_fence_finished() takes one more parameter--the error.
> 
> Ah, sure! I can do that.

It's worth posting it as a stand-alone patch. Please make sure to add Cc tags
into the commit description--use "dim add-missing-cc", perhaps also
git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl"
for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl
for affected files.)

Feel free to post it stand-alone and we'll let the natural review process take over. :-)

> 
>>
>>> +
>>> +		WARN_ON(s_job->s_fence->parent);
>>> +		sched->ops->free_job(s_job);
>>> +	}
>>> +
>>> +	kthread_stop(sched->thread);
>>>   
>>>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>   		struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>
>>
>> Conceptually I don't mind this patch--I see what it is trying to achieve,
>> but technically, we want the driver to detect GPU removal and return shared
>> resources back, such as "jobs", which DRM is also aware of.
> 
> I think you missed the context of why I'm doing this, so in short: my

As a general rule of thumb, in my writing emails I try to avoid using
"you" and "I" as much as possible--it sets this divisive stage, and it
can get misrepresented, especially in email.

As is the case in research literature, if I absolutely have to use a pronoun--which
rarely happens, I always use "we", and this is the most number of "I"-s I've used
in a long while.

> use case (like Xe's) involves using a separate drm_sched instance *per 
> file* since these queues are scheduled directly by the firmware. So this 
> isn't about GPU removal, but rather about a GPU context going away while 
> jobs are in flight (e.g. the process got killed). We want that to 
> quickly kill the "DRM view" of the world, including signaling all the 
> fences with an error and freeing resources like the scheduler itself.
> 
> In the case of this particular GPU, there is no known way to actively 
> and synchronously abort GPU jobs, so we need to let them run to 
> completion (or failure), but we don't want that to block process cleanup 
> and freeing a bunch of high-level resources. The driver is architected 
> roughly along the lines of a firmware abstraction layer that maps to the 
> firmware shared memory structures, and then a layer on top that 
> implements the DRM view. When a process gets killed, the DRM side (which 
> includes the scheduler, etc.) gets torn down immediately, and it makes 
> sense to handle this cleanup inside drm_sched since it already has a 
> view into what jobs are in flight. Otherwise, I would have to duplicate 
> job tracking in the driver (actually worse: in the Rust abstraction for 
> safety), which doesn't make much sense.
> 
> But what I *do* have in the driver is tracking of the firmware 
> structures. So when the drm_sched gets torn down and all the jobs 
> killed, the underlying firmware jobs do run to completion, and the 
> resources they use are all cleaned up after that (it's all reference 
> counted).

The ref-count definitely helps here.

> The primitive involved here is that in-flight firmware jobs 
> are assigned an event completion slot, and that keeps a reference to 
> them from a global array until the events fire and all the jobs are 
> known to have completed. This keeps things memory-safe, since we 
> absolutely cannot free/destroy firmware structures while they are in use 
> (otherwise the firmware crashes, which is fatal on these GPUs - requires 
> a full system reboot to recover).
> 
> In practice, with the VM map model that we use, what ends up happening 
> when a process gets killed is that all the user objects for in-flight 
> jobs get unmapped, which usually causes the GPU hardware (not firmware) 
> to fault. This then triggers early termination of jobs anyway via the 
> firmware fault recovery flow. But even that takes some short amount of 
> time, and by then all the drm_sched stuff is long gone and we're just 
> dealing with the in-flight firmware stuff.
> 
>> In the case where we're initiating the tear, we should notify the driver that
>> we're about to forget jobs (resources), so that it knows to return them back
>> or that it shouldn't notify us for them (since we've notified we're forgetting them.)
> 
> That contradicts Christian's comment. I tried to document that (after 
> this patch) the scheduler no longer cares about hw fences and whether 
> they are signaled or not after it's destroyed, and I got a strongly 
> worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
> signal the hw fence after a scheduler teardown, or not?

Christian is correct in that we don't want to hang upstream control
to the whims of a low-level device driver.

> But really, I don't see a use case for an explicit "about to forget job" 
> callback. The job free callback already serves the purpose of telling 

Long time ago, in a galaxy far far away, this was needed in order
to prevent device write-DMA into non-existing (random) memory. As
this is not the case anymore, go with Christian's comment.

> the driver to clean up resources associated with a job. If it wants to 
> synchronously abort things there, it could easily take over its own 
> fence signaling and do something with the underlying stuff if the fence 
> is not signaled yet.
> 
> In my case, since the driver is written in Rust and free_job() just maps 
> to the destructor (Drop impl), that just ends up freeing a bunch of 
> memory and other objects, and I don't particularly care about the state 
> of the firmware side any more after that. The flow is the same whether 
> it was a successful job completion, a failure, or an early destruction 
> due to the drm_sched getting torn down.
> 
>> (Note also that in this latter case, traditionally, the device would be reset,
>> so that we can guarantee that it has forgotten all shared resources which
>> we are to tear up. This is somewhat more complicated with GPUs, thus the method
>> pointed out above.)
> 
> Yeah, in the firmware scheduling case we can't do this at all unless the 
> firmware has an explicit teardown/forget op (which I'm not aware of) and 
> a full GPU reset isn't something we can do either. Hence we just let the 
> underlying jobs complete. In practice they tend to die pretty quickly 
> anyway once all the buffers are unmapped.

Perhaps in the future, as more complex workloads are deferred to this
hardware and driver, a real-time requirement might be needed for this
"tend to die pretty quickly", that that there's some guarantee of
work resuming in some finite time.
-- 
Regards,
Luben

> 
> ~~ Lina
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-17 17:40         ` Luben Tuikov
@ 2023-07-17 22:45           ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-17 22:45 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 18/07/2023 02.40, Luben Tuikov wrote:
> On 2023-07-16 03:51, Asahi Lina wrote:
>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>> causes segfaults and other badness when job completion fences are
>>>> signaled after the scheduler is torn down.
>>>
>>> If there are pending jobs, ideally we want to call into the driver,
>>> so that it can release resources it may be holding for those.
>>> The idea behind "pending" is that they are pending in the hardware
>>> and we don't know their state until signalled/the callback called.
>>> (Or unless the device is reset and we get a notification of that fact.)
>>
>> That's what the job->free_job() callback does, then the driver is free
>> to do whatever it wants with those jobs. A driver could opt to
>> synchronously kill those jobs (if it can) or account for them
>> separately/asynchronously.
>>
>> What this patch basically says is that if you destroy a scheduler with
>> pending jobs, it immediately considers them terminated with an error,
>> and returns ownership back to the driver for freeing. Then the driver
>> can decide how to handle the rest and whatever the underlying hardware
>> state is.
>>
>>>> Explicitly detach all jobs from their completion callbacks and free
>>>> them. This makes it possible to write a sensible safe abstraction for
>>>> drm_sched, without having to externally duplicate the tracking of
>>>> in-flight jobs.
>>>>
>>>> This shouldn't regress any existing drivers, since calling
>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>> be a no-op if there are no pending jobs.
>>>
>>> While this statement is true on its own, it kind of contradicts
>>> the premise of the first paragraph.
>>
>> I mean right *now* it's broken, before this patch. I'm trying to make it
>> safe, but it shouldn't regress any exiting drivers since if they trigger
>> this code path they are broken today.
> 
> Not sure about other drivers--they can speak for themselves and the CC list
> should include them--please use "dim add-missing-cc" and make sure
> that the Git commit description contains the Cc tags--then git send-email
> will populate the SMTP CC. Feel free to add more Cc tags on top of that.

I use `b4 prep -c` which I think does the same thing? I just ran it 
again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why 
that one wasn't there. Am I missing anything else?

>>
>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
>>>>    1 file changed, 30 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 1f3bc3606239..a4da4aac0efd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>    void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>    {
>>>>    	struct drm_sched_entity *s_entity;
>>>> +	struct drm_sched_job *s_job, *tmp;
>>>>    	int i;
>>>>    
>>>> -	if (sched->thread)
>>>> -		kthread_stop(sched->thread);
>>>> +	if (!sched->thread)
>>>> +		return;
>>>> +
>>>> +	/*
>>>> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
>>>> +	 * and cleaning up complete jobs.
>>>> +	 */
>>>> +	drm_sched_stop(sched, NULL);
>>>> +
>>>> +	/*
>>>> +	 * Iterate through the pending job list and free all jobs.
>>>> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
>>>> +	 * otherwise it is responsible for keeping any necessary data structures for
>>>> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
>>>> +	 * putting them in its own queue or doing its own refcounting).
>>>> +	 */
>>>> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>> +		spin_lock(&sched->job_list_lock);
>>>> +		list_del_init(&s_job->list);
>>>> +		spin_unlock(&sched->job_list_lock);
>>>> +
>>>> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>>> +		drm_sched_fence_finished(s_job->s_fence);
>>>
>>> I'd imagine it's better to rebase this on top of drm-misc-next where
>>> drm_sched_fence_finished() takes one more parameter--the error.
>>
>> Ah, sure! I can do that.
> 
> It's worth posting it as a stand-alone patch. Please make sure to add Cc tags
> into the commit description--use "dim add-missing-cc", perhaps also
> git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl"
> for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl
> for affected files.)
> 
> Feel free to post it stand-alone and we'll let the natural review process take over. :-)

I already posted this one as part of the bindings RFC and the other one 
stand-alone, and they got NAKed by Christian, that's why it's a specific 
series for sched now with the docs, per Daniel's suggestion... now 
you're saying I should post them stand-alone again... ?

>>
>>>
>>>> +
>>>> +		WARN_ON(s_job->s_fence->parent);
>>>> +		sched->ops->free_job(s_job);
>>>> +	}
>>>> +
>>>> +	kthread_stop(sched->thread);
>>>>    
>>>>    	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>>    		struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>
>>>
>>> Conceptually I don't mind this patch--I see what it is trying to achieve,
>>> but technically, we want the driver to detect GPU removal and return shared
>>> resources back, such as "jobs", which DRM is also aware of.
>>
>> I think you missed the context of why I'm doing this, so in short: my
> 
> As a general rule of thumb, in my writing emails I try to avoid using
> "you" and "I" as much as possible--it sets this divisive stage, and it
> can get misrepresented, especially in email.
> 
> As is the case in research literature, if I absolutely have to use a pronoun--which
> rarely happens, I always use "we", and this is the most number of "I"-s I've used
> in a long while.
> 
>> use case (like Xe's) involves using a separate drm_sched instance *per
>> file* since these queues are scheduled directly by the firmware. So this
>> isn't about GPU removal, but rather about a GPU context going away while
>> jobs are in flight (e.g. the process got killed). We want that to
>> quickly kill the "DRM view" of the world, including signaling all the
>> fences with an error and freeing resources like the scheduler itself.
>>
>> In the case of this particular GPU, there is no known way to actively
>> and synchronously abort GPU jobs, so we need to let them run to
>> completion (or failure), but we don't want that to block process cleanup
>> and freeing a bunch of high-level resources. The driver is architected
>> roughly along the lines of a firmware abstraction layer that maps to the
>> firmware shared memory structures, and then a layer on top that
>> implements the DRM view. When a process gets killed, the DRM side (which
>> includes the scheduler, etc.) gets torn down immediately, and it makes
>> sense to handle this cleanup inside drm_sched since it already has a
>> view into what jobs are in flight. Otherwise, I would have to duplicate
>> job tracking in the driver (actually worse: in the Rust abstraction for
>> safety), which doesn't make much sense.
>>
>> But what I *do* have in the driver is tracking of the firmware
>> structures. So when the drm_sched gets torn down and all the jobs
>> killed, the underlying firmware jobs do run to completion, and the
>> resources they use are all cleaned up after that (it's all reference
>> counted).
> 
> The ref-count definitely helps here.
> 
>> The primitive involved here is that in-flight firmware jobs
>> are assigned an event completion slot, and that keeps a reference to
>> them from a global array until the events fire and all the jobs are
>> known to have completed. This keeps things memory-safe, since we
>> absolutely cannot free/destroy firmware structures while they are in use
>> (otherwise the firmware crashes, which is fatal on these GPUs - requires
>> a full system reboot to recover).
>>
>> In practice, with the VM map model that we use, what ends up happening
>> when a process gets killed is that all the user objects for in-flight
>> jobs get unmapped, which usually causes the GPU hardware (not firmware)
>> to fault. This then triggers early termination of jobs anyway via the
>> firmware fault recovery flow. But even that takes some short amount of
>> time, and by then all the drm_sched stuff is long gone and we're just
>> dealing with the in-flight firmware stuff.
>>
>>> In the case where we're initiating the tear, we should notify the driver that
>>> we're about to forget jobs (resources), so that it knows to return them back
>>> or that it shouldn't notify us for them (since we've notified we're forgetting them.)
>>
>> That contradicts Christian's comment. I tried to document that (after
>> this patch) the scheduler no longer cares about hw fences and whether
>> they are signaled or not after it's destroyed, and I got a strongly
>> worded NAK for it. Sooo... which is it? Is it okay for drivers not to
>> signal the hw fence after a scheduler teardown, or not?
> 
> Christian is correct in that we don't want to hang upstream control
> to the whims of a low-level device driver.
> 
>> But really, I don't see a use case for an explicit "about to forget job"
>> callback. The job free callback already serves the purpose of telling
> 
> Long time ago, in a galaxy far far away, this was needed in order
> to prevent device write-DMA into non-existing (random) memory. As
> this is not the case anymore, go with Christian's comment.
> 
>> the driver to clean up resources associated with a job. If it wants to
>> synchronously abort things there, it could easily take over its own
>> fence signaling and do something with the underlying stuff if the fence
>> is not signaled yet.
>>
>> In my case, since the driver is written in Rust and free_job() just maps
>> to the destructor (Drop impl), that just ends up freeing a bunch of
>> memory and other objects, and I don't particularly care about the state
>> of the firmware side any more after that. The flow is the same whether
>> it was a successful job completion, a failure, or an early destruction
>> due to the drm_sched getting torn down.
>>
>>> (Note also that in this latter case, traditionally, the device would be reset,
>>> so that we can guarantee that it has forgotten all shared resources which
>>> we are to tear up. This is somewhat more complicated with GPUs, thus the method
>>> pointed out above.)
>>
>> Yeah, in the firmware scheduling case we can't do this at all unless the
>> firmware has an explicit teardown/forget op (which I'm not aware of) and
>> a full GPU reset isn't something we can do either. Hence we just let the
>> underlying jobs complete. In practice they tend to die pretty quickly
>> anyway once all the buffers are unmapped.
> 
> Perhaps in the future, as more complex workloads are deferred to this
> hardware and driver, a real-time requirement might be needed for this
> "tend to die pretty quickly", that that there's some guarantee of
> work resuming in some finite time.

That's not something we can control. This hardware is reverse-engineered 
and we don't get to write the firmware (it's signed). Maybe there is a 
job cancel op, and maybe we'll find it some day, or maybe not. I've 
certainly never seen macOS do anything like that, including in very 
blatant cases like a 30-second compute job. On macOS it kept running to 
completion even after I killed the process. We can't make the 
hardware/firmware do something it can't do.

At least there's firmware preemption though, so a rogue long-running job 
shouldn't block everything else.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-07-17 22:45           ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-17 22:45 UTC (permalink / raw)
  To: Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 18/07/2023 02.40, Luben Tuikov wrote:
> On 2023-07-16 03:51, Asahi Lina wrote:
>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>> causes segfaults and other badness when job completion fences are
>>>> signaled after the scheduler is torn down.
>>>
>>> If there are pending jobs, ideally we want to call into the driver,
>>> so that it can release resources it may be holding for those.
>>> The idea behind "pending" is that they are pending in the hardware
>>> and we don't know their state until signalled/the callback called.
>>> (Or unless the device is reset and we get a notification of that fact.)
>>
>> That's what the job->free_job() callback does, then the driver is free
>> to do whatever it wants with those jobs. A driver could opt to
>> synchronously kill those jobs (if it can) or account for them
>> separately/asynchronously.
>>
>> What this patch basically says is that if you destroy a scheduler with
>> pending jobs, it immediately considers them terminated with an error,
>> and returns ownership back to the driver for freeing. Then the driver
>> can decide how to handle the rest and whatever the underlying hardware
>> state is.
>>
>>>> Explicitly detach all jobs from their completion callbacks and free
>>>> them. This makes it possible to write a sensible safe abstraction for
>>>> drm_sched, without having to externally duplicate the tracking of
>>>> in-flight jobs.
>>>>
>>>> This shouldn't regress any existing drivers, since calling
>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>> be a no-op if there are no pending jobs.
>>>
>>> While this statement is true on its own, it kind of contradicts
>>> the premise of the first paragraph.
>>
>> I mean right *now* it's broken, before this patch. I'm trying to make it
>> safe, but it shouldn't regress any exiting drivers since if they trigger
>> this code path they are broken today.
> 
> Not sure about other drivers--they can speak for themselves and the CC list
> should include them--please use "dim add-missing-cc" and make sure
> that the Git commit description contains the Cc tags--then git send-email
> will populate the SMTP CC. Feel free to add more Cc tags on top of that.

I use `b4 prep -c` which I think does the same thing? I just ran it 
again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why 
that one wasn't there. Am I missing anything else?

>>
>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
>>>>    1 file changed, 30 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 1f3bc3606239..a4da4aac0efd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>    void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>    {
>>>>    	struct drm_sched_entity *s_entity;
>>>> +	struct drm_sched_job *s_job, *tmp;
>>>>    	int i;
>>>>    
>>>> -	if (sched->thread)
>>>> -		kthread_stop(sched->thread);
>>>> +	if (!sched->thread)
>>>> +		return;
>>>> +
>>>> +	/*
>>>> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
>>>> +	 * and cleaning up complete jobs.
>>>> +	 */
>>>> +	drm_sched_stop(sched, NULL);
>>>> +
>>>> +	/*
>>>> +	 * Iterate through the pending job list and free all jobs.
>>>> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
>>>> +	 * otherwise it is responsible for keeping any necessary data structures for
>>>> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
>>>> +	 * putting them in its own queue or doing its own refcounting).
>>>> +	 */
>>>> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>> +		spin_lock(&sched->job_list_lock);
>>>> +		list_del_init(&s_job->list);
>>>> +		spin_unlock(&sched->job_list_lock);
>>>> +
>>>> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>>> +		drm_sched_fence_finished(s_job->s_fence);
>>>
>>> I'd imagine it's better to rebase this on top of drm-misc-next where
>>> drm_sched_fence_finished() takes one more parameter--the error.
>>
>> Ah, sure! I can do that.
> 
> It's worth posting it as a stand-alone patch. Please make sure to add Cc tags
> into the commit description--use "dim add-missing-cc", perhaps also
> git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl"
> for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl
> for affected files.)
> 
> Feel free to post it stand-alone and we'll let the natural review process take over. :-)

I already posted this one as part of the bindings RFC and the other one 
stand-alone, and they got NAKed by Christian, that's why it's a specific 
series for sched now with the docs, per Daniel's suggestion... now 
you're saying I should post them stand-alone again... ?

>>
>>>
>>>> +
>>>> +		WARN_ON(s_job->s_fence->parent);
>>>> +		sched->ops->free_job(s_job);
>>>> +	}
>>>> +
>>>> +	kthread_stop(sched->thread);
>>>>    
>>>>    	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>>    		struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>
>>>
>>> Conceptually I don't mind this patch--I see what it is trying to achieve,
>>> but technically, we want the driver to detect GPU removal and return shared
>>> resources back, such as "jobs", which DRM is also aware of.
>>
>> I think you missed the context of why I'm doing this, so in short: my
> 
> As a general rule of thumb, in my writing emails I try to avoid using
> "you" and "I" as much as possible--it sets this divisive stage, and it
> can get misrepresented, especially in email.
> 
> As is the case in research literature, if I absolutely have to use a pronoun--which
> rarely happens, I always use "we", and this is the most number of "I"-s I've used
> in a long while.
> 
>> use case (like Xe's) involves using a separate drm_sched instance *per
>> file* since these queues are scheduled directly by the firmware. So this
>> isn't about GPU removal, but rather about a GPU context going away while
>> jobs are in flight (e.g. the process got killed). We want that to
>> quickly kill the "DRM view" of the world, including signaling all the
>> fences with an error and freeing resources like the scheduler itself.
>>
>> In the case of this particular GPU, there is no known way to actively
>> and synchronously abort GPU jobs, so we need to let them run to
>> completion (or failure), but we don't want that to block process cleanup
>> and freeing a bunch of high-level resources. The driver is architected
>> roughly along the lines of a firmware abstraction layer that maps to the
>> firmware shared memory structures, and then a layer on top that
>> implements the DRM view. When a process gets killed, the DRM side (which
>> includes the scheduler, etc.) gets torn down immediately, and it makes
>> sense to handle this cleanup inside drm_sched since it already has a
>> view into what jobs are in flight. Otherwise, I would have to duplicate
>> job tracking in the driver (actually worse: in the Rust abstraction for
>> safety), which doesn't make much sense.
>>
>> But what I *do* have in the driver is tracking of the firmware
>> structures. So when the drm_sched gets torn down and all the jobs
>> killed, the underlying firmware jobs do run to completion, and the
>> resources they use are all cleaned up after that (it's all reference
>> counted).
> 
> The ref-count definitely helps here.
> 
>> The primitive involved here is that in-flight firmware jobs
>> are assigned an event completion slot, and that keeps a reference to
>> them from a global array until the events fire and all the jobs are
>> known to have completed. This keeps things memory-safe, since we
>> absolutely cannot free/destroy firmware structures while they are in use
>> (otherwise the firmware crashes, which is fatal on these GPUs - requires
>> a full system reboot to recover).
>>
>> In practice, with the VM map model that we use, what ends up happening
>> when a process gets killed is that all the user objects for in-flight
>> jobs get unmapped, which usually causes the GPU hardware (not firmware)
>> to fault. This then triggers early termination of jobs anyway via the
>> firmware fault recovery flow. But even that takes some short amount of
>> time, and by then all the drm_sched stuff is long gone and we're just
>> dealing with the in-flight firmware stuff.
>>
>>> In the case where we're initiating the tear, we should notify the driver that
>>> we're about to forget jobs (resources), so that it knows to return them back
>>> or that it shouldn't notify us for them (since we've notified we're forgetting them.)
>>
>> That contradicts Christian's comment. I tried to document that (after
>> this patch) the scheduler no longer cares about hw fences and whether
>> they are signaled or not after it's destroyed, and I got a strongly
>> worded NAK for it. Sooo... which is it? Is it okay for drivers not to
>> signal the hw fence after a scheduler teardown, or not?
> 
> Christian is correct in that we don't want to hang upstream control
> to the whims of a low-level device driver.
> 
>> But really, I don't see a use case for an explicit "about to forget job"
>> callback. The job free callback already serves the purpose of telling
> 
> Long time ago, in a galaxy far far away, this was needed in order
> to prevent device write-DMA into non-existing (random) memory. As
> this is not the case anymore, go with Christian's comment.
> 
>> the driver to clean up resources associated with a job. If it wants to
>> synchronously abort things there, it could easily take over its own
>> fence signaling and do something with the underlying stuff if the fence
>> is not signaled yet.
>>
>> In my case, since the driver is written in Rust and free_job() just maps
>> to the destructor (Drop impl), that just ends up freeing a bunch of
>> memory and other objects, and I don't particularly care about the state
>> of the firmware side any more after that. The flow is the same whether
>> it was a successful job completion, a failure, or an early destruction
>> due to the drm_sched getting torn down.
>>
>>> (Note also that in this latter case, traditionally, the device would be reset,
>>> so that we can guarantee that it has forgotten all shared resources which
>>> we are to tear up. This is somewhat more complicated with GPUs, thus the method
>>> pointed out above.)
>>
>> Yeah, in the firmware scheduling case we can't do this at all unless the
>> firmware has an explicit teardown/forget op (which I'm not aware of) and
>> a full GPU reset isn't something we can do either. Hence we just let the
>> underlying jobs complete. In practice they tend to die pretty quickly
>> anyway once all the buffers are unmapped.
> 
> Perhaps in the future, as more complex workloads are deferred to this
> hardware and driver, a real-time requirement might be needed for this
> "tend to die pretty quickly", that that there's some guarantee of
> work resuming in some finite time.

That's not something we can control. This hardware is reverse-engineered 
and we don't get to write the firmware (it's signed). Maybe there is a 
job cancel op, and maybe we'll find it some day, or maybe not. I've 
certainly never seen macOS do anything like that, including in very 
blatant cases like a 30-second compute job. On macOS it kept running to 
completion even after I killed the process. We can't make the 
hardware/firmware do something it can't do.

At least there's firmware preemption though, so a rogue long-running job 
shouldn't block everything else.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-17 15:55             ` Christian König
@ 2023-07-18  2:35               ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-18  2:35 UTC (permalink / raw)
  To: Christian König, alyssa, Luben Tuikov, David Airlie,
	Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, dri-devel, linux-kernel, linux-media, asahi

On 18/07/2023 00.55, Christian König wrote:
> Am 15.07.23 um 16:14 schrieb alyssa@rosenzweig.io:
>> 15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
>>> On 2023-07-14 05:57, Christian König wrote:
>>>
>>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>>
>>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>>
>>>>    Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>    A signaled scheduler fence can outlive its scheduler, since fences are
>>>>    independencly reference counted. Therefore, we can't reference the
>>>>    scheduler in the get_timeline_name() implementation.
>>>>
>>>>    Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>    dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>
>>>>    Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>    ---
>>>>       drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>       drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>       include/drm/gpu_scheduler.h              | 5 +++++
>>>>       3 files changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>>    diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>    b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>    index b2bbc8a68b30..17f35b0b005a 100644
>>>>    --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>    +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>    @@ -389,7 +389,12 @@ static bool
>>>>    drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>                  /*
>>>>                * Fence is from the same scheduler, only need to wait for
>>>>    -         * it to be scheduled
>>>>    +         * it to be scheduled.
>>>>    +         *
>>>>    +         * Note: s_fence->sched could have been freed and reallocated
>>>>    +         * as another scheduler. This false positive case is okay,
>>>>    as if
>>>>    +         * the old scheduler was freed all of its jobs must have
>>>>    +         * signaled their completion fences.
>>>>
>>>>    This is outright nonsense. As long as an entity for a scheduler exists
>>>>    it is not allowed to free up this scheduler.
>>>>
>>>>    So this function can't be called like this.
>>>>
>>>>> As I already explained, the fences can outlive their scheduler. That
>>>>>    means *this* entity certainly exists for *this* scheduler, but the
>>>>>    *dependency* fence might have come from a past scheduler that was
>>>>>    already destroyed along with all of its entities, and its address reused.
>>>>>
>>>>    
>>>>    Well this is function is not about fences, this function is a callback
>>>>    for the entity.
>>>>    
>>>>
>>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>>>    being told my comments are "outright nonsense" when you don't even
>>>>>    take the time to understand what the issue is and what I'm trying to
>>>>>    do/document. If you aren't interested in working with me, I'm just
>>>>>    going to give up on drm_sched, wait until Rust gets workqueue support,
>>>>>    and reimplement it in Rust. You can keep your broken fence lifetime
>>>>>    semantics and I'll do my own thing.
>>>>>
>>>>    
>>>>    I'm certainly trying to help here, but you seem to have unrealistic
>>>>    expectations.
>>>>    
>>>>    I perfectly understand what you are trying to do, but you don't seem to
>>>>    understand that this functionality here isn't made for your use case.
>>>>    
>>>>    We can adjust the functionality to better match your requirements, but
>>>>    you can't say it is broken because it doesn't work when you use it not
>>>>    in the way it is intended to be used.
>>>>
>>> I believe "adjusting" functionality to fit some external requirements,
>>> may have unintended consequences, requiring yet more and more "adjustments".
>>> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
>>>
>>> We need to be extra careful and wary of this.
>> Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX.
> 
> Well this is the fundamental disagreement we have. As far as I can see
> you don't need to adjust anything in the common drm/scheduler code.
> 
> That code works with quite a bunch of different drivers, including the
> Intel XE which has similar requirements to your work here.
> 
> We can talk about gradually improving the common code, but as Luben
> already wrote as well this needs to be done very carefully.
> 
>>    Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.
>>
>> AMD has NAK'd both options, effectively NAK'ing the driver.
>>
>> I will ask a simple yes/no question: Should we use drm/sched?
> 
> Well, yes.
> 
>>
>> If yes, it will need patches like these,
> 
> No, you don't.
> 
> First of all you need to try to adjust your driver to match the
> requirements of drm/scheduler and *not* the other way around.
> 
>>    and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.
>>
>> If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.
>>
>> Which is it?
>>
>> Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.
>>
>> Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.
> 
> Well as far as I can see I'm totally polite as well.
> 
> Pointing out that approaches doesn't seem to make sense and NAKing
> patches is a perfectly normal part of the review process.
> 
> What you need to to is to take a step back and ask yourself why this
> here is facing so much rejection from our side. I have to admit that I
> don't seem to be good at explaining that, cause we are obviously talking
> past each other, but you don't seem to try hard to understand what I'm
> pointing out either.
> 
>> That's not ok.
> 
> As far as I can see it is.
> 
> As maintainer of a commonly used component my first duty is to preserve
> the status quo and prevent modifications which are not well thought
> through. And to be honest those changes here strongly looks like Lina is
> just adjusting the code to match her requirements without looking left
> and right first.
> 
> Regards,
> Christian.
> 
> 

I give up. You are ignoring everything we say, and rejecting everything 
we suggest. We've already explained why drm_sched doesn't work for us. 
I'm tired of repeating the same explanation over and over again only to 
be ignored and told I'm wrong.

I'll start working on a new, much simpler Rust-native scheduler based on 
the workqueue Rust abstractions which are in review.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-18  2:35               ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-18  2:35 UTC (permalink / raw)
  To: Christian König, alyssa, Luben Tuikov, David Airlie,
	Daniel Vetter, Sumit Semwal
  Cc: asahi, linux-media, dri-devel, Faith Ekstrand, linux-kernel

On 18/07/2023 00.55, Christian König wrote:
> Am 15.07.23 um 16:14 schrieb alyssa@rosenzweig.io:
>> 15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
>>> On 2023-07-14 05:57, Christian König wrote:
>>>
>>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>>
>>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>>
>>>>    Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>    A signaled scheduler fence can outlive its scheduler, since fences are
>>>>    independencly reference counted. Therefore, we can't reference the
>>>>    scheduler in the get_timeline_name() implementation.
>>>>
>>>>    Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>    dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>
>>>>    Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>    ---
>>>>       drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>       drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>       include/drm/gpu_scheduler.h              | 5 +++++
>>>>       3 files changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>>    diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>    b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>    index b2bbc8a68b30..17f35b0b005a 100644
>>>>    --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>    +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>    @@ -389,7 +389,12 @@ static bool
>>>>    drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>                  /*
>>>>                * Fence is from the same scheduler, only need to wait for
>>>>    -         * it to be scheduled
>>>>    +         * it to be scheduled.
>>>>    +         *
>>>>    +         * Note: s_fence->sched could have been freed and reallocated
>>>>    +         * as another scheduler. This false positive case is okay,
>>>>    as if
>>>>    +         * the old scheduler was freed all of its jobs must have
>>>>    +         * signaled their completion fences.
>>>>
>>>>    This is outright nonsense. As long as an entity for a scheduler exists
>>>>    it is not allowed to free up this scheduler.
>>>>
>>>>    So this function can't be called like this.
>>>>
>>>>> As I already explained, the fences can outlive their scheduler. That
>>>>>    means *this* entity certainly exists for *this* scheduler, but the
>>>>>    *dependency* fence might have come from a past scheduler that was
>>>>>    already destroyed along with all of its entities, and its address reused.
>>>>>
>>>>    
>>>>    Well this is function is not about fences, this function is a callback
>>>>    for the entity.
>>>>    
>>>>
>>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>>>    being told my comments are "outright nonsense" when you don't even
>>>>>    take the time to understand what the issue is and what I'm trying to
>>>>>    do/document. If you aren't interested in working with me, I'm just
>>>>>    going to give up on drm_sched, wait until Rust gets workqueue support,
>>>>>    and reimplement it in Rust. You can keep your broken fence lifetime
>>>>>    semantics and I'll do my own thing.
>>>>>
>>>>    
>>>>    I'm certainly trying to help here, but you seem to have unrealistic
>>>>    expectations.
>>>>    
>>>>    I perfectly understand what you are trying to do, but you don't seem to
>>>>    understand that this functionality here isn't made for your use case.
>>>>    
>>>>    We can adjust the functionality to better match your requirements, but
>>>>    you can't say it is broken because it doesn't work when you use it not
>>>>    in the way it is intended to be used.
>>>>
>>> I believe "adjusting" functionality to fit some external requirements,
>>> may have unintended consequences, requiring yet more and more "adjustments".
>>> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
>>>
>>> We need to be extra careful and wary of this.
>> Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX.
> 
> Well this is the fundamental disagreement we have. As far as I can see
> you don't need to adjust anything in the common drm/scheduler code.
> 
> That code works with quite a bunch of different drivers, including the
> Intel XE which has similar requirements to your work here.
> 
> We can talk about gradually improving the common code, but as Luben
> already wrote as well this needs to be done very carefully.
> 
>>    Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.
>>
>> AMD has NAK'd both options, effectively NAK'ing the driver.
>>
>> I will ask a simple yes/no question: Should we use drm/sched?
> 
> Well, yes.
> 
>>
>> If yes, it will need patches like these,
> 
> No, you don't.
> 
> First of all you need to try to adjust your driver to match the
> requirements of drm/scheduler and *not* the other way around.
> 
>>    and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.
>>
>> If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.
>>
>> Which is it?
>>
>> Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.
>>
>> Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.
> 
> Well as far as I can see I'm totally polite as well.
> 
> Pointing out that approaches doesn't seem to make sense and NAKing
> patches is a perfectly normal part of the review process.
> 
> What you need to to is to take a step back and ask yourself why this
> here is facing so much rejection from our side. I have to admit that I
> don't seem to be good at explaining that, cause we are obviously talking
> past each other, but you don't seem to try hard to understand what I'm
> pointing out either.
> 
>> That's not ok.
> 
> As far as I can see it is.
> 
> As maintainer of a commonly used component my first duty is to preserve
> the status quo and prevent modifications which are not well thought
> through. And to be honest those changes here strongly looks like Lina is
> just adjusting the code to match her requirements without looking left
> and right first.
> 
> Regards,
> Christian.
> 
> 

I give up. You are ignoring everything we say, and rejecting everything 
we suggest. We've already explained why drm_sched doesn't work for us. 
I'm tired of repeating the same explanation over and over again only to 
be ignored and told I'm wrong.

I'll start working on a new, much simpler Rust-native scheduler based on 
the workqueue Rust abstractions which are in review.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-17 22:45           ` Asahi Lina
@ 2023-07-18  5:14             ` Luben Tuikov
  -1 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-18  5:14 UTC (permalink / raw)
  To: Asahi Lina, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 2023-07-17 18:45, Asahi Lina wrote:
> On 18/07/2023 02.40, Luben Tuikov wrote:
>> On 2023-07-16 03:51, Asahi Lina wrote:
>>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>>> causes segfaults and other badness when job completion fences are
>>>>> signaled after the scheduler is torn down.
>>>>
>>>> If there are pending jobs, ideally we want to call into the driver,
>>>> so that it can release resources it may be holding for those.
>>>> The idea behind "pending" is that they are pending in the hardware
>>>> and we don't know their state until signalled/the callback called.
>>>> (Or unless the device is reset and we get a notification of that fact.)
>>>
>>> That's what the job->free_job() callback does, then the driver is free
>>> to do whatever it wants with those jobs. A driver could opt to
>>> synchronously kill those jobs (if it can) or account for them
>>> separately/asynchronously.
>>>
>>> What this patch basically says is that if you destroy a scheduler with
>>> pending jobs, it immediately considers them terminated with an error,
>>> and returns ownership back to the driver for freeing. Then the driver
>>> can decide how to handle the rest and whatever the underlying hardware
>>> state is.
>>>
>>>>> Explicitly detach all jobs from their completion callbacks and free
>>>>> them. This makes it possible to write a sensible safe abstraction for
>>>>> drm_sched, without having to externally duplicate the tracking of
>>>>> in-flight jobs.
>>>>>
>>>>> This shouldn't regress any existing drivers, since calling
>>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>>> be a no-op if there are no pending jobs.
>>>>
>>>> While this statement is true on its own, it kind of contradicts
>>>> the premise of the first paragraph.
>>>
>>> I mean right *now* it's broken, before this patch. I'm trying to make it
>>> safe, but it shouldn't regress any exiting drivers since if they trigger
>>> this code path they are broken today.
>>
>> Not sure about other drivers--they can speak for themselves and the CC list
>> should include them--please use "dim add-missing-cc" and make sure
>> that the Git commit description contains the Cc tags--then git send-email
>> will populate the SMTP CC. Feel free to add more Cc tags on top of that.
> 
> I use `b4 prep -c` which I think does the same thing? I just ran it 
> again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why 
> that one wasn't there. Am I missing anything else?

Not sure about "b4 prep -c"--using "git send-email" instead, but what is
important is to add the Cc: tags as part of the commit message. A "git log" of
drm-misc-next shows the proper format. Then maintainers add Link:
tag to the correct email thread, which is usually completely automated
by "dim" or by "git am", or both.

I never do any of this stuff manually and it's all done by tools
like "dim", and the such. Sometimes I'd run "scripts/get_maintainer.pl"
manually ("dim add-missing-cc" runs that script too), as well
as "git blame" and "git log -- <file>" to see if I can add more Cc:
tags to the commit message to keep people well informed. Then
let "git send-email" add them to the SMTP CC, when the patch
is actually emailed out.

> 
>>>
>>>>
>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>> ---
>>>>>    drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
>>>>>    1 file changed, 30 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 1f3bc3606239..a4da4aac0efd 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>    void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>    {
>>>>>    	struct drm_sched_entity *s_entity;
>>>>> +	struct drm_sched_job *s_job, *tmp;
>>>>>    	int i;
>>>>>    
>>>>> -	if (sched->thread)
>>>>> -		kthread_stop(sched->thread);
>>>>> +	if (!sched->thread)
>>>>> +		return;
>>>>> +
>>>>> +	/*
>>>>> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
>>>>> +	 * and cleaning up complete jobs.
>>>>> +	 */
>>>>> +	drm_sched_stop(sched, NULL);
>>>>> +
>>>>> +	/*
>>>>> +	 * Iterate through the pending job list and free all jobs.
>>>>> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
>>>>> +	 * otherwise it is responsible for keeping any necessary data structures for
>>>>> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
>>>>> +	 * putting them in its own queue or doing its own refcounting).
>>>>> +	 */
>>>>> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>>> +		spin_lock(&sched->job_list_lock);
>>>>> +		list_del_init(&s_job->list);
>>>>> +		spin_unlock(&sched->job_list_lock);
>>>>> +
>>>>> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>>>> +		drm_sched_fence_finished(s_job->s_fence);
>>>>
>>>> I'd imagine it's better to rebase this on top of drm-misc-next where
>>>> drm_sched_fence_finished() takes one more parameter--the error.
>>>
>>> Ah, sure! I can do that.
>>
>> It's worth posting it as a stand-alone patch. Please make sure to add Cc tags
>> into the commit description--use "dim add-missing-cc", perhaps also
>> git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl"
>> for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl
>> for affected files.)
>>
>> Feel free to post it stand-alone and we'll let the natural review process take over. :-)
> 
> I already posted this one as part of the bindings RFC and the other one 
> stand-alone, and they got NAKed by Christian, that's why it's a specific 
> series for sched now with the docs, per Daniel's suggestion... now 
> you're saying I should post them stand-alone again... ?

Oh, I see. I don't remember why Christian NAK-ed it--do you have a link by any chance?

As I said, conceptually I don't mind this patch as there is some merit to what it is
trying to do, but this does beg the question why no drivers seem to have wanted it thus far?

However, it is worth nothing that there is some logic to this patch, so I'd say,
if driver writers agree with it (we do call their free_job() method after all--do
we need to check that it is non-null?), we can at least try to cement
whether this is something they think is good to have, or is redundant, or breaks
some assumption, and so on.
-- 
Regards,
Luben

> 
>>>
>>>>
>>>>> +
>>>>> +		WARN_ON(s_job->s_fence->parent);
>>>>> +		sched->ops->free_job(s_job);
>>>>> +	}
>>>>> +
>>>>> +	kthread_stop(sched->thread);
>>>>>    
>>>>>    	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>    		struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>
>>>>
>>>> Conceptually I don't mind this patch--I see what it is trying to achieve,
>>>> but technically, we want the driver to detect GPU removal and return shared
>>>> resources back, such as "jobs", which DRM is also aware of.
>>>
>>> I think you missed the context of why I'm doing this, so in short: my
>>
>> As a general rule of thumb, in my writing emails I try to avoid using
>> "you" and "I" as much as possible--it sets this divisive stage, and it
>> can get misrepresented, especially in email.
>>
>> As is the case in research literature, if I absolutely have to use a pronoun--which
>> rarely happens, I always use "we", and this is the most number of "I"-s I've used
>> in a long while.
>>
>>> use case (like Xe's) involves using a separate drm_sched instance *per
>>> file* since these queues are scheduled directly by the firmware. So this
>>> isn't about GPU removal, but rather about a GPU context going away while
>>> jobs are in flight (e.g. the process got killed). We want that to
>>> quickly kill the "DRM view" of the world, including signaling all the
>>> fences with an error and freeing resources like the scheduler itself.
>>>
>>> In the case of this particular GPU, there is no known way to actively
>>> and synchronously abort GPU jobs, so we need to let them run to
>>> completion (or failure), but we don't want that to block process cleanup
>>> and freeing a bunch of high-level resources. The driver is architected
>>> roughly along the lines of a firmware abstraction layer that maps to the
>>> firmware shared memory structures, and then a layer on top that
>>> implements the DRM view. When a process gets killed, the DRM side (which
>>> includes the scheduler, etc.) gets torn down immediately, and it makes
>>> sense to handle this cleanup inside drm_sched since it already has a
>>> view into what jobs are in flight. Otherwise, I would have to duplicate
>>> job tracking in the driver (actually worse: in the Rust abstraction for
>>> safety), which doesn't make much sense.
>>>
>>> But what I *do* have in the driver is tracking of the firmware
>>> structures. So when the drm_sched gets torn down and all the jobs
>>> killed, the underlying firmware jobs do run to completion, and the
>>> resources they use are all cleaned up after that (it's all reference
>>> counted).
>>
>> The ref-count definitely helps here.
>>
>>> The primitive involved here is that in-flight firmware jobs
>>> are assigned an event completion slot, and that keeps a reference to
>>> them from a global array until the events fire and all the jobs are
>>> known to have completed. This keeps things memory-safe, since we
>>> absolutely cannot free/destroy firmware structures while they are in use
>>> (otherwise the firmware crashes, which is fatal on these GPUs - requires
>>> a full system reboot to recover).
>>>
>>> In practice, with the VM map model that we use, what ends up happening
>>> when a process gets killed is that all the user objects for in-flight
>>> jobs get unmapped, which usually causes the GPU hardware (not firmware)
>>> to fault. This then triggers early termination of jobs anyway via the
>>> firmware fault recovery flow. But even that takes some short amount of
>>> time, and by then all the drm_sched stuff is long gone and we're just
>>> dealing with the in-flight firmware stuff.
>>>
>>>> In the case where we're initiating the tear, we should notify the driver that
>>>> we're about to forget jobs (resources), so that it knows to return them back
>>>> or that it shouldn't notify us for them (since we've notified we're forgetting them.)
>>>
>>> That contradicts Christian's comment. I tried to document that (after
>>> this patch) the scheduler no longer cares about hw fences and whether
>>> they are signaled or not after it's destroyed, and I got a strongly
>>> worded NAK for it. Sooo... which is it? Is it okay for drivers not to
>>> signal the hw fence after a scheduler teardown, or not?
>>
>> Christian is correct in that we don't want to hang upstream control
>> to the whims of a low-level device driver.
>>
>>> But really, I don't see a use case for an explicit "about to forget job"
>>> callback. The job free callback already serves the purpose of telling
>>
>> Long time ago, in a galaxy far far away, this was needed in order
>> to prevent device write-DMA into non-existing (random) memory. As
>> this is not the case anymore, go with Christian's comment.
>>
>>> the driver to clean up resources associated with a job. If it wants to
>>> synchronously abort things there, it could easily take over its own
>>> fence signaling and do something with the underlying stuff if the fence
>>> is not signaled yet.
>>>
>>> In my case, since the driver is written in Rust and free_job() just maps
>>> to the destructor (Drop impl), that just ends up freeing a bunch of
>>> memory and other objects, and I don't particularly care about the state
>>> of the firmware side any more after that. The flow is the same whether
>>> it was a successful job completion, a failure, or an early destruction
>>> due to the drm_sched getting torn down.
>>>
>>>> (Note also that in this latter case, traditionally, the device would be reset,
>>>> so that we can guarantee that it has forgotten all shared resources which
>>>> we are to tear up. This is somewhat more complicated with GPUs, thus the method
>>>> pointed out above.)
>>>
>>> Yeah, in the firmware scheduling case we can't do this at all unless the
>>> firmware has an explicit teardown/forget op (which I'm not aware of) and
>>> a full GPU reset isn't something we can do either. Hence we just let the
>>> underlying jobs complete. In practice they tend to die pretty quickly
>>> anyway once all the buffers are unmapped.
>>
>> Perhaps in the future, as more complex workloads are deferred to this
>> hardware and driver, a real-time requirement might be needed for this
>> "tend to die pretty quickly", that that there's some guarantee of
>> work resuming in some finite time.
> 
> That's not something we can control. This hardware is reverse-engineered 
> and we don't get to write the firmware (it's signed). Maybe there is a 
> job cancel op, and maybe we'll find it some day, or maybe not. I've 
> certainly never seen macOS do anything like that, including in very 
> blatant cases like a 30-second compute job. On macOS it kept running to 
> completion even after I killed the process. We can't make the 
> hardware/firmware do something it can't do.
> 
> At least there's firmware preemption though, so a rogue long-running job 
> shouldn't block everything else.
> 
> ~~ Lina
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-07-18  5:14             ` Luben Tuikov
  0 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-18  5:14 UTC (permalink / raw)
  To: Asahi Lina, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 2023-07-17 18:45, Asahi Lina wrote:
> On 18/07/2023 02.40, Luben Tuikov wrote:
>> On 2023-07-16 03:51, Asahi Lina wrote:
>>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>>> causes segfaults and other badness when job completion fences are
>>>>> signaled after the scheduler is torn down.
>>>>
>>>> If there are pending jobs, ideally we want to call into the driver,
>>>> so that it can release resources it may be holding for those.
>>>> The idea behind "pending" is that they are pending in the hardware
>>>> and we don't know their state until signalled/the callback called.
>>>> (Or unless the device is reset and we get a notification of that fact.)
>>>
>>> That's what the job->free_job() callback does, then the driver is free
>>> to do whatever it wants with those jobs. A driver could opt to
>>> synchronously kill those jobs (if it can) or account for them
>>> separately/asynchronously.
>>>
>>> What this patch basically says is that if you destroy a scheduler with
>>> pending jobs, it immediately considers them terminated with an error,
>>> and returns ownership back to the driver for freeing. Then the driver
>>> can decide how to handle the rest and whatever the underlying hardware
>>> state is.
>>>
>>>>> Explicitly detach all jobs from their completion callbacks and free
>>>>> them. This makes it possible to write a sensible safe abstraction for
>>>>> drm_sched, without having to externally duplicate the tracking of
>>>>> in-flight jobs.
>>>>>
>>>>> This shouldn't regress any existing drivers, since calling
>>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>>> be a no-op if there are no pending jobs.
>>>>
>>>> While this statement is true on its own, it kind of contradicts
>>>> the premise of the first paragraph.
>>>
>>> I mean right *now* it's broken, before this patch. I'm trying to make it
>>> safe, but it shouldn't regress any exiting drivers since if they trigger
>>> this code path they are broken today.
>>
>> Not sure about other drivers--they can speak for themselves and the CC list
>> should include them--please use "dim add-missing-cc" and make sure
>> that the Git commit description contains the Cc tags--then git send-email
>> will populate the SMTP CC. Feel free to add more Cc tags on top of that.
> 
> I use `b4 prep -c` which I think does the same thing? I just ran it 
> again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why 
> that one wasn't there. Am I missing anything else?

Not sure about "b4 prep -c"--using "git send-email" instead, but what is
important is to add the Cc: tags as part of the commit message. A "git log" of
drm-misc-next shows the proper format. Then maintainers add Link:
tag to the correct email thread, which is usually completely automated
by "dim" or by "git am", or both.

I never do any of this stuff manually and it's all done by tools
like "dim", and the such. Sometimes I'd run "scripts/get_maintainer.pl"
manually ("dim add-missing-cc" runs that script too), as well
as "git blame" and "git log -- <file>" to see if I can add more Cc:
tags to the commit message to keep people well informed. Then
let "git send-email" add them to the SMTP CC, when the patch
is actually emailed out.

> 
>>>
>>>>
>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>> ---
>>>>>    drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
>>>>>    1 file changed, 30 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 1f3bc3606239..a4da4aac0efd 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>    void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>    {
>>>>>    	struct drm_sched_entity *s_entity;
>>>>> +	struct drm_sched_job *s_job, *tmp;
>>>>>    	int i;
>>>>>    
>>>>> -	if (sched->thread)
>>>>> -		kthread_stop(sched->thread);
>>>>> +	if (!sched->thread)
>>>>> +		return;
>>>>> +
>>>>> +	/*
>>>>> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
>>>>> +	 * and cleaning up complete jobs.
>>>>> +	 */
>>>>> +	drm_sched_stop(sched, NULL);
>>>>> +
>>>>> +	/*
>>>>> +	 * Iterate through the pending job list and free all jobs.
>>>>> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
>>>>> +	 * otherwise it is responsible for keeping any necessary data structures for
>>>>> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
>>>>> +	 * putting them in its own queue or doing its own refcounting).
>>>>> +	 */
>>>>> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>>> +		spin_lock(&sched->job_list_lock);
>>>>> +		list_del_init(&s_job->list);
>>>>> +		spin_unlock(&sched->job_list_lock);
>>>>> +
>>>>> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>>>> +		drm_sched_fence_finished(s_job->s_fence);
>>>>
>>>> I'd imagine it's better to rebase this on top of drm-misc-next where
>>>> drm_sched_fence_finished() takes one more parameter--the error.
>>>
>>> Ah, sure! I can do that.
>>
>> It's worth posting it as a stand-alone patch. Please make sure to add Cc tags
>> into the commit description--use "dim add-missing-cc", perhaps also
>> git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl"
>> for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl
>> for affected files.)
>>
>> Feel free to post it stand-alone and we'll let the natural review process take over. :-)
> 
> I already posted this one as part of the bindings RFC and the other one 
> stand-alone, and they got NAKed by Christian, that's why it's a specific 
> series for sched now with the docs, per Daniel's suggestion... now 
> you're saying I should post them stand-alone again... ?

Oh, I see. I don't remember why Christian NAK-ed it--do you have a link by any chance?

As I said, conceptually I don't mind this patch as there is some merit to what it is
trying to do, but this does beg the question why no drivers seem to have wanted it thus far?

However, it is worth nothing that there is some logic to this patch, so I'd say,
if driver writers agree with it (we do call their free_job() method after all--do
we need to check that it is non-null?), we can at least try to cement
whether this is something they think is good to have, or is redundant, or breaks
some assumption, and so on.
-- 
Regards,
Luben

> 
>>>
>>>>
>>>>> +
>>>>> +		WARN_ON(s_job->s_fence->parent);
>>>>> +		sched->ops->free_job(s_job);
>>>>> +	}
>>>>> +
>>>>> +	kthread_stop(sched->thread);
>>>>>    
>>>>>    	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>    		struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>
>>>>
>>>> Conceptually I don't mind this patch--I see what it is trying to achieve,
>>>> but technically, we want the driver to detect GPU removal and return shared
>>>> resources back, such as "jobs", which DRM is also aware of.
>>>
>>> I think you missed the context of why I'm doing this, so in short: my
>>
>> As a general rule of thumb, in my writing emails I try to avoid using
>> "you" and "I" as much as possible--it sets this divisive stage, and it
>> can get misrepresented, especially in email.
>>
>> As is the case in research literature, if I absolutely have to use a pronoun--which
>> rarely happens, I always use "we", and this is the most number of "I"-s I've used
>> in a long while.
>>
>>> use case (like Xe's) involves using a separate drm_sched instance *per
>>> file* since these queues are scheduled directly by the firmware. So this
>>> isn't about GPU removal, but rather about a GPU context going away while
>>> jobs are in flight (e.g. the process got killed). We want that to
>>> quickly kill the "DRM view" of the world, including signaling all the
>>> fences with an error and freeing resources like the scheduler itself.
>>>
>>> In the case of this particular GPU, there is no known way to actively
>>> and synchronously abort GPU jobs, so we need to let them run to
>>> completion (or failure), but we don't want that to block process cleanup
>>> and freeing a bunch of high-level resources. The driver is architected
>>> roughly along the lines of a firmware abstraction layer that maps to the
>>> firmware shared memory structures, and then a layer on top that
>>> implements the DRM view. When a process gets killed, the DRM side (which
>>> includes the scheduler, etc.) gets torn down immediately, and it makes
>>> sense to handle this cleanup inside drm_sched since it already has a
>>> view into what jobs are in flight. Otherwise, I would have to duplicate
>>> job tracking in the driver (actually worse: in the Rust abstraction for
>>> safety), which doesn't make much sense.
>>>
>>> But what I *do* have in the driver is tracking of the firmware
>>> structures. So when the drm_sched gets torn down and all the jobs
>>> killed, the underlying firmware jobs do run to completion, and the
>>> resources they use are all cleaned up after that (it's all reference
>>> counted).
>>
>> The ref-count definitely helps here.
>>
>>> The primitive involved here is that in-flight firmware jobs
>>> are assigned an event completion slot, and that keeps a reference to
>>> them from a global array until the events fire and all the jobs are
>>> known to have completed. This keeps things memory-safe, since we
>>> absolutely cannot free/destroy firmware structures while they are in use
>>> (otherwise the firmware crashes, which is fatal on these GPUs - requires
>>> a full system reboot to recover).
>>>
>>> In practice, with the VM map model that we use, what ends up happening
>>> when a process gets killed is that all the user objects for in-flight
>>> jobs get unmapped, which usually causes the GPU hardware (not firmware)
>>> to fault. This then triggers early termination of jobs anyway via the
>>> firmware fault recovery flow. But even that takes some short amount of
>>> time, and by then all the drm_sched stuff is long gone and we're just
>>> dealing with the in-flight firmware stuff.
>>>
>>>> In the case where we're initiating the tear, we should notify the driver that
>>>> we're about to forget jobs (resources), so that it knows to return them back
>>>> or that it shouldn't notify us for them (since we've notified we're forgetting them.)
>>>
>>> That contradicts Christian's comment. I tried to document that (after
>>> this patch) the scheduler no longer cares about hw fences and whether
>>> they are signaled or not after it's destroyed, and I got a strongly
>>> worded NAK for it. Sooo... which is it? Is it okay for drivers not to
>>> signal the hw fence after a scheduler teardown, or not?
>>
>> Christian is correct in that we don't want to hang upstream control
>> to the whims of a low-level device driver.
>>
>>> But really, I don't see a use case for an explicit "about to forget job"
>>> callback. The job free callback already serves the purpose of telling
>>
>> Long time ago, in a galaxy far far away, this was needed in order
>> to prevent device write-DMA into non-existing (random) memory. As
>> this is not the case anymore, go with Christian's comment.
>>
>>> the driver to clean up resources associated with a job. If it wants to
>>> synchronously abort things there, it could easily take over its own
>>> fence signaling and do something with the underlying stuff if the fence
>>> is not signaled yet.
>>>
>>> In my case, since the driver is written in Rust and free_job() just maps
>>> to the destructor (Drop impl), that just ends up freeing a bunch of
>>> memory and other objects, and I don't particularly care about the state
>>> of the firmware side any more after that. The flow is the same whether
>>> it was a successful job completion, a failure, or an early destruction
>>> due to the drm_sched getting torn down.
>>>
>>>> (Note also that in this latter case, traditionally, the device would be reset,
>>>> so that we can guarantee that it has forgotten all shared resources which
>>>> we are to tear up. This is somewhat more complicated with GPUs, thus the method
>>>> pointed out above.)
>>>
>>> Yeah, in the firmware scheduling case we can't do this at all unless the
>>> firmware has an explicit teardown/forget op (which I'm not aware of) and
>>> a full GPU reset isn't something we can do either. Hence we just let the
>>> underlying jobs complete. In practice they tend to die pretty quickly
>>> anyway once all the buffers are unmapped.
>>
>> Perhaps in the future, as more complex workloads are deferred to this
>> hardware and driver, a real-time requirement might be needed for this
>> "tend to die pretty quickly", that that there's some guarantee of
>> work resuming in some finite time.
> 
> That's not something we can control. This hardware is reverse-engineered 
> and we don't get to write the firmware (it's signed). Maybe there is a 
> job cancel op, and maybe we'll find it some day, or maybe not. I've 
> certainly never seen macOS do anything like that, including in very 
> blatant cases like a 30-second compute job. On macOS it kept running to 
> completion even after I killed the process. We can't make the 
> hardware/firmware do something it can't do.
> 
> At least there's firmware preemption though, so a rogue long-running job 
> shouldn't block everything else.
> 
> ~~ Lina
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-18  2:35               ` Asahi Lina
@ 2023-07-18  5:45                 ` Luben Tuikov
  -1 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-18  5:45 UTC (permalink / raw)
  To: Asahi Lina, Christian König, alyssa, David Airlie,
	Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, dri-devel, linux-kernel, linux-media, asahi

On 2023-07-17 22:35, Asahi Lina wrote:
> On 18/07/2023 00.55, Christian König wrote:
>> Am 15.07.23 um 16:14 schrieb alyssa@rosenzweig.io:
>>> 15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
>>>> On 2023-07-14 05:57, Christian König wrote:
>>>>
>>>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>>>
>>>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>>>
>>>>>    Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>>    A signaled scheduler fence can outlive its scheduler, since fences are
>>>>>    independencly reference counted. Therefore, we can't reference the
>>>>>    scheduler in the get_timeline_name() implementation.
>>>>>
>>>>>    Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>>    dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>
>>>>>    Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>>    ---
>>>>>       drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>       drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>       include/drm/gpu_scheduler.h              | 5 +++++
>>>>>       3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>
>>>>>    diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    index b2bbc8a68b30..17f35b0b005a 100644
>>>>>    --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    @@ -389,7 +389,12 @@ static bool
>>>>>    drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>                  /*
>>>>>                * Fence is from the same scheduler, only need to wait for
>>>>>    -         * it to be scheduled
>>>>>    +         * it to be scheduled.
>>>>>    +         *
>>>>>    +         * Note: s_fence->sched could have been freed and reallocated
>>>>>    +         * as another scheduler. This false positive case is okay,
>>>>>    as if
>>>>>    +         * the old scheduler was freed all of its jobs must have
>>>>>    +         * signaled their completion fences.
>>>>>
>>>>>    This is outright nonsense. As long as an entity for a scheduler exists
>>>>>    it is not allowed to free up this scheduler.
>>>>>
>>>>>    So this function can't be called like this.
>>>>>
>>>>>> As I already explained, the fences can outlive their scheduler. That
>>>>>>    means *this* entity certainly exists for *this* scheduler, but the
>>>>>>    *dependency* fence might have come from a past scheduler that was
>>>>>>    already destroyed along with all of its entities, and its address reused.
>>>>>>
>>>>>    
>>>>>    Well this is function is not about fences, this function is a callback
>>>>>    for the entity.
>>>>>    
>>>>>
>>>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>>>>    being told my comments are "outright nonsense" when you don't even
>>>>>>    take the time to understand what the issue is and what I'm trying to
>>>>>>    do/document. If you aren't interested in working with me, I'm just
>>>>>>    going to give up on drm_sched, wait until Rust gets workqueue support,
>>>>>>    and reimplement it in Rust. You can keep your broken fence lifetime
>>>>>>    semantics and I'll do my own thing.
>>>>>>
>>>>>    
>>>>>    I'm certainly trying to help here, but you seem to have unrealistic
>>>>>    expectations.
>>>>>    
>>>>>    I perfectly understand what you are trying to do, but you don't seem to
>>>>>    understand that this functionality here isn't made for your use case.
>>>>>    
>>>>>    We can adjust the functionality to better match your requirements, but
>>>>>    you can't say it is broken because it doesn't work when you use it not
>>>>>    in the way it is intended to be used.
>>>>>
>>>> I believe "adjusting" functionality to fit some external requirements,
>>>> may have unintended consequences, requiring yet more and more "adjustments".
>>>> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
>>>>
>>>> We need to be extra careful and wary of this.
>>> Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX.
>>
>> Well this is the fundamental disagreement we have. As far as I can see
>> you don't need to adjust anything in the common drm/scheduler code.
>>
>> That code works with quite a bunch of different drivers, including the
>> Intel XE which has similar requirements to your work here.
>>
>> We can talk about gradually improving the common code, but as Luben
>> already wrote as well this needs to be done very carefully.
>>
>>>    Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.
>>>
>>> AMD has NAK'd both options, effectively NAK'ing the driver.
>>>
>>> I will ask a simple yes/no question: Should we use drm/sched?
>>
>> Well, yes.
>>
>>>
>>> If yes, it will need patches like these,
>>
>> No, you don't.
>>
>> First of all you need to try to adjust your driver to match the
>> requirements of drm/scheduler and *not* the other way around.
>>
>>>    and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.
>>>
>>> If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.
>>>
>>> Which is it?
>>>
>>> Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.
>>>
>>> Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.
>>
>> Well as far as I can see I'm totally polite as well.
>>
>> Pointing out that approaches doesn't seem to make sense and NAKing
>> patches is a perfectly normal part of the review process.
>>
>> What you need to to is to take a step back and ask yourself why this
>> here is facing so much rejection from our side. I have to admit that I
>> don't seem to be good at explaining that, cause we are obviously talking
>> past each other, but you don't seem to try hard to understand what I'm
>> pointing out either.
>>
>>> That's not ok.
>>
>> As far as I can see it is.
>>
>> As maintainer of a commonly used component my first duty is to preserve
>> the status quo and prevent modifications which are not well thought
>> through. And to be honest those changes here strongly looks like Lina is
>> just adjusting the code to match her requirements without looking left
>> and right first.
>>
>> Regards,
>> Christian.
>>
>>
> 
> I give up. You are ignoring everything we say, and rejecting everything 
> we suggest. We've already explained why drm_sched doesn't work for us. 
> I'm tired of repeating the same explanation over and over again only to 
> be ignored and told I'm wrong.
> 
> I'll start working on a new, much simpler Rust-native scheduler based on 
> the workqueue Rust abstractions which are in review.
> 
> ~~ Lina
> 

Perhaps it is worth having a reset and just trying to clarify requirements
one at a time, even if that involves describing a change on a single line
in a single file.

The maintainer discourse is quite common. Its ultimate goal is to keep
things working. If we let some dependencies loose, or change some requirements,
it's conceivable that this may lead to further problems with new development
of current drivers, as well as new drivers. This will lead to more one-off fixes,
and more "adjustments" to the point where the core requirement is lost,
and the code has lost its purpose and meaning.

Maintainers usually see 10 moves ahead, while driver developers as such in their
role, are expressly concerned with their immediate need to get this and that
driver and its feature working.

We should perhaps concentrate on the core of the requirements--the very
root of what is deemed to need to be changed, to understand why, and
if there's a better way to achieve this, and in general a good reason
for the way to proceed forward, whichever way is taken.

Let's take it one thing at a time, slow and perhaps in a ELI5 manner to make
sure no one misses the other person's point.

Perhaps short and lucid, but complete emails would be best, with code quoted,
and scenarios explained. I know this takes a long time--it's not unusual for me
to take hours to write a single email and I'm exhausted after that.

We all mean well here. Hope we can make something good.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-18  5:45                 ` Luben Tuikov
  0 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-18  5:45 UTC (permalink / raw)
  To: Asahi Lina, Christian König, alyssa, David Airlie,
	Daniel Vetter, Sumit Semwal
  Cc: asahi, linux-media, dri-devel, Faith Ekstrand, linux-kernel

On 2023-07-17 22:35, Asahi Lina wrote:
> On 18/07/2023 00.55, Christian König wrote:
>> Am 15.07.23 um 16:14 schrieb alyssa@rosenzweig.io:
>>> 15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
>>>> On 2023-07-14 05:57, Christian König wrote:
>>>>
>>>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>>>
>>>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>>>
>>>>>    Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>>    A signaled scheduler fence can outlive its scheduler, since fences are
>>>>>    independencly reference counted. Therefore, we can't reference the
>>>>>    scheduler in the get_timeline_name() implementation.
>>>>>
>>>>>    Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>>    dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>
>>>>>    Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>>    ---
>>>>>       drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>       drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>       include/drm/gpu_scheduler.h              | 5 +++++
>>>>>       3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>
>>>>>    diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    index b2bbc8a68b30..17f35b0b005a 100644
>>>>>    --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    @@ -389,7 +389,12 @@ static bool
>>>>>    drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>                  /*
>>>>>                * Fence is from the same scheduler, only need to wait for
>>>>>    -         * it to be scheduled
>>>>>    +         * it to be scheduled.
>>>>>    +         *
>>>>>    +         * Note: s_fence->sched could have been freed and reallocated
>>>>>    +         * as another scheduler. This false positive case is okay,
>>>>>    as if
>>>>>    +         * the old scheduler was freed all of its jobs must have
>>>>>    +         * signaled their completion fences.
>>>>>
>>>>>    This is outright nonsense. As long as an entity for a scheduler exists
>>>>>    it is not allowed to free up this scheduler.
>>>>>
>>>>>    So this function can't be called like this.
>>>>>
>>>>>> As I already explained, the fences can outlive their scheduler. That
>>>>>>    means *this* entity certainly exists for *this* scheduler, but the
>>>>>>    *dependency* fence might have come from a past scheduler that was
>>>>>>    already destroyed along with all of its entities, and its address reused.
>>>>>>
>>>>>    
>>>>>    Well this is function is not about fences, this function is a callback
>>>>>    for the entity.
>>>>>    
>>>>>
>>>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>>>>    being told my comments are "outright nonsense" when you don't even
>>>>>>    take the time to understand what the issue is and what I'm trying to
>>>>>>    do/document. If you aren't interested in working with me, I'm just
>>>>>>    going to give up on drm_sched, wait until Rust gets workqueue support,
>>>>>>    and reimplement it in Rust. You can keep your broken fence lifetime
>>>>>>    semantics and I'll do my own thing.
>>>>>>
>>>>>    
>>>>>    I'm certainly trying to help here, but you seem to have unrealistic
>>>>>    expectations.
>>>>>    
>>>>>    I perfectly understand what you are trying to do, but you don't seem to
>>>>>    understand that this functionality here isn't made for your use case.
>>>>>    
>>>>>    We can adjust the functionality to better match your requirements, but
>>>>>    you can't say it is broken because it doesn't work when you use it not
>>>>>    in the way it is intended to be used.
>>>>>
>>>> I believe "adjusting" functionality to fit some external requirements,
>>>> may have unintended consequences, requiring yet more and more "adjustments".
>>>> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
>>>>
>>>> We need to be extra careful and wary of this.
>>> Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX.
>>
>> Well this is the fundamental disagreement we have. As far as I can see
>> you don't need to adjust anything in the common drm/scheduler code.
>>
>> That code works with quite a bunch of different drivers, including the
>> Intel XE which has similar requirements to your work here.
>>
>> We can talk about gradually improving the common code, but as Luben
>> already wrote as well this needs to be done very carefully.
>>
>>>    Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.
>>>
>>> AMD has NAK'd both options, effectively NAK'ing the driver.
>>>
>>> I will ask a simple yes/no question: Should we use drm/sched?
>>
>> Well, yes.
>>
>>>
>>> If yes, it will need patches like these,
>>
>> No, you don't.
>>
>> First of all you need to try to adjust your driver to match the
>> requirements of drm/scheduler and *not* the other way around.
>>
>>>    and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.
>>>
>>> If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.
>>>
>>> Which is it?
>>>
>>> Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.
>>>
>>> Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.
>>
>> Well as far as I can see I'm totally polite as well.
>>
>> Pointing out that approaches doesn't seem to make sense and NAKing
>> patches is a perfectly normal part of the review process.
>>
>> What you need to to is to take a step back and ask yourself why this
>> here is facing so much rejection from our side. I have to admit that I
>> don't seem to be good at explaining that, cause we are obviously talking
>> past each other, but you don't seem to try hard to understand what I'm
>> pointing out either.
>>
>>> That's not ok.
>>
>> As far as I can see it is.
>>
>> As maintainer of a commonly used component my first duty is to preserve
>> the status quo and prevent modifications which are not well thought
>> through. And to be honest those changes here strongly looks like Lina is
>> just adjusting the code to match her requirements without looking left
>> and right first.
>>
>> Regards,
>> Christian.
>>
>>
> 
> I give up. You are ignoring everything we say, and rejecting everything 
> we suggest. We've already explained why drm_sched doesn't work for us. 
> I'm tired of repeating the same explanation over and over again only to 
> be ignored and told I'm wrong.
> 
> I'll start working on a new, much simpler Rust-native scheduler based on 
> the workqueue Rust abstractions which are in review.
> 
> ~~ Lina
> 

Perhaps it is worth having a reset and just trying to clarify requirements
one at a time, even if that involves describing a change on a single line
in a single file.

The maintainer discourse is quite common. Its ultimate goal is to keep
things working. If we let some dependencies loose, or change some requirements,
it's conceivable that this may lead to further problems with new development
of current drivers, as well as new drivers. This will lead to more one-off fixes,
and more "adjustments" to the point where the core requirement is lost,
and the code has lost its purpose and meaning.

Maintainers usually see 10 moves ahead, while driver developers as such in their
role, are expressly concerned with their immediate need to get this and that
driver and its feature working.

We should perhaps concentrate on the core of the requirements--the very
root of what is deemed to need to be changed, to understand why, and
if there's a better way to achieve this, and in general a good reason
for the way to proceed forward, whichever way is taken.

Let's take it one thing at a time, slow and perhaps in a ELI5 manner to make
sure no one misses the other person's point.

Perhaps short and lucid, but complete emails would be best, with code quoted,
and scenarios explained. I know this takes a long time--it's not unusual for me
to take hours to write a single email and I'm exhausted after that.

We all mean well here. Hope we can make something good.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-17 15:55             ` Christian König
@ 2023-07-18  8:21               ` Pekka Paalanen
  -1 siblings, 0 replies; 86+ messages in thread
From: Pekka Paalanen @ 2023-07-18  8:21 UTC (permalink / raw)
  To: Christian König
  Cc: alyssa, Luben Tuikov, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal, Faith Ekstrand, dri-devel, linux-kernel,
	linux-media, asahi

[-- Attachment #1: Type: text/plain, Size: 2838 bytes --]

On Mon, 17 Jul 2023 17:55:04 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 15.07.23 um 16:14 schrieb alyssa@rosenzweig.io:

...

> > Lina has been polite and accommodating while AMD calls her code
> > "outright nonsense" and gets "outright NAK"s, and puts her into an
> > impossible catch-22 where no matter what she does it's NAK'd.  
> 
> Well as far as I can see I'm totally polite as well.

Christian,

politeness is in the eye of the beholder. You do not get to decide how
other people feel.

I consider myself a very blunt and difficult reviewer on my own area
(which I consider mostly as a negative trait), and while I have
alienated some people over the years, I try hard to not intentionally
hurt anyone. Sometimes it means that writing one email takes an hour or
two. It can take a tremendous amount of energy. Like this email here.

If people have the courage to repeatedly tell someone that the someone
comes out as off-putting, it cannot be dismissed. It really means
coming out as off-putting. There does not need to be anything malicious
related to it from either side, it could as well be a cultural
difference that one cannot know in advance, or it could be a personal
hurt inside the offending person lashing out.

When told, it is time to reflect.

> Pointing out that approaches doesn't seem to make sense and NAKing 
> patches is a perfectly normal part of the review process.

Yes. You don't have to change your message.

One only needs to make an effort to try to change their tone. Otherwise
they lose and alienate developers by choosing to hurt them. It was an
accident before one knew about it, but now it is known, so how one
communicates is a decision. It's no longer an accident.

> What you need to to is to take a step back and ask yourself why this 
> here is facing so much rejection from our side. I have to admit that I 
> don't seem to be good at explaining that, cause we are obviously talking 
> past each other, but you don't seem to try hard to understand what I'm 
> pointing out either.

Maybe try using a softer tone for a start? Lina has reiterated the
restrictions imposed by the hardware, the firmware they cannot change,
and Rust design principles. How do *you* fit those with unchanged
drm/sched?

> > That's not ok.  
> 
> As far as I can see it is.

Hurting people is not ok.

Not even if the kernel community culture traditionally does so.

> As maintainer of a commonly used component my first duty is to preserve 
> the status quo and prevent modifications which are not well thought 
> through.

Of course.

Accidentally hurting someone is eventually unavoidable. Defending the
communication style that hurt someone in order to keep on doing that
just makes one look like a d...


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-18  8:21               ` Pekka Paalanen
  0 siblings, 0 replies; 86+ messages in thread
From: Pekka Paalanen @ 2023-07-18  8:21 UTC (permalink / raw)
  To: Christian König
  Cc: asahi, Asahi Lina, linux-kernel, dri-devel, Luben Tuikov,
	Faith Ekstrand, Sumit Semwal, alyssa, linux-media

[-- Attachment #1: Type: text/plain, Size: 2838 bytes --]

On Mon, 17 Jul 2023 17:55:04 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 15.07.23 um 16:14 schrieb alyssa@rosenzweig.io:

...

> > Lina has been polite and accommodating while AMD calls her code
> > "outright nonsense" and gets "outright NAK"s, and puts her into an
> > impossible catch-22 where no matter what she does it's NAK'd.  
> 
> Well as far as I can see I'm totally polite as well.

Christian,

politeness is in the eye of the beholder. You do not get to decide how
other people feel.

I consider myself a very blunt and difficult reviewer on my own area
(which I consider mostly as a negative trait), and while I have
alienated some people over the years, I try hard to not intentionally
hurt anyone. Sometimes it means that writing one email takes an hour or
two. It can take a tremendous amount of energy. Like this email here.

If people have the courage to repeatedly tell someone that the someone
comes out as off-putting, it cannot be dismissed. It really means
coming out as off-putting. There does not need to be anything malicious
related to it from either side, it could as well be a cultural
difference that one cannot know in advance, or it could be a personal
hurt inside the offending person lashing out.

When told, it is time to reflect.

> Pointing out that approaches doesn't seem to make sense and NAKing 
> patches is a perfectly normal part of the review process.

Yes. You don't have to change your message.

One only needs to make an effort to try to change their tone. Otherwise
they lose and alienate developers by choosing to hurt them. It was an
accident before one knew about it, but now it is known, so how one
communicates is a decision. It's no longer an accident.

> What you need to to is to take a step back and ask yourself why this 
> here is facing so much rejection from our side. I have to admit that I 
> don't seem to be good at explaining that, cause we are obviously talking 
> past each other, but you don't seem to try hard to understand what I'm 
> pointing out either.

Maybe try using a softer tone for a start? Lina has reiterated the
restrictions imposed by the hardware, the firmware they cannot change,
and Rust design principles. How do *you* fit those with unchanged
drm/sched?

> > That's not ok.  
> 
> As far as I can see it is.

Hurting people is not ok.

Not even if the kernel community culture traditionally does so.

> As maintainer of a commonly used component my first duty is to preserve 
> the status quo and prevent modifications which are not well thought 
> through.

Of course.

Accidentally hurting someone is eventually unavoidable. Defending the
communication style that hurt someone in order to keep on doing that
just makes one look like a d...


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-16  7:51       ` Asahi Lina
@ 2023-07-19  8:45         ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-19  8:45 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

Am 16.07.23 um 09:51 schrieb Asahi Lina:
> On 15/07/2023 16.14, Luben Tuikov wrote:
>> On 2023-07-14 04:21, Asahi Lina wrote:
>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>> causes segfaults and other badness when job completion fences are
>>> signaled after the scheduler is torn down.
>>
>> If there are pending jobs, ideally we want to call into the driver,
>> so that it can release resources it may be holding for those.
>> The idea behind "pending" is that they are pending in the hardware
>> and we don't know their state until signalled/the callback called.
>> (Or unless the device is reset and we get a notification of that fact.)
>
> That's what the job->free_job() callback does, then the driver is free 
> to do whatever it wants with those jobs. A driver could opt to 
> synchronously kill those jobs (if it can) or account for them 
> separately/asynchronously.
>
> What this patch basically says is that if you destroy a scheduler with 
> pending jobs, it immediately considers them terminated with an error, 
> and returns ownership back to the driver for freeing. Then the driver 
> can decide how to handle the rest and whatever the underlying hardware 
> state is.

Yeah, and exactly that is absolutely *not* a good idea. Keep in mind 
that memory management depends on all this stuff and signal a dma_fence 
always requires that the hw give a go for that.

If you want to cleanup a scheduler with pending jobs what needs to 
happen instead is that the driver cancels the processing and signals the 
hw fence.

>
>>> Explicitly detach all jobs from their completion callbacks and free
>>> them. This makes it possible to write a sensible safe abstraction for
>>> drm_sched, without having to externally duplicate the tracking of
>>> in-flight jobs.
>>>
>>> This shouldn't regress any existing drivers, since calling
>>> drm_sched_fini() with any pending jobs is broken and this change should
>>> be a no-op if there are no pending jobs.
>>
>> While this statement is true on its own, it kind of contradicts
>> the premise of the first paragraph.
>
> I mean right *now* it's broken, before this patch. I'm trying to make 
> it safe, but it shouldn't regress any exiting drivers since if they 
> trigger this code path they are broken today.

Yes and exactly that's intentional.

What you can do is to issue a *big* warning here when there are still 
pending unsignaled hw fences when the driver calls drm_sched_fini().

But setting the scheduler fence to signaled without getting a signaled 
state from the hw fence is a complete NO-GO.

Regards,
Christian.

>
>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 32 
>>> ++++++++++++++++++++++++++++++--
>>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 1f3bc3606239..a4da4aac0efd 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>   {
>>>       struct drm_sched_entity *s_entity;
>>> +    struct drm_sched_job *s_job, *tmp;
>>>       int i;
>>>   -    if (sched->thread)
>>> -        kthread_stop(sched->thread);
>>> +    if (!sched->thread)
>>> +        return;
>>> +
>>> +    /*
>>> +     * Stop the scheduler, detaching all jobs from their hardware 
>>> callbacks
>>> +     * and cleaning up complete jobs.
>>> +     */
>>> +    drm_sched_stop(sched, NULL);
>>> +
>>> +    /*
>>> +     * Iterate through the pending job list and free all jobs.
>>> +     * This assumes the driver has either guaranteed jobs are 
>>> already stopped, or that
>>> +     * otherwise it is responsible for keeping any necessary data 
>>> structures for
>>> +     * in-progress jobs alive even when the free_job() callback is 
>>> called early (e.g. by
>>> +     * putting them in its own queue or doing its own refcounting).
>>> +     */
>>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>> +        spin_lock(&sched->job_list_lock);
>>> +        list_del_init(&s_job->list);
>>> +        spin_unlock(&sched->job_list_lock);
>>> +
>>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>> +        drm_sched_fence_finished(s_job->s_fence);
>>
>> I'd imagine it's better to rebase this on top of drm-misc-next where
>> drm_sched_fence_finished() takes one more parameter--the error.
>
> Ah, sure! I can do that.
>
>>
>>> +
>>> +        WARN_ON(s_job->s_fence->parent);
>>> +        sched->ops->free_job(s_job);
>>> +    }
>>> +
>>> +    kthread_stop(sched->thread);
>>>         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>           struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>
>>
>> Conceptually I don't mind this patch--I see what it is trying to 
>> achieve,
>> but technically, we want the driver to detect GPU removal and return 
>> shared
>> resources back, such as "jobs", which DRM is also aware of.
>
> I think you missed the context of why I'm doing this, so in short: my 
> use case (like Xe's) involves using a separate drm_sched instance *per 
> file* since these queues are scheduled directly by the firmware. So 
> this isn't about GPU removal, but rather about a GPU context going 
> away while jobs are in flight (e.g. the process got killed). We want 
> that to quickly kill the "DRM view" of the world, including signaling 
> all the fences with an error and freeing resources like the scheduler 
> itself.
>
> In the case of this particular GPU, there is no known way to actively 
> and synchronously abort GPU jobs, so we need to let them run to 
> completion (or failure), but we don't want that to block process 
> cleanup and freeing a bunch of high-level resources. The driver is 
> architected roughly along the lines of a firmware abstraction layer 
> that maps to the firmware shared memory structures, and then a layer 
> on top that implements the DRM view. When a process gets killed, the 
> DRM side (which includes the scheduler, etc.) gets torn down 
> immediately, and it makes sense to handle this cleanup inside 
> drm_sched since it already has a view into what jobs are in flight. 
> Otherwise, I would have to duplicate job tracking in the driver 
> (actually worse: in the Rust abstraction for safety), which doesn't 
> make much sense.
>
> But what I *do* have in the driver is tracking of the firmware 
> structures. So when the drm_sched gets torn down and all the jobs 
> killed, the underlying firmware jobs do run to completion, and the 
> resources they use are all cleaned up after that (it's all reference 
> counted). The primitive involved here is that in-flight firmware jobs 
> are assigned an event completion slot, and that keeps a reference to 
> them from a global array until the events fire and all the jobs are 
> known to have completed. This keeps things memory-safe, since we 
> absolutely cannot free/destroy firmware structures while they are in 
> use (otherwise the firmware crashes, which is fatal on these GPUs - 
> requires a full system reboot to recover).
>
> In practice, with the VM map model that we use, what ends up happening 
> when a process gets killed is that all the user objects for in-flight 
> jobs get unmapped, which usually causes the GPU hardware (not 
> firmware) to fault. This then triggers early termination of jobs 
> anyway via the firmware fault recovery flow. But even that takes some 
> short amount of time, and by then all the drm_sched stuff is long gone 
> and we're just dealing with the in-flight firmware stuff.
>
>> In the case where we're initiating the tear, we should notify the 
>> driver that
>> we're about to forget jobs (resources), so that it knows to return 
>> them back
>> or that it shouldn't notify us for them (since we've notified we're 
>> forgetting them.)
>
> That contradicts Christian's comment. I tried to document that (after 
> this patch) the scheduler no longer cares about hw fences and whether 
> they are signaled or not after it's destroyed, and I got a strongly 
> worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
> signal the hw fence after a scheduler teardown, or not?
>
> But really, I don't see a use case for an explicit "about to forget 
> job" callback. The job free callback already serves the purpose of 
> telling the driver to clean up resources associated with a job. If it 
> wants to synchronously abort things there, it could easily take over 
> its own fence signaling and do something with the underlying stuff if 
> the fence is not signaled yet.
>
> In my case, since the driver is written in Rust and free_job() just 
> maps to the destructor (Drop impl), that just ends up freeing a bunch 
> of memory and other objects, and I don't particularly care about the 
> state of the firmware side any more after that. The flow is the same 
> whether it was a successful job completion, a failure, or an early 
> destruction due to the drm_sched getting torn down.
>
>> (Note also that in this latter case, traditionally, the device would 
>> be reset,
>> so that we can guarantee that it has forgotten all shared resources 
>> which
>> we are to tear up. This is somewhat more complicated with GPUs, thus 
>> the method
>> pointed out above.)
>
> Yeah, in the firmware scheduling case we can't do this at all unless 
> the firmware has an explicit teardown/forget op (which I'm not aware 
> of) and a full GPU reset isn't something we can do either. Hence we 
> just let the underlying jobs complete. In practice they tend to die 
> pretty quickly anyway once all the buffers are unmapped.
>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-07-19  8:45         ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-19  8:45 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, David Airlie, Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

Am 16.07.23 um 09:51 schrieb Asahi Lina:
> On 15/07/2023 16.14, Luben Tuikov wrote:
>> On 2023-07-14 04:21, Asahi Lina wrote:
>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>> causes segfaults and other badness when job completion fences are
>>> signaled after the scheduler is torn down.
>>
>> If there are pending jobs, ideally we want to call into the driver,
>> so that it can release resources it may be holding for those.
>> The idea behind "pending" is that they are pending in the hardware
>> and we don't know their state until signalled/the callback called.
>> (Or unless the device is reset and we get a notification of that fact.)
>
> That's what the job->free_job() callback does, then the driver is free 
> to do whatever it wants with those jobs. A driver could opt to 
> synchronously kill those jobs (if it can) or account for them 
> separately/asynchronously.
>
> What this patch basically says is that if you destroy a scheduler with 
> pending jobs, it immediately considers them terminated with an error, 
> and returns ownership back to the driver for freeing. Then the driver 
> can decide how to handle the rest and whatever the underlying hardware 
> state is.

Yeah, and exactly that is absolutely *not* a good idea. Keep in mind 
that memory management depends on all this stuff and signal a dma_fence 
always requires that the hw give a go for that.

If you want to cleanup a scheduler with pending jobs what needs to 
happen instead is that the driver cancels the processing and signals the 
hw fence.

>
>>> Explicitly detach all jobs from their completion callbacks and free
>>> them. This makes it possible to write a sensible safe abstraction for
>>> drm_sched, without having to externally duplicate the tracking of
>>> in-flight jobs.
>>>
>>> This shouldn't regress any existing drivers, since calling
>>> drm_sched_fini() with any pending jobs is broken and this change should
>>> be a no-op if there are no pending jobs.
>>
>> While this statement is true on its own, it kind of contradicts
>> the premise of the first paragraph.
>
> I mean right *now* it's broken, before this patch. I'm trying to make 
> it safe, but it shouldn't regress any exiting drivers since if they 
> trigger this code path they are broken today.

Yes and exactly that's intentional.

What you can do is to issue a *big* warning here when there are still 
pending unsignaled hw fences when the driver calls drm_sched_fini().

But setting the scheduler fence to signaled without getting a signaled 
state from the hw fence is a complete NO-GO.

Regards,
Christian.

>
>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 32 
>>> ++++++++++++++++++++++++++++++--
>>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 1f3bc3606239..a4da4aac0efd 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>   {
>>>       struct drm_sched_entity *s_entity;
>>> +    struct drm_sched_job *s_job, *tmp;
>>>       int i;
>>>   -    if (sched->thread)
>>> -        kthread_stop(sched->thread);
>>> +    if (!sched->thread)
>>> +        return;
>>> +
>>> +    /*
>>> +     * Stop the scheduler, detaching all jobs from their hardware 
>>> callbacks
>>> +     * and cleaning up complete jobs.
>>> +     */
>>> +    drm_sched_stop(sched, NULL);
>>> +
>>> +    /*
>>> +     * Iterate through the pending job list and free all jobs.
>>> +     * This assumes the driver has either guaranteed jobs are 
>>> already stopped, or that
>>> +     * otherwise it is responsible for keeping any necessary data 
>>> structures for
>>> +     * in-progress jobs alive even when the free_job() callback is 
>>> called early (e.g. by
>>> +     * putting them in its own queue or doing its own refcounting).
>>> +     */
>>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>> +        spin_lock(&sched->job_list_lock);
>>> +        list_del_init(&s_job->list);
>>> +        spin_unlock(&sched->job_list_lock);
>>> +
>>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>> +        drm_sched_fence_finished(s_job->s_fence);
>>
>> I'd imagine it's better to rebase this on top of drm-misc-next where
>> drm_sched_fence_finished() takes one more parameter--the error.
>
> Ah, sure! I can do that.
>
>>
>>> +
>>> +        WARN_ON(s_job->s_fence->parent);
>>> +        sched->ops->free_job(s_job);
>>> +    }
>>> +
>>> +    kthread_stop(sched->thread);
>>>         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>           struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>
>>
>> Conceptually I don't mind this patch--I see what it is trying to 
>> achieve,
>> but technically, we want the driver to detect GPU removal and return 
>> shared
>> resources back, such as "jobs", which DRM is also aware of.
>
> I think you missed the context of why I'm doing this, so in short: my 
> use case (like Xe's) involves using a separate drm_sched instance *per 
> file* since these queues are scheduled directly by the firmware. So 
> this isn't about GPU removal, but rather about a GPU context going 
> away while jobs are in flight (e.g. the process got killed). We want 
> that to quickly kill the "DRM view" of the world, including signaling 
> all the fences with an error and freeing resources like the scheduler 
> itself.
>
> In the case of this particular GPU, there is no known way to actively 
> and synchronously abort GPU jobs, so we need to let them run to 
> completion (or failure), but we don't want that to block process 
> cleanup and freeing a bunch of high-level resources. The driver is 
> architected roughly along the lines of a firmware abstraction layer 
> that maps to the firmware shared memory structures, and then a layer 
> on top that implements the DRM view. When a process gets killed, the 
> DRM side (which includes the scheduler, etc.) gets torn down 
> immediately, and it makes sense to handle this cleanup inside 
> drm_sched since it already has a view into what jobs are in flight. 
> Otherwise, I would have to duplicate job tracking in the driver 
> (actually worse: in the Rust abstraction for safety), which doesn't 
> make much sense.
>
> But what I *do* have in the driver is tracking of the firmware 
> structures. So when the drm_sched gets torn down and all the jobs 
> killed, the underlying firmware jobs do run to completion, and the 
> resources they use are all cleaned up after that (it's all reference 
> counted). The primitive involved here is that in-flight firmware jobs 
> are assigned an event completion slot, and that keeps a reference to 
> them from a global array until the events fire and all the jobs are 
> known to have completed. This keeps things memory-safe, since we 
> absolutely cannot free/destroy firmware structures while they are in 
> use (otherwise the firmware crashes, which is fatal on these GPUs - 
> requires a full system reboot to recover).
>
> In practice, with the VM map model that we use, what ends up happening 
> when a process gets killed is that all the user objects for in-flight 
> jobs get unmapped, which usually causes the GPU hardware (not 
> firmware) to fault. This then triggers early termination of jobs 
> anyway via the firmware fault recovery flow. But even that takes some 
> short amount of time, and by then all the drm_sched stuff is long gone 
> and we're just dealing with the in-flight firmware stuff.
>
>> In the case where we're initiating the tear, we should notify the 
>> driver that
>> we're about to forget jobs (resources), so that it knows to return 
>> them back
>> or that it shouldn't notify us for them (since we've notified we're 
>> forgetting them.)
>
> That contradicts Christian's comment. I tried to document that (after 
> this patch) the scheduler no longer cares about hw fences and whether 
> they are signaled or not after it's destroyed, and I got a strongly 
> worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
> signal the hw fence after a scheduler teardown, or not?
>
> But really, I don't see a use case for an explicit "about to forget 
> job" callback. The job free callback already serves the purpose of 
> telling the driver to clean up resources associated with a job. If it 
> wants to synchronously abort things there, it could easily take over 
> its own fence signaling and do something with the underlying stuff if 
> the fence is not signaled yet.
>
> In my case, since the driver is written in Rust and free_job() just 
> maps to the destructor (Drop impl), that just ends up freeing a bunch 
> of memory and other objects, and I don't particularly care about the 
> state of the firmware side any more after that. The flow is the same 
> whether it was a successful job completion, a failure, or an early 
> destruction due to the drm_sched getting torn down.
>
>> (Note also that in this latter case, traditionally, the device would 
>> be reset,
>> so that we can guarantee that it has forgotten all shared resources 
>> which
>> we are to tear up. This is somewhat more complicated with GPUs, thus 
>> the method
>> pointed out above.)
>
> Yeah, in the firmware scheduling case we can't do this at all unless 
> the firmware has an explicit teardown/forget op (which I'm not aware 
> of) and a full GPU reset isn't something we can do either. Hence we 
> just let the underlying jobs complete. In practice they tend to die 
> pretty quickly anyway once all the buffers are unmapped.
>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-19  8:45         ` Christian König
@ 2023-07-19 15:05           ` Luben Tuikov
  -1 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-19 15:05 UTC (permalink / raw)
  To: Christian König, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, Alyssa Rosenzweig, dri-devel, linux-kernel,
	linux-media, asahi

On 2023-07-19 04:45, Christian König wrote:
> Am 16.07.23 um 09:51 schrieb Asahi Lina:
>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>> causes segfaults and other badness when job completion fences are
>>>> signaled after the scheduler is torn down.
>>>
>>> If there are pending jobs, ideally we want to call into the driver,
>>> so that it can release resources it may be holding for those.
>>> The idea behind "pending" is that they are pending in the hardware
>>> and we don't know their state until signalled/the callback called.
>>> (Or unless the device is reset and we get a notification of that fact.)
>>
>> That's what the job->free_job() callback does, then the driver is free 
>> to do whatever it wants with those jobs. A driver could opt to 
>> synchronously kill those jobs (if it can) or account for them 
>> separately/asynchronously.
>>
>> What this patch basically says is that if you destroy a scheduler with 
>> pending jobs, it immediately considers them terminated with an error, 
>> and returns ownership back to the driver for freeing. Then the driver 
>> can decide how to handle the rest and whatever the underlying hardware 
>> state is.
> 
> Yeah, and exactly that is absolutely *not* a good idea. Keep in mind 
> that memory management depends on all this stuff and signal a dma_fence 
> always requires that the hw give a go for that.
> 
> If you want to cleanup a scheduler with pending jobs what needs to 
> happen instead is that the driver cancels the processing and signals the 
> hw fence.
> 
>>
>>>> Explicitly detach all jobs from their completion callbacks and free
>>>> them. This makes it possible to write a sensible safe abstraction for
>>>> drm_sched, without having to externally duplicate the tracking of
>>>> in-flight jobs.
>>>>
>>>> This shouldn't regress any existing drivers, since calling
>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>> be a no-op if there are no pending jobs.
>>>
>>> While this statement is true on its own, it kind of contradicts
>>> the premise of the first paragraph.
>>
>> I mean right *now* it's broken, before this patch. I'm trying to make 
>> it safe, but it shouldn't regress any exiting drivers since if they 
>> trigger this code path they are broken today.
> 
> Yes and exactly that's intentional.
> 
> What you can do is to issue a *big* warning here when there are still 
> pending unsignaled hw fences when the driver calls drm_sched_fini().
> 
> But setting the scheduler fence to signaled without getting a signaled 
> state from the hw fence is a complete NO-GO.

Okay, so we have the requirement (how). If we can also get a reason behind
it (why), perhaps we can add the requirement and the reason as a lucid comment
to drm_sched_fini() to come with this patch when reworked, so that future
drivers whether they be in Rust or C, can take note.

Perhaps this will also help future development in DRM itself.
-- 
Regards,
Luben

> 
> Regards,
> Christian.
> 
>>
>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>   drivers/gpu/drm/scheduler/sched_main.c | 32 
>>>> ++++++++++++++++++++++++++++++--
>>>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 1f3bc3606239..a4da4aac0efd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>   {
>>>>       struct drm_sched_entity *s_entity;
>>>> +    struct drm_sched_job *s_job, *tmp;
>>>>       int i;
>>>>   -    if (sched->thread)
>>>> -        kthread_stop(sched->thread);
>>>> +    if (!sched->thread)
>>>> +        return;
>>>> +
>>>> +    /*
>>>> +     * Stop the scheduler, detaching all jobs from their hardware 
>>>> callbacks
>>>> +     * and cleaning up complete jobs.
>>>> +     */
>>>> +    drm_sched_stop(sched, NULL);
>>>> +
>>>> +    /*
>>>> +     * Iterate through the pending job list and free all jobs.
>>>> +     * This assumes the driver has either guaranteed jobs are 
>>>> already stopped, or that
>>>> +     * otherwise it is responsible for keeping any necessary data 
>>>> structures for
>>>> +     * in-progress jobs alive even when the free_job() callback is 
>>>> called early (e.g. by
>>>> +     * putting them in its own queue or doing its own refcounting).
>>>> +     */
>>>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>> +        spin_lock(&sched->job_list_lock);
>>>> +        list_del_init(&s_job->list);
>>>> +        spin_unlock(&sched->job_list_lock);
>>>> +
>>>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>>> +        drm_sched_fence_finished(s_job->s_fence);
>>>
>>> I'd imagine it's better to rebase this on top of drm-misc-next where
>>> drm_sched_fence_finished() takes one more parameter--the error.
>>
>> Ah, sure! I can do that.
>>
>>>
>>>> +
>>>> +        WARN_ON(s_job->s_fence->parent);
>>>> +        sched->ops->free_job(s_job);
>>>> +    }
>>>> +
>>>> +    kthread_stop(sched->thread);
>>>>         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>           struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>
>>>
>>> Conceptually I don't mind this patch--I see what it is trying to 
>>> achieve,
>>> but technically, we want the driver to detect GPU removal and return 
>>> shared
>>> resources back, such as "jobs", which DRM is also aware of.
>>
>> I think you missed the context of why I'm doing this, so in short: my 
>> use case (like Xe's) involves using a separate drm_sched instance *per 
>> file* since these queues are scheduled directly by the firmware. So 
>> this isn't about GPU removal, but rather about a GPU context going 
>> away while jobs are in flight (e.g. the process got killed). We want 
>> that to quickly kill the "DRM view" of the world, including signaling 
>> all the fences with an error and freeing resources like the scheduler 
>> itself.
>>
>> In the case of this particular GPU, there is no known way to actively 
>> and synchronously abort GPU jobs, so we need to let them run to 
>> completion (or failure), but we don't want that to block process 
>> cleanup and freeing a bunch of high-level resources. The driver is 
>> architected roughly along the lines of a firmware abstraction layer 
>> that maps to the firmware shared memory structures, and then a layer 
>> on top that implements the DRM view. When a process gets killed, the 
>> DRM side (which includes the scheduler, etc.) gets torn down 
>> immediately, and it makes sense to handle this cleanup inside 
>> drm_sched since it already has a view into what jobs are in flight. 
>> Otherwise, I would have to duplicate job tracking in the driver 
>> (actually worse: in the Rust abstraction for safety), which doesn't 
>> make much sense.
>>
>> But what I *do* have in the driver is tracking of the firmware 
>> structures. So when the drm_sched gets torn down and all the jobs 
>> killed, the underlying firmware jobs do run to completion, and the 
>> resources they use are all cleaned up after that (it's all reference 
>> counted). The primitive involved here is that in-flight firmware jobs 
>> are assigned an event completion slot, and that keeps a reference to 
>> them from a global array until the events fire and all the jobs are 
>> known to have completed. This keeps things memory-safe, since we 
>> absolutely cannot free/destroy firmware structures while they are in 
>> use (otherwise the firmware crashes, which is fatal on these GPUs - 
>> requires a full system reboot to recover).
>>
>> In practice, with the VM map model that we use, what ends up happening 
>> when a process gets killed is that all the user objects for in-flight 
>> jobs get unmapped, which usually causes the GPU hardware (not 
>> firmware) to fault. This then triggers early termination of jobs 
>> anyway via the firmware fault recovery flow. But even that takes some 
>> short amount of time, and by then all the drm_sched stuff is long gone 
>> and we're just dealing with the in-flight firmware stuff.
>>
>>> In the case where we're initiating the tear, we should notify the 
>>> driver that
>>> we're about to forget jobs (resources), so that it knows to return 
>>> them back
>>> or that it shouldn't notify us for them (since we've notified we're 
>>> forgetting them.)
>>
>> That contradicts Christian's comment. I tried to document that (after 
>> this patch) the scheduler no longer cares about hw fences and whether 
>> they are signaled or not after it's destroyed, and I got a strongly 
>> worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
>> signal the hw fence after a scheduler teardown, or not?
>>
>> But really, I don't see a use case for an explicit "about to forget 
>> job" callback. The job free callback already serves the purpose of 
>> telling the driver to clean up resources associated with a job. If it 
>> wants to synchronously abort things there, it could easily take over 
>> its own fence signaling and do something with the underlying stuff if 
>> the fence is not signaled yet.
>>
>> In my case, since the driver is written in Rust and free_job() just 
>> maps to the destructor (Drop impl), that just ends up freeing a bunch 
>> of memory and other objects, and I don't particularly care about the 
>> state of the firmware side any more after that. The flow is the same 
>> whether it was a successful job completion, a failure, or an early 
>> destruction due to the drm_sched getting torn down.
>>
>>> (Note also that in this latter case, traditionally, the device would 
>>> be reset,
>>> so that we can guarantee that it has forgotten all shared resources 
>>> which
>>> we are to tear up. This is somewhat more complicated with GPUs, thus 
>>> the method
>>> pointed out above.)
>>
>> Yeah, in the firmware scheduling case we can't do this at all unless 
>> the firmware has an explicit teardown/forget op (which I'm not aware 
>> of) and a full GPU reset isn't something we can do either. Hence we 
>> just let the underlying jobs complete. In practice they tend to die 
>> pretty quickly anyway once all the buffers are unmapped.
>>
>> ~~ Lina
>>
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-07-19 15:05           ` Luben Tuikov
  0 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-19 15:05 UTC (permalink / raw)
  To: Christian König, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 2023-07-19 04:45, Christian König wrote:
> Am 16.07.23 um 09:51 schrieb Asahi Lina:
>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>> causes segfaults and other badness when job completion fences are
>>>> signaled after the scheduler is torn down.
>>>
>>> If there are pending jobs, ideally we want to call into the driver,
>>> so that it can release resources it may be holding for those.
>>> The idea behind "pending" is that they are pending in the hardware
>>> and we don't know their state until signalled/the callback called.
>>> (Or unless the device is reset and we get a notification of that fact.)
>>
>> That's what the job->free_job() callback does, then the driver is free 
>> to do whatever it wants with those jobs. A driver could opt to 
>> synchronously kill those jobs (if it can) or account for them 
>> separately/asynchronously.
>>
>> What this patch basically says is that if you destroy a scheduler with 
>> pending jobs, it immediately considers them terminated with an error, 
>> and returns ownership back to the driver for freeing. Then the driver 
>> can decide how to handle the rest and whatever the underlying hardware 
>> state is.
> 
> Yeah, and exactly that is absolutely *not* a good idea. Keep in mind 
> that memory management depends on all this stuff and signal a dma_fence 
> always requires that the hw give a go for that.
> 
> If you want to cleanup a scheduler with pending jobs what needs to 
> happen instead is that the driver cancels the processing and signals the 
> hw fence.
> 
>>
>>>> Explicitly detach all jobs from their completion callbacks and free
>>>> them. This makes it possible to write a sensible safe abstraction for
>>>> drm_sched, without having to externally duplicate the tracking of
>>>> in-flight jobs.
>>>>
>>>> This shouldn't regress any existing drivers, since calling
>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>> be a no-op if there are no pending jobs.
>>>
>>> While this statement is true on its own, it kind of contradicts
>>> the premise of the first paragraph.
>>
>> I mean right *now* it's broken, before this patch. I'm trying to make 
>> it safe, but it shouldn't regress any exiting drivers since if they 
>> trigger this code path they are broken today.
> 
> Yes and exactly that's intentional.
> 
> What you can do is to issue a *big* warning here when there are still 
> pending unsignaled hw fences when the driver calls drm_sched_fini().
> 
> But setting the scheduler fence to signaled without getting a signaled 
> state from the hw fence is a complete NO-GO.

Okay, so we have the requirement (how). If we can also get a reason behind
it (why), perhaps we can add the requirement and the reason as a lucid comment
to drm_sched_fini() to come with this patch when reworked, so that future
drivers whether they be in Rust or C, can take note.

Perhaps this will also help future development in DRM itself.
-- 
Regards,
Luben

> 
> Regards,
> Christian.
> 
>>
>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>   drivers/gpu/drm/scheduler/sched_main.c | 32 
>>>> ++++++++++++++++++++++++++++++--
>>>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 1f3bc3606239..a4da4aac0efd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>   {
>>>>       struct drm_sched_entity *s_entity;
>>>> +    struct drm_sched_job *s_job, *tmp;
>>>>       int i;
>>>>   -    if (sched->thread)
>>>> -        kthread_stop(sched->thread);
>>>> +    if (!sched->thread)
>>>> +        return;
>>>> +
>>>> +    /*
>>>> +     * Stop the scheduler, detaching all jobs from their hardware 
>>>> callbacks
>>>> +     * and cleaning up complete jobs.
>>>> +     */
>>>> +    drm_sched_stop(sched, NULL);
>>>> +
>>>> +    /*
>>>> +     * Iterate through the pending job list and free all jobs.
>>>> +     * This assumes the driver has either guaranteed jobs are 
>>>> already stopped, or that
>>>> +     * otherwise it is responsible for keeping any necessary data 
>>>> structures for
>>>> +     * in-progress jobs alive even when the free_job() callback is 
>>>> called early (e.g. by
>>>> +     * putting them in its own queue or doing its own refcounting).
>>>> +     */
>>>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>> +        spin_lock(&sched->job_list_lock);
>>>> +        list_del_init(&s_job->list);
>>>> +        spin_unlock(&sched->job_list_lock);
>>>> +
>>>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
>>>> +        drm_sched_fence_finished(s_job->s_fence);
>>>
>>> I'd imagine it's better to rebase this on top of drm-misc-next where
>>> drm_sched_fence_finished() takes one more parameter--the error.
>>
>> Ah, sure! I can do that.
>>
>>>
>>>> +
>>>> +        WARN_ON(s_job->s_fence->parent);
>>>> +        sched->ops->free_job(s_job);
>>>> +    }
>>>> +
>>>> +    kthread_stop(sched->thread);
>>>>         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>           struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>
>>>
>>> Conceptually I don't mind this patch--I see what it is trying to 
>>> achieve,
>>> but technically, we want the driver to detect GPU removal and return 
>>> shared
>>> resources back, such as "jobs", which DRM is also aware of.
>>
>> I think you missed the context of why I'm doing this, so in short: my 
>> use case (like Xe's) involves using a separate drm_sched instance *per 
>> file* since these queues are scheduled directly by the firmware. So 
>> this isn't about GPU removal, but rather about a GPU context going 
>> away while jobs are in flight (e.g. the process got killed). We want 
>> that to quickly kill the "DRM view" of the world, including signaling 
>> all the fences with an error and freeing resources like the scheduler 
>> itself.
>>
>> In the case of this particular GPU, there is no known way to actively 
>> and synchronously abort GPU jobs, so we need to let them run to 
>> completion (or failure), but we don't want that to block process 
>> cleanup and freeing a bunch of high-level resources. The driver is 
>> architected roughly along the lines of a firmware abstraction layer 
>> that maps to the firmware shared memory structures, and then a layer 
>> on top that implements the DRM view. When a process gets killed, the 
>> DRM side (which includes the scheduler, etc.) gets torn down 
>> immediately, and it makes sense to handle this cleanup inside 
>> drm_sched since it already has a view into what jobs are in flight. 
>> Otherwise, I would have to duplicate job tracking in the driver 
>> (actually worse: in the Rust abstraction for safety), which doesn't 
>> make much sense.
>>
>> But what I *do* have in the driver is tracking of the firmware 
>> structures. So when the drm_sched gets torn down and all the jobs 
>> killed, the underlying firmware jobs do run to completion, and the 
>> resources they use are all cleaned up after that (it's all reference 
>> counted). The primitive involved here is that in-flight firmware jobs 
>> are assigned an event completion slot, and that keeps a reference to 
>> them from a global array until the events fire and all the jobs are 
>> known to have completed. This keeps things memory-safe, since we 
>> absolutely cannot free/destroy firmware structures while they are in 
>> use (otherwise the firmware crashes, which is fatal on these GPUs - 
>> requires a full system reboot to recover).
>>
>> In practice, with the VM map model that we use, what ends up happening 
>> when a process gets killed is that all the user objects for in-flight 
>> jobs get unmapped, which usually causes the GPU hardware (not 
>> firmware) to fault. This then triggers early termination of jobs 
>> anyway via the firmware fault recovery flow. But even that takes some 
>> short amount of time, and by then all the drm_sched stuff is long gone 
>> and we're just dealing with the in-flight firmware stuff.
>>
>>> In the case where we're initiating the tear, we should notify the 
>>> driver that
>>> we're about to forget jobs (resources), so that it knows to return 
>>> them back
>>> or that it shouldn't notify us for them (since we've notified we're 
>>> forgetting them.)
>>
>> That contradicts Christian's comment. I tried to document that (after 
>> this patch) the scheduler no longer cares about hw fences and whether 
>> they are signaled or not after it's destroyed, and I got a strongly 
>> worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
>> signal the hw fence after a scheduler teardown, or not?
>>
>> But really, I don't see a use case for an explicit "about to forget 
>> job" callback. The job free callback already serves the purpose of 
>> telling the driver to clean up resources associated with a job. If it 
>> wants to synchronously abort things there, it could easily take over 
>> its own fence signaling and do something with the underlying stuff if 
>> the fence is not signaled yet.
>>
>> In my case, since the driver is written in Rust and free_job() just 
>> maps to the destructor (Drop impl), that just ends up freeing a bunch 
>> of memory and other objects, and I don't particularly care about the 
>> state of the firmware side any more after that. The flow is the same 
>> whether it was a successful job completion, a failure, or an early 
>> destruction due to the drm_sched getting torn down.
>>
>>> (Note also that in this latter case, traditionally, the device would 
>>> be reset,
>>> so that we can guarantee that it has forgotten all shared resources 
>>> which
>>> we are to tear up. This is somewhat more complicated with GPUs, thus 
>>> the method
>>> pointed out above.)
>>
>> Yeah, in the firmware scheduling case we can't do this at all unless 
>> the firmware has an explicit teardown/forget op (which I'm not aware 
>> of) and a full GPU reset isn't something we can do either. Hence we 
>> just let the underlying jobs complete. In practice they tend to die 
>> pretty quickly anyway once all the buffers are unmapped.
>>
>> ~~ Lina
>>
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-17 22:45           ` Asahi Lina
@ 2023-07-19 18:16             ` Konstantin Ryabitsev
  -1 siblings, 0 replies; 86+ messages in thread
From: Konstantin Ryabitsev @ 2023-07-19 18:16 UTC (permalink / raw)
  To: Luben Tuikov, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal, Christian König
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

July 18, 2023 at 1:14 AM, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
> > > Not sure about other drivers--they can speak for themselves and the CC list
> > >  should include them--please use "dim add-missing-cc" and make sure
> > >  that the Git commit description contains the Cc tags--then git send-email
> > >  will populate the SMTP CC. Feel free to add more Cc tags on top of that.
> >  
> >  I use `b4 prep -c` which I think does the same thing? I just ran it 
> >  again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why 
> >  that one wasn't there. Am I missing anything else?
> 
> Not sure about "b4 prep -c"--using "git send-email" instead, but what is
> important is to add the Cc: tags as part of the commit message. A "git log" of
> drm-misc-next shows the proper format. Then maintainers add Link:
> tag to the correct email thread, which is usually completely automated
> by "dim" or by "git am", or both.

It's useful to note here that this is not standard practice across the entirety of the Linux tree. In general, Cc: trailers are added to individual commits when get_maintainer.pl wouldn't otherwise include someone in the recipient list. The "dim" tool mentioned here is specific to the DRM subsystem (the "d" stands for "DRM"). Since both tools work on git series, you can use it alongside b4.

DRM folks, if get_maintainer.pl isn't finding someone who should be included on a series of patches, should the MAINTAINERS file be updated to make it easier to submit valid patches without needing to know of "dim"?

Regards,
-K

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-07-19 18:16             ` Konstantin Ryabitsev
  0 siblings, 0 replies; 86+ messages in thread
From: Konstantin Ryabitsev @ 2023-07-19 18:16 UTC (permalink / raw)
  To: Luben Tuikov, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal, Christian König
  Cc: asahi, linux-kernel, dri-devel, Faith Ekstrand,
	Alyssa Rosenzweig, linux-media

July 18, 2023 at 1:14 AM, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
> > > Not sure about other drivers--they can speak for themselves and the CC list
> > >  should include them--please use "dim add-missing-cc" and make sure
> > >  that the Git commit description contains the Cc tags--then git send-email
> > >  will populate the SMTP CC. Feel free to add more Cc tags on top of that.
> >  
> >  I use `b4 prep -c` which I think does the same thing? I just ran it 
> >  again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why 
> >  that one wasn't there. Am I missing anything else?
> 
> Not sure about "b4 prep -c"--using "git send-email" instead, but what is
> important is to add the Cc: tags as part of the commit message. A "git log" of
> drm-misc-next shows the proper format. Then maintainers add Link:
> tag to the correct email thread, which is usually completely automated
> by "dim" or by "git am", or both.

It's useful to note here that this is not standard practice across the entirety of the Linux tree. In general, Cc: trailers are added to individual commits when get_maintainer.pl wouldn't otherwise include someone in the recipient list. The "dim" tool mentioned here is specific to the DRM subsystem (the "d" stands for "DRM"). Since both tools work on git series, you can use it alongside b4.

DRM folks, if get_maintainer.pl isn't finding someone who should be included on a series of patches, should the MAINTAINERS file be updated to make it easier to submit valid patches without needing to know of "dim"?

Regards,
-K

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-19 18:16             ` Konstantin Ryabitsev
@ 2023-07-19 18:58               ` Luben Tuikov
  -1 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-19 18:58 UTC (permalink / raw)
  To: Konstantin Ryabitsev, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal, Christian König
  Cc: Faith Ekstrand, linux-kernel, dri-devel, asahi,
	Alyssa Rosenzweig, linux-media

On 2023-07-19 14:16, Konstantin Ryabitsev wrote:
> July 18, 2023 at 1:14 AM, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
>>>> Not sure about other drivers--they can speak for themselves and the CC list
>>>>  should include them--please use "dim add-missing-cc" and make sure
>>>>  that the Git commit description contains the Cc tags--then git send-email
>>>>  will populate the SMTP CC. Feel free to add more Cc tags on top of that.
>>>  
>>>  I use `b4 prep -c` which I think does the same thing? I just ran it 
>>>  again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why 
>>>  that one wasn't there. Am I missing anything else?
>>
>> Not sure about "b4 prep -c"--using "git send-email" instead, but what is
>> important is to add the Cc: tags as part of the commit message. A "git log" of
>> drm-misc-next shows the proper format. Then maintainers add Link:
>> tag to the correct email thread, which is usually completely automated
>> by "dim" or by "git am", or both.
> 
> It's useful to note here that this is not standard practice across the entirety of the Linux tree. In general, Cc: trailers are added to individual commits when get_maintainer.pl wouldn't otherwise include someone in the recipient list. The "dim" tool mentioned here is specific to the DRM subsystem (the "d" stands for "DRM"). Since both tools work on git series, you can use it alongside b4.
> 

In DRM we use "dim"--it's just how we do things and everyone complies with this.
"dim" also includes the Link: tag (which "git am" can also be made add), and this adds
certain amount of accountability, which is a good thing.

This is why I suggested that a subsequent version of these patches, include
the Cc: tags, which would normally come from "dim add-missing-cc", which uses
"scripts/get_maintainer.pl".

DRM maintainers regularly use `git rebase --exec "dim add-missing-cc" ...'.

> DRM folks, if get_maintainer.pl isn't finding someone who should be included on a series of patches, should the MAINTAINERS file be updated to make it easier to submit valid patches without needing to know of "dim"?
"scripts/get_maintainer.pl" does consult the MAINTAINERS file. There's been no immediate need
to update the MAINTAINERS file.

Sometimes a single function or a single line in a function (as in some kind of complex calculation),
might be coming from someone who doesn't normally commit to the subsystem. This is where "git blame"
and "git log" are helpful to inspect and add a Cc: tag with that email to the commit message, and this
of course depends on the nature of the incoming patch.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-07-19 18:58               ` Luben Tuikov
  0 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-07-19 18:58 UTC (permalink / raw)
  To: Konstantin Ryabitsev, Asahi Lina, David Airlie, Daniel Vetter,
	Sumit Semwal, Christian König
  Cc: asahi, linux-kernel, dri-devel, Faith Ekstrand,
	Alyssa Rosenzweig, linux-media

On 2023-07-19 14:16, Konstantin Ryabitsev wrote:
> July 18, 2023 at 1:14 AM, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
>>>> Not sure about other drivers--they can speak for themselves and the CC list
>>>>  should include them--please use "dim add-missing-cc" and make sure
>>>>  that the Git commit description contains the Cc tags--then git send-email
>>>>  will populate the SMTP CC. Feel free to add more Cc tags on top of that.
>>>  
>>>  I use `b4 prep -c` which I think does the same thing? I just ran it 
>>>  again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why 
>>>  that one wasn't there. Am I missing anything else?
>>
>> Not sure about "b4 prep -c"--using "git send-email" instead, but what is
>> important is to add the Cc: tags as part of the commit message. A "git log" of
>> drm-misc-next shows the proper format. Then maintainers add Link:
>> tag to the correct email thread, which is usually completely automated
>> by "dim" or by "git am", or both.
> 
> It's useful to note here that this is not standard practice across the entirety of the Linux tree. In general, Cc: trailers are added to individual commits when get_maintainer.pl wouldn't otherwise include someone in the recipient list. The "dim" tool mentioned here is specific to the DRM subsystem (the "d" stands for "DRM"). Since both tools work on git series, you can use it alongside b4.
> 

In DRM we use "dim"--it's just how we do things and everyone complies with this.
"dim" also includes the Link: tag (which "git am" can also be made add), and this adds
certain amount of accountability, which is a good thing.

This is why I suggested that a subsequent version of these patches, include
the Cc: tags, which would normally come from "dim add-missing-cc", which uses
"scripts/get_maintainer.pl".

DRM maintainers regularly use `git rebase --exec "dim add-missing-cc" ...'.

> DRM folks, if get_maintainer.pl isn't finding someone who should be included on a series of patches, should the MAINTAINERS file be updated to make it easier to submit valid patches without needing to know of "dim"?
"scripts/get_maintainer.pl" does consult the MAINTAINERS file. There's been no immediate need
to update the MAINTAINERS file.

Sometimes a single function or a single line in a function (as in some kind of complex calculation),
might be coming from someone who doesn't normally commit to the subsystem. This is where "git blame"
and "git log" are helpful to inspect and add a Cc: tag with that email to the commit message, and this
of course depends on the nature of the incoming patch.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-18  5:45                 ` Luben Tuikov
@ 2023-07-21 10:33                   ` Asahi Lina
  -1 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-21 10:33 UTC (permalink / raw)
  To: Luben Tuikov, Christian König, alyssa, David Airlie,
	Daniel Vetter, Sumit Semwal
  Cc: Faith Ekstrand, dri-devel, linux-kernel, linux-media, asahi

On 18/07/2023 14.45, Luben Tuikov wrote:
> On 2023-07-17 22:35, Asahi Lina wrote:
>> On 18/07/2023 00.55, Christian König wrote:
>>> Am 15.07.23 um 16:14 schrieb alyssa@rosenzweig.io:
>>>> 15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
>>>>> On 2023-07-14 05:57, Christian König wrote:
>>>>>
>>>>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>>>>
>>>>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>>>>
>>>>>>     Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>>>     A signaled scheduler fence can outlive its scheduler, since fences are
>>>>>>     independencly reference counted. Therefore, we can't reference the
>>>>>>     scheduler in the get_timeline_name() implementation.
>>>>>>
>>>>>>     Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>>>     dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>>
>>>>>>     Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>>>     ---
>>>>>>        drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>>        drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>>        include/drm/gpu_scheduler.h              | 5 +++++
>>>>>>        3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>>
>>>>>>     diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>     b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>     index b2bbc8a68b30..17f35b0b005a 100644
>>>>>>     --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>     +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>     @@ -389,7 +389,12 @@ static bool
>>>>>>     drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>>                   /*
>>>>>>                 * Fence is from the same scheduler, only need to wait for
>>>>>>     -         * it to be scheduled
>>>>>>     +         * it to be scheduled.
>>>>>>     +         *
>>>>>>     +         * Note: s_fence->sched could have been freed and reallocated
>>>>>>     +         * as another scheduler. This false positive case is okay,
>>>>>>     as if
>>>>>>     +         * the old scheduler was freed all of its jobs must have
>>>>>>     +         * signaled their completion fences.
>>>>>>
>>>>>>     This is outright nonsense. As long as an entity for a scheduler exists
>>>>>>     it is not allowed to free up this scheduler.
>>>>>>
>>>>>>     So this function can't be called like this.
>>>>>>
>>>>>>> As I already explained, the fences can outlive their scheduler. That
>>>>>>>     means *this* entity certainly exists for *this* scheduler, but the
>>>>>>>     *dependency* fence might have come from a past scheduler that was
>>>>>>>     already destroyed along with all of its entities, and its address reused.
>>>>>>>
>>>>>>     
>>>>>>     Well this is function is not about fences, this function is a callback
>>>>>>     for the entity.
>>>>>>     
>>>>>>
>>>>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>>>>>     being told my comments are "outright nonsense" when you don't even
>>>>>>>     take the time to understand what the issue is and what I'm trying to
>>>>>>>     do/document. If you aren't interested in working with me, I'm just
>>>>>>>     going to give up on drm_sched, wait until Rust gets workqueue support,
>>>>>>>     and reimplement it in Rust. You can keep your broken fence lifetime
>>>>>>>     semantics and I'll do my own thing.
>>>>>>>
>>>>>>     
>>>>>>     I'm certainly trying to help here, but you seem to have unrealistic
>>>>>>     expectations.
>>>>>>     
>>>>>>     I perfectly understand what you are trying to do, but you don't seem to
>>>>>>     understand that this functionality here isn't made for your use case.
>>>>>>     
>>>>>>     We can adjust the functionality to better match your requirements, but
>>>>>>     you can't say it is broken because it doesn't work when you use it not
>>>>>>     in the way it is intended to be used.
>>>>>>
>>>>> I believe "adjusting" functionality to fit some external requirements,
>>>>> may have unintended consequences, requiring yet more and more "adjustments".
>>>>> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
>>>>>
>>>>> We need to be extra careful and wary of this.
>>>> Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX.
>>>
>>> Well this is the fundamental disagreement we have. As far as I can see
>>> you don't need to adjust anything in the common drm/scheduler code.
>>>
>>> That code works with quite a bunch of different drivers, including the
>>> Intel XE which has similar requirements to your work here.
>>>
>>> We can talk about gradually improving the common code, but as Luben
>>> already wrote as well this needs to be done very carefully.
>>>
>>>>     Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.
>>>>
>>>> AMD has NAK'd both options, effectively NAK'ing the driver.
>>>>
>>>> I will ask a simple yes/no question: Should we use drm/sched?
>>>
>>> Well, yes.
>>>
>>>>
>>>> If yes, it will need patches like these,
>>>
>>> No, you don't.
>>>
>>> First of all you need to try to adjust your driver to match the
>>> requirements of drm/scheduler and *not* the other way around.
>>>
>>>>     and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.
>>>>
>>>> If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.
>>>>
>>>> Which is it?
>>>>
>>>> Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.
>>>>
>>>> Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.
>>>
>>> Well as far as I can see I'm totally polite as well.
>>>
>>> Pointing out that approaches doesn't seem to make sense and NAKing
>>> patches is a perfectly normal part of the review process.
>>>
>>> What you need to to is to take a step back and ask yourself why this
>>> here is facing so much rejection from our side. I have to admit that I
>>> don't seem to be good at explaining that, cause we are obviously talking
>>> past each other, but you don't seem to try hard to understand what I'm
>>> pointing out either.
>>>
>>>> That's not ok.
>>>
>>> As far as I can see it is.
>>>
>>> As maintainer of a commonly used component my first duty is to preserve
>>> the status quo and prevent modifications which are not well thought
>>> through. And to be honest those changes here strongly looks like Lina is
>>> just adjusting the code to match her requirements without looking left
>>> and right first.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>
>> I give up. You are ignoring everything we say, and rejecting everything
>> we suggest. We've already explained why drm_sched doesn't work for us.
>> I'm tired of repeating the same explanation over and over again only to
>> be ignored and told I'm wrong.
>>
>> I'll start working on a new, much simpler Rust-native scheduler based on
>> the workqueue Rust abstractions which are in review.
>>
>> ~~ Lina
>>
> 
> Perhaps it is worth having a reset and just trying to clarify requirements
> one at a time, even if that involves describing a change on a single line
> in a single file.

I've already tried to explain the issue. The DRM scheduler design, as it 
stands today, makes it impractical to write a safe Rust abstraction for 
it. This is a fact. Christian has repeatedly NAKed my attempts at 
changing it to make such a safe abstraction possible, and is clearly 
opposed to the fundamental lifetime requirements change I am trying to 
make here. Therefore, we are at an impasse.

It's unfortunate, but given this situation, the DRM scheduler will not 
be available to Rust DRM drivers. I thought the goal of the DRM 
subsystem common code was to cater to multiple drivers and usage 
approaches, with an emphasis on doing things "right" and avoiding design 
issues that are common mistakes in driver design. Clearly, this is not 
the case for all of DRM, at least not the DRM scheduler.

In software engineering, complex lifetime requirements are an 
anti-pattern, which is one reason why Rust draws a line between safe and 
unsafe APIs. For a safe API, it is up to the API developer to design it 
such that it cannot be misused in a way that causes memory safety 
issues, and the only lifetime requirements it can impose are those that 
can be expressed in the Rust type system and statically checked at 
compile time. The DRM scheduler's complex chain of lifetime requirements 
cannot, and wrapping it in enough glue to remove those lifetime 
requirements would require about as much code as just rewriting it, as 
well as add unacceptable duplication and overhead.

In kernel Rust, we strive to only have safe APIs for components which 
have no fundamental reason for being unsafe (things like direct memory 
mapping and raw hardware access). The DRM scheduler does not fall into 
one of those "inherently unsafe" categories, so it doesn't make sense to 
provide a raw unsafe API for it. Doing so would just expose Rust code to 
the kind of subtle safety implications that cause memory problems every 
day in C. If I were to use drm_sched's unsafe API "as is" in my driver, 
it would *by far* be the least auditable, most complex usage of unsafe 
code in the entire driver, and I have no confidence that I would be able 
to get it right and keep it right as the driver changes.

I don't see a reason why this cannot be simply fixed in drm_sched, but 
Christian disagrees, and has repeatedly (and strongly) NAKed my attempts 
at doing so, and indeed NAKed the entire premise of the change in 
lifetime requirements itself. So here we are. There is no solution that 
will work for everyone that involves drm_sched.

So I don't have a choice other than to just not use drm_sched and roll 
my own. I will happily move that Rust implementation to common code if 
another Rust driver comes along and wants to use it. I'm not sure if 
that kind of thing can be easily made available to C drivers in reverse, 
but I guess we'll cross that bridge when a C driver expresses interest 
in using it.

So far it seems existing C drivers are happy with drm_sched's design and 
complex lifetime requirements, even though they aren't even documented. 
Perhaps they managed to reverse engineer them from the source, or 
someone told the authors about it (certainly nobody told me when I 
started using drm_sched). Or maybe a bunch of the drm_scheduler users 
are actually broken and have memory safety issues in corner cases. I 
don't know, though if I had to bet, I'd bet on the second option.

Actually, let's do a quick search and find out!

panfrost_remove() -> panfrost_device_fini() -> panfrost_job_fini() -> 
drm_sched_fini()

There is a direct, synchronous path between device removal and 
destroying the DRM scheduler. At no point does it wait for any jobs to 
complete. That's what patch #3 in this series tries to fix.

In fact, it doesn't even keep the entities alive! It calls 
drm_dev_unregister() before everything else, but as the docs for the DRM 
driver lifetimes clearly say (see, docs!), objects visible to userspace 
must survive that and only be released from the release callback. 
drm_sched entities are created/destroyed from 
panfrost_job_open()/panfrost_job_close(), which are called from 
panfrost_open() and panfrost_postclose(), which are the userspace file 
open/close functions.

That one I fix in the Rust abstraction already (that one's relatively 
easy to fix), so it doesn't need a drm_sched patch from my point of 
view, but it is yet another subtle, undocumented lifetime requirement 
that is, evidently, impossible to know about and get right without 
documentation.

Meanwhile, panfrost_fence_ops has no remove() callback, which means 
there is no reference path stopping device removal (and therefore 
scheduler teardown) or even module unload while driver fences are still 
alive. That leads to the UAF patch #2 in this series tries to fix.

In other words, as far as I can tell, the panfrost driver gets 
*everything* wrong when it comes to the DRM scheduler lifetime 
requirements, and will crash and burn if the driver is unbound while 
jobs are in flight, anyone has a file descriptor open at all, or even if 
any shared buffer holding a DRM scheduler fence from it is bound to the 
display controller at that time.

This is why this kind of design is not allowed in Rust. Because nobody 
gets it right. *Especially* not without docs. I assumed, like the 
authors of the Panfrost driver clearly assumed, that the DRM scheduler 
API would not have these crazy undocumented hidden requirements. The 
only reason I found out the hard way is I happen to create and destroy 
schedulers all the time, not just once globally, so I would hit the bugs 
and dangling pointers much more often than Panfrost users who, most 
likely, never unbind their devices. But we both have the same problem.

I think I've done all I can to explain the issues and try to fix them, 
so the ball is in your court now. If you want to keep the current 
design, that's fine, but Rust code will not be a drm_sched user in that 
case. And the rest of the DRM folks in the C world will have to contend 
with these issues and fix all the problems in the C drivers (I'm sure 
panfrost isn't the only one, it's just literally the first one I looked at).

As for me, I'm happy to write a simple workqueue-based Rust scheduler 
suitable for firmware-managed scheduler devices. Honestly, at this 
point, I have very little faith left in my ability to fix all these 
issues in drm_sched (I know there's at least one more lurking, I saw a 
NULL deref but I wasn't able to reproduce it nor divine how it possibly 
happened). That, combined with the hostility from the AMD folks about my 
attempts to improve drm_sched even just a little bit, makes that 
decision very easy.

Farewell, DRM scheduler. It was nice trying to work with you, but things 
just didn't work out. I won't be submitting a v2 to this series myself. 
Please ping me if you fix all these fundamental design issues with 
drm_sched and think I might actually be able to use it safely in Rust 
one day. If the API design is solid and safe and the implementation done 
in a way that inspires confidence at that time maybe we can yank out my 
Rust solution when the time comes and switch back to drm_sched.

Just please don't expect me to do the work any more, I've done 
everything I can and this now has to come from you, not me. I've spent 
way more time understanding drm_sched, auditing its code, understanding 
its design intent, trying to fix it, and getting yelled at for it than 
it would take to write a new, clean, safe Rust scheduler. I don't regret 
some of the time spent (some of the implementation details of drm_sched 
have taught me useful things), but I'm not prepared to spend any more, 
sorry.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-21 10:33                   ` Asahi Lina
  0 siblings, 0 replies; 86+ messages in thread
From: Asahi Lina @ 2023-07-21 10:33 UTC (permalink / raw)
  To: Luben Tuikov, Christian König, alyssa, David Airlie,
	Daniel Vetter, Sumit Semwal
  Cc: asahi, linux-media, dri-devel, Faith Ekstrand, linux-kernel

On 18/07/2023 14.45, Luben Tuikov wrote:
> On 2023-07-17 22:35, Asahi Lina wrote:
>> On 18/07/2023 00.55, Christian König wrote:
>>> Am 15.07.23 um 16:14 schrieb alyssa@rosenzweig.io:
>>>> 15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@amd.com> wrote:
>>>>> On 2023-07-14 05:57, Christian König wrote:
>>>>>
>>>>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>>>>
>>>>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>>>>
>>>>>>     Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>>>     A signaled scheduler fence can outlive its scheduler, since fences are
>>>>>>     independencly reference counted. Therefore, we can't reference the
>>>>>>     scheduler in the get_timeline_name() implementation.
>>>>>>
>>>>>>     Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>>>     dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>>
>>>>>>     Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>>>     ---
>>>>>>        drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>>        drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>>        include/drm/gpu_scheduler.h              | 5 +++++
>>>>>>        3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>>
>>>>>>     diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>     b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>     index b2bbc8a68b30..17f35b0b005a 100644
>>>>>>     --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>     +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>     @@ -389,7 +389,12 @@ static bool
>>>>>>     drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>>                   /*
>>>>>>                 * Fence is from the same scheduler, only need to wait for
>>>>>>     -         * it to be scheduled
>>>>>>     +         * it to be scheduled.
>>>>>>     +         *
>>>>>>     +         * Note: s_fence->sched could have been freed and reallocated
>>>>>>     +         * as another scheduler. This false positive case is okay,
>>>>>>     as if
>>>>>>     +         * the old scheduler was freed all of its jobs must have
>>>>>>     +         * signaled their completion fences.
>>>>>>
>>>>>>     This is outright nonsense. As long as an entity for a scheduler exists
>>>>>>     it is not allowed to free up this scheduler.
>>>>>>
>>>>>>     So this function can't be called like this.
>>>>>>
>>>>>>> As I already explained, the fences can outlive their scheduler. That
>>>>>>>     means *this* entity certainly exists for *this* scheduler, but the
>>>>>>>     *dependency* fence might have come from a past scheduler that was
>>>>>>>     already destroyed along with all of its entities, and its address reused.
>>>>>>>
>>>>>>     
>>>>>>     Well this is function is not about fences, this function is a callback
>>>>>>     for the entity.
>>>>>>     
>>>>>>
>>>>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>>>>>     being told my comments are "outright nonsense" when you don't even
>>>>>>>     take the time to understand what the issue is and what I'm trying to
>>>>>>>     do/document. If you aren't interested in working with me, I'm just
>>>>>>>     going to give up on drm_sched, wait until Rust gets workqueue support,
>>>>>>>     and reimplement it in Rust. You can keep your broken fence lifetime
>>>>>>>     semantics and I'll do my own thing.
>>>>>>>
>>>>>>     
>>>>>>     I'm certainly trying to help here, but you seem to have unrealistic
>>>>>>     expectations.
>>>>>>     
>>>>>>     I perfectly understand what you are trying to do, but you don't seem to
>>>>>>     understand that this functionality here isn't made for your use case.
>>>>>>     
>>>>>>     We can adjust the functionality to better match your requirements, but
>>>>>>     you can't say it is broken because it doesn't work when you use it not
>>>>>>     in the way it is intended to be used.
>>>>>>
>>>>> I believe "adjusting" functionality to fit some external requirements,
>>>>> may have unintended consequences, requiring yet more and more "adjustments".
>>>>> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
>>>>>
>>>>> We need to be extra careful and wary of this.
>>>> Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX.
>>>
>>> Well this is the fundamental disagreement we have. As far as I can see
>>> you don't need to adjust anything in the common drm/scheduler code.
>>>
>>> That code works with quite a bunch of different drivers, including the
>>> Intel XE which has similar requirements to your work here.
>>>
>>> We can talk about gradually improving the common code, but as Luben
>>> already wrote as well this needs to be done very carefully.
>>>
>>>>     Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.
>>>>
>>>> AMD has NAK'd both options, effectively NAK'ing the driver.
>>>>
>>>> I will ask a simple yes/no question: Should we use drm/sched?
>>>
>>> Well, yes.
>>>
>>>>
>>>> If yes, it will need patches like these,
>>>
>>> No, you don't.
>>>
>>> First of all you need to try to adjust your driver to match the
>>> requirements of drm/scheduler and *not* the other way around.
>>>
>>>>     and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.
>>>>
>>>> If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.
>>>>
>>>> Which is it?
>>>>
>>>> Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.
>>>>
>>>> Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.
>>>
>>> Well as far as I can see I'm totally polite as well.
>>>
>>> Pointing out that approaches doesn't seem to make sense and NAKing
>>> patches is a perfectly normal part of the review process.
>>>
>>> What you need to to is to take a step back and ask yourself why this
>>> here is facing so much rejection from our side. I have to admit that I
>>> don't seem to be good at explaining that, cause we are obviously talking
>>> past each other, but you don't seem to try hard to understand what I'm
>>> pointing out either.
>>>
>>>> That's not ok.
>>>
>>> As far as I can see it is.
>>>
>>> As maintainer of a commonly used component my first duty is to preserve
>>> the status quo and prevent modifications which are not well thought
>>> through. And to be honest those changes here strongly looks like Lina is
>>> just adjusting the code to match her requirements without looking left
>>> and right first.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>
>> I give up. You are ignoring everything we say, and rejecting everything
>> we suggest. We've already explained why drm_sched doesn't work for us.
>> I'm tired of repeating the same explanation over and over again only to
>> be ignored and told I'm wrong.
>>
>> I'll start working on a new, much simpler Rust-native scheduler based on
>> the workqueue Rust abstractions which are in review.
>>
>> ~~ Lina
>>
> 
> Perhaps it is worth having a reset and just trying to clarify requirements
> one at a time, even if that involves describing a change on a single line
> in a single file.

I've already tried to explain the issue. The DRM scheduler design, as it 
stands today, makes it impractical to write a safe Rust abstraction for 
it. This is a fact. Christian has repeatedly NAKed my attempts at 
changing it to make such a safe abstraction possible, and is clearly 
opposed to the fundamental lifetime requirements change I am trying to 
make here. Therefore, we are at an impasse.

It's unfortunate, but given this situation, the DRM scheduler will not 
be available to Rust DRM drivers. I thought the goal of the DRM 
subsystem common code was to cater to multiple drivers and usage 
approaches, with an emphasis on doing things "right" and avoiding design 
issues that are common mistakes in driver design. Clearly, this is not 
the case for all of DRM, at least not the DRM scheduler.

In software engineering, complex lifetime requirements are an 
anti-pattern, which is one reason why Rust draws a line between safe and 
unsafe APIs. For a safe API, it is up to the API developer to design it 
such that it cannot be misused in a way that causes memory safety 
issues, and the only lifetime requirements it can impose are those that 
can be expressed in the Rust type system and statically checked at 
compile time. The DRM scheduler's complex chain of lifetime requirements 
cannot, and wrapping it in enough glue to remove those lifetime 
requirements would require about as much code as just rewriting it, as 
well as add unacceptable duplication and overhead.

In kernel Rust, we strive to only have safe APIs for components which 
have no fundamental reason for being unsafe (things like direct memory 
mapping and raw hardware access). The DRM scheduler does not fall into 
one of those "inherently unsafe" categories, so it doesn't make sense to 
provide a raw unsafe API for it. Doing so would just expose Rust code to 
the kind of subtle safety implications that cause memory problems every 
day in C. If I were to use drm_sched's unsafe API "as is" in my driver, 
it would *by far* be the least auditable, most complex usage of unsafe 
code in the entire driver, and I have no confidence that I would be able 
to get it right and keep it right as the driver changes.

I don't see a reason why this cannot be simply fixed in drm_sched, but 
Christian disagrees, and has repeatedly (and strongly) NAKed my attempts 
at doing so, and indeed NAKed the entire premise of the change in 
lifetime requirements itself. So here we are. There is no solution that 
will work for everyone that involves drm_sched.

So I don't have a choice other than to just not use drm_sched and roll 
my own. I will happily move that Rust implementation to common code if 
another Rust driver comes along and wants to use it. I'm not sure if 
that kind of thing can be easily made available to C drivers in reverse, 
but I guess we'll cross that bridge when a C driver expresses interest 
in using it.

So far it seems existing C drivers are happy with drm_sched's design and 
complex lifetime requirements, even though they aren't even documented. 
Perhaps they managed to reverse engineer them from the source, or 
someone told the authors about it (certainly nobody told me when I 
started using drm_sched). Or maybe a bunch of the drm_scheduler users 
are actually broken and have memory safety issues in corner cases. I 
don't know, though if I had to bet, I'd bet on the second option.

Actually, let's do a quick search and find out!

panfrost_remove() -> panfrost_device_fini() -> panfrost_job_fini() -> 
drm_sched_fini()

There is a direct, synchronous path between device removal and 
destroying the DRM scheduler. At no point does it wait for any jobs to 
complete. That's what patch #3 in this series tries to fix.

In fact, it doesn't even keep the entities alive! It calls 
drm_dev_unregister() before everything else, but as the docs for the DRM 
driver lifetimes clearly say (see, docs!), objects visible to userspace 
must survive that and only be released from the release callback. 
drm_sched entities are created/destroyed from 
panfrost_job_open()/panfrost_job_close(), which are called from 
panfrost_open() and panfrost_postclose(), which are the userspace file 
open/close functions.

That one I fix in the Rust abstraction already (that one's relatively 
easy to fix), so it doesn't need a drm_sched patch from my point of 
view, but it is yet another subtle, undocumented lifetime requirement 
that is, evidently, impossible to know about and get right without 
documentation.

Meanwhile, panfrost_fence_ops has no remove() callback, which means 
there is no reference path stopping device removal (and therefore 
scheduler teardown) or even module unload while driver fences are still 
alive. That leads to the UAF patch #2 in this series tries to fix.

In other words, as far as I can tell, the panfrost driver gets 
*everything* wrong when it comes to the DRM scheduler lifetime 
requirements, and will crash and burn if the driver is unbound while 
jobs are in flight, anyone has a file descriptor open at all, or even if 
any shared buffer holding a DRM scheduler fence from it is bound to the 
display controller at that time.

This is why this kind of design is not allowed in Rust. Because nobody 
gets it right. *Especially* not without docs. I assumed, like the 
authors of the Panfrost driver clearly assumed, that the DRM scheduler 
API would not have these crazy undocumented hidden requirements. The 
only reason I found out the hard way is I happen to create and destroy 
schedulers all the time, not just once globally, so I would hit the bugs 
and dangling pointers much more often than Panfrost users who, most 
likely, never unbind their devices. But we both have the same problem.

I think I've done all I can to explain the issues and try to fix them, 
so the ball is in your court now. If you want to keep the current 
design, that's fine, but Rust code will not be a drm_sched user in that 
case. And the rest of the DRM folks in the C world will have to contend 
with these issues and fix all the problems in the C drivers (I'm sure 
panfrost isn't the only one, it's just literally the first one I looked at).

As for me, I'm happy to write a simple workqueue-based Rust scheduler 
suitable for firmware-managed scheduler devices. Honestly, at this 
point, I have very little faith left in my ability to fix all these 
issues in drm_sched (I know there's at least one more lurking, I saw a 
NULL deref but I wasn't able to reproduce it nor divine how it possibly 
happened). That, combined with the hostility from the AMD folks about my 
attempts to improve drm_sched even just a little bit, makes that 
decision very easy.

Farewell, DRM scheduler. It was nice trying to work with you, but things 
just didn't work out. I won't be submitting a v2 to this series myself. 
Please ping me if you fix all these fundamental design issues with 
drm_sched and think I might actually be able to use it safely in Rust 
one day. If the API design is solid and safe and the implementation done 
in a way that inspires confidence at that time maybe we can yank out my 
Rust solution when the time comes and switch back to drm_sched.

Just please don't expect me to do the work any more, I've done 
everything I can and this now has to come from you, not me. I've spent 
way more time understanding drm_sched, auditing its code, understanding 
its design intent, trying to fix it, and getting yelled at for it than 
it would take to write a new, clean, safe Rust scheduler. I don't regret 
some of the time spent (some of the implementation details of drm_sched 
have taught me useful things), but I'm not prepared to spend any more, 
sorry.

~~ Lina


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-18  2:35               ` Asahi Lina
@ 2023-07-28  7:48                 ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-28  7:48 UTC (permalink / raw)
  To: Asahi Lina, alyssa, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, dri-devel, linux-kernel, linux-media, asahi

Am 18.07.23 um 04:35 schrieb Asahi Lina:
> On 18/07/2023 00.55, Christian König wrote:
>> [SNIP]
>
> I give up. You are ignoring everything we say, and rejecting 
> everything we suggest. We've already explained why drm_sched doesn't 
> work for us. I'm tired of repeating the same explanation over and over 
> again only to be ignored and told I'm wrong.

I'm not telling you that you are wrong in any way. What I'm pointing out 
is that your solution won't work upstream and you need to take a step 
back and think about why this won't work.

>
> I'll start working on a new, much simpler Rust-native scheduler based 
> on the workqueue Rust abstractions which are in review.

Please note that when you are implementing a dma_fence interface you 
also need my acknowledgement to get this upstream.

Additional to that Dave and Daniel might have objections to this as well.

Regards,
Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-28  7:48                 ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-28  7:48 UTC (permalink / raw)
  To: Asahi Lina, alyssa, Luben Tuikov, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: asahi, linux-media, dri-devel, Faith Ekstrand, linux-kernel

Am 18.07.23 um 04:35 schrieb Asahi Lina:
> On 18/07/2023 00.55, Christian König wrote:
>> [SNIP]
>
> I give up. You are ignoring everything we say, and rejecting 
> everything we suggest. We've already explained why drm_sched doesn't 
> work for us. I'm tired of repeating the same explanation over and over 
> again only to be ignored and told I'm wrong.

I'm not telling you that you are wrong in any way. What I'm pointing out 
is that your solution won't work upstream and you need to take a step 
back and think about why this won't work.

>
> I'll start working on a new, much simpler Rust-native scheduler based 
> on the workqueue Rust abstractions which are in review.

Please note that when you are implementing a dma_fence interface you 
also need my acknowledgement to get this upstream.

Additional to that Dave and Daniel might have objections to this as well.

Regards,
Christian.

>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-21 10:33                   ` Asahi Lina
@ 2023-07-31  8:09                     ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-31  8:09 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, alyssa, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: asahi, linux-media, dri-devel, Faith Ekstrand, linux-kernel

Am 21.07.23 um 12:33 schrieb Asahi Lina:
> [SNIP]

> I've already tried to explain the issue. The DRM scheduler design, as 
> it stands today, makes it impractical to write a safe Rust abstraction 
> for it. This is a fact. Christian has repeatedly NAKed my attempts at 
> changing it to make such a safe abstraction possible, and is clearly 
> opposed to the fundamental lifetime requirements change I am trying to 
> make here. Therefore, we are at an impasse.
>
> It's unfortunate, but given this situation, the DRM scheduler will not 
> be available to Rust DRM drivers. I thought the goal of the DRM 
> subsystem common code was to cater to multiple drivers and usage 
> approaches, with an emphasis on doing things "right" and avoiding 
> design issues that are common mistakes in driver design. Clearly, this 
> is not the case for all of DRM, at least not the DRM scheduler.
>
> In software engineering, complex lifetime requirements are an 
> anti-pattern, which is one reason why Rust draws a line between safe 
> and unsafe APIs. For a safe API, it is up to the API developer to 
> design it such that it cannot be misused in a way that causes memory 
> safety issues, and the only lifetime requirements it can impose are 
> those that can be expressed in the Rust type system and statically 
> checked at compile time. The DRM scheduler's complex chain of lifetime 
> requirements cannot, and wrapping it in enough glue to remove those 
> lifetime requirements would require about as much code as just 
> rewriting it, as well as add unacceptable duplication and overhead.
>
> In kernel Rust, we strive to only have safe APIs for components which 
> have no fundamental reason for being unsafe (things like direct memory 
> mapping and raw hardware access). The DRM scheduler does not fall into 
> one of those "inherently unsafe" categories, so it doesn't make sense 
> to provide a raw unsafe API for it.

This is not completely correct. The DRM scheduler provides a dma_fence 
interface as wrapper around hardware dma_fence interfaces.

This interface is made to allow core Linux memory management to query 
the progress of hardware operations.

So you are working with something very low level here and have to follow 
restrictions which Rust can't enforce from the language because it 
simply can't know about that at compile time.

> Doing so would just expose Rust code to the kind of subtle safety 
> implications that cause memory problems every day in C. If I were to 
> use drm_sched's unsafe API "as is" in my driver, it would *by far* be 
> the least auditable, most complex usage of unsafe code in the entire 
> driver, and I have no confidence that I would be able to get it right 
> and keep it right as the driver changes.
>
> I don't see a reason why this cannot be simply fixed in drm_sched, but 
> Christian disagrees, and has repeatedly (and strongly) NAKed my 
> attempts at doing so, and indeed NAKed the entire premise of the 
> change in lifetime requirements itself. So here we are. There is no 
> solution that will work for everyone that involves drm_sched.
>
> So I don't have a choice other than to just not use drm_sched and roll 
> my own. I will happily move that Rust implementation to common code if 
> another Rust driver comes along and wants to use it. I'm not sure if 
> that kind of thing can be easily made available to C drivers in 
> reverse, but I guess we'll cross that bridge when a C driver expresses 
> interest in using it.

Well, to make it clear once more: Signaling a dma_fence from the 
destructor of a reference counted object is very problematic! This will 
be rejected no matter if you do that in C or in Rust.

What we can do is to make it safe in the sense that you don't access 
freed up memory by using the scheduler fences even more as wrapper 
around the hardware fence as we do now. But this quite a change and 
requires a bit more than just hacking around 
drm_sched_fence_get_timeline_name().

>
> So far it seems existing C drivers are happy with drm_sched's design 
> and complex lifetime requirements, even though they aren't even 
> documented. Perhaps they managed to reverse engineer them from the 
> source, or someone told the authors about it (certainly nobody told me 
> when I started using drm_sched). Or maybe a bunch of the drm_scheduler 
> users are actually broken and have memory safety issues in corner 
> cases. I don't know, though if I had to bet, I'd bet on the second 
> option.
>
> Actually, let's do a quick search and find out!
>
> panfrost_remove() -> panfrost_device_fini() -> panfrost_job_fini() -> 
> drm_sched_fini()
>
> There is a direct, synchronous path between device removal and 
> destroying the DRM scheduler. At no point does it wait for any jobs to 
> complete. That's what patch #3 in this series tries to fix.
>
> In fact, it doesn't even keep the entities alive! It calls 
> drm_dev_unregister() before everything else, but as the docs for the 
> DRM driver lifetimes clearly say (see, docs!), objects visible to 
> userspace must survive that and only be released from the release 
> callback. drm_sched entities are created/destroyed from 
> panfrost_job_open()/panfrost_job_close(), which are called from 
> panfrost_open() and panfrost_postclose(), which are the userspace file 
> open/close functions.
>
> That one I fix in the Rust abstraction already (that one's relatively 
> easy to fix), so it doesn't need a drm_sched patch from my point of 
> view, but it is yet another subtle, undocumented lifetime requirement 
> that is, evidently, impossible to know about and get right without 
> documentation.
>
> Meanwhile, panfrost_fence_ops has no remove() callback, which means 
> there is no reference path stopping device removal (and therefore 
> scheduler teardown) or even module unload while driver fences are 
> still alive. That leads to the UAF patch #2 in this series tries to fix.
>
> In other words, as far as I can tell, the panfrost driver gets 
> *everything* wrong when it comes to the DRM scheduler lifetime 
> requirements, and will crash and burn if the driver is unbound while 
> jobs are in flight, anyone has a file descriptor open at all, or even 
> if any shared buffer holding a DRM scheduler fence from it is bound to 
> the display controller at that time.

Yeah, that is perfectly correct what you wrote.

Daniel and I have gone back and forth multiple times how we should 
design this and we opted to not use direct pointers for the contexts 
because that allows for simpler driver implementations.

The DRM scheduler doesn't document the lifetime requirements because it 
doesn't define the lifetime requirements. From the design the DRM 
scheduler is supposed to be an component wrapping around DMA fences. And 
those DMA fences have the necessary lifetime definition.

Now DMA fences have their live cycles explained in the structure 
documentation, but it doesn't really talk much about the requirements 
for dma_fence_ops implementations. We should probably improve that.

So yes, drivers need to keep the structures which might be accessed by 
userspace alive even after the underlying device is removed. But 
signaling dma_fences is completely independent from that.

>
> This is why this kind of design is not allowed in Rust.

Well it is allowed, it's just not safe.

Regards,
Christian.

> Because nobody gets it right. *Especially* not without docs. I 
> assumed, like the authors of the Panfrost driver clearly assumed, that 
> the DRM scheduler API would not have these crazy undocumented hidden 
> requirements. The only reason I found out the hard way is I happen to 
> create and destroy schedulers all the time, not just once globally, so 
> I would hit the bugs and dangling pointers much more often than 
> Panfrost users who, most likely, never unbind their devices. But we 
> both have the same problem.
>
> I think I've done all I can to explain the issues and try to fix them, 
> so the ball is in your court now. If you want to keep the current 
> design, that's fine, but Rust code will not be a drm_sched user in 
> that case. And the rest of the DRM folks in the C world will have to 
> contend with these issues and fix all the problems in the C drivers 
> (I'm sure panfrost isn't the only one, it's just literally the first 
> one I looked at).
>
> As for me, I'm happy to write a simple workqueue-based Rust scheduler 
> suitable for firmware-managed scheduler devices. Honestly, at this 
> point, I have very little faith left in my ability to fix all these 
> issues in drm_sched (I know there's at least one more lurking, I saw a 
> NULL deref but I wasn't able to reproduce it nor divine how it 
> possibly happened). That, combined with the hostility from the AMD 
> folks about my attempts to improve drm_sched even just a little bit, 
> makes that decision very easy.
>
> Farewell, DRM scheduler. It was nice trying to work with you, but 
> things just didn't work out. I won't be submitting a v2 to this series 
> myself. Please ping me if you fix all these fundamental design issues 
> with drm_sched and think I might actually be able to use it safely in 
> Rust one day. If the API design is solid and safe and the 
> implementation done in a way that inspires confidence at that time 
> maybe we can yank out my Rust solution when the time comes and switch 
> back to drm_sched.
>
> Just please don't expect me to do the work any more, I've done 
> everything I can and this now has to come from you, not me. I've spent 
> way more time understanding drm_sched, auditing its code, 
> understanding its design intent, trying to fix it, and getting yelled 
> at for it than it would take to write a new, clean, safe Rust 
> scheduler. I don't regret some of the time spent (some of the 
> implementation details of drm_sched have taught me useful things), but 
> I'm not prepared to spend any more, sorry.
>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-07-31  8:09                     ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-07-31  8:09 UTC (permalink / raw)
  To: Asahi Lina, Luben Tuikov, alyssa, David Airlie, Daniel Vetter,
	Sumit Semwal
  Cc: Faith Ekstrand, dri-devel, linux-kernel, linux-media, asahi

Am 21.07.23 um 12:33 schrieb Asahi Lina:
> [SNIP]

> I've already tried to explain the issue. The DRM scheduler design, as 
> it stands today, makes it impractical to write a safe Rust abstraction 
> for it. This is a fact. Christian has repeatedly NAKed my attempts at 
> changing it to make such a safe abstraction possible, and is clearly 
> opposed to the fundamental lifetime requirements change I am trying to 
> make here. Therefore, we are at an impasse.
>
> It's unfortunate, but given this situation, the DRM scheduler will not 
> be available to Rust DRM drivers. I thought the goal of the DRM 
> subsystem common code was to cater to multiple drivers and usage 
> approaches, with an emphasis on doing things "right" and avoiding 
> design issues that are common mistakes in driver design. Clearly, this 
> is not the case for all of DRM, at least not the DRM scheduler.
>
> In software engineering, complex lifetime requirements are an 
> anti-pattern, which is one reason why Rust draws a line between safe 
> and unsafe APIs. For a safe API, it is up to the API developer to 
> design it such that it cannot be misused in a way that causes memory 
> safety issues, and the only lifetime requirements it can impose are 
> those that can be expressed in the Rust type system and statically 
> checked at compile time. The DRM scheduler's complex chain of lifetime 
> requirements cannot, and wrapping it in enough glue to remove those 
> lifetime requirements would require about as much code as just 
> rewriting it, as well as add unacceptable duplication and overhead.
>
> In kernel Rust, we strive to only have safe APIs for components which 
> have no fundamental reason for being unsafe (things like direct memory 
> mapping and raw hardware access). The DRM scheduler does not fall into 
> one of those "inherently unsafe" categories, so it doesn't make sense 
> to provide a raw unsafe API for it.

This is not completely correct. The DRM scheduler provides a dma_fence 
interface as wrapper around hardware dma_fence interfaces.

This interface is made to allow core Linux memory management to query 
the progress of hardware operations.

So you are working with something very low level here and have to follow 
restrictions which Rust can't enforce from the language because it 
simply can't know about that at compile time.

> Doing so would just expose Rust code to the kind of subtle safety 
> implications that cause memory problems every day in C. If I were to 
> use drm_sched's unsafe API "as is" in my driver, it would *by far* be 
> the least auditable, most complex usage of unsafe code in the entire 
> driver, and I have no confidence that I would be able to get it right 
> and keep it right as the driver changes.
>
> I don't see a reason why this cannot be simply fixed in drm_sched, but 
> Christian disagrees, and has repeatedly (and strongly) NAKed my 
> attempts at doing so, and indeed NAKed the entire premise of the 
> change in lifetime requirements itself. So here we are. There is no 
> solution that will work for everyone that involves drm_sched.
>
> So I don't have a choice other than to just not use drm_sched and roll 
> my own. I will happily move that Rust implementation to common code if 
> another Rust driver comes along and wants to use it. I'm not sure if 
> that kind of thing can be easily made available to C drivers in 
> reverse, but I guess we'll cross that bridge when a C driver expresses 
> interest in using it.

Well, to make it clear once more: Signaling a dma_fence from the 
destructor of a reference counted object is very problematic! This will 
be rejected no matter if you do that in C or in Rust.

What we can do is to make it safe in the sense that you don't access 
freed up memory by using the scheduler fences even more as wrapper 
around the hardware fence as we do now. But this quite a change and 
requires a bit more than just hacking around 
drm_sched_fence_get_timeline_name().

>
> So far it seems existing C drivers are happy with drm_sched's design 
> and complex lifetime requirements, even though they aren't even 
> documented. Perhaps they managed to reverse engineer them from the 
> source, or someone told the authors about it (certainly nobody told me 
> when I started using drm_sched). Or maybe a bunch of the drm_scheduler 
> users are actually broken and have memory safety issues in corner 
> cases. I don't know, though if I had to bet, I'd bet on the second 
> option.
>
> Actually, let's do a quick search and find out!
>
> panfrost_remove() -> panfrost_device_fini() -> panfrost_job_fini() -> 
> drm_sched_fini()
>
> There is a direct, synchronous path between device removal and 
> destroying the DRM scheduler. At no point does it wait for any jobs to 
> complete. That's what patch #3 in this series tries to fix.
>
> In fact, it doesn't even keep the entities alive! It calls 
> drm_dev_unregister() before everything else, but as the docs for the 
> DRM driver lifetimes clearly say (see, docs!), objects visible to 
> userspace must survive that and only be released from the release 
> callback. drm_sched entities are created/destroyed from 
> panfrost_job_open()/panfrost_job_close(), which are called from 
> panfrost_open() and panfrost_postclose(), which are the userspace file 
> open/close functions.
>
> That one I fix in the Rust abstraction already (that one's relatively 
> easy to fix), so it doesn't need a drm_sched patch from my point of 
> view, but it is yet another subtle, undocumented lifetime requirement 
> that is, evidently, impossible to know about and get right without 
> documentation.
>
> Meanwhile, panfrost_fence_ops has no remove() callback, which means 
> there is no reference path stopping device removal (and therefore 
> scheduler teardown) or even module unload while driver fences are 
> still alive. That leads to the UAF patch #2 in this series tries to fix.
>
> In other words, as far as I can tell, the panfrost driver gets 
> *everything* wrong when it comes to the DRM scheduler lifetime 
> requirements, and will crash and burn if the driver is unbound while 
> jobs are in flight, anyone has a file descriptor open at all, or even 
> if any shared buffer holding a DRM scheduler fence from it is bound to 
> the display controller at that time.

Yeah, that is perfectly correct what you wrote.

Daniel and I have gone back and forth multiple times how we should 
design this and we opted to not use direct pointers for the contexts 
because that allows for simpler driver implementations.

The DRM scheduler doesn't document the lifetime requirements because it 
doesn't define the lifetime requirements. From the design the DRM 
scheduler is supposed to be an component wrapping around DMA fences. And 
those DMA fences have the necessary lifetime definition.

Now DMA fences have their live cycles explained in the structure 
documentation, but it doesn't really talk much about the requirements 
for dma_fence_ops implementations. We should probably improve that.

So yes, drivers need to keep the structures which might be accessed by 
userspace alive even after the underlying device is removed. But 
signaling dma_fences is completely independent from that.

>
> This is why this kind of design is not allowed in Rust.

Well it is allowed, it's just not safe.

Regards,
Christian.

> Because nobody gets it right. *Especially* not without docs. I 
> assumed, like the authors of the Panfrost driver clearly assumed, that 
> the DRM scheduler API would not have these crazy undocumented hidden 
> requirements. The only reason I found out the hard way is I happen to 
> create and destroy schedulers all the time, not just once globally, so 
> I would hit the bugs and dangling pointers much more often than 
> Panfrost users who, most likely, never unbind their devices. But we 
> both have the same problem.
>
> I think I've done all I can to explain the issues and try to fix them, 
> so the ball is in your court now. If you want to keep the current 
> design, that's fine, but Rust code will not be a drm_sched user in 
> that case. And the rest of the DRM folks in the C world will have to 
> contend with these issues and fix all the problems in the C drivers 
> (I'm sure panfrost isn't the only one, it's just literally the first 
> one I looked at).
>
> As for me, I'm happy to write a simple workqueue-based Rust scheduler 
> suitable for firmware-managed scheduler devices. Honestly, at this 
> point, I have very little faith left in my ability to fix all these 
> issues in drm_sched (I know there's at least one more lurking, I saw a 
> NULL deref but I wasn't able to reproduce it nor divine how it 
> possibly happened). That, combined with the hostility from the AMD 
> folks about my attempts to improve drm_sched even just a little bit, 
> makes that decision very easy.
>
> Farewell, DRM scheduler. It was nice trying to work with you, but 
> things just didn't work out. I won't be submitting a v2 to this series 
> myself. Please ping me if you fix all these fundamental design issues 
> with drm_sched and think I might actually be able to use it safely in 
> Rust one day. If the API design is solid and safe and the 
> implementation done in a way that inspires confidence at that time 
> maybe we can yank out my Rust solution when the time comes and switch 
> back to drm_sched.
>
> Just please don't expect me to do the work any more, I've done 
> everything I can and this now has to come from you, not me. I've spent 
> way more time understanding drm_sched, auditing its code, 
> understanding its design intent, trying to fix it, and getting yelled 
> at for it than it would take to write a new, clean, safe Rust 
> scheduler. I don't regret some of the time spent (some of the 
> implementation details of drm_sched have taught me useful things), but 
> I'm not prepared to spend any more, sorry.
>
> ~~ Lina
>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-07-17 17:40         ` Luben Tuikov
@ 2023-08-02  4:06           ` Matthew Brost
  -1 siblings, 0 replies; 86+ messages in thread
From: Matthew Brost @ 2023-08-02  4:06 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: Asahi Lina, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König, Faith Ekstrand, linux-kernel, dri-devel,
	asahi, Alyssa Rosenzweig, linux-media

On Mon, Jul 17, 2023 at 01:40:38PM -0400, Luben Tuikov wrote:
> On 2023-07-16 03:51, Asahi Lina wrote:
> > On 15/07/2023 16.14, Luben Tuikov wrote:
> >> On 2023-07-14 04:21, Asahi Lina wrote:
> >>> drm_sched_fini() currently leaves any pending jobs dangling, which
> >>> causes segfaults and other badness when job completion fences are
> >>> signaled after the scheduler is torn down.
> >>
> >> If there are pending jobs, ideally we want to call into the driver,
> >> so that it can release resources it may be holding for those.
> >> The idea behind "pending" is that they are pending in the hardware
> >> and we don't know their state until signalled/the callback called.
> >> (Or unless the device is reset and we get a notification of that fact.)
> > 
> > That's what the job->free_job() callback does, then the driver is free 
> > to do whatever it wants with those jobs. A driver could opt to 
> > synchronously kill those jobs (if it can) or account for them 
> > separately/asynchronously.
> > 
> > What this patch basically says is that if you destroy a scheduler with 
> > pending jobs, it immediately considers them terminated with an error, 
> > and returns ownership back to the driver for freeing. Then the driver 
> > can decide how to handle the rest and whatever the underlying hardware 
> > state is.
> > 
> >>> Explicitly detach all jobs from their completion callbacks and free
> >>> them. This makes it possible to write a sensible safe abstraction for
> >>> drm_sched, without having to externally duplicate the tracking of
> >>> in-flight jobs.
> >>>
> >>> This shouldn't regress any existing drivers, since calling
> >>> drm_sched_fini() with any pending jobs is broken and this change should
> >>> be a no-op if there are no pending jobs.
> >>
> >> While this statement is true on its own, it kind of contradicts
> >> the premise of the first paragraph.
> > 
> > I mean right *now* it's broken, before this patch. I'm trying to make it 
> > safe, but it shouldn't regress any exiting drivers since if they trigger 
> > this code path they are broken today.
> 
> Not sure about other drivers--they can speak for themselves and the CC list
> should include them--please use "dim add-missing-cc" and make sure
> that the Git commit description contains the Cc tags--then git send-email
> will populate the SMTP CC. Feel free to add more Cc tags on top of that.
> 

Xe doesn't need this as our reference counting scheme doesn't allow
drm_sched_fini to be called when jobs are pending. If we want to
teardown a drm_sched we set TDR timeout to zero and all pending jobs
gets cleaned up that way, the ref of sched will go to zero, and
drm_sched_fini is called. The caveat here being I think we need a worker
to call drm_sched_fini as the last ref to scheduler might be dropped
from within scheduler main thread.

That being said, I doubt this patch breaks anything in Xe so do not a
real strong opinion on this.

Matt

> > 
> >>
> >>> Signed-off-by: Asahi Lina <lina@asahilina.net>
> >>> ---
> >>>   drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
> >>>   1 file changed, 30 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >>> index 1f3bc3606239..a4da4aac0efd 100644
> >>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
> >>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
> >>>   {
> >>>   	struct drm_sched_entity *s_entity;
> >>> +	struct drm_sched_job *s_job, *tmp;
> >>>   	int i;
> >>>   
> >>> -	if (sched->thread)
> >>> -		kthread_stop(sched->thread);
> >>> +	if (!sched->thread)
> >>> +		return;
> >>> +
> >>> +	/*
> >>> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
> >>> +	 * and cleaning up complete jobs.
> >>> +	 */
> >>> +	drm_sched_stop(sched, NULL);
> >>> +
> >>> +	/*
> >>> +	 * Iterate through the pending job list and free all jobs.
> >>> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
> >>> +	 * otherwise it is responsible for keeping any necessary data structures for
> >>> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
> >>> +	 * putting them in its own queue or doing its own refcounting).
> >>> +	 */
> >>> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
> >>> +		spin_lock(&sched->job_list_lock);
> >>> +		list_del_init(&s_job->list);
> >>> +		spin_unlock(&sched->job_list_lock);
> >>> +
> >>> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
> >>> +		drm_sched_fence_finished(s_job->s_fence);
> >>
> >> I'd imagine it's better to rebase this on top of drm-misc-next where
> >> drm_sched_fence_finished() takes one more parameter--the error.
> > 
> > Ah, sure! I can do that.
> 
> It's worth posting it as a stand-alone patch. Please make sure to add Cc tags
> into the commit description--use "dim add-missing-cc", perhaps also
> git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl"
> for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl
> for affected files.)
> 
> Feel free to post it stand-alone and we'll let the natural review process take over. :-)
> 
> > 
> >>
> >>> +
> >>> +		WARN_ON(s_job->s_fence->parent);
> >>> +		sched->ops->free_job(s_job);
> >>> +	}
> >>> +
> >>> +	kthread_stop(sched->thread);
> >>>   
> >>>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> >>>   		struct drm_sched_rq *rq = &sched->sched_rq[i];
> >>>
> >>
> >> Conceptually I don't mind this patch--I see what it is trying to achieve,
> >> but technically, we want the driver to detect GPU removal and return shared
> >> resources back, such as "jobs", which DRM is also aware of.
> > 
> > I think you missed the context of why I'm doing this, so in short: my
> 
> As a general rule of thumb, in my writing emails I try to avoid using
> "you" and "I" as much as possible--it sets this divisive stage, and it
> can get misrepresented, especially in email.
> 
> As is the case in research literature, if I absolutely have to use a pronoun--which
> rarely happens, I always use "we", and this is the most number of "I"-s I've used
> in a long while.
> 
> > use case (like Xe's) involves using a separate drm_sched instance *per 
> > file* since these queues are scheduled directly by the firmware. So this 
> > isn't about GPU removal, but rather about a GPU context going away while 
> > jobs are in flight (e.g. the process got killed). We want that to 
> > quickly kill the "DRM view" of the world, including signaling all the 
> > fences with an error and freeing resources like the scheduler itself.
> > 
> > In the case of this particular GPU, there is no known way to actively 
> > and synchronously abort GPU jobs, so we need to let them run to 
> > completion (or failure), but we don't want that to block process cleanup 
> > and freeing a bunch of high-level resources. The driver is architected 
> > roughly along the lines of a firmware abstraction layer that maps to the 
> > firmware shared memory structures, and then a layer on top that 
> > implements the DRM view. When a process gets killed, the DRM side (which 
> > includes the scheduler, etc.) gets torn down immediately, and it makes 
> > sense to handle this cleanup inside drm_sched since it already has a 
> > view into what jobs are in flight. Otherwise, I would have to duplicate 
> > job tracking in the driver (actually worse: in the Rust abstraction for 
> > safety), which doesn't make much sense.
> > 
> > But what I *do* have in the driver is tracking of the firmware 
> > structures. So when the drm_sched gets torn down and all the jobs 
> > killed, the underlying firmware jobs do run to completion, and the 
> > resources they use are all cleaned up after that (it's all reference 
> > counted).
> 
> The ref-count definitely helps here.
> 
> > The primitive involved here is that in-flight firmware jobs 
> > are assigned an event completion slot, and that keeps a reference to 
> > them from a global array until the events fire and all the jobs are 
> > known to have completed. This keeps things memory-safe, since we 
> > absolutely cannot free/destroy firmware structures while they are in use 
> > (otherwise the firmware crashes, which is fatal on these GPUs - requires 
> > a full system reboot to recover).
> > 
> > In practice, with the VM map model that we use, what ends up happening 
> > when a process gets killed is that all the user objects for in-flight 
> > jobs get unmapped, which usually causes the GPU hardware (not firmware) 
> > to fault. This then triggers early termination of jobs anyway via the 
> > firmware fault recovery flow. But even that takes some short amount of 
> > time, and by then all the drm_sched stuff is long gone and we're just 
> > dealing with the in-flight firmware stuff.
> > 
> >> In the case where we're initiating the tear, we should notify the driver that
> >> we're about to forget jobs (resources), so that it knows to return them back
> >> or that it shouldn't notify us for them (since we've notified we're forgetting them.)
> > 
> > That contradicts Christian's comment. I tried to document that (after 
> > this patch) the scheduler no longer cares about hw fences and whether 
> > they are signaled or not after it's destroyed, and I got a strongly 
> > worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
> > signal the hw fence after a scheduler teardown, or not?
> 
> Christian is correct in that we don't want to hang upstream control
> to the whims of a low-level device driver.
> 
> > But really, I don't see a use case for an explicit "about to forget job" 
> > callback. The job free callback already serves the purpose of telling 
> 
> Long time ago, in a galaxy far far away, this was needed in order
> to prevent device write-DMA into non-existing (random) memory. As
> this is not the case anymore, go with Christian's comment.
> 
> > the driver to clean up resources associated with a job. If it wants to 
> > synchronously abort things there, it could easily take over its own 
> > fence signaling and do something with the underlying stuff if the fence 
> > is not signaled yet.
> > 
> > In my case, since the driver is written in Rust and free_job() just maps 
> > to the destructor (Drop impl), that just ends up freeing a bunch of 
> > memory and other objects, and I don't particularly care about the state 
> > of the firmware side any more after that. The flow is the same whether 
> > it was a successful job completion, a failure, or an early destruction 
> > due to the drm_sched getting torn down.
> > 
> >> (Note also that in this latter case, traditionally, the device would be reset,
> >> so that we can guarantee that it has forgotten all shared resources which
> >> we are to tear up. This is somewhat more complicated with GPUs, thus the method
> >> pointed out above.)
> > 
> > Yeah, in the firmware scheduling case we can't do this at all unless the 
> > firmware has an explicit teardown/forget op (which I'm not aware of) and 
> > a full GPU reset isn't something we can do either. Hence we just let the 
> > underlying jobs complete. In practice they tend to die pretty quickly 
> > anyway once all the buffers are unmapped.
> 
> Perhaps in the future, as more complex workloads are deferred to this
> hardware and driver, a real-time requirement might be needed for this
> "tend to die pretty quickly", that that there's some guarantee of
> work resuming in some finite time.
> -- 
> Regards,
> Luben
> 
> > 
> > ~~ Lina
> > 
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-08-02  4:06           ` Matthew Brost
  0 siblings, 0 replies; 86+ messages in thread
From: Matthew Brost @ 2023-08-02  4:06 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: Alyssa Rosenzweig, Asahi Lina, linux-kernel, dri-devel,
	Christian König, asahi, Sumit Semwal, Faith Ekstrand,
	linux-media

On Mon, Jul 17, 2023 at 01:40:38PM -0400, Luben Tuikov wrote:
> On 2023-07-16 03:51, Asahi Lina wrote:
> > On 15/07/2023 16.14, Luben Tuikov wrote:
> >> On 2023-07-14 04:21, Asahi Lina wrote:
> >>> drm_sched_fini() currently leaves any pending jobs dangling, which
> >>> causes segfaults and other badness when job completion fences are
> >>> signaled after the scheduler is torn down.
> >>
> >> If there are pending jobs, ideally we want to call into the driver,
> >> so that it can release resources it may be holding for those.
> >> The idea behind "pending" is that they are pending in the hardware
> >> and we don't know their state until signalled/the callback called.
> >> (Or unless the device is reset and we get a notification of that fact.)
> > 
> > That's what the job->free_job() callback does, then the driver is free 
> > to do whatever it wants with those jobs. A driver could opt to 
> > synchronously kill those jobs (if it can) or account for them 
> > separately/asynchronously.
> > 
> > What this patch basically says is that if you destroy a scheduler with 
> > pending jobs, it immediately considers them terminated with an error, 
> > and returns ownership back to the driver for freeing. Then the driver 
> > can decide how to handle the rest and whatever the underlying hardware 
> > state is.
> > 
> >>> Explicitly detach all jobs from their completion callbacks and free
> >>> them. This makes it possible to write a sensible safe abstraction for
> >>> drm_sched, without having to externally duplicate the tracking of
> >>> in-flight jobs.
> >>>
> >>> This shouldn't regress any existing drivers, since calling
> >>> drm_sched_fini() with any pending jobs is broken and this change should
> >>> be a no-op if there are no pending jobs.
> >>
> >> While this statement is true on its own, it kind of contradicts
> >> the premise of the first paragraph.
> > 
> > I mean right *now* it's broken, before this patch. I'm trying to make it 
> > safe, but it shouldn't regress any exiting drivers since if they trigger 
> > this code path they are broken today.
> 
> Not sure about other drivers--they can speak for themselves and the CC list
> should include them--please use "dim add-missing-cc" and make sure
> that the Git commit description contains the Cc tags--then git send-email
> will populate the SMTP CC. Feel free to add more Cc tags on top of that.
> 

Xe doesn't need this as our reference counting scheme doesn't allow
drm_sched_fini to be called when jobs are pending. If we want to
teardown a drm_sched we set TDR timeout to zero and all pending jobs
gets cleaned up that way, the ref of sched will go to zero, and
drm_sched_fini is called. The caveat here being I think we need a worker
to call drm_sched_fini as the last ref to scheduler might be dropped
from within scheduler main thread.

That being said, I doubt this patch breaks anything in Xe so do not a
real strong opinion on this.

Matt

> > 
> >>
> >>> Signed-off-by: Asahi Lina <lina@asahilina.net>
> >>> ---
> >>>   drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
> >>>   1 file changed, 30 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >>> index 1f3bc3606239..a4da4aac0efd 100644
> >>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init);
> >>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
> >>>   {
> >>>   	struct drm_sched_entity *s_entity;
> >>> +	struct drm_sched_job *s_job, *tmp;
> >>>   	int i;
> >>>   
> >>> -	if (sched->thread)
> >>> -		kthread_stop(sched->thread);
> >>> +	if (!sched->thread)
> >>> +		return;
> >>> +
> >>> +	/*
> >>> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
> >>> +	 * and cleaning up complete jobs.
> >>> +	 */
> >>> +	drm_sched_stop(sched, NULL);
> >>> +
> >>> +	/*
> >>> +	 * Iterate through the pending job list and free all jobs.
> >>> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
> >>> +	 * otherwise it is responsible for keeping any necessary data structures for
> >>> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
> >>> +	 * putting them in its own queue or doing its own refcounting).
> >>> +	 */
> >>> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
> >>> +		spin_lock(&sched->job_list_lock);
> >>> +		list_del_init(&s_job->list);
> >>> +		spin_unlock(&sched->job_list_lock);
> >>> +
> >>> +		dma_fence_set_error(&s_job->s_fence->finished, -ESRCH);
> >>> +		drm_sched_fence_finished(s_job->s_fence);
> >>
> >> I'd imagine it's better to rebase this on top of drm-misc-next where
> >> drm_sched_fence_finished() takes one more parameter--the error.
> > 
> > Ah, sure! I can do that.
> 
> It's worth posting it as a stand-alone patch. Please make sure to add Cc tags
> into the commit description--use "dim add-missing-cc", perhaps also
> git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl"
> for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl
> for affected files.)
> 
> Feel free to post it stand-alone and we'll let the natural review process take over. :-)
> 
> > 
> >>
> >>> +
> >>> +		WARN_ON(s_job->s_fence->parent);
> >>> +		sched->ops->free_job(s_job);
> >>> +	}
> >>> +
> >>> +	kthread_stop(sched->thread);
> >>>   
> >>>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> >>>   		struct drm_sched_rq *rq = &sched->sched_rq[i];
> >>>
> >>
> >> Conceptually I don't mind this patch--I see what it is trying to achieve,
> >> but technically, we want the driver to detect GPU removal and return shared
> >> resources back, such as "jobs", which DRM is also aware of.
> > 
> > I think you missed the context of why I'm doing this, so in short: my
> 
> As a general rule of thumb, in my writing emails I try to avoid using
> "you" and "I" as much as possible--it sets this divisive stage, and it
> can get misrepresented, especially in email.
> 
> As is the case in research literature, if I absolutely have to use a pronoun--which
> rarely happens, I always use "we", and this is the most number of "I"-s I've used
> in a long while.
> 
> > use case (like Xe's) involves using a separate drm_sched instance *per 
> > file* since these queues are scheduled directly by the firmware. So this 
> > isn't about GPU removal, but rather about a GPU context going away while 
> > jobs are in flight (e.g. the process got killed). We want that to 
> > quickly kill the "DRM view" of the world, including signaling all the 
> > fences with an error and freeing resources like the scheduler itself.
> > 
> > In the case of this particular GPU, there is no known way to actively 
> > and synchronously abort GPU jobs, so we need to let them run to 
> > completion (or failure), but we don't want that to block process cleanup 
> > and freeing a bunch of high-level resources. The driver is architected 
> > roughly along the lines of a firmware abstraction layer that maps to the 
> > firmware shared memory structures, and then a layer on top that 
> > implements the DRM view. When a process gets killed, the DRM side (which 
> > includes the scheduler, etc.) gets torn down immediately, and it makes 
> > sense to handle this cleanup inside drm_sched since it already has a 
> > view into what jobs are in flight. Otherwise, I would have to duplicate 
> > job tracking in the driver (actually worse: in the Rust abstraction for 
> > safety), which doesn't make much sense.
> > 
> > But what I *do* have in the driver is tracking of the firmware 
> > structures. So when the drm_sched gets torn down and all the jobs 
> > killed, the underlying firmware jobs do run to completion, and the 
> > resources they use are all cleaned up after that (it's all reference 
> > counted).
> 
> The ref-count definitely helps here.
> 
> > The primitive involved here is that in-flight firmware jobs 
> > are assigned an event completion slot, and that keeps a reference to 
> > them from a global array until the events fire and all the jobs are 
> > known to have completed. This keeps things memory-safe, since we 
> > absolutely cannot free/destroy firmware structures while they are in use 
> > (otherwise the firmware crashes, which is fatal on these GPUs - requires 
> > a full system reboot to recover).
> > 
> > In practice, with the VM map model that we use, what ends up happening 
> > when a process gets killed is that all the user objects for in-flight 
> > jobs get unmapped, which usually causes the GPU hardware (not firmware) 
> > to fault. This then triggers early termination of jobs anyway via the 
> > firmware fault recovery flow. But even that takes some short amount of 
> > time, and by then all the drm_sched stuff is long gone and we're just 
> > dealing with the in-flight firmware stuff.
> > 
> >> In the case where we're initiating the tear, we should notify the driver that
> >> we're about to forget jobs (resources), so that it knows to return them back
> >> or that it shouldn't notify us for them (since we've notified we're forgetting them.)
> > 
> > That contradicts Christian's comment. I tried to document that (after 
> > this patch) the scheduler no longer cares about hw fences and whether 
> > they are signaled or not after it's destroyed, and I got a strongly 
> > worded NAK for it. Sooo... which is it? Is it okay for drivers not to 
> > signal the hw fence after a scheduler teardown, or not?
> 
> Christian is correct in that we don't want to hang upstream control
> to the whims of a low-level device driver.
> 
> > But really, I don't see a use case for an explicit "about to forget job" 
> > callback. The job free callback already serves the purpose of telling 
> 
> Long time ago, in a galaxy far far away, this was needed in order
> to prevent device write-DMA into non-existing (random) memory. As
> this is not the case anymore, go with Christian's comment.
> 
> > the driver to clean up resources associated with a job. If it wants to 
> > synchronously abort things there, it could easily take over its own 
> > fence signaling and do something with the underlying stuff if the fence 
> > is not signaled yet.
> > 
> > In my case, since the driver is written in Rust and free_job() just maps 
> > to the destructor (Drop impl), that just ends up freeing a bunch of 
> > memory and other objects, and I don't particularly care about the state 
> > of the firmware side any more after that. The flow is the same whether 
> > it was a successful job completion, a failure, or an early destruction 
> > due to the drm_sched getting torn down.
> > 
> >> (Note also that in this latter case, traditionally, the device would be reset,
> >> so that we can guarantee that it has forgotten all shared resources which
> >> we are to tear up. This is somewhat more complicated with GPUs, thus the method
> >> pointed out above.)
> > 
> > Yeah, in the firmware scheduling case we can't do this at all unless the 
> > firmware has an explicit teardown/forget op (which I'm not aware of) and 
> > a full GPU reset isn't something we can do either. Hence we just let the 
> > underlying jobs complete. In practice they tend to die pretty quickly 
> > anyway once all the buffers are unmapped.
> 
> Perhaps in the future, as more complex workloads are deferred to this
> hardware and driver, a real-time requirement might be needed for this
> "tend to die pretty quickly", that that there's some guarantee of
> work resuming in some finite time.
> -- 
> Regards,
> Luben
> 
> > 
> > ~~ Lina
> > 
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
  2023-08-02  4:06           ` Matthew Brost
@ 2023-08-02 14:12             ` Luben Tuikov
  -1 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-08-02 14:12 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Asahi Lina, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König, Faith Ekstrand, linux-kernel, dri-devel,
	asahi, Alyssa Rosenzweig, linux-media

On 2023-08-02 00:06, Matthew Brost wrote:
> On Mon, Jul 17, 2023 at 01:40:38PM -0400, Luben Tuikov wrote:
>> On 2023-07-16 03:51, Asahi Lina wrote:
>>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>>> causes segfaults and other badness when job completion fences are
>>>>> signaled after the scheduler is torn down.
>>>>
>>>> If there are pending jobs, ideally we want to call into the driver,
>>>> so that it can release resources it may be holding for those.
>>>> The idea behind "pending" is that they are pending in the hardware
>>>> and we don't know their state until signalled/the callback called.
>>>> (Or unless the device is reset and we get a notification of that fact.)
>>>
>>> That's what the job->free_job() callback does, then the driver is free 
>>> to do whatever it wants with those jobs. A driver could opt to 
>>> synchronously kill those jobs (if it can) or account for them 
>>> separately/asynchronously.
>>>
>>> What this patch basically says is that if you destroy a scheduler with 
>>> pending jobs, it immediately considers them terminated with an error, 
>>> and returns ownership back to the driver for freeing. Then the driver 
>>> can decide how to handle the rest and whatever the underlying hardware 
>>> state is.
>>>
>>>>> Explicitly detach all jobs from their completion callbacks and free
>>>>> them. This makes it possible to write a sensible safe abstraction for
>>>>> drm_sched, without having to externally duplicate the tracking of
>>>>> in-flight jobs.
>>>>>
>>>>> This shouldn't regress any existing drivers, since calling
>>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>>> be a no-op if there are no pending jobs.
>>>>
>>>> While this statement is true on its own, it kind of contradicts
>>>> the premise of the first paragraph.
>>>
>>> I mean right *now* it's broken, before this patch. I'm trying to make it 
>>> safe, but it shouldn't regress any exiting drivers since if they trigger 
>>> this code path they are broken today.
>>
>> Not sure about other drivers--they can speak for themselves and the CC list
>> should include them--please use "dim add-missing-cc" and make sure
>> that the Git commit description contains the Cc tags--then git send-email
>> will populate the SMTP CC. Feel free to add more Cc tags on top of that.
>>
> 
> Xe doesn't need this as our reference counting scheme doesn't allow
> drm_sched_fini to be called when jobs are pending. If we want to
> teardown a drm_sched we set TDR timeout to zero and all pending jobs
> gets cleaned up that way, the ref of sched will go to zero, and
> drm_sched_fini is called. The caveat here being I think we need a worker
> to call drm_sched_fini as the last ref to scheduler might be dropped
> from within scheduler main thread.
> 
> That being said, I doubt this patch breaks anything in Xe so do not a
> real strong opinion on this.

Yes, that's my understanding as well. If the drivers call drm_sched_fini()
then they are responsible for cleaning up _before_ calling drm_sched_fini().
The Xe driver seems to be doing the right thing. All in all, since
drm_sched_fini() is called by the driver, the driver is supposed to have
cleaned up before calling it, so I don't see much need for this patch
after all.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.
@ 2023-08-02 14:12             ` Luben Tuikov
  0 siblings, 0 replies; 86+ messages in thread
From: Luben Tuikov @ 2023-08-02 14:12 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Alyssa Rosenzweig, Asahi Lina, linux-kernel, dri-devel,
	Christian König, asahi, Sumit Semwal, Faith Ekstrand,
	linux-media

On 2023-08-02 00:06, Matthew Brost wrote:
> On Mon, Jul 17, 2023 at 01:40:38PM -0400, Luben Tuikov wrote:
>> On 2023-07-16 03:51, Asahi Lina wrote:
>>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>>> causes segfaults and other badness when job completion fences are
>>>>> signaled after the scheduler is torn down.
>>>>
>>>> If there are pending jobs, ideally we want to call into the driver,
>>>> so that it can release resources it may be holding for those.
>>>> The idea behind "pending" is that they are pending in the hardware
>>>> and we don't know their state until signalled/the callback called.
>>>> (Or unless the device is reset and we get a notification of that fact.)
>>>
>>> That's what the job->free_job() callback does, then the driver is free 
>>> to do whatever it wants with those jobs. A driver could opt to 
>>> synchronously kill those jobs (if it can) or account for them 
>>> separately/asynchronously.
>>>
>>> What this patch basically says is that if you destroy a scheduler with 
>>> pending jobs, it immediately considers them terminated with an error, 
>>> and returns ownership back to the driver for freeing. Then the driver 
>>> can decide how to handle the rest and whatever the underlying hardware 
>>> state is.
>>>
>>>>> Explicitly detach all jobs from their completion callbacks and free
>>>>> them. This makes it possible to write a sensible safe abstraction for
>>>>> drm_sched, without having to externally duplicate the tracking of
>>>>> in-flight jobs.
>>>>>
>>>>> This shouldn't regress any existing drivers, since calling
>>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>>> be a no-op if there are no pending jobs.
>>>>
>>>> While this statement is true on its own, it kind of contradicts
>>>> the premise of the first paragraph.
>>>
>>> I mean right *now* it's broken, before this patch. I'm trying to make it 
>>> safe, but it shouldn't regress any exiting drivers since if they trigger 
>>> this code path they are broken today.
>>
>> Not sure about other drivers--they can speak for themselves and the CC list
>> should include them--please use "dim add-missing-cc" and make sure
>> that the Git commit description contains the Cc tags--then git send-email
>> will populate the SMTP CC. Feel free to add more Cc tags on top of that.
>>
> 
> Xe doesn't need this as our reference counting scheme doesn't allow
> drm_sched_fini to be called when jobs are pending. If we want to
> teardown a drm_sched we set TDR timeout to zero and all pending jobs
> gets cleaned up that way, the ref of sched will go to zero, and
> drm_sched_fini is called. The caveat here being I think we need a worker
> to call drm_sched_fini as the last ref to scheduler might be dropped
> from within scheduler main thread.
> 
> That being said, I doubt this patch breaks anything in Xe so do not a
> real strong opinion on this.

Yes, that's my understanding as well. If the drivers call drm_sched_fini()
then they are responsible for cleaning up _before_ calling drm_sched_fini().
The Xe driver seems to be doing the right thing. All in all, since
drm_sched_fini() is called by the driver, the driver is supposed to have
cleaned up before calling it, so I don't see much need for this patch
after all.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-07-31  8:09                     ` Christian König
@ 2023-11-01  6:59                       ` Dave Airlie
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Airlie @ 2023-11-01  6:59 UTC (permalink / raw)
  To: Christian König
  Cc: Asahi Lina, Luben Tuikov, alyssa, Daniel Vetter, Sumit Semwal,
	Faith Ekstrand, dri-devel, linux-kernel, linux-media, asahi

>
> Well, to make it clear once more: Signaling a dma_fence from the
> destructor of a reference counted object is very problematic! This will
> be rejected no matter if you do that in C or in Rust.
>
> What we can do is to make it safe in the sense that you don't access
> freed up memory by using the scheduler fences even more as wrapper
> around the hardware fence as we do now. But this quite a change and
> requires a bit more than just hacking around
> drm_sched_fence_get_timeline_name().

I really think this needs to be documented if nothing else out of this thread.

Clearly nobody is going to get it right and hidden here in this
thread, this info isn't useful.

Can we have some sort of design document for the dma-fence/scheduler
interactions written and we can try and refine it with solutions on
the list, because I'm tired of people proposing things and NAK's
getting thrown around without anything to point people at.

The next NAK I see on the list will mean I block all patches from the
sender until they write a documentation patch, because seriously this
stuff is too hard for someone to just keep it in their head and expect
everyone else to understand from reading the code.

Dave.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-11-01  6:59                       ` Dave Airlie
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Airlie @ 2023-11-01  6:59 UTC (permalink / raw)
  To: Christian König
  Cc: asahi, Asahi Lina, linux-kernel, dri-devel, Luben Tuikov, alyssa,
	Sumit Semwal, Faith Ekstrand, linux-media

>
> Well, to make it clear once more: Signaling a dma_fence from the
> destructor of a reference counted object is very problematic! This will
> be rejected no matter if you do that in C or in Rust.
>
> What we can do is to make it safe in the sense that you don't access
> freed up memory by using the scheduler fences even more as wrapper
> around the hardware fence as we do now. But this quite a change and
> requires a bit more than just hacking around
> drm_sched_fence_get_timeline_name().

I really think this needs to be documented if nothing else out of this thread.

Clearly nobody is going to get it right and hidden here in this
thread, this info isn't useful.

Can we have some sort of design document for the dma-fence/scheduler
interactions written and we can try and refine it with solutions on
the list, because I'm tired of people proposing things and NAK's
getting thrown around without anything to point people at.

The next NAK I see on the list will mean I block all patches from the
sender until they write a documentation patch, because seriously this
stuff is too hard for someone to just keep it in their head and expect
everyone else to understand from reading the code.

Dave.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-11-01  6:59                       ` Dave Airlie
@ 2023-11-01  8:13                         ` Daniel Vetter
  -1 siblings, 0 replies; 86+ messages in thread
From: Daniel Vetter @ 2023-11-01  8:13 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Christian König, Asahi Lina, Luben Tuikov, alyssa,
	Sumit Semwal, Faith Ekstrand, dri-devel, linux-kernel,
	linux-media, asahi

On Wed, 1 Nov 2023 at 07:59, Dave Airlie <airlied@gmail.com> wrote:
>
> >
> > Well, to make it clear once more: Signaling a dma_fence from the
> > destructor of a reference counted object is very problematic! This will
> > be rejected no matter if you do that in C or in Rust.
> >
> > What we can do is to make it safe in the sense that you don't access
> > freed up memory by using the scheduler fences even more as wrapper
> > around the hardware fence as we do now. But this quite a change and
> > requires a bit more than just hacking around
> > drm_sched_fence_get_timeline_name().
>
> I really think this needs to be documented if nothing else out of this thread.
>
> Clearly nobody is going to get it right and hidden here in this
> thread, this info isn't useful.
>
> Can we have some sort of design document for the dma-fence/scheduler
> interactions written and we can try and refine it with solutions on
> the list, because I'm tired of people proposing things and NAK's
> getting thrown around without anything to point people at.
>
> The next NAK I see on the list will mean I block all patches from the
> sender until they write a documentation patch, because seriously this
> stuff is too hard for someone to just keep it in their head and expect
> everyone else to understand from reading the code.

I very much like the idea that NAK replies are counted as "you've just
volunteered yourself for some documentation patches so that next time
around you can reply with a link to the docs instead of just a NAK".

I don't think we'll get out of these discussions otherwise, since
currently we have undocumented, but very tricky semantics of the
drm/sched codebase for ringbuffer scheduling which is extended to fw
scheduling in also very tricky ways, with not entirely clear impacts
on semantics of all the drm/sched things. And as a result we just pile
up enormous amounts of threads where I think the only thing assured is
that people talk past each another.

Converting NAKs into doc patches should at least eventually get rid of
the worst confusions we're dealing with here.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-11-01  8:13                         ` Daniel Vetter
  0 siblings, 0 replies; 86+ messages in thread
From: Daniel Vetter @ 2023-11-01  8:13 UTC (permalink / raw)
  To: Dave Airlie
  Cc: asahi, Asahi Lina, linux-kernel, dri-devel, Sumit Semwal,
	Luben Tuikov, alyssa, Christian König, Faith Ekstrand,
	linux-media

On Wed, 1 Nov 2023 at 07:59, Dave Airlie <airlied@gmail.com> wrote:
>
> >
> > Well, to make it clear once more: Signaling a dma_fence from the
> > destructor of a reference counted object is very problematic! This will
> > be rejected no matter if you do that in C or in Rust.
> >
> > What we can do is to make it safe in the sense that you don't access
> > freed up memory by using the scheduler fences even more as wrapper
> > around the hardware fence as we do now. But this quite a change and
> > requires a bit more than just hacking around
> > drm_sched_fence_get_timeline_name().
>
> I really think this needs to be documented if nothing else out of this thread.
>
> Clearly nobody is going to get it right and hidden here in this
> thread, this info isn't useful.
>
> Can we have some sort of design document for the dma-fence/scheduler
> interactions written and we can try and refine it with solutions on
> the list, because I'm tired of people proposing things and NAK's
> getting thrown around without anything to point people at.
>
> The next NAK I see on the list will mean I block all patches from the
> sender until they write a documentation patch, because seriously this
> stuff is too hard for someone to just keep it in their head and expect
> everyone else to understand from reading the code.

I very much like the idea that NAK replies are counted as "you've just
volunteered yourself for some documentation patches so that next time
around you can reply with a link to the docs instead of just a NAK".

I don't think we'll get out of these discussions otherwise, since
currently we have undocumented, but very tricky semantics of the
drm/sched codebase for ringbuffer scheduling which is extended to fw
scheduling in also very tricky ways, with not entirely clear impacts
on semantics of all the drm/sched things. And as a result we just pile
up enormous amounts of threads where I think the only thing assured is
that people talk past each another.

Converting NAKs into doc patches should at least eventually get rid of
the worst confusions we're dealing with here.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-11-01  8:13                         ` Daniel Vetter
@ 2023-11-02 10:48                           ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-11-02 10:48 UTC (permalink / raw)
  To: Daniel Vetter, Dave Airlie
  Cc: Asahi Lina, alyssa, Sumit Semwal, Faith Ekstrand, dri-devel,
	linux-kernel, linux-media, asahi, Luben Tuikov

Am 01.11.23 um 09:13 schrieb Daniel Vetter:
> On Wed, 1 Nov 2023 at 07:59, Dave Airlie <airlied@gmail.com> wrote:
>>> Well, to make it clear once more: Signaling a dma_fence from the
>>> destructor of a reference counted object is very problematic! This will
>>> be rejected no matter if you do that in C or in Rust.
>>>
>>> What we can do is to make it safe in the sense that you don't access
>>> freed up memory by using the scheduler fences even more as wrapper
>>> around the hardware fence as we do now. But this quite a change and
>>> requires a bit more than just hacking around
>>> drm_sched_fence_get_timeline_name().
>> I really think this needs to be documented if nothing else out of this thread.
>>
>> Clearly nobody is going to get it right and hidden here in this
>> thread, this info isn't useful.
>>
>> Can we have some sort of design document for the dma-fence/scheduler
>> interactions written and we can try and refine it with solutions on
>> the list, because I'm tired of people proposing things and NAK's
>> getting thrown around without anything to point people at.
>>
>> The next NAK I see on the list will mean I block all patches from the
>> sender until they write a documentation patch, because seriously this
>> stuff is too hard for someone to just keep it in their head and expect
>> everyone else to understand from reading the code.
> I very much like the idea that NAK replies are counted as "you've just
> volunteered yourself for some documentation patches so that next time
> around you can reply with a link to the docs instead of just a NAK".

Yeah, that sounds like a great idea to me as well :)

Especially when I can use it to convince managers that we need to have 
more work force on writing documentation.

> I don't think we'll get out of these discussions otherwise, since
> currently we have undocumented, but very tricky semantics of the
> drm/sched codebase for ringbuffer scheduling which is extended to fw
> scheduling in also very tricky ways, with not entirely clear impacts
> on semantics of all the drm/sched things. And as a result we just pile
> up enormous amounts of threads where I think the only thing assured is
> that people talk past each another.

The scheduler is certainly the ugliest part, but it's unfortunately 
still only the tip of the iceberg.

I have seen at least halve a dozen approach in the last two years where 
people tried to signal a dma_fence from userspace or similar.

Fortunately it was mostly prototyping and I could jump in early enough 
to stop that, but basically this is a fight against windmills.

I was considering to change the dma_fence semantics so that 
dma_fence_signal() could only be called from the interrupt contexts of 
devices and then put a big fat WARN_ON(!in_interrupt()) in there.

It's a sledgehammer, but as far as I can see the only thing which might 
help. Opinions?

Thanks,
Christian.

>
> Converting NAKs into doc patches should at least eventually get rid of
> the worst confusions we're dealing with here.
>
> Cheers, Sima


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-11-02 10:48                           ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-11-02 10:48 UTC (permalink / raw)
  To: Daniel Vetter, Dave Airlie
  Cc: asahi, Asahi Lina, Luben Tuikov, linux-kernel, dri-devel, alyssa,
	Sumit Semwal, Faith Ekstrand, linux-media

Am 01.11.23 um 09:13 schrieb Daniel Vetter:
> On Wed, 1 Nov 2023 at 07:59, Dave Airlie <airlied@gmail.com> wrote:
>>> Well, to make it clear once more: Signaling a dma_fence from the
>>> destructor of a reference counted object is very problematic! This will
>>> be rejected no matter if you do that in C or in Rust.
>>>
>>> What we can do is to make it safe in the sense that you don't access
>>> freed up memory by using the scheduler fences even more as wrapper
>>> around the hardware fence as we do now. But this quite a change and
>>> requires a bit more than just hacking around
>>> drm_sched_fence_get_timeline_name().
>> I really think this needs to be documented if nothing else out of this thread.
>>
>> Clearly nobody is going to get it right and hidden here in this
>> thread, this info isn't useful.
>>
>> Can we have some sort of design document for the dma-fence/scheduler
>> interactions written and we can try and refine it with solutions on
>> the list, because I'm tired of people proposing things and NAK's
>> getting thrown around without anything to point people at.
>>
>> The next NAK I see on the list will mean I block all patches from the
>> sender until they write a documentation patch, because seriously this
>> stuff is too hard for someone to just keep it in their head and expect
>> everyone else to understand from reading the code.
> I very much like the idea that NAK replies are counted as "you've just
> volunteered yourself for some documentation patches so that next time
> around you can reply with a link to the docs instead of just a NAK".

Yeah, that sounds like a great idea to me as well :)

Especially when I can use it to convince managers that we need to have 
more work force on writing documentation.

> I don't think we'll get out of these discussions otherwise, since
> currently we have undocumented, but very tricky semantics of the
> drm/sched codebase for ringbuffer scheduling which is extended to fw
> scheduling in also very tricky ways, with not entirely clear impacts
> on semantics of all the drm/sched things. And as a result we just pile
> up enormous amounts of threads where I think the only thing assured is
> that people talk past each another.

The scheduler is certainly the ugliest part, but it's unfortunately 
still only the tip of the iceberg.

I have seen at least halve a dozen approach in the last two years where 
people tried to signal a dma_fence from userspace or similar.

Fortunately it was mostly prototyping and I could jump in early enough 
to stop that, but basically this is a fight against windmills.

I was considering to change the dma_fence semantics so that 
dma_fence_signal() could only be called from the interrupt contexts of 
devices and then put a big fat WARN_ON(!in_interrupt()) in there.

It's a sledgehammer, but as far as I can see the only thing which might 
help. Opinions?

Thanks,
Christian.

>
> Converting NAKs into doc patches should at least eventually get rid of
> the worst confusions we're dealing with here.
>
> Cheers, Sima


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-11-02 10:48                           ` Christian König
@ 2023-11-02 11:19                             ` Lucas Stach
  -1 siblings, 0 replies; 86+ messages in thread
From: Lucas Stach @ 2023-11-02 11:19 UTC (permalink / raw)
  To: Christian König, Daniel Vetter, Dave Airlie
  Cc: asahi, Asahi Lina, Luben Tuikov, linux-kernel, dri-devel, alyssa,
	Sumit Semwal, Faith Ekstrand, linux-media

Am Donnerstag, dem 02.11.2023 um 11:48 +0100 schrieb Christian König:
[...]
> I was considering to change the dma_fence semantics so that 
> dma_fence_signal() could only be called from the interrupt contexts of 
> devices and then put a big fat WARN_ON(!in_interrupt()) in there.
> 
> It's a sledgehammer, but as far as I can see the only thing which might 
> help. Opinions?

That's not going to fly. As soon as you are dealing with device drivers
that use IRQ threads, either voluntarily or even involuntarily on RT
kernels, the dma_fence_signal will be from process context.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-11-02 11:19                             ` Lucas Stach
  0 siblings, 0 replies; 86+ messages in thread
From: Lucas Stach @ 2023-11-02 11:19 UTC (permalink / raw)
  To: Christian König, Daniel Vetter, Dave Airlie
  Cc: Faith Ekstrand, Luben Tuikov, Asahi Lina, linux-kernel,
	dri-devel, asahi, Sumit Semwal, alyssa, linux-media

Am Donnerstag, dem 02.11.2023 um 11:48 +0100 schrieb Christian König:
[...]
> I was considering to change the dma_fence semantics so that 
> dma_fence_signal() could only be called from the interrupt contexts of 
> devices and then put a big fat WARN_ON(!in_interrupt()) in there.
> 
> It's a sledgehammer, but as far as I can see the only thing which might 
> help. Opinions?

That's not going to fly. As soon as you are dealing with device drivers
that use IRQ threads, either voluntarily or even involuntarily on RT
kernels, the dma_fence_signal will be from process context.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
  2023-11-02 11:19                             ` Lucas Stach
@ 2023-11-02 12:39                               ` Christian König
  -1 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-11-02 12:39 UTC (permalink / raw)
  To: Lucas Stach, Daniel Vetter, Dave Airlie
  Cc: asahi, Asahi Lina, Luben Tuikov, linux-kernel, dri-devel, alyssa,
	Sumit Semwal, Faith Ekstrand, linux-media

Am 02.11.23 um 12:19 schrieb Lucas Stach:
> Am Donnerstag, dem 02.11.2023 um 11:48 +0100 schrieb Christian König:
> [...]
>> I was considering to change the dma_fence semantics so that
>> dma_fence_signal() could only be called from the interrupt contexts of
>> devices and then put a big fat WARN_ON(!in_interrupt()) in there.
>>
>> It's a sledgehammer, but as far as I can see the only thing which might
>> help. Opinions?
> That's not going to fly. As soon as you are dealing with device drivers
> that use IRQ threads, either voluntarily or even involuntarily on RT
> kernels, the dma_fence_signal will be from process context.

Ah shit, yeah of course. We use IRQ threads in amdgpu for the second 
interrupt ring as well.

Ok, nail that coffin. Any other ideas how we could enforce this?

Thanks,
Christian.

>
> Regards,
> Lucas


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
@ 2023-11-02 12:39                               ` Christian König
  0 siblings, 0 replies; 86+ messages in thread
From: Christian König @ 2023-11-02 12:39 UTC (permalink / raw)
  To: Lucas Stach, Daniel Vetter, Dave Airlie
  Cc: Faith Ekstrand, Luben Tuikov, Asahi Lina, linux-kernel,
	dri-devel, asahi, Sumit Semwal, alyssa, linux-media

Am 02.11.23 um 12:19 schrieb Lucas Stach:
> Am Donnerstag, dem 02.11.2023 um 11:48 +0100 schrieb Christian König:
> [...]
>> I was considering to change the dma_fence semantics so that
>> dma_fence_signal() could only be called from the interrupt contexts of
>> devices and then put a big fat WARN_ON(!in_interrupt()) in there.
>>
>> It's a sledgehammer, but as far as I can see the only thing which might
>> help. Opinions?
> That's not going to fly. As soon as you are dealing with device drivers
> that use IRQ threads, either voluntarily or even involuntarily on RT
> kernels, the dma_fence_signal will be from process context.

Ah shit, yeah of course. We use IRQ threads in amdgpu for the second 
interrupt ring as well.

Ok, nail that coffin. Any other ideas how we could enforce this?

Thanks,
Christian.

>
> Regards,
> Lucas


^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2023-11-02 12:39 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-14  8:21 [PATCH 0/3] DRM scheduler documentation & bug fixes Asahi Lina
2023-07-14  8:21 ` Asahi Lina
2023-07-14  8:21 ` [PATCH 1/3] drm/scheduler: Add more documentation Asahi Lina
2023-07-14  8:21   ` Asahi Lina
2023-07-14  8:40   ` Christian König
2023-07-14  8:40     ` Christian König
2023-07-14  9:39     ` Asahi Lina
2023-07-14  9:39       ` Asahi Lina
2023-07-14  9:47       ` Christian König
2023-07-14  9:47         ` Christian König
2023-07-14  9:51         ` Asahi Lina
2023-07-14  9:51           ` Asahi Lina
2023-07-14  8:21 ` [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name Asahi Lina
2023-07-14  8:21   ` Asahi Lina
2023-07-14  8:43   ` Christian König
2023-07-14  8:43     ` Christian König
2023-07-14  9:44     ` Asahi Lina
2023-07-14  9:44       ` Asahi Lina
2023-07-14  9:51       ` Christian König
2023-07-14  9:51         ` Christian König
2023-07-14 10:07         ` Asahi Lina
2023-07-14 10:07           ` Asahi Lina
2023-07-14 10:29           ` Christian König
2023-07-14 10:29             ` Christian König
2023-07-14  9:49     ` Asahi Lina
2023-07-14  9:49       ` Asahi Lina
2023-07-14  9:57       ` Christian König
2023-07-14  9:57         ` Christian König
2023-07-14 10:06         ` Asahi Lina
2023-07-14 10:06           ` Asahi Lina
2023-07-14 10:18           ` Christian König
2023-07-14 10:18             ` Christian König
2023-07-14 12:13             ` Asahi Lina
2023-07-14 12:13               ` Asahi Lina
2023-07-15  4:03         ` Luben Tuikov
2023-07-15  4:03           ` Luben Tuikov
2023-07-15 14:14         ` alyssa
2023-07-15 14:14           ` alyssa
2023-07-17 15:55           ` Christian König
2023-07-17 15:55             ` Christian König
2023-07-18  2:35             ` Asahi Lina
2023-07-18  2:35               ` Asahi Lina
2023-07-18  5:45               ` Luben Tuikov
2023-07-18  5:45                 ` Luben Tuikov
2023-07-21 10:33                 ` Asahi Lina
2023-07-21 10:33                   ` Asahi Lina
2023-07-31  8:09                   ` Christian König
2023-07-31  8:09                     ` Christian König
2023-11-01  6:59                     ` Dave Airlie
2023-11-01  6:59                       ` Dave Airlie
2023-11-01  8:13                       ` Daniel Vetter
2023-11-01  8:13                         ` Daniel Vetter
2023-11-02 10:48                         ` Christian König
2023-11-02 10:48                           ` Christian König
2023-11-02 11:19                           ` Lucas Stach
2023-11-02 11:19                             ` Lucas Stach
2023-11-02 12:39                             ` Christian König
2023-11-02 12:39                               ` Christian König
2023-07-28  7:48               ` Christian König
2023-07-28  7:48                 ` Christian König
2023-07-18  8:21             ` Pekka Paalanen
2023-07-18  8:21               ` Pekka Paalanen
2023-07-14  8:21 ` [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down Asahi Lina
2023-07-14  8:21   ` Asahi Lina
2023-07-15  7:14   ` Luben Tuikov
2023-07-15  7:14     ` Luben Tuikov
2023-07-16  7:51     ` Asahi Lina
2023-07-16  7:51       ` Asahi Lina
2023-07-17 17:40       ` Luben Tuikov
2023-07-17 17:40         ` Luben Tuikov
2023-07-17 22:45         ` Asahi Lina
2023-07-17 22:45           ` Asahi Lina
2023-07-18  5:14           ` Luben Tuikov
2023-07-18  5:14             ` Luben Tuikov
2023-07-19 18:16           ` Konstantin Ryabitsev
2023-07-19 18:16             ` Konstantin Ryabitsev
2023-07-19 18:58             ` Luben Tuikov
2023-07-19 18:58               ` Luben Tuikov
2023-08-02  4:06         ` Matthew Brost
2023-08-02  4:06           ` Matthew Brost
2023-08-02 14:12           ` Luben Tuikov
2023-08-02 14:12             ` Luben Tuikov
2023-07-19  8:45       ` Christian König
2023-07-19  8:45         ` Christian König
2023-07-19 15:05         ` Luben Tuikov
2023-07-19 15:05           ` Luben Tuikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.