All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait
@ 2020-07-15 11:50 Chris Wilson
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 02/66] drm/i915: Remove i915_request.lock requirement for execution callbacks Chris Wilson
                   ` (72 more replies)
  0 siblings, 73 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Matthew Auld, Chris Wilson

Currently, we use i915_request_completed() directly in
i915_request_wait() and follow up with a manual invocation of
dma_fence_signal(). This appears to cause a large number of contentions
on i915_request.lock as when the process is woken up after the fence is
signaled by an interrupt, we will then try and call dma_fence_signal()
ourselves while the signaler is still holding the lock.
dma_fence_is_signaled() has the benefit of checking the
DMA_FENCE_FLAG_SIGNALED_BIT prior to calling dma_fence_signal() and so
avoids most of that contention.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_request.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 0b2fe55e6194..bb4eb1a8780e 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1640,7 +1640,7 @@ static bool busywait_stop(unsigned long timeout, unsigned int cpu)
 	return this_cpu != cpu;
 }
 
-static bool __i915_spin_request(const struct i915_request * const rq, int state)
+static bool __i915_spin_request(struct i915_request * const rq, int state)
 {
 	unsigned long timeout_ns;
 	unsigned int cpu;
@@ -1673,7 +1673,7 @@ static bool __i915_spin_request(const struct i915_request * const rq, int state)
 	timeout_ns = READ_ONCE(rq->engine->props.max_busywait_duration_ns);
 	timeout_ns += local_clock_ns(&cpu);
 	do {
-		if (i915_request_completed(rq))
+		if (dma_fence_is_signaled(&rq->fence))
 			return true;
 
 		if (signal_pending_state(state, current))
@@ -1766,10 +1766,8 @@ long i915_request_wait(struct i915_request *rq,
 	 * duration, which we currently lack.
 	 */
 	if (IS_ACTIVE(CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT) &&
-	    __i915_spin_request(rq, state)) {
-		dma_fence_signal(&rq->fence);
+	    __i915_spin_request(rq, state))
 		goto out;
-	}
 
 	/*
 	 * This client is about to stall waiting for the GPU. In many cases
@@ -1796,10 +1794,8 @@ long i915_request_wait(struct i915_request *rq,
 	for (;;) {
 		set_current_state(state);
 
-		if (i915_request_completed(rq)) {
-			dma_fence_signal(&rq->fence);
+		if (dma_fence_is_signaled(&rq->fence))
 			break;
-		}
 
 		intel_engine_flush_submission(rq->engine);
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 02/66] drm/i915: Remove i915_request.lock requirement for execution callbacks
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 03/66] drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs Chris Wilson
                   ` (71 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

We are using the i915_request.lock to serialise adding an execution
callback with __i915_request_submit. However, if we use an atomic
llist_add to serialise multiple waiters and then check to see if the
request is already executing, we can remove the irq-spinlock.

Fixes: 1d9221e9d395 ("drm/i915: Skip signaling a signaled request")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_request.c | 43 ++++++++---------------------
 1 file changed, 12 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index bb4eb1a8780e..d13dd013acb4 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -190,13 +190,11 @@ static void __notify_execute_cb(struct i915_request *rq)
 {
 	struct execute_cb *cb, *cn;
 
-	lockdep_assert_held(&rq->lock);
-
-	GEM_BUG_ON(!i915_request_is_active(rq));
 	if (llist_empty(&rq->execute_cb))
 		return;
 
-	llist_for_each_entry_safe(cb, cn, rq->execute_cb.first, work.llnode)
+	llist_for_each_entry_safe(cb, cn,
+				  llist_del_all(&rq->execute_cb), work.llnode)
 		irq_work_queue(&cb->work);
 
 	/*
@@ -209,7 +207,6 @@ static void __notify_execute_cb(struct i915_request *rq)
 	 * preempt-to-idle cycle on the target engine, all the while the
 	 * master execute_cb may refire.
 	 */
-	init_llist_head(&rq->execute_cb);
 }
 
 static inline void
@@ -274,9 +271,11 @@ static void remove_from_engine(struct i915_request *rq)
 		locked = engine;
 	}
 	list_del_init(&rq->sched.link);
+	spin_unlock_irq(&locked->active.lock);
+
 	clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
 	clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
-	spin_unlock_irq(&locked->active.lock);
+	set_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags);
 }
 
 bool i915_request_retire(struct i915_request *rq)
@@ -288,6 +287,7 @@ bool i915_request_retire(struct i915_request *rq)
 
 	GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
 	trace_i915_request_retire(rq);
+	i915_request_mark_complete(rq);
 
 	/*
 	 * We know the GPU must have read the request to have
@@ -314,7 +314,6 @@ bool i915_request_retire(struct i915_request *rq)
 	remove_from_engine(rq);
 
 	spin_lock_irq(&rq->lock);
-	i915_request_mark_complete(rq);
 	if (!i915_request_signaled(rq))
 		dma_fence_signal_locked(&rq->fence);
 	if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &rq->fence.flags))
@@ -323,12 +322,8 @@ bool i915_request_retire(struct i915_request *rq)
 		GEM_BUG_ON(!atomic_read(&rq->engine->gt->rps.num_waiters));
 		atomic_dec(&rq->engine->gt->rps.num_waiters);
 	}
-	if (!test_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags)) {
-		set_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags);
-		__notify_execute_cb(rq);
-	}
-	GEM_BUG_ON(!llist_empty(&rq->execute_cb));
 	spin_unlock_irq(&rq->lock);
+	__notify_execute_cb(rq);
 
 	remove_from_client(rq);
 	__list_del_entry(&rq->link); /* poison neither prev/next (RCU walks) */
@@ -357,12 +352,6 @@ void i915_request_retire_upto(struct i915_request *rq)
 	} while (i915_request_retire(tmp) && tmp != rq);
 }
 
-static void __llist_add(struct llist_node *node, struct llist_head *head)
-{
-	node->next = head->first;
-	head->first = node;
-}
-
 static struct i915_request * const *
 __engine_active(struct intel_engine_cs *engine)
 {
@@ -439,18 +428,11 @@ __await_execution(struct i915_request *rq,
 		cb->work.func = irq_execute_cb_hook;
 	}
 
-	spin_lock_irq(&signal->lock);
-	if (i915_request_is_active(signal) || __request_in_flight(signal)) {
-		if (hook) {
-			hook(rq, &signal->fence);
-			i915_request_put(signal);
-		}
-		i915_sw_fence_complete(cb->fence);
-		kmem_cache_free(global.slab_execute_cbs, cb);
-	} else {
-		__llist_add(&cb->work.llnode, &signal->execute_cb);
+	if (llist_add(&cb->work.llnode, &signal->execute_cb)) {
+		if (i915_request_is_active(signal) ||
+		    __request_in_flight(signal))
+			__notify_execute_cb(signal);
 	}
-	spin_unlock_irq(&signal->lock);
 
 	return 0;
 }
@@ -565,19 +547,18 @@ bool __i915_request_submit(struct i915_request *request)
 		list_move_tail(&request->sched.link, &engine->active.requests);
 		clear_bit(I915_FENCE_FLAG_PQUEUE, &request->fence.flags);
 	}
+	__notify_execute_cb(request);
 
 	/* We may be recursing from the signal callback of another i915 fence */
 	if (!i915_request_signaled(request)) {
 		spin_lock_nested(&request->lock, SINGLE_DEPTH_NESTING);
 
-		__notify_execute_cb(request);
 		if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
 			     &request->fence.flags) &&
 		    !i915_request_enable_breadcrumb(request))
 			intel_engine_signal_breadcrumbs(engine);
 
 		spin_unlock(&request->lock);
-		GEM_BUG_ON(!llist_empty(&request->execute_cb));
 	}
 
 	return result;
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 03/66] drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 02/66] drm/i915: Remove i915_request.lock requirement for execution callbacks Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 04/66] drm/i915: Add a couple of missing i915_active_fini() Chris Wilson
                   ` (70 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Since the breadcrumb enabling/cancelling itself is serialised by the
breadcrumbs.irq_lock, with a bit of care we can remove the outer
serialisation with i915_request.lock for concurrent
dma_fence_enable_signaling(). This has the important side-effect of
eliminating the nested i915_request.lock within request submission.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 100 +++++++++++---------
 drivers/gpu/drm/i915/gt/intel_lrc.c         |  14 ---
 drivers/gpu/drm/i915/i915_request.c         |  30 ++----
 3 files changed, 63 insertions(+), 81 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
index 91786310c114..87fd06d3eb3f 100644
--- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
@@ -220,17 +220,17 @@ static void signal_irq_work(struct irq_work *work)
 	}
 }
 
-static bool __intel_breadcrumbs_arm_irq(struct intel_breadcrumbs *b)
+static void __intel_breadcrumbs_arm_irq(struct intel_breadcrumbs *b)
 {
 	struct intel_engine_cs *engine =
 		container_of(b, struct intel_engine_cs, breadcrumbs);
 
 	lockdep_assert_held(&b->irq_lock);
 	if (b->irq_armed)
-		return true;
+		return;
 
 	if (!intel_gt_pm_get_if_awake(engine->gt))
-		return false;
+		return;
 
 	/*
 	 * The breadcrumb irq will be disarmed on the interrupt after the
@@ -250,8 +250,6 @@ static bool __intel_breadcrumbs_arm_irq(struct intel_breadcrumbs *b)
 
 	if (!b->irq_enabled++)
 		irq_enable(engine);
-
-	return true;
 }
 
 void intel_engine_init_breadcrumbs(struct intel_engine_cs *engine)
@@ -310,57 +308,69 @@ void intel_engine_fini_breadcrumbs(struct intel_engine_cs *engine)
 {
 }
 
-bool i915_request_enable_breadcrumb(struct i915_request *rq)
+static void insert_breadcrumb(struct i915_request *rq,
+			      struct intel_breadcrumbs *b)
 {
-	lockdep_assert_held(&rq->lock);
+	struct intel_context *ce = rq->context;
+	struct list_head *pos;
 
-	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &rq->fence.flags))
-		return true;
+	if (test_bit(I915_FENCE_FLAG_SIGNAL, &rq->fence.flags))
+		return;
 
-	if (test_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags)) {
-		struct intel_breadcrumbs *b = &rq->engine->breadcrumbs;
-		struct intel_context *ce = rq->context;
-		struct list_head *pos;
+	__intel_breadcrumbs_arm_irq(b);
 
-		spin_lock(&b->irq_lock);
+	/*
+	 * We keep the seqno in retirement order, so we can break
+	 * inside intel_engine_signal_breadcrumbs as soon as we've
+	 * passed the last completed request (or seen a request that
+	 * hasn't event started). We could walk the timeline->requests,
+	 * but keeping a separate signalers_list has the advantage of
+	 * hopefully being much smaller than the full list and so
+	 * provides faster iteration and detection when there are no
+	 * more interrupts required for this context.
+	 *
+	 * We typically expect to add new signalers in order, so we
+	 * start looking for our insertion point from the tail of
+	 * the list.
+	 */
+	list_for_each_prev(pos, &ce->signals) {
+		struct i915_request *it =
+			list_entry(pos, typeof(*it), signal_link);
+
+		if (i915_seqno_passed(rq->fence.seqno, it->fence.seqno))
+			break;
+	}
+	list_add(&rq->signal_link, pos);
+	if (pos == &ce->signals) /* catch transitions from empty list */
+		list_move_tail(&ce->signal_link, &b->signalers);
+	GEM_BUG_ON(!check_signal_order(ce, rq));
 
-		if (test_bit(I915_FENCE_FLAG_SIGNAL, &rq->fence.flags))
-			goto unlock;
+	set_bit(I915_FENCE_FLAG_SIGNAL, &rq->fence.flags);
+}
 
-		if (!__intel_breadcrumbs_arm_irq(b))
-			goto unlock;
+bool i915_request_enable_breadcrumb(struct i915_request *rq)
+{
+	struct intel_breadcrumbs *b;
 
-		/*
-		 * We keep the seqno in retirement order, so we can break
-		 * inside intel_engine_signal_breadcrumbs as soon as we've
-		 * passed the last completed request (or seen a request that
-		 * hasn't event started). We could walk the timeline->requests,
-		 * but keeping a separate signalers_list has the advantage of
-		 * hopefully being much smaller than the full list and so
-		 * provides faster iteration and detection when there are no
-		 * more interrupts required for this context.
-		 *
-		 * We typically expect to add new signalers in order, so we
-		 * start looking for our insertion point from the tail of
-		 * the list.
-		 */
-		list_for_each_prev(pos, &ce->signals) {
-			struct i915_request *it =
-				list_entry(pos, typeof(*it), signal_link);
+	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &rq->fence.flags))
+		return true;
 
-			if (i915_seqno_passed(rq->fence.seqno, it->fence.seqno))
-				break;
-		}
-		list_add(&rq->signal_link, pos);
-		if (pos == &ce->signals) /* catch transitions from empty list */
-			list_move_tail(&ce->signal_link, &b->signalers);
-		GEM_BUG_ON(!check_signal_order(ce, rq));
+	if (!test_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags))
+		return true;
 
-		set_bit(I915_FENCE_FLAG_SIGNAL, &rq->fence.flags);
-unlock:
+	b = &READ_ONCE(rq->engine)->breadcrumbs;
+	spin_lock(&b->irq_lock);
+	while (unlikely(b != &READ_ONCE(rq->engine)->breadcrumbs)) {
 		spin_unlock(&b->irq_lock);
+		b = &READ_ONCE(rq->engine)->breadcrumbs;
+		spin_lock(&b->irq_lock);
 	}
 
+	if (test_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags))
+		insert_breadcrumb(rq, b);
+
+	spin_unlock(&b->irq_lock);
+
 	return !__request_completed(rq);
 }
 
@@ -368,8 +378,6 @@ void i915_request_cancel_breadcrumb(struct i915_request *rq)
 {
 	struct intel_breadcrumbs *b = &rq->engine->breadcrumbs;
 
-	lockdep_assert_held(&rq->lock);
-
 	/*
 	 * We must wait for b->irq_lock so that we know the interrupt handler
 	 * has released its reference to the intel_context and has completed
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index e0280a672f1d..aa7be7f05f8c 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1148,20 +1148,6 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
 		} else {
 			struct intel_engine_cs *owner = rq->context->engine;
 
-			/*
-			 * Decouple the virtual breadcrumb before moving it
-			 * back to the virtual engine -- we don't want the
-			 * request to complete in the background and try
-			 * and cancel the breadcrumb on the virtual engine
-			 * (instead of the old engine where it is linked)!
-			 */
-			if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
-				     &rq->fence.flags)) {
-				spin_lock_nested(&rq->lock,
-						 SINGLE_DEPTH_NESTING);
-				i915_request_cancel_breadcrumb(rq);
-				spin_unlock(&rq->lock);
-			}
 			WRITE_ONCE(rq->engine, owner);
 			owner->submit_request(rq);
 			active = NULL;
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index d13dd013acb4..29b5e71307e3 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -318,12 +318,13 @@ bool i915_request_retire(struct i915_request *rq)
 		dma_fence_signal_locked(&rq->fence);
 	if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &rq->fence.flags))
 		i915_request_cancel_breadcrumb(rq);
+	spin_unlock_irq(&rq->lock);
+
+	__notify_execute_cb(rq);
 	if (i915_request_has_waitboost(rq)) {
 		GEM_BUG_ON(!atomic_read(&rq->engine->gt->rps.num_waiters));
 		atomic_dec(&rq->engine->gt->rps.num_waiters);
 	}
-	spin_unlock_irq(&rq->lock);
-	__notify_execute_cb(rq);
 
 	remove_from_client(rq);
 	__list_del_entry(&rq->link); /* poison neither prev/next (RCU walks) */
@@ -549,17 +550,9 @@ bool __i915_request_submit(struct i915_request *request)
 	}
 	__notify_execute_cb(request);
 
-	/* We may be recursing from the signal callback of another i915 fence */
-	if (!i915_request_signaled(request)) {
-		spin_lock_nested(&request->lock, SINGLE_DEPTH_NESTING);
-
-		if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
-			     &request->fence.flags) &&
-		    !i915_request_enable_breadcrumb(request))
-			intel_engine_signal_breadcrumbs(engine);
-
-		spin_unlock(&request->lock);
-	}
+	if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &request->fence.flags) &&
+	    !i915_request_enable_breadcrumb(request))
+		intel_engine_signal_breadcrumbs(engine);
 
 	return result;
 }
@@ -591,16 +584,11 @@ void __i915_request_unsubmit(struct i915_request *request)
 	 * is kept in seqno/ring order.
 	 */
 
-	/* We may be recursing from the signal callback of another i915 fence */
-	spin_lock_nested(&request->lock, SINGLE_DEPTH_NESTING);
-
-	if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &request->fence.flags))
-		i915_request_cancel_breadcrumb(request);
-
 	GEM_BUG_ON(!test_bit(I915_FENCE_FLAG_ACTIVE, &request->fence.flags));
-	clear_bit(I915_FENCE_FLAG_ACTIVE, &request->fence.flags);
+	clear_bit_unlock(I915_FENCE_FLAG_ACTIVE, &request->fence.flags);
 
-	spin_unlock(&request->lock);
+	if (test_bit(I915_FENCE_FLAG_SIGNAL, &request->fence.flags))
+		i915_request_cancel_breadcrumb(request);
 
 	/* We've already spun, don't charge on resubmitting. */
 	if (request->sched.semaphores && i915_request_started(request))
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 04/66] drm/i915: Add a couple of missing i915_active_fini()
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 02/66] drm/i915: Remove i915_request.lock requirement for execution callbacks Chris Wilson
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 03/66] drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-17 12:00   ` Tvrtko Ursulin
  2020-07-21 12:23   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 05/66] drm/i915: Skip taking acquire mutex for no ref->active callback Chris Wilson
                   ` (69 subsequent siblings)
  72 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

We use i915_active_fini() as a debug check on the i915_active state
before freeing. If we forget to call it, we may end up angering the
debugobjects contained within.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/display/intel_frontbuffer.c    | 2 ++
 drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c | 5 ++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/display/intel_frontbuffer.c b/drivers/gpu/drm/i915/display/intel_frontbuffer.c
index 2979ed2588eb..d898b370d7a4 100644
--- a/drivers/gpu/drm/i915/display/intel_frontbuffer.c
+++ b/drivers/gpu/drm/i915/display/intel_frontbuffer.c
@@ -232,6 +232,8 @@ static void frontbuffer_release(struct kref *ref)
 	RCU_INIT_POINTER(obj->frontbuffer, NULL);
 	spin_unlock(&to_i915(obj->base.dev)->fb_tracking.lock);
 
+	i915_active_fini(&front->write);
+
 	i915_gem_object_put(obj);
 	kfree_rcu(front, rcu);
 }
diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
index 73243ba59c7d..e73854dd2fe0 100644
--- a/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
@@ -47,7 +47,10 @@ static int pulse_active(struct i915_active *active)
 
 static void pulse_free(struct kref *kref)
 {
-	kfree(container_of(kref, struct pulse, kref));
+	struct pulse *p = container_of(kref, typeof(*p), kref);
+
+	i915_active_fini(&p->active);
+	kfree(p);
 }
 
 static void pulse_put(struct pulse *p)
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 05/66] drm/i915: Skip taking acquire mutex for no ref->active callback
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (2 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 04/66] drm/i915: Add a couple of missing i915_active_fini() Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-17 12:04   ` Tvrtko Ursulin
  2020-07-21 12:32   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire() Chris Wilson
                   ` (68 subsequent siblings)
  72 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

If no active callback is defined for i915_active, we do not need to
serialise its enabling with the mutex. We still do only want to call the
debug activate once, and must still serialise with a concurrent retire.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_active.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index d960d0be5bd2..841b5c30950a 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -416,6 +416,14 @@ bool i915_active_acquire_if_busy(struct i915_active *ref)
 	return atomic_add_unless(&ref->count, 1, 0);
 }
 
+static void __i915_active_activate(struct i915_active *ref)
+{
+	spin_lock_irq(&ref->tree_lock); /* __active_retire() */
+	if (!atomic_fetch_inc(&ref->count))
+		debug_active_activate(ref);
+	spin_unlock_irq(&ref->tree_lock);
+}
+
 int i915_active_acquire(struct i915_active *ref)
 {
 	int err;
@@ -423,23 +431,22 @@ int i915_active_acquire(struct i915_active *ref)
 	if (i915_active_acquire_if_busy(ref))
 		return 0;
 
+	if (!ref->active) {
+		__i915_active_activate(ref);
+		return 0;
+	}
+
 	err = mutex_lock_interruptible(&ref->mutex);
 	if (err)
 		return err;
 
 	if (likely(!i915_active_acquire_if_busy(ref))) {
-		if (ref->active)
-			err = ref->active(ref);
-		if (!err) {
-			spin_lock_irq(&ref->tree_lock); /* __active_retire() */
-			debug_active_activate(ref);
-			atomic_inc(&ref->count);
-			spin_unlock_irq(&ref->tree_lock);
-		}
+		err = ref->active(ref);
+		if (!err)
+			__i915_active_activate(ref);
 	}
 
 	mutex_unlock(&ref->mutex);
-
 	return err;
 }
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire()
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (3 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 05/66] drm/i915: Skip taking acquire mutex for no ref->active callback Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-17 12:21   ` Tvrtko Ursulin
  2020-07-21 15:33   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 07/66] drm/i915: Keep the most recently used active-fence upon discard Chris Wilson
                   ` (67 subsequent siblings)
  72 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Sometimes we have to be very careful not to allocate underneath a mutex
(or spinlock) and yet still want to track activity. Enter
i915_active_acquire_for_context(). This raises the activity counter on
i915_active prior to use and ensures that the fence-tree contains a slot
for the context.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 +-
 drivers/gpu/drm/i915/gt/intel_timeline.c      |   4 +-
 drivers/gpu/drm/i915/i915_active.c            | 136 +++++++++++++++---
 drivers/gpu/drm/i915/i915_active.h            |  12 +-
 4 files changed, 126 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 6b4ec66cb558..719ba9fe3e85 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1729,7 +1729,7 @@ __parser_mark_active(struct i915_vma *vma,
 {
 	struct intel_gt_buffer_pool_node *node = vma->private;
 
-	return i915_active_ref(&node->active, tl, fence);
+	return i915_active_ref(&node->active, tl->fence_context, fence);
 }
 
 static int
diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
index 46d20f5f3ddc..acb43aebd669 100644
--- a/drivers/gpu/drm/i915/gt/intel_timeline.c
+++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
@@ -484,7 +484,9 @@ __intel_timeline_get_seqno(struct intel_timeline *tl,
 	 * free it after the current request is retired, which ensures that
 	 * all writes into the cacheline from previous requests are complete.
 	 */
-	err = i915_active_ref(&tl->hwsp_cacheline->active, tl, &rq->fence);
+	err = i915_active_ref(&tl->hwsp_cacheline->active,
+			      tl->fence_context,
+			      &rq->fence);
 	if (err)
 		goto err_cacheline;
 
diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index 841b5c30950a..799282fb1bb9 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -28,12 +28,14 @@ static struct i915_global_active {
 } global;
 
 struct active_node {
+	struct rb_node node;
 	struct i915_active_fence base;
 	struct i915_active *ref;
-	struct rb_node node;
 	u64 timeline;
 };
 
+#define fetch_node(x) rb_entry(READ_ONCE(x), typeof(struct active_node), node)
+
 static inline struct active_node *
 node_from_active(struct i915_active_fence *active)
 {
@@ -216,12 +218,40 @@ excl_retire(struct dma_fence *fence, struct dma_fence_cb *cb)
 		active_retire(container_of(cb, struct i915_active, excl.cb));
 }
 
+static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
+{
+	struct active_node *it;
+
+	it = READ_ONCE(ref->cache);
+	if (it && it->timeline == idx)
+		return it;
+
+	BUILD_BUG_ON(offsetof(typeof(*it), node));
+
+	/* While active, the tree can only be built; not destroyed */
+	GEM_BUG_ON(i915_active_is_idle(ref));
+
+	it = fetch_node(ref->tree.rb_node);
+	while (it) {
+		if (it->timeline < idx) {
+			it = fetch_node(it->node.rb_right);
+		} else if (it->timeline > idx) {
+			it = fetch_node(it->node.rb_left);
+		} else {
+			WRITE_ONCE(ref->cache, it);
+			break;
+		}
+	}
+
+	/* NB: If the tree rotated beneath us, we may miss our target. */
+	return it;
+}
+
 static struct i915_active_fence *
-active_instance(struct i915_active *ref, struct intel_timeline *tl)
+active_instance(struct i915_active *ref, u64 idx)
 {
 	struct active_node *node, *prealloc;
 	struct rb_node **p, *parent;
-	u64 idx = tl->fence_context;
 
 	/*
 	 * We track the most recently used timeline to skip a rbtree search
@@ -230,8 +260,8 @@ active_instance(struct i915_active *ref, struct intel_timeline *tl)
 	 * after the previous activity has been retired, or if it matches the
 	 * current timeline.
 	 */
-	node = READ_ONCE(ref->cache);
-	if (node && node->timeline == idx)
+	node = __active_lookup(ref, idx);
+	if (likely(node))
 		return &node->base;
 
 	/* Preallocate a replacement, just in case */
@@ -268,10 +298,9 @@ active_instance(struct i915_active *ref, struct intel_timeline *tl)
 	rb_insert_color(&node->node, &ref->tree);
 
 out:
-	ref->cache = node;
+	WRITE_ONCE(ref->cache, node);
 	spin_unlock_irq(&ref->tree_lock);
 
-	BUILD_BUG_ON(offsetof(typeof(*node), base));
 	return &node->base;
 }
 
@@ -353,21 +382,17 @@ __active_del_barrier(struct i915_active *ref, struct active_node *node)
 	return ____active_del_barrier(ref, node, barrier_to_engine(node));
 }
 
-int i915_active_ref(struct i915_active *ref,
-		    struct intel_timeline *tl,
-		    struct dma_fence *fence)
+int i915_active_ref(struct i915_active *ref, u64 idx, struct dma_fence *fence)
 {
 	struct i915_active_fence *active;
 	int err;
 
-	lockdep_assert_held(&tl->mutex);
-
 	/* Prevent reaping in case we malloc/wait while building the tree */
 	err = i915_active_acquire(ref);
 	if (err)
 		return err;
 
-	active = active_instance(ref, tl);
+	active = active_instance(ref, idx);
 	if (!active) {
 		err = -ENOMEM;
 		goto out;
@@ -384,32 +409,81 @@ int i915_active_ref(struct i915_active *ref,
 		atomic_dec(&ref->count);
 	}
 	if (!__i915_active_fence_set(active, fence))
-		atomic_inc(&ref->count);
+		__i915_active_acquire(ref);
 
 out:
 	i915_active_release(ref);
 	return err;
 }
 
-struct dma_fence *
-i915_active_set_exclusive(struct i915_active *ref, struct dma_fence *f)
+static struct dma_fence *
+__i915_active_set_fence(struct i915_active *ref,
+			struct i915_active_fence *active,
+			struct dma_fence *fence)
 {
 	struct dma_fence *prev;
 
-	/* We expect the caller to manage the exclusive timeline ordering */
-	GEM_BUG_ON(i915_active_is_idle(ref));
+	if (is_barrier(active)) { /* proto-node used by our idle barrier */
+		/*
+		 * This request is on the kernel_context timeline, and so
+		 * we can use it to substitute for the pending idle-barrer
+		 * request that we want to emit on the kernel_context.
+		 */
+		__active_del_barrier(ref, node_from_active(active));
+		RCU_INIT_POINTER(active->fence, fence);
+		return NULL;
+	}
 
 	rcu_read_lock();
-	prev = __i915_active_fence_set(&ref->excl, f);
+	prev = __i915_active_fence_set(active, fence);
 	if (prev)
 		prev = dma_fence_get_rcu(prev);
 	else
-		atomic_inc(&ref->count);
+		__i915_active_acquire(ref);
 	rcu_read_unlock();
 
 	return prev;
 }
 
+static struct i915_active_fence *__active_fence(struct i915_active *ref, u64 idx)
+{
+	struct active_node *it;
+
+	it = __active_lookup(ref, idx);
+	if (unlikely(!it)) { /* Contention with parallel tree builders! */
+		spin_lock_irq(&ref->tree_lock);
+		it = fetch_node(ref->tree.rb_node);
+		while (it) {
+			if (it->timeline < idx) {
+				it = fetch_node(it->node.rb_right);
+			} else if (it->timeline > idx) {
+				it = fetch_node(it->node.rb_left);
+			} else {
+				WRITE_ONCE(ref->cache, it);
+				break;
+			}
+		}
+		spin_unlock_irq(&ref->tree_lock);
+	}
+	GEM_BUG_ON(!it); /* slot must be preallocated */
+
+	return &it->base;
+}
+
+struct dma_fence *
+__i915_active_ref(struct i915_active *ref, u64 idx, struct dma_fence *fence)
+{
+	/* Only valid while active, see i915_active_acquire_for_context() */
+	return __i915_active_set_fence(ref, __active_fence(ref, idx), fence);
+}
+
+struct dma_fence *
+i915_active_set_exclusive(struct i915_active *ref, struct dma_fence *f)
+{
+	/* We expect the caller to manage the exclusive timeline ordering */
+	return __i915_active_set_fence(ref, &ref->excl, f);
+}
+
 bool i915_active_acquire_if_busy(struct i915_active *ref)
 {
 	debug_active_assert(ref);
@@ -450,6 +524,24 @@ int i915_active_acquire(struct i915_active *ref)
 	return err;
 }
 
+int i915_active_acquire_for_context(struct i915_active *ref, u64 idx)
+{
+	struct i915_active_fence *active;
+	int err;
+
+	err = i915_active_acquire(ref);
+	if (err)
+		return err;
+
+	active = active_instance(ref, idx);
+	if (!active) {
+		i915_active_release(ref);
+		return -ENOMEM;
+	}
+
+	return 0; /* return with active ref */
+}
+
 void i915_active_release(struct i915_active *ref)
 {
 	debug_active_assert(ref);
@@ -753,7 +845,7 @@ static struct active_node *reuse_idle_barrier(struct i915_active *ref, u64 idx)
 match:
 	rb_erase(p, &ref->tree); /* Hide from waits and sibling allocations */
 	if (p == &ref->cache->node)
-		ref->cache = NULL;
+		WRITE_ONCE(ref->cache, NULL);
 	spin_unlock_irq(&ref->tree_lock);
 
 	return rb_entry(p, struct active_node, node);
@@ -811,7 +903,7 @@ int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
 			 */
 			RCU_INIT_POINTER(node->base.fence, ERR_PTR(-EAGAIN));
 			node->base.cb.node.prev = (void *)engine;
-			atomic_inc(&ref->count);
+			__i915_active_acquire(ref);
 		}
 		GEM_BUG_ON(rcu_access_pointer(node->base.fence) != ERR_PTR(-EAGAIN));
 
diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h
index cf4058150966..73ded3c52a04 100644
--- a/drivers/gpu/drm/i915/i915_active.h
+++ b/drivers/gpu/drm/i915/i915_active.h
@@ -163,14 +163,16 @@ void __i915_active_init(struct i915_active *ref,
 	__i915_active_init(ref, active, retire, &__mkey, &__wkey);	\
 } while (0)
 
-int i915_active_ref(struct i915_active *ref,
-		    struct intel_timeline *tl,
-		    struct dma_fence *fence);
+struct dma_fence *
+__i915_active_ref(struct i915_active *ref, u64 idx, struct dma_fence *fence);
+int i915_active_ref(struct i915_active *ref, u64 idx, struct dma_fence *fence);
 
 static inline int
 i915_active_add_request(struct i915_active *ref, struct i915_request *rq)
 {
-	return i915_active_ref(ref, i915_request_timeline(rq), &rq->fence);
+	return i915_active_ref(ref,
+			       i915_request_timeline(rq)->fence_context,
+			       &rq->fence);
 }
 
 struct dma_fence *
@@ -198,7 +200,9 @@ int i915_request_await_active(struct i915_request *rq,
 #define I915_ACTIVE_AWAIT_BARRIER BIT(2)
 
 int i915_active_acquire(struct i915_active *ref);
+int i915_active_acquire_for_context(struct i915_active *ref, u64 idx);
 bool i915_active_acquire_if_busy(struct i915_active *ref);
+
 void i915_active_release(struct i915_active *ref);
 
 static inline void __i915_active_acquire(struct i915_active *ref)
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 07/66] drm/i915: Keep the most recently used active-fence upon discard
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (4 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire() Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-17 12:38   ` Tvrtko Ursulin
  2020-07-22  9:46   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline Chris Wilson
                   ` (66 subsequent siblings)
  72 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Whenever an i915_active idles, we prune its tree of old fence slots to
prevent a gradual leak should it be used to track many, many timelines.
The downside is that we then have to frequently reallocate the rbtree.
A compromise is that we keep the most recently used fence slot, and
reuse that for the next active reference as that is the most likely
timeline to be reused.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_active.c | 27 ++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_active.h |  4 ----
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index 799282fb1bb9..0854b1552bc1 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -130,8 +130,8 @@ static inline void debug_active_assert(struct i915_active *ref) { }
 static void
 __active_retire(struct i915_active *ref)
 {
+	struct rb_root root = RB_ROOT;
 	struct active_node *it, *n;
-	struct rb_root root;
 	unsigned long flags;
 
 	GEM_BUG_ON(i915_active_is_idle(ref));
@@ -143,9 +143,21 @@ __active_retire(struct i915_active *ref)
 	GEM_BUG_ON(rcu_access_pointer(ref->excl.fence));
 	debug_active_deactivate(ref);
 
-	root = ref->tree;
-	ref->tree = RB_ROOT;
-	ref->cache = NULL;
+	/* Even if we have not used the cache, we may still have a barrier */
+	if (!ref->cache)
+		ref->cache = fetch_node(ref->tree.rb_node);
+
+	/* Keep the MRU cached node for reuse */
+	if (ref->cache) {
+		/* Discard all other nodes in the tree */
+		rb_erase(&ref->cache->node, &ref->tree);
+		root = ref->tree;
+
+		/* Rebuild the tree with only the cached node */
+		rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
+		rb_insert_color(&ref->cache->node, &ref->tree);
+		GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
+	}
 
 	spin_unlock_irqrestore(&ref->tree_lock, flags);
 
@@ -156,6 +168,7 @@ __active_retire(struct i915_active *ref)
 	/* ... except if you wait on it, you must manage your own references! */
 	wake_up_var(ref);
 
+	/* Finally free the discarded timeline tree  */
 	rbtree_postorder_for_each_entry_safe(it, n, &root, node) {
 		GEM_BUG_ON(i915_active_fence_isset(&it->base));
 		kmem_cache_free(global.slab_cache, it);
@@ -750,16 +763,16 @@ int i915_sw_fence_await_active(struct i915_sw_fence *fence,
 	return await_active(ref, flags, sw_await_fence, fence, fence);
 }
 
-#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
 void i915_active_fini(struct i915_active *ref)
 {
 	debug_active_fini(ref);
 	GEM_BUG_ON(atomic_read(&ref->count));
 	GEM_BUG_ON(work_pending(&ref->work));
-	GEM_BUG_ON(!RB_EMPTY_ROOT(&ref->tree));
 	mutex_destroy(&ref->mutex);
+
+	if (ref->cache)
+		kmem_cache_free(global.slab_cache, ref->cache);
 }
-#endif
 
 static inline bool is_idle_barrier(struct active_node *node, u64 idx)
 {
diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h
index 73ded3c52a04..b9e0394e2975 100644
--- a/drivers/gpu/drm/i915/i915_active.h
+++ b/drivers/gpu/drm/i915/i915_active.h
@@ -217,11 +217,7 @@ i915_active_is_idle(const struct i915_active *ref)
 	return !atomic_read(&ref->count);
 }
 
-#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
 void i915_active_fini(struct i915_active *ref);
-#else
-static inline void i915_active_fini(struct i915_active *ref) { }
-#endif
 
 int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
 					    struct intel_engine_cs *engine);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (5 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 07/66] drm/i915: Keep the most recently used active-fence upon discard Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-17 13:04   ` Tvrtko Ursulin
  2020-07-22 11:19   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings Chris Wilson
                   ` (65 subsequent siblings)
  72 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Rather than require the next timeline after idling to match the MRU
before idling, reset the index on the node and allow it to match the
first request. However, this requires cmpxchg(u64) and so is not trivial
on 32b, so for compatibility we just fallback to keeping the cached node
pointing to the MRU timeline.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_active.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index 0854b1552bc1..6737b5615c0c 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -157,6 +157,10 @@ __active_retire(struct i915_active *ref)
 		rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
 		rb_insert_color(&ref->cache->node, &ref->tree);
 		GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
+
+		/* Make the cached node available for reuse with any timeline */
+		if (IS_ENABLED(CONFIG_64BIT))
+			ref->cache->timeline = 0; /* needs cmpxchg(u64) */
 	}
 
 	spin_unlock_irqrestore(&ref->tree_lock, flags);
@@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
 {
 	struct active_node *it;
 
+	GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
+
 	it = READ_ONCE(ref->cache);
-	if (it && it->timeline == idx)
-		return it;
+	if (it) {
+		u64 cached = READ_ONCE(it->timeline);
+
+		if (cached == idx)
+			return it;
+
+#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
+		if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
+			GEM_BUG_ON(i915_active_fence_isset(&it->base));
+			return it;
+		}
+#endif
+	}
 
 	BUILD_BUG_ON(offsetof(typeof(*it), node));
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (6 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-17 13:23   ` Tvrtko Ursulin
  2020-07-22 15:07   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 10/66] drm/i915: Soften the tasklet flush frequency before waits Chris Wilson
                   ` (64 subsequent siblings)
  72 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Before we can execute a request, we must wait for all of its vma to be
bound. This is a frequent operation for which we can optimise away a
few atomic operations (notably a cmpxchg) in lieu of the RCU protection.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_active.h | 15 +++++++++++++++
 drivers/gpu/drm/i915/i915_vma.c    |  9 +++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h
index b9e0394e2975..fb165d3f01cf 100644
--- a/drivers/gpu/drm/i915/i915_active.h
+++ b/drivers/gpu/drm/i915/i915_active.h
@@ -231,4 +231,19 @@ struct i915_active *i915_active_create(void);
 struct i915_active *i915_active_get(struct i915_active *ref);
 void i915_active_put(struct i915_active *ref);
 
+static inline int __i915_request_await_exclusive(struct i915_request *rq,
+						 struct i915_active *active)
+{
+	struct dma_fence *fence;
+	int err = 0;
+
+	fence = i915_active_fence_get(&active->excl);
+	if (fence) {
+		err = i915_request_await_dma_fence(rq, fence);
+		dma_fence_put(fence);
+	}
+
+	return err;
+}
+
 #endif /* _I915_ACTIVE_H_ */
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index bc64f773dcdb..cd12047c7791 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -1167,6 +1167,12 @@ void i915_vma_revoke_mmap(struct i915_vma *vma)
 		list_del(&vma->obj->userfault_link);
 }
 
+static int
+__i915_request_await_bind(struct i915_request *rq, struct i915_vma *vma)
+{
+	return __i915_request_await_exclusive(rq, &vma->active);
+}
+
 int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
 {
 	int err;
@@ -1174,8 +1180,7 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
 	GEM_BUG_ON(!i915_vma_is_pinned(vma));
 
 	/* Wait for the vma to be bound before we start! */
-	err = i915_request_await_active(rq, &vma->active,
-					I915_ACTIVE_AWAIT_EXCL);
+	err = __i915_request_await_bind(rq, vma);
 	if (err)
 		return err;
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 10/66] drm/i915: Soften the tasklet flush frequency before waits
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (7 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-16 14:23   ` Mika Kuoppala
  2020-07-22 15:10   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories Chris Wilson
                   ` (63 subsequent siblings)
  72 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

We include a tasklet flush before waiting on a request as a precaution
against the HW being lax in event signaling. We now have a precautionary
flush in the engine's heartbeat and so do not need to be quite so
zealous on every request wait. If we focus on the request, the only
tasklet flush that matters is if there is a delay in submitting this
request to HW, so if the request is not ready to be executed no
advantage in reducing this wait can be gained by running the tasklet.
And there is little point in doing busy work for no result.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_request.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 29b5e71307e3..f58beff5e859 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1760,14 +1760,30 @@ long i915_request_wait(struct i915_request *rq,
 	if (dma_fence_add_callback(&rq->fence, &wait.cb, request_wait_wake))
 		goto out;
 
+	/*
+	 * Flush the submission tasklet, but only if it may help this request.
+	 *
+	 * We sometimes experience some latency between the HW interrupts and
+	 * tasklet execution (mostly due to ksoftirqd latency, but it can also
+	 * be due to lazy CS events), so lets run the tasklet manually if there
+	 * is a chance it may submit this request. If the request is not ready
+	 * to run, as it is waiting for other fences to be signaled, flushing
+	 * the tasklet is busy work without any advantage for this client.
+	 *
+	 * If the HW is being lazy, this is the last chance before we go to
+	 * sleep to catch any pending events. We will check periodically in
+	 * the heartbeat to flush the submission tasklets as a last resort
+	 * for unhappy HW.
+	 */
+	if (i915_request_is_ready(rq))
+		intel_engine_flush_submission(rq->engine);
+
 	for (;;) {
 		set_current_state(state);
 
 		if (dma_fence_is_signaled(&rq->fence))
 			break;
 
-		intel_engine_flush_submission(rq->engine);
-
 		if (signal_pending_state(state, current)) {
 			timeout = -ERESTARTSYS;
 			break;
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (8 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 10/66] drm/i915: Soften the tasklet flush frequency before waits Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-20 10:35   ` Matthew Auld
                     ` (2 more replies)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 12/66] drm/i915: Switch to object allocations for page directories Chris Wilson
                   ` (62 subsequent siblings)
  72 siblings, 3 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Matthew Auld, Chris Wilson

We need to make the DMA allocations used for page directories to be
performed up front so that we can include those allocations in our
memory reservation pass. The downside is that we have to assume the
worst case, even before we know the final layout, and always allocate
enough page directories for this object, even when there will be overlap.
This unfortunately can be quite expensive, especially as we have to
clear/reset the page directories and DMA pages, but it should only be
required during early phases of a workload when new objects are being
discovered, or after memory/eviction pressure when we need to rebind.
Once we reach steady state, the objects should not be moved and we no
longer need to preallocating the pages tables.

It should be noted that the lifetime for the page directories DMA is
more or less decoupled from individual fences as they will be shared
across objects across timelines.

v2: Only allocate enough PD space for the PTE we may use, we do not need
to allocate PD that will be left as scratch.
v3: Store the shift unto the first PD level to encapsulate the different
PTE counts for gen6/gen8.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_client_blt.c    | 11 +--
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c          | 40 ++++-----
 drivers/gpu/drm/i915/gt/gen8_ppgtt.c          | 78 +++++------------
 drivers/gpu/drm/i915/gt/intel_ggtt.c          | 60 ++++++--------
 drivers/gpu/drm/i915/gt/intel_gtt.h           | 46 ++++++----
 drivers/gpu/drm/i915/gt/intel_ppgtt.c         | 83 ++++++++++++++++---
 drivers/gpu/drm/i915/i915_vma.c               | 27 +++---
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c | 60 ++++++++------
 drivers/gpu/drm/i915/selftests/mock_gtt.c     | 22 ++---
 9 files changed, 237 insertions(+), 190 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_client_blt.c b/drivers/gpu/drm/i915/gem/i915_gem_client_blt.c
index 278664f831e7..947c8aa8e13e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_client_blt.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_client_blt.c
@@ -32,12 +32,13 @@ static void vma_clear_pages(struct i915_vma *vma)
 	vma->pages = NULL;
 }
 
-static int vma_bind(struct i915_address_space *vm,
-		    struct i915_vma *vma,
-		    enum i915_cache_level cache_level,
-		    u32 flags)
+static void vma_bind(struct i915_address_space *vm,
+		     struct i915_vm_pt_stash *stash,
+		     struct i915_vma *vma,
+		     enum i915_cache_level cache_level,
+		     u32 flags)
 {
-	return vm->vma_ops.bind_vma(vm, vma, cache_level, flags);
+	vm->vma_ops.bind_vma(vm, stash, vma, cache_level, flags);
 }
 
 static void vma_unbind(struct i915_address_space *vm, struct i915_vma *vma)
diff --git a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
index cdc0b9c54305..ee2e149454cb 100644
--- a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
@@ -177,16 +177,16 @@ static void gen6_flush_pd(struct gen6_ppgtt *ppgtt, u64 start, u64 end)
 	mutex_unlock(&ppgtt->flush);
 }
 
-static int gen6_alloc_va_range(struct i915_address_space *vm,
-			       u64 start, u64 length)
+static void gen6_alloc_va_range(struct i915_address_space *vm,
+				struct i915_vm_pt_stash *stash,
+				u64 start, u64 length)
 {
 	struct gen6_ppgtt *ppgtt = to_gen6_ppgtt(i915_vm_to_ppgtt(vm));
 	struct i915_page_directory * const pd = ppgtt->base.pd;
-	struct i915_page_table *pt, *alloc = NULL;
+	struct i915_page_table *pt;
 	bool flush = false;
 	u64 from = start;
 	unsigned int pde;
-	int ret = 0;
 
 	spin_lock(&pd->lock);
 	gen6_for_each_pde(pt, pd, start, length, pde) {
@@ -195,21 +195,17 @@ static int gen6_alloc_va_range(struct i915_address_space *vm,
 		if (px_base(pt) == px_base(&vm->scratch[1])) {
 			spin_unlock(&pd->lock);
 
-			pt = fetch_and_zero(&alloc);
-			if (!pt)
-				pt = alloc_pt(vm);
-			if (IS_ERR(pt)) {
-				ret = PTR_ERR(pt);
-				goto unwind_out;
-			}
+			pt = stash->pt[0];
+			GEM_BUG_ON(!pt);
 
 			fill32_px(pt, vm->scratch[0].encode);
 
 			spin_lock(&pd->lock);
 			if (pd->entry[pde] == &vm->scratch[1]) {
+				stash->pt[0] = pt->stash;
+				atomic_set(&pt->used, 0);
 				pd->entry[pde] = pt;
 			} else {
-				alloc = pt;
 				pt = pd->entry[pde];
 			}
 
@@ -226,15 +222,6 @@ static int gen6_alloc_va_range(struct i915_address_space *vm,
 		with_intel_runtime_pm(&vm->i915->runtime_pm, wakeref)
 			gen6_flush_pd(ppgtt, from, start);
 	}
-
-	goto out;
-
-unwind_out:
-	gen6_ppgtt_clear_range(vm, from, start - from);
-out:
-	if (alloc)
-		free_px(vm, alloc);
-	return ret;
 }
 
 static int gen6_ppgtt_init_scratch(struct gen6_ppgtt *ppgtt)
@@ -302,10 +289,11 @@ static void pd_vma_clear_pages(struct i915_vma *vma)
 	vma->pages = NULL;
 }
 
-static int pd_vma_bind(struct i915_address_space *vm,
-		       struct i915_vma *vma,
-		       enum i915_cache_level cache_level,
-		       u32 unused)
+static void pd_vma_bind(struct i915_address_space *vm,
+			struct i915_vm_pt_stash *stash,
+			struct i915_vma *vma,
+			enum i915_cache_level cache_level,
+			u32 unused)
 {
 	struct i915_ggtt *ggtt = i915_vm_to_ggtt(vm);
 	struct gen6_ppgtt *ppgtt = vma->private;
@@ -315,7 +303,6 @@ static int pd_vma_bind(struct i915_address_space *vm,
 	ppgtt->pd_addr = (gen6_pte_t __iomem *)ggtt->gsm + ggtt_offset;
 
 	gen6_flush_pd(ppgtt, 0, ppgtt->base.vm.total);
-	return 0;
 }
 
 static void pd_vma_unbind(struct i915_address_space *vm, struct i915_vma *vma)
@@ -448,6 +435,7 @@ struct i915_ppgtt *gen6_ppgtt_create(struct intel_gt *gt)
 	mutex_init(&ppgtt->pin_mutex);
 
 	ppgtt_init(&ppgtt->base, gt);
+	ppgtt->base.vm.pd_shift = 22;
 	ppgtt->base.vm.top = 1;
 
 	ppgtt->base.vm.bind_async_flags = I915_VMA_LOCAL_BIND;
diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
index 699125928272..c3cadc70dae2 100644
--- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
@@ -269,14 +269,12 @@ static void gen8_ppgtt_clear(struct i915_address_space *vm,
 			   start, start + length, vm->top);
 }
 
-static int __gen8_ppgtt_alloc(struct i915_address_space * const vm,
-			      struct i915_page_directory * const pd,
-			      u64 * const start, const u64 end, int lvl)
+static void __gen8_ppgtt_alloc(struct i915_address_space * const vm,
+			       struct i915_vm_pt_stash *stash,
+			       struct i915_page_directory * const pd,
+			       u64 * const start, const u64 end, int lvl)
 {
-	const struct i915_page_scratch * const scratch = &vm->scratch[lvl];
-	struct i915_page_table *alloc = NULL;
 	unsigned int idx, len;
-	int ret = 0;
 
 	GEM_BUG_ON(end > vm->total >> GEN8_PTE_SHIFT);
 
@@ -297,49 +295,30 @@ static int __gen8_ppgtt_alloc(struct i915_address_space * const vm,
 			DBG("%s(%p):{ lvl:%d, idx:%d } allocating new tree\n",
 			    __func__, vm, lvl + 1, idx);
 
-			pt = fetch_and_zero(&alloc);
-			if (lvl) {
-				if (!pt) {
-					pt = &alloc_pd(vm)->pt;
-					if (IS_ERR(pt)) {
-						ret = PTR_ERR(pt);
-						goto out;
-					}
-				}
+			pt = stash->pt[!!lvl];
+			GEM_BUG_ON(!pt);
 
+			if (lvl ||
+			    gen8_pt_count(*start, end) < I915_PDES ||
+			    intel_vgpu_active(vm->i915))
 				fill_px(pt, vm->scratch[lvl].encode);
-			} else {
-				if (!pt) {
-					pt = alloc_pt(vm);
-					if (IS_ERR(pt)) {
-						ret = PTR_ERR(pt);
-						goto out;
-					}
-				}
-
-				if (intel_vgpu_active(vm->i915) ||
-				    gen8_pt_count(*start, end) < I915_PDES)
-					fill_px(pt, vm->scratch[lvl].encode);
-			}
 
 			spin_lock(&pd->lock);
-			if (likely(!pd->entry[idx]))
+			if (likely(!pd->entry[idx])) {
+				stash->pt[!!lvl] = pt->stash;
+				atomic_set(&pt->used, 0);
 				set_pd_entry(pd, idx, pt);
-			else
-				alloc = pt, pt = pd->entry[idx];
+			} else {
+				pt = pd->entry[idx];
+			}
 		}
 
 		if (lvl) {
 			atomic_inc(&pt->used);
 			spin_unlock(&pd->lock);
 
-			ret = __gen8_ppgtt_alloc(vm, as_pd(pt),
-						 start, end, lvl);
-			if (unlikely(ret)) {
-				if (release_pd_entry(pd, idx, pt, scratch))
-					free_px(vm, pt);
-				goto out;
-			}
+			__gen8_ppgtt_alloc(vm, stash,
+					   as_pd(pt), start, end, lvl);
 
 			spin_lock(&pd->lock);
 			atomic_dec(&pt->used);
@@ -359,18 +338,12 @@ static int __gen8_ppgtt_alloc(struct i915_address_space * const vm,
 		}
 	} while (idx++, --len);
 	spin_unlock(&pd->lock);
-out:
-	if (alloc)
-		free_px(vm, alloc);
-	return ret;
 }
 
-static int gen8_ppgtt_alloc(struct i915_address_space *vm,
-			    u64 start, u64 length)
+static void gen8_ppgtt_alloc(struct i915_address_space *vm,
+			     struct i915_vm_pt_stash *stash,
+			     u64 start, u64 length)
 {
-	u64 from;
-	int err;
-
 	GEM_BUG_ON(!IS_ALIGNED(start, BIT_ULL(GEN8_PTE_SHIFT)));
 	GEM_BUG_ON(!IS_ALIGNED(length, BIT_ULL(GEN8_PTE_SHIFT)));
 	GEM_BUG_ON(range_overflows(start, length, vm->total));
@@ -378,15 +351,9 @@ static int gen8_ppgtt_alloc(struct i915_address_space *vm,
 	start >>= GEN8_PTE_SHIFT;
 	length >>= GEN8_PTE_SHIFT;
 	GEM_BUG_ON(length == 0);
-	from = start;
-
-	err = __gen8_ppgtt_alloc(vm, i915_vm_to_ppgtt(vm)->pd,
-				 &start, start + length, vm->top);
-	if (unlikely(err && from != start))
-		__gen8_ppgtt_clear(vm, i915_vm_to_ppgtt(vm)->pd,
-				   from, start, vm->top);
 
-	return err;
+	__gen8_ppgtt_alloc(vm, stash, i915_vm_to_ppgtt(vm)->pd,
+			   &start, start + length, vm->top);
 }
 
 static __always_inline void
@@ -703,6 +670,7 @@ struct i915_ppgtt *gen8_ppgtt_create(struct intel_gt *gt)
 
 	ppgtt_init(ppgtt, gt);
 	ppgtt->vm.top = i915_vm_is_4lvl(&ppgtt->vm) ? 3 : 2;
+	ppgtt->vm.pd_shift = 21;
 
 	/*
 	 * From bdw, there is hw support for read-only pages in the PPGTT.
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 62979ea591f0..5a33056ab976 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -436,16 +436,17 @@ static void i915_ggtt_clear_range(struct i915_address_space *vm,
 	intel_gtt_clear_range(start >> PAGE_SHIFT, length >> PAGE_SHIFT);
 }
 
-static int ggtt_bind_vma(struct i915_address_space *vm,
-			 struct i915_vma *vma,
-			 enum i915_cache_level cache_level,
-			 u32 flags)
+static void ggtt_bind_vma(struct i915_address_space *vm,
+			  struct i915_vm_pt_stash *stash,
+			  struct i915_vma *vma,
+			  enum i915_cache_level cache_level,
+			  u32 flags)
 {
 	struct drm_i915_gem_object *obj = vma->obj;
 	u32 pte_flags;
 
 	if (i915_vma_is_bound(vma, ~flags & I915_VMA_BIND_MASK))
-		return 0;
+		return;
 
 	/* Applicable to VLV (gen8+ do not support RO in the GGTT) */
 	pte_flags = 0;
@@ -454,8 +455,6 @@ static int ggtt_bind_vma(struct i915_address_space *vm,
 
 	vm->insert_entries(vm, vma, cache_level, pte_flags);
 	vma->page_sizes.gtt = I915_GTT_PAGE_SIZE;
-
-	return 0;
 }
 
 static void ggtt_unbind_vma(struct i915_address_space *vm, struct i915_vma *vma)
@@ -568,31 +567,25 @@ static int init_ggtt(struct i915_ggtt *ggtt)
 	return ret;
 }
 
-static int aliasing_gtt_bind_vma(struct i915_address_space *vm,
-				 struct i915_vma *vma,
-				 enum i915_cache_level cache_level,
-				 u32 flags)
+static void aliasing_gtt_bind_vma(struct i915_address_space *vm,
+				  struct i915_vm_pt_stash *stash,
+				  struct i915_vma *vma,
+				  enum i915_cache_level cache_level,
+				  u32 flags)
 {
 	u32 pte_flags;
-	int ret;
 
 	/* Currently applicable only to VLV */
 	pte_flags = 0;
 	if (i915_gem_object_is_readonly(vma->obj))
 		pte_flags |= PTE_READ_ONLY;
 
-	if (flags & I915_VMA_LOCAL_BIND) {
-		struct i915_ppgtt *alias = i915_vm_to_ggtt(vm)->alias;
-
-		ret = ppgtt_bind_vma(&alias->vm, vma, cache_level, flags);
-		if (ret)
-			return ret;
-	}
+	if (flags & I915_VMA_LOCAL_BIND)
+		ppgtt_bind_vma(&i915_vm_to_ggtt(vm)->alias->vm,
+			       stash, vma, cache_level, flags);
 
 	if (flags & I915_VMA_GLOBAL_BIND)
 		vm->insert_entries(vm, vma, cache_level, pte_flags);
-
-	return 0;
 }
 
 static void aliasing_gtt_unbind_vma(struct i915_address_space *vm,
@@ -607,6 +600,7 @@ static void aliasing_gtt_unbind_vma(struct i915_address_space *vm,
 
 static int init_aliasing_ppgtt(struct i915_ggtt *ggtt)
 {
+	struct i915_vm_pt_stash stash = {};
 	struct i915_ppgtt *ppgtt;
 	int err;
 
@@ -619,15 +613,17 @@ static int init_aliasing_ppgtt(struct i915_ggtt *ggtt)
 		goto err_ppgtt;
 	}
 
+	err = i915_vm_alloc_pt_stash(&ppgtt->vm, &stash, ggtt->vm.total);
+	if (err)
+		goto err_ppgtt;
+
 	/*
 	 * Note we only pre-allocate as far as the end of the global
 	 * GTT. On 48b / 4-level page-tables, the difference is very,
 	 * very significant! We have to preallocate as GVT/vgpu does
 	 * not like the page directory disappearing.
 	 */
-	err = ppgtt->vm.allocate_va_range(&ppgtt->vm, 0, ggtt->vm.total);
-	if (err)
-		goto err_ppgtt;
+	ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash, 0, ggtt->vm.total);
 
 	ggtt->alias = ppgtt;
 	ggtt->vm.bind_async_flags |= ppgtt->vm.bind_async_flags;
@@ -638,6 +634,7 @@ static int init_aliasing_ppgtt(struct i915_ggtt *ggtt)
 	GEM_BUG_ON(ggtt->vm.vma_ops.unbind_vma != ggtt_unbind_vma);
 	ggtt->vm.vma_ops.unbind_vma = aliasing_gtt_unbind_vma;
 
+	i915_vm_free_pt_stash(&ppgtt->vm, &stash);
 	return 0;
 
 err_ppgtt:
@@ -1165,11 +1162,6 @@ void i915_ggtt_disable_guc(struct i915_ggtt *ggtt)
 	ggtt->invalidate(ggtt);
 }
 
-static unsigned int clear_bind(struct i915_vma *vma)
-{
-	return atomic_fetch_and(~I915_VMA_BIND_MASK, &vma->flags);
-}
-
 void i915_ggtt_resume(struct i915_ggtt *ggtt)
 {
 	struct i915_vma *vma;
@@ -1187,11 +1179,13 @@ void i915_ggtt_resume(struct i915_ggtt *ggtt)
 	/* clflush objects bound into the GGTT and rebind them. */
 	list_for_each_entry(vma, &ggtt->vm.bound_list, vm_link) {
 		struct drm_i915_gem_object *obj = vma->obj;
-		unsigned int was_bound = clear_bind(vma);
+		unsigned int was_bound =
+			atomic_read(&vma->flags) & I915_VMA_BIND_MASK;
 
-		WARN_ON(i915_vma_bind(vma,
-				      obj ? obj->cache_level : 0,
-				      was_bound, NULL));
+		GEM_BUG_ON(!was_bound);
+		vma->ops->bind_vma(&ggtt->vm, NULL, vma,
+				   obj ? obj->cache_level : 0,
+				   was_bound);
 		if (obj) { /* only used during resume => exclusive access */
 			flush |= fetch_and_zero(&obj->write_domain);
 			obj->read_domains |= I915_GEM_DOMAIN_GTT;
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index f2b75078e05f..0d9f29aea6b4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -159,7 +159,10 @@ struct i915_page_scratch {
 
 struct i915_page_table {
 	struct i915_page_dma base;
-	atomic_t used;
+	union {
+		atomic_t used;
+		struct i915_page_table *stash;
+	};
 };
 
 struct i915_page_directory {
@@ -196,12 +199,18 @@ struct drm_i915_gem_object;
 struct i915_vma;
 struct intel_gt;
 
+struct i915_vm_pt_stash {
+	/* preallocated chains of page tables/directories */
+	struct i915_page_table *pt[2];
+};
+
 struct i915_vma_ops {
 	/* Map an object into an address space with the given cache flags. */
-	int (*bind_vma)(struct i915_address_space *vm,
-			struct i915_vma *vma,
-			enum i915_cache_level cache_level,
-			u32 flags);
+	void (*bind_vma)(struct i915_address_space *vm,
+			 struct i915_vm_pt_stash *stash,
+			 struct i915_vma *vma,
+			 enum i915_cache_level cache_level,
+			 u32 flags);
 	/*
 	 * Unmap an object from an address space. This usually consists of
 	 * setting the valid PTE entries to a reserved scratch page.
@@ -257,9 +266,6 @@ struct i915_address_space {
 #define VM_CLASS_PPGTT 1
 
 	struct i915_page_scratch scratch[4];
-	unsigned int scratch_order;
-	unsigned int top;
-
 	/**
 	 * List of vma currently bound.
 	 */
@@ -276,13 +282,18 @@ struct i915_address_space {
 	/* Some systems support read-only mappings for GGTT and/or PPGTT */
 	bool has_read_only:1;
 
+	u8 top;
+	u8 pd_shift;
+	u8 scratch_order;
+
 	u64 (*pte_encode)(dma_addr_t addr,
 			  enum i915_cache_level level,
 			  u32 flags); /* Create a valid PTE */
 #define PTE_READ_ONLY	BIT(0)
 
-	int (*allocate_va_range)(struct i915_address_space *vm,
-				 u64 start, u64 length);
+	void (*allocate_va_range)(struct i915_address_space *vm,
+				  struct i915_vm_pt_stash *stash,
+				  u64 start, u64 length);
 	void (*clear_range)(struct i915_address_space *vm,
 			    u64 start, u64 length);
 	void (*insert_page)(struct i915_address_space *vm,
@@ -568,10 +579,11 @@ int ggtt_set_pages(struct i915_vma *vma);
 int ppgtt_set_pages(struct i915_vma *vma);
 void clear_pages(struct i915_vma *vma);
 
-int ppgtt_bind_vma(struct i915_address_space *vm,
-		   struct i915_vma *vma,
-		   enum i915_cache_level cache_level,
-		   u32 flags);
+void ppgtt_bind_vma(struct i915_address_space *vm,
+		    struct i915_vm_pt_stash *stash,
+		    struct i915_vma *vma,
+		    enum i915_cache_level cache_level,
+		    u32 flags);
 void ppgtt_unbind_vma(struct i915_address_space *vm,
 		      struct i915_vma *vma);
 
@@ -579,6 +591,12 @@ void gtt_write_workarounds(struct intel_gt *gt);
 
 void setup_private_pat(struct intel_uncore *uncore);
 
+int i915_vm_alloc_pt_stash(struct i915_address_space *vm,
+			   struct i915_vm_pt_stash *stash,
+			   u64 size);
+void i915_vm_free_pt_stash(struct i915_address_space *vm,
+			   struct i915_vm_pt_stash *stash);
+
 static inline struct sgt_dma {
 	struct scatterlist *sg;
 	dma_addr_t dma, max;
diff --git a/drivers/gpu/drm/i915/gt/intel_ppgtt.c b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
index f0862e924d11..7c3f50948829 100644
--- a/drivers/gpu/drm/i915/gt/intel_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
@@ -155,19 +155,16 @@ struct i915_ppgtt *i915_ppgtt_create(struct intel_gt *gt)
 	return ppgtt;
 }
 
-int ppgtt_bind_vma(struct i915_address_space *vm,
-		   struct i915_vma *vma,
-		   enum i915_cache_level cache_level,
-		   u32 flags)
+void ppgtt_bind_vma(struct i915_address_space *vm,
+		    struct i915_vm_pt_stash *stash,
+		    struct i915_vma *vma,
+		    enum i915_cache_level cache_level,
+		    u32 flags)
 {
 	u32 pte_flags;
-	int err;
 
 	if (!test_bit(I915_VMA_ALLOC_BIT, __i915_vma_flags(vma))) {
-		err = vm->allocate_va_range(vm, vma->node.start, vma->size);
-		if (err)
-			return err;
-
+		vm->allocate_va_range(vm, stash, vma->node.start, vma->size);
 		set_bit(I915_VMA_ALLOC_BIT, __i915_vma_flags(vma));
 	}
 
@@ -178,8 +175,6 @@ int ppgtt_bind_vma(struct i915_address_space *vm,
 
 	vm->insert_entries(vm, vma, cache_level, pte_flags);
 	wmb();
-
-	return 0;
 }
 
 void ppgtt_unbind_vma(struct i915_address_space *vm, struct i915_vma *vma)
@@ -188,12 +183,76 @@ void ppgtt_unbind_vma(struct i915_address_space *vm, struct i915_vma *vma)
 		vm->clear_range(vm, vma->node.start, vma->size);
 }
 
+static unsigned long pd_count(u64 size, int shift)
+{
+	/* Beware later misalignment */
+	return (size + 2 * (BIT_ULL(shift) - 1)) >> shift;
+}
+
+int i915_vm_alloc_pt_stash(struct i915_address_space *vm,
+			   struct i915_vm_pt_stash *stash,
+			   u64 size)
+{
+	unsigned long count;
+	int shift, n;
+
+	shift = vm->pd_shift;
+	if (!shift)
+		return 0;
+
+	count = pd_count(size, shift);
+	while (count--) {
+		struct i915_page_table *pt;
+
+		pt = alloc_pt(vm);
+		if (IS_ERR(pt)) {
+			i915_vm_free_pt_stash(vm, stash);
+			return PTR_ERR(pt);
+		}
+
+		pt->stash = stash->pt[0];
+		stash->pt[0] = pt;
+	}
+
+	for (n = 1; n < vm->top; n++) {
+		shift += 9; /* Each PD holds 512 entries */
+		count = pd_count(size, shift);
+		while (count--) {
+			struct i915_page_directory *pd;
+
+			pd = alloc_pd(vm);
+			if (IS_ERR(pd)) {
+				i915_vm_free_pt_stash(vm, stash);
+				return PTR_ERR(pd);
+			}
+
+			pd->pt.stash = stash->pt[1];
+			stash->pt[1] = &pd->pt;
+		}
+	}
+
+	return 0;
+}
+
+void i915_vm_free_pt_stash(struct i915_address_space *vm,
+			   struct i915_vm_pt_stash *stash)
+{
+	struct i915_page_table *pt;
+	int n;
+
+	for (n = 0; n < ARRAY_SIZE(stash->pt); n++) {
+		while ((pt = stash->pt[n])) {
+			stash->pt[n] = pt->stash;
+			free_px(vm, pt);
+		}
+	}
+}
+
 int ppgtt_set_pages(struct i915_vma *vma)
 {
 	GEM_BUG_ON(vma->pages);
 
 	vma->pages = vma->obj->mm.pages;
-
 	vma->page_sizes = vma->obj->mm.page_sizes;
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index cd12047c7791..a9e79b67035e 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -291,6 +291,8 @@ i915_vma_instance(struct drm_i915_gem_object *obj,
 
 struct i915_vma_work {
 	struct dma_fence_work base;
+	struct i915_address_space *vm;
+	struct i915_vm_pt_stash stash;
 	struct i915_vma *vma;
 	struct drm_i915_gem_object *pinned;
 	struct i915_sw_dma_fence_cb cb;
@@ -302,13 +304,10 @@ static int __vma_bind(struct dma_fence_work *work)
 {
 	struct i915_vma_work *vw = container_of(work, typeof(*vw), base);
 	struct i915_vma *vma = vw->vma;
-	int err;
-
-	err = vma->ops->bind_vma(vma->vm, vma, vw->cache_level, vw->flags);
-	if (err)
-		atomic_or(I915_VMA_ERROR, &vma->flags);
 
-	return err;
+	vma->ops->bind_vma(vw->vm, &vw->stash,
+			   vma, vw->cache_level, vw->flags);
+	return 0;
 }
 
 static void __vma_release(struct dma_fence_work *work)
@@ -317,6 +316,9 @@ static void __vma_release(struct dma_fence_work *work)
 
 	if (vw->pinned)
 		__i915_gem_object_unpin_pages(vw->pinned);
+
+	i915_vm_free_pt_stash(vw->vm, &vw->stash);
+	i915_vm_put(vw->vm);
 }
 
 static const struct dma_fence_work_ops bind_ops = {
@@ -376,7 +378,6 @@ int i915_vma_bind(struct i915_vma *vma,
 {
 	u32 bind_flags;
 	u32 vma_flags;
-	int ret;
 
 	GEM_BUG_ON(!drm_mm_node_allocated(&vma->node));
 	GEM_BUG_ON(vma->size > vma->node.size);
@@ -433,9 +434,7 @@ int i915_vma_bind(struct i915_vma *vma,
 			work->pinned = vma->obj;
 		}
 	} else {
-		ret = vma->ops->bind_vma(vma->vm, vma, cache_level, bind_flags);
-		if (ret)
-			return ret;
+		vma->ops->bind_vma(vma->vm, NULL, vma, cache_level, bind_flags);
 	}
 
 	atomic_or(bind_flags, &vma->flags);
@@ -879,6 +878,14 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 			err = -ENOMEM;
 			goto err_pages;
 		}
+
+		work->vm = i915_vm_get(vma->vm);
+
+		/* Allocate enough page directories to used PTE */
+		if (vma->vm->allocate_va_range)
+			i915_vm_alloc_pt_stash(vma->vm,
+					       &work->stash,
+					       vma->size);
 	}
 
 	if (flags & PIN_GLOBAL)
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
index 0016ffc7d914..9b8fc990e9ef 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
@@ -172,35 +172,33 @@ static int igt_ppgtt_alloc(void *arg)
 
 	/* Check we can allocate the entire range */
 	for (size = 4096; size <= limit; size <<= 2) {
-		err = ppgtt->vm.allocate_va_range(&ppgtt->vm, 0, size);
-		if (err) {
-			if (err == -ENOMEM) {
-				pr_info("[1] Ran out of memory for va_range [0 + %llx] [bit %d]\n",
-					size, ilog2(size));
-				err = 0; /* virtual space too large! */
-			}
+		struct i915_vm_pt_stash stash = {};
+
+		err = i915_vm_alloc_pt_stash(&ppgtt->vm, &stash, size);
+		if (err)
 			goto err_ppgtt_cleanup;
-		}
 
+		ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash, 0, size);
 		cond_resched();
 
 		ppgtt->vm.clear_range(&ppgtt->vm, 0, size);
+
+		i915_vm_free_pt_stash(&ppgtt->vm, &stash);
 	}
 
 	/* Check we can incrementally allocate the entire range */
 	for (last = 0, size = 4096; size <= limit; last = size, size <<= 2) {
-		err = ppgtt->vm.allocate_va_range(&ppgtt->vm,
-						  last, size - last);
-		if (err) {
-			if (err == -ENOMEM) {
-				pr_info("[2] Ran out of memory for va_range [%llx + %llx] [bit %d]\n",
-					last, size - last, ilog2(size));
-				err = 0; /* virtual space too large! */
-			}
+		struct i915_vm_pt_stash stash = {};
+
+		err = i915_vm_alloc_pt_stash(&ppgtt->vm, &stash, size - last);
+		if (err)
 			goto err_ppgtt_cleanup;
-		}
 
+		ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash,
+					    last, size - last);
 		cond_resched();
+
+		i915_vm_free_pt_stash(&ppgtt->vm, &stash);
 	}
 
 err_ppgtt_cleanup:
@@ -284,9 +282,18 @@ static int lowlevel_hole(struct i915_address_space *vm,
 				break;
 			}
 
-			if (vm->allocate_va_range &&
-			    vm->allocate_va_range(vm, addr, BIT_ULL(size)))
-				break;
+			if (vm->allocate_va_range) {
+				struct i915_vm_pt_stash stash = {};
+
+				if (i915_vm_alloc_pt_stash(vm, &stash,
+							   BIT_ULL(size)))
+					break;
+
+				vm->allocate_va_range(vm, &stash,
+						      addr, BIT_ULL(size));
+
+				i915_vm_free_pt_stash(vm, &stash);
+			}
 
 			mock_vma->pages = obj->mm.pages;
 			mock_vma->node.size = BIT_ULL(size);
@@ -1881,6 +1888,7 @@ static int igt_cs_tlb(void *arg)
 			continue;
 
 		while (!__igt_timeout(end_time, NULL)) {
+			struct i915_vm_pt_stash stash = {};
 			struct i915_request *rq;
 			u64 offset;
 
@@ -1888,10 +1896,6 @@ static int igt_cs_tlb(void *arg)
 						   0, vm->total - PAGE_SIZE,
 						   chunk_size, PAGE_SIZE);
 
-			err = vm->allocate_va_range(vm, offset, chunk_size);
-			if (err)
-				goto end;
-
 			memset32(result, STACK_MAGIC, PAGE_SIZE / sizeof(u32));
 
 			vma = i915_vma_instance(bbe, vm, NULL);
@@ -1904,6 +1908,14 @@ static int igt_cs_tlb(void *arg)
 			if (err)
 				goto end;
 
+			err = i915_vm_alloc_pt_stash(vm, &stash, chunk_size);
+			if (err)
+				goto end;
+
+			vm->allocate_va_range(vm, &stash, offset, chunk_size);
+
+			i915_vm_free_pt_stash(vm, &stash);
+
 			/* Prime the TLB with the dummy pages */
 			for (i = 0; i < count; i++) {
 				vma->node.start = offset + i * PAGE_SIZE;
diff --git a/drivers/gpu/drm/i915/selftests/mock_gtt.c b/drivers/gpu/drm/i915/selftests/mock_gtt.c
index b173086411ef..5e4fb0fba34b 100644
--- a/drivers/gpu/drm/i915/selftests/mock_gtt.c
+++ b/drivers/gpu/drm/i915/selftests/mock_gtt.c
@@ -38,14 +38,14 @@ static void mock_insert_entries(struct i915_address_space *vm,
 {
 }
 
-static int mock_bind_ppgtt(struct i915_address_space *vm,
-			   struct i915_vma *vma,
-			   enum i915_cache_level cache_level,
-			   u32 flags)
+static void mock_bind_ppgtt(struct i915_address_space *vm,
+			    struct i915_vm_pt_stash *stash,
+			    struct i915_vma *vma,
+			    enum i915_cache_level cache_level,
+			    u32 flags)
 {
 	GEM_BUG_ON(flags & I915_VMA_GLOBAL_BIND);
 	set_bit(I915_VMA_LOCAL_BIND_BIT, __i915_vma_flags(vma));
-	return 0;
 }
 
 static void mock_unbind_ppgtt(struct i915_address_space *vm,
@@ -74,6 +74,7 @@ struct i915_ppgtt *mock_ppgtt(struct drm_i915_private *i915, const char *name)
 	ppgtt->vm.i915 = i915;
 	ppgtt->vm.total = round_down(U64_MAX, PAGE_SIZE);
 	ppgtt->vm.file = ERR_PTR(-ENODEV);
+	ppgtt->vm.dma = &i915->drm.pdev->dev;
 
 	i915_address_space_init(&ppgtt->vm, VM_CLASS_PPGTT);
 
@@ -90,13 +91,12 @@ struct i915_ppgtt *mock_ppgtt(struct drm_i915_private *i915, const char *name)
 	return ppgtt;
 }
 
-static int mock_bind_ggtt(struct i915_address_space *vm,
-			  struct i915_vma *vma,
-			  enum i915_cache_level cache_level,
-			  u32 flags)
+static void mock_bind_ggtt(struct i915_address_space *vm,
+			   struct i915_vm_pt_stash *stash,
+			   struct i915_vma *vma,
+			   enum i915_cache_level cache_level,
+			   u32 flags)
 {
-	atomic_or(I915_VMA_GLOBAL_BIND | I915_VMA_LOCAL_BIND, &vma->flags);
-	return 0;
 }
 
 static void mock_unbind_ggtt(struct i915_address_space *vm,
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 12/66] drm/i915: Switch to object allocations for page directories
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (9 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-20 10:34   ` Matthew Auld
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf Chris Wilson
                   ` (61 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Matthew Auld, Chris Wilson

The GEM object is grossly overweight for the practicality of tracking
large numbers of individual pages, yet it is currently our only
abstraction for tracking DMA allocations. Since those allocations need
to be reserved upfront before an operation, and that we need to break
away from simple system memory, we need to ditch using plain struct page
wrappers.

In the process, we drop the WC mapping as we ended up clflushing
everything anyway due to various issues across a wider range of
platforms. Though in a future step, we need to drop the kmap_atomic
approach which suggests we need to pre-map all the pages and keep them
mapped.

v2: Verify our large scratch page is suitably DMA aligned; and manually
clear the scratch since we are allocating random struct pages.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_object_types.h  |   1 +
 .../gpu/drm/i915/gem/selftests/huge_pages.c   |   2 +-
 .../drm/i915/gem/selftests/i915_gem_context.c |   2 +-
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c          |  53 +--
 drivers/gpu/drm/i915/gt/gen6_ppgtt.h          |   1 +
 drivers/gpu/drm/i915/gt/gen8_ppgtt.c          |  89 ++---
 drivers/gpu/drm/i915/gt/intel_ggtt.c          |  37 ++-
 drivers/gpu/drm/i915/gt/intel_gtt.c           | 303 ++++--------------
 drivers/gpu/drm/i915/gt/intel_gtt.h           |  94 ++----
 drivers/gpu/drm/i915/gt/intel_ppgtt.c         |  42 ++-
 .../gpu/drm/i915/gt/intel_ring_submission.c   |  16 +-
 drivers/gpu/drm/i915/gvt/scheduler.c          |  17 +-
 drivers/gpu/drm/i915/i915_drv.c               |   1 +
 drivers/gpu/drm/i915/i915_drv.h               |   5 -
 drivers/gpu/drm/i915/i915_vma.c               |  18 +-
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |  23 ++
 drivers/gpu/drm/i915/selftests/i915_perf.c    |   4 +-
 drivers/gpu/drm/i915/selftests/mock_gtt.c     |   4 +
 18 files changed, 289 insertions(+), 423 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
index 5335f799b548..d0847d7896f9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
@@ -282,6 +282,7 @@ struct drm_i915_gem_object {
 		} userptr;
 
 		unsigned long scratch;
+		u64 encode;
 
 		void *gvt_info;
 	};
diff --git a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
index 8291ede6902c..e2f3d014acb2 100644
--- a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
+++ b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
@@ -393,7 +393,7 @@ static int igt_mock_exhaust_device_supported_pages(void *arg)
 	 */
 
 	for (i = 1; i < BIT(ARRAY_SIZE(page_sizes)); i++) {
-		unsigned int combination = 0;
+		unsigned int combination = SZ_4K; /* Required for ppGTT */
 
 		for (j = 0; j < ARRAY_SIZE(page_sizes); j++) {
 			if (i & BIT(j))
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
index 7ffc3c751432..d176b015353f 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
@@ -1748,7 +1748,7 @@ static int check_scratch_page(struct i915_gem_context *ctx, u32 *out)
 	if (!vm)
 		return -ENODEV;
 
-	page = vm->scratch[0].base.page;
+	page = __px_page(vm->scratch[0]);
 	if (!page) {
 		pr_err("No scratch page!\n");
 		return -EINVAL;
diff --git a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
index ee2e149454cb..a823d2e3c39c 100644
--- a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
@@ -16,8 +16,10 @@ static inline void gen6_write_pde(const struct gen6_ppgtt *ppgtt,
 				  const unsigned int pde,
 				  const struct i915_page_table *pt)
 {
+	dma_addr_t addr = pt ? px_dma(pt) : px_dma(ppgtt->base.vm.scratch[1]);
+
 	/* Caller needs to make sure the write completes if necessary */
-	iowrite32(GEN6_PDE_ADDR_ENCODE(px_dma(pt)) | GEN6_PDE_VALID,
+	iowrite32(GEN6_PDE_ADDR_ENCODE(addr) | GEN6_PDE_VALID,
 		  ppgtt->pd_addr + pde);
 }
 
@@ -79,7 +81,7 @@ static void gen6_ppgtt_clear_range(struct i915_address_space *vm,
 {
 	struct gen6_ppgtt * const ppgtt = to_gen6_ppgtt(i915_vm_to_ppgtt(vm));
 	const unsigned int first_entry = start / I915_GTT_PAGE_SIZE;
-	const gen6_pte_t scratch_pte = vm->scratch[0].encode;
+	const gen6_pte_t scratch_pte = vm->scratch[0]->encode;
 	unsigned int pde = first_entry / GEN6_PTES;
 	unsigned int pte = first_entry % GEN6_PTES;
 	unsigned int num_entries = length / I915_GTT_PAGE_SIZE;
@@ -90,8 +92,6 @@ static void gen6_ppgtt_clear_range(struct i915_address_space *vm,
 		const unsigned int count = min(num_entries, GEN6_PTES - pte);
 		gen6_pte_t *vaddr;
 
-		GEM_BUG_ON(px_base(pt) == px_base(&vm->scratch[1]));
-
 		num_entries -= count;
 
 		GEM_BUG_ON(count > atomic_read(&pt->used));
@@ -127,7 +127,7 @@ static void gen6_ppgtt_insert_entries(struct i915_address_space *vm,
 	struct sgt_dma iter = sgt_dma(vma);
 	gen6_pte_t *vaddr;
 
-	GEM_BUG_ON(pd->entry[act_pt] == &vm->scratch[1]);
+	GEM_BUG_ON(!pd->entry[act_pt]);
 
 	vaddr = kmap_atomic_px(i915_pt_entry(pd, act_pt));
 	do {
@@ -192,16 +192,17 @@ static void gen6_alloc_va_range(struct i915_address_space *vm,
 	gen6_for_each_pde(pt, pd, start, length, pde) {
 		const unsigned int count = gen6_pte_count(start, length);
 
-		if (px_base(pt) == px_base(&vm->scratch[1])) {
+		if (!pt) {
 			spin_unlock(&pd->lock);
 
 			pt = stash->pt[0];
-			GEM_BUG_ON(!pt);
+			__i915_gem_object_pin_pages(pt->base);
+			i915_gem_object_make_unshrinkable(pt->base);
 
-			fill32_px(pt, vm->scratch[0].encode);
+			fill32_px(pt, vm->scratch[0]->encode);
 
 			spin_lock(&pd->lock);
-			if (pd->entry[pde] == &vm->scratch[1]) {
+			if (!pd->entry[pde]) {
 				stash->pt[0] = pt->stash;
 				atomic_set(&pt->used, 0);
 				pd->entry[pde] = pt;
@@ -227,24 +228,27 @@ static void gen6_alloc_va_range(struct i915_address_space *vm,
 static int gen6_ppgtt_init_scratch(struct gen6_ppgtt *ppgtt)
 {
 	struct i915_address_space * const vm = &ppgtt->base.vm;
-	struct i915_page_directory * const pd = ppgtt->base.pd;
 	int ret;
 
-	ret = setup_scratch_page(vm, __GFP_HIGHMEM);
+	ret = setup_scratch_page(vm);
 	if (ret)
 		return ret;
 
-	vm->scratch[0].encode =
-		vm->pte_encode(px_dma(&vm->scratch[0]),
+	vm->scratch[0]->encode =
+		vm->pte_encode(px_dma(vm->scratch[0]),
 			       I915_CACHE_NONE, PTE_READ_ONLY);
 
-	if (unlikely(setup_page_dma(vm, px_base(&vm->scratch[1])))) {
-		cleanup_scratch_page(vm);
-		return -ENOMEM;
+	vm->scratch[1] = vm->alloc_pt_dma(vm, I915_GTT_PAGE_SIZE_4K);
+	if (IS_ERR(vm->scratch[1]))
+		return PTR_ERR(vm->scratch[1]);
+
+	ret = pin_pt_dma(vm, vm->scratch[1]);
+	if (ret) {
+		i915_gem_object_put(vm->scratch[1]);
+		return ret;
 	}
 
-	fill32_px(&vm->scratch[1], vm->scratch[0].encode);
-	memset_p(pd->entry, &vm->scratch[1], I915_PDES);
+	fill32_px(vm->scratch[1], vm->scratch[0]->encode);
 
 	return 0;
 }
@@ -252,13 +256,11 @@ static int gen6_ppgtt_init_scratch(struct gen6_ppgtt *ppgtt)
 static void gen6_ppgtt_free_pd(struct gen6_ppgtt *ppgtt)
 {
 	struct i915_page_directory * const pd = ppgtt->base.pd;
-	struct i915_page_dma * const scratch =
-		px_base(&ppgtt->base.vm.scratch[1]);
 	struct i915_page_table *pt;
 	u32 pde;
 
 	gen6_for_all_pdes(pt, pd, pde)
-		if (px_base(pt) != scratch)
+		if (pt)
 			free_px(&ppgtt->base.vm, pt);
 }
 
@@ -299,7 +301,7 @@ static void pd_vma_bind(struct i915_address_space *vm,
 	struct gen6_ppgtt *ppgtt = vma->private;
 	u32 ggtt_offset = i915_ggtt_offset(vma) / I915_GTT_PAGE_SIZE;
 
-	px_base(ppgtt->base.pd)->ggtt_offset = ggtt_offset * sizeof(gen6_pte_t);
+	ppgtt->pp_dir = ggtt_offset * sizeof(gen6_pte_t) << 10;
 	ppgtt->pd_addr = (gen6_pte_t __iomem *)ggtt->gsm + ggtt_offset;
 
 	gen6_flush_pd(ppgtt, 0, ppgtt->base.vm.total);
@@ -309,8 +311,6 @@ static void pd_vma_unbind(struct i915_address_space *vm, struct i915_vma *vma)
 {
 	struct gen6_ppgtt *ppgtt = vma->private;
 	struct i915_page_directory * const pd = ppgtt->base.pd;
-	struct i915_page_dma * const scratch =
-		px_base(&ppgtt->base.vm.scratch[1]);
 	struct i915_page_table *pt;
 	unsigned int pde;
 
@@ -319,11 +319,11 @@ static void pd_vma_unbind(struct i915_address_space *vm, struct i915_vma *vma)
 
 	/* Free all no longer used page tables */
 	gen6_for_all_pdes(pt, ppgtt->base.pd, pde) {
-		if (px_base(pt) == scratch || atomic_read(&pt->used))
+		if (!pt || atomic_read(&pt->used))
 			continue;
 
 		free_px(&ppgtt->base.vm, pt);
-		pd->entry[pde] = scratch;
+		pd->entry[pde] = NULL;
 	}
 
 	ppgtt->scan_for_unused_pt = false;
@@ -444,6 +444,7 @@ struct i915_ppgtt *gen6_ppgtt_create(struct intel_gt *gt)
 	ppgtt->base.vm.insert_entries = gen6_ppgtt_insert_entries;
 	ppgtt->base.vm.cleanup = gen6_ppgtt_cleanup;
 
+	ppgtt->base.vm.alloc_pt_dma = alloc_pt_dma;
 	ppgtt->base.vm.pte_encode = ggtt->vm.pte_encode;
 
 	ppgtt->base.pd = __alloc_pd(sizeof(*ppgtt->base.pd));
diff --git a/drivers/gpu/drm/i915/gt/gen6_ppgtt.h b/drivers/gpu/drm/i915/gt/gen6_ppgtt.h
index 72e481806c96..7249672e5802 100644
--- a/drivers/gpu/drm/i915/gt/gen6_ppgtt.h
+++ b/drivers/gpu/drm/i915/gt/gen6_ppgtt.h
@@ -14,6 +14,7 @@ struct gen6_ppgtt {
 	struct mutex flush;
 	struct i915_vma *vma;
 	gen6_pte_t __iomem *pd_addr;
+	u32 pp_dir;
 
 	atomic_t pin_count;
 	struct mutex pin_mutex;
diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
index c3cadc70dae2..e3afd250cd7f 100644
--- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
@@ -199,7 +199,7 @@ static u64 __gen8_ppgtt_clear(struct i915_address_space * const vm,
 			      struct i915_page_directory * const pd,
 			      u64 start, const u64 end, int lvl)
 {
-	const struct i915_page_scratch * const scratch = &vm->scratch[lvl];
+	const struct drm_i915_gem_object * const scratch = vm->scratch[lvl];
 	unsigned int idx, len;
 
 	GEM_BUG_ON(end > vm->total >> GEN8_PTE_SHIFT);
@@ -239,7 +239,7 @@ static u64 __gen8_ppgtt_clear(struct i915_address_space * const vm,
 
 			vaddr = kmap_atomic_px(pt);
 			memset64(vaddr + gen8_pd_index(start, 0),
-				 vm->scratch[0].encode,
+				 vm->scratch[0]->encode,
 				 count);
 			kunmap_atomic(vaddr);
 
@@ -296,12 +296,13 @@ static void __gen8_ppgtt_alloc(struct i915_address_space * const vm,
 			    __func__, vm, lvl + 1, idx);
 
 			pt = stash->pt[!!lvl];
-			GEM_BUG_ON(!pt);
+			__i915_gem_object_pin_pages(pt->base);
+			i915_gem_object_make_unshrinkable(pt->base);
 
 			if (lvl ||
 			    gen8_pt_count(*start, end) < I915_PDES ||
 			    intel_vgpu_active(vm->i915))
-				fill_px(pt, vm->scratch[lvl].encode);
+				fill_px(pt, vm->scratch[lvl]->encode);
 
 			spin_lock(&pd->lock);
 			if (likely(!pd->entry[idx])) {
@@ -356,16 +357,6 @@ static void gen8_ppgtt_alloc(struct i915_address_space *vm,
 			   &start, start + length, vm->top);
 }
 
-static __always_inline void
-write_pte(gen8_pte_t *pte, const gen8_pte_t val)
-{
-	/* Magic delays? Or can we refine these to flush all in one pass? */
-	*pte = val;
-	wmb(); /* cpu to cache */
-	clflush(pte); /* cache to memory */
-	wmb(); /* visible to all */
-}
-
 static __always_inline u64
 gen8_ppgtt_insert_pte(struct i915_ppgtt *ppgtt,
 		      struct i915_page_directory *pdp,
@@ -382,8 +373,7 @@ gen8_ppgtt_insert_pte(struct i915_ppgtt *ppgtt,
 	vaddr = kmap_atomic_px(i915_pt_entry(pd, gen8_pd_index(idx, 1)));
 	do {
 		GEM_BUG_ON(iter->sg->length < I915_GTT_PAGE_SIZE);
-		write_pte(&vaddr[gen8_pd_index(idx, 0)],
-			  pte_encode | iter->dma);
+		vaddr[gen8_pd_index(idx, 0)] = pte_encode | iter->dma;
 
 		iter->dma += I915_GTT_PAGE_SIZE;
 		if (iter->dma >= iter->max) {
@@ -406,10 +396,12 @@ gen8_ppgtt_insert_pte(struct i915_ppgtt *ppgtt,
 				pd = pdp->entry[gen8_pd_index(idx, 2)];
 			}
 
+			clflush_cache_range(vaddr, PAGE_SIZE);
 			kunmap_atomic(vaddr);
 			vaddr = kmap_atomic_px(i915_pt_entry(pd, gen8_pd_index(idx, 1)));
 		}
 	} while (1);
+	clflush_cache_range(vaddr, PAGE_SIZE);
 	kunmap_atomic(vaddr);
 
 	return idx;
@@ -465,7 +457,7 @@ static void gen8_ppgtt_insert_huge(struct i915_vma *vma,
 
 		do {
 			GEM_BUG_ON(iter->sg->length < page_size);
-			write_pte(&vaddr[index++], encode | iter->dma);
+			vaddr[index++] = encode | iter->dma;
 
 			start += page_size;
 			iter->dma += page_size;
@@ -490,6 +482,7 @@ static void gen8_ppgtt_insert_huge(struct i915_vma *vma,
 			}
 		} while (rem >= page_size && index < I915_PDES);
 
+		clflush_cache_range(vaddr, PAGE_SIZE);
 		kunmap_atomic(vaddr);
 
 		/*
@@ -521,7 +514,7 @@ static void gen8_ppgtt_insert_huge(struct i915_vma *vma,
 			if (I915_SELFTEST_ONLY(vma->vm->scrub_64K)) {
 				u16 i;
 
-				encode = vma->vm->scratch[0].encode;
+				encode = vma->vm->scratch[0]->encode;
 				vaddr = kmap_atomic_px(i915_pt_entry(pd, maybe_64K));
 
 				for (i = 1; i < index; i += 16)
@@ -575,27 +568,37 @@ static int gen8_init_scratch(struct i915_address_space *vm)
 		GEM_BUG_ON(!clone->has_read_only);
 
 		vm->scratch_order = clone->scratch_order;
-		memcpy(vm->scratch, clone->scratch, sizeof(vm->scratch));
-		px_dma(&vm->scratch[0]) = 0; /* no xfer of ownership */
+		for (i = 0; i <= vm->top; i++)
+			vm->scratch[i] = i915_gem_object_get(clone->scratch[i]);
+
 		return 0;
 	}
 
-	ret = setup_scratch_page(vm, __GFP_HIGHMEM);
+	ret = setup_scratch_page(vm);
 	if (ret)
 		return ret;
 
-	vm->scratch[0].encode =
-		gen8_pte_encode(px_dma(&vm->scratch[0]),
+	vm->scratch[0]->encode =
+		gen8_pte_encode(px_dma(vm->scratch[0]),
 				I915_CACHE_LLC, vm->has_read_only);
 
 	for (i = 1; i <= vm->top; i++) {
-		if (unlikely(setup_page_dma(vm, px_base(&vm->scratch[i]))))
+		struct drm_i915_gem_object *obj;
+
+		obj = vm->alloc_pt_dma(vm, I915_GTT_PAGE_SIZE_4K);
+		if (IS_ERR(obj))
 			goto free_scratch;
 
-		fill_px(&vm->scratch[i], vm->scratch[i - 1].encode);
-		vm->scratch[i].encode =
-			gen8_pde_encode(px_dma(&vm->scratch[i]),
-					I915_CACHE_LLC);
+		ret = pin_pt_dma(vm, obj);
+		if (ret) {
+			i915_gem_object_put(obj);
+			goto free_scratch;
+		}
+
+		fill_px(obj, vm->scratch[i - 1]->encode);
+		obj->encode = gen8_pde_encode(px_dma(obj), I915_CACHE_LLC);
+
+		vm->scratch[i] = obj;
 	}
 
 	return 0;
@@ -616,12 +619,20 @@ static int gen8_preallocate_top_level_pdp(struct i915_ppgtt *ppgtt)
 
 	for (idx = 0; idx < GEN8_3LVL_PDPES; idx++) {
 		struct i915_page_directory *pde;
+		int err;
 
 		pde = alloc_pd(vm);
 		if (IS_ERR(pde))
 			return PTR_ERR(pde);
 
-		fill_px(pde, vm->scratch[1].encode);
+		err = pin_pt_dma(vm, pde->pt.base);
+		if (err) {
+			i915_gem_object_put(pde->pt.base);
+			kfree(pde);
+			return err;
+		}
+
+		fill_px(pde, vm->scratch[1]->encode);
 		set_pd_entry(pd, idx, pde);
 		atomic_inc(px_used(pde)); /* keep pinned */
 	}
@@ -635,6 +646,7 @@ gen8_alloc_top_pd(struct i915_address_space *vm)
 {
 	const unsigned int count = gen8_pd_top_count(vm);
 	struct i915_page_directory *pd;
+	int err;
 
 	GEM_BUG_ON(count > ARRAY_SIZE(pd->entry));
 
@@ -642,12 +654,20 @@ gen8_alloc_top_pd(struct i915_address_space *vm)
 	if (unlikely(!pd))
 		return ERR_PTR(-ENOMEM);
 
-	if (unlikely(setup_page_dma(vm, px_base(pd)))) {
+	pd->pt.base = vm->alloc_pt_dma(vm, I915_GTT_PAGE_SIZE_4K);
+	if (IS_ERR(pd->pt.base)) {
 		kfree(pd);
 		return ERR_PTR(-ENOMEM);
 	}
 
-	fill_page_dma(px_base(pd), vm->scratch[vm->top].encode, count);
+	err = pin_pt_dma(vm, pd->pt.base);
+	if (err) {
+		i915_gem_object_put(pd->pt.base);
+		kfree(pd);
+		return ERR_PTR(err);
+	}
+
+	fill_page_dma(px_base(pd), vm->scratch[vm->top]->encode, count);
 	atomic_inc(px_used(pd)); /* mark as pinned */
 	return pd;
 }
@@ -682,12 +702,7 @@ struct i915_ppgtt *gen8_ppgtt_create(struct intel_gt *gt)
 	 */
 	ppgtt->vm.has_read_only = !IS_GEN_RANGE(gt->i915, 11, 12);
 
-	/*
-	 * There are only few exceptions for gen >=6. chv and bxt.
-	 * And we are not sure about the latter so play safe for now.
-	 */
-	if (IS_CHERRYVIEW(gt->i915) || IS_BROXTON(gt->i915))
-		ppgtt->vm.pt_kmap_wc = true;
+	ppgtt->vm.alloc_pt_dma = alloc_pt_dma;
 
 	err = gen8_init_scratch(&ppgtt->vm);
 	if (err)
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 5a33056ab976..33a3f627ddb1 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -78,8 +78,6 @@ int i915_ggtt_init_hw(struct drm_i915_private *i915)
 {
 	int ret;
 
-	stash_init(&i915->mm.wc_stash);
-
 	/*
 	 * Note that we use page colouring to enforce a guard page at the
 	 * end of the address space. This is required as the CS may prefetch
@@ -232,7 +230,7 @@ static void gen8_ggtt_insert_entries(struct i915_address_space *vm,
 
 	/* Fill the allocated but "unused" space beyond the end of the buffer */
 	while (gte < end)
-		gen8_set_pte(gte++, vm->scratch[0].encode);
+		gen8_set_pte(gte++, vm->scratch[0]->encode);
 
 	/*
 	 * We want to flush the TLBs only after we're certain all the PTE
@@ -283,7 +281,7 @@ static void gen6_ggtt_insert_entries(struct i915_address_space *vm,
 
 	/* Fill the allocated but "unused" space beyond the end of the buffer */
 	while (gte < end)
-		iowrite32(vm->scratch[0].encode, gte++);
+		iowrite32(vm->scratch[0]->encode, gte++);
 
 	/*
 	 * We want to flush the TLBs only after we're certain all the PTE
@@ -303,7 +301,7 @@ static void gen8_ggtt_clear_range(struct i915_address_space *vm,
 	struct i915_ggtt *ggtt = i915_vm_to_ggtt(vm);
 	unsigned int first_entry = start / I915_GTT_PAGE_SIZE;
 	unsigned int num_entries = length / I915_GTT_PAGE_SIZE;
-	const gen8_pte_t scratch_pte = vm->scratch[0].encode;
+	const gen8_pte_t scratch_pte = vm->scratch[0]->encode;
 	gen8_pte_t __iomem *gtt_base =
 		(gen8_pte_t __iomem *)ggtt->gsm + first_entry;
 	const int max_entries = ggtt_total_entries(ggtt) - first_entry;
@@ -401,7 +399,7 @@ static void gen6_ggtt_clear_range(struct i915_address_space *vm,
 		 first_entry, num_entries, max_entries))
 		num_entries = max_entries;
 
-	scratch_pte = vm->scratch[0].encode;
+	scratch_pte = vm->scratch[0]->encode;
 	for (i = 0; i < num_entries; i++)
 		iowrite32(scratch_pte, &gtt_base[i]);
 }
@@ -617,6 +615,10 @@ static int init_aliasing_ppgtt(struct i915_ggtt *ggtt)
 	if (err)
 		goto err_ppgtt;
 
+	err = i915_vm_pin_pt_stash(&ppgtt->vm, &stash);
+	if (err)
+		goto err_stash;
+
 	/*
 	 * Note we only pre-allocate as far as the end of the global
 	 * GTT. On 48b / 4-level page-tables, the difference is very,
@@ -637,6 +639,8 @@ static int init_aliasing_ppgtt(struct i915_ggtt *ggtt)
 	i915_vm_free_pt_stash(&ppgtt->vm, &stash);
 	return 0;
 
+err_stash:
+	i915_vm_free_pt_stash(&ppgtt->vm, &stash);
 err_ppgtt:
 	i915_vm_put(&ppgtt->vm);
 	return err;
@@ -712,18 +716,11 @@ static void ggtt_cleanup_hw(struct i915_ggtt *ggtt)
 void i915_ggtt_driver_release(struct drm_i915_private *i915)
 {
 	struct i915_ggtt *ggtt = &i915->ggtt;
-	struct pagevec *pvec;
 
 	fini_aliasing_ppgtt(ggtt);
 
 	intel_ggtt_fini_fences(ggtt);
 	ggtt_cleanup_hw(ggtt);
-
-	pvec = &i915->mm.wc_stash.pvec;
-	if (pvec->nr) {
-		set_pages_array_wb(pvec->pages, pvec->nr);
-		__pagevec_release(pvec);
-	}
 }
 
 static unsigned int gen6_get_total_gtt_size(u16 snb_gmch_ctl)
@@ -786,7 +783,7 @@ static int ggtt_probe_common(struct i915_ggtt *ggtt, u64 size)
 		return -ENOMEM;
 	}
 
-	ret = setup_scratch_page(&ggtt->vm, GFP_DMA32);
+	ret = setup_scratch_page(&ggtt->vm);
 	if (ret) {
 		drm_err(&i915->drm, "Scratch setup failed\n");
 		/* iounmap will also get called at remove, but meh */
@@ -794,8 +791,8 @@ static int ggtt_probe_common(struct i915_ggtt *ggtt, u64 size)
 		return ret;
 	}
 
-	ggtt->vm.scratch[0].encode =
-		ggtt->vm.pte_encode(px_dma(&ggtt->vm.scratch[0]),
+	ggtt->vm.scratch[0]->encode =
+		ggtt->vm.pte_encode(px_dma(ggtt->vm.scratch[0]),
 				    I915_CACHE_NONE, 0);
 
 	return 0;
@@ -821,7 +818,7 @@ static void gen6_gmch_remove(struct i915_address_space *vm)
 	struct i915_ggtt *ggtt = i915_vm_to_ggtt(vm);
 
 	iounmap(ggtt->gsm);
-	cleanup_scratch_page(vm);
+	free_scratch(vm);
 }
 
 static struct resource pci_resource(struct pci_dev *pdev, int bar)
@@ -849,6 +846,8 @@ static int gen8_gmch_probe(struct i915_ggtt *ggtt)
 	else
 		size = gen8_get_total_gtt_size(snb_gmch_ctl);
 
+	ggtt->vm.alloc_pt_dma = alloc_pt_dma;
+
 	ggtt->vm.total = (size / sizeof(gen8_pte_t)) * I915_GTT_PAGE_SIZE;
 	ggtt->vm.cleanup = gen6_gmch_remove;
 	ggtt->vm.insert_page = gen8_ggtt_insert_page;
@@ -997,6 +996,8 @@ static int gen6_gmch_probe(struct i915_ggtt *ggtt)
 	size = gen6_get_total_gtt_size(snb_gmch_ctl);
 	ggtt->vm.total = (size / sizeof(gen6_pte_t)) * I915_GTT_PAGE_SIZE;
 
+	ggtt->vm.alloc_pt_dma = alloc_pt_dma;
+
 	ggtt->vm.clear_range = nop_clear_range;
 	if (!HAS_FULL_PPGTT(i915) || intel_scanout_needs_vtd_wa(i915))
 		ggtt->vm.clear_range = gen6_ggtt_clear_range;
@@ -1047,6 +1048,8 @@ static int i915_gmch_probe(struct i915_ggtt *ggtt)
 	ggtt->gmadr =
 		(struct resource)DEFINE_RES_MEM(gmadr_base, ggtt->mappable_end);
 
+	ggtt->vm.alloc_pt_dma = alloc_pt_dma;
+
 	ggtt->do_idle_maps = needs_idle_maps(i915);
 	ggtt->vm.insert_page = i915_ggtt_insert_page;
 	ggtt->vm.insert_entries = i915_ggtt_insert_entries;
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.c b/drivers/gpu/drm/i915/gt/intel_gtt.c
index 2a72cce63fd9..795ed81ba358 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.c
@@ -11,160 +11,21 @@
 #include "intel_gt.h"
 #include "intel_gtt.h"
 
-void stash_init(struct pagestash *stash)
+struct drm_i915_gem_object *alloc_pt_dma(struct i915_address_space *vm, int sz)
 {
-	pagevec_init(&stash->pvec);
-	spin_lock_init(&stash->lock);
+	return i915_gem_object_create_internal(vm->i915, sz);
 }
 
-static struct page *stash_pop_page(struct pagestash *stash)
+int pin_pt_dma(struct i915_address_space *vm, struct drm_i915_gem_object *obj)
 {
-	struct page *page = NULL;
+	int err;
 
-	spin_lock(&stash->lock);
-	if (likely(stash->pvec.nr))
-		page = stash->pvec.pages[--stash->pvec.nr];
-	spin_unlock(&stash->lock);
+	err = i915_gem_object_pin_pages(obj);
+	if (err)
+		return err;
 
-	return page;
-}
-
-static void stash_push_pagevec(struct pagestash *stash, struct pagevec *pvec)
-{
-	unsigned int nr;
-
-	spin_lock_nested(&stash->lock, SINGLE_DEPTH_NESTING);
-
-	nr = min_t(typeof(nr), pvec->nr, pagevec_space(&stash->pvec));
-	memcpy(stash->pvec.pages + stash->pvec.nr,
-	       pvec->pages + pvec->nr - nr,
-	       sizeof(pvec->pages[0]) * nr);
-	stash->pvec.nr += nr;
-
-	spin_unlock(&stash->lock);
-
-	pvec->nr -= nr;
-}
-
-static struct page *vm_alloc_page(struct i915_address_space *vm, gfp_t gfp)
-{
-	struct pagevec stack;
-	struct page *page;
-
-	if (I915_SELFTEST_ONLY(should_fail(&vm->fault_attr, 1)))
-		i915_gem_shrink_all(vm->i915);
-
-	page = stash_pop_page(&vm->free_pages);
-	if (page)
-		return page;
-
-	if (!vm->pt_kmap_wc)
-		return alloc_page(gfp);
-
-	/* Look in our global stash of WC pages... */
-	page = stash_pop_page(&vm->i915->mm.wc_stash);
-	if (page)
-		return page;
-
-	/*
-	 * Otherwise batch allocate pages to amortize cost of set_pages_wc.
-	 *
-	 * We have to be careful as page allocation may trigger the shrinker
-	 * (via direct reclaim) which will fill up the WC stash underneath us.
-	 * So we add our WB pages into a temporary pvec on the stack and merge
-	 * them into the WC stash after all the allocations are complete.
-	 */
-	pagevec_init(&stack);
-	do {
-		struct page *page;
-
-		page = alloc_page(gfp);
-		if (unlikely(!page))
-			break;
-
-		stack.pages[stack.nr++] = page;
-	} while (pagevec_space(&stack));
-
-	if (stack.nr && !set_pages_array_wc(stack.pages, stack.nr)) {
-		page = stack.pages[--stack.nr];
-
-		/* Merge spare WC pages to the global stash */
-		if (stack.nr)
-			stash_push_pagevec(&vm->i915->mm.wc_stash, &stack);
-
-		/* Push any surplus WC pages onto the local VM stash */
-		if (stack.nr)
-			stash_push_pagevec(&vm->free_pages, &stack);
-	}
-
-	/* Return unwanted leftovers */
-	if (unlikely(stack.nr)) {
-		WARN_ON_ONCE(set_pages_array_wb(stack.pages, stack.nr));
-		__pagevec_release(&stack);
-	}
-
-	return page;
-}
-
-static void vm_free_pages_release(struct i915_address_space *vm,
-				  bool immediate)
-{
-	struct pagevec *pvec = &vm->free_pages.pvec;
-	struct pagevec stack;
-
-	lockdep_assert_held(&vm->free_pages.lock);
-	GEM_BUG_ON(!pagevec_count(pvec));
-
-	if (vm->pt_kmap_wc) {
-		/*
-		 * When we use WC, first fill up the global stash and then
-		 * only if full immediately free the overflow.
-		 */
-		stash_push_pagevec(&vm->i915->mm.wc_stash, pvec);
-
-		/*
-		 * As we have made some room in the VM's free_pages,
-		 * we can wait for it to fill again. Unless we are
-		 * inside i915_address_space_fini() and must
-		 * immediately release the pages!
-		 */
-		if (pvec->nr <= (immediate ? 0 : PAGEVEC_SIZE - 1))
-			return;
-
-		/*
-		 * We have to drop the lock to allow ourselves to sleep,
-		 * so take a copy of the pvec and clear the stash for
-		 * others to use it as we sleep.
-		 */
-		stack = *pvec;
-		pagevec_reinit(pvec);
-		spin_unlock(&vm->free_pages.lock);
-
-		pvec = &stack;
-		set_pages_array_wb(pvec->pages, pvec->nr);
-
-		spin_lock(&vm->free_pages.lock);
-	}
-
-	__pagevec_release(pvec);
-}
-
-static void vm_free_page(struct i915_address_space *vm, struct page *page)
-{
-	/*
-	 * On !llc, we need to change the pages back to WB. We only do so
-	 * in bulk, so we rarely need to change the page attributes here,
-	 * but doing so requires a stop_machine() from deep inside arch/x86/mm.
-	 * To make detection of the possible sleep more likely, use an
-	 * unconditional might_sleep() for everybody.
-	 */
-	might_sleep();
-	spin_lock(&vm->free_pages.lock);
-	while (!pagevec_space(&vm->free_pages.pvec))
-		vm_free_pages_release(vm, false);
-	GEM_BUG_ON(pagevec_count(&vm->free_pages.pvec) >= PAGEVEC_SIZE);
-	pagevec_add(&vm->free_pages.pvec, page);
-	spin_unlock(&vm->free_pages.lock);
+	i915_gem_object_make_unshrinkable(obj);
+	return 0;
 }
 
 void __i915_vm_close(struct i915_address_space *vm)
@@ -194,14 +55,7 @@ void __i915_vm_close(struct i915_address_space *vm)
 
 void i915_address_space_fini(struct i915_address_space *vm)
 {
-	spin_lock(&vm->free_pages.lock);
-	if (pagevec_count(&vm->free_pages.pvec))
-		vm_free_pages_release(vm, true);
-	GEM_BUG_ON(pagevec_count(&vm->free_pages.pvec));
-	spin_unlock(&vm->free_pages.lock);
-
 	drm_mm_takedown(&vm->mm);
-
 	mutex_destroy(&vm->mutex);
 }
 
@@ -246,8 +100,6 @@ void i915_address_space_init(struct i915_address_space *vm, int subclass)
 	drm_mm_init(&vm->mm, 0, vm->total);
 	vm->mm.head_node.color = I915_COLOR_UNEVICTABLE;
 
-	stash_init(&vm->free_pages);
-
 	INIT_LIST_HEAD(&vm->bound_list);
 }
 
@@ -264,64 +116,50 @@ void clear_pages(struct i915_vma *vma)
 	memset(&vma->page_sizes, 0, sizeof(vma->page_sizes));
 }
 
-static int __setup_page_dma(struct i915_address_space *vm,
-			    struct i915_page_dma *p,
-			    gfp_t gfp)
-{
-	p->page = vm_alloc_page(vm, gfp | I915_GFP_ALLOW_FAIL);
-	if (unlikely(!p->page))
-		return -ENOMEM;
-
-	p->daddr = dma_map_page_attrs(vm->dma,
-				      p->page, 0, PAGE_SIZE,
-				      PCI_DMA_BIDIRECTIONAL,
-				      DMA_ATTR_SKIP_CPU_SYNC |
-				      DMA_ATTR_NO_WARN);
-	if (unlikely(dma_mapping_error(vm->dma, p->daddr))) {
-		vm_free_page(vm, p->page);
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
-int setup_page_dma(struct i915_address_space *vm, struct i915_page_dma *p)
+dma_addr_t __px_dma(struct drm_i915_gem_object *p)
 {
-	return __setup_page_dma(vm, p, __GFP_HIGHMEM);
+	GEM_BUG_ON(!i915_gem_object_has_pages(p));
+	return sg_dma_address(p->mm.pages->sgl);
 }
 
-void cleanup_page_dma(struct i915_address_space *vm, struct i915_page_dma *p)
+struct page *__px_page(struct drm_i915_gem_object *p)
 {
-	dma_unmap_page(vm->dma, p->daddr, PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
-	vm_free_page(vm, p->page);
+	GEM_BUG_ON(!i915_gem_object_has_pages(p));
+	return sg_page(p->mm.pages->sgl);
 }
 
 void
-fill_page_dma(const struct i915_page_dma *p, const u64 val, unsigned int count)
+fill_page_dma(struct drm_i915_gem_object *p, const u64 val, unsigned int count)
 {
-	kunmap_atomic(memset64(kmap_atomic(p->page), val, count));
+	struct page *page = __px_page(p);
+	void *vaddr;
+
+	vaddr = kmap(page);
+	memset64(vaddr, val, count);
+	clflush_cache_range(vaddr, PAGE_SIZE);
+	kunmap(page);
 }
 
-static void poison_scratch_page(struct page *page, unsigned long size)
+static void poison_scratch_page(struct drm_i915_gem_object *scratch)
 {
-	if (!IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
-		return;
+	struct sgt_iter sgt;
+	struct page *page;
+	u8 val;
 
-	GEM_BUG_ON(!IS_ALIGNED(size, PAGE_SIZE));
+	val = 0;
+	if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
+		val = POISON_FREE;
 
-	do {
+	for_each_sgt_page(page, sgt, scratch->mm.pages) {
 		void *vaddr;
 
 		vaddr = kmap(page);
-		memset(vaddr, POISON_FREE, PAGE_SIZE);
+		memset(vaddr, val, PAGE_SIZE);
 		kunmap(page);
-
-		page = pfn_to_page(page_to_pfn(page) + 1);
-		size -= PAGE_SIZE;
-	} while (size);
+	}
 }
 
-int setup_scratch_page(struct i915_address_space *vm, gfp_t gfp)
+int setup_scratch_page(struct i915_address_space *vm)
 {
 	unsigned long size;
 
@@ -338,21 +176,27 @@ int setup_scratch_page(struct i915_address_space *vm, gfp_t gfp)
 	 */
 	size = I915_GTT_PAGE_SIZE_4K;
 	if (i915_vm_is_4lvl(vm) &&
-	    HAS_PAGE_SIZES(vm->i915, I915_GTT_PAGE_SIZE_64K)) {
+	    HAS_PAGE_SIZES(vm->i915, I915_GTT_PAGE_SIZE_64K))
 		size = I915_GTT_PAGE_SIZE_64K;
-		gfp |= __GFP_NOWARN;
-	}
-	gfp |= __GFP_ZERO | __GFP_RETRY_MAYFAIL;
 
 	do {
-		unsigned int order = get_order(size);
-		struct page *page;
-		dma_addr_t addr;
+		struct drm_i915_gem_object *obj;
 
-		page = alloc_pages(gfp, order);
-		if (unlikely(!page))
+		obj = vm->alloc_pt_dma(vm, size);
+		if (IS_ERR(obj))
 			goto skip;
 
+		if (pin_pt_dma(vm, obj))
+			goto skip_obj;
+
+		/* We need a single contiguous page for our scratch */
+		if (obj->mm.page_sizes.sg < size)
+			goto skip_obj;
+
+		/* And it needs to be correspondingly aligned */
+		if (__px_dma(obj) & (size - 1))
+			goto skip_obj;
+
 		/*
 		 * Use a non-zero scratch page for debugging.
 		 *
@@ -362,61 +206,28 @@ int setup_scratch_page(struct i915_address_space *vm, gfp_t gfp)
 		 * should it ever be accidentally used, the effect should be
 		 * fairly benign.
 		 */
-		poison_scratch_page(page, size);
-
-		addr = dma_map_page_attrs(vm->dma,
-					  page, 0, size,
-					  PCI_DMA_BIDIRECTIONAL,
-					  DMA_ATTR_SKIP_CPU_SYNC |
-					  DMA_ATTR_NO_WARN);
-		if (unlikely(dma_mapping_error(vm->dma, addr)))
-			goto free_page;
-
-		if (unlikely(!IS_ALIGNED(addr, size)))
-			goto unmap_page;
-
-		vm->scratch[0].base.page = page;
-		vm->scratch[0].base.daddr = addr;
-		vm->scratch_order = order;
+		poison_scratch_page(obj);
+
+		vm->scratch[0] = obj;
+		vm->scratch_order = get_order(size);
 		return 0;
 
-unmap_page:
-		dma_unmap_page(vm->dma, addr, size, PCI_DMA_BIDIRECTIONAL);
-free_page:
-		__free_pages(page, order);
+skip_obj:
+		i915_gem_object_put(obj);
 skip:
 		if (size == I915_GTT_PAGE_SIZE_4K)
 			return -ENOMEM;
 
 		size = I915_GTT_PAGE_SIZE_4K;
-		gfp &= ~__GFP_NOWARN;
 	} while (1);
 }
 
-void cleanup_scratch_page(struct i915_address_space *vm)
-{
-	struct i915_page_dma *p = px_base(&vm->scratch[0]);
-	unsigned int order = vm->scratch_order;
-
-	dma_unmap_page(vm->dma, p->daddr, BIT(order) << PAGE_SHIFT,
-		       PCI_DMA_BIDIRECTIONAL);
-	__free_pages(p->page, order);
-}
-
 void free_scratch(struct i915_address_space *vm)
 {
 	int i;
 
-	if (!px_dma(&vm->scratch[0])) /* set to 0 on clones */
-		return;
-
-	for (i = 1; i <= vm->top; i++) {
-		if (!px_dma(&vm->scratch[i]))
-			break;
-		cleanup_page_dma(vm, px_base(&vm->scratch[i]));
-	}
-
-	cleanup_scratch_page(vm);
+	for (i = 0; i <= vm->top; i++)
+		i915_gem_object_put(vm->scratch[i]);
 }
 
 void gtt_write_workarounds(struct intel_gt *gt)
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index 0d9f29aea6b4..6abab2d37b6f 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -134,31 +134,19 @@ typedef u64 gen8_pte_t;
 #define GEN8_PDE_IPS_64K BIT(11)
 #define GEN8_PDE_PS_2M   BIT(7)
 
+enum i915_cache_level;
+
+struct drm_i915_file_private;
+struct drm_i915_gem_object;
 struct i915_fence_reg;
+struct i915_vma;
+struct intel_gt;
 
 #define for_each_sgt_daddr(__dp, __iter, __sgt) \
 	__for_each_sgt_daddr(__dp, __iter, __sgt, I915_GTT_PAGE_SIZE)
 
-struct i915_page_dma {
-	struct page *page;
-	union {
-		dma_addr_t daddr;
-
-		/*
-		 * For gen6/gen7 only. This is the offset in the GGTT
-		 * where the page directory entries for PPGTT begin
-		 */
-		u32 ggtt_offset;
-	};
-};
-
-struct i915_page_scratch {
-	struct i915_page_dma base;
-	u64 encode;
-};
-
 struct i915_page_table {
-	struct i915_page_dma base;
+	struct drm_i915_gem_object *base;
 	union {
 		atomic_t used;
 		struct i915_page_table *stash;
@@ -179,12 +167,14 @@ struct i915_page_directory {
 	other)
 
 #define px_base(px) \
-	__px_choose_expr(px, struct i915_page_dma *, __x, \
-	__px_choose_expr(px, struct i915_page_scratch *, &__x->base, \
-	__px_choose_expr(px, struct i915_page_table *, &__x->base, \
-	__px_choose_expr(px, struct i915_page_directory *, &__x->pt.base, \
-	(void)0))))
-#define px_dma(px) (px_base(px)->daddr)
+	__px_choose_expr(px, struct drm_i915_gem_object *, __x, \
+	__px_choose_expr(px, struct i915_page_table *, __x->base, \
+	__px_choose_expr(px, struct i915_page_directory *, __x->pt.base, \
+	(void)0)))
+
+struct page *__px_page(struct drm_i915_gem_object *p);
+dma_addr_t __px_dma(struct drm_i915_gem_object *p);
+#define px_dma(px) (__px_dma(px_base(px)))
 
 #define px_pt(px) \
 	__px_choose_expr(px, struct i915_page_table *, __x, \
@@ -192,13 +182,6 @@ struct i915_page_directory {
 	(void)0))
 #define px_used(px) (&px_pt(px)->used)
 
-enum i915_cache_level;
-
-struct drm_i915_file_private;
-struct drm_i915_gem_object;
-struct i915_vma;
-struct intel_gt;
-
 struct i915_vm_pt_stash {
 	/* preallocated chains of page tables/directories */
 	struct i915_page_table *pt[2];
@@ -222,13 +205,6 @@ struct i915_vma_ops {
 	void (*clear_pages)(struct i915_vma *vma);
 };
 
-struct pagestash {
-	spinlock_t lock;
-	struct pagevec pvec;
-};
-
-void stash_init(struct pagestash *stash);
-
 struct i915_address_space {
 	struct kref ref;
 	struct rcu_work rcu;
@@ -265,20 +241,15 @@ struct i915_address_space {
 #define VM_CLASS_GGTT 0
 #define VM_CLASS_PPGTT 1
 
-	struct i915_page_scratch scratch[4];
+	struct drm_i915_gem_object *scratch[4];
 	/**
 	 * List of vma currently bound.
 	 */
 	struct list_head bound_list;
 
-	struct pagestash free_pages;
-
 	/* Global GTT */
 	bool is_ggtt:1;
 
-	/* Some systems require uncached updates of the page directories */
-	bool pt_kmap_wc:1;
-
 	/* Some systems support read-only mappings for GGTT and/or PPGTT */
 	bool has_read_only:1;
 
@@ -286,6 +257,9 @@ struct i915_address_space {
 	u8 pd_shift;
 	u8 scratch_order;
 
+	struct drm_i915_gem_object *
+		(*alloc_pt_dma)(struct i915_address_space *vm, int sz);
+
 	u64 (*pte_encode)(dma_addr_t addr,
 			  enum i915_cache_level level,
 			  u32 flags); /* Create a valid PTE */
@@ -501,9 +475,9 @@ i915_pd_entry(const struct i915_page_directory * const pdp,
 static inline dma_addr_t
 i915_page_dir_dma_addr(const struct i915_ppgtt *ppgtt, const unsigned int n)
 {
-	struct i915_page_dma *pt = ppgtt->pd->entry[n];
+	struct i915_page_table *pt = ppgtt->pd->entry[n];
 
-	return px_dma(pt ?: px_base(&ppgtt->vm.scratch[ppgtt->vm.top]));
+	return __px_dma(pt ? px_base(pt) : ppgtt->vm.scratch[ppgtt->vm.top]);
 }
 
 void ppgtt_init(struct i915_ppgtt *ppgtt, struct intel_gt *gt);
@@ -528,13 +502,10 @@ struct i915_ppgtt *i915_ppgtt_create(struct intel_gt *gt);
 void i915_ggtt_suspend(struct i915_ggtt *gtt);
 void i915_ggtt_resume(struct i915_ggtt *ggtt);
 
-int setup_page_dma(struct i915_address_space *vm, struct i915_page_dma *p);
-void cleanup_page_dma(struct i915_address_space *vm, struct i915_page_dma *p);
-
-#define kmap_atomic_px(px) kmap_atomic(px_base(px)->page)
+#define kmap_atomic_px(px) kmap_atomic(__px_page(px_base(px)))
 
 void
-fill_page_dma(const struct i915_page_dma *p, const u64 val, unsigned int count);
+fill_page_dma(struct drm_i915_gem_object *p, const u64 val, unsigned int count);
 
 #define fill_px(px, v) fill_page_dma(px_base(px), (v), PAGE_SIZE / sizeof(u64))
 #define fill32_px(px, v) do {						\
@@ -542,37 +513,38 @@ fill_page_dma(const struct i915_page_dma *p, const u64 val, unsigned int count);
 	fill_px((px), v__ << 32 | v__);					\
 } while (0)
 
-int setup_scratch_page(struct i915_address_space *vm, gfp_t gfp);
-void cleanup_scratch_page(struct i915_address_space *vm);
+int setup_scratch_page(struct i915_address_space *vm);
 void free_scratch(struct i915_address_space *vm);
 
+struct drm_i915_gem_object *alloc_pt_dma(struct i915_address_space *vm, int sz);
 struct i915_page_table *alloc_pt(struct i915_address_space *vm);
 struct i915_page_directory *alloc_pd(struct i915_address_space *vm);
 struct i915_page_directory *__alloc_pd(size_t sz);
 
-void free_pd(struct i915_address_space *vm, struct i915_page_dma *pd);
+int pin_pt_dma(struct i915_address_space *vm, struct drm_i915_gem_object *obj);
 
-#define free_px(vm, px) free_pd(vm, px_base(px))
+void free_pt(struct i915_address_space *vm, struct i915_page_table *pt);
+#define free_px(vm, px) free_pt(vm, px_pt(px))
 
 void
 __set_pd_entry(struct i915_page_directory * const pd,
 	       const unsigned short idx,
-	       struct i915_page_dma * const to,
+	       struct i915_page_table *pt,
 	       u64 (*encode)(const dma_addr_t, const enum i915_cache_level));
 
 #define set_pd_entry(pd, idx, to) \
-	__set_pd_entry((pd), (idx), px_base(to), gen8_pde_encode)
+	__set_pd_entry((pd), (idx), px_pt(to), gen8_pde_encode)
 
 void
 clear_pd_entry(struct i915_page_directory * const pd,
 	       const unsigned short idx,
-	       const struct i915_page_scratch * const scratch);
+	       const struct drm_i915_gem_object * const scratch);
 
 bool
 release_pd_entry(struct i915_page_directory * const pd,
 		 const unsigned short idx,
 		 struct i915_page_table * const pt,
-		 const struct i915_page_scratch * const scratch);
+		 const struct drm_i915_gem_object * const scratch);
 void gen6_ggtt_invalidate(struct i915_ggtt *ggtt);
 
 int ggtt_set_pages(struct i915_vma *vma);
@@ -594,6 +566,8 @@ void setup_private_pat(struct intel_uncore *uncore);
 int i915_vm_alloc_pt_stash(struct i915_address_space *vm,
 			   struct i915_vm_pt_stash *stash,
 			   u64 size);
+int i915_vm_pin_pt_stash(struct i915_address_space *vm,
+			 struct i915_vm_pt_stash *stash);
 void i915_vm_free_pt_stash(struct i915_address_space *vm,
 			   struct i915_vm_pt_stash *stash);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_ppgtt.c b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
index 7c3f50948829..1f80d79a6588 100644
--- a/drivers/gpu/drm/i915/gt/intel_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
@@ -18,7 +18,8 @@ struct i915_page_table *alloc_pt(struct i915_address_space *vm)
 	if (unlikely(!pt))
 		return ERR_PTR(-ENOMEM);
 
-	if (unlikely(setup_page_dma(vm, &pt->base))) {
+	pt->base = vm->alloc_pt_dma(vm, I915_GTT_PAGE_SIZE_4K);
+	if (IS_ERR(pt->base)) {
 		kfree(pt);
 		return ERR_PTR(-ENOMEM);
 	}
@@ -47,7 +48,8 @@ struct i915_page_directory *alloc_pd(struct i915_address_space *vm)
 	if (unlikely(!pd))
 		return ERR_PTR(-ENOMEM);
 
-	if (unlikely(setup_page_dma(vm, px_base(pd)))) {
+	pd->pt.base = vm->alloc_pt_dma(vm, I915_GTT_PAGE_SIZE_4K);
+	if (IS_ERR(pd->pt.base)) {
 		kfree(pd);
 		return ERR_PTR(-ENOMEM);
 	}
@@ -55,27 +57,28 @@ struct i915_page_directory *alloc_pd(struct i915_address_space *vm)
 	return pd;
 }
 
-void free_pd(struct i915_address_space *vm, struct i915_page_dma *pd)
+void free_pt(struct i915_address_space *vm, struct i915_page_table *pt)
 {
-	cleanup_page_dma(vm, pd);
-	kfree(pd);
+	i915_gem_object_put(pt->base);
+	kfree(pt);
 }
 
 static inline void
-write_dma_entry(struct i915_page_dma * const pdma,
+write_dma_entry(struct drm_i915_gem_object * const pdma,
 		const unsigned short idx,
 		const u64 encoded_entry)
 {
-	u64 * const vaddr = kmap_atomic(pdma->page);
+	u64 * const vaddr = kmap_atomic(__px_page(pdma));
 
 	vaddr[idx] = encoded_entry;
+	clflush_cache_range(&vaddr[idx], sizeof(u64));
 	kunmap_atomic(vaddr);
 }
 
 void
 __set_pd_entry(struct i915_page_directory * const pd,
 	       const unsigned short idx,
-	       struct i915_page_dma * const to,
+	       struct i915_page_table * const to,
 	       u64 (*encode)(const dma_addr_t, const enum i915_cache_level))
 {
 	/* Each thread pre-pins the pd, and we may have a thread per pde. */
@@ -83,13 +86,13 @@ __set_pd_entry(struct i915_page_directory * const pd,
 
 	atomic_inc(px_used(pd));
 	pd->entry[idx] = to;
-	write_dma_entry(px_base(pd), idx, encode(to->daddr, I915_CACHE_LLC));
+	write_dma_entry(px_base(pd), idx, encode(px_dma(to), I915_CACHE_LLC));
 }
 
 void
 clear_pd_entry(struct i915_page_directory * const pd,
 	       const unsigned short idx,
-	       const struct i915_page_scratch * const scratch)
+	       const struct drm_i915_gem_object * const scratch)
 {
 	GEM_BUG_ON(atomic_read(px_used(pd)) == 0);
 
@@ -102,7 +105,7 @@ bool
 release_pd_entry(struct i915_page_directory * const pd,
 		 const unsigned short idx,
 		 struct i915_page_table * const pt,
-		 const struct i915_page_scratch * const scratch)
+		 const struct drm_i915_gem_object * const scratch)
 {
 	bool free = false;
 
@@ -234,6 +237,23 @@ int i915_vm_alloc_pt_stash(struct i915_address_space *vm,
 	return 0;
 }
 
+int i915_vm_pin_pt_stash(struct i915_address_space *vm,
+			 struct i915_vm_pt_stash *stash)
+{
+	struct i915_page_table *pt;
+	int n, err;
+
+	for (n = 0; n < ARRAY_SIZE(stash->pt); n++) {
+		for (pt = stash->pt[n]; pt; pt = pt->stash) {
+			err = pin_pt_dma(vm, pt->base);
+			if (err)
+				return err;
+		}
+	}
+
+	return 0;
+}
+
 void i915_vm_free_pt_stash(struct i915_address_space *vm,
 			   struct i915_vm_pt_stash *stash)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
index 94915f668715..9a126ad517c1 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
@@ -201,16 +201,18 @@ static struct i915_address_space *vm_alias(struct i915_address_space *vm)
 	return vm;
 }
 
+static u32 pp_dir(struct i915_address_space *vm)
+{
+	return to_gen6_ppgtt(i915_vm_to_ppgtt(vm))->pp_dir;
+}
+
 static void set_pp_dir(struct intel_engine_cs *engine)
 {
 	struct i915_address_space *vm = vm_alias(engine->gt->vm);
 
 	if (vm) {
-		struct i915_ppgtt *ppgtt = i915_vm_to_ppgtt(vm);
-
 		ENGINE_WRITE(engine, RING_PP_DIR_DCLV, PP_DIR_DCLV_2G);
-		ENGINE_WRITE(engine, RING_PP_DIR_BASE,
-			     px_base(ppgtt->pd)->ggtt_offset << 10);
+		ENGINE_WRITE(engine, RING_PP_DIR_BASE, pp_dir(vm));
 	}
 }
 
@@ -608,7 +610,7 @@ static const struct intel_context_ops ring_context_ops = {
 };
 
 static int load_pd_dir(struct i915_request *rq,
-		       const struct i915_ppgtt *ppgtt,
+		       struct i915_address_space *vm,
 		       u32 valid)
 {
 	const struct intel_engine_cs * const engine = rq->engine;
@@ -624,7 +626,7 @@ static int load_pd_dir(struct i915_request *rq,
 
 	*cs++ = MI_LOAD_REGISTER_IMM(1);
 	*cs++ = i915_mmio_reg_offset(RING_PP_DIR_BASE(engine->mmio_base));
-	*cs++ = px_base(ppgtt->pd)->ggtt_offset << 10;
+	*cs++ = pp_dir(vm);
 
 	/* Stall until the page table load is complete? */
 	*cs++ = MI_STORE_REGISTER_MEM | MI_SRM_LRM_GLOBAL_GTT;
@@ -826,7 +828,7 @@ static int switch_mm(struct i915_request *rq, struct i915_address_space *vm)
 	 * post-sync op, this extra pass appears vital before a
 	 * mm switch!
 	 */
-	ret = load_pd_dir(rq, i915_vm_to_ppgtt(vm), PP_DIR_DCLV_2G);
+	ret = load_pd_dir(rq, vm, PP_DIR_DCLV_2G);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/gvt/scheduler.c b/drivers/gpu/drm/i915/gvt/scheduler.c
index 3c3b9842bbbd..1570eb8aa978 100644
--- a/drivers/gpu/drm/i915/gvt/scheduler.c
+++ b/drivers/gpu/drm/i915/gvt/scheduler.c
@@ -403,6 +403,14 @@ static void release_shadow_wa_ctx(struct intel_shadow_wa_ctx *wa_ctx)
 	wa_ctx->indirect_ctx.shadow_va = NULL;
 }
 
+static void set_dma_address(struct i915_page_directory *pd, dma_addr_t addr)
+{
+	struct scatterlist *sg = pd->pt.base->mm.pages->sgl;
+
+	/* This is not a good idea */
+	sg->dma_address = addr;
+}
+
 static void set_context_ppgtt_from_shadow(struct intel_vgpu_workload *workload,
 					  struct intel_context *ce)
 {
@@ -411,7 +419,7 @@ static void set_context_ppgtt_from_shadow(struct intel_vgpu_workload *workload,
 	int i = 0;
 
 	if (mm->ppgtt_mm.root_entry_type == GTT_TYPE_PPGTT_ROOT_L4_ENTRY) {
-		px_dma(ppgtt->pd) = mm->ppgtt_mm.shadow_pdps[0];
+		set_dma_address(ppgtt->pd, mm->ppgtt_mm.shadow_pdps[0]);
 	} else {
 		for (i = 0; i < GVT_RING_CTX_NR_PDPS; i++) {
 			struct i915_page_directory * const pd =
@@ -421,7 +429,8 @@ static void set_context_ppgtt_from_shadow(struct intel_vgpu_workload *workload,
 			   shadow ppgtt. */
 			if (!pd)
 				break;
-			px_dma(pd) = mm->ppgtt_mm.shadow_pdps[i];
+
+			set_dma_address(pd, mm->ppgtt_mm.shadow_pdps[i]);
 		}
 	}
 }
@@ -1240,13 +1249,13 @@ i915_context_ppgtt_root_restore(struct intel_vgpu_submission *s,
 	int i;
 
 	if (i915_vm_is_4lvl(&ppgtt->vm)) {
-		px_dma(ppgtt->pd) = s->i915_context_pml4;
+		set_dma_address(ppgtt->pd, s->i915_context_pml4);
 	} else {
 		for (i = 0; i < GEN8_3LVL_PDPES; i++) {
 			struct i915_page_directory * const pd =
 				i915_pd_entry(ppgtt->pd, i);
 
-			px_dma(pd) = s->i915_context_pdps[i];
+			set_dma_address(pd, s->i915_context_pdps[i]);
 		}
 	}
 }
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 5fd5af4bc855..503a89e0ea09 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1075,6 +1075,7 @@ static void i915_driver_release(struct drm_device *dev)
 
 	intel_memory_regions_driver_release(dev_priv);
 	i915_ggtt_driver_release(dev_priv);
+	i915_gem_drain_freed_objects(dev_priv);
 
 	i915_driver_mmio_release(dev_priv);
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 9ba6cfff9e3f..bd7ff2ad6514 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -591,11 +591,6 @@ struct i915_gem_mm {
 	 */
 	atomic_t free_count;
 
-	/**
-	 * Small stash of WC pages
-	 */
-	struct pagestash wc_stash;
-
 	/**
 	 * tmpfs instance used for shmem backed objects
 	 */
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index a9e79b67035e..c6bf04ca2032 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -872,24 +872,30 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 	if (err)
 		return err;
 
+	if (flags & PIN_GLOBAL)
+		wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm);
+
 	if (flags & vma->vm->bind_async_flags) {
 		work = i915_vma_work();
 		if (!work) {
 			err = -ENOMEM;
-			goto err_pages;
+			goto err_rpm;
 		}
 
 		work->vm = i915_vm_get(vma->vm);
 
 		/* Allocate enough page directories to used PTE */
-		if (vma->vm->allocate_va_range)
+		if (vma->vm->allocate_va_range) {
 			i915_vm_alloc_pt_stash(vma->vm,
 					       &work->stash,
 					       vma->size);
-	}
 
-	if (flags & PIN_GLOBAL)
-		wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm);
+			err = i915_vm_pin_pt_stash(vma->vm,
+						   &work->stash);
+			if (err)
+				goto err_fence;
+		}
+	}
 
 	/*
 	 * Differentiate between user/kernel vma inside the aliasing-ppgtt.
@@ -978,9 +984,9 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 err_fence:
 	if (work)
 		dma_fence_work_commit_imm(&work->base);
+err_rpm:
 	if (wakeref)
 		intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref);
-err_pages:
 	vma_put_pages(vma);
 	return err;
 }
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
index 9b8fc990e9ef..af8205a2bd8f 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
@@ -178,6 +178,12 @@ static int igt_ppgtt_alloc(void *arg)
 		if (err)
 			goto err_ppgtt_cleanup;
 
+		err = i915_vm_pin_pt_stash(&ppgtt->vm, &stash);
+		if (err) {
+			i915_vm_free_pt_stash(&ppgtt->vm, &stash);
+			goto err_ppgtt_cleanup;
+		}
+
 		ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash, 0, size);
 		cond_resched();
 
@@ -194,6 +200,12 @@ static int igt_ppgtt_alloc(void *arg)
 		if (err)
 			goto err_ppgtt_cleanup;
 
+		err = i915_vm_pin_pt_stash(&ppgtt->vm, &stash);
+		if (err) {
+			i915_vm_free_pt_stash(&ppgtt->vm, &stash);
+			goto err_ppgtt_cleanup;
+		}
+
 		ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash,
 					    last, size - last);
 		cond_resched();
@@ -289,6 +301,11 @@ static int lowlevel_hole(struct i915_address_space *vm,
 							   BIT_ULL(size)))
 					break;
 
+				if (i915_vm_pin_pt_stash(vm, &stash)) {
+					i915_vm_free_pt_stash(vm, &stash);
+					break;
+				}
+
 				vm->allocate_va_range(vm, &stash,
 						      addr, BIT_ULL(size));
 
@@ -1912,6 +1929,12 @@ static int igt_cs_tlb(void *arg)
 			if (err)
 				goto end;
 
+			err = i915_vm_pin_pt_stash(vm, &stash);
+			if (err) {
+				i915_vm_free_pt_stash(vm, &stash);
+				goto end;
+			}
+
 			vm->allocate_va_range(vm, &stash, offset, chunk_size);
 
 			i915_vm_free_pt_stash(vm, &stash);
diff --git a/drivers/gpu/drm/i915/selftests/i915_perf.c b/drivers/gpu/drm/i915/selftests/i915_perf.c
index c2d001d9c0ec..debbac660519 100644
--- a/drivers/gpu/drm/i915/selftests/i915_perf.c
+++ b/drivers/gpu/drm/i915/selftests/i915_perf.c
@@ -307,7 +307,7 @@ static int live_noa_gpr(void *arg)
 	}
 
 	/* Poison the ce->vm so we detect writes not to the GGTT gt->scratch */
-	scratch = kmap(ce->vm->scratch[0].base.page);
+	scratch = kmap(__px_page(ce->vm->scratch[0]));
 	memset(scratch, POISON_FREE, PAGE_SIZE);
 
 	rq = intel_context_create_request(ce);
@@ -405,7 +405,7 @@ static int live_noa_gpr(void *arg)
 out_rq:
 	i915_request_put(rq);
 out_ce:
-	kunmap(ce->vm->scratch[0].base.page);
+	kunmap(__px_page(ce->vm->scratch[0]));
 	intel_context_put(ce);
 out:
 	stream_destroy(stream);
diff --git a/drivers/gpu/drm/i915/selftests/mock_gtt.c b/drivers/gpu/drm/i915/selftests/mock_gtt.c
index 5e4fb0fba34b..7270fc8ca801 100644
--- a/drivers/gpu/drm/i915/selftests/mock_gtt.c
+++ b/drivers/gpu/drm/i915/selftests/mock_gtt.c
@@ -78,6 +78,8 @@ struct i915_ppgtt *mock_ppgtt(struct drm_i915_private *i915, const char *name)
 
 	i915_address_space_init(&ppgtt->vm, VM_CLASS_PPGTT);
 
+	ppgtt->vm.alloc_pt_dma = alloc_pt_dma;
+
 	ppgtt->vm.clear_range = mock_clear_range;
 	ppgtt->vm.insert_page = mock_insert_page;
 	ppgtt->vm.insert_entries = mock_insert_entries;
@@ -116,6 +118,8 @@ void mock_init_ggtt(struct drm_i915_private *i915, struct i915_ggtt *ggtt)
 	ggtt->mappable_end = resource_size(&ggtt->gmadr);
 	ggtt->vm.total = 4096 * PAGE_SIZE;
 
+	ggtt->vm.alloc_pt_dma = alloc_pt_dma;
+
 	ggtt->vm.clear_range = mock_clear_range;
 	ggtt->vm.insert_page = mock_insert_page;
 	ggtt->vm.insert_entries = mock_insert_entries;
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (10 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 12/66] drm/i915: Switch to object allocations for page directories Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-23 16:09   ` Thomas Hellström (Intel)
  2020-07-31  8:09   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 14/66] drm/i915/gem: Rename execbuf.bind_link to unbound_link Chris Wilson
                   ` (60 subsequent siblings)
  72 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Our timeline lock is our defence against a concurrent execbuf
interrupting our request construction. we need hold it throughout or,
for example, a second thread may interject a relocation request in
between our own relocation request and execution in the ring.

A second, major benefit, is that it allows us to preserve a large chunk
of the ringbuffer for our exclusive use; which should virtually
eliminate the threat of hitting a wait_for_space during request
construction -- although we should have already dropped other
contentious locks at that point.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 413 +++++++++++-------
 .../i915/gem/selftests/i915_gem_execbuffer.c  |  24 +-
 2 files changed, 281 insertions(+), 156 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 719ba9fe3e85..af3499aafd22 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -259,6 +259,8 @@ struct i915_execbuffer {
 		bool has_fence : 1;
 		bool needs_unfenced : 1;
 
+		struct intel_context *ce;
+
 		struct i915_vma *target;
 		struct i915_request *rq;
 		struct i915_vma *rq_vma;
@@ -639,6 +641,35 @@ static int eb_reserve_vma(const struct i915_execbuffer *eb,
 	return 0;
 }
 
+static void retire_requests(struct intel_timeline *tl)
+{
+	struct i915_request *rq, *rn;
+
+	list_for_each_entry_safe(rq, rn, &tl->requests, link)
+		if (!i915_request_retire(rq))
+			break;
+}
+
+static int wait_for_timeline(struct intel_timeline *tl)
+{
+	do {
+		struct dma_fence *fence;
+		int err;
+
+		fence = i915_active_fence_get(&tl->last_request);
+		if (!fence)
+			return 0;
+
+		err = dma_fence_wait(fence, true);
+		dma_fence_put(fence);
+		if (err)
+			return err;
+
+		/* Retiring may trigger a barrier, requiring an extra pass */
+		retire_requests(tl);
+	} while (1);
+}
+
 static int eb_reserve(struct i915_execbuffer *eb)
 {
 	const unsigned int count = eb->buffer_count;
@@ -646,7 +677,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 	struct list_head last;
 	struct eb_vma *ev;
 	unsigned int i, pass;
-	int err = 0;
 
 	/*
 	 * Attempt to pin all of the buffers into the GTT.
@@ -662,18 +692,37 @@ static int eb_reserve(struct i915_execbuffer *eb)
 	 * room for the earlier objects *unless* we need to defragment.
 	 */
 
-	if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
-		return -EINTR;
-
 	pass = 0;
 	do {
+		int err = 0;
+
+		/*
+		 * We need to hold one lock as we bind all the vma so that
+		 * we have a consistent view of the entire vm and can plan
+		 * evictions to fill the whole GTT. If we allow a second
+		 * thread to run as we do this, it will either unbind
+		 * everything we want pinned, or steal space that we need for
+		 * ourselves. The closer we are to a full GTT, the more likely
+		 * such contention will cause us to fail to bind the workload
+		 * for this batch. Since we know at this point we need to
+		 * find space for new buffers, we know that extra pressure
+		 * from contention is likely.
+		 *
+		 * In lieu of being able to hold vm->mutex for the entire
+		 * sequence (it's complicated!), we opt for struct_mutex.
+		 */
+		if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
+			return -EINTR;
+
 		list_for_each_entry(ev, &eb->unbound, bind_link) {
 			err = eb_reserve_vma(eb, ev, pin_flags);
 			if (err)
 				break;
 		}
-		if (!(err == -ENOSPC || err == -EAGAIN))
-			break;
+		if (!(err == -ENOSPC || err == -EAGAIN)) {
+			mutex_unlock(&eb->i915->drm.struct_mutex);
+			return err;
+		}
 
 		/* Resort *all* the objects into priority order */
 		INIT_LIST_HEAD(&eb->unbound);
@@ -702,38 +751,50 @@ static int eb_reserve(struct i915_execbuffer *eb)
 				list_add_tail(&ev->bind_link, &last);
 		}
 		list_splice_tail(&last, &eb->unbound);
+		mutex_unlock(&eb->i915->drm.struct_mutex);
 
 		if (err == -EAGAIN) {
-			mutex_unlock(&eb->i915->drm.struct_mutex);
 			flush_workqueue(eb->i915->mm.userptr_wq);
-			mutex_lock(&eb->i915->drm.struct_mutex);
 			continue;
 		}
 
+		/*
+		 * We failed to bind our workload; there's not enough space.
+		 *
+		 * This could be due to userspace trying to submit a workload
+		 * that requires more space than is available in an empty GTT,
+		 * but more likely it means that some client is temporarily
+		 * holding onto pressure space. If we wait and flush the
+		 * timeline, that will reduce the concurrent pressure
+		 * giving us a clean shot at allocating our workload.
+		 *
+		 * However, after waiting we may compete once more with new
+		 * clients. Without a ticketlock or some other mechanism,
+		 * there is no guarantee that we will succeed in claiming
+		 * total ownership of the vm.
+		 */
 		switch (pass++) {
 		case 0:
 			break;
 
 		case 1:
-			/* Too fragmented, unbind everything and retry */
-			mutex_lock(&eb->context->vm->mutex);
-			err = i915_gem_evict_vm(eb->context->vm);
-			mutex_unlock(&eb->context->vm->mutex);
+			/*
+			 * Too fragmented, retire everything on the timeline
+			 * and so make it all [contexts included] available to
+			 * evict.
+			 */
+			err = wait_for_timeline(eb->context->timeline);
 			if (err)
-				goto unlock;
+				return err;
+
 			break;
 
 		default:
-			err = -ENOSPC;
-			goto unlock;
+			return -ENOSPC;
 		}
 
 		pin_flags = PIN_USER;
 	} while (1);
-
-unlock:
-	mutex_unlock(&eb->i915->drm.struct_mutex);
-	return err;
 }
 
 static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
@@ -1007,13 +1068,44 @@ static int reloc_gpu_chain(struct reloc_cache *cache)
 	return err;
 }
 
+static struct i915_request *
+nested_request_create(struct intel_context *ce)
+{
+	struct i915_request *rq;
+
+	/* XXX This only works once; replace with shared timeline */
+	mutex_lock_nested(&ce->timeline->mutex, SINGLE_DEPTH_NESTING);
+	intel_context_enter(ce);
+
+	rq = __i915_request_create(ce, GFP_KERNEL);
+
+	intel_context_exit(ce);
+	if (IS_ERR(rq))
+		mutex_unlock(&ce->timeline->mutex);
+
+	return rq;
+}
+
+static void __i915_request_add(struct i915_request *rq,
+			       struct i915_sched_attr *attr)
+{
+	struct intel_timeline * const tl = i915_request_timeline(rq);
+
+	lockdep_assert_held(&tl->mutex);
+	lockdep_unpin_lock(&tl->mutex, rq->cookie);
+
+	__i915_request_commit(rq);
+	__i915_request_queue(rq, attr);
+}
+
 static unsigned int reloc_bb_flags(const struct reloc_cache *cache)
 {
 	return cache->gen > 5 ? 0 : I915_DISPATCH_SECURE;
 }
 
-static int reloc_gpu_flush(struct reloc_cache *cache)
+static int reloc_gpu_flush(struct i915_execbuffer *eb)
 {
+	struct reloc_cache *cache = &eb->reloc_cache;
 	struct i915_request *rq;
 	int err;
 
@@ -1044,7 +1136,9 @@ static int reloc_gpu_flush(struct reloc_cache *cache)
 		i915_request_set_error_once(rq, err);
 
 	intel_gt_chipset_flush(rq->engine->gt);
-	i915_request_add(rq);
+	__i915_request_add(rq, &eb->gem_context->sched);
+	if (i915_request_timeline(rq) != eb->context->timeline)
+		mutex_unlock(&i915_request_timeline(rq)->mutex);
 
 	return err;
 }
@@ -1103,27 +1197,15 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 	if (err)
 		goto err_unmap;
 
-	if (engine == eb->context->engine) {
-		rq = i915_request_create(eb->context);
-	} else {
-		struct intel_context *ce;
-
-		ce = intel_context_create(engine);
-		if (IS_ERR(ce)) {
-			err = PTR_ERR(ce);
-			goto err_unpin;
-		}
-
-		i915_vm_put(ce->vm);
-		ce->vm = i915_vm_get(eb->context->vm);
-
-		rq = intel_context_create_request(ce);
-		intel_context_put(ce);
-	}
+	if (cache->ce == eb->context)
+		rq = __i915_request_create(cache->ce, GFP_KERNEL);
+	else
+		rq = nested_request_create(cache->ce);
 	if (IS_ERR(rq)) {
 		err = PTR_ERR(rq);
 		goto err_unpin;
 	}
+	rq->cookie = lockdep_pin_lock(&i915_request_timeline(rq)->mutex);
 
 	err = intel_gt_buffer_pool_mark_active(pool, rq);
 	if (err)
@@ -1151,7 +1233,9 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 skip_request:
 	i915_request_set_error_once(rq, err);
 err_request:
-	i915_request_add(rq);
+	__i915_request_add(rq, &eb->gem_context->sched);
+	if (i915_request_timeline(rq) != eb->context->timeline)
+		mutex_unlock(&i915_request_timeline(rq)->mutex);
 err_unpin:
 	i915_vma_unpin(batch);
 err_unmap:
@@ -1161,11 +1245,6 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 	return err;
 }
 
-static bool reloc_can_use_engine(const struct intel_engine_cs *engine)
-{
-	return engine->class != VIDEO_DECODE_CLASS || !IS_GEN(engine->i915, 6);
-}
-
 static u32 *reloc_gpu(struct i915_execbuffer *eb,
 		      struct i915_vma *vma,
 		      unsigned int len)
@@ -1177,12 +1256,6 @@ static u32 *reloc_gpu(struct i915_execbuffer *eb,
 	if (unlikely(!cache->rq)) {
 		struct intel_engine_cs *engine = eb->engine;
 
-		if (!reloc_can_use_engine(engine)) {
-			engine = engine->gt->engine_class[COPY_ENGINE_CLASS][0];
-			if (!engine)
-				return ERR_PTR(-ENODEV);
-		}
-
 		err = __reloc_gpu_alloc(eb, engine, len);
 		if (unlikely(err))
 			return ERR_PTR(err);
@@ -1513,7 +1586,7 @@ static int eb_relocate(struct i915_execbuffer *eb)
 				break;
 		}
 
-		flush = reloc_gpu_flush(&eb->reloc_cache);
+		flush = reloc_gpu_flush(eb);
 		if (!err)
 			err = flush;
 	}
@@ -1706,20 +1779,9 @@ static int __eb_parse(struct dma_fence_work *work)
 				       pw->trampoline);
 }
 
-static void __eb_parse_release(struct dma_fence_work *work)
-{
-	struct eb_parse_work *pw = container_of(work, typeof(*pw), base);
-
-	if (pw->trampoline)
-		i915_active_release(&pw->trampoline->active);
-	i915_active_release(&pw->shadow->active);
-	i915_active_release(&pw->batch->active);
-}
-
 static const struct dma_fence_work_ops eb_parse_ops = {
 	.name = "eb_parse",
 	.work = __eb_parse,
-	.release = __eb_parse_release,
 };
 
 static inline int
@@ -1737,21 +1799,23 @@ parser_mark_active(struct eb_parse_work *pw, struct intel_timeline *tl)
 {
 	int err;
 
-	mutex_lock(&tl->mutex);
+	err = i915_active_ref(&pw->batch->active,
+			      tl->fence_context,
+			      &pw->base.dma);
+	if (err)
+		return err;
 
 	err = __parser_mark_active(pw->shadow, tl, &pw->base.dma);
 	if (err)
-		goto unlock;
+		return err;
 
 	if (pw->trampoline) {
 		err = __parser_mark_active(pw->trampoline, tl, &pw->base.dma);
 		if (err)
-			goto unlock;
+			return err;
 	}
 
-unlock:
-	mutex_unlock(&tl->mutex);
-	return err;
+	return 0;
 }
 
 static int eb_parse_pipeline(struct i915_execbuffer *eb,
@@ -1765,20 +1829,6 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
 	if (!pw)
 		return -ENOMEM;
 
-	err = i915_active_acquire(&eb->batch->vma->active);
-	if (err)
-		goto err_free;
-
-	err = i915_active_acquire(&shadow->active);
-	if (err)
-		goto err_batch;
-
-	if (trampoline) {
-		err = i915_active_acquire(&trampoline->active);
-		if (err)
-			goto err_shadow;
-	}
-
 	dma_fence_work_init(&pw->base, &eb_parse_ops);
 
 	pw->engine = eb->engine;
@@ -1827,14 +1877,6 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
 	i915_sw_fence_set_error_once(&pw->base.chain, err);
 	dma_fence_work_commit_imm(&pw->base);
 	return err;
-
-err_shadow:
-	i915_active_release(&shadow->active);
-err_batch:
-	i915_active_release(&eb->batch->vma->active);
-err_free:
-	kfree(pw);
-	return err;
 }
 
 static int eb_parse(struct i915_execbuffer *eb)
@@ -2043,32 +2085,61 @@ static struct i915_request *eb_throttle(struct intel_context *ce)
 	return i915_request_get(rq);
 }
 
-static int __eb_pin_engine(struct i915_execbuffer *eb, struct intel_context *ce)
+static bool reloc_can_use_engine(const struct intel_engine_cs *engine)
 {
-	struct intel_timeline *tl;
-	struct i915_request *rq;
+	return engine->class != VIDEO_DECODE_CLASS || !IS_GEN(engine->i915, 6);
+}
+
+static int __eb_pin_reloc_engine(struct i915_execbuffer *eb)
+{
+	struct intel_engine_cs *engine = eb->engine;
+	struct intel_context *ce;
 	int err;
 
-	/*
-	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
-	 * EIO if the GPU is already wedged.
-	 */
-	err = intel_gt_terminally_wedged(ce->engine->gt);
-	if (err)
-		return err;
+	if (reloc_can_use_engine(engine)) {
+		eb->reloc_cache.ce = eb->context;
+		return 0;
+	}
 
-	if (unlikely(intel_context_is_banned(ce)))
-		return -EIO;
+	engine = engine->gt->engine_class[COPY_ENGINE_CLASS][0];
+	if (!engine)
+		return -ENODEV;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce))
+		return PTR_ERR(ce);
+
+	/* Reuse eb->context->timeline with scheduler! */
+
+	i915_vm_put(ce->vm);
+	ce->vm = i915_vm_get(eb->context->vm);
 
-	/*
-	 * Pinning the contexts may generate requests in order to acquire
-	 * GGTT space, so do this first before we reserve a seqno for
-	 * ourselves.
-	 */
 	err = intel_context_pin(ce);
 	if (err)
 		return err;
 
+	eb->reloc_cache.ce = ce;
+	return 0;
+}
+
+static void __eb_unpin_reloc_engine(struct i915_execbuffer *eb)
+{
+	struct intel_context *ce = eb->reloc_cache.ce;
+
+	if (ce == eb->context)
+		return;
+
+	intel_context_unpin(ce);
+	intel_context_put(ce);
+}
+
+static int eb_lock_engine(struct i915_execbuffer *eb)
+{
+	struct intel_context *ce = eb->context;
+	struct intel_timeline *tl;
+	struct i915_request *rq;
+	int err;
+
 	/*
 	 * Take a local wakeref for preparing to dispatch the execbuf as
 	 * we expect to access the hardware fairly frequently in the
@@ -2078,17 +2149,17 @@ static int __eb_pin_engine(struct i915_execbuffer *eb, struct intel_context *ce)
 	 * taken on the engine, and the parent device.
 	 */
 	tl = intel_context_timeline_lock(ce);
-	if (IS_ERR(tl)) {
-		err = PTR_ERR(tl);
-		goto err_unpin;
-	}
+	if (IS_ERR(tl))
+		return PTR_ERR(tl);
 
 	intel_context_enter(ce);
-	rq = eb_throttle(ce);
-
-	intel_context_timeline_unlock(tl);
 
-	if (rq) {
+	/*
+	 * Before we begin, make sure there is enough space in the ring to
+	 * build the mightiest of requests, and to ratelimits those hogs
+	 * who do succeed in flooding the rings.
+	 */
+	while ((rq = eb_throttle(ce))) {
 		bool nonblock = eb->file->filp->f_flags & O_NONBLOCK;
 		long timeout;
 
@@ -2096,40 +2167,51 @@ static int __eb_pin_engine(struct i915_execbuffer *eb, struct intel_context *ce)
 		if (nonblock)
 			timeout = 0;
 
+		mutex_unlock(&tl->mutex);
+
 		timeout = i915_request_wait(rq,
 					    I915_WAIT_INTERRUPTIBLE,
 					    timeout);
 		i915_request_put(rq);
 
+		mutex_lock(&tl->mutex);
+
 		if (timeout < 0) {
 			err = nonblock ? -EWOULDBLOCK : timeout;
 			goto err_exit;
 		}
+
+		retire_requests(tl);
 	}
 
-	eb->engine = ce->engine;
-	eb->context = ce;
+	err = __eb_pin_reloc_engine(eb);
+	if (err)
+		goto err_exit;
+
 	return 0;
 
 err_exit:
-	mutex_lock(&tl->mutex);
 	intel_context_exit(ce);
 	intel_context_timeline_unlock(tl);
-err_unpin:
-	intel_context_unpin(ce);
 	return err;
 }
 
-static void eb_unpin_engine(struct i915_execbuffer *eb)
+static void eb_unlock_engine(struct i915_execbuffer *eb)
 {
 	struct intel_context *ce = eb->context;
-	struct intel_timeline *tl = ce->timeline;
 
-	mutex_lock(&tl->mutex);
+	__eb_unpin_reloc_engine(eb);
+
+	/* Try to clean up the client's timeline after submitting the request */
+	retire_requests(ce->timeline);
+
 	intel_context_exit(ce);
-	mutex_unlock(&tl->mutex);
+	intel_context_timeline_unlock(ce->timeline);
+}
 
-	intel_context_unpin(ce);
+static void eb_unpin_engine(struct i915_execbuffer *eb)
+{
+	intel_context_unpin(eb->context);
 }
 
 static unsigned int
@@ -2176,6 +2258,35 @@ eb_select_legacy_ring(struct i915_execbuffer *eb,
 	return user_ring_map[user_ring_id];
 }
 
+static int __eb_pin_engine(struct i915_execbuffer *eb, struct intel_context *ce)
+{
+	int err;
+
+	/*
+	 * ABI: Before userspace accesses the GPU (e.g. execbuffer), report
+	 * EIO if the GPU is already wedged.
+	 */
+	err = intel_gt_terminally_wedged(ce->engine->gt);
+	if (err)
+		return err;
+
+	if (unlikely(intel_context_is_banned(ce)))
+		return -EIO;
+
+	/*
+	 * Pinning the contexts may generate requests in order to acquire
+	 * GGTT space, so do this first before we reserve a seqno for
+	 * ourselves.
+	 */
+	err = intel_context_pin(ce);
+	if (err)
+		return err;
+
+	eb->engine = ce->engine;
+	eb->context = ce;
+	return 0;
+}
+
 static int
 eb_pin_engine(struct i915_execbuffer *eb,
 	      struct drm_file *file,
@@ -2329,28 +2440,18 @@ signal_fence_array(struct i915_execbuffer *eb,
 	}
 }
 
-static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
-{
-	struct i915_request *rq, *rn;
-
-	list_for_each_entry_safe(rq, rn, &tl->requests, link)
-		if (rq == end || !i915_request_retire(rq))
-			break;
-}
-
 static void eb_request_add(struct i915_execbuffer *eb)
 {
 	struct i915_request *rq = eb->request;
 	struct intel_timeline * const tl = i915_request_timeline(rq);
 	struct i915_sched_attr attr = {};
-	struct i915_request *prev;
 
 	lockdep_assert_held(&tl->mutex);
 	lockdep_unpin_lock(&tl->mutex, rq->cookie);
 
 	trace_i915_request_add(rq);
 
-	prev = __i915_request_commit(rq);
+	__i915_request_commit(rq);
 
 	/* Check that the context wasn't destroyed before submission */
 	if (likely(!intel_context_is_closed(eb->context))) {
@@ -2362,12 +2463,6 @@ static void eb_request_add(struct i915_execbuffer *eb)
 	}
 
 	__i915_request_queue(rq, &attr);
-
-	/* Try to clean up the client's timeline after submitting the request */
-	if (prev)
-		retire_requests(tl, prev);
-
-	mutex_unlock(&tl->mutex);
 }
 
 static int
@@ -2455,6 +2550,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (unlikely(err))
 		goto err_context;
 
+	/* *** TIMELINE LOCK *** */
+	err = eb_lock_engine(&eb);
+	if (unlikely(err))
+		goto err_engine;
+	lockdep_assert_held(&eb.context->timeline->mutex);
+
 	err = eb_relocate(&eb);
 	if (err) {
 		/*
@@ -2521,11 +2622,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	GEM_BUG_ON(eb.reloc_cache.rq);
 
 	/* Allocate a request for this batch buffer nice and early. */
-	eb.request = i915_request_create(eb.context);
+	eb.request = __i915_request_create(eb.context, GFP_KERNEL);
 	if (IS_ERR(eb.request)) {
 		err = PTR_ERR(eb.request);
 		goto err_batch_unpin;
 	}
+	eb.request->cookie = lockdep_pin_lock(&eb.context->timeline->mutex);
 
 	if (in_fence) {
 		if (args->flags & I915_EXEC_FENCE_SUBMIT)
@@ -2567,23 +2669,13 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	trace_i915_request_queue(eb.request, eb.batch_flags);
 	err = eb_submit(&eb, batch);
 err_request:
-	add_to_client(eb.request, file);
 	i915_request_get(eb.request);
 	eb_request_add(&eb);
 
 	if (fences)
 		signal_fence_array(&eb, fences);
 
-	if (out_fence) {
-		if (err == 0) {
-			fd_install(out_fence_fd, out_fence->file);
-			args->rsvd2 &= GENMASK_ULL(31, 0); /* keep in-fence */
-			args->rsvd2 |= (u64)out_fence_fd << 32;
-			out_fence_fd = -1;
-		} else {
-			fput(out_fence->file);
-		}
-	}
+	add_to_client(eb.request, file);
 	i915_request_put(eb.request);
 
 err_batch_unpin:
@@ -2595,12 +2687,25 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 err_vma:
 	if (eb.trampoline)
 		i915_vma_unpin(eb.trampoline);
+	eb_unlock_engine(&eb);
+	/* *** TIMELINE UNLOCK *** */
+err_engine:
 	eb_unpin_engine(&eb);
 err_context:
 	i915_gem_context_put(eb.gem_context);
 err_destroy:
 	eb_destroy(&eb);
 err_out_fence:
+	if (out_fence) {
+		if (err == 0) {
+			fd_install(out_fence_fd, out_fence->file);
+			args->rsvd2 &= GENMASK_ULL(31, 0); /* keep in-fence */
+			args->rsvd2 |= (u64)out_fence_fd << 32;
+			out_fence_fd = -1;
+		} else {
+			fput(out_fence->file);
+		}
+	}
 	if (out_fence_fd != -1)
 		put_unused_fd(out_fence_fd);
 err_in_fence:
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
index 57c14d3340cd..992d46db1b33 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
@@ -7,6 +7,9 @@
 
 #include "gt/intel_engine_pm.h"
 #include "selftests/igt_flush_test.h"
+#include "selftests/mock_drm.h"
+
+#include "mock_context.h"
 
 static u64 read_reloc(const u32 *map, int x, const u64 mask)
 {
@@ -60,7 +63,7 @@ static int __igt_gpu_reloc(struct i915_execbuffer *eb,
 
 	GEM_BUG_ON(!eb->reloc_cache.rq);
 	rq = i915_request_get(eb->reloc_cache.rq);
-	err = reloc_gpu_flush(&eb->reloc_cache);
+	err = reloc_gpu_flush(eb);
 	if (err)
 		goto put_rq;
 	GEM_BUG_ON(eb->reloc_cache.rq);
@@ -100,14 +103,22 @@ static int igt_gpu_reloc(void *arg)
 {
 	struct i915_execbuffer eb;
 	struct drm_i915_gem_object *scratch;
+	struct file *file;
 	int err = 0;
 	u32 *map;
 
+	file = mock_file(arg);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
 	eb.i915 = arg;
+	eb.gem_context = live_context(arg, file);
+	if (IS_ERR(eb.gem_context))
+		goto err_file;
 
 	scratch = i915_gem_object_create_internal(eb.i915, 4096);
 	if (IS_ERR(scratch))
-		return PTR_ERR(scratch);
+		goto err_file;
 
 	map = i915_gem_object_pin_map(scratch, I915_MAP_WC);
 	if (IS_ERR(map)) {
@@ -130,8 +141,15 @@ static int igt_gpu_reloc(void *arg)
 		if (err)
 			goto err_put;
 
+		mutex_lock(&eb.context->timeline->mutex);
+		intel_context_enter(eb.context);
+		eb.reloc_cache.ce = eb.context;
+
 		err = __igt_gpu_reloc(&eb, scratch);
 
+		intel_context_exit(eb.context);
+		mutex_unlock(&eb.context->timeline->mutex);
+
 		intel_context_unpin(eb.context);
 err_put:
 		intel_context_put(eb.context);
@@ -146,6 +164,8 @@ static int igt_gpu_reloc(void *arg)
 
 err_scratch:
 	i915_gem_object_put(scratch);
+err_file:
+	fput(file);
 	return err;
 }
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 14/66] drm/i915/gem: Rename execbuf.bind_link to unbound_link
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (11 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-31  8:11   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 15/66] drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup Chris Wilson
                   ` (59 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Rename the current list of unbound objects so that we can track of all
objects that we need to bind, as well as the list of currently unbound
[unprocessed] objects.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index af3499aafd22..40ee2718007e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -33,7 +33,7 @@ struct eb_vma {
 
 	/** This vma's place in the execbuf reservation list */
 	struct drm_i915_gem_exec_object2 *exec;
-	struct list_head bind_link;
+	struct list_head unbound_link;
 	struct list_head reloc_link;
 
 	struct hlist_node node;
@@ -594,7 +594,7 @@ eb_add_vma(struct i915_execbuffer *eb,
 		}
 	} else {
 		eb_unreserve_vma(ev);
-		list_add_tail(&ev->bind_link, &eb->unbound);
+		list_add_tail(&ev->unbound_link, &eb->unbound);
 	}
 }
 
@@ -714,7 +714,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
 			return -EINTR;
 
-		list_for_each_entry(ev, &eb->unbound, bind_link) {
+		list_for_each_entry(ev, &eb->unbound, unbound_link) {
 			err = eb_reserve_vma(eb, ev, pin_flags);
 			if (err)
 				break;
@@ -740,15 +740,15 @@ static int eb_reserve(struct i915_execbuffer *eb)
 
 			if (flags & EXEC_OBJECT_PINNED)
 				/* Pinned must have their slot */
-				list_add(&ev->bind_link, &eb->unbound);
+				list_add(&ev->unbound_link, &eb->unbound);
 			else if (flags & __EXEC_OBJECT_NEEDS_MAP)
 				/* Map require the lowest 256MiB (aperture) */
-				list_add_tail(&ev->bind_link, &eb->unbound);
+				list_add_tail(&ev->unbound_link, &eb->unbound);
 			else if (!(flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS))
 				/* Prioritise 4GiB region for restricted bo */
-				list_add(&ev->bind_link, &last);
+				list_add(&ev->unbound_link, &last);
 			else
-				list_add_tail(&ev->bind_link, &last);
+				list_add_tail(&ev->unbound_link, &last);
 		}
 		list_splice_tail(&last, &eb->unbound);
 		mutex_unlock(&eb->i915->drm.struct_mutex);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 15/66] drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (12 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 14/66] drm/i915/gem: Rename execbuf.bind_link to unbound_link Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-31  8:51   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin Chris Wilson
                   ` (58 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

As a prelude to the next step where we want to perform all the object
allocations together under the same lock, we first must delay the
i915_vma_pin() as that implicitly does the allocations for us, one by
one. As it only does the allocations one by one, it is not allowed to
wait/evict, whereas pulling all the allocations together the entire set
can be scheduled as one.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 74 ++++++++++---------
 1 file changed, 41 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 40ee2718007e..28cf28fcf80a 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -33,6 +33,8 @@ struct eb_vma {
 
 	/** This vma's place in the execbuf reservation list */
 	struct drm_i915_gem_exec_object2 *exec;
+
+	struct list_head bind_link;
 	struct list_head unbound_link;
 	struct list_head reloc_link;
 
@@ -240,8 +242,8 @@ struct i915_execbuffer {
 	/** actual size of execobj[] as we may extend it for the cmdparser */
 	unsigned int buffer_count;
 
-	/** list of vma not yet bound during reservation phase */
-	struct list_head unbound;
+	/** list of all vma required to be bound for this execbuf */
+	struct list_head bind_list;
 
 	/** list of vma that have execobj.relocation_count */
 	struct list_head relocs;
@@ -565,6 +567,8 @@ eb_add_vma(struct i915_execbuffer *eb,
 						    eb->lut_size)]);
 	}
 
+	list_add_tail(&ev->bind_link, &eb->bind_list);
+
 	if (entry->relocation_count)
 		list_add_tail(&ev->reloc_link, &eb->relocs);
 
@@ -586,16 +590,6 @@ eb_add_vma(struct i915_execbuffer *eb,
 
 		eb->batch = ev;
 	}
-
-	if (eb_pin_vma(eb, entry, ev)) {
-		if (entry->offset != vma->node.start) {
-			entry->offset = vma->node.start | UPDATE;
-			eb->args->flags |= __EXEC_HAS_RELOC;
-		}
-	} else {
-		eb_unreserve_vma(ev);
-		list_add_tail(&ev->unbound_link, &eb->unbound);
-	}
 }
 
 static int eb_reserve_vma(const struct i915_execbuffer *eb,
@@ -670,13 +664,31 @@ static int wait_for_timeline(struct intel_timeline *tl)
 	} while (1);
 }
 
-static int eb_reserve(struct i915_execbuffer *eb)
+static int eb_reserve_vm(struct i915_execbuffer *eb)
 {
-	const unsigned int count = eb->buffer_count;
 	unsigned int pin_flags = PIN_USER | PIN_NONBLOCK;
-	struct list_head last;
+	struct list_head last, unbound;
 	struct eb_vma *ev;
-	unsigned int i, pass;
+	unsigned int pass;
+
+	INIT_LIST_HEAD(&unbound);
+	list_for_each_entry(ev, &eb->bind_list, bind_link) {
+		struct drm_i915_gem_exec_object2 *entry = ev->exec;
+		struct i915_vma *vma = ev->vma;
+
+		if (eb_pin_vma(eb, entry, ev)) {
+			if (entry->offset != vma->node.start) {
+				entry->offset = vma->node.start | UPDATE;
+				eb->args->flags |= __EXEC_HAS_RELOC;
+			}
+		} else {
+			eb_unreserve_vma(ev);
+			list_add_tail(&ev->unbound_link, &unbound);
+		}
+	}
+
+	if (list_empty(&unbound))
+		return 0;
 
 	/*
 	 * Attempt to pin all of the buffers into the GTT.
@@ -714,7 +726,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
 			return -EINTR;
 
-		list_for_each_entry(ev, &eb->unbound, unbound_link) {
+		list_for_each_entry(ev, &unbound, unbound_link) {
 			err = eb_reserve_vma(eb, ev, pin_flags);
 			if (err)
 				break;
@@ -725,13 +737,11 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		}
 
 		/* Resort *all* the objects into priority order */
-		INIT_LIST_HEAD(&eb->unbound);
+		INIT_LIST_HEAD(&unbound);
 		INIT_LIST_HEAD(&last);
-		for (i = 0; i < count; i++) {
-			unsigned int flags;
+		list_for_each_entry(ev, &eb->bind_list, bind_link) {
+			unsigned int flags = ev->flags;
 
-			ev = &eb->vma[i];
-			flags = ev->flags;
 			if (flags & EXEC_OBJECT_PINNED &&
 			    flags & __EXEC_OBJECT_HAS_PIN)
 				continue;
@@ -740,17 +750,17 @@ static int eb_reserve(struct i915_execbuffer *eb)
 
 			if (flags & EXEC_OBJECT_PINNED)
 				/* Pinned must have their slot */
-				list_add(&ev->unbound_link, &eb->unbound);
+				list_add(&ev->unbound_link, &unbound);
 			else if (flags & __EXEC_OBJECT_NEEDS_MAP)
 				/* Map require the lowest 256MiB (aperture) */
-				list_add_tail(&ev->unbound_link, &eb->unbound);
+				list_add_tail(&ev->unbound_link, &unbound);
 			else if (!(flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS))
 				/* Prioritise 4GiB region for restricted bo */
 				list_add(&ev->unbound_link, &last);
 			else
 				list_add_tail(&ev->unbound_link, &last);
 		}
-		list_splice_tail(&last, &eb->unbound);
+		list_splice_tail(&last, &unbound);
 		mutex_unlock(&eb->i915->drm.struct_mutex);
 
 		if (err == -EAGAIN) {
@@ -921,8 +931,8 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 	unsigned int i;
 	int err = 0;
 
+	INIT_LIST_HEAD(&eb->bind_list);
 	INIT_LIST_HEAD(&eb->relocs);
-	INIT_LIST_HEAD(&eb->unbound);
 
 	for (i = 0; i < eb->buffer_count; i++) {
 		struct i915_vma *vma;
@@ -1565,16 +1575,10 @@ static int eb_relocate(struct i915_execbuffer *eb)
 {
 	int err;
 
-	err = eb_lookup_vmas(eb);
+	err = eb_reserve_vm(eb);
 	if (err)
 		return err;
 
-	if (!list_empty(&eb->unbound)) {
-		err = eb_reserve(eb);
-		if (err)
-			return err;
-	}
-
 	/* The objects are in their final locations, apply the relocations. */
 	if (eb->args->flags & __EXEC_HAS_RELOC) {
 		struct eb_vma *ev;
@@ -2550,6 +2554,10 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (unlikely(err))
 		goto err_context;
 
+	err = eb_lookup_vmas(&eb);
+	if (unlikely(err))
+		goto err_engine;
+
 	/* *** TIMELINE LOCK *** */
 	err = eb_lock_engine(&eb);
 	if (unlikely(err))
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (13 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 15/66] drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-17 14:36   ` Tvrtko Ursulin
  2020-07-28  9:46   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 17/66] drm/i915: Add list_for_each_entry_safe_continue_reverse Chris Wilson
                   ` (57 subsequent siblings)
  72 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Remove the stub i915_vma_pin() used for incrementally pining objects for
execbuf (under the severe restriction that they must not wait on a
resource as we may have already pinned it) and replace it with a
i915_vma_pin_inplace() that is only allowed to reclaim the currently
bound location for the vma (and will never wait for a pinned resource).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 69 +++++++++++--------
 drivers/gpu/drm/i915/i915_vma.c               |  6 +-
 drivers/gpu/drm/i915/i915_vma.h               |  2 +
 3 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 28cf28fcf80a..0b8a26da26e5 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -452,49 +452,55 @@ static u64 eb_pin_flags(const struct drm_i915_gem_exec_object2 *entry,
 	return pin_flags;
 }
 
+static bool eb_pin_vma_fence_inplace(struct eb_vma *ev)
+{
+	struct i915_vma *vma = ev->vma;
+	struct i915_fence_reg *reg = vma->fence;
+
+	if (reg) {
+		if (READ_ONCE(reg->dirty))
+			return false;
+
+		atomic_inc(&reg->pin_count);
+		ev->flags |= __EXEC_OBJECT_HAS_FENCE;
+	} else {
+		if (i915_gem_object_is_tiled(vma->obj))
+			return false;
+	}
+
+	return true;
+}
+
 static inline bool
-eb_pin_vma(struct i915_execbuffer *eb,
-	   const struct drm_i915_gem_exec_object2 *entry,
-	   struct eb_vma *ev)
+eb_pin_vma_inplace(struct i915_execbuffer *eb,
+		   const struct drm_i915_gem_exec_object2 *entry,
+		   struct eb_vma *ev)
 {
 	struct i915_vma *vma = ev->vma;
-	u64 pin_flags;
+	unsigned int pin_flags;
 
-	if (vma->node.size)
-		pin_flags = vma->node.start;
-	else
-		pin_flags = entry->offset & PIN_OFFSET_MASK;
+	if (eb_vma_misplaced(entry, vma, ev->flags))
+		return false;
 
-	pin_flags |= PIN_USER | PIN_NOEVICT | PIN_OFFSET_FIXED;
+	pin_flags = PIN_USER;
 	if (unlikely(ev->flags & EXEC_OBJECT_NEEDS_GTT))
 		pin_flags |= PIN_GLOBAL;
 
 	/* Attempt to reuse the current location if available */
-	if (unlikely(i915_vma_pin(vma, 0, 0, pin_flags))) {
-		if (entry->flags & EXEC_OBJECT_PINNED)
-			return false;
-
-		/* Failing that pick any _free_ space if suitable */
-		if (unlikely(i915_vma_pin(vma,
-					  entry->pad_to_size,
-					  entry->alignment,
-					  eb_pin_flags(entry, ev->flags) |
-					  PIN_USER | PIN_NOEVICT)))
-			return false;
-	}
+	if (!i915_vma_pin_inplace(vma, pin_flags))
+		return false;
 
 	if (unlikely(ev->flags & EXEC_OBJECT_NEEDS_FENCE)) {
-		if (unlikely(i915_vma_pin_fence(vma))) {
-			i915_vma_unpin(vma);
+		if (!eb_pin_vma_fence_inplace(ev)) {
+			__i915_vma_unpin(vma);
 			return false;
 		}
-
-		if (vma->fence)
-			ev->flags |= __EXEC_OBJECT_HAS_FENCE;
 	}
 
+	GEM_BUG_ON(eb_vma_misplaced(entry, vma, ev->flags));
+
 	ev->flags |= __EXEC_OBJECT_HAS_PIN;
-	return !eb_vma_misplaced(entry, vma, ev->flags);
+	return true;
 }
 
 static int
@@ -676,14 +682,17 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 		struct drm_i915_gem_exec_object2 *entry = ev->exec;
 		struct i915_vma *vma = ev->vma;
 
-		if (eb_pin_vma(eb, entry, ev)) {
+		if (eb_pin_vma_inplace(eb, entry, ev)) {
 			if (entry->offset != vma->node.start) {
 				entry->offset = vma->node.start | UPDATE;
 				eb->args->flags |= __EXEC_HAS_RELOC;
 			}
 		} else {
-			eb_unreserve_vma(ev);
-			list_add_tail(&ev->unbound_link, &unbound);
+			/* Lightly sort user placed objects to the fore */
+			if (ev->flags & EXEC_OBJECT_PINNED)
+				list_add(&ev->unbound_link, &unbound);
+			else
+				list_add_tail(&ev->unbound_link, &unbound);
 		}
 	}
 
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index c6bf04ca2032..dbe11b349175 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -740,11 +740,13 @@ i915_vma_detach(struct i915_vma *vma)
 	list_del(&vma->vm_link);
 }
 
-static bool try_qad_pin(struct i915_vma *vma, unsigned int flags)
+bool i915_vma_pin_inplace(struct i915_vma *vma, unsigned int flags)
 {
 	unsigned int bound;
 	bool pinned = true;
 
+	GEM_BUG_ON(flags & ~I915_VMA_BIND_MASK);
+
 	bound = atomic_read(&vma->flags);
 	do {
 		if (unlikely(flags & ~bound))
@@ -865,7 +867,7 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 	GEM_BUG_ON(!(flags & (PIN_USER | PIN_GLOBAL)));
 
 	/* First try and grab the pin without rebinding the vma */
-	if (try_qad_pin(vma, flags & I915_VMA_BIND_MASK))
+	if (i915_vma_pin_inplace(vma, flags & I915_VMA_BIND_MASK))
 		return 0;
 
 	err = vma_get_pages(vma);
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index d0d01f909548..03fea54fd573 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -236,6 +236,8 @@ static inline void i915_vma_unlock(struct i915_vma *vma)
 	dma_resv_unlock(vma->resv);
 }
 
+bool i915_vma_pin_inplace(struct i915_vma *vma, unsigned int flags);
+
 int __must_check
 i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags);
 int i915_ggtt_pin(struct i915_vma *vma, u32 align, unsigned int flags);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 17/66] drm/i915: Add list_for_each_entry_safe_continue_reverse
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (14 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-31  8:59   ` Thomas Hellström (Intel)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 18/66] drm/i915: Always defer fenced work to the worker Chris Wilson
                   ` (56 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

One more list iterator variant, for when we want to unwind from inside
one list iterator with the intention of restarting from the current
entry as the new head of the list.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_utils.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_utils.h b/drivers/gpu/drm/i915/i915_utils.h
index 54773371e6bd..32cecd8583b1 100644
--- a/drivers/gpu/drm/i915/i915_utils.h
+++ b/drivers/gpu/drm/i915/i915_utils.h
@@ -266,6 +266,12 @@ static inline int list_is_last_rcu(const struct list_head *list,
 	return READ_ONCE(list->next) == head;
 }
 
+#define list_for_each_entry_safe_continue_reverse(pos, n, head, member)	\
+	for (pos = list_prev_entry(pos, member),			\
+	     n = list_prev_entry(pos, member);				\
+	     &pos->member != (head);					\
+	     pos = n, n = list_prev_entry(n, member))
+
 static inline unsigned long msecs_to_jiffies_timeout(const unsigned int m)
 {
 	unsigned long j = msecs_to_jiffies(m);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 18/66] drm/i915: Always defer fenced work to the worker
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (15 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 17/66] drm/i915: Add list_for_each_entry_safe_continue_reverse Chris Wilson
@ 2020-07-15 11:50 ` Chris Wilson
  2020-07-31  9:03   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 19/66] drm/i915/gem: Assign context id for async work Chris Wilson
                   ` (55 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Currently, if an error is raised we always call the cleanup locally
[and skip the main work callback]. However, some future users may need
to take a mutex to cleanup and so we cannot immediately execute the
cleanup as we may still be in interrupt context.

With the execute-immediate flag, for most cases this should result in
immediate cleanup of an error.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_sw_fence_work.c | 25 +++++++++++------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_sw_fence_work.c b/drivers/gpu/drm/i915/i915_sw_fence_work.c
index a3a81bb8f2c3..29f63ebc24e8 100644
--- a/drivers/gpu/drm/i915/i915_sw_fence_work.c
+++ b/drivers/gpu/drm/i915/i915_sw_fence_work.c
@@ -16,11 +16,14 @@ static void fence_complete(struct dma_fence_work *f)
 static void fence_work(struct work_struct *work)
 {
 	struct dma_fence_work *f = container_of(work, typeof(*f), work);
-	int err;
 
-	err = f->ops->work(f);
-	if (err)
-		dma_fence_set_error(&f->dma, err);
+	if (!f->dma.error) {
+		int err;
+
+		err = f->ops->work(f);
+		if (err)
+			dma_fence_set_error(&f->dma, err);
+	}
 
 	fence_complete(f);
 	dma_fence_put(&f->dma);
@@ -36,15 +39,11 @@ fence_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 		if (fence->error)
 			dma_fence_set_error(&f->dma, fence->error);
 
-		if (!f->dma.error) {
-			dma_fence_get(&f->dma);
-			if (test_bit(DMA_FENCE_WORK_IMM, &f->dma.flags))
-				fence_work(&f->work);
-			else
-				queue_work(system_unbound_wq, &f->work);
-		} else {
-			fence_complete(f);
-		}
+		dma_fence_get(&f->dma);
+		if (test_bit(DMA_FENCE_WORK_IMM, &f->dma.flags))
+			fence_work(&f->work);
+		else
+			queue_work(system_unbound_wq, &f->work);
 		break;
 
 	case FENCE_FREE:
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 19/66] drm/i915/gem: Assign context id for async work
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (16 preceding siblings ...)
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 18/66] drm/i915: Always defer fenced work to the worker Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 20/66] drm/i915/gem: Separate the ww_mutex walker into its own list Chris Wilson
                   ` (54 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Allocate a few dma fence context id that we can use to associate async work
[for the CPU] launched on behalf of this context. For extra fun, we allow
a configurable concurrency width.

A current example would be that we spawn an unbound worker for every
userptr get_pages. In the future, we wish to charge this work to the
context that initiated the async work and to impose concurrency limits
based on the context.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c       | 4 ++++
 drivers/gpu/drm/i915/gem/i915_gem_context.h       | 6 ++++++
 drivers/gpu/drm/i915/gem/i915_gem_context_types.h | 6 ++++++
 3 files changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index d0bdb6d447ed..b5f6dc2333ab 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -714,6 +714,10 @@ __create_context(struct drm_i915_private *i915)
 	ctx->sched.priority = I915_USER_PRIORITY(I915_PRIORITY_NORMAL);
 	mutex_init(&ctx->mutex);
 
+	ctx->async.width = rounddown_pow_of_two(num_online_cpus());
+	ctx->async.context = dma_fence_context_alloc(ctx->async.width);
+	ctx->async.width--;
+
 	spin_lock_init(&ctx->stale.lock);
 	INIT_LIST_HEAD(&ctx->stale.engines);
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.h b/drivers/gpu/drm/i915/gem/i915_gem_context.h
index a133f92bbedb..f254458a795e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.h
@@ -134,6 +134,12 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_reset_stats_ioctl(struct drm_device *dev, void *data,
 				       struct drm_file *file);
 
+static inline u64 i915_gem_context_async_id(struct i915_gem_context *ctx)
+{
+	return (ctx->async.context +
+		(atomic_fetch_inc(&ctx->async.cur) & ctx->async.width));
+}
+
 static inline struct i915_gem_context *
 i915_gem_context_get(struct i915_gem_context *ctx)
 {
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index ae14ca24a11f..52561f98000f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -85,6 +85,12 @@ struct i915_gem_context {
 
 	struct intel_timeline *timeline;
 
+	struct {
+		u64 context;
+		atomic_t cur;
+		unsigned int width;
+	} async;
+
 	/**
 	 * @vm: unique address space (GTT)
 	 *
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 20/66] drm/i915/gem: Separate the ww_mutex walker into its own list
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (17 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 19/66] drm/i915/gem: Assign context id for async work Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-31  9:23   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 21/66] drm/i915/gem: Asynchronous GTT unbinding Chris Wilson
                   ` (53 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

In preparation for making eb_vma bigger and heavy to run in parallel,
we need to stop applying an in-place swap() to reorder around ww_mutex
deadlocks. Keep the array intact and reorder the locks using a dedicated
list.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 83 ++++++++++++-------
 1 file changed, 54 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 0b8a26da26e5..430b2d4dc747 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -37,6 +37,7 @@ struct eb_vma {
 	struct list_head bind_link;
 	struct list_head unbound_link;
 	struct list_head reloc_link;
+	struct list_head submit_link;
 
 	struct hlist_node node;
 	u32 handle;
@@ -248,6 +249,8 @@ struct i915_execbuffer {
 	/** list of vma that have execobj.relocation_count */
 	struct list_head relocs;
 
+	struct list_head submit_list;
+
 	/**
 	 * Track the most recently used object for relocations, as we
 	 * frequently have to perform multiple relocations within the same
@@ -341,6 +344,42 @@ static void eb_vma_array_put(struct eb_vma_array *arr)
 	kref_put(&arr->kref, eb_vma_array_destroy);
 }
 
+static int
+eb_lock_vma(struct i915_execbuffer *eb, struct ww_acquire_ctx *acquire)
+{
+	struct eb_vma *ev;
+	int err = 0;
+
+	list_for_each_entry(ev, &eb->submit_list, submit_link) {
+		struct i915_vma *vma = ev->vma;
+
+		err = ww_mutex_lock_interruptible(&vma->resv->lock, acquire);
+		if (err == -EDEADLK) {
+			struct eb_vma *unlock = ev, *en;
+
+			list_for_each_entry_safe_continue_reverse(unlock, en,
+								  &eb->submit_list,
+								  submit_link) {
+				ww_mutex_unlock(&unlock->vma->resv->lock);
+				list_move_tail(&unlock->submit_link, &eb->submit_list);
+			}
+
+			GEM_BUG_ON(!list_is_first(&ev->submit_link, &eb->submit_list));
+			err = ww_mutex_lock_slow_interruptible(&vma->resv->lock,
+							       acquire);
+		}
+		if (err) {
+			list_for_each_entry_continue_reverse(ev,
+							     &eb->submit_list,
+							     submit_link)
+				ww_mutex_unlock(&ev->vma->resv->lock);
+			break;
+		}
+	}
+
+	return err;
+}
+
 static int eb_create(struct i915_execbuffer *eb)
 {
 	/* Allocate an extra slot for use by the command parser + sentinel */
@@ -393,6 +432,10 @@ static int eb_create(struct i915_execbuffer *eb)
 		eb->lut_size = -eb->buffer_count;
 	}
 
+	INIT_LIST_HEAD(&eb->bind_list);
+	INIT_LIST_HEAD(&eb->submit_list);
+	INIT_LIST_HEAD(&eb->relocs);
+
 	return 0;
 }
 
@@ -574,6 +617,7 @@ eb_add_vma(struct i915_execbuffer *eb,
 	}
 
 	list_add_tail(&ev->bind_link, &eb->bind_list);
+	list_add_tail(&ev->submit_link, &eb->submit_list);
 
 	if (entry->relocation_count)
 		list_add_tail(&ev->reloc_link, &eb->relocs);
@@ -940,9 +984,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 	unsigned int i;
 	int err = 0;
 
-	INIT_LIST_HEAD(&eb->bind_list);
-	INIT_LIST_HEAD(&eb->relocs);
-
 	for (i = 0; i < eb->buffer_count; i++) {
 		struct i915_vma *vma;
 
@@ -1609,38 +1650,19 @@ static int eb_relocate(struct i915_execbuffer *eb)
 
 static int eb_move_to_gpu(struct i915_execbuffer *eb)
 {
-	const unsigned int count = eb->buffer_count;
 	struct ww_acquire_ctx acquire;
-	unsigned int i;
+	struct eb_vma *ev;
 	int err = 0;
 
 	ww_acquire_init(&acquire, &reservation_ww_class);
 
-	for (i = 0; i < count; i++) {
-		struct eb_vma *ev = &eb->vma[i];
-		struct i915_vma *vma = ev->vma;
-
-		err = ww_mutex_lock_interruptible(&vma->resv->lock, &acquire);
-		if (err == -EDEADLK) {
-			GEM_BUG_ON(i == 0);
-			do {
-				int j = i - 1;
-
-				ww_mutex_unlock(&eb->vma[j].vma->resv->lock);
-
-				swap(eb->vma[i],  eb->vma[j]);
-			} while (--i);
+	err = eb_lock_vma(eb, &acquire);
+	if (err)
+		goto err_fini;
 
-			err = ww_mutex_lock_slow_interruptible(&vma->resv->lock,
-							       &acquire);
-		}
-		if (err)
-			break;
-	}
 	ww_acquire_done(&acquire);
 
-	while (i--) {
-		struct eb_vma *ev = &eb->vma[i];
+	list_for_each_entry(ev, &eb->submit_list, submit_link) {
 		struct i915_vma *vma = ev->vma;
 		unsigned int flags = ev->flags;
 		struct drm_i915_gem_object *obj = vma->obj;
@@ -1697,6 +1719,8 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 	intel_gt_chipset_flush(eb->engine->gt);
 	return 0;
 
+err_fini:
+	ww_acquire_fini(&acquire);
 err_skip:
 	i915_request_set_error_once(eb->request, err);
 	return err;
@@ -1951,9 +1975,10 @@ static int eb_parse(struct i915_execbuffer *eb)
 	if (err)
 		goto err_trampoline;
 
-	eb->vma[eb->buffer_count].vma = i915_vma_get(shadow);
-	eb->vma[eb->buffer_count].flags = __EXEC_OBJECT_HAS_PIN;
 	eb->batch = &eb->vma[eb->buffer_count++];
+	eb->batch->vma = i915_vma_get(shadow);
+	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
+	list_add_tail(&eb->batch->submit_link, &eb->submit_list);
 	eb->vma[eb->buffer_count].vma = NULL;
 
 	eb->trampoline = trampoline;
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 21/66] drm/i915/gem: Asynchronous GTT unbinding
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (18 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 20/66] drm/i915/gem: Separate the ww_mutex walker into its own list Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-31 13:09   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 22/66] drm/i915/gem: Bind the fence async for execbuf Chris Wilson
                   ` (52 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Matthew Auld, Chris Wilson

It is reasonably common for userspace (even modern drivers like iris) to
reuse an active address for a new buffer. This would cause the
application to stall under its mutex (originally struct_mutex) until the
old batches were idle and it could synchronously remove the stale PTE.
However, we can queue up a job that waits on the signal for the old
nodes to complete and upon those signals, remove the old nodes replacing
them with the new ones for the batch. This is still CPU driven, but in
theory we can do the GTT patching from the GPU. The job itself has a
completion signal allowing the execbuf to wait upon the rebinding, and
also other observers to coordinate with the common VM activity.

Letting userspace queue up more work, lets it do more stuff without
blocking other clients. In turn, we take care not to let it too much
concurrent work, creating a small number of queues for each context to
limit the number of concurrent tasks.

The implementation relies on only scheduling one unbind operation per
vma as we use the unbound vma->node location to track the stale PTE.

Closes: https://gitlab.freedesktop.org/drm/intel/issues/1402
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 919 ++++++++++++++++--
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c          |   1 +
 drivers/gpu/drm/i915/gt/intel_gtt.c           |   4 +
 drivers/gpu/drm/i915/gt/intel_gtt.h           |   2 +
 drivers/gpu/drm/i915/i915_gem.c               |   7 +
 drivers/gpu/drm/i915/i915_gem_gtt.c           |   5 +
 drivers/gpu/drm/i915/i915_vma.c               |  71 +-
 drivers/gpu/drm/i915/i915_vma.h               |   4 +
 8 files changed, 883 insertions(+), 130 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 430b2d4dc747..bdcbb82bfc3d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -18,6 +18,7 @@
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_buffer_pool.h"
 #include "gt/intel_gt_pm.h"
+#include "gt/intel_gt_requests.h"
 #include "gt/intel_ring.h"
 
 #include "i915_drv.h"
@@ -43,6 +44,12 @@ struct eb_vma {
 	u32 handle;
 };
 
+struct eb_bind_vma {
+	struct eb_vma *ev;
+	struct drm_mm_node hole;
+	unsigned int bind_flags;
+};
+
 struct eb_vma_array {
 	struct kref kref;
 	struct eb_vma vma[];
@@ -66,11 +73,12 @@ struct eb_vma_array {
 	 I915_EXEC_RESOURCE_STREAMER)
 
 /* Catch emission of unexpected errors for CI! */
+#define __EINVAL__ 22
 #if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
 #undef EINVAL
 #define EINVAL ({ \
 	DRM_DEBUG_DRIVER("EINVAL at %s:%d\n", __func__, __LINE__); \
-	22; \
+	__EINVAL__; \
 })
 #endif
 
@@ -311,6 +319,12 @@ static struct eb_vma_array *eb_vma_array_create(unsigned int count)
 	return arr;
 }
 
+static struct eb_vma_array *eb_vma_array_get(struct eb_vma_array *arr)
+{
+	kref_get(&arr->kref);
+	return arr;
+}
+
 static inline void eb_unreserve_vma(struct eb_vma *ev)
 {
 	struct i915_vma *vma = ev->vma;
@@ -444,7 +458,10 @@ eb_vma_misplaced(const struct drm_i915_gem_exec_object2 *entry,
 		 const struct i915_vma *vma,
 		 unsigned int flags)
 {
-	if (vma->node.size < entry->pad_to_size)
+	if (test_bit(I915_VMA_ERROR_BIT, __i915_vma_flags(vma)))
+		return true;
+
+	if (vma->node.size < max(vma->size, entry->pad_to_size))
 		return true;
 
 	if (entry->alignment && !IS_ALIGNED(vma->node.start, entry->alignment))
@@ -469,32 +486,6 @@ eb_vma_misplaced(const struct drm_i915_gem_exec_object2 *entry,
 	return false;
 }
 
-static u64 eb_pin_flags(const struct drm_i915_gem_exec_object2 *entry,
-			unsigned int exec_flags)
-{
-	u64 pin_flags = 0;
-
-	if (exec_flags & EXEC_OBJECT_NEEDS_GTT)
-		pin_flags |= PIN_GLOBAL;
-
-	/*
-	 * Wa32bitGeneralStateOffset & Wa32bitInstructionBaseOffset,
-	 * limit address to the first 4GBs for unflagged objects.
-	 */
-	if (!(exec_flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS))
-		pin_flags |= PIN_ZONE_4G;
-
-	if (exec_flags & __EXEC_OBJECT_NEEDS_MAP)
-		pin_flags |= PIN_MAPPABLE;
-
-	if (exec_flags & EXEC_OBJECT_PINNED)
-		pin_flags |= entry->offset | PIN_OFFSET_FIXED;
-	else if (exec_flags & __EXEC_OBJECT_NEEDS_BIAS)
-		pin_flags |= BATCH_OFFSET_BIAS | PIN_OFFSET_BIAS;
-
-	return pin_flags;
-}
-
 static bool eb_pin_vma_fence_inplace(struct eb_vma *ev)
 {
 	struct i915_vma *vma = ev->vma;
@@ -522,6 +513,10 @@ eb_pin_vma_inplace(struct i915_execbuffer *eb,
 	struct i915_vma *vma = ev->vma;
 	unsigned int pin_flags;
 
+	/* Concurrent async binds in progress, get in the queue */
+	if (!i915_active_is_idle(&vma->vm->binding))
+		return false;
+
 	if (eb_vma_misplaced(entry, vma, ev->flags))
 		return false;
 
@@ -642,45 +637,463 @@ eb_add_vma(struct i915_execbuffer *eb,
 	}
 }
 
-static int eb_reserve_vma(const struct i915_execbuffer *eb,
-			  struct eb_vma *ev,
-			  u64 pin_flags)
+struct eb_vm_work {
+	struct dma_fence_work base;
+	struct eb_vma_array *array;
+	struct eb_bind_vma *bind;
+	struct i915_address_space *vm;
+	struct i915_vm_pt_stash stash;
+	struct list_head evict_list;
+	u64 *p_flags;
+	u64 id;
+	unsigned long count;
+};
+
+static inline u64 node_end(const struct drm_mm_node *node)
+{
+	return node->start + node->size;
+}
+
+static int set_bind_fence(struct i915_vma *vma, struct eb_vm_work *work)
+{
+	struct dma_fence *prev;
+	int err = 0;
+
+	lockdep_assert_held(&vma->vm->mutex);
+	prev = i915_active_set_exclusive(&vma->active, &work->base.dma);
+	if (unlikely(prev)) {
+		err = i915_sw_fence_await_dma_fence(&work->base.chain, prev, 0,
+						    GFP_NOWAIT | __GFP_NOWARN);
+		dma_fence_put(prev);
+	}
+
+	return err < 0 ? err : 0;
+}
+
+static int await_evict(struct eb_vm_work *work, struct i915_vma *vma)
 {
-	struct drm_i915_gem_exec_object2 *entry = ev->exec;
-	struct i915_vma *vma = ev->vma;
 	int err;
 
-	if (drm_mm_node_allocated(&vma->node) &&
-	    eb_vma_misplaced(entry, vma, ev->flags)) {
-		err = i915_vma_unbind(vma);
+	GEM_BUG_ON(rcu_access_pointer(vma->active.excl.fence) == &work->base.dma);
+
+	/* Wait for all other previous activity */
+	err = i915_sw_fence_await_active(&work->base.chain,
+					 &vma->active,
+					 I915_ACTIVE_AWAIT_ACTIVE);
+	/* Then insert along the exclusive vm->mutex timeline */
+	if (err == 0)
+		err = set_bind_fence(vma, work);
+
+	return err;
+}
+
+static int
+evict_for_node(struct eb_vm_work *work,
+	       struct eb_bind_vma *const target,
+	       unsigned int flags)
+{
+	struct i915_vma *target_vma = target->ev->vma;
+	struct i915_address_space *vm = target_vma->vm;
+	const unsigned long color = target_vma->node.color;
+	const u64 start = target_vma->node.start;
+	const u64 end = start + target_vma->node.size;
+	u64 hole_start = start, hole_end = end;
+	struct i915_vma *vma, *next;
+	struct drm_mm_node *node;
+	LIST_HEAD(evict_list);
+	LIST_HEAD(steal_list);
+	int err = 0;
+
+	lockdep_assert_held(&vm->mutex);
+	GEM_BUG_ON(drm_mm_node_allocated(&target_vma->node));
+	GEM_BUG_ON(!IS_ALIGNED(start, I915_GTT_PAGE_SIZE));
+	GEM_BUG_ON(!IS_ALIGNED(end, I915_GTT_PAGE_SIZE));
+
+	if (i915_vm_has_cache_coloring(vm)) {
+		/* Expand search to cover neighbouring guard pages (or lack!) */
+		if (hole_start)
+			hole_start -= I915_GTT_PAGE_SIZE;
+
+		/* Always look at the page afterwards to avoid the end-of-GTT */
+		hole_end += I915_GTT_PAGE_SIZE;
+	}
+	GEM_BUG_ON(hole_start >= hole_end);
+
+	drm_mm_for_each_node_in_range(node, &vm->mm, hole_start, hole_end) {
+		GEM_BUG_ON(node == &target_vma->node);
+		err = -ENOSPC;
+
+		/* If we find any non-objects (!vma), we cannot evict them */
+		if (node->color == I915_COLOR_UNEVICTABLE)
+			goto err;
+
+		/*
+		 * If we are using coloring to insert guard pages between
+		 * different cache domains within the address space, we have
+		 * to check whether the objects on either side of our range
+		 * abutt and conflict. If they are in conflict, then we evict
+		 * those as well to make room for our guard pages.
+		 */
+		if (i915_vm_has_cache_coloring(vm)) {
+			if (node_end(node) == start && node->color == color)
+				continue;
+
+			if (node->start == end && node->color == color)
+				continue;
+		}
+
+		GEM_BUG_ON(!drm_mm_node_allocated(node));
+		vma = container_of(node, typeof(*vma), node);
+
+		if (flags & PIN_NOEVICT || i915_vma_is_pinned(vma))
+			goto err;
+
+		/* If this VMA is already being freed, or idle, steal it! */
+		if (!i915_active_acquire_if_busy(&vma->active)) {
+			list_move(&vma->vm_link, &steal_list);
+			continue;
+		}
+
+		if (!(flags & PIN_NONBLOCK))
+			err = await_evict(work, vma);
+		i915_active_release(&vma->active);
 		if (err)
-			return err;
+			goto err;
+
+		GEM_BUG_ON(!i915_vma_is_active(vma));
+		list_move(&vma->vm_link, &evict_list);
 	}
 
-	err = i915_vma_pin(vma,
-			   entry->pad_to_size, entry->alignment,
-			   eb_pin_flags(entry, ev->flags) | pin_flags);
-	if (err)
-		return err;
+	list_for_each_entry_safe(vma, next, &steal_list, vm_link) {
+		GEM_BUG_ON(i915_vma_is_pinned(vma));
+		GEM_BUG_ON(i915_vma_is_active(vma));
+		__i915_vma_evict(vma);
+		drm_mm_remove_node(&vma->node);
+		/* No ref held; vma may now be concurrently freed */
+	}
 
-	if (entry->offset != vma->node.start) {
-		entry->offset = vma->node.start | UPDATE;
-		eb->args->flags |= __EXEC_HAS_RELOC;
+	/* No overlapping nodes to evict, claim the slot for ourselves! */
+	if (list_empty(&evict_list))
+		return drm_mm_reserve_node(&vm->mm, &target_vma->node);
+
+	/*
+	 * Mark this range as reserved.
+	 *
+	 * We have not yet removed the PTEs for the old evicted nodes, so
+	 * must prevent this range from being reused for anything else. The
+	 * PTE will be cleared when the range is idle (during the rebind
+	 * phase in the worker).
+	 */
+	target->hole.color = I915_COLOR_UNEVICTABLE;
+	target->hole.start = start;
+	target->hole.size = end;
+
+	list_for_each_entry(vma, &evict_list, vm_link) {
+		target->hole.start =
+			min(target->hole.start, vma->node.start);
+		target->hole.size =
+			max(target->hole.size, node_end(&vma->node));
+
+		GEM_BUG_ON(!i915_vma_is_active(vma));
+		GEM_BUG_ON(vma->node.mm != &vm->mm);
+		set_bit(I915_VMA_ERROR_BIT, __i915_vma_flags(vma));
+		drm_mm_remove_node(&vma->node);
+		GEM_BUG_ON(i915_vma_is_pinned(vma));
 	}
+	list_splice(&evict_list, &work->evict_list);
 
-	if (unlikely(ev->flags & EXEC_OBJECT_NEEDS_FENCE)) {
-		err = i915_vma_pin_fence(vma);
-		if (unlikely(err)) {
-			i915_vma_unpin(vma);
-			return err;
+	target->hole.size -= target->hole.start;
+
+	return drm_mm_reserve_node(&vm->mm, &target->hole);
+
+err:
+	list_splice(&evict_list, &vm->bound_list);
+	list_splice(&steal_list, &vm->bound_list);
+	return err;
+}
+
+static int
+evict_in_range(struct eb_vm_work *work,
+	       struct eb_bind_vma * const target,
+	       u64 start, u64 end, u64 align)
+{
+	struct i915_vma *target_vma = target->ev->vma;
+	struct i915_address_space *vm = target_vma->vm;
+	struct i915_vma *active = NULL;
+	struct i915_vma *vma, *next;
+	struct drm_mm_scan scan;
+	LIST_HEAD(evict_list);
+	bool found = false;
+
+	lockdep_assert_held(&vm->mutex);
+
+	drm_mm_scan_init_with_range(&scan, &vm->mm,
+				    target_vma->node.size,
+				    align,
+				    target_vma->node.color,
+				    start, end,
+				    DRM_MM_INSERT_BEST);
+
+	list_for_each_entry_safe(vma, next, &vm->bound_list, vm_link) {
+		if (i915_vma_is_pinned(vma))
+			continue;
+
+		if (vma == active)
+			active = ERR_PTR(-EAGAIN);
+
+		/* Prefer to reuse idle nodes; push all active vma to the end */
+		if (active != ERR_PTR(-EAGAIN) && i915_vma_is_active(vma)) {
+			if (!active)
+				active = vma;
+
+			list_move_tail(&vma->vm_link, &vm->bound_list);
+			continue;
 		}
 
+		list_move(&vma->vm_link, &evict_list);
+		if (drm_mm_scan_add_block(&scan, &vma->node)) {
+			target_vma->node.start =
+				round_up(scan.hit_start, align);
+			found = true;
+			break;
+		}
+	}
+
+	list_for_each_entry(vma, &evict_list, vm_link)
+		drm_mm_scan_remove_block(&scan, &vma->node);
+	list_splice(&evict_list, &vm->bound_list);
+	if (!found)
+		return -ENOSPC;
+
+	return evict_for_node(work, target, 0);
+}
+
+static u64 random_offset(u64 start, u64 end, u64 len, u64 align)
+{
+	u64 range, addr;
+
+	GEM_BUG_ON(range_overflows(start, len, end));
+	GEM_BUG_ON(round_up(start, align) > round_down(end - len, align));
+
+	range = round_down(end - len, align) - round_up(start, align);
+	if (range) {
+		if (sizeof(unsigned long) == sizeof(u64)) {
+			addr = get_random_long();
+		} else {
+			addr = get_random_int();
+			if (range > U32_MAX) {
+				addr <<= 32;
+				addr |= get_random_int();
+			}
+		}
+		div64_u64_rem(addr, range, &addr);
+		start += addr;
+	}
+
+	return round_up(start, align);
+}
+
+static u64 align0(u64 align)
+{
+	return align <= I915_GTT_MIN_ALIGNMENT ? 0 : align;
+}
+
+static struct drm_mm_node *__best_hole(struct drm_mm *mm, u64 size)
+{
+	struct rb_node *rb = mm->holes_size.rb_root.rb_node;
+	struct drm_mm_node *best = NULL;
+
+	while (rb) {
+		struct drm_mm_node *node =
+			rb_entry(rb, struct drm_mm_node, rb_hole_size);
+
+		if (size <= node->hole_size) {
+			best = node;
+			rb = rb->rb_right;
+		} else {
+			rb = rb->rb_left;
+		}
+	}
+
+	return best;
+}
+
+static int best_hole(struct drm_mm *mm, struct drm_mm_node *node,
+		     u64 start, u64 end, u64 align)
+{
+	struct drm_mm_node *hole;
+	u64 size = node->size;
+
+	do {
+		hole = __best_hole(mm, size);
+		if (!hole)
+			return -ENOSPC;
+
+		node->start = round_up(max(start, drm_mm_hole_node_start(hole)),
+				       align);
+		if (min(drm_mm_hole_node_end(hole), end) >=
+		    node->start + node->size)
+			return drm_mm_reserve_node(mm, node);
+
+		/*
+		 * Too expensive to search for every single hole every time,
+		 * so just look for the next bigger hole, introducing enough
+		 * space for alignments. Finding the smallest hole with ideal
+		 * alignment scales very poorly, so we choose to waste space
+		 * if an alignment is forced. On the other hand, simply
+		 * randomly selecting an offset in 48b space will cause us
+		 * to use the majority of that space and exhaust all memory
+		 * in storing the page directories. Compromise is required.
+		 */
+		size = hole->hole_size + align;
+	} while (1);
+}
+
+static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
+{
+	struct drm_i915_gem_exec_object2 *entry = bind->ev->exec;
+	const unsigned int exec_flags = bind->ev->flags;
+	struct i915_vma *vma = bind->ev->vma;
+	struct i915_address_space *vm = vma->vm;
+	u64 start = 0, end = vm->total;
+	u64 align = entry->alignment ?: I915_GTT_MIN_ALIGNMENT;
+	unsigned int bind_flags;
+	int err;
+
+	lockdep_assert_held(&vm->mutex);
+
+	bind_flags = PIN_USER;
+	if (exec_flags & EXEC_OBJECT_NEEDS_GTT)
+		bind_flags |= PIN_GLOBAL;
+
+	if (drm_mm_node_allocated(&vma->node))
+		goto pin;
+
+	GEM_BUG_ON(i915_vma_is_pinned(vma));
+	GEM_BUG_ON(i915_vma_is_bound(vma, I915_VMA_BIND_MASK));
+	GEM_BUG_ON(i915_active_fence_isset(&vma->active.excl));
+	GEM_BUG_ON(!vma->size);
+
+	/* Reuse old address (if it doesn't conflict with new requirements) */
+	if (eb_vma_misplaced(entry, vma, exec_flags)) {
+		vma->node.start = entry->offset & PIN_OFFSET_MASK;
+		vma->node.size = max(entry->pad_to_size, vma->size);
+		vma->node.color = 0;
+		if (i915_vm_has_cache_coloring(vm))
+			vma->node.color = vma->obj->cache_level;
+	}
+
+	/*
+	 * Wa32bitGeneralStateOffset & Wa32bitInstructionBaseOffset,
+	 * limit address to the first 4GBs for unflagged objects.
+	 */
+	if (!(exec_flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS))
+		end = min_t(u64, end, (1ULL << 32) - I915_GTT_PAGE_SIZE);
+
+	align = max(align, vma->display_alignment);
+	if (exec_flags & __EXEC_OBJECT_NEEDS_MAP) {
+		vma->node.size = max_t(u64, vma->node.size, vma->fence_size);
+		end = min_t(u64, end, i915_vm_to_ggtt(vm)->mappable_end);
+		align = max_t(u64, align, vma->fence_alignment);
+	}
+
+	if (exec_flags & __EXEC_OBJECT_NEEDS_BIAS)
+		start = BATCH_OFFSET_BIAS;
+
+	GEM_BUG_ON(!vma->node.size);
+	if (vma->node.size > end - start)
+		return -E2BIG;
+
+	/* Try the user's preferred location first (mandatory if soft-pinned) */
+	err = -__EINVAL__;
+	if (vma->node.start >= start &&
+	    IS_ALIGNED(vma->node.start, align) &&
+	    !range_overflows(vma->node.start, vma->node.size, end)) {
+		unsigned int pin_flags;
+
+		/*
+		 * Prefer to relocate and spread objects around.
+		 *
+		 * If we relocate and continue to use the new location in
+		 * future batches, we only pay the relocation cost once.
+		 *
+		 * If we instead keep reusing the same address for different
+		 * objects, each batch must remove/insert objects into the GTT,
+		 * which is more expensive than performing a relocation.
+		 */
+		pin_flags = 0;
+		if (!(exec_flags & EXEC_OBJECT_PINNED))
+			pin_flags = PIN_NOEVICT;
+
+		err = evict_for_node(work, bind, pin_flags);
+		if (err == 0)
+			goto pin;
+	}
+	if (exec_flags & EXEC_OBJECT_PINNED)
+		return err;
+
+	/* Try the first available free space */
+	if (!best_hole(&vm->mm, &vma->node, start, end, align))
+		goto pin;
+
+	/* Pick a random slot and see if it's available [O(N) worst case] */
+	vma->node.start = random_offset(start, end, vma->node.size, align);
+	if (evict_for_node(work, bind, PIN_NONBLOCK) == 0)
+		goto pin;
+
+	/* Otherwise search all free space [degrades to O(N^2)] */
+	if (drm_mm_insert_node_in_range(&vm->mm, &vma->node,
+					vma->node.size,
+					align0(align),
+					vma->node.color,
+					start, end,
+					DRM_MM_INSERT_BEST) == 0)
+		goto pin;
+
+	/* Pretty busy! Loop over "LRU" and evict oldest in our search range */
+	err = evict_in_range(work, bind, start, end, align);
+	if (unlikely(err))
+		return err;
+
+pin:
+	if (unlikely(exec_flags & EXEC_OBJECT_NEEDS_FENCE)) {
+		err = __i915_vma_pin_fence(vma); /* XXX no waiting */
+		if (unlikely(err))
+			return err;
+
 		if (vma->fence)
-			ev->flags |= __EXEC_OBJECT_HAS_FENCE;
+			bind->ev->flags |= __EXEC_OBJECT_HAS_FENCE;
 	}
 
-	ev->flags |= __EXEC_OBJECT_HAS_PIN;
-	GEM_BUG_ON(eb_vma_misplaced(entry, vma, ev->flags));
+	bind_flags &= ~atomic_read(&vma->flags);
+	if (bind_flags) {
+		err = set_bind_fence(vma, work);
+		if (unlikely(err))
+			return err;
+
+		atomic_add(I915_VMA_PAGES_ACTIVE, &vma->pages_count);
+		atomic_or(bind_flags, &vma->flags);
+
+		if (i915_vma_is_ggtt(vma))
+			__i915_vma_set_map_and_fenceable(vma);
+
+		GEM_BUG_ON(!i915_vma_is_active(vma));
+		list_move_tail(&vma->vm_link, &vm->bound_list);
+		bind->bind_flags = bind_flags;
+	}
+	__i915_vma_pin(vma); /* and release */
+
+	GEM_BUG_ON(!bind_flags && !drm_mm_node_allocated(&vma->node));
+	GEM_BUG_ON(!(drm_mm_node_allocated(&vma->node) ^
+		     drm_mm_node_allocated(&bind->hole)));
+
+	if (entry->offset != vma->node.start) {
+		entry->offset = vma->node.start | UPDATE;
+		*work->p_flags |= __EXEC_HAS_RELOC;
+	}
+
+	bind->ev->flags |= __EXEC_OBJECT_HAS_PIN;
+	GEM_BUG_ON(eb_vma_misplaced(entry, vma, bind->ev->flags));
 
 	return 0;
 }
@@ -714,13 +1127,241 @@ static int wait_for_timeline(struct intel_timeline *tl)
 	} while (1);
 }
 
+static void __eb_bind_vma(struct eb_vm_work *work)
+{
+	struct i915_address_space *vm = work->vm;
+	unsigned long n;
+
+	GEM_BUG_ON(!intel_gt_pm_is_awake(vm->gt));
+
+	/*
+	 * We have to wait until the stale nodes are completely idle before
+	 * we can remove their PTE and unbind their pages. Hence, after
+	 * claiming their slot in the drm_mm, we defer their removal to
+	 * after the fences are signaled.
+	 */
+	if (!list_empty(&work->evict_list)) {
+		struct i915_vma *vma, *vn;
+
+		mutex_lock(&vm->mutex);
+		list_for_each_entry_safe(vma, vn, &work->evict_list, vm_link) {
+			GEM_BUG_ON(vma->vm != vm);
+			__i915_vma_evict(vma);
+			GEM_BUG_ON(!i915_vma_is_active(vma));
+		}
+		mutex_unlock(&vm->mutex);
+	}
+
+	/*
+	 * Now we know the nodes we require in drm_mm are idle, we can
+	 * replace the PTE in those ranges with our own.
+	 */
+	for (n = 0; n < work->count; n++) {
+		struct eb_bind_vma *bind = &work->bind[n];
+		struct i915_vma *vma = bind->ev->vma;
+
+		if (!bind->bind_flags)
+			goto put;
+
+		GEM_BUG_ON(vma->vm != vm);
+		GEM_BUG_ON(!i915_vma_is_active(vma));
+
+		vma->ops->bind_vma(vm, &work->stash, vma,
+				   vma->obj->cache_level, bind->bind_flags);
+
+		if (drm_mm_node_allocated(&bind->hole)) {
+			mutex_lock(&vm->mutex);
+			GEM_BUG_ON(bind->hole.mm != &vm->mm);
+			GEM_BUG_ON(bind->hole.color != I915_COLOR_UNEVICTABLE);
+			GEM_BUG_ON(drm_mm_node_allocated(&vma->node));
+
+			drm_mm_remove_node(&bind->hole);
+			drm_mm_reserve_node(&vm->mm, &vma->node);
+
+			GEM_BUG_ON(!drm_mm_node_allocated(&vma->node));
+			mutex_unlock(&vm->mutex);
+		}
+		bind->bind_flags = 0;
+
+put:
+		GEM_BUG_ON(drm_mm_node_allocated(&bind->hole));
+		i915_vma_put_pages(vma);
+	}
+	work->count = 0;
+}
+
+static int eb_bind_vma(struct dma_fence_work *base)
+{
+	struct eb_vm_work *work = container_of(base, typeof(*work), base);
+
+	__eb_bind_vma(work);
+	return 0;
+}
+
+static void eb_vma_work_release(struct dma_fence_work *base)
+{
+	struct eb_vm_work *work = container_of(base, typeof(*work), base);
+
+	__eb_bind_vma(work);
+	kvfree(work->bind);
+
+	if (work->id)
+		i915_active_release(&work->vm->binding);
+
+	eb_vma_array_put(work->array);
+
+	i915_vm_free_pt_stash(work->vm, &work->stash);
+	i915_vm_put(work->vm);
+}
+
+static const struct dma_fence_work_ops eb_bind_ops = {
+	.name = "eb_bind",
+	.work = eb_bind_vma,
+	.release = eb_vma_work_release,
+};
+
+static int eb_vm_work_cancel(struct eb_vm_work *work, int err)
+{
+	work->base.dma.error = err;
+	dma_fence_work_commit_imm(&work->base);
+
+	return err;
+}
+
+static struct eb_vm_work *eb_vm_work(struct i915_execbuffer *eb,
+				     unsigned long count)
+{
+	struct eb_vm_work *work;
+
+	work = kmalloc(sizeof(*work), GFP_KERNEL);
+	if (!work)
+		return NULL;
+
+	work->bind = kvmalloc(sizeof(*work->bind) * count, GFP_KERNEL);
+	if (!work->bind) {
+		kfree(work->bind);
+		return NULL;
+	}
+	work->count = count;
+
+	INIT_LIST_HEAD(&work->evict_list);
+
+	dma_fence_work_init(&work->base, &eb_bind_ops);
+	work->array = eb_vma_array_get(eb->array);
+	work->p_flags = &eb->args->flags;
+	work->vm = i915_vm_get(eb->context->vm);
+	memset(&work->stash, 0, sizeof(work->stash));
+
+	/* Preallocate our slot in vm->binding, outside of vm->mutex */
+	work->id = i915_gem_context_async_id(eb->gem_context);
+	if (i915_active_acquire_for_context(&work->vm->binding, work->id)) {
+		work->id = 0;
+		eb_vm_work_cancel(work, -ENOMEM);
+		return NULL;
+	}
+
+	return work;
+}
+
+static int eb_vm_throttle(struct eb_vm_work *work)
+{
+	struct dma_fence *p;
+	int err;
+
+	/* Keep async work queued per context */
+	p = __i915_active_ref(&work->vm->binding, work->id, &work->base.dma);
+	if (IS_ERR_OR_NULL(p))
+		return PTR_ERR_OR_ZERO(p);
+
+	err = i915_sw_fence_await_dma_fence(&work->base.chain, p, 0,
+					    GFP_NOWAIT | __GFP_NOWARN);
+	dma_fence_put(p);
+
+	return err < 0 ? err : 0;
+}
+
+static int eb_prepare_vma(struct eb_vm_work *work,
+			  unsigned long idx,
+			  struct eb_vma *ev)
+{
+	struct eb_bind_vma *bind = &work->bind[idx];
+	struct i915_vma *vma = ev->vma;
+	int err;
+
+	bind->ev = ev;
+	bind->hole.flags = 0;
+	bind->bind_flags = 0;
+
+	/* Allocate enough page directories to cover PTE used */
+	if (work->vm->allocate_va_range) {
+		err = i915_vm_alloc_pt_stash(work->vm, &work->stash, vma->size);
+		if (err)
+			return err;
+	}
+
+	return i915_vma_get_pages(vma);
+}
+
+static int wait_for_unbinds(struct i915_execbuffer *eb,
+			    struct list_head *unbound,
+			    int pass)
+{
+	struct eb_vma *ev;
+	int err;
+
+	list_for_each_entry(ev, unbound, unbound_link) {
+		struct i915_vma *vma = ev->vma;
+
+		GEM_BUG_ON(ev->flags & __EXEC_OBJECT_HAS_PIN);
+
+		if (drm_mm_node_allocated(&vma->node) &&
+		    eb_vma_misplaced(ev->exec, vma, ev->flags)) {
+			err = i915_vma_unbind(vma);
+			if (err)
+				return err;
+		}
+
+		/* Wait for previous to avoid reusing vma->node */
+		err = i915_vma_wait_for_unbind(vma);
+		if (err)
+			return err;
+	}
+
+	switch (pass) {
+	default:
+		return -ENOSPC;
+
+	case 2:
+		/*
+		 * Too fragmented, retire everything on the timeline and so
+		 * make it all [contexts included] available to evict.
+		 */
+		err = wait_for_timeline(eb->context->timeline);
+		if (err)
+			return err;
+
+		fallthrough;
+	case 1:
+		/* XXX ticket lock */
+		if (i915_active_wait(&eb->context->vm->binding))
+			return -EINTR;
+
+		fallthrough;
+	case 0:
+		return 0;
+	}
+}
+
 static int eb_reserve_vm(struct i915_execbuffer *eb)
 {
-	unsigned int pin_flags = PIN_USER | PIN_NONBLOCK;
+	struct i915_address_space *vm = eb->context->vm;
 	struct list_head last, unbound;
+	unsigned long count;
 	struct eb_vma *ev;
 	unsigned int pass;
+	int err = 0;
 
+	count = 0;
 	INIT_LIST_HEAD(&unbound);
 	list_for_each_entry(ev, &eb->bind_list, bind_link) {
 		struct drm_i915_gem_exec_object2 *entry = ev->exec;
@@ -737,29 +1378,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 				list_add(&ev->unbound_link, &unbound);
 			else
 				list_add_tail(&ev->unbound_link, &unbound);
+			count++;
 		}
 	}
-
-	if (list_empty(&unbound))
+	if (count == 0)
 		return 0;
 
-	/*
-	 * Attempt to pin all of the buffers into the GTT.
-	 * This is done in 3 phases:
-	 *
-	 * 1a. Unbind all objects that do not match the GTT constraints for
-	 *     the execbuffer (fenceable, mappable, alignment etc).
-	 * 1b. Increment pin count for already bound objects.
-	 * 2.  Bind new objects.
-	 * 3.  Decrement pin count.
-	 *
-	 * This avoid unnecessary unbinding of later objects in order to make
-	 * room for the earlier objects *unless* we need to defragment.
-	 */
-
 	pass = 0;
 	do {
-		int err = 0;
+		struct eb_vm_work *work;
 
 		/*
 		 * We need to hold one lock as we bind all the vma so that
@@ -773,23 +1400,87 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 		 * find space for new buffers, we know that extra pressure
 		 * from contention is likely.
 		 *
-		 * In lieu of being able to hold vm->mutex for the entire
-		 * sequence (it's complicated!), we opt for struct_mutex.
+		 * vm->mutex is complicated, as we are not allowed to allocate
+		 * beneath it, so we have to stage and preallocate all the
+		 * resources we may require before taking the mutex.
 		 */
-		if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
-			return -EINTR;
+		work = eb_vm_work(eb, count);
+		if (!work)
+			return -ENOMEM;
 
+		count = 0;
 		list_for_each_entry(ev, &unbound, unbound_link) {
-			err = eb_reserve_vma(eb, ev, pin_flags);
+			err = eb_prepare_vma(work, count++, ev);
+			if (err) {
+				work->count = count - 1;
+
+				if (eb_vm_work_cancel(work, err) == -EAGAIN)
+					goto retry;
+
+				return err;
+			}
+		}
+
+		err = i915_vm_pin_pt_stash(work->vm, &work->stash);
+		if (err)
+			return eb_vm_work_cancel(work, err);
+
+		/* No allocations allowed beyond this point */
+		if (mutex_lock_interruptible(&vm->mutex))
+			return eb_vm_work_cancel(work, -EINTR);
+
+		err = eb_vm_throttle(work);
+		if (err) {
+			mutex_unlock(&vm->mutex);
+			return eb_vm_work_cancel(work, err);
+		}
+
+		for (count = 0; count < work->count; count++) {
+			struct eb_bind_vma *bind = &work->bind[count];
+			struct i915_vma *vma;
+
+			ev = bind->ev;
+			vma = ev->vma;
+
+			/*
+			 * Check if this node is being evicted or must be.
+			 *
+			 * As we use the single node inside the vma to track
+			 * both the eviction and where to insert the new node,
+			 * we cannot handle migrating the vma inside the worker.
+			 */
+			if (drm_mm_node_allocated(&vma->node)) {
+				if (eb_vma_misplaced(ev->exec, vma, ev->flags)) {
+					err = -ENOSPC;
+					break;
+				}
+			} else {
+				if (i915_vma_is_active(vma)) {
+					err = -ENOSPC;
+					break;
+				}
+			}
+
+			err = i915_active_acquire(&vma->active);
+			if (!err) {
+				err = eb_reserve_vma(work, bind);
+				i915_active_release(&vma->active);
+			}
 			if (err)
 				break;
+
+			GEM_BUG_ON(!i915_vma_is_pinned(vma));
 		}
-		if (!(err == -ENOSPC || err == -EAGAIN)) {
-			mutex_unlock(&eb->i915->drm.struct_mutex);
+
+		mutex_unlock(&vm->mutex);
+
+		dma_fence_work_commit_imm(&work->base);
+		if (err != -ENOSPC)
 			return err;
-		}
 
+retry:
 		/* Resort *all* the objects into priority order */
+		count = 0;
 		INIT_LIST_HEAD(&unbound);
 		INIT_LIST_HEAD(&last);
 		list_for_each_entry(ev, &eb->bind_list, bind_link) {
@@ -800,6 +1491,7 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 				continue;
 
 			eb_unreserve_vma(ev);
+			count++;
 
 			if (flags & EXEC_OBJECT_PINNED)
 				/* Pinned must have their slot */
@@ -814,11 +1506,16 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 				list_add_tail(&ev->unbound_link, &last);
 		}
 		list_splice_tail(&last, &unbound);
-		mutex_unlock(&eb->i915->drm.struct_mutex);
+		GEM_BUG_ON(!count);
+
+		if (signal_pending(current))
+			return -EINTR;
+
+		/* Now safe to wait with no reservations held */
 
 		if (err == -EAGAIN) {
 			flush_workqueue(eb->i915->mm.userptr_wq);
-			continue;
+			pass = 0;
 		}
 
 		/*
@@ -836,27 +1533,9 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 		 * there is no guarantee that we will succeed in claiming
 		 * total ownership of the vm.
 		 */
-		switch (pass++) {
-		case 0:
-			break;
-
-		case 1:
-			/*
-			 * Too fragmented, retire everything on the timeline
-			 * and so make it all [contexts included] available to
-			 * evict.
-			 */
-			err = wait_for_timeline(eb->context->timeline);
-			if (err)
-				return err;
-
-			break;
-
-		default:
-			return -ENOSPC;
-		}
-
-		pin_flags = PIN_USER;
+		err = wait_for_unbinds(eb, &unbound, pass++);
+		if (err)
+			return err;
 	} while (1);
 }
 
@@ -1448,6 +2127,29 @@ relocate_entry(struct i915_execbuffer *eb,
 	return target->node.start | UPDATE;
 }
 
+static int gen6_fixup_ggtt(struct i915_vma *vma)
+{
+	int err;
+
+	if (i915_vma_is_bound(vma, I915_VMA_GLOBAL_BIND))
+		return 0;
+
+	err = i915_vma_wait_for_bind(vma);
+	if (err)
+		return err;
+
+	mutex_lock(&vma->vm->mutex);
+	if (!(atomic_fetch_or(I915_VMA_GLOBAL_BIND, &vma->flags) & I915_VMA_GLOBAL_BIND)) {
+		__i915_gem_object_pin_pages(vma->obj);
+		vma->ops->bind_vma(vma->vm, NULL, vma,
+				   vma->obj->cache_level,
+				   I915_VMA_GLOBAL_BIND);
+	}
+	mutex_unlock(&vma->vm->mutex);
+
+	return 0;
+}
+
 static u64
 eb_relocate_entry(struct i915_execbuffer *eb,
 		  struct eb_vma *ev,
@@ -1462,6 +2164,8 @@ eb_relocate_entry(struct i915_execbuffer *eb,
 	if (unlikely(!target))
 		return -ENOENT;
 
+	GEM_BUG_ON(!i915_vma_is_pinned(target->vma));
+
 	/* Validate that the target is in a valid r/w GPU domain */
 	if (unlikely(reloc->write_domain & (reloc->write_domain - 1))) {
 		drm_dbg(&i915->drm, "reloc with multiple write domains: "
@@ -1496,9 +2200,7 @@ eb_relocate_entry(struct i915_execbuffer *eb,
 		 */
 		if (reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION &&
 		    IS_GEN(eb->i915, 6)) {
-			err = i915_vma_bind(target->vma,
-					    target->vma->obj->cache_level,
-					    PIN_GLOBAL, NULL);
+			err = gen6_fixup_ggtt(target->vma);
 			if (err)
 				return err;
 		}
@@ -1668,6 +2370,8 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 		struct drm_i915_gem_object *obj = vma->obj;
 
 		assert_vma_held(vma);
+		GEM_BUG_ON(!(flags & __EXEC_OBJECT_HAS_PIN));
+		GEM_BUG_ON(!i915_vma_is_bound(vma, I915_VMA_LOCAL_BIND));
 
 		if (flags & EXEC_OBJECT_CAPTURE) {
 			struct i915_capture_list *capture;
@@ -1706,7 +2410,6 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 			err = i915_vma_move_to_active(vma, eb->request, flags);
 
 		i915_vma_unlock(vma);
-		eb_unreserve_vma(ev);
 	}
 	ww_acquire_fini(&acquire);
 
@@ -2637,7 +3340,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	 * snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
 	 * batch" bit. Hence we need to pin secure batches into the global gtt.
 	 * hsw should have this fixed, but bdw mucks it up again. */
-	batch = eb.batch->vma;
+	batch = i915_vma_get(eb.batch->vma);
 	if (eb.batch_flags & I915_DISPATCH_SECURE) {
 		struct i915_vma *vma;
 
@@ -2657,6 +3360,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			goto err_parse;
 		}
 
+		GEM_BUG_ON(vma->obj != batch->obj);
 		batch = vma;
 	}
 
@@ -2726,6 +3430,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 err_parse:
 	if (batch->private)
 		intel_gt_buffer_pool_put(batch->private);
+	i915_vma_put(batch);
 err_vma:
 	if (eb.trampoline)
 		i915_vma_unpin(eb.trampoline);
diff --git a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
index a823d2e3c39c..71baf2f8bdf3 100644
--- a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
@@ -361,6 +361,7 @@ static struct i915_vma *pd_vma_create(struct gen6_ppgtt *ppgtt, int size)
 	atomic_set(&vma->flags, I915_VMA_GGTT);
 	vma->ggtt_view.type = I915_GGTT_VIEW_ROTATED; /* prevent fencing */
 
+	INIT_LIST_HEAD(&vma->vm_link);
 	INIT_LIST_HEAD(&vma->obj_link);
 	INIT_LIST_HEAD(&vma->closed_link);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.c b/drivers/gpu/drm/i915/gt/intel_gtt.c
index 795ed81ba358..31dc0fdc183b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.c
@@ -55,6 +55,8 @@ void __i915_vm_close(struct i915_address_space *vm)
 
 void i915_address_space_fini(struct i915_address_space *vm)
 {
+	i915_active_fini(&vm->binding);
+
 	drm_mm_takedown(&vm->mm);
 	mutex_destroy(&vm->mutex);
 }
@@ -100,6 +102,8 @@ void i915_address_space_init(struct i915_address_space *vm, int subclass)
 	drm_mm_init(&vm->mm, 0, vm->total);
 	vm->mm.head_node.color = I915_COLOR_UNEVICTABLE;
 
+	i915_active_init(&vm->binding, NULL, NULL);
+
 	INIT_LIST_HEAD(&vm->bound_list);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index 6abab2d37b6f..496f8236ca09 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -247,6 +247,8 @@ struct i915_address_space {
 	 */
 	struct list_head bound_list;
 
+	struct i915_active binding;
+
 	/* Global GTT */
 	bool is_ggtt:1;
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9aa3066cb75d..e998f25f30a3 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -997,6 +997,9 @@ i915_gem_object_ggtt_pin(struct drm_i915_gem_object *obj,
 		return vma;
 
 	if (i915_vma_misplaced(vma, size, alignment, flags)) {
+		if (flags & PIN_NOEVICT)
+			return ERR_PTR(-ENOSPC);
+
 		if (flags & PIN_NONBLOCK) {
 			if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma))
 				return ERR_PTR(-ENOSPC);
@@ -1016,6 +1019,10 @@ i915_gem_object_ggtt_pin(struct drm_i915_gem_object *obj,
 			return ERR_PTR(ret);
 	}
 
+	if (flags & PIN_NONBLOCK &&
+	    i915_active_fence_isset(&vma->active.excl))
+		return ERR_PTR(-EAGAIN);
+
 	ret = i915_vma_pin(vma, size, alignment, flags | PIN_GLOBAL);
 	if (ret)
 		return ERR_PTR(ret);
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index c5ee1567f3d1..356c492b80e0 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -221,6 +221,8 @@ int i915_gem_gtt_insert(struct i915_address_space *vm,
 		mode = DRM_MM_INSERT_HIGHEST;
 	if (flags & PIN_MAPPABLE)
 		mode = DRM_MM_INSERT_LOW;
+	if (flags & PIN_NOSEARCH)
+		mode |= DRM_MM_INSERT_ONCE;
 
 	/* We only allocate in PAGE_SIZE/GTT_PAGE_SIZE (4096) chunks,
 	 * so we know that we always have a minimum alignment of 4096.
@@ -238,6 +240,9 @@ int i915_gem_gtt_insert(struct i915_address_space *vm,
 	if (err != -ENOSPC)
 		return err;
 
+	if (flags & PIN_NOSEARCH)
+		return -ENOSPC;
+
 	if (mode & DRM_MM_INSERT_ONCE) {
 		err = drm_mm_insert_node_in_range(&vm->mm, node,
 						  size, alignment, color,
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index dbe11b349175..7278cc7c40b9 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -133,6 +133,7 @@ vma_create(struct drm_i915_gem_object *obj,
 		fs_reclaim_release(GFP_KERNEL);
 	}
 
+	INIT_LIST_HEAD(&vma->vm_link);
 	INIT_LIST_HEAD(&vma->closed_link);
 
 	if (view && view->type != I915_GGTT_VIEW_NORMAL) {
@@ -341,25 +342,37 @@ struct i915_vma_work *i915_vma_work(void)
 	return vw;
 }
 
-int i915_vma_wait_for_bind(struct i915_vma *vma)
+static int
+__i915_vma_wait_excl(struct i915_vma *vma, bool bound, unsigned int flags)
 {
+	struct dma_fence *fence;
 	int err = 0;
 
-	if (rcu_access_pointer(vma->active.excl.fence)) {
-		struct dma_fence *fence;
+	fence = i915_active_fence_get(&vma->active.excl);
+	if (!fence)
+		return 0;
 
-		rcu_read_lock();
-		fence = dma_fence_get_rcu_safe(&vma->active.excl.fence);
-		rcu_read_unlock();
-		if (fence) {
-			err = dma_fence_wait(fence, MAX_SCHEDULE_TIMEOUT);
-			dma_fence_put(fence);
-		}
+	if (drm_mm_node_allocated(&vma->node) == bound) {
+		if (flags & PIN_NOEVICT)
+			err = -EBUSY;
+		else
+			err = dma_fence_wait(fence, true);
 	}
 
+	dma_fence_put(fence);
 	return err;
 }
 
+int i915_vma_wait_for_bind(struct i915_vma *vma)
+{
+	return __i915_vma_wait_excl(vma, true, 0);
+}
+
+int i915_vma_wait_for_unbind(struct i915_vma *vma)
+{
+	return __i915_vma_wait_excl(vma, false, 0);
+}
+
 /**
  * i915_vma_bind - Sets up PTEs for an VMA in it's corresponding address space.
  * @vma: VMA to map
@@ -624,8 +637,9 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 	u64 start, end;
 	int ret;
 
-	GEM_BUG_ON(i915_vma_is_bound(vma, I915_VMA_GLOBAL_BIND | I915_VMA_LOCAL_BIND));
+	GEM_BUG_ON(i915_vma_is_bound(vma, I915_VMA_BIND_MASK));
 	GEM_BUG_ON(drm_mm_node_allocated(&vma->node));
+	GEM_BUG_ON(i915_active_fence_isset(&vma->active.excl));
 
 	size = max(size, vma->size);
 	alignment = max(alignment, vma->display_alignment);
@@ -721,7 +735,7 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 	GEM_BUG_ON(!drm_mm_node_allocated(&vma->node));
 	GEM_BUG_ON(!i915_gem_valid_gtt_space(vma, color));
 
-	list_add_tail(&vma->vm_link, &vma->vm->bound_list);
+	list_move_tail(&vma->vm_link, &vma->vm->bound_list);
 
 	return 0;
 }
@@ -729,15 +743,12 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 static void
 i915_vma_detach(struct i915_vma *vma)
 {
-	GEM_BUG_ON(!drm_mm_node_allocated(&vma->node));
-	GEM_BUG_ON(i915_vma_is_bound(vma, I915_VMA_GLOBAL_BIND | I915_VMA_LOCAL_BIND));
-
 	/*
 	 * And finally now the object is completely decoupled from this
 	 * vma, we can drop its hold on the backing storage and allow
 	 * it to be reaped by the shrinker.
 	 */
-	list_del(&vma->vm_link);
+	list_del_init(&vma->vm_link);
 }
 
 bool i915_vma_pin_inplace(struct i915_vma *vma, unsigned int flags)
@@ -785,7 +796,7 @@ bool i915_vma_pin_inplace(struct i915_vma *vma, unsigned int flags)
 	return pinned;
 }
 
-static int vma_get_pages(struct i915_vma *vma)
+int i915_vma_get_pages(struct i915_vma *vma)
 {
 	int err = 0;
 
@@ -832,7 +843,7 @@ static void __vma_put_pages(struct i915_vma *vma, unsigned int count)
 	mutex_unlock(&vma->pages_mutex);
 }
 
-static void vma_put_pages(struct i915_vma *vma)
+void i915_vma_put_pages(struct i915_vma *vma)
 {
 	if (atomic_add_unless(&vma->pages_count, -1, 1))
 		return;
@@ -849,9 +860,13 @@ static void vma_unbind_pages(struct i915_vma *vma)
 	/* The upper portion of pages_count is the number of bindings */
 	count = atomic_read(&vma->pages_count);
 	count >>= I915_VMA_PAGES_BIAS;
-	GEM_BUG_ON(!count);
+	if (count)
+		__vma_put_pages(vma, count | count << I915_VMA_PAGES_BIAS);
+}
 
-	__vma_put_pages(vma, count | count << I915_VMA_PAGES_BIAS);
+static int __wait_for_unbind(struct i915_vma *vma, unsigned int flags)
+{
+	return __i915_vma_wait_excl(vma, false, flags);
 }
 
 int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
@@ -870,13 +885,17 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 	if (i915_vma_pin_inplace(vma, flags & I915_VMA_BIND_MASK))
 		return 0;
 
-	err = vma_get_pages(vma);
+	err = i915_vma_get_pages(vma);
 	if (err)
 		return err;
 
 	if (flags & PIN_GLOBAL)
 		wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm);
 
+	err = __wait_for_unbind(vma, flags);
+	if (err)
+		goto err_rpm;
+
 	if (flags & vma->vm->bind_async_flags) {
 		work = i915_vma_work();
 		if (!work) {
@@ -949,6 +968,10 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 		goto err_unlock;
 
 	if (!(bound & I915_VMA_BIND_MASK)) {
+		err = __wait_for_unbind(vma, flags);
+		if (err)
+			goto err_active;
+
 		err = i915_vma_insert(vma, size, alignment, flags);
 		if (err)
 			goto err_active;
@@ -968,6 +991,7 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 	GEM_BUG_ON(bound + I915_VMA_PAGES_ACTIVE < bound);
 	atomic_add(I915_VMA_PAGES_ACTIVE, &vma->pages_count);
 	list_move_tail(&vma->vm_link, &vma->vm->bound_list);
+	GEM_BUG_ON(!i915_vma_is_active(vma));
 
 	__i915_vma_pin(vma);
 	GEM_BUG_ON(!i915_vma_is_pinned(vma));
@@ -989,7 +1013,7 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 err_rpm:
 	if (wakeref)
 		intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref);
-	vma_put_pages(vma);
+	i915_vma_put_pages(vma);
 	return err;
 }
 
@@ -1093,6 +1117,7 @@ void i915_vma_release(struct kref *ref)
 		GEM_BUG_ON(drm_mm_node_allocated(&vma->node));
 	}
 	GEM_BUG_ON(i915_vma_is_active(vma));
+	GEM_BUG_ON(!list_empty(&vma->vm_link));
 
 	if (vma->obj) {
 		struct drm_i915_gem_object *obj = vma->obj;
@@ -1152,7 +1177,7 @@ static void __i915_vma_iounmap(struct i915_vma *vma)
 {
 	GEM_BUG_ON(i915_vma_is_pinned(vma));
 
-	if (vma->iomap == NULL)
+	if (!vma->iomap)
 		return;
 
 	io_mapping_unmap(vma->iomap);
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index 03fea54fd573..9a26e6cbe8cd 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -236,6 +236,9 @@ static inline void i915_vma_unlock(struct i915_vma *vma)
 	dma_resv_unlock(vma->resv);
 }
 
+int i915_vma_get_pages(struct i915_vma *vma);
+void i915_vma_put_pages(struct i915_vma *vma);
+
 bool i915_vma_pin_inplace(struct i915_vma *vma, unsigned int flags);
 
 int __must_check
@@ -379,6 +382,7 @@ void i915_vma_make_shrinkable(struct i915_vma *vma);
 void i915_vma_make_purgeable(struct i915_vma *vma);
 
 int i915_vma_wait_for_bind(struct i915_vma *vma);
+int i915_vma_wait_for_unbind(struct i915_vma *vma);
 
 static inline int i915_vma_sync(struct i915_vma *vma)
 {
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 22/66] drm/i915/gem: Bind the fence async for execbuf
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (19 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 21/66] drm/i915/gem: Asynchronous GTT unbinding Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-27 18:19   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 23/66] drm/i915/gem: Include cmdparser in common execbuf pinning Chris Wilson
                   ` (51 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

It is illegal to wait on an another vma while holding the vm->mutex, as
that easily leads to ABBA deadlocks (we wait on a second vma that waits
on us to release the vm->mutex). So while the vm->mutex exists, move the
waiting outside of the lock into the async binding pipeline.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  21 +--
 drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c  | 137 +++++++++++++++++-
 drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h  |   5 +
 3 files changed, 151 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index bdcbb82bfc3d..af2b4aeb6df0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1056,15 +1056,6 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
 		return err;
 
 pin:
-	if (unlikely(exec_flags & EXEC_OBJECT_NEEDS_FENCE)) {
-		err = __i915_vma_pin_fence(vma); /* XXX no waiting */
-		if (unlikely(err))
-			return err;
-
-		if (vma->fence)
-			bind->ev->flags |= __EXEC_OBJECT_HAS_FENCE;
-	}
-
 	bind_flags &= ~atomic_read(&vma->flags);
 	if (bind_flags) {
 		err = set_bind_fence(vma, work);
@@ -1095,6 +1086,15 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
 	bind->ev->flags |= __EXEC_OBJECT_HAS_PIN;
 	GEM_BUG_ON(eb_vma_misplaced(entry, vma, bind->ev->flags));
 
+	if (unlikely(exec_flags & EXEC_OBJECT_NEEDS_FENCE)) {
+		err = __i915_vma_pin_fence_async(vma, &work->base);
+		if (unlikely(err))
+			return err;
+
+		if (vma->fence)
+			bind->ev->flags |= __EXEC_OBJECT_HAS_FENCE;
+	}
+
 	return 0;
 }
 
@@ -1160,6 +1160,9 @@ static void __eb_bind_vma(struct eb_vm_work *work)
 		struct eb_bind_vma *bind = &work->bind[n];
 		struct i915_vma *vma = bind->ev->vma;
 
+		if (bind->ev->flags & __EXEC_OBJECT_HAS_FENCE)
+			__i915_vma_apply_fence_async(vma);
+
 		if (!bind->bind_flags)
 			goto put;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
index 7fb36b12fe7a..734b6aa61809 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
@@ -21,10 +21,13 @@
  * IN THE SOFTWARE.
  */
 
+#include "i915_active.h"
 #include "i915_drv.h"
 #include "i915_scatterlist.h"
+#include "i915_sw_fence_work.h"
 #include "i915_pvinfo.h"
 #include "i915_vgpu.h"
+#include "i915_vma.h"
 
 /**
  * DOC: fence register handling
@@ -340,19 +343,37 @@ static struct i915_fence_reg *fence_find(struct i915_ggtt *ggtt)
 	return ERR_PTR(-EDEADLK);
 }
 
+static int fence_wait_bind(struct i915_fence_reg *reg)
+{
+	struct dma_fence *fence;
+	int err = 0;
+
+	fence = i915_active_fence_get(&reg->active.excl);
+	if (fence) {
+		err = dma_fence_wait(fence, true);
+		dma_fence_put(fence);
+	}
+
+	return err;
+}
+
 int __i915_vma_pin_fence(struct i915_vma *vma)
 {
 	struct i915_ggtt *ggtt = i915_vm_to_ggtt(vma->vm);
-	struct i915_fence_reg *fence;
+	struct i915_fence_reg *fence = vma->fence;
 	struct i915_vma *set = i915_gem_object_is_tiled(vma->obj) ? vma : NULL;
 	int err;
 
 	lockdep_assert_held(&vma->vm->mutex);
 
 	/* Just update our place in the LRU if our fence is getting reused. */
-	if (vma->fence) {
-		fence = vma->fence;
+	if (fence) {
 		GEM_BUG_ON(fence->vma != vma);
+
+		err = fence_wait_bind(fence);
+		if (err)
+			return err;
+
 		atomic_inc(&fence->pin_count);
 		if (!fence->dirty) {
 			list_move_tail(&fence->link, &ggtt->fence_list);
@@ -384,6 +405,116 @@ int __i915_vma_pin_fence(struct i915_vma *vma)
 	return err;
 }
 
+static int set_bind_fence(struct i915_fence_reg *fence,
+			  struct dma_fence_work *work)
+{
+	struct dma_fence *prev;
+	int err;
+
+	if (rcu_access_pointer(fence->active.excl.fence) == &work->dma)
+		return 0;
+
+	err = i915_sw_fence_await_active(&work->chain,
+					 &fence->active,
+					 I915_ACTIVE_AWAIT_ACTIVE);
+	if (err)
+		return err;
+
+	if (i915_active_acquire(&fence->active))
+		return -ENOENT;
+
+	prev = i915_active_set_exclusive(&fence->active, &work->dma);
+	if (unlikely(prev)) {
+		err = i915_sw_fence_await_dma_fence(&work->chain, prev, 0,
+						    GFP_NOWAIT | __GFP_NOWARN);
+		dma_fence_put(prev);
+	}
+
+	i915_active_release(&fence->active);
+	return err < 0 ? err : 0;
+}
+
+int __i915_vma_pin_fence_async(struct i915_vma *vma,
+			       struct dma_fence_work *work)
+{
+	struct i915_ggtt *ggtt = i915_vm_to_ggtt(vma->vm);
+	struct i915_vma *set = i915_gem_object_is_tiled(vma->obj) ? vma : NULL;
+	struct i915_fence_reg *fence = vma->fence;
+	int err;
+
+	lockdep_assert_held(&vma->vm->mutex);
+
+	/* Just update our place in the LRU if our fence is getting reused. */
+	if (fence) {
+		GEM_BUG_ON(fence->vma != vma);
+		GEM_BUG_ON(!i915_vma_is_map_and_fenceable(vma));
+	} else if (set) {
+		if (!i915_vma_is_map_and_fenceable(vma))
+			return -EINVAL;
+
+		fence = fence_find(ggtt);
+		if (IS_ERR(fence))
+			return -ENOSPC;
+
+		GEM_BUG_ON(atomic_read(&fence->pin_count));
+		fence->dirty = true;
+	} else {
+		return 0;
+	}
+
+	atomic_inc(&fence->pin_count);
+	list_move_tail(&fence->link, &ggtt->fence_list);
+	if (!fence->dirty)
+		return 0;
+
+	if (INTEL_GEN(fence_to_i915(fence)) < 4 &&
+	    rcu_access_pointer(vma->active.excl.fence) != &work->dma) {
+		/* implicit 'unfenced' GPU blits */
+		err = i915_sw_fence_await_active(&work->chain,
+						 &vma->active,
+						 I915_ACTIVE_AWAIT_ACTIVE);
+		if (err)
+			goto err_unpin;
+	}
+
+	err = set_bind_fence(fence, work);
+	if (err)
+		goto err_unpin;
+
+	if (set) {
+		fence->start = vma->node.start;
+		fence->size  = vma->fence_size;
+		fence->stride = i915_gem_object_get_stride(vma->obj);
+		fence->tiling = i915_gem_object_get_tiling(vma->obj);
+
+		vma->fence = fence;
+	} else {
+		fence->tiling = 0;
+		vma->fence = NULL;
+	}
+
+	set = xchg(&fence->vma, set);
+	if (set && set != vma) {
+		GEM_BUG_ON(set->fence != fence);
+		WRITE_ONCE(set->fence, NULL);
+		i915_vma_revoke_mmap(set);
+	}
+
+	return 0;
+
+err_unpin:
+	atomic_dec(&fence->pin_count);
+	return err;
+}
+
+void __i915_vma_apply_fence_async(struct i915_vma *vma)
+{
+	struct i915_fence_reg *fence = vma->fence;
+
+	if (fence->dirty)
+		fence_write(fence);
+}
+
 /**
  * i915_vma_pin_fence - set up fencing for a vma
  * @vma: vma to map through a fence reg
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h
index 9eef679e1311..d306ac14d47e 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h
@@ -30,6 +30,7 @@
 
 #include "i915_active.h"
 
+struct dma_fence_work;
 struct drm_i915_gem_object;
 struct i915_ggtt;
 struct i915_vma;
@@ -70,6 +71,10 @@ void i915_gem_object_do_bit_17_swizzle(struct drm_i915_gem_object *obj,
 void i915_gem_object_save_bit_17_swizzle(struct drm_i915_gem_object *obj,
 					 struct sg_table *pages);
 
+int __i915_vma_pin_fence_async(struct i915_vma *vma,
+			       struct dma_fence_work *work);
+void __i915_vma_apply_fence_async(struct i915_vma *vma);
+
 void intel_ggtt_init_fences(struct i915_ggtt *ggtt);
 void intel_ggtt_fini_fences(struct i915_ggtt *ggtt);
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 23/66] drm/i915/gem: Include cmdparser in common execbuf pinning
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (20 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 22/66] drm/i915/gem: Bind the fence async for execbuf Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-31  9:43   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 24/66] drm/i915/gem: Include secure batch " Chris Wilson
                   ` (50 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Pull the cmdparser allocations in to the reservation phase, and then
they are included in the common vma pinning pass.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 360 +++++++++++-------
 drivers/gpu/drm/i915/gem/i915_gem_object.h    |  10 +
 drivers/gpu/drm/i915/i915_cmd_parser.c        |  21 +-
 3 files changed, 230 insertions(+), 161 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index af2b4aeb6df0..8c1f3528b1e9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -25,6 +25,7 @@
 #include "i915_gem_clflush.h"
 #include "i915_gem_context.h"
 #include "i915_gem_ioctls.h"
+#include "i915_memcpy.h"
 #include "i915_sw_fence_work.h"
 #include "i915_trace.h"
 
@@ -52,6 +53,7 @@ struct eb_bind_vma {
 
 struct eb_vma_array {
 	struct kref kref;
+	struct list_head aux_list;
 	struct eb_vma vma[];
 };
 
@@ -246,7 +248,6 @@ struct i915_execbuffer {
 
 	struct i915_request *request; /** our request to build */
 	struct eb_vma *batch; /** identity of the batch obj/vma */
-	struct i915_vma *trampoline; /** trampoline used for chaining */
 
 	/** actual size of execobj[] as we may extend it for the cmdparser */
 	unsigned int buffer_count;
@@ -281,6 +282,11 @@ struct i915_execbuffer {
 		unsigned int rq_size;
 	} reloc_cache;
 
+	struct eb_cmdparser {
+		struct eb_vma *shadow;
+		struct eb_vma *trampoline;
+	} parser;
+
 	u64 invalid_flags; /** Set of execobj.flags that are invalid */
 	u32 context_flags; /** Set of execobj.flags to insert from the ctx */
 
@@ -298,6 +304,10 @@ struct i915_execbuffer {
 	struct eb_vma_array *array;
 };
 
+static struct drm_i915_gem_exec_object2 no_entry = {
+	.offset = -1ull
+};
+
 static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
 {
 	return intel_engine_requires_cmd_parser(eb->engine) ||
@@ -314,6 +324,7 @@ static struct eb_vma_array *eb_vma_array_create(unsigned int count)
 		return NULL;
 
 	kref_init(&arr->kref);
+	INIT_LIST_HEAD(&arr->aux_list);
 	arr->vma[0].vma = NULL;
 
 	return arr;
@@ -339,16 +350,31 @@ static inline void eb_unreserve_vma(struct eb_vma *ev)
 		       __EXEC_OBJECT_HAS_FENCE);
 }
 
+static void eb_vma_destroy(struct eb_vma *ev)
+{
+	eb_unreserve_vma(ev);
+	i915_vma_put(ev->vma);
+}
+
+static void eb_destroy_aux(struct eb_vma_array *arr)
+{
+	struct eb_vma *ev, *en;
+
+	list_for_each_entry_safe(ev, en, &arr->aux_list, reloc_link) {
+		eb_vma_destroy(ev);
+		kfree(ev);
+	}
+}
+
 static void eb_vma_array_destroy(struct kref *kref)
 {
 	struct eb_vma_array *arr = container_of(kref, typeof(*arr), kref);
-	struct eb_vma *ev = arr->vma;
+	struct eb_vma *ev;
 
-	while (ev->vma) {
-		eb_unreserve_vma(ev);
-		i915_vma_put(ev->vma);
-		ev++;
-	}
+	eb_destroy_aux(arr);
+
+	for (ev = arr->vma; ev->vma; ev++)
+		eb_vma_destroy(ev);
 
 	kvfree(arr);
 }
@@ -396,8 +422,8 @@ eb_lock_vma(struct i915_execbuffer *eb, struct ww_acquire_ctx *acquire)
 
 static int eb_create(struct i915_execbuffer *eb)
 {
-	/* Allocate an extra slot for use by the command parser + sentinel */
-	eb->array = eb_vma_array_create(eb->buffer_count + 2);
+	/* Allocate an extra slot for use by the sentinel */
+	eb->array = eb_vma_array_create(eb->buffer_count + 1);
 	if (!eb->array)
 		return -ENOMEM;
 
@@ -1078,7 +1104,7 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
 	GEM_BUG_ON(!(drm_mm_node_allocated(&vma->node) ^
 		     drm_mm_node_allocated(&bind->hole)));
 
-	if (entry->offset != vma->node.start) {
+	if (entry != &no_entry && entry->offset != vma->node.start) {
 		entry->offset = vma->node.start | UPDATE;
 		*work->p_flags |= __EXEC_HAS_RELOC;
 	}
@@ -1371,7 +1397,8 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 		struct i915_vma *vma = ev->vma;
 
 		if (eb_pin_vma_inplace(eb, entry, ev)) {
-			if (entry->offset != vma->node.start) {
+			if (entry != &no_entry &&
+			    entry->offset != vma->node.start) {
 				entry->offset = vma->node.start | UPDATE;
 				eb->args->flags |= __EXEC_HAS_RELOC;
 			}
@@ -1542,6 +1569,113 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 	} while (1);
 }
 
+static int eb_alloc_cmdparser(struct i915_execbuffer *eb)
+{
+	struct intel_gt_buffer_pool_node *pool;
+	struct i915_vma *vma;
+	struct eb_vma *ev;
+	unsigned int len;
+	int err;
+
+	if (range_overflows_t(u64,
+			      eb->batch_start_offset, eb->batch_len,
+			      eb->batch->vma->size)) {
+		drm_dbg(&eb->i915->drm,
+			"Attempting to use out-of-bounds batch\n");
+		return -EINVAL;
+	}
+
+	if (eb->batch_len == 0)
+		eb->batch_len = eb->batch->vma->size - eb->batch_start_offset;
+
+	if (!eb_use_cmdparser(eb))
+		return 0;
+
+	len = eb->batch_len;
+	if (!CMDPARSER_USES_GGTT(eb->i915)) {
+		/*
+		 * ppGTT backed shadow buffers must be mapped RO, to prevent
+		 * post-scan tampering
+		 */
+		if (!eb->context->vm->has_read_only) {
+			drm_dbg(&eb->i915->drm,
+				"Cannot prevent post-scan tampering without RO capable vm\n");
+			return -EINVAL;
+		}
+	} else {
+		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
+	}
+
+	pool = intel_gt_get_buffer_pool(eb->engine->gt, len);
+	if (IS_ERR(pool))
+		return PTR_ERR(pool);
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev) {
+		err = -ENOMEM;
+		goto err_pool;
+	}
+
+	vma = i915_vma_instance(pool->obj, eb->context->vm, NULL);
+	if (IS_ERR(vma)) {
+		err = PTR_ERR(vma);
+		goto err_ev;
+	}
+	i915_gem_object_set_readonly(vma->obj);
+	i915_gem_object_set_cache_coherency(vma->obj, I915_CACHE_LLC);
+	vma->private = pool;
+
+	ev->vma = i915_vma_get(vma);
+	ev->exec = &no_entry;
+	list_add(&ev->reloc_link, &eb->array->aux_list);
+	list_add(&ev->bind_link, &eb->bind_list);
+	list_add(&ev->submit_link, &eb->submit_list);
+
+	if (CMDPARSER_USES_GGTT(eb->i915)) {
+		eb->parser.trampoline = ev;
+
+		/*
+		 * Special care when binding will be required for full-ppgtt
+		 * as there will be distinct vm involved, and we will need to
+		 * separate the binding/eviction passes (different vm->mutex).
+		 */
+		if (GEM_WARN_ON(eb->context->vm != &eb->engine->gt->ggtt->vm)) {
+			ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+			if (!ev) {
+				err = -ENOMEM;
+				goto err_pool;
+			}
+
+			vma = i915_vma_instance(pool->obj,
+						&eb->engine->gt->ggtt->vm,
+						NULL);
+			if (IS_ERR(vma)) {
+				err = PTR_ERR(vma);
+				goto err_ev;
+			}
+			vma->private = pool;
+
+			ev->vma = i915_vma_get(vma);
+			ev->exec = &no_entry;
+			list_add(&ev->reloc_link, &eb->array->aux_list);
+			list_add(&ev->bind_link, &eb->bind_list);
+			list_add(&ev->submit_link, &eb->submit_list);
+		}
+
+		ev->flags = EXEC_OBJECT_NEEDS_GTT;
+		eb->batch_flags |= I915_DISPATCH_SECURE;
+	}
+
+	eb->parser.shadow = ev;
+	return 0;
+
+err_ev:
+	kfree(ev);
+err_pool:
+	intel_gt_buffer_pool_put(pool);
+	return err;
+}
+
 static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
 {
 	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
@@ -1683,9 +1817,15 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 
 		eb_add_vma(eb, i, batch, vma);
 	}
-
 	eb->vma[i].vma = NULL;
-	return err;
+	if (err)
+		return err;
+
+	err = eb_alloc_cmdparser(eb);
+	if (err)
+		return err;
+
+	return 0;
 }
 
 static struct eb_vma *
@@ -1712,9 +1852,7 @@ static void eb_destroy(const struct i915_execbuffer *eb)
 {
 	GEM_BUG_ON(eb->reloc_cache.rq);
 
-	if (eb->array)
-		eb_vma_array_put(eb->array);
-
+	eb_vma_array_put(eb->array);
 	if (eb->lut_size > 0)
 		kfree(eb->buckets);
 }
@@ -2416,8 +2554,6 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 	}
 	ww_acquire_fini(&acquire);
 
-	eb_vma_array_put(fetch_and_zero(&eb->array));
-
 	if (unlikely(err))
 		goto err_skip;
 
@@ -2481,25 +2617,6 @@ static int i915_reset_gen7_sol_offsets(struct i915_request *rq)
 	return 0;
 }
 
-static struct i915_vma *
-shadow_batch_pin(struct drm_i915_gem_object *obj,
-		 struct i915_address_space *vm,
-		 unsigned int flags)
-{
-	struct i915_vma *vma;
-	int err;
-
-	vma = i915_vma_instance(obj, vm, NULL);
-	if (IS_ERR(vma))
-		return vma;
-
-	err = i915_vma_pin(vma, 0, 0, flags);
-	if (err)
-		return ERR_PTR(err);
-
-	return vma;
-}
-
 struct eb_parse_work {
 	struct dma_fence_work base;
 	struct intel_engine_cs *engine;
@@ -2522,9 +2639,18 @@ static int __eb_parse(struct dma_fence_work *work)
 				       pw->trampoline);
 }
 
+static void __eb_parse_release(struct dma_fence_work *work)
+{
+	struct eb_parse_work *pw = container_of(work, typeof(*pw), base);
+
+	i915_gem_object_unpin_pages(pw->shadow->obj);
+	i915_gem_object_unpin_pages(pw->batch->obj);
+}
+
 static const struct dma_fence_work_ops eb_parse_ops = {
 	.name = "eb_parse",
 	.work = __eb_parse,
+	.release = __eb_parse_release,
 };
 
 static inline int
@@ -2542,36 +2668,51 @@ parser_mark_active(struct eb_parse_work *pw, struct intel_timeline *tl)
 {
 	int err;
 
+	GEM_BUG_ON(pw->trampoline &&
+		   pw->trampoline->private != pw->shadow->private);
+
 	err = i915_active_ref(&pw->batch->active,
 			      tl->fence_context,
 			      &pw->base.dma);
 	if (err)
 		return err;
 
-	err = __parser_mark_active(pw->shadow, tl, &pw->base.dma);
-	if (err)
-		return err;
-
-	if (pw->trampoline) {
-		err = __parser_mark_active(pw->trampoline, tl, &pw->base.dma);
-		if (err)
-			return err;
-	}
-
-	return 0;
+	return __parser_mark_active(pw->shadow, tl, &pw->base.dma);
 }
 
 static int eb_parse_pipeline(struct i915_execbuffer *eb,
 			     struct i915_vma *shadow,
 			     struct i915_vma *trampoline)
 {
+	struct i915_vma *batch = eb->batch->vma;
 	struct eb_parse_work *pw;
+	void *ptr;
 	int err;
 
+	GEM_BUG_ON(!i915_vma_is_pinned(shadow));
+	GEM_BUG_ON(trampoline && !i915_vma_is_pinned(trampoline));
+
 	pw = kzalloc(sizeof(*pw), GFP_KERNEL);
 	if (!pw)
 		return -ENOMEM;
 
+	ptr = i915_gem_object_pin_map(shadow->obj, I915_MAP_FORCE_WB);
+	if (IS_ERR(ptr)) {
+		err = PTR_ERR(ptr);
+		goto err_free;
+	}
+
+	if (!(batch->obj->cache_coherent & I915_BO_CACHE_COHERENT_FOR_READ) &&
+	    i915_has_memcpy_from_wc()) {
+		ptr = i915_gem_object_pin_map(batch->obj, I915_MAP_WC);
+		if (IS_ERR(ptr)) {
+			err = PTR_ERR(ptr);
+			goto err_dst;
+		}
+	} else {
+		__i915_gem_object_pin_pages(batch->obj);
+	}
+
 	dma_fence_work_init(&pw->base, &eb_parse_ops);
 
 	pw->engine = eb->engine;
@@ -2620,86 +2761,36 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
 	i915_sw_fence_set_error_once(&pw->base.chain, err);
 	dma_fence_work_commit_imm(&pw->base);
 	return err;
+
+err_dst:
+	i915_gem_object_unpin_pages(shadow->obj);
+err_free:
+	kfree(pw);
+	return err;
 }
 
 static int eb_parse(struct i915_execbuffer *eb)
 {
-	struct drm_i915_private *i915 = eb->i915;
-	struct intel_gt_buffer_pool_node *pool;
-	struct i915_vma *shadow, *trampoline;
-	unsigned int len;
 	int err;
 
-	if (!eb_use_cmdparser(eb))
-		return 0;
-
-	len = eb->batch_len;
-	if (!CMDPARSER_USES_GGTT(eb->i915)) {
-		/*
-		 * ppGTT backed shadow buffers must be mapped RO, to prevent
-		 * post-scan tampering
-		 */
-		if (!eb->context->vm->has_read_only) {
-			drm_dbg(&i915->drm,
-				"Cannot prevent post-scan tampering without RO capable vm\n");
-			return -EINVAL;
-		}
-	} else {
-		len += I915_CMD_PARSER_TRAMPOLINE_SIZE;
-	}
-
-	pool = intel_gt_get_buffer_pool(eb->engine->gt, len);
-	if (IS_ERR(pool))
-		return PTR_ERR(pool);
-
-	shadow = shadow_batch_pin(pool->obj, eb->context->vm, PIN_USER);
-	if (IS_ERR(shadow)) {
-		err = PTR_ERR(shadow);
-		goto err;
+	if (unlikely(eb->batch->flags & EXEC_OBJECT_WRITE)) {
+		drm_dbg(&eb->i915->drm,
+			"Attempting to use self-modifying batch buffer\n");
+		return -EINVAL;
 	}
-	i915_gem_object_set_readonly(shadow->obj);
-	shadow->private = pool;
-
-	trampoline = NULL;
-	if (CMDPARSER_USES_GGTT(eb->i915)) {
-		trampoline = shadow;
-
-		shadow = shadow_batch_pin(pool->obj,
-					  &eb->engine->gt->ggtt->vm,
-					  PIN_GLOBAL);
-		if (IS_ERR(shadow)) {
-			err = PTR_ERR(shadow);
-			shadow = trampoline;
-			goto err_shadow;
-		}
-		shadow->private = pool;
 
-		eb->batch_flags |= I915_DISPATCH_SECURE;
-	}
+	if (!eb->parser.shadow)
+		return 0;
 
-	err = eb_parse_pipeline(eb, shadow, trampoline);
+	err = eb_parse_pipeline(eb,
+				eb->parser.shadow->vma,
+				eb->parser.trampoline ? eb->parser.trampoline->vma : NULL);
 	if (err)
-		goto err_trampoline;
-
-	eb->batch = &eb->vma[eb->buffer_count++];
-	eb->batch->vma = i915_vma_get(shadow);
-	eb->batch->flags = __EXEC_OBJECT_HAS_PIN;
-	list_add_tail(&eb->batch->submit_link, &eb->submit_list);
-	eb->vma[eb->buffer_count].vma = NULL;
+		return err;
 
-	eb->trampoline = trampoline;
+	eb->batch = eb->parser.shadow;
 	eb->batch_start_offset = 0;
-
 	return 0;
-
-err_trampoline:
-	if (trampoline)
-		i915_vma_unpin(trampoline);
-err_shadow:
-	i915_vma_unpin(shadow);
-err:
-	intel_gt_buffer_pool_put(pool);
-	return err;
 }
 
 static void
@@ -2748,10 +2839,10 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
 	if (err)
 		return err;
 
-	if (eb->trampoline) {
+	if (eb->parser.trampoline) {
 		GEM_BUG_ON(eb->batch_start_offset);
 		err = eb->engine->emit_bb_start(eb->request,
-						eb->trampoline->node.start +
+						eb->parser.trampoline->vma->node.start +
 						eb->batch_len,
 						0, 0);
 		if (err)
@@ -3242,7 +3333,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	eb.buffer_count = args->buffer_count;
 	eb.batch_start_offset = args->batch_start_offset;
 	eb.batch_len = args->batch_len;
-	eb.trampoline = NULL;
+	memset(&eb.parser, 0, sizeof(eb.parser));
 
 	eb.batch_flags = 0;
 	if (args->flags & I915_EXEC_SECURE) {
@@ -3317,24 +3408,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		goto err_vma;
 	}
 
-	if (unlikely(eb.batch->flags & EXEC_OBJECT_WRITE)) {
-		drm_dbg(&i915->drm,
-			"Attempting to use self-modifying batch buffer\n");
-		err = -EINVAL;
-		goto err_vma;
-	}
-
-	if (range_overflows_t(u64,
-			      eb.batch_start_offset, eb.batch_len,
-			      eb.batch->vma->size)) {
-		drm_dbg(&i915->drm, "Attempting to use out-of-bounds batch\n");
-		err = -EINVAL;
-		goto err_vma;
-	}
-
-	if (eb.batch_len == 0)
-		eb.batch_len = eb.batch->vma->size - eb.batch_start_offset;
-
 	err = eb_parse(&eb);
 	if (err)
 		goto err_vma;
@@ -3360,7 +3433,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		vma = i915_gem_object_ggtt_pin(batch->obj, NULL, 0, 0, 0);
 		if (IS_ERR(vma)) {
 			err = PTR_ERR(vma);
-			goto err_parse;
+			goto err_vma;
 		}
 
 		GEM_BUG_ON(vma->obj != batch->obj);
@@ -3412,8 +3485,9 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	 * to explicitly hold another reference here.
 	 */
 	eb.request->batch = batch;
-	if (batch->private)
-		intel_gt_buffer_pool_mark_active(batch->private, eb.request);
+	if (eb.parser.shadow)
+		intel_gt_buffer_pool_mark_active(eb.parser.shadow->vma->private,
+						 eb.request);
 
 	trace_i915_request_queue(eb.request, eb.batch_flags);
 	err = eb_submit(&eb, batch);
@@ -3430,18 +3504,14 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 err_batch_unpin:
 	if (eb.batch_flags & I915_DISPATCH_SECURE)
 		i915_vma_unpin(batch);
-err_parse:
-	if (batch->private)
-		intel_gt_buffer_pool_put(batch->private);
-	i915_vma_put(batch);
 err_vma:
-	if (eb.trampoline)
-		i915_vma_unpin(eb.trampoline);
 	eb_unlock_engine(&eb);
 	/* *** TIMELINE UNLOCK *** */
 err_engine:
 	eb_unpin_engine(&eb);
 err_context:
+	if (eb.parser.shadow)
+		intel_gt_buffer_pool_put(eb.parser.shadow->vma->private);
 	i915_gem_context_put(eb.gem_context);
 err_destroy:
 	eb_destroy(&eb);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index e5b9276d254c..6f60687b6be2 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -368,6 +368,16 @@ enum i915_map_type {
 void *__must_check i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
 					   enum i915_map_type type);
 
+static inline void *__i915_gem_object_mapping(struct drm_i915_gem_object *obj)
+{
+	return page_mask_bits(obj->mm.mapping);
+}
+
+static inline int __i915_gem_object_mapping_type(struct drm_i915_gem_object *obj)
+{
+	return page_unmask_bits(obj->mm.mapping);
+}
+
 void __i915_gem_object_flush_map(struct drm_i915_gem_object *obj,
 				 unsigned long offset,
 				 unsigned long size);
diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
index 372354d33f55..dc8770206bb8 100644
--- a/drivers/gpu/drm/i915/i915_cmd_parser.c
+++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
@@ -1140,29 +1140,22 @@ static u32 *copy_batch(struct drm_i915_gem_object *dst_obj,
 {
 	bool needs_clflush;
 	void *dst, *src;
-	int ret;
 
-	dst = i915_gem_object_pin_map(dst_obj, I915_MAP_FORCE_WB);
-	if (IS_ERR(dst))
-		return dst;
+	GEM_BUG_ON(!i915_gem_object_has_pages(src_obj));
 
-	ret = i915_gem_object_pin_pages(src_obj);
-	if (ret) {
-		i915_gem_object_unpin_map(dst_obj);
-		return ERR_PTR(ret);
-	}
+	dst = __i915_gem_object_mapping(dst_obj);
+	GEM_BUG_ON(!dst);
 
 	needs_clflush =
 		!(src_obj->cache_coherent & I915_BO_CACHE_COHERENT_FOR_READ);
 
 	src = ERR_PTR(-ENODEV);
 	if (needs_clflush && i915_has_memcpy_from_wc()) {
-		src = i915_gem_object_pin_map(src_obj, I915_MAP_WC);
-		if (!IS_ERR(src)) {
+		if (__i915_gem_object_mapping_type(src_obj) == I915_MAP_WC) {
+			src = __i915_gem_object_mapping(src_obj);
 			i915_unaligned_memcpy_from_wc(dst,
 						      src + offset,
 						      length);
-			i915_gem_object_unpin_map(src_obj);
 		}
 	}
 	if (IS_ERR(src)) {
@@ -1198,9 +1191,6 @@ static u32 *copy_batch(struct drm_i915_gem_object *dst_obj,
 		}
 	}
 
-	i915_gem_object_unpin_pages(src_obj);
-
-	/* dst_obj is returned with vmap pinned */
 	return dst;
 }
 
@@ -1546,7 +1536,6 @@ int intel_engine_cmd_parser(struct intel_engine_cs *engine,
 
 	if (!IS_ERR_OR_NULL(jump_whitelist))
 		kfree(jump_whitelist);
-	i915_gem_object_unpin_map(shadow->obj);
 	return ret;
 }
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 24/66] drm/i915/gem: Include secure batch in common execbuf pinning
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (21 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 23/66] drm/i915/gem: Include cmdparser in common execbuf pinning Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-31  9:47   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 25/66] drm/i915/gem: Reintroduce multiple passes for reloc processing Chris Wilson
                   ` (49 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Pull the GGTT binding for the secure batch dispatch into the common vma
pinning routine for execbuf, so that there is just a single central
place for all i915_vma_pin().

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 88 +++++++++++--------
 1 file changed, 51 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 8c1f3528b1e9..b6290c2b99c8 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1676,6 +1676,48 @@ static int eb_alloc_cmdparser(struct i915_execbuffer *eb)
 	return err;
 }
 
+static int eb_secure_batch(struct i915_execbuffer *eb)
+{
+	struct i915_vma *vma = eb->batch->vma;
+
+	/*
+	 * snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
+	 * batch" bit. Hence we need to pin secure batches into the global gtt.
+	 * hsw should have this fixed, but bdw mucks it up again.
+	 */
+	if (!(eb->batch_flags & I915_DISPATCH_SECURE))
+		return 0;
+
+	if (GEM_WARN_ON(vma->vm != &eb->engine->gt->ggtt->vm)) {
+		struct eb_vma *ev;
+
+		ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+		if (!ev)
+			return -ENOMEM;
+
+		vma = i915_vma_instance(vma->obj,
+					&eb->engine->gt->ggtt->vm,
+					NULL);
+		if (IS_ERR(vma)) {
+			kfree(ev);
+			return PTR_ERR(vma);
+		}
+
+		ev->vma = i915_vma_get(vma);
+		ev->exec = &no_entry;
+
+		list_add(&ev->submit_link, &eb->submit_list);
+		list_add(&ev->reloc_link, &eb->array->aux_list);
+		list_add(&ev->bind_link, &eb->bind_list);
+
+		GEM_BUG_ON(eb->batch->vma->private);
+		eb->batch = ev;
+	}
+
+	eb->batch->flags |= EXEC_OBJECT_NEEDS_GTT;
+	return 0;
+}
+
 static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
 {
 	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
@@ -1825,6 +1867,10 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 	if (err)
 		return err;
 
+	err = eb_secure_batch(eb);
+	if (err)
+		return err;
+
 	return 0;
 }
 
@@ -2805,7 +2851,7 @@ add_to_client(struct i915_request *rq, struct drm_file *file)
 	spin_unlock(&file_priv->mm.lock);
 }
 
-static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
+static int eb_submit(struct i915_execbuffer *eb)
 {
 	int err;
 
@@ -2832,7 +2878,7 @@ static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
 	}
 
 	err = eb->engine->emit_bb_start(eb->request,
-					batch->node.start +
+					eb->batch->vma->node.start +
 					eb->batch_start_offset,
 					eb->batch_len,
 					eb->batch_flags);
@@ -3311,7 +3357,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	struct i915_execbuffer eb;
 	struct dma_fence *in_fence = NULL;
 	struct sync_file *out_fence = NULL;
-	struct i915_vma *batch;
 	int out_fence_fd = -1;
 	int err;
 
@@ -3412,34 +3457,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (err)
 		goto err_vma;
 
-	/*
-	 * snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
-	 * batch" bit. Hence we need to pin secure batches into the global gtt.
-	 * hsw should have this fixed, but bdw mucks it up again. */
-	batch = i915_vma_get(eb.batch->vma);
-	if (eb.batch_flags & I915_DISPATCH_SECURE) {
-		struct i915_vma *vma;
-
-		/*
-		 * So on first glance it looks freaky that we pin the batch here
-		 * outside of the reservation loop. But:
-		 * - The batch is already pinned into the relevant ppgtt, so we
-		 *   already have the backing storage fully allocated.
-		 * - No other BO uses the global gtt (well contexts, but meh),
-		 *   so we don't really have issues with multiple objects not
-		 *   fitting due to fragmentation.
-		 * So this is actually safe.
-		 */
-		vma = i915_gem_object_ggtt_pin(batch->obj, NULL, 0, 0, 0);
-		if (IS_ERR(vma)) {
-			err = PTR_ERR(vma);
-			goto err_vma;
-		}
-
-		GEM_BUG_ON(vma->obj != batch->obj);
-		batch = vma;
-	}
-
 	/* All GPU relocation batches must be submitted prior to the user rq */
 	GEM_BUG_ON(eb.reloc_cache.rq);
 
@@ -3447,7 +3464,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	eb.request = __i915_request_create(eb.context, GFP_KERNEL);
 	if (IS_ERR(eb.request)) {
 		err = PTR_ERR(eb.request);
-		goto err_batch_unpin;
+		goto err_vma;
 	}
 	eb.request->cookie = lockdep_pin_lock(&eb.context->timeline->mutex);
 
@@ -3484,13 +3501,13 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	 * inactive_list and lose its active reference. Hence we do not need
 	 * to explicitly hold another reference here.
 	 */
-	eb.request->batch = batch;
+	eb.request->batch = eb.batch->vma;
 	if (eb.parser.shadow)
 		intel_gt_buffer_pool_mark_active(eb.parser.shadow->vma->private,
 						 eb.request);
 
 	trace_i915_request_queue(eb.request, eb.batch_flags);
-	err = eb_submit(&eb, batch);
+	err = eb_submit(&eb);
 err_request:
 	i915_request_get(eb.request);
 	eb_request_add(&eb);
@@ -3501,9 +3518,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	add_to_client(eb.request, file);
 	i915_request_put(eb.request);
 
-err_batch_unpin:
-	if (eb.batch_flags & I915_DISPATCH_SECURE)
-		i915_vma_unpin(batch);
 err_vma:
 	eb_unlock_engine(&eb);
 	/* *** TIMELINE UNLOCK *** */
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 25/66] drm/i915/gem: Reintroduce multiple passes for reloc processing
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (22 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 24/66] drm/i915/gem: Include secure batch " Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-31 10:05   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 26/66] drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2 Chris Wilson
                   ` (48 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

The prospect of locking the entire submission sequence under a wide
ww_mutex re-imposes some key restrictions, in particular that we must
not call copy_(from|to)_user underneath the mutex (as the faulthandlers
themselves may need to take the ww_mutex). To satisfy this requirement,
we need to split the relocation handling into multiple phases again.
After dropping the reservations, we need to allocate enough buffer space
to both copy the relocations from userspace into, and serve as the
relocation command buffer. Once we have finished copying the
relocations, we can then re-aquire all the objects for the execbuf and
rebind them, including our new relocations objects. After we have bound
all the new and old objects into their final locations, we can then
convert the relocation entries into the GPU commands to update the
relocated vma. Finally, once it is all over and we have dropped the
ww_mutex for the last time, we can then complete the update of the user
relocation entries.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 887 +++++++++---------
 .../i915/gem/selftests/i915_gem_execbuffer.c  | 201 ++--
 2 files changed, 564 insertions(+), 524 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index b6290c2b99c8..ebabc0746d50 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -35,6 +35,7 @@ struct eb_vma {
 
 	/** This vma's place in the execbuf reservation list */
 	struct drm_i915_gem_exec_object2 *exec;
+	u32 bias;
 
 	struct list_head bind_link;
 	struct list_head unbound_link;
@@ -60,15 +61,12 @@ struct eb_vma_array {
 #define __EXEC_OBJECT_HAS_PIN		BIT(31)
 #define __EXEC_OBJECT_HAS_FENCE		BIT(30)
 #define __EXEC_OBJECT_NEEDS_MAP		BIT(29)
-#define __EXEC_OBJECT_NEEDS_BIAS	BIT(28)
-#define __EXEC_OBJECT_INTERNAL_FLAGS	(~0u << 28) /* all of the above */
+#define __EXEC_OBJECT_INTERNAL_FLAGS	(~0u << 29) /* all of the above */
 
 #define __EXEC_HAS_RELOC	BIT(31)
 #define __EXEC_INTERNAL_FLAGS	(~0u << 31)
 #define UPDATE			PIN_OFFSET_FIXED
 
-#define BATCH_OFFSET_BIAS (256*1024)
-
 #define __I915_EXEC_ILLEGAL_FLAGS \
 	(__I915_EXEC_UNKNOWN_FLAGS | \
 	 I915_EXEC_CONSTANTS_MASK  | \
@@ -266,20 +264,21 @@ struct i915_execbuffer {
 	 * obj/page
 	 */
 	struct reloc_cache {
-		struct drm_mm_node node; /** temporary GTT binding */
 		unsigned int gen; /** Cached value of INTEL_GEN */
 		bool use_64bit_reloc : 1;
-		bool has_llc : 1;
 		bool has_fence : 1;
 		bool needs_unfenced : 1;
 
 		struct intel_context *ce;
 
-		struct i915_vma *target;
-		struct i915_request *rq;
-		struct i915_vma *rq_vma;
-		u32 *rq_cmd;
-		unsigned int rq_size;
+		struct eb_relocs_link {
+			struct i915_vma *vma;
+		} head;
+		struct drm_i915_gem_relocation_entry *map;
+		unsigned int pos;
+		unsigned int max;
+
+		unsigned long bufsz;
 	} reloc_cache;
 
 	struct eb_cmdparser {
@@ -288,7 +287,7 @@ struct i915_execbuffer {
 	} parser;
 
 	u64 invalid_flags; /** Set of execobj.flags that are invalid */
-	u32 context_flags; /** Set of execobj.flags to insert from the ctx */
+	u32 context_bias;
 
 	u32 batch_start_offset; /** Location within object of batch */
 	u32 batch_len; /** Length of batch within object */
@@ -308,6 +307,12 @@ static struct drm_i915_gem_exec_object2 no_entry = {
 	.offset = -1ull
 };
 
+static u64 noncanonical_addr(u64 addr, const struct i915_address_space *vm)
+{
+	GEM_BUG_ON(!is_power_of_2(vm->total));
+	return addr & (vm->total - 1);
+}
+
 static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
 {
 	return intel_engine_requires_cmd_parser(eb->engine) ||
@@ -479,11 +484,12 @@ static int eb_create(struct i915_execbuffer *eb)
 	return 0;
 }
 
-static bool
-eb_vma_misplaced(const struct drm_i915_gem_exec_object2 *entry,
-		 const struct i915_vma *vma,
-		 unsigned int flags)
+static bool eb_vma_misplaced(const struct eb_vma *ev)
 {
+	const struct drm_i915_gem_exec_object2 *entry = ev->exec;
+	const struct i915_vma *vma = ev->vma;
+	unsigned int flags = ev->flags;
+
 	if (test_bit(I915_VMA_ERROR_BIT, __i915_vma_flags(vma)))
 		return true;
 
@@ -497,8 +503,7 @@ eb_vma_misplaced(const struct drm_i915_gem_exec_object2 *entry,
 	    vma->node.start != entry->offset)
 		return true;
 
-	if (flags & __EXEC_OBJECT_NEEDS_BIAS &&
-	    vma->node.start < BATCH_OFFSET_BIAS)
+	if (vma->node.start < ev->bias)
 		return true;
 
 	if (!(flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS) &&
@@ -532,9 +537,7 @@ static bool eb_pin_vma_fence_inplace(struct eb_vma *ev)
 }
 
 static inline bool
-eb_pin_vma_inplace(struct i915_execbuffer *eb,
-		   const struct drm_i915_gem_exec_object2 *entry,
-		   struct eb_vma *ev)
+eb_pin_vma_inplace(struct i915_execbuffer *eb, struct eb_vma *ev)
 {
 	struct i915_vma *vma = ev->vma;
 	unsigned int pin_flags;
@@ -543,7 +546,7 @@ eb_pin_vma_inplace(struct i915_execbuffer *eb,
 	if (!i915_active_is_idle(&vma->vm->binding))
 		return false;
 
-	if (eb_vma_misplaced(entry, vma, ev->flags))
+	if (eb_vma_misplaced(ev))
 		return false;
 
 	pin_flags = PIN_USER;
@@ -561,7 +564,7 @@ eb_pin_vma_inplace(struct i915_execbuffer *eb,
 		}
 	}
 
-	GEM_BUG_ON(eb_vma_misplaced(entry, vma, ev->flags));
+	GEM_BUG_ON(eb_vma_misplaced(ev));
 
 	ev->flags |= __EXEC_OBJECT_HAS_PIN;
 	return true;
@@ -599,7 +602,7 @@ eb_validate_vma(struct i915_execbuffer *eb,
 	 * so from this point we're always using non-canonical
 	 * form internally.
 	 */
-	entry->offset = gen8_noncanonical_addr(entry->offset);
+	entry->offset = noncanonical_addr(entry->offset, eb->context->vm);
 
 	if (!eb->reloc_cache.has_fence) {
 		entry->flags &= ~EXEC_OBJECT_NEEDS_FENCE;
@@ -610,9 +613,6 @@ eb_validate_vma(struct i915_execbuffer *eb,
 			entry->flags |= EXEC_OBJECT_NEEDS_GTT | __EXEC_OBJECT_NEEDS_MAP;
 	}
 
-	if (!(entry->flags & EXEC_OBJECT_PINNED))
-		entry->flags |= eb->context_flags;
-
 	return 0;
 }
 
@@ -629,7 +629,9 @@ eb_add_vma(struct i915_execbuffer *eb,
 	ev->vma = vma;
 	ev->exec = entry;
 	ev->flags = entry->flags;
+	ev->bias = eb->context_bias;
 
+	ev->handle = entry->handle;
 	if (eb->lut_size > 0) {
 		ev->handle = entry->handle;
 		hlist_add_head(&ev->node,
@@ -640,8 +642,10 @@ eb_add_vma(struct i915_execbuffer *eb,
 	list_add_tail(&ev->bind_link, &eb->bind_list);
 	list_add_tail(&ev->submit_link, &eb->submit_list);
 
-	if (entry->relocation_count)
+	if (entry->relocation_count) {
 		list_add_tail(&ev->reloc_link, &eb->relocs);
+		eb->reloc_cache.bufsz += entry->relocation_count;
+	}
 
 	/*
 	 * SNA is doing fancy tricks with compressing batch buffers, which leads
@@ -655,7 +659,8 @@ eb_add_vma(struct i915_execbuffer *eb,
 	if (i == batch_idx) {
 		if (entry->relocation_count &&
 		    !(ev->flags & EXEC_OBJECT_PINNED))
-			ev->flags |= __EXEC_OBJECT_NEEDS_BIAS;
+			ev->bias = max_t(u32, ev->bias, SZ_256K);
+
 		if (eb->reloc_cache.has_fence)
 			ev->flags |= EXEC_OBJECT_NEEDS_FENCE;
 
@@ -981,7 +986,8 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
 	const unsigned int exec_flags = bind->ev->flags;
 	struct i915_vma *vma = bind->ev->vma;
 	struct i915_address_space *vm = vma->vm;
-	u64 start = 0, end = vm->total;
+	u64 start = round_up(bind->ev->bias, I915_GTT_MIN_ALIGNMENT);
+	u64 end = vm->total;
 	u64 align = entry->alignment ?: I915_GTT_MIN_ALIGNMENT;
 	unsigned int bind_flags;
 	int err;
@@ -1001,7 +1007,7 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
 	GEM_BUG_ON(!vma->size);
 
 	/* Reuse old address (if it doesn't conflict with new requirements) */
-	if (eb_vma_misplaced(entry, vma, exec_flags)) {
+	if (eb_vma_misplaced(bind->ev)) {
 		vma->node.start = entry->offset & PIN_OFFSET_MASK;
 		vma->node.size = max(entry->pad_to_size, vma->size);
 		vma->node.color = 0;
@@ -1023,11 +1029,8 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
 		align = max_t(u64, align, vma->fence_alignment);
 	}
 
-	if (exec_flags & __EXEC_OBJECT_NEEDS_BIAS)
-		start = BATCH_OFFSET_BIAS;
-
 	GEM_BUG_ON(!vma->node.size);
-	if (vma->node.size > end - start)
+	if (start > end || vma->node.size > end - start)
 		return -E2BIG;
 
 	/* Try the user's preferred location first (mandatory if soft-pinned) */
@@ -1110,7 +1113,7 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
 	}
 
 	bind->ev->flags |= __EXEC_OBJECT_HAS_PIN;
-	GEM_BUG_ON(eb_vma_misplaced(entry, vma, bind->ev->flags));
+	GEM_BUG_ON(eb_vma_misplaced(bind->ev));
 
 	if (unlikely(exec_flags & EXEC_OBJECT_NEEDS_FENCE)) {
 		err = __i915_vma_pin_fence_async(vma, &work->base);
@@ -1343,8 +1346,7 @@ static int wait_for_unbinds(struct i915_execbuffer *eb,
 
 		GEM_BUG_ON(ev->flags & __EXEC_OBJECT_HAS_PIN);
 
-		if (drm_mm_node_allocated(&vma->node) &&
-		    eb_vma_misplaced(ev->exec, vma, ev->flags)) {
+		if (drm_mm_node_allocated(&vma->node) && eb_vma_misplaced(ev)) {
 			err = i915_vma_unbind(vma);
 			if (err)
 				return err;
@@ -1393,10 +1395,10 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 	count = 0;
 	INIT_LIST_HEAD(&unbound);
 	list_for_each_entry(ev, &eb->bind_list, bind_link) {
-		struct drm_i915_gem_exec_object2 *entry = ev->exec;
-		struct i915_vma *vma = ev->vma;
+		if (eb_pin_vma_inplace(eb, ev)) {
+			struct drm_i915_gem_exec_object2 *entry = ev->exec;
+			struct i915_vma *vma = ev->vma;
 
-		if (eb_pin_vma_inplace(eb, entry, ev)) {
 			if (entry != &no_entry &&
 			    entry->offset != vma->node.start) {
 				entry->offset = vma->node.start | UPDATE;
@@ -1480,7 +1482,7 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 			 * we cannot handle migrating the vma inside the worker.
 			 */
 			if (drm_mm_node_allocated(&vma->node)) {
-				if (eb_vma_misplaced(ev->exec, vma, ev->flags)) {
+				if (eb_vma_misplaced(ev)) {
 					err = -ENOSPC;
 					break;
 				}
@@ -1738,9 +1740,9 @@ static int eb_select_context(struct i915_execbuffer *eb)
 	if (rcu_access_pointer(ctx->vm))
 		eb->invalid_flags |= EXEC_OBJECT_NEEDS_GTT;
 
-	eb->context_flags = 0;
+	eb->context_bias = 0;
 	if (test_bit(UCONTEXT_NO_ZEROMAP, &ctx->user_flags))
-		eb->context_flags |= __EXEC_OBJECT_NEEDS_BIAS;
+		eb->context_bias = I915_GTT_MIN_ALIGNMENT;
 
 	return 0;
 }
@@ -1896,8 +1898,6 @@ eb_get_vma(const struct i915_execbuffer *eb, unsigned long handle)
 
 static void eb_destroy(const struct i915_execbuffer *eb)
 {
-	GEM_BUG_ON(eb->reloc_cache.rq);
-
 	eb_vma_array_put(eb->array);
 	if (eb->lut_size > 0)
 		kfree(eb->buckets);
@@ -1915,98 +1915,27 @@ static void reloc_cache_init(struct reloc_cache *cache,
 {
 	/* Must be a variable in the struct to allow GCC to unroll. */
 	cache->gen = INTEL_GEN(i915);
-	cache->has_llc = HAS_LLC(i915);
 	cache->use_64bit_reloc = HAS_64BIT_RELOC(i915);
 	cache->has_fence = cache->gen < 4;
 	cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
-	cache->node.flags = 0;
-	cache->rq = NULL;
-	cache->target = NULL;
-}
 
-#define RELOC_TAIL 4
-
-static int reloc_gpu_chain(struct reloc_cache *cache)
-{
-	struct intel_gt_buffer_pool_node *pool;
-	struct i915_request *rq = cache->rq;
-	struct i915_vma *batch;
-	u32 *cmd;
-	int err;
-
-	pool = intel_gt_get_buffer_pool(rq->engine->gt, PAGE_SIZE);
-	if (IS_ERR(pool))
-		return PTR_ERR(pool);
-
-	batch = i915_vma_instance(pool->obj, rq->context->vm, NULL);
-	if (IS_ERR(batch)) {
-		err = PTR_ERR(batch);
-		goto out_pool;
-	}
-
-	err = i915_vma_pin(batch, 0, 0, PIN_USER | PIN_NONBLOCK);
-	if (err)
-		goto out_pool;
-
-	GEM_BUG_ON(cache->rq_size + RELOC_TAIL > PAGE_SIZE  / sizeof(u32));
-	cmd = cache->rq_cmd + cache->rq_size;
-	*cmd++ = MI_ARB_CHECK;
-	if (cache->gen >= 8)
-		*cmd++ = MI_BATCH_BUFFER_START_GEN8;
-	else if (cache->gen >= 6)
-		*cmd++ = MI_BATCH_BUFFER_START;
-	else
-		*cmd++ = MI_BATCH_BUFFER_START | MI_BATCH_GTT;
-	*cmd++ = lower_32_bits(batch->node.start);
-	*cmd++ = upper_32_bits(batch->node.start); /* Always 0 for gen<8 */
-	i915_gem_object_flush_map(cache->rq_vma->obj);
-	i915_gem_object_unpin_map(cache->rq_vma->obj);
-	cache->rq_vma = NULL;
-
-	err = intel_gt_buffer_pool_mark_active(pool, rq);
-	if (err == 0) {
-		i915_vma_lock(batch);
-		err = i915_request_await_object(rq, batch->obj, false);
-		if (err == 0)
-			err = i915_vma_move_to_active(batch, rq, 0);
-		i915_vma_unlock(batch);
-	}
-	i915_vma_unpin(batch);
-	if (err)
-		goto out_pool;
-
-	cmd = i915_gem_object_pin_map(batch->obj,
-				      cache->has_llc ?
-				      I915_MAP_FORCE_WB :
-				      I915_MAP_FORCE_WC);
-	if (IS_ERR(cmd)) {
-		err = PTR_ERR(cmd);
-		goto out_pool;
-	}
-
-	/* Return with batch mapping (cmd) still pinned */
-	cache->rq_cmd = cmd;
-	cache->rq_size = 0;
-	cache->rq_vma = batch;
-
-out_pool:
-	intel_gt_buffer_pool_put(pool);
-	return err;
+	cache->bufsz = 0;
 }
 
 static struct i915_request *
-nested_request_create(struct intel_context *ce)
+nested_request_create(struct intel_context *ce, struct i915_execbuffer *eb)
 {
 	struct i915_request *rq;
 
 	/* XXX This only works once; replace with shared timeline */
-	mutex_lock_nested(&ce->timeline->mutex, SINGLE_DEPTH_NESTING);
+	if (ce->timeline != eb->context->timeline)
+		mutex_lock_nested(&ce->timeline->mutex, SINGLE_DEPTH_NESTING);
 	intel_context_enter(ce);
 
 	rq = __i915_request_create(ce, GFP_KERNEL);
 
 	intel_context_exit(ce);
-	if (IS_ERR(rq))
+	if (IS_ERR(rq) && ce->timeline != eb->context->timeline)
 		mutex_unlock(&ce->timeline->mutex);
 
 	return rq;
@@ -2029,28 +1958,18 @@ static unsigned int reloc_bb_flags(const struct reloc_cache *cache)
 	return cache->gen > 5 ? 0 : I915_DISPATCH_SECURE;
 }
 
-static int reloc_gpu_flush(struct i915_execbuffer *eb)
+static int
+reloc_gpu_flush(struct i915_execbuffer *eb, struct i915_request *rq, int err)
 {
 	struct reloc_cache *cache = &eb->reloc_cache;
-	struct i915_request *rq;
-	int err;
-
-	rq = fetch_and_zero(&cache->rq);
-	if (!rq)
-		return 0;
-
-	if (cache->rq_vma) {
-		struct drm_i915_gem_object *obj = cache->rq_vma->obj;
-
-		GEM_BUG_ON(cache->rq_size >= obj->base.size / sizeof(u32));
-		cache->rq_cmd[cache->rq_size++] = MI_BATCH_BUFFER_END;
+	u32 *cs;
 
-		__i915_gem_object_flush_map(obj,
-					    0, sizeof(u32) * cache->rq_size);
-		i915_gem_object_unpin_map(obj);
-	}
+	cs = (u32 *)(cache->map + cache->pos);
+	*cs++ = MI_BATCH_BUFFER_END;
+	__i915_gem_object_flush_map(cache->head.vma->obj,
+				    0, (void *)cs - (void *)cache->map);
+	i915_gem_object_unpin_map(cache->head.vma->obj);
 
-	err = 0;
 	if (rq->engine->emit_init_breadcrumb)
 		err = rq->engine->emit_init_breadcrumb(rq);
 	if (!err)
@@ -2063,6 +1982,7 @@ static int reloc_gpu_flush(struct i915_execbuffer *eb)
 
 	intel_gt_chipset_flush(rq->engine->gt);
 	__i915_request_add(rq, &eb->gem_context->sched);
+
 	if (i915_request_timeline(rq) != eb->context->timeline)
 		mutex_unlock(&i915_request_timeline(rq)->mutex);
 
@@ -2080,7 +2000,7 @@ static int reloc_move_to_gpu(struct i915_request *rq, struct i915_vma *vma)
 		i915_gem_clflush_object(obj, 0);
 	obj->write_domain = 0;
 
-	err = i915_request_await_object(rq, vma->obj, true);
+	err = i915_request_await_object(rq, obj, true);
 	if (err == 0)
 		err = i915_vma_move_to_active(vma, rq, EXEC_OBJECT_WRITE);
 
@@ -2089,130 +2009,6 @@ static int reloc_move_to_gpu(struct i915_request *rq, struct i915_vma *vma)
 	return err;
 }
 
-static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
-			     struct intel_engine_cs *engine,
-			     unsigned int len)
-{
-	struct reloc_cache *cache = &eb->reloc_cache;
-	struct intel_gt_buffer_pool_node *pool;
-	struct i915_request *rq;
-	struct i915_vma *batch;
-	u32 *cmd;
-	int err;
-
-	pool = intel_gt_get_buffer_pool(engine->gt, PAGE_SIZE);
-	if (IS_ERR(pool))
-		return PTR_ERR(pool);
-
-	cmd = i915_gem_object_pin_map(pool->obj,
-				      cache->has_llc ?
-				      I915_MAP_FORCE_WB :
-				      I915_MAP_FORCE_WC);
-	if (IS_ERR(cmd)) {
-		err = PTR_ERR(cmd);
-		goto out_pool;
-	}
-
-	batch = i915_vma_instance(pool->obj, eb->context->vm, NULL);
-	if (IS_ERR(batch)) {
-		err = PTR_ERR(batch);
-		goto err_unmap;
-	}
-
-	err = i915_vma_pin(batch, 0, 0, PIN_USER | PIN_NONBLOCK);
-	if (err)
-		goto err_unmap;
-
-	if (cache->ce == eb->context)
-		rq = __i915_request_create(cache->ce, GFP_KERNEL);
-	else
-		rq = nested_request_create(cache->ce);
-	if (IS_ERR(rq)) {
-		err = PTR_ERR(rq);
-		goto err_unpin;
-	}
-	rq->cookie = lockdep_pin_lock(&i915_request_timeline(rq)->mutex);
-
-	err = intel_gt_buffer_pool_mark_active(pool, rq);
-	if (err)
-		goto err_request;
-
-	i915_vma_lock(batch);
-	err = i915_request_await_object(rq, batch->obj, false);
-	if (err == 0)
-		err = i915_vma_move_to_active(batch, rq, 0);
-	i915_vma_unlock(batch);
-	if (err)
-		goto skip_request;
-
-	rq->batch = batch;
-	i915_vma_unpin(batch);
-
-	cache->rq = rq;
-	cache->rq_cmd = cmd;
-	cache->rq_size = 0;
-	cache->rq_vma = batch;
-
-	/* Return with batch mapping (cmd) still pinned */
-	goto out_pool;
-
-skip_request:
-	i915_request_set_error_once(rq, err);
-err_request:
-	__i915_request_add(rq, &eb->gem_context->sched);
-	if (i915_request_timeline(rq) != eb->context->timeline)
-		mutex_unlock(&i915_request_timeline(rq)->mutex);
-err_unpin:
-	i915_vma_unpin(batch);
-err_unmap:
-	i915_gem_object_unpin_map(pool->obj);
-out_pool:
-	intel_gt_buffer_pool_put(pool);
-	return err;
-}
-
-static u32 *reloc_gpu(struct i915_execbuffer *eb,
-		      struct i915_vma *vma,
-		      unsigned int len)
-{
-	struct reloc_cache *cache = &eb->reloc_cache;
-	u32 *cmd;
-	int err;
-
-	if (unlikely(!cache->rq)) {
-		struct intel_engine_cs *engine = eb->engine;
-
-		err = __reloc_gpu_alloc(eb, engine, len);
-		if (unlikely(err))
-			return ERR_PTR(err);
-	}
-
-	if (vma != cache->target) {
-		err = reloc_move_to_gpu(cache->rq, vma);
-		if (unlikely(err)) {
-			i915_request_set_error_once(cache->rq, err);
-			return ERR_PTR(err);
-		}
-
-		cache->target = vma;
-	}
-
-	if (unlikely(cache->rq_size + len >
-		     PAGE_SIZE / sizeof(u32) - RELOC_TAIL)) {
-		err = reloc_gpu_chain(cache);
-		if (unlikely(err)) {
-			i915_request_set_error_once(cache->rq, err);
-			return ERR_PTR(err);
-		}
-	}
-
-	GEM_BUG_ON(cache->rq_size + len >= PAGE_SIZE  / sizeof(u32));
-	cmd = cache->rq_cmd + cache->rq_size;
-	cache->rq_size += len;
-
-	return cmd;
-}
-
 static unsigned long vma_phys_addr(struct i915_vma *vma, u32 offset)
 {
 	struct page *page;
@@ -2227,30 +2023,30 @@ static unsigned long vma_phys_addr(struct i915_vma *vma, u32 offset)
 	return addr + offset_in_page(offset);
 }
 
-static int __reloc_entry_gpu(struct i915_execbuffer *eb,
-			     struct i915_vma *vma,
-			     u64 offset,
-			     u64 target_addr)
+static bool
+eb_relocs_vma_entry(struct i915_execbuffer *eb,
+		    const struct eb_vma *ev,
+		    struct drm_i915_gem_relocation_entry *reloc)
 {
 	const unsigned int gen = eb->reloc_cache.gen;
-	unsigned int len;
+	struct i915_vma *target = eb_get_vma(eb, reloc->target_handle)->vma;
+	const u64 target_addr = relocation_target(reloc, target);
+	const u64 presumed =
+		noncanonical_addr(reloc->presumed_offset, target->vm);
+	u64 offset = reloc->offset;
 	u32 *batch;
-	u64 addr;
 
-	if (gen >= 8)
-		len = offset & 7 ? 8 : 5;
-	else if (gen >= 4)
-		len = 4;
-	else
-		len = 3;
+	GEM_BUG_ON(!i915_vma_is_pinned(target));
 
-	batch = reloc_gpu(eb, vma, len);
-	if (IS_ERR(batch))
-		return PTR_ERR(batch);
+	/* Replace the reloc entry with the GPU commands */
+	batch = memset(reloc, 0, sizeof(*reloc));
+	if (presumed == target->node.start)
+		return false;
 
-	addr = gen8_canonical_addr(vma->node.start + offset);
 	if (gen >= 8) {
-		if (offset & 7) {
+		u64 addr = gen8_canonical_addr(ev->vma->node.start + offset);
+
+		if (addr & 7) {
 			*batch++ = MI_STORE_DWORD_IMM_GEN4;
 			*batch++ = lower_32_bits(addr);
 			*batch++ = upper_32_bits(addr);
@@ -2272,107 +2068,65 @@ static int __reloc_entry_gpu(struct i915_execbuffer *eb,
 	} else if (gen >= 6) {
 		*batch++ = MI_STORE_DWORD_IMM_GEN4;
 		*batch++ = 0;
-		*batch++ = addr;
+		*batch++ = ev->vma->node.start + offset;
 		*batch++ = target_addr;
 	} else if (IS_I965G(eb->i915)) {
 		*batch++ = MI_STORE_DWORD_IMM_GEN4;
 		*batch++ = 0;
-		*batch++ = vma_phys_addr(vma, offset);
+		*batch++ = vma_phys_addr(ev->vma, offset);
 		*batch++ = target_addr;
 	} else if (gen >= 4) {
 		*batch++ = MI_STORE_DWORD_IMM_GEN4 | MI_USE_GGTT;
 		*batch++ = 0;
-		*batch++ = addr;
+		*batch++ = ev->vma->node.start + offset;
 		*batch++ = target_addr;
-	} else if (gen >= 3 &&
-		   !(IS_I915G(eb->i915) || IS_I915GM(eb->i915))) {
+	} else if (gen >= 3 && !(IS_I915G(eb->i915) || IS_I915GM(eb->i915))) {
 		*batch++ = MI_STORE_DWORD_IMM | MI_MEM_VIRTUAL;
-		*batch++ = addr;
+		*batch++ = ev->vma->node.start + offset;
 		*batch++ = target_addr;
 	} else {
 		*batch++ = MI_STORE_DWORD_IMM;
-		*batch++ = vma_phys_addr(vma, offset);
+		*batch++ = vma_phys_addr(ev->vma, offset);
 		*batch++ = target_addr;
 	}
+	GEM_BUG_ON(batch > (u32 *)(reloc + 1));
 
-	return 0;
-}
-
-static u64
-relocate_entry(struct i915_execbuffer *eb,
-	       struct i915_vma *vma,
-	       const struct drm_i915_gem_relocation_entry *reloc,
-	       const struct i915_vma *target)
-{
-	u64 target_addr = relocation_target(reloc, target);
-	int err;
-
-	err = __reloc_entry_gpu(eb, vma, reloc->offset, target_addr);
-	if (err)
-		return err;
-
-	return target->node.start | UPDATE;
-}
-
-static int gen6_fixup_ggtt(struct i915_vma *vma)
-{
-	int err;
-
-	if (i915_vma_is_bound(vma, I915_VMA_GLOBAL_BIND))
-		return 0;
-
-	err = i915_vma_wait_for_bind(vma);
-	if (err)
-		return err;
-
-	mutex_lock(&vma->vm->mutex);
-	if (!(atomic_fetch_or(I915_VMA_GLOBAL_BIND, &vma->flags) & I915_VMA_GLOBAL_BIND)) {
-		__i915_gem_object_pin_pages(vma->obj);
-		vma->ops->bind_vma(vma->vm, NULL, vma,
-				   vma->obj->cache_level,
-				   I915_VMA_GLOBAL_BIND);
-	}
-	mutex_unlock(&vma->vm->mutex);
-
-	return 0;
+	return true;
 }
 
-static u64
-eb_relocate_entry(struct i915_execbuffer *eb,
-		  struct eb_vma *ev,
-		  const struct drm_i915_gem_relocation_entry *reloc)
+static int
+eb_relocs_check_entry(struct i915_execbuffer *eb,
+		      const struct eb_vma *ev,
+		      const struct drm_i915_gem_relocation_entry *reloc)
 {
 	struct drm_i915_private *i915 = eb->i915;
 	struct eb_vma *target;
-	int err;
 
 	/* we've already hold a reference to all valid objects */
 	target = eb_get_vma(eb, reloc->target_handle);
 	if (unlikely(!target))
 		return -ENOENT;
 
-	GEM_BUG_ON(!i915_vma_is_pinned(target->vma));
-
 	/* Validate that the target is in a valid r/w GPU domain */
 	if (unlikely(reloc->write_domain & (reloc->write_domain - 1))) {
 		drm_dbg(&i915->drm, "reloc with multiple write domains: "
-			  "target %d offset %d "
-			  "read %08x write %08x",
-			  reloc->target_handle,
-			  (int) reloc->offset,
-			  reloc->read_domains,
-			  reloc->write_domain);
+			"target %d offset %llu "
+			"read %08x write %08x",
+			reloc->target_handle,
+			reloc->offset,
+			reloc->read_domains,
+			reloc->write_domain);
 		return -EINVAL;
 	}
 	if (unlikely((reloc->write_domain | reloc->read_domains)
 		     & ~I915_GEM_GPU_DOMAINS)) {
 		drm_dbg(&i915->drm, "reloc with read/write non-GPU domains: "
-			  "target %d offset %d "
-			  "read %08x write %08x",
-			  reloc->target_handle,
-			  (int) reloc->offset,
-			  reloc->read_domains,
-			  reloc->write_domain);
+			"target %d offset %llu "
+			"read %08x write %08x",
+			reloc->target_handle,
+			reloc->offset,
+			reloc->read_domains,
+			reloc->write_domain);
 		return -EINVAL;
 	}
 
@@ -2386,155 +2140,379 @@ eb_relocate_entry(struct i915_execbuffer *eb,
 		 * batchbuffers.
 		 */
 		if (reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION &&
-		    IS_GEN(eb->i915, 6)) {
-			err = gen6_fixup_ggtt(target->vma);
-			if (err)
-				return err;
-		}
+		    IS_GEN(eb->i915, 6))
+			target->flags |= EXEC_OBJECT_NEEDS_GTT;
 	}
 
-	/*
-	 * If the relocation already has the right value in it, no
-	 * more work needs to be done.
-	 */
-	if (gen8_canonical_addr(target->vma->node.start) == reloc->presumed_offset)
-		return 0;
+	if ((int)reloc->delta < 0)
+		target->bias = max_t(u32, target->bias, -(int)reloc->delta);
 
 	/* Check that the relocation address is valid... */
 	if (unlikely(reloc->offset >
 		     ev->vma->size - (eb->reloc_cache.use_64bit_reloc ? 8 : 4))) {
 		drm_dbg(&i915->drm, "Relocation beyond object bounds: "
-			  "target %d offset %d size %d.\n",
-			  reloc->target_handle,
-			  (int)reloc->offset,
-			  (int)ev->vma->size);
+			"target %d offset %llu size %llu.\n",
+			reloc->target_handle,
+			reloc->offset,
+			ev->vma->size);
 		return -EINVAL;
 	}
 	if (unlikely(reloc->offset & 3)) {
 		drm_dbg(&i915->drm, "Relocation not 4-byte aligned: "
-			  "target %d offset %d.\n",
-			  reloc->target_handle,
-			  (int)reloc->offset);
+			"target %d offset %llu.\n",
+			reloc->target_handle,
+			reloc->offset);
 		return -EINVAL;
 	}
 
-	/*
-	 * If we write into the object, we need to force the synchronisation
-	 * barrier, either with an asynchronous clflush or if we executed the
-	 * patching using the GPU (though that should be serialised by the
-	 * timeline). To be completely sure, and since we are required to
-	 * do relocations we are already stalling, disable the user's opt
-	 * out of our synchronisation.
-	 */
-	ev->flags &= ~EXEC_OBJECT_ASYNC;
+	return 0;
+}
+
+static struct drm_i915_gem_relocation_entry *
+eb_relocs_grow(struct i915_execbuffer *eb, unsigned long *count)
+{
+	struct reloc_cache *c = &eb->reloc_cache;
+	struct drm_i915_gem_relocation_entry *r;
+	unsigned long remain;
+
+	GEM_BUG_ON(c->pos > c->max);
+	remain = c->max - c->pos;
+	if (remain == 0) {
+		struct drm_i915_gem_object *obj;
+		struct i915_vma *vma;
+		struct eb_vma *ev;
+
+		obj = i915_gem_object_create_internal(eb->i915, c->bufsz);
+		if (IS_ERR(obj))
+			return ERR_CAST(obj);
 
-	/* and update the user's relocation entry */
-	return relocate_entry(eb, ev->vma, reloc, target->vma);
+		if (c->gen >= 6)
+			i915_gem_object_set_cache_coherency(obj,
+							    I915_CACHE_LLC);
+
+		vma = i915_vma_instance(obj, eb->context->vm, NULL);
+		if (IS_ERR(vma)) {
+			i915_gem_object_put(obj);
+			return ERR_CAST(vma);
+		}
+
+		ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+		if (!ev) {
+			i915_gem_object_put(obj);
+			return ERR_PTR(-ENOMEM);
+		}
+
+		vma->private = ev;
+		ev->vma = vma;
+		ev->exec = &no_entry;
+		ev->flags = EXEC_OBJECT_SUPPORTS_48B_ADDRESS;
+		list_add_tail(&ev->bind_link, &eb->bind_list);
+		list_add(&ev->reloc_link, &eb->array->aux_list);
+
+		if (!c->head.vma) {
+			c->head.vma = vma;
+		} else {
+			struct eb_relocs_link *link;
+
+			link = (struct eb_relocs_link *)(c->map + c->pos);
+			link->vma = vma;
+		}
+
+		c->pos = 0;
+		c->map = i915_gem_object_pin_map(obj, I915_MAP_WB);
+		if (IS_ERR(c->map))
+			return ERR_CAST(c->map);
+
+		remain = c->max;
+	}
+	*count = min(remain, *count);
+
+	GEM_BUG_ON(!c->map);
+	r = c->map + c->pos;
+	c->pos += *count;
+	GEM_BUG_ON(c->pos > c->max);
+
+	return r;
 }
 
-static int eb_relocate_vma(struct i915_execbuffer *eb, struct eb_vma *ev)
+static int
+eb_relocs_copy_vma(struct i915_execbuffer *eb, const struct eb_vma *ev)
 {
-#define N_RELOC(x) ((x) / sizeof(struct drm_i915_gem_relocation_entry))
-	struct drm_i915_gem_relocation_entry stack[N_RELOC(512)];
 	const struct drm_i915_gem_exec_object2 *entry = ev->exec;
-	struct drm_i915_gem_relocation_entry __user *urelocs =
+	const struct drm_i915_gem_relocation_entry __user *ureloc =
 		u64_to_user_ptr(entry->relocs_ptr);
 	unsigned long remain = entry->relocation_count;
 
-	if (unlikely(remain > N_RELOC(ULONG_MAX)))
+	if (unlikely(remain > ULONG_MAX / sizeof(*ureloc)))
 		return -EINVAL;
 
-	/*
-	 * We must check that the entire relocation array is safe
-	 * to read. However, if the array is not writable the user loses
-	 * the updated relocation values.
-	 */
-	if (unlikely(!access_ok(urelocs, remain * sizeof(*urelocs))))
-		return -EFAULT;
-
 	do {
-		struct drm_i915_gem_relocation_entry *r = stack;
-		unsigned int count =
-			min_t(unsigned long, remain, ARRAY_SIZE(stack));
-		unsigned int copied;
+		struct drm_i915_gem_relocation_entry *r;
+		unsigned long count = remain;
+		int err;
 
-		/*
-		 * This is the fast path and we cannot handle a pagefault
-		 * whilst holding the struct mutex lest the user pass in the
-		 * relocations contained within a mmaped bo. For in such a case
-		 * we, the page fault handler would call i915_gem_fault() and
-		 * we would try to acquire the struct mutex again. Obviously
-		 * this is bad and so lockdep complains vehemently.
-		 */
-		copied = __copy_from_user(r, urelocs, count * sizeof(r[0]));
-		if (unlikely(copied))
+		r = eb_relocs_grow(eb, &count);
+		if (IS_ERR(r))
+			return PTR_ERR(r);
+
+		GEM_BUG_ON(!count);
+		if (unlikely(copy_from_user(r, ureloc, count * sizeof(r[0]))))
 			return -EFAULT;
 
 		remain -= count;
-		do {
-			u64 offset = eb_relocate_entry(eb, ev, r);
+		ureloc += count;
 
-			if (likely(offset == 0)) {
-			} else if ((s64)offset < 0) {
-				return (int)offset;
-			} else {
-				/*
-				 * Note that reporting an error now
-				 * leaves everything in an inconsistent
-				 * state as we have *already* changed
-				 * the relocation value inside the
-				 * object. As we have not changed the
-				 * reloc.presumed_offset or will not
-				 * change the execobject.offset, on the
-				 * call we may not rewrite the value
-				 * inside the object, leaving it
-				 * dangling and causing a GPU hang. Unless
-				 * userspace dynamically rebuilds the
-				 * relocations on each execbuf rather than
-				 * presume a static tree.
-				 *
-				 * We did previously check if the relocations
-				 * were writable (access_ok), an error now
-				 * would be a strange race with mprotect,
-				 * having already demonstrated that we
-				 * can read from this userspace address.
-				 */
-				offset = gen8_canonical_addr(offset & ~UPDATE);
-				__put_user(offset,
-					   &urelocs[r - stack].presumed_offset);
-			}
-		} while (r++, --count);
-		urelocs += ARRAY_SIZE(stack);
+		do {
+			err = eb_relocs_check_entry(eb, ev, r++);
+			if (err)
+				return err;
+		} while (--count);
 	} while (remain);
 
 	return 0;
 }
 
+static int eb_relocs_copy_user(struct i915_execbuffer *eb)
+{
+	struct eb_vma *ev;
+	int err;
+
+	eb->reloc_cache.head.vma = NULL;
+	eb->reloc_cache.pos = eb->reloc_cache.max;
+
+	list_for_each_entry(ev, &eb->relocs, reloc_link) {
+		err = eb_relocs_copy_vma(eb, ev);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static struct drm_i915_gem_relocation_entry *
+get_gpu_relocs(struct i915_execbuffer *eb,
+	       struct i915_request *rq,
+	       unsigned long *count)
+{
+	struct reloc_cache *c = &eb->reloc_cache;
+	struct drm_i915_gem_relocation_entry *r;
+	unsigned long remain;
+
+	GEM_BUG_ON(c->pos > c->max);
+	remain = c->max - c->pos;
+	if (remain == 0) {
+		struct eb_relocs_link link;
+		const int gen = c->gen;
+		u32 *cs;
+
+		GEM_BUG_ON(!c->head.vma);
+		GEM_BUG_ON(!c->map);
+
+		link = *(struct eb_relocs_link *)(c->map + c->pos);
+		GEM_BUG_ON(!link.vma);
+		GEM_BUG_ON(!i915_vma_is_pinned(link.vma));
+
+		cs = (u32 *)(c->map + c->pos);
+		*cs++ = MI_ARB_CHECK;
+		if (gen >= 8)
+			*cs++ = MI_BATCH_BUFFER_START_GEN8;
+		else if (gen >= 6)
+			*cs++ = MI_BATCH_BUFFER_START;
+		else
+			*cs++ = MI_BATCH_BUFFER_START | MI_BATCH_GTT;
+		*cs++ = lower_32_bits(link.vma->node.start);
+		*cs++ = upper_32_bits(link.vma->node.start);
+		i915_gem_object_flush_map(c->head.vma->obj);
+		i915_gem_object_unpin_map(c->head.vma->obj);
+
+		c->head = link;
+		c->map = NULL;
+	}
+
+	if (!c->map) {
+		struct i915_vma *vma = c->head.vma;
+		int err;
+
+		GEM_BUG_ON(!vma);
+		i915_vma_lock(vma);
+		err = i915_request_await_object(rq, vma->obj, false);
+		if (err == 0)
+			err = i915_vma_move_to_active(vma, rq, 0);
+		i915_vma_unlock(vma);
+		if (err)
+			return ERR_PTR(err);
+
+		GEM_BUG_ON(!i915_gem_object_has_pinned_pages(vma->obj));
+		c->map = page_mask_bits(vma->obj->mm.mapping);
+		c->pos = 0;
+
+		remain = c->max;
+	}
+
+	*count = min(remain, *count);
+
+	GEM_BUG_ON(!c->map);
+	r = c->map + c->pos;
+	c->pos += *count;
+	GEM_BUG_ON(c->pos > c->max);
+
+	return r;
+}
+
+static int eb_relocs_gpu_vma(struct i915_execbuffer *eb,
+			     struct i915_request *rq,
+			     const struct eb_vma *ev)
+{
+	const struct drm_i915_gem_exec_object2 *entry = ev->exec;
+	unsigned long remain = entry->relocation_count;
+	bool write = false;
+	int err = 0;
+
+	do {
+		struct drm_i915_gem_relocation_entry *r;
+		unsigned long count = remain;
+
+		r = get_gpu_relocs(eb, rq, &count);
+		if (IS_ERR(r))
+			return PTR_ERR(r);
+
+		GEM_BUG_ON(!count);
+		remain -= count;
+		do {
+			write |= eb_relocs_vma_entry(eb, ev, r++);
+		} while (--count);
+	} while (remain);
+
+	if (write)
+		err = reloc_move_to_gpu(rq, ev->vma);
+
+	return err;
+}
+
+static struct i915_request *reloc_gpu_alloc(struct i915_execbuffer *eb)
+{
+	struct reloc_cache *cache = &eb->reloc_cache;
+	struct i915_request *rq;
+
+	if (cache->ce == eb->context)
+		rq = __i915_request_create(cache->ce, GFP_KERNEL);
+	else
+		rq = nested_request_create(cache->ce, eb);
+	if (IS_ERR(rq))
+		return rq;
+
+	rq->cookie = lockdep_pin_lock(&i915_request_timeline(rq)->mutex);
+	return rq;
+}
+
+static int eb_relocs_gpu(struct i915_execbuffer *eb)
+{
+	struct i915_request *rq;
+	struct eb_vma *ev;
+	int err;
+
+	rq = reloc_gpu_alloc(eb);
+	if (IS_ERR(rq))
+		return PTR_ERR(rq);
+
+	rq->batch = eb->reloc_cache.head.vma;
+
+	eb->reloc_cache.map = NULL;
+	eb->reloc_cache.pos = 0;
+
+	err = 0;
+	list_for_each_entry(ev, &eb->relocs, reloc_link) {
+		err = eb_relocs_gpu_vma(eb, rq, ev);
+		if (err)
+			break;
+	}
+
+	return reloc_gpu_flush(eb, rq, err);
+}
+
+static void eb_relocs_update_vma(struct i915_execbuffer *eb, struct eb_vma *ev)
+{
+	const struct drm_i915_gem_exec_object2 *entry = ev->exec;
+	struct drm_i915_gem_relocation_entry __user *ureloc =
+		u64_to_user_ptr(entry->relocs_ptr);
+	unsigned long count = entry->relocation_count;
+
+	do {
+		u32 handle;
+
+		if (get_user(handle, &ureloc->target_handle) == 0) {
+			struct i915_vma *vma = eb_get_vma(eb, handle)->vma;
+			u64 offset = gen8_canonical_addr(vma->node.start);
+
+			if (put_user(offset, &ureloc->presumed_offset))
+				return;
+		}
+	} while (ureloc++, --count);
+}
+
+static void eb_relocs_update_user(struct i915_execbuffer *eb)
+{
+	struct eb_vma *ev;
+
+	if (!(eb->args->flags & __EXEC_HAS_RELOC))
+		return;
+
+	list_for_each_entry(ev, &eb->relocs, reloc_link)
+		eb_relocs_update_vma(eb, ev);
+}
+
 static int eb_relocate(struct i915_execbuffer *eb)
 {
+	struct reloc_cache *c = &eb->reloc_cache;
+	struct eb_vma *ev;
 	int err;
 
+	/* Drop everything before we copy_from_user */
+	list_for_each_entry(ev, &eb->bind_list, bind_link)
+		eb_unreserve_vma(ev);
+
+	/* Pick a single buffer for all relocs, within reason */
+	c->bufsz *= sizeof(struct drm_i915_gem_relocation_entry);
+	c->bufsz += sizeof(struct drm_i915_gem_relocation_entry);
+	c->bufsz = round_up(c->bufsz, SZ_4K);
+	c->bufsz = clamp_val(c->bufsz, SZ_4K, SZ_256K);
+
+	/* We leave the final slot for chaining together or termination */
+	c->max = c->bufsz / sizeof(struct drm_i915_gem_relocation_entry) - 1;
+
+	/* Copy the user's relocations into plain system memory */
+	err = eb_relocs_copy_user(eb);
+	if (err)
+		return err;
+
+	/* Now reacquire everything, including the extra reloc bo */
 	err = eb_reserve_vm(eb);
 	if (err)
 		return err;
 
-	/* The objects are in their final locations, apply the relocations. */
-	if (eb->args->flags & __EXEC_HAS_RELOC) {
-		struct eb_vma *ev;
-		int flush;
+	/* The objects are now final, convert the relocations into commands. */
+	err = eb_relocs_gpu(eb);
+	if (err)
+		return err;
 
-		list_for_each_entry(ev, &eb->relocs, reloc_link) {
-			err = eb_relocate_vma(eb, ev);
-			if (err)
-				break;
-		}
+	return 0;
+}
+
+static int eb_reserve(struct i915_execbuffer *eb)
+{
+	int err;
 
-		flush = reloc_gpu_flush(eb);
-		if (!err)
-			err = flush;
+	err = eb_reserve_vm(eb);
+	if (err)
+		return err;
+
+	if (eb->args->flags & __EXEC_HAS_RELOC && !list_empty(&eb->relocs)) {
+		err = eb_relocate(eb);
+		if (err)
+			return err;
 	}
 
-	return err;
+	return 0;
 }
 
 static int eb_move_to_gpu(struct i915_execbuffer *eb)
@@ -2991,6 +2969,8 @@ static int __eb_pin_reloc_engine(struct i915_execbuffer *eb)
 		return PTR_ERR(ce);
 
 	/* Reuse eb->context->timeline with scheduler! */
+	if (engine->schedule)
+		ce->timeline = intel_timeline_get(eb->context->timeline);
 
 	i915_vm_put(ce->vm);
 	ce->vm = i915_vm_get(eb->context->vm);
@@ -3440,7 +3420,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		goto err_engine;
 	lockdep_assert_held(&eb.context->timeline->mutex);
 
-	err = eb_relocate(&eb);
+	err = eb_reserve(&eb);
 	if (err) {
 		/*
 		 * If the user expects the execobject.offset and
@@ -3457,9 +3437,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (err)
 		goto err_vma;
 
-	/* All GPU relocation batches must be submitted prior to the user rq */
-	GEM_BUG_ON(eb.reloc_cache.rq);
-
 	/* Allocate a request for this batch buffer nice and early. */
 	eb.request = __i915_request_create(eb.context, GFP_KERNEL);
 	if (IS_ERR(eb.request)) {
@@ -3521,6 +3498,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 err_vma:
 	eb_unlock_engine(&eb);
 	/* *** TIMELINE UNLOCK *** */
+
+	eb_relocs_update_user(&eb);
 err_engine:
 	eb_unpin_engine(&eb);
 err_context:
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
index 992d46db1b33..8776f2750fa7 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
@@ -11,7 +11,7 @@
 
 #include "mock_context.h"
 
-static u64 read_reloc(const u32 *map, int x, const u64 mask)
+static u64 read_reloc(const char *map, int x, const u64 mask)
 {
 	u64 reloc;
 
@@ -19,90 +19,111 @@ static u64 read_reloc(const u32 *map, int x, const u64 mask)
 	return reloc & mask;
 }
 
-static int __igt_gpu_reloc(struct i915_execbuffer *eb,
-			   struct drm_i915_gem_object *obj)
+static int mock_relocs_copy_user(struct i915_execbuffer *eb, struct eb_vma *ev)
 {
-	const unsigned int offsets[] = { 8, 3, 0 };
-	const u64 mask =
-		GENMASK_ULL(eb->reloc_cache.use_64bit_reloc ? 63 : 31, 0);
-	const u32 *map = page_mask_bits(obj->mm.mapping);
-	struct i915_request *rq;
-	struct i915_vma *vma;
-	int err;
-	int i;
+	const int stride = 2 * sizeof(u64);
+	struct drm_i915_gem_object *obj = ev->vma->obj;
+	void *last = NULL;
+	int n, total = 0;
+
+	eb->reloc_cache.head.vma = NULL;
+	eb->reloc_cache.pos = eb->reloc_cache.max;
+
+	for (n = 0; n < obj->base.size / stride; n++) {
+		struct drm_i915_gem_relocation_entry *r;
+		unsigned long count = 1;
+
+		r = eb_relocs_grow(eb, &count);
+		if (IS_ERR(r))
+			return PTR_ERR(r);
+
+		if (!count)
+			return -EINVAL;
+
+		if (eb->reloc_cache.map != last) {
+			pr_info("%s New reloc buffer @ %d\n",
+				eb->engine->name, n);
+			last = eb->reloc_cache.map;
+			total++;
+		}
 
-	vma = i915_vma_instance(obj, eb->context->vm, NULL);
-	if (IS_ERR(vma))
-		return PTR_ERR(vma);
+		r->target_handle = 0;
+		r->offset = n * stride;
+		if (n & 1)
+			r->offset += sizeof(u32);
+		r->delta = n;
+	}
 
-	err = i915_vma_pin(vma, 0, 0, PIN_USER | PIN_HIGH);
-	if (err)
-		return err;
+	pr_info("%s: %d relocs, %d buffers\n", eb->engine->name, n, total);
 
-	/* 8-Byte aligned */
-	err = __reloc_entry_gpu(eb, vma, offsets[0] * sizeof(u32), 0);
-	if (err)
-		goto unpin_vma;
+	return n;
+}
 
-	/* !8-Byte aligned */
-	err = __reloc_entry_gpu(eb, vma, offsets[1] * sizeof(u32), 1);
+static int check_relocs(struct i915_execbuffer *eb, struct eb_vma *ev)
+{
+	const u64 mask =
+		GENMASK_ULL(eb->reloc_cache.use_64bit_reloc ? 63 : 31, 0);
+	const int stride = 2 * sizeof(u64);
+	struct drm_i915_gem_object *obj = ev->vma->obj;
+	const void *map = __i915_gem_object_mapping(obj);
+	int n, err = 0;
+
+	for (n = 0; n < obj->base.size / stride; n++) {
+		unsigned int offset;
+		u64 address, reloc;
+
+		address = gen8_canonical_addr(ev->vma->node.start + n);
+		address &= mask;
+
+		offset = n * stride;
+		if (n & 1)
+			offset += sizeof(u32);
+
+		reloc = read_reloc(map, offset, mask);
+		if (reloc != address) {
+			pr_err("%s[%d]: map[%x] %llx != %llx\n",
+			       eb->engine->name, n, offset, reloc, address);
+			err = -EINVAL;
+		}
+	}
 	if (err)
-		goto unpin_vma;
+		igt_hexdump(map, obj->base.size);
+
+	return err;
+}
 
-	/* Skip to the end of the cmd page */
-	i = PAGE_SIZE / sizeof(u32) - RELOC_TAIL - 1;
-	i -= eb->reloc_cache.rq_size;
-	memset32(eb->reloc_cache.rq_cmd + eb->reloc_cache.rq_size,
-		 MI_NOOP, i);
-	eb->reloc_cache.rq_size += i;
+static int __igt_gpu_reloc(struct i915_execbuffer *eb, struct eb_vma *ev)
+{
+	int err;
+
+	err = mock_relocs_copy_user(eb, ev);
+	if (err < 0)
+		return err;
+	ev->exec->relocation_count = err;
 
-	/* Force batch chaining */
-	err = __reloc_entry_gpu(eb, vma, offsets[2] * sizeof(u32), 2);
+	err = eb_reserve_vm(eb);
 	if (err)
-		goto unpin_vma;
+		return err;
 
-	GEM_BUG_ON(!eb->reloc_cache.rq);
-	rq = i915_request_get(eb->reloc_cache.rq);
-	err = reloc_gpu_flush(eb);
+	err = eb_relocs_gpu(eb);
 	if (err)
-		goto put_rq;
-	GEM_BUG_ON(eb->reloc_cache.rq);
+		return err;
 
-	err = i915_gem_object_wait(obj, I915_WAIT_INTERRUPTIBLE, HZ / 2);
-	if (err) {
+	if (i915_gem_object_wait(ev->vma->obj,
+				 I915_WAIT_INTERRUPTIBLE, HZ / 2)) {
 		intel_gt_set_wedged(eb->engine->gt);
-		goto put_rq;
+		return -EIO;
 	}
 
-	if (!i915_request_completed(rq)) {
-		pr_err("%s: did not wait for relocations!\n", eb->engine->name);
-		err = -EINVAL;
-		goto put_rq;
-	}
-
-	for (i = 0; i < ARRAY_SIZE(offsets); i++) {
-		u64 reloc = read_reloc(map, offsets[i], mask);
-
-		if (reloc != i) {
-			pr_err("%s[%d]: map[%d] %llx != %x\n",
-			       eb->engine->name, i, offsets[i], reloc, i);
-			err = -EINVAL;
-		}
-	}
-	if (err)
-		igt_hexdump(map, 4096);
-
-put_rq:
-	i915_request_put(rq);
-unpin_vma:
-	i915_vma_unpin(vma);
-	return err;
+	return check_relocs(eb, ev);
 }
 
 static int igt_gpu_reloc(void *arg)
 {
 	struct i915_execbuffer eb;
 	struct drm_i915_gem_object *scratch;
+	struct drm_i915_gem_exec_object2 exec;
+	struct eb_vma ev = { .exec = &exec };
 	struct file *file;
 	int err = 0;
 	u32 *map;
@@ -112,11 +133,13 @@ static int igt_gpu_reloc(void *arg)
 		return PTR_ERR(file);
 
 	eb.i915 = arg;
+	INIT_LIST_HEAD(&eb.relocs);
+
 	eb.gem_context = live_context(arg, file);
 	if (IS_ERR(eb.gem_context))
 		goto err_file;
 
-	scratch = i915_gem_object_create_internal(eb.i915, 4096);
+	scratch = i915_gem_object_create_internal(eb.i915, SZ_32K);
 	if (IS_ERR(scratch))
 		goto err_file;
 
@@ -126,33 +149,71 @@ static int igt_gpu_reloc(void *arg)
 		goto err_scratch;
 	}
 
+	eb.lut_size = -1;
+	eb.vma = &ev;
+	list_add(&ev.reloc_link, &eb.relocs);
+	GEM_BUG_ON(eb_get_vma(&eb, 0) != &ev);
+
 	for_each_uabi_engine(eb.engine, eb.i915) {
+		INIT_LIST_HEAD(&eb.bind_list);
 		reloc_cache_init(&eb.reloc_cache, eb.i915);
-		memset(map, POISON_INUSE, 4096);
+		memset(map, POISON_INUSE, scratch->base.size);
+		wmb();
+
+		eb.reloc_cache.bufsz = SZ_4K;
+		eb.reloc_cache.max = eb.reloc_cache.bufsz;
+		eb.reloc_cache.max /=
+			sizeof(struct drm_i915_gem_relocation_entry);
+		eb.reloc_cache.max--; /* leave room for terminator */
 
 		intel_engine_pm_get(eb.engine);
+
+		eb.array = eb_vma_array_create(1);
+		if (!eb.array) {
+			err = -ENOMEM;
+			goto err_pm;
+		}
+
 		eb.context = intel_context_create(eb.engine);
 		if (IS_ERR(eb.context)) {
 			err = PTR_ERR(eb.context);
-			goto err_pm;
+			goto err_array;
 		}
 
 		err = intel_context_pin(eb.context);
 		if (err)
 			goto err_put;
 
+		ev.vma = i915_vma_instance(scratch, eb.context->vm, NULL);
+		if (IS_ERR(ev.vma)) {
+			err = PTR_ERR(ev.vma);
+			goto err_unpin;
+		}
+
+		err = i915_vma_pin(ev.vma, 0, 0, PIN_USER | PIN_HIGH);
+		if (err)
+			goto err_unpin;
+
 		mutex_lock(&eb.context->timeline->mutex);
 		intel_context_enter(eb.context);
-		eb.reloc_cache.ce = eb.context;
 
-		err = __igt_gpu_reloc(&eb, scratch);
+		err = __eb_pin_reloc_engine(&eb);
+		if (err)
+			goto err_exit;
+
+		err = __igt_gpu_reloc(&eb, &ev);
 
+		__eb_unpin_reloc_engine(&eb);
+err_exit:
 		intel_context_exit(eb.context);
 		mutex_unlock(&eb.context->timeline->mutex);
-
+		i915_vma_unpin(ev.vma);
+err_unpin:
 		intel_context_unpin(eb.context);
 err_put:
 		intel_context_put(eb.context);
+err_array:
+		eb_vma_array_put(eb.array);
 err_pm:
 		intel_engine_pm_put(eb.engine);
 		if (err)
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 26/66] drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2.
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (23 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 25/66] drm/i915/gem: Reintroduce multiple passes for reloc processing Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-31 10:07   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 27/66] drm/i915/gem: Pull execbuf dma resv under a single critical section Chris Wilson
                   ` (47 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

i915_gem_ww_ctx is used to lock all gem bo's for pinning and memory
eviction. We don't use it yet, but lets start adding the definition
first.

To use it, we have to pass a non-NULL ww to gem_object_lock, and don't
unlock directly. It is done in i915_gem_ww_ctx_fini.

Changes since v1:
- Change ww_ctx and obj order in locking functions (Jonas Lahtinen)

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/Makefile                 |   4 +
 drivers/gpu/drm/i915/i915_globals.c           |   1 +
 drivers/gpu/drm/i915/i915_globals.h           |   1 +
 drivers/gpu/drm/i915/mm/i915_acquire_ctx.c    | 139 ++++++++++
 drivers/gpu/drm/i915/mm/i915_acquire_ctx.h    |  34 +++
 drivers/gpu/drm/i915/mm/st_acquire_ctx.c      | 242 ++++++++++++++++++
 .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
 7 files changed, 422 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
 create mode 100644 drivers/gpu/drm/i915/mm/i915_acquire_ctx.h
 create mode 100644 drivers/gpu/drm/i915/mm/st_acquire_ctx.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index bda4c0e408f8..a3a4c8a555ec 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -125,6 +125,10 @@ gt-y += \
 	gt/gen9_renderstate.o
 i915-y += $(gt-y)
 
+# Memory + DMA management
+i915-y += \
+	mm/i915_acquire_ctx.o
+
 # GEM (Graphics Execution Management) code
 gem-y += \
 	gem/i915_gem_busy.o \
diff --git a/drivers/gpu/drm/i915/i915_globals.c b/drivers/gpu/drm/i915/i915_globals.c
index 3aa213684293..51ec42a14694 100644
--- a/drivers/gpu/drm/i915/i915_globals.c
+++ b/drivers/gpu/drm/i915/i915_globals.c
@@ -87,6 +87,7 @@ static void __i915_globals_cleanup(void)
 
 static __initconst int (* const initfn[])(void) = {
 	i915_global_active_init,
+	i915_global_acquire_init,
 	i915_global_buddy_init,
 	i915_global_context_init,
 	i915_global_gem_context_init,
diff --git a/drivers/gpu/drm/i915/i915_globals.h b/drivers/gpu/drm/i915/i915_globals.h
index b2f5cd9b9b1a..11227abf2769 100644
--- a/drivers/gpu/drm/i915/i915_globals.h
+++ b/drivers/gpu/drm/i915/i915_globals.h
@@ -27,6 +27,7 @@ void i915_globals_exit(void);
 
 /* constructors */
 int i915_global_active_init(void);
+int i915_global_acquire_init(void);
 int i915_global_buddy_init(void);
 int i915_global_context_init(void);
 int i915_global_gem_context_init(void);
diff --git a/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c b/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
new file mode 100644
index 000000000000..d1c3b958c15d
--- /dev/null
+++ b/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include <linux/dma-resv.h>
+
+#include "i915_globals.h"
+#include "gem/i915_gem_object.h"
+
+#include "i915_acquire_ctx.h"
+
+static struct i915_global_acquire {
+	struct i915_global base;
+	struct kmem_cache *slab_acquires;
+} global;
+
+struct i915_acquire {
+	struct drm_i915_gem_object *obj;
+	struct i915_acquire *next;
+};
+
+static struct i915_acquire *i915_acquire_alloc(void)
+{
+	return kmem_cache_alloc(global.slab_acquires, GFP_KERNEL);
+}
+
+static void i915_acquire_free(struct i915_acquire *lnk)
+{
+	kmem_cache_free(global.slab_acquires, lnk);
+}
+
+void i915_acquire_ctx_init(struct i915_acquire_ctx *ctx)
+{
+	ww_acquire_init(&ctx->ctx, &reservation_ww_class);
+	ctx->locked = NULL;
+}
+
+int i915_acquire_ctx_lock(struct i915_acquire_ctx *ctx,
+			  struct drm_i915_gem_object *obj)
+{
+	struct i915_acquire *lock, *lnk;
+	int err;
+
+	lock = i915_acquire_alloc();
+	if (!lock)
+		return -ENOMEM;
+
+	lock->obj = i915_gem_object_get(obj);
+	lock->next = NULL;
+
+	while ((lnk = lock)) {
+		obj = lnk->obj;
+		lock = lnk->next;
+
+		err = dma_resv_lock_interruptible(obj->base.resv, &ctx->ctx);
+		if (err == -EDEADLK) {
+			struct i915_acquire *old;
+
+			while ((old = ctx->locked)) {
+				i915_gem_object_unlock(old->obj);
+				ctx->locked = old->next;
+				old->next = lock;
+				lock = old;
+			}
+
+			err = dma_resv_lock_slow_interruptible(obj->base.resv,
+							       &ctx->ctx);
+		}
+		if (!err) {
+			lnk->next = ctx->locked;
+			ctx->locked = lnk;
+		} else {
+			i915_gem_object_put(obj);
+			i915_acquire_free(lnk);
+		}
+		if (err == -EALREADY)
+			err = 0;
+		if (err)
+			break;
+	}
+
+	while ((lnk = lock)) {
+		lock = lnk->next;
+		i915_gem_object_put(lnk->obj);
+		i915_acquire_free(lnk);
+	}
+
+	return err;
+}
+
+int i915_acquire_mm(struct i915_acquire_ctx *acquire)
+{
+	return 0;
+}
+
+void i915_acquire_ctx_fini(struct i915_acquire_ctx *ctx)
+{
+	struct i915_acquire *lnk;
+
+	while ((lnk = ctx->locked)) {
+		i915_gem_object_unlock(lnk->obj);
+		i915_gem_object_put(lnk->obj);
+
+		ctx->locked = lnk->next;
+		i915_acquire_free(lnk);
+	}
+
+	ww_acquire_fini(&ctx->ctx);
+}
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "st_acquire_ctx.c"
+#endif
+
+static void i915_global_acquire_shrink(void)
+{
+	kmem_cache_shrink(global.slab_acquires);
+}
+
+static void i915_global_acquire_exit(void)
+{
+	kmem_cache_destroy(global.slab_acquires);
+}
+
+static struct i915_global_acquire global = { {
+	.shrink = i915_global_acquire_shrink,
+	.exit = i915_global_acquire_exit,
+} };
+
+int __init i915_global_acquire_init(void)
+{
+	global.slab_acquires = KMEM_CACHE(i915_acquire, 0);
+	if (!global.slab_acquires)
+		return -ENOMEM;
+
+	i915_global_register(&global.base);
+	return 0;
+}
diff --git a/drivers/gpu/drm/i915/mm/i915_acquire_ctx.h b/drivers/gpu/drm/i915/mm/i915_acquire_ctx.h
new file mode 100644
index 000000000000..2d263ac1460d
--- /dev/null
+++ b/drivers/gpu/drm/i915/mm/i915_acquire_ctx.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#ifndef __I915_ACQUIRE_CTX_H__
+#define __I915_ACQUIRE_CTX_H__
+
+#include <linux/list.h>
+#include <linux/ww_mutex.h>
+
+struct drm_i915_gem_object;
+struct i915_acquire;
+
+struct i915_acquire_ctx {
+	struct ww_acquire_ctx ctx;
+	struct i915_acquire *locked;
+};
+
+void i915_acquire_ctx_init(struct i915_acquire_ctx *acquire);
+
+static inline void i915_acquire_ctx_done(struct i915_acquire_ctx *acquire)
+{
+	ww_acquire_done(&acquire->ctx);
+}
+
+void i915_acquire_ctx_fini(struct i915_acquire_ctx *acquire);
+
+int __must_check i915_acquire_ctx_lock(struct i915_acquire_ctx *acquire,
+				       struct drm_i915_gem_object *obj);
+
+int i915_acquire_mm(struct i915_acquire_ctx *acquire);
+
+#endif /* __I915_ACQUIRE_CTX_H__ */
diff --git a/drivers/gpu/drm/i915/mm/st_acquire_ctx.c b/drivers/gpu/drm/i915/mm/st_acquire_ctx.c
new file mode 100644
index 000000000000..6e94bdbb3265
--- /dev/null
+++ b/drivers/gpu/drm/i915/mm/st_acquire_ctx.c
@@ -0,0 +1,242 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include "i915_drv.h"
+#include "i915_selftest.h"
+
+#include "selftests/i915_random.h"
+#include "selftests/mock_gem_device.h"
+
+static int checked_acquire_lock(struct i915_acquire_ctx *acquire,
+				struct drm_i915_gem_object *obj,
+				const char *name)
+{
+	int err;
+
+	err = i915_acquire_ctx_lock(acquire, obj);
+	if (err) {
+		pr_err("i915_acquire_lock(%s) failed, err:%d\n", name, err);
+		return err;
+	}
+
+	if (!mutex_is_locked(&obj->base.resv->lock.base)) {
+		pr_err("Failed to lock %s!\n", name);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int igt_acquire_lock(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct drm_i915_gem_object *a, *b;
+	struct i915_acquire_ctx acquire;
+	int err;
+
+	a = i915_gem_object_create_internal(i915, PAGE_SIZE);
+	if (IS_ERR(a))
+		return PTR_ERR(a);
+
+	b = i915_gem_object_create_internal(i915, PAGE_SIZE);
+	if (IS_ERR(b)) {
+		err = PTR_ERR(b);
+		goto out_a;
+	}
+
+	i915_acquire_ctx_init(&acquire);
+
+	err = checked_acquire_lock(&acquire, a, "A");
+	if (err)
+		goto out_fini;
+
+	err = checked_acquire_lock(&acquire, b, "B");
+	if (err)
+		goto out_fini;
+
+	/* Again for EALREADY */
+
+	err = checked_acquire_lock(&acquire, a, "A");
+	if (err)
+		goto out_fini;
+
+	err = checked_acquire_lock(&acquire, b, "B");
+	if (err)
+		goto out_fini;
+
+	i915_acquire_ctx_done(&acquire);
+
+	if (!mutex_is_locked(&a->base.resv->lock.base)) {
+		pr_err("Failed to lock A, after i915_acquire_done\n");
+		err = -EINVAL;
+	}
+	if (!mutex_is_locked(&b->base.resv->lock.base)) {
+		pr_err("Failed to lock B, after i915_acquire_done\n");
+		err = -EINVAL;
+	}
+
+out_fini:
+	i915_acquire_ctx_fini(&acquire);
+
+	if (mutex_is_locked(&a->base.resv->lock.base)) {
+		pr_err("A is still locked!\n");
+		err = -EINVAL;
+	}
+	if (mutex_is_locked(&b->base.resv->lock.base)) {
+		pr_err("B is still locked!\n");
+		err = -EINVAL;
+	}
+
+	i915_gem_object_put(b);
+out_a:
+	i915_gem_object_put(a);
+	return err;
+}
+
+struct deadlock {
+	struct drm_i915_gem_object *obj[64];
+};
+
+static int __igt_acquire_deadlock(void *arg)
+{
+	struct deadlock *dl = arg;
+	const unsigned int total = ARRAY_SIZE(dl->obj);
+	I915_RND_STATE(prng);
+	unsigned int *order;
+	int n, count, err = 0;
+
+	order = i915_random_order(total, &prng);
+	if (!order)
+		return -ENOMEM;
+
+	while (!kthread_should_stop()) {
+		struct i915_acquire_ctx acquire;
+
+		i915_random_reorder(order, total, &prng);
+		count = i915_prandom_u32_max_state(total, &prng);
+
+		i915_acquire_ctx_init(&acquire);
+
+		for (n = 0; n < count; n++) {
+			struct drm_i915_gem_object *obj = dl->obj[order[n]];
+
+			err = checked_acquire_lock(&acquire, obj, "dl");
+			if (err) {
+				i915_acquire_ctx_fini(&acquire);
+				goto out;
+			}
+		}
+
+		i915_acquire_ctx_done(&acquire);
+
+#if IS_ENABLED(CONFIG_LOCKDEP)
+		for (n = 0; n < count; n++) {
+			struct drm_i915_gem_object *obj = dl->obj[order[n]];
+
+			if (!lockdep_is_held(&obj->base.resv->lock.base)) {
+				pr_err("lock not taken!\n");
+				i915_acquire_ctx_fini(&acquire);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+#endif
+
+		i915_acquire_ctx_fini(&acquire);
+
+#if IS_ENABLED(CONFIG_LOCKDEP)
+		for (n = 0; n < count; n++) {
+			struct drm_i915_gem_object *obj = dl->obj[order[n]];
+
+			if (lockdep_is_held(&obj->base.resv->lock.base)) {
+				pr_err("lock still held after fini!\n");
+				err = -EINVAL;
+				goto out;
+			}
+		}
+#endif
+	}
+
+out:
+	kfree(order);
+	return err;
+}
+
+static int igt_acquire_deadlock(void *arg)
+{
+	unsigned int ncpus = num_online_cpus();
+	struct drm_i915_private *i915 = arg;
+	struct task_struct **threads;
+	struct deadlock dl;
+	int ret = 0, n;
+
+	threads = kcalloc(ncpus, sizeof(*threads), GFP_KERNEL);
+	if (!threads)
+		return -ENOMEM;
+
+	for (n = 0; n < ARRAY_SIZE(dl.obj); n += 2) {
+		dl.obj[n] = i915_gem_object_create_internal(i915, PAGE_SIZE);
+		if (IS_ERR(dl.obj[n])) {
+			ret = PTR_ERR(dl.obj[n]);
+			goto out_obj;
+		}
+
+		/* Repeat the objects for -EALREADY */
+		dl.obj[n + 1] = i915_gem_object_get(dl.obj[n]);
+	}
+
+	for (n = 0; n < ncpus; n++) {
+		threads[n] = kthread_run(__igt_acquire_deadlock,
+					 &dl, "igt/%d", n);
+		if (IS_ERR(threads[n])) {
+			ret = PTR_ERR(threads[n]);
+			ncpus = n;
+			break;
+		}
+
+		get_task_struct(threads[n]);
+	}
+
+	yield(); /* start all threads before we begin */
+	msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
+
+	for (n = 0; n < ncpus; n++) {
+		int err;
+
+		err = kthread_stop(threads[n]);
+		if (err < 0 && !ret)
+			ret = err;
+
+		put_task_struct(threads[n]);
+	}
+
+out_obj:
+	for (n = 0; n < ARRAY_SIZE(dl.obj); n++) {
+		if (IS_ERR(dl.obj[n]))
+			break;
+		i915_gem_object_put(dl.obj[n]);
+	}
+	kfree(threads);
+	return ret;
+}
+
+int i915_acquire_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(igt_acquire_lock),
+		SUBTEST(igt_acquire_deadlock),
+	};
+	struct drm_i915_private *i915;
+	int err = 0;
+
+	i915 = mock_gem_device();
+	if (!i915)
+		return -ENOMEM;
+
+	err = i915_subtests(tests, i915);
+	drm_dev_put(&i915->drm);
+
+	return err;
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
index 3db34d3eea58..cb6f94633356 100644
--- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
@@ -26,6 +26,7 @@ selftest(engine, intel_engine_cs_mock_selftests)
 selftest(timelines, intel_timeline_mock_selftests)
 selftest(requests, i915_request_mock_selftests)
 selftest(objects, i915_gem_object_mock_selftests)
+selftest(acquire, i915_acquire_mock_selftests)
 selftest(phys, i915_gem_phys_mock_selftests)
 selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
 selftest(vma, i915_vma_mock_selftests)
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 27/66] drm/i915/gem: Pull execbuf dma resv under a single critical section
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (24 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 26/66] drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2 Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-27 18:08   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class Chris Wilson
                   ` (46 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Acquire all the objects and their backing storage, and page directories,
as used by execbuf under a single common ww_mutex. Albeit we have to
restart the critical section a few times in order to handle various
restrictions (such as avoiding copy_(from|to)_user and mmap_sem).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 168 +++++++++---------
 .../i915/gem/selftests/i915_gem_execbuffer.c  |   8 +-
 2 files changed, 87 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index ebabc0746d50..db433f3f18ec 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -20,6 +20,7 @@
 #include "gt/intel_gt_pm.h"
 #include "gt/intel_gt_requests.h"
 #include "gt/intel_ring.h"
+#include "mm/i915_acquire_ctx.h"
 
 #include "i915_drv.h"
 #include "i915_gem_clflush.h"
@@ -244,6 +245,8 @@ struct i915_execbuffer {
 	struct intel_context *context; /* logical state for the request */
 	struct i915_gem_context *gem_context; /** caller's context */
 
+	struct i915_acquire_ctx acquire; /** lock for _all_ DMA reservations */
+
 	struct i915_request *request; /** our request to build */
 	struct eb_vma *batch; /** identity of the batch obj/vma */
 
@@ -389,42 +392,6 @@ static void eb_vma_array_put(struct eb_vma_array *arr)
 	kref_put(&arr->kref, eb_vma_array_destroy);
 }
 
-static int
-eb_lock_vma(struct i915_execbuffer *eb, struct ww_acquire_ctx *acquire)
-{
-	struct eb_vma *ev;
-	int err = 0;
-
-	list_for_each_entry(ev, &eb->submit_list, submit_link) {
-		struct i915_vma *vma = ev->vma;
-
-		err = ww_mutex_lock_interruptible(&vma->resv->lock, acquire);
-		if (err == -EDEADLK) {
-			struct eb_vma *unlock = ev, *en;
-
-			list_for_each_entry_safe_continue_reverse(unlock, en,
-								  &eb->submit_list,
-								  submit_link) {
-				ww_mutex_unlock(&unlock->vma->resv->lock);
-				list_move_tail(&unlock->submit_link, &eb->submit_list);
-			}
-
-			GEM_BUG_ON(!list_is_first(&ev->submit_link, &eb->submit_list));
-			err = ww_mutex_lock_slow_interruptible(&vma->resv->lock,
-							       acquire);
-		}
-		if (err) {
-			list_for_each_entry_continue_reverse(ev,
-							     &eb->submit_list,
-							     submit_link)
-				ww_mutex_unlock(&ev->vma->resv->lock);
-			break;
-		}
-	}
-
-	return err;
-}
-
 static int eb_create(struct i915_execbuffer *eb)
 {
 	/* Allocate an extra slot for use by the sentinel */
@@ -668,6 +635,25 @@ eb_add_vma(struct i915_execbuffer *eb,
 	}
 }
 
+static int eb_lock_mm(struct i915_execbuffer *eb)
+{
+	struct eb_vma *ev;
+	int err;
+
+	list_for_each_entry(ev, &eb->bind_list, bind_link) {
+		err = i915_acquire_ctx_lock(&eb->acquire, ev->vma->obj);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int eb_acquire_mm(struct i915_execbuffer *eb)
+{
+	return i915_acquire_mm(&eb->acquire);
+}
+
 struct eb_vm_work {
 	struct dma_fence_work base;
 	struct eb_vma_array *array;
@@ -1390,7 +1376,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 	unsigned long count;
 	struct eb_vma *ev;
 	unsigned int pass;
-	int err = 0;
+	int err;
+
+	err = eb_lock_mm(eb);
+	if (err)
+		return err;
+
+	err = eb_acquire_mm(eb);
+	if (err)
+		return err;
 
 	count = 0;
 	INIT_LIST_HEAD(&unbound);
@@ -1416,10 +1410,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 	if (count == 0)
 		return 0;
 
+	/* We need to reserve page directories, release all, start over */
+	i915_acquire_ctx_fini(&eb->acquire);
+
 	pass = 0;
 	do {
 		struct eb_vm_work *work;
 
+		i915_acquire_ctx_init(&eb->acquire);
+
 		/*
 		 * We need to hold one lock as we bind all the vma so that
 		 * we have a consistent view of the entire vm and can plan
@@ -1436,6 +1435,11 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 		 * beneath it, so we have to stage and preallocate all the
 		 * resources we may require before taking the mutex.
 		 */
+
+		err = eb_lock_mm(eb);
+		if (err)
+			return err;
+
 		work = eb_vm_work(eb, count);
 		if (!work)
 			return -ENOMEM;
@@ -1453,6 +1457,10 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 			}
 		}
 
+		err = eb_acquire_mm(eb);
+		if (err)
+			return eb_vm_work_cancel(work, err);
+
 		err = i915_vm_pin_pt_stash(work->vm, &work->stash);
 		if (err)
 			return eb_vm_work_cancel(work, err);
@@ -1543,6 +1551,8 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 		if (signal_pending(current))
 			return -EINTR;
 
+		i915_acquire_ctx_fini(&eb->acquire);
+
 		/* Now safe to wait with no reservations held */
 
 		if (err == -EAGAIN) {
@@ -1566,8 +1576,10 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 		 * total ownership of the vm.
 		 */
 		err = wait_for_unbinds(eb, &unbound, pass++);
-		if (err)
+		if (err) {
+			i915_acquire_ctx_init(&eb->acquire);
 			return err;
+		}
 	} while (1);
 }
 
@@ -1994,8 +2006,6 @@ static int reloc_move_to_gpu(struct i915_request *rq, struct i915_vma *vma)
 	struct drm_i915_gem_object *obj = vma->obj;
 	int err;
 
-	i915_vma_lock(vma);
-
 	if (obj->cache_dirty & ~obj->cache_coherent)
 		i915_gem_clflush_object(obj, 0);
 	obj->write_domain = 0;
@@ -2004,8 +2014,6 @@ static int reloc_move_to_gpu(struct i915_request *rq, struct i915_vma *vma)
 	if (err == 0)
 		err = i915_vma_move_to_active(vma, rq, EXEC_OBJECT_WRITE);
 
-	i915_vma_unlock(vma);
-
 	return err;
 }
 
@@ -2334,11 +2342,9 @@ get_gpu_relocs(struct i915_execbuffer *eb,
 		int err;
 
 		GEM_BUG_ON(!vma);
-		i915_vma_lock(vma);
 		err = i915_request_await_object(rq, vma->obj, false);
 		if (err == 0)
 			err = i915_vma_move_to_active(vma, rq, 0);
-		i915_vma_unlock(vma);
 		if (err)
 			return ERR_PTR(err);
 
@@ -2470,6 +2476,7 @@ static int eb_relocate(struct i915_execbuffer *eb)
 	/* Drop everything before we copy_from_user */
 	list_for_each_entry(ev, &eb->bind_list, bind_link)
 		eb_unreserve_vma(ev);
+	i915_acquire_ctx_fini(&eb->acquire);
 
 	/* Pick a single buffer for all relocs, within reason */
 	c->bufsz *= sizeof(struct drm_i915_gem_relocation_entry);
@@ -2482,6 +2489,7 @@ static int eb_relocate(struct i915_execbuffer *eb)
 
 	/* Copy the user's relocations into plain system memory */
 	err = eb_relocs_copy_user(eb);
+	i915_acquire_ctx_init(&eb->acquire);
 	if (err)
 		return err;
 
@@ -2517,17 +2525,8 @@ static int eb_reserve(struct i915_execbuffer *eb)
 
 static int eb_move_to_gpu(struct i915_execbuffer *eb)
 {
-	struct ww_acquire_ctx acquire;
 	struct eb_vma *ev;
-	int err = 0;
-
-	ww_acquire_init(&acquire, &reservation_ww_class);
-
-	err = eb_lock_vma(eb, &acquire);
-	if (err)
-		goto err_fini;
-
-	ww_acquire_done(&acquire);
+	int err;
 
 	list_for_each_entry(ev, &eb->submit_list, submit_link) {
 		struct i915_vma *vma = ev->vma;
@@ -2566,27 +2565,22 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 				flags &= ~EXEC_OBJECT_ASYNC;
 		}
 
-		if (err == 0 && !(flags & EXEC_OBJECT_ASYNC)) {
+		if (!(flags & EXEC_OBJECT_ASYNC)) {
 			err = i915_request_await_object
 				(eb->request, obj, flags & EXEC_OBJECT_WRITE);
+			if (unlikely(err))
+				goto err_skip;
 		}
 
-		if (err == 0)
-			err = i915_vma_move_to_active(vma, eb->request, flags);
-
-		i915_vma_unlock(vma);
+		err = i915_vma_move_to_active(vma, eb->request, flags);
+		if (unlikely(err))
+			goto err_skip;
 	}
-	ww_acquire_fini(&acquire);
-
-	if (unlikely(err))
-		goto err_skip;
 
 	/* Unconditionally flush any chipset caches (for streaming writes). */
 	intel_gt_chipset_flush(eb->engine->gt);
 	return 0;
 
-err_fini:
-	ww_acquire_fini(&acquire);
 err_skip:
 	i915_request_set_error_once(eb->request, err);
 	return err;
@@ -2749,39 +2743,27 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
 	/* Mark active refs early for this worker, in case we get interrupted */
 	err = parser_mark_active(pw, eb->context->timeline);
 	if (err)
-		goto err_commit;
-
-	err = dma_resv_lock_interruptible(pw->batch->resv, NULL);
-	if (err)
-		goto err_commit;
+		goto out;
 
 	err = dma_resv_reserve_shared(pw->batch->resv, 1);
 	if (err)
-		goto err_commit_unlock;
+		goto out;
 
 	/* Wait for all writes (and relocs) into the batch to complete */
 	err = i915_sw_fence_await_reservation(&pw->base.chain,
 					      pw->batch->resv, NULL, false,
 					      0, I915_FENCE_GFP);
 	if (err < 0)
-		goto err_commit_unlock;
+		goto out;
 
 	/* Keep the batch alive and unwritten as we parse */
 	dma_resv_add_shared_fence(pw->batch->resv, &pw->base.dma);
 
-	dma_resv_unlock(pw->batch->resv);
-
 	/* Force execution to wait for completion of the parser */
-	dma_resv_lock(shadow->resv, NULL);
 	dma_resv_add_excl_fence(shadow->resv, &pw->base.dma);
-	dma_resv_unlock(shadow->resv);
 
-	dma_fence_work_commit_imm(&pw->base);
-	return 0;
-
-err_commit_unlock:
-	dma_resv_unlock(pw->batch->resv);
-err_commit:
+	err = 0;
+out:
 	i915_sw_fence_set_error_once(&pw->base.chain, err);
 	dma_fence_work_commit_imm(&pw->base);
 	return err;
@@ -2833,10 +2815,6 @@ static int eb_submit(struct i915_execbuffer *eb)
 {
 	int err;
 
-	err = eb_move_to_gpu(eb);
-	if (err)
-		return err;
-
 	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
 		err = i915_reset_gen7_sol_offsets(eb->request);
 		if (err)
@@ -3420,6 +3398,9 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		goto err_engine;
 	lockdep_assert_held(&eb.context->timeline->mutex);
 
+	/* *** DMA-RESV LOCK *** */
+	i915_acquire_ctx_init(&eb.acquire);
+
 	err = eb_reserve(&eb);
 	if (err) {
 		/*
@@ -3433,6 +3414,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		goto err_vma;
 	}
 
+	/* *** DMA-RESV SEALED *** */
+
 	err = eb_parse(&eb);
 	if (err)
 		goto err_vma;
@@ -3483,9 +3466,20 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		intel_gt_buffer_pool_mark_active(eb.parser.shadow->vma->private,
 						 eb.request);
 
+	err = eb_move_to_gpu(&eb);
+	if (err)
+		goto err_request;
+
+	/* *** DMA-RESV PUBLISHED *** */
+
 	trace_i915_request_queue(eb.request, eb.batch_flags);
 	err = eb_submit(&eb);
+
 err_request:
+	i915_acquire_ctx_fini(&eb.acquire);
+	eb.acquire.locked = ERR_PTR(-1);
+	/* *** DMA-RESV UNLOCK *** */
+
 	i915_request_get(eb.request);
 	eb_request_add(&eb);
 
@@ -3496,6 +3490,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	i915_request_put(eb.request);
 
 err_vma:
+	if (eb.acquire.locked != ERR_PTR(-1))
+		i915_acquire_ctx_fini(&eb.acquire);
 	eb_unlock_engine(&eb);
 	/* *** TIMELINE UNLOCK *** */
 
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
index 8776f2750fa7..57181718acb1 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c
@@ -101,11 +101,13 @@ static int __igt_gpu_reloc(struct i915_execbuffer *eb, struct eb_vma *ev)
 		return err;
 	ev->exec->relocation_count = err;
 
+	i915_acquire_ctx_init(&eb->acquire);
+
 	err = eb_reserve_vm(eb);
-	if (err)
-		return err;
+	if (err == 0)
+		err = eb_relocs_gpu(eb);
 
-	err = eb_relocs_gpu(eb);
+	i915_acquire_ctx_fini(&eb->acquire);
 	if (err)
 		return err;
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (25 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 27/66] drm/i915/gem: Pull execbuf dma resv under a single critical section Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 15:43   ` Maarten Lankhorst
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 29/66] drm/i915: Hold wakeref for the duration of the vma GGTT binding Chris Wilson
                   ` (45 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Our goal is to pull all memory reservations (next iteration
obj->ops->get_pages()) under a ww_mutex, and to align those reservations
with other drivers, i.e. control all such allocations with the
reservation_ww_class. Currently, this is under the purview of the
obj->mm.mutex, and while obj->mm remains an embedded struct we can
"simply" switch to using the reservation_ww_class obj->base.resv->lock

The major consequence is the impact on the shrinker paths as the
reservation_ww_class is used to wrap allocations, and a ww_mutex does
not support subclassing so we cannot do our usual trick of knowing that
we never recurse inside the shrinker and instead have to finish the
reclaim with a trylock. This may result in us failing to release the
pages after having released the vma. This will have to do until a better
idea comes along.

However, this step only converts the mutex over and continues to treat
everything as a single allocation and pinning the pages. With the
ww_mutex in place we can remove the temporary pinning, as we can then
reserve all storage en masse.

One last thing to do: kill the implict page pinning for active vma.
This will require us to invalidate the vma->pages when the backing store
is removed (and we expect that while the vma is active, we mark the
backing store as active so that it cannot be removed while the HW is
busy.)

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |  20 +-
 drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c    |  18 +-
 drivers/gpu/drm/i915/gem/i915_gem_domain.c    |  65 ++----
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  40 +++-
 drivers/gpu/drm/i915/gem/i915_gem_object.c    |   8 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.h    |  37 +--
 .../gpu/drm/i915/gem/i915_gem_object_types.h  |   1 -
 drivers/gpu/drm/i915/gem/i915_gem_pages.c     | 134 ++++++-----
 drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   8 +-
 drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |  13 +-
 drivers/gpu/drm/i915/gem/i915_gem_tiling.c    |   2 -
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |  15 +-
 .../gpu/drm/i915/gem/selftests/huge_pages.c   |  32 ++-
 .../i915/gem/selftests/i915_gem_coherency.c   |  14 +-
 .../drm/i915/gem/selftests/i915_gem_context.c |  10 +-
 .../drm/i915/gem/selftests/i915_gem_mman.c    |   2 +
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c          |   2 -
 drivers/gpu/drm/i915/gt/gen8_ppgtt.c          |   1 -
 drivers/gpu/drm/i915/gt/intel_ggtt.c          |   5 +-
 drivers/gpu/drm/i915/gt/intel_gtt.h           |   2 -
 drivers/gpu/drm/i915/gt/intel_ppgtt.c         |   1 +
 drivers/gpu/drm/i915/i915_gem.c               |  16 +-
 drivers/gpu/drm/i915/i915_vma.c               | 217 +++++++-----------
 drivers/gpu/drm/i915/i915_vma_types.h         |   6 -
 drivers/gpu/drm/i915/mm/i915_acquire_ctx.c    |  12 +-
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   4 +-
 .../drm/i915/selftests/intel_memory_region.c  |  17 +-
 27 files changed, 313 insertions(+), 389 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
index bc0223716906..a32fd0d5570b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
@@ -27,16 +27,8 @@ static void __do_clflush(struct drm_i915_gem_object *obj)
 static int clflush_work(struct dma_fence_work *base)
 {
 	struct clflush *clflush = container_of(base, typeof(*clflush), base);
-	struct drm_i915_gem_object *obj = clflush->obj;
-	int err;
-
-	err = i915_gem_object_pin_pages(obj);
-	if (err)
-		return err;
-
-	__do_clflush(obj);
-	i915_gem_object_unpin_pages(obj);
 
+	__do_clflush(clflush->obj);
 	return 0;
 }
 
@@ -44,7 +36,7 @@ static void clflush_release(struct dma_fence_work *base)
 {
 	struct clflush *clflush = container_of(base, typeof(*clflush), base);
 
-	i915_gem_object_put(clflush->obj);
+	i915_gem_object_unpin_pages(clflush->obj);
 }
 
 static const struct dma_fence_work_ops clflush_ops = {
@@ -63,8 +55,14 @@ static struct clflush *clflush_work_create(struct drm_i915_gem_object *obj)
 	if (!clflush)
 		return NULL;
 
+	if (__i915_gem_object_get_pages_locked(obj)) {
+		kfree(clflush);
+		return NULL;
+	}
+
 	dma_fence_work_init(&clflush->base, &clflush_ops);
-	clflush->obj = i915_gem_object_get(obj); /* obj <-> clflush cycle */
+	__i915_gem_object_pin_pages(obj);
+	clflush->obj = obj; /* Beware the obj.resv <-> clflush fence cycle */
 
 	return clflush;
 }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
index 2679380159fc..049a15e6b496 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
@@ -124,19 +124,12 @@ static int i915_gem_begin_cpu_access(struct dma_buf *dma_buf, enum dma_data_dire
 	bool write = (direction == DMA_BIDIRECTIONAL || direction == DMA_TO_DEVICE);
 	int err;
 
-	err = i915_gem_object_pin_pages(obj);
-	if (err)
-		return err;
-
 	err = i915_gem_object_lock_interruptible(obj);
 	if (err)
-		goto out;
+		return err;
 
 	err = i915_gem_object_set_to_cpu_domain(obj, write);
 	i915_gem_object_unlock(obj);
-
-out:
-	i915_gem_object_unpin_pages(obj);
 	return err;
 }
 
@@ -145,19 +138,12 @@ static int i915_gem_end_cpu_access(struct dma_buf *dma_buf, enum dma_data_direct
 	struct drm_i915_gem_object *obj = dma_buf_to_obj(dma_buf);
 	int err;
 
-	err = i915_gem_object_pin_pages(obj);
-	if (err)
-		return err;
-
 	err = i915_gem_object_lock_interruptible(obj);
 	if (err)
-		goto out;
+		return err;
 
 	err = i915_gem_object_set_to_gtt_domain(obj, false);
 	i915_gem_object_unlock(obj);
-
-out:
-	i915_gem_object_unpin_pages(obj);
 	return err;
 }
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_domain.c b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
index 7f76fc68f498..30e4b163588b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_domain.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
@@ -70,7 +70,7 @@ i915_gem_object_set_to_wc_domain(struct drm_i915_gem_object *obj, bool write)
 	 * continue to assume that the obj remained out of the CPU cached
 	 * domain.
 	 */
-	ret = i915_gem_object_pin_pages(obj);
+	ret = __i915_gem_object_get_pages_locked(obj);
 	if (ret)
 		return ret;
 
@@ -94,7 +94,6 @@ i915_gem_object_set_to_wc_domain(struct drm_i915_gem_object *obj, bool write)
 		obj->mm.dirty = true;
 	}
 
-	i915_gem_object_unpin_pages(obj);
 	return 0;
 }
 
@@ -131,7 +130,7 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write)
 	 * continue to assume that the obj remained out of the CPU cached
 	 * domain.
 	 */
-	ret = i915_gem_object_pin_pages(obj);
+	ret = __i915_gem_object_get_pages_locked(obj);
 	if (ret)
 		return ret;
 
@@ -163,7 +162,6 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write)
 		spin_unlock(&obj->vma.lock);
 	}
 
-	i915_gem_object_unpin_pages(obj);
 	return 0;
 }
 
@@ -532,13 +530,9 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data,
 	 * continue to assume that the obj remained out of the CPU cached
 	 * domain.
 	 */
-	err = i915_gem_object_pin_pages(obj);
-	if (err)
-		goto out;
-
 	err = i915_gem_object_lock_interruptible(obj);
 	if (err)
-		goto out_unpin;
+		goto out;
 
 	if (read_domains & I915_GEM_DOMAIN_WC)
 		err = i915_gem_object_set_to_wc_domain(obj, write_domain);
@@ -555,8 +549,6 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data,
 	if (write_domain)
 		i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
 
-out_unpin:
-	i915_gem_object_unpin_pages(obj);
 out:
 	i915_gem_object_put(obj);
 	return err;
@@ -572,11 +564,13 @@ int i915_gem_object_prepare_read(struct drm_i915_gem_object *obj,
 {
 	int ret;
 
+	assert_object_held(obj);
+
 	*needs_clflush = 0;
 	if (!i915_gem_object_has_struct_page(obj))
 		return -ENODEV;
 
-	ret = i915_gem_object_lock_interruptible(obj);
+	ret = __i915_gem_object_get_pages_locked(obj);
 	if (ret)
 		return ret;
 
@@ -584,19 +578,11 @@ int i915_gem_object_prepare_read(struct drm_i915_gem_object *obj,
 				   I915_WAIT_INTERRUPTIBLE,
 				   MAX_SCHEDULE_TIMEOUT);
 	if (ret)
-		goto err_unlock;
-
-	ret = i915_gem_object_pin_pages(obj);
-	if (ret)
-		goto err_unlock;
+		return ret;
 
 	if (obj->cache_coherent & I915_BO_CACHE_COHERENT_FOR_READ ||
 	    !static_cpu_has(X86_FEATURE_CLFLUSH)) {
-		ret = i915_gem_object_set_to_cpu_domain(obj, false);
-		if (ret)
-			goto err_unpin;
-		else
-			goto out;
+		return i915_gem_object_set_to_cpu_domain(obj, false);
 	}
 
 	i915_gem_object_flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
@@ -610,15 +596,7 @@ int i915_gem_object_prepare_read(struct drm_i915_gem_object *obj,
 	    !(obj->read_domains & I915_GEM_DOMAIN_CPU))
 		*needs_clflush = CLFLUSH_BEFORE;
 
-out:
-	/* return with the pages pinned */
 	return 0;
-
-err_unpin:
-	i915_gem_object_unpin_pages(obj);
-err_unlock:
-	i915_gem_object_unlock(obj);
-	return ret;
 }
 
 int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
@@ -626,11 +604,13 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
 {
 	int ret;
 
+	assert_object_held(obj);
+
 	*needs_clflush = 0;
 	if (!i915_gem_object_has_struct_page(obj))
 		return -ENODEV;
 
-	ret = i915_gem_object_lock_interruptible(obj);
+	ret = __i915_gem_object_get_pages_locked(obj);
 	if (ret)
 		return ret;
 
@@ -639,20 +619,11 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
 				   I915_WAIT_ALL,
 				   MAX_SCHEDULE_TIMEOUT);
 	if (ret)
-		goto err_unlock;
-
-	ret = i915_gem_object_pin_pages(obj);
-	if (ret)
-		goto err_unlock;
+		return ret;
 
 	if (obj->cache_coherent & I915_BO_CACHE_COHERENT_FOR_WRITE ||
-	    !static_cpu_has(X86_FEATURE_CLFLUSH)) {
-		ret = i915_gem_object_set_to_cpu_domain(obj, true);
-		if (ret)
-			goto err_unpin;
-		else
-			goto out;
-	}
+	    !static_cpu_has(X86_FEATURE_CLFLUSH))
+		return i915_gem_object_set_to_cpu_domain(obj, true);
 
 	i915_gem_object_flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
 
@@ -672,15 +643,7 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
 			*needs_clflush |= CLFLUSH_BEFORE;
 	}
 
-out:
 	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
 	obj->mm.dirty = true;
-	/* return with the pages pinned */
 	return 0;
-
-err_unpin:
-	i915_gem_object_unpin_pages(obj);
-err_unlock:
-	i915_gem_object_unlock(obj);
-	return ret;
 }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index db433f3f18ec..b07c508812ad 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -966,6 +966,13 @@ static int best_hole(struct drm_mm *mm, struct drm_mm_node *node,
 	} while (1);
 }
 
+static void eb_pin_vma_pages(struct i915_vma *vma, unsigned int count)
+{
+	count = hweight32(count);
+	while (count--)
+		__i915_gem_object_pin_pages(vma->obj);
+}
+
 static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
 {
 	struct drm_i915_gem_exec_object2 *entry = bind->ev->exec;
@@ -1077,7 +1084,6 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
 		if (unlikely(err))
 			return err;
 
-		atomic_add(I915_VMA_PAGES_ACTIVE, &vma->pages_count);
 		atomic_or(bind_flags, &vma->flags);
 
 		if (i915_vma_is_ggtt(vma))
@@ -1184,9 +1190,14 @@ static void __eb_bind_vma(struct eb_vm_work *work)
 		GEM_BUG_ON(vma->vm != vm);
 		GEM_BUG_ON(!i915_vma_is_active(vma));
 
+		if (!vma->pages)
+			vma->ops->set_pages(vma); /* plain assignment */
+
 		vma->ops->bind_vma(vm, &work->stash, vma,
 				   vma->obj->cache_level, bind->bind_flags);
 
+		eb_pin_vma_pages(vma, bind->bind_flags);
+
 		if (drm_mm_node_allocated(&bind->hole)) {
 			mutex_lock(&vm->mutex);
 			GEM_BUG_ON(bind->hole.mm != &vm->mm);
@@ -1203,7 +1214,6 @@ static void __eb_bind_vma(struct eb_vm_work *work)
 
 put:
 		GEM_BUG_ON(drm_mm_node_allocated(&bind->hole));
-		i915_vma_put_pages(vma);
 	}
 	work->count = 0;
 }
@@ -1316,8 +1326,24 @@ static int eb_prepare_vma(struct eb_vm_work *work,
 		if (err)
 			return err;
 	}
+	return 0;
+}
+
+static int eb_lock_pt(struct i915_execbuffer *eb,
+		      struct i915_vm_pt_stash *stash)
+{
+	struct i915_page_table *pt;
+	int n, err;
 
-	return i915_vma_get_pages(vma);
+	for (n = 0; n < ARRAY_SIZE(stash->pt); n++) {
+		for (pt = stash->pt[n]; pt; pt = pt->stash) {
+			err = i915_acquire_ctx_lock(&eb->acquire, pt->base);
+			if (err)
+				return err;
+		}
+	}
+
+	return 0;
 }
 
 static int wait_for_unbinds(struct i915_execbuffer *eb,
@@ -1457,11 +1483,11 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 			}
 		}
 
-		err = eb_acquire_mm(eb);
+		err = eb_lock_pt(eb, &work->stash);
 		if (err)
 			return eb_vm_work_cancel(work, err);
 
-		err = i915_vm_pin_pt_stash(work->vm, &work->stash);
+		err = eb_acquire_mm(eb);
 		if (err)
 			return eb_vm_work_cancel(work, err);
 
@@ -2714,7 +2740,7 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
 	if (!pw)
 		return -ENOMEM;
 
-	ptr = i915_gem_object_pin_map(shadow->obj, I915_MAP_FORCE_WB);
+	ptr = __i915_gem_object_pin_map_locked(shadow->obj, I915_MAP_FORCE_WB);
 	if (IS_ERR(ptr)) {
 		err = PTR_ERR(ptr);
 		goto err_free;
@@ -2722,7 +2748,7 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
 
 	if (!(batch->obj->cache_coherent & I915_BO_CACHE_COHERENT_FOR_READ) &&
 	    i915_has_memcpy_from_wc()) {
-		ptr = i915_gem_object_pin_map(batch->obj, I915_MAP_WC);
+		ptr = __i915_gem_object_pin_map_locked(batch->obj, I915_MAP_WC);
 		if (IS_ERR(ptr)) {
 			err = PTR_ERR(ptr);
 			goto err_dst;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index c8421fd9d2dc..799ad4e648aa 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -53,8 +53,6 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
 			  const struct drm_i915_gem_object_ops *ops,
 			  struct lock_class_key *key)
 {
-	__mutex_init(&obj->mm.lock, ops->name ?: "obj->mm.lock", key);
-
 	spin_lock_init(&obj->vma.lock);
 	INIT_LIST_HEAD(&obj->vma.list);
 
@@ -73,10 +71,6 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
 	obj->mm.madv = I915_MADV_WILLNEED;
 	INIT_RADIX_TREE(&obj->mm.get_page.radix, GFP_KERNEL | __GFP_NOWARN);
 	mutex_init(&obj->mm.get_page.lock);
-
-	if (IS_ENABLED(CONFIG_LOCKDEP) && i915_gem_object_is_shrinkable(obj))
-		i915_gem_shrinker_taints_mutex(to_i915(obj->base.dev),
-					       &obj->mm.lock);
 }
 
 /**
@@ -229,10 +223,12 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
 
 		GEM_BUG_ON(!list_empty(&obj->lut_list));
 
+		i915_gem_object_lock(obj);
 		atomic_set(&obj->mm.pages_pin_count, 0);
 		__i915_gem_object_put_pages(obj);
 		GEM_BUG_ON(i915_gem_object_has_pages(obj));
 		bitmap_free(obj->bit_17);
+		i915_gem_object_unlock(obj);
 
 		if (obj->base.import_attach)
 			drm_prime_gem_destroy(&obj->base, NULL);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index 6f60687b6be2..26f53321443b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -271,36 +271,9 @@ void __i915_gem_object_set_pages(struct drm_i915_gem_object *obj,
 				 struct sg_table *pages,
 				 unsigned int sg_page_sizes);
 
-int ____i915_gem_object_get_pages(struct drm_i915_gem_object *obj);
-int __i915_gem_object_get_pages(struct drm_i915_gem_object *obj);
-
-enum i915_mm_subclass { /* lockdep subclass for obj->mm.lock/struct_mutex */
-	I915_MM_NORMAL = 0,
-	/*
-	 * Only used by struct_mutex, when called "recursively" from
-	 * direct-reclaim-esque. Safe because there is only every one
-	 * struct_mutex in the entire system.
-	 */
-	I915_MM_SHRINKER = 1,
-	/*
-	 * Used for obj->mm.lock when allocating pages. Safe because the object
-	 * isn't yet on any LRU, and therefore the shrinker can't deadlock on
-	 * it. As soon as the object has pages, obj->mm.lock nests within
-	 * fs_reclaim.
-	 */
-	I915_MM_GET_PAGES = 1,
-};
-
-static inline int __must_check
-i915_gem_object_pin_pages(struct drm_i915_gem_object *obj)
-{
-	might_lock_nested(&obj->mm.lock, I915_MM_GET_PAGES);
+int __i915_gem_object_get_pages_locked(struct drm_i915_gem_object *obj);
 
-	if (atomic_inc_not_zero(&obj->mm.pages_pin_count))
-		return 0;
-
-	return __i915_gem_object_get_pages(obj);
-}
+int i915_gem_object_pin_pages(struct drm_i915_gem_object *obj);
 
 static inline bool
 i915_gem_object_has_pages(struct drm_i915_gem_object *obj)
@@ -368,6 +341,9 @@ enum i915_map_type {
 void *__must_check i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
 					   enum i915_map_type type);
 
+void *__i915_gem_object_pin_map_locked(struct drm_i915_gem_object *obj,
+				       enum i915_map_type type);
+
 static inline void *__i915_gem_object_mapping(struct drm_i915_gem_object *obj)
 {
 	return page_mask_bits(obj->mm.mapping);
@@ -417,8 +393,7 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
 static inline void
 i915_gem_object_finish_access(struct drm_i915_gem_object *obj)
 {
-	i915_gem_object_unpin_pages(obj);
-	i915_gem_object_unlock(obj);
+	assert_object_held(obj);
 }
 
 static inline struct intel_engine_cs *
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
index d0847d7896f9..ae3303ba272c 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
@@ -187,7 +187,6 @@ struct drm_i915_gem_object {
 		 * Protects the pages and their use. Do not use directly, but
 		 * instead go through the pin/unpin interfaces.
 		 */
-		struct mutex lock;
 		atomic_t pages_pin_count;
 		atomic_t shrink_pin;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
index 7050519c87a4..76d53e535f42 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -18,7 +18,7 @@ void __i915_gem_object_set_pages(struct drm_i915_gem_object *obj,
 	unsigned long supported = INTEL_INFO(i915)->page_sizes;
 	int i;
 
-	lockdep_assert_held(&obj->mm.lock);
+	assert_object_held(obj);
 
 	if (i915_gem_object_is_volatile(obj))
 		obj->mm.madv = I915_MADV_DONTNEED;
@@ -81,13 +81,17 @@ void __i915_gem_object_set_pages(struct drm_i915_gem_object *obj,
 	}
 }
 
-int ____i915_gem_object_get_pages(struct drm_i915_gem_object *obj)
+int __i915_gem_object_get_pages_locked(struct drm_i915_gem_object *obj)
 {
-	struct drm_i915_private *i915 = to_i915(obj->base.dev);
 	int err;
 
+	assert_object_held(obj);
+
+	if (i915_gem_object_has_pages(obj))
+		return 0;
+
 	if (unlikely(obj->mm.madv != I915_MADV_WILLNEED)) {
-		drm_dbg(&i915->drm,
+		drm_dbg(obj->base.dev,
 			"Attempting to obtain a purgeable object\n");
 		return -EFAULT;
 	}
@@ -98,34 +102,33 @@ int ____i915_gem_object_get_pages(struct drm_i915_gem_object *obj)
 	return err;
 }
 
-/* Ensure that the associated pages are gathered from the backing storage
+/*
+ * Ensure that the associated pages are gathered from the backing storage
  * and pinned into our object. i915_gem_object_pin_pages() may be called
  * multiple times before they are released by a single call to
  * i915_gem_object_unpin_pages() - once the pages are no longer referenced
  * either as a result of memory pressure (reaping pages under the shrinker)
  * or as the object is itself released.
  */
-int __i915_gem_object_get_pages(struct drm_i915_gem_object *obj)
+int i915_gem_object_pin_pages(struct drm_i915_gem_object *obj)
 {
 	int err;
 
-	err = mutex_lock_interruptible_nested(&obj->mm.lock, I915_MM_GET_PAGES);
+	might_lock(&obj->base.resv->lock.base);
+
+	if (atomic_inc_not_zero(&obj->mm.pages_pin_count))
+		return 0;
+
+	err = i915_gem_object_lock_interruptible(obj);
 	if (err)
 		return err;
 
-	if (unlikely(!i915_gem_object_has_pages(obj))) {
-		GEM_BUG_ON(i915_gem_object_has_pinned_pages(obj));
-
-		err = ____i915_gem_object_get_pages(obj);
-		if (err)
-			goto unlock;
+	err = __i915_gem_object_get_pages_locked(obj);
+	if (err == 0)
+		atomic_inc(&obj->mm.pages_pin_count);
 
-		smp_mb__before_atomic();
-	}
-	atomic_inc(&obj->mm.pages_pin_count);
+	i915_gem_object_unlock(obj);
 
-unlock:
-	mutex_unlock(&obj->mm.lock);
 	return err;
 }
 
@@ -140,7 +143,7 @@ void i915_gem_object_truncate(struct drm_i915_gem_object *obj)
 /* Try to discard unwanted pages */
 void i915_gem_object_writeback(struct drm_i915_gem_object *obj)
 {
-	lockdep_assert_held(&obj->mm.lock);
+	assert_object_held(obj);
 	GEM_BUG_ON(i915_gem_object_has_pages(obj));
 
 	if (obj->ops->writeback)
@@ -194,17 +197,15 @@ __i915_gem_object_unset_pages(struct drm_i915_gem_object *obj)
 int __i915_gem_object_put_pages(struct drm_i915_gem_object *obj)
 {
 	struct sg_table *pages;
-	int err;
+
+	/* May be called by shrinker from within get_pages() (on another bo) */
+	assert_object_held(obj);
 
 	if (i915_gem_object_has_pinned_pages(obj))
 		return -EBUSY;
 
-	/* May be called by shrinker from within get_pages() (on another bo) */
-	mutex_lock(&obj->mm.lock);
-	if (unlikely(atomic_read(&obj->mm.pages_pin_count))) {
-		err = -EBUSY;
-		goto unlock;
-	}
+	if (unlikely(atomic_read(&obj->mm.pages_pin_count)))
+		return -EBUSY;
 
 	i915_gem_object_release_mmap_offset(obj);
 
@@ -227,11 +228,7 @@ int __i915_gem_object_put_pages(struct drm_i915_gem_object *obj)
 	if (!IS_ERR(pages))
 		obj->ops->put_pages(obj, pages);
 
-	err = 0;
-unlock:
-	mutex_unlock(&obj->mm.lock);
-
-	return err;
+	return 0;
 }
 
 static inline pte_t iomap_pte(resource_size_t base,
@@ -311,48 +308,28 @@ static void *i915_gem_object_map(struct drm_i915_gem_object *obj,
 	return area->addr;
 }
 
-/* get, pin, and map the pages of the object into kernel space */
-void *i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
-			      enum i915_map_type type)
+void *__i915_gem_object_pin_map_locked(struct drm_i915_gem_object *obj,
+				       enum i915_map_type type)
 {
 	enum i915_map_type has_type;
 	unsigned int flags;
 	bool pinned;
 	void *ptr;
-	int err;
+
+	assert_object_held(obj);
+	GEM_BUG_ON(!i915_gem_object_has_pages(obj));
 
 	flags = I915_GEM_OBJECT_HAS_STRUCT_PAGE | I915_GEM_OBJECT_HAS_IOMEM;
 	if (!i915_gem_object_type_has(obj, flags))
 		return ERR_PTR(-ENXIO);
 
-	err = mutex_lock_interruptible_nested(&obj->mm.lock, I915_MM_GET_PAGES);
-	if (err)
-		return ERR_PTR(err);
-
 	pinned = !(type & I915_MAP_OVERRIDE);
 	type &= ~I915_MAP_OVERRIDE;
 
-	if (!atomic_inc_not_zero(&obj->mm.pages_pin_count)) {
-		if (unlikely(!i915_gem_object_has_pages(obj))) {
-			GEM_BUG_ON(i915_gem_object_has_pinned_pages(obj));
-
-			err = ____i915_gem_object_get_pages(obj);
-			if (err)
-				goto err_unlock;
-
-			smp_mb__before_atomic();
-		}
-		atomic_inc(&obj->mm.pages_pin_count);
-		pinned = false;
-	}
-	GEM_BUG_ON(!i915_gem_object_has_pages(obj));
-
 	ptr = page_unpack_bits(obj->mm.mapping, &has_type);
 	if (ptr && has_type != type) {
-		if (pinned) {
-			err = -EBUSY;
-			goto err_unpin;
-		}
+		if (pinned)
+			return ERR_PTR(-EBUSY);
 
 		unmap_object(obj, ptr);
 
@@ -361,23 +338,38 @@ void *i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
 
 	if (!ptr) {
 		ptr = i915_gem_object_map(obj, type);
-		if (!ptr) {
-			err = -ENOMEM;
-			goto err_unpin;
-		}
+		if (!ptr)
+			return ERR_PTR(-ENOMEM);
 
 		obj->mm.mapping = page_pack_bits(ptr, type);
 	}
 
-out_unlock:
-	mutex_unlock(&obj->mm.lock);
+	__i915_gem_object_pin_pages(obj);
 	return ptr;
+}
+
+/* get, pin, and map the pages of the object into kernel space */
+void *i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
+			      enum i915_map_type type)
+{
+	void *ptr;
+	int err;
+
+	err = i915_gem_object_lock_interruptible(obj);
+	if (err)
+		return ERR_PTR(err);
 
-err_unpin:
-	atomic_dec(&obj->mm.pages_pin_count);
-err_unlock:
-	ptr = ERR_PTR(err);
-	goto out_unlock;
+	err = __i915_gem_object_get_pages_locked(obj);
+	if (err) {
+		ptr = ERR_PTR(err);
+		goto out;
+	}
+
+	ptr = __i915_gem_object_pin_map_locked(obj, type);
+
+out:
+	i915_gem_object_unlock(obj);
+	return ptr;
 }
 
 void __i915_gem_object_flush_map(struct drm_i915_gem_object *obj,
@@ -434,7 +426,9 @@ i915_gem_object_get_sg(struct drm_i915_gem_object *obj,
 
 	might_sleep();
 	GEM_BUG_ON(n >= obj->base.size >> PAGE_SHIFT);
-	GEM_BUG_ON(!i915_gem_object_has_pinned_pages(obj));
+	GEM_BUG_ON(!i915_gem_object_has_pages(obj));
+	GEM_BUG_ON(!mutex_is_locked(&obj->base.resv->lock.base) &&
+		   !i915_gem_object_has_pinned_pages(obj));
 
 	/* As we iterate forward through the sg, we record each entry in a
 	 * radixtree for quick repeated (backwards) lookups. If we have seen
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_phys.c b/drivers/gpu/drm/i915/gem/i915_gem_phys.c
index 28147aab47b9..f7f93b68b7c1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_phys.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_phys.c
@@ -165,7 +165,7 @@ int i915_gem_object_attach_phys(struct drm_i915_gem_object *obj, int align)
 	if (err)
 		return err;
 
-	mutex_lock_nested(&obj->mm.lock, I915_MM_GET_PAGES);
+	i915_gem_object_lock(obj);
 
 	if (obj->mm.madv != I915_MADV_WILLNEED) {
 		err = -EFAULT;
@@ -186,7 +186,7 @@ int i915_gem_object_attach_phys(struct drm_i915_gem_object *obj, int align)
 
 	obj->ops = &i915_gem_phys_ops;
 
-	err = ____i915_gem_object_get_pages(obj);
+	err = __i915_gem_object_get_pages_locked(obj);
 	if (err)
 		goto err_xfer;
 
@@ -198,7 +198,7 @@ int i915_gem_object_attach_phys(struct drm_i915_gem_object *obj, int align)
 
 	i915_gem_object_release_memory_region(obj);
 
-	mutex_unlock(&obj->mm.lock);
+	i915_gem_object_unlock(obj);
 	return 0;
 
 err_xfer:
@@ -209,7 +209,7 @@ int i915_gem_object_attach_phys(struct drm_i915_gem_object *obj, int align)
 		__i915_gem_object_set_pages(obj, pages, sg_page_sizes);
 	}
 err_unlock:
-	mutex_unlock(&obj->mm.lock);
+	i915_gem_object_unlock(obj);
 	return err;
 }
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
index dc8f052a0ffe..4e928103a38f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
@@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct drm_i915_gem_object *obj,
 	if (!(shrink & I915_SHRINK_BOUND))
 		flags = I915_GEM_OBJECT_UNBIND_TEST;
 
-	if (i915_gem_object_unbind(obj, flags) == 0)
-		__i915_gem_object_put_pages(obj);
-
-	return !i915_gem_object_has_pages(obj);
+	return i915_gem_object_unbind(obj, flags) == 0;
 }
 
 static void try_to_writeback(struct drm_i915_gem_object *obj,
@@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
 
 			spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
 
-			if (unsafe_drop_pages(obj, shrink)) {
-				/* May arrive from get_pages on another bo */
-				mutex_lock(&obj->mm.lock);
+			if (unsafe_drop_pages(obj, shrink) &&
+			    i915_gem_object_trylock(obj)) {
+				__i915_gem_object_put_pages(obj);
 				if (!i915_gem_object_has_pages(obj)) {
 					try_to_writeback(obj, shrink);
 					count += obj->base.size >> PAGE_SHIFT;
 				}
-				mutex_unlock(&obj->mm.lock);
+				i915_gem_object_unlock(obj);
 			}
 
 			scanned += obj->base.size >> PAGE_SHIFT;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
index ff72ee2fd9cd..ac12e1c20e66 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
@@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct drm_i915_gem_object *obj,
 	 * pages to prevent them being swapped out and causing corruption
 	 * due to the change in swizzling.
 	 */
-	mutex_lock(&obj->mm.lock);
 	if (i915_gem_object_has_pages(obj) &&
 	    obj->mm.madv == I915_MADV_WILLNEED &&
 	    i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
@@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct drm_i915_gem_object *obj,
 			obj->mm.quirked = true;
 		}
 	}
-	mutex_unlock(&obj->mm.lock);
 
 	spin_lock(&obj->vma.lock);
 	for_each_ggtt_vma(vma, obj) {
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index e946032b13e4..80907c00c6fd 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 		ret = i915_gem_object_unbind(obj,
 					     I915_GEM_OBJECT_UNBIND_ACTIVE |
 					     I915_GEM_OBJECT_UNBIND_BARRIER);
-		if (ret == 0)
-			ret = __i915_gem_object_put_pages(obj);
+		if (ret == 0) {
+			/* ww_mutex and mmu_notifier is fs_reclaim tainted */
+			if (i915_gem_object_trylock(obj)) {
+				ret = __i915_gem_object_put_pages(obj);
+				i915_gem_object_unlock(obj);
+			} else {
+				ret = -EAGAIN;
+			}
+		}
 		i915_gem_object_put(obj);
 		if (ret)
 			return ret;
@@ -485,7 +492,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
 		}
 	}
 
-	mutex_lock_nested(&obj->mm.lock, I915_MM_GET_PAGES);
+	i915_gem_object_lock(obj);
 	if (obj->userptr.work == &work->work) {
 		struct sg_table *pages = ERR_PTR(ret);
 
@@ -502,7 +509,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
 		if (IS_ERR(pages))
 			__i915_gem_userptr_set_active(obj, false);
 	}
-	mutex_unlock(&obj->mm.lock);
+	i915_gem_object_unlock(obj);
 
 	unpin_user_pages(pvec, pinned);
 	kvfree(pvec);
diff --git a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
index e2f3d014acb2..eb12d444d2cc 100644
--- a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
+++ b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
@@ -452,6 +452,15 @@ static int igt_mock_exhaust_device_supported_pages(void *arg)
 	return err;
 }
 
+static void close_object(struct drm_i915_gem_object *obj)
+{
+	i915_gem_object_lock(obj);
+	__i915_gem_object_put_pages(obj);
+	i915_gem_object_unlock(obj);
+
+	i915_gem_object_put(obj);
+}
+
 static int igt_mock_memory_region_huge_pages(void *arg)
 {
 	const unsigned int flags[] = { 0, I915_BO_ALLOC_CONTIGUOUS };
@@ -514,8 +523,7 @@ static int igt_mock_memory_region_huge_pages(void *arg)
 			}
 
 			i915_vma_unpin(vma);
-			__i915_gem_object_put_pages(obj);
-			i915_gem_object_put(obj);
+			close_object(obj);
 		}
 	}
 
@@ -633,8 +641,7 @@ static int igt_mock_ppgtt_misaligned_dma(void *arg)
 		}
 
 		i915_gem_object_unpin_pages(obj);
-		__i915_gem_object_put_pages(obj);
-		i915_gem_object_put(obj);
+		close_object(obj);
 	}
 
 	return 0;
@@ -655,8 +662,7 @@ static void close_object_list(struct list_head *objects,
 	list_for_each_entry_safe(obj, on, objects, st_link) {
 		list_del(&obj->st_link);
 		i915_gem_object_unpin_pages(obj);
-		__i915_gem_object_put_pages(obj);
-		i915_gem_object_put(obj);
+		close_object(obj);
 	}
 }
 
@@ -923,8 +929,7 @@ static int igt_mock_ppgtt_64K(void *arg)
 
 			i915_vma_unpin(vma);
 			i915_gem_object_unpin_pages(obj);
-			__i915_gem_object_put_pages(obj);
-			i915_gem_object_put(obj);
+			close_object(obj);
 		}
 	}
 
@@ -964,9 +969,10 @@ __cpu_check_shmem(struct drm_i915_gem_object *obj, u32 dword, u32 val)
 	unsigned long n;
 	int err;
 
+	i915_gem_object_lock(obj);
 	err = i915_gem_object_prepare_read(obj, &needs_flush);
 	if (err)
-		return err;
+		goto unlock;
 
 	for (n = 0; n < obj->base.size >> PAGE_SHIFT; ++n) {
 		u32 *ptr = kmap_atomic(i915_gem_object_get_page(obj, n));
@@ -986,7 +992,8 @@ __cpu_check_shmem(struct drm_i915_gem_object *obj, u32 dword, u32 val)
 	}
 
 	i915_gem_object_finish_access(obj);
-
+unlock:
+	i915_gem_object_unlock(obj);
 	return err;
 }
 
@@ -1304,7 +1311,9 @@ static int igt_ppgtt_smoke_huge(void *arg)
 		}
 out_unpin:
 		i915_gem_object_unpin_pages(obj);
+		i915_gem_object_lock(obj);
 		__i915_gem_object_put_pages(obj);
+		i915_gem_object_unlock(obj);
 out_put:
 		i915_gem_object_put(obj);
 
@@ -1392,8 +1401,7 @@ static int igt_ppgtt_sanity_check(void *arg)
 			err = igt_write_huge(ctx, obj);
 
 			i915_gem_object_unpin_pages(obj);
-			__i915_gem_object_put_pages(obj);
-			i915_gem_object_put(obj);
+			close_object(obj);
 
 			if (err) {
 				pr_err("%s write-huge failed with size=%u pages=%u i=%d, j=%d\n",
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
index 87d7d8aa080f..b8dd6fabe70a 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
@@ -27,9 +27,10 @@ static int cpu_set(struct context *ctx, unsigned long offset, u32 v)
 	u32 *cpu;
 	int err;
 
+	i915_gem_object_lock(ctx->obj);
 	err = i915_gem_object_prepare_write(ctx->obj, &needs_clflush);
 	if (err)
-		return err;
+		goto unlock;
 
 	page = i915_gem_object_get_page(ctx->obj, offset >> PAGE_SHIFT);
 	map = kmap_atomic(page);
@@ -46,7 +47,9 @@ static int cpu_set(struct context *ctx, unsigned long offset, u32 v)
 	kunmap_atomic(map);
 	i915_gem_object_finish_access(ctx->obj);
 
-	return 0;
+unlock:
+	i915_gem_object_unlock(ctx->obj);
+	return err;
 }
 
 static int cpu_get(struct context *ctx, unsigned long offset, u32 *v)
@@ -57,9 +60,10 @@ static int cpu_get(struct context *ctx, unsigned long offset, u32 *v)
 	u32 *cpu;
 	int err;
 
+	i915_gem_object_lock(ctx->obj);
 	err = i915_gem_object_prepare_read(ctx->obj, &needs_clflush);
 	if (err)
-		return err;
+		goto unlock;
 
 	page = i915_gem_object_get_page(ctx->obj, offset >> PAGE_SHIFT);
 	map = kmap_atomic(page);
@@ -73,7 +77,9 @@ static int cpu_get(struct context *ctx, unsigned long offset, u32 *v)
 	kunmap_atomic(map);
 	i915_gem_object_finish_access(ctx->obj);
 
-	return 0;
+unlock:
+	i915_gem_object_unlock(ctx->obj);
+	return err;
 }
 
 static int gtt_set(struct context *ctx, unsigned long offset, u32 v)
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
index d176b015353f..f2a307b4146e 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
@@ -461,9 +461,10 @@ static int cpu_fill(struct drm_i915_gem_object *obj, u32 value)
 	unsigned int n, m, need_flush;
 	int err;
 
+	i915_gem_object_lock(obj);
 	err = i915_gem_object_prepare_write(obj, &need_flush);
 	if (err)
-		return err;
+		goto unlock;
 
 	for (n = 0; n < real_page_count(obj); n++) {
 		u32 *map;
@@ -479,6 +480,8 @@ static int cpu_fill(struct drm_i915_gem_object *obj, u32 value)
 	i915_gem_object_finish_access(obj);
 	obj->read_domains = I915_GEM_DOMAIN_GTT | I915_GEM_DOMAIN_CPU;
 	obj->write_domain = 0;
+unlock:
+	i915_gem_object_unlock(obj);
 	return 0;
 }
 
@@ -488,9 +491,10 @@ static noinline int cpu_check(struct drm_i915_gem_object *obj,
 	unsigned int n, m, needs_flush;
 	int err;
 
+	i915_gem_object_lock(obj);
 	err = i915_gem_object_prepare_read(obj, &needs_flush);
 	if (err)
-		return err;
+		goto unlock;
 
 	for (n = 0; n < real_page_count(obj); n++) {
 		u32 *map;
@@ -527,6 +531,8 @@ static noinline int cpu_check(struct drm_i915_gem_object *obj,
 	}
 
 	i915_gem_object_finish_access(obj);
+unlock:
+	i915_gem_object_unlock(obj);
 	return err;
 }
 
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
index 9c7402ce5bf9..11f734fea3ab 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
@@ -1297,7 +1297,9 @@ static int __igt_mmap_revoke(struct drm_i915_private *i915,
 	}
 
 	if (type != I915_MMAP_TYPE_GTT) {
+		i915_gem_object_lock(obj);
 		__i915_gem_object_put_pages(obj);
+		i915_gem_object_unlock(obj);
 		if (i915_gem_object_has_pages(obj)) {
 			pr_err("Failed to put-pages object!\n");
 			err = -EINVAL;
diff --git a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
index 71baf2f8bdf3..3eab2cc751bc 100644
--- a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
@@ -351,7 +351,6 @@ static struct i915_vma *pd_vma_create(struct gen6_ppgtt *ppgtt, int size)
 	i915_active_init(&vma->active, NULL, NULL);
 
 	kref_init(&vma->ref);
-	mutex_init(&vma->pages_mutex);
 	vma->vm = i915_vm_get(&ggtt->vm);
 	vma->ops = &pd_vma_ops;
 	vma->private = ppgtt;
@@ -439,7 +438,6 @@ struct i915_ppgtt *gen6_ppgtt_create(struct intel_gt *gt)
 	ppgtt->base.vm.pd_shift = 22;
 	ppgtt->base.vm.top = 1;
 
-	ppgtt->base.vm.bind_async_flags = I915_VMA_LOCAL_BIND;
 	ppgtt->base.vm.allocate_va_range = gen6_alloc_va_range;
 	ppgtt->base.vm.clear_range = gen6_ppgtt_clear_range;
 	ppgtt->base.vm.insert_entries = gen6_ppgtt_insert_entries;
diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
index e3afd250cd7f..203aa1f9aec7 100644
--- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
@@ -720,7 +720,6 @@ struct i915_ppgtt *gen8_ppgtt_create(struct intel_gt *gt)
 			goto err_free_pd;
 	}
 
-	ppgtt->vm.bind_async_flags = I915_VMA_LOCAL_BIND;
 	ppgtt->vm.insert_entries = gen8_ppgtt_insert;
 	ppgtt->vm.allocate_va_range = gen8_ppgtt_alloc;
 	ppgtt->vm.clear_range = gen8_ppgtt_clear;
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 33a3f627ddb1..59a4a3ab6bfd 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -628,7 +628,6 @@ static int init_aliasing_ppgtt(struct i915_ggtt *ggtt)
 	ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash, 0, ggtt->vm.total);
 
 	ggtt->alias = ppgtt;
-	ggtt->vm.bind_async_flags |= ppgtt->vm.bind_async_flags;
 
 	GEM_BUG_ON(ggtt->vm.vma_ops.bind_vma != ggtt_bind_vma);
 	ggtt->vm.vma_ops.bind_vma = aliasing_gtt_bind_vma;
@@ -862,8 +861,6 @@ static int gen8_gmch_probe(struct i915_ggtt *ggtt)
 	    IS_CHERRYVIEW(i915) /* fails with concurrent use/update */) {
 		ggtt->vm.insert_entries = bxt_vtd_ggtt_insert_entries__BKL;
 		ggtt->vm.insert_page    = bxt_vtd_ggtt_insert_page__BKL;
-		ggtt->vm.bind_async_flags =
-			I915_VMA_GLOBAL_BIND | I915_VMA_LOCAL_BIND;
 	}
 
 	ggtt->invalidate = gen8_ggtt_invalidate;
@@ -1429,7 +1426,7 @@ i915_get_ggtt_vma_pages(struct i915_vma *vma)
 	 * must be the vma->pages. A simple rule is that vma->pages must only
 	 * be accessed when the obj->mm.pages are pinned.
 	 */
-	GEM_BUG_ON(!i915_gem_object_has_pinned_pages(vma->obj));
+	GEM_BUG_ON(!i915_gem_object_has_pages(vma->obj));
 
 	switch (vma->ggtt_view.type) {
 	default:
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index 496f8236ca09..1bb447ef824b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -226,8 +226,6 @@ struct i915_address_space {
 	u64 total;		/* size addr space maps (ex. 2GB for ggtt) */
 	u64 reserved;		/* size addr space reserved */
 
-	unsigned int bind_async_flags;
-
 	/*
 	 * Each active user context has its own address space (in full-ppgtt).
 	 * Since the vm may be shared between multiple contexts, we count how
diff --git a/drivers/gpu/drm/i915/gt/intel_ppgtt.c b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
index 1f80d79a6588..68dd3f8b79d0 100644
--- a/drivers/gpu/drm/i915/gt/intel_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
@@ -271,6 +271,7 @@ void i915_vm_free_pt_stash(struct i915_address_space *vm,
 int ppgtt_set_pages(struct i915_vma *vma)
 {
 	GEM_BUG_ON(vma->pages);
+	GEM_BUG_ON(IS_ERR_OR_NULL(vma->obj->mm.pages));
 
 	vma->pages = vma->obj->mm.pages;
 	vma->page_sizes = vma->obj->mm.page_sizes;
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e998f25f30a3..0fbe438c4523 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -335,12 +335,16 @@ i915_gem_shmem_pread(struct drm_i915_gem_object *obj,
 	u64 remain;
 	int ret;
 
+	i915_gem_object_lock(obj);
 	ret = i915_gem_object_prepare_read(obj, &needs_clflush);
-	if (ret)
+	if (ret) {
+		i915_gem_object_unlock(obj);
 		return ret;
+	}
 
 	fence = i915_gem_object_lock_fence(obj);
 	i915_gem_object_finish_access(obj);
+	i915_gem_object_unlock(obj);
 	if (!fence)
 		return -ENOMEM;
 
@@ -734,12 +738,16 @@ i915_gem_shmem_pwrite(struct drm_i915_gem_object *obj,
 	u64 remain;
 	int ret;
 
+	i915_gem_object_lock(obj);
 	ret = i915_gem_object_prepare_write(obj, &needs_clflush);
-	if (ret)
+	if (ret) {
+		i915_gem_object_unlock(obj);
 		return ret;
+	}
 
 	fence = i915_gem_object_lock_fence(obj);
 	i915_gem_object_finish_access(obj);
+	i915_gem_object_unlock(obj);
 	if (!fence)
 		return -ENOMEM;
 
@@ -1063,7 +1071,7 @@ i915_gem_madvise_ioctl(struct drm_device *dev, void *data,
 	if (!obj)
 		return -ENOENT;
 
-	err = mutex_lock_interruptible(&obj->mm.lock);
+	err = i915_gem_object_lock_interruptible(obj);
 	if (err)
 		goto out;
 
@@ -1109,7 +1117,7 @@ i915_gem_madvise_ioctl(struct drm_device *dev, void *data,
 		i915_gem_object_truncate(obj);
 
 	args->retained = obj->mm.madv != __I915_MADV_PURGED;
-	mutex_unlock(&obj->mm.lock);
+	i915_gem_object_unlock(obj);
 
 out:
 	i915_gem_object_put(obj);
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 7278cc7c40b9..633f335ce892 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -116,7 +116,6 @@ vma_create(struct drm_i915_gem_object *obj,
 		return ERR_PTR(-ENOMEM);
 
 	kref_init(&vma->ref);
-	mutex_init(&vma->pages_mutex);
 	vma->vm = i915_vm_get(vm);
 	vma->ops = &vm->vma_ops;
 	vma->obj = obj;
@@ -295,16 +294,31 @@ struct i915_vma_work {
 	struct i915_address_space *vm;
 	struct i915_vm_pt_stash stash;
 	struct i915_vma *vma;
-	struct drm_i915_gem_object *pinned;
 	struct i915_sw_dma_fence_cb cb;
 	enum i915_cache_level cache_level;
 	unsigned int flags;
 };
 
+static void pin_pages(struct i915_vma *vma, unsigned int bind)
+{
+	bind = hweight32(bind & I915_VMA_BIND_MASK);
+	while (bind--)
+		__i915_gem_object_pin_pages(vma->obj);
+}
+
 static int __vma_bind(struct dma_fence_work *work)
 {
 	struct i915_vma_work *vw = container_of(work, typeof(*vw), base);
 	struct i915_vma *vma = vw->vma;
+	int err;
+
+	if (!vma->pages) {
+		err = vma->ops->set_pages(vma);
+		if (err) {
+			atomic_or(I915_VMA_ERROR, &vma->flags);
+			return err;
+		}
+	}
 
 	vma->ops->bind_vma(vw->vm, &vw->stash,
 			   vma, vw->cache_level, vw->flags);
@@ -315,8 +329,8 @@ static void __vma_release(struct dma_fence_work *work)
 {
 	struct i915_vma_work *vw = container_of(work, typeof(*vw), base);
 
-	if (vw->pinned)
-		__i915_gem_object_unpin_pages(vw->pinned);
+	if (work->dma.error && vw->flags)
+		atomic_or(I915_VMA_ERROR, &vw->vma->flags);
 
 	i915_vm_free_pt_stash(vw->vm, &vw->stash);
 	i915_vm_put(vw->vm);
@@ -389,6 +403,7 @@ int i915_vma_bind(struct i915_vma *vma,
 		  u32 flags,
 		  struct i915_vma_work *work)
 {
+	struct dma_fence *prev;
 	u32 bind_flags;
 	u32 vma_flags;
 
@@ -413,41 +428,39 @@ int i915_vma_bind(struct i915_vma *vma,
 	if (bind_flags == 0)
 		return 0;
 
-	GEM_BUG_ON(!vma->pages);
-
 	trace_i915_vma_bind(vma, bind_flags);
-	if (work && bind_flags & vma->vm->bind_async_flags) {
-		struct dma_fence *prev;
 
-		work->vma = vma;
-		work->cache_level = cache_level;
-		work->flags = bind_flags;
+	work->vma = vma;
+	work->cache_level = cache_level;
+	work->flags = bind_flags;
+	work->base.dma.error = 0; /* enable the queue_work() */
 
-		/*
-		 * Note we only want to chain up to the migration fence on
-		 * the pages (not the object itself). As we don't track that,
-		 * yet, we have to use the exclusive fence instead.
-		 *
-		 * Also note that we do not want to track the async vma as
-		 * part of the obj->resv->excl_fence as it only affects
-		 * execution and not content or object's backing store lifetime.
-		 */
-		prev = i915_active_set_exclusive(&vma->active, &work->base.dma);
-		if (prev) {
-			__i915_sw_fence_await_dma_fence(&work->base.chain,
-							prev,
-							&work->cb);
-			dma_fence_put(prev);
-		}
-
-		work->base.dma.error = 0; /* enable the queue_work() */
+	/*
+	 * Note we only want to chain up to the migration fence on
+	 * the pages (not the object itself). As we don't track that,
+	 * yet, we have to use the exclusive fence instead.
+	 *
+	 * Also note that we do not want to track the async vma as
+	 * part of the obj->resv->excl_fence as it only affects
+	 * execution and not content or object's backing store lifetime.
+	 */
+	prev = i915_active_set_exclusive(&vma->active, &work->base.dma);
+	if (prev) {
+		__i915_sw_fence_await_dma_fence(&work->base.chain,
+						prev,
+						&work->cb);
+		dma_fence_put(prev);
+	}
 
-		if (vma->obj) {
-			__i915_gem_object_pin_pages(vma->obj);
-			work->pinned = vma->obj;
+	if (vma->obj) {
+		if (IS_ERR(vma->obj->mm.pages)) {
+			i915_sw_fence_set_error_once(&work->base.chain,
+						     PTR_ERR(vma->obj->mm.pages));
+			atomic_or(I915_VMA_ERROR, &vma->flags);
+			bind_flags = 0;
 		}
-	} else {
-		vma->ops->bind_vma(vma->vm, NULL, vma, cache_level, bind_flags);
+
+		pin_pages(vma, bind_flags);
 	}
 
 	atomic_or(bind_flags, &vma->flags);
@@ -690,6 +703,9 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 		if (ret)
 			return ret;
 	} else {
+		const unsigned long page_sizes =
+			INTEL_INFO(vma->vm->i915)->page_sizes;
+
 		/*
 		 * We only support huge gtt pages through the 48b PPGTT,
 		 * however we also don't want to force any alignment for
@@ -699,7 +715,7 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 		 * forseeable future. See also i915_ggtt_offset().
 		 */
 		if (upper_32_bits(end - 1) &&
-		    vma->page_sizes.sg > I915_GTT_PAGE_SIZE) {
+		    page_sizes > I915_GTT_PAGE_SIZE) {
 			/*
 			 * We can't mix 64K and 4K PTEs in the same page-table
 			 * (2M block), and so to avoid the ugliness and
@@ -707,7 +723,7 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 			 * objects to 2M.
 			 */
 			u64 page_alignment =
-				rounddown_pow_of_two(vma->page_sizes.sg |
+				rounddown_pow_of_two(page_sizes |
 						     I915_GTT_PAGE_SIZE_2M);
 
 			/*
@@ -719,7 +735,7 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 
 			alignment = max(alignment, page_alignment);
 
-			if (vma->page_sizes.sg & I915_GTT_PAGE_SIZE_64K)
+			if (page_sizes & I915_GTT_PAGE_SIZE_64K)
 				size = round_up(size, I915_GTT_PAGE_SIZE_2M);
 		}
 
@@ -796,74 +812,6 @@ bool i915_vma_pin_inplace(struct i915_vma *vma, unsigned int flags)
 	return pinned;
 }
 
-int i915_vma_get_pages(struct i915_vma *vma)
-{
-	int err = 0;
-
-	if (atomic_add_unless(&vma->pages_count, 1, 0))
-		return 0;
-
-	/* Allocations ahoy! */
-	if (mutex_lock_interruptible(&vma->pages_mutex))
-		return -EINTR;
-
-	if (!atomic_read(&vma->pages_count)) {
-		if (vma->obj) {
-			err = i915_gem_object_pin_pages(vma->obj);
-			if (err)
-				goto unlock;
-		}
-
-		err = vma->ops->set_pages(vma);
-		if (err) {
-			if (vma->obj)
-				i915_gem_object_unpin_pages(vma->obj);
-			goto unlock;
-		}
-	}
-	atomic_inc(&vma->pages_count);
-
-unlock:
-	mutex_unlock(&vma->pages_mutex);
-
-	return err;
-}
-
-static void __vma_put_pages(struct i915_vma *vma, unsigned int count)
-{
-	/* We allocate under vma_get_pages, so beware the shrinker */
-	mutex_lock_nested(&vma->pages_mutex, SINGLE_DEPTH_NESTING);
-	GEM_BUG_ON(atomic_read(&vma->pages_count) < count);
-	if (atomic_sub_return(count, &vma->pages_count) == 0) {
-		vma->ops->clear_pages(vma);
-		GEM_BUG_ON(vma->pages);
-		if (vma->obj)
-			i915_gem_object_unpin_pages(vma->obj);
-	}
-	mutex_unlock(&vma->pages_mutex);
-}
-
-void i915_vma_put_pages(struct i915_vma *vma)
-{
-	if (atomic_add_unless(&vma->pages_count, -1, 1))
-		return;
-
-	__vma_put_pages(vma, 1);
-}
-
-static void vma_unbind_pages(struct i915_vma *vma)
-{
-	unsigned int count;
-
-	lockdep_assert_held(&vma->vm->mutex);
-
-	/* The upper portion of pages_count is the number of bindings */
-	count = atomic_read(&vma->pages_count);
-	count >>= I915_VMA_PAGES_BIAS;
-	if (count)
-		__vma_put_pages(vma, count | count << I915_VMA_PAGES_BIAS);
-}
-
 static int __wait_for_unbind(struct i915_vma *vma, unsigned int flags)
 {
 	return __i915_vma_wait_excl(vma, false, flags);
@@ -885,9 +833,11 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 	if (i915_vma_pin_inplace(vma, flags & I915_VMA_BIND_MASK))
 		return 0;
 
-	err = i915_vma_get_pages(vma);
-	if (err)
-		return err;
+	if (vma->obj) {
+		err = i915_gem_object_pin_pages(vma->obj);
+		if (err)
+			return err;
+	}
 
 	if (flags & PIN_GLOBAL)
 		wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm);
@@ -896,26 +846,21 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 	if (err)
 		goto err_rpm;
 
-	if (flags & vma->vm->bind_async_flags) {
-		work = i915_vma_work();
-		if (!work) {
-			err = -ENOMEM;
-			goto err_rpm;
-		}
+	work = i915_vma_work();
+	if (!work) {
+		err = -ENOMEM;
+		goto err_rpm;
+	}
 
-		work->vm = i915_vm_get(vma->vm);
+	work->vm = i915_vm_get(vma->vm);
 
-		/* Allocate enough page directories to used PTE */
-		if (vma->vm->allocate_va_range) {
-			i915_vm_alloc_pt_stash(vma->vm,
-					       &work->stash,
-					       vma->size);
+	/* Allocate enough page directories to used PTE */
+	if (vma->vm->allocate_va_range) {
+		i915_vm_alloc_pt_stash(vma->vm, &work->stash, vma->size);
 
-			err = i915_vm_pin_pt_stash(vma->vm,
-						   &work->stash);
-			if (err)
-				goto err_fence;
-		}
+		err = i915_vm_pin_pt_stash(vma->vm, &work->stash);
+		if (err)
+			goto err_fence;
 	}
 
 	/*
@@ -980,16 +925,12 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 			__i915_vma_set_map_and_fenceable(vma);
 	}
 
-	GEM_BUG_ON(!vma->pages);
 	err = i915_vma_bind(vma,
 			    vma->obj ? vma->obj->cache_level : 0,
 			    flags, work);
 	if (err)
 		goto err_remove;
 
-	/* There should only be at most 2 active bindings (user, global) */
-	GEM_BUG_ON(bound + I915_VMA_PAGES_ACTIVE < bound);
-	atomic_add(I915_VMA_PAGES_ACTIVE, &vma->pages_count);
 	list_move_tail(&vma->vm_link, &vma->vm->bound_list);
 	GEM_BUG_ON(!i915_vma_is_active(vma));
 
@@ -1008,12 +949,12 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 err_unlock:
 	mutex_unlock(&vma->vm->mutex);
 err_fence:
-	if (work)
-		dma_fence_work_commit_imm(&work->base);
+	dma_fence_work_commit_imm(&work->base);
 err_rpm:
 	if (wakeref)
 		intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref);
-	i915_vma_put_pages(vma);
+	if (vma->obj)
+		i915_gem_object_unpin_pages(vma->obj);
 	return err;
 }
 
@@ -1274,6 +1215,8 @@ int i915_vma_move_to_active(struct i915_vma *vma,
 
 void __i915_vma_evict(struct i915_vma *vma)
 {
+	int count;
+
 	GEM_BUG_ON(i915_vma_is_pinned(vma));
 
 	if (i915_vma_is_map_and_fenceable(vma)) {
@@ -1308,11 +1251,19 @@ void __i915_vma_evict(struct i915_vma *vma)
 		trace_i915_vma_unbind(vma);
 		vma->ops->unbind_vma(vma->vm, vma);
 	}
+	count = hweight32(atomic_read(&vma->flags) & I915_VMA_BIND_MASK);
 	atomic_and(~(I915_VMA_BIND_MASK | I915_VMA_ERROR | I915_VMA_GGTT_WRITE),
 		   &vma->flags);
 
 	i915_vma_detach(vma);
-	vma_unbind_pages(vma);
+
+	if (vma->pages)
+		vma->ops->clear_pages(vma);
+
+	if (vma->obj) {
+		while (count--)
+			__i915_gem_object_unpin_pages(vma->obj);
+	}
 }
 
 int __i915_vma_unbind(struct i915_vma *vma)
diff --git a/drivers/gpu/drm/i915/i915_vma_types.h b/drivers/gpu/drm/i915/i915_vma_types.h
index 9e9082dc8f4b..02c1640bb034 100644
--- a/drivers/gpu/drm/i915/i915_vma_types.h
+++ b/drivers/gpu/drm/i915/i915_vma_types.h
@@ -251,11 +251,6 @@ struct i915_vma {
 
 	struct i915_active active;
 
-#define I915_VMA_PAGES_BIAS 24
-#define I915_VMA_PAGES_ACTIVE (BIT(24) | 1)
-	atomic_t pages_count; /* number of active binds to the pages */
-	struct mutex pages_mutex; /* protect acquire/release of backing pages */
-
 	/**
 	 * Support different GGTT views into the same object.
 	 * This means there can be multiple VMA mappings per object and per VM.
@@ -279,4 +274,3 @@ struct i915_vma {
 };
 
 #endif
-
diff --git a/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c b/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
index d1c3b958c15d..02b653328b9d 100644
--- a/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
+++ b/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
@@ -89,8 +89,18 @@ int i915_acquire_ctx_lock(struct i915_acquire_ctx *ctx,
 	return err;
 }
 
-int i915_acquire_mm(struct i915_acquire_ctx *acquire)
+int i915_acquire_mm(struct i915_acquire_ctx *ctx)
 {
+	struct i915_acquire *lnk;
+	int err;
+
+	for (lnk = ctx->locked; lnk; lnk = lnk->next) {
+		err = __i915_gem_object_get_pages_locked(lnk->obj);
+		if (err)
+			return err;
+	}
+
+	i915_acquire_ctx_done(ctx);
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
index af8205a2bd8f..e5e6973eb6ea 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
@@ -1245,9 +1245,9 @@ static void track_vma_bind(struct i915_vma *vma)
 	__i915_gem_object_pin_pages(obj);
 
 	GEM_BUG_ON(vma->pages);
-	atomic_set(&vma->pages_count, I915_VMA_PAGES_ACTIVE);
-	__i915_gem_object_pin_pages(obj);
 	vma->pages = obj->mm.pages;
+	__i915_gem_object_pin_pages(obj);
+	atomic_or(I915_VMA_GLOBAL_BIND, &vma->flags);
 
 	mutex_lock(&vma->vm->mutex);
 	list_add_tail(&vma->vm_link, &vma->vm->bound_list);
diff --git a/drivers/gpu/drm/i915/selftests/intel_memory_region.c b/drivers/gpu/drm/i915/selftests/intel_memory_region.c
index 6e80d99048e4..8d9fdf591514 100644
--- a/drivers/gpu/drm/i915/selftests/intel_memory_region.c
+++ b/drivers/gpu/drm/i915/selftests/intel_memory_region.c
@@ -24,6 +24,15 @@
 #include "selftests/igt_flush_test.h"
 #include "selftests/i915_random.h"
 
+static void close_object(struct drm_i915_gem_object *obj)
+{
+	i915_gem_object_lock(obj);
+	__i915_gem_object_put_pages(obj);
+	i915_gem_object_unlock(obj);
+
+	i915_gem_object_put(obj);
+}
+
 static void close_objects(struct intel_memory_region *mem,
 			  struct list_head *objects)
 {
@@ -33,10 +42,9 @@ static void close_objects(struct intel_memory_region *mem,
 	list_for_each_entry_safe(obj, on, objects, st_link) {
 		if (i915_gem_object_has_pinned_pages(obj))
 			i915_gem_object_unpin_pages(obj);
-		/* No polluting the memory region between tests */
-		__i915_gem_object_put_pages(obj);
 		list_del(&obj->st_link);
-		i915_gem_object_put(obj);
+		/* No polluting the memory region between tests */
+		close_object(obj);
 	}
 
 	cond_resched();
@@ -124,9 +132,8 @@ igt_object_create(struct intel_memory_region *mem,
 static void igt_object_release(struct drm_i915_gem_object *obj)
 {
 	i915_gem_object_unpin_pages(obj);
-	__i915_gem_object_put_pages(obj);
 	list_del(&obj->st_link);
-	i915_gem_object_put(obj);
+	close_object(obj);
 }
 
 static int igt_mock_contiguous(void *arg)
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 29/66] drm/i915: Hold wakeref for the duration of the vma GGTT binding
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (26 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-31 10:09   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 30/66] drm/i915: Specialise " Chris Wilson
                   ` (44 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Now that we have pushed the binding itself outside of the vm->mutex, we
are clear of the potential wakeref inversions and can take the wakeref
around the actual duration of the HW interaction.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_ggtt.c | 39 ++++++++++++++++------------
 drivers/gpu/drm/i915/i915_vma.c      |  6 -----
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 59a4a3ab6bfd..a78ae2733fd6 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -434,27 +434,39 @@ static void i915_ggtt_clear_range(struct i915_address_space *vm,
 	intel_gtt_clear_range(start >> PAGE_SHIFT, length >> PAGE_SHIFT);
 }
 
-static void ggtt_bind_vma(struct i915_address_space *vm,
-			  struct i915_vm_pt_stash *stash,
-			  struct i915_vma *vma,
-			  enum i915_cache_level cache_level,
-			  u32 flags)
+static void __ggtt_bind_vma(struct i915_address_space *vm,
+			    struct i915_vm_pt_stash *stash,
+			    struct i915_vma *vma,
+			    enum i915_cache_level cache_level,
+			    u32 flags)
 {
 	struct drm_i915_gem_object *obj = vma->obj;
+	intel_wakeref_t wakeref;
 	u32 pte_flags;
 
-	if (i915_vma_is_bound(vma, ~flags & I915_VMA_BIND_MASK))
-		return;
-
 	/* Applicable to VLV (gen8+ do not support RO in the GGTT) */
 	pte_flags = 0;
 	if (i915_gem_object_is_readonly(obj))
 		pte_flags |= PTE_READ_ONLY;
 
-	vm->insert_entries(vm, vma, cache_level, pte_flags);
+	with_intel_runtime_pm(vm->gt->uncore->rpm, wakeref)
+		vm->insert_entries(vm, vma, cache_level, pte_flags);
+
 	vma->page_sizes.gtt = I915_GTT_PAGE_SIZE;
 }
 
+static void ggtt_bind_vma(struct i915_address_space *vm,
+			  struct i915_vm_pt_stash *stash,
+			  struct i915_vma *vma,
+			  enum i915_cache_level cache_level,
+			  u32 flags)
+{
+	if (i915_vma_is_bound(vma, ~flags & I915_VMA_BIND_MASK))
+		return;
+
+	__ggtt_bind_vma(vm, stash, vma, cache_level, flags);
+}
+
 static void ggtt_unbind_vma(struct i915_address_space *vm, struct i915_vma *vma)
 {
 	vm->clear_range(vm, vma->node.start, vma->size);
@@ -571,19 +583,12 @@ static void aliasing_gtt_bind_vma(struct i915_address_space *vm,
 				  enum i915_cache_level cache_level,
 				  u32 flags)
 {
-	u32 pte_flags;
-
-	/* Currently applicable only to VLV */
-	pte_flags = 0;
-	if (i915_gem_object_is_readonly(vma->obj))
-		pte_flags |= PTE_READ_ONLY;
-
 	if (flags & I915_VMA_LOCAL_BIND)
 		ppgtt_bind_vma(&i915_vm_to_ggtt(vm)->alias->vm,
 			       stash, vma, cache_level, flags);
 
 	if (flags & I915_VMA_GLOBAL_BIND)
-		vm->insert_entries(vm, vma, cache_level, pte_flags);
+		__ggtt_bind_vma(vm, stash, vma, cache_level, flags);
 }
 
 static void aliasing_gtt_unbind_vma(struct i915_address_space *vm,
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 633f335ce892..e584a3355911 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -820,7 +820,6 @@ static int __wait_for_unbind(struct i915_vma *vma, unsigned int flags)
 int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 {
 	struct i915_vma_work *work = NULL;
-	intel_wakeref_t wakeref = 0;
 	unsigned int bound;
 	int err;
 
@@ -839,9 +838,6 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 			return err;
 	}
 
-	if (flags & PIN_GLOBAL)
-		wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm);
-
 	err = __wait_for_unbind(vma, flags);
 	if (err)
 		goto err_rpm;
@@ -951,8 +947,6 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 err_fence:
 	dma_fence_work_commit_imm(&work->base);
 err_rpm:
-	if (wakeref)
-		intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref);
 	if (vma->obj)
 		i915_gem_object_unpin_pages(vma->obj);
 	return err;
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 30/66] drm/i915: Specialise GGTT binding
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (27 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 29/66] drm/i915: Hold wakeref for the duration of the vma GGTT binding Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 31/66] drm/i915/gt: Acquire backing storage for the context Chris Wilson
                   ` (43 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

The Global GTT mmapings do not require any backing storage for the page
directories and so do not need extensive support for preallocations, or
for handling multiple bindings en masse. The Global GTT bindings also
need to take into account an eviction strategy for pinned vma, that we
want to explicitly avoid for user bindings. It is easier to specialise
the i915_ggtt_pin() to keep alive the pages/address as they are used by
HW in its private GTT, while we deconstruct the i915_vma_pin() and
rebuild.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c |   7 +-
 drivers/gpu/drm/i915/i915_vma.c      | 125 +++++++++++++++++++++++----
 drivers/gpu/drm/i915/i915_vma.h      |   1 +
 3 files changed, 113 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
index 3eab2cc751bc..308f7f4f7bd7 100644
--- a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
@@ -392,8 +392,11 @@ int gen6_ppgtt_pin(struct i915_ppgtt *base)
 	 * size. We allocate at the top of the GTT to avoid fragmentation.
 	 */
 	err = 0;
-	if (!atomic_read(&ppgtt->pin_count))
-		err = i915_ggtt_pin(ppgtt->vma, GEN6_PD_ALIGN, PIN_HIGH);
+	if (!atomic_read(&ppgtt->pin_count)) {
+		err = i915_ggtt_pin_locked(ppgtt->vma, GEN6_PD_ALIGN, PIN_HIGH);
+		if (err == 0)
+			err = i915_vma_wait_for_bind(ppgtt->vma);
+	}
 	if (!err)
 		atomic_inc(&ppgtt->pin_count);
 	mutex_unlock(&ppgtt->pin_mutex);
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index e584a3355911..4993fa99cb71 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -952,7 +952,7 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 	return err;
 }
 
-static void flush_idle_contexts(struct intel_gt *gt)
+static void unpin_idle_contexts(struct intel_gt *gt)
 {
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
@@ -963,31 +963,120 @@ static void flush_idle_contexts(struct intel_gt *gt)
 	intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
 }
 
+int i915_ggtt_pin_locked(struct i915_vma *vma, u32 align, unsigned int flags)
+{
+	struct i915_vma_work *work = NULL;
+	unsigned int bound;
+	int err;
+
+	GEM_BUG_ON(vma->vm->allocate_va_range);
+	GEM_BUG_ON(i915_vma_is_closed(vma));
+
+	/* First try and grab the pin without rebinding the vma */
+	if (i915_vma_pin_inplace(vma, I915_VMA_GLOBAL_BIND))
+		return 0;
+
+	work = i915_vma_work();
+	if (!work)
+		return -ENOMEM;
+	work->vm = i915_vm_get(vma->vm);
+
+	err = mutex_lock_interruptible(&vma->vm->mutex);
+	if (err)
+		goto err_fence;
+
+	/* No more allocations allowed now we hold vm->mutex */
+
+	bound = atomic_read(&vma->flags);
+	if (unlikely(bound & I915_VMA_ERROR)) {
+		err = -ENOMEM;
+		goto err_unlock;
+	}
+
+	if (unlikely(!((bound + 1) & I915_VMA_PIN_MASK))) {
+		err = -EAGAIN; /* pins are meant to be fairly temporary */
+		goto err_unlock;
+	}
+
+	if (unlikely(bound & I915_VMA_GLOBAL_BIND)) {
+		__i915_vma_pin(vma);
+		goto err_unlock;
+	}
+
+	err = i915_active_acquire(&vma->active);
+	if (err)
+		goto err_unlock;
+
+	if (!(bound & I915_VMA_BIND_MASK)) {
+		err = __wait_for_unbind(vma, flags);
+		if (err)
+			goto err_active;
+
+		err = i915_vma_insert(vma, 0, align,
+				      flags | I915_VMA_GLOBAL_BIND);
+		if (err == -ENOSPC) {
+			unpin_idle_contexts(vma->vm->gt);
+			err = i915_vma_insert(vma, 0, align,
+					      flags | I915_VMA_GLOBAL_BIND);
+		}
+		if (err)
+			goto err_active;
+
+		__i915_vma_set_map_and_fenceable(vma);
+	}
+
+	err = i915_vma_bind(vma,
+			    vma->obj ? vma->obj->cache_level : 0,
+			    I915_VMA_GLOBAL_BIND,
+			    work);
+	if (err)
+		goto err_remove;
+	GEM_BUG_ON(!i915_vma_is_bound(vma, I915_VMA_GLOBAL_BIND));
+
+	list_move_tail(&vma->vm_link, &vma->vm->bound_list);
+	GEM_BUG_ON(!i915_vma_is_active(vma));
+
+	__i915_vma_pin(vma);
+
+err_remove:
+	if (!i915_vma_is_bound(vma, I915_VMA_BIND_MASK)) {
+		i915_vma_detach(vma);
+		drm_mm_remove_node(&vma->node);
+	}
+err_active:
+	i915_active_release(&vma->active);
+err_unlock:
+	mutex_unlock(&vma->vm->mutex);
+err_fence:
+	dma_fence_work_commit_imm(&work->base);
+	return err;
+}
+
 int i915_ggtt_pin(struct i915_vma *vma, u32 align, unsigned int flags)
 {
-	struct i915_address_space *vm = vma->vm;
 	int err;
 
 	GEM_BUG_ON(!i915_vma_is_ggtt(vma));
 
-	do {
-		err = i915_vma_pin(vma, 0, align, flags | PIN_GLOBAL);
-		if (err != -ENOSPC) {
-			if (!err) {
-				err = i915_vma_wait_for_bind(vma);
-				if (err)
-					i915_vma_unpin(vma);
-			}
+	if (!i915_vma_pin_inplace(vma, I915_VMA_GLOBAL_BIND)) {
+		err = i915_gem_object_lock_interruptible(vma->obj);
+		if (err)
 			return err;
-		}
 
-		/* Unlike i915_vma_pin, we don't take no for an answer! */
-		flush_idle_contexts(vm->gt);
-		if (mutex_lock_interruptible(&vm->mutex) == 0) {
-			i915_gem_evict_vm(vm);
-			mutex_unlock(&vm->mutex);
-		}
-	} while (1);
+		err = __i915_gem_object_get_pages_locked(vma->obj);
+		if (err == 0)
+			err = i915_ggtt_pin_locked(vma, align, flags);
+
+		i915_gem_object_unlock(vma->obj);
+		if (err)
+			return err;
+	}
+
+	err = i915_vma_wait_for_bind(vma);
+	if (err)
+		i915_vma_unpin(vma);
+
+	return err;
 }
 
 static void __vma_close(struct i915_vma *vma, struct intel_gt *gt)
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index 9a26e6cbe8cd..1049d80dc47f 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -244,6 +244,7 @@ bool i915_vma_pin_inplace(struct i915_vma *vma, unsigned int flags);
 int __must_check
 i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags);
 int i915_ggtt_pin(struct i915_vma *vma, u32 align, unsigned int flags);
+int i915_ggtt_pin_locked(struct i915_vma *vma, u32 align, unsigned int flags);
 
 static inline int i915_vma_pin_count(const struct i915_vma *vma)
 {
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 31/66] drm/i915/gt: Acquire backing storage for the context
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (28 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 30/66] drm/i915: Specialise " Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-31 10:27   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 32/66] drm/i915/gt: Push the wait for the context to bound to the request Chris Wilson
                   ` (42 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Pull the individual acquisition of the context objects (state, ring,
timeline) under a common i915_acquire_ctx in preparation to allow the
context to evict memory (or rather the i915_acquire_ctx on its behalf).

The context objects maintain their semi-permanent status; that is they
are assumed to be accessible by the HW at all times until we receive a
signal from the HW that they are no longer in use. Currently, we
generate such a signal ourselves from the context switch following the
final use of the objects. This means that they will remain on the HW for
an indefinite amount of time, and we retain the use of pinning to keep
them in the same place. As they are pinned, they can be processed
outside of the working set for the requests within the context. This is
useful, as the context share some global state causing it to incur a
global lock via its objects. By only requiring that lock as the context
is activated, it is both reduced in frequency and reduced in duration
(as compared to execbuf).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 108 ++++++++++++++--
 drivers/gpu/drm/i915/gt/intel_ring.c          |  17 ++-
 drivers/gpu/drm/i915/gt/intel_ring.h          |   5 +-
 .../gpu/drm/i915/gt/intel_ring_submission.c   | 117 ++++++++++++------
 drivers/gpu/drm/i915/gt/intel_timeline.c      |  14 ++-
 drivers/gpu/drm/i915/gt/intel_timeline.h      |  10 +-
 drivers/gpu/drm/i915/gt/mock_engine.c         |   2 +
 drivers/gpu/drm/i915/gt/selftest_timeline.c   |  30 ++++-
 8 files changed, 237 insertions(+), 66 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 52db2bde44a3..2f1606365f63 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -6,6 +6,7 @@
 
 #include "gem/i915_gem_context.h"
 #include "gem/i915_gem_pm.h"
+#include "mm/i915_acquire_ctx.h"
 
 #include "i915_drv.h"
 #include "i915_globals.h"
@@ -93,6 +94,27 @@ static void intel_context_active_release(struct intel_context *ce)
 	i915_active_release(&ce->active);
 }
 
+static int __intel_context_sync(struct intel_context *ce)
+{
+	int err;
+
+	err = i915_vma_wait_for_bind(ce->ring->vma);
+	if (err)
+		return err;
+
+	err = i915_vma_wait_for_bind(ce->timeline->hwsp_ggtt);
+	if (err)
+		return err;
+
+	if (ce->state) {
+		err = i915_vma_wait_for_bind(ce->state);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
 int __intel_context_do_pin(struct intel_context *ce)
 {
 	int err;
@@ -118,6 +140,10 @@ int __intel_context_do_pin(struct intel_context *ce)
 	}
 
 	if (likely(!atomic_add_unless(&ce->pin_count, 1, 0))) {
+		err = __intel_context_sync(ce);
+		if (unlikely(err))
+			goto out_unlock;
+
 		err = intel_context_active_acquire(ce);
 		if (unlikely(err))
 			goto out_unlock;
@@ -166,12 +192,12 @@ void intel_context_unpin(struct intel_context *ce)
 	intel_context_put(ce);
 }
 
-static int __context_pin_state(struct i915_vma *vma)
+static int __context_active_locked(struct i915_vma *vma)
 {
 	unsigned int bias = i915_ggtt_pin_bias(vma) | PIN_OFFSET_BIAS;
 	int err;
 
-	err = i915_ggtt_pin(vma, 0, bias | PIN_HIGH);
+	err = i915_ggtt_pin_locked(vma, 0, bias | PIN_HIGH);
 	if (err)
 		return err;
 
@@ -200,11 +226,11 @@ static void __context_unpin_state(struct i915_vma *vma)
 	__i915_vma_unpin(vma);
 }
 
-static int __ring_active(struct intel_ring *ring)
+static int __ring_active_locked(struct intel_ring *ring)
 {
 	int err;
 
-	err = intel_ring_pin(ring);
+	err = intel_ring_pin_locked(ring);
 	if (err)
 		return err;
 
@@ -244,27 +270,53 @@ static void __intel_context_retire(struct i915_active *active)
 	intel_context_put(ce);
 }
 
-static int __intel_context_active(struct i915_active *active)
+static int
+__intel_context_acquire_lock(struct intel_context *ce,
+			     struct i915_acquire_ctx *ctx)
+{
+	return i915_acquire_ctx_lock(ctx, ce->state->obj);
+}
+
+static int
+intel_context_acquire_lock(struct intel_context *ce,
+			   struct i915_acquire_ctx *ctx)
 {
-	struct intel_context *ce = container_of(active, typeof(*ce), active);
 	int err;
 
-	CE_TRACE(ce, "active\n");
+	err = intel_ring_acquire_lock(ce->ring, ctx);
+	if (err)
+		return err;
 
-	intel_context_get(ce);
+	if (ce->state) {
+		err = __intel_context_acquire_lock(ce, ctx);
+		if (err)
+			return err;
+	}
 
-	err = __ring_active(ce->ring);
+	/* Note that the timeline will migrate as the seqno wrap around */
+	err = intel_timeline_acquire_lock(ce->timeline, ctx);
 	if (err)
-		goto err_put;
+		return err;
+
+	return 0;
+}
 
-	err = intel_timeline_pin(ce->timeline);
+static int intel_context_active_locked(struct intel_context *ce)
+{
+	int err;
+
+	err = __ring_active_locked(ce->ring);
+	if (err)
+		return err;
+
+	err = intel_timeline_pin_locked(ce->timeline);
 	if (err)
 		goto err_ring;
 
 	if (!ce->state)
 		return 0;
 
-	err = __context_pin_state(ce->state);
+	err = __context_active_locked(ce->state);
 	if (err)
 		goto err_timeline;
 
@@ -274,7 +326,37 @@ static int __intel_context_active(struct i915_active *active)
 	intel_timeline_unpin(ce->timeline);
 err_ring:
 	__ring_retire(ce->ring);
-err_put:
+	return err;
+}
+
+static int __intel_context_active(struct i915_active *active)
+{
+	struct intel_context *ce = container_of(active, typeof(*ce), active);
+	struct i915_acquire_ctx acquire;
+	int err;
+
+	CE_TRACE(ce, "active\n");
+
+	intel_context_get(ce);
+	i915_acquire_ctx_init(&acquire);
+
+	err = intel_context_acquire_lock(ce, &acquire);
+	if (err)
+		goto err;
+
+	err = i915_acquire_mm(&acquire);
+	if (err)
+		goto err;
+
+	err = intel_context_active_locked(ce);
+	if (err)
+		goto err;
+
+	i915_acquire_ctx_fini(&acquire);
+	return 0;
+
+err:
+	i915_acquire_ctx_fini(&acquire);
 	intel_context_put(ce);
 	return err;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_ring.c b/drivers/gpu/drm/i915/gt/intel_ring.c
index bdb324167ef3..1c21f5725731 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring.c
@@ -5,6 +5,8 @@
  */
 
 #include "gem/i915_gem_object.h"
+#include "mm/i915_acquire_ctx.h"
+
 #include "i915_drv.h"
 #include "i915_vma.h"
 #include "intel_engine.h"
@@ -21,9 +23,16 @@ unsigned int intel_ring_update_space(struct intel_ring *ring)
 	return space;
 }
 
-int intel_ring_pin(struct intel_ring *ring)
+int intel_ring_acquire_lock(struct intel_ring *ring,
+			    struct i915_acquire_ctx *ctx)
+{
+	return i915_acquire_ctx_lock(ctx, ring->vma->obj);
+}
+
+int intel_ring_pin_locked(struct intel_ring *ring)
 {
 	struct i915_vma *vma = ring->vma;
+	enum i915_map_type type;
 	unsigned int flags;
 	void *addr;
 	int ret;
@@ -39,15 +48,15 @@ int intel_ring_pin(struct intel_ring *ring)
 	else
 		flags |= PIN_HIGH;
 
-	ret = i915_ggtt_pin(vma, 0, flags);
+	ret = i915_ggtt_pin_locked(vma, 0, flags);
 	if (unlikely(ret))
 		goto err_unpin;
 
+	type = i915_coherent_map_type(vma->vm->i915);
 	if (i915_vma_is_map_and_fenceable(vma))
 		addr = (void __force *)i915_vma_pin_iomap(vma);
 	else
-		addr = i915_gem_object_pin_map(vma->obj,
-					       i915_coherent_map_type(vma->vm->i915));
+		addr = __i915_gem_object_pin_map_locked(vma->obj, type);
 	if (IS_ERR(addr)) {
 		ret = PTR_ERR(addr);
 		goto err_ring;
diff --git a/drivers/gpu/drm/i915/gt/intel_ring.h b/drivers/gpu/drm/i915/gt/intel_ring.h
index cc0ebca65167..34134a0b80b3 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring.h
+++ b/drivers/gpu/drm/i915/gt/intel_ring.h
@@ -11,6 +11,7 @@
 #include "i915_request.h"
 #include "intel_ring_types.h"
 
+struct i915_acquire_ctx;
 struct intel_engine_cs;
 
 struct intel_ring *
@@ -21,7 +22,9 @@ int intel_ring_cacheline_align(struct i915_request *rq);
 
 unsigned int intel_ring_update_space(struct intel_ring *ring);
 
-int intel_ring_pin(struct intel_ring *ring);
+int intel_ring_acquire_lock(struct intel_ring *ring,
+			    struct i915_acquire_ctx *ctx);
+int intel_ring_pin_locked(struct intel_ring *ring);
 void intel_ring_unpin(struct intel_ring *ring);
 void intel_ring_reset(struct intel_ring *ring, u32 tail);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
index 9a126ad517c1..ec54ff029699 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
@@ -27,6 +27,8 @@
  *
  */
 
+#include "mm/i915_acquire_ctx.h"
+
 #include "gen2_engine_cs.h"
 #include "gen6_engine_cs.h"
 #include "gen6_ppgtt.h"
@@ -1009,6 +1011,15 @@ static void gen6_bsd_set_default_submission(struct intel_engine_cs *engine)
 	engine->submit_request = gen6_bsd_submit_request;
 }
 
+static void ring_release_global_timeline(struct intel_engine_cs *engine)
+{
+	intel_ring_unpin(engine->legacy.ring);
+	intel_ring_put(engine->legacy.ring);
+
+	intel_timeline_unpin(engine->legacy.timeline);
+	intel_timeline_put(engine->legacy.timeline);
+}
+
 static void ring_release(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
@@ -1023,11 +1034,7 @@ static void ring_release(struct intel_engine_cs *engine)
 		i915_vma_unpin_and_release(&engine->wa_ctx.vma, 0);
 	}
 
-	intel_ring_unpin(engine->legacy.ring);
-	intel_ring_put(engine->legacy.ring);
-
-	intel_timeline_unpin(engine->legacy.timeline);
-	intel_timeline_put(engine->legacy.timeline);
+	ring_release_global_timeline(engine);
 }
 
 static void setup_irq(struct intel_engine_cs *engine)
@@ -1226,12 +1233,69 @@ static int gen7_ctx_switch_bb_init(struct intel_engine_cs *engine)
 	return err;
 }
 
-int intel_ring_submission_setup(struct intel_engine_cs *engine)
+static int ring_setup_global_timeline(struct intel_engine_cs *engine)
 {
+	struct i915_acquire_ctx acquire;
 	struct intel_timeline *timeline;
 	struct intel_ring *ring;
 	int err;
 
+	timeline = intel_timeline_create(engine->gt, engine->status_page.vma);
+	if (IS_ERR(timeline))
+		return PTR_ERR(timeline);
+	GEM_BUG_ON(timeline->has_initial_breadcrumb);
+
+	ring = intel_engine_create_ring(engine, SZ_16K);
+	if (IS_ERR(ring)) {
+		err = PTR_ERR(ring);
+		goto err_timeline;
+	}
+
+	i915_acquire_ctx_init(&acquire);
+
+	err = intel_ring_acquire_lock(ring, &acquire);
+	if (err)
+		goto err_acquire;
+
+	err = intel_timeline_acquire_lock(timeline, &acquire);
+	if (err)
+		goto err_acquire;
+
+	err = i915_acquire_mm(&acquire);
+	if (err)
+		goto err_acquire;
+
+	err = intel_timeline_pin_locked(timeline);
+	if (err)
+		goto err_acquire;
+
+	err = intel_ring_pin_locked(ring);
+	if (err)
+		goto err_timeline_unpin;
+
+	i915_acquire_ctx_fini(&acquire);
+
+	GEM_BUG_ON(engine->legacy.ring);
+	engine->legacy.ring = ring;
+	engine->legacy.timeline = timeline;
+
+	GEM_BUG_ON(timeline->hwsp_ggtt != engine->status_page.vma);
+	return 0;
+
+err_timeline_unpin:
+	intel_timeline_unpin(timeline);
+err_acquire:
+	i915_acquire_ctx_fini(&acquire);
+	intel_ring_put(ring);
+err_timeline:
+	intel_timeline_put(timeline);
+	return err;
+}
+
+int intel_ring_submission_setup(struct intel_engine_cs *engine)
+{
+	int err;
+
 	setup_common(engine);
 
 	switch (engine->class) {
@@ -1252,37 +1316,14 @@ int intel_ring_submission_setup(struct intel_engine_cs *engine)
 		return -ENODEV;
 	}
 
-	timeline = intel_timeline_create(engine->gt, engine->status_page.vma);
-	if (IS_ERR(timeline)) {
-		err = PTR_ERR(timeline);
-		goto err;
-	}
-	GEM_BUG_ON(timeline->has_initial_breadcrumb);
-
-	err = intel_timeline_pin(timeline);
-	if (err)
-		goto err_timeline;
-
-	ring = intel_engine_create_ring(engine, SZ_16K);
-	if (IS_ERR(ring)) {
-		err = PTR_ERR(ring);
-		goto err_timeline_unpin;
-	}
-
-	err = intel_ring_pin(ring);
+	err = ring_setup_global_timeline(engine);
 	if (err)
-		goto err_ring;
-
-	GEM_BUG_ON(engine->legacy.ring);
-	engine->legacy.ring = ring;
-	engine->legacy.timeline = timeline;
-
-	GEM_BUG_ON(timeline->hwsp_ggtt != engine->status_page.vma);
+		goto err_common;
 
 	if (IS_HASWELL(engine->i915) && engine->class == RENDER_CLASS) {
 		err = gen7_ctx_switch_bb_init(engine);
 		if (err)
-			goto err_ring_unpin;
+			goto err_global;
 	}
 
 	/* Finally, take ownership and responsibility for cleanup! */
@@ -1290,15 +1331,9 @@ int intel_ring_submission_setup(struct intel_engine_cs *engine)
 
 	return 0;
 
-err_ring_unpin:
-	intel_ring_unpin(ring);
-err_ring:
-	intel_ring_put(ring);
-err_timeline_unpin:
-	intel_timeline_unpin(timeline);
-err_timeline:
-	intel_timeline_put(timeline);
-err:
+err_global:
+	ring_release_global_timeline(engine);
+err_common:
 	intel_engine_cleanup_common(engine);
 	return err;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
index acb43aebd669..c63ea46ce71b 100644
--- a/drivers/gpu/drm/i915/gt/intel_timeline.c
+++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
@@ -4,9 +4,10 @@
  * Copyright © 2016-2018 Intel Corporation
  */
 
-#include "i915_drv.h"
+#include "mm/i915_acquire_ctx.h"
 
 #include "i915_active.h"
+#include "i915_drv.h"
 #include "i915_syncmap.h"
 #include "intel_gt.h"
 #include "intel_ring.h"
@@ -315,14 +316,21 @@ intel_timeline_create(struct intel_gt *gt, struct i915_vma *global_hwsp)
 	return timeline;
 }
 
-int intel_timeline_pin(struct intel_timeline *tl)
+int
+intel_timeline_acquire_lock(struct intel_timeline *tl,
+			    struct i915_acquire_ctx *ctx)
+{
+	return i915_acquire_ctx_lock(ctx, tl->hwsp_ggtt->obj);
+}
+
+int intel_timeline_pin_locked(struct intel_timeline *tl)
 {
 	int err;
 
 	if (atomic_add_unless(&tl->pin_count, 1, 0))
 		return 0;
 
-	err = i915_ggtt_pin(tl->hwsp_ggtt, 0, PIN_HIGH);
+	err = i915_ggtt_pin_locked(tl->hwsp_ggtt, 0, PIN_HIGH);
 	if (err)
 		return err;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.h b/drivers/gpu/drm/i915/gt/intel_timeline.h
index 4298b9ac7327..073c94cd8160 100644
--- a/drivers/gpu/drm/i915/gt/intel_timeline.h
+++ b/drivers/gpu/drm/i915/gt/intel_timeline.h
@@ -29,7 +29,9 @@
 
 #include "i915_active.h"
 #include "i915_syncmap.h"
-#include "gt/intel_timeline_types.h"
+#include "intel_timeline_types.h"
+
+struct i915_acquire_ctx;
 
 struct intel_timeline *
 intel_timeline_create(struct intel_gt *gt, struct i915_vma *global_hwsp);
@@ -71,7 +73,11 @@ static inline bool intel_timeline_sync_is_later(struct intel_timeline *tl,
 	return __intel_timeline_sync_is_later(tl, fence->context, fence->seqno);
 }
 
-int intel_timeline_pin(struct intel_timeline *tl);
+int
+intel_timeline_acquire_lock(struct intel_timeline *tl,
+			    struct i915_acquire_ctx *ctx);
+int intel_timeline_pin_locked(struct intel_timeline *tl);
+
 void intel_timeline_enter(struct intel_timeline *tl);
 int intel_timeline_get_seqno(struct intel_timeline *tl,
 			     struct i915_request *rq,
diff --git a/drivers/gpu/drm/i915/gt/mock_engine.c b/drivers/gpu/drm/i915/gt/mock_engine.c
index b8dd3cbc8696..283cfa2912b3 100644
--- a/drivers/gpu/drm/i915/gt/mock_engine.c
+++ b/drivers/gpu/drm/i915/gt/mock_engine.c
@@ -67,6 +67,7 @@ static struct intel_ring *mock_ring(struct intel_engine_cs *engine)
 	__set_bit(I915_VMA_GGTT_BIT, __i915_vma_flags(ring->vma));
 	__set_bit(DRM_MM_NODE_ALLOCATED_BIT, &ring->vma->node.flags);
 	ring->vma->node.size = sz;
+	ring->vma->obj = i915_gem_object_create_internal(engine->i915, 4096);
 
 	intel_ring_update_space(ring);
 
@@ -75,6 +76,7 @@ static struct intel_ring *mock_ring(struct intel_engine_cs *engine)
 
 static void mock_ring_free(struct intel_ring *ring)
 {
+	i915_gem_object_put(ring->vma->obj);
 	i915_active_fini(&ring->vma->active);
 	i915_vma_free(ring->vma);
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_timeline.c b/drivers/gpu/drm/i915/gt/selftest_timeline.c
index fb5b7d3498a6..ddd3a435387c 100644
--- a/drivers/gpu/drm/i915/gt/selftest_timeline.c
+++ b/drivers/gpu/drm/i915/gt/selftest_timeline.c
@@ -6,6 +6,8 @@
 
 #include <linux/prime_numbers.h>
 
+#include "mm/i915_acquire_ctx.h"
+
 #include "intel_context.h"
 #include "intel_engine_heartbeat.h"
 #include "intel_engine_pm.h"
@@ -449,13 +451,37 @@ static int emit_ggtt_store_dw(struct i915_request *rq, u32 addr, u32 value)
 	return 0;
 }
 
+static int tl_pin(struct intel_timeline *tl)
+{
+	struct i915_acquire_ctx acquire;
+	int err;
+
+	i915_acquire_ctx_init(&acquire);
+
+	err = intel_timeline_acquire_lock(tl, &acquire);
+	if (err)
+		goto out;
+
+	err = i915_acquire_mm(&acquire);
+	if (err)
+		goto out;
+
+	err = intel_timeline_pin_locked(tl);
+	if (err == 0)
+		err = i915_vma_wait_for_bind(tl->hwsp_ggtt);
+
+out:
+	i915_acquire_ctx_fini(&acquire);
+	return err;
+}
+
 static struct i915_request *
 tl_write(struct intel_timeline *tl, struct intel_engine_cs *engine, u32 value)
 {
 	struct i915_request *rq;
 	int err;
 
-	err = intel_timeline_pin(tl);
+	err = tl_pin(tl);
 	if (err) {
 		rq = ERR_PTR(err);
 		goto out;
@@ -667,7 +693,7 @@ static int live_hwsp_wrap(void *arg)
 	if (!tl->has_initial_breadcrumb || !tl->hwsp_cacheline)
 		goto out_free;
 
-	err = intel_timeline_pin(tl);
+	err = tl_pin(tl);
 	if (err)
 		goto out_free;
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 32/66] drm/i915/gt: Push the wait for the context to bound to the request
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (29 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 31/66] drm/i915/gt: Acquire backing storage for the context Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-31 10:48   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 33/66] drm/i915: Remove unused i915_gem_evict_vm() Chris Wilson
                   ` (41 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Rather than synchronously wait for the context to be bound, within the
intel_context_pin(), we can track the pending completion of the bind
fence and only submit requests along the context when signaled.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/Makefile              |  1 +
 drivers/gpu/drm/i915/gt/intel_context.c    | 80 +++++++++++++---------
 drivers/gpu/drm/i915/gt/intel_context.h    |  6 ++
 drivers/gpu/drm/i915/i915_active.h         |  1 -
 drivers/gpu/drm/i915/i915_request.c        |  4 ++
 drivers/gpu/drm/i915/i915_sw_fence_await.c | 62 +++++++++++++++++
 drivers/gpu/drm/i915/i915_sw_fence_await.h | 19 +++++
 7 files changed, 140 insertions(+), 33 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_sw_fence_await.c
 create mode 100644 drivers/gpu/drm/i915/i915_sw_fence_await.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index a3a4c8a555ec..2cf54db8b847 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -61,6 +61,7 @@ i915-y += \
 	i915_memcpy.o \
 	i915_mm.o \
 	i915_sw_fence.o \
+	i915_sw_fence_await.o \
 	i915_sw_fence_work.o \
 	i915_syncmap.o \
 	i915_user_extensions.o
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 2f1606365f63..9ba1c15114d7 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -10,6 +10,7 @@
 
 #include "i915_drv.h"
 #include "i915_globals.h"
+#include "i915_sw_fence_await.h"
 
 #include "intel_context.h"
 #include "intel_engine.h"
@@ -94,27 +95,6 @@ static void intel_context_active_release(struct intel_context *ce)
 	i915_active_release(&ce->active);
 }
 
-static int __intel_context_sync(struct intel_context *ce)
-{
-	int err;
-
-	err = i915_vma_wait_for_bind(ce->ring->vma);
-	if (err)
-		return err;
-
-	err = i915_vma_wait_for_bind(ce->timeline->hwsp_ggtt);
-	if (err)
-		return err;
-
-	if (ce->state) {
-		err = i915_vma_wait_for_bind(ce->state);
-		if (err)
-			return err;
-	}
-
-	return 0;
-}
-
 int __intel_context_do_pin(struct intel_context *ce)
 {
 	int err;
@@ -140,10 +120,6 @@ int __intel_context_do_pin(struct intel_context *ce)
 	}
 
 	if (likely(!atomic_add_unless(&ce->pin_count, 1, 0))) {
-		err = __intel_context_sync(ce);
-		if (unlikely(err))
-			goto out_unlock;
-
 		err = intel_context_active_acquire(ce);
 		if (unlikely(err))
 			goto out_unlock;
@@ -301,31 +277,71 @@ intel_context_acquire_lock(struct intel_context *ce,
 	return 0;
 }
 
+static int await_bind(struct dma_fence_await *fence, struct i915_vma *vma)
+{
+	struct dma_fence *bind;
+	int err = 0;
+
+	bind = i915_active_fence_get(&vma->active.excl);
+	if (bind) {
+		err = i915_sw_fence_await_dma_fence(&fence->await, bind,
+						    0, GFP_KERNEL);
+		dma_fence_put(bind);
+	}
+
+	return err;
+}
+
 static int intel_context_active_locked(struct intel_context *ce)
 {
+	struct dma_fence_await *fence;
 	int err;
 
+	fence = dma_fence_await_create(GFP_KERNEL);
+	if (!fence)
+		return -ENOMEM;
+
 	err = __ring_active_locked(ce->ring);
 	if (err)
-		return err;
+		goto out_fence;
+
+	err = await_bind(fence, ce->ring->vma);
+	if (err < 0)
+		goto err_ring;
 
 	err = intel_timeline_pin_locked(ce->timeline);
 	if (err)
 		goto err_ring;
 
-	if (!ce->state)
-		return 0;
-
-	err = __context_active_locked(ce->state);
-	if (err)
+	err = await_bind(fence, ce->timeline->hwsp_ggtt);
+	if (err < 0)
 		goto err_timeline;
 
-	return 0;
+	if (ce->state) {
+		err = __context_active_locked(ce->state);
+		if (err)
+			goto err_timeline;
+
+		err = await_bind(fence, ce->state);
+		if (err < 0)
+			goto err_state;
+	}
+
+	/* Must be the last action as it *releases* the ce->active */
+	if (atomic_read(&fence->await.pending) > 1)
+		i915_active_set_exclusive(&ce->active, &fence->dma);
+
+	err = 0;
+	goto out_fence;
 
+err_state:
+	__context_unpin_state(ce->state);
 err_timeline:
 	intel_timeline_unpin(ce->timeline);
 err_ring:
 	__ring_retire(ce->ring);
+out_fence:
+	i915_sw_fence_commit(&fence->await);
 	return err;
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index 07be021882cc..f48df2784a6c 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -249,4 +249,10 @@ static inline u64 intel_context_get_avg_runtime_ns(struct intel_context *ce)
 	return mul_u32_u32(ewma_runtime_read(&ce->runtime.avg), period);
 }
 
+static inline int i915_request_await_context(struct i915_request *rq,
+					     struct intel_context *ce)
+{
+	return __i915_request_await_exclusive(rq, &ce->active);
+}
+
 #endif /* __INTEL_CONTEXT_H__ */
diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h
index fb165d3f01cf..4edf4bb92121 100644
--- a/drivers/gpu/drm/i915/i915_active.h
+++ b/drivers/gpu/drm/i915/i915_active.h
@@ -207,7 +207,6 @@ void i915_active_release(struct i915_active *ref);
 
 static inline void __i915_active_acquire(struct i915_active *ref)
 {
-	GEM_BUG_ON(!atomic_read(&ref->count));
 	atomic_inc(&ref->count);
 }
 
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index f58beff5e859..83696955ddf7 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -847,6 +847,10 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp)
 	if (ret)
 		goto err_unwind;
 
+	ret = i915_request_await_context(rq, ce);
+	if (ret)
+		goto err_unwind;
+
 	rq->infix = rq->ring->emit; /* end of header; start of user payload */
 
 	intel_context_mark_active(ce);
diff --git a/drivers/gpu/drm/i915/i915_sw_fence_await.c b/drivers/gpu/drm/i915/i915_sw_fence_await.c
new file mode 100644
index 000000000000..431d324e5591
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_sw_fence_await.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: MIT
+/*
+ * (C) Copyright 2020 Intel Corporation
+ */
+
+#include <linux/slab.h>
+#include <linux/dma-fence.h>
+
+#include "i915_sw_fence_await.h"
+
+static int __i915_sw_fence_call
+fence_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
+{
+	struct dma_fence_await *f = container_of(fence, typeof(*f), await);
+
+	switch (state) {
+	case FENCE_COMPLETE:
+		dma_fence_signal(&f->dma);
+		break;
+
+	case FENCE_FREE:
+		dma_fence_put(&f->dma);
+		break;
+	}
+
+	return NOTIFY_DONE;
+}
+
+static const char *fence_name(struct dma_fence *fence)
+{
+	return "dma-fence-await";
+}
+
+static void fence_release(struct dma_fence *fence)
+{
+	struct dma_fence_await *f = container_of(fence, typeof(*f), dma);
+
+	i915_sw_fence_fini(&f->await);
+
+	BUILD_BUG_ON(offsetof(typeof(*f), dma));
+	dma_fence_free(&f->dma);
+}
+
+static const struct dma_fence_ops fence_ops = {
+	.get_driver_name = fence_name,
+	.get_timeline_name = fence_name,
+	.release = fence_release,
+};
+
+struct dma_fence_await *dma_fence_await_create(gfp_t gfp)
+{
+	struct dma_fence_await *f;
+
+	f = kmalloc(sizeof(*f), gfp);
+	if (!f)
+		return NULL;
+
+	i915_sw_fence_init(&f->await, fence_notify);
+	dma_fence_init(&f->dma, &fence_ops, &f->await.wait.lock, 0, 0);
+
+	return f;
+}
diff --git a/drivers/gpu/drm/i915/i915_sw_fence_await.h b/drivers/gpu/drm/i915/i915_sw_fence_await.h
new file mode 100644
index 000000000000..71882a5ed443
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_sw_fence_await.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * (C) Copyright 2020 Intel Corporation
+ */
+
+#ifndef I915_SW_FENCE_AWAIT_H
+#define I915_SW_FENCE_AWAIT_H
+
+#include <linux/dma-fence.h>
+#include <linux/slab.h>
+
+#include "i915_sw_fence.h"
+
+struct dma_fence_await {
+	struct dma_fence dma;
+	struct i915_sw_fence await;
+} *dma_fence_await_create(gfp_t gfp);
+
+#endif /* I915_SW_FENCE_AWAIT_H */
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 33/66] drm/i915: Remove unused i915_gem_evict_vm()
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (30 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 32/66] drm/i915/gt: Push the wait for the context to bound to the request Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-31 10:51   ` Thomas Hellström (Intel)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 34/66] drm/i915/gt: Decouple completed requests on unwind Chris Wilson
                   ` (40 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Obsolete, last user removed.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h               |  1 -
 drivers/gpu/drm/i915/i915_gem_evict.c         | 57 -------------------
 .../gpu/drm/i915/selftests/i915_gem_evict.c   | 40 -------------
 3 files changed, 98 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index bd7ff2ad6514..2c1a9b74af8d 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1865,7 +1865,6 @@ int __must_check i915_gem_evict_something(struct i915_address_space *vm,
 int __must_check i915_gem_evict_for_node(struct i915_address_space *vm,
 					 struct drm_mm_node *node,
 					 unsigned int flags);
-int i915_gem_evict_vm(struct i915_address_space *vm);
 
 /* i915_gem_internal.c */
 struct drm_i915_gem_object *
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index 6501939929d5..e35f0ba5e245 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -343,63 +343,6 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 	return ret;
 }
 
-/**
- * i915_gem_evict_vm - Evict all idle vmas from a vm
- * @vm: Address space to cleanse
- *
- * This function evicts all vmas from a vm.
- *
- * This is used by the execbuf code as a last-ditch effort to defragment the
- * address space.
- *
- * To clarify: This is for freeing up virtual address space, not for freeing
- * memory in e.g. the shrinker.
- */
-int i915_gem_evict_vm(struct i915_address_space *vm)
-{
-	int ret = 0;
-
-	lockdep_assert_held(&vm->mutex);
-	trace_i915_gem_evict_vm(vm);
-
-	/* Switch back to the default context in order to unpin
-	 * the existing context objects. However, such objects only
-	 * pin themselves inside the global GTT and performing the
-	 * switch otherwise is ineffective.
-	 */
-	if (i915_is_ggtt(vm)) {
-		ret = ggtt_flush(vm->gt);
-		if (ret)
-			return ret;
-	}
-
-	do {
-		struct i915_vma *vma, *vn;
-		LIST_HEAD(eviction_list);
-
-		list_for_each_entry(vma, &vm->bound_list, vm_link) {
-			if (i915_vma_is_pinned(vma))
-				continue;
-
-			__i915_vma_pin(vma);
-			list_add(&vma->evict_link, &eviction_list);
-		}
-		if (list_empty(&eviction_list))
-			break;
-
-		ret = 0;
-		list_for_each_entry_safe(vma, vn, &eviction_list, evict_link) {
-			__i915_vma_unpin(vma);
-			if (ret == 0)
-				ret = __i915_vma_unbind(vma);
-			if (ret != -EINTR) /* "Get me out of here!" */
-				ret = 0;
-		}
-	} while (ret == 0);
-
-	return ret;
-}
-
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftests/i915_gem_evict.c"
 #endif
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_evict.c b/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
index 028baae9631f..773cecacba82 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
@@ -327,45 +327,6 @@ static int igt_evict_for_cache_color(void *arg)
 	return err;
 }
 
-static int igt_evict_vm(void *arg)
-{
-	struct intel_gt *gt = arg;
-	struct i915_ggtt *ggtt = gt->ggtt;
-	LIST_HEAD(objects);
-	int err;
-
-	/* Fill the GGTT with pinned objects and try to evict everything. */
-
-	err = populate_ggtt(ggtt, &objects);
-	if (err)
-		goto cleanup;
-
-	/* Everything is pinned, nothing should happen */
-	mutex_lock(&ggtt->vm.mutex);
-	err = i915_gem_evict_vm(&ggtt->vm);
-	mutex_unlock(&ggtt->vm.mutex);
-	if (err) {
-		pr_err("i915_gem_evict_vm on a full GGTT returned err=%d]\n",
-		       err);
-		goto cleanup;
-	}
-
-	unpin_ggtt(ggtt);
-
-	mutex_lock(&ggtt->vm.mutex);
-	err = i915_gem_evict_vm(&ggtt->vm);
-	mutex_unlock(&ggtt->vm.mutex);
-	if (err) {
-		pr_err("i915_gem_evict_vm on a full GGTT returned err=%d]\n",
-		       err);
-		goto cleanup;
-	}
-
-cleanup:
-	cleanup_objects(ggtt, &objects);
-	return err;
-}
-
 static int igt_evict_contexts(void *arg)
 {
 	const u64 PRETEND_GGTT_SIZE = 16ull << 20;
@@ -522,7 +483,6 @@ int i915_gem_evict_mock_selftests(void)
 		SUBTEST(igt_evict_something),
 		SUBTEST(igt_evict_for_vma),
 		SUBTEST(igt_evict_for_cache_color),
-		SUBTEST(igt_evict_vm),
 		SUBTEST(igt_overcommit),
 	};
 	struct drm_i915_private *i915;
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 34/66] drm/i915/gt: Decouple completed requests on unwind
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (31 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 33/66] drm/i915: Remove unused i915_gem_evict_vm() Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 35/66] drm/i915/gt: Check for a completed last request once Chris Wilson
                   ` (39 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Since the introduction of preempt-to-busy, requests can complete in the
background, even while they are not on the engine->active.requests list.
As such, the engine->active.request list itself is not in strict
retirement order, and we have to scan the entire list while unwinding to
not miss any. However, if the request is completed we currently leave it
on the list [until retirement], but we could just as simply remove it
and stop treating it as active. We would only have to then traverse it
once while unwinding in quick succession.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 6 ++++--
 drivers/gpu/drm/i915/i915_request.c | 3 ++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index aa7be7f05f8c..f52b52a7b1d3 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1114,8 +1114,10 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
 	list_for_each_entry_safe_reverse(rq, rn,
 					 &engine->active.requests,
 					 sched.link) {
-		if (i915_request_completed(rq))
-			continue; /* XXX */
+		if (i915_request_completed(rq)) {
+			list_del_init(&rq->sched.link);
+			continue;
+		}
 
 		__i915_request_unsubmit(rq);
 
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 83696955ddf7..31c60e6c5c7a 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -311,7 +311,8 @@ bool i915_request_retire(struct i915_request *rq)
 	 * request that we have removed from the HW and put back on a run
 	 * queue.
 	 */
-	remove_from_engine(rq);
+	if (!list_empty(&rq->sched.link))
+		remove_from_engine(rq);
 
 	spin_lock_irq(&rq->lock);
 	if (!i915_request_signaled(rq))
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 35/66] drm/i915/gt: Check for a completed last request once
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (32 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 34/66] drm/i915/gt: Decouple completed requests on unwind Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 36/66] drm/i915/gt: Replace direct submit with direct call to tasklet Chris Wilson
                   ` (38 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Pull the repeated check for the last active request being completed to a
single spot, when deciding whether or not execlist preemption is
required.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index f52b52a7b1d3..0c478187f9ba 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2123,12 +2123,9 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	 */
 
 	if ((last = *active)) {
-		if (need_preempt(engine, last, rb)) {
-			if (i915_request_completed(last)) {
-				tasklet_hi_schedule(&execlists->tasklet);
-				return;
-			}
-
+		if (i915_request_completed(last)) {
+			goto check_secondary;
+		} else if (need_preempt(engine, last, rb)) {
 			ENGINE_TRACE(engine,
 				     "preempting last=%llx:%lld, prio=%d, hint=%d\n",
 				     last->fence.context,
@@ -2156,11 +2153,6 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			last = NULL;
 		} else if (need_timeslice(engine, last, rb) &&
 			   timeslice_expired(execlists, last)) {
-			if (i915_request_completed(last)) {
-				tasklet_hi_schedule(&execlists->tasklet);
-				return;
-			}
-
 			ENGINE_TRACE(engine,
 				     "expired last=%llx:%lld, prio=%d, hint=%d, yield?=%s\n",
 				     last->fence.context,
@@ -2196,6 +2188,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			 * we hopefully coalesce several updates into a single
 			 * submission.
 			 */
+check_secondary:
 			if (!list_is_last(&last->sched.link,
 					  &engine->active.requests)) {
 				/*
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 36/66] drm/i915/gt: Replace direct submit with direct call to tasklet
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (33 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 35/66] drm/i915/gt: Check for a completed last request once Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 37/66] drm/i915/gt: Free stale request on destroying the virtual engine Chris Wilson
                   ` (37 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Rather than having special case code for opportunistically calling
process_csb() and performing a direct submit while holding the engine
spinlock for submitting the request, simply call the tasklet directly.
This allows us to retain the direct submission path, including the CS
draining to allow fast/immediate submissions, without requiring any
duplicated code paths.

The trickiest part here is to ensure that paired operations (such as
schedule_in/schedule_out) remain under consistent locking domains,
e.g. when pulled outside of the engine->active.lock

v2: Use bh kicking, see commit 3c53776e29f8 ("Mark HI and TASKLET
softirq synchronous").
v3: Update engine-reset to be tasklet aware

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |   4 +-
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  35 +++--
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  20 ++-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |   3 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c           | 120 ++++++------------
 drivers/gpu/drm/i915/gt/intel_reset.c         |  60 +++++----
 drivers/gpu/drm/i915/gt/intel_reset.h         |   2 +
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   7 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  27 ++--
 drivers/gpu/drm/i915/gt/selftest_reset.c      |   8 +-
 drivers/gpu/drm/i915/i915_request.c           |   2 +
 drivers/gpu/drm/i915/selftests/i915_request.c |   6 +-
 13 files changed, 152 insertions(+), 144 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index b5f6dc2333ab..901b2f5614ea 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -398,12 +398,14 @@ static bool __reset_engine(struct intel_engine_cs *engine)
 	if (!intel_has_reset_engine(gt))
 		return false;
 
+	local_bh_disable();
 	if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
 			      &gt->reset.flags)) {
-		success = intel_engine_reset(engine, NULL) == 0;
+		success = __intel_engine_reset_bh(engine, NULL) == 0;
 		clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
 				      &gt->reset.flags);
 	}
+	local_bh_enable();
 
 	return success;
 }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index b07c508812ad..7ad65612e4a0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3327,7 +3327,9 @@ static void eb_request_add(struct i915_execbuffer *eb)
 		__i915_request_skip(rq);
 	}
 
+	local_bh_disable();
 	__i915_request_queue(rq, &attr);
+	local_bh_enable();
 }
 
 static int
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index dd1a42c4d344..c10521fdbbe4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -978,32 +978,39 @@ static unsigned long stop_timeout(const struct intel_engine_cs *engine)
 	return READ_ONCE(engine->props.stop_timeout_ms);
 }
 
-int intel_engine_stop_cs(struct intel_engine_cs *engine)
+static int __intel_engine_stop_cs(struct intel_engine_cs *engine,
+				  int fast_timeout_us,
+				  int slow_timeout_ms)
 {
 	struct intel_uncore *uncore = engine->uncore;
-	const u32 base = engine->mmio_base;
-	const i915_reg_t mode = RING_MI_MODE(base);
+	const i915_reg_t mode = RING_MI_MODE(engine->mmio_base);
 	int err;
 
+	intel_uncore_write_fw(uncore, mode, _MASKED_BIT_ENABLE(STOP_RING));
+	err = __intel_wait_for_register_fw(engine->uncore, mode,
+					   MODE_IDLE, MODE_IDLE,
+					   fast_timeout_us,
+					   slow_timeout_ms,
+					   NULL);
+
+	/* A final mmio read to let GPU writes be hopefully flushed to memory */
+	intel_uncore_posting_read_fw(uncore, mode);
+	return err;
+}
+
+int intel_engine_stop_cs(struct intel_engine_cs *engine)
+{
+	int err = 0;
+
 	if (INTEL_GEN(engine->i915) < 3)
 		return -ENODEV;
 
 	ENGINE_TRACE(engine, "\n");
-
-	intel_uncore_write_fw(uncore, mode, _MASKED_BIT_ENABLE(STOP_RING));
-
-	err = 0;
-	if (__intel_wait_for_register_fw(uncore,
-					 mode, MODE_IDLE, MODE_IDLE,
-					 1000, stop_timeout(engine),
-					 NULL)) {
+	if (__intel_engine_stop_cs(engine, 1000, stop_timeout(engine))) {
 		ENGINE_TRACE(engine, "timed out on STOP_RING -> IDLE\n");
 		err = -ETIMEDOUT;
 	}
 
-	/* A final mmio read to let GPU writes be hopefully flushed to memory */
-	intel_uncore_posting_read_fw(uncore, mode);
-
 	return err;
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 8ffdf676c0a0..be5d78472f18 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -43,6 +43,17 @@ static void idle_pulse(struct intel_engine_cs *engine, struct i915_request *rq)
 	i915_request_add_active_barriers(rq);
 }
 
+static void heartbeat_commit(struct i915_request *rq,
+			     const struct i915_sched_attr *attr)
+{
+	idle_pulse(rq->engine, rq);
+	__i915_request_commit(rq);
+
+	local_bh_disable();
+	__i915_request_queue(rq, attr);
+	local_bh_enable();
+}
+
 static void show_heartbeat(const struct i915_request *rq,
 			   struct intel_engine_cs *engine)
 {
@@ -143,12 +154,10 @@ static void heartbeat(struct work_struct *wrk)
 	if (IS_ERR(rq))
 		goto unlock;
 
-	idle_pulse(engine, rq);
 	if (engine->i915->params.enable_hangcheck)
 		engine->heartbeat.systole = i915_request_get(rq);
 
-	__i915_request_commit(rq);
-	__i915_request_queue(rq, &attr);
+	heartbeat_commit(rq, &attr);
 
 unlock:
 	mutex_unlock(&ce->timeline->mutex);
@@ -229,10 +238,7 @@ int intel_engine_pulse(struct intel_engine_cs *engine)
 	}
 
 	__set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
-	idle_pulse(engine, rq);
-
-	__i915_request_commit(rq);
-	__i915_request_queue(rq, &attr);
+	heartbeat_commit(rq, &attr);
 	GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER);
 	err = 0;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 8de92fd7d392..36981ba1db75 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -182,7 +182,8 @@ struct intel_engine_execlists {
 	 * Reserve the upper 16b for tracking internal errors.
 	 */
 	u32 error_interrupt;
-#define ERROR_CSB BIT(31)
+#define ERROR_CSB	BIT(31)
+#define ERROR_PREEMPT	BIT(30)
 
 	/**
 	 * @reset_ccid: Active CCID [EXECLISTS_STATUS_HI] at the time of reset
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 0c478187f9ba..4e770274ea8f 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1376,8 +1376,7 @@ __execlists_schedule_in(struct i915_request *rq)
 	return engine;
 }
 
-static inline struct i915_request *
-execlists_schedule_in(struct i915_request *rq, int idx)
+static inline void execlists_schedule_in(struct i915_request *rq, int idx)
 {
 	struct intel_context * const ce = rq->context;
 	struct intel_engine_cs *old;
@@ -1394,7 +1393,6 @@ execlists_schedule_in(struct i915_request *rq, int idx)
 	} while (!try_cmpxchg(&ce->inflight, &old, ptr_inc(old)));
 
 	GEM_BUG_ON(intel_context_inflight(ce) != rq->engine);
-	return i915_request_get(rq);
 }
 
 static void kick_siblings(struct i915_request *rq, struct intel_context *ce)
@@ -2053,8 +2051,9 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	struct intel_engine_execlists * const execlists = &engine->execlists;
 	struct i915_request **port = execlists->pending;
 	struct i915_request ** const last_port = port + execlists->port_mask;
-	struct i915_request * const *active;
+	struct i915_request * const *active = execlists->active;
 	struct i915_request *last;
+	unsigned long flags;
 	struct rb_node *rb;
 	bool submit = false;
 
@@ -2079,6 +2078,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	 * sequence of requests as being the most optimal (fewest wake ups
 	 * and context switches) submission.
 	 */
+	spin_lock_irqsave(&engine->active.lock, flags);
 
 	for (rb = rb_first_cached(&execlists->virtual); rb; ) {
 		struct virtual_engine *ve =
@@ -2107,10 +2107,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	 * the active context to interject the preemption request,
 	 * i.e. we will retrigger preemption following the ack in case
 	 * of trouble.
-	 */
-	active = READ_ONCE(execlists->active);
-
-	/*
+	 *
 	 * In theory we can skip over completed contexts that have not
 	 * yet been processed by events (as those events are in flight):
 	 *
@@ -2195,6 +2192,8 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 				 * Even if ELSP[1] is occupied and not worthy
 				 * of timeslices, our queue might be.
 				 */
+				spin_unlock_irqrestore(&engine->active.lock,
+						       flags);
 				start_timeslice(engine, queue_prio(execlists));
 				return;
 			}
@@ -2230,6 +2229,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 
 			if (last && !can_merge_rq(last, rq)) {
 				spin_unlock(&ve->base.active.lock);
+				spin_unlock_irqrestore(&engine->active.lock, flags);
 				start_timeslice(engine, rq_prio(rq));
 				return; /* leave this for another sibling */
 			}
@@ -2362,8 +2362,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 
 			if (__i915_request_submit(rq)) {
 				if (!merge) {
-					*port = execlists_schedule_in(last, port - execlists->pending);
-					port++;
+					*port++ = i915_request_get(last);
 					last = NULL;
 				}
 
@@ -2382,8 +2381,9 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		rb_erase_cached(&p->node, &execlists->queue);
 		i915_priolist_free(p);
 	}
-
 done:
+	*port++ = i915_request_get(last);
+
 	/*
 	 * Here be a bit of magic! Or sleight-of-hand, whichever you prefer.
 	 *
@@ -2401,25 +2401,23 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	 * interrupt for secondary ports).
 	 */
 	execlists->queue_priority_hint = queue_prio(execlists);
+	spin_unlock_irqrestore(&engine->active.lock, flags);
 
 	if (submit) {
-		*port = execlists_schedule_in(last, port - execlists->pending);
-		execlists->switch_priority_hint =
-			switch_prio(engine, *execlists->pending);
-
 		/*
 		 * Skip if we ended up with exactly the same set of requests,
 		 * e.g. trying to timeslice a pair of ordered contexts
 		 */
 		if (!memcmp(active, execlists->pending,
-			    (port - execlists->pending + 1) * sizeof(*port))) {
-			do
-				execlists_schedule_out(fetch_and_zero(port));
-			while (port-- != execlists->pending);
-
+			    (port - execlists->pending) * sizeof(*port)))
 			goto skip_submit;
-		}
-		clear_ports(port + 1, last_port - port);
+
+		*port = NULL;
+		while (port-- != execlists->pending)
+			execlists_schedule_in(*port, port - execlists->pending);
+
+		execlists->switch_priority_hint =
+			switch_prio(engine, *execlists->pending);
 
 		WRITE_ONCE(execlists->yield, -1);
 		set_preempt_timeout(engine, *active);
@@ -2428,6 +2426,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		start_timeslice(engine, execlists->queue_priority_hint);
 skip_submit:
 		ring_set_paused(engine, 0);
+		*execlists->pending = NULL;
 	}
 }
 
@@ -2708,16 +2707,6 @@ static void process_csb(struct intel_engine_cs *engine)
 	invalidate_csb_entries(&buf[0], &buf[num_entries - 1]);
 }
 
-static void __execlists_submission_tasklet(struct intel_engine_cs *const engine)
-{
-	lockdep_assert_held(&engine->active.lock);
-	if (!READ_ONCE(engine->execlists.pending[0])) {
-		rcu_read_lock(); /* protect peeking at execlists->active */
-		execlists_dequeue(engine);
-		rcu_read_unlock();
-	}
-}
-
 static void __execlists_hold(struct i915_request *rq)
 {
 	LIST_HEAD(list);
@@ -3104,7 +3093,7 @@ static bool preempt_timeout(const struct intel_engine_cs *const engine)
 	if (!timer_expired(t))
 		return false;
 
-	return READ_ONCE(engine->execlists.pending[0]);
+	return engine->execlists.pending[0];
 }
 
 /*
@@ -3114,10 +3103,12 @@ static bool preempt_timeout(const struct intel_engine_cs *const engine)
 static void execlists_submission_tasklet(unsigned long data)
 {
 	struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
-	bool timeout = preempt_timeout(engine);
 
 	process_csb(engine);
 
+	if (unlikely(preempt_timeout(engine)))
+		engine->execlists.error_interrupt |= ERROR_PREEMPT;
+
 	if (unlikely(READ_ONCE(engine->execlists.error_interrupt))) {
 		const char *msg;
 
@@ -3126,6 +3117,8 @@ static void execlists_submission_tasklet(unsigned long data)
 			msg = "CS error"; /* thrown by a user payload */
 		else if (engine->execlists.error_interrupt & ERROR_CSB)
 			msg = "invalid CSB event";
+		else if (engine->execlists.error_interrupt & ERROR_PREEMPT)
+			msg = "preemption time out";
 		else
 			msg = "internal error";
 
@@ -3133,17 +3126,8 @@ static void execlists_submission_tasklet(unsigned long data)
 		execlists_reset(engine, msg);
 	}
 
-	if (!READ_ONCE(engine->execlists.pending[0]) || timeout) {
-		unsigned long flags;
-
-		spin_lock_irqsave(&engine->active.lock, flags);
-		__execlists_submission_tasklet(engine);
-		spin_unlock_irqrestore(&engine->active.lock, flags);
-
-		/* Recheck after serialising with direct-submission */
-		if (unlikely(timeout && preempt_timeout(engine)))
-			execlists_reset(engine, "preemption time out");
-	}
+	if (!engine->execlists.pending[0])
+		execlists_dequeue(engine);
 }
 
 static void __execlists_kick(struct intel_engine_execlists *execlists)
@@ -3174,26 +3158,16 @@ static void queue_request(struct intel_engine_cs *engine,
 	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
 }
 
-static void __submit_queue_imm(struct intel_engine_cs *engine)
-{
-	struct intel_engine_execlists * const execlists = &engine->execlists;
-
-	if (reset_in_progress(execlists))
-		return; /* defer until we restart the engine following reset */
-
-	__execlists_submission_tasklet(engine);
-}
-
-static void submit_queue(struct intel_engine_cs *engine,
+static bool submit_queue(struct intel_engine_cs *engine,
 			 const struct i915_request *rq)
 {
 	struct intel_engine_execlists *execlists = &engine->execlists;
 
 	if (rq_prio(rq) <= execlists->queue_priority_hint)
-		return;
+		return false;
 
 	execlists->queue_priority_hint = rq_prio(rq);
-	__submit_queue_imm(engine);
+	return true;
 }
 
 static bool ancestor_on_hold(const struct intel_engine_cs *engine,
@@ -3203,25 +3177,11 @@ static bool ancestor_on_hold(const struct intel_engine_cs *engine,
 	return !list_empty(&engine->active.hold) && hold_request(rq);
 }
 
-static void flush_csb(struct intel_engine_cs *engine)
-{
-	struct intel_engine_execlists *el = &engine->execlists;
-
-	if (READ_ONCE(el->pending[0]) && tasklet_trylock(&el->tasklet)) {
-		if (!reset_in_progress(el))
-			process_csb(engine);
-		tasklet_unlock(&el->tasklet);
-	}
-}
-
 static void execlists_submit_request(struct i915_request *request)
 {
 	struct intel_engine_cs *engine = request->engine;
 	unsigned long flags;
 
-	/* Hopefully we clear execlists->pending[] to let us through */
-	flush_csb(engine);
-
 	/* Will be called from irq-context when using foreign fences. */
 	spin_lock_irqsave(&engine->active.lock, flags);
 
@@ -3235,7 +3195,8 @@ static void execlists_submit_request(struct i915_request *request)
 		GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
 		GEM_BUG_ON(list_empty(&request->sched.link));
 
-		submit_queue(engine, request);
+		if (submit_queue(engine, request))
+			__execlists_kick(&engine->execlists);
 	}
 
 	spin_unlock_irqrestore(&engine->active.lock, flags);
@@ -4123,7 +4084,6 @@ static int execlists_resume(struct intel_engine_cs *engine)
 static void execlists_reset_prepare(struct intel_engine_cs *engine)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
-	unsigned long flags;
 
 	ENGINE_TRACE(engine, "depth<-%d\n",
 		     atomic_read(&execlists->tasklet.count));
@@ -4140,10 +4100,6 @@ static void execlists_reset_prepare(struct intel_engine_cs *engine)
 	__tasklet_disable_sync_once(&execlists->tasklet);
 	GEM_BUG_ON(!reset_in_progress(execlists));
 
-	/* And flush any current direct submission. */
-	spin_lock_irqsave(&engine->active.lock, flags);
-	spin_unlock_irqrestore(&engine->active.lock, flags);
-
 	/*
 	 * We stop engines, otherwise we might get failed reset and a
 	 * dead gpu (on elk). Also as modern gpu as kbl can suffer
@@ -4387,12 +4343,12 @@ static void execlists_reset_finish(struct intel_engine_cs *engine)
 	 * to sleep before we restart and reload a context.
 	 */
 	GEM_BUG_ON(!reset_in_progress(execlists));
-	if (!RB_EMPTY_ROOT(&execlists->queue.rb_root))
-		execlists->tasklet.func(execlists->tasklet.data);
+	GEM_BUG_ON(engine->execlists.pending[0]);
 
+	/* And kick in case we missed a new request submission. */
 	if (__tasklet_enable(&execlists->tasklet))
-		/* And kick in case we missed a new request submission. */
-		tasklet_hi_schedule(&execlists->tasklet);
+		__execlists_kick(execlists);
+
 	ENGINE_TRACE(engine, "depth->%d\n",
 		     atomic_read(&execlists->tasklet.count));
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 46a5ceffc22f..990354bc0ee1 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -38,20 +38,19 @@ static void rmw_clear_fw(struct intel_uncore *uncore, i915_reg_t reg, u32 clr)
 	intel_uncore_rmw_fw(uncore, reg, clr, 0);
 }
 
-static void engine_skip_context(struct i915_request *rq)
+static void skip_context(struct i915_request *rq)
 {
-	struct intel_engine_cs *engine = rq->engine;
 	struct intel_context *hung_ctx = rq->context;
 
-	if (!i915_request_is_active(rq))
-		return;
+	list_for_each_entry_from_rcu(rq, &hung_ctx->timeline->requests, link) {
+		if (!i915_request_is_active(rq))
+			return;
 
-	lockdep_assert_held(&engine->active.lock);
-	list_for_each_entry_continue(rq, &engine->active.requests, sched.link)
 		if (rq->context == hung_ctx) {
 			i915_request_set_error_once(rq, -EIO);
 			__i915_request_skip(rq);
 		}
+	}
 }
 
 static void client_mark_guilty(struct i915_gem_context *ctx, bool banned)
@@ -158,7 +157,7 @@ void __i915_request_reset(struct i915_request *rq, bool guilty)
 		i915_request_set_error_once(rq, -EIO);
 		__i915_request_skip(rq);
 		if (mark_guilty(rq))
-			engine_skip_context(rq);
+			skip_context(rq);
 	} else {
 		i915_request_set_error_once(rq, -EAGAIN);
 		mark_innocent(rq);
@@ -752,8 +751,10 @@ static int gt_reset(struct intel_gt *gt, intel_engine_mask_t stalled_mask)
 	if (err)
 		return err;
 
+	local_bh_disable();
 	for_each_engine(engine, gt, id)
 		__intel_engine_reset(engine, stalled_mask & engine->mask);
+	local_bh_enable();
 
 	intel_ggtt_restore_fences(gt->ggtt);
 
@@ -831,9 +832,11 @@ static void __intel_gt_set_wedged(struct intel_gt *gt)
 	set_bit(I915_WEDGED, &gt->reset.flags);
 
 	/* Mark all executing requests as skipped */
+	local_bh_disable();
 	for_each_engine(engine, gt, id)
 		if (engine->reset.cancel)
 			engine->reset.cancel(engine);
+	local_bh_enable();
 
 	reset_finish(gt, awake);
 
@@ -1108,20 +1111,7 @@ static inline int intel_gt_reset_engine(struct intel_engine_cs *engine)
 	return __intel_gt_reset(engine->gt, engine->mask);
 }
 
-/**
- * intel_engine_reset - reset GPU engine to recover from a hang
- * @engine: engine to reset
- * @msg: reason for GPU reset; or NULL for no drm_notice()
- *
- * Reset a specific GPU engine. Useful if a hang is detected.
- * Returns zero on successful reset or otherwise an error code.
- *
- * Procedure is:
- *  - identifies the request that caused the hang and it is dropped
- *  - reset engine (which will force the engine to idle)
- *  - re-init/configure engine
- */
-int intel_engine_reset(struct intel_engine_cs *engine, const char *msg)
+int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *msg)
 {
 	struct intel_gt *gt = engine->gt;
 	bool uses_guc = intel_engine_in_guc_submission_mode(engine);
@@ -1172,6 +1162,30 @@ int intel_engine_reset(struct intel_engine_cs *engine, const char *msg)
 	return ret;
 }
 
+/**
+ * intel_engine_reset - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ * @msg: reason for GPU reset; or NULL for no drm_notice()
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ *
+ * Procedure is:
+ *  - identifies the request that caused the hang and it is dropped
+ *  - reset engine (which will force the engine to idle)
+ *  - re-init/configure engine
+ */
+int intel_engine_reset(struct intel_engine_cs *engine, const char *msg)
+{
+	int err;
+
+	local_bh_disable();
+	err = __intel_engine_reset_bh(engine, msg);
+	local_bh_enable();
+
+	return err;
+}
+
 static void intel_gt_reset_global(struct intel_gt *gt,
 				  u32 engine_mask,
 				  const char *reason)
@@ -1258,18 +1272,20 @@ void intel_gt_handle_error(struct intel_gt *gt,
 	 * single reset fails.
 	 */
 	if (intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
+		local_bh_disable();
 		for_each_engine_masked(engine, gt, engine_mask, tmp) {
 			BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE);
 			if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
 					     &gt->reset.flags))
 				continue;
 
-			if (intel_engine_reset(engine, msg) == 0)
+			if (__intel_engine_reset_bh(engine, msg) == 0)
 				engine_mask &= ~engine->mask;
 
 			clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
 					      &gt->reset.flags);
 		}
+		local_bh_enable();
 	}
 
 	if (!engine_mask)
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h
index a0eec7c11c0c..7dbf5cc8a333 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.h
+++ b/drivers/gpu/drm/i915/gt/intel_reset.h
@@ -34,6 +34,8 @@ void intel_gt_reset(struct intel_gt *gt,
 		    const char *reason);
 int intel_engine_reset(struct intel_engine_cs *engine,
 		       const char *reason);
+int __intel_engine_reset_bh(struct intel_engine_cs *engine,
+			    const char *reason);
 
 void __i915_request_reset(struct i915_request *rq, bool guilty);
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index fb5ebf930ab2..c28d1fcad673 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -1576,12 +1576,17 @@ static int __igt_atomic_reset_engine(struct intel_engine_cs *engine,
 		  engine->name, mode, p->name);
 
 	tasklet_disable(t);
+	if (strcmp(p->name, "softirq"))
+		local_bh_disable();
 	p->critical_section_begin();
 
-	err = intel_engine_reset(engine, NULL);
+	err = __intel_engine_reset_bh(engine, NULL);
 
 	p->critical_section_end();
+	if (strcmp(p->name, "softirq"))
+		local_bh_enable();
 	tasklet_enable(t);
+	tasklet_hi_schedule(t);
 
 	if (err)
 		pr_err("i915_reset_engine(%s:%s) failed under %s\n",
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index 3fc5de961280..93b576ee4203 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -623,8 +623,10 @@ static int live_hold_reset(void *arg)
 
 		/* We have our request executing, now remove it and reset */
 
+		local_bh_disable();
 		if (test_and_set_bit(I915_RESET_ENGINE + id,
 				     &gt->reset.flags)) {
+			local_bh_enable();
 			intel_gt_set_wedged(gt);
 			err = -EBUSY;
 			goto out;
@@ -638,12 +640,13 @@ static int live_hold_reset(void *arg)
 		execlists_hold(engine, rq);
 		GEM_BUG_ON(!i915_request_on_hold(rq));
 
-		intel_engine_reset(engine, NULL);
+		__intel_engine_reset_bh(engine, NULL);
 		GEM_BUG_ON(rq->fence.error != -EIO);
 
 		tasklet_enable(&engine->execlists.tasklet);
 		clear_and_wake_up_bit(I915_RESET_ENGINE + id,
 				      &gt->reset.flags);
+		local_bh_enable();
 
 		/* Check that we do not resubmit the held request */
 		if (!i915_request_wait(rq, 0, HZ / 5)) {
@@ -4570,8 +4573,10 @@ static int reset_virtual_engine(struct intel_gt *gt,
 	GEM_BUG_ON(engine == ve->engine);
 
 	/* Take ownership of the reset and tasklet */
+	local_bh_disable();
 	if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
 			     &gt->reset.flags)) {
+		local_bh_enable();
 		intel_gt_set_wedged(gt);
 		err = -EBUSY;
 		goto out_heartbeat;
@@ -4591,12 +4596,13 @@ static int reset_virtual_engine(struct intel_gt *gt,
 	execlists_hold(engine, rq);
 	GEM_BUG_ON(!i915_request_on_hold(rq));
 
-	intel_engine_reset(engine, NULL);
+	__intel_engine_reset_bh(engine, NULL);
 	GEM_BUG_ON(rq->fence.error != -EIO);
 
 	/* Release our grasp on the engine, letting CS flow again */
 	tasklet_enable(&engine->execlists.tasklet);
 	clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id, &gt->reset.flags);
+	local_bh_enable();
 
 	/* Check that we do not resubmit the held request */
 	i915_request_get(rq);
@@ -6234,16 +6240,17 @@ static void garbage_reset(struct intel_engine_cs *engine,
 	const unsigned int bit = I915_RESET_ENGINE + engine->id;
 	unsigned long *lock = &engine->gt->reset.flags;
 
-	if (test_and_set_bit(bit, lock))
-		return;
-
-	tasklet_disable(&engine->execlists.tasklet);
+	local_bh_disable();
+	if (!test_and_set_bit(bit, lock)) {
+		tasklet_disable(&engine->execlists.tasklet);
 
-	if (!rq->fence.error)
-		intel_engine_reset(engine, NULL);
+		if (!rq->fence.error)
+			__intel_engine_reset_bh(engine, NULL);
 
-	tasklet_enable(&engine->execlists.tasklet);
-	clear_and_wake_up_bit(bit, lock);
+		tasklet_enable(&engine->execlists.tasklet);
+		clear_and_wake_up_bit(bit, lock);
+	}
+	local_bh_enable();
 }
 
 static struct i915_request *garbage(struct intel_context *ce,
diff --git a/drivers/gpu/drm/i915/gt/selftest_reset.c b/drivers/gpu/drm/i915/gt/selftest_reset.c
index 35406ecdf0b2..19dd0c347874 100644
--- a/drivers/gpu/drm/i915/gt/selftest_reset.c
+++ b/drivers/gpu/drm/i915/gt/selftest_reset.c
@@ -132,11 +132,16 @@ static int igt_atomic_engine_reset(void *arg)
 		for (p = igt_atomic_phases; p->name; p++) {
 			GEM_TRACE("intel_engine_reset(%s) under %s\n",
 				  engine->name, p->name);
+			if (strcmp(p->name, "softirq"))
+				local_bh_disable();
 
 			p->critical_section_begin();
-			err = intel_engine_reset(engine, NULL);
+			err = __intel_engine_reset_bh(engine, NULL);
 			p->critical_section_end();
 
+			if (strcmp(p->name, "softirq"))
+				local_bh_enable();
+
 			if (err) {
 				pr_err("intel_engine_reset(%s) failed under %s\n",
 				       engine->name, p->name);
@@ -146,6 +151,7 @@ static int igt_atomic_engine_reset(void *arg)
 
 		intel_engine_pm_put(engine);
 		tasklet_enable(&engine->execlists.tasklet);
+		tasklet_hi_schedule(&engine->execlists.tasklet);
 		if (err)
 			break;
 	}
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 31c60e6c5c7a..025666a6c67f 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1576,7 +1576,9 @@ void i915_request_add(struct i915_request *rq)
 		attr = ctx->sched;
 	rcu_read_unlock();
 
+	local_bh_disable();
 	__i915_request_queue(rq, &attr);
+	local_bh_enable();
 
 	mutex_unlock(&tl->mutex);
 }
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index 57dd6f5122ee..236f9fda8f31 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -1926,9 +1926,7 @@ static int measure_inter_request(struct intel_context *ce)
 		intel_ring_advance(rq, cs);
 		i915_request_add(rq);
 	}
-	local_bh_disable();
 	i915_sw_fence_commit(submit);
-	local_bh_enable();
 	intel_engine_flush_submission(ce->engine);
 	heap_fence_put(submit);
 
@@ -2214,11 +2212,9 @@ static int measure_completion(struct intel_context *ce)
 		intel_ring_advance(rq, cs);
 
 		dma_fence_add_callback(&rq->fence, &cb.base, signal_cb);
-
-		local_bh_disable();
 		i915_request_add(rq);
-		local_bh_enable();
 
+		intel_engine_flush_submission(ce->engine);
 		if (wait_for(READ_ONCE(sema[i]) == -1, 50)) {
 			err = -EIO;
 			goto err;
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 37/66] drm/i915/gt: Free stale request on destroying the virtual engine
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (34 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 36/66] drm/i915/gt: Replace direct submit with direct call to tasklet Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51   ` [Intel-gfx] " Chris Wilson
                   ` (36 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Since preempt-to-busy, we may unsubmit a request while it is still on
the HW and completes asynchronously. That means it may be retired and in
the process destroy the virtual engine (as the user has closed their
context), but that engine may still be holding onto the unsubmitted
compelted request. Therefore we need to potentially cleanup the old
request on destroying the virtual engine. We also have to keep the
virtual_engine alive until after the sibling's execlists_dequeue() have
finished peeking into the virtual engines, for which we serialise with
RCU.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 4e770274ea8f..fabb20a6800b 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -179,6 +179,7 @@
 #define EXECLISTS_REQUEST_SIZE 64 /* bytes */
 
 struct virtual_engine {
+	struct rcu_head rcu;
 	struct intel_engine_cs base;
 	struct intel_context context;
 
@@ -5319,10 +5320,25 @@ static void virtual_context_destroy(struct kref *kref)
 		container_of(kref, typeof(*ve), context.ref);
 	unsigned int n;
 
-	GEM_BUG_ON(!list_empty(virtual_queue(ve)));
-	GEM_BUG_ON(ve->request);
 	GEM_BUG_ON(ve->context.inflight);
 
+	if (unlikely(ve->request)) {
+		struct i915_request *old;
+		unsigned long flags;
+
+		spin_lock_irqsave(&ve->base.active.lock, flags);
+
+		old = fetch_and_zero(&ve->request);
+		if (old) {
+			GEM_BUG_ON(!i915_request_completed(old));
+			__i915_request_submit(old);
+			i915_request_put(old);
+		}
+
+		spin_unlock_irqrestore(&ve->base.active.lock, flags);
+	}
+	GEM_BUG_ON(!list_empty(virtual_queue(ve)));
+
 	for (n = 0; n < ve->num_siblings; n++) {
 		struct intel_engine_cs *sibling = ve->siblings[n];
 		struct rb_node *node = &ve->nodes[sibling->id].rb;
@@ -5348,7 +5364,7 @@ static void virtual_context_destroy(struct kref *kref)
 	intel_engine_free_request_pool(&ve->base);
 
 	kfree(ve->bonds);
-	kfree(ve);
+	kfree_rcu(ve, rcu);
 }
 
 static void virtual_engine_initial_hint(struct virtual_engine *ve)
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 38/66] drm/i915/gt: Use virtual_engine during execlists_dequeue
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
@ 2020-07-15 11:51   ` Chris Wilson
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 03/66] drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs Chris Wilson
                     ` (71 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Tvrtko Ursulin, stable

Rather than going back and forth between the rb_node entry and the
virtual_engine type, store the ve local and reuse it. As the
container_of conversion from rb_node to virtual_engine requires a
variable offset, performing that conversion just once shaves off a bit
of code.

v2: Keep a single virtual engine lookup, for typical use.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: <stable@vger.kernel.org> # v5.4+
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 254 ++++++++++++----------------
 1 file changed, 111 insertions(+), 143 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index fabb20a6800b..ec533dfe3be9 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -453,9 +453,15 @@ static int queue_prio(const struct intel_engine_execlists *execlists)
 	return ((p->priority + 1) << I915_USER_PRIORITY_SHIFT) - ffs(p->used);
 }
 
+static int virtual_prio(const struct intel_engine_execlists *el)
+{
+	struct rb_node *rb = rb_first_cached(&el->virtual);
+
+	return rb ? rb_entry(rb, struct ve_node, rb)->prio : INT_MIN;
+}
+
 static inline bool need_preempt(const struct intel_engine_cs *engine,
-				const struct i915_request *rq,
-				struct rb_node *rb)
+				const struct i915_request *rq)
 {
 	int last_prio;
 
@@ -492,25 +498,6 @@ static inline bool need_preempt(const struct intel_engine_cs *engine,
 	    rq_prio(list_next_entry(rq, sched.link)) > last_prio)
 		return true;
 
-	if (rb) {
-		struct virtual_engine *ve =
-			rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
-		bool preempt = false;
-
-		if (engine == ve->siblings[0]) { /* only preempt one sibling */
-			struct i915_request *next;
-
-			rcu_read_lock();
-			next = READ_ONCE(ve->request);
-			if (next)
-				preempt = rq_prio(next) > last_prio;
-			rcu_read_unlock();
-		}
-
-		if (preempt)
-			return preempt;
-	}
-
 	/*
 	 * If the inflight context did not trigger the preemption, then maybe
 	 * it was the set of queued requests? Pick the highest priority in
@@ -521,7 +508,8 @@ static inline bool need_preempt(const struct intel_engine_cs *engine,
 	 * ELSP[0] or ELSP[1] as, thanks again to PI, if it was the same
 	 * context, it's priority would not exceed ELSP[0] aka last_prio.
 	 */
-	return queue_prio(&engine->execlists) > last_prio;
+	return max(virtual_prio(&engine->execlists),
+		   queue_prio(&engine->execlists)) > last_prio;
 }
 
 __maybe_unused static inline bool
@@ -1806,6 +1794,35 @@ static bool virtual_matches(const struct virtual_engine *ve,
 	return true;
 }
 
+static struct virtual_engine *
+first_virtual_engine(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists *el = &engine->execlists;
+	struct rb_node *rb = rb_first_cached(&el->virtual);
+
+	while (rb) {
+		struct virtual_engine *ve =
+			rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
+		struct i915_request *rq = READ_ONCE(ve->request);
+
+		/* lazily cleanup after another engine handled rq */
+		if (!rq) {
+			rb_erase_cached(rb, &el->virtual);
+			RB_CLEAR_NODE(rb);
+			rb = rb_first_cached(&el->virtual);
+			continue;
+		}
+
+		if (!virtual_matches(ve, rq, engine)) {
+			rb = rb_next(rb);
+			continue;
+		}
+		return ve;
+	}
+
+	return NULL;
+}
+
 static void virtual_xfer_breadcrumbs(struct virtual_engine *ve)
 {
 	/*
@@ -1889,32 +1906,15 @@ static void defer_active(struct intel_engine_cs *engine)
 
 static bool
 need_timeslice(const struct intel_engine_cs *engine,
-	       const struct i915_request *rq,
-	       const struct rb_node *rb)
+	       const struct i915_request *rq)
 {
 	int hint;
 
 	if (!intel_engine_has_timeslices(engine))
 		return false;
 
-	hint = engine->execlists.queue_priority_hint;
-
-	if (rb) {
-		const struct virtual_engine *ve =
-			rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
-		const struct intel_engine_cs *inflight =
-			intel_context_inflight(&ve->context);
-
-		if (!inflight || inflight == engine) {
-			struct i915_request *next;
-
-			rcu_read_lock();
-			next = READ_ONCE(ve->request);
-			if (next)
-				hint = max(hint, rq_prio(next));
-			rcu_read_unlock();
-		}
-	}
+	hint = max(engine->execlists.queue_priority_hint,
+		   virtual_prio(&engine->execlists));
 
 	if (!list_is_last(&rq->sched.link, &engine->active.requests))
 		hint = max(hint, rq_prio(list_next_entry(rq, sched.link)));
@@ -2053,6 +2053,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	struct i915_request **port = execlists->pending;
 	struct i915_request ** const last_port = port + execlists->port_mask;
 	struct i915_request * const *active = execlists->active;
+	struct virtual_engine *ve;
 	struct i915_request *last;
 	unsigned long flags;
 	struct rb_node *rb;
@@ -2081,26 +2082,6 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	 */
 	spin_lock_irqsave(&engine->active.lock, flags);
 
-	for (rb = rb_first_cached(&execlists->virtual); rb; ) {
-		struct virtual_engine *ve =
-			rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
-		struct i915_request *rq = READ_ONCE(ve->request);
-
-		if (!rq) { /* lazily cleanup after another engine handled rq */
-			rb_erase_cached(rb, &execlists->virtual);
-			RB_CLEAR_NODE(rb);
-			rb = rb_first_cached(&execlists->virtual);
-			continue;
-		}
-
-		if (!virtual_matches(ve, rq, engine)) {
-			rb = rb_next(rb);
-			continue;
-		}
-
-		break;
-	}
-
 	/*
 	 * If the queue is higher priority than the last
 	 * request in the currently active context, submit afresh.
@@ -2123,7 +2104,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	if ((last = *active)) {
 		if (i915_request_completed(last)) {
 			goto check_secondary;
-		} else if (need_preempt(engine, last, rb)) {
+		} else if (need_preempt(engine, last)) {
 			ENGINE_TRACE(engine,
 				     "preempting last=%llx:%lld, prio=%d, hint=%d\n",
 				     last->fence.context,
@@ -2149,7 +2130,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			__unwind_incomplete_requests(engine);
 
 			last = NULL;
-		} else if (need_timeslice(engine, last, rb) &&
+		} else if (need_timeslice(engine, last) &&
 			   timeslice_expired(execlists, last)) {
 			ENGINE_TRACE(engine,
 				     "expired last=%llx:%lld, prio=%d, hint=%d, yield?=%s\n",
@@ -2201,111 +2182,98 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		}
 	}
 
-	while (rb) { /* XXX virtual is always taking precedence */
-		struct virtual_engine *ve =
-			rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
+	/* XXX virtual is always taking precedence */
+	while ((ve = first_virtual_engine(engine))) {
 		struct i915_request *rq;
 
 		spin_lock(&ve->base.active.lock);
 
 		rq = ve->request;
-		if (unlikely(!rq)) { /* lost the race to a sibling */
-			spin_unlock(&ve->base.active.lock);
-			rb_erase_cached(rb, &execlists->virtual);
-			RB_CLEAR_NODE(rb);
-			rb = rb_first_cached(&execlists->virtual);
-			continue;
-		}
+		if (unlikely(!rq)) /* lost the race to a sibling */
+			goto unlock;
 
-		GEM_BUG_ON(rq != ve->request);
 		GEM_BUG_ON(rq->engine != &ve->base);
 		GEM_BUG_ON(rq->context != &ve->context);
 
-		if (rq_prio(rq) >= queue_prio(execlists)) {
-			if (!virtual_matches(ve, rq, engine)) {
-				spin_unlock(&ve->base.active.lock);
-				rb = rb_next(rb);
-				continue;
-			}
+		if (unlikely(rq_prio(rq) < queue_prio(execlists))) {
+			spin_unlock(&ve->base.active.lock);
+			break;
+		}
 
-			if (last && !can_merge_rq(last, rq)) {
-				spin_unlock(&ve->base.active.lock);
-				spin_unlock_irqrestore(&engine->active.lock, flags);
-				start_timeslice(engine, rq_prio(rq));
-				return; /* leave this for another sibling */
-			}
+		GEM_BUG_ON(!virtual_matches(ve, rq, engine));
 
-			ENGINE_TRACE(engine,
-				     "virtual rq=%llx:%lld%s, new engine? %s\n",
-				     rq->fence.context,
-				     rq->fence.seqno,
-				     i915_request_completed(rq) ? "!" :
-				     i915_request_started(rq) ? "*" :
-				     "",
-				     yesno(engine != ve->siblings[0]));
-
-			WRITE_ONCE(ve->request, NULL);
-			WRITE_ONCE(ve->base.execlists.queue_priority_hint,
-				   INT_MIN);
-			rb_erase_cached(rb, &execlists->virtual);
-			RB_CLEAR_NODE(rb);
+		if (last && !can_merge_rq(last, rq)) {
+			spin_unlock(&ve->base.active.lock);
+			spin_unlock_irqrestore(&engine->active.lock, flags);
+			start_timeslice(engine, rq_prio(rq));
+			return; /* leave this for another sibling */
+		}
 
-			GEM_BUG_ON(!(rq->execution_mask & engine->mask));
-			WRITE_ONCE(rq->engine, engine);
+		ENGINE_TRACE(engine,
+			     "virtual rq=%llx:%lld%s, new engine? %s\n",
+			     rq->fence.context,
+			     rq->fence.seqno,
+			     i915_request_completed(rq) ? "!" :
+			     i915_request_started(rq) ? "*" :
+			     "",
+			     yesno(engine != ve->siblings[0]));
 
-			if (engine != ve->siblings[0]) {
-				u32 *regs = ve->context.lrc_reg_state;
-				unsigned int n;
+		WRITE_ONCE(ve->request, NULL);
+		WRITE_ONCE(ve->base.execlists.queue_priority_hint, INT_MIN);
 
-				GEM_BUG_ON(READ_ONCE(ve->context.inflight));
+		rb = &ve->nodes[engine->id].rb;
+		rb_erase_cached(rb, &execlists->virtual);
+		RB_CLEAR_NODE(rb);
 
-				if (!intel_engine_has_relative_mmio(engine))
-					virtual_update_register_offsets(regs,
-									engine);
+		GEM_BUG_ON(!(rq->execution_mask & engine->mask));
+		WRITE_ONCE(rq->engine, engine);
 
-				if (!list_empty(&ve->context.signals))
-					virtual_xfer_breadcrumbs(ve);
+		if (engine != ve->siblings[0]) {
+			u32 *regs = ve->context.lrc_reg_state;
+			unsigned int n;
 
-				/*
-				 * Move the bound engine to the top of the list
-				 * for future execution. We then kick this
-				 * tasklet first before checking others, so that
-				 * we preferentially reuse this set of bound
-				 * registers.
-				 */
-				for (n = 1; n < ve->num_siblings; n++) {
-					if (ve->siblings[n] == engine) {
-						swap(ve->siblings[n],
-						     ve->siblings[0]);
-						break;
-					}
-				}
+			GEM_BUG_ON(READ_ONCE(ve->context.inflight));
 
-				GEM_BUG_ON(ve->siblings[0] != engine);
-			}
+			if (!intel_engine_has_relative_mmio(engine))
+				virtual_update_register_offsets(regs, engine);
 
-			if (__i915_request_submit(rq)) {
-				submit = true;
-				last = rq;
-			}
-			i915_request_put(rq);
+			if (!list_empty(&ve->context.signals))
+				virtual_xfer_breadcrumbs(ve);
 
 			/*
-			 * Hmm, we have a bunch of virtual engine requests,
-			 * but the first one was already completed (thanks
-			 * preempt-to-busy!). Keep looking at the veng queue
-			 * until we have no more relevant requests (i.e.
-			 * the normal submit queue has higher priority).
+			 * Move the bound engine to the top of the list for
+			 * future execution. We then kick this tasklet first
+			 * before checking others, so that we preferentially
+			 * reuse this set of bound registers.
 			 */
-			if (!submit) {
-				spin_unlock(&ve->base.active.lock);
-				rb = rb_first_cached(&execlists->virtual);
-				continue;
+			for (n = 1; n < ve->num_siblings; n++) {
+				if (ve->siblings[n] == engine) {
+					swap(ve->siblings[n], ve->siblings[0]);
+					break;
+				}
 			}
+
+			GEM_BUG_ON(ve->siblings[0] != engine);
+		}
+
+		if (__i915_request_submit(rq)) {
+			submit = true;
+			last = rq;
 		}
 
+		i915_request_put(rq);
+unlock:
 		spin_unlock(&ve->base.active.lock);
-		break;
+
+		/*
+		 * Hmm, we have a bunch of virtual engine requests,
+		 * but the first one was already completed (thanks
+		 * preempt-to-busy!). Keep looking at the veng queue
+		 * until we have no more relevant requests (i.e.
+		 * the normal submit queue has higher priority).
+		 */
+		if (submit)
+			break;
 	}
 
 	while ((rb = rb_first_cached(&execlists->queue))) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 38/66] drm/i915/gt: Use virtual_engine during execlists_dequeue
@ 2020-07-15 11:51   ` Chris Wilson
  0 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: stable, Chris Wilson

Rather than going back and forth between the rb_node entry and the
virtual_engine type, store the ve local and reuse it. As the
container_of conversion from rb_node to virtual_engine requires a
variable offset, performing that conversion just once shaves off a bit
of code.

v2: Keep a single virtual engine lookup, for typical use.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: <stable@vger.kernel.org> # v5.4+
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 254 ++++++++++++----------------
 1 file changed, 111 insertions(+), 143 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index fabb20a6800b..ec533dfe3be9 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -453,9 +453,15 @@ static int queue_prio(const struct intel_engine_execlists *execlists)
 	return ((p->priority + 1) << I915_USER_PRIORITY_SHIFT) - ffs(p->used);
 }
 
+static int virtual_prio(const struct intel_engine_execlists *el)
+{
+	struct rb_node *rb = rb_first_cached(&el->virtual);
+
+	return rb ? rb_entry(rb, struct ve_node, rb)->prio : INT_MIN;
+}
+
 static inline bool need_preempt(const struct intel_engine_cs *engine,
-				const struct i915_request *rq,
-				struct rb_node *rb)
+				const struct i915_request *rq)
 {
 	int last_prio;
 
@@ -492,25 +498,6 @@ static inline bool need_preempt(const struct intel_engine_cs *engine,
 	    rq_prio(list_next_entry(rq, sched.link)) > last_prio)
 		return true;
 
-	if (rb) {
-		struct virtual_engine *ve =
-			rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
-		bool preempt = false;
-
-		if (engine == ve->siblings[0]) { /* only preempt one sibling */
-			struct i915_request *next;
-
-			rcu_read_lock();
-			next = READ_ONCE(ve->request);
-			if (next)
-				preempt = rq_prio(next) > last_prio;
-			rcu_read_unlock();
-		}
-
-		if (preempt)
-			return preempt;
-	}
-
 	/*
 	 * If the inflight context did not trigger the preemption, then maybe
 	 * it was the set of queued requests? Pick the highest priority in
@@ -521,7 +508,8 @@ static inline bool need_preempt(const struct intel_engine_cs *engine,
 	 * ELSP[0] or ELSP[1] as, thanks again to PI, if it was the same
 	 * context, it's priority would not exceed ELSP[0] aka last_prio.
 	 */
-	return queue_prio(&engine->execlists) > last_prio;
+	return max(virtual_prio(&engine->execlists),
+		   queue_prio(&engine->execlists)) > last_prio;
 }
 
 __maybe_unused static inline bool
@@ -1806,6 +1794,35 @@ static bool virtual_matches(const struct virtual_engine *ve,
 	return true;
 }
 
+static struct virtual_engine *
+first_virtual_engine(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists *el = &engine->execlists;
+	struct rb_node *rb = rb_first_cached(&el->virtual);
+
+	while (rb) {
+		struct virtual_engine *ve =
+			rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
+		struct i915_request *rq = READ_ONCE(ve->request);
+
+		/* lazily cleanup after another engine handled rq */
+		if (!rq) {
+			rb_erase_cached(rb, &el->virtual);
+			RB_CLEAR_NODE(rb);
+			rb = rb_first_cached(&el->virtual);
+			continue;
+		}
+
+		if (!virtual_matches(ve, rq, engine)) {
+			rb = rb_next(rb);
+			continue;
+		}
+		return ve;
+	}
+
+	return NULL;
+}
+
 static void virtual_xfer_breadcrumbs(struct virtual_engine *ve)
 {
 	/*
@@ -1889,32 +1906,15 @@ static void defer_active(struct intel_engine_cs *engine)
 
 static bool
 need_timeslice(const struct intel_engine_cs *engine,
-	       const struct i915_request *rq,
-	       const struct rb_node *rb)
+	       const struct i915_request *rq)
 {
 	int hint;
 
 	if (!intel_engine_has_timeslices(engine))
 		return false;
 
-	hint = engine->execlists.queue_priority_hint;
-
-	if (rb) {
-		const struct virtual_engine *ve =
-			rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
-		const struct intel_engine_cs *inflight =
-			intel_context_inflight(&ve->context);
-
-		if (!inflight || inflight == engine) {
-			struct i915_request *next;
-
-			rcu_read_lock();
-			next = READ_ONCE(ve->request);
-			if (next)
-				hint = max(hint, rq_prio(next));
-			rcu_read_unlock();
-		}
-	}
+	hint = max(engine->execlists.queue_priority_hint,
+		   virtual_prio(&engine->execlists));
 
 	if (!list_is_last(&rq->sched.link, &engine->active.requests))
 		hint = max(hint, rq_prio(list_next_entry(rq, sched.link)));
@@ -2053,6 +2053,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	struct i915_request **port = execlists->pending;
 	struct i915_request ** const last_port = port + execlists->port_mask;
 	struct i915_request * const *active = execlists->active;
+	struct virtual_engine *ve;
 	struct i915_request *last;
 	unsigned long flags;
 	struct rb_node *rb;
@@ -2081,26 +2082,6 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	 */
 	spin_lock_irqsave(&engine->active.lock, flags);
 
-	for (rb = rb_first_cached(&execlists->virtual); rb; ) {
-		struct virtual_engine *ve =
-			rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
-		struct i915_request *rq = READ_ONCE(ve->request);
-
-		if (!rq) { /* lazily cleanup after another engine handled rq */
-			rb_erase_cached(rb, &execlists->virtual);
-			RB_CLEAR_NODE(rb);
-			rb = rb_first_cached(&execlists->virtual);
-			continue;
-		}
-
-		if (!virtual_matches(ve, rq, engine)) {
-			rb = rb_next(rb);
-			continue;
-		}
-
-		break;
-	}
-
 	/*
 	 * If the queue is higher priority than the last
 	 * request in the currently active context, submit afresh.
@@ -2123,7 +2104,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	if ((last = *active)) {
 		if (i915_request_completed(last)) {
 			goto check_secondary;
-		} else if (need_preempt(engine, last, rb)) {
+		} else if (need_preempt(engine, last)) {
 			ENGINE_TRACE(engine,
 				     "preempting last=%llx:%lld, prio=%d, hint=%d\n",
 				     last->fence.context,
@@ -2149,7 +2130,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			__unwind_incomplete_requests(engine);
 
 			last = NULL;
-		} else if (need_timeslice(engine, last, rb) &&
+		} else if (need_timeslice(engine, last) &&
 			   timeslice_expired(execlists, last)) {
 			ENGINE_TRACE(engine,
 				     "expired last=%llx:%lld, prio=%d, hint=%d, yield?=%s\n",
@@ -2201,111 +2182,98 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		}
 	}
 
-	while (rb) { /* XXX virtual is always taking precedence */
-		struct virtual_engine *ve =
-			rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
+	/* XXX virtual is always taking precedence */
+	while ((ve = first_virtual_engine(engine))) {
 		struct i915_request *rq;
 
 		spin_lock(&ve->base.active.lock);
 
 		rq = ve->request;
-		if (unlikely(!rq)) { /* lost the race to a sibling */
-			spin_unlock(&ve->base.active.lock);
-			rb_erase_cached(rb, &execlists->virtual);
-			RB_CLEAR_NODE(rb);
-			rb = rb_first_cached(&execlists->virtual);
-			continue;
-		}
+		if (unlikely(!rq)) /* lost the race to a sibling */
+			goto unlock;
 
-		GEM_BUG_ON(rq != ve->request);
 		GEM_BUG_ON(rq->engine != &ve->base);
 		GEM_BUG_ON(rq->context != &ve->context);
 
-		if (rq_prio(rq) >= queue_prio(execlists)) {
-			if (!virtual_matches(ve, rq, engine)) {
-				spin_unlock(&ve->base.active.lock);
-				rb = rb_next(rb);
-				continue;
-			}
+		if (unlikely(rq_prio(rq) < queue_prio(execlists))) {
+			spin_unlock(&ve->base.active.lock);
+			break;
+		}
 
-			if (last && !can_merge_rq(last, rq)) {
-				spin_unlock(&ve->base.active.lock);
-				spin_unlock_irqrestore(&engine->active.lock, flags);
-				start_timeslice(engine, rq_prio(rq));
-				return; /* leave this for another sibling */
-			}
+		GEM_BUG_ON(!virtual_matches(ve, rq, engine));
 
-			ENGINE_TRACE(engine,
-				     "virtual rq=%llx:%lld%s, new engine? %s\n",
-				     rq->fence.context,
-				     rq->fence.seqno,
-				     i915_request_completed(rq) ? "!" :
-				     i915_request_started(rq) ? "*" :
-				     "",
-				     yesno(engine != ve->siblings[0]));
-
-			WRITE_ONCE(ve->request, NULL);
-			WRITE_ONCE(ve->base.execlists.queue_priority_hint,
-				   INT_MIN);
-			rb_erase_cached(rb, &execlists->virtual);
-			RB_CLEAR_NODE(rb);
+		if (last && !can_merge_rq(last, rq)) {
+			spin_unlock(&ve->base.active.lock);
+			spin_unlock_irqrestore(&engine->active.lock, flags);
+			start_timeslice(engine, rq_prio(rq));
+			return; /* leave this for another sibling */
+		}
 
-			GEM_BUG_ON(!(rq->execution_mask & engine->mask));
-			WRITE_ONCE(rq->engine, engine);
+		ENGINE_TRACE(engine,
+			     "virtual rq=%llx:%lld%s, new engine? %s\n",
+			     rq->fence.context,
+			     rq->fence.seqno,
+			     i915_request_completed(rq) ? "!" :
+			     i915_request_started(rq) ? "*" :
+			     "",
+			     yesno(engine != ve->siblings[0]));
 
-			if (engine != ve->siblings[0]) {
-				u32 *regs = ve->context.lrc_reg_state;
-				unsigned int n;
+		WRITE_ONCE(ve->request, NULL);
+		WRITE_ONCE(ve->base.execlists.queue_priority_hint, INT_MIN);
 
-				GEM_BUG_ON(READ_ONCE(ve->context.inflight));
+		rb = &ve->nodes[engine->id].rb;
+		rb_erase_cached(rb, &execlists->virtual);
+		RB_CLEAR_NODE(rb);
 
-				if (!intel_engine_has_relative_mmio(engine))
-					virtual_update_register_offsets(regs,
-									engine);
+		GEM_BUG_ON(!(rq->execution_mask & engine->mask));
+		WRITE_ONCE(rq->engine, engine);
 
-				if (!list_empty(&ve->context.signals))
-					virtual_xfer_breadcrumbs(ve);
+		if (engine != ve->siblings[0]) {
+			u32 *regs = ve->context.lrc_reg_state;
+			unsigned int n;
 
-				/*
-				 * Move the bound engine to the top of the list
-				 * for future execution. We then kick this
-				 * tasklet first before checking others, so that
-				 * we preferentially reuse this set of bound
-				 * registers.
-				 */
-				for (n = 1; n < ve->num_siblings; n++) {
-					if (ve->siblings[n] == engine) {
-						swap(ve->siblings[n],
-						     ve->siblings[0]);
-						break;
-					}
-				}
+			GEM_BUG_ON(READ_ONCE(ve->context.inflight));
 
-				GEM_BUG_ON(ve->siblings[0] != engine);
-			}
+			if (!intel_engine_has_relative_mmio(engine))
+				virtual_update_register_offsets(regs, engine);
 
-			if (__i915_request_submit(rq)) {
-				submit = true;
-				last = rq;
-			}
-			i915_request_put(rq);
+			if (!list_empty(&ve->context.signals))
+				virtual_xfer_breadcrumbs(ve);
 
 			/*
-			 * Hmm, we have a bunch of virtual engine requests,
-			 * but the first one was already completed (thanks
-			 * preempt-to-busy!). Keep looking at the veng queue
-			 * until we have no more relevant requests (i.e.
-			 * the normal submit queue has higher priority).
+			 * Move the bound engine to the top of the list for
+			 * future execution. We then kick this tasklet first
+			 * before checking others, so that we preferentially
+			 * reuse this set of bound registers.
 			 */
-			if (!submit) {
-				spin_unlock(&ve->base.active.lock);
-				rb = rb_first_cached(&execlists->virtual);
-				continue;
+			for (n = 1; n < ve->num_siblings; n++) {
+				if (ve->siblings[n] == engine) {
+					swap(ve->siblings[n], ve->siblings[0]);
+					break;
+				}
 			}
+
+			GEM_BUG_ON(ve->siblings[0] != engine);
+		}
+
+		if (__i915_request_submit(rq)) {
+			submit = true;
+			last = rq;
 		}
 
+		i915_request_put(rq);
+unlock:
 		spin_unlock(&ve->base.active.lock);
-		break;
+
+		/*
+		 * Hmm, we have a bunch of virtual engine requests,
+		 * but the first one was already completed (thanks
+		 * preempt-to-busy!). Keep looking at the veng queue
+		 * until we have no more relevant requests (i.e.
+		 * the normal submit queue has higher priority).
+		 */
+		if (submit)
+			break;
 	}
 
 	while ((rb = rb_first_cached(&execlists->queue))) {
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 39/66] drm/i915/gt: Decouple inflight virtual engines
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (36 preceding siblings ...)
  2020-07-15 11:51   ` [Intel-gfx] " Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 40/66] drm/i915/gt: Defer schedule_out until after the next dequeue Chris Wilson
                   ` (34 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Once a virtual engine has been bound to a sibling, it will remain bound
until we finally schedule out the last active request. We can not rebind
the context to a new sibling while it is inflight as the context save
will conflict, hence we wait. As we cannot then use any other sibliing
while the context is inflight, only kick the bound sibling while it
inflight and upon scheduling out the kick the rest (so that we can swap
engines on timeslicing if the previously bound engine becomes
oversubscribed).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index ec533dfe3be9..2f35aceea778 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1387,9 +1387,8 @@ static inline void execlists_schedule_in(struct i915_request *rq, int idx)
 static void kick_siblings(struct i915_request *rq, struct intel_context *ce)
 {
 	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
-	struct i915_request *next = READ_ONCE(ve->request);
 
-	if (next == rq || (next && next->execution_mask & ~rq->execution_mask))
+	if (READ_ONCE(ve->request))
 		tasklet_hi_schedule(&ve->base.execlists.tasklet);
 }
 
@@ -1806,17 +1805,13 @@ first_virtual_engine(struct intel_engine_cs *engine)
 		struct i915_request *rq = READ_ONCE(ve->request);
 
 		/* lazily cleanup after another engine handled rq */
-		if (!rq) {
+		if (!rq || !virtual_matches(ve, rq, engine)) {
 			rb_erase_cached(rb, &el->virtual);
 			RB_CLEAR_NODE(rb);
 			rb = rb_first_cached(&el->virtual);
 			continue;
 		}
 
-		if (!virtual_matches(ve, rq, engine)) {
-			rb = rb_next(rb);
-			continue;
-		}
 		return ve;
 	}
 
@@ -5443,7 +5438,6 @@ static void virtual_submission_tasklet(unsigned long data)
 	if (unlikely(!mask))
 		return;
 
-	local_irq_disable();
 	for (n = 0; n < ve->num_siblings; n++) {
 		struct intel_engine_cs *sibling = READ_ONCE(ve->siblings[n]);
 		struct ve_node * const node = &ve->nodes[sibling->id];
@@ -5453,20 +5447,19 @@ static void virtual_submission_tasklet(unsigned long data)
 		if (!READ_ONCE(ve->request))
 			break; /* already handled by a sibling's tasklet */
 
+		spin_lock_irq(&sibling->active.lock);
+
 		if (unlikely(!(mask & sibling->mask))) {
 			if (!RB_EMPTY_NODE(&node->rb)) {
-				spin_lock(&sibling->active.lock);
 				rb_erase_cached(&node->rb,
 						&sibling->execlists.virtual);
 				RB_CLEAR_NODE(&node->rb);
-				spin_unlock(&sibling->active.lock);
 			}
-			continue;
-		}
 
-		spin_lock(&sibling->active.lock);
+			goto unlock_engine;
+		}
 
-		if (!RB_EMPTY_NODE(&node->rb)) {
+		if (unlikely(!RB_EMPTY_NODE(&node->rb))) {
 			/*
 			 * Cheat and avoid rebalancing the tree if we can
 			 * reuse this node in situ.
@@ -5506,9 +5499,12 @@ static void virtual_submission_tasklet(unsigned long data)
 		if (first && prio > sibling->execlists.queue_priority_hint)
 			tasklet_hi_schedule(&sibling->execlists.tasklet);
 
-		spin_unlock(&sibling->active.lock);
+unlock_engine:
+		spin_unlock_irq(&sibling->active.lock);
+
+		if (intel_context_inflight(&ve->context))
+			break;
 	}
-	local_irq_enable();
 }
 
 static void virtual_submit_request(struct i915_request *rq)
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 40/66] drm/i915/gt: Defer schedule_out until after the next dequeue
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (37 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 39/66] drm/i915/gt: Decouple inflight virtual engines Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 41/66] drm/i915/gt: Resubmit the virtual engine on schedule-out Chris Wilson
                   ` (33 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Inside schedule_out, we do extra work upon idling the context, such as
updating the runtime, kicking off retires, kicking virtual engines.
However, if we are in a series of processing single requests per
contexts, we may find ourselves scheduling out the context, only to
immediately schedule it back in during dequeue. This is just extra work
that we can avoid if we keep the context marked as inflight across the
dequeue. This becomes more significant later on for minimising virtual
engine misses.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c           | 111 ++++++++++++------
 2 files changed, 78 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 4954b0df4864..b63db45bab7b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -45,8 +45,8 @@ struct intel_context {
 
 	struct intel_engine_cs *engine;
 	struct intel_engine_cs *inflight;
-#define intel_context_inflight(ce) ptr_mask_bits(READ_ONCE((ce)->inflight), 2)
-#define intel_context_inflight_count(ce) ptr_unmask_bits(READ_ONCE((ce)->inflight), 2)
+#define intel_context_inflight(ce) ptr_mask_bits(READ_ONCE((ce)->inflight), 3)
+#define intel_context_inflight_count(ce) ptr_unmask_bits(READ_ONCE((ce)->inflight), 3)
 
 	struct i915_address_space *vm;
 	struct i915_gem_context __rcu *gem_context;
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 2f35aceea778..aa3233702613 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1362,6 +1362,8 @@ __execlists_schedule_in(struct i915_request *rq)
 	execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
 	intel_engine_context_in(engine);
 
+	CE_TRACE(ce, "schedule-in, ccid:%x\n", ce->lrc.ccid);
+
 	return engine;
 }
 
@@ -1405,6 +1407,8 @@ __execlists_schedule_out(struct i915_request *rq,
 	 * refrain from doing non-trivial work here.
 	 */
 
+	CE_TRACE(ce, "schedule-out, ccid:%x\n", ccid);
+
 	/*
 	 * If we have just completed this context, the engine may now be
 	 * idle and we want to re-enter powersaving.
@@ -2037,11 +2041,6 @@ static void set_preempt_timeout(struct intel_engine_cs *engine,
 		     active_preempt_timeout(engine, rq));
 }
 
-static inline void clear_ports(struct i915_request **ports, int count)
-{
-	memset_p((void **)ports, NULL, count);
-}
-
 static void execlists_dequeue(struct intel_engine_cs *engine)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
@@ -2390,26 +2389,36 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		start_timeslice(engine, execlists->queue_priority_hint);
 skip_submit:
 		ring_set_paused(engine, 0);
+		while (port-- != execlists->pending)
+			i915_request_put(*port);
 		*execlists->pending = NULL;
 	}
 }
 
-static void
-cancel_port_requests(struct intel_engine_execlists * const execlists)
+static inline void clear_ports(struct i915_request **ports, int count)
+{
+	memset_p((void **)ports, NULL, count);
+}
+
+static struct i915_request **
+cancel_port_requests(struct intel_engine_execlists * const execlists,
+		     struct i915_request **inactive)
 {
 	struct i915_request * const *port;
 
 	for (port = execlists->pending; *port; port++)
-		execlists_schedule_out(*port);
+		*inactive++ = *port;
 	clear_ports(execlists->pending, ARRAY_SIZE(execlists->pending));
 
 	/* Mark the end of active before we overwrite *active */
 	for (port = xchg(&execlists->active, execlists->pending); *port; port++)
-		execlists_schedule_out(*port);
+		*inactive++ = *port;
 	clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
 
 	smp_wmb(); /* complete the seqlock for execlists_active() */
 	WRITE_ONCE(execlists->active, execlists->inflight);
+
+	return inactive;
 }
 
 static inline void
@@ -2481,7 +2490,8 @@ gen8_csb_parse(const struct intel_engine_execlists *execlists, const u32 *csb)
 	return *csb & (GEN8_CTX_STATUS_IDLE_ACTIVE | GEN8_CTX_STATUS_PREEMPTED);
 }
 
-static void process_csb(struct intel_engine_cs *engine)
+static struct i915_request **
+process_csb(struct intel_engine_cs *engine, struct i915_request **inactive)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
 	const u32 * const buf = execlists->csb_status;
@@ -2510,7 +2520,7 @@ static void process_csb(struct intel_engine_cs *engine)
 	head = execlists->csb_head;
 	tail = READ_ONCE(*execlists->csb_write);
 	if (unlikely(head == tail))
-		return;
+		return inactive;
 
 	/*
 	 * We will consume all events from HW, or at least pretend to.
@@ -2588,7 +2598,7 @@ static void process_csb(struct intel_engine_cs *engine)
 			/* cancel old inflight, prepare for switch */
 			trace_ports(execlists, "preempted", old);
 			while (*old)
-				execlists_schedule_out(*old++);
+				*inactive++ = *old++;
 
 			/* switch pending to inflight */
 			GEM_BUG_ON(!assert_pending_valid(execlists, "promote"));
@@ -2648,7 +2658,7 @@ static void process_csb(struct intel_engine_cs *engine)
 					     regs[CTX_RING_TAIL]);
 			}
 
-			execlists_schedule_out(*execlists->active++);
+			*inactive++ = *execlists->active++;
 
 			GEM_BUG_ON(execlists->active - execlists->inflight >
 				   execlists_num_ports(execlists));
@@ -2669,6 +2679,15 @@ static void process_csb(struct intel_engine_cs *engine)
 	 * invalidation before.
 	 */
 	invalidate_csb_entries(&buf[0], &buf[num_entries - 1]);
+
+	return inactive;
+}
+
+static void post_process_csb(struct i915_request **port,
+			     struct i915_request **last)
+{
+	while (port != last)
+		execlists_schedule_out(*port++);
 }
 
 static void __execlists_hold(struct i915_request *rq)
@@ -2939,8 +2958,8 @@ active_context(struct intel_engine_cs *engine, u32 ccid)
 	for (port = el->active; (rq = *port); port++) {
 		if (rq->context->lrc.ccid == ccid) {
 			ENGINE_TRACE(engine,
-				     "ccid found at active:%zd\n",
-				     port - el->active);
+				     "ccid:%x found at active:%zd\n",
+				     ccid, port - el->active);
 			return rq;
 		}
 	}
@@ -2948,8 +2967,8 @@ active_context(struct intel_engine_cs *engine, u32 ccid)
 	for (port = el->pending; (rq = *port); port++) {
 		if (rq->context->lrc.ccid == ccid) {
 			ENGINE_TRACE(engine,
-				     "ccid found at pending:%zd\n",
-				     port - el->pending);
+				     "ccid:%x found at pending:%zd\n",
+				     ccid, port - el->pending);
 			return rq;
 		}
 	}
@@ -3067,8 +3086,11 @@ static bool preempt_timeout(const struct intel_engine_cs *const engine)
 static void execlists_submission_tasklet(unsigned long data)
 {
 	struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
+	struct i915_request *post[2 * EXECLIST_MAX_PORTS];
+	struct i915_request **inactive;
 
-	process_csb(engine);
+	inactive = process_csb(engine, post);
+	GEM_BUG_ON(inactive - post > ARRAY_SIZE(post));
 
 	if (unlikely(preempt_timeout(engine)))
 		engine->execlists.error_interrupt |= ERROR_PREEMPT;
@@ -3092,6 +3114,8 @@ static void execlists_submission_tasklet(unsigned long data)
 
 	if (!engine->execlists.pending[0])
 		execlists_dequeue(engine);
+
+	post_process_csb(post, inactive);
 }
 
 static void __execlists_kick(struct intel_engine_execlists *execlists)
@@ -4011,8 +4035,6 @@ static void enable_execlists(struct intel_engine_cs *engine)
 	ENGINE_POSTING_READ(engine, RING_HWS_PGA);
 
 	enable_error_interrupt(engine);
-
-	engine->context_tag = GENMASK(BITS_PER_LONG - 2, 0);
 }
 
 static bool unexpected_starting_state(struct intel_engine_cs *engine)
@@ -4101,22 +4123,29 @@ static void __execlists_reset_reg_state(const struct intel_context *ce,
 	__reset_stop_ring(regs, engine);
 }
 
-static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
+static struct i915_request **reset_csb(struct intel_engine_cs *engine,
+				       struct i915_request **inactive)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
-	struct intel_context *ce;
-	struct i915_request *rq;
-	u32 head;
 
 	mb(); /* paranoia: read the CSB pointers from after the reset */
 	clflush(execlists->csb_write);
 	mb();
 
-	process_csb(engine); /* drain preemption events */
+	inactive = process_csb(engine, inactive); /* drain preemption events */
 
 	/* Following the reset, we need to reload the CSB read/write pointers */
 	reset_csb_pointers(engine);
 
+	return inactive;
+}
+
+static void execlists_reset_active(struct intel_engine_cs *engine, bool stalled)
+{
+	struct intel_context *ce;
+	struct i915_request *rq;
+	u32 head;
+
 	/*
 	 * Save the currently executing context, even if we completed
 	 * its request, it was still running at the time of the
@@ -4124,7 +4153,7 @@ static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
 	 */
 	rq = active_context(engine, engine->execlists.reset_ccid);
 	if (!rq)
-		goto unwind;
+		return;
 
 	ce = rq->context;
 	GEM_BUG_ON(!i915_vma_is_pinned(ce->state));
@@ -4187,11 +4216,20 @@ static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
 	__execlists_reset_reg_state(ce, engine);
 	__execlists_update_reg_state(ce, engine, head);
 	ce->lrc.desc |= CTX_DESC_FORCE_RESTORE; /* paranoid: GPU was reset! */
+}
 
-unwind:
-	/* Push back any incomplete requests for replay after the reset. */
-	cancel_port_requests(execlists);
-	__unwind_incomplete_requests(engine);
+static void execlists_reset_csb(struct intel_engine_cs *engine, bool stalled)
+{
+	struct intel_engine_execlists * const execlists = &engine->execlists;
+	struct i915_request *post[2 * EXECLIST_MAX_PORTS];
+	struct i915_request **inactive;
+
+	inactive = reset_csb(engine, post);
+
+	execlists_reset_active(engine, true);
+
+	inactive = cancel_port_requests(execlists, inactive);
+	post_process_csb(post, inactive);
 }
 
 static void execlists_reset_rewind(struct intel_engine_cs *engine, bool stalled)
@@ -4200,10 +4238,12 @@ static void execlists_reset_rewind(struct intel_engine_cs *engine, bool stalled)
 
 	ENGINE_TRACE(engine, "\n");
 
-	spin_lock_irqsave(&engine->active.lock, flags);
-
-	__execlists_reset(engine, stalled);
+	/* Process the csb, find the guilty context and throw away */
+	execlists_reset_csb(engine, stalled);
 
+	/* Push back any incomplete requests for replay after the reset. */
+	spin_lock_irqsave(&engine->active.lock, flags);
+	__unwind_incomplete_requests(engine);
 	spin_unlock_irqrestore(&engine->active.lock, flags);
 }
 
@@ -4238,9 +4278,9 @@ static void execlists_reset_cancel(struct intel_engine_cs *engine)
 	 * submission's irq state, we also wish to remind ourselves that
 	 * it is irq state.)
 	 */
-	spin_lock_irqsave(&engine->active.lock, flags);
+	execlists_reset_csb(engine, true);
 
-	__execlists_reset(engine, true);
+	spin_lock_irqsave(&engine->active.lock, flags);
 
 	/* Mark all executing requests as skipped. */
 	list_for_each_entry(rq, &engine->active.requests, sched.link)
@@ -5054,6 +5094,7 @@ int intel_execlists_submission_setup(struct intel_engine_cs *engine)
 	else
 		execlists->csb_size = GEN11_CSB_ENTRIES;
 
+	engine->context_tag = GENMASK(BITS_PER_LONG - 2, 0);
 	if (INTEL_GEN(engine->i915) >= 11) {
 		execlists->ccid |= engine->instance << (GEN11_ENGINE_INSTANCE_SHIFT - 32);
 		execlists->ccid |= engine->class << (GEN11_ENGINE_CLASS_SHIFT - 32);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 41/66] drm/i915/gt: Resubmit the virtual engine on schedule-out
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (38 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 40/66] drm/i915/gt: Defer schedule_out until after the next dequeue Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 42/66] drm/i915/gt: Simplify virtual engine handling for execlists_hold() Chris Wilson
                   ` (32 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Having recognised that we do not change the sibling until we schedule
out, we can then defer the decision to resubmit the virtual engine from
the unwind of the active queue to scheduling out of the virtual context.

By keeping the unwind order intact on the local engine, we can preserve
data dependency ordering while doing a preempt-to-busy pass until we
have determined the new ELSP. This means that if we try to timeslice
between a virtual engine and a data-dependent ordinary request, the pair
will maintain their relative ordering and we will avoid the
resubmission, cancelling the timeslicing until further change.

The dilemma though is that we then may end up in a situation where the
'demotion' of the virtual request to an ordinary request in the engine
queue results in filling the ELSP[] with virtual requests instead of
spreading the load across the engines. To compensate for this, we mark
each virtual request and refuse to resubmit a virtual request in the
secondary ELSP slots, thus forcing subsequent virtual requests to be
scheduled out after timeslicing. By delaying the decision until we
schedule out, we will avoid unnecessary resubmission.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c    | 92 +++++++++++++++++---------
 drivers/gpu/drm/i915/gt/selftest_lrc.c |  2 +-
 2 files changed, 63 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index aa3233702613..062185116e13 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1110,39 +1110,23 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
 
 		__i915_request_unsubmit(rq);
 
-		/*
-		 * Push the request back into the queue for later resubmission.
-		 * If this request is not native to this physical engine (i.e.
-		 * it came from a virtual source), push it back onto the virtual
-		 * engine so that it can be moved across onto another physical
-		 * engine as load dictates.
-		 */
-		if (likely(rq->execution_mask == engine->mask)) {
-			GEM_BUG_ON(rq_prio(rq) == I915_PRIORITY_INVALID);
-			if (rq_prio(rq) != prio) {
-				prio = rq_prio(rq);
-				pl = i915_sched_lookup_priolist(engine, prio);
-			}
-			GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
-
-			list_move(&rq->sched.link, pl);
-			set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+		GEM_BUG_ON(rq_prio(rq) == I915_PRIORITY_INVALID);
+		if (rq_prio(rq) != prio) {
+			prio = rq_prio(rq);
+			pl = i915_sched_lookup_priolist(engine, prio);
+		}
+		GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
 
-			/* Check in case we rollback so far we wrap [size/2] */
-			if (intel_ring_direction(rq->ring,
-						 intel_ring_wrap(rq->ring,
-								 rq->tail),
-						 rq->ring->tail) > 0)
-				rq->context->lrc.desc |= CTX_DESC_FORCE_RESTORE;
+		list_move(&rq->sched.link, pl);
+		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
 
-			active = rq;
-		} else {
-			struct intel_engine_cs *owner = rq->context->engine;
+		/* Check in case we rollback so far we wrap [size/2] */
+		if (intel_ring_direction(rq->ring,
+					 intel_ring_wrap(rq->ring, rq->tail),
+					 rq->ring->tail) > 0)
+			rq->context->lrc.desc |= CTX_DESC_FORCE_RESTORE;
 
-			WRITE_ONCE(rq->engine, owner);
-			owner->submit_request(rq);
-			active = NULL;
-		}
+		active = rq;
 	}
 
 	return active;
@@ -1386,12 +1370,37 @@ static inline void execlists_schedule_in(struct i915_request *rq, int idx)
 	GEM_BUG_ON(intel_context_inflight(ce) != rq->engine);
 }
 
+static void
+resubmit_virtual_request(struct i915_request *rq, struct virtual_engine *ve)
+{
+	struct intel_engine_cs *engine = rq->engine;
+	unsigned long flags;
+
+	spin_lock_irqsave(&engine->active.lock, flags);
+
+	clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+	WRITE_ONCE(rq->engine, &ve->base);
+	ve->base.submit_request(rq);
+
+	spin_unlock_irqrestore(&engine->active.lock, flags);
+}
+
 static void kick_siblings(struct i915_request *rq, struct intel_context *ce)
 {
 	struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
 
 	if (READ_ONCE(ve->request))
 		tasklet_hi_schedule(&ve->base.execlists.tasklet);
+
+	/*
+	 * This engine is now too busy to run this virtual request, so
+	 * see if we can find an alternative engine for it to execute on.
+	 * Once a request has become bonded to this engine, we treat it the
+	 * same as other native request.
+	 */
+	if (i915_request_in_priority_queue(rq) &&
+	    rq->execution_mask != rq->engine->mask)
+		resubmit_virtual_request(rq, ve);
 }
 
 static inline void
@@ -1634,6 +1643,20 @@ assert_pending_valid(const struct intel_engine_execlists *execlists,
 		}
 		sentinel = i915_request_has_sentinel(rq);
 
+		/*
+		 * We want virtual requests to only be in the first slot so
+		 * that they are never stuck behind a hog and can be immediately
+		 * transferred onto the next idle engine.
+		 */
+		if (rq->execution_mask != engine->mask &&
+		    port != execlists->pending) {
+			GEM_TRACE_ERR("%s: virtual engine:%llx not in prime position[%zd]\n",
+				      engine->name,
+				      ce->timeline->fence_context,
+				      port - execlists->pending);
+			return false;
+		}
+
 		/* Hold tightly onto the lock to prevent concurrent retires! */
 		if (!spin_trylock_irqsave(&rq->lock, flags))
 			continue;
@@ -2309,6 +2332,15 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 				if (i915_request_has_sentinel(last))
 					goto done;
 
+				/*
+				 * We avoid submitting virtual requests into
+				 * the secondary ports so that we can migrate
+				 * the request immediately to another engine
+				 * rather than wait for the primary request.
+				 */
+				if (rq->execution_mask != engine->mask)
+					goto done;
+
 				/*
 				 * If GVT overrides us we only ever submit
 				 * port[0], leaving port[1] empty. Note that we
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index 93b576ee4203..e05c750452be 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -4590,7 +4590,7 @@ static int reset_virtual_engine(struct intel_gt *gt,
 	spin_lock_irq(&engine->active.lock);
 	__unwind_incomplete_requests(engine);
 	spin_unlock_irq(&engine->active.lock);
-	GEM_BUG_ON(rq->engine != ve->engine);
+	GEM_BUG_ON(rq->engine != engine);
 
 	/* Reset the engine while keeping our active request on hold */
 	execlists_hold(engine, rq);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 42/66] drm/i915/gt: Simplify virtual engine handling for execlists_hold()
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (39 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 41/66] drm/i915/gt: Resubmit the virtual engine on schedule-out Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 43/66] drm/i915/gt: ce->inflight updates are now serialised Chris Wilson
                   ` (31 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Now that the tasklet completely controls scheduling of the requests, and
we postpone scheduling out the old requests, we can keep a hanging
virtual request bound to the engine on which it hung, and remove it from
te queue. On release, it will be returned to the same engine and remain
in its queue until it is scheduled; after which point it will become
eligible for transfer to a sibling. Instead, we could opt to resubmit the
request along the virtual engine on unhold, making it eligible for load
balancing immediately -- but that seems like a pointless optimisation
for a hanging context.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 062185116e13..0020fc77b3da 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2771,35 +2771,6 @@ static bool execlists_hold(struct intel_engine_cs *engine,
 		goto unlock;
 	}
 
-	if (rq->engine != engine) { /* preempted virtual engine */
-		struct virtual_engine *ve = to_virtual_engine(rq->engine);
-
-		/*
-		 * intel_context_inflight() is only protected by virtue
-		 * of process_csb() being called only by the tasklet (or
-		 * directly from inside reset while the tasklet is suspended).
-		 * Assert that neither of those are allowed to run while we
-		 * poke at the request queues.
-		 */
-		GEM_BUG_ON(!reset_in_progress(&engine->execlists));
-
-		/*
-		 * An unsubmitted request along a virtual engine will
-		 * remain on the active (this) engine until we are able
-		 * to process the context switch away (and so mark the
-		 * context as no longer in flight). That cannot have happened
-		 * yet, otherwise we would not be hanging!
-		 */
-		spin_lock(&ve->base.active.lock);
-		GEM_BUG_ON(intel_context_inflight(rq->context) != engine);
-		GEM_BUG_ON(ve->request != rq);
-		ve->request = NULL;
-		spin_unlock(&ve->base.active.lock);
-		i915_request_put(rq);
-
-		rq->engine = engine;
-	}
-
 	/*
 	 * Transfer this request onto the hold queue to prevent it
 	 * being resumbitted to HW (and potentially completed) before we have
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 43/66] drm/i915/gt: ce->inflight updates are now serialised
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (40 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 42/66] drm/i915/gt: Simplify virtual engine handling for execlists_hold() Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 44/66] drm/i915/gt: Drop atomic for engine->fw_active tracking Chris Wilson
                   ` (30 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Since schedule-in and schedule-out are now both always under the tasklet
bitlock, we can reduce the individual atomic operations to simple
instructions and worry less.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 44 +++++++++++++----------------
 1 file changed, 19 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 0020fc77b3da..a59332f28cd3 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1332,7 +1332,7 @@ __execlists_schedule_in(struct i915_request *rq)
 		unsigned int tag = ffs(READ_ONCE(engine->context_tag));
 
 		GEM_BUG_ON(tag == 0 || tag >= BITS_PER_LONG);
-		clear_bit(tag - 1, &engine->context_tag);
+		__clear_bit(tag - 1, &engine->context_tag);
 		ce->lrc.ccid = tag << (GEN11_SW_CTX_ID_SHIFT - 32);
 
 		BUILD_BUG_ON(BITS_PER_LONG > GEN12_MAX_CONTEXT_HW_ID);
@@ -1359,13 +1359,10 @@ static inline void execlists_schedule_in(struct i915_request *rq, int idx)
 	GEM_BUG_ON(!intel_engine_pm_is_awake(rq->engine));
 	trace_i915_request_in(rq, idx);
 
-	old = READ_ONCE(ce->inflight);
-	do {
-		if (!old) {
-			WRITE_ONCE(ce->inflight, __execlists_schedule_in(rq));
-			break;
-		}
-	} while (!try_cmpxchg(&ce->inflight, &old, ptr_inc(old)));
+	old = ce->inflight;
+	if (!old)
+		old = __execlists_schedule_in(rq);
+	WRITE_ONCE(ce->inflight, ptr_inc(old));
 
 	GEM_BUG_ON(intel_context_inflight(ce) != rq->engine);
 }
@@ -1403,12 +1400,11 @@ static void kick_siblings(struct i915_request *rq, struct intel_context *ce)
 		resubmit_virtual_request(rq, ve);
 }
 
-static inline void
-__execlists_schedule_out(struct i915_request *rq,
-			 struct intel_engine_cs * const engine,
-			 unsigned int ccid)
+static inline void __execlists_schedule_out(struct i915_request *rq)
 {
 	struct intel_context * const ce = rq->context;
+	struct intel_engine_cs * const engine = rq->engine;
+	unsigned int ccid;
 
 	/*
 	 * NB process_csb() is not under the engine->active.lock and hence
@@ -1416,7 +1412,7 @@ __execlists_schedule_out(struct i915_request *rq,
 	 * refrain from doing non-trivial work here.
 	 */
 
-	CE_TRACE(ce, "schedule-out, ccid:%x\n", ccid);
+	CE_TRACE(ce, "schedule-out, ccid:%x\n", ce->lrc.ccid);
 
 	/*
 	 * If we have just completed this context, the engine may now be
@@ -1426,12 +1422,13 @@ __execlists_schedule_out(struct i915_request *rq,
 	    i915_request_completed(rq))
 		intel_engine_add_retire(engine, ce->timeline);
 
+	ccid = ce->lrc.ccid;
 	ccid >>= GEN11_SW_CTX_ID_SHIFT - 32;
 	ccid &= GEN12_MAX_CONTEXT_HW_ID;
 	if (ccid < BITS_PER_LONG) {
 		GEM_BUG_ON(ccid == 0);
 		GEM_BUG_ON(test_bit(ccid - 1, &engine->context_tag));
-		set_bit(ccid - 1, &engine->context_tag);
+		__set_bit(ccid - 1, &engine->context_tag);
 	}
 
 	intel_context_update_runtime(ce);
@@ -1452,26 +1449,23 @@ __execlists_schedule_out(struct i915_request *rq,
 	 */
 	if (ce->engine != engine)
 		kick_siblings(rq, ce);
-
-	intel_context_put(ce);
 }
 
 static inline void
 execlists_schedule_out(struct i915_request *rq)
 {
 	struct intel_context * const ce = rq->context;
-	struct intel_engine_cs *cur, *old;
-	u32 ccid;
 
 	trace_i915_request_out(rq);
 
-	ccid = rq->context->lrc.ccid;
-	old = READ_ONCE(ce->inflight);
-	do
-		cur = ptr_unmask_bits(old, 2) ? ptr_dec(old) : NULL;
-	while (!try_cmpxchg(&ce->inflight, &old, cur));
-	if (!cur)
-		__execlists_schedule_out(rq, old, ccid);
+	GEM_BUG_ON(!ce->inflight);
+	ce->inflight = ptr_dec(ce->inflight);
+	if (!intel_context_inflight_count(ce)) {
+		GEM_BUG_ON(ce->inflight != rq->engine);
+		__execlists_schedule_out(rq);
+		WRITE_ONCE(ce->inflight, NULL);
+		intel_context_put(ce);
+	}
 
 	i915_request_put(rq);
 }
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 44/66] drm/i915/gt: Drop atomic for engine->fw_active tracking
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (41 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 43/66] drm/i915/gt: ce->inflight updates are now serialised Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 45/66] drm/i915/gt: Extract busy-stats for ring-scheduler Chris Wilson
                   ` (29 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Since schedule-in/out is now entirely serialised by the tasklet bitlock,
we do not need to worry about concurrent in/out operations and so reduce
the atomic operations to plain instructions.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c    | 2 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h | 2 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c          | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index c10521fdbbe4..10997cae5e41 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1615,7 +1615,7 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 			   ktime_to_ms(intel_engine_get_busy_time(engine,
 								  &dummy)));
 	drm_printf(m, "\tForcewake: %x domains, %d active\n",
-		   engine->fw_domain, atomic_read(&engine->fw_active));
+		   engine->fw_domain, READ_ONCE(engine->fw_active));
 
 	rcu_read_lock();
 	rq = READ_ONCE(engine->heartbeat.systole);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 36981ba1db75..f86efafd385f 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -327,7 +327,7 @@ struct intel_engine_cs {
 	 * as possible.
 	 */
 	enum forcewake_domains fw_domain;
-	atomic_t fw_active;
+	unsigned int fw_active;
 
 	unsigned long context_tag;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index a59332f28cd3..72b343242251 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1341,7 +1341,7 @@ __execlists_schedule_in(struct i915_request *rq)
 	ce->lrc.ccid |= engine->execlists.ccid;
 
 	__intel_gt_pm_get(engine->gt);
-	if (engine->fw_domain && !atomic_fetch_inc(&engine->fw_active))
+	if (engine->fw_domain && !engine->fw_active++)
 		intel_uncore_forcewake_get(engine->uncore, engine->fw_domain);
 	execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
 	intel_engine_context_in(engine);
@@ -1434,7 +1434,7 @@ static inline void __execlists_schedule_out(struct i915_request *rq)
 	intel_context_update_runtime(ce);
 	intel_engine_context_out(engine);
 	execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT);
-	if (engine->fw_domain && !atomic_dec_return(&engine->fw_active))
+	if (engine->fw_domain && !--engine->fw_active)
 		intel_uncore_forcewake_put(engine->uncore, engine->fw_domain);
 	intel_gt_pm_put_async(engine->gt);
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 45/66] drm/i915/gt: Extract busy-stats for ring-scheduler
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (42 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 44/66] drm/i915/gt: Drop atomic for engine->fw_active tracking Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 46/66] drm/i915/gt: Convert stats.active to plain unsigned int Chris Wilson
                   ` (28 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Lift the busy-stats context-in/out implementation out of intel_lrc, so
that we can reuse it for other scheduler implementations.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_stats.h | 49 ++++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_lrc.c          | 34 +-------------
 2 files changed, 50 insertions(+), 33 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_stats.h

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_stats.h b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
new file mode 100644
index 000000000000..58491eae3482
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#ifndef __INTEL_ENGINE_STATS_H__
+#define __INTEL_ENGINE_STATS_H__
+
+#include <linux/atomic.h>
+#include <linux/ktime.h>
+#include <linux/seqlock.h>
+
+#include "i915_gem.h" /* GEM_BUG_ON */
+#include "intel_engine.h"
+
+static inline void intel_engine_context_in(struct intel_engine_cs *engine)
+{
+	unsigned long flags;
+
+	if (atomic_add_unless(&engine->stats.active, 1, 0))
+		return;
+
+	write_seqlock_irqsave(&engine->stats.lock, flags);
+	if (!atomic_add_unless(&engine->stats.active, 1, 0)) {
+		engine->stats.start = ktime_get();
+		atomic_inc(&engine->stats.active);
+	}
+	write_sequnlock_irqrestore(&engine->stats.lock, flags);
+}
+
+static inline void intel_engine_context_out(struct intel_engine_cs *engine)
+{
+	unsigned long flags;
+
+	GEM_BUG_ON(!atomic_read(&engine->stats.active));
+
+	if (atomic_add_unless(&engine->stats.active, -1, 1))
+		return;
+
+	write_seqlock_irqsave(&engine->stats.lock, flags);
+	if (atomic_dec_and_test(&engine->stats.active)) {
+		engine->stats.total =
+			ktime_add(engine->stats.total,
+				  ktime_sub(ktime_get(), engine->stats.start));
+	}
+	write_sequnlock_irqrestore(&engine->stats.lock, flags);
+}
+
+#endif /* __INTEL_ENGINE_STATS_H__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 72b343242251..534adfdc42fe 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -139,6 +139,7 @@
 #include "i915_vgpu.h"
 #include "intel_context.h"
 #include "intel_engine_pm.h"
+#include "intel_engine_stats.h"
 #include "intel_gt.h"
 #include "intel_gt_pm.h"
 #include "intel_gt_requests.h"
@@ -1155,39 +1156,6 @@ execlists_context_status_change(struct i915_request *rq, unsigned long status)
 				   status, rq);
 }
 
-static void intel_engine_context_in(struct intel_engine_cs *engine)
-{
-	unsigned long flags;
-
-	if (atomic_add_unless(&engine->stats.active, 1, 0))
-		return;
-
-	write_seqlock_irqsave(&engine->stats.lock, flags);
-	if (!atomic_add_unless(&engine->stats.active, 1, 0)) {
-		engine->stats.start = ktime_get();
-		atomic_inc(&engine->stats.active);
-	}
-	write_sequnlock_irqrestore(&engine->stats.lock, flags);
-}
-
-static void intel_engine_context_out(struct intel_engine_cs *engine)
-{
-	unsigned long flags;
-
-	GEM_BUG_ON(!atomic_read(&engine->stats.active));
-
-	if (atomic_add_unless(&engine->stats.active, -1, 1))
-		return;
-
-	write_seqlock_irqsave(&engine->stats.lock, flags);
-	if (atomic_dec_and_test(&engine->stats.active)) {
-		engine->stats.total =
-			ktime_add(engine->stats.total,
-				  ktime_sub(ktime_get(), engine->stats.start));
-	}
-	write_sequnlock_irqrestore(&engine->stats.lock, flags);
-}
-
 static void
 execlists_check_context(const struct intel_context *ce,
 			const struct intel_engine_cs *engine)
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 46/66] drm/i915/gt: Convert stats.active to plain unsigned int
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (43 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 45/66] drm/i915/gt: Extract busy-stats for ring-scheduler Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 47/66] drm/i915: Lift waiter/signaler iterators Chris Wilson
                   ` (27 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

As context-in/out is now always serialised, we do not have to worry
about concurrent enabling/disable of the busy-stats and can reduce the
atomic_t active to a plain unsigned int, and the seqlock to a seqcount.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c    |  8 ++--
 drivers/gpu/drm/i915/gt/intel_engine_stats.h | 45 ++++++++++++--------
 drivers/gpu/drm/i915/gt/intel_engine_types.h |  4 +-
 3 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 10997cae5e41..fcdf336ebf43 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -338,7 +338,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
 	engine->schedule = NULL;
 
 	ewma__engine_latency_init(&engine->latency);
-	seqlock_init(&engine->stats.lock);
+	seqcount_init(&engine->stats.lock);
 
 	ATOMIC_INIT_NOTIFIER_HEAD(&engine->context_status_notifier);
 
@@ -1692,7 +1692,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
 	 * add it to the total.
 	 */
 	*now = ktime_get();
-	if (atomic_read(&engine->stats.active))
+	if (READ_ONCE(engine->stats.active))
 		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
 
 	return total;
@@ -1711,9 +1711,9 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
 	ktime_t total;
 
 	do {
-		seq = read_seqbegin(&engine->stats.lock);
+		seq = read_seqcount_begin(&engine->stats.lock);
 		total = __intel_engine_get_busy_time(engine, now);
-	} while (read_seqretry(&engine->stats.lock, seq));
+	} while (read_seqcount_retry(&engine->stats.lock, seq));
 
 	return total;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_stats.h b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
index 58491eae3482..24fbdd94351a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_stats.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
@@ -17,33 +17,44 @@ static inline void intel_engine_context_in(struct intel_engine_cs *engine)
 {
 	unsigned long flags;
 
-	if (atomic_add_unless(&engine->stats.active, 1, 0))
+	if (engine->stats.active) {
+		engine->stats.active++;
 		return;
-
-	write_seqlock_irqsave(&engine->stats.lock, flags);
-	if (!atomic_add_unless(&engine->stats.active, 1, 0)) {
-		engine->stats.start = ktime_get();
-		atomic_inc(&engine->stats.active);
 	}
-	write_sequnlock_irqrestore(&engine->stats.lock, flags);
+
+	/* The writer is serialised; but the pmu reader may be from hardirq */
+	local_irq_save(flags);
+	write_seqcount_begin(&engine->stats.lock);
+
+	engine->stats.start = ktime_get();
+	engine->stats.active++;
+
+	write_seqcount_end(&engine->stats.lock);
+	local_irq_restore(flags);
+
+	GEM_BUG_ON(!engine->stats.active);
 }
 
 static inline void intel_engine_context_out(struct intel_engine_cs *engine)
 {
 	unsigned long flags;
 
-	GEM_BUG_ON(!atomic_read(&engine->stats.active));
-
-	if (atomic_add_unless(&engine->stats.active, -1, 1))
+	GEM_BUG_ON(!engine->stats.active);
+	if (engine->stats.active > 1) {
+		engine->stats.active--;
 		return;
-
-	write_seqlock_irqsave(&engine->stats.lock, flags);
-	if (atomic_dec_and_test(&engine->stats.active)) {
-		engine->stats.total =
-			ktime_add(engine->stats.total,
-				  ktime_sub(ktime_get(), engine->stats.start));
 	}
-	write_sequnlock_irqrestore(&engine->stats.lock, flags);
+
+	local_irq_save(flags);
+	write_seqcount_begin(&engine->stats.lock);
+
+	engine->stats.active--;
+	engine->stats.total =
+		ktime_add(engine->stats.total,
+			  ktime_sub(ktime_get(), engine->stats.start));
+
+	write_seqcount_end(&engine->stats.lock);
+	local_irq_restore(flags);
 }
 
 #endif /* __INTEL_ENGINE_STATS_H__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index f86efafd385f..7be475315fa9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -550,12 +550,12 @@ struct intel_engine_cs {
 		/**
 		 * @active: Number of contexts currently scheduled in.
 		 */
-		atomic_t active;
+		unsigned int active;
 
 		/**
 		 * @lock: Lock protecting the below fields.
 		 */
-		seqlock_t lock;
+		seqcount_t lock;
 
 		/**
 		 * @total: Total time this engine was busy.
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 47/66] drm/i915: Lift waiter/signaler iterators
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (44 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 46/66] drm/i915/gt: Convert stats.active to plain unsigned int Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 48/66] drm/i915: Strip out internal priorities Chris Wilson
                   ` (26 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Lift the list iteration defines for traversing the signaler/waiter lists
into i915_scheduler.h for reuse.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c         | 10 ----------
 drivers/gpu/drm/i915/i915_scheduler_types.h | 10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 534adfdc42fe..78dad751c187 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1819,16 +1819,6 @@ static void virtual_xfer_breadcrumbs(struct virtual_engine *ve)
 	intel_engine_transfer_stale_breadcrumbs(ve->siblings[0], &ve->context);
 }
 
-#define for_each_waiter(p__, rq__) \
-	list_for_each_entry_lockless(p__, \
-				     &(rq__)->sched.waiters_list, \
-				     wait_link)
-
-#define for_each_signaler(p__, rq__) \
-	list_for_each_entry_rcu(p__, \
-				&(rq__)->sched.signalers_list, \
-				signal_link)
-
 static void defer_request(struct i915_request *rq, struct list_head * const pl)
 {
 	LIST_HEAD(list);
diff --git a/drivers/gpu/drm/i915/i915_scheduler_types.h b/drivers/gpu/drm/i915/i915_scheduler_types.h
index f72e6c397b08..343ed44d5ed4 100644
--- a/drivers/gpu/drm/i915/i915_scheduler_types.h
+++ b/drivers/gpu/drm/i915/i915_scheduler_types.h
@@ -81,4 +81,14 @@ struct i915_dependency {
 #define I915_DEPENDENCY_WEAK		BIT(2)
 };
 
+#define for_each_waiter(p__, rq__) \
+	list_for_each_entry_lockless(p__, \
+				     &(rq__)->sched.waiters_list, \
+				     wait_link)
+
+#define for_each_signaler(p__, rq__) \
+	list_for_each_entry_rcu(p__, \
+				&(rq__)->sched.signalers_list, \
+				signal_link)
+
 #endif /* _I915_SCHEDULER_TYPES_H_ */
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 48/66] drm/i915: Strip out internal priorities
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (45 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 47/66] drm/i915: Lift waiter/signaler iterators Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 49/66] drm/i915: Remove I915_USER_PRIORITY_SHIFT Chris Wilson
                   ` (25 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Since we are not using any internal priority levels, and in the next few
patches will introduce a new index for which the optimisation is not so
lear cut, discard the small table within the priolist.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  2 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c           | 22 ++------
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  2 -
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 +--
 drivers/gpu/drm/i915/i915_priolist_types.h    |  8 +--
 drivers/gpu/drm/i915/i915_scheduler.c         | 51 +++----------------
 drivers/gpu/drm/i915/i915_scheduler.h         | 18 ++-----
 7 files changed, 21 insertions(+), 88 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index be5d78472f18..addab2d922b7 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -113,7 +113,7 @@ static void heartbeat(struct work_struct *wrk)
 			 * low latency and no jitter] the chance to naturally
 			 * complete before being preempted.
 			 */
-			attr.priority = I915_PRIORITY_MASK;
+			attr.priority = 0;
 			if (rq->sched.attr.priority >= attr.priority)
 				attr.priority |= I915_USER_PRIORITY(I915_PRIORITY_HEARTBEAT);
 			if (rq->sched.attr.priority >= attr.priority)
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 78dad751c187..e3d7647a8514 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -436,22 +436,13 @@ static int effective_prio(const struct i915_request *rq)
 
 static int queue_prio(const struct intel_engine_execlists *execlists)
 {
-	struct i915_priolist *p;
 	struct rb_node *rb;
 
 	rb = rb_first_cached(&execlists->queue);
 	if (!rb)
 		return INT_MIN;
 
-	/*
-	 * As the priolist[] are inverted, with the highest priority in [0],
-	 * we have to flip the index value to become priority.
-	 */
-	p = to_priolist(rb);
-	if (!I915_USER_PRIORITY_SHIFT)
-		return p->priority;
-
-	return ((p->priority + 1) << I915_USER_PRIORITY_SHIFT) - ffs(p->used);
+	return to_priolist(rb)->priority;
 }
 
 static int virtual_prio(const struct intel_engine_execlists *el)
@@ -2248,9 +2239,8 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	while ((rb = rb_first_cached(&execlists->queue))) {
 		struct i915_priolist *p = to_priolist(rb);
 		struct i915_request *rq, *rn;
-		int i;
 
-		priolist_for_each_request_consume(rq, rn, p, i) {
+		priolist_for_each_request_consume(rq, rn, p) {
 			bool merge = true;
 
 			/*
@@ -4244,9 +4234,8 @@ static void execlists_reset_cancel(struct intel_engine_cs *engine)
 	/* Flush the queued requests to the timeline list (for retiring). */
 	while ((rb = rb_first_cached(&execlists->queue))) {
 		struct i915_priolist *p = to_priolist(rb);
-		int i;
 
-		priolist_for_each_request_consume(rq, rn, p, i) {
+		priolist_for_each_request_consume(rq, rn, p) {
 			mark_eio(rq);
 			__i915_request_submit(rq);
 		}
@@ -5270,7 +5259,7 @@ static int __execlists_context_alloc(struct intel_context *ce,
 
 static struct list_head *virtual_queue(struct virtual_engine *ve)
 {
-	return &ve->base.execlists.default_priolist.requests[0];
+	return &ve->base.execlists.default_priolist.requests;
 }
 
 static void virtual_context_destroy(struct kref *kref)
@@ -5835,9 +5824,8 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
 	count = 0;
 	for (rb = rb_first_cached(&execlists->queue); rb; rb = rb_next(rb)) {
 		struct i915_priolist *p = rb_entry(rb, typeof(*p), node);
-		int i;
 
-		priolist_for_each_request(rq, p, i) {
+		priolist_for_each_request(rq, p) {
 			if (count++ < max - 1)
 				show_request(m, rq, "\t\tQ ");
 			else
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index e05c750452be..3843c69ac8a3 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -1102,7 +1102,6 @@ create_rewinder(struct intel_context *ce,
 
 	intel_ring_advance(rq, cs);
 
-	rq->sched.attr.priority = I915_PRIORITY_MASK;
 	err = 0;
 err:
 	i915_request_get(rq);
@@ -5363,7 +5362,6 @@ create_timestamp(struct intel_context *ce, void *slot, int idx)
 
 	intel_ring_advance(rq, cs);
 
-	rq->sched.attr.priority = I915_PRIORITY_MASK;
 	err = 0;
 err:
 	i915_request_get(rq);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index fdfeb4b9b0f5..8b56cf0d970e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -312,9 +312,8 @@ static void __guc_dequeue(struct intel_engine_cs *engine)
 	while ((rb = rb_first_cached(&execlists->queue))) {
 		struct i915_priolist *p = to_priolist(rb);
 		struct i915_request *rq, *rn;
-		int i;
 
-		priolist_for_each_request_consume(rq, rn, p, i) {
+		priolist_for_each_request_consume(rq, rn, p) {
 			if (last && rq->context != last->context) {
 				if (port == last_port)
 					goto done;
@@ -463,9 +462,8 @@ static void guc_reset_cancel(struct intel_engine_cs *engine)
 	/* Flush the queued requests to the timeline list (for retiring). */
 	while ((rb = rb_first_cached(&execlists->queue))) {
 		struct i915_priolist *p = to_priolist(rb);
-		int i;
 
-		priolist_for_each_request_consume(rq, rn, p, i) {
+		priolist_for_each_request_consume(rq, rn, p) {
 			list_del_init(&rq->sched.link);
 			__i915_request_submit(rq);
 			dma_fence_set_error(&rq->fence, -EIO);
diff --git a/drivers/gpu/drm/i915/i915_priolist_types.h b/drivers/gpu/drm/i915/i915_priolist_types.h
index 8aa7866ec6b6..9a7657bb002e 100644
--- a/drivers/gpu/drm/i915/i915_priolist_types.h
+++ b/drivers/gpu/drm/i915/i915_priolist_types.h
@@ -27,11 +27,8 @@ enum {
 #define I915_USER_PRIORITY_SHIFT 0
 #define I915_USER_PRIORITY(x) ((x) << I915_USER_PRIORITY_SHIFT)
 
-#define I915_PRIORITY_COUNT BIT(I915_USER_PRIORITY_SHIFT)
-#define I915_PRIORITY_MASK (I915_PRIORITY_COUNT - 1)
-
 /* Smallest priority value that cannot be bumped. */
-#define I915_PRIORITY_INVALID (INT_MIN | (u8)I915_PRIORITY_MASK)
+#define I915_PRIORITY_INVALID (INT_MIN)
 
 /*
  * Requests containing performance queries must not be preempted by
@@ -45,9 +42,8 @@ enum {
 #define I915_PRIORITY_BARRIER (I915_PRIORITY_UNPREEMPTABLE - 1)
 
 struct i915_priolist {
-	struct list_head requests[I915_PRIORITY_COUNT];
+	struct list_head requests;
 	struct rb_node node;
-	unsigned long used;
 	int priority;
 };
 
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index cbb880b10c65..805c5e062004 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -43,7 +43,7 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 static void assert_priolists(struct intel_engine_execlists * const execlists)
 {
 	struct rb_node *rb;
-	long last_prio, i;
+	long last_prio;
 
 	if (!IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
 		return;
@@ -57,14 +57,6 @@ static void assert_priolists(struct intel_engine_execlists * const execlists)
 
 		GEM_BUG_ON(p->priority > last_prio);
 		last_prio = p->priority;
-
-		GEM_BUG_ON(!p->used);
-		for (i = 0; i < ARRAY_SIZE(p->requests); i++) {
-			if (list_empty(&p->requests[i]))
-				continue;
-
-			GEM_BUG_ON(!(p->used & BIT(i)));
-		}
 	}
 }
 
@@ -75,13 +67,10 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 	struct i915_priolist *p;
 	struct rb_node **parent, *rb;
 	bool first = true;
-	int idx, i;
 
 	lockdep_assert_held(&engine->active.lock);
 	assert_priolists(execlists);
 
-	/* buckets sorted from highest [in slot 0] to lowest priority */
-	idx = I915_PRIORITY_COUNT - (prio & I915_PRIORITY_MASK) - 1;
 	prio >>= I915_USER_PRIORITY_SHIFT;
 	if (unlikely(execlists->no_priolist))
 		prio = I915_PRIORITY_NORMAL;
@@ -99,7 +88,7 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 			parent = &rb->rb_right;
 			first = false;
 		} else {
-			goto out;
+			return &p->requests;
 		}
 	}
 
@@ -125,15 +114,12 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 	}
 
 	p->priority = prio;
-	for (i = 0; i < ARRAY_SIZE(p->requests); i++)
-		INIT_LIST_HEAD(&p->requests[i]);
+	INIT_LIST_HEAD(&p->requests);
+
 	rb_link_node(&p->node, rb, parent);
 	rb_insert_color_cached(&p->node, &execlists->queue, first);
-	p->used = 0;
 
-out:
-	p->used |= BIT(idx);
-	return &p->requests[idx];
+	return &p->requests;
 }
 
 void __i915_priolist_free(struct i915_priolist *p)
@@ -363,30 +349,6 @@ void i915_schedule(struct i915_request *rq, const struct i915_sched_attr *attr)
 	spin_unlock_irq(&schedule_lock);
 }
 
-static void __bump_priority(struct i915_sched_node *node, unsigned int bump)
-{
-	struct i915_sched_attr attr = node->attr;
-
-	if (attr.priority & bump)
-		return;
-
-	attr.priority |= bump;
-	__i915_schedule(node, &attr);
-}
-
-void i915_schedule_bump_priority(struct i915_request *rq, unsigned int bump)
-{
-	unsigned long flags;
-
-	GEM_BUG_ON(bump & ~I915_PRIORITY_MASK);
-	if (READ_ONCE(rq->sched.attr.priority) & bump)
-		return;
-
-	spin_lock_irqsave(&schedule_lock, flags);
-	__bump_priority(&rq->sched, bump);
-	spin_unlock_irqrestore(&schedule_lock, flags);
-}
-
 void i915_sched_node_init(struct i915_sched_node *node)
 {
 	INIT_LIST_HEAD(&node->signalers_list);
@@ -529,8 +491,7 @@ int __init i915_global_scheduler_init(void)
 	if (!global.slab_dependencies)
 		return -ENOMEM;
 
-	global.slab_priorities = KMEM_CACHE(i915_priolist,
-					    SLAB_HWCACHE_ALIGN);
+	global.slab_priorities = KMEM_CACHE(i915_priolist, 0);
 	if (!global.slab_priorities)
 		goto err_priorities;
 
diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
index 6f0bf00fc569..b089d5cace1d 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.h
+++ b/drivers/gpu/drm/i915/i915_scheduler.h
@@ -13,17 +13,11 @@
 
 #include "i915_scheduler_types.h"
 
-#define priolist_for_each_request(it, plist, idx) \
-	for (idx = 0; idx < ARRAY_SIZE((plist)->requests); idx++) \
-		list_for_each_entry(it, &(plist)->requests[idx], sched.link)
-
-#define priolist_for_each_request_consume(it, n, plist, idx) \
-	for (; \
-	     (plist)->used ? (idx = __ffs((plist)->used)), 1 : 0; \
-	     (plist)->used &= ~BIT(idx)) \
-		list_for_each_entry_safe(it, n, \
-					 &(plist)->requests[idx], \
-					 sched.link)
+#define priolist_for_each_request(it, plist) \
+	list_for_each_entry(it, &(plist)->requests, sched.link)
+
+#define priolist_for_each_request_consume(it, n, plist) \
+	list_for_each_entry_safe(it, n, &(plist)->requests, sched.link)
 
 void i915_sched_node_init(struct i915_sched_node *node);
 void i915_sched_node_reinit(struct i915_sched_node *node);
@@ -42,8 +36,6 @@ void i915_sched_node_fini(struct i915_sched_node *node);
 void i915_schedule(struct i915_request *request,
 		   const struct i915_sched_attr *attr);
 
-void i915_schedule_bump_priority(struct i915_request *rq, unsigned int bump);
-
 struct list_head *
 i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio);
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 49/66] drm/i915: Remove I915_USER_PRIORITY_SHIFT
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (46 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 48/66] drm/i915: Strip out internal priorities Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 50/66] drm/i915: Replace engine->schedule() with a known request operation Chris Wilson
                   ` (24 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

As we do not have any internal priority levels, the priority can be set
directed from the user values.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/display/intel_display.c  |  4 +-
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  6 +--
 .../i915/gem/selftests/i915_gem_object_blt.c  |  4 +-
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  6 +--
 drivers/gpu/drm/i915/gt/selftest_lrc.c        | 44 +++++++------------
 drivers/gpu/drm/i915/i915_priolist_types.h    |  3 --
 drivers/gpu/drm/i915/i915_scheduler.c         |  1 -
 7 files changed, 23 insertions(+), 45 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index 729ec6e0d43a..b1120d49d44e 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -15909,9 +15909,7 @@ static void intel_plane_unpin_fb(struct intel_plane_state *old_plane_state)
 
 static void fb_obj_bump_render_priority(struct drm_i915_gem_object *obj)
 {
-	struct i915_sched_attr attr = {
-		.priority = I915_USER_PRIORITY(I915_PRIORITY_DISPLAY),
-	};
+	struct i915_sched_attr attr = { .priority = I915_PRIORITY_DISPLAY };
 
 	i915_gem_object_wait_priority(obj, 0, &attr);
 }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 901b2f5614ea..e30f7dbc5700 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -713,7 +713,7 @@ __create_context(struct drm_i915_private *i915)
 
 	kref_init(&ctx->ref);
 	ctx->i915 = i915;
-	ctx->sched.priority = I915_USER_PRIORITY(I915_PRIORITY_NORMAL);
+	ctx->sched.priority = I915_PRIORITY_NORMAL;
 	mutex_init(&ctx->mutex);
 
 	ctx->async.width = rounddown_pow_of_two(num_online_cpus());
@@ -2002,7 +2002,7 @@ static int set_priority(struct i915_gem_context *ctx,
 	    !capable(CAP_SYS_NICE))
 		return -EPERM;
 
-	ctx->sched.priority = I915_USER_PRIORITY(priority);
+	ctx->sched.priority = priority;
 	context_apply_all(ctx, __apply_priority, ctx);
 
 	return 0;
@@ -2505,7 +2505,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->size = 0;
-		args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
+		args->value = ctx->sched.priority;
 		break;
 
 	case I915_CONTEXT_PARAM_SSEU:
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object_blt.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object_blt.c
index 23b6e11bbc3e..c4c04fb97d14 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object_blt.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object_blt.c
@@ -220,7 +220,7 @@ static int igt_fill_blt_thread(void *arg)
 			return PTR_ERR(ctx);
 
 		prio = i915_prandom_u32_max_state(I915_PRIORITY_MAX, prng);
-		ctx->sched.priority = I915_USER_PRIORITY(prio);
+		ctx->sched.priority = prio;
 	}
 
 	ce = i915_gem_context_get_engine(ctx, 0);
@@ -338,7 +338,7 @@ static int igt_copy_blt_thread(void *arg)
 			return PTR_ERR(ctx);
 
 		prio = i915_prandom_u32_max_state(I915_PRIORITY_MAX, prng);
-		ctx->sched.priority = I915_USER_PRIORITY(prio);
+		ctx->sched.priority = prio;
 	}
 
 	ce = i915_gem_context_get_engine(ctx, 0);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index addab2d922b7..58a5c43156f4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -69,9 +69,7 @@ static void show_heartbeat(const struct i915_request *rq,
 
 static void heartbeat(struct work_struct *wrk)
 {
-	struct i915_sched_attr attr = {
-		.priority = I915_USER_PRIORITY(I915_PRIORITY_MIN),
-	};
+	struct i915_sched_attr attr = { .priority = I915_PRIORITY_MIN };
 	struct intel_engine_cs *engine =
 		container_of(wrk, typeof(*engine), heartbeat.work.work);
 	struct intel_context *ce = engine->kernel_context;
@@ -115,7 +113,7 @@ static void heartbeat(struct work_struct *wrk)
 			 */
 			attr.priority = 0;
 			if (rq->sched.attr.priority >= attr.priority)
-				attr.priority |= I915_USER_PRIORITY(I915_PRIORITY_HEARTBEAT);
+				attr.priority = I915_PRIORITY_HEARTBEAT;
 			if (rq->sched.attr.priority >= attr.priority)
 				attr.priority = I915_PRIORITY_BARRIER;
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index 3843c69ac8a3..8a395b885b54 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -345,7 +345,7 @@ static int live_unlite_switch(void *arg)
 
 static int live_unlite_preempt(void *arg)
 {
-	return live_unlite_restore(arg, I915_USER_PRIORITY(I915_PRIORITY_MAX));
+	return live_unlite_restore(arg, I915_PRIORITY_MAX);
 }
 
 static int live_unlite_ring(void *arg)
@@ -1332,9 +1332,7 @@ static int live_timeslice_queue(void *arg)
 		goto err_pin;
 
 	for_each_engine(engine, gt, id) {
-		struct i915_sched_attr attr = {
-			.priority = I915_USER_PRIORITY(I915_PRIORITY_MAX),
-		};
+		struct i915_sched_attr attr = { .priority = I915_PRIORITY_MAX };
 		struct i915_request *rq, *nop;
 
 		if (!intel_engine_has_preemption(engine))
@@ -1549,14 +1547,12 @@ static int live_busywait_preempt(void *arg)
 	ctx_hi = kernel_context(gt->i915);
 	if (!ctx_hi)
 		return -ENOMEM;
-	ctx_hi->sched.priority =
-		I915_USER_PRIORITY(I915_CONTEXT_MAX_USER_PRIORITY);
+	ctx_hi->sched.priority = I915_CONTEXT_MAX_USER_PRIORITY;
 
 	ctx_lo = kernel_context(gt->i915);
 	if (!ctx_lo)
 		goto err_ctx_hi;
-	ctx_lo->sched.priority =
-		I915_USER_PRIORITY(I915_CONTEXT_MIN_USER_PRIORITY);
+	ctx_lo->sched.priority = I915_CONTEXT_MIN_USER_PRIORITY;
 
 	obj = i915_gem_object_create_internal(gt->i915, PAGE_SIZE);
 	if (IS_ERR(obj)) {
@@ -1759,14 +1755,12 @@ static int live_preempt(void *arg)
 	ctx_hi = kernel_context(gt->i915);
 	if (!ctx_hi)
 		goto err_spin_lo;
-	ctx_hi->sched.priority =
-		I915_USER_PRIORITY(I915_CONTEXT_MAX_USER_PRIORITY);
+	ctx_hi->sched.priority = I915_CONTEXT_MAX_USER_PRIORITY;
 
 	ctx_lo = kernel_context(gt->i915);
 	if (!ctx_lo)
 		goto err_ctx_hi;
-	ctx_lo->sched.priority =
-		I915_USER_PRIORITY(I915_CONTEXT_MIN_USER_PRIORITY);
+	ctx_lo->sched.priority = I915_CONTEXT_MIN_USER_PRIORITY;
 
 	for_each_engine(engine, gt, id) {
 		struct igt_live_test t;
@@ -1862,7 +1856,7 @@ static int live_late_preempt(void *arg)
 		goto err_ctx_hi;
 
 	/* Make sure ctx_lo stays before ctx_hi until we trigger preemption. */
-	ctx_lo->sched.priority = I915_USER_PRIORITY(1);
+	ctx_lo->sched.priority = 1;
 
 	for_each_engine(engine, gt, id) {
 		struct igt_live_test t;
@@ -1903,7 +1897,7 @@ static int live_late_preempt(void *arg)
 			goto err_wedged;
 		}
 
-		attr.priority = I915_USER_PRIORITY(I915_PRIORITY_MAX);
+		attr.priority = I915_PRIORITY_MAX;
 		engine->schedule(rq, &attr);
 
 		if (!igt_wait_for_spinner(&spin_hi, rq)) {
@@ -1987,7 +1981,7 @@ static int live_nopreempt(void *arg)
 		return -ENOMEM;
 	if (preempt_client_init(gt, &b))
 		goto err_client_a;
-	b.ctx->sched.priority = I915_USER_PRIORITY(I915_PRIORITY_MAX);
+	b.ctx->sched.priority = I915_PRIORITY_MAX;
 
 	for_each_engine(engine, gt, id) {
 		struct i915_request *rq_a, *rq_b;
@@ -2380,11 +2374,9 @@ static int live_preempt_cancel(void *arg)
 
 static int live_suppress_self_preempt(void *arg)
 {
+	struct i915_sched_attr attr = { .priority = I915_PRIORITY_MAX };
 	struct intel_gt *gt = arg;
 	struct intel_engine_cs *engine;
-	struct i915_sched_attr attr = {
-		.priority = I915_USER_PRIORITY(I915_PRIORITY_MAX)
-	};
 	struct preempt_client a, b;
 	enum intel_engine_id id;
 	int err = -ENOMEM;
@@ -2521,9 +2513,7 @@ static int live_chain_preempt(void *arg)
 		goto err_client_hi;
 
 	for_each_engine(engine, gt, id) {
-		struct i915_sched_attr attr = {
-			.priority = I915_USER_PRIORITY(I915_PRIORITY_MAX),
-		};
+		struct i915_sched_attr attr = { .priority = I915_PRIORITY_MAX };
 		struct igt_live_test t;
 		struct i915_request *rq;
 		int ring_size, count, i;
@@ -2941,9 +2931,7 @@ static int live_preempt_gang(void *arg)
 			return -EIO;
 
 		do {
-			struct i915_sched_attr attr = {
-				.priority = I915_USER_PRIORITY(prio++),
-			};
+			struct i915_sched_attr attr = { .priority = prio++ };
 
 			err = create_gang(engine, &rq);
 			if (err)
@@ -2980,7 +2968,7 @@ static int live_preempt_gang(void *arg)
 					drm_info_printer(engine->i915->drm.dev);
 
 				pr_err("Failed to flush chain of %d requests, at %d\n",
-				       prio, rq_prio(rq) >> I915_USER_PRIORITY_SHIFT);
+				       prio, rq_prio(rq));
 				intel_engine_dump(engine, &p,
 						  "%s\n", engine->name);
 
@@ -3354,14 +3342,12 @@ static int live_preempt_timeout(void *arg)
 	ctx_hi = kernel_context(gt->i915);
 	if (!ctx_hi)
 		goto err_spin_lo;
-	ctx_hi->sched.priority =
-		I915_USER_PRIORITY(I915_CONTEXT_MAX_USER_PRIORITY);
+	ctx_hi->sched.priority = I915_CONTEXT_MAX_USER_PRIORITY;
 
 	ctx_lo = kernel_context(gt->i915);
 	if (!ctx_lo)
 		goto err_ctx_hi;
-	ctx_lo->sched.priority =
-		I915_USER_PRIORITY(I915_CONTEXT_MIN_USER_PRIORITY);
+	ctx_lo->sched.priority = I915_CONTEXT_MIN_USER_PRIORITY;
 
 	for_each_engine(engine, gt, id) {
 		unsigned long saved_timeout;
diff --git a/drivers/gpu/drm/i915/i915_priolist_types.h b/drivers/gpu/drm/i915/i915_priolist_types.h
index 9a7657bb002e..bc2fa84f98a8 100644
--- a/drivers/gpu/drm/i915/i915_priolist_types.h
+++ b/drivers/gpu/drm/i915/i915_priolist_types.h
@@ -24,9 +24,6 @@ enum {
 	I915_PRIORITY_DISPLAY,
 };
 
-#define I915_USER_PRIORITY_SHIFT 0
-#define I915_USER_PRIORITY(x) ((x) << I915_USER_PRIORITY_SHIFT)
-
 /* Smallest priority value that cannot be bumped. */
 #define I915_PRIORITY_INVALID (INT_MIN)
 
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 805c5e062004..a9973d7a724c 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -71,7 +71,6 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 	lockdep_assert_held(&engine->active.lock);
 	assert_priolists(execlists);
 
-	prio >>= I915_USER_PRIORITY_SHIFT;
 	if (unlikely(execlists->no_priolist))
 		prio = I915_PRIORITY_NORMAL;
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 50/66] drm/i915: Replace engine->schedule() with a known request operation
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (47 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 49/66] drm/i915: Remove I915_USER_PRIORITY_SHIFT Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 51/66] drm/i915/gt: Do not suspend bonded requests if one hangs Chris Wilson
                   ` (23 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Looking to the future, we want to set the scheduling attributes
explicitly and so replace the generic engine->schedule() with the more
direct i915_request_set_priority()

What it loses in removing the 'schedule' name from the function, it
gains in having an explicit entry point with a stated goal.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/display/intel_display.c  |  9 +----
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.h    |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_wait.c      | 27 +++++----------
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  3 --
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  4 +--
 drivers/gpu/drm/i915/gt/intel_engine_types.h  | 29 ++++++++--------
 drivers/gpu/drm/i915/gt/intel_engine_user.c   |  2 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c           |  3 +-
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  | 11 +++----
 drivers/gpu/drm/i915/gt/selftest_lrc.c        | 33 +++++--------------
 drivers/gpu/drm/i915/i915_request.c           | 11 ++++---
 drivers/gpu/drm/i915/i915_scheduler.c         | 15 +++++----
 drivers/gpu/drm/i915/i915_scheduler.h         |  3 +-
 14 files changed, 58 insertions(+), 96 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index b1120d49d44e..c74e664a3759 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -15907,13 +15907,6 @@ static void intel_plane_unpin_fb(struct intel_plane_state *old_plane_state)
 		intel_unpin_fb_vma(vma, old_plane_state->flags);
 }
 
-static void fb_obj_bump_render_priority(struct drm_i915_gem_object *obj)
-{
-	struct i915_sched_attr attr = { .priority = I915_PRIORITY_DISPLAY };
-
-	i915_gem_object_wait_priority(obj, 0, &attr);
-}
-
 /**
  * intel_prepare_plane_fb - Prepare fb for usage on plane
  * @_plane: drm plane to prepare for
@@ -15990,7 +15983,7 @@ intel_prepare_plane_fb(struct drm_plane *_plane,
 	if (ret)
 		return ret;
 
-	fb_obj_bump_render_priority(obj);
+	i915_gem_object_wait_priority(obj, 0, I915_PRIORITY_DISPLAY);
 	i915_gem_object_flush_frontbuffer(obj, ORIGIN_DIRTYFB);
 
 	if (!new_plane_state->uapi.fence) { /* implicit fencing */
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 7ad65612e4a0..d9f1403ddfa4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -2973,7 +2973,7 @@ static int __eb_pin_reloc_engine(struct i915_execbuffer *eb)
 		return PTR_ERR(ce);
 
 	/* Reuse eb->context->timeline with scheduler! */
-	if (engine->schedule)
+	if (intel_engine_has_scheduler(engine))
 		ce->timeline = intel_timeline_get(eb->context->timeline);
 
 	i915_vm_put(ce->vm);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index 26f53321443b..d916155b0c52 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -459,7 +459,7 @@ int i915_gem_object_wait(struct drm_i915_gem_object *obj,
 			 long timeout);
 int i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
 				  unsigned int flags,
-				  const struct i915_sched_attr *attr);
+				  int prio);
 
 void __i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
 					 enum fb_op_origin origin);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_wait.c b/drivers/gpu/drm/i915/gem/i915_gem_wait.c
index 8af55cd3e690..cefbbb3d9b52 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_wait.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_wait.c
@@ -93,28 +93,17 @@ i915_gem_object_wait_reservation(struct dma_resv *resv,
 	return timeout;
 }
 
-static void __fence_set_priority(struct dma_fence *fence,
-				 const struct i915_sched_attr *attr)
+static void __fence_set_priority(struct dma_fence *fence, int prio)
 {
-	struct i915_request *rq;
-	struct intel_engine_cs *engine;
-
 	if (dma_fence_is_signaled(fence) || !dma_fence_is_i915(fence))
 		return;
 
-	rq = to_request(fence);
-	engine = rq->engine;
-
 	local_bh_disable();
-	rcu_read_lock(); /* RCU serialisation for set-wedged protection */
-	if (engine->schedule)
-		engine->schedule(rq, attr);
-	rcu_read_unlock();
+	i915_request_set_priority(to_request(fence), prio);
 	local_bh_enable(); /* kick the tasklets if queues were reprioritised */
 }
 
-static void fence_set_priority(struct dma_fence *fence,
-			       const struct i915_sched_attr *attr)
+static void fence_set_priority(struct dma_fence *fence, int prio)
 {
 	/* Recurse once into a fence-array */
 	if (dma_fence_is_array(fence)) {
@@ -122,16 +111,16 @@ static void fence_set_priority(struct dma_fence *fence,
 		int i;
 
 		for (i = 0; i < array->num_fences; i++)
-			__fence_set_priority(array->fences[i], attr);
+			__fence_set_priority(array->fences[i], prio);
 	} else {
-		__fence_set_priority(fence, attr);
+		__fence_set_priority(fence, prio);
 	}
 }
 
 int
 i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
 			      unsigned int flags,
-			      const struct i915_sched_attr *attr)
+			      int prio)
 {
 	struct dma_fence *excl;
 
@@ -146,7 +135,7 @@ i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
 			return ret;
 
 		for (i = 0; i < count; i++) {
-			fence_set_priority(shared[i], attr);
+			fence_set_priority(shared[i], prio);
 			dma_fence_put(shared[i]);
 		}
 
@@ -156,7 +145,7 @@ i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
 	}
 
 	if (excl) {
-		fence_set_priority(excl, attr);
+		fence_set_priority(excl, prio);
 		dma_fence_put(excl);
 	}
 	return 0;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index fcdf336ebf43..8bd87ca918d0 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -334,9 +334,6 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
 	if (engine->context_size)
 		DRIVER_CAPS(i915)->has_logical_contexts = true;
 
-	/* Nothing to do here, execute in order of dependencies */
-	engine->schedule = NULL;
-
 	ewma__engine_latency_init(&engine->latency);
 	seqcount_init(&engine->stats.lock);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 58a5c43156f4..96ebf61038d9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -103,7 +103,7 @@ static void heartbeat(struct work_struct *wrk)
 			 * but all other contexts, including the kernel
 			 * context are stuck waiting for the signal.
 			 */
-		} else if (engine->schedule &&
+		} else if (intel_engine_has_scheduler(engine) &&
 			   rq->sched.attr.priority < I915_PRIORITY_BARRIER) {
 			/*
 			 * Gradually raise the priority of the heartbeat to
@@ -118,7 +118,7 @@ static void heartbeat(struct work_struct *wrk)
 				attr.priority = I915_PRIORITY_BARRIER;
 
 			local_bh_disable();
-			engine->schedule(rq, &attr);
+			i915_request_set_priority(rq, attr.priority);
 			local_bh_enable();
 		} else {
 			if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 7be475315fa9..a0ed041cfab4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -488,14 +488,6 @@ struct intel_engine_cs {
 	void            (*bond_execute)(struct i915_request *rq,
 					struct dma_fence *signal);
 
-	/*
-	 * Call when the priority on a request has changed and it and its
-	 * dependencies may need rescheduling. Note the request itself may
-	 * not be ready to run!
-	 */
-	void		(*schedule)(struct i915_request *request,
-				    const struct i915_sched_attr *attr);
-
 	void		(*release)(struct intel_engine_cs *engine);
 
 	struct intel_engine_execlists execlists;
@@ -513,13 +505,14 @@ struct intel_engine_cs {
 
 #define I915_ENGINE_USING_CMD_PARSER BIT(0)
 #define I915_ENGINE_SUPPORTS_STATS   BIT(1)
-#define I915_ENGINE_HAS_PREEMPTION   BIT(2)
-#define I915_ENGINE_HAS_SEMAPHORES   BIT(3)
-#define I915_ENGINE_HAS_TIMESLICES   BIT(4)
-#define I915_ENGINE_NEEDS_BREADCRUMB_TASKLET BIT(5)
-#define I915_ENGINE_IS_VIRTUAL       BIT(6)
-#define I915_ENGINE_HAS_RELATIVE_MMIO BIT(7)
-#define I915_ENGINE_REQUIRES_CMD_PARSER BIT(8)
+#define I915_ENGINE_HAS_SCHEDULER    BIT(2)
+#define I915_ENGINE_HAS_PREEMPTION   BIT(3)
+#define I915_ENGINE_HAS_SEMAPHORES   BIT(4)
+#define I915_ENGINE_HAS_TIMESLICES   BIT(5)
+#define I915_ENGINE_NEEDS_BREADCRUMB_TASKLET BIT(6)
+#define I915_ENGINE_IS_VIRTUAL       BIT(7)
+#define I915_ENGINE_HAS_RELATIVE_MMIO BIT(8)
+#define I915_ENGINE_REQUIRES_CMD_PARSER BIT(9)
 	unsigned int flags;
 
 	/*
@@ -605,6 +598,12 @@ intel_engine_supports_stats(const struct intel_engine_cs *engine)
 	return engine->flags & I915_ENGINE_SUPPORTS_STATS;
 }
 
+static inline bool
+intel_engine_has_scheduler(const struct intel_engine_cs *engine)
+{
+	return engine->flags & I915_ENGINE_HAS_SCHEDULER;
+}
+
 static inline bool
 intel_engine_has_preemption(const struct intel_engine_cs *engine)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c b/drivers/gpu/drm/i915/gt/intel_engine_user.c
index 34e6096f196e..6b5a4fdc14a0 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_user.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c
@@ -108,7 +108,7 @@ static void set_scheduler_caps(struct drm_i915_private *i915)
 	for_each_uabi_engine(engine, i915) { /* all engines must agree! */
 		int i;
 
-		if (engine->schedule)
+		if (intel_engine_has_scheduler(engine))
 			enabled |= (I915_SCHEDULER_CAP_ENABLED |
 				    I915_SCHEDULER_CAP_PRIORITY);
 		else
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index e3d7647a8514..dca6f8165ec7 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -4870,7 +4870,6 @@ static void execlists_park(struct intel_engine_cs *engine)
 void intel_execlists_set_default_submission(struct intel_engine_cs *engine)
 {
 	engine->submit_request = execlists_submit_request;
-	engine->schedule = i915_schedule;
 	engine->execlists.tasklet.func = execlists_submission_tasklet;
 
 	engine->reset.prepare = execlists_reset_prepare;
@@ -4881,6 +4880,7 @@ void intel_execlists_set_default_submission(struct intel_engine_cs *engine)
 	engine->park = execlists_park;
 	engine->unpark = NULL;
 
+	engine->flags |= I915_ENGINE_HAS_SCHEDULER;
 	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
 	if (!intel_vgpu_active(engine->i915)) {
 		engine->flags |= I915_ENGINE_HAS_SEMAPHORES;
@@ -5620,7 +5620,6 @@ intel_execlists_create_virtual(struct intel_engine_cs **siblings,
 	ve->base.cops = &virtual_context_ops;
 	ve->base.request_alloc = execlists_request_alloc;
 
-	ve->base.schedule = i915_schedule;
 	ve->base.submit_request = virtual_submit_request;
 	ve->base.bond_execute = virtual_bond_execute;
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index c28d1fcad673..927d54c702f4 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -726,12 +726,11 @@ static int active_engine(void *data)
 		rq[idx] = i915_request_get(new);
 		i915_request_add(new);
 
-		if (engine->schedule && arg->flags & TEST_PRIORITY) {
-			struct i915_sched_attr attr = {
-				.priority =
-					i915_prandom_u32_max_state(512, &prng),
-			};
-			engine->schedule(rq[idx], &attr);
+		if (intel_engine_has_scheduler(engine) &&
+		    arg->flags & TEST_PRIORITY) {
+			int prio = i915_prandom_u32_max_state(512, &prng);
+
+			i915_request_set_priority(rq[idx], prio);
 		}
 
 		err = active_request_put(old);
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index 8a395b885b54..b23234ae2572 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -293,12 +293,8 @@ static int live_unlite_restore(struct intel_gt *gt, int prio)
 		i915_request_put(rq[0]);
 
 		if (prio) {
-			struct i915_sched_attr attr = {
-				.priority = prio,
-			};
-
 			/* Alternatively preempt the spinner with ce[1] */
-			engine->schedule(rq[1], &attr);
+			i915_request_set_priority(rq[1], prio);
 		}
 
 		/* And switch back to ce[0] for good measure */
@@ -898,9 +894,6 @@ release_queue(struct intel_engine_cs *engine,
 	      struct i915_vma *vma,
 	      int idx, int prio)
 {
-	struct i915_sched_attr attr = {
-		.priority = prio,
-	};
 	struct i915_request *rq;
 	u32 *cs;
 
@@ -925,7 +918,7 @@ release_queue(struct intel_engine_cs *engine,
 	i915_request_add(rq);
 
 	local_bh_disable();
-	engine->schedule(rq, &attr);
+	i915_request_set_priority(rq, prio);
 	local_bh_enable(); /* kick tasklet */
 
 	i915_request_put(rq);
@@ -1332,7 +1325,6 @@ static int live_timeslice_queue(void *arg)
 		goto err_pin;
 
 	for_each_engine(engine, gt, id) {
-		struct i915_sched_attr attr = { .priority = I915_PRIORITY_MAX };
 		struct i915_request *rq, *nop;
 
 		if (!intel_engine_has_preemption(engine))
@@ -1347,7 +1339,7 @@ static int live_timeslice_queue(void *arg)
 			err = PTR_ERR(rq);
 			goto err_heartbeat;
 		}
-		engine->schedule(rq, &attr);
+		i915_request_set_priority(rq, I915_PRIORITY_MAX);
 		err = wait_for_submit(engine, rq, HZ / 2);
 		if (err) {
 			pr_err("%s: Timed out trying to submit semaphores\n",
@@ -1834,7 +1826,6 @@ static int live_late_preempt(void *arg)
 	struct i915_gem_context *ctx_hi, *ctx_lo;
 	struct igt_spinner spin_hi, spin_lo;
 	struct intel_engine_cs *engine;
-	struct i915_sched_attr attr = {};
 	enum intel_engine_id id;
 	int err = -ENOMEM;
 
@@ -1897,8 +1888,7 @@ static int live_late_preempt(void *arg)
 			goto err_wedged;
 		}
 
-		attr.priority = I915_PRIORITY_MAX;
-		engine->schedule(rq, &attr);
+		i915_request_set_priority(rq, I915_PRIORITY_MAX);
 
 		if (!igt_wait_for_spinner(&spin_hi, rq)) {
 			pr_err("High priority context failed to preempt the low priority context\n");
@@ -2374,7 +2364,6 @@ static int live_preempt_cancel(void *arg)
 
 static int live_suppress_self_preempt(void *arg)
 {
-	struct i915_sched_attr attr = { .priority = I915_PRIORITY_MAX };
 	struct intel_gt *gt = arg;
 	struct intel_engine_cs *engine;
 	struct preempt_client a, b;
@@ -2445,7 +2434,7 @@ static int live_suppress_self_preempt(void *arg)
 			i915_request_add(rq_b);
 
 			GEM_BUG_ON(i915_request_completed(rq_a));
-			engine->schedule(rq_a, &attr);
+			i915_request_set_priority(rq_a, I915_PRIORITY_MAX);
 			igt_spinner_end(&a.spin);
 
 			if (!igt_wait_for_spinner(&b.spin, rq_b)) {
@@ -2513,7 +2502,6 @@ static int live_chain_preempt(void *arg)
 		goto err_client_hi;
 
 	for_each_engine(engine, gt, id) {
-		struct i915_sched_attr attr = { .priority = I915_PRIORITY_MAX };
 		struct igt_live_test t;
 		struct i915_request *rq;
 		int ring_size, count, i;
@@ -2580,7 +2568,7 @@ static int live_chain_preempt(void *arg)
 
 			i915_request_get(rq);
 			i915_request_add(rq);
-			engine->schedule(rq, &attr);
+			i915_request_set_priority(rq, I915_PRIORITY_MAX);
 
 			igt_spinner_end(&hi.spin);
 			if (i915_request_wait(rq, 0, HZ / 5) < 0) {
@@ -2931,14 +2919,12 @@ static int live_preempt_gang(void *arg)
 			return -EIO;
 
 		do {
-			struct i915_sched_attr attr = { .priority = prio++ };
-
 			err = create_gang(engine, &rq);
 			if (err)
 				break;
 
 			/* Submit each spinner at increasing priority */
-			engine->schedule(rq, &attr);
+			i915_request_set_priority(rq, prio++);
 		} while (prio <= I915_PRIORITY_MAX &&
 			 !__igt_timeout(end_time, NULL));
 		pr_debug("%s: Preempt chain of %d requests\n",
@@ -3160,9 +3146,6 @@ static int preempt_user(struct intel_engine_cs *engine,
 			struct i915_vma *global,
 			int id)
 {
-	struct i915_sched_attr attr = {
-		.priority = I915_PRIORITY_MAX
-	};
 	struct i915_request *rq;
 	int err = 0;
 	u32 *cs;
@@ -3187,7 +3170,7 @@ static int preempt_user(struct intel_engine_cs *engine,
 	i915_request_get(rq);
 	i915_request_add(rq);
 
-	engine->schedule(rq, &attr);
+	i915_request_set_priority(rq, I915_PRIORITY_MAX);
 
 	if (i915_request_wait(rq, 0, HZ / 2) < 0)
 		err = -ETIME;
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 025666a6c67f..1c00edf427f0 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1157,7 +1157,7 @@ __i915_request_await_execution(struct i915_request *to,
 	}
 
 	/* Couple the dependency tree for PI on this exposed to->fence */
-	if (to->engine->schedule) {
+	if (intel_engine_has_scheduler(to->engine)) {
 		err = i915_sched_node_add_dependency(&to->sched,
 						     &from->sched,
 						     I915_DEPENDENCY_WEAK);
@@ -1298,7 +1298,7 @@ i915_request_await_request(struct i915_request *to, struct i915_request *from)
 		return 0;
 	}
 
-	if (to->engine->schedule) {
+	if (intel_engine_has_scheduler(to->engine)) {
 		ret = i915_sched_node_add_dependency(&to->sched,
 						     &from->sched,
 						     I915_DEPENDENCY_EXTERNAL);
@@ -1485,7 +1485,7 @@ __i915_request_add_to_timeline(struct i915_request *rq)
 			__i915_sw_fence_await_dma_fence(&rq->submit,
 							&prev->fence,
 							&rq->dmaq);
-		if (rq->engine->schedule)
+		if (intel_engine_has_scheduler(rq->engine))
 			__i915_sched_node_add_dependency(&rq->sched,
 							 &prev->sched,
 							 &rq->dep,
@@ -1551,8 +1551,9 @@ void __i915_request_queue(struct i915_request *rq,
 	 * decide whether to preempt the entire chain so that it is ready to
 	 * run at the earliest possible convenience.
 	 */
-	if (attr && rq->engine->schedule)
-		rq->engine->schedule(rq, attr);
+	if (attr)
+		i915_request_set_priority(rq, attr->priority);
+
 	i915_sw_fence_commit(&rq->semaphore);
 	i915_sw_fence_commit(&rq->submit);
 }
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index a9973d7a724c..9f744f470556 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -216,10 +216,8 @@ static void kick_submission(struct intel_engine_cs *engine,
 	rcu_read_unlock();
 }
 
-static void __i915_schedule(struct i915_sched_node *node,
-			    const struct i915_sched_attr *attr)
+static void __i915_schedule(struct i915_sched_node *node, int prio)
 {
-	const int prio = max(attr->priority, node->attr.priority);
 	struct intel_engine_cs *engine;
 	struct i915_dependency *dep, *p;
 	struct i915_dependency stack;
@@ -233,6 +231,8 @@ static void __i915_schedule(struct i915_sched_node *node,
 	if (node_signaled(node))
 		return;
 
+	prio = max(prio, node->attr.priority);
+
 	stack.signaler = node;
 	list_add(&stack.dfs_link, &dfs);
 
@@ -286,7 +286,7 @@ static void __i915_schedule(struct i915_sched_node *node,
 	 */
 	if (node->attr.priority == I915_PRIORITY_INVALID) {
 		GEM_BUG_ON(!list_empty(&node->link));
-		node->attr = *attr;
+		node->attr.priority = prio;
 
 		if (stack.dfs_link.next == stack.dfs_link.prev)
 			return;
@@ -341,10 +341,13 @@ static void __i915_schedule(struct i915_sched_node *node,
 	spin_unlock(&engine->active.lock);
 }
 
-void i915_schedule(struct i915_request *rq, const struct i915_sched_attr *attr)
+void i915_request_set_priority(struct i915_request *rq, int prio)
 {
+	if (!intel_engine_has_scheduler(rq->engine))
+		return;
+
 	spin_lock_irq(&schedule_lock);
-	__i915_schedule(&rq->sched, attr);
+	__i915_schedule(&rq->sched, prio);
 	spin_unlock_irq(&schedule_lock);
 }
 
diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
index b089d5cace1d..c30bf8af045d 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.h
+++ b/drivers/gpu/drm/i915/i915_scheduler.h
@@ -33,8 +33,7 @@ int i915_sched_node_add_dependency(struct i915_sched_node *node,
 
 void i915_sched_node_fini(struct i915_sched_node *node);
 
-void i915_schedule(struct i915_request *request,
-		   const struct i915_sched_attr *attr);
+void i915_request_set_priority(struct i915_request *request, int prio);
 
 struct list_head *
 i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 51/66] drm/i915/gt: Do not suspend bonded requests if one hangs
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (48 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 50/66] drm/i915: Replace engine->schedule() with a known request operation Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 52/66] drm/i915: Teach the i915_dependency to use a double-lock Chris Wilson
                   ` (22 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Treat the dependency between bonded requests as weak and leave the
remainder of the pair on the GPU if one hangs.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index dca6f8165ec7..fdeeed8b45d5 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2683,6 +2683,9 @@ static void __execlists_hold(struct i915_request *rq)
 			struct i915_request *w =
 				container_of(p->waiter, typeof(*w), sched);
 
+			if (p->flags & I915_DEPENDENCY_WEAK)
+				continue;
+
 			/* Leave semaphores spinning on the other engines */
 			if (w->engine != rq->engine)
 				continue;
@@ -2778,6 +2781,9 @@ static void __execlists_unhold(struct i915_request *rq)
 			struct i915_request *w =
 				container_of(p->waiter, typeof(*w), sched);
 
+			if (p->flags & I915_DEPENDENCY_WEAK)
+				continue;
+
 			/* Propagate any change in error status */
 			if (rq->fence.error)
 				i915_request_set_error_once(w, rq->fence.error);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 52/66] drm/i915: Teach the i915_dependency to use a double-lock
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (49 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 51/66] drm/i915/gt: Do not suspend bonded requests if one hangs Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 53/66] drm/i915: Restructure priority inheritance Chris Wilson
                   ` (21 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Currently, we construct and teardown the i915_dependency chains using a
global spinlock. As the lists are entirely local, it should be possible
to use an double-lock with an explicit nesting [signaler -> waiter,
always] and so avoid the costly convenience of a global spinlock.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c         |  6 +--
 drivers/gpu/drm/i915/i915_request.c         |  2 +-
 drivers/gpu/drm/i915/i915_scheduler.c       | 44 +++++++++++++--------
 drivers/gpu/drm/i915/i915_scheduler.h       |  2 +-
 drivers/gpu/drm/i915/i915_scheduler_types.h |  1 +
 5 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index fdeeed8b45d5..2dd116c0d2a1 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1831,7 +1831,7 @@ static void defer_request(struct i915_request *rq, struct list_head * const pl)
 			struct i915_request *w =
 				container_of(p->waiter, typeof(*w), sched);
 
-			if (p->flags & I915_DEPENDENCY_WEAK)
+			if (!p->waiter || p->flags & I915_DEPENDENCY_WEAK)
 				continue;
 
 			/* Leave semaphores spinning on the other engines */
@@ -2683,7 +2683,7 @@ static void __execlists_hold(struct i915_request *rq)
 			struct i915_request *w =
 				container_of(p->waiter, typeof(*w), sched);
 
-			if (p->flags & I915_DEPENDENCY_WEAK)
+			if (!p->waiter || p->flags & I915_DEPENDENCY_WEAK)
 				continue;
 
 			/* Leave semaphores spinning on the other engines */
@@ -2781,7 +2781,7 @@ static void __execlists_unhold(struct i915_request *rq)
 			struct i915_request *w =
 				container_of(p->waiter, typeof(*w), sched);
 
-			if (p->flags & I915_DEPENDENCY_WEAK)
+			if (!p->waiter || p->flags & I915_DEPENDENCY_WEAK)
 				continue;
 
 			/* Propagate any change in error status */
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 1c00edf427f0..6528ace4c0b7 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -334,7 +334,7 @@ bool i915_request_retire(struct i915_request *rq)
 	intel_context_unpin(rq->context);
 
 	free_capture_list(rq);
-	i915_sched_node_fini(&rq->sched);
+	i915_sched_node_retire(&rq->sched);
 	i915_request_put(rq);
 
 	return true;
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 9f744f470556..2e4d512e61d8 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -353,6 +353,8 @@ void i915_request_set_priority(struct i915_request *rq, int prio)
 
 void i915_sched_node_init(struct i915_sched_node *node)
 {
+	spin_lock_init(&node->lock);
+
 	INIT_LIST_HEAD(&node->signalers_list);
 	INIT_LIST_HEAD(&node->waiters_list);
 	INIT_LIST_HEAD(&node->link);
@@ -390,7 +392,8 @@ bool __i915_sched_node_add_dependency(struct i915_sched_node *node,
 {
 	bool ret = false;
 
-	spin_lock_irq(&schedule_lock);
+	/* The signal->lock is always the outer lock in this double-lock. */
+	spin_lock_irq(&signal->lock);
 
 	if (!node_signaled(signal)) {
 		INIT_LIST_HEAD(&dep->dfs_link);
@@ -399,15 +402,17 @@ bool __i915_sched_node_add_dependency(struct i915_sched_node *node,
 		dep->flags = flags;
 
 		/* All set, now publish. Beware the lockless walkers. */
+		spin_lock_nested(&node->lock, SINGLE_DEPTH_NESTING);
 		list_add_rcu(&dep->signal_link, &node->signalers_list);
 		list_add_rcu(&dep->wait_link, &signal->waiters_list);
+		spin_unlock(&node->lock);
 
 		/* Propagate the chains */
 		node->flags |= signal->flags;
 		ret = true;
 	}
 
-	spin_unlock_irq(&schedule_lock);
+	spin_unlock_irq(&signal->lock);
 
 	return ret;
 }
@@ -433,39 +438,46 @@ int i915_sched_node_add_dependency(struct i915_sched_node *node,
 	return 0;
 }
 
-void i915_sched_node_fini(struct i915_sched_node *node)
+void i915_sched_node_retire(struct i915_sched_node *node)
 {
 	struct i915_dependency *dep, *tmp;
 
-	spin_lock_irq(&schedule_lock);
+	spin_lock_irq(&node->lock);
 
 	/*
 	 * Everyone we depended upon (the fences we wait to be signaled)
 	 * should retire before us and remove themselves from our list.
 	 * However, retirement is run independently on each timeline and
-	 * so we may be called out-of-order.
+	 * so we may be called out-of-order. As we need to avoid taking
+	 * the signaler's lock, just mark up our completion and be wary
+	 * in traversing the signalers->waiters_list.
 	 */
-	list_for_each_entry_safe(dep, tmp, &node->signalers_list, signal_link) {
-		GEM_BUG_ON(!list_empty(&dep->dfs_link));
-
-		list_del_rcu(&dep->wait_link);
-		if (dep->flags & I915_DEPENDENCY_ALLOC)
-			i915_dependency_free(dep);
+	list_for_each_entry(dep, &node->signalers_list, signal_link) {
+		GEM_BUG_ON(dep->waiter != node);
+		WRITE_ONCE(dep->waiter, NULL);
 	}
-	INIT_LIST_HEAD(&node->signalers_list);
+	INIT_LIST_HEAD_RCU(&node->signalers_list);
 
 	/* Remove ourselves from everyone who depends upon us */
 	list_for_each_entry_safe(dep, tmp, &node->waiters_list, wait_link) {
+		struct i915_sched_node *w;
+
 		GEM_BUG_ON(dep->signaler != node);
-		GEM_BUG_ON(!list_empty(&dep->dfs_link));
 
-		list_del_rcu(&dep->signal_link);
+		w = READ_ONCE(dep->waiter);
+		if (w) {
+			spin_lock_nested(&w->lock, SINGLE_DEPTH_NESTING);
+			if (READ_ONCE(dep->waiter))
+				list_del_rcu(&dep->signal_link);
+			spin_unlock(&w->lock);
+		}
+
 		if (dep->flags & I915_DEPENDENCY_ALLOC)
 			i915_dependency_free(dep);
 	}
-	INIT_LIST_HEAD(&node->waiters_list);
+	INIT_LIST_HEAD_RCU(&node->waiters_list);
 
-	spin_unlock_irq(&schedule_lock);
+	spin_unlock_irq(&node->lock);
 }
 
 static void i915_global_scheduler_shrink(void)
diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
index c30bf8af045d..53ac819cc786 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.h
+++ b/drivers/gpu/drm/i915/i915_scheduler.h
@@ -31,7 +31,7 @@ int i915_sched_node_add_dependency(struct i915_sched_node *node,
 				   struct i915_sched_node *signal,
 				   unsigned long flags);
 
-void i915_sched_node_fini(struct i915_sched_node *node);
+void i915_sched_node_retire(struct i915_sched_node *node);
 
 void i915_request_set_priority(struct i915_request *request, int prio);
 
diff --git a/drivers/gpu/drm/i915/i915_scheduler_types.h b/drivers/gpu/drm/i915/i915_scheduler_types.h
index 343ed44d5ed4..3246430eb1c1 100644
--- a/drivers/gpu/drm/i915/i915_scheduler_types.h
+++ b/drivers/gpu/drm/i915/i915_scheduler_types.h
@@ -60,6 +60,7 @@ struct i915_sched_attr {
  * others.
  */
 struct i915_sched_node {
+	spinlock_t lock; /* protect the lists */
 	struct list_head signalers_list; /* those before us, we depend upon */
 	struct list_head waiters_list; /* those after us, they depend upon us */
 	struct list_head link;
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 53/66] drm/i915: Restructure priority inheritance
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (50 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 52/66] drm/i915: Teach the i915_dependency to use a double-lock Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 54/66] drm/i915/gt: Remove timeslice suppression Chris Wilson
                   ` (20 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

In anticipation of wanting to be able to call pi from underneath an
engine's active.lock, rework the priority inheritance to primarily work
along an engine's priority queue, delegating any other engine that the
chain may traverse to a worker. This reduces the global spinlock from
governing the entire priority inheritance depth-first search, to a small
lock around a single list.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_scheduler.c       | 277 +++++++++++---------
 drivers/gpu/drm/i915/i915_scheduler_types.h |   6 +-
 2 files changed, 162 insertions(+), 121 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 2e4d512e61d8..3f261b4fee66 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -17,7 +17,65 @@ static struct i915_global_scheduler {
 	struct kmem_cache *slab_priorities;
 } global;
 
-static DEFINE_SPINLOCK(schedule_lock);
+static DEFINE_SPINLOCK(ipi_lock);
+static LIST_HEAD(ipi_list);
+
+static inline int rq_prio(const struct i915_request *rq)
+{
+	return READ_ONCE(rq->sched.attr.priority);
+}
+
+static void ipi_schedule(struct irq_work *wrk)
+{
+	rcu_read_lock();
+	do {
+		struct i915_dependency *p;
+		struct i915_request *rq;
+		unsigned long flags;
+		int prio;
+
+		spin_lock_irqsave(&ipi_lock, flags);
+		p = list_first_entry_or_null(&ipi_list, typeof(*p), ipi_link);
+		if (p) {
+			rq = container_of(p->signaler, typeof(*rq), sched);
+			list_del_init(&p->ipi_link);
+
+			prio = p->ipi_priority;
+			p->ipi_priority = I915_PRIORITY_INVALID;
+		}
+		spin_unlock_irqrestore(&ipi_lock, flags);
+		if (!p)
+			break;
+
+		if (i915_request_completed(rq))
+			continue;
+
+		i915_request_set_priority(rq, prio);
+	} while (1);
+	rcu_read_unlock();
+}
+
+static DEFINE_IRQ_WORK(ipi_work, ipi_schedule);
+
+/*
+ * Virtual engines complicate acquiring the engine timeline lock,
+ * as their rq->engine pointer is not stable until under that
+ * engine lock. The simple ploy we use is to take the lock then
+ * check that the rq still belongs to the newly locked engine.
+ */
+#define lock_engine_irqsave(rq, flags) ({ \
+	struct i915_request * const rq__ = (rq); \
+	struct intel_engine_cs *engine__ = READ_ONCE(rq__->engine); \
+\
+	spin_lock_irqsave(&engine__->active.lock, (flags)); \
+	while (engine__ != READ_ONCE((rq__)->engine)) { \
+		spin_unlock(&engine__->active.lock); \
+		engine__ = READ_ONCE(rq__->engine); \
+		spin_lock(&engine__->active.lock); \
+	} \
+\
+	engine__; \
+})
 
 static const struct i915_request *
 node_to_request(const struct i915_sched_node *node)
@@ -126,42 +184,6 @@ void __i915_priolist_free(struct i915_priolist *p)
 	kmem_cache_free(global.slab_priorities, p);
 }
 
-struct sched_cache {
-	struct list_head *priolist;
-};
-
-static struct intel_engine_cs *
-sched_lock_engine(const struct i915_sched_node *node,
-		  struct intel_engine_cs *locked,
-		  struct sched_cache *cache)
-{
-	const struct i915_request *rq = node_to_request(node);
-	struct intel_engine_cs *engine;
-
-	GEM_BUG_ON(!locked);
-
-	/*
-	 * Virtual engines complicate acquiring the engine timeline lock,
-	 * as their rq->engine pointer is not stable until under that
-	 * engine lock. The simple ploy we use is to take the lock then
-	 * check that the rq still belongs to the newly locked engine.
-	 */
-	while (locked != (engine = READ_ONCE(rq->engine))) {
-		spin_unlock(&locked->active.lock);
-		memset(cache, 0, sizeof(*cache));
-		spin_lock(&engine->active.lock);
-		locked = engine;
-	}
-
-	GEM_BUG_ON(locked != engine);
-	return locked;
-}
-
-static inline int rq_prio(const struct i915_request *rq)
-{
-	return rq->sched.attr.priority;
-}
-
 static inline bool need_preempt(int prio, int active)
 {
 	/*
@@ -216,25 +238,15 @@ static void kick_submission(struct intel_engine_cs *engine,
 	rcu_read_unlock();
 }
 
-static void __i915_schedule(struct i915_sched_node *node, int prio)
+static void __i915_request_set_priority(struct i915_request *rq, int prio)
 {
-	struct intel_engine_cs *engine;
-	struct i915_dependency *dep, *p;
-	struct i915_dependency stack;
-	struct sched_cache cache;
+	struct intel_engine_cs *engine = rq->engine;
+	struct i915_request *rn;
+	struct list_head *plist;
 	LIST_HEAD(dfs);
 
-	/* Needed in order to use the temporary link inside i915_dependency */
-	lockdep_assert_held(&schedule_lock);
-	GEM_BUG_ON(prio == I915_PRIORITY_INVALID);
-
-	if (node_signaled(node))
-		return;
-
-	prio = max(prio, node->attr.priority);
-
-	stack.signaler = node;
-	list_add(&stack.dfs_link, &dfs);
+	lockdep_assert_held(&engine->active.lock);
+	list_add(&rq->sched.dfs, &dfs);
 
 	/*
 	 * Recursively bump all dependent priorities to match the new request.
@@ -254,66 +266,47 @@ static void __i915_schedule(struct i915_sched_node *node, int prio)
 	 * end result is a topological list of requests in reverse order, the
 	 * last element in the list is the request we must execute first.
 	 */
-	list_for_each_entry(dep, &dfs, dfs_link) {
-		struct i915_sched_node *node = dep->signaler;
+	list_for_each_entry(rq, &dfs, sched.dfs) {
+		struct i915_dependency *p;
 
-		/* If we are already flying, we know we have no signalers */
-		if (node_started(node))
-			continue;
+		/* Also release any children on this engine that are ready */
+		GEM_BUG_ON(rq->engine != engine);
 
-		/*
-		 * Within an engine, there can be no cycle, but we may
-		 * refer to the same dependency chain multiple times
-		 * (redundant dependencies are not eliminated) and across
-		 * engines.
-		 */
-		list_for_each_entry(p, &node->signalers_list, signal_link) {
-			GEM_BUG_ON(p == dep); /* no cycles! */
+		for_each_signaler(p, rq) {
+			struct i915_request *s =
+				container_of(p->signaler, typeof(*s), sched);
 
-			if (node_signaled(p->signaler))
-				continue;
+			GEM_BUG_ON(s == rq);
 
-			if (prio > READ_ONCE(p->signaler->attr.priority))
-				list_move_tail(&p->dfs_link, &dfs);
-		}
-	}
+			if (rq_prio(s) >= prio)
+				continue;
 
-	/*
-	 * If we didn't need to bump any existing priorities, and we haven't
-	 * yet submitted this request (i.e. there is no potential race with
-	 * execlists_submit_request()), we can set our own priority and skip
-	 * acquiring the engine locks.
-	 */
-	if (node->attr.priority == I915_PRIORITY_INVALID) {
-		GEM_BUG_ON(!list_empty(&node->link));
-		node->attr.priority = prio;
+			if (i915_request_completed(s))
+				continue;
 
-		if (stack.dfs_link.next == stack.dfs_link.prev)
-			return;
+			if (s->engine != rq->engine) {
+				spin_lock(&ipi_lock);
+				if (prio > p->ipi_priority) {
+					p->ipi_priority = prio;
+					list_move(&p->ipi_link, &ipi_list);
+					irq_work_queue(&ipi_work);
+				}
+				spin_unlock(&ipi_lock);
+				continue;
+			}
 
-		__list_del_entry(&stack.dfs_link);
+			list_move_tail(&s->sched.dfs, &dfs);
+		}
 	}
 
-	memset(&cache, 0, sizeof(cache));
-	engine = node_to_request(node)->engine;
-	spin_lock(&engine->active.lock);
+	plist = i915_sched_lookup_priolist(engine, prio);
 
-	/* Fifo and depth-first replacement ensure our deps execute before us */
-	engine = sched_lock_engine(node, engine, &cache);
-	list_for_each_entry_safe_reverse(dep, p, &dfs, dfs_link) {
-		INIT_LIST_HEAD(&dep->dfs_link);
+	/* Fifo and depth-first replacement ensure our deps execute first */
+	list_for_each_entry_safe_reverse(rq, rn, &dfs, sched.dfs) {
+		GEM_BUG_ON(rq->engine != engine);
 
-		node = dep->signaler;
-		engine = sched_lock_engine(node, engine, &cache);
-		lockdep_assert_held(&engine->active.lock);
-
-		/* Recheck after acquiring the engine->timeline.lock */
-		if (prio <= node->attr.priority || node_signaled(node))
-			continue;
-
-		GEM_BUG_ON(node_to_request(node)->engine != engine);
-
-		WRITE_ONCE(node->attr.priority, prio);
+		INIT_LIST_HEAD(&rq->sched.dfs);
+		WRITE_ONCE(rq->sched.attr.priority, prio);
 
 		/*
 		 * Once the request is ready, it will be placed into the
@@ -323,32 +316,70 @@ static void __i915_schedule(struct i915_sched_node *node, int prio)
 		 * any preemption required, be dealt with upon submission.
 		 * See engine->submit_request()
 		 */
-		if (list_empty(&node->link))
+		if (!i915_request_is_ready(rq))
 			continue;
 
-		if (i915_request_in_priority_queue(node_to_request(node))) {
-			if (!cache.priolist)
-				cache.priolist =
-					i915_sched_lookup_priolist(engine,
-								   prio);
-			list_move_tail(&node->link, cache.priolist);
-		}
+		if (i915_request_in_priority_queue(rq))
+			list_move_tail(&rq->sched.link, plist);
 
-		/* Defer (tasklet) submission until after all of our updates. */
-		kick_submission(engine, node_to_request(node), prio);
+		/* Defer (tasklet) submission until after all updates. */
+		kick_submission(engine, rq, prio);
 	}
-
-	spin_unlock(&engine->active.lock);
 }
 
 void i915_request_set_priority(struct i915_request *rq, int prio)
 {
-	if (!intel_engine_has_scheduler(rq->engine))
+	struct intel_engine_cs *engine;
+	unsigned long flags;
+
+	if (prio <= rq_prio(rq))
 		return;
 
-	spin_lock_irq(&schedule_lock);
-	__i915_schedule(&rq->sched, prio);
-	spin_unlock_irq(&schedule_lock);
+	/*
+	 * If we are setting the priority before being submitted, see if we
+	 * can quickly adjust our own priority in-situ and avoid taking
+	 * the contended engine->active.lock. If we need priority inheritance,
+	 * take the slow route.
+	 */
+	if (rq_prio(rq) == I915_PRIORITY_INVALID) {
+		struct i915_dependency *p;
+
+		rcu_read_lock();
+		for_each_signaler(p, rq) {
+			struct i915_request *s =
+				container_of(p->signaler, typeof(*s), sched);
+
+			if (rq_prio(s) >= prio)
+				continue;
+
+			if (i915_request_completed(s))
+				continue;
+
+			break;
+		}
+		rcu_read_unlock();
+
+		if (&p->signal_link == &rq->sched.signalers_list &&
+		    cmpxchg(&rq->sched.attr.priority,
+			    I915_PRIORITY_INVALID,
+			    prio) == I915_PRIORITY_INVALID)
+			return;
+	}
+
+	engine = lock_engine_irqsave(rq, flags);
+	if (!intel_engine_has_scheduler(engine))
+		goto unlock;
+
+	if (i915_request_completed(rq))
+		goto unlock;
+
+	if (prio <= rq_prio(rq))
+		goto unlock;
+
+	__i915_request_set_priority(rq, prio);
+
+unlock:
+	spin_unlock_irqrestore(&engine->active.lock, flags);
 }
 
 void i915_sched_node_init(struct i915_sched_node *node)
@@ -358,6 +389,7 @@ void i915_sched_node_init(struct i915_sched_node *node)
 	INIT_LIST_HEAD(&node->signalers_list);
 	INIT_LIST_HEAD(&node->waiters_list);
 	INIT_LIST_HEAD(&node->link);
+	INIT_LIST_HEAD(&node->dfs);
 
 	i915_sched_node_reinit(node);
 }
@@ -396,7 +428,8 @@ bool __i915_sched_node_add_dependency(struct i915_sched_node *node,
 	spin_lock_irq(&signal->lock);
 
 	if (!node_signaled(signal)) {
-		INIT_LIST_HEAD(&dep->dfs_link);
+		INIT_LIST_HEAD(&dep->ipi_link);
+		dep->ipi_priority = I915_PRIORITY_INVALID;
 		dep->signaler = signal;
 		dep->waiter = node;
 		dep->flags = flags;
@@ -464,6 +497,12 @@ void i915_sched_node_retire(struct i915_sched_node *node)
 
 		GEM_BUG_ON(dep->signaler != node);
 
+		if (unlikely(!list_empty(&dep->ipi_link))) {
+			spin_lock(&ipi_lock);
+			list_del(&dep->ipi_link);
+			spin_unlock(&ipi_lock);
+		}
+
 		w = READ_ONCE(dep->waiter);
 		if (w) {
 			spin_lock_nested(&w->lock, SINGLE_DEPTH_NESTING);
diff --git a/drivers/gpu/drm/i915/i915_scheduler_types.h b/drivers/gpu/drm/i915/i915_scheduler_types.h
index 3246430eb1c1..ce60577df2bf 100644
--- a/drivers/gpu/drm/i915/i915_scheduler_types.h
+++ b/drivers/gpu/drm/i915/i915_scheduler_types.h
@@ -63,7 +63,8 @@ struct i915_sched_node {
 	spinlock_t lock; /* protect the lists */
 	struct list_head signalers_list; /* those before us, we depend upon */
 	struct list_head waiters_list; /* those after us, they depend upon us */
-	struct list_head link;
+	struct list_head link; /* guarded by engine->active.lock */
+	struct list_head dfs; /* guarded by engine->active.lock */
 	struct i915_sched_attr attr;
 	unsigned int flags;
 #define I915_SCHED_HAS_EXTERNAL_CHAIN	BIT(0)
@@ -75,11 +76,12 @@ struct i915_dependency {
 	struct i915_sched_node *waiter;
 	struct list_head signal_link;
 	struct list_head wait_link;
-	struct list_head dfs_link;
+	struct list_head ipi_link;
 	unsigned long flags;
 #define I915_DEPENDENCY_ALLOC		BIT(0)
 #define I915_DEPENDENCY_EXTERNAL	BIT(1)
 #define I915_DEPENDENCY_WEAK		BIT(2)
+	int ipi_priority;
 };
 
 #define for_each_waiter(p__, rq__) \
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 54/66] drm/i915/gt: Remove timeslice suppression
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (51 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 53/66] drm/i915: Restructure priority inheritance Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 55/66] drm/i915: Fair low-latency scheduling Chris Wilson
                   ` (19 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

In the next patch, we remove the strict priority system and continuously
re-evaluate the relative priority of tasks. As such we need to enable
the timeslice whenever there is more than one context in the pipeline.
This simplifies the decision and removes some of the tweaks to suppress
timeslicing, allowing us to lift the timeslice enabling to a common spot
at the end of running the submission tasklet.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_types.h |  10 --
 drivers/gpu/drm/i915/gt/intel_lrc.c          | 146 +++++++------------
 2 files changed, 52 insertions(+), 104 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index a0ed041cfab4..354e01c560f2 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -236,16 +236,6 @@ struct intel_engine_execlists {
 	 */
 	unsigned int port_mask;
 
-	/**
-	 * @switch_priority_hint: Second context priority.
-	 *
-	 * We submit multiple contexts to the HW simultaneously and would
-	 * like to occasionally switch between them to emulate timeslicing.
-	 * To know when timeslicing is suitable, we track the priority of
-	 * the context submitted second.
-	 */
-	int switch_priority_hint;
-
 	/**
 	 * @queue_priority_hint: Highest pending priority.
 	 *
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 2dd116c0d2a1..29072215635e 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1869,25 +1869,6 @@ static void defer_active(struct intel_engine_cs *engine)
 	defer_request(rq, i915_sched_lookup_priolist(engine, rq_prio(rq)));
 }
 
-static bool
-need_timeslice(const struct intel_engine_cs *engine,
-	       const struct i915_request *rq)
-{
-	int hint;
-
-	if (!intel_engine_has_timeslices(engine))
-		return false;
-
-	hint = max(engine->execlists.queue_priority_hint,
-		   virtual_prio(&engine->execlists));
-
-	if (!list_is_last(&rq->sched.link, &engine->active.requests))
-		hint = max(hint, rq_prio(list_next_entry(rq, sched.link)));
-
-	GEM_BUG_ON(hint >= I915_PRIORITY_UNPREEMPTABLE);
-	return hint >= effective_prio(rq);
-}
-
 static bool
 timeslice_yield(const struct intel_engine_execlists *el,
 		const struct i915_request *rq)
@@ -1907,76 +1888,63 @@ timeslice_yield(const struct intel_engine_execlists *el,
 	return rq->context->lrc.ccid == READ_ONCE(el->yield);
 }
 
-static bool
-timeslice_expired(const struct intel_engine_execlists *el,
-		  const struct i915_request *rq)
+static bool needs_timeslice(const struct intel_engine_cs *engine,
+			    const struct i915_request *rq)
 {
-	return timer_expired(&el->timer) || timeslice_yield(el, rq);
-}
+	/* If not currently active, or about to switch, wait for next event */
+	if (!rq || i915_request_completed(rq))
+		return false;
 
-static int
-switch_prio(struct intel_engine_cs *engine, const struct i915_request *rq)
-{
-	if (list_is_last(&rq->sched.link, &engine->active.requests))
-		return engine->execlists.queue_priority_hint;
+	/* We do not need to start the timeslice until after the ACK */
+	if (READ_ONCE(engine->execlists.pending[0]))
+		return false;
 
-	return rq_prio(list_next_entry(rq, sched.link));
-}
+	/* If ELSP[1] is occupied, always check to see if worth slicing */
+	if (!list_is_last_rcu(&rq->sched.link, &engine->active.requests))
+		return true;
 
-static inline unsigned long
-timeslice(const struct intel_engine_cs *engine)
-{
-	return READ_ONCE(engine->props.timeslice_duration_ms);
+	/* Otherwise, ELSP[0] is by itself, but may be waiting in the queue */
+	if (!RB_EMPTY_ROOT(&engine->execlists.queue.rb_root))
+		return true;
+
+	return !RB_EMPTY_ROOT(&engine->execlists.virtual.rb_root);
 }
 
-static unsigned long active_timeslice(const struct intel_engine_cs *engine)
+static bool
+timeslice_expired(struct intel_engine_cs *engine, const struct i915_request *rq)
 {
-	const struct intel_engine_execlists *execlists = &engine->execlists;
-	const struct i915_request *rq = *execlists->active;
+	const struct intel_engine_execlists *el = &engine->execlists;
 
-	if (!rq || i915_request_completed(rq))
-		return 0;
+	if (!intel_engine_has_timeslices(engine))
+		return false;
 
-	if (READ_ONCE(execlists->switch_priority_hint) < effective_prio(rq))
-		return 0;
+	if (i915_request_has_nopreempt(rq) && i915_request_started(rq))
+		return false;
+
+	if (!needs_timeslice(engine, rq))
+		return false;
 
-	return timeslice(engine);
+	return timer_expired(&el->timer) || timeslice_yield(el, rq);
 }
 
-static void set_timeslice(struct intel_engine_cs *engine)
+static unsigned long timeslice(const struct intel_engine_cs *engine)
 {
-	unsigned long duration;
-
-	if (!intel_engine_has_timeslices(engine))
-		return;
-
-	duration = active_timeslice(engine);
-	ENGINE_TRACE(engine, "bump timeslicing, interval:%lu", duration);
-
-	set_timer_ms(&engine->execlists.timer, duration);
+	return READ_ONCE(engine->props.timeslice_duration_ms);
 }
 
-static void start_timeslice(struct intel_engine_cs *engine, int prio)
+static void start_timeslice(struct intel_engine_cs *engine)
 {
-	struct intel_engine_execlists *execlists = &engine->execlists;
 	unsigned long duration;
 
 	if (!intel_engine_has_timeslices(engine))
 		return;
 
-	WRITE_ONCE(execlists->switch_priority_hint, prio);
-	if (prio == INT_MIN)
-		return;
-
-	if (timer_pending(&execlists->timer))
-		return;
-
-	duration = timeslice(engine);
-	ENGINE_TRACE(engine,
-		     "start timeslicing, prio:%d, interval:%lu",
-		     prio, duration);
+	/* Disable the timer if there is nothing to switch to */
+	duration = 0;
+	if (needs_timeslice(engine, execlists_active(&engine->execlists)))
+		duration = timeslice(engine);
 
-	set_timer_ms(&execlists->timer, duration);
+	set_timer_ms(&engine->execlists.timer, duration);
 }
 
 static void record_preemption(struct intel_engine_execlists *execlists)
@@ -2090,13 +2058,12 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			__unwind_incomplete_requests(engine);
 
 			last = NULL;
-		} else if (need_timeslice(engine, last) &&
-			   timeslice_expired(execlists, last)) {
+		} else if (timeslice_expired(engine, last)) {
 			ENGINE_TRACE(engine,
-				     "expired last=%llx:%lld, prio=%d, hint=%d, yield?=%s\n",
-				     last->fence.context,
-				     last->fence.seqno,
-				     last->sched.attr.priority,
+				     "expired:%s last=%llx:%lld, prio=%d, hint=%d, yield?=%s\n",
+				     yesno(timer_expired(&execlists->timer)),
+				     last->fence.context, last->fence.seqno,
+				     rq_prio(last),
 				     execlists->queue_priority_hint,
 				     yesno(timeslice_yield(execlists, last)));
 
@@ -2136,7 +2103,6 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 				 */
 				spin_unlock_irqrestore(&engine->active.lock,
 						       flags);
-				start_timeslice(engine, queue_prio(execlists));
 				return;
 			}
 		}
@@ -2165,7 +2131,6 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		if (last && !can_merge_rq(last, rq)) {
 			spin_unlock(&ve->base.active.lock);
 			spin_unlock_irqrestore(&engine->active.lock, flags);
-			start_timeslice(engine, rq_prio(rq));
 			return; /* leave this for another sibling */
 		}
 
@@ -2340,28 +2305,22 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	execlists->queue_priority_hint = queue_prio(execlists);
 	spin_unlock_irqrestore(&engine->active.lock, flags);
 
-	if (submit) {
-		/*
-		 * Skip if we ended up with exactly the same set of requests,
-		 * e.g. trying to timeslice a pair of ordered contexts
-		 */
-		if (!memcmp(active, execlists->pending,
-			    (port - execlists->pending) * sizeof(*port)))
-			goto skip_submit;
-
+	/*
+	 * We can skip poking the HW if we ended up with exactly the same set
+	 * of requests as currently running, e.g. trying to timeslice a pair
+	 * of ordered contexts.
+	 */
+	if (submit &&
+	    memcmp(active, execlists->pending,
+		   (port - execlists->pending) * sizeof(*port))) {
 		*port = NULL;
 		while (port-- != execlists->pending)
 			execlists_schedule_in(*port, port - execlists->pending);
 
-		execlists->switch_priority_hint =
-			switch_prio(engine, *execlists->pending);
-
 		WRITE_ONCE(execlists->yield, -1);
 		set_preempt_timeout(engine, *active);
 		execlists_submit_ports(engine);
 	} else {
-		start_timeslice(engine, execlists->queue_priority_hint);
-skip_submit:
 		ring_set_paused(engine, 0);
 		while (port-- != execlists->pending)
 			i915_request_put(*port);
@@ -2639,8 +2598,6 @@ process_csb(struct intel_engine_cs *engine, struct i915_request **inactive)
 		}
 	} while (head != tail);
 
-	set_timeslice(engine);
-
 	/*
 	 * Gen11 has proven to fail wrt global observation point between
 	 * entry and tail update, failing on the ordering and thus
@@ -3067,6 +3024,7 @@ static void execlists_submission_tasklet(unsigned long data)
 		execlists_dequeue(engine);
 
 	post_process_csb(post, inactive);
+	start_timeslice(engine);
 }
 
 static void __execlists_kick(struct intel_engine_execlists *execlists)
@@ -3139,6 +3097,9 @@ static void execlists_submit_request(struct i915_request *request)
 	}
 
 	spin_unlock_irqrestore(&engine->active.lock, flags);
+
+	if (!timer_pending(&engine->execlists.timer))
+		start_timeslice(engine);
 }
 
 static void __execlists_context_fini(struct intel_context *ce)
@@ -5818,9 +5779,6 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
 		show_request(m, last, "\t\tE ");
 	}
 
-	if (execlists->switch_priority_hint != INT_MIN)
-		drm_printf(m, "\t\tSwitch priority hint: %d\n",
-			   READ_ONCE(execlists->switch_priority_hint));
 	if (execlists->queue_priority_hint != INT_MIN)
 		drm_printf(m, "\t\tQueue priority hint: %d\n",
 			   READ_ONCE(execlists->queue_priority_hint));
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 55/66] drm/i915: Fair low-latency scheduling
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (52 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 54/66] drm/i915/gt: Remove timeslice suppression Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 15:33   ` [Intel-gfx] [PATCH] " Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 56/66] drm/i915/gt: Specify a deadline for the heartbeat Chris Wilson
                   ` (18 subsequent siblings)
  72 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

The first "scheduler" was a topographical sorting of requests into
priority order. The execution order was deterministic, the earliest
submitted, highest priority request would be executed first. Priority
inherited ensured that inversions were kept at bay, and allowed us to
dynamically boost priorities (e.g. for interactive pageflips).

The minimalistic timeslicing scheme was an attempt to introduce fairness
between long running requests, by evicting the active request at the end
of a timeslice and moving it to the back of its priority queue (while
ensuring that dependencies were kept in order). For short running
requests from many clients of equal priority, the scheme is still very
much FIFO submission ordering, and as unfair as before.

To impose fairness, we need an external metric that ensures that clients
are interpersed, we don't execute one long chain from client A before
executing any of client B. This could be imposed by the clients by using
a fences based on an external clock, that is they only submit work for a
"frame" at frame-interval, instead of submitting as much work as they
are able to. The standard SwapBuffers approach is akin to double
bufferring, where as one frame is being executed, the next is being
submitted, such that there is always a maximum of two frames per client
in the pipeline. Even this scheme exhibits unfairness under load as a
single client will execute two frames back to back before the next, and
with enough clients, deadlines will be missed.

The idea introduced by BFS/MuQSS is that fairness is introduced by
metering with an external clock. Every request, when it becomes ready to
execute is assigned a virtual deadline, and execution order is then
determined by earliest deadline. Priority is used as a hint, rather than
strict ordering, where high priority requests have earlier deadlines,
but not necessarily earlier than outstanding work. Thus work is executed
in order of 'readiness', with timeslicing to demote long running work.

The Achille's heel of this scheduler is its strong preference for
low-latency and favouring of new queues. Whereas it was easy to dominate
the old scheduler by flooding it with many requests over a short period
of time, the new scheduler can be dominated by a 'synchronous' client
that waits for each of its requests to complete before submitting the
next. As such a client has no history, it is always considered
ready-to-run and receives an earlier deadline than the long running
requests.

To check the impact on throughput (often the downfall of latency
sensitive schedulers), we used gem_wsim to simulate various transcode
workloads with different load balancers, and varying the number of
competing [heterogenous] clients.

+mB--------------------------------------------------------------------+
|                               a                                      |
|                             cda                                      |
|                             c.a                                      |
|                             ..aa                                     |
|                           ..---.                                     |
|                           -.--+-.                                    |
|                        .c.-.-+++.  b                                 |
|               b    bb.d-c-+--+++.aab aa    b b                       |
|b  b   b   b  b.  b ..---+++-+++++....a. b. b b   b       b    b     b|
1                               A|                                     |
2                         |___AM____|                                  |
3                            |A__|                                     |
4                            |MA_|                                     |
+----------------------------------------------------------------------+
Clients   Min       Max     Median           Avg        Stddev
1       -8.20       5.4     -0.045      -0.02375   0.094722134
2      -15.96     19.28      -0.64         -1.05     2.2428076
4       -5.11      2.95      -1.15    -1.0683333    0.72382651
8       -5.63      1.85     -0.905   -0.87122449    0.73390971

The impact was on average 1% under contention due to the change in context
execution order and number of context switches.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  12 +-
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  14 -
 drivers/gpu/drm/i915/gt/intel_lrc.c           | 230 +++++-------
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  41 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   6 +-
 drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
 drivers/gpu/drm/i915/i915_request.c           |   1 +
 drivers/gpu/drm/i915/i915_scheduler.c         | 352 +++++++++++++-----
 drivers/gpu/drm/i915/i915_scheduler.h         |  24 +-
 drivers/gpu/drm/i915/i915_scheduler_types.h   |  17 +
 .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
 drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
 .../gpu/drm/i915/selftests/i915_scheduler.c   |  49 +++
 16 files changed, 501 insertions(+), 264 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 8bd87ca918d0..af9cc42d3061 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -588,7 +588,6 @@ void intel_engine_init_execlists(struct intel_engine_cs *engine)
 	execlists->active =
 		memset(execlists->inflight, 0, sizeof(execlists->inflight));
 
-	execlists->queue_priority_hint = INT_MIN;
 	execlists->queue = RB_ROOT_CACHED;
 }
 
@@ -1274,14 +1273,15 @@ bool intel_engine_can_store_dword(struct intel_engine_cs *engine)
 	}
 }
 
-static int print_sched_attr(const struct i915_sched_attr *attr,
-			    char *buf, int x, int len)
+static int print_sched(const struct i915_sched_node *node,
+		       char *buf, int x, int len)
 {
-	if (attr->priority == I915_PRIORITY_INVALID)
+	if (node->attr.priority == I915_PRIORITY_INVALID)
 		return x;
 
 	x += snprintf(buf + x, len - x,
-		      " prio=%d", attr->priority);
+		      " prio=%d, dl=%llu",
+		      node->attr.priority, node->deadline);
 
 	return x;
 }
@@ -1294,7 +1294,7 @@ static void print_request(struct drm_printer *m,
 	char buf[80] = "";
 	int x = 0;
 
-	x = print_sched_attr(&rq->sched.attr, buf, x, sizeof(buf));
+	x = print_sched(&rq->sched, buf, x, sizeof(buf));
 
 	drm_printf(m, "%s %llx:%llx%s%s %s @ %dms: %s\n",
 		   prefix,
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 96ebf61038d9..9fdc8223007f 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -235,6 +235,7 @@ int intel_engine_pulse(struct intel_engine_cs *engine)
 		goto out_unlock;
 	}
 
+	rq->sched.deadline = 0;
 	__set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
 	heartbeat_commit(rq, &attr);
 	GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index 8ec3eecf3e39..a95099b7b759 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -189,6 +189,7 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
 	i915_request_add_active_barriers(rq);
 
 	/* Install ourselves as a preemption barrier */
+	rq->sched.deadline = 0;
 	rq->sched.attr.priority = I915_PRIORITY_BARRIER;
 	if (likely(!__i915_request_commit(rq))) { /* engine should be idle! */
 		/*
@@ -249,9 +250,6 @@ static int __engine_park(struct intel_wakeref *wf)
 	intel_engine_park_heartbeat(engine);
 	intel_engine_disarm_breadcrumbs(engine);
 
-	/* Must be reset upon idling, or we may miss the busy wakeup. */
-	GEM_BUG_ON(engine->execlists.queue_priority_hint != INT_MIN);
-
 	if (engine->park)
 		engine->park(engine);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 354e01c560f2..af6f1154200a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -236,20 +236,6 @@ struct intel_engine_execlists {
 	 */
 	unsigned int port_mask;
 
-	/**
-	 * @queue_priority_hint: Highest pending priority.
-	 *
-	 * When we add requests into the queue, or adjust the priority of
-	 * executing requests, we compute the maximum priority of those
-	 * pending requests. We can then use this value to determine if
-	 * we need to preempt the executing requests to service the queue.
-	 * However, since the we may have recorded the priority of an inflight
-	 * request we wanted to preempt but since completed, at the time of
-	 * dequeuing the priority hint may no longer may match the highest
-	 * available request priority.
-	 */
-	int queue_priority_hint;
-
 	/**
 	 * @queue: queue of requests, in priority lists
 	 */
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 29072215635e..6054695611ad 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -202,7 +202,7 @@ struct virtual_engine {
 	 */
 	struct ve_node {
 		struct rb_node rb;
-		int prio;
+		u64 deadline;
 	} nodes[I915_NUM_ENGINES];
 
 	/*
@@ -413,12 +413,17 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 
 static inline int rq_prio(const struct i915_request *rq)
 {
-	return READ_ONCE(rq->sched.attr.priority);
+	return rq->sched.attr.priority;
 }
 
-static int effective_prio(const struct i915_request *rq)
+static inline u64 rq_deadline(const struct i915_request *rq)
 {
-	int prio = rq_prio(rq);
+	return rq->sched.deadline;
+}
+
+static u64 effective_deadline(const struct i915_request *rq)
+{
+	u64 deadline = rq_deadline(rq);
 
 	/*
 	 * If this request is special and must not be interrupted at any
@@ -429,33 +434,45 @@ static int effective_prio(const struct i915_request *rq)
 	 * nopreempt for as long as desired).
 	 */
 	if (i915_request_has_nopreempt(rq))
-		prio = I915_PRIORITY_UNPREEMPTABLE;
+		deadline = 0;
 
-	return prio;
+	return deadline;
 }
 
-static int queue_prio(const struct intel_engine_execlists *execlists)
+static u64 queue_deadline(struct intel_engine_execlists *el)
 {
-	struct rb_node *rb;
+	do {
+		struct rb_node *rb;
+		struct i915_priolist *p;
 
-	rb = rb_first_cached(&execlists->queue);
-	if (!rb)
-		return INT_MIN;
+		rb = rb_first_cached(&el->queue);
+		if (!rb)
+			return I915_DEADLINE_NEVER;
+
+		p = to_priolist(rb);
+		if (likely(!list_empty(&p->requests)))
+			return p->deadline;
 
-	return to_priolist(rb)->priority;
+		rb_erase_cached(&p->node, &el->queue);
+		i915_priolist_free(p);
+	} while (1);
 }
 
-static int virtual_prio(const struct intel_engine_execlists *el)
+static u64 virtual_deadline(const struct intel_engine_execlists *el)
 {
-	struct rb_node *rb = rb_first_cached(&el->virtual);
+	struct rb_node *rb;
 
-	return rb ? rb_entry(rb, struct ve_node, rb)->prio : INT_MIN;
+	rb = rb_first_cached(&el->virtual);
+	if (!rb)
+		return I915_DEADLINE_NEVER;
+
+	return rb_entry(rb, struct ve_node, rb)->deadline;
 }
 
-static inline bool need_preempt(const struct intel_engine_cs *engine,
+static inline bool need_preempt(struct intel_engine_cs *engine,
 				const struct i915_request *rq)
 {
-	int last_prio;
+	u64 last_deadline;
 
 	if (!intel_engine_has_semaphores(engine))
 		return false;
@@ -478,16 +495,14 @@ static inline bool need_preempt(const struct intel_engine_cs *engine,
 	 * priority level: the task that is running should remain running
 	 * to preserve FIFO ordering of dependencies.
 	 */
-	last_prio = max(effective_prio(rq), I915_PRIORITY_NORMAL - 1);
-	if (engine->execlists.queue_priority_hint <= last_prio)
-		return false;
+	last_deadline = effective_deadline(rq);
 
 	/*
 	 * Check against the first request in ELSP[1], it will, thanks to the
 	 * power of PI, be the highest priority of that context.
 	 */
 	if (!list_is_last(&rq->sched.link, &engine->active.requests) &&
-	    rq_prio(list_next_entry(rq, sched.link)) > last_prio)
+	    rq_deadline(list_next_entry(rq, sched.link)) < last_deadline)
 		return true;
 
 	/*
@@ -500,8 +515,8 @@ static inline bool need_preempt(const struct intel_engine_cs *engine,
 	 * ELSP[0] or ELSP[1] as, thanks again to PI, if it was the same
 	 * context, it's priority would not exceed ELSP[0] aka last_prio.
 	 */
-	return max(virtual_prio(&engine->execlists),
-		   queue_prio(&engine->execlists)) > last_prio;
+	return min(virtual_deadline(&engine->execlists),
+		   queue_deadline(&engine->execlists)) < last_deadline;
 }
 
 __maybe_unused static inline bool
@@ -518,7 +533,7 @@ assert_priority_queue(const struct i915_request *prev,
 	if (i915_request_is_active(prev))
 		return true;
 
-	return rq_prio(prev) >= rq_prio(next);
+	return rq_deadline(prev) <= rq_deadline(next);
 }
 
 /*
@@ -1088,7 +1103,7 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
 {
 	struct i915_request *rq, *rn, *active = NULL;
 	struct list_head *uninitialized_var(pl);
-	int prio = I915_PRIORITY_INVALID;
+	u64 deadline = I915_DEADLINE_NEVER;
 
 	lockdep_assert_held(&engine->active.lock);
 
@@ -1102,10 +1117,15 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
 
 		__i915_request_unsubmit(rq);
 
-		GEM_BUG_ON(rq_prio(rq) == I915_PRIORITY_INVALID);
-		if (rq_prio(rq) != prio) {
-			prio = rq_prio(rq);
-			pl = i915_sched_lookup_priolist(engine, prio);
+		if (i915_request_started(rq)) {
+			u64 deadline =
+				i915_scheduler_next_virtual_deadline(rq_prio(rq));
+			rq->sched.deadline = min(rq_deadline(rq), deadline);
+		}
+
+		if (rq_deadline(rq) != deadline) {
+			deadline = rq_deadline(rq);
+			pl = i915_sched_lookup_priolist(engine, deadline);
 		}
 		GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
 
@@ -1377,9 +1397,12 @@ static inline void __execlists_schedule_out(struct i915_request *rq)
 	 * If we have just completed this context, the engine may now be
 	 * idle and we want to re-enter powersaving.
 	 */
-	if (list_is_last_rcu(&rq->link, &ce->timeline->requests) &&
-	    i915_request_completed(rq))
-		intel_engine_add_retire(engine, ce->timeline);
+	if (i915_request_completed(rq)) {
+		if (!list_is_last_rcu(&rq->link, &ce->timeline->requests))
+			i915_request_update_deadline(list_next_entry(rq, link));
+		else
+			intel_engine_add_retire(engine, ce->timeline);
+	}
 
 	ccid = ce->lrc.ccid;
 	ccid >>= GEN11_SW_CTX_ID_SHIFT - 32;
@@ -1493,14 +1516,14 @@ dump_port(char *buf, int buflen, const char *prefix, struct i915_request *rq)
 	if (!rq)
 		return "";
 
-	snprintf(buf, buflen, "%sccid:%x %llx:%lld%s prio %d",
+	snprintf(buf, buflen, "%sccid:%x %llx:%lld%s dl:%llu",
 		 prefix,
 		 rq->context->lrc.ccid,
 		 rq->fence.context, rq->fence.seqno,
 		 i915_request_completed(rq) ? "!" :
 		 i915_request_started(rq) ? "*" :
 		 "",
-		 rq_prio(rq));
+		 rq_deadline(rq));
 
 	return buf;
 }
@@ -1810,7 +1833,9 @@ static void virtual_xfer_breadcrumbs(struct virtual_engine *ve)
 	intel_engine_transfer_stale_breadcrumbs(ve->siblings[0], &ve->context);
 }
 
-static void defer_request(struct i915_request *rq, struct list_head * const pl)
+static void defer_request(struct i915_request *rq,
+			  struct list_head * const pl,
+			  u64 deadline)
 {
 	LIST_HEAD(list);
 
@@ -1825,6 +1850,7 @@ static void defer_request(struct i915_request *rq, struct list_head * const pl)
 		struct i915_dependency *p;
 
 		GEM_BUG_ON(i915_request_is_active(rq));
+		rq->sched.deadline = deadline;
 		list_move_tail(&rq->sched.link, pl);
 
 		for_each_waiter(p, rq) {
@@ -1847,10 +1873,9 @@ static void defer_request(struct i915_request *rq, struct list_head * const pl)
 			if (!i915_request_is_ready(w))
 				continue;
 
-			if (rq_prio(w) < rq_prio(rq))
+			if (rq_deadline(w) > deadline)
 				continue;
 
-			GEM_BUG_ON(rq_prio(w) > rq_prio(rq));
 			list_move_tail(&w->sched.link, &list);
 		}
 
@@ -1861,12 +1886,21 @@ static void defer_request(struct i915_request *rq, struct list_head * const pl)
 static void defer_active(struct intel_engine_cs *engine)
 {
 	struct i915_request *rq;
+	u64 deadline;
 
 	rq = __unwind_incomplete_requests(engine);
 	if (!rq)
 		return;
 
-	defer_request(rq, i915_sched_lookup_priolist(engine, rq_prio(rq)));
+	deadline = max(rq_deadline(rq),
+		       i915_scheduler_next_virtual_deadline(rq_prio(rq)));
+	ENGINE_TRACE(engine, "defer %llx:%lld, dl:%llu -> %llu\n",
+		     rq->fence.context, rq->fence.seqno,
+		     rq_deadline(rq), deadline);
+
+	defer_request(rq,
+		      i915_sched_lookup_priolist(engine, deadline),
+		      deadline);
 }
 
 static bool
@@ -2034,11 +2068,10 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			goto check_secondary;
 		} else if (need_preempt(engine, last)) {
 			ENGINE_TRACE(engine,
-				     "preempting last=%llx:%lld, prio=%d, hint=%d\n",
+				     "preempting last=%llx:%llu, dl=%llu\n",
 				     last->fence.context,
 				     last->fence.seqno,
-				     last->sched.attr.priority,
-				     execlists->queue_priority_hint);
+				     rq_deadline(last));
 			record_preemption(execlists);
 
 			/*
@@ -2060,11 +2093,11 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			last = NULL;
 		} else if (timeslice_expired(engine, last)) {
 			ENGINE_TRACE(engine,
-				     "expired:%s last=%llx:%lld, prio=%d, hint=%d, yield?=%s\n",
+				     "expired:%s last=%llx:%llu, deadline=%llu, now=%llu, yield?=%s\n",
 				     yesno(timer_expired(&execlists->timer)),
 				     last->fence.context, last->fence.seqno,
-				     rq_prio(last),
-				     execlists->queue_priority_hint,
+				     rq_deadline(last),
+				     i915_sched_to_ticks(ktime_get()),
 				     yesno(timeslice_yield(execlists, last)));
 
 			ring_set_paused(engine, 1);
@@ -2121,7 +2154,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		GEM_BUG_ON(rq->engine != &ve->base);
 		GEM_BUG_ON(rq->context != &ve->context);
 
-		if (unlikely(rq_prio(rq) < queue_prio(execlists))) {
+		if (unlikely(rq_deadline(rq) > queue_deadline(execlists))) {
 			spin_unlock(&ve->base.active.lock);
 			break;
 		}
@@ -2142,9 +2175,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			     i915_request_started(rq) ? "*" :
 			     "",
 			     yesno(engine != ve->siblings[0]));
-
 		WRITE_ONCE(ve->request, NULL);
-		WRITE_ONCE(ve->base.execlists.queue_priority_hint, INT_MIN);
 
 		rb = &ve->nodes[engine->id].rb;
 		rb_erase_cached(rb, &execlists->virtual);
@@ -2285,24 +2316,6 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	}
 done:
 	*port++ = i915_request_get(last);
-
-	/*
-	 * Here be a bit of magic! Or sleight-of-hand, whichever you prefer.
-	 *
-	 * We choose the priority hint such that if we add a request of greater
-	 * priority than this, we kick the submission tasklet to decide on
-	 * the right order of submitting the requests to hardware. We must
-	 * also be prepared to reorder requests as they are in-flight on the
-	 * HW. We derive the priority hint then as the first "hole" in
-	 * the HW submission ports and if there are no available slots,
-	 * the priority of the lowest executing request, i.e. last.
-	 *
-	 * When we do receive a higher priority request ready to run from the
-	 * user, see queue_request(), the priority hint is bumped to that
-	 * request triggering preemption on the next dequeue (or subsequent
-	 * interrupt for secondary ports).
-	 */
-	execlists->queue_priority_hint = queue_prio(execlists);
 	spin_unlock_irqrestore(&engine->active.lock, flags);
 
 	/*
@@ -2715,9 +2728,10 @@ static bool hold_request(const struct i915_request *rq)
 	return result;
 }
 
-static void __execlists_unhold(struct i915_request *rq)
+static bool __execlists_unhold(struct i915_request *rq)
 {
 	LIST_HEAD(list);
+	bool submit = false;
 
 	do {
 		struct i915_dependency *p;
@@ -2728,10 +2742,7 @@ static void __execlists_unhold(struct i915_request *rq)
 		GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
 
 		i915_request_clear_hold(rq);
-		list_move_tail(&rq->sched.link,
-			       i915_sched_lookup_priolist(rq->engine,
-							  rq_prio(rq)));
-		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+		submit |= intel_engine_queue_request(rq->engine, rq);
 
 		/* Also release any children on this engine that are ready */
 		for_each_waiter(p, rq) {
@@ -2760,6 +2771,8 @@ static void __execlists_unhold(struct i915_request *rq)
 
 		rq = list_first_entry_or_null(&list, typeof(*rq), sched.link);
 	} while (rq);
+
+	return submit;
 }
 
 static void execlists_unhold(struct intel_engine_cs *engine,
@@ -2771,12 +2784,8 @@ static void execlists_unhold(struct intel_engine_cs *engine,
 	 * Move this request back to the priority queue, and all of its
 	 * children and grandchildren that were suspended along with it.
 	 */
-	__execlists_unhold(rq);
-
-	if (rq_prio(rq) > engine->execlists.queue_priority_hint) {
-		engine->execlists.queue_priority_hint = rq_prio(rq);
+	if (__execlists_unhold(rq))
 		tasklet_hi_schedule(&engine->execlists.tasklet);
-	}
 
 	spin_unlock_irq(&engine->active.lock);
 }
@@ -3046,27 +3055,6 @@ static void execlists_preempt(struct timer_list *timer)
 	execlists_kick(timer, preempt);
 }
 
-static void queue_request(struct intel_engine_cs *engine,
-			  struct i915_request *rq)
-{
-	GEM_BUG_ON(!list_empty(&rq->sched.link));
-	list_add_tail(&rq->sched.link,
-		      i915_sched_lookup_priolist(engine, rq_prio(rq)));
-	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
-}
-
-static bool submit_queue(struct intel_engine_cs *engine,
-			 const struct i915_request *rq)
-{
-	struct intel_engine_execlists *execlists = &engine->execlists;
-
-	if (rq_prio(rq) <= execlists->queue_priority_hint)
-		return false;
-
-	execlists->queue_priority_hint = rq_prio(rq);
-	return true;
-}
-
 static bool ancestor_on_hold(const struct intel_engine_cs *engine,
 			     const struct i915_request *rq)
 {
@@ -3087,12 +3075,7 @@ static void execlists_submit_request(struct i915_request *request)
 		list_add_tail(&request->sched.link, &engine->active.hold);
 		i915_request_set_hold(request);
 	} else {
-		queue_request(engine, request);
-
-		GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
-		GEM_BUG_ON(list_empty(&request->sched.link));
-
-		if (submit_queue(engine, request))
+		if (intel_engine_queue_request(engine, request))
 			__execlists_kick(&engine->execlists);
 	}
 
@@ -4161,10 +4144,6 @@ static void execlists_reset_rewind(struct intel_engine_cs *engine, bool stalled)
 
 static void nop_submission_tasklet(unsigned long data)
 {
-	struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
-
-	/* The driver is wedged; don't process any more events. */
-	WRITE_ONCE(engine->execlists.queue_priority_hint, INT_MIN);
 }
 
 static void execlists_reset_cancel(struct intel_engine_cs *engine)
@@ -4210,6 +4189,7 @@ static void execlists_reset_cancel(struct intel_engine_cs *engine)
 		rb_erase_cached(&p->node, &execlists->queue);
 		i915_priolist_free(p);
 	}
+	GEM_BUG_ON(!RB_EMPTY_ROOT(&execlists->queue.rb_root));
 
 	/* On-hold requests will be flushed to timeline upon their release */
 	list_for_each_entry(rq, &engine->active.hold, sched.link)
@@ -4231,17 +4211,12 @@ static void execlists_reset_cancel(struct intel_engine_cs *engine)
 			rq->engine = engine;
 			__i915_request_submit(rq);
 			i915_request_put(rq);
-
-			ve->base.execlists.queue_priority_hint = INT_MIN;
 		}
 		spin_unlock(&ve->base.active.lock);
 	}
 
 	/* Remaining _unready_ requests will be nop'ed when submitted */
 
-	execlists->queue_priority_hint = INT_MIN;
-	execlists->queue = RB_ROOT_CACHED;
-
 	GEM_BUG_ON(__tasklet_is_enabled(&execlists->tasklet));
 	execlists->tasklet.func = nop_submission_tasklet;
 
@@ -5353,7 +5328,8 @@ static const struct intel_context_ops virtual_context_ops = {
 	.destroy = virtual_context_destroy,
 };
 
-static intel_engine_mask_t virtual_submission_mask(struct virtual_engine *ve)
+static intel_engine_mask_t
+virtual_submission_mask(struct virtual_engine *ve, u64 *deadline)
 {
 	struct i915_request *rq;
 	intel_engine_mask_t mask;
@@ -5370,9 +5346,11 @@ static intel_engine_mask_t virtual_submission_mask(struct virtual_engine *ve)
 		mask = ve->siblings[0]->mask;
 	}
 
-	ENGINE_TRACE(&ve->base, "rq=%llx:%lld, mask=%x, prio=%d\n",
+	*deadline = rq_deadline(rq);
+
+	ENGINE_TRACE(&ve->base, "rq=%llx:%llu, mask=%x, dl=%llu\n",
 		     rq->fence.context, rq->fence.seqno,
-		     mask, ve->base.execlists.queue_priority_hint);
+		     mask, *deadline);
 
 	return mask;
 }
@@ -5380,12 +5358,12 @@ static intel_engine_mask_t virtual_submission_mask(struct virtual_engine *ve)
 static void virtual_submission_tasklet(unsigned long data)
 {
 	struct virtual_engine * const ve = (struct virtual_engine *)data;
-	const int prio = READ_ONCE(ve->base.execlists.queue_priority_hint);
 	intel_engine_mask_t mask;
+	u64 deadline;
 	unsigned int n;
 
 	rcu_read_lock();
-	mask = virtual_submission_mask(ve);
+	mask = virtual_submission_mask(ve, &deadline);
 	rcu_read_unlock();
 	if (unlikely(!mask))
 		return;
@@ -5418,7 +5396,8 @@ static void virtual_submission_tasklet(unsigned long data)
 			 */
 			first = rb_first_cached(&sibling->execlists.virtual) ==
 				&node->rb;
-			if (prio == node->prio || (prio > node->prio && first))
+			if (deadline == node->deadline ||
+			    (deadline < node->deadline && first))
 				goto submit_engine;
 
 			rb_erase_cached(&node->rb, &sibling->execlists.virtual);
@@ -5432,7 +5411,7 @@ static void virtual_submission_tasklet(unsigned long data)
 
 			rb = *parent;
 			other = rb_entry(rb, typeof(*other), rb);
-			if (prio > other->prio) {
+			if (deadline < other->deadline) {
 				parent = &rb->rb_left;
 			} else {
 				parent = &rb->rb_right;
@@ -5447,8 +5426,8 @@ static void virtual_submission_tasklet(unsigned long data)
 
 submit_engine:
 		GEM_BUG_ON(RB_EMPTY_NODE(&node->rb));
-		node->prio = prio;
-		if (first && prio > sibling->execlists.queue_priority_hint)
+		node->deadline = deadline;
+		if (first)
 			tasklet_hi_schedule(&sibling->execlists.tasklet);
 
 unlock_engine:
@@ -5482,11 +5461,11 @@ static void virtual_submit_request(struct i915_request *rq)
 
 	if (i915_request_completed(rq)) {
 		__i915_request_submit(rq);
-
-		ve->base.execlists.queue_priority_hint = INT_MIN;
 		ve->request = NULL;
 	} else {
-		ve->base.execlists.queue_priority_hint = rq_prio(rq);
+		rq->sched.deadline =
+			min(rq->sched.deadline,
+			    i915_scheduler_next_virtual_deadline(rq_prio(rq)));
 		ve->request = i915_request_get(rq);
 
 		GEM_BUG_ON(!list_empty(virtual_queue(ve)));
@@ -5591,7 +5570,6 @@ intel_execlists_create_virtual(struct intel_engine_cs **siblings,
 	ve->base.bond_execute = virtual_bond_execute;
 
 	INIT_LIST_HEAD(virtual_queue(ve));
-	ve->base.execlists.queue_priority_hint = INT_MIN;
 	tasklet_init(&ve->base.execlists.tasklet,
 		     virtual_submission_tasklet,
 		     (unsigned long)ve);
@@ -5779,10 +5757,6 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
 		show_request(m, last, "\t\tE ");
 	}
 
-	if (execlists->queue_priority_hint != INT_MIN)
-		drm_printf(m, "\t\tQueue priority hint: %d\n",
-			   READ_ONCE(execlists->queue_priority_hint));
-
 	last = NULL;
 	count = 0;
 	for (rb = rb_first_cached(&execlists->queue); rb; rb = rb_next(rb)) {
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 927d54c702f4..b0eb426d26fe 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -878,7 +878,10 @@ static int __igt_reset_engines(struct intel_gt *gt,
 					break;
 				}
 
-				if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+				/* With deadlines, no strict priority */
+				i915_request_set_deadline(rq, 0);
+
+				if (i915_request_wait(rq, 0, HZ / 2) < 0) {
 					struct drm_printer p =
 						drm_info_printer(gt->i915->drm.dev);
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index b23234ae2572..ec648b61b2cc 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -70,6 +70,9 @@ static int wait_for_submit(struct intel_engine_cs *engine,
 			   struct i915_request *rq,
 			   unsigned long timeout)
 {
+	/* Ignore our own attempts to suppress excess tasklets */
+	tasklet_hi_schedule(&engine->execlists.tasklet);
+
 	timeout += jiffies;
 	do {
 		bool done = time_after(jiffies, timeout);
@@ -892,7 +895,7 @@ semaphore_queue(struct intel_engine_cs *engine, struct i915_vma *vma, int idx)
 static int
 release_queue(struct intel_engine_cs *engine,
 	      struct i915_vma *vma,
-	      int idx, int prio)
+	      int idx, u64 deadline)
 {
 	struct i915_request *rq;
 	u32 *cs;
@@ -917,10 +920,7 @@ release_queue(struct intel_engine_cs *engine,
 	i915_request_get(rq);
 	i915_request_add(rq);
 
-	local_bh_disable();
-	i915_request_set_priority(rq, prio);
-	local_bh_enable(); /* kick tasklet */
-
+	i915_request_set_deadline(rq, deadline);
 	i915_request_put(rq);
 
 	return 0;
@@ -934,6 +934,7 @@ slice_semaphore_queue(struct intel_engine_cs *outer,
 	struct intel_engine_cs *engine;
 	struct i915_request *head;
 	enum intel_engine_id id;
+	long timeout;
 	int err, i, n = 0;
 
 	head = semaphore_queue(outer, vma, n++);
@@ -954,12 +955,16 @@ slice_semaphore_queue(struct intel_engine_cs *outer,
 		}
 	}
 
-	err = release_queue(outer, vma, n, I915_PRIORITY_BARRIER);
+	err = release_queue(outer, vma, n, 0);
 	if (err)
 		goto out;
 
-	if (i915_request_wait(head, 0,
-			      2 * outer->gt->info.num_engines * (count + 2) * (count + 3)) < 0) {
+	/* Expected number of pessimal slices required */
+	timeout = outer->gt->info.num_engines * (count + 2) * (count + 3);
+	timeout *= 4; /* safety factor, including bucketing */
+	timeout += HZ / 2; /* and include the request completion */
+
+	if (i915_request_wait(head, 0, timeout) < 0) {
 		pr_err("Failed to slice along semaphore chain of length (%d, %d)!\n",
 		       count, n);
 		GEM_TRACE_DUMP();
@@ -1064,6 +1069,8 @@ create_rewinder(struct intel_context *ce,
 		err = i915_request_await_dma_fence(rq, &wait->fence);
 		if (err)
 			goto err;
+
+		i915_request_set_deadline(rq, rq_deadline(wait));
 	}
 
 	cs = intel_ring_begin(rq, 14);
@@ -1339,7 +1346,7 @@ static int live_timeslice_queue(void *arg)
 			err = PTR_ERR(rq);
 			goto err_heartbeat;
 		}
-		i915_request_set_priority(rq, I915_PRIORITY_MAX);
+		i915_request_set_deadline(rq, 0);
 		err = wait_for_submit(engine, rq, HZ / 2);
 		if (err) {
 			pr_err("%s: Timed out trying to submit semaphores\n",
@@ -1362,10 +1369,9 @@ static int live_timeslice_queue(void *arg)
 		}
 
 		GEM_BUG_ON(i915_request_completed(rq));
-		GEM_BUG_ON(execlists_active(&engine->execlists) != rq);
 
 		/* Queue: semaphore signal, matching priority as semaphore */
-		err = release_queue(engine, vma, 1, effective_prio(rq));
+		err = release_queue(engine, vma, 1, effective_deadline(rq));
 		if (err)
 			goto err_rq;
 
@@ -1476,6 +1482,7 @@ static int live_timeslice_nopreempt(void *arg)
 			goto out_spin;
 		}
 
+		rq->sched.deadline = 0;
 		rq->sched.attr.priority = I915_PRIORITY_BARRIER;
 		i915_request_get(rq);
 		i915_request_add(rq);
@@ -1848,6 +1855,7 @@ static int live_late_preempt(void *arg)
 
 	/* Make sure ctx_lo stays before ctx_hi until we trigger preemption. */
 	ctx_lo->sched.priority = 1;
+	ctx_hi->sched.priority = I915_PRIORITY_MIN;
 
 	for_each_engine(engine, gt, id) {
 		struct igt_live_test t;
@@ -2949,6 +2957,9 @@ static int live_preempt_gang(void *arg)
 			struct i915_request *n =
 				list_next_entry(rq, client_link);
 
+			/* With deadlines, no strict priority ordering */
+			i915_request_set_deadline(rq, 0);
+
 			if (err == 0 && i915_request_wait(rq, 0, HZ / 5) < 0) {
 				struct drm_printer p =
 					drm_info_printer(engine->i915->drm.dev);
@@ -3170,7 +3181,7 @@ static int preempt_user(struct intel_engine_cs *engine,
 	i915_request_get(rq);
 	i915_request_add(rq);
 
-	i915_request_set_priority(rq, I915_PRIORITY_MAX);
+	i915_request_set_deadline(rq, 0);
 
 	if (i915_request_wait(rq, 0, HZ / 2) < 0)
 		err = -ETIME;
@@ -4706,6 +4717,7 @@ static int emit_semaphore_signal(struct intel_context *ce, void *slot)
 
 	intel_ring_advance(rq, cs);
 
+	rq->sched.deadline = 0;
 	rq->sched.attr.priority = I915_PRIORITY_BARRIER;
 	i915_request_add(rq);
 	return 0;
@@ -5215,6 +5227,10 @@ static int __live_lrc_gpr(struct intel_engine_cs *engine,
 		err = emit_semaphore_signal(engine->kernel_context, slot);
 		if (err)
 			goto err_rq;
+
+		err = wait_for_submit(engine, rq, HZ / 2);
+		if (err)
+			goto err_rq;
 	} else {
 		slot[0] = 1;
 		wmb();
@@ -5772,6 +5788,7 @@ static int poison_registers(struct intel_context *ce, u32 poison, u32 *sema)
 
 	intel_ring_advance(rq, cs);
 
+	rq->sched.deadline = 0;
 	rq->sched.attr.priority = I915_PRIORITY_BARRIER;
 err_rq:
 	i915_request_add(rq);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 8b56cf0d970e..e31f9b2c12cc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -333,8 +333,6 @@ static void __guc_dequeue(struct intel_engine_cs *engine)
 		i915_priolist_free(p);
 	}
 done:
-	execlists->queue_priority_hint =
-		rb ? to_priolist(rb)->priority : INT_MIN;
 	if (submit) {
 		*port = schedule_in(last, port - execlists->inflight);
 		*++port = NULL;
@@ -473,12 +471,10 @@ static void guc_reset_cancel(struct intel_engine_cs *engine)
 		rb_erase_cached(&p->node, &execlists->queue);
 		i915_priolist_free(p);
 	}
+	GEM_BUG_ON(!RB_EMPTY_ROOT(&execlists->queue.rb_root));
 
 	/* Remaining _unready_ requests will be nop'ed when submitted */
 
-	execlists->queue_priority_hint = INT_MIN;
-	execlists->queue = RB_ROOT_CACHED;
-
 	spin_unlock_irqrestore(&engine->active.lock, flags);
 }
 
diff --git a/drivers/gpu/drm/i915/i915_priolist_types.h b/drivers/gpu/drm/i915/i915_priolist_types.h
index bc2fa84f98a8..43a0ac45295f 100644
--- a/drivers/gpu/drm/i915/i915_priolist_types.h
+++ b/drivers/gpu/drm/i915/i915_priolist_types.h
@@ -22,6 +22,8 @@ enum {
 
 	/* Interactive workload, scheduled for immediate pageflipping */
 	I915_PRIORITY_DISPLAY,
+
+	__I915_PRIORITY_KERNEL__
 };
 
 /* Smallest priority value that cannot be bumped. */
@@ -35,13 +37,12 @@ enum {
  * i.e. nothing can have higher priority and force us to usurp the
  * active request.
  */
-#define I915_PRIORITY_UNPREEMPTABLE INT_MAX
-#define I915_PRIORITY_BARRIER (I915_PRIORITY_UNPREEMPTABLE - 1)
+#define I915_PRIORITY_BARRIER INT_MAX
 
 struct i915_priolist {
 	struct list_head requests;
 	struct rb_node node;
-	int priority;
+	u64 deadline;
 };
 
 #endif /* _I915_PRIOLIST_TYPES_H_ */
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 6528ace4c0b7..a90e90e96c19 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -658,6 +658,7 @@ semaphore_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 
 	switch (state) {
 	case FENCE_COMPLETE:
+		i915_request_update_deadline(rq);
 		break;
 
 	case FENCE_FREE:
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 3f261b4fee66..806f76651635 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -20,6 +20,11 @@ static struct i915_global_scheduler {
 static DEFINE_SPINLOCK(ipi_lock);
 static LIST_HEAD(ipi_list);
 
+static inline u64 rq_deadline(const struct i915_request *rq)
+{
+	return READ_ONCE(rq->sched.deadline);
+}
+
 static inline int rq_prio(const struct i915_request *rq)
 {
 	return READ_ONCE(rq->sched.attr.priority);
@@ -32,6 +37,7 @@ static void ipi_schedule(struct irq_work *wrk)
 		struct i915_dependency *p;
 		struct i915_request *rq;
 		unsigned long flags;
+		u64 deadline;
 		int prio;
 
 		spin_lock_irqsave(&ipi_lock, flags);
@@ -40,7 +46,10 @@ static void ipi_schedule(struct irq_work *wrk)
 			rq = container_of(p->signaler, typeof(*rq), sched);
 			list_del_init(&p->ipi_link);
 
+			deadline = p->ipi_deadline;
 			prio = p->ipi_priority;
+
+			p->ipi_deadline = I915_DEADLINE_NEVER;
 			p->ipi_priority = I915_PRIORITY_INVALID;
 		}
 		spin_unlock_irqrestore(&ipi_lock, flags);
@@ -51,6 +60,7 @@ static void ipi_schedule(struct irq_work *wrk)
 			continue;
 
 		i915_request_set_priority(rq, prio);
+		i915_request_set_deadline(rq, deadline);
 	} while (1);
 	rcu_read_unlock();
 }
@@ -98,28 +108,8 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 	return rb_entry(rb, struct i915_priolist, node);
 }
 
-static void assert_priolists(struct intel_engine_execlists * const execlists)
-{
-	struct rb_node *rb;
-	long last_prio;
-
-	if (!IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
-		return;
-
-	GEM_BUG_ON(rb_first_cached(&execlists->queue) !=
-		   rb_first(&execlists->queue.rb_root));
-
-	last_prio = INT_MAX;
-	for (rb = rb_first_cached(&execlists->queue); rb; rb = rb_next(rb)) {
-		const struct i915_priolist *p = to_priolist(rb);
-
-		GEM_BUG_ON(p->priority > last_prio);
-		last_prio = p->priority;
-	}
-}
-
 struct list_head *
-i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
+i915_sched_lookup_priolist(struct intel_engine_cs *engine, u64 deadline)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
 	struct i915_priolist *p;
@@ -127,10 +117,9 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 	bool first = true;
 
 	lockdep_assert_held(&engine->active.lock);
-	assert_priolists(execlists);
 
 	if (unlikely(execlists->no_priolist))
-		prio = I915_PRIORITY_NORMAL;
+		deadline = 0;
 
 find_priolist:
 	/* most positive priority is scheduled first, equal priorities fifo */
@@ -139,9 +128,9 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 	while (*parent) {
 		rb = *parent;
 		p = to_priolist(rb);
-		if (prio > p->priority) {
+		if (deadline < p->deadline) {
 			parent = &rb->rb_left;
-		} else if (prio < p->priority) {
+		} else if (deadline > p->deadline) {
 			parent = &rb->rb_right;
 			first = false;
 		} else {
@@ -149,13 +138,13 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 		}
 	}
 
-	if (prio == I915_PRIORITY_NORMAL) {
+	if (!deadline) {
 		p = &execlists->default_priolist;
 	} else {
 		p = kmem_cache_alloc(global.slab_priorities, GFP_ATOMIC);
 		/* Convert an allocation failure to a priority bump */
 		if (unlikely(!p)) {
-			prio = I915_PRIORITY_NORMAL; /* recurses just once */
+			deadline = 0; /* recurses just once */
 
 			/* To maintain ordering with all rendering, after an
 			 * allocation failure we have to disable all scheduling.
@@ -170,7 +159,7 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 		}
 	}
 
-	p->priority = prio;
+	p->deadline = deadline;
 	INIT_LIST_HEAD(&p->requests);
 
 	rb_link_node(&p->node, rb, parent);
@@ -179,70 +168,234 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 	return &p->requests;
 }
 
-void __i915_priolist_free(struct i915_priolist *p)
+void i915_priolist_free(struct i915_priolist *p)
 {
-	kmem_cache_free(global.slab_priorities, p);
+	if (p->deadline)
+		kmem_cache_free(global.slab_priorities, p);
 }
 
-static inline bool need_preempt(int prio, int active)
+static bool kick_submission(const struct intel_engine_cs *engine, u64 deadline)
 {
-	/*
-	 * Allow preemption of low -> normal -> high, but we do
-	 * not allow low priority tasks to preempt other low priority
-	 * tasks under the impression that latency for low priority
-	 * tasks does not matter (as much as background throughput),
-	 * so kiss.
-	 */
-	return prio >= max(I915_PRIORITY_NORMAL, active);
+	const struct intel_engine_execlists *el = &engine->execlists;
+	const struct i915_request *inflight;
+	bool kick = true;
+
+	if (to_priolist(rb_first_cached(&el->queue))->deadline < deadline)
+		return false;
+
+	rcu_read_lock();
+	inflight = execlists_active(el);
+	if (inflight)
+		kick = deadline < rq_deadline(inflight);
+	rcu_read_unlock();
+
+	return kick;
 }
 
-static void kick_submission(struct intel_engine_cs *engine,
-			    const struct i915_request *rq,
-			    int prio)
+static bool __i915_request_set_deadline(struct i915_request *rq, u64 deadline)
 {
-	const struct i915_request *inflight;
+	struct intel_engine_cs *engine = rq->engine;
+	struct i915_request *rn;
+	struct list_head *plist;
+	LIST_HEAD(dfs);
 
-	/*
-	 * We only need to kick the tasklet once for the high priority
-	 * new context we add into the queue.
-	 */
-	if (prio <= engine->execlists.queue_priority_hint)
+	lockdep_assert_held(&engine->active.lock);
+	list_add(&rq->sched.dfs, &dfs);
+
+	list_for_each_entry(rq, &dfs, sched.dfs) {
+		struct i915_dependency *p;
+
+		GEM_BUG_ON(rq->engine != engine);
+
+		for_each_signaler(p, rq) {
+			struct i915_request *s =
+				container_of(p->signaler, typeof(*s), sched);
+
+			GEM_BUG_ON(s == rq);
+
+			if (rq_deadline(s) <= deadline)
+				continue;
+
+			if (i915_request_completed(s))
+				continue;
+
+			if (s->engine != rq->engine) {
+				spin_lock(&ipi_lock);
+				if (deadline < p->ipi_deadline) {
+					p->ipi_deadline = deadline;
+					list_move(&p->ipi_link, &ipi_list);
+					irq_work_queue(&ipi_work);
+				}
+				spin_unlock(&ipi_lock);
+				continue;
+			}
+
+			list_move_tail(&s->sched.dfs, &dfs);
+		}
+	}
+
+	plist = i915_sched_lookup_priolist(engine, deadline);
+
+	/* Fifo and depth-first replacement ensure our deps execute first */
+	list_for_each_entry_safe_reverse(rq, rn, &dfs, sched.dfs) {
+		GEM_BUG_ON(rq->engine != engine);
+		GEM_BUG_ON(deadline > rq_deadline(rq));
+
+		INIT_LIST_HEAD(&rq->sched.dfs);
+		WRITE_ONCE(rq->sched.deadline, deadline);
+		RQ_TRACE(rq, "set-deadline:%llu\n", deadline);
+
+		/*
+		 * Once the request is ready, it will be placed into the
+		 * priority lists and then onto the HW runlist. Before the
+		 * request is ready, it does not contribute to our preemption
+		 * decisions and we can safely ignore it, as it will, and
+		 * any preemption required, be dealt with upon submission.
+		 * See engine->submit_request()
+		 */
+
+		if (i915_request_in_priority_queue(rq))
+			list_move_tail(&rq->sched.link, plist);
+	}
+
+	return kick_submission(engine, deadline);
+}
+
+void i915_request_set_deadline(struct i915_request *rq, u64 deadline)
+{
+	struct intel_engine_cs *engine;
+	unsigned long flags;
+
+	if (deadline >= rq_deadline(rq))
 		return;
 
-	rcu_read_lock();
+	if (!i915_request_is_ready(rq))
+		return;
 
-	/* Nothing currently active? We're overdue for a submission! */
-	inflight = execlists_active(&engine->execlists);
-	if (!inflight)
+	engine = lock_engine_irqsave(rq, flags);
+	if (!intel_engine_has_scheduler(engine))
 		goto unlock;
 
-	/*
-	 * If we are already the currently executing context, don't
-	 * bother evaluating if we should preempt ourselves.
-	 */
-	if (inflight->context == rq->context)
+	if (i915_request_completed(rq))
 		goto unlock;
 
-	ENGINE_TRACE(engine,
-		     "bumping queue-priority-hint:%d for rq:%llx:%lld, inflight:%llx:%lld prio %d\n",
-		     prio,
-		     rq->fence.context, rq->fence.seqno,
-		     inflight->fence.context, inflight->fence.seqno,
-		     inflight->sched.attr.priority);
+	if (deadline >= rq_deadline(rq))
+		goto unlock;
 
-	engine->execlists.queue_priority_hint = prio;
-	if (need_preempt(prio, rq_prio(inflight)))
+	if (__i915_request_set_deadline(rq, deadline))
 		tasklet_hi_schedule(&engine->execlists.tasklet);
 
 unlock:
+	spin_unlock_irqrestore(&engine->active.lock, flags);
+}
+
+static u64 prio_slice(int prio)
+{
+	u64 slice;
+	int sf;
+
+	/*
+	 * This is the central heuristic to the virtual deadlines. By
+	 * imposing that each task takes an equal amount of time, we
+	 * let each client have an equal slice of the GPU time. By
+	 * bringing the virtual deadline forward, that client will then
+	 * have more GPU time, and vice versa a lower priority client will
+	 * have a later deadline and receive less GPU time.
+	 *
+	 * In BFS/MuQSS, the prio_ratios[] are based on the task nice range of
+	 * [-20, 20], with each lower priority having a ~10% longer deadline,
+	 * with the note that the proportion of CPU time between two clients
+	 * of different priority will be the square of the relative prio_slice.
+	 *
+	 * In contrast, this prio_slice() curve was chosen because it gave good
+	 * results with igt/gem_exec_schedule. It may not be the best choice!
+	 *
+	 * With a 1ms scheduling quantum:
+	 *
+	 *   MAX USER:  ~32us deadline
+	 *   0:         ~16ms deadline
+	 *   MIN_USER: 1000ms deadline
+	 */
+
+	if (prio >= __I915_PRIORITY_KERNEL__)
+		return INT_MAX - prio;
+
+	slice = __I915_PRIORITY_KERNEL__ - prio;
+	if (prio >= 0)
+		sf = 20 - 6;
+	else
+		sf = 20 - 1;
+
+	return slice << sf;
+}
+
+u64 i915_scheduler_virtual_deadline(u64 kt, int priority)
+{
+	return i915_sched_to_ticks(kt + prio_slice(priority));
+}
+
+u64 i915_scheduler_next_virtual_deadline(int priority)
+{
+	return i915_scheduler_virtual_deadline(ktime_get(), priority);
+}
+
+static u64 signal_deadline(const struct i915_request *rq)
+{
+	u64 last = ktime_to_ns(ktime_get());
+	const struct i915_dependency *p;
+
+	/*
+	 * Find the earliest point at which we will become 'ready',
+	 * which we infer from the deadline of all active signalers.
+	 * We will position ourselves at the end of that chain of work.
+	 */
+
+	rcu_read_lock();
+	for_each_signaler(p, rq) {
+		const struct i915_request *s =
+			container_of(p->signaler, typeof(*s), sched);
+		u64 deadline;
+
+		if (i915_request_completed(s))
+			continue;
+
+		if (rq_prio(s) < rq_prio(rq))
+			continue;
+
+		deadline = i915_sched_to_ns(rq_deadline(s));
+		if (p->flags & I915_DEPENDENCY_WEAK)
+			deadline -= prio_slice(rq_prio(s));
+
+		last = max(last, deadline);
+	}
 	rcu_read_unlock();
+
+	return last;
+}
+
+static u64 earliest_deadline(const struct i915_request *rq)
+{
+	return i915_scheduler_virtual_deadline(signal_deadline(rq),
+					       rq_prio(rq));
+}
+
+static bool set_earliest_deadline(struct i915_request *rq, u64 old)
+{
+	u64 dl;
+
+	/* Recompute our deadlines and promote after a priority change */
+	dl = min(earliest_deadline(rq), rq_deadline(rq));
+	if (dl >= old)
+		return false;
+
+	return __i915_request_set_deadline(rq, dl);
 }
 
-static void __i915_request_set_priority(struct i915_request *rq, int prio)
+static bool __i915_request_set_priority(struct i915_request *rq, int prio)
 {
 	struct intel_engine_cs *engine = rq->engine;
 	struct i915_request *rn;
-	struct list_head *plist;
+	bool kick = false;
 	LIST_HEAD(dfs);
 
 	lockdep_assert_held(&engine->active.lock);
@@ -299,32 +452,20 @@ static void __i915_request_set_priority(struct i915_request *rq, int prio)
 		}
 	}
 
-	plist = i915_sched_lookup_priolist(engine, prio);
-
-	/* Fifo and depth-first replacement ensure our deps execute first */
 	list_for_each_entry_safe_reverse(rq, rn, &dfs, sched.dfs) {
 		GEM_BUG_ON(rq->engine != engine);
+		GEM_BUG_ON(prio < rq_prio(rq));
 
 		INIT_LIST_HEAD(&rq->sched.dfs);
 		WRITE_ONCE(rq->sched.attr.priority, prio);
+		RQ_TRACE(rq, "set-priority:%d\n", prio);
 
-		/*
-		 * Once the request is ready, it will be placed into the
-		 * priority lists and then onto the HW runlist. Before the
-		 * request is ready, it does not contribute to our preemption
-		 * decisions and we can safely ignore it, as it will, and
-		 * any preemption required, be dealt with upon submission.
-		 * See engine->submit_request()
-		 */
-		if (!i915_request_is_ready(rq))
-			continue;
-
-		if (i915_request_in_priority_queue(rq))
-			list_move_tail(&rq->sched.link, plist);
-
-		/* Defer (tasklet) submission until after all updates. */
-		kick_submission(engine, rq, prio);
+		if (i915_request_is_ready(rq) &&
+		    set_earliest_deadline(rq, rq_deadline(rq)))
+			kick = true;
 	}
+
+	return kick;
 }
 
 void i915_request_set_priority(struct i915_request *rq, int prio)
@@ -376,7 +517,38 @@ void i915_request_set_priority(struct i915_request *rq, int prio)
 	if (prio <= rq_prio(rq))
 		goto unlock;
 
-	__i915_request_set_priority(rq, prio);
+	if (__i915_request_set_priority(rq, prio))
+		tasklet_hi_schedule(&engine->execlists.tasklet);
+
+unlock:
+	spin_unlock_irqrestore(&engine->active.lock, flags);
+}
+
+bool intel_engine_queue_request(struct intel_engine_cs *engine,
+				struct i915_request *rq)
+{
+	lockdep_assert_held(&engine->active.lock);
+	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+	return set_earliest_deadline(rq, I915_DEADLINE_NEVER);
+}
+
+void i915_request_update_deadline(struct i915_request *rq)
+{
+	struct intel_engine_cs *engine;
+	unsigned long flags;
+
+	if (!i915_request_is_ready(rq))
+		return;
+
+	engine = lock_engine_irqsave(rq, flags);
+	if (!intel_engine_has_scheduler(engine))
+		goto unlock;
+
+	if (i915_request_completed(rq))
+		goto unlock;
+
+	if (set_earliest_deadline(rq, rq_deadline(rq)))
+		tasklet_hi_schedule(&engine->execlists.tasklet);
 
 unlock:
 	spin_unlock_irqrestore(&engine->active.lock, flags);
@@ -397,6 +569,7 @@ void i915_sched_node_init(struct i915_sched_node *node)
 void i915_sched_node_reinit(struct i915_sched_node *node)
 {
 	node->attr.priority = I915_PRIORITY_INVALID;
+	node->deadline = I915_DEADLINE_NEVER;
 	node->semaphores = 0;
 	node->flags = 0;
 
@@ -429,6 +602,7 @@ bool __i915_sched_node_add_dependency(struct i915_sched_node *node,
 
 	if (!node_signaled(signal)) {
 		INIT_LIST_HEAD(&dep->ipi_link);
+		dep->ipi_deadline = I915_DEADLINE_NEVER;
 		dep->ipi_priority = I915_PRIORITY_INVALID;
 		dep->signaler = signal;
 		dep->waiter = node;
@@ -519,6 +693,10 @@ void i915_sched_node_retire(struct i915_sched_node *node)
 	spin_unlock_irq(&node->lock);
 }
 
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftests/i915_scheduler.c"
+#endif
+
 static void i915_global_scheduler_shrink(void)
 {
 	kmem_cache_shrink(global.slab_dependencies);
diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
index 53ac819cc786..89875ea3fb20 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.h
+++ b/drivers/gpu/drm/i915/i915_scheduler.h
@@ -34,15 +34,29 @@ int i915_sched_node_add_dependency(struct i915_sched_node *node,
 void i915_sched_node_retire(struct i915_sched_node *node);
 
 void i915_request_set_priority(struct i915_request *request, int prio);
+void i915_request_set_deadline(struct i915_request *request, u64 deadline);
+
+void i915_request_update_deadline(struct i915_request *request);
+
+u64 i915_scheduler_virtual_deadline(u64 kt, int priority);
+u64 i915_scheduler_next_virtual_deadline(int priority);
+
+bool intel_engine_queue_request(struct intel_engine_cs *engine,
+				struct i915_request *rq);
 
 struct list_head *
-i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio);
+i915_sched_lookup_priolist(struct intel_engine_cs *engine, u64 deadline);
+
+void i915_priolist_free(struct i915_priolist *p);
+
+static inline u64 i915_sched_to_ticks(ktime_t kt)
+{
+	return ktime_to_ns(kt) >> I915_SCHED_DEADLINE_SHIFT;
+}
 
-void __i915_priolist_free(struct i915_priolist *p);
-static inline void i915_priolist_free(struct i915_priolist *p)
+static inline u64 i915_sched_to_ns(u64 deadline)
 {
-	if (p->priority != I915_PRIORITY_NORMAL)
-		__i915_priolist_free(p);
+	return deadline << I915_SCHED_DEADLINE_SHIFT;
 }
 
 #endif /* _I915_SCHEDULER_H_ */
diff --git a/drivers/gpu/drm/i915/i915_scheduler_types.h b/drivers/gpu/drm/i915/i915_scheduler_types.h
index ce60577df2bf..ae7ca78a88c8 100644
--- a/drivers/gpu/drm/i915/i915_scheduler_types.h
+++ b/drivers/gpu/drm/i915/i915_scheduler_types.h
@@ -69,6 +69,22 @@ struct i915_sched_node {
 	unsigned int flags;
 #define I915_SCHED_HAS_EXTERNAL_CHAIN	BIT(0)
 	intel_engine_mask_t semaphores;
+
+	/**
+	 * @deadline: [virtual] deadline
+	 *
+	 * When the request is ready for execution, it is given a quota
+	 * (the engine's timeslice) and a virtual deadline. The virtual
+	 * deadline is derived from the current time:
+	 *     ktime_get() + (prio_ratio * timeslice)
+	 *
+	 * Requests are then executed in order of deadline completion.
+	 * Requests with earlier deadlines than currently executing on
+	 * the engine will preempt the active requests.
+	 */
+	u64 deadline;
+#define I915_SCHED_DEADLINE_SHIFT 19 /* i.e. roughly 500us buckets */
+#define I915_DEADLINE_NEVER U64_MAX
 };
 
 struct i915_dependency {
@@ -81,6 +97,7 @@ struct i915_dependency {
 #define I915_DEPENDENCY_ALLOC		BIT(0)
 #define I915_DEPENDENCY_EXTERNAL	BIT(1)
 #define I915_DEPENDENCY_WEAK		BIT(2)
+	u64 ipi_deadline;
 	int ipi_priority;
 };
 
diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
index cb6f94633356..3782cdd281cc 100644
--- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
@@ -25,6 +25,7 @@ selftest(ring, intel_ring_mock_selftests)
 selftest(engine, intel_engine_cs_mock_selftests)
 selftest(timelines, intel_timeline_mock_selftests)
 selftest(requests, i915_request_mock_selftests)
+selftest(scheduler, i915_scheduler_mock_selftests)
 selftest(objects, i915_gem_object_mock_selftests)
 selftest(acquire, i915_acquire_mock_selftests)
 selftest(phys, i915_gem_phys_mock_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index 236f9fda8f31..ff21d8de7689 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -2123,6 +2123,7 @@ static int measure_preemption(struct intel_context *ce)
 
 		intel_ring_advance(rq, cs);
 		rq->sched.attr.priority = I915_PRIORITY_BARRIER;
+		rq->sched.deadline = 0;
 
 		elapsed[i - 1] = ENGINE_READ_FW(ce->engine, RING_TIMESTAMP);
 		i915_request_add(rq);
diff --git a/drivers/gpu/drm/i915/selftests/i915_scheduler.c b/drivers/gpu/drm/i915/selftests/i915_scheduler.c
new file mode 100644
index 000000000000..9ca50db81034
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/i915_scheduler.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include "i915_selftest.h"
+
+static int mock_scheduler_slices(void *dummy)
+{
+	u64 min, max, normal, kernel;
+
+	min = prio_slice(I915_PRIORITY_MIN);
+	pr_info("%8s slice: %lluus\n", "min", min >> 10);
+
+	normal = prio_slice(0);
+	pr_info("%8s slice: %lluus\n", "normal", normal >> 10);
+
+	max = prio_slice(I915_PRIORITY_MAX);
+	pr_info("%8s slice: %lluus\n", "max", max >> 10);
+
+	kernel = prio_slice(I915_PRIORITY_BARRIER);
+	pr_info("%8s slice: %lluus\n", "kernel", kernel >> 10);
+
+	if (kernel != 0) {
+		pr_err("kernel prio slice should be 0\n");
+		return -EINVAL;
+	}
+
+	if (max >= normal) {
+		pr_err("maximum prio slice should be shorter than normal\n");
+		return -EINVAL;
+	}
+
+	if (min <= normal) {
+		pr_err("minimum prio slice should be longer than normal\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int i915_scheduler_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(mock_scheduler_slices),
+	};
+
+	return i915_subtests(tests, NULL);
+}
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 56/66] drm/i915/gt: Specify a deadline for the heartbeat
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (53 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 55/66] drm/i915: Fair low-latency scheduling Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 57/66] drm/i915: Replace the priority boosting for the display with a deadline Chris Wilson
                   ` (17 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

As we know when we expect the heartbeat to be checked for completion,
pass this information along as its deadline. We still do not complain if
the deadline is missed, at least until we have tried a few times, but it
will allow for quicker hang detection on systems where deadlines are
adhered to.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 9fdc8223007f..41199254b2b5 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -54,6 +54,16 @@ static void heartbeat_commit(struct i915_request *rq,
 	local_bh_enable();
 }
 
+static void set_heartbeat_deadline(struct intel_engine_cs *engine,
+				   struct i915_request *rq)
+{
+	unsigned long interval;
+
+	interval = READ_ONCE(engine->props.heartbeat_interval_ms);
+	if (interval)
+		i915_request_set_deadline(rq, ktime_get() + (interval << 20));
+}
+
 static void show_heartbeat(const struct i915_request *rq,
 			   struct intel_engine_cs *engine)
 {
@@ -119,6 +129,8 @@ static void heartbeat(struct work_struct *wrk)
 
 			local_bh_disable();
 			i915_request_set_priority(rq, attr.priority);
+			if (attr.priority == I915_PRIORITY_BARRIER)
+				i915_request_set_deadline(rq, 0);
 			local_bh_enable();
 		} else {
 			if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
@@ -155,6 +167,7 @@ static void heartbeat(struct work_struct *wrk)
 	if (engine->i915->params.enable_hangcheck)
 		engine->heartbeat.systole = i915_request_get(rq);
 
+	set_heartbeat_deadline(engine, rq);
 	heartbeat_commit(rq, &attr);
 
 unlock:
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 57/66] drm/i915: Replace the priority boosting for the display with a deadline
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (54 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 56/66] drm/i915/gt: Specify a deadline for the heartbeat Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 58/66] drm/i915: Move saturated workload detection to the GT Chris Wilson
                   ` (16 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

For a modeset/pageflip, there is a very precise deadline by which the
frame must be completed in order to hit the vblank and be shown. While
we don't pass along that exact information, we can at least inform the
scheduler that this request-chain needs to be completed asap.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/display/intel_display.c |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.h   |  4 ++--
 drivers/gpu/drm/i915/gem/i915_gem_wait.c     | 19 ++++++++++---------
 drivers/gpu/drm/i915/i915_priolist_types.h   |  3 ---
 4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index c74e664a3759..1c644b613246 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -15983,7 +15983,7 @@ intel_prepare_plane_fb(struct drm_plane *_plane,
 	if (ret)
 		return ret;
 
-	i915_gem_object_wait_priority(obj, 0, I915_PRIORITY_DISPLAY);
+	i915_gem_object_wait_deadline(obj, 0, ktime_get() /* next vblank? */);
 	i915_gem_object_flush_frontbuffer(obj, ORIGIN_DIRTYFB);
 
 	if (!new_plane_state->uapi.fence) { /* implicit fencing */
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index d916155b0c52..d2d7e6bd099d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -457,9 +457,9 @@ static inline void __start_cpu_write(struct drm_i915_gem_object *obj)
 int i915_gem_object_wait(struct drm_i915_gem_object *obj,
 			 unsigned int flags,
 			 long timeout);
-int i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
+int i915_gem_object_wait_deadline(struct drm_i915_gem_object *obj,
 				  unsigned int flags,
-				  int prio);
+				  ktime_t deadline);
 
 void __i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
 					 enum fb_op_origin origin);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_wait.c b/drivers/gpu/drm/i915/gem/i915_gem_wait.c
index cefbbb3d9b52..3334817183f6 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_wait.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_wait.c
@@ -93,17 +93,18 @@ i915_gem_object_wait_reservation(struct dma_resv *resv,
 	return timeout;
 }
 
-static void __fence_set_priority(struct dma_fence *fence, int prio)
+static void __fence_set_deadline(struct dma_fence *fence, ktime_t deadline)
 {
 	if (dma_fence_is_signaled(fence) || !dma_fence_is_i915(fence))
 		return;
 
 	local_bh_disable();
-	i915_request_set_priority(to_request(fence), prio);
+	i915_request_set_deadline(to_request(fence),
+				  i915_sched_to_ticks(deadline));
 	local_bh_enable(); /* kick the tasklets if queues were reprioritised */
 }
 
-static void fence_set_priority(struct dma_fence *fence, int prio)
+static void fence_set_deadline(struct dma_fence *fence, ktime_t deadline)
 {
 	/* Recurse once into a fence-array */
 	if (dma_fence_is_array(fence)) {
@@ -111,16 +112,16 @@ static void fence_set_priority(struct dma_fence *fence, int prio)
 		int i;
 
 		for (i = 0; i < array->num_fences; i++)
-			__fence_set_priority(array->fences[i], prio);
+			__fence_set_deadline(array->fences[i], deadline);
 	} else {
-		__fence_set_priority(fence, prio);
+		__fence_set_deadline(fence, deadline);
 	}
 }
 
 int
-i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
+i915_gem_object_wait_deadline(struct drm_i915_gem_object *obj,
 			      unsigned int flags,
-			      int prio)
+			      ktime_t deadline)
 {
 	struct dma_fence *excl;
 
@@ -135,7 +136,7 @@ i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
 			return ret;
 
 		for (i = 0; i < count; i++) {
-			fence_set_priority(shared[i], prio);
+			fence_set_deadline(shared[i], deadline);
 			dma_fence_put(shared[i]);
 		}
 
@@ -145,7 +146,7 @@ i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
 	}
 
 	if (excl) {
-		fence_set_priority(excl, prio);
+		fence_set_deadline(excl, deadline);
 		dma_fence_put(excl);
 	}
 	return 0;
diff --git a/drivers/gpu/drm/i915/i915_priolist_types.h b/drivers/gpu/drm/i915/i915_priolist_types.h
index 43a0ac45295f..ac6d9614ea23 100644
--- a/drivers/gpu/drm/i915/i915_priolist_types.h
+++ b/drivers/gpu/drm/i915/i915_priolist_types.h
@@ -20,9 +20,6 @@ enum {
 	/* A preemptive pulse used to monitor the health of each engine */
 	I915_PRIORITY_HEARTBEAT,
 
-	/* Interactive workload, scheduled for immediate pageflipping */
-	I915_PRIORITY_DISPLAY,
-
 	__I915_PRIORITY_KERNEL__
 };
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 58/66] drm/i915: Move saturated workload detection to the GT
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (55 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 57/66] drm/i915: Replace the priority boosting for the display with a deadline Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 59/66] Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq" Chris Wilson
                   ` (15 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

When we introduced the saturated workload detection to tell us to back
off from semaphore usage [semaphores have a noticeable impact on
contended bus cycles with the CPU for some heavy workloads], we first
introduced it as a per-context tracker. This allows individual contexts
to try and optimise their own usage, but we found that with the local
tracking and the no-semaphore boosting, the first context to disable
semaphores got a massive priority boost and so would starve the rest and
all new contexts (as they started with semaphores enabled and lower
priority). Hence we moved the saturated workload detection to the
engine, and a consequence had to disable semaphores on virtual engines.

Now that we do not have semaphore priority boosting, we can move the
tracking to the GT and virtual engines can now utilise the faster
inter-engine synchronisation, while maintaining the global information
to back off on saturation.

References: 44d89409a12e ("drm/i915: Make the semaphore saturation mask global")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_pm.c    |  2 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h |  2 --
 drivers/gpu/drm/i915/gt/intel_gt_types.h     |  2 ++
 drivers/gpu/drm/i915/gt/intel_lrc.c          | 15 ---------------
 drivers/gpu/drm/i915/i915_request.c          | 13 ++++++++-----
 5 files changed, 11 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index a95099b7b759..630c0cf8cffd 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -231,7 +231,7 @@ static int __engine_park(struct intel_wakeref *wf)
 	struct intel_engine_cs *engine =
 		container_of(wf, typeof(*engine), wakeref);
 
-	engine->saturated = 0;
+	clear_bit(engine->id, &engine->gt->saturated);
 
 	/*
 	 * If one and only one request is completed between pm events,
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index af6f1154200a..8c502cf34de7 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -324,8 +324,6 @@ struct intel_engine_cs {
 
 	struct intel_context *kernel_context; /* pinned */
 
-	intel_engine_mask_t saturated; /* submitting semaphores too late? */
-
 	struct {
 		struct delayed_work work;
 		struct i915_request *systole;
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index 6d39a4a11bf3..6e7719082add 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -74,6 +74,8 @@ struct intel_gt {
 	 */
 	intel_wakeref_t awake;
 
+	unsigned long saturated; /* submitting semaphores too late? */
+
 	u32 clock_frequency;
 
 	struct intel_llc llc;
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 6054695611ad..a22d24a5696e 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -5541,21 +5541,6 @@ intel_execlists_create_virtual(struct intel_engine_cs **siblings,
 	ve->base.instance = I915_ENGINE_CLASS_INVALID_VIRTUAL;
 	ve->base.uabi_instance = I915_ENGINE_CLASS_INVALID_VIRTUAL;
 
-	/*
-	 * The decision on whether to submit a request using semaphores
-	 * depends on the saturated state of the engine. We only compute
-	 * this during HW submission of the request, and we need for this
-	 * state to be globally applied to all requests being submitted
-	 * to this engine. Virtual engines encompass more than one physical
-	 * engine and so we cannot accurately tell in advance if one of those
-	 * engines is already saturated and so cannot afford to use a semaphore
-	 * and be pessimized in priority for doing so -- if we are the only
-	 * context using semaphores after all other clients have stopped, we
-	 * will be starved on the saturated system. Such a global switch for
-	 * semaphores is less than ideal, but alas is the current compromise.
-	 */
-	ve->base.saturated = ALL_ENGINES;
-
 	snprintf(ve->base.name, sizeof(ve->base.name), "virtual");
 
 	intel_engine_init_active(&ve->base, ENGINE_VIRTUAL);
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index a90e90e96c19..e4bafd90432b 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -535,7 +535,7 @@ bool __i915_request_submit(struct i915_request *request)
 	 */
 	if (request->sched.semaphores &&
 	    i915_sw_fence_signaled(&request->semaphore))
-		engine->saturated |= request->sched.semaphores;
+		set_bit(engine->id, &engine->gt->saturated);
 
 	engine->emit_fini_breadcrumb(request,
 				     request->ring->vaddr + request->postfix);
@@ -965,8 +965,7 @@ i915_request_await_start(struct i915_request *rq, struct i915_request *signal)
 	return err;
 }
 
-static intel_engine_mask_t
-already_busywaiting(struct i915_request *rq)
+static bool engine_saturated(struct intel_engine_cs *engine)
 {
 	/*
 	 * Polling a semaphore causes bus traffic, delaying other users of
@@ -980,7 +979,7 @@ already_busywaiting(struct i915_request *rq)
 	 *
 	 * See the are-we-too-late? check in __i915_request_submit().
 	 */
-	return rq->sched.semaphores | READ_ONCE(rq->engine->saturated);
+	return engine->mask & READ_ONCE(engine->gt->saturated);
 }
 
 static int
@@ -1061,7 +1060,11 @@ emit_semaphore_wait(struct i915_request *to,
 		goto await_fence;
 
 	/* Just emit the first semaphore we see as request space is limited. */
-	if (already_busywaiting(to) & mask)
+	if (to->sched.semaphores & mask)
+		goto await_fence;
+
+	/* Don't over use semaphores, they may be fast but not free. */
+	if (engine_saturated(to->engine))
 		goto await_fence;
 
 	if (i915_request_await_start(to, from) < 0)
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 59/66] Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq"
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (56 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 58/66] drm/i915: Move saturated workload detection to the GT Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 60/66] drm/i915/gt: Couple tasklet scheduling for all CS interrupts Chris Wilson
                   ` (14 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

This was removed in commit 478ffad6d690 ("drm/i915: drop
engine_pin/unpin_breadcrumbs_irq") as the last user had been removed,
but now there is a promise of a new user in the next patch.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 22 +++++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_engine.h      |  3 +++
 2 files changed, 25 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
index 87fd06d3eb3f..5a7a4853cbba 100644
--- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
@@ -220,6 +220,28 @@ static void signal_irq_work(struct irq_work *work)
 	}
 }
 
+void intel_engine_pin_breadcrumbs_irq(struct intel_engine_cs *engine)
+{
+	struct intel_breadcrumbs *b = &engine->breadcrumbs;
+
+	spin_lock_irq(&b->irq_lock);
+	if (!b->irq_enabled++)
+		irq_enable(engine);
+	GEM_BUG_ON(!b->irq_enabled); /* no overflow! */
+	spin_unlock_irq(&b->irq_lock);
+}
+
+void intel_engine_unpin_breadcrumbs_irq(struct intel_engine_cs *engine)
+{
+	struct intel_breadcrumbs *b = &engine->breadcrumbs;
+
+	spin_lock_irq(&b->irq_lock);
+	GEM_BUG_ON(!b->irq_enabled); /* no underflow! */
+	if (!--b->irq_enabled)
+		irq_disable(engine);
+	spin_unlock_irq(&b->irq_lock);
+}
+
 static void __intel_breadcrumbs_arm_irq(struct intel_breadcrumbs *b)
 {
 	struct intel_engine_cs *engine =
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index a9249a23903a..dcc2fc22ea37 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -226,6 +226,9 @@ void intel_engine_init_execlists(struct intel_engine_cs *engine);
 void intel_engine_init_breadcrumbs(struct intel_engine_cs *engine);
 void intel_engine_fini_breadcrumbs(struct intel_engine_cs *engine);
 
+void intel_engine_pin_breadcrumbs_irq(struct intel_engine_cs *engine);
+void intel_engine_unpin_breadcrumbs_irq(struct intel_engine_cs *engine);
+
 void intel_engine_disarm_breadcrumbs(struct intel_engine_cs *engine);
 
 static inline void
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 60/66] drm/i915/gt: Couple tasklet scheduling for all CS interrupts
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (57 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 59/66] Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq" Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 61/66] drm/i915/gt: Support creation of 'internal' rings Chris Wilson
                   ` (13 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

If any engine asks for the tasklet to be kicked from the CS interrupt,
do so. Currently, this is used by the execlists scheduler backends to
feed in the next request to the HW, and similarly could be used by a
ring scheduler, as will be seen in the next patch.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/gt/intel_gt_irq.c | 19 ++++++++++++++-----
 drivers/gpu/drm/i915/gt/intel_gt_irq.h |  3 +++
 drivers/gpu/drm/i915/gt/intel_rps.c    |  2 +-
 drivers/gpu/drm/i915/i915_irq.c        |  8 ++++----
 4 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_irq.c b/drivers/gpu/drm/i915/gt/intel_gt_irq.c
index b05da68e52f4..b825b93b4b05 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_irq.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_irq.c
@@ -61,6 +61,15 @@ cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
 		tasklet_hi_schedule(&engine->execlists.tasklet);
 }
 
+void gen2_engine_cs_irq(struct intel_engine_cs *engine)
+{
+	if (!list_empty(&engine->breadcrumbs.signalers))
+		intel_engine_signal_breadcrumbs(engine);
+
+	if (intel_engine_needs_breadcrumb_tasklet(engine))
+		tasklet_hi_schedule(&engine->execlists.tasklet);
+}
+
 static u32
 gen11_gt_engine_identity(struct intel_gt *gt,
 			 const unsigned int bank, const unsigned int bit)
@@ -274,9 +283,9 @@ void gen11_gt_irq_postinstall(struct intel_gt *gt)
 void gen5_gt_irq_handler(struct intel_gt *gt, u32 gt_iir)
 {
 	if (gt_iir & GT_RENDER_USER_INTERRUPT)
-		intel_engine_signal_breadcrumbs(gt->engine_class[RENDER_CLASS][0]);
+		gen2_engine_cs_irq(gt->engine_class[RENDER_CLASS][0]);
 	if (gt_iir & ILK_BSD_USER_INTERRUPT)
-		intel_engine_signal_breadcrumbs(gt->engine_class[VIDEO_DECODE_CLASS][0]);
+		gen2_engine_cs_irq(gt->engine_class[VIDEO_DECODE_CLASS][0]);
 }
 
 static void gen7_parity_error_irq_handler(struct intel_gt *gt, u32 iir)
@@ -300,11 +309,11 @@ static void gen7_parity_error_irq_handler(struct intel_gt *gt, u32 iir)
 void gen6_gt_irq_handler(struct intel_gt *gt, u32 gt_iir)
 {
 	if (gt_iir & GT_RENDER_USER_INTERRUPT)
-		intel_engine_signal_breadcrumbs(gt->engine_class[RENDER_CLASS][0]);
+		gen2_engine_cs_irq(gt->engine_class[RENDER_CLASS][0]);
 	if (gt_iir & GT_BSD_USER_INTERRUPT)
-		intel_engine_signal_breadcrumbs(gt->engine_class[VIDEO_DECODE_CLASS][0]);
+		gen2_engine_cs_irq(gt->engine_class[VIDEO_DECODE_CLASS][0]);
 	if (gt_iir & GT_BLT_USER_INTERRUPT)
-		intel_engine_signal_breadcrumbs(gt->engine_class[COPY_ENGINE_CLASS][0]);
+		gen2_engine_cs_irq(gt->engine_class[COPY_ENGINE_CLASS][0]);
 
 	if (gt_iir & (GT_BLT_CS_ERROR_INTERRUPT |
 		      GT_BSD_CS_ERROR_INTERRUPT |
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_irq.h b/drivers/gpu/drm/i915/gt/intel_gt_irq.h
index 886c5cf408a2..6c69cd563fe1 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_irq.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_irq.h
@@ -9,6 +9,7 @@
 
 #include <linux/types.h>
 
+struct intel_engine_cs;
 struct intel_gt;
 
 #define GEN8_GT_IRQS (GEN8_GT_RCS_IRQ | \
@@ -19,6 +20,8 @@ struct intel_gt;
 		      GEN8_GT_PM_IRQ | \
 		      GEN8_GT_GUC_IRQ)
 
+void gen2_engine_cs_irq(struct intel_engine_cs *engine);
+
 void gen11_gt_irq_reset(struct intel_gt *gt);
 void gen11_gt_irq_postinstall(struct intel_gt *gt);
 void gen11_gt_irq_handler(struct intel_gt *gt, const u32 master_ctl);
diff --git a/drivers/gpu/drm/i915/gt/intel_rps.c b/drivers/gpu/drm/i915/gt/intel_rps.c
index 97ba14ad52e4..49910425e986 100644
--- a/drivers/gpu/drm/i915/gt/intel_rps.c
+++ b/drivers/gpu/drm/i915/gt/intel_rps.c
@@ -1741,7 +1741,7 @@ void gen6_rps_irq_handler(struct intel_rps *rps, u32 pm_iir)
 		return;
 
 	if (pm_iir & PM_VEBOX_USER_INTERRUPT)
-		intel_engine_signal_breadcrumbs(gt->engine[VECS0]);
+		gen2_engine_cs_irq(gt->engine[VECS0]);
 
 	if (pm_iir & PM_VEBOX_CS_ERROR_INTERRUPT)
 		DRM_DEBUG("Command parser error, pm_iir 0x%08x\n", pm_iir);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 1fa67700d8f4..27a0b3b89ddf 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -3734,7 +3734,7 @@ static irqreturn_t i8xx_irq_handler(int irq, void *arg)
 		intel_uncore_write16(&dev_priv->uncore, GEN2_IIR, iir);
 
 		if (iir & I915_USER_INTERRUPT)
-			intel_engine_signal_breadcrumbs(dev_priv->gt.engine[RCS0]);
+			gen2_engine_cs_irq(dev_priv->gt.engine[RCS0]);
 
 		if (iir & I915_MASTER_ERROR_INTERRUPT)
 			i8xx_error_irq_handler(dev_priv, eir, eir_stuck);
@@ -3839,7 +3839,7 @@ static irqreturn_t i915_irq_handler(int irq, void *arg)
 		I915_WRITE(GEN2_IIR, iir);
 
 		if (iir & I915_USER_INTERRUPT)
-			intel_engine_signal_breadcrumbs(dev_priv->gt.engine[RCS0]);
+			gen2_engine_cs_irq(dev_priv->gt.engine[RCS0]);
 
 		if (iir & I915_MASTER_ERROR_INTERRUPT)
 			i9xx_error_irq_handler(dev_priv, eir, eir_stuck);
@@ -3981,10 +3981,10 @@ static irqreturn_t i965_irq_handler(int irq, void *arg)
 		I915_WRITE(GEN2_IIR, iir);
 
 		if (iir & I915_USER_INTERRUPT)
-			intel_engine_signal_breadcrumbs(dev_priv->gt.engine[RCS0]);
+			gen2_engine_cs_irq(dev_priv->gt.engine[RCS0]);
 
 		if (iir & I915_BSD_USER_INTERRUPT)
-			intel_engine_signal_breadcrumbs(dev_priv->gt.engine[VCS0]);
+			gen2_engine_cs_irq(dev_priv->gt.engine[VCS0]);
 
 		if (iir & I915_MASTER_ERROR_INTERRUPT)
 			i9xx_error_irq_handler(dev_priv, eir, eir_stuck);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 61/66] drm/i915/gt: Support creation of 'internal' rings
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (58 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 60/66] drm/i915/gt: Couple tasklet scheduling for all CS interrupts Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 62/66] drm/i915/gt: Use client timeline address for seqno writes Chris Wilson
                   ` (12 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

To support legacy ring buffer scheduling, we want a virtual ringbuffer
for each client. These rings are purely for holding the requests as they
are being constructed on the CPU and never accessed by the GPU, so they
should not be bound into the GGTT, and we can use plain old WB mapped
pages.

As they are not bound, we need to nerf a few assumptions that a rq->ring
is in the GGTT.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_context.c    |  2 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c  | 17 ++----
 drivers/gpu/drm/i915/gt/intel_ring.c       | 63 ++++++++++++++--------
 drivers/gpu/drm/i915/gt/intel_ring.h       | 12 ++++-
 drivers/gpu/drm/i915/gt/intel_ring_types.h |  2 +
 5 files changed, 57 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 9ba1c15114d7..fb32b6c92f29 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -129,7 +129,7 @@ int __intel_context_do_pin(struct intel_context *ce)
 			goto err_active;
 
 		CE_TRACE(ce, "pin ring:{start:%08x, head:%04x, tail:%04x}\n",
-			 i915_ggtt_offset(ce->ring->vma),
+			 intel_ring_address(ce->ring),
 			 ce->ring->head, ce->ring->tail);
 
 		smp_mb__before_atomic(); /* flush pin before it is visible */
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index af9cc42d3061..df234ce10907 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1342,7 +1342,7 @@ static int print_ring(char *buf, int sz, struct i915_request *rq)
 
 		len = scnprintf(buf, sz,
 				"ring:{start:%08x, hwsp:%08x, seqno:%08x, runtime:%llums}, ",
-				i915_ggtt_offset(rq->ring->vma),
+				intel_ring_address(rq->ring),
 				tl ? tl->hwsp_offset : 0,
 				hwsp_seqno(rq),
 				DIV_ROUND_CLOSEST_ULL(intel_context_get_total_runtime_ns(rq->context),
@@ -1634,7 +1634,7 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 		print_request(m, rq, "\t\tactive ");
 
 		drm_printf(m, "\t\tring->start:  0x%08x\n",
-			   i915_ggtt_offset(rq->ring->vma));
+			   intel_ring_address(rq->ring));
 		drm_printf(m, "\t\tring->head:   0x%08x\n",
 			   rq->ring->head);
 		drm_printf(m, "\t\tring->tail:   0x%08x\n",
@@ -1715,13 +1715,6 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
 	return total;
 }
 
-static bool match_ring(struct i915_request *rq)
-{
-	u32 ring = ENGINE_READ(rq->engine, RING_START);
-
-	return ring == i915_ggtt_offset(rq->ring->vma);
-}
-
 struct i915_request *
 intel_engine_find_active_request(struct intel_engine_cs *engine)
 {
@@ -1761,11 +1754,7 @@ intel_engine_find_active_request(struct intel_engine_cs *engine)
 			continue;
 
 		if (!i915_request_started(request))
-			continue;
-
-		/* More than one preemptible request may match! */
-		if (!match_ring(request))
-			continue;
+			break;
 
 		active = request;
 		break;
diff --git a/drivers/gpu/drm/i915/gt/intel_ring.c b/drivers/gpu/drm/i915/gt/intel_ring.c
index 1c21f5725731..9aeb4025c485 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring.c
@@ -33,33 +33,42 @@ int intel_ring_pin_locked(struct intel_ring *ring)
 {
 	struct i915_vma *vma = ring->vma;
 	enum i915_map_type type;
-	unsigned int flags;
 	void *addr;
 	int ret;
 
 	if (atomic_fetch_inc(&ring->pin_count))
 		return 0;
 
-	/* Ring wraparound at offset 0 sometimes hangs. No idea why. */
-	flags = PIN_OFFSET_BIAS | i915_ggtt_pin_bias(vma);
+	if (!(ring->flags & INTEL_RING_CREATE_INTERNAL)) {
+		unsigned int pin;
 
-	if (vma->obj->stolen)
-		flags |= PIN_MAPPABLE;
-	else
-		flags |= PIN_HIGH;
+		/* Ring wraparound at offset 0 sometimes hangs. No idea why. */
+		pin |= PIN_OFFSET_BIAS | i915_ggtt_pin_bias(vma);
 
-	ret = i915_ggtt_pin_locked(vma, 0, flags);
-	if (unlikely(ret))
-		goto err_unpin;
+		if (vma->obj->stolen)
+			pin |= PIN_MAPPABLE;
+		else
+			pin |= PIN_HIGH;
 
-	type = i915_coherent_map_type(vma->vm->i915);
-	if (i915_vma_is_map_and_fenceable(vma))
-		addr = (void __force *)i915_vma_pin_iomap(vma);
-	else
-		addr = __i915_gem_object_pin_map_locked(vma->obj, type);
-	if (IS_ERR(addr)) {
-		ret = PTR_ERR(addr);
-		goto err_ring;
+		ret = i915_ggtt_pin_locked(vma, 0, pin);
+		if (unlikely(ret))
+			goto err_unpin;
+
+		type = i915_coherent_map_type(vma->vm->i915);
+		if (i915_vma_is_map_and_fenceable(vma))
+			addr = (void __force *)i915_vma_pin_iomap(vma);
+		else
+			addr = __i915_gem_object_pin_map_locked(vma->obj, type);
+		if (IS_ERR(addr)) {
+			ret = PTR_ERR(addr);
+			goto err_ring;
+		}
+	} else {
+		addr = __i915_gem_object_pin_map_locked(vma->obj, I915_MAP_WB);
+		if (IS_ERR(addr)) {
+			ret = PTR_ERR(addr);
+			goto err_ring;
+		}
 	}
 
 	i915_vma_make_unshrinkable(vma);
@@ -100,10 +109,12 @@ void intel_ring_unpin(struct intel_ring *ring)
 		i915_gem_object_unpin_map(vma->obj);
 
 	i915_vma_make_purgeable(vma);
-	i915_vma_unpin(vma);
+	if (!(ring->flags & INTEL_RING_CREATE_INTERNAL))
+		i915_vma_unpin(vma);
 }
 
-static struct i915_vma *create_ring_vma(struct i915_ggtt *ggtt, int size)
+static struct i915_vma *
+create_ring_vma(struct i915_ggtt *ggtt, int size, unsigned int flags)
 {
 	struct i915_address_space *vm = &ggtt->vm;
 	struct drm_i915_private *i915 = vm->i915;
@@ -111,7 +122,8 @@ static struct i915_vma *create_ring_vma(struct i915_ggtt *ggtt, int size)
 	struct i915_vma *vma;
 
 	obj = ERR_PTR(-ENODEV);
-	if (i915_ggtt_has_aperture(ggtt))
+	if (!(flags & INTEL_RING_CREATE_INTERNAL) &&
+	    i915_ggtt_has_aperture(ggtt))
 		obj = i915_gem_object_create_stolen(i915, size);
 	if (IS_ERR(obj))
 		obj = i915_gem_object_create_internal(i915, size);
@@ -137,12 +149,14 @@ static struct i915_vma *create_ring_vma(struct i915_ggtt *ggtt, int size)
 }
 
 struct intel_ring *
-intel_engine_create_ring(struct intel_engine_cs *engine, int size)
+intel_engine_create_ring(struct intel_engine_cs *engine, unsigned int size)
 {
 	struct drm_i915_private *i915 = engine->i915;
+	unsigned int flags = size & GENMASK(11, 0);
 	struct intel_ring *ring;
 	struct i915_vma *vma;
 
+	size ^= flags;
 	GEM_BUG_ON(!is_power_of_2(size));
 	GEM_BUG_ON(RING_CTL_SIZE(size) & ~RING_NR_PAGES);
 
@@ -151,8 +165,10 @@ intel_engine_create_ring(struct intel_engine_cs *engine, int size)
 		return ERR_PTR(-ENOMEM);
 
 	kref_init(&ring->ref);
+
 	ring->size = size;
 	ring->wrap = BITS_PER_TYPE(ring->size) - ilog2(size);
+	ring->flags = flags;
 
 	/*
 	 * Workaround an erratum on the i830 which causes a hang if
@@ -165,11 +181,12 @@ intel_engine_create_ring(struct intel_engine_cs *engine, int size)
 
 	intel_ring_update_space(ring);
 
-	vma = create_ring_vma(engine->gt->ggtt, size);
+	vma = create_ring_vma(engine->gt->ggtt, size, flags);
 	if (IS_ERR(vma)) {
 		kfree(ring);
 		return ERR_CAST(vma);
 	}
+
 	ring->vma = vma;
 
 	return ring;
diff --git a/drivers/gpu/drm/i915/gt/intel_ring.h b/drivers/gpu/drm/i915/gt/intel_ring.h
index 34134a0b80b3..55eb1e139c12 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring.h
+++ b/drivers/gpu/drm/i915/gt/intel_ring.h
@@ -9,13 +9,15 @@
 
 #include "i915_gem.h" /* GEM_BUG_ON */
 #include "i915_request.h"
+#include "i915_vma.h"
 #include "intel_ring_types.h"
 
 struct i915_acquire_ctx;
 struct intel_engine_cs;
 
 struct intel_ring *
-intel_engine_create_ring(struct intel_engine_cs *engine, int size);
+intel_engine_create_ring(struct intel_engine_cs *engine, unsigned int size);
+#define INTEL_RING_CREATE_INTERNAL BIT(0)
 
 u32 *intel_ring_begin(struct i915_request *rq, unsigned int num_dwords);
 int intel_ring_cacheline_align(struct i915_request *rq);
@@ -140,4 +142,12 @@ __intel_ring_space(unsigned int head, unsigned int tail, unsigned int size)
 	return (head - tail - CACHELINE_BYTES) & (size - 1);
 }
 
+static inline u32 intel_ring_address(const struct intel_ring *ring)
+{
+	if (ring->flags & INTEL_RING_CREATE_INTERNAL)
+		return -1;
+
+	return i915_ggtt_offset(ring->vma);
+}
+
 #endif /* INTEL_RING_H */
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_types.h b/drivers/gpu/drm/i915/gt/intel_ring_types.h
index 1a189ea00fd8..d927deafcb33 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_ring_types.h
@@ -47,6 +47,8 @@ struct intel_ring {
 	u32 size;
 	u32 wrap;
 	u32 effective_size;
+
+	unsigned long flags;
 };
 
 #endif /* INTEL_RING_TYPES_H */
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 62/66] drm/i915/gt: Use client timeline address for seqno writes
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (59 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 61/66] drm/i915/gt: Support creation of 'internal' rings Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 63/66] drm/i915/gt: Infrastructure for ring scheduling Chris Wilson
                   ` (11 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

If we allow for per-client timelines, even with legacy ring submission,
we open the door to a world full of possiblities [scheduling and
semaphores].

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/gen6_engine_cs.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/gen6_engine_cs.c b/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
index ce38d1bcaba3..fa11174bb13b 100644
--- a/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
@@ -373,11 +373,10 @@ u32 *gen7_emit_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
 
 u32 *gen6_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
 {
-	GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
-	GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
+	u32 addr = i915_request_active_timeline(rq)->hwsp_offset;
 
-	*cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
-	*cs++ = I915_GEM_HWS_SEQNO_ADDR | MI_FLUSH_DW_USE_GTT;
+	*cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW;
+	*cs++ = addr | MI_FLUSH_DW_USE_GTT;
 	*cs++ = rq->fence.seqno;
 
 	*cs++ = MI_USER_INTERRUPT;
@@ -391,19 +390,17 @@ u32 *gen6_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
 #define GEN7_XCS_WA 32
 u32 *gen7_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
 {
+	u32 addr = i915_request_active_timeline(rq)->hwsp_offset;
 	int i;
 
-	GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
-	GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
-
-	*cs++ = MI_FLUSH_DW | MI_INVALIDATE_TLB |
-		MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
-	*cs++ = I915_GEM_HWS_SEQNO_ADDR | MI_FLUSH_DW_USE_GTT;
+	*cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW;
+	*cs++ = addr | MI_FLUSH_DW_USE_GTT;
 	*cs++ = rq->fence.seqno;
 
 	for (i = 0; i < GEN7_XCS_WA; i++) {
-		*cs++ = MI_STORE_DWORD_INDEX;
-		*cs++ = I915_GEM_HWS_SEQNO_ADDR;
+		*cs++ = MI_STORE_DWORD_IMM_GEN4 | MI_USE_GGTT;
+		*cs++ = 0;
+		*cs++ = addr;
 		*cs++ = rq->fence.seqno;
 	}
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 63/66] drm/i915/gt: Infrastructure for ring scheduling
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (60 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 62/66] drm/i915/gt: Use client timeline address for seqno writes Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 64/66] drm/i915/gt: Implement ring scheduler for gen6/7 Chris Wilson
                   ` (10 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Build a bare bones scheduler to sit on top the global legacy ringbuffer
submission. This virtual execlists scheme should be applicable to all
older platforms.

A key problem we have with the legacy ring buffer submission is that it
only allows for FIFO queuing. All clients share the global request queue
and must contend for its lock when submitting. As any client may need to
wait for external events, all clients must then wait. However, if we
stage each client into their own virtual ringbuffer with their own
timelines, we can copy the client requests into the global ringbuffer
only when they are ready, reordering the submission around stalls.
Furthermore, the ability to reorder gives us rudimentarily priority
sorting -- although without preemption support, once something is on the
GPU it stays on the GPU, and so it is still possible for a hog to delay
a high priority request (such as updating the display). However, it does
means that in keeping a short submission queue, the high priority
request will be next. This design resembles the old guc submission
scheduler, for reordering requests onto a global workqueue.

The implementation uses the MI_USER_INTERRUPT at the end of every
request to track completion, so is more interrupt happy than execlists
[which has an interrupt for each context event, albeit two]. Our
interrupts on these system are relatively heavy, and in the past we have
been able to completely starve Sandybrige by the interrupt traffic. Our
interrupt handlers are being much better (in part offloading the work to
bottom halves leaving the interrupt itself only dealing with acking the
registers) but we can still see the impact of starvation in the uneven
submission latency on a saturated system.

Overall though, the short sumission queues and extra interrupts do not
appear to be affecting throughput (+-10%, some tasks even improve to the
reduced request overheads) and improve latency. [Which is a massive
improvement since the introduction of Sandybridge!]

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/Makefile                 |   1 +
 drivers/gpu/drm/i915/gt/intel_engine.h        |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |   1 +
 .../gpu/drm/i915/gt/intel_ring_scheduler.c    | 762 ++++++++++++++++++
 .../gpu/drm/i915/gt/intel_ring_submission.c   |  13 +-
 .../gpu/drm/i915/gt/intel_ring_submission.h   |  16 +
 6 files changed, 788 insertions(+), 6 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
 create mode 100644 drivers/gpu/drm/i915/gt/intel_ring_submission.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 2cf54db8b847..e4eea4980129 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -110,6 +110,7 @@ gt-y += \
 	gt/intel_renderstate.o \
 	gt/intel_reset.o \
 	gt/intel_ring.o \
+	gt/intel_ring_scheduler.o \
 	gt/intel_ring_submission.o \
 	gt/intel_rps.o \
 	gt/intel_sseu.o \
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index dcc2fc22ea37..b816581b95d3 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -209,6 +209,7 @@ void intel_engine_cleanup_common(struct intel_engine_cs *engine);
 int intel_engine_resume(struct intel_engine_cs *engine);
 
 int intel_ring_submission_setup(struct intel_engine_cs *engine);
+int intel_ring_scheduler_setup(struct intel_engine_cs *engine);
 
 int intel_engine_stop_cs(struct intel_engine_cs *engine);
 void intel_engine_cancel_stop_cs(struct intel_engine_cs *engine);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 8c502cf34de7..78a57879aef8 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -339,6 +339,7 @@ struct intel_engine_cs {
 	struct {
 		struct intel_ring *ring;
 		struct intel_timeline *timeline;
+		struct intel_context *context;
 	} legacy;
 
 	/*
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c b/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
new file mode 100644
index 000000000000..d3c22037f17d
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
@@ -0,0 +1,762 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include <linux/log2.h>
+
+#include <drm/i915_drm.h>
+
+#include "mm/i915_acquire_ctx.h"
+
+#include "i915_drv.h"
+#include "intel_context.h"
+#include "intel_engine_stats.h"
+#include "intel_gt.h"
+#include "intel_gt_pm.h"
+#include "intel_gt_requests.h"
+#include "intel_reset.h"
+#include "intel_ring.h"
+#include "intel_ring_submission.h"
+#include "shmem_utils.h"
+
+/*
+ * Rough estimate of the typical request size, performing a flush,
+ * set-context and then emitting the batch.
+ */
+#define LEGACY_REQUEST_SIZE 200
+
+static inline int rq_prio(const struct i915_request *rq)
+{
+	return rq->sched.attr.priority;
+}
+
+static inline u64 rq_deadline(const struct i915_request *rq)
+{
+	return rq->sched.deadline;
+}
+
+static inline struct i915_priolist *to_priolist(struct rb_node *rb)
+{
+	return rb_entry(rb, struct i915_priolist, node);
+}
+
+static inline bool reset_in_progress(const struct intel_engine_execlists *el)
+{
+	return unlikely(!__tasklet_is_enabled(&el->tasklet));
+}
+
+static void
+set_current_context(struct intel_context **ptr, struct intel_context *ce)
+{
+	if (ce)
+		intel_context_get(ce);
+
+	ce = xchg(ptr, ce);
+
+	if (ce)
+		intel_context_put(ce);
+}
+
+static void schedule_in(struct intel_engine_cs *engine, struct i915_request *rq)
+{
+	__intel_gt_pm_get(engine->gt);
+	intel_engine_context_in(engine);
+	i915_request_get(rq);
+}
+
+static void schedule_in_new(struct intel_engine_cs *engine,
+			    struct i915_request **port)
+{
+	while (*port)
+		schedule_in(engine, *port++);
+}
+
+static void schedule_out(struct i915_request *rq)
+{
+	struct intel_engine_cs *engine = rq->engine;
+	struct intel_context *ce = rq->context;
+
+	if (i915_request_completed(rq)) {
+		if (!list_is_last_rcu(&rq->link, &ce->timeline->requests))
+			i915_request_update_deadline(list_next_entry(rq, link));
+		else
+			intel_engine_add_retire(engine, ce->timeline);
+	}
+
+	i915_request_put(rq);
+	intel_engine_context_out(engine);
+	intel_gt_pm_put_async(engine->gt);
+}
+
+static inline void *clear_ports(struct i915_request **ports, int count)
+{
+	return memset_p((void **)ports, NULL, count);
+}
+
+static u32 *ring_map(struct intel_ring *ring, u32 len)
+{
+	u32 *va;
+
+	if (unlikely(ring->tail + len > ring->effective_size)) {
+		memset(ring->vaddr + ring->tail, 0, ring->size - ring->tail);
+		ring->tail = 0;
+	}
+
+	va = ring->vaddr + ring->tail;
+	ring->tail = intel_ring_wrap(ring, ring->tail + len);
+
+	return va;
+}
+
+static inline u32 *ring_map_dw(struct intel_ring *ring, u32 len)
+{
+	return ring_map(ring, len * sizeof(u32));
+}
+
+static void ring_copy(struct intel_ring *dst,
+		      const struct intel_ring *src,
+		      u32 start, u32 end)
+{
+	unsigned int len;
+	void *out;
+
+	len = end - start;
+	if (end < start)
+		len += src->size;
+	out = ring_map(dst, len);
+
+	if (end < start) {
+		len = src->size - start;
+		memcpy(out, src->vaddr + start, len);
+		out += len;
+		start = 0;
+	}
+
+	memcpy(out, src->vaddr + start, end - start);
+}
+
+static void switch_context(struct intel_ring *ring, struct i915_request *rq)
+{
+}
+
+static struct i915_request *ring_submit(struct i915_request *rq)
+{
+	struct intel_ring *ring = rq->engine->legacy.ring;
+
+	__i915_request_submit(rq);
+
+	if (rq->engine->legacy.context != rq->context) {
+		switch_context(ring, rq);
+		set_current_context(&rq->engine->legacy.context, rq->context);
+	}
+
+	ring_copy(ring, rq->ring, rq->head, rq->tail);
+	return rq;
+}
+
+static struct i915_request **
+copy_active(struct i915_request **port, struct i915_request * const *active)
+{
+	while (*active)
+		*port++ = *active++;
+
+	return port;
+}
+
+static void dequeue(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists * const el = &engine->execlists;
+	struct i915_request ** const last_port = el->pending + el->port_mask;
+	struct i915_request **port, **first;
+	unsigned long flags;
+	struct rb_node *rb;
+
+	first = copy_active(el->pending, el->active);
+	if (first > last_port)
+		return;
+
+	port = first;
+	*port = NULL;
+	spin_lock_irqsave(&engine->active.lock, flags);
+	while ((rb = rb_first_cached(&el->queue))) {
+		struct i915_priolist *p = to_priolist(rb);
+		struct i915_request *rq, *rn;
+
+		priolist_for_each_request_consume(rq, rn, p) {
+			GEM_BUG_ON(rq == *port);
+			if (*port && rq->context != (*port)->context) {
+				if (port == last_port)
+					goto done;
+
+				port++;
+			}
+
+			*port = ring_submit(rq);
+		}
+
+		rb_erase_cached(&p->node, &el->queue);
+		i915_priolist_free(p);
+	}
+done:
+	spin_unlock_irqrestore(&engine->active.lock, flags);
+	if (!*port)
+		return;
+
+	*++port = NULL;
+	schedule_in_new(engine, first);
+	WRITE_ONCE(el->active, el->pending);
+
+	wmb(); /* paranoid flush of WCB before RING_TAIL write */
+	ENGINE_WRITE(engine, RING_TAIL, engine->legacy.ring->tail);
+	memcpy(el->inflight, el->pending,
+	       (port - el->pending + 1) * sizeof(*port));
+
+	WRITE_ONCE(el->active, el->inflight);
+	GEM_BUG_ON(!*el->active);
+}
+
+static struct i915_request **
+process_q(struct intel_engine_cs *engine, struct i915_request **inactive)
+{
+	struct intel_engine_execlists * const el = &engine->execlists;
+
+	while (*el->active) {
+		if (!i915_request_completed(*el->active))
+			break;
+
+		*inactive++ = *el->active++;
+	}
+
+	return inactive;
+}
+
+static void post_process_q(struct i915_request **port,
+			   struct i915_request * const *inactive)
+{
+	while (port != inactive)
+		schedule_out(*port++);
+}
+
+static void submission_tasklet(unsigned long data)
+{
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
+	struct i915_request *post[EXECLIST_MAX_PORTS];
+	struct i915_request **inactive;
+
+	inactive = process_q(engine, post);
+	GEM_BUG_ON(inactive - post > ARRAY_SIZE(post));
+
+	if (rb_first_cached(&engine->execlists.queue))
+		dequeue(engine);
+
+	post_process_q(post, inactive);
+}
+
+static void submit_request(struct i915_request *rq)
+{
+	struct intel_engine_cs *engine = rq->engine;
+	unsigned long flags;
+
+	spin_lock_irqsave(&engine->active.lock, flags);
+	if (intel_engine_queue_request(engine, rq))
+		tasklet_hi_schedule(&engine->execlists.tasklet);
+	spin_unlock_irqrestore(&engine->active.lock, flags);
+}
+
+static void reset_prepare(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists * const el = &engine->execlists;
+
+	GEM_TRACE("%s\n", engine->name);
+
+	__tasklet_disable_sync_once(&el->tasklet);
+	GEM_BUG_ON(!reset_in_progress(el));
+
+	intel_ring_submission_reset_prepare(engine);
+}
+
+static struct i915_request *
+__unwind_incomplete_requests(struct intel_engine_cs *engine)
+{
+	struct i915_request *rq, *rn, *active = NULL;
+	struct list_head *uninitialized_var(pl);
+	u64 deadline = I915_DEADLINE_NEVER;
+
+	lockdep_assert_held(&engine->active.lock);
+
+	list_for_each_entry_safe_reverse(rq, rn,
+					 &engine->active.requests,
+					 sched.link) {
+		if (i915_request_completed(rq))
+			break;
+
+		__i915_request_unsubmit(rq);
+
+		if (i915_request_started(rq)) {
+			u64 deadline =
+				i915_scheduler_next_virtual_deadline(rq_prio(rq));
+			rq->sched.deadline = min(rq_deadline(rq), deadline);
+		}
+
+		if (rq_deadline(rq) != deadline) {
+			deadline = rq_deadline(rq);
+			pl = i915_sched_lookup_priolist(engine, deadline);
+		}
+		GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
+
+		list_move(&rq->sched.link, pl);
+		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+
+		active = rq;
+	}
+
+	return active;
+}
+
+static void cancel_port_requests(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists * const el = &engine->execlists;
+	struct i915_request * const *port;
+
+	clear_ports(el->pending, ARRAY_SIZE(el->pending));
+	for (port = xchg(&el->active, el->pending); *port; port++)
+		schedule_out(*port);
+	clear_ports(el->inflight, ARRAY_SIZE(el->inflight));
+
+	smp_wmb(); /* complete the seqlock for execlists_active() */
+	WRITE_ONCE(el->active, el->inflight);
+}
+
+static void __ring_rewind(struct intel_engine_cs *engine, bool stalled)
+{
+	struct i915_request *rq;
+
+	rq = __unwind_incomplete_requests(engine);
+	if (rq && i915_request_started(rq))
+		__i915_request_reset(rq, stalled);
+
+	/* Clear the global submission state, we will submit from scratch */
+	intel_ring_reset(engine->legacy.ring, 0);
+	set_current_context(&engine->legacy.context, NULL);
+}
+
+static void ring_reset_rewind(struct intel_engine_cs *engine, bool stalled)
+{
+	unsigned long flags;
+
+	cancel_port_requests(engine);
+
+	spin_lock_irqsave(&engine->active.lock, flags);
+	__ring_rewind(engine, stalled);
+	spin_unlock_irqrestore(&engine->active.lock, flags);
+}
+
+static void nop_submission_tasklet(unsigned long data)
+{
+}
+
+static void ring_reset_cancel(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists * const el = &engine->execlists;
+	struct i915_request *rq, *rn;
+	unsigned long flags;
+	struct rb_node *rb;
+
+	cancel_port_requests(engine);
+
+	spin_lock_irqsave(&engine->active.lock, flags);
+
+	__ring_rewind(engine, true);
+
+	/* Mark all submitted requests as skipped. */
+	list_for_each_entry(rq, &engine->active.requests, sched.link) {
+		i915_request_set_error_once(rq, -EIO);
+		i915_request_mark_complete(rq);
+	}
+
+	/* Flush the queued requests to the timeline list (for retiring). */
+	while ((rb = rb_first_cached(&el->queue))) {
+		struct i915_priolist *p = to_priolist(rb);
+
+		priolist_for_each_request_consume(rq, rn, p) {
+			i915_request_set_error_once(rq, -EIO);
+			i915_request_mark_complete(rq);
+			__i915_request_submit(rq);
+		}
+
+		rb_erase_cached(&p->node, &el->queue);
+		i915_priolist_free(p);
+	}
+	GEM_BUG_ON(!RB_EMPTY_ROOT(&el->queue.rb_root));
+
+	/* Remaining _unready_ requests will be nop'ed when submitted */
+
+	GEM_BUG_ON(__tasklet_is_enabled(&el->tasklet));
+	el->tasklet.func = nop_submission_tasklet;
+
+	spin_unlock_irqrestore(&engine->active.lock, flags);
+}
+
+static void reset_finish(struct intel_engine_cs *engine)
+{
+	intel_ring_submission_reset_finish(engine);
+
+	if (__tasklet_enable(&engine->execlists.tasklet))
+		tasklet_hi_schedule(&engine->execlists.tasklet);
+}
+
+static void submission_park(struct intel_engine_cs *engine)
+{
+	intel_engine_unpin_breadcrumbs_irq(engine);
+	/* drain the submit queue */
+	tasklet_hi_schedule(&engine->execlists.tasklet);
+}
+
+static void submission_unpark(struct intel_engine_cs *engine)
+{
+	intel_engine_pin_breadcrumbs_irq(engine);
+}
+
+static void ring_context_destroy(struct kref *ref)
+{
+	struct intel_context *ce = container_of(ref, typeof(*ce), ref);
+
+	GEM_BUG_ON(intel_context_is_pinned(ce));
+
+	if (ce->state)
+		i915_vma_put(ce->state);
+	if (test_bit(CONTEXT_ALLOC_BIT, &ce->flags))
+		intel_ring_put(ce->ring);
+
+	intel_context_fini(ce);
+	intel_context_free(ce);
+}
+
+static void ring_context_unpin(struct intel_context *ce)
+{
+}
+
+static int alloc_context_vma(struct intel_context *ce)
+
+{
+	struct intel_engine_cs *engine = ce->engine;
+	struct drm_i915_gem_object *obj;
+	struct i915_vma *vma;
+	int err;
+
+	obj = i915_gem_object_create_shmem(engine->i915, engine->context_size);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+
+	/*
+	 * Try to make the context utilize L3 as well as LLC.
+	 *
+	 * On VLV we don't have L3 controls in the PTEs so we
+	 * shouldn't touch the cache level, especially as that
+	 * would make the object snooped which might have a
+	 * negative performance impact.
+	 *
+	 * Snooping is required on non-llc platforms in execlist
+	 * mode, but since all GGTT accesses use PAT entry 0 we
+	 * get snooping anyway regardless of cache_level.
+	 *
+	 * This is only applicable for Ivy Bridge devices since
+	 * later platforms don't have L3 control bits in the PTE.
+	 */
+	if (IS_IVYBRIDGE(engine->i915))
+		i915_gem_object_set_cache_coherency(obj, I915_CACHE_L3_LLC);
+
+	if (engine->default_state) {
+		void *vaddr;
+
+		vaddr = i915_gem_object_pin_map(obj, I915_MAP_WB);
+		if (IS_ERR(vaddr)) {
+			err = PTR_ERR(vaddr);
+			goto err_obj;
+		}
+
+		shmem_read(engine->default_state, 0,
+			   vaddr, engine->context_size);
+		__set_bit(CONTEXT_VALID_BIT, &ce->flags);
+
+		i915_gem_object_flush_map(obj);
+		i915_gem_object_unpin_map(obj);
+	}
+
+	vma = i915_vma_instance(obj, &engine->gt->ggtt->vm, NULL);
+	if (IS_ERR(vma)) {
+		err = PTR_ERR(vma);
+		goto err_obj;
+	}
+
+	ce->state = vma;
+	return 0;
+
+err_obj:
+	i915_gem_object_put(obj);
+	return err;
+}
+
+static int alloc_timeline(struct intel_context *ce)
+{
+	struct intel_engine_cs *engine = ce->engine;
+	struct intel_timeline *tl;
+	struct i915_vma *hwsp;
+
+	/*
+	 * Use the static global HWSP for the kernel context, and
+	 * a dynamically allocated cacheline for everyone else.
+	 */
+	hwsp = NULL;
+	if (unlikely(intel_context_is_barrier(ce)))
+		hwsp = engine->status_page.vma;
+
+	tl = intel_timeline_create(engine->gt, hwsp);
+	if (IS_ERR(tl))
+		return PTR_ERR(tl);
+
+	ce->timeline = tl;
+	return 0;
+}
+
+static int ring_context_alloc(struct intel_context *ce)
+{
+	struct intel_engine_cs *engine = ce->engine;
+	struct intel_ring *ring;
+	int err;
+
+	GEM_BUG_ON(ce->state);
+	if (engine->context_size) {
+		err = alloc_context_vma(ce);
+		if (err)
+			return err;
+	}
+
+	if (!ce->timeline) {
+		err = alloc_timeline(ce);
+		if (err)
+			goto err_vma;
+	}
+
+	ring = intel_engine_create_ring(engine,
+					(unsigned long)ce->ring |
+					INTEL_RING_CREATE_INTERNAL);
+	if (IS_ERR(ring)) {
+		err = PTR_ERR(ring);
+		goto err_timeline;
+	}
+	ce->ring = ring;
+
+	return 0;
+
+err_timeline:
+	intel_timeline_put(ce->timeline);
+err_vma:
+	if (ce->state) {
+		i915_vma_put(ce->state);
+		ce->state = NULL;
+	}
+	return err;
+}
+
+static int ring_context_pin(struct intel_context *ce)
+{
+	return 0;
+}
+
+static void ring_context_reset(struct intel_context *ce)
+{
+	intel_ring_reset(ce->ring, 0);
+	clear_bit(CONTEXT_VALID_BIT, &ce->flags);
+}
+
+static const struct intel_context_ops ring_context_ops = {
+	.alloc = ring_context_alloc,
+
+	.pin = ring_context_pin,
+	.unpin = ring_context_unpin,
+
+	.enter = intel_context_enter_engine,
+	.exit = intel_context_exit_engine,
+
+	.reset = ring_context_reset,
+	.destroy = ring_context_destroy,
+};
+
+static int ring_request_alloc(struct i915_request *rq)
+{
+	int ret;
+
+	GEM_BUG_ON(!intel_context_is_pinned(rq->context));
+
+	/*
+	 * Flush enough space to reduce the likelihood of waiting after
+	 * we start building the request - in which case we will just
+	 * have to repeat work.
+	 */
+	rq->reserved_space += LEGACY_REQUEST_SIZE;
+
+	/* Unconditionally invalidate GPU caches and TLBs. */
+	ret = rq->engine->emit_flush(rq, EMIT_INVALIDATE);
+	if (ret)
+		return ret;
+
+	rq->reserved_space -= LEGACY_REQUEST_SIZE;
+	return 0;
+}
+
+static void set_default_submission(struct intel_engine_cs *engine)
+{
+	engine->submit_request = submit_request;
+	engine->execlists.tasklet.func = submission_tasklet;
+}
+
+static void ring_release_global_submission(struct intel_engine_cs *engine)
+{
+	intel_ring_unpin(engine->legacy.ring);
+	intel_ring_put(engine->legacy.ring);
+}
+
+static void ring_release(struct intel_engine_cs *engine)
+{
+	intel_engine_cleanup_common(engine);
+
+	set_current_context(&engine->legacy.context, NULL);
+
+	ring_release_global_submission(engine);
+}
+
+static void setup_irq(struct intel_engine_cs *engine)
+{
+}
+
+static void setup_common(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *i915 = engine->i915;
+
+	/* gen8+ are only supported with execlists */
+	GEM_BUG_ON(INTEL_GEN(i915) >= 8);
+	GEM_BUG_ON(INTEL_GEN(i915) < 8);
+
+	setup_irq(engine);
+
+	engine->park = submission_park;
+	engine->unpark = submission_unpark;
+
+	engine->resume = intel_ring_submission_resume;
+	engine->reset.prepare = reset_prepare;
+	engine->reset.rewind = ring_reset_rewind;
+	engine->reset.cancel = ring_reset_cancel;
+	engine->reset.finish = reset_finish;
+
+	engine->cops = &ring_context_ops;
+	engine->request_alloc = ring_request_alloc;
+
+	engine->set_default_submission = set_default_submission;
+}
+
+static void setup_rcs(struct intel_engine_cs *engine)
+{
+}
+
+static void setup_vcs(struct intel_engine_cs *engine)
+{
+}
+
+static void setup_bcs(struct intel_engine_cs *engine)
+{
+}
+
+static void setup_vecs(struct intel_engine_cs *engine)
+{
+	GEM_BUG_ON(!IS_HASWELL(engine->i915));
+}
+
+static unsigned int global_ring_size(void)
+{
+	/* Enough space to hold 2 clients and the context switch */
+	return roundup_pow_of_two(EXECLIST_MAX_PORTS * SZ_16K + SZ_4K);
+}
+
+static int ring_setup_global_submission(struct intel_engine_cs *engine)
+{
+	struct i915_acquire_ctx acquire;
+	struct intel_ring *ring;
+	int err;
+
+	ring = intel_engine_create_ring(engine, global_ring_size());
+	if (IS_ERR(ring))
+		return PTR_ERR(ring);
+
+	i915_acquire_ctx_init(&acquire);
+
+	err = intel_ring_acquire_lock(ring, &acquire);
+	if (err)
+		goto err_acquire;
+
+	err = i915_acquire_mm(&acquire);
+	if (err)
+		goto err_acquire;
+
+	err = intel_ring_pin_locked(ring);
+	if (err)
+		goto err_acquire;
+
+	i915_acquire_ctx_fini(&acquire);
+
+	GEM_BUG_ON(engine->legacy.ring);
+	engine->legacy.ring = ring;
+	return 0;
+
+err_acquire:
+	i915_acquire_ctx_fini(&acquire);
+	intel_ring_put(ring);
+	return err;
+}
+
+int intel_ring_scheduler_setup(struct intel_engine_cs *engine)
+{
+	int err;
+
+	GEM_BUG_ON(HAS_EXECLISTS(engine->i915));
+
+	tasklet_init(&engine->execlists.tasklet,
+		     submission_tasklet, (unsigned long)engine);
+
+	setup_common(engine);
+
+	switch (engine->class) {
+	case RENDER_CLASS:
+		setup_rcs(engine);
+		break;
+	case VIDEO_DECODE_CLASS:
+		setup_vcs(engine);
+		break;
+	case COPY_ENGINE_CLASS:
+		setup_bcs(engine);
+		break;
+	case VIDEO_ENHANCEMENT_CLASS:
+		setup_vecs(engine);
+		break;
+	default:
+		MISSING_CASE(engine->class);
+		return -ENODEV;
+	}
+
+	err = ring_setup_global_submission(engine);
+	if (err)
+		goto err_common;
+
+	engine->flags |= I915_ENGINE_HAS_SCHEDULER;
+	engine->flags |= I915_ENGINE_NEEDS_BREADCRUMB_TASKLET;
+	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
+
+	/* Finally, take ownership and responsibility for cleanup! */
+	engine->release = ring_release;
+	return 0;
+
+err_common:
+	intel_engine_cleanup_common(engine);
+	return err;
+}
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
index ec54ff029699..4c3b75fbc899 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
@@ -38,6 +38,7 @@
 #include "intel_gt.h"
 #include "intel_reset.h"
 #include "intel_ring.h"
+#include "intel_ring_submission.h"
 #include "shmem_utils.h"
 
 /* Rough estimate of the typical request size, performing a flush,
@@ -218,7 +219,7 @@ static void set_pp_dir(struct intel_engine_cs *engine)
 	}
 }
 
-static int xcs_resume(struct intel_engine_cs *engine)
+int intel_ring_submission_resume(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 	struct intel_ring *ring = engine->legacy.ring;
@@ -322,7 +323,7 @@ static int xcs_resume(struct intel_engine_cs *engine)
 	return ret;
 }
 
-static void reset_prepare(struct intel_engine_cs *engine)
+void intel_ring_submission_reset_prepare(struct intel_engine_cs *engine)
 {
 	struct intel_uncore *uncore = engine->uncore;
 	const u32 base = engine->mmio_base;
@@ -429,7 +430,7 @@ static void reset_rewind(struct intel_engine_cs *engine, bool stalled)
 	spin_unlock_irqrestore(&engine->active.lock, flags);
 }
 
-static void reset_finish(struct intel_engine_cs *engine)
+void intel_ring_submission_reset_finish(struct intel_engine_cs *engine)
 {
 }
 
@@ -1065,11 +1066,11 @@ static void setup_common(struct intel_engine_cs *engine)
 
 	setup_irq(engine);
 
-	engine->resume = xcs_resume;
-	engine->reset.prepare = reset_prepare;
+	engine->resume = intel_ring_submission_resume;
+	engine->reset.prepare = intel_ring_submission_reset_prepare;
 	engine->reset.rewind = reset_rewind;
 	engine->reset.cancel = reset_cancel;
-	engine->reset.finish = reset_finish;
+	engine->reset.finish = intel_ring_submission_reset_finish;
 
 	engine->cops = &ring_context_ops;
 	engine->request_alloc = ring_request_alloc;
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.h b/drivers/gpu/drm/i915/gt/intel_ring_submission.h
new file mode 100644
index 000000000000..701eb033e055
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#ifndef __INTEL_RING_SUBMISSION_H__
+#define __INTEL_RING_SUBMISSION_H__
+
+struct intel_engine_cs;
+
+void intel_ring_submission_reset_prepare(struct intel_engine_cs *engine);
+void intel_ring_submission_reset_finish(struct intel_engine_cs *engine);
+
+int intel_ring_submission_resume(struct intel_engine_cs *engine);
+
+#endif /* __INTEL_RING_SUBMISSION_H__ */
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 64/66] drm/i915/gt: Implement ring scheduler for gen6/7
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (61 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 63/66] drm/i915/gt: Infrastructure for ring scheduling Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 65/66] drm/i915/gt: Enable ring scheduling " Chris Wilson
                   ` (9 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

A key prolem with legacy ring buffer submission is that it is an inheret
FIFO queue across all clients; if one blocks, they all block. A
scheduler allows us to avoid that limitation, and ensures that all
clients can submit in parallel, removing the resource contention of the
global ringbuffer.

Having built the ring scheduler infrastructure over top of the global
ringbuffer submission, we now need to provide the HW knowledge required
to build command packets and implement context switching.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gt/intel_ring_scheduler.c    | 428 +++++++++++++++++-
 drivers/gpu/drm/i915/i915_reg.h               |   1 +
 2 files changed, 426 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c b/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
index d3c22037f17d..2d26d62e0135 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
@@ -9,6 +9,10 @@
 
 #include "mm/i915_acquire_ctx.h"
 
+#include "gen2_engine_cs.h"
+#include "gen6_engine_cs.h"
+#include "gen6_ppgtt.h"
+#include "gen7_renderclear.h"
 #include "i915_drv.h"
 #include "intel_context.h"
 #include "intel_engine_stats.h"
@@ -136,8 +140,263 @@ static void ring_copy(struct intel_ring *dst,
 	memcpy(out, src->vaddr + start, end - start);
 }
 
+static void mi_set_context(struct intel_ring *ring,
+			   struct intel_engine_cs *engine,
+			   struct intel_context *ce,
+			   u32 flags)
+{
+	struct drm_i915_private *i915 = engine->i915;
+	enum intel_engine_id id;
+	const int num_engines =
+		IS_HASWELL(i915) ? engine->gt->info.num_engines - 1 : 0;
+	int len;
+	u32 *cs;
+
+	len = 4;
+	if (IS_GEN(i915, 7))
+		len += 2 + (num_engines ? 4 * num_engines + 6 : 0);
+	else if (IS_GEN(i915, 5))
+		len += 2;
+
+	cs = ring_map_dw(ring, len);
+
+	/* WaProgramMiArbOnOffAroundMiSetContext:ivb,vlv,hsw,bdw,chv */
+	if (IS_GEN(i915, 7)) {
+		*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+		if (num_engines) {
+			struct intel_engine_cs *signaller;
+
+			*cs++ = MI_LOAD_REGISTER_IMM(num_engines);
+			for_each_engine(signaller, engine->gt, id) {
+				if (signaller == engine)
+					continue;
+
+				*cs++ = i915_mmio_reg_offset(
+					   RING_PSMI_CTL(signaller->mmio_base));
+				*cs++ = _MASKED_BIT_ENABLE(
+						GEN6_PSMI_SLEEP_MSG_DISABLE);
+			}
+		}
+	} else if (IS_GEN(i915, 5)) {
+		/*
+		 * This w/a is only listed for pre-production ilk a/b steppings,
+		 * but is also mentioned for programming the powerctx. To be
+		 * safe, just apply the workaround; we do not use SyncFlush so
+		 * this should never take effect and so be a no-op!
+		 */
+		*cs++ = MI_SUSPEND_FLUSH | MI_SUSPEND_FLUSH_EN;
+	}
+
+	*cs++ = MI_NOOP;
+	*cs++ = MI_SET_CONTEXT;
+	*cs++ = i915_ggtt_offset(ce->state) | flags;
+	/*
+	 * w/a: MI_SET_CONTEXT must always be followed by MI_NOOP
+	 * WaMiSetContext_Hang:snb,ivb,vlv
+	 */
+	*cs++ = MI_NOOP;
+
+	if (IS_GEN(i915, 7)) {
+		if (num_engines) {
+			struct intel_engine_cs *signaller;
+			i915_reg_t last_reg = {}; /* keep gcc quiet */
+
+			*cs++ = MI_LOAD_REGISTER_IMM(num_engines);
+			for_each_engine(signaller, engine->gt, id) {
+				if (signaller == engine)
+					continue;
+
+				last_reg = RING_PSMI_CTL(signaller->mmio_base);
+				*cs++ = i915_mmio_reg_offset(last_reg);
+				*cs++ = _MASKED_BIT_DISABLE(
+						GEN6_PSMI_SLEEP_MSG_DISABLE);
+			}
+
+			/* Insert a delay before the next switch! */
+			*cs++ = MI_STORE_REGISTER_MEM | MI_SRM_LRM_GLOBAL_GTT;
+			*cs++ = i915_mmio_reg_offset(last_reg);
+			*cs++ = intel_gt_scratch_offset(engine->gt,
+							INTEL_GT_SCRATCH_FIELD_DEFAULT);
+			*cs++ = MI_NOOP;
+		}
+		*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+	} else if (IS_GEN(i915, 5)) {
+		*cs++ = MI_SUSPEND_FLUSH;
+	}
+}
+
+static struct i915_address_space *vm_alias(struct i915_address_space *vm)
+{
+	if (i915_is_ggtt(vm))
+		vm = &i915_vm_to_ggtt(vm)->alias->vm;
+
+	return vm;
+}
+
+static u32 pp_dir(struct i915_address_space *vm)
+{
+	return to_gen6_ppgtt(i915_vm_to_ppgtt(vm))->pp_dir;
+}
+
+static void load_pd_dir(struct intel_ring *ring,
+			struct intel_engine_cs *engine,
+			struct i915_address_space *vm)
+{
+	u32 *cs = ring_map_dw(ring, 10);
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(RING_PP_DIR_DCLV(engine->mmio_base));
+	*cs++ = PP_DIR_DCLV_2G;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(RING_PP_DIR_BASE(engine->mmio_base));
+	*cs++ = pp_dir(vm);
+
+	/* Stall until the page table load is complete? */
+	*cs++ = MI_STORE_REGISTER_MEM | MI_SRM_LRM_GLOBAL_GTT;
+	*cs++ = i915_mmio_reg_offset(RING_PP_DIR_BASE(engine->mmio_base));
+	*cs++ = intel_gt_scratch_offset(engine->gt,
+					INTEL_GT_SCRATCH_FIELD_DEFAULT);
+	*cs++ = MI_NOOP;
+}
+
+static struct i915_address_space *current_vm(struct intel_engine_cs *engine)
+{
+	struct intel_context *old = engine->legacy.context;
+
+	return old ? vm_alias(old->vm) : NULL;
+}
+
+static void gen6_emit_invalidate_rcs(struct intel_ring *ring,
+				     struct intel_engine_cs *engine)
+{
+	u32 addr, flags;
+	u32 *cs;
+
+	addr = intel_gt_scratch_offset(engine->gt,
+				       INTEL_GT_SCRATCH_FIELD_RENDER_FLUSH);
+
+	flags = PIPE_CONTROL_QW_WRITE | PIPE_CONTROL_CS_STALL;
+	flags |= PIPE_CONTROL_TLB_INVALIDATE;
+
+	if (INTEL_GEN(engine->i915) >= 7)
+		flags |= PIPE_CONTROL_GLOBAL_GTT_IVB;
+	else
+		addr |= PIPE_CONTROL_GLOBAL_GTT;
+
+	cs = ring_map_dw(ring, 4);
+	*cs++ = GFX_OP_PIPE_CONTROL(4);
+	*cs++ = flags;
+	*cs++ = addr;
+	*cs++ = 0;
+}
+
+static struct i915_address_space *
+clear_residuals(struct intel_ring *ring, struct intel_engine_cs *engine)
+{
+	struct intel_context *ce = engine->kernel_context;
+	struct i915_address_space *vm = vm_alias(engine->gt->vm);
+	u32 flags;
+
+	if (vm != current_vm(engine))
+		load_pd_dir(ring, engine, vm);
+
+	if (ce->state)
+		mi_set_context(ring, engine, ce,
+			       MI_MM_SPACE_GTT | MI_RESTORE_INHIBIT);
+
+	if (IS_HASWELL(engine->i915))
+		flags = MI_BATCH_PPGTT_HSW | MI_BATCH_NON_SECURE_HSW;
+	else
+		flags = MI_BATCH_NON_SECURE_I965;
+
+	__gen6_emit_bb_start(ring_map_dw(ring, 2),
+			     engine->wa_ctx.vma->node.start, flags);
+
+	return vm;
+}
+
+static void remap_l3_slice(struct intel_ring *ring,
+			   struct intel_engine_cs *engine,
+			   int slice)
+{
+	u32 *cs, *remap_info = engine->i915->l3_parity.remap_info[slice];
+	int i;
+
+	if (!remap_info)
+		return;
+
+	/*
+	 * Note: We do not worry about the concurrent register cacheline hang
+	 * here because no other code should access these registers other than
+	 * at initialization time.
+	 */
+	cs = ring_map_dw(ring, GEN7_L3LOG_SIZE / 4 * 2 + 2);
+	*cs++ = MI_LOAD_REGISTER_IMM(GEN7_L3LOG_SIZE / 4);
+	for (i = 0; i < GEN7_L3LOG_SIZE / 4; i++) {
+		*cs++ = i915_mmio_reg_offset(GEN7_L3LOG(slice, i));
+		*cs++ = remap_info[i];
+	}
+	*cs++ = MI_NOOP;
+}
+
+static void remap_l3(struct intel_ring *ring,
+		     struct intel_engine_cs *engine,
+		     struct intel_context *ce)
+{
+	struct i915_gem_context *ctx =
+		rcu_dereference_protected(ce->gem_context, true);
+	int bit, idx = -1;
+
+	if (!ctx || !ctx->remap_slice)
+		return;
+
+	do {
+		bit = ffs(ctx->remap_slice);
+		remap_l3_slice(ring, engine, idx += bit);
+	} while (ctx->remap_slice >>= bit);
+}
+
 static void switch_context(struct intel_ring *ring, struct i915_request *rq)
 {
+	struct intel_engine_cs *engine = rq->engine;
+	struct i915_address_space *cvm = current_vm(engine);
+	struct intel_context *ce = rq->context;
+	struct i915_address_space *vm;
+
+	if (engine->wa_ctx.vma && ce != engine->kernel_context) {
+		if (engine->wa_ctx.vma->private != ce) {
+			cvm = clear_residuals(ring, engine);
+			intel_context_put(engine->wa_ctx.vma->private);
+			engine->wa_ctx.vma->private = intel_context_get(ce);
+		}
+	}
+
+	vm = vm_alias(ce->vm);
+	if (vm != cvm)
+		load_pd_dir(ring, engine, vm);
+
+	if (ce->state) {
+		u32 flags;
+
+		GEM_BUG_ON(engine->id != RCS0);
+
+		/* For resource streamer on HSW+ and power context elsewhere */
+		BUILD_BUG_ON(HSW_MI_RS_SAVE_STATE_EN != MI_SAVE_EXT_STATE_EN);
+		BUILD_BUG_ON(HSW_MI_RS_RESTORE_STATE_EN != MI_RESTORE_EXT_STATE_EN);
+
+		flags = MI_SAVE_EXT_STATE_EN | MI_MM_SPACE_GTT;
+		if (test_bit(CONTEXT_VALID_BIT, &ce->flags)) {
+			gen6_emit_invalidate_rcs(ring, engine);
+			flags |= MI_RESTORE_EXT_STATE_EN;
+		} else {
+			flags |= MI_RESTORE_INHIBIT;
+		}
+
+		mi_set_context(ring, engine, ce, flags);
+	}
+
+	remap_l3(ring, engine, ce);
 }
 
 static struct i915_request *ring_submit(struct i915_request *rq)
@@ -164,6 +423,15 @@ copy_active(struct i915_request **port, struct i915_request * const *active)
 	return port;
 }
 
+static void write_tail(struct intel_engine_cs *engine, u32 tail)
+{
+	/* Clear the context id. Here be magic! */
+	if (engine->fw_domain)
+		ENGINE_WRITE_FW(engine, RING_RNCID, 0);
+
+	ENGINE_WRITE(engine, RING_TAIL, tail);
+}
+
 static void dequeue(struct intel_engine_cs *engine)
 {
 	struct intel_engine_execlists * const el = &engine->execlists;
@@ -208,7 +476,7 @@ static void dequeue(struct intel_engine_cs *engine)
 	WRITE_ONCE(el->active, el->pending);
 
 	wmb(); /* paranoid flush of WCB before RING_TAIL write */
-	ENGINE_WRITE(engine, RING_TAIL, engine->legacy.ring->tail);
+	write_tail(engine, engine->legacy.ring->tail);
 	memcpy(el->inflight, el->pending,
 	       (port - el->pending + 1) * sizeof(*port));
 
@@ -418,6 +686,33 @@ static void submission_unpark(struct intel_engine_cs *engine)
 	intel_engine_pin_breadcrumbs_irq(engine);
 }
 
+static int gen4_emit_init_breadcrumb(struct i915_request *rq)
+{
+	struct intel_timeline *tl = i915_request_timeline(rq);
+	u32 *cs;
+
+	GEM_BUG_ON(i915_request_has_initial_breadcrumb(rq));
+	if (!tl->has_initial_breadcrumb)
+		return 0;
+
+	cs = intel_ring_begin(rq, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	*cs++ = MI_STORE_DWORD_IMM_GEN4 | MI_USE_GGTT;
+	*cs++ = 0;
+	*cs++ = tl->hwsp_offset;
+	*cs++ = rq->fence.seqno - 1;
+
+	intel_ring_advance(rq, cs);
+
+	/* Record the updated position of the request's payload */
+	rq->infix = intel_ring_offset(rq, cs);
+
+	__set_bit(I915_FENCE_FLAG_INITIAL_BREADCRUMB, &rq->fence.flags);
+	return 0;
+}
+
 static void ring_context_destroy(struct kref *ref)
 {
 	struct intel_context *ce = container_of(ref, typeof(*ce), ref);
@@ -433,8 +728,30 @@ static void ring_context_destroy(struct kref *ref)
 	intel_context_free(ce);
 }
 
+static int __context_pin_ppgtt(struct intel_context *ce)
+{
+	struct i915_address_space *vm;
+	int err = 0;
+
+	vm = vm_alias(ce->vm);
+	if (vm)
+		err = gen6_ppgtt_pin(i915_vm_to_ppgtt((vm)));
+
+	return err;
+}
+
+static void __context_unpin_ppgtt(struct intel_context *ce)
+{
+	struct i915_address_space *vm;
+
+	vm = vm_alias(ce->vm);
+	if (vm)
+		gen6_ppgtt_unpin(i915_vm_to_ppgtt(vm));
+}
+
 static void ring_context_unpin(struct intel_context *ce)
 {
+	__context_unpin_ppgtt(ce);
 }
 
 static int alloc_context_vma(struct intel_context *ce)
@@ -562,7 +879,7 @@ static int ring_context_alloc(struct intel_context *ce)
 
 static int ring_context_pin(struct intel_context *ce)
 {
-	return 0;
+	return __context_pin_ppgtt(ce);
 }
 
 static void ring_context_reset(struct intel_context *ce)
@@ -624,11 +941,18 @@ static void ring_release(struct intel_engine_cs *engine)
 
 	set_current_context(&engine->legacy.context, NULL);
 
+	if (engine->wa_ctx.vma) {
+		intel_context_put(engine->wa_ctx.vma->private);
+		i915_vma_unpin_and_release(&engine->wa_ctx.vma, 0);
+	}
+
 	ring_release_global_submission(engine);
 }
 
 static void setup_irq(struct intel_engine_cs *engine)
 {
+	engine->irq_enable = gen6_irq_enable;
+	engine->irq_disable = gen6_irq_disable;
 }
 
 static void setup_common(struct intel_engine_cs *engine)
@@ -637,7 +961,7 @@ static void setup_common(struct intel_engine_cs *engine)
 
 	/* gen8+ are only supported with execlists */
 	GEM_BUG_ON(INTEL_GEN(i915) >= 8);
-	GEM_BUG_ON(INTEL_GEN(i915) < 8);
+	GEM_BUG_ON(INTEL_GEN(i915) < 6);
 
 	setup_irq(engine);
 
@@ -653,24 +977,62 @@ static void setup_common(struct intel_engine_cs *engine)
 	engine->cops = &ring_context_ops;
 	engine->request_alloc = ring_request_alloc;
 
+	engine->emit_init_breadcrumb = gen4_emit_init_breadcrumb;
+	if (INTEL_GEN(i915) >= 7)
+		engine->emit_fini_breadcrumb = gen7_emit_breadcrumb_xcs;
+	else if (INTEL_GEN(i915) >= 6)
+		engine->emit_fini_breadcrumb = gen6_emit_breadcrumb_xcs;
+	else
+		engine->emit_fini_breadcrumb = gen3_emit_breadcrumb;
+
 	engine->set_default_submission = set_default_submission;
+
+	engine->emit_bb_start = gen6_emit_bb_start;
 }
 
 static void setup_rcs(struct intel_engine_cs *engine)
 {
+	struct drm_i915_private *i915 = engine->i915;
+
+	if (HAS_L3_DPF(i915))
+		engine->irq_keep_mask = GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
+
+	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT;
+
+	if (INTEL_GEN(i915) >= 7) {
+		engine->emit_flush = gen7_emit_flush_rcs;
+		engine->emit_fini_breadcrumb = gen7_emit_breadcrumb_rcs;
+		if (IS_HASWELL(i915))
+			engine->emit_bb_start = hsw_emit_bb_start;
+	} else {
+		engine->emit_flush = gen6_emit_flush_rcs;
+		engine->emit_fini_breadcrumb = gen6_emit_breadcrumb_rcs;
+	}
 }
 
 static void setup_vcs(struct intel_engine_cs *engine)
 {
+	engine->emit_flush = gen6_emit_flush_vcs;
+	engine->irq_enable_mask = GT_BSD_USER_INTERRUPT;
+
+	if (IS_GEN(engine->i915, 6))
+		engine->fw_domain = FORCEWAKE_ALL;
 }
 
 static void setup_bcs(struct intel_engine_cs *engine)
 {
+	engine->emit_flush = gen6_emit_flush_xcs;
+	engine->irq_enable_mask = GT_BLT_USER_INTERRUPT;
 }
 
 static void setup_vecs(struct intel_engine_cs *engine)
 {
 	GEM_BUG_ON(!IS_HASWELL(engine->i915));
+
+	engine->emit_flush = gen6_emit_flush_xcs;
+	engine->irq_enable_mask = PM_VEBOX_USER_INTERRUPT;
+	engine->irq_enable = hsw_irq_enable_vecs;
+	engine->irq_disable = hsw_irq_disable_vecs;
 }
 
 static unsigned int global_ring_size(void)
@@ -715,6 +1077,58 @@ static int ring_setup_global_submission(struct intel_engine_cs *engine)
 	return err;
 }
 
+static int gen7_ctx_switch_bb_init(struct intel_engine_cs *engine)
+{
+	struct drm_i915_gem_object *obj;
+	struct i915_vma *vma;
+	int size;
+	int err;
+
+	size = gen7_setup_clear_gpr_bb(engine, NULL /* probe size */);
+	if (size <= 0)
+		return size;
+
+	size = ALIGN(size, PAGE_SIZE);
+	obj = i915_gem_object_create_internal(engine->i915, size);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+
+	vma = i915_vma_instance(obj, engine->gt->vm, NULL);
+	if (IS_ERR(vma)) {
+		err = PTR_ERR(vma);
+		goto err_obj;
+	}
+
+	vma->private = intel_context_create(engine); /* dummy residuals */
+	if (IS_ERR(vma->private)) {
+		err = PTR_ERR(vma->private);
+		goto err_obj;
+	}
+
+	err = i915_vma_pin(vma, 0, 0, PIN_USER | PIN_HIGH);
+	if (err)
+		goto err_private;
+
+	err = i915_vma_sync(vma);
+	if (err)
+		goto err_unpin;
+
+	size = gen7_setup_clear_gpr_bb(engine, vma);
+	if (err)
+		goto err_unpin;
+
+	engine->wa_ctx.vma = vma;
+	return 0;
+
+err_unpin:
+	i915_vma_unpin(vma);
+err_private:
+	intel_context_put(vma->private);
+err_obj:
+	i915_gem_object_put(obj);
+	return err;
+}
+
 int intel_ring_scheduler_setup(struct intel_engine_cs *engine)
 {
 	int err;
@@ -748,6 +1162,12 @@ int intel_ring_scheduler_setup(struct intel_engine_cs *engine)
 	if (err)
 		goto err_common;
 
+	if (IS_HASWELL(engine->i915) && engine->class == RENDER_CLASS) {
+		err = gen7_ctx_switch_bb_init(engine);
+		if (err)
+			goto err_global;
+	}
+
 	engine->flags |= I915_ENGINE_HAS_SCHEDULER;
 	engine->flags |= I915_ENGINE_NEEDS_BREADCRUMB_TASKLET;
 	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
@@ -756,6 +1176,8 @@ int intel_ring_scheduler_setup(struct intel_engine_cs *engine)
 	engine->release = ring_release;
 	return 0;
 
+err_global:
+	ring_release_global_submission(engine);
 err_common:
 	intel_engine_cleanup_common(engine);
 	return err;
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 4e796ff4d7d0..7755206d06e3 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -2531,6 +2531,7 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define   RESET_CTL_CAT_ERROR	   REG_BIT(2)
 #define   RESET_CTL_READY_TO_RESET REG_BIT(1)
 #define   RESET_CTL_REQUEST_RESET  REG_BIT(0)
+#define RING_RNCID(base)	_MMIO((base) + 0x198)
 
 #define RING_SEMA_WAIT_POLL(base) _MMIO((base) + 0x24c)
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 65/66] drm/i915/gt: Enable ring scheduling for gen6/7
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (62 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 64/66] drm/i915/gt: Implement ring scheduler for gen6/7 Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 66/66] drm/i915/gem: Remove timeline nesting from snb relocs Chris Wilson
                   ` (8 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Switch over from FIFO global submission to the priority-sorted
topographical scheduler. At the cost of more busy work on the CPU to
keep the GPU supplied with the next packet of requests, this allows us
to reorder requests around submission stalls.

This also enables the timer based RPS, with the exception of Valleyview
who's PCU doesn't take kindly to our interference.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c | 2 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c             | 2 ++
 drivers/gpu/drm/i915/gt/intel_rps.c                   | 6 ++----
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
index f2a307b4146e..55f09ab7136a 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
@@ -94,7 +94,7 @@ static int live_nop_switch(void *arg)
 			rq = i915_request_get(this);
 			i915_request_add(this);
 		}
-		if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+		if (i915_request_wait(rq, 0, HZ) < 0) {
 			pr_err("Failed to populated %d contexts\n", nctx);
 			intel_gt_set_wedged(&i915->gt);
 			i915_request_put(rq);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index df234ce10907..c9db59b9bacf 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -863,6 +863,8 @@ int intel_engines_init(struct intel_gt *gt)
 
 	if (HAS_EXECLISTS(gt->i915))
 		setup = intel_execlists_submission_setup;
+	else if (INTEL_GEN(gt->i915) >= 6)
+		setup = intel_ring_scheduler_setup;
 	else
 		setup = intel_ring_submission_setup;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_rps.c b/drivers/gpu/drm/i915/gt/intel_rps.c
index 49910425e986..bf923df212d1 100644
--- a/drivers/gpu/drm/i915/gt/intel_rps.c
+++ b/drivers/gpu/drm/i915/gt/intel_rps.c
@@ -1052,9 +1052,7 @@ static bool gen6_rps_enable(struct intel_rps *rps)
 	intel_uncore_write_fw(uncore, GEN6_RP_DOWN_TIMEOUT, 50000);
 	intel_uncore_write_fw(uncore, GEN6_RP_IDLE_HYSTERSIS, 10);
 
-	rps->pm_events = (GEN6_PM_RP_UP_THRESHOLD |
-			  GEN6_PM_RP_DOWN_THRESHOLD |
-			  GEN6_PM_RP_DOWN_TIMEOUT);
+	rps->pm_events = GEN6_PM_RP_UP_THRESHOLD | GEN6_PM_RP_DOWN_THRESHOLD;
 
 	return rps_reset(rps);
 }
@@ -1362,7 +1360,7 @@ void intel_rps_enable(struct intel_rps *rps)
 	GEM_BUG_ON(rps->efficient_freq < rps->min_freq);
 	GEM_BUG_ON(rps->efficient_freq > rps->max_freq);
 
-	if (has_busy_stats(rps))
+	if (has_busy_stats(rps) && !IS_VALLEYVIEW(i915))
 		intel_rps_set_timer(rps);
 	else if (INTEL_GEN(i915) >= 6)
 		intel_rps_set_interrupts(rps);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH 66/66] drm/i915/gem: Remove timeline nesting from snb relocs
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (63 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 65/66] drm/i915/gt: Enable ring scheduling " Chris Wilson
@ 2020-07-15 11:51 ` Chris Wilson
  2020-07-15 13:27 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Patchwork
                   ` (7 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 11:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

As snb is the only one to require an alternative engine for performing
relocations, we know that we can reuse a common timeline between
engines.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 22 +++++--------------
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index d9f1403ddfa4..28f5c28a9449 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1965,16 +1965,9 @@ nested_request_create(struct intel_context *ce, struct i915_execbuffer *eb)
 {
 	struct i915_request *rq;
 
-	/* XXX This only works once; replace with shared timeline */
-	if (ce->timeline != eb->context->timeline)
-		mutex_lock_nested(&ce->timeline->mutex, SINGLE_DEPTH_NESTING);
 	intel_context_enter(ce);
-
 	rq = __i915_request_create(ce, GFP_KERNEL);
-
 	intel_context_exit(ce);
-	if (IS_ERR(rq) && ce->timeline != eb->context->timeline)
-		mutex_unlock(&ce->timeline->mutex);
 
 	return rq;
 }
@@ -2021,9 +2014,6 @@ reloc_gpu_flush(struct i915_execbuffer *eb, struct i915_request *rq, int err)
 	intel_gt_chipset_flush(rq->engine->gt);
 	__i915_request_add(rq, &eb->gem_context->sched);
 
-	if (i915_request_timeline(rq) != eb->context->timeline)
-		mutex_unlock(&i915_request_timeline(rq)->mutex);
-
 	return err;
 }
 
@@ -2426,10 +2416,7 @@ static struct i915_request *reloc_gpu_alloc(struct i915_execbuffer *eb)
 	struct reloc_cache *cache = &eb->reloc_cache;
 	struct i915_request *rq;
 
-	if (cache->ce == eb->context)
-		rq = __i915_request_create(cache->ce, GFP_KERNEL);
-	else
-		rq = nested_request_create(cache->ce, eb);
+	rq = nested_request_create(cache->ce, eb);
 	if (IS_ERR(rq))
 		return rq;
 
@@ -2968,13 +2955,14 @@ static int __eb_pin_reloc_engine(struct i915_execbuffer *eb)
 	if (!engine)
 		return -ENODEV;
 
+	if (!intel_engine_has_scheduler(engine))
+		return -ENODEV;
+
 	ce = intel_context_create(engine);
 	if (IS_ERR(ce))
 		return PTR_ERR(ce);
 
-	/* Reuse eb->context->timeline with scheduler! */
-	if (intel_engine_has_scheduler(engine))
-		ce->timeline = intel_timeline_get(eb->context->timeline);
+	ce->timeline = intel_timeline_get(eb->context->timeline);
 
 	i915_vm_put(ce->vm);
 	ce->vm = i915_vm_get(eb->context->vm);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (64 preceding siblings ...)
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 66/66] drm/i915/gem: Remove timeline nesting from snb relocs Chris Wilson
@ 2020-07-15 13:27 ` Patchwork
  2020-07-15 13:28 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
                   ` (6 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Patchwork @ 2020-07-15 13:27 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait
URL   : https://patchwork.freedesktop.org/series/79517/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
91bde1c5e935 drm/i915: Reduce i915_request.lock contention for i915_request_wait
13de08637a53 drm/i915: Remove i915_request.lock requirement for execution callbacks
6fa430293a65 drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs
e919aa95911c drm/i915: Add a couple of missing i915_active_fini()
379852a0d374 drm/i915: Skip taking acquire mutex for no ref->active callback
8628c1a1b800 drm/i915: Export a preallocate variant of i915_active_acquire()
383c7cbfc8d5 drm/i915: Keep the most recently used active-fence upon discard
00c92cc2c921 drm/i915: Make the stale cached active node available for any timeline
a938c15b20b3 drm/i915: Provide a fastpath for waiting on vma bindings
63c73191100c drm/i915: Soften the tasklet flush frequency before waits
7c20ceade654 drm/i915: Preallocate stashes for vma page-directories
f62b56fc3c3f drm/i915: Switch to object allocations for page directories
d9e5a405669b drm/i915/gem: Don't drop the timeline lock during execbuf
a488eb2ac0c6 drm/i915/gem: Rename execbuf.bind_link to unbound_link
bf7156efc5fe drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup
b336351c77f1 drm/i915/gem: Remove the call for no-evict i915_vma_pin
482c12431055 drm/i915: Add list_for_each_entry_safe_continue_reverse
-:21: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'pos' - possible side-effects?
#21: FILE: drivers/gpu/drm/i915/i915_utils.h:269:
+#define list_for_each_entry_safe_continue_reverse(pos, n, head, member)	\
+	for (pos = list_prev_entry(pos, member),			\
+	     n = list_prev_entry(pos, member);				\
+	     &pos->member != (head);					\
+	     pos = n, n = list_prev_entry(n, member))

-:21: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'n' - possible side-effects?
#21: FILE: drivers/gpu/drm/i915/i915_utils.h:269:
+#define list_for_each_entry_safe_continue_reverse(pos, n, head, member)	\
+	for (pos = list_prev_entry(pos, member),			\
+	     n = list_prev_entry(pos, member);				\
+	     &pos->member != (head);					\
+	     pos = n, n = list_prev_entry(n, member))

-:21: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'member' - possible side-effects?
#21: FILE: drivers/gpu/drm/i915/i915_utils.h:269:
+#define list_for_each_entry_safe_continue_reverse(pos, n, head, member)	\
+	for (pos = list_prev_entry(pos, member),			\
+	     n = list_prev_entry(pos, member);				\
+	     &pos->member != (head);					\
+	     pos = n, n = list_prev_entry(n, member))

total: 0 errors, 0 warnings, 3 checks, 12 lines checked
03f0ac6d0aa7 drm/i915: Always defer fenced work to the worker
0ffcc361f56b drm/i915/gem: Assign context id for async work
82ed2ac041bb drm/i915/gem: Separate the ww_mutex walker into its own list
8ed473e9ebb5 drm/i915/gem: Asynchronous GTT unbinding
289e4f21e37d drm/i915/gem: Bind the fence async for execbuf
3994c1315802 drm/i915/gem: Include cmdparser in common execbuf pinning
d59dcbda945a drm/i915/gem: Include secure batch in common execbuf pinning
a2d67581b7cc drm/i915/gem: Reintroduce multiple passes for reloc processing
-:1512: WARNING:MEMORY_BARRIER: memory barrier without comment
#1512: FILE: drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c:161:
+		wmb();

total: 0 errors, 1 warnings, 0 checks, 1502 lines checked
34f8c8b06fb1 drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2.
-:59: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#59: 
new file mode 100644

-:354: WARNING:LINE_SPACING: Missing a blank line after declarations
#354: FILE: drivers/gpu/drm/i915/mm/st_acquire_ctx.c:106:
+	const unsigned int total = ARRAY_SIZE(dl->obj);
+	I915_RND_STATE(prng);

-:450: WARNING:YIELD: Using yield() is generally wrong. See yield() kernel-doc (sched/core.c)
#450: FILE: drivers/gpu/drm/i915/mm/st_acquire_ctx.c:202:
+	yield(); /* start all threads before we begin */

total: 0 errors, 3 warnings, 0 checks, 446 lines checked
fe0b2e8f9867 drm/i915/gem: Pull execbuf dma resv under a single critical section
0ca5333286a3 drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
009379133a04 drm/i915: Hold wakeref for the duration of the vma GGTT binding
30196b1778f2 drm/i915: Specialise GGTT binding
7ef75aca0d57 drm/i915/gt: Acquire backing storage for the context
0e4a431e2f5e drm/i915/gt: Push the wait for the context to bound to the request
-:198: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#198: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 237 lines checked
6296fe868377 drm/i915: Remove unused i915_gem_evict_vm()
4ddd5ececb22 drm/i915/gt: Decouple completed requests on unwind
b1fba621ff00 drm/i915/gt: Check for a completed last request once
9fd59fa35cdc drm/i915/gt: Replace direct submit with direct call to tasklet
bf6e26068597 drm/i915/gt: Free stale request on destroying the virtual engine
067ea4919a58 drm/i915/gt: Use virtual_engine during execlists_dequeue
bf89b50c5a71 drm/i915/gt: Decouple inflight virtual engines
ce7f6e2bf488 drm/i915/gt: Defer schedule_out until after the next dequeue
f0e7e4b98971 drm/i915/gt: Resubmit the virtual engine on schedule-out
08be9ee137a1 drm/i915/gt: Simplify virtual engine handling for execlists_hold()
11f1bf554545 drm/i915/gt: ce->inflight updates are now serialised
79b47cba6b9c drm/i915/gt: Drop atomic for engine->fw_active tracking
488c89851fae drm/i915/gt: Extract busy-stats for ring-scheduler
-:12: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#12: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 95 lines checked
ff6f1d863327 drm/i915/gt: Convert stats.active to plain unsigned int
e716fd9ed81d drm/i915: Lift waiter/signaler iterators
657226e71789 drm/i915: Strip out internal priorities
ec6f07931628 drm/i915: Remove I915_USER_PRIORITY_SHIFT
109dafe14a1a drm/i915: Replace engine->schedule() with a known request operation
b8563a88df20 drm/i915/gt: Do not suspend bonded requests if one hangs
64cc265ade5a drm/i915: Teach the i915_dependency to use a double-lock
cdaf10b00d13 drm/i915: Restructure priority inheritance
96a6eabdd904 drm/i915/gt: Remove timeslice suppression
5dbb21007170 drm/i915: Fair low-latency scheduling
-:1570: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#1570: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 1394 lines checked
e27a7ce87a79 drm/i915/gt: Specify a deadline for the heartbeat
bb9eb77ea30a drm/i915: Replace the priority boosting for the display with a deadline
09a11cb682d1 drm/i915: Move saturated workload detection to the GT
-:22: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#22: 
References: 44d89409a12e ("drm/i915: Make the semaphore saturation mask global")

-:22: ERROR:GIT_COMMIT_ID: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 44d89409a12e ("drm/i915: Make the semaphore saturation mask global")'
#22: 
References: 44d89409a12e ("drm/i915: Make the semaphore saturation mask global")

total: 1 errors, 1 warnings, 0 checks, 82 lines checked
d23e78b07d59 Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq"
6645f5fddc72 drm/i915/gt: Couple tasklet scheduling for all CS interrupts
f28f48cf03af drm/i915/gt: Support creation of 'internal' rings
2b5546dd8abd drm/i915/gt: Use client timeline address for seqno writes
863da3117a20 drm/i915/gt: Infrastructure for ring scheduling
-:79: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#79: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 844 lines checked
87c21a67aa37 drm/i915/gt: Implement ring scheduler for gen6/7
-:68: CHECK:OPEN_ENDED_LINE: Lines should not end with a '('
#68: FILE: drivers/gpu/drm/i915/gt/intel_ring_scheduler.c:174:
+				*cs++ = i915_mmio_reg_offset(

-:70: CHECK:OPEN_ENDED_LINE: Lines should not end with a '('
#70: FILE: drivers/gpu/drm/i915/gt/intel_ring_scheduler.c:176:
+				*cs++ = _MASKED_BIT_ENABLE(

-:105: CHECK:OPEN_ENDED_LINE: Lines should not end with a '('
#105: FILE: drivers/gpu/drm/i915/gt/intel_ring_scheduler.c:211:
+				*cs++ = _MASKED_BIT_DISABLE(

total: 0 errors, 0 warnings, 3 checks, 540 lines checked
b3c1e59841ea drm/i915/gt: Enable ring scheduling for gen6/7
823526e73ba8 drm/i915/gem: Remove timeline nesting from snb relocs


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (65 preceding siblings ...)
  2020-07-15 13:27 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Patchwork
@ 2020-07-15 13:28 ` Patchwork
  2020-07-15 14:20 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
                   ` (5 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Patchwork @ 2020-07-15 13:28 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait
URL   : https://patchwork.freedesktop.org/series/79517/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.0
Fast mode used, each commit won't be checked separately.


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (66 preceding siblings ...)
  2020-07-15 13:28 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
@ 2020-07-15 14:20 ` Patchwork
  2020-07-15 15:41 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2) Patchwork
                   ` (4 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Patchwork @ 2020-07-15 14:20 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 20044 bytes --]

== Series Details ==

Series: series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait
URL   : https://patchwork.freedesktop.org/series/79517/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_8750 -> Patchwork_18177
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_18177 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_18177, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/index.html

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_18177:

### IGT changes ###

#### Possible regressions ####

  * igt@i915_selftest@live@execlists:
    - fi-cfl-8109u:       [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-cfl-8109u/igt@i915_selftest@live@execlists.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-cfl-8109u/igt@i915_selftest@live@execlists.html
    - fi-icl-u2:          [PASS][3] -> [INCOMPLETE][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-icl-u2/igt@i915_selftest@live@execlists.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-icl-u2/igt@i915_selftest@live@execlists.html
    - fi-tgl-y:           [PASS][5] -> [INCOMPLETE][6]
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-y/igt@i915_selftest@live@execlists.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-y/igt@i915_selftest@live@execlists.html
    - fi-cfl-8700k:       [PASS][7] -> [INCOMPLETE][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-cfl-8700k/igt@i915_selftest@live@execlists.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-cfl-8700k/igt@i915_selftest@live@execlists.html
    - fi-tgl-u2:          [PASS][9] -> [INCOMPLETE][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-u2/igt@i915_selftest@live@execlists.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-u2/igt@i915_selftest@live@execlists.html
    - fi-cml-s:           [PASS][11] -> [INCOMPLETE][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-cml-s/igt@i915_selftest@live@execlists.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-cml-s/igt@i915_selftest@live@execlists.html
    - fi-cfl-guc:         [PASS][13] -> [INCOMPLETE][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-cfl-guc/igt@i915_selftest@live@execlists.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-cfl-guc/igt@i915_selftest@live@execlists.html
    - fi-icl-y:           [PASS][15] -> [INCOMPLETE][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-icl-y/igt@i915_selftest@live@execlists.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-icl-y/igt@i915_selftest@live@execlists.html
    - fi-whl-u:           [PASS][17] -> [INCOMPLETE][18]
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-whl-u/igt@i915_selftest@live@execlists.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-whl-u/igt@i915_selftest@live@execlists.html
    - fi-cml-u2:          [PASS][19] -> [INCOMPLETE][20]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-cml-u2/igt@i915_selftest@live@execlists.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-cml-u2/igt@i915_selftest@live@execlists.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@i915_selftest@live@execlists:
    - {fi-ehl-1}:         [PASS][21] -> [INCOMPLETE][22]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-ehl-1/igt@i915_selftest@live@execlists.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-ehl-1/igt@i915_selftest@live@execlists.html
    - {fi-tgl-dsi}:       [PASS][23] -> [INCOMPLETE][24]
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-dsi/igt@i915_selftest@live@execlists.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-dsi/igt@i915_selftest@live@execlists.html

  
Known issues
------------

  Here are the changes found in Patchwork_18177 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_exec_suspend@basic-s3:
    - fi-tgl-u2:          [PASS][25] -> [FAIL][26] ([i915#1888]) +1 similar issue
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-u2/igt@gem_exec_suspend@basic-s3.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-u2/igt@gem_exec_suspend@basic-s3.html

  * igt@gem_flink_basic@basic:
    - fi-tgl-y:           [PASS][27] -> [DMESG-WARN][28] ([i915#402])
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-y/igt@gem_flink_basic@basic.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-y/igt@gem_flink_basic@basic.html

  * igt@i915_pm_rpm@basic-pci-d3-state:
    - fi-tgl-y:           [PASS][29] -> [DMESG-WARN][30] ([i915#1982])
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-y/igt@i915_pm_rpm@basic-pci-d3-state.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-y/igt@i915_pm_rpm@basic-pci-d3-state.html

  * igt@i915_selftest@live@execlists:
    - fi-kbl-r:           [PASS][31] -> [INCOMPLETE][32] ([i915#794])
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-r/igt@i915_selftest@live@execlists.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-r/igt@i915_selftest@live@execlists.html
    - fi-apl-guc:         [PASS][33] -> [INCOMPLETE][34] ([i915#1635])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-apl-guc/igt@i915_selftest@live@execlists.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-apl-guc/igt@i915_selftest@live@execlists.html
    - fi-skl-lmem:        [PASS][35] -> [INCOMPLETE][36] ([i915#1795])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-skl-lmem/igt@i915_selftest@live@execlists.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-skl-lmem/igt@i915_selftest@live@execlists.html
    - fi-kbl-x1275:       [PASS][37] -> [INCOMPLETE][38] ([i915#794])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-x1275/igt@i915_selftest@live@execlists.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-x1275/igt@i915_selftest@live@execlists.html
    - fi-skl-6600u:       [PASS][39] -> [INCOMPLETE][40] ([i915#1795])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-skl-6600u/igt@i915_selftest@live@execlists.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-skl-6600u/igt@i915_selftest@live@execlists.html
    - fi-skl-guc:         [PASS][41] -> [INCOMPLETE][42] ([i915#1795])
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-skl-guc/igt@i915_selftest@live@execlists.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-skl-guc/igt@i915_selftest@live@execlists.html
    - fi-skl-6700k2:      [PASS][43] -> [INCOMPLETE][44] ([i915#1795])
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-skl-6700k2/igt@i915_selftest@live@execlists.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-skl-6700k2/igt@i915_selftest@live@execlists.html
    - fi-bxt-dsi:         [PASS][45] -> [INCOMPLETE][46] ([i915#1635])
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-bxt-dsi/igt@i915_selftest@live@execlists.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-bxt-dsi/igt@i915_selftest@live@execlists.html
    - fi-kbl-soraka:      [PASS][47] -> [INCOMPLETE][48] ([i915#794])
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-soraka/igt@i915_selftest@live@execlists.html
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-soraka/igt@i915_selftest@live@execlists.html
    - fi-kbl-guc:         [PASS][49] -> [INCOMPLETE][50] ([i915#794])
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-guc/igt@i915_selftest@live@execlists.html
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-guc/igt@i915_selftest@live@execlists.html
    - fi-kbl-7500u:       [PASS][51] -> [INCOMPLETE][52] ([i915#794])
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-7500u/igt@i915_selftest@live@execlists.html
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-7500u/igt@i915_selftest@live@execlists.html
    - fi-kbl-8809g:       [PASS][53] -> [INCOMPLETE][54] ([i915#794])
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-8809g/igt@i915_selftest@live@execlists.html
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-8809g/igt@i915_selftest@live@execlists.html

  * igt@kms_busy@basic@flip:
    - fi-kbl-soraka:      [PASS][55] -> [DMESG-WARN][56] ([i915#1982])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-soraka/igt@kms_busy@basic@flip.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-soraka/igt@kms_busy@basic@flip.html

  * igt@kms_pipe_crc_basic@read-crc-pipe-a-frame-sequence:
    - fi-tgl-u2:          [PASS][57] -> [DMESG-WARN][58] ([i915#402])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-u2/igt@kms_pipe_crc_basic@read-crc-pipe-a-frame-sequence.html
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-u2/igt@kms_pipe_crc_basic@read-crc-pipe-a-frame-sequence.html

  
#### Possible fixes ####

  * igt@i915_module_load@reload:
    - fi-byt-j1900:       [DMESG-WARN][59] ([i915#1982]) -> [PASS][60]
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-byt-j1900/igt@i915_module_load@reload.html
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-byt-j1900/igt@i915_module_load@reload.html
    - fi-bxt-dsi:         [DMESG-WARN][61] ([i915#1635] / [i915#1982]) -> [PASS][62]
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-bxt-dsi/igt@i915_module_load@reload.html
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-bxt-dsi/igt@i915_module_load@reload.html
    - fi-tgl-u2:          [DMESG-WARN][63] ([i915#402]) -> [PASS][64]
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-u2/igt@i915_module_load@reload.html
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-u2/igt@i915_module_load@reload.html
    - fi-tgl-y:           [DMESG-WARN][65] ([i915#1982]) -> [PASS][66]
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-y/igt@i915_module_load@reload.html
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-y/igt@i915_module_load@reload.html

  * igt@i915_selftest@live@gt_lrc:
    - fi-tgl-u2:          [DMESG-FAIL][67] ([i915#1233]) -> [PASS][68]
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-u2/igt@i915_selftest@live@gt_lrc.html
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-u2/igt@i915_selftest@live@gt_lrc.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
    - {fi-kbl-7560u}:     [DMESG-WARN][69] ([i915#1982]) -> [PASS][70]
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-7560u/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-7560u/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html
    - fi-bsw-kefka:       [DMESG-WARN][71] ([i915#1982]) -> [PASS][72]
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-bsw-kefka/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-bsw-kefka/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html

  * igt@vgem_basic@setversion:
    - fi-tgl-y:           [DMESG-WARN][73] ([i915#402]) -> [PASS][74]
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-y/igt@vgem_basic@setversion.html
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-tgl-y/igt@vgem_basic@setversion.html

  
#### Warnings ####

  * igt@i915_pm_rpm@module-reload:
    - fi-kbl-x1275:       [DMESG-FAIL][75] ([i915#62]) -> [SKIP][76] ([fdo#109271])
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-x1275/igt@i915_pm_rpm@module-reload.html
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-x1275/igt@i915_pm_rpm@module-reload.html

  * igt@kms_flip@basic-flip-vs-wf_vblank@a-dp1:
    - fi-kbl-x1275:       [DMESG-WARN][77] ([i915#62] / [i915#92]) -> [DMESG-WARN][78] ([i915#62] / [i915#92] / [i915#95]) +3 similar issues
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-x1275/igt@kms_flip@basic-flip-vs-wf_vblank@a-dp1.html
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-x1275/igt@kms_flip@basic-flip-vs-wf_vblank@a-dp1.html

  * igt@kms_force_connector_basic@force-edid:
    - fi-kbl-x1275:       [DMESG-WARN][79] ([i915#62] / [i915#92] / [i915#95]) -> [DMESG-WARN][80] ([i915#62] / [i915#92]) +6 similar issues
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-x1275/igt@kms_force_connector_basic@force-edid.html
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/fi-kbl-x1275/igt@kms_force_connector_basic@force-edid.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [i915#1233]: https://gitlab.freedesktop.org/drm/intel/issues/1233
  [i915#1635]: https://gitlab.freedesktop.org/drm/intel/issues/1635
  [i915#1795]: https://gitlab.freedesktop.org/drm/intel/issues/1795
  [i915#1888]: https://gitlab.freedesktop.org/drm/intel/issues/1888
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [i915#402]: https://gitlab.freedesktop.org/drm/intel/issues/402
  [i915#62]: https://gitlab.freedesktop.org/drm/intel/issues/62
  [i915#794]: https://gitlab.freedesktop.org/drm/intel/issues/794
  [i915#92]: https://gitlab.freedesktop.org/drm/intel/issues/92
  [i915#95]: https://gitlab.freedesktop.org/drm/intel/issues/95


Participating hosts (47 -> 40)
------------------------------

  Missing    (7): fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-ctg-p8600 fi-byt-clapper fi-bdw-samus 


Build changes
-------------

  * Linux: CI_DRM_8750 -> Patchwork_18177

  CI-20190529: 20190529
  CI_DRM_8750: 0714e0ca72205b9c38c4b2a09d8d5981637af2fb @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_5735: 21f8204e54c122e4a0f8ca4b59e4b2db8d1ba687 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_18177: 823526e73ba85199f5a978bd9ea6156bf852f3d6 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

823526e73ba8 drm/i915/gem: Remove timeline nesting from snb relocs
b3c1e59841ea drm/i915/gt: Enable ring scheduling for gen6/7
87c21a67aa37 drm/i915/gt: Implement ring scheduler for gen6/7
863da3117a20 drm/i915/gt: Infrastructure for ring scheduling
2b5546dd8abd drm/i915/gt: Use client timeline address for seqno writes
f28f48cf03af drm/i915/gt: Support creation of 'internal' rings
6645f5fddc72 drm/i915/gt: Couple tasklet scheduling for all CS interrupts
d23e78b07d59 Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq"
09a11cb682d1 drm/i915: Move saturated workload detection to the GT
bb9eb77ea30a drm/i915: Replace the priority boosting for the display with a deadline
e27a7ce87a79 drm/i915/gt: Specify a deadline for the heartbeat
5dbb21007170 drm/i915: Fair low-latency scheduling
96a6eabdd904 drm/i915/gt: Remove timeslice suppression
cdaf10b00d13 drm/i915: Restructure priority inheritance
64cc265ade5a drm/i915: Teach the i915_dependency to use a double-lock
b8563a88df20 drm/i915/gt: Do not suspend bonded requests if one hangs
109dafe14a1a drm/i915: Replace engine->schedule() with a known request operation
ec6f07931628 drm/i915: Remove I915_USER_PRIORITY_SHIFT
657226e71789 drm/i915: Strip out internal priorities
e716fd9ed81d drm/i915: Lift waiter/signaler iterators
ff6f1d863327 drm/i915/gt: Convert stats.active to plain unsigned int
488c89851fae drm/i915/gt: Extract busy-stats for ring-scheduler
79b47cba6b9c drm/i915/gt: Drop atomic for engine->fw_active tracking
11f1bf554545 drm/i915/gt: ce->inflight updates are now serialised
08be9ee137a1 drm/i915/gt: Simplify virtual engine handling for execlists_hold()
f0e7e4b98971 drm/i915/gt: Resubmit the virtual engine on schedule-out
ce7f6e2bf488 drm/i915/gt: Defer schedule_out until after the next dequeue
bf89b50c5a71 drm/i915/gt: Decouple inflight virtual engines
067ea4919a58 drm/i915/gt: Use virtual_engine during execlists_dequeue
bf6e26068597 drm/i915/gt: Free stale request on destroying the virtual engine
9fd59fa35cdc drm/i915/gt: Replace direct submit with direct call to tasklet
b1fba621ff00 drm/i915/gt: Check for a completed last request once
4ddd5ececb22 drm/i915/gt: Decouple completed requests on unwind
6296fe868377 drm/i915: Remove unused i915_gem_evict_vm()
0e4a431e2f5e drm/i915/gt: Push the wait for the context to bound to the request
7ef75aca0d57 drm/i915/gt: Acquire backing storage for the context
30196b1778f2 drm/i915: Specialise GGTT binding
009379133a04 drm/i915: Hold wakeref for the duration of the vma GGTT binding
0ca5333286a3 drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
fe0b2e8f9867 drm/i915/gem: Pull execbuf dma resv under a single critical section
34f8c8b06fb1 drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2.
a2d67581b7cc drm/i915/gem: Reintroduce multiple passes for reloc processing
d59dcbda945a drm/i915/gem: Include secure batch in common execbuf pinning
3994c1315802 drm/i915/gem: Include cmdparser in common execbuf pinning
289e4f21e37d drm/i915/gem: Bind the fence async for execbuf
8ed473e9ebb5 drm/i915/gem: Asynchronous GTT unbinding
82ed2ac041bb drm/i915/gem: Separate the ww_mutex walker into its own list
0ffcc361f56b drm/i915/gem: Assign context id for async work
03f0ac6d0aa7 drm/i915: Always defer fenced work to the worker
482c12431055 drm/i915: Add list_for_each_entry_safe_continue_reverse
b336351c77f1 drm/i915/gem: Remove the call for no-evict i915_vma_pin
bf7156efc5fe drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup
a488eb2ac0c6 drm/i915/gem: Rename execbuf.bind_link to unbound_link
d9e5a405669b drm/i915/gem: Don't drop the timeline lock during execbuf
f62b56fc3c3f drm/i915: Switch to object allocations for page directories
7c20ceade654 drm/i915: Preallocate stashes for vma page-directories
63c73191100c drm/i915: Soften the tasklet flush frequency before waits
a938c15b20b3 drm/i915: Provide a fastpath for waiting on vma bindings
00c92cc2c921 drm/i915: Make the stale cached active node available for any timeline
383c7cbfc8d5 drm/i915: Keep the most recently used active-fence upon discard
8628c1a1b800 drm/i915: Export a preallocate variant of i915_active_acquire()
379852a0d374 drm/i915: Skip taking acquire mutex for no ref->active callback
e919aa95911c drm/i915: Add a couple of missing i915_active_fini()
6fa430293a65 drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs
13de08637a53 drm/i915: Remove i915_request.lock requirement for execution callbacks
91bde1c5e935 drm/i915: Reduce i915_request.lock contention for i915_request_wait

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18177/index.html

[-- Attachment #1.2: Type: text/html, Size: 23587 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [Intel-gfx] [PATCH] drm/i915: Fair low-latency scheduling
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 55/66] drm/i915: Fair low-latency scheduling Chris Wilson
@ 2020-07-15 15:33   ` Chris Wilson
  0 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-15 15:33 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

The first "scheduler" was a topographical sorting of requests into
priority order. The execution order was deterministic, the earliest
submitted, highest priority request would be executed first. Priority
inherited ensured that inversions were kept at bay, and allowed us to
dynamically boost priorities (e.g. for interactive pageflips).

The minimalistic timeslicing scheme was an attempt to introduce fairness
between long running requests, by evicting the active request at the end
of a timeslice and moving it to the back of its priority queue (while
ensuring that dependencies were kept in order). For short running
requests from many clients of equal priority, the scheme is still very
much FIFO submission ordering, and as unfair as before.

To impose fairness, we need an external metric that ensures that clients
are interpersed, we don't execute one long chain from client A before
executing any of client B. This could be imposed by the clients by using
a fences based on an external clock, that is they only submit work for a
"frame" at frame-interval, instead of submitting as much work as they
are able to. The standard SwapBuffers approach is akin to double
bufferring, where as one frame is being executed, the next is being
submitted, such that there is always a maximum of two frames per client
in the pipeline. Even this scheme exhibits unfairness under load as a
single client will execute two frames back to back before the next, and
with enough clients, deadlines will be missed.

The idea introduced by BFS/MuQSS is that fairness is introduced by
metering with an external clock. Every request, when it becomes ready to
execute is assigned a virtual deadline, and execution order is then
determined by earliest deadline. Priority is used as a hint, rather than
strict ordering, where high priority requests have earlier deadlines,
but not necessarily earlier than outstanding work. Thus work is executed
in order of 'readiness', with timeslicing to demote long running work.

The Achille's heel of this scheduler is its strong preference for
low-latency and favouring of new queues. Whereas it was easy to dominate
the old scheduler by flooding it with many requests over a short period
of time, the new scheduler can be dominated by a 'synchronous' client
that waits for each of its requests to complete before submitting the
next. As such a client has no history, it is always considered
ready-to-run and receives an earlier deadline than the long running
requests.

To check the impact on throughput (often the downfall of latency
sensitive schedulers), we used gem_wsim to simulate various transcode
workloads with different load balancers, and varying the number of
competing [heterogenous] clients.

+mB--------------------------------------------------------------------+
|                               a                                      |
|                             cda                                      |
|                             c.a                                      |
|                             ..aa                                     |
|                           ..---.                                     |
|                           -.--+-.                                    |
|                        .c.-.-+++.  b                                 |
|               b    bb.d-c-+--+++.aab aa    b b                       |
|b  b   b   b  b.  b ..---+++-+++++....a. b. b b   b       b    b     b|
1                               A|                                     |
2                         |___AM____|                                  |
3                            |A__|                                     |
4                            |MA_|                                     |
+----------------------------------------------------------------------+
Clients   Min       Max     Median           Avg        Stddev
1       -8.20       5.4     -0.045      -0.02375   0.094722134
2      -15.96     19.28      -0.64         -1.05     2.2428076
4       -5.11      2.95      -1.15    -1.0683333    0.72382651
8       -5.63      1.85     -0.905   -0.87122449    0.73390971

The impact was on average 1% under contention due to the change in context
execution order and number of context switches.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  12 +-
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  14 -
 drivers/gpu/drm/i915/gt/intel_lrc.c           | 230 +++++-------
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  41 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   6 +-
 drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
 drivers/gpu/drm/i915/i915_request.c           |   1 +
 drivers/gpu/drm/i915/i915_scheduler.c         | 351 +++++++++++++-----
 drivers/gpu/drm/i915/i915_scheduler.h         |  24 +-
 drivers/gpu/drm/i915/i915_scheduler_types.h   |  17 +
 .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
 drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
 .../gpu/drm/i915/selftests/i915_scheduler.c   |  49 +++
 16 files changed, 499 insertions(+), 265 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 8bd87ca918d0..af9cc42d3061 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -588,7 +588,6 @@ void intel_engine_init_execlists(struct intel_engine_cs *engine)
 	execlists->active =
 		memset(execlists->inflight, 0, sizeof(execlists->inflight));
 
-	execlists->queue_priority_hint = INT_MIN;
 	execlists->queue = RB_ROOT_CACHED;
 }
 
@@ -1274,14 +1273,15 @@ bool intel_engine_can_store_dword(struct intel_engine_cs *engine)
 	}
 }
 
-static int print_sched_attr(const struct i915_sched_attr *attr,
-			    char *buf, int x, int len)
+static int print_sched(const struct i915_sched_node *node,
+		       char *buf, int x, int len)
 {
-	if (attr->priority == I915_PRIORITY_INVALID)
+	if (node->attr.priority == I915_PRIORITY_INVALID)
 		return x;
 
 	x += snprintf(buf + x, len - x,
-		      " prio=%d", attr->priority);
+		      " prio=%d, dl=%llu",
+		      node->attr.priority, node->deadline);
 
 	return x;
 }
@@ -1294,7 +1294,7 @@ static void print_request(struct drm_printer *m,
 	char buf[80] = "";
 	int x = 0;
 
-	x = print_sched_attr(&rq->sched.attr, buf, x, sizeof(buf));
+	x = print_sched(&rq->sched, buf, x, sizeof(buf));
 
 	drm_printf(m, "%s %llx:%llx%s%s %s @ %dms: %s\n",
 		   prefix,
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 96ebf61038d9..9fdc8223007f 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -235,6 +235,7 @@ int intel_engine_pulse(struct intel_engine_cs *engine)
 		goto out_unlock;
 	}
 
+	rq->sched.deadline = 0;
 	__set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
 	heartbeat_commit(rq, &attr);
 	GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index 8ec3eecf3e39..a95099b7b759 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -189,6 +189,7 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
 	i915_request_add_active_barriers(rq);
 
 	/* Install ourselves as a preemption barrier */
+	rq->sched.deadline = 0;
 	rq->sched.attr.priority = I915_PRIORITY_BARRIER;
 	if (likely(!__i915_request_commit(rq))) { /* engine should be idle! */
 		/*
@@ -249,9 +250,6 @@ static int __engine_park(struct intel_wakeref *wf)
 	intel_engine_park_heartbeat(engine);
 	intel_engine_disarm_breadcrumbs(engine);
 
-	/* Must be reset upon idling, or we may miss the busy wakeup. */
-	GEM_BUG_ON(engine->execlists.queue_priority_hint != INT_MIN);
-
 	if (engine->park)
 		engine->park(engine);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 354e01c560f2..af6f1154200a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -236,20 +236,6 @@ struct intel_engine_execlists {
 	 */
 	unsigned int port_mask;
 
-	/**
-	 * @queue_priority_hint: Highest pending priority.
-	 *
-	 * When we add requests into the queue, or adjust the priority of
-	 * executing requests, we compute the maximum priority of those
-	 * pending requests. We can then use this value to determine if
-	 * we need to preempt the executing requests to service the queue.
-	 * However, since the we may have recorded the priority of an inflight
-	 * request we wanted to preempt but since completed, at the time of
-	 * dequeuing the priority hint may no longer may match the highest
-	 * available request priority.
-	 */
-	int queue_priority_hint;
-
 	/**
 	 * @queue: queue of requests, in priority lists
 	 */
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 29072215635e..6054695611ad 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -202,7 +202,7 @@ struct virtual_engine {
 	 */
 	struct ve_node {
 		struct rb_node rb;
-		int prio;
+		u64 deadline;
 	} nodes[I915_NUM_ENGINES];
 
 	/*
@@ -413,12 +413,17 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 
 static inline int rq_prio(const struct i915_request *rq)
 {
-	return READ_ONCE(rq->sched.attr.priority);
+	return rq->sched.attr.priority;
 }
 
-static int effective_prio(const struct i915_request *rq)
+static inline u64 rq_deadline(const struct i915_request *rq)
 {
-	int prio = rq_prio(rq);
+	return rq->sched.deadline;
+}
+
+static u64 effective_deadline(const struct i915_request *rq)
+{
+	u64 deadline = rq_deadline(rq);
 
 	/*
 	 * If this request is special and must not be interrupted at any
@@ -429,33 +434,45 @@ static int effective_prio(const struct i915_request *rq)
 	 * nopreempt for as long as desired).
 	 */
 	if (i915_request_has_nopreempt(rq))
-		prio = I915_PRIORITY_UNPREEMPTABLE;
+		deadline = 0;
 
-	return prio;
+	return deadline;
 }
 
-static int queue_prio(const struct intel_engine_execlists *execlists)
+static u64 queue_deadline(struct intel_engine_execlists *el)
 {
-	struct rb_node *rb;
+	do {
+		struct rb_node *rb;
+		struct i915_priolist *p;
 
-	rb = rb_first_cached(&execlists->queue);
-	if (!rb)
-		return INT_MIN;
+		rb = rb_first_cached(&el->queue);
+		if (!rb)
+			return I915_DEADLINE_NEVER;
+
+		p = to_priolist(rb);
+		if (likely(!list_empty(&p->requests)))
+			return p->deadline;
 
-	return to_priolist(rb)->priority;
+		rb_erase_cached(&p->node, &el->queue);
+		i915_priolist_free(p);
+	} while (1);
 }
 
-static int virtual_prio(const struct intel_engine_execlists *el)
+static u64 virtual_deadline(const struct intel_engine_execlists *el)
 {
-	struct rb_node *rb = rb_first_cached(&el->virtual);
+	struct rb_node *rb;
 
-	return rb ? rb_entry(rb, struct ve_node, rb)->prio : INT_MIN;
+	rb = rb_first_cached(&el->virtual);
+	if (!rb)
+		return I915_DEADLINE_NEVER;
+
+	return rb_entry(rb, struct ve_node, rb)->deadline;
 }
 
-static inline bool need_preempt(const struct intel_engine_cs *engine,
+static inline bool need_preempt(struct intel_engine_cs *engine,
 				const struct i915_request *rq)
 {
-	int last_prio;
+	u64 last_deadline;
 
 	if (!intel_engine_has_semaphores(engine))
 		return false;
@@ -478,16 +495,14 @@ static inline bool need_preempt(const struct intel_engine_cs *engine,
 	 * priority level: the task that is running should remain running
 	 * to preserve FIFO ordering of dependencies.
 	 */
-	last_prio = max(effective_prio(rq), I915_PRIORITY_NORMAL - 1);
-	if (engine->execlists.queue_priority_hint <= last_prio)
-		return false;
+	last_deadline = effective_deadline(rq);
 
 	/*
 	 * Check against the first request in ELSP[1], it will, thanks to the
 	 * power of PI, be the highest priority of that context.
 	 */
 	if (!list_is_last(&rq->sched.link, &engine->active.requests) &&
-	    rq_prio(list_next_entry(rq, sched.link)) > last_prio)
+	    rq_deadline(list_next_entry(rq, sched.link)) < last_deadline)
 		return true;
 
 	/*
@@ -500,8 +515,8 @@ static inline bool need_preempt(const struct intel_engine_cs *engine,
 	 * ELSP[0] or ELSP[1] as, thanks again to PI, if it was the same
 	 * context, it's priority would not exceed ELSP[0] aka last_prio.
 	 */
-	return max(virtual_prio(&engine->execlists),
-		   queue_prio(&engine->execlists)) > last_prio;
+	return min(virtual_deadline(&engine->execlists),
+		   queue_deadline(&engine->execlists)) < last_deadline;
 }
 
 __maybe_unused static inline bool
@@ -518,7 +533,7 @@ assert_priority_queue(const struct i915_request *prev,
 	if (i915_request_is_active(prev))
 		return true;
 
-	return rq_prio(prev) >= rq_prio(next);
+	return rq_deadline(prev) <= rq_deadline(next);
 }
 
 /*
@@ -1088,7 +1103,7 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
 {
 	struct i915_request *rq, *rn, *active = NULL;
 	struct list_head *uninitialized_var(pl);
-	int prio = I915_PRIORITY_INVALID;
+	u64 deadline = I915_DEADLINE_NEVER;
 
 	lockdep_assert_held(&engine->active.lock);
 
@@ -1102,10 +1117,15 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
 
 		__i915_request_unsubmit(rq);
 
-		GEM_BUG_ON(rq_prio(rq) == I915_PRIORITY_INVALID);
-		if (rq_prio(rq) != prio) {
-			prio = rq_prio(rq);
-			pl = i915_sched_lookup_priolist(engine, prio);
+		if (i915_request_started(rq)) {
+			u64 deadline =
+				i915_scheduler_next_virtual_deadline(rq_prio(rq));
+			rq->sched.deadline = min(rq_deadline(rq), deadline);
+		}
+
+		if (rq_deadline(rq) != deadline) {
+			deadline = rq_deadline(rq);
+			pl = i915_sched_lookup_priolist(engine, deadline);
 		}
 		GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
 
@@ -1377,9 +1397,12 @@ static inline void __execlists_schedule_out(struct i915_request *rq)
 	 * If we have just completed this context, the engine may now be
 	 * idle and we want to re-enter powersaving.
 	 */
-	if (list_is_last_rcu(&rq->link, &ce->timeline->requests) &&
-	    i915_request_completed(rq))
-		intel_engine_add_retire(engine, ce->timeline);
+	if (i915_request_completed(rq)) {
+		if (!list_is_last_rcu(&rq->link, &ce->timeline->requests))
+			i915_request_update_deadline(list_next_entry(rq, link));
+		else
+			intel_engine_add_retire(engine, ce->timeline);
+	}
 
 	ccid = ce->lrc.ccid;
 	ccid >>= GEN11_SW_CTX_ID_SHIFT - 32;
@@ -1493,14 +1516,14 @@ dump_port(char *buf, int buflen, const char *prefix, struct i915_request *rq)
 	if (!rq)
 		return "";
 
-	snprintf(buf, buflen, "%sccid:%x %llx:%lld%s prio %d",
+	snprintf(buf, buflen, "%sccid:%x %llx:%lld%s dl:%llu",
 		 prefix,
 		 rq->context->lrc.ccid,
 		 rq->fence.context, rq->fence.seqno,
 		 i915_request_completed(rq) ? "!" :
 		 i915_request_started(rq) ? "*" :
 		 "",
-		 rq_prio(rq));
+		 rq_deadline(rq));
 
 	return buf;
 }
@@ -1810,7 +1833,9 @@ static void virtual_xfer_breadcrumbs(struct virtual_engine *ve)
 	intel_engine_transfer_stale_breadcrumbs(ve->siblings[0], &ve->context);
 }
 
-static void defer_request(struct i915_request *rq, struct list_head * const pl)
+static void defer_request(struct i915_request *rq,
+			  struct list_head * const pl,
+			  u64 deadline)
 {
 	LIST_HEAD(list);
 
@@ -1825,6 +1850,7 @@ static void defer_request(struct i915_request *rq, struct list_head * const pl)
 		struct i915_dependency *p;
 
 		GEM_BUG_ON(i915_request_is_active(rq));
+		rq->sched.deadline = deadline;
 		list_move_tail(&rq->sched.link, pl);
 
 		for_each_waiter(p, rq) {
@@ -1847,10 +1873,9 @@ static void defer_request(struct i915_request *rq, struct list_head * const pl)
 			if (!i915_request_is_ready(w))
 				continue;
 
-			if (rq_prio(w) < rq_prio(rq))
+			if (rq_deadline(w) > deadline)
 				continue;
 
-			GEM_BUG_ON(rq_prio(w) > rq_prio(rq));
 			list_move_tail(&w->sched.link, &list);
 		}
 
@@ -1861,12 +1886,21 @@ static void defer_request(struct i915_request *rq, struct list_head * const pl)
 static void defer_active(struct intel_engine_cs *engine)
 {
 	struct i915_request *rq;
+	u64 deadline;
 
 	rq = __unwind_incomplete_requests(engine);
 	if (!rq)
 		return;
 
-	defer_request(rq, i915_sched_lookup_priolist(engine, rq_prio(rq)));
+	deadline = max(rq_deadline(rq),
+		       i915_scheduler_next_virtual_deadline(rq_prio(rq)));
+	ENGINE_TRACE(engine, "defer %llx:%lld, dl:%llu -> %llu\n",
+		     rq->fence.context, rq->fence.seqno,
+		     rq_deadline(rq), deadline);
+
+	defer_request(rq,
+		      i915_sched_lookup_priolist(engine, deadline),
+		      deadline);
 }
 
 static bool
@@ -2034,11 +2068,10 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			goto check_secondary;
 		} else if (need_preempt(engine, last)) {
 			ENGINE_TRACE(engine,
-				     "preempting last=%llx:%lld, prio=%d, hint=%d\n",
+				     "preempting last=%llx:%llu, dl=%llu\n",
 				     last->fence.context,
 				     last->fence.seqno,
-				     last->sched.attr.priority,
-				     execlists->queue_priority_hint);
+				     rq_deadline(last));
 			record_preemption(execlists);
 
 			/*
@@ -2060,11 +2093,11 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			last = NULL;
 		} else if (timeslice_expired(engine, last)) {
 			ENGINE_TRACE(engine,
-				     "expired:%s last=%llx:%lld, prio=%d, hint=%d, yield?=%s\n",
+				     "expired:%s last=%llx:%llu, deadline=%llu, now=%llu, yield?=%s\n",
 				     yesno(timer_expired(&execlists->timer)),
 				     last->fence.context, last->fence.seqno,
-				     rq_prio(last),
-				     execlists->queue_priority_hint,
+				     rq_deadline(last),
+				     i915_sched_to_ticks(ktime_get()),
 				     yesno(timeslice_yield(execlists, last)));
 
 			ring_set_paused(engine, 1);
@@ -2121,7 +2154,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		GEM_BUG_ON(rq->engine != &ve->base);
 		GEM_BUG_ON(rq->context != &ve->context);
 
-		if (unlikely(rq_prio(rq) < queue_prio(execlists))) {
+		if (unlikely(rq_deadline(rq) > queue_deadline(execlists))) {
 			spin_unlock(&ve->base.active.lock);
 			break;
 		}
@@ -2142,9 +2175,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			     i915_request_started(rq) ? "*" :
 			     "",
 			     yesno(engine != ve->siblings[0]));
-
 		WRITE_ONCE(ve->request, NULL);
-		WRITE_ONCE(ve->base.execlists.queue_priority_hint, INT_MIN);
 
 		rb = &ve->nodes[engine->id].rb;
 		rb_erase_cached(rb, &execlists->virtual);
@@ -2285,24 +2316,6 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	}
 done:
 	*port++ = i915_request_get(last);
-
-	/*
-	 * Here be a bit of magic! Or sleight-of-hand, whichever you prefer.
-	 *
-	 * We choose the priority hint such that if we add a request of greater
-	 * priority than this, we kick the submission tasklet to decide on
-	 * the right order of submitting the requests to hardware. We must
-	 * also be prepared to reorder requests as they are in-flight on the
-	 * HW. We derive the priority hint then as the first "hole" in
-	 * the HW submission ports and if there are no available slots,
-	 * the priority of the lowest executing request, i.e. last.
-	 *
-	 * When we do receive a higher priority request ready to run from the
-	 * user, see queue_request(), the priority hint is bumped to that
-	 * request triggering preemption on the next dequeue (or subsequent
-	 * interrupt for secondary ports).
-	 */
-	execlists->queue_priority_hint = queue_prio(execlists);
 	spin_unlock_irqrestore(&engine->active.lock, flags);
 
 	/*
@@ -2715,9 +2728,10 @@ static bool hold_request(const struct i915_request *rq)
 	return result;
 }
 
-static void __execlists_unhold(struct i915_request *rq)
+static bool __execlists_unhold(struct i915_request *rq)
 {
 	LIST_HEAD(list);
+	bool submit = false;
 
 	do {
 		struct i915_dependency *p;
@@ -2728,10 +2742,7 @@ static void __execlists_unhold(struct i915_request *rq)
 		GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
 
 		i915_request_clear_hold(rq);
-		list_move_tail(&rq->sched.link,
-			       i915_sched_lookup_priolist(rq->engine,
-							  rq_prio(rq)));
-		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+		submit |= intel_engine_queue_request(rq->engine, rq);
 
 		/* Also release any children on this engine that are ready */
 		for_each_waiter(p, rq) {
@@ -2760,6 +2771,8 @@ static void __execlists_unhold(struct i915_request *rq)
 
 		rq = list_first_entry_or_null(&list, typeof(*rq), sched.link);
 	} while (rq);
+
+	return submit;
 }
 
 static void execlists_unhold(struct intel_engine_cs *engine,
@@ -2771,12 +2784,8 @@ static void execlists_unhold(struct intel_engine_cs *engine,
 	 * Move this request back to the priority queue, and all of its
 	 * children and grandchildren that were suspended along with it.
 	 */
-	__execlists_unhold(rq);
-
-	if (rq_prio(rq) > engine->execlists.queue_priority_hint) {
-		engine->execlists.queue_priority_hint = rq_prio(rq);
+	if (__execlists_unhold(rq))
 		tasklet_hi_schedule(&engine->execlists.tasklet);
-	}
 
 	spin_unlock_irq(&engine->active.lock);
 }
@@ -3046,27 +3055,6 @@ static void execlists_preempt(struct timer_list *timer)
 	execlists_kick(timer, preempt);
 }
 
-static void queue_request(struct intel_engine_cs *engine,
-			  struct i915_request *rq)
-{
-	GEM_BUG_ON(!list_empty(&rq->sched.link));
-	list_add_tail(&rq->sched.link,
-		      i915_sched_lookup_priolist(engine, rq_prio(rq)));
-	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
-}
-
-static bool submit_queue(struct intel_engine_cs *engine,
-			 const struct i915_request *rq)
-{
-	struct intel_engine_execlists *execlists = &engine->execlists;
-
-	if (rq_prio(rq) <= execlists->queue_priority_hint)
-		return false;
-
-	execlists->queue_priority_hint = rq_prio(rq);
-	return true;
-}
-
 static bool ancestor_on_hold(const struct intel_engine_cs *engine,
 			     const struct i915_request *rq)
 {
@@ -3087,12 +3075,7 @@ static void execlists_submit_request(struct i915_request *request)
 		list_add_tail(&request->sched.link, &engine->active.hold);
 		i915_request_set_hold(request);
 	} else {
-		queue_request(engine, request);
-
-		GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
-		GEM_BUG_ON(list_empty(&request->sched.link));
-
-		if (submit_queue(engine, request))
+		if (intel_engine_queue_request(engine, request))
 			__execlists_kick(&engine->execlists);
 	}
 
@@ -4161,10 +4144,6 @@ static void execlists_reset_rewind(struct intel_engine_cs *engine, bool stalled)
 
 static void nop_submission_tasklet(unsigned long data)
 {
-	struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
-
-	/* The driver is wedged; don't process any more events. */
-	WRITE_ONCE(engine->execlists.queue_priority_hint, INT_MIN);
 }
 
 static void execlists_reset_cancel(struct intel_engine_cs *engine)
@@ -4210,6 +4189,7 @@ static void execlists_reset_cancel(struct intel_engine_cs *engine)
 		rb_erase_cached(&p->node, &execlists->queue);
 		i915_priolist_free(p);
 	}
+	GEM_BUG_ON(!RB_EMPTY_ROOT(&execlists->queue.rb_root));
 
 	/* On-hold requests will be flushed to timeline upon their release */
 	list_for_each_entry(rq, &engine->active.hold, sched.link)
@@ -4231,17 +4211,12 @@ static void execlists_reset_cancel(struct intel_engine_cs *engine)
 			rq->engine = engine;
 			__i915_request_submit(rq);
 			i915_request_put(rq);
-
-			ve->base.execlists.queue_priority_hint = INT_MIN;
 		}
 		spin_unlock(&ve->base.active.lock);
 	}
 
 	/* Remaining _unready_ requests will be nop'ed when submitted */
 
-	execlists->queue_priority_hint = INT_MIN;
-	execlists->queue = RB_ROOT_CACHED;
-
 	GEM_BUG_ON(__tasklet_is_enabled(&execlists->tasklet));
 	execlists->tasklet.func = nop_submission_tasklet;
 
@@ -5353,7 +5328,8 @@ static const struct intel_context_ops virtual_context_ops = {
 	.destroy = virtual_context_destroy,
 };
 
-static intel_engine_mask_t virtual_submission_mask(struct virtual_engine *ve)
+static intel_engine_mask_t
+virtual_submission_mask(struct virtual_engine *ve, u64 *deadline)
 {
 	struct i915_request *rq;
 	intel_engine_mask_t mask;
@@ -5370,9 +5346,11 @@ static intel_engine_mask_t virtual_submission_mask(struct virtual_engine *ve)
 		mask = ve->siblings[0]->mask;
 	}
 
-	ENGINE_TRACE(&ve->base, "rq=%llx:%lld, mask=%x, prio=%d\n",
+	*deadline = rq_deadline(rq);
+
+	ENGINE_TRACE(&ve->base, "rq=%llx:%llu, mask=%x, dl=%llu\n",
 		     rq->fence.context, rq->fence.seqno,
-		     mask, ve->base.execlists.queue_priority_hint);
+		     mask, *deadline);
 
 	return mask;
 }
@@ -5380,12 +5358,12 @@ static intel_engine_mask_t virtual_submission_mask(struct virtual_engine *ve)
 static void virtual_submission_tasklet(unsigned long data)
 {
 	struct virtual_engine * const ve = (struct virtual_engine *)data;
-	const int prio = READ_ONCE(ve->base.execlists.queue_priority_hint);
 	intel_engine_mask_t mask;
+	u64 deadline;
 	unsigned int n;
 
 	rcu_read_lock();
-	mask = virtual_submission_mask(ve);
+	mask = virtual_submission_mask(ve, &deadline);
 	rcu_read_unlock();
 	if (unlikely(!mask))
 		return;
@@ -5418,7 +5396,8 @@ static void virtual_submission_tasklet(unsigned long data)
 			 */
 			first = rb_first_cached(&sibling->execlists.virtual) ==
 				&node->rb;
-			if (prio == node->prio || (prio > node->prio && first))
+			if (deadline == node->deadline ||
+			    (deadline < node->deadline && first))
 				goto submit_engine;
 
 			rb_erase_cached(&node->rb, &sibling->execlists.virtual);
@@ -5432,7 +5411,7 @@ static void virtual_submission_tasklet(unsigned long data)
 
 			rb = *parent;
 			other = rb_entry(rb, typeof(*other), rb);
-			if (prio > other->prio) {
+			if (deadline < other->deadline) {
 				parent = &rb->rb_left;
 			} else {
 				parent = &rb->rb_right;
@@ -5447,8 +5426,8 @@ static void virtual_submission_tasklet(unsigned long data)
 
 submit_engine:
 		GEM_BUG_ON(RB_EMPTY_NODE(&node->rb));
-		node->prio = prio;
-		if (first && prio > sibling->execlists.queue_priority_hint)
+		node->deadline = deadline;
+		if (first)
 			tasklet_hi_schedule(&sibling->execlists.tasklet);
 
 unlock_engine:
@@ -5482,11 +5461,11 @@ static void virtual_submit_request(struct i915_request *rq)
 
 	if (i915_request_completed(rq)) {
 		__i915_request_submit(rq);
-
-		ve->base.execlists.queue_priority_hint = INT_MIN;
 		ve->request = NULL;
 	} else {
-		ve->base.execlists.queue_priority_hint = rq_prio(rq);
+		rq->sched.deadline =
+			min(rq->sched.deadline,
+			    i915_scheduler_next_virtual_deadline(rq_prio(rq)));
 		ve->request = i915_request_get(rq);
 
 		GEM_BUG_ON(!list_empty(virtual_queue(ve)));
@@ -5591,7 +5570,6 @@ intel_execlists_create_virtual(struct intel_engine_cs **siblings,
 	ve->base.bond_execute = virtual_bond_execute;
 
 	INIT_LIST_HEAD(virtual_queue(ve));
-	ve->base.execlists.queue_priority_hint = INT_MIN;
 	tasklet_init(&ve->base.execlists.tasklet,
 		     virtual_submission_tasklet,
 		     (unsigned long)ve);
@@ -5779,10 +5757,6 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
 		show_request(m, last, "\t\tE ");
 	}
 
-	if (execlists->queue_priority_hint != INT_MIN)
-		drm_printf(m, "\t\tQueue priority hint: %d\n",
-			   READ_ONCE(execlists->queue_priority_hint));
-
 	last = NULL;
 	count = 0;
 	for (rb = rb_first_cached(&execlists->queue); rb; rb = rb_next(rb)) {
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 927d54c702f4..b0eb426d26fe 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -878,7 +878,10 @@ static int __igt_reset_engines(struct intel_gt *gt,
 					break;
 				}
 
-				if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+				/* With deadlines, no strict priority */
+				i915_request_set_deadline(rq, 0);
+
+				if (i915_request_wait(rq, 0, HZ / 2) < 0) {
 					struct drm_printer p =
 						drm_info_printer(gt->i915->drm.dev);
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index b23234ae2572..ec648b61b2cc 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -70,6 +70,9 @@ static int wait_for_submit(struct intel_engine_cs *engine,
 			   struct i915_request *rq,
 			   unsigned long timeout)
 {
+	/* Ignore our own attempts to suppress excess tasklets */
+	tasklet_hi_schedule(&engine->execlists.tasklet);
+
 	timeout += jiffies;
 	do {
 		bool done = time_after(jiffies, timeout);
@@ -892,7 +895,7 @@ semaphore_queue(struct intel_engine_cs *engine, struct i915_vma *vma, int idx)
 static int
 release_queue(struct intel_engine_cs *engine,
 	      struct i915_vma *vma,
-	      int idx, int prio)
+	      int idx, u64 deadline)
 {
 	struct i915_request *rq;
 	u32 *cs;
@@ -917,10 +920,7 @@ release_queue(struct intel_engine_cs *engine,
 	i915_request_get(rq);
 	i915_request_add(rq);
 
-	local_bh_disable();
-	i915_request_set_priority(rq, prio);
-	local_bh_enable(); /* kick tasklet */
-
+	i915_request_set_deadline(rq, deadline);
 	i915_request_put(rq);
 
 	return 0;
@@ -934,6 +934,7 @@ slice_semaphore_queue(struct intel_engine_cs *outer,
 	struct intel_engine_cs *engine;
 	struct i915_request *head;
 	enum intel_engine_id id;
+	long timeout;
 	int err, i, n = 0;
 
 	head = semaphore_queue(outer, vma, n++);
@@ -954,12 +955,16 @@ slice_semaphore_queue(struct intel_engine_cs *outer,
 		}
 	}
 
-	err = release_queue(outer, vma, n, I915_PRIORITY_BARRIER);
+	err = release_queue(outer, vma, n, 0);
 	if (err)
 		goto out;
 
-	if (i915_request_wait(head, 0,
-			      2 * outer->gt->info.num_engines * (count + 2) * (count + 3)) < 0) {
+	/* Expected number of pessimal slices required */
+	timeout = outer->gt->info.num_engines * (count + 2) * (count + 3);
+	timeout *= 4; /* safety factor, including bucketing */
+	timeout += HZ / 2; /* and include the request completion */
+
+	if (i915_request_wait(head, 0, timeout) < 0) {
 		pr_err("Failed to slice along semaphore chain of length (%d, %d)!\n",
 		       count, n);
 		GEM_TRACE_DUMP();
@@ -1064,6 +1069,8 @@ create_rewinder(struct intel_context *ce,
 		err = i915_request_await_dma_fence(rq, &wait->fence);
 		if (err)
 			goto err;
+
+		i915_request_set_deadline(rq, rq_deadline(wait));
 	}
 
 	cs = intel_ring_begin(rq, 14);
@@ -1339,7 +1346,7 @@ static int live_timeslice_queue(void *arg)
 			err = PTR_ERR(rq);
 			goto err_heartbeat;
 		}
-		i915_request_set_priority(rq, I915_PRIORITY_MAX);
+		i915_request_set_deadline(rq, 0);
 		err = wait_for_submit(engine, rq, HZ / 2);
 		if (err) {
 			pr_err("%s: Timed out trying to submit semaphores\n",
@@ -1362,10 +1369,9 @@ static int live_timeslice_queue(void *arg)
 		}
 
 		GEM_BUG_ON(i915_request_completed(rq));
-		GEM_BUG_ON(execlists_active(&engine->execlists) != rq);
 
 		/* Queue: semaphore signal, matching priority as semaphore */
-		err = release_queue(engine, vma, 1, effective_prio(rq));
+		err = release_queue(engine, vma, 1, effective_deadline(rq));
 		if (err)
 			goto err_rq;
 
@@ -1476,6 +1482,7 @@ static int live_timeslice_nopreempt(void *arg)
 			goto out_spin;
 		}
 
+		rq->sched.deadline = 0;
 		rq->sched.attr.priority = I915_PRIORITY_BARRIER;
 		i915_request_get(rq);
 		i915_request_add(rq);
@@ -1848,6 +1855,7 @@ static int live_late_preempt(void *arg)
 
 	/* Make sure ctx_lo stays before ctx_hi until we trigger preemption. */
 	ctx_lo->sched.priority = 1;
+	ctx_hi->sched.priority = I915_PRIORITY_MIN;
 
 	for_each_engine(engine, gt, id) {
 		struct igt_live_test t;
@@ -2949,6 +2957,9 @@ static int live_preempt_gang(void *arg)
 			struct i915_request *n =
 				list_next_entry(rq, client_link);
 
+			/* With deadlines, no strict priority ordering */
+			i915_request_set_deadline(rq, 0);
+
 			if (err == 0 && i915_request_wait(rq, 0, HZ / 5) < 0) {
 				struct drm_printer p =
 					drm_info_printer(engine->i915->drm.dev);
@@ -3170,7 +3181,7 @@ static int preempt_user(struct intel_engine_cs *engine,
 	i915_request_get(rq);
 	i915_request_add(rq);
 
-	i915_request_set_priority(rq, I915_PRIORITY_MAX);
+	i915_request_set_deadline(rq, 0);
 
 	if (i915_request_wait(rq, 0, HZ / 2) < 0)
 		err = -ETIME;
@@ -4706,6 +4717,7 @@ static int emit_semaphore_signal(struct intel_context *ce, void *slot)
 
 	intel_ring_advance(rq, cs);
 
+	rq->sched.deadline = 0;
 	rq->sched.attr.priority = I915_PRIORITY_BARRIER;
 	i915_request_add(rq);
 	return 0;
@@ -5215,6 +5227,10 @@ static int __live_lrc_gpr(struct intel_engine_cs *engine,
 		err = emit_semaphore_signal(engine->kernel_context, slot);
 		if (err)
 			goto err_rq;
+
+		err = wait_for_submit(engine, rq, HZ / 2);
+		if (err)
+			goto err_rq;
 	} else {
 		slot[0] = 1;
 		wmb();
@@ -5772,6 +5788,7 @@ static int poison_registers(struct intel_context *ce, u32 poison, u32 *sema)
 
 	intel_ring_advance(rq, cs);
 
+	rq->sched.deadline = 0;
 	rq->sched.attr.priority = I915_PRIORITY_BARRIER;
 err_rq:
 	i915_request_add(rq);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 8b56cf0d970e..e31f9b2c12cc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -333,8 +333,6 @@ static void __guc_dequeue(struct intel_engine_cs *engine)
 		i915_priolist_free(p);
 	}
 done:
-	execlists->queue_priority_hint =
-		rb ? to_priolist(rb)->priority : INT_MIN;
 	if (submit) {
 		*port = schedule_in(last, port - execlists->inflight);
 		*++port = NULL;
@@ -473,12 +471,10 @@ static void guc_reset_cancel(struct intel_engine_cs *engine)
 		rb_erase_cached(&p->node, &execlists->queue);
 		i915_priolist_free(p);
 	}
+	GEM_BUG_ON(!RB_EMPTY_ROOT(&execlists->queue.rb_root));
 
 	/* Remaining _unready_ requests will be nop'ed when submitted */
 
-	execlists->queue_priority_hint = INT_MIN;
-	execlists->queue = RB_ROOT_CACHED;
-
 	spin_unlock_irqrestore(&engine->active.lock, flags);
 }
 
diff --git a/drivers/gpu/drm/i915/i915_priolist_types.h b/drivers/gpu/drm/i915/i915_priolist_types.h
index bc2fa84f98a8..43a0ac45295f 100644
--- a/drivers/gpu/drm/i915/i915_priolist_types.h
+++ b/drivers/gpu/drm/i915/i915_priolist_types.h
@@ -22,6 +22,8 @@ enum {
 
 	/* Interactive workload, scheduled for immediate pageflipping */
 	I915_PRIORITY_DISPLAY,
+
+	__I915_PRIORITY_KERNEL__
 };
 
 /* Smallest priority value that cannot be bumped. */
@@ -35,13 +37,12 @@ enum {
  * i.e. nothing can have higher priority and force us to usurp the
  * active request.
  */
-#define I915_PRIORITY_UNPREEMPTABLE INT_MAX
-#define I915_PRIORITY_BARRIER (I915_PRIORITY_UNPREEMPTABLE - 1)
+#define I915_PRIORITY_BARRIER INT_MAX
 
 struct i915_priolist {
 	struct list_head requests;
 	struct rb_node node;
-	int priority;
+	u64 deadline;
 };
 
 #endif /* _I915_PRIOLIST_TYPES_H_ */
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 7c93f52ee8c1..9555885f343c 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -665,6 +665,7 @@ semaphore_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 
 	switch (state) {
 	case FENCE_COMPLETE:
+		i915_request_update_deadline(rq);
 		break;
 
 	case FENCE_FREE:
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 3f261b4fee66..e14807cff226 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -20,6 +20,11 @@ static struct i915_global_scheduler {
 static DEFINE_SPINLOCK(ipi_lock);
 static LIST_HEAD(ipi_list);
 
+static inline u64 rq_deadline(const struct i915_request *rq)
+{
+	return READ_ONCE(rq->sched.deadline);
+}
+
 static inline int rq_prio(const struct i915_request *rq)
 {
 	return READ_ONCE(rq->sched.attr.priority);
@@ -32,6 +37,7 @@ static void ipi_schedule(struct irq_work *wrk)
 		struct i915_dependency *p;
 		struct i915_request *rq;
 		unsigned long flags;
+		u64 deadline;
 		int prio;
 
 		spin_lock_irqsave(&ipi_lock, flags);
@@ -40,7 +46,10 @@ static void ipi_schedule(struct irq_work *wrk)
 			rq = container_of(p->signaler, typeof(*rq), sched);
 			list_del_init(&p->ipi_link);
 
+			deadline = p->ipi_deadline;
 			prio = p->ipi_priority;
+
+			p->ipi_deadline = I915_DEADLINE_NEVER;
 			p->ipi_priority = I915_PRIORITY_INVALID;
 		}
 		spin_unlock_irqrestore(&ipi_lock, flags);
@@ -51,6 +60,7 @@ static void ipi_schedule(struct irq_work *wrk)
 			continue;
 
 		i915_request_set_priority(rq, prio);
+		i915_request_set_deadline(rq, deadline);
 	} while (1);
 	rcu_read_unlock();
 }
@@ -98,28 +108,8 @@ static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 	return rb_entry(rb, struct i915_priolist, node);
 }
 
-static void assert_priolists(struct intel_engine_execlists * const execlists)
-{
-	struct rb_node *rb;
-	long last_prio;
-
-	if (!IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
-		return;
-
-	GEM_BUG_ON(rb_first_cached(&execlists->queue) !=
-		   rb_first(&execlists->queue.rb_root));
-
-	last_prio = INT_MAX;
-	for (rb = rb_first_cached(&execlists->queue); rb; rb = rb_next(rb)) {
-		const struct i915_priolist *p = to_priolist(rb);
-
-		GEM_BUG_ON(p->priority > last_prio);
-		last_prio = p->priority;
-	}
-}
-
 struct list_head *
-i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
+i915_sched_lookup_priolist(struct intel_engine_cs *engine, u64 deadline)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
 	struct i915_priolist *p;
@@ -127,10 +117,9 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 	bool first = true;
 
 	lockdep_assert_held(&engine->active.lock);
-	assert_priolists(execlists);
 
 	if (unlikely(execlists->no_priolist))
-		prio = I915_PRIORITY_NORMAL;
+		deadline = 0;
 
 find_priolist:
 	/* most positive priority is scheduled first, equal priorities fifo */
@@ -139,9 +128,9 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 	while (*parent) {
 		rb = *parent;
 		p = to_priolist(rb);
-		if (prio > p->priority) {
+		if (deadline < p->deadline) {
 			parent = &rb->rb_left;
-		} else if (prio < p->priority) {
+		} else if (deadline > p->deadline) {
 			parent = &rb->rb_right;
 			first = false;
 		} else {
@@ -149,13 +138,13 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 		}
 	}
 
-	if (prio == I915_PRIORITY_NORMAL) {
+	if (!deadline) {
 		p = &execlists->default_priolist;
 	} else {
 		p = kmem_cache_alloc(global.slab_priorities, GFP_ATOMIC);
 		/* Convert an allocation failure to a priority bump */
 		if (unlikely(!p)) {
-			prio = I915_PRIORITY_NORMAL; /* recurses just once */
+			deadline = 0; /* recurses just once */
 
 			/* To maintain ordering with all rendering, after an
 			 * allocation failure we have to disable all scheduling.
@@ -170,7 +159,7 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 		}
 	}
 
-	p->priority = prio;
+	p->deadline = deadline;
 	INIT_LIST_HEAD(&p->requests);
 
 	rb_link_node(&p->node, rb, parent);
@@ -179,70 +168,231 @@ i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio)
 	return &p->requests;
 }
 
-void __i915_priolist_free(struct i915_priolist *p)
+void i915_priolist_free(struct i915_priolist *p)
 {
-	kmem_cache_free(global.slab_priorities, p);
+	if (p->deadline)
+		kmem_cache_free(global.slab_priorities, p);
 }
 
-static inline bool need_preempt(int prio, int active)
+static bool kick_submission(const struct intel_engine_cs *engine, u64 deadline)
 {
-	/*
-	 * Allow preemption of low -> normal -> high, but we do
-	 * not allow low priority tasks to preempt other low priority
-	 * tasks under the impression that latency for low priority
-	 * tasks does not matter (as much as background throughput),
-	 * so kiss.
-	 */
-	return prio >= max(I915_PRIORITY_NORMAL, active);
+	const struct intel_engine_execlists *el = &engine->execlists;
+	const struct i915_request *inflight;
+	bool kick = true;
+
+	if (to_priolist(rb_first_cached(&el->queue))->deadline < deadline)
+		return false;
+
+	rcu_read_lock();
+	inflight = execlists_active(el);
+	if (inflight)
+		kick = deadline < rq_deadline(inflight);
+	rcu_read_unlock();
+
+	return kick;
 }
 
-static void kick_submission(struct intel_engine_cs *engine,
-			    const struct i915_request *rq,
-			    int prio)
+static bool __i915_request_set_deadline(struct i915_request *rq, u64 deadline)
 {
-	const struct i915_request *inflight;
+	struct intel_engine_cs *engine = rq->engine;
+	struct i915_request *rn;
+	struct list_head *plist;
+	LIST_HEAD(dfs);
 
-	/*
-	 * We only need to kick the tasklet once for the high priority
-	 * new context we add into the queue.
-	 */
-	if (prio <= engine->execlists.queue_priority_hint)
-		return;
+	lockdep_assert_held(&engine->active.lock);
+	list_add(&rq->sched.dfs, &dfs);
 
-	rcu_read_lock();
+	list_for_each_entry(rq, &dfs, sched.dfs) {
+		struct i915_dependency *p;
+
+		GEM_BUG_ON(rq->engine != engine);
+
+		for_each_signaler(p, rq) {
+			struct i915_request *s =
+				container_of(p->signaler, typeof(*s), sched);
+
+			GEM_BUG_ON(s == rq);
+
+			if (rq_deadline(s) <= deadline)
+				continue;
+
+			if (i915_request_completed(s))
+				continue;
+
+			if (s->engine != rq->engine) {
+				spin_lock(&ipi_lock);
+				if (deadline < p->ipi_deadline) {
+					p->ipi_deadline = deadline;
+					list_move(&p->ipi_link, &ipi_list);
+					irq_work_queue(&ipi_work);
+				}
+				spin_unlock(&ipi_lock);
+				continue;
+			}
 
-	/* Nothing currently active? We're overdue for a submission! */
-	inflight = execlists_active(&engine->execlists);
-	if (!inflight)
+			list_move_tail(&s->sched.dfs, &dfs);
+		}
+	}
+
+	plist = i915_sched_lookup_priolist(engine, deadline);
+
+	/* Fifo and depth-first replacement ensure our deps execute first */
+	list_for_each_entry_safe_reverse(rq, rn, &dfs, sched.dfs) {
+		GEM_BUG_ON(rq->engine != engine);
+		GEM_BUG_ON(deadline > rq_deadline(rq));
+
+		INIT_LIST_HEAD(&rq->sched.dfs);
+		WRITE_ONCE(rq->sched.deadline, deadline);
+		RQ_TRACE(rq, "set-deadline:%llu\n", deadline);
+
+		/*
+		 * Once the request is ready, it will be placed into the
+		 * priority lists and then onto the HW runlist. Before the
+		 * request is ready, it does not contribute to our preemption
+		 * decisions and we can safely ignore it, as it will, and
+		 * any preemption required, be dealt with upon submission.
+		 * See engine->submit_request()
+		 */
+
+		if (i915_request_in_priority_queue(rq))
+			list_move_tail(&rq->sched.link, plist);
+	}
+
+	return kick_submission(engine, deadline);
+}
+
+void i915_request_set_deadline(struct i915_request *rq, u64 deadline)
+{
+	struct intel_engine_cs *engine;
+	unsigned long flags;
+
+	if (deadline >= rq_deadline(rq))
+		return;
+
+	engine = lock_engine_irqsave(rq, flags);
+	if (!intel_engine_has_scheduler(engine))
 		goto unlock;
 
-	/*
-	 * If we are already the currently executing context, don't
-	 * bother evaluating if we should preempt ourselves.
-	 */
-	if (inflight->context == rq->context)
+	if (i915_request_completed(rq))
 		goto unlock;
 
-	ENGINE_TRACE(engine,
-		     "bumping queue-priority-hint:%d for rq:%llx:%lld, inflight:%llx:%lld prio %d\n",
-		     prio,
-		     rq->fence.context, rq->fence.seqno,
-		     inflight->fence.context, inflight->fence.seqno,
-		     inflight->sched.attr.priority);
+	if (deadline >= rq_deadline(rq))
+		goto unlock;
 
-	engine->execlists.queue_priority_hint = prio;
-	if (need_preempt(prio, rq_prio(inflight)))
+	if (__i915_request_set_deadline(rq, deadline))
 		tasklet_hi_schedule(&engine->execlists.tasklet);
 
 unlock:
+	spin_unlock_irqrestore(&engine->active.lock, flags);
+}
+
+static u64 prio_slice(int prio)
+{
+	u64 slice;
+	int sf;
+
+	/*
+	 * This is the central heuristic to the virtual deadlines. By
+	 * imposing that each task takes an equal amount of time, we
+	 * let each client have an equal slice of the GPU time. By
+	 * bringing the virtual deadline forward, that client will then
+	 * have more GPU time, and vice versa a lower priority client will
+	 * have a later deadline and receive less GPU time.
+	 *
+	 * In BFS/MuQSS, the prio_ratios[] are based on the task nice range of
+	 * [-20, 20], with each lower priority having a ~10% longer deadline,
+	 * with the note that the proportion of CPU time between two clients
+	 * of different priority will be the square of the relative prio_slice.
+	 *
+	 * In contrast, this prio_slice() curve was chosen because it gave good
+	 * results with igt/gem_exec_schedule. It may not be the best choice!
+	 *
+	 * With a 1ms scheduling quantum:
+	 *
+	 *   MAX USER:  ~32us deadline
+	 *   0:         ~16ms deadline
+	 *   MIN_USER: 1000ms deadline
+	 */
+
+	if (prio >= __I915_PRIORITY_KERNEL__)
+		return INT_MAX - prio;
+
+	slice = __I915_PRIORITY_KERNEL__ - prio;
+	if (prio >= 0)
+		sf = 20 - 6;
+	else
+		sf = 20 - 1;
+
+	return slice << sf;
+}
+
+u64 i915_scheduler_virtual_deadline(u64 kt, int priority)
+{
+	return i915_sched_to_ticks(kt + prio_slice(priority));
+}
+
+u64 i915_scheduler_next_virtual_deadline(int priority)
+{
+	return i915_scheduler_virtual_deadline(ktime_get(), priority);
+}
+
+static u64 signal_deadline(const struct i915_request *rq)
+{
+	u64 last = ktime_to_ns(ktime_get());
+	const struct i915_dependency *p;
+
+	/*
+	 * Find the earliest point at which we will become 'ready',
+	 * which we infer from the deadline of all active signalers.
+	 * We will position ourselves at the end of that chain of work.
+	 */
+
+	rcu_read_lock();
+	for_each_signaler(p, rq) {
+		const struct i915_request *s =
+			container_of(p->signaler, typeof(*s), sched);
+		u64 deadline;
+
+		if (i915_request_completed(s))
+			continue;
+
+		if (rq_prio(s) < rq_prio(rq))
+			continue;
+
+		deadline = i915_sched_to_ns(rq_deadline(s));
+		if (p->flags & I915_DEPENDENCY_WEAK)
+			deadline -= prio_slice(rq_prio(s));
+
+		last = max(last, deadline);
+	}
 	rcu_read_unlock();
+
+	return last;
+}
+
+static u64 earliest_deadline(const struct i915_request *rq)
+{
+	return i915_scheduler_virtual_deadline(signal_deadline(rq),
+					       rq_prio(rq));
+}
+
+static bool set_earliest_deadline(struct i915_request *rq, u64 old)
+{
+	u64 dl;
+
+	/* Recompute our deadlines and promote after a priority change */
+	dl = min(earliest_deadline(rq), rq_deadline(rq));
+	if (dl >= old)
+		return false;
+
+	return __i915_request_set_deadline(rq, dl);
 }
 
-static void __i915_request_set_priority(struct i915_request *rq, int prio)
+static bool __i915_request_set_priority(struct i915_request *rq, int prio)
 {
 	struct intel_engine_cs *engine = rq->engine;
 	struct i915_request *rn;
-	struct list_head *plist;
+	bool kick = false;
 	LIST_HEAD(dfs);
 
 	lockdep_assert_held(&engine->active.lock);
@@ -299,32 +449,20 @@ static void __i915_request_set_priority(struct i915_request *rq, int prio)
 		}
 	}
 
-	plist = i915_sched_lookup_priolist(engine, prio);
-
-	/* Fifo and depth-first replacement ensure our deps execute first */
 	list_for_each_entry_safe_reverse(rq, rn, &dfs, sched.dfs) {
 		GEM_BUG_ON(rq->engine != engine);
+		GEM_BUG_ON(prio < rq_prio(rq));
 
 		INIT_LIST_HEAD(&rq->sched.dfs);
 		WRITE_ONCE(rq->sched.attr.priority, prio);
+		RQ_TRACE(rq, "set-priority:%d\n", prio);
 
-		/*
-		 * Once the request is ready, it will be placed into the
-		 * priority lists and then onto the HW runlist. Before the
-		 * request is ready, it does not contribute to our preemption
-		 * decisions and we can safely ignore it, as it will, and
-		 * any preemption required, be dealt with upon submission.
-		 * See engine->submit_request()
-		 */
-		if (!i915_request_is_ready(rq))
-			continue;
-
-		if (i915_request_in_priority_queue(rq))
-			list_move_tail(&rq->sched.link, plist);
-
-		/* Defer (tasklet) submission until after all updates. */
-		kick_submission(engine, rq, prio);
+		if (i915_request_is_ready(rq) &&
+		    set_earliest_deadline(rq, rq_deadline(rq)))
+			kick = true;
 	}
+
+	return kick;
 }
 
 void i915_request_set_priority(struct i915_request *rq, int prio)
@@ -376,7 +514,38 @@ void i915_request_set_priority(struct i915_request *rq, int prio)
 	if (prio <= rq_prio(rq))
 		goto unlock;
 
-	__i915_request_set_priority(rq, prio);
+	if (__i915_request_set_priority(rq, prio))
+		tasklet_hi_schedule(&engine->execlists.tasklet);
+
+unlock:
+	spin_unlock_irqrestore(&engine->active.lock, flags);
+}
+
+bool intel_engine_queue_request(struct intel_engine_cs *engine,
+				struct i915_request *rq)
+{
+	lockdep_assert_held(&engine->active.lock);
+	set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+	return set_earliest_deadline(rq, I915_DEADLINE_NEVER);
+}
+
+void i915_request_update_deadline(struct i915_request *rq)
+{
+	struct intel_engine_cs *engine;
+	unsigned long flags;
+
+	if (!i915_request_is_ready(rq))
+		return;
+
+	engine = lock_engine_irqsave(rq, flags);
+	if (!intel_engine_has_scheduler(engine))
+		goto unlock;
+
+	if (i915_request_completed(rq))
+		goto unlock;
+
+	if (set_earliest_deadline(rq, rq_deadline(rq)))
+		tasklet_hi_schedule(&engine->execlists.tasklet);
 
 unlock:
 	spin_unlock_irqrestore(&engine->active.lock, flags);
@@ -397,6 +566,7 @@ void i915_sched_node_init(struct i915_sched_node *node)
 void i915_sched_node_reinit(struct i915_sched_node *node)
 {
 	node->attr.priority = I915_PRIORITY_INVALID;
+	node->deadline = I915_DEADLINE_NEVER;
 	node->semaphores = 0;
 	node->flags = 0;
 
@@ -429,6 +599,7 @@ bool __i915_sched_node_add_dependency(struct i915_sched_node *node,
 
 	if (!node_signaled(signal)) {
 		INIT_LIST_HEAD(&dep->ipi_link);
+		dep->ipi_deadline = I915_DEADLINE_NEVER;
 		dep->ipi_priority = I915_PRIORITY_INVALID;
 		dep->signaler = signal;
 		dep->waiter = node;
@@ -519,6 +690,10 @@ void i915_sched_node_retire(struct i915_sched_node *node)
 	spin_unlock_irq(&node->lock);
 }
 
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftests/i915_scheduler.c"
+#endif
+
 static void i915_global_scheduler_shrink(void)
 {
 	kmem_cache_shrink(global.slab_dependencies);
diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
index 53ac819cc786..89875ea3fb20 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.h
+++ b/drivers/gpu/drm/i915/i915_scheduler.h
@@ -34,15 +34,29 @@ int i915_sched_node_add_dependency(struct i915_sched_node *node,
 void i915_sched_node_retire(struct i915_sched_node *node);
 
 void i915_request_set_priority(struct i915_request *request, int prio);
+void i915_request_set_deadline(struct i915_request *request, u64 deadline);
+
+void i915_request_update_deadline(struct i915_request *request);
+
+u64 i915_scheduler_virtual_deadline(u64 kt, int priority);
+u64 i915_scheduler_next_virtual_deadline(int priority);
+
+bool intel_engine_queue_request(struct intel_engine_cs *engine,
+				struct i915_request *rq);
 
 struct list_head *
-i915_sched_lookup_priolist(struct intel_engine_cs *engine, int prio);
+i915_sched_lookup_priolist(struct intel_engine_cs *engine, u64 deadline);
+
+void i915_priolist_free(struct i915_priolist *p);
+
+static inline u64 i915_sched_to_ticks(ktime_t kt)
+{
+	return ktime_to_ns(kt) >> I915_SCHED_DEADLINE_SHIFT;
+}
 
-void __i915_priolist_free(struct i915_priolist *p);
-static inline void i915_priolist_free(struct i915_priolist *p)
+static inline u64 i915_sched_to_ns(u64 deadline)
 {
-	if (p->priority != I915_PRIORITY_NORMAL)
-		__i915_priolist_free(p);
+	return deadline << I915_SCHED_DEADLINE_SHIFT;
 }
 
 #endif /* _I915_SCHEDULER_H_ */
diff --git a/drivers/gpu/drm/i915/i915_scheduler_types.h b/drivers/gpu/drm/i915/i915_scheduler_types.h
index ce60577df2bf..ae7ca78a88c8 100644
--- a/drivers/gpu/drm/i915/i915_scheduler_types.h
+++ b/drivers/gpu/drm/i915/i915_scheduler_types.h
@@ -69,6 +69,22 @@ struct i915_sched_node {
 	unsigned int flags;
 #define I915_SCHED_HAS_EXTERNAL_CHAIN	BIT(0)
 	intel_engine_mask_t semaphores;
+
+	/**
+	 * @deadline: [virtual] deadline
+	 *
+	 * When the request is ready for execution, it is given a quota
+	 * (the engine's timeslice) and a virtual deadline. The virtual
+	 * deadline is derived from the current time:
+	 *     ktime_get() + (prio_ratio * timeslice)
+	 *
+	 * Requests are then executed in order of deadline completion.
+	 * Requests with earlier deadlines than currently executing on
+	 * the engine will preempt the active requests.
+	 */
+	u64 deadline;
+#define I915_SCHED_DEADLINE_SHIFT 19 /* i.e. roughly 500us buckets */
+#define I915_DEADLINE_NEVER U64_MAX
 };
 
 struct i915_dependency {
@@ -81,6 +97,7 @@ struct i915_dependency {
 #define I915_DEPENDENCY_ALLOC		BIT(0)
 #define I915_DEPENDENCY_EXTERNAL	BIT(1)
 #define I915_DEPENDENCY_WEAK		BIT(2)
+	u64 ipi_deadline;
 	int ipi_priority;
 };
 
diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
index cb6f94633356..3782cdd281cc 100644
--- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
@@ -25,6 +25,7 @@ selftest(ring, intel_ring_mock_selftests)
 selftest(engine, intel_engine_cs_mock_selftests)
 selftest(timelines, intel_timeline_mock_selftests)
 selftest(requests, i915_request_mock_selftests)
+selftest(scheduler, i915_scheduler_mock_selftests)
 selftest(objects, i915_gem_object_mock_selftests)
 selftest(acquire, i915_acquire_mock_selftests)
 selftest(phys, i915_gem_phys_mock_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index 236f9fda8f31..ff21d8de7689 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -2123,6 +2123,7 @@ static int measure_preemption(struct intel_context *ce)
 
 		intel_ring_advance(rq, cs);
 		rq->sched.attr.priority = I915_PRIORITY_BARRIER;
+		rq->sched.deadline = 0;
 
 		elapsed[i - 1] = ENGINE_READ_FW(ce->engine, RING_TIMESTAMP);
 		i915_request_add(rq);
diff --git a/drivers/gpu/drm/i915/selftests/i915_scheduler.c b/drivers/gpu/drm/i915/selftests/i915_scheduler.c
new file mode 100644
index 000000000000..9ca50db81034
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/i915_scheduler.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include "i915_selftest.h"
+
+static int mock_scheduler_slices(void *dummy)
+{
+	u64 min, max, normal, kernel;
+
+	min = prio_slice(I915_PRIORITY_MIN);
+	pr_info("%8s slice: %lluus\n", "min", min >> 10);
+
+	normal = prio_slice(0);
+	pr_info("%8s slice: %lluus\n", "normal", normal >> 10);
+
+	max = prio_slice(I915_PRIORITY_MAX);
+	pr_info("%8s slice: %lluus\n", "max", max >> 10);
+
+	kernel = prio_slice(I915_PRIORITY_BARRIER);
+	pr_info("%8s slice: %lluus\n", "kernel", kernel >> 10);
+
+	if (kernel != 0) {
+		pr_err("kernel prio slice should be 0\n");
+		return -EINVAL;
+	}
+
+	if (max >= normal) {
+		pr_err("maximum prio slice should be shorter than normal\n");
+		return -EINVAL;
+	}
+
+	if (min <= normal) {
+		pr_err("minimum prio slice should be longer than normal\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int i915_scheduler_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(mock_scheduler_slices),
+	};
+
+	return i915_subtests(tests, NULL);
+}
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2)
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (67 preceding siblings ...)
  2020-07-15 14:20 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
@ 2020-07-15 15:41 ` Patchwork
  2020-07-15 15:42 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
                   ` (3 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Patchwork @ 2020-07-15 15:41 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2)
URL   : https://patchwork.freedesktop.org/series/79517/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
130497fc98af drm/i915: Reduce i915_request.lock contention for i915_request_wait
78b2b4227c1a drm/i915: Remove i915_request.lock requirement for execution callbacks
c0b29c38984e drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs
96f54f935986 drm/i915: Add a couple of missing i915_active_fini()
e0279073e5d3 drm/i915: Skip taking acquire mutex for no ref->active callback
74f1c669d951 drm/i915: Export a preallocate variant of i915_active_acquire()
db2f42ce7206 drm/i915: Keep the most recently used active-fence upon discard
3bfed8a701e1 drm/i915: Make the stale cached active node available for any timeline
8d7b14253127 drm/i915: Provide a fastpath for waiting on vma bindings
638a715d2c30 drm/i915: Soften the tasklet flush frequency before waits
5250859a3025 drm/i915: Preallocate stashes for vma page-directories
2747217ec2ad drm/i915: Switch to object allocations for page directories
6bdab21d6597 drm/i915/gem: Don't drop the timeline lock during execbuf
f3c8a75f8f0d drm/i915/gem: Rename execbuf.bind_link to unbound_link
1608dda568ca drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup
63d098780e45 drm/i915/gem: Remove the call for no-evict i915_vma_pin
6ac79cc3b262 drm/i915: Add list_for_each_entry_safe_continue_reverse
-:21: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'pos' - possible side-effects?
#21: FILE: drivers/gpu/drm/i915/i915_utils.h:269:
+#define list_for_each_entry_safe_continue_reverse(pos, n, head, member)	\
+	for (pos = list_prev_entry(pos, member),			\
+	     n = list_prev_entry(pos, member);				\
+	     &pos->member != (head);					\
+	     pos = n, n = list_prev_entry(n, member))

-:21: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'n' - possible side-effects?
#21: FILE: drivers/gpu/drm/i915/i915_utils.h:269:
+#define list_for_each_entry_safe_continue_reverse(pos, n, head, member)	\
+	for (pos = list_prev_entry(pos, member),			\
+	     n = list_prev_entry(pos, member);				\
+	     &pos->member != (head);					\
+	     pos = n, n = list_prev_entry(n, member))

-:21: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'member' - possible side-effects?
#21: FILE: drivers/gpu/drm/i915/i915_utils.h:269:
+#define list_for_each_entry_safe_continue_reverse(pos, n, head, member)	\
+	for (pos = list_prev_entry(pos, member),			\
+	     n = list_prev_entry(pos, member);				\
+	     &pos->member != (head);					\
+	     pos = n, n = list_prev_entry(n, member))

total: 0 errors, 0 warnings, 3 checks, 12 lines checked
69b95b2f1b58 drm/i915: Always defer fenced work to the worker
339dd5f35caf drm/i915/gem: Assign context id for async work
dbe18ef2f769 drm/i915/gem: Separate the ww_mutex walker into its own list
d6442826e675 drm/i915/gem: Asynchronous GTT unbinding
2ee5a36ee1d9 drm/i915/gem: Bind the fence async for execbuf
2d3e21b38717 drm/i915/gem: Include cmdparser in common execbuf pinning
c180ca0649ab drm/i915/gem: Include secure batch in common execbuf pinning
9040f3b37f3b drm/i915/gem: Reintroduce multiple passes for reloc processing
-:1512: WARNING:MEMORY_BARRIER: memory barrier without comment
#1512: FILE: drivers/gpu/drm/i915/gem/selftests/i915_gem_execbuffer.c:161:
+		wmb();

total: 0 errors, 1 warnings, 0 checks, 1502 lines checked
d3fe2f8baa11 drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2.
-:59: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#59: 
new file mode 100644

-:354: WARNING:LINE_SPACING: Missing a blank line after declarations
#354: FILE: drivers/gpu/drm/i915/mm/st_acquire_ctx.c:106:
+	const unsigned int total = ARRAY_SIZE(dl->obj);
+	I915_RND_STATE(prng);

-:450: WARNING:YIELD: Using yield() is generally wrong. See yield() kernel-doc (sched/core.c)
#450: FILE: drivers/gpu/drm/i915/mm/st_acquire_ctx.c:202:
+	yield(); /* start all threads before we begin */

total: 0 errors, 3 warnings, 0 checks, 446 lines checked
90ec819f7195 drm/i915/gem: Pull execbuf dma resv under a single critical section
69013a1a8c42 drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
6cb053b52a1e drm/i915: Hold wakeref for the duration of the vma GGTT binding
22e6c1fa7455 drm/i915: Specialise GGTT binding
f89d50b33136 drm/i915/gt: Acquire backing storage for the context
cd9d5d1c7e79 drm/i915/gt: Push the wait for the context to bound to the request
-:198: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#198: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 237 lines checked
35a1391f3c4b drm/i915: Remove unused i915_gem_evict_vm()
8cdadd182cff drm/i915/gt: Decouple completed requests on unwind
c4ccca3c8899 drm/i915/gt: Check for a completed last request once
364ad467bb9c drm/i915/gt: Replace direct submit with direct call to tasklet
a97efde96c6f drm/i915/gt: Free stale request on destroying the virtual engine
28a117fb0f10 drm/i915/gt: Use virtual_engine during execlists_dequeue
655eecd42876 drm/i915/gt: Decouple inflight virtual engines
0782f33abf51 drm/i915/gt: Defer schedule_out until after the next dequeue
9a3abbadb311 drm/i915/gt: Resubmit the virtual engine on schedule-out
c7b81ba5771a drm/i915/gt: Simplify virtual engine handling for execlists_hold()
9bc1aca39d8f drm/i915/gt: ce->inflight updates are now serialised
f0a8817be8e7 drm/i915/gt: Drop atomic for engine->fw_active tracking
f21e9ee6ef05 drm/i915/gt: Extract busy-stats for ring-scheduler
-:12: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#12: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 95 lines checked
d66ffb59c681 drm/i915/gt: Convert stats.active to plain unsigned int
0618152c0ed3 drm/i915: Lift waiter/signaler iterators
000a31affd74 drm/i915: Strip out internal priorities
bce06e971029 drm/i915: Remove I915_USER_PRIORITY_SHIFT
2383c96eac81 drm/i915: Replace engine->schedule() with a known request operation
56ddf5a9a91a drm/i915/gt: Do not suspend bonded requests if one hangs
6423ee0d3c8a drm/i915: Teach the i915_dependency to use a double-lock
53f0f5814c60 drm/i915: Restructure priority inheritance
fffa992d3a28 drm/i915/gt: Remove timeslice suppression
55f8528025f2 drm/i915: Fair low-latency scheduling
-:1568: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#1568: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 1392 lines checked
c1be9968e1e1 drm/i915/gt: Specify a deadline for the heartbeat
ba8e433eb1ad drm/i915: Replace the priority boosting for the display with a deadline
dea960c9f9bc drm/i915: Move saturated workload detection to the GT
-:22: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#22: 
References: 44d89409a12e ("drm/i915: Make the semaphore saturation mask global")

-:22: ERROR:GIT_COMMIT_ID: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 44d89409a12e ("drm/i915: Make the semaphore saturation mask global")'
#22: 
References: 44d89409a12e ("drm/i915: Make the semaphore saturation mask global")

total: 1 errors, 1 warnings, 0 checks, 82 lines checked
b8ed262e9007 Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq"
84f48ea3e305 drm/i915/gt: Couple tasklet scheduling for all CS interrupts
0c68e10ca35a drm/i915/gt: Support creation of 'internal' rings
e7bfa5f7c2ca drm/i915/gt: Use client timeline address for seqno writes
3e60e3380101 drm/i915/gt: Infrastructure for ring scheduling
-:79: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#79: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 844 lines checked
eaa788663d0a drm/i915/gt: Implement ring scheduler for gen6/7
-:68: CHECK:OPEN_ENDED_LINE: Lines should not end with a '('
#68: FILE: drivers/gpu/drm/i915/gt/intel_ring_scheduler.c:174:
+				*cs++ = i915_mmio_reg_offset(

-:70: CHECK:OPEN_ENDED_LINE: Lines should not end with a '('
#70: FILE: drivers/gpu/drm/i915/gt/intel_ring_scheduler.c:176:
+				*cs++ = _MASKED_BIT_ENABLE(

-:105: CHECK:OPEN_ENDED_LINE: Lines should not end with a '('
#105: FILE: drivers/gpu/drm/i915/gt/intel_ring_scheduler.c:211:
+				*cs++ = _MASKED_BIT_DISABLE(

total: 0 errors, 0 warnings, 3 checks, 540 lines checked
e0e91286252d drm/i915/gt: Enable ring scheduling for gen6/7
85f3b20e2a32 drm/i915/gem: Remove timeline nesting from snb relocs


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2)
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (68 preceding siblings ...)
  2020-07-15 15:41 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2) Patchwork
@ 2020-07-15 15:42 ` Patchwork
  2020-07-15 16:03 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
                   ` (2 subsequent siblings)
  72 siblings, 0 replies; 156+ messages in thread
From: Patchwork @ 2020-07-15 15:42 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2)
URL   : https://patchwork.freedesktop.org/series/79517/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.0
Fast mode used, each commit won't be checked separately.


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class Chris Wilson
@ 2020-07-15 15:43   ` Maarten Lankhorst
  2020-07-16 15:53     ` Tvrtko Ursulin
  0 siblings, 1 reply; 156+ messages in thread
From: Maarten Lankhorst @ 2020-07-15 15:43 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Op 15-07-2020 om 13:51 schreef Chris Wilson:
> Our goal is to pull all memory reservations (next iteration
> obj->ops->get_pages()) under a ww_mutex, and to align those reservations
> with other drivers, i.e. control all such allocations with the
> reservation_ww_class. Currently, this is under the purview of the
> obj->mm.mutex, and while obj->mm remains an embedded struct we can
> "simply" switch to using the reservation_ww_class obj->base.resv->lock
>
> The major consequence is the impact on the shrinker paths as the
> reservation_ww_class is used to wrap allocations, and a ww_mutex does
> not support subclassing so we cannot do our usual trick of knowing that
> we never recurse inside the shrinker and instead have to finish the
> reclaim with a trylock. This may result in us failing to release the
> pages after having released the vma. This will have to do until a better
> idea comes along.
>
> However, this step only converts the mutex over and continues to treat
> everything as a single allocation and pinning the pages. With the
> ww_mutex in place we can remove the temporary pinning, as we can then
> reserve all storage en masse.
>
> One last thing to do: kill the implict page pinning for active vma.
> This will require us to invalidate the vma->pages when the backing store
> is removed (and we expect that while the vma is active, we mark the
> backing store as active so that it cannot be removed while the HW is
> busy.)
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |  20 +-
>  drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c    |  18 +-
>  drivers/gpu/drm/i915/gem/i915_gem_domain.c    |  65 ++----
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  40 +++-
>  drivers/gpu/drm/i915/gem/i915_gem_object.c    |   8 +-
>  drivers/gpu/drm/i915/gem/i915_gem_object.h    |  37 +--
>  .../gpu/drm/i915/gem/i915_gem_object_types.h  |   1 -
>  drivers/gpu/drm/i915/gem/i915_gem_pages.c     | 134 ++++++-----
>  drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   8 +-
>  drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |  13 +-
>  drivers/gpu/drm/i915/gem/i915_gem_tiling.c    |   2 -
>  drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |  15 +-
>  .../gpu/drm/i915/gem/selftests/huge_pages.c   |  32 ++-
>  .../i915/gem/selftests/i915_gem_coherency.c   |  14 +-
>  .../drm/i915/gem/selftests/i915_gem_context.c |  10 +-
>  .../drm/i915/gem/selftests/i915_gem_mman.c    |   2 +
>  drivers/gpu/drm/i915/gt/gen6_ppgtt.c          |   2 -
>  drivers/gpu/drm/i915/gt/gen8_ppgtt.c          |   1 -
>  drivers/gpu/drm/i915/gt/intel_ggtt.c          |   5 +-
>  drivers/gpu/drm/i915/gt/intel_gtt.h           |   2 -
>  drivers/gpu/drm/i915/gt/intel_ppgtt.c         |   1 +
>  drivers/gpu/drm/i915/i915_gem.c               |  16 +-
>  drivers/gpu/drm/i915/i915_vma.c               | 217 +++++++-----------
>  drivers/gpu/drm/i915/i915_vma_types.h         |   6 -
>  drivers/gpu/drm/i915/mm/i915_acquire_ctx.c    |  12 +-
>  drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   4 +-
>  .../drm/i915/selftests/intel_memory_region.c  |  17 +-
>  27 files changed, 313 insertions(+), 389 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
> index bc0223716906..a32fd0d5570b 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
> @@ -27,16 +27,8 @@ static void __do_clflush(struct drm_i915_gem_object *obj)
>  static int clflush_work(struct dma_fence_work *base)
>  {
>  	struct clflush *clflush = container_of(base, typeof(*clflush), base);
> -	struct drm_i915_gem_object *obj = clflush->obj;
> -	int err;
> -
> -	err = i915_gem_object_pin_pages(obj);
> -	if (err)
> -		return err;
> -
> -	__do_clflush(obj);
> -	i915_gem_object_unpin_pages(obj);
>  
> +	__do_clflush(clflush->obj);
>  	return 0;
>  }
>  
> @@ -44,7 +36,7 @@ static void clflush_release(struct dma_fence_work *base)
>  {
>  	struct clflush *clflush = container_of(base, typeof(*clflush), base);
>  
> -	i915_gem_object_put(clflush->obj);
> +	i915_gem_object_unpin_pages(clflush->obj);
>  }
>  
>  static const struct dma_fence_work_ops clflush_ops = {
> @@ -63,8 +55,14 @@ static struct clflush *clflush_work_create(struct drm_i915_gem_object *obj)
>  	if (!clflush)
>  		return NULL;
>  
> +	if (__i915_gem_object_get_pages_locked(obj)) {
> +		kfree(clflush);
> +		return NULL;
> +	}
> +
>  	dma_fence_work_init(&clflush->base, &clflush_ops);
> -	clflush->obj = i915_gem_object_get(obj); /* obj <-> clflush cycle */
> +	__i915_gem_object_pin_pages(obj);
> +	clflush->obj = obj; /* Beware the obj.resv <-> clflush fence cycle */
>  
>  	return clflush;
>  }
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
> index 2679380159fc..049a15e6b496 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
> @@ -124,19 +124,12 @@ static int i915_gem_begin_cpu_access(struct dma_buf *dma_buf, enum dma_data_dire
>  	bool write = (direction == DMA_BIDIRECTIONAL || direction == DMA_TO_DEVICE);
>  	int err;
>  
> -	err = i915_gem_object_pin_pages(obj);
> -	if (err)
> -		return err;
> -
>  	err = i915_gem_object_lock_interruptible(obj);
>  	if (err)
> -		goto out;
> +		return err;
>  
>  	err = i915_gem_object_set_to_cpu_domain(obj, write);
>  	i915_gem_object_unlock(obj);
> -
> -out:
> -	i915_gem_object_unpin_pages(obj);
>  	return err;
>  }
>  
> @@ -145,19 +138,12 @@ static int i915_gem_end_cpu_access(struct dma_buf *dma_buf, enum dma_data_direct
>  	struct drm_i915_gem_object *obj = dma_buf_to_obj(dma_buf);
>  	int err;
>  
> -	err = i915_gem_object_pin_pages(obj);
> -	if (err)
> -		return err;
> -
>  	err = i915_gem_object_lock_interruptible(obj);
>  	if (err)
> -		goto out;
> +		return err;
>  
>  	err = i915_gem_object_set_to_gtt_domain(obj, false);
>  	i915_gem_object_unlock(obj);
> -
> -out:
> -	i915_gem_object_unpin_pages(obj);
>  	return err;
>  }
>  
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_domain.c b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
> index 7f76fc68f498..30e4b163588b 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_domain.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
> @@ -70,7 +70,7 @@ i915_gem_object_set_to_wc_domain(struct drm_i915_gem_object *obj, bool write)
>  	 * continue to assume that the obj remained out of the CPU cached
>  	 * domain.
>  	 */
> -	ret = i915_gem_object_pin_pages(obj);
> +	ret = __i915_gem_object_get_pages_locked(obj);
>  	if (ret)
>  		return ret;
>  
> @@ -94,7 +94,6 @@ i915_gem_object_set_to_wc_domain(struct drm_i915_gem_object *obj, bool write)
>  		obj->mm.dirty = true;
>  	}
>  
> -	i915_gem_object_unpin_pages(obj);
>  	return 0;
>  }
>  
> @@ -131,7 +130,7 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write)
>  	 * continue to assume that the obj remained out of the CPU cached
>  	 * domain.
>  	 */
> -	ret = i915_gem_object_pin_pages(obj);
> +	ret = __i915_gem_object_get_pages_locked(obj);
>  	if (ret)
>  		return ret;
>  
> @@ -163,7 +162,6 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write)
>  		spin_unlock(&obj->vma.lock);
>  	}
>  
> -	i915_gem_object_unpin_pages(obj);
>  	return 0;
>  }
>  
> @@ -532,13 +530,9 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data,
>  	 * continue to assume that the obj remained out of the CPU cached
>  	 * domain.
>  	 */
> -	err = i915_gem_object_pin_pages(obj);
> -	if (err)
> -		goto out;
> -
>  	err = i915_gem_object_lock_interruptible(obj);
>  	if (err)
> -		goto out_unpin;
> +		goto out;
>  
>  	if (read_domains & I915_GEM_DOMAIN_WC)
>  		err = i915_gem_object_set_to_wc_domain(obj, write_domain);
> @@ -555,8 +549,6 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data,
>  	if (write_domain)
>  		i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
>  
> -out_unpin:
> -	i915_gem_object_unpin_pages(obj);
>  out:
>  	i915_gem_object_put(obj);
>  	return err;
> @@ -572,11 +564,13 @@ int i915_gem_object_prepare_read(struct drm_i915_gem_object *obj,
>  {
>  	int ret;
>  
> +	assert_object_held(obj);
> +
>  	*needs_clflush = 0;
>  	if (!i915_gem_object_has_struct_page(obj))
>  		return -ENODEV;
>  
> -	ret = i915_gem_object_lock_interruptible(obj);
> +	ret = __i915_gem_object_get_pages_locked(obj);
>  	if (ret)
>  		return ret;
>  
> @@ -584,19 +578,11 @@ int i915_gem_object_prepare_read(struct drm_i915_gem_object *obj,
>  				   I915_WAIT_INTERRUPTIBLE,
>  				   MAX_SCHEDULE_TIMEOUT);
>  	if (ret)
> -		goto err_unlock;
> -
> -	ret = i915_gem_object_pin_pages(obj);
> -	if (ret)
> -		goto err_unlock;
> +		return ret;
>  
>  	if (obj->cache_coherent & I915_BO_CACHE_COHERENT_FOR_READ ||
>  	    !static_cpu_has(X86_FEATURE_CLFLUSH)) {
> -		ret = i915_gem_object_set_to_cpu_domain(obj, false);
> -		if (ret)
> -			goto err_unpin;
> -		else
> -			goto out;
> +		return i915_gem_object_set_to_cpu_domain(obj, false);
>  	}
>  
>  	i915_gem_object_flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
> @@ -610,15 +596,7 @@ int i915_gem_object_prepare_read(struct drm_i915_gem_object *obj,
>  	    !(obj->read_domains & I915_GEM_DOMAIN_CPU))
>  		*needs_clflush = CLFLUSH_BEFORE;
>  
> -out:
> -	/* return with the pages pinned */
>  	return 0;
> -
> -err_unpin:
> -	i915_gem_object_unpin_pages(obj);
> -err_unlock:
> -	i915_gem_object_unlock(obj);
> -	return ret;
>  }
>  
>  int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
> @@ -626,11 +604,13 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
>  {
>  	int ret;
>  
> +	assert_object_held(obj);
> +
>  	*needs_clflush = 0;
>  	if (!i915_gem_object_has_struct_page(obj))
>  		return -ENODEV;
>  
> -	ret = i915_gem_object_lock_interruptible(obj);
> +	ret = __i915_gem_object_get_pages_locked(obj);
>  	if (ret)
>  		return ret;
>  
> @@ -639,20 +619,11 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
>  				   I915_WAIT_ALL,
>  				   MAX_SCHEDULE_TIMEOUT);
>  	if (ret)
> -		goto err_unlock;
> -
> -	ret = i915_gem_object_pin_pages(obj);
> -	if (ret)
> -		goto err_unlock;
> +		return ret;
>  
>  	if (obj->cache_coherent & I915_BO_CACHE_COHERENT_FOR_WRITE ||
> -	    !static_cpu_has(X86_FEATURE_CLFLUSH)) {
> -		ret = i915_gem_object_set_to_cpu_domain(obj, true);
> -		if (ret)
> -			goto err_unpin;
> -		else
> -			goto out;
> -	}
> +	    !static_cpu_has(X86_FEATURE_CLFLUSH))
> +		return i915_gem_object_set_to_cpu_domain(obj, true);
>  
>  	i915_gem_object_flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
>  
> @@ -672,15 +643,7 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
>  			*needs_clflush |= CLFLUSH_BEFORE;
>  	}
>  
> -out:
>  	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
>  	obj->mm.dirty = true;
> -	/* return with the pages pinned */
>  	return 0;
> -
> -err_unpin:
> -	i915_gem_object_unpin_pages(obj);
> -err_unlock:
> -	i915_gem_object_unlock(obj);
> -	return ret;
>  }
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index db433f3f18ec..b07c508812ad 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -966,6 +966,13 @@ static int best_hole(struct drm_mm *mm, struct drm_mm_node *node,
>  	} while (1);
>  }
>  
> +static void eb_pin_vma_pages(struct i915_vma *vma, unsigned int count)
> +{
> +	count = hweight32(count);
> +	while (count--)
> +		__i915_gem_object_pin_pages(vma->obj);
> +}
> +
>  static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
>  {
>  	struct drm_i915_gem_exec_object2 *entry = bind->ev->exec;
> @@ -1077,7 +1084,6 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
>  		if (unlikely(err))
>  			return err;
>  
> -		atomic_add(I915_VMA_PAGES_ACTIVE, &vma->pages_count);
>  		atomic_or(bind_flags, &vma->flags);
>  
>  		if (i915_vma_is_ggtt(vma))
> @@ -1184,9 +1190,14 @@ static void __eb_bind_vma(struct eb_vm_work *work)
>  		GEM_BUG_ON(vma->vm != vm);
>  		GEM_BUG_ON(!i915_vma_is_active(vma));
>  
> +		if (!vma->pages)
> +			vma->ops->set_pages(vma); /* plain assignment */
> +
>  		vma->ops->bind_vma(vm, &work->stash, vma,
>  				   vma->obj->cache_level, bind->bind_flags);
>  
> +		eb_pin_vma_pages(vma, bind->bind_flags);
> +
>  		if (drm_mm_node_allocated(&bind->hole)) {
>  			mutex_lock(&vm->mutex);
>  			GEM_BUG_ON(bind->hole.mm != &vm->mm);
> @@ -1203,7 +1214,6 @@ static void __eb_bind_vma(struct eb_vm_work *work)
>  
>  put:
>  		GEM_BUG_ON(drm_mm_node_allocated(&bind->hole));
> -		i915_vma_put_pages(vma);
>  	}
>  	work->count = 0;
>  }
> @@ -1316,8 +1326,24 @@ static int eb_prepare_vma(struct eb_vm_work *work,
>  		if (err)
>  			return err;
>  	}
> +	return 0;
> +}
> +
> +static int eb_lock_pt(struct i915_execbuffer *eb,
> +		      struct i915_vm_pt_stash *stash)
> +{
> +	struct i915_page_table *pt;
> +	int n, err;
>  
> -	return i915_vma_get_pages(vma);
> +	for (n = 0; n < ARRAY_SIZE(stash->pt); n++) {
> +		for (pt = stash->pt[n]; pt; pt = pt->stash) {
> +			err = i915_acquire_ctx_lock(&eb->acquire, pt->base);
> +			if (err)
> +				return err;
> +		}
> +	}
> +
> +	return 0;
>  }
>  
>  static int wait_for_unbinds(struct i915_execbuffer *eb,
> @@ -1457,11 +1483,11 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
>  			}
>  		}
>  
> -		err = eb_acquire_mm(eb);
> +		err = eb_lock_pt(eb, &work->stash);
>  		if (err)
>  			return eb_vm_work_cancel(work, err);
>  
> -		err = i915_vm_pin_pt_stash(work->vm, &work->stash);
> +		err = eb_acquire_mm(eb);
>  		if (err)
>  			return eb_vm_work_cancel(work, err);
>  
> @@ -2714,7 +2740,7 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
>  	if (!pw)
>  		return -ENOMEM;
>  
> -	ptr = i915_gem_object_pin_map(shadow->obj, I915_MAP_FORCE_WB);
> +	ptr = __i915_gem_object_pin_map_locked(shadow->obj, I915_MAP_FORCE_WB);
>  	if (IS_ERR(ptr)) {
>  		err = PTR_ERR(ptr);
>  		goto err_free;
> @@ -2722,7 +2748,7 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb,
>  
>  	if (!(batch->obj->cache_coherent & I915_BO_CACHE_COHERENT_FOR_READ) &&
>  	    i915_has_memcpy_from_wc()) {
> -		ptr = i915_gem_object_pin_map(batch->obj, I915_MAP_WC);
> +		ptr = __i915_gem_object_pin_map_locked(batch->obj, I915_MAP_WC);
>  		if (IS_ERR(ptr)) {
>  			err = PTR_ERR(ptr);
>  			goto err_dst;
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> index c8421fd9d2dc..799ad4e648aa 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> @@ -53,8 +53,6 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
>  			  const struct drm_i915_gem_object_ops *ops,
>  			  struct lock_class_key *key)
>  {
> -	__mutex_init(&obj->mm.lock, ops->name ?: "obj->mm.lock", key);
> -
>  	spin_lock_init(&obj->vma.lock);
>  	INIT_LIST_HEAD(&obj->vma.list);
>  
> @@ -73,10 +71,6 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
>  	obj->mm.madv = I915_MADV_WILLNEED;
>  	INIT_RADIX_TREE(&obj->mm.get_page.radix, GFP_KERNEL | __GFP_NOWARN);
>  	mutex_init(&obj->mm.get_page.lock);
> -
> -	if (IS_ENABLED(CONFIG_LOCKDEP) && i915_gem_object_is_shrinkable(obj))
> -		i915_gem_shrinker_taints_mutex(to_i915(obj->base.dev),
> -					       &obj->mm.lock);
>  }
>  
>  /**
> @@ -229,10 +223,12 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
>  
>  		GEM_BUG_ON(!list_empty(&obj->lut_list));
>  
> +		i915_gem_object_lock(obj);
>  		atomic_set(&obj->mm.pages_pin_count, 0);
>  		__i915_gem_object_put_pages(obj);
>  		GEM_BUG_ON(i915_gem_object_has_pages(obj));
>  		bitmap_free(obj->bit_17);
> +		i915_gem_object_unlock(obj);
>  
>  		if (obj->base.import_attach)
>  			drm_prime_gem_destroy(&obj->base, NULL);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> index 6f60687b6be2..26f53321443b 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> @@ -271,36 +271,9 @@ void __i915_gem_object_set_pages(struct drm_i915_gem_object *obj,
>  				 struct sg_table *pages,
>  				 unsigned int sg_page_sizes);
>  
> -int ____i915_gem_object_get_pages(struct drm_i915_gem_object *obj);
> -int __i915_gem_object_get_pages(struct drm_i915_gem_object *obj);
> -
> -enum i915_mm_subclass { /* lockdep subclass for obj->mm.lock/struct_mutex */
> -	I915_MM_NORMAL = 0,
> -	/*
> -	 * Only used by struct_mutex, when called "recursively" from
> -	 * direct-reclaim-esque. Safe because there is only every one
> -	 * struct_mutex in the entire system.
> -	 */
> -	I915_MM_SHRINKER = 1,
> -	/*
> -	 * Used for obj->mm.lock when allocating pages. Safe because the object
> -	 * isn't yet on any LRU, and therefore the shrinker can't deadlock on
> -	 * it. As soon as the object has pages, obj->mm.lock nests within
> -	 * fs_reclaim.
> -	 */
> -	I915_MM_GET_PAGES = 1,
> -};
> -
> -static inline int __must_check
> -i915_gem_object_pin_pages(struct drm_i915_gem_object *obj)
> -{
> -	might_lock_nested(&obj->mm.lock, I915_MM_GET_PAGES);
> +int __i915_gem_object_get_pages_locked(struct drm_i915_gem_object *obj);
>  
> -	if (atomic_inc_not_zero(&obj->mm.pages_pin_count))
> -		return 0;
> -
> -	return __i915_gem_object_get_pages(obj);
> -}
> +int i915_gem_object_pin_pages(struct drm_i915_gem_object *obj);
>  
>  static inline bool
>  i915_gem_object_has_pages(struct drm_i915_gem_object *obj)
> @@ -368,6 +341,9 @@ enum i915_map_type {
>  void *__must_check i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
>  					   enum i915_map_type type);
>  
> +void *__i915_gem_object_pin_map_locked(struct drm_i915_gem_object *obj,
> +				       enum i915_map_type type);
> +
>  static inline void *__i915_gem_object_mapping(struct drm_i915_gem_object *obj)
>  {
>  	return page_mask_bits(obj->mm.mapping);
> @@ -417,8 +393,7 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
>  static inline void
>  i915_gem_object_finish_access(struct drm_i915_gem_object *obj)
>  {
> -	i915_gem_object_unpin_pages(obj);
> -	i915_gem_object_unlock(obj);
> +	assert_object_held(obj);
>  }
>  
>  static inline struct intel_engine_cs *
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> index d0847d7896f9..ae3303ba272c 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> @@ -187,7 +187,6 @@ struct drm_i915_gem_object {
>  		 * Protects the pages and their use. Do not use directly, but
>  		 * instead go through the pin/unpin interfaces.
>  		 */
> -		struct mutex lock;
>  		atomic_t pages_pin_count;
>  		atomic_t shrink_pin;
>  
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
> index 7050519c87a4..76d53e535f42 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
> @@ -18,7 +18,7 @@ void __i915_gem_object_set_pages(struct drm_i915_gem_object *obj,
>  	unsigned long supported = INTEL_INFO(i915)->page_sizes;
>  	int i;
>  
> -	lockdep_assert_held(&obj->mm.lock);
> +	assert_object_held(obj);
>  
>  	if (i915_gem_object_is_volatile(obj))
>  		obj->mm.madv = I915_MADV_DONTNEED;
> @@ -81,13 +81,17 @@ void __i915_gem_object_set_pages(struct drm_i915_gem_object *obj,
>  	}
>  }
>  
> -int ____i915_gem_object_get_pages(struct drm_i915_gem_object *obj)
> +int __i915_gem_object_get_pages_locked(struct drm_i915_gem_object *obj)
>  {
> -	struct drm_i915_private *i915 = to_i915(obj->base.dev);
>  	int err;
>  
> +	assert_object_held(obj);
> +
> +	if (i915_gem_object_has_pages(obj))
> +		return 0;
> +
>  	if (unlikely(obj->mm.madv != I915_MADV_WILLNEED)) {
> -		drm_dbg(&i915->drm,
> +		drm_dbg(obj->base.dev,
>  			"Attempting to obtain a purgeable object\n");
>  		return -EFAULT;
>  	}
> @@ -98,34 +102,33 @@ int ____i915_gem_object_get_pages(struct drm_i915_gem_object *obj)
>  	return err;
>  }
>  
> -/* Ensure that the associated pages are gathered from the backing storage
> +/*
> + * Ensure that the associated pages are gathered from the backing storage
>   * and pinned into our object. i915_gem_object_pin_pages() may be called
>   * multiple times before they are released by a single call to
>   * i915_gem_object_unpin_pages() - once the pages are no longer referenced
>   * either as a result of memory pressure (reaping pages under the shrinker)
>   * or as the object is itself released.
>   */
> -int __i915_gem_object_get_pages(struct drm_i915_gem_object *obj)
> +int i915_gem_object_pin_pages(struct drm_i915_gem_object *obj)
>  {
>  	int err;
>  
> -	err = mutex_lock_interruptible_nested(&obj->mm.lock, I915_MM_GET_PAGES);
> +	might_lock(&obj->base.resv->lock.base);
> +
> +	if (atomic_inc_not_zero(&obj->mm.pages_pin_count))
> +		return 0;
> +
> +	err = i915_gem_object_lock_interruptible(obj);
>  	if (err)
>  		return err;
>  
> -	if (unlikely(!i915_gem_object_has_pages(obj))) {
> -		GEM_BUG_ON(i915_gem_object_has_pinned_pages(obj));
> -
> -		err = ____i915_gem_object_get_pages(obj);
> -		if (err)
> -			goto unlock;
> +	err = __i915_gem_object_get_pages_locked(obj);
> +	if (err == 0)
> +		atomic_inc(&obj->mm.pages_pin_count);
>  
> -		smp_mb__before_atomic();
> -	}
> -	atomic_inc(&obj->mm.pages_pin_count);
> +	i915_gem_object_unlock(obj);
>  
> -unlock:
> -	mutex_unlock(&obj->mm.lock);
>  	return err;
>  }
>  
> @@ -140,7 +143,7 @@ void i915_gem_object_truncate(struct drm_i915_gem_object *obj)
>  /* Try to discard unwanted pages */
>  void i915_gem_object_writeback(struct drm_i915_gem_object *obj)
>  {
> -	lockdep_assert_held(&obj->mm.lock);
> +	assert_object_held(obj);
>  	GEM_BUG_ON(i915_gem_object_has_pages(obj));
>  
>  	if (obj->ops->writeback)
> @@ -194,17 +197,15 @@ __i915_gem_object_unset_pages(struct drm_i915_gem_object *obj)
>  int __i915_gem_object_put_pages(struct drm_i915_gem_object *obj)
>  {
>  	struct sg_table *pages;
> -	int err;
> +
> +	/* May be called by shrinker from within get_pages() (on another bo) */
> +	assert_object_held(obj);
>  
>  	if (i915_gem_object_has_pinned_pages(obj))
>  		return -EBUSY;
>  
> -	/* May be called by shrinker from within get_pages() (on another bo) */
> -	mutex_lock(&obj->mm.lock);
> -	if (unlikely(atomic_read(&obj->mm.pages_pin_count))) {
> -		err = -EBUSY;
> -		goto unlock;
> -	}
> +	if (unlikely(atomic_read(&obj->mm.pages_pin_count)))
> +		return -EBUSY;
>  
>  	i915_gem_object_release_mmap_offset(obj);
>  
> @@ -227,11 +228,7 @@ int __i915_gem_object_put_pages(struct drm_i915_gem_object *obj)
>  	if (!IS_ERR(pages))
>  		obj->ops->put_pages(obj, pages);
>  
> -	err = 0;
> -unlock:
> -	mutex_unlock(&obj->mm.lock);
> -
> -	return err;
> +	return 0;
>  }
>  
>  static inline pte_t iomap_pte(resource_size_t base,
> @@ -311,48 +308,28 @@ static void *i915_gem_object_map(struct drm_i915_gem_object *obj,
>  	return area->addr;
>  }
>  
> -/* get, pin, and map the pages of the object into kernel space */
> -void *i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
> -			      enum i915_map_type type)
> +void *__i915_gem_object_pin_map_locked(struct drm_i915_gem_object *obj,
> +				       enum i915_map_type type)
>  {
>  	enum i915_map_type has_type;
>  	unsigned int flags;
>  	bool pinned;
>  	void *ptr;
> -	int err;
> +
> +	assert_object_held(obj);
> +	GEM_BUG_ON(!i915_gem_object_has_pages(obj));
>  
>  	flags = I915_GEM_OBJECT_HAS_STRUCT_PAGE | I915_GEM_OBJECT_HAS_IOMEM;
>  	if (!i915_gem_object_type_has(obj, flags))
>  		return ERR_PTR(-ENXIO);
>  
> -	err = mutex_lock_interruptible_nested(&obj->mm.lock, I915_MM_GET_PAGES);
> -	if (err)
> -		return ERR_PTR(err);
> -
>  	pinned = !(type & I915_MAP_OVERRIDE);
>  	type &= ~I915_MAP_OVERRIDE;
>  
> -	if (!atomic_inc_not_zero(&obj->mm.pages_pin_count)) {
> -		if (unlikely(!i915_gem_object_has_pages(obj))) {
> -			GEM_BUG_ON(i915_gem_object_has_pinned_pages(obj));
> -
> -			err = ____i915_gem_object_get_pages(obj);
> -			if (err)
> -				goto err_unlock;
> -
> -			smp_mb__before_atomic();
> -		}
> -		atomic_inc(&obj->mm.pages_pin_count);
> -		pinned = false;
> -	}
> -	GEM_BUG_ON(!i915_gem_object_has_pages(obj));
> -
>  	ptr = page_unpack_bits(obj->mm.mapping, &has_type);
>  	if (ptr && has_type != type) {
> -		if (pinned) {
> -			err = -EBUSY;
> -			goto err_unpin;
> -		}
> +		if (pinned)
> +			return ERR_PTR(-EBUSY);
>  
>  		unmap_object(obj, ptr);
>  
> @@ -361,23 +338,38 @@ void *i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
>  
>  	if (!ptr) {
>  		ptr = i915_gem_object_map(obj, type);
> -		if (!ptr) {
> -			err = -ENOMEM;
> -			goto err_unpin;
> -		}
> +		if (!ptr)
> +			return ERR_PTR(-ENOMEM);
>  
>  		obj->mm.mapping = page_pack_bits(ptr, type);
>  	}
>  
> -out_unlock:
> -	mutex_unlock(&obj->mm.lock);
> +	__i915_gem_object_pin_pages(obj);
>  	return ptr;
> +}
> +
> +/* get, pin, and map the pages of the object into kernel space */
> +void *i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
> +			      enum i915_map_type type)
> +{
> +	void *ptr;
> +	int err;
> +
> +	err = i915_gem_object_lock_interruptible(obj);
> +	if (err)
> +		return ERR_PTR(err);
>  
> -err_unpin:
> -	atomic_dec(&obj->mm.pages_pin_count);
> -err_unlock:
> -	ptr = ERR_PTR(err);
> -	goto out_unlock;
> +	err = __i915_gem_object_get_pages_locked(obj);
> +	if (err) {
> +		ptr = ERR_PTR(err);
> +		goto out;
> +	}
> +
> +	ptr = __i915_gem_object_pin_map_locked(obj, type);
> +
> +out:
> +	i915_gem_object_unlock(obj);
> +	return ptr;
>  }
>  
>  void __i915_gem_object_flush_map(struct drm_i915_gem_object *obj,
> @@ -434,7 +426,9 @@ i915_gem_object_get_sg(struct drm_i915_gem_object *obj,
>  
>  	might_sleep();
>  	GEM_BUG_ON(n >= obj->base.size >> PAGE_SHIFT);
> -	GEM_BUG_ON(!i915_gem_object_has_pinned_pages(obj));
> +	GEM_BUG_ON(!i915_gem_object_has_pages(obj));
> +	GEM_BUG_ON(!mutex_is_locked(&obj->base.resv->lock.base) &&
> +		   !i915_gem_object_has_pinned_pages(obj));
>  
>  	/* As we iterate forward through the sg, we record each entry in a
>  	 * radixtree for quick repeated (backwards) lookups. If we have seen
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_phys.c b/drivers/gpu/drm/i915/gem/i915_gem_phys.c
> index 28147aab47b9..f7f93b68b7c1 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_phys.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_phys.c
> @@ -165,7 +165,7 @@ int i915_gem_object_attach_phys(struct drm_i915_gem_object *obj, int align)
>  	if (err)
>  		return err;
>  
> -	mutex_lock_nested(&obj->mm.lock, I915_MM_GET_PAGES);
> +	i915_gem_object_lock(obj);
>  
>  	if (obj->mm.madv != I915_MADV_WILLNEED) {
>  		err = -EFAULT;
> @@ -186,7 +186,7 @@ int i915_gem_object_attach_phys(struct drm_i915_gem_object *obj, int align)
>  
>  	obj->ops = &i915_gem_phys_ops;
>  
> -	err = ____i915_gem_object_get_pages(obj);
> +	err = __i915_gem_object_get_pages_locked(obj);
>  	if (err)
>  		goto err_xfer;
>  
> @@ -198,7 +198,7 @@ int i915_gem_object_attach_phys(struct drm_i915_gem_object *obj, int align)
>  
>  	i915_gem_object_release_memory_region(obj);
>  
> -	mutex_unlock(&obj->mm.lock);
> +	i915_gem_object_unlock(obj);
>  	return 0;
>  
>  err_xfer:
> @@ -209,7 +209,7 @@ int i915_gem_object_attach_phys(struct drm_i915_gem_object *obj, int align)
>  		__i915_gem_object_set_pages(obj, pages, sg_page_sizes);
>  	}
>  err_unlock:
> -	mutex_unlock(&obj->mm.lock);
> +	i915_gem_object_unlock(obj);
>  	return err;
>  }
>  
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> index dc8f052a0ffe..4e928103a38f 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> @@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct drm_i915_gem_object *obj,
>  	if (!(shrink & I915_SHRINK_BOUND))
>  		flags = I915_GEM_OBJECT_UNBIND_TEST;
>  
> -	if (i915_gem_object_unbind(obj, flags) == 0)
> -		__i915_gem_object_put_pages(obj);
> -
> -	return !i915_gem_object_has_pages(obj);
> +	return i915_gem_object_unbind(obj, flags) == 0;
>  }
>  
>  static void try_to_writeback(struct drm_i915_gem_object *obj,
> @@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
>  
>  			spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
>  
> -			if (unsafe_drop_pages(obj, shrink)) {
> -				/* May arrive from get_pages on another bo */
> -				mutex_lock(&obj->mm.lock);
> +			if (unsafe_drop_pages(obj, shrink) &&
> +			    i915_gem_object_trylock(obj)) {
Why trylock? Because of the nesting? In that case, still use ww ctx if provided please
> +				__i915_gem_object_put_pages(obj);
>  				if (!i915_gem_object_has_pages(obj)) {
>  					try_to_writeback(obj, shrink);
>  					count += obj->base.size >> PAGE_SHIFT;
>  				}
> -				mutex_unlock(&obj->mm.lock);
> +				i915_gem_object_unlock(obj);
>  			}
>  
>  			scanned += obj->base.size >> PAGE_SHIFT;
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> index ff72ee2fd9cd..ac12e1c20e66 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> @@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct drm_i915_gem_object *obj,
>  	 * pages to prevent them being swapped out and causing corruption
>  	 * due to the change in swizzling.
>  	 */
> -	mutex_lock(&obj->mm.lock);
>  	if (i915_gem_object_has_pages(obj) &&
>  	    obj->mm.madv == I915_MADV_WILLNEED &&
>  	    i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
> @@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct drm_i915_gem_object *obj,
>  			obj->mm.quirked = true;
>  		}
>  	}
> -	mutex_unlock(&obj->mm.lock);
>  
>  	spin_lock(&obj->vma.lock);
>  	for_each_ggtt_vma(vma, obj) {
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> index e946032b13e4..80907c00c6fd 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> @@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  		ret = i915_gem_object_unbind(obj,
>  					     I915_GEM_OBJECT_UNBIND_ACTIVE |
>  					     I915_GEM_OBJECT_UNBIND_BARRIER);
> -		if (ret == 0)
> -			ret = __i915_gem_object_put_pages(obj);
> +		if (ret == 0) {
> +			/* ww_mutex and mmu_notifier is fs_reclaim tainted */
> +			if (i915_gem_object_trylock(obj)) {
> +				ret = __i915_gem_object_put_pages(obj);
> +				i915_gem_object_unlock(obj);
> +			} else {
> +				ret = -EAGAIN;
> +			}
> +		}

I'm not sure upstream will agree with this kind of API:

1. It will deadlock when RT tasks are used.

2. You start throwing -EAGAIN because you don't have the correct ordering of locking, this needs fixing first.

>  		i915_gem_object_put(obj);
>  		if (ret)
>  			return ret;
> @@ -485,7 +492,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
>  		}
>  	}
>  
> -	mutex_lock_nested(&obj->mm.lock, I915_MM_GET_PAGES);
> +	i915_gem_object_lock(obj);
>  	if (obj->userptr.work == &work->work) {
>  		struct sg_table *pages = ERR_PTR(ret);
>  
> @@ -502,7 +509,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
>  		if (IS_ERR(pages))
>  			__i915_gem_userptr_set_active(obj, false);
>  	}
> -	mutex_unlock(&obj->mm.lock);
> +	i915_gem_object_unlock(obj);
>  
>  	unpin_user_pages(pvec, pinned);
>  	kvfree(pvec);
> diff --git a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
> index e2f3d014acb2..eb12d444d2cc 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
> @@ -452,6 +452,15 @@ static int igt_mock_exhaust_device_supported_pages(void *arg)
>  	return err;
>  }
>  
> +static void close_object(struct drm_i915_gem_object *obj)
> +{
> +	i915_gem_object_lock(obj);
> +	__i915_gem_object_put_pages(obj);
> +	i915_gem_object_unlock(obj);
> +
> +	i915_gem_object_put(obj);
> +}
> +
>  static int igt_mock_memory_region_huge_pages(void *arg)
>  {
>  	const unsigned int flags[] = { 0, I915_BO_ALLOC_CONTIGUOUS };
> @@ -514,8 +523,7 @@ static int igt_mock_memory_region_huge_pages(void *arg)
>  			}
>  
>  			i915_vma_unpin(vma);
> -			__i915_gem_object_put_pages(obj);
> -			i915_gem_object_put(obj);
> +			close_object(obj);
>  		}
>  	}
>  
> @@ -633,8 +641,7 @@ static int igt_mock_ppgtt_misaligned_dma(void *arg)
>  		}
>  
>  		i915_gem_object_unpin_pages(obj);
> -		__i915_gem_object_put_pages(obj);
> -		i915_gem_object_put(obj);
> +		close_object(obj);
>  	}
>  
>  	return 0;
> @@ -655,8 +662,7 @@ static void close_object_list(struct list_head *objects,
>  	list_for_each_entry_safe(obj, on, objects, st_link) {
>  		list_del(&obj->st_link);
>  		i915_gem_object_unpin_pages(obj);
> -		__i915_gem_object_put_pages(obj);
> -		i915_gem_object_put(obj);
> +		close_object(obj);
>  	}
>  }
>  
> @@ -923,8 +929,7 @@ static int igt_mock_ppgtt_64K(void *arg)
>  
>  			i915_vma_unpin(vma);
>  			i915_gem_object_unpin_pages(obj);
> -			__i915_gem_object_put_pages(obj);
> -			i915_gem_object_put(obj);
> +			close_object(obj);
>  		}
>  	}
>  
> @@ -964,9 +969,10 @@ __cpu_check_shmem(struct drm_i915_gem_object *obj, u32 dword, u32 val)
>  	unsigned long n;
>  	int err;
>  
> +	i915_gem_object_lock(obj);
>  	err = i915_gem_object_prepare_read(obj, &needs_flush);
>  	if (err)
> -		return err;
> +		goto unlock;
>  
>  	for (n = 0; n < obj->base.size >> PAGE_SHIFT; ++n) {
>  		u32 *ptr = kmap_atomic(i915_gem_object_get_page(obj, n));
> @@ -986,7 +992,8 @@ __cpu_check_shmem(struct drm_i915_gem_object *obj, u32 dword, u32 val)
>  	}
>  
>  	i915_gem_object_finish_access(obj);
> -
> +unlock:
> +	i915_gem_object_unlock(obj);
>  	return err;
>  }
>  
> @@ -1304,7 +1311,9 @@ static int igt_ppgtt_smoke_huge(void *arg)
>  		}
>  out_unpin:
>  		i915_gem_object_unpin_pages(obj);
> +		i915_gem_object_lock(obj);
>  		__i915_gem_object_put_pages(obj);
> +		i915_gem_object_unlock(obj);
>  out_put:
>  		i915_gem_object_put(obj);
>  
> @@ -1392,8 +1401,7 @@ static int igt_ppgtt_sanity_check(void *arg)
>  			err = igt_write_huge(ctx, obj);
>  
>  			i915_gem_object_unpin_pages(obj);
> -			__i915_gem_object_put_pages(obj);
> -			i915_gem_object_put(obj);
> +			close_object(obj);
>  
>  			if (err) {
>  				pr_err("%s write-huge failed with size=%u pages=%u i=%d, j=%d\n",
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> index 87d7d8aa080f..b8dd6fabe70a 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
> @@ -27,9 +27,10 @@ static int cpu_set(struct context *ctx, unsigned long offset, u32 v)
>  	u32 *cpu;
>  	int err;
>  
> +	i915_gem_object_lock(ctx->obj);
>  	err = i915_gem_object_prepare_write(ctx->obj, &needs_clflush);
>  	if (err)
> -		return err;
> +		goto unlock;
>  
>  	page = i915_gem_object_get_page(ctx->obj, offset >> PAGE_SHIFT);
>  	map = kmap_atomic(page);
> @@ -46,7 +47,9 @@ static int cpu_set(struct context *ctx, unsigned long offset, u32 v)
>  	kunmap_atomic(map);
>  	i915_gem_object_finish_access(ctx->obj);
>  
> -	return 0;
> +unlock:
> +	i915_gem_object_unlock(ctx->obj);
> +	return err;
>  }
>  
>  static int cpu_get(struct context *ctx, unsigned long offset, u32 *v)
> @@ -57,9 +60,10 @@ static int cpu_get(struct context *ctx, unsigned long offset, u32 *v)
>  	u32 *cpu;
>  	int err;
>  
> +	i915_gem_object_lock(ctx->obj);
>  	err = i915_gem_object_prepare_read(ctx->obj, &needs_clflush);
>  	if (err)
> -		return err;
> +		goto unlock;
>  
>  	page = i915_gem_object_get_page(ctx->obj, offset >> PAGE_SHIFT);
>  	map = kmap_atomic(page);
> @@ -73,7 +77,9 @@ static int cpu_get(struct context *ctx, unsigned long offset, u32 *v)
>  	kunmap_atomic(map);
>  	i915_gem_object_finish_access(ctx->obj);
>  
> -	return 0;
> +unlock:
> +	i915_gem_object_unlock(ctx->obj);
> +	return err;
>  }
>  
>  static int gtt_set(struct context *ctx, unsigned long offset, u32 v)
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> index d176b015353f..f2a307b4146e 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
> @@ -461,9 +461,10 @@ static int cpu_fill(struct drm_i915_gem_object *obj, u32 value)
>  	unsigned int n, m, need_flush;
>  	int err;
>  
> +	i915_gem_object_lock(obj);
>  	err = i915_gem_object_prepare_write(obj, &need_flush);
>  	if (err)
> -		return err;
> +		goto unlock;
>  
>  	for (n = 0; n < real_page_count(obj); n++) {
>  		u32 *map;
> @@ -479,6 +480,8 @@ static int cpu_fill(struct drm_i915_gem_object *obj, u32 value)
>  	i915_gem_object_finish_access(obj);
>  	obj->read_domains = I915_GEM_DOMAIN_GTT | I915_GEM_DOMAIN_CPU;
>  	obj->write_domain = 0;
> +unlock:
> +	i915_gem_object_unlock(obj);
>  	return 0;
>  }
>  
> @@ -488,9 +491,10 @@ static noinline int cpu_check(struct drm_i915_gem_object *obj,
>  	unsigned int n, m, needs_flush;
>  	int err;
>  
> +	i915_gem_object_lock(obj);
>  	err = i915_gem_object_prepare_read(obj, &needs_flush);
>  	if (err)
> -		return err;
> +		goto unlock;
>  
>  	for (n = 0; n < real_page_count(obj); n++) {
>  		u32 *map;
> @@ -527,6 +531,8 @@ static noinline int cpu_check(struct drm_i915_gem_object *obj,
>  	}
>  
>  	i915_gem_object_finish_access(obj);
> +unlock:
> +	i915_gem_object_unlock(obj);
>  	return err;
>  }
>  
> diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> index 9c7402ce5bf9..11f734fea3ab 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
> @@ -1297,7 +1297,9 @@ static int __igt_mmap_revoke(struct drm_i915_private *i915,
>  	}
>  
>  	if (type != I915_MMAP_TYPE_GTT) {
> +		i915_gem_object_lock(obj);
>  		__i915_gem_object_put_pages(obj);
> +		i915_gem_object_unlock(obj);
>  		if (i915_gem_object_has_pages(obj)) {
>  			pr_err("Failed to put-pages object!\n");
>  			err = -EINVAL;
> diff --git a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
> index 71baf2f8bdf3..3eab2cc751bc 100644
> --- a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
> +++ b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
> @@ -351,7 +351,6 @@ static struct i915_vma *pd_vma_create(struct gen6_ppgtt *ppgtt, int size)
>  	i915_active_init(&vma->active, NULL, NULL);
>  
>  	kref_init(&vma->ref);
> -	mutex_init(&vma->pages_mutex);
>  	vma->vm = i915_vm_get(&ggtt->vm);
>  	vma->ops = &pd_vma_ops;
>  	vma->private = ppgtt;
> @@ -439,7 +438,6 @@ struct i915_ppgtt *gen6_ppgtt_create(struct intel_gt *gt)
>  	ppgtt->base.vm.pd_shift = 22;
>  	ppgtt->base.vm.top = 1;
>  
> -	ppgtt->base.vm.bind_async_flags = I915_VMA_LOCAL_BIND;
>  	ppgtt->base.vm.allocate_va_range = gen6_alloc_va_range;
>  	ppgtt->base.vm.clear_range = gen6_ppgtt_clear_range;
>  	ppgtt->base.vm.insert_entries = gen6_ppgtt_insert_entries;
> diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
> index e3afd250cd7f..203aa1f9aec7 100644
> --- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
> +++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
> @@ -720,7 +720,6 @@ struct i915_ppgtt *gen8_ppgtt_create(struct intel_gt *gt)
>  			goto err_free_pd;
>  	}
>  
> -	ppgtt->vm.bind_async_flags = I915_VMA_LOCAL_BIND;
>  	ppgtt->vm.insert_entries = gen8_ppgtt_insert;
>  	ppgtt->vm.allocate_va_range = gen8_ppgtt_alloc;
>  	ppgtt->vm.clear_range = gen8_ppgtt_clear;
> diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
> index 33a3f627ddb1..59a4a3ab6bfd 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
> @@ -628,7 +628,6 @@ static int init_aliasing_ppgtt(struct i915_ggtt *ggtt)
>  	ppgtt->vm.allocate_va_range(&ppgtt->vm, &stash, 0, ggtt->vm.total);
>  
>  	ggtt->alias = ppgtt;
> -	ggtt->vm.bind_async_flags |= ppgtt->vm.bind_async_flags;
>  
>  	GEM_BUG_ON(ggtt->vm.vma_ops.bind_vma != ggtt_bind_vma);
>  	ggtt->vm.vma_ops.bind_vma = aliasing_gtt_bind_vma;
> @@ -862,8 +861,6 @@ static int gen8_gmch_probe(struct i915_ggtt *ggtt)
>  	    IS_CHERRYVIEW(i915) /* fails with concurrent use/update */) {
>  		ggtt->vm.insert_entries = bxt_vtd_ggtt_insert_entries__BKL;
>  		ggtt->vm.insert_page    = bxt_vtd_ggtt_insert_page__BKL;
> -		ggtt->vm.bind_async_flags =
> -			I915_VMA_GLOBAL_BIND | I915_VMA_LOCAL_BIND;
>  	}
>  
>  	ggtt->invalidate = gen8_ggtt_invalidate;
> @@ -1429,7 +1426,7 @@ i915_get_ggtt_vma_pages(struct i915_vma *vma)
>  	 * must be the vma->pages. A simple rule is that vma->pages must only
>  	 * be accessed when the obj->mm.pages are pinned.
>  	 */
> -	GEM_BUG_ON(!i915_gem_object_has_pinned_pages(vma->obj));
> +	GEM_BUG_ON(!i915_gem_object_has_pages(vma->obj));
>  
>  	switch (vma->ggtt_view.type) {
>  	default:
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index 496f8236ca09..1bb447ef824b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -226,8 +226,6 @@ struct i915_address_space {
>  	u64 total;		/* size addr space maps (ex. 2GB for ggtt) */
>  	u64 reserved;		/* size addr space reserved */
>  
> -	unsigned int bind_async_flags;
> -
>  	/*
>  	 * Each active user context has its own address space (in full-ppgtt).
>  	 * Since the vm may be shared between multiple contexts, we count how
> diff --git a/drivers/gpu/drm/i915/gt/intel_ppgtt.c b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
> index 1f80d79a6588..68dd3f8b79d0 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ppgtt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
> @@ -271,6 +271,7 @@ void i915_vm_free_pt_stash(struct i915_address_space *vm,
>  int ppgtt_set_pages(struct i915_vma *vma)
>  {
>  	GEM_BUG_ON(vma->pages);
> +	GEM_BUG_ON(IS_ERR_OR_NULL(vma->obj->mm.pages));
>  
>  	vma->pages = vma->obj->mm.pages;
>  	vma->page_sizes = vma->obj->mm.page_sizes;
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index e998f25f30a3..0fbe438c4523 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -335,12 +335,16 @@ i915_gem_shmem_pread(struct drm_i915_gem_object *obj,
>  	u64 remain;
>  	int ret;
>  
> +	i915_gem_object_lock(obj);
>  	ret = i915_gem_object_prepare_read(obj, &needs_clflush);
> -	if (ret)
> +	if (ret) {
> +		i915_gem_object_unlock(obj);
>  		return ret;
> +	}
>  
>  	fence = i915_gem_object_lock_fence(obj);
>  	i915_gem_object_finish_access(obj);
> +	i915_gem_object_unlock(obj);
>  	if (!fence)
>  		return -ENOMEM;
>  
> @@ -734,12 +738,16 @@ i915_gem_shmem_pwrite(struct drm_i915_gem_object *obj,
>  	u64 remain;
>  	int ret;
>  
> +	i915_gem_object_lock(obj);
>  	ret = i915_gem_object_prepare_write(obj, &needs_clflush);
> -	if (ret)
> +	if (ret) {
> +		i915_gem_object_unlock(obj);
>  		return ret;
> +	}
>  
>  	fence = i915_gem_object_lock_fence(obj);
>  	i915_gem_object_finish_access(obj);
> +	i915_gem_object_unlock(obj);
>  	if (!fence)
>  		return -ENOMEM;
>  
> @@ -1063,7 +1071,7 @@ i915_gem_madvise_ioctl(struct drm_device *dev, void *data,
>  	if (!obj)
>  		return -ENOENT;
>  
> -	err = mutex_lock_interruptible(&obj->mm.lock);
> +	err = i915_gem_object_lock_interruptible(obj);
>  	if (err)
>  		goto out;
>  
> @@ -1109,7 +1117,7 @@ i915_gem_madvise_ioctl(struct drm_device *dev, void *data,
>  		i915_gem_object_truncate(obj);
>  
>  	args->retained = obj->mm.madv != __I915_MADV_PURGED;
> -	mutex_unlock(&obj->mm.lock);
> +	i915_gem_object_unlock(obj);
>  
>  out:
>  	i915_gem_object_put(obj);
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index 7278cc7c40b9..633f335ce892 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -116,7 +116,6 @@ vma_create(struct drm_i915_gem_object *obj,
>  		return ERR_PTR(-ENOMEM);
>  
>  	kref_init(&vma->ref);
> -	mutex_init(&vma->pages_mutex);
>  	vma->vm = i915_vm_get(vm);
>  	vma->ops = &vm->vma_ops;
>  	vma->obj = obj;
> @@ -295,16 +294,31 @@ struct i915_vma_work {
>  	struct i915_address_space *vm;
>  	struct i915_vm_pt_stash stash;
>  	struct i915_vma *vma;
> -	struct drm_i915_gem_object *pinned;
>  	struct i915_sw_dma_fence_cb cb;
>  	enum i915_cache_level cache_level;
>  	unsigned int flags;
>  };
>  
> +static void pin_pages(struct i915_vma *vma, unsigned int bind)
> +{
> +	bind = hweight32(bind & I915_VMA_BIND_MASK);
> +	while (bind--)
> +		__i915_gem_object_pin_pages(vma->obj);
> +}
> +
>  static int __vma_bind(struct dma_fence_work *work)
>  {
>  	struct i915_vma_work *vw = container_of(work, typeof(*vw), base);
>  	struct i915_vma *vma = vw->vma;
> +	int err;
> +
> +	if (!vma->pages) {
> +		err = vma->ops->set_pages(vma);
> +		if (err) {
> +			atomic_or(I915_VMA_ERROR, &vma->flags);
> +			return err;
> +		}
> +	}
>  
>  	vma->ops->bind_vma(vw->vm, &vw->stash,
>  			   vma, vw->cache_level, vw->flags);
> @@ -315,8 +329,8 @@ static void __vma_release(struct dma_fence_work *work)
>  {
>  	struct i915_vma_work *vw = container_of(work, typeof(*vw), base);
>  
> -	if (vw->pinned)
> -		__i915_gem_object_unpin_pages(vw->pinned);
> +	if (work->dma.error && vw->flags)
> +		atomic_or(I915_VMA_ERROR, &vw->vma->flags);
>  
>  	i915_vm_free_pt_stash(vw->vm, &vw->stash);
>  	i915_vm_put(vw->vm);
> @@ -389,6 +403,7 @@ int i915_vma_bind(struct i915_vma *vma,
>  		  u32 flags,
>  		  struct i915_vma_work *work)
>  {
> +	struct dma_fence *prev;
>  	u32 bind_flags;
>  	u32 vma_flags;
>  
> @@ -413,41 +428,39 @@ int i915_vma_bind(struct i915_vma *vma,
>  	if (bind_flags == 0)
>  		return 0;
>  
> -	GEM_BUG_ON(!vma->pages);
> -
>  	trace_i915_vma_bind(vma, bind_flags);
> -	if (work && bind_flags & vma->vm->bind_async_flags) {
> -		struct dma_fence *prev;
>  
> -		work->vma = vma;
> -		work->cache_level = cache_level;
> -		work->flags = bind_flags;
> +	work->vma = vma;
> +	work->cache_level = cache_level;
> +	work->flags = bind_flags;
> +	work->base.dma.error = 0; /* enable the queue_work() */
>  
> -		/*
> -		 * Note we only want to chain up to the migration fence on
> -		 * the pages (not the object itself). As we don't track that,
> -		 * yet, we have to use the exclusive fence instead.
> -		 *
> -		 * Also note that we do not want to track the async vma as
> -		 * part of the obj->resv->excl_fence as it only affects
> -		 * execution and not content or object's backing store lifetime.
> -		 */
> -		prev = i915_active_set_exclusive(&vma->active, &work->base.dma);
> -		if (prev) {
> -			__i915_sw_fence_await_dma_fence(&work->base.chain,
> -							prev,
> -							&work->cb);
> -			dma_fence_put(prev);
> -		}
> -
> -		work->base.dma.error = 0; /* enable the queue_work() */
> +	/*
> +	 * Note we only want to chain up to the migration fence on
> +	 * the pages (not the object itself). As we don't track that,
> +	 * yet, we have to use the exclusive fence instead.
> +	 *
> +	 * Also note that we do not want to track the async vma as
> +	 * part of the obj->resv->excl_fence as it only affects
> +	 * execution and not content or object's backing store lifetime.
> +	 */
> +	prev = i915_active_set_exclusive(&vma->active, &work->base.dma);
> +	if (prev) {
> +		__i915_sw_fence_await_dma_fence(&work->base.chain,
> +						prev,
> +						&work->cb);
> +		dma_fence_put(prev);
> +	}
>  
> -		if (vma->obj) {
> -			__i915_gem_object_pin_pages(vma->obj);
> -			work->pinned = vma->obj;
> +	if (vma->obj) {
> +		if (IS_ERR(vma->obj->mm.pages)) {
> +			i915_sw_fence_set_error_once(&work->base.chain,
> +						     PTR_ERR(vma->obj->mm.pages));
> +			atomic_or(I915_VMA_ERROR, &vma->flags);
> +			bind_flags = 0;
>  		}
> -	} else {
> -		vma->ops->bind_vma(vma->vm, NULL, vma, cache_level, bind_flags);
> +
> +		pin_pages(vma, bind_flags);
>  	}
>  
>  	atomic_or(bind_flags, &vma->flags);
> @@ -690,6 +703,9 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
>  		if (ret)
>  			return ret;
>  	} else {
> +		const unsigned long page_sizes =
> +			INTEL_INFO(vma->vm->i915)->page_sizes;
> +
>  		/*
>  		 * We only support huge gtt pages through the 48b PPGTT,
>  		 * however we also don't want to force any alignment for
> @@ -699,7 +715,7 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
>  		 * forseeable future. See also i915_ggtt_offset().
>  		 */
>  		if (upper_32_bits(end - 1) &&
> -		    vma->page_sizes.sg > I915_GTT_PAGE_SIZE) {
> +		    page_sizes > I915_GTT_PAGE_SIZE) {
>  			/*
>  			 * We can't mix 64K and 4K PTEs in the same page-table
>  			 * (2M block), and so to avoid the ugliness and
> @@ -707,7 +723,7 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
>  			 * objects to 2M.
>  			 */
>  			u64 page_alignment =
> -				rounddown_pow_of_two(vma->page_sizes.sg |
> +				rounddown_pow_of_two(page_sizes |
>  						     I915_GTT_PAGE_SIZE_2M);
>  
>  			/*
> @@ -719,7 +735,7 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
>  
>  			alignment = max(alignment, page_alignment);
>  
> -			if (vma->page_sizes.sg & I915_GTT_PAGE_SIZE_64K)
> +			if (page_sizes & I915_GTT_PAGE_SIZE_64K)
>  				size = round_up(size, I915_GTT_PAGE_SIZE_2M);
>  		}
>  
> @@ -796,74 +812,6 @@ bool i915_vma_pin_inplace(struct i915_vma *vma, unsigned int flags)
>  	return pinned;
>  }
>  
> -int i915_vma_get_pages(struct i915_vma *vma)
> -{
> -	int err = 0;
> -
> -	if (atomic_add_unless(&vma->pages_count, 1, 0))
> -		return 0;
> -
> -	/* Allocations ahoy! */
> -	if (mutex_lock_interruptible(&vma->pages_mutex))
> -		return -EINTR;
> -
> -	if (!atomic_read(&vma->pages_count)) {
> -		if (vma->obj) {
> -			err = i915_gem_object_pin_pages(vma->obj);
> -			if (err)
> -				goto unlock;
> -		}
> -
> -		err = vma->ops->set_pages(vma);
> -		if (err) {
> -			if (vma->obj)
> -				i915_gem_object_unpin_pages(vma->obj);
> -			goto unlock;
> -		}
> -	}
> -	atomic_inc(&vma->pages_count);
> -
> -unlock:
> -	mutex_unlock(&vma->pages_mutex);
> -
> -	return err;
> -}
> -
> -static void __vma_put_pages(struct i915_vma *vma, unsigned int count)
> -{
> -	/* We allocate under vma_get_pages, so beware the shrinker */
> -	mutex_lock_nested(&vma->pages_mutex, SINGLE_DEPTH_NESTING);
> -	GEM_BUG_ON(atomic_read(&vma->pages_count) < count);
> -	if (atomic_sub_return(count, &vma->pages_count) == 0) {
> -		vma->ops->clear_pages(vma);
> -		GEM_BUG_ON(vma->pages);
> -		if (vma->obj)
> -			i915_gem_object_unpin_pages(vma->obj);
> -	}
> -	mutex_unlock(&vma->pages_mutex);
> -}
> -
> -void i915_vma_put_pages(struct i915_vma *vma)
> -{
> -	if (atomic_add_unless(&vma->pages_count, -1, 1))
> -		return;
> -
> -	__vma_put_pages(vma, 1);
> -}
> -
> -static void vma_unbind_pages(struct i915_vma *vma)
> -{
> -	unsigned int count;
> -
> -	lockdep_assert_held(&vma->vm->mutex);
> -
> -	/* The upper portion of pages_count is the number of bindings */
> -	count = atomic_read(&vma->pages_count);
> -	count >>= I915_VMA_PAGES_BIAS;
> -	if (count)
> -		__vma_put_pages(vma, count | count << I915_VMA_PAGES_BIAS);
> -}
> -
>  static int __wait_for_unbind(struct i915_vma *vma, unsigned int flags)
>  {
>  	return __i915_vma_wait_excl(vma, false, flags);
> @@ -885,9 +833,11 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
>  	if (i915_vma_pin_inplace(vma, flags & I915_VMA_BIND_MASK))
>  		return 0;
>  
> -	err = i915_vma_get_pages(vma);
> -	if (err)
> -		return err;
> +	if (vma->obj) {
> +		err = i915_gem_object_pin_pages(vma->obj);
> +		if (err)
> +			return err;
> +	}
>  
>  	if (flags & PIN_GLOBAL)
>  		wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm);
> @@ -896,26 +846,21 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
>  	if (err)
>  		goto err_rpm;
>  
> -	if (flags & vma->vm->bind_async_flags) {
> -		work = i915_vma_work();
> -		if (!work) {
> -			err = -ENOMEM;
> -			goto err_rpm;
> -		}
> +	work = i915_vma_work();
> +	if (!work) {
> +		err = -ENOMEM;
> +		goto err_rpm;
> +	}
>  
> -		work->vm = i915_vm_get(vma->vm);
> +	work->vm = i915_vm_get(vma->vm);
>  
> -		/* Allocate enough page directories to used PTE */
> -		if (vma->vm->allocate_va_range) {
> -			i915_vm_alloc_pt_stash(vma->vm,
> -					       &work->stash,
> -					       vma->size);
> +	/* Allocate enough page directories to used PTE */
> +	if (vma->vm->allocate_va_range) {
> +		i915_vm_alloc_pt_stash(vma->vm, &work->stash, vma->size);
>  
> -			err = i915_vm_pin_pt_stash(vma->vm,
> -						   &work->stash);
> -			if (err)
> -				goto err_fence;
> -		}
> +		err = i915_vm_pin_pt_stash(vma->vm, &work->stash);
> +		if (err)
> +			goto err_fence;
>  	}
>  
>  	/*
> @@ -980,16 +925,12 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
>  			__i915_vma_set_map_and_fenceable(vma);
>  	}
>  
> -	GEM_BUG_ON(!vma->pages);
>  	err = i915_vma_bind(vma,
>  			    vma->obj ? vma->obj->cache_level : 0,
>  			    flags, work);
>  	if (err)
>  		goto err_remove;
>  
> -	/* There should only be at most 2 active bindings (user, global) */
> -	GEM_BUG_ON(bound + I915_VMA_PAGES_ACTIVE < bound);
> -	atomic_add(I915_VMA_PAGES_ACTIVE, &vma->pages_count);
>  	list_move_tail(&vma->vm_link, &vma->vm->bound_list);
>  	GEM_BUG_ON(!i915_vma_is_active(vma));
>  
> @@ -1008,12 +949,12 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
>  err_unlock:
>  	mutex_unlock(&vma->vm->mutex);
>  err_fence:
> -	if (work)
> -		dma_fence_work_commit_imm(&work->base);
> +	dma_fence_work_commit_imm(&work->base);
>  err_rpm:
>  	if (wakeref)
>  		intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref);
> -	i915_vma_put_pages(vma);
> +	if (vma->obj)
> +		i915_gem_object_unpin_pages(vma->obj);
>  	return err;
>  }
>  
> @@ -1274,6 +1215,8 @@ int i915_vma_move_to_active(struct i915_vma *vma,
>  
>  void __i915_vma_evict(struct i915_vma *vma)
>  {
> +	int count;
> +
>  	GEM_BUG_ON(i915_vma_is_pinned(vma));
>  
>  	if (i915_vma_is_map_and_fenceable(vma)) {
> @@ -1308,11 +1251,19 @@ void __i915_vma_evict(struct i915_vma *vma)
>  		trace_i915_vma_unbind(vma);
>  		vma->ops->unbind_vma(vma->vm, vma);
>  	}
> +	count = hweight32(atomic_read(&vma->flags) & I915_VMA_BIND_MASK);
>  	atomic_and(~(I915_VMA_BIND_MASK | I915_VMA_ERROR | I915_VMA_GGTT_WRITE),
>  		   &vma->flags);
>  
>  	i915_vma_detach(vma);
> -	vma_unbind_pages(vma);
> +
> +	if (vma->pages)
> +		vma->ops->clear_pages(vma);
> +
> +	if (vma->obj) {
> +		while (count--)
> +			__i915_gem_object_unpin_pages(vma->obj);
> +	}
>  }
>  
>  int __i915_vma_unbind(struct i915_vma *vma)
> diff --git a/drivers/gpu/drm/i915/i915_vma_types.h b/drivers/gpu/drm/i915/i915_vma_types.h
> index 9e9082dc8f4b..02c1640bb034 100644
> --- a/drivers/gpu/drm/i915/i915_vma_types.h
> +++ b/drivers/gpu/drm/i915/i915_vma_types.h
> @@ -251,11 +251,6 @@ struct i915_vma {
>  
>  	struct i915_active active;
>  
> -#define I915_VMA_PAGES_BIAS 24
> -#define I915_VMA_PAGES_ACTIVE (BIT(24) | 1)
> -	atomic_t pages_count; /* number of active binds to the pages */
> -	struct mutex pages_mutex; /* protect acquire/release of backing pages */
> -
>  	/**
>  	 * Support different GGTT views into the same object.
>  	 * This means there can be multiple VMA mappings per object and per VM.
> @@ -279,4 +274,3 @@ struct i915_vma {
>  };
>  
>  #endif
> -
> diff --git a/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c b/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
> index d1c3b958c15d..02b653328b9d 100644
> --- a/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
> +++ b/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
> @@ -89,8 +89,18 @@ int i915_acquire_ctx_lock(struct i915_acquire_ctx *ctx,
>  	return err;
>  }
>  
> -int i915_acquire_mm(struct i915_acquire_ctx *acquire)
> +int i915_acquire_mm(struct i915_acquire_ctx *ctx)
>  {
> +	struct i915_acquire *lnk;
> +	int err;
> +
> +	for (lnk = ctx->locked; lnk; lnk = lnk->next) {
> +		err = __i915_gem_object_get_pages_locked(lnk->obj);
> +		if (err)
> +			return err;
> +	}
> +
> +	i915_acquire_ctx_done(ctx);
>  	return 0;
>  }
>  
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> index af8205a2bd8f..e5e6973eb6ea 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c
> @@ -1245,9 +1245,9 @@ static void track_vma_bind(struct i915_vma *vma)
>  	__i915_gem_object_pin_pages(obj);
>  
>  	GEM_BUG_ON(vma->pages);
> -	atomic_set(&vma->pages_count, I915_VMA_PAGES_ACTIVE);
> -	__i915_gem_object_pin_pages(obj);
>  	vma->pages = obj->mm.pages;
> +	__i915_gem_object_pin_pages(obj);
> +	atomic_or(I915_VMA_GLOBAL_BIND, &vma->flags);
>  
>  	mutex_lock(&vma->vm->mutex);
>  	list_add_tail(&vma->vm_link, &vma->vm->bound_list);
> diff --git a/drivers/gpu/drm/i915/selftests/intel_memory_region.c b/drivers/gpu/drm/i915/selftests/intel_memory_region.c
> index 6e80d99048e4..8d9fdf591514 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_memory_region.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_memory_region.c
> @@ -24,6 +24,15 @@
>  #include "selftests/igt_flush_test.h"
>  #include "selftests/i915_random.h"
>  
> +static void close_object(struct drm_i915_gem_object *obj)
> +{
> +	i915_gem_object_lock(obj);
> +	__i915_gem_object_put_pages(obj);
> +	i915_gem_object_unlock(obj);
> +
> +	i915_gem_object_put(obj);
> +}
> +
>  static void close_objects(struct intel_memory_region *mem,
>  			  struct list_head *objects)
>  {
> @@ -33,10 +42,9 @@ static void close_objects(struct intel_memory_region *mem,
>  	list_for_each_entry_safe(obj, on, objects, st_link) {
>  		if (i915_gem_object_has_pinned_pages(obj))
>  			i915_gem_object_unpin_pages(obj);
> -		/* No polluting the memory region between tests */
> -		__i915_gem_object_put_pages(obj);
>  		list_del(&obj->st_link);
> -		i915_gem_object_put(obj);
> +		/* No polluting the memory region between tests */
> +		close_object(obj);
>  	}
>  
>  	cond_resched();
> @@ -124,9 +132,8 @@ igt_object_create(struct intel_memory_region *mem,
>  static void igt_object_release(struct drm_i915_gem_object *obj)
>  {
>  	i915_gem_object_unpin_pages(obj);
> -	__i915_gem_object_put_pages(obj);
>  	list_del(&obj->st_link);
> -	i915_gem_object_put(obj);
> +	close_object(obj);
>  }
>  
>  static int igt_mock_contiguous(void *arg)


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2)
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (69 preceding siblings ...)
  2020-07-15 15:42 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
@ 2020-07-15 16:03 ` Patchwork
  2020-07-15 19:55 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
  2020-07-23 20:32 ` [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Dave Airlie
  72 siblings, 0 replies; 156+ messages in thread
From: Patchwork @ 2020-07-15 16:03 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 12273 bytes --]

== Series Details ==

Series: series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2)
URL   : https://patchwork.freedesktop.org/series/79517/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_8750 -> Patchwork_18179
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/index.html

Known issues
------------

  Here are the changes found in Patchwork_18179 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_exec_suspend@basic-s3:
    - fi-tgl-u2:          [PASS][1] -> [FAIL][2] ([i915#1888]) +1 similar issue
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-u2/igt@gem_exec_suspend@basic-s3.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-tgl-u2/igt@gem_exec_suspend@basic-s3.html

  * igt@gem_flink_basic@flink-lifetime:
    - fi-tgl-y:           [PASS][3] -> [DMESG-WARN][4] ([i915#402]) +1 similar issue
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-y/igt@gem_flink_basic@flink-lifetime.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-tgl-y/igt@gem_flink_basic@flink-lifetime.html

  * igt@i915_pm_rpm@basic-pci-d3-state:
    - fi-tgl-y:           [PASS][5] -> [DMESG-WARN][6] ([i915#1982])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-y/igt@i915_pm_rpm@basic-pci-d3-state.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-tgl-y/igt@i915_pm_rpm@basic-pci-d3-state.html

  * igt@kms_flip@basic-flip-vs-wf_vblank@c-edp1:
    - fi-icl-u2:          [PASS][7] -> [DMESG-WARN][8] ([i915#1982]) +1 similar issue
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-icl-u2/igt@kms_flip@basic-flip-vs-wf_vblank@c-edp1.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-icl-u2/igt@kms_flip@basic-flip-vs-wf_vblank@c-edp1.html

  
#### Possible fixes ####

  * igt@i915_module_load@reload:
    - fi-byt-j1900:       [DMESG-WARN][9] ([i915#1982]) -> [PASS][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-byt-j1900/igt@i915_module_load@reload.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-byt-j1900/igt@i915_module_load@reload.html
    - fi-bxt-dsi:         [DMESG-WARN][11] ([i915#1635] / [i915#1982]) -> [PASS][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-bxt-dsi/igt@i915_module_load@reload.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-bxt-dsi/igt@i915_module_load@reload.html
    - fi-tgl-u2:          [DMESG-WARN][13] ([i915#402]) -> [PASS][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-u2/igt@i915_module_load@reload.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-tgl-u2/igt@i915_module_load@reload.html
    - fi-tgl-y:           [DMESG-WARN][15] ([i915#1982]) -> [PASS][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-y/igt@i915_module_load@reload.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-tgl-y/igt@i915_module_load@reload.html

  * igt@i915_pm_rpm@basic-pci-d3-state:
    - {fi-tgl-dsi}:       [DMESG-WARN][17] ([i915#1982]) -> [PASS][18]
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-dsi/igt@i915_pm_rpm@basic-pci-d3-state.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-tgl-dsi/igt@i915_pm_rpm@basic-pci-d3-state.html

  * igt@i915_selftest@live@gt_lrc:
    - fi-tgl-u2:          [DMESG-FAIL][19] ([i915#1233]) -> [PASS][20]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-u2/igt@i915_selftest@live@gt_lrc.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-tgl-u2/igt@i915_selftest@live@gt_lrc.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
    - {fi-kbl-7560u}:     [DMESG-WARN][21] ([i915#1982]) -> [PASS][22]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-7560u/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-kbl-7560u/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html
    - fi-bsw-kefka:       [DMESG-WARN][23] ([i915#1982]) -> [PASS][24] +1 similar issue
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-bsw-kefka/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-bsw-kefka/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html

  * igt@vgem_basic@setversion:
    - fi-tgl-y:           [DMESG-WARN][25] ([i915#402]) -> [PASS][26]
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-tgl-y/igt@vgem_basic@setversion.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-tgl-y/igt@vgem_basic@setversion.html

  
#### Warnings ####

  * igt@i915_pm_rpm@module-reload:
    - fi-kbl-x1275:       [DMESG-FAIL][27] ([i915#62]) -> [DMESG-FAIL][28] ([i915#62] / [i915#95])
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-x1275/igt@i915_pm_rpm@module-reload.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-kbl-x1275/igt@i915_pm_rpm@module-reload.html

  * igt@kms_cursor_legacy@basic-flip-before-cursor-legacy:
    - fi-kbl-x1275:       [DMESG-WARN][29] ([i915#62] / [i915#92]) -> [DMESG-WARN][30] ([i915#62] / [i915#92] / [i915#95]) +2 similar issues
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-x1275/igt@kms_cursor_legacy@basic-flip-before-cursor-legacy.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-kbl-x1275/igt@kms_cursor_legacy@basic-flip-before-cursor-legacy.html

  * igt@kms_flip@basic-flip-vs-modeset@a-dp1:
    - fi-kbl-x1275:       [DMESG-WARN][31] ([i915#62] / [i915#92] / [i915#95]) -> [DMESG-WARN][32] ([i915#62] / [i915#92]) +4 similar issues
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/fi-kbl-x1275/igt@kms_flip@basic-flip-vs-modeset@a-dp1.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/fi-kbl-x1275/igt@kms_flip@basic-flip-vs-modeset@a-dp1.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [i915#1233]: https://gitlab.freedesktop.org/drm/intel/issues/1233
  [i915#1635]: https://gitlab.freedesktop.org/drm/intel/issues/1635
  [i915#1888]: https://gitlab.freedesktop.org/drm/intel/issues/1888
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [i915#402]: https://gitlab.freedesktop.org/drm/intel/issues/402
  [i915#62]: https://gitlab.freedesktop.org/drm/intel/issues/62
  [i915#92]: https://gitlab.freedesktop.org/drm/intel/issues/92
  [i915#95]: https://gitlab.freedesktop.org/drm/intel/issues/95


Participating hosts (47 -> 40)
------------------------------

  Missing    (7): fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-ctg-p8600 fi-byt-clapper fi-bdw-samus 


Build changes
-------------

  * Linux: CI_DRM_8750 -> Patchwork_18179

  CI-20190529: 20190529
  CI_DRM_8750: 0714e0ca72205b9c38c4b2a09d8d5981637af2fb @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_5735: 21f8204e54c122e4a0f8ca4b59e4b2db8d1ba687 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_18179: 85f3b20e2a3296847de34a044cd16516b5798d70 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

85f3b20e2a32 drm/i915/gem: Remove timeline nesting from snb relocs
e0e91286252d drm/i915/gt: Enable ring scheduling for gen6/7
eaa788663d0a drm/i915/gt: Implement ring scheduler for gen6/7
3e60e3380101 drm/i915/gt: Infrastructure for ring scheduling
e7bfa5f7c2ca drm/i915/gt: Use client timeline address for seqno writes
0c68e10ca35a drm/i915/gt: Support creation of 'internal' rings
84f48ea3e305 drm/i915/gt: Couple tasklet scheduling for all CS interrupts
b8ed262e9007 Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq"
dea960c9f9bc drm/i915: Move saturated workload detection to the GT
ba8e433eb1ad drm/i915: Replace the priority boosting for the display with a deadline
c1be9968e1e1 drm/i915/gt: Specify a deadline for the heartbeat
55f8528025f2 drm/i915: Fair low-latency scheduling
fffa992d3a28 drm/i915/gt: Remove timeslice suppression
53f0f5814c60 drm/i915: Restructure priority inheritance
6423ee0d3c8a drm/i915: Teach the i915_dependency to use a double-lock
56ddf5a9a91a drm/i915/gt: Do not suspend bonded requests if one hangs
2383c96eac81 drm/i915: Replace engine->schedule() with a known request operation
bce06e971029 drm/i915: Remove I915_USER_PRIORITY_SHIFT
000a31affd74 drm/i915: Strip out internal priorities
0618152c0ed3 drm/i915: Lift waiter/signaler iterators
d66ffb59c681 drm/i915/gt: Convert stats.active to plain unsigned int
f21e9ee6ef05 drm/i915/gt: Extract busy-stats for ring-scheduler
f0a8817be8e7 drm/i915/gt: Drop atomic for engine->fw_active tracking
9bc1aca39d8f drm/i915/gt: ce->inflight updates are now serialised
c7b81ba5771a drm/i915/gt: Simplify virtual engine handling for execlists_hold()
9a3abbadb311 drm/i915/gt: Resubmit the virtual engine on schedule-out
0782f33abf51 drm/i915/gt: Defer schedule_out until after the next dequeue
655eecd42876 drm/i915/gt: Decouple inflight virtual engines
28a117fb0f10 drm/i915/gt: Use virtual_engine during execlists_dequeue
a97efde96c6f drm/i915/gt: Free stale request on destroying the virtual engine
364ad467bb9c drm/i915/gt: Replace direct submit with direct call to tasklet
c4ccca3c8899 drm/i915/gt: Check for a completed last request once
8cdadd182cff drm/i915/gt: Decouple completed requests on unwind
35a1391f3c4b drm/i915: Remove unused i915_gem_evict_vm()
cd9d5d1c7e79 drm/i915/gt: Push the wait for the context to bound to the request
f89d50b33136 drm/i915/gt: Acquire backing storage for the context
22e6c1fa7455 drm/i915: Specialise GGTT binding
6cb053b52a1e drm/i915: Hold wakeref for the duration of the vma GGTT binding
69013a1a8c42 drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
90ec819f7195 drm/i915/gem: Pull execbuf dma resv under a single critical section
d3fe2f8baa11 drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2.
9040f3b37f3b drm/i915/gem: Reintroduce multiple passes for reloc processing
c180ca0649ab drm/i915/gem: Include secure batch in common execbuf pinning
2d3e21b38717 drm/i915/gem: Include cmdparser in common execbuf pinning
2ee5a36ee1d9 drm/i915/gem: Bind the fence async for execbuf
d6442826e675 drm/i915/gem: Asynchronous GTT unbinding
dbe18ef2f769 drm/i915/gem: Separate the ww_mutex walker into its own list
339dd5f35caf drm/i915/gem: Assign context id for async work
69b95b2f1b58 drm/i915: Always defer fenced work to the worker
6ac79cc3b262 drm/i915: Add list_for_each_entry_safe_continue_reverse
63d098780e45 drm/i915/gem: Remove the call for no-evict i915_vma_pin
1608dda568ca drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup
f3c8a75f8f0d drm/i915/gem: Rename execbuf.bind_link to unbound_link
6bdab21d6597 drm/i915/gem: Don't drop the timeline lock during execbuf
2747217ec2ad drm/i915: Switch to object allocations for page directories
5250859a3025 drm/i915: Preallocate stashes for vma page-directories
638a715d2c30 drm/i915: Soften the tasklet flush frequency before waits
8d7b14253127 drm/i915: Provide a fastpath for waiting on vma bindings
3bfed8a701e1 drm/i915: Make the stale cached active node available for any timeline
db2f42ce7206 drm/i915: Keep the most recently used active-fence upon discard
74f1c669d951 drm/i915: Export a preallocate variant of i915_active_acquire()
e0279073e5d3 drm/i915: Skip taking acquire mutex for no ref->active callback
96f54f935986 drm/i915: Add a couple of missing i915_active_fini()
c0b29c38984e drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs
78b2b4227c1a drm/i915: Remove i915_request.lock requirement for execution callbacks
130497fc98af drm/i915: Reduce i915_request.lock contention for i915_request_wait

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/index.html

[-- Attachment #1.2: Type: text/html, Size: 14889 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [Intel-gfx] ✗ Fi.CI.IGT: failure for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2)
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (70 preceding siblings ...)
  2020-07-15 16:03 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
@ 2020-07-15 19:55 ` Patchwork
  2020-07-23 20:32 ` [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Dave Airlie
  72 siblings, 0 replies; 156+ messages in thread
From: Patchwork @ 2020-07-15 19:55 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 21738 bytes --]

== Series Details ==

Series: series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2)
URL   : https://patchwork.freedesktop.org/series/79517/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_8750_full -> Patchwork_18179_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_18179_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_18179_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_18179_full:

### IGT changes ###

#### Possible regressions ####

  * igt@gem_ctx_exec@basic-nohangcheck:
    - shard-tglb:         [PASS][1] -> [FAIL][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-tglb7/igt@gem_ctx_exec@basic-nohangcheck.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-tglb5/igt@gem_ctx_exec@basic-nohangcheck.html

  * igt@gem_ctx_persistence@engines-queued@vecs0:
    - shard-iclb:         [PASS][3] -> [FAIL][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-iclb1/igt@gem_ctx_persistence@engines-queued@vecs0.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-iclb3/igt@gem_ctx_persistence@engines-queued@vecs0.html

  
Known issues
------------

  Here are the changes found in Patchwork_18179_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_ctx_param@set-priority-not-supported:
    - shard-snb:          [PASS][5] -> [SKIP][6] ([fdo#109271])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-snb4/igt@gem_ctx_param@set-priority-not-supported.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-snb1/igt@gem_ctx_param@set-priority-not-supported.html

  * igt@i915_pm_dc@dc6-psr:
    - shard-iclb:         [PASS][7] -> [FAIL][8] ([i915#454])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-iclb3/igt@i915_pm_dc@dc6-psr.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-iclb6/igt@i915_pm_dc@dc6-psr.html

  * igt@kms_big_fb@x-tiled-64bpp-rotate-180:
    - shard-glk:          [PASS][9] -> [DMESG-FAIL][10] ([i915#118] / [i915#95])
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-glk4/igt@kms_big_fb@x-tiled-64bpp-rotate-180.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-glk8/igt@kms_big_fb@x-tiled-64bpp-rotate-180.html

  * igt@kms_cursor_crc@pipe-a-cursor-suspend:
    - shard-kbl:          [PASS][11] -> [DMESG-WARN][12] ([i915#180]) +6 similar issues
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-kbl4/igt@kms_cursor_crc@pipe-a-cursor-suspend.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-kbl3/igt@kms_cursor_crc@pipe-a-cursor-suspend.html

  * igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-legacy:
    - shard-skl:          [PASS][13] -> [FAIL][14] ([IGT#5])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl5/igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-legacy.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl6/igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-legacy.html

  * igt@kms_draw_crc@draw-method-xrgb2101010-mmap-gtt-ytiled:
    - shard-glk:          [PASS][15] -> [DMESG-WARN][16] ([i915#1982])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-glk5/igt@kms_draw_crc@draw-method-xrgb2101010-mmap-gtt-ytiled.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-glk3/igt@kms_draw_crc@draw-method-xrgb2101010-mmap-gtt-ytiled.html

  * igt@kms_flip@basic-plain-flip@a-edp1:
    - shard-skl:          [PASS][17] -> [DMESG-WARN][18] ([i915#1982]) +9 similar issues
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl8/igt@kms_flip@basic-plain-flip@a-edp1.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl5/igt@kms_flip@basic-plain-flip@a-edp1.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@a-edp1:
    - shard-skl:          [PASS][19] -> [FAIL][20] ([i915#79]) +1 similar issue
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl5/igt@kms_flip@flip-vs-expired-vblank-interruptible@a-edp1.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl6/igt@kms_flip@flip-vs-expired-vblank-interruptible@a-edp1.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@c-dp1:
    - shard-apl:          [PASS][21] -> [FAIL][22] ([i915#1635] / [i915#79])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-apl8/igt@kms_flip@flip-vs-expired-vblank-interruptible@c-dp1.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-apl4/igt@kms_flip@flip-vs-expired-vblank-interruptible@c-dp1.html

  * igt@kms_flip@flip-vs-expired-vblank@b-hdmi-a1:
    - shard-glk:          [PASS][23] -> [FAIL][24] ([i915#79])
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-glk2/igt@kms_flip@flip-vs-expired-vblank@b-hdmi-a1.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-glk8/igt@kms_flip@flip-vs-expired-vblank@b-hdmi-a1.html

  * igt@kms_flip@plain-flip-fb-recreate@a-edp1:
    - shard-skl:          [PASS][25] -> [FAIL][26] ([i915#2122])
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl4/igt@kms_flip@plain-flip-fb-recreate@a-edp1.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl1/igt@kms_flip@plain-flip-fb-recreate@a-edp1.html

  * igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-blt:
    - shard-kbl:          [PASS][27] -> [DMESG-WARN][28] ([i915#1982])
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-kbl6/igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-blt.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-kbl3/igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-blt.html
    - shard-tglb:         [PASS][29] -> [DMESG-WARN][30] ([i915#1982]) +1 similar issue
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-tglb8/igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-blt.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-tglb2/igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-blt.html

  * igt@kms_hdr@bpc-switch-dpms:
    - shard-skl:          [PASS][31] -> [FAIL][32] ([i915#1188])
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl4/igt@kms_hdr@bpc-switch-dpms.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl4/igt@kms_hdr@bpc-switch-dpms.html

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a:
    - shard-apl:          [PASS][33] -> [FAIL][34] ([fdo#103375] / [i915#1635])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-apl1/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-apl7/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

  * igt@kms_psr2_su@page_flip:
    - shard-iclb:         [PASS][35] -> [SKIP][36] ([fdo#109642] / [fdo#111068])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-iclb2/igt@kms_psr2_su@page_flip.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-iclb4/igt@kms_psr2_su@page_flip.html

  * igt@kms_psr@psr2_cursor_mmap_gtt:
    - shard-iclb:         [PASS][37] -> [SKIP][38] ([fdo#109441])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-iclb2/igt@kms_psr@psr2_cursor_mmap_gtt.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-iclb7/igt@kms_psr@psr2_cursor_mmap_gtt.html

  * igt@kms_psr@psr2_dpms:
    - shard-tglb:         [PASS][39] -> [DMESG-WARN][40] ([i915#402])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-tglb7/igt@kms_psr@psr2_dpms.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-tglb5/igt@kms_psr@psr2_dpms.html

  * igt@perf_pmu@semaphore-busy@vcs0:
    - shard-kbl:          [PASS][41] -> [FAIL][42] ([i915#1820]) +3 similar issues
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-kbl3/igt@perf_pmu@semaphore-busy@vcs0.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-kbl7/igt@perf_pmu@semaphore-busy@vcs0.html

  
#### Possible fixes ####

  * igt@gem_exec_reloc@basic-concurrent0:
    - shard-tglb:         [FAIL][43] ([i915#1930]) -> [PASS][44]
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-tglb8/igt@gem_exec_reloc@basic-concurrent0.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-tglb5/igt@gem_exec_reloc@basic-concurrent0.html
    - shard-glk:          [FAIL][45] ([i915#1930]) -> [PASS][46]
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-glk6/igt@gem_exec_reloc@basic-concurrent0.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-glk6/igt@gem_exec_reloc@basic-concurrent0.html
    - shard-apl:          [FAIL][47] ([i915#1635] / [i915#1930]) -> [PASS][48]
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-apl8/igt@gem_exec_reloc@basic-concurrent0.html
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-apl4/igt@gem_exec_reloc@basic-concurrent0.html
    - shard-kbl:          [FAIL][49] ([i915#1930]) -> [PASS][50]
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-kbl1/igt@gem_exec_reloc@basic-concurrent0.html
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-kbl6/igt@gem_exec_reloc@basic-concurrent0.html
    - shard-iclb:         [FAIL][51] ([i915#1930]) -> [PASS][52]
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-iclb8/igt@gem_exec_reloc@basic-concurrent0.html
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-iclb1/igt@gem_exec_reloc@basic-concurrent0.html
    - shard-snb:          [FAIL][53] ([i915#1930]) -> [PASS][54]
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-snb2/igt@gem_exec_reloc@basic-concurrent0.html
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-snb4/igt@gem_exec_reloc@basic-concurrent0.html
    - shard-skl:          [FAIL][55] ([i915#1930]) -> [PASS][56]
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl5/igt@gem_exec_reloc@basic-concurrent0.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl6/igt@gem_exec_reloc@basic-concurrent0.html

  * igt@gem_exec_reloc@basic-concurrent16:
    - shard-snb:          [TIMEOUT][57] ([i915#1958] / [i915#2119]) -> [PASS][58] +1 similar issue
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-snb6/igt@gem_exec_reloc@basic-concurrent16.html
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-snb2/igt@gem_exec_reloc@basic-concurrent16.html
    - shard-iclb:         [INCOMPLETE][59] ([i915#1958]) -> [PASS][60]
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-iclb6/igt@gem_exec_reloc@basic-concurrent16.html
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-iclb6/igt@gem_exec_reloc@basic-concurrent16.html
    - shard-skl:          [INCOMPLETE][61] ([i915#1958]) -> [PASS][62]
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl3/igt@gem_exec_reloc@basic-concurrent16.html
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl2/igt@gem_exec_reloc@basic-concurrent16.html
    - shard-kbl:          [INCOMPLETE][63] ([i915#1958]) -> [PASS][64]
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-kbl6/igt@gem_exec_reloc@basic-concurrent16.html
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-kbl3/igt@gem_exec_reloc@basic-concurrent16.html
    - shard-apl:          [INCOMPLETE][65] ([i915#1635] / [i915#1958]) -> [PASS][66]
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-apl3/igt@gem_exec_reloc@basic-concurrent16.html
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-apl8/igt@gem_exec_reloc@basic-concurrent16.html
    - shard-tglb:         [INCOMPLETE][67] ([i915#1958]) -> [PASS][68]
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-tglb8/igt@gem_exec_reloc@basic-concurrent16.html
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-tglb2/igt@gem_exec_reloc@basic-concurrent16.html
    - shard-glk:          [INCOMPLETE][69] ([i915#1958] / [i915#58] / [k.org#198133]) -> [PASS][70]
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-glk5/igt@gem_exec_reloc@basic-concurrent16.html
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-glk3/igt@gem_exec_reloc@basic-concurrent16.html

  * igt@gem_exec_whisper@basic-queues-priority:
    - shard-glk:          [DMESG-WARN][71] ([i915#118] / [i915#95]) -> [PASS][72]
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-glk4/igt@gem_exec_whisper@basic-queues-priority.html
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-glk7/igt@gem_exec_whisper@basic-queues-priority.html

  * igt@gem_softpin@noreloc-interruptible:
    - shard-snb:          [INCOMPLETE][73] ([i915#82]) -> [PASS][74]
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-snb6/igt@gem_softpin@noreloc-interruptible.html
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-snb2/igt@gem_softpin@noreloc-interruptible.html

  * igt@i915_module_load@reload:
    - shard-tglb:         [DMESG-WARN][75] ([i915#402]) -> [PASS][76]
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-tglb8/igt@i915_module_load@reload.html
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-tglb6/igt@i915_module_load@reload.html

  * igt@i915_suspend@debugfs-reader:
    - shard-kbl:          [DMESG-WARN][77] ([i915#180]) -> [PASS][78] +3 similar issues
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-kbl4/igt@i915_suspend@debugfs-reader.html
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-kbl6/igt@i915_suspend@debugfs-reader.html

  * igt@kms_cursor_crc@pipe-b-cursor-64x64-sliding:
    - shard-skl:          [FAIL][79] ([i915#54]) -> [PASS][80]
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl1/igt@kms_cursor_crc@pipe-b-cursor-64x64-sliding.html
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl4/igt@kms_cursor_crc@pipe-b-cursor-64x64-sliding.html

  * igt@kms_draw_crc@draw-method-xrgb2101010-mmap-gtt-untiled:
    - shard-glk:          [DMESG-WARN][81] ([i915#1982]) -> [PASS][82] +1 similar issue
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-glk7/igt@kms_draw_crc@draw-method-xrgb2101010-mmap-gtt-untiled.html
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-glk3/igt@kms_draw_crc@draw-method-xrgb2101010-mmap-gtt-untiled.html

  * igt@kms_flip@2x-flip-vs-expired-vblank-interruptible@ac-hdmi-a1-hdmi-a2:
    - shard-glk:          [FAIL][83] ([i915#79]) -> [PASS][84]
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-glk4/igt@kms_flip@2x-flip-vs-expired-vblank-interruptible@ac-hdmi-a1-hdmi-a2.html
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-glk7/igt@kms_flip@2x-flip-vs-expired-vblank-interruptible@ac-hdmi-a1-hdmi-a2.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@b-dp1:
    - shard-apl:          [FAIL][85] ([i915#1635] / [i915#79]) -> [PASS][86]
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-apl8/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-dp1.html
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-apl4/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-dp1.html

  * igt@kms_flip@flip-vs-expired-vblank@c-edp1:
    - shard-skl:          [FAIL][87] ([i915#79]) -> [PASS][88]
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl3/igt@kms_flip@flip-vs-expired-vblank@c-edp1.html
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl7/igt@kms_flip@flip-vs-expired-vblank@c-edp1.html

  * igt@kms_frontbuffer_tracking@psr-1p-pri-indfb-multidraw:
    - shard-tglb:         [DMESG-WARN][89] ([i915#1982]) -> [PASS][90]
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-tglb5/igt@kms_frontbuffer_tracking@psr-1p-pri-indfb-multidraw.html
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-tglb7/igt@kms_frontbuffer_tracking@psr-1p-pri-indfb-multidraw.html

  * igt@kms_plane_alpha_blend@pipe-b-constant-alpha-min:
    - shard-skl:          [FAIL][91] ([fdo#108145] / [i915#265]) -> [PASS][92]
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl1/igt@kms_plane_alpha_blend@pipe-b-constant-alpha-min.html
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl4/igt@kms_plane_alpha_blend@pipe-b-constant-alpha-min.html

  * igt@kms_plane_cursor@pipe-c-viewport-size-128:
    - shard-skl:          [DMESG-WARN][93] ([i915#1982]) -> [PASS][94] +7 similar issues
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-skl4/igt@kms_plane_cursor@pipe-c-viewport-size-128.html
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-skl1/igt@kms_plane_cursor@pipe-c-viewport-size-128.html

  * igt@kms_plane_scaling@pipe-a-scaler-with-clipping-clamping:
    - shard-iclb:         [DMESG-WARN][95] ([i915#1982]) -> [PASS][96]
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-iclb3/igt@kms_plane_scaling@pipe-a-scaler-with-clipping-clamping.html
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-iclb5/igt@kms_plane_scaling@pipe-a-scaler-with-clipping-clamping.html

  * igt@kms_psr@psr2_cursor_render:
    - shard-iclb:         [SKIP][97] ([fdo#109441]) -> [PASS][98] +3 similar issues
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-iclb7/igt@kms_psr@psr2_cursor_render.html
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-iclb2/igt@kms_psr@psr2_cursor_render.html

  * igt@perf_pmu@busy-accuracy-2@bcs0:
    - shard-snb:          [SKIP][99] ([fdo#109271]) -> [PASS][100] +21 similar issues
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-snb6/igt@perf_pmu@busy-accuracy-2@bcs0.html
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-snb2/igt@perf_pmu@busy-accuracy-2@bcs0.html

  
#### Warnings ####

  * igt@kms_plane_cursor@pipe-d-overlay-size-128:
    - shard-snb:          [TIMEOUT][101] ([i915#1958] / [i915#2119]) -> [SKIP][102] ([fdo#109271]) +3 similar issues
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8750/shard-snb6/igt@kms_plane_cursor@pipe-d-overlay-size-128.html
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/shard-snb2/igt@kms_plane_cursor@pipe-d-overlay-size-128.html

  
  [IGT#5]: https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/5
  [fdo#103375]: https://bugs.freedesktop.org/show_bug.cgi?id=103375
  [fdo#108145]: https://bugs.freedesktop.org/show_bug.cgi?id=108145
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109441]: https://bugs.freedesktop.org/show_bug.cgi?id=109441
  [fdo#109642]: https://bugs.freedesktop.org/show_bug.cgi?id=109642
  [fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
  [i915#118]: https://gitlab.freedesktop.org/drm/intel/issues/118
  [i915#1188]: https://gitlab.freedesktop.org/drm/intel/issues/1188
  [i915#1635]: https://gitlab.freedesktop.org/drm/intel/issues/1635
  [i915#180]: https://gitlab.freedesktop.org/drm/intel/issues/180
  [i915#1820]: https://gitlab.freedesktop.org/drm/intel/issues/1820
  [i915#1930]: https://gitlab.freedesktop.org/drm/intel/issues/1930
  [i915#1958]: https://gitlab.freedesktop.org/drm/intel/issues/1958
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [i915#2119]: https://gitlab.freedesktop.org/drm/intel/issues/2119
  [i915#2122]: https://gitlab.freedesktop.org/drm/intel/issues/2122
  [i915#265]: https://gitlab.freedesktop.org/drm/intel/issues/265
  [i915#402]: https://gitlab.freedesktop.org/drm/intel/issues/402
  [i915#454]: https://gitlab.freedesktop.org/drm/intel/issues/454
  [i915#54]: https://gitlab.freedesktop.org/drm/intel/issues/54
  [i915#58]: https://gitlab.freedesktop.org/drm/intel/issues/58
  [i915#79]: https://gitlab.freedesktop.org/drm/intel/issues/79
  [i915#82]: https://gitlab.freedesktop.org/drm/intel/issues/82
  [i915#95]: https://gitlab.freedesktop.org/drm/intel/issues/95
  [k.org#198133]: https://bugzilla.kernel.org/show_bug.cgi?id=198133


Participating hosts (10 -> 10)
------------------------------

  No changes in participating hosts


Build changes
-------------

  * Linux: CI_DRM_8750 -> Patchwork_18179

  CI-20190529: 20190529
  CI_DRM_8750: 0714e0ca72205b9c38c4b2a09d8d5981637af2fb @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_5735: 21f8204e54c122e4a0f8ca4b59e4b2db8d1ba687 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_18179: 85f3b20e2a3296847de34a044cd16516b5798d70 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18179/index.html

[-- Attachment #1.2: Type: text/html, Size: 25909 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 10/66] drm/i915: Soften the tasklet flush frequency before waits
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 10/66] drm/i915: Soften the tasklet flush frequency before waits Chris Wilson
@ 2020-07-16 14:23   ` Mika Kuoppala
  2020-07-22 15:10   ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Mika Kuoppala @ 2020-07-16 14:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Chris Wilson

Chris Wilson <chris@chris-wilson.co.uk> writes:

> We include a tasklet flush before waiting on a request as a precaution
> against the HW being lax in event signaling. We now have a precautionary
> flush in the engine's heartbeat and so do not need to be quite so
> zealous on every request wait. If we focus on the request, the only
> tasklet flush that matters is if there is a delay in submitting this
> request to HW, so if the request is not ready to be executed no
> advantage in reducing this wait can be gained by running the tasklet.
> And there is little point in doing busy work for no result.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

> ---
>  drivers/gpu/drm/i915/i915_request.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 29b5e71307e3..f58beff5e859 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1760,14 +1760,30 @@ long i915_request_wait(struct i915_request *rq,
>  	if (dma_fence_add_callback(&rq->fence, &wait.cb, request_wait_wake))
>  		goto out;
>  
> +	/*
> +	 * Flush the submission tasklet, but only if it may help this request.
> +	 *
> +	 * We sometimes experience some latency between the HW interrupts and
> +	 * tasklet execution (mostly due to ksoftirqd latency, but it can also
> +	 * be due to lazy CS events), so lets run the tasklet manually if there
> +	 * is a chance it may submit this request. If the request is not ready
> +	 * to run, as it is waiting for other fences to be signaled, flushing
> +	 * the tasklet is busy work without any advantage for this client.
> +	 *
> +	 * If the HW is being lazy, this is the last chance before we go to
> +	 * sleep to catch any pending events. We will check periodically in
> +	 * the heartbeat to flush the submission tasklets as a last resort
> +	 * for unhappy HW.
> +	 */
> +	if (i915_request_is_ready(rq))
> +		intel_engine_flush_submission(rq->engine);
> +
>  	for (;;) {
>  		set_current_state(state);
>  
>  		if (dma_fence_is_signaled(&rq->fence))
>  			break;
>  
> -		intel_engine_flush_submission(rq->engine);
> -
>  		if (signal_pending_state(state, current)) {
>  			timeout = -ERESTARTSYS;
>  			break;
> -- 
> 2.20.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
  2020-07-15 15:43   ` Maarten Lankhorst
@ 2020-07-16 15:53     ` Tvrtko Ursulin
  2020-07-28 11:17       ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-16 15:53 UTC (permalink / raw)
  To: Maarten Lankhorst, Chris Wilson, intel-gfx


On 15/07/2020 16:43, Maarten Lankhorst wrote:
> Op 15-07-2020 om 13:51 schreef Chris Wilson:
>> Our goal is to pull all memory reservations (next iteration
>> obj->ops->get_pages()) under a ww_mutex, and to align those reservations
>> with other drivers, i.e. control all such allocations with the
>> reservation_ww_class. Currently, this is under the purview of the
>> obj->mm.mutex, and while obj->mm remains an embedded struct we can
>> "simply" switch to using the reservation_ww_class obj->base.resv->lock
>>
>> The major consequence is the impact on the shrinker paths as the
>> reservation_ww_class is used to wrap allocations, and a ww_mutex does
>> not support subclassing so we cannot do our usual trick of knowing that
>> we never recurse inside the shrinker and instead have to finish the
>> reclaim with a trylock. This may result in us failing to release the
>> pages after having released the vma. This will have to do until a better
>> idea comes along.
>>
>> However, this step only converts the mutex over and continues to treat
>> everything as a single allocation and pinning the pages. With the
>> ww_mutex in place we can remove the temporary pinning, as we can then
>> reserve all storage en masse.
>>
>> One last thing to do: kill the implict page pinning for active vma.
>> This will require us to invalidate the vma->pages when the backing store
>> is removed (and we expect that while the vma is active, we mark the
>> backing store as active so that it cannot be removed while the HW is
>> busy.)
>>
>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

[snip]

>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>> index dc8f052a0ffe..4e928103a38f 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>> @@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct drm_i915_gem_object *obj,
>>   	if (!(shrink & I915_SHRINK_BOUND))
>>   		flags = I915_GEM_OBJECT_UNBIND_TEST;
>>   
>> -	if (i915_gem_object_unbind(obj, flags) == 0)
>> -		__i915_gem_object_put_pages(obj);
>> -
>> -	return !i915_gem_object_has_pages(obj);
>> +	return i915_gem_object_unbind(obj, flags) == 0;
>>   }
>>   
>>   static void try_to_writeback(struct drm_i915_gem_object *obj,
>> @@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
>>   
>>   			spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
>>   
>> -			if (unsafe_drop_pages(obj, shrink)) {
>> -				/* May arrive from get_pages on another bo */
>> -				mutex_lock(&obj->mm.lock);
>> +			if (unsafe_drop_pages(obj, shrink) &&
>> +			    i915_gem_object_trylock(obj)) {

> Why trylock? Because of the nesting? In that case, still use ww ctx if provided please

By "if provided" you mean for code paths where we are calling the 
shrinker ourselves, as opposed to reclaim, like shmem_get_pages?

That indeed sounds like the right thing to do, since all the get_pages 
from execbuf are in the reservation phase, collecting a list of GEM 
objects to lock, the ones to shrink sound like should be on that list.

>> +				__i915_gem_object_put_pages(obj);
>>   				if (!i915_gem_object_has_pages(obj)) {
>>   					try_to_writeback(obj, shrink);
>>   					count += obj->base.size >> PAGE_SHIFT;
>>   				}
>> -				mutex_unlock(&obj->mm.lock);
>> +				i915_gem_object_unlock(obj);
>>   			}
>>   
>>   			scanned += obj->base.size >> PAGE_SHIFT;
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>> index ff72ee2fd9cd..ac12e1c20e66 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>> @@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct drm_i915_gem_object *obj,
>>   	 * pages to prevent them being swapped out and causing corruption
>>   	 * due to the change in swizzling.
>>   	 */
>> -	mutex_lock(&obj->mm.lock);
>>   	if (i915_gem_object_has_pages(obj) &&
>>   	    obj->mm.madv == I915_MADV_WILLNEED &&
>>   	    i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
>> @@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct drm_i915_gem_object *obj,
>>   			obj->mm.quirked = true;
>>   		}
>>   	}
>> -	mutex_unlock(&obj->mm.lock);
>>   
>>   	spin_lock(&obj->vma.lock);
>>   	for_each_ggtt_vma(vma, obj) {
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>> index e946032b13e4..80907c00c6fd 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>> @@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>>   		ret = i915_gem_object_unbind(obj,
>>   					     I915_GEM_OBJECT_UNBIND_ACTIVE |
>>   					     I915_GEM_OBJECT_UNBIND_BARRIER);
>> -		if (ret == 0)
>> -			ret = __i915_gem_object_put_pages(obj);
>> +		if (ret == 0) {
>> +			/* ww_mutex and mmu_notifier is fs_reclaim tainted */
>> +			if (i915_gem_object_trylock(obj)) {
>> +				ret = __i915_gem_object_put_pages(obj);
>> +				i915_gem_object_unlock(obj);
>> +			} else {
>> +				ret = -EAGAIN;
>> +			}
>> +		}
> 
> I'm not sure upstream will agree with this kind of API:
> 
> 1. It will deadlock when RT tasks are used.

It will or it can? Which part? It will break out of the loop if trylock 
fails.

> 
> 2. You start throwing -EAGAIN because you don't have the correct ordering of locking, this needs fixing first.

Is it about correct ordering of locks or something else? If memory 
allocation is allowed under dma_resv.lock, then the opposite order 
cannot be taken in any case.

I've had a brief look at the amdgpu solution and maybe I misunderstood 
something, but it looks like a BKL approach with the device level 
notifier_lock. Their userptr notifier blocks on that one, not on 
dma_resv lock, but that also means their command submission 
(amdgpu_cs_submit) blocks on the same lock while obtaining backing store.

So it looks like a big hammer approach not directly related to the story 
of dma_resv locking. Maybe we could do the same big hammer approach, 
although I am not sure how it is deadlock free.

What happens for instance if someone submits an userptr batch which gets 
unmapped while amdgpu_cs_submit is holding the notifier_lock?

If you understand amdgpu better please share some insights. I certainly 
only looked at it briefly today so may be wrong.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 04/66] drm/i915: Add a couple of missing i915_active_fini()
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 04/66] drm/i915: Add a couple of missing i915_active_fini() Chris Wilson
@ 2020-07-17 12:00   ` Tvrtko Ursulin
  2020-07-21 12:23   ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-17 12:00 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 15/07/2020 12:50, Chris Wilson wrote:
> We use i915_active_fini() as a debug check on the i915_active state
> before freeing. If we forget to call it, we may end up angering the
> debugobjects contained within.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/display/intel_frontbuffer.c    | 2 ++
>   drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c | 5 ++++-
>   2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/display/intel_frontbuffer.c b/drivers/gpu/drm/i915/display/intel_frontbuffer.c
> index 2979ed2588eb..d898b370d7a4 100644
> --- a/drivers/gpu/drm/i915/display/intel_frontbuffer.c
> +++ b/drivers/gpu/drm/i915/display/intel_frontbuffer.c
> @@ -232,6 +232,8 @@ static void frontbuffer_release(struct kref *ref)
>   	RCU_INIT_POINTER(obj->frontbuffer, NULL);
>   	spin_unlock(&to_i915(obj->base.dev)->fb_tracking.lock);
>   
> +	i915_active_fini(&front->write);
> +
>   	i915_gem_object_put(obj);
>   	kfree_rcu(front, rcu);
>   }
> diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
> index 73243ba59c7d..e73854dd2fe0 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
> @@ -47,7 +47,10 @@ static int pulse_active(struct i915_active *active)
>   
>   static void pulse_free(struct kref *kref)
>   {
> -	kfree(container_of(kref, struct pulse, kref));
> +	struct pulse *p = container_of(kref, typeof(*p), kref);
> +
> +	i915_active_fini(&p->active);
> +	kfree(p);
>   }
>   
>   static void pulse_put(struct pulse *p)
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 05/66] drm/i915: Skip taking acquire mutex for no ref->active callback
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 05/66] drm/i915: Skip taking acquire mutex for no ref->active callback Chris Wilson
@ 2020-07-17 12:04   ` Tvrtko Ursulin
  2020-07-21 12:32   ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-17 12:04 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 15/07/2020 12:50, Chris Wilson wrote:
> If no active callback is defined for i915_active, we do not need to
> serialise its enabling with the mutex. We still do only want to call the
> debug activate once, and must still serialise with a concurrent retire.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_active.c | 25 ++++++++++++++++---------
>   1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> index d960d0be5bd2..841b5c30950a 100644
> --- a/drivers/gpu/drm/i915/i915_active.c
> +++ b/drivers/gpu/drm/i915/i915_active.c
> @@ -416,6 +416,14 @@ bool i915_active_acquire_if_busy(struct i915_active *ref)
>   	return atomic_add_unless(&ref->count, 1, 0);
>   }
>   
> +static void __i915_active_activate(struct i915_active *ref)
> +{
> +	spin_lock_irq(&ref->tree_lock); /* __active_retire() */
> +	if (!atomic_fetch_inc(&ref->count))
> +		debug_active_activate(ref);
> +	spin_unlock_irq(&ref->tree_lock);
> +}
> +
>   int i915_active_acquire(struct i915_active *ref)
>   {
>   	int err;
> @@ -423,23 +431,22 @@ int i915_active_acquire(struct i915_active *ref)
>   	if (i915_active_acquire_if_busy(ref))
>   		return 0;
>   
> +	if (!ref->active) {
> +		__i915_active_activate(ref);
> +		return 0;
> +	}
> +
>   	err = mutex_lock_interruptible(&ref->mutex);
>   	if (err)
>   		return err;
>   
>   	if (likely(!i915_active_acquire_if_busy(ref))) {
> -		if (ref->active)
> -			err = ref->active(ref);
> -		if (!err) {
> -			spin_lock_irq(&ref->tree_lock); /* __active_retire() */
> -			debug_active_activate(ref);
> -			atomic_inc(&ref->count);
> -			spin_unlock_irq(&ref->tree_lock);
> -		}
> +		err = ref->active(ref);
> +		if (!err)
> +			__i915_active_activate(ref);
>   	}
>   
>   	mutex_unlock(&ref->mutex);
> -

Blank line was nice there.

>   	return err;
>   }
>   
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire()
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire() Chris Wilson
@ 2020-07-17 12:21   ` Tvrtko Ursulin
  2020-07-17 12:45     ` Chris Wilson
  2020-07-21 15:33   ` Thomas Hellström (Intel)
  1 sibling, 1 reply; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-17 12:21 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 15/07/2020 12:50, Chris Wilson wrote:
> Sometimes we have to be very careful not to allocate underneath a mutex
> (or spinlock) and yet still want to track activity. Enter
> i915_active_acquire_for_context(). This raises the activity counter on
> i915_active prior to use and ensures that the fence-tree contains a slot
> for the context.

Changelog?
 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 +-
>   drivers/gpu/drm/i915/gt/intel_timeline.c      |   4 +-
>   drivers/gpu/drm/i915/i915_active.c            | 136 +++++++++++++++---
>   drivers/gpu/drm/i915/i915_active.h            |  12 +-
>   4 files changed, 126 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 6b4ec66cb558..719ba9fe3e85 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -1729,7 +1729,7 @@ __parser_mark_active(struct i915_vma *vma,
>   {
>   	struct intel_gt_buffer_pool_node *node = vma->private;
>   
> -	return i915_active_ref(&node->active, tl, fence);
> +	return i915_active_ref(&node->active, tl->fence_context, fence);
>   }
>   
>   static int
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
> index 46d20f5f3ddc..acb43aebd669 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline.c
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
> @@ -484,7 +484,9 @@ __intel_timeline_get_seqno(struct intel_timeline *tl,
>   	 * free it after the current request is retired, which ensures that
>   	 * all writes into the cacheline from previous requests are complete.
>   	 */
> -	err = i915_active_ref(&tl->hwsp_cacheline->active, tl, &rq->fence);
> +	err = i915_active_ref(&tl->hwsp_cacheline->active,
> +			      tl->fence_context,
> +			      &rq->fence);
>   	if (err)
>   		goto err_cacheline;
>   
> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> index 841b5c30950a..799282fb1bb9 100644
> --- a/drivers/gpu/drm/i915/i915_active.c
> +++ b/drivers/gpu/drm/i915/i915_active.c
> @@ -28,12 +28,14 @@ static struct i915_global_active {
>   } global;
>   
>   struct active_node {
> +	struct rb_node node;
>   	struct i915_active_fence base;
>   	struct i915_active *ref;
> -	struct rb_node node;
>   	u64 timeline;
>   };
>   
> +#define fetch_node(x) rb_entry(READ_ONCE(x), typeof(struct active_node), node)
> +
>   static inline struct active_node *
>   node_from_active(struct i915_active_fence *active)
>   {
> @@ -216,12 +218,40 @@ excl_retire(struct dma_fence *fence, struct dma_fence_cb *cb)
>   		active_retire(container_of(cb, struct i915_active, excl.cb));
>   }
>   
> +static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
> +{
> +	struct active_node *it;
> +
> +	it = READ_ONCE(ref->cache);
> +	if (it && it->timeline == idx)
> +		return it;
> +
> +	BUILD_BUG_ON(offsetof(typeof(*it), node));
> +
> +	/* While active, the tree can only be built; not destroyed */
> +	GEM_BUG_ON(i915_active_is_idle(ref));
> +
> +	it = fetch_node(ref->tree.rb_node);
> +	while (it) {
> +		if (it->timeline < idx) {
> +			it = fetch_node(it->node.rb_right);
> +		} else if (it->timeline > idx) {
> +			it = fetch_node(it->node.rb_left);
> +		} else {
> +			WRITE_ONCE(ref->cache, it);
> +			break;
> +		}
> +	}
> +
> +	/* NB: If the tree rotated beneath us, we may miss our target. */
> +	return it;
> +}
> +
>   static struct i915_active_fence *
> -active_instance(struct i915_active *ref, struct intel_timeline *tl)
> +active_instance(struct i915_active *ref, u64 idx)
>   {
>   	struct active_node *node, *prealloc;
>   	struct rb_node **p, *parent;
> -	u64 idx = tl->fence_context;
>   
>   	/*
>   	 * We track the most recently used timeline to skip a rbtree search
> @@ -230,8 +260,8 @@ active_instance(struct i915_active *ref, struct intel_timeline *tl)
>   	 * after the previous activity has been retired, or if it matches the
>   	 * current timeline.
>   	 */
> -	node = READ_ONCE(ref->cache);
> -	if (node && node->timeline == idx)
> +	node = __active_lookup(ref, idx);
> +	if (likely(node))
>   		return &node->base;
>   
>   	/* Preallocate a replacement, just in case */
> @@ -268,10 +298,9 @@ active_instance(struct i915_active *ref, struct intel_timeline *tl)
>   	rb_insert_color(&node->node, &ref->tree);
>   
>   out:
> -	ref->cache = node;
> +	WRITE_ONCE(ref->cache, node);
>   	spin_unlock_irq(&ref->tree_lock);
>   
> -	BUILD_BUG_ON(offsetof(typeof(*node), base));
>   	return &node->base;
>   }
>   
> @@ -353,21 +382,17 @@ __active_del_barrier(struct i915_active *ref, struct active_node *node)
>   	return ____active_del_barrier(ref, node, barrier_to_engine(node));
>   }
>   
> -int i915_active_ref(struct i915_active *ref,
> -		    struct intel_timeline *tl,
> -		    struct dma_fence *fence)
> +int i915_active_ref(struct i915_active *ref, u64 idx, struct dma_fence *fence)
>   {
>   	struct i915_active_fence *active;
>   	int err;
>   
> -	lockdep_assert_held(&tl->mutex);
> -
>   	/* Prevent reaping in case we malloc/wait while building the tree */
>   	err = i915_active_acquire(ref);
>   	if (err)
>   		return err;
>   
> -	active = active_instance(ref, tl);
> +	active = active_instance(ref, idx);
>   	if (!active) {
>   		err = -ENOMEM;
>   		goto out;
> @@ -384,32 +409,81 @@ int i915_active_ref(struct i915_active *ref,
>   		atomic_dec(&ref->count);
>   	}
>   	if (!__i915_active_fence_set(active, fence))
> -		atomic_inc(&ref->count);
> +		__i915_active_acquire(ref);
>   
>   out:
>   	i915_active_release(ref);
>   	return err;
>   }
>   
> -struct dma_fence *
> -i915_active_set_exclusive(struct i915_active *ref, struct dma_fence *f)
> +static struct dma_fence *
> +__i915_active_set_fence(struct i915_active *ref,
> +			struct i915_active_fence *active,
> +			struct dma_fence *fence)
>   {
>   	struct dma_fence *prev;
>   
> -	/* We expect the caller to manage the exclusive timeline ordering */
> -	GEM_BUG_ON(i915_active_is_idle(ref));
> +	if (is_barrier(active)) { /* proto-node used by our idle barrier */
> +		/*
> +		 * This request is on the kernel_context timeline, and so
> +		 * we can use it to substitute for the pending idle-barrer
> +		 * request that we want to emit on the kernel_context.
> +		 */
> +		__active_del_barrier(ref, node_from_active(active));
> +		RCU_INIT_POINTER(active->fence, fence);

There is still some duplication between i915_active_ref and here. Maybe something like:

static bool __active_check_delete_barrier(active, fence)
{
	if (is_barrier(active)) { /* proto-node used by our idle barrier */
		/*
		 * This request is on the kernel_context timeline, and so
		 * we can use it to substitute for the pending idle-barrer
		 * request that we want to emit on the kernel_context.
		 */
		__active_del_barrier(ref, node_from_active(active));
		RCU_INIT_POINTER(active->fence, NULL);
		return true;
	} else {
		return false
	}
}

And then here:

	if (__active_check_delete_barrier(active, fence))
		return NULL;

And in i915_active_ref:

	if (__active_check_delete_barrier(active, fence))
		atomic_dec(&ref->count);

?

> +		return NULL;
> +	}
>   
>   	rcu_read_lock();
> -	prev = __i915_active_fence_set(&ref->excl, f);
> +	prev = __i915_active_fence_set(active, fence);
>   	if (prev)
>   		prev = dma_fence_get_rcu(prev);
>   	else
> -		atomic_inc(&ref->count);
> +		__i915_active_acquire(ref);
>   	rcu_read_unlock();
>   
>   	return prev;
>   }
>   
> +static struct i915_active_fence *__active_fence(struct i915_active *ref, u64 idx)
> +{
> +	struct active_node *it;
> +
> +	it = __active_lookup(ref, idx);
> +	if (unlikely(!it)) { /* Contention with parallel tree builders! */
> +		spin_lock_irq(&ref->tree_lock);
> +		it = fetch_node(ref->tree.rb_node);
> +		while (it) {
> +			if (it->timeline < idx) {
> +				it = fetch_node(it->node.rb_right);
> +			} else if (it->timeline > idx) {
> +				it = fetch_node(it->node.rb_left);
> +			} else {
> +				WRITE_ONCE(ref->cache, it);
> +				break;
> +			}
> +		}

The part between the spinlocks is the same as the lookup in __active_lookup. Even if you add three underscores I'd share it for readability.

> +		spin_unlock_irq(&ref->tree_lock);
> +	}
> +	GEM_BUG_ON(!it); /* slot must be preallocated */
> +
> +	return &it->base;
> +}
> +
> +struct dma_fence *
> +__i915_active_ref(struct i915_active *ref, u64 idx, struct dma_fence *fence)
> +{
> +	/* Only valid while active, see i915_active_acquire_for_context() */
> +	return __i915_active_set_fence(ref, __active_fence(ref, idx), fence);
> +}

What is the difference between i915_active_ref and __i915_active_ref?

> +
> +struct dma_fence *
> +i915_active_set_exclusive(struct i915_active *ref, struct dma_fence *f)
> +{
> +	/* We expect the caller to manage the exclusive timeline ordering */
> +	return __i915_active_set_fence(ref, &ref->excl, f);
> +}
> +
>   bool i915_active_acquire_if_busy(struct i915_active *ref)
>   {
>   	debug_active_assert(ref);
> @@ -450,6 +524,24 @@ int i915_active_acquire(struct i915_active *ref)
>   	return err;
>   }
>   
> +int i915_active_acquire_for_context(struct i915_active *ref, u64 idx)
> +{
> +	struct i915_active_fence *active;
> +	int err;
> +
> +	err = i915_active_acquire(ref);
> +	if (err)
> +		return err;
> +
> +	active = active_instance(ref, idx);
> +	if (!active) {
> +		i915_active_release(ref);
> +		return -ENOMEM;
> +	}
> +
> +	return 0; /* return with active ref */
> +}
> +
>   void i915_active_release(struct i915_active *ref)
>   {
>   	debug_active_assert(ref);
> @@ -753,7 +845,7 @@ static struct active_node *reuse_idle_barrier(struct i915_active *ref, u64 idx)
>   match:
>   	rb_erase(p, &ref->tree); /* Hide from waits and sibling allocations */
>   	if (p == &ref->cache->node)
> -		ref->cache = NULL;
> +		WRITE_ONCE(ref->cache, NULL);
>   	spin_unlock_irq(&ref->tree_lock);
>   
>   	return rb_entry(p, struct active_node, node);
> @@ -811,7 +903,7 @@ int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
>   			 */
>   			RCU_INIT_POINTER(node->base.fence, ERR_PTR(-EAGAIN));
>   			node->base.cb.node.prev = (void *)engine;
> -			atomic_inc(&ref->count);
> +			__i915_active_acquire(ref);
>   		}
>   		GEM_BUG_ON(rcu_access_pointer(node->base.fence) != ERR_PTR(-EAGAIN));
>   
> diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h
> index cf4058150966..73ded3c52a04 100644
> --- a/drivers/gpu/drm/i915/i915_active.h
> +++ b/drivers/gpu/drm/i915/i915_active.h
> @@ -163,14 +163,16 @@ void __i915_active_init(struct i915_active *ref,
>   	__i915_active_init(ref, active, retire, &__mkey, &__wkey);	\
>   } while (0)
>   
> -int i915_active_ref(struct i915_active *ref,
> -		    struct intel_timeline *tl,
> -		    struct dma_fence *fence);
> +struct dma_fence *
> +__i915_active_ref(struct i915_active *ref, u64 idx, struct dma_fence *fence);
> +int i915_active_ref(struct i915_active *ref, u64 idx, struct dma_fence *fence);
>   
>   static inline int
>   i915_active_add_request(struct i915_active *ref, struct i915_request *rq)
>   {
> -	return i915_active_ref(ref, i915_request_timeline(rq), &rq->fence);
> +	return i915_active_ref(ref,
> +			       i915_request_timeline(rq)->fence_context,
> +			       &rq->fence);
>   }
>   
>   struct dma_fence *
> @@ -198,7 +200,9 @@ int i915_request_await_active(struct i915_request *rq,
>   #define I915_ACTIVE_AWAIT_BARRIER BIT(2)
>   
>   int i915_active_acquire(struct i915_active *ref);
> +int i915_active_acquire_for_context(struct i915_active *ref, u64 idx);
>   bool i915_active_acquire_if_busy(struct i915_active *ref);
> +
>   void i915_active_release(struct i915_active *ref);
>   
>   static inline void __i915_active_acquire(struct i915_active *ref)
> 

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 07/66] drm/i915: Keep the most recently used active-fence upon discard
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 07/66] drm/i915: Keep the most recently used active-fence upon discard Chris Wilson
@ 2020-07-17 12:38   ` Tvrtko Ursulin
  2020-07-28 14:22     ` Chris Wilson
  2020-07-22  9:46   ` Thomas Hellström (Intel)
  1 sibling, 1 reply; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-17 12:38 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 15/07/2020 12:50, Chris Wilson wrote:
> Whenever an i915_active idles, we prune its tree of old fence slots to
> prevent a gradual leak should it be used to track many, many timelines.
> The downside is that we then have to frequently reallocate the rbtree.
> A compromise is that we keep the most recently used fence slot, and
> reuse that for the next active reference as that is the most likely
> timeline to be reused.
 >
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_active.c | 27 ++++++++++++++++++++-------
>   drivers/gpu/drm/i915/i915_active.h |  4 ----
>   2 files changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> index 799282fb1bb9..0854b1552bc1 100644
> --- a/drivers/gpu/drm/i915/i915_active.c
> +++ b/drivers/gpu/drm/i915/i915_active.c
> @@ -130,8 +130,8 @@ static inline void debug_active_assert(struct i915_active *ref) { }
>   static void
>   __active_retire(struct i915_active *ref)
>   {
> +	struct rb_root root = RB_ROOT;
>   	struct active_node *it, *n;
> -	struct rb_root root;
>   	unsigned long flags;
>   
>   	GEM_BUG_ON(i915_active_is_idle(ref));
> @@ -143,9 +143,21 @@ __active_retire(struct i915_active *ref)
>   	GEM_BUG_ON(rcu_access_pointer(ref->excl.fence));
>   	debug_active_deactivate(ref);
>   
> -	root = ref->tree;
> -	ref->tree = RB_ROOT;
> -	ref->cache = NULL;
> +	/* Even if we have not used the cache, we may still have a barrier */
> +	if (!ref->cache)
> +		ref->cache = fetch_node(ref->tree.rb_node);
> +
> +	/* Keep the MRU cached node for reuse */
> +	if (ref->cache) {
> +		/* Discard all other nodes in the tree */
> +		rb_erase(&ref->cache->node, &ref->tree);
> +		root = ref->tree;
> +
> +		/* Rebuild the tree with only the cached node */
> +		rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
> +		rb_insert_color(&ref->cache->node, &ref->tree);
> +		GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
> +	}
>   
>   	spin_unlock_irqrestore(&ref->tree_lock, flags);
>   
> @@ -156,6 +168,7 @@ __active_retire(struct i915_active *ref)
>   	/* ... except if you wait on it, you must manage your own references! */
>   	wake_up_var(ref);
>   
> +	/* Finally free the discarded timeline tree  */
>   	rbtree_postorder_for_each_entry_safe(it, n, &root, node) {
>   		GEM_BUG_ON(i915_active_fence_isset(&it->base));
>   		kmem_cache_free(global.slab_cache, it);

Here it frees everything.. so how does ref->cache, being in the tree, 
survives?

> @@ -750,16 +763,16 @@ int i915_sw_fence_await_active(struct i915_sw_fence *fence,
>   	return await_active(ref, flags, sw_await_fence, fence, fence);
>   }
>   
> -#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
>   void i915_active_fini(struct i915_active *ref)
>   {
>   	debug_active_fini(ref);
>   	GEM_BUG_ON(atomic_read(&ref->count));
>   	GEM_BUG_ON(work_pending(&ref->work));
> -	GEM_BUG_ON(!RB_EMPTY_ROOT(&ref->tree));
>   	mutex_destroy(&ref->mutex);
> +
> +	if (ref->cache)
> +		kmem_cache_free(global.slab_cache, ref->cache);
>   }
> -#endif
>   
>   static inline bool is_idle_barrier(struct active_node *node, u64 idx)
>   {
> diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h
> index 73ded3c52a04..b9e0394e2975 100644
> --- a/drivers/gpu/drm/i915/i915_active.h
> +++ b/drivers/gpu/drm/i915/i915_active.h
> @@ -217,11 +217,7 @@ i915_active_is_idle(const struct i915_active *ref)
>   	return !atomic_read(&ref->count);
>   }
>   
> -#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
>   void i915_active_fini(struct i915_active *ref);
> -#else
> -static inline void i915_active_fini(struct i915_active *ref) { }
> -#endif
>   
>   int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
>   					    struct intel_engine_cs *engine);
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire()
  2020-07-17 12:21   ` Tvrtko Ursulin
@ 2020-07-17 12:45     ` Chris Wilson
  2020-07-17 13:06       ` Tvrtko Ursulin
  0 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-17 12:45 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2020-07-17 13:21:54)
> 
> On 15/07/2020 12:50, Chris Wilson wrote:
> > Sometimes we have to be very careful not to allocate underneath a mutex
> > (or spinlock) and yet still want to track activity. Enter
> > i915_active_acquire_for_context(). This raises the activity counter on
> > i915_active prior to use and ensures that the fence-tree contains a slot
> > for the context.
> 
> Changelog?

"Time spent with perf trying to reduce holdtime of certain mutexes we
will introduce, in particular focusing on reducing the number of atomics
required for typical i915_active lookups"
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline Chris Wilson
@ 2020-07-17 13:04   ` Tvrtko Ursulin
  2020-07-28 14:28     ` Chris Wilson
  2020-07-22 11:19   ` Thomas Hellström (Intel)
  1 sibling, 1 reply; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-17 13:04 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 15/07/2020 12:50, Chris Wilson wrote:
> Rather than require the next timeline after idling to match the MRU
> before idling, reset the index on the node and allow it to match the
> first request. However, this requires cmpxchg(u64) and so is not trivial
> on 32b, so for compatibility we just fallback to keeping the cached node
> pointing to the MRU timeline.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_active.c | 21 +++++++++++++++++++--
>   1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> index 0854b1552bc1..6737b5615c0c 100644
> --- a/drivers/gpu/drm/i915/i915_active.c
> +++ b/drivers/gpu/drm/i915/i915_active.c
> @@ -157,6 +157,10 @@ __active_retire(struct i915_active *ref)
>   		rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
>   		rb_insert_color(&ref->cache->node, &ref->tree);
>   		GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
> +
> +		/* Make the cached node available for reuse with any timeline */
> +		if (IS_ENABLED(CONFIG_64BIT))
> +			ref->cache->timeline = 0; /* needs cmpxchg(u64) */

Or when fence context wraps shock horror.

>   	}
>   
>   	spin_unlock_irqrestore(&ref->tree_lock, flags);
> @@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
>   {
>   	struct active_node *it;
>   
> +	GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
> +
>   	it = READ_ONCE(ref->cache);
> -	if (it && it->timeline == idx)
> -		return it;
> +	if (it) {
> +		u64 cached = READ_ONCE(it->timeline);
> +
> +		if (cached == idx)
> +			return it;
> +
> +#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
> +		if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
> +			GEM_BUG_ON(i915_active_fence_isset(&it->base));
> +			return it;

cpmxchg suggests this needs to be atomic, however above the check for 
equality comes from a separate read.

Since there is a lookup code path under the spinlock, perhaps the 
unlocked lookup could just fail, and then locked lookup could re-assign 
the timeline without the need for cmpxchg?

Regards,

Tvrtko

> +		}
> +#endif
> +	}
>   
>   	BUILD_BUG_ON(offsetof(typeof(*it), node));
>   
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire()
  2020-07-17 12:45     ` Chris Wilson
@ 2020-07-17 13:06       ` Tvrtko Ursulin
  0 siblings, 0 replies; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-17 13:06 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 17/07/2020 13:45, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-07-17 13:21:54)
>>
>> On 15/07/2020 12:50, Chris Wilson wrote:
>>> Sometimes we have to be very careful not to allocate underneath a mutex
>>> (or spinlock) and yet still want to track activity. Enter
>>> i915_active_acquire_for_context(). This raises the activity counter on
>>> i915_active prior to use and ensures that the fence-tree contains a slot
>>> for the context.
>>
>> Changelog?
> 
> "Time spent with perf trying to reduce holdtime of certain mutexes we
> will introduce, in particular focusing on reducing the number of atomics
> required for typical i915_active lookups"

Why not, then I would had at least known nothing from my previous round 
feedback so easier review.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings Chris Wilson
@ 2020-07-17 13:23   ` Tvrtko Ursulin
  2020-07-28 14:35     ` Chris Wilson
  2020-07-22 15:07   ` Thomas Hellström (Intel)
  1 sibling, 1 reply; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-17 13:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 15/07/2020 12:50, Chris Wilson wrote:
> Before we can execute a request, we must wait for all of its vma to be
> bound. This is a frequent operation for which we can optimise away a
> few atomic operations (notably a cmpxchg) in lieu of the RCU protection.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_active.h | 15 +++++++++++++++
>   drivers/gpu/drm/i915/i915_vma.c    |  9 +++++++--
>   2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h
> index b9e0394e2975..fb165d3f01cf 100644
> --- a/drivers/gpu/drm/i915/i915_active.h
> +++ b/drivers/gpu/drm/i915/i915_active.h
> @@ -231,4 +231,19 @@ struct i915_active *i915_active_create(void);
>   struct i915_active *i915_active_get(struct i915_active *ref);
>   void i915_active_put(struct i915_active *ref);
>   
> +static inline int __i915_request_await_exclusive(struct i915_request *rq,
> +						 struct i915_active *active)
> +{
> +	struct dma_fence *fence;
> +	int err = 0;
> +
> +	fence = i915_active_fence_get(&active->excl);
> +	if (fence) {
> +		err = i915_request_await_dma_fence(rq, fence);
> +		dma_fence_put(fence);
> +	}
> +
> +	return err;
> +}
> +
>   #endif /* _I915_ACTIVE_H_ */
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index bc64f773dcdb..cd12047c7791 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -1167,6 +1167,12 @@ void i915_vma_revoke_mmap(struct i915_vma *vma)
>   		list_del(&vma->obj->userfault_link);
>   }
>   
> +static int
> +__i915_request_await_bind(struct i915_request *rq, struct i915_vma *vma)
> +{
> +	return __i915_request_await_exclusive(rq, &vma->active);
> +}
> +
>   int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>   {
>   	int err;
> @@ -1174,8 +1180,7 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>   	GEM_BUG_ON(!i915_vma_is_pinned(vma));
>   
>   	/* Wait for the vma to be bound before we start! */
> -	err = i915_request_await_active(rq, &vma->active,
> -					I915_ACTIVE_AWAIT_EXCL);
> +	err = __i915_request_await_bind(rq, vma);
>   	if (err)
>   		return err;
>   
> 

Looks like for like, apart from missing i915_active_acquire_if_busy 
across the operation. Remind me please what is acquire/release 
protecting against? :)

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin Chris Wilson
@ 2020-07-17 14:36   ` Tvrtko Ursulin
  2020-07-28 15:04     ` Chris Wilson
  2020-07-28  9:46   ` Thomas Hellström (Intel)
  1 sibling, 1 reply; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-17 14:36 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 15/07/2020 12:50, Chris Wilson wrote:
> Remove the stub i915_vma_pin() used for incrementally pining objects for
> execbuf (under the severe restriction that they must not wait on a
> resource as we may have already pinned it) and replace it with a
> i915_vma_pin_inplace() that is only allowed to reclaim the currently
> bound location for the vma (and will never wait for a pinned resource).
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 69 +++++++++++--------
>   drivers/gpu/drm/i915/i915_vma.c               |  6 +-
>   drivers/gpu/drm/i915/i915_vma.h               |  2 +
>   3 files changed, 45 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 28cf28fcf80a..0b8a26da26e5 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -452,49 +452,55 @@ static u64 eb_pin_flags(const struct drm_i915_gem_exec_object2 *entry,
>   	return pin_flags;
>   }
>   
> +static bool eb_pin_vma_fence_inplace(struct eb_vma *ev)
> +{
> +	struct i915_vma *vma = ev->vma;
> +	struct i915_fence_reg *reg = vma->fence;
> +
> +	if (reg) {
> +		if (READ_ONCE(reg->dirty))
> +			return false;
> +
> +		atomic_inc(&reg->pin_count);

Why is this safe outside the vm->mutex? It otherwise seems to be 
protecting this pin count.

Regards,

Tvrtko

> +		ev->flags |= __EXEC_OBJECT_HAS_FENCE;
> +	} else {
> +		if (i915_gem_object_is_tiled(vma->obj))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
>   static inline bool
> -eb_pin_vma(struct i915_execbuffer *eb,
> -	   const struct drm_i915_gem_exec_object2 *entry,
> -	   struct eb_vma *ev)
> +eb_pin_vma_inplace(struct i915_execbuffer *eb,
> +		   const struct drm_i915_gem_exec_object2 *entry,
> +		   struct eb_vma *ev)
>   {
>   	struct i915_vma *vma = ev->vma;
> -	u64 pin_flags;
> +	unsigned int pin_flags;
>   
> -	if (vma->node.size)
> -		pin_flags = vma->node.start;
> -	else
> -		pin_flags = entry->offset & PIN_OFFSET_MASK;
> +	if (eb_vma_misplaced(entry, vma, ev->flags))
> +		return false;
>   
> -	pin_flags |= PIN_USER | PIN_NOEVICT | PIN_OFFSET_FIXED;
> +	pin_flags = PIN_USER;
>   	if (unlikely(ev->flags & EXEC_OBJECT_NEEDS_GTT))
>   		pin_flags |= PIN_GLOBAL;
>   
>   	/* Attempt to reuse the current location if available */
> -	if (unlikely(i915_vma_pin(vma, 0, 0, pin_flags))) {
> -		if (entry->flags & EXEC_OBJECT_PINNED)
> -			return false;
> -
> -		/* Failing that pick any _free_ space if suitable */
> -		if (unlikely(i915_vma_pin(vma,
> -					  entry->pad_to_size,
> -					  entry->alignment,
> -					  eb_pin_flags(entry, ev->flags) |
> -					  PIN_USER | PIN_NOEVICT)))
> -			return false;
> -	}
> +	if (!i915_vma_pin_inplace(vma, pin_flags))
> +		return false;
>   
>   	if (unlikely(ev->flags & EXEC_OBJECT_NEEDS_FENCE)) {
> -		if (unlikely(i915_vma_pin_fence(vma))) {
> -			i915_vma_unpin(vma);
> +		if (!eb_pin_vma_fence_inplace(ev)) {
> +			__i915_vma_unpin(vma);
>   			return false;
>   		}
> -
> -		if (vma->fence)
> -			ev->flags |= __EXEC_OBJECT_HAS_FENCE;
>   	}
>   
> +	GEM_BUG_ON(eb_vma_misplaced(entry, vma, ev->flags));
> +
>   	ev->flags |= __EXEC_OBJECT_HAS_PIN;
> -	return !eb_vma_misplaced(entry, vma, ev->flags);
> +	return true;
>   }
>   
>   static int
> @@ -676,14 +682,17 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
>   		struct drm_i915_gem_exec_object2 *entry = ev->exec;
>   		struct i915_vma *vma = ev->vma;
>   
> -		if (eb_pin_vma(eb, entry, ev)) {
> +		if (eb_pin_vma_inplace(eb, entry, ev)) {
>   			if (entry->offset != vma->node.start) {
>   				entry->offset = vma->node.start | UPDATE;
>   				eb->args->flags |= __EXEC_HAS_RELOC;
>   			}
>   		} else {
> -			eb_unreserve_vma(ev);
> -			list_add_tail(&ev->unbound_link, &unbound);
> +			/* Lightly sort user placed objects to the fore */
> +			if (ev->flags & EXEC_OBJECT_PINNED)
> +				list_add(&ev->unbound_link, &unbound);
> +			else
> +				list_add_tail(&ev->unbound_link, &unbound);
>   		}
>   	}
>   
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index c6bf04ca2032..dbe11b349175 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -740,11 +740,13 @@ i915_vma_detach(struct i915_vma *vma)
>   	list_del(&vma->vm_link);
>   }
>   
> -static bool try_qad_pin(struct i915_vma *vma, unsigned int flags)
> +bool i915_vma_pin_inplace(struct i915_vma *vma, unsigned int flags)
>   {
>   	unsigned int bound;
>   	bool pinned = true;
>   
> +	GEM_BUG_ON(flags & ~I915_VMA_BIND_MASK);
> +
>   	bound = atomic_read(&vma->flags);
>   	do {
>   		if (unlikely(flags & ~bound))
> @@ -865,7 +867,7 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
>   	GEM_BUG_ON(!(flags & (PIN_USER | PIN_GLOBAL)));
>   
>   	/* First try and grab the pin without rebinding the vma */
> -	if (try_qad_pin(vma, flags & I915_VMA_BIND_MASK))
> +	if (i915_vma_pin_inplace(vma, flags & I915_VMA_BIND_MASK))
>   		return 0;
>   
>   	err = vma_get_pages(vma);
> diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> index d0d01f909548..03fea54fd573 100644
> --- a/drivers/gpu/drm/i915/i915_vma.h
> +++ b/drivers/gpu/drm/i915/i915_vma.h
> @@ -236,6 +236,8 @@ static inline void i915_vma_unlock(struct i915_vma *vma)
>   	dma_resv_unlock(vma->resv);
>   }
>   
> +bool i915_vma_pin_inplace(struct i915_vma *vma, unsigned int flags);
> +
>   int __must_check
>   i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags);
>   int i915_ggtt_pin(struct i915_vma *vma, u32 align, unsigned int flags);
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 12/66] drm/i915: Switch to object allocations for page directories
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 12/66] drm/i915: Switch to object allocations for page directories Chris Wilson
@ 2020-07-20 10:34   ` Matthew Auld
  2020-07-20 10:40     ` Chris Wilson
  0 siblings, 1 reply; 156+ messages in thread
From: Matthew Auld @ 2020-07-20 10:34 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 15/07/2020 12:50, Chris Wilson wrote:
> The GEM object is grossly overweight for the practicality of tracking
> large numbers of individual pages, yet it is currently our only
> abstraction for tracking DMA allocations. Since those allocations need
> to be reserved upfront before an operation, and that we need to break
> away from simple system memory, we need to ditch using plain struct page
> wrappers.
> 
> In the process, we drop the WC mapping as we ended up clflushing
> everything anyway due to various issues across a wider range of
> platforms. Though in a future step, we need to drop the kmap_atomic
> approach which suggests we need to pre-map all the pages and keep them
> mapped.
> 
> v2: Verify our large scratch page is suitably DMA aligned; and manually
> clear the scratch since we are allocating random struct pages.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>
> ---

<snip>

> -
> -static struct page *vm_alloc_page(struct i915_address_space *vm, gfp_t gfp)
> -{
> -	struct pagevec stack;
> -	struct page *page;
> -
> -	if (I915_SELFTEST_ONLY(should_fail(&vm->fault_attr, 1)))
> -		i915_gem_shrink_all(vm->i915);


I guess shrink_boom et al are now mostly irrelevant in this new scheme.

Fwiw,
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories Chris Wilson
@ 2020-07-20 10:35   ` Matthew Auld
  2020-07-23 14:33   ` Thomas Hellström (Intel)
  2020-07-27  9:24   ` Thomas Hellström (Intel)
  2 siblings, 0 replies; 156+ messages in thread
From: Matthew Auld @ 2020-07-20 10:35 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 15/07/2020 12:50, Chris Wilson wrote:
> We need to make the DMA allocations used for page directories to be
> performed up front so that we can include those allocations in our
> memory reservation pass. The downside is that we have to assume the
> worst case, even before we know the final layout, and always allocate
> enough page directories for this object, even when there will be overlap.
> This unfortunately can be quite expensive, especially as we have to
> clear/reset the page directories and DMA pages, but it should only be
> required during early phases of a workload when new objects are being
> discovered, or after memory/eviction pressure when we need to rebind.
> Once we reach steady state, the objects should not be moved and we no
> longer need to preallocating the pages tables.
> 
> It should be noted that the lifetime for the page directories DMA is
> more or less decoupled from individual fences as they will be shared
> across objects across timelines.
> 
> v2: Only allocate enough PD space for the PTE we may use, we do not need
> to allocate PD that will be left as scratch.
> v3: Store the shift unto the first PD level to encapsulate the different
> PTE counts for gen6/gen8.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>

Fwiw,
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 12/66] drm/i915: Switch to object allocations for page directories
  2020-07-20 10:34   ` Matthew Auld
@ 2020-07-20 10:40     ` Chris Wilson
  0 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-20 10:40 UTC (permalink / raw)
  To: Matthew Auld, intel-gfx

Quoting Matthew Auld (2020-07-20 11:34:10)
> On 15/07/2020 12:50, Chris Wilson wrote:
> > The GEM object is grossly overweight for the practicality of tracking
> > large numbers of individual pages, yet it is currently our only
> > abstraction for tracking DMA allocations. Since those allocations need
> > to be reserved upfront before an operation, and that we need to break
> > away from simple system memory, we need to ditch using plain struct page
> > wrappers.
> > 
> > In the process, we drop the WC mapping as we ended up clflushing
> > everything anyway due to various issues across a wider range of
> > platforms. Though in a future step, we need to drop the kmap_atomic
> > approach which suggests we need to pre-map all the pages and keep them
> > mapped.
> > 
> > v2: Verify our large scratch page is suitably DMA aligned; and manually
> > clear the scratch since we are allocating random struct pages.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > ---
> 
> <snip>
> 
> > -
> > -static struct page *vm_alloc_page(struct i915_address_space *vm, gfp_t gfp)
> > -{
> > -     struct pagevec stack;
> > -     struct page *page;
> > -
> > -     if (I915_SELFTEST_ONLY(should_fail(&vm->fault_attr, 1)))
> > -             i915_gem_shrink_all(vm->i915);
> 
> 
> I guess shrink_boom et al are now mostly irrelevant in this new scheme.

The failures now occur at a less precarious time, but we can copy the
should_fail across, maybe even add more injection sites.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 04/66] drm/i915: Add a couple of missing i915_active_fini()
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 04/66] drm/i915: Add a couple of missing i915_active_fini() Chris Wilson
  2020-07-17 12:00   ` Tvrtko Ursulin
@ 2020-07-21 12:23   ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-21 12:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 2020-07-15 13:50, Chris Wilson wrote:
> We use i915_active_fini() as a debug check on the i915_active state
> before freeing. If we forget to call it, we may end up angering the
> debugobjects contained within.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 05/66] drm/i915: Skip taking acquire mutex for no ref->active callback
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 05/66] drm/i915: Skip taking acquire mutex for no ref->active callback Chris Wilson
  2020-07-17 12:04   ` Tvrtko Ursulin
@ 2020-07-21 12:32   ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-21 12:32 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 2020-07-15 13:50, Chris Wilson wrote:
> If no active callback is defined for i915_active, we do not need to
> serialise its enabling with the mutex. We still do only want to call the
> debug activate once, and must still serialise with a concurrent retire.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Minor nit below,

Otherwise

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>

> ---
>   drivers/gpu/drm/i915/i915_active.c | 25 ++++++++++++++++---------
>   1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> index d960d0be5bd2..841b5c30950a 100644
> --- a/drivers/gpu/drm/i915/i915_active.c
> +++ b/drivers/gpu/drm/i915/i915_active.c
> @@ -416,6 +416,14 @@ bool i915_active_acquire_if_busy(struct i915_active *ref)
>   	return atomic_add_unless(&ref->count, 1, 0);
>   }
>   
> +static void __i915_active_activate(struct i915_active *ref)
> +{
> +	spin_lock_irq(&ref->tree_lock); /* __active_retire() */
> +	if (!atomic_fetch_inc(&ref->count))
> +		debug_active_activate(ref);
> +	spin_unlock_irq(&ref->tree_lock);
> +}
> +
>   int i915_active_acquire(struct i915_active *ref)
>   {
>   	int err;
> @@ -423,23 +431,22 @@ int i915_active_acquire(struct i915_active *ref)
>   	if (i915_active_acquire_if_busy(ref))
>   		return 0;
>   
> +	if (!ref->active) {
> +		__i915_active_activate(ref);
> +		return 0;
> +	}
> +
>   	err = mutex_lock_interruptible(&ref->mutex);
>   	if (err)
>   		return err;
>   
>   	if (likely(!i915_active_acquire_if_busy(ref))) {
> -		if (ref->active)
> -			err = ref->active(ref);
> -		if (!err) {
> -			spin_lock_irq(&ref->tree_lock); /* __active_retire() */
> -			debug_active_activate(ref);
> -			atomic_inc(&ref->count);
> -			spin_unlock_irq(&ref->tree_lock);
> -		}
> +		err = ref->active(ref);
> +		if (!err)
> +			__i915_active_activate(ref);
>   	}
>   
>   	mutex_unlock(&ref->mutex);
> -

Unrelated

>   	return err;
>   }
>   
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire()
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire() Chris Wilson
  2020-07-17 12:21   ` Tvrtko Ursulin
@ 2020-07-21 15:33   ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-21 15:33 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 2020-07-15 13:50, Chris Wilson wrote:
> Sometimes we have to be very careful not to allocate underneath a mutex
> (or spinlock) and yet still want to track activity. Enter
> i915_active_acquire_for_context(). This raises the activity counter on
> i915_active prior to use and ensures that the fence-tree contains a slot
> for the context.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 +-
>   drivers/gpu/drm/i915/gt/intel_timeline.c      |   4 +-
>   drivers/gpu/drm/i915/i915_active.c            | 136 +++++++++++++++---
>   drivers/gpu/drm/i915/i915_active.h            |  12 +-
>   4 files changed, 126 insertions(+), 28 deletions(-)
>
...
>   	/*
>   	 * We track the most recently used timeline to skip a rbtree search
> @@ -230,8 +260,8 @@ active_instance(struct i915_active *ref, struct intel_timeline *tl)
>   	 * after the previous activity has been retired, or if it matches the
>   	 * current timeline.
>   	 */


In addition to Tvrtko's comments, Should this comment be moved to 
__active_lookup() ?

/Thomas


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 07/66] drm/i915: Keep the most recently used active-fence upon discard
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 07/66] drm/i915: Keep the most recently used active-fence upon discard Chris Wilson
  2020-07-17 12:38   ` Tvrtko Ursulin
@ 2020-07-22  9:46   ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-22  9:46 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 2020-07-15 13:50, Chris Wilson wrote:
> Whenever an i915_active idles, we prune its tree of old fence slots to
> prevent a gradual leak should it be used to track many, many timelines.
> The downside is that we then have to frequently reallocate the rbtree.
> A compromise is that we keep the most recently used fence slot, and
> reuse that for the next active reference as that is the most likely
> timeline to be reused.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_active.c | 27 ++++++++++++++++++++-------
>   drivers/gpu/drm/i915/i915_active.h |  4 ----
>   2 files changed, 20 insertions(+), 11 deletions(-)

Lgtm. Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline Chris Wilson
  2020-07-17 13:04   ` Tvrtko Ursulin
@ 2020-07-22 11:19   ` Thomas Hellström (Intel)
  2020-07-28 14:31     ` Chris Wilson
  1 sibling, 1 reply; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-22 11:19 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 2020-07-15 13:50, Chris Wilson wrote:
> Rather than require the next timeline after idling to match the MRU
> before idling, reset the index on the node and allow it to match the
> first request. However, this requires cmpxchg(u64) and so is not trivial
> on 32b, so for compatibility we just fallback to keeping the cached node
> pointing to the MRU timeline.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_active.c | 21 +++++++++++++++++++--
>   1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> index 0854b1552bc1..6737b5615c0c 100644
> --- a/drivers/gpu/drm/i915/i915_active.c
> +++ b/drivers/gpu/drm/i915/i915_active.c
> @@ -157,6 +157,10 @@ __active_retire(struct i915_active *ref)
>   		rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
>   		rb_insert_color(&ref->cache->node, &ref->tree);
>   		GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
> +
> +		/* Make the cached node available for reuse with any timeline */
> +		if (IS_ENABLED(CONFIG_64BIT))
> +			ref->cache->timeline = 0; /* needs cmpxchg(u64) */
>   	}
>   
>   	spin_unlock_irqrestore(&ref->tree_lock, flags);
> @@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
>   {
>   	struct active_node *it;
>   
> +	GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
> +
>   	it = READ_ONCE(ref->cache);
> -	if (it && it->timeline == idx)
> -		return it;
> +	if (it) {
> +		u64 cached = READ_ONCE(it->timeline);
> +
> +		if (cached == idx)
> +			return it;
> +
> +#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
> +		if (!cached && !cmpxchg(&it->timeline, 0, idx)) {

Doesn't cmpxchg() already do an unlocked compare before attempting the 
locked cycle?

Otherwise lgtm.

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


> +			GEM_BUG_ON(i915_active_fence_isset(&it->base));
> +			return it;
> +		}
> +#endif
> +	}
>   
>   	BUILD_BUG_ON(offsetof(typeof(*it), node));
>   
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings Chris Wilson
  2020-07-17 13:23   ` Tvrtko Ursulin
@ 2020-07-22 15:07   ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-22 15:07 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 2020-07-15 13:50, Chris Wilson wrote:
> Before we can execute a request, we must wait for all of its vma to be
> bound. This is a frequent operation for which we can optimise away a
> few atomic operations (notably a cmpxchg) in lieu of the RCU protection.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

LGTM. Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


> ---
>   drivers/gpu/drm/i915/i915_active.h | 15 +++++++++++++++
>   drivers/gpu/drm/i915/i915_vma.c    |  9 +++++++--
>   2 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h
> index b9e0394e2975..fb165d3f01cf 100644
> --- a/drivers/gpu/drm/i915/i915_active.h
> +++ b/drivers/gpu/drm/i915/i915_active.h
> @@ -231,4 +231,19 @@ struct i915_active *i915_active_create(void);
>   struct i915_active *i915_active_get(struct i915_active *ref);
>   void i915_active_put(struct i915_active *ref);
>   
> +static inline int __i915_request_await_exclusive(struct i915_request *rq,
> +						 struct i915_active *active)
> +{
> +	struct dma_fence *fence;
> +	int err = 0;
> +
> +	fence = i915_active_fence_get(&active->excl);
> +	if (fence) {
> +		err = i915_request_await_dma_fence(rq, fence);
> +		dma_fence_put(fence);
> +	}
> +
> +	return err;
> +}
> +
>   #endif /* _I915_ACTIVE_H_ */
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index bc64f773dcdb..cd12047c7791 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -1167,6 +1167,12 @@ void i915_vma_revoke_mmap(struct i915_vma *vma)
>   		list_del(&vma->obj->userfault_link);
>   }
>   
> +static int
> +__i915_request_await_bind(struct i915_request *rq, struct i915_vma *vma)
> +{
> +	return __i915_request_await_exclusive(rq, &vma->active);
> +}
> +
>   int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>   {
>   	int err;
> @@ -1174,8 +1180,7 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>   	GEM_BUG_ON(!i915_vma_is_pinned(vma));
>   
>   	/* Wait for the vma to be bound before we start! */
> -	err = i915_request_await_active(rq, &vma->active,
> -					I915_ACTIVE_AWAIT_EXCL);
> +	err = __i915_request_await_bind(rq, vma);
>   	if (err)
>   		return err;
>   
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 10/66] drm/i915: Soften the tasklet flush frequency before waits
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 10/66] drm/i915: Soften the tasklet flush frequency before waits Chris Wilson
  2020-07-16 14:23   ` Mika Kuoppala
@ 2020-07-22 15:10   ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-22 15:10 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 2020-07-15 13:50, Chris Wilson wrote:
> We include a tasklet flush before waiting on a request as a precaution
> against the HW being lax in event signaling. We now have a precautionary
> flush in the engine's heartbeat and so do not need to be quite so
> zealous on every request wait. If we focus on the request, the only
> tasklet flush that matters is if there is a delay in submitting this
> request to HW, so if the request is not ready to be executed no
> advantage in reducing this wait can be gained by running the tasklet.
> And there is little point in doing busy work for no result.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories Chris Wilson
  2020-07-20 10:35   ` Matthew Auld
@ 2020-07-23 14:33   ` Thomas Hellström (Intel)
  2020-07-28 14:42     ` Chris Wilson
  2020-07-27  9:24   ` Thomas Hellström (Intel)
  2 siblings, 1 reply; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-23 14:33 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Matthew Auld


On 2020-07-15 13:50, Chris Wilson wrote:
> We need to make the DMA allocations used for page directories to be
> performed up front so that we can include those allocations in our
> memory reservation pass. The downside is that we have to assume the
> worst case, even before we know the final layout, and always allocate
> enough page directories for this object, even when there will be overlap.
> This unfortunately can be quite expensive, especially as we have to
> clear/reset the page directories and DMA pages, but it should only be
> required during early phases of a workload when new objects are being
> discovered, or after memory/eviction pressure when we need to rebind.
> Once we reach steady state, the objects should not be moved and we no
> longer need to preallocating the pages tables.
>
> It should be noted that the lifetime for the page directories DMA is
> more or less decoupled from individual fences as they will be shared
> across objects across timelines.
>
> v2: Only allocate enough PD space for the PTE we may use, we do not need
> to allocate PD that will be left as scratch.
> v3: Store the shift unto the first PD level to encapsulate the different
> PTE counts for gen6/gen8.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>
> ---
>   .../gpu/drm/i915/gem/i915_gem_client_blt.c    | 11 +--
>   drivers/gpu/drm/i915/gt/gen6_ppgtt.c          | 40 ++++-----
>   drivers/gpu/drm/i915/gt/gen8_ppgtt.c          | 78 +++++------------
>   drivers/gpu/drm/i915/gt/intel_ggtt.c          | 60 ++++++--------
>   drivers/gpu/drm/i915/gt/intel_gtt.h           | 46 ++++++----
>   drivers/gpu/drm/i915/gt/intel_ppgtt.c         | 83 ++++++++++++++++---
>   drivers/gpu/drm/i915/i915_vma.c               | 27 +++---
>   drivers/gpu/drm/i915/selftests/i915_gem_gtt.c | 60 ++++++++------
>   drivers/gpu/drm/i915/selftests/mock_gtt.c     | 22 ++---
>   9 files changed, 237 insertions(+), 190 deletions(-)

Hi, Chris,

Overall looks good, but a question: Why can't we perform page-table 
memory allocation on demand when needed?

Are we then under a mutex that we also take during reclaim?

/Thomas



_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf Chris Wilson
@ 2020-07-23 16:09   ` Thomas Hellström (Intel)
  2020-07-28 14:46     ` Thomas Hellström (Intel)
  2020-07-28 14:51     ` Chris Wilson
  2020-07-31  8:09   ` Thomas Hellström (Intel)
  1 sibling, 2 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-23 16:09 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 2020-07-15 13:50, Chris Wilson wrote:
> Our timeline lock is our defence against a concurrent execbuf
> interrupting our request construction. we need hold it throughout or,
> for example, a second thread may interject a relocation request in
> between our own relocation request and execution in the ring.
>
> A second, major benefit, is that it allows us to preserve a large chunk
> of the ringbuffer for our exclusive use; which should virtually
> eliminate the threat of hitting a wait_for_space during request
> construction -- although we should have already dropped other
> contentious locks at that point.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 413 +++++++++++-------
>   .../i915/gem/selftests/i915_gem_execbuffer.c  |  24 +-
>   2 files changed, 281 insertions(+), 156 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index 719ba9fe3e85..af3499aafd22 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -259,6 +259,8 @@ struct i915_execbuffer {
>   		bool has_fence : 1;
>   		bool needs_unfenced : 1;
>   
> +		struct intel_context *ce;
> +
>   		struct i915_vma *target;
>   		struct i915_request *rq;
>   		struct i915_vma *rq_vma;
> @@ -639,6 +641,35 @@ static int eb_reserve_vma(const struct i915_execbuffer *eb,
>   	return 0;
>   }
>   
> +static void retire_requests(struct intel_timeline *tl)
> +{
> +	struct i915_request *rq, *rn;
> +
> +	list_for_each_entry_safe(rq, rn, &tl->requests, link)
> +		if (!i915_request_retire(rq))
> +			break;
> +}
> +
> +static int wait_for_timeline(struct intel_timeline *tl)
> +{
> +	do {
> +		struct dma_fence *fence;
> +		int err;
> +
> +		fence = i915_active_fence_get(&tl->last_request);
> +		if (!fence)
> +			return 0;
> +
> +		err = dma_fence_wait(fence, true);
> +		dma_fence_put(fence);
> +		if (err)
> +			return err;
> +
> +		/* Retiring may trigger a barrier, requiring an extra pass */
> +		retire_requests(tl);
> +	} while (1);
> +}
> +
>   static int eb_reserve(struct i915_execbuffer *eb)
>   {
>   	const unsigned int count = eb->buffer_count;
> @@ -646,7 +677,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
>   	struct list_head last;
>   	struct eb_vma *ev;
>   	unsigned int i, pass;
> -	int err = 0;
>   
>   	/*
>   	 * Attempt to pin all of the buffers into the GTT.
> @@ -662,18 +692,37 @@ static int eb_reserve(struct i915_execbuffer *eb)
>   	 * room for the earlier objects *unless* we need to defragment.
>   	 */
>   
> -	if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
> -		return -EINTR;
> -
>   	pass = 0;
>   	do {
> +		int err = 0;
> +
> +		/*
> +		 * We need to hold one lock as we bind all the vma so that
> +		 * we have a consistent view of the entire vm and can plan
> +		 * evictions to fill the whole GTT. If we allow a second
> +		 * thread to run as we do this, it will either unbind
> +		 * everything we want pinned, or steal space that we need for
> +		 * ourselves. The closer we are to a full GTT, the more likely
> +		 * such contention will cause us to fail to bind the workload
> +		 * for this batch. Since we know at this point we need to
> +		 * find space for new buffers, we know that extra pressure
> +		 * from contention is likely.
> +		 *
> +		 * In lieu of being able to hold vm->mutex for the entire
> +		 * sequence (it's complicated!), we opt for struct_mutex.
> +		 */
> +		if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
> +			return -EINTR;
> +

With TTM, an idea that has been around for a long time is to let the 
reservations resolve this. I don't think that's in place yet, though, 
due to the fact that eviction / unbinding still requires a trylock 
reservation and also because the evictions are not batched but performed 
one by one with the evicted objects' reservations dropped immediately 
after eviction. Having reservations resolve this could perhaps be 
something we could aim for in the long run as well? Unrelated batches 
would then never contend.

In the meantime would it make sense to introduce a new device-wide mutex

to avoid completely unrelated contention with the struct_mutex?


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait
  2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
                   ` (71 preceding siblings ...)
  2020-07-15 19:55 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
@ 2020-07-23 20:32 ` Dave Airlie
  2020-07-27  9:35   ` Tvrtko Ursulin
  72 siblings, 1 reply; 156+ messages in thread
From: Dave Airlie @ 2020-07-23 20:32 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Intel Graphics Development, Matthew Auld

I've got a 66 patch series here, does it have a cover letter I missed?

Does it have a what is the goal of this series? Does it tell the
reviewer where things are headed and why this is a good idea from a
high level.

The problem with these series is they are impossible to review from a
WTF does it do, and it forces people to review at a patch level, but
the high level concepts and implications go unmissed.

There is no world where this will be landing like this in my tree.

Dave.

On Wed, 15 Jul 2020 at 21:52, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> Currently, we use i915_request_completed() directly in
> i915_request_wait() and follow up with a manual invocation of
> dma_fence_signal(). This appears to cause a large number of contentions
> on i915_request.lock as when the process is woken up after the fence is
> signaled by an interrupt, we will then try and call dma_fence_signal()
> ourselves while the signaler is still holding the lock.
> dma_fence_is_signaled() has the benefit of checking the
> DMA_FENCE_FLAG_SIGNALED_BIT prior to calling dma_fence_signal() and so
> avoids most of that contention.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_request.c | 12 ++++--------
>  1 file changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 0b2fe55e6194..bb4eb1a8780e 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1640,7 +1640,7 @@ static bool busywait_stop(unsigned long timeout, unsigned int cpu)
>         return this_cpu != cpu;
>  }
>
> -static bool __i915_spin_request(const struct i915_request * const rq, int state)
> +static bool __i915_spin_request(struct i915_request * const rq, int state)
>  {
>         unsigned long timeout_ns;
>         unsigned int cpu;
> @@ -1673,7 +1673,7 @@ static bool __i915_spin_request(const struct i915_request * const rq, int state)
>         timeout_ns = READ_ONCE(rq->engine->props.max_busywait_duration_ns);
>         timeout_ns += local_clock_ns(&cpu);
>         do {
> -               if (i915_request_completed(rq))
> +               if (dma_fence_is_signaled(&rq->fence))
>                         return true;
>
>                 if (signal_pending_state(state, current))
> @@ -1766,10 +1766,8 @@ long i915_request_wait(struct i915_request *rq,
>          * duration, which we currently lack.
>          */
>         if (IS_ACTIVE(CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT) &&
> -           __i915_spin_request(rq, state)) {
> -               dma_fence_signal(&rq->fence);
> +           __i915_spin_request(rq, state))
>                 goto out;
> -       }
>
>         /*
>          * This client is about to stall waiting for the GPU. In many cases
> @@ -1796,10 +1794,8 @@ long i915_request_wait(struct i915_request *rq,
>         for (;;) {
>                 set_current_state(state);
>
> -               if (i915_request_completed(rq)) {
> -                       dma_fence_signal(&rq->fence);
> +               if (dma_fence_is_signaled(&rq->fence))
>                         break;
> -               }
>
>                 intel_engine_flush_submission(rq->engine);
>
> --
> 2.20.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories Chris Wilson
  2020-07-20 10:35   ` Matthew Auld
  2020-07-23 14:33   ` Thomas Hellström (Intel)
@ 2020-07-27  9:24   ` Thomas Hellström (Intel)
  2020-07-28 14:50     ` Chris Wilson
  2 siblings, 1 reply; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-27  9:24 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Hi, Chris,

It appears to me like this series is doing a lot of different things:

- Various optimizations
- Locking rework
- Adding schedulers
- Other misc fixes

Could you please separate out as much as possible the locking rework 
prerequisites in one series with cover letter, and most importantly the 
major part of the locking rework (only) with a more elaborate cover 
letter discussing, if not trivial, how each patch fits in and on design 
and future directions, Questions that I have stumbled on so far (being a 
new-to-the-driver reviewer):

- When are memory allocations disallowed? If we need to pre-allocate in 
execbuf, when? why?
- When is the request dma-fence published?
- Do we need to keep cpu asynchronous execbuf tasks after this? why?
- What about userptr pinning ending up in the dma_fence critical path?

And then move anything non-related to separate series?

Thanks,

Thomas


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait
  2020-07-23 20:32 ` [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Dave Airlie
@ 2020-07-27  9:35   ` Tvrtko Ursulin
  0 siblings, 0 replies; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-27  9:35 UTC (permalink / raw)
  To: Dave Airlie, Chris Wilson; +Cc: Intel Graphics Development, Matthew Auld


On 23/07/2020 21:32, Dave Airlie wrote:
> I've got a 66 patch series here, does it have a cover letter I missed?
 >
 > Does it have a what is the goal of this series? Does it tell the
 > reviewer where things are headed and why this is a good idea from a
 > high level.

Chris sent it on one of the previous rounds upon my request - please see 
https://www.spinics.net/lists/intel-gfx/msg243461.html. First paragraph 
is the key.

This series of 66 is some other unrelated work which is a bit 
misleading, but that the usual. :) Real amount of patches is more around 
20, like that posting which had a cover letter.

> The problem with these series is they are impossible to review from a
> WTF does it do, and it forces people to review at a patch level, but
> the high level concepts and implications go unmissed.

I've been reviewing both implementations so in case it helps I'll write 
a few words... We had internal discussions and meetings on two different 
approaches. With this in mind, I agree it is hard to get the full 
picture looking from the outside when only limited amount of information 
went out (in the for of the cover letter).

In short, core idea the series is doing is splitting out object backing 
store reservation from address space management. This way it is able to 
collect all possible backing store (and kernel memory allocations) into 
this first stage, and it also does not have to feed the ww context down 
the stack. (Because parts lower in the stack can therefore never try to 
obtain a new buffer objects, or do memory allocation.)

To me that sounds a solid approach which is in line with obj dma_resv 
locking rules.

And it definitely is not to be reviewed (just) on the patch-per-patch 
basis. Applying all of it and looking at the end result is what is 
needed and what I have done first before proceeded to look at individual 
patches.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 27/66] drm/i915/gem: Pull execbuf dma resv under a single critical section
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 27/66] drm/i915/gem: Pull execbuf dma resv under a single critical section Chris Wilson
@ 2020-07-27 18:08   ` Thomas Hellström (Intel)
  2020-07-28 15:16     ` Chris Wilson
  0 siblings, 1 reply; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-27 18:08 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> Acquire all the objects and their backing storage, and page directories,
> as used by execbuf under a single common ww_mutex. Albeit we have to
> restart the critical section a few times in order to handle various
> restrictions (such as avoiding copy_(from|to)_user and mmap_sem).
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 168 +++++++++---------
>   .../i915/gem/selftests/i915_gem_execbuffer.c  |   8 +-
>   2 files changed, 87 insertions(+), 89 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index ebabc0746d50..db433f3f18ec 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -20,6 +20,7 @@
>   #include "gt/intel_gt_pm.h"
>   #include "gt/intel_gt_requests.h"
>   #include "gt/intel_ring.h"
> +#include "mm/i915_acquire_ctx.h"
>   
>   #include "i915_drv.h"
>   #include "i915_gem_clflush.h"
> @@ -244,6 +245,8 @@ struct i915_execbuffer {
>   	struct intel_context *context; /* logical state for the request */
>   	struct i915_gem_context *gem_context; /** caller's context */
>   
> +	struct i915_acquire_ctx acquire; /** lock for _all_ DMA reservations */
> +
>   	struct i915_request *request; /** our request to build */
>   	struct eb_vma *batch; /** identity of the batch obj/vma */
>   
> @@ -389,42 +392,6 @@ static void eb_vma_array_put(struct eb_vma_array *arr)
>   	kref_put(&arr->kref, eb_vma_array_destroy);
>   }
>   
> -static int
> -eb_lock_vma(struct i915_execbuffer *eb, struct ww_acquire_ctx *acquire)
> -{
> -	struct eb_vma *ev;
> -	int err = 0;
> -
> -	list_for_each_entry(ev, &eb->submit_list, submit_link) {
> -		struct i915_vma *vma = ev->vma;
> -
> -		err = ww_mutex_lock_interruptible(&vma->resv->lock, acquire);
> -		if (err == -EDEADLK) {
> -			struct eb_vma *unlock = ev, *en;
> -
> -			list_for_each_entry_safe_continue_reverse(unlock, en,
> -								  &eb->submit_list,
> -								  submit_link) {
> -				ww_mutex_unlock(&unlock->vma->resv->lock);
> -				list_move_tail(&unlock->submit_link, &eb->submit_list);
> -			}
> -
> -			GEM_BUG_ON(!list_is_first(&ev->submit_link, &eb->submit_list));
> -			err = ww_mutex_lock_slow_interruptible(&vma->resv->lock,
> -							       acquire);
> -		}
> -		if (err) {
> -			list_for_each_entry_continue_reverse(ev,
> -							     &eb->submit_list,
> -							     submit_link)
> -				ww_mutex_unlock(&ev->vma->resv->lock);
> -			break;
> -		}
> -	}
> -
> -	return err;
> -}
> -
>   static int eb_create(struct i915_execbuffer *eb)
>   {
>   	/* Allocate an extra slot for use by the sentinel */
> @@ -668,6 +635,25 @@ eb_add_vma(struct i915_execbuffer *eb,
>   	}
>   }
>   
> +static int eb_lock_mm(struct i915_execbuffer *eb)
> +{
> +	struct eb_vma *ev;
> +	int err;
> +
> +	list_for_each_entry(ev, &eb->bind_list, bind_link) {
> +		err = i915_acquire_ctx_lock(&eb->acquire, ev->vma->obj);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +
> +static int eb_acquire_mm(struct i915_execbuffer *eb)
> +{
> +	return i915_acquire_mm(&eb->acquire);
> +}
> +
>   struct eb_vm_work {
>   	struct dma_fence_work base;
>   	struct eb_vma_array *array;
> @@ -1390,7 +1376,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
>   	unsigned long count;
>   	struct eb_vma *ev;
>   	unsigned int pass;
> -	int err = 0;
> +	int err;
> +
> +	err = eb_lock_mm(eb);
> +	if (err)
> +		return err;
> +
> +	err = eb_acquire_mm(eb);
> +	if (err)
> +		return err;
>   
>   	count = 0;
>   	INIT_LIST_HEAD(&unbound);
> @@ -1416,10 +1410,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
>   	if (count == 0)
>   		return 0;
>   
> +	/* We need to reserve page directories, release all, start over */
> +	i915_acquire_ctx_fini(&eb->acquire);
> +
>   	pass = 0;
>   	do {
>   		struct eb_vm_work *work;
>   
> +		i915_acquire_ctx_init(&eb->acquire);

Couldn't we do a i915_acquire_ctx_rollback() here to avoid losing our 
ticket? That would mean deferring i915_acquire_ctx_done() until all 
potential rollbacks have been performed.

Or even better if we defer _ctx_done(), couldn't we just continue 
locking the pts here instead of dropping and re-acquiring everything?

> +
>   		/*
>   		 * We need to hold one lock as we bind all the vma so that
>   		 * we have a consistent view of the entire vm and can plan
> @@ -1436,6 +1435,11 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
>   		 * beneath it, so we have to stage and preallocate all the
>   		 * resources we may require before taking the mutex.
>   		 */
> +
> +		err = eb_lock_mm(eb);
> +		if (err)
> +			return err;
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 22/66] drm/i915/gem: Bind the fence async for execbuf
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 22/66] drm/i915/gem: Bind the fence async for execbuf Chris Wilson
@ 2020-07-27 18:19   ` Thomas Hellström (Intel)
  2020-07-28 15:08     ` Chris Wilson
  0 siblings, 1 reply; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-27 18:19 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> It is illegal to wait on an another vma while holding the vm->mutex, as
> that easily leads to ABBA deadlocks (we wait on a second vma that waits
> on us to release the vm->mutex). So while the vm->mutex exists, move the
> waiting outside of the lock into the async binding pipeline.

Why is it we don't just move the fence binding to a separate loop after 
unlocking the vm->mutex in eb_reserve_vm()?

/Thomas

>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  21 +--
>   drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c  | 137 +++++++++++++++++-
>   drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h  |   5 +
>   3 files changed, 151 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index bdcbb82bfc3d..af2b4aeb6df0 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -1056,15 +1056,6 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
>   		return err;
>   
>   pin:
> -	if (unlikely(exec_flags & EXEC_OBJECT_NEEDS_FENCE)) {
> -		err = __i915_vma_pin_fence(vma); /* XXX no waiting */
> -		if (unlikely(err))
> -			return err;
> -
> -		if (vma->fence)
> -			bind->ev->flags |= __EXEC_OBJECT_HAS_FENCE;
> -	}
> -
>   	bind_flags &= ~atomic_read(&vma->flags);
>   	if (bind_flags) {
>   		err = set_bind_fence(vma, work);
> @@ -1095,6 +1086,15 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct eb_bind_vma *bind)
>   	bind->ev->flags |= __EXEC_OBJECT_HAS_PIN;
>   	GEM_BUG_ON(eb_vma_misplaced(entry, vma, bind->ev->flags));
>   
> +	if (unlikely(exec_flags & EXEC_OBJECT_NEEDS_FENCE)) {
> +		err = __i915_vma_pin_fence_async(vma, &work->base);
> +		if (unlikely(err))
> +			return err;
> +
> +		if (vma->fence)
> +			bind->ev->flags |= __EXEC_OBJECT_HAS_FENCE;
> +	}
> +
>   	return 0;
>   }
>   
> @@ -1160,6 +1160,9 @@ static void __eb_bind_vma(struct eb_vm_work *work)
>   		struct eb_bind_vma *bind = &work->bind[n];
>   		struct i915_vma *vma = bind->ev->vma;
>   
> +		if (bind->ev->flags & __EXEC_OBJECT_HAS_FENCE)
> +			__i915_vma_apply_fence_async(vma);
> +
>   		if (!bind->bind_flags)
>   			goto put;
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
> index 7fb36b12fe7a..734b6aa61809 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
> @@ -21,10 +21,13 @@
>    * IN THE SOFTWARE.
>    */
>   
> +#include "i915_active.h"
>   #include "i915_drv.h"
>   #include "i915_scatterlist.h"
> +#include "i915_sw_fence_work.h"
>   #include "i915_pvinfo.h"
>   #include "i915_vgpu.h"
> +#include "i915_vma.h"
>   
>   /**
>    * DOC: fence register handling
> @@ -340,19 +343,37 @@ static struct i915_fence_reg *fence_find(struct i915_ggtt *ggtt)
>   	return ERR_PTR(-EDEADLK);
>   }
>   
> +static int fence_wait_bind(struct i915_fence_reg *reg)
> +{
> +	struct dma_fence *fence;
> +	int err = 0;
> +
> +	fence = i915_active_fence_get(&reg->active.excl);
> +	if (fence) {
> +		err = dma_fence_wait(fence, true);
> +		dma_fence_put(fence);
> +	}
> +
> +	return err;
> +}
> +
>   int __i915_vma_pin_fence(struct i915_vma *vma)
>   {
>   	struct i915_ggtt *ggtt = i915_vm_to_ggtt(vma->vm);
> -	struct i915_fence_reg *fence;
> +	struct i915_fence_reg *fence = vma->fence;
>   	struct i915_vma *set = i915_gem_object_is_tiled(vma->obj) ? vma : NULL;
>   	int err;
>   
>   	lockdep_assert_held(&vma->vm->mutex);
>   
>   	/* Just update our place in the LRU if our fence is getting reused. */
> -	if (vma->fence) {
> -		fence = vma->fence;
> +	if (fence) {
>   		GEM_BUG_ON(fence->vma != vma);
> +
> +		err = fence_wait_bind(fence);
> +		if (err)
> +			return err;
> +
>   		atomic_inc(&fence->pin_count);
>   		if (!fence->dirty) {
>   			list_move_tail(&fence->link, &ggtt->fence_list);
> @@ -384,6 +405,116 @@ int __i915_vma_pin_fence(struct i915_vma *vma)
>   	return err;
>   }
>   
> +static int set_bind_fence(struct i915_fence_reg *fence,
> +			  struct dma_fence_work *work)
> +{
> +	struct dma_fence *prev;
> +	int err;
> +
> +	if (rcu_access_pointer(fence->active.excl.fence) == &work->dma)
> +		return 0;
> +
> +	err = i915_sw_fence_await_active(&work->chain,
> +					 &fence->active,
> +					 I915_ACTIVE_AWAIT_ACTIVE);
> +	if (err)
> +		return err;
> +
> +	if (i915_active_acquire(&fence->active))
> +		return -ENOENT;
> +
> +	prev = i915_active_set_exclusive(&fence->active, &work->dma);
> +	if (unlikely(prev)) {
> +		err = i915_sw_fence_await_dma_fence(&work->chain, prev, 0,
> +						    GFP_NOWAIT | __GFP_NOWARN);
> +		dma_fence_put(prev);
> +	}
> +
> +	i915_active_release(&fence->active);
> +	return err < 0 ? err : 0;
> +}
> +
> +int __i915_vma_pin_fence_async(struct i915_vma *vma,
> +			       struct dma_fence_work *work)
> +{
> +	struct i915_ggtt *ggtt = i915_vm_to_ggtt(vma->vm);
> +	struct i915_vma *set = i915_gem_object_is_tiled(vma->obj) ? vma : NULL;
> +	struct i915_fence_reg *fence = vma->fence;
> +	int err;
> +
> +	lockdep_assert_held(&vma->vm->mutex);
> +
> +	/* Just update our place in the LRU if our fence is getting reused. */
> +	if (fence) {
> +		GEM_BUG_ON(fence->vma != vma);
> +		GEM_BUG_ON(!i915_vma_is_map_and_fenceable(vma));
> +	} else if (set) {
> +		if (!i915_vma_is_map_and_fenceable(vma))
> +			return -EINVAL;
> +
> +		fence = fence_find(ggtt);
> +		if (IS_ERR(fence))
> +			return -ENOSPC;
> +
> +		GEM_BUG_ON(atomic_read(&fence->pin_count));
> +		fence->dirty = true;
> +	} else {
> +		return 0;
> +	}
> +
> +	atomic_inc(&fence->pin_count);
> +	list_move_tail(&fence->link, &ggtt->fence_list);
> +	if (!fence->dirty)
> +		return 0;
> +
> +	if (INTEL_GEN(fence_to_i915(fence)) < 4 &&
> +	    rcu_access_pointer(vma->active.excl.fence) != &work->dma) {
> +		/* implicit 'unfenced' GPU blits */
> +		err = i915_sw_fence_await_active(&work->chain,
> +						 &vma->active,
> +						 I915_ACTIVE_AWAIT_ACTIVE);
> +		if (err)
> +			goto err_unpin;
> +	}
> +
> +	err = set_bind_fence(fence, work);
> +	if (err)
> +		goto err_unpin;
> +
> +	if (set) {
> +		fence->start = vma->node.start;
> +		fence->size  = vma->fence_size;
> +		fence->stride = i915_gem_object_get_stride(vma->obj);
> +		fence->tiling = i915_gem_object_get_tiling(vma->obj);
> +
> +		vma->fence = fence;
> +	} else {
> +		fence->tiling = 0;
> +		vma->fence = NULL;
> +	}
> +
> +	set = xchg(&fence->vma, set);
> +	if (set && set != vma) {
> +		GEM_BUG_ON(set->fence != fence);
> +		WRITE_ONCE(set->fence, NULL);
> +		i915_vma_revoke_mmap(set);
> +	}
> +
> +	return 0;
> +
> +err_unpin:
> +	atomic_dec(&fence->pin_count);
> +	return err;
> +}
> +
> +void __i915_vma_apply_fence_async(struct i915_vma *vma)
> +{
> +	struct i915_fence_reg *fence = vma->fence;
> +
> +	if (fence->dirty)
> +		fence_write(fence);
> +}
> +
>   /**
>    * i915_vma_pin_fence - set up fencing for a vma
>    * @vma: vma to map through a fence reg
> diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h
> index 9eef679e1311..d306ac14d47e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h
> +++ b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h
> @@ -30,6 +30,7 @@
>   
>   #include "i915_active.h"
>   
> +struct dma_fence_work;
>   struct drm_i915_gem_object;
>   struct i915_ggtt;
>   struct i915_vma;
> @@ -70,6 +71,10 @@ void i915_gem_object_do_bit_17_swizzle(struct drm_i915_gem_object *obj,
>   void i915_gem_object_save_bit_17_swizzle(struct drm_i915_gem_object *obj,
>   					 struct sg_table *pages);
>   
> +int __i915_vma_pin_fence_async(struct i915_vma *vma,
> +			       struct dma_fence_work *work);
> +void __i915_vma_apply_fence_async(struct i915_vma *vma);
> +
>   void intel_ggtt_init_fences(struct i915_ggtt *ggtt);
>   void intel_ggtt_fini_fences(struct i915_ggtt *ggtt);
>   
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin Chris Wilson
  2020-07-17 14:36   ` Tvrtko Ursulin
@ 2020-07-28  9:46   ` Thomas Hellström (Intel)
  2020-07-28 15:05     ` Chris Wilson
  1 sibling, 1 reply; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-28  9:46 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:50 PM, Chris Wilson wrote:
> Remove the stub i915_vma_pin() used for incrementally pining objects for

s/pining/pinning/

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
  2020-07-16 15:53     ` Tvrtko Ursulin
@ 2020-07-28 11:17       ` Thomas Hellström (Intel)
  2020-07-29  7:56         ` Thomas Hellström (Intel)
  2020-07-29 12:17         ` Tvrtko Ursulin
  0 siblings, 2 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-28 11:17 UTC (permalink / raw)
  To: Tvrtko Ursulin, Maarten Lankhorst, Chris Wilson, intel-gfx


On 7/16/20 5:53 PM, Tvrtko Ursulin wrote:
>
> On 15/07/2020 16:43, Maarten Lankhorst wrote:
>> Op 15-07-2020 om 13:51 schreef Chris Wilson:
>>> Our goal is to pull all memory reservations (next iteration
>>> obj->ops->get_pages()) under a ww_mutex, and to align those 
>>> reservations
>>> with other drivers, i.e. control all such allocations with the
>>> reservation_ww_class. Currently, this is under the purview of the
>>> obj->mm.mutex, and while obj->mm remains an embedded struct we can
>>> "simply" switch to using the reservation_ww_class obj->base.resv->lock
>>>
>>> The major consequence is the impact on the shrinker paths as the
>>> reservation_ww_class is used to wrap allocations, and a ww_mutex does
>>> not support subclassing so we cannot do our usual trick of knowing that
>>> we never recurse inside the shrinker and instead have to finish the
>>> reclaim with a trylock. This may result in us failing to release the
>>> pages after having released the vma. This will have to do until a 
>>> better
>>> idea comes along.
>>>
>>> However, this step only converts the mutex over and continues to treat
>>> everything as a single allocation and pinning the pages. With the
>>> ww_mutex in place we can remove the temporary pinning, as we can then
>>> reserve all storage en masse.
>>>
>>> One last thing to do: kill the implict page pinning for active vma.
>>> This will require us to invalidate the vma->pages when the backing 
>>> store
>>> is removed (and we expect that while the vma is active, we mark the
>>> backing store as active so that it cannot be removed while the HW is
>>> busy.)
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>
> [snip]
>
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c 
>>> b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>> index dc8f052a0ffe..4e928103a38f 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>> @@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct 
>>> drm_i915_gem_object *obj,
>>>       if (!(shrink & I915_SHRINK_BOUND))
>>>           flags = I915_GEM_OBJECT_UNBIND_TEST;
>>>   -    if (i915_gem_object_unbind(obj, flags) == 0)
>>> -        __i915_gem_object_put_pages(obj);
>>> -
>>> -    return !i915_gem_object_has_pages(obj);
>>> +    return i915_gem_object_unbind(obj, flags) == 0;
>>>   }
>>>     static void try_to_writeback(struct drm_i915_gem_object *obj,
>>> @@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
>>> spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
>>>   -            if (unsafe_drop_pages(obj, shrink)) {
>>> -                /* May arrive from get_pages on another bo */
>>> -                mutex_lock(&obj->mm.lock);
>>> +            if (unsafe_drop_pages(obj, shrink) &&
>>> +                i915_gem_object_trylock(obj)) {
>
>> Why trylock? Because of the nesting? In that case, still use ww ctx 
>> if provided please
>
> By "if provided" you mean for code paths where we are calling the 
> shrinker ourselves, as opposed to reclaim, like shmem_get_pages?
>
> That indeed sounds like the right thing to do, since all the get_pages 
> from execbuf are in the reservation phase, collecting a list of GEM 
> objects to lock, the ones to shrink sound like should be on that list.
>
>>> + __i915_gem_object_put_pages(obj);
>>>                   if (!i915_gem_object_has_pages(obj)) {
>>>                       try_to_writeback(obj, shrink);
>>>                       count += obj->base.size >> PAGE_SHIFT;
>>>                   }
>>> -                mutex_unlock(&obj->mm.lock);
>>> +                i915_gem_object_unlock(obj);
>>>               }
>>>                 scanned += obj->base.size >> PAGE_SHIFT;
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c 
>>> b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>> index ff72ee2fd9cd..ac12e1c20e66 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>> @@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct 
>>> drm_i915_gem_object *obj,
>>>        * pages to prevent them being swapped out and causing corruption
>>>        * due to the change in swizzling.
>>>        */
>>> -    mutex_lock(&obj->mm.lock);
>>>       if (i915_gem_object_has_pages(obj) &&
>>>           obj->mm.madv == I915_MADV_WILLNEED &&
>>>           i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
>>> @@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct 
>>> drm_i915_gem_object *obj,
>>>               obj->mm.quirked = true;
>>>           }
>>>       }
>>> -    mutex_unlock(&obj->mm.lock);
>>>         spin_lock(&obj->vma.lock);
>>>       for_each_ggtt_vma(vma, obj) {
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
>>> b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>> index e946032b13e4..80907c00c6fd 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>> @@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct 
>>> mmu_notifier *_mn,
>>>           ret = i915_gem_object_unbind(obj,
>>>                            I915_GEM_OBJECT_UNBIND_ACTIVE |
>>>                            I915_GEM_OBJECT_UNBIND_BARRIER);
>>> -        if (ret == 0)
>>> -            ret = __i915_gem_object_put_pages(obj);
>>> +        if (ret == 0) {
>>> +            /* ww_mutex and mmu_notifier is fs_reclaim tainted */
>>> +            if (i915_gem_object_trylock(obj)) {
>>> +                ret = __i915_gem_object_put_pages(obj);
>>> +                i915_gem_object_unlock(obj);
>>> +            } else {
>>> +                ret = -EAGAIN;
>>> +            }
>>> +        }
>>
>> I'm not sure upstream will agree with this kind of API:
>>
>> 1. It will deadlock when RT tasks are used.
>
> It will or it can? Which part? It will break out of the loop if 
> trylock fails.
>
>>
>> 2. You start throwing -EAGAIN because you don't have the correct 
>> ordering of locking, this needs fixing first.
>
> Is it about correct ordering of locks or something else? If memory 
> allocation is allowed under dma_resv.lock, then the opposite order 
> cannot be taken in any case.
>
> I've had a brief look at the amdgpu solution and maybe I misunderstood 
> something, but it looks like a BKL approach with the device level 
> notifier_lock. Their userptr notifier blocks on that one, not on 
> dma_resv lock, but that also means their command submission 
> (amdgpu_cs_submit) blocks on the same lock while obtaining backing store.

If I read Christian right, it blocks on that lock only just before 
command submission to validate that sequence number. If there is a 
mismatch, it needs to rerun CS. I'm not sure how common userptr buffers 
are, but if a device-wide mutex hurts too much, There are perhaps more 
fine-grained solutions. (Like an rw semaphore, and unlock before the 
fence wait in the notifier: CS which are unaffected shouldn't need to 
wait...).

>
> So it looks like a big hammer approach not directly related to the 
> story of dma_resv locking. Maybe we could do the same big hammer 
> approach, although I am not sure how it is deadlock free.
>
> What happens for instance if someone submits an userptr batch which 
> gets unmapped while amdgpu_cs_submit is holding the notifier_lock?

My understanding is the unmapping operation blocks on the notifier_lock 
in the mmu notifier?

/Thomas


>
> If you understand amdgpu better please share some insights. I 
> certainly only looked at it briefly today so may be wrong.
>
> Regards,
>
> Tvrtko
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 07/66] drm/i915: Keep the most recently used active-fence upon discard
  2020-07-17 12:38   ` Tvrtko Ursulin
@ 2020-07-28 14:22     ` Chris Wilson
  0 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 14:22 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2020-07-17 13:38:01)
> 
> On 15/07/2020 12:50, Chris Wilson wrote:
> > +     /* Even if we have not used the cache, we may still have a barrier */
> > +     if (!ref->cache)
> > +             ref->cache = fetch_node(ref->tree.rb_node);
> > +
> > +     /* Keep the MRU cached node for reuse */
> > +     if (ref->cache) {
> > +             /* Discard all other nodes in the tree */
> > +             rb_erase(&ref->cache->node, &ref->tree);
> > +             root = ref->tree;
> > +
> > +             /* Rebuild the tree with only the cached node */
> > +             rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
> > +             rb_insert_color(&ref->cache->node, &ref->tree);
> > +             GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
> > +     }
> >   
> >       spin_unlock_irqrestore(&ref->tree_lock, flags);
> >   
> > @@ -156,6 +168,7 @@ __active_retire(struct i915_active *ref)
> >       /* ... except if you wait on it, you must manage your own references! */
> >       wake_up_var(ref);
> >   
> > +     /* Finally free the discarded timeline tree  */
> >       rbtree_postorder_for_each_entry_safe(it, n, &root, node) {
> >               GEM_BUG_ON(i915_active_fence_isset(&it->base));
> >               kmem_cache_free(global.slab_cache, it);
> 
> Here it frees everything.. so how does ref->cache, being in the tree, 
> survives?

This is the old root which does not contain ref->cache, as we moved that
to become the new tree.

             /* Discard all other nodes in the tree */
             rb_erase(&ref->cache->node, &ref->tree);
             root = ref->tree;
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-17 13:04   ` Tvrtko Ursulin
@ 2020-07-28 14:28     ` Chris Wilson
  2020-07-29 12:40       ` Tvrtko Ursulin
  0 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 14:28 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2020-07-17 14:04:58)
> 
> On 15/07/2020 12:50, Chris Wilson wrote:
> > Rather than require the next timeline after idling to match the MRU
> > before idling, reset the index on the node and allow it to match the
> > first request. However, this requires cmpxchg(u64) and so is not trivial
> > on 32b, so for compatibility we just fallback to keeping the cached node
> > pointing to the MRU timeline.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >   drivers/gpu/drm/i915/i915_active.c | 21 +++++++++++++++++++--
> >   1 file changed, 19 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> > index 0854b1552bc1..6737b5615c0c 100644
> > --- a/drivers/gpu/drm/i915/i915_active.c
> > +++ b/drivers/gpu/drm/i915/i915_active.c
> > @@ -157,6 +157,10 @@ __active_retire(struct i915_active *ref)
> >               rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
> >               rb_insert_color(&ref->cache->node, &ref->tree);
> >               GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
> > +
> > +             /* Make the cached node available for reuse with any timeline */
> > +             if (IS_ENABLED(CONFIG_64BIT))
> > +                     ref->cache->timeline = 0; /* needs cmpxchg(u64) */
> 
> Or when fence context wraps shock horror.

I more concerned about that we use timeline:0 as a special unordered
timeline. It's reserved by use in the dma_fence_stub, and everything
will start to break when the timelines wrap. The earliest causalities
will be the kernel_context timelines which are also very special indices
for the barriers.

> 
> >       }
> >   
> >       spin_unlock_irqrestore(&ref->tree_lock, flags);
> > @@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
> >   {
> >       struct active_node *it;
> >   
> > +     GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
> > +
> >       it = READ_ONCE(ref->cache);
> > -     if (it && it->timeline == idx)
> > -             return it;
> > +     if (it) {
> > +             u64 cached = READ_ONCE(it->timeline);
> > +
> > +             if (cached == idx)
> > +                     return it;
> > +
> > +#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
> > +             if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
> > +                     GEM_BUG_ON(i915_active_fence_isset(&it->base));
> > +                     return it;
> 
> cpmxchg suggests this needs to be atomic, however above the check for 
> equality comes from a separate read.

That's fine, and quite common to avoid cmpxchg if the current value
already does not match the expected condition.

> Since there is a lookup code path under the spinlock, perhaps the 
> unlocked lookup could just fail, and then locked lookup could re-assign 
> the timeline without the need for cmpxchg?

The unlocked/locked lookup are the same routine. You pointed that out
:-p
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-22 11:19   ` Thomas Hellström (Intel)
@ 2020-07-28 14:31     ` Chris Wilson
  0 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 14:31 UTC (permalink / raw)
  To: Thomas Hellström, intel-gfx

Quoting Thomas Hellström (Intel) (2020-07-22 12:19:28)
> 
> On 2020-07-15 13:50, Chris Wilson wrote:
> > Rather than require the next timeline after idling to match the MRU
> > before idling, reset the index on the node and allow it to match the
> > first request. However, this requires cmpxchg(u64) and so is not trivial
> > on 32b, so for compatibility we just fallback to keeping the cached node
> > pointing to the MRU timeline.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >   drivers/gpu/drm/i915/i915_active.c | 21 +++++++++++++++++++--
> >   1 file changed, 19 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> > index 0854b1552bc1..6737b5615c0c 100644
> > --- a/drivers/gpu/drm/i915/i915_active.c
> > +++ b/drivers/gpu/drm/i915/i915_active.c
> > @@ -157,6 +157,10 @@ __active_retire(struct i915_active *ref)
> >               rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
> >               rb_insert_color(&ref->cache->node, &ref->tree);
> >               GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
> > +
> > +             /* Make the cached node available for reuse with any timeline */
> > +             if (IS_ENABLED(CONFIG_64BIT))
> > +                     ref->cache->timeline = 0; /* needs cmpxchg(u64) */
> >       }
> >   
> >       spin_unlock_irqrestore(&ref->tree_lock, flags);
> > @@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
> >   {
> >       struct active_node *it;
> >   
> > +     GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
> > +
> >       it = READ_ONCE(ref->cache);
> > -     if (it && it->timeline == idx)
> > -             return it;
> > +     if (it) {
> > +             u64 cached = READ_ONCE(it->timeline);
> > +
> > +             if (cached == idx)
> > +                     return it;
> > +
> > +#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
> > +             if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
> 
> Doesn't cmpxchg() already do an unlocked compare before attempting the 
> locked cycle?

It goes straight to the locked instruction, as it's usually used at
the end of a loop that looks at the old value (and now try_cmpxchg). You
can see the difference in perf, as the cmpxchg stands out like a sore
thumb.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings
  2020-07-17 13:23   ` Tvrtko Ursulin
@ 2020-07-28 14:35     ` Chris Wilson
  2020-07-29 12:43       ` Tvrtko Ursulin
  0 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 14:35 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2020-07-17 14:23:22)
> 
> On 15/07/2020 12:50, Chris Wilson wrote:
> > Before we can execute a request, we must wait for all of its vma to be
> > bound. This is a frequent operation for which we can optimise away a
> > few atomic operations (notably a cmpxchg) in lieu of the RCU protection.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >   drivers/gpu/drm/i915/i915_active.h | 15 +++++++++++++++
> >   drivers/gpu/drm/i915/i915_vma.c    |  9 +++++++--
> >   2 files changed, 22 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h
> > index b9e0394e2975..fb165d3f01cf 100644
> > --- a/drivers/gpu/drm/i915/i915_active.h
> > +++ b/drivers/gpu/drm/i915/i915_active.h
> > @@ -231,4 +231,19 @@ struct i915_active *i915_active_create(void);
> >   struct i915_active *i915_active_get(struct i915_active *ref);
> >   void i915_active_put(struct i915_active *ref);
> >   
> > +static inline int __i915_request_await_exclusive(struct i915_request *rq,
> > +                                              struct i915_active *active)
> > +{
> > +     struct dma_fence *fence;
> > +     int err = 0;
> > +
> > +     fence = i915_active_fence_get(&active->excl);
> > +     if (fence) {
> > +             err = i915_request_await_dma_fence(rq, fence);
> > +             dma_fence_put(fence);
> > +     }
> > +
> > +     return err;
> > +}
> > +
> >   #endif /* _I915_ACTIVE_H_ */
> > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> > index bc64f773dcdb..cd12047c7791 100644
> > --- a/drivers/gpu/drm/i915/i915_vma.c
> > +++ b/drivers/gpu/drm/i915/i915_vma.c
> > @@ -1167,6 +1167,12 @@ void i915_vma_revoke_mmap(struct i915_vma *vma)
> >               list_del(&vma->obj->userfault_link);
> >   }
> >   
> > +static int
> > +__i915_request_await_bind(struct i915_request *rq, struct i915_vma *vma)
> > +{
> > +     return __i915_request_await_exclusive(rq, &vma->active);
> > +}
> > +
> >   int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
> >   {
> >       int err;
> > @@ -1174,8 +1180,7 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
> >       GEM_BUG_ON(!i915_vma_is_pinned(vma));
> >   
> >       /* Wait for the vma to be bound before we start! */
> > -     err = i915_request_await_active(rq, &vma->active,
> > -                                     I915_ACTIVE_AWAIT_EXCL);
> > +     err = __i915_request_await_bind(rq, vma);
> >       if (err)
> >               return err;
> >   
> > 
> 
> Looks like for like, apart from missing i915_active_acquire_if_busy 
> across the operation. Remind me please what is acquire/release 
> protecting against? :)

To protect the rbtree walk. So, this is the function we started with for
active_await, but when we added the option to walk the entire rbtree as
well, we pulled it all under a single acquire/release. perf suggests
that was a mistake if all we frequently want to do is grab the exclusive
fence for an await.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-07-23 14:33   ` Thomas Hellström (Intel)
@ 2020-07-28 14:42     ` Chris Wilson
  2020-07-31  7:43       ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 14:42 UTC (permalink / raw)
  To: Thomas Hellström, intel-gfx; +Cc: Matthew Auld

Quoting Thomas Hellström (Intel) (2020-07-23 15:33:20)
> 
> On 2020-07-15 13:50, Chris Wilson wrote:
> > We need to make the DMA allocations used for page directories to be
> > performed up front so that we can include those allocations in our
> > memory reservation pass. The downside is that we have to assume the
> > worst case, even before we know the final layout, and always allocate
> > enough page directories for this object, even when there will be overlap.
> > This unfortunately can be quite expensive, especially as we have to
> > clear/reset the page directories and DMA pages, but it should only be
> > required during early phases of a workload when new objects are being
> > discovered, or after memory/eviction pressure when we need to rebind.
> > Once we reach steady state, the objects should not be moved and we no
> > longer need to preallocating the pages tables.
> >
> > It should be noted that the lifetime for the page directories DMA is
> > more or less decoupled from individual fences as they will be shared
> > across objects across timelines.
> >
> > v2: Only allocate enough PD space for the PTE we may use, we do not need
> > to allocate PD that will be left as scratch.
> > v3: Store the shift unto the first PD level to encapsulate the different
> > PTE counts for gen6/gen8.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > ---
> >   .../gpu/drm/i915/gem/i915_gem_client_blt.c    | 11 +--
> >   drivers/gpu/drm/i915/gt/gen6_ppgtt.c          | 40 ++++-----
> >   drivers/gpu/drm/i915/gt/gen8_ppgtt.c          | 78 +++++------------
> >   drivers/gpu/drm/i915/gt/intel_ggtt.c          | 60 ++++++--------
> >   drivers/gpu/drm/i915/gt/intel_gtt.h           | 46 ++++++----
> >   drivers/gpu/drm/i915/gt/intel_ppgtt.c         | 83 ++++++++++++++++---
> >   drivers/gpu/drm/i915/i915_vma.c               | 27 +++---
> >   drivers/gpu/drm/i915/selftests/i915_gem_gtt.c | 60 ++++++++------
> >   drivers/gpu/drm/i915/selftests/mock_gtt.c     | 22 ++---
> >   9 files changed, 237 insertions(+), 190 deletions(-)
> 
> Hi, Chris,
> 
> Overall looks good, but a question: Why can't we perform page-table 
> memory allocation on demand when needed?

We need to allocate device memory for the page tables. The intention
here is gather up all the resource requirements for an operation and
reserve them in a single pass.
 
> Are we then under a mutex that we also take during reclaim?

Yes, the vm->mutex is used during the shrinker to revoke the GPU
bindings before returning memory to the system.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf
  2020-07-23 16:09   ` Thomas Hellström (Intel)
@ 2020-07-28 14:46     ` Thomas Hellström (Intel)
  2020-07-28 14:51     ` Chris Wilson
  1 sibling, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-28 14:46 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/23/20 6:09 PM, Thomas Hellström (Intel) wrote:
>
> On 2020-07-15 13:50, Chris Wilson wrote:
>> Our timeline lock is our defence against a concurrent execbuf
>> interrupting our request construction. we need hold it throughout or,
>> for example, a second thread may interject a relocation request in
>> between our own relocation request and execution in the ring.
>>
>> A second, major benefit, is that it allows us to preserve a large chunk
>> of the ringbuffer for our exclusive use; which should virtually
>> eliminate the threat of hitting a wait_for_space during request
>> construction -- although we should have already dropped other
>> contentious locks at that point.
>>
>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> ---
>>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 413 +++++++++++-------
>>   .../i915/gem/selftests/i915_gem_execbuffer.c  |  24 +-
>>   2 files changed, 281 insertions(+), 156 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
>> b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>> index 719ba9fe3e85..af3499aafd22 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>> @@ -259,6 +259,8 @@ struct i915_execbuffer {
>>           bool has_fence : 1;
>>           bool needs_unfenced : 1;
>>   +        struct intel_context *ce;
>> +
>>           struct i915_vma *target;
>>           struct i915_request *rq;
>>           struct i915_vma *rq_vma;
>> @@ -639,6 +641,35 @@ static int eb_reserve_vma(const struct 
>> i915_execbuffer *eb,
>>       return 0;
>>   }
>>   +static void retire_requests(struct intel_timeline *tl)
>> +{
>> +    struct i915_request *rq, *rn;
>> +
>> +    list_for_each_entry_safe(rq, rn, &tl->requests, link)
>> +        if (!i915_request_retire(rq))
>> +            break;
>> +}
>> +
>> +static int wait_for_timeline(struct intel_timeline *tl)
>> +{
>> +    do {
>> +        struct dma_fence *fence;
>> +        int err;
>> +
>> +        fence = i915_active_fence_get(&tl->last_request);
>> +        if (!fence)
>> +            return 0;
>> +
>> +        err = dma_fence_wait(fence, true);
>> +        dma_fence_put(fence);
>> +        if (err)
>> +            return err;
>> +
>> +        /* Retiring may trigger a barrier, requiring an extra pass */
>> +        retire_requests(tl);
>> +    } while (1);
>> +}
>> +
>>   static int eb_reserve(struct i915_execbuffer *eb)
>>   {
>>       const unsigned int count = eb->buffer_count;
>> @@ -646,7 +677,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
>>       struct list_head last;
>>       struct eb_vma *ev;
>>       unsigned int i, pass;
>> -    int err = 0;
>>         /*
>>        * Attempt to pin all of the buffers into the GTT.
>> @@ -662,18 +692,37 @@ static int eb_reserve(struct i915_execbuffer *eb)
>>        * room for the earlier objects *unless* we need to defragment.
>>        */
>>   -    if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
>> -        return -EINTR;
>> -
>>       pass = 0;
>>       do {
>> +        int err = 0;
>> +
>> +        /*
>> +         * We need to hold one lock as we bind all the vma so that
>> +         * we have a consistent view of the entire vm and can plan
>> +         * evictions to fill the whole GTT. If we allow a second
>> +         * thread to run as we do this, it will either unbind
>> +         * everything we want pinned, or steal space that we need for
>> +         * ourselves. The closer we are to a full GTT, the more likely
>> +         * such contention will cause us to fail to bind the workload
>> +         * for this batch. Since we know at this point we need to
>> +         * find space for new buffers, we know that extra pressure
>> +         * from contention is likely.
>> +         *
>> +         * In lieu of being able to hold vm->mutex for the entire
>> +         * sequence (it's complicated!), we opt for struct_mutex.
>> +         */
>> +        if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
>> +            return -EINTR;
>> +
>
> With TTM, an idea that has been around for a long time is to let the 
> reservations resolve this. I don't think that's in place yet, though, 
> due to the fact that eviction / unbinding still requires a trylock 
> reservation and also because the evictions are not batched but 
> performed one by one with the evicted objects' reservations dropped 
> immediately after eviction. Having reservations resolve this could 
> perhaps be something we could aim for in the long run as well? 
> Unrelated batches would then never contend.
>
> In the meantime would it make sense to introduce a new device-wide mutex
>
> to avoid completely unrelated contention with the struct_mutex?
>
>
Actually I see this is changed later in the series...

/Thomas


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-07-27  9:24   ` Thomas Hellström (Intel)
@ 2020-07-28 14:50     ` Chris Wilson
  2020-07-30 12:04       ` Thomas Hellström (Intel)
  2020-07-30 12:28       ` Thomas Hellström (Intel)
  0 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 14:50 UTC (permalink / raw)
  To: Thomas Hellström, intel-gfx

Quoting Thomas Hellström (Intel) (2020-07-27 10:24:24)
> Hi, Chris,
> 
> It appears to me like this series is doing a lot of different things:
> 
> - Various optimizations
> - Locking rework
> - Adding schedulers
> - Other misc fixes
> 
> Could you please separate out as much as possible the locking rework 
> prerequisites in one series with cover letter, and most importantly the 
> major part of the locking rework (only) with a more elaborate cover 
> letter discussing, if not trivial, how each patch fits in and on design 
> and future directions, Questions that I have stumbled on so far (being a 
> new-to-the-driver reviewer):

The locking depend on the former work to reduce the impact. It's still a
major issue that we introduce a broad lock that is held for several
hundred milliseconds across many objects that stalls game&compositor.
 
> - When are memory allocations disallowed? If we need to pre-allocate in 
> execbuf, when? why?

That should be mentioned in the code.

> - When is the request dma-fence published?

There a big comment to that effect.

> - Do we need to keep cpu asynchronous execbuf tasks after this? why?

Keep? Oh, you mean not immediately discard after publishing them, but
why we need them. Same reason as we need them before.

> - What about userptr pinning ending up in the dma_fence critical path?

It's in the user critical path (the shortest path to perform their
sequence of operations), but it's before the dma-fence itself. I say
that's a particularly nasty false claim that it is not on the critical
path, but being where it is circumvents the whole argument.
 
> And then move anything non-related to separate series?

Not related to what? Development of i915?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf
  2020-07-23 16:09   ` Thomas Hellström (Intel)
  2020-07-28 14:46     ` Thomas Hellström (Intel)
@ 2020-07-28 14:51     ` Chris Wilson
  1 sibling, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 14:51 UTC (permalink / raw)
  To: Thomas Hellström, intel-gfx

Quoting Thomas Hellström (Intel) (2020-07-23 17:09:15)
> 
> On 2020-07-15 13:50, Chris Wilson wrote:
> > Our timeline lock is our defence against a concurrent execbuf
> > interrupting our request construction. we need hold it throughout or,
> > for example, a second thread may interject a relocation request in
> > between our own relocation request and execution in the ring.
> >
> > A second, major benefit, is that it allows us to preserve a large chunk
> > of the ringbuffer for our exclusive use; which should virtually
> > eliminate the threat of hitting a wait_for_space during request
> > construction -- although we should have already dropped other
> > contentious locks at that point.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 413 +++++++++++-------
> >   .../i915/gem/selftests/i915_gem_execbuffer.c  |  24 +-
> >   2 files changed, 281 insertions(+), 156 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > index 719ba9fe3e85..af3499aafd22 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > @@ -259,6 +259,8 @@ struct i915_execbuffer {
> >               bool has_fence : 1;
> >               bool needs_unfenced : 1;
> >   
> > +             struct intel_context *ce;
> > +
> >               struct i915_vma *target;
> >               struct i915_request *rq;
> >               struct i915_vma *rq_vma;
> > @@ -639,6 +641,35 @@ static int eb_reserve_vma(const struct i915_execbuffer *eb,
> >       return 0;
> >   }
> >   
> > +static void retire_requests(struct intel_timeline *tl)
> > +{
> > +     struct i915_request *rq, *rn;
> > +
> > +     list_for_each_entry_safe(rq, rn, &tl->requests, link)
> > +             if (!i915_request_retire(rq))
> > +                     break;
> > +}
> > +
> > +static int wait_for_timeline(struct intel_timeline *tl)
> > +{
> > +     do {
> > +             struct dma_fence *fence;
> > +             int err;
> > +
> > +             fence = i915_active_fence_get(&tl->last_request);
> > +             if (!fence)
> > +                     return 0;
> > +
> > +             err = dma_fence_wait(fence, true);
> > +             dma_fence_put(fence);
> > +             if (err)
> > +                     return err;
> > +
> > +             /* Retiring may trigger a barrier, requiring an extra pass */
> > +             retire_requests(tl);
> > +     } while (1);
> > +}
> > +
> >   static int eb_reserve(struct i915_execbuffer *eb)
> >   {
> >       const unsigned int count = eb->buffer_count;
> > @@ -646,7 +677,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
> >       struct list_head last;
> >       struct eb_vma *ev;
> >       unsigned int i, pass;
> > -     int err = 0;
> >   
> >       /*
> >        * Attempt to pin all of the buffers into the GTT.
> > @@ -662,18 +692,37 @@ static int eb_reserve(struct i915_execbuffer *eb)
> >        * room for the earlier objects *unless* we need to defragment.
> >        */
> >   
> > -     if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
> > -             return -EINTR;
> > -
> >       pass = 0;
> >       do {
> > +             int err = 0;
> > +
> > +             /*
> > +              * We need to hold one lock as we bind all the vma so that
> > +              * we have a consistent view of the entire vm and can plan
> > +              * evictions to fill the whole GTT. If we allow a second
> > +              * thread to run as we do this, it will either unbind
> > +              * everything we want pinned, or steal space that we need for
> > +              * ourselves. The closer we are to a full GTT, the more likely
> > +              * such contention will cause us to fail to bind the workload
> > +              * for this batch. Since we know at this point we need to
> > +              * find space for new buffers, we know that extra pressure
> > +              * from contention is likely.
> > +              *
> > +              * In lieu of being able to hold vm->mutex for the entire
> > +              * sequence (it's complicated!), we opt for struct_mutex.
> > +              */
> > +             if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
> > +                     return -EINTR;
> > +
> 
> With TTM, an idea that has been around for a long time is to let the 
> reservations resolve this. I don't think that's in place yet, though, 
> due to the fact that eviction / unbinding still requires a trylock 
> reservation and also because the evictions are not batched but performed 
> one by one with the evicted objects' reservations dropped immediately 
> after eviction. Having reservations resolve this could perhaps be 
> something we could aim for in the long run as well? Unrelated batches 
> would then never contend.
> 
> In the meantime would it make sense to introduce a new device-wide mutex
> to avoid completely unrelated contention with the struct_mutex?
No.

The vma are not related to reservations.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin
  2020-07-17 14:36   ` Tvrtko Ursulin
@ 2020-07-28 15:04     ` Chris Wilson
  0 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 15:04 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2020-07-17 15:36:04)
> 
> On 15/07/2020 12:50, Chris Wilson wrote:
> > Remove the stub i915_vma_pin() used for incrementally pining objects for
> > execbuf (under the severe restriction that they must not wait on a
> > resource as we may have already pinned it) and replace it with a
> > i915_vma_pin_inplace() that is only allowed to reclaim the currently
> > bound location for the vma (and will never wait for a pinned resource).
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 69 +++++++++++--------
> >   drivers/gpu/drm/i915/i915_vma.c               |  6 +-
> >   drivers/gpu/drm/i915/i915_vma.h               |  2 +
> >   3 files changed, 45 insertions(+), 32 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > index 28cf28fcf80a..0b8a26da26e5 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > @@ -452,49 +452,55 @@ static u64 eb_pin_flags(const struct drm_i915_gem_exec_object2 *entry,
> >       return pin_flags;
> >   }
> >   
> > +static bool eb_pin_vma_fence_inplace(struct eb_vma *ev)
> > +{
> > +     struct i915_vma *vma = ev->vma;
> > +     struct i915_fence_reg *reg = vma->fence;
> > +
> > +     if (reg) {
> > +             if (READ_ONCE(reg->dirty))
> > +                     return false;
> > +
> > +             atomic_inc(&reg->pin_count);
> 
> Why is this safe outside the vm->mutex? It otherwise seems to be 
> protecting this pin count.

I was working on having the fence protected by the vma. It's important
that we do avoid the fallback scheme -- although not strictly as
important for gen2/gen3 as they do not need the ppGTT preallocations.

If I adapt find_fence() to operate against a concurrent atomic_inc()
that should dig myself out of the hold. (Another cmpxchg, oh my.)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin
  2020-07-28  9:46   ` Thomas Hellström (Intel)
@ 2020-07-28 15:05     ` Chris Wilson
  2020-07-31  8:58       ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 15:05 UTC (permalink / raw)
  To: Thomas Hellström, intel-gfx

Quoting Thomas Hellström (Intel) (2020-07-28 10:46:51)
> 
> On 7/15/20 1:50 PM, Chris Wilson wrote:
> > Remove the stub i915_vma_pin() used for incrementally pining objects for
> 
> s/pining/pinning/

Pining for the fjords.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 22/66] drm/i915/gem: Bind the fence async for execbuf
  2020-07-27 18:19   ` Thomas Hellström (Intel)
@ 2020-07-28 15:08     ` Chris Wilson
  2020-07-31 13:12       ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 15:08 UTC (permalink / raw)
  To: Thomas Hellström, intel-gfx

Quoting Thomas Hellström (Intel) (2020-07-27 19:19:19)
> 
> On 7/15/20 1:51 PM, Chris Wilson wrote:
> > It is illegal to wait on an another vma while holding the vm->mutex, as
> > that easily leads to ABBA deadlocks (we wait on a second vma that waits
> > on us to release the vm->mutex). So while the vm->mutex exists, move the
> > waiting outside of the lock into the async binding pipeline.
> 
> Why is it we don't just move the fence binding to a separate loop after 
> unlocking the vm->mutex in eb_reserve_vm()?

That is what is done. The work is called immediately when possible. Just
the loop may be deferred if the what we need to unbind are still active.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 27/66] drm/i915/gem: Pull execbuf dma resv under a single critical section
  2020-07-27 18:08   ` Thomas Hellström (Intel)
@ 2020-07-28 15:16     ` Chris Wilson
  2020-07-30 12:57       ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-28 15:16 UTC (permalink / raw)
  To: Thomas Hellström, intel-gfx

Quoting Thomas Hellström (Intel) (2020-07-27 19:08:39)
> 
> On 7/15/20 1:51 PM, Chris Wilson wrote:
> > Acquire all the objects and their backing storage, and page directories,
> > as used by execbuf under a single common ww_mutex. Albeit we have to
> > restart the critical section a few times in order to handle various
> > restrictions (such as avoiding copy_(from|to)_user and mmap_sem).
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 168 +++++++++---------
> >   .../i915/gem/selftests/i915_gem_execbuffer.c  |   8 +-
> >   2 files changed, 87 insertions(+), 89 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > index ebabc0746d50..db433f3f18ec 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> > @@ -20,6 +20,7 @@
> >   #include "gt/intel_gt_pm.h"
> >   #include "gt/intel_gt_requests.h"
> >   #include "gt/intel_ring.h"
> > +#include "mm/i915_acquire_ctx.h"
> >   
> >   #include "i915_drv.h"
> >   #include "i915_gem_clflush.h"
> > @@ -244,6 +245,8 @@ struct i915_execbuffer {
> >       struct intel_context *context; /* logical state for the request */
> >       struct i915_gem_context *gem_context; /** caller's context */
> >   
> > +     struct i915_acquire_ctx acquire; /** lock for _all_ DMA reservations */
> > +
> >       struct i915_request *request; /** our request to build */
> >       struct eb_vma *batch; /** identity of the batch obj/vma */
> >   
> > @@ -389,42 +392,6 @@ static void eb_vma_array_put(struct eb_vma_array *arr)
> >       kref_put(&arr->kref, eb_vma_array_destroy);
> >   }
> >   
> > -static int
> > -eb_lock_vma(struct i915_execbuffer *eb, struct ww_acquire_ctx *acquire)
> > -{
> > -     struct eb_vma *ev;
> > -     int err = 0;
> > -
> > -     list_for_each_entry(ev, &eb->submit_list, submit_link) {
> > -             struct i915_vma *vma = ev->vma;
> > -
> > -             err = ww_mutex_lock_interruptible(&vma->resv->lock, acquire);
> > -             if (err == -EDEADLK) {
> > -                     struct eb_vma *unlock = ev, *en;
> > -
> > -                     list_for_each_entry_safe_continue_reverse(unlock, en,
> > -                                                               &eb->submit_list,
> > -                                                               submit_link) {
> > -                             ww_mutex_unlock(&unlock->vma->resv->lock);
> > -                             list_move_tail(&unlock->submit_link, &eb->submit_list);
> > -                     }
> > -
> > -                     GEM_BUG_ON(!list_is_first(&ev->submit_link, &eb->submit_list));
> > -                     err = ww_mutex_lock_slow_interruptible(&vma->resv->lock,
> > -                                                            acquire);
> > -             }
> > -             if (err) {
> > -                     list_for_each_entry_continue_reverse(ev,
> > -                                                          &eb->submit_list,
> > -                                                          submit_link)
> > -                             ww_mutex_unlock(&ev->vma->resv->lock);
> > -                     break;
> > -             }
> > -     }
> > -
> > -     return err;
> > -}
> > -
> >   static int eb_create(struct i915_execbuffer *eb)
> >   {
> >       /* Allocate an extra slot for use by the sentinel */
> > @@ -668,6 +635,25 @@ eb_add_vma(struct i915_execbuffer *eb,
> >       }
> >   }
> >   
> > +static int eb_lock_mm(struct i915_execbuffer *eb)
> > +{
> > +     struct eb_vma *ev;
> > +     int err;
> > +
> > +     list_for_each_entry(ev, &eb->bind_list, bind_link) {
> > +             err = i915_acquire_ctx_lock(&eb->acquire, ev->vma->obj);
> > +             if (err)
> > +                     return err;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static int eb_acquire_mm(struct i915_execbuffer *eb)
> > +{
> > +     return i915_acquire_mm(&eb->acquire);
> > +}
> > +
> >   struct eb_vm_work {
> >       struct dma_fence_work base;
> >       struct eb_vma_array *array;
> > @@ -1390,7 +1376,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
> >       unsigned long count;
> >       struct eb_vma *ev;
> >       unsigned int pass;
> > -     int err = 0;
> > +     int err;
> > +
> > +     err = eb_lock_mm(eb);
> > +     if (err)
> > +             return err;
> > +
> > +     err = eb_acquire_mm(eb);
> > +     if (err)
> > +             return err;
> >   
> >       count = 0;
> >       INIT_LIST_HEAD(&unbound);
> > @@ -1416,10 +1410,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
> >       if (count == 0)
> >               return 0;
> >   
> > +     /* We need to reserve page directories, release all, start over */
> > +     i915_acquire_ctx_fini(&eb->acquire);
> > +
> >       pass = 0;
> >       do {
> >               struct eb_vm_work *work;
> >   
> > +             i915_acquire_ctx_init(&eb->acquire);
> 
> Couldn't we do a i915_acquire_ctx_rollback() here to avoid losing our 
> ticket? That would mean deferring i915_acquire_ctx_done() until all 
> potential rollbacks have been performed.

We need to completely drop the acquire-class for catching up with userptr
(and anything else deferred that doesn't meet the current fence semantics).

I thought it was sensible to drop all around the waits in this loop, and
tidier to always reacquire at the beginning of each loop.

> Or even better if we defer _ctx_done(), couldn't we just continue 
> locking the pts here instead of dropping and re-acquiring everything?

Userptr would like to have word. If you just mean these do lines, then
yes fini/init is overkill -- it just looked simpler than doing it at the
end of the loop. The steady-state load is not meant to get past the
optimistic fastpath.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
  2020-07-28 11:17       ` Thomas Hellström (Intel)
@ 2020-07-29  7:56         ` Thomas Hellström (Intel)
  2020-07-29 12:17         ` Tvrtko Ursulin
  1 sibling, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-29  7:56 UTC (permalink / raw)
  To: Tvrtko Ursulin, Maarten Lankhorst, Chris Wilson, intel-gfx


On 7/28/20 1:17 PM, Thomas Hellström (Intel) wrote:
>
> On 7/16/20 5:53 PM, Tvrtko Ursulin wrote:
>>
>> On 15/07/2020 16:43, Maarten Lankhorst wrote:
>>> Op 15-07-2020 om 13:51 schreef Chris Wilson:
>>>> Our goal is to pull all memory reservations (next iteration
>>>> obj->ops->get_pages()) under a ww_mutex, and to align those 
>>>> reservations
>>>> with other drivers, i.e. control all such allocations with the
>>>> reservation_ww_class. Currently, this is under the purview of the
>>>> obj->mm.mutex, and while obj->mm remains an embedded struct we can
>>>> "simply" switch to using the reservation_ww_class obj->base.resv->lock
>>>>
>>>> The major consequence is the impact on the shrinker paths as the
>>>> reservation_ww_class is used to wrap allocations, and a ww_mutex does
>>>> not support subclassing so we cannot do our usual trick of knowing 
>>>> that
>>>> we never recurse inside the shrinker and instead have to finish the
>>>> reclaim with a trylock. This may result in us failing to release the
>>>> pages after having released the vma. This will have to do until a 
>>>> better
>>>> idea comes along.
>>>>
>>>> However, this step only converts the mutex over and continues to treat
>>>> everything as a single allocation and pinning the pages. With the
>>>> ww_mutex in place we can remove the temporary pinning, as we can then
>>>> reserve all storage en masse.
>>>>
>>>> One last thing to do: kill the implict page pinning for active vma.
>>>> This will require us to invalidate the vma->pages when the backing 
>>>> store
>>>> is removed (and we expect that while the vma is active, we mark the
>>>> backing store as active so that it cannot be removed while the HW is
>>>> busy.)
>>>>
>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>
>> [snip]
>>
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>>> index dc8f052a0ffe..4e928103a38f 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>>> @@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct 
>>>> drm_i915_gem_object *obj,
>>>>       if (!(shrink & I915_SHRINK_BOUND))
>>>>           flags = I915_GEM_OBJECT_UNBIND_TEST;
>>>>   -    if (i915_gem_object_unbind(obj, flags) == 0)
>>>> -        __i915_gem_object_put_pages(obj);
>>>> -
>>>> -    return !i915_gem_object_has_pages(obj);
>>>> +    return i915_gem_object_unbind(obj, flags) == 0;
>>>>   }
>>>>     static void try_to_writeback(struct drm_i915_gem_object *obj,
>>>> @@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
>>>> spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
>>>>   -            if (unsafe_drop_pages(obj, shrink)) {
>>>> -                /* May arrive from get_pages on another bo */
>>>> -                mutex_lock(&obj->mm.lock);
>>>> +            if (unsafe_drop_pages(obj, shrink) &&
>>>> +                i915_gem_object_trylock(obj)) {
>>
>>> Why trylock? Because of the nesting? In that case, still use ww ctx 
>>> if provided please
>>
>> By "if provided" you mean for code paths where we are calling the 
>> shrinker ourselves, as opposed to reclaim, like shmem_get_pages?
>>
>> That indeed sounds like the right thing to do, since all the 
>> get_pages from execbuf are in the reservation phase, collecting a 
>> list of GEM objects to lock, the ones to shrink sound like should be 
>> on that list.
>>
>>>> + __i915_gem_object_put_pages(obj);
>>>>                   if (!i915_gem_object_has_pages(obj)) {
>>>>                       try_to_writeback(obj, shrink);
>>>>                       count += obj->base.size >> PAGE_SHIFT;
>>>>                   }
>>>> -                mutex_unlock(&obj->mm.lock);
>>>> +                i915_gem_object_unlock(obj);
>>>>               }
>>>>                 scanned += obj->base.size >> PAGE_SHIFT;
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>>> index ff72ee2fd9cd..ac12e1c20e66 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>>> @@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct 
>>>> drm_i915_gem_object *obj,
>>>>        * pages to prevent them being swapped out and causing 
>>>> corruption
>>>>        * due to the change in swizzling.
>>>>        */
>>>> -    mutex_lock(&obj->mm.lock);
>>>>       if (i915_gem_object_has_pages(obj) &&
>>>>           obj->mm.madv == I915_MADV_WILLNEED &&
>>>>           i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
>>>> @@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct 
>>>> drm_i915_gem_object *obj,
>>>>               obj->mm.quirked = true;
>>>>           }
>>>>       }
>>>> -    mutex_unlock(&obj->mm.lock);
>>>>         spin_lock(&obj->vma.lock);
>>>>       for_each_ggtt_vma(vma, obj) {
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>>> index e946032b13e4..80907c00c6fd 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>>> @@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct 
>>>> mmu_notifier *_mn,
>>>>           ret = i915_gem_object_unbind(obj,
>>>>                            I915_GEM_OBJECT_UNBIND_ACTIVE |
>>>>                            I915_GEM_OBJECT_UNBIND_BARRIER);

Question: What happens above if someone is preparing cs with the above 
object, is holding its reservation object and is just about to submit? 
Doesn't the ppgtt binding go away from under it?


>>>> -        if (ret == 0)
>>>> -            ret = __i915_gem_object_put_pages(obj);
>>>> +        if (ret == 0) {
>>>> +            /* ww_mutex and mmu_notifier is fs_reclaim tainted */
>>>> +            if (i915_gem_object_trylock(obj)) {
>>>> +                ret = __i915_gem_object_put_pages(obj);
>>>> +                i915_gem_object_unlock(obj);
>>>> +            } else {
>>>> +                ret = -EAGAIN;
>>>> +            }
>>>> +        }
>>>
>>> I'm not sure upstream will agree with this kind of API:
>>>
>>> 1. It will deadlock when RT tasks are used.
>>
>> It will or it can? Which part? It will break out of the loop if 
>> trylock fails.
>>
>>>
>>> 2. You start throwing -EAGAIN because you don't have the correct 
>>> ordering of locking, this needs fixing first.
>>
>> Is it about correct ordering of locks or something else? If memory 
>> allocation is allowed under dma_resv.lock, then the opposite order 
>> cannot be taken in any case.
>>
>> I've had a brief look at the amdgpu solution and maybe I 
>> misunderstood something, but it looks like a BKL approach with the 
>> device level notifier_lock. Their userptr notifier blocks on that 
>> one, not on dma_resv lock, but that also means their command 
>> submission (amdgpu_cs_submit) blocks on the same lock while obtaining 
>> backing store.
>
> If I read Christian right, it blocks on that lock only just before 
> command submission to validate that sequence number. If there is a 
> mismatch, it needs to rerun CS. I'm not sure how common userptr 
> buffers are, but if a device-wide mutex hurts too much, There are 
> perhaps more fine-grained solutions. (Like an rw semaphore, and unlock 
> before the fence wait in the notifier: CS which are unaffected 
> shouldn't need to wait...).
>
>>
>> So it looks like a big hammer approach not directly related to the 
>> story of dma_resv locking. Maybe we could do the same big hammer 
>> approach, although I am not sure how it is deadlock free.
>>
>> What happens for instance if someone submits an userptr batch which 
>> gets unmapped while amdgpu_cs_submit is holding the notifier_lock?
>
> My understanding is the unmapping operation blocks on the 
> notifier_lock in the mmu notifier?
>
> /Thomas
>
>
>>
>> If you understand amdgpu better please share some insights. I 
>> certainly only looked at it briefly today so may be wrong.
>>
>> Regards,
>>
>> Tvrtko
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
  2020-07-28 11:17       ` Thomas Hellström (Intel)
  2020-07-29  7:56         ` Thomas Hellström (Intel)
@ 2020-07-29 12:17         ` Tvrtko Ursulin
  2020-07-29 13:44           ` Thomas Hellström (Intel)
  1 sibling, 1 reply; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-29 12:17 UTC (permalink / raw)
  To: Thomas Hellström (Intel),
	Maarten Lankhorst, Chris Wilson, intel-gfx


On 28/07/2020 12:17, Thomas Hellström (Intel) wrote:
> On 7/16/20 5:53 PM, Tvrtko Ursulin wrote:
>> On 15/07/2020 16:43, Maarten Lankhorst wrote:
>>> Op 15-07-2020 om 13:51 schreef Chris Wilson:
>>>> Our goal is to pull all memory reservations (next iteration
>>>> obj->ops->get_pages()) under a ww_mutex, and to align those 
>>>> reservations
>>>> with other drivers, i.e. control all such allocations with the
>>>> reservation_ww_class. Currently, this is under the purview of the
>>>> obj->mm.mutex, and while obj->mm remains an embedded struct we can
>>>> "simply" switch to using the reservation_ww_class obj->base.resv->lock
>>>>
>>>> The major consequence is the impact on the shrinker paths as the
>>>> reservation_ww_class is used to wrap allocations, and a ww_mutex does
>>>> not support subclassing so we cannot do our usual trick of knowing that
>>>> we never recurse inside the shrinker and instead have to finish the
>>>> reclaim with a trylock. This may result in us failing to release the
>>>> pages after having released the vma. This will have to do until a 
>>>> better
>>>> idea comes along.
>>>>
>>>> However, this step only converts the mutex over and continues to treat
>>>> everything as a single allocation and pinning the pages. With the
>>>> ww_mutex in place we can remove the temporary pinning, as we can then
>>>> reserve all storage en masse.
>>>>
>>>> One last thing to do: kill the implict page pinning for active vma.
>>>> This will require us to invalidate the vma->pages when the backing 
>>>> store
>>>> is removed (and we expect that while the vma is active, we mark the
>>>> backing store as active so that it cannot be removed while the HW is
>>>> busy.)
>>>>
>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>
>> [snip]
>>
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>>> index dc8f052a0ffe..4e928103a38f 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>>> @@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct 
>>>> drm_i915_gem_object *obj,
>>>>       if (!(shrink & I915_SHRINK_BOUND))
>>>>           flags = I915_GEM_OBJECT_UNBIND_TEST;
>>>>   -    if (i915_gem_object_unbind(obj, flags) == 0)
>>>> -        __i915_gem_object_put_pages(obj);
>>>> -
>>>> -    return !i915_gem_object_has_pages(obj);
>>>> +    return i915_gem_object_unbind(obj, flags) == 0;
>>>>   }
>>>>     static void try_to_writeback(struct drm_i915_gem_object *obj,
>>>> @@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
>>>> spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
>>>>   -            if (unsafe_drop_pages(obj, shrink)) {
>>>> -                /* May arrive from get_pages on another bo */
>>>> -                mutex_lock(&obj->mm.lock);
>>>> +            if (unsafe_drop_pages(obj, shrink) &&
>>>> +                i915_gem_object_trylock(obj)) {
>>
>>> Why trylock? Because of the nesting? In that case, still use ww ctx 
>>> if provided please
>>
>> By "if provided" you mean for code paths where we are calling the 
>> shrinker ourselves, as opposed to reclaim, like shmem_get_pages?
>>
>> That indeed sounds like the right thing to do, since all the get_pages 
>> from execbuf are in the reservation phase, collecting a list of GEM 
>> objects to lock, the ones to shrink sound like should be on that list.
>>
>>>> + __i915_gem_object_put_pages(obj);
>>>>                   if (!i915_gem_object_has_pages(obj)) {
>>>>                       try_to_writeback(obj, shrink);
>>>>                       count += obj->base.size >> PAGE_SHIFT;
>>>>                   }
>>>> -                mutex_unlock(&obj->mm.lock);
>>>> +                i915_gem_object_unlock(obj);
>>>>               }
>>>>                 scanned += obj->base.size >> PAGE_SHIFT;
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>>> index ff72ee2fd9cd..ac12e1c20e66 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>>> @@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct 
>>>> drm_i915_gem_object *obj,
>>>>        * pages to prevent them being swapped out and causing corruption
>>>>        * due to the change in swizzling.
>>>>        */
>>>> -    mutex_lock(&obj->mm.lock);
>>>>       if (i915_gem_object_has_pages(obj) &&
>>>>           obj->mm.madv == I915_MADV_WILLNEED &&
>>>>           i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
>>>> @@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct 
>>>> drm_i915_gem_object *obj,
>>>>               obj->mm.quirked = true;
>>>>           }
>>>>       }
>>>> -    mutex_unlock(&obj->mm.lock);
>>>>         spin_lock(&obj->vma.lock);
>>>>       for_each_ggtt_vma(vma, obj) {
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>>> index e946032b13e4..80907c00c6fd 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>>> @@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct 
>>>> mmu_notifier *_mn,
>>>>           ret = i915_gem_object_unbind(obj,
>>>>                            I915_GEM_OBJECT_UNBIND_ACTIVE |
>>>>                            I915_GEM_OBJECT_UNBIND_BARRIER);
>>>> -        if (ret == 0)
>>>> -            ret = __i915_gem_object_put_pages(obj);
>>>> +        if (ret == 0) {
>>>> +            /* ww_mutex and mmu_notifier is fs_reclaim tainted */
>>>> +            if (i915_gem_object_trylock(obj)) {
>>>> +                ret = __i915_gem_object_put_pages(obj);
>>>> +                i915_gem_object_unlock(obj);
>>>> +            } else {
>>>> +                ret = -EAGAIN;
>>>> +            }
>>>> +        }
>>>
>>> I'm not sure upstream will agree with this kind of API:
>>>
>>> 1. It will deadlock when RT tasks are used.
>>
>> It will or it can? Which part? It will break out of the loop if 
>> trylock fails.
>>
>>>
>>> 2. You start throwing -EAGAIN because you don't have the correct 
>>> ordering of locking, this needs fixing first.
>>
>> Is it about correct ordering of locks or something else? If memory 
>> allocation is allowed under dma_resv.lock, then the opposite order 
>> cannot be taken in any case.
>>
>> I've had a brief look at the amdgpu solution and maybe I misunderstood 
>> something, but it looks like a BKL approach with the device level 
>> notifier_lock. Their userptr notifier blocks on that one, not on 
>> dma_resv lock, but that also means their command submission 
>> (amdgpu_cs_submit) blocks on the same lock while obtaining backing store.
> 
> If I read Christian right, it blocks on that lock only just before 
> command submission to validate that sequence number. If there is a 
> mismatch, it needs to rerun CS. I'm not sure how common userptr buffers 
> are, but if a device-wide mutex hurts too much, There are perhaps more 
> fine-grained solutions. (Like an rw semaphore, and unlock before the 
> fence wait in the notifier: CS which are unaffected shouldn't need to 
> wait...).

Right, I wasn't familiar with the mmu notifier seqno business. Hm i915 
and amdgpu seem to use different mmu notifier hooks. Could we use the 
same and does it mean "locking order" is irrelevant to the sub-story of 
userptr?

> 
>>
>> So it looks like a big hammer approach not directly related to the 
>> story of dma_resv locking. Maybe we could do the same big hammer 
>> approach, although I am not sure how it is deadlock free.
>>
>> What happens for instance if someone submits an userptr batch which 
>> gets unmapped while amdgpu_cs_submit is holding the notifier_lock?
> 
> My understanding is the unmapping operation blocks on the notifier_lock 
> in the mmu notifier?

Yes, the key will be understanding the difference between
invalidate_range_start and invalidate+seqno and whether we can use the 
same from i915 to avoid the trylock.

But given how ww mutex is tainted fs reclaim does that mean we would 
also need to ensure no memory allocations under ww mutex?

Regards,

Tvrtko


> 
> /Thomas
> 
> 
>>
>> If you understand amdgpu better please share some insights. I 
>> certainly only looked at it briefly today so may be wrong.
>>
>> Regards,
>>
>> Tvrtko
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-28 14:28     ` Chris Wilson
@ 2020-07-29 12:40       ` Tvrtko Ursulin
  2020-07-29 13:42         ` Chris Wilson
  0 siblings, 1 reply; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-29 12:40 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 28/07/2020 15:28, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-07-17 14:04:58)
>>
>> On 15/07/2020 12:50, Chris Wilson wrote:
>>> Rather than require the next timeline after idling to match the MRU
>>> before idling, reset the index on the node and allow it to match the
>>> first request. However, this requires cmpxchg(u64) and so is not trivial
>>> on 32b, so for compatibility we just fallback to keeping the cached node
>>> pointing to the MRU timeline.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> ---
>>>    drivers/gpu/drm/i915/i915_active.c | 21 +++++++++++++++++++--
>>>    1 file changed, 19 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
>>> index 0854b1552bc1..6737b5615c0c 100644
>>> --- a/drivers/gpu/drm/i915/i915_active.c
>>> +++ b/drivers/gpu/drm/i915/i915_active.c
>>> @@ -157,6 +157,10 @@ __active_retire(struct i915_active *ref)
>>>                rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
>>>                rb_insert_color(&ref->cache->node, &ref->tree);
>>>                GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
>>> +
>>> +             /* Make the cached node available for reuse with any timeline */
>>> +             if (IS_ENABLED(CONFIG_64BIT))
>>> +                     ref->cache->timeline = 0; /* needs cmpxchg(u64) */
>>
>> Or when fence context wraps shock horror.
> 
> I more concerned about that we use timeline:0 as a special unordered
> timeline. It's reserved by use in the dma_fence_stub, and everything
> will start to break when the timelines wrap. The earliest causalities
> will be the kernel_context timelines which are also very special indices
> for the barriers.
> 
>>
>>>        }
>>>    
>>>        spin_unlock_irqrestore(&ref->tree_lock, flags);
>>> @@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
>>>    {
>>>        struct active_node *it;
>>>    
>>> +     GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
>>> +
>>>        it = READ_ONCE(ref->cache);
>>> -     if (it && it->timeline == idx)
>>> -             return it;
>>> +     if (it) {
>>> +             u64 cached = READ_ONCE(it->timeline);
>>> +
>>> +             if (cached == idx)
>>> +                     return it;
>>> +
>>> +#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
>>> +             if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
>>> +                     GEM_BUG_ON(i915_active_fence_isset(&it->base));
>>> +                     return it;
>>
>> cpmxchg suggests this needs to be atomic, however above the check for
>> equality comes from a separate read.
> 
> That's fine, and quite common to avoid cmpxchg if the current value
> already does not match the expected condition.

How? What is another thread is about to install its idx into 
it->timeline with cmpxchg and this thread does not see it because it 
just returned on the "cached == idx" condition.

> 
>> Since there is a lookup code path under the spinlock, perhaps the
>> unlocked lookup could just fail, and then locked lookup could re-assign
>> the timeline without the need for cmpxchg?
> 
> The unlocked/locked lookup are the same routine. You pointed that out
> :-p

Like I remember from ten days ago.. Anyway, I am pointing out it still 
doesn't smell right.

__active_lookup(...) -> lockless
{
...
	it = fetch_node(ref->tree.rb_node);
	while (it) {
		if (it->timeline < idx) {
			it = fetch_node(it->node.rb_right);
		} else if (it->timeline > idx) {
			it = fetch_node(it->node.rb_left);
		} else {
			WRITE_ONCE(ref->cache, it);
			break;
		}
	}
...
}

Then in active_instance, locked:

...
	parent = NULL;
	p = &ref->tree.rb_node;
	while (*p) {
		parent = *p;

		node = rb_entry(parent, struct active_node, node);
		if (node->timeline == idx) {
			kmem_cache_free(global.slab_cache, prealloc);
			goto out;
		}

		if (node->timeline < idx)
			p = &parent->rb_right;
		else
			p = &parent->rb_left;
			WRITE_ONCE(ref->cache, it);
			break;
		}
	}
...

Tree walk could be consolidated between the two.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings
  2020-07-28 14:35     ` Chris Wilson
@ 2020-07-29 12:43       ` Tvrtko Ursulin
  0 siblings, 0 replies; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-29 12:43 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 28/07/2020 15:35, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-07-17 14:23:22)
>>
>> On 15/07/2020 12:50, Chris Wilson wrote:
>>> Before we can execute a request, we must wait for all of its vma to be
>>> bound. This is a frequent operation for which we can optimise away a
>>> few atomic operations (notably a cmpxchg) in lieu of the RCU protection.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> ---
>>>    drivers/gpu/drm/i915/i915_active.h | 15 +++++++++++++++
>>>    drivers/gpu/drm/i915/i915_vma.c    |  9 +++++++--
>>>    2 files changed, 22 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h
>>> index b9e0394e2975..fb165d3f01cf 100644
>>> --- a/drivers/gpu/drm/i915/i915_active.h
>>> +++ b/drivers/gpu/drm/i915/i915_active.h
>>> @@ -231,4 +231,19 @@ struct i915_active *i915_active_create(void);
>>>    struct i915_active *i915_active_get(struct i915_active *ref);
>>>    void i915_active_put(struct i915_active *ref);
>>>    
>>> +static inline int __i915_request_await_exclusive(struct i915_request *rq,
>>> +                                              struct i915_active *active)
>>> +{
>>> +     struct dma_fence *fence;
>>> +     int err = 0;
>>> +
>>> +     fence = i915_active_fence_get(&active->excl);
>>> +     if (fence) {
>>> +             err = i915_request_await_dma_fence(rq, fence);
>>> +             dma_fence_put(fence);
>>> +     }
>>> +
>>> +     return err;
>>> +}
>>> +
>>>    #endif /* _I915_ACTIVE_H_ */
>>> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
>>> index bc64f773dcdb..cd12047c7791 100644
>>> --- a/drivers/gpu/drm/i915/i915_vma.c
>>> +++ b/drivers/gpu/drm/i915/i915_vma.c
>>> @@ -1167,6 +1167,12 @@ void i915_vma_revoke_mmap(struct i915_vma *vma)
>>>                list_del(&vma->obj->userfault_link);
>>>    }
>>>    
>>> +static int
>>> +__i915_request_await_bind(struct i915_request *rq, struct i915_vma *vma)
>>> +{
>>> +     return __i915_request_await_exclusive(rq, &vma->active);
>>> +}
>>> +
>>>    int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>>>    {
>>>        int err;
>>> @@ -1174,8 +1180,7 @@ int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
>>>        GEM_BUG_ON(!i915_vma_is_pinned(vma));
>>>    
>>>        /* Wait for the vma to be bound before we start! */
>>> -     err = i915_request_await_active(rq, &vma->active,
>>> -                                     I915_ACTIVE_AWAIT_EXCL);
>>> +     err = __i915_request_await_bind(rq, vma);
>>>        if (err)
>>>                return err;
>>>    
>>>
>>
>> Looks like for like, apart from missing i915_active_acquire_if_busy
>> across the operation. Remind me please what is acquire/release
>> protecting against? :)
> 
> To protect the rbtree walk. So, this is the function we started with for
> active_await, but when we added the option to walk the entire rbtree as
> well, we pulled it all under a single acquire/release. perf suggests
> that was a mistake if all we frequently want to do is grab the exclusive
> fence for an await.

Ok!

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-29 12:40       ` Tvrtko Ursulin
@ 2020-07-29 13:42         ` Chris Wilson
  2020-07-29 13:53           ` Chris Wilson
  2020-07-29 14:22           ` Tvrtko Ursulin
  0 siblings, 2 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-29 13:42 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2020-07-29 13:40:38)
> 
> On 28/07/2020 15:28, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-07-17 14:04:58)
> >>
> >> On 15/07/2020 12:50, Chris Wilson wrote:
> >>> Rather than require the next timeline after idling to match the MRU
> >>> before idling, reset the index on the node and allow it to match the
> >>> first request. However, this requires cmpxchg(u64) and so is not trivial
> >>> on 32b, so for compatibility we just fallback to keeping the cached node
> >>> pointing to the MRU timeline.
> >>>
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> ---
> >>>    drivers/gpu/drm/i915/i915_active.c | 21 +++++++++++++++++++--
> >>>    1 file changed, 19 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> >>> index 0854b1552bc1..6737b5615c0c 100644
> >>> --- a/drivers/gpu/drm/i915/i915_active.c
> >>> +++ b/drivers/gpu/drm/i915/i915_active.c
> >>> @@ -157,6 +157,10 @@ __active_retire(struct i915_active *ref)
> >>>                rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
> >>>                rb_insert_color(&ref->cache->node, &ref->tree);
> >>>                GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
> >>> +
> >>> +             /* Make the cached node available for reuse with any timeline */
> >>> +             if (IS_ENABLED(CONFIG_64BIT))
> >>> +                     ref->cache->timeline = 0; /* needs cmpxchg(u64) */
> >>
> >> Or when fence context wraps shock horror.
> > 
> > I more concerned about that we use timeline:0 as a special unordered
> > timeline. It's reserved by use in the dma_fence_stub, and everything
> > will start to break when the timelines wrap. The earliest causalities
> > will be the kernel_context timelines which are also very special indices
> > for the barriers.
> > 
> >>
> >>>        }
> >>>    
> >>>        spin_unlock_irqrestore(&ref->tree_lock, flags);
> >>> @@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
> >>>    {
> >>>        struct active_node *it;
> >>>    
> >>> +     GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
> >>> +
> >>>        it = READ_ONCE(ref->cache);
> >>> -     if (it && it->timeline == idx)
> >>> -             return it;
> >>> +     if (it) {
> >>> +             u64 cached = READ_ONCE(it->timeline);
> >>> +
> >>> +             if (cached == idx)
> >>> +                     return it;
> >>> +
> >>> +#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
> >>> +             if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
> >>> +                     GEM_BUG_ON(i915_active_fence_isset(&it->base));
> >>> +                     return it;
> >>
> >> cpmxchg suggests this needs to be atomic, however above the check for
> >> equality comes from a separate read.
> > 
> > That's fine, and quite common to avoid cmpxchg if the current value
> > already does not match the expected condition.
> 
> How? What is another thread is about to install its idx into 
> it->timeline with cmpxchg and this thread does not see it because it 
> just returned on the "cached == idx" condition.

Because it's nonzero.

If the idx is already non-zero, it will always remain non-zero until
everybody idles (and there are no more threads).

If the idx is zero, it can only transition to non-zero once, atomically
via cmpxchg. The first and only first cmpxchg will return that the
previous value was 0, and so return with it->idx == idx.

> 
> > 
> >> Since there is a lookup code path under the spinlock, perhaps the
> >> unlocked lookup could just fail, and then locked lookup could re-assign
> >> the timeline without the need for cmpxchg?
> > 
> > The unlocked/locked lookup are the same routine. You pointed that out
> > :-p
> 
> Like I remember from ten days ago.. Anyway, I am pointing out it still 
> doesn't smell right.
> 
> __active_lookup(...) -> lockless
> {
> ...
>         it = fetch_node(ref->tree.rb_node);
>         while (it) {
>                 if (it->timeline < idx) {
>                         it = fetch_node(it->node.rb_right);
>                 } else if (it->timeline > idx) {
>                         it = fetch_node(it->node.rb_left);
>                 } else {
>                         WRITE_ONCE(ref->cache, it);
>                         break;
>                 }
>         }
> ...
> }
> 
> Then in active_instance, locked:
> 
> ...
>         parent = NULL;
>         p = &ref->tree.rb_node;
>         while (*p) {
>                 parent = *p;
> 
>                 node = rb_entry(parent, struct active_node, node);
>                 if (node->timeline == idx) {
>                         kmem_cache_free(global.slab_cache, prealloc);
>                         goto out;
>                 }
> 
>                 if (node->timeline < idx)
>                         p = &parent->rb_right;
>                 else
>                         p = &parent->rb_left;
>                         WRITE_ONCE(ref->cache, it);
>                         break;
>                 }
>         }
> ...
> 
> Tree walk could be consolidated between the two.

This tree walk is subtly different, as we aren't just interested in the
node, but its parent. The exact repetitions have been consolidated into
__active_lookup.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
  2020-07-29 12:17         ` Tvrtko Ursulin
@ 2020-07-29 13:44           ` Thomas Hellström (Intel)
  2020-08-05 12:12             ` Chris Wilson
  0 siblings, 1 reply; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-29 13:44 UTC (permalink / raw)
  To: Tvrtko Ursulin, Maarten Lankhorst, Chris Wilson, intel-gfx


On 7/29/20 2:17 PM, Tvrtko Ursulin wrote:
>
> On 28/07/2020 12:17, Thomas Hellström (Intel) wrote:
>> On 7/16/20 5:53 PM, Tvrtko Ursulin wrote:
>>> On 15/07/2020 16:43, Maarten Lankhorst wrote:
>>>> Op 15-07-2020 om 13:51 schreef Chris Wilson:
>>>>> Our goal is to pull all memory reservations (next iteration
>>>>> obj->ops->get_pages()) under a ww_mutex, and to align those 
>>>>> reservations
>>>>> with other drivers, i.e. control all such allocations with the
>>>>> reservation_ww_class. Currently, this is under the purview of the
>>>>> obj->mm.mutex, and while obj->mm remains an embedded struct we can
>>>>> "simply" switch to using the reservation_ww_class 
>>>>> obj->base.resv->lock
>>>>>
>>>>> The major consequence is the impact on the shrinker paths as the
>>>>> reservation_ww_class is used to wrap allocations, and a ww_mutex does
>>>>> not support subclassing so we cannot do our usual trick of knowing 
>>>>> that
>>>>> we never recurse inside the shrinker and instead have to finish the
>>>>> reclaim with a trylock. This may result in us failing to release the
>>>>> pages after having released the vma. This will have to do until a 
>>>>> better
>>>>> idea comes along.
>>>>>
>>>>> However, this step only converts the mutex over and continues to 
>>>>> treat
>>>>> everything as a single allocation and pinning the pages. With the
>>>>> ww_mutex in place we can remove the temporary pinning, as we can then
>>>>> reserve all storage en masse.
>>>>>
>>>>> One last thing to do: kill the implict page pinning for active vma.
>>>>> This will require us to invalidate the vma->pages when the backing 
>>>>> store
>>>>> is removed (and we expect that while the vma is active, we mark the
>>>>> backing store as active so that it cannot be removed while the HW is
>>>>> busy.)
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>
>>> [snip]
>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>>>> index dc8f052a0ffe..4e928103a38f 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
>>>>> @@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct 
>>>>> drm_i915_gem_object *obj,
>>>>>       if (!(shrink & I915_SHRINK_BOUND))
>>>>>           flags = I915_GEM_OBJECT_UNBIND_TEST;
>>>>>   -    if (i915_gem_object_unbind(obj, flags) == 0)
>>>>> -        __i915_gem_object_put_pages(obj);
>>>>> -
>>>>> -    return !i915_gem_object_has_pages(obj);
>>>>> +    return i915_gem_object_unbind(obj, flags) == 0;
>>>>>   }
>>>>>     static void try_to_writeback(struct drm_i915_gem_object *obj,
>>>>> @@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
>>>>> spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
>>>>>   -            if (unsafe_drop_pages(obj, shrink)) {
>>>>> -                /* May arrive from get_pages on another bo */
>>>>> -                mutex_lock(&obj->mm.lock);
>>>>> +            if (unsafe_drop_pages(obj, shrink) &&
>>>>> +                i915_gem_object_trylock(obj)) {
>>>
>>>> Why trylock? Because of the nesting? In that case, still use ww ctx 
>>>> if provided please
>>>
>>> By "if provided" you mean for code paths where we are calling the 
>>> shrinker ourselves, as opposed to reclaim, like shmem_get_pages?
>>>
>>> That indeed sounds like the right thing to do, since all the 
>>> get_pages from execbuf are in the reservation phase, collecting a 
>>> list of GEM objects to lock, the ones to shrink sound like should be 
>>> on that list.
>>>
>>>>> + __i915_gem_object_put_pages(obj);
>>>>>                   if (!i915_gem_object_has_pages(obj)) {
>>>>>                       try_to_writeback(obj, shrink);
>>>>>                       count += obj->base.size >> PAGE_SHIFT;
>>>>>                   }
>>>>> -                mutex_unlock(&obj->mm.lock);
>>>>> +                i915_gem_object_unlock(obj);
>>>>>               }
>>>>>                 scanned += obj->base.size >> PAGE_SHIFT;
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>>>> index ff72ee2fd9cd..ac12e1c20e66 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
>>>>> @@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct 
>>>>> drm_i915_gem_object *obj,
>>>>>        * pages to prevent them being swapped out and causing 
>>>>> corruption
>>>>>        * due to the change in swizzling.
>>>>>        */
>>>>> -    mutex_lock(&obj->mm.lock);
>>>>>       if (i915_gem_object_has_pages(obj) &&
>>>>>           obj->mm.madv == I915_MADV_WILLNEED &&
>>>>>           i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
>>>>> @@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct 
>>>>> drm_i915_gem_object *obj,
>>>>>               obj->mm.quirked = true;
>>>>>           }
>>>>>       }
>>>>> -    mutex_unlock(&obj->mm.lock);
>>>>>         spin_lock(&obj->vma.lock);
>>>>>       for_each_ggtt_vma(vma, obj) {
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>>>> index e946032b13e4..80907c00c6fd 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
>>>>> @@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct 
>>>>> mmu_notifier *_mn,
>>>>>           ret = i915_gem_object_unbind(obj,
>>>>>                            I915_GEM_OBJECT_UNBIND_ACTIVE |
>>>>> I915_GEM_OBJECT_UNBIND_BARRIER);
>>>>> -        if (ret == 0)
>>>>> -            ret = __i915_gem_object_put_pages(obj);
>>>>> +        if (ret == 0) {
>>>>> +            /* ww_mutex and mmu_notifier is fs_reclaim tainted */
>>>>> +            if (i915_gem_object_trylock(obj)) {
>>>>> +                ret = __i915_gem_object_put_pages(obj);
>>>>> +                i915_gem_object_unlock(obj);
>>>>> +            } else {
>>>>> +                ret = -EAGAIN;
>>>>> +            }
>>>>> +        }
>>>>
>>>> I'm not sure upstream will agree with this kind of API:
>>>>
>>>> 1. It will deadlock when RT tasks are used.
>>>
>>> It will or it can? Which part? It will break out of the loop if 
>>> trylock fails.
>>>
>>>>
>>>> 2. You start throwing -EAGAIN because you don't have the correct 
>>>> ordering of locking, this needs fixing first.
>>>
>>> Is it about correct ordering of locks or something else? If memory 
>>> allocation is allowed under dma_resv.lock, then the opposite order 
>>> cannot be taken in any case.
>>>
>>> I've had a brief look at the amdgpu solution and maybe I 
>>> misunderstood something, but it looks like a BKL approach with the 
>>> device level notifier_lock. Their userptr notifier blocks on that 
>>> one, not on dma_resv lock, but that also means their command 
>>> submission (amdgpu_cs_submit) blocks on the same lock while 
>>> obtaining backing store.
>>
>> If I read Christian right, it blocks on that lock only just before 
>> command submission to validate that sequence number. If there is a 
>> mismatch, it needs to rerun CS. I'm not sure how common userptr 
>> buffers are, but if a device-wide mutex hurts too much, There are 
>> perhaps more fine-grained solutions. (Like an rw semaphore, and 
>> unlock before the fence wait in the notifier: CS which are unaffected 
>> shouldn't need to wait...).
>
> Right, I wasn't familiar with the mmu notifier seqno business. Hm i915 
> and amdgpu seem to use different mmu notifier hooks. Could we use the 
> same and does it mean "locking order" is irrelevant to the sub-story 
> of userptr?
>
>>
>>>
>>> So it looks like a big hammer approach not directly related to the 
>>> story of dma_resv locking. Maybe we could do the same big hammer 
>>> approach, although I am not sure how it is deadlock free.
>>>
>>> What happens for instance if someone submits an userptr batch which 
>>> gets unmapped while amdgpu_cs_submit is holding the notifier_lock?
>>
>> My understanding is the unmapping operation blocks on the 
>> notifier_lock in the mmu notifier?
>
> Yes, the key will be understanding the difference between
> invalidate_range_start and invalidate+seqno and whether we can use the 
> same from i915 to avoid the trylock.
>
> But given how ww mutex is tainted fs reclaim does that mean we would 
> also need to ensure no memory allocations under ww mutex?

So with the fs_reclaim, we can do memory allocations under ww_mutex, but 
need trylock in shrinkers and mmu_notifiers. However get_user_pages() 
must not run under ww_mutex due to mmap_sem inversion.

But with the AMD solution I figure we don't really need the ww_mutex in 
the mmu_notifier at all? That would possibly mean holding on to and 
exposing to GPU its pages that aren't used anymore, but a shrinker would 
notice the sequence number mismatch and consider the pages purgeable, 
and also the next cs the stale pages would be released.

Still if we can't live with that either, we can free the pages and kill 
GPU bindings with an async work from the mmu_notifier.

Danvet suggested at some point breaking out the synchronization part of 
the AMD solution and make drm helpers out of it. I think that makes sense.

/Thomas

>
> Regards,
>
> Tvrtko





>
>
>>
>> /Thomas
>>
>>
>>>
>>> If you understand amdgpu better please share some insights. I 
>>> certainly only looked at it briefly today so may be wrong.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>> _______________________________________________
>>> Intel-gfx mailing list
>>> Intel-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-29 13:42         ` Chris Wilson
@ 2020-07-29 13:53           ` Chris Wilson
  2020-07-29 14:22           ` Tvrtko Ursulin
  1 sibling, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-29 13:53 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Chris Wilson (2020-07-29 14:42:06)
> Quoting Tvrtko Ursulin (2020-07-29 13:40:38)
> > 
> > On 28/07/2020 15:28, Chris Wilson wrote:
> > > Quoting Tvrtko Ursulin (2020-07-17 14:04:58)
> > >>
> > >> On 15/07/2020 12:50, Chris Wilson wrote:
> > >>> Rather than require the next timeline after idling to match the MRU
> > >>> before idling, reset the index on the node and allow it to match the
> > >>> first request. However, this requires cmpxchg(u64) and so is not trivial
> > >>> on 32b, so for compatibility we just fallback to keeping the cached node
> > >>> pointing to the MRU timeline.
> > >>>
> > >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > >>> ---
> > >>>    drivers/gpu/drm/i915/i915_active.c | 21 +++++++++++++++++++--
> > >>>    1 file changed, 19 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
> > >>> index 0854b1552bc1..6737b5615c0c 100644
> > >>> --- a/drivers/gpu/drm/i915/i915_active.c
> > >>> +++ b/drivers/gpu/drm/i915/i915_active.c
> > >>> @@ -157,6 +157,10 @@ __active_retire(struct i915_active *ref)
> > >>>                rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
> > >>>                rb_insert_color(&ref->cache->node, &ref->tree);
> > >>>                GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
> > >>> +
> > >>> +             /* Make the cached node available for reuse with any timeline */
> > >>> +             if (IS_ENABLED(CONFIG_64BIT))
> > >>> +                     ref->cache->timeline = 0; /* needs cmpxchg(u64) */
> > >>
> > >> Or when fence context wraps shock horror.
> > > 
> > > I more concerned about that we use timeline:0 as a special unordered
> > > timeline. It's reserved by use in the dma_fence_stub, and everything
> > > will start to break when the timelines wrap. The earliest causalities
> > > will be the kernel_context timelines which are also very special indices
> > > for the barriers.
> > > 
> > >>
> > >>>        }
> > >>>    
> > >>>        spin_unlock_irqrestore(&ref->tree_lock, flags);
> > >>> @@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
> > >>>    {
> > >>>        struct active_node *it;
> > >>>    
> > >>> +     GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
> > >>> +
> > >>>        it = READ_ONCE(ref->cache);
> > >>> -     if (it && it->timeline == idx)
> > >>> -             return it;
> > >>> +     if (it) {
> > >>> +             u64 cached = READ_ONCE(it->timeline);
> > >>> +
> > >>> +             if (cached == idx)
> > >>> +                     return it;
> > >>> +
> > >>> +#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
> > >>> +             if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
> > >>> +                     GEM_BUG_ON(i915_active_fence_isset(&it->base));
> > >>> +                     return it;
> > >>
> > >> cpmxchg suggests this needs to be atomic, however above the check for
> > >> equality comes from a separate read.
> > > 
> > > That's fine, and quite common to avoid cmpxchg if the current value
> > > already does not match the expected condition.
> > 
> > How? What is another thread is about to install its idx into 
> > it->timeline with cmpxchg and this thread does not see it because it 
> > just returned on the "cached == idx" condition.
> 
> Because it's nonzero.
> 
> If the idx is already non-zero, it will always remain non-zero until
> everybody idles (and there are no more threads).
> 
> If the idx is zero, it can only transition to non-zero once, atomically
> via cmpxchg. The first and only first cmpxchg will return that the
> previous value was 0, and so return with it->idx == idx.

As for the case that two threads are attempting to install 2 different
fences into the same timeline slot -- that concurrency is controlled
by the timeline mutex, or some other agreed upon serialisation for the
slot [e.g. the exclusive slot doesn't have an intel_timeline associated
with it, and some ranges uses mutexes other than intel_timeline.]
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-29 13:42         ` Chris Wilson
  2020-07-29 13:53           ` Chris Wilson
@ 2020-07-29 14:22           ` Tvrtko Ursulin
  2020-07-29 14:39             ` Chris Wilson
  2020-07-29 14:52             ` Chris Wilson
  1 sibling, 2 replies; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-29 14:22 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 29/07/2020 14:42, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-07-29 13:40:38)
>>
>> On 28/07/2020 15:28, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2020-07-17 14:04:58)
>>>>
>>>> On 15/07/2020 12:50, Chris Wilson wrote:
>>>>> Rather than require the next timeline after idling to match the MRU
>>>>> before idling, reset the index on the node and allow it to match the
>>>>> first request. However, this requires cmpxchg(u64) and so is not trivial
>>>>> on 32b, so for compatibility we just fallback to keeping the cached node
>>>>> pointing to the MRU timeline.
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/i915_active.c | 21 +++++++++++++++++++--
>>>>>     1 file changed, 19 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
>>>>> index 0854b1552bc1..6737b5615c0c 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_active.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_active.c
>>>>> @@ -157,6 +157,10 @@ __active_retire(struct i915_active *ref)
>>>>>                 rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
>>>>>                 rb_insert_color(&ref->cache->node, &ref->tree);
>>>>>                 GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
>>>>> +
>>>>> +             /* Make the cached node available for reuse with any timeline */
>>>>> +             if (IS_ENABLED(CONFIG_64BIT))
>>>>> +                     ref->cache->timeline = 0; /* needs cmpxchg(u64) */
>>>>
>>>> Or when fence context wraps shock horror.
>>>
>>> I more concerned about that we use timeline:0 as a special unordered
>>> timeline. It's reserved by use in the dma_fence_stub, and everything
>>> will start to break when the timelines wrap. The earliest causalities
>>> will be the kernel_context timelines which are also very special indices
>>> for the barriers.
>>>
>>>>
>>>>>         }
>>>>>     
>>>>>         spin_unlock_irqrestore(&ref->tree_lock, flags);
>>>>> @@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
>>>>>     {
>>>>>         struct active_node *it;
>>>>>     
>>>>> +     GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
>>>>> +
>>>>>         it = READ_ONCE(ref->cache);
>>>>> -     if (it && it->timeline == idx)
>>>>> -             return it;
>>>>> +     if (it) {
>>>>> +             u64 cached = READ_ONCE(it->timeline);
>>>>> +
>>>>> +             if (cached == idx)
>>>>> +                     return it;
>>>>> +
>>>>> +#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
>>>>> +             if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
>>>>> +                     GEM_BUG_ON(i915_active_fence_isset(&it->base));
>>>>> +                     return it;
>>>>
>>>> cpmxchg suggests this needs to be atomic, however above the check for
>>>> equality comes from a separate read.
>>>
>>> That's fine, and quite common to avoid cmpxchg if the current value
>>> already does not match the expected condition.
>>
>> How? What is another thread is about to install its idx into
>> it->timeline with cmpxchg and this thread does not see it because it
>> just returned on the "cached == idx" condition.
> 
> Because it's nonzero.
> 
> If the idx is already non-zero, it will always remain non-zero until
> everybody idles (and there are no more threads).
> 
> If the idx is zero, it can only transition to non-zero once, atomically
> via cmpxchg. The first and only first cmpxchg will return that the
> previous value was 0, and so return with it->idx == idx.

I think this is worthy of a comment to avoid future reader having to 
re-figure it all out.

>>>> Since there is a lookup code path under the spinlock, perhaps the
>>>> unlocked lookup could just fail, and then locked lookup could re-assign
>>>> the timeline without the need for cmpxchg?
>>>
>>> The unlocked/locked lookup are the same routine. You pointed that out
>>> :-p
>>
>> Like I remember from ten days ago.. Anyway, I am pointing out it still
>> doesn't smell right.
>>
>> __active_lookup(...) -> lockless
>> {
>> ...
>>          it = fetch_node(ref->tree.rb_node);
>>          while (it) {
>>                  if (it->timeline < idx) {
>>                          it = fetch_node(it->node.rb_right);
>>                  } else if (it->timeline > idx) {
>>                          it = fetch_node(it->node.rb_left);
>>                  } else {
>>                          WRITE_ONCE(ref->cache, it);
>>                          break;
>>                  }
>>          }
>> ...
>> }
>>
>> Then in active_instance, locked:
>>
>> ...
>>          parent = NULL;
>>          p = &ref->tree.rb_node;
>>          while (*p) {
>>                  parent = *p;
>>
>>                  node = rb_entry(parent, struct active_node, node);
>>                  if (node->timeline == idx) {
>>                          kmem_cache_free(global.slab_cache, prealloc);
>>                          goto out;
>>                  }
>>
>>                  if (node->timeline < idx)
>>                          p = &parent->rb_right;
>>                  else
>>                          p = &parent->rb_left;
>>                          WRITE_ONCE(ref->cache, it);
>>                          break;
>>                  }
>>          }
>> ...
>>
>> Tree walk could be consolidated between the two.
> 
> This tree walk is subtly different, as we aren't just interested in the
> node, but its parent. The exact repetitions have been consolidated into
> __active_lookup.

It returns the previous/parent node if idx is not found so yeah, common 
helper would need to have two out parameters. One returns the match, or 
NULL, another returns the previous/parent node. You think that is not 
worth it?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-29 14:22           ` Tvrtko Ursulin
@ 2020-07-29 14:39             ` Chris Wilson
  2020-07-29 14:52             ` Chris Wilson
  1 sibling, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-07-29 14:39 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2020-07-29 15:22:26)
> 
> On 29/07/2020 14:42, Chris Wilson wrote:
> >>          parent = NULL;
> >>          p = &ref->tree.rb_node;
> >>          while (*p) {
> >>                  parent = *p;
> >>
> >>                  node = rb_entry(parent, struct active_node, node);
> >>                  if (node->timeline == idx) {
> >>                          kmem_cache_free(global.slab_cache, prealloc);
> >>                          goto out;
> >>                  }
> >>
> >>                  if (node->timeline < idx)
> >>                          p = &parent->rb_right;
> >>                  else
> >>                          p = &parent->rb_left;
> >>                          WRITE_ONCE(ref->cache, it);
> >>                          break;
> >>                  }
> >>          }
> >> ...
> >>
> >> Tree walk could be consolidated between the two.
> > 
> > This tree walk is subtly different, as we aren't just interested in the
> > node, but its parent. The exact repetitions have been consolidated into
> > __active_lookup.
> 
> It returns the previous/parent node if idx is not found so yeah, common 
> helper would need to have two out parameters. One returns the match, or 
> NULL, another returns the previous/parent node. You think that is not 
> worth it?

I remember in the past wanting to keep the lookup separate from
insertion as the compiler generated better code if we didn't track the
parent pointer.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-29 14:22           ` Tvrtko Ursulin
  2020-07-29 14:39             ` Chris Wilson
@ 2020-07-29 14:52             ` Chris Wilson
  2020-07-29 15:31               ` Tvrtko Ursulin
  1 sibling, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-29 14:52 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2020-07-29 15:22:26)
> 
> On 29/07/2020 14:42, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-07-29 13:40:38)
> >>
> >> On 28/07/2020 15:28, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2020-07-17 14:04:58)
> >>>>
> >>>> On 15/07/2020 12:50, Chris Wilson wrote:
> >>>>> @@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
> >>>>>     {
> >>>>>         struct active_node *it;
> >>>>>     
> >>>>> +     GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
> >>>>> +
> >>>>>         it = READ_ONCE(ref->cache);
> >>>>> -     if (it && it->timeline == idx)
> >>>>> -             return it;
> >>>>> +     if (it) {
> >>>>> +             u64 cached = READ_ONCE(it->timeline);
> >>>>> +
> >>>>> +             if (cached == idx)
> >>>>> +                     return it;
> >>>>> +
> >>>>> +#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
> >>>>> +             if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
> >>>>> +                     GEM_BUG_ON(i915_active_fence_isset(&it->base));
> >>>>> +                     return it;
> >>>>
> >>>> cpmxchg suggests this needs to be atomic, however above the check for
> >>>> equality comes from a separate read.
> >>>
> >>> That's fine, and quite common to avoid cmpxchg if the current value
> >>> already does not match the expected condition.
> >>
> >> How? What is another thread is about to install its idx into
> >> it->timeline with cmpxchg and this thread does not see it because it
> >> just returned on the "cached == idx" condition.
> > 
> > Because it's nonzero.
> > 
> > If the idx is already non-zero, it will always remain non-zero until
> > everybody idles (and there are no more threads).
> > 
> > If the idx is zero, it can only transition to non-zero once, atomically
> > via cmpxchg. The first and only first cmpxchg will return that the
> > previous value was 0, and so return with it->idx == idx.
> 
> I think this is worthy of a comment to avoid future reader having to 
> re-figure it all out.

        if (it) {
                u64 cached = READ_ONCE(it->timeline);

+               /* Once claimed, this slot will only belong to this idx */
                if (cached == idx)
                        return it;

 #ifdef CONFIG_64BIT /* for cmpxchg(u64) */
+               /*
+                * An unclaimed cache [.timeline=0] can only be claimed once.
+                *
+                * If the value is already non-zero, some other thread has
+                * claimed the cache and we know that is does not match our
+                * idx. If, and only if, the timeline is currently zero is it
+                * worth competing to claim it atomically for ourselves (for
+		 * only the winner of that race will cmpxchg return the old
+		 * value of 0).
+                */
                if (!cached && !cmpxchg(&it->timeline, 0, idx))
                        return it;
 #endif
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline
  2020-07-29 14:52             ` Chris Wilson
@ 2020-07-29 15:31               ` Tvrtko Ursulin
  0 siblings, 0 replies; 156+ messages in thread
From: Tvrtko Ursulin @ 2020-07-29 15:31 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 29/07/2020 15:52, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-07-29 15:22:26)
>>
>> On 29/07/2020 14:42, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2020-07-29 13:40:38)
>>>>
>>>> On 28/07/2020 15:28, Chris Wilson wrote:
>>>>> Quoting Tvrtko Ursulin (2020-07-17 14:04:58)
>>>>>>
>>>>>> On 15/07/2020 12:50, Chris Wilson wrote:
>>>>>>> @@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
>>>>>>>      {
>>>>>>>          struct active_node *it;
>>>>>>>      
>>>>>>> +     GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
>>>>>>> +
>>>>>>>          it = READ_ONCE(ref->cache);
>>>>>>> -     if (it && it->timeline == idx)
>>>>>>> -             return it;
>>>>>>> +     if (it) {
>>>>>>> +             u64 cached = READ_ONCE(it->timeline);
>>>>>>> +
>>>>>>> +             if (cached == idx)
>>>>>>> +                     return it;
>>>>>>> +
>>>>>>> +#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
>>>>>>> +             if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
>>>>>>> +                     GEM_BUG_ON(i915_active_fence_isset(&it->base));
>>>>>>> +                     return it;
>>>>>>
>>>>>> cpmxchg suggests this needs to be atomic, however above the check for
>>>>>> equality comes from a separate read.
>>>>>
>>>>> That's fine, and quite common to avoid cmpxchg if the current value
>>>>> already does not match the expected condition.
>>>>
>>>> How? What is another thread is about to install its idx into
>>>> it->timeline with cmpxchg and this thread does not see it because it
>>>> just returned on the "cached == idx" condition.
>>>
>>> Because it's nonzero.
>>>
>>> If the idx is already non-zero, it will always remain non-zero until
>>> everybody idles (and there are no more threads).
>>>
>>> If the idx is zero, it can only transition to non-zero once, atomically
>>> via cmpxchg. The first and only first cmpxchg will return that the
>>> previous value was 0, and so return with it->idx == idx.
>>
>> I think this is worthy of a comment to avoid future reader having to
>> re-figure it all out.
> 
>          if (it) {
>                  u64 cached = READ_ONCE(it->timeline);
> 
> +               /* Once claimed, this slot will only belong to this idx */
>                  if (cached == idx)
>                          return it;
> 
>   #ifdef CONFIG_64BIT /* for cmpxchg(u64) */
> +               /*
> +                * An unclaimed cache [.timeline=0] can only be claimed once.
> +                *
> +                * If the value is already non-zero, some other thread has
> +                * claimed the cache and we know that is does not match our
> +                * idx. If, and only if, the timeline is currently zero is it
> +                * worth competing to claim it atomically for ourselves (for
> +		 * only the winner of that race will cmpxchg return the old
> +		 * value of 0).
> +                */
>                  if (!cached && !cmpxchg(&it->timeline, 0, idx))
>                          return it;
>   #endif
> 

Sounds good. With that:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-07-28 14:50     ` Chris Wilson
@ 2020-07-30 12:04       ` Thomas Hellström (Intel)
  2020-07-30 12:28       ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-30 12:04 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 7/28/20 4:50 PM, Chris Wilson wrote:
> Quoting Thomas Hellström (Intel) (2020-07-27 10:24:24)
>> Hi, Chris,
>>
>> It appears to me like this series is doing a lot of different things:
>>
>> - Various optimizations
>> - Locking rework
>> - Adding schedulers
>> - Other misc fixes
>>
>> Could you please separate out as much as possible the locking rework
>> prerequisites in one series with cover letter, and most importantly the
>> major part of the locking rework (only) with a more elaborate cover
>> letter discussing, if not trivial, how each patch fits in and on design
>> and future directions, Questions that I have stumbled on so far (being a
>> new-to-the-driver reviewer):
> The locking depend on the former work to reduce the impact. It's still a
> major issue that we introduce a broad lock that is held for several
> hundred milliseconds across many objects that stalls game&compositor.
>   
>> - When are memory allocations disallowed? If we need to pre-allocate in
>> execbuf, when? why?
> That should be mentioned in the code.
>
>> - When is the request dma-fence published?
> There a big comment to that effect.
>
>> - Do we need to keep cpu asynchronous execbuf tasks after this? why?
> Keep? Oh, you mean not immediately discard after publishing them, but
> why we need them. Same reason as we need them before.
>
>> - What about userptr pinning ending up in the dma_fence critical path?
> It's in the user critical path (the shortest path to perform their
> sequence of operations), but it's before the dma-fence itself. I say
> that's a particularly nasty false claim that it is not on the critical
> path, but being where it is circumvents the whole argument.
>   
>> And then move anything non-related to separate series?
> Not related to what? Development of i915?
> -Chris

So while it's true that a good prior understaning of the driver together 
with a detailed analysis of the code would provide answers to most of 
these questions, they were actually primarilly intended to serve as 
inspiration for an elaborate cover letter.

I believe a discussion touching these items would be a good aid to 
people embarking on reviewing the series.

/Thomas




_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-07-28 14:50     ` Chris Wilson
  2020-07-30 12:04       ` Thomas Hellström (Intel)
@ 2020-07-30 12:28       ` Thomas Hellström (Intel)
  2020-08-04 14:08         ` Chris Wilson
  1 sibling, 1 reply; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-30 12:28 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/28/20 4:50 PM, Chris Wilson wrote:
>
> It's in the user critical path (the shortest path to perform their
> sequence of operations), but it's before the dma-fence itself. I say
> that's a particularly nasty false claim that it is not on the critical
> path, but being where it is circumvents the whole argument.
>   

Couldn't the following situation happen?

1. CS spawns userptr pinning work.
2. CS creates and publishes a DMA-fence that depends on that pinning work.
3. Another driver CS creates and publishes a second DMA-fence that 
depends on that first DMA-fence.
4. userptr pinning starts pinning pages, triggers a shrinker on the 
other driver
5. Other driver shrinker blocks on the second DMA-fence,
6. Deadlock.

Or do I misread the i915 userptr code?

/Thomas


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 27/66] drm/i915/gem: Pull execbuf dma resv under a single critical section
  2020-07-28 15:16     ` Chris Wilson
@ 2020-07-30 12:57       ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-30 12:57 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/28/20 5:16 PM, Chris Wilson wrote:
> Quoting Thomas Hellström (Intel) (2020-07-27 19:08:39)
>> On 7/15/20 1:51 PM, Chris Wilson wrote:
>>> Acquire all the objects and their backing storage, and page directories,
>>> as used by execbuf under a single common ww_mutex. Albeit we have to
>>> restart the critical section a few times in order to handle various
>>> restrictions (such as avoiding copy_(from|to)_user and mmap_sem).
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> ---
>>>    .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 168 +++++++++---------
>>>    .../i915/gem/selftests/i915_gem_execbuffer.c  |   8 +-
>>>    2 files changed, 87 insertions(+), 89 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>>> index ebabc0746d50..db433f3f18ec 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>>> @@ -20,6 +20,7 @@
>>>    #include "gt/intel_gt_pm.h"
>>>    #include "gt/intel_gt_requests.h"
>>>    #include "gt/intel_ring.h"
>>> +#include "mm/i915_acquire_ctx.h"
>>>    
>>>    #include "i915_drv.h"
>>>    #include "i915_gem_clflush.h"
>>> @@ -244,6 +245,8 @@ struct i915_execbuffer {
>>>        struct intel_context *context; /* logical state for the request */
>>>        struct i915_gem_context *gem_context; /** caller's context */
>>>    
>>> +     struct i915_acquire_ctx acquire; /** lock for _all_ DMA reservations */
>>> +
>>>        struct i915_request *request; /** our request to build */
>>>        struct eb_vma *batch; /** identity of the batch obj/vma */
>>>    
>>> @@ -389,42 +392,6 @@ static void eb_vma_array_put(struct eb_vma_array *arr)
>>>        kref_put(&arr->kref, eb_vma_array_destroy);
>>>    }
>>>    
>>> -static int
>>> -eb_lock_vma(struct i915_execbuffer *eb, struct ww_acquire_ctx *acquire)
>>> -{
>>> -     struct eb_vma *ev;
>>> -     int err = 0;
>>> -
>>> -     list_for_each_entry(ev, &eb->submit_list, submit_link) {
>>> -             struct i915_vma *vma = ev->vma;
>>> -
>>> -             err = ww_mutex_lock_interruptible(&vma->resv->lock, acquire);
>>> -             if (err == -EDEADLK) {
>>> -                     struct eb_vma *unlock = ev, *en;
>>> -
>>> -                     list_for_each_entry_safe_continue_reverse(unlock, en,
>>> -                                                               &eb->submit_list,
>>> -                                                               submit_link) {
>>> -                             ww_mutex_unlock(&unlock->vma->resv->lock);
>>> -                             list_move_tail(&unlock->submit_link, &eb->submit_list);
>>> -                     }
>>> -
>>> -                     GEM_BUG_ON(!list_is_first(&ev->submit_link, &eb->submit_list));
>>> -                     err = ww_mutex_lock_slow_interruptible(&vma->resv->lock,
>>> -                                                            acquire);
>>> -             }
>>> -             if (err) {
>>> -                     list_for_each_entry_continue_reverse(ev,
>>> -                                                          &eb->submit_list,
>>> -                                                          submit_link)
>>> -                             ww_mutex_unlock(&ev->vma->resv->lock);
>>> -                     break;
>>> -             }
>>> -     }
>>> -
>>> -     return err;
>>> -}
>>> -
>>>    static int eb_create(struct i915_execbuffer *eb)
>>>    {
>>>        /* Allocate an extra slot for use by the sentinel */
>>> @@ -668,6 +635,25 @@ eb_add_vma(struct i915_execbuffer *eb,
>>>        }
>>>    }
>>>    
>>> +static int eb_lock_mm(struct i915_execbuffer *eb)
>>> +{
>>> +     struct eb_vma *ev;
>>> +     int err;
>>> +
>>> +     list_for_each_entry(ev, &eb->bind_list, bind_link) {
>>> +             err = i915_acquire_ctx_lock(&eb->acquire, ev->vma->obj);
>>> +             if (err)
>>> +                     return err;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int eb_acquire_mm(struct i915_execbuffer *eb)
>>> +{
>>> +     return i915_acquire_mm(&eb->acquire);
>>> +}
>>> +
>>>    struct eb_vm_work {
>>>        struct dma_fence_work base;
>>>        struct eb_vma_array *array;
>>> @@ -1390,7 +1376,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
>>>        unsigned long count;
>>>        struct eb_vma *ev;
>>>        unsigned int pass;
>>> -     int err = 0;
>>> +     int err;
>>> +
>>> +     err = eb_lock_mm(eb);
>>> +     if (err)
>>> +             return err;
>>> +
>>> +     err = eb_acquire_mm(eb);
>>> +     if (err)
>>> +             return err;
>>>    
>>>        count = 0;
>>>        INIT_LIST_HEAD(&unbound);
>>> @@ -1416,10 +1410,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
>>>        if (count == 0)
>>>                return 0;
>>>    
>>> +     /* We need to reserve page directories, release all, start over */
>>> +     i915_acquire_ctx_fini(&eb->acquire);
>>> +
>>>        pass = 0;
>>>        do {
>>>                struct eb_vm_work *work;
>>>    
>>> +             i915_acquire_ctx_init(&eb->acquire);
>> Couldn't we do a i915_acquire_ctx_rollback() here to avoid losing our
>> ticket? That would mean deferring i915_acquire_ctx_done() until all
>> potential rollbacks have been performed.
> We need to completely drop the acquire-class for catching up with userptr
> (and anything else deferred that doesn't meet the current fence semantics).
Hmm, what other deferrered stuff are we doing that doesn't meet the 
current fence semantics? (Which I interpret as "everything deferred that 
we spawn during a CS needs to be either synced before we publish the 
fence or considered part of the fence signaling critical path")
>
> I thought it was sensible to drop all around the waits in this loop, and
> tidier to always reacquire at the beginning of each loop.
>
>> Or even better if we defer _ctx_done(), couldn't we just continue
>> locking the pts here instead of dropping and re-acquiring everything?
> Userptr would like to have word. If you just mean these do lines, then
> yes fini/init is overkill -- it just looked simpler than doing it at the
> end of the loop. The steady-state load is not meant to get past the
> optimistic fastpath.

Indeed complicating the code to not drop these locks outside of the 
faspath doesn't sound like a good idea.

However, if we drop the acquire ctx we might be subject to starving, and 
I figure waiting for userptr would still work as long as we drop all the 
locks.

> -Chris

/Thomas


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-07-28 14:42     ` Chris Wilson
@ 2020-07-31  7:43       ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31  7:43 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Matthew Auld


On 7/28/20 4:42 PM, Chris Wilson wrote:
> Quoting Thomas Hellström (Intel) (2020-07-23 15:33:20)
>> On 2020-07-15 13:50, Chris Wilson wrote:
>>> We need to make the DMA allocations used for page directories to be
>>> performed up front so that we can include those allocations in our
>>> memory reservation pass. The downside is that we have to assume the
>>> worst case, even before we know the final layout, and always allocate
>>> enough page directories for this object, even when there will be overlap.
>>> This unfortunately can be quite expensive, especially as we have to
>>> clear/reset the page directories and DMA pages, but it should only be
>>> required during early phases of a workload when new objects are being
>>> discovered, or after memory/eviction pressure when we need to rebind.
>>> Once we reach steady state, the objects should not be moved and we no
>>> longer need to preallocating the pages tables.
>>>
>>> It should be noted that the lifetime for the page directories DMA is
>>> more or less decoupled from individual fences as they will be shared
>>> across objects across timelines.
>>>
>>> v2: Only allocate enough PD space for the PTE we may use, we do not need
>>> to allocate PD that will be left as scratch.
>>> v3: Store the shift unto the first PD level to encapsulate the different
>>> PTE counts for gen6/gen8.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Matthew Auld <matthew.auld@intel.com>
>>> ---
>>>    .../gpu/drm/i915/gem/i915_gem_client_blt.c    | 11 +--
>>>    drivers/gpu/drm/i915/gt/gen6_ppgtt.c          | 40 ++++-----
>>>    drivers/gpu/drm/i915/gt/gen8_ppgtt.c          | 78 +++++------------
>>>    drivers/gpu/drm/i915/gt/intel_ggtt.c          | 60 ++++++--------
>>>    drivers/gpu/drm/i915/gt/intel_gtt.h           | 46 ++++++----
>>>    drivers/gpu/drm/i915/gt/intel_ppgtt.c         | 83 ++++++++++++++++---
>>>    drivers/gpu/drm/i915/i915_vma.c               | 27 +++---
>>>    drivers/gpu/drm/i915/selftests/i915_gem_gtt.c | 60 ++++++++------
>>>    drivers/gpu/drm/i915/selftests/mock_gtt.c     | 22 ++---
>>>    9 files changed, 237 insertions(+), 190 deletions(-)
>> Hi, Chris,
>>
>> Overall looks good, but a question: Why can't we perform page-table
>> memory allocation on demand when needed?
> We need to allocate device memory for the page tables. The intention
> here is gather up all the resource requirements for an operation and
> reserve them in a single pass.
>   
>> Are we then under a mutex that we also take during reclaim?
> Yes, the vm->mutex is used during the shrinker to revoke the GPU
> bindings before returning memory to the system.
> -Chris

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf Chris Wilson
  2020-07-23 16:09   ` Thomas Hellström (Intel)
@ 2020-07-31  8:09   ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31  8:09 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:50 PM, Chris Wilson wrote:
> Our timeline lock is our defence against a concurrent execbuf
> interrupting our request construction. we need hold it throughout or,
> for example, a second thread may interject a relocation request in
> between our own relocation request and execution in the ring.
>
> A second, major benefit, is that it allows us to preserve a large chunk
> of the ringbuffer for our exclusive use; which should virtually
> eliminate the threat of hitting a wait_for_space during request
> construction -- although we should have already dropped other
> contentious locks at that point.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>
Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 14/66] drm/i915/gem: Rename execbuf.bind_link to unbound_link
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 14/66] drm/i915/gem: Rename execbuf.bind_link to unbound_link Chris Wilson
@ 2020-07-31  8:11   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31  8:11 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:50 PM, Chris Wilson wrote:
> Rename the current list of unbound objects so that we can track of all
> objects that we need to bind, as well as the list of currently unbound
> [unprocessed] objects.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 14 +++++++-------
>   1 file changed, 7 insertions(+), 7 deletions(-)
>
Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 15/66] drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 15/66] drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup Chris Wilson
@ 2020-07-31  8:51   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31  8:51 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:50 PM, Chris Wilson wrote:
> As a prelude to the next step where we want to perform all the object
> allocations together under the same lock, we first must delay the
> i915_vma_pin() as that implicitly does the allocations for us, one by
> one. As it only does the allocations one by one, it is not allowed to
> wait/evict, whereas pulling all the allocations together the entire set
> can be scheduled as one.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin
  2020-07-28 15:05     ` Chris Wilson
@ 2020-07-31  8:58       ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31  8:58 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/28/20 5:05 PM, Chris Wilson wrote:
> Quoting Thomas Hellström (Intel) (2020-07-28 10:46:51)
>> On 7/15/20 1:50 PM, Chris Wilson wrote:
>>> Remove the stub i915_vma_pin() used for incrementally pining objects for
>> s/pining/pinning/
> Pining for the fjords.
> -Chris

Apart from that,

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 17/66] drm/i915: Add list_for_each_entry_safe_continue_reverse
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 17/66] drm/i915: Add list_for_each_entry_safe_continue_reverse Chris Wilson
@ 2020-07-31  8:59   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31  8:59 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:50 PM, Chris Wilson wrote:
> One more list iterator variant, for when we want to unwind from inside
> one list iterator with the intention of restarting from the current
> entry as the new head of the list.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 18/66] drm/i915: Always defer fenced work to the worker
  2020-07-15 11:50 ` [Intel-gfx] [PATCH 18/66] drm/i915: Always defer fenced work to the worker Chris Wilson
@ 2020-07-31  9:03   ` Thomas Hellström (Intel)
  2020-07-31 13:28     ` Chris Wilson
  0 siblings, 1 reply; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31  9:03 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:50 PM, Chris Wilson wrote:
> Currently, if an error is raised we always call the cleanup locally
> [and skip the main work callback]. However, some future users
Could you add an example of those future users?
> may need
> to take a mutex to cleanup and so we cannot immediately execute the
> cleanup as we may still be in interrupt context.
>
> With the execute-immediate flag, for most cases this should result in
> immediate cleanup of an error.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Otherwise Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 20/66] drm/i915/gem: Separate the ww_mutex walker into its own list
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 20/66] drm/i915/gem: Separate the ww_mutex walker into its own list Chris Wilson
@ 2020-07-31  9:23   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31  9:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> In preparation for making eb_vma bigger and heavy to run in parallel,
> we need to stop applying an in-place swap() to reorder around ww_mutex
> deadlocks. Keep the array intact and reorder the locks using a dedicated
> list.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 23/66] drm/i915/gem: Include cmdparser in common execbuf pinning
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 23/66] drm/i915/gem: Include cmdparser in common execbuf pinning Chris Wilson
@ 2020-07-31  9:43   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31  9:43 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> Pull the cmdparser allocations in to the reservation phase, and then
> they are included in the common vma pinning pass.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 24/66] drm/i915/gem: Include secure batch in common execbuf pinning
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 24/66] drm/i915/gem: Include secure batch " Chris Wilson
@ 2020-07-31  9:47   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31  9:47 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> Pull the GGTT binding for the secure batch dispatch into the common vma
> pinning routine for execbuf, so that there is just a single central
> place for all i915_vma_pin().
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 25/66] drm/i915/gem: Reintroduce multiple passes for reloc processing
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 25/66] drm/i915/gem: Reintroduce multiple passes for reloc processing Chris Wilson
@ 2020-07-31 10:05   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31 10:05 UTC (permalink / raw)
  To: intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> The prospect of locking the entire submission sequence under a wide
> ww_mutex re-imposes some key restrictions, in particular that we must
> not call copy_(from|to)_user underneath the mutex (as the faulthandlers
> themselves may need to take the ww_mutex). To satisfy this requirement,
> we need to split the relocation handling into multiple phases again.
> After dropping the reservations, we need to allocate enough buffer space
> to both copy the relocations from userspace into, and serve as the
> relocation command buffer. Once we have finished copying the
> relocations, we can then re-aquire all the objects for the execbuf and
> rebind them, including our new relocations objects. After we have bound
> all the new and old objects into their final locations, we can then
> convert the relocation entries into the GPU commands to update the
> relocated vma. Finally, once it is all over and we have dropped the
> ww_mutex for the last time, we can then complete the update of the user
> relocation entries.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Tvrtko had some issues with this patch when submitted as part of a 
previous series, and they don't seem addressed.

/Thomas


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 26/66] drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2.
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 26/66] drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2 Chris Wilson
@ 2020-07-31 10:07   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31 10:07 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>
> i915_gem_ww_ctx is used to lock all gem bo's for pinning and memory
> eviction. We don't use it yet, but lets start adding the definition
> first.
>
> To use it, we have to pass a non-NULL ww to gem_object_lock, and don't
> unlock directly. It is done in i915_gem_ww_ctx_fini.
>
> Changes since v1:
> - Change ww_ctx and obj order in locking functions (Jonas Lahtinen)
>
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 29/66] drm/i915: Hold wakeref for the duration of the vma GGTT binding
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 29/66] drm/i915: Hold wakeref for the duration of the vma GGTT binding Chris Wilson
@ 2020-07-31 10:09   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31 10:09 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> Now that we have pushed the binding itself outside of the vm->mutex, we
> are clear of the potential wakeref inversions and can take the wakeref
> around the actual duration of the HW interaction.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 31/66] drm/i915/gt: Acquire backing storage for the context
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 31/66] drm/i915/gt: Acquire backing storage for the context Chris Wilson
@ 2020-07-31 10:27   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31 10:27 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> Pull the individual acquisition of the context objects (state, ring,
> timeline) under a common i915_acquire_ctx in preparation to allow the
> context to evict memory (or rather the i915_acquire_ctx on its behalf).
>
> The context objects maintain their semi-permanent status; that is they
> are assumed to be accessible by the HW at all times until we receive a
> signal from the HW that they are no longer in use. Currently, we
> generate such a signal ourselves from the context switch following the
> final use of the objects. This means that they will remain on the HW for
> an indefinite amount of time, and we retain the use of pinning to keep
> them in the same place. As they are pinned, they can be processed
> outside of the working set for the requests within the context. This is
> useful, as the context share some global state causing it to incur a
> global lock via its objects. By only requiring that lock as the context
> is activated, it is both reduced in frequency and reduced in duration
> (as compared to execbuf).
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 32/66] drm/i915/gt: Push the wait for the context to bound to the request
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 32/66] drm/i915/gt: Push the wait for the context to bound to the request Chris Wilson
@ 2020-07-31 10:48   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31 10:48 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> Rather than synchronously wait for the context to be bound, within the
> intel_context_pin(), we can track the pending completion of the bind
> fence and only submit requests along the context when signaled.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 33/66] drm/i915: Remove unused i915_gem_evict_vm()
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 33/66] drm/i915: Remove unused i915_gem_evict_vm() Chris Wilson
@ 2020-07-31 10:51   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31 10:51 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/15/20 1:51 PM, Chris Wilson wrote:
> Obsolete, last user removed.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 21/66] drm/i915/gem: Asynchronous GTT unbinding
  2020-07-15 11:51 ` [Intel-gfx] [PATCH 21/66] drm/i915/gem: Asynchronous GTT unbinding Chris Wilson
@ 2020-07-31 13:09   ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31 13:09 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Matthew Auld


On 7/15/20 1:51 PM, Chris Wilson wrote:
> It is reasonably common for userspace (even modern drivers like iris) to
> reuse an active address for a new buffer. This would cause the
> application to stall under its mutex (originally struct_mutex) until the
> old batches were idle and it could synchronously remove the stale PTE.
> However, we can queue up a job that waits on the signal for the old
> nodes to complete and upon those signals, remove the old nodes replacing
> them with the new ones for the batch. This is still CPU driven, but in
> theory we can do the GTT patching from the GPU. The job itself has a
> completion signal allowing the execbuf to wait upon the rebinding, and
> also other observers to coordinate with the common VM activity.
>
> Letting userspace queue up more work, lets it do more stuff without
> blocking other clients. In turn, we take care not to let it too much
> concurrent work, creating a small number of queues for each context to
> limit the number of concurrent tasks.
>
> The implementation relies on only scheduling one unbind operation per
> vma as we use the unbound vma->node location to track the stale PTE.
>
> Closes: https://gitlab.freedesktop.org/drm/intel/issues/1402
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Andi Shyti <andi.shyti@intel.com>

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 22/66] drm/i915/gem: Bind the fence async for execbuf
  2020-07-28 15:08     ` Chris Wilson
@ 2020-07-31 13:12       ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31 13:12 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/28/20 5:08 PM, Chris Wilson wrote:
> Quoting Thomas Hellström (Intel) (2020-07-27 19:19:19)
>> On 7/15/20 1:51 PM, Chris Wilson wrote:
>>> It is illegal to wait on an another vma while holding the vm->mutex, as
>>> that easily leads to ABBA deadlocks (we wait on a second vma that waits
>>> on us to release the vm->mutex). So while the vm->mutex exists, move the
>>> waiting outside of the lock into the async binding pipeline.
>> Why is it we don't just move the fence binding to a separate loop after
>> unlocking the vm->mutex in eb_reserve_vm()?
> That is what is done. The work is called immediately when possible. Just
> the loop may be deferred if the what we need to unbind are still active

OK, then

Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>


> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 18/66] drm/i915: Always defer fenced work to the worker
  2020-07-31  9:03   ` Thomas Hellström (Intel)
@ 2020-07-31 13:28     ` Chris Wilson
  2020-07-31 13:31       ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-07-31 13:28 UTC (permalink / raw)
  To: Thomas Hellström, intel-gfx

Quoting Thomas Hellström (Intel) (2020-07-31 10:03:59)
> 
> On 7/15/20 1:50 PM, Chris Wilson wrote:
> > Currently, if an error is raised we always call the cleanup locally
> > [and skip the main work callback]. However, some future users
> Could you add an example of those future users?

In the next (or two) patch, the code needs to do the error cleanup from
process context. Since the error paths should be relatively infrequent,
and more often than not raised synchronously, I didn't see a reason to
build in a flag to say whether or not the release-on-error could be
performed immediately from the interrupt context.

The example in this series is that even if an error is thrown, we have
committed changes to the ppGTT layout (in particular marking PTE to be
evicted) and so we must complete unbinding the old pages from the ppGTT,
otherwise they may remain accessible.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 18/66] drm/i915: Always defer fenced work to the worker
  2020-07-31 13:28     ` Chris Wilson
@ 2020-07-31 13:31       ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 156+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-31 13:31 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 7/31/20 3:28 PM, Chris Wilson wrote:
> Quoting Thomas Hellström (Intel) (2020-07-31 10:03:59)
>> On 7/15/20 1:50 PM, Chris Wilson wrote:
>>> Currently, if an error is raised we always call the cleanup locally
>>> [and skip the main work callback]. However, some future users
>> Could you add an example of those future users?
> In the next (or two) patch, the code needs to do the error cleanup from
> process context. Since the error paths should be relatively infrequent,
> and more often than not raised synchronously, I didn't see a reason to
> build in a flag to say whether or not the release-on-error could be
> performed immediately from the interrupt context.
>
> The example in this series is that even if an error is thrown, we have
> committed changes to the ppGTT layout (in particular marking PTE to be
> evicted) and so we must complete unbinding the old pages from the ppGTT,
> otherwise they may remain accessible.


Thanks.

>   I was mostly thinking if this or something similar could be added to the commit message to aid in understanding why the change is needed.

/Thomas




> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-07-30 12:28       ` Thomas Hellström (Intel)
@ 2020-08-04 14:08         ` Chris Wilson
  2020-08-04 16:14             ` Daniel Vetter
  0 siblings, 1 reply; 156+ messages in thread
From: Chris Wilson @ 2020-08-04 14:08 UTC (permalink / raw)
  To: Thomas Hellström, intel-gfx

Quoting Thomas Hellström (Intel) (2020-07-30 13:28:19)
> 
> On 7/28/20 4:50 PM, Chris Wilson wrote:
> >
> > It's in the user critical path (the shortest path to perform their
> > sequence of operations), but it's before the dma-fence itself. I say
> > that's a particularly nasty false claim that it is not on the critical
> > path, but being where it is circumvents the whole argument.
> >   
> 
> Couldn't the following situation happen?
> 
> 1. CS spawns userptr pinning work.
> 2. CS creates and publishes a DMA-fence that depends on that pinning work.

There's a break before 2 in that we do not publish a dma-fence on pending
userptr work. There's no async wait on the userptr, if the pages are not
available at the point of acquire, we hit an -EAGAIN, and take the
flush_workqueue path until we stop hitting -EAGAIN.

That is as painful as it sounds, and I claim that sitting and spinning in
a user path is no better in terms of critical path than having it inside
the dma-fence section. However, with this pretense we do not violate that
rule.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
  2020-08-04 14:08         ` Chris Wilson
@ 2020-08-04 16:14             ` Daniel Vetter
  0 siblings, 0 replies; 156+ messages in thread
From: Daniel Vetter @ 2020-08-04 16:14 UTC (permalink / raw)
  To: Chris Wilson, dri-devel, Christian König, Dave Airlie
  Cc: Thomas Hellström, intel-gfx

On Tue, Aug 4, 2020 at 4:08 PM Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> Quoting Thomas Hellström (Intel) (2020-07-30 13:28:19)
> >
> > On 7/28/20 4:50 PM, Chris Wilson wrote:
> > >
> > > It's in the user critical path (the shortest path to perform their
> > > sequence of operations), but it's before the dma-fence itself. I say
> > > that's a particularly nasty false claim that it is not on the critical
> > > path, but being where it is circumvents the whole argument.
> > >
> >
> > Couldn't the following situation happen?
> >
> > 1. CS spawns userptr pinning work.
> > 2. CS creates and publishes a DMA-fence that depends on that pinning work.
>
> There's a break before 2 in that we do not publish a dma-fence on pending
> userptr work. There's no async wait on the userptr, if the pages are not
> available at the point of acquire, we hit an -EAGAIN, and take the
> flush_workqueue path until we stop hitting -EAGAIN.
>
> That is as painful as it sounds, and I claim that sitting and spinning in
> a user path is no better in terms of critical path than having it inside
> the dma-fence section. However, with this pretense we do not violate that
> rule.

You trade a deadlock for a livelock, and the livelock is limited fully
to the offending process using (too much) userptr, and the user can
break out of it with ^C. That's a fairly significant difference. Don't
overruse userptr still applies.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories
@ 2020-08-04 16:14             ` Daniel Vetter
  0 siblings, 0 replies; 156+ messages in thread
From: Daniel Vetter @ 2020-08-04 16:14 UTC (permalink / raw)
  To: Chris Wilson, dri-devel, Christian König, Dave Airlie; +Cc: intel-gfx

On Tue, Aug 4, 2020 at 4:08 PM Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> Quoting Thomas Hellström (Intel) (2020-07-30 13:28:19)
> >
> > On 7/28/20 4:50 PM, Chris Wilson wrote:
> > >
> > > It's in the user critical path (the shortest path to perform their
> > > sequence of operations), but it's before the dma-fence itself. I say
> > > that's a particularly nasty false claim that it is not on the critical
> > > path, but being where it is circumvents the whole argument.
> > >
> >
> > Couldn't the following situation happen?
> >
> > 1. CS spawns userptr pinning work.
> > 2. CS creates and publishes a DMA-fence that depends on that pinning work.
>
> There's a break before 2 in that we do not publish a dma-fence on pending
> userptr work. There's no async wait on the userptr, if the pages are not
> available at the point of acquire, we hit an -EAGAIN, and take the
> flush_workqueue path until we stop hitting -EAGAIN.
>
> That is as painful as it sounds, and I claim that sitting and spinning in
> a user path is no better in terms of critical path than having it inside
> the dma-fence section. However, with this pretense we do not violate that
> rule.

You trade a deadlock for a livelock, and the livelock is limited fully
to the offending process using (too much) userptr, and the user can
break out of it with ^C. That's a fairly significant difference. Don't
overruse userptr still applies.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class
  2020-07-29 13:44           ` Thomas Hellström (Intel)
@ 2020-08-05 12:12             ` Chris Wilson
  0 siblings, 0 replies; 156+ messages in thread
From: Chris Wilson @ 2020-08-05 12:12 UTC (permalink / raw)
  To: Maarten Lankhorst, Thomas Hellström, Tvrtko Ursulin, intel-gfx

Quoting Thomas Hellström (Intel) (2020-07-29 14:44:41)
> 
> On 7/29/20 2:17 PM, Tvrtko Ursulin wrote:
> >
> > On 28/07/2020 12:17, Thomas Hellström (Intel) wrote:
> >> On 7/16/20 5:53 PM, Tvrtko Ursulin wrote:
> >>> On 15/07/2020 16:43, Maarten Lankhorst wrote:
> >>>> Op 15-07-2020 om 13:51 schreef Chris Wilson:
> >>>>> Our goal is to pull all memory reservations (next iteration
> >>>>> obj->ops->get_pages()) under a ww_mutex, and to align those 
> >>>>> reservations
> >>>>> with other drivers, i.e. control all such allocations with the
> >>>>> reservation_ww_class. Currently, this is under the purview of the
> >>>>> obj->mm.mutex, and while obj->mm remains an embedded struct we can
> >>>>> "simply" switch to using the reservation_ww_class 
> >>>>> obj->base.resv->lock
> >>>>>
> >>>>> The major consequence is the impact on the shrinker paths as the
> >>>>> reservation_ww_class is used to wrap allocations, and a ww_mutex does
> >>>>> not support subclassing so we cannot do our usual trick of knowing 
> >>>>> that
> >>>>> we never recurse inside the shrinker and instead have to finish the
> >>>>> reclaim with a trylock. This may result in us failing to release the
> >>>>> pages after having released the vma. This will have to do until a 
> >>>>> better
> >>>>> idea comes along.
> >>>>>
> >>>>> However, this step only converts the mutex over and continues to 
> >>>>> treat
> >>>>> everything as a single allocation and pinning the pages. With the
> >>>>> ww_mutex in place we can remove the temporary pinning, as we can then
> >>>>> reserve all storage en masse.
> >>>>>
> >>>>> One last thing to do: kill the implict page pinning for active vma.
> >>>>> This will require us to invalidate the vma->pages when the backing 
> >>>>> store
> >>>>> is removed (and we expect that while the vma is active, we mark the
> >>>>> backing store as active so that it cannot be removed while the HW is
> >>>>> busy.)
> >>>>>
> >>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>
> >>> [snip]
> >>>
> >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c 
> >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> >>>>> index dc8f052a0ffe..4e928103a38f 100644
> >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
> >>>>> @@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct 
> >>>>> drm_i915_gem_object *obj,
> >>>>>       if (!(shrink & I915_SHRINK_BOUND))
> >>>>>           flags = I915_GEM_OBJECT_UNBIND_TEST;
> >>>>>   -    if (i915_gem_object_unbind(obj, flags) == 0)
> >>>>> -        __i915_gem_object_put_pages(obj);
> >>>>> -
> >>>>> -    return !i915_gem_object_has_pages(obj);
> >>>>> +    return i915_gem_object_unbind(obj, flags) == 0;
> >>>>>   }
> >>>>>     static void try_to_writeback(struct drm_i915_gem_object *obj,
> >>>>> @@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
> >>>>> spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
> >>>>>   -            if (unsafe_drop_pages(obj, shrink)) {
> >>>>> -                /* May arrive from get_pages on another bo */
> >>>>> -                mutex_lock(&obj->mm.lock);
> >>>>> +            if (unsafe_drop_pages(obj, shrink) &&
> >>>>> +                i915_gem_object_trylock(obj)) {
> >>>
> >>>> Why trylock? Because of the nesting? In that case, still use ww ctx 
> >>>> if provided please
> >>>
> >>> By "if provided" you mean for code paths where we are calling the 
> >>> shrinker ourselves, as opposed to reclaim, like shmem_get_pages?
> >>>
> >>> That indeed sounds like the right thing to do, since all the 
> >>> get_pages from execbuf are in the reservation phase, collecting a 
> >>> list of GEM objects to lock, the ones to shrink sound like should be 
> >>> on that list.
> >>>
> >>>>> + __i915_gem_object_put_pages(obj);
> >>>>>                   if (!i915_gem_object_has_pages(obj)) {
> >>>>>                       try_to_writeback(obj, shrink);
> >>>>>                       count += obj->base.size >> PAGE_SHIFT;
> >>>>>                   }
> >>>>> -                mutex_unlock(&obj->mm.lock);
> >>>>> +                i915_gem_object_unlock(obj);
> >>>>>               }
> >>>>>                 scanned += obj->base.size >> PAGE_SHIFT;
> >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c 
> >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> >>>>> index ff72ee2fd9cd..ac12e1c20e66 100644
> >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
> >>>>> @@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct 
> >>>>> drm_i915_gem_object *obj,
> >>>>>        * pages to prevent them being swapped out and causing 
> >>>>> corruption
> >>>>>        * due to the change in swizzling.
> >>>>>        */
> >>>>> -    mutex_lock(&obj->mm.lock);
> >>>>>       if (i915_gem_object_has_pages(obj) &&
> >>>>>           obj->mm.madv == I915_MADV_WILLNEED &&
> >>>>>           i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
> >>>>> @@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct 
> >>>>> drm_i915_gem_object *obj,
> >>>>>               obj->mm.quirked = true;
> >>>>>           }
> >>>>>       }
> >>>>> -    mutex_unlock(&obj->mm.lock);
> >>>>>         spin_lock(&obj->vma.lock);
> >>>>>       for_each_ggtt_vma(vma, obj) {
> >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
> >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> >>>>> index e946032b13e4..80907c00c6fd 100644
> >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> >>>>> @@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct 
> >>>>> mmu_notifier *_mn,
> >>>>>           ret = i915_gem_object_unbind(obj,
> >>>>>                            I915_GEM_OBJECT_UNBIND_ACTIVE |
> >>>>> I915_GEM_OBJECT_UNBIND_BARRIER);
> >>>>> -        if (ret == 0)
> >>>>> -            ret = __i915_gem_object_put_pages(obj);
> >>>>> +        if (ret == 0) {
> >>>>> +            /* ww_mutex and mmu_notifier is fs_reclaim tainted */
> >>>>> +            if (i915_gem_object_trylock(obj)) {
> >>>>> +                ret = __i915_gem_object_put_pages(obj);
> >>>>> +                i915_gem_object_unlock(obj);
> >>>>> +            } else {
> >>>>> +                ret = -EAGAIN;
> >>>>> +            }
> >>>>> +        }
> >>>>
> >>>> I'm not sure upstream will agree with this kind of API:
> >>>>
> >>>> 1. It will deadlock when RT tasks are used.
> >>>
> >>> It will or it can? Which part? It will break out of the loop if 
> >>> trylock fails.
> >>>
> >>>>
> >>>> 2. You start throwing -EAGAIN because you don't have the correct 
> >>>> ordering of locking, this needs fixing first.
> >>>
> >>> Is it about correct ordering of locks or something else? If memory 
> >>> allocation is allowed under dma_resv.lock, then the opposite order 
> >>> cannot be taken in any case.
> >>>
> >>> I've had a brief look at the amdgpu solution and maybe I 
> >>> misunderstood something, but it looks like a BKL approach with the 
> >>> device level notifier_lock. Their userptr notifier blocks on that 
> >>> one, not on dma_resv lock, but that also means their command 
> >>> submission (amdgpu_cs_submit) blocks on the same lock while 
> >>> obtaining backing store.
> >>
> >> If I read Christian right, it blocks on that lock only just before 
> >> command submission to validate that sequence number. If there is a 
> >> mismatch, it needs to rerun CS. I'm not sure how common userptr 
> >> buffers are, but if a device-wide mutex hurts too much, There are 
> >> perhaps more fine-grained solutions. (Like an rw semaphore, and 
> >> unlock before the fence wait in the notifier: CS which are unaffected 
> >> shouldn't need to wait...).
> >
> > Right, I wasn't familiar with the mmu notifier seqno business. Hm i915 
> > and amdgpu seem to use different mmu notifier hooks. Could we use the 
> > same and does it mean "locking order" is irrelevant to the sub-story 
> > of userptr?
> >
> >>
> >>>
> >>> So it looks like a big hammer approach not directly related to the 
> >>> story of dma_resv locking. Maybe we could do the same big hammer 
> >>> approach, although I am not sure how it is deadlock free.
> >>>
> >>> What happens for instance if someone submits an userptr batch which 
> >>> gets unmapped while amdgpu_cs_submit is holding the notifier_lock?
> >>
> >> My understanding is the unmapping operation blocks on the 
> >> notifier_lock in the mmu notifier?
> >
> > Yes, the key will be understanding the difference between
> > invalidate_range_start and invalidate+seqno and whether we can use the 
> > same from i915 to avoid the trylock.
> >
> > But given how ww mutex is tainted fs reclaim does that mean we would 
> > also need to ensure no memory allocations under ww mutex?
> 
> So with the fs_reclaim, we can do memory allocations under ww_mutex, but 
> need trylock in shrinkers and mmu_notifiers. However get_user_pages() 
> must not run under ww_mutex due to mmap_sem inversion.

There's no real reason for the shrinker to call into mmu-notifiers,
certainly doesn't work for us where we handle the shrinking via another
shrinker. The debate is between adding skip for page_maybe_dma_pinned to
the lru shrink and for for never adding pinned dma pages to the lru list
in the first place.
 
> But with the AMD solution I figure we don't really need the ww_mutex in 
> the mmu_notifier at all? That would possibly mean holding on to and 
> exposing to GPU its pages that aren't used anymore, but a shrinker would 
> notice the sequence number mismatch and consider the pages purgeable, 
> and also the next cs the stale pages would be released.

You need some lock to coordinate to give the guarantee required by the
mmu-notifier interface we use. HMM is likely the only way around that,
by moving the goalposts entirely.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 156+ messages in thread

end of thread, other threads:[~2020-08-05 12:12 UTC | newest]

Thread overview: 156+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15 11:50 [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Chris Wilson
2020-07-15 11:50 ` [Intel-gfx] [PATCH 02/66] drm/i915: Remove i915_request.lock requirement for execution callbacks Chris Wilson
2020-07-15 11:50 ` [Intel-gfx] [PATCH 03/66] drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs Chris Wilson
2020-07-15 11:50 ` [Intel-gfx] [PATCH 04/66] drm/i915: Add a couple of missing i915_active_fini() Chris Wilson
2020-07-17 12:00   ` Tvrtko Ursulin
2020-07-21 12:23   ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 05/66] drm/i915: Skip taking acquire mutex for no ref->active callback Chris Wilson
2020-07-17 12:04   ` Tvrtko Ursulin
2020-07-21 12:32   ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire() Chris Wilson
2020-07-17 12:21   ` Tvrtko Ursulin
2020-07-17 12:45     ` Chris Wilson
2020-07-17 13:06       ` Tvrtko Ursulin
2020-07-21 15:33   ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 07/66] drm/i915: Keep the most recently used active-fence upon discard Chris Wilson
2020-07-17 12:38   ` Tvrtko Ursulin
2020-07-28 14:22     ` Chris Wilson
2020-07-22  9:46   ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline Chris Wilson
2020-07-17 13:04   ` Tvrtko Ursulin
2020-07-28 14:28     ` Chris Wilson
2020-07-29 12:40       ` Tvrtko Ursulin
2020-07-29 13:42         ` Chris Wilson
2020-07-29 13:53           ` Chris Wilson
2020-07-29 14:22           ` Tvrtko Ursulin
2020-07-29 14:39             ` Chris Wilson
2020-07-29 14:52             ` Chris Wilson
2020-07-29 15:31               ` Tvrtko Ursulin
2020-07-22 11:19   ` Thomas Hellström (Intel)
2020-07-28 14:31     ` Chris Wilson
2020-07-15 11:50 ` [Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings Chris Wilson
2020-07-17 13:23   ` Tvrtko Ursulin
2020-07-28 14:35     ` Chris Wilson
2020-07-29 12:43       ` Tvrtko Ursulin
2020-07-22 15:07   ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 10/66] drm/i915: Soften the tasklet flush frequency before waits Chris Wilson
2020-07-16 14:23   ` Mika Kuoppala
2020-07-22 15:10   ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories Chris Wilson
2020-07-20 10:35   ` Matthew Auld
2020-07-23 14:33   ` Thomas Hellström (Intel)
2020-07-28 14:42     ` Chris Wilson
2020-07-31  7:43       ` Thomas Hellström (Intel)
2020-07-27  9:24   ` Thomas Hellström (Intel)
2020-07-28 14:50     ` Chris Wilson
2020-07-30 12:04       ` Thomas Hellström (Intel)
2020-07-30 12:28       ` Thomas Hellström (Intel)
2020-08-04 14:08         ` Chris Wilson
2020-08-04 16:14           ` Daniel Vetter
2020-08-04 16:14             ` Daniel Vetter
2020-07-15 11:50 ` [Intel-gfx] [PATCH 12/66] drm/i915: Switch to object allocations for page directories Chris Wilson
2020-07-20 10:34   ` Matthew Auld
2020-07-20 10:40     ` Chris Wilson
2020-07-15 11:50 ` [Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf Chris Wilson
2020-07-23 16:09   ` Thomas Hellström (Intel)
2020-07-28 14:46     ` Thomas Hellström (Intel)
2020-07-28 14:51     ` Chris Wilson
2020-07-31  8:09   ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 14/66] drm/i915/gem: Rename execbuf.bind_link to unbound_link Chris Wilson
2020-07-31  8:11   ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 15/66] drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup Chris Wilson
2020-07-31  8:51   ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin Chris Wilson
2020-07-17 14:36   ` Tvrtko Ursulin
2020-07-28 15:04     ` Chris Wilson
2020-07-28  9:46   ` Thomas Hellström (Intel)
2020-07-28 15:05     ` Chris Wilson
2020-07-31  8:58       ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 17/66] drm/i915: Add list_for_each_entry_safe_continue_reverse Chris Wilson
2020-07-31  8:59   ` Thomas Hellström (Intel)
2020-07-15 11:50 ` [Intel-gfx] [PATCH 18/66] drm/i915: Always defer fenced work to the worker Chris Wilson
2020-07-31  9:03   ` Thomas Hellström (Intel)
2020-07-31 13:28     ` Chris Wilson
2020-07-31 13:31       ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 19/66] drm/i915/gem: Assign context id for async work Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 20/66] drm/i915/gem: Separate the ww_mutex walker into its own list Chris Wilson
2020-07-31  9:23   ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 21/66] drm/i915/gem: Asynchronous GTT unbinding Chris Wilson
2020-07-31 13:09   ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 22/66] drm/i915/gem: Bind the fence async for execbuf Chris Wilson
2020-07-27 18:19   ` Thomas Hellström (Intel)
2020-07-28 15:08     ` Chris Wilson
2020-07-31 13:12       ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 23/66] drm/i915/gem: Include cmdparser in common execbuf pinning Chris Wilson
2020-07-31  9:43   ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 24/66] drm/i915/gem: Include secure batch " Chris Wilson
2020-07-31  9:47   ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 25/66] drm/i915/gem: Reintroduce multiple passes for reloc processing Chris Wilson
2020-07-31 10:05   ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 26/66] drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2 Chris Wilson
2020-07-31 10:07   ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 27/66] drm/i915/gem: Pull execbuf dma resv under a single critical section Chris Wilson
2020-07-27 18:08   ` Thomas Hellström (Intel)
2020-07-28 15:16     ` Chris Wilson
2020-07-30 12:57       ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class Chris Wilson
2020-07-15 15:43   ` Maarten Lankhorst
2020-07-16 15:53     ` Tvrtko Ursulin
2020-07-28 11:17       ` Thomas Hellström (Intel)
2020-07-29  7:56         ` Thomas Hellström (Intel)
2020-07-29 12:17         ` Tvrtko Ursulin
2020-07-29 13:44           ` Thomas Hellström (Intel)
2020-08-05 12:12             ` Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 29/66] drm/i915: Hold wakeref for the duration of the vma GGTT binding Chris Wilson
2020-07-31 10:09   ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 30/66] drm/i915: Specialise " Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 31/66] drm/i915/gt: Acquire backing storage for the context Chris Wilson
2020-07-31 10:27   ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 32/66] drm/i915/gt: Push the wait for the context to bound to the request Chris Wilson
2020-07-31 10:48   ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 33/66] drm/i915: Remove unused i915_gem_evict_vm() Chris Wilson
2020-07-31 10:51   ` Thomas Hellström (Intel)
2020-07-15 11:51 ` [Intel-gfx] [PATCH 34/66] drm/i915/gt: Decouple completed requests on unwind Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 35/66] drm/i915/gt: Check for a completed last request once Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 36/66] drm/i915/gt: Replace direct submit with direct call to tasklet Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 37/66] drm/i915/gt: Free stale request on destroying the virtual engine Chris Wilson
2020-07-15 11:51 ` [PATCH 38/66] drm/i915/gt: Use virtual_engine during execlists_dequeue Chris Wilson
2020-07-15 11:51   ` [Intel-gfx] " Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 39/66] drm/i915/gt: Decouple inflight virtual engines Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 40/66] drm/i915/gt: Defer schedule_out until after the next dequeue Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 41/66] drm/i915/gt: Resubmit the virtual engine on schedule-out Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 42/66] drm/i915/gt: Simplify virtual engine handling for execlists_hold() Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 43/66] drm/i915/gt: ce->inflight updates are now serialised Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 44/66] drm/i915/gt: Drop atomic for engine->fw_active tracking Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 45/66] drm/i915/gt: Extract busy-stats for ring-scheduler Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 46/66] drm/i915/gt: Convert stats.active to plain unsigned int Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 47/66] drm/i915: Lift waiter/signaler iterators Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 48/66] drm/i915: Strip out internal priorities Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 49/66] drm/i915: Remove I915_USER_PRIORITY_SHIFT Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 50/66] drm/i915: Replace engine->schedule() with a known request operation Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 51/66] drm/i915/gt: Do not suspend bonded requests if one hangs Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 52/66] drm/i915: Teach the i915_dependency to use a double-lock Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 53/66] drm/i915: Restructure priority inheritance Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 54/66] drm/i915/gt: Remove timeslice suppression Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 55/66] drm/i915: Fair low-latency scheduling Chris Wilson
2020-07-15 15:33   ` [Intel-gfx] [PATCH] " Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 56/66] drm/i915/gt: Specify a deadline for the heartbeat Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 57/66] drm/i915: Replace the priority boosting for the display with a deadline Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 58/66] drm/i915: Move saturated workload detection to the GT Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 59/66] Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq" Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 60/66] drm/i915/gt: Couple tasklet scheduling for all CS interrupts Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 61/66] drm/i915/gt: Support creation of 'internal' rings Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 62/66] drm/i915/gt: Use client timeline address for seqno writes Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 63/66] drm/i915/gt: Infrastructure for ring scheduling Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 64/66] drm/i915/gt: Implement ring scheduler for gen6/7 Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 65/66] drm/i915/gt: Enable ring scheduling " Chris Wilson
2020-07-15 11:51 ` [Intel-gfx] [PATCH 66/66] drm/i915/gem: Remove timeline nesting from snb relocs Chris Wilson
2020-07-15 13:27 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Patchwork
2020-07-15 13:28 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2020-07-15 14:20 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2020-07-15 15:41 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait (rev2) Patchwork
2020-07-15 15:42 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2020-07-15 16:03 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2020-07-15 19:55 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2020-07-23 20:32 ` [Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait Dave Airlie
2020-07-27  9:35   ` Tvrtko Ursulin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.