[PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution
@ 2018-11-08  8:17 Chris Wilson
  2018-11-08  8:17 ` [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim Chris Wilson
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Chris Wilson @ 2018-11-08  8:17 UTC (permalink / raw)
  To: intel-gfx

Ensure that the writes into the context image are completed prior to the
register mmio to trigger execution. Although previously we were assured
by the SDM that all writes are flushed before an uncached memory
transaction (our mmio write to submit the context to HW for execution),
we have empirical evidence to believe that this is not actually the
case.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656
References: https://bugs.freedesktop.org/show_bug.cgi?id=108315
References: https://bugs.freedesktop.org/show_bug.cgi?id=106887
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 22b57b8926fc..f7892ddb3f13 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -380,7 +380,8 @@ static u64 execlists_update_context(struct i915_request *rq)
 
 	reg_state[CTX_RING_TAIL+1] = intel_ring_set_tail(rq->ring, rq->tail);
 
-	/* True 32b PPGTT with dynamic page allocation: update PDP
+	/*
+	 * True 32b PPGTT with dynamic page allocation: update PDP
 	 * registers and point the unallocated PDPs to scratch page.
 	 * PML4 is allocated during ppgtt init, so this is not needed
 	 * in 48-bit mode.
@@ -388,6 +389,17 @@ static u64 execlists_update_context(struct i915_request *rq)
 	if (!i915_vm_is_48bit(&ppgtt->vm))
 		execlists_update_context_pdps(ppgtt, reg_state);
 
+	/*
+	 * Make sure the context image is complete before we submit it to HW.
+	 *
+	 * Ostensibly, writes (including the WCB) should be flushed prior to
+	 * an uncached write such as our mmio register access, the empirical
+	 * evidence (esp. on Braswell) suggests that the WC write into memory
+	 * may not be visible to the HW prior to the completion of the UC
+	 * register write and that we may begin execution from the context
+	 * before its image is complete leading to invalid PD chasing.
+	 */
+	wmb();
 	return ce->lrc_desc;
 }
 
-- 
2.19.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim
  2018-11-08  8:17 [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution Chris Wilson
@ 2018-11-08  8:17 ` Chris Wilson
  2018-11-08 16:23   ` Tvrtko Ursulin
  2018-11-08  8:17 ` [PATCH 3/3] drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start Chris Wilson
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2018-11-08  8:17 UTC (permalink / raw)
  To: intel-gfx

Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
in the shrinker while performing direct-reclaim. The trade-off being
(much) lower latency for non-i915 clients at an increased risk of being
unable to obtain a page from direct-reclaim without hitting the
oom-notifier. The proviso being that we still keep trying to hard
obtain the lock for oom so that we can reap under heavy memory pressure.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
index ea90d3a0d511..d461f458f4af 100644
--- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
@@ -36,7 +36,9 @@
 #include "i915_drv.h"
 #include "i915_trace.h"
 
-static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
+static bool shrinker_lock(struct drm_i915_private *i915,
+			  unsigned int flags,
+			  bool *unlock)
 {
 	switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
 	case MUTEX_TRYLOCK_RECURSIVE:
@@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
 
 	case MUTEX_TRYLOCK_FAILED:
 		*unlock = false;
-		preempt_disable();
-		do {
-			cpu_relax();
-			if (mutex_trylock(&i915->drm.struct_mutex)) {
-				*unlock = true;
-				break;
-			}
-		} while (!need_resched());
-		preempt_enable();
+		if (flags & I915_SHRINK_ACTIVE) {
+			mutex_lock_nested(&i915->drm.struct_mutex,
+					  I915_MM_SHRINKER);
+			*unlock = true;
+		}
 		return *unlock;
 
 	case MUTEX_TRYLOCK_SUCCESS:
@@ -160,7 +158,7 @@ i915_gem_shrink(struct drm_i915_private *i915,
 	unsigned long scanned = 0;
 	bool unlock;
 
-	if (!shrinker_lock(i915, &unlock))
+	if (!shrinker_lock(i915, flags, &unlock))
 		return 0;
 
 	/*
@@ -357,7 +355,7 @@ i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
 
 	sc->nr_scanned = 0;
 
-	if (!shrinker_lock(i915, &unlock))
+	if (!shrinker_lock(i915, 0, &unlock))
 		return SHRINK_STOP;
 
 	freed = i915_gem_shrink(i915,
@@ -397,7 +395,7 @@ shrinker_lock_uninterruptible(struct drm_i915_private *i915, bool *unlock,
 	do {
 		if (i915_gem_wait_for_idle(i915,
 					   0, MAX_SCHEDULE_TIMEOUT) == 0 &&
-		    shrinker_lock(i915, unlock))
+		    shrinker_lock(i915, 0, unlock))
 			break;
 
 		schedule_timeout_killable(1);
-- 
2.19.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start
  2018-11-08  8:17 [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution Chris Wilson
  2018-11-08  8:17 ` [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim Chris Wilson
@ 2018-11-08  8:17 ` Chris Wilson
  2018-11-08 16:45   ` Tvrtko Ursulin
  2018-11-08  8:33 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/3] drm/i915/execlists: Force write serialisation into context image vs execution Patchwork
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2018-11-08  8:17 UTC (permalink / raw)
  To: intel-gfx

Since commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
notifiers") we have been able to report failure from
mmu_invalidate_range_start which allows us to use a trylock on the
struct_mutex to avoid potential recursion and report -EBUSY instead.
Furthermore, this allows us to pull the work into the main callback and
avoid the sleight-of-hand in using a workqueue to avoid lockdep.

However, not all paths to mmu_invalidate_range_start are prepared to
handle failure, so instead of reporting the recursion, deal with it by
propagating the failure upwards, who can decide themselves to handle it
or report it.

v2: Mark up the recursive lock behaviour and comment on the various weak
points.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108375
References: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |   4 +-
 drivers/gpu/drm/i915/i915_gem.c         |  30 +++-
 drivers/gpu/drm/i915/i915_gem_object.h  |   7 +
 drivers/gpu/drm/i915/i915_gem_userptr.c | 221 +++++++++++-------------
 4 files changed, 136 insertions(+), 126 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 0c8438de3c1b..b95afa05976d 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3076,8 +3076,8 @@ enum i915_mm_subclass { /* lockdep subclass for obj->mm.lock */
 	I915_MM_SHRINKER
 };
 
-void __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
-				 enum i915_mm_subclass subclass);
+int __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
+				enum i915_mm_subclass subclass);
 void __i915_gem_object_invalidate(struct drm_i915_gem_object *obj);
 
 enum i915_map_type {
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index dc120b5d8e05..53fb3605c4df 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2447,8 +2447,8 @@ __i915_gem_object_unset_pages(struct drm_i915_gem_object *obj)
 	struct sg_table *pages;
 
 	pages = fetch_and_zero(&obj->mm.pages);
-	if (!pages)
-		return NULL;
+	if (IS_ERR_OR_NULL(pages))
+		return pages;
 
 	spin_lock(&i915->mm.obj_lock);
 	list_del(&obj->mm.link);
@@ -2472,22 +2472,23 @@ __i915_gem_object_unset_pages(struct drm_i915_gem_object *obj)
 	return pages;
 }
 
-void __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
-				 enum i915_mm_subclass subclass)
+int __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
+				enum i915_mm_subclass subclass)
 {
 	struct sg_table *pages;
+	int ret;
 
 	if (i915_gem_object_has_pinned_pages(obj))
-		return;
+		return -EBUSY;
 
 	GEM_BUG_ON(obj->bind_count);
-	if (!i915_gem_object_has_pages(obj))
-		return;
 
 	/* May be called by shrinker from within get_pages() (on another bo) */
 	mutex_lock_nested(&obj->mm.lock, subclass);
-	if (unlikely(atomic_read(&obj->mm.pages_pin_count)))
+	if (unlikely(atomic_read(&obj->mm.pages_pin_count))) {
+		ret = -EBUSY;
 		goto unlock;
+	}
 
 	/*
 	 * ->put_pages might need to allocate memory for the bit17 swizzle
@@ -2495,11 +2496,24 @@ void __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
 	 * lists early.
 	 */
 	pages = __i915_gem_object_unset_pages(obj);
+
+	/*
+	 * XXX Temporary hijinx to avoid updating all backends to handle
+	 * NULL pages. In the future, when we have more asynchronous
+	 * get_pages backends we should be better able to handle the
+	 * cancellation of the async task in a more uniform manner.
+	 */
+	if (!pages && !i915_gem_object_needs_async_cancel(obj))
+		pages = ERR_PTR(-EINVAL);
+
 	if (!IS_ERR(pages))
 		obj->ops->put_pages(obj, pages);
 
+	ret = 0;
 unlock:
 	mutex_unlock(&obj->mm.lock);
+
+	return ret;
 }
 
 bool i915_sg_trim(struct sg_table *orig_st)
diff --git a/drivers/gpu/drm/i915/i915_gem_object.h b/drivers/gpu/drm/i915/i915_gem_object.h
index a6dd7c46de0d..49ce797173b5 100644
--- a/drivers/gpu/drm/i915/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/i915_gem_object.h
@@ -56,6 +56,7 @@ struct drm_i915_gem_object_ops {
 #define I915_GEM_OBJECT_HAS_STRUCT_PAGE	BIT(0)
 #define I915_GEM_OBJECT_IS_SHRINKABLE	BIT(1)
 #define I915_GEM_OBJECT_IS_PROXY	BIT(2)
+#define I915_GEM_OBJECT_ASYNC_CANCEL	BIT(3)
 
 	/* Interface between the GEM object and its backing storage.
 	 * get_pages() is called once prior to the use of the associated set
@@ -386,6 +387,12 @@ i915_gem_object_is_proxy(const struct drm_i915_gem_object *obj)
 	return obj->ops->flags & I915_GEM_OBJECT_IS_PROXY;
 }
 
+static inline bool
+i915_gem_object_needs_async_cancel(const struct drm_i915_gem_object *obj)
+{
+	return obj->ops->flags & I915_GEM_OBJECT_ASYNC_CANCEL;
+}
+
 static inline bool
 i915_gem_object_is_active(const struct drm_i915_gem_object *obj)
 {
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 2c9b284036d1..8b07fd44731f 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -50,79 +50,84 @@ struct i915_mmu_notifier {
 	struct hlist_node node;
 	struct mmu_notifier mn;
 	struct rb_root_cached objects;
-	struct workqueue_struct *wq;
+	struct i915_mm_struct *mm;
 };
 
 struct i915_mmu_object {
 	struct i915_mmu_notifier *mn;
 	struct drm_i915_gem_object *obj;
 	struct interval_tree_node it;
-	struct list_head link;
-	struct work_struct work;
-	bool attached;
 };
 
-static void cancel_userptr(struct work_struct *work)
-{
-	struct i915_mmu_object *mo = container_of(work, typeof(*mo), work);
-	struct drm_i915_gem_object *obj = mo->obj;
-	struct work_struct *active;
-
-	/* Cancel any active worker and force us to re-evaluate gup */
-	mutex_lock(&obj->mm.lock);
-	active = fetch_and_zero(&obj->userptr.work);
-	mutex_unlock(&obj->mm.lock);
-	if (active)
-		goto out;
-
-	i915_gem_object_wait(obj, I915_WAIT_ALL, MAX_SCHEDULE_TIMEOUT, NULL);
-
-	mutex_lock(&obj->base.dev->struct_mutex);
-
-	/* We are inside a kthread context and can't be interrupted */
-	if (i915_gem_object_unbind(obj) == 0)
-		__i915_gem_object_put_pages(obj, I915_MM_NORMAL);
-	WARN_ONCE(i915_gem_object_has_pages(obj),
-		  "Failed to release pages: bind_count=%d, pages_pin_count=%d, pin_global=%d\n",
-		  obj->bind_count,
-		  atomic_read(&obj->mm.pages_pin_count),
-		  obj->pin_global);
-
-	mutex_unlock(&obj->base.dev->struct_mutex);
-
-out:
-	i915_gem_object_put(obj);
-}
-
 static void add_object(struct i915_mmu_object *mo)
 {
-	if (mo->attached)
+	if (!RB_EMPTY_NODE(&mo->it.rb))
 		return;
 
 	interval_tree_insert(&mo->it, &mo->mn->objects);
-	mo->attached = true;
 }
 
 static void del_object(struct i915_mmu_object *mo)
 {
-	if (!mo->attached)
+	if (RB_EMPTY_NODE(&mo->it.rb))
 		return;
 
 	interval_tree_remove(&mo->it, &mo->mn->objects);
-	mo->attached = false;
+	RB_CLEAR_NODE(&mo->it.rb);
+}
+
+static void
+__i915_gem_userptr_set_active(struct drm_i915_gem_object *obj, bool value)
+{
+	struct i915_mmu_object *mo = obj->userptr.mmu_object;
+
+	/*
+	 * During mm_invalidate_range we need to cancel any userptr that
+	 * overlaps the range being invalidated. Doing so requires the
+	 * struct_mutex, and that risks recursion. In order to cause
+	 * recursion, the user must alias the userptr address space with
+	 * a GTT mmapping (possible with a MAP_FIXED) - then when we have
+	 * to invalidate that mmaping, mm_invalidate_range is called with
+	 * the userptr address *and* the struct_mutex held.  To prevent that
+	 * we set a flag under the i915_mmu_notifier spinlock to indicate
+	 * whether this object is valid.
+	 */
+	if (!mo)
+		return;
+
+	spin_lock(&mo->mn->lock);
+	if (value)
+		add_object(mo);
+	else
+		del_object(mo);
+	spin_unlock(&mo->mn->lock);
+}
+
+static struct mutex *__i915_mutex_lock_recursive(struct mutex *m)
+{
+	switch (mutex_trylock_recursive(m)) {
+	default:
+	case MUTEX_TRYLOCK_FAILED:
+		mutex_lock_nested(m, I915_MM_SHRINKER);
+	case MUTEX_TRYLOCK_SUCCESS:
+		return m;
+
+	case MUTEX_TRYLOCK_RECURSIVE:
+		return ERR_PTR(-EEXIST);
+	}
 }
 
 static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
-						       struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       bool blockable)
+						      struct mm_struct *mm,
+						      unsigned long start,
+						      unsigned long end,
+						      bool blockable)
 {
 	struct i915_mmu_notifier *mn =
 		container_of(_mn, struct i915_mmu_notifier, mn);
-	struct i915_mmu_object *mo;
 	struct interval_tree_node *it;
-	LIST_HEAD(cancelled);
+	struct mutex *unlock = NULL;
+	int ret = 0;
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
 		return 0;
@@ -133,11 +138,15 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_lock(&mn->lock);
 	it = interval_tree_iter_first(&mn->objects, start, end);
 	while (it) {
+		struct drm_i915_gem_object *obj;
+
 		if (!blockable) {
-			spin_unlock(&mn->lock);
-			return -EAGAIN;
+			ret = -EAGAIN;
+			break;
 		}
-		/* The mmu_object is released late when destroying the
+
+		/*
+		 * The mmu_object is released late when destroying the
 		 * GEM object so it is entirely possible to gain a
 		 * reference on an object in the process of being freed
 		 * since our serialisation is via the spinlock and not
@@ -146,21 +155,39 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 		 * use-after-free we only acquire a reference on the
 		 * object if it is not in the process of being destroyed.
 		 */
-		mo = container_of(it, struct i915_mmu_object, it);
-		if (kref_get_unless_zero(&mo->obj->base.refcount))
-			queue_work(mn->wq, &mo->work);
+		obj = container_of(it, struct i915_mmu_object, it)->obj;
+		if (!kref_get_unless_zero(&obj->base.refcount)) {
+			it = interval_tree_iter_next(it, start, end);
+			continue;
+		}
+		spin_unlock(&mn->lock);
 
-		list_add(&mo->link, &cancelled);
-		it = interval_tree_iter_next(it, start, end);
+		if (!unlock)
+			unlock = __i915_mutex_lock_recursive(&mn->mm->i915->drm.struct_mutex);
+		ret = i915_gem_object_unbind(obj);
+		if (ret == 0)
+			ret = __i915_gem_object_put_pages(obj, I915_MM_SHRINKER);
+		i915_gem_object_put(obj);
+		if (ret)
+			goto unlock;
+
+		spin_lock(&mn->lock);
+
+		/*
+		 * As we do not (yet) protect the mmu from concurrent insertion
+		 * over this range, there is no guarantee that this search will
+		 * terminate given a pathologic workload.
+		 */
+		it = interval_tree_iter_first(&mn->objects, start, end);
 	}
-	list_for_each_entry(mo, &cancelled, link)
-		del_object(mo);
 	spin_unlock(&mn->lock);
 
-	if (!list_empty(&cancelled))
-		flush_workqueue(mn->wq);
+unlock:
+	if (!IS_ERR_OR_NULL(unlock))
+		mutex_unlock(unlock);
+
+	return ret;
 
-	return 0;
 }
 
 static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
@@ -168,7 +195,7 @@ static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
 };
 
 static struct i915_mmu_notifier *
-i915_mmu_notifier_create(struct mm_struct *mm)
+i915_mmu_notifier_create(struct i915_mm_struct *mm)
 {
 	struct i915_mmu_notifier *mn;
 
@@ -179,13 +206,7 @@ i915_mmu_notifier_create(struct mm_struct *mm)
 	spin_lock_init(&mn->lock);
 	mn->mn.ops = &i915_gem_userptr_notifier;
 	mn->objects = RB_ROOT_CACHED;
-	mn->wq = alloc_workqueue("i915-userptr-release",
-				 WQ_UNBOUND | WQ_MEM_RECLAIM,
-				 0);
-	if (mn->wq == NULL) {
-		kfree(mn);
-		return ERR_PTR(-ENOMEM);
-	}
+	mn->mm = mm;
 
 	return mn;
 }
@@ -195,16 +216,14 @@ i915_gem_userptr_release__mmu_notifier(struct drm_i915_gem_object *obj)
 {
 	struct i915_mmu_object *mo;
 
-	mo = obj->userptr.mmu_object;
-	if (mo == NULL)
+	mo = fetch_and_zero(&obj->userptr.mmu_object);
+	if (!mo)
 		return;
 
 	spin_lock(&mo->mn->lock);
 	del_object(mo);
 	spin_unlock(&mo->mn->lock);
 	kfree(mo);
-
-	obj->userptr.mmu_object = NULL;
 }
 
 static struct i915_mmu_notifier *
@@ -217,7 +236,7 @@ i915_mmu_notifier_find(struct i915_mm_struct *mm)
 	if (mn)
 		return mn;
 
-	mn = i915_mmu_notifier_create(mm->mm);
+	mn = i915_mmu_notifier_create(mm);
 	if (IS_ERR(mn))
 		err = PTR_ERR(mn);
 
@@ -240,10 +259,8 @@ i915_mmu_notifier_find(struct i915_mm_struct *mm)
 	mutex_unlock(&mm->i915->mm_lock);
 	up_write(&mm->mm->mmap_sem);
 
-	if (mn && !IS_ERR(mn)) {
-		destroy_workqueue(mn->wq);
+	if (mn && !IS_ERR(mn))
 		kfree(mn);
-	}
 
 	return err ? ERR_PTR(err) : mm->mn;
 }
@@ -266,14 +283,14 @@ i915_gem_userptr_init__mmu_notifier(struct drm_i915_gem_object *obj,
 		return PTR_ERR(mn);
 
 	mo = kzalloc(sizeof(*mo), GFP_KERNEL);
-	if (mo == NULL)
+	if (!mo)
 		return -ENOMEM;
 
 	mo->mn = mn;
 	mo->obj = obj;
 	mo->it.start = obj->userptr.ptr;
 	mo->it.last = obj->userptr.ptr + obj->base.size - 1;
-	INIT_WORK(&mo->work, cancel_userptr);
+	RB_CLEAR_NODE(&mo->it.rb);
 
 	obj->userptr.mmu_object = mo;
 	return 0;
@@ -287,12 +304,16 @@ i915_mmu_notifier_free(struct i915_mmu_notifier *mn,
 		return;
 
 	mmu_notifier_unregister(&mn->mn, mm);
-	destroy_workqueue(mn->wq);
 	kfree(mn);
 }
 
 #else
 
+static void
+__i915_gem_userptr_set_active(struct drm_i915_gem_object *obj, bool value)
+{
+}
+
 static void
 i915_gem_userptr_release__mmu_notifier(struct drm_i915_gem_object *obj)
 {
@@ -461,42 +482,6 @@ __i915_gem_userptr_alloc_pages(struct drm_i915_gem_object *obj,
 	return st;
 }
 
-static int
-__i915_gem_userptr_set_active(struct drm_i915_gem_object *obj,
-			      bool value)
-{
-	int ret = 0;
-
-	/* During mm_invalidate_range we need to cancel any userptr that
-	 * overlaps the range being invalidated. Doing so requires the
-	 * struct_mutex, and that risks recursion. In order to cause
-	 * recursion, the user must alias the userptr address space with
-	 * a GTT mmapping (possible with a MAP_FIXED) - then when we have
-	 * to invalidate that mmaping, mm_invalidate_range is called with
-	 * the userptr address *and* the struct_mutex held.  To prevent that
-	 * we set a flag under the i915_mmu_notifier spinlock to indicate
-	 * whether this object is valid.
-	 */
-#if defined(CONFIG_MMU_NOTIFIER)
-	if (obj->userptr.mmu_object == NULL)
-		return 0;
-
-	spin_lock(&obj->userptr.mmu_object->mn->lock);
-	/* In order to serialise get_pages with an outstanding
-	 * cancel_userptr, we must drop the struct_mutex and try again.
-	 */
-	if (!value)
-		del_object(obj->userptr.mmu_object);
-	else if (!work_pending(&obj->userptr.mmu_object->work))
-		add_object(obj->userptr.mmu_object);
-	else
-		ret = -EAGAIN;
-	spin_unlock(&obj->userptr.mmu_object->mn->lock);
-#endif
-
-	return ret;
-}
-
 static void
 __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
 {
@@ -682,8 +667,11 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
 	struct sgt_iter sgt_iter;
 	struct page *page;
 
-	BUG_ON(obj->userptr.work != NULL);
+	/* Cancel any inflight work and force them to restart their gup */
+	obj->userptr.work = NULL;
 	__i915_gem_userptr_set_active(obj, false);
+	if (!pages)
+		return;
 
 	if (obj->mm.madv != I915_MADV_WILLNEED)
 		obj->mm.dirty = false;
@@ -721,7 +709,8 @@ i915_gem_userptr_dmabuf_export(struct drm_i915_gem_object *obj)
 
 static const struct drm_i915_gem_object_ops i915_gem_userptr_ops = {
 	.flags = I915_GEM_OBJECT_HAS_STRUCT_PAGE |
-		 I915_GEM_OBJECT_IS_SHRINKABLE,
+		 I915_GEM_OBJECT_IS_SHRINKABLE |
+		 I915_GEM_OBJECT_ASYNC_CANCEL,
 	.get_pages = i915_gem_userptr_get_pages,
 	.put_pages = i915_gem_userptr_put_pages,
 	.dmabuf_export = i915_gem_userptr_dmabuf_export,
-- 
2.19.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/3] drm/i915/execlists: Force write serialisation into context image vs execution
  2018-11-08  8:17 [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution Chris Wilson
  2018-11-08  8:17 ` [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim Chris Wilson
  2018-11-08  8:17 ` [PATCH 3/3] drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start Chris Wilson
@ 2018-11-08  8:33 ` Patchwork
  2018-11-08  9:07 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Patchwork @ 2018-11-08  8:33 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915/execlists: Force write serialisation into context image vs execution
URL   : https://patchwork.freedesktop.org/series/52209/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
719262fc8e08 drm/i915/execlists: Force write serialisation into context image vs execution
60eed560af03 drm/i915: Return immediately if trylock fails for direct-reclaim
227c8bdf97a3 drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start
-:23: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#23: 
References: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")

-:23: ERROR:GIT_COMMIT_ID: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")'
#23: 
References: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")

-:240: ERROR:LOCKING: recursive locking is bad, do not use this ever.
#240: FILE: drivers/gpu/drm/i915/i915_gem_userptr.c:108:
+	switch (mutex_trylock_recursive(m)) {

total: 2 errors, 1 warnings, 0 checks, 444 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [1/3] drm/i915/execlists: Force write serialisation into context image vs execution
  2018-11-08  8:17 [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution Chris Wilson
                   ` (2 preceding siblings ...)
  2018-11-08  8:33 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/3] drm/i915/execlists: Force write serialisation into context image vs execution Patchwork
@ 2018-11-08  9:07 ` Patchwork
  2018-11-08 12:00 ` [PATCH 1/3] " Mika Kuoppala
  2018-11-08 19:51 ` ✓ Fi.CI.IGT: success for series starting with [1/3] " Patchwork
  5 siblings, 0 replies; 19+ messages in thread
From: Patchwork @ 2018-11-08  9:07 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915/execlists: Force write serialisation into context image vs execution
URL   : https://patchwork.freedesktop.org/series/52209/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_5103 -> Patchwork_10769 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/52209/revisions/1/mbox/

== Known issues ==

  Here are the changes found in Patchwork_10769 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_frontbuffer_tracking@basic:
      fi-byt-clapper:     PASS -> FAIL (fdo#103167)

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a:
      fi-kbl-7567u:       PASS -> DMESG-WARN (fdo#105079, fdo#105602) +2

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-cfl-8109u:       PASS -> INCOMPLETE (fdo#106070, fdo#108126)
      fi-icl-u:           PASS -> INCOMPLETE (fdo#107713)
      fi-blb-e6850:       PASS -> INCOMPLETE (fdo#107718)

    
    ==== Possible fixes ====

    igt@kms_flip@basic-flip-vs-modeset:
      fi-hsw-4770r:       DMESG-WARN (fdo#105602) -> PASS

    igt@kms_pipe_crc_basic@read-crc-pipe-b:
      fi-byt-clapper:     FAIL (fdo#107362) -> PASS

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-byt-clapper:     FAIL (fdo#103191, fdo#107362) -> PASS

    igt@prime_vgem@basic-fence-flip:
      fi-ilk-650:         FAIL (fdo#104008) -> PASS

    
  fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
  fdo#103191 https://bugs.freedesktop.org/show_bug.cgi?id=103191
  fdo#104008 https://bugs.freedesktop.org/show_bug.cgi?id=104008
  fdo#105079 https://bugs.freedesktop.org/show_bug.cgi?id=105079
  fdo#105602 https://bugs.freedesktop.org/show_bug.cgi?id=105602
  fdo#106070 https://bugs.freedesktop.org/show_bug.cgi?id=106070
  fdo#107362 https://bugs.freedesktop.org/show_bug.cgi?id=107362
  fdo#107713 https://bugs.freedesktop.org/show_bug.cgi?id=107713
  fdo#107718 https://bugs.freedesktop.org/show_bug.cgi?id=107718
  fdo#108126 https://bugs.freedesktop.org/show_bug.cgi?id=108126


== Participating hosts (54 -> 46) ==

  Missing    (8): fi-kbl-soraka fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-snb-2520m fi-ctg-p8600 fi-kbl-7560u 


== Build changes ==

    * Linux: CI_DRM_5103 -> Patchwork_10769

  CI_DRM_5103: 23c1138030ad65402f698ab0b356e2f55722bc77 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4712: a3ede1b535ac8137f6949c468edd7054453d5dae @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_10769: 227c8bdf97a34415367225dfc484370af453c477 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

227c8bdf97a3 drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start
60eed560af03 drm/i915: Return immediately if trylock fails for direct-reclaim
719262fc8e08 drm/i915/execlists: Force write serialisation into context image vs execution

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10769/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution
  2018-11-08  8:17 [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution Chris Wilson
                   ` (3 preceding siblings ...)
  2018-11-08  9:07 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-11-08 12:00 ` Mika Kuoppala
  2018-11-08 12:11   ` Chris Wilson
  2018-11-08 19:51 ` ✓ Fi.CI.IGT: success for series starting with [1/3] " Patchwork
  5 siblings, 1 reply; 19+ messages in thread
From: Mika Kuoppala @ 2018-11-08 12:00 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Ensure that the writes into the context image are completed prior to the
> register mmio to trigger execution. Although previously we were assured
> by the SDM that all writes are flushed before an uncached memory
> transaction (our mmio write to submit the context to HW for execution),
> we have empirical evidence to believe that this is not actually the
> case.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656

I would have marked this also as References.

> References: https://bugs.freedesktop.org/show_bug.cgi?id=108315
> References: https://bugs.freedesktop.org/show_bug.cgi?id=106887
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 22b57b8926fc..f7892ddb3f13 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -380,7 +380,8 @@ static u64 execlists_update_context(struct i915_request *rq)
>  
>  	reg_state[CTX_RING_TAIL+1] = intel_ring_set_tail(rq->ring, rq->tail);
>  
> -	/* True 32b PPGTT with dynamic page allocation: update PDP
> +	/*
> +	 * True 32b PPGTT with dynamic page allocation: update PDP
>  	 * registers and point the unallocated PDPs to scratch page.
>  	 * PML4 is allocated during ppgtt init, so this is not needed
>  	 * in 48-bit mode.
> @@ -388,6 +389,17 @@ static u64 execlists_update_context(struct i915_request *rq)
>  	if (!i915_vm_is_48bit(&ppgtt->vm))
>  		execlists_update_context_pdps(ppgtt, reg_state);
>  
> +	/*
> +	 * Make sure the context image is complete before we submit it to HW.
> +	 *
> +	 * Ostensibly, writes (including the WCB) should be flushed prior to
> +	 * an uncached write such as our mmio register access, the empirical
> +	 * evidence (esp. on Braswell) suggests that the WC write into memory
> +	 * may not be visible to the HW prior to the completion of the UC
> +	 * register write and that we may begin execution from the context
> +	 * before its image is complete leading to invalid PD chasing.
> +	 */
> +	wmb();

Let's put it into use and gather more evidence.

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

>  	return ce->lrc_desc;
>  }
>  
> -- 
> 2.19.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution
  2018-11-08 12:00 ` [PATCH 1/3] " Mika Kuoppala
@ 2018-11-08 12:11   ` Chris Wilson
  2018-11-08 12:13     ` Mika Kuoppala
  2018-11-08 13:38     ` Chris Wilson
  0 siblings, 2 replies; 19+ messages in thread
From: Chris Wilson @ 2018-11-08 12:11 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2018-11-08 12:00:39)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Ensure that the writes into the context image are completed prior to the
> > register mmio to trigger execution. Although previously we were assured
> > by the SDM that all writes are flushed before an uncached memory
> > transaction (our mmio write to submit the context to HW for execution),
> > we have empirical evidence to believe that this is not actually the
> > case.
> >
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656
> 
> I would have marked this also as References.

I'm confident in my local results indicating some success here, albeit
in not exactly the same quick death, but still out-of-bounds execution.
 
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=108315
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=106887
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_lrc.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 22b57b8926fc..f7892ddb3f13 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -380,7 +380,8 @@ static u64 execlists_update_context(struct i915_request *rq)
> >  
> >       reg_state[CTX_RING_TAIL+1] = intel_ring_set_tail(rq->ring, rq->tail);
> >  
> > -     /* True 32b PPGTT with dynamic page allocation: update PDP
> > +     /*
> > +      * True 32b PPGTT with dynamic page allocation: update PDP
> >        * registers and point the unallocated PDPs to scratch page.
> >        * PML4 is allocated during ppgtt init, so this is not needed
> >        * in 48-bit mode.
> > @@ -388,6 +389,17 @@ static u64 execlists_update_context(struct i915_request *rq)
> >       if (!i915_vm_is_48bit(&ppgtt->vm))
> >               execlists_update_context_pdps(ppgtt, reg_state);
> >  
> > +     /*
> > +      * Make sure the context image is complete before we submit it to HW.
> > +      *
> > +      * Ostensibly, writes (including the WCB) should be flushed prior to
> > +      * an uncached write such as our mmio register access, the empirical
> > +      * evidence (esp. on Braswell) suggests that the WC write into memory
> > +      * may not be visible to the HW prior to the completion of the UC
> > +      * register write and that we may begin execution from the context
> > +      * before its image is complete leading to invalid PD chasing.
> > +      */
> > +     wmb();
> 
> Let's put it into use and gather more evidence.
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Aye. Sure about r-b? I'm quite happy to take an a-b since we're just
postulating to gather evidence.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution
  2018-11-08 12:11   ` Chris Wilson
@ 2018-11-08 12:13     ` Mika Kuoppala
  2018-11-08 12:26       ` Chris Wilson
  2018-11-08 13:38     ` Chris Wilson
  1 sibling, 1 reply; 19+ messages in thread
From: Mika Kuoppala @ 2018-11-08 12:13 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2018-11-08 12:00:39)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > Ensure that the writes into the context image are completed prior to the
>> > register mmio to trigger execution. Although previously we were assured
>> > by the SDM that all writes are flushed before an uncached memory
>> > transaction (our mmio write to submit the context to HW for execution),
>> > we have empirical evidence to believe that this is not actually the
>> > case.
>> >
>> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656
>> 
>> I would have marked this also as References.
>
> I'm confident in my local results indicating some success here, albeit
> in not exactly the same quick death, but still out-of-bounds execution.
>  
>> > References: https://bugs.freedesktop.org/show_bug.cgi?id=108315
>> > References: https://bugs.freedesktop.org/show_bug.cgi?id=106887
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> > ---
>> >  drivers/gpu/drm/i915/intel_lrc.c | 14 +++++++++++++-
>> >  1 file changed, 13 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> > index 22b57b8926fc..f7892ddb3f13 100644
>> > --- a/drivers/gpu/drm/i915/intel_lrc.c
>> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> > @@ -380,7 +380,8 @@ static u64 execlists_update_context(struct i915_request *rq)
>> >  
>> >       reg_state[CTX_RING_TAIL+1] = intel_ring_set_tail(rq->ring, rq->tail);
>> >  
>> > -     /* True 32b PPGTT with dynamic page allocation: update PDP
>> > +     /*
>> > +      * True 32b PPGTT with dynamic page allocation: update PDP
>> >        * registers and point the unallocated PDPs to scratch page.
>> >        * PML4 is allocated during ppgtt init, so this is not needed
>> >        * in 48-bit mode.
>> > @@ -388,6 +389,17 @@ static u64 execlists_update_context(struct i915_request *rq)
>> >       if (!i915_vm_is_48bit(&ppgtt->vm))
>> >               execlists_update_context_pdps(ppgtt, reg_state);
>> >  
>> > +     /*
>> > +      * Make sure the context image is complete before we submit it to HW.
>> > +      *
>> > +      * Ostensibly, writes (including the WCB) should be flushed prior to
>> > +      * an uncached write such as our mmio register access, the empirical
>> > +      * evidence (esp. on Braswell) suggests that the WC write into memory
>> > +      * may not be visible to the HW prior to the completion of the UC
>> > +      * register write and that we may begin execution from the context
>> > +      * before its image is complete leading to invalid PD chasing.
>> > +      */
>> > +     wmb();
>> 
>> Let's put it into use and gather more evidence.
>> 
>> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>
> Aye. Sure about r-b? I'm quite happy to take an a-b since we're just
> postulating to gather evidence.

Agreed that a-b is more accurate. r-b would indicate I know what the
heck is going on there under the hood.
-Mika
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution
  2018-11-08 12:13     ` Mika Kuoppala
@ 2018-11-08 12:26       ` Chris Wilson
  0 siblings, 0 replies; 19+ messages in thread
From: Chris Wilson @ 2018-11-08 12:26 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2018-11-08 12:13:42)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Quoting Mika Kuoppala (2018-11-08 12:00:39)
> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >> > +     /*
> >> > +      * Make sure the context image is complete before we submit it to HW.
> >> > +      *
> >> > +      * Ostensibly, writes (including the WCB) should be flushed prior to
> >> > +      * an uncached write such as our mmio register access, the empirical
> >> > +      * evidence (esp. on Braswell) suggests that the WC write into memory
> >> > +      * may not be visible to the HW prior to the completion of the UC
> >> > +      * register write and that we may begin execution from the context
> >> > +      * before its image is complete leading to invalid PD chasing.
> >> > +      */
> >> > +     wmb();
> >> 
> >> Let's put it into use and gather more evidence.
> >> 
> >> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >
> > Aye. Sure about r-b? I'm quite happy to take an a-b since we're just
> > postulating to gather evidence.
> 
> Agreed that a-b is more accurate. r-b would indicate I know what the
> heck is going on there under the hood.

And pushed (this patch) to see if the next few months do quieten down.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution
  2018-11-08 12:11   ` Chris Wilson
  2018-11-08 12:13     ` Mika Kuoppala
@ 2018-11-08 13:38     ` Chris Wilson
  1 sibling, 0 replies; 19+ messages in thread
From: Chris Wilson @ 2018-11-08 13:38 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Chris Wilson (2018-11-08 12:11:05)
> Quoting Mika Kuoppala (2018-11-08 12:00:39)
> > Chris Wilson <chris@chris-wilson.co.uk> writes:
> > 
> > > Ensure that the writes into the context image are completed prior to the
> > > register mmio to trigger execution. Although previously we were assured
> > > by the SDM that all writes are flushed before an uncached memory
> > > transaction (our mmio write to submit the context to HW for execution),
> > > we have empirical evidence to believe that this is not actually the
> > > case.
> > >
> > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656
> > 
> > I would have marked this also as References.
> 
> I'm confident in my local results indicating some success here, albeit
> in not exactly the same quick death, but still out-of-bounds execution.

For the record, bsw just hung after close to a day, with the same out of
bounds execution (BBADDR after the batch).
/o\
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim
  2018-11-08  8:17 ` [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim Chris Wilson
@ 2018-11-08 16:23   ` Tvrtko Ursulin
  2018-11-08 16:48     ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2018-11-08 16:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 08/11/2018 08:17, Chris Wilson wrote:
> Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
> in the shrinker while performing direct-reclaim. The trade-off being
> (much) lower latency for non-i915 clients at an increased risk of being
> unable to obtain a page from direct-reclaim without hitting the
> oom-notifier. The proviso being that we still keep trying to hard
> obtain the lock for oom so that we can reap under heavy memory pressure.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
>   1 file changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> index ea90d3a0d511..d461f458f4af 100644
> --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
> +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> @@ -36,7 +36,9 @@
>   #include "i915_drv.h"
>   #include "i915_trace.h"
>   
> -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> +static bool shrinker_lock(struct drm_i915_private *i915,
> +			  unsigned int flags,
> +			  bool *unlock)
>   {
>   	switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
>   	case MUTEX_TRYLOCK_RECURSIVE:
> @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>   
>   	case MUTEX_TRYLOCK_FAILED:
>   		*unlock = false;
> -		preempt_disable();
> -		do {
> -			cpu_relax();
> -			if (mutex_trylock(&i915->drm.struct_mutex)) {
> -				*unlock = true;
> -				break;
> -			}
> -		} while (!need_resched());
> -		preempt_enable();
> +		if (flags & I915_SHRINK_ACTIVE) {

So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink 
in the normal case (direct reclaim?) or oom, we bail out on the first 
sign of struct mutex contention. Doesn't this make our shrinker much 
less effective at runtime and why is that OK?

Or in other words, for what use cases, tests or benchmark was the 
existing approach of busy looping a problem?

> +			mutex_lock_nested(&i915->drm.struct_mutex,
> +					  I915_MM_SHRINKER);

_nested is needed since abandoning trylock would otherwise cause lockdep 
issues? But is it really safe? I don't know.. how can it be? It is 
guaranteed to be a different process here otherwise the result wouldn't 
be MUTEX_TRYLOCK_FAILED.

Also, I915_MM_SHRINKER is so far only documented to apply to obj->mm.lock.

Regards,

Tvrtko

> +			*unlock = true;
> +		}
>   		return *unlock;
>   
>   	case MUTEX_TRYLOCK_SUCCESS:
> @@ -160,7 +158,7 @@ i915_gem_shrink(struct drm_i915_private *i915,
>   	unsigned long scanned = 0;
>   	bool unlock;
>   
> -	if (!shrinker_lock(i915, &unlock))
> +	if (!shrinker_lock(i915, flags, &unlock))
>   		return 0;
>   
>   	/*
> @@ -357,7 +355,7 @@ i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
>   
>   	sc->nr_scanned = 0;
>   
> -	if (!shrinker_lock(i915, &unlock))
> +	if (!shrinker_lock(i915, 0, &unlock))
>   		return SHRINK_STOP;
>   
>   	freed = i915_gem_shrink(i915,
> @@ -397,7 +395,7 @@ shrinker_lock_uninterruptible(struct drm_i915_private *i915, bool *unlock,
>   	do {
>   		if (i915_gem_wait_for_idle(i915,
>   					   0, MAX_SCHEDULE_TIMEOUT) == 0 &&
> -		    shrinker_lock(i915, unlock))
> +		    shrinker_lock(i915, 0, unlock))
>   			break;
>   
>   		schedule_timeout_killable(1);
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start
  2018-11-08  8:17 ` [PATCH 3/3] drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start Chris Wilson
@ 2018-11-08 16:45   ` Tvrtko Ursulin
  0 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2018-11-08 16:45 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 08/11/2018 08:17, Chris Wilson wrote:
> Since commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
> notifiers") we have been able to report failure from
> mmu_invalidate_range_start which allows us to use a trylock on the
> struct_mutex to avoid potential recursion and report -EBUSY instead.
> Furthermore, this allows us to pull the work into the main callback and
> avoid the sleight-of-hand in using a workqueue to avoid lockdep.
> 
> However, not all paths to mmu_invalidate_range_start are prepared to
> handle failure, so instead of reporting the recursion, deal with it by
> propagating the failure upwards, who can decide themselves to handle it
> or report it.
> 
> v2: Mark up the recursive lock behaviour and comment on the various weak
> points.
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108375
> References: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |   4 +-
>   drivers/gpu/drm/i915/i915_gem.c         |  30 +++-
>   drivers/gpu/drm/i915/i915_gem_object.h  |   7 +
>   drivers/gpu/drm/i915/i915_gem_userptr.c | 221 +++++++++++-------------
>   4 files changed, 136 insertions(+), 126 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 0c8438de3c1b..b95afa05976d 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3076,8 +3076,8 @@ enum i915_mm_subclass { /* lockdep subclass for obj->mm.lock */
>   	I915_MM_SHRINKER
>   };
>   
> -void __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
> -				 enum i915_mm_subclass subclass);
> +int __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
> +				enum i915_mm_subclass subclass);
>   void __i915_gem_object_invalidate(struct drm_i915_gem_object *obj);
>   
>   enum i915_map_type {
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index dc120b5d8e05..53fb3605c4df 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2447,8 +2447,8 @@ __i915_gem_object_unset_pages(struct drm_i915_gem_object *obj)
>   	struct sg_table *pages;
>   
>   	pages = fetch_and_zero(&obj->mm.pages);
> -	if (!pages)
> -		return NULL;
> +	if (IS_ERR_OR_NULL(pages))

Where can we set pages to some ERR_PTR value?

> +		return pages;
>   
>   	spin_lock(&i915->mm.obj_lock);
>   	list_del(&obj->mm.link);
> @@ -2472,22 +2472,23 @@ __i915_gem_object_unset_pages(struct drm_i915_gem_object *obj)
>   	return pages;
>   }
>   
> -void __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
> -				 enum i915_mm_subclass subclass)
> +int __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
> +				enum i915_mm_subclass subclass)
>   {
>   	struct sg_table *pages;
> +	int ret;
>   
>   	if (i915_gem_object_has_pinned_pages(obj))
> -		return;
> +		return -EBUSY;
>   
>   	GEM_BUG_ON(obj->bind_count);
> -	if (!i915_gem_object_has_pages(obj))
> -		return;
>   
>   	/* May be called by shrinker from within get_pages() (on another bo) */
>   	mutex_lock_nested(&obj->mm.lock, subclass);
> -	if (unlikely(atomic_read(&obj->mm.pages_pin_count)))
> +	if (unlikely(atomic_read(&obj->mm.pages_pin_count))) {
> +		ret = -EBUSY;
>   		goto unlock;
> +	}
>   
>   	/*
>   	 * ->put_pages might need to allocate memory for the bit17 swizzle
> @@ -2495,11 +2496,24 @@ void __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
>   	 * lists early.
>   	 */
>   	pages = __i915_gem_object_unset_pages(obj);
> +
> +	/*
> +	 * XXX Temporary hijinx to avoid updating all backends to handle
> +	 * NULL pages. In the future, when we have more asynchronous
> +	 * get_pages backends we should be better able to handle the
> +	 * cancellation of the async task in a more uniform manner.
> +	 */
> +	if (!pages && !i915_gem_object_needs_async_cancel(obj))
> +		pages = ERR_PTR(-EINVAL);
> +
>   	if (!IS_ERR(pages))
>   		obj->ops->put_pages(obj, pages);
>   
> +	ret = 0;
>   unlock:
>   	mutex_unlock(&obj->mm.lock);
> +
> +	return ret;
>   }
>   
>   bool i915_sg_trim(struct sg_table *orig_st)
> diff --git a/drivers/gpu/drm/i915/i915_gem_object.h b/drivers/gpu/drm/i915/i915_gem_object.h
> index a6dd7c46de0d..49ce797173b5 100644
> --- a/drivers/gpu/drm/i915/i915_gem_object.h
> +++ b/drivers/gpu/drm/i915/i915_gem_object.h
> @@ -56,6 +56,7 @@ struct drm_i915_gem_object_ops {
>   #define I915_GEM_OBJECT_HAS_STRUCT_PAGE	BIT(0)
>   #define I915_GEM_OBJECT_IS_SHRINKABLE	BIT(1)
>   #define I915_GEM_OBJECT_IS_PROXY	BIT(2)
> +#define I915_GEM_OBJECT_ASYNC_CANCEL	BIT(3)
>   
>   	/* Interface between the GEM object and its backing storage.
>   	 * get_pages() is called once prior to the use of the associated set
> @@ -386,6 +387,12 @@ i915_gem_object_is_proxy(const struct drm_i915_gem_object *obj)
>   	return obj->ops->flags & I915_GEM_OBJECT_IS_PROXY;
>   }
>   
> +static inline bool
> +i915_gem_object_needs_async_cancel(const struct drm_i915_gem_object *obj)
> +{
> +	return obj->ops->flags & I915_GEM_OBJECT_ASYNC_CANCEL;
> +}
> +
>   static inline bool
>   i915_gem_object_is_active(const struct drm_i915_gem_object *obj)
>   {
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 2c9b284036d1..8b07fd44731f 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -50,79 +50,84 @@ struct i915_mmu_notifier {
>   	struct hlist_node node;
>   	struct mmu_notifier mn;
>   	struct rb_root_cached objects;
> -	struct workqueue_struct *wq;
> +	struct i915_mm_struct *mm;
>   };
>   
>   struct i915_mmu_object {
>   	struct i915_mmu_notifier *mn;
>   	struct drm_i915_gem_object *obj;
>   	struct interval_tree_node it;
> -	struct list_head link;
> -	struct work_struct work;
> -	bool attached;
>   };
>   
> -static void cancel_userptr(struct work_struct *work)
> -{
> -	struct i915_mmu_object *mo = container_of(work, typeof(*mo), work);
> -	struct drm_i915_gem_object *obj = mo->obj;
> -	struct work_struct *active;
> -
> -	/* Cancel any active worker and force us to re-evaluate gup */
> -	mutex_lock(&obj->mm.lock);
> -	active = fetch_and_zero(&obj->userptr.work);
> -	mutex_unlock(&obj->mm.lock);
> -	if (active)
> -		goto out;
> -
> -	i915_gem_object_wait(obj, I915_WAIT_ALL, MAX_SCHEDULE_TIMEOUT, NULL);
> -
> -	mutex_lock(&obj->base.dev->struct_mutex);
> -
> -	/* We are inside a kthread context and can't be interrupted */
> -	if (i915_gem_object_unbind(obj) == 0)
> -		__i915_gem_object_put_pages(obj, I915_MM_NORMAL);
> -	WARN_ONCE(i915_gem_object_has_pages(obj),
> -		  "Failed to release pages: bind_count=%d, pages_pin_count=%d, pin_global=%d\n",
> -		  obj->bind_count,
> -		  atomic_read(&obj->mm.pages_pin_count),
> -		  obj->pin_global);
> -
> -	mutex_unlock(&obj->base.dev->struct_mutex);
> -
> -out:
> -	i915_gem_object_put(obj);
> -}
> -
>   static void add_object(struct i915_mmu_object *mo)
>   {
> -	if (mo->attached)
> +	if (!RB_EMPTY_NODE(&mo->it.rb))
>   		return;
>   
>   	interval_tree_insert(&mo->it, &mo->mn->objects);
> -	mo->attached = true;
>   }
>   
>   static void del_object(struct i915_mmu_object *mo)
>   {
> -	if (!mo->attached)
> +	if (RB_EMPTY_NODE(&mo->it.rb))
>   		return;
>   
>   	interval_tree_remove(&mo->it, &mo->mn->objects);
> -	mo->attached = false;
> +	RB_CLEAR_NODE(&mo->it.rb);
> +}
> +
> +static void
> +__i915_gem_userptr_set_active(struct drm_i915_gem_object *obj, bool value)
> +{
> +	struct i915_mmu_object *mo = obj->userptr.mmu_object;
> +
> +	/*
> +	 * During mm_invalidate_range we need to cancel any userptr that
> +	 * overlaps the range being invalidated. Doing so requires the
> +	 * struct_mutex, and that risks recursion. In order to cause
> +	 * recursion, the user must alias the userptr address space with
> +	 * a GTT mmapping (possible with a MAP_FIXED) - then when we have
> +	 * to invalidate that mmaping, mm_invalidate_range is called with
> +	 * the userptr address *and* the struct_mutex held.  To prevent that
> +	 * we set a flag under the i915_mmu_notifier spinlock to indicate
> +	 * whether this object is valid.
> +	 */
> +	if (!mo)
> +		return;
> +
> +	spin_lock(&mo->mn->lock);
> +	if (value)
> +		add_object(mo);
> +	else
> +		del_object(mo);
> +	spin_unlock(&mo->mn->lock);
> +}
> +
> +static struct mutex *__i915_mutex_lock_recursive(struct mutex *m)
> +{
> +	switch (mutex_trylock_recursive(m)) {
> +	default:
> +	case MUTEX_TRYLOCK_FAILED:
> +		mutex_lock_nested(m, I915_MM_SHRINKER);

I am thinking that perhaps we need a new namespace for struct mutex lock 
nesting, like I915_STRUCT_MUTEX...

And this one would then be called something other than shrinker. Since 
it is not triggered by the shrinker but page migration, right?

> +	case MUTEX_TRYLOCK_SUCCESS:
> +		return m;
> +
> +	case MUTEX_TRYLOCK_RECURSIVE:
> +		return ERR_PTR(-EEXIST);
> +	}
>   }
>   
>   static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> -						       struct mm_struct *mm,
> -						       unsigned long start,
> -						       unsigned long end,
> -						       bool blockable)
> +						      struct mm_struct *mm,
> +						      unsigned long start,
> +						      unsigned long end,
> +						      bool blockable)
>   {
>   	struct i915_mmu_notifier *mn =
>   		container_of(_mn, struct i915_mmu_notifier, mn);
> -	struct i915_mmu_object *mo;
>   	struct interval_tree_node *it;
> -	LIST_HEAD(cancelled);
> +	struct mutex *unlock = NULL;
> +	int ret = 0;
>   
>   	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
>   		return 0;
> @@ -133,11 +138,15 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	spin_lock(&mn->lock);
>   	it = interval_tree_iter_first(&mn->objects, start, end);
>   	while (it) {
> +		struct drm_i915_gem_object *obj;
> +
>   		if (!blockable) {
> -			spin_unlock(&mn->lock);
> -			return -EAGAIN;
> +			ret = -EAGAIN;
> +			break;
>   		}
> -		/* The mmu_object is released late when destroying the
> +
> +		/*
> +		 * The mmu_object is released late when destroying the
>   		 * GEM object so it is entirely possible to gain a
>   		 * reference on an object in the process of being freed
>   		 * since our serialisation is via the spinlock and not
> @@ -146,21 +155,39 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   		 * use-after-free we only acquire a reference on the
>   		 * object if it is not in the process of being destroyed.
>   		 */
> -		mo = container_of(it, struct i915_mmu_object, it);
> -		if (kref_get_unless_zero(&mo->obj->base.refcount))
> -			queue_work(mn->wq, &mo->work);
> +		obj = container_of(it, struct i915_mmu_object, it)->obj;
> +		if (!kref_get_unless_zero(&obj->base.refcount)) {
> +			it = interval_tree_iter_next(it, start, end);
> +			continue;
> +		}
> +		spin_unlock(&mn->lock);
>   
> -		list_add(&mo->link, &cancelled);
> -		it = interval_tree_iter_next(it, start, end);
> +		if (!unlock)
> +			unlock = __i915_mutex_lock_recursive(&mn->mm->i915->drm.struct_mutex);
> +		ret = i915_gem_object_unbind(obj);
> +		if (ret == 0)
> +			ret = __i915_gem_object_put_pages(obj, I915_MM_SHRINKER);
> +		i915_gem_object_put(obj);
> +		if (ret)
> +			goto unlock;
> +
> +		spin_lock(&mn->lock);
> +
> +		/*
> +		 * As we do not (yet) protect the mmu from concurrent insertion
> +		 * over this range, there is no guarantee that this search will
> +		 * terminate given a pathologic workload.
> +		 */

Hmm my concern from the other day. But it must exit, some client will 
eventually stall for a bit I think. If not sooner then when everything 
grinds to a halt due everyone being stuck in this loop.

> +		it = interval_tree_iter_first(&mn->objects, start, end);
>   	}
> -	list_for_each_entry(mo, &cancelled, link)
> -		del_object(mo);
>   	spin_unlock(&mn->lock);
>   
> -	if (!list_empty(&cancelled))
> -		flush_workqueue(mn->wq);
> +unlock:
> +	if (!IS_ERR_OR_NULL(unlock))
> +		mutex_unlock(unlock);
> +
> +	return ret;
>   
> -	return 0;
>   }
>   
>   static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> @@ -168,7 +195,7 @@ static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
>   };
>   
>   static struct i915_mmu_notifier *
> -i915_mmu_notifier_create(struct mm_struct *mm)
> +i915_mmu_notifier_create(struct i915_mm_struct *mm)
>   {
>   	struct i915_mmu_notifier *mn;
>   
> @@ -179,13 +206,7 @@ i915_mmu_notifier_create(struct mm_struct *mm)
>   	spin_lock_init(&mn->lock);
>   	mn->mn.ops = &i915_gem_userptr_notifier;
>   	mn->objects = RB_ROOT_CACHED;
> -	mn->wq = alloc_workqueue("i915-userptr-release",
> -				 WQ_UNBOUND | WQ_MEM_RECLAIM,
> -				 0);
> -	if (mn->wq == NULL) {
> -		kfree(mn);
> -		return ERR_PTR(-ENOMEM);
> -	}
> +	mn->mm = mm;
>   
>   	return mn;
>   }
> @@ -195,16 +216,14 @@ i915_gem_userptr_release__mmu_notifier(struct drm_i915_gem_object *obj)
>   {
>   	struct i915_mmu_object *mo;
>   
> -	mo = obj->userptr.mmu_object;
> -	if (mo == NULL)
> +	mo = fetch_and_zero(&obj->userptr.mmu_object);
> +	if (!mo)
>   		return;
>   
>   	spin_lock(&mo->mn->lock);
>   	del_object(mo);
>   	spin_unlock(&mo->mn->lock);
>   	kfree(mo);
> -
> -	obj->userptr.mmu_object = NULL;
>   }
>   
>   static struct i915_mmu_notifier *
> @@ -217,7 +236,7 @@ i915_mmu_notifier_find(struct i915_mm_struct *mm)
>   	if (mn)
>   		return mn;
>   
> -	mn = i915_mmu_notifier_create(mm->mm);
> +	mn = i915_mmu_notifier_create(mm);
>   	if (IS_ERR(mn))
>   		err = PTR_ERR(mn);
>   
> @@ -240,10 +259,8 @@ i915_mmu_notifier_find(struct i915_mm_struct *mm)
>   	mutex_unlock(&mm->i915->mm_lock);
>   	up_write(&mm->mm->mmap_sem);
>   
> -	if (mn && !IS_ERR(mn)) {
> -		destroy_workqueue(mn->wq);
> +	if (mn && !IS_ERR(mn))
>   		kfree(mn);
> -	}
>   
>   	return err ? ERR_PTR(err) : mm->mn;
>   }
> @@ -266,14 +283,14 @@ i915_gem_userptr_init__mmu_notifier(struct drm_i915_gem_object *obj,
>   		return PTR_ERR(mn);
>   
>   	mo = kzalloc(sizeof(*mo), GFP_KERNEL);
> -	if (mo == NULL)
> +	if (!mo)
>   		return -ENOMEM;
>   
>   	mo->mn = mn;
>   	mo->obj = obj;
>   	mo->it.start = obj->userptr.ptr;
>   	mo->it.last = obj->userptr.ptr + obj->base.size - 1;
> -	INIT_WORK(&mo->work, cancel_userptr);
> +	RB_CLEAR_NODE(&mo->it.rb);
>   
>   	obj->userptr.mmu_object = mo;
>   	return 0;
> @@ -287,12 +304,16 @@ i915_mmu_notifier_free(struct i915_mmu_notifier *mn,
>   		return;
>   
>   	mmu_notifier_unregister(&mn->mn, mm);
> -	destroy_workqueue(mn->wq);
>   	kfree(mn);
>   }
>   
>   #else
>   
> +static void
> +__i915_gem_userptr_set_active(struct drm_i915_gem_object *obj, bool value)
> +{
> +}
> +
>   static void
>   i915_gem_userptr_release__mmu_notifier(struct drm_i915_gem_object *obj)
>   {
> @@ -461,42 +482,6 @@ __i915_gem_userptr_alloc_pages(struct drm_i915_gem_object *obj,
>   	return st;
>   }
>   
> -static int
> -__i915_gem_userptr_set_active(struct drm_i915_gem_object *obj,
> -			      bool value)
> -{
> -	int ret = 0;
> -
> -	/* During mm_invalidate_range we need to cancel any userptr that
> -	 * overlaps the range being invalidated. Doing so requires the
> -	 * struct_mutex, and that risks recursion. In order to cause
> -	 * recursion, the user must alias the userptr address space with
> -	 * a GTT mmapping (possible with a MAP_FIXED) - then when we have
> -	 * to invalidate that mmaping, mm_invalidate_range is called with
> -	 * the userptr address *and* the struct_mutex held.  To prevent that
> -	 * we set a flag under the i915_mmu_notifier spinlock to indicate
> -	 * whether this object is valid.
> -	 */
> -#if defined(CONFIG_MMU_NOTIFIER)
> -	if (obj->userptr.mmu_object == NULL)
> -		return 0;
> -
> -	spin_lock(&obj->userptr.mmu_object->mn->lock);
> -	/* In order to serialise get_pages with an outstanding
> -	 * cancel_userptr, we must drop the struct_mutex and try again.
> -	 */
> -	if (!value)
> -		del_object(obj->userptr.mmu_object);
> -	else if (!work_pending(&obj->userptr.mmu_object->work))
> -		add_object(obj->userptr.mmu_object);
> -	else
> -		ret = -EAGAIN;
> -	spin_unlock(&obj->userptr.mmu_object->mn->lock);
> -#endif
> -
> -	return ret;
> -}
> -
>   static void
>   __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
>   {
> @@ -682,8 +667,11 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
>   	struct sgt_iter sgt_iter;
>   	struct page *page;
>   
> -	BUG_ON(obj->userptr.work != NULL);
> +	/* Cancel any inflight work and force them to restart their gup */
> +	obj->userptr.work = NULL;
>   	__i915_gem_userptr_set_active(obj, false);
> +	if (!pages)
> +		return;
>   
>   	if (obj->mm.madv != I915_MADV_WILLNEED)
>   		obj->mm.dirty = false;
> @@ -721,7 +709,8 @@ i915_gem_userptr_dmabuf_export(struct drm_i915_gem_object *obj)
>   
>   static const struct drm_i915_gem_object_ops i915_gem_userptr_ops = {
>   	.flags = I915_GEM_OBJECT_HAS_STRUCT_PAGE |
> -		 I915_GEM_OBJECT_IS_SHRINKABLE,
> +		 I915_GEM_OBJECT_IS_SHRINKABLE |
> +		 I915_GEM_OBJECT_ASYNC_CANCEL,
>   	.get_pages = i915_gem_userptr_get_pages,
>   	.put_pages = i915_gem_userptr_put_pages,
>   	.dmabuf_export = i915_gem_userptr_dmabuf_export,
> 

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim
  2018-11-08 16:23   ` Tvrtko Ursulin
@ 2018-11-08 16:48     ` Chris Wilson
  2018-11-09  7:30       ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2018-11-08 16:48 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
> 
> On 08/11/2018 08:17, Chris Wilson wrote:
> > Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
> > in the shrinker while performing direct-reclaim. The trade-off being
> > (much) lower latency for non-i915 clients at an increased risk of being
> > unable to obtain a page from direct-reclaim without hitting the
> > oom-notifier. The proviso being that we still keep trying to hard
> > obtain the lock for oom so that we can reap under heavy memory pressure.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
> >   1 file changed, 11 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> > index ea90d3a0d511..d461f458f4af 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> > @@ -36,7 +36,9 @@
> >   #include "i915_drv.h"
> >   #include "i915_trace.h"
> >   
> > -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> > +static bool shrinker_lock(struct drm_i915_private *i915,
> > +                       unsigned int flags,
> > +                       bool *unlock)
> >   {
> >       switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
> >       case MUTEX_TRYLOCK_RECURSIVE:
> > @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> >   
> >       case MUTEX_TRYLOCK_FAILED:
> >               *unlock = false;
> > -             preempt_disable();
> > -             do {
> > -                     cpu_relax();
> > -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
> > -                             *unlock = true;
> > -                             break;
> > -                     }
> > -             } while (!need_resched());
> > -             preempt_enable();
> > +             if (flags & I915_SHRINK_ACTIVE) {
> 
> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink 
> in the normal case (direct reclaim?) or oom, we bail out on the first 
> sign of struct mutex contention. Doesn't this make our shrinker much 
> less effective at runtime and why is that OK?

As I said, it's a tradeoff between blocking others for _several_
_seconds_ and making no progress and returning immediately and making no
progress. My argument is along the lines of if direct-reclaim is running
in another process and something else is engaged in the driver hopefully
the driver will be cleaning up as it goes along or else what remains is
active and won't be reaped anyway. If direct reclaim is failing, the
delay before trying the oom path is insignificant.

> Or in other words, for what use cases, tests or benchmark was the 
> existing approach of busy looping a problem?

Do something like 'find / -exec cat' while running i915 and see how long
you have to wait for a khungtaskd :|

gem_syslatency reports max latencies of over 600s, and I'm sure it's
pretty much unbounded. It's bad enough that I expect we are the cause of
significant hitching (mainly in other tasks) on the desktop running at
memory capacity.

> > +                     mutex_lock_nested(&i915->drm.struct_mutex,
> > +                                       I915_MM_SHRINKER);
> 
> _nested is needed since abandoning trylock would otherwise cause lockdep 
> issues? But is it really safe? I don't know.. how can it be? It is 
> guaranteed to be a different process here otherwise the result wouldn't 
> be MUTEX_TRYLOCK_FAILED.

The easy option was to only force the mutex_lock for kswapd. However,
noting that we do need to forcibly shrink before oom, I opted for the
more inclusive attempt to take it in both. For the oom approach, my only
handwaving is we shouldn't be nested under oom serialisation and so
avoid an ABBA deadlock. It's survived some oom abuse, but the machine
still becomes unresponsive (but pingable) eventually (as it does before
the patch).

> Also, I915_MM_SHRINKER is so far only documented to apply to obj->mm.lock.

It suits very nicely as it being applied for the same purpose.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* ✓ Fi.CI.IGT: success for series starting with [1/3] drm/i915/execlists: Force write serialisation into context image vs execution
  2018-11-08  8:17 [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution Chris Wilson
                   ` (4 preceding siblings ...)
  2018-11-08 12:00 ` [PATCH 1/3] " Mika Kuoppala
@ 2018-11-08 19:51 ` Patchwork
  5 siblings, 0 replies; 19+ messages in thread
From: Patchwork @ 2018-11-08 19:51 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915/execlists: Force write serialisation into context image vs execution
URL   : https://patchwork.freedesktop.org/series/52209/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_5103_full -> Patchwork_10769_full =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_10769_full need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_10769_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_10769_full:

  === IGT changes ===

    ==== Warnings ====

    igt@perf_pmu@rc6:
      shard-kbl:          PASS -> SKIP

    
== Known issues ==

  Here are the changes found in Patchwork_10769_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@drv_suspend@shrink:
      shard-skl:          PASS -> INCOMPLETE (fdo#106886)

    igt@gem_cpu_reloc@full:
      shard-skl:          NOTRUN -> INCOMPLETE (fdo#108073)

    igt@gem_exec_bad_domains@cpu-domain:
      shard-apl:          PASS -> DMESG-WARN (fdo#103558, fdo#105602) +7

    igt@gem_exec_schedule@pi-ringfull-bsd:
      shard-skl:          NOTRUN -> FAIL (fdo#103158)

    igt@kms_chv_cursor_fail@pipe-a-64x64-right-edge:
      shard-skl:          NOTRUN -> FAIL (fdo#104671)

    igt@kms_cursor_crc@cursor-64x64-random:
      shard-glk:          PASS -> FAIL (fdo#103232)

    igt@kms_cursor_crc@cursor-64x64-sliding:
      shard-skl:          NOTRUN -> FAIL (fdo#103232)

    igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-draw-render:
      shard-apl:          PASS -> FAIL (fdo#103167) +2

    igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-fullscreen:
      shard-glk:          PASS -> FAIL (fdo#103167)

    igt@kms_frontbuffer_tracking@fbc-indfb-scaledprimary:
      shard-skl:          PASS -> FAIL (fdo#105682)

    igt@kms_frontbuffer_tracking@fbc-suspend:
      shard-skl:          PASS -> INCOMPLETE (fdo#105959, fdo#107773, fdo#104108)

    igt@kms_frontbuffer_tracking@fbcpsr-1p-offscren-pri-indfb-draw-blt:
      shard-skl:          NOTRUN -> FAIL (fdo#105682)

    igt@kms_frontbuffer_tracking@psr-1p-primscrn-cur-indfb-draw-mmap-wc:
      shard-skl:          NOTRUN -> FAIL (fdo#103167) +2

    igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb:
      shard-skl:          NOTRUN -> FAIL (fdo#108145) +2

    igt@kms_plane_alpha_blend@pipe-a-constant-alpha-max:
      shard-glk:          PASS -> FAIL (fdo#108145)

    igt@kms_plane_alpha_blend@pipe-a-coverage-7efc:
      shard-skl:          PASS -> FAIL (fdo#107815, fdo#108145)

    igt@kms_plane_multiple@atomic-pipe-a-tiling-y:
      shard-apl:          PASS -> FAIL (fdo#103166) +2

    igt@kms_plane_multiple@atomic-pipe-b-tiling-x:
      shard-glk:          PASS -> FAIL (fdo#103166) +1

    igt@kms_setmode@basic:
      shard-apl:          PASS -> FAIL (fdo#99912)

    igt@pm_rpm@i2c:
      shard-skl:          PASS -> INCOMPLETE (fdo#107807)

    igt@pm_rpm@modeset-non-lpsp-stress:
      shard-skl:          SKIP -> INCOMPLETE (fdo#107807)

    
    ==== Possible fixes ====

    igt@gem_exec_reuse@baggage:
      shard-apl:          DMESG-WARN (fdo#108690) -> PASS

    igt@kms_color@pipe-a-degamma:
      shard-apl:          FAIL (fdo#104782, fdo#108145) -> PASS

    igt@kms_cursor_crc@cursor-128x128-random:
      shard-apl:          FAIL (fdo#103232) -> PASS +1

    igt@kms_cursor_legacy@cursorb-vs-flipb-varying-size:
      shard-glk:          DMESG-WARN (fdo#106538, fdo#105763) -> PASS +1

    igt@kms_flip@2x-flip-vs-expired-vblank-interruptible:
      shard-glk:          FAIL (fdo#105363) -> PASS

    igt@kms_frontbuffer_tracking@fbc-1p-primscrn-cur-indfb-draw-pwrite:
      shard-apl:          FAIL (fdo#103167) -> PASS +1

    igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-onoff:
      shard-glk:          FAIL (fdo#103167) -> PASS +3

    igt@kms_plane@pixel-format-pipe-a-planes:
      shard-apl:          FAIL (fdo#103166) -> PASS

    igt@kms_plane_multiple@atomic-pipe-a-tiling-x:
      shard-glk:          FAIL (fdo#103166) -> PASS

    igt@kms_setmode@basic:
      shard-kbl:          FAIL (fdo#99912) -> PASS

    igt@pm_rpm@drm-resources-equal:
      shard-skl:          INCOMPLETE (fdo#107807) -> PASS

    igt@pm_rpm@modeset-non-lpsp-stress-no-wait:
      shard-skl:          INCOMPLETE (fdo#107807) -> SKIP

    
    ==== Warnings ====

    igt@kms_ccs@pipe-b-crc-sprite-planes-basic:
      shard-apl:          FAIL (fdo#106510, fdo#105458) -> DMESG-WARN (fdo#103558, fdo#105602)

    igt@pm_backlight@fade_with_suspend:
      shard-skl:          FAIL (fdo#107847) -> INCOMPLETE (fdo#107773)

    
  fdo#103158 https://bugs.freedesktop.org/show_bug.cgi?id=103158
  fdo#103166 https://bugs.freedesktop.org/show_bug.cgi?id=103166
  fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
  fdo#103232 https://bugs.freedesktop.org/show_bug.cgi?id=103232
  fdo#103558 https://bugs.freedesktop.org/show_bug.cgi?id=103558
  fdo#104108 https://bugs.freedesktop.org/show_bug.cgi?id=104108
  fdo#104671 https://bugs.freedesktop.org/show_bug.cgi?id=104671
  fdo#104782 https://bugs.freedesktop.org/show_bug.cgi?id=104782
  fdo#105363 https://bugs.freedesktop.org/show_bug.cgi?id=105363
  fdo#105458 https://bugs.freedesktop.org/show_bug.cgi?id=105458
  fdo#105602 https://bugs.freedesktop.org/show_bug.cgi?id=105602
  fdo#105682 https://bugs.freedesktop.org/show_bug.cgi?id=105682
  fdo#105763 https://bugs.freedesktop.org/show_bug.cgi?id=105763
  fdo#105959 https://bugs.freedesktop.org/show_bug.cgi?id=105959
  fdo#106510 https://bugs.freedesktop.org/show_bug.cgi?id=106510
  fdo#106538 https://bugs.freedesktop.org/show_bug.cgi?id=106538
  fdo#106886 https://bugs.freedesktop.org/show_bug.cgi?id=106886
  fdo#107773 https://bugs.freedesktop.org/show_bug.cgi?id=107773
  fdo#107807 https://bugs.freedesktop.org/show_bug.cgi?id=107807
  fdo#107815 https://bugs.freedesktop.org/show_bug.cgi?id=107815
  fdo#107847 https://bugs.freedesktop.org/show_bug.cgi?id=107847
  fdo#108073 https://bugs.freedesktop.org/show_bug.cgi?id=108073
  fdo#108145 https://bugs.freedesktop.org/show_bug.cgi?id=108145
  fdo#108690 https://bugs.freedesktop.org/show_bug.cgi?id=108690
  fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912


== Participating hosts (6 -> 6) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_5103 -> Patchwork_10769

  CI_DRM_5103: 23c1138030ad65402f698ab0b356e2f55722bc77 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4712: a3ede1b535ac8137f6949c468edd7054453d5dae @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_10769: 227c8bdf97a34415367225dfc484370af453c477 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10769/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim
  2018-11-08 16:48     ` Chris Wilson
@ 2018-11-09  7:30       ` Tvrtko Ursulin
  2018-11-09 11:44         ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2018-11-09  7:30 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 08/11/2018 16:48, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
>>
>> On 08/11/2018 08:17, Chris Wilson wrote:
>>> Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
>>> in the shrinker while performing direct-reclaim. The trade-off being
>>> (much) lower latency for non-i915 clients at an increased risk of being
>>> unable to obtain a page from direct-reclaim without hitting the
>>> oom-notifier. The proviso being that we still keep trying to hard
>>> obtain the lock for oom so that we can reap under heavy memory pressure.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
>>>    1 file changed, 11 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>> index ea90d3a0d511..d461f458f4af 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>> @@ -36,7 +36,9 @@
>>>    #include "i915_drv.h"
>>>    #include "i915_trace.h"
>>>    
>>> -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>>> +static bool shrinker_lock(struct drm_i915_private *i915,
>>> +                       unsigned int flags,
>>> +                       bool *unlock)
>>>    {
>>>        switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
>>>        case MUTEX_TRYLOCK_RECURSIVE:
>>> @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>>>    
>>>        case MUTEX_TRYLOCK_FAILED:
>>>                *unlock = false;
>>> -             preempt_disable();
>>> -             do {
>>> -                     cpu_relax();
>>> -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
>>> -                             *unlock = true;
>>> -                             break;
>>> -                     }
>>> -             } while (!need_resched());
>>> -             preempt_enable();
>>> +             if (flags & I915_SHRINK_ACTIVE) {
>>
>> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink
>> in the normal case (direct reclaim?) or oom, we bail out on the first
>> sign of struct mutex contention. Doesn't this make our shrinker much
>> less effective at runtime and why is that OK?
> 
> As I said, it's a tradeoff between blocking others for _several_
> _seconds_ and making no progress and returning immediately and making no
> progress. My argument is along the lines of if direct-reclaim is running
> in another process and something else is engaged in the driver hopefully
> the driver will be cleaning up as it goes along or else what remains is
> active and won't be reaped anyway. If direct reclaim is failing, the
> delay before trying the oom path is insignificant.

What was the rationale behind busy looping there btw? Compared to 
perhaps an alternative of micro-sleeps and trying a few times? I know it 
would be opposite from what this patch is trying to achieve, I Just 
don't had a good judgment on what makes most sense for the shrinker. Is 
it better to perhaps try a little bit harder instead of giving up 
immediately, but try a little bit harder in a softer way? Or that ends 
up blocking the callers and has the same effect of making no progress?

>> Or in other words, for what use cases, tests or benchmark was the
>> existing approach of busy looping a problem?
> 
> Do something like 'find / -exec cat' while running i915 and see how long
> you have to wait for a khungtaskd :|

I couldn't reproduce anything strange with this. Assuming you meant 
something like -exec cat { } \; >dev/null.

Either case I think explanations like this should go into the commit 
message.

> gem_syslatency reports max latencies of over 600s, and I'm sure it's
> pretty much unbounded. It's bad enough that I expect we are the cause of
> significant hitching (mainly in other tasks) on the desktop running at
> memory capacity.

Running gem_syslatency in parallel to the find, or gem_syslatency -b in 
parallel or not did not do anything.

Then I tries gem_shrink but that doesn't seem to be making any progress 
with or without the patch. But it's not even hitting the i915 hard, but 
perhaps I need to turn off lockdep.. Which would be a bummer since I 
wanted to have a lockdep to check the next bit..

>>> +                     mutex_lock_nested(&i915->drm.struct_mutex,
>>> +                                       I915_MM_SHRINKER);
>>
>> _nested is needed since abandoning trylock would otherwise cause lockdep
>> issues? But is it really safe? I don't know.. how can it be? It is
>> guaranteed to be a different process here otherwise the result wouldn't
>> be MUTEX_TRYLOCK_FAILED.
> 
> The easy option was to only force the mutex_lock for kswapd. However,
> noting that we do need to forcibly shrink before oom, I opted for the
> more inclusive attempt to take it in both. For the oom approach, my only
> handwaving is we shouldn't be nested under oom serialisation and so
> avoid an ABBA deadlock. It's survived some oom abuse, but the machine
> still becomes unresponsive (but pingable) eventually (as it does before
> the patch).

...which I completely did not understand! :(

When you say kswapd - that means mutex->owner != current in the shrinker 
and otherwise it is direct reclaim?

If true, how is the nested annotation correct for the kswapd case?

> 
>> Also, I915_MM_SHRINKER is so far only documented to apply to obj->mm.lock.
> 
> It suits very nicely as it being applied for the same purpose.

Maybe rename or at least update the comment where they are defined?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim
  2018-11-09  7:30       ` Tvrtko Ursulin
@ 2018-11-09 11:44         ` Chris Wilson
  2018-11-13 10:24           ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2018-11-09 11:44 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-11-09 07:30:34)
> 
> On 08/11/2018 16:48, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
> >>
> >> On 08/11/2018 08:17, Chris Wilson wrote:
> >>> Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
> >>> in the shrinker while performing direct-reclaim. The trade-off being
> >>> (much) lower latency for non-i915 clients at an increased risk of being
> >>> unable to obtain a page from direct-reclaim without hitting the
> >>> oom-notifier. The proviso being that we still keep trying to hard
> >>> obtain the lock for oom so that we can reap under heavy memory pressure.
> >>>
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
> >>>    1 file changed, 11 insertions(+), 13 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>> index ea90d3a0d511..d461f458f4af 100644
> >>> --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>> +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>> @@ -36,7 +36,9 @@
> >>>    #include "i915_drv.h"
> >>>    #include "i915_trace.h"
> >>>    
> >>> -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> >>> +static bool shrinker_lock(struct drm_i915_private *i915,
> >>> +                       unsigned int flags,
> >>> +                       bool *unlock)
> >>>    {
> >>>        switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
> >>>        case MUTEX_TRYLOCK_RECURSIVE:
> >>> @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> >>>    
> >>>        case MUTEX_TRYLOCK_FAILED:
> >>>                *unlock = false;
> >>> -             preempt_disable();
> >>> -             do {
> >>> -                     cpu_relax();
> >>> -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
> >>> -                             *unlock = true;
> >>> -                             break;
> >>> -                     }
> >>> -             } while (!need_resched());
> >>> -             preempt_enable();
> >>> +             if (flags & I915_SHRINK_ACTIVE) {
> >>
> >> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink
> >> in the normal case (direct reclaim?) or oom, we bail out on the first
> >> sign of struct mutex contention. Doesn't this make our shrinker much
> >> less effective at runtime and why is that OK?
> > 
> > As I said, it's a tradeoff between blocking others for _several_
> > _seconds_ and making no progress and returning immediately and making no
> > progress. My argument is along the lines of if direct-reclaim is running
> > in another process and something else is engaged in the driver hopefully
> > the driver will be cleaning up as it goes along or else what remains is
> > active and won't be reaped anyway. If direct reclaim is failing, the
> > delay before trying the oom path is insignificant.
> 
> What was the rationale behind busy looping there btw?

Emulating the optimistic spin for mutex (my patches to expose it from
kernel/locking were kept hidden for public decency). My thinking was the
exact opposite to this patch, that direct reclaim was of paramount
importance and spending the time to try and ensure we grabbed the
struct_mutex to search for some pages to free was preferable.

It's just on the basis of looking at the actual syslatency and realising
the cause is this spinner, I want to swing the axe in other direction.

(There's probably a compromise, but honestly I'd prefer to sell the
struct_mutex free version of the shrinker first :)

> Compared to 
> perhaps an alternative of micro-sleeps and trying a few times? I know it 
> would be opposite from what this patch is trying to achieve, I Just 
> don't had a good judgment on what makes most sense for the shrinker. Is 
> it better to perhaps try a little bit harder instead of giving up 
> immediately, but try a little bit harder in a softer way? Or that ends 
> up blocking the callers and has the same effect of making no progress?

Exactly. We can definitely measure the impact of the spinner on
unrelated processes, but detecting the premature allocation failure is
harder (we wait for more dmesg-warns). The compromise that I've tried to
reach here is that if direct-reclaim isn't enough, then we should still
try hard to grab the struct_mutex. (That leaves __GFP_RETRY_MAYFAIL
vulnerable to not shrinking i915, but a worthwhile compromise as it's
allowed to fail?)

> >> Or in other words, for what use cases, tests or benchmark was the
> >> existing approach of busy looping a problem?
> > 
> > Do something like 'find / -exec cat' while running i915 and see how long
> > you have to wait for a khungtaskd :|
> 
> I couldn't reproduce anything strange with this. Assuming you meant 
> something like -exec cat { } \; >dev/null.
> 
> Either case I think explanations like this should go into the commit 
> message.

Weird, I spent so long over the last few weeks talking about the impact
on gem_syslatency, I thought it was mentioned here.

> > gem_syslatency reports max latencies of over 600s, and I'm sure it's
> > pretty much unbounded. It's bad enough that I expect we are the cause of
> > significant hitching (mainly in other tasks) on the desktop running at
> > memory capacity.
> 
> Running gem_syslatency in parallel to the find, or gem_syslatency -b in 
> parallel or not did not do anything.

gem_syslatency -b -m

+------------------------------------------------------------------------+
|      x                                                                 |
|      x                                                                 |
.
.
.
|      x                                                                 |
|      x xxx      xx x           x  x      x                            x|
||_____M_A_______|                                                       |
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x 120       2284928  6.752085e+08       3385097      20825362      80352645

Unpatched, and similar have been observed on ivb/byt/bdw/bsw/skl/kbl and
while we were fighting ksoftirqd. 675s maximum latency for a RT thread to wake up.

Patched, max is around 261ms and order of magnitude better than the best
previously!

> Then I tries gem_shrink but that doesn't seem to be making any progress 
> with or without the patch. But it's not even hitting the i915 hard, but 
> perhaps I need to turn off lockdep.. Which would be a bummer since I 
> wanted to have a lockdep to check the next bit..

gem_shrink never has completed... It's not quite the right test for this
as we need to hit the shrinker & oom on both i915 and non-i915 paths to
try and hit all the different lock chains.

> >>> +                     mutex_lock_nested(&i915->drm.struct_mutex,
> >>> +                                       I915_MM_SHRINKER);
> >>
> >> _nested is needed since abandoning trylock would otherwise cause lockdep
> >> issues? But is it really safe? I don't know.. how can it be? It is
> >> guaranteed to be a different process here otherwise the result wouldn't
> >> be MUTEX_TRYLOCK_FAILED.
> > 
> > The easy option was to only force the mutex_lock for kswapd. However,
> > noting that we do need to forcibly shrink before oom, I opted for the
> > more inclusive attempt to take it in both. For the oom approach, my only
> > handwaving is we shouldn't be nested under oom serialisation and so
> > avoid an ABBA deadlock. It's survived some oom abuse, but the machine
> > still becomes unresponsive (but pingable) eventually (as it does before
> > the patch).
> 
> ...which I completely did not understand! :(
> 
> When you say kswapd - that means mutex->owner != current in the shrinker 
> and otherwise it is direct reclaim?
> 
> If true, how is the nested annotation correct for the kswapd case?

We have 3 paths,
direct-reclaim / oom from under i915 + struct_mutex
direct-reclaim / oom from elsewhere
kswapd

The nested annotation doesn't mean that the lock is nested per-se, it
just means that here it is acting as a different subclass of the lock.
Since the kswapd path can't have struct_mutex, it is safe from being
confused. The complication is if there is serialisation on
direct-reclaim (which I think can be run concurrently, i.e. no
mutex/semaphore nesting) or oom. For oom, there is a lock but afaict the
guard is just a trylock so shouldn't cause an ABBA deadlock if we oom
under struct_mutex while fighting an oom in another process.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim
  2018-11-09 11:44         ` Chris Wilson
@ 2018-11-13 10:24           ` Tvrtko Ursulin
  2018-11-13 17:10             ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2018-11-13 10:24 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 09/11/2018 11:44, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-11-09 07:30:34)
>>
>> On 08/11/2018 16:48, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
>>>>
>>>> On 08/11/2018 08:17, Chris Wilson wrote:
>>>>> Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
>>>>> in the shrinker while performing direct-reclaim. The trade-off being
>>>>> (much) lower latency for non-i915 clients at an increased risk of being
>>>>> unable to obtain a page from direct-reclaim without hitting the
>>>>> oom-notifier. The proviso being that we still keep trying to hard
>>>>> obtain the lock for oom so that we can reap under heavy memory pressure.
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
>>>>>     1 file changed, 11 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>> index ea90d3a0d511..d461f458f4af 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>> @@ -36,7 +36,9 @@
>>>>>     #include "i915_drv.h"
>>>>>     #include "i915_trace.h"
>>>>>     
>>>>> -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>>>>> +static bool shrinker_lock(struct drm_i915_private *i915,
>>>>> +                       unsigned int flags,
>>>>> +                       bool *unlock)
>>>>>     {
>>>>>         switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
>>>>>         case MUTEX_TRYLOCK_RECURSIVE:
>>>>> @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>>>>>     
>>>>>         case MUTEX_TRYLOCK_FAILED:
>>>>>                 *unlock = false;
>>>>> -             preempt_disable();
>>>>> -             do {
>>>>> -                     cpu_relax();
>>>>> -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
>>>>> -                             *unlock = true;
>>>>> -                             break;
>>>>> -                     }
>>>>> -             } while (!need_resched());
>>>>> -             preempt_enable();
>>>>> +             if (flags & I915_SHRINK_ACTIVE) {
>>>>
>>>> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink
>>>> in the normal case (direct reclaim?) or oom, we bail out on the first
>>>> sign of struct mutex contention. Doesn't this make our shrinker much
>>>> less effective at runtime and why is that OK?
>>>
>>> As I said, it's a tradeoff between blocking others for _several_
>>> _seconds_ and making no progress and returning immediately and making no
>>> progress. My argument is along the lines of if direct-reclaim is running
>>> in another process and something else is engaged in the driver hopefully
>>> the driver will be cleaning up as it goes along or else what remains is
>>> active and won't be reaped anyway. If direct reclaim is failing, the
>>> delay before trying the oom path is insignificant.
>>
>> What was the rationale behind busy looping there btw?
> 
> Emulating the optimistic spin for mutex (my patches to expose it from
> kernel/locking were kept hidden for public decency). My thinking was the
> exact opposite to this patch, that direct reclaim was of paramount
> importance and spending the time to try and ensure we grabbed the
> struct_mutex to search for some pages to free was preferable.
> 
> It's just on the basis of looking at the actual syslatency and realising
> the cause is this spinner, I want to swing the axe in other direction.
> 
> (There's probably a compromise, but honestly I'd prefer to sell the
> struct_mutex free version of the shrinker first :)
> 
>> Compared to
>> perhaps an alternative of micro-sleeps and trying a few times? I know it
>> would be opposite from what this patch is trying to achieve, I Just
>> don't had a good judgment on what makes most sense for the shrinker. Is
>> it better to perhaps try a little bit harder instead of giving up
>> immediately, but try a little bit harder in a softer way? Or that ends
>> up blocking the callers and has the same effect of making no progress?
> 
> Exactly. We can definitely measure the impact of the spinner on
> unrelated processes, but detecting the premature allocation failure is
> harder (we wait for more dmesg-warns). The compromise that I've tried to
> reach here is that if direct-reclaim isn't enough, then we should still
> try hard to grab the struct_mutex. (That leaves __GFP_RETRY_MAYFAIL
> vulnerable to not shrinking i915, but a worthwhile compromise as it's
> allowed to fail?)
> 
>>>> Or in other words, for what use cases, tests or benchmark was the
>>>> existing approach of busy looping a problem?
>>>
>>> Do something like 'find / -exec cat' while running i915 and see how long
>>> you have to wait for a khungtaskd :|
>>
>> I couldn't reproduce anything strange with this. Assuming you meant
>> something like -exec cat { } \; >dev/null.
>>
>> Either case I think explanations like this should go into the commit
>> message.
> 
> Weird, I spent so long over the last few weeks talking about the impact
> on gem_syslatency, I thought it was mentioned here.
> 
>>> gem_syslatency reports max latencies of over 600s, and I'm sure it's
>>> pretty much unbounded. It's bad enough that I expect we are the cause of
>>> significant hitching (mainly in other tasks) on the desktop running at
>>> memory capacity.
>>
>> Running gem_syslatency in parallel to the find, or gem_syslatency -b in
>> parallel or not did not do anything.
> 
> gem_syslatency -b -m
> 
> +------------------------------------------------------------------------+
> |      x                                                                 |
> |      x                                                                 |
> .
> .
> .
> |      x                                                                 |
> |      x xxx      xx x           x  x      x                            x|
> ||_____M_A_______|                                                       |
> +------------------------------------------------------------------------+
>      N           Min           Max        Median           Avg        Stddev
> x 120       2284928  6.752085e+08       3385097      20825362      80352645
> 
> Unpatched, and similar have been observed on ivb/byt/bdw/bsw/skl/kbl and
> while we were fighting ksoftirqd. 675s maximum latency for a RT thread to wake up.
> 
> Patched, max is around 261ms and order of magnitude better than the best
> previously!

Yes I have reproduced the ~20x better max latency myself. Mean latency 
stays roughly the same but has less variance. From that point of view 
this is very good.

> 
>> Then I tries gem_shrink but that doesn't seem to be making any progress
>> with or without the patch. But it's not even hitting the i915 hard, but
>> perhaps I need to turn off lockdep.. Which would be a bummer since I
>> wanted to have a lockdep to check the next bit..
> 
> gem_shrink never has completed... It's not quite the right test for this
> as we need to hit the shrinker & oom on both i915 and non-i915 paths to
> try and hit all the different lock chains.

Ok. :)

> 
>>>>> +                     mutex_lock_nested(&i915->drm.struct_mutex,
>>>>> +                                       I915_MM_SHRINKER);
>>>>
>>>> _nested is needed since abandoning trylock would otherwise cause lockdep
>>>> issues? But is it really safe? I don't know.. how can it be? It is
>>>> guaranteed to be a different process here otherwise the result wouldn't
>>>> be MUTEX_TRYLOCK_FAILED.
>>>
>>> The easy option was to only force the mutex_lock for kswapd. However,
>>> noting that we do need to forcibly shrink before oom, I opted for the
>>> more inclusive attempt to take it in both. For the oom approach, my only
>>> handwaving is we shouldn't be nested under oom serialisation and so
>>> avoid an ABBA deadlock. It's survived some oom abuse, but the machine
>>> still becomes unresponsive (but pingable) eventually (as it does before
>>> the patch).
>>
>> ...which I completely did not understand! :(
>>
>> When you say kswapd - that means mutex->owner != current in the shrinker
>> and otherwise it is direct reclaim?
>>
>> If true, how is the nested annotation correct for the kswapd case?
> 
> We have 3 paths,
> direct-reclaim / oom from under i915 + struct_mutex
> direct-reclaim / oom from elsewher > kswapd

Is the kernel guaranteeing only one reclaim path at a time can happen? 
In other words if it is doing direct reclaim, kswapd won't enter the 
same shrinking notifier and vice versa?

> 
> The nested annotation doesn't mean that the lock is nested per-se, it
> just means that here it is acting as a different subclass of the lock.
> Since the kswapd path can't have struct_mutex, it is safe from being
> confused. The complication is if there is serialisation on
> direct-reclaim (which I think can be run concurrently, i.e. no
> mutex/semaphore nesting) or oom. For oom, there is a lock but afaict the
> guard is just a trylock so shouldn't cause an ABBA deadlock if we oom
> under struct_mutex while fighting an oom in another process.

When you say direct reclaim can run concurrently, in that case we would 
have one thread potentially holding the struct_mutex and entering 
shrinker. Second thread would grab the mutex in the shrinker with the 
nested annotation. But it is not nested but overlapping. And you say 
nested is effectively a hack to enable different lock classes.

Actually in this case it is the same situation for any two simultaneous 
threads in direct reclaim, even if none holds struct mutex on entry. But 
in this case the lockdep annotation is also not needed.

Hm.. maybe to turn the question on it's head.. if you had a normal 
mutex_lock in there, what deadlock would lockdep notice?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim
  2018-11-13 10:24           ` Tvrtko Ursulin
@ 2018-11-13 17:10             ` Chris Wilson
  2018-11-15 11:39               ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2018-11-13 17:10 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-11-13 10:24:43)
> 
> On 09/11/2018 11:44, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-11-09 07:30:34)
> >>
> >> On 08/11/2018 16:48, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
> >>>>
> >>>> On 08/11/2018 08:17, Chris Wilson wrote:
> >>>>> Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
> >>>>> in the shrinker while performing direct-reclaim. The trade-off being
> >>>>> (much) lower latency for non-i915 clients at an increased risk of being
> >>>>> unable to obtain a page from direct-reclaim without hitting the
> >>>>> oom-notifier. The proviso being that we still keep trying to hard
> >>>>> obtain the lock for oom so that we can reap under heavy memory pressure.
> >>>>>
> >>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>> ---
> >>>>>     drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
> >>>>>     1 file changed, 11 insertions(+), 13 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>>>> index ea90d3a0d511..d461f458f4af 100644
> >>>>> --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>>>> +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>>>> @@ -36,7 +36,9 @@
> >>>>>     #include "i915_drv.h"
> >>>>>     #include "i915_trace.h"
> >>>>>     
> >>>>> -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> >>>>> +static bool shrinker_lock(struct drm_i915_private *i915,
> >>>>> +                       unsigned int flags,
> >>>>> +                       bool *unlock)
> >>>>>     {
> >>>>>         switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
> >>>>>         case MUTEX_TRYLOCK_RECURSIVE:
> >>>>> @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> >>>>>     
> >>>>>         case MUTEX_TRYLOCK_FAILED:
> >>>>>                 *unlock = false;
> >>>>> -             preempt_disable();
> >>>>> -             do {
> >>>>> -                     cpu_relax();
> >>>>> -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
> >>>>> -                             *unlock = true;
> >>>>> -                             break;
> >>>>> -                     }
> >>>>> -             } while (!need_resched());
> >>>>> -             preempt_enable();
> >>>>> +             if (flags & I915_SHRINK_ACTIVE) {
> >>>>
> >>>> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink
> >>>> in the normal case (direct reclaim?) or oom, we bail out on the first
> >>>> sign of struct mutex contention. Doesn't this make our shrinker much
> >>>> less effective at runtime and why is that OK?
> >>>
> >>> As I said, it's a tradeoff between blocking others for _several_
> >>> _seconds_ and making no progress and returning immediately and making no
> >>> progress. My argument is along the lines of if direct-reclaim is running
> >>> in another process and something else is engaged in the driver hopefully
> >>> the driver will be cleaning up as it goes along or else what remains is
> >>> active and won't be reaped anyway. If direct reclaim is failing, the
> >>> delay before trying the oom path is insignificant.
> >>
> >> What was the rationale behind busy looping there btw?
> > 
> > Emulating the optimistic spin for mutex (my patches to expose it from
> > kernel/locking were kept hidden for public decency). My thinking was the
> > exact opposite to this patch, that direct reclaim was of paramount
> > importance and spending the time to try and ensure we grabbed the
> > struct_mutex to search for some pages to free was preferable.
> > 
> > It's just on the basis of looking at the actual syslatency and realising
> > the cause is this spinner, I want to swing the axe in other direction.
> > 
> > (There's probably a compromise, but honestly I'd prefer to sell the
> > struct_mutex free version of the shrinker first :)
> > 
> >> Compared to
> >> perhaps an alternative of micro-sleeps and trying a few times? I know it
> >> would be opposite from what this patch is trying to achieve, I Just
> >> don't had a good judgment on what makes most sense for the shrinker. Is
> >> it better to perhaps try a little bit harder instead of giving up
> >> immediately, but try a little bit harder in a softer way? Or that ends
> >> up blocking the callers and has the same effect of making no progress?
> > 
> > Exactly. We can definitely measure the impact of the spinner on
> > unrelated processes, but detecting the premature allocation failure is
> > harder (we wait for more dmesg-warns). The compromise that I've tried to
> > reach here is that if direct-reclaim isn't enough, then we should still
> > try hard to grab the struct_mutex. (That leaves __GFP_RETRY_MAYFAIL
> > vulnerable to not shrinking i915, but a worthwhile compromise as it's
> > allowed to fail?)
> > 
> >>>> Or in other words, for what use cases, tests or benchmark was the
> >>>> existing approach of busy looping a problem?
> >>>
> >>> Do something like 'find / -exec cat' while running i915 and see how long
> >>> you have to wait for a khungtaskd :|
> >>
> >> I couldn't reproduce anything strange with this. Assuming you meant
> >> something like -exec cat { } \; >dev/null.
> >>
> >> Either case I think explanations like this should go into the commit
> >> message.
> > 
> > Weird, I spent so long over the last few weeks talking about the impact
> > on gem_syslatency, I thought it was mentioned here.
> > 
> >>> gem_syslatency reports max latencies of over 600s, and I'm sure it's
> >>> pretty much unbounded. It's bad enough that I expect we are the cause of
> >>> significant hitching (mainly in other tasks) on the desktop running at
> >>> memory capacity.
> >>
> >> Running gem_syslatency in parallel to the find, or gem_syslatency -b in
> >> parallel or not did not do anything.
> > 
> > gem_syslatency -b -m
> > 
> > +------------------------------------------------------------------------+
> > |      x                                                                 |
> > |      x                                                                 |
> > .
> > .
> > .
> > |      x                                                                 |
> > |      x xxx      xx x           x  x      x                            x|
> > ||_____M_A_______|                                                       |
> > +------------------------------------------------------------------------+
> >      N           Min           Max        Median           Avg        Stddev
> > x 120       2284928  6.752085e+08       3385097      20825362      80352645
> > 
> > Unpatched, and similar have been observed on ivb/byt/bdw/bsw/skl/kbl and
> > while we were fighting ksoftirqd. 675s maximum latency for a RT thread to wake up.
> > 
> > Patched, max is around 261ms and order of magnitude better than the best
> > previously!
> 
> Yes I have reproduced the ~20x better max latency myself. Mean latency 
> stays roughly the same but has less variance. From that point of view 
> this is very good.
> 
> > 
> >> Then I tries gem_shrink but that doesn't seem to be making any progress
> >> with or without the patch. But it's not even hitting the i915 hard, but
> >> perhaps I need to turn off lockdep.. Which would be a bummer since I
> >> wanted to have a lockdep to check the next bit..
> > 
> > gem_shrink never has completed... It's not quite the right test for this
> > as we need to hit the shrinker & oom on both i915 and non-i915 paths to
> > try and hit all the different lock chains.
> 
> Ok. :)
> 
> > 
> >>>>> +                     mutex_lock_nested(&i915->drm.struct_mutex,
> >>>>> +                                       I915_MM_SHRINKER);
> >>>>
> >>>> _nested is needed since abandoning trylock would otherwise cause lockdep
> >>>> issues? But is it really safe? I don't know.. how can it be? It is
> >>>> guaranteed to be a different process here otherwise the result wouldn't
> >>>> be MUTEX_TRYLOCK_FAILED.
> >>>
> >>> The easy option was to only force the mutex_lock for kswapd. However,
> >>> noting that we do need to forcibly shrink before oom, I opted for the
> >>> more inclusive attempt to take it in both. For the oom approach, my only
> >>> handwaving is we shouldn't be nested under oom serialisation and so
> >>> avoid an ABBA deadlock. It's survived some oom abuse, but the machine
> >>> still becomes unresponsive (but pingable) eventually (as it does before
> >>> the patch).
> >>
> >> ...which I completely did not understand! :(
> >>
> >> When you say kswapd - that means mutex->owner != current in the shrinker
> >> and otherwise it is direct reclaim?
> >>
> >> If true, how is the nested annotation correct for the kswapd case?
> > 
> > We have 3 paths,
> > direct-reclaim / oom from under i915 + struct_mutex
> > direct-reclaim / oom from elsewher > kswapd
> 
> Is the kernel guaranteeing only one reclaim path at a time can happen? 
> In other words if it is doing direct reclaim, kswapd won't enter the 
> same shrinking notifier and vice versa?

direct-reclaim is parallelised; the shrink_slab is only guarded by a
read lock (to prevent the shrinker lists being modified). Multiple
callers into direct-reclaim is allowed, and the caller is not always
guaranteed of obtaining the pages they themselves freed. Afaik, there's
no lock there to cause contention and inversion for us.

> > The nested annotation doesn't mean that the lock is nested per-se, it
> > just means that here it is acting as a different subclass of the lock.
> > Since the kswapd path can't have struct_mutex, it is safe from being
> > confused. The complication is if there is serialisation on
> > direct-reclaim (which I think can be run concurrently, i.e. no
> > mutex/semaphore nesting) or oom. For oom, there is a lock but afaict the
> > guard is just a trylock so shouldn't cause an ABBA deadlock if we oom
> > under struct_mutex while fighting an oom in another process.
> 
> When you say direct reclaim can run concurrently, in that case we would 
> have one thread potentially holding the struct_mutex and entering 
> shrinker. Second thread would grab the mutex in the shrinker with the 
> nested annotation. But it is not nested but overlapping. And you say 
> nested is effectively a hack to enable different lock classes.
> 
> Actually in this case it is the same situation for any two simultaneous 
> threads in direct reclaim, even if none holds struct mutex on entry. But 
> in this case the lockdep annotation is also not needed.
> 
> Hm.. maybe to turn the question on it's head.. if you had a normal 
> mutex_lock in there, what deadlock would lockdep notice?

The direct-reclaim paths are explicitly marked up so as to avoid taking
a mutex inside the shrinker that may be held while allocating. That's
the contention we avoid by special casing the recursive mutex; but it
also means we can't use a plain mutex_lock() for the cases where we are
happy that we won't deadlock. Hence why previously it was a trylock to
avoid the same lockdep warning.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim
  2018-11-13 17:10             ` Chris Wilson
@ 2018-11-15 11:39               ` Tvrtko Ursulin
  0 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2018-11-15 11:39 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 13/11/2018 17:10, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-11-13 10:24:43)
>>
>> On 09/11/2018 11:44, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-11-09 07:30:34)
>>>>
>>>> On 08/11/2018 16:48, Chris Wilson wrote:
>>>>> Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
>>>>>>
>>>>>> On 08/11/2018 08:17, Chris Wilson wrote:
>>>>>>> Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
>>>>>>> in the shrinker while performing direct-reclaim. The trade-off being
>>>>>>> (much) lower latency for non-i915 clients at an increased risk of being
>>>>>>> unable to obtain a page from direct-reclaim without hitting the
>>>>>>> oom-notifier. The proviso being that we still keep trying to hard
>>>>>>> obtain the lock for oom so that we can reap under heavy memory pressure.
>>>>>>>
>>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
>>>>>>>      1 file changed, 11 insertions(+), 13 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>>>> index ea90d3a0d511..d461f458f4af 100644
>>>>>>> --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>>>> @@ -36,7 +36,9 @@
>>>>>>>      #include "i915_drv.h"
>>>>>>>      #include "i915_trace.h"
>>>>>>>      
>>>>>>> -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>>>>>>> +static bool shrinker_lock(struct drm_i915_private *i915,
>>>>>>> +                       unsigned int flags,
>>>>>>> +                       bool *unlock)
>>>>>>>      {
>>>>>>>          switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
>>>>>>>          case MUTEX_TRYLOCK_RECURSIVE:
>>>>>>> @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>>>>>>>      
>>>>>>>          case MUTEX_TRYLOCK_FAILED:
>>>>>>>                  *unlock = false;
>>>>>>> -             preempt_disable();
>>>>>>> -             do {
>>>>>>> -                     cpu_relax();
>>>>>>> -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
>>>>>>> -                             *unlock = true;
>>>>>>> -                             break;
>>>>>>> -                     }
>>>>>>> -             } while (!need_resched());
>>>>>>> -             preempt_enable();
>>>>>>> +             if (flags & I915_SHRINK_ACTIVE) {
>>>>>>
>>>>>> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink
>>>>>> in the normal case (direct reclaim?) or oom, we bail out on the first
>>>>>> sign of struct mutex contention. Doesn't this make our shrinker much
>>>>>> less effective at runtime and why is that OK?
>>>>>
>>>>> As I said, it's a tradeoff between blocking others for _several_
>>>>> _seconds_ and making no progress and returning immediately and making no
>>>>> progress. My argument is along the lines of if direct-reclaim is running
>>>>> in another process and something else is engaged in the driver hopefully
>>>>> the driver will be cleaning up as it goes along or else what remains is
>>>>> active and won't be reaped anyway. If direct reclaim is failing, the
>>>>> delay before trying the oom path is insignificant.
>>>>
>>>> What was the rationale behind busy looping there btw?
>>>
>>> Emulating the optimistic spin for mutex (my patches to expose it from
>>> kernel/locking were kept hidden for public decency). My thinking was the
>>> exact opposite to this patch, that direct reclaim was of paramount
>>> importance and spending the time to try and ensure we grabbed the
>>> struct_mutex to search for some pages to free was preferable.
>>>
>>> It's just on the basis of looking at the actual syslatency and realising
>>> the cause is this spinner, I want to swing the axe in other direction.
>>>
>>> (There's probably a compromise, but honestly I'd prefer to sell the
>>> struct_mutex free version of the shrinker first :)
>>>
>>>> Compared to
>>>> perhaps an alternative of micro-sleeps and trying a few times? I know it
>>>> would be opposite from what this patch is trying to achieve, I Just
>>>> don't had a good judgment on what makes most sense for the shrinker. Is
>>>> it better to perhaps try a little bit harder instead of giving up
>>>> immediately, but try a little bit harder in a softer way? Or that ends
>>>> up blocking the callers and has the same effect of making no progress?
>>>
>>> Exactly. We can definitely measure the impact of the spinner on
>>> unrelated processes, but detecting the premature allocation failure is
>>> harder (we wait for more dmesg-warns). The compromise that I've tried to
>>> reach here is that if direct-reclaim isn't enough, then we should still
>>> try hard to grab the struct_mutex. (That leaves __GFP_RETRY_MAYFAIL
>>> vulnerable to not shrinking i915, but a worthwhile compromise as it's
>>> allowed to fail?)
>>>
>>>>>> Or in other words, for what use cases, tests or benchmark was the
>>>>>> existing approach of busy looping a problem?
>>>>>
>>>>> Do something like 'find / -exec cat' while running i915 and see how long
>>>>> you have to wait for a khungtaskd :|
>>>>
>>>> I couldn't reproduce anything strange with this. Assuming you meant
>>>> something like -exec cat { } \; >dev/null.
>>>>
>>>> Either case I think explanations like this should go into the commit
>>>> message.
>>>
>>> Weird, I spent so long over the last few weeks talking about the impact
>>> on gem_syslatency, I thought it was mentioned here.
>>>
>>>>> gem_syslatency reports max latencies of over 600s, and I'm sure it's
>>>>> pretty much unbounded. It's bad enough that I expect we are the cause of
>>>>> significant hitching (mainly in other tasks) on the desktop running at
>>>>> memory capacity.
>>>>
>>>> Running gem_syslatency in parallel to the find, or gem_syslatency -b in
>>>> parallel or not did not do anything.
>>>
>>> gem_syslatency -b -m
>>>
>>> +------------------------------------------------------------------------+
>>> |      x                                                                 |
>>> |      x                                                                 |
>>> .
>>> .
>>> .
>>> |      x                                                                 |
>>> |      x xxx      xx x           x  x      x                            x|
>>> ||_____M_A_______|                                                       |
>>> +------------------------------------------------------------------------+
>>>       N           Min           Max        Median           Avg        Stddev
>>> x 120       2284928  6.752085e+08       3385097      20825362      80352645
>>>
>>> Unpatched, and similar have been observed on ivb/byt/bdw/bsw/skl/kbl and
>>> while we were fighting ksoftirqd. 675s maximum latency for a RT thread to wake up.
>>>
>>> Patched, max is around 261ms and order of magnitude better than the best
>>> previously!
>>
>> Yes I have reproduced the ~20x better max latency myself. Mean latency
>> stays roughly the same but has less variance. From that point of view
>> this is very good.
>>
>>>
>>>> Then I tries gem_shrink but that doesn't seem to be making any progress
>>>> with or without the patch. But it's not even hitting the i915 hard, but
>>>> perhaps I need to turn off lockdep.. Which would be a bummer since I
>>>> wanted to have a lockdep to check the next bit..
>>>
>>> gem_shrink never has completed... It's not quite the right test for this
>>> as we need to hit the shrinker & oom on both i915 and non-i915 paths to
>>> try and hit all the different lock chains.
>>
>> Ok. :)
>>
>>>
>>>>>>> +                     mutex_lock_nested(&i915->drm.struct_mutex,
>>>>>>> +                                       I915_MM_SHRINKER);
>>>>>>
>>>>>> _nested is needed since abandoning trylock would otherwise cause lockdep
>>>>>> issues? But is it really safe? I don't know.. how can it be? It is
>>>>>> guaranteed to be a different process here otherwise the result wouldn't
>>>>>> be MUTEX_TRYLOCK_FAILED.
>>>>>
>>>>> The easy option was to only force the mutex_lock for kswapd. However,
>>>>> noting that we do need to forcibly shrink before oom, I opted for the
>>>>> more inclusive attempt to take it in both. For the oom approach, my only
>>>>> handwaving is we shouldn't be nested under oom serialisation and so
>>>>> avoid an ABBA deadlock. It's survived some oom abuse, but the machine
>>>>> still becomes unresponsive (but pingable) eventually (as it does before
>>>>> the patch).
>>>>
>>>> ...which I completely did not understand! :(
>>>>
>>>> When you say kswapd - that means mutex->owner != current in the shrinker
>>>> and otherwise it is direct reclaim?
>>>>
>>>> If true, how is the nested annotation correct for the kswapd case?
>>>
>>> We have 3 paths,
>>> direct-reclaim / oom from under i915 + struct_mutex
>>> direct-reclaim / oom from elsewher > kswapd
>>
>> Is the kernel guaranteeing only one reclaim path at a time can happen?
>> In other words if it is doing direct reclaim, kswapd won't enter the
>> same shrinking notifier and vice versa?
> 
> direct-reclaim is parallelised; the shrink_slab is only guarded by a
> read lock (to prevent the shrinker lists being modified). Multiple
> callers into direct-reclaim is allowed, and the caller is not always
> guaranteed of obtaining the pages they themselves freed. Afaik, there's
> no lock there to cause contention and inversion for us.
> 
>>> The nested annotation doesn't mean that the lock is nested per-se, it
>>> just means that here it is acting as a different subclass of the lock.
>>> Since the kswapd path can't have struct_mutex, it is safe from being
>>> confused. The complication is if there is serialisation on
>>> direct-reclaim (which I think can be run concurrently, i.e. no
>>> mutex/semaphore nesting) or oom. For oom, there is a lock but afaict the
>>> guard is just a trylock so shouldn't cause an ABBA deadlock if we oom
>>> under struct_mutex while fighting an oom in another process.
>>
>> When you say direct reclaim can run concurrently, in that case we would
>> have one thread potentially holding the struct_mutex and entering
>> shrinker. Second thread would grab the mutex in the shrinker with the
>> nested annotation. But it is not nested but overlapping. And you say
>> nested is effectively a hack to enable different lock classes.
>>
>> Actually in this case it is the same situation for any two simultaneous
>> threads in direct reclaim, even if none holds struct mutex on entry. But
>> in this case the lockdep annotation is also not needed.
>>
>> Hm.. maybe to turn the question on it's head.. if you had a normal
>> mutex_lock in there, what deadlock would lockdep notice?
> 
> The direct-reclaim paths are explicitly marked up so as to avoid taking
> a mutex inside the shrinker that may be held while allocating. That's
> the contention we avoid by special casing the recursive mutex; but it
> also means we can't use a plain mutex_lock() for the cases where we are
> happy that we won't deadlock. Hence why previously it was a trylock to
> avoid the same lockdep warning.

Should we now drop use of mutex_trylock_recursive to employ the owner 
trick directly from the shrinker? Like more or less:

shrinker_lock(..)
{
         if (unlikely(__mutex_owner(lock) == current)) {
		*unlock = false;
                 return true;
	}

	mutex_lock_nested(...)
	*unlock = true;

	return true;
}

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-11-15 11:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-08  8:17 [PATCH 1/3] drm/i915/execlists: Force write serialisation into context image vs execution Chris Wilson
2018-11-08  8:17 ` [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim Chris Wilson
2018-11-08 16:23   ` Tvrtko Ursulin
2018-11-08 16:48     ` Chris Wilson
2018-11-09  7:30       ` Tvrtko Ursulin
2018-11-09 11:44         ` Chris Wilson
2018-11-13 10:24           ` Tvrtko Ursulin
2018-11-13 17:10             ` Chris Wilson
2018-11-15 11:39               ` Tvrtko Ursulin
2018-11-08  8:17 ` [PATCH 3/3] drm/i915/userptr: Avoid struct_mutex recursion for mmu_invalidate_range_start Chris Wilson
2018-11-08 16:45   ` Tvrtko Ursulin
2018-11-08  8:33 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/3] drm/i915/execlists: Force write serialisation into context image vs execution Patchwork
2018-11-08  9:07 ` ✓ Fi.CI.BAT: success " Patchwork
2018-11-08 12:00 ` [PATCH 1/3] " Mika Kuoppala
2018-11-08 12:11   ` Chris Wilson
2018-11-08 12:13     ` Mika Kuoppala
2018-11-08 12:26       ` Chris Wilson
2018-11-08 13:38     ` Chris Wilson
2018-11-08 19:51 ` ✓ Fi.CI.IGT: success for series starting with [1/3] " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.