All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-10  5:31 ` Mani Milani
  0 siblings, 0 replies; 19+ messages in thread
From: Mani Milani @ 2022-11-10  5:31 UTC (permalink / raw)
  To: LKML
  Cc: Tvrtko Ursulin, Maarten Lankhorst, Thomas Hellström,
	Mani Milani, Chris Wilson, Christian König, Daniel Vetter,
	David Airlie, Jani Nikula, Joonas Lahtinen, Matthew Auld,
	Niranjana Vishwanathapura, Nirmoy Das, Rodrigo Vivi, dri-devel,
	intel-gfx

At present, the gpu thread crashes at times when grab_vma() attempts to
acquire a gem object lock when in a deadlock state.

Problems:
I identified the following 4 issues in the current code:
1. Since grab_vma() calls i915_gem_object_trylock(), which consequently
   calls ww_mutex_trylock(), to acquire lock, it does not perform any
   -EDEADLK handling; And -EALREADY handling is also unreliable,
   according to the description of ww_mutex_trylock().
2. Since the return value of grab_vma() is a boolean showing
   success/failure, it does not provide any extra information on the
   failure reason, and therefore does not provide any mechanism to its
   caller to take any action to fix a potential deadlock.
3. Current grab_vma() implementation produces inconsistent behaviour
   depending on the refcount value, without informing the caller. If
   refcount is already zero, grab_vma() neither acquires lock nor
   increments the refcount, but still returns 'true' for success! This
   means that grab_vma() returning true (for success) does not always
   mean that the gem obj is actually safely accessible.
4. Currently, calling "i915_gem_object_lock(obj,ww)" is meant to be
   followed by a consequent "i915_gem_object_unlock(obj)" ONLY if the
   original 'ww' object pointer was NULL, or otherwise not be called and
   leave the houskeeping to "i915_gem_ww_ctx_fini(ww)". There are a few
   issues with this:
   - This is not documented anywhere in the code (that I could find),
     but only explained in an older commit message.
   - This produces an inconsistent usage of the lock/unlock functions,
     increasing the chance of mistakes and issues.
   - This is not a clean design as it requires any new code that calls
     these lock/unlock functions to know their internals, as well as the
     internals of the functions calling the new code being added.

Fix:
To fix the issues above, this patch:
1. Changes grab_vma() to call i915_gem_object_lock() instead of
   i915_gem_object_trylock(), to handle -EDEADLK and -EALREADY cases.
   This should not cause any issue since the PIN_NONBLOCK flag is
   checked beforehand in the 2 cases grab_vma() is called.
2. Changes grab_vma() to return the actual error code, instead of bool.
3. Changes grab_vma() to behave consistently when returning success, by
   both incrementing the refcount and acquiring lock at all times.
4. Changes i915_gem_object_unlock() to pair with i915_gem_object_lock()
   nicely in all cases and do the housekeeping without the need for the
   caller to do anything other than simply calling lock and unlock.
5. Ensures the gem obj->obj_link is initialized and deleted from the ww
   list such that it can be tested for emptiness using list_empty().

Signed-off-by: Mani Milani <mani@chromium.org>
---

 drivers/gpu/drm/i915/gem/i915_gem_object.c |  2 +
 drivers/gpu/drm/i915/gem/i915_gem_object.h | 10 ++++-
 drivers/gpu/drm/i915/i915_gem_evict.c      | 48 ++++++++++++----------
 drivers/gpu/drm/i915/i915_gem_ww.c         |  8 ++--
 4 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index 369006c5317f..69d013b393fb 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -78,6 +78,8 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
 
 	INIT_LIST_HEAD(&obj->mm.link);
 
+	INIT_LIST_HEAD(&obj->obj_link);
+
 	INIT_LIST_HEAD(&obj->lut_list);
 	spin_lock_init(&obj->lut_lock);
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index 1723af9b0f6a..7e7a61bdf52c 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -219,7 +219,7 @@ static inline bool i915_gem_object_trylock(struct drm_i915_gem_object *obj,
 		return ww_mutex_trylock(&obj->base.resv->lock, &ww->ctx);
 }
 
-static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
+static inline void __i915_gem_object_unlock(struct drm_i915_gem_object *obj)
 {
 	if (obj->ops->adjust_lru)
 		obj->ops->adjust_lru(obj);
@@ -227,6 +227,14 @@ static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
 	dma_resv_unlock(obj->base.resv);
 }
 
+static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
+{
+	if (list_empty(&obj->obj_link))
+		__i915_gem_object_unlock(obj);
+	else
+		i915_gem_ww_unlock_single(obj);
+}
+
 static inline void
 i915_gem_object_set_readonly(struct drm_i915_gem_object *obj)
 {
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index f025ee4fa526..3eb514b4eddc 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -55,29 +55,33 @@ static int ggtt_flush(struct intel_gt *gt)
 	return intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
 }
 
-static bool grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
+static int grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
 {
+	int err;
+
+	/* Dead objects don't need pins */
+	if (dying_vma(vma))
+		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
+
+	err = i915_gem_object_lock(vma->obj, ww);
+
 	/*
 	 * We add the extra refcount so the object doesn't drop to zero until
-	 * after ungrab_vma(), this way trylock is always paired with unlock.
+	 * after ungrab_vma(), this way lock is always paired with unlock.
 	 */
-	if (i915_gem_object_get_rcu(vma->obj)) {
-		if (!i915_gem_object_trylock(vma->obj, ww)) {
-			i915_gem_object_put(vma->obj);
-			return false;
-		}
-	} else {
-		/* Dead objects don't need pins */
-		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
-	}
+	if (!err)
+		i915_gem_object_get(vma->obj);
 
-	return true;
+	return err;
 }
 
 static void ungrab_vma(struct i915_vma *vma)
 {
-	if (dying_vma(vma))
+	if (dying_vma(vma)) {
+		/* Dead objects don't need pins */
+		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
 		return;
+	}
 
 	i915_gem_object_unlock(vma->obj);
 	i915_gem_object_put(vma->obj);
@@ -93,10 +97,11 @@ mark_free(struct drm_mm_scan *scan,
 	if (i915_vma_is_pinned(vma))
 		return false;
 
-	if (!grab_vma(vma, ww))
+	if (grab_vma(vma, ww))
 		return false;
 
 	list_add(&vma->evict_link, unwind);
+
 	return drm_mm_scan_add_block(scan, &vma->node);
 }
 
@@ -284,10 +289,12 @@ i915_gem_evict_something(struct i915_address_space *vm,
 		vma = container_of(node, struct i915_vma, node);
 
 		/* If we find any non-objects (!vma), we cannot evict them */
-		if (vma->node.color != I915_COLOR_UNEVICTABLE &&
-		    grab_vma(vma, ww)) {
-			ret = __i915_vma_unbind(vma);
-			ungrab_vma(vma);
+		if (vma->node.color != I915_COLOR_UNEVICTABLE) {
+			ret = grab_vma(vma, ww);
+			if (!ret) {
+				ret = __i915_vma_unbind(vma);
+				ungrab_vma(vma);
+			}
 		} else {
 			ret = -ENOSPC;
 		}
@@ -382,10 +389,9 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 			break;
 		}
 
-		if (!grab_vma(vma, ww)) {
-			ret = -ENOSPC;
+		ret = grab_vma(vma, ww);
+		if (ret)
 			break;
-		}
 
 		/*
 		 * Never show fear in the face of dragons!
diff --git a/drivers/gpu/drm/i915/i915_gem_ww.c b/drivers/gpu/drm/i915/i915_gem_ww.c
index 3f6ff139478e..937b279f50fc 100644
--- a/drivers/gpu/drm/i915/i915_gem_ww.c
+++ b/drivers/gpu/drm/i915/i915_gem_ww.c
@@ -19,16 +19,14 @@ static void i915_gem_ww_ctx_unlock_all(struct i915_gem_ww_ctx *ww)
 	struct drm_i915_gem_object *obj;
 
 	while ((obj = list_first_entry_or_null(&ww->obj_list, struct drm_i915_gem_object, obj_link))) {
-		list_del(&obj->obj_link);
-		i915_gem_object_unlock(obj);
-		i915_gem_object_put(obj);
+		i915_gem_ww_unlock_single(obj);
 	}
 }
 
 void i915_gem_ww_unlock_single(struct drm_i915_gem_object *obj)
 {
-	list_del(&obj->obj_link);
-	i915_gem_object_unlock(obj);
+	list_del_init(&obj->obj_link);
+	__i915_gem_object_unlock(obj);
 	i915_gem_object_put(obj);
 }
 
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Intel-gfx] [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-10  5:31 ` Mani Milani
  0 siblings, 0 replies; 19+ messages in thread
From: Mani Milani @ 2022-11-10  5:31 UTC (permalink / raw)
  To: LKML
  Cc: dri-devel, intel-gfx, Thomas Hellström, Mani Milani,
	Chris Wilson, Matthew Auld, Daniel Vetter, Rodrigo Vivi,
	David Airlie, Christian König, Nirmoy Das

At present, the gpu thread crashes at times when grab_vma() attempts to
acquire a gem object lock when in a deadlock state.

Problems:
I identified the following 4 issues in the current code:
1. Since grab_vma() calls i915_gem_object_trylock(), which consequently
   calls ww_mutex_trylock(), to acquire lock, it does not perform any
   -EDEADLK handling; And -EALREADY handling is also unreliable,
   according to the description of ww_mutex_trylock().
2. Since the return value of grab_vma() is a boolean showing
   success/failure, it does not provide any extra information on the
   failure reason, and therefore does not provide any mechanism to its
   caller to take any action to fix a potential deadlock.
3. Current grab_vma() implementation produces inconsistent behaviour
   depending on the refcount value, without informing the caller. If
   refcount is already zero, grab_vma() neither acquires lock nor
   increments the refcount, but still returns 'true' for success! This
   means that grab_vma() returning true (for success) does not always
   mean that the gem obj is actually safely accessible.
4. Currently, calling "i915_gem_object_lock(obj,ww)" is meant to be
   followed by a consequent "i915_gem_object_unlock(obj)" ONLY if the
   original 'ww' object pointer was NULL, or otherwise not be called and
   leave the houskeeping to "i915_gem_ww_ctx_fini(ww)". There are a few
   issues with this:
   - This is not documented anywhere in the code (that I could find),
     but only explained in an older commit message.
   - This produces an inconsistent usage of the lock/unlock functions,
     increasing the chance of mistakes and issues.
   - This is not a clean design as it requires any new code that calls
     these lock/unlock functions to know their internals, as well as the
     internals of the functions calling the new code being added.

Fix:
To fix the issues above, this patch:
1. Changes grab_vma() to call i915_gem_object_lock() instead of
   i915_gem_object_trylock(), to handle -EDEADLK and -EALREADY cases.
   This should not cause any issue since the PIN_NONBLOCK flag is
   checked beforehand in the 2 cases grab_vma() is called.
2. Changes grab_vma() to return the actual error code, instead of bool.
3. Changes grab_vma() to behave consistently when returning success, by
   both incrementing the refcount and acquiring lock at all times.
4. Changes i915_gem_object_unlock() to pair with i915_gem_object_lock()
   nicely in all cases and do the housekeeping without the need for the
   caller to do anything other than simply calling lock and unlock.
5. Ensures the gem obj->obj_link is initialized and deleted from the ww
   list such that it can be tested for emptiness using list_empty().

Signed-off-by: Mani Milani <mani@chromium.org>
---

 drivers/gpu/drm/i915/gem/i915_gem_object.c |  2 +
 drivers/gpu/drm/i915/gem/i915_gem_object.h | 10 ++++-
 drivers/gpu/drm/i915/i915_gem_evict.c      | 48 ++++++++++++----------
 drivers/gpu/drm/i915/i915_gem_ww.c         |  8 ++--
 4 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index 369006c5317f..69d013b393fb 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -78,6 +78,8 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
 
 	INIT_LIST_HEAD(&obj->mm.link);
 
+	INIT_LIST_HEAD(&obj->obj_link);
+
 	INIT_LIST_HEAD(&obj->lut_list);
 	spin_lock_init(&obj->lut_lock);
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index 1723af9b0f6a..7e7a61bdf52c 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -219,7 +219,7 @@ static inline bool i915_gem_object_trylock(struct drm_i915_gem_object *obj,
 		return ww_mutex_trylock(&obj->base.resv->lock, &ww->ctx);
 }
 
-static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
+static inline void __i915_gem_object_unlock(struct drm_i915_gem_object *obj)
 {
 	if (obj->ops->adjust_lru)
 		obj->ops->adjust_lru(obj);
@@ -227,6 +227,14 @@ static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
 	dma_resv_unlock(obj->base.resv);
 }
 
+static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
+{
+	if (list_empty(&obj->obj_link))
+		__i915_gem_object_unlock(obj);
+	else
+		i915_gem_ww_unlock_single(obj);
+}
+
 static inline void
 i915_gem_object_set_readonly(struct drm_i915_gem_object *obj)
 {
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index f025ee4fa526..3eb514b4eddc 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -55,29 +55,33 @@ static int ggtt_flush(struct intel_gt *gt)
 	return intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
 }
 
-static bool grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
+static int grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
 {
+	int err;
+
+	/* Dead objects don't need pins */
+	if (dying_vma(vma))
+		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
+
+	err = i915_gem_object_lock(vma->obj, ww);
+
 	/*
 	 * We add the extra refcount so the object doesn't drop to zero until
-	 * after ungrab_vma(), this way trylock is always paired with unlock.
+	 * after ungrab_vma(), this way lock is always paired with unlock.
 	 */
-	if (i915_gem_object_get_rcu(vma->obj)) {
-		if (!i915_gem_object_trylock(vma->obj, ww)) {
-			i915_gem_object_put(vma->obj);
-			return false;
-		}
-	} else {
-		/* Dead objects don't need pins */
-		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
-	}
+	if (!err)
+		i915_gem_object_get(vma->obj);
 
-	return true;
+	return err;
 }
 
 static void ungrab_vma(struct i915_vma *vma)
 {
-	if (dying_vma(vma))
+	if (dying_vma(vma)) {
+		/* Dead objects don't need pins */
+		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
 		return;
+	}
 
 	i915_gem_object_unlock(vma->obj);
 	i915_gem_object_put(vma->obj);
@@ -93,10 +97,11 @@ mark_free(struct drm_mm_scan *scan,
 	if (i915_vma_is_pinned(vma))
 		return false;
 
-	if (!grab_vma(vma, ww))
+	if (grab_vma(vma, ww))
 		return false;
 
 	list_add(&vma->evict_link, unwind);
+
 	return drm_mm_scan_add_block(scan, &vma->node);
 }
 
@@ -284,10 +289,12 @@ i915_gem_evict_something(struct i915_address_space *vm,
 		vma = container_of(node, struct i915_vma, node);
 
 		/* If we find any non-objects (!vma), we cannot evict them */
-		if (vma->node.color != I915_COLOR_UNEVICTABLE &&
-		    grab_vma(vma, ww)) {
-			ret = __i915_vma_unbind(vma);
-			ungrab_vma(vma);
+		if (vma->node.color != I915_COLOR_UNEVICTABLE) {
+			ret = grab_vma(vma, ww);
+			if (!ret) {
+				ret = __i915_vma_unbind(vma);
+				ungrab_vma(vma);
+			}
 		} else {
 			ret = -ENOSPC;
 		}
@@ -382,10 +389,9 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 			break;
 		}
 
-		if (!grab_vma(vma, ww)) {
-			ret = -ENOSPC;
+		ret = grab_vma(vma, ww);
+		if (ret)
 			break;
-		}
 
 		/*
 		 * Never show fear in the face of dragons!
diff --git a/drivers/gpu/drm/i915/i915_gem_ww.c b/drivers/gpu/drm/i915/i915_gem_ww.c
index 3f6ff139478e..937b279f50fc 100644
--- a/drivers/gpu/drm/i915/i915_gem_ww.c
+++ b/drivers/gpu/drm/i915/i915_gem_ww.c
@@ -19,16 +19,14 @@ static void i915_gem_ww_ctx_unlock_all(struct i915_gem_ww_ctx *ww)
 	struct drm_i915_gem_object *obj;
 
 	while ((obj = list_first_entry_or_null(&ww->obj_list, struct drm_i915_gem_object, obj_link))) {
-		list_del(&obj->obj_link);
-		i915_gem_object_unlock(obj);
-		i915_gem_object_put(obj);
+		i915_gem_ww_unlock_single(obj);
 	}
 }
 
 void i915_gem_ww_unlock_single(struct drm_i915_gem_object *obj)
 {
-	list_del(&obj->obj_link);
-	i915_gem_object_unlock(obj);
+	list_del_init(&obj->obj_link);
+	__i915_gem_object_unlock(obj);
 	i915_gem_object_put(obj);
 }
 
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-10  5:31 ` Mani Milani
  0 siblings, 0 replies; 19+ messages in thread
From: Mani Milani @ 2022-11-10  5:31 UTC (permalink / raw)
  To: LKML
  Cc: Tvrtko Ursulin, dri-devel, intel-gfx, Thomas Hellström,
	Mani Milani, Chris Wilson, Matthew Auld, Rodrigo Vivi,
	Niranjana Vishwanathapura, Christian König, Nirmoy Das

At present, the gpu thread crashes at times when grab_vma() attempts to
acquire a gem object lock when in a deadlock state.

Problems:
I identified the following 4 issues in the current code:
1. Since grab_vma() calls i915_gem_object_trylock(), which consequently
   calls ww_mutex_trylock(), to acquire lock, it does not perform any
   -EDEADLK handling; And -EALREADY handling is also unreliable,
   according to the description of ww_mutex_trylock().
2. Since the return value of grab_vma() is a boolean showing
   success/failure, it does not provide any extra information on the
   failure reason, and therefore does not provide any mechanism to its
   caller to take any action to fix a potential deadlock.
3. Current grab_vma() implementation produces inconsistent behaviour
   depending on the refcount value, without informing the caller. If
   refcount is already zero, grab_vma() neither acquires lock nor
   increments the refcount, but still returns 'true' for success! This
   means that grab_vma() returning true (for success) does not always
   mean that the gem obj is actually safely accessible.
4. Currently, calling "i915_gem_object_lock(obj,ww)" is meant to be
   followed by a consequent "i915_gem_object_unlock(obj)" ONLY if the
   original 'ww' object pointer was NULL, or otherwise not be called and
   leave the houskeeping to "i915_gem_ww_ctx_fini(ww)". There are a few
   issues with this:
   - This is not documented anywhere in the code (that I could find),
     but only explained in an older commit message.
   - This produces an inconsistent usage of the lock/unlock functions,
     increasing the chance of mistakes and issues.
   - This is not a clean design as it requires any new code that calls
     these lock/unlock functions to know their internals, as well as the
     internals of the functions calling the new code being added.

Fix:
To fix the issues above, this patch:
1. Changes grab_vma() to call i915_gem_object_lock() instead of
   i915_gem_object_trylock(), to handle -EDEADLK and -EALREADY cases.
   This should not cause any issue since the PIN_NONBLOCK flag is
   checked beforehand in the 2 cases grab_vma() is called.
2. Changes grab_vma() to return the actual error code, instead of bool.
3. Changes grab_vma() to behave consistently when returning success, by
   both incrementing the refcount and acquiring lock at all times.
4. Changes i915_gem_object_unlock() to pair with i915_gem_object_lock()
   nicely in all cases and do the housekeeping without the need for the
   caller to do anything other than simply calling lock and unlock.
5. Ensures the gem obj->obj_link is initialized and deleted from the ww
   list such that it can be tested for emptiness using list_empty().

Signed-off-by: Mani Milani <mani@chromium.org>
---

 drivers/gpu/drm/i915/gem/i915_gem_object.c |  2 +
 drivers/gpu/drm/i915/gem/i915_gem_object.h | 10 ++++-
 drivers/gpu/drm/i915/i915_gem_evict.c      | 48 ++++++++++++----------
 drivers/gpu/drm/i915/i915_gem_ww.c         |  8 ++--
 4 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index 369006c5317f..69d013b393fb 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -78,6 +78,8 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
 
 	INIT_LIST_HEAD(&obj->mm.link);
 
+	INIT_LIST_HEAD(&obj->obj_link);
+
 	INIT_LIST_HEAD(&obj->lut_list);
 	spin_lock_init(&obj->lut_lock);
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index 1723af9b0f6a..7e7a61bdf52c 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -219,7 +219,7 @@ static inline bool i915_gem_object_trylock(struct drm_i915_gem_object *obj,
 		return ww_mutex_trylock(&obj->base.resv->lock, &ww->ctx);
 }
 
-static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
+static inline void __i915_gem_object_unlock(struct drm_i915_gem_object *obj)
 {
 	if (obj->ops->adjust_lru)
 		obj->ops->adjust_lru(obj);
@@ -227,6 +227,14 @@ static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
 	dma_resv_unlock(obj->base.resv);
 }
 
+static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
+{
+	if (list_empty(&obj->obj_link))
+		__i915_gem_object_unlock(obj);
+	else
+		i915_gem_ww_unlock_single(obj);
+}
+
 static inline void
 i915_gem_object_set_readonly(struct drm_i915_gem_object *obj)
 {
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index f025ee4fa526..3eb514b4eddc 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -55,29 +55,33 @@ static int ggtt_flush(struct intel_gt *gt)
 	return intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
 }
 
-static bool grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
+static int grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
 {
+	int err;
+
+	/* Dead objects don't need pins */
+	if (dying_vma(vma))
+		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
+
+	err = i915_gem_object_lock(vma->obj, ww);
+
 	/*
 	 * We add the extra refcount so the object doesn't drop to zero until
-	 * after ungrab_vma(), this way trylock is always paired with unlock.
+	 * after ungrab_vma(), this way lock is always paired with unlock.
 	 */
-	if (i915_gem_object_get_rcu(vma->obj)) {
-		if (!i915_gem_object_trylock(vma->obj, ww)) {
-			i915_gem_object_put(vma->obj);
-			return false;
-		}
-	} else {
-		/* Dead objects don't need pins */
-		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
-	}
+	if (!err)
+		i915_gem_object_get(vma->obj);
 
-	return true;
+	return err;
 }
 
 static void ungrab_vma(struct i915_vma *vma)
 {
-	if (dying_vma(vma))
+	if (dying_vma(vma)) {
+		/* Dead objects don't need pins */
+		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
 		return;
+	}
 
 	i915_gem_object_unlock(vma->obj);
 	i915_gem_object_put(vma->obj);
@@ -93,10 +97,11 @@ mark_free(struct drm_mm_scan *scan,
 	if (i915_vma_is_pinned(vma))
 		return false;
 
-	if (!grab_vma(vma, ww))
+	if (grab_vma(vma, ww))
 		return false;
 
 	list_add(&vma->evict_link, unwind);
+
 	return drm_mm_scan_add_block(scan, &vma->node);
 }
 
@@ -284,10 +289,12 @@ i915_gem_evict_something(struct i915_address_space *vm,
 		vma = container_of(node, struct i915_vma, node);
 
 		/* If we find any non-objects (!vma), we cannot evict them */
-		if (vma->node.color != I915_COLOR_UNEVICTABLE &&
-		    grab_vma(vma, ww)) {
-			ret = __i915_vma_unbind(vma);
-			ungrab_vma(vma);
+		if (vma->node.color != I915_COLOR_UNEVICTABLE) {
+			ret = grab_vma(vma, ww);
+			if (!ret) {
+				ret = __i915_vma_unbind(vma);
+				ungrab_vma(vma);
+			}
 		} else {
 			ret = -ENOSPC;
 		}
@@ -382,10 +389,9 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 			break;
 		}
 
-		if (!grab_vma(vma, ww)) {
-			ret = -ENOSPC;
+		ret = grab_vma(vma, ww);
+		if (ret)
 			break;
-		}
 
 		/*
 		 * Never show fear in the face of dragons!
diff --git a/drivers/gpu/drm/i915/i915_gem_ww.c b/drivers/gpu/drm/i915/i915_gem_ww.c
index 3f6ff139478e..937b279f50fc 100644
--- a/drivers/gpu/drm/i915/i915_gem_ww.c
+++ b/drivers/gpu/drm/i915/i915_gem_ww.c
@@ -19,16 +19,14 @@ static void i915_gem_ww_ctx_unlock_all(struct i915_gem_ww_ctx *ww)
 	struct drm_i915_gem_object *obj;
 
 	while ((obj = list_first_entry_or_null(&ww->obj_list, struct drm_i915_gem_object, obj_link))) {
-		list_del(&obj->obj_link);
-		i915_gem_object_unlock(obj);
-		i915_gem_object_put(obj);
+		i915_gem_ww_unlock_single(obj);
 	}
 }
 
 void i915_gem_ww_unlock_single(struct drm_i915_gem_object *obj)
 {
-	list_del(&obj->obj_link);
-	i915_gem_object_unlock(obj);
+	list_del_init(&obj->obj_link);
+	__i915_gem_object_unlock(obj);
 	i915_gem_object_put(obj);
 }
 
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
  2022-11-10  5:31 ` [Intel-gfx] " Mani Milani
  (?)
@ 2022-11-10 14:49   ` Matthew Auld
  -1 siblings, 0 replies; 19+ messages in thread
From: Matthew Auld @ 2022-11-10 14:49 UTC (permalink / raw)
  To: Mani Milani, LKML
  Cc: Tvrtko Ursulin, Maarten Lankhorst, Thomas Hellström,
	Chris Wilson, Christian König, Daniel Vetter, David Airlie,
	Jani Nikula, Joonas Lahtinen, Niranjana Vishwanathapura,
	Nirmoy Das, Rodrigo Vivi, dri-devel, intel-gfx

On 10/11/2022 05:31, Mani Milani wrote:
> At present, the gpu thread crashes at times when grab_vma() attempts to
> acquire a gem object lock when in a deadlock state.
> 
> Problems:
> I identified the following 4 issues in the current code:
> 1. Since grab_vma() calls i915_gem_object_trylock(), which consequently
>     calls ww_mutex_trylock(), to acquire lock, it does not perform any
>     -EDEADLK handling; And -EALREADY handling is also unreliable,
>     according to the description of ww_mutex_trylock().
> 2. Since the return value of grab_vma() is a boolean showing
>     success/failure, it does not provide any extra information on the
>     failure reason, and therefore does not provide any mechanism to its
>     caller to take any action to fix a potential deadlock.
> 3. Current grab_vma() implementation produces inconsistent behaviour
>     depending on the refcount value, without informing the caller. If
>     refcount is already zero, grab_vma() neither acquires lock nor
>     increments the refcount, but still returns 'true' for success! This
>     means that grab_vma() returning true (for success) does not always
>     mean that the gem obj is actually safely accessible.
> 4. Currently, calling "i915_gem_object_lock(obj,ww)" is meant to be
>     followed by a consequent "i915_gem_object_unlock(obj)" ONLY if the
>     original 'ww' object pointer was NULL, or otherwise not be called and
>     leave the houskeeping to "i915_gem_ww_ctx_fini(ww)". There are a few
>     issues with this:
>     - This is not documented anywhere in the code (that I could find),
>       but only explained in an older commit message.
>     - This produces an inconsistent usage of the lock/unlock functions,
>       increasing the chance of mistakes and issues.
>     - This is not a clean design as it requires any new code that calls
>       these lock/unlock functions to know their internals, as well as the
>       internals of the functions calling the new code being added.
> 
> Fix:
> To fix the issues above, this patch:
> 1. Changes grab_vma() to call i915_gem_object_lock() instead of
>     i915_gem_object_trylock(), to handle -EDEADLK and -EALREADY cases.
>     This should not cause any issue since the PIN_NONBLOCK flag is
>     checked beforehand in the 2 cases grab_vma() is called.
> 2. Changes grab_vma() to return the actual error code, instead of bool.
> 3. Changes grab_vma() to behave consistently when returning success, by
>     both incrementing the refcount and acquiring lock at all times.
> 4. Changes i915_gem_object_unlock() to pair with i915_gem_object_lock()
>     nicely in all cases and do the housekeeping without the need for the
>     caller to do anything other than simply calling lock and unlock.
> 5. Ensures the gem obj->obj_link is initialized and deleted from the ww
>     list such that it can be tested for emptiness using list_empty().
> 
> Signed-off-by: Mani Milani <mani@chromium.org>
> ---
> 
>   drivers/gpu/drm/i915/gem/i915_gem_object.c |  2 +
>   drivers/gpu/drm/i915/gem/i915_gem_object.h | 10 ++++-
>   drivers/gpu/drm/i915/i915_gem_evict.c      | 48 ++++++++++++----------
>   drivers/gpu/drm/i915/i915_gem_ww.c         |  8 ++--
>   4 files changed, 41 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> index 369006c5317f..69d013b393fb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> @@ -78,6 +78,8 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
>   
>   	INIT_LIST_HEAD(&obj->mm.link);
>   
> +	INIT_LIST_HEAD(&obj->obj_link);
> +
>   	INIT_LIST_HEAD(&obj->lut_list);
>   	spin_lock_init(&obj->lut_lock);
>   
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> index 1723af9b0f6a..7e7a61bdf52c 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> @@ -219,7 +219,7 @@ static inline bool i915_gem_object_trylock(struct drm_i915_gem_object *obj,
>   		return ww_mutex_trylock(&obj->base.resv->lock, &ww->ctx);
>   }
>   
> -static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
> +static inline void __i915_gem_object_unlock(struct drm_i915_gem_object *obj)
>   {
>   	if (obj->ops->adjust_lru)
>   		obj->ops->adjust_lru(obj);
> @@ -227,6 +227,14 @@ static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
>   	dma_resv_unlock(obj->base.resv);
>   }
>   
> +static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
> +{
> +	if (list_empty(&obj->obj_link))
> +		__i915_gem_object_unlock(obj);
> +	else
> +		i915_gem_ww_unlock_single(obj);
> +}
> +
>   static inline void
>   i915_gem_object_set_readonly(struct drm_i915_gem_object *obj)
>   {
> diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
> index f025ee4fa526..3eb514b4eddc 100644
> --- a/drivers/gpu/drm/i915/i915_gem_evict.c
> +++ b/drivers/gpu/drm/i915/i915_gem_evict.c
> @@ -55,29 +55,33 @@ static int ggtt_flush(struct intel_gt *gt)
>   	return intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>   }
>   
> -static bool grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
> +static int grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
>   {
> +	int err;
> +
> +	/* Dead objects don't need pins */
> +	if (dying_vma(vma))
> +		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> +
> +	err = i915_gem_object_lock(vma->obj, ww);

AFAIK the issue here is that we are already holding the vm->mutex, so 
this can potentially deadlock, which I guess is why this was trylock.

We typically grab a bunch of object locks during execbuf, and then grab 
the vm->mutex, before binding the vma for each object. So vm->mutex is 
always our inner lock, and the object lock is the outer one. Using a 
full lock here then inverts that locking AFAICT. Like say if one process 
is holding object A + vm->mutex and then tries to grab object B here in 
grab_vma(), but another process is already holding object B + waiting to 
grab vm->mutex?

> +
>   	/*
>   	 * We add the extra refcount so the object doesn't drop to zero until
> -	 * after ungrab_vma(), this way trylock is always paired with unlock.
> +	 * after ungrab_vma(), this way lock is always paired with unlock.
>   	 */
> -	if (i915_gem_object_get_rcu(vma->obj)) {
> -		if (!i915_gem_object_trylock(vma->obj, ww)) {
> -			i915_gem_object_put(vma->obj);
> -			return false;
> -		}
> -	} else {
> -		/* Dead objects don't need pins */
> -		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> -	}
> +	if (!err)
> +		i915_gem_object_get(vma->obj);
>   
> -	return true;
> +	return err;
>   }
>   
>   static void ungrab_vma(struct i915_vma *vma)
>   {
> -	if (dying_vma(vma))
> +	if (dying_vma(vma)) {
> +		/* Dead objects don't need pins */
> +		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
>   		return;
> +	}
>   
>   	i915_gem_object_unlock(vma->obj);
>   	i915_gem_object_put(vma->obj);
> @@ -93,10 +97,11 @@ mark_free(struct drm_mm_scan *scan,
>   	if (i915_vma_is_pinned(vma))
>   		return false;
>   
> -	if (!grab_vma(vma, ww))
> +	if (grab_vma(vma, ww))
>   		return false;
>   
>   	list_add(&vma->evict_link, unwind);
> +
>   	return drm_mm_scan_add_block(scan, &vma->node);
>   }
>   
> @@ -284,10 +289,12 @@ i915_gem_evict_something(struct i915_address_space *vm,
>   		vma = container_of(node, struct i915_vma, node);
>   
>   		/* If we find any non-objects (!vma), we cannot evict them */
> -		if (vma->node.color != I915_COLOR_UNEVICTABLE &&
> -		    grab_vma(vma, ww)) {
> -			ret = __i915_vma_unbind(vma);
> -			ungrab_vma(vma);
> +		if (vma->node.color != I915_COLOR_UNEVICTABLE) {
> +			ret = grab_vma(vma, ww);
> +			if (!ret) {
> +				ret = __i915_vma_unbind(vma);
> +				ungrab_vma(vma);
> +			}
>   		} else {
>   			ret = -ENOSPC;
>   		}
> @@ -382,10 +389,9 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
>   			break;
>   		}
>   
> -		if (!grab_vma(vma, ww)) {
> -			ret = -ENOSPC;
> +		ret = grab_vma(vma, ww);
> +		if (ret)
>   			break;
> -		}
>   
>   		/*
>   		 * Never show fear in the face of dragons!
> diff --git a/drivers/gpu/drm/i915/i915_gem_ww.c b/drivers/gpu/drm/i915/i915_gem_ww.c
> index 3f6ff139478e..937b279f50fc 100644
> --- a/drivers/gpu/drm/i915/i915_gem_ww.c
> +++ b/drivers/gpu/drm/i915/i915_gem_ww.c
> @@ -19,16 +19,14 @@ static void i915_gem_ww_ctx_unlock_all(struct i915_gem_ww_ctx *ww)
>   	struct drm_i915_gem_object *obj;
>   
>   	while ((obj = list_first_entry_or_null(&ww->obj_list, struct drm_i915_gem_object, obj_link))) {
> -		list_del(&obj->obj_link);
> -		i915_gem_object_unlock(obj);
> -		i915_gem_object_put(obj);
> +		i915_gem_ww_unlock_single(obj);
>   	}
>   }
>   
>   void i915_gem_ww_unlock_single(struct drm_i915_gem_object *obj)
>   {
> -	list_del(&obj->obj_link);
> -	i915_gem_object_unlock(obj);
> +	list_del_init(&obj->obj_link);
> +	__i915_gem_object_unlock(obj);
>   	i915_gem_object_put(obj);
>   }
>   

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-10 14:49   ` Matthew Auld
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Auld @ 2022-11-10 14:49 UTC (permalink / raw)
  To: Mani Milani, LKML
  Cc: Tvrtko Ursulin, Thomas Hellström, intel-gfx, Chris Wilson,
	dri-devel, Rodrigo Vivi, Niranjana Vishwanathapura,
	Christian König, Nirmoy Das

On 10/11/2022 05:31, Mani Milani wrote:
> At present, the gpu thread crashes at times when grab_vma() attempts to
> acquire a gem object lock when in a deadlock state.
> 
> Problems:
> I identified the following 4 issues in the current code:
> 1. Since grab_vma() calls i915_gem_object_trylock(), which consequently
>     calls ww_mutex_trylock(), to acquire lock, it does not perform any
>     -EDEADLK handling; And -EALREADY handling is also unreliable,
>     according to the description of ww_mutex_trylock().
> 2. Since the return value of grab_vma() is a boolean showing
>     success/failure, it does not provide any extra information on the
>     failure reason, and therefore does not provide any mechanism to its
>     caller to take any action to fix a potential deadlock.
> 3. Current grab_vma() implementation produces inconsistent behaviour
>     depending on the refcount value, without informing the caller. If
>     refcount is already zero, grab_vma() neither acquires lock nor
>     increments the refcount, but still returns 'true' for success! This
>     means that grab_vma() returning true (for success) does not always
>     mean that the gem obj is actually safely accessible.
> 4. Currently, calling "i915_gem_object_lock(obj,ww)" is meant to be
>     followed by a consequent "i915_gem_object_unlock(obj)" ONLY if the
>     original 'ww' object pointer was NULL, or otherwise not be called and
>     leave the houskeeping to "i915_gem_ww_ctx_fini(ww)". There are a few
>     issues with this:
>     - This is not documented anywhere in the code (that I could find),
>       but only explained in an older commit message.
>     - This produces an inconsistent usage of the lock/unlock functions,
>       increasing the chance of mistakes and issues.
>     - This is not a clean design as it requires any new code that calls
>       these lock/unlock functions to know their internals, as well as the
>       internals of the functions calling the new code being added.
> 
> Fix:
> To fix the issues above, this patch:
> 1. Changes grab_vma() to call i915_gem_object_lock() instead of
>     i915_gem_object_trylock(), to handle -EDEADLK and -EALREADY cases.
>     This should not cause any issue since the PIN_NONBLOCK flag is
>     checked beforehand in the 2 cases grab_vma() is called.
> 2. Changes grab_vma() to return the actual error code, instead of bool.
> 3. Changes grab_vma() to behave consistently when returning success, by
>     both incrementing the refcount and acquiring lock at all times.
> 4. Changes i915_gem_object_unlock() to pair with i915_gem_object_lock()
>     nicely in all cases and do the housekeeping without the need for the
>     caller to do anything other than simply calling lock and unlock.
> 5. Ensures the gem obj->obj_link is initialized and deleted from the ww
>     list such that it can be tested for emptiness using list_empty().
> 
> Signed-off-by: Mani Milani <mani@chromium.org>
> ---
> 
>   drivers/gpu/drm/i915/gem/i915_gem_object.c |  2 +
>   drivers/gpu/drm/i915/gem/i915_gem_object.h | 10 ++++-
>   drivers/gpu/drm/i915/i915_gem_evict.c      | 48 ++++++++++++----------
>   drivers/gpu/drm/i915/i915_gem_ww.c         |  8 ++--
>   4 files changed, 41 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> index 369006c5317f..69d013b393fb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> @@ -78,6 +78,8 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
>   
>   	INIT_LIST_HEAD(&obj->mm.link);
>   
> +	INIT_LIST_HEAD(&obj->obj_link);
> +
>   	INIT_LIST_HEAD(&obj->lut_list);
>   	spin_lock_init(&obj->lut_lock);
>   
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> index 1723af9b0f6a..7e7a61bdf52c 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> @@ -219,7 +219,7 @@ static inline bool i915_gem_object_trylock(struct drm_i915_gem_object *obj,
>   		return ww_mutex_trylock(&obj->base.resv->lock, &ww->ctx);
>   }
>   
> -static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
> +static inline void __i915_gem_object_unlock(struct drm_i915_gem_object *obj)
>   {
>   	if (obj->ops->adjust_lru)
>   		obj->ops->adjust_lru(obj);
> @@ -227,6 +227,14 @@ static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
>   	dma_resv_unlock(obj->base.resv);
>   }
>   
> +static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
> +{
> +	if (list_empty(&obj->obj_link))
> +		__i915_gem_object_unlock(obj);
> +	else
> +		i915_gem_ww_unlock_single(obj);
> +}
> +
>   static inline void
>   i915_gem_object_set_readonly(struct drm_i915_gem_object *obj)
>   {
> diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
> index f025ee4fa526..3eb514b4eddc 100644
> --- a/drivers/gpu/drm/i915/i915_gem_evict.c
> +++ b/drivers/gpu/drm/i915/i915_gem_evict.c
> @@ -55,29 +55,33 @@ static int ggtt_flush(struct intel_gt *gt)
>   	return intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>   }
>   
> -static bool grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
> +static int grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
>   {
> +	int err;
> +
> +	/* Dead objects don't need pins */
> +	if (dying_vma(vma))
> +		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> +
> +	err = i915_gem_object_lock(vma->obj, ww);

AFAIK the issue here is that we are already holding the vm->mutex, so 
this can potentially deadlock, which I guess is why this was trylock.

We typically grab a bunch of object locks during execbuf, and then grab 
the vm->mutex, before binding the vma for each object. So vm->mutex is 
always our inner lock, and the object lock is the outer one. Using a 
full lock here then inverts that locking AFAICT. Like say if one process 
is holding object A + vm->mutex and then tries to grab object B here in 
grab_vma(), but another process is already holding object B + waiting to 
grab vm->mutex?

> +
>   	/*
>   	 * We add the extra refcount so the object doesn't drop to zero until
> -	 * after ungrab_vma(), this way trylock is always paired with unlock.
> +	 * after ungrab_vma(), this way lock is always paired with unlock.
>   	 */
> -	if (i915_gem_object_get_rcu(vma->obj)) {
> -		if (!i915_gem_object_trylock(vma->obj, ww)) {
> -			i915_gem_object_put(vma->obj);
> -			return false;
> -		}
> -	} else {
> -		/* Dead objects don't need pins */
> -		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> -	}
> +	if (!err)
> +		i915_gem_object_get(vma->obj);
>   
> -	return true;
> +	return err;
>   }
>   
>   static void ungrab_vma(struct i915_vma *vma)
>   {
> -	if (dying_vma(vma))
> +	if (dying_vma(vma)) {
> +		/* Dead objects don't need pins */
> +		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
>   		return;
> +	}
>   
>   	i915_gem_object_unlock(vma->obj);
>   	i915_gem_object_put(vma->obj);
> @@ -93,10 +97,11 @@ mark_free(struct drm_mm_scan *scan,
>   	if (i915_vma_is_pinned(vma))
>   		return false;
>   
> -	if (!grab_vma(vma, ww))
> +	if (grab_vma(vma, ww))
>   		return false;
>   
>   	list_add(&vma->evict_link, unwind);
> +
>   	return drm_mm_scan_add_block(scan, &vma->node);
>   }
>   
> @@ -284,10 +289,12 @@ i915_gem_evict_something(struct i915_address_space *vm,
>   		vma = container_of(node, struct i915_vma, node);
>   
>   		/* If we find any non-objects (!vma), we cannot evict them */
> -		if (vma->node.color != I915_COLOR_UNEVICTABLE &&
> -		    grab_vma(vma, ww)) {
> -			ret = __i915_vma_unbind(vma);
> -			ungrab_vma(vma);
> +		if (vma->node.color != I915_COLOR_UNEVICTABLE) {
> +			ret = grab_vma(vma, ww);
> +			if (!ret) {
> +				ret = __i915_vma_unbind(vma);
> +				ungrab_vma(vma);
> +			}
>   		} else {
>   			ret = -ENOSPC;
>   		}
> @@ -382,10 +389,9 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
>   			break;
>   		}
>   
> -		if (!grab_vma(vma, ww)) {
> -			ret = -ENOSPC;
> +		ret = grab_vma(vma, ww);
> +		if (ret)
>   			break;
> -		}
>   
>   		/*
>   		 * Never show fear in the face of dragons!
> diff --git a/drivers/gpu/drm/i915/i915_gem_ww.c b/drivers/gpu/drm/i915/i915_gem_ww.c
> index 3f6ff139478e..937b279f50fc 100644
> --- a/drivers/gpu/drm/i915/i915_gem_ww.c
> +++ b/drivers/gpu/drm/i915/i915_gem_ww.c
> @@ -19,16 +19,14 @@ static void i915_gem_ww_ctx_unlock_all(struct i915_gem_ww_ctx *ww)
>   	struct drm_i915_gem_object *obj;
>   
>   	while ((obj = list_first_entry_or_null(&ww->obj_list, struct drm_i915_gem_object, obj_link))) {
> -		list_del(&obj->obj_link);
> -		i915_gem_object_unlock(obj);
> -		i915_gem_object_put(obj);
> +		i915_gem_ww_unlock_single(obj);
>   	}
>   }
>   
>   void i915_gem_ww_unlock_single(struct drm_i915_gem_object *obj)
>   {
> -	list_del(&obj->obj_link);
> -	i915_gem_object_unlock(obj);
> +	list_del_init(&obj->obj_link);
> +	__i915_gem_object_unlock(obj);
>   	i915_gem_object_put(obj);
>   }
>   

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-10 14:49   ` Matthew Auld
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Auld @ 2022-11-10 14:49 UTC (permalink / raw)
  To: Mani Milani, LKML
  Cc: Thomas Hellström, intel-gfx, Chris Wilson, dri-devel,
	Daniel Vetter, Rodrigo Vivi, David Airlie, Christian König,
	Nirmoy Das

On 10/11/2022 05:31, Mani Milani wrote:
> At present, the gpu thread crashes at times when grab_vma() attempts to
> acquire a gem object lock when in a deadlock state.
> 
> Problems:
> I identified the following 4 issues in the current code:
> 1. Since grab_vma() calls i915_gem_object_trylock(), which consequently
>     calls ww_mutex_trylock(), to acquire lock, it does not perform any
>     -EDEADLK handling; And -EALREADY handling is also unreliable,
>     according to the description of ww_mutex_trylock().
> 2. Since the return value of grab_vma() is a boolean showing
>     success/failure, it does not provide any extra information on the
>     failure reason, and therefore does not provide any mechanism to its
>     caller to take any action to fix a potential deadlock.
> 3. Current grab_vma() implementation produces inconsistent behaviour
>     depending on the refcount value, without informing the caller. If
>     refcount is already zero, grab_vma() neither acquires lock nor
>     increments the refcount, but still returns 'true' for success! This
>     means that grab_vma() returning true (for success) does not always
>     mean that the gem obj is actually safely accessible.
> 4. Currently, calling "i915_gem_object_lock(obj,ww)" is meant to be
>     followed by a consequent "i915_gem_object_unlock(obj)" ONLY if the
>     original 'ww' object pointer was NULL, or otherwise not be called and
>     leave the houskeeping to "i915_gem_ww_ctx_fini(ww)". There are a few
>     issues with this:
>     - This is not documented anywhere in the code (that I could find),
>       but only explained in an older commit message.
>     - This produces an inconsistent usage of the lock/unlock functions,
>       increasing the chance of mistakes and issues.
>     - This is not a clean design as it requires any new code that calls
>       these lock/unlock functions to know their internals, as well as the
>       internals of the functions calling the new code being added.
> 
> Fix:
> To fix the issues above, this patch:
> 1. Changes grab_vma() to call i915_gem_object_lock() instead of
>     i915_gem_object_trylock(), to handle -EDEADLK and -EALREADY cases.
>     This should not cause any issue since the PIN_NONBLOCK flag is
>     checked beforehand in the 2 cases grab_vma() is called.
> 2. Changes grab_vma() to return the actual error code, instead of bool.
> 3. Changes grab_vma() to behave consistently when returning success, by
>     both incrementing the refcount and acquiring lock at all times.
> 4. Changes i915_gem_object_unlock() to pair with i915_gem_object_lock()
>     nicely in all cases and do the housekeeping without the need for the
>     caller to do anything other than simply calling lock and unlock.
> 5. Ensures the gem obj->obj_link is initialized and deleted from the ww
>     list such that it can be tested for emptiness using list_empty().
> 
> Signed-off-by: Mani Milani <mani@chromium.org>
> ---
> 
>   drivers/gpu/drm/i915/gem/i915_gem_object.c |  2 +
>   drivers/gpu/drm/i915/gem/i915_gem_object.h | 10 ++++-
>   drivers/gpu/drm/i915/i915_gem_evict.c      | 48 ++++++++++++----------
>   drivers/gpu/drm/i915/i915_gem_ww.c         |  8 ++--
>   4 files changed, 41 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> index 369006c5317f..69d013b393fb 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> @@ -78,6 +78,8 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
>   
>   	INIT_LIST_HEAD(&obj->mm.link);
>   
> +	INIT_LIST_HEAD(&obj->obj_link);
> +
>   	INIT_LIST_HEAD(&obj->lut_list);
>   	spin_lock_init(&obj->lut_lock);
>   
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> index 1723af9b0f6a..7e7a61bdf52c 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> @@ -219,7 +219,7 @@ static inline bool i915_gem_object_trylock(struct drm_i915_gem_object *obj,
>   		return ww_mutex_trylock(&obj->base.resv->lock, &ww->ctx);
>   }
>   
> -static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
> +static inline void __i915_gem_object_unlock(struct drm_i915_gem_object *obj)
>   {
>   	if (obj->ops->adjust_lru)
>   		obj->ops->adjust_lru(obj);
> @@ -227,6 +227,14 @@ static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
>   	dma_resv_unlock(obj->base.resv);
>   }
>   
> +static inline void i915_gem_object_unlock(struct drm_i915_gem_object *obj)
> +{
> +	if (list_empty(&obj->obj_link))
> +		__i915_gem_object_unlock(obj);
> +	else
> +		i915_gem_ww_unlock_single(obj);
> +}
> +
>   static inline void
>   i915_gem_object_set_readonly(struct drm_i915_gem_object *obj)
>   {
> diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
> index f025ee4fa526..3eb514b4eddc 100644
> --- a/drivers/gpu/drm/i915/i915_gem_evict.c
> +++ b/drivers/gpu/drm/i915/i915_gem_evict.c
> @@ -55,29 +55,33 @@ static int ggtt_flush(struct intel_gt *gt)
>   	return intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>   }
>   
> -static bool grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
> +static int grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx *ww)
>   {
> +	int err;
> +
> +	/* Dead objects don't need pins */
> +	if (dying_vma(vma))
> +		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> +
> +	err = i915_gem_object_lock(vma->obj, ww);

AFAIK the issue here is that we are already holding the vm->mutex, so 
this can potentially deadlock, which I guess is why this was trylock.

We typically grab a bunch of object locks during execbuf, and then grab 
the vm->mutex, before binding the vma for each object. So vm->mutex is 
always our inner lock, and the object lock is the outer one. Using a 
full lock here then inverts that locking AFAICT. Like say if one process 
is holding object A + vm->mutex and then tries to grab object B here in 
grab_vma(), but another process is already holding object B + waiting to 
grab vm->mutex?

> +
>   	/*
>   	 * We add the extra refcount so the object doesn't drop to zero until
> -	 * after ungrab_vma(), this way trylock is always paired with unlock.
> +	 * after ungrab_vma(), this way lock is always paired with unlock.
>   	 */
> -	if (i915_gem_object_get_rcu(vma->obj)) {
> -		if (!i915_gem_object_trylock(vma->obj, ww)) {
> -			i915_gem_object_put(vma->obj);
> -			return false;
> -		}
> -	} else {
> -		/* Dead objects don't need pins */
> -		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> -	}
> +	if (!err)
> +		i915_gem_object_get(vma->obj);
>   
> -	return true;
> +	return err;
>   }
>   
>   static void ungrab_vma(struct i915_vma *vma)
>   {
> -	if (dying_vma(vma))
> +	if (dying_vma(vma)) {
> +		/* Dead objects don't need pins */
> +		atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
>   		return;
> +	}
>   
>   	i915_gem_object_unlock(vma->obj);
>   	i915_gem_object_put(vma->obj);
> @@ -93,10 +97,11 @@ mark_free(struct drm_mm_scan *scan,
>   	if (i915_vma_is_pinned(vma))
>   		return false;
>   
> -	if (!grab_vma(vma, ww))
> +	if (grab_vma(vma, ww))
>   		return false;
>   
>   	list_add(&vma->evict_link, unwind);
> +
>   	return drm_mm_scan_add_block(scan, &vma->node);
>   }
>   
> @@ -284,10 +289,12 @@ i915_gem_evict_something(struct i915_address_space *vm,
>   		vma = container_of(node, struct i915_vma, node);
>   
>   		/* If we find any non-objects (!vma), we cannot evict them */
> -		if (vma->node.color != I915_COLOR_UNEVICTABLE &&
> -		    grab_vma(vma, ww)) {
> -			ret = __i915_vma_unbind(vma);
> -			ungrab_vma(vma);
> +		if (vma->node.color != I915_COLOR_UNEVICTABLE) {
> +			ret = grab_vma(vma, ww);
> +			if (!ret) {
> +				ret = __i915_vma_unbind(vma);
> +				ungrab_vma(vma);
> +			}
>   		} else {
>   			ret = -ENOSPC;
>   		}
> @@ -382,10 +389,9 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
>   			break;
>   		}
>   
> -		if (!grab_vma(vma, ww)) {
> -			ret = -ENOSPC;
> +		ret = grab_vma(vma, ww);
> +		if (ret)
>   			break;
> -		}
>   
>   		/*
>   		 * Never show fear in the face of dragons!
> diff --git a/drivers/gpu/drm/i915/i915_gem_ww.c b/drivers/gpu/drm/i915/i915_gem_ww.c
> index 3f6ff139478e..937b279f50fc 100644
> --- a/drivers/gpu/drm/i915/i915_gem_ww.c
> +++ b/drivers/gpu/drm/i915/i915_gem_ww.c
> @@ -19,16 +19,14 @@ static void i915_gem_ww_ctx_unlock_all(struct i915_gem_ww_ctx *ww)
>   	struct drm_i915_gem_object *obj;
>   
>   	while ((obj = list_first_entry_or_null(&ww->obj_list, struct drm_i915_gem_object, obj_link))) {
> -		list_del(&obj->obj_link);
> -		i915_gem_object_unlock(obj);
> -		i915_gem_object_put(obj);
> +		i915_gem_ww_unlock_single(obj);
>   	}
>   }
>   
>   void i915_gem_ww_unlock_single(struct drm_i915_gem_object *obj)
>   {
> -	list_del(&obj->obj_link);
> -	i915_gem_object_unlock(obj);
> +	list_del_init(&obj->obj_link);
> +	__i915_gem_object_unlock(obj);
>   	i915_gem_object_put(obj);
>   }
>   

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
  2022-11-10 14:49   ` Matthew Auld
  (?)
@ 2022-11-10 15:21     ` Thomas Hellström
  -1 siblings, 0 replies; 19+ messages in thread
From: Thomas Hellström @ 2022-11-10 15:21 UTC (permalink / raw)
  To: Matthew Auld, Mani Milani, LKML
  Cc: Tvrtko Ursulin, intel-gfx, Chris Wilson, dri-devel, Rodrigo Vivi,
	Niranjana Vishwanathapura, Christian König, Nirmoy Das

On Thu, 2022-11-10 at 14:49 +0000, Matthew Auld wrote:
> On 10/11/2022 05:31, Mani Milani wrote:
> > At present, the gpu thread crashes at times when grab_vma()
> > attempts to
> > acquire a gem object lock when in a deadlock state.
> > 
> > Problems:
> > I identified the following 4 issues in the current code:
> > 1. Since grab_vma() calls i915_gem_object_trylock(), which
> > consequently
> >     calls ww_mutex_trylock(), to acquire lock, it does not perform
> > any
> >     -EDEADLK handling; And -EALREADY handling is also unreliable,
> >     according to the description of ww_mutex_trylock().
> > 2. Since the return value of grab_vma() is a boolean showing
> >     success/failure, it does not provide any extra information on
> > the
> >     failure reason, and therefore does not provide any mechanism to
> > its
> >     caller to take any action to fix a potential deadlock.
> > 3. Current grab_vma() implementation produces inconsistent
> > behaviour
> >     depending on the refcount value, without informing the caller.
> > If
> >     refcount is already zero, grab_vma() neither acquires lock nor
> >     increments the refcount, but still returns 'true' for success!
> > This
> >     means that grab_vma() returning true (for success) does not
> > always
> >     mean that the gem obj is actually safely accessible.
> > 4. Currently, calling "i915_gem_object_lock(obj,ww)" is meant to be
> >     followed by a consequent "i915_gem_object_unlock(obj)" ONLY if
> > the
> >     original 'ww' object pointer was NULL, or otherwise not be
> > called and
> >     leave the houskeeping to "i915_gem_ww_ctx_fini(ww)". There are
> > a few
> >     issues with this:
> >     - This is not documented anywhere in the code (that I could
> > find),
> >       but only explained in an older commit message.
> >     - This produces an inconsistent usage of the lock/unlock
> > functions,
> >       increasing the chance of mistakes and issues.
> >     - This is not a clean design as it requires any new code that
> > calls
> >       these lock/unlock functions to know their internals, as well
> > as the
> >       internals of the functions calling the new code being added.
> > 
> > Fix:
> > To fix the issues above, this patch:
> > 1. Changes grab_vma() to call i915_gem_object_lock() instead of
> >     i915_gem_object_trylock(), to handle -EDEADLK and -EALREADY
> > cases.
> >     This should not cause any issue since the PIN_NONBLOCK flag is
> >     checked beforehand in the 2 cases grab_vma() is called.
> > 2. Changes grab_vma() to return the actual error code, instead of
> > bool.
> > 3. Changes grab_vma() to behave consistently when returning
> > success, by
> >     both incrementing the refcount and acquiring lock at all times.
> > 4. Changes i915_gem_object_unlock() to pair with
> > i915_gem_object_lock()
> >     nicely in all cases and do the housekeeping without the need
> > for the
> >     caller to do anything other than simply calling lock and
> > unlock.
> > 5. Ensures the gem obj->obj_link is initialized and deleted from
> > the ww
> >     list such that it can be tested for emptiness using
> > list_empty().
> > 
> > Signed-off-by: Mani Milani <mani@chromium.org>
> > ---
> > 
> >   drivers/gpu/drm/i915/gem/i915_gem_object.c |  2 +
> >   drivers/gpu/drm/i915/gem/i915_gem_object.h | 10 ++++-
> >   drivers/gpu/drm/i915/i915_gem_evict.c      | 48 ++++++++++++-----
> > -----
> >   drivers/gpu/drm/i915/i915_gem_ww.c         |  8 ++--
> >   4 files changed, 41 insertions(+), 27 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > index 369006c5317f..69d013b393fb 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > @@ -78,6 +78,8 @@ void i915_gem_object_init(struct
> > drm_i915_gem_object *obj,
> >   
> >         INIT_LIST_HEAD(&obj->mm.link);
> >   
> > +       INIT_LIST_HEAD(&obj->obj_link);
> > +
> >         INIT_LIST_HEAD(&obj->lut_list);
> >         spin_lock_init(&obj->lut_lock);
> >   
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > index 1723af9b0f6a..7e7a61bdf52c 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > @@ -219,7 +219,7 @@ static inline bool
> > i915_gem_object_trylock(struct drm_i915_gem_object *obj,
> >                 return ww_mutex_trylock(&obj->base.resv->lock, &ww-
> > >ctx);
> >   }
> >   
> > -static inline void i915_gem_object_unlock(struct
> > drm_i915_gem_object *obj)
> > +static inline void __i915_gem_object_unlock(struct
> > drm_i915_gem_object *obj)
> >   {
> >         if (obj->ops->adjust_lru)
> >                 obj->ops->adjust_lru(obj);
> > @@ -227,6 +227,14 @@ static inline void
> > i915_gem_object_unlock(struct drm_i915_gem_object *obj)
> >         dma_resv_unlock(obj->base.resv);
> >   }
> >   
> > +static inline void i915_gem_object_unlock(struct
> > drm_i915_gem_object *obj)
> > +{
> > +       if (list_empty(&obj->obj_link))
> > +               __i915_gem_object_unlock(obj);
> > +       else
> > +               i915_gem_ww_unlock_single(obj);
> > +}
> > +
> >   static inline void
> >   i915_gem_object_set_readonly(struct drm_i915_gem_object *obj)
> >   {
> > diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c
> > b/drivers/gpu/drm/i915/i915_gem_evict.c
> > index f025ee4fa526..3eb514b4eddc 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_evict.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_evict.c
> > @@ -55,29 +55,33 @@ static int ggtt_flush(struct intel_gt *gt)
> >         return intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
> >   }
> >   
> > -static bool grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx
> > *ww)
> > +static int grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx
> > *ww)
> >   {
> > +       int err;
> > +
> > +       /* Dead objects don't need pins */
> > +       if (dying_vma(vma))
> > +               atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> > +
> > +       err = i915_gem_object_lock(vma->obj, ww);
> 
> AFAIK the issue here is that we are already holding the vm->mutex, so
> this can potentially deadlock, which I guess is why this was trylock.
> 
> We typically grab a bunch of object locks during execbuf, and then
> grab 
> the vm->mutex, before binding the vma for each object. So vm->mutex
> is 
> always our inner lock, and the object lock is the outer one. Using a 
> full lock here then inverts that locking AFAICT. Like say if one
> process 
> is holding object A + vm->mutex and then tries to grab object B here
> in 
> grab_vma(), but another process is already holding object B + waiting
> to 
> grab vm->mutex?

Indeed. 

IIRC the assumption here was that a ww mutex trylock would be similar
in behaviour to the previous code which instead checked for pinned
vmas.

One difference, though, is that we lock most objects upfront and then
try to make space for VMAs avoiding evicting vmas that is needed anyway
for the batch. The old code would instead evict and rebind.

But there should be a catch-all evict everything and rebind if eviction
fails so it would be beneficial to see the "gpu thread crash". I think
there is an issue filed in gitlab where even  this catch-all evict
everything fails that might needs looking at.

/Thomas

> 
> > +
> >         /*
> >          * We add the extra refcount so the object doesn't drop to
> > zero until
> > -        * after ungrab_vma(), this way trylock is always paired
> > with unlock.
> > +        * after ungrab_vma(), this way lock is always paired with
> > unlock.
> >          */
> > -       if (i915_gem_object_get_rcu(vma->obj)) {
> > -               if (!i915_gem_object_trylock(vma->obj, ww)) {
> > -                       i915_gem_object_put(vma->obj);
> > -                       return false;
> > -               }
> > -       } else {
> > -               /* Dead objects don't need pins */
> > -               atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> > -       }
> > +       if (!err)
> > +               i915_gem_object_get(vma->obj);
> >   
> > -       return true;
> > +       return err;
> >   }
> >   
> >   static void ungrab_vma(struct i915_vma *vma)
> >   {
> > -       if (dying_vma(vma))
> > +       if (dying_vma(vma)) {
> > +               /* Dead objects don't need pins */
> > +               atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> >                 return;
> > +       }
> >   
> >         i915_gem_object_unlock(vma->obj);
> >         i915_gem_object_put(vma->obj);
> > @@ -93,10 +97,11 @@ mark_free(struct drm_mm_scan *scan,
> >         if (i915_vma_is_pinned(vma))
> >                 return false;
> >   
> > -       if (!grab_vma(vma, ww))
> > +       if (grab_vma(vma, ww))
> >                 return false;
> >   
> >         list_add(&vma->evict_link, unwind);
> > +
> >         return drm_mm_scan_add_block(scan, &vma->node);
> >   }
> >   
> > @@ -284,10 +289,12 @@ i915_gem_evict_something(struct
> > i915_address_space *vm,
> >                 vma = container_of(node, struct i915_vma, node);
> >   
> >                 /* If we find any non-objects (!vma), we cannot
> > evict them */
> > -               if (vma->node.color != I915_COLOR_UNEVICTABLE &&
> > -                   grab_vma(vma, ww)) {
> > -                       ret = __i915_vma_unbind(vma);
> > -                       ungrab_vma(vma);
> > +               if (vma->node.color != I915_COLOR_UNEVICTABLE) {
> > +                       ret = grab_vma(vma, ww);
> > +                       if (!ret) {
> > +                               ret = __i915_vma_unbind(vma);
> > +                               ungrab_vma(vma);
> > +                       }
> >                 } else {
> >                         ret = -ENOSPC;
> >                 }
> > @@ -382,10 +389,9 @@ int i915_gem_evict_for_node(struct
> > i915_address_space *vm,
> >                         break;
> >                 }
> >   
> > -               if (!grab_vma(vma, ww)) {
> > -                       ret = -ENOSPC;
> > +               ret = grab_vma(vma, ww);
> > +               if (ret)
> >                         break;
> > -               }
> >   
> >                 /*
> >                  * Never show fear in the face of dragons!
> > diff --git a/drivers/gpu/drm/i915/i915_gem_ww.c
> > b/drivers/gpu/drm/i915/i915_gem_ww.c
> > index 3f6ff139478e..937b279f50fc 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_ww.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_ww.c
> > @@ -19,16 +19,14 @@ static void i915_gem_ww_ctx_unlock_all(struct
> > i915_gem_ww_ctx *ww)
> >         struct drm_i915_gem_object *obj;
> >   
> >         while ((obj = list_first_entry_or_null(&ww->obj_list,
> > struct drm_i915_gem_object, obj_link))) {
> > -               list_del(&obj->obj_link);
> > -               i915_gem_object_unlock(obj);
> > -               i915_gem_object_put(obj);
> > +               i915_gem_ww_unlock_single(obj);
> >         }
> >   }
> >   
> >   void i915_gem_ww_unlock_single(struct drm_i915_gem_object *obj)
> >   {
> > -       list_del(&obj->obj_link);
> > -       i915_gem_object_unlock(obj);
> > +       list_del_init(&obj->obj_link);
> > +       __i915_gem_object_unlock(obj);
> >         i915_gem_object_put(obj);
> >   }
> >   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-10 15:21     ` Thomas Hellström
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Hellström @ 2022-11-10 15:21 UTC (permalink / raw)
  To: Matthew Auld, Mani Milani, LKML
  Cc: intel-gfx, Chris Wilson, dri-devel, Daniel Vetter, Rodrigo Vivi,
	David Airlie, Christian König, Nirmoy Das

On Thu, 2022-11-10 at 14:49 +0000, Matthew Auld wrote:
> On 10/11/2022 05:31, Mani Milani wrote:
> > At present, the gpu thread crashes at times when grab_vma()
> > attempts to
> > acquire a gem object lock when in a deadlock state.
> > 
> > Problems:
> > I identified the following 4 issues in the current code:
> > 1. Since grab_vma() calls i915_gem_object_trylock(), which
> > consequently
> >     calls ww_mutex_trylock(), to acquire lock, it does not perform
> > any
> >     -EDEADLK handling; And -EALREADY handling is also unreliable,
> >     according to the description of ww_mutex_trylock().
> > 2. Since the return value of grab_vma() is a boolean showing
> >     success/failure, it does not provide any extra information on
> > the
> >     failure reason, and therefore does not provide any mechanism to
> > its
> >     caller to take any action to fix a potential deadlock.
> > 3. Current grab_vma() implementation produces inconsistent
> > behaviour
> >     depending on the refcount value, without informing the caller.
> > If
> >     refcount is already zero, grab_vma() neither acquires lock nor
> >     increments the refcount, but still returns 'true' for success!
> > This
> >     means that grab_vma() returning true (for success) does not
> > always
> >     mean that the gem obj is actually safely accessible.
> > 4. Currently, calling "i915_gem_object_lock(obj,ww)" is meant to be
> >     followed by a consequent "i915_gem_object_unlock(obj)" ONLY if
> > the
> >     original 'ww' object pointer was NULL, or otherwise not be
> > called and
> >     leave the houskeeping to "i915_gem_ww_ctx_fini(ww)". There are
> > a few
> >     issues with this:
> >     - This is not documented anywhere in the code (that I could
> > find),
> >       but only explained in an older commit message.
> >     - This produces an inconsistent usage of the lock/unlock
> > functions,
> >       increasing the chance of mistakes and issues.
> >     - This is not a clean design as it requires any new code that
> > calls
> >       these lock/unlock functions to know their internals, as well
> > as the
> >       internals of the functions calling the new code being added.
> > 
> > Fix:
> > To fix the issues above, this patch:
> > 1. Changes grab_vma() to call i915_gem_object_lock() instead of
> >     i915_gem_object_trylock(), to handle -EDEADLK and -EALREADY
> > cases.
> >     This should not cause any issue since the PIN_NONBLOCK flag is
> >     checked beforehand in the 2 cases grab_vma() is called.
> > 2. Changes grab_vma() to return the actual error code, instead of
> > bool.
> > 3. Changes grab_vma() to behave consistently when returning
> > success, by
> >     both incrementing the refcount and acquiring lock at all times.
> > 4. Changes i915_gem_object_unlock() to pair with
> > i915_gem_object_lock()
> >     nicely in all cases and do the housekeeping without the need
> > for the
> >     caller to do anything other than simply calling lock and
> > unlock.
> > 5. Ensures the gem obj->obj_link is initialized and deleted from
> > the ww
> >     list such that it can be tested for emptiness using
> > list_empty().
> > 
> > Signed-off-by: Mani Milani <mani@chromium.org>
> > ---
> > 
> >   drivers/gpu/drm/i915/gem/i915_gem_object.c |  2 +
> >   drivers/gpu/drm/i915/gem/i915_gem_object.h | 10 ++++-
> >   drivers/gpu/drm/i915/i915_gem_evict.c      | 48 ++++++++++++-----
> > -----
> >   drivers/gpu/drm/i915/i915_gem_ww.c         |  8 ++--
> >   4 files changed, 41 insertions(+), 27 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > index 369006c5317f..69d013b393fb 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > @@ -78,6 +78,8 @@ void i915_gem_object_init(struct
> > drm_i915_gem_object *obj,
> >   
> >         INIT_LIST_HEAD(&obj->mm.link);
> >   
> > +       INIT_LIST_HEAD(&obj->obj_link);
> > +
> >         INIT_LIST_HEAD(&obj->lut_list);
> >         spin_lock_init(&obj->lut_lock);
> >   
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > index 1723af9b0f6a..7e7a61bdf52c 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > @@ -219,7 +219,7 @@ static inline bool
> > i915_gem_object_trylock(struct drm_i915_gem_object *obj,
> >                 return ww_mutex_trylock(&obj->base.resv->lock, &ww-
> > >ctx);
> >   }
> >   
> > -static inline void i915_gem_object_unlock(struct
> > drm_i915_gem_object *obj)
> > +static inline void __i915_gem_object_unlock(struct
> > drm_i915_gem_object *obj)
> >   {
> >         if (obj->ops->adjust_lru)
> >                 obj->ops->adjust_lru(obj);
> > @@ -227,6 +227,14 @@ static inline void
> > i915_gem_object_unlock(struct drm_i915_gem_object *obj)
> >         dma_resv_unlock(obj->base.resv);
> >   }
> >   
> > +static inline void i915_gem_object_unlock(struct
> > drm_i915_gem_object *obj)
> > +{
> > +       if (list_empty(&obj->obj_link))
> > +               __i915_gem_object_unlock(obj);
> > +       else
> > +               i915_gem_ww_unlock_single(obj);
> > +}
> > +
> >   static inline void
> >   i915_gem_object_set_readonly(struct drm_i915_gem_object *obj)
> >   {
> > diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c
> > b/drivers/gpu/drm/i915/i915_gem_evict.c
> > index f025ee4fa526..3eb514b4eddc 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_evict.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_evict.c
> > @@ -55,29 +55,33 @@ static int ggtt_flush(struct intel_gt *gt)
> >         return intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
> >   }
> >   
> > -static bool grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx
> > *ww)
> > +static int grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx
> > *ww)
> >   {
> > +       int err;
> > +
> > +       /* Dead objects don't need pins */
> > +       if (dying_vma(vma))
> > +               atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> > +
> > +       err = i915_gem_object_lock(vma->obj, ww);
> 
> AFAIK the issue here is that we are already holding the vm->mutex, so
> this can potentially deadlock, which I guess is why this was trylock.
> 
> We typically grab a bunch of object locks during execbuf, and then
> grab 
> the vm->mutex, before binding the vma for each object. So vm->mutex
> is 
> always our inner lock, and the object lock is the outer one. Using a 
> full lock here then inverts that locking AFAICT. Like say if one
> process 
> is holding object A + vm->mutex and then tries to grab object B here
> in 
> grab_vma(), but another process is already holding object B + waiting
> to 
> grab vm->mutex?

Indeed. 

IIRC the assumption here was that a ww mutex trylock would be similar
in behaviour to the previous code which instead checked for pinned
vmas.

One difference, though, is that we lock most objects upfront and then
try to make space for VMAs avoiding evicting vmas that is needed anyway
for the batch. The old code would instead evict and rebind.

But there should be a catch-all evict everything and rebind if eviction
fails so it would be beneficial to see the "gpu thread crash". I think
there is an issue filed in gitlab where even  this catch-all evict
everything fails that might needs looking at.

/Thomas

> 
> > +
> >         /*
> >          * We add the extra refcount so the object doesn't drop to
> > zero until
> > -        * after ungrab_vma(), this way trylock is always paired
> > with unlock.
> > +        * after ungrab_vma(), this way lock is always paired with
> > unlock.
> >          */
> > -       if (i915_gem_object_get_rcu(vma->obj)) {
> > -               if (!i915_gem_object_trylock(vma->obj, ww)) {
> > -                       i915_gem_object_put(vma->obj);
> > -                       return false;
> > -               }
> > -       } else {
> > -               /* Dead objects don't need pins */
> > -               atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> > -       }
> > +       if (!err)
> > +               i915_gem_object_get(vma->obj);
> >   
> > -       return true;
> > +       return err;
> >   }
> >   
> >   static void ungrab_vma(struct i915_vma *vma)
> >   {
> > -       if (dying_vma(vma))
> > +       if (dying_vma(vma)) {
> > +               /* Dead objects don't need pins */
> > +               atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> >                 return;
> > +       }
> >   
> >         i915_gem_object_unlock(vma->obj);
> >         i915_gem_object_put(vma->obj);
> > @@ -93,10 +97,11 @@ mark_free(struct drm_mm_scan *scan,
> >         if (i915_vma_is_pinned(vma))
> >                 return false;
> >   
> > -       if (!grab_vma(vma, ww))
> > +       if (grab_vma(vma, ww))
> >                 return false;
> >   
> >         list_add(&vma->evict_link, unwind);
> > +
> >         return drm_mm_scan_add_block(scan, &vma->node);
> >   }
> >   
> > @@ -284,10 +289,12 @@ i915_gem_evict_something(struct
> > i915_address_space *vm,
> >                 vma = container_of(node, struct i915_vma, node);
> >   
> >                 /* If we find any non-objects (!vma), we cannot
> > evict them */
> > -               if (vma->node.color != I915_COLOR_UNEVICTABLE &&
> > -                   grab_vma(vma, ww)) {
> > -                       ret = __i915_vma_unbind(vma);
> > -                       ungrab_vma(vma);
> > +               if (vma->node.color != I915_COLOR_UNEVICTABLE) {
> > +                       ret = grab_vma(vma, ww);
> > +                       if (!ret) {
> > +                               ret = __i915_vma_unbind(vma);
> > +                               ungrab_vma(vma);
> > +                       }
> >                 } else {
> >                         ret = -ENOSPC;
> >                 }
> > @@ -382,10 +389,9 @@ int i915_gem_evict_for_node(struct
> > i915_address_space *vm,
> >                         break;
> >                 }
> >   
> > -               if (!grab_vma(vma, ww)) {
> > -                       ret = -ENOSPC;
> > +               ret = grab_vma(vma, ww);
> > +               if (ret)
> >                         break;
> > -               }
> >   
> >                 /*
> >                  * Never show fear in the face of dragons!
> > diff --git a/drivers/gpu/drm/i915/i915_gem_ww.c
> > b/drivers/gpu/drm/i915/i915_gem_ww.c
> > index 3f6ff139478e..937b279f50fc 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_ww.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_ww.c
> > @@ -19,16 +19,14 @@ static void i915_gem_ww_ctx_unlock_all(struct
> > i915_gem_ww_ctx *ww)
> >         struct drm_i915_gem_object *obj;
> >   
> >         while ((obj = list_first_entry_or_null(&ww->obj_list,
> > struct drm_i915_gem_object, obj_link))) {
> > -               list_del(&obj->obj_link);
> > -               i915_gem_object_unlock(obj);
> > -               i915_gem_object_put(obj);
> > +               i915_gem_ww_unlock_single(obj);
> >         }
> >   }
> >   
> >   void i915_gem_ww_unlock_single(struct drm_i915_gem_object *obj)
> >   {
> > -       list_del(&obj->obj_link);
> > -       i915_gem_object_unlock(obj);
> > +       list_del_init(&obj->obj_link);
> > +       __i915_gem_object_unlock(obj);
> >         i915_gem_object_put(obj);
> >   }
> >   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-10 15:21     ` Thomas Hellström
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Hellström @ 2022-11-10 15:21 UTC (permalink / raw)
  To: Matthew Auld, Mani Milani, LKML
  Cc: Tvrtko Ursulin, Maarten Lankhorst, Chris Wilson,
	Christian König, Daniel Vetter, David Airlie, Jani Nikula,
	Joonas Lahtinen, Niranjana Vishwanathapura, Nirmoy Das,
	Rodrigo Vivi, dri-devel, intel-gfx

On Thu, 2022-11-10 at 14:49 +0000, Matthew Auld wrote:
> On 10/11/2022 05:31, Mani Milani wrote:
> > At present, the gpu thread crashes at times when grab_vma()
> > attempts to
> > acquire a gem object lock when in a deadlock state.
> > 
> > Problems:
> > I identified the following 4 issues in the current code:
> > 1. Since grab_vma() calls i915_gem_object_trylock(), which
> > consequently
> >     calls ww_mutex_trylock(), to acquire lock, it does not perform
> > any
> >     -EDEADLK handling; And -EALREADY handling is also unreliable,
> >     according to the description of ww_mutex_trylock().
> > 2. Since the return value of grab_vma() is a boolean showing
> >     success/failure, it does not provide any extra information on
> > the
> >     failure reason, and therefore does not provide any mechanism to
> > its
> >     caller to take any action to fix a potential deadlock.
> > 3. Current grab_vma() implementation produces inconsistent
> > behaviour
> >     depending on the refcount value, without informing the caller.
> > If
> >     refcount is already zero, grab_vma() neither acquires lock nor
> >     increments the refcount, but still returns 'true' for success!
> > This
> >     means that grab_vma() returning true (for success) does not
> > always
> >     mean that the gem obj is actually safely accessible.
> > 4. Currently, calling "i915_gem_object_lock(obj,ww)" is meant to be
> >     followed by a consequent "i915_gem_object_unlock(obj)" ONLY if
> > the
> >     original 'ww' object pointer was NULL, or otherwise not be
> > called and
> >     leave the houskeeping to "i915_gem_ww_ctx_fini(ww)". There are
> > a few
> >     issues with this:
> >     - This is not documented anywhere in the code (that I could
> > find),
> >       but only explained in an older commit message.
> >     - This produces an inconsistent usage of the lock/unlock
> > functions,
> >       increasing the chance of mistakes and issues.
> >     - This is not a clean design as it requires any new code that
> > calls
> >       these lock/unlock functions to know their internals, as well
> > as the
> >       internals of the functions calling the new code being added.
> > 
> > Fix:
> > To fix the issues above, this patch:
> > 1. Changes grab_vma() to call i915_gem_object_lock() instead of
> >     i915_gem_object_trylock(), to handle -EDEADLK and -EALREADY
> > cases.
> >     This should not cause any issue since the PIN_NONBLOCK flag is
> >     checked beforehand in the 2 cases grab_vma() is called.
> > 2. Changes grab_vma() to return the actual error code, instead of
> > bool.
> > 3. Changes grab_vma() to behave consistently when returning
> > success, by
> >     both incrementing the refcount and acquiring lock at all times.
> > 4. Changes i915_gem_object_unlock() to pair with
> > i915_gem_object_lock()
> >     nicely in all cases and do the housekeeping without the need
> > for the
> >     caller to do anything other than simply calling lock and
> > unlock.
> > 5. Ensures the gem obj->obj_link is initialized and deleted from
> > the ww
> >     list such that it can be tested for emptiness using
> > list_empty().
> > 
> > Signed-off-by: Mani Milani <mani@chromium.org>
> > ---
> > 
> >   drivers/gpu/drm/i915/gem/i915_gem_object.c |  2 +
> >   drivers/gpu/drm/i915/gem/i915_gem_object.h | 10 ++++-
> >   drivers/gpu/drm/i915/i915_gem_evict.c      | 48 ++++++++++++-----
> > -----
> >   drivers/gpu/drm/i915/i915_gem_ww.c         |  8 ++--
> >   4 files changed, 41 insertions(+), 27 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > index 369006c5317f..69d013b393fb 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > @@ -78,6 +78,8 @@ void i915_gem_object_init(struct
> > drm_i915_gem_object *obj,
> >   
> >         INIT_LIST_HEAD(&obj->mm.link);
> >   
> > +       INIT_LIST_HEAD(&obj->obj_link);
> > +
> >         INIT_LIST_HEAD(&obj->lut_list);
> >         spin_lock_init(&obj->lut_lock);
> >   
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > index 1723af9b0f6a..7e7a61bdf52c 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > @@ -219,7 +219,7 @@ static inline bool
> > i915_gem_object_trylock(struct drm_i915_gem_object *obj,
> >                 return ww_mutex_trylock(&obj->base.resv->lock, &ww-
> > >ctx);
> >   }
> >   
> > -static inline void i915_gem_object_unlock(struct
> > drm_i915_gem_object *obj)
> > +static inline void __i915_gem_object_unlock(struct
> > drm_i915_gem_object *obj)
> >   {
> >         if (obj->ops->adjust_lru)
> >                 obj->ops->adjust_lru(obj);
> > @@ -227,6 +227,14 @@ static inline void
> > i915_gem_object_unlock(struct drm_i915_gem_object *obj)
> >         dma_resv_unlock(obj->base.resv);
> >   }
> >   
> > +static inline void i915_gem_object_unlock(struct
> > drm_i915_gem_object *obj)
> > +{
> > +       if (list_empty(&obj->obj_link))
> > +               __i915_gem_object_unlock(obj);
> > +       else
> > +               i915_gem_ww_unlock_single(obj);
> > +}
> > +
> >   static inline void
> >   i915_gem_object_set_readonly(struct drm_i915_gem_object *obj)
> >   {
> > diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c
> > b/drivers/gpu/drm/i915/i915_gem_evict.c
> > index f025ee4fa526..3eb514b4eddc 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_evict.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_evict.c
> > @@ -55,29 +55,33 @@ static int ggtt_flush(struct intel_gt *gt)
> >         return intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
> >   }
> >   
> > -static bool grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx
> > *ww)
> > +static int grab_vma(struct i915_vma *vma, struct i915_gem_ww_ctx
> > *ww)
> >   {
> > +       int err;
> > +
> > +       /* Dead objects don't need pins */
> > +       if (dying_vma(vma))
> > +               atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> > +
> > +       err = i915_gem_object_lock(vma->obj, ww);
> 
> AFAIK the issue here is that we are already holding the vm->mutex, so
> this can potentially deadlock, which I guess is why this was trylock.
> 
> We typically grab a bunch of object locks during execbuf, and then
> grab 
> the vm->mutex, before binding the vma for each object. So vm->mutex
> is 
> always our inner lock, and the object lock is the outer one. Using a 
> full lock here then inverts that locking AFAICT. Like say if one
> process 
> is holding object A + vm->mutex and then tries to grab object B here
> in 
> grab_vma(), but another process is already holding object B + waiting
> to 
> grab vm->mutex?

Indeed. 

IIRC the assumption here was that a ww mutex trylock would be similar
in behaviour to the previous code which instead checked for pinned
vmas.

One difference, though, is that we lock most objects upfront and then
try to make space for VMAs avoiding evicting vmas that is needed anyway
for the batch. The old code would instead evict and rebind.

But there should be a catch-all evict everything and rebind if eviction
fails so it would be beneficial to see the "gpu thread crash". I think
there is an issue filed in gitlab where even  this catch-all evict
everything fails that might needs looking at.

/Thomas

> 
> > +
> >         /*
> >          * We add the extra refcount so the object doesn't drop to
> > zero until
> > -        * after ungrab_vma(), this way trylock is always paired
> > with unlock.
> > +        * after ungrab_vma(), this way lock is always paired with
> > unlock.
> >          */
> > -       if (i915_gem_object_get_rcu(vma->obj)) {
> > -               if (!i915_gem_object_trylock(vma->obj, ww)) {
> > -                       i915_gem_object_put(vma->obj);
> > -                       return false;
> > -               }
> > -       } else {
> > -               /* Dead objects don't need pins */
> > -               atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> > -       }
> > +       if (!err)
> > +               i915_gem_object_get(vma->obj);
> >   
> > -       return true;
> > +       return err;
> >   }
> >   
> >   static void ungrab_vma(struct i915_vma *vma)
> >   {
> > -       if (dying_vma(vma))
> > +       if (dying_vma(vma)) {
> > +               /* Dead objects don't need pins */
> > +               atomic_and(~I915_VMA_PIN_MASK, &vma->flags);
> >                 return;
> > +       }
> >   
> >         i915_gem_object_unlock(vma->obj);
> >         i915_gem_object_put(vma->obj);
> > @@ -93,10 +97,11 @@ mark_free(struct drm_mm_scan *scan,
> >         if (i915_vma_is_pinned(vma))
> >                 return false;
> >   
> > -       if (!grab_vma(vma, ww))
> > +       if (grab_vma(vma, ww))
> >                 return false;
> >   
> >         list_add(&vma->evict_link, unwind);
> > +
> >         return drm_mm_scan_add_block(scan, &vma->node);
> >   }
> >   
> > @@ -284,10 +289,12 @@ i915_gem_evict_something(struct
> > i915_address_space *vm,
> >                 vma = container_of(node, struct i915_vma, node);
> >   
> >                 /* If we find any non-objects (!vma), we cannot
> > evict them */
> > -               if (vma->node.color != I915_COLOR_UNEVICTABLE &&
> > -                   grab_vma(vma, ww)) {
> > -                       ret = __i915_vma_unbind(vma);
> > -                       ungrab_vma(vma);
> > +               if (vma->node.color != I915_COLOR_UNEVICTABLE) {
> > +                       ret = grab_vma(vma, ww);
> > +                       if (!ret) {
> > +                               ret = __i915_vma_unbind(vma);
> > +                               ungrab_vma(vma);
> > +                       }
> >                 } else {
> >                         ret = -ENOSPC;
> >                 }
> > @@ -382,10 +389,9 @@ int i915_gem_evict_for_node(struct
> > i915_address_space *vm,
> >                         break;
> >                 }
> >   
> > -               if (!grab_vma(vma, ww)) {
> > -                       ret = -ENOSPC;
> > +               ret = grab_vma(vma, ww);
> > +               if (ret)
> >                         break;
> > -               }
> >   
> >                 /*
> >                  * Never show fear in the face of dragons!
> > diff --git a/drivers/gpu/drm/i915/i915_gem_ww.c
> > b/drivers/gpu/drm/i915/i915_gem_ww.c
> > index 3f6ff139478e..937b279f50fc 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_ww.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_ww.c
> > @@ -19,16 +19,14 @@ static void i915_gem_ww_ctx_unlock_all(struct
> > i915_gem_ww_ctx *ww)
> >         struct drm_i915_gem_object *obj;
> >   
> >         while ((obj = list_first_entry_or_null(&ww->obj_list,
> > struct drm_i915_gem_object, obj_link))) {
> > -               list_del(&obj->obj_link);
> > -               i915_gem_object_unlock(obj);
> > -               i915_gem_object_put(obj);
> > +               i915_gem_ww_unlock_single(obj);
> >         }
> >   }
> >   
> >   void i915_gem_ww_unlock_single(struct drm_i915_gem_object *obj)
> >   {
> > -       list_del(&obj->obj_link);
> > -       i915_gem_object_unlock(obj);
> > +       list_del_init(&obj->obj_link);
> > +       __i915_gem_object_unlock(obj);
> >         i915_gem_object_put(obj);
> >   }
> >   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for drm/i915: Fix unhandled deadlock in grab_vma()
  2022-11-10  5:31 ` [Intel-gfx] " Mani Milani
                   ` (2 preceding siblings ...)
  (?)
@ 2022-11-10 21:08 ` Patchwork
  -1 siblings, 0 replies; 19+ messages in thread
From: Patchwork @ 2022-11-10 21:08 UTC (permalink / raw)
  To: Mani Milani; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Fix unhandled deadlock in grab_vma()
URL   : https://patchwork.freedesktop.org/series/110733/
State : failure

== Summary ==

Error: make failed
  CALL    scripts/checksyscalls.sh
  DESCEND objtool
  MODPOST Module.symvers
ERROR: modpost: "i915_gem_ww_unlock_single" [drivers/gpu/drm/i915/kvmgt.ko] undefined!
scripts/Makefile.modpost:126: recipe for target 'Module.symvers' failed
make[1]: *** [Module.symvers] Error 1
Makefile:1944: recipe for target 'modpost' failed
make: *** [modpost] Error 2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
  2022-11-10 15:21     ` [Intel-gfx] " Thomas Hellström
  (?)
@ 2022-11-14  2:16       ` Mani Milani
  -1 siblings, 0 replies; 19+ messages in thread
From: Mani Milani @ 2022-11-14  2:16 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Matthew Auld, LKML, Tvrtko Ursulin, Maarten Lankhorst,
	Chris Wilson, Christian König, Daniel Vetter, David Airlie,
	Jani Nikula, Joonas Lahtinen, Niranjana Vishwanathapura,
	Nirmoy Das, Rodrigo Vivi, dri-devel, intel-gfx

Thank you for your comments.

To Thomas's point, the crash always seems to happen when the following
sequence of events occurs:

1. When inside "i915_gem_evict_vm()", the call to
"i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
eviction of a vma is skipped as a result. Basically if the code
reaches here:
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
And here is the stack dump for this scenario:
    Call Trace:
    <TASK>
    dump_stack_lvl+0x68/0x95
    i915_gem_evict_vm+0x1d2/0x369
    eb_validate_vmas+0x54a/0x6ae
    eb_relocate_parse+0x4b/0xdb
    i915_gem_execbuffer2_ioctl+0x6f5/0xab6
    ? i915_gem_object_prepare_write+0xfb/0xfb
    drm_ioctl_kernel+0xda/0x14d
    drm_ioctl+0x27f/0x3b7
    ? i915_gem_object_prepare_write+0xfb/0xfb
    __se_sys_ioctl+0x7a/0xbc
    do_syscall_64+0x56/0xa1
    ? exit_to_user_mode_prepare+0x3d/0x8c
    entry_SYSCALL_64_after_hwframe+0x61/0xcb
    RIP: 0033:0x78302de5fae7
    Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
    RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
    RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
    R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
    R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
    </TASK>
It is worth noting that "i915_gem_evict_vm()" still returns success in
this case.

2. After step 1 occurs, the next call to "grab_vma()" always fails
(with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
deadlock), which then results in the crash.
Here is the stack dump for this scenario:
    Call Trace:
    <TASK>
    dump_stack_lvl+0x68/0x95
    grab_vma+0x6c/0xd0
    i915_gem_evict_for_node+0x178/0x23b
    i915_gem_gtt_reserve+0x5a/0x82
    i915_vma_insert+0x295/0x29e
    i915_vma_pin_ww+0x41e/0x5c7
    eb_validate_vmas+0x5f5/0x6ae
    eb_relocate_parse+0x4b/0xdb
    i915_gem_execbuffer2_ioctl+0x6f5/0xab6
    ? i915_gem_object_prepare_write+0xfb/0xfb
    drm_ioctl_kernel+0xda/0x14d
    drm_ioctl+0x27f/0x3b7
    ? i915_gem_object_prepare_write+0xfb/0xfb
    __se_sys_ioctl+0x7a/0xbc
    do_syscall_64+0x56/0xa1
    ? exit_to_user_mode_prepare+0x3d/0x8c
    entry_SYSCALL_64_after_hwframe+0x61/0xcb
    RIP: 0033:0x78302de5fae7
    Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
    RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
    RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
    R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
    R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
    </TASK>

My Notes:
- I verified the two "i915_gem_object_trylock()" failures I mentioned
above are due to deadlock by slightly modifying the code to call
"i915_gem_object_lock()" only in those exact cases and subsequent to
the trylock failure, only to look at the return error code.
- The two cases mentioned above, are the only cases where
"i915_gem_object_trylock(obj, ww)" is called with the second argument
not being forced to NULL.
- When in either of the two cases above (i.e. inside "grab_vma()" or
"i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
"i915_gem_object_lock", the issue gets resolved (because deadlock is
detected and resolved).

So if this could matches the design better, another solution could be
for "grab_vma" to continue to call "i915_gem_object_trylock", but for
"i915_gem_evict_vm" to call "i915_gem_object_lock" instead.

Further info:
- Would you like any further info on the crash? If so, could you
please advise 1) what exactly you need and 2) how I can share with you
especially if it is big dumps?

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-14  2:16       ` Mani Milani
  0 siblings, 0 replies; 19+ messages in thread
From: Mani Milani @ 2022-11-14  2:16 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-gfx, dri-devel, LKML, Chris Wilson, Matthew Auld,
	Daniel Vetter, Rodrigo Vivi, David Airlie, Christian König,
	Nirmoy Das

Thank you for your comments.

To Thomas's point, the crash always seems to happen when the following
sequence of events occurs:

1. When inside "i915_gem_evict_vm()", the call to
"i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
eviction of a vma is skipped as a result. Basically if the code
reaches here:
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
And here is the stack dump for this scenario:
    Call Trace:
    <TASK>
    dump_stack_lvl+0x68/0x95
    i915_gem_evict_vm+0x1d2/0x369
    eb_validate_vmas+0x54a/0x6ae
    eb_relocate_parse+0x4b/0xdb
    i915_gem_execbuffer2_ioctl+0x6f5/0xab6
    ? i915_gem_object_prepare_write+0xfb/0xfb
    drm_ioctl_kernel+0xda/0x14d
    drm_ioctl+0x27f/0x3b7
    ? i915_gem_object_prepare_write+0xfb/0xfb
    __se_sys_ioctl+0x7a/0xbc
    do_syscall_64+0x56/0xa1
    ? exit_to_user_mode_prepare+0x3d/0x8c
    entry_SYSCALL_64_after_hwframe+0x61/0xcb
    RIP: 0033:0x78302de5fae7
    Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
    RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
    RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
    R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
    R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
    </TASK>
It is worth noting that "i915_gem_evict_vm()" still returns success in
this case.

2. After step 1 occurs, the next call to "grab_vma()" always fails
(with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
deadlock), which then results in the crash.
Here is the stack dump for this scenario:
    Call Trace:
    <TASK>
    dump_stack_lvl+0x68/0x95
    grab_vma+0x6c/0xd0
    i915_gem_evict_for_node+0x178/0x23b
    i915_gem_gtt_reserve+0x5a/0x82
    i915_vma_insert+0x295/0x29e
    i915_vma_pin_ww+0x41e/0x5c7
    eb_validate_vmas+0x5f5/0x6ae
    eb_relocate_parse+0x4b/0xdb
    i915_gem_execbuffer2_ioctl+0x6f5/0xab6
    ? i915_gem_object_prepare_write+0xfb/0xfb
    drm_ioctl_kernel+0xda/0x14d
    drm_ioctl+0x27f/0x3b7
    ? i915_gem_object_prepare_write+0xfb/0xfb
    __se_sys_ioctl+0x7a/0xbc
    do_syscall_64+0x56/0xa1
    ? exit_to_user_mode_prepare+0x3d/0x8c
    entry_SYSCALL_64_after_hwframe+0x61/0xcb
    RIP: 0033:0x78302de5fae7
    Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
    RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
    RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
    R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
    R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
    </TASK>

My Notes:
- I verified the two "i915_gem_object_trylock()" failures I mentioned
above are due to deadlock by slightly modifying the code to call
"i915_gem_object_lock()" only in those exact cases and subsequent to
the trylock failure, only to look at the return error code.
- The two cases mentioned above, are the only cases where
"i915_gem_object_trylock(obj, ww)" is called with the second argument
not being forced to NULL.
- When in either of the two cases above (i.e. inside "grab_vma()" or
"i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
"i915_gem_object_lock", the issue gets resolved (because deadlock is
detected and resolved).

So if this could matches the design better, another solution could be
for "grab_vma" to continue to call "i915_gem_object_trylock", but for
"i915_gem_evict_vm" to call "i915_gem_object_lock" instead.

Further info:
- Would you like any further info on the crash? If so, could you
please advise 1) what exactly you need and 2) how I can share with you
especially if it is big dumps?

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-14  2:16       ` Mani Milani
  0 siblings, 0 replies; 19+ messages in thread
From: Mani Milani @ 2022-11-14  2:16 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Tvrtko Ursulin, intel-gfx, dri-devel, LKML, Chris Wilson,
	Matthew Auld, Rodrigo Vivi, Niranjana Vishwanathapura,
	Christian König, Nirmoy Das

Thank you for your comments.

To Thomas's point, the crash always seems to happen when the following
sequence of events occurs:

1. When inside "i915_gem_evict_vm()", the call to
"i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
eviction of a vma is skipped as a result. Basically if the code
reaches here:
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
And here is the stack dump for this scenario:
    Call Trace:
    <TASK>
    dump_stack_lvl+0x68/0x95
    i915_gem_evict_vm+0x1d2/0x369
    eb_validate_vmas+0x54a/0x6ae
    eb_relocate_parse+0x4b/0xdb
    i915_gem_execbuffer2_ioctl+0x6f5/0xab6
    ? i915_gem_object_prepare_write+0xfb/0xfb
    drm_ioctl_kernel+0xda/0x14d
    drm_ioctl+0x27f/0x3b7
    ? i915_gem_object_prepare_write+0xfb/0xfb
    __se_sys_ioctl+0x7a/0xbc
    do_syscall_64+0x56/0xa1
    ? exit_to_user_mode_prepare+0x3d/0x8c
    entry_SYSCALL_64_after_hwframe+0x61/0xcb
    RIP: 0033:0x78302de5fae7
    Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
    RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
    RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
    R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
    R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
    </TASK>
It is worth noting that "i915_gem_evict_vm()" still returns success in
this case.

2. After step 1 occurs, the next call to "grab_vma()" always fails
(with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
deadlock), which then results in the crash.
Here is the stack dump for this scenario:
    Call Trace:
    <TASK>
    dump_stack_lvl+0x68/0x95
    grab_vma+0x6c/0xd0
    i915_gem_evict_for_node+0x178/0x23b
    i915_gem_gtt_reserve+0x5a/0x82
    i915_vma_insert+0x295/0x29e
    i915_vma_pin_ww+0x41e/0x5c7
    eb_validate_vmas+0x5f5/0x6ae
    eb_relocate_parse+0x4b/0xdb
    i915_gem_execbuffer2_ioctl+0x6f5/0xab6
    ? i915_gem_object_prepare_write+0xfb/0xfb
    drm_ioctl_kernel+0xda/0x14d
    drm_ioctl+0x27f/0x3b7
    ? i915_gem_object_prepare_write+0xfb/0xfb
    __se_sys_ioctl+0x7a/0xbc
    do_syscall_64+0x56/0xa1
    ? exit_to_user_mode_prepare+0x3d/0x8c
    entry_SYSCALL_64_after_hwframe+0x61/0xcb
    RIP: 0033:0x78302de5fae7
    Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
    RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
    RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
    R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
    R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
    </TASK>

My Notes:
- I verified the two "i915_gem_object_trylock()" failures I mentioned
above are due to deadlock by slightly modifying the code to call
"i915_gem_object_lock()" only in those exact cases and subsequent to
the trylock failure, only to look at the return error code.
- The two cases mentioned above, are the only cases where
"i915_gem_object_trylock(obj, ww)" is called with the second argument
not being forced to NULL.
- When in either of the two cases above (i.e. inside "grab_vma()" or
"i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
"i915_gem_object_lock", the issue gets resolved (because deadlock is
detected and resolved).

So if this could matches the design better, another solution could be
for "grab_vma" to continue to call "i915_gem_object_trylock", but for
"i915_gem_evict_vm" to call "i915_gem_object_lock" instead.

Further info:
- Would you like any further info on the crash? If so, could you
please advise 1) what exactly you need and 2) how I can share with you
especially if it is big dumps?

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
  2022-11-14  2:16       ` [Intel-gfx] " Mani Milani
  (?)
@ 2022-11-14 12:48         ` Thomas Hellström
  -1 siblings, 0 replies; 19+ messages in thread
From: Thomas Hellström @ 2022-11-14 12:48 UTC (permalink / raw)
  To: Mani Milani
  Cc: Matthew Auld, LKML, Tvrtko Ursulin, Maarten Lankhorst,
	Chris Wilson, Christian König, Daniel Vetter, David Airlie,
	Jani Nikula, Joonas Lahtinen, Niranjana Vishwanathapura,
	Nirmoy Das, Rodrigo Vivi, dri-devel, intel-gfx

Hi, Mani.

On 11/14/22 03:16, Mani Milani wrote:
> Thank you for your comments.
>
> To Thomas's point, the crash always seems to happen when the following
> sequence of events occurs:
>
> 1. When inside "i915_gem_evict_vm()", the call to
> "i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
> eviction of a vma is skipped as a result. Basically if the code
> reaches here:
> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
> And here is the stack dump for this scenario:
>      Call Trace:
>      <TASK>
>      dump_stack_lvl+0x68/0x95
>      i915_gem_evict_vm+0x1d2/0x369
>      eb_validate_vmas+0x54a/0x6ae
>      eb_relocate_parse+0x4b/0xdb
>      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      drm_ioctl_kernel+0xda/0x14d
>      drm_ioctl+0x27f/0x3b7
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      __se_sys_ioctl+0x7a/0xbc
>      do_syscall_64+0x56/0xa1
>      ? exit_to_user_mode_prepare+0x3d/0x8c
>      entry_SYSCALL_64_after_hwframe+0x61/0xcb
>      RIP: 0033:0x78302de5fae7
>      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
>      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
>      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
>      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
>      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
>      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
>      </TASK>
> It is worth noting that "i915_gem_evict_vm()" still returns success in
> this case.
>
> 2. After step 1 occurs, the next call to "grab_vma()" always fails
> (with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
> deadlock), which then results in the crash.
> Here is the stack dump for this scenario:
>      Call Trace:
>      <TASK>
>      dump_stack_lvl+0x68/0x95
>      grab_vma+0x6c/0xd0
>      i915_gem_evict_for_node+0x178/0x23b
>      i915_gem_gtt_reserve+0x5a/0x82
>      i915_vma_insert+0x295/0x29e
>      i915_vma_pin_ww+0x41e/0x5c7
>      eb_validate_vmas+0x5f5/0x6ae
>      eb_relocate_parse+0x4b/0xdb
>      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      drm_ioctl_kernel+0xda/0x14d
>      drm_ioctl+0x27f/0x3b7
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      __se_sys_ioctl+0x7a/0xbc
>      do_syscall_64+0x56/0xa1
>      ? exit_to_user_mode_prepare+0x3d/0x8c
>      entry_SYSCALL_64_after_hwframe+0x61/0xcb
>      RIP: 0033:0x78302de5fae7
>      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
>      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
>      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
>      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
>      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
>      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
>      </TASK>
>
> My Notes:
> - I verified the two "i915_gem_object_trylock()" failures I mentioned
> above are due to deadlock by slightly modifying the code to call
> "i915_gem_object_lock()" only in those exact cases and subsequent to
> the trylock failure, only to look at the return error code.
> - The two cases mentioned above, are the only cases where
> "i915_gem_object_trylock(obj, ww)" is called with the second argument
> not being forced to NULL.
> - When in either of the two cases above (i.e. inside "grab_vma()" or
> "i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
> "i915_gem_object_lock", the issue gets resolved (because deadlock is
> detected and resolved).
>
> So if this could matches the design better, another solution could be
> for "grab_vma" to continue to call "i915_gem_object_trylock", but for
> "i915_gem_evict_vm" to call "i915_gem_object_lock" instead.

No, i915_gem_object_lock() is not allowed when the vm mutex is held.


>
> Further info:
> - Would you like any further info on the crash? If so, could you
> please advise 1) what exactly you need and 2) how I can share with you
> especially if it is big dumps?

Yes, I would like to know how the crash manifests itself. Is it a kernel 
BUG or a kernel WARNING or is it the user-space application that crashes 
due to receiveing an -ENOSPC?

Thanks,

Thomas



>
> Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-14 12:48         ` Thomas Hellström
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Hellström @ 2022-11-14 12:48 UTC (permalink / raw)
  To: Mani Milani
  Cc: Tvrtko Ursulin, intel-gfx, dri-devel, LKML, Chris Wilson,
	Matthew Auld, Rodrigo Vivi, Niranjana Vishwanathapura,
	Christian König, Nirmoy Das

Hi, Mani.

On 11/14/22 03:16, Mani Milani wrote:
> Thank you for your comments.
>
> To Thomas's point, the crash always seems to happen when the following
> sequence of events occurs:
>
> 1. When inside "i915_gem_evict_vm()", the call to
> "i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
> eviction of a vma is skipped as a result. Basically if the code
> reaches here:
> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
> And here is the stack dump for this scenario:
>      Call Trace:
>      <TASK>
>      dump_stack_lvl+0x68/0x95
>      i915_gem_evict_vm+0x1d2/0x369
>      eb_validate_vmas+0x54a/0x6ae
>      eb_relocate_parse+0x4b/0xdb
>      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      drm_ioctl_kernel+0xda/0x14d
>      drm_ioctl+0x27f/0x3b7
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      __se_sys_ioctl+0x7a/0xbc
>      do_syscall_64+0x56/0xa1
>      ? exit_to_user_mode_prepare+0x3d/0x8c
>      entry_SYSCALL_64_after_hwframe+0x61/0xcb
>      RIP: 0033:0x78302de5fae7
>      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
>      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
>      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
>      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
>      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
>      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
>      </TASK>
> It is worth noting that "i915_gem_evict_vm()" still returns success in
> this case.
>
> 2. After step 1 occurs, the next call to "grab_vma()" always fails
> (with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
> deadlock), which then results in the crash.
> Here is the stack dump for this scenario:
>      Call Trace:
>      <TASK>
>      dump_stack_lvl+0x68/0x95
>      grab_vma+0x6c/0xd0
>      i915_gem_evict_for_node+0x178/0x23b
>      i915_gem_gtt_reserve+0x5a/0x82
>      i915_vma_insert+0x295/0x29e
>      i915_vma_pin_ww+0x41e/0x5c7
>      eb_validate_vmas+0x5f5/0x6ae
>      eb_relocate_parse+0x4b/0xdb
>      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      drm_ioctl_kernel+0xda/0x14d
>      drm_ioctl+0x27f/0x3b7
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      __se_sys_ioctl+0x7a/0xbc
>      do_syscall_64+0x56/0xa1
>      ? exit_to_user_mode_prepare+0x3d/0x8c
>      entry_SYSCALL_64_after_hwframe+0x61/0xcb
>      RIP: 0033:0x78302de5fae7
>      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
>      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
>      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
>      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
>      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
>      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
>      </TASK>
>
> My Notes:
> - I verified the two "i915_gem_object_trylock()" failures I mentioned
> above are due to deadlock by slightly modifying the code to call
> "i915_gem_object_lock()" only in those exact cases and subsequent to
> the trylock failure, only to look at the return error code.
> - The two cases mentioned above, are the only cases where
> "i915_gem_object_trylock(obj, ww)" is called with the second argument
> not being forced to NULL.
> - When in either of the two cases above (i.e. inside "grab_vma()" or
> "i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
> "i915_gem_object_lock", the issue gets resolved (because deadlock is
> detected and resolved).
>
> So if this could matches the design better, another solution could be
> for "grab_vma" to continue to call "i915_gem_object_trylock", but for
> "i915_gem_evict_vm" to call "i915_gem_object_lock" instead.

No, i915_gem_object_lock() is not allowed when the vm mutex is held.


>
> Further info:
> - Would you like any further info on the crash? If so, could you
> please advise 1) what exactly you need and 2) how I can share with you
> especially if it is big dumps?

Yes, I would like to know how the crash manifests itself. Is it a kernel 
BUG or a kernel WARNING or is it the user-space application that crashes 
due to receiveing an -ENOSPC?

Thanks,

Thomas



>
> Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-14 12:48         ` Thomas Hellström
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Hellström @ 2022-11-14 12:48 UTC (permalink / raw)
  To: Mani Milani
  Cc: intel-gfx, dri-devel, LKML, Chris Wilson, Matthew Auld,
	Daniel Vetter, Rodrigo Vivi, David Airlie, Christian König,
	Nirmoy Das

Hi, Mani.

On 11/14/22 03:16, Mani Milani wrote:
> Thank you for your comments.
>
> To Thomas's point, the crash always seems to happen when the following
> sequence of events occurs:
>
> 1. When inside "i915_gem_evict_vm()", the call to
> "i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
> eviction of a vma is skipped as a result. Basically if the code
> reaches here:
> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
> And here is the stack dump for this scenario:
>      Call Trace:
>      <TASK>
>      dump_stack_lvl+0x68/0x95
>      i915_gem_evict_vm+0x1d2/0x369
>      eb_validate_vmas+0x54a/0x6ae
>      eb_relocate_parse+0x4b/0xdb
>      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      drm_ioctl_kernel+0xda/0x14d
>      drm_ioctl+0x27f/0x3b7
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      __se_sys_ioctl+0x7a/0xbc
>      do_syscall_64+0x56/0xa1
>      ? exit_to_user_mode_prepare+0x3d/0x8c
>      entry_SYSCALL_64_after_hwframe+0x61/0xcb
>      RIP: 0033:0x78302de5fae7
>      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
>      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
>      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
>      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
>      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
>      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
>      </TASK>
> It is worth noting that "i915_gem_evict_vm()" still returns success in
> this case.
>
> 2. After step 1 occurs, the next call to "grab_vma()" always fails
> (with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
> deadlock), which then results in the crash.
> Here is the stack dump for this scenario:
>      Call Trace:
>      <TASK>
>      dump_stack_lvl+0x68/0x95
>      grab_vma+0x6c/0xd0
>      i915_gem_evict_for_node+0x178/0x23b
>      i915_gem_gtt_reserve+0x5a/0x82
>      i915_vma_insert+0x295/0x29e
>      i915_vma_pin_ww+0x41e/0x5c7
>      eb_validate_vmas+0x5f5/0x6ae
>      eb_relocate_parse+0x4b/0xdb
>      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      drm_ioctl_kernel+0xda/0x14d
>      drm_ioctl+0x27f/0x3b7
>      ? i915_gem_object_prepare_write+0xfb/0xfb
>      __se_sys_ioctl+0x7a/0xbc
>      do_syscall_64+0x56/0xa1
>      ? exit_to_user_mode_prepare+0x3d/0x8c
>      entry_SYSCALL_64_after_hwframe+0x61/0xcb
>      RIP: 0033:0x78302de5fae7
>      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
>      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
>      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
>      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
>      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
>      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
>      </TASK>
>
> My Notes:
> - I verified the two "i915_gem_object_trylock()" failures I mentioned
> above are due to deadlock by slightly modifying the code to call
> "i915_gem_object_lock()" only in those exact cases and subsequent to
> the trylock failure, only to look at the return error code.
> - The two cases mentioned above, are the only cases where
> "i915_gem_object_trylock(obj, ww)" is called with the second argument
> not being forced to NULL.
> - When in either of the two cases above (i.e. inside "grab_vma()" or
> "i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
> "i915_gem_object_lock", the issue gets resolved (because deadlock is
> detected and resolved).
>
> So if this could matches the design better, another solution could be
> for "grab_vma" to continue to call "i915_gem_object_trylock", but for
> "i915_gem_evict_vm" to call "i915_gem_object_lock" instead.

No, i915_gem_object_lock() is not allowed when the vm mutex is held.


>
> Further info:
> - Would you like any further info on the crash? If so, could you
> please advise 1) what exactly you need and 2) how I can share with you
> especially if it is big dumps?

Yes, I would like to know how the crash manifests itself. Is it a kernel 
BUG or a kernel WARNING or is it the user-space application that crashes 
due to receiveing an -ENOSPC?

Thanks,

Thomas



>
> Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
  2022-11-14 12:48         ` Thomas Hellström
  (?)
@ 2022-11-15 23:54           ` Mani Milani
  -1 siblings, 0 replies; 19+ messages in thread
From: Mani Milani @ 2022-11-15 23:54 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Matthew Auld, LKML, Tvrtko Ursulin, Maarten Lankhorst,
	Chris Wilson, Christian König, Daniel Vetter, David Airlie,
	Jani Nikula, Joonas Lahtinen, Niranjana Vishwanathapura,
	Nirmoy Das, Rodrigo Vivi, dri-devel, intel-gfx

Hi Thomas,

It is a user-space application that crashes due to receiving an -ENOSPC.

This occurs after code reaches the line below where grab_vma() fails
(due to deadlock) and consequently i915_gem_evict_for_node() returns
-ENOSPC. I provided the call stack for when this happens in my
previous message:
https://github.com/torvalds/linux/blame/59d0d52c30d4991ac4b329f049cc37118e00f5b0/drivers/gpu/drm/i915/i915_gem_evict.c#L386

Context:
This crash is happening on an intel GeminiLake chromebook, when
running a video seek h264 stress test, and it is reproducible 100%. To
troubleshoot, I did a git bisect and found the following commit to be
the culprit (which is when grab_vma() has been introduced):
https://github.com/torvalds/linux/commit/7e00897be8bf13ef9c68c95a8e386b714c29ad95

I also have crash dumps and further logs that I can send you if
needed. But please let me know how to share those with you, since
pasting them here does not seem reasonable to me.

Thanks,
Mani

On Mon, Nov 14, 2022 at 11:48 PM Thomas Hellström
<thomas.hellstrom@linux.intel.com> wrote:
>
> Hi, Mani.
>
> On 11/14/22 03:16, Mani Milani wrote:
> > Thank you for your comments.
> >
> > To Thomas's point, the crash always seems to happen when the following
> > sequence of events occurs:
> >
> > 1. When inside "i915_gem_evict_vm()", the call to
> > "i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
> > eviction of a vma is skipped as a result. Basically if the code
> > reaches here:
> > https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
> > And here is the stack dump for this scenario:
> >      Call Trace:
> >      <TASK>
> >      dump_stack_lvl+0x68/0x95
> >      i915_gem_evict_vm+0x1d2/0x369
> >      eb_validate_vmas+0x54a/0x6ae
> >      eb_relocate_parse+0x4b/0xdb
> >      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      drm_ioctl_kernel+0xda/0x14d
> >      drm_ioctl+0x27f/0x3b7
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      __se_sys_ioctl+0x7a/0xbc
> >      do_syscall_64+0x56/0xa1
> >      ? exit_to_user_mode_prepare+0x3d/0x8c
> >      entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >      RIP: 0033:0x78302de5fae7
> >      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> > 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> > 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
> >      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> >      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
> >      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
> >      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
> >      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
> >      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
> >      </TASK>
> > It is worth noting that "i915_gem_evict_vm()" still returns success in
> > this case.
> >
> > 2. After step 1 occurs, the next call to "grab_vma()" always fails
> > (with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
> > deadlock), which then results in the crash.
> > Here is the stack dump for this scenario:
> >      Call Trace:
> >      <TASK>
> >      dump_stack_lvl+0x68/0x95
> >      grab_vma+0x6c/0xd0
> >      i915_gem_evict_for_node+0x178/0x23b
> >      i915_gem_gtt_reserve+0x5a/0x82
> >      i915_vma_insert+0x295/0x29e
> >      i915_vma_pin_ww+0x41e/0x5c7
> >      eb_validate_vmas+0x5f5/0x6ae
> >      eb_relocate_parse+0x4b/0xdb
> >      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      drm_ioctl_kernel+0xda/0x14d
> >      drm_ioctl+0x27f/0x3b7
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      __se_sys_ioctl+0x7a/0xbc
> >      do_syscall_64+0x56/0xa1
> >      ? exit_to_user_mode_prepare+0x3d/0x8c
> >      entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >      RIP: 0033:0x78302de5fae7
> >      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> > 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> > 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
> >      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> >      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
> >      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
> >      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
> >      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
> >      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
> >      </TASK>
> >
> > My Notes:
> > - I verified the two "i915_gem_object_trylock()" failures I mentioned
> > above are due to deadlock by slightly modifying the code to call
> > "i915_gem_object_lock()" only in those exact cases and subsequent to
> > the trylock failure, only to look at the return error code.
> > - The two cases mentioned above, are the only cases where
> > "i915_gem_object_trylock(obj, ww)" is called with the second argument
> > not being forced to NULL.
> > - When in either of the two cases above (i.e. inside "grab_vma()" or
> > "i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
> > "i915_gem_object_lock", the issue gets resolved (because deadlock is
> > detected and resolved).
> >
> > So if this could matches the design better, another solution could be
> > for "grab_vma" to continue to call "i915_gem_object_trylock", but for
> > "i915_gem_evict_vm" to call "i915_gem_object_lock" instead.
>
> No, i915_gem_object_lock() is not allowed when the vm mutex is held.
>
>
> >
> > Further info:
> > - Would you like any further info on the crash? If so, could you
> > please advise 1) what exactly you need and 2) how I can share with you
> > especially if it is big dumps?
>
> Yes, I would like to know how the crash manifests itself. Is it a kernel
> BUG or a kernel WARNING or is it the user-space application that crashes
> due to receiveing an -ENOSPC?
>
> Thanks,
>
> Thomas
>
>
>
> >
> > Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-15 23:54           ` Mani Milani
  0 siblings, 0 replies; 19+ messages in thread
From: Mani Milani @ 2022-11-15 23:54 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Tvrtko Ursulin, intel-gfx, dri-devel, LKML, Chris Wilson,
	Matthew Auld, Rodrigo Vivi, Niranjana Vishwanathapura,
	Christian König, Nirmoy Das

Hi Thomas,

It is a user-space application that crashes due to receiving an -ENOSPC.

This occurs after code reaches the line below where grab_vma() fails
(due to deadlock) and consequently i915_gem_evict_for_node() returns
-ENOSPC. I provided the call stack for when this happens in my
previous message:
https://github.com/torvalds/linux/blame/59d0d52c30d4991ac4b329f049cc37118e00f5b0/drivers/gpu/drm/i915/i915_gem_evict.c#L386

Context:
This crash is happening on an intel GeminiLake chromebook, when
running a video seek h264 stress test, and it is reproducible 100%. To
troubleshoot, I did a git bisect and found the following commit to be
the culprit (which is when grab_vma() has been introduced):
https://github.com/torvalds/linux/commit/7e00897be8bf13ef9c68c95a8e386b714c29ad95

I also have crash dumps and further logs that I can send you if
needed. But please let me know how to share those with you, since
pasting them here does not seem reasonable to me.

Thanks,
Mani

On Mon, Nov 14, 2022 at 11:48 PM Thomas Hellström
<thomas.hellstrom@linux.intel.com> wrote:
>
> Hi, Mani.
>
> On 11/14/22 03:16, Mani Milani wrote:
> > Thank you for your comments.
> >
> > To Thomas's point, the crash always seems to happen when the following
> > sequence of events occurs:
> >
> > 1. When inside "i915_gem_evict_vm()", the call to
> > "i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
> > eviction of a vma is skipped as a result. Basically if the code
> > reaches here:
> > https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
> > And here is the stack dump for this scenario:
> >      Call Trace:
> >      <TASK>
> >      dump_stack_lvl+0x68/0x95
> >      i915_gem_evict_vm+0x1d2/0x369
> >      eb_validate_vmas+0x54a/0x6ae
> >      eb_relocate_parse+0x4b/0xdb
> >      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      drm_ioctl_kernel+0xda/0x14d
> >      drm_ioctl+0x27f/0x3b7
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      __se_sys_ioctl+0x7a/0xbc
> >      do_syscall_64+0x56/0xa1
> >      ? exit_to_user_mode_prepare+0x3d/0x8c
> >      entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >      RIP: 0033:0x78302de5fae7
> >      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> > 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> > 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
> >      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> >      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
> >      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
> >      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
> >      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
> >      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
> >      </TASK>
> > It is worth noting that "i915_gem_evict_vm()" still returns success in
> > this case.
> >
> > 2. After step 1 occurs, the next call to "grab_vma()" always fails
> > (with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
> > deadlock), which then results in the crash.
> > Here is the stack dump for this scenario:
> >      Call Trace:
> >      <TASK>
> >      dump_stack_lvl+0x68/0x95
> >      grab_vma+0x6c/0xd0
> >      i915_gem_evict_for_node+0x178/0x23b
> >      i915_gem_gtt_reserve+0x5a/0x82
> >      i915_vma_insert+0x295/0x29e
> >      i915_vma_pin_ww+0x41e/0x5c7
> >      eb_validate_vmas+0x5f5/0x6ae
> >      eb_relocate_parse+0x4b/0xdb
> >      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      drm_ioctl_kernel+0xda/0x14d
> >      drm_ioctl+0x27f/0x3b7
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      __se_sys_ioctl+0x7a/0xbc
> >      do_syscall_64+0x56/0xa1
> >      ? exit_to_user_mode_prepare+0x3d/0x8c
> >      entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >      RIP: 0033:0x78302de5fae7
> >      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> > 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> > 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
> >      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> >      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
> >      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
> >      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
> >      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
> >      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
> >      </TASK>
> >
> > My Notes:
> > - I verified the two "i915_gem_object_trylock()" failures I mentioned
> > above are due to deadlock by slightly modifying the code to call
> > "i915_gem_object_lock()" only in those exact cases and subsequent to
> > the trylock failure, only to look at the return error code.
> > - The two cases mentioned above, are the only cases where
> > "i915_gem_object_trylock(obj, ww)" is called with the second argument
> > not being forced to NULL.
> > - When in either of the two cases above (i.e. inside "grab_vma()" or
> > "i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
> > "i915_gem_object_lock", the issue gets resolved (because deadlock is
> > detected and resolved).
> >
> > So if this could matches the design better, another solution could be
> > for "grab_vma" to continue to call "i915_gem_object_trylock", but for
> > "i915_gem_evict_vm" to call "i915_gem_object_lock" instead.
>
> No, i915_gem_object_lock() is not allowed when the vm mutex is held.
>
>
> >
> > Further info:
> > - Would you like any further info on the crash? If so, could you
> > please advise 1) what exactly you need and 2) how I can share with you
> > especially if it is big dumps?
>
> Yes, I would like to know how the crash manifests itself. Is it a kernel
> BUG or a kernel WARNING or is it the user-space application that crashes
> due to receiveing an -ENOSPC?
>
> Thanks,
>
> Thomas
>
>
>
> >
> > Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
@ 2022-11-15 23:54           ` Mani Milani
  0 siblings, 0 replies; 19+ messages in thread
From: Mani Milani @ 2022-11-15 23:54 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-gfx, dri-devel, LKML, Chris Wilson, Matthew Auld,
	Daniel Vetter, Rodrigo Vivi, David Airlie, Christian König,
	Nirmoy Das

Hi Thomas,

It is a user-space application that crashes due to receiving an -ENOSPC.

This occurs after code reaches the line below where grab_vma() fails
(due to deadlock) and consequently i915_gem_evict_for_node() returns
-ENOSPC. I provided the call stack for when this happens in my
previous message:
https://github.com/torvalds/linux/blame/59d0d52c30d4991ac4b329f049cc37118e00f5b0/drivers/gpu/drm/i915/i915_gem_evict.c#L386

Context:
This crash is happening on an intel GeminiLake chromebook, when
running a video seek h264 stress test, and it is reproducible 100%. To
troubleshoot, I did a git bisect and found the following commit to be
the culprit (which is when grab_vma() has been introduced):
https://github.com/torvalds/linux/commit/7e00897be8bf13ef9c68c95a8e386b714c29ad95

I also have crash dumps and further logs that I can send you if
needed. But please let me know how to share those with you, since
pasting them here does not seem reasonable to me.

Thanks,
Mani

On Mon, Nov 14, 2022 at 11:48 PM Thomas Hellström
<thomas.hellstrom@linux.intel.com> wrote:
>
> Hi, Mani.
>
> On 11/14/22 03:16, Mani Milani wrote:
> > Thank you for your comments.
> >
> > To Thomas's point, the crash always seems to happen when the following
> > sequence of events occurs:
> >
> > 1. When inside "i915_gem_evict_vm()", the call to
> > "i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
> > eviction of a vma is skipped as a result. Basically if the code
> > reaches here:
> > https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
> > And here is the stack dump for this scenario:
> >      Call Trace:
> >      <TASK>
> >      dump_stack_lvl+0x68/0x95
> >      i915_gem_evict_vm+0x1d2/0x369
> >      eb_validate_vmas+0x54a/0x6ae
> >      eb_relocate_parse+0x4b/0xdb
> >      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      drm_ioctl_kernel+0xda/0x14d
> >      drm_ioctl+0x27f/0x3b7
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      __se_sys_ioctl+0x7a/0xbc
> >      do_syscall_64+0x56/0xa1
> >      ? exit_to_user_mode_prepare+0x3d/0x8c
> >      entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >      RIP: 0033:0x78302de5fae7
> >      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> > 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> > 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
> >      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> >      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
> >      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
> >      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
> >      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
> >      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
> >      </TASK>
> > It is worth noting that "i915_gem_evict_vm()" still returns success in
> > this case.
> >
> > 2. After step 1 occurs, the next call to "grab_vma()" always fails
> > (with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
> > deadlock), which then results in the crash.
> > Here is the stack dump for this scenario:
> >      Call Trace:
> >      <TASK>
> >      dump_stack_lvl+0x68/0x95
> >      grab_vma+0x6c/0xd0
> >      i915_gem_evict_for_node+0x178/0x23b
> >      i915_gem_gtt_reserve+0x5a/0x82
> >      i915_vma_insert+0x295/0x29e
> >      i915_vma_pin_ww+0x41e/0x5c7
> >      eb_validate_vmas+0x5f5/0x6ae
> >      eb_relocate_parse+0x4b/0xdb
> >      i915_gem_execbuffer2_ioctl+0x6f5/0xab6
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      drm_ioctl_kernel+0xda/0x14d
> >      drm_ioctl+0x27f/0x3b7
> >      ? i915_gem_object_prepare_write+0xfb/0xfb
> >      __se_sys_ioctl+0x7a/0xbc
> >      do_syscall_64+0x56/0xa1
> >      ? exit_to_user_mode_prepare+0x3d/0x8c
> >      entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >      RIP: 0033:0x78302de5fae7
> >      Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
> > 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> > 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
> >      RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> >      RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
> >      RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
> >      RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
> >      R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
> >      R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
> >      </TASK>
> >
> > My Notes:
> > - I verified the two "i915_gem_object_trylock()" failures I mentioned
> > above are due to deadlock by slightly modifying the code to call
> > "i915_gem_object_lock()" only in those exact cases and subsequent to
> > the trylock failure, only to look at the return error code.
> > - The two cases mentioned above, are the only cases where
> > "i915_gem_object_trylock(obj, ww)" is called with the second argument
> > not being forced to NULL.
> > - When in either of the two cases above (i.e. inside "grab_vma()" or
> > "i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
> > "i915_gem_object_lock", the issue gets resolved (because deadlock is
> > detected and resolved).
> >
> > So if this could matches the design better, another solution could be
> > for "grab_vma" to continue to call "i915_gem_object_trylock", but for
> > "i915_gem_evict_vm" to call "i915_gem_object_lock" instead.
>
> No, i915_gem_object_lock() is not allowed when the vm mutex is held.
>
>
> >
> > Further info:
> > - Would you like any further info on the crash? If so, could you
> > please advise 1) what exactly you need and 2) how I can share with you
> > especially if it is big dumps?
>
> Yes, I would like to know how the crash manifests itself. Is it a kernel
> BUG or a kernel WARNING or is it the user-space application that crashes
> due to receiveing an -ENOSPC?
>
> Thanks,
>
> Thomas
>
>
>
> >
> > Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-11-15 23:54 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-10  5:31 [PATCH] drm/i915: Fix unhandled deadlock in grab_vma() Mani Milani
2022-11-10  5:31 ` Mani Milani
2022-11-10  5:31 ` [Intel-gfx] " Mani Milani
2022-11-10 14:49 ` Matthew Auld
2022-11-10 14:49   ` [Intel-gfx] " Matthew Auld
2022-11-10 14:49   ` Matthew Auld
2022-11-10 15:21   ` Thomas Hellström
2022-11-10 15:21     ` Thomas Hellström
2022-11-10 15:21     ` [Intel-gfx] " Thomas Hellström
2022-11-14  2:16     ` Mani Milani
2022-11-14  2:16       ` Mani Milani
2022-11-14  2:16       ` [Intel-gfx] " Mani Milani
2022-11-14 12:48       ` Thomas Hellström
2022-11-14 12:48         ` [Intel-gfx] " Thomas Hellström
2022-11-14 12:48         ` Thomas Hellström
2022-11-15 23:54         ` Mani Milani
2022-11-15 23:54           ` [Intel-gfx] " Mani Milani
2022-11-15 23:54           ` Mani Milani
2022-11-10 21:08 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.