All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Stability improvements to error state capture
@ 2015-10-08 18:31 Tomas Elf
  2015-10-08 18:31 ` [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
                   ` (7 more replies)
  0 siblings, 8 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw)
  To: Intel-GFX

In preparation for the upcoming TDR per-engine hang recovery enablement the
stability of the error state capture code needs to be addressed. The biggest
reason for this is that in order to test TDR a long-duration test needs to be
run for several hours during which a large number of hangs is handled together
with the associated error state captures. In its current state the i915 driver
experiences various forms of kernel panics and other kinds of fatal errors
within the first hour(s) of the hang testing. The patches in this series have
been tested with a long-duration hang testing clocking in at 12+ hours and
should suffice as an initial improvement.

The underlying issue of trying to capture the driver state without
synchronization is still a problem that remains to be fixed. One way of at
least further alleviating this problem that has been suggested by John Harrison
is to do a mutex_trylock() of the struct_mutex for a while (give it a second or
so) before going into the error state capture from i915_handle_error(). Then,
if nobody is holding the struct_mutex, the error state capture is considerably
more safe from sudden state changes. If some thread has hung while holding the
struct_mutex one could at least hope that there would be no sudden state
changes during error state capture due to the hung state (unless some thread
has been caught in a livelock or is perhaps not stuck at all but is simply
running for a very long time - still some improvements might be expected here).

One fix that has been omitted from this patch series is in regards to the
broken ring space calculation following a full GPU reset. Two independent
patches to solve this are: "[PATCH] drm/i915: Update ring space correctly on
lrc context reset" by Mika Kuoppala and "[51/70] drm/i915: Record the position
of the start of the request" by Chris Wilson. Since the solution is currently
in review I'll simply mention it here as a pre-requistite for long-duration
operations stability testing. Without a fix for this problem the ring space is
terminally depleted within the first iterations of the hang test, simply
because the ring space is miscalculated following every GPU hang recovery and
traversal of the GEM init hw path gradually leading to a terminally hung state.

Tomas Elf (8):
  drm/i915: Early exit from semaphore_waits_for for execlist mode.
  drm/i915: Migrate to safe iterators in error state capture
  drm/i915: Cope with request list state change during error state
    capture
  drm/i915: NULL checking when capturing buffer objects during error
    state capture
  drm/i915: vma NULL pointer check
  drm/i915: Use safe list iterators
  drm/i915: Grab execlist spinlock to avoid post-reset concurrency
    issues.
  drm/i915: NULL check of unpin_work

 drivers/gpu/drm/i915/i915_gem.c       | 18 ++++++++---
 drivers/gpu/drm/i915/i915_gpu_error.c | 61 +++++++++++++++++++++++------------
 drivers/gpu/drm/i915/i915_irq.c       | 20 ++++++++++++
 drivers/gpu/drm/i915/intel_display.c  |  5 +++
 4 files changed, 80 insertions(+), 24 deletions(-)

-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode.
  2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf
@ 2015-10-08 18:31 ` Tomas Elf
  2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw)
  To: Intel-GFX

When submitting semaphores in execlist mode the hang checker crashes in this
function because it is only runnable in ring submission mode. The reason this
is of particular interest to the TDR patch series is because we use semaphores
as a mean to induce hangs during testing (which is the recommended way to
induce hangs for gen8+). It's not clear how this is supposed to work in
execlist mode since:

1. This function requires a ring buffer.

2. Retrieving a ring buffer in execlist mode requires us to retrieve the
corresponding context, which we get from a request.

3. Retieving a request from the hang checker is not straight-forward since that
requires us to grab the struct_mutex in order to synchronize against the
request retirement thread.

4. Grabbing the struct_mutex from the hang checker is nothing that we will do
since that puts us at risk of deadlock since a hung thread might be holding the
struct_mutex already.

Therefore it's not obvious how we're supposed to deal with this. For now, we're
doing an early exit from this function, which avoids any kernel panic situation
when running our own internal TDR ULT.

* v2: (Chris Wilson)
Turned the execlist mode check into a ringbuffer NULL check to make it more
submission mode agnostic and less of a layering violation.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 8ca772d..132e8f3 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2782,6 +2782,26 @@ semaphore_waits_for(struct intel_engine_cs *ring, u32 *seqno)
 	u64 offset = 0;
 	int i, backwards;
 
+	/*
+	 * This function does not support execlist mode - any attempt to
+	 * proceed further into this function will result in a kernel panic
+	 * when dereferencing ring->buffer, which is not set up in execlist
+	 * mode.
+	 *
+	 * The correct way of doing it would be to derive the currently
+	 * executing ring buffer from the current context, which is derived
+	 * from the currently running request. Unfortunately, to get the
+	 * current request we would have to grab the struct_mutex before doing
+	 * anything else, which would be ill-advised since some other thread
+	 * might have grabbed it already and managed to hang itself, causing
+	 * the hang checker to deadlock.
+	 *
+	 * Therefore, this function does not support execlist mode in its
+	 * current form. Just return NULL and move on.
+	 */
+	if (ring->buffer == NULL)
+		return NULL;
+
 	ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
 	if (!ipehr_is_semaphore_wait(ring->dev, ipehr))
 		return NULL;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture
  2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf
  2015-10-08 18:31 ` [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
@ 2015-10-08 18:31 ` Tomas Elf
  2015-10-09  7:49   ` Chris Wilson
  2015-10-09  8:27   ` Daniel Vetter
  2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw)
  To: Intel-GFX

Using safe list iterators alleviates the problem of unsynchronized driver list
manipulations while error state capture is in the process of traversing lists.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 2f04e4f..32c1799 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err,
 static u32 capture_active_bo(struct drm_i915_error_buffer *err,
 			     int count, struct list_head *head)
 {
-	struct i915_vma *vma;
+	struct i915_vma *vma, *tmpvma;
 	int i = 0;
 
-	list_for_each_entry(vma, head, mm_list) {
+	list_for_each_entry_safe(vma, tmpvma, head, mm_list) {
 		capture_bo(err++, vma);
 		if (++i == count)
 			break;
@@ -734,17 +734,17 @@ static u32 capture_pinned_bo(struct drm_i915_error_buffer *err,
 			     int count, struct list_head *head,
 			     struct i915_address_space *vm)
 {
-	struct drm_i915_gem_object *obj;
+	struct drm_i915_gem_object *obj, *tmpobj;
 	struct drm_i915_error_buffer * const first = err;
 	struct drm_i915_error_buffer * const last = err + count;
 
-	list_for_each_entry(obj, head, global_list) {
-		struct i915_vma *vma;
+	list_for_each_entry_safe(obj, tmpobj, head, global_list) {
+		struct i915_vma *vma, *tmpvma;
 
 		if (err == last)
 			break;
 
-		list_for_each_entry(vma, &obj->vma_list, vma_link)
+		list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link)
 			if (vma->vm == vm && vma->pin_count > 0)
 				capture_bo(err++, vma);
 	}
@@ -958,13 +958,13 @@ static void i915_gem_record_active_context(struct intel_engine_cs *ring,
 					   struct drm_i915_error_ring *ering)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
-	struct drm_i915_gem_object *obj;
+	struct drm_i915_gem_object *obj, *tmpobj;
 
 	/* Currently render ring is the only HW context user */
 	if (ring->id != RCS || !error->ccid)
 		return;
 
-	list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) {
+	list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) {
 		if (!i915_gem_obj_ggtt_bound(obj))
 			continue;
 
@@ -979,7 +979,7 @@ static void i915_gem_record_rings(struct drm_device *dev,
 				  struct drm_i915_error_state *error)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
-	struct drm_i915_gem_request *request;
+	struct drm_i915_gem_request *request, *tmpreq;
 	int i, count;
 
 	for (i = 0; i < I915_NUM_RINGS; i++) {
@@ -1055,7 +1055,7 @@ static void i915_gem_record_rings(struct drm_device *dev,
 		i915_gem_record_active_context(ring, error, &error->ring[i]);
 
 		count = 0;
-		list_for_each_entry(request, &ring->request_list, list)
+		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list)
 			count++;
 
 		error->ring[i].num_requests = count;
@@ -1068,7 +1068,7 @@ static void i915_gem_record_rings(struct drm_device *dev,
 		}
 
 		count = 0;
-		list_for_each_entry(request, &ring->request_list, list) {
+		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
 			struct drm_i915_error_request *erq;
 
 			erq = &error->ring[i].requests[count++];
@@ -1088,17 +1088,17 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv,
 				const int ndx)
 {
 	struct drm_i915_error_buffer *active_bo = NULL, *pinned_bo = NULL;
-	struct drm_i915_gem_object *obj;
-	struct i915_vma *vma;
+	struct drm_i915_gem_object *obj, *tmpobj;
+	struct i915_vma *vma, *tmpvma;
 	int i;
 
 	i = 0;
-	list_for_each_entry(vma, &vm->active_list, mm_list)
+	list_for_each_entry_safe(vma, tmpvma, &vm->active_list, mm_list)
 		i++;
 	error->active_bo_count[ndx] = i;
 
-	list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) {
-		list_for_each_entry(vma, &obj->vma_list, vma_link)
+	list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) {
+		list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link)
 			if (vma->vm == vm && vma->pin_count > 0)
 				i++;
 	}
@@ -1128,10 +1128,10 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv,
 static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv,
 				     struct drm_i915_error_state *error)
 {
-	struct i915_address_space *vm;
+	struct i915_address_space *vm, *tmpvm;
 	int cnt = 0, i = 0;
 
-	list_for_each_entry(vm, &dev_priv->vm_list, global_link)
+	list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link)
 		cnt++;
 
 	error->active_bo = kcalloc(cnt, sizeof(*error->active_bo), GFP_ATOMIC);
@@ -1155,7 +1155,7 @@ static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv,
 		error->pinned_bo = NULL;
 		error->pinned_bo_count = NULL;
 	} else {
-		list_for_each_entry(vm, &dev_priv->vm_list, global_link)
+		list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link)
 			i915_gem_capture_vm(dev_priv, error, vm, i++);
 
 		error->vm_count = cnt;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 3/8] drm/i915: Cope with request list state change during error state capture
  2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf
  2015-10-08 18:31 ` [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
  2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf
@ 2015-10-08 18:31 ` Tomas Elf
  2015-10-09  7:48   ` Chris Wilson
  2015-10-09  8:28   ` Daniel Vetter
  2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw)
  To: Intel-GFX

Since we're not synchronizing the ring request list during error state capture
the request list state might change between the time the corresponding error
request list was allocated and dimensioned to the time when the ring request
list is actually captured into the error state. If this happens, throw a
WARNING and do early exit and be aware that the captured error state might not
be fully reliable.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 32c1799..cc75ca4 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev,
 		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
 			struct drm_i915_error_request *erq;
 
+			if (WARN_ON(!request || count >= error->ring[i].num_requests)) {
+				/*
+				 * If the ring request list was changed in
+				 * between the point where the error request
+				 * list was created and dimensioned and this
+				 * point then just update the num_requests
+				 * field to reflect this.
+				 */
+				error->ring[i].num_requests =
+					min(count, error->ring[i].num_requests);
+				break;
+			}
+
 			erq = &error->ring[i].requests[count++];
 			erq->seqno = request->seqno;
 			erq->jiffies = request->emitted_jiffies;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture
  2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf
                   ` (2 preceding siblings ...)
  2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf
@ 2015-10-08 18:31 ` Tomas Elf
  2015-10-09  7:49   ` Chris Wilson
  2015-10-09  8:32   ` Daniel Vetter
  2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw)
  To: Intel-GFX

In the past vmas have sometimes turned out to be NULL when capturing buffer
objects during error state capture. To avoid NULL pointer exceptions throw a
WARNING and exit early and be prepared for the error state not being fully
accurate following this point.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index cc75ca4..ae24971 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -691,9 +691,17 @@ unwind:
 static void capture_bo(struct drm_i915_error_buffer *err,
 		       struct i915_vma *vma)
 {
-	struct drm_i915_gem_object *obj = vma->obj;
+	struct drm_i915_gem_object *obj;
 	int i;
 
+	if (WARN_ON(!vma))
+	    return;
+
+	if (WARN_ON(!vma->obj))
+	    return;
+
+	obj = vma->obj;
+
 	err->size = obj->base.size;
 	err->name = obj->base.name;
 	for (i = 0; i < I915_NUM_RINGS; i++)
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 5/8] drm/i915: vma NULL pointer check
  2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf
                   ` (3 preceding siblings ...)
  2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf
@ 2015-10-08 18:31 ` Tomas Elf
  2015-10-09  7:48   ` Chris Wilson
  2015-10-09  8:33   ` Daniel Vetter
  2015-10-08 18:31 ` [PATCH 6/8] drm/i915: Use safe list iterators Tomas Elf
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw)
  To: Intel-GFX

Sometimes the iterated vma objects are NULL apparently. Be aware of this while
iterating and do early exit instead of causing a NULL pointer exception.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 52642af..8210ae7 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -5122,9 +5122,13 @@ unsigned long i915_gem_obj_size(struct drm_i915_gem_object *o,
 bool i915_gem_obj_is_pinned(struct drm_i915_gem_object *obj)
 {
 	struct i915_vma *vma;
-	list_for_each_entry(vma, &obj->vma_list, vma_link)
+	list_for_each_entry(vma, &obj->vma_list, vma_link) {
+		if (!vma)
+			continue;
+
 		if (vma->pin_count > 0)
 			return true;
+	}
 
 	return false;
 }
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 6/8] drm/i915: Use safe list iterators
  2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf
                   ` (4 preceding siblings ...)
  2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf
@ 2015-10-08 18:31 ` Tomas Elf
  2015-10-09  7:41   ` Chris Wilson
  2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf
  2015-10-08 18:31 ` [PATCH 8/8] drm/i915: NULL check of unpin_work Tomas Elf
  7 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw)
  To: Intel-GFX

Error state capture is dependent on i915_gem_active_request() and
i915_gem_obj_is_pinned(). Since there is no synchronization between error state
capture and the driver state we at least need to use safe list iterators in
these functions to alleviate the problem of request/vma lists changing during
error state capture.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8210ae7..1666499 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2704,9 +2704,9 @@ void i915_gem_request_cancel(struct drm_i915_gem_request *req)
 struct drm_i915_gem_request *
 i915_gem_find_active_request(struct intel_engine_cs *ring)
 {
-	struct drm_i915_gem_request *request;
+	struct drm_i915_gem_request *request, *tmpreq;
 
-	list_for_each_entry(request, &ring->request_list, list) {
+	list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
 		if (i915_gem_request_completed(request, false))
 			continue;
 
@@ -5121,8 +5121,9 @@ unsigned long i915_gem_obj_size(struct drm_i915_gem_object *o,
 
 bool i915_gem_obj_is_pinned(struct drm_i915_gem_object *obj)
 {
-	struct i915_vma *vma;
-	list_for_each_entry(vma, &obj->vma_list, vma_link) {
+	struct i915_vma *vma, *tmpvma;
+
+	list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link) {
 		if (!vma)
 			continue;
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf
                   ` (5 preceding siblings ...)
  2015-10-08 18:31 ` [PATCH 6/8] drm/i915: Use safe list iterators Tomas Elf
@ 2015-10-08 18:31 ` Tomas Elf
  2015-10-09  7:45   ` Chris Wilson
                     ` (2 more replies)
  2015-10-08 18:31 ` [PATCH 8/8] drm/i915: NULL check of unpin_work Tomas Elf
  7 siblings, 3 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw)
  To: Intel-GFX

Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
concurrency problems between the context event interrupt handler and the reset
path immediately following a GPU reset.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 1666499..50c2dcd 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2738,6 +2738,8 @@ static void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv,
 static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
 					struct intel_engine_cs *ring)
 {
+	unsigned long flags;
+
 	while (!list_empty(&ring->active_list)) {
 		struct drm_i915_gem_object *obj;
 
@@ -2753,6 +2755,8 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
 	 * are the ones that keep the context and ringbuffer backing objects
 	 * pinned in place.
 	 */
+
+	spin_lock_irqsave(&ring->execlist_lock, flags);
 	while (!list_empty(&ring->execlist_queue)) {
 		struct drm_i915_gem_request *submit_req;
 
@@ -2766,6 +2770,7 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
 
 		i915_gem_request_unreference(submit_req);
 	}
+	spin_unlock_irqrestore(&ring->execlist_lock, flags);
 
 	/*
 	 * We must free the requests after all the corresponding objects have
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 8/8] drm/i915: NULL check of unpin_work
  2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf
                   ` (6 preceding siblings ...)
  2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf
@ 2015-10-08 18:31 ` Tomas Elf
  2015-10-09  7:46   ` Chris Wilson
  7 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw)
  To: Intel-GFX

Avoid NULL pointer exceptions in the display driver for certain critical cases
when unpin_work has turned out to be NULL.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/intel_display.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 539c373..421bc80 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -10853,6 +10853,8 @@ static inline void intel_mark_page_flip_active(struct intel_crtc *intel_crtc)
 {
 	/* Ensure that the work item is consistent when activating it ... */
 	smp_wmb();
+	if (!intel_crtc->unpin_work)
+	    return;
 	atomic_set(&intel_crtc->unpin_work->pending, INTEL_FLIP_PENDING);
 	/* and that it is marked active as soon as the irq could fire. */
 	smp_wmb();
@@ -11187,6 +11189,9 @@ static void ilk_do_mmio_flip(struct intel_crtc *intel_crtc)
 
 	I915_WRITE(reg, dspcntr);
 
+	if (!intel_crtc->unpin_work)
+	    return;
+
 	I915_WRITE(DSPSURF(intel_crtc->plane),
 		   intel_crtc->unpin_work->gtt_offset);
 	POSTING_READ(DSPSURF(intel_crtc->plane));
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/8] drm/i915: Use safe list iterators
  2015-10-08 18:31 ` [PATCH 6/8] drm/i915: Use safe list iterators Tomas Elf
@ 2015-10-09  7:41   ` Chris Wilson
  2015-10-09 10:27     ` Tomas Elf
  0 siblings, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09  7:41 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:38PM +0100, Tomas Elf wrote:
> Error state capture is dependent on i915_gem_active_request() and
> i915_gem_obj_is_pinned(). Since there is no synchronization between error state
> capture and the driver state we at least need to use safe list iterators in
> these functions to alleviate the problem of request/vma lists changing during
> error state capture.

Does not make it safe.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf
@ 2015-10-09  7:45   ` Chris Wilson
  2015-10-09 10:28     ` Tomas Elf
  2015-10-09  8:38   ` Daniel Vetter
  2015-10-19 15:32   ` [PATCH v2 " Tomas Elf
  2 siblings, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09  7:45 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote:
> Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> concurrency problems between the context event interrupt handler and the reset
> path immediately following a GPU reset.

Hmm, my tree just does:

static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
                                        struct intel_engine_cs *ring)
{
        /*
         * Clear the execlists queue up before freeing the requests, as those
         * are the ones that keep the context and ringbuffer backing objects
         * pinned in place.
         */
        if (i915.enable_execlists) {
                spin_lock_irq(&ring->execlist_lock);
                list_splice_tail_init(&ring->execlist_queue,
                                      &ring->execlist_completed);
                memset(&ring->execlist_port, 0, sizeof(ring->execlist_port));
                spin_unlock_irq(&ring->execlist_lock);

                intel_execlists_retire_requests(ring);
        }


Oh, that's right another unreviewed patch from several months ago.

For your patch, you should not take the spinlock unconditionally as it is
only initialised for execlists, and you can use the simpler form of irq
locking as above.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work
  2015-10-08 18:31 ` [PATCH 8/8] drm/i915: NULL check of unpin_work Tomas Elf
@ 2015-10-09  7:46   ` Chris Wilson
  2015-10-09  8:39     ` Daniel Vetter
  2015-10-09 10:30     ` Tomas Elf
  0 siblings, 2 replies; 68+ messages in thread
From: Chris Wilson @ 2015-10-09  7:46 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote:
> Avoid NULL pointer exceptions in the display driver for certain critical cases
> when unpin_work has turned out to be NULL.

Nope, the machine reached a point where it cannot possibly reach, we
want the OOPS.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture
  2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf
@ 2015-10-09  7:48   ` Chris Wilson
  2015-10-09 11:25     ` Tomas Elf
  2015-10-09  8:28   ` Daniel Vetter
  1 sibling, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09  7:48 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote:
> Since we're not synchronizing the ring request list during error state capture
> the request list state might change between the time the corresponding error
> request list was allocated and dimensioned to the time when the ring request
> list is actually captured into the error state. If this happens, throw a
> WARNING and do early exit and be aware that the captured error state might not
> be fully reliable.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 32c1799..cc75ca4 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev,
>  		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
>  			struct drm_i915_error_request *erq;
>  
> +			if (WARN_ON(!request || count >= error->ring[i].num_requests)) {

Request cannot be null, count can legitmately be more, the WARN on is
inappropriate. Again, I sent several patches over the past couple of
years to fix this.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/8] drm/i915: vma NULL pointer check
  2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf
@ 2015-10-09  7:48   ` Chris Wilson
  2015-10-09 11:30     ` Tomas Elf
  2015-10-09  8:33   ` Daniel Vetter
  1 sibling, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09  7:48 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote:
> Sometimes the iterated vma objects are NULL apparently. Be aware of this while
> iterating and do early exit instead of causing a NULL pointer exception.

Wrong.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture
  2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf
@ 2015-10-09  7:49   ` Chris Wilson
  2015-10-09 11:34     ` Tomas Elf
  2015-10-09  8:32   ` Daniel Vetter
  1 sibling, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09  7:49 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote:
> In the past vmas have sometimes turned out to be NULL when capturing buffer
> objects during error state capture. To avoid NULL pointer exceptions throw a
> WARNING and exit early and be prepared for the error state not being fully
> accurate following this point.

No, fix the root cause.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture
  2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf
@ 2015-10-09  7:49   ` Chris Wilson
  2015-10-09 11:38     ` Tomas Elf
  2015-10-09  8:27   ` Daniel Vetter
  1 sibling, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09  7:49 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote:
> Using safe list iterators alleviates the problem of unsynchronized driver list
> manipulations while error state capture is in the process of traversing lists.

Wrong, this does not make anything safe.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture
  2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf
  2015-10-09  7:49   ` Chris Wilson
@ 2015-10-09  8:27   ` Daniel Vetter
  2015-10-09 11:40     ` Tomas Elf
  1 sibling, 1 reply; 68+ messages in thread
From: Daniel Vetter @ 2015-10-09  8:27 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote:
> Using safe list iterators alleviates the problem of unsynchronized driver list
> manipulations while error state capture is in the process of traversing lists.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 2f04e4f..32c1799 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err,
>  static u32 capture_active_bo(struct drm_i915_error_buffer *err,
>  			     int count, struct list_head *head)
>  {
> -	struct i915_vma *vma;
> +	struct i915_vma *vma, *tmpvma;
>  	int i = 0;
>  
> -	list_for_each_entry(vma, head, mm_list) {
> +	list_for_each_entry_safe(vma, tmpvma, head, mm_list) {

_safe isn't safe against anything, it's only safe against removal of the
current list item done with a list_del/list_move from this thread. I.e.
the only thing it does is have a temporary pointer to the next list
element so if you delete the current one it won't fall over.

It doesn't protect against anything else, and especially not against other
threads totally stomping the list.

So we need proper protection for these lists, independent of
dev->struct_mutex. The least invasive way to do that is probably rcu,
which only requires us that we release the memory for any object sitting
on these lists with kfree_rcu instead of normal kfree. Another option
might be a spinlock, but that would be a lot more invasive, and Chris
won't like the execbuf overhead this causes ;-)
-Daniel

>  		capture_bo(err++, vma);
>  		if (++i == count)
>  			break;
> @@ -734,17 +734,17 @@ static u32 capture_pinned_bo(struct drm_i915_error_buffer *err,
>  			     int count, struct list_head *head,
>  			     struct i915_address_space *vm)
>  {
> -	struct drm_i915_gem_object *obj;
> +	struct drm_i915_gem_object *obj, *tmpobj;
>  	struct drm_i915_error_buffer * const first = err;
>  	struct drm_i915_error_buffer * const last = err + count;
>  
> -	list_for_each_entry(obj, head, global_list) {
> -		struct i915_vma *vma;
> +	list_for_each_entry_safe(obj, tmpobj, head, global_list) {
> +		struct i915_vma *vma, *tmpvma;
>  
>  		if (err == last)
>  			break;
>  
> -		list_for_each_entry(vma, &obj->vma_list, vma_link)
> +		list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link)
>  			if (vma->vm == vm && vma->pin_count > 0)
>  				capture_bo(err++, vma);
>  	}
> @@ -958,13 +958,13 @@ static void i915_gem_record_active_context(struct intel_engine_cs *ring,
>  					   struct drm_i915_error_ring *ering)
>  {
>  	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> -	struct drm_i915_gem_object *obj;
> +	struct drm_i915_gem_object *obj, *tmpobj;
>  
>  	/* Currently render ring is the only HW context user */
>  	if (ring->id != RCS || !error->ccid)
>  		return;
>  
> -	list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) {
> +	list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) {
>  		if (!i915_gem_obj_ggtt_bound(obj))
>  			continue;
>  
> @@ -979,7 +979,7 @@ static void i915_gem_record_rings(struct drm_device *dev,
>  				  struct drm_i915_error_state *error)
>  {
>  	struct drm_i915_private *dev_priv = dev->dev_private;
> -	struct drm_i915_gem_request *request;
> +	struct drm_i915_gem_request *request, *tmpreq;
>  	int i, count;
>  
>  	for (i = 0; i < I915_NUM_RINGS; i++) {
> @@ -1055,7 +1055,7 @@ static void i915_gem_record_rings(struct drm_device *dev,
>  		i915_gem_record_active_context(ring, error, &error->ring[i]);
>  
>  		count = 0;
> -		list_for_each_entry(request, &ring->request_list, list)
> +		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list)
>  			count++;
>  
>  		error->ring[i].num_requests = count;
> @@ -1068,7 +1068,7 @@ static void i915_gem_record_rings(struct drm_device *dev,
>  		}
>  
>  		count = 0;
> -		list_for_each_entry(request, &ring->request_list, list) {
> +		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
>  			struct drm_i915_error_request *erq;
>  
>  			erq = &error->ring[i].requests[count++];
> @@ -1088,17 +1088,17 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv,
>  				const int ndx)
>  {
>  	struct drm_i915_error_buffer *active_bo = NULL, *pinned_bo = NULL;
> -	struct drm_i915_gem_object *obj;
> -	struct i915_vma *vma;
> +	struct drm_i915_gem_object *obj, *tmpobj;
> +	struct i915_vma *vma, *tmpvma;
>  	int i;
>  
>  	i = 0;
> -	list_for_each_entry(vma, &vm->active_list, mm_list)
> +	list_for_each_entry_safe(vma, tmpvma, &vm->active_list, mm_list)
>  		i++;
>  	error->active_bo_count[ndx] = i;
>  
> -	list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) {
> -		list_for_each_entry(vma, &obj->vma_list, vma_link)
> +	list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) {
> +		list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link)
>  			if (vma->vm == vm && vma->pin_count > 0)
>  				i++;
>  	}
> @@ -1128,10 +1128,10 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv,
>  static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv,
>  				     struct drm_i915_error_state *error)
>  {
> -	struct i915_address_space *vm;
> +	struct i915_address_space *vm, *tmpvm;
>  	int cnt = 0, i = 0;
>  
> -	list_for_each_entry(vm, &dev_priv->vm_list, global_link)
> +	list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link)
>  		cnt++;
>  
>  	error->active_bo = kcalloc(cnt, sizeof(*error->active_bo), GFP_ATOMIC);
> @@ -1155,7 +1155,7 @@ static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv,
>  		error->pinned_bo = NULL;
>  		error->pinned_bo_count = NULL;
>  	} else {
> -		list_for_each_entry(vm, &dev_priv->vm_list, global_link)
> +		list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link)
>  			i915_gem_capture_vm(dev_priv, error, vm, i++);
>  
>  		error->vm_count = cnt;
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture
  2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf
  2015-10-09  7:48   ` Chris Wilson
@ 2015-10-09  8:28   ` Daniel Vetter
  2015-10-09 11:45     ` Tomas Elf
  1 sibling, 1 reply; 68+ messages in thread
From: Daniel Vetter @ 2015-10-09  8:28 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote:
> Since we're not synchronizing the ring request list during error state capture
> the request list state might change between the time the corresponding error
> request list was allocated and dimensioned to the time when the ring request
> list is actually captured into the error state. If this happens, throw a
> WARNING and do early exit and be aware that the captured error state might not
> be fully reliable.

Please don't throw a WARNING since this is expected to occasionally
happen. DRM_DEBUG_DRIVER is enough imo.
-Daniel

> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 32c1799..cc75ca4 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev,
>  		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
>  			struct drm_i915_error_request *erq;
>  
> +			if (WARN_ON(!request || count >= error->ring[i].num_requests)) {
> +				/*
> +				 * If the ring request list was changed in
> +				 * between the point where the error request
> +				 * list was created and dimensioned and this
> +				 * point then just update the num_requests
> +				 * field to reflect this.
> +				 */
> +				error->ring[i].num_requests =
> +					min(count, error->ring[i].num_requests);
> +				break;
> +			}
> +
>  			erq = &error->ring[i].requests[count++];
>  			erq->seqno = request->seqno;
>  			erq->jiffies = request->emitted_jiffies;
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture
  2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf
  2015-10-09  7:49   ` Chris Wilson
@ 2015-10-09  8:32   ` Daniel Vetter
  2015-10-09  8:47     ` Chris Wilson
  2015-10-09 11:45     ` Tomas Elf
  1 sibling, 2 replies; 68+ messages in thread
From: Daniel Vetter @ 2015-10-09  8:32 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote:
> In the past vmas have sometimes turned out to be NULL when capturing buffer
> objects during error state capture. To avoid NULL pointer exceptions throw a
> WARNING and exit early and be prepared for the error state not being fully
> accurate following this point.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index cc75ca4..ae24971 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -691,9 +691,17 @@ unwind:
>  static void capture_bo(struct drm_i915_error_buffer *err,
>  		       struct i915_vma *vma)
>  {
> -	struct drm_i915_gem_object *obj = vma->obj;
> +	struct drm_i915_gem_object *obj;
>  	int i;
>  
> +	if (WARN_ON(!vma))
> +	    return;

If we rcu-protect the vma lists and vmas themselves then this can't
happen.

> +
> +	if (WARN_ON(!vma->obj))
> +	    return;

Again we probably need to rcu-protect this, and more important we need to
make sure the compiler doesn't do crap. So

	obj = rcu_derefence(vma->obj)
	if (!obj)
		return;

Ofc obj then can also only be released with kfree_rcu.
-Daniel

> +
> +	obj = vma->obj;
> +
>  	err->size = obj->base.size;
>  	err->name = obj->base.name;
>  	for (i = 0; i < I915_NUM_RINGS; i++)
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/8] drm/i915: vma NULL pointer check
  2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf
  2015-10-09  7:48   ` Chris Wilson
@ 2015-10-09  8:33   ` Daniel Vetter
  2015-10-09 11:46     ` Tomas Elf
  1 sibling, 1 reply; 68+ messages in thread
From: Daniel Vetter @ 2015-10-09  8:33 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote:
> Sometimes the iterated vma objects are NULL apparently. Be aware of this while
> iterating and do early exit instead of causing a NULL pointer exception.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 52642af..8210ae7 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -5122,9 +5122,13 @@ unsigned long i915_gem_obj_size(struct drm_i915_gem_object *o,
>  bool i915_gem_obj_is_pinned(struct drm_i915_gem_object *obj)
>  {
>  	struct i915_vma *vma;
> -	list_for_each_entry(vma, &obj->vma_list, vma_link)
> +	list_for_each_entry(vma, &obj->vma_list, vma_link) {
> +		if (!vma)
> +			continue;

Again no need to check for that if we rcu-protect this list and vmas.
-Daniel

> +
>  		if (vma->pin_count > 0)
>  			return true;
> +	}
>  
>  	return false;
>  }
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf
  2015-10-09  7:45   ` Chris Wilson
@ 2015-10-09  8:38   ` Daniel Vetter
  2015-10-09  8:45     ` Chris Wilson
  2015-10-19 15:32   ` [PATCH v2 " Tomas Elf
  2 siblings, 1 reply; 68+ messages in thread
From: Daniel Vetter @ 2015-10-09  8:38 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote:
> Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> concurrency problems between the context event interrupt handler and the reset
> path immediately following a GPU reset.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>

Should we instead just stop any irqs from the GT completely while we do
the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably
safer. Well not the entire GT, but all the rings under reset (as prep for
per-ring reset).

Just an idea which feels a bit safer design overall, but we can ofc also
lock down all the state properly with the irq-safe spinlocks.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_gem.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 1666499..50c2dcd 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2738,6 +2738,8 @@ static void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv,
>  static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
>  					struct intel_engine_cs *ring)
>  {
> +	unsigned long flags;
> +
>  	while (!list_empty(&ring->active_list)) {
>  		struct drm_i915_gem_object *obj;
>  
> @@ -2753,6 +2755,8 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
>  	 * are the ones that keep the context and ringbuffer backing objects
>  	 * pinned in place.
>  	 */
> +
> +	spin_lock_irqsave(&ring->execlist_lock, flags);
>  	while (!list_empty(&ring->execlist_queue)) {
>  		struct drm_i915_gem_request *submit_req;
>  
> @@ -2766,6 +2770,7 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
>  
>  		i915_gem_request_unreference(submit_req);
>  	}
> +	spin_unlock_irqrestore(&ring->execlist_lock, flags);
>  
>  	/*
>  	 * We must free the requests after all the corresponding objects have
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work
  2015-10-09  7:46   ` Chris Wilson
@ 2015-10-09  8:39     ` Daniel Vetter
  2015-10-09 11:50       ` Tomas Elf
  2015-10-09 10:30     ` Tomas Elf
  1 sibling, 1 reply; 68+ messages in thread
From: Daniel Vetter @ 2015-10-09  8:39 UTC (permalink / raw)
  To: Chris Wilson, Tomas Elf, Intel-GFX

On Fri, Oct 09, 2015 at 08:46:31AM +0100, Chris Wilson wrote:
> On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote:
> > Avoid NULL pointer exceptions in the display driver for certain critical cases
> > when unpin_work has turned out to be NULL.
> 
> Nope, the machine reached a point where it cannot possibly reach, we
> want the OOPS.

Yeah this looks funky ... how did you get to that state? If it indeed
blows up in testing then we'd need at least a WARN_ON (and hopefully some
ideas what's going wrong).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-09  8:38   ` Daniel Vetter
@ 2015-10-09  8:45     ` Chris Wilson
  2015-10-13 11:46       ` Daniel Vetter
  0 siblings, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09  8:45 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-GFX

On Fri, Oct 09, 2015 at 10:38:18AM +0200, Daniel Vetter wrote:
> On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote:
> > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> > concurrency problems between the context event interrupt handler and the reset
> > path immediately following a GPU reset.
> > 
> > Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> 
> Should we instead just stop any irqs from the GT completely while we do
> the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably
> safer. Well not the entire GT, but all the rings under reset (as prep for
> per-ring reset).

Bah, stopping IRQs is not enough for error state capture though since
requests complete asynchronously just by polling a memory address. (If
that is the goal here, this patch just makes execlist_queue access
consistent and should only be run once the GPU has been reset and so is
categorically idle.)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture
  2015-10-09  8:32   ` Daniel Vetter
@ 2015-10-09  8:47     ` Chris Wilson
  2015-10-09 11:52       ` Tomas Elf
  2015-10-09 11:45     ` Tomas Elf
  1 sibling, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09  8:47 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-GFX

On Fri, Oct 09, 2015 at 10:32:11AM +0200, Daniel Vetter wrote:
> On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote:
> > In the past vmas have sometimes turned out to be NULL when capturing buffer
> > objects during error state capture. To avoid NULL pointer exceptions throw a
> > WARNING and exit early and be prepared for the error state not being fully
> > accurate following this point.
> > 
> > Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gpu_error.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > index cc75ca4..ae24971 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > @@ -691,9 +691,17 @@ unwind:
> >  static void capture_bo(struct drm_i915_error_buffer *err,
> >  		       struct i915_vma *vma)
> >  {
> > -	struct drm_i915_gem_object *obj = vma->obj;
> > +	struct drm_i915_gem_object *obj;
> >  	int i;
> >  
> > +	if (WARN_ON(!vma))
> > +	    return;
> 
> If we rcu-protect the vma lists and vmas themselves then this can't
> happen.
> 
> > +
> > +	if (WARN_ON(!vma->obj))
> > +	    return;
> 
> Again we probably need to rcu-protect this, and more important we need to
> make sure the compiler doesn't do crap. So
> 
> 	obj = rcu_derefence(vma->obj)
> 	if (!obj)
> 		return;

The other issue is that we don't actually want capture_bo() but capture_vma().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/8] drm/i915: Use safe list iterators
  2015-10-09  7:41   ` Chris Wilson
@ 2015-10-09 10:27     ` Tomas Elf
  2015-10-09 10:38       ` Chris Wilson
  0 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 10:27 UTC (permalink / raw)
  To: Chris Wilson, Intel-GFX

On 09/10/2015 08:41, Chris Wilson wrote:
> On Thu, Oct 08, 2015 at 07:31:38PM +0100, Tomas Elf wrote:
>> Error state capture is dependent on i915_gem_active_request() and
>> i915_gem_obj_is_pinned(). Since there is no synchronization between error state
>> capture and the driver state we at least need to use safe list iterators in
>> these functions to alleviate the problem of request/vma lists changing during
>> error state capture.
>
> Does not make it safe.
> -Chris
>

I know it doesn't make it safe, I didn't say this patch makes it safe.

Maybe I was unclear but I chose the word "alleviate" rather than "solve" 
to indicate that there are still problems but this change reduces the 
problem scope and makes crashes less likely to occur. I also used the 
formulation "at least" to indicate that we're not solving everything but 
we can do certain things to improve things somewhat.

The problems I've been seeing has been that the list state changes 
during iteration and that the error capture tries to read elements that 
are no longer part of the list - not that elements that the error 
capture code is dereferencing are deallocated by the driver or whatever. 
Using a safe iterator helps with that very particular problem. Or maybe 
I guess I've just been incredibly lucky for the last 2 months when I've 
been running this code as I've been able to get 12+ hours of stability 
during my tests instead of less than one hour in between crashes that 
was the case before I introduced these changes.

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-09  7:45   ` Chris Wilson
@ 2015-10-09 10:28     ` Tomas Elf
  0 siblings, 0 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 10:28 UTC (permalink / raw)
  To: Chris Wilson, Intel-GFX

On 09/10/2015 08:45, Chris Wilson wrote:
> On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote:
>> Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
>> concurrency problems between the context event interrupt handler and the reset
>> path immediately following a GPU reset.
>
> Hmm, my tree just does:
>
> static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
>                                          struct intel_engine_cs *ring)
> {
>          /*
>           * Clear the execlists queue up before freeing the requests, as those
>           * are the ones that keep the context and ringbuffer backing objects
>           * pinned in place.
>           */
>          if (i915.enable_execlists) {
>                  spin_lock_irq(&ring->execlist_lock);
>                  list_splice_tail_init(&ring->execlist_queue,
>                                        &ring->execlist_completed);
>                  memset(&ring->execlist_port, 0, sizeof(ring->execlist_port));
>                  spin_unlock_irq(&ring->execlist_lock);
>
>                  intel_execlists_retire_requests(ring);
>          }
>
>
> Oh, that's right another unreviewed patch from several months ago.
>
> For your patch, you should not take the spinlock unconditionally as it is
> only initialised for execlists, and you can use the simpler form of irq
> locking as above.

Those are fair points. Mind pointing me to your patch so that I can have 
a look at it?

Thanks,
Tomas

> -Chris
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work
  2015-10-09  7:46   ` Chris Wilson
  2015-10-09  8:39     ` Daniel Vetter
@ 2015-10-09 10:30     ` Tomas Elf
  2015-10-09 10:44       ` Chris Wilson
  1 sibling, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 10:30 UTC (permalink / raw)
  To: Chris Wilson, Intel-GFX

On 09/10/2015 08:46, Chris Wilson wrote:
> On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote:
>> Avoid NULL pointer exceptions in the display driver for certain critical cases
>> when unpin_work has turned out to be NULL.
>
> Nope, the machine reached a point where it cannot possibly reach, we
> want the OOPS.
> -Chris
>

Really? Because if I have this in there I can actually run the 
long-duration stability tests for 12+ hours rather than just a few hours 
(it doesn't happen all the time but I've seen it at least once). But, 
hey, if you want to investigate the reason why we have a NULL there then 
go ahead. I guess I'll just have to carry this patch for as long as my 
machine crashes during my TDR testing.

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/8] drm/i915: Use safe list iterators
  2015-10-09 10:27     ` Tomas Elf
@ 2015-10-09 10:38       ` Chris Wilson
  2015-10-09 12:00         ` Tomas Elf
  0 siblings, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09 10:38 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Fri, Oct 09, 2015 at 11:27:48AM +0100, Tomas Elf wrote:
> On 09/10/2015 08:41, Chris Wilson wrote:
> >On Thu, Oct 08, 2015 at 07:31:38PM +0100, Tomas Elf wrote:
> >>Error state capture is dependent on i915_gem_active_request() and
> >>i915_gem_obj_is_pinned(). Since there is no synchronization between error state
> >>capture and the driver state we at least need to use safe list iterators in
> >>these functions to alleviate the problem of request/vma lists changing during
> >>error state capture.
> >
> >Does not make it safe.
> >-Chris
> >
> 
> I know it doesn't make it safe, I didn't say this patch makes it safe.
> 
> Maybe I was unclear but I chose the word "alleviate" rather than
> "solve" to indicate that there are still problems but this change
> reduces the problem scope and makes crashes less likely to occur. I
> also used the formulation "at least" to indicate that we're not
> solving everything but we can do certain things to improve things
> somewhat.
> 
> The problems I've been seeing has been that the list state changes
> during iteration and that the error capture tries to read elements
> that are no longer part of the list - not that elements that the
> error capture code is dereferencing are deallocated by the driver or
> whatever. Using a safe iterator helps with that very particular
> problem. Or maybe I guess I've just been incredibly lucky for the
> last 2 months when I've been running this code as I've been able to
> get 12+ hours of stability during my tests instead of less than one
> hour in between crashes that was the case before I introduced these
> changes.

You have been incredibily lucky (probably due to how the requests are
being cached now), but that the requests can be modified whilst error
capture runs and oops is well known. Just pretending the list iterator
is safe does nothing.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work
  2015-10-09 10:30     ` Tomas Elf
@ 2015-10-09 10:44       ` Chris Wilson
  2015-10-09 12:06         ` Tomas Elf
  0 siblings, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09 10:44 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Fri, Oct 09, 2015 at 11:30:34AM +0100, Tomas Elf wrote:
> On 09/10/2015 08:46, Chris Wilson wrote:
> >On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote:
> >>Avoid NULL pointer exceptions in the display driver for certain critical cases
> >>when unpin_work has turned out to be NULL.
> >
> >Nope, the machine reached a point where it cannot possibly reach, we
> >want the OOPS.
> >-Chris
> >
> 
> Really? Because if I have this in there I can actually run the
> long-duration stability tests for 12+ hours rather than just a few
> hours (it doesn't happen all the time but I've seen it at least
> once). But, hey, if you want to investigate the reason why we have a
> NULL there then go ahead. I guess I'll just have to carry this patch
> for as long as my machine crashes during my TDR testing.

You've sat on a critical bug for how long? There's been one recent
report that appears to be fallout from Tvrtko's context cleanup, but
nothing older, in public at least.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture
  2015-10-09  7:48   ` Chris Wilson
@ 2015-10-09 11:25     ` Tomas Elf
  2015-10-13 11:39       ` Daniel Vetter
  0 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 11:25 UTC (permalink / raw)
  To: Chris Wilson, Intel-GFX

On 09/10/2015 08:48, Chris Wilson wrote:
> On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote:
>> Since we're not synchronizing the ring request list during error state capture
>> the request list state might change between the time the corresponding error
>> request list was allocated and dimensioned to the time when the ring request
>> list is actually captured into the error state. If this happens, throw a
>> WARNING and do early exit and be aware that the captured error state might not
>> be fully reliable.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 32c1799..cc75ca4 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev,
>>   		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
>>   			struct drm_i915_error_request *erq;
>>
>> +			if (WARN_ON(!request || count >= error->ring[i].num_requests)) {
>
> Request cannot be null, count can legitmately be more, the WARN on is
> inappropriate. Again, I sent several patches over the past couple of
> years to fix this.
> -Chris
>

Ok, so having the driver request list change grow in between the point 
where we allocate and set up the error state request list to the same 
size as the driver request list (since that's what count being larger 
than the list size implies) is legitimate? Traversing into unallocated 
memory seems pretty dodgy to me but if you say so.

Do you mind pointing me to your patch that solves this your way?

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/8] drm/i915: vma NULL pointer check
  2015-10-09  7:48   ` Chris Wilson
@ 2015-10-09 11:30     ` Tomas Elf
  2015-10-09 11:59       ` Chris Wilson
  0 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 11:30 UTC (permalink / raw)
  To: Chris Wilson, Intel-GFX

On 09/10/2015 08:48, Chris Wilson wrote:
> On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote:
>> Sometimes the iterated vma objects are NULL apparently. Be aware of this while
>> iterating and do early exit instead of causing a NULL pointer exception.
>
> Wrong.
> -Chris
>

So the NULL pointer exception that I ran into multiple times during 
several different test runs on the line that says "vma->pin_count" was 
not because the vma pointer was NULL. Would you mind sharing your 
explanation to how this might have happened? Is it because we're not 
synchronizing and there's no protection against the driver deallocating 
vmas while we're trying to capture them? If this all ties into the 
aforementioned RCU-based solution then maybe we should go with that then.

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture
  2015-10-09  7:49   ` Chris Wilson
@ 2015-10-09 11:34     ` Tomas Elf
  0 siblings, 0 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 11:34 UTC (permalink / raw)
  To: Chris Wilson, Intel-GFX

On 09/10/2015 08:49, Chris Wilson wrote:
> On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote:
>> In the past vmas have sometimes turned out to be NULL when capturing buffer
>> objects during error state capture. To avoid NULL pointer exceptions throw a
>> WARNING and exit early and be prepared for the error state not being fully
>> accurate following this point.
>
> No, fix the root cause.
> -Chris
>

I'd be glad to look that once TDR is upstreamed. Until then I need a 
system that can run long enough to validate TDR during long-duration 
operations testing. I need this patch in order to make that happen. But 
I guess I'll just carry this patch locally for now then.

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture
  2015-10-09  7:49   ` Chris Wilson
@ 2015-10-09 11:38     ` Tomas Elf
  0 siblings, 0 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 11:38 UTC (permalink / raw)
  To: Chris Wilson, Intel-GFX

On 09/10/2015 08:49, Chris Wilson wrote:
> On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote:
>> Using safe list iterators alleviates the problem of unsynchronized driver list
>> manipulations while error state capture is in the process of traversing lists.
>
> Wrong, this does not make anything safe.
> -Chris
>

I know, I didn't say it would make it safe. As I said in an earlier 
mail: I use the word "alleviate" to indicate that we're reducing the 
problem scope and make fatal crashes less likely to occur but the 
underlying problem is still there. Since my focus is on TDR I didn't set 
out to solve the underlying problem, I only needed the error state 
capture code to not break my tests, which is why I set out to solve the 
crashes - not the underlying problem. If you have a solution for the 
underlying problem then that's another story.

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture
  2015-10-09  8:27   ` Daniel Vetter
@ 2015-10-09 11:40     ` Tomas Elf
  2015-10-13 11:37       ` Daniel Vetter
  0 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 11:40 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-GFX

On 09/10/2015 09:27, Daniel Vetter wrote:
> On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote:
>> Using safe list iterators alleviates the problem of unsynchronized driver list
>> manipulations while error state capture is in the process of traversing lists.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------
>>   1 file changed, 19 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 2f04e4f..32c1799 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err,
>>   static u32 capture_active_bo(struct drm_i915_error_buffer *err,
>>   			     int count, struct list_head *head)
>>   {
>> -	struct i915_vma *vma;
>> +	struct i915_vma *vma, *tmpvma;
>>   	int i = 0;
>>
>> -	list_for_each_entry(vma, head, mm_list) {
>> +	list_for_each_entry_safe(vma, tmpvma, head, mm_list) {
>
> _safe isn't safe against anything, it's only safe against removal of the
> current list item done with a list_del/list_move from this thread. I.e.
> the only thing it does is have a temporary pointer to the next list
> element so if you delete the current one it won't fall over.
>
> It doesn't protect against anything else, and especially not against other
> threads totally stomping the list.

Absolutely! But it's good enough to make the test not fall over within 
an hour of testing and instead allow it to run for 12+ hours during 
continuous long-duration testing, which is what I need for the TDR 
validation.

>
> So we need proper protection for these lists, independent of
> dev->struct_mutex. The least invasive way to do that is probably rcu,
> which only requires us that we release the memory for any object sitting
> on these lists with kfree_rcu instead of normal kfree. Another option
> might be a spinlock, but that would be a lot more invasive, and Chris
> won't like the execbuf overhead this causes ;-)
> -Daniel

Awesome! Let's go with that then.

Thanks,
Tomas

>
>>   		capture_bo(err++, vma);
>>   		if (++i == count)
>>   			break;
>> @@ -734,17 +734,17 @@ static u32 capture_pinned_bo(struct drm_i915_error_buffer *err,
>>   			     int count, struct list_head *head,
>>   			     struct i915_address_space *vm)
>>   {
>> -	struct drm_i915_gem_object *obj;
>> +	struct drm_i915_gem_object *obj, *tmpobj;
>>   	struct drm_i915_error_buffer * const first = err;
>>   	struct drm_i915_error_buffer * const last = err + count;
>>
>> -	list_for_each_entry(obj, head, global_list) {
>> -		struct i915_vma *vma;
>> +	list_for_each_entry_safe(obj, tmpobj, head, global_list) {
>> +		struct i915_vma *vma, *tmpvma;
>>
>>   		if (err == last)
>>   			break;
>>
>> -		list_for_each_entry(vma, &obj->vma_list, vma_link)
>> +		list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link)
>>   			if (vma->vm == vm && vma->pin_count > 0)
>>   				capture_bo(err++, vma);
>>   	}
>> @@ -958,13 +958,13 @@ static void i915_gem_record_active_context(struct intel_engine_cs *ring,
>>   					   struct drm_i915_error_ring *ering)
>>   {
>>   	struct drm_i915_private *dev_priv = ring->dev->dev_private;
>> -	struct drm_i915_gem_object *obj;
>> +	struct drm_i915_gem_object *obj, *tmpobj;
>>
>>   	/* Currently render ring is the only HW context user */
>>   	if (ring->id != RCS || !error->ccid)
>>   		return;
>>
>> -	list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) {
>> +	list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) {
>>   		if (!i915_gem_obj_ggtt_bound(obj))
>>   			continue;
>>
>> @@ -979,7 +979,7 @@ static void i915_gem_record_rings(struct drm_device *dev,
>>   				  struct drm_i915_error_state *error)
>>   {
>>   	struct drm_i915_private *dev_priv = dev->dev_private;
>> -	struct drm_i915_gem_request *request;
>> +	struct drm_i915_gem_request *request, *tmpreq;
>>   	int i, count;
>>
>>   	for (i = 0; i < I915_NUM_RINGS; i++) {
>> @@ -1055,7 +1055,7 @@ static void i915_gem_record_rings(struct drm_device *dev,
>>   		i915_gem_record_active_context(ring, error, &error->ring[i]);
>>
>>   		count = 0;
>> -		list_for_each_entry(request, &ring->request_list, list)
>> +		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list)
>>   			count++;
>>
>>   		error->ring[i].num_requests = count;
>> @@ -1068,7 +1068,7 @@ static void i915_gem_record_rings(struct drm_device *dev,
>>   		}
>>
>>   		count = 0;
>> -		list_for_each_entry(request, &ring->request_list, list) {
>> +		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
>>   			struct drm_i915_error_request *erq;
>>
>>   			erq = &error->ring[i].requests[count++];
>> @@ -1088,17 +1088,17 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv,
>>   				const int ndx)
>>   {
>>   	struct drm_i915_error_buffer *active_bo = NULL, *pinned_bo = NULL;
>> -	struct drm_i915_gem_object *obj;
>> -	struct i915_vma *vma;
>> +	struct drm_i915_gem_object *obj, *tmpobj;
>> +	struct i915_vma *vma, *tmpvma;
>>   	int i;
>>
>>   	i = 0;
>> -	list_for_each_entry(vma, &vm->active_list, mm_list)
>> +	list_for_each_entry_safe(vma, tmpvma, &vm->active_list, mm_list)
>>   		i++;
>>   	error->active_bo_count[ndx] = i;
>>
>> -	list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) {
>> -		list_for_each_entry(vma, &obj->vma_list, vma_link)
>> +	list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) {
>> +		list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link)
>>   			if (vma->vm == vm && vma->pin_count > 0)
>>   				i++;
>>   	}
>> @@ -1128,10 +1128,10 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv,
>>   static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv,
>>   				     struct drm_i915_error_state *error)
>>   {
>> -	struct i915_address_space *vm;
>> +	struct i915_address_space *vm, *tmpvm;
>>   	int cnt = 0, i = 0;
>>
>> -	list_for_each_entry(vm, &dev_priv->vm_list, global_link)
>> +	list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link)
>>   		cnt++;
>>
>>   	error->active_bo = kcalloc(cnt, sizeof(*error->active_bo), GFP_ATOMIC);
>> @@ -1155,7 +1155,7 @@ static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv,
>>   		error->pinned_bo = NULL;
>>   		error->pinned_bo_count = NULL;
>>   	} else {
>> -		list_for_each_entry(vm, &dev_priv->vm_list, global_link)
>> +		list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link)
>>   			i915_gem_capture_vm(dev_priv, error, vm, i++);
>>
>>   		error->vm_count = cnt;
>> --
>> 1.9.1
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture
  2015-10-09  8:28   ` Daniel Vetter
@ 2015-10-09 11:45     ` Tomas Elf
  2015-10-13 11:40       ` Daniel Vetter
  0 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 11:45 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-GFX

On 09/10/2015 09:28, Daniel Vetter wrote:
> On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote:
>> Since we're not synchronizing the ring request list during error state capture
>> the request list state might change between the time the corresponding error
>> request list was allocated and dimensioned to the time when the ring request
>> list is actually captured into the error state. If this happens, throw a
>> WARNING and do early exit and be aware that the captured error state might not
>> be fully reliable.
>
> Please don't throw a WARNING since this is expected to occasionally
> happen. DRM_DEBUG_DRIVER is enough imo.
> -Daniel

I still don't see how it could happen without leading to reads of 
unallocated memory. The error state request list has been allocated to a 
certain size equal to num_requests and this loop seems to assume that 
the error state request list maintains the same size as the driver 
request list, which is not the case - leading to crashes, which is how I 
happened to notice it.

I can obviously remove the warning but are you saying we shouldn't even 
take action if it happens? Such as early exit?

Thanks,
Tomas

>
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 32c1799..cc75ca4 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev,
>>   		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
>>   			struct drm_i915_error_request *erq;
>>
>> +			if (WARN_ON(!request || count >= error->ring[i].num_requests)) {
>> +				/*
>> +				 * If the ring request list was changed in
>> +				 * between the point where the error request
>> +				 * list was created and dimensioned and this
>> +				 * point then just update the num_requests
>> +				 * field to reflect this.
>> +				 */
>> +				error->ring[i].num_requests =
>> +					min(count, error->ring[i].num_requests);
>> +				break;
>> +			}
>> +
>>   			erq = &error->ring[i].requests[count++];
>>   			erq->seqno = request->seqno;
>>   			erq->jiffies = request->emitted_jiffies;
>> --
>> 1.9.1
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture
  2015-10-09  8:32   ` Daniel Vetter
  2015-10-09  8:47     ` Chris Wilson
@ 2015-10-09 11:45     ` Tomas Elf
  1 sibling, 0 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 11:45 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-GFX

On 09/10/2015 09:32, Daniel Vetter wrote:
> On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote:
>> In the past vmas have sometimes turned out to be NULL when capturing buffer
>> objects during error state capture. To avoid NULL pointer exceptions throw a
>> WARNING and exit early and be prepared for the error state not being fully
>> accurate following this point.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gpu_error.c | 10 +++++++++-
>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index cc75ca4..ae24971 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -691,9 +691,17 @@ unwind:
>>   static void capture_bo(struct drm_i915_error_buffer *err,
>>   		       struct i915_vma *vma)
>>   {
>> -	struct drm_i915_gem_object *obj = vma->obj;
>> +	struct drm_i915_gem_object *obj;
>>   	int i;
>>
>> +	if (WARN_ON(!vma))
>> +	    return;
>
> If we rcu-protect the vma lists and vmas themselves then this can't
> happen.
>
>> +
>> +	if (WARN_ON(!vma->obj))
>> +	    return;
>
> Again we probably need to rcu-protect this, and more important we need to
> make sure the compiler doesn't do crap. So
>
> 	obj = rcu_derefence(vma->obj)
> 	if (!obj)
> 		return;
>
> Ofc obj then can also only be released with kfree_rcu.
> -Daniel

Awesome! Let's go with that solution then.

Thanks,
Tomas

>
>> +
>> +	obj = vma->obj;
>> +
>>   	err->size = obj->base.size;
>>   	err->name = obj->base.name;
>>   	for (i = 0; i < I915_NUM_RINGS; i++)
>> --
>> 1.9.1
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/8] drm/i915: vma NULL pointer check
  2015-10-09  8:33   ` Daniel Vetter
@ 2015-10-09 11:46     ` Tomas Elf
  0 siblings, 0 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 11:46 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-GFX

On 09/10/2015 09:33, Daniel Vetter wrote:
> On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote:
>> Sometimes the iterated vma objects are NULL apparently. Be aware of this while
>> iterating and do early exit instead of causing a NULL pointer exception.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gem.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index 52642af..8210ae7 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -5122,9 +5122,13 @@ unsigned long i915_gem_obj_size(struct drm_i915_gem_object *o,
>>   bool i915_gem_obj_is_pinned(struct drm_i915_gem_object *obj)
>>   {
>>   	struct i915_vma *vma;
>> -	list_for_each_entry(vma, &obj->vma_list, vma_link)
>> +	list_for_each_entry(vma, &obj->vma_list, vma_link) {
>> +		if (!vma)
>> +			continue;
>
> Again no need to check for that if we rcu-protect this list and vmas.
> -Daniel

Sweet! Let's go with that solution!

Thanks,
Tomas

>
>> +
>>   		if (vma->pin_count > 0)
>>   			return true;
>> +	}
>>
>>   	return false;
>>   }
>> --
>> 1.9.1
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work
  2015-10-09  8:39     ` Daniel Vetter
@ 2015-10-09 11:50       ` Tomas Elf
  0 siblings, 0 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 11:50 UTC (permalink / raw)
  To: Daniel Vetter, Chris Wilson, Intel-GFX

On 09/10/2015 09:39, Daniel Vetter wrote:
> On Fri, Oct 09, 2015 at 08:46:31AM +0100, Chris Wilson wrote:
>> On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote:
>>> Avoid NULL pointer exceptions in the display driver for certain critical cases
>>> when unpin_work has turned out to be NULL.
>>
>> Nope, the machine reached a point where it cannot possibly reach, we
>> want the OOPS.
>
> Yeah this looks funky ... how did you get to that state? If it indeed
> blows up in testing then we'd need at least a WARN_ON (and hopefully some
> ideas what's going wrong).
> -Daniel
>

I have no idea but I know that it blew up my machine and ended my test 
ahead of time. Adding this check allows me to test for 12+ hours, which 
is what I wanted to do.

Sure, I'll add a WARN_ON. I don't know how much time I have to 
investigate the root cause considering the whole TDR upstreaming situation.

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture
  2015-10-09  8:47     ` Chris Wilson
@ 2015-10-09 11:52       ` Tomas Elf
  0 siblings, 0 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 11:52 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Intel-GFX

On 09/10/2015 09:47, Chris Wilson wrote:
> On Fri, Oct 09, 2015 at 10:32:11AM +0200, Daniel Vetter wrote:
>> On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote:
>>> In the past vmas have sometimes turned out to be NULL when capturing buffer
>>> objects during error state capture. To avoid NULL pointer exceptions throw a
>>> WARNING and exit early and be prepared for the error state not being fully
>>> accurate following this point.
>>>
>>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gpu_error.c | 10 +++++++++-
>>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index cc75ca4..ae24971 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -691,9 +691,17 @@ unwind:
>>>   static void capture_bo(struct drm_i915_error_buffer *err,
>>>   		       struct i915_vma *vma)
>>>   {
>>> -	struct drm_i915_gem_object *obj = vma->obj;
>>> +	struct drm_i915_gem_object *obj;
>>>   	int i;
>>>
>>> +	if (WARN_ON(!vma))
>>> +	    return;
>>
>> If we rcu-protect the vma lists and vmas themselves then this can't
>> happen.
>>
>>> +
>>> +	if (WARN_ON(!vma->obj))
>>> +	    return;
>>
>> Again we probably need to rcu-protect this, and more important we need to
>> make sure the compiler doesn't do crap. So
>>
>> 	obj = rcu_derefence(vma->obj)
>> 	if (!obj)
>> 		return;
>

Sweet! Let's go with that solution.

> The other issue is that we don't actually want capture_bo() but capture_vma().

Sure, but capturing BOs instead of VMAs is not going to crash anything 
and is therefore slightly outside of the scope of my current work. It 
would certainly be an improvement to make that change, though.

Thanks,
Tomas

> -Chris
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/8] drm/i915: vma NULL pointer check
  2015-10-09 11:30     ` Tomas Elf
@ 2015-10-09 11:59       ` Chris Wilson
  2015-10-13 11:43         ` Daniel Vetter
  0 siblings, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-09 11:59 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Fri, Oct 09, 2015 at 12:30:29PM +0100, Tomas Elf wrote:
> On 09/10/2015 08:48, Chris Wilson wrote:
> >On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote:
> >>Sometimes the iterated vma objects are NULL apparently. Be aware of this while
> >>iterating and do early exit instead of causing a NULL pointer exception.
> >
> >Wrong.
> >-Chris
> >
> 
> So the NULL pointer exception that I ran into multiple times during
> several different test runs on the line that says "vma->pin_count"
> was not because the vma pointer was NULL. Would you mind sharing
> your explanation to how this might have happened? Is it because
> we're not synchronizing and there's no protection against the driver
> deallocating vmas while we're trying to capture them? If this all
> ties into the aforementioned RCU-based solution then maybe we should
> go with that then.

Correct. The driver is retiring requests whilst the hang check worker is
running. And you chased a stale pointer, you could have equally chased a
stale vma->obj, vma->ctx etc pointers.

What I have done in the past is to serialise the retirement and the
hangcheck using a spinlock (as adding to the end of the list is safe),
but there are still weak spots when walking the list of bound vma.
What I would actually feel more comfortable with is to only record the
request->batch_vma, at the cost of losing information about the
currently bound buffers.

Or we could just stop_machine() whilst running the capture and have the
machine unresponsive for a few 100ms. I don't think simply RCU the lists
is enough (VM active_list, request list, bound list) as eventually we
chase a pointer from obj itself (which means we need to RCU pretty much
everything).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/8] drm/i915: Use safe list iterators
  2015-10-09 10:38       ` Chris Wilson
@ 2015-10-09 12:00         ` Tomas Elf
  0 siblings, 0 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 12:00 UTC (permalink / raw)
  To: Chris Wilson, Intel-GFX

On 09/10/2015 11:38, Chris Wilson wrote:
> On Fri, Oct 09, 2015 at 11:27:48AM +0100, Tomas Elf wrote:
>> On 09/10/2015 08:41, Chris Wilson wrote:
>>> On Thu, Oct 08, 2015 at 07:31:38PM +0100, Tomas Elf wrote:
>>>> Error state capture is dependent on i915_gem_active_request() and
>>>> i915_gem_obj_is_pinned(). Since there is no synchronization between error state
>>>> capture and the driver state we at least need to use safe list iterators in
>>>> these functions to alleviate the problem of request/vma lists changing during
>>>> error state capture.
>>>
>>> Does not make it safe.
>>> -Chris
>>>
>>
>> I know it doesn't make it safe, I didn't say this patch makes it safe.
>>
>> Maybe I was unclear but I chose the word "alleviate" rather than
>> "solve" to indicate that there are still problems but this change
>> reduces the problem scope and makes crashes less likely to occur. I
>> also used the formulation "at least" to indicate that we're not
>> solving everything but we can do certain things to improve things
>> somewhat.
>>
>> The problems I've been seeing has been that the list state changes
>> during iteration and that the error capture tries to read elements
>> that are no longer part of the list - not that elements that the
>> error capture code is dereferencing are deallocated by the driver or
>> whatever. Using a safe iterator helps with that very particular
>> problem. Or maybe I guess I've just been incredibly lucky for the
>> last 2 months when I've been running this code as I've been able to
>> get 12+ hours of stability during my tests instead of less than one
>> hour in between crashes that was the case before I introduced these
>> changes.
>
> You have been incredibily lucky (probably due to how the requests are
> being cached now), but that the requests can be modified whilst error
> capture runs and oops is well known. Just pretending the list iterator
> is safe does nothing.
> -Chris
>

Does nothing except consistently extend meantime between failures from 
less than one hour to more than 12 hours during my TDR tests. That's a 
lot of luck right there. On a consistent and predictable basis. Also, 
I'm not trying to pretend, I thought I communicated clearly that this is 
not a solution but rather an improvement that makes certain types of 
tests actually possible to run, tests that are needed for the TDR 
validation.

But if you have another solution then let's go with that.

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work
  2015-10-09 10:44       ` Chris Wilson
@ 2015-10-09 12:06         ` Tomas Elf
  2015-10-13 11:51           ` Daniel Vetter
  0 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-09 12:06 UTC (permalink / raw)
  To: Chris Wilson, Intel-GFX

On 09/10/2015 11:44, Chris Wilson wrote:
> On Fri, Oct 09, 2015 at 11:30:34AM +0100, Tomas Elf wrote:
>> On 09/10/2015 08:46, Chris Wilson wrote:
>>> On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote:
>>>> Avoid NULL pointer exceptions in the display driver for certain critical cases
>>>> when unpin_work has turned out to be NULL.
>>>
>>> Nope, the machine reached a point where it cannot possibly reach, we
>>> want the OOPS.
>>> -Chris
>>>
>>
>> Really? Because if I have this in there I can actually run the
>> long-duration stability tests for 12+ hours rather than just a few
>> hours (it doesn't happen all the time but I've seen it at least
>> once). But, hey, if you want to investigate the reason why we have a
>> NULL there then go ahead. I guess I'll just have to carry this patch
>> for as long as my machine crashes during my TDR testing.
>
> You've sat on a critical bug for how long? There's been one recent
> report that appears to be fallout from Tvrtko's context cleanup, but
> nothing older, in public at least.
> -Chris
>

Do people typically try to actively hang their machines several times a 
minute for a 12+ hours at a time? If not then maybe that's why they 
haven't seen this.

I haven't sat on anything, this has been part of my error state capture 
improvement patches which I've intended to upstream for several months 
now. I consider this part of the overall stability issue and grouped it 
together with all the other patches necessary to make the machine not 
crash for the duration of my test. I would've upstreamed this series a 
long time ago if I had actually been given time to work on it but that's 
another discussion.

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture
  2015-10-09 11:40     ` Tomas Elf
@ 2015-10-13 11:37       ` Daniel Vetter
  2015-10-13 11:47         ` Chris Wilson
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel Vetter @ 2015-10-13 11:37 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Fri, Oct 09, 2015 at 12:40:51PM +0100, Tomas Elf wrote:
> On 09/10/2015 09:27, Daniel Vetter wrote:
> >On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote:
> >>Using safe list iterators alleviates the problem of unsynchronized driver list
> >>manipulations while error state capture is in the process of traversing lists.
> >>
> >>Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >>---
> >>  drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------
> >>  1 file changed, 19 insertions(+), 19 deletions(-)
> >>
> >>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> >>index 2f04e4f..32c1799 100644
> >>--- a/drivers/gpu/drm/i915/i915_gpu_error.c
> >>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> >>@@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err,
> >>  static u32 capture_active_bo(struct drm_i915_error_buffer *err,
> >>  			     int count, struct list_head *head)
> >>  {
> >>-	struct i915_vma *vma;
> >>+	struct i915_vma *vma, *tmpvma;
> >>  	int i = 0;
> >>
> >>-	list_for_each_entry(vma, head, mm_list) {
> >>+	list_for_each_entry_safe(vma, tmpvma, head, mm_list) {
> >
> >_safe isn't safe against anything, it's only safe against removal of the
> >current list item done with a list_del/list_move from this thread. I.e.
> >the only thing it does is have a temporary pointer to the next list
> >element so if you delete the current one it won't fall over.
> >
> >It doesn't protect against anything else, and especially not against other
> >threads totally stomping the list.
> 
> Absolutely! But it's good enough to make the test not fall over within an
> hour of testing and instead allow it to run for 12+ hours during continuous
> long-duration testing, which is what I need for the TDR validation.

Well it looks really suspicious, since the only thing this protects
against is against deleting the element we look at right now. The only
sensible theory I can come up with is that this slightly helps when
starting the list walk, in case someone comes around and deletes the first
element in the list from the retire worker. Since the retire worker tends
to do more work per list item than we do that's enough to make the race
really unlikely. But it's still just duct-tape.

> >So we need proper protection for these lists, independent of
> >dev->struct_mutex. The least invasive way to do that is probably rcu,
> >which only requires us that we release the memory for any object sitting
> >on these lists with kfree_rcu instead of normal kfree. Another option
> >might be a spinlock, but that would be a lot more invasive, and Chris
> >won't like the execbuf overhead this causes ;-)
> >-Daniel
> 
> Awesome! Let's go with that then.

See our massive irc discussion ... it's more tricky :(

In short rcu works perfectly if you only have the following lifetime
sequence:
- kmalloc object
- add to list
- remove from list (eventually)
- free object

If you readd an object to a list, or even worse, move it then the rcu list
helpers won't work. What could happen on the read side is:

- you walk the rcu list, keeping track of the head element to know when to
  stop
- while looking at that list one of the yet untraversed items gets moved
  to a new list

-> you'll traverse the new list forever since that one will never hit the
head element.

Not a problem for requests/vmas, but a problem for some of the object lists.

Note that the problem isn't that we re-add the element (if that happens we
might end up walking parts of the list twice or parts of it not at all),
but moving to a different list where there's a different head element.

I haven't checked, but maybe we're lucky and all the object lists we're
looking at will always have the same head element. Then I think it'll work
out (albeit racy, but who cares about that for the error capture) and with
the guarantee (when using rcu delayed freeing) that'll we'll never chase
freed memory.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture
  2015-10-09 11:25     ` Tomas Elf
@ 2015-10-13 11:39       ` Daniel Vetter
  2015-10-14 11:46         ` Tomas Elf
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel Vetter @ 2015-10-13 11:39 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Fri, Oct 09, 2015 at 12:25:26PM +0100, Tomas Elf wrote:
> On 09/10/2015 08:48, Chris Wilson wrote:
> >On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote:
> >>Since we're not synchronizing the ring request list during error state capture
> >>the request list state might change between the time the corresponding error
> >>request list was allocated and dimensioned to the time when the ring request
> >>list is actually captured into the error state. If this happens, throw a
> >>WARNING and do early exit and be aware that the captured error state might not
> >>be fully reliable.
> >>
> >>Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >>---
> >>  drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++
> >>  1 file changed, 13 insertions(+)
> >>
> >>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> >>index 32c1799..cc75ca4 100644
> >>--- a/drivers/gpu/drm/i915/i915_gpu_error.c
> >>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> >>@@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev,
> >>  		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
> >>  			struct drm_i915_error_request *erq;
> >>
> >>+			if (WARN_ON(!request || count >= error->ring[i].num_requests)) {
> >
> >Request cannot be null, count can legitmately be more, the WARN on is
> >inappropriate. Again, I sent several patches over the past couple of
> >years to fix this.
> >-Chris
> >
> 
> Ok, so having the driver request list change grow in between the point where
> we allocate and set up the error state request list to the same size as the
> driver request list (since that's what count being larger than the list size
> implies) is legitimate? Traversing into unallocated memory seems pretty
> dodgy to me but if you say so.

We still need to handle it ofc, but just not WARN on this condition since
it can happen.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture
  2015-10-09 11:45     ` Tomas Elf
@ 2015-10-13 11:40       ` Daniel Vetter
  0 siblings, 0 replies; 68+ messages in thread
From: Daniel Vetter @ 2015-10-13 11:40 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Fri, Oct 09, 2015 at 12:45:07PM +0100, Tomas Elf wrote:
> On 09/10/2015 09:28, Daniel Vetter wrote:
> >On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote:
> >>Since we're not synchronizing the ring request list during error state capture
> >>the request list state might change between the time the corresponding error
> >>request list was allocated and dimensioned to the time when the ring request
> >>list is actually captured into the error state. If this happens, throw a
> >>WARNING and do early exit and be aware that the captured error state might not
> >>be fully reliable.
> >
> >Please don't throw a WARNING since this is expected to occasionally
> >happen. DRM_DEBUG_DRIVER is enough imo.
> >-Daniel
> 
> I still don't see how it could happen without leading to reads of
> unallocated memory. The error state request list has been allocated to a
> certain size equal to num_requests and this loop seems to assume that the
> error state request list maintains the same size as the driver request list,
> which is not the case - leading to crashes, which is how I happened to
> notice it.
> 
> I can obviously remove the warning but are you saying we shouldn't even take
> action if it happens? Such as early exit?

Just remove the WARNING since this is a perfectly expected outcome when
error state capture races against request adding. If you want put a
DRM_DEBUG or similar in there, but WARN_ON for stuff that we expect might
happen isn't good - i915.ko is too noisy already as-is.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 5/8] drm/i915: vma NULL pointer check
  2015-10-09 11:59       ` Chris Wilson
@ 2015-10-13 11:43         ` Daniel Vetter
  0 siblings, 0 replies; 68+ messages in thread
From: Daniel Vetter @ 2015-10-13 11:43 UTC (permalink / raw)
  To: Chris Wilson, Tomas Elf, Intel-GFX

On Fri, Oct 09, 2015 at 12:59:44PM +0100, Chris Wilson wrote:
> On Fri, Oct 09, 2015 at 12:30:29PM +0100, Tomas Elf wrote:
> > On 09/10/2015 08:48, Chris Wilson wrote:
> > >On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote:
> > >>Sometimes the iterated vma objects are NULL apparently. Be aware of this while
> > >>iterating and do early exit instead of causing a NULL pointer exception.
> > >
> > >Wrong.
> > >-Chris
> > >
> > 
> > So the NULL pointer exception that I ran into multiple times during
> > several different test runs on the line that says "vma->pin_count"
> > was not because the vma pointer was NULL. Would you mind sharing
> > your explanation to how this might have happened? Is it because
> > we're not synchronizing and there's no protection against the driver
> > deallocating vmas while we're trying to capture them? If this all
> > ties into the aforementioned RCU-based solution then maybe we should
> > go with that then.
> 
> Correct. The driver is retiring requests whilst the hang check worker is
> running. And you chased a stale pointer, you could have equally chased a
> stale vma->obj, vma->ctx etc pointers.
> 
> What I have done in the past is to serialise the retirement and the
> hangcheck using a spinlock (as adding to the end of the list is safe),
> but there are still weak spots when walking the list of bound vma.
> What I would actually feel more comfortable with is to only record the
> request->batch_vma, at the cost of losing information about the
> currently bound buffers.
> 
> Or we could just stop_machine() whilst running the capture and have the
> machine unresponsive for a few 100ms. I don't think simply RCU the lists
> is enough (VM active_list, request list, bound list) as eventually we
> chase a pointer from obj itself (which means we need to RCU pretty much
> everything).

I discussed stop_machine with Chris a bit - it does avoid some of the
races, but not all of them (since the lists might be in wonky state right
when we stop_machine). And yes rcu means we need to rcu-free any object
that the error state looks at.

What we could do is protect list consistency with rcu (where possible, see
my other mail about possible problems). And use stop_machine to make sure
the underlying objects don't go poof (instead of kfree_rcu).

Not sure where the gap in that idea is yet ;-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-13 11:46       ` Daniel Vetter
@ 2015-10-13 11:45         ` Chris Wilson
  2015-10-13 13:46           ` Daniel Vetter
  0 siblings, 1 reply; 68+ messages in thread
From: Chris Wilson @ 2015-10-13 11:45 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-GFX

On Tue, Oct 13, 2015 at 01:46:38PM +0200, Daniel Vetter wrote:
> On Fri, Oct 09, 2015 at 09:45:16AM +0100, Chris Wilson wrote:
> > On Fri, Oct 09, 2015 at 10:38:18AM +0200, Daniel Vetter wrote:
> > > On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote:
> > > > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> > > > concurrency problems between the context event interrupt handler and the reset
> > > > path immediately following a GPU reset.
> > > > 
> > > > Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> > > 
> > > Should we instead just stop any irqs from the GT completely while we do
> > > the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably
> > > safer. Well not the entire GT, but all the rings under reset (as prep for
> > > per-ring reset).
> > 
> > Bah, stopping IRQs is not enough for error state capture though since
> > requests complete asynchronously just by polling a memory address. (If
> > that is the goal here, this patch just makes execlist_queue access
> > consistent and should only be run once the GPU has been reset and so is
> > categorically idle.)
> 
> This is the execlist ELSP tracking, which is execlusively driven by the
> CTX_SWITCH interrupt signal from each engine.
> 
> At least that's been my assumption, and under that assumption I think
> stalling interrupts should be good enough.

No, because the requests and vma are not coupled to the interrupt in
terms of when they can disappear.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-09  8:45     ` Chris Wilson
@ 2015-10-13 11:46       ` Daniel Vetter
  2015-10-13 11:45         ` Chris Wilson
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel Vetter @ 2015-10-13 11:46 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Tomas Elf, Intel-GFX

On Fri, Oct 09, 2015 at 09:45:16AM +0100, Chris Wilson wrote:
> On Fri, Oct 09, 2015 at 10:38:18AM +0200, Daniel Vetter wrote:
> > On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote:
> > > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> > > concurrency problems between the context event interrupt handler and the reset
> > > path immediately following a GPU reset.
> > > 
> > > Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> > 
> > Should we instead just stop any irqs from the GT completely while we do
> > the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably
> > safer. Well not the entire GT, but all the rings under reset (as prep for
> > per-ring reset).
> 
> Bah, stopping IRQs is not enough for error state capture though since
> requests complete asynchronously just by polling a memory address. (If
> that is the goal here, this patch just makes execlist_queue access
> consistent and should only be run once the GPU has been reset and so is
> categorically idle.)

This is the execlist ELSP tracking, which is execlusively driven by the
CTX_SWITCH interrupt signal from each engine.

At least that's been my assumption, and under that assumption I think
stalling interrupts should be good enough.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture
  2015-10-13 11:37       ` Daniel Vetter
@ 2015-10-13 11:47         ` Chris Wilson
  0 siblings, 0 replies; 68+ messages in thread
From: Chris Wilson @ 2015-10-13 11:47 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-GFX

On Tue, Oct 13, 2015 at 01:37:32PM +0200, Daniel Vetter wrote:
> On Fri, Oct 09, 2015 at 12:40:51PM +0100, Tomas Elf wrote:
> > On 09/10/2015 09:27, Daniel Vetter wrote:
> > >On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote:
> > >>Using safe list iterators alleviates the problem of unsynchronized driver list
> > >>manipulations while error state capture is in the process of traversing lists.
> > >>
> > >>Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> > >>---
> > >>  drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------
> > >>  1 file changed, 19 insertions(+), 19 deletions(-)
> > >>
> > >>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > >>index 2f04e4f..32c1799 100644
> > >>--- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > >>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > >>@@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err,
> > >>  static u32 capture_active_bo(struct drm_i915_error_buffer *err,
> > >>  			     int count, struct list_head *head)
> > >>  {
> > >>-	struct i915_vma *vma;
> > >>+	struct i915_vma *vma, *tmpvma;
> > >>  	int i = 0;
> > >>
> > >>-	list_for_each_entry(vma, head, mm_list) {
> > >>+	list_for_each_entry_safe(vma, tmpvma, head, mm_list) {
> > >
> > >_safe isn't safe against anything, it's only safe against removal of the
> > >current list item done with a list_del/list_move from this thread. I.e.
> > >the only thing it does is have a temporary pointer to the next list
> > >element so if you delete the current one it won't fall over.
> > >
> > >It doesn't protect against anything else, and especially not against other
> > >threads totally stomping the list.
> > 
> > Absolutely! But it's good enough to make the test not fall over within an
> > hour of testing and instead allow it to run for 12+ hours during continuous
> > long-duration testing, which is what I need for the TDR validation.
> 
> Well it looks really suspicious, since the only thing this protects
> against is against deleting the element we look at right now. The only
> sensible theory I can come up with is that this slightly helps when
> starting the list walk, in case someone comes around and deletes the first
> element in the list from the retire worker. Since the retire worker tends
> to do more work per list item than we do that's enough to make the race
> really unlikely. But it's still just duct-tape.
> 
> > >So we need proper protection for these lists, independent of
> > >dev->struct_mutex. The least invasive way to do that is probably rcu,
> > >which only requires us that we release the memory for any object sitting
> > >on these lists with kfree_rcu instead of normal kfree. Another option
> > >might be a spinlock, but that would be a lot more invasive, and Chris
> > >won't like the execbuf overhead this causes ;-)
> > >-Daniel
> > 
> > Awesome! Let's go with that then.
> 
> See our massive irc discussion ... it's more tricky :(
> 
> In short rcu works perfectly if you only have the following lifetime
> sequence:
> - kmalloc object
> - add to list
> - remove from list (eventually)
> - free object
> 
> If you readd an object to a list, or even worse, move it then the rcu list
> helpers won't work. What could happen on the read side is:
> 
> - you walk the rcu list, keeping track of the head element to know when to
>   stop
> - while looking at that list one of the yet untraversed items gets moved
>   to a new list
> 
> -> you'll traverse the new list forever since that one will never hit the
> head element.
> 
> Not a problem for requests/vmas, but a problem for some of the object lists.
> 
> Note that the problem isn't that we re-add the element (if that happens we
> might end up walking parts of the list twice or parts of it not at all),
> but moving to a different list where there's a different head element.
> 
> I haven't checked, but maybe we're lucky and all the object lists we're
> looking at will always have the same head element. Then I think it'll work
> out (albeit racy, but who cares about that for the error capture) and with
> the guarantee (when using rcu delayed freeing) that'll we'll never chase
> freed memory.

Ah, so you agree that stop_machine() is good enough.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work
  2015-10-09 12:06         ` Tomas Elf
@ 2015-10-13 11:51           ` Daniel Vetter
  0 siblings, 0 replies; 68+ messages in thread
From: Daniel Vetter @ 2015-10-13 11:51 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Fri, Oct 09, 2015 at 01:06:24PM +0100, Tomas Elf wrote:
> On 09/10/2015 11:44, Chris Wilson wrote:
> >On Fri, Oct 09, 2015 at 11:30:34AM +0100, Tomas Elf wrote:
> >>On 09/10/2015 08:46, Chris Wilson wrote:
> >>>On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote:
> >>>>Avoid NULL pointer exceptions in the display driver for certain critical cases
> >>>>when unpin_work has turned out to be NULL.
> >>>
> >>>Nope, the machine reached a point where it cannot possibly reach, we
> >>>want the OOPS.
> >>>-Chris
> >>>
> >>
> >>Really? Because if I have this in there I can actually run the
> >>long-duration stability tests for 12+ hours rather than just a few
> >>hours (it doesn't happen all the time but I've seen it at least
> >>once). But, hey, if you want to investigate the reason why we have a
> >>NULL there then go ahead. I guess I'll just have to carry this patch
> >>for as long as my machine crashes during my TDR testing.
> >
> >You've sat on a critical bug for how long? There's been one recent
> >report that appears to be fallout from Tvrtko's context cleanup, but
> >nothing older, in public at least.
> >-Chris
> >
> 
> Do people typically try to actively hang their machines several times a
> minute for a 12+ hours at a time? If not then maybe that's why they haven't
> seen this.

The problem is that the world loves to conspire against races like these,
and somewhere out there is a machine which hits this ridiculously
reliably. And if this is a machine sitting on an OEM's product engineer
bench we've just lost a sale ;-)

Just because you can't repro if fast doesn't mean that your reproduction
rate is the lower limit.

> I haven't sat on anything, this has been part of my error state capture
> improvement patches which I've intended to upstream for several months now.
> I consider this part of the overall stability issue and grouped it together
> with all the other patches necessary to make the machine not crash for the
> duration of my test. I would've upstreamed this series a long time ago if I
> had actually been given time to work on it but that's another discussion.

I guess Chris' reaction is because we've had tons of cases where bugfixes
like this where stuck in android trees because they where tied up with
some feature work. But I also know it's really hard to spot them, so in
casee of doubt please upstream small fixes aggressively.

Since this one here is outside of the error capture itself, and so code
where our existing assumption that error capture only runs when the driver
is dead since seconds is invalid, it's a prime candidate.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-13 11:45         ` Chris Wilson
@ 2015-10-13 13:46           ` Daniel Vetter
  2015-10-13 14:00             ` Chris Wilson
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel Vetter @ 2015-10-13 13:46 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Tomas Elf, Intel-GFX

On Tue, Oct 13, 2015 at 12:45:58PM +0100, Chris Wilson wrote:
> On Tue, Oct 13, 2015 at 01:46:38PM +0200, Daniel Vetter wrote:
> > On Fri, Oct 09, 2015 at 09:45:16AM +0100, Chris Wilson wrote:
> > > On Fri, Oct 09, 2015 at 10:38:18AM +0200, Daniel Vetter wrote:
> > > > On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote:
> > > > > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> > > > > concurrency problems between the context event interrupt handler and the reset
> > > > > path immediately following a GPU reset.
> > > > > 
> > > > > Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> > > > 
> > > > Should we instead just stop any irqs from the GT completely while we do
> > > > the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably
> > > > safer. Well not the entire GT, but all the rings under reset (as prep for
> > > > per-ring reset).
> > > 
> > > Bah, stopping IRQs is not enough for error state capture though since
> > > requests complete asynchronously just by polling a memory address. (If
> > > that is the goal here, this patch just makes execlist_queue access
> > > consistent and should only be run once the GPU has been reset and so is
> > > categorically idle.)
> > 
> > This is the execlist ELSP tracking, which is execlusively driven by the
> > CTX_SWITCH interrupt signal from each engine.
> > 
> > At least that's been my assumption, and under that assumption I think
> > stalling interrupts should be good enough.
> 
> No, because the requests and vma are not coupled to the interrupt in
> terms of when they can disappear.

At least today execlist keeps its own reference on requests until the
CTX_SWITCH stuff is done to exactly make sure this is the case. And even
when we have that fixed up I think we do need to exclude this processing
somehow, and the irqsave spinlock seems ok for that. disabling the
interrupt itself plus synchronize_irq was really just an idea.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-13 13:46           ` Daniel Vetter
@ 2015-10-13 14:00             ` Chris Wilson
  0 siblings, 0 replies; 68+ messages in thread
From: Chris Wilson @ 2015-10-13 14:00 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-GFX

On Tue, Oct 13, 2015 at 03:46:00PM +0200, Daniel Vetter wrote:
> On Tue, Oct 13, 2015 at 12:45:58PM +0100, Chris Wilson wrote:
> > On Tue, Oct 13, 2015 at 01:46:38PM +0200, Daniel Vetter wrote:
> > > On Fri, Oct 09, 2015 at 09:45:16AM +0100, Chris Wilson wrote:
> > > > On Fri, Oct 09, 2015 at 10:38:18AM +0200, Daniel Vetter wrote:
> > > > > On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote:
> > > > > > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> > > > > > concurrency problems between the context event interrupt handler and the reset
> > > > > > path immediately following a GPU reset.
> > > > > > 
> > > > > > Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> > > > > 
> > > > > Should we instead just stop any irqs from the GT completely while we do
> > > > > the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably
> > > > > safer. Well not the entire GT, but all the rings under reset (as prep for
> > > > > per-ring reset).
> > > > 
> > > > Bah, stopping IRQs is not enough for error state capture though since
> > > > requests complete asynchronously just by polling a memory address. (If
> > > > that is the goal here, this patch just makes execlist_queue access
> > > > consistent and should only be run once the GPU has been reset and so is
> > > > categorically idle.)
> > > 
> > > This is the execlist ELSP tracking, which is execlusively driven by the
> > > CTX_SWITCH interrupt signal from each engine.
> > > 
> > > At least that's been my assumption, and under that assumption I think
> > > stalling interrupts should be good enough.
> > 
> > No, because the requests and vma are not coupled to the interrupt in
> > terms of when they can disappear.
> 
> At least today execlist keeps its own reference on requests until the
> CTX_SWITCH stuff is done to exactly make sure this is the case. And even
> when we have that fixed up I think we do need to exclude this processing
> somehow, and the irqsave spinlock seems ok for that. disabling the
> interrupt itself plus synchronize_irq was really just an idea.

What I meant was that we have this problem with vma/obj disappearing not
simply on execlists because request completion is asynchronous (we just
check for the GPU's breadcrumb). Execlists is slightly special in that
its requests can't disappear, but they can elsewhere.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture
  2015-10-13 11:39       ` Daniel Vetter
@ 2015-10-14 11:46         ` Tomas Elf
  2015-10-14 12:45           ` Daniel Vetter
  0 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-14 11:46 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-GFX

On 13/10/2015 12:39, Daniel Vetter wrote:
> On Fri, Oct 09, 2015 at 12:25:26PM +0100, Tomas Elf wrote:
>> On 09/10/2015 08:48, Chris Wilson wrote:
>>> On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote:
>>>> Since we're not synchronizing the ring request list during error state capture
>>>> the request list state might change between the time the corresponding error
>>>> request list was allocated and dimensioned to the time when the ring request
>>>> list is actually captured into the error state. If this happens, throw a
>>>> WARNING and do early exit and be aware that the captured error state might not
>>>> be fully reliable.
>>>>
>>>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++
>>>>   1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>> index 32c1799..cc75ca4 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>> @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev,
>>>>   		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
>>>>   			struct drm_i915_error_request *erq;
>>>>
>>>> +			if (WARN_ON(!request || count >= error->ring[i].num_requests)) {
>>>
>>> Request cannot be null, count can legitmately be more, the WARN on is
>>> inappropriate. Again, I sent several patches over the past couple of
>>> years to fix this.
>>> -Chris
>>>
>>
>> Ok, so having the driver request list change grow in between the point where
>> we allocate and set up the error state request list to the same size as the
>> driver request list (since that's what count being larger than the list size
>> implies) is legitimate? Traversing into unallocated memory seems pretty
>> dodgy to me but if you say so.
>
> We still need to handle it ofc, but just not WARN on this condition since
> it can happen.
> -Daniel
>

With the RCU discussion ongoing I guess we should we just drop this 
patch? I agree that what I've been seeing looks like a side-effect of 
concurrent memory deallocation. Whatever solution you reach should make 
this patch pointless.

Thanks,
Tomas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture
  2015-10-14 11:46         ` Tomas Elf
@ 2015-10-14 12:45           ` Daniel Vetter
  0 siblings, 0 replies; 68+ messages in thread
From: Daniel Vetter @ 2015-10-14 12:45 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Wed, Oct 14, 2015 at 12:46:51PM +0100, Tomas Elf wrote:
> On 13/10/2015 12:39, Daniel Vetter wrote:
> >On Fri, Oct 09, 2015 at 12:25:26PM +0100, Tomas Elf wrote:
> >>On 09/10/2015 08:48, Chris Wilson wrote:
> >>>On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote:
> >>>>Since we're not synchronizing the ring request list during error state capture
> >>>>the request list state might change between the time the corresponding error
> >>>>request list was allocated and dimensioned to the time when the ring request
> >>>>list is actually captured into the error state. If this happens, throw a
> >>>>WARNING and do early exit and be aware that the captured error state might not
> >>>>be fully reliable.
> >>>>
> >>>>Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >>>>---
> >>>>  drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++
> >>>>  1 file changed, 13 insertions(+)
> >>>>
> >>>>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> >>>>index 32c1799..cc75ca4 100644
> >>>>--- a/drivers/gpu/drm/i915/i915_gpu_error.c
> >>>>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> >>>>@@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev,
> >>>>  		list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) {
> >>>>  			struct drm_i915_error_request *erq;
> >>>>
> >>>>+			if (WARN_ON(!request || count >= error->ring[i].num_requests)) {
> >>>
> >>>Request cannot be null, count can legitmately be more, the WARN on is
> >>>inappropriate. Again, I sent several patches over the past couple of
> >>>years to fix this.
> >>>-Chris
> >>>
> >>
> >>Ok, so having the driver request list change grow in between the point where
> >>we allocate and set up the error state request list to the same size as the
> >>driver request list (since that's what count being larger than the list size
> >>implies) is legitimate? Traversing into unallocated memory seems pretty
> >>dodgy to me but if you say so.
> >
> >We still need to handle it ofc, but just not WARN on this condition since
> >it can happen.
> >-Daniel
> >
> 
> With the RCU discussion ongoing I guess we should we just drop this patch? I
> agree that what I've been seeing looks like a side-effect of concurrent
> memory deallocation. Whatever solution you reach should make this patch
> pointless.

Depending upon the solution we pick we might still need it, since if we
only make sure we don't chase freed memory but still allow the list itself
to get modified (which would be the case with kfree_rcu), then you need
it.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf
  2015-10-09  7:45   ` Chris Wilson
  2015-10-09  8:38   ` Daniel Vetter
@ 2015-10-19 15:32   ` Tomas Elf
  2015-10-22 16:49     ` Dave Gordon
                       ` (3 more replies)
  2 siblings, 4 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-19 15:32 UTC (permalink / raw)
  To: Intel-GFX

Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
concurrency problems between the context event interrupt handler and the reset
path immediately following a GPU reset.

* v2 (Chris Wilson):
Do execlist check and use simpler form of spinlock functions.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e57061a..2c7a0b7 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
 	 * are the ones that keep the context and ringbuffer backing objects
 	 * pinned in place.
 	 */
-	while (!list_empty(&ring->execlist_queue)) {
-		struct drm_i915_gem_request *submit_req;
 
-		submit_req = list_first_entry(&ring->execlist_queue,
-				struct drm_i915_gem_request,
-				execlist_link);
-		list_del(&submit_req->execlist_link);
+	if (i915.enable_execlists) {
+		spin_lock_irq(&ring->execlist_lock);
+		while (!list_empty(&ring->execlist_queue)) {
+			struct drm_i915_gem_request *submit_req;
 
-		if (submit_req->ctx != ring->default_context)
-			intel_lr_context_unpin(submit_req);
+			submit_req = list_first_entry(&ring->execlist_queue,
+					struct drm_i915_gem_request,
+					execlist_link);
+			list_del(&submit_req->execlist_link);
 
-		i915_gem_request_unreference(submit_req);
+			if (submit_req->ctx != ring->default_context)
+				intel_lr_context_unpin(submit_req);
+
+			i915_gem_request_unreference(submit_req);
+		}
+		spin_unlock_irq(&ring->execlist_lock);
 	}
 
 	/*
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-19 15:32   ` [PATCH v2 " Tomas Elf
@ 2015-10-22 16:49     ` Dave Gordon
  2015-10-22 17:35       ` Daniel Vetter
  2015-10-23  8:42     ` Tvrtko Ursulin
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 68+ messages in thread
From: Dave Gordon @ 2015-10-22 16:49 UTC (permalink / raw)
  To: Tomas Elf, Intel-GFX

On 19/10/15 16:32, Tomas Elf wrote:
> Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> concurrency problems between the context event interrupt handler and the reset
> path immediately following a GPU reset.
>
> * v2 (Chris Wilson):
> Do execlist check and use simpler form of spinlock functions.
>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++---------
>   1 file changed, 14 insertions(+), 9 deletions(-)

LGTM.

Reviewed-by: Dave Gordon <david.s.gordon@intel.com>

> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index e57061a..2c7a0b7 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
>   	 * are the ones that keep the context and ringbuffer backing objects
>   	 * pinned in place.
>   	 */
> -	while (!list_empty(&ring->execlist_queue)) {
> -		struct drm_i915_gem_request *submit_req;
>
> -		submit_req = list_first_entry(&ring->execlist_queue,
> -				struct drm_i915_gem_request,
> -				execlist_link);
> -		list_del(&submit_req->execlist_link);
> +	if (i915.enable_execlists) {
> +		spin_lock_irq(&ring->execlist_lock);
> +		while (!list_empty(&ring->execlist_queue)) {
> +			struct drm_i915_gem_request *submit_req;
>
> -		if (submit_req->ctx != ring->default_context)
> -			intel_lr_context_unpin(submit_req);
> +			submit_req = list_first_entry(&ring->execlist_queue,
> +					struct drm_i915_gem_request,
> +					execlist_link);
> +			list_del(&submit_req->execlist_link);
>
> -		i915_gem_request_unreference(submit_req);
> +			if (submit_req->ctx != ring->default_context)
> +				intel_lr_context_unpin(submit_req);
> +
> +			i915_gem_request_unreference(submit_req);
> +		}
> +		spin_unlock_irq(&ring->execlist_lock);
>   	}
>
>   	/*
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-22 16:49     ` Dave Gordon
@ 2015-10-22 17:35       ` Daniel Vetter
  0 siblings, 0 replies; 68+ messages in thread
From: Daniel Vetter @ 2015-10-22 17:35 UTC (permalink / raw)
  To: Dave Gordon; +Cc: Intel-GFX

On Thu, Oct 22, 2015 at 05:49:56PM +0100, Dave Gordon wrote:
> On 19/10/15 16:32, Tomas Elf wrote:
> >Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> >concurrency problems between the context event interrupt handler and the reset
> >path immediately following a GPU reset.
> >
> >* v2 (Chris Wilson):
> >Do execlist check and use simpler form of spinlock functions.
> >
> >Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >---
> >  drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++---------
> >  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> LGTM.
> 
> Reviewed-by: Dave Gordon <david.s.gordon@intel.com>

Queued for -next, thanks for the patch.
-Daniel

> 
> >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >index e57061a..2c7a0b7 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
> >  	 * are the ones that keep the context and ringbuffer backing objects
> >  	 * pinned in place.
> >  	 */
> >-	while (!list_empty(&ring->execlist_queue)) {
> >-		struct drm_i915_gem_request *submit_req;
> >
> >-		submit_req = list_first_entry(&ring->execlist_queue,
> >-				struct drm_i915_gem_request,
> >-				execlist_link);
> >-		list_del(&submit_req->execlist_link);
> >+	if (i915.enable_execlists) {
> >+		spin_lock_irq(&ring->execlist_lock);
> >+		while (!list_empty(&ring->execlist_queue)) {
> >+			struct drm_i915_gem_request *submit_req;
> >
> >-		if (submit_req->ctx != ring->default_context)
> >-			intel_lr_context_unpin(submit_req);
> >+			submit_req = list_first_entry(&ring->execlist_queue,
> >+					struct drm_i915_gem_request,
> >+					execlist_link);
> >+			list_del(&submit_req->execlist_link);
> >
> >-		i915_gem_request_unreference(submit_req);
> >+			if (submit_req->ctx != ring->default_context)
> >+				intel_lr_context_unpin(submit_req);
> >+
> >+			i915_gem_request_unreference(submit_req);
> >+		}
> >+		spin_unlock_irq(&ring->execlist_lock);
> >  	}
> >
> >  	/*
> >
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-19 15:32   ` [PATCH v2 " Tomas Elf
  2015-10-22 16:49     ` Dave Gordon
@ 2015-10-23  8:42     ` Tvrtko Ursulin
  2015-10-23  8:59       ` Daniel Vetter
  2015-10-23 13:08     ` [PATCH v3 " Tomas Elf
  2015-10-23 17:02     ` [PATCH] drm/i915: Update to post-reset execlist queue clean-up Tomas Elf
  3 siblings, 1 reply; 68+ messages in thread
From: Tvrtko Ursulin @ 2015-10-23  8:42 UTC (permalink / raw)
  To: Tomas Elf, Intel-GFX


On 19/10/15 16:32, Tomas Elf wrote:
> Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> concurrency problems between the context event interrupt handler and the reset
> path immediately following a GPU reset.
>
> * v2 (Chris Wilson):
> Do execlist check and use simpler form of spinlock functions.
>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++---------
>   1 file changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index e57061a..2c7a0b7 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
>   	 * are the ones that keep the context and ringbuffer backing objects
>   	 * pinned in place.
>   	 */
> -	while (!list_empty(&ring->execlist_queue)) {
> -		struct drm_i915_gem_request *submit_req;
>
> -		submit_req = list_first_entry(&ring->execlist_queue,
> -				struct drm_i915_gem_request,
> -				execlist_link);
> -		list_del(&submit_req->execlist_link);
> +	if (i915.enable_execlists) {
> +		spin_lock_irq(&ring->execlist_lock);
> +		while (!list_empty(&ring->execlist_queue)) {
> +			struct drm_i915_gem_request *submit_req;
>
> -		if (submit_req->ctx != ring->default_context)
> -			intel_lr_context_unpin(submit_req);
> +			submit_req = list_first_entry(&ring->execlist_queue,
> +					struct drm_i915_gem_request,
> +					execlist_link);
> +			list_del(&submit_req->execlist_link);
>
> -		i915_gem_request_unreference(submit_req);
> +			if (submit_req->ctx != ring->default_context)
> +				intel_lr_context_unpin(submit_req);
> +
> +			i915_gem_request_unreference(submit_req);
> +		}
> +		spin_unlock_irq(&ring->execlist_lock);

Can't this, in theory at least, end up calling 
drm_gem_object_unreference in a couple of code paths, which can sleep, 
while holding a spinlock?

If this cannot happen in practice for some reason it would probably be 
good to put a comment explaining it.

Or I am missing something?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-23  8:42     ` Tvrtko Ursulin
@ 2015-10-23  8:59       ` Daniel Vetter
  2015-10-23 11:02         ` Tomas Elf
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel Vetter @ 2015-10-23  8:59 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-GFX

On Fri, Oct 23, 2015 at 09:42:27AM +0100, Tvrtko Ursulin wrote:
> 
> On 19/10/15 16:32, Tomas Elf wrote:
> >Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> >concurrency problems between the context event interrupt handler and the reset
> >path immediately following a GPU reset.
> >
> >* v2 (Chris Wilson):
> >Do execlist check and use simpler form of spinlock functions.
> >
> >Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >---
> >  drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++---------
> >  1 file changed, 14 insertions(+), 9 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >index e57061a..2c7a0b7 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
> >  	 * are the ones that keep the context and ringbuffer backing objects
> >  	 * pinned in place.
> >  	 */
> >-	while (!list_empty(&ring->execlist_queue)) {
> >-		struct drm_i915_gem_request *submit_req;
> >
> >-		submit_req = list_first_entry(&ring->execlist_queue,
> >-				struct drm_i915_gem_request,
> >-				execlist_link);
> >-		list_del(&submit_req->execlist_link);
> >+	if (i915.enable_execlists) {
> >+		spin_lock_irq(&ring->execlist_lock);
> >+		while (!list_empty(&ring->execlist_queue)) {
> >+			struct drm_i915_gem_request *submit_req;
> >
> >-		if (submit_req->ctx != ring->default_context)
> >-			intel_lr_context_unpin(submit_req);
> >+			submit_req = list_first_entry(&ring->execlist_queue,
> >+					struct drm_i915_gem_request,
> >+					execlist_link);
> >+			list_del(&submit_req->execlist_link);
> >
> >-		i915_gem_request_unreference(submit_req);
> >+			if (submit_req->ctx != ring->default_context)
> >+				intel_lr_context_unpin(submit_req);
> >+
> >+			i915_gem_request_unreference(submit_req);
> >+		}
> >+		spin_unlock_irq(&ring->execlist_lock);
> 
> Can't this, in theory at least, end up calling drm_gem_object_unreference in
> a couple of code paths, which can sleep, while holding a spinlock?
> 
> If this cannot happen in practice for some reason it would probably be good
> to put a comment explaining it.
> 
> Or I am missing something?

It can indeed I think. Tvrtko, since I'm off to KS next week and you have
push access, can you pls push the revert if this checks out?

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-23  8:59       ` Daniel Vetter
@ 2015-10-23 11:02         ` Tomas Elf
  2015-10-23 12:49           ` Dave Gordon
  0 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-23 11:02 UTC (permalink / raw)
  To: Daniel Vetter, Tvrtko Ursulin; +Cc: Intel-GFX

On 23/10/2015 09:59, Daniel Vetter wrote:
> On Fri, Oct 23, 2015 at 09:42:27AM +0100, Tvrtko Ursulin wrote:
>>
>> On 19/10/15 16:32, Tomas Elf wrote:
>>> Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
>>> concurrency problems between the context event interrupt handler and the reset
>>> path immediately following a GPU reset.
>>>
>>> * v2 (Chris Wilson):
>>> Do execlist check and use simpler form of spinlock functions.
>>>
>>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++---------
>>>   1 file changed, 14 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index e57061a..2c7a0b7 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
>>>   	 * are the ones that keep the context and ringbuffer backing objects
>>>   	 * pinned in place.
>>>   	 */
>>> -	while (!list_empty(&ring->execlist_queue)) {
>>> -		struct drm_i915_gem_request *submit_req;
>>>
>>> -		submit_req = list_first_entry(&ring->execlist_queue,
>>> -				struct drm_i915_gem_request,
>>> -				execlist_link);
>>> -		list_del(&submit_req->execlist_link);
>>> +	if (i915.enable_execlists) {
>>> +		spin_lock_irq(&ring->execlist_lock);
>>> +		while (!list_empty(&ring->execlist_queue)) {
>>> +			struct drm_i915_gem_request *submit_req;
>>>
>>> -		if (submit_req->ctx != ring->default_context)
>>> -			intel_lr_context_unpin(submit_req);
>>> +			submit_req = list_first_entry(&ring->execlist_queue,
>>> +					struct drm_i915_gem_request,
>>> +					execlist_link);
>>> +			list_del(&submit_req->execlist_link);
>>>
>>> -		i915_gem_request_unreference(submit_req);
>>> +			if (submit_req->ctx != ring->default_context)
>>> +				intel_lr_context_unpin(submit_req);
>>> +
>>> +			i915_gem_request_unreference(submit_req);
>>> +		}
>>> +		spin_unlock_irq(&ring->execlist_lock);
>>
>> Can't this, in theory at least, end up calling drm_gem_object_unreference in
>> a couple of code paths, which can sleep, while holding a spinlock?
>>
>> If this cannot happen in practice for some reason it would probably be good
>> to put a comment explaining it.
>>
>> Or I am missing something?
>
> It can indeed I think. Tvrtko, since I'm off to KS next week and you have
> push access, can you pls push the revert if this checks out?

I've discussed it with Tvrtko and one solution that seems reasonable is 
to hold the execlist lock while traversing the execlist queue and moving 
the elements over to a temporary list. After that we can drop the 
execlist lock, traverse the temporary list and unreference the elements 
while not holding lock. That way we can get rid of the element in a 
synchronized fashion and also unreference without holding the lock. I 
believe this pattern is used elsewhere (such as during request retirement)

Does that sound like a plan or does anyone have a problem with that 
solution?

Thanks,
Tomas


>
> Thanks, Daniel
>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-23 11:02         ` Tomas Elf
@ 2015-10-23 12:49           ` Dave Gordon
  0 siblings, 0 replies; 68+ messages in thread
From: Dave Gordon @ 2015-10-23 12:49 UTC (permalink / raw)
  To: Tomas Elf, Daniel Vetter, Tvrtko Ursulin; +Cc: Intel-GFX

On 23/10/15 12:02, Tomas Elf wrote:
> On 23/10/2015 09:59, Daniel Vetter wrote:
>> On Fri, Oct 23, 2015 at 09:42:27AM +0100, Tvrtko Ursulin wrote:
>>>
>>> On 19/10/15 16:32, Tomas Elf wrote:
>>>> Grab execlist lock when cleaning up execlist queues after GPU reset
>>>> to avoid
>>>> concurrency problems between the context event interrupt handler and
>>>> the reset
>>>> path immediately following a GPU reset.
>>>>
>>>> * v2 (Chris Wilson):
>>>> Do execlist check and use simpler form of spinlock functions.
>>>>
>>>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++---------
>>>>   1 file changed, 14 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
>>>> b/drivers/gpu/drm/i915/i915_gem.c
>>>> index e57061a..2c7a0b7 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>> @@ -2753,18 +2753,23 @@ static void
>>>> i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
>>>>        * are the ones that keep the context and ringbuffer backing
>>>> objects
>>>>        * pinned in place.
>>>>        */
>>>> -    while (!list_empty(&ring->execlist_queue)) {
>>>> -        struct drm_i915_gem_request *submit_req;
>>>>
>>>> -        submit_req = list_first_entry(&ring->execlist_queue,
>>>> -                struct drm_i915_gem_request,
>>>> -                execlist_link);
>>>> -        list_del(&submit_req->execlist_link);
>>>> +    if (i915.enable_execlists) {
>>>> +        spin_lock_irq(&ring->execlist_lock);
>>>> +        while (!list_empty(&ring->execlist_queue)) {
>>>> +            struct drm_i915_gem_request *submit_req;
>>>>
>>>> -        if (submit_req->ctx != ring->default_context)
>>>> -            intel_lr_context_unpin(submit_req);
>>>> +            submit_req = list_first_entry(&ring->execlist_queue,
>>>> +                    struct drm_i915_gem_request,
>>>> +                    execlist_link);
>>>> +            list_del(&submit_req->execlist_link);
>>>>
>>>> -        i915_gem_request_unreference(submit_req);
>>>> +            if (submit_req->ctx != ring->default_context)
>>>> +                intel_lr_context_unpin(submit_req);
>>>> +
>>>> +            i915_gem_request_unreference(submit_req);
>>>> +        }
>>>> +        spin_unlock_irq(&ring->execlist_lock);
>>>
>>> Can't this, in theory at least, end up calling
>>> drm_gem_object_unreference in
>>> a couple of code paths, which can sleep, while holding a spinlock?
>>>
>>> If this cannot happen in practice for some reason it would probably
>>> be good
>>> to put a comment explaining it.
>>>
>>> Or I am missing something?
>>
>> It can indeed I think. Tvrtko, since I'm off to KS next week and you have
>> push access, can you pls push the revert if this checks out?
>
> I've discussed it with Tvrtko and one solution that seems reasonable is
> to hold the execlist lock while traversing the execlist queue and moving
> the elements over to a temporary list. After that we can drop the
> execlist lock, traverse the temporary list and unreference the elements
> while not holding lock. That way we can get rid of the element in a
> synchronized fashion and also unreference without holding the lock. I
> believe this pattern is used elsewhere (such as during request retirement)
>
> Does that sound like a plan or does anyone have a problem with that
> solution?
>
> Thanks,
> Tomas

I think the original is safe, because the execlist queue /shouldn't/ be 
the last reference. But that could change, and in any case in a possibly 
broken state such as may have caused the hang we probably shouldn't rely 
on it. So I think the staged-dequeue-and-release is a better solution :)

.Dave.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-19 15:32   ` [PATCH v2 " Tomas Elf
  2015-10-22 16:49     ` Dave Gordon
  2015-10-23  8:42     ` Tvrtko Ursulin
@ 2015-10-23 13:08     ` Tomas Elf
  2015-10-23 14:53       ` Daniel, Thomas
  2015-10-23 17:02     ` [PATCH] drm/i915: Update to post-reset execlist queue clean-up Tomas Elf
  3 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-23 13:08 UTC (permalink / raw)
  To: Intel-GFX

Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
concurrency problems between the context event interrupt handler and the reset
path immediately following a GPU reset.

* v2 (Chris Wilson):
Do execlist check and use simpler form of spinlock functions.

* v3 (Tvrtko Ursulin):
Don't hold the execlist_lock while dereferencing the requests since that might
lead to sleep while IRQs are disabled.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>

---
 drivers/gpu/drm/i915/i915_gem.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e57061a..b492603 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2753,18 +2753,16 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
 	 * are the ones that keep the context and ringbuffer backing objects
 	 * pinned in place.
 	 */
-	while (!list_empty(&ring->execlist_queue)) {
-		struct drm_i915_gem_request *submit_req;
 
-		submit_req = list_first_entry(&ring->execlist_queue,
-				struct drm_i915_gem_request,
-				execlist_link);
-		list_del(&submit_req->execlist_link);
+	if (i915.enable_execlists) {
+		spin_lock_irq(&ring->execlist_lock);
 
-		if (submit_req->ctx != ring->default_context)
-			intel_lr_context_unpin(submit_req);
+		/* list_splice_tail_init checks for empty lists */
+		list_splice_tail_init(&ring->execlist_queue,
+				      &ring->execlist_retired_req_list);
 
-		i915_gem_request_unreference(submit_req);
+		spin_unlock_irq(&ring->execlist_lock);
+		intel_execlists_retire_requests(ring);
 	}
 
 	/*
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues.
  2015-10-23 13:08     ` [PATCH v3 " Tomas Elf
@ 2015-10-23 14:53       ` Daniel, Thomas
  0 siblings, 0 replies; 68+ messages in thread
From: Daniel, Thomas @ 2015-10-23 14:53 UTC (permalink / raw)
  To: Elf, Tomas, Intel-GFX

> -----Original Message-----
> From: Intel-gfx [mailto:intel-gfx-bounces@lists.freedesktop.org] On Behalf Of
> Tomas Elf
> Sent: Friday, October 23, 2015 2:09 PM
> To: Intel-GFX@Lists.FreeDesktop.Org
> Subject: [Intel-gfx] [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid
> post-reset concurrency issues.
> 
> Grab execlist lock when cleaning up execlist queues after GPU reset to avoid
> concurrency problems between the context event interrupt handler and the
> reset
> path immediately following a GPU reset.
> 
> * v2 (Chris Wilson):
> Do execlist check and use simpler form of spinlock functions.
> 
> * v3 (Tvrtko Ursulin):
> Don't hold the execlist_lock while dereferencing the requests since that might
> lead to sleep while IRQs are disabled.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> 
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c
> b/drivers/gpu/drm/i915/i915_gem.c
> index e57061a..b492603 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2753,18 +2753,16 @@ static void i915_gem_reset_ring_cleanup(struct
> drm_i915_private *dev_priv,
>  	 * are the ones that keep the context and ringbuffer backing objects
>  	 * pinned in place.
>  	 */
> -	while (!list_empty(&ring->execlist_queue)) {
> -		struct drm_i915_gem_request *submit_req;
> 
> -		submit_req = list_first_entry(&ring->execlist_queue,
> -				struct drm_i915_gem_request,
> -				execlist_link);
> -		list_del(&submit_req->execlist_link);
> +	if (i915.enable_execlists) {
> +		spin_lock_irq(&ring->execlist_lock);
> 
> -		if (submit_req->ctx != ring->default_context)
> -			intel_lr_context_unpin(submit_req);
> +		/* list_splice_tail_init checks for empty lists */
> +		list_splice_tail_init(&ring->execlist_queue,
> +				      &ring->execlist_retired_req_list);
> 
> -		i915_gem_request_unreference(submit_req);
> +		spin_unlock_irq(&ring->execlist_lock);
> +		intel_execlists_retire_requests(ring);
>  	}
> 
>  	/*
> --
> 1.9.1

LGTM
Reviewed-by: Thomas Daniel <thomas.daniel@intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH] drm/i915: Update to post-reset execlist queue clean-up
  2015-10-19 15:32   ` [PATCH v2 " Tomas Elf
                       ` (2 preceding siblings ...)
  2015-10-23 13:08     ` [PATCH v3 " Tomas Elf
@ 2015-10-23 17:02     ` Tomas Elf
  2015-12-01 11:46       ` Tvrtko Ursulin
  3 siblings, 1 reply; 68+ messages in thread
From: Tomas Elf @ 2015-10-23 17:02 UTC (permalink / raw)
  To: Intel-GFX

When clearing an execlist queue, instead of traversing it and unreferencing all
requests while holding the spinlock (which might lead to thread sleeping with
IRQs are turned off - bad news!), just move all requests to the retire request
list while holding spinlock and then drop spinlock and invoke the execlists
request retirement path, which already deals with the intricacies of
purging/dereferencing execlist queue requests.

This patch can be considered v3 of:

	commit b96db8b81c54ef30485ddb5992d63305d86ea8d3
	Author: Tomas Elf <tomas.elf@intel.com>
	drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues

This patch assumes v2 of the above patch is part of the baseline, reverts v2
and adds changes on top to turn it into v3.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 2c7a0b7..b492603 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2756,20 +2756,13 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
 
 	if (i915.enable_execlists) {
 		spin_lock_irq(&ring->execlist_lock);
-		while (!list_empty(&ring->execlist_queue)) {
-			struct drm_i915_gem_request *submit_req;
 
-			submit_req = list_first_entry(&ring->execlist_queue,
-					struct drm_i915_gem_request,
-					execlist_link);
-			list_del(&submit_req->execlist_link);
+		/* list_splice_tail_init checks for empty lists */
+		list_splice_tail_init(&ring->execlist_queue,
+				      &ring->execlist_retired_req_list);
 
-			if (submit_req->ctx != ring->default_context)
-				intel_lr_context_unpin(submit_req);
-
-			i915_gem_request_unreference(submit_req);
-		}
 		spin_unlock_irq(&ring->execlist_lock);
+		intel_execlists_retire_requests(ring);
 	}
 
 	/*
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH] drm/i915: Update to post-reset execlist queue clean-up
  2015-10-23 17:02     ` [PATCH] drm/i915: Update to post-reset execlist queue clean-up Tomas Elf
@ 2015-12-01 11:46       ` Tvrtko Ursulin
  2015-12-11 14:14         ` Dave Gordon
  0 siblings, 1 reply; 68+ messages in thread
From: Tvrtko Ursulin @ 2015-12-01 11:46 UTC (permalink / raw)
  To: Intel-GFX


On 23/10/15 18:02, Tomas Elf wrote:
> When clearing an execlist queue, instead of traversing it and unreferencing all
> requests while holding the spinlock (which might lead to thread sleeping with
> IRQs are turned off - bad news!), just move all requests to the retire request
> list while holding spinlock and then drop spinlock and invoke the execlists
> request retirement path, which already deals with the intricacies of
> purging/dereferencing execlist queue requests.
>
> This patch can be considered v3 of:
>
> 	commit b96db8b81c54ef30485ddb5992d63305d86ea8d3
> 	Author: Tomas Elf <tomas.elf@intel.com>
> 	drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues
>
> This patch assumes v2 of the above patch is part of the baseline, reverts v2
> and adds changes on top to turn it into v3.
>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_gem.c | 15 ++++-----------
>   1 file changed, 4 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 2c7a0b7..b492603 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2756,20 +2756,13 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv,
>
>   	if (i915.enable_execlists) {
>   		spin_lock_irq(&ring->execlist_lock);
> -		while (!list_empty(&ring->execlist_queue)) {
> -			struct drm_i915_gem_request *submit_req;
>
> -			submit_req = list_first_entry(&ring->execlist_queue,
> -					struct drm_i915_gem_request,
> -					execlist_link);
> -			list_del(&submit_req->execlist_link);
> +		/* list_splice_tail_init checks for empty lists */
> +		list_splice_tail_init(&ring->execlist_queue,
> +				      &ring->execlist_retired_req_list);
>
> -			if (submit_req->ctx != ring->default_context)
> -				intel_lr_context_unpin(submit_req);
> -
> -			i915_gem_request_unreference(submit_req);
> -		}
>   		spin_unlock_irq(&ring->execlist_lock);
> +		intel_execlists_retire_requests(ring);
>   	}
>
>   	/*
>

Fallen through the cracks..

This looks to be even more serious, since lockdep notices possible 
deadlock involving vmap_area_lock:

  Possible interrupt unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(vmap_area_lock);
                                local_irq_disable();
                                lock(&(&ring->execlist_lock)->rlock);
                                lock(vmap_area_lock);
   <Interrupt>
     lock(&(&ring->execlist_lock)->rlock);

  *** DEADLOCK ***

Because it unpins LRC context and ringbuffer which ends up in the VM 
code under the execlist_lock.

intel_execlists_retire_requests is slightly different from the code in 
the reset handler because it concerns itself with ctx_obj existence 
which the other one doesn't.

Could people more knowledgeable of this code check if it is OK and R-B?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] drm/i915: Update to post-reset execlist queue clean-up
  2015-12-01 11:46       ` Tvrtko Ursulin
@ 2015-12-11 14:14         ` Dave Gordon
  2015-12-11 16:40           ` Daniel Vetter
  2015-12-14 10:21           ` Mika Kuoppala
  0 siblings, 2 replies; 68+ messages in thread
From: Dave Gordon @ 2015-12-11 14:14 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX, Daniel, Thomas, Daniel Vetter

On 01/12/15 11:46, Tvrtko Ursulin wrote:
>
> On 23/10/15 18:02, Tomas Elf wrote:
>> When clearing an execlist queue, instead of traversing it and
>> unreferencing all
>> requests while holding the spinlock (which might lead to thread
>> sleeping with
>> IRQs are turned off - bad news!), just move all requests to the retire
>> request
>> list while holding spinlock and then drop spinlock and invoke the
>> execlists
>> request retirement path, which already deals with the intricacies of
>> purging/dereferencing execlist queue requests.
>>
>> This patch can be considered v3 of:
>>
>>     commit b96db8b81c54ef30485ddb5992d63305d86ea8d3
>>     Author: Tomas Elf <tomas.elf@intel.com>
>>     drm/i915: Grab execlist spinlock to avoid post-reset concurrency
>> issues
>>
>> This patch assumes v2 of the above patch is part of the baseline,
>> reverts v2
>> and adds changes on top to turn it into v3.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> ---
>>   drivers/gpu/drm/i915/i915_gem.c | 15 ++++-----------
>>   1 file changed, 4 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
>> b/drivers/gpu/drm/i915/i915_gem.c
>> index 2c7a0b7..b492603 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -2756,20 +2756,13 @@ static void i915_gem_reset_ring_cleanup(struct
>> drm_i915_private *dev_priv,
>>
>>       if (i915.enable_execlists) {
>>           spin_lock_irq(&ring->execlist_lock);
>> -        while (!list_empty(&ring->execlist_queue)) {
>> -            struct drm_i915_gem_request *submit_req;
>>
>> -            submit_req = list_first_entry(&ring->execlist_queue,
>> -                    struct drm_i915_gem_request,
>> -                    execlist_link);
>> -            list_del(&submit_req->execlist_link);
>> +        /* list_splice_tail_init checks for empty lists */
>> +        list_splice_tail_init(&ring->execlist_queue,
>> +                      &ring->execlist_retired_req_list);
>>
>> -            if (submit_req->ctx != ring->default_context)
>> -                intel_lr_context_unpin(submit_req);
>> -
>> -            i915_gem_request_unreference(submit_req);
>> -        }
>>           spin_unlock_irq(&ring->execlist_lock);
>> +        intel_execlists_retire_requests(ring);
>>       }
>>
>>       /*
>
> Fallen through the cracks..
>
> This looks to be even more serious, since lockdep notices possible
> deadlock involving vmap_area_lock:
>
>   Possible interrupt unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(vmap_area_lock);
>                                 local_irq_disable();
>                                 lock(&(&ring->execlist_lock)->rlock);
>                                 lock(vmap_area_lock);
>    <Interrupt>
>      lock(&(&ring->execlist_lock)->rlock);
>
>   *** DEADLOCK ***
>
> Because it unpins LRC context and ringbuffer which ends up in the VM
> code under the execlist_lock.
>
> intel_execlists_retire_requests is slightly different from the code in
> the reset handler because it concerns itself with ctx_obj existence
> which the other one doesn't.
>
> Could people more knowledgeable of this code check if it is OK and R-B?
>
> Regards,
>
> Tvrtko

Hi Tvrtko,

I didn't understand this message at first, I thought you'd found a 
problem with this ("v3") patch, but now I see what you actually meant is 
that there is indeed a problem with the (v2) that got merged, not the 
original question about unreferencing an object while holding a spinlock 
(because it can't be the last reference), but rather because of the 
unpin, which can indeed cause a problem with a non-i915-defined kernel lock.

So we should certainly update the current (v2) upstream with this.
Thomas Daniel already R-B'd this code on 23rd October, when it was:

[PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid post-reset 
concurrency issues.

and it hasn't changed in substance since then, so you can carry his R-B 
over, plus I said on that same day that this was a better solution. So:

Reviewed-by: Thomas Daniel <thomas.daniel@intel.com>
Reviewed-by: Dave Gordon <dave.gordon@intel.com>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] drm/i915: Update to post-reset execlist queue clean-up
  2015-12-11 14:14         ` Dave Gordon
@ 2015-12-11 16:40           ` Daniel Vetter
  2015-12-14 10:21           ` Mika Kuoppala
  1 sibling, 0 replies; 68+ messages in thread
From: Daniel Vetter @ 2015-12-11 16:40 UTC (permalink / raw)
  To: Dave Gordon; +Cc: Intel-GFX

On Fri, Dec 11, 2015 at 02:14:00PM +0000, Dave Gordon wrote:
> On 01/12/15 11:46, Tvrtko Ursulin wrote:
> >
> >On 23/10/15 18:02, Tomas Elf wrote:
> >>When clearing an execlist queue, instead of traversing it and
> >>unreferencing all
> >>requests while holding the spinlock (which might lead to thread
> >>sleeping with
> >>IRQs are turned off - bad news!), just move all requests to the retire
> >>request
> >>list while holding spinlock and then drop spinlock and invoke the
> >>execlists
> >>request retirement path, which already deals with the intricacies of
> >>purging/dereferencing execlist queue requests.
> >>
> >>This patch can be considered v3 of:
> >>
> >>    commit b96db8b81c54ef30485ddb5992d63305d86ea8d3
> >>    Author: Tomas Elf <tomas.elf@intel.com>
> >>    drm/i915: Grab execlist spinlock to avoid post-reset concurrency
> >>issues
> >>
> >>This patch assumes v2 of the above patch is part of the baseline,
> >>reverts v2
> >>and adds changes on top to turn it into v3.
> >>
> >>Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>---
> >>  drivers/gpu/drm/i915/i915_gem.c | 15 ++++-----------
> >>  1 file changed, 4 insertions(+), 11 deletions(-)
> >>
> >>diff --git a/drivers/gpu/drm/i915/i915_gem.c
> >>b/drivers/gpu/drm/i915/i915_gem.c
> >>index 2c7a0b7..b492603 100644
> >>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>@@ -2756,20 +2756,13 @@ static void i915_gem_reset_ring_cleanup(struct
> >>drm_i915_private *dev_priv,
> >>
> >>      if (i915.enable_execlists) {
> >>          spin_lock_irq(&ring->execlist_lock);
> >>-        while (!list_empty(&ring->execlist_queue)) {
> >>-            struct drm_i915_gem_request *submit_req;
> >>
> >>-            submit_req = list_first_entry(&ring->execlist_queue,
> >>-                    struct drm_i915_gem_request,
> >>-                    execlist_link);
> >>-            list_del(&submit_req->execlist_link);
> >>+        /* list_splice_tail_init checks for empty lists */
> >>+        list_splice_tail_init(&ring->execlist_queue,
> >>+                      &ring->execlist_retired_req_list);
> >>
> >>-            if (submit_req->ctx != ring->default_context)
> >>-                intel_lr_context_unpin(submit_req);
> >>-
> >>-            i915_gem_request_unreference(submit_req);
> >>-        }
> >>          spin_unlock_irq(&ring->execlist_lock);
> >>+        intel_execlists_retire_requests(ring);
> >>      }
> >>
> >>      /*
> >
> >Fallen through the cracks..
> >
> >This looks to be even more serious, since lockdep notices possible
> >deadlock involving vmap_area_lock:
> >
> >  Possible interrupt unsafe locking scenario:
> >
> >        CPU0                    CPU1
> >        ----                    ----
> >   lock(vmap_area_lock);
> >                                local_irq_disable();
> >                                lock(&(&ring->execlist_lock)->rlock);
> >                                lock(vmap_area_lock);
> >   <Interrupt>
> >     lock(&(&ring->execlist_lock)->rlock);
> >
> >  *** DEADLOCK ***
> >
> >Because it unpins LRC context and ringbuffer which ends up in the VM
> >code under the execlist_lock.
> >
> >intel_execlists_retire_requests is slightly different from the code in
> >the reset handler because it concerns itself with ctx_obj existence
> >which the other one doesn't.
> >
> >Could people more knowledgeable of this code check if it is OK and R-B?
> >
> >Regards,
> >
> >Tvrtko
> 
> Hi Tvrtko,
> 
> I didn't understand this message at first, I thought you'd found a problem
> with this ("v3") patch, but now I see what you actually meant is that there
> is indeed a problem with the (v2) that got merged, not the original question
> about unreferencing an object while holding a spinlock (because it can't be
> the last reference), but rather because of the unpin, which can indeed cause
> a problem with a non-i915-defined kernel lock.
> 
> So we should certainly update the current (v2) upstream with this.
> Thomas Daniel already R-B'd this code on 23rd October, when it was:
> 
> [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid post-reset
> concurrency issues.
> 
> and it hasn't changed in substance since then, so you can carry his R-B
> over, plus I said on that same day that this was a better solution. So:
> 
> Reviewed-by: Thomas Daniel <thomas.daniel@intel.com>
> Reviewed-by: Dave Gordon <dave.gordon@intel.com>

Indeed, fell through the cracks more than once :(

Sorry about that, picked up now.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] drm/i915: Update to post-reset execlist queue clean-up
  2015-12-11 14:14         ` Dave Gordon
  2015-12-11 16:40           ` Daniel Vetter
@ 2015-12-14 10:21           ` Mika Kuoppala
  1 sibling, 0 replies; 68+ messages in thread
From: Mika Kuoppala @ 2015-12-14 10:21 UTC (permalink / raw)
  To: Dave Gordon, Tvrtko Ursulin, Intel-GFX, Daniel, Thomas, Daniel Vetter

Dave Gordon <david.s.gordon@intel.com> writes:

> On 01/12/15 11:46, Tvrtko Ursulin wrote:
>>
>> On 23/10/15 18:02, Tomas Elf wrote:
>>> When clearing an execlist queue, instead of traversing it and
>>> unreferencing all
>>> requests while holding the spinlock (which might lead to thread
>>> sleeping with
>>> IRQs are turned off - bad news!), just move all requests to the retire
>>> request
>>> list while holding spinlock and then drop spinlock and invoke the
>>> execlists
>>> request retirement path, which already deals with the intricacies of
>>> purging/dereferencing execlist queue requests.
>>>
>>> This patch can be considered v3 of:
>>>
>>>     commit b96db8b81c54ef30485ddb5992d63305d86ea8d3
>>>     Author: Tomas Elf <tomas.elf@intel.com>
>>>     drm/i915: Grab execlist spinlock to avoid post-reset concurrency
>>> issues
>>>
>>> This patch assumes v2 of the above patch is part of the baseline,
>>> reverts v2
>>> and adds changes on top to turn it into v3.
>>>
>>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gem.c | 15 ++++-----------
>>>   1 file changed, 4 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
>>> b/drivers/gpu/drm/i915/i915_gem.c
>>> index 2c7a0b7..b492603 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -2756,20 +2756,13 @@ static void i915_gem_reset_ring_cleanup(struct
>>> drm_i915_private *dev_priv,
>>>
>>>       if (i915.enable_execlists) {
>>>           spin_lock_irq(&ring->execlist_lock);
>>> -        while (!list_empty(&ring->execlist_queue)) {
>>> -            struct drm_i915_gem_request *submit_req;
>>>
>>> -            submit_req = list_first_entry(&ring->execlist_queue,
>>> -                    struct drm_i915_gem_request,
>>> -                    execlist_link);
>>> -            list_del(&submit_req->execlist_link);
>>> +        /* list_splice_tail_init checks for empty lists */
>>> +        list_splice_tail_init(&ring->execlist_queue,
>>> +                      &ring->execlist_retired_req_list);
>>>
>>> -            if (submit_req->ctx != ring->default_context)
>>> -                intel_lr_context_unpin(submit_req);
>>> -
>>> -            i915_gem_request_unreference(submit_req);
>>> -        }
>>>           spin_unlock_irq(&ring->execlist_lock);
>>> +        intel_execlists_retire_requests(ring);
>>>       }
>>>
>>>       /*
>>
>> Fallen through the cracks..
>>
>> This looks to be even more serious, since lockdep notices possible
>> deadlock involving vmap_area_lock:
>>
>>   Possible interrupt unsafe locking scenario:
>>
>>         CPU0                    CPU1
>>         ----                    ----
>>    lock(vmap_area_lock);
>>                                 local_irq_disable();
>>                                 lock(&(&ring->execlist_lock)->rlock);
>>                                 lock(vmap_area_lock);
>>    <Interrupt>
>>      lock(&(&ring->execlist_lock)->rlock);
>>
>>   *** DEADLOCK ***
>>
>> Because it unpins LRC context and ringbuffer which ends up in the VM
>> code under the execlist_lock.
>>
>> intel_execlists_retire_requests is slightly different from the code in
>> the reset handler because it concerns itself with ctx_obj existence
>> which the other one doesn't.
>>
>> Could people more knowledgeable of this code check if it is OK and R-B?
>>
>> Regards,
>>
>> Tvrtko
>
> Hi Tvrtko,
>
> I didn't understand this message at first, I thought you'd found a 
> problem with this ("v3") patch, but now I see what you actually meant is 
> that there is indeed a problem with the (v2) that got merged, not the 
> original question about unreferencing an object while holding a spinlock 
> (because it can't be the last reference), but rather because of the 
> unpin, which can indeed cause a problem with a non-i915-defined kernel lock.
>
> So we should certainly update the current (v2) upstream with this.
> Thomas Daniel already R-B'd this code on 23rd October, when it was:
>
> [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid post-reset 
> concurrency issues.
>
> and it hasn't changed in substance since then, so you can carry his R-B 
> over, plus I said on that same day that this was a better solution. So:
>
> Reviewed-by: Thomas Daniel <thomas.daniel@intel.com>
> Reviewed-by: Dave Gordon <dave.gordon@intel.com>
>

Bat farm did encounter with this few weeks back,
so it was vaguely registered. But I just failed
with timely review.

Thanks for pushing it forward,
-Mika

> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2015-12-14 10:23 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf
2015-10-08 18:31 ` [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf
2015-10-09  7:49   ` Chris Wilson
2015-10-09 11:38     ` Tomas Elf
2015-10-09  8:27   ` Daniel Vetter
2015-10-09 11:40     ` Tomas Elf
2015-10-13 11:37       ` Daniel Vetter
2015-10-13 11:47         ` Chris Wilson
2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf
2015-10-09  7:48   ` Chris Wilson
2015-10-09 11:25     ` Tomas Elf
2015-10-13 11:39       ` Daniel Vetter
2015-10-14 11:46         ` Tomas Elf
2015-10-14 12:45           ` Daniel Vetter
2015-10-09  8:28   ` Daniel Vetter
2015-10-09 11:45     ` Tomas Elf
2015-10-13 11:40       ` Daniel Vetter
2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf
2015-10-09  7:49   ` Chris Wilson
2015-10-09 11:34     ` Tomas Elf
2015-10-09  8:32   ` Daniel Vetter
2015-10-09  8:47     ` Chris Wilson
2015-10-09 11:52       ` Tomas Elf
2015-10-09 11:45     ` Tomas Elf
2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf
2015-10-09  7:48   ` Chris Wilson
2015-10-09 11:30     ` Tomas Elf
2015-10-09 11:59       ` Chris Wilson
2015-10-13 11:43         ` Daniel Vetter
2015-10-09  8:33   ` Daniel Vetter
2015-10-09 11:46     ` Tomas Elf
2015-10-08 18:31 ` [PATCH 6/8] drm/i915: Use safe list iterators Tomas Elf
2015-10-09  7:41   ` Chris Wilson
2015-10-09 10:27     ` Tomas Elf
2015-10-09 10:38       ` Chris Wilson
2015-10-09 12:00         ` Tomas Elf
2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf
2015-10-09  7:45   ` Chris Wilson
2015-10-09 10:28     ` Tomas Elf
2015-10-09  8:38   ` Daniel Vetter
2015-10-09  8:45     ` Chris Wilson
2015-10-13 11:46       ` Daniel Vetter
2015-10-13 11:45         ` Chris Wilson
2015-10-13 13:46           ` Daniel Vetter
2015-10-13 14:00             ` Chris Wilson
2015-10-19 15:32   ` [PATCH v2 " Tomas Elf
2015-10-22 16:49     ` Dave Gordon
2015-10-22 17:35       ` Daniel Vetter
2015-10-23  8:42     ` Tvrtko Ursulin
2015-10-23  8:59       ` Daniel Vetter
2015-10-23 11:02         ` Tomas Elf
2015-10-23 12:49           ` Dave Gordon
2015-10-23 13:08     ` [PATCH v3 " Tomas Elf
2015-10-23 14:53       ` Daniel, Thomas
2015-10-23 17:02     ` [PATCH] drm/i915: Update to post-reset execlist queue clean-up Tomas Elf
2015-12-01 11:46       ` Tvrtko Ursulin
2015-12-11 14:14         ` Dave Gordon
2015-12-11 16:40           ` Daniel Vetter
2015-12-14 10:21           ` Mika Kuoppala
2015-10-08 18:31 ` [PATCH 8/8] drm/i915: NULL check of unpin_work Tomas Elf
2015-10-09  7:46   ` Chris Wilson
2015-10-09  8:39     ` Daniel Vetter
2015-10-09 11:50       ` Tomas Elf
2015-10-09 10:30     ` Tomas Elf
2015-10-09 10:44       ` Chris Wilson
2015-10-09 12:06         ` Tomas Elf
2015-10-13 11:51           ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.