* [PATCH 0/8] Stability improvements to error state capture @ 2015-10-08 18:31 Tomas Elf 2015-10-08 18:31 ` [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf ` (7 more replies) 0 siblings, 8 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw) To: Intel-GFX In preparation for the upcoming TDR per-engine hang recovery enablement the stability of the error state capture code needs to be addressed. The biggest reason for this is that in order to test TDR a long-duration test needs to be run for several hours during which a large number of hangs is handled together with the associated error state captures. In its current state the i915 driver experiences various forms of kernel panics and other kinds of fatal errors within the first hour(s) of the hang testing. The patches in this series have been tested with a long-duration hang testing clocking in at 12+ hours and should suffice as an initial improvement. The underlying issue of trying to capture the driver state without synchronization is still a problem that remains to be fixed. One way of at least further alleviating this problem that has been suggested by John Harrison is to do a mutex_trylock() of the struct_mutex for a while (give it a second or so) before going into the error state capture from i915_handle_error(). Then, if nobody is holding the struct_mutex, the error state capture is considerably more safe from sudden state changes. If some thread has hung while holding the struct_mutex one could at least hope that there would be no sudden state changes during error state capture due to the hung state (unless some thread has been caught in a livelock or is perhaps not stuck at all but is simply running for a very long time - still some improvements might be expected here). One fix that has been omitted from this patch series is in regards to the broken ring space calculation following a full GPU reset. Two independent patches to solve this are: "[PATCH] drm/i915: Update ring space correctly on lrc context reset" by Mika Kuoppala and "[51/70] drm/i915: Record the position of the start of the request" by Chris Wilson. Since the solution is currently in review I'll simply mention it here as a pre-requistite for long-duration operations stability testing. Without a fix for this problem the ring space is terminally depleted within the first iterations of the hang test, simply because the ring space is miscalculated following every GPU hang recovery and traversal of the GEM init hw path gradually leading to a terminally hung state. Tomas Elf (8): drm/i915: Early exit from semaphore_waits_for for execlist mode. drm/i915: Migrate to safe iterators in error state capture drm/i915: Cope with request list state change during error state capture drm/i915: NULL checking when capturing buffer objects during error state capture drm/i915: vma NULL pointer check drm/i915: Use safe list iterators drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. drm/i915: NULL check of unpin_work drivers/gpu/drm/i915/i915_gem.c | 18 ++++++++--- drivers/gpu/drm/i915/i915_gpu_error.c | 61 +++++++++++++++++++++++------------ drivers/gpu/drm/i915/i915_irq.c | 20 ++++++++++++ drivers/gpu/drm/i915/intel_display.c | 5 +++ 4 files changed, 80 insertions(+), 24 deletions(-) -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode. 2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf @ 2015-10-08 18:31 ` Tomas Elf 2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf ` (6 subsequent siblings) 7 siblings, 0 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw) To: Intel-GFX When submitting semaphores in execlist mode the hang checker crashes in this function because it is only runnable in ring submission mode. The reason this is of particular interest to the TDR patch series is because we use semaphores as a mean to induce hangs during testing (which is the recommended way to induce hangs for gen8+). It's not clear how this is supposed to work in execlist mode since: 1. This function requires a ring buffer. 2. Retrieving a ring buffer in execlist mode requires us to retrieve the corresponding context, which we get from a request. 3. Retieving a request from the hang checker is not straight-forward since that requires us to grab the struct_mutex in order to synchronize against the request retirement thread. 4. Grabbing the struct_mutex from the hang checker is nothing that we will do since that puts us at risk of deadlock since a hung thread might be holding the struct_mutex already. Therefore it's not obvious how we're supposed to deal with this. For now, we're doing an early exit from this function, which avoids any kernel panic situation when running our own internal TDR ULT. * v2: (Chris Wilson) Turned the execlist mode check into a ringbuffer NULL check to make it more submission mode agnostic and less of a layering violation. Signed-off-by: Tomas Elf <tomas.elf@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> --- drivers/gpu/drm/i915/i915_irq.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 8ca772d..132e8f3 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2782,6 +2782,26 @@ semaphore_waits_for(struct intel_engine_cs *ring, u32 *seqno) u64 offset = 0; int i, backwards; + /* + * This function does not support execlist mode - any attempt to + * proceed further into this function will result in a kernel panic + * when dereferencing ring->buffer, which is not set up in execlist + * mode. + * + * The correct way of doing it would be to derive the currently + * executing ring buffer from the current context, which is derived + * from the currently running request. Unfortunately, to get the + * current request we would have to grab the struct_mutex before doing + * anything else, which would be ill-advised since some other thread + * might have grabbed it already and managed to hang itself, causing + * the hang checker to deadlock. + * + * Therefore, this function does not support execlist mode in its + * current form. Just return NULL and move on. + */ + if (ring->buffer == NULL) + return NULL; + ipehr = I915_READ(RING_IPEHR(ring->mmio_base)); if (!ipehr_is_semaphore_wait(ring->dev, ipehr)) return NULL; -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture 2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf 2015-10-08 18:31 ` [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf @ 2015-10-08 18:31 ` Tomas Elf 2015-10-09 7:49 ` Chris Wilson 2015-10-09 8:27 ` Daniel Vetter 2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf ` (5 subsequent siblings) 7 siblings, 2 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw) To: Intel-GFX Using safe list iterators alleviates the problem of unsynchronized driver list manipulations while error state capture is in the process of traversing lists. Signed-off-by: Tomas Elf <tomas.elf@intel.com> --- drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------ 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 2f04e4f..32c1799 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err, static u32 capture_active_bo(struct drm_i915_error_buffer *err, int count, struct list_head *head) { - struct i915_vma *vma; + struct i915_vma *vma, *tmpvma; int i = 0; - list_for_each_entry(vma, head, mm_list) { + list_for_each_entry_safe(vma, tmpvma, head, mm_list) { capture_bo(err++, vma); if (++i == count) break; @@ -734,17 +734,17 @@ static u32 capture_pinned_bo(struct drm_i915_error_buffer *err, int count, struct list_head *head, struct i915_address_space *vm) { - struct drm_i915_gem_object *obj; + struct drm_i915_gem_object *obj, *tmpobj; struct drm_i915_error_buffer * const first = err; struct drm_i915_error_buffer * const last = err + count; - list_for_each_entry(obj, head, global_list) { - struct i915_vma *vma; + list_for_each_entry_safe(obj, tmpobj, head, global_list) { + struct i915_vma *vma, *tmpvma; if (err == last) break; - list_for_each_entry(vma, &obj->vma_list, vma_link) + list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link) if (vma->vm == vm && vma->pin_count > 0) capture_bo(err++, vma); } @@ -958,13 +958,13 @@ static void i915_gem_record_active_context(struct intel_engine_cs *ring, struct drm_i915_error_ring *ering) { struct drm_i915_private *dev_priv = ring->dev->dev_private; - struct drm_i915_gem_object *obj; + struct drm_i915_gem_object *obj, *tmpobj; /* Currently render ring is the only HW context user */ if (ring->id != RCS || !error->ccid) return; - list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) { + list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) { if (!i915_gem_obj_ggtt_bound(obj)) continue; @@ -979,7 +979,7 @@ static void i915_gem_record_rings(struct drm_device *dev, struct drm_i915_error_state *error) { struct drm_i915_private *dev_priv = dev->dev_private; - struct drm_i915_gem_request *request; + struct drm_i915_gem_request *request, *tmpreq; int i, count; for (i = 0; i < I915_NUM_RINGS; i++) { @@ -1055,7 +1055,7 @@ static void i915_gem_record_rings(struct drm_device *dev, i915_gem_record_active_context(ring, error, &error->ring[i]); count = 0; - list_for_each_entry(request, &ring->request_list, list) + list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) count++; error->ring[i].num_requests = count; @@ -1068,7 +1068,7 @@ static void i915_gem_record_rings(struct drm_device *dev, } count = 0; - list_for_each_entry(request, &ring->request_list, list) { + list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { struct drm_i915_error_request *erq; erq = &error->ring[i].requests[count++]; @@ -1088,17 +1088,17 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv, const int ndx) { struct drm_i915_error_buffer *active_bo = NULL, *pinned_bo = NULL; - struct drm_i915_gem_object *obj; - struct i915_vma *vma; + struct drm_i915_gem_object *obj, *tmpobj; + struct i915_vma *vma, *tmpvma; int i; i = 0; - list_for_each_entry(vma, &vm->active_list, mm_list) + list_for_each_entry_safe(vma, tmpvma, &vm->active_list, mm_list) i++; error->active_bo_count[ndx] = i; - list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) { - list_for_each_entry(vma, &obj->vma_list, vma_link) + list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) { + list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link) if (vma->vm == vm && vma->pin_count > 0) i++; } @@ -1128,10 +1128,10 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv, static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv, struct drm_i915_error_state *error) { - struct i915_address_space *vm; + struct i915_address_space *vm, *tmpvm; int cnt = 0, i = 0; - list_for_each_entry(vm, &dev_priv->vm_list, global_link) + list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link) cnt++; error->active_bo = kcalloc(cnt, sizeof(*error->active_bo), GFP_ATOMIC); @@ -1155,7 +1155,7 @@ static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv, error->pinned_bo = NULL; error->pinned_bo_count = NULL; } else { - list_for_each_entry(vm, &dev_priv->vm_list, global_link) + list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link) i915_gem_capture_vm(dev_priv, error, vm, i++); error->vm_count = cnt; -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture 2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf @ 2015-10-09 7:49 ` Chris Wilson 2015-10-09 11:38 ` Tomas Elf 2015-10-09 8:27 ` Daniel Vetter 1 sibling, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 7:49 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote: > Using safe list iterators alleviates the problem of unsynchronized driver list > manipulations while error state capture is in the process of traversing lists. Wrong, this does not make anything safe. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture 2015-10-09 7:49 ` Chris Wilson @ 2015-10-09 11:38 ` Tomas Elf 0 siblings, 0 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-09 11:38 UTC (permalink / raw) To: Chris Wilson, Intel-GFX On 09/10/2015 08:49, Chris Wilson wrote: > On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote: >> Using safe list iterators alleviates the problem of unsynchronized driver list >> manipulations while error state capture is in the process of traversing lists. > > Wrong, this does not make anything safe. > -Chris > I know, I didn't say it would make it safe. As I said in an earlier mail: I use the word "alleviate" to indicate that we're reducing the problem scope and make fatal crashes less likely to occur but the underlying problem is still there. Since my focus is on TDR I didn't set out to solve the underlying problem, I only needed the error state capture code to not break my tests, which is why I set out to solve the crashes - not the underlying problem. If you have a solution for the underlying problem then that's another story. Thanks, Tomas _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture 2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf 2015-10-09 7:49 ` Chris Wilson @ 2015-10-09 8:27 ` Daniel Vetter 2015-10-09 11:40 ` Tomas Elf 1 sibling, 1 reply; 68+ messages in thread From: Daniel Vetter @ 2015-10-09 8:27 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote: > Using safe list iterators alleviates the problem of unsynchronized driver list > manipulations while error state capture is in the process of traversing lists. > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > --- > drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------ > 1 file changed, 19 insertions(+), 19 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > index 2f04e4f..32c1799 100644 > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > @@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err, > static u32 capture_active_bo(struct drm_i915_error_buffer *err, > int count, struct list_head *head) > { > - struct i915_vma *vma; > + struct i915_vma *vma, *tmpvma; > int i = 0; > > - list_for_each_entry(vma, head, mm_list) { > + list_for_each_entry_safe(vma, tmpvma, head, mm_list) { _safe isn't safe against anything, it's only safe against removal of the current list item done with a list_del/list_move from this thread. I.e. the only thing it does is have a temporary pointer to the next list element so if you delete the current one it won't fall over. It doesn't protect against anything else, and especially not against other threads totally stomping the list. So we need proper protection for these lists, independent of dev->struct_mutex. The least invasive way to do that is probably rcu, which only requires us that we release the memory for any object sitting on these lists with kfree_rcu instead of normal kfree. Another option might be a spinlock, but that would be a lot more invasive, and Chris won't like the execbuf overhead this causes ;-) -Daniel > capture_bo(err++, vma); > if (++i == count) > break; > @@ -734,17 +734,17 @@ static u32 capture_pinned_bo(struct drm_i915_error_buffer *err, > int count, struct list_head *head, > struct i915_address_space *vm) > { > - struct drm_i915_gem_object *obj; > + struct drm_i915_gem_object *obj, *tmpobj; > struct drm_i915_error_buffer * const first = err; > struct drm_i915_error_buffer * const last = err + count; > > - list_for_each_entry(obj, head, global_list) { > - struct i915_vma *vma; > + list_for_each_entry_safe(obj, tmpobj, head, global_list) { > + struct i915_vma *vma, *tmpvma; > > if (err == last) > break; > > - list_for_each_entry(vma, &obj->vma_list, vma_link) > + list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link) > if (vma->vm == vm && vma->pin_count > 0) > capture_bo(err++, vma); > } > @@ -958,13 +958,13 @@ static void i915_gem_record_active_context(struct intel_engine_cs *ring, > struct drm_i915_error_ring *ering) > { > struct drm_i915_private *dev_priv = ring->dev->dev_private; > - struct drm_i915_gem_object *obj; > + struct drm_i915_gem_object *obj, *tmpobj; > > /* Currently render ring is the only HW context user */ > if (ring->id != RCS || !error->ccid) > return; > > - list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) { > + list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) { > if (!i915_gem_obj_ggtt_bound(obj)) > continue; > > @@ -979,7 +979,7 @@ static void i915_gem_record_rings(struct drm_device *dev, > struct drm_i915_error_state *error) > { > struct drm_i915_private *dev_priv = dev->dev_private; > - struct drm_i915_gem_request *request; > + struct drm_i915_gem_request *request, *tmpreq; > int i, count; > > for (i = 0; i < I915_NUM_RINGS; i++) { > @@ -1055,7 +1055,7 @@ static void i915_gem_record_rings(struct drm_device *dev, > i915_gem_record_active_context(ring, error, &error->ring[i]); > > count = 0; > - list_for_each_entry(request, &ring->request_list, list) > + list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) > count++; > > error->ring[i].num_requests = count; > @@ -1068,7 +1068,7 @@ static void i915_gem_record_rings(struct drm_device *dev, > } > > count = 0; > - list_for_each_entry(request, &ring->request_list, list) { > + list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { > struct drm_i915_error_request *erq; > > erq = &error->ring[i].requests[count++]; > @@ -1088,17 +1088,17 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv, > const int ndx) > { > struct drm_i915_error_buffer *active_bo = NULL, *pinned_bo = NULL; > - struct drm_i915_gem_object *obj; > - struct i915_vma *vma; > + struct drm_i915_gem_object *obj, *tmpobj; > + struct i915_vma *vma, *tmpvma; > int i; > > i = 0; > - list_for_each_entry(vma, &vm->active_list, mm_list) > + list_for_each_entry_safe(vma, tmpvma, &vm->active_list, mm_list) > i++; > error->active_bo_count[ndx] = i; > > - list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) { > - list_for_each_entry(vma, &obj->vma_list, vma_link) > + list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) { > + list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link) > if (vma->vm == vm && vma->pin_count > 0) > i++; > } > @@ -1128,10 +1128,10 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv, > static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv, > struct drm_i915_error_state *error) > { > - struct i915_address_space *vm; > + struct i915_address_space *vm, *tmpvm; > int cnt = 0, i = 0; > > - list_for_each_entry(vm, &dev_priv->vm_list, global_link) > + list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link) > cnt++; > > error->active_bo = kcalloc(cnt, sizeof(*error->active_bo), GFP_ATOMIC); > @@ -1155,7 +1155,7 @@ static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv, > error->pinned_bo = NULL; > error->pinned_bo_count = NULL; > } else { > - list_for_each_entry(vm, &dev_priv->vm_list, global_link) > + list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link) > i915_gem_capture_vm(dev_priv, error, vm, i++); > > error->vm_count = cnt; > -- > 1.9.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture 2015-10-09 8:27 ` Daniel Vetter @ 2015-10-09 11:40 ` Tomas Elf 2015-10-13 11:37 ` Daniel Vetter 0 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-09 11:40 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-GFX On 09/10/2015 09:27, Daniel Vetter wrote: > On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote: >> Using safe list iterators alleviates the problem of unsynchronized driver list >> manipulations while error state capture is in the process of traversing lists. >> >> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >> --- >> drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------ >> 1 file changed, 19 insertions(+), 19 deletions(-) >> >> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c >> index 2f04e4f..32c1799 100644 >> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >> @@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err, >> static u32 capture_active_bo(struct drm_i915_error_buffer *err, >> int count, struct list_head *head) >> { >> - struct i915_vma *vma; >> + struct i915_vma *vma, *tmpvma; >> int i = 0; >> >> - list_for_each_entry(vma, head, mm_list) { >> + list_for_each_entry_safe(vma, tmpvma, head, mm_list) { > > _safe isn't safe against anything, it's only safe against removal of the > current list item done with a list_del/list_move from this thread. I.e. > the only thing it does is have a temporary pointer to the next list > element so if you delete the current one it won't fall over. > > It doesn't protect against anything else, and especially not against other > threads totally stomping the list. Absolutely! But it's good enough to make the test not fall over within an hour of testing and instead allow it to run for 12+ hours during continuous long-duration testing, which is what I need for the TDR validation. > > So we need proper protection for these lists, independent of > dev->struct_mutex. The least invasive way to do that is probably rcu, > which only requires us that we release the memory for any object sitting > on these lists with kfree_rcu instead of normal kfree. Another option > might be a spinlock, but that would be a lot more invasive, and Chris > won't like the execbuf overhead this causes ;-) > -Daniel Awesome! Let's go with that then. Thanks, Tomas > >> capture_bo(err++, vma); >> if (++i == count) >> break; >> @@ -734,17 +734,17 @@ static u32 capture_pinned_bo(struct drm_i915_error_buffer *err, >> int count, struct list_head *head, >> struct i915_address_space *vm) >> { >> - struct drm_i915_gem_object *obj; >> + struct drm_i915_gem_object *obj, *tmpobj; >> struct drm_i915_error_buffer * const first = err; >> struct drm_i915_error_buffer * const last = err + count; >> >> - list_for_each_entry(obj, head, global_list) { >> - struct i915_vma *vma; >> + list_for_each_entry_safe(obj, tmpobj, head, global_list) { >> + struct i915_vma *vma, *tmpvma; >> >> if (err == last) >> break; >> >> - list_for_each_entry(vma, &obj->vma_list, vma_link) >> + list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link) >> if (vma->vm == vm && vma->pin_count > 0) >> capture_bo(err++, vma); >> } >> @@ -958,13 +958,13 @@ static void i915_gem_record_active_context(struct intel_engine_cs *ring, >> struct drm_i915_error_ring *ering) >> { >> struct drm_i915_private *dev_priv = ring->dev->dev_private; >> - struct drm_i915_gem_object *obj; >> + struct drm_i915_gem_object *obj, *tmpobj; >> >> /* Currently render ring is the only HW context user */ >> if (ring->id != RCS || !error->ccid) >> return; >> >> - list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) { >> + list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) { >> if (!i915_gem_obj_ggtt_bound(obj)) >> continue; >> >> @@ -979,7 +979,7 @@ static void i915_gem_record_rings(struct drm_device *dev, >> struct drm_i915_error_state *error) >> { >> struct drm_i915_private *dev_priv = dev->dev_private; >> - struct drm_i915_gem_request *request; >> + struct drm_i915_gem_request *request, *tmpreq; >> int i, count; >> >> for (i = 0; i < I915_NUM_RINGS; i++) { >> @@ -1055,7 +1055,7 @@ static void i915_gem_record_rings(struct drm_device *dev, >> i915_gem_record_active_context(ring, error, &error->ring[i]); >> >> count = 0; >> - list_for_each_entry(request, &ring->request_list, list) >> + list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) >> count++; >> >> error->ring[i].num_requests = count; >> @@ -1068,7 +1068,7 @@ static void i915_gem_record_rings(struct drm_device *dev, >> } >> >> count = 0; >> - list_for_each_entry(request, &ring->request_list, list) { >> + list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { >> struct drm_i915_error_request *erq; >> >> erq = &error->ring[i].requests[count++]; >> @@ -1088,17 +1088,17 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv, >> const int ndx) >> { >> struct drm_i915_error_buffer *active_bo = NULL, *pinned_bo = NULL; >> - struct drm_i915_gem_object *obj; >> - struct i915_vma *vma; >> + struct drm_i915_gem_object *obj, *tmpobj; >> + struct i915_vma *vma, *tmpvma; >> int i; >> >> i = 0; >> - list_for_each_entry(vma, &vm->active_list, mm_list) >> + list_for_each_entry_safe(vma, tmpvma, &vm->active_list, mm_list) >> i++; >> error->active_bo_count[ndx] = i; >> >> - list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) { >> - list_for_each_entry(vma, &obj->vma_list, vma_link) >> + list_for_each_entry_safe(obj, tmpobj, &dev_priv->mm.bound_list, global_list) { >> + list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link) >> if (vma->vm == vm && vma->pin_count > 0) >> i++; >> } >> @@ -1128,10 +1128,10 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv, >> static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv, >> struct drm_i915_error_state *error) >> { >> - struct i915_address_space *vm; >> + struct i915_address_space *vm, *tmpvm; >> int cnt = 0, i = 0; >> >> - list_for_each_entry(vm, &dev_priv->vm_list, global_link) >> + list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link) >> cnt++; >> >> error->active_bo = kcalloc(cnt, sizeof(*error->active_bo), GFP_ATOMIC); >> @@ -1155,7 +1155,7 @@ static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv, >> error->pinned_bo = NULL; >> error->pinned_bo_count = NULL; >> } else { >> - list_for_each_entry(vm, &dev_priv->vm_list, global_link) >> + list_for_each_entry_safe(vm, tmpvm, &dev_priv->vm_list, global_link) >> i915_gem_capture_vm(dev_priv, error, vm, i++); >> >> error->vm_count = cnt; >> -- >> 1.9.1 >> >> _______________________________________________ >> Intel-gfx mailing list >> Intel-gfx@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/intel-gfx > _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture 2015-10-09 11:40 ` Tomas Elf @ 2015-10-13 11:37 ` Daniel Vetter 2015-10-13 11:47 ` Chris Wilson 0 siblings, 1 reply; 68+ messages in thread From: Daniel Vetter @ 2015-10-13 11:37 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Fri, Oct 09, 2015 at 12:40:51PM +0100, Tomas Elf wrote: > On 09/10/2015 09:27, Daniel Vetter wrote: > >On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote: > >>Using safe list iterators alleviates the problem of unsynchronized driver list > >>manipulations while error state capture is in the process of traversing lists. > >> > >>Signed-off-by: Tomas Elf <tomas.elf@intel.com> > >>--- > >> drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------ > >> 1 file changed, 19 insertions(+), 19 deletions(-) > >> > >>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > >>index 2f04e4f..32c1799 100644 > >>--- a/drivers/gpu/drm/i915/i915_gpu_error.c > >>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c > >>@@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err, > >> static u32 capture_active_bo(struct drm_i915_error_buffer *err, > >> int count, struct list_head *head) > >> { > >>- struct i915_vma *vma; > >>+ struct i915_vma *vma, *tmpvma; > >> int i = 0; > >> > >>- list_for_each_entry(vma, head, mm_list) { > >>+ list_for_each_entry_safe(vma, tmpvma, head, mm_list) { > > > >_safe isn't safe against anything, it's only safe against removal of the > >current list item done with a list_del/list_move from this thread. I.e. > >the only thing it does is have a temporary pointer to the next list > >element so if you delete the current one it won't fall over. > > > >It doesn't protect against anything else, and especially not against other > >threads totally stomping the list. > > Absolutely! But it's good enough to make the test not fall over within an > hour of testing and instead allow it to run for 12+ hours during continuous > long-duration testing, which is what I need for the TDR validation. Well it looks really suspicious, since the only thing this protects against is against deleting the element we look at right now. The only sensible theory I can come up with is that this slightly helps when starting the list walk, in case someone comes around and deletes the first element in the list from the retire worker. Since the retire worker tends to do more work per list item than we do that's enough to make the race really unlikely. But it's still just duct-tape. > >So we need proper protection for these lists, independent of > >dev->struct_mutex. The least invasive way to do that is probably rcu, > >which only requires us that we release the memory for any object sitting > >on these lists with kfree_rcu instead of normal kfree. Another option > >might be a spinlock, but that would be a lot more invasive, and Chris > >won't like the execbuf overhead this causes ;-) > >-Daniel > > Awesome! Let's go with that then. See our massive irc discussion ... it's more tricky :( In short rcu works perfectly if you only have the following lifetime sequence: - kmalloc object - add to list - remove from list (eventually) - free object If you readd an object to a list, or even worse, move it then the rcu list helpers won't work. What could happen on the read side is: - you walk the rcu list, keeping track of the head element to know when to stop - while looking at that list one of the yet untraversed items gets moved to a new list -> you'll traverse the new list forever since that one will never hit the head element. Not a problem for requests/vmas, but a problem for some of the object lists. Note that the problem isn't that we re-add the element (if that happens we might end up walking parts of the list twice or parts of it not at all), but moving to a different list where there's a different head element. I haven't checked, but maybe we're lucky and all the object lists we're looking at will always have the same head element. Then I think it'll work out (albeit racy, but who cares about that for the error capture) and with the guarantee (when using rcu delayed freeing) that'll we'll never chase freed memory. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture 2015-10-13 11:37 ` Daniel Vetter @ 2015-10-13 11:47 ` Chris Wilson 0 siblings, 0 replies; 68+ messages in thread From: Chris Wilson @ 2015-10-13 11:47 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-GFX On Tue, Oct 13, 2015 at 01:37:32PM +0200, Daniel Vetter wrote: > On Fri, Oct 09, 2015 at 12:40:51PM +0100, Tomas Elf wrote: > > On 09/10/2015 09:27, Daniel Vetter wrote: > > >On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote: > > >>Using safe list iterators alleviates the problem of unsynchronized driver list > > >>manipulations while error state capture is in the process of traversing lists. > > >> > > >>Signed-off-by: Tomas Elf <tomas.elf@intel.com> > > >>--- > > >> drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------ > > >> 1 file changed, 19 insertions(+), 19 deletions(-) > > >> > > >>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > > >>index 2f04e4f..32c1799 100644 > > >>--- a/drivers/gpu/drm/i915/i915_gpu_error.c > > >>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c > > >>@@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err, > > >> static u32 capture_active_bo(struct drm_i915_error_buffer *err, > > >> int count, struct list_head *head) > > >> { > > >>- struct i915_vma *vma; > > >>+ struct i915_vma *vma, *tmpvma; > > >> int i = 0; > > >> > > >>- list_for_each_entry(vma, head, mm_list) { > > >>+ list_for_each_entry_safe(vma, tmpvma, head, mm_list) { > > > > > >_safe isn't safe against anything, it's only safe against removal of the > > >current list item done with a list_del/list_move from this thread. I.e. > > >the only thing it does is have a temporary pointer to the next list > > >element so if you delete the current one it won't fall over. > > > > > >It doesn't protect against anything else, and especially not against other > > >threads totally stomping the list. > > > > Absolutely! But it's good enough to make the test not fall over within an > > hour of testing and instead allow it to run for 12+ hours during continuous > > long-duration testing, which is what I need for the TDR validation. > > Well it looks really suspicious, since the only thing this protects > against is against deleting the element we look at right now. The only > sensible theory I can come up with is that this slightly helps when > starting the list walk, in case someone comes around and deletes the first > element in the list from the retire worker. Since the retire worker tends > to do more work per list item than we do that's enough to make the race > really unlikely. But it's still just duct-tape. > > > >So we need proper protection for these lists, independent of > > >dev->struct_mutex. The least invasive way to do that is probably rcu, > > >which only requires us that we release the memory for any object sitting > > >on these lists with kfree_rcu instead of normal kfree. Another option > > >might be a spinlock, but that would be a lot more invasive, and Chris > > >won't like the execbuf overhead this causes ;-) > > >-Daniel > > > > Awesome! Let's go with that then. > > See our massive irc discussion ... it's more tricky :( > > In short rcu works perfectly if you only have the following lifetime > sequence: > - kmalloc object > - add to list > - remove from list (eventually) > - free object > > If you readd an object to a list, or even worse, move it then the rcu list > helpers won't work. What could happen on the read side is: > > - you walk the rcu list, keeping track of the head element to know when to > stop > - while looking at that list one of the yet untraversed items gets moved > to a new list > > -> you'll traverse the new list forever since that one will never hit the > head element. > > Not a problem for requests/vmas, but a problem for some of the object lists. > > Note that the problem isn't that we re-add the element (if that happens we > might end up walking parts of the list twice or parts of it not at all), > but moving to a different list where there's a different head element. > > I haven't checked, but maybe we're lucky and all the object lists we're > looking at will always have the same head element. Then I think it'll work > out (albeit racy, but who cares about that for the error capture) and with > the guarantee (when using rcu delayed freeing) that'll we'll never chase > freed memory. Ah, so you agree that stop_machine() is good enough. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 3/8] drm/i915: Cope with request list state change during error state capture 2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf 2015-10-08 18:31 ` [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf 2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf @ 2015-10-08 18:31 ` Tomas Elf 2015-10-09 7:48 ` Chris Wilson 2015-10-09 8:28 ` Daniel Vetter 2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf ` (4 subsequent siblings) 7 siblings, 2 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw) To: Intel-GFX Since we're not synchronizing the ring request list during error state capture the request list state might change between the time the corresponding error request list was allocated and dimensioned to the time when the ring request list is actually captured into the error state. If this happens, throw a WARNING and do early exit and be aware that the captured error state might not be fully reliable. Signed-off-by: Tomas Elf <tomas.elf@intel.com> --- drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 32c1799..cc75ca4 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev, list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { struct drm_i915_error_request *erq; + if (WARN_ON(!request || count >= error->ring[i].num_requests)) { + /* + * If the ring request list was changed in + * between the point where the error request + * list was created and dimensioned and this + * point then just update the num_requests + * field to reflect this. + */ + error->ring[i].num_requests = + min(count, error->ring[i].num_requests); + break; + } + erq = &error->ring[i].requests[count++]; erq->seqno = request->seqno; erq->jiffies = request->emitted_jiffies; -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture 2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf @ 2015-10-09 7:48 ` Chris Wilson 2015-10-09 11:25 ` Tomas Elf 2015-10-09 8:28 ` Daniel Vetter 1 sibling, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 7:48 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote: > Since we're not synchronizing the ring request list during error state capture > the request list state might change between the time the corresponding error > request list was allocated and dimensioned to the time when the ring request > list is actually captured into the error state. If this happens, throw a > WARNING and do early exit and be aware that the captured error state might not > be fully reliable. > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > --- > drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > index 32c1799..cc75ca4 100644 > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev, > list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { > struct drm_i915_error_request *erq; > > + if (WARN_ON(!request || count >= error->ring[i].num_requests)) { Request cannot be null, count can legitmately be more, the WARN on is inappropriate. Again, I sent several patches over the past couple of years to fix this. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture 2015-10-09 7:48 ` Chris Wilson @ 2015-10-09 11:25 ` Tomas Elf 2015-10-13 11:39 ` Daniel Vetter 0 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-09 11:25 UTC (permalink / raw) To: Chris Wilson, Intel-GFX On 09/10/2015 08:48, Chris Wilson wrote: > On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote: >> Since we're not synchronizing the ring request list during error state capture >> the request list state might change between the time the corresponding error >> request list was allocated and dimensioned to the time when the ring request >> list is actually captured into the error state. If this happens, throw a >> WARNING and do early exit and be aware that the captured error state might not >> be fully reliable. >> >> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >> --- >> drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c >> index 32c1799..cc75ca4 100644 >> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >> @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev, >> list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { >> struct drm_i915_error_request *erq; >> >> + if (WARN_ON(!request || count >= error->ring[i].num_requests)) { > > Request cannot be null, count can legitmately be more, the WARN on is > inappropriate. Again, I sent several patches over the past couple of > years to fix this. > -Chris > Ok, so having the driver request list change grow in between the point where we allocate and set up the error state request list to the same size as the driver request list (since that's what count being larger than the list size implies) is legitimate? Traversing into unallocated memory seems pretty dodgy to me but if you say so. Do you mind pointing me to your patch that solves this your way? Thanks, Tomas _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture 2015-10-09 11:25 ` Tomas Elf @ 2015-10-13 11:39 ` Daniel Vetter 2015-10-14 11:46 ` Tomas Elf 0 siblings, 1 reply; 68+ messages in thread From: Daniel Vetter @ 2015-10-13 11:39 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Fri, Oct 09, 2015 at 12:25:26PM +0100, Tomas Elf wrote: > On 09/10/2015 08:48, Chris Wilson wrote: > >On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote: > >>Since we're not synchronizing the ring request list during error state capture > >>the request list state might change between the time the corresponding error > >>request list was allocated and dimensioned to the time when the ring request > >>list is actually captured into the error state. If this happens, throw a > >>WARNING and do early exit and be aware that the captured error state might not > >>be fully reliable. > >> > >>Signed-off-by: Tomas Elf <tomas.elf@intel.com> > >>--- > >> drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++ > >> 1 file changed, 13 insertions(+) > >> > >>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > >>index 32c1799..cc75ca4 100644 > >>--- a/drivers/gpu/drm/i915/i915_gpu_error.c > >>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c > >>@@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev, > >> list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { > >> struct drm_i915_error_request *erq; > >> > >>+ if (WARN_ON(!request || count >= error->ring[i].num_requests)) { > > > >Request cannot be null, count can legitmately be more, the WARN on is > >inappropriate. Again, I sent several patches over the past couple of > >years to fix this. > >-Chris > > > > Ok, so having the driver request list change grow in between the point where > we allocate and set up the error state request list to the same size as the > driver request list (since that's what count being larger than the list size > implies) is legitimate? Traversing into unallocated memory seems pretty > dodgy to me but if you say so. We still need to handle it ofc, but just not WARN on this condition since it can happen. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture 2015-10-13 11:39 ` Daniel Vetter @ 2015-10-14 11:46 ` Tomas Elf 2015-10-14 12:45 ` Daniel Vetter 0 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-14 11:46 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-GFX On 13/10/2015 12:39, Daniel Vetter wrote: > On Fri, Oct 09, 2015 at 12:25:26PM +0100, Tomas Elf wrote: >> On 09/10/2015 08:48, Chris Wilson wrote: >>> On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote: >>>> Since we're not synchronizing the ring request list during error state capture >>>> the request list state might change between the time the corresponding error >>>> request list was allocated and dimensioned to the time when the ring request >>>> list is actually captured into the error state. If this happens, throw a >>>> WARNING and do early exit and be aware that the captured error state might not >>>> be fully reliable. >>>> >>>> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >>>> --- >>>> drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++ >>>> 1 file changed, 13 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c >>>> index 32c1799..cc75ca4 100644 >>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >>>> @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev, >>>> list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { >>>> struct drm_i915_error_request *erq; >>>> >>>> + if (WARN_ON(!request || count >= error->ring[i].num_requests)) { >>> >>> Request cannot be null, count can legitmately be more, the WARN on is >>> inappropriate. Again, I sent several patches over the past couple of >>> years to fix this. >>> -Chris >>> >> >> Ok, so having the driver request list change grow in between the point where >> we allocate and set up the error state request list to the same size as the >> driver request list (since that's what count being larger than the list size >> implies) is legitimate? Traversing into unallocated memory seems pretty >> dodgy to me but if you say so. > > We still need to handle it ofc, but just not WARN on this condition since > it can happen. > -Daniel > With the RCU discussion ongoing I guess we should we just drop this patch? I agree that what I've been seeing looks like a side-effect of concurrent memory deallocation. Whatever solution you reach should make this patch pointless. Thanks, Tomas _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture 2015-10-14 11:46 ` Tomas Elf @ 2015-10-14 12:45 ` Daniel Vetter 0 siblings, 0 replies; 68+ messages in thread From: Daniel Vetter @ 2015-10-14 12:45 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Wed, Oct 14, 2015 at 12:46:51PM +0100, Tomas Elf wrote: > On 13/10/2015 12:39, Daniel Vetter wrote: > >On Fri, Oct 09, 2015 at 12:25:26PM +0100, Tomas Elf wrote: > >>On 09/10/2015 08:48, Chris Wilson wrote: > >>>On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote: > >>>>Since we're not synchronizing the ring request list during error state capture > >>>>the request list state might change between the time the corresponding error > >>>>request list was allocated and dimensioned to the time when the ring request > >>>>list is actually captured into the error state. If this happens, throw a > >>>>WARNING and do early exit and be aware that the captured error state might not > >>>>be fully reliable. > >>>> > >>>>Signed-off-by: Tomas Elf <tomas.elf@intel.com> > >>>>--- > >>>> drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++ > >>>> 1 file changed, 13 insertions(+) > >>>> > >>>>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > >>>>index 32c1799..cc75ca4 100644 > >>>>--- a/drivers/gpu/drm/i915/i915_gpu_error.c > >>>>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c > >>>>@@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev, > >>>> list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { > >>>> struct drm_i915_error_request *erq; > >>>> > >>>>+ if (WARN_ON(!request || count >= error->ring[i].num_requests)) { > >>> > >>>Request cannot be null, count can legitmately be more, the WARN on is > >>>inappropriate. Again, I sent several patches over the past couple of > >>>years to fix this. > >>>-Chris > >>> > >> > >>Ok, so having the driver request list change grow in between the point where > >>we allocate and set up the error state request list to the same size as the > >>driver request list (since that's what count being larger than the list size > >>implies) is legitimate? Traversing into unallocated memory seems pretty > >>dodgy to me but if you say so. > > > >We still need to handle it ofc, but just not WARN on this condition since > >it can happen. > >-Daniel > > > > With the RCU discussion ongoing I guess we should we just drop this patch? I > agree that what I've been seeing looks like a side-effect of concurrent > memory deallocation. Whatever solution you reach should make this patch > pointless. Depending upon the solution we pick we might still need it, since if we only make sure we don't chase freed memory but still allow the list itself to get modified (which would be the case with kfree_rcu), then you need it. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture 2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf 2015-10-09 7:48 ` Chris Wilson @ 2015-10-09 8:28 ` Daniel Vetter 2015-10-09 11:45 ` Tomas Elf 1 sibling, 1 reply; 68+ messages in thread From: Daniel Vetter @ 2015-10-09 8:28 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote: > Since we're not synchronizing the ring request list during error state capture > the request list state might change between the time the corresponding error > request list was allocated and dimensioned to the time when the ring request > list is actually captured into the error state. If this happens, throw a > WARNING and do early exit and be aware that the captured error state might not > be fully reliable. Please don't throw a WARNING since this is expected to occasionally happen. DRM_DEBUG_DRIVER is enough imo. -Daniel > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > --- > drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > index 32c1799..cc75ca4 100644 > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev, > list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { > struct drm_i915_error_request *erq; > > + if (WARN_ON(!request || count >= error->ring[i].num_requests)) { > + /* > + * If the ring request list was changed in > + * between the point where the error request > + * list was created and dimensioned and this > + * point then just update the num_requests > + * field to reflect this. > + */ > + error->ring[i].num_requests = > + min(count, error->ring[i].num_requests); > + break; > + } > + > erq = &error->ring[i].requests[count++]; > erq->seqno = request->seqno; > erq->jiffies = request->emitted_jiffies; > -- > 1.9.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture 2015-10-09 8:28 ` Daniel Vetter @ 2015-10-09 11:45 ` Tomas Elf 2015-10-13 11:40 ` Daniel Vetter 0 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-09 11:45 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-GFX On 09/10/2015 09:28, Daniel Vetter wrote: > On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote: >> Since we're not synchronizing the ring request list during error state capture >> the request list state might change between the time the corresponding error >> request list was allocated and dimensioned to the time when the ring request >> list is actually captured into the error state. If this happens, throw a >> WARNING and do early exit and be aware that the captured error state might not >> be fully reliable. > > Please don't throw a WARNING since this is expected to occasionally > happen. DRM_DEBUG_DRIVER is enough imo. > -Daniel I still don't see how it could happen without leading to reads of unallocated memory. The error state request list has been allocated to a certain size equal to num_requests and this loop seems to assume that the error state request list maintains the same size as the driver request list, which is not the case - leading to crashes, which is how I happened to notice it. I can obviously remove the warning but are you saying we shouldn't even take action if it happens? Such as early exit? Thanks, Tomas > >> >> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >> --- >> drivers/gpu/drm/i915/i915_gpu_error.c | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c >> index 32c1799..cc75ca4 100644 >> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >> @@ -1071,6 +1071,19 @@ static void i915_gem_record_rings(struct drm_device *dev, >> list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { >> struct drm_i915_error_request *erq; >> >> + if (WARN_ON(!request || count >= error->ring[i].num_requests)) { >> + /* >> + * If the ring request list was changed in >> + * between the point where the error request >> + * list was created and dimensioned and this >> + * point then just update the num_requests >> + * field to reflect this. >> + */ >> + error->ring[i].num_requests = >> + min(count, error->ring[i].num_requests); >> + break; >> + } >> + >> erq = &error->ring[i].requests[count++]; >> erq->seqno = request->seqno; >> erq->jiffies = request->emitted_jiffies; >> -- >> 1.9.1 >> >> _______________________________________________ >> Intel-gfx mailing list >> Intel-gfx@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/intel-gfx > _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 3/8] drm/i915: Cope with request list state change during error state capture 2015-10-09 11:45 ` Tomas Elf @ 2015-10-13 11:40 ` Daniel Vetter 0 siblings, 0 replies; 68+ messages in thread From: Daniel Vetter @ 2015-10-13 11:40 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Fri, Oct 09, 2015 at 12:45:07PM +0100, Tomas Elf wrote: > On 09/10/2015 09:28, Daniel Vetter wrote: > >On Thu, Oct 08, 2015 at 07:31:35PM +0100, Tomas Elf wrote: > >>Since we're not synchronizing the ring request list during error state capture > >>the request list state might change between the time the corresponding error > >>request list was allocated and dimensioned to the time when the ring request > >>list is actually captured into the error state. If this happens, throw a > >>WARNING and do early exit and be aware that the captured error state might not > >>be fully reliable. > > > >Please don't throw a WARNING since this is expected to occasionally > >happen. DRM_DEBUG_DRIVER is enough imo. > >-Daniel > > I still don't see how it could happen without leading to reads of > unallocated memory. The error state request list has been allocated to a > certain size equal to num_requests and this loop seems to assume that the > error state request list maintains the same size as the driver request list, > which is not the case - leading to crashes, which is how I happened to > notice it. > > I can obviously remove the warning but are you saying we shouldn't even take > action if it happens? Such as early exit? Just remove the WARNING since this is a perfectly expected outcome when error state capture races against request adding. If you want put a DRM_DEBUG or similar in there, but WARN_ON for stuff that we expect might happen isn't good - i915.ko is too noisy already as-is. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture 2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf ` (2 preceding siblings ...) 2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf @ 2015-10-08 18:31 ` Tomas Elf 2015-10-09 7:49 ` Chris Wilson 2015-10-09 8:32 ` Daniel Vetter 2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf ` (3 subsequent siblings) 7 siblings, 2 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw) To: Intel-GFX In the past vmas have sometimes turned out to be NULL when capturing buffer objects during error state capture. To avoid NULL pointer exceptions throw a WARNING and exit early and be prepared for the error state not being fully accurate following this point. Signed-off-by: Tomas Elf <tomas.elf@intel.com> --- drivers/gpu/drm/i915/i915_gpu_error.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index cc75ca4..ae24971 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -691,9 +691,17 @@ unwind: static void capture_bo(struct drm_i915_error_buffer *err, struct i915_vma *vma) { - struct drm_i915_gem_object *obj = vma->obj; + struct drm_i915_gem_object *obj; int i; + if (WARN_ON(!vma)) + return; + + if (WARN_ON(!vma->obj)) + return; + + obj = vma->obj; + err->size = obj->base.size; err->name = obj->base.name; for (i = 0; i < I915_NUM_RINGS; i++) -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture 2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf @ 2015-10-09 7:49 ` Chris Wilson 2015-10-09 11:34 ` Tomas Elf 2015-10-09 8:32 ` Daniel Vetter 1 sibling, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 7:49 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote: > In the past vmas have sometimes turned out to be NULL when capturing buffer > objects during error state capture. To avoid NULL pointer exceptions throw a > WARNING and exit early and be prepared for the error state not being fully > accurate following this point. No, fix the root cause. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture 2015-10-09 7:49 ` Chris Wilson @ 2015-10-09 11:34 ` Tomas Elf 0 siblings, 0 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-09 11:34 UTC (permalink / raw) To: Chris Wilson, Intel-GFX On 09/10/2015 08:49, Chris Wilson wrote: > On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote: >> In the past vmas have sometimes turned out to be NULL when capturing buffer >> objects during error state capture. To avoid NULL pointer exceptions throw a >> WARNING and exit early and be prepared for the error state not being fully >> accurate following this point. > > No, fix the root cause. > -Chris > I'd be glad to look that once TDR is upstreamed. Until then I need a system that can run long enough to validate TDR during long-duration operations testing. I need this patch in order to make that happen. But I guess I'll just carry this patch locally for now then. Thanks, Tomas _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture 2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf 2015-10-09 7:49 ` Chris Wilson @ 2015-10-09 8:32 ` Daniel Vetter 2015-10-09 8:47 ` Chris Wilson 2015-10-09 11:45 ` Tomas Elf 1 sibling, 2 replies; 68+ messages in thread From: Daniel Vetter @ 2015-10-09 8:32 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote: > In the past vmas have sometimes turned out to be NULL when capturing buffer > objects during error state capture. To avoid NULL pointer exceptions throw a > WARNING and exit early and be prepared for the error state not being fully > accurate following this point. > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > --- > drivers/gpu/drm/i915/i915_gpu_error.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > index cc75ca4..ae24971 100644 > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > @@ -691,9 +691,17 @@ unwind: > static void capture_bo(struct drm_i915_error_buffer *err, > struct i915_vma *vma) > { > - struct drm_i915_gem_object *obj = vma->obj; > + struct drm_i915_gem_object *obj; > int i; > > + if (WARN_ON(!vma)) > + return; If we rcu-protect the vma lists and vmas themselves then this can't happen. > + > + if (WARN_ON(!vma->obj)) > + return; Again we probably need to rcu-protect this, and more important we need to make sure the compiler doesn't do crap. So obj = rcu_derefence(vma->obj) if (!obj) return; Ofc obj then can also only be released with kfree_rcu. -Daniel > + > + obj = vma->obj; > + > err->size = obj->base.size; > err->name = obj->base.name; > for (i = 0; i < I915_NUM_RINGS; i++) > -- > 1.9.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture 2015-10-09 8:32 ` Daniel Vetter @ 2015-10-09 8:47 ` Chris Wilson 2015-10-09 11:52 ` Tomas Elf 2015-10-09 11:45 ` Tomas Elf 1 sibling, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 8:47 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-GFX On Fri, Oct 09, 2015 at 10:32:11AM +0200, Daniel Vetter wrote: > On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote: > > In the past vmas have sometimes turned out to be NULL when capturing buffer > > objects during error state capture. To avoid NULL pointer exceptions throw a > > WARNING and exit early and be prepared for the error state not being fully > > accurate following this point. > > > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > > --- > > drivers/gpu/drm/i915/i915_gpu_error.c | 10 +++++++++- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > > index cc75ca4..ae24971 100644 > > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > > @@ -691,9 +691,17 @@ unwind: > > static void capture_bo(struct drm_i915_error_buffer *err, > > struct i915_vma *vma) > > { > > - struct drm_i915_gem_object *obj = vma->obj; > > + struct drm_i915_gem_object *obj; > > int i; > > > > + if (WARN_ON(!vma)) > > + return; > > If we rcu-protect the vma lists and vmas themselves then this can't > happen. > > > + > > + if (WARN_ON(!vma->obj)) > > + return; > > Again we probably need to rcu-protect this, and more important we need to > make sure the compiler doesn't do crap. So > > obj = rcu_derefence(vma->obj) > if (!obj) > return; The other issue is that we don't actually want capture_bo() but capture_vma(). -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture 2015-10-09 8:47 ` Chris Wilson @ 2015-10-09 11:52 ` Tomas Elf 0 siblings, 0 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-09 11:52 UTC (permalink / raw) To: Chris Wilson, Daniel Vetter, Intel-GFX On 09/10/2015 09:47, Chris Wilson wrote: > On Fri, Oct 09, 2015 at 10:32:11AM +0200, Daniel Vetter wrote: >> On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote: >>> In the past vmas have sometimes turned out to be NULL when capturing buffer >>> objects during error state capture. To avoid NULL pointer exceptions throw a >>> WARNING and exit early and be prepared for the error state not being fully >>> accurate following this point. >>> >>> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >>> --- >>> drivers/gpu/drm/i915/i915_gpu_error.c | 10 +++++++++- >>> 1 file changed, 9 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c >>> index cc75ca4..ae24971 100644 >>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >>> @@ -691,9 +691,17 @@ unwind: >>> static void capture_bo(struct drm_i915_error_buffer *err, >>> struct i915_vma *vma) >>> { >>> - struct drm_i915_gem_object *obj = vma->obj; >>> + struct drm_i915_gem_object *obj; >>> int i; >>> >>> + if (WARN_ON(!vma)) >>> + return; >> >> If we rcu-protect the vma lists and vmas themselves then this can't >> happen. >> >>> + >>> + if (WARN_ON(!vma->obj)) >>> + return; >> >> Again we probably need to rcu-protect this, and more important we need to >> make sure the compiler doesn't do crap. So >> >> obj = rcu_derefence(vma->obj) >> if (!obj) >> return; > Sweet! Let's go with that solution. > The other issue is that we don't actually want capture_bo() but capture_vma(). Sure, but capturing BOs instead of VMAs is not going to crash anything and is therefore slightly outside of the scope of my current work. It would certainly be an improvement to make that change, though. Thanks, Tomas > -Chris > _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects during error state capture 2015-10-09 8:32 ` Daniel Vetter 2015-10-09 8:47 ` Chris Wilson @ 2015-10-09 11:45 ` Tomas Elf 1 sibling, 0 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-09 11:45 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-GFX On 09/10/2015 09:32, Daniel Vetter wrote: > On Thu, Oct 08, 2015 at 07:31:36PM +0100, Tomas Elf wrote: >> In the past vmas have sometimes turned out to be NULL when capturing buffer >> objects during error state capture. To avoid NULL pointer exceptions throw a >> WARNING and exit early and be prepared for the error state not being fully >> accurate following this point. >> >> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >> --- >> drivers/gpu/drm/i915/i915_gpu_error.c | 10 +++++++++- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c >> index cc75ca4..ae24971 100644 >> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >> @@ -691,9 +691,17 @@ unwind: >> static void capture_bo(struct drm_i915_error_buffer *err, >> struct i915_vma *vma) >> { >> - struct drm_i915_gem_object *obj = vma->obj; >> + struct drm_i915_gem_object *obj; >> int i; >> >> + if (WARN_ON(!vma)) >> + return; > > If we rcu-protect the vma lists and vmas themselves then this can't > happen. > >> + >> + if (WARN_ON(!vma->obj)) >> + return; > > Again we probably need to rcu-protect this, and more important we need to > make sure the compiler doesn't do crap. So > > obj = rcu_derefence(vma->obj) > if (!obj) > return; > > Ofc obj then can also only be released with kfree_rcu. > -Daniel Awesome! Let's go with that solution then. Thanks, Tomas > >> + >> + obj = vma->obj; >> + >> err->size = obj->base.size; >> err->name = obj->base.name; >> for (i = 0; i < I915_NUM_RINGS; i++) >> -- >> 1.9.1 >> >> _______________________________________________ >> Intel-gfx mailing list >> Intel-gfx@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/intel-gfx > _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 5/8] drm/i915: vma NULL pointer check 2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf ` (3 preceding siblings ...) 2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf @ 2015-10-08 18:31 ` Tomas Elf 2015-10-09 7:48 ` Chris Wilson 2015-10-09 8:33 ` Daniel Vetter 2015-10-08 18:31 ` [PATCH 6/8] drm/i915: Use safe list iterators Tomas Elf ` (2 subsequent siblings) 7 siblings, 2 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw) To: Intel-GFX Sometimes the iterated vma objects are NULL apparently. Be aware of this while iterating and do early exit instead of causing a NULL pointer exception. Signed-off-by: Tomas Elf <tomas.elf@intel.com> --- drivers/gpu/drm/i915/i915_gem.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 52642af..8210ae7 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -5122,9 +5122,13 @@ unsigned long i915_gem_obj_size(struct drm_i915_gem_object *o, bool i915_gem_obj_is_pinned(struct drm_i915_gem_object *obj) { struct i915_vma *vma; - list_for_each_entry(vma, &obj->vma_list, vma_link) + list_for_each_entry(vma, &obj->vma_list, vma_link) { + if (!vma) + continue; + if (vma->pin_count > 0) return true; + } return false; } -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 5/8] drm/i915: vma NULL pointer check 2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf @ 2015-10-09 7:48 ` Chris Wilson 2015-10-09 11:30 ` Tomas Elf 2015-10-09 8:33 ` Daniel Vetter 1 sibling, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 7:48 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote: > Sometimes the iterated vma objects are NULL apparently. Be aware of this while > iterating and do early exit instead of causing a NULL pointer exception. Wrong. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 5/8] drm/i915: vma NULL pointer check 2015-10-09 7:48 ` Chris Wilson @ 2015-10-09 11:30 ` Tomas Elf 2015-10-09 11:59 ` Chris Wilson 0 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-09 11:30 UTC (permalink / raw) To: Chris Wilson, Intel-GFX On 09/10/2015 08:48, Chris Wilson wrote: > On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote: >> Sometimes the iterated vma objects are NULL apparently. Be aware of this while >> iterating and do early exit instead of causing a NULL pointer exception. > > Wrong. > -Chris > So the NULL pointer exception that I ran into multiple times during several different test runs on the line that says "vma->pin_count" was not because the vma pointer was NULL. Would you mind sharing your explanation to how this might have happened? Is it because we're not synchronizing and there's no protection against the driver deallocating vmas while we're trying to capture them? If this all ties into the aforementioned RCU-based solution then maybe we should go with that then. Thanks, Tomas _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 5/8] drm/i915: vma NULL pointer check 2015-10-09 11:30 ` Tomas Elf @ 2015-10-09 11:59 ` Chris Wilson 2015-10-13 11:43 ` Daniel Vetter 0 siblings, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 11:59 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Fri, Oct 09, 2015 at 12:30:29PM +0100, Tomas Elf wrote: > On 09/10/2015 08:48, Chris Wilson wrote: > >On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote: > >>Sometimes the iterated vma objects are NULL apparently. Be aware of this while > >>iterating and do early exit instead of causing a NULL pointer exception. > > > >Wrong. > >-Chris > > > > So the NULL pointer exception that I ran into multiple times during > several different test runs on the line that says "vma->pin_count" > was not because the vma pointer was NULL. Would you mind sharing > your explanation to how this might have happened? Is it because > we're not synchronizing and there's no protection against the driver > deallocating vmas while we're trying to capture them? If this all > ties into the aforementioned RCU-based solution then maybe we should > go with that then. Correct. The driver is retiring requests whilst the hang check worker is running. And you chased a stale pointer, you could have equally chased a stale vma->obj, vma->ctx etc pointers. What I have done in the past is to serialise the retirement and the hangcheck using a spinlock (as adding to the end of the list is safe), but there are still weak spots when walking the list of bound vma. What I would actually feel more comfortable with is to only record the request->batch_vma, at the cost of losing information about the currently bound buffers. Or we could just stop_machine() whilst running the capture and have the machine unresponsive for a few 100ms. I don't think simply RCU the lists is enough (VM active_list, request list, bound list) as eventually we chase a pointer from obj itself (which means we need to RCU pretty much everything). -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 5/8] drm/i915: vma NULL pointer check 2015-10-09 11:59 ` Chris Wilson @ 2015-10-13 11:43 ` Daniel Vetter 0 siblings, 0 replies; 68+ messages in thread From: Daniel Vetter @ 2015-10-13 11:43 UTC (permalink / raw) To: Chris Wilson, Tomas Elf, Intel-GFX On Fri, Oct 09, 2015 at 12:59:44PM +0100, Chris Wilson wrote: > On Fri, Oct 09, 2015 at 12:30:29PM +0100, Tomas Elf wrote: > > On 09/10/2015 08:48, Chris Wilson wrote: > > >On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote: > > >>Sometimes the iterated vma objects are NULL apparently. Be aware of this while > > >>iterating and do early exit instead of causing a NULL pointer exception. > > > > > >Wrong. > > >-Chris > > > > > > > So the NULL pointer exception that I ran into multiple times during > > several different test runs on the line that says "vma->pin_count" > > was not because the vma pointer was NULL. Would you mind sharing > > your explanation to how this might have happened? Is it because > > we're not synchronizing and there's no protection against the driver > > deallocating vmas while we're trying to capture them? If this all > > ties into the aforementioned RCU-based solution then maybe we should > > go with that then. > > Correct. The driver is retiring requests whilst the hang check worker is > running. And you chased a stale pointer, you could have equally chased a > stale vma->obj, vma->ctx etc pointers. > > What I have done in the past is to serialise the retirement and the > hangcheck using a spinlock (as adding to the end of the list is safe), > but there are still weak spots when walking the list of bound vma. > What I would actually feel more comfortable with is to only record the > request->batch_vma, at the cost of losing information about the > currently bound buffers. > > Or we could just stop_machine() whilst running the capture and have the > machine unresponsive for a few 100ms. I don't think simply RCU the lists > is enough (VM active_list, request list, bound list) as eventually we > chase a pointer from obj itself (which means we need to RCU pretty much > everything). I discussed stop_machine with Chris a bit - it does avoid some of the races, but not all of them (since the lists might be in wonky state right when we stop_machine). And yes rcu means we need to rcu-free any object that the error state looks at. What we could do is protect list consistency with rcu (where possible, see my other mail about possible problems). And use stop_machine to make sure the underlying objects don't go poof (instead of kfree_rcu). Not sure where the gap in that idea is yet ;-) -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 5/8] drm/i915: vma NULL pointer check 2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf 2015-10-09 7:48 ` Chris Wilson @ 2015-10-09 8:33 ` Daniel Vetter 2015-10-09 11:46 ` Tomas Elf 1 sibling, 1 reply; 68+ messages in thread From: Daniel Vetter @ 2015-10-09 8:33 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote: > Sometimes the iterated vma objects are NULL apparently. Be aware of this while > iterating and do early exit instead of causing a NULL pointer exception. > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > --- > drivers/gpu/drm/i915/i915_gem.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 52642af..8210ae7 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -5122,9 +5122,13 @@ unsigned long i915_gem_obj_size(struct drm_i915_gem_object *o, > bool i915_gem_obj_is_pinned(struct drm_i915_gem_object *obj) > { > struct i915_vma *vma; > - list_for_each_entry(vma, &obj->vma_list, vma_link) > + list_for_each_entry(vma, &obj->vma_list, vma_link) { > + if (!vma) > + continue; Again no need to check for that if we rcu-protect this list and vmas. -Daniel > + > if (vma->pin_count > 0) > return true; > + } > > return false; > } > -- > 1.9.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 5/8] drm/i915: vma NULL pointer check 2015-10-09 8:33 ` Daniel Vetter @ 2015-10-09 11:46 ` Tomas Elf 0 siblings, 0 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-09 11:46 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-GFX On 09/10/2015 09:33, Daniel Vetter wrote: > On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote: >> Sometimes the iterated vma objects are NULL apparently. Be aware of this while >> iterating and do early exit instead of causing a NULL pointer exception. >> >> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >> --- >> drivers/gpu/drm/i915/i915_gem.c | 6 +++++- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c >> index 52642af..8210ae7 100644 >> --- a/drivers/gpu/drm/i915/i915_gem.c >> +++ b/drivers/gpu/drm/i915/i915_gem.c >> @@ -5122,9 +5122,13 @@ unsigned long i915_gem_obj_size(struct drm_i915_gem_object *o, >> bool i915_gem_obj_is_pinned(struct drm_i915_gem_object *obj) >> { >> struct i915_vma *vma; >> - list_for_each_entry(vma, &obj->vma_list, vma_link) >> + list_for_each_entry(vma, &obj->vma_list, vma_link) { >> + if (!vma) >> + continue; > > Again no need to check for that if we rcu-protect this list and vmas. > -Daniel Sweet! Let's go with that solution! Thanks, Tomas > >> + >> if (vma->pin_count > 0) >> return true; >> + } >> >> return false; >> } >> -- >> 1.9.1 >> >> _______________________________________________ >> Intel-gfx mailing list >> Intel-gfx@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/intel-gfx > _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 6/8] drm/i915: Use safe list iterators 2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf ` (4 preceding siblings ...) 2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf @ 2015-10-08 18:31 ` Tomas Elf 2015-10-09 7:41 ` Chris Wilson 2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf 2015-10-08 18:31 ` [PATCH 8/8] drm/i915: NULL check of unpin_work Tomas Elf 7 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw) To: Intel-GFX Error state capture is dependent on i915_gem_active_request() and i915_gem_obj_is_pinned(). Since there is no synchronization between error state capture and the driver state we at least need to use safe list iterators in these functions to alleviate the problem of request/vma lists changing during error state capture. Signed-off-by: Tomas Elf <tomas.elf@intel.com> --- drivers/gpu/drm/i915/i915_gem.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 8210ae7..1666499 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2704,9 +2704,9 @@ void i915_gem_request_cancel(struct drm_i915_gem_request *req) struct drm_i915_gem_request * i915_gem_find_active_request(struct intel_engine_cs *ring) { - struct drm_i915_gem_request *request; + struct drm_i915_gem_request *request, *tmpreq; - list_for_each_entry(request, &ring->request_list, list) { + list_for_each_entry_safe(request, tmpreq, &ring->request_list, list) { if (i915_gem_request_completed(request, false)) continue; @@ -5121,8 +5121,9 @@ unsigned long i915_gem_obj_size(struct drm_i915_gem_object *o, bool i915_gem_obj_is_pinned(struct drm_i915_gem_object *obj) { - struct i915_vma *vma; - list_for_each_entry(vma, &obj->vma_list, vma_link) { + struct i915_vma *vma, *tmpvma; + + list_for_each_entry_safe(vma, tmpvma, &obj->vma_list, vma_link) { if (!vma) continue; -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 6/8] drm/i915: Use safe list iterators 2015-10-08 18:31 ` [PATCH 6/8] drm/i915: Use safe list iterators Tomas Elf @ 2015-10-09 7:41 ` Chris Wilson 2015-10-09 10:27 ` Tomas Elf 0 siblings, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 7:41 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:38PM +0100, Tomas Elf wrote: > Error state capture is dependent on i915_gem_active_request() and > i915_gem_obj_is_pinned(). Since there is no synchronization between error state > capture and the driver state we at least need to use safe list iterators in > these functions to alleviate the problem of request/vma lists changing during > error state capture. Does not make it safe. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 6/8] drm/i915: Use safe list iterators 2015-10-09 7:41 ` Chris Wilson @ 2015-10-09 10:27 ` Tomas Elf 2015-10-09 10:38 ` Chris Wilson 0 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-09 10:27 UTC (permalink / raw) To: Chris Wilson, Intel-GFX On 09/10/2015 08:41, Chris Wilson wrote: > On Thu, Oct 08, 2015 at 07:31:38PM +0100, Tomas Elf wrote: >> Error state capture is dependent on i915_gem_active_request() and >> i915_gem_obj_is_pinned(). Since there is no synchronization between error state >> capture and the driver state we at least need to use safe list iterators in >> these functions to alleviate the problem of request/vma lists changing during >> error state capture. > > Does not make it safe. > -Chris > I know it doesn't make it safe, I didn't say this patch makes it safe. Maybe I was unclear but I chose the word "alleviate" rather than "solve" to indicate that there are still problems but this change reduces the problem scope and makes crashes less likely to occur. I also used the formulation "at least" to indicate that we're not solving everything but we can do certain things to improve things somewhat. The problems I've been seeing has been that the list state changes during iteration and that the error capture tries to read elements that are no longer part of the list - not that elements that the error capture code is dereferencing are deallocated by the driver or whatever. Using a safe iterator helps with that very particular problem. Or maybe I guess I've just been incredibly lucky for the last 2 months when I've been running this code as I've been able to get 12+ hours of stability during my tests instead of less than one hour in between crashes that was the case before I introduced these changes. Thanks, Tomas _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 6/8] drm/i915: Use safe list iterators 2015-10-09 10:27 ` Tomas Elf @ 2015-10-09 10:38 ` Chris Wilson 2015-10-09 12:00 ` Tomas Elf 0 siblings, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 10:38 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Fri, Oct 09, 2015 at 11:27:48AM +0100, Tomas Elf wrote: > On 09/10/2015 08:41, Chris Wilson wrote: > >On Thu, Oct 08, 2015 at 07:31:38PM +0100, Tomas Elf wrote: > >>Error state capture is dependent on i915_gem_active_request() and > >>i915_gem_obj_is_pinned(). Since there is no synchronization between error state > >>capture and the driver state we at least need to use safe list iterators in > >>these functions to alleviate the problem of request/vma lists changing during > >>error state capture. > > > >Does not make it safe. > >-Chris > > > > I know it doesn't make it safe, I didn't say this patch makes it safe. > > Maybe I was unclear but I chose the word "alleviate" rather than > "solve" to indicate that there are still problems but this change > reduces the problem scope and makes crashes less likely to occur. I > also used the formulation "at least" to indicate that we're not > solving everything but we can do certain things to improve things > somewhat. > > The problems I've been seeing has been that the list state changes > during iteration and that the error capture tries to read elements > that are no longer part of the list - not that elements that the > error capture code is dereferencing are deallocated by the driver or > whatever. Using a safe iterator helps with that very particular > problem. Or maybe I guess I've just been incredibly lucky for the > last 2 months when I've been running this code as I've been able to > get 12+ hours of stability during my tests instead of less than one > hour in between crashes that was the case before I introduced these > changes. You have been incredibily lucky (probably due to how the requests are being cached now), but that the requests can be modified whilst error capture runs and oops is well known. Just pretending the list iterator is safe does nothing. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 6/8] drm/i915: Use safe list iterators 2015-10-09 10:38 ` Chris Wilson @ 2015-10-09 12:00 ` Tomas Elf 0 siblings, 0 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-09 12:00 UTC (permalink / raw) To: Chris Wilson, Intel-GFX On 09/10/2015 11:38, Chris Wilson wrote: > On Fri, Oct 09, 2015 at 11:27:48AM +0100, Tomas Elf wrote: >> On 09/10/2015 08:41, Chris Wilson wrote: >>> On Thu, Oct 08, 2015 at 07:31:38PM +0100, Tomas Elf wrote: >>>> Error state capture is dependent on i915_gem_active_request() and >>>> i915_gem_obj_is_pinned(). Since there is no synchronization between error state >>>> capture and the driver state we at least need to use safe list iterators in >>>> these functions to alleviate the problem of request/vma lists changing during >>>> error state capture. >>> >>> Does not make it safe. >>> -Chris >>> >> >> I know it doesn't make it safe, I didn't say this patch makes it safe. >> >> Maybe I was unclear but I chose the word "alleviate" rather than >> "solve" to indicate that there are still problems but this change >> reduces the problem scope and makes crashes less likely to occur. I >> also used the formulation "at least" to indicate that we're not >> solving everything but we can do certain things to improve things >> somewhat. >> >> The problems I've been seeing has been that the list state changes >> during iteration and that the error capture tries to read elements >> that are no longer part of the list - not that elements that the >> error capture code is dereferencing are deallocated by the driver or >> whatever. Using a safe iterator helps with that very particular >> problem. Or maybe I guess I've just been incredibly lucky for the >> last 2 months when I've been running this code as I've been able to >> get 12+ hours of stability during my tests instead of less than one >> hour in between crashes that was the case before I introduced these >> changes. > > You have been incredibily lucky (probably due to how the requests are > being cached now), but that the requests can be modified whilst error > capture runs and oops is well known. Just pretending the list iterator > is safe does nothing. > -Chris > Does nothing except consistently extend meantime between failures from less than one hour to more than 12 hours during my TDR tests. That's a lot of luck right there. On a consistent and predictable basis. Also, I'm not trying to pretend, I thought I communicated clearly that this is not a solution but rather an improvement that makes certain types of tests actually possible to run, tests that are needed for the TDR validation. But if you have another solution then let's go with that. Thanks, Tomas _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf ` (5 preceding siblings ...) 2015-10-08 18:31 ` [PATCH 6/8] drm/i915: Use safe list iterators Tomas Elf @ 2015-10-08 18:31 ` Tomas Elf 2015-10-09 7:45 ` Chris Wilson ` (2 more replies) 2015-10-08 18:31 ` [PATCH 8/8] drm/i915: NULL check of unpin_work Tomas Elf 7 siblings, 3 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw) To: Intel-GFX Grab execlist lock when cleaning up execlist queues after GPU reset to avoid concurrency problems between the context event interrupt handler and the reset path immediately following a GPU reset. Signed-off-by: Tomas Elf <tomas.elf@intel.com> --- drivers/gpu/drm/i915/i915_gem.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 1666499..50c2dcd 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2738,6 +2738,8 @@ static void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv, static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, struct intel_engine_cs *ring) { + unsigned long flags; + while (!list_empty(&ring->active_list)) { struct drm_i915_gem_object *obj; @@ -2753,6 +2755,8 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, * are the ones that keep the context and ringbuffer backing objects * pinned in place. */ + + spin_lock_irqsave(&ring->execlist_lock, flags); while (!list_empty(&ring->execlist_queue)) { struct drm_i915_gem_request *submit_req; @@ -2766,6 +2770,7 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, i915_gem_request_unreference(submit_req); } + spin_unlock_irqrestore(&ring->execlist_lock, flags); /* * We must free the requests after all the corresponding objects have -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf @ 2015-10-09 7:45 ` Chris Wilson 2015-10-09 10:28 ` Tomas Elf 2015-10-09 8:38 ` Daniel Vetter 2015-10-19 15:32 ` [PATCH v2 " Tomas Elf 2 siblings, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 7:45 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote: > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > concurrency problems between the context event interrupt handler and the reset > path immediately following a GPU reset. Hmm, my tree just does: static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, struct intel_engine_cs *ring) { /* * Clear the execlists queue up before freeing the requests, as those * are the ones that keep the context and ringbuffer backing objects * pinned in place. */ if (i915.enable_execlists) { spin_lock_irq(&ring->execlist_lock); list_splice_tail_init(&ring->execlist_queue, &ring->execlist_completed); memset(&ring->execlist_port, 0, sizeof(ring->execlist_port)); spin_unlock_irq(&ring->execlist_lock); intel_execlists_retire_requests(ring); } Oh, that's right another unreviewed patch from several months ago. For your patch, you should not take the spinlock unconditionally as it is only initialised for execlists, and you can use the simpler form of irq locking as above. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-09 7:45 ` Chris Wilson @ 2015-10-09 10:28 ` Tomas Elf 0 siblings, 0 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-09 10:28 UTC (permalink / raw) To: Chris Wilson, Intel-GFX On 09/10/2015 08:45, Chris Wilson wrote: > On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote: >> Grab execlist lock when cleaning up execlist queues after GPU reset to avoid >> concurrency problems between the context event interrupt handler and the reset >> path immediately following a GPU reset. > > Hmm, my tree just does: > > static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, > struct intel_engine_cs *ring) > { > /* > * Clear the execlists queue up before freeing the requests, as those > * are the ones that keep the context and ringbuffer backing objects > * pinned in place. > */ > if (i915.enable_execlists) { > spin_lock_irq(&ring->execlist_lock); > list_splice_tail_init(&ring->execlist_queue, > &ring->execlist_completed); > memset(&ring->execlist_port, 0, sizeof(ring->execlist_port)); > spin_unlock_irq(&ring->execlist_lock); > > intel_execlists_retire_requests(ring); > } > > > Oh, that's right another unreviewed patch from several months ago. > > For your patch, you should not take the spinlock unconditionally as it is > only initialised for execlists, and you can use the simpler form of irq > locking as above. Those are fair points. Mind pointing me to your patch so that I can have a look at it? Thanks, Tomas > -Chris > _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf 2015-10-09 7:45 ` Chris Wilson @ 2015-10-09 8:38 ` Daniel Vetter 2015-10-09 8:45 ` Chris Wilson 2015-10-19 15:32 ` [PATCH v2 " Tomas Elf 2 siblings, 1 reply; 68+ messages in thread From: Daniel Vetter @ 2015-10-09 8:38 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote: > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > concurrency problems between the context event interrupt handler and the reset > path immediately following a GPU reset. > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> Should we instead just stop any irqs from the GT completely while we do the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably safer. Well not the entire GT, but all the rings under reset (as prep for per-ring reset). Just an idea which feels a bit safer design overall, but we can ofc also lock down all the state properly with the irq-safe spinlocks. -Daniel > --- > drivers/gpu/drm/i915/i915_gem.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 1666499..50c2dcd 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2738,6 +2738,8 @@ static void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv, > static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, > struct intel_engine_cs *ring) > { > + unsigned long flags; > + > while (!list_empty(&ring->active_list)) { > struct drm_i915_gem_object *obj; > > @@ -2753,6 +2755,8 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, > * are the ones that keep the context and ringbuffer backing objects > * pinned in place. > */ > + > + spin_lock_irqsave(&ring->execlist_lock, flags); > while (!list_empty(&ring->execlist_queue)) { > struct drm_i915_gem_request *submit_req; > > @@ -2766,6 +2770,7 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, > > i915_gem_request_unreference(submit_req); > } > + spin_unlock_irqrestore(&ring->execlist_lock, flags); > > /* > * We must free the requests after all the corresponding objects have > -- > 1.9.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-09 8:38 ` Daniel Vetter @ 2015-10-09 8:45 ` Chris Wilson 2015-10-13 11:46 ` Daniel Vetter 0 siblings, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 8:45 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-GFX On Fri, Oct 09, 2015 at 10:38:18AM +0200, Daniel Vetter wrote: > On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote: > > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > > concurrency problems between the context event interrupt handler and the reset > > path immediately following a GPU reset. > > > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > > Should we instead just stop any irqs from the GT completely while we do > the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably > safer. Well not the entire GT, but all the rings under reset (as prep for > per-ring reset). Bah, stopping IRQs is not enough for error state capture though since requests complete asynchronously just by polling a memory address. (If that is the goal here, this patch just makes execlist_queue access consistent and should only be run once the GPU has been reset and so is categorically idle.) -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-09 8:45 ` Chris Wilson @ 2015-10-13 11:46 ` Daniel Vetter 2015-10-13 11:45 ` Chris Wilson 0 siblings, 1 reply; 68+ messages in thread From: Daniel Vetter @ 2015-10-13 11:46 UTC (permalink / raw) To: Chris Wilson, Daniel Vetter, Tomas Elf, Intel-GFX On Fri, Oct 09, 2015 at 09:45:16AM +0100, Chris Wilson wrote: > On Fri, Oct 09, 2015 at 10:38:18AM +0200, Daniel Vetter wrote: > > On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote: > > > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > > > concurrency problems between the context event interrupt handler and the reset > > > path immediately following a GPU reset. > > > > > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > > > > Should we instead just stop any irqs from the GT completely while we do > > the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably > > safer. Well not the entire GT, but all the rings under reset (as prep for > > per-ring reset). > > Bah, stopping IRQs is not enough for error state capture though since > requests complete asynchronously just by polling a memory address. (If > that is the goal here, this patch just makes execlist_queue access > consistent and should only be run once the GPU has been reset and so is > categorically idle.) This is the execlist ELSP tracking, which is execlusively driven by the CTX_SWITCH interrupt signal from each engine. At least that's been my assumption, and under that assumption I think stalling interrupts should be good enough. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-13 11:46 ` Daniel Vetter @ 2015-10-13 11:45 ` Chris Wilson 2015-10-13 13:46 ` Daniel Vetter 0 siblings, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-13 11:45 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-GFX On Tue, Oct 13, 2015 at 01:46:38PM +0200, Daniel Vetter wrote: > On Fri, Oct 09, 2015 at 09:45:16AM +0100, Chris Wilson wrote: > > On Fri, Oct 09, 2015 at 10:38:18AM +0200, Daniel Vetter wrote: > > > On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote: > > > > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > > > > concurrency problems between the context event interrupt handler and the reset > > > > path immediately following a GPU reset. > > > > > > > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > > > > > > Should we instead just stop any irqs from the GT completely while we do > > > the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably > > > safer. Well not the entire GT, but all the rings under reset (as prep for > > > per-ring reset). > > > > Bah, stopping IRQs is not enough for error state capture though since > > requests complete asynchronously just by polling a memory address. (If > > that is the goal here, this patch just makes execlist_queue access > > consistent and should only be run once the GPU has been reset and so is > > categorically idle.) > > This is the execlist ELSP tracking, which is execlusively driven by the > CTX_SWITCH interrupt signal from each engine. > > At least that's been my assumption, and under that assumption I think > stalling interrupts should be good enough. No, because the requests and vma are not coupled to the interrupt in terms of when they can disappear. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-13 11:45 ` Chris Wilson @ 2015-10-13 13:46 ` Daniel Vetter 2015-10-13 14:00 ` Chris Wilson 0 siblings, 1 reply; 68+ messages in thread From: Daniel Vetter @ 2015-10-13 13:46 UTC (permalink / raw) To: Chris Wilson, Daniel Vetter, Tomas Elf, Intel-GFX On Tue, Oct 13, 2015 at 12:45:58PM +0100, Chris Wilson wrote: > On Tue, Oct 13, 2015 at 01:46:38PM +0200, Daniel Vetter wrote: > > On Fri, Oct 09, 2015 at 09:45:16AM +0100, Chris Wilson wrote: > > > On Fri, Oct 09, 2015 at 10:38:18AM +0200, Daniel Vetter wrote: > > > > On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote: > > > > > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > > > > > concurrency problems between the context event interrupt handler and the reset > > > > > path immediately following a GPU reset. > > > > > > > > > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > > > > > > > > Should we instead just stop any irqs from the GT completely while we do > > > > the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably > > > > safer. Well not the entire GT, but all the rings under reset (as prep for > > > > per-ring reset). > > > > > > Bah, stopping IRQs is not enough for error state capture though since > > > requests complete asynchronously just by polling a memory address. (If > > > that is the goal here, this patch just makes execlist_queue access > > > consistent and should only be run once the GPU has been reset and so is > > > categorically idle.) > > > > This is the execlist ELSP tracking, which is execlusively driven by the > > CTX_SWITCH interrupt signal from each engine. > > > > At least that's been my assumption, and under that assumption I think > > stalling interrupts should be good enough. > > No, because the requests and vma are not coupled to the interrupt in > terms of when they can disappear. At least today execlist keeps its own reference on requests until the CTX_SWITCH stuff is done to exactly make sure this is the case. And even when we have that fixed up I think we do need to exclude this processing somehow, and the irqsave spinlock seems ok for that. disabling the interrupt itself plus synchronize_irq was really just an idea. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-13 13:46 ` Daniel Vetter @ 2015-10-13 14:00 ` Chris Wilson 0 siblings, 0 replies; 68+ messages in thread From: Chris Wilson @ 2015-10-13 14:00 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-GFX On Tue, Oct 13, 2015 at 03:46:00PM +0200, Daniel Vetter wrote: > On Tue, Oct 13, 2015 at 12:45:58PM +0100, Chris Wilson wrote: > > On Tue, Oct 13, 2015 at 01:46:38PM +0200, Daniel Vetter wrote: > > > On Fri, Oct 09, 2015 at 09:45:16AM +0100, Chris Wilson wrote: > > > > On Fri, Oct 09, 2015 at 10:38:18AM +0200, Daniel Vetter wrote: > > > > > On Thu, Oct 08, 2015 at 07:31:39PM +0100, Tomas Elf wrote: > > > > > > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > > > > > > concurrency problems between the context event interrupt handler and the reset > > > > > > path immediately following a GPU reset. > > > > > > > > > > > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > > > > > > > > > > Should we instead just stop any irqs from the GT completely while we do > > > > > the reset (plus a synchronize_irq)? It's a bit heavy-weight, but probably > > > > > safer. Well not the entire GT, but all the rings under reset (as prep for > > > > > per-ring reset). > > > > > > > > Bah, stopping IRQs is not enough for error state capture though since > > > > requests complete asynchronously just by polling a memory address. (If > > > > that is the goal here, this patch just makes execlist_queue access > > > > consistent and should only be run once the GPU has been reset and so is > > > > categorically idle.) > > > > > > This is the execlist ELSP tracking, which is execlusively driven by the > > > CTX_SWITCH interrupt signal from each engine. > > > > > > At least that's been my assumption, and under that assumption I think > > > stalling interrupts should be good enough. > > > > No, because the requests and vma are not coupled to the interrupt in > > terms of when they can disappear. > > At least today execlist keeps its own reference on requests until the > CTX_SWITCH stuff is done to exactly make sure this is the case. And even > when we have that fixed up I think we do need to exclude this processing > somehow, and the irqsave spinlock seems ok for that. disabling the > interrupt itself plus synchronize_irq was really just an idea. What I meant was that we have this problem with vma/obj disappearing not simply on execlists because request completion is asynchronous (we just check for the GPU's breadcrumb). Execlists is slightly special in that its requests can't disappear, but they can elsewhere. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf 2015-10-09 7:45 ` Chris Wilson 2015-10-09 8:38 ` Daniel Vetter @ 2015-10-19 15:32 ` Tomas Elf 2015-10-22 16:49 ` Dave Gordon ` (3 more replies) 2 siblings, 4 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-19 15:32 UTC (permalink / raw) To: Intel-GFX Grab execlist lock when cleaning up execlist queues after GPU reset to avoid concurrency problems between the context event interrupt handler and the reset path immediately following a GPU reset. * v2 (Chris Wilson): Do execlist check and use simpler form of spinlock functions. Signed-off-by: Tomas Elf <tomas.elf@intel.com> --- drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index e57061a..2c7a0b7 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, * are the ones that keep the context and ringbuffer backing objects * pinned in place. */ - while (!list_empty(&ring->execlist_queue)) { - struct drm_i915_gem_request *submit_req; - submit_req = list_first_entry(&ring->execlist_queue, - struct drm_i915_gem_request, - execlist_link); - list_del(&submit_req->execlist_link); + if (i915.enable_execlists) { + spin_lock_irq(&ring->execlist_lock); + while (!list_empty(&ring->execlist_queue)) { + struct drm_i915_gem_request *submit_req; - if (submit_req->ctx != ring->default_context) - intel_lr_context_unpin(submit_req); + submit_req = list_first_entry(&ring->execlist_queue, + struct drm_i915_gem_request, + execlist_link); + list_del(&submit_req->execlist_link); - i915_gem_request_unreference(submit_req); + if (submit_req->ctx != ring->default_context) + intel_lr_context_unpin(submit_req); + + i915_gem_request_unreference(submit_req); + } + spin_unlock_irq(&ring->execlist_lock); } /* -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-19 15:32 ` [PATCH v2 " Tomas Elf @ 2015-10-22 16:49 ` Dave Gordon 2015-10-22 17:35 ` Daniel Vetter 2015-10-23 8:42 ` Tvrtko Ursulin ` (2 subsequent siblings) 3 siblings, 1 reply; 68+ messages in thread From: Dave Gordon @ 2015-10-22 16:49 UTC (permalink / raw) To: Tomas Elf, Intel-GFX On 19/10/15 16:32, Tomas Elf wrote: > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > concurrency problems between the context event interrupt handler and the reset > path immediately following a GPU reset. > > * v2 (Chris Wilson): > Do execlist check and use simpler form of spinlock functions. > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > --- > drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++--------- > 1 file changed, 14 insertions(+), 9 deletions(-) LGTM. Reviewed-by: Dave Gordon <david.s.gordon@intel.com> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index e57061a..2c7a0b7 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, > * are the ones that keep the context and ringbuffer backing objects > * pinned in place. > */ > - while (!list_empty(&ring->execlist_queue)) { > - struct drm_i915_gem_request *submit_req; > > - submit_req = list_first_entry(&ring->execlist_queue, > - struct drm_i915_gem_request, > - execlist_link); > - list_del(&submit_req->execlist_link); > + if (i915.enable_execlists) { > + spin_lock_irq(&ring->execlist_lock); > + while (!list_empty(&ring->execlist_queue)) { > + struct drm_i915_gem_request *submit_req; > > - if (submit_req->ctx != ring->default_context) > - intel_lr_context_unpin(submit_req); > + submit_req = list_first_entry(&ring->execlist_queue, > + struct drm_i915_gem_request, > + execlist_link); > + list_del(&submit_req->execlist_link); > > - i915_gem_request_unreference(submit_req); > + if (submit_req->ctx != ring->default_context) > + intel_lr_context_unpin(submit_req); > + > + i915_gem_request_unreference(submit_req); > + } > + spin_unlock_irq(&ring->execlist_lock); > } > > /* > _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-22 16:49 ` Dave Gordon @ 2015-10-22 17:35 ` Daniel Vetter 0 siblings, 0 replies; 68+ messages in thread From: Daniel Vetter @ 2015-10-22 17:35 UTC (permalink / raw) To: Dave Gordon; +Cc: Intel-GFX On Thu, Oct 22, 2015 at 05:49:56PM +0100, Dave Gordon wrote: > On 19/10/15 16:32, Tomas Elf wrote: > >Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > >concurrency problems between the context event interrupt handler and the reset > >path immediately following a GPU reset. > > > >* v2 (Chris Wilson): > >Do execlist check and use simpler form of spinlock functions. > > > >Signed-off-by: Tomas Elf <tomas.elf@intel.com> > >--- > > drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++--------- > > 1 file changed, 14 insertions(+), 9 deletions(-) > > LGTM. > > Reviewed-by: Dave Gordon <david.s.gordon@intel.com> Queued for -next, thanks for the patch. -Daniel > > >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >index e57061a..2c7a0b7 100644 > >--- a/drivers/gpu/drm/i915/i915_gem.c > >+++ b/drivers/gpu/drm/i915/i915_gem.c > >@@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, > > * are the ones that keep the context and ringbuffer backing objects > > * pinned in place. > > */ > >- while (!list_empty(&ring->execlist_queue)) { > >- struct drm_i915_gem_request *submit_req; > > > >- submit_req = list_first_entry(&ring->execlist_queue, > >- struct drm_i915_gem_request, > >- execlist_link); > >- list_del(&submit_req->execlist_link); > >+ if (i915.enable_execlists) { > >+ spin_lock_irq(&ring->execlist_lock); > >+ while (!list_empty(&ring->execlist_queue)) { > >+ struct drm_i915_gem_request *submit_req; > > > >- if (submit_req->ctx != ring->default_context) > >- intel_lr_context_unpin(submit_req); > >+ submit_req = list_first_entry(&ring->execlist_queue, > >+ struct drm_i915_gem_request, > >+ execlist_link); > >+ list_del(&submit_req->execlist_link); > > > >- i915_gem_request_unreference(submit_req); > >+ if (submit_req->ctx != ring->default_context) > >+ intel_lr_context_unpin(submit_req); > >+ > >+ i915_gem_request_unreference(submit_req); > >+ } > >+ spin_unlock_irq(&ring->execlist_lock); > > } > > > > /* > > > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-19 15:32 ` [PATCH v2 " Tomas Elf 2015-10-22 16:49 ` Dave Gordon @ 2015-10-23 8:42 ` Tvrtko Ursulin 2015-10-23 8:59 ` Daniel Vetter 2015-10-23 13:08 ` [PATCH v3 " Tomas Elf 2015-10-23 17:02 ` [PATCH] drm/i915: Update to post-reset execlist queue clean-up Tomas Elf 3 siblings, 1 reply; 68+ messages in thread From: Tvrtko Ursulin @ 2015-10-23 8:42 UTC (permalink / raw) To: Tomas Elf, Intel-GFX On 19/10/15 16:32, Tomas Elf wrote: > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > concurrency problems between the context event interrupt handler and the reset > path immediately following a GPU reset. > > * v2 (Chris Wilson): > Do execlist check and use simpler form of spinlock functions. > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > --- > drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++--------- > 1 file changed, 14 insertions(+), 9 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index e57061a..2c7a0b7 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, > * are the ones that keep the context and ringbuffer backing objects > * pinned in place. > */ > - while (!list_empty(&ring->execlist_queue)) { > - struct drm_i915_gem_request *submit_req; > > - submit_req = list_first_entry(&ring->execlist_queue, > - struct drm_i915_gem_request, > - execlist_link); > - list_del(&submit_req->execlist_link); > + if (i915.enable_execlists) { > + spin_lock_irq(&ring->execlist_lock); > + while (!list_empty(&ring->execlist_queue)) { > + struct drm_i915_gem_request *submit_req; > > - if (submit_req->ctx != ring->default_context) > - intel_lr_context_unpin(submit_req); > + submit_req = list_first_entry(&ring->execlist_queue, > + struct drm_i915_gem_request, > + execlist_link); > + list_del(&submit_req->execlist_link); > > - i915_gem_request_unreference(submit_req); > + if (submit_req->ctx != ring->default_context) > + intel_lr_context_unpin(submit_req); > + > + i915_gem_request_unreference(submit_req); > + } > + spin_unlock_irq(&ring->execlist_lock); Can't this, in theory at least, end up calling drm_gem_object_unreference in a couple of code paths, which can sleep, while holding a spinlock? If this cannot happen in practice for some reason it would probably be good to put a comment explaining it. Or I am missing something? Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-23 8:42 ` Tvrtko Ursulin @ 2015-10-23 8:59 ` Daniel Vetter 2015-10-23 11:02 ` Tomas Elf 0 siblings, 1 reply; 68+ messages in thread From: Daniel Vetter @ 2015-10-23 8:59 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: Intel-GFX On Fri, Oct 23, 2015 at 09:42:27AM +0100, Tvrtko Ursulin wrote: > > On 19/10/15 16:32, Tomas Elf wrote: > >Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > >concurrency problems between the context event interrupt handler and the reset > >path immediately following a GPU reset. > > > >* v2 (Chris Wilson): > >Do execlist check and use simpler form of spinlock functions. > > > >Signed-off-by: Tomas Elf <tomas.elf@intel.com> > >--- > > drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++--------- > > 1 file changed, 14 insertions(+), 9 deletions(-) > > > >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >index e57061a..2c7a0b7 100644 > >--- a/drivers/gpu/drm/i915/i915_gem.c > >+++ b/drivers/gpu/drm/i915/i915_gem.c > >@@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, > > * are the ones that keep the context and ringbuffer backing objects > > * pinned in place. > > */ > >- while (!list_empty(&ring->execlist_queue)) { > >- struct drm_i915_gem_request *submit_req; > > > >- submit_req = list_first_entry(&ring->execlist_queue, > >- struct drm_i915_gem_request, > >- execlist_link); > >- list_del(&submit_req->execlist_link); > >+ if (i915.enable_execlists) { > >+ spin_lock_irq(&ring->execlist_lock); > >+ while (!list_empty(&ring->execlist_queue)) { > >+ struct drm_i915_gem_request *submit_req; > > > >- if (submit_req->ctx != ring->default_context) > >- intel_lr_context_unpin(submit_req); > >+ submit_req = list_first_entry(&ring->execlist_queue, > >+ struct drm_i915_gem_request, > >+ execlist_link); > >+ list_del(&submit_req->execlist_link); > > > >- i915_gem_request_unreference(submit_req); > >+ if (submit_req->ctx != ring->default_context) > >+ intel_lr_context_unpin(submit_req); > >+ > >+ i915_gem_request_unreference(submit_req); > >+ } > >+ spin_unlock_irq(&ring->execlist_lock); > > Can't this, in theory at least, end up calling drm_gem_object_unreference in > a couple of code paths, which can sleep, while holding a spinlock? > > If this cannot happen in practice for some reason it would probably be good > to put a comment explaining it. > > Or I am missing something? It can indeed I think. Tvrtko, since I'm off to KS next week and you have push access, can you pls push the revert if this checks out? Thanks, Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-23 8:59 ` Daniel Vetter @ 2015-10-23 11:02 ` Tomas Elf 2015-10-23 12:49 ` Dave Gordon 0 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-23 11:02 UTC (permalink / raw) To: Daniel Vetter, Tvrtko Ursulin; +Cc: Intel-GFX On 23/10/2015 09:59, Daniel Vetter wrote: > On Fri, Oct 23, 2015 at 09:42:27AM +0100, Tvrtko Ursulin wrote: >> >> On 19/10/15 16:32, Tomas Elf wrote: >>> Grab execlist lock when cleaning up execlist queues after GPU reset to avoid >>> concurrency problems between the context event interrupt handler and the reset >>> path immediately following a GPU reset. >>> >>> * v2 (Chris Wilson): >>> Do execlist check and use simpler form of spinlock functions. >>> >>> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >>> --- >>> drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++--------- >>> 1 file changed, 14 insertions(+), 9 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c >>> index e57061a..2c7a0b7 100644 >>> --- a/drivers/gpu/drm/i915/i915_gem.c >>> +++ b/drivers/gpu/drm/i915/i915_gem.c >>> @@ -2753,18 +2753,23 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, >>> * are the ones that keep the context and ringbuffer backing objects >>> * pinned in place. >>> */ >>> - while (!list_empty(&ring->execlist_queue)) { >>> - struct drm_i915_gem_request *submit_req; >>> >>> - submit_req = list_first_entry(&ring->execlist_queue, >>> - struct drm_i915_gem_request, >>> - execlist_link); >>> - list_del(&submit_req->execlist_link); >>> + if (i915.enable_execlists) { >>> + spin_lock_irq(&ring->execlist_lock); >>> + while (!list_empty(&ring->execlist_queue)) { >>> + struct drm_i915_gem_request *submit_req; >>> >>> - if (submit_req->ctx != ring->default_context) >>> - intel_lr_context_unpin(submit_req); >>> + submit_req = list_first_entry(&ring->execlist_queue, >>> + struct drm_i915_gem_request, >>> + execlist_link); >>> + list_del(&submit_req->execlist_link); >>> >>> - i915_gem_request_unreference(submit_req); >>> + if (submit_req->ctx != ring->default_context) >>> + intel_lr_context_unpin(submit_req); >>> + >>> + i915_gem_request_unreference(submit_req); >>> + } >>> + spin_unlock_irq(&ring->execlist_lock); >> >> Can't this, in theory at least, end up calling drm_gem_object_unreference in >> a couple of code paths, which can sleep, while holding a spinlock? >> >> If this cannot happen in practice for some reason it would probably be good >> to put a comment explaining it. >> >> Or I am missing something? > > It can indeed I think. Tvrtko, since I'm off to KS next week and you have > push access, can you pls push the revert if this checks out? I've discussed it with Tvrtko and one solution that seems reasonable is to hold the execlist lock while traversing the execlist queue and moving the elements over to a temporary list. After that we can drop the execlist lock, traverse the temporary list and unreference the elements while not holding lock. That way we can get rid of the element in a synchronized fashion and also unreference without holding the lock. I believe this pattern is used elsewhere (such as during request retirement) Does that sound like a plan or does anyone have a problem with that solution? Thanks, Tomas > > Thanks, Daniel > _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-23 11:02 ` Tomas Elf @ 2015-10-23 12:49 ` Dave Gordon 0 siblings, 0 replies; 68+ messages in thread From: Dave Gordon @ 2015-10-23 12:49 UTC (permalink / raw) To: Tomas Elf, Daniel Vetter, Tvrtko Ursulin; +Cc: Intel-GFX On 23/10/15 12:02, Tomas Elf wrote: > On 23/10/2015 09:59, Daniel Vetter wrote: >> On Fri, Oct 23, 2015 at 09:42:27AM +0100, Tvrtko Ursulin wrote: >>> >>> On 19/10/15 16:32, Tomas Elf wrote: >>>> Grab execlist lock when cleaning up execlist queues after GPU reset >>>> to avoid >>>> concurrency problems between the context event interrupt handler and >>>> the reset >>>> path immediately following a GPU reset. >>>> >>>> * v2 (Chris Wilson): >>>> Do execlist check and use simpler form of spinlock functions. >>>> >>>> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >>>> --- >>>> drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++++--------- >>>> 1 file changed, 14 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c >>>> b/drivers/gpu/drm/i915/i915_gem.c >>>> index e57061a..2c7a0b7 100644 >>>> --- a/drivers/gpu/drm/i915/i915_gem.c >>>> +++ b/drivers/gpu/drm/i915/i915_gem.c >>>> @@ -2753,18 +2753,23 @@ static void >>>> i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, >>>> * are the ones that keep the context and ringbuffer backing >>>> objects >>>> * pinned in place. >>>> */ >>>> - while (!list_empty(&ring->execlist_queue)) { >>>> - struct drm_i915_gem_request *submit_req; >>>> >>>> - submit_req = list_first_entry(&ring->execlist_queue, >>>> - struct drm_i915_gem_request, >>>> - execlist_link); >>>> - list_del(&submit_req->execlist_link); >>>> + if (i915.enable_execlists) { >>>> + spin_lock_irq(&ring->execlist_lock); >>>> + while (!list_empty(&ring->execlist_queue)) { >>>> + struct drm_i915_gem_request *submit_req; >>>> >>>> - if (submit_req->ctx != ring->default_context) >>>> - intel_lr_context_unpin(submit_req); >>>> + submit_req = list_first_entry(&ring->execlist_queue, >>>> + struct drm_i915_gem_request, >>>> + execlist_link); >>>> + list_del(&submit_req->execlist_link); >>>> >>>> - i915_gem_request_unreference(submit_req); >>>> + if (submit_req->ctx != ring->default_context) >>>> + intel_lr_context_unpin(submit_req); >>>> + >>>> + i915_gem_request_unreference(submit_req); >>>> + } >>>> + spin_unlock_irq(&ring->execlist_lock); >>> >>> Can't this, in theory at least, end up calling >>> drm_gem_object_unreference in >>> a couple of code paths, which can sleep, while holding a spinlock? >>> >>> If this cannot happen in practice for some reason it would probably >>> be good >>> to put a comment explaining it. >>> >>> Or I am missing something? >> >> It can indeed I think. Tvrtko, since I'm off to KS next week and you have >> push access, can you pls push the revert if this checks out? > > I've discussed it with Tvrtko and one solution that seems reasonable is > to hold the execlist lock while traversing the execlist queue and moving > the elements over to a temporary list. After that we can drop the > execlist lock, traverse the temporary list and unreference the elements > while not holding lock. That way we can get rid of the element in a > synchronized fashion and also unreference without holding the lock. I > believe this pattern is used elsewhere (such as during request retirement) > > Does that sound like a plan or does anyone have a problem with that > solution? > > Thanks, > Tomas I think the original is safe, because the execlist queue /shouldn't/ be the last reference. But that could change, and in any case in a possibly broken state such as may have caused the hang we probably shouldn't rely on it. So I think the staged-dequeue-and-release is a better solution :) .Dave. _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-19 15:32 ` [PATCH v2 " Tomas Elf 2015-10-22 16:49 ` Dave Gordon 2015-10-23 8:42 ` Tvrtko Ursulin @ 2015-10-23 13:08 ` Tomas Elf 2015-10-23 14:53 ` Daniel, Thomas 2015-10-23 17:02 ` [PATCH] drm/i915: Update to post-reset execlist queue clean-up Tomas Elf 3 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-23 13:08 UTC (permalink / raw) To: Intel-GFX Grab execlist lock when cleaning up execlist queues after GPU reset to avoid concurrency problems between the context event interrupt handler and the reset path immediately following a GPU reset. * v2 (Chris Wilson): Do execlist check and use simpler form of spinlock functions. * v3 (Tvrtko Ursulin): Don't hold the execlist_lock while dereferencing the requests since that might lead to sleep while IRQs are disabled. Signed-off-by: Tomas Elf <tomas.elf@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/i915_gem.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index e57061a..b492603 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2753,18 +2753,16 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, * are the ones that keep the context and ringbuffer backing objects * pinned in place. */ - while (!list_empty(&ring->execlist_queue)) { - struct drm_i915_gem_request *submit_req; - submit_req = list_first_entry(&ring->execlist_queue, - struct drm_i915_gem_request, - execlist_link); - list_del(&submit_req->execlist_link); + if (i915.enable_execlists) { + spin_lock_irq(&ring->execlist_lock); - if (submit_req->ctx != ring->default_context) - intel_lr_context_unpin(submit_req); + /* list_splice_tail_init checks for empty lists */ + list_splice_tail_init(&ring->execlist_queue, + &ring->execlist_retired_req_list); - i915_gem_request_unreference(submit_req); + spin_unlock_irq(&ring->execlist_lock); + intel_execlists_retire_requests(ring); } /* -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. 2015-10-23 13:08 ` [PATCH v3 " Tomas Elf @ 2015-10-23 14:53 ` Daniel, Thomas 0 siblings, 0 replies; 68+ messages in thread From: Daniel, Thomas @ 2015-10-23 14:53 UTC (permalink / raw) To: Elf, Tomas, Intel-GFX > -----Original Message----- > From: Intel-gfx [mailto:intel-gfx-bounces@lists.freedesktop.org] On Behalf Of > Tomas Elf > Sent: Friday, October 23, 2015 2:09 PM > To: Intel-GFX@Lists.FreeDesktop.Org > Subject: [Intel-gfx] [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid > post-reset concurrency issues. > > Grab execlist lock when cleaning up execlist queues after GPU reset to avoid > concurrency problems between the context event interrupt handler and the > reset > path immediately following a GPU reset. > > * v2 (Chris Wilson): > Do execlist check and use simpler form of spinlock functions. > > * v3 (Tvrtko Ursulin): > Don't hold the execlist_lock while dereferencing the requests since that might > lead to sleep while IRQs are disabled. > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > --- > drivers/gpu/drm/i915/i915_gem.c | 16 +++++++--------- > 1 file changed, 7 insertions(+), 9 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c > b/drivers/gpu/drm/i915/i915_gem.c > index e57061a..b492603 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2753,18 +2753,16 @@ static void i915_gem_reset_ring_cleanup(struct > drm_i915_private *dev_priv, > * are the ones that keep the context and ringbuffer backing objects > * pinned in place. > */ > - while (!list_empty(&ring->execlist_queue)) { > - struct drm_i915_gem_request *submit_req; > > - submit_req = list_first_entry(&ring->execlist_queue, > - struct drm_i915_gem_request, > - execlist_link); > - list_del(&submit_req->execlist_link); > + if (i915.enable_execlists) { > + spin_lock_irq(&ring->execlist_lock); > > - if (submit_req->ctx != ring->default_context) > - intel_lr_context_unpin(submit_req); > + /* list_splice_tail_init checks for empty lists */ > + list_splice_tail_init(&ring->execlist_queue, > + &ring->execlist_retired_req_list); > > - i915_gem_request_unreference(submit_req); > + spin_unlock_irq(&ring->execlist_lock); > + intel_execlists_retire_requests(ring); > } > > /* > -- > 1.9.1 LGTM Reviewed-by: Thomas Daniel <thomas.daniel@intel.com> _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH] drm/i915: Update to post-reset execlist queue clean-up 2015-10-19 15:32 ` [PATCH v2 " Tomas Elf ` (2 preceding siblings ...) 2015-10-23 13:08 ` [PATCH v3 " Tomas Elf @ 2015-10-23 17:02 ` Tomas Elf 2015-12-01 11:46 ` Tvrtko Ursulin 3 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-23 17:02 UTC (permalink / raw) To: Intel-GFX When clearing an execlist queue, instead of traversing it and unreferencing all requests while holding the spinlock (which might lead to thread sleeping with IRQs are turned off - bad news!), just move all requests to the retire request list while holding spinlock and then drop spinlock and invoke the execlists request retirement path, which already deals with the intricacies of purging/dereferencing execlist queue requests. This patch can be considered v3 of: commit b96db8b81c54ef30485ddb5992d63305d86ea8d3 Author: Tomas Elf <tomas.elf@intel.com> drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues This patch assumes v2 of the above patch is part of the baseline, reverts v2 and adds changes on top to turn it into v3. Signed-off-by: Tomas Elf <tomas.elf@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/i915_gem.c | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 2c7a0b7..b492603 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2756,20 +2756,13 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, if (i915.enable_execlists) { spin_lock_irq(&ring->execlist_lock); - while (!list_empty(&ring->execlist_queue)) { - struct drm_i915_gem_request *submit_req; - submit_req = list_first_entry(&ring->execlist_queue, - struct drm_i915_gem_request, - execlist_link); - list_del(&submit_req->execlist_link); + /* list_splice_tail_init checks for empty lists */ + list_splice_tail_init(&ring->execlist_queue, + &ring->execlist_retired_req_list); - if (submit_req->ctx != ring->default_context) - intel_lr_context_unpin(submit_req); - - i915_gem_request_unreference(submit_req); - } spin_unlock_irq(&ring->execlist_lock); + intel_execlists_retire_requests(ring); } /* -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH] drm/i915: Update to post-reset execlist queue clean-up 2015-10-23 17:02 ` [PATCH] drm/i915: Update to post-reset execlist queue clean-up Tomas Elf @ 2015-12-01 11:46 ` Tvrtko Ursulin 2015-12-11 14:14 ` Dave Gordon 0 siblings, 1 reply; 68+ messages in thread From: Tvrtko Ursulin @ 2015-12-01 11:46 UTC (permalink / raw) To: Intel-GFX On 23/10/15 18:02, Tomas Elf wrote: > When clearing an execlist queue, instead of traversing it and unreferencing all > requests while holding the spinlock (which might lead to thread sleeping with > IRQs are turned off - bad news!), just move all requests to the retire request > list while holding spinlock and then drop spinlock and invoke the execlists > request retirement path, which already deals with the intricacies of > purging/dereferencing execlist queue requests. > > This patch can be considered v3 of: > > commit b96db8b81c54ef30485ddb5992d63305d86ea8d3 > Author: Tomas Elf <tomas.elf@intel.com> > drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues > > This patch assumes v2 of the above patch is part of the baseline, reverts v2 > and adds changes on top to turn it into v3. > > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > --- > drivers/gpu/drm/i915/i915_gem.c | 15 ++++----------- > 1 file changed, 4 insertions(+), 11 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 2c7a0b7..b492603 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2756,20 +2756,13 @@ static void i915_gem_reset_ring_cleanup(struct drm_i915_private *dev_priv, > > if (i915.enable_execlists) { > spin_lock_irq(&ring->execlist_lock); > - while (!list_empty(&ring->execlist_queue)) { > - struct drm_i915_gem_request *submit_req; > > - submit_req = list_first_entry(&ring->execlist_queue, > - struct drm_i915_gem_request, > - execlist_link); > - list_del(&submit_req->execlist_link); > + /* list_splice_tail_init checks for empty lists */ > + list_splice_tail_init(&ring->execlist_queue, > + &ring->execlist_retired_req_list); > > - if (submit_req->ctx != ring->default_context) > - intel_lr_context_unpin(submit_req); > - > - i915_gem_request_unreference(submit_req); > - } > spin_unlock_irq(&ring->execlist_lock); > + intel_execlists_retire_requests(ring); > } > > /* > Fallen through the cracks.. This looks to be even more serious, since lockdep notices possible deadlock involving vmap_area_lock: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(vmap_area_lock); local_irq_disable(); lock(&(&ring->execlist_lock)->rlock); lock(vmap_area_lock); <Interrupt> lock(&(&ring->execlist_lock)->rlock); *** DEADLOCK *** Because it unpins LRC context and ringbuffer which ends up in the VM code under the execlist_lock. intel_execlists_retire_requests is slightly different from the code in the reset handler because it concerns itself with ctx_obj existence which the other one doesn't. Could people more knowledgeable of this code check if it is OK and R-B? Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] drm/i915: Update to post-reset execlist queue clean-up 2015-12-01 11:46 ` Tvrtko Ursulin @ 2015-12-11 14:14 ` Dave Gordon 2015-12-11 16:40 ` Daniel Vetter 2015-12-14 10:21 ` Mika Kuoppala 0 siblings, 2 replies; 68+ messages in thread From: Dave Gordon @ 2015-12-11 14:14 UTC (permalink / raw) To: Tvrtko Ursulin, Intel-GFX, Daniel, Thomas, Daniel Vetter On 01/12/15 11:46, Tvrtko Ursulin wrote: > > On 23/10/15 18:02, Tomas Elf wrote: >> When clearing an execlist queue, instead of traversing it and >> unreferencing all >> requests while holding the spinlock (which might lead to thread >> sleeping with >> IRQs are turned off - bad news!), just move all requests to the retire >> request >> list while holding spinlock and then drop spinlock and invoke the >> execlists >> request retirement path, which already deals with the intricacies of >> purging/dereferencing execlist queue requests. >> >> This patch can be considered v3 of: >> >> commit b96db8b81c54ef30485ddb5992d63305d86ea8d3 >> Author: Tomas Elf <tomas.elf@intel.com> >> drm/i915: Grab execlist spinlock to avoid post-reset concurrency >> issues >> >> This patch assumes v2 of the above patch is part of the baseline, >> reverts v2 >> and adds changes on top to turn it into v3. >> >> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> --- >> drivers/gpu/drm/i915/i915_gem.c | 15 ++++----------- >> 1 file changed, 4 insertions(+), 11 deletions(-) >> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c >> b/drivers/gpu/drm/i915/i915_gem.c >> index 2c7a0b7..b492603 100644 >> --- a/drivers/gpu/drm/i915/i915_gem.c >> +++ b/drivers/gpu/drm/i915/i915_gem.c >> @@ -2756,20 +2756,13 @@ static void i915_gem_reset_ring_cleanup(struct >> drm_i915_private *dev_priv, >> >> if (i915.enable_execlists) { >> spin_lock_irq(&ring->execlist_lock); >> - while (!list_empty(&ring->execlist_queue)) { >> - struct drm_i915_gem_request *submit_req; >> >> - submit_req = list_first_entry(&ring->execlist_queue, >> - struct drm_i915_gem_request, >> - execlist_link); >> - list_del(&submit_req->execlist_link); >> + /* list_splice_tail_init checks for empty lists */ >> + list_splice_tail_init(&ring->execlist_queue, >> + &ring->execlist_retired_req_list); >> >> - if (submit_req->ctx != ring->default_context) >> - intel_lr_context_unpin(submit_req); >> - >> - i915_gem_request_unreference(submit_req); >> - } >> spin_unlock_irq(&ring->execlist_lock); >> + intel_execlists_retire_requests(ring); >> } >> >> /* > > Fallen through the cracks.. > > This looks to be even more serious, since lockdep notices possible > deadlock involving vmap_area_lock: > > Possible interrupt unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(vmap_area_lock); > local_irq_disable(); > lock(&(&ring->execlist_lock)->rlock); > lock(vmap_area_lock); > <Interrupt> > lock(&(&ring->execlist_lock)->rlock); > > *** DEADLOCK *** > > Because it unpins LRC context and ringbuffer which ends up in the VM > code under the execlist_lock. > > intel_execlists_retire_requests is slightly different from the code in > the reset handler because it concerns itself with ctx_obj existence > which the other one doesn't. > > Could people more knowledgeable of this code check if it is OK and R-B? > > Regards, > > Tvrtko Hi Tvrtko, I didn't understand this message at first, I thought you'd found a problem with this ("v3") patch, but now I see what you actually meant is that there is indeed a problem with the (v2) that got merged, not the original question about unreferencing an object while holding a spinlock (because it can't be the last reference), but rather because of the unpin, which can indeed cause a problem with a non-i915-defined kernel lock. So we should certainly update the current (v2) upstream with this. Thomas Daniel already R-B'd this code on 23rd October, when it was: [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues. and it hasn't changed in substance since then, so you can carry his R-B over, plus I said on that same day that this was a better solution. So: Reviewed-by: Thomas Daniel <thomas.daniel@intel.com> Reviewed-by: Dave Gordon <dave.gordon@intel.com> _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] drm/i915: Update to post-reset execlist queue clean-up 2015-12-11 14:14 ` Dave Gordon @ 2015-12-11 16:40 ` Daniel Vetter 2015-12-14 10:21 ` Mika Kuoppala 1 sibling, 0 replies; 68+ messages in thread From: Daniel Vetter @ 2015-12-11 16:40 UTC (permalink / raw) To: Dave Gordon; +Cc: Intel-GFX On Fri, Dec 11, 2015 at 02:14:00PM +0000, Dave Gordon wrote: > On 01/12/15 11:46, Tvrtko Ursulin wrote: > > > >On 23/10/15 18:02, Tomas Elf wrote: > >>When clearing an execlist queue, instead of traversing it and > >>unreferencing all > >>requests while holding the spinlock (which might lead to thread > >>sleeping with > >>IRQs are turned off - bad news!), just move all requests to the retire > >>request > >>list while holding spinlock and then drop spinlock and invoke the > >>execlists > >>request retirement path, which already deals with the intricacies of > >>purging/dereferencing execlist queue requests. > >> > >>This patch can be considered v3 of: > >> > >> commit b96db8b81c54ef30485ddb5992d63305d86ea8d3 > >> Author: Tomas Elf <tomas.elf@intel.com> > >> drm/i915: Grab execlist spinlock to avoid post-reset concurrency > >>issues > >> > >>This patch assumes v2 of the above patch is part of the baseline, > >>reverts v2 > >>and adds changes on top to turn it into v3. > >> > >>Signed-off-by: Tomas Elf <tomas.elf@intel.com> > >>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>Cc: Chris Wilson <chris@chris-wilson.co.uk> > >>--- > >> drivers/gpu/drm/i915/i915_gem.c | 15 ++++----------- > >> 1 file changed, 4 insertions(+), 11 deletions(-) > >> > >>diff --git a/drivers/gpu/drm/i915/i915_gem.c > >>b/drivers/gpu/drm/i915/i915_gem.c > >>index 2c7a0b7..b492603 100644 > >>--- a/drivers/gpu/drm/i915/i915_gem.c > >>+++ b/drivers/gpu/drm/i915/i915_gem.c > >>@@ -2756,20 +2756,13 @@ static void i915_gem_reset_ring_cleanup(struct > >>drm_i915_private *dev_priv, > >> > >> if (i915.enable_execlists) { > >> spin_lock_irq(&ring->execlist_lock); > >>- while (!list_empty(&ring->execlist_queue)) { > >>- struct drm_i915_gem_request *submit_req; > >> > >>- submit_req = list_first_entry(&ring->execlist_queue, > >>- struct drm_i915_gem_request, > >>- execlist_link); > >>- list_del(&submit_req->execlist_link); > >>+ /* list_splice_tail_init checks for empty lists */ > >>+ list_splice_tail_init(&ring->execlist_queue, > >>+ &ring->execlist_retired_req_list); > >> > >>- if (submit_req->ctx != ring->default_context) > >>- intel_lr_context_unpin(submit_req); > >>- > >>- i915_gem_request_unreference(submit_req); > >>- } > >> spin_unlock_irq(&ring->execlist_lock); > >>+ intel_execlists_retire_requests(ring); > >> } > >> > >> /* > > > >Fallen through the cracks.. > > > >This looks to be even more serious, since lockdep notices possible > >deadlock involving vmap_area_lock: > > > > Possible interrupt unsafe locking scenario: > > > > CPU0 CPU1 > > ---- ---- > > lock(vmap_area_lock); > > local_irq_disable(); > > lock(&(&ring->execlist_lock)->rlock); > > lock(vmap_area_lock); > > <Interrupt> > > lock(&(&ring->execlist_lock)->rlock); > > > > *** DEADLOCK *** > > > >Because it unpins LRC context and ringbuffer which ends up in the VM > >code under the execlist_lock. > > > >intel_execlists_retire_requests is slightly different from the code in > >the reset handler because it concerns itself with ctx_obj existence > >which the other one doesn't. > > > >Could people more knowledgeable of this code check if it is OK and R-B? > > > >Regards, > > > >Tvrtko > > Hi Tvrtko, > > I didn't understand this message at first, I thought you'd found a problem > with this ("v3") patch, but now I see what you actually meant is that there > is indeed a problem with the (v2) that got merged, not the original question > about unreferencing an object while holding a spinlock (because it can't be > the last reference), but rather because of the unpin, which can indeed cause > a problem with a non-i915-defined kernel lock. > > So we should certainly update the current (v2) upstream with this. > Thomas Daniel already R-B'd this code on 23rd October, when it was: > > [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid post-reset > concurrency issues. > > and it hasn't changed in substance since then, so you can carry his R-B > over, plus I said on that same day that this was a better solution. So: > > Reviewed-by: Thomas Daniel <thomas.daniel@intel.com> > Reviewed-by: Dave Gordon <dave.gordon@intel.com> Indeed, fell through the cracks more than once :( Sorry about that, picked up now. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] drm/i915: Update to post-reset execlist queue clean-up 2015-12-11 14:14 ` Dave Gordon 2015-12-11 16:40 ` Daniel Vetter @ 2015-12-14 10:21 ` Mika Kuoppala 1 sibling, 0 replies; 68+ messages in thread From: Mika Kuoppala @ 2015-12-14 10:21 UTC (permalink / raw) To: Dave Gordon, Tvrtko Ursulin, Intel-GFX, Daniel, Thomas, Daniel Vetter Dave Gordon <david.s.gordon@intel.com> writes: > On 01/12/15 11:46, Tvrtko Ursulin wrote: >> >> On 23/10/15 18:02, Tomas Elf wrote: >>> When clearing an execlist queue, instead of traversing it and >>> unreferencing all >>> requests while holding the spinlock (which might lead to thread >>> sleeping with >>> IRQs are turned off - bad news!), just move all requests to the retire >>> request >>> list while holding spinlock and then drop spinlock and invoke the >>> execlists >>> request retirement path, which already deals with the intricacies of >>> purging/dereferencing execlist queue requests. >>> >>> This patch can be considered v3 of: >>> >>> commit b96db8b81c54ef30485ddb5992d63305d86ea8d3 >>> Author: Tomas Elf <tomas.elf@intel.com> >>> drm/i915: Grab execlist spinlock to avoid post-reset concurrency >>> issues >>> >>> This patch assumes v2 of the above patch is part of the baseline, >>> reverts v2 >>> and adds changes on top to turn it into v3. >>> >>> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>> Cc: Chris Wilson <chris@chris-wilson.co.uk> >>> --- >>> drivers/gpu/drm/i915/i915_gem.c | 15 ++++----------- >>> 1 file changed, 4 insertions(+), 11 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/i915/i915_gem.c >>> b/drivers/gpu/drm/i915/i915_gem.c >>> index 2c7a0b7..b492603 100644 >>> --- a/drivers/gpu/drm/i915/i915_gem.c >>> +++ b/drivers/gpu/drm/i915/i915_gem.c >>> @@ -2756,20 +2756,13 @@ static void i915_gem_reset_ring_cleanup(struct >>> drm_i915_private *dev_priv, >>> >>> if (i915.enable_execlists) { >>> spin_lock_irq(&ring->execlist_lock); >>> - while (!list_empty(&ring->execlist_queue)) { >>> - struct drm_i915_gem_request *submit_req; >>> >>> - submit_req = list_first_entry(&ring->execlist_queue, >>> - struct drm_i915_gem_request, >>> - execlist_link); >>> - list_del(&submit_req->execlist_link); >>> + /* list_splice_tail_init checks for empty lists */ >>> + list_splice_tail_init(&ring->execlist_queue, >>> + &ring->execlist_retired_req_list); >>> >>> - if (submit_req->ctx != ring->default_context) >>> - intel_lr_context_unpin(submit_req); >>> - >>> - i915_gem_request_unreference(submit_req); >>> - } >>> spin_unlock_irq(&ring->execlist_lock); >>> + intel_execlists_retire_requests(ring); >>> } >>> >>> /* >> >> Fallen through the cracks.. >> >> This looks to be even more serious, since lockdep notices possible >> deadlock involving vmap_area_lock: >> >> Possible interrupt unsafe locking scenario: >> >> CPU0 CPU1 >> ---- ---- >> lock(vmap_area_lock); >> local_irq_disable(); >> lock(&(&ring->execlist_lock)->rlock); >> lock(vmap_area_lock); >> <Interrupt> >> lock(&(&ring->execlist_lock)->rlock); >> >> *** DEADLOCK *** >> >> Because it unpins LRC context and ringbuffer which ends up in the VM >> code under the execlist_lock. >> >> intel_execlists_retire_requests is slightly different from the code in >> the reset handler because it concerns itself with ctx_obj existence >> which the other one doesn't. >> >> Could people more knowledgeable of this code check if it is OK and R-B? >> >> Regards, >> >> Tvrtko > > Hi Tvrtko, > > I didn't understand this message at first, I thought you'd found a > problem with this ("v3") patch, but now I see what you actually meant is > that there is indeed a problem with the (v2) that got merged, not the > original question about unreferencing an object while holding a spinlock > (because it can't be the last reference), but rather because of the > unpin, which can indeed cause a problem with a non-i915-defined kernel lock. > > So we should certainly update the current (v2) upstream with this. > Thomas Daniel already R-B'd this code on 23rd October, when it was: > > [PATCH v3 7/8] drm/i915: Grab execlist spinlock to avoid post-reset > concurrency issues. > > and it hasn't changed in substance since then, so you can carry his R-B > over, plus I said on that same day that this was a better solution. So: > > Reviewed-by: Thomas Daniel <thomas.daniel@intel.com> > Reviewed-by: Dave Gordon <dave.gordon@intel.com> > Bat farm did encounter with this few weeks back, so it was vaguely registered. But I just failed with timely review. Thanks for pushing it forward, -Mika > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 8/8] drm/i915: NULL check of unpin_work 2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf ` (6 preceding siblings ...) 2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf @ 2015-10-08 18:31 ` Tomas Elf 2015-10-09 7:46 ` Chris Wilson 7 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw) To: Intel-GFX Avoid NULL pointer exceptions in the display driver for certain critical cases when unpin_work has turned out to be NULL. Signed-off-by: Tomas Elf <tomas.elf@intel.com> --- drivers/gpu/drm/i915/intel_display.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 539c373..421bc80 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -10853,6 +10853,8 @@ static inline void intel_mark_page_flip_active(struct intel_crtc *intel_crtc) { /* Ensure that the work item is consistent when activating it ... */ smp_wmb(); + if (!intel_crtc->unpin_work) + return; atomic_set(&intel_crtc->unpin_work->pending, INTEL_FLIP_PENDING); /* and that it is marked active as soon as the irq could fire. */ smp_wmb(); @@ -11187,6 +11189,9 @@ static void ilk_do_mmio_flip(struct intel_crtc *intel_crtc) I915_WRITE(reg, dspcntr); + if (!intel_crtc->unpin_work) + return; + I915_WRITE(DSPSURF(intel_crtc->plane), intel_crtc->unpin_work->gtt_offset); POSTING_READ(DSPSURF(intel_crtc->plane)); -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work 2015-10-08 18:31 ` [PATCH 8/8] drm/i915: NULL check of unpin_work Tomas Elf @ 2015-10-09 7:46 ` Chris Wilson 2015-10-09 8:39 ` Daniel Vetter 2015-10-09 10:30 ` Tomas Elf 0 siblings, 2 replies; 68+ messages in thread From: Chris Wilson @ 2015-10-09 7:46 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote: > Avoid NULL pointer exceptions in the display driver for certain critical cases > when unpin_work has turned out to be NULL. Nope, the machine reached a point where it cannot possibly reach, we want the OOPS. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work 2015-10-09 7:46 ` Chris Wilson @ 2015-10-09 8:39 ` Daniel Vetter 2015-10-09 11:50 ` Tomas Elf 2015-10-09 10:30 ` Tomas Elf 1 sibling, 1 reply; 68+ messages in thread From: Daniel Vetter @ 2015-10-09 8:39 UTC (permalink / raw) To: Chris Wilson, Tomas Elf, Intel-GFX On Fri, Oct 09, 2015 at 08:46:31AM +0100, Chris Wilson wrote: > On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote: > > Avoid NULL pointer exceptions in the display driver for certain critical cases > > when unpin_work has turned out to be NULL. > > Nope, the machine reached a point where it cannot possibly reach, we > want the OOPS. Yeah this looks funky ... how did you get to that state? If it indeed blows up in testing then we'd need at least a WARN_ON (and hopefully some ideas what's going wrong). -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work 2015-10-09 8:39 ` Daniel Vetter @ 2015-10-09 11:50 ` Tomas Elf 0 siblings, 0 replies; 68+ messages in thread From: Tomas Elf @ 2015-10-09 11:50 UTC (permalink / raw) To: Daniel Vetter, Chris Wilson, Intel-GFX On 09/10/2015 09:39, Daniel Vetter wrote: > On Fri, Oct 09, 2015 at 08:46:31AM +0100, Chris Wilson wrote: >> On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote: >>> Avoid NULL pointer exceptions in the display driver for certain critical cases >>> when unpin_work has turned out to be NULL. >> >> Nope, the machine reached a point where it cannot possibly reach, we >> want the OOPS. > > Yeah this looks funky ... how did you get to that state? If it indeed > blows up in testing then we'd need at least a WARN_ON (and hopefully some > ideas what's going wrong). > -Daniel > I have no idea but I know that it blew up my machine and ended my test ahead of time. Adding this check allows me to test for 12+ hours, which is what I wanted to do. Sure, I'll add a WARN_ON. I don't know how much time I have to investigate the root cause considering the whole TDR upstreaming situation. Thanks, Tomas _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work 2015-10-09 7:46 ` Chris Wilson 2015-10-09 8:39 ` Daniel Vetter @ 2015-10-09 10:30 ` Tomas Elf 2015-10-09 10:44 ` Chris Wilson 1 sibling, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-09 10:30 UTC (permalink / raw) To: Chris Wilson, Intel-GFX On 09/10/2015 08:46, Chris Wilson wrote: > On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote: >> Avoid NULL pointer exceptions in the display driver for certain critical cases >> when unpin_work has turned out to be NULL. > > Nope, the machine reached a point where it cannot possibly reach, we > want the OOPS. > -Chris > Really? Because if I have this in there I can actually run the long-duration stability tests for 12+ hours rather than just a few hours (it doesn't happen all the time but I've seen it at least once). But, hey, if you want to investigate the reason why we have a NULL there then go ahead. I guess I'll just have to carry this patch for as long as my machine crashes during my TDR testing. Thanks, Tomas _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work 2015-10-09 10:30 ` Tomas Elf @ 2015-10-09 10:44 ` Chris Wilson 2015-10-09 12:06 ` Tomas Elf 0 siblings, 1 reply; 68+ messages in thread From: Chris Wilson @ 2015-10-09 10:44 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Fri, Oct 09, 2015 at 11:30:34AM +0100, Tomas Elf wrote: > On 09/10/2015 08:46, Chris Wilson wrote: > >On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote: > >>Avoid NULL pointer exceptions in the display driver for certain critical cases > >>when unpin_work has turned out to be NULL. > > > >Nope, the machine reached a point where it cannot possibly reach, we > >want the OOPS. > >-Chris > > > > Really? Because if I have this in there I can actually run the > long-duration stability tests for 12+ hours rather than just a few > hours (it doesn't happen all the time but I've seen it at least > once). But, hey, if you want to investigate the reason why we have a > NULL there then go ahead. I guess I'll just have to carry this patch > for as long as my machine crashes during my TDR testing. You've sat on a critical bug for how long? There's been one recent report that appears to be fallout from Tvrtko's context cleanup, but nothing older, in public at least. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work 2015-10-09 10:44 ` Chris Wilson @ 2015-10-09 12:06 ` Tomas Elf 2015-10-13 11:51 ` Daniel Vetter 0 siblings, 1 reply; 68+ messages in thread From: Tomas Elf @ 2015-10-09 12:06 UTC (permalink / raw) To: Chris Wilson, Intel-GFX On 09/10/2015 11:44, Chris Wilson wrote: > On Fri, Oct 09, 2015 at 11:30:34AM +0100, Tomas Elf wrote: >> On 09/10/2015 08:46, Chris Wilson wrote: >>> On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote: >>>> Avoid NULL pointer exceptions in the display driver for certain critical cases >>>> when unpin_work has turned out to be NULL. >>> >>> Nope, the machine reached a point where it cannot possibly reach, we >>> want the OOPS. >>> -Chris >>> >> >> Really? Because if I have this in there I can actually run the >> long-duration stability tests for 12+ hours rather than just a few >> hours (it doesn't happen all the time but I've seen it at least >> once). But, hey, if you want to investigate the reason why we have a >> NULL there then go ahead. I guess I'll just have to carry this patch >> for as long as my machine crashes during my TDR testing. > > You've sat on a critical bug for how long? There's been one recent > report that appears to be fallout from Tvrtko's context cleanup, but > nothing older, in public at least. > -Chris > Do people typically try to actively hang their machines several times a minute for a 12+ hours at a time? If not then maybe that's why they haven't seen this. I haven't sat on anything, this has been part of my error state capture improvement patches which I've intended to upstream for several months now. I consider this part of the overall stability issue and grouped it together with all the other patches necessary to make the machine not crash for the duration of my test. I would've upstreamed this series a long time ago if I had actually been given time to work on it but that's another discussion. Thanks, Tomas _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 8/8] drm/i915: NULL check of unpin_work 2015-10-09 12:06 ` Tomas Elf @ 2015-10-13 11:51 ` Daniel Vetter 0 siblings, 0 replies; 68+ messages in thread From: Daniel Vetter @ 2015-10-13 11:51 UTC (permalink / raw) To: Tomas Elf; +Cc: Intel-GFX On Fri, Oct 09, 2015 at 01:06:24PM +0100, Tomas Elf wrote: > On 09/10/2015 11:44, Chris Wilson wrote: > >On Fri, Oct 09, 2015 at 11:30:34AM +0100, Tomas Elf wrote: > >>On 09/10/2015 08:46, Chris Wilson wrote: > >>>On Thu, Oct 08, 2015 at 07:31:40PM +0100, Tomas Elf wrote: > >>>>Avoid NULL pointer exceptions in the display driver for certain critical cases > >>>>when unpin_work has turned out to be NULL. > >>> > >>>Nope, the machine reached a point where it cannot possibly reach, we > >>>want the OOPS. > >>>-Chris > >>> > >> > >>Really? Because if I have this in there I can actually run the > >>long-duration stability tests for 12+ hours rather than just a few > >>hours (it doesn't happen all the time but I've seen it at least > >>once). But, hey, if you want to investigate the reason why we have a > >>NULL there then go ahead. I guess I'll just have to carry this patch > >>for as long as my machine crashes during my TDR testing. > > > >You've sat on a critical bug for how long? There's been one recent > >report that appears to be fallout from Tvrtko's context cleanup, but > >nothing older, in public at least. > >-Chris > > > > Do people typically try to actively hang their machines several times a > minute for a 12+ hours at a time? If not then maybe that's why they haven't > seen this. The problem is that the world loves to conspire against races like these, and somewhere out there is a machine which hits this ridiculously reliably. And if this is a machine sitting on an OEM's product engineer bench we've just lost a sale ;-) Just because you can't repro if fast doesn't mean that your reproduction rate is the lower limit. > I haven't sat on anything, this has been part of my error state capture > improvement patches which I've intended to upstream for several months now. > I consider this part of the overall stability issue and grouped it together > with all the other patches necessary to make the machine not crash for the > duration of my test. I would've upstreamed this series a long time ago if I > had actually been given time to work on it but that's another discussion. I guess Chris' reaction is because we've had tons of cases where bugfixes like this where stuck in android trees because they where tied up with some feature work. But I also know it's really hard to spot them, so in casee of doubt please upstream small fixes aggressively. Since this one here is outside of the error capture itself, and so code where our existing assumption that error capture only runs when the driver is dead since seconds is invalid, it's a prime candidate. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2015-12-14 10:23 UTC | newest] Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf 2015-10-08 18:31 ` [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf 2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf 2015-10-09 7:49 ` Chris Wilson 2015-10-09 11:38 ` Tomas Elf 2015-10-09 8:27 ` Daniel Vetter 2015-10-09 11:40 ` Tomas Elf 2015-10-13 11:37 ` Daniel Vetter 2015-10-13 11:47 ` Chris Wilson 2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf 2015-10-09 7:48 ` Chris Wilson 2015-10-09 11:25 ` Tomas Elf 2015-10-13 11:39 ` Daniel Vetter 2015-10-14 11:46 ` Tomas Elf 2015-10-14 12:45 ` Daniel Vetter 2015-10-09 8:28 ` Daniel Vetter 2015-10-09 11:45 ` Tomas Elf 2015-10-13 11:40 ` Daniel Vetter 2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf 2015-10-09 7:49 ` Chris Wilson 2015-10-09 11:34 ` Tomas Elf 2015-10-09 8:32 ` Daniel Vetter 2015-10-09 8:47 ` Chris Wilson 2015-10-09 11:52 ` Tomas Elf 2015-10-09 11:45 ` Tomas Elf 2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf 2015-10-09 7:48 ` Chris Wilson 2015-10-09 11:30 ` Tomas Elf 2015-10-09 11:59 ` Chris Wilson 2015-10-13 11:43 ` Daniel Vetter 2015-10-09 8:33 ` Daniel Vetter 2015-10-09 11:46 ` Tomas Elf 2015-10-08 18:31 ` [PATCH 6/8] drm/i915: Use safe list iterators Tomas Elf 2015-10-09 7:41 ` Chris Wilson 2015-10-09 10:27 ` Tomas Elf 2015-10-09 10:38 ` Chris Wilson 2015-10-09 12:00 ` Tomas Elf 2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf 2015-10-09 7:45 ` Chris Wilson 2015-10-09 10:28 ` Tomas Elf 2015-10-09 8:38 ` Daniel Vetter 2015-10-09 8:45 ` Chris Wilson 2015-10-13 11:46 ` Daniel Vetter 2015-10-13 11:45 ` Chris Wilson 2015-10-13 13:46 ` Daniel Vetter 2015-10-13 14:00 ` Chris Wilson 2015-10-19 15:32 ` [PATCH v2 " Tomas Elf 2015-10-22 16:49 ` Dave Gordon 2015-10-22 17:35 ` Daniel Vetter 2015-10-23 8:42 ` Tvrtko Ursulin 2015-10-23 8:59 ` Daniel Vetter 2015-10-23 11:02 ` Tomas Elf 2015-10-23 12:49 ` Dave Gordon 2015-10-23 13:08 ` [PATCH v3 " Tomas Elf 2015-10-23 14:53 ` Daniel, Thomas 2015-10-23 17:02 ` [PATCH] drm/i915: Update to post-reset execlist queue clean-up Tomas Elf 2015-12-01 11:46 ` Tvrtko Ursulin 2015-12-11 14:14 ` Dave Gordon 2015-12-11 16:40 ` Daniel Vetter 2015-12-14 10:21 ` Mika Kuoppala 2015-10-08 18:31 ` [PATCH 8/8] drm/i915: NULL check of unpin_work Tomas Elf 2015-10-09 7:46 ` Chris Wilson 2015-10-09 8:39 ` Daniel Vetter 2015-10-09 11:50 ` Tomas Elf 2015-10-09 10:30 ` Tomas Elf 2015-10-09 10:44 ` Chris Wilson 2015-10-09 12:06 ` Tomas Elf 2015-10-13 11:51 ` Daniel Vetter
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.