All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Stability improvements to error state capture
@ 2015-10-08 18:31 Tomas Elf
  2015-10-08 18:31 ` [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
                   ` (7 more replies)
  0 siblings, 8 replies; 68+ messages in thread
From: Tomas Elf @ 2015-10-08 18:31 UTC (permalink / raw)
  To: Intel-GFX

In preparation for the upcoming TDR per-engine hang recovery enablement the
stability of the error state capture code needs to be addressed. The biggest
reason for this is that in order to test TDR a long-duration test needs to be
run for several hours during which a large number of hangs is handled together
with the associated error state captures. In its current state the i915 driver
experiences various forms of kernel panics and other kinds of fatal errors
within the first hour(s) of the hang testing. The patches in this series have
been tested with a long-duration hang testing clocking in at 12+ hours and
should suffice as an initial improvement.

The underlying issue of trying to capture the driver state without
synchronization is still a problem that remains to be fixed. One way of at
least further alleviating this problem that has been suggested by John Harrison
is to do a mutex_trylock() of the struct_mutex for a while (give it a second or
so) before going into the error state capture from i915_handle_error(). Then,
if nobody is holding the struct_mutex, the error state capture is considerably
more safe from sudden state changes. If some thread has hung while holding the
struct_mutex one could at least hope that there would be no sudden state
changes during error state capture due to the hung state (unless some thread
has been caught in a livelock or is perhaps not stuck at all but is simply
running for a very long time - still some improvements might be expected here).

One fix that has been omitted from this patch series is in regards to the
broken ring space calculation following a full GPU reset. Two independent
patches to solve this are: "[PATCH] drm/i915: Update ring space correctly on
lrc context reset" by Mika Kuoppala and "[51/70] drm/i915: Record the position
of the start of the request" by Chris Wilson. Since the solution is currently
in review I'll simply mention it here as a pre-requistite for long-duration
operations stability testing. Without a fix for this problem the ring space is
terminally depleted within the first iterations of the hang test, simply
because the ring space is miscalculated following every GPU hang recovery and
traversal of the GEM init hw path gradually leading to a terminally hung state.

Tomas Elf (8):
  drm/i915: Early exit from semaphore_waits_for for execlist mode.
  drm/i915: Migrate to safe iterators in error state capture
  drm/i915: Cope with request list state change during error state
    capture
  drm/i915: NULL checking when capturing buffer objects during error
    state capture
  drm/i915: vma NULL pointer check
  drm/i915: Use safe list iterators
  drm/i915: Grab execlist spinlock to avoid post-reset concurrency
    issues.
  drm/i915: NULL check of unpin_work

 drivers/gpu/drm/i915/i915_gem.c       | 18 ++++++++---
 drivers/gpu/drm/i915/i915_gpu_error.c | 61 +++++++++++++++++++++++------------
 drivers/gpu/drm/i915/i915_irq.c       | 20 ++++++++++++
 drivers/gpu/drm/i915/intel_display.c  |  5 +++
 4 files changed, 80 insertions(+), 24 deletions(-)

-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2015-12-14 10:23 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-08 18:31 [PATCH 0/8] Stability improvements to error state capture Tomas Elf
2015-10-08 18:31 ` [PATCH 1/8] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
2015-10-08 18:31 ` [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture Tomas Elf
2015-10-09  7:49   ` Chris Wilson
2015-10-09 11:38     ` Tomas Elf
2015-10-09  8:27   ` Daniel Vetter
2015-10-09 11:40     ` Tomas Elf
2015-10-13 11:37       ` Daniel Vetter
2015-10-13 11:47         ` Chris Wilson
2015-10-08 18:31 ` [PATCH 3/8] drm/i915: Cope with request list state change during " Tomas Elf
2015-10-09  7:48   ` Chris Wilson
2015-10-09 11:25     ` Tomas Elf
2015-10-13 11:39       ` Daniel Vetter
2015-10-14 11:46         ` Tomas Elf
2015-10-14 12:45           ` Daniel Vetter
2015-10-09  8:28   ` Daniel Vetter
2015-10-09 11:45     ` Tomas Elf
2015-10-13 11:40       ` Daniel Vetter
2015-10-08 18:31 ` [PATCH 4/8] drm/i915: NULL checking when capturing buffer objects " Tomas Elf
2015-10-09  7:49   ` Chris Wilson
2015-10-09 11:34     ` Tomas Elf
2015-10-09  8:32   ` Daniel Vetter
2015-10-09  8:47     ` Chris Wilson
2015-10-09 11:52       ` Tomas Elf
2015-10-09 11:45     ` Tomas Elf
2015-10-08 18:31 ` [PATCH 5/8] drm/i915: vma NULL pointer check Tomas Elf
2015-10-09  7:48   ` Chris Wilson
2015-10-09 11:30     ` Tomas Elf
2015-10-09 11:59       ` Chris Wilson
2015-10-13 11:43         ` Daniel Vetter
2015-10-09  8:33   ` Daniel Vetter
2015-10-09 11:46     ` Tomas Elf
2015-10-08 18:31 ` [PATCH 6/8] drm/i915: Use safe list iterators Tomas Elf
2015-10-09  7:41   ` Chris Wilson
2015-10-09 10:27     ` Tomas Elf
2015-10-09 10:38       ` Chris Wilson
2015-10-09 12:00         ` Tomas Elf
2015-10-08 18:31 ` [PATCH 7/8] drm/i915: Grab execlist spinlock to avoid post-reset concurrency issues Tomas Elf
2015-10-09  7:45   ` Chris Wilson
2015-10-09 10:28     ` Tomas Elf
2015-10-09  8:38   ` Daniel Vetter
2015-10-09  8:45     ` Chris Wilson
2015-10-13 11:46       ` Daniel Vetter
2015-10-13 11:45         ` Chris Wilson
2015-10-13 13:46           ` Daniel Vetter
2015-10-13 14:00             ` Chris Wilson
2015-10-19 15:32   ` [PATCH v2 " Tomas Elf
2015-10-22 16:49     ` Dave Gordon
2015-10-22 17:35       ` Daniel Vetter
2015-10-23  8:42     ` Tvrtko Ursulin
2015-10-23  8:59       ` Daniel Vetter
2015-10-23 11:02         ` Tomas Elf
2015-10-23 12:49           ` Dave Gordon
2015-10-23 13:08     ` [PATCH v3 " Tomas Elf
2015-10-23 14:53       ` Daniel, Thomas
2015-10-23 17:02     ` [PATCH] drm/i915: Update to post-reset execlist queue clean-up Tomas Elf
2015-12-01 11:46       ` Tvrtko Ursulin
2015-12-11 14:14         ` Dave Gordon
2015-12-11 16:40           ` Daniel Vetter
2015-12-14 10:21           ` Mika Kuoppala
2015-10-08 18:31 ` [PATCH 8/8] drm/i915: NULL check of unpin_work Tomas Elf
2015-10-09  7:46   ` Chris Wilson
2015-10-09  8:39     ` Daniel Vetter
2015-10-09 11:50       ` Tomas Elf
2015-10-09 10:30     ` Tomas Elf
2015-10-09 10:44       ` Chris Wilson
2015-10-09 12:06         ` Tomas Elf
2015-10-13 11:51           ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.