All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Per Engine hang detection and recovery
@ 2013-11-11 14:58 Siluvery, Arun
  2013-11-11 15:31 ` Daniel Vetter
  0 siblings, 1 reply; 4+ messages in thread
From: Siluvery, Arun @ 2013-11-11 14:58 UTC (permalink / raw)
  To: intel-gfx

From: "Siluvery, Arun" <arun.siluvery@intel.com>

This patchset contains changes for Timeout detection and recovery (TDR) which
provides per-engine hang detection and recovery.
The current driver performs full gpu reset in case of a hang, TDR attempts to
only reset the engine that is hung and it falls back to full reset if it fails.

Full GPU reset can leave the system in a state where the display updates
intermittently and possibly lock-up depending on the work load at the time of
hang. TDR can help recover the system in those case thus increasing the stability.

The changes are split in multiple patches.
1. Ring utility functions to save/restore context, reset ring etc
2. TDR hang detection logic and error recovery function
3. Debugfs changes to export TDR statistics.

I have tested these changes on drm-intel-nightly with simple test which
inserts a bad batch buffer on the specific to trigger a hang. TDR logic
then detects this and recovers from it by skipping the bad batch.

Please review and give your comments.

regards
Arun

Siluvery, Arun (3):
  drm/1915: Add ring functions to save/restore context for per-ring
    reset
  drm/i915: Per-engine Timeout detection and recovery on HSW
  drm/i915: Export TDR hang count to debugfs

 drivers/gpu/drm/i915/i915_debugfs.c     |  68 +++-
 drivers/gpu/drm/i915/i915_dma.c         |  16 +-
 drivers/gpu/drm/i915/i915_drv.c         | 195 +++++++++-
 drivers/gpu/drm/i915/i915_drv.h         |  92 ++++-
 drivers/gpu/drm/i915/i915_gem.c         |  77 +++-
 drivers/gpu/drm/i915/i915_gpu_error.c   |  25 +-
 drivers/gpu/drm/i915/i915_irq.c         | 556 ++++++++++++++++-------------
 drivers/gpu/drm/i915/i915_reg.h         |   7 +
 drivers/gpu/drm/i915/intel_display.c    |  25 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c | 607 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  51 +++
 drivers/gpu/drm/i915/intel_uncore.c     |  31 +-
 include/drm/drmP.h                      |   7 +
 13 files changed, 1467 insertions(+), 290 deletions(-)

-- 
1.8.4

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/3] Per Engine hang detection and recovery
  2013-11-11 14:58 [PATCH 0/3] Per Engine hang detection and recovery Siluvery, Arun
@ 2013-11-11 15:31 ` Daniel Vetter
  2013-11-11 15:49   ` Siluvery, Arun
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel Vetter @ 2013-11-11 15:31 UTC (permalink / raw)
  To: Siluvery, Arun; +Cc: intel-gfx

On Mon, Nov 11, 2013 at 02:58:31PM +0000, Siluvery, Arun wrote:
> From: "Siluvery, Arun" <arun.siluvery@intel.com>
> 
> This patchset contains changes for Timeout detection and recovery (TDR) which
> provides per-engine hang detection and recovery.
> The current driver performs full gpu reset in case of a hang, TDR attempts to
> only reset the engine that is hung and it falls back to full reset if it fails.
> 
> Full GPU reset can leave the system in a state where the display updates
> intermittently and possibly lock-up depending on the work load at the time of
> hang. TDR can help recover the system in those case thus increasing the stability.

Are these hw lockups you've seen with full gpu reset or just kernel
deadlocks? If it's the latter we've recently (re-)fixed a bunch of those,
and if there are new ones we definitely want to fix them and add testcases
to igt. So if you could share some of these hangs and their
analysis/testcases that's be very interesting.

That's of course on top of any other reset improvements.

> The changes are split in multiple patches.
> 1. Ring utility functions to save/restore context, reset ring etc
> 2. TDR hang detection logic and error recovery function
> 3. Debugfs changes to export TDR statistics.
> 
> I have tested these changes on drm-intel-nightly with simple test which
> inserts a bad batch buffer on the specific to trigger a hang. TDR logic
> then detects this and recovers from it by skipping the bad batch.

I want this testcase (as a patch to igt).

> Please review and give your comments.

I'll try to have a look later this week, atm still busy with bdw
upstreaming. One more meta-comment though: Something with your git setup
seems to be broken, the patches don't have in-reply-to headers pointing at
this cover letter and hence the threading is a bit broken.

Cheers, Daniel
> 
> regards
> Arun
> 
> Siluvery, Arun (3):
>   drm/1915: Add ring functions to save/restore context for per-ring
>     reset
>   drm/i915: Per-engine Timeout detection and recovery on HSW
>   drm/i915: Export TDR hang count to debugfs
> 
>  drivers/gpu/drm/i915/i915_debugfs.c     |  68 +++-
>  drivers/gpu/drm/i915/i915_dma.c         |  16 +-
>  drivers/gpu/drm/i915/i915_drv.c         | 195 +++++++++-
>  drivers/gpu/drm/i915/i915_drv.h         |  92 ++++-
>  drivers/gpu/drm/i915/i915_gem.c         |  77 +++-
>  drivers/gpu/drm/i915/i915_gpu_error.c   |  25 +-
>  drivers/gpu/drm/i915/i915_irq.c         | 556 ++++++++++++++++-------------
>  drivers/gpu/drm/i915/i915_reg.h         |   7 +
>  drivers/gpu/drm/i915/intel_display.c    |  25 +-
>  drivers/gpu/drm/i915/intel_ringbuffer.c | 607 +++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  51 +++
>  drivers/gpu/drm/i915/intel_uncore.c     |  31 +-
>  include/drm/drmP.h                      |   7 +
>  13 files changed, 1467 insertions(+), 290 deletions(-)
> 
> -- 
> 1.8.4
> 
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/3] Per Engine hang detection and recovery
  2013-11-11 15:31 ` Daniel Vetter
@ 2013-11-11 15:49   ` Siluvery, Arun
  2013-11-11 17:55     ` Daniel Vetter
  0 siblings, 1 reply; 4+ messages in thread
From: Siluvery, Arun @ 2013-11-11 15:49 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Mon, 2013-11-11 at 16:31 +0100, Daniel Vetter wrote:
> On Mon, Nov 11, 2013 at 02:58:31PM +0000, Siluvery, Arun wrote:
> > From: "Siluvery, Arun" <arun.siluvery@intel.com>
> > 
> > This patchset contains changes for Timeout detection and recovery (TDR) which
> > provides per-engine hang detection and recovery.
> > The current driver performs full gpu reset in case of a hang, TDR attempts to
> > only reset the engine that is hung and it falls back to full reset if it fails.
> > 
> > Full GPU reset can leave the system in a state where the display updates
> > intermittently and possibly lock-up depending on the work load at the time of
> > hang. TDR can help recover the system in those case thus increasing the stability.
> 
> Are these hw lockups you've seen with full gpu reset or just kernel
> deadlocks? If it's the latter we've recently (re-)fixed a bunch of those,
> and if there are new ones we definitely want to fix them and add testcases
> to igt. So if you could share some of these hangs and their
> analysis/testcases that's be very interesting.
> 
> That's of course on top of any other reset improvements.

I think these are kernel lockups, unfortunately when this happens there
is no response from the kernel, sending break is also not helping. I
will try to get more details on this.

> 
> > The changes are split in multiple patches.
> > 1. Ring utility functions to save/restore context, reset ring etc
> > 2. TDR hang detection logic and error recovery function
> > 3. Debugfs changes to export TDR statistics.
> > 
> > I have tested these changes on drm-intel-nightly with simple test which
> > inserts a bad batch buffer on the specific to trigger a hang. TDR logic
> > then detects this and recovers from it by skipping the bad batch.
> 
> I want this testcase (as a patch to igt).

ok, I will send it to the mailing list.

> 
> > Please review and give your comments.
> 
> I'll try to have a look later this week, atm still busy with bdw
> upstreaming. One more meta-comment though: Something with your git setup
> seems to be broken, the patches don't have in-reply-to headers pointing at
> this cover letter and hence the threading is a bit broken.

ok thanks.
yes my mistake I missed an option while generating the patches.
Do you suggest resending all patches again?

> 
> Cheers, Daniel
> > 
> > regards
> > Arun
> > 
> > Siluvery, Arun (3):
> >   drm/1915: Add ring functions to save/restore context for per-ring
> >     reset
> >   drm/i915: Per-engine Timeout detection and recovery on HSW
> >   drm/i915: Export TDR hang count to debugfs
> > 
> >  drivers/gpu/drm/i915/i915_debugfs.c     |  68 +++-
> >  drivers/gpu/drm/i915/i915_dma.c         |  16 +-
> >  drivers/gpu/drm/i915/i915_drv.c         | 195 +++++++++-
> >  drivers/gpu/drm/i915/i915_drv.h         |  92 ++++-
> >  drivers/gpu/drm/i915/i915_gem.c         |  77 +++-
> >  drivers/gpu/drm/i915/i915_gpu_error.c   |  25 +-
> >  drivers/gpu/drm/i915/i915_irq.c         | 556 ++++++++++++++++-------------
> >  drivers/gpu/drm/i915/i915_reg.h         |   7 +
> >  drivers/gpu/drm/i915/intel_display.c    |  25 +-
> >  drivers/gpu/drm/i915/intel_ringbuffer.c | 607 +++++++++++++++++++++++++++++++-
> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  51 +++
> >  drivers/gpu/drm/i915/intel_uncore.c     |  31 +-
> >  include/drm/drmP.h                      |   7 +
> >  13 files changed, 1467 insertions(+), 290 deletions(-)
> > 
> > -- 
> > 1.8.4
> > 
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/3] Per Engine hang detection and recovery
  2013-11-11 15:49   ` Siluvery, Arun
@ 2013-11-11 17:55     ` Daniel Vetter
  0 siblings, 0 replies; 4+ messages in thread
From: Daniel Vetter @ 2013-11-11 17:55 UTC (permalink / raw)
  To: Siluvery, Arun; +Cc: intel-gfx

On Mon, Nov 11, 2013 at 03:49:01PM +0000, Siluvery, Arun wrote:
> On Mon, 2013-11-11 at 16:31 +0100, Daniel Vetter wrote:
> > On Mon, Nov 11, 2013 at 02:58:31PM +0000, Siluvery, Arun wrote:
> > > From: "Siluvery, Arun" <arun.siluvery@intel.com>
> > > 
> > > This patchset contains changes for Timeout detection and recovery (TDR) which
> > > provides per-engine hang detection and recovery.
> > > The current driver performs full gpu reset in case of a hang, TDR attempts to
> > > only reset the engine that is hung and it falls back to full reset if it fails.
> > > 
> > > Full GPU reset can leave the system in a state where the display updates
> > > intermittently and possibly lock-up depending on the work load at the time of
> > > hang. TDR can help recover the system in those case thus increasing the stability.
> > 
> > Are these hw lockups you've seen with full gpu reset or just kernel
> > deadlocks? If it's the latter we've recently (re-)fixed a bunch of those,
> > and if there are new ones we definitely want to fix them and add testcases
> > to igt. So if you could share some of these hangs and their
> > analysis/testcases that's be very interesting.
> > 
> > That's of course on top of any other reset improvements.
> 
> I think these are kernel lockups, unfortunately when this happens there
> is no response from the kernel, sending break is also not helping. I
> will try to get more details on this.

Enabling lockdep and making sure you have the stuck task warning enabled
occasionally helps to get something out of the system. Is the entire box
dead or just everything related to gfx?

> > > The changes are split in multiple patches.
> > > 1. Ring utility functions to save/restore context, reset ring etc
> > > 2. TDR hang detection logic and error recovery function
> > > 3. Debugfs changes to export TDR statistics.
> > > 
> > > I have tested these changes on drm-intel-nightly with simple test which
> > > inserts a bad batch buffer on the specific to trigger a hang. TDR logic
> > > then detects this and recovers from it by skipping the bad batch.
> > 
> > I want this testcase (as a patch to igt).
> 
> ok, I will send it to the mailing list.

Thanks.

> > > Please review and give your comments.
> > 
> > I'll try to have a look later this week, atm still busy with bdw
> > upstreaming. One more meta-comment though: Something with your git setup
> > seems to be broken, the patches don't have in-reply-to headers pointing at
> > this cover letter and hence the threading is a bit broken.
> 
> ok thanks.
> yes my mistake I missed an option while generating the patches.
> Do you suggest resending all patches again?

No need, was just a quick reminder for next time around. Without threading
patch groups are harder to find, especially when a bigger discussion
ensued.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-11-11 17:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-11 14:58 [PATCH 0/3] Per Engine hang detection and recovery Siluvery, Arun
2013-11-11 15:31 ` Daniel Vetter
2013-11-11 15:49   ` Siluvery, Arun
2013-11-11 17:55     ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.