* Regression in i915 for 4.11-rc1 - bisected to commit 69df05e11ab8
@ 2017-03-23 18:19 Larry Finger
2017-03-23 20:44 ` Chris Wilson
0 siblings, 1 reply; 3+ messages in thread
From: Larry Finger @ 2017-03-23 18:19 UTC (permalink / raw)
To: LKML, Chris Wilson, Tvrtko Ursulin, intel-gfx, Jani Nikula,
Daniel Vetter
Cc: Thorsten Leemhuis
Since kernel 4.11-rc1, my desktop (Plasma5/KDE) has encountered intermittent
hangs with the following information in the logs:
linux-4v1g.suse kernel: [drm] GPU HANG: ecode 7:0:0xf3cffffe, in plasmashell
[1283], reason: Hang on render ring, action: reset
linux-4v1g.suse kernel: [drm] GPU hangs can indicate a bug anywhere in the
entire gfx stack, including userspace.
linux-4v1g.suse kernel: [drm] Please file a _new_ bug report on
bugs.freedesktop.org against DRI -> DRM/Intel
linux-4v1g.suse kernel: [drm] drm/i915 developers can then reassign to the right
component if it's not a kernel issue.
linux-4v1g.suse kernel: [drm] The gpu crash dump is required to analyze gpu
hangs, so please always attach it.
linux-4v1g.suse kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
linux-4v1g.suse kernel: drm/i915: Resetting chip after gpu hang
This problem was added to https://bugs.freedesktop.org/show_bug.cgi?id=99380,
but it probably is a different bug, as the OP in that report has problems with
kernel 4.10.x, whereas my problem did not appear until 4.11.
The problem was bisected to commit 69df05e11ab8 ("drm/i915: Simplify releasing
context reference"). The accuracy of the bisection was tested by reverting that
patch in kernel 4.11-rc3. With that change, my kernel has now run for over 17
hours with no problem. Before the reversion, the longest any affected kernel
would run was ~3 hours until a gpu hang was detected.
I admit that I do not understand this driver, but my guess is that this commit
introduced a race condition in the context put.
Thanks,
Larry
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Regression in i915 for 4.11-rc1 - bisected to commit 69df05e11ab8
2017-03-23 18:19 Regression in i915 for 4.11-rc1 - bisected to commit 69df05e11ab8 Larry Finger
@ 2017-03-23 20:44 ` Chris Wilson
2017-03-23 21:23 ` Larry Finger
0 siblings, 1 reply; 3+ messages in thread
From: Chris Wilson @ 2017-03-23 20:44 UTC (permalink / raw)
To: Larry Finger
Cc: LKML, Tvrtko Ursulin, intel-gfx, Jani Nikula, Daniel Vetter,
Thorsten Leemhuis
On Thu, Mar 23, 2017 at 01:19:43PM -0500, Larry Finger wrote:
> Since kernel 4.11-rc1, my desktop (Plasma5/KDE) has encountered
> intermittent hangs with the following information in the logs:
>
> linux-4v1g.suse kernel: [drm] GPU HANG: ecode 7:0:0xf3cffffe, in
> plasmashell [1283], reason: Hang on render ring, action: reset
> linux-4v1g.suse kernel: [drm] GPU hangs can indicate a bug anywhere
> in the entire gfx stack, including userspace.
> linux-4v1g.suse kernel: [drm] Please file a _new_ bug report on
> bugs.freedesktop.org against DRI -> DRM/Intel
> linux-4v1g.suse kernel: [drm] drm/i915 developers can then reassign
> to the right component if it's not a kernel issue.
> linux-4v1g.suse kernel: [drm] The gpu crash dump is required to
> analyze gpu hangs, so please always attach it.
> linux-4v1g.suse kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
> linux-4v1g.suse kernel: drm/i915: Resetting chip after gpu hang
>
> This problem was added to
> https://bugs.freedesktop.org/show_bug.cgi?id=99380, but it probably
> is a different bug, as the OP in that report has problems with
> kernel 4.10.x, whereas my problem did not appear until 4.11.
Close. Actually that patch touches code you are not using (oa-perf and
gvt), the real culprit was e8a9c58fcd9a ("drm/i915: Unify active context
tracking between legacy/execlists/guc").
The fix
commit 5d4bac5503fcc67dd7999571e243cee49371aef7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Wed Mar 22 20:59:30 2017 +0000
drm/i915: Restore marking context objects as dirty on pinning
Commit e8a9c58fcd9a ("drm/i915: Unify active context tracking between
legacy/execlists/guc") converted the legacy intel_ringbuffer submission
to the same context pinning mechanism as execlists - that is to pin the
context until the subsequent request is retired. Previously it used the
vma retirement of the context object to keep itself pinned until the
next request (after i915_vma_move_to_active()). In the conversion, I
missed that the vma retirement was also responsible for marking the
object as dirty. Mark the context object as dirty when pinning
(equivalent to execlists) which ensures that if the context is swapped
out due to mempressure or suspend/hibernation, when it is loaded back in
it does so with the previous state (and not all zero).
Fixes: e8a9c58fcd9a ("drm/i915: Unify active context tracking between legacy/execlists/guc")
Reported-by: Dennis Gilmore <dennis@ausil.us>
Reported-by: Mathieu Marquer <mathieu.marquer@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99993
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100181
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.11-rc1
Link: http://patchwork.freedesktop.org/patch/msgid/20170322205930.12762-1-chris@chris-wilson.co.uk
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
went in this morning and so will be upstreamed ~next week.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Regression in i915 for 4.11-rc1 - bisected to commit 69df05e11ab8
2017-03-23 20:44 ` Chris Wilson
@ 2017-03-23 21:23 ` Larry Finger
0 siblings, 0 replies; 3+ messages in thread
From: Larry Finger @ 2017-03-23 21:23 UTC (permalink / raw)
To: Chris Wilson, LKML, Tvrtko Ursulin, intel-gfx, Jani Nikula,
Daniel Vetter, Thorsten Leemhuis
On 03/23/2017 03:44 PM, Chris Wilson wrote:
> On Thu, Mar 23, 2017 at 01:19:43PM -0500, Larry Finger wrote:
>> Since kernel 4.11-rc1, my desktop (Plasma5/KDE) has encountered
>> intermittent hangs with the following information in the logs:
>>
>> linux-4v1g.suse kernel: [drm] GPU HANG: ecode 7:0:0xf3cffffe, in
>> plasmashell [1283], reason: Hang on render ring, action: reset
>> linux-4v1g.suse kernel: [drm] GPU hangs can indicate a bug anywhere
>> in the entire gfx stack, including userspace.
>> linux-4v1g.suse kernel: [drm] Please file a _new_ bug report on
>> bugs.freedesktop.org against DRI -> DRM/Intel
>> linux-4v1g.suse kernel: [drm] drm/i915 developers can then reassign
>> to the right component if it's not a kernel issue.
>> linux-4v1g.suse kernel: [drm] The gpu crash dump is required to
>> analyze gpu hangs, so please always attach it.
>> linux-4v1g.suse kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
>> linux-4v1g.suse kernel: drm/i915: Resetting chip after gpu hang
>>
>> This problem was added to
>> https://bugs.freedesktop.org/show_bug.cgi?id=99380, but it probably
>> is a different bug, as the OP in that report has problems with
>> kernel 4.10.x, whereas my problem did not appear until 4.11.
>
> Close. Actually that patch touches code you are not using (oa-perf and
> gvt), the real culprit was e8a9c58fcd9a ("drm/i915: Unify active context
> tracking between legacy/execlists/guc").
>
> The fix
>
> commit 5d4bac5503fcc67dd7999571e243cee49371aef7
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date: Wed Mar 22 20:59:30 2017 +0000
>
> drm/i915: Restore marking context objects as dirty on pinning
>
> Commit e8a9c58fcd9a ("drm/i915: Unify active context tracking between
> legacy/execlists/guc") converted the legacy intel_ringbuffer submission
> to the same context pinning mechanism as execlists - that is to pin the
> context until the subsequent request is retired. Previously it used the
> vma retirement of the context object to keep itself pinned until the
> next request (after i915_vma_move_to_active()). In the conversion, I
> missed that the vma retirement was also responsible for marking the
> object as dirty. Mark the context object as dirty when pinning
> (equivalent to execlists) which ensures that if the context is swapped
> out due to mempressure or suspend/hibernation, when it is loaded back in
> it does so with the previous state (and not all zero).
>
> Fixes: e8a9c58fcd9a ("drm/i915: Unify active context tracking between legacy/execlists/guc")
> Reported-by: Dennis Gilmore <dennis@ausil.us>
> Reported-by: Mathieu Marquer <mathieu.marquer@gmail.com>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99993
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100181
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.11-rc1
> Link: http://patchwork.freedesktop.org/patch/msgid/20170322205930.12762-1-chris@chris-wilson.co.uk
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> went in this morning and so will be upstreamed ~next week.
> -Chris
Thanks. With a bug that is difficult to trigger, bisection is difficult. I am
surprised that the only step I got wrong was the last one. BTW, my reversion
failed after 20 hours. I was ready to write again when I got your fix. Good timing.
If your patch does not fix my problem, I will let you know.
Larry
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2017-03-23 21:23 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-23 18:19 Regression in i915 for 4.11-rc1 - bisected to commit 69df05e11ab8 Larry Finger
2017-03-23 20:44 ` Chris Wilson
2017-03-23 21:23 ` Larry Finger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).