linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Remove instructions to file a bug report.
@ 2016-12-03  1:03 Matt Turner
  2016-12-03  1:26 ` Matt Turner
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Matt Turner @ 2016-12-03  1:03 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, Mika Kuoppala, Kenneth Graunke,
	Mark Janes, linux-kernel, Matt Turner

>From these instructions, users assume that /sys/class/drm/card0/error
contains all the information a developer needs to diagnose and fix a GPU
hang.

In fact it doesn't, and we have no tools for solving them (other than
stabbing in the dark). Most of the time the error state itself isn't
even useful because it just shows a hang on a PIPE_CONTROL or similar.

Until a time when the error state contains enough information to
actually solve a hang, stop telling users to file unsolvable bugs, and
instead rely on users who know where and how to file a good bug report
to find their own way there.

Signed-off-by: Matt Turner <mattst88@gmail.com>
---
Maybe now's a good time to discuss what *would* be useful to put in the
error state for debugging hangs. The currently executing shader program
would be a great place to start.

 drivers/gpu/drm/i915/i915_gpu_error.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 334f15d..8ddca7b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1431,7 +1431,6 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv,
 			      u32 engine_mask,
 			      const char *error_msg)
 {
-	static bool warned;
 	struct drm_i915_error_state *error;
 	unsigned long flags;
 
@@ -1475,16 +1474,6 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv,
 		i915_error_state_free(&error->ref);
 		return;
 	}
-
-	if (!warned) {
-		DRM_INFO("GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.\n");
-		DRM_INFO("Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel\n");
-		DRM_INFO("drm/i915 developers can then reassign to the right component if it's not a kernel issue.\n");
-		DRM_INFO("The gpu crash dump is required to analyze gpu hangs, so please always attach it.\n");
-		DRM_INFO("GPU crash dump saved to /sys/class/drm/card%d/error\n",
-			 dev_priv->drm.primary->index);
-		warned = true;
-	}
 }
 
 void i915_error_state_get(struct drm_device *dev,
-- 
2.7.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/i915: Remove instructions to file a bug report.
  2016-12-03  1:03 [PATCH] drm/i915: Remove instructions to file a bug report Matt Turner
@ 2016-12-03  1:26 ` Matt Turner
  2016-12-03  8:57 ` [Intel-gfx] " Chris Wilson
  2016-12-03  9:52 ` Jani Nikula
  2 siblings, 0 replies; 6+ messages in thread
From: Matt Turner @ 2016-12-03  1:26 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, Mika Kuoppala, Kenneth Graunke,
	Mark Janes, LKML, Matt Turner, Ben Widawsky

On Fri, Dec 2, 2016 at 5:03 PM, Matt Turner <mattst88@gmail.com> wrote:
> From these instructions, users assume that /sys/class/drm/card0/error
> contains all the information a developer needs to diagnose and fix a GPU
> hang.
>
> In fact it doesn't, and we have no tools for solving them (other than
> stabbing in the dark). Most of the time the error state itself isn't
> even useful because it just shows a hang on a PIPE_CONTROL or similar.
>
> Until a time when the error state contains enough information to
> actually solve a hang, stop telling users to file unsolvable bugs, and
> instead rely on users who know where and how to file a good bug report
> to find their own way there.
>
> Signed-off-by: Matt Turner <mattst88@gmail.com>
> ---
> Maybe now's a good time to discuss what *would* be useful to put in the
> error state for debugging hangs. The currently executing shader program
> would be a great place to start.

Looks like Ben had a patch:

https://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/?h=extra_error_objs&id=711da2570cd3302d0cb9f2489a101e4a7c966a6c

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915: Remove instructions to file a bug report.
  2016-12-03  1:03 [PATCH] drm/i915: Remove instructions to file a bug report Matt Turner
  2016-12-03  1:26 ` Matt Turner
@ 2016-12-03  8:57 ` Chris Wilson
  2016-12-03  9:52 ` Jani Nikula
  2 siblings, 0 replies; 6+ messages in thread
From: Chris Wilson @ 2016-12-03  8:57 UTC (permalink / raw)
  To: Matt Turner
  Cc: intel-gfx, linux-kernel, Kenneth Graunke, Daniel Vetter, Mika Kuoppala

On Fri, Dec 02, 2016 at 05:03:05PM -0800, Matt Turner wrote:
> From these instructions, users assume that /sys/class/drm/card0/error
> contains all the information a developer needs to diagnose and fix a GPU
> hang.
> 
> In fact it doesn't, and we have no tools for solving them (other than
> stabbing in the dark). Most of the time the error state itself isn't
> even useful because it just shows a hang on a PIPE_CONTROL or similar.
> 
> Until a time when the error state contains enough information to
> actually solve a hang, stop telling users to file unsolvable bugs, and
> instead rely on users who know where and how to file a good bug report
> to find their own way there.
> 
> Signed-off-by: Matt Turner <mattst88@gmail.com>

Nak. Though having stale bug reports is a pain, we've recently adopted
the policy of stopping the request after a certain period, those bug
reports are still vital. They don't just represent bugs in mesa.

> ---
> Maybe now's a good time to discuss what *would* be useful to put in the
> error state for debugging hangs. The currently executing shader program
> would be a great place to start.

Now? That is the conversation we've being trying to have for several
years. The contents of the error state are currently about sufficient to
spot kernel bugs, triage the culprit and the general class of bug.

Capturing all state for a request is unfeasible (because we can't copy
the gigabytes of memory required). Copying a selected set of aux bo is
one option. And since those bo are under user control and do not have to
be executed, you can even store aub data in them or whatnot.

Even if you make attaching the debug information conditional, I would
still keep the error message unconditional.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/i915: Remove instructions to file a bug report.
  2016-12-03  1:03 [PATCH] drm/i915: Remove instructions to file a bug report Matt Turner
  2016-12-03  1:26 ` Matt Turner
  2016-12-03  8:57 ` [Intel-gfx] " Chris Wilson
@ 2016-12-03  9:52 ` Jani Nikula
  2016-12-06  0:55   ` Matt Turner
  2 siblings, 1 reply; 6+ messages in thread
From: Jani Nikula @ 2016-12-03  9:52 UTC (permalink / raw)
  To: Matt Turner, intel-gfx
  Cc: Daniel Vetter, Mika Kuoppala, Kenneth Graunke, Mark Janes,
	linux-kernel, Matt Turner, Argotti, Yann, Chris Wilson

On Sat, 03 Dec 2016, Matt Turner <mattst88@gmail.com> wrote:
> From these instructions, users assume that /sys/class/drm/card0/error
> contains all the information a developer needs to diagnose and fix a GPU
> hang.
>
> In fact it doesn't, and we have no tools for solving them (other than
> stabbing in the dark). Most of the time the error state itself isn't
> even useful because it just shows a hang on a PIPE_CONTROL or similar.
>
> Until a time when the error state contains enough information to
> actually solve a hang, stop telling users to file unsolvable bugs, and
> instead rely on users who know where and how to file a good bug report
> to find their own way there.
>
> Signed-off-by: Matt Turner <mattst88@gmail.com>
> ---
> Maybe now's a good time to discuss what *would* be useful to put in the
> error state for debugging hangs. The currently executing shader program
> would be a great place to start.

I'm wondering why we're getting this patch now, and my guess is that
it's because we have been reassigning the related bugs to Mesa more
actively lately. Is that the case?

IIUC the bug reports are useful for us when it's a kernel bug, but less
useful for you when it's a Mesa bug. And you'd rather have fewer
incoming bugs that you think are unsolvable with the information at
hand.

Sounds like a bug workflow issue between drm/i915 and Mesa to be ironed
out.


BR,
Jani.


>
>  drivers/gpu/drm/i915/i915_gpu_error.c | 11 -----------
>  1 file changed, 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 334f15d..8ddca7b 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1431,7 +1431,6 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv,
>  			      u32 engine_mask,
>  			      const char *error_msg)
>  {
> -	static bool warned;
>  	struct drm_i915_error_state *error;
>  	unsigned long flags;
>  
> @@ -1475,16 +1474,6 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv,
>  		i915_error_state_free(&error->ref);
>  		return;
>  	}
> -
> -	if (!warned) {
> -		DRM_INFO("GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.\n");
> -		DRM_INFO("Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel\n");
> -		DRM_INFO("drm/i915 developers can then reassign to the right component if it's not a kernel issue.\n");
> -		DRM_INFO("The gpu crash dump is required to analyze gpu hangs, so please always attach it.\n");
> -		DRM_INFO("GPU crash dump saved to /sys/class/drm/card%d/error\n",
> -			 dev_priv->drm.primary->index);
> -		warned = true;
> -	}
>  }
>  
>  void i915_error_state_get(struct drm_device *dev,

-- 
Jani Nikula, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/i915: Remove instructions to file a bug report.
  2016-12-03  9:52 ` Jani Nikula
@ 2016-12-06  0:55   ` Matt Turner
  2016-12-07 16:09     ` [Intel-gfx] " Daniel Vetter
  0 siblings, 1 reply; 6+ messages in thread
From: Matt Turner @ 2016-12-06  0:55 UTC (permalink / raw)
  To: Jani Nikula
  Cc: intel-gfx, Daniel Vetter, Mika Kuoppala, Kenneth Graunke,
	Mark Janes, LKML, Argotti, Yann, Chris Wilson

On Sat, Dec 3, 2016 at 1:52 AM, Jani Nikula <jani.nikula@linux.intel.com> wrote:
> On Sat, 03 Dec 2016, Matt Turner <mattst88@gmail.com> wrote:
>> From these instructions, users assume that /sys/class/drm/card0/error
>> contains all the information a developer needs to diagnose and fix a GPU
>> hang.
>>
>> In fact it doesn't, and we have no tools for solving them (other than
>> stabbing in the dark). Most of the time the error state itself isn't
>> even useful because it just shows a hang on a PIPE_CONTROL or similar.
>>
>> Until a time when the error state contains enough information to
>> actually solve a hang, stop telling users to file unsolvable bugs, and
>> instead rely on users who know where and how to file a good bug report
>> to find their own way there.
>>
>> Signed-off-by: Matt Turner <mattst88@gmail.com>
>> ---
>> Maybe now's a good time to discuss what *would* be useful to put in the
>> error state for debugging hangs. The currently executing shader program
>> would be a great place to start.
>
> I'm wondering why we're getting this patch now, and my guess is that
> it's because we have been reassigning the related bugs to Mesa more
> actively lately. Is that the case?

No, it's simply because I spent a week going through Bugzilla and
realized how incomplete an unactionable the majority of GPU hang
reports are.

Asking users to report bugs, but not telling them what actually
constitutes a bug report, is a recipe for a lot of wasted developer
time.

I suspect we could improve the usefulness of the reports by directing
users to a webpage that gave a few suggestions (tell us what you were
doing when the hang occurred would be an obvious one) about filing a
bug and then provided a link to Bugzilla. Or even configured Bugzilla
to have a default template that requested various bits of information.

> IIUC the bug reports are useful for us when it's a kernel bug, but less
> useful for you when it's a Mesa bug. And you'd rather have fewer
> incoming bugs that you think are unsolvable with the information at
> hand.
>
> Sounds like a bug workflow issue between drm/i915 and Mesa to be ironed
> out.

Indeed. I'd rather have the information provided in a bug report to
actually solve it. I hope having access to the shader program will
make many more reports useful.

I am also happy to see that there's now a sunset to the GPU hang message.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915: Remove instructions to file a bug report.
  2016-12-06  0:55   ` Matt Turner
@ 2016-12-07 16:09     ` Daniel Vetter
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Vetter @ 2016-12-07 16:09 UTC (permalink / raw)
  To: Matt Turner
  Cc: Jani Nikula, intel-gfx, LKML, Kenneth Graunke, Daniel Vetter,
	Mika Kuoppala

On Mon, Dec 05, 2016 at 04:55:47PM -0800, Matt Turner wrote:
> On Sat, Dec 3, 2016 at 1:52 AM, Jani Nikula <jani.nikula@linux.intel.com> wrote:
> > On Sat, 03 Dec 2016, Matt Turner <mattst88@gmail.com> wrote:
> >> From these instructions, users assume that /sys/class/drm/card0/error
> >> contains all the information a developer needs to diagnose and fix a GPU
> >> hang.
> >>
> >> In fact it doesn't, and we have no tools for solving them (other than
> >> stabbing in the dark). Most of the time the error state itself isn't
> >> even useful because it just shows a hang on a PIPE_CONTROL or similar.
> >>
> >> Until a time when the error state contains enough information to
> >> actually solve a hang, stop telling users to file unsolvable bugs, and
> >> instead rely on users who know where and how to file a good bug report
> >> to find their own way there.
> >>
> >> Signed-off-by: Matt Turner <mattst88@gmail.com>
> >> ---
> >> Maybe now's a good time to discuss what *would* be useful to put in the
> >> error state for debugging hangs. The currently executing shader program
> >> would be a great place to start.
> >
> > I'm wondering why we're getting this patch now, and my guess is that
> > it's because we have been reassigning the related bugs to Mesa more
> > actively lately. Is that the case?
> 
> No, it's simply because I spent a week going through Bugzilla and
> realized how incomplete an unactionable the majority of GPU hang
> reports are.
> 
> Asking users to report bugs, but not telling them what actually
> constitutes a bug report, is a recipe for a lot of wasted developer
> time.
> 
> I suspect we could improve the usefulness of the reports by directing
> users to a webpage that gave a few suggestions (tell us what you were
> doing when the hang occurred would be an obvious one) about filing a
> bug and then provided a link to Bugzilla. Or even configured Bugzilla
> to have a default template that requested various bits of information.

I think dumping at least some of the aux buffers should make this tons
more useful for mesa, since it would indicate stuff like "we always die on
resolves on skl gt4" or stuff like that. Thus far error states have been
mostly used by kernel folks to debug kernel issues, which is why none of
that additional stuff gets dumped.

But a bare-bones parser to hunt for indirect state base addresses and fish
out the aux stuff shouldn't be that hard, and might make this fully
useful.

Like Chris said the goal is to at least be able to triage and classify
bugs, and I'm perfectly fine with merging additional code to the dumper to
get there for mesa folks. We z-compress the state, so size isn't really an
issue. And Ben has commit rights, so shouldn't be a problem to get this
all merged.

> > IIUC the bug reports are useful for us when it's a kernel bug, but less
> > useful for you when it's a Mesa bug. And you'd rather have fewer
> > incoming bugs that you think are unsolvable with the information at
> > hand.
> >
> > Sounds like a bug workflow issue between drm/i915 and Mesa to be ironed
> > out.
> 
> Indeed. I'd rather have the information provided in a bug report to
> actually solve it. I hope having access to the shader program will
> make many more reports useful.
> 
> I am also happy to see that there's now a sunset to the GPU hang message.

The other option is that mesa folks don't want error states that we triage
to mesa. We could definitely update the process in that area.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-12-07 16:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-03  1:03 [PATCH] drm/i915: Remove instructions to file a bug report Matt Turner
2016-12-03  1:26 ` Matt Turner
2016-12-03  8:57 ` [Intel-gfx] " Chris Wilson
2016-12-03  9:52 ` Jani Nikula
2016-12-06  0:55   ` Matt Turner
2016-12-07 16:09     ` [Intel-gfx] " Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).