All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [RFC] drm/i915: Generate a hang error code
@ 2014-02-04 12:18 Ben Widawsky
  2014-02-04 12:43 ` Daniel Vetter
  2014-02-05 14:59 ` Jesse Barnes
  0 siblings, 2 replies; 7+ messages in thread
From: Ben Widawsky @ 2014-02-04 12:18 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

We get a large number of bugs which have a, "hey I have that too"
because they see a GPU hang in dmesg. While two machines of the same
model having a GPU hang is indeed a coincidence, it is far from enough
evidence to suggest they are the same.

In order to reduce this effect, and hopefully get people to file new bug
reports, clearly the error message itself has been insufficient (see ref
at the bottom for a new bug report with this characteristic).

The algorithm is purposely pretty naive. I don't think we need much in
order to avoid the problem I am trying to solve, and keeping it naive
gives us some ability to make a decent test case.

Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
References: https://bugs.freedesktop.org/show_bug.cgi?id=73276
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 44 +++++++++++++++++++++++++++++------
 1 file changed, 37 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 94542d4..dc47bb9 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -653,6 +653,33 @@ static u32 capture_pinned_bo(struct drm_i915_error_buffer *err,
 	return i;
 }
 
+/* Generate a semi-unique error code. The code is not meant to have meaning, The
+ * code's only purpose is to try to prevent false duplicated bug reports by
+ * grossly estimating a GPU error state.
+ *
+ * TODO Ideally, hashing the batchbuffer would be a very nice way to determine
+ * the hang if we could strip the GTT offset information from it.
+ *
+ * It's only a small step better than a random number in its current form.
+ */
+static uint32_t i915_error_generate_code(struct drm_i915_private *dev_priv,
+					 struct drm_i915_error_state *error)
+{
+	uint32_t error_code = 0;
+	int i;
+
+	/* IPEHR would be an ideal way to detect errors, as it's the gross
+	 * measure of "the command that hung." However, has some very common
+	 * synchronization commands which almost always appear in the case
+	 * strictly a client bug. Use instdone to differentiate those some.
+	 */
+	for (i = 0; i < I915_NUM_RINGS; i++)
+		if (error->ring[i].hangcheck_action == HANGCHECK_HUNG)
+			return error->ring[i].ipehr ^ error->ring[i].instdone;
+
+	return error_code;
+}
+
 static void i915_gem_record_fences(struct drm_device *dev,
 				   struct drm_i915_error_state *error)
 {
@@ -1098,6 +1125,7 @@ void i915_capture_error_state(struct drm_device *dev)
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct drm_i915_error_state *error;
 	unsigned long flags;
+	uint32_t ecode;
 
 	spin_lock_irqsave(&dev_priv->gpu_error.lock, flags);
 	error = dev_priv->gpu_error.first_error;
@@ -1114,7 +1142,16 @@ void i915_capture_error_state(struct drm_device *dev)
 
 	DRM_INFO("GPU crash dump saved to /sys/class/drm/card%d/error\n",
 		 dev->primary->index);
+	kref_init(&error->ref);
+
+	i915_capture_reg_state(dev_priv, error);
+	i915_gem_capture_buffers(dev_priv, error);
+	i915_gem_record_fences(dev, error);
+	i915_gem_record_rings(dev, error);
+	ecode = i915_error_generate_code(dev_priv, error);
+
 	if (!warned) {
+		DRM_INFO("GPU HANG [%x]\n", ecode);
 		DRM_INFO("GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.\n");
 		DRM_INFO("Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel\n");
 		DRM_INFO("drm/i915 developers can then reassign to the right component if it's not a kernel issue.\n");
@@ -1122,13 +1159,6 @@ void i915_capture_error_state(struct drm_device *dev)
 		warned = true;
 	}
 
-	kref_init(&error->ref);
-
-	i915_capture_reg_state(dev_priv, error);
-	i915_gem_capture_buffers(dev_priv, error);
-	i915_gem_record_fences(dev, error);
-	i915_gem_record_rings(dev, error);
-
 	do_gettimeofday(&error->time);
 
 	error->overlay = intel_overlay_capture_error_state(dev);
-- 
1.8.5.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] [RFC] drm/i915: Generate a hang error code
  2014-02-04 12:18 [PATCH] [RFC] drm/i915: Generate a hang error code Ben Widawsky
@ 2014-02-04 12:43 ` Daniel Vetter
  2014-02-05 14:59 ` Jesse Barnes
  1 sibling, 0 replies; 7+ messages in thread
From: Daniel Vetter @ 2014-02-04 12:43 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Tue, Feb 4, 2014 at 1:18 PM, Ben Widawsky
<benjamin.widawsky@intel.com> wrote:
> We get a large number of bugs which have a, "hey I have that too"
> because they see a GPU hang in dmesg. While two machines of the same
> model having a GPU hang is indeed a coincidence, it is far from enough
> evidence to suggest they are the same.
>
> In order to reduce this effect, and hopefully get people to file new bug
> reports, clearly the error message itself has been insufficient (see ref
> at the bottom for a new bug report with this characteristic).
>
> The algorithm is purposely pretty naive. I don't think we need much in
> order to avoid the problem I am trying to solve, and keeping it naive
> gives us some ability to make a decent test case.
>
> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=73276

I think most of this can be avoided by actually renaming bugs to have
sane summaries - of course people will go "me, too" if the summary is
"ubuntu gpu hangs". For everything else I think if users aren't
capable of the rather verbose "pls file new bug report, don't me-too"
we dump into dmesg nothing else will help. And for developers it's imo
better to smash such things into our error state decoder, similar to
some of the other analysis steps we already do (like decoding the HEAD
pointer).

So not convinced really.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] [RFC] drm/i915: Generate a hang error code
  2014-02-04 12:18 [PATCH] [RFC] drm/i915: Generate a hang error code Ben Widawsky
  2014-02-04 12:43 ` Daniel Vetter
@ 2014-02-05 14:59 ` Jesse Barnes
  2014-02-05 15:15   ` Daniel Vetter
  1 sibling, 1 reply; 7+ messages in thread
From: Jesse Barnes @ 2014-02-05 14:59 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Tue,  4 Feb 2014 12:18:55 +0000
Ben Widawsky <benjamin.widawsky@intel.com> wrote:

> We get a large number of bugs which have a, "hey I have that too"
> because they see a GPU hang in dmesg. While two machines of the same
> model having a GPU hang is indeed a coincidence, it is far from enough
> evidence to suggest they are the same.
> 
> In order to reduce this effect, and hopefully get people to file new bug
> reports, clearly the error message itself has been insufficient (see ref
> at the bottom for a new bug report with this characteristic).
> 
> The algorithm is purposely pretty naive. I don't think we need much in
> order to avoid the problem I am trying to solve, and keeping it naive
> gives us some ability to make a decent test case.

I like the direction of this.  If we can get some basic info into the
dmesg part of things (the only part regular users will actually look
at) we can probably avoid some of the "me too" action we see on general
GPU hangs.  Having PID, comm, and some sort of hang signature are all
good steps in that direction imo.

Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>

Jesse

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] [RFC] drm/i915: Generate a hang error code
  2014-02-05 14:59 ` Jesse Barnes
@ 2014-02-05 15:15   ` Daniel Vetter
  2014-02-05 16:03     ` Jesse Barnes
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Vetter @ 2014-02-05 15:15 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Intel GFX, Ben Widawsky, Ben Widawsky

On Wed, Feb 05, 2014 at 02:59:08PM +0000, Jesse Barnes wrote:
> On Tue,  4 Feb 2014 12:18:55 +0000
> Ben Widawsky <benjamin.widawsky@intel.com> wrote:
> 
> > We get a large number of bugs which have a, "hey I have that too"
> > because they see a GPU hang in dmesg. While two machines of the same
> > model having a GPU hang is indeed a coincidence, it is far from enough
> > evidence to suggest they are the same.
> > 
> > In order to reduce this effect, and hopefully get people to file new bug
> > reports, clearly the error message itself has been insufficient (see ref
> > at the bottom for a new bug report with this characteristic).
> > 
> > The algorithm is purposely pretty naive. I don't think we need much in
> > order to avoid the problem I am trying to solve, and keeping it naive
> > gives us some ability to make a decent test case.
> 
> I like the direction of this.  If we can get some basic info into the
> dmesg part of things (the only part regular users will actually look
> at) we can probably avoid some of the "me too" action we see on general
> GPU hangs.  Having PID, comm, and some sort of hang signature are all
> good steps in that direction imo.

tbh I don't see much value in regular users trying to triage gpu hang. If
they're not damn sure that they have a dupe (which means same platform,
versions of the software stack and crashing games) I much prefer if they
just send in a duplicate bug for us to triage.

With the mis-design of bugzilla it's much harder to untangle a wrong
me-too than mark something as duplicate. And especially long-running bugs
are a royal pain if there's too much wrong me-too noise in there.

Not a comment on the patch itself, just a general comment wrt avoiding
me-too gpu hang reports.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] [RFC] drm/i915: Generate a hang error code
  2014-02-05 15:15   ` Daniel Vetter
@ 2014-02-05 16:03     ` Jesse Barnes
  2014-02-05 16:18       ` Daniel Vetter
  0 siblings, 1 reply; 7+ messages in thread
From: Jesse Barnes @ 2014-02-05 16:03 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel GFX, Ben Widawsky, Ben Widawsky

On Wed, 5 Feb 2014 16:15:02 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Wed, Feb 05, 2014 at 02:59:08PM +0000, Jesse Barnes wrote:
> > On Tue,  4 Feb 2014 12:18:55 +0000
> > Ben Widawsky <benjamin.widawsky@intel.com> wrote:
> > 
> > > We get a large number of bugs which have a, "hey I have that too"
> > > because they see a GPU hang in dmesg. While two machines of the same
> > > model having a GPU hang is indeed a coincidence, it is far from enough
> > > evidence to suggest they are the same.
> > > 
> > > In order to reduce this effect, and hopefully get people to file new bug
> > > reports, clearly the error message itself has been insufficient (see ref
> > > at the bottom for a new bug report with this characteristic).
> > > 
> > > The algorithm is purposely pretty naive. I don't think we need much in
> > > order to avoid the problem I am trying to solve, and keeping it naive
> > > gives us some ability to make a decent test case.
> > 
> > I like the direction of this.  If we can get some basic info into the
> > dmesg part of things (the only part regular users will actually look
> > at) we can probably avoid some of the "me too" action we see on general
> > GPU hangs.  Having PID, comm, and some sort of hang signature are all
> > good steps in that direction imo.
> 
> tbh I don't see much value in regular users trying to triage gpu hang. If
> they're not damn sure that they have a dupe (which means same platform,
> versions of the software stack and crashing games) I much prefer if they
> just send in a duplicate bug for us to triage.
> 
> With the mis-design of bugzilla it's much harder to untangle a wrong
> me-too than mark something as duplicate. And especially long-running bugs
> are a royal pain if there's too much wrong me-too noise in there.
> 
> Not a comment on the patch itself, just a general comment wrt avoiding
> me-too gpu hang reports.

So you're saying the GPU error decode tool should create a bug template
for people so we don't get the "me too" reports?

What I see above is that it's really important to avoid the "me too"
stuff, and to do it in such a way that false positives are minimized
(e.g. the IPEHR bit Ubuntu used to use).  So I guess I don't see what's
unconvincing here.  Today we have no way of differentiating w/o digging
in to the error record, which users definitely won't do, and this patch
seems like it could only help with that... so count me confused.

Jesse

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] [RFC] drm/i915: Generate a hang error code
  2014-02-05 16:03     ` Jesse Barnes
@ 2014-02-05 16:18       ` Daniel Vetter
  2014-02-05 16:30         ` Daniel Vetter
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Vetter @ 2014-02-05 16:18 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Intel GFX, Ben Widawsky, Ben Widawsky

On Wed, Feb 05, 2014 at 04:03:45PM +0000, Jesse Barnes wrote:
> On Wed, 5 Feb 2014 16:15:02 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Wed, Feb 05, 2014 at 02:59:08PM +0000, Jesse Barnes wrote:
> > > On Tue,  4 Feb 2014 12:18:55 +0000
> > > Ben Widawsky <benjamin.widawsky@intel.com> wrote:
> > > 
> > > > We get a large number of bugs which have a, "hey I have that too"
> > > > because they see a GPU hang in dmesg. While two machines of the same
> > > > model having a GPU hang is indeed a coincidence, it is far from enough
> > > > evidence to suggest they are the same.
> > > > 
> > > > In order to reduce this effect, and hopefully get people to file new bug
> > > > reports, clearly the error message itself has been insufficient (see ref
> > > > at the bottom for a new bug report with this characteristic).
> > > > 
> > > > The algorithm is purposely pretty naive. I don't think we need much in
> > > > order to avoid the problem I am trying to solve, and keeping it naive
> > > > gives us some ability to make a decent test case.
> > > 
> > > I like the direction of this.  If we can get some basic info into the
> > > dmesg part of things (the only part regular users will actually look
> > > at) we can probably avoid some of the "me too" action we see on general
> > > GPU hangs.  Having PID, comm, and some sort of hang signature are all
> > > good steps in that direction imo.
> > 
> > tbh I don't see much value in regular users trying to triage gpu hang. If
> > they're not damn sure that they have a dupe (which means same platform,
> > versions of the software stack and crashing games) I much prefer if they
> > just send in a duplicate bug for us to triage.
> > 
> > With the mis-design of bugzilla it's much harder to untangle a wrong
> > me-too than mark something as duplicate. And especially long-running bugs
> > are a royal pain if there's too much wrong me-too noise in there.
> > 
> > Not a comment on the patch itself, just a general comment wrt avoiding
> > me-too gpu hang reports.
> 
> So you're saying the GPU error decode tool should create a bug template
> for people so we don't get the "me too" reports?
> 
> What I see above is that it's really important to avoid the "me too"
> stuff, and to do it in such a way that false positives are minimized
> (e.g. the IPEHR bit Ubuntu used to use).  So I guess I don't see what's
> unconvincing here.  Today we have no way of differentiating w/o digging
> in to the error record, which users definitely won't do, and this patch
> seems like it could only help with that... so count me confused.

We have a full paragraph explaining to users exactly what they need to do.
They still me-too and fail to attach the error state. I don't how adding
even more helps, since it never really did.

Anyway, patch merged since meh. I'd still like to see the same information
dumped into the error state though.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] [RFC] drm/i915: Generate a hang error code
  2014-02-05 16:18       ` Daniel Vetter
@ 2014-02-05 16:30         ` Daniel Vetter
  0 siblings, 0 replies; 7+ messages in thread
From: Daniel Vetter @ 2014-02-05 16:30 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Intel GFX, Ben Widawsky, Ben Widawsky

On Wed, Feb 5, 2014 at 5:18 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> I'd still like to see the same information
> dumped into the error state though.

This was re: Jesse's idea on irc to dump pid/comm, too. But adding the
same gpu hang cookie computation to the error state decoder might
still make sense.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-02-05 16:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-04 12:18 [PATCH] [RFC] drm/i915: Generate a hang error code Ben Widawsky
2014-02-04 12:43 ` Daniel Vetter
2014-02-05 14:59 ` Jesse Barnes
2014-02-05 15:15   ` Daniel Vetter
2014-02-05 16:03     ` Jesse Barnes
2014-02-05 16:18       ` Daniel Vetter
2014-02-05 16:30         ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.