All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] hangcheck fixes
@ 2011-10-04 21:11 Ben Widawsky
  2011-10-04 21:11 ` [PATCH 1/3] drm/i915: collect more per ring error info Ben Widawsky
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Ben Widawsky @ 2011-10-04 21:11 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

I split out a lot more of the per ring registers in this version, as
well as added the patch to remove the kick_ring, which AFAICT doesn't
really help anything.

Patch 3 can be dropped if not desired.

Ben Widawsky (3):
  drm/i915: collect more per ring error info
  drm/i915: check acthd for all rings
  drm/i915: remove ring kick

 drivers/gpu/drm/i915/i915_debugfs.c |   44 +++++---
 drivers/gpu/drm/i915/i915_drv.h     |   33 ++----
 drivers/gpu/drm/i915/i915_irq.c     |  200 ++++++++++++++++++++---------------
 drivers/gpu/drm/i915/i915_reg.h     |   14 +++-
 4 files changed, 164 insertions(+), 127 deletions(-)

-- 
1.7.6.4

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] drm/i915: collect more per ring error info
  2011-10-04 21:11 [PATCH v2 0/3] hangcheck fixes Ben Widawsky
@ 2011-10-04 21:11 ` Ben Widawsky
  2011-10-07 13:52   ` Daniel Vetter
  2011-10-04 21:11 ` [PATCH 2/3] drm/i915: check acthd for all rings Ben Widawsky
  2011-10-04 21:11 ` [PATCH 3/3] drm/i915: remove ring kick Ben Widawsky
  2 siblings, 1 reply; 8+ messages in thread
From: Ben Widawsky @ 2011-10-04 21:11 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

V1: Just added per ring page faults
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

V2: Added much more per ring stuff
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_debugfs.c |   44 +++++++++++++++----------
 drivers/gpu/drm/i915/i915_drv.h     |   28 ++++++----------
 drivers/gpu/drm/i915/i915_irq.c     |   60 ++++++++++++++++++++--------------
 drivers/gpu/drm/i915/i915_reg.h     |   14 +++++++-
 4 files changed, 84 insertions(+), 62 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 8e95d66..332a418 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -747,34 +747,42 @@ static int i915_error_state(struct seq_file *m, void *unused)
 	seq_printf(m, "Time: %ld s %ld us\n", error->time.tv_sec,
 		   error->time.tv_usec);
 	seq_printf(m, "PCI ID: 0x%04x\n", dev->pci_device);
-	seq_printf(m, "EIR: 0x%08x\n", error->eir);
+	seq_printf(m, "EIR: 0x%08x\n", error->eir[RCS]);
 	seq_printf(m, "PGTBL_ER: 0x%08x\n", error->pgtbl_er);
 	if (INTEL_INFO(dev)->gen >= 6) {
+		seq_printf(m, "GFX Page Fault: 0x%08x\n", error->page_fault[RCS]);
+		seq_printf(m, "Media Page Fault: 0x%08x\n", error->page_fault[VCS]);
+		seq_printf(m, "Blitter Page Fault: 0x%08x\n", error->page_fault[BCS]);
 		seq_printf(m, "ERROR: 0x%08x\n", error->error);
 		seq_printf(m, "Blitter command stream:\n");
-		seq_printf(m, "  ACTHD:    0x%08x\n", error->bcs_acthd);
-		seq_printf(m, "  IPEIR:    0x%08x\n", error->bcs_ipeir);
-		seq_printf(m, "  IPEHR:    0x%08x\n", error->bcs_ipehr);
-		seq_printf(m, "  INSTDONE: 0x%08x\n", error->bcs_instdone);
-		seq_printf(m, "  seqno:    0x%08x\n", error->bcs_seqno);
+		seq_printf(m, "  ACTHD:    0x%08x\n", error->acthd[BCS]);
+		seq_printf(m, "  IPEIR:    0x%08x\n", error->ipeir[BCS]);
+		seq_printf(m, "  IPEHR:    0x%08x\n", error->ipehr[BCS]);
+		seq_printf(m, "  INSTDONE: 0x%08x\n", error->instdone[BCS]);
+		seq_printf(m, "  FADDR:    0x%08x\n", error->fadd[BCS]);
+		seq_printf(m, "  INSTPS:   0x%08x\n", error->instps[BCS]);
+		seq_printf(m, "  seqno:    0x%08x\n", error->seqno[BCS]);
 		seq_printf(m, "Video (BSD) command stream:\n");
-		seq_printf(m, "  ACTHD:    0x%08x\n", error->vcs_acthd);
-		seq_printf(m, "  IPEIR:    0x%08x\n", error->vcs_ipeir);
-		seq_printf(m, "  IPEHR:    0x%08x\n", error->vcs_ipehr);
-		seq_printf(m, "  INSTDONE: 0x%08x\n", error->vcs_instdone);
-		seq_printf(m, "  seqno:    0x%08x\n", error->vcs_seqno);
+		seq_printf(m, "  ACTHD:    0x%08x\n", error->acthd[VCS]);
+		seq_printf(m, "  IPEIR:    0x%08x\n", error->ipeir[VCS]);
+		seq_printf(m, "  IPEHR:    0x%08x\n", error->ipehr[VCS]);
+		seq_printf(m, "  INSTDONE: 0x%08x\n", error->instdone[VCS]);
+		seq_printf(m, "  FADDR:    0x%08x\n", error->fadd[VCS]);
+		seq_printf(m, "  INSTPS:   0x%08x\n", error->instps[VCS]);
+		seq_printf(m, "  seqno:    0x%08x\n", error->seqno[VCS]);
 	}
 	seq_printf(m, "Render command stream:\n");
-	seq_printf(m, "  ACTHD: 0x%08x\n", error->acthd);
-	seq_printf(m, "  IPEIR: 0x%08x\n", error->ipeir);
-	seq_printf(m, "  IPEHR: 0x%08x\n", error->ipehr);
-	seq_printf(m, "  INSTDONE: 0x%08x\n", error->instdone);
+	seq_printf(m, "  ACTHD: 0x%08x\n", error->acthd[RCS]);
+	seq_printf(m, "  IPEIR: 0x%08x\n", error->ipeir[RCS]);
+	seq_printf(m, "  IPEHR: 0x%08x\n", error->ipehr[RCS]);
+	seq_printf(m, "  INSTDONE: 0x%08x\n", error->instdone[RCS]);
 	if (INTEL_INFO(dev)->gen >= 4) {
 		seq_printf(m, "  INSTDONE1: 0x%08x\n", error->instdone1);
-		seq_printf(m, "  INSTPS: 0x%08x\n", error->instps);
+		seq_printf(m, "  INSTPS: 0x%08x\n", error->instps[RCS]);
 	}
-	seq_printf(m, "  INSTPM: 0x%08x\n", error->instpm);
-	seq_printf(m, "  seqno: 0x%08x\n", error->seqno);
+	seq_printf(m, "  INSTPM: 0x%08x\n", error->instpm[RCS]);
+	seq_printf(m, "  FADDR:  0x%08x\n", error->fadd[RCS]);
+	seq_printf(m, "  seqno:  0x%08x\n", error->seqno[RCS]);
 
 	for (i = 0; i < dev_priv->num_fence_regs; i++)
 		seq_printf(m, "  fence[%d] = %08llx\n", i, error->fence[i]);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 15c0ca5..567275c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -146,28 +146,20 @@ struct sdvo_device_mapping {
 struct intel_display_error_state;
 
 struct drm_i915_error_state {
-	u32 eir;
+	u32 eir[I915_NUM_RINGS];
 	u32 pgtbl_er;
 	u32 pipestat[I915_MAX_PIPES];
-	u32 ipeir;
-	u32 ipehr;
-	u32 instdone;
-	u32 acthd;
+	u32 ipeir[I915_NUM_RINGS];
+	u32 ipehr[I915_NUM_RINGS];
+	u32 instdone[I915_NUM_RINGS];
+	u32 acthd[I915_NUM_RINGS];
+	u32 page_fault[I915_NUM_RINGS];
+	u32 fadd[I915_NUM_RINGS];
 	u32 error; /* gen6+ */
-	u32 bcs_acthd; /* gen6+ blt engine */
-	u32 bcs_ipehr;
-	u32 bcs_ipeir;
-	u32 bcs_instdone;
-	u32 bcs_seqno;
-	u32 vcs_acthd; /* gen6+ bsd engine */
-	u32 vcs_ipehr;
-	u32 vcs_ipeir;
-	u32 vcs_instdone;
-	u32 vcs_seqno;
-	u32 instpm;
-	u32 instps;
+	u32 instpm[I915_NUM_RINGS];
+	u32 instps[I915_NUM_RINGS];
 	u32 instdone1;
-	u32 seqno;
+	u32 seqno[I915_NUM_RINGS];
 	u64 bbaddr;
 	u64 fence[16];
 	struct timeval time;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 012732b..97e338b 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -906,45 +906,55 @@ static void i915_capture_error_state(struct drm_device *dev)
 	DRM_INFO("capturing error event; look for more information in /debug/dri/%d/i915_error_state\n",
 		 dev->primary->index);
 
-	error->seqno = dev_priv->ring[RCS].get_seqno(&dev_priv->ring[RCS]);
-	error->eir = I915_READ(EIR);
+	error->seqno[RCS] = dev_priv->ring[RCS].get_seqno(&dev_priv->ring[RCS]);
+	error->eir[RCS] = I915_READ(EIR);
 	error->pgtbl_er = I915_READ(PGTBL_ER);
 	for_each_pipe(pipe)
 		error->pipestat[pipe] = I915_READ(PIPESTAT(pipe));
-	error->instpm = I915_READ(INSTPM);
+	error->instpm[RCS] = I915_READ(INSTPM);
 	error->error = 0;
 	if (INTEL_INFO(dev)->gen >= 6) {
 		error->error = I915_READ(ERROR_GEN6);
-
-		error->bcs_acthd = I915_READ(BCS_ACTHD);
-		error->bcs_ipehr = I915_READ(BCS_IPEHR);
-		error->bcs_ipeir = I915_READ(BCS_IPEIR);
-		error->bcs_instdone = I915_READ(BCS_INSTDONE);
-		error->bcs_seqno = 0;
+		error->page_fault[RCS] = I915_READ(GEN6_FAULT);
+		error->fadd[RCS] = I915_READ(GEN6_DMA_FADD);
+
+		error->eir[BCS] = I915_READ(GEN6_BLITTER_EIR);
+		error->acthd[BCS] = I915_READ(BCS_ACTHD);
+		error->ipehr[BCS] = I915_READ(BCS_IPEHR);
+		error->ipeir[BCS] = I915_READ(BCS_IPEIR);
+		error->fadd[BCS] = I915_READ(GEN6_BCS_FADD);
+		error->instdone[BCS] = I915_READ(BCS_INSTDONE);
+		error->instps[BCS] = I915_READ(GEN6_BCS_INSTPS);
+		error->seqno[BCS] = 0;
 		if (dev_priv->ring[BCS].get_seqno)
-			error->bcs_seqno = dev_priv->ring[BCS].get_seqno(&dev_priv->ring[BCS]);
-
-		error->vcs_acthd = I915_READ(VCS_ACTHD);
-		error->vcs_ipehr = I915_READ(VCS_IPEHR);
-		error->vcs_ipeir = I915_READ(VCS_IPEIR);
-		error->vcs_instdone = I915_READ(VCS_INSTDONE);
-		error->vcs_seqno = 0;
+			error->seqno[BCS] = dev_priv->ring[BCS].get_seqno(&dev_priv->ring[BCS]);
+		error->page_fault[BCS] = I915_READ(GEN6_BLT_FAULT);
+
+		error->eir[VCS] = I915_READ(GEN6_BSD_EIR);
+		error->acthd[VCS] = I915_READ(VCS_ACTHD);
+		error->ipehr[VCS] = I915_READ(VCS_IPEHR);
+		error->ipeir[VCS] = I915_READ(VCS_IPEIR);
+		error->fadd[VCS] = I915_READ(GEN6_VCS_FADD);
+		error->instdone[VCS] = I915_READ(VCS_INSTDONE);
+		error->instps[VCS] = I915_READ(GEN6_VCS_INSTPS);
+		error->seqno[VCS] = 0;
 		if (dev_priv->ring[VCS].get_seqno)
-			error->vcs_seqno = dev_priv->ring[VCS].get_seqno(&dev_priv->ring[VCS]);
+			error->seqno[VCS] = dev_priv->ring[VCS].get_seqno(&dev_priv->ring[VCS]);
+		error->page_fault[VCS] = I915_READ(GEN6_BSD_FAULT);
 	}
 	if (INTEL_INFO(dev)->gen >= 4) {
-		error->ipeir = I915_READ(IPEIR_I965);
-		error->ipehr = I915_READ(IPEHR_I965);
-		error->instdone = I915_READ(INSTDONE_I965);
-		error->instps = I915_READ(INSTPS);
+		error->ipeir[RCS] = I915_READ(IPEIR_I965);
+		error->ipehr[RCS] = I915_READ(IPEHR_I965);
+		error->instdone[RCS] = I915_READ(INSTDONE_I965);
+		error->instps[RCS] = I915_READ(INSTPS);
 		error->instdone1 = I915_READ(INSTDONE1);
 		error->acthd = I915_READ(ACTHD_I965);
 		error->bbaddr = I915_READ64(BB_ADDR);
 	} else {
-		error->ipeir = I915_READ(IPEIR);
-		error->ipehr = I915_READ(IPEHR);
-		error->instdone = I915_READ(INSTDONE);
-		error->acthd = I915_READ(ACTHD);
+		error->ipeir[RCS] = I915_READ(IPEIR);
+		error->ipehr[RCS] = I915_READ(IPEHR);
+		error->instdone[RCS] = I915_READ(INSTDONE);
+		error->acthd[RCS] = I915_READ(ACTHD);
 		error->bbaddr = 0;
 	}
 	i915_gem_record_fences(dev, error);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 138eae1..ca28efc 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -347,6 +347,7 @@
 #define IPEHR_I965	0x02068
 #define INSTDONE_I965	0x0206c
 #define INSTPS		0x02070 /* 965+ only */
+#define GEN6_DMA_FADD	0x02078
 #define INSTDONE1	0x0207c /* 965+ only */
 #define ACTHD_I965	0x02074
 #define HWS_PGA		0x02080
@@ -362,12 +363,21 @@
 #define VCS_INSTDONE	0x1206C
 #define VCS_IPEIR	0x12064
 #define VCS_IPEHR	0x12068
+#define GEN6_VCS_INSTPS	0x12070 /* Not clear if this is valid */
 #define VCS_ACTHD	0x12074
+#define GEN6_VCS_FADD	0x12078
+#define GEN6_VCS_INSTDONE	0x12090
 #define BCS_INSTDONE	0x2206C
 #define BCS_IPEIR	0x22064
 #define BCS_IPEHR	0x22068
+#define GEN6_BCS_INSTPS	0x22070
 #define BCS_ACTHD	0x22074
+#define GEN6_BCS_FADD	0x22078
+#define GEN6_BCS_INSTDONE	0x22090
 
+#define GEN6_FAULT	0x04094
+#define GEN6_BSD_FAULT	0x04194
+#define GEN6_BLT_FAULT	0x04294
 #define ERROR_GEN6	0x040a0
 
 /* GM45+ chicken bits -- debug workaround bits that may be required
@@ -545,6 +555,8 @@
 #define   GEN6_BLITTER_COMMAND_PARSER_MASTER_ERROR	(1 << 25)
 #define   GEN6_BLITTER_SYNC_STATUS			(1 << 24)
 #define   GEN6_BLITTER_USER_INTERRUPT			(1 << 22)
+#define GEN6_BLITTER_EIR	0x220b0
+#define GEN6_BLITTER_EMR	0x220b4
 
 #define GEN6_BLITTER_ECOSKPD	0x221d0
 #define   GEN6_BLITTER_LOCK_SHIFT			16
@@ -559,7 +571,7 @@
 #define GEN6_BSD_HWSTAM			0x12098
 #define GEN6_BSD_IMR			0x120a8
 #define   GEN6_BSD_USER_INTERRUPT	(1 << 12)
-
+#define GEN6_BSD_EIR			0x120b0
 #define GEN6_BSD_RNCID			0x12198
 
 /*
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] drm/i915: check acthd for all rings
  2011-10-04 21:11 [PATCH v2 0/3] hangcheck fixes Ben Widawsky
  2011-10-04 21:11 ` [PATCH 1/3] drm/i915: collect more per ring error info Ben Widawsky
@ 2011-10-04 21:11 ` Ben Widawsky
  2011-10-07 13:59   ` Daniel Vetter
  2011-10-04 21:11 ` [PATCH 3/3] drm/i915: remove ring kick Ben Widawsky
  2 siblings, 1 reply; 8+ messages in thread
From: Ben Widawsky @ 2011-10-04 21:11 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

On Gen6+ we have other rings which may be in use. We haven't hung if the
blit or media ring is still going

Before rebase:
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_drv.h |    5 +-
 drivers/gpu/drm/i915/i915_irq.c |  143 +++++++++++++++++++++++++++------------
 2 files changed, 102 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 567275c..edfa8be 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -323,9 +323,8 @@ typedef struct drm_i915_private {
 #define DRM_I915_HANGCHECK_PERIOD 1500 /* in ms */
 	struct timer_list hangcheck_timer;
 	int hangcheck_count;
-	uint32_t last_acthd;
-	uint32_t last_instdone;
-	uint32_t last_instdone1;
+	uint32_t last_acthd[I915_NUM_RINGS];
+	uint64_t last_instdone[I915_NUM_RINGS];
 
 	unsigned long cfb_size;
 	unsigned int cfb_fb;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 97e338b..6b6abe1 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -948,7 +948,7 @@ static void i915_capture_error_state(struct drm_device *dev)
 		error->instdone[RCS] = I915_READ(INSTDONE_I965);
 		error->instps[RCS] = I915_READ(INSTPS);
 		error->instdone1 = I915_READ(INSTDONE1);
-		error->acthd = I915_READ(ACTHD_I965);
+		error->acthd[RCS] = I915_READ(ACTHD_I965);
 		error->bbaddr = I915_READ64(BB_ADDR);
 	} else {
 		error->ipeir[RCS] = I915_READ(IPEIR);
@@ -1666,6 +1666,85 @@ static bool kick_ring(struct intel_ring_buffer *ring)
 	return false;
 }
 
+static bool
+instdone_stuck(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	uint64_t instdone = 0, instdone1 = 0, vcs_instdone = 0, bcs_instdone = 0;
+	bool stuck;
+
+	switch (INTEL_INFO(dev)->gen) {
+	case 7:
+	case 6:
+		bcs_instdone = I915_READ(BCS_INSTDONE);
+	case 5:
+		vcs_instdone = I915_READ(VCS_INSTDONE);
+	case 4:
+		instdone = I915_READ(INSTDONE_I965);
+		instdone1 = I915_READ(INSTDONE1);
+		break;
+	case 3:
+	case 2:
+		instdone = I915_READ(INSTDONE);
+		break;
+	}
+
+	stuck =
+	    (dev_priv->last_instdone[RCS] == ((instdone << 32) | instdone1)) &&
+	    (dev_priv->last_instdone[VCS] == vcs_instdone) &&
+	    (dev_priv->last_instdone[BCS] == bcs_instdone);
+
+	dev_priv->last_instdone[RCS] = (instdone << 32) | instdone1;
+	dev_priv->last_instdone[VCS] = vcs_instdone;
+	dev_priv->last_instdone[BCS] = bcs_instdone;
+
+	return stuck;
+}
+
+static bool
+acthd_stuck(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	uint32_t acthd = 0, vcs_acthd = 0, bcs_acthd = 0;
+	bool stuck = false;
+
+	switch (INTEL_INFO(dev)->gen) {
+	case 7:
+	case 6:
+		bcs_acthd = intel_ring_get_active_head(&dev_priv->ring[BCS]);
+	case 5:
+		vcs_acthd = intel_ring_get_active_head(&dev_priv->ring[VCS]);
+	case 4:
+	case 3:
+	case 2:
+		acthd = intel_ring_get_active_head(&dev_priv->ring[RCS]);
+		break;
+	}
+
+	stuck =
+	    (dev_priv->last_acthd[RCS] == acthd) &&
+	    (dev_priv->last_acthd[VCS] == vcs_acthd) &&
+	    (dev_priv->last_acthd[BCS] == bcs_acthd);
+
+	dev_priv->last_acthd[RCS] = acthd;
+	dev_priv->last_acthd[VCS] = vcs_acthd;
+	dev_priv->last_acthd[BCS] = bcs_acthd;
+
+	return stuck;
+}
+
+static bool gpu_stuck(struct drm_device *dev)
+{
+	#define NUM_HANGCHECKS_TO_RESET 1
+
+	struct drm_i915_private *dev_priv = dev->dev_private;
+
+	if (dev_priv->hangcheck_count++ < NUM_HANGCHECKS_TO_RESET)
+		return false;
+
+	return acthd_stuck(dev) && instdone_stuck(dev);
+}
+
 /**
  * This is called when the chip hasn't reported back with completed
  * batchbuffers in a long time. The first time this is called we simply record
@@ -1676,13 +1755,11 @@ void i915_hangcheck_elapsed(unsigned long data)
 {
 	struct drm_device *dev = (struct drm_device *)data;
 	drm_i915_private_t *dev_priv = dev->dev_private;
-	uint32_t acthd, instdone, instdone1;
 	bool err = false;
 
 	if (!i915_enable_hangcheck)
 		return;
 
-	/* If all work is done then ACTHD clearly hasn't advanced. */
 	if (i915_hangcheck_ring_idle(&dev_priv->ring[RCS], &err) &&
 	    i915_hangcheck_ring_idle(&dev_priv->ring[VCS], &err) &&
 	    i915_hangcheck_ring_idle(&dev_priv->ring[BCS], &err)) {
@@ -1692,50 +1769,30 @@ void i915_hangcheck_elapsed(unsigned long data)
 		return;
 	}
 
-	if (INTEL_INFO(dev)->gen < 4) {
-		acthd = I915_READ(ACTHD);
-		instdone = I915_READ(INSTDONE);
-		instdone1 = 0;
-	} else {
-		acthd = I915_READ(ACTHD_I965);
-		instdone = I915_READ(INSTDONE_I965);
-		instdone1 = I915_READ(INSTDONE1);
-	}
-
-	if (dev_priv->last_acthd == acthd &&
-	    dev_priv->last_instdone == instdone &&
-	    dev_priv->last_instdone1 == instdone1) {
-		if (dev_priv->hangcheck_count++ > 1) {
-			DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
+	if (gpu_stuck(dev)) {
+		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
 
-			if (!IS_GEN2(dev)) {
-				/* Is the chip hanging on a WAIT_FOR_EVENT?
-				 * If so we can simply poke the RB_WAIT bit
-				 * and break the hang. This should work on
-				 * all but the second generation chipsets.
-				 */
-
-				if (kick_ring(&dev_priv->ring[RCS]))
-					goto repeat;
+		if (!IS_GEN2(dev)) {
+			/* Is the chip hanging on a WAIT_FOR_EVENT?
+			 * If so we can simply poke the RB_WAIT bit
+			 * and break the hang. This should work on
+			 * all but the second generation chipsets.
+			 */
 
-				if (HAS_BSD(dev) &&
-				    kick_ring(&dev_priv->ring[VCS]))
-					goto repeat;
+			if (kick_ring(&dev_priv->ring[RCS]))
+				goto repeat;
 
-				if (HAS_BLT(dev) &&
-				    kick_ring(&dev_priv->ring[BCS]))
-					goto repeat;
-			}
+			if (HAS_BSD(dev) &&
+			    kick_ring(&dev_priv->ring[VCS]))
+				goto repeat;
 
-			i915_handle_error(dev, true);
-			return;
+			if (HAS_BLT(dev) &&
+			    kick_ring(&dev_priv->ring[BCS]))
+				goto repeat;
 		}
-	} else {
-		dev_priv->hangcheck_count = 0;
 
-		dev_priv->last_acthd = acthd;
-		dev_priv->last_instdone = instdone;
-		dev_priv->last_instdone1 = instdone1;
+		i915_handle_error(dev, true);
+		return;
 	}
 
 repeat:
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] drm/i915: remove ring kick
  2011-10-04 21:11 [PATCH v2 0/3] hangcheck fixes Ben Widawsky
  2011-10-04 21:11 ` [PATCH 1/3] drm/i915: collect more per ring error info Ben Widawsky
  2011-10-04 21:11 ` [PATCH 2/3] drm/i915: check acthd for all rings Ben Widawsky
@ 2011-10-04 21:11 ` Ben Widawsky
  2011-10-07 14:00   ` Daniel Vetter
  2 siblings, 1 reply; 8+ messages in thread
From: Ben Widawsky @ 2011-10-04 21:11 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

Ring kick was racy (forcewake), and doesn't really do anything for us
that shouldn't be done by reset.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_irq.c |   41 ---------------------------------------
 1 files changed, 0 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 6b6abe1..6e168b6 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1645,27 +1645,6 @@ static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err)
 	return false;
 }
 
-static bool kick_ring(struct intel_ring_buffer *ring)
-{
-	struct drm_device *dev = ring->dev;
-	struct drm_i915_private *dev_priv = dev->dev_private;
-	u32 tmp = I915_READ_CTL(ring);
-	if (tmp & RING_WAIT) {
-		DRM_ERROR("Kicking stuck wait on %s\n",
-			  ring->name);
-		I915_WRITE_CTL(ring, tmp);
-		return true;
-	}
-	if (IS_GEN6(dev) &&
-	    (tmp & RING_WAIT_SEMAPHORE)) {
-		DRM_ERROR("Kicking stuck semaphore on %s\n",
-			  ring->name);
-		I915_WRITE_CTL(ring, tmp);
-		return true;
-	}
-	return false;
-}
-
 static bool
 instdone_stuck(struct drm_device *dev)
 {
@@ -1771,26 +1750,6 @@ void i915_hangcheck_elapsed(unsigned long data)
 
 	if (gpu_stuck(dev)) {
 		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
-
-		if (!IS_GEN2(dev)) {
-			/* Is the chip hanging on a WAIT_FOR_EVENT?
-			 * If so we can simply poke the RB_WAIT bit
-			 * and break the hang. This should work on
-			 * all but the second generation chipsets.
-			 */
-
-			if (kick_ring(&dev_priv->ring[RCS]))
-				goto repeat;
-
-			if (HAS_BSD(dev) &&
-			    kick_ring(&dev_priv->ring[VCS]))
-				goto repeat;
-
-			if (HAS_BLT(dev) &&
-			    kick_ring(&dev_priv->ring[BCS]))
-				goto repeat;
-		}
-
 		i915_handle_error(dev, true);
 		return;
 	}
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] drm/i915: collect more per ring error info
  2011-10-04 21:11 ` [PATCH 1/3] drm/i915: collect more per ring error info Ben Widawsky
@ 2011-10-07 13:52   ` Daniel Vetter
  2011-10-07 16:19     ` Chris Wilson
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Vetter @ 2011-10-07 13:52 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx

On Tue, Oct 04, 2011 at 02:11:51PM -0700, Ben Widawsky wrote:
> V1: Just added per ring page faults
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> V2: Added much more per ring stuff
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>

While you wrestle this, can we cut down on the line-noise a bit by using
ring->mmio_base an per-ring functions? By storing VCS, BCS, RCS in
ring->id instead of the flags that are currently there this should reduce
quite a bit.
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] drm/i915: check acthd for all rings
  2011-10-04 21:11 ` [PATCH 2/3] drm/i915: check acthd for all rings Ben Widawsky
@ 2011-10-07 13:59   ` Daniel Vetter
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Vetter @ 2011-10-07 13:59 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx

On Tue, Oct 04, 2011 at 02:11:52PM -0700, Ben Widawsky wrote:
> On Gen6+ we have other rings which may be in use. We haven't hung if the
> blit or media ring is still going
> 
> Before rebase:
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>

I've spotted another goof-up besides the instdone_stuck && acthd_stuck
I've overlooked last time around.

> ---
>  drivers/gpu/drm/i915/i915_drv.h |    5 +-
>  drivers/gpu/drm/i915/i915_irq.c |  143 +++++++++++++++++++++++++++------------
>  2 files changed, 102 insertions(+), 46 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 567275c..edfa8be 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -323,9 +323,8 @@ typedef struct drm_i915_private {
>  #define DRM_I915_HANGCHECK_PERIOD 1500 /* in ms */
>  	struct timer_list hangcheck_timer;
>  	int hangcheck_count;
> -	uint32_t last_acthd;
> -	uint32_t last_instdone;
> -	uint32_t last_instdone1;
> +	uint32_t last_acthd[I915_NUM_RINGS];
> +	uint64_t last_instdone[I915_NUM_RINGS];
>  
>  	unsigned long cfb_size;
>  	unsigned int cfb_fb;
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 97e338b..6b6abe1 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -948,7 +948,7 @@ static void i915_capture_error_state(struct drm_device *dev)
>  		error->instdone[RCS] = I915_READ(INSTDONE_I965);
>  		error->instps[RCS] = I915_READ(INSTPS);
>  		error->instdone1 = I915_READ(INSTDONE1);
> -		error->acthd = I915_READ(ACTHD_I965);
> +		error->acthd[RCS] = I915_READ(ACTHD_I965);
>  		error->bbaddr = I915_READ64(BB_ADDR);
>  	} else {
>  		error->ipeir[RCS] = I915_READ(IPEIR);
> @@ -1666,6 +1666,85 @@ static bool kick_ring(struct intel_ring_buffer *ring)
>  	return false;
>  }
>  
> +static bool
> +instdone_stuck(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	uint64_t instdone = 0, instdone1 = 0, vcs_instdone = 0, bcs_instdone = 0;
> +	bool stuck;
> +
> +	switch (INTEL_INFO(dev)->gen) {
> +	case 7:
> +	case 6:
> +		bcs_instdone = I915_READ(BCS_INSTDONE);
> +	case 5:
> +		vcs_instdone = I915_READ(VCS_INSTDONE);
> +	case 4:
> +		instdone = I915_READ(INSTDONE_I965);
> +		instdone1 = I915_READ(INSTDONE1);
> +		break;
> +	case 3:
> +	case 2:
> +		instdone = I915_READ(INSTDONE);
> +		break;
> +	}
> +
> +	stuck =
> +	    (dev_priv->last_instdone[RCS] == ((instdone << 32) | instdone1)) &&
> +	    (dev_priv->last_instdone[VCS] == vcs_instdone) &&
> +	    (dev_priv->last_instdone[BCS] == bcs_instdone);
> +
> +	dev_priv->last_instdone[RCS] = (instdone << 32) | instdone1;
> +	dev_priv->last_instdone[VCS] = vcs_instdone;
> +	dev_priv->last_instdone[BCS] = bcs_instdone;
> +
> +	return stuck;
> +}
> +
> +static bool
> +acthd_stuck(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	uint32_t acthd = 0, vcs_acthd = 0, bcs_acthd = 0;
> +	bool stuck = false;
> +
> +	switch (INTEL_INFO(dev)->gen) {
> +	case 7:
> +	case 6:
> +		bcs_acthd = intel_ring_get_active_head(&dev_priv->ring[BCS]);
> +	case 5:
> +		vcs_acthd = intel_ring_get_active_head(&dev_priv->ring[VCS]);
> +	case 4:
> +	case 3:
> +	case 2:
> +		acthd = intel_ring_get_active_head(&dev_priv->ring[RCS]);
> +		break;
> +	}
> +
> +	stuck =
> +	    (dev_priv->last_acthd[RCS] == acthd) &&
> +	    (dev_priv->last_acthd[VCS] == vcs_acthd) &&
> +	    (dev_priv->last_acthd[BCS] == bcs_acthd);
> +
> +	dev_priv->last_acthd[RCS] = acthd;
> +	dev_priv->last_acthd[VCS] = vcs_acthd;
> +	dev_priv->last_acthd[BCS] = bcs_acthd;
> +
> +	return stuck;
> +}
> +
> +static bool gpu_stuck(struct drm_device *dev)
> +{
> +	#define NUM_HANGCHECKS_TO_RESET 1
> +
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +
> +	if (dev_priv->hangcheck_count++ < NUM_HANGCHECKS_TO_RESET)
> +		return false;
> +
> +	return acthd_stuck(dev) && instdone_stuck(dev);

First check whether the gpu ist stuck, then increment the hangcheck_count
If the gpu is not stuck, we also need to clear the hangcheck_count again.

> +}
> +
>  /**
>   * This is called when the chip hasn't reported back with completed
>   * batchbuffers in a long time. The first time this is called we simply record
> @@ -1676,13 +1755,11 @@ void i915_hangcheck_elapsed(unsigned long data)
>  {
>  	struct drm_device *dev = (struct drm_device *)data;
>  	drm_i915_private_t *dev_priv = dev->dev_private;
> -	uint32_t acthd, instdone, instdone1;
>  	bool err = false;
>  
>  	if (!i915_enable_hangcheck)
>  		return;
>  
> -	/* If all work is done then ACTHD clearly hasn't advanced. */
>  	if (i915_hangcheck_ring_idle(&dev_priv->ring[RCS], &err) &&
>  	    i915_hangcheck_ring_idle(&dev_priv->ring[VCS], &err) &&
>  	    i915_hangcheck_ring_idle(&dev_priv->ring[BCS], &err)) {
> @@ -1692,50 +1769,30 @@ void i915_hangcheck_elapsed(unsigned long data)
>  		return;
>  	}
>  
> -	if (INTEL_INFO(dev)->gen < 4) {
> -		acthd = I915_READ(ACTHD);
> -		instdone = I915_READ(INSTDONE);
> -		instdone1 = 0;
> -	} else {
> -		acthd = I915_READ(ACTHD_I965);
> -		instdone = I915_READ(INSTDONE_I965);
> -		instdone1 = I915_READ(INSTDONE1);
> -	}
> -
> -	if (dev_priv->last_acthd == acthd &&
> -	    dev_priv->last_instdone == instdone &&
> -	    dev_priv->last_instdone1 == instdone1) {
> -		if (dev_priv->hangcheck_count++ > 1) {
> -			DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
> +	if (gpu_stuck(dev)) {
> +		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
>  
> -			if (!IS_GEN2(dev)) {
> -				/* Is the chip hanging on a WAIT_FOR_EVENT?
> -				 * If so we can simply poke the RB_WAIT bit
> -				 * and break the hang. This should work on
> -				 * all but the second generation chipsets.
> -				 */
> -
> -				if (kick_ring(&dev_priv->ring[RCS]))
> -					goto repeat;
> +		if (!IS_GEN2(dev)) {
> +			/* Is the chip hanging on a WAIT_FOR_EVENT?
> +			 * If so we can simply poke the RB_WAIT bit
> +			 * and break the hang. This should work on
> +			 * all but the second generation chipsets.
> +			 */
>  
> -				if (HAS_BSD(dev) &&
> -				    kick_ring(&dev_priv->ring[VCS]))
> -					goto repeat;
> +			if (kick_ring(&dev_priv->ring[RCS]))
> +				goto repeat;
>  
> -				if (HAS_BLT(dev) &&
> -				    kick_ring(&dev_priv->ring[BCS]))
> -					goto repeat;
> -			}
> +			if (HAS_BSD(dev) &&
> +			    kick_ring(&dev_priv->ring[VCS]))
> +				goto repeat;
>  
> -			i915_handle_error(dev, true);
> -			return;
> +			if (HAS_BLT(dev) &&
> +			    kick_ring(&dev_priv->ring[BCS]))
> +				goto repeat;
>  		}
> -	} else {
> -		dev_priv->hangcheck_count = 0;
>  
> -		dev_priv->last_acthd = acthd;
> -		dev_priv->last_instdone = instdone;
> -		dev_priv->last_instdone1 = instdone1;
> +		i915_handle_error(dev, true);
> +		return;
>  	}
>  
>  repeat:
> -- 
> 1.7.6.4
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] drm/i915: remove ring kick
  2011-10-04 21:11 ` [PATCH 3/3] drm/i915: remove ring kick Ben Widawsky
@ 2011-10-07 14:00   ` Daniel Vetter
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Vetter @ 2011-10-07 14:00 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx

On Tue, Oct 04, 2011 at 02:11:53PM -0700, Ben Widawsky wrote:
> Ring kick was racy (forcewake), and doesn't really do anything for us
> that shouldn't be done by reset.
> 
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---
>  drivers/gpu/drm/i915/i915_irq.c |   41 ---------------------------------------
>  1 files changed, 0 insertions(+), 41 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 6b6abe1..6e168b6 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1645,27 +1645,6 @@ static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err)
>  	return false;
>  }
>  
> -static bool kick_ring(struct intel_ring_buffer *ring)
> -{
> -	struct drm_device *dev = ring->dev;
> -	struct drm_i915_private *dev_priv = dev->dev_private;
> -	u32 tmp = I915_READ_CTL(ring);
> -	if (tmp & RING_WAIT) {
> -		DRM_ERROR("Kicking stuck wait on %s\n",
> -			  ring->name);
> -		I915_WRITE_CTL(ring, tmp);
> -		return true;
> -	}
> -	if (IS_GEN6(dev) &&
> -	    (tmp & RING_WAIT_SEMAPHORE)) {
> -		DRM_ERROR("Kicking stuck semaphore on %s\n",
> -			  ring->name);
> -		I915_WRITE_CTL(ring, tmp);
> -		return true;
> -	}

As discussed on irc, I think we need to retain the kick ring stuck on
MI_WAIT for older, broken userspace. Also, kicking rings stuck on MI_WAITS
should be pretty harmless, at most we see a bit of gargabe on the screen.
-Daniel

-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] drm/i915: collect more per ring error info
  2011-10-07 13:52   ` Daniel Vetter
@ 2011-10-07 16:19     ` Chris Wilson
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Wilson @ 2011-10-07 16:19 UTC (permalink / raw)
  To: Daniel Vetter, Ben Widawsky; +Cc: intel-gfx

On Fri, 7 Oct 2011 15:52:01 +0200, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Tue, Oct 04, 2011 at 02:11:51PM -0700, Ben Widawsky wrote:
> > V1: Just added per ring page faults
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > 
> > V2: Added much more per ring stuff
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> 
> While you wrestle this, can we cut down on the line-noise a bit by using
> ring->mmio_base an per-ring functions? By storing VCS, BCS, RCS in
> ring->id instead of the flags that are currently there this should reduce
> quite a bit.

You beat me to it, I was also going to suggest this. Seconded ;-)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-10-07 16:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-04 21:11 [PATCH v2 0/3] hangcheck fixes Ben Widawsky
2011-10-04 21:11 ` [PATCH 1/3] drm/i915: collect more per ring error info Ben Widawsky
2011-10-07 13:52   ` Daniel Vetter
2011-10-07 16:19     ` Chris Wilson
2011-10-04 21:11 ` [PATCH 2/3] drm/i915: check acthd for all rings Ben Widawsky
2011-10-07 13:59   ` Daniel Vetter
2011-10-04 21:11 ` [PATCH 3/3] drm/i915: remove ring kick Ben Widawsky
2011-10-07 14:00   ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.