All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/12] AubCrash
@ 2017-10-27 18:01 Oscar Mateo
  2017-10-27 18:01 ` [RFC PATCH 01/12] drm/i915: New define for the number of PDPEs per PDP in a 4-level walk Oscar Mateo
                   ` (14 more replies)
  0 siblings, 15 replies; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx

AubCrash is a companion to i915_gpu_error. It gives us the possibility to
dump an AUB file that describes the state of the system at the point of
the crash (GTTs, contexts, BBs, BOs, etc...). Being an AUB file, it can be
used by a number of already existing tools (graphical AUB file browsers,
simulators, emulators, etc...) that facilitate debugging (an improvement
over the current text-based crash dump).

Oscar Mateo (12):
  drm/i915: New define for the number of PDPEs per PDP in a 4-level walk
  drm/i915: Move the context switch status reason bits to the header
  drm/i915: Store the GGTT Stolen Memory base physical address
  drm/i915: Capture some extra small details in the GPU error state
  drm/i915: Capture the renderstate batchbuffer
  drm/i915: Provide a way to write binary data to the error dump file
  drm/i915: Skeleton for AubCrash
  drm/i915: Capture the PPGTT pagetables on a GPU crash
  drm/i915: Store PTE information together with the error object pages
  drm/i915: Always add BOs to capture list if AubCrash is enabled
  drm/i915: Add an AUB file format writer
  drm/i915: Actually write the AUB file

 drivers/gpu/drm/i915/Kconfig                   |   8 +
 drivers/gpu/drm/i915/Makefile                  |   1 +
 drivers/gpu/drm/i915/i915_aubcrash.c           | 436 ++++++++++++++++
 drivers/gpu/drm/i915/i915_aubcrash.h           |  79 +++
 drivers/gpu/drm/i915/i915_aubmemtrace.c        | 665 +++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_aubmemtrace.h        |  76 +++
 drivers/gpu/drm/i915/i915_aubmemtrace_format.h | 359 +++++++++++++
 drivers/gpu/drm/i915/i915_debugfs.c            |  49 +-
 drivers/gpu/drm/i915/i915_drv.h                |  20 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c     |   4 +-
 drivers/gpu/drm/i915/i915_gem_gtt.c            |   1 +
 drivers/gpu/drm/i915/i915_gem_gtt.h            |   4 +-
 drivers/gpu/drm/i915/i915_gem_render_state.c   |  11 +
 drivers/gpu/drm/i915/i915_gem_render_state.h   |   1 +
 drivers/gpu/drm/i915/i915_gpu_error.c          |  89 +++-
 drivers/gpu/drm/i915/i915_sysfs.c              |  54 +-
 drivers/gpu/drm/i915/intel_lrc.c               |  12 -
 drivers/gpu/drm/i915/intel_lrc.h               |  14 +
 18 files changed, 1833 insertions(+), 50 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_aubcrash.c
 create mode 100644 drivers/gpu/drm/i915/i915_aubcrash.h
 create mode 100644 drivers/gpu/drm/i915/i915_aubmemtrace.c
 create mode 100644 drivers/gpu/drm/i915/i915_aubmemtrace.h
 create mode 100644 drivers/gpu/drm/i915/i915_aubmemtrace_format.h

-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 01/12] drm/i915: New define for the number of PDPEs per PDP in a 4-level walk
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:01 ` [RFC PATCH 02/12] drm/i915: Move the context switch status reason bits to the header Oscar Mateo
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Mika Kuoppala

The number of PML4Es per PML4 and the number of PDPEs per PDP
should not be confused (even if the amount, 512, is indeed the
same).

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/i915_gem_gtt.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
index 93211a9..717bbd6 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.h
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
@@ -124,6 +124,7 @@
  * 47:39 | 38:30 | 29:21 | 20:12 |  11:0
  * PML4E | PDPE  |  PDE  |  PTE  | offset
  */
+#define GEN8_4LVL_PDPES			512
 #define GEN8_PML4ES_PER_PML4		512
 #define GEN8_PML4E_SHIFT		39
 #define GEN8_PML4E_MASK			(GEN8_PML4ES_PER_PML4 - 1)
@@ -488,7 +489,7 @@ static inline u32 gen6_pde_index(u32 addr)
 i915_pdpes_per_pdp(const struct i915_address_space *vm)
 {
 	if (i915_vm_is_48bit(vm))
-		return GEN8_PML4ES_PER_PML4;
+		return GEN8_4LVL_PDPES;
 
 	return GEN8_3LVL_PDPES;
 }
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 02/12] drm/i915: Move the context switch status reason bits to the header
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
  2017-10-27 18:01 ` [RFC PATCH 01/12] drm/i915: New define for the number of PDPEs per PDP in a 4-level walk Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:01 ` [RFC PATCH 03/12] drm/i915: Store the GGTT Stolen Memory base physical address Oscar Mateo
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Mika Kuoppala

I am planning on using them in AubCrash. While at it, define the mask
and shift for the Last Context Switch Reason field inside the Execlist
Status register.

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/intel_lrc.c | 12 ------------
 drivers/gpu/drm/i915/intel_lrc.h | 14 ++++++++++++++
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index e821c1e..f4a3516 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -145,18 +145,6 @@
 #define RING_EXECLIST1_ACTIVE		(1 << 0x11)
 #define RING_EXECLIST0_ACTIVE		(1 << 0x12)
 
-#define GEN8_CTX_STATUS_IDLE_ACTIVE	(1 << 0)
-#define GEN8_CTX_STATUS_PREEMPTED	(1 << 1)
-#define GEN8_CTX_STATUS_ELEMENT_SWITCH	(1 << 2)
-#define GEN8_CTX_STATUS_ACTIVE_IDLE	(1 << 3)
-#define GEN8_CTX_STATUS_COMPLETE	(1 << 4)
-#define GEN8_CTX_STATUS_LITE_RESTORE	(1 << 15)
-
-#define GEN8_CTX_STATUS_COMPLETED_MASK \
-	 (GEN8_CTX_STATUS_ACTIVE_IDLE | \
-	  GEN8_CTX_STATUS_PREEMPTED | \
-	  GEN8_CTX_STATUS_ELEMENT_SWITCH)
-
 #define CTX_LRI_HEADER_0		0x01
 #define CTX_CONTEXT_CONTROL		0x02
 #define CTX_RING_HEAD			0x04
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 689fde1..7da2809 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -32,6 +32,8 @@
 /* Execlists regs */
 #define RING_ELSP(engine)			_MMIO((engine)->mmio_base + 0x230)
 #define RING_EXECLIST_STATUS_LO(engine)		_MMIO((engine)->mmio_base + 0x234)
+#define   EL_STATUS_LAST_CTX_SWITCH_SHIFT	5
+#define   EL_STATUS_LAST_CTX_SWITCH_MASK	(0x1ff << EL_STAT_LAST_CTX_SWITCH_REASON_SHIFT)
 #define RING_EXECLIST_STATUS_HI(engine)		_MMIO((engine)->mmio_base + 0x234 + 4)
 #define RING_CONTEXT_CONTROL(engine)		_MMIO((engine)->mmio_base + 0x244)
 #define	  CTX_CTRL_INHIBIT_SYN_CTX_SWITCH	(1 << 3)
@@ -42,6 +44,18 @@
 #define RING_CONTEXT_STATUS_BUF_HI(engine, i)	_MMIO((engine)->mmio_base + 0x370 + (i) * 8 + 4)
 #define RING_CONTEXT_STATUS_PTR(engine)		_MMIO((engine)->mmio_base + 0x3a0)
 
+/* Context switch status */
+#define GEN8_CTX_STATUS_IDLE_ACTIVE	(1 << 0)
+#define GEN8_CTX_STATUS_PREEMPTED	(1 << 1)
+#define GEN8_CTX_STATUS_ELEMENT_SWITCH	(1 << 2)
+#define GEN8_CTX_STATUS_ACTIVE_IDLE	(1 << 3)
+#define GEN8_CTX_STATUS_COMPLETE	(1 << 4)
+#define GEN8_CTX_STATUS_LITE_RESTORE	(1 << 15)
+#define GEN8_CTX_STATUS_COMPLETED_MASK \
+	 (GEN8_CTX_STATUS_ACTIVE_IDLE | \
+	  GEN8_CTX_STATUS_PREEMPTED | \
+	  GEN8_CTX_STATUS_ELEMENT_SWITCH)
+
 /* The docs specify that the write pointer wraps around after 5h, "After status
  * is written out to the last available status QW at offset 5h, this pointer
  * wraps to 0."
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 03/12] drm/i915: Store the GGTT Stolen Memory base physical address
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
  2017-10-27 18:01 ` [RFC PATCH 01/12] drm/i915: New define for the number of PDPEs per PDP in a 4-level walk Oscar Mateo
  2017-10-27 18:01 ` [RFC PATCH 02/12] drm/i915: Move the context switch status reason bits to the header Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:01 ` [RFC PATCH 04/12] drm/i915: Capture some extra small details in the GPU error state Oscar Mateo
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Mika Kuoppala

I'm planning on using it later for AubCrash.

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/i915_gem_gtt.c | 1 +
 drivers/gpu/drm/i915/i915_gem_gtt.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 5eaa689..c5b5892 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -2974,6 +2974,7 @@ static int ggtt_probe_common(struct i915_ggtt *ggtt, u64 size)
 
 	/* For Modern GENs the PTEs and register space are split in the BAR */
 	phys_addr = pci_resource_start(pdev, 0) + pci_resource_len(pdev, 0) / 2;
+	ggtt->gsm_paddr = phys_addr;
 
 	/*
 	 * On BXT+/CNL+ writes larger than 64 bit to the GTT pagetable range
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
index 717bbd6..c64fce1 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.h
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
@@ -388,6 +388,7 @@ struct i915_ggtt {
 	u32 stolen_reserved_size;
 
 	/** "Graphics Stolen Memory" holds the global PTEs */
+	phys_addr_t gsm_paddr;
 	void __iomem *gsm;
 	void (*invalidate)(struct drm_i915_private *dev_priv);
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 04/12] drm/i915: Capture some extra small details in the GPU error state
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (2 preceding siblings ...)
  2017-10-27 18:01 ` [RFC PATCH 03/12] drm/i915: Store the GGTT Stolen Memory base physical address Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:15   ` Chris Wilson
  2017-10-27 18:01 ` [RFC PATCH 05/12] drm/i915: Capture the renderstate batchbuffer Oscar Mateo
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Namely:
- Capture tiling per drm_i915_error_object
- Capture the LRC descriptor per active request
- Capture the wa_batchbuffer unconditionally
- Capture the GAM_ECOCHK register for all GENs

They don't increase the size greatly, and they can be useful even in
the existing GPU error dump (but I will need them for sure in AubCrash).

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h       |  2 ++
 drivers/gpu/drm/i915/i915_gpu_error.c | 25 +++++++++++++++----------
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 366ba74..f64871b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -992,6 +992,7 @@ struct i915_gpu_state {
 		struct drm_i915_error_object {
 			u64 gtt_offset;
 			u64 gtt_size;
+			u32 tiling:2;
 			int page_count;
 			int unused;
 			u32 *pages[0];
@@ -1006,6 +1007,7 @@ struct i915_gpu_state {
 			long jiffies;
 			pid_t pid;
 			u32 context;
+			u64 lrc_desc;
 			int priority;
 			int ban_score;
 			u32 seqno;
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 653fb69..befd17c 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -897,6 +897,7 @@ void __i915_gpu_state_free(struct kref *error_ref)
 
 	dst->gtt_offset = vma->node.start;
 	dst->gtt_size = vma->node.size;
+	dst->tiling = i915_gem_object_get_tiling(vma->obj);
 	dst->page_count = 0;
 	dst->unused = 0;
 
@@ -1270,16 +1271,21 @@ static void error_record_engine_registers(struct i915_gpu_state *error,
 static void record_request(struct drm_i915_gem_request *request,
 			   struct drm_i915_error_request *erq)
 {
-	erq->context = request->ctx->hw_id;
+	struct i915_gem_context *ctx = request->ctx;
+	struct intel_engine_cs *engine = request->engine;
+	struct intel_context *ce = &ctx->engine[engine->id];
+
+	erq->context = ctx->hw_id;
+	erq->lrc_desc = ce->lrc_desc;
 	erq->priority = request->priotree.priority;
-	erq->ban_score = atomic_read(&request->ctx->ban_score);
+	erq->ban_score = atomic_read(&ctx->ban_score);
 	erq->seqno = request->global_seqno;
 	erq->jiffies = request->emitted_jiffies;
 	erq->head = request->head;
 	erq->tail = request->tail;
 
 	rcu_read_lock();
-	erq->pid = request->ctx->pid ? pid_nr(request->ctx->pid) : 0;
+	erq->pid = ctx->pid ? pid_nr(ctx->pid) : 0;
 	rcu_read_unlock();
 }
 
@@ -1442,10 +1448,10 @@ static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
 				i915_error_object_create(dev_priv,
 							 request->batch);
 
-			if (HAS_BROKEN_CS_TLB(dev_priv))
-				ee->wa_batchbuffer =
-					i915_error_object_create(dev_priv,
-								 engine->scratch);
+			ee->wa_batchbuffer =
+				i915_error_object_create(dev_priv,
+							 engine->scratch);
+
 			request_record_user_bo(request, ee);
 
 			ee->ctx =
@@ -1619,10 +1625,8 @@ static void i915_capture_reg_state(struct drm_i915_private *dev_priv,
 		error->ccid = I915_READ(CCID);
 
 	/* 3: Feature specific registers */
-	if (IS_GEN6(dev_priv) || IS_GEN7(dev_priv)) {
-		error->gam_ecochk = I915_READ(GAM_ECOCHK);
+	if (IS_GEN6(dev_priv) || IS_GEN7(dev_priv))
 		error->gac_eco = I915_READ(GAC_ECO_BITS);
-	}
 
 	/* 4: Everything else */
 	if (INTEL_GEN(dev_priv) >= 8) {
@@ -1641,6 +1645,7 @@ static void i915_capture_reg_state(struct drm_i915_private *dev_priv,
 	}
 	error->eir = I915_READ(EIR);
 	error->pgtbl_er = I915_READ(PGTBL_ER);
+	error->gam_ecochk = I915_READ(GAM_ECOCHK);
 }
 
 static void i915_error_capture_msg(struct drm_i915_private *dev_priv,
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 05/12] drm/i915: Capture the renderstate batchbuffer
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (3 preceding siblings ...)
  2017-10-27 18:01 ` [RFC PATCH 04/12] drm/i915: Capture some extra small details in the GPU error state Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:17   ` Chris Wilson
  2017-10-27 18:01 ` [RFC PATCH 06/12] drm/i915: Provide a way to write binary data to the error dump file Oscar Mateo
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Mika Kuoppala

It can be useful if it's in play at the time of the crash,
and I will be needing it in AubCrash.

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h              |  3 ++-
 drivers/gpu/drm/i915/i915_gem_render_state.c | 11 +++++++++++
 drivers/gpu/drm/i915/i915_gem_render_state.h |  1 +
 drivers/gpu/drm/i915/i915_gpu_error.c        | 13 +++++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f64871b..23d746b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -996,7 +996,8 @@ struct i915_gpu_state {
 			int page_count;
 			int unused;
 			u32 *pages[0];
-		} *ringbuffer, *batchbuffer, *wa_batchbuffer, *ctx, *hws_page;
+		} *ringbuffer, *batchbuffer, *wa_batchbuffer,
+		  *renderstate, *ctx, *hws_page;
 
 		struct drm_i915_error_object **user_bo;
 		long user_bo_count;
diff --git a/drivers/gpu/drm/i915/i915_gem_render_state.c b/drivers/gpu/drm/i915/i915_gem_render_state.c
index 3703dc9..47454fb 100644
--- a/drivers/gpu/drm/i915/i915_gem_render_state.c
+++ b/drivers/gpu/drm/i915/i915_gem_render_state.c
@@ -266,6 +266,17 @@ int i915_gem_render_state_emit(struct drm_i915_gem_request *req)
 	return ret;
 }
 
+struct i915_vma *i915_gem_render_state_get(struct intel_engine_cs *engine)
+{
+	struct intel_render_state *so;
+
+	so = engine->render_state;
+	if (!so)
+		return NULL;
+
+	return so->vma;
+}
+
 void i915_gem_render_state_fini(struct intel_engine_cs *engine)
 {
 	struct intel_render_state *so;
diff --git a/drivers/gpu/drm/i915/i915_gem_render_state.h b/drivers/gpu/drm/i915/i915_gem_render_state.h
index 8748184..c760cf9 100644
--- a/drivers/gpu/drm/i915/i915_gem_render_state.h
+++ b/drivers/gpu/drm/i915/i915_gem_render_state.h
@@ -28,6 +28,7 @@
 
 int i915_gem_render_state_init(struct intel_engine_cs *engine);
 int i915_gem_render_state_emit(struct drm_i915_gem_request *req);
+struct i915_vma *i915_gem_render_state_get(struct intel_engine_cs *engine);
 void i915_gem_render_state_fini(struct intel_engine_cs *engine);
 
 #endif /* _I915_GEM_RENDER_STATE_H_ */
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index befd17c..a14aa29 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -755,6 +755,9 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
 				"HW context", ee->ctx);
 
 		print_error_obj(m, dev_priv->engine[i],
+				"Renderstate", ee->renderstate);
+
+		print_error_obj(m, dev_priv->engine[i],
 				"WA context", ee->wa_ctx);
 
 		print_error_obj(m, dev_priv->engine[i],
@@ -850,6 +853,7 @@ void __i915_gpu_state_free(struct kref *error_ref)
 		i915_error_object_free(ee->hws_page);
 		i915_error_object_free(ee->ctx);
 		i915_error_object_free(ee->wa_ctx);
+		i915_error_object_free(ee->renderstate);
 
 		kfree(ee->requests);
 		if (!IS_ERR_OR_NULL(ee->waiters))
@@ -1434,6 +1438,7 @@ static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
 		request = i915_gem_find_active_request(engine);
 		if (request) {
 			struct intel_ring *ring;
+			struct i915_vma *renderstate_vma;
 
 			ee->vm = request->ctx->ppgtt ?
 				&request->ctx->ppgtt->base : &ggtt->base;
@@ -1454,6 +1459,14 @@ static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
 
 			request_record_user_bo(request, ee);
 
+			renderstate_vma = i915_gem_render_state_get(engine);
+			if (renderstate_vma &&
+			    i915_vma_is_active(renderstate_vma)) {
+				ee->renderstate =
+					i915_error_object_create(dev_priv,
+								 renderstate_vma);
+			}
+
 			ee->ctx =
 				i915_error_object_create(dev_priv,
 							 request->ctx->engine[i].state);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 06/12] drm/i915: Provide a way to write binary data to the error dump file
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (4 preceding siblings ...)
  2017-10-27 18:01 ` [RFC PATCH 05/12] drm/i915: Capture the renderstate batchbuffer Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:01 ` [RFC PATCH 07/12] drm/i915: Skeleton for AubCrash Oscar Mateo
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

i915_error_puts becomes an specialization of the more general case.
Also, make it publicly accesible (since I will be using it inside
AubCrash).

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h       |  2 ++
 drivers/gpu/drm/i915/i915_gpu_error.c | 19 +++++++++++--------
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 23d746b..ae3c8b1 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3937,6 +3937,8 @@ static inline void intel_display_crc_init(struct drm_i915_private *dev_priv) {}
 
 __printf(2, 3)
 void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...);
+void i915_error_binary_write(struct drm_i915_error_state_buf *e,
+			     const void *data, size_t len);
 int i915_error_state_to_str(struct drm_i915_error_state_buf *estr,
 			    const struct i915_gpu_state *gpu);
 int i915_error_state_buf_init(struct drm_i915_error_state_buf *eb,
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index a14aa29..e71d2c8 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -149,17 +149,13 @@ static void i915_error_vprintf(struct drm_i915_error_state_buf *e,
 	__i915_error_advance(e, len);
 }
 
-static void i915_error_puts(struct drm_i915_error_state_buf *e,
-			    const char *str)
+void i915_error_binary_write(struct drm_i915_error_state_buf *e,
+			     const void *data, size_t len)
 {
-	unsigned len;
-
 	if (!__i915_error_ok(e))
 		return;
 
-	len = strlen(str);
-
-	/* Seek the first printf which is hits start position */
+	/* Seek the first printf which hits start position */
 	if (e->pos < e->start) {
 		if (!__i915_error_seek(e, len))
 			return;
@@ -167,11 +163,18 @@ static void i915_error_puts(struct drm_i915_error_state_buf *e,
 
 	if (len >= e->size - e->bytes)
 		len = e->size - e->bytes - 1;
-	memcpy(e->buf + e->bytes, str, len);
+	memcpy(e->buf + e->bytes, data, len);
 
 	__i915_error_advance(e, len);
 }
 
+static void i915_error_puts(struct drm_i915_error_state_buf *e, const char *str)
+{
+	size_t len = strlen(str);
+
+	i915_error_binary_write(e, str, len);
+}
+
 #define err_printf(e, ...) i915_error_printf(e, __VA_ARGS__)
 #define err_puts(e, s) i915_error_puts(e, s)
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 07/12] drm/i915: Skeleton for AubCrash
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (5 preceding siblings ...)
  2017-10-27 18:01 ` [RFC PATCH 06/12] drm/i915: Provide a way to write binary data to the error dump file Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:20   ` Chris Wilson
  2017-10-27 18:01 ` [RFC PATCH 08/12] drm/i915: Capture the PPGTT pagetables on a GPU crash Oscar Mateo
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Includes some documentation on what AubCrash is supposed to achieve.

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/Kconfig         |  8 ++++++
 drivers/gpu/drm/i915/Makefile        |  1 +
 drivers/gpu/drm/i915/i915_aubcrash.c | 47 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_aubcrash.h | 42 ++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_debugfs.c  | 49 +++++++++++++++++++++++++++++---
 drivers/gpu/drm/i915/i915_sysfs.c    | 54 +++++++++++++++++++++++++++++++++---
 6 files changed, 193 insertions(+), 8 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_aubcrash.c
 create mode 100644 drivers/gpu/drm/i915/i915_aubcrash.h

diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index dfd9588..176e53e 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -70,6 +70,14 @@ config DRM_I915_CAPTURE_ERROR
 
 	  If in doubt, say "Y".
 
+config DRM_I915_AUB_CRASH_DUMP
+        bool "Capture GPU error state in the form of an AUB file"
+        depends on DRM_I915_CAPTURE_ERROR
+        default n
+        help
+          Choose this option to allow the driver to dump a memtrace file (AUB)
+          with the GPU state when a hang is detected.
+
 config DRM_I915_COMPRESS_ERROR
 	bool "Compress GPU error state"
 	depends on DRM_I915_CAPTURE_ERROR
diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 6c3b048..04956c7 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -124,6 +124,7 @@ i915-y += dvo_ch7017.o \
 
 # Post-mortem debug and GPU hang state capture
 i915-$(CONFIG_DRM_I915_CAPTURE_ERROR) += i915_gpu_error.o
+i915-$(CONFIG_DRM_I915_AUB_CRASH_DUMP) += i915_aubcrash.o
 i915-$(CONFIG_DRM_I915_SELFTEST) += \
 	selftests/i915_random.o \
 	selftests/i915_selftest.o
diff --git a/drivers/gpu/drm/i915/i915_aubcrash.c b/drivers/gpu/drm/i915/i915_aubcrash.c
new file mode 100644
index 0000000..95b75ab
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_aubcrash.c
@@ -0,0 +1,47 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ * Author:
+ *    Oscar Mateo <oscar.mateo@intel.com>
+ *
+ */
+
+#include "intel_drv.h"
+#include "i915_aubcrash.h"
+
+/**
+ * DOC: AubCrash
+ *
+ * This code is a companion to i915_gpu_error. The idea is that, on a GPU crash,
+ * we can dump an AUB file that describes the state of the system at the point
+ * of the crash (GTTs, contexts, BBs, BOs, etc...). While i915_gpu_error kind of
+ * already does that, it uses a text format that is not specially human-friendly.
+ * An AUB file, on the other hand, can be used by a number of tools (graphical
+ * AUB file browsers, simulators, emulators, etc...) that facilitate debugging.
+ *
+ */
+
+int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
+			    const struct i915_gpu_state *error)
+{
+	return 0;
+}
diff --git a/drivers/gpu/drm/i915/i915_aubcrash.h b/drivers/gpu/drm/i915/i915_aubcrash.h
new file mode 100644
index 0000000..bab1953
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_aubcrash.h
@@ -0,0 +1,42 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef _INTEL_AUBCRASH_H_
+#define _INTEL_AUBCRASH_H_
+
+#if IS_ENABLED(CONFIG_DRM_I915_AUB_CRASH_DUMP)
+
+int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
+                            const struct i915_gpu_state *error);
+
+#else
+
+static inline int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
+					  const struct i915_gpu_state *error)
+{
+	return 0;
+}
+
+#endif
+
+#endif
diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index c65e381..f0f23ef 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -31,6 +31,7 @@
 #include <linux/sched/mm.h>
 #include "intel_drv.h"
 #include "i915_guc_submission.h"
+#include "i915_aubcrash.h"
 
 static inline struct drm_i915_private *node_to_i915(struct drm_info_node *node)
 {
@@ -938,7 +939,7 @@ static int i915_gem_fence_regs_info(struct seq_file *m, void *data)
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 static ssize_t gpu_state_read(struct file *file, char __user *ubuf,
-			      size_t count, loff_t *pos)
+			      size_t count, loff_t *pos, bool type_aub)
 {
 	struct i915_gpu_state *error = file->private_data;
 	struct drm_i915_error_state_buf str;
@@ -952,7 +953,10 @@ static ssize_t gpu_state_read(struct file *file, char __user *ubuf,
 	if (ret)
 		return ret;
 
-	ret = i915_error_state_to_str(&str, error);
+	if (type_aub)
+		ret = i915_error_state_to_aub(&str, error);
+	else
+		ret = i915_error_state_to_str(&str, error);
 	if (ret)
 		goto out;
 
@@ -967,6 +971,12 @@ static ssize_t gpu_state_read(struct file *file, char __user *ubuf,
 	return ret;
 }
 
+static ssize_t gpu_state_read_str(struct file *file, char __user *ubuf,
+				  size_t count, loff_t *pos)
+{
+	return gpu_state_read(file, ubuf, count, pos, false);
+}
+
 static int gpu_state_release(struct inode *inode, struct file *file)
 {
 	i915_gpu_state_put(file->private_data);
@@ -991,7 +1001,7 @@ static int i915_gpu_info_open(struct inode *inode, struct file *file)
 static const struct file_operations i915_gpu_info_fops = {
 	.owner = THIS_MODULE,
 	.open = i915_gpu_info_open,
-	.read = gpu_state_read,
+	.read = gpu_state_read_str,
 	.llseek = default_llseek,
 	.release = gpu_state_release,
 };
@@ -1022,11 +1032,38 @@ static int i915_error_state_open(struct inode *inode, struct file *file)
 static const struct file_operations i915_error_state_fops = {
 	.owner = THIS_MODULE,
 	.open = i915_error_state_open,
-	.read = gpu_state_read,
+	.read = gpu_state_read_str,
+	.write = i915_error_state_write,
+	.llseek = default_llseek,
+	.release = gpu_state_release,
+};
+#endif
+
+#if IS_ENABLED(CONFIG_DRM_I915_AUB_CRASH_DUMP)
+
+static ssize_t gpu_state_read_aub(struct file *file, char __user *ubuf,
+				  size_t count, loff_t *pos)
+{
+	return gpu_state_read(file, ubuf, count, pos, true);
+}
+
+static const struct file_operations i915_gpu_info_aub_fops = {
+	.owner = THIS_MODULE,
+	.open = i915_gpu_info_open,
+	.read = gpu_state_read_aub,
+	.llseek = default_llseek,
+	.release = gpu_state_release,
+};
+
+static const struct file_operations i915_error_state_aub_fops = {
+	.owner = THIS_MODULE,
+	.open = i915_error_state_open,
+	.read = gpu_state_read_aub,
 	.write = i915_error_state_write,
 	.llseek = default_llseek,
 	.release = gpu_state_release,
 };
+
 #endif
 
 static int
@@ -4776,6 +4813,10 @@ static int i915_hpd_storm_ctl_open(struct inode *inode, struct file *file)
 	{"i915_error_state", &i915_error_state_fops},
 	{"i915_gpu_info", &i915_gpu_info_fops},
 #endif
+#if IS_ENABLED(CONFIG_DRM_I915_AUB_CRASH_DUMP)
+	{"i915_error_state_aub", &i915_error_state_aub_fops},
+	{"i915_gpu_info_aub", &i915_gpu_info_aub_fops},
+#endif
 	{"i915_next_seqno", &i915_next_seqno_fops},
 	{"i915_display_crc_ctl", &i915_display_crc_ctl_fops},
 	{"i915_pri_wm_latency", &i915_pri_wm_latency_fops},
diff --git a/drivers/gpu/drm/i915/i915_sysfs.c b/drivers/gpu/drm/i915/i915_sysfs.c
index 791759f..646ba5f 100644
--- a/drivers/gpu/drm/i915/i915_sysfs.c
+++ b/drivers/gpu/drm/i915/i915_sysfs.c
@@ -31,6 +31,7 @@
 #include <linux/sysfs.h>
 #include "intel_drv.h"
 #include "i915_drv.h"
+#include "i915_aubcrash.h"
 
 static inline struct drm_i915_private *kdev_minor_to_i915(struct device *kdev)
 {
@@ -495,9 +496,8 @@ static ssize_t gt_rp_mhz_show(struct device *kdev, struct device_attribute *attr
 
 static ssize_t error_state_read(struct file *filp, struct kobject *kobj,
 				struct bin_attribute *attr, char *buf,
-				loff_t off, size_t count)
+				loff_t off, size_t count, bool type_aub)
 {
-
 	struct device *kdev = kobj_to_dev(kobj);
 	struct drm_i915_private *dev_priv = kdev_minor_to_i915(kdev);
 	struct drm_i915_error_state_buf error_str;
@@ -509,7 +509,11 @@ static ssize_t error_state_read(struct file *filp, struct kobject *kobj,
 		return ret;
 
 	gpu = i915_first_error_state(dev_priv);
-	ret = i915_error_state_to_str(&error_str, gpu);
+
+	if (type_aub)
+		ret = i915_error_state_to_aub(&error_str, gpu);
+	else
+		ret = i915_error_state_to_str(&error_str, gpu);
 	if (ret)
 		goto out;
 
@@ -536,11 +540,18 @@ static ssize_t error_state_write(struct file *file, struct kobject *kobj,
 	return count;
 }
 
+static ssize_t error_state_read_str(struct file *filp, struct kobject *kobj,
+				    struct bin_attribute *attr, char *buf,
+				    loff_t off, size_t count)
+{
+	return error_state_read(filp, kobj, attr, buf, off, count, false);
+}
+
 static const struct bin_attribute error_state_attr = {
 	.attr.name = "error",
 	.attr.mode = S_IRUSR | S_IWUSR,
 	.size = 0,
-	.read = error_state_read,
+	.read = error_state_read_str,
 	.write = error_state_write,
 };
 
@@ -559,6 +570,39 @@ static void i915_setup_error_capture(struct device *kdev) {}
 static void i915_teardown_error_capture(struct device *kdev) {}
 #endif
 
+#if IS_ENABLED(CONFIG_DRM_I915_AUB_CRASH_DUMP)
+
+static ssize_t error_state_read_aub(struct file *filp, struct kobject *kobj,
+				    struct bin_attribute *attr, char *buf,
+				    loff_t off, size_t count)
+{
+	return error_state_read(filp, kobj, attr, buf, off, count, true);
+}
+
+static const struct bin_attribute aub_state_attr = {
+	.attr.name = "aub",
+	.attr.mode = S_IRUSR | S_IWUSR,
+	.size = 0,
+	.read = error_state_read_aub,
+	.write = error_state_write,
+};
+
+static void i915_setup_error_capture_aub(struct device *kdev)
+{
+	if (sysfs_create_bin_file(&kdev->kobj, &aub_state_attr))
+		DRM_ERROR("aub_state sysfs setup failed\n");
+}
+
+static void i915_teardown_error_capture_aub(struct device *kdev)
+{
+	sysfs_remove_bin_file(&kdev->kobj, &aub_state_attr);
+}
+
+#else
+static void i915_setup_error_capture_aub(struct device *kdev) {}
+static void i915_teardown_error_capture_aub(struct device *kdev) {}
+#endif
+
 void i915_setup_sysfs(struct drm_i915_private *dev_priv)
 {
 	struct device *kdev = dev_priv->drm.primary->kdev;
@@ -606,6 +650,7 @@ void i915_setup_sysfs(struct drm_i915_private *dev_priv)
 		DRM_ERROR("RPS sysfs setup failed\n");
 
 	i915_setup_error_capture(kdev);
+	i915_setup_error_capture_aub(kdev);
 }
 
 void i915_teardown_sysfs(struct drm_i915_private *dev_priv)
@@ -613,6 +658,7 @@ void i915_teardown_sysfs(struct drm_i915_private *dev_priv)
 	struct device *kdev = dev_priv->drm.primary->kdev;
 
 	i915_teardown_error_capture(kdev);
+	i915_teardown_error_capture_aub(kdev);
 
 	if (IS_VALLEYVIEW(dev_priv) || IS_CHERRYVIEW(dev_priv))
 		sysfs_remove_files(&kdev->kobj, vlv_attrs);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 08/12] drm/i915: Capture the PPGTT pagetables on a GPU crash
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (6 preceding siblings ...)
  2017-10-27 18:01 ` [RFC PATCH 07/12] drm/i915: Skeleton for AubCrash Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:01 ` [RFC PATCH 09/12] drm/i915: Store PTE information together with the error object pages Oscar Mateo
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx

Or, at least, the first thee levels (PML4, PDPs, PDs). We'll deal
with the PTs later, at the same time we record actual physical pages
with data.

We only do this when AubCrash is enabled, to save space, but
we could also do it unconditionally and maybe dump it in text
mode when in the legacy crash dump.

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
c: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/i915_aubcrash.c  | 169 ++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_aubcrash.h  |  14 +++
 drivers/gpu/drm/i915/i915_drv.h       |   6 ++
 drivers/gpu/drm/i915/i915_gpu_error.c |   8 ++
 4 files changed, 197 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_aubcrash.c b/drivers/gpu/drm/i915/i915_aubcrash.c
index 95b75ab..1b613e4 100644
--- a/drivers/gpu/drm/i915/i915_aubcrash.c
+++ b/drivers/gpu/drm/i915/i915_aubcrash.c
@@ -40,6 +40,175 @@
  *
  */
 
+#define COPY_PX_ENTRIES(px, storage) do { \
+	u64 *vaddr; \
+	if (!storage) \
+		return -ENOMEM; \
+	vaddr = kmap_atomic(px_base(px)->page); \
+	memcpy(storage, vaddr, PAGE_SIZE); \
+	kunmap_atomic(vaddr); \
+} while (0)
+
+int record_pml4(struct drm_i915_error_pagemap_lvl *e_pml4,
+		struct i915_pml4 *pml4,
+		struct i915_page_directory_pointer *scratch_pdp,
+		bool is_48bit)
+{
+	int l3;
+
+	if (is_48bit) {
+		e_pml4->paddr = px_dma(pml4);
+		e_pml4->storage = (u64 *)__get_free_page(GFP_ATOMIC | __GFP_NOWARN);
+		COPY_PX_ENTRIES(pml4, e_pml4->storage);
+		for (l3 = 0; l3 < GEN8_PML4ES_PER_PML4; l3++)
+			if (pml4->pdps[l3] != scratch_pdp)
+				e_pml4->nxt_lvl_count++;
+	} else
+		e_pml4->nxt_lvl_count = 1;
+
+	e_pml4->nxt_lvl = kcalloc(e_pml4->nxt_lvl_count,
+				  sizeof(*e_pml4->nxt_lvl), GFP_ATOMIC);
+	if (!e_pml4->nxt_lvl) {
+		e_pml4->nxt_lvl_count = 0;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+int record_pdp(struct drm_i915_error_pagemap_lvl *e_pdp,
+	       struct i915_page_directory_pointer *pdp,
+	       bool is_48bit)
+{
+	if (is_48bit) {
+		e_pdp->paddr = px_dma(pdp);
+		e_pdp->storage = (u64 *)__get_free_page(GFP_ATOMIC | __GFP_NOWARN);
+		COPY_PX_ENTRIES(pdp, e_pdp->storage);
+	}
+
+	e_pdp->nxt_lvl_count = pdp->used_pdpes;
+	e_pdp->nxt_lvl = kcalloc(e_pdp->nxt_lvl_count,
+				  sizeof(*e_pdp->nxt_lvl), GFP_ATOMIC);
+	if (!e_pdp->nxt_lvl) {
+		e_pdp->nxt_lvl_count = 0;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+int record_pd(struct drm_i915_error_pagemap_lvl *e_pd,
+	      struct i915_page_directory *pd)
+{
+	e_pd->paddr = px_dma(pd);
+	e_pd->storage = (u64 *)__get_free_page(GFP_ATOMIC | __GFP_NOWARN);
+	COPY_PX_ENTRIES(pd, e_pd->storage);
+
+	e_pd->nxt_lvl_count = pd->used_pdes;
+	e_pd->nxt_lvl = kcalloc(e_pd->nxt_lvl_count,
+				sizeof(*e_pd->nxt_lvl), GFP_ATOMIC);
+	if (!e_pd->nxt_lvl) {
+		e_pd->nxt_lvl_count = 0;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void i915_error_record_ppgtt(struct i915_gpu_state *error,
+			     struct i915_address_space *vm,
+			     int idx)
+{
+	struct i915_hw_ppgtt *ppgtt;
+	struct drm_i915_error_pagemap_lvl *e_pml4;
+	struct i915_pml4 *pml4;
+	int l3, l2, max_lvl3, max_lvl2, i, j;
+	bool is_48bit;
+	int ret;
+
+	if (i915_is_ggtt(vm))
+		return;
+
+	ppgtt = i915_vm_to_ppgtt(vm);
+	is_48bit = i915_vm_is_48bit(&ppgtt->base);
+	if (is_48bit) {
+		max_lvl3 = GEN8_PML4ES_PER_PML4;
+		max_lvl2 = GEN8_4LVL_PDPES;
+	} else {
+		max_lvl3 = 1;
+		max_lvl2 = GEN8_3LVL_PDPES;
+	}
+
+	/* PML4 */
+	pml4 = is_48bit? &ppgtt->pml4 : NULL;
+	e_pml4 = &error->ppgtt_pml4[idx];
+	ret = record_pml4(e_pml4, pml4, vm->scratch_pdp, is_48bit);
+	if (ret < 0)
+		return;
+
+	/* PDP */
+	for (l3 = 0, i = 0; l3 < max_lvl3; l3++) {
+		struct drm_i915_error_pagemap_lvl *e_pdp;
+		struct i915_page_directory_pointer *pdp;
+
+		pdp = is_48bit? pml4->pdps[l3] : &ppgtt->pdp;
+		if (pdp == vm->scratch_pdp)
+			continue;
+
+		e_pdp = &e_pml4->nxt_lvl[i];
+		ret = record_pdp(e_pdp, pdp, is_48bit);
+		if (ret < 0)
+			return;
+
+		/* PD */
+		for (l2 = 0, j = 0; l2 < max_lvl2; l2++) {
+			struct drm_i915_error_pagemap_lvl *e_pd;
+			struct i915_page_directory *pd;
+
+			pd = pdp->page_directory[l2];
+			if (pd == vm->scratch_pd)
+				continue;
+
+			e_pd = &e_pdp->nxt_lvl[j];
+			ret = record_pd(e_pd, pd);
+			if (ret < 0)
+				return;
+
+			if (++j == e_pdp->nxt_lvl_count)
+				break;
+		}
+
+		if (++i == e_pml4->nxt_lvl_count)
+			break;
+
+	}
+
+	/* XXX: Do I want to dump the scratch pdp/pd/pt/page? */
+	/* TODO: Support huge pages */
+}
+
+void i915_error_free_ppgtt(struct i915_gpu_state *error, int idx)
+{
+	struct drm_i915_error_pagemap_lvl *e_pml4 = &error->ppgtt_pml4[idx];
+	int i, j;
+
+	for (i = e_pml4->nxt_lvl_count - 1; i >= 0; i--) {
+		struct drm_i915_error_pagemap_lvl *e_pdp =
+			&e_pml4->nxt_lvl[i];
+
+		for (j = e_pdp->nxt_lvl_count - 1; j >= 0; j--) {
+			struct drm_i915_error_pagemap_lvl *e_pd =
+				&e_pdp->nxt_lvl[j];
+
+			free_page((unsigned long)e_pd->storage);
+			kfree(e_pd);
+		}
+		free_page((unsigned long)e_pdp->storage);
+		kfree(e_pdp);
+	}
+	free_page((unsigned long)e_pml4->storage);
+}
+
 int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
 			    const struct i915_gpu_state *error)
 {
diff --git a/drivers/gpu/drm/i915/i915_aubcrash.h b/drivers/gpu/drm/i915/i915_aubcrash.h
index bab1953..af7d42e 100644
--- a/drivers/gpu/drm/i915/i915_aubcrash.h
+++ b/drivers/gpu/drm/i915/i915_aubcrash.h
@@ -26,11 +26,25 @@
 
 #if IS_ENABLED(CONFIG_DRM_I915_AUB_CRASH_DUMP)
 
+void i915_error_record_ppgtt(struct i915_gpu_state *error,
+			     struct i915_address_space *vm,
+			     int idx);
+void i915_error_free_ppgtt(struct i915_gpu_state *error, int idx);
 int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
                             const struct i915_gpu_state *error);
 
 #else
 
+static inline void i915_error_record_ppgtt(struct i915_gpu_state *error,
+					   struct i915_address_space *vm,
+					   int idx)
+{
+}
+
+static inline void i915_error_free_ppgtt(struct i915_gpu_state *error, int idx)
+{
+}
+
 static inline int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
 					  const struct i915_gpu_state *error)
 {
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ae3c8b1..9b2539a 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1048,6 +1048,12 @@ struct i915_gpu_state {
 		u32 cache_level:3;
 	} *active_bo[I915_NUM_ENGINES], *pinned_bo;
 	u32 active_bo_count[I915_NUM_ENGINES], pinned_bo_count;
+	struct drm_i915_error_pagemap_lvl {
+		phys_addr_t paddr;
+		u64 *storage;
+		struct drm_i915_error_pagemap_lvl *nxt_lvl;
+		uint nxt_lvl_count;
+	} ppgtt_pml4[I915_NUM_ENGINES];
 	struct i915_address_space *active_vm[I915_NUM_ENGINES];
 };
 
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index e71d2c8..39b69d8 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -31,6 +31,7 @@
 #include <linux/stop_machine.h>
 #include <linux/zlib.h>
 #include "i915_drv.h"
+#include "i915_aubcrash.h"
 
 static const char *engine_str(int engine)
 {
@@ -866,6 +867,9 @@ void __i915_gpu_state_free(struct kref *error_ref)
 	i915_error_object_free(error->semaphore);
 	i915_error_object_free(error->guc_log);
 
+	for (i = 0; i < ARRAY_SIZE(error->ppgtt_pml4); i++)
+		i915_error_free_ppgtt(error, i);
+
 	for (i = 0; i < ARRAY_SIZE(error->active_bo); i++)
 		kfree(error->active_bo[i]);
 	kfree(error->pinned_bo);
@@ -1520,6 +1524,9 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv,
 	else
 		count = 0;
 
+	if (INTEL_GEN(dev_priv) >= 8)
+		i915_error_record_ppgtt(error, vm, idx);
+
 	error->active_vm[idx] = vm;
 	error->active_bo[idx] = active_bo;
 	error->active_bo_count[idx] = count;
@@ -1533,6 +1540,7 @@ static void i915_capture_active_buffers(struct drm_i915_private *dev_priv,
 	BUILD_BUG_ON(ARRAY_SIZE(error->engine) > ARRAY_SIZE(error->active_bo));
 	BUILD_BUG_ON(ARRAY_SIZE(error->active_bo) != ARRAY_SIZE(error->active_vm));
 	BUILD_BUG_ON(ARRAY_SIZE(error->active_bo) != ARRAY_SIZE(error->active_bo_count));
+	BUILD_BUG_ON(ARRAY_SIZE(error->active_bo) != ARRAY_SIZE(error->ppgtt_pml4));
 
 	/* Scan each engine looking for unique active contexts/vm */
 	for (i = 0; i < ARRAY_SIZE(error->engine); i++) {
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 09/12] drm/i915: Store PTE information together with the error object pages
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (7 preceding siblings ...)
  2017-10-27 18:01 ` [RFC PATCH 08/12] drm/i915: Capture the PPGTT pagetables on a GPU crash Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:25   ` Chris Wilson
  2017-10-27 18:01 ` [RFC PATCH 10/12] drm/i915: Always add BOs to capture list if AubCrash is enabled Oscar Mateo
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

With every page of physical data we store: its physical address, the PTE
that points to it and the physical address of the PTE itself. We will be
using all this information later.

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/i915_aubcrash.c  | 46 +++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_aubcrash.h  | 11 +++++++++
 drivers/gpu/drm/i915/i915_drv.h       |  7 +++++-
 drivers/gpu/drm/i915/i915_gpu_error.c | 24 ++++++++++++------
 4 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_aubcrash.c b/drivers/gpu/drm/i915/i915_aubcrash.c
index 1b613e4..ebdbf09 100644
--- a/drivers/gpu/drm/i915/i915_aubcrash.c
+++ b/drivers/gpu/drm/i915/i915_aubcrash.c
@@ -209,6 +209,52 @@ void i915_error_free_ppgtt(struct i915_gpu_state *error, int idx)
 	free_page((unsigned long)e_pml4->storage);
 }
 
+void i915_error_page_walk(struct i915_address_space *vm,
+			  u64 offset,
+			  gen8_pte_t *entry,
+			  phys_addr_t *paddr)
+{
+	if (i915_is_ggtt(vm)) {
+		struct i915_ggtt *ggtt = i915_vm_to_ggtt(vm);
+		uint index = offset >> PAGE_SHIFT;
+
+		gen8_pte_t __iomem *pte =
+			(gen8_pte_t __iomem *)ggtt->gsm + index;
+
+		*entry = readq(pte);
+		*paddr = ggtt->gsm_paddr + index * sizeof(u64);
+	} else {
+		struct i915_hw_ppgtt *ppgtt = i915_vm_to_ppgtt(vm);
+		struct i915_pml4 *pml4;
+		struct i915_page_directory_pointer *pdp;
+		struct i915_page_directory *pd;
+		struct i915_page_table *pt;
+		u32 pml4e, pdpe, pde, pte;
+		u64 *vaddr;
+
+		pml4e = gen8_pml4e_index(offset);
+		if (i915_vm_is_48bit(&ppgtt->base)) {
+			pml4 = &ppgtt->pml4;
+			pdp = pml4->pdps[pml4e];
+		} else {
+			GEM_BUG_ON(pml4e != 0);
+			pdp = &ppgtt->pdp;
+		}
+
+		pdpe = gen8_pdpe_index(offset);
+		pd = pdp->page_directory[pdpe];
+
+		pde = gen8_pde_index(offset);
+		pt = pd->page_table[pde];
+
+		pte = gen8_pte_index(offset);
+		vaddr = kmap_atomic(px_base(pt)->page);
+		*entry = vaddr[pte];
+		kunmap_atomic(vaddr);
+		*paddr = px_dma(pt) + pte * sizeof(u64);
+	}
+}
+
 int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
 			    const struct i915_gpu_state *error)
 {
diff --git a/drivers/gpu/drm/i915/i915_aubcrash.h b/drivers/gpu/drm/i915/i915_aubcrash.h
index af7d42e..2eed388 100644
--- a/drivers/gpu/drm/i915/i915_aubcrash.h
+++ b/drivers/gpu/drm/i915/i915_aubcrash.h
@@ -30,6 +30,10 @@ void i915_error_record_ppgtt(struct i915_gpu_state *error,
 			     struct i915_address_space *vm,
 			     int idx);
 void i915_error_free_ppgtt(struct i915_gpu_state *error, int idx);
+void i915_error_page_walk(struct i915_address_space *vm,
+			  u64 offset,
+			  gen8_pte_t *entry,
+			  phys_addr_t *paddr);
 int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
                             const struct i915_gpu_state *error);
 
@@ -45,6 +49,13 @@ static inline void i915_error_free_ppgtt(struct i915_gpu_state *error, int idx)
 {
 }
 
+static inline void i915_error_page_walk(struct i915_address_space *vm,
+					u64 offset,
+					gen8_pte_t *entry,
+					phys_addr_t *paddr)
+{
+}
+
 static inline int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
 					  const struct i915_gpu_state *error)
 {
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 9b2539a..6db4f96 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -995,7 +995,12 @@ struct i915_gpu_state {
 			u32 tiling:2;
 			int page_count;
 			int unused;
-			u32 *pages[0];
+			struct drm_i915_error_page {
+				phys_addr_t pte_paddr;
+				gen8_pte_t pte;
+				phys_addr_t paddr;
+				u32 *storage;
+			} pages[0];
 		} *ringbuffer, *batchbuffer, *wa_batchbuffer,
 		  *renderstate, *ctx, *hws_page;
 
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 39b69d8..924691d2 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -179,7 +179,7 @@ static void i915_error_puts(struct drm_i915_error_state_buf *e, const char *str)
 #define err_printf(e, ...) i915_error_printf(e, __VA_ARGS__)
 #define err_puts(e, s) i915_error_puts(e, s)
 
-#ifdef CONFIG_DRM_I915_COMPRESS_ERROR
+#if defined(CONFIG_DRM_I915_COMPRESS_ERROR) && !defined(CONFIG_DRM_I915_AUB_CRASH_DUMP)
 
 struct compress {
 	struct z_stream_s zstream;
@@ -227,7 +227,7 @@ static int compress_page(struct compress *c,
 			if (!page)
 				return -ENOMEM;
 
-			dst->pages[dst->page_count++] = (void *)page;
+			dst->pages[dst->page_count++].storage = (void *)page;
 
 			zstream->next_out = (void *)page;
 			zstream->avail_out = PAGE_SIZE;
@@ -290,7 +290,7 @@ static int compress_page(struct compress *c,
 	ptr = (void *)page;
 	if (!i915_memcpy_from_wc(ptr, src, PAGE_SIZE))
 		memcpy(ptr, src, PAGE_SIZE);
-	dst->pages[dst->page_count++] = ptr;
+	dst->pages[dst->page_count++].storage = ptr;
 
 	return 0;
 }
@@ -539,7 +539,7 @@ static void print_error_obj(struct drm_i915_error_state_buf *m,
 		len = ascii85_encode_len(len);
 
 		for (i = 0; i < len; i++) {
-			if (ascii85_encode(obj->pages[page][i], out))
+			if (ascii85_encode(obj->pages[page].storage[i], out))
 				err_puts(m, out);
 			else
 				err_puts(m, "z");
@@ -827,7 +827,7 @@ static void i915_error_object_free(struct drm_i915_error_object *obj)
 		return;
 
 	for (page = 0; page < obj->page_count; page++)
-		free_page((unsigned long)obj->pages[page]);
+		free_page((unsigned long)obj->pages[page].storage);
 
 	kfree(obj);
 }
@@ -901,7 +901,7 @@ void __i915_gpu_state_free(struct kref *error_ref)
 
 	num_pages = min_t(u64, vma->size, vma->obj->base.size) >> PAGE_SHIFT;
 	num_pages = DIV_ROUND_UP(10 * num_pages, 8); /* worstcase zlib growth */
-	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *),
+	dst = kmalloc(sizeof(*dst) + num_pages * sizeof(*dst->pages),
 		      GFP_ATOMIC | __GFP_NOWARN);
 	if (!dst)
 		return NULL;
@@ -924,6 +924,14 @@ void __i915_gpu_state_free(struct kref *error_ref)
 		ggtt->base.insert_page(&ggtt->base, dma, slot,
 				       I915_CACHE_NONE, 0);
 
+		if (INTEL_GEN(i915) >= 8) {
+			dst->pages[dst->page_count].paddr = dma;
+			i915_error_page_walk(vma->vm, dst->gtt_offset +
+					     dst->page_count * PAGE_SIZE,
+					     &dst->pages[dst->page_count].pte,
+					     &dst->pages[dst->page_count].pte_paddr);
+		}
+
 		s = io_mapping_map_atomic_wc(&ggtt->mappable, slot);
 		ret = compress_page(&compress, (void  __force *)s, dst);
 		io_mapping_unmap_atomic(s);
@@ -935,7 +943,7 @@ void __i915_gpu_state_free(struct kref *error_ref)
 
 unwind:
 	while (dst->page_count--)
-		free_page((unsigned long)dst->pages[dst->page_count]);
+		free_page((unsigned long)dst->pages[dst->page_count].storage);
 	kfree(dst);
 	dst = NULL;
 
@@ -1104,7 +1112,7 @@ static void gen8_record_semaphore_state(struct i915_gpu_state *error,
 
 		signal_offset =
 			(GEN8_SIGNAL_OFFSET(engine, id) & (PAGE_SIZE - 1)) / 4;
-		tmp = error->semaphore->pages[0];
+		tmp = error->semaphore->pages[0].storage;
 		idx = gen8_engine_sync_index(engine, to);
 
 		ee->semaphore_mboxes[idx] = tmp[signal_offset];
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 10/12] drm/i915: Always add BOs to capture list if AubCrash is enabled
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (8 preceding siblings ...)
  2017-10-27 18:01 ` [RFC PATCH 09/12] drm/i915: Store PTE information together with the error object pages Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:26   ` Chris Wilson
  2017-10-27 18:01 ` [RFC PATCH 11/12] drm/i915: Add an AUB file format writer Oscar Mateo
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

If we want the AUB file to be complete (and, therefore, more useful)
we need to capture all BOs in use, we cannot leave that to the UMD
as before.

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/i915_aubcrash.h       | 12 ++++++++++++
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  4 +++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_aubcrash.h b/drivers/gpu/drm/i915/i915_aubcrash.h
index 2eed388..b42307e 100644
--- a/drivers/gpu/drm/i915/i915_aubcrash.h
+++ b/drivers/gpu/drm/i915/i915_aubcrash.h
@@ -37,6 +37,12 @@ void i915_error_page_walk(struct i915_address_space *vm,
 int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
                             const struct i915_gpu_state *error);
 
+static inline bool i915_error_state_should_capture(struct i915_vma *vma,
+						   struct i915_vma *batch)
+{
+	return ((INTEL_GEN(vma->vm->i915) >= 8) && (vma != batch));
+}
+
 #else
 
 static inline void i915_error_record_ppgtt(struct i915_gpu_state *error,
@@ -62,6 +68,12 @@ static inline int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
 	return 0;
 }
 
+static inline bool i915_error_state_should_capture(struct i915_vma *vma,
+						   struct i915_vma *batch)
+{
+	return false;
+}
+
 #endif
 
 #endif
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 3d71907..47559a4 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -38,6 +38,7 @@
 #include "i915_drv.h"
 #include "i915_gem_clflush.h"
 #include "i915_trace.h"
+#include "i915_aubcrash.h"
 #include "intel_drv.h"
 #include "intel_frontbuffer.h"
 
@@ -1758,7 +1759,8 @@ static int eb_move_to_gpu(struct i915_execbuffer *eb)
 		struct i915_vma *vma = eb->vma[i];
 		struct drm_i915_gem_object *obj = vma->obj;
 
-		if (flags & EXEC_OBJECT_CAPTURE) {
+		if ((flags & EXEC_OBJECT_CAPTURE) ||
+		    i915_error_state_should_capture(vma, eb->batch)) {
 			struct i915_gem_capture_list *capture;
 
 			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 11/12] drm/i915: Add an AUB file format writer
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (9 preceding siblings ...)
  2017-10-27 18:01 ` [RFC PATCH 10/12] drm/i915: Always add BOs to capture list if AubCrash is enabled Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:01 ` [RFC PATCH 12/12] drm/i915: Actually write the AUB file Oscar Mateo
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

This is where the magic happens. For the moment, this will be
used by AubCrash to output a correctly-formatted AUB file of
GPU error information.

In the future, it could be used by other modules to generate
all kinds of AUB files. D.g. a running AUB capture of everything
a given userspace application ends up sending to the hardware
(like i-g-t's intel_aubdump tool, but at the kernel level).

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/i915_aubmemtrace.c        | 665 +++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_aubmemtrace.h        |  76 +++
 drivers/gpu/drm/i915/i915_aubmemtrace_format.h | 359 +++++++++++++
 3 files changed, 1100 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/i915_aubmemtrace.c
 create mode 100644 drivers/gpu/drm/i915/i915_aubmemtrace.h
 create mode 100644 drivers/gpu/drm/i915/i915_aubmemtrace_format.h

diff --git a/drivers/gpu/drm/i915/i915_aubmemtrace.c b/drivers/gpu/drm/i915/i915_aubmemtrace.c
new file mode 100644
index 0000000..61b1392
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_aubmemtrace.c
@@ -0,0 +1,665 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ * Author:
+ *    Oscar Mateo <oscar.mateo@intel.com>
+ *
+ */
+
+#include "intel_drv.h"
+#include "i915_aubmemtrace.h"
+#include "i915_aubmemtrace_format.h"
+
+/**
+ * DOC: AUB Memtrace
+ *
+ * The "AUB" memtrace file format provides a way to log GPU workloads in the
+ * same (or a very similar) form as they would be sent to the Intel Graphics
+ * Hardware. These logs are then provided to the user, who can use them for
+ * multiple purposes. For example: to easily browse the workload in order to
+ * find HW programming errors or to replay the workload using a GPU simulator or
+ * emulator.
+ *
+ * Technically, the format is the same used by intel_aubdump (a userspace tool
+ * that you can find in intel-gpu-tools) but by writing AUB files from the KMD
+ * we can log information that a userspace tool by itself cannot. E.g.: real GPU
+ * virtual addresses, pagetables, GPU contexts, workaround batchbuffers, etc...
+ *
+ * Trivia:
+ * In case the reader was wondering, AUB is a shorthand for "Auburn", the
+ * code name of the Intel740™ Graphics Accelerator (also known as the i740).
+ * We maintain the name of the file format for historical reasons.
+ *
+ */
+
+#define AUB_TOOL_VERSION_MAJOR 0
+#define AUB_TOOL_VERSION_MINOR 1
+
+#define AUB_WRITE(data, len) do { \
+	aub->write(aub->priv, data, len); \
+} while (0)
+
+#define PADDING(x) ((4 - ((x) & 3)) & 3)
+
+static inline void aub_write_padding(struct intel_aub *aub, u32 bytes)
+{
+	u32 zero = 0;
+
+	if (GEM_WARN_ON(bytes > 3))
+		return;
+
+	AUB_WRITE(&zero, bytes);
+}
+
+static inline void aub_header_fill(struct aub_cmd_hdr *header, u32 type,
+				   u32 opcode, u32 sub_opcode,
+				   u32 dword_count)
+{
+	header->type = type;
+	header->opcode = opcode;
+	header->sub_opcode = sub_opcode;
+	header->dword_count = dword_count;
+}
+
+struct aub_chip_revision {
+	uint rev_id;
+	uint stepping;
+	uint metal;
+};
+
+static const struct aub_chip_revision bdw_revs[] = {
+	{ 0, STEP_A, 0 },
+};
+
+static const struct aub_chip_revision chv_revs[] = {
+	{ 0, STEP_A, 0 },
+};
+
+static const struct aub_chip_revision skl_revs[] = {
+	{ SKL_REVID_A0, STEP_A, 0 },
+	{ SKL_REVID_B0, STEP_B, 0 },
+	{ SKL_REVID_C0, STEP_C, 0 },
+	{ SKL_REVID_D0, STEP_D, 0 },
+	{ SKL_REVID_E0, STEP_E, 0 },
+	{ SKL_REVID_F0, STEP_E, 0 },
+	{ SKL_REVID_G0, STEP_G, 0 },
+	{ SKL_REVID_H0, STEP_H, 0 },
+};
+
+static const struct aub_chip_revision bxt_revs[] = {
+	{ BXT_REVID_A0,     STEP_A, 0 },
+	{ BXT_REVID_A1,     STEP_A, 1 },
+	{ BXT_REVID_B0,     STEP_B, 0 },
+	{ BXT_REVID_B_LAST, STEP_B, 1 },
+	{ BXT_REVID_C0,     STEP_C, 0 },
+};
+
+static const struct aub_chip_revision kbl_revs[] = {
+	{ KBL_REVID_A0, STEP_A, 0 },
+	{ KBL_REVID_B0, STEP_B, 0 },
+	{ KBL_REVID_C0, STEP_C, 0 },
+	{ KBL_REVID_D0, STEP_D, 0 },
+	{ KBL_REVID_E0, STEP_E, 0 },
+};
+
+static const struct aub_chip_revision glk_revs[] = {
+	{ GLK_REVID_A0, STEP_A, 0 },
+	{ GLK_REVID_A1, STEP_A, 1 },
+};
+
+static const struct aub_chip_revision cnl_revs[] = {
+	{ CNL_REVID_A0, STEP_A, 0 },
+	{ CNL_REVID_B0, STEP_B, 0 },
+	{ CNL_REVID_C0, STEP_C, 0 },
+};
+
+struct aub_platforms_table {
+	uint platform_id;
+	uint device;
+	const struct aub_chip_revision *table;
+	uint count;
+};
+
+static const struct aub_platforms_table platforms[] = {
+	{ INTEL_BROADWELL,  DEV_BDW, bdw_revs, ARRAY_SIZE(bdw_revs) },
+	{ INTEL_CHERRYVIEW, DEV_CHV, chv_revs, ARRAY_SIZE(chv_revs) },
+	{ INTEL_SKYLAKE,    DEV_SKL, skl_revs, ARRAY_SIZE(skl_revs) },
+	{ INTEL_BROXTON,    DEV_BXT, bxt_revs, ARRAY_SIZE(bxt_revs) },
+	{ INTEL_KABYLAKE,   DEV_KBL, kbl_revs, ARRAY_SIZE(kbl_revs) },
+	{ INTEL_GEMINILAKE, DEV_GLK, glk_revs, ARRAY_SIZE(glk_revs) },
+	{ INTEL_CANNONLAKE, DEV_CNL, cnl_revs, ARRAY_SIZE(cnl_revs) },
+};
+
+static int aub_write_version_packet(struct intel_aub *aub,
+				    enum intel_platform platform,
+				    u8 revision, const char *message)
+{
+	u32 length, padding;
+	struct cmd_memtrace_version cmd;
+	char *buf = (char *)aub->scratch;
+	bool rev_warning = false;
+	int i, j;
+
+	memset(&cmd, 0, sizeof(cmd));
+	aub_header_fill(&cmd.header, CMD_TYPE_AUB, CMD_OPC_MEMTRACE,
+			CMD_SUBOPC_MEMTRACE_VERSION, sizeof(cmd) / 4 - 2);
+
+	cmd.memtrace_file_version = AUB_FILE_FORMAT_VERSION;
+	cmd.swizzling = SWIZZLING_DISABLED;
+	cmd.recording_method = METHOD_PHY;
+	cmd.pch = PCH_DEFAULT;
+	cmd.capture_tool = CAPTURE_TOOL_KMD;
+
+	for (i = 0; i < ARRAY_SIZE(platforms); i++) {
+		if (platform == platforms[i].platform_id) {
+			const struct aub_chip_revision *table =
+				platforms[i].table;
+			uint count = platforms[i].count;
+			cmd.device = platforms[i].device;
+
+			for (j = 0; j < count; j++) {
+				if (revision == table[j].rev_id) {
+					cmd.stepping = table[j].stepping;
+					cmd.metal = table[j].metal;
+				}
+			}
+
+			if (j == count) {
+				rev_warning = true;
+				cmd.stepping = table[count - 1].stepping;
+				cmd.metal = table[count - 1].metal;
+			}
+
+			break;
+		}
+
+		if (i == ARRAY_SIZE(platforms)) {
+			DRM_ERROR("Unsupported platform 0x%x\n", platform);
+			return -ENODEV;
+		}
+	}
+
+	cmd.tool_primary_version = AUB_TOOL_VERSION_MAJOR;
+	cmd.tool_secondary_version = AUB_TOOL_VERSION_MINOR;
+
+	snprintf(buf, AUB_COMMENT_MAX_LENGTH, message);
+	length = strlen(buf);
+	padding = PADDING(length);
+	cmd.header.dword_count += (length + padding) / 4;
+
+	AUB_WRITE(&cmd, sizeof(cmd) - 4);
+	AUB_WRITE(buf, length);
+	aub_write_padding(aub, padding);
+
+	if (rev_warning)
+		i915_aub_comment(aub,
+				 "Unknown revid 0x%x. Using last known step/metal",
+				 revision);
+
+	return 0;
+}
+
+static void aub_write_comment_packet(struct intel_aub *aub, const char *comment)
+{
+	struct cmd_memtrace_comment cmd;
+	const char preface[] = "AUB: ";
+	uint preface_len = strlen(preface);
+	uint comment_len = strlen(comment) + 1;
+	uint padding = PADDING(comment_len + preface_len);
+
+	memset(&cmd, 0, sizeof(cmd));
+	aub_header_fill(&cmd.header, CMD_TYPE_AUB, CMD_OPC_MEMTRACE,
+			CMD_SUBOPC_MEMTRACE_COMMENT, sizeof(cmd) / 4 - 2);
+	cmd.header.dword_count += (preface_len + comment_len + padding) / 4;
+
+	AUB_WRITE(&cmd, sizeof(cmd) - 4);
+	AUB_WRITE(preface, preface_len);
+	AUB_WRITE(comment, comment_len);
+	aub_write_padding(aub, padding);
+}
+
+static int aub_write_mem_packet(struct intel_aub *aub,
+				enum tiling_values tiling,
+				enum data_type_values type,
+				enum address_space_values space,
+				u64 address,
+				const void *data,
+				u32 bytes)
+{
+	struct cmd_memtrace_memwrite cmd;
+	uint max_bytes = 4 * (0xffff - (sizeof(cmd) / 4 - 2));
+	uint padding = PADDING(bytes);
+	uint num_dwords = (bytes + padding) / 4;
+
+	if (bytes > max_bytes)
+		return -E2BIG;
+
+	memset(&cmd, 0, sizeof(cmd));
+	aub_header_fill(&cmd.header, CMD_TYPE_AUB, CMD_OPC_MEMTRACE,
+			CMD_SUBOPC_MEMTRACE_MEMORY_WRITE, sizeof(cmd) / 4 - 2);
+	cmd.header.dword_count += num_dwords;
+
+	cmd.address = address;
+	cmd.tiling = tiling;
+	cmd.data_type_hint = type;
+	cmd.address_space = space;
+	cmd.data_size = bytes;
+
+	AUB_WRITE(&cmd, sizeof(cmd) - 4);
+	AUB_WRITE(data, bytes);
+	aub_write_padding(aub, padding);
+
+	return bytes;
+}
+
+static int aub_write_mem_discon_packet(struct intel_aub *aub,
+				       enum tiling_values tiling,
+				       enum data_type_values type,
+				       enum address_space_values space,
+				       const struct memwrite_element *elements,
+				       const void **data,
+				       uint count)
+{
+	struct aub_cmd_hdr header;
+	struct aub_cmd_memwrite_discon_opts cmd_opts;
+	uint cmd_size = sizeof(struct cmd_memtrace_memwrite_discon);
+	uint max_bytes = 4 * (0xffff - (cmd_size / 4 - 2));
+	uint total_bytes = 0;
+	uint padding;
+	uint num_dwords;
+	int i;
+
+	if (count > DISCONTIGUOUS_WRITE_MAX_ELEMENTS)
+		return -E2BIG;
+
+	for (i = 0; i < count; i++)
+		total_bytes += elements[i].data_size;
+
+	padding = PADDING(total_bytes);
+	num_dwords = (total_bytes + padding) / 4;
+
+	if (total_bytes > max_bytes)
+		return -E2BIG;
+
+	memset(&header, 0, sizeof(header));
+	aub_header_fill(&header, CMD_TYPE_AUB, CMD_OPC_MEMTRACE,
+			CMD_SUBOPC_MEMTRACE_MEMORY_WRITE_DISCONTIGUOUS,
+			cmd_size / 4 - 2);
+	header.dword_count += num_dwords;
+
+	memset(&cmd_opts, 0, sizeof(cmd_opts));
+	cmd_opts.tiling = tiling;
+	cmd_opts.data_type_hint = type;
+	cmd_opts.address_space = space;
+	cmd_opts.number_of_elements = count;
+
+	AUB_WRITE(&header, sizeof(header));
+	AUB_WRITE(&cmd_opts, sizeof(cmd_opts));
+
+	AUB_WRITE(elements, count * sizeof(*elements));
+	for (i = count; i < DISCONTIGUOUS_WRITE_MAX_ELEMENTS; i++) {
+		struct memwrite_element zero = {0, 0};
+		AUB_WRITE(&zero, sizeof(zero));
+	}
+
+	for (i = 0; i < count; i++)
+		AUB_WRITE(data[i], elements[i].data_size);
+
+	aub_write_padding(aub, padding);
+
+	return total_bytes;
+}
+
+static void aub_write_register_packet(struct intel_aub *aub, i915_reg_t reg,
+				      u32 value)
+{
+	struct cmd_memtrace_register_write cmd;
+
+	memset(&cmd, 0, sizeof(cmd));
+	aub_header_fill(&cmd.header, CMD_TYPE_AUB, CMD_OPC_MEMTRACE,
+			CMD_SUBOPC_MEMTRACE_REGISTER_WRITE, sizeof(cmd) / 4 - 1);
+	cmd.message_source = SOURCE_IA;
+	cmd.register_size = SIZE_DWORD;
+	cmd.register_space = SPACE_MMIO;
+	cmd.write_mask_low = 0xffffffff;
+	cmd.write_mask_high = 0x0;
+
+	cmd.register_offset = i915_mmio_reg_offset(reg);
+	cmd.data[0] = value;
+
+	AUB_WRITE(&cmd, sizeof(cmd));
+}
+
+static void aub_write_pci_register_packet(struct intel_aub *aub, u16 bus,
+					  u8 device, u8 function,
+					  u32 offset, u32 value)
+{
+	struct cmd_memtrace_register_write cmd;
+
+	memset(&cmd, 0, sizeof(cmd));
+	aub_header_fill(&cmd.header, CMD_TYPE_AUB, CMD_OPC_MEMTRACE,
+			CMD_SUBOPC_MEMTRACE_REGISTER_WRITE, sizeof(cmd) / 4 - 1);
+	cmd.message_source = SOURCE_IA;
+	cmd.register_size = SIZE_DWORD;
+	cmd.register_space = SPACE_PCI;
+	cmd.write_mask_low = 0xffffffff;
+	cmd.write_mask_high = 0x0;
+
+	cmd.bus = bus;
+	cmd.device = device;
+	cmd.function = function;
+	cmd.offset = offset;
+	cmd.data[0] = value;
+
+	AUB_WRITE(&cmd, sizeof(cmd));
+}
+
+static void aub_write_regpoll_packet(struct intel_aub *aub, i915_reg_t reg,
+				     u32 mask, u32 value)
+{
+	struct cmd_memtrace_register_poll cmd;
+
+	memset(&cmd, 0, sizeof(cmd));
+	aub_header_fill(&cmd.header, CMD_TYPE_AUB, CMD_OPC_MEMTRACE,
+			CMD_SUBOPC_MEMTRACE_REGISTER_POLL, sizeof(cmd) / 4 - 1);
+	cmd.abort_on_timeout = 1;
+	cmd.poll_not_equal = 0;
+	cmd.operation_type = OPERATION_TYPE_NORMAL;
+	cmd.register_size = SIZE_DWORD;
+	cmd.register_space = SPACE_MMIO;
+
+	cmd.poll_mask_low = mask;
+	cmd.register_offset = i915_mmio_reg_offset(reg);
+	cmd.data[0] = value;
+
+	AUB_WRITE(&cmd, sizeof(cmd));
+}
+
+static inline phys_addr_t adjust_gsm_paddr(struct intel_aub *aub,
+					   bool global_gtt,
+					   phys_addr_t pte_paddr)
+{
+	if (global_gtt) {
+		/*
+		 * We already told the other end about the base
+		 * of the GGTT stolen memory, so treat it here
+		 * as if it was 0x0
+		 */
+		return (pte_paddr - aub->gsm_paddr);
+	} else
+		return pte_paddr;
+}
+
+static int aub_write_discon_pages(struct intel_aub *aub,
+				  bool global_gtt,
+				  enum tiling_values tiling,
+				  enum data_type_values type,
+				  enum address_space_values space,
+				  const struct drm_i915_error_page *pages,
+				  uint count)
+{
+	enum address_space_values pte_space;
+	uint count_left = count;
+	const struct drm_i915_error_page *pages_left = pages;
+	struct memwrite_element *elements =
+		(struct memwrite_element *)aub->scratch;
+	const void **data =
+		(const void **)(elements + DISCONTIGUOUS_WRITE_MAX_ELEMENTS);
+	int ret;
+	int i;
+
+	BUILD_BUG_ON(sizeof(aub->scratch) <
+		     DISCONTIGUOUS_WRITE_MAX_ELEMENTS * sizeof(*elements) +
+		     DISCONTIGUOUS_WRITE_MAX_ELEMENTS * sizeof(*data));
+
+	pte_space = global_gtt ? ADDRESS_SPACE_GTT_ENTRY :
+				 ADDRESS_SPACE_PPGTT_ENTRY;
+
+	while (count_left) {
+		uint c = min(count_left, (uint)DISCONTIGUOUS_WRITE_MAX_ELEMENTS);
+
+		if (c == 1) {
+			const gen8_pte_t *pte = &pages_left[0].pte;
+			phys_addr_t pte_paddr =
+				adjust_gsm_paddr(aub, global_gtt,
+						 pages_left[0].pte_paddr);
+
+			ret = aub_write_mem_packet(aub, TILING_NONE, TYPE_NOTYPE,
+						   pte_space, pte_paddr, pte,
+						   sizeof(u64));
+		} else {
+			for (i = 0; i < c; i++) {
+				elements[i].address =
+					adjust_gsm_paddr(aub, global_gtt,
+							 pages_left[i].pte_paddr);
+				elements[i].data_size = sizeof(u64);
+				data[i] = &pages_left[i].pte;
+			}
+
+			ret = aub_write_mem_discon_packet(aub, TILING_NONE,
+							  TYPE_NOTYPE, pte_space,
+							  elements,
+							  data, c);
+		}
+		if (ret < 0)
+			return ret;
+
+		if (c == 1) {
+			ret = aub_write_mem_packet(aub, tiling, type, space,
+						   pages_left[0].paddr,
+						   pages_left[0].storage,
+						   PAGE_SIZE);
+		} else {
+			for (i = 0; i < c; i++) {
+				elements[i].address = pages_left[i].paddr;
+				elements[i].data_size = PAGE_SIZE;
+				data[i] = pages_left[i].storage;
+			}
+
+			ret = aub_write_mem_discon_packet(aub, tiling, type,
+							  space, elements,
+							  data, c);
+		}
+		if (ret < 0)
+			return ret;
+
+		count_left -= c;
+		pages_left += c;
+	}
+
+	return 0;
+}
+
+struct intel_aub *i915_aub_start(struct drm_i915_private *i915,
+				 write_aub_fn write_function,
+				 void *private_data,
+				 const char *message,
+				 bool verbose)
+{
+	struct i915_ggtt *ggtt = &i915->ggtt;
+	struct intel_aub *aub;
+	int ret;
+
+	aub = kmalloc(sizeof(*aub), GFP_KERNEL);
+	if (!aub)
+		return ERR_PTR(-ENOMEM);
+
+	aub->write = write_function;
+	aub->priv = private_data;
+	aub->platform = i915->info.platform;
+	aub->revision = INTEL_REVID(i915);
+	aub->gsm_paddr = ggtt->gsm_paddr;
+	aub->verbose = verbose;
+
+	ret = aub_write_version_packet(aub, aub->platform,
+				       aub->revision, message);
+	if (ret < 0) {
+		kfree(aub);
+		return ERR_PTR(ret);
+	}
+
+	/* Tell the other end about the physical GGTT location */
+	GEM_BUG_ON(upper_32_bits(aub->gsm_paddr));
+	aub_write_pci_register_packet(aub, 0, 0, 0, 0xb4,
+				      lower_32_bits(aub->gsm_paddr));
+
+	return aub;
+}
+
+void i915_aub_comment(struct intel_aub *aub, const char *format, ...)
+{
+	va_list args;
+	char *buf = (char *)aub->scratch;
+	BUILD_BUG_ON(sizeof(aub->scratch) < AUB_COMMENT_MAX_LENGTH);
+
+	if (!aub->verbose)
+		return;
+
+	va_start(args, format);
+	vsnprintf(buf, AUB_COMMENT_MAX_LENGTH, format, args);
+	va_end(args);
+
+	aub_write_comment_packet(aub, buf);
+}
+
+void i915_aub_register(struct intel_aub *aub, i915_reg_t reg, u32 value)
+{
+	aub_write_register_packet(aub, reg, value);
+}
+
+void i915_aub_gtt(struct intel_aub *aub, enum pagemap_level lvl,
+		  phys_addr_t paddr, const u64 *entries, uint count)
+{
+	enum address_space_values space;
+	uint max_count = PAGE_SIZE / sizeof(*entries);
+	uint c = min(count, max_count);
+
+	switch (lvl) {
+	default:
+		MISSING_CASE(lvl);
+	case PPGTT_LEVEL4:
+		space = ADDRESS_SPACE_PPGTT_PML4_ENTRY;
+		break;
+	case PPGTT_LEVEL3:
+		space = ADDRESS_SPACE_PPGTT_PDP_ENTRY;
+		break;
+	case PPGTT_LEVEL2:
+		space = ADDRESS_SPACE_PPGTT_PD_ENTRY;
+		break;
+	case PPGTT_LEVEL1:
+		space = ADDRESS_SPACE_PPGTT_ENTRY;
+		break;
+	case GGTT_LEVEL1:
+		space = ADDRESS_SPACE_GTT_ENTRY;
+		paddr = adjust_gsm_paddr(aub, true, paddr);
+		break;
+	}
+
+	aub_write_mem_packet(aub, TILING_NONE, TYPE_NOTYPE, space, paddr,
+			     entries, c * sizeof(*entries));
+}
+
+void i915_aub_context(struct intel_aub *aub, u8 class,
+		      const struct drm_i915_error_page *pages, uint count)
+{
+	enum data_type_values type;
+
+	switch (class) {
+	default:
+		MISSING_CASE(class);
+	case OTHER_CLASS:
+	case RENDER_CLASS:
+		type = TYPE_LOGICAL_RING_CONTEXT_RCS;
+		break;
+	case VIDEO_DECODE_CLASS:
+		type = TYPE_LOGICAL_RING_CONTEXT_VCS;
+		break;
+	case VIDEO_ENHANCEMENT_CLASS:
+		type = TYPE_LOGICAL_RING_CONTEXT_VECS;
+		break;
+	case COPY_ENGINE_CLASS:
+		type = TYPE_LOGICAL_RING_CONTEXT_BCS;
+		break;
+	}
+
+	aub_write_discon_pages(aub, true, TILING_NONE, type,
+			       ADDRESS_SPACE_PHYSICAL, pages, count);
+}
+
+void i915_aub_batchbuffer(struct intel_aub *aub, bool global_gtt,
+			  const struct drm_i915_error_page *pages, uint count)
+{
+	aub_write_discon_pages(aub, global_gtt, TILING_NONE, TYPE_BATCH_BUFFER,
+			       ADDRESS_SPACE_PHYSICAL, pages, count);
+}
+
+void i915_aub_buffer(struct intel_aub *aub, bool global_gtt, int tiling_mode,
+		     const struct drm_i915_error_page *pages, uint count)
+{
+	enum tiling_values tiling;
+
+	switch (tiling_mode) {
+	default:
+		MISSING_CASE(tiling_mode);
+	case I915_TILING_NONE:
+		tiling = TILING_NONE;
+		break;
+	case I915_TILING_X:
+		tiling = TILING_X;
+		break;
+	case I915_TILING_Y:
+		tiling = TILING_Y;
+		break;
+	}
+
+	aub_write_discon_pages(aub, global_gtt, tiling, TYPE_NOTYPE,
+			       ADDRESS_SPACE_PHYSICAL, pages, count);
+}
+
+void i915_aub_elsp_submit(struct intel_aub *aub, struct intel_engine_cs *engine,
+			  u64 desc)
+{
+	i915_reg_t elsp = RING_ELSP(engine);
+	i915_reg_t elsp_status = RING_EXECLIST_STATUS_LO(engine);
+	u32 value;
+
+	aub_write_register_packet(aub, elsp, 0x0);
+	aub_write_register_packet(aub, elsp, 0x0);
+	aub_write_register_packet(aub, elsp, upper_32_bits(desc));
+	aub_write_register_packet(aub, elsp, lower_32_bits(desc));
+
+	/*
+	 * Due to the nature of the AUB file (no timing information), we cannot
+	 * use it to model asynchronous things like Lite Restores or Preemption.
+	 * This is the reason we use this "fake" ELSP submission with just one
+	 * element at a time instead of just capturing the real submission. And
+	 * also the reason why here we force the other end to wait until the HW
+	 * becomes idle again.
+	 */
+	value = GEN8_CTX_STATUS_ACTIVE_IDLE << EL_STATUS_LAST_CTX_SWITCH_SHIFT;
+	aub_write_regpoll_packet(aub, elsp_status, value, value);
+}
+
+void i915_aub_stop(struct intel_aub *aub)
+{
+	kfree(aub);
+}
diff --git a/drivers/gpu/drm/i915/i915_aubmemtrace.h b/drivers/gpu/drm/i915/i915_aubmemtrace.h
new file mode 100644
index 0000000..baab082
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_aubmemtrace.h
@@ -0,0 +1,76 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef _INTEL_AUBCAPTURE_H_
+#define _INTEL_AUBCAPTURE_H_
+
+#define AUB_COMMENT_MAX_LENGTH	512
+#define AUB_SCRATCH_SIZE	1280
+
+typedef void (*write_aub_fn)(void *priv, const void *data, size_t length);
+
+struct intel_aub {
+	struct drm_i915_private *i915;
+
+	write_aub_fn write;
+	void *priv;
+
+	enum intel_platform platform;
+	u8 revision;
+
+	phys_addr_t gsm_paddr;
+
+	bool verbose;
+
+	/* Avoid using the stack */
+	u8 scratch[AUB_SCRATCH_SIZE];
+};
+
+enum pagemap_level {
+	PPGTT_LEVEL4,
+	PPGTT_LEVEL3,
+	PPGTT_LEVEL2,
+	PPGTT_LEVEL1,
+	GGTT_LEVEL1,
+};
+
+struct intel_aub *i915_aub_start(struct drm_i915_private *i915,
+				 write_aub_fn write_function,
+				 void *private_data,
+				 const char *message,
+				 bool verbose);
+void i915_aub_comment(struct intel_aub *aub, const char *format, ...);
+void i915_aub_register(struct intel_aub *aub, i915_reg_t reg, u32 value);
+void i915_aub_gtt(struct intel_aub *aub, enum pagemap_level lvl,
+		  phys_addr_t paddr, const u64 *entries, uint count);
+void i915_aub_context(struct intel_aub *aub, u8 class,
+		      const struct drm_i915_error_page *pages, uint count);
+void i915_aub_batchbuffer(struct intel_aub *aub, bool global_gtt,
+			  const struct drm_i915_error_page *pages, uint count);
+void i915_aub_buffer(struct intel_aub *aub, bool global_gtt, int tiling_mode,
+		     const struct drm_i915_error_page *pages, uint count);
+void i915_aub_elsp_submit(struct intel_aub *aub, struct intel_engine_cs *engine,
+			  u64 desc);
+void i915_aub_stop(struct intel_aub *aub);
+
+#endif
diff --git a/drivers/gpu/drm/i915/i915_aubmemtrace_format.h b/drivers/gpu/drm/i915/i915_aubmemtrace_format.h
new file mode 100644
index 0000000..0821cfc
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_aubmemtrace_format.h
@@ -0,0 +1,359 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef _INTEL_AUBCAPTURE_FORMAT_H_
+#define _INTEL_AUBCAPTURE_FORMAT_H_
+
+#pragma pack(push, 4)
+
+#define AUB_FILE_FORMAT_VERSION 0
+
+#define CMD_TYPE_AUB	0x7
+
+#define CMD_OPC_MEMTRACE	0x2e
+
+#define CMD_SUBOPC_MEMTRACE_VERSION			0xe
+#define CMD_SUBOPC_MEMTRACE_COMMENT			0x8
+#define CMD_SUBOPC_MEMTRACE_REGISTER_POLL		0x2
+#define CMD_SUBOPC_MEMTRACE_REGISTER_WRITE		0x3
+#define CMD_SUBOPC_MEMTRACE_MEMORY_WRITE		0x6
+#define CMD_SUBOPC_MEMTRACE_MEMORY_WRITE_DISCONTIGUOUS	0xb
+
+/**
+ * struct aub_cmd_hdr - AUB command header
+ */
+struct aub_cmd_hdr {
+	/** @dword_count: The number of dwords in the command not including the
+	 * first dword */
+	uint32_t dword_count : 16;
+	uint32_t sub_opcode  : 7;
+	uint32_t opcode      : 6;
+	uint32_t type        : 3;
+};
+
+enum stepping_values {
+	STEP_A = 0, STEP_B, STEP_C, STEP_D, STEP_E, STEP_F, STEP_G, STEP_H,
+	STEP_I, STEP_J, STEP_K, STEP_L, STEP_M, STEP_N, STEP_O, STEP_P, STEP_Q,
+	STEP_R, STEP_S, STEP_T, STEP_U, STEP_V, STEP_W, STEP_X, STEP_Y, STEP_Z
+};
+
+enum device_values {
+	DEV_BDW = 11,
+	DEV_CHV = 13,
+	DEV_SKL = 12,
+	DEV_BXT = 14,
+	DEV_KBL = 16,
+	DEV_GLK = 17,
+	DEV_CNL = 15,
+};
+
+enum swizzling_values {
+	SWIZZLING_ENABLED = 1,
+	SWIZZLING_DISABLED = 0
+};
+
+enum recording_method_values {
+	METHOD_PHY = 1,
+	METHOD_GFX = 0
+};
+
+enum pch_values {
+	PCH_DEFAULT = 0
+};
+
+enum capture_tool_values {
+	CAPTURE_TOOL_KMD = 1
+};
+
+/**
+ * struct cmd_memtrace_version - first packet to appear on the AUB file (kind of
+ * a file header).
+ *
+ * Includes version information about the memtrace file that contains it.
+ */
+struct cmd_memtrace_version {
+	struct aub_cmd_hdr header;
+
+	/** @memtrace_file_version: memtrace file format version. */
+	uint32_t memtrace_file_version;
+
+	struct {
+		/** @metal: Which HW metal the memtrace file was generated on */
+		uint32_t metal            : 3;
+		/** @stepping: Which HW stepping the memtrace file was generated
+		 * on. One of  enum stepping_values */
+		uint32_t stepping         : 5;
+		/** @device: Which device the memtrace file was generated on.
+		 * One of enum device_values */
+		uint32_t device           : 8;
+		/** @swizzling: Which swizzling the data is in. One of enum
+		 * swizzling_values */
+		uint32_t swizzling        : 2;
+		/** @recording_method: Which recording method was used.
+		 * One of enum recording_method_values */
+		uint32_t recording_method : 2;
+		/** @pch: Which PCH was used. One of enum pch_values */
+		uint32_t pch              : 8;
+		/** @capture_tool: Which tool generated the memtrace file. One
+		 * of enum capture_tool_values */
+		uint32_t capture_tool     : 4;
+	};
+
+	/** @tool_primary_version: The primary version number for the capture
+	 * tool used. */
+	uint32_t tool_primary_version;
+
+	/** @tool_secondary_version: The secondary version number for the
+	 * capture tool used. */
+	uint32_t tool_secondary_version;
+
+	/**
+	 * @command_line: Command line used to generate the memtrace file (N
+	 * dwords). If this string is not 4 byte aligned it has to be padded
+	 * with 0s at the end.
+	 */
+	char command_line[4];
+};
+
+/**
+ * struct cmd_memtrace_comment - A comment in the AUB file.
+ *
+ * Free-style text, can be used for a number of reasons.
+ */
+struct cmd_memtrace_comment {
+	struct aub_cmd_hdr header;
+
+	uint32_t reserved;
+
+	/**
+	 * @comment: A comment that should be printed to console (N dwords).
+	 * If this string is not 4 byte aligned it has to be padded with 0s
+	 * at the end.
+	 */
+	char comment[4];
+};
+
+enum message_source_values {
+	SOURCE_IA = 0
+};
+
+enum register_size_values {
+	SIZE_BYTE =  0,
+	SIZE_WORD =  1,
+	SIZE_DWORD = 2,
+	SIZE_QWORD = 3,
+};
+
+enum register_space_values {
+	SPACE_MMIO = 0,
+	SPACE_PCI = 2,
+};
+
+struct cmd_memtrace_register_write {
+	struct aub_cmd_hdr header;
+
+	/** @register_offset: The offset in the selected register space. For
+	 * PCI configuration registers this offset field is split into four
+	 * sub-fields: [31:16] is the bus number, [15:11] is the device number,
+	 * [10:8] is the function number, and [7:0] is the register offset. */
+	union {
+		uint32_t register_offset;
+		struct {
+			uint32_t bus      : 16;
+			uint32_t device   : 5;
+			uint32_t function : 3;
+			uint32_t offset   : 8;
+		};
+	};
+
+	struct {
+		uint32_t                  : 4;
+		/** @message_source: Origin of the register write. One of enum
+		 * message_source_values */
+		uint32_t message_source   : 4;
+		uint32_t                  : 8;
+		/** @register_size: Size of the data. One of enum
+		 * register_size_values */
+		uint32_t register_size    : 4;
+		uint32_t                  : 8;
+		/** @register_space: Which register space to use. One of enum
+		 * register_space_values */
+		uint32_t register_space   : 4;
+	};
+
+	uint32_t write_mask_low;
+
+	/** @write_mask_high: ignored if register_size is not QWORD. */
+	uint32_t write_mask_high;
+
+	/** @data: The data that is expected from the register write. */
+	uint32_t data[1];
+};
+
+enum operation_type_values {
+	OPERATION_TYPE_NORMAL =         0,
+	OPERATION_TYPE_INTERLACED_CRC = 1,
+};
+
+struct cmd_memtrace_register_poll {
+	struct aub_cmd_hdr header;
+
+	/** @register_offset: The offset in the selected register space. For
+	 * PCI configuration registers this offset field is split into four
+	 * sub-fields: [31:16] is the bus number, [15:11] is the device number,
+	 * [10:8] is the function number, and [7:0] is the register offset. */
+	union {
+		uint32_t register_offset;
+		struct {
+			uint32_t bus      : 16;
+			uint32_t device   : 5;
+			uint32_t function : 3;
+			uint32_t offset   : 8;
+		};
+	};
+
+	struct {
+		uint32_t                  : 1;
+		/** @timeout_action: Abort if the timeout expires? */
+		uint32_t abort_on_timeout : 1;
+		/** @poll_not_equal: Poll until value != target */
+		uint32_t poll_not_equal   : 1;
+		uint32_t                  : 1;
+		/** @operation_type: One of operation_type_values */
+		uint32_t operation_type   : 4;
+		uint32_t                  : 8;
+		/** @register_size: Size of the data. One of enum
+		 * register_size_values */
+		uint32_t register_size    : 4;
+		uint32_t                  : 8;
+		/** @register_space: Which register space to use. One of enum
+		 * register_space_values */
+		uint32_t register_space   : 4;
+	};
+
+	uint32_t poll_mask_low;
+
+	/** @write_mask_high: ignored if register_size is not QWORD. */
+	uint32_t poll_mask_high;
+
+	/** @data: The data that is expected from the register read. */
+	uint32_t data[1];
+};
+
+enum tiling_values {
+	TILING_NONE = 0,
+	TILING_X = 1,
+	TILING_Y = 2,
+};
+
+enum data_type_values {
+	TYPE_NOTYPE = 0,
+	TYPE_BATCH_BUFFER = 1,
+	TYPE_LOGICAL_RING_CONTEXT_RCS = 48,
+	TYPE_LOGICAL_RING_CONTEXT_BCS = 49,
+	TYPE_LOGICAL_RING_CONTEXT_VCS = 50,
+	TYPE_LOGICAL_RING_CONTEXT_VECS = 51,
+};
+
+enum address_space_values {
+	ADDRESS_SPACE_PHYSICAL = 2,
+	ADDRESS_SPACE_GTT_GFX  = 0,
+	ADDRESS_SPACE_GTT_ENTRY = 4,
+	ADDRESS_SPACE_PPGTT_GFX = 5,
+	ADDRESS_SPACE_PPGTT_PML4_ENTRY = 10,
+	ADDRESS_SPACE_PPGTT_PDP_ENTRY = 8,
+	ADDRESS_SPACE_PPGTT_PD_ENTRY = 9,
+	ADDRESS_SPACE_PPGTT_ENTRY = 6,
+};
+
+struct cmd_memtrace_memwrite {
+	struct aub_cmd_hdr header;
+
+	/** @address: The address of the memory to read. The address space is
+	 * determined by the address_space field. */
+	uint64_t address;
+
+	struct {
+		uint32_t                    : 2;
+		/** @tiling: Tiling format. One of enum tiling_values */
+		uint32_t tiling             : 2;
+		uint32_t                    : 16;
+		/** @data_type_hint: This parameter specifies the type of data
+		 * block that follows. One of enum data_type_values. If it isn't
+		 * known mark it as TYPE_NOTYPE */
+		uint32_t data_type_hint     : 8;
+		/** @address_space: This parameter specifies the type of memory
+		 * corresponding to the data block (GTT-relative, physical
+		 * local, physical system, etc...). One of enum
+		 * address_space_values */
+		uint32_t address_space      : 4;
+	};
+
+	/** @data_size: The number of bytes that will be written. The data
+	 * elements are packed into dwords in the data parameter, padded with
+	 * zeroes */
+	uint32_t data_size;
+
+	/** @data: The data that will be written. */
+	uint32_t data[1];
+};
+
+#define DISCONTIGUOUS_WRITE_MAX_ELEMENTS 63
+
+struct memwrite_element {
+	/** @address: The address of the memory to read. */
+	uint64_t address;
+	/** @data_size: The number of bytes that will be written. */
+	uint32_t data_size;
+};
+
+struct cmd_memtrace_memwrite_discon {
+	struct aub_cmd_hdr header;
+
+	struct aub_cmd_memwrite_discon_opts {
+		uint32_t                    : 2;
+		/** @tiling: Tiling format. One of enum tiling_values */
+		uint32_t tiling             : 2;
+
+		/** @tiling: Number of address and data_size pairs */
+		uint32_t number_of_elements : 16;
+		/** @data_type_hint: This parameter specifies the type of data
+		 * block that follows. One of enum data_type_values. If it isn't
+		 * known mark it as TYPE_NOTYPE */
+		uint32_t data_type_hint     : 8;
+		/** @address_space: This parameter specifies the type of memory
+		 * corresponding to the data block (GTT-relative, physical
+		 * local, physical system, etc...). One of enum
+		 * address_space_values */
+		uint32_t address_space      : 4;
+	} opts;
+
+	struct memwrite_element elements[DISCONTIGUOUS_WRITE_MAX_ELEMENTS];
+
+	/** @data: The data that will be written. */
+	uint32_t data[1];
+};
+
+#pragma pack(pop)
+
+#endif
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 12/12] drm/i915: Actually write the AUB file
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (10 preceding siblings ...)
  2017-10-27 18:01 ` [RFC PATCH 11/12] drm/i915: Add an AUB file format writer Oscar Mateo
@ 2017-10-27 18:01 ` Oscar Mateo
  2017-10-27 18:20 ` [RFC PATCH 00/12] AubCrash Chris Wilson
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

Use all the information previously recorded in the GPU error
state to write an AUB file.

TODO: output an already compressed file?

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
Cc: Chris Wilson <chris@chris-wsilon.co.uk>
---
 drivers/gpu/drm/i915/Makefile        |   2 +-
 drivers/gpu/drm/i915/i915_aubcrash.c | 174 +++++++++++++++++++++++++++++++++++
 2 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 04956c7..f958291 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -124,7 +124,7 @@ i915-y += dvo_ch7017.o \
 
 # Post-mortem debug and GPU hang state capture
 i915-$(CONFIG_DRM_I915_CAPTURE_ERROR) += i915_gpu_error.o
-i915-$(CONFIG_DRM_I915_AUB_CRASH_DUMP) += i915_aubcrash.o
+i915-$(CONFIG_DRM_I915_AUB_CRASH_DUMP) += i915_aubmemtrace.o i915_aubcrash.o
 i915-$(CONFIG_DRM_I915_SELFTEST) += \
 	selftests/i915_random.o \
 	selftests/i915_selftest.o
diff --git a/drivers/gpu/drm/i915/i915_aubcrash.c b/drivers/gpu/drm/i915/i915_aubcrash.c
index ebdbf09..a336777 100644
--- a/drivers/gpu/drm/i915/i915_aubcrash.c
+++ b/drivers/gpu/drm/i915/i915_aubcrash.c
@@ -27,6 +27,7 @@
 
 #include "intel_drv.h"
 #include "i915_aubcrash.h"
+#include "i915_aubmemtrace.h"
 
 /**
  * DOC: AubCrash
@@ -255,8 +256,181 @@ void i915_error_page_walk(struct i915_address_space *vm,
 	}
 }
 
+#ifdef CONFIG_DRM_I915_COMPRESS_ERROR
+
+void write_aub(void *priv, const void *data, size_t len)
+{
+	struct drm_i915_error_state_buf *e = priv;
+
+	/* TODO: Compress the AUB file on the go */
+	i915_error_binary_write(e, data, len);
+}
+
+#else
+
+void write_aub(void *priv, const void *data, size_t len)
+{
+	struct drm_i915_error_state_buf *e = priv;
+
+	i915_error_binary_write(e, data, len);
+}
+
+#endif
+
+#define AUB_COMMENT_ERROR_OBJ(name, obj) do { \
+	i915_aub_comment(aub, name " (%08x_%08x %8u)", \
+			 upper_32_bits((obj)->gtt_offset), \
+			 lower_32_bits((obj)->gtt_offset), \
+			 (obj)->gtt_size); \
+} while (0)
+
 int i915_error_state_to_aub(struct drm_i915_error_state_buf *m,
 			    const struct i915_gpu_state *error)
 {
+	struct drm_i915_private *dev_priv = m->i915;
+	struct intel_aub *aub;
+	int i;
+
+	aub = i915_aub_start(dev_priv, write_aub, (void *)m, "AubCrash", true);
+	if (IS_ERR(aub))
+		return PTR_ERR(aub);
+
+	if (!error) {
+		i915_aub_comment(aub, "No error state collected\n");
+		return 0;
+	}
+
+	i915_aub_comment(aub, "Registers");
+	i915_aub_register(aub, GAM_ECOCHK, error->gam_ecochk);
+	for (i = 0; i < ARRAY_SIZE(error->engine); i++) {
+		const struct drm_i915_error_engine *ee = &error->engine[i];
+		struct intel_engine_cs *engine = dev_priv->engine[i];
+
+		if (!ee->batchbuffer)
+			continue;
+
+		i915_aub_register(aub, RING_MODE_GEN7(engine),
+				  _MASKED_BIT_ENABLE(ee->vm_info.gfx_mode));
+		i915_aub_register(aub, RING_HWS_PGA(engine->mmio_base),
+				  ee->hws);
+	}
+
+	i915_aub_comment(aub, "PPGTT PML4/PDP/PD");
+	for (i = 0; i < ARRAY_SIZE(error->active_vm); i++) {
+		const struct drm_i915_error_pagemap_lvl *pml4 =
+			&error->ppgtt_pml4[i];
+		int l3, l2;
+
+		if (!error->active_vm[i])
+			break;
+
+		if (pml4->storage)
+			i915_aub_gtt(aub, PPGTT_LEVEL4, pml4->paddr,
+				     pml4->storage, GEN8_PML4ES_PER_PML4);
+
+		for (l3 = 0; l3 < pml4->nxt_lvl_count; l3++) {
+			const struct drm_i915_error_pagemap_lvl *pdp =
+				&pml4->nxt_lvl[l3];
+
+			if (pdp->storage)
+				i915_aub_gtt(aub, PPGTT_LEVEL3, pdp->paddr,
+					     pdp->storage, GEN8_4LVL_PDPES);
+
+			for (l2 = 0; l2 < pdp->nxt_lvl_count; l2++) {
+				const struct drm_i915_error_pagemap_lvl *pd =
+					&pdp->nxt_lvl[l2];
+
+				i915_aub_gtt(aub, PPGTT_LEVEL2, pd->paddr,
+					     pd->storage, I915_PDES);
+			}
+		}
+	}
+
+	/* Active request */
+	for (i = 0; i < ARRAY_SIZE(error->engine); i++) {
+		const struct drm_i915_error_engine *ee = &error->engine[i];
+		struct intel_engine_cs *engine = dev_priv->engine[i];
+		int j;
+
+		if (!ee->batchbuffer)
+			continue;
+
+		i915_aub_comment(aub, "Engine %s", engine->name);
+
+		if (ee->hws_page) {
+			AUB_COMMENT_ERROR_OBJ("Hardware Status Page",
+					      ee->hws_page);
+			i915_aub_buffer(aub, true, ee->hws_page->tiling,
+					ee->hws_page->pages,
+					ee->hws_page->page_count);
+		}
+
+		if (ee->ctx) {
+			u64 gtt_offset =
+				ee->ctx->gtt_offset + LRC_GUCSHR_SZ * PAGE_SIZE;
+			u64 gtt_size =
+				ee->ctx->gtt_size - LRC_GUCSHR_SZ * PAGE_SIZE;
+
+			i915_aub_comment(aub,
+					 "Logical Ring Context (%08x_%08x %8u)",
+					 upper_32_bits(gtt_offset),
+					 lower_32_bits(gtt_offset),
+					 gtt_size);
+			i915_aub_context(aub, engine->class,
+					 ee->ctx->pages + LRC_GUCSHR_SZ,
+					 ee->ctx->page_count - LRC_GUCSHR_SZ);
+		}
+
+		if (ee->renderstate) {
+			AUB_COMMENT_ERROR_OBJ("Renderstate", ee->renderstate);
+			i915_aub_batchbuffer(aub, true, ee->renderstate->pages,
+					     ee->renderstate->page_count);
+		}
+
+		if (ee->wa_batchbuffer) {
+			AUB_COMMENT_ERROR_OBJ("Scratch", ee->wa_batchbuffer);
+			i915_aub_buffer(aub, true, I915_TILING_NONE,
+					ee->wa_batchbuffer->pages,
+					ee->wa_batchbuffer->page_count);
+		}
+
+		if (ee->wa_ctx) {
+			AUB_COMMENT_ERROR_OBJ("WA context", ee->wa_ctx);
+			i915_aub_batchbuffer(aub, true, ee->wa_ctx->pages,
+					     ee->wa_ctx->page_count);
+		}
+
+		if (ee->ringbuffer) {
+			AUB_COMMENT_ERROR_OBJ("Ringbuffer", ee->ringbuffer);
+			i915_aub_batchbuffer(aub, true, ee->ringbuffer->pages,
+					     ee->ringbuffer->page_count);
+		}
+
+		if (ee->batchbuffer) {
+			AUB_COMMENT_ERROR_OBJ("Batchbuffer", ee->batchbuffer);
+			i915_aub_batchbuffer(aub, false, ee->batchbuffer->pages,
+					     ee->batchbuffer->page_count);
+		}
+
+		for (j = 0; j < ee->user_bo_count; j++) {
+			struct drm_i915_error_object *obj = ee->user_bo[j];
+
+			AUB_COMMENT_ERROR_OBJ("BO", obj);
+			i915_aub_buffer(aub, false, obj->tiling,
+					obj->pages, obj->page_count);
+		}
+
+		/* XXX: Do I want to overwrite the head/tail inside the lrc? */
+		i915_aub_comment(aub, "ELSP submissions");
+		for (j = 0; j < ee->num_requests; j++)
+			i915_aub_elsp_submit(aub, engine,
+					     ee->requests[j].lrc_desc);
+	}
+
+	i915_aub_stop(aub);
+
+	if (m->bytes == 0 && m->err)
+		return m->err;
+
 	return 0;
 }
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 04/12] drm/i915: Capture some extra small details in the GPU error state
  2017-10-27 18:01 ` [RFC PATCH 04/12] drm/i915: Capture some extra small details in the GPU error state Oscar Mateo
@ 2017-10-27 18:15   ` Chris Wilson
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-10-27 18:15 UTC (permalink / raw)
  To: Oscar Mateo, intel-gfx; +Cc: Chris Wilson

Quoting Oscar Mateo (2017-10-27 19:01:07)
> Namely:
> - Capture tiling per drm_i915_error_object

tiling isn't that useful as it only relates to fences and not userspace.

> - Capture the LRC descriptor per active request

Context details.

> - Capture the wa_batchbuffer unconditionally
> - Capture the GAM_ECOCHK register for all GENs

4 different items in the changelog...

Simple rule: every and is a new patch.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 05/12] drm/i915: Capture the renderstate batchbuffer
  2017-10-27 18:01 ` [RFC PATCH 05/12] drm/i915: Capture the renderstate batchbuffer Oscar Mateo
@ 2017-10-27 18:17   ` Chris Wilson
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-10-27 18:17 UTC (permalink / raw)
  To: Oscar Mateo, intel-gfx; +Cc: Chris Wilson, Mika Kuoppala

Quoting Oscar Mateo (2017-10-27 19:01:08)
> It can be useful if it's in play at the time of the crash,
> and I will be needing it in AubCrash.

Ambiguous pronoun.

Just add it to the error-state unconditionally. Active is used very
loosely here.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 00/12] AubCrash
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (11 preceding siblings ...)
  2017-10-27 18:01 ` [RFC PATCH 12/12] drm/i915: Actually write the AUB file Oscar Mateo
@ 2017-10-27 18:20 ` Chris Wilson
  2017-10-27 18:26   ` Lionel Landwerlin
  2017-10-27 18:31   ` Oscar Mateo
  2017-10-27 18:30 ` Chris Wilson
  2017-10-27 18:33 ` ✗ Fi.CI.BAT: failure for AubCrash Patchwork
  14 siblings, 2 replies; 26+ messages in thread
From: Chris Wilson @ 2017-10-27 18:20 UTC (permalink / raw)
  To: Oscar Mateo, intel-gfx

Quoting Oscar Mateo (2017-10-27 19:01:03)
> AubCrash is a companion to i915_gpu_error. It gives us the possibility to
> dump an AUB file that describes the state of the system at the point of
> the crash (GTTs, contexts, BBs, BOs, etc...). Being an AUB file, it can be
> used by a number of already existing tools (graphical AUB file browsers,
> simulators, emulators, etc...) that facilitate debugging (an improvement
> over the current text-based crash dump).

Serious question: why both? If the aub is useful for pm debugging, it
should be either a replacement for sysfs/error or an addendum.

I expect tools to use the new format as well.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 07/12] drm/i915: Skeleton for AubCrash
  2017-10-27 18:01 ` [RFC PATCH 07/12] drm/i915: Skeleton for AubCrash Oscar Mateo
@ 2017-10-27 18:20   ` Chris Wilson
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-10-27 18:20 UTC (permalink / raw)
  To: Oscar Mateo, intel-gfx; +Cc: Chris Wilson

Quoting Oscar Mateo (2017-10-27 19:01:10)
> Includes some documentation on what AubCrash is supposed to achieve.

Which is missing *here*.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 09/12] drm/i915: Store PTE information together with the error object pages
  2017-10-27 18:01 ` [RFC PATCH 09/12] drm/i915: Store PTE information together with the error object pages Oscar Mateo
@ 2017-10-27 18:25   ` Chris Wilson
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-10-27 18:25 UTC (permalink / raw)
  To: Oscar Mateo, intel-gfx; +Cc: Chris Wilson

Quoting Oscar Mateo (2017-10-27 19:01:12)
> With every page of physical data we store: its physical address, the PTE
> that points to it and the physical address of the PTE itself. We will be
> using all this information later.

But then you've stuck it in the error_object.pages which doesn't relate at
all to the vma. Separate out the capture of the object contents, with
capturing of the phys/dma addresses.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 00/12] AubCrash
  2017-10-27 18:20 ` [RFC PATCH 00/12] AubCrash Chris Wilson
@ 2017-10-27 18:26   ` Lionel Landwerlin
  2017-10-27 18:31   ` Oscar Mateo
  1 sibling, 0 replies; 26+ messages in thread
From: Lionel Landwerlin @ 2017-10-27 18:26 UTC (permalink / raw)
  To: Chris Wilson, Oscar Mateo, intel-gfx

On 27/10/17 19:20, Chris Wilson wrote:
> Quoting Oscar Mateo (2017-10-27 19:01:03)
>> AubCrash is a companion to i915_gpu_error. It gives us the possibility to
>> dump an AUB file that describes the state of the system at the point of
>> the crash (GTTs, contexts, BBs, BOs, etc...). Being an AUB file, it can be
>> used by a number of already existing tools (graphical AUB file browsers,
>> simulators, emulators, etc...) that facilitate debugging (an improvement
>> over the current text-based crash dump).
> Serious question: why both? If the aub is useful for pm debugging, it
> should be either a replacement for sysfs/error or an addendum.
>
> I expect tools to use the new format as well.
> -Chris

Looks like the solution we were looking for :)
Thanks!
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 10/12] drm/i915: Always add BOs to capture list if AubCrash is enabled
  2017-10-27 18:01 ` [RFC PATCH 10/12] drm/i915: Always add BOs to capture list if AubCrash is enabled Oscar Mateo
@ 2017-10-27 18:26   ` Chris Wilson
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-10-27 18:26 UTC (permalink / raw)
  To: Oscar Mateo, intel-gfx; +Cc: Chris Wilson

Quoting Oscar Mateo (2017-10-27 19:01:13)
> If we want the AUB file to be complete (and, therefore, more useful)
> we need to capture all BOs in use, we cannot leave that to the UMD
> as before.

There's a very good reason why we don't try to capture the entirety of
mem from within the error capture. So no. There's no way we are selling
that as absolutely necessary use of GFP_ATOMIC and stop_machine.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 00/12] AubCrash
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (12 preceding siblings ...)
  2017-10-27 18:20 ` [RFC PATCH 00/12] AubCrash Chris Wilson
@ 2017-10-27 18:30 ` Chris Wilson
  2017-10-27 18:45   ` Oscar Mateo
  2017-10-27 18:33 ` ✗ Fi.CI.BAT: failure for AubCrash Patchwork
  14 siblings, 1 reply; 26+ messages in thread
From: Chris Wilson @ 2017-10-27 18:30 UTC (permalink / raw)
  To: Oscar Mateo, intel-gfx

Quoting Oscar Mateo (2017-10-27 19:01:03)
> AubCrash is a companion to i915_gpu_error. It gives us the possibility to
> dump an AUB file that describes the state of the system at the point of
> the crash (GTTs, contexts, BBs, BOs, etc...). Being an AUB file, it can be
> used by a number of already existing tools (graphical AUB file browsers,
> simulators, emulators, etc...) that facilitate debugging (an improvement
> over the current text-based crash dump).

Since it is capture everything in progress, but only the kernel side of
it, why put it in the kernel? Is this absolutely required for
post-mortem debugging, or should we focus on capturing the death throes
of userspace much better (an aubcapture flight-data-recorder, plus
client annotations more akin to apitrace)?

Sell me with the bugzilla references.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 00/12] AubCrash
  2017-10-27 18:20 ` [RFC PATCH 00/12] AubCrash Chris Wilson
  2017-10-27 18:26   ` Lionel Landwerlin
@ 2017-10-27 18:31   ` Oscar Mateo
  2017-10-27 20:17     ` Lionel Landwerlin
  1 sibling, 1 reply; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:31 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 10/27/2017 11:20 AM, Chris Wilson wrote:
> Quoting Oscar Mateo (2017-10-27 19:01:03)
>> AubCrash is a companion to i915_gpu_error. It gives us the possibility to
>> dump an AUB file that describes the state of the system at the point of
>> the crash (GTTs, contexts, BBs, BOs, etc...). Being an AUB file, it can be
>> used by a number of already existing tools (graphical AUB file browsers,
>> simulators, emulators, etc...) that facilitate debugging (an improvement
>> over the current text-based crash dump).
> Serious question: why both? If the aub is useful for pm debugging, it
> should be either a replacement for sysfs/error or an addendum.
>
> I expect tools to use the new format as well.
> -Chris

That's why I sent this as an RFC, to get suggestions :)

There are things an AUB file cannot replace (fences, seqnos, error 
registers, etc...). It could still be used as an addendum, where the 
things that go into the AUB do not go into sysfs/error anymore (maybe 
with a message to look for the rest in the AUB file?). Or even better: I 
could try to provide the remaining text inside the AUB as comments, so 
that only one file is required...

As for tools, I still need to test how compatible Mesas's Aubinator is 
at the moment. There a number of non-opensource tools as well, but I 
have to discover which ones are publicly available outside of Intel 
before I provide a list here.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* ✗ Fi.CI.BAT: failure for AubCrash
  2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
                   ` (13 preceding siblings ...)
  2017-10-27 18:30 ` Chris Wilson
@ 2017-10-27 18:33 ` Patchwork
  14 siblings, 0 replies; 26+ messages in thread
From: Patchwork @ 2017-10-27 18:33 UTC (permalink / raw)
  To: Oscar Mateo; +Cc: intel-gfx

== Series Details ==

Series: AubCrash
URL   : https://patchwork.freedesktop.org/series/32774/
State : failure

== Summary ==

Series 32774v1 AubCrash
https://patchwork.freedesktop.org/api/1.0/series/32774/revisions/1/mbox/

Test gem_exec_reloc:
        Subgroup basic-cpu-active:
                pass       -> FAIL       (fi-gdg-551) fdo#102582 +5
Test kms_frontbuffer_tracking:
        Subgroup basic:
                fail       -> PASS       (fi-glk-dsi) fdo#103167
Test kms_psr_sink_crc:
        Subgroup psr_basic:
                pass       -> INCOMPLETE (fi-cnl-y)

fdo#102582 https://bugs.freedesktop.org/show_bug.cgi?id=102582
fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167

fi-bdw-5557u     total:289  pass:268  dwarn:0   dfail:0   fail:0   skip:21  time:441s
fi-bdw-gvtdvm    total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:447s
fi-blb-e6850     total:289  pass:223  dwarn:1   dfail:0   fail:0   skip:65  time:377s
fi-bsw-n3050     total:289  pass:243  dwarn:0   dfail:0   fail:0   skip:46  time:522s
fi-bwr-2160      total:289  pass:183  dwarn:0   dfail:0   fail:0   skip:106 time:265s
fi-bxt-dsi       total:289  pass:259  dwarn:0   dfail:0   fail:0   skip:30  time:500s
fi-bxt-j4205     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:489s
fi-byt-j1900     total:289  pass:253  dwarn:1   dfail:0   fail:0   skip:35  time:490s
fi-byt-n2820     total:289  pass:249  dwarn:1   dfail:0   fail:0   skip:39  time:479s
fi-cnl-y         total:248  pass:223  dwarn:0   dfail:0   fail:0   skip:24 
fi-elk-e7500     total:289  pass:229  dwarn:0   dfail:0   fail:0   skip:60  time:429s
fi-gdg-551       total:289  pass:172  dwarn:1   dfail:0   fail:7   skip:109 time:250s
fi-glk-1         total:289  pass:261  dwarn:0   dfail:0   fail:0   skip:28  time:578s
fi-glk-dsi       total:289  pass:259  dwarn:0   dfail:0   fail:0   skip:30  time:486s
fi-hsw-4770      total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:429s
fi-hsw-4770r     total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:426s
fi-ilk-650       total:289  pass:228  dwarn:0   dfail:0   fail:0   skip:61  time:420s
fi-ivb-3520m     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:494s
fi-ivb-3770      total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:456s
fi-kbl-7500u     total:289  pass:264  dwarn:1   dfail:0   fail:0   skip:24  time:490s
fi-kbl-7560u     total:289  pass:270  dwarn:0   dfail:0   fail:0   skip:19  time:574s
fi-kbl-7567u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:479s
fi-kbl-r         total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:585s
fi-pnv-d510      total:289  pass:222  dwarn:1   dfail:0   fail:0   skip:66  time:547s
fi-skl-6260u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:453s
fi-skl-6600u     total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:593s
fi-skl-6700hq    total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:649s
fi-skl-6700k     total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:521s
fi-skl-6770hq    total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:500s
fi-skl-gvtdvm    total:289  pass:266  dwarn:0   dfail:0   fail:0   skip:23  time:453s
fi-snb-2520m     total:289  pass:250  dwarn:0   dfail:0   fail:0   skip:39  time:559s
fi-snb-2600      total:289  pass:249  dwarn:0   dfail:0   fail:0   skip:40  time:416s

4efa630855362c267af94785e50948bb60615bfe drm-tip: 2017y-10m-27d-15h-25m-04s UTC integration manifest
416c8c57c9bb drm/i915: Actually write the AUB file
1ea0352e77e3 drm/i915: Add an AUB file format writer
d8a58aff43a5 drm/i915: Always add BOs to capture list if AubCrash is enabled
d4152742ee77 drm/i915: Store PTE information together with the error object pages
4459f7f0556e drm/i915: Capture the PPGTT pagetables on a GPU crash
124d5a4ca4b3 drm/i915: Skeleton for AubCrash
23e8583c734b drm/i915: Provide a way to write binary data to the error dump file
61e441a891a2 drm/i915: Capture the renderstate batchbuffer
3c7f9f42415a drm/i915: Capture some extra small details in the GPU error state
ec56735df5fb drm/i915: Store the GGTT Stolen Memory base physical address
da7f505480c4 drm/i915: Move the context switch status reason bits to the header
34a4870dbf5a drm/i915: New define for the number of PDPEs per PDP in a 4-level walk

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6245/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 00/12] AubCrash
  2017-10-27 18:30 ` Chris Wilson
@ 2017-10-27 18:45   ` Oscar Mateo
  2017-10-27 19:10     ` Chris Wilson
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Mateo @ 2017-10-27 18:45 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 10/27/2017 11:30 AM, Chris Wilson wrote:
> Quoting Oscar Mateo (2017-10-27 19:01:03)
>> AubCrash is a companion to i915_gpu_error. It gives us the possibility to
>> dump an AUB file that describes the state of the system at the point of
>> the crash (GTTs, contexts, BBs, BOs, etc...). Being an AUB file, it can be
>> used by a number of already existing tools (graphical AUB file browsers,
>> simulators, emulators, etc...) that facilitate debugging (an improvement
>> over the current text-based crash dump).
> Since it is capture everything in progress, but only the kernel side of
> it, why put it in the kernel? Is this absolutely required for
> post-mortem debugging, or should we focus on capturing the death throes
> of userspace much better (an aubcapture flight-data-recorder, plus
> client annotations more akin to apitrace)?
>
> Sell me with the bugzilla references.
> -Chris

An aubcapture flight-data-recorder is the next logical step. Like 
i-g-t's intel_aubdump tool, but at the kernel level, so that it includes 
everything: contexts, WA BBs, virtual GPU addresses, pagetables, etc... 
The trojan horse for that is "drm/i915: Add an AUB file format writer". 
Now you only have to add a couple of debugfs entries (one for start/stop 
the capture, one to retrieve the AUB file as it gets created via 'relay 
channel') and a number of hooks around i915 to capture everything that 
can be interesting.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 00/12] AubCrash
  2017-10-27 18:45   ` Oscar Mateo
@ 2017-10-27 19:10     ` Chris Wilson
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-10-27 19:10 UTC (permalink / raw)
  To: Oscar Mateo, intel-gfx

Quoting Oscar Mateo (2017-10-27 19:45:39)
> 
> 
> On 10/27/2017 11:30 AM, Chris Wilson wrote:
> > Quoting Oscar Mateo (2017-10-27 19:01:03)
> >> AubCrash is a companion to i915_gpu_error. It gives us the possibility to
> >> dump an AUB file that describes the state of the system at the point of
> >> the crash (GTTs, contexts, BBs, BOs, etc...). Being an AUB file, it can be
> >> used by a number of already existing tools (graphical AUB file browsers,
> >> simulators, emulators, etc...) that facilitate debugging (an improvement
> >> over the current text-based crash dump).
> > Since it is capture everything in progress, but only the kernel side of
> > it, why put it in the kernel? Is this absolutely required for
> > post-mortem debugging, or should we focus on capturing the death throes
> > of userspace much better (an aubcapture flight-data-recorder, plus
> > client annotations more akin to apitrace)?
> >
> > Sell me with the bugzilla references.
> > -Chris
> 
> An aubcapture flight-data-recorder is the next logical step. Like 
> i-g-t's intel_aubdump tool, but at the kernel level, so that it includes 
> everything: contexts, WA BBs, virtual GPU addresses, pagetables, etc... 
> The trojan horse for that is "drm/i915: Add an AUB file format writer". 
> Now you only have to add a couple of debugfs entries (one for start/stop 
> the capture, one to retrieve the AUB file as it gets created via 'relay 
> channel') and a number of hooks around i915 to capture everything that 
> can be interesting.

But we don't need to do that at the kernel level, as the ioctl interface
is the defining uABI. The only thing we can't snoop are the real phys
addresses but afaik for the replay aspect you don't need real, just
consistent.

Don't do anything in the kernel that can be done in userspace, because
we can never get it out again. We really do need compelling arguments as
to why it is impossible to do what needs to be done from userspace. And
however we put it, we can't just leak physical addressess or other
lowlevel information that opens ourselves to abuse, or snooping of one
client on another. At least not without a very good defense to hide behind
when it is spotted. The argument has to be really compelling if you want
us to maintain this for all platforms for the next decade+.

One suggestion is that we put all the dodgy stuff in an auxiliary
module, not even just hiding behind a module option. Of course that
makes the post-mortem aspect impossible.

(I'm not saying I have the answers, just that that its a high bar we have
to pass.)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 00/12] AubCrash
  2017-10-27 18:31   ` Oscar Mateo
@ 2017-10-27 20:17     ` Lionel Landwerlin
  0 siblings, 0 replies; 26+ messages in thread
From: Lionel Landwerlin @ 2017-10-27 20:17 UTC (permalink / raw)
  To: Oscar Mateo, Chris Wilson, intel-gfx

On 27/10/17 19:31, Oscar Mateo wrote:
>
>
> On 10/27/2017 11:20 AM, Chris Wilson wrote:
>> Quoting Oscar Mateo (2017-10-27 19:01:03)
>>> AubCrash is a companion to i915_gpu_error. It gives us the 
>>> possibility to
>>> dump an AUB file that describes the state of the system at the point of
>>> the crash (GTTs, contexts, BBs, BOs, etc...). Being an AUB file, it 
>>> can be
>>> used by a number of already existing tools (graphical AUB file 
>>> browsers,
>>> simulators, emulators, etc...) that facilitate debugging (an 
>>> improvement
>>> over the current text-based crash dump).
>> Serious question: why both? If the aub is useful for pm debugging, it
>> should be either a replacement for sysfs/error or an addendum.
>>
>> I expect tools to use the new format as well.
>> -Chris
>
> That's why I sent this as an RFC, to get suggestions :)
>
> There are things an AUB file cannot replace (fences, seqnos, error 
> registers, etc...). It could still be used as an addendum, where the 
> things that go into the AUB do not go into sysfs/error anymore (maybe 
> with a message to look for the rest in the AUB file?). Or even better: 
> I could try to provide the remaining text inside the AUB as comments, 
> so that only one file is required...
>
> As for tools, I still need to test how compatible Mesas's Aubinator is 
> at the moment. There a number of non-opensource tools as well, but I 
> have to discover which ones are publicly available outside of Intel 
> before I provide a list here. 

I don't think there is any problem having yet another tool for a new 
format as long as it brings some interesting bits.
When aubinator_error_decode was introduced Chris actually brought up the 
idea of a new format to make parsing of the error state a bit easier.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-10-27 20:17 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-27 18:01 [RFC PATCH 00/12] AubCrash Oscar Mateo
2017-10-27 18:01 ` [RFC PATCH 01/12] drm/i915: New define for the number of PDPEs per PDP in a 4-level walk Oscar Mateo
2017-10-27 18:01 ` [RFC PATCH 02/12] drm/i915: Move the context switch status reason bits to the header Oscar Mateo
2017-10-27 18:01 ` [RFC PATCH 03/12] drm/i915: Store the GGTT Stolen Memory base physical address Oscar Mateo
2017-10-27 18:01 ` [RFC PATCH 04/12] drm/i915: Capture some extra small details in the GPU error state Oscar Mateo
2017-10-27 18:15   ` Chris Wilson
2017-10-27 18:01 ` [RFC PATCH 05/12] drm/i915: Capture the renderstate batchbuffer Oscar Mateo
2017-10-27 18:17   ` Chris Wilson
2017-10-27 18:01 ` [RFC PATCH 06/12] drm/i915: Provide a way to write binary data to the error dump file Oscar Mateo
2017-10-27 18:01 ` [RFC PATCH 07/12] drm/i915: Skeleton for AubCrash Oscar Mateo
2017-10-27 18:20   ` Chris Wilson
2017-10-27 18:01 ` [RFC PATCH 08/12] drm/i915: Capture the PPGTT pagetables on a GPU crash Oscar Mateo
2017-10-27 18:01 ` [RFC PATCH 09/12] drm/i915: Store PTE information together with the error object pages Oscar Mateo
2017-10-27 18:25   ` Chris Wilson
2017-10-27 18:01 ` [RFC PATCH 10/12] drm/i915: Always add BOs to capture list if AubCrash is enabled Oscar Mateo
2017-10-27 18:26   ` Chris Wilson
2017-10-27 18:01 ` [RFC PATCH 11/12] drm/i915: Add an AUB file format writer Oscar Mateo
2017-10-27 18:01 ` [RFC PATCH 12/12] drm/i915: Actually write the AUB file Oscar Mateo
2017-10-27 18:20 ` [RFC PATCH 00/12] AubCrash Chris Wilson
2017-10-27 18:26   ` Lionel Landwerlin
2017-10-27 18:31   ` Oscar Mateo
2017-10-27 20:17     ` Lionel Landwerlin
2017-10-27 18:30 ` Chris Wilson
2017-10-27 18:45   ` Oscar Mateo
2017-10-27 19:10     ` Chris Wilson
2017-10-27 18:33 ` ✗ Fi.CI.BAT: failure for AubCrash Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.