All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/7] Add GuC Error Capture Support
@ 2021-11-22 23:03 ` Alan Previn
  0 siblings, 0 replies; 52+ messages in thread
From: Alan Previn @ 2021-11-22 23:03 UTC (permalink / raw)
  To: intel-gfx; +Cc: Matthew Brost, John Harrison, dri-devel, Alan Previn

This series:
  1. Supports the roll out of an upcoming GuC feature to
     enable error-state-capture that allows the driver to
     register lists of MMIO registers that GuC will report
     during a GuC triggered engine-reset event.
  2. Updates the ADS blob creation to register lists
     of global and engine registers with GuC.
  3. Defines tables of register lists that are global or
     engine class or engine instance in scope.
  4. Separates GuC log-buffer access locks for relay logging
     vs the new region for the error state capture data.
  5. Allocates an additional interim circular buffer store
     to copy snapshots of new GuC reported error-state-capture
     dumps in response to the G2H notification.
  6. Connects the i915_gpu_coredump reporting function
     to the GuC error capture module to print all GuC
     error state capture dumps that is reported.

Alan Previn (6):
  drm/i915/guc: Update GuC ADS size for error capture lists
  drm/i915/guc: Populate XE_LP register lists for GuC error state
    capture.
  drm/i915/guc: Add GuC's error state capture output structures.
  drm/i915/guc: Update GuC's log-buffer-state access for error capture.
  drm/i915/guc: Copy new GuC error capture logs upon G2H notification.
  drm/i915/guc: Print the GuC error capture output register list.

John Harrison (1):
  drm/i915/guc: Add basic support for error capture lists

 drivers/gpu/drm/i915/Makefile                 |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   4 +-
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   8 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  52 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   9 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    | 197 +++-
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 999 ++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h    | 107 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |   3 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  40 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.c    | 141 ++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.h    |  21 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  22 +
 drivers/gpu/drm/i915/i915_gpu_error.c         |  53 +-
 drivers/gpu/drm/i915/i915_gpu_error.h         |   5 +
 15 files changed, 1581 insertions(+), 81 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
 create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Intel-gfx] [RFC 0/7] Add GuC Error Capture Support
@ 2021-11-22 23:03 ` Alan Previn
  0 siblings, 0 replies; 52+ messages in thread
From: Alan Previn @ 2021-11-22 23:03 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Alan Previn

This series:
  1. Supports the roll out of an upcoming GuC feature to
     enable error-state-capture that allows the driver to
     register lists of MMIO registers that GuC will report
     during a GuC triggered engine-reset event.
  2. Updates the ADS blob creation to register lists
     of global and engine registers with GuC.
  3. Defines tables of register lists that are global or
     engine class or engine instance in scope.
  4. Separates GuC log-buffer access locks for relay logging
     vs the new region for the error state capture data.
  5. Allocates an additional interim circular buffer store
     to copy snapshots of new GuC reported error-state-capture
     dumps in response to the G2H notification.
  6. Connects the i915_gpu_coredump reporting function
     to the GuC error capture module to print all GuC
     error state capture dumps that is reported.

Alan Previn (6):
  drm/i915/guc: Update GuC ADS size for error capture lists
  drm/i915/guc: Populate XE_LP register lists for GuC error state
    capture.
  drm/i915/guc: Add GuC's error state capture output structures.
  drm/i915/guc: Update GuC's log-buffer-state access for error capture.
  drm/i915/guc: Copy new GuC error capture logs upon G2H notification.
  drm/i915/guc: Print the GuC error capture output register list.

John Harrison (1):
  drm/i915/guc: Add basic support for error capture lists

 drivers/gpu/drm/i915/Makefile                 |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   4 +-
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   8 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  52 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   9 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    | 197 +++-
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 999 ++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h    | 107 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |   3 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  40 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.c    | 141 ++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.h    |  21 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  22 +
 drivers/gpu/drm/i915/i915_gpu_error.c         |  53 +-
 drivers/gpu/drm/i915/i915_gpu_error.h         |   5 +
 15 files changed, 1581 insertions(+), 81 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
 create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Intel-gfx] [RFC 1/7] drm/i915/guc: Add basic support for error capture lists
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
  (?)
@ 2021-11-22 23:03 ` Alan Previn
  2021-11-23 21:12   ` Michal Wajdeczko
  -1 siblings, 1 reply; 52+ messages in thread
From: Alan Previn @ 2021-11-22 23:03 UTC (permalink / raw)
  To: intel-gfx; +Cc: Alan Previn

From: John Harrison <John.C.Harrison@Intel.com>

Add not-quite-support for GuC based error capture. GuC will add error
capture capability amongst other things. In order to load the
firmware, a minimum amount of support is required on the driver side.
This adds that bare minimum.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |  1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        | 42 +++++++++++------
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  2 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    | 45 ++++++++++++++++++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  3 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   | 21 ++++++++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.c    |  9 +++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.h    |  2 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 18 ++++++++
 9 files changed, 126 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index fe5d7d261797..5af03a486a13 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -145,6 +145,7 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
 	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
 	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
+	INTEL_GUC_ACTION_STATE_CAPTURE_NOTIFICATION = 0x8002,
 	INTEL_GUC_ACTION_LIMIT
 };
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
index 6e228343e8cb..5cf9ebd2ee55 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
@@ -222,32 +222,48 @@ static u32 guc_ctl_log_params_flags(struct intel_guc *guc)
 	u32 flags;
 
 	#if (((CRASH_BUFFER_SIZE) % SZ_1M) == 0)
-	#define UNIT SZ_1M
-	#define FLAG GUC_LOG_ALLOC_IN_MEGABYTE
+	#define LOG_UNIT SZ_1M
+	#define LOG_FLAG GUC_LOG_LOG_ALLOC_UNITS
 	#else
-	#define UNIT SZ_4K
-	#define FLAG 0
+	#define LOG_UNIT SZ_4K
+	#define LOG_FLAG 0
+	#endif
+
+	#if (((CAPTURE_BUFFER_SIZE) % SZ_1M) == 0)
+	#define CAPTURE_UNIT SZ_1M
+	#define CAPTURE_FLAG GUC_LOG_CAPTURE_ALLOC_UNITS
+	#else
+	#define CAPTURE_UNIT SZ_4K
+	#define CAPTURE_FLAG 0
 	#endif
 
 	BUILD_BUG_ON(!CRASH_BUFFER_SIZE);
-	BUILD_BUG_ON(!IS_ALIGNED(CRASH_BUFFER_SIZE, UNIT));
+	BUILD_BUG_ON(!IS_ALIGNED(CRASH_BUFFER_SIZE, LOG_UNIT));
 	BUILD_BUG_ON(!DEBUG_BUFFER_SIZE);
-	BUILD_BUG_ON(!IS_ALIGNED(DEBUG_BUFFER_SIZE, UNIT));
+	BUILD_BUG_ON(!IS_ALIGNED(DEBUG_BUFFER_SIZE, LOG_UNIT));
+	BUILD_BUG_ON(!CAPTURE_BUFFER_SIZE);
+	BUILD_BUG_ON(!IS_ALIGNED(CAPTURE_BUFFER_SIZE, CAPTURE_UNIT));
 
-	BUILD_BUG_ON((CRASH_BUFFER_SIZE / UNIT - 1) >
+	BUILD_BUG_ON((CRASH_BUFFER_SIZE / LOG_UNIT - 1) >
 			(GUC_LOG_CRASH_MASK >> GUC_LOG_CRASH_SHIFT));
-	BUILD_BUG_ON((DEBUG_BUFFER_SIZE / UNIT - 1) >
+	BUILD_BUG_ON((DEBUG_BUFFER_SIZE / LOG_UNIT - 1) >
 			(GUC_LOG_DEBUG_MASK >> GUC_LOG_DEBUG_SHIFT));
+	BUILD_BUG_ON((CAPTURE_BUFFER_SIZE / CAPTURE_UNIT - 1) >
+			(GUC_LOG_CAPTURE_MASK >> GUC_LOG_CAPTURE_SHIFT));
 
 	flags = GUC_LOG_VALID |
 		GUC_LOG_NOTIFY_ON_HALF_FULL |
-		FLAG |
-		((CRASH_BUFFER_SIZE / UNIT - 1) << GUC_LOG_CRASH_SHIFT) |
-		((DEBUG_BUFFER_SIZE / UNIT - 1) << GUC_LOG_DEBUG_SHIFT) |
+		CAPTURE_FLAG |
+		LOG_FLAG |
+		((CRASH_BUFFER_SIZE / LOG_UNIT - 1) << GUC_LOG_CRASH_SHIFT) |
+		((DEBUG_BUFFER_SIZE / LOG_UNIT - 1) << GUC_LOG_DEBUG_SHIFT) |
+		((CAPTURE_BUFFER_SIZE / CAPTURE_UNIT - 1) << GUC_LOG_CAPTURE_SHIFT) |
 		(offset << GUC_LOG_BUF_ADDR_SHIFT);
 
-	#undef UNIT
-	#undef FLAG
+	#undef LOG_UNIT
+	#undef LOG_FLAG
+	#undef CAPTURE_UNIT
+	#undef CAPTURE_FLAG
 
 	return flags;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 1cb46098030d..9de99772f916 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -392,6 +392,8 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
 					const u32 *msg, u32 len);
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 					 const u32 *msg, u32 len);
+int intel_guc_error_capture_process_msg(struct intel_guc *guc,
+					 const u32 *msg, u32 len);
 
 void intel_guc_find_hung_context(struct intel_engine_cs *engine);
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 1a1edae67e4e..6c81ddd303d3 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -40,6 +40,10 @@
  *      +---------------------------------------+
  *      | padding                               |
  *      +---------------------------------------+ <== 4K aligned
+ *      | capture lists                         |
+ *      +---------------------------------------+
+ *      | padding                               |
+ *      +---------------------------------------+ <== 4K aligned
  *      | private data                          |
  *      +---------------------------------------+
  *      | padding                               |
@@ -65,6 +69,12 @@ static u32 guc_ads_golden_ctxt_size(struct intel_guc *guc)
 	return PAGE_ALIGN(guc->ads_golden_ctxt_size);
 }
 
+static u32 guc_ads_capture_size(struct intel_guc *guc)
+{
+	/* Basic support to init ADS without a proper GuC error capture list */
+	return PAGE_ALIGN(PAGE_SIZE);
+}
+
 static u32 guc_ads_private_data_size(struct intel_guc *guc)
 {
 	return PAGE_ALIGN(guc->fw.private_data_size);
@@ -85,7 +95,7 @@ static u32 guc_ads_golden_ctxt_offset(struct intel_guc *guc)
 	return PAGE_ALIGN(offset);
 }
 
-static u32 guc_ads_private_data_offset(struct intel_guc *guc)
+static u32 guc_ads_capture_offset(struct intel_guc *guc)
 {
 	u32 offset;
 
@@ -95,6 +105,16 @@ static u32 guc_ads_private_data_offset(struct intel_guc *guc)
 	return PAGE_ALIGN(offset);
 }
 
+static u32 guc_ads_private_data_offset(struct intel_guc *guc)
+{
+	u32 offset;
+
+	offset = guc_ads_capture_offset(guc) +
+		 guc_ads_capture_size(guc);
+
+	return PAGE_ALIGN(offset);
+}
+
 static u32 guc_ads_blob_size(struct intel_guc *guc)
 {
 	return guc_ads_private_data_offset(guc) +
@@ -499,6 +519,26 @@ static void guc_init_golden_context(struct intel_guc *guc)
 	GEM_BUG_ON(guc->ads_golden_ctxt_size != total_size);
 }
 
+static void guc_capture_prep_lists(struct intel_guc *guc, struct __guc_ads_blob *blob)
+{
+	int i, j;
+	u32 addr_ggtt, offset;
+
+	offset = guc_ads_capture_offset(guc);
+	addr_ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
+
+	/* FIXME: Populate a proper capture list */
+
+	for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
+		for (j = 0; j < GUC_MAX_ENGINE_CLASSES; j++) {
+			blob->ads.capture_instance[i][j] = addr_ggtt;
+			blob->ads.capture_class[i][j] = addr_ggtt;
+		}
+
+		blob->ads.capture_global[i] = addr_ggtt;
+	}
+}
+
 static void __guc_ads_init(struct intel_guc *guc)
 {
 	struct intel_gt *gt = guc_to_gt(guc);
@@ -532,6 +572,9 @@ static void __guc_ads_init(struct intel_guc *guc)
 
 	base = intel_guc_ggtt_offset(guc, guc->ads_vma);
 
+	/* Lists for error capture debug */
+	guc_capture_prep_lists(guc, blob);
+
 	/* ADS */
 	blob->ads.scheduler_policies = base + ptr_offset(blob, policies);
 	blob->ads.gt_system_info = base + ptr_offset(blob, system_info);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index a0cc34be7b56..c20c0bcb83f9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -974,6 +974,9 @@ static int ct_process_request(struct intel_guc_ct *ct, struct ct_incoming_msg *r
 	case INTEL_GUC_ACTION_CONTEXT_RESET_NOTIFICATION:
 		ret = intel_guc_context_reset_process_msg(guc, payload, len);
 		break;
+	case INTEL_GUC_ACTION_STATE_CAPTURE_NOTIFICATION:
+		ret = intel_guc_error_capture_process_msg(guc, payload, len);
+		break;
 	case INTEL_GUC_ACTION_ENGINE_FAILURE_NOTIFICATION:
 		ret = intel_guc_engine_failure_process_msg(guc, payload, len);
 		break;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 7072e30e99f4..767684b6af67 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -86,11 +86,14 @@
 #define GUC_CTL_LOG_PARAMS		0
 #define   GUC_LOG_VALID			(1 << 0)
 #define   GUC_LOG_NOTIFY_ON_HALF_FULL	(1 << 1)
-#define   GUC_LOG_ALLOC_IN_MEGABYTE	(1 << 3)
+#define   GUC_LOG_CAPTURE_ALLOC_UNITS	(1 << 2)
+#define   GUC_LOG_LOG_ALLOC_UNITS	(1 << 3)
 #define   GUC_LOG_CRASH_SHIFT		4
 #define   GUC_LOG_CRASH_MASK		(0x3 << GUC_LOG_CRASH_SHIFT)
 #define   GUC_LOG_DEBUG_SHIFT		6
 #define   GUC_LOG_DEBUG_MASK	        (0xF << GUC_LOG_DEBUG_SHIFT)
+#define   GUC_LOG_CAPTURE_SHIFT		10
+#define   GUC_LOG_CAPTURE_MASK	        (0x3 << GUC_LOG_CAPTURE_SHIFT)
 #define   GUC_LOG_BUF_ADDR_SHIFT	12
 
 #define GUC_CTL_WA			1
@@ -264,6 +267,7 @@ struct guc_mmio_reg {
 	u32 value;
 	u32 flags;
 #define GUC_REGSET_MASKED		(1 << 0)
+	u32 mask;
 } __packed;
 
 /* GuC register sets */
@@ -280,6 +284,14 @@ struct guc_gt_system_info {
 	u32 generic_gt_sysinfo[GUC_GENERIC_GT_SYSINFO_MAX];
 } __packed;
 
+/* Capture-types of GuC capture register lists */
+enum
+{
+	GUC_CAPTURE_LIST_INDEX_PF = 0,
+	GUC_CAPTURE_LIST_INDEX_VF = 1,
+	GUC_CAPTURE_LIST_INDEX_MAX = 2,
+};
+
 /* GuC Additional Data Struct */
 struct guc_ads {
 	struct guc_mmio_reg_set reg_state_list[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
@@ -291,7 +303,11 @@ struct guc_ads {
 	u32 golden_context_lrca[GUC_MAX_ENGINE_CLASSES];
 	u32 eng_state_size[GUC_MAX_ENGINE_CLASSES];
 	u32 private_data;
-	u32 reserved[15];
+	u32 reserved2;
+	u32 capture_instance[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
+	u32 capture_class[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
+	u32 capture_global[GUC_CAPTURE_LIST_INDEX_MAX];
+	u32 reserved[4];
 } __packed;
 
 /* Engine usage stats */
@@ -312,6 +328,7 @@ struct guc_engine_usage {
 enum guc_log_buffer_type {
 	GUC_DEBUG_LOG_BUFFER,
 	GUC_CRASH_DUMP_LOG_BUFFER,
+	GUC_CAPTURE_LOG_BUFFER,
 	GUC_MAX_LOG_BUFFER
 };
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
index ac0931f0374b..1962a43302a8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
@@ -201,6 +201,8 @@ static unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type)
 		return DEBUG_BUFFER_SIZE;
 	case GUC_CRASH_DUMP_LOG_BUFFER:
 		return CRASH_BUFFER_SIZE;
+	case GUC_CAPTURE_LOG_BUFFER:
+		return CAPTURE_BUFFER_SIZE;
 	default:
 		MISSING_CASE(type);
 	}
@@ -463,6 +465,8 @@ int intel_guc_log_create(struct intel_guc_log *log)
 	 *  +-------------------------------+ 32B
 	 *  |      Debug state header       |
 	 *  +-------------------------------+ 64B
+	 *  |     Capture state header      |
+	 *  +-------------------------------+ 96B
 	 *  |                               |
 	 *  +===============================+ PAGE_SIZE (4KB)
 	 *  |        Crash Dump logs        |
@@ -470,7 +474,8 @@ int intel_guc_log_create(struct intel_guc_log *log)
 	 *  |          Debug logs           |
 	 *  +===============================+ + DEBUG_SIZE
 	 */
-	guc_log_size = PAGE_SIZE + CRASH_BUFFER_SIZE + DEBUG_BUFFER_SIZE;
+	guc_log_size = PAGE_SIZE + CRASH_BUFFER_SIZE + DEBUG_BUFFER_SIZE +
+		       CAPTURE_BUFFER_SIZE;
 
 	vma = intel_guc_allocate_vma(guc, guc_log_size);
 	if (IS_ERR(vma)) {
@@ -672,6 +677,8 @@ stringify_guc_log_type(enum guc_log_buffer_type type)
 		return "DEBUG";
 	case GUC_CRASH_DUMP_LOG_BUFFER:
 		return "CRASH";
+	case GUC_CAPTURE_LOG_BUFFER:
+		return "CAPTURE";
 	default:
 		MISSING_CASE(type);
 	}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
index ac1ee1d5ce10..9d9004dc58f1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
@@ -18,9 +18,11 @@ struct intel_guc;
 #ifdef CONFIG_DRM_I915_DEBUG_GUC
 #define CRASH_BUFFER_SIZE	SZ_2M
 #define DEBUG_BUFFER_SIZE	SZ_16M
+#define CAPTURE_BUFFER_SIZE	SZ_4M
 #else
 #define CRASH_BUFFER_SIZE	SZ_8K
 #define DEBUG_BUFFER_SIZE	SZ_64K
+#define CAPTURE_BUFFER_SIZE	SZ_16K
 #endif
 
 /*
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 77fbcd8730ee..0bfc92b1b982 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4003,6 +4003,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
 	return 0;
 }
 
+int intel_guc_error_capture_process_msg(struct intel_guc *guc,
+					 const u32 *msg, u32 len)
+{
+	int status;
+
+	if (unlikely(len != 1)) {
+		drm_dbg(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
+		return -EPROTO;
+	}
+
+	status = msg[0];
+	drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status);
+
+	/* Add extraction of error capture dump */
+
+	return 0;
+}
+
 static struct intel_engine_cs *
 guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size for error capture lists
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
  (?)
  (?)
@ 2021-11-22 23:03 ` Alan Previn
  2021-11-23 21:46   ` Michal Wajdeczko
  2021-11-24 10:06   ` Jani Nikula
  -1 siblings, 2 replies; 52+ messages in thread
From: Alan Previn @ 2021-11-22 23:03 UTC (permalink / raw)
  To: intel-gfx; +Cc: Alan Previn

Update GuC ADS size allocation to include space for
the lists of error state capture register descriptors.

Also, populate the lists of registers we want GuC to report back to
Host on engine reset events. This list should include global,
engine-class and engine-instance registers for every engine-class
type on the current hardware.

NOTE: Start with a fake table of register lists to layout the
framework before adding real registers in subsequent patch.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 drivers/gpu/drm/i915/Makefile                 |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  10 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    | 176 ++++++++++++-
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 232 ++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |  47 ++++
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  19 +-
 7 files changed, 476 insertions(+), 14 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
 create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 074d6b8edd23..e3c4d5cea4c3 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -190,6 +190,7 @@ i915-y += gt/uc/intel_uc.o \
 	  gt/uc/intel_guc_rc.o \
 	  gt/uc/intel_guc_slpc.o \
 	  gt/uc/intel_guc_submission.o \
+	  gt/uc/intel_guc_capture.o \
 	  gt/uc/intel_huc.o \
 	  gt/uc/intel_huc_debugfs.o \
 	  gt/uc/intel_huc_fw.o
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
index 5cf9ebd2ee55..458f0d248a5a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
@@ -335,9 +335,14 @@ int intel_guc_init(struct intel_guc *guc)
 	if (ret)
 		goto err_fw;
 
-	ret = intel_guc_ads_create(guc);
+	ret = intel_guc_capture_init(guc);
 	if (ret)
 		goto err_log;
+
+	ret = intel_guc_ads_create(guc);
+	if (ret)
+		goto err_capture;
+
 	GEM_BUG_ON(!guc->ads_vma);
 
 	ret = intel_guc_ct_init(&guc->ct);
@@ -376,6 +381,8 @@ int intel_guc_init(struct intel_guc *guc)
 	intel_guc_ct_fini(&guc->ct);
 err_ads:
 	intel_guc_ads_destroy(guc);
+err_capture:
+	intel_guc_capture_destroy(guc);
 err_log:
 	intel_guc_log_destroy(&guc->log);
 err_fw:
@@ -403,6 +410,7 @@ void intel_guc_fini(struct intel_guc *guc)
 	intel_guc_ct_fini(&guc->ct);
 
 	intel_guc_ads_destroy(guc);
+	intel_guc_capture_destroy(guc);
 	intel_guc_log_destroy(&guc->log);
 	intel_uc_fw_fini(&guc->fw);
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 9de99772f916..d136c69abe12 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -16,6 +16,7 @@
 #include "intel_guc_log.h"
 #include "intel_guc_reg.h"
 #include "intel_guc_slpc_types.h"
+#include "intel_guc_capture.h"
 #include "intel_uc_fw.h"
 #include "i915_utils.h"
 #include "i915_vma.h"
@@ -37,6 +38,8 @@ struct intel_guc {
 	struct intel_guc_ct ct;
 	/** @slpc: sub-structure containing SLPC related data and objects */
 	struct intel_guc_slpc slpc;
+	/** @capture: the error-state-capture module's data and objects */
+	struct intel_guc_state_capture capture;
 
 	/** @sched_engine: Global engine used to submit requests to GuC */
 	struct i915_sched_engine *sched_engine;
@@ -138,6 +141,8 @@ struct intel_guc {
 	u32 ads_regset_size;
 	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
 	u32 ads_golden_ctxt_size;
+	/** @ads_capture_size: size of register lists in the ADS used for error capture */
+	u32 ads_capture_size;
 	/** @ads_engine_usage_size: size of engine usage in the ADS */
 	u32 ads_engine_usage_size;
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 6c81ddd303d3..2780c0fadd01 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -10,6 +10,7 @@
 #include "gt/shmem_utils.h"
 #include "intel_guc_ads.h"
 #include "intel_guc_fwif.h"
+#include "intel_guc_capture.h"
 #include "intel_uc.h"
 #include "i915_drv.h"
 
@@ -71,8 +72,7 @@ static u32 guc_ads_golden_ctxt_size(struct intel_guc *guc)
 
 static u32 guc_ads_capture_size(struct intel_guc *guc)
 {
-	/* Basic support to init ADS without a proper GuC error capture list */
-	return PAGE_ALIGN(PAGE_SIZE);
+	return PAGE_ALIGN(guc->ads_capture_size);
 }
 
 static u32 guc_ads_private_data_size(struct intel_guc *guc)
@@ -519,24 +519,170 @@ static void guc_init_golden_context(struct intel_guc *guc)
 	GEM_BUG_ON(guc->ads_golden_ctxt_size != total_size);
 }
 
-static void guc_capture_prep_lists(struct intel_guc *guc, struct __guc_ads_blob *blob)
+static int
+guc_fill_reglist(struct intel_guc *guc, struct __guc_ads_blob *blob, int vf, bool enabled,
+		 int classid, int type, char *typename, u16 *p_numregs, int newnum, u8 **p_virt_ptr,
+		 u32 *p_blobptr_to_ggtt, u32 *p_ggtt, u32 null_ggtt)
 {
-	int i, j;
-	u32 addr_ggtt, offset;
+	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
+	struct guc_debug_capture_list *listnode;
+	int size = 0;
 
-	offset = guc_ads_capture_offset(guc);
-	addr_ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
+	if (blob && *p_numregs != newnum) {
+		if (type == GUC_CAPTURE_LIST_TYPE_GLOBAL)
+			drm_warn(&i915->drm, "Guc-Cap VF%d-%s num-reg mismatch was=%d now=%d!\n",
+				 vf, typename, *p_numregs, newnum);
+		else
+			drm_warn(&i915->drm, "Guc-Cap VF%d-Class-%d-%s num-reg mismatch was=%d now=%d!\n",
+				 vf, classid, typename, *p_numregs, newnum);
+	}
+	/*
+	 * For enabled capture lists, we not only need to call capture module to help
+	 * populate the list-descriptor into the correct ads capture structures, but
+	 * we also need to increment the virtual pointers and ggtt offsets so that
+	 * caller has the subsequent gfx memory location.
+	 */
+	*p_numregs = newnum;
+	size = PAGE_ALIGN((sizeof(struct guc_debug_capture_list)) +
+			  (newnum * sizeof(struct guc_mmio_reg)));
+	/* if caller hasn't allocated ADS blob, return size and counts, we're done */
+	if (!blob)
+		return size;
+	if (blob) {
+		/* if caller allocated ADS blob, populate the capture register descriptors */
+		if (!newnum) {
+			*p_blobptr_to_ggtt = null_ggtt;
+		} else {
+			/* get ptr and populate header info: */
+			*p_blobptr_to_ggtt = *p_ggtt;
+			listnode = (struct guc_debug_capture_list *)*p_virt_ptr;
+			*p_ggtt += sizeof(struct guc_debug_capture_list);
+			*p_virt_ptr += sizeof(struct guc_debug_capture_list);
+			listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, *p_numregs);
+
+			/* get ptr and populate register descriptor list: */
+			intel_guc_capture_list_init(guc, vf, type, classid,
+						    (struct guc_mmio_reg *)*p_virt_ptr,
+						    *p_numregs);
+
+			/* increment ptrs for that header: */
+			*p_ggtt += size - sizeof(struct guc_debug_capture_list);
+			*p_virt_ptr += size - sizeof(struct guc_debug_capture_list);
+		}
+	}
+
+	return size;
+}
+
+static int guc_capture_prep_lists(struct intel_guc *guc, struct __guc_ads_blob *blob)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	int i, j, size;
+	u32 ggtt, null_ggtt, offset, alloc_size = 0;
+	struct guc_gt_system_info *info, local_info;
+	struct guc_debug_capture_list *listnode;
+	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
+	struct intel_guc_state_capture *gc = &guc->capture;
+	u16 tmp = 0;
+	u8 *ptr = NULL;
+
+	if (blob) {
+		offset = guc_ads_capture_offset(guc);
+		ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
+		ptr = ((u8 *)blob) + offset;
+		info = &blob->system_info;
+	} else {
+		memset(&local_info, 0, sizeof(local_info));
+		info = &local_info;
+		fill_engine_enable_masks(gt, info);
+	}
+
+	/* first, set aside the first page for a capture_list with zero descriptors */
+	alloc_size = PAGE_SIZE;
+	if (blob) {
+		listnode = (struct guc_debug_capture_list *)ptr;
+		listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, 0);
+		null_ggtt = ggtt;
+		ggtt += PAGE_SIZE;
+		ptr +=  PAGE_SIZE;
+	}
 
-	/* FIXME: Populate a proper capture list */
+#define COUNT_REGS intel_guc_capture_list_count
+#define FILL_REGS guc_fill_reglist
+#define TYPE_CLASS GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS
+#define TYPE_INSTANCE GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE
 
 	for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
 		for (j = 0; j < GUC_MAX_ENGINE_CLASSES; j++) {
-			blob->ads.capture_instance[i][j] = addr_ggtt;
-			blob->ads.capture_class[i][j] = addr_ggtt;
+			if (!info->engine_enabled_masks[j]) {
+				if (gc->num_class_regs[i][j])
+					drm_warn(&i915->drm, "GuC-Cap VF%d-class-%d "
+						 "class regs valid mismatch was=%d now=%d!\n",
+						 i, j, gc->num_class_regs[i][j], tmp);
+				if (gc->num_instance_regs[i][j])
+					drm_warn(&i915->drm, "GuC-Cap VF%d-class-%d "
+						 "inst regs valid mismatch was=%d now=%d!\n",
+						 i, j, gc->num_instance_regs[i][j], tmp);
+				gc->num_class_regs[i][j] = 0;
+				gc->num_instance_regs[i][j] = 0;
+				if (blob) {
+					blob->ads.capture_class[i][j] = null_ggtt;
+					blob->ads.capture_instance[i][j] = null_ggtt;
+				}
+			} else {
+				if (!COUNT_REGS(guc, i, TYPE_CLASS,
+						guc_class_to_engine_class(j), &tmp)) {
+					size = FILL_REGS(guc, blob, i, true, j, TYPE_CLASS,
+							 "class", &gc->num_class_regs[i][j],
+							 tmp, &ptr,
+							 &blob->ads.capture_class[i][j],
+							 &ggtt, null_ggtt);
+					gc->class_list_size += size;
+					alloc_size += size;
+				} else {
+					gc->num_class_regs[i][j] = 0;
+					if (blob)
+						blob->ads.capture_class[i][j] = null_ggtt;
+				}
+				if (!COUNT_REGS(guc, i, TYPE_INSTANCE,
+						guc_class_to_engine_class(j), &tmp)) {
+					size = FILL_REGS(guc, blob, i, true, j, TYPE_INSTANCE,
+							 "instance", &gc->num_instance_regs[i][j],
+							 tmp, &ptr,
+							 &blob->ads.capture_instance[i][j],
+							 &ggtt, null_ggtt);
+					gc->instance_list_size += size;
+					alloc_size += size;
+				} else {
+					gc->num_instance_regs[i][j] = 0;
+					if (blob)
+						blob->ads.capture_instance[i][j] = null_ggtt;
+				}
+			}
+		}
+		if (!COUNT_REGS(guc, i, GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, &tmp)) {
+			size = FILL_REGS(guc, blob, i, true, 0, GUC_CAPTURE_LIST_TYPE_GLOBAL,
+					 "global", &gc->num_global_regs[i], tmp, &ptr,
+					 &blob->ads.capture_global[i], &ggtt, null_ggtt);
+			gc->global_list_size += size;
+			alloc_size += size;
+		} else {
+			gc->num_global_regs[i] = 0;
+			if (blob)
+				blob->ads.capture_global[i] = null_ggtt;
 		}
-
-		blob->ads.capture_global[i] = addr_ggtt;
 	}
+
+#undef COUNT_REGS
+#undef FILL_REGS
+#undef TYPE_CLASS
+#undef TYPE_INSTANCE
+
+	if (guc->ads_capture_size && guc->ads_capture_size != PAGE_ALIGN(alloc_size))
+		drm_warn(&i915->drm, "GuC->ADS->Capture alloc size changed from %d to %d\n",
+			 guc->ads_capture_size, PAGE_ALIGN(alloc_size));
+
+	return PAGE_ALIGN(alloc_size);
 }
 
 static void __guc_ads_init(struct intel_guc *guc)
@@ -614,6 +760,12 @@ int intel_guc_ads_create(struct intel_guc *guc)
 		return ret;
 	guc->ads_golden_ctxt_size = ret;
 
+	/* Likewise the capture lists: */
+	ret = guc_capture_prep_lists(guc, NULL);
+	if (ret < 0)
+		return ret;
+	guc->ads_capture_size = ret;
+
 	/* Now the total size can be determined: */
 	size = guc_ads_blob_size(guc);
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
new file mode 100644
index 000000000000..c741c77b7fc8
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2021-2021 Intel Corporation
+ */
+
+#include <drm/drm_print.h>
+
+#include "i915_drv.h"
+#include "i915_drv.h"
+#include "i915_memcpy.h"
+#include "gt/intel_gt.h"
+
+#include "intel_guc_fwif.h"
+#include "intel_guc_capture.h"
+
+/* Define all device tables of GuC error capture register lists */
+
+/********************************* Gen12 LP  *********************************/
+/************** GLOBAL *************/
+struct __guc_mmio_reg_descr gen12lp_global_regs[] = {
+	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
+	/* Add additional register list */
+};
+
+/********** RENDER/COMPUTE *********/
+/* Per-Class */
+struct __guc_mmio_reg_descr gen12lp_rc_class_regs[] = {
+	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
+	/* Add additional register list */
+};
+
+/* Per-Engine-Instance */
+struct __guc_mmio_reg_descr gen12lp_rc_inst_regs[] = {
+	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
+	/* Add additional register list */
+};
+
+/************* MEDIA-VD ************/
+/* Per-Class */
+struct __guc_mmio_reg_descr gen12lp_vd_class_regs[] = {
+	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
+	/* Add additional register list */
+};
+
+/* Per-Engine-Instance */
+struct __guc_mmio_reg_descr gen12lp_vd_inst_regs[] = {
+	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
+	/* Add additional register list */
+};
+
+/************* MEDIA-VEC ***********/
+/* Per-Class */
+struct __guc_mmio_reg_descr gen12lp_vec_class_regs[] = {
+	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
+	/* Add additional register list */
+};
+
+/* Per-Engine-Instance */
+struct __guc_mmio_reg_descr gen12lp_vec_inst_regs[] = {
+	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
+	/* Add additional register list */
+};
+
+/********** List of lists **********/
+struct __guc_mmio_reg_descr_group gen12lp_lists[] = {
+	{
+		.list = gen12lp_global_regs,
+		.num_regs = (sizeof(gen12lp_global_regs) / sizeof(struct __guc_mmio_reg_descr)),
+		.owner = GUC_CAPTURE_LIST_INDEX_PF,
+		.type = GUC_CAPTURE_LIST_TYPE_GLOBAL,
+		.engine = 0
+	},
+	{
+		.list = gen12lp_rc_class_regs,
+		.num_regs = (sizeof(gen12lp_rc_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
+		.owner = GUC_CAPTURE_LIST_INDEX_PF,
+		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
+		.engine = RENDER_CLASS
+	},
+	{
+		.list = gen12lp_rc_inst_regs,
+		.num_regs = (sizeof(gen12lp_rc_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
+		.owner = GUC_CAPTURE_LIST_INDEX_PF,
+		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
+		.engine = RENDER_CLASS
+	},
+	{
+		.list = gen12lp_vd_class_regs,
+		.num_regs = (sizeof(gen12lp_vd_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
+		.owner = GUC_CAPTURE_LIST_INDEX_PF,
+		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
+		.engine = VIDEO_DECODE_CLASS
+	},
+	{
+		.list = gen12lp_vd_inst_regs,
+		.num_regs = (sizeof(gen12lp_vd_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
+		.owner = GUC_CAPTURE_LIST_INDEX_PF,
+		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
+		.engine = VIDEO_DECODE_CLASS
+	},
+	{
+		.list = gen12lp_vec_class_regs,
+		.num_regs = (sizeof(gen12lp_vec_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
+		.owner = GUC_CAPTURE_LIST_INDEX_PF,
+		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
+		.engine = VIDEO_ENHANCEMENT_CLASS
+	},
+	{
+		.list = gen12lp_vec_inst_regs,
+		.num_regs = (sizeof(gen12lp_vec_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
+		.owner = GUC_CAPTURE_LIST_INDEX_PF,
+		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
+		.engine = VIDEO_ENHANCEMENT_CLASS
+	},
+	{NULL, 0, 0, 0, 0}
+};
+
+/************ FIXME: Populate tables for other devices in subsequent patch ************/
+
+static struct __guc_mmio_reg_descr_group *
+guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)
+{
+	if (IS_TIGERLAKE(dev_priv) || IS_ROCKETLAKE(dev_priv) ||
+	    IS_ALDERLAKE_S(dev_priv) || IS_ALDERLAKE_P(dev_priv)) {
+		return gen12lp_lists;
+	}
+
+	return NULL;
+}
+
+static inline struct __guc_mmio_reg_descr_group *
+guc_capture_get_one_list(struct __guc_mmio_reg_descr_group *reglists, u32 owner, u32 type, u32 id)
+{
+	int i = 0;
+
+	if (!reglists)
+		return NULL;
+	while (reglists[i].list) {
+		if (reglists[i].owner == owner &&
+		    reglists[i].type == type) {
+			if (reglists[i].type == GUC_CAPTURE_LIST_TYPE_GLOBAL ||
+			    reglists[i].engine == id) {
+				return &reglists[i];
+			}
+		}
+		++i;
+	}
+	return NULL;
+}
+
+static inline void
+warn_with_capture_list_identifier(struct drm_i915_private *i915, char *msg,
+				  u32 owner, u32 type, u32 classid)
+{
+	const char *ownerstr[GUC_CAPTURE_LIST_INDEX_MAX] = {"PF", "VF"};
+	const char *typestr[GUC_CAPTURE_LIST_TYPE_MAX - 1] = {"Class", "Instance"};
+	const char *classstr[GUC_LAST_ENGINE_CLASS + 1] = {"Render", "Video", "VideoEnhance",
+							   "Blitter", "Reserved"};
+	static const char unknownstr[] = "unknown";
+
+	if (type == GUC_CAPTURE_LIST_TYPE_GLOBAL)
+		drm_warn(&i915->drm, "GuC-capture: %s for %s Global-Registers.\n", msg,
+			 (owner < GUC_CAPTURE_LIST_INDEX_MAX) ? ownerstr[owner] : unknownstr);
+	else
+		drm_warn(&i915->drm, "GuC-capture: %s for %s %s-Registers on %s-Engine\n", msg,
+			 (owner < GUC_CAPTURE_LIST_INDEX_MAX) ? ownerstr[owner] : unknownstr,
+			 (type < GUC_CAPTURE_LIST_TYPE_MAX) ? typestr[type - 1] :  unknownstr,
+			 (classid < GUC_LAST_ENGINE_CLASS + 1) ? classstr[classid] : unknownstr);
+}
+
+int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 classid,
+				 u16 *num_entries)
+{
+	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
+	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
+	struct __guc_mmio_reg_descr_group *match;
+
+	if (!reglists)
+		return -ENODEV;
+
+	match = guc_capture_get_one_list(reglists, owner, type, classid);
+	if (match) {
+		*num_entries = match->num_regs;
+		return 0;
+	}
+
+	warn_with_capture_list_identifier(dev_priv, "Missing register list size", owner, type,
+					  classid);
+
+	return -ENODATA;
+}
+
+int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 classid,
+				struct guc_mmio_reg *ptr, u16 num_entries)
+{
+	u32 j = 0;
+	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
+	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
+	struct __guc_mmio_reg_descr_group *match;
+
+	if (!reglists)
+		return -ENODEV;
+
+	match = guc_capture_get_one_list(reglists, owner, type, classid);
+	if (match) {
+		while (j < num_entries && j < match->num_regs) {
+			ptr[j].offset = match->list[j].reg.reg;
+			ptr[j].value = 0xDEADF00D;
+			ptr[j].flags = match->list[j].flags;
+			ptr[j].mask = match->list[j].mask;
+			++j;
+		}
+		return 0;
+	}
+
+	warn_with_capture_list_identifier(dev_priv, "Missing register list init", owner, type,
+					  classid);
+
+	return -ENODATA;
+}
+
+void intel_guc_capture_destroy(struct intel_guc *guc)
+{
+}
+
+int intel_guc_capture_init(struct intel_guc *guc)
+{
+	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
+
+	guc->capture.reglists = guc_capture_get_device_reglist(dev_priv);
+	return 0;
+}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
new file mode 100644
index 000000000000..352940b8bc87
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2021-2021 Intel Corporation
+ */
+
+#ifndef _INTEL_GUC_CAPTURE_H
+#define _INTEL_GUC_CAPTURE_H
+
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include "intel_guc_fwif.h"
+
+struct intel_guc;
+
+struct __guc_mmio_reg_descr {
+	i915_reg_t reg;
+	u32 flags;
+	u32 mask;
+	char *regname;
+};
+
+struct __guc_mmio_reg_descr_group {
+	struct __guc_mmio_reg_descr *list;
+	u32 num_regs;
+	u32 owner; /* see enum guc_capture_owner */
+	u32 type; /* see enum guc_capture_type */
+	u32 engine; /* as per MAX_ENGINE_CLASS */
+};
+
+struct intel_guc_state_capture {
+	struct __guc_mmio_reg_descr_group *reglists;
+	u16 num_instance_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
+	u16 num_class_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
+	u16 num_global_regs[GUC_CAPTURE_LIST_INDEX_MAX];
+	int instance_list_size;
+	int class_list_size;
+	int global_list_size;
+};
+
+int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 class,
+				 u16 *num_entries);
+int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 class,
+				struct guc_mmio_reg *ptr, u16 num_entries);
+void intel_guc_capture_destroy(struct intel_guc *guc);
+int intel_guc_capture_init(struct intel_guc *guc);
+
+#endif /* _INTEL_GUC_CAPTURE_H */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 767684b6af67..1a1d2271c7e9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -285,13 +285,30 @@ struct guc_gt_system_info {
 } __packed;
 
 /* Capture-types of GuC capture register lists */
-enum
+enum guc_capture_owner
 {
 	GUC_CAPTURE_LIST_INDEX_PF = 0,
 	GUC_CAPTURE_LIST_INDEX_VF = 1,
 	GUC_CAPTURE_LIST_INDEX_MAX = 2,
 };
 
+/*Register-types of GuC capture register lists */
+enum guc_capture_type {
+	GUC_CAPTURE_LIST_TYPE_GLOBAL = 0,
+	GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
+	GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
+	GUC_CAPTURE_LIST_TYPE_MAX,
+};
+
+struct guc_debug_capture_list_header {
+	u32 info;
+		#define GUC_CAPTURELISTHDR_NUMDESCR GENMASK(15, 0)
+};
+
+struct guc_debug_capture_list {
+	struct guc_debug_capture_list_header header;
+};
+
 /* GuC Additional Data Struct */
 struct guc_ads {
 	struct guc_mmio_reg_set reg_state_list[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Intel-gfx] [RFC 3/7] drm/i915/guc: Populate XE_LP register lists for GuC error state capture.
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
                   ` (2 preceding siblings ...)
  (?)
@ 2021-11-22 23:03 ` Alan Previn
  2021-11-23  1:59   ` kernel test robot
  2021-11-23 21:55   ` Michal Wajdeczko
  -1 siblings, 2 replies; 52+ messages in thread
From: Alan Previn @ 2021-11-22 23:03 UTC (permalink / raw)
  To: intel-gfx; +Cc: Alan Previn

Add device specific tables and register lists to cover different engines
class types for GuC error state capture.

Also, add runtime allocation and freeing of extended register lists
for registers that need steering identifiers that depend on
the detected HW config.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 260 +++++++++++++-----
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |   2 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +
 3 files changed, 197 insertions(+), 67 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index c741c77b7fc8..eec1d193ac26 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -9,120 +9,245 @@
 #include "i915_drv.h"
 #include "i915_memcpy.h"
 #include "gt/intel_gt.h"
+#include "gt/intel_lrc_reg.h"
 
 #include "intel_guc_fwif.h"
 #include "intel_guc_capture.h"
 
-/* Define all device tables of GuC error capture register lists */
+/*
+ * Define all device tables of GuC error capture register lists
+ * NOTE: For engine-registers, GuC only needs the register offsets
+ *       from the engine-mmio-base
+ */
+#define COMMON_GEN12BASE_GLOBAL() \
+	{GEN12_FAULT_TLB_DATA0,    0,      0, "GEN12_FAULT_TLB_DATA0"}, \
+	{GEN12_FAULT_TLB_DATA1,    0,      0, "GEN12_FAULT_TLB_DATA1"}, \
+	{FORCEWAKE_MT,             0,      0, "FORCEWAKE_MT"}, \
+	{DERRMR,                   0,      0, "DERRMR"}, \
+	{GEN12_AUX_ERR_DBG,        0,      0, "GEN12_AUX_ERR_DBG"}, \
+	{GEN12_GAM_DONE,           0,      0, "GEN12_GAM_DONE"}, \
+	{GEN11_GUC_SG_INTR_ENABLE, 0,      0, "GEN11_GUC_SG_INTR_ENABLE"}, \
+	{GEN11_CRYPTO_RSVD_INTR_ENABLE, 0, 0, "GEN11_CRYPTO_RSVD_INTR_ENABLE"}, \
+	{GEN11_GUNIT_CSME_INTR_ENABLE, 0,  0, "GEN11_GUNIT_CSME_INTR_ENABLE"}, \
+	{GEN12_RING_FAULT_REG,     0,      0, "GEN12_RING_FAULT_REG"}
+
+#define COMMON_GEN12BASE_ENGINE_INSTANCE() \
+	{RING_PSMI_CTL(0),         0,      0, "RING_PSMI_CTL"}, \
+	{RING_ESR(0),              0,      0, "RING_ESR"}, \
+	{RING_ESR(0),              0,      0, "RING_ESR"}, \
+	{RING_DMA_FADD(0),         0,      0, "RING_DMA_FADD_LOW32"}, \
+	{RING_DMA_FADD_UDW(0),     0,      0, "RING_DMA_FADD_UP32"}, \
+	{RING_IPEIR(0),            0,      0, "RING_IPEIR"}, \
+	{RING_IPEHR(0),            0,      0, "RING_IPEHR"}, \
+	{RING_INSTPS(0),           0,      0, "RING_INSTPS"}, \
+	{RING_BBADDR(0),           0,      0, "RING_BBADDR_LOW32"}, \
+	{RING_BBADDR_UDW(0),       0,      0, "RING_BBADDR_UP32"}, \
+	{RING_BBSTATE(0),          0,      0, "RING_BBSTATE"}, \
+	{CCID(0),                  0,      0, "CCID"}, \
+	{RING_ACTHD(0),            0,      0, "RING_ACTHD_LOW32"}, \
+	{RING_ACTHD_UDW(0),        0,      0, "RING_ACTHD_UP32"}, \
+	{RING_INSTPM(0),           0,      0, "RING_INSTPM"}, \
+	{RING_NOPID(0),            0,      0, "RING_NOPID"}, \
+	{RING_START(0),            0,      0, "RING_START"}, \
+	{RING_HEAD(0),             0,      0, "RING_HEAD"}, \
+	{RING_TAIL(0),             0,      0, "RING_TAIL"}, \
+	{RING_CTL(0),              0,      0, "RING_CTL"}, \
+	{RING_MI_MODE(0),          0,      0, "RING_MI_MODE"}, \
+	{RING_CONTEXT_CONTROL(0),  0,      0, "RING_CONTEXT_CONTROL"}, \
+	{RING_INSTDONE(0),         0,      0, "RING_INSTDONE"}, \
+	{RING_HWS_PGA(0),          0,      0, "RING_HWS_PGA"}, \
+	{RING_MODE_GEN7(0),        0,      0, "RING_MODE_GEN7"}, \
+	{GEN8_RING_PDP_LDW(0, 0),  0,      0, "GEN8_RING_PDP0_LDW"}, \
+	{GEN8_RING_PDP_UDW(0, 0),  0,      0, "GEN8_RING_PDP0_UDW"}, \
+	{GEN8_RING_PDP_LDW(0, 1),  0,      0, "GEN8_RING_PDP1_LDW"}, \
+	{GEN8_RING_PDP_UDW(0, 1),  0,      0, "GEN8_RING_PDP1_UDW"}, \
+	{GEN8_RING_PDP_LDW(0, 2),  0,      0, "GEN8_RING_PDP2_LDW"}, \
+	{GEN8_RING_PDP_UDW(0, 2),  0,      0, "GEN8_RING_PDP2_UDW"}, \
+	{GEN8_RING_PDP_LDW(0, 3),  0,      0, "GEN8_RING_PDP3_LDW"}, \
+	{GEN8_RING_PDP_UDW(0, 3),  0,      0, "GEN8_RING_PDP3_UDW"}
+
+#define COMMON_GEN12BASE_HAS_EU() \
+	{EIR,                      0,      0, "EIR"}
+
+#define COMMON_GEN12BASE_RENDER() \
+	{GEN7_SC_INSTDONE,         0,      0, "GEN7_SC_INSTDONE"}, \
+	{GEN12_SC_INSTDONE_EXTRA,  0,      0, "GEN12_SC_INSTDONE_EXTRA"}, \
+	{GEN12_SC_INSTDONE_EXTRA2, 0,      0, "GEN12_SC_INSTDONE_EXTRA2"}
+
+#define COMMON_GEN12BASE_VEC() \
+	{GEN11_VCS_VECS_INTR_ENABLE, 0,    0, "GEN11_VCS_VECS_INTR_ENABLE"}, \
+	{GEN12_SFC_DONE(0),        0,      0, "GEN12_SFC_DONE0"}, \
+	{GEN12_SFC_DONE(1),        0,      0, "GEN12_SFC_DONE1"}, \
+	{GEN12_SFC_DONE(2),        0,      0, "GEN12_SFC_DONE2"}, \
+	{GEN12_SFC_DONE(3),        0,      0, "GEN12_SFC_DONE3"}
 
 /********************************* Gen12 LP  *********************************/
 /************** GLOBAL *************/
 struct __guc_mmio_reg_descr gen12lp_global_regs[] = {
-	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
-	/* Add additional register list */
+	COMMON_GEN12BASE_GLOBAL(),
+	{GEN7_ROW_INSTDONE,        0,      0, "GEN7_ROW_INSTDONE"},
 };
 
 /********** RENDER/COMPUTE *********/
 /* Per-Class */
 struct __guc_mmio_reg_descr gen12lp_rc_class_regs[] = {
-	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
-	/* Add additional register list */
+	COMMON_GEN12BASE_HAS_EU(),
+	COMMON_GEN12BASE_RENDER(),
+	{GEN11_RENDER_COPY_INTR_ENABLE, 0, 0, "GEN11_RENDER_COPY_INTR_ENABLE"},
 };
 
 /* Per-Engine-Instance */
 struct __guc_mmio_reg_descr gen12lp_rc_inst_regs[] = {
-	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
-	/* Add additional register list */
+	COMMON_GEN12BASE_ENGINE_INSTANCE(),
 };
 
 /************* MEDIA-VD ************/
 /* Per-Class */
 struct __guc_mmio_reg_descr gen12lp_vd_class_regs[] = {
-	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
-	/* Add additional register list */
 };
 
 /* Per-Engine-Instance */
 struct __guc_mmio_reg_descr gen12lp_vd_inst_regs[] = {
-	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
-	/* Add additional register list */
+	COMMON_GEN12BASE_ENGINE_INSTANCE(),
 };
 
 /************* MEDIA-VEC ***********/
 /* Per-Class */
 struct __guc_mmio_reg_descr gen12lp_vec_class_regs[] = {
-	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
-	/* Add additional register list */
+	COMMON_GEN12BASE_VEC(),
 };
 
 /* Per-Engine-Instance */
 struct __guc_mmio_reg_descr gen12lp_vec_inst_regs[] = {
-	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
-	/* Add additional register list */
+	COMMON_GEN12BASE_ENGINE_INSTANCE(),
+};
+
+/************* BLITTER ***********/
+/* Per-Class */
+struct __guc_mmio_reg_descr gen12lp_blt_class_regs[] = {
+};
+
+/* Per-Engine-Instance */
+struct __guc_mmio_reg_descr gen12lp_blt_inst_regs[] = {
+	COMMON_GEN12BASE_ENGINE_INSTANCE(),
 };
 
+#define TO_GCAP_DEF(x) (GUC_CAPTURE_LIST_##x)
+#define MAKE_GCAP_REGLIST_DESCR(regslist, regsowner, regstype, class) \
+	{ \
+		.list = (regslist), \
+		.num_regs = (sizeof(regslist) / sizeof(struct __guc_mmio_reg_descr)), \
+		.owner = TO_GCAP_DEF(regsowner), \
+		.type = TO_GCAP_DEF(regstype), \
+		.engine = class, \
+		.num_ext = 0, \
+		.ext = NULL, \
+	}
+
+
 /********** List of lists **********/
-struct __guc_mmio_reg_descr_group gen12lp_lists[] = {
-	{
-		.list = gen12lp_global_regs,
-		.num_regs = (sizeof(gen12lp_global_regs) / sizeof(struct __guc_mmio_reg_descr)),
-		.owner = GUC_CAPTURE_LIST_INDEX_PF,
-		.type = GUC_CAPTURE_LIST_TYPE_GLOBAL,
-		.engine = 0
-	},
-	{
-		.list = gen12lp_rc_class_regs,
-		.num_regs = (sizeof(gen12lp_rc_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
-		.owner = GUC_CAPTURE_LIST_INDEX_PF,
-		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
-		.engine = RENDER_CLASS
-	},
-	{
-		.list = gen12lp_rc_inst_regs,
-		.num_regs = (sizeof(gen12lp_rc_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
-		.owner = GUC_CAPTURE_LIST_INDEX_PF,
-		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
-		.engine = RENDER_CLASS
-	},
-	{
-		.list = gen12lp_vd_class_regs,
-		.num_regs = (sizeof(gen12lp_vd_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
-		.owner = GUC_CAPTURE_LIST_INDEX_PF,
-		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
-		.engine = VIDEO_DECODE_CLASS
-	},
-	{
-		.list = gen12lp_vd_inst_regs,
-		.num_regs = (sizeof(gen12lp_vd_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
-		.owner = GUC_CAPTURE_LIST_INDEX_PF,
-		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
-		.engine = VIDEO_DECODE_CLASS
-	},
-	{
-		.list = gen12lp_vec_class_regs,
-		.num_regs = (sizeof(gen12lp_vec_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
-		.owner = GUC_CAPTURE_LIST_INDEX_PF,
-		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
-		.engine = VIDEO_ENHANCEMENT_CLASS
-	},
-	{
-		.list = gen12lp_vec_inst_regs,
-		.num_regs = (sizeof(gen12lp_vec_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
-		.owner = GUC_CAPTURE_LIST_INDEX_PF,
-		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
-		.engine = VIDEO_ENHANCEMENT_CLASS
-	},
+struct __guc_mmio_reg_descr_group xe_lpd_lists[] = {
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_global_regs, INDEX_PF, TYPE_GLOBAL, 0),
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_rc_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_RENDER_CLASS),
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_rc_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_RENDER_CLASS),
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_vd_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_VIDEO_CLASS),
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_vd_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_VIDEO_CLASS),
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_vec_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_VIDEOENHANCE_CLASS),
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_vec_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_VIDEOENHANCE_CLASS),
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_blt_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_BLITTER_CLASS),
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_blt_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_BLITTER_CLASS),
 	{NULL, 0, 0, 0, 0}
 };
 
-/************ FIXME: Populate tables for other devices in subsequent patch ************/
+/************* Populate additional registers / device tables *************/
+
+static inline struct __guc_mmio_reg_descr **
+guc_capture_get_ext_list_ptr(struct __guc_mmio_reg_descr_group * lists, u32 owner, u32 type, u32 class)
+{
+	while(lists->list){
+		if (lists->owner == owner && lists->type == type && lists->engine == class)
+			break;
+		++lists;
+	}
+	if (!lists->list)
+		return NULL;
+
+	return &(lists->ext);
+}
+
+void guc_capture_clear_ext_regs(struct __guc_mmio_reg_descr_group * lists)
+{
+	while(lists->list){
+		if (lists->ext) {
+			kfree(lists->ext);
+			lists->ext = NULL;
+		}
+		++lists;
+	}
+	return;
+}
+
+static void
+xelpd_alloc_steered_ext_list(struct drm_i915_private *i915,
+			     struct __guc_mmio_reg_descr_group * lists)
+{
+	struct intel_gt *gt = &i915->gt;
+	struct sseu_dev_info *sseu;
+	int slice, subslice, i, num_tot_regs = 0;
+	struct __guc_mmio_reg_descr **ext;
+	static char * const strings[] = {
+		[0] = "GEN7_SAMPLER_INSTDONE",
+		[1] = "GEN7_ROW_INSTDONE",
+	};
+
+	/* In XE_LP we only care about render-class steering registers during error-capture */
+	ext = guc_capture_get_ext_list_ptr(lists, GUC_CAPTURE_LIST_INDEX_PF,
+					   GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS, GUC_RENDER_CLASS);
+	if (!ext)
+		return;
+	if (*ext)
+		return; /* already populated */
+
+	sseu = &gt->info.sseu;
+	for_each_instdone_slice_subslice(i915, sseu, slice, subslice) {
+		num_tot_regs += 2; /* two registers of interest for now */
+	}
+	if (!num_tot_regs)
+		return;
+
+	*ext = kzalloc(2 * num_tot_regs * sizeof(struct __guc_mmio_reg_descr), GFP_KERNEL);
+	if (!*ext) {
+		drm_warn(&i915->drm, "GuC-capture: Fail to allocate for extended registers\n");
+		return;
+	}
+
+	for_each_instdone_slice_subslice(i915, sseu, slice, subslice) {
+		for (i = 0; i < 2; i++) {
+			if (i == 0)
+				(*ext)->reg = GEN7_SAMPLER_INSTDONE;
+			else
+				(*ext)->reg = GEN7_ROW_INSTDONE;
+			(*ext)->flags = FIELD_PREP(GUC_REGSET_STEERING_GROUP, slice);
+			(*ext)->flags |= FIELD_PREP(GUC_REGSET_STEERING_INSTANCE, subslice);
+			(*ext)->regname = strings[i];
+			(*ext)++;
+		}
+	}
+}
 
 static struct __guc_mmio_reg_descr_group *
 guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)
 {
 	if (IS_TIGERLAKE(dev_priv) || IS_ROCKETLAKE(dev_priv) ||
 	    IS_ALDERLAKE_S(dev_priv) || IS_ALDERLAKE_P(dev_priv)) {
-		return gen12lp_lists;
+		/*
+		* For certain engine classes, there are slice and subslice
+		* level registers requiring steering. We allocate and populate
+		* these at init time based on hw config add it as an extension
+		* list at the end of the pre-populated render list.
+		*/
+		xelpd_alloc_steered_ext_list(dev_priv, xe_lpd_lists);
+		return xe_lpd_lists;
 	}
 
 	return NULL;
@@ -221,6 +346,7 @@ int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32
 
 void intel_guc_capture_destroy(struct intel_guc *guc)
 {
+	guc_capture_clear_ext_regs(guc->capture.reglists);
 }
 
 int intel_guc_capture_init(struct intel_guc *guc)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
index 352940b8bc87..df420f0f49b3 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
@@ -25,6 +25,8 @@ struct __guc_mmio_reg_descr_group {
 	u32 owner; /* see enum guc_capture_owner */
 	u32 type; /* see enum guc_capture_type */
 	u32 engine; /* as per MAX_ENGINE_CLASS */
+	int num_ext;
+	struct __guc_mmio_reg_descr * ext;
 };
 
 struct intel_guc_state_capture {
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 1a1d2271c7e9..c26cfefd916c 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -267,6 +267,8 @@ struct guc_mmio_reg {
 	u32 value;
 	u32 flags;
 #define GUC_REGSET_MASKED		(1 << 0)
+#define GUC_REGSET_STEERING_GROUP       GENMASK(15, 12)
+#define GUC_REGSET_STEERING_INSTANCE    GENMASK(23, 20)
 	u32 mask;
 } __packed;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Intel-gfx] [RFC 4/7] drm/i915/guc: Add GuC's error state capture output structures.
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
                   ` (3 preceding siblings ...)
  (?)
@ 2021-11-22 23:03 ` Alan Previn
  2021-11-24 10:08   ` Jani Nikula
  2021-12-07 21:01   ` Matthew Brost
  -1 siblings, 2 replies; 52+ messages in thread
From: Alan Previn @ 2021-11-22 23:03 UTC (permalink / raw)
  To: intel-gfx; +Cc: Alan Previn

Add GuC's error capture output structures and definitions as how
they would appear in GuC log buffer's error capture subregion after
an error state capture G2H event notification.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h    | 35 +++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
index df420f0f49b3..b2454b6cd778 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
@@ -29,6 +29,41 @@ struct __guc_mmio_reg_descr_group {
 	struct __guc_mmio_reg_descr * ext;
 };
 
+struct intel_guc_capture_out_data_header {
+	u32 reserved1;
+	u32 info;
+		#define GUC_CAPTURE_DATAHDR_SRC_TYPE GENMASK(3, 0) /* as per enum guc_capture_type */
+		#define GUC_CAPTURE_DATAHDR_SRC_CLASS GENMASK(7, 4) /* as per GUC_MAX_ENGINE_CLASSES */
+		#define GUC_CAPTURE_DATAHDR_SRC_INSTANCE GENMASK(11, 8)
+	u32 lrca; /* if type-instance, LRCA (address) that hung, else set to ~0 */
+	u32 guc_ctx_id; /* if type-instance, context index of hung context, else set to ~0 */
+	u32 num_mmios;
+		#define GUC_CAPTURE_DATAHDR_NUM_MMIOS GENMASK(9, 0)
+};
+
+struct intel_guc_capture_out_data {
+	struct intel_guc_capture_out_data_header capture_header;
+	struct guc_mmio_reg capture_list[0];
+};
+
+enum guc_capture_group_types {
+	GUC_STATE_CAPTURE_GROUP_TYPE_FULL,
+	GUC_STATE_CAPTURE_GROUP_TYPE_PARTIAL,
+	GUC_STATE_CAPTURE_GROUP_TYPE_MAX,
+};
+
+struct intel_guc_capture_out_group_header {
+	u32 reserved1;
+	u32 info;
+		#define GUC_CAPTURE_GRPHDR_SRC_NUMCAPTURES GENMASK(7, 0)
+		#define GUC_CAPTURE_GRPHDR_SRC_CAPTURE_TYPE GENMASK(15, 8)
+};
+
+struct intel_guc_capture_out_group {
+	struct intel_guc_capture_out_group_header group_header;
+	struct intel_guc_capture_out_data group_lists[0];
+};
+
 struct intel_guc_state_capture {
 	struct __guc_mmio_reg_descr_group *reglists;
 	u16 num_instance_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Intel-gfx] [RFC 5/7] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
                   ` (4 preceding siblings ...)
  (?)
@ 2021-11-22 23:04 ` Alan Previn
  2021-12-07 22:31   ` Matthew Brost
  -1 siblings, 1 reply; 52+ messages in thread
From: Alan Previn @ 2021-11-22 23:04 UTC (permalink / raw)
  To: intel-gfx; +Cc: Alan Previn

GuC log buffer regions for debug-log-events, crash-dumps and
error-state-capture are all a single bo allocation that includes
the guc_log_buffer_state structures.

Since the error-capture region is accessed with high priority at non-
deterministic times (as part of gpu coredump) while the debug-log-event
region is populated and accessed with different priorities, timings and
consumers, let's split out separate locks for buffer-state accesses
of each region.

Also, ensure a global mapping is made up front for the entire bo
throughout GuC operation so that dynamic mapping and unmapping isn't
required for error capture log access if relay-logging isn't running.

Additionally, while here, make some readibility improvements:
1. change previous function names with "capture_logs" to
   "copy_debug_logs" to help make the distinction clearer.
2. Update the guc log region mapping comments to order them
   according to the enum definition as per the GuC interface.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   2 +
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c    |  46 +++++++
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.c    | 120 ++++++++++++------
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.h    |  14 +-
 5 files changed, 137 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index d136c69abe12..e0db21bbffdd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -34,6 +34,8 @@ struct intel_guc {
 	struct intel_uc_fw fw;
 	/** @log: sub-structure containing GuC log related data and objects */
 	struct intel_guc_log log;
+	/** @log_state: states and locks for each subregion of GuC's log buffer */
+	struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER];
 	/** @ct: the command transport communication channel */
 	struct intel_guc_ct ct;
 	/** @slpc: sub-structure containing SLPC related data and objects */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index eec1d193ac26..0cb358a98605 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -344,6 +344,52 @@ int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32
 	return -ENODATA;
 }
 
+int intel_guc_capture_output_min_size_est(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	int worst_min_size = 0, num_regs = 0;
+	u16 tmp = 0;
+
+	/*
+	 * If every single engine-instance suffered a failure in quick succession but
+	 * were all unrelated, then a burst of multiple error-capture events would dump
+	 * registers for every one engine instance, one at a time. In this case, GuC
+	 * would even dump the global-registers repeatedly.
+	 *
+	 * For each engine instance, there would be 1 x intel_guc_capture_out_group output
+	 * followed by 3 x intel_guc_capture_out_data lists. The latter is how the register
+	 * dumps are split across different register types (where the '3' are global vs class
+	 * vs instance). Finally, let's multiply the whole thing by 3x (just so we are
+	 * not limited to just 1 rounds of data in a  worst case full register dump log)
+	 *
+	 * NOTE: intel_guc_log that allocates the log buffer would round this size up to
+	 * a power of two.
+	 */
+
+	for_each_engine(engine, gt, id) {
+		worst_min_size += sizeof(struct intel_guc_capture_out_group_header) +
+				  (3 * sizeof(struct intel_guc_capture_out_data_header));
+
+		if (!intel_guc_capture_list_count(guc, 0, GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, &tmp))
+			num_regs += tmp;
+
+		if (!intel_guc_capture_list_count(guc, 0, GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
+						  engine->class, &tmp)) {
+			num_regs += tmp;
+		}
+		if (!intel_guc_capture_list_count(guc, 0, GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
+						  engine->class, &tmp)) {
+			num_regs += tmp;
+		}
+	}
+
+	worst_min_size += (num_regs * sizeof(struct guc_mmio_reg));
+
+	return (worst_min_size * 3);
+}
+
 void intel_guc_capture_destroy(struct intel_guc *guc)
 {
 	guc_capture_clear_ext_regs(guc->capture.reglists);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
index b2454b6cd778..839b53425e1e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
@@ -78,6 +78,7 @@ int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32
 				 u16 *num_entries);
 int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 class,
 				struct guc_mmio_reg *ptr, u16 num_entries);
+int intel_guc_capture_output_min_size_est(struct intel_guc *guc);
 void intel_guc_capture_destroy(struct intel_guc *guc);
 int intel_guc_capture_init(struct intel_guc *guc);
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
index 1962a43302a8..dd86530f77a1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
@@ -10,7 +10,7 @@
 #include "i915_memcpy.h"
 #include "intel_guc_log.h"
 
-static void guc_log_capture_logs(struct intel_guc_log *log);
+static void guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log);
 
 /**
  * DOC: GuC firmware log
@@ -149,7 +149,7 @@ static void guc_move_to_next_buf(struct intel_guc_log *log)
 	smp_wmb();
 
 	/* All data has been written, so now move the offset of sub buffer. */
-	relay_reserve(log->relay.channel, log->vma->obj->base.size);
+	relay_reserve(log->relay.channel, log->vma->obj->base.size - CAPTURE_BUFFER_SIZE);
 
 	/* Switch to the next sub buffer */
 	relay_flush(log->relay.channel);
@@ -169,25 +169,25 @@ static void *guc_get_write_buffer(struct intel_guc_log *log)
 	return relay_reserve(log->relay.channel, 0);
 }
 
-static bool guc_check_log_buf_overflow(struct intel_guc_log *log,
-				       enum guc_log_buffer_type type,
-				       unsigned int full_cnt)
+bool guc_check_log_buf_overflow(struct intel_guc *guc,
+				struct intel_guc_log_stats *log_state,
+				unsigned int full_cnt)
 {
-	unsigned int prev_full_cnt = log->stats[type].sampled_overflow;
+	unsigned int prev_full_cnt = log_state->sampled_overflow;
 	bool overflow = false;
 
 	if (full_cnt != prev_full_cnt) {
 		overflow = true;
 
-		log->stats[type].overflow = full_cnt;
-		log->stats[type].sampled_overflow += full_cnt - prev_full_cnt;
+		log_state->overflow = full_cnt;
+		log_state->sampled_overflow += full_cnt - prev_full_cnt;
 
 		if (full_cnt < prev_full_cnt) {
 			/* buffer_full_cnt is a 4 bit counter */
-			log->stats[type].sampled_overflow += 16;
+			log_state->sampled_overflow += 16;
 		}
 
-		dev_notice_ratelimited(guc_to_gt(log_to_guc(log))->i915->drm.dev,
+		dev_notice_ratelimited(guc_to_gt(guc)->i915->drm.dev,
 				       "GuC log buffer overflow\n");
 	}
 
@@ -210,8 +210,10 @@ static unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type)
 	return 0;
 }
 
-static void guc_read_update_log_buffer(struct intel_guc_log *log)
+static void _guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log)
 {
+	struct intel_guc *guc = log_to_guc(log);
+	struct intel_guc_log_stats *logstate;
 	unsigned int buffer_size, read_offset, write_offset, bytes_to_copy, full_cnt;
 	struct guc_log_buffer_state *log_buf_state, *log_buf_snapshot_state;
 	struct guc_log_buffer_state log_buf_state_local;
@@ -235,7 +237,7 @@ static void guc_read_update_log_buffer(struct intel_guc_log *log)
 		 * Used rate limited to avoid deluge of messages, logs might be
 		 * getting consumed by User at a slow rate.
 		 */
-		DRM_ERROR_RATELIMITED("no sub-buffer to capture logs\n");
+		DRM_ERROR_RATELIMITED("no sub-buffer to copy general logs\n");
 		log->relay.full_count++;
 
 		goto out_unlock;
@@ -245,12 +247,16 @@ static void guc_read_update_log_buffer(struct intel_guc_log *log)
 	src_data += PAGE_SIZE;
 	dst_data += PAGE_SIZE;
 
-	for (type = GUC_DEBUG_LOG_BUFFER; type < GUC_MAX_LOG_BUFFER; type++) {
+	/* For relay logging, we exclude error state capture */
+	for (type = GUC_DEBUG_LOG_BUFFER; type <= GUC_CRASH_DUMP_LOG_BUFFER; type++) {
 		/*
+		 * Get a lock to the buffer_state we want to read and update.
 		 * Make a copy of the state structure, inside GuC log buffer
 		 * (which is uncached mapped), on the stack to avoid reading
 		 * from it multiple times.
 		 */
+		logstate = &guc->log_state[type];
+		mutex_lock(&logstate->lock);
 		memcpy(&log_buf_state_local, log_buf_state,
 		       sizeof(struct guc_log_buffer_state));
 		buffer_size = guc_get_log_buffer_size(type);
@@ -259,13 +265,14 @@ static void guc_read_update_log_buffer(struct intel_guc_log *log)
 		full_cnt = log_buf_state_local.buffer_full_cnt;
 
 		/* Bookkeeping stuff */
-		log->stats[type].flush += log_buf_state_local.flush_to_file;
-		new_overflow = guc_check_log_buf_overflow(log, type, full_cnt);
+		logstate->flush += log_buf_state_local.flush_to_file;
+		new_overflow = guc_check_log_buf_overflow(guc, logstate, full_cnt);
 
 		/* Update the state of shared log buffer */
 		log_buf_state->read_ptr = write_offset;
 		log_buf_state->flush_to_file = 0;
 		log_buf_state++;
+		mutex_unlock(&logstate->lock);
 
 		/* First copy the state structure in snapshot buffer */
 		memcpy(log_buf_snapshot_state, &log_buf_state_local,
@@ -313,15 +320,15 @@ static void guc_read_update_log_buffer(struct intel_guc_log *log)
 	mutex_unlock(&log->relay.lock);
 }
 
-static void capture_logs_work(struct work_struct *work)
+static void copy_debug_logs_work(struct work_struct *work)
 {
 	struct intel_guc_log *log =
 		container_of(work, struct intel_guc_log, relay.flush_work);
 
-	guc_log_capture_logs(log);
+	guc_log_copy_debuglogs_for_relay(log);
 }
 
-static int guc_log_map(struct intel_guc_log *log)
+static int guc_log_relay_map(struct intel_guc_log *log)
 {
 	void *vaddr;
 
@@ -333,7 +340,9 @@ static int guc_log_map(struct intel_guc_log *log)
 	/*
 	 * Create a WC (Uncached for read) vmalloc mapping of log
 	 * buffer pages, so that we can directly get the data
-	 * (up-to-date) from memory.
+	 * (up-to-date) from memory. This has already been
+	 * mapped at GuC Init time (for error-state-capture), but
+	 * call it again anyway for book-keeping
 	 */
 	vaddr = i915_gem_object_pin_map_unlocked(log->vma->obj, I915_MAP_WC);
 	if (IS_ERR(vaddr))
@@ -344,7 +353,7 @@ static int guc_log_map(struct intel_guc_log *log)
 	return 0;
 }
 
-static void guc_log_unmap(struct intel_guc_log *log)
+static void guc_log_relay_unmap(struct intel_guc_log *log)
 {
 	lockdep_assert_held(&log->relay.lock);
 
@@ -354,8 +363,14 @@ static void guc_log_unmap(struct intel_guc_log *log)
 
 void intel_guc_log_init_early(struct intel_guc_log *log)
 {
+	struct intel_guc *guc = log_to_guc(log);
+	int n;
+
+	for (n = GUC_DEBUG_LOG_BUFFER; n < GUC_MAX_LOG_BUFFER; n++)
+		mutex_init(&guc->log_state[n].lock);
+
 	mutex_init(&log->relay.lock);
-	INIT_WORK(&log->relay.flush_work, capture_logs_work);
+	INIT_WORK(&log->relay.flush_work, copy_debug_logs_work);
 	log->relay.started = false;
 }
 
@@ -370,8 +385,11 @@ static int guc_log_relay_create(struct intel_guc_log *log)
 	lockdep_assert_held(&log->relay.lock);
 	GEM_BUG_ON(!log->vma);
 
-	 /* Keep the size of sub buffers same as shared log buffer */
-	subbuf_size = log->vma->size;
+	 /*
+	  * Keep the size of sub buffers same as shared log buffer
+	  * but GuC log-events excludes the error-state-capture logs
+	  */
+	subbuf_size = log->vma->size - CAPTURE_BUFFER_SIZE;
 
 	/*
 	 * Store up to 8 snapshots, which is large enough to buffer sufficient
@@ -406,13 +424,13 @@ static void guc_log_relay_destroy(struct intel_guc_log *log)
 	log->relay.channel = NULL;
 }
 
-static void guc_log_capture_logs(struct intel_guc_log *log)
+static void guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log)
 {
 	struct intel_guc *guc = log_to_guc(log);
 	struct drm_i915_private *dev_priv = guc_to_gt(guc)->i915;
 	intel_wakeref_t wakeref;
 
-	guc_read_update_log_buffer(log);
+	_guc_log_copy_debuglogs_for_relay(log);
 
 	/*
 	 * Generally device is expected to be active only at this
@@ -452,6 +470,7 @@ int intel_guc_log_create(struct intel_guc_log *log)
 {
 	struct intel_guc *guc = log_to_guc(log);
 	struct i915_vma *vma;
+	void *vaddr;
 	u32 guc_log_size;
 	int ret;
 
@@ -459,23 +478,31 @@ int intel_guc_log_create(struct intel_guc_log *log)
 
 	/*
 	 *  GuC Log buffer Layout
+	 * (this ordering must follow "enum guc_log_buffer_type" definition)
 	 *
 	 *  +===============================+ 00B
-	 *  |    Crash dump state header    |
-	 *  +-------------------------------+ 32B
 	 *  |      Debug state header       |
+	 *  +-------------------------------+ 32B
+	 *  |    Crash dump state header    |
+	 *  +-------------------------------+ 64B
+	 *  |     Capture state header      |
 	 *  +-------------------------------+ 64B
 	 *  |     Capture state header      |
 	 *  +-------------------------------+ 96B
 	 *  |                               |
 	 *  +===============================+ PAGE_SIZE (4KB)
-	 *  |        Crash Dump logs        |
-	 *  +===============================+ + CRASH_SIZE
 	 *  |          Debug logs           |
 	 *  +===============================+ + DEBUG_SIZE
+	 *  |        Crash Dump logs        |
+	 *  +===============================+ + CRASH_SIZE
+	 *  |         Capture logs          |
+	 *  +===============================+ + CAPTURE_SIZE
 	 */
-	guc_log_size = PAGE_SIZE + CRASH_BUFFER_SIZE + DEBUG_BUFFER_SIZE +
-		       CAPTURE_BUFFER_SIZE;
+	if (intel_guc_capture_output_min_size_est(guc) > CAPTURE_BUFFER_SIZE)
+		DRM_WARN("GuC log buffer for state_capture maybe too small. %d < %d\n",
+			 CAPTURE_BUFFER_SIZE, intel_guc_capture_output_min_size_est(guc));
+
+	guc_log_size = PAGE_SIZE + DEBUG_BUFFER_SIZE + CRASH_BUFFER_SIZE + CAPTURE_BUFFER_SIZE;
 
 	vma = intel_guc_allocate_vma(guc, guc_log_size);
 	if (IS_ERR(vma)) {
@@ -484,6 +511,17 @@ int intel_guc_log_create(struct intel_guc_log *log)
 	}
 
 	log->vma = vma;
+	/*
+	 * Create a WC (Uncached for read) vmalloc mapping up front immediate access to
+	 * data from memory during  critical events such as error capture
+	 */
+	vaddr = i915_gem_object_pin_map_unlocked(log->vma->obj, I915_MAP_WC);
+	if (IS_ERR(vaddr)) {
+		ret = PTR_ERR(vaddr);
+		i915_vma_unpin_and_release(&log->vma, 0);
+		goto err;
+	}
+	log->buf_addr = vaddr;
 
 	log->level = __get_default_log_level(log);
 	DRM_DEBUG_DRIVER("guc_log_level=%d (%s, verbose:%s, verbosity:%d)\n",
@@ -494,13 +532,14 @@ int intel_guc_log_create(struct intel_guc_log *log)
 	return 0;
 
 err:
-	DRM_ERROR("Failed to allocate GuC log buffer. %d\n", ret);
+	DRM_ERROR("Failed to allocate or map GuC log buffer. %d\n", ret);
 	return ret;
 }
 
 void intel_guc_log_destroy(struct intel_guc_log *log)
 {
-	i915_vma_unpin_and_release(&log->vma, 0);
+	log->buf_addr = NULL;
+	i915_vma_unpin_and_release(&log->vma, I915_VMA_RELEASE_MAP);
 }
 
 int intel_guc_log_set_level(struct intel_guc_log *log, u32 level)
@@ -545,7 +584,7 @@ int intel_guc_log_set_level(struct intel_guc_log *log, u32 level)
 
 bool intel_guc_log_relay_created(const struct intel_guc_log *log)
 {
-	return log->relay.buf_addr;
+	return log->buf_addr;
 }
 
 int intel_guc_log_relay_open(struct intel_guc_log *log)
@@ -576,7 +615,7 @@ int intel_guc_log_relay_open(struct intel_guc_log *log)
 	if (ret)
 		goto out_unlock;
 
-	ret = guc_log_map(log);
+	ret = guc_log_relay_map(log);
 	if (ret)
 		goto out_relay;
 
@@ -628,8 +667,8 @@ void intel_guc_log_relay_flush(struct intel_guc_log *log)
 	with_intel_runtime_pm(guc_to_gt(guc)->uncore->rpm, wakeref)
 		guc_action_flush_log(guc);
 
-	/* GuC would have updated log buffer by now, so capture it */
-	guc_log_capture_logs(log);
+	/* GuC would have updated log buffer by now, so copy it */
+	guc_log_copy_debuglogs_for_relay(log);
 }
 
 /*
@@ -659,7 +698,7 @@ void intel_guc_log_relay_close(struct intel_guc_log *log)
 
 	mutex_lock(&log->relay.lock);
 	GEM_BUG_ON(!intel_guc_log_relay_created(log));
-	guc_log_unmap(log);
+	guc_log_relay_unmap(log);
 	guc_log_relay_destroy(log);
 	mutex_unlock(&log->relay.lock);
 }
@@ -695,6 +734,7 @@ stringify_guc_log_type(enum guc_log_buffer_type type)
  */
 void intel_guc_log_info(struct intel_guc_log *log, struct drm_printer *p)
 {
+	struct intel_guc *guc = log_to_guc(log);
 	enum guc_log_buffer_type type;
 
 	if (!intel_guc_log_relay_created(log)) {
@@ -709,8 +749,8 @@ void intel_guc_log_info(struct intel_guc_log *log, struct drm_printer *p)
 	for (type = GUC_DEBUG_LOG_BUFFER; type < GUC_MAX_LOG_BUFFER; type++) {
 		drm_printf(p, "\t%s:\tflush count %10u, overflow count %10u\n",
 			   stringify_guc_log_type(type),
-			   log->stats[type].flush,
-			   log->stats[type].sampled_overflow);
+			   guc->log_state[type].flush,
+			   guc->log_state[type].sampled_overflow);
 	}
 }
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
index 9d9004dc58f1..2968023f7447 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
@@ -42,9 +42,17 @@ struct intel_guc;
 #define GUC_VERBOSITY_TO_LOG_LEVEL(x)	((x) + 2)
 #define GUC_LOG_LEVEL_MAX GUC_VERBOSITY_TO_LOG_LEVEL(GUC_LOG_VERBOSITY_MAX)
 
+struct intel_guc_log_stats {
+	struct mutex lock; /* protects below and guc_log_buffer_state's read-ptr */
+	u32 sampled_overflow;
+	u32 overflow;
+	u32 flush;
+};
+
 struct intel_guc_log {
 	u32 level;
 	struct i915_vma *vma;
+	void *buf_addr;
 	struct {
 		void *buf_addr;
 		bool started;
@@ -53,12 +61,6 @@ struct intel_guc_log {
 		struct mutex lock;
 		u32 full_count;
 	} relay;
-	/* logging related stats */
-	struct {
-		u32 sampled_overflow;
-		u32 overflow;
-		u32 flush;
-	} stats[GUC_MAX_LOG_BUFFER];
 };
 
 void intel_guc_log_init_early(struct intel_guc_log *log);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Intel-gfx] [RFC 6/7] drm/i915/guc: Copy new GuC error capture logs upon G2H notification.
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
                   ` (5 preceding siblings ...)
  (?)
@ 2021-11-22 23:04 ` Alan Previn
  2021-12-07 22:58   ` Matthew Brost
  -1 siblings, 1 reply; 52+ messages in thread
From: Alan Previn @ 2021-11-22 23:04 UTC (permalink / raw)
  To: intel-gfx; +Cc: Alan Previn

Upon the G2H Notify-Err-Capture event, queue a worker to make a
snapshot of the error state capture logs from the GuC-log buffer
(error capture region) into an bigger interim circular buffer store
that can be parsed later during gpu coredump printing.

Also, call that worker function directly for the cases where we
are resetting GuC submission and need to flush outstanding logs.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   7 +
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 206 ++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |  16 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.c    |  16 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.h    |   5 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  10 +-
 6 files changed, 256 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index 5af03a486a13..c130f465c19a 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -178,4 +178,11 @@ enum intel_guc_sleep_state_status {
 #define GUC_LOG_CONTROL_VERBOSITY_MASK	(0xF << GUC_LOG_CONTROL_VERBOSITY_SHIFT)
 #define GUC_LOG_CONTROL_DEFAULT_LOGGING	(1 << 8)
 
+enum intel_guc_state_capture_event_status {
+	INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_SUCCESS = 0x0,
+	INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE = 0x1,
+};
+
+#define INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK      0x1
+
 #endif /* _ABI_GUC_ACTIONS_ABI_H */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index 0cb358a98605..459fe81c77ae 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -11,8 +11,11 @@
 #include "gt/intel_gt.h"
 #include "gt/intel_lrc_reg.h"
 
+#include <linux/circ_buf.h>
+
 #include "intel_guc_fwif.h"
 #include "intel_guc_capture.h"
+#include "i915_gpu_error.h"
 
 /*
  * Define all device tables of GuC error capture register lists
@@ -390,15 +393,218 @@ int intel_guc_capture_output_min_size_est(struct intel_guc *guc)
 	return (worst_min_size * 3);
 }
 
+/*
+ * KMD Init time flows:
+ * --------------------
+ *     --> alloc A: GuC input capture regs lists (registered via ADS)
+ *                  List acquired via intel_guc_capture_list_count + intel_guc_capture_list_init
+ *                  Size = global-reg-list + (class-reg-list) + (num-instances x instance-reg-list)
+ *                  Device tables carry: 1x global, 1x per-class, 1x per-instance)
+ *                  Caller needs to call per-class and per-instance multiplie times
+ *
+ *     --> alloc B: GuC output capture buf (registered via guc_init_params(log_param))
+ *                  Size = #define CAPTURE_BUFFER_SIZE (warns if on too-small)
+ *                  Note2: 'x 3' to hold multiple capture groups
+ *
+ *     --> alloc C: GuC capture interim circular buffer storage in system mem
+ *                  Size = 'power_of_two(sizeof(B))' as per kernel circular buffer helper
+ *
+ * GUC Runtime notify capture:
+ * --------------------------
+ *     --> G2H STATE_CAPTURE_NOTIFICATION
+ *                   L--> intel_guc_capture_store_snapshot
+ *                        L--> queue(__guc_capture_store_snapshot_work)
+ *                             Copies from B (head->tail) into C
+ */
+
+static void guc_capture_store_insert(struct intel_guc *guc, struct guc_capture_out_store *store,
+				     unsigned char *new_data, size_t bytes)
+{
+	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
+	unsigned char *dst_data = store->addr;
+	unsigned long h, t;
+	size_t tmp;
+
+	h = store->head;
+	t = store->tail;
+	if (CIRC_SPACE(h, t, store->size) >= bytes) {
+		while (bytes) {
+			tmp = CIRC_SPACE_TO_END(h, t, store->size);
+			if (tmp) {
+				tmp = tmp < bytes ? tmp : bytes;
+				i915_unaligned_memcpy_from_wc(&dst_data[h], new_data, tmp);
+				bytes -= tmp;
+				new_data += tmp;
+				h = (h + tmp) & (store->size - 1);
+			} else {
+				drm_err(&dev_priv->drm, "circbuf copy-to ptr-corruption!\n");
+				break;
+			}
+		}
+		store->head = h;
+	} else {
+		drm_err(&dev_priv->drm, "GuC capture interim-store insufficient space!\n");
+	}
+}
+
+static void __guc_capture_store_snapshot_work(struct intel_guc *guc)
+{
+	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
+	unsigned int buffer_size, read_offset, write_offset, bytes_to_copy, full_count;
+	struct guc_log_buffer_state *log_buf_state;
+	struct guc_log_buffer_state log_buf_state_local;
+	void *src_data, *dst_data = NULL;
+	bool new_overflow;
+
+	/* Lock to get the pointer to GuC capture-log-buffer-state */
+	mutex_lock(&guc->log_state[GUC_CAPTURE_LOG_BUFFER].lock);
+	log_buf_state = guc->log.buf_addr +
+			(sizeof(struct guc_log_buffer_state) * GUC_CAPTURE_LOG_BUFFER);
+	src_data = guc->log.buf_addr + guc_get_log_buffer_offset(GUC_CAPTURE_LOG_BUFFER);
+
+	/*
+	 * Make a copy of the state structure, inside GuC log buffer
+	 * (which is uncached mapped), on the stack to avoid reading
+	 * from it multiple times.
+	 */
+	memcpy(&log_buf_state_local, log_buf_state, sizeof(struct guc_log_buffer_state));
+	buffer_size = guc_get_log_buffer_size(GUC_CAPTURE_LOG_BUFFER);
+	read_offset = log_buf_state_local.read_ptr;
+	write_offset = log_buf_state_local.sampled_write_ptr;
+	full_count = log_buf_state_local.buffer_full_cnt;
+
+	/* Bookkeeping stuff */
+	guc->log_state[GUC_CAPTURE_LOG_BUFFER].flush += log_buf_state_local.flush_to_file;
+	new_overflow = guc_check_log_buf_overflow(guc, &guc->log_state[GUC_CAPTURE_LOG_BUFFER],
+						  full_count);
+
+	/* Update the state of shared log buffer */
+	log_buf_state->read_ptr = write_offset;
+	log_buf_state->flush_to_file = 0;
+
+	mutex_unlock(&guc->log_state[GUC_CAPTURE_LOG_BUFFER].lock);
+
+	dst_data = guc->capture.out_store.addr;
+	if (dst_data) {
+		mutex_lock(&guc->capture.out_store.lock);
+
+		/* Now copy the actual logs. */
+		if (unlikely(new_overflow)) {
+			/* copy the whole buffer in case of overflow */
+			read_offset = 0;
+			write_offset = buffer_size;
+		} else if (unlikely((read_offset > buffer_size) ||
+					(write_offset > buffer_size))) {
+			drm_err(&dev_priv->drm, "invalid GuC log capture buffer state!\n");
+			/* copy whole buffer as offsets are unreliable */
+			read_offset = 0;
+			write_offset = buffer_size;
+		}
+
+		/* first copy from the tail end of the GuC log capture buffer */
+		if (read_offset > write_offset) {
+			guc_capture_store_insert(guc, &guc->capture.out_store, src_data,
+						 write_offset);
+			bytes_to_copy = buffer_size - read_offset;
+		} else {
+			bytes_to_copy = write_offset - read_offset;
+		}
+		guc_capture_store_insert(guc, &guc->capture.out_store, src_data + read_offset,
+					 bytes_to_copy);
+
+		mutex_unlock(&guc->capture.out_store.lock);
+	}
+}
+
+static void guc_capture_store_snapshot_work(struct work_struct *work)
+{
+	struct intel_guc_state_capture *capture =
+		container_of(work, struct intel_guc_state_capture, store_work);
+	struct intel_guc *guc =
+		container_of(capture, struct intel_guc, capture);
+
+	__guc_capture_store_snapshot_work(guc);
+}
+
+void  intel_guc_capture_store_snapshot(struct intel_guc *guc)
+{
+	if (guc->capture.enabled)
+		queue_work(system_highpri_wq, &guc->capture.store_work);
+}
+
+void intel_guc_capture_store_snapshot_immediate(struct intel_guc *guc)
+{
+	if (guc->capture.enabled)
+		__guc_capture_store_snapshot_work(guc);
+}
+
+static void guc_capture_store_destroy(struct intel_guc *guc)
+{
+	mutex_destroy(&guc->capture.out_store.lock);
+	mutex_destroy(&guc->capture.out_store.lock);
+	guc->capture.out_store.size = 0;
+	kfree(guc->capture.out_store.addr);
+	guc->capture.out_store.addr = NULL;
+}
+
+static int guc_capture_store_create(struct intel_guc *guc)
+{
+	/*
+	 * Make this interim buffer 3x the GuC capture output buffer so that we can absorb
+	 * a little delay when processing the raw capture dumps into text friendly logs
+	 * for the i915_gpu_coredump output
+	 */
+	size_t max_dump_size;
+	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
+
+	GEM_BUG_ON(guc->capture.out_store.addr);
+
+	max_dump_size = PAGE_ALIGN(intel_guc_capture_output_min_size_est(guc));
+	max_dump_size = roundup_pow_of_two(max_dump_size);
+
+	guc->capture.out_store.addr = kzalloc(max_dump_size, GFP_KERNEL);
+	if (!guc->capture.out_store.addr) {
+		drm_warn(&dev_priv->drm, "GuC-capture interim-store populated at init!\n");
+		return -ENOMEM;
+	}
+	guc->capture.out_store.size = max_dump_size;
+	mutex_init(&guc->capture.out_store.lock);
+	mutex_init(&guc->capture.out_store.lock);
+
+	return 0;
+}
+
 void intel_guc_capture_destroy(struct intel_guc *guc)
 {
+	if (!guc->capture.enabled)
+		return;
+
+	guc->capture.enabled = false;
+
+	intel_synchronize_irq(guc_to_gt(guc)->i915);
+	flush_work(&guc->capture.store_work);
+	guc_capture_store_destroy(guc);
 	guc_capture_clear_ext_regs(guc->capture.reglists);
 }
 
 int intel_guc_capture_init(struct intel_guc *guc)
 {
 	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
+	int ret;
 
 	guc->capture.reglists = guc_capture_get_device_reglist(dev_priv);
+	/*
+	 * allocate interim store at init time so we dont require memory
+	 * allocation whilst in the midst of the reset + capture
+	 */
+	ret = guc_capture_store_create(guc);
+	if (ret) {
+		guc_capture_clear_ext_regs(guc->capture.reglists);
+		return ret;
+	}
+
+	INIT_WORK(&guc->capture.store_work, guc_capture_store_snapshot_work);
+	guc->capture.enabled = true;
+
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
index 839b53425e1e..7031de12f3a1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
@@ -64,7 +64,19 @@ struct intel_guc_capture_out_group {
 	struct intel_guc_capture_out_data group_lists[0];
 };
 
+struct guc_capture_out_store {
+	/* An interim storage to copy the GuC error-capture-output before
+	 * parsing and reporting via proper reporting flows with formatting.
+	 */
+	unsigned char *addr;
+	size_t size;
+	unsigned long head; /* inject new output capture data */
+	unsigned long tail; /* remove output capture data when reporting */
+	struct mutex lock; /*lock head or tail when copying capture in or extracting out*/
+};
+
 struct intel_guc_state_capture {
+	bool enabled;
 	struct __guc_mmio_reg_descr_group *reglists;
 	u16 num_instance_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
 	u16 num_class_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
@@ -72,14 +84,18 @@ struct intel_guc_state_capture {
 	int instance_list_size;
 	int class_list_size;
 	int global_list_size;
+	struct guc_capture_out_store out_store;
+	struct work_struct store_work;
 };
 
+void intel_guc_capture_store_snapshot(struct intel_guc *guc);
 int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 class,
 				 u16 *num_entries);
 int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 class,
 				struct guc_mmio_reg *ptr, u16 num_entries);
 int intel_guc_capture_output_min_size_est(struct intel_guc *guc);
 void intel_guc_capture_destroy(struct intel_guc *guc);
+void intel_guc_capture_store_snapshot_immediate(struct intel_guc *guc);
 int intel_guc_capture_init(struct intel_guc *guc);
 
 #endif /* _INTEL_GUC_CAPTURE_H */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
index dd86530f77a1..1354dbde9994 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
@@ -194,7 +194,7 @@ bool guc_check_log_buf_overflow(struct intel_guc *guc,
 	return overflow;
 }
 
-static unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type)
+unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type)
 {
 	switch (type) {
 	case GUC_DEBUG_LOG_BUFFER:
@@ -210,6 +210,20 @@ static unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type)
 	return 0;
 }
 
+size_t guc_get_log_buffer_offset(enum guc_log_buffer_type type)
+{
+	enum guc_log_buffer_type i;
+	size_t offset = PAGE_SIZE;/* for the log_buffer_states */
+
+	for (i = GUC_DEBUG_LOG_BUFFER; i < GUC_MAX_LOG_BUFFER; i++) {
+		if (i == type)
+			break;
+		offset += guc_get_log_buffer_size(i);
+	}
+
+	return offset;
+}
+
 static void _guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log)
 {
 	struct intel_guc *guc = log_to_guc(log);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
index 2968023f7447..9bf29343df0e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
@@ -64,8 +64,13 @@ struct intel_guc_log {
 };
 
 void intel_guc_log_init_early(struct intel_guc_log *log);
+unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type);
+size_t guc_get_log_buffer_offset(enum guc_log_buffer_type type);
 int intel_guc_log_create(struct intel_guc_log *log);
 void intel_guc_log_destroy(struct intel_guc_log *log);
+ 
+bool guc_check_log_buf_overflow(struct intel_guc *guc, struct intel_guc_log_stats *state,
+				unsigned int full_cnt);
 
 int intel_guc_log_set_level(struct intel_guc_log *log, u32 level);
 bool intel_guc_log_relay_created(const struct intel_guc_log *log);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0bfc92b1b982..0afd9ddd71fc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -24,6 +24,7 @@
 
 #include "intel_guc_ads.h"
 #include "intel_guc_submission.h"
+#include "gt/uc/intel_guc_capture.h"
 
 #include "i915_drv.h"
 #include "i915_trace.h"
@@ -1431,6 +1432,8 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	}
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
+
+	intel_guc_capture_store_snapshot_immediate(guc);
 }
 
 static struct intel_engine_cs *
@@ -4013,10 +4016,11 @@ int intel_guc_error_capture_process_msg(struct intel_guc *guc,
 		return -EPROTO;
 	}
 
-	status = msg[0];
-	drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status);
+	status = msg[0] & INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK;
+	if (status == INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE)
+		drm_warn(&guc_to_gt(guc)->i915->drm, "G2H-Error capture no space\n");
 
-	/* Add extraction of error capture dump */
+	intel_guc_capture_store_snapshot(guc);
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
                   ` (6 preceding siblings ...)
  (?)
@ 2021-11-22 23:04 ` Alan Previn
  2021-11-23  0:25   ` Teres Alexis, Alan Previn
  2021-12-08  0:22   ` Matthew Brost
  -1 siblings, 2 replies; 52+ messages in thread
From: Alan Previn @ 2021-11-22 23:04 UTC (permalink / raw)
  To: intel-gfx; +Cc: Alan Previn

Print the GuC captured error state register list (offsets
and values) when gpu_coredump_state printout is invoked.

Also, since the GuC can report multiple engine class registers in a
single notification event, parse the captured data (appearing as a
stream of structures) to identify multiple captures of different
'engine-capture-group-outputs'.

Finally, for each 'engine-capture-group-output', identify the last
running context and print already-identified vma's so that user's
output report follows the same layout as execlist submission. I.e.
engine1-registers, engine1-context-vmas, engine2-registers,
engine2-context-vmas, etc.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   4 +-
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 389 ++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |   6 +
 drivers/gpu/drm/i915/i915_gpu_error.c         |  53 ++-
 drivers/gpu/drm/i915/i915_gpu_error.h         |   5 +
 5 files changed, 439 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 332756036007..5806e2c05212 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1595,9 +1595,7 @@ static void intel_engine_print_registers(struct intel_engine_cs *engine,
 		drm_printf(m, "\tIPEHR: 0x%08x\n", ENGINE_READ(engine, IPEHR));
 	}
 
-	if (intel_engine_uses_guc(engine)) {
-		/* nothing to print yet */
-	} else if (HAS_EXECLISTS(dev_priv)) {
+	if (HAS_EXECLISTS(dev_priv) && !intel_engine_uses_guc(engine)) {
 		struct i915_request * const *port, *rq;
 		const u32 *hws =
 			&engine->status_page.addr[I915_HWS_CSB_BUF0_INDEX];
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index 459fe81c77ae..998ce1b474ed 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -415,8 +415,389 @@ int intel_guc_capture_output_min_size_est(struct intel_guc *guc)
  *                   L--> intel_guc_capture_store_snapshot
  *                        L--> queue(__guc_capture_store_snapshot_work)
  *                             Copies from B (head->tail) into C
+ *
+ * GUC --> notify context reset:
+ * -----------------------------
+ *     --> G2H CONTEXT RESET
+ *                   L--> guc_handle_context_reset --> i915_capture_error_state
+ *                    --> i915_gpu_coredump --> intel_guc_capture_store_ptr
+ *                        L--> keep a ptr to capture_store in
+ *                             i915_gpu_coredump struct.
+ *
+ * User Sysfs / Debugfs
+ * --------------------
+ *      --> i915_gpu_coredump_copy_to_buffer->
+ *                   L--> err_print_to_sgl --> err_print_gt
+ *                        L--> error_print_guc_captures
+ *                             L--> loop: intel_guc_capture_out_print_next_group
+ *
  */
 
+#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
+
+static char *
+guc_capture_register_string(const struct intel_guc *guc, u32 owner, u32 type,
+			    u32 class, u32 id, u32 offset)
+{
+	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
+	struct __guc_mmio_reg_descr_group *match;
+	int num_regs, j = 0;
+
+	if (!reglists)
+		return NULL;
+
+	match = guc_capture_get_one_list(reglists, owner, type, id);
+	if (match) {
+		num_regs = match->num_regs;
+		while (num_regs--) {
+			if (offset == match->list[j].reg.reg)
+				return match->list[j].regname;
+			++j;
+		}
+	}
+
+	return NULL;
+}
+
+static inline int
+guc_capture_store_remove_dw(struct guc_capture_out_store *store, u32 *bytesleft,
+			    u32 *dw)
+{
+	int tries = 2;
+	int avail = 0;
+	u32 *src_data;
+
+	if (!*bytesleft)
+		return 0;
+
+	while (tries--) {
+		avail = CIRC_CNT_TO_END(store->head, store->tail, store->size);
+		if (avail >= sizeof(u32)) {
+			src_data = (u32 *)(store->addr + store->tail);
+			*dw = *src_data;
+			store->tail = (store->tail + 4) & (store->size - 1);
+			*bytesleft -= 4;
+			return 4;
+		}
+		if (store->tail == (store->size - 1) && store->head > 0)
+			store->tail = 0;
+	}
+
+	return 0;
+}
+
+static int
+capture_store_get_group_hdr(const struct intel_guc *guc,
+			    struct guc_capture_out_store *store, u32 *bytesleft,
+			    struct intel_guc_capture_out_group_header *group)
+{
+	int read = 0;
+	int fullsize = sizeof(struct intel_guc_capture_out_group_header);
+
+	if (fullsize > *bytesleft)
+		return -1;
+
+	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
+		    memcpy(group, (store->addr + store->tail), fullsize);
+			store->tail = (store->tail + fullsize) & (store->size - 1);
+			*bytesleft -= fullsize;
+		return 0;
+	}
+
+	read += guc_capture_store_remove_dw(store, bytesleft, &group->reserved1);
+	read += guc_capture_store_remove_dw(store, bytesleft, &group->info);
+	if (read != sizeof(*group))
+		return -1;
+
+	return 0;
+}
+
+static int
+capture_store_get_data_hdr(const struct intel_guc *guc,
+			   struct guc_capture_out_store *store, u32 *bytesleft,
+			   struct intel_guc_capture_out_data_header *data)
+{
+	int read = 0;
+	int fullsize = sizeof(struct intel_guc_capture_out_data_header);
+
+	if (fullsize > *bytesleft)
+		return -1;
+
+	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
+		    memcpy(data, (store->addr + store->tail), fullsize);
+			store->tail = (store->tail + fullsize) & (store->size - 1);
+			*bytesleft -= fullsize;
+		return 0;
+	}
+
+	read += guc_capture_store_remove_dw(store, bytesleft, &data->reserved1);
+	read += guc_capture_store_remove_dw(store, bytesleft, &data->info);
+	read += guc_capture_store_remove_dw(store, bytesleft, &data->lrca);
+	read += guc_capture_store_remove_dw(store, bytesleft, &data->guc_ctx_id);
+	read += guc_capture_store_remove_dw(store, bytesleft, &data->num_mmios);
+	if (read != sizeof(*data))
+		return -1;
+
+	return 0;
+}
+
+static int
+capture_store_get_register(const struct intel_guc *guc,
+			   struct guc_capture_out_store *store, u32 *bytesleft,
+			   struct guc_mmio_reg *reg)
+{
+	int read = 0;
+	int fullsize = sizeof(struct guc_mmio_reg);
+
+	if (fullsize > *bytesleft)
+		return -1;
+
+	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
+		    memcpy(reg, (store->addr + store->tail), fullsize);
+			store->tail = (store->tail + fullsize) & (store->size - 1);
+			*bytesleft -= fullsize;
+		return 0;
+	}
+
+	read += guc_capture_store_remove_dw(store, bytesleft, &reg->offset);
+	read += guc_capture_store_remove_dw(store, bytesleft, &reg->value);
+	read += guc_capture_store_remove_dw(store, bytesleft, &reg->flags);
+	read += guc_capture_store_remove_dw(store, bytesleft, &reg->mask);
+	if (read != sizeof(*reg))
+		return -1;
+
+	return 0;
+}
+
+static void guc_capture_store_drop_data(struct guc_capture_out_store *store,
+					unsigned long sampled_head)
+{
+	if (sampled_head == 0)
+		store->tail = store->size - 1;
+	else
+		store->tail = sampled_head - 1;
+}
+
+#ifdef CONFIG_DRM_I915_DEBUG_GUC
+#define guc_capt_err_print(a, b, ...) \
+	do { \
+		drm_warn(a, __VA_ARGS__); \
+		if (b) \
+			i915_error_printf(b, __VA_ARGS__); \
+	} while (0)
+#else
+#define guc_capt_err_print(a, b, ...) \
+	do { \
+		if (b) \
+			i915_error_printf(b, __VA_ARGS__); \
+	} while (0)
+#endif
+
+static struct intel_engine_cs *
+guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	u8 engine_class = guc_class_to_engine_class(guc_class);
+
+	/* Class index is checked in class converter */
+	GEM_BUG_ON(instance > MAX_ENGINE_INSTANCE);
+
+	return gt->engine_class[engine_class][instance];
+}
+
+static inline struct intel_context *
+guc_context_lookup(struct intel_guc *guc, u32 guc_ctx_id)
+{
+	struct intel_context *ce;
+
+	if (unlikely(guc_ctx_id >= GUC_MAX_LRC_DESCRIPTORS)) {
+		drm_dbg(&guc_to_gt(guc)->i915->drm, "Invalid guc_ctx_id 0x%X, max 0x%X",
+			guc_ctx_id, GUC_MAX_LRC_DESCRIPTORS);
+		return NULL;
+	}
+
+	ce = xa_load(&guc->context_lookup, guc_ctx_id);
+	if (unlikely(!ce)) {
+		drm_dbg(&guc_to_gt(guc)->i915->drm, "Context is NULL, guc_ctx_id 0x%X",
+			guc_ctx_id);
+		return NULL;
+	}
+
+	return ce;
+}
+
+
+#define PRINT guc_capt_err_print
+#define REGSTR guc_capture_register_string
+
+#define GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Name: %s\n", (eng)->name); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Class: 0x%02x\n", (eng)->class); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Inst: 0x%02x\n", (eng)->instance); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-LogicalMask: 0x%08x\n", (eng)->logical_mask)
+
+#define GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data) \
+	PRINT(&(i915->drm), (ebuf), "    LRCA: 0x%08x\n", (data).lrca); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-ContextID: 0x%08x\n", (data).guc_ctx_id); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-Engine-Instance: 0x%08x\n", \
+	      (uint32_t) FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, (data).info));
+
+#define GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-Flags: 0x%016lx\n", (ce)->flags); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-GuC-ID: 0x%016x\n", (ce)->guc_id.id);
+
+int intel_guc_capture_out_print_next_group(struct drm_i915_error_state_buf *ebuf,
+					   struct intel_gt_coredump *gt)
+{
+	/* constant qualifier for data-pointers we shouldn't change mid of error dump printing */
+	struct intel_guc_state_capture *cap = gt->uc->capture;
+	struct intel_guc *guc = container_of(cap, struct intel_guc, capture);
+	struct drm_i915_private *i915 = (container_of(guc, struct intel_gt,
+						   uc.guc))->i915;
+	struct guc_capture_out_store *store = &cap->out_store;
+	struct guc_capture_out_store tmpstore;
+	struct intel_guc_capture_out_group_header group;
+	struct intel_guc_capture_out_data_header data;
+	struct guc_mmio_reg reg;
+	const char *grptypestr[GUC_STATE_CAPTURE_GROUP_TYPE_MAX] = {"full-capture",
+								    "partial-capture"};
+	const char *datatypestr[GUC_CAPTURE_LIST_TYPE_MAX] = {"Global", "Engine-Class",
+							      "Engine-Instance"};
+	enum guc_capture_group_types grptype;
+	enum guc_capture_type datatype;
+	int numgrps, numregs;
+	char *str, noname[16];
+	u32 numbytes, engineclass, eng_inst, ret = 0;
+	struct intel_engine_cs *eng;
+	struct intel_context *ce;
+
+	if (!cap->enabled)
+		return -ENODEV;
+
+	mutex_lock(&store->lock);
+	smp_mb(); /* sync to get the latest head for the moment */
+	/* NOTE1: make a copy of store so we dont have to deal with a changing lower bound of
+	 *        occupied-space in this circular buffer.
+	 * NOTE2: Higher up the stack from here, we keep calling this function in a loop to
+	 *        reading more capture groups as they appear (as the lower bound of occupied-space
+	 *        changes) until this circ-buf is empty.
+	 */
+	memcpy(&tmpstore, store, sizeof(tmpstore));
+
+	PRINT(&i915->drm, ebuf, "global --- GuC Error Capture\n");
+
+	numbytes = CIRC_CNT(tmpstore.head, tmpstore.tail, tmpstore.size);
+	if (!numbytes) {
+		PRINT(&i915->drm, ebuf, "GuC capture stream empty!\n");
+		ret = -ENODATA;
+		goto unlock;
+	}
+	/* everything in GuC output structures are dword aligned */
+	if (numbytes & 0x3) {
+		PRINT(&i915->drm, ebuf, "GuC capture stream unaligned!\n");
+		ret = -EIO;
+		goto unlock;
+	}
+
+	if (capture_store_get_group_hdr(guc, &tmpstore, &numbytes, &group)) {
+		PRINT(&i915->drm, ebuf, "GuC capture error getting next group-header!\n");
+		ret = -EIO;
+		goto unlock;
+	}
+
+	PRINT(&i915->drm, ebuf, "NumCaptures:  0x%08x\n", (uint32_t)
+	      FIELD_GET(GUC_CAPTURE_GRPHDR_SRC_NUMCAPTURES, group.info));
+	grptype = FIELD_GET(GUC_CAPTURE_GRPHDR_SRC_CAPTURE_TYPE, group.info);
+	PRINT(&i915->drm, ebuf, "Coverage:  0x%08x = %s\n", grptype,
+	      grptypestr[grptype % GUC_STATE_CAPTURE_GROUP_TYPE_MAX]);
+
+	numgrps = FIELD_GET(GUC_CAPTURE_GRPHDR_SRC_NUMCAPTURES, group.info);
+	while (numgrps--) {
+		eng = NULL;
+		ce = NULL;
+
+		if (capture_store_get_data_hdr(guc, &tmpstore, &numbytes, &data)) {
+			PRINT(&i915->drm, ebuf, "GuC capture error on next data-header!\n");
+			ret = -EIO;
+			goto unlock;
+		}
+		datatype = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_TYPE, data.info);
+		PRINT(&i915->drm, ebuf, "  RegListType: %s\n",
+		      datatypestr[datatype % GUC_CAPTURE_LIST_TYPE_MAX]);
+
+		engineclass = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_CLASS, data.info);
+		if (datatype != GUC_CAPTURE_LIST_TYPE_GLOBAL) {
+			PRINT(&i915->drm, ebuf, "    GuC-Engine-Class: %d\n",
+			      engineclass);
+			if (engineclass <= GUC_LAST_ENGINE_CLASS)
+				PRINT(&i915->drm, ebuf, "    i915-Eng-Class: %d\n",
+				      guc_class_to_engine_class(engineclass));
+
+			if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
+				GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
+				eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
+				eng = guc_lookup_engine(guc, engineclass, eng_inst);
+				if (eng) {
+					GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
+				} else {
+					PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup Fail!\n");
+				}
+				ce = guc_context_lookup(guc, data.guc_ctx_id);
+				if (ce) {
+					GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce);
+				} else {
+					PRINT(&i915->drm, ebuf, "    i915-Ctx-Lookup Fail!\n");
+				}
+			}
+		}
+		numregs = FIELD_GET(GUC_CAPTURE_DATAHDR_NUM_MMIOS, data.num_mmios);
+		PRINT(&i915->drm, ebuf, "     NumRegs: 0x%08x\n", numregs);
+
+		while (numregs--) {
+			if (capture_store_get_register(guc, &tmpstore, &numbytes, &reg)) {
+				PRINT(&i915->drm, ebuf, "Error getting next register!\n");
+				ret = -EIO;
+				goto unlock;
+			}
+			str = REGSTR(guc, GUC_CAPTURE_LIST_INDEX_PF, datatype,
+				     engineclass, 0, reg.offset);
+			if (!str) {
+				snprintf(noname, sizeof(noname), "REG-0x%08x", reg.offset);
+				str = noname;
+			}
+			PRINT(&i915->drm, ebuf, "      %s:  0x%08x\n", str, reg.value);
+
+		}
+		if (eng) {
+			const struct intel_engine_coredump *ee;
+			for (ee = gt->engine; ee; ee = ee->next) {
+				const struct i915_vma_coredump *vma;
+				if (ee->engine == eng) {
+					for (vma = ee->vma; vma; vma = vma->next)
+						i915_print_error_vma(ebuf, ee->engine, vma);
+				}
+			}
+		}
+	}
+
+	store->tail = tmpstore.tail;
+unlock:
+	/* if we have a stream error, just drop everything */
+	if (ret == -EIO) {
+		drm_warn(&i915->drm, "Skip GuC capture data print due to stream error\n");
+		guc_capture_store_drop_data(store, tmpstore.head);
+	}
+
+	mutex_unlock(&store->lock);
+
+	return ret;
+}
+
+#undef REGSTR
+#undef PRINT
+
+#endif //CONFIG_DRM_I915_DEBUG_GUC
+
 static void guc_capture_store_insert(struct intel_guc *guc, struct guc_capture_out_store *store,
 				     unsigned char *new_data, size_t bytes)
 {
@@ -587,6 +968,14 @@ void intel_guc_capture_destroy(struct intel_guc *guc)
 	guc_capture_clear_ext_regs(guc->capture.reglists);
 }
 
+struct intel_guc_state_capture *
+intel_guc_capture_store_ptr(struct intel_guc *guc)
+{
+	if (!guc->capture.enabled)
+		return NULL;
+	return &guc->capture;
+}
+
 int intel_guc_capture_init(struct intel_guc *guc)
 {
 	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
index 7031de12f3a1..7d048a8f6efe 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
@@ -88,6 +88,11 @@ struct intel_guc_state_capture {
 	struct work_struct store_work;
 };
 
+struct drm_i915_error_state_buf;
+struct intel_gt_coredump;
+
+int intel_guc_capture_out_print_next_group(struct drm_i915_error_state_buf *m,
+					   struct intel_gt_coredump *gt);
 void intel_guc_capture_store_snapshot(struct intel_guc *guc);
 int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 class,
 				 u16 *num_entries);
@@ -96,6 +101,7 @@ int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32
 int intel_guc_capture_output_min_size_est(struct intel_guc *guc);
 void intel_guc_capture_destroy(struct intel_guc *guc);
 void intel_guc_capture_store_snapshot_immediate(struct intel_guc *guc);
+struct intel_guc_state_capture *intel_guc_capture_store_ptr(struct intel_guc *guc);
 int intel_guc_capture_init(struct intel_guc *guc);
 
 #endif /* _INTEL_GUC_CAPTURE_H */
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 2a2d7643b551..47016059c65d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -600,6 +600,16 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
 	error_print_context(m, "  Active context: ", &ee->context);
 }
 
+static void error_print_guc_captures(struct drm_i915_error_state_buf *m,
+				     struct intel_gt_coredump *gt)
+{
+	int ret;
+
+	do {
+		ret = intel_guc_capture_out_print_next_group(m, gt);
+	} while (!ret);
+}
+
 void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
 {
 	va_list args;
@@ -609,9 +619,9 @@ void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
 	va_end(args);
 }
 
-static void print_error_vma(struct drm_i915_error_state_buf *m,
-			    const struct intel_engine_cs *engine,
-			    const struct i915_vma_coredump *vma)
+void i915_print_error_vma(struct drm_i915_error_state_buf *m,
+			  const struct intel_engine_cs *engine,
+			  const struct i915_vma_coredump *vma)
 {
 	char out[ASCII85_BUFSZ];
 	int page;
@@ -679,7 +689,7 @@ static void err_print_uc(struct drm_i915_error_state_buf *m,
 
 	intel_uc_fw_dump(&error_uc->guc_fw, &p);
 	intel_uc_fw_dump(&error_uc->huc_fw, &p);
-	print_error_vma(m, NULL, error_uc->guc_log);
+	i915_print_error_vma(m, NULL, error_uc->guc_log);
 }
 
 static void err_free_sgl(struct scatterlist *sgl)
@@ -764,12 +774,16 @@ static void err_print_gt(struct drm_i915_error_state_buf *m,
 		err_printf(m, "  GAM_DONE: 0x%08x\n", gt->gam_done);
 	}
 
-	for (ee = gt->engine; ee; ee = ee->next) {
-		const struct i915_vma_coredump *vma;
+	if (gt->uc->capture) /* error capture was via GuC */
+		error_print_guc_captures(m, gt);
+	else {
+		for (ee = gt->engine; ee; ee = ee->next) {
+			const struct i915_vma_coredump *vma;
 
-		error_print_engine(m, ee);
-		for (vma = ee->vma; vma; vma = vma->next)
-			print_error_vma(m, ee->engine, vma);
+			error_print_engine(m, ee);
+			for (vma = ee->vma; vma; vma = vma->next)
+				i915_print_error_vma(m, ee->engine, vma);
+		}
 	}
 
 	if (gt->uc)
@@ -1140,7 +1154,7 @@ static void gt_record_fences(struct intel_gt_coredump *gt)
 	gt->nfence = i;
 }
 
-static void engine_record_registers(struct intel_engine_coredump *ee)
+static void engine_record_registers_execlist(struct intel_engine_coredump *ee)
 {
 	const struct intel_engine_cs *engine = ee->engine;
 	struct drm_i915_private *i915 = engine->i915;
@@ -1384,8 +1398,10 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp)
 
 	ee->engine = engine;
 
-	engine_record_registers(ee);
-	engine_record_execlists(ee);
+	if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
+		engine_record_registers_execlist(ee);
+		engine_record_execlists(ee);
+	}
 
 	return ee;
 }
@@ -1558,8 +1574,8 @@ gt_record_uc(struct intel_gt_coredump *gt,
 	return error_uc;
 }
 
-/* Capture all registers which don't fit into another category. */
-static void gt_record_regs(struct intel_gt_coredump *gt)
+/* Capture all global registers which don't fit into another category. */
+static void gt_record_registers_execlist(struct intel_gt_coredump *gt)
 {
 	struct intel_uncore *uncore = gt->_gt->uncore;
 	struct drm_i915_private *i915 = uncore->i915;
@@ -1806,7 +1822,9 @@ intel_gt_coredump_alloc(struct intel_gt *gt, gfp_t gfp)
 	gc->_gt = gt;
 	gc->awake = intel_gt_pm_is_awake(gt);
 
-	gt_record_regs(gc);
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		gt_record_registers_execlist(gc);
+
 	gt_record_fences(gc);
 
 	return gc;
@@ -1871,6 +1889,11 @@ i915_gpu_coredump(struct intel_gt *gt, intel_engine_mask_t engine_mask)
 		if (INTEL_INFO(i915)->has_gt_uc)
 			error->gt->uc = gt_record_uc(error->gt, compress);
 
+		if (intel_uc_uses_guc_submission(&gt->uc))
+			error->gt->uc->capture = intel_guc_capture_store_ptr(&gt->uc.guc);
+		else
+			error->gt->uc->capture = NULL;
+
 		i915_vma_capture_finish(error->gt, compress);
 
 		error->simulated |= error->gt->simulated;
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index b98d8cdbe4f2..b55369b245ee 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -17,6 +17,7 @@
 #include "gt/intel_engine.h"
 #include "gt/intel_gt_types.h"
 #include "gt/uc/intel_uc_fw.h"
+#include "gt/uc/intel_guc_capture.h"
 
 #include "intel_device_info.h"
 
@@ -151,6 +152,7 @@ struct intel_gt_coredump {
 		struct intel_uc_fw guc_fw;
 		struct intel_uc_fw huc_fw;
 		struct i915_vma_coredump *guc_log;
+		struct intel_guc_state_capture *capture;
 	} *uc;
 
 	struct intel_gt_coredump *next;
@@ -216,6 +218,9 @@ struct drm_i915_error_state_buf {
 
 __printf(2, 3)
 void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...);
+void i915_print_error_vma(struct drm_i915_error_state_buf *m,
+			  const struct intel_engine_cs *engine,
+			  const struct i915_vma_coredump *vma);
 
 struct i915_gpu_coredump *i915_gpu_coredump(struct intel_gt *gt,
 					    intel_engine_mask_t engine_mask);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Add GuC Error Capture Support
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
                   ` (7 preceding siblings ...)
  (?)
@ 2021-11-22 23:44 ` Patchwork
  -1 siblings, 0 replies; 52+ messages in thread
From: Patchwork @ 2021-11-22 23:44 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx

== Series Details ==

Series: Add GuC Error Capture Support
URL   : https://patchwork.freedesktop.org/series/97187/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
3b28aa6f3791 drm/i915/guc: Add basic support for error capture lists
-:101: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#101: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc.h:396:
+int intel_guc_error_capture_process_msg(struct intel_guc *guc,
+					 const u32 *msg, u32 len);

-:244: ERROR:OPEN_BRACE: open brace '{' following enum go on the same line
#244: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h:289:
+enum
+{

-:340: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#340: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:4001:
+int intel_guc_error_capture_process_msg(struct intel_guc *guc,
+					 const u32 *msg, u32 len)

total: 1 errors, 0 warnings, 2 checks, 289 lines checked
5d7b43376e65 drm/i915/guc: Update GuC ADS size for error capture lists
-:315: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#315: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 578 lines checked
55daaa2dfbb8 drm/i915/guc: Populate XE_LP register lists for GuC error state capture.
-:35: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
#35: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:22:
+#define COMMON_GEN12BASE_GLOBAL() \
+	{GEN12_FAULT_TLB_DATA0,    0,      0, "GEN12_FAULT_TLB_DATA0"}, \
+	{GEN12_FAULT_TLB_DATA1,    0,      0, "GEN12_FAULT_TLB_DATA1"}, \
+	{FORCEWAKE_MT,             0,      0, "FORCEWAKE_MT"}, \
+	{DERRMR,                   0,      0, "DERRMR"}, \
+	{GEN12_AUX_ERR_DBG,        0,      0, "GEN12_AUX_ERR_DBG"}, \
+	{GEN12_GAM_DONE,           0,      0, "GEN12_GAM_DONE"}, \
+	{GEN11_GUC_SG_INTR_ENABLE, 0,      0, "GEN11_GUC_SG_INTR_ENABLE"}, \
+	{GEN11_CRYPTO_RSVD_INTR_ENABLE, 0, 0, "GEN11_CRYPTO_RSVD_INTR_ENABLE"}, \
+	{GEN11_GUNIT_CSME_INTR_ENABLE, 0,  0, "GEN11_GUNIT_CSME_INTR_ENABLE"}, \
+	{GEN12_RING_FAULT_REG,     0,      0, "GEN12_RING_FAULT_REG"}

-:47: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
#47: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:34:
+#define COMMON_GEN12BASE_ENGINE_INSTANCE() \
+	{RING_PSMI_CTL(0),         0,      0, "RING_PSMI_CTL"}, \
+	{RING_ESR(0),              0,      0, "RING_ESR"}, \
+	{RING_ESR(0),              0,      0, "RING_ESR"}, \
+	{RING_DMA_FADD(0),         0,      0, "RING_DMA_FADD_LOW32"}, \
+	{RING_DMA_FADD_UDW(0),     0,      0, "RING_DMA_FADD_UP32"}, \
+	{RING_IPEIR(0),            0,      0, "RING_IPEIR"}, \
+	{RING_IPEHR(0),            0,      0, "RING_IPEHR"}, \
+	{RING_INSTPS(0),           0,      0, "RING_INSTPS"}, \
+	{RING_BBADDR(0),           0,      0, "RING_BBADDR_LOW32"}, \
+	{RING_BBADDR_UDW(0),       0,      0, "RING_BBADDR_UP32"}, \
+	{RING_BBSTATE(0),          0,      0, "RING_BBSTATE"}, \
+	{CCID(0),                  0,      0, "CCID"}, \
+	{RING_ACTHD(0),            0,      0, "RING_ACTHD_LOW32"}, \
+	{RING_ACTHD_UDW(0),        0,      0, "RING_ACTHD_UP32"}, \
+	{RING_INSTPM(0),           0,      0, "RING_INSTPM"}, \
+	{RING_NOPID(0),            0,      0, "RING_NOPID"}, \
+	{RING_START(0),            0,      0, "RING_START"}, \
+	{RING_HEAD(0),             0,      0, "RING_HEAD"}, \
+	{RING_TAIL(0),             0,      0, "RING_TAIL"}, \
+	{RING_CTL(0),              0,      0, "RING_CTL"}, \
+	{RING_MI_MODE(0),          0,      0, "RING_MI_MODE"}, \
+	{RING_CONTEXT_CONTROL(0),  0,      0, "RING_CONTEXT_CONTROL"}, \
+	{RING_INSTDONE(0),         0,      0, "RING_INSTDONE"}, \
+	{RING_HWS_PGA(0),          0,      0, "RING_HWS_PGA"}, \
+	{RING_MODE_GEN7(0),        0,      0, "RING_MODE_GEN7"}, \
+	{GEN8_RING_PDP_LDW(0, 0),  0,      0, "GEN8_RING_PDP0_LDW"}, \
+	{GEN8_RING_PDP_UDW(0, 0),  0,      0, "GEN8_RING_PDP0_UDW"}, \
+	{GEN8_RING_PDP_LDW(0, 1),  0,      0, "GEN8_RING_PDP1_LDW"}, \
+	{GEN8_RING_PDP_UDW(0, 1),  0,      0, "GEN8_RING_PDP1_UDW"}, \
+	{GEN8_RING_PDP_LDW(0, 2),  0,      0, "GEN8_RING_PDP2_LDW"}, \
+	{GEN8_RING_PDP_UDW(0, 2),  0,      0, "GEN8_RING_PDP2_UDW"}, \
+	{GEN8_RING_PDP_LDW(0, 3),  0,      0, "GEN8_RING_PDP3_LDW"}, \
+	{GEN8_RING_PDP_UDW(0, 3),  0,      0, "GEN8_RING_PDP3_UDW"}

-:85: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
#85: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:72:
+#define COMMON_GEN12BASE_RENDER() \
+	{GEN7_SC_INSTDONE,         0,      0, "GEN7_SC_INSTDONE"}, \
+	{GEN12_SC_INSTDONE_EXTRA,  0,      0, "GEN12_SC_INSTDONE_EXTRA"}, \
+	{GEN12_SC_INSTDONE_EXTRA2, 0,      0, "GEN12_SC_INSTDONE_EXTRA2"}

-:90: ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parentheses
#90: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:77:
+#define COMMON_GEN12BASE_VEC() \
+	{GEN11_VCS_VECS_INTR_ENABLE, 0,    0, "GEN11_VCS_VECS_INTR_ENABLE"}, \
+	{GEN12_SFC_DONE(0),        0,      0, "GEN12_SFC_DONE0"}, \
+	{GEN12_SFC_DONE(1),        0,      0, "GEN12_SFC_DONE1"}, \
+	{GEN12_SFC_DONE(2),        0,      0, "GEN12_SFC_DONE2"}, \
+	{GEN12_SFC_DONE(3),        0,      0, "GEN12_SFC_DONE3"}

-:174: CHECK:LINE_SPACING: Please don't use multiple blank lines
#174: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:147:
+
+

-:228: WARNING:LONG_LINE: line length of 102 exceeds 100 columns
#228: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:151:
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_rc_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_RENDER_CLASS),

-:229: WARNING:LONG_LINE: line length of 104 exceeds 100 columns
#229: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:152:
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_rc_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_RENDER_CLASS),

-:230: WARNING:LONG_LINE: line length of 101 exceeds 100 columns
#230: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:153:
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_vd_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_VIDEO_CLASS),

-:231: WARNING:LONG_LINE: line length of 103 exceeds 100 columns
#231: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:154:
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_vd_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_VIDEO_CLASS),

-:232: WARNING:LONG_LINE: line length of 109 exceeds 100 columns
#232: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:155:
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_vec_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_VIDEOENHANCE_CLASS),

-:233: WARNING:LONG_LINE: line length of 111 exceeds 100 columns
#233: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:156:
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_vec_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_VIDEOENHANCE_CLASS),

-:234: WARNING:LONG_LINE: line length of 104 exceeds 100 columns
#234: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:157:
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_blt_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_BLITTER_CLASS),

-:235: WARNING:LONG_LINE: line length of 106 exceeds 100 columns
#235: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:158:
+	MAKE_GCAP_REGLIST_DESCR(gen12lp_blt_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_BLITTER_CLASS),

-:243: WARNING:LONG_LINE: line length of 103 exceeds 100 columns
#243: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:165:
+guc_capture_get_ext_list_ptr(struct __guc_mmio_reg_descr_group * lists, u32 owner, u32 type, u32 class)

-:243: ERROR:POINTER_LOCATION: "foo * bar" should be "foo *bar"
#243: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:165:
+guc_capture_get_ext_list_ptr(struct __guc_mmio_reg_descr_group * lists, u32 owner, u32 type, u32 class)

-:245: ERROR:SPACING: space required before the open brace '{'
#245: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:167:
+	while(lists->list){

-:245: ERROR:SPACING: space required before the open parenthesis '('
#245: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:167:
+	while(lists->list){

-:253: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around lists->ext
#253: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:175:
+	return &(lists->ext);

-:256: ERROR:POINTER_LOCATION: "foo * bar" should be "foo *bar"
#256: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:178:
+void guc_capture_clear_ext_regs(struct __guc_mmio_reg_descr_group * lists)

-:258: ERROR:SPACING: space required before the open brace '{'
#258: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:180:
+	while(lists->list){

-:258: ERROR:SPACING: space required before the open parenthesis '('
#258: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:180:
+	while(lists->list){

-:260: WARNING:NEEDLESS_IF: kfree(NULL) is safe and this check is probably not required
#260: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:182:
+		if (lists->ext) {
+			kfree(lists->ext);

-:266: WARNING:RETURN_VOID: void function return statements are not generally useful
#266: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:188:
+	return;
+}

-:270: ERROR:POINTER_LOCATION: "foo * bar" should be "foo *bar"
#270: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:192:
+			     struct __guc_mmio_reg_descr_group * lists)

-:323: WARNING:BLOCK_COMMENT_STYLE: Block comments should align the * on each line
#323: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:244:
+		/*
+		* For certain engine classes, there are slice and subslice

-:350: ERROR:POINTER_LOCATION: "foo * bar" should be "foo *bar"
#350: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h:29:
+	struct __guc_mmio_reg_descr * ext;

total: 12 errors, 12 warnings, 2 checks, 335 lines checked
f20287422c25 drm/i915/guc: Add GuC's error state capture output structures.
-:24: WARNING:LONG_LINE_COMMENT: line length of 101 exceeds 100 columns
#24: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h:35:
+		#define GUC_CAPTURE_DATAHDR_SRC_TYPE GENMASK(3, 0) /* as per enum guc_capture_type */

-:25: WARNING:LONG_LINE_COMMENT: line length of 103 exceeds 100 columns
#25: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h:36:
+		#define GUC_CAPTURE_DATAHDR_SRC_CLASS GENMASK(7, 4) /* as per GUC_MAX_ENGINE_CLASSES */

total: 0 errors, 2 warnings, 0 checks, 41 lines checked
c9432d4f7471 drm/i915/guc: Update GuC's log-buffer-state access for error capture.
f56313a6f708 drm/i915/guc: Copy new GuC error capture logs upon G2H notification.
-:224: WARNING:OOM_MESSAGE: Possible unnecessary 'out of memory' message
#224: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:567:
+	if (!guc->capture.out_store.addr) {
+		drm_warn(&dev_priv->drm, "GuC-capture interim-store populated at init!\n");

-:357: ERROR:TRAILING_WHITESPACE: trailing whitespace
#357: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_log.h:71:
+ $

-:357: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#357: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_log.h:71:
+ $

total: 1 errors, 2 warnings, 0 checks, 347 lines checked
cd819348e021 drm/i915/guc: Print the GuC error capture output register list.
-:128: WARNING:SUSPECT_CODE_INDENT: suspect code indent for conditional statements (8, 20)
#128: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:500:
+	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
+		    memcpy(group, (store->addr + store->tail), fullsize);

-:154: WARNING:SUSPECT_CODE_INDENT: suspect code indent for conditional statements (8, 20)
#154: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:526:
+	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
+		    memcpy(data, (store->addr + store->tail), fullsize);

-:183: WARNING:SUSPECT_CODE_INDENT: suspect code indent for conditional statements (8, 20)
#183: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:555:
+	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
+		    memcpy(reg, (store->addr + store->tail), fullsize);

-:210: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'b' - possible side-effects?
#210: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:582:
+#define guc_capt_err_print(a, b, ...) \
+	do { \
+		drm_warn(a, __VA_ARGS__); \
+		if (b) \
+			i915_error_printf(b, __VA_ARGS__); \
+	} while (0)

-:217: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'b' - possible side-effects?
#217: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:589:
+#define guc_capt_err_print(a, b, ...) \
+	do { \
+		if (b) \
+			i915_error_printf(b, __VA_ARGS__); \
+	} while (0)

-:257: CHECK:LINE_SPACING: Please don't use multiple blank lines
#257: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:629:
+
+

-:261: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
#261: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:633:
+#define GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Name: %s\n", (eng)->name); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Class: 0x%02x\n", (eng)->class); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Inst: 0x%02x\n", (eng)->instance); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-LogicalMask: 0x%08x\n", (eng)->logical_mask)

-:261: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'i915' - possible side-effects?
#261: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:633:
+#define GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Name: %s\n", (eng)->name); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Class: 0x%02x\n", (eng)->class); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Inst: 0x%02x\n", (eng)->instance); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-LogicalMask: 0x%08x\n", (eng)->logical_mask)

-:261: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'i915' may be better as '(i915)' to avoid precedence issues
#261: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:633:
+#define GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Name: %s\n", (eng)->name); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Class: 0x%02x\n", (eng)->class); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Inst: 0x%02x\n", (eng)->instance); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-LogicalMask: 0x%08x\n", (eng)->logical_mask)

-:261: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'ebuf' - possible side-effects?
#261: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:633:
+#define GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Name: %s\n", (eng)->name); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Class: 0x%02x\n", (eng)->class); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Inst: 0x%02x\n", (eng)->instance); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-LogicalMask: 0x%08x\n", (eng)->logical_mask)

-:261: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'eng' - possible side-effects?
#261: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:633:
+#define GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Name: %s\n", (eng)->name); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Class: 0x%02x\n", (eng)->class); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Inst: 0x%02x\n", (eng)->instance); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-LogicalMask: 0x%08x\n", (eng)->logical_mask)

-:262: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around i915->drm
#262: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:634:
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Name: %s\n", (eng)->name); \

-:263: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around i915->drm
#263: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:635:
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Class: 0x%02x\n", (eng)->class); \

-:264: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around i915->drm
#264: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:636:
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Inst: 0x%02x\n", (eng)->instance); \

-:265: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around i915->drm
#265: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:637:
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-LogicalMask: 0x%08x\n", (eng)->logical_mask)

-:267: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
#267: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:639:
+#define GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data) \
+	PRINT(&(i915->drm), (ebuf), "    LRCA: 0x%08x\n", (data).lrca); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-ContextID: 0x%08x\n", (data).guc_ctx_id); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-Engine-Instance: 0x%08x\n", \
+	      (uint32_t) FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, (data).info));

-:267: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'i915' - possible side-effects?
#267: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:639:
+#define GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data) \
+	PRINT(&(i915->drm), (ebuf), "    LRCA: 0x%08x\n", (data).lrca); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-ContextID: 0x%08x\n", (data).guc_ctx_id); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-Engine-Instance: 0x%08x\n", \
+	      (uint32_t) FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, (data).info));

-:267: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'i915' may be better as '(i915)' to avoid precedence issues
#267: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:639:
+#define GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data) \
+	PRINT(&(i915->drm), (ebuf), "    LRCA: 0x%08x\n", (data).lrca); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-ContextID: 0x%08x\n", (data).guc_ctx_id); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-Engine-Instance: 0x%08x\n", \
+	      (uint32_t) FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, (data).info));

-:267: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'ebuf' - possible side-effects?
#267: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:639:
+#define GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data) \
+	PRINT(&(i915->drm), (ebuf), "    LRCA: 0x%08x\n", (data).lrca); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-ContextID: 0x%08x\n", (data).guc_ctx_id); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-Engine-Instance: 0x%08x\n", \
+	      (uint32_t) FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, (data).info));

-:267: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'data' - possible side-effects?
#267: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:639:
+#define GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data) \
+	PRINT(&(i915->drm), (ebuf), "    LRCA: 0x%08x\n", (data).lrca); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-ContextID: 0x%08x\n", (data).guc_ctx_id); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-Engine-Instance: 0x%08x\n", \
+	      (uint32_t) FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, (data).info));

-:267: WARNING:TRAILING_SEMICOLON: macros should not use a trailing semicolon
#267: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:639:
+#define GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data) \
+	PRINT(&(i915->drm), (ebuf), "    LRCA: 0x%08x\n", (data).lrca); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-ContextID: 0x%08x\n", (data).guc_ctx_id); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-Engine-Instance: 0x%08x\n", \
+	      (uint32_t) FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, (data).info));

-:268: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around i915->drm
#268: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:640:
+	PRINT(&(i915->drm), (ebuf), "    LRCA: 0x%08x\n", (data).lrca); \

-:269: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around i915->drm
#269: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:641:
+	PRINT(&(i915->drm), (ebuf), "    GuC-ContextID: 0x%08x\n", (data).guc_ctx_id); \

-:270: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around i915->drm
#270: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:642:
+	PRINT(&(i915->drm), (ebuf), "    GuC-Engine-Instance: 0x%08x\n", \

-:271: CHECK:SPACING: No space is necessary after a cast
#271: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:643:
+	      (uint32_t) FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, (data).info));

-:273: ERROR:MULTISTATEMENT_MACRO_USE_DO_WHILE: Macros with multiple statements should be enclosed in a do - while loop
#273: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:645:
+#define GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-Flags: 0x%016lx\n", (ce)->flags); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-GuC-ID: 0x%016x\n", (ce)->guc_id.id);

-:273: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'i915' - possible side-effects?
#273: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:645:
+#define GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-Flags: 0x%016lx\n", (ce)->flags); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-GuC-ID: 0x%016x\n", (ce)->guc_id.id);

-:273: CHECK:MACRO_ARG_PRECEDENCE: Macro argument 'i915' may be better as '(i915)' to avoid precedence issues
#273: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:645:
+#define GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-Flags: 0x%016lx\n", (ce)->flags); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-GuC-ID: 0x%016x\n", (ce)->guc_id.id);

-:273: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'ebuf' - possible side-effects?
#273: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:645:
+#define GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-Flags: 0x%016lx\n", (ce)->flags); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-GuC-ID: 0x%016x\n", (ce)->guc_id.id);

-:273: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'ce' - possible side-effects?
#273: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:645:
+#define GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-Flags: 0x%016lx\n", (ce)->flags); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-GuC-ID: 0x%016x\n", (ce)->guc_id.id);

-:273: WARNING:TRAILING_SEMICOLON: macros should not use a trailing semicolon
#273: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:645:
+#define GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-Flags: 0x%016lx\n", (ce)->flags); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-GuC-ID: 0x%016x\n", (ce)->guc_id.id);

-:274: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around i915->drm
#274: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:646:
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-Flags: 0x%016lx\n", (ce)->flags); \

-:275: CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around i915->drm
#275: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:647:
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-GuC-ID: 0x%016x\n", (ce)->guc_id.id);

-:368: WARNING:BRACES: braces {} are not necessary for any arm of this statement
#368: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:740:
+				if (eng) {
[...]
+				} else {
[...]

-:374: WARNING:BRACES: braces {} are not necessary for any arm of this statement
#374: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:746:
+				if (ce) {
[...]
+				} else {
[...]

-:398: CHECK:BRACES: Blank lines aren't necessary before a close brace '}'
#398: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:770:
+
+		}

-:401: WARNING:LINE_SPACING: Missing a blank line after declarations
#401: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:773:
+			const struct intel_engine_coredump *ee;
+			for (ee = gt->engine; ee; ee = ee->next) {

-:403: WARNING:LINE_SPACING: Missing a blank line after declarations
#403: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:775:
+				const struct i915_vma_coredump *vma;
+				if (ee->engine == eng) {

-:522: CHECK:BRACES: Unbalanced braces around else statement
#522: FILE: drivers/gpu/drm/i915/i915_gpu_error.c:779:
+	else {

total: 3 errors, 9 warnings, 27 checks, 562 lines checked



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Add GuC Error Capture Support
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
                   ` (8 preceding siblings ...)
  (?)
@ 2021-11-22 23:45 ` Patchwork
  -1 siblings, 0 replies; 52+ messages in thread
From: Patchwork @ 2021-11-22 23:45 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx

== Series Details ==

Series: Add GuC Error Capture Support
URL   : https://patchwork.freedesktop.org/series/97187/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Add GuC Error Capture Support
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
                   ` (9 preceding siblings ...)
  (?)
@ 2021-11-23  0:16 ` Patchwork
  -1 siblings, 0 replies; 52+ messages in thread
From: Patchwork @ 2021-11-23  0:16 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 10431 bytes --]

== Series Details ==

Series: Add GuC Error Capture Support
URL   : https://patchwork.freedesktop.org/series/97187/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_10916 -> Patchwork_21662
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_21662 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_21662, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/index.html

Participating hosts (42 -> 34)
------------------------------

  Missing    (8): bat-dg1-6 fi-tgl-u2 bat-dg1-5 fi-bsw-cyan bat-adlp-6 bat-adlp-4 bat-jsl-2 bat-jsl-1 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_21662:

### IGT changes ###

#### Possible regressions ####

  * igt@debugfs_test@read_all_entries:
    - fi-elk-e7500:       [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-elk-e7500/igt@debugfs_test@read_all_entries.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-elk-e7500/igt@debugfs_test@read_all_entries.html
    - fi-ivb-3770:        [PASS][3] -> [INCOMPLETE][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-ivb-3770/igt@debugfs_test@read_all_entries.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-ivb-3770/igt@debugfs_test@read_all_entries.html
    - fi-snb-2600:        [PASS][5] -> [INCOMPLETE][6]
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-snb-2600/igt@debugfs_test@read_all_entries.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-snb-2600/igt@debugfs_test@read_all_entries.html
    - fi-bdw-gvtdvm:      [PASS][7] -> [INCOMPLETE][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-bdw-gvtdvm/igt@debugfs_test@read_all_entries.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bdw-gvtdvm/igt@debugfs_test@read_all_entries.html
    - fi-bsw-kefka:       [PASS][9] -> [INCOMPLETE][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-bsw-kefka/igt@debugfs_test@read_all_entries.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bsw-kefka/igt@debugfs_test@read_all_entries.html
    - fi-blb-e6850:       [PASS][11] -> [INCOMPLETE][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-blb-e6850/igt@debugfs_test@read_all_entries.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-blb-e6850/igt@debugfs_test@read_all_entries.html
    - fi-bwr-2160:        [PASS][13] -> [INCOMPLETE][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-bwr-2160/igt@debugfs_test@read_all_entries.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bwr-2160/igt@debugfs_test@read_all_entries.html
    - fi-bdw-5557u:       [PASS][15] -> [INCOMPLETE][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-bdw-5557u/igt@debugfs_test@read_all_entries.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bdw-5557u/igt@debugfs_test@read_all_entries.html
    - fi-snb-2520m:       [PASS][17] -> [INCOMPLETE][18]
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-snb-2520m/igt@debugfs_test@read_all_entries.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-snb-2520m/igt@debugfs_test@read_all_entries.html
    - fi-bsw-nick:        [PASS][19] -> [INCOMPLETE][20]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-bsw-nick/igt@debugfs_test@read_all_entries.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bsw-nick/igt@debugfs_test@read_all_entries.html
    - fi-ilk-650:         [PASS][21] -> [INCOMPLETE][22]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-ilk-650/igt@debugfs_test@read_all_entries.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-ilk-650/igt@debugfs_test@read_all_entries.html
    - fi-bsw-n3050:       [PASS][23] -> [INCOMPLETE][24]
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-bsw-n3050/igt@debugfs_test@read_all_entries.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bsw-n3050/igt@debugfs_test@read_all_entries.html
    - fi-hsw-4770:        [PASS][25] -> [INCOMPLETE][26]
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-hsw-4770/igt@debugfs_test@read_all_entries.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-hsw-4770/igt@debugfs_test@read_all_entries.html

  
Known issues
------------

  Here are the changes found in Patchwork_21662 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@runner@aborted:
    - fi-snb-2600:        NOTRUN -> [FAIL][27] ([i915#2426] / [i915#4312])
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-snb-2600/igt@runner@aborted.html
    - fi-ilk-650:         NOTRUN -> [FAIL][28] ([i915#2426] / [i915#4312])
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-ilk-650/igt@runner@aborted.html
    - fi-bsw-kefka:       NOTRUN -> [FAIL][29] ([i915#3690] / [i915#4312])
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bsw-kefka/igt@runner@aborted.html
    - fi-bdw-gvtdvm:      NOTRUN -> [FAIL][30] ([i915#2426] / [i915#4312])
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bdw-gvtdvm/igt@runner@aborted.html
    - fi-snb-2520m:       NOTRUN -> [FAIL][31] ([i915#2426] / [i915#4312])
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-snb-2520m/igt@runner@aborted.html
    - fi-bdw-5557u:       NOTRUN -> [FAIL][32] ([i915#2426] / [i915#4312])
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bdw-5557u/igt@runner@aborted.html
    - fi-bwr-2160:        NOTRUN -> [FAIL][33] ([i915#4312])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bwr-2160/igt@runner@aborted.html
    - fi-hsw-4770:        NOTRUN -> [FAIL][34] ([i915#4312])
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-hsw-4770/igt@runner@aborted.html
    - fi-kbl-guc:         NOTRUN -> [FAIL][35] ([i915#2426] / [i915#3363])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-kbl-guc/igt@runner@aborted.html
    - fi-rkl-guc:         NOTRUN -> [FAIL][36] ([i915#2426])
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-rkl-guc/igt@runner@aborted.html
    - fi-ivb-3770:        NOTRUN -> [FAIL][37] ([i915#2426] / [i915#4312])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-ivb-3770/igt@runner@aborted.html
    - fi-elk-e7500:       NOTRUN -> [FAIL][38] ([i915#2426] / [i915#4312])
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-elk-e7500/igt@runner@aborted.html
    - fi-cfl-guc:         NOTRUN -> [FAIL][39] ([i915#2426] / [i915#3363])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-cfl-guc/igt@runner@aborted.html
    - fi-skl-guc:         NOTRUN -> [FAIL][40] ([i915#2426] / [i915#3363])
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-skl-guc/igt@runner@aborted.html
    - fi-bsw-n3050:       NOTRUN -> [FAIL][41] ([i915#3690] / [i915#4312])
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bsw-n3050/igt@runner@aborted.html
    - fi-blb-e6850:       NOTRUN -> [FAIL][42] ([i915#2403] / [i915#2426] / [i915#4312])
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-blb-e6850/igt@runner@aborted.html

  
#### Warnings ####

  * igt@runner@aborted:
    - fi-bsw-nick:        [FAIL][43] ([fdo#109271] / [i915#1436] / [i915#3428] / [i915#4312]) -> [FAIL][44] ([i915#3690] / [i915#4312])
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-bsw-nick/igt@runner@aborted.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-bsw-nick/igt@runner@aborted.html
    - fi-apl-guc:         [FAIL][45] ([i915#2426] / [i915#3363] / [i915#4312]) -> [FAIL][46] ([i915#2426] / [i915#3363])
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10916/fi-apl-guc/igt@runner@aborted.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/fi-apl-guc/igt@runner@aborted.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [i915#1436]: https://gitlab.freedesktop.org/drm/intel/issues/1436
  [i915#2403]: https://gitlab.freedesktop.org/drm/intel/issues/2403
  [i915#2426]: https://gitlab.freedesktop.org/drm/intel/issues/2426
  [i915#2575]: https://gitlab.freedesktop.org/drm/intel/issues/2575
  [i915#3363]: https://gitlab.freedesktop.org/drm/intel/issues/3363
  [i915#3428]: https://gitlab.freedesktop.org/drm/intel/issues/3428
  [i915#3690]: https://gitlab.freedesktop.org/drm/intel/issues/3690
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312


Build changes
-------------

  * Linux: CI_DRM_10916 -> Patchwork_21662

  CI-20190529: 20190529
  CI_DRM_10916: 876217519d26774d843128cc66640ae501a5c38d @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6286: cdcbf81f734fdb1d102e84490e49e9fec23760cd @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_21662: cd819348e021724894dafe2e13547570b425a687 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

cd819348e021 drm/i915/guc: Print the GuC error capture output register list.
f56313a6f708 drm/i915/guc: Copy new GuC error capture logs upon G2H notification.
c9432d4f7471 drm/i915/guc: Update GuC's log-buffer-state access for error capture.
f20287422c25 drm/i915/guc: Add GuC's error state capture output structures.
55daaa2dfbb8 drm/i915/guc: Populate XE_LP register lists for GuC error state capture.
5d7b43376e65 drm/i915/guc: Update GuC ADS size for error capture lists
3b28aa6f3791 drm/i915/guc: Add basic support for error capture lists

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21662/index.html

[-- Attachment #2: Type: text/html, Size: 13567 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2021-11-22 23:04 ` [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list Alan Previn
@ 2021-11-23  0:25   ` Teres Alexis, Alan Previn
  2021-12-08  0:22   ` Matthew Brost
  1 sibling, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-11-23  0:25 UTC (permalink / raw)
  To: intel-gfx

I realize I missed checkpatch on patch-7 before send-mail. Will fix that on next rev.
Patch #2 also has checkpatch failures which I was aware of - I'm still wresting with how to instance those register tables in a clean readable way using.

...alan

-----Original Message-----
From: Teres Alexis, Alan Previn <alan.previn.teres.alexis@intel.com> 
Sent: Monday, November 22, 2021 3:04 PM
To: intel-gfx@lists.freedesktop.org
Cc: Teres Alexis, Alan Previn <alan.previn.teres.alexis@intel.com>
Subject: [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

Print the GuC captured error state register list (offsets and values) when gpu_coredump_state printout is invoked.

Also, since the GuC can report multiple engine class registers in a single notification event, parse the captured data (appearing as a stream of structures) to identify multiple captures of different 'engine-capture-group-outputs'.

Finally, for each 'engine-capture-group-output', identify the last running context and print already-identified vma's so that user's output report follows the same layout as execlist submission. I.e.
engine1-registers, engine1-context-vmas, engine2-registers, engine2-context-vmas, etc.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   4 +-
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 389 ++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |   6 +
 drivers/gpu/drm/i915/i915_gpu_error.c         |  53 ++-
 drivers/gpu/drm/i915/i915_gpu_error.h         |   5 +
 5 files changed, 439 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 332756036007..5806e2c05212 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1595,9 +1595,7 @@ static void intel_engine_print_registers(struct intel_engine_cs *engine,
 		drm_printf(m, "\tIPEHR: 0x%08x\n", ENGINE_READ(engine, IPEHR));
 	}
 
-	if (intel_engine_uses_guc(engine)) {
-		/* nothing to print yet */
-	} else if (HAS_EXECLISTS(dev_priv)) {
+	if (HAS_EXECLISTS(dev_priv) && !intel_engine_uses_guc(engine)) {
 		struct i915_request * const *port, *rq;
 		const u32 *hws =
 			&engine->status_page.addr[I915_HWS_CSB_BUF0_INDEX];
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index 459fe81c77ae..998ce1b474ed 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -415,8 +415,389 @@ int intel_guc_capture_output_min_size_est(struct intel_guc *guc)
  *                   L--> intel_guc_capture_store_snapshot
  *                        L--> queue(__guc_capture_store_snapshot_work)
  *                             Copies from B (head->tail) into C
+ *
+ * GUC --> notify context reset:
+ * -----------------------------
+ *     --> G2H CONTEXT RESET
+ *                   L--> guc_handle_context_reset --> i915_capture_error_state
+ *                    --> i915_gpu_coredump --> intel_guc_capture_store_ptr
+ *                        L--> keep a ptr to capture_store in
+ *                             i915_gpu_coredump struct.
+ *
+ * User Sysfs / Debugfs
+ * --------------------
+ *      --> i915_gpu_coredump_copy_to_buffer->
+ *                   L--> err_print_to_sgl --> err_print_gt
+ *                        L--> error_print_guc_captures
+ *                             L--> loop: intel_guc_capture_out_print_next_group
+ *
  */
 
+#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
+
+static char *
+guc_capture_register_string(const struct intel_guc *guc, u32 owner, u32 type,
+			    u32 class, u32 id, u32 offset)
+{
+	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
+	struct __guc_mmio_reg_descr_group *match;
+	int num_regs, j = 0;
+
+	if (!reglists)
+		return NULL;
+
+	match = guc_capture_get_one_list(reglists, owner, type, id);
+	if (match) {
+		num_regs = match->num_regs;
+		while (num_regs--) {
+			if (offset == match->list[j].reg.reg)
+				return match->list[j].regname;
+			++j;
+		}
+	}
+
+	return NULL;
+}
+
+static inline int
+guc_capture_store_remove_dw(struct guc_capture_out_store *store, u32 *bytesleft,
+			    u32 *dw)
+{
+	int tries = 2;
+	int avail = 0;
+	u32 *src_data;
+
+	if (!*bytesleft)
+		return 0;
+
+	while (tries--) {
+		avail = CIRC_CNT_TO_END(store->head, store->tail, store->size);
+		if (avail >= sizeof(u32)) {
+			src_data = (u32 *)(store->addr + store->tail);
+			*dw = *src_data;
+			store->tail = (store->tail + 4) & (store->size - 1);
+			*bytesleft -= 4;
+			return 4;
+		}
+		if (store->tail == (store->size - 1) && store->head > 0)
+			store->tail = 0;
+	}
+
+	return 0;
+}
+
+static int
+capture_store_get_group_hdr(const struct intel_guc *guc,
+			    struct guc_capture_out_store *store, u32 *bytesleft,
+			    struct intel_guc_capture_out_group_header *group) {
+	int read = 0;
+	int fullsize = sizeof(struct intel_guc_capture_out_group_header);
+
+	if (fullsize > *bytesleft)
+		return -1;
+
+	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
+		    memcpy(group, (store->addr + store->tail), fullsize);
+			store->tail = (store->tail + fullsize) & (store->size - 1);
+			*bytesleft -= fullsize;
+		return 0;
+	}
+
+	read += guc_capture_store_remove_dw(store, bytesleft, &group->reserved1);
+	read += guc_capture_store_remove_dw(store, bytesleft, &group->info);
+	if (read != sizeof(*group))
+		return -1;
+
+	return 0;
+}
+
+static int
+capture_store_get_data_hdr(const struct intel_guc *guc,
+			   struct guc_capture_out_store *store, u32 *bytesleft,
+			   struct intel_guc_capture_out_data_header *data) {
+	int read = 0;
+	int fullsize = sizeof(struct intel_guc_capture_out_data_header);
+
+	if (fullsize > *bytesleft)
+		return -1;
+
+	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
+		    memcpy(data, (store->addr + store->tail), fullsize);
+			store->tail = (store->tail + fullsize) & (store->size - 1);
+			*bytesleft -= fullsize;
+		return 0;
+	}
+
+	read += guc_capture_store_remove_dw(store, bytesleft, &data->reserved1);
+	read += guc_capture_store_remove_dw(store, bytesleft, &data->info);
+	read += guc_capture_store_remove_dw(store, bytesleft, &data->lrca);
+	read += guc_capture_store_remove_dw(store, bytesleft, &data->guc_ctx_id);
+	read += guc_capture_store_remove_dw(store, bytesleft, &data->num_mmios);
+	if (read != sizeof(*data))
+		return -1;
+
+	return 0;
+}
+
+static int
+capture_store_get_register(const struct intel_guc *guc,
+			   struct guc_capture_out_store *store, u32 *bytesleft,
+			   struct guc_mmio_reg *reg)
+{
+	int read = 0;
+	int fullsize = sizeof(struct guc_mmio_reg);
+
+	if (fullsize > *bytesleft)
+		return -1;
+
+	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
+		    memcpy(reg, (store->addr + store->tail), fullsize);
+			store->tail = (store->tail + fullsize) & (store->size - 1);
+			*bytesleft -= fullsize;
+		return 0;
+	}
+
+	read += guc_capture_store_remove_dw(store, bytesleft, &reg->offset);
+	read += guc_capture_store_remove_dw(store, bytesleft, &reg->value);
+	read += guc_capture_store_remove_dw(store, bytesleft, &reg->flags);
+	read += guc_capture_store_remove_dw(store, bytesleft, &reg->mask);
+	if (read != sizeof(*reg))
+		return -1;
+
+	return 0;
+}
+
+static void guc_capture_store_drop_data(struct guc_capture_out_store *store,
+					unsigned long sampled_head)
+{
+	if (sampled_head == 0)
+		store->tail = store->size - 1;
+	else
+		store->tail = sampled_head - 1;
+}
+
+#ifdef CONFIG_DRM_I915_DEBUG_GUC
+#define guc_capt_err_print(a, b, ...) \
+	do { \
+		drm_warn(a, __VA_ARGS__); \
+		if (b) \
+			i915_error_printf(b, __VA_ARGS__); \
+	} while (0)
+#else
+#define guc_capt_err_print(a, b, ...) \
+	do { \
+		if (b) \
+			i915_error_printf(b, __VA_ARGS__); \
+	} while (0)
+#endif
+
+static struct intel_engine_cs *
+guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance) {
+	struct intel_gt *gt = guc_to_gt(guc);
+	u8 engine_class = guc_class_to_engine_class(guc_class);
+
+	/* Class index is checked in class converter */
+	GEM_BUG_ON(instance > MAX_ENGINE_INSTANCE);
+
+	return gt->engine_class[engine_class][instance];
+}
+
+static inline struct intel_context *
+guc_context_lookup(struct intel_guc *guc, u32 guc_ctx_id) {
+	struct intel_context *ce;
+
+	if (unlikely(guc_ctx_id >= GUC_MAX_LRC_DESCRIPTORS)) {
+		drm_dbg(&guc_to_gt(guc)->i915->drm, "Invalid guc_ctx_id 0x%X, max 0x%X",
+			guc_ctx_id, GUC_MAX_LRC_DESCRIPTORS);
+		return NULL;
+	}
+
+	ce = xa_load(&guc->context_lookup, guc_ctx_id);
+	if (unlikely(!ce)) {
+		drm_dbg(&guc_to_gt(guc)->i915->drm, "Context is NULL, guc_ctx_id 0x%X",
+			guc_ctx_id);
+		return NULL;
+	}
+
+	return ce;
+}
+
+
+#define PRINT guc_capt_err_print
+#define REGSTR guc_capture_register_string
+
+#define GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Name: %s\n", (eng)->name); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Class: 0x%02x\n", (eng)->class); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Inst: 0x%02x\n", (eng)->instance); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Eng-LogicalMask: 0x%08x\n", (eng)->logical_mask)
+
+#define GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data) \
+	PRINT(&(i915->drm), (ebuf), "    LRCA: 0x%08x\n", (data).lrca); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-ContextID: 0x%08x\n", (data).guc_ctx_id); \
+	PRINT(&(i915->drm), (ebuf), "    GuC-Engine-Instance: 0x%08x\n", \
+	      (uint32_t) FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, 
+(data).info));
+
+#define GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce) \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-Flags: 0x%016lx\n", (ce)->flags); \
+	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-GuC-ID: 0x%016x\n", (ce)->guc_id.id);
+
+int intel_guc_capture_out_print_next_group(struct drm_i915_error_state_buf *ebuf,
+					   struct intel_gt_coredump *gt)
+{
+	/* constant qualifier for data-pointers we shouldn't change mid of error dump printing */
+	struct intel_guc_state_capture *cap = gt->uc->capture;
+	struct intel_guc *guc = container_of(cap, struct intel_guc, capture);
+	struct drm_i915_private *i915 = (container_of(guc, struct intel_gt,
+						   uc.guc))->i915;
+	struct guc_capture_out_store *store = &cap->out_store;
+	struct guc_capture_out_store tmpstore;
+	struct intel_guc_capture_out_group_header group;
+	struct intel_guc_capture_out_data_header data;
+	struct guc_mmio_reg reg;
+	const char *grptypestr[GUC_STATE_CAPTURE_GROUP_TYPE_MAX] = {"full-capture",
+								    "partial-capture"};
+	const char *datatypestr[GUC_CAPTURE_LIST_TYPE_MAX] = {"Global", "Engine-Class",
+							      "Engine-Instance"};
+	enum guc_capture_group_types grptype;
+	enum guc_capture_type datatype;
+	int numgrps, numregs;
+	char *str, noname[16];
+	u32 numbytes, engineclass, eng_inst, ret = 0;
+	struct intel_engine_cs *eng;
+	struct intel_context *ce;
+
+	if (!cap->enabled)
+		return -ENODEV;
+
+	mutex_lock(&store->lock);
+	smp_mb(); /* sync to get the latest head for the moment */
+	/* NOTE1: make a copy of store so we dont have to deal with a changing lower bound of
+	 *        occupied-space in this circular buffer.
+	 * NOTE2: Higher up the stack from here, we keep calling this function in a loop to
+	 *        reading more capture groups as they appear (as the lower bound of occupied-space
+	 *        changes) until this circ-buf is empty.
+	 */
+	memcpy(&tmpstore, store, sizeof(tmpstore));
+
+	PRINT(&i915->drm, ebuf, "global --- GuC Error Capture\n");
+
+	numbytes = CIRC_CNT(tmpstore.head, tmpstore.tail, tmpstore.size);
+	if (!numbytes) {
+		PRINT(&i915->drm, ebuf, "GuC capture stream empty!\n");
+		ret = -ENODATA;
+		goto unlock;
+	}
+	/* everything in GuC output structures are dword aligned */
+	if (numbytes & 0x3) {
+		PRINT(&i915->drm, ebuf, "GuC capture stream unaligned!\n");
+		ret = -EIO;
+		goto unlock;
+	}
+
+	if (capture_store_get_group_hdr(guc, &tmpstore, &numbytes, &group)) {
+		PRINT(&i915->drm, ebuf, "GuC capture error getting next group-header!\n");
+		ret = -EIO;
+		goto unlock;
+	}
+
+	PRINT(&i915->drm, ebuf, "NumCaptures:  0x%08x\n", (uint32_t)
+	      FIELD_GET(GUC_CAPTURE_GRPHDR_SRC_NUMCAPTURES, group.info));
+	grptype = FIELD_GET(GUC_CAPTURE_GRPHDR_SRC_CAPTURE_TYPE, group.info);
+	PRINT(&i915->drm, ebuf, "Coverage:  0x%08x = %s\n", grptype,
+	      grptypestr[grptype % GUC_STATE_CAPTURE_GROUP_TYPE_MAX]);
+
+	numgrps = FIELD_GET(GUC_CAPTURE_GRPHDR_SRC_NUMCAPTURES, group.info);
+	while (numgrps--) {
+		eng = NULL;
+		ce = NULL;
+
+		if (capture_store_get_data_hdr(guc, &tmpstore, &numbytes, &data)) {
+			PRINT(&i915->drm, ebuf, "GuC capture error on next data-header!\n");
+			ret = -EIO;
+			goto unlock;
+		}
+		datatype = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_TYPE, data.info);
+		PRINT(&i915->drm, ebuf, "  RegListType: %s\n",
+		      datatypestr[datatype % GUC_CAPTURE_LIST_TYPE_MAX]);
+
+		engineclass = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_CLASS, data.info);
+		if (datatype != GUC_CAPTURE_LIST_TYPE_GLOBAL) {
+			PRINT(&i915->drm, ebuf, "    GuC-Engine-Class: %d\n",
+			      engineclass);
+			if (engineclass <= GUC_LAST_ENGINE_CLASS)
+				PRINT(&i915->drm, ebuf, "    i915-Eng-Class: %d\n",
+				      guc_class_to_engine_class(engineclass));
+
+			if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
+				GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
+				eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
+				eng = guc_lookup_engine(guc, engineclass, eng_inst);
+				if (eng) {
+					GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
+				} else {
+					PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup Fail!\n");
+				}
+				ce = guc_context_lookup(guc, data.guc_ctx_id);
+				if (ce) {
+					GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce);
+				} else {
+					PRINT(&i915->drm, ebuf, "    i915-Ctx-Lookup Fail!\n");
+				}
+			}
+		}
+		numregs = FIELD_GET(GUC_CAPTURE_DATAHDR_NUM_MMIOS, data.num_mmios);
+		PRINT(&i915->drm, ebuf, "     NumRegs: 0x%08x\n", numregs);
+
+		while (numregs--) {
+			if (capture_store_get_register(guc, &tmpstore, &numbytes, &reg)) {
+				PRINT(&i915->drm, ebuf, "Error getting next register!\n");
+				ret = -EIO;
+				goto unlock;
+			}
+			str = REGSTR(guc, GUC_CAPTURE_LIST_INDEX_PF, datatype,
+				     engineclass, 0, reg.offset);
+			if (!str) {
+				snprintf(noname, sizeof(noname), "REG-0x%08x", reg.offset);
+				str = noname;
+			}
+			PRINT(&i915->drm, ebuf, "      %s:  0x%08x\n", str, reg.value);
+
+		}
+		if (eng) {
+			const struct intel_engine_coredump *ee;
+			for (ee = gt->engine; ee; ee = ee->next) {
+				const struct i915_vma_coredump *vma;
+				if (ee->engine == eng) {
+					for (vma = ee->vma; vma; vma = vma->next)
+						i915_print_error_vma(ebuf, ee->engine, vma);
+				}
+			}
+		}
+	}
+
+	store->tail = tmpstore.tail;
+unlock:
+	/* if we have a stream error, just drop everything */
+	if (ret == -EIO) {
+		drm_warn(&i915->drm, "Skip GuC capture data print due to stream error\n");
+		guc_capture_store_drop_data(store, tmpstore.head);
+	}
+
+	mutex_unlock(&store->lock);
+
+	return ret;
+}
+
+#undef REGSTR
+#undef PRINT
+
+#endif //CONFIG_DRM_I915_DEBUG_GUC
+
 static void guc_capture_store_insert(struct intel_guc *guc, struct guc_capture_out_store *store,
 				     unsigned char *new_data, size_t bytes)  { @@ -587,6 +968,14 @@ void intel_guc_capture_destroy(struct intel_guc *guc)
 	guc_capture_clear_ext_regs(guc->capture.reglists);
 }
 
+struct intel_guc_state_capture *
+intel_guc_capture_store_ptr(struct intel_guc *guc) {
+	if (!guc->capture.enabled)
+		return NULL;
+	return &guc->capture;
+}
+
 int intel_guc_capture_init(struct intel_guc *guc)  {
 	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915; diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
index 7031de12f3a1..7d048a8f6efe 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
@@ -88,6 +88,11 @@ struct intel_guc_state_capture {
 	struct work_struct store_work;
 };
 
+struct drm_i915_error_state_buf;
+struct intel_gt_coredump;
+
+int intel_guc_capture_out_print_next_group(struct drm_i915_error_state_buf *m,
+					   struct intel_gt_coredump *gt);
 void intel_guc_capture_store_snapshot(struct intel_guc *guc);  int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 class,
 				 u16 *num_entries);
@@ -96,6 +101,7 @@ int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32  int intel_guc_capture_output_min_size_est(struct intel_guc *guc);  void intel_guc_capture_destroy(struct intel_guc *guc);  void intel_guc_capture_store_snapshot_immediate(struct intel_guc *guc);
+struct intel_guc_state_capture *intel_guc_capture_store_ptr(struct 
+intel_guc *guc);
 int intel_guc_capture_init(struct intel_guc *guc);
 
 #endif /* _INTEL_GUC_CAPTURE_H */
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 2a2d7643b551..47016059c65d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -600,6 +600,16 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
 	error_print_context(m, "  Active context: ", &ee->context);  }
 
+static void error_print_guc_captures(struct drm_i915_error_state_buf *m,
+				     struct intel_gt_coredump *gt)
+{
+	int ret;
+
+	do {
+		ret = intel_guc_capture_out_print_next_group(m, gt);
+	} while (!ret);
+}
+
 void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)  {
 	va_list args;
@@ -609,9 +619,9 @@ void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
 	va_end(args);
 }
 
-static void print_error_vma(struct drm_i915_error_state_buf *m,
-			    const struct intel_engine_cs *engine,
-			    const struct i915_vma_coredump *vma)
+void i915_print_error_vma(struct drm_i915_error_state_buf *m,
+			  const struct intel_engine_cs *engine,
+			  const struct i915_vma_coredump *vma)
 {
 	char out[ASCII85_BUFSZ];
 	int page;
@@ -679,7 +689,7 @@ static void err_print_uc(struct drm_i915_error_state_buf *m,
 
 	intel_uc_fw_dump(&error_uc->guc_fw, &p);
 	intel_uc_fw_dump(&error_uc->huc_fw, &p);
-	print_error_vma(m, NULL, error_uc->guc_log);
+	i915_print_error_vma(m, NULL, error_uc->guc_log);
 }
 
 static void err_free_sgl(struct scatterlist *sgl) @@ -764,12 +774,16 @@ static void err_print_gt(struct drm_i915_error_state_buf *m,
 		err_printf(m, "  GAM_DONE: 0x%08x\n", gt->gam_done);
 	}
 
-	for (ee = gt->engine; ee; ee = ee->next) {
-		const struct i915_vma_coredump *vma;
+	if (gt->uc->capture) /* error capture was via GuC */
+		error_print_guc_captures(m, gt);
+	else {
+		for (ee = gt->engine; ee; ee = ee->next) {
+			const struct i915_vma_coredump *vma;
 
-		error_print_engine(m, ee);
-		for (vma = ee->vma; vma; vma = vma->next)
-			print_error_vma(m, ee->engine, vma);
+			error_print_engine(m, ee);
+			for (vma = ee->vma; vma; vma = vma->next)
+				i915_print_error_vma(m, ee->engine, vma);
+		}
 	}
 
 	if (gt->uc)
@@ -1140,7 +1154,7 @@ static void gt_record_fences(struct intel_gt_coredump *gt)
 	gt->nfence = i;
 }
 
-static void engine_record_registers(struct intel_engine_coredump *ee)
+static void engine_record_registers_execlist(struct 
+intel_engine_coredump *ee)
 {
 	const struct intel_engine_cs *engine = ee->engine;
 	struct drm_i915_private *i915 = engine->i915; @@ -1384,8 +1398,10 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp)
 
 	ee->engine = engine;
 
-	engine_record_registers(ee);
-	engine_record_execlists(ee);
+	if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
+		engine_record_registers_execlist(ee);
+		engine_record_execlists(ee);
+	}
 
 	return ee;
 }
@@ -1558,8 +1574,8 @@ gt_record_uc(struct intel_gt_coredump *gt,
 	return error_uc;
 }
 
-/* Capture all registers which don't fit into another category. */ -static void gt_record_regs(struct intel_gt_coredump *gt)
+/* Capture all global registers which don't fit into another category. 
+*/ static void gt_record_registers_execlist(struct intel_gt_coredump 
+*gt)
 {
 	struct intel_uncore *uncore = gt->_gt->uncore;
 	struct drm_i915_private *i915 = uncore->i915; @@ -1806,7 +1822,9 @@ intel_gt_coredump_alloc(struct intel_gt *gt, gfp_t gfp)
 	gc->_gt = gt;
 	gc->awake = intel_gt_pm_is_awake(gt);
 
-	gt_record_regs(gc);
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		gt_record_registers_execlist(gc);
+
 	gt_record_fences(gc);
 
 	return gc;
@@ -1871,6 +1889,11 @@ i915_gpu_coredump(struct intel_gt *gt, intel_engine_mask_t engine_mask)
 		if (INTEL_INFO(i915)->has_gt_uc)
 			error->gt->uc = gt_record_uc(error->gt, compress);
 
+		if (intel_uc_uses_guc_submission(&gt->uc))
+			error->gt->uc->capture = intel_guc_capture_store_ptr(&gt->uc.guc);
+		else
+			error->gt->uc->capture = NULL;
+
 		i915_vma_capture_finish(error->gt, compress);
 
 		error->simulated |= error->gt->simulated; diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index b98d8cdbe4f2..b55369b245ee 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -17,6 +17,7 @@
 #include "gt/intel_engine.h"
 #include "gt/intel_gt_types.h"
 #include "gt/uc/intel_uc_fw.h"
+#include "gt/uc/intel_guc_capture.h"
 
 #include "intel_device_info.h"
 
@@ -151,6 +152,7 @@ struct intel_gt_coredump {
 		struct intel_uc_fw guc_fw;
 		struct intel_uc_fw huc_fw;
 		struct i915_vma_coredump *guc_log;
+		struct intel_guc_state_capture *capture;
 	} *uc;
 
 	struct intel_gt_coredump *next;
@@ -216,6 +218,9 @@ struct drm_i915_error_state_buf {
 
 __printf(2, 3)
 void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...);
+void i915_print_error_vma(struct drm_i915_error_state_buf *m,
+			  const struct intel_engine_cs *engine,
+			  const struct i915_vma_coredump *vma);
 
 struct i915_gpu_coredump *i915_gpu_coredump(struct intel_gt *gt,
 					    intel_engine_mask_t engine_mask);
--
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for Add GuC Error Capture Support (rev2)
  2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
                   ` (10 preceding siblings ...)
  (?)
@ 2021-11-23  0:40 ` Patchwork
  -1 siblings, 0 replies; 52+ messages in thread
From: Patchwork @ 2021-11-23  0:40 UTC (permalink / raw)
  To: Teres Alexis, Alan Previn; +Cc: intel-gfx

== Series Details ==

Series: Add GuC Error Capture Support (rev2)
URL   : https://patchwork.freedesktop.org/series/97187/
State : failure

== Summary ==

Applying: drm/i915/guc: Add basic support for error capture lists
Applying: drm/i915/guc: Update GuC ADS size for error capture lists
Applying: drm/i915/guc: Populate XE_LP register lists for GuC error state capture.
Applying: drm/i915/guc: Add GuC's error state capture output structures.
Applying: drm/i915/guc: Update GuC's log-buffer-state access for error capture.
Applying: drm/i915/guc: Copy new GuC error capture logs upon G2H notification.
Applying: drm/i915/guc: Print the GuC error capture output register list.
error: corrupt patch at line 429
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0007 drm/i915/guc: Print the GuC error capture output register list.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 3/7] drm/i915/guc: Populate XE_LP register lists for GuC error state capture.
  2021-11-22 23:03 ` [Intel-gfx] [RFC 3/7] drm/i915/guc: Populate XE_LP register lists for GuC error state capture Alan Previn
@ 2021-11-23  1:59   ` kernel test robot
  2021-11-23 21:55   ` Michal Wajdeczko
  1 sibling, 0 replies; 52+ messages in thread
From: kernel test robot @ 2021-11-23  1:59 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2132 bytes --]

Hi Alan,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on drm-tip/drm-tip]
[also build test WARNING on next-20211118]
[cannot apply to drm-intel/for-linux-next drm-exynos/exynos-drm-next drm/drm-next tegra-drm/drm/tegra/for-next airlied/drm-next v5.16-rc2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Alan-Previn/Add-GuC-Error-Capture-Support/20211123-070418
base:   git://anongit.freedesktop.org/drm/drm-tip drm-tip
config: i386-debian-10.3 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/5dee80d5080bd11d7abcba6d1d0cf442547105de
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Alan-Previn/Add-GuC-Error-Capture-Support/20211123-070418
        git checkout 5dee80d5080bd11d7abcba6d1d0cf442547105de
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c:178:6: warning: no previous prototype for 'guc_capture_clear_ext_regs' [-Wmissing-prototypes]
     178 | void guc_capture_clear_ext_regs(struct __guc_mmio_reg_descr_group * lists)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/guc_capture_clear_ext_regs +178 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c

   177	
 > 178	void guc_capture_clear_ext_regs(struct __guc_mmio_reg_descr_group * lists)
   179	{
   180		while(lists->list){
   181			if (lists->ext) {
   182				kfree(lists->ext);
   183				lists->ext = NULL;
   184			}
   185			++lists;
   186		}
   187		return;
   188	}
   189	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 34286 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 1/7] drm/i915/guc: Add basic support for error capture lists
  2021-11-22 23:03 ` [Intel-gfx] [RFC 1/7] drm/i915/guc: Add basic support for error capture lists Alan Previn
@ 2021-11-23 21:12   ` Michal Wajdeczko
  2021-12-08 18:23     ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Wajdeczko @ 2021-11-23 21:12 UTC (permalink / raw)
  To: Alan Previn, intel-gfx



On 23.11.2021 00:03, Alan Previn wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
...

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 77fbcd8730ee..0bfc92b1b982 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4003,6 +4003,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
>  	return 0;
>  }
>  
> +int intel_guc_error_capture_process_msg(struct intel_guc *guc,
> +					 const u32 *msg, u32 len)
> +{
> +	int status;

likely it should be "u32" as few lines below you're using msg[0];

> +
> +	if (unlikely(len != 1)) {
> +		drm_dbg(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);

any error returned by the CTB message handler will trigger full dump of
unexpected message - do we really need this unlikely dbg message here ?

> +		return -EPROTO;
> +	}
> +
> +	status = msg[0];
> +	drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status);

IIRC all notification status are defined in GuC spec in hex, so maybe we
should also print it as %#x ?

-Michal

> +
> +	/* Add extraction of error capture dump */
> +
> +	return 0;
> +}
> +
>  static struct intel_engine_cs *
>  guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
>  {
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size for error capture lists
  2021-11-22 23:03 ` [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size " Alan Previn
@ 2021-11-23 21:46   ` Michal Wajdeczko
  2021-11-24  9:52     ` Jani Nikula
                       ` (2 more replies)
  2021-11-24 10:06   ` Jani Nikula
  1 sibling, 3 replies; 52+ messages in thread
From: Michal Wajdeczko @ 2021-11-23 21:46 UTC (permalink / raw)
  To: Alan Previn, intel-gfx

Hi,

just few random nits below

-Michal


On 23.11.2021 00:03, Alan Previn wrote:
> Update GuC ADS size allocation to include space for
> the lists of error state capture register descriptors.
> 
> Also, populate the lists of registers we want GuC to report back to
> Host on engine reset events. This list should include global,
> engine-class and engine-instance registers for every engine-class
> type on the current hardware.
> 
> NOTE: Start with a fake table of register lists to layout the
> framework before adding real registers in subsequent patch.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  drivers/gpu/drm/i915/Makefile                 |   1 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  10 +-
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   5 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    | 176 ++++++++++++-
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 232 ++++++++++++++++++
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |  47 ++++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  19 +-
>  7 files changed, 476 insertions(+), 14 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
>  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> 
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 074d6b8edd23..e3c4d5cea4c3 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -190,6 +190,7 @@ i915-y += gt/uc/intel_uc.o \
>  	  gt/uc/intel_guc_rc.o \
>  	  gt/uc/intel_guc_slpc.o \
>  	  gt/uc/intel_guc_submission.o \
> +	  gt/uc/intel_guc_capture.o \

use alphabetical order

>  	  gt/uc/intel_huc.o \
>  	  gt/uc/intel_huc_debugfs.o \
>  	  gt/uc/intel_huc_fw.o
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> index 5cf9ebd2ee55..458f0d248a5a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> @@ -335,9 +335,14 @@ int intel_guc_init(struct intel_guc *guc)
>  	if (ret)
>  		goto err_fw;
>  
> -	ret = intel_guc_ads_create(guc);
> +	ret = intel_guc_capture_init(guc);
>  	if (ret)
>  		goto err_log;
> +
> +	ret = intel_guc_ads_create(guc);
> +	if (ret)
> +		goto err_capture;
> +
>  	GEM_BUG_ON(!guc->ads_vma);
>  
>  	ret = intel_guc_ct_init(&guc->ct);
> @@ -376,6 +381,8 @@ int intel_guc_init(struct intel_guc *guc)
>  	intel_guc_ct_fini(&guc->ct);
>  err_ads:
>  	intel_guc_ads_destroy(guc);
> +err_capture:
> +	intel_guc_capture_destroy(guc);
>  err_log:
>  	intel_guc_log_destroy(&guc->log);
>  err_fw:
> @@ -403,6 +410,7 @@ void intel_guc_fini(struct intel_guc *guc)
>  	intel_guc_ct_fini(&guc->ct);
>  
>  	intel_guc_ads_destroy(guc);
> +	intel_guc_capture_destroy(guc);
>  	intel_guc_log_destroy(&guc->log);
>  	intel_uc_fw_fini(&guc->fw);
>  }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 9de99772f916..d136c69abe12 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -16,6 +16,7 @@
>  #include "intel_guc_log.h"
>  #include "intel_guc_reg.h"
>  #include "intel_guc_slpc_types.h"
> +#include "intel_guc_capture.h"

use alphabetical order

>  #include "intel_uc_fw.h"
>  #include "i915_utils.h"
>  #include "i915_vma.h"
> @@ -37,6 +38,8 @@ struct intel_guc {
>  	struct intel_guc_ct ct;
>  	/** @slpc: sub-structure containing SLPC related data and objects */
>  	struct intel_guc_slpc slpc;
> +	/** @capture: the error-state-capture module's data and objects */
> +	struct intel_guc_state_capture capture;
>  
>  	/** @sched_engine: Global engine used to submit requests to GuC */
>  	struct i915_sched_engine *sched_engine;
> @@ -138,6 +141,8 @@ struct intel_guc {
>  	u32 ads_regset_size;
>  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>  	u32 ads_golden_ctxt_size;
> +	/** @ads_capture_size: size of register lists in the ADS used for error capture */
> +	u32 ads_capture_size;
>  	/** @ads_engine_usage_size: size of engine usage in the ADS */
>  	u32 ads_engine_usage_size;
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 6c81ddd303d3..2780c0fadd01 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -10,6 +10,7 @@
>  #include "gt/shmem_utils.h"
>  #include "intel_guc_ads.h"
>  #include "intel_guc_fwif.h"
> +#include "intel_guc_capture.h"

wrong order

>  #include "intel_uc.h"
>  #include "i915_drv.h"
>  
> @@ -71,8 +72,7 @@ static u32 guc_ads_golden_ctxt_size(struct intel_guc *guc)
>  
>  static u32 guc_ads_capture_size(struct intel_guc *guc)
>  {
> -	/* Basic support to init ADS without a proper GuC error capture list */
> -	return PAGE_ALIGN(PAGE_SIZE);
> +	return PAGE_ALIGN(guc->ads_capture_size);
>  }
>  
>  static u32 guc_ads_private_data_size(struct intel_guc *guc)
> @@ -519,24 +519,170 @@ static void guc_init_golden_context(struct intel_guc *guc)
>  	GEM_BUG_ON(guc->ads_golden_ctxt_size != total_size);
>  }
>  
> -static void guc_capture_prep_lists(struct intel_guc *guc, struct __guc_ads_blob *blob)
> +static int
> +guc_fill_reglist(struct intel_guc *guc, struct __guc_ads_blob *blob, int vf, bool enabled,
> +		 int classid, int type, char *typename, u16 *p_numregs, int newnum, u8 **p_virt_ptr,
> +		 u32 *p_blobptr_to_ggtt, u32 *p_ggtt, u32 null_ggtt)

hmm, this does not look good - do we really need all these params ?

>  {
> -	int i, j;
> -	u32 addr_ggtt, offset;
> +	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> +	struct guc_debug_capture_list *listnode;
> +	int size = 0;
>  
> -	offset = guc_ads_capture_offset(guc);
> -	addr_ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
> +	if (blob && *p_numregs != newnum) {
> +		if (type == GUC_CAPTURE_LIST_TYPE_GLOBAL)
> +			drm_warn(&i915->drm, "Guc-Cap VF%d-%s num-reg mismatch was=%d now=%d!\n",
> +				 vf, typename, *p_numregs, newnum);
> +		else
> +			drm_warn(&i915->drm, "Guc-Cap VF%d-Class-%d-%s num-reg mismatch was=%d now=%d!\n",
> +				 vf, classid, typename, *p_numregs, newnum);
> +	}
> +	/*
> +	 * For enabled capture lists, we not only need to call capture module to help
> +	 * populate the list-descriptor into the correct ads capture structures, but
> +	 * we also need to increment the virtual pointers and ggtt offsets so that
> +	 * caller has the subsequent gfx memory location.
> +	 */
> +	*p_numregs = newnum;
> +	size = PAGE_ALIGN((sizeof(struct guc_debug_capture_list)) +
> +			  (newnum * sizeof(struct guc_mmio_reg)));
> +	/* if caller hasn't allocated ADS blob, return size and counts, we're done */
> +	if (!blob)
> +		return size;
> +	if (blob) {

redundant

> +		/* if caller allocated ADS blob, populate the capture register descriptors */
> +		if (!newnum) {
> +			*p_blobptr_to_ggtt = null_ggtt;
> +		} else {
> +			/* get ptr and populate header info: */
> +			*p_blobptr_to_ggtt = *p_ggtt;
> +			listnode = (struct guc_debug_capture_list *)*p_virt_ptr;
> +			*p_ggtt += sizeof(struct guc_debug_capture_list);
> +			*p_virt_ptr += sizeof(struct guc_debug_capture_list);
> +			listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, *p_numregs);
> +
> +			/* get ptr and populate register descriptor list: */
> +			intel_guc_capture_list_init(guc, vf, type, classid,
> +						    (struct guc_mmio_reg *)*p_virt_ptr,
> +						    *p_numregs);
> +
> +			/* increment ptrs for that header: */
> +			*p_ggtt += size - sizeof(struct guc_debug_capture_list);
> +			*p_virt_ptr += size - sizeof(struct guc_debug_capture_list);
> +		}
> +	}
> +
> +	return size;
> +}
> +
> +static int guc_capture_prep_lists(struct intel_guc *guc, struct __guc_ads_blob *blob)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	int i, j, size;
> +	u32 ggtt, null_ggtt, offset, alloc_size = 0;
> +	struct guc_gt_system_info *info, local_info;
> +	struct guc_debug_capture_list *listnode;
> +	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> +	struct intel_guc_state_capture *gc = &guc->capture;
> +	u16 tmp = 0;
> +	u8 *ptr = NULL;
> +
> +	if (blob) {
> +		offset = guc_ads_capture_offset(guc);
> +		ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
> +		ptr = ((u8 *)blob) + offset;
> +		info = &blob->system_info;
> +	} else {
> +		memset(&local_info, 0, sizeof(local_info));
> +		info = &local_info;
> +		fill_engine_enable_masks(gt, info);
> +	}
> +
> +	/* first, set aside the first page for a capture_list with zero descriptors */
> +	alloc_size = PAGE_SIZE;
> +	if (blob) {
> +		listnode = (struct guc_debug_capture_list *)ptr;
> +		listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, 0);
> +		null_ggtt = ggtt;
> +		ggtt += PAGE_SIZE;
> +		ptr +=  PAGE_SIZE;
> +	}
>  
> -	/* FIXME: Populate a proper capture list */
> +#define COUNT_REGS intel_guc_capture_list_count
> +#define FILL_REGS guc_fill_reglist
> +#define TYPE_CLASS GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS
> +#define TYPE_INSTANCE GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE
>  
>  	for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
>  		for (j = 0; j < GUC_MAX_ENGINE_CLASSES; j++) {
> -			blob->ads.capture_instance[i][j] = addr_ggtt;
> -			blob->ads.capture_class[i][j] = addr_ggtt;
> +			if (!info->engine_enabled_masks[j]) {
> +				if (gc->num_class_regs[i][j])
> +					drm_warn(&i915->drm, "GuC-Cap VF%d-class-%d "
> +						 "class regs valid mismatch was=%d now=%d!\n",
> +						 i, j, gc->num_class_regs[i][j], tmp);
> +				if (gc->num_instance_regs[i][j])
> +					drm_warn(&i915->drm, "GuC-Cap VF%d-class-%d "
> +						 "inst regs valid mismatch was=%d now=%d!\n",
> +						 i, j, gc->num_instance_regs[i][j], tmp);
> +				gc->num_class_regs[i][j] = 0;
> +				gc->num_instance_regs[i][j] = 0;
> +				if (blob) {
> +					blob->ads.capture_class[i][j] = null_ggtt;
> +					blob->ads.capture_instance[i][j] = null_ggtt;
> +				}
> +			} else {
> +				if (!COUNT_REGS(guc, i, TYPE_CLASS,
> +						guc_class_to_engine_class(j), &tmp)) {
> +					size = FILL_REGS(guc, blob, i, true, j, TYPE_CLASS,
> +							 "class", &gc->num_class_regs[i][j],
> +							 tmp, &ptr,
> +							 &blob->ads.capture_class[i][j],
> +							 &ggtt, null_ggtt);
> +					gc->class_list_size += size;
> +					alloc_size += size;
> +				} else {
> +					gc->num_class_regs[i][j] = 0;
> +					if (blob)
> +						blob->ads.capture_class[i][j] = null_ggtt;
> +				}
> +				if (!COUNT_REGS(guc, i, TYPE_INSTANCE,
> +						guc_class_to_engine_class(j), &tmp)) {
> +					size = FILL_REGS(guc, blob, i, true, j, TYPE_INSTANCE,
> +							 "instance", &gc->num_instance_regs[i][j],
> +							 tmp, &ptr,
> +							 &blob->ads.capture_instance[i][j],
> +							 &ggtt, null_ggtt);
> +					gc->instance_list_size += size;
> +					alloc_size += size;
> +				} else {
> +					gc->num_instance_regs[i][j] = 0;
> +					if (blob)
> +						blob->ads.capture_instance[i][j] = null_ggtt;
> +				}
> +			}
> +		}
> +		if (!COUNT_REGS(guc, i, GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, &tmp)) {
> +			size = FILL_REGS(guc, blob, i, true, 0, GUC_CAPTURE_LIST_TYPE_GLOBAL,
> +					 "global", &gc->num_global_regs[i], tmp, &ptr,
> +					 &blob->ads.capture_global[i], &ggtt, null_ggtt);
> +			gc->global_list_size += size;
> +			alloc_size += size;
> +		} else {
> +			gc->num_global_regs[i] = 0;
> +			if (blob)
> +				blob->ads.capture_global[i] = null_ggtt;
>  		}
> -
> -		blob->ads.capture_global[i] = addr_ggtt;
>  	}
> +
> +#undef COUNT_REGS
> +#undef FILL_REGS
> +#undef TYPE_CLASS
> +#undef TYPE_INSTANCE
> +
> +	if (guc->ads_capture_size && guc->ads_capture_size != PAGE_ALIGN(alloc_size))
> +		drm_warn(&i915->drm, "GuC->ADS->Capture alloc size changed from %d to %d\n",
> +			 guc->ads_capture_size, PAGE_ALIGN(alloc_size));
> +
> +	return PAGE_ALIGN(alloc_size);
>  }
>  
>  static void __guc_ads_init(struct intel_guc *guc)
> @@ -614,6 +760,12 @@ int intel_guc_ads_create(struct intel_guc *guc)
>  		return ret;
>  	guc->ads_golden_ctxt_size = ret;
>  
> +	/* Likewise the capture lists: */
> +	ret = guc_capture_prep_lists(guc, NULL);
> +	if (ret < 0)
> +		return ret;
> +	guc->ads_capture_size = ret;
> +
>  	/* Now the total size can be determined: */
>  	size = guc_ads_blob_size(guc);
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> new file mode 100644
> index 000000000000..c741c77b7fc8
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -0,0 +1,232 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2021-2021 Intel Corporation
> + */
> +
> +#include <drm/drm_print.h>
> +
> +#include "i915_drv.h"
> +#include "i915_drv.h"

duplicated include

> +#include "i915_memcpy.h"
> +#include "gt/intel_gt.h"
> +
> +#include "intel_guc_fwif.h"
> +#include "intel_guc_capture.h"
> +
> +/* Define all device tables of GuC error capture register lists */
> +
> +/********************************* Gen12 LP  *********************************/

didn't we move away from "GEN" naming ?

> +/************** GLOBAL *************/

do we really need all these decorations ?

> +struct __guc_mmio_reg_descr gen12lp_global_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */

do we need this reminder ?

> +};
> +
> +/********** RENDER/COMPUTE *********/
> +/* Per-Class */
> +struct __guc_mmio_reg_descr gen12lp_rc_class_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/* Per-Engine-Instance */
> +struct __guc_mmio_reg_descr gen12lp_rc_inst_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/************* MEDIA-VD ************/
> +/* Per-Class */
> +struct __guc_mmio_reg_descr gen12lp_vd_class_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/* Per-Engine-Instance */
> +struct __guc_mmio_reg_descr gen12lp_vd_inst_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/************* MEDIA-VEC ***********/
> +/* Per-Class */
> +struct __guc_mmio_reg_descr gen12lp_vec_class_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/* Per-Engine-Instance */
> +struct __guc_mmio_reg_descr gen12lp_vec_inst_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/********** List of lists **********/
> +struct __guc_mmio_reg_descr_group gen12lp_lists[] = {
> +	{
> +		.list = gen12lp_global_regs,
> +		.num_regs = (sizeof(gen12lp_global_regs) / sizeof(struct __guc_mmio_reg_descr)),

ARRAY_SIZE ?

> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_GLOBAL,
> +		.engine = 0
> +	},
> +	{
> +		.list = gen12lp_rc_class_regs,
> +		.num_regs = (sizeof(gen12lp_rc_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +		.engine = RENDER_CLASS
> +	},
> +	{
> +		.list = gen12lp_rc_inst_regs,
> +		.num_regs = (sizeof(gen12lp_rc_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +		.engine = RENDER_CLASS
> +	},
> +	{
> +		.list = gen12lp_vd_class_regs,
> +		.num_regs = (sizeof(gen12lp_vd_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +		.engine = VIDEO_DECODE_CLASS
> +	},
> +	{
> +		.list = gen12lp_vd_inst_regs,
> +		.num_regs = (sizeof(gen12lp_vd_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +		.engine = VIDEO_DECODE_CLASS
> +	},
> +	{
> +		.list = gen12lp_vec_class_regs,
> +		.num_regs = (sizeof(gen12lp_vec_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +		.engine = VIDEO_ENHANCEMENT_CLASS
> +	},
> +	{
> +		.list = gen12lp_vec_inst_regs,
> +		.num_regs = (sizeof(gen12lp_vec_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +		.engine = VIDEO_ENHANCEMENT_CLASS
> +	},
> +	{NULL, 0, 0, 0, 0}
> +};
> +
> +/************ FIXME: Populate tables for other devices in subsequent patch ************/
> +
> +static struct __guc_mmio_reg_descr_group *
> +guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)

in new code we are using "i915" instead of "dev_priv" and since this
function has "guc" prefix it shall rather take "guc" as param:

guc_capture_get_device_reglist(struct intel_guc *guc)
{
	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
	...


> +{
> +	if (IS_TIGERLAKE(dev_priv) || IS_ROCKETLAKE(dev_priv) ||
> +	    IS_ALDERLAKE_S(dev_priv) || IS_ALDERLAKE_P(dev_priv)) {
> +		return gen12lp_lists;
> +	}
> +
> +	return NULL;
> +}
> +
> +static inline struct __guc_mmio_reg_descr_group *
> +guc_capture_get_one_list(struct __guc_mmio_reg_descr_group *reglists, u32 owner, u32 type, u32 id)
> +{
> +	int i = 0;
> +
> +	if (!reglists)
> +		return NULL;
> +	while (reglists[i].list) {
> +		if (reglists[i].owner == owner &&
> +		    reglists[i].type == type) {
> +			if (reglists[i].type == GUC_CAPTURE_LIST_TYPE_GLOBAL ||
> +			    reglists[i].engine == id) {
> +				return &reglists[i];
> +			}
> +		}
> +		++i;
> +	}
> +	return NULL;
> +}
> +
> +static inline void
> +warn_with_capture_list_identifier(struct drm_i915_private *i915, char *msg,
> +				  u32 owner, u32 type, u32 classid)
> +{
> +	const char *ownerstr[GUC_CAPTURE_LIST_INDEX_MAX] = {"PF", "VF"};
> +	const char *typestr[GUC_CAPTURE_LIST_TYPE_MAX - 1] = {"Class", "Instance"};
> +	const char *classstr[GUC_LAST_ENGINE_CLASS + 1] = {"Render", "Video", "VideoEnhance",
> +							   "Blitter", "Reserved"};

better to wrap that into simple small helpers like

	const char *stringify_guc_capture_owner(u32 owner) { .. }
	const char *stringify_guc_capture_type(u32 type) { .. }
	const char *stringify_guc_capture_class(u32 class) { .. }

> +	static const char unknownstr[] = "unknown";
> +
> +	if (type == GUC_CAPTURE_LIST_TYPE_GLOBAL)
> +		drm_warn(&i915->drm, "GuC-capture: %s for %s Global-Registers.\n", msg,
> +			 (owner < GUC_CAPTURE_LIST_INDEX_MAX) ? ownerstr[owner] : unknownstr);
> +	else
> +		drm_warn(&i915->drm, "GuC-capture: %s for %s %s-Registers on %s-Engine\n", msg,
> +			 (owner < GUC_CAPTURE_LIST_INDEX_MAX) ? ownerstr[owner] : unknownstr,
> +			 (type < GUC_CAPTURE_LIST_TYPE_MAX) ? typestr[type - 1] :  unknownstr,
> +			 (classid < GUC_LAST_ENGINE_CLASS + 1) ? classstr[classid] : unknownstr);
> +}
> +
> +int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 classid,
> +				 u16 *num_entries)
> +{
> +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;

s/dev_priv/i915
redundant ()

> +	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
> +	struct __guc_mmio_reg_descr_group *match;
> +
> +	if (!reglists)
> +		return -ENODEV;
> +
> +	match = guc_capture_get_one_list(reglists, owner, type, classid);
> +	if (match) {
> +		*num_entries = match->num_regs;
> +		return 0;

IIRC early returns are preferred for error cases, not success

> +	}
> +
> +	warn_with_capture_list_identifier(dev_priv, "Missing register list size", owner, type,
> +					  classid);
> +
> +	return -ENODATA;
> +}
> +
> +int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 classid,
> +				struct guc_mmio_reg *ptr, u16 num_entries)
> +{
> +	u32 j = 0;
> +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;

s/dev_priv/i915
redundant ()

> +	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
> +	struct __guc_mmio_reg_descr_group *match;
> +
> +	if (!reglists)
> +		return -ENODEV;
> +
> +	match = guc_capture_get_one_list(reglists, owner, type, classid);
> +	if (match) {
> +		while (j < num_entries && j < match->num_regs) {
> +			ptr[j].offset = match->list[j].reg.reg;
> +			ptr[j].value = 0xDEADF00D;
> +			ptr[j].flags = match->list[j].flags;
> +			ptr[j].mask = match->list[j].mask;
> +			++j;
> +		}
> +		return 0;
> +	}
> +
> +	warn_with_capture_list_identifier(dev_priv, "Missing register list init", owner, type,
> +					  classid);
> +
> +	return -ENODATA;
> +}
> +
> +void intel_guc_capture_destroy(struct intel_guc *guc)
> +{
> +}
> +
> +int intel_guc_capture_init(struct intel_guc *guc)
> +{
> +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
> +
> +	guc->capture.reglists = guc_capture_get_device_reglist(dev_priv);
> +	return 0;
> +}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> new file mode 100644
> index 000000000000..352940b8bc87
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> @@ -0,0 +1,47 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2021-2021 Intel Corporation
> + */
> +
> +#ifndef _INTEL_GUC_CAPTURE_H
> +#define _INTEL_GUC_CAPTURE_H
> +
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include "intel_guc_fwif.h"
> +
> +struct intel_guc;
> +
> +struct __guc_mmio_reg_descr {
> +	i915_reg_t reg;
> +	u32 flags;
> +	u32 mask;
> +	char *regname;

const char* ?

but maybe instead of adding reg name to the GuC specific struct we
should add generic purpose function that will return pretty name of the
register:

i915_reg.c:

const char *i915_reg_to_string(i915_reg_r reg)
{
	...
}

> +};
> +
> +struct __guc_mmio_reg_descr_group {
> +	struct __guc_mmio_reg_descr *list;
> +	u32 num_regs;
> +	u32 owner; /* see enum guc_capture_owner */
> +	u32 type; /* see enum guc_capture_type */
> +	u32 engine; /* as per MAX_ENGINE_CLASS */
> +};
> +
> +struct intel_guc_state_capture {
> +	struct __guc_mmio_reg_descr_group *reglists;
> +	u16 num_instance_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
> +	u16 num_class_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
> +	u16 num_global_regs[GUC_CAPTURE_LIST_INDEX_MAX];
> +	int instance_list_size;
> +	int class_list_size;
> +	int global_list_size;
> +};
> +
> +int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 class,
> +				 u16 *num_entries);
> +int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 class,
> +				struct guc_mmio_reg *ptr, u16 num_entries);
> +void intel_guc_capture_destroy(struct intel_guc *guc);
> +int intel_guc_capture_init(struct intel_guc *guc);
> +
> +#endif /* _INTEL_GUC_CAPTURE_H */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index 767684b6af67..1a1d2271c7e9 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -285,13 +285,30 @@ struct guc_gt_system_info {
>  } __packed;
>  
>  /* Capture-types of GuC capture register lists */
> -enum
> +enum guc_capture_owner
>  {
>  	GUC_CAPTURE_LIST_INDEX_PF = 0,
>  	GUC_CAPTURE_LIST_INDEX_VF = 1,
>  	GUC_CAPTURE_LIST_INDEX_MAX = 2,

s/INDEX/OWNER ?

>  };
>  
> +/*Register-types of GuC capture register lists */
> +enum guc_capture_type {
> +	GUC_CAPTURE_LIST_TYPE_GLOBAL = 0,
> +	GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +	GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +	GUC_CAPTURE_LIST_TYPE_MAX,
> +};
> +
> +struct guc_debug_capture_list_header {
> +	u32 info;
> +		#define GUC_CAPTURELISTHDR_NUMDESCR GENMASK(15, 0)
> +};
> +
> +struct guc_debug_capture_list {
> +	struct guc_debug_capture_list_header header;
> +};
> +
>  /* GuC Additional Data Struct */
>  struct guc_ads {
>  	struct guc_mmio_reg_set reg_state_list[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 3/7] drm/i915/guc: Populate XE_LP register lists for GuC error state capture.
  2021-11-22 23:03 ` [Intel-gfx] [RFC 3/7] drm/i915/guc: Populate XE_LP register lists for GuC error state capture Alan Previn
  2021-11-23  1:59   ` kernel test robot
@ 2021-11-23 21:55   ` Michal Wajdeczko
  2021-11-24 17:16     ` Teres Alexis, Alan Previn
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Wajdeczko @ 2021-11-23 21:55 UTC (permalink / raw)
  To: Alan Previn, intel-gfx


On 23.11.2021 00:03, Alan Previn wrote:
> Add device specific tables and register lists to cover different engines
> class types for GuC error state capture.
> 
> Also, add runtime allocation and freeing of extended register lists
> for registers that need steering identifiers that depend on
> the detected HW config.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 260 +++++++++++++-----
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |   2 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +
>  3 files changed, 197 insertions(+), 67 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> index c741c77b7fc8..eec1d193ac26 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -9,120 +9,245 @@
>  #include "i915_drv.h"
>  #include "i915_memcpy.h"
>  #include "gt/intel_gt.h"
> +#include "gt/intel_lrc_reg.h"
>  
>  #include "intel_guc_fwif.h"
>  #include "intel_guc_capture.h"
>  
> -/* Define all device tables of GuC error capture register lists */
> +/*
> + * Define all device tables of GuC error capture register lists
> + * NOTE: For engine-registers, GuC only needs the register offsets
> + *       from the engine-mmio-base
> + */
> +#define COMMON_GEN12BASE_GLOBAL() \
> +	{GEN12_FAULT_TLB_DATA0,    0,      0, "GEN12_FAULT_TLB_DATA0"}, \
> +	{GEN12_FAULT_TLB_DATA1,    0,      0, "GEN12_FAULT_TLB_DATA1"}, \
> +	{FORCEWAKE_MT,             0,      0, "FORCEWAKE_MT"}, \
> +	{DERRMR,                   0,      0, "DERRMR"}, \
> +	{GEN12_AUX_ERR_DBG,        0,      0, "GEN12_AUX_ERR_DBG"}, \
> +	{GEN12_GAM_DONE,           0,      0, "GEN12_GAM_DONE"}, \
> +	{GEN11_GUC_SG_INTR_ENABLE, 0,      0, "GEN11_GUC_SG_INTR_ENABLE"}, \
> +	{GEN11_CRYPTO_RSVD_INTR_ENABLE, 0, 0, "GEN11_CRYPTO_RSVD_INTR_ENABLE"}, \
> +	{GEN11_GUNIT_CSME_INTR_ENABLE, 0,  0, "GEN11_GUNIT_CSME_INTR_ENABLE"}, \
> +	{GEN12_RING_FAULT_REG,     0,      0, "GEN12_RING_FAULT_REG"}
> +
> +#define COMMON_GEN12BASE_ENGINE_INSTANCE() \
> +	{RING_PSMI_CTL(0),         0,      0, "RING_PSMI_CTL"}, \
> +	{RING_ESR(0),              0,      0, "RING_ESR"}, \
> +	{RING_ESR(0),              0,      0, "RING_ESR"}, \
> +	{RING_DMA_FADD(0),         0,      0, "RING_DMA_FADD_LOW32"}, \
> +	{RING_DMA_FADD_UDW(0),     0,      0, "RING_DMA_FADD_UP32"}, \
> +	{RING_IPEIR(0),            0,      0, "RING_IPEIR"}, \
> +	{RING_IPEHR(0),            0,      0, "RING_IPEHR"}, \
> +	{RING_INSTPS(0),           0,      0, "RING_INSTPS"}, \
> +	{RING_BBADDR(0),           0,      0, "RING_BBADDR_LOW32"}, \
> +	{RING_BBADDR_UDW(0),       0,      0, "RING_BBADDR_UP32"}, \
> +	{RING_BBSTATE(0),          0,      0, "RING_BBSTATE"}, \
> +	{CCID(0),                  0,      0, "CCID"}, \
> +	{RING_ACTHD(0),            0,      0, "RING_ACTHD_LOW32"}, \
> +	{RING_ACTHD_UDW(0),        0,      0, "RING_ACTHD_UP32"}, \
> +	{RING_INSTPM(0),           0,      0, "RING_INSTPM"}, \
> +	{RING_NOPID(0),            0,      0, "RING_NOPID"}, \
> +	{RING_START(0),            0,      0, "RING_START"}, \
> +	{RING_HEAD(0),             0,      0, "RING_HEAD"}, \
> +	{RING_TAIL(0),             0,      0, "RING_TAIL"}, \
> +	{RING_CTL(0),              0,      0, "RING_CTL"}, \
> +	{RING_MI_MODE(0),          0,      0, "RING_MI_MODE"}, \
> +	{RING_CONTEXT_CONTROL(0),  0,      0, "RING_CONTEXT_CONTROL"}, \
> +	{RING_INSTDONE(0),         0,      0, "RING_INSTDONE"}, \
> +	{RING_HWS_PGA(0),          0,      0, "RING_HWS_PGA"}, \
> +	{RING_MODE_GEN7(0),        0,      0, "RING_MODE_GEN7"}, \
> +	{GEN8_RING_PDP_LDW(0, 0),  0,      0, "GEN8_RING_PDP0_LDW"}, \
> +	{GEN8_RING_PDP_UDW(0, 0),  0,      0, "GEN8_RING_PDP0_UDW"}, \
> +	{GEN8_RING_PDP_LDW(0, 1),  0,      0, "GEN8_RING_PDP1_LDW"}, \
> +	{GEN8_RING_PDP_UDW(0, 1),  0,      0, "GEN8_RING_PDP1_UDW"}, \
> +	{GEN8_RING_PDP_LDW(0, 2),  0,      0, "GEN8_RING_PDP2_LDW"}, \
> +	{GEN8_RING_PDP_UDW(0, 2),  0,      0, "GEN8_RING_PDP2_UDW"}, \
> +	{GEN8_RING_PDP_LDW(0, 3),  0,      0, "GEN8_RING_PDP3_LDW"}, \
> +	{GEN8_RING_PDP_UDW(0, 3),  0,      0, "GEN8_RING_PDP3_UDW"}
> +
> +#define COMMON_GEN12BASE_HAS_EU() \
> +	{EIR,                      0,      0, "EIR"}
> +
> +#define COMMON_GEN12BASE_RENDER() \
> +	{GEN7_SC_INSTDONE,         0,      0, "GEN7_SC_INSTDONE"}, \
> +	{GEN12_SC_INSTDONE_EXTRA,  0,      0, "GEN12_SC_INSTDONE_EXTRA"}, \
> +	{GEN12_SC_INSTDONE_EXTRA2, 0,      0, "GEN12_SC_INSTDONE_EXTRA2"}
> +
> +#define COMMON_GEN12BASE_VEC() \
> +	{GEN11_VCS_VECS_INTR_ENABLE, 0,    0, "GEN11_VCS_VECS_INTR_ENABLE"}, \
> +	{GEN12_SFC_DONE(0),        0,      0, "GEN12_SFC_DONE0"}, \
> +	{GEN12_SFC_DONE(1),        0,      0, "GEN12_SFC_DONE1"}, \
> +	{GEN12_SFC_DONE(2),        0,      0, "GEN12_SFC_DONE2"}, \
> +	{GEN12_SFC_DONE(3),        0,      0, "GEN12_SFC_DONE3"}
>  
>  /********************************* Gen12 LP  *********************************/
>  /************** GLOBAL *************/
>  struct __guc_mmio_reg_descr gen12lp_global_regs[] = {
> -	{SWF_ILK(0),               0,      0, "SWF_ILK0"},

we should avoid adding/removing code in same series

> -	/* Add additional register list */
> +	COMMON_GEN12BASE_GLOBAL(),
> +	{GEN7_ROW_INSTDONE,        0,      0, "GEN7_ROW_INSTDONE"},
>  };
>  
>  /********** RENDER/COMPUTE *********/
>  /* Per-Class */
>  struct __guc_mmio_reg_descr gen12lp_rc_class_regs[] = {
> -	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> -	/* Add additional register list */
> +	COMMON_GEN12BASE_HAS_EU(),
> +	COMMON_GEN12BASE_RENDER(),
> +	{GEN11_RENDER_COPY_INTR_ENABLE, 0, 0, "GEN11_RENDER_COPY_INTR_ENABLE"},
>  };
>  
>  /* Per-Engine-Instance */
>  struct __guc_mmio_reg_descr gen12lp_rc_inst_regs[] = {
> -	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> -	/* Add additional register list */
> +	COMMON_GEN12BASE_ENGINE_INSTANCE(),
>  };
>  
>  /************* MEDIA-VD ************/
>  /* Per-Class */
>  struct __guc_mmio_reg_descr gen12lp_vd_class_regs[] = {
> -	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> -	/* Add additional register list */
>  };
>  
>  /* Per-Engine-Instance */
>  struct __guc_mmio_reg_descr gen12lp_vd_inst_regs[] = {
> -	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> -	/* Add additional register list */
> +	COMMON_GEN12BASE_ENGINE_INSTANCE(),
>  };
>  
>  /************* MEDIA-VEC ***********/
>  /* Per-Class */
>  struct __guc_mmio_reg_descr gen12lp_vec_class_regs[] = {
> -	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> -	/* Add additional register list */
> +	COMMON_GEN12BASE_VEC(),
>  };
>  
>  /* Per-Engine-Instance */
>  struct __guc_mmio_reg_descr gen12lp_vec_inst_regs[] = {
> -	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> -	/* Add additional register list */
> +	COMMON_GEN12BASE_ENGINE_INSTANCE(),
> +};
> +
> +/************* BLITTER ***********/
> +/* Per-Class */
> +struct __guc_mmio_reg_descr gen12lp_blt_class_regs[] = {
> +};
> +
> +/* Per-Engine-Instance */
> +struct __guc_mmio_reg_descr gen12lp_blt_inst_regs[] = {
> +	COMMON_GEN12BASE_ENGINE_INSTANCE(),
>  };
>  
> +#define TO_GCAP_DEF(x) (GUC_CAPTURE_LIST_##x)
> +#define MAKE_GCAP_REGLIST_DESCR(regslist, regsowner, regstype, class) \
> +	{ \
> +		.list = (regslist), \
> +		.num_regs = (sizeof(regslist) / sizeof(struct __guc_mmio_reg_descr)), \
> +		.owner = TO_GCAP_DEF(regsowner), \
> +		.type = TO_GCAP_DEF(regstype), \
> +		.engine = class, \
> +		.num_ext = 0, \
> +		.ext = NULL, \
> +	}
> +
> +
>  /********** List of lists **********/
> -struct __guc_mmio_reg_descr_group gen12lp_lists[] = {
> -	{
> -		.list = gen12lp_global_regs,
> -		.num_regs = (sizeof(gen12lp_global_regs) / sizeof(struct __guc_mmio_reg_descr)),
> -		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> -		.type = GUC_CAPTURE_LIST_TYPE_GLOBAL,
> -		.engine = 0
> -	},
> -	{
> -		.list = gen12lp_rc_class_regs,
> -		.num_regs = (sizeof(gen12lp_rc_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> -		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> -		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> -		.engine = RENDER_CLASS
> -	},
> -	{
> -		.list = gen12lp_rc_inst_regs,
> -		.num_regs = (sizeof(gen12lp_rc_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> -		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> -		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> -		.engine = RENDER_CLASS
> -	},
> -	{
> -		.list = gen12lp_vd_class_regs,
> -		.num_regs = (sizeof(gen12lp_vd_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> -		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> -		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> -		.engine = VIDEO_DECODE_CLASS
> -	},
> -	{
> -		.list = gen12lp_vd_inst_regs,
> -		.num_regs = (sizeof(gen12lp_vd_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> -		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> -		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> -		.engine = VIDEO_DECODE_CLASS
> -	},
> -	{
> -		.list = gen12lp_vec_class_regs,
> -		.num_regs = (sizeof(gen12lp_vec_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> -		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> -		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> -		.engine = VIDEO_ENHANCEMENT_CLASS
> -	},
> -	{
> -		.list = gen12lp_vec_inst_regs,
> -		.num_regs = (sizeof(gen12lp_vec_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> -		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> -		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> -		.engine = VIDEO_ENHANCEMENT_CLASS
> -	},
> +struct __guc_mmio_reg_descr_group xe_lpd_lists[] = {
> +	MAKE_GCAP_REGLIST_DESCR(gen12lp_global_regs, INDEX_PF, TYPE_GLOBAL, 0),
> +	MAKE_GCAP_REGLIST_DESCR(gen12lp_rc_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_RENDER_CLASS),
> +	MAKE_GCAP_REGLIST_DESCR(gen12lp_rc_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_RENDER_CLASS),
> +	MAKE_GCAP_REGLIST_DESCR(gen12lp_vd_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_VIDEO_CLASS),
> +	MAKE_GCAP_REGLIST_DESCR(gen12lp_vd_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_VIDEO_CLASS),
> +	MAKE_GCAP_REGLIST_DESCR(gen12lp_vec_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_VIDEOENHANCE_CLASS),
> +	MAKE_GCAP_REGLIST_DESCR(gen12lp_vec_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_VIDEOENHANCE_CLASS),
> +	MAKE_GCAP_REGLIST_DESCR(gen12lp_blt_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_BLITTER_CLASS),
> +	MAKE_GCAP_REGLIST_DESCR(gen12lp_blt_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_BLITTER_CLASS),

if you knew that you want to use macros, why not start with them in
previous patch ?

>  	{NULL, 0, 0, 0, 0}
>  };
>  
> -/************ FIXME: Populate tables for other devices in subsequent patch ************/
> +/************* Populate additional registers / device tables *************/
> +
> +static inline struct __guc_mmio_reg_descr **
> +guc_capture_get_ext_list_ptr(struct __guc_mmio_reg_descr_group * lists, u32 owner, u32 type, u32 class)
> +{
> +	while(lists->list){

please run checkpatch.pl

> +		if (lists->owner == owner && lists->type == type && lists->engine == class)
> +			break;
> +		++lists;
> +	}
> +	if (!lists->list)
> +		return NULL;
> +
> +	return &(lists->ext);
> +}
> +
> +void guc_capture_clear_ext_regs(struct __guc_mmio_reg_descr_group * lists)
> +{
> +	while(lists->list){
> +		if (lists->ext) {
> +			kfree(lists->ext);
> +			lists->ext = NULL;
> +		}
> +		++lists;
> +	}
> +	return;
> +}
> +
> +static void
> +xelpd_alloc_steered_ext_list(struct drm_i915_private *i915,
> +			     struct __guc_mmio_reg_descr_group * lists)
> +{
> +	struct intel_gt *gt = &i915->gt;
> +	struct sseu_dev_info *sseu;
> +	int slice, subslice, i, num_tot_regs = 0;
> +	struct __guc_mmio_reg_descr **ext;
> +	static char * const strings[] = {
> +		[0] = "GEN7_SAMPLER_INSTDONE",
> +		[1] = "GEN7_ROW_INSTDONE",
> +	};
> +
> +	/* In XE_LP we only care about render-class steering registers during error-capture */
> +	ext = guc_capture_get_ext_list_ptr(lists, GUC_CAPTURE_LIST_INDEX_PF,
> +					   GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS, GUC_RENDER_CLASS);
> +	if (!ext)
> +		return;
> +	if (*ext)
> +		return; /* already populated */
> +
> +	sseu = &gt->info.sseu;
> +	for_each_instdone_slice_subslice(i915, sseu, slice, subslice) {
> +		num_tot_regs += 2; /* two registers of interest for now */
> +	}
> +	if (!num_tot_regs)
> +		return;
> +
> +	*ext = kzalloc(2 * num_tot_regs * sizeof(struct __guc_mmio_reg_descr), GFP_KERNEL);

kcalloc ?

> +	if (!*ext) {
> +		drm_warn(&i915->drm, "GuC-capture: Fail to allocate for extended registers\n");
> +		return;
> +	}
> +
> +	for_each_instdone_slice_subslice(i915, sseu, slice, subslice) {
> +		for (i = 0; i < 2; i++) {
> +			if (i == 0)
> +				(*ext)->reg = GEN7_SAMPLER_INSTDONE;
> +			else
> +				(*ext)->reg = GEN7_ROW_INSTDONE;
> +			(*ext)->flags = FIELD_PREP(GUC_REGSET_STEERING_GROUP, slice);
> +			(*ext)->flags |= FIELD_PREP(GUC_REGSET_STEERING_INSTANCE, subslice);
> +			(*ext)->regname = strings[i];
> +			(*ext)++;
> +		}
> +	}
> +}
>  
>  static struct __guc_mmio_reg_descr_group *
>  guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)
>  {
>  	if (IS_TIGERLAKE(dev_priv) || IS_ROCKETLAKE(dev_priv) ||
>  	    IS_ALDERLAKE_S(dev_priv) || IS_ALDERLAKE_P(dev_priv)) {
> -		return gen12lp_lists;

patch2: gen12lp_lists
patch3: xe_lpd_lists

please be consistent across series

> +		/*
> +		* For certain engine classes, there are slice and subslice
> +		* level registers requiring steering. We allocate and populate
> +		* these at init time based on hw config add it as an extension
> +		* list at the end of the pre-populated render list.
> +		*/
> +		xelpd_alloc_steered_ext_list(dev_priv, xe_lpd_lists);
> +		return xe_lpd_lists;
>  	}
>  
>  	return NULL;
> @@ -221,6 +346,7 @@ int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32
>  
>  void intel_guc_capture_destroy(struct intel_guc *guc)
>  {
> +	guc_capture_clear_ext_regs(guc->capture.reglists);
>  }

maybe whole function shall be introduced in this patch ?

-Michal

>  
>  int intel_guc_capture_init(struct intel_guc *guc)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> index 352940b8bc87..df420f0f49b3 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> @@ -25,6 +25,8 @@ struct __guc_mmio_reg_descr_group {
>  	u32 owner; /* see enum guc_capture_owner */
>  	u32 type; /* see enum guc_capture_type */
>  	u32 engine; /* as per MAX_ENGINE_CLASS */
> +	int num_ext;
> +	struct __guc_mmio_reg_descr * ext;
>  };
>  
>  struct intel_guc_state_capture {
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index 1a1d2271c7e9..c26cfefd916c 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -267,6 +267,8 @@ struct guc_mmio_reg {
>  	u32 value;
>  	u32 flags;
>  #define GUC_REGSET_MASKED		(1 << 0)
> +#define GUC_REGSET_STEERING_GROUP       GENMASK(15, 12)
> +#define GUC_REGSET_STEERING_INSTANCE    GENMASK(23, 20)
>  	u32 mask;
>  } __packed;
>  
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size for error capture lists
  2021-11-23 21:46   ` Michal Wajdeczko
@ 2021-11-24  9:52     ` Jani Nikula
  2021-11-24 17:34     ` Teres Alexis, Alan Previn
  2021-12-22 20:13     ` Teres Alexis, Alan Previn
  2 siblings, 0 replies; 52+ messages in thread
From: Jani Nikula @ 2021-11-24  9:52 UTC (permalink / raw)
  To: Michal Wajdeczko, Alan Previn, intel-gfx

On Tue, 23 Nov 2021, Michal Wajdeczko <michal.wajdeczko@intel.com> wrote:
> Hi,
>
> just few random nits below
>
> -Michal
>
>
> On 23.11.2021 00:03, Alan Previn wrote:
>> +/* Define all device tables of GuC error capture register lists */
>> +
>> +/********************************* Gen12 LP  *********************************/
>
> didn't we move away from "GEN" naming ?

Yes.

>
>> +/************** GLOBAL *************/
>
> do we really need all these decorations ?

No, please remove them.

>
>> +struct __guc_mmio_reg_descr gen12lp_global_regs[] = {
>> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
>> +	/* Add additional register list */
>
> do we need this reminder ?

No, please remove them.

Also, all of these need to be static.


BR,
Jani.


-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size for error capture lists
  2021-11-22 23:03 ` [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size " Alan Previn
  2021-11-23 21:46   ` Michal Wajdeczko
@ 2021-11-24 10:06   ` Jani Nikula
  2021-11-24 17:37     ` Teres Alexis, Alan Previn
  1 sibling, 1 reply; 52+ messages in thread
From: Jani Nikula @ 2021-11-24 10:06 UTC (permalink / raw)
  To: Alan Previn, intel-gfx; +Cc: Alan Previn

On Mon, 22 Nov 2021, Alan Previn <alan.previn.teres.alexis@intel.com> wrote:
> +	{
> +		.list = gen12lp_vec_class_regs,
> +		.num_regs = (sizeof(gen12lp_vec_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +		.engine = VIDEO_ENHANCEMENT_CLASS
> +	},
> +	{

Usually }, { on the same line

> +		.list = gen12lp_vec_inst_regs,
> +		.num_regs = (sizeof(gen12lp_vec_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +		.engine = VIDEO_ENHANCEMENT_CLASS
> +	},
> +	{NULL, 0, 0, 0, 0}

Just {}  should work as a sentinel.

> +};
> +
> +/************ FIXME: Populate tables for other devices in subsequent patch ************/

Please don't add any of this ******* nonsense.

> +
> +static struct __guc_mmio_reg_descr_group *
> +guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)
> +{
> +	if (IS_TIGERLAKE(dev_priv) || IS_ROCKETLAKE(dev_priv) ||
> +	    IS_ALDERLAKE_S(dev_priv) || IS_ALDERLAKE_P(dev_priv)) {
> +		return gen12lp_lists;
> +	}
> +
> +	return NULL;
> +}
> +
> +static inline struct __guc_mmio_reg_descr_group *
> +guc_capture_get_one_list(struct __guc_mmio_reg_descr_group *reglists, u32 owner, u32 type, u32 id)

Please don't use inlines in .c files. Let the compiler decide.

> +{
> +	int i = 0;
> +
> +	if (!reglists)
> +		return NULL;
> +	while (reglists[i].list) {
> +		if (reglists[i].owner == owner &&
> +		    reglists[i].type == type) {
> +			if (reglists[i].type == GUC_CAPTURE_LIST_TYPE_GLOBAL ||
> +			    reglists[i].engine == id) {
> +				return &reglists[i];
> +			}
> +		}
> +		++i;
> +	}

That's a for loop right there.

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> new file mode 100644
> index 000000000000..352940b8bc87
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> @@ -0,0 +1,47 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2021-2021 Intel Corporation
> + */
> +
> +#ifndef _INTEL_GUC_CAPTURE_H
> +#define _INTEL_GUC_CAPTURE_H
> +
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>

Both of these seem random and completely unnecessary. linux/types.h is
required but it's not here.

> +#include "intel_guc_fwif.h"

I've been trying hard to reduce includes from headers throughout the
driver, to clean up and clarify the interfaces and dependencies. I don't
know how the guc headers have grown the kind of interdependencies that
they all pull in almost everything.

This one line pulls in another 19 headers. Just to get
GUC_CAPTURE_LIST_INDEX_MAX and GUC_MAX_ENGINE_CLASSES. Everything else
could be solved through forward declarations.

BR,
Jani.


> +
> +struct intel_guc;
> +
> +struct __guc_mmio_reg_descr {
> +	i915_reg_t reg;
> +	u32 flags;
> +	u32 mask;
> +	char *regname;
> +};
> +
> +struct __guc_mmio_reg_descr_group {
> +	struct __guc_mmio_reg_descr *list;
> +	u32 num_regs;
> +	u32 owner; /* see enum guc_capture_owner */
> +	u32 type; /* see enum guc_capture_type */
> +	u32 engine; /* as per MAX_ENGINE_CLASS */
> +};
> +
> +struct intel_guc_state_capture {
> +	struct __guc_mmio_reg_descr_group *reglists;
> +	u16 num_instance_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
> +	u16 num_class_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
> +	u16 num_global_regs[GUC_CAPTURE_LIST_INDEX_MAX];
> +	int instance_list_size;
> +	int class_list_size;
> +	int global_list_size;
> +};
> +
> +int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 class,
> +				 u16 *num_entries);
> +int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 class,
> +				struct guc_mmio_reg *ptr, u16 num_entries);
> +void intel_guc_capture_destroy(struct intel_guc *guc);
> +int intel_guc_capture_init(struct intel_guc *guc);
> +
> +#endif /* _INTEL_GUC_CAPTURE_H */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index 767684b6af67..1a1d2271c7e9 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -285,13 +285,30 @@ struct guc_gt_system_info {
>  } __packed;
>  
>  /* Capture-types of GuC capture register lists */
> -enum
> +enum guc_capture_owner
>  {
>  	GUC_CAPTURE_LIST_INDEX_PF = 0,
>  	GUC_CAPTURE_LIST_INDEX_VF = 1,
>  	GUC_CAPTURE_LIST_INDEX_MAX = 2,
>  };
>  
> +/*Register-types of GuC capture register lists */
> +enum guc_capture_type {
> +	GUC_CAPTURE_LIST_TYPE_GLOBAL = 0,
> +	GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +	GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +	GUC_CAPTURE_LIST_TYPE_MAX,
> +};
> +
> +struct guc_debug_capture_list_header {
> +	u32 info;
> +		#define GUC_CAPTURELISTHDR_NUMDESCR GENMASK(15, 0)
> +};
> +
> +struct guc_debug_capture_list {
> +	struct guc_debug_capture_list_header header;
> +};
> +
>  /* GuC Additional Data Struct */
>  struct guc_ads {
>  	struct guc_mmio_reg_set reg_state_list[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 4/7] drm/i915/guc: Add GuC's error state capture output structures.
  2021-11-22 23:03 ` [Intel-gfx] [RFC 4/7] drm/i915/guc: Add GuC's error state capture output structures Alan Previn
@ 2021-11-24 10:08   ` Jani Nikula
  2021-11-24 17:37     ` Teres Alexis, Alan Previn
  2021-12-07 21:01   ` Matthew Brost
  1 sibling, 1 reply; 52+ messages in thread
From: Jani Nikula @ 2021-11-24 10:08 UTC (permalink / raw)
  To: Alan Previn, intel-gfx; +Cc: Alan Previn

On Mon, 22 Nov 2021, Alan Previn <alan.previn.teres.alexis@intel.com> wrote:
> Add GuC's error capture output structures and definitions as how
> they would appear in GuC log buffer's error capture subregion after
> an error state capture G2H event notification.

If it's for decoding data, should they all have __packed?

>
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    | 35 +++++++++++++++++++
>  1 file changed, 35 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> index df420f0f49b3..b2454b6cd778 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> @@ -29,6 +29,41 @@ struct __guc_mmio_reg_descr_group {
>  	struct __guc_mmio_reg_descr * ext;
>  };
>  
> +struct intel_guc_capture_out_data_header {
> +	u32 reserved1;
> +	u32 info;
> +		#define GUC_CAPTURE_DATAHDR_SRC_TYPE GENMASK(3, 0) /* as per enum guc_capture_type */
> +		#define GUC_CAPTURE_DATAHDR_SRC_CLASS GENMASK(7, 4) /* as per GUC_MAX_ENGINE_CLASSES */
> +		#define GUC_CAPTURE_DATAHDR_SRC_INSTANCE GENMASK(11, 8)
> +	u32 lrca; /* if type-instance, LRCA (address) that hung, else set to ~0 */
> +	u32 guc_ctx_id; /* if type-instance, context index of hung context, else set to ~0 */
> +	u32 num_mmios;
> +		#define GUC_CAPTURE_DATAHDR_NUM_MMIOS GENMASK(9, 0)
> +};
> +
> +struct intel_guc_capture_out_data {
> +	struct intel_guc_capture_out_data_header capture_header;
> +	struct guc_mmio_reg capture_list[0];
> +};
> +
> +enum guc_capture_group_types {
> +	GUC_STATE_CAPTURE_GROUP_TYPE_FULL,
> +	GUC_STATE_CAPTURE_GROUP_TYPE_PARTIAL,
> +	GUC_STATE_CAPTURE_GROUP_TYPE_MAX,
> +};
> +
> +struct intel_guc_capture_out_group_header {
> +	u32 reserved1;
> +	u32 info;
> +		#define GUC_CAPTURE_GRPHDR_SRC_NUMCAPTURES GENMASK(7, 0)
> +		#define GUC_CAPTURE_GRPHDR_SRC_CAPTURE_TYPE GENMASK(15, 8)
> +};
> +
> +struct intel_guc_capture_out_group {
> +	struct intel_guc_capture_out_group_header group_header;
> +	struct intel_guc_capture_out_data group_lists[0];
> +};
> +
>  struct intel_guc_state_capture {
>  	struct __guc_mmio_reg_descr_group *reglists;
>  	u16 num_instance_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 3/7] drm/i915/guc: Populate XE_LP register lists for GuC error state capture.
  2021-11-23 21:55   ` Michal Wajdeczko
@ 2021-11-24 17:16     ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-11-24 17:16 UTC (permalink / raw)
  To: intel-gfx, Wajdeczko, Michal

Thanks Michal for reviewing the code. I will get all of these fixed.

I still would like continue to have a first patch with a skeleton table of registers
as the patch that focuses on the infrastructure and another patch just for the registers.

That sad, to align with your review comments, i shall ensure the first patch starts
with one valid register that doesnt get removed and also move the ext-list
functions and macros into that first patch. But keep full register table population
in the 2nd patch..

...alan



On Tue, 2021-11-23 at 22:55 +0100, Michal Wajdeczko wrote:
>  
> >  /********************************* Gen12 LP  *********************************/
> >  /************** GLOBAL *************/
> >  struct __guc_mmio_reg_descr gen12lp_global_regs[] = {
> > -	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> 
> we should avoid adding/removing code in same series
> 
> > +struct __guc_mmio_reg_descr_group xe_lpd_lists[] = {
> > +	MAKE_GCAP_REGLIST_DESCR(gen12lp_global_regs, INDEX_PF, TYPE_GLOBAL, 0),
> > +	MAKE_GCAP_REGLIST_DESCR(gen12lp_rc_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_RENDER_CLASS),
> > +	MAKE_GCAP_REGLIST_DESCR(gen12lp_rc_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_RENDER_CLASS),
> > +	MAKE_GCAP_REGLIST_DESCR(gen12lp_vd_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_VIDEO_CLASS),
> > +	MAKE_GCAP_REGLIST_DESCR(gen12lp_vd_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_VIDEO_CLASS),
> > +	MAKE_GCAP_REGLIST_DESCR(gen12lp_vec_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_VIDEOENHANCE_CLASS),
> > +	MAKE_GCAP_REGLIST_DESCR(gen12lp_vec_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_VIDEOENHANCE_CLASS),
> > +	MAKE_GCAP_REGLIST_DESCR(gen12lp_blt_class_regs, INDEX_PF, TYPE_ENGINE_CLASS, GUC_BLITTER_CLASS),
> > +	MAKE_GCAP_REGLIST_DESCR(gen12lp_blt_inst_regs, INDEX_PF, TYPE_ENGINE_INSTANCE, GUC_BLITTER_CLASS),
> 
> if you knew that you want to use macros, why not start with them in
> previous patch ?
> 
> >  	{NULL, 0, 0, 0, 0}
> >  };
> >  
> > -/************ FIXME: Populate tables for other devices in subsequent patch ************/
> > +/************* Populate additional registers / device tables *************/
> > +
> > +static inline struct __guc_mmio_reg_descr **
> > +guc_capture_get_ext_list_ptr(struct __guc_mmio_reg_descr_group * lists, u32 owner, u32 type, u32 class)
> > +{
> > +	while(lists->list){
> 
> please run checkpatch.pl
> 
> > +
> > +	sseu = &gt->info.sseu;
> > +	for_each_instdone_slice_subslice(i915, sseu, slice, subslice) {
> > +		num_tot_regs += 2; /* two registers of interest for now */
> > +	}
> > +	if (!num_tot_regs)
> > +		return;
> > +
> > +	*ext = kzalloc(2 * num_tot_regs * sizeof(struct __guc_mmio_reg_descr), GFP_KERNEL);
> 
> kcalloc ?
> 
> >  
> >  static struct __guc_mmio_reg_descr_group *
> >  guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)
> >  {
> >  	if (IS_TIGERLAKE(dev_priv) || IS_ROCKETLAKE(dev_priv) ||
> >  	    IS_ALDERLAKE_S(dev_priv) || IS_ALDERLAKE_P(dev_priv)) {
> > -		return gen12lp_lists;
> 
> patch2: gen12lp_lists
> patch3: xe_lpd_lists
> 
> please be consistent across series
> 
> > +		/*
> > +		* For certain engine classes, there are slice and subslice
> > +		* level registers requiring steering. We allocate and populate
> > +		* these at init time based on hw config add it as an extension
> > +		* list at the end of the pre-populated render list.
> > +		*/
> > +		xelpd_alloc_steered_ext_list(dev_priv, xe_lpd_lists);
> > +		return xe_lpd_lists;
> >  	}
> >  
> >  	return NULL;
> > @@ -221,6 +346,7 @@ int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32
> >  
> >  void intel_guc_capture_destroy(struct intel_guc *guc)
> >  {
> > +	guc_capture_clear_ext_regs(guc->capture.reglists);
> >  }
> 
> maybe whole function shall be introduced in this patch ?
> 
> -Michal
> 
> >  

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size for error capture lists
  2021-11-23 21:46   ` Michal Wajdeczko
  2021-11-24  9:52     ` Jani Nikula
@ 2021-11-24 17:34     ` Teres Alexis, Alan Previn
  2021-12-21 23:15       ` Teres Alexis, Alan Previn
  2021-12-22  1:49       ` Teres Alexis, Alan Previn
  2021-12-22 20:13     ` Teres Alexis, Alan Previn
  2 siblings, 2 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-11-24 17:34 UTC (permalink / raw)
  To: intel-gfx, Wajdeczko, Michal

Thanks Michal for the thorough review of the code (and the other patches). I will fix them all.

On the register-to-string helper function,
i'll have to think it through because i do want to keep future development
maintenance work when adding new registers simple (in the sense that
adding a single line into the table will be all thats needed).

Unless you are suggesting keeping a global i915-wide list somewhere?
which might be a bit of an overhead when searching through an offset list
to find the mmio being requested for string return - unless i keep a sorted tree
initialized with registers ordered by address, but would not work well for
different registers that share addresses on diff gen's).


...alan


On Tue, 2021-11-23 at 22:46 +0100, Michal Wajdeczko wrote:
> Hi,
> 
> just few random nits below
> 
> -Michal
> 
> 
> On 23.11.2021 00:03, Alan Previn wrote:
> > Update GuC ADS size allocation to include space for
> > the lists of error state capture register descriptors.
> > 
> > Also, populate the lists of registers we want GuC to report back to
> > Host on engine reset events. This list should include global,
> > engine-class and engine-instance registers for every engine-class
> > type on the current hardware.
> > 
> > NOTE: Start with a fake table of register lists to layout the
> > framework before adding real registers in subsequent patch.
> > 
> > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> > ---
> >  drivers/gpu/drm/i915/Makefile                 |   1 +
> >  drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  10 +-
> >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   5 +
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    | 176 ++++++++++++-
> >  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 232 ++++++++++++++++++
> >  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |  47 ++++
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  19 +-
> >  7 files changed, 476 insertions(+), 14 deletions(-)
> >  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> >  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> > 
> > diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> > index 074d6b8edd23..e3c4d5cea4c3 100644
> > --- a/drivers/gpu/drm/i915/Makefile
> > +++ b/drivers/gpu/drm/i915/Makefile
> > @@ -190,6 +190,7 @@ i915-y += gt/uc/intel_uc.o \
> >  	  gt/uc/intel_guc_rc.o \
> >  	  gt/uc/intel_guc_slpc.o \
> >  	  gt/uc/intel_guc_submission.o \
> > +	  gt/uc/intel_guc_capture.o \
> 
> use alphabetical order
> 
> >  	  gt/uc/intel_huc.o \
> >  	  gt/uc/intel_huc_debugfs.o \
> >  	  gt/uc/intel_huc_fw.o
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > index 5cf9ebd2ee55..458f0d248a5a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > @@ -335,9 +335,14 @@ int intel_guc_init(struct intel_guc *guc)
> >  	if (ret)
> >  		goto err_fw;
> >  
> > -	ret = intel_guc_ads_create(guc);
> > +	ret = intel_guc_capture_init(guc);
> >  	if (ret)
> >  		goto err_log;
> > +
> > +	ret = intel_guc_ads_create(guc);
> > +	if (ret)
> > +		goto err_capture;
> > +
> >  	GEM_BUG_ON(!guc->ads_vma);
> >  
> >  	ret = intel_guc_ct_init(&guc->ct);
> > @@ -376,6 +381,8 @@ int intel_guc_init(struct intel_guc *guc)
> >  	intel_guc_ct_fini(&guc->ct);
> >  err_ads:
> >  	intel_guc_ads_destroy(guc);
> > +err_capture:
> > +	intel_guc_capture_destroy(guc);
> >  err_log:
> >  	intel_guc_log_destroy(&guc->log);
> >  err_fw:
> > @@ -403,6 +410,7 @@ void intel_guc_fini(struct intel_guc *guc)
> >  	intel_guc_ct_fini(&guc->ct);
> >  
> >  	intel_guc_ads_destroy(guc);
> > +	intel_guc_capture_destroy(guc);
> >  	intel_guc_log_destroy(&guc->log);
> >  	intel_uc_fw_fini(&guc->fw);
> >  }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 9de99772f916..d136c69abe12 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -16,6 +16,7 @@
> >  #include "intel_guc_log.h"
> >  #include "intel_guc_reg.h"
> >  #include "intel_guc_slpc_types.h"
> > +#include "intel_guc_capture.h"
> 
> use alphabetical order
> 
> >  #include "intel_uc_fw.h"
> >  #include "i915_utils.h"
> >  #include "i915_vma.h"
> > @@ -37,6 +38,8 @@ struct intel_guc {
> >  	struct intel_guc_ct ct;
> >  	/** @slpc: sub-structure containing SLPC related data and objects */
> >  	struct intel_guc_slpc slpc;
> > +	/** @capture: the error-state-capture module's data and objects */
> > +	struct intel_guc_state_capture capture;
> >  
> >  	/** @sched_engine: Global engine used to submit requests to GuC */
> >  	struct i915_sched_engine *sched_engine;
> > @@ -138,6 +141,8 @@ struct intel_guc {
> >  	u32 ads_regset_size;
> >  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
> >  	u32 ads_golden_ctxt_size;
> > +	/** @ads_capture_size: size of register lists in the ADS used for error capture */
> > +	u32 ads_capture_size;
> >  	/** @ads_engine_usage_size: size of engine usage in the ADS */
> >  	u32 ads_engine_usage_size;
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > index 6c81ddd303d3..2780c0fadd01 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > @@ -10,6 +10,7 @@
> >  #include "gt/shmem_utils.h"
> >  #include "intel_guc_ads.h"
> >  #include "intel_guc_fwif.h"
> > +#include "intel_guc_capture.h"
> 
> wrong order
> 
> >  #include "intel_uc.h"
> >  #include "i915_drv.h"
> >  
> > @@ -71,8 +72,7 @@ static u32 guc_ads_golden_ctxt_size(struct intel_guc *guc)
> >  
> >  static u32 guc_ads_capture_size(struct intel_guc *guc)
> >  {
> > -	/* Basic support to init ADS without a proper GuC error capture list */
> > -	return PAGE_ALIGN(PAGE_SIZE);
> > +	return PAGE_ALIGN(guc->ads_capture_size);
> >  }
> >  
> >  static u32 guc_ads_private_data_size(struct intel_guc *guc)
> > @@ -519,24 +519,170 @@ static void guc_init_golden_context(struct intel_guc *guc)
> >  	GEM_BUG_ON(guc->ads_golden_ctxt_size != total_size);
> >  }
> >  
> > -static void guc_capture_prep_lists(struct intel_guc *guc, struct __guc_ads_blob *blob)
> > +static int
> > +guc_fill_reglist(struct intel_guc *guc, struct __guc_ads_blob *blob, int vf, bool enabled,
> > +		 int classid, int type, char *typename, u16 *p_numregs, int newnum, u8 **p_virt_ptr,
> > +		 u32 *p_blobptr_to_ggtt, u32 *p_ggtt, u32 null_ggtt)
> 
> hmm, this does not look good - do we really need all these params ?
> 
> >  {
> > -	int i, j;
> > -	u32 addr_ggtt, offset;
> > +	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> > +	struct guc_debug_capture_list *listnode;
> > +	int size = 0;
> >  
> > -	offset = guc_ads_capture_offset(guc);
> > -	addr_ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
> > +	if (blob && *p_numregs != newnum) {
> > +		if (type == GUC_CAPTURE_LIST_TYPE_GLOBAL)
> > +			drm_warn(&i915->drm, "Guc-Cap VF%d-%s num-reg mismatch was=%d now=%d!\n",
> > +				 vf, typename, *p_numregs, newnum);
> > +		else
> > +			drm_warn(&i915->drm, "Guc-Cap VF%d-Class-%d-%s num-reg mismatch was=%d now=%d!\n",
> > +				 vf, classid, typename, *p_numregs, newnum);
> > +	}
> > +	/*
> > +	 * For enabled capture lists, we not only need to call capture module to help
> > +	 * populate the list-descriptor into the correct ads capture structures, but
> > +	 * we also need to increment the virtual pointers and ggtt offsets so that
> > +	 * caller has the subsequent gfx memory location.
> > +	 */
> > +	*p_numregs = newnum;
> > +	size = PAGE_ALIGN((sizeof(struct guc_debug_capture_list)) +
> > +			  (newnum * sizeof(struct guc_mmio_reg)));
> > +	/* if caller hasn't allocated ADS blob, return size and counts, we're done */
> > +	if (!blob)
> > +		return size;
> > +	if (blob) {
> 
> redundant
> 
> > +		/* if caller allocated ADS blob, populate the capture register descriptors */
> > +		if (!newnum) {
> > +			*p_blobptr_to_ggtt = null_ggtt;
> > +		} else {
> > +			/* get ptr and populate header info: */
> > +			*p_blobptr_to_ggtt = *p_ggtt;
> > +			listnode = (struct guc_debug_capture_list *)*p_virt_ptr;
> > +			*p_ggtt += sizeof(struct guc_debug_capture_list);
> > +			*p_virt_ptr += sizeof(struct guc_debug_capture_list);
> > +			listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, *p_numregs);
> > +
> > +			/* get ptr and populate register descriptor list: */
> > +			intel_guc_capture_list_init(guc, vf, type, classid,
> > +						    (struct guc_mmio_reg *)*p_virt_ptr,
> > +						    *p_numregs);
> > +
> > +			/* increment ptrs for that header: */
> > +			*p_ggtt += size - sizeof(struct guc_debug_capture_list);
> > +			*p_virt_ptr += size - sizeof(struct guc_debug_capture_list);
> > +		}
> > +	}
> > +
> > +	return size;
> > +}
> > +
> > +static int guc_capture_prep_lists(struct intel_guc *guc, struct __guc_ads_blob *blob)
> > +{
> > +	struct intel_gt *gt = guc_to_gt(guc);
> > +	int i, j, size;
> > +	u32 ggtt, null_ggtt, offset, alloc_size = 0;
> > +	struct guc_gt_system_info *info, local_info;
> > +	struct guc_debug_capture_list *listnode;
> > +	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> > +	struct intel_guc_state_capture *gc = &guc->capture;
> > +	u16 tmp = 0;
> > +	u8 *ptr = NULL;
> > +
> > +	if (blob) {
> > +		offset = guc_ads_capture_offset(guc);
> > +		ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
> > +		ptr = ((u8 *)blob) + offset;
> > +		info = &blob->system_info;
> > +	} else {
> > +		memset(&local_info, 0, sizeof(local_info));
> > +		info = &local_info;
> > +		fill_engine_enable_masks(gt, info);
> > +	}
> > +
> > +	/* first, set aside the first page for a capture_list with zero descriptors */
> > +	alloc_size = PAGE_SIZE;
> > +	if (blob) {
> > +		listnode = (struct guc_debug_capture_list *)ptr;
> > +		listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, 0);
> > +		null_ggtt = ggtt;
> > +		ggtt += PAGE_SIZE;
> > +		ptr +=  PAGE_SIZE;
> > +	}
> >  
> > -	/* FIXME: Populate a proper capture list */
> > +#define COUNT_REGS intel_guc_capture_list_count
> > +#define FILL_REGS guc_fill_reglist
> > +#define TYPE_CLASS GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS
> > +#define TYPE_INSTANCE GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE
> >  
> >  	for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
> >  		for (j = 0; j < GUC_MAX_ENGINE_CLASSES; j++) {
> > -			blob->ads.capture_instance[i][j] = addr_ggtt;
> > -			blob->ads.capture_class[i][j] = addr_ggtt;
> > +			if (!info->engine_enabled_masks[j]) {
> > +				if (gc->num_class_regs[i][j])
> > +					drm_warn(&i915->drm, "GuC-Cap VF%d-class-%d "
> > +						 "class regs valid mismatch was=%d now=%d!\n",
> > +						 i, j, gc->num_class_regs[i][j], tmp);
> > +				if (gc->num_instance_regs[i][j])
> > +					drm_warn(&i915->drm, "GuC-Cap VF%d-class-%d "
> > +						 "inst regs valid mismatch was=%d now=%d!\n",
> > +						 i, j, gc->num_instance_regs[i][j], tmp);
> > +				gc->num_class_regs[i][j] = 0;
> > +				gc->num_instance_regs[i][j] = 0;
> > +				if (blob) {
> > +					blob->ads.capture_class[i][j] = null_ggtt;
> > +					blob->ads.capture_instance[i][j] = null_ggtt;
> > +				}
> > +			} else {
> > +				if (!COUNT_REGS(guc, i, TYPE_CLASS,
> > +						guc_class_to_engine_class(j), &tmp)) {
> > +					size = FILL_REGS(guc, blob, i, true, j, TYPE_CLASS,
> > +							 "class", &gc->num_class_regs[i][j],
> > +							 tmp, &ptr,
> > +							 &blob->ads.capture_class[i][j],
> > +							 &ggtt, null_ggtt);
> > +					gc->class_list_size += size;
> > +					alloc_size += size;
> > +				} else {
> > +					gc->num_class_regs[i][j] = 0;
> > +					if (blob)
> > +						blob->ads.capture_class[i][j] = null_ggtt;
> > +				}
> > +				if (!COUNT_REGS(guc, i, TYPE_INSTANCE,
> > +						guc_class_to_engine_class(j), &tmp)) {
> > +					size = FILL_REGS(guc, blob, i, true, j, TYPE_INSTANCE,
> > +							 "instance", &gc->num_instance_regs[i][j],
> > +							 tmp, &ptr,
> > +							 &blob->ads.capture_instance[i][j],
> > +							 &ggtt, null_ggtt);
> > +					gc->instance_list_size += size;
> > +					alloc_size += size;
> > +				} else {
> > +					gc->num_instance_regs[i][j] = 0;
> > +					if (blob)
> > +						blob->ads.capture_instance[i][j] = null_ggtt;
> > +				}
> > +			}
> > +		}
> > +		if (!COUNT_REGS(guc, i, GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, &tmp)) {
> > +			size = FILL_REGS(guc, blob, i, true, 0, GUC_CAPTURE_LIST_TYPE_GLOBAL,
> > +					 "global", &gc->num_global_regs[i], tmp, &ptr,
> > +					 &blob->ads.capture_global[i], &ggtt, null_ggtt);
> > +			gc->global_list_size += size;
> > +			alloc_size += size;
> > +		} else {
> > +			gc->num_global_regs[i] = 0;
> > +			if (blob)
> > +				blob->ads.capture_global[i] = null_ggtt;
> >  		}
> > -
> > -		blob->ads.capture_global[i] = addr_ggtt;
> >  	}
> > +
> > +#undef COUNT_REGS
> > +#undef FILL_REGS
> > +#undef TYPE_CLASS
> > +#undef TYPE_INSTANCE
> > +
> > +	if (guc->ads_capture_size && guc->ads_capture_size != PAGE_ALIGN(alloc_size))
> > +		drm_warn(&i915->drm, "GuC->ADS->Capture alloc size changed from %d to %d\n",
> > +			 guc->ads_capture_size, PAGE_ALIGN(alloc_size));
> > +
> > +	return PAGE_ALIGN(alloc_size);
> >  }
> >  
> >  static void __guc_ads_init(struct intel_guc *guc)
> > @@ -614,6 +760,12 @@ int intel_guc_ads_create(struct intel_guc *guc)
> >  		return ret;
> >  	guc->ads_golden_ctxt_size = ret;
> >  
> > +	/* Likewise the capture lists: */
> > +	ret = guc_capture_prep_lists(guc, NULL);
> > +	if (ret < 0)
> > +		return ret;
> > +	guc->ads_capture_size = ret;
> > +
> >  	/* Now the total size can be determined: */
> >  	size = guc_ads_blob_size(guc);
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> > new file mode 100644
> > index 000000000000..c741c77b7fc8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> > @@ -0,0 +1,232 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2021-2021 Intel Corporation
> > + */
> > +
> > +#include <drm/drm_print.h>
> > +
> > +#include "i915_drv.h"
> > +#include "i915_drv.h"
> 
> duplicated include
> 
> > +#include "i915_memcpy.h"
> > +#include "gt/intel_gt.h"
> > +
> > +#include "intel_guc_fwif.h"
> > +#include "intel_guc_capture.h"
> > +
> > +/* Define all device tables of GuC error capture register lists */
> > +
> > +/********************************* Gen12 LP  *********************************/
> 
> didn't we move away from "GEN" naming ?
> 
> > +/************** GLOBAL *************/
> 
> do we really need all these decorations ?
> 
> > +struct __guc_mmio_reg_descr gen12lp_global_regs[] = {
> > +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> > +	/* Add additional register list */
> 
> do we need this reminder ?
> 
> > +/********** List of lists **********/
> > +struct __guc_mmio_reg_descr_group gen12lp_lists[] = {
> > +	{
> > +		.list = gen12lp_global_regs,
> > +		.num_regs = (sizeof(gen12lp_global_regs) / sizeof(struct __guc_mmio_reg_descr)),
> 
> ARRAY_SIZE ?
> 
> > +/************ FIXME: Populate tables for other devices in subsequent patch ************/
> > +
> > +static struct __guc_mmio_reg_descr_group *
> > +guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)
> 
> in new code we are using "i915" instead of "dev_priv" and since this
> function has "guc" prefix it shall rather take "guc" as param:
> 
> guc_capture_get_device_reglist(struct intel_guc *guc)
> {
> 	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> 	...
> 
> 
> > +static inline void
> > +warn_with_capture_list_identifier(struct drm_i915_private *i915, char *msg,
> > +				  u32 owner, u32 type, u32 classid)
> > +{
> > +	const char *ownerstr[GUC_CAPTURE_LIST_INDEX_MAX] = {"PF", "VF"};
> > +	const char *typestr[GUC_CAPTURE_LIST_TYPE_MAX - 1] = {"Class", "Instance"};
> > +	const char *classstr[GUC_LAST_ENGINE_CLASS + 1] = {"Render", "Video", "VideoEnhance",
> > +							   "Blitter", "Reserved"};
> 
> better to wrap that into simple small helpers like
> 
> 	const char *stringify_guc_capture_owner(u32 owner) { .. }
> 	const char *stringify_guc_capture_type(u32 type) { .. }
> 	const char *stringify_guc_capture_class(u32 class) { .. }
> 
> > +int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 classid,
> > +				 u16 *num_entries)
> > +{
> > +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
> 
> s/dev_priv/i915
> redundant ()
> 
> > +	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
> > +	struct __guc_mmio_reg_descr_group *match;
> > +
> > +	if (!reglists)
> > +		return -ENODEV;
> > +
> > +	match = guc_capture_get_one_list(reglists, owner, type, classid);
> > +	if (match) {
> > +		*num_entries = match->num_regs;
> > +		return 0;
> 
> IIRC early returns are preferred for error cases, not success
> 
> > +int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 classid,
> > +				struct guc_mmio_reg *ptr, u16 num_entries)
> > +{
> > +	u32 j = 0;
> > +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
> 
> s/dev_priv/i915
> redundant ()
> 
> > +struct intel_guc;
> > +
> > +struct __guc_mmio_reg_descr {
> > +	i915_reg_t reg;
> > +	u32 flags;
> > +	u32 mask;
> > +	char *regname;
> 
> const char* ?
> 
> but maybe instead of adding reg name to the GuC specific struct we
> should add generic purpose function that will return pretty name of the
> register:
> 
> i915_reg.c:
> 
> const char *i915_reg_to_string(i915_reg_r reg)
> {
> 	...
> }
> 
> >  
> >  /* Capture-types of GuC capture register lists */
> > -enum
> > +enum guc_capture_owner
> >  {
> >  	GUC_CAPTURE_LIST_INDEX_PF = 0,
> >  	GUC_CAPTURE_LIST_INDEX_VF = 1,
> >  	GUC_CAPTURE_LIST_INDEX_MAX = 2,
> 
> s/INDEX/OWNER ?
> 
> >  };
> >  

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size for error capture lists
  2021-11-24 10:06   ` Jani Nikula
@ 2021-11-24 17:37     ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-11-24 17:37 UTC (permalink / raw)
  To: intel-gfx, jani.nikula

Thanks very much Jani for the detail review of the code... apologies on some of the styling mishaps.
I will fix them all. I agree completely with the header file comments - my bad on that - had already
learnt that lesson on pxp side. Will fix accordingly.

...alan


On Wed, 2021-11-24 at 12:06 +0200, Jani Nikula wrote:
> On Mon, 22 Nov 2021, Alan Previn <alan.previn.teres.alexis@intel.com> wrote:
> > +	{
> > +		.list = gen12lp_vec_class_regs,
> > +		.num_regs = (sizeof(gen12lp_vec_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> > +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> > +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> > +		.engine = VIDEO_ENHANCEMENT_CLASS
> > +	},
> > +	{
> 
> Usually }, { on the same line
> 
> > +		.list = gen12lp_vec_inst_regs,
> > +		.num_regs = (sizeof(gen12lp_vec_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> > +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> > +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> > +		.engine = VIDEO_ENHANCEMENT_CLASS
> > +	},
> > +	{NULL, 0, 0, 0, 0}
> 
> Just {}  should work as a sentinel.
> 
> > +};
> > +
> > +/************ FIXME: Populate tables for other devices in subsequent patch ************/
> 
> Please don't add any of this ******* nonsense.
> 
> > +
> > +static struct __guc_mmio_reg_descr_group *
> > +guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)
> > +{
> > +	if (IS_TIGERLAKE(dev_priv) || IS_ROCKETLAKE(dev_priv) ||
> > +	    IS_ALDERLAKE_S(dev_priv) || IS_ALDERLAKE_P(dev_priv)) {
> > +		return gen12lp_lists;
> > +	}
> > +
> > +	return NULL;
> > +}
> > +
> > +static inline struct __guc_mmio_reg_descr_group *
> > +guc_capture_get_one_list(struct __guc_mmio_reg_descr_group *reglists, u32 owner, u32 type, u32 id)
> 
> Please don't use inlines in .c files. Let the compiler decide.
> 
> > +{
> > +	int i = 0;
> > +
> > +	if (!reglists)
> > +		return NULL;
> > +	while (reglists[i].list) {
> > +		if (reglists[i].owner == owner &&
> > +		    reglists[i].type == type) {
> > +			if (reglists[i].type == GUC_CAPTURE_LIST_TYPE_GLOBAL ||
> > +			    reglists[i].engine == id) {
> > +				return &reglists[i];
> > +			}
> > +		}
> > +		++i;
> > +	}
> 
> That's a for loop right there.
> 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> > new file mode 100644
> > index 000000000000..352940b8bc87
> > --- /dev/null
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> > @@ -0,0 +1,47 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2021-2021 Intel Corporation
> > + */
> > +
> > +#ifndef _INTEL_GUC_CAPTURE_H
> > +#define _INTEL_GUC_CAPTURE_H
> > +
> > +#include <linux/mutex.h>
> > +#include <linux/workqueue.h>
> 
> Both of these seem random and completely unnecessary. linux/types.h is
> required but it's not here.
> 
> > +#include "intel_guc_fwif.h"
> 
> I've been trying hard to reduce includes from headers throughout the
> driver, to clean up and clarify the interfaces and dependencies. I don't
> know how the guc headers have grown the kind of interdependencies that
> they all pull in almost everything.
> 
> This one line pulls in another 19 headers. Just to get
> GUC_CAPTURE_LIST_INDEX_MAX and GUC_MAX_ENGINE_CLASSES. Everything else
> could be solved through forward declarations.
> 
> BR,
> Jani.
> 
> 
> >  	struct guc_mmio_reg_set reg_state_list[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
> 
> -- 
> Jani Nikula, Intel Open Source Graphics Center


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 4/7] drm/i915/guc: Add GuC's error state capture output structures.
  2021-11-24 10:08   ` Jani Nikula
@ 2021-11-24 17:37     ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-11-24 17:37 UTC (permalink / raw)
  To: intel-gfx, jani.nikula

Good catch - i missed that. Will fix it. Thanks again.

...alan

On Wed, 2021-11-24 at 12:08 +0200, Jani Nikula wrote:
> On Mon, 22 Nov 2021, Alan Previn <alan.previn.teres.alexis@intel.com> wrote:
> > Add GuC's error capture output structures and definitions as how
> > they would appear in GuC log buffer's error capture subregion after
> > an error state capture G2H event notification.
> 
> If it's for decoding data, should they all have __packed?
> 
> > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> > ---
> >  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    | 35 +++++++++++++++++++
> >  1 file changed, 35 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> > index df420f0f49b3..b2454b6cd778 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> > @@ -29,6 +29,41 @@ struct __guc_mmio_reg_descr_group {
> >  	struct __guc_mmio_reg_descr * ext;
> >  };
> >  
> > +struct intel_guc_capture_out_data_header {
> > +	u32 reserved1;
> > +	u32 info;
> > +		#define GUC_CAPTURE_DATAHDR_SRC_TYPE GENMASK(3, 0) /* as per enum guc_capture_type */
> > +		#define GUC_CAPTURE_DATAHDR_SRC_CLASS GENMASK(7, 4) /* as per GUC_MAX_ENGINE_CLASSES */
> > +		#define GUC_CAPTURE_DATAHDR_SRC_INSTANCE GENMASK(11, 8)
> > +	u32 lrca; /* if type-instance, LRCA (address) that hung, else set to ~0 */
> > +	u32 guc_ctx_id; /* if type-instance, context index of hung context, else set to ~0 */
> > +	u32 num_mmios;
> > +		#define GUC_CAPTURE_DATAHDR_NUM_MMIOS GENMASK(9, 0)
> > +};
> > +
> > +struct intel_guc_capture_out_data {
> > +	struct intel_guc_capture_out_data_header capture_header;
> > +	struct guc_mmio_reg capture_list[0];
> > +};
> > +
> > +enum guc_capture_group_types {
> > +	GUC_STATE_CAPTURE_GROUP_TYPE_FULL,
> > +	GUC_STATE_CAPTURE_GROUP_TYPE_PARTIAL,
> > +	GUC_STATE_CAPTURE_GROUP_TYPE_MAX,
> > +};
> > +
> > +struct intel_guc_capture_out_group_header {
> > +	u32 reserved1;
> > +	u32 info;
> > +		#define GUC_CAPTURE_GRPHDR_SRC_NUMCAPTURES GENMASK(7, 0)
> > +		#define GUC_CAPTURE_GRPHDR_SRC_CAPTURE_TYPE GENMASK(15, 8)
> > +};
> > +
> > +struct intel_guc_capture_out_group {
> > +	struct intel_guc_capture_out_group_header group_header;
> > +	struct intel_guc_capture_out_data group_lists[0];
> > +};
> > +
> >  struct intel_guc_state_capture {
> >  	struct __guc_mmio_reg_descr_group *reglists;
> >  	u16 num_instance_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
> 
> -- 
> Jani Nikula, Intel Open Source Graphics Center


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 4/7] drm/i915/guc: Add GuC's error state capture output structures.
  2021-11-22 23:03 ` [Intel-gfx] [RFC 4/7] drm/i915/guc: Add GuC's error state capture output structures Alan Previn
  2021-11-24 10:08   ` Jani Nikula
@ 2021-12-07 21:01   ` Matthew Brost
  2021-12-07 23:35     ` Teres Alexis, Alan Previn
  1 sibling, 1 reply; 52+ messages in thread
From: Matthew Brost @ 2021-12-07 21:01 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx

On Mon, Nov 22, 2021 at 03:03:59PM -0800, Alan Previn wrote:
> Add GuC's error capture output structures and definitions as how
> they would appear in GuC log buffer's error capture subregion after
> an error state capture G2H event notification.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    | 35 +++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> index df420f0f49b3..b2454b6cd778 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> @@ -29,6 +29,41 @@ struct __guc_mmio_reg_descr_group {
>  	struct __guc_mmio_reg_descr * ext;
>  };
>  
> +struct intel_guc_capture_out_data_header {
> +	u32 reserved1;
> +	u32 info;
> +		#define GUC_CAPTURE_DATAHDR_SRC_TYPE GENMASK(3, 0) /* as per enum guc_capture_type */
> +		#define GUC_CAPTURE_DATAHDR_SRC_CLASS GENMASK(7, 4) /* as per GUC_MAX_ENGINE_CLASSES */
> +		#define GUC_CAPTURE_DATAHDR_SRC_INSTANCE GENMASK(11, 8)
> +	u32 lrca; /* if type-instance, LRCA (address) that hung, else set to ~0 */
> +	u32 guc_ctx_id; /* if type-instance, context index of hung context, else set to ~0 */

s/guc_ctx_id/guc_id

With __packed (per Jani's feedback) as well:

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> +	u32 num_mmios;
> +		#define GUC_CAPTURE_DATAHDR_NUM_MMIOS GENMASK(9, 0)
> +};
> +
> +struct intel_guc_capture_out_data {
> +	struct intel_guc_capture_out_data_header capture_header;
> +	struct guc_mmio_reg capture_list[0];
> +};
> +
> +enum guc_capture_group_types {
> +	GUC_STATE_CAPTURE_GROUP_TYPE_FULL,
> +	GUC_STATE_CAPTURE_GROUP_TYPE_PARTIAL,
> +	GUC_STATE_CAPTURE_GROUP_TYPE_MAX,
> +};
> +
> +struct intel_guc_capture_out_group_header {
> +	u32 reserved1;
> +	u32 info;
> +		#define GUC_CAPTURE_GRPHDR_SRC_NUMCAPTURES GENMASK(7, 0)
> +		#define GUC_CAPTURE_GRPHDR_SRC_CAPTURE_TYPE GENMASK(15, 8)
> +};
> +
> +struct intel_guc_capture_out_group {
> +	struct intel_guc_capture_out_group_header group_header;
> +	struct intel_guc_capture_out_data group_lists[0];
> +};
> +
>  struct intel_guc_state_capture {
>  	struct __guc_mmio_reg_descr_group *reglists;
>  	u16 num_instance_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 5/7] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
  2021-11-22 23:04 ` [Intel-gfx] [RFC 5/7] drm/i915/guc: Update GuC's log-buffer-state access for error capture Alan Previn
@ 2021-12-07 22:31   ` Matthew Brost
  2021-12-07 23:33     ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Brost @ 2021-12-07 22:31 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx

On Mon, Nov 22, 2021 at 03:04:00PM -0800, Alan Previn wrote:
> GuC log buffer regions for debug-log-events, crash-dumps and
> error-state-capture are all a single bo allocation that includes
> the guc_log_buffer_state structures.
> 
> Since the error-capture region is accessed with high priority at non-
> deterministic times (as part of gpu coredump) while the debug-log-event
> region is populated and accessed with different priorities, timings and
> consumers, let's split out separate locks for buffer-state accesses
> of each region.
> 
> Also, ensure a global mapping is made up front for the entire bo
> throughout GuC operation so that dynamic mapping and unmapping isn't
> required for error capture log access if relay-logging isn't running.
> 
> Additionally, while here, make some readibility improvements:
> 1. change previous function names with "capture_logs" to
>    "copy_debug_logs" to help make the distinction clearer.
> 2. Update the guc log region mapping comments to order them
>    according to the enum definition as per the GuC interface.
> 

Nothing major, just a couple nits below.

> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   2 +
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    |  46 +++++++
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |   1 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c    | 120 ++++++++++++------
>  drivers/gpu/drm/i915/gt/uc/intel_guc_log.h    |  14 +-
>  5 files changed, 137 insertions(+), 46 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index d136c69abe12..e0db21bbffdd 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -34,6 +34,8 @@ struct intel_guc {
>  	struct intel_uc_fw fw;
>  	/** @log: sub-structure containing GuC log related data and objects */
>  	struct intel_guc_log log;
> +	/** @log_state: states and locks for each subregion of GuC's log buffer */
> +	struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER];
>  	/** @ct: the command transport communication channel */
>  	struct intel_guc_ct ct;
>  	/** @slpc: sub-structure containing SLPC related data and objects */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> index eec1d193ac26..0cb358a98605 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -344,6 +344,52 @@ int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32
>  	return -ENODATA;
>  }
>  
> +int intel_guc_capture_output_min_size_est(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	int worst_min_size = 0, num_regs = 0;
> +	u16 tmp = 0;
> +
> +	/*
> +	 * If every single engine-instance suffered a failure in quick succession but
> +	 * were all unrelated, then a burst of multiple error-capture events would dump
> +	 * registers for every one engine instance, one at a time. In this case, GuC
> +	 * would even dump the global-registers repeatedly.
> +	 *
> +	 * For each engine instance, there would be 1 x intel_guc_capture_out_group output
> +	 * followed by 3 x intel_guc_capture_out_data lists. The latter is how the register
> +	 * dumps are split across different register types (where the '3' are global vs class
> +	 * vs instance). Finally, let's multiply the whole thing by 3x (just so we are
> +	 * not limited to just 1 rounds of data in a  worst case full register dump log)

s/a  worst/a worst/

> +	 *
> +	 * NOTE: intel_guc_log that allocates the log buffer would round this size up to
> +	 * a power of two.
> +	 */
> +
> +	for_each_engine(engine, gt, id) {
> +		worst_min_size += sizeof(struct intel_guc_capture_out_group_header) +
> +				  (3 * sizeof(struct intel_guc_capture_out_data_header));
> +
> +		if (!intel_guc_capture_list_count(guc, 0, GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, &tmp))
> +			num_regs += tmp;
> +
> +		if (!intel_guc_capture_list_count(guc, 0, GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +						  engine->class, &tmp)) {
> +			num_regs += tmp;
> +		}
> +		if (!intel_guc_capture_list_count(guc, 0, GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +						  engine->class, &tmp)) {
> +			num_regs += tmp;
> +		}
> +	}
> +
> +	worst_min_size += (num_regs * sizeof(struct guc_mmio_reg));
> +
> +	return (worst_min_size * 3);

Maybe a define for the '3' here describing what the '3' means.

> +}
> +
>  void intel_guc_capture_destroy(struct intel_guc *guc)
>  {
>  	guc_capture_clear_ext_regs(guc->capture.reglists);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> index b2454b6cd778..839b53425e1e 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> @@ -78,6 +78,7 @@ int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32
>  				 u16 *num_entries);
>  int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 class,
>  				struct guc_mmio_reg *ptr, u16 num_entries);
> +int intel_guc_capture_output_min_size_est(struct intel_guc *guc);
>  void intel_guc_capture_destroy(struct intel_guc *guc);
>  int intel_guc_capture_init(struct intel_guc *guc);
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
> index 1962a43302a8..dd86530f77a1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
> @@ -10,7 +10,7 @@
>  #include "i915_memcpy.h"
>  #include "intel_guc_log.h"
>  
> -static void guc_log_capture_logs(struct intel_guc_log *log);
> +static void guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log);
>  
>  /**
>   * DOC: GuC firmware log
> @@ -149,7 +149,7 @@ static void guc_move_to_next_buf(struct intel_guc_log *log)
>  	smp_wmb();
>  
>  	/* All data has been written, so now move the offset of sub buffer. */
> -	relay_reserve(log->relay.channel, log->vma->obj->base.size);
> +	relay_reserve(log->relay.channel, log->vma->obj->base.size - CAPTURE_BUFFER_SIZE);
>  
>  	/* Switch to the next sub buffer */
>  	relay_flush(log->relay.channel);
> @@ -169,25 +169,25 @@ static void *guc_get_write_buffer(struct intel_guc_log *log)
>  	return relay_reserve(log->relay.channel, 0);
>  }
>  
> -static bool guc_check_log_buf_overflow(struct intel_guc_log *log,
> -				       enum guc_log_buffer_type type,
> -				       unsigned int full_cnt)
> +bool guc_check_log_buf_overflow(struct intel_guc *guc,
> +				struct intel_guc_log_stats *log_state,
> +				unsigned int full_cnt)

I don't think you meant to drop the 'static' here.

>  {
> -	unsigned int prev_full_cnt = log->stats[type].sampled_overflow;
> +	unsigned int prev_full_cnt = log_state->sampled_overflow;
>  	bool overflow = false;
>  
>  	if (full_cnt != prev_full_cnt) {
>  		overflow = true;
>  
> -		log->stats[type].overflow = full_cnt;
> -		log->stats[type].sampled_overflow += full_cnt - prev_full_cnt;
> +		log_state->overflow = full_cnt;
> +		log_state->sampled_overflow += full_cnt - prev_full_cnt;
>  
>  		if (full_cnt < prev_full_cnt) {
>  			/* buffer_full_cnt is a 4 bit counter */
> -			log->stats[type].sampled_overflow += 16;
> +			log_state->sampled_overflow += 16;
>  		}
>  
> -		dev_notice_ratelimited(guc_to_gt(log_to_guc(log))->i915->drm.dev,
> +		dev_notice_ratelimited(guc_to_gt(guc)->i915->drm.dev,
>  				       "GuC log buffer overflow\n");
>  	}
>  
> @@ -210,8 +210,10 @@ static unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type)
>  	return 0;
>  }
>  
> -static void guc_read_update_log_buffer(struct intel_guc_log *log)
> +static void _guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log)
>  {
> +	struct intel_guc *guc = log_to_guc(log);
> +	struct intel_guc_log_stats *logstate;
>  	unsigned int buffer_size, read_offset, write_offset, bytes_to_copy, full_cnt;
>  	struct guc_log_buffer_state *log_buf_state, *log_buf_snapshot_state;
>  	struct guc_log_buffer_state log_buf_state_local;
> @@ -235,7 +237,7 @@ static void guc_read_update_log_buffer(struct intel_guc_log *log)
>  		 * Used rate limited to avoid deluge of messages, logs might be
>  		 * getting consumed by User at a slow rate.
>  		 */
> -		DRM_ERROR_RATELIMITED("no sub-buffer to capture logs\n");
> +		DRM_ERROR_RATELIMITED("no sub-buffer to copy general logs\n");
>  		log->relay.full_count++;
>  
>  		goto out_unlock;
> @@ -245,12 +247,16 @@ static void guc_read_update_log_buffer(struct intel_guc_log *log)
>  	src_data += PAGE_SIZE;
>  	dst_data += PAGE_SIZE;
>  
> -	for (type = GUC_DEBUG_LOG_BUFFER; type < GUC_MAX_LOG_BUFFER; type++) {
> +	/* For relay logging, we exclude error state capture */
> +	for (type = GUC_DEBUG_LOG_BUFFER; type <= GUC_CRASH_DUMP_LOG_BUFFER; type++) {
>  		/*
> +		 * Get a lock to the buffer_state we want to read and update.
>  		 * Make a copy of the state structure, inside GuC log buffer
>  		 * (which is uncached mapped), on the stack to avoid reading
>  		 * from it multiple times.
>  		 */
> +		logstate = &guc->log_state[type];
> +		mutex_lock(&logstate->lock);
>  		memcpy(&log_buf_state_local, log_buf_state,
>  		       sizeof(struct guc_log_buffer_state));
>  		buffer_size = guc_get_log_buffer_size(type);
> @@ -259,13 +265,14 @@ static void guc_read_update_log_buffer(struct intel_guc_log *log)
>  		full_cnt = log_buf_state_local.buffer_full_cnt;
>  
>  		/* Bookkeeping stuff */
> -		log->stats[type].flush += log_buf_state_local.flush_to_file;
> -		new_overflow = guc_check_log_buf_overflow(log, type, full_cnt);
> +		logstate->flush += log_buf_state_local.flush_to_file;
> +		new_overflow = guc_check_log_buf_overflow(guc, logstate, full_cnt);
>  
>  		/* Update the state of shared log buffer */
>  		log_buf_state->read_ptr = write_offset;
>  		log_buf_state->flush_to_file = 0;
>  		log_buf_state++;
> +		mutex_unlock(&logstate->lock);
>  
>  		/* First copy the state structure in snapshot buffer */
>  		memcpy(log_buf_snapshot_state, &log_buf_state_local,
> @@ -313,15 +320,15 @@ static void guc_read_update_log_buffer(struct intel_guc_log *log)
>  	mutex_unlock(&log->relay.lock);
>  }
>  
> -static void capture_logs_work(struct work_struct *work)
> +static void copy_debug_logs_work(struct work_struct *work)
>  {
>  	struct intel_guc_log *log =
>  		container_of(work, struct intel_guc_log, relay.flush_work);
>  
> -	guc_log_capture_logs(log);
> +	guc_log_copy_debuglogs_for_relay(log);
>  }
>  
> -static int guc_log_map(struct intel_guc_log *log)
> +static int guc_log_relay_map(struct intel_guc_log *log)
>  {
>  	void *vaddr;
>  
> @@ -333,7 +340,9 @@ static int guc_log_map(struct intel_guc_log *log)
>  	/*
>  	 * Create a WC (Uncached for read) vmalloc mapping of log
>  	 * buffer pages, so that we can directly get the data
> -	 * (up-to-date) from memory.
> +	 * (up-to-date) from memory. This has already been
> +	 * mapped at GuC Init time (for error-state-capture), but
> +	 * call it again anyway for book-keeping
>  	 */
>  	vaddr = i915_gem_object_pin_map_unlocked(log->vma->obj, I915_MAP_WC);
>  	if (IS_ERR(vaddr))
> @@ -344,7 +353,7 @@ static int guc_log_map(struct intel_guc_log *log)
>  	return 0;
>  }
>  
> -static void guc_log_unmap(struct intel_guc_log *log)
> +static void guc_log_relay_unmap(struct intel_guc_log *log)
>  {
>  	lockdep_assert_held(&log->relay.lock);
>  
> @@ -354,8 +363,14 @@ static void guc_log_unmap(struct intel_guc_log *log)
>  
>  void intel_guc_log_init_early(struct intel_guc_log *log)
>  {
> +	struct intel_guc *guc = log_to_guc(log);
> +	int n;
> +
> +	for (n = GUC_DEBUG_LOG_BUFFER; n < GUC_MAX_LOG_BUFFER; n++)
> +		mutex_init(&guc->log_state[n].lock);
> +
>  	mutex_init(&log->relay.lock);
> -	INIT_WORK(&log->relay.flush_work, capture_logs_work);
> +	INIT_WORK(&log->relay.flush_work, copy_debug_logs_work);
>  	log->relay.started = false;
>  }
>  
> @@ -370,8 +385,11 @@ static int guc_log_relay_create(struct intel_guc_log *log)
>  	lockdep_assert_held(&log->relay.lock);
>  	GEM_BUG_ON(!log->vma);
>  
> -	 /* Keep the size of sub buffers same as shared log buffer */
> -	subbuf_size = log->vma->size;
> +	 /*
> +	  * Keep the size of sub buffers same as shared log buffer
> +	  * but GuC log-events excludes the error-state-capture logs
> +	  */
> +	subbuf_size = log->vma->size - CAPTURE_BUFFER_SIZE;
>  
>  	/*
>  	 * Store up to 8 snapshots, which is large enough to buffer sufficient
> @@ -406,13 +424,13 @@ static void guc_log_relay_destroy(struct intel_guc_log *log)
>  	log->relay.channel = NULL;
>  }
>  
> -static void guc_log_capture_logs(struct intel_guc_log *log)
> +static void guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log)
>  {
>  	struct intel_guc *guc = log_to_guc(log);
>  	struct drm_i915_private *dev_priv = guc_to_gt(guc)->i915;
>  	intel_wakeref_t wakeref;
>  
> -	guc_read_update_log_buffer(log);
> +	_guc_log_copy_debuglogs_for_relay(log);
>  
>  	/*
>  	 * Generally device is expected to be active only at this
> @@ -452,6 +470,7 @@ int intel_guc_log_create(struct intel_guc_log *log)
>  {
>  	struct intel_guc *guc = log_to_guc(log);
>  	struct i915_vma *vma;
> +	void *vaddr;
>  	u32 guc_log_size;
>  	int ret;
>  
> @@ -459,23 +478,31 @@ int intel_guc_log_create(struct intel_guc_log *log)
>  
>  	/*
>  	 *  GuC Log buffer Layout
> +	 * (this ordering must follow "enum guc_log_buffer_type" definition)
>  	 *
>  	 *  +===============================+ 00B
> -	 *  |    Crash dump state header    |
> -	 *  +-------------------------------+ 32B
>  	 *  |      Debug state header       |
> +	 *  +-------------------------------+ 32B
> +	 *  |    Crash dump state header    |
> +	 *  +-------------------------------+ 64B
> +	 *  |     Capture state header      |
>  	 *  +-------------------------------+ 64B
>  	 *  |     Capture state header      |
>  	 *  +-------------------------------+ 96B
>  	 *  |                               |
>  	 *  +===============================+ PAGE_SIZE (4KB)
> -	 *  |        Crash Dump logs        |
> -	 *  +===============================+ + CRASH_SIZE
>  	 *  |          Debug logs           |
>  	 *  +===============================+ + DEBUG_SIZE
> +	 *  |        Crash Dump logs        |
> +	 *  +===============================+ + CRASH_SIZE
> +	 *  |         Capture logs          |
> +	 *  +===============================+ + CAPTURE_SIZE
>  	 */
> -	guc_log_size = PAGE_SIZE + CRASH_BUFFER_SIZE + DEBUG_BUFFER_SIZE +
> -		       CAPTURE_BUFFER_SIZE;
> +	if (intel_guc_capture_output_min_size_est(guc) > CAPTURE_BUFFER_SIZE)
> +		DRM_WARN("GuC log buffer for state_capture maybe too small. %d < %d\n",
> +			 CAPTURE_BUFFER_SIZE, intel_guc_capture_output_min_size_est(guc));
> +
> +	guc_log_size = PAGE_SIZE + DEBUG_BUFFER_SIZE + CRASH_BUFFER_SIZE + CAPTURE_BUFFER_SIZE;

I'd personally keep the original formatting here.

>  
>  	vma = intel_guc_allocate_vma(guc, guc_log_size);
>  	if (IS_ERR(vma)) {
> @@ -484,6 +511,17 @@ int intel_guc_log_create(struct intel_guc_log *log)
>  	}
>  
>  	log->vma = vma;
> +	/*
> +	 * Create a WC (Uncached for read) vmalloc mapping up front immediate access to
> +	 * data from memory during  critical events such as error capture
> +	 */
> +	vaddr = i915_gem_object_pin_map_unlocked(log->vma->obj, I915_MAP_WC);
> +	if (IS_ERR(vaddr)) {
> +		ret = PTR_ERR(vaddr);
> +		i915_vma_unpin_and_release(&log->vma, 0);
> +		goto err;
> +	}
> +	log->buf_addr = vaddr;
>  
>  	log->level = __get_default_log_level(log);
>  	DRM_DEBUG_DRIVER("guc_log_level=%d (%s, verbose:%s, verbosity:%d)\n",
> @@ -494,13 +532,14 @@ int intel_guc_log_create(struct intel_guc_log *log)
>  	return 0;
>  
>  err:
> -	DRM_ERROR("Failed to allocate GuC log buffer. %d\n", ret);
> +	DRM_ERROR("Failed to allocate or map GuC log buffer. %d\n", ret);
>  	return ret;
>  }
>  
>  void intel_guc_log_destroy(struct intel_guc_log *log)
>  {
> -	i915_vma_unpin_and_release(&log->vma, 0);
> +	log->buf_addr = NULL;
> +	i915_vma_unpin_and_release(&log->vma, I915_VMA_RELEASE_MAP);
>  }
>  
>  int intel_guc_log_set_level(struct intel_guc_log *log, u32 level)
> @@ -545,7 +584,7 @@ int intel_guc_log_set_level(struct intel_guc_log *log, u32 level)
>  
>  bool intel_guc_log_relay_created(const struct intel_guc_log *log)
>  {
> -	return log->relay.buf_addr;
> +	return log->buf_addr;
>  }
>  
>  int intel_guc_log_relay_open(struct intel_guc_log *log)
> @@ -576,7 +615,7 @@ int intel_guc_log_relay_open(struct intel_guc_log *log)
>  	if (ret)
>  		goto out_unlock;
>  
> -	ret = guc_log_map(log);
> +	ret = guc_log_relay_map(log);
>  	if (ret)
>  		goto out_relay;
>  
> @@ -628,8 +667,8 @@ void intel_guc_log_relay_flush(struct intel_guc_log *log)
>  	with_intel_runtime_pm(guc_to_gt(guc)->uncore->rpm, wakeref)
>  		guc_action_flush_log(guc);
>  
> -	/* GuC would have updated log buffer by now, so capture it */
> -	guc_log_capture_logs(log);
> +	/* GuC would have updated log buffer by now, so copy it */
> +	guc_log_copy_debuglogs_for_relay(log);
>  }
>  
>  /*
> @@ -659,7 +698,7 @@ void intel_guc_log_relay_close(struct intel_guc_log *log)
>  
>  	mutex_lock(&log->relay.lock);
>  	GEM_BUG_ON(!intel_guc_log_relay_created(log));
> -	guc_log_unmap(log);
> +	guc_log_relay_unmap(log);
>  	guc_log_relay_destroy(log);
>  	mutex_unlock(&log->relay.lock);
>  }
> @@ -695,6 +734,7 @@ stringify_guc_log_type(enum guc_log_buffer_type type)
>   */
>  void intel_guc_log_info(struct intel_guc_log *log, struct drm_printer *p)
>  {
> +	struct intel_guc *guc = log_to_guc(log);
>  	enum guc_log_buffer_type type;
>  
>  	if (!intel_guc_log_relay_created(log)) {
> @@ -709,8 +749,8 @@ void intel_guc_log_info(struct intel_guc_log *log, struct drm_printer *p)
>  	for (type = GUC_DEBUG_LOG_BUFFER; type < GUC_MAX_LOG_BUFFER; type++) {
>  		drm_printf(p, "\t%s:\tflush count %10u, overflow count %10u\n",
>  			   stringify_guc_log_type(type),
> -			   log->stats[type].flush,
> -			   log->stats[type].sampled_overflow);
> +			   guc->log_state[type].flush,
> +			   guc->log_state[type].sampled_overflow);
>  	}
>  }
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
> index 9d9004dc58f1..2968023f7447 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
> @@ -42,9 +42,17 @@ struct intel_guc;
>  #define GUC_VERBOSITY_TO_LOG_LEVEL(x)	((x) + 2)
>  #define GUC_LOG_LEVEL_MAX GUC_VERBOSITY_TO_LOG_LEVEL(GUC_LOG_VERBOSITY_MAX)
>  
> +struct intel_guc_log_stats {
> +	struct mutex lock; /* protects below and guc_log_buffer_state's read-ptr */
> +	u32 sampled_overflow;
> +	u32 overflow;
> +	u32 flush;
> +};
> +
>  struct intel_guc_log {
>  	u32 level;
>  	struct i915_vma *vma;
> +	void *buf_addr;

I don't think you need both 'buf_addr' and 'relay.buf_addr' as they are
the same value, right?

Matt

>  	struct {
>  		void *buf_addr;
>  		bool started;
> @@ -53,12 +61,6 @@ struct intel_guc_log {
>  		struct mutex lock;
>  		u32 full_count;
>  	} relay;
> -	/* logging related stats */
> -	struct {
> -		u32 sampled_overflow;
> -		u32 overflow;
> -		u32 flush;
> -	} stats[GUC_MAX_LOG_BUFFER];
>  };
>  
>  void intel_guc_log_init_early(struct intel_guc_log *log);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 6/7] drm/i915/guc: Copy new GuC error capture logs upon G2H notification.
  2021-11-22 23:04 ` [Intel-gfx] [RFC 6/7] drm/i915/guc: Copy new GuC error capture logs upon G2H notification Alan Previn
@ 2021-12-07 22:58   ` Matthew Brost
  2021-12-08  5:14     ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Brost @ 2021-12-07 22:58 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx

On Mon, Nov 22, 2021 at 03:04:01PM -0800, Alan Previn wrote:
> Upon the G2H Notify-Err-Capture event, queue a worker to make a
> snapshot of the error state capture logs from the GuC-log buffer
> (error capture region) into an bigger interim circular buffer store
> that can be parsed later during gpu coredump printing.
> 
> Also, call that worker function directly for the cases where we
> are resetting GuC submission and need to flush outstanding logs.
> 

A couple nits and perhaps race condition. See below.

> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   7 +
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 206 ++++++++++++++++++
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |  16 ++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c    |  16 +-
>  drivers/gpu/drm/i915/gt/uc/intel_guc_log.h    |   5 +
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  10 +-
>  6 files changed, 256 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index 5af03a486a13..c130f465c19a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -178,4 +178,11 @@ enum intel_guc_sleep_state_status {
>  #define GUC_LOG_CONTROL_VERBOSITY_MASK	(0xF << GUC_LOG_CONTROL_VERBOSITY_SHIFT)
>  #define GUC_LOG_CONTROL_DEFAULT_LOGGING	(1 << 8)
>  
> +enum intel_guc_state_capture_event_status {
> +	INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_SUCCESS = 0x0,
> +	INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE = 0x1,
> +};
> +
> +#define INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK      0x1
> +
>  #endif /* _ABI_GUC_ACTIONS_ABI_H */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> index 0cb358a98605..459fe81c77ae 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -11,8 +11,11 @@
>  #include "gt/intel_gt.h"
>  #include "gt/intel_lrc_reg.h"
>  
> +#include <linux/circ_buf.h>
> +
>  #include "intel_guc_fwif.h"
>  #include "intel_guc_capture.h"
> +#include "i915_gpu_error.h"
>  
>  /*
>   * Define all device tables of GuC error capture register lists
> @@ -390,15 +393,218 @@ int intel_guc_capture_output_min_size_est(struct intel_guc *guc)
>  	return (worst_min_size * 3);
>  }
>  
> +/*
> + * KMD Init time flows:
> + * --------------------
> + *     --> alloc A: GuC input capture regs lists (registered via ADS)
> + *                  List acquired via intel_guc_capture_list_count + intel_guc_capture_list_init
> + *                  Size = global-reg-list + (class-reg-list) + (num-instances x instance-reg-list)
> + *                  Device tables carry: 1x global, 1x per-class, 1x per-instance)
> + *                  Caller needs to call per-class and per-instance multiplie times
> + *
> + *     --> alloc B: GuC output capture buf (registered via guc_init_params(log_param))
> + *                  Size = #define CAPTURE_BUFFER_SIZE (warns if on too-small)
> + *                  Note2: 'x 3' to hold multiple capture groups
> + *
> + *     --> alloc C: GuC capture interim circular buffer storage in system mem
> + *                  Size = 'power_of_two(sizeof(B))' as per kernel circular buffer helper
> + *
> + * GUC Runtime notify capture:
> + * --------------------------
> + *     --> G2H STATE_CAPTURE_NOTIFICATION
> + *                   L--> intel_guc_capture_store_snapshot
> + *                        L--> queue(__guc_capture_store_snapshot_work)
> + *                             Copies from B (head->tail) into C
> + */
> +
> +static void guc_capture_store_insert(struct intel_guc *guc, struct guc_capture_out_store *store,
> +				     unsigned char *new_data, size_t bytes)
> +{
> +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;

s/dev_priv/i915/

For the whole file.

> +	unsigned char *dst_data = store->addr;
> +	unsigned long h, t;
> +	size_t tmp;
> +
> +	h = store->head;
> +	t = store->tail;
> +	if (CIRC_SPACE(h, t, store->size) >= bytes) {
> +		while (bytes) {
> +			tmp = CIRC_SPACE_TO_END(h, t, store->size);
> +			if (tmp) {
> +				tmp = tmp < bytes ? tmp : bytes;
> +				i915_unaligned_memcpy_from_wc(&dst_data[h], new_data, tmp);
> +				bytes -= tmp;
> +				new_data += tmp;
> +				h = (h + tmp) & (store->size - 1);
> +			} else {
> +				drm_err(&dev_priv->drm, "circbuf copy-to ptr-corruption!\n");
> +				break;
> +			}
> +		}
> +		store->head = h;
> +	} else {
> +		drm_err(&dev_priv->drm, "GuC capture interim-store insufficient space!\n");
> +	}
> +}
> +
> +static void __guc_capture_store_snapshot_work(struct intel_guc *guc)
> +{
> +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
> +	unsigned int buffer_size, read_offset, write_offset, bytes_to_copy, full_count;
> +	struct guc_log_buffer_state *log_buf_state;
> +	struct guc_log_buffer_state log_buf_state_local;
> +	void *src_data, *dst_data = NULL;
> +	bool new_overflow;
> +
> +	/* Lock to get the pointer to GuC capture-log-buffer-state */
> +	mutex_lock(&guc->log_state[GUC_CAPTURE_LOG_BUFFER].lock);
> +	log_buf_state = guc->log.buf_addr +
> +			(sizeof(struct guc_log_buffer_state) * GUC_CAPTURE_LOG_BUFFER);
> +	src_data = guc->log.buf_addr + guc_get_log_buffer_offset(GUC_CAPTURE_LOG_BUFFER);
> +
> +	/*
> +	 * Make a copy of the state structure, inside GuC log buffer
> +	 * (which is uncached mapped), on the stack to avoid reading
> +	 * from it multiple times.
> +	 */
> +	memcpy(&log_buf_state_local, log_buf_state, sizeof(struct guc_log_buffer_state));
> +	buffer_size = guc_get_log_buffer_size(GUC_CAPTURE_LOG_BUFFER);
> +	read_offset = log_buf_state_local.read_ptr;
> +	write_offset = log_buf_state_local.sampled_write_ptr;
> +	full_count = log_buf_state_local.buffer_full_cnt;
> +
> +	/* Bookkeeping stuff */
> +	guc->log_state[GUC_CAPTURE_LOG_BUFFER].flush += log_buf_state_local.flush_to_file;
> +	new_overflow = guc_check_log_buf_overflow(guc, &guc->log_state[GUC_CAPTURE_LOG_BUFFER],
> +						  full_count);
> +
> +	/* Update the state of shared log buffer */
> +	log_buf_state->read_ptr = write_offset;
> +	log_buf_state->flush_to_file = 0;
> +
> +	mutex_unlock(&guc->log_state[GUC_CAPTURE_LOG_BUFFER].lock);
> +
> +	dst_data = guc->capture.out_store.addr;
> +	if (dst_data) {
> +		mutex_lock(&guc->capture.out_store.lock);
> +
> +		/* Now copy the actual logs. */
> +		if (unlikely(new_overflow)) {
> +			/* copy the whole buffer in case of overflow */
> +			read_offset = 0;
> +			write_offset = buffer_size;
> +		} else if (unlikely((read_offset > buffer_size) ||
> +					(write_offset > buffer_size))) {

Odd alignment.

> +			drm_err(&dev_priv->drm, "invalid GuC log capture buffer state!\n");
> +			/* copy whole buffer as offsets are unreliable */
> +			read_offset = 0;
> +			write_offset = buffer_size;
> +		}
> +
> +		/* first copy from the tail end of the GuC log capture buffer */
> +		if (read_offset > write_offset) {
> +			guc_capture_store_insert(guc, &guc->capture.out_store, src_data,
> +						 write_offset);
> +			bytes_to_copy = buffer_size - read_offset;
> +		} else {
> +			bytes_to_copy = write_offset - read_offset;
> +		}
> +		guc_capture_store_insert(guc, &guc->capture.out_store, src_data + read_offset,
> +					 bytes_to_copy);
> +
> +		mutex_unlock(&guc->capture.out_store.lock);
> +	}
> +}
> +
> +static void guc_capture_store_snapshot_work(struct work_struct *work)
> +{
> +	struct intel_guc_state_capture *capture =
> +		container_of(work, struct intel_guc_state_capture, store_work);
> +	struct intel_guc *guc =
> +		container_of(capture, struct intel_guc, capture);
> +
> +	__guc_capture_store_snapshot_work(guc);
> +}
> +
> +void  intel_guc_capture_store_snapshot(struct intel_guc *guc)
> +{
> +	if (guc->capture.enabled)
> +		queue_work(system_highpri_wq, &guc->capture.store_work);
> +}
> +
> +void intel_guc_capture_store_snapshot_immediate(struct intel_guc *guc)
> +{
> +	if (guc->capture.enabled)
> +		__guc_capture_store_snapshot_work(guc);
> +}
> +
> +static void guc_capture_store_destroy(struct intel_guc *guc)
> +{
> +	mutex_destroy(&guc->capture.out_store.lock);
> +	mutex_destroy(&guc->capture.out_store.lock);

Duplicate mutex_destroy.

> +	guc->capture.out_store.size = 0;
> +	kfree(guc->capture.out_store.addr);
> +	guc->capture.out_store.addr = NULL;
> +}
> +
> +static int guc_capture_store_create(struct intel_guc *guc)
> +{
> +	/*
> +	 * Make this interim buffer 3x the GuC capture output buffer so that we can absorb
> +	 * a little delay when processing the raw capture dumps into text friendly logs
> +	 * for the i915_gpu_coredump output
> +	 */
> +	size_t max_dump_size;
> +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
> +
> +	GEM_BUG_ON(guc->capture.out_store.addr);
> +
> +	max_dump_size = PAGE_ALIGN(intel_guc_capture_output_min_size_est(guc));
> +	max_dump_size = roundup_pow_of_two(max_dump_size);
> +
> +	guc->capture.out_store.addr = kzalloc(max_dump_size, GFP_KERNEL);
> +	if (!guc->capture.out_store.addr) {
> +		drm_warn(&dev_priv->drm, "GuC-capture interim-store populated at init!\n");
> +		return -ENOMEM;
> +	}
> +	guc->capture.out_store.size = max_dump_size;
> +	mutex_init(&guc->capture.out_store.lock);
> +	mutex_init(&guc->capture.out_store.lock);

Duplicate mutex_init.

> +
> +	return 0;
> +}
> +
>  void intel_guc_capture_destroy(struct intel_guc *guc)
>  {
> +	if (!guc->capture.enabled)
> +		return;
> +
> +	guc->capture.enabled = false;
> +
> +	intel_synchronize_irq(guc_to_gt(guc)->i915);
> +	flush_work(&guc->capture.store_work);
> +	guc_capture_store_destroy(guc);
>  	guc_capture_clear_ext_regs(guc->capture.reglists);
>  }
>  
>  int intel_guc_capture_init(struct intel_guc *guc)
>  {
>  	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
> +	int ret;
>  
>  	guc->capture.reglists = guc_capture_get_device_reglist(dev_priv);
> +	/*
> +	 * allocate interim store at init time so we dont require memory
> +	 * allocation whilst in the midst of the reset + capture
> +	 */
> +	ret = guc_capture_store_create(guc);
> +	if (ret) {
> +		guc_capture_clear_ext_regs(guc->capture.reglists);
> +		return ret;
> +	}
> +
> +	INIT_WORK(&guc->capture.store_work, guc_capture_store_snapshot_work);
> +	guc->capture.enabled = true;
> +
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> index 839b53425e1e..7031de12f3a1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> @@ -64,7 +64,19 @@ struct intel_guc_capture_out_group {
>  	struct intel_guc_capture_out_data group_lists[0];
>  };
>  
> +struct guc_capture_out_store {
> +	/* An interim storage to copy the GuC error-capture-output before
> +	 * parsing and reporting via proper reporting flows with formatting.
> +	 */
> +	unsigned char *addr;
> +	size_t size;
> +	unsigned long head; /* inject new output capture data */
> +	unsigned long tail; /* remove output capture data when reporting */
> +	struct mutex lock; /*lock head or tail when copying capture in or extracting out*/
> +};
> +
>  struct intel_guc_state_capture {
> +	bool enabled;
>  	struct __guc_mmio_reg_descr_group *reglists;
>  	u16 num_instance_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
>  	u16 num_class_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
> @@ -72,14 +84,18 @@ struct intel_guc_state_capture {
>  	int instance_list_size;
>  	int class_list_size;
>  	int global_list_size;
> +	struct guc_capture_out_store out_store;
> +	struct work_struct store_work;
>  };
>  
> +void intel_guc_capture_store_snapshot(struct intel_guc *guc);
>  int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 class,
>  				 u16 *num_entries);
>  int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 class,
>  				struct guc_mmio_reg *ptr, u16 num_entries);
>  int intel_guc_capture_output_min_size_est(struct intel_guc *guc);
>  void intel_guc_capture_destroy(struct intel_guc *guc);
> +void intel_guc_capture_store_snapshot_immediate(struct intel_guc *guc);
>  int intel_guc_capture_init(struct intel_guc *guc);
>  
>  #endif /* _INTEL_GUC_CAPTURE_H */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
> index dd86530f77a1..1354dbde9994 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
> @@ -194,7 +194,7 @@ bool guc_check_log_buf_overflow(struct intel_guc *guc,
>  	return overflow;
>  }
>  
> -static unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type)
> +unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type)
>  {
>  	switch (type) {
>  	case GUC_DEBUG_LOG_BUFFER:
> @@ -210,6 +210,20 @@ static unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type)
>  	return 0;
>  }
>  
> +size_t guc_get_log_buffer_offset(enum guc_log_buffer_type type)
> +{
> +	enum guc_log_buffer_type i;
> +	size_t offset = PAGE_SIZE;/* for the log_buffer_states */
> +
> +	for (i = GUC_DEBUG_LOG_BUFFER; i < GUC_MAX_LOG_BUFFER; i++) {
> +		if (i == type)
> +			break;
> +		offset += guc_get_log_buffer_size(i);
> +	}
> +
> +	return offset;
> +}
> +
>  static void _guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log)
>  {
>  	struct intel_guc *guc = log_to_guc(log);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
> index 2968023f7447..9bf29343df0e 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.h
> @@ -64,8 +64,13 @@ struct intel_guc_log {
>  };
>  
>  void intel_guc_log_init_early(struct intel_guc_log *log);
> +unsigned int guc_get_log_buffer_size(enum guc_log_buffer_type type);
> +size_t guc_get_log_buffer_offset(enum guc_log_buffer_type type);

intel_ prefix for exported functions.

>  int intel_guc_log_create(struct intel_guc_log *log);
>  void intel_guc_log_destroy(struct intel_guc_log *log);
> + 
> +bool guc_check_log_buf_overflow(struct intel_guc *guc, struct intel_guc_log_stats *state,
> +				unsigned int full_cnt);
>  
>  int intel_guc_log_set_level(struct intel_guc_log *log, u32 level);
>  bool intel_guc_log_relay_created(const struct intel_guc_log *log);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 0bfc92b1b982..0afd9ddd71fc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -24,6 +24,7 @@
>  
>  #include "intel_guc_ads.h"
>  #include "intel_guc_submission.h"
> +#include "gt/uc/intel_guc_capture.h"
>  
>  #include "i915_drv.h"
>  #include "i915_trace.h"
> @@ -1431,6 +1432,8 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>  	}
>  
>  	scrub_guc_desc_for_outstanding_g2h(guc);
> +
> +	intel_guc_capture_store_snapshot_immediate(guc);
>  }
>  
>  static struct intel_engine_cs *
> @@ -4013,10 +4016,11 @@ int intel_guc_error_capture_process_msg(struct intel_guc *guc,
>  		return -EPROTO;
>  	}
>  
> -	status = msg[0];
> -	drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status);
> +	status = msg[0] & INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK;
> +	if (status == INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE)
> +		drm_warn(&guc_to_gt(guc)->i915->drm, "G2H-Error capture no space\n");
>  
> -	/* Add extraction of error capture dump */
> +	intel_guc_capture_store_snapshot(guc);

This is done in different worker, right? How does this not race with an
engine reset notification that does an error capture (e.g. the error
capture is done before we read out the info from the GuC)?

As far as I can tell 'intel_guc_capture_store_snapshot' doesn't allocate
memory so I don't think we need a worker here.

Matt

>  
>  	return 0;
>  }
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 5/7] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
  2021-12-07 23:33     ` Teres Alexis, Alan Previn
@ 2021-12-07 23:30       ` Matthew Brost
  0 siblings, 0 replies; 52+ messages in thread
From: Matthew Brost @ 2021-12-07 23:30 UTC (permalink / raw)
  To: Teres Alexis, Alan Previn; +Cc: intel-gfx

On Tue, Dec 07, 2021 at 03:33:00PM -0800, Teres Alexis, Alan Previn wrote:
> Thank you for the detailed review Matt. Responses and follow up questions on some of them below (wanna
> make sure i dont misunderstand).
> 
> Will fix all the rest - glad we dont have any design problems .. so far :)
> 
> ...alan
> 
> On Tue, 2021-12-07 at 14:31 -0800, Matthew Brost wrote:
> > On Mon, Nov 22, 2021 at 03:04:00PM -0800, Alan Previn wrote:
> > > -static bool guc_check_log_buf_overflow(struct intel_guc_log *log,
> > > -                                  enum guc_log_buffer_type type,
> > > -                                  unsigned int full_cnt)
> > > +bool guc_check_log_buf_overflow(struct intel_guc *guc,
> > > +                           struct intel_guc_log_stats *log_state,
> > > +                           unsigned int full_cnt)
> >
> > I don't think you meant to drop the 'static' here.
> actually i do need to call it from guc_capture - but that was on the next patch.
> my action would be to move this change to the next patch and i guess change the name to "intel_guc..."?
> (im assuming we dont wanna duplicate this).
> 

Ok. Yes, if you export a function add the intel_ prefix.

> >
> > > +
> > > +   guc_log_size = PAGE_SIZE + DEBUG_BUFFER_SIZE + CRASH_BUFFER_SIZE + CAPTURE_BUFFER_SIZE;
> >
> > I'd personally keep the original formatting here.
> >
> Question: - You are refering to just that last line of the "guc_log_size = .." above right?
> 

Yes.

> > >   struct intel_guc_log {
> > >     u32 level;
> > >     struct i915_vma *vma;
> > > +   void *buf_addr;
> >
> > I don't think you need both 'buf_addr' and 'relay.buf_addr' as they are
> > the same value, right?
> >
> Matt
> >
> Clarification: In the baseline code, i believe we use the "relay.buf_addr" like an enable flag
> way to verify that the guc relay debugfs was invoked by user space and is currently active.
> (which can be disabled. That said I can definitely remove that relay.buf_addr by introducing
> a more descriptive flag such as "relay.active". Assume this is ok right?
>

Sounds good to me.

Matt
 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 5/7] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
  2021-12-07 22:31   ` Matthew Brost
@ 2021-12-07 23:33     ` Teres Alexis, Alan Previn
  2021-12-07 23:30       ` Matthew Brost
  0 siblings, 1 reply; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-07 23:33 UTC (permalink / raw)
  To: Brost, Matthew; +Cc: intel-gfx

Thank you for the detailed review Matt. Responses and follow up questions on some of them below (wanna
make sure i dont misunderstand).

Will fix all the rest - glad we dont have any design problems .. so far :)

...alan

On Tue, 2021-12-07 at 14:31 -0800, Matthew Brost wrote:
> On Mon, Nov 22, 2021 at 03:04:00PM -0800, Alan Previn wrote:
> > -static bool guc_check_log_buf_overflow(struct intel_guc_log *log,
> > -				       enum guc_log_buffer_type type,
> > -				       unsigned int full_cnt)
> > +bool guc_check_log_buf_overflow(struct intel_guc *guc,
> > +				struct intel_guc_log_stats *log_state,
> > +				unsigned int full_cnt)
> 
> I don't think you meant to drop the 'static' here.
actually i do need to call it from guc_capture - but that was on the next patch.
my action would be to move this change to the next patch and i guess change the name to "intel_guc..."?
(im assuming we dont wanna duplicate this).

> 
> > +
> > +	guc_log_size = PAGE_SIZE + DEBUG_BUFFER_SIZE + CRASH_BUFFER_SIZE + CAPTURE_BUFFER_SIZE;
> 
> I'd personally keep the original formatting here.
> 
Question: - You are refering to just that last line of the "guc_log_size = .." above right?

> >   struct intel_guc_log {
> >  	u32 level;
> >  	struct i915_vma *vma;
> > +	void *buf_addr;
> 
> I don't think you need both 'buf_addr' and 'relay.buf_addr' as they are
> the same value, right?
> 
Matt
> 
Clarification: In the baseline code, i believe we use the "relay.buf_addr" like an enable flag
way to verify that the guc relay debugfs was invoked by user space and is currently active.
(which can be disabled. That said I can definitely remove that relay.buf_addr by introducing
a more descriptive flag such as "relay.active". Assume this is ok right?



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 4/7] drm/i915/guc: Add GuC's error state capture output structures.
  2021-12-07 21:01   ` Matthew Brost
@ 2021-12-07 23:35     ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-07 23:35 UTC (permalink / raw)
  To: Brost, Matthew; +Cc: intel-gfx

Thanks for the conditional Rvb - will get that fixed on next rev.

On Tue, 2021-12-07 at 13:01 -0800, Matthew Brost wrote:
> On Mon, Nov 22, 2021 at 03:03:59PM -0800, Alan Previn wrote:
> > 
> >  
> > +struct intel_guc_capture_out_data_header {
> > +	u32 reserved1;
> > +	u32 info;
> > +		#define GUC_CAPTURE_DATAHDR_SRC_TYPE GENMASK(3, 0) /* as per enum guc_capture_type */
> > +		#define GUC_CAPTURE_DATAHDR_SRC_CLASS GENMASK(7, 4) /* as per GUC_MAX_ENGINE_CLASSES */
> > +		#define GUC_CAPTURE_DATAHDR_SRC_INSTANCE GENMASK(11, 8)
> > +	u32 lrca; /* if type-instance, LRCA (address) that hung, else set to ~0 */
> > +	u32 guc_ctx_id; /* if type-instance, context index of hung context, else set to ~0 */
> 
> s/guc_ctx_id/guc_id
> 
> With __packed (per Jani's feedback) as well:
> 
> Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2021-11-22 23:04 ` [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list Alan Previn
  2021-11-23  0:25   ` Teres Alexis, Alan Previn
@ 2021-12-08  0:22   ` Matthew Brost
  2021-12-08  6:31     ` Teres Alexis, Alan Previn
  1 sibling, 1 reply; 52+ messages in thread
From: Matthew Brost @ 2021-12-08  0:22 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx

On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
> Print the GuC captured error state register list (offsets
> and values) when gpu_coredump_state printout is invoked.
> 
> Also, since the GuC can report multiple engine class registers in a
> single notification event, parse the captured data (appearing as a
> stream of structures) to identify multiple captures of different
> 'engine-capture-group-outputs'.
> 
> Finally, for each 'engine-capture-group-output', identify the last
> running context and print already-identified vma's so that user's
> output report follows the same layout as execlist submission. I.e.
> engine1-registers, engine1-context-vmas, engine2-registers,
> engine2-context-vmas, etc.
> 

Can you include a sample error capture in the next rev cover letter
assuming it is posted after 69.0.0 is merged.

A couple of comments below.

> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   4 +-
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 389 ++++++++++++++++++
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |   6 +
>  drivers/gpu/drm/i915/i915_gpu_error.c         |  53 ++-
>  drivers/gpu/drm/i915/i915_gpu_error.h         |   5 +
>  5 files changed, 439 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 332756036007..5806e2c05212 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1595,9 +1595,7 @@ static void intel_engine_print_registers(struct intel_engine_cs *engine,
>  		drm_printf(m, "\tIPEHR: 0x%08x\n", ENGINE_READ(engine, IPEHR));
>  	}
>  
> -	if (intel_engine_uses_guc(engine)) {
> -		/* nothing to print yet */
> -	} else if (HAS_EXECLISTS(dev_priv)) {
> +	if (HAS_EXECLISTS(dev_priv) && !intel_engine_uses_guc(engine)) {
>  		struct i915_request * const *port, *rq;
>  		const u32 *hws =
>  			&engine->status_page.addr[I915_HWS_CSB_BUF0_INDEX];
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> index 459fe81c77ae..998ce1b474ed 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -415,8 +415,389 @@ int intel_guc_capture_output_min_size_est(struct intel_guc *guc)
>   *                   L--> intel_guc_capture_store_snapshot
>   *                        L--> queue(__guc_capture_store_snapshot_work)
>   *                             Copies from B (head->tail) into C
> + *
> + * GUC --> notify context reset:
> + * -----------------------------
> + *     --> G2H CONTEXT RESET
> + *                   L--> guc_handle_context_reset --> i915_capture_error_state
> + *                    --> i915_gpu_coredump --> intel_guc_capture_store_ptr
> + *                        L--> keep a ptr to capture_store in
> + *                             i915_gpu_coredump struct.
> + *
> + * User Sysfs / Debugfs
> + * --------------------
> + *      --> i915_gpu_coredump_copy_to_buffer->
> + *                   L--> err_print_to_sgl --> err_print_gt
> + *                        L--> error_print_guc_captures
> + *                             L--> loop: intel_guc_capture_out_print_next_group
> + *
>   */
>  
> +#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
> +
> +static char *
> +guc_capture_register_string(const struct intel_guc *guc, u32 owner, u32 type,
> +			    u32 class, u32 id, u32 offset)
> +{
> +	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
> +	struct __guc_mmio_reg_descr_group *match;
> +	int num_regs, j = 0;
> +
> +	if (!reglists)
> +		return NULL;
> +
> +	match = guc_capture_get_one_list(reglists, owner, type, id);
> +	if (match) {
> +		num_regs = match->num_regs;
> +		while (num_regs--) {

for (num_regs = match->num_regs, j = 0; num_regs; ++j, --num_regs)

> +			if (offset == match->list[j].reg.reg)
> +				return match->list[j].regname;
> +			++j;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static inline int
> +guc_capture_store_remove_dw(struct guc_capture_out_store *store, u32 *bytesleft,
> +			    u32 *dw)
> +{
> +	int tries = 2;
> +	int avail = 0;
> +	u32 *src_data;
> +
> +	if (!*bytesleft)
> +		return 0;
> +
> +	while (tries--) {
> +		avail = CIRC_CNT_TO_END(store->head, store->tail, store->size);
> +		if (avail >= sizeof(u32)) {
> +			src_data = (u32 *)(store->addr + store->tail);
> +			*dw = *src_data;
> +			store->tail = (store->tail + 4) & (store->size - 1);
> +			*bytesleft -= 4;
> +			return 4;
> +		}
> +		if (store->tail == (store->size - 1) && store->head > 0)
> +			store->tail = 0;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +capture_store_get_group_hdr(const struct intel_guc *guc,
> +			    struct guc_capture_out_store *store, u32 *bytesleft,
> +			    struct intel_guc_capture_out_group_header *group)
> +{
> +	int read = 0;
> +	int fullsize = sizeof(struct intel_guc_capture_out_group_header);
> +
> +	if (fullsize > *bytesleft)
> +		return -1;
> +
> +	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
> +		    memcpy(group, (store->addr + store->tail), fullsize);
> +			store->tail = (store->tail + fullsize) & (store->size - 1);
> +			*bytesleft -= fullsize;

Weird alignment.

> +		return 0;
> +	}
> +
> +	read += guc_capture_store_remove_dw(store, bytesleft, &group->reserved1);
> +	read += guc_capture_store_remove_dw(store, bytesleft, &group->info);
> +	if (read != sizeof(*group))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static int
> +capture_store_get_data_hdr(const struct intel_guc *guc,
> +			   struct guc_capture_out_store *store, u32 *bytesleft,
> +			   struct intel_guc_capture_out_data_header *data)
> +{
> +	int read = 0;
> +	int fullsize = sizeof(struct intel_guc_capture_out_data_header);
> +
> +	if (fullsize > *bytesleft)
> +		return -1;
> +
> +	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
> +		    memcpy(data, (store->addr + store->tail), fullsize);
> +			store->tail = (store->tail + fullsize) & (store->size - 1);
> +			*bytesleft -= fullsize;

Weird alignment.

> +		return 0;
> +	}
> +
> +	read += guc_capture_store_remove_dw(store, bytesleft, &data->reserved1);
> +	read += guc_capture_store_remove_dw(store, bytesleft, &data->info);
> +	read += guc_capture_store_remove_dw(store, bytesleft, &data->lrca);
> +	read += guc_capture_store_remove_dw(store, bytesleft, &data->guc_ctx_id);
> +	read += guc_capture_store_remove_dw(store, bytesleft, &data->num_mmios);
> +	if (read != sizeof(*data))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static int
> +capture_store_get_register(const struct intel_guc *guc,
> +			   struct guc_capture_out_store *store, u32 *bytesleft,
> +			   struct guc_mmio_reg *reg)
> +{
> +	int read = 0;
> +	int fullsize = sizeof(struct guc_mmio_reg);
> +
> +	if (fullsize > *bytesleft)
> +		return -1;
> +
> +	if (CIRC_CNT_TO_END(store->head, store->tail, store->size) >= fullsize) {
> +		    memcpy(reg, (store->addr + store->tail), fullsize);
> +			store->tail = (store->tail + fullsize) & (store->size - 1);
> +			*bytesleft -= fullsize;

Weird alignment.

> +		return 0;
> +	}
> +
> +	read += guc_capture_store_remove_dw(store, bytesleft, &reg->offset);
> +	read += guc_capture_store_remove_dw(store, bytesleft, &reg->value);
> +	read += guc_capture_store_remove_dw(store, bytesleft, &reg->flags);
> +	read += guc_capture_store_remove_dw(store, bytesleft, &reg->mask);
> +	if (read != sizeof(*reg))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static void guc_capture_store_drop_data(struct guc_capture_out_store *store,
> +					unsigned long sampled_head)
> +{
> +	if (sampled_head == 0)
> +		store->tail = store->size - 1;
> +	else
> +		store->tail = sampled_head - 1;
> +}
> +
> +#ifdef CONFIG_DRM_I915_DEBUG_GUC
> +#define guc_capt_err_print(a, b, ...) \
> +	do { \
> +		drm_warn(a, __VA_ARGS__); \
> +		if (b) \
> +			i915_error_printf(b, __VA_ARGS__); \
> +	} while (0)
> +#else
> +#define guc_capt_err_print(a, b, ...) \
> +	do { \
> +		if (b) \
> +			i915_error_printf(b, __VA_ARGS__); \
> +	} while (0)
> +#endif
> +
> +static struct intel_engine_cs *
> +guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	u8 engine_class = guc_class_to_engine_class(guc_class);
> +
> +	/* Class index is checked in class converter */
> +	GEM_BUG_ON(instance > MAX_ENGINE_INSTANCE);
> +
> +	return gt->engine_class[engine_class][instance];
> +}
> +
> +static inline struct intel_context *
> +guc_context_lookup(struct intel_guc *guc, u32 guc_ctx_id)
> +{
> +	struct intel_context *ce;
> +
> +	if (unlikely(guc_ctx_id >= GUC_MAX_LRC_DESCRIPTORS)) {
> +		drm_dbg(&guc_to_gt(guc)->i915->drm, "Invalid guc_ctx_id 0x%X, max 0x%X",
> +			guc_ctx_id, GUC_MAX_LRC_DESCRIPTORS);
> +		return NULL;
> +	}
> +
> +	ce = xa_load(&guc->context_lookup, guc_ctx_id);
> +	if (unlikely(!ce)) {
> +		drm_dbg(&guc_to_gt(guc)->i915->drm, "Context is NULL, guc_ctx_id 0x%X",
> +			guc_ctx_id);
> +		return NULL;
> +	}
> +
> +	return ce;
> +}
> +
> +
> +#define PRINT guc_capt_err_print
> +#define REGSTR guc_capture_register_string
> +
> +#define GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng) \
> +	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Name: %s\n", (eng)->name); \
> +	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Class: 0x%02x\n", (eng)->class); \
> +	PRINT(&(i915->drm), (ebuf), "    i915-Eng-Inst: 0x%02x\n", (eng)->instance); \
> +	PRINT(&(i915->drm), (ebuf), "    i915-Eng-LogicalMask: 0x%08x\n", (eng)->logical_mask)
> +
> +#define GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data) \
> +	PRINT(&(i915->drm), (ebuf), "    LRCA: 0x%08x\n", (data).lrca); \
> +	PRINT(&(i915->drm), (ebuf), "    GuC-ContextID: 0x%08x\n", (data).guc_ctx_id); \
> +	PRINT(&(i915->drm), (ebuf), "    GuC-Engine-Instance: 0x%08x\n", \
> +	      (uint32_t) FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, (data).info));
> +
> +#define GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce) \
> +	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-Flags: 0x%016lx\n", (ce)->flags); \
> +	PRINT(&(i915->drm), (ebuf), "    i915-Ctx-GuC-ID: 0x%016x\n", (ce)->guc_id.id);
> +
> +int intel_guc_capture_out_print_next_group(struct drm_i915_error_state_buf *ebuf,
> +					   struct intel_gt_coredump *gt)
> +{
> +	/* constant qualifier for data-pointers we shouldn't change mid of error dump printing */
> +	struct intel_guc_state_capture *cap = gt->uc->capture;
> +	struct intel_guc *guc = container_of(cap, struct intel_guc, capture);
> +	struct drm_i915_private *i915 = (container_of(guc, struct intel_gt,
> +						   uc.guc))->i915;
> +	struct guc_capture_out_store *store = &cap->out_store;
> +	struct guc_capture_out_store tmpstore;
> +	struct intel_guc_capture_out_group_header group;
> +	struct intel_guc_capture_out_data_header data;
> +	struct guc_mmio_reg reg;
> +	const char *grptypestr[GUC_STATE_CAPTURE_GROUP_TYPE_MAX] = {"full-capture",
> +								    "partial-capture"};
> +	const char *datatypestr[GUC_CAPTURE_LIST_TYPE_MAX] = {"Global", "Engine-Class",
> +							      "Engine-Instance"};
> +	enum guc_capture_group_types grptype;
> +	enum guc_capture_type datatype;
> +	int numgrps, numregs;
> +	char *str, noname[16];
> +	u32 numbytes, engineclass, eng_inst, ret = 0;
> +	struct intel_engine_cs *eng;
> +	struct intel_context *ce;
> +
> +	if (!cap->enabled)
> +		return -ENODEV;
> +
> +	mutex_lock(&store->lock);
> +	smp_mb(); /* sync to get the latest head for the moment */
> +	/* NOTE1: make a copy of store so we dont have to deal with a changing lower bound of
> +	 *        occupied-space in this circular buffer.
> +	 * NOTE2: Higher up the stack from here, we keep calling this function in a loop to
> +	 *        reading more capture groups as they appear (as the lower bound of occupied-space
> +	 *        changes) until this circ-buf is empty.
> +	 */
> +	memcpy(&tmpstore, store, sizeof(tmpstore));
> +
> +	PRINT(&i915->drm, ebuf, "global --- GuC Error Capture\n");
> +
> +	numbytes = CIRC_CNT(tmpstore.head, tmpstore.tail, tmpstore.size);
> +	if (!numbytes) {
> +		PRINT(&i915->drm, ebuf, "GuC capture stream empty!\n");
> +		ret = -ENODATA;
> +		goto unlock;
> +	}
> +	/* everything in GuC output structures are dword aligned */
> +	if (numbytes & 0x3) {
> +		PRINT(&i915->drm, ebuf, "GuC capture stream unaligned!\n");
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	if (capture_store_get_group_hdr(guc, &tmpstore, &numbytes, &group)) {
> +		PRINT(&i915->drm, ebuf, "GuC capture error getting next group-header!\n");
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	PRINT(&i915->drm, ebuf, "NumCaptures:  0x%08x\n", (uint32_t)
> +	      FIELD_GET(GUC_CAPTURE_GRPHDR_SRC_NUMCAPTURES, group.info));
> +	grptype = FIELD_GET(GUC_CAPTURE_GRPHDR_SRC_CAPTURE_TYPE, group.info);
> +	PRINT(&i915->drm, ebuf, "Coverage:  0x%08x = %s\n", grptype,
> +	      grptypestr[grptype % GUC_STATE_CAPTURE_GROUP_TYPE_MAX]);
> +
> +	numgrps = FIELD_GET(GUC_CAPTURE_GRPHDR_SRC_NUMCAPTURES, group.info);
> +	while (numgrps--) {
> +		eng = NULL;
> +		ce = NULL;
> +
> +		if (capture_store_get_data_hdr(guc, &tmpstore, &numbytes, &data)) {
> +			PRINT(&i915->drm, ebuf, "GuC capture error on next data-header!\n");
> +			ret = -EIO;
> +			goto unlock;
> +		}
> +		datatype = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_TYPE, data.info);
> +		PRINT(&i915->drm, ebuf, "  RegListType: %s\n",
> +		      datatypestr[datatype % GUC_CAPTURE_LIST_TYPE_MAX]);
> +
> +		engineclass = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_CLASS, data.info);
> +		if (datatype != GUC_CAPTURE_LIST_TYPE_GLOBAL) {
> +			PRINT(&i915->drm, ebuf, "    GuC-Engine-Class: %d\n",
> +			      engineclass);
> +			if (engineclass <= GUC_LAST_ENGINE_CLASS)
> +				PRINT(&i915->drm, ebuf, "    i915-Eng-Class: %d\n",
> +				      guc_class_to_engine_class(engineclass));
> +
> +			if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
> +				GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
> +				eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
> +				eng = guc_lookup_engine(guc, engineclass, eng_inst);
> +				if (eng) {
> +					GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
> +				} else {
> +					PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup Fail!\n");
> +				}
> +				ce = guc_context_lookup(guc, data.guc_ctx_id);

You are going to need to reference count the 'ce' here. See
intel_guc_context_reset_process_msg for an example. 

> +				if (ce) {
> +					GCAP_PRINT_INTEL_CTX_INFO(i915, ebuf, ce);
> +				} else {
> +					PRINT(&i915->drm, ebuf, "    i915-Ctx-Lookup Fail!\n");
> +				}
> +			}
> +		}
> +		numregs = FIELD_GET(GUC_CAPTURE_DATAHDR_NUM_MMIOS, data.num_mmios);
> +		PRINT(&i915->drm, ebuf, "     NumRegs: 0x%08x\n", numregs);
> +
> +		while (numregs--) {
> +			if (capture_store_get_register(guc, &tmpstore, &numbytes, &reg)) {
> +				PRINT(&i915->drm, ebuf, "Error getting next register!\n");
> +				ret = -EIO;
> +				goto unlock;
> +			}
> +			str = REGSTR(guc, GUC_CAPTURE_LIST_INDEX_PF, datatype,
> +				     engineclass, 0, reg.offset);
> +			if (!str) {
> +				snprintf(noname, sizeof(noname), "REG-0x%08x", reg.offset);
> +				str = noname;
> +			}
> +			PRINT(&i915->drm, ebuf, "      %s:  0x%08x\n", str, reg.value);
> +
> +		}
> +		if (eng) {
> +			const struct intel_engine_coredump *ee;
> +			for (ee = gt->engine; ee; ee = ee->next) {
> +				const struct i915_vma_coredump *vma;
> +				if (ee->engine == eng) {
> +					for (vma = ee->vma; vma; vma = vma->next)
> +						i915_print_error_vma(ebuf, ee->engine, vma);
> +				}
> +			}
> +		}
> +	}
> +
> +	store->tail = tmpstore.tail;
> +unlock:
> +	/* if we have a stream error, just drop everything */
> +	if (ret == -EIO) {
> +		drm_warn(&i915->drm, "Skip GuC capture data print due to stream error\n");
> +		guc_capture_store_drop_data(store, tmpstore.head);
> +	}
> +
> +	mutex_unlock(&store->lock);
> +
> +	return ret;
> +}
> +
> +#undef REGSTR
> +#undef PRINT
> +
> +#endif //CONFIG_DRM_I915_DEBUG_GUC
> +
>  static void guc_capture_store_insert(struct intel_guc *guc, struct guc_capture_out_store *store,
>  				     unsigned char *new_data, size_t bytes)
>  {
> @@ -587,6 +968,14 @@ void intel_guc_capture_destroy(struct intel_guc *guc)
>  	guc_capture_clear_ext_regs(guc->capture.reglists);
>  }
>  
> +struct intel_guc_state_capture *
> +intel_guc_capture_store_ptr(struct intel_guc *guc)
> +{
> +	if (!guc->capture.enabled)
> +		return NULL;
> +	return &guc->capture;
> +}
> +
>  int intel_guc_capture_init(struct intel_guc *guc)
>  {
>  	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> index 7031de12f3a1..7d048a8f6efe 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> @@ -88,6 +88,11 @@ struct intel_guc_state_capture {
>  	struct work_struct store_work;
>  };
>  
> +struct drm_i915_error_state_buf;
> +struct intel_gt_coredump;
> +
> +int intel_guc_capture_out_print_next_group(struct drm_i915_error_state_buf *m,
> +					   struct intel_gt_coredump *gt);
>  void intel_guc_capture_store_snapshot(struct intel_guc *guc);
>  int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 class,
>  				 u16 *num_entries);
> @@ -96,6 +101,7 @@ int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32
>  int intel_guc_capture_output_min_size_est(struct intel_guc *guc);
>  void intel_guc_capture_destroy(struct intel_guc *guc);
>  void intel_guc_capture_store_snapshot_immediate(struct intel_guc *guc);
> +struct intel_guc_state_capture *intel_guc_capture_store_ptr(struct intel_guc *guc);
>  int intel_guc_capture_init(struct intel_guc *guc);
>  
>  #endif /* _INTEL_GUC_CAPTURE_H */
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 2a2d7643b551..47016059c65d 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -600,6 +600,16 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
>  	error_print_context(m, "  Active context: ", &ee->context);
>  }
>  
> +static void error_print_guc_captures(struct drm_i915_error_state_buf *m,
> +				     struct intel_gt_coredump *gt)
> +{
> +	int ret;
> +
> +	do {
> +		ret = intel_guc_capture_out_print_next_group(m, gt);
> +	} while (!ret);
> +}
> +
>  void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
>  {
>  	va_list args;
> @@ -609,9 +619,9 @@ void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
>  	va_end(args);
>  }
>  
> -static void print_error_vma(struct drm_i915_error_state_buf *m,
> -			    const struct intel_engine_cs *engine,
> -			    const struct i915_vma_coredump *vma)
> +void i915_print_error_vma(struct drm_i915_error_state_buf *m,
> +			  const struct intel_engine_cs *engine,
> +			  const struct i915_vma_coredump *vma)
>  {
>  	char out[ASCII85_BUFSZ];
>  	int page;
> @@ -679,7 +689,7 @@ static void err_print_uc(struct drm_i915_error_state_buf *m,
>  
>  	intel_uc_fw_dump(&error_uc->guc_fw, &p);
>  	intel_uc_fw_dump(&error_uc->huc_fw, &p);
> -	print_error_vma(m, NULL, error_uc->guc_log);
> +	i915_print_error_vma(m, NULL, error_uc->guc_log);
>  }
>  
>  static void err_free_sgl(struct scatterlist *sgl)
> @@ -764,12 +774,16 @@ static void err_print_gt(struct drm_i915_error_state_buf *m,
>  		err_printf(m, "  GAM_DONE: 0x%08x\n", gt->gam_done);
>  	}
>  
> -	for (ee = gt->engine; ee; ee = ee->next) {
> -		const struct i915_vma_coredump *vma;
> +	if (gt->uc->capture) /* error capture was via GuC */
> +		error_print_guc_captures(m, gt);
> +	else {
> +		for (ee = gt->engine; ee; ee = ee->next) {
> +			const struct i915_vma_coredump *vma;
>  
> -		error_print_engine(m, ee);
> -		for (vma = ee->vma; vma; vma = vma->next)
> -			print_error_vma(m, ee->engine, vma);
> +			error_print_engine(m, ee);
> +			for (vma = ee->vma; vma; vma = vma->next)
> +				i915_print_error_vma(m, ee->engine, vma);
> +		}
>  	}
>  
>  	if (gt->uc)
> @@ -1140,7 +1154,7 @@ static void gt_record_fences(struct intel_gt_coredump *gt)
>  	gt->nfence = i;
>  }
>  
> -static void engine_record_registers(struct intel_engine_coredump *ee)
> +static void engine_record_registers_execlist(struct intel_engine_coredump *ee)
>  {
>  	const struct intel_engine_cs *engine = ee->engine;
>  	struct drm_i915_private *i915 = engine->i915;
> @@ -1384,8 +1398,10 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp)
>  
>  	ee->engine = engine;
>  
> -	engine_record_registers(ee);
> -	engine_record_execlists(ee);
> +	if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
> +		engine_record_registers_execlist(ee);
> +		engine_record_execlists(ee);
> +	}
>  
>  	return ee;
>  }
> @@ -1558,8 +1574,8 @@ gt_record_uc(struct intel_gt_coredump *gt,
>  	return error_uc;
>  }
>  
> -/* Capture all registers which don't fit into another category. */
> -static void gt_record_regs(struct intel_gt_coredump *gt)
> +/* Capture all global registers which don't fit into another category. */
> +static void gt_record_registers_execlist(struct intel_gt_coredump *gt)
>  {
>  	struct intel_uncore *uncore = gt->_gt->uncore;
>  	struct drm_i915_private *i915 = uncore->i915;
> @@ -1806,7 +1822,9 @@ intel_gt_coredump_alloc(struct intel_gt *gt, gfp_t gfp)
>  	gc->_gt = gt;
>  	gc->awake = intel_gt_pm_is_awake(gt);
>  
> -	gt_record_regs(gc);
> +	if (!intel_uc_uses_guc_submission(&gt->uc))
> +		gt_record_registers_execlist(gc);
> +
>  	gt_record_fences(gc);
>  
>  	return gc;
> @@ -1871,6 +1889,11 @@ i915_gpu_coredump(struct intel_gt *gt, intel_engine_mask_t engine_mask)
>  		if (INTEL_INFO(i915)->has_gt_uc)
>  			error->gt->uc = gt_record_uc(error->gt, compress);
>  
> +		if (intel_uc_uses_guc_submission(&gt->uc))

I don't think this is needed as guc->capture.enabled should always be
false if submission isn't enabled, right?

Matt

> +			error->gt->uc->capture = intel_guc_capture_store_ptr(&gt->uc.guc);
> +		else
> +			error->gt->uc->capture = NULL;
> +
>  		i915_vma_capture_finish(error->gt, compress);
>  
>  		error->simulated |= error->gt->simulated;
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index b98d8cdbe4f2..b55369b245ee 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -17,6 +17,7 @@
>  #include "gt/intel_engine.h"
>  #include "gt/intel_gt_types.h"
>  #include "gt/uc/intel_uc_fw.h"
> +#include "gt/uc/intel_guc_capture.h"
>  
>  #include "intel_device_info.h"
>  
> @@ -151,6 +152,7 @@ struct intel_gt_coredump {
>  		struct intel_uc_fw guc_fw;
>  		struct intel_uc_fw huc_fw;
>  		struct i915_vma_coredump *guc_log;
> +		struct intel_guc_state_capture *capture;
>  	} *uc;
>  
>  	struct intel_gt_coredump *next;
> @@ -216,6 +218,9 @@ struct drm_i915_error_state_buf {
>  
>  __printf(2, 3)
>  void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...);
> +void i915_print_error_vma(struct drm_i915_error_state_buf *m,
> +			  const struct intel_engine_cs *engine,
> +			  const struct i915_vma_coredump *vma);
>  
>  struct i915_gpu_coredump *i915_gpu_coredump(struct intel_gt *gt,
>  					    intel_engine_mask_t engine_mask);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 6/7] drm/i915/guc: Copy new GuC error capture logs upon G2H notification.
  2021-12-07 22:58   ` Matthew Brost
@ 2021-12-08  5:14     ` Teres Alexis, Alan Previn
  2021-12-08 18:22       ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-08  5:14 UTC (permalink / raw)
  To: Brost, Matthew; +Cc: intel-gfx

Thanks Matt for reviewing. Responses to the questions you had.
will fix the rest on next rev.
 
> > @@ -4013,10 +4016,11 @@ int intel_guc_error_capture_process_msg(struct intel_guc *guc,
> >  		return -EPROTO;
> >  	}
> >  
> > -	status = msg[0];
> > -	drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status);
> > +	status = msg[0] & INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK;
> > +	if (status == INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE)
> > +		drm_warn(&guc_to_gt(guc)->i915->drm, "G2H-Error capture no space\n");
> >  
> > -	/* Add extraction of error capture dump */
> > +	intel_guc_capture_store_snapshot(guc);
> 
> This is done in different worker, right? How does this not race with an
> engine reset notification that does an error capture (e.g. the error
> capture is done before we read out the info from the GuC)?
> 
I believe the guc error state capture notification event comes right before
context resets, not engine resets. Also, the i915_gpu_coredump subsystem
gathers hw state in response to the a context hanging and is done prior to
the hw reset. Therefore, engine reset notification doesnt play a role here.
However, since the context reset notification is expected to come right after
the error state capture notification and your argument is valid in this case...
you could argue a race condition can exist when the context reset event leads
to calling of i915_gpu_coredump subsystem (which in turn gets a pointer to
the intel_guc_capture module), however even here, no actual parsing is done
yet - i am currently waiting on the actual debugfs call before parsing any
of the data. As a fix, However, I add a flush_work at the time the call to
the parsing happens even later?


> As far as I can tell 'intel_guc_capture_store_snapshot' doesn't allocate
> memory so I don't think we need a worker here.
> 
Yes, i am not doing any allocation but the worker function does lock the
capture_store's mutex (to ensure we dont trample on the circular buffer pointers
of the interim store (the one between the guc-log-buffer and the error-capture
reporting). I am not sure if spin_lock_irqsave would be safe considering in the
event we had back to back error captures, then we wouldnt want to be spinning that
long if coincidentially the reporting side is actively parsing the bytestream 
output of the same interim buffer.

After thinking a bit more, a lockless solution is possible, i could double
buffer the interim-store-circular-buffer and so when the event comes in, i flip
to the next "free" interim-store (that isnt filled with pending logs waiting
to be read or being read). If there is no available interim-store, (i.e.
we've already had 2 error state captures that have yet to be read/flushed), then
we just drop the output to the floor. (this last part would be in line with the
current execlist model.. unless something changed there in the last 2 weeks).

However the simplest solution from with this series today, would be to flush_work
much later at the time the debugfs calls to get the output error capture report
(since that would be the last chance to resolve the possible race condition).
I could call the flush_work earlier at the context_reset notification, but that too
would be an irq handler path. 

> Matt
> 
> >  
> >  	return 0;
> >  }
> > -- 
> > 2.25.1
> > 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2021-12-08  0:22   ` Matthew Brost
@ 2021-12-08  6:31     ` Teres Alexis, Alan Previn
  2021-12-23 18:54       ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-08  6:31 UTC (permalink / raw)
  To: Brost, Matthew; +Cc: intel-gfx

Thanks again for the detailed review here.
Will fix all the rest on next rev.
One special response for this one:


On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
> On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
> > +			if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
> > +				GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
> > +				eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
> > +				eng = guc_lookup_engine(guc, engineclass, eng_inst);
> > +				if (eng) {
> > +					GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
> > +				} else {
> > +					PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup Fail!\n");
> > +				}
> > +				ce = guc_context_lookup(guc, data.guc_ctx_id);
> 
> You are going to need to reference count the 'ce' here. See
> intel_guc_context_reset_process_msg for an example. 
> 

Oh crap - i missed this one - which you had explicitly mentioned offline when i was doing the
development. Sorry about that i just totally missed it from my todo-notes.

...alan

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 6/7] drm/i915/guc: Copy new GuC error capture logs upon G2H notification.
  2021-12-08  5:14     ` Teres Alexis, Alan Previn
@ 2021-12-08 18:22       ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-08 18:22 UTC (permalink / raw)
  To: Brost, Matthew; +Cc: intel-gfx

After chatting offline with Matt, it became apparent that
i somehow missed the fact that the ctb processing handler
was already in a work queue. That said, Matt is correct, i
dont need to create a work queue to extract that capture
log into the interim-store. That would eliminate the race
condition (when the context-reset notification comes
in later via the same queue).

Thanks again Matt.

....alan


On Tue, 2021-12-07 at 21:15 -0800, Alan Previn Teres Alexis wrote:
> Thanks Matt for reviewing. Responses to the questions you had.
> will fix the rest on next rev.
>  
> > > @@ -4013,10 +4016,11 @@ int intel_guc_error_capture_process_msg(struct intel_guc *guc,
> > >  		return -EPROTO;
> > >  	}
> > >  
> > > -	status = msg[0];
> > > -	drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status);
> > > +	status = msg[0] & INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK;
> > > +	if (status == INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE)
> > > +		drm_warn(&guc_to_gt(guc)->i915->drm, "G2H-Error capture no space\n");
> > >  
> > > -	/* Add extraction of error capture dump */
> > > +	intel_guc_capture_store_snapshot(guc);
> > 
> > This is done in different worker, right? How does this not race with an
> > engine reset notification that does an error capture (e.g. the error
> > capture is done before we read out the info from the GuC)?
> > 
> I believe the guc error state capture notification event comes right before
> context resets, not engine resets. Also, the i915_gpu_coredump subsystem
> gathers hw state in response to the a context hanging and is done prior to
> the hw reset. Therefore, engine reset notification doesnt play a role here.
> However, since the context reset notification is expected to come right after
> the error state capture notification and your argument is valid in this case...
> you could argue a race condition can exist when the context reset event leads
> to calling of i915_gpu_coredump subsystem (which in turn gets a pointer to
> the intel_guc_capture module), however even here, no actual parsing is done
> yet - i am currently waiting on the actual debugfs call before parsing any
> of the data. As a fix, However, I add a flush_work at the time the call to
> the parsing happens even later?
> 
> 
> > As far as I can tell 'intel_guc_capture_store_snapshot' doesn't allocate
> > memory so I don't think we need a worker here.
> > 
> Yes, i am not doing any allocation but the worker function does lock the
> capture_store's mutex (to ensure we dont trample on the circular buffer pointers
> of the interim store (the one between the guc-log-buffer and the error-capture
> reporting). I am not sure if spin_lock_irqsave would be safe considering in the
> event we had back to back error captures, then we wouldnt want to be spinning that
> long if coincidentially the reporting side is actively parsing the bytestream 
> output of the same interim buffer.
> 
> After thinking a bit more, a lockless solution is possible, i could double
> buffer the interim-store-circular-buffer and so when the event comes in, i flip
> to the next "free" interim-store (that isnt filled with pending logs waiting
> to be read or being read). If there is no available interim-store, (i.e.
> we've already had 2 error state captures that have yet to be read/flushed), then
> we just drop the output to the floor. (this last part would be in line with the
> current execlist model.. unless something changed there in the last 2 weeks).
> 
> However the simplest solution from with this series today, would be to flush_work
> much later at the time the debugfs calls to get the output error capture report
> (since that would be the last chance to resolve the possible race condition).
> I could call the flush_work earlier at the context_reset notification, but that too
> would be an irq handler path. 
> 
> > Matt
> > 
> > >  
> > >  	return 0;
> > >  }
> > > -- 
> > > 2.25.1
> > > 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 1/7] drm/i915/guc: Add basic support for error capture lists
  2021-11-23 21:12   ` Michal Wajdeczko
@ 2021-12-08 18:23     ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-08 18:23 UTC (permalink / raw)
  To: intel-gfx, Wajdeczko, Michal

I missed responding to this.
Thanks for the review Michal - will fix them on next rev.
...alan

On Tue, 2021-11-23 at 22:12 +0100, Michal Wajdeczko wrote:
> 
> On 23.11.2021 00:03, Alan Previn wrote:
> > From: John Harrison <John.C.Harrison@Intel.com>
> ...
> 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 77fbcd8730ee..0bfc92b1b982 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -4003,6 +4003,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> >  	return 0;
> >  }
> >  
> > +int intel_guc_error_capture_process_msg(struct intel_guc *guc,
> > +					 const u32 *msg, u32 len)
> > +{
> > +	int status;
> 
> likely it should be "u32" as few lines below you're using msg[0];
> 
> > +
> > +	if (unlikely(len != 1)) {
> > +		drm_dbg(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
> 
> any error returned by the CTB message handler will trigger full dump of
> unexpected message - do we really need this unlikely dbg message here ?
> 
> > +		return -EPROTO;
> > +	}
> > +
> > +	status = msg[0];
> > +	drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status);
> 
> IIRC all notification status are defined in GuC spec in hex, so maybe we
> should also print it as %#x ?
> 
> -Michal
> 
> > +
> > +	/* Add extraction of error capture dump */
> > +
> > +	return 0;
> > +}
> > +
> >  static struct intel_engine_cs *
> >  guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance)
> >  {
> > 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size for error capture lists
  2021-11-24 17:34     ` Teres Alexis, Alan Previn
@ 2021-12-21 23:15       ` Teres Alexis, Alan Previn
  2021-12-22  1:49       ` Teres Alexis, Alan Previn
  1 sibling, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-21 23:15 UTC (permalink / raw)
  To: intel-gfx, Wajdeczko, Michal

Michal, wrt this one:

+/************ FIXME: Populate tables for other devices in subsequent patch ************/
> > > +
> > > +static struct __guc_mmio_reg_descr_group *
> > > +guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)
> > 
> > in new code we are using "i915" instead of "dev_priv" and since this
> > function has "guc" prefix it shall rather take "guc" as param:
> > 
> > guc_capture_get_device_reglist(struct intel_guc *guc)
> > {
> > 	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> > 	...
> > 

if its a static function that is only used internally, does the rule still apply?
I thought rules on primary handle inputs are for cross-i915-component interfaces that should start with an "intel_" in
front no? I will fix the others, but will keep static internal only functions the way they are (to avoid unnecessary de-
refencing).


...alan



On Wed, 2021-11-24 at 09:35 -0800, Alan Previn Teres Alexis wrote:
> Thanks Michal for the thorough review of the code (and the other patches). I will fix them all.
> 
> On the register-to-string helper function,
> i'll have to think it through because i do want to keep future development
> maintenance work when adding new registers simple (in the sense that
> adding a single line into the table will be all thats needed).
> 
> Unless you are suggesting keeping a global i915-wide list somewhere?
> which might be a bit of an overhead when searching through an offset list
> to find the mmio being requested for string return - unless i keep a sorted tree
> initialized with registers ordered by address, but would not work well for
> different registers that share addresses on diff gen's).
> 
> 
> ...alan
> 
> 
> On Tue, 2021-11-23 at 22:46 +0100, Michal Wajdeczko wrote:
> > Hi,
> > 
> > just few random nits below
> > 
> > -Michal
> > 
> > 
> > On 23.11.2021 00:03, Alan Previn wrote:
> > > Update GuC ADS size allocation to include space for
> > > the lists of error state capture register descriptors.
> > > 
> > > Also, populate the lists of registers we want GuC to report back to
> > > Host on engine reset events. This list should include global,
> > > engine-class and engine-instance registers for every engine-class
> > > type on the current hardware.
> > > 
> > > NOTE: Start with a fake table of register lists to layout the
> > > framework before adding real registers in subsequent patch.
> > > 
> > > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/Makefile                 |   1 +
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  10 +-
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   5 +
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    | 176 ++++++++++++-
> > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 232 ++++++++++++++++++
> > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |  47 ++++
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  19 +-
> > >  7 files changed, 476 insertions(+), 14 deletions(-)
> > >  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> > >  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> > > 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size for error capture lists
  2021-11-24 17:34     ` Teres Alexis, Alan Previn
  2021-12-21 23:15       ` Teres Alexis, Alan Previn
@ 2021-12-22  1:49       ` Teres Alexis, Alan Previn
  1 sibling, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-22  1:49 UTC (permalink / raw)
  To: intel-gfx, Wajdeczko, Michal

Hi Michal - wrt to this comment:

+struct intel_guc;
> > > +
> > > +struct __guc_mmio_reg_descr {
> > > +	i915_reg_t reg;
> > > +	u32 flags;
> > > +	u32 mask;
> > > +	char *regname;
> > 
> > const char* ?
> > 
> > but maybe instead of adding reg name to the GuC specific struct we
> > should add generic purpose function that will return pretty name of the
> > register:
> > 
> > i915_reg.c:
> > 
> > const char *i915_reg_to_string(i915_reg_r reg)
> > {
> > 	...
> > }
> > 

I dont think this will scale if we have different generation hardware with different register names but same offset.
I checked that this has happenned in the past. Additionally, it would mean anyone adding a new register in what could
some day be a pretty long list, would have to update 2 locations.

If its okay with you, i think i would stick with current version for this specific hunk. (am fixing all the rest ofc).



...alan


On Wed, 2021-11-24 at 09:35 -0800, Alan Previn Teres Alexis wrote:
> Thanks Michal for the thorough review of the code (and the other patches). I will fix them all.
> 
> On the register-to-string helper function,
> i'll have to think it through because i do want to keep future development
> maintenance work when adding new registers simple (in the sense that
> adding a single line into the table will be all thats needed).
> 
> Unless you are suggesting keeping a global i915-wide list somewhere?
> which might be a bit of an overhead when searching through an offset list
> to find the mmio being requested for string return - unless i keep a sorted tree
> initialized with registers ordered by address, but would not work well for
> different registers that share addresses on diff gen's).
> 
> 
> ...alan
> 
> 
> On Tue, 2021-11-23 at 22:46 +0100, Michal Wajdeczko wrote:
> > Hi,
> > 
> > just few random nits below
> > 
> > -Michal
> > 
> > 
> > On 23.11.2021 00:03, Alan Previn wrote:
> > > Update GuC ADS size allocation to include space for
> > > the lists of error state capture register descriptors.
> > > 
> > > Also, populate the lists of registers we want GuC to report back to
> > > Host on engine reset events. This list should include global,
> > > engine-class and engine-instance registers for every engine-class
> > > type on the current hardware.
> > > 
> > > NOTE: Start with a fake table of register lists to layout the
> > > framework before adding real registers in subsequent patch.
> > > 
> > > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/Makefile                 |   1 +
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  10 +-
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   5 +
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    | 176 ++++++++++++-
> > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 232 ++++++++++++++++++
> > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |  47 ++++
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  19 +-
> > >  7 files changed, 476 insertions(+), 14 deletions(-)
> > >  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> > >  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> > > 
> > > diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> > > index 074d6b8edd23..e3c4d5cea4c3 100644
> > > --- a/drivers/gpu/drm/i915/Makefile
> > > +++ b/drivers/gpu/drm/i915/Makefile
> > > @@ -190,6 +190,7 @@ i915-y += gt/uc/intel_uc.o \
> > >  	  gt/uc/intel_guc_rc.o \
> > >  	  gt/uc/intel_guc_slpc.o \
> > >  	  gt/uc/intel_guc_submission.o \
> > > +	  gt/uc/intel_guc_capture.o \
> > 
> > use alphabetical order
> > 
> > >  	  gt/uc/intel_huc.o \
> > >  	  gt/uc/intel_huc_debugfs.o \
> > >  	  gt/uc/intel_huc_fw.o
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > > index 5cf9ebd2ee55..458f0d248a5a 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> > > @@ -335,9 +335,14 @@ int intel_guc_init(struct intel_guc *guc)
> > >  	if (ret)
> > >  		goto err_fw;
> > >  
> > > -	ret = intel_guc_ads_create(guc);
> > > +	ret = intel_guc_capture_init(guc);
> > >  	if (ret)
> > >  		goto err_log;
> > > +
> > > +	ret = intel_guc_ads_create(guc);
> > > +	if (ret)
> > > +		goto err_capture;
> > > +
> > >  	GEM_BUG_ON(!guc->ads_vma);
> > >  
> > >  	ret = intel_guc_ct_init(&guc->ct);
> > > @@ -376,6 +381,8 @@ int intel_guc_init(struct intel_guc *guc)
> > >  	intel_guc_ct_fini(&guc->ct);
> > >  err_ads:
> > >  	intel_guc_ads_destroy(guc);
> > > +err_capture:
> > > +	intel_guc_capture_destroy(guc);
> > >  err_log:
> > >  	intel_guc_log_destroy(&guc->log);
> > >  err_fw:
> > > @@ -403,6 +410,7 @@ void intel_guc_fini(struct intel_guc *guc)
> > >  	intel_guc_ct_fini(&guc->ct);
> > >  
> > >  	intel_guc_ads_destroy(guc);
> > > +	intel_guc_capture_destroy(guc);
> > >  	intel_guc_log_destroy(&guc->log);
> > >  	intel_uc_fw_fini(&guc->fw);
> > >  }
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > index 9de99772f916..d136c69abe12 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > @@ -16,6 +16,7 @@
> > >  #include "intel_guc_log.h"
> > >  #include "intel_guc_reg.h"
> > >  #include "intel_guc_slpc_types.h"
> > > +#include "intel_guc_capture.h"
> > 
> > use alphabetical order
> > 
> > >  #include "intel_uc_fw.h"
> > >  #include "i915_utils.h"
> > >  #include "i915_vma.h"
> > > @@ -37,6 +38,8 @@ struct intel_guc {
> > >  	struct intel_guc_ct ct;
> > >  	/** @slpc: sub-structure containing SLPC related data and objects */
> > >  	struct intel_guc_slpc slpc;
> > > +	/** @capture: the error-state-capture module's data and objects */
> > > +	struct intel_guc_state_capture capture;
> > >  
> > >  	/** @sched_engine: Global engine used to submit requests to GuC */
> > >  	struct i915_sched_engine *sched_engine;
> > > @@ -138,6 +141,8 @@ struct intel_guc {
> > >  	u32 ads_regset_size;
> > >  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
> > >  	u32 ads_golden_ctxt_size;
> > > +	/** @ads_capture_size: size of register lists in the ADS used for error capture */
> > > +	u32 ads_capture_size;
> > >  	/** @ads_engine_usage_size: size of engine usage in the ADS */
> > >  	u32 ads_engine_usage_size;
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > index 6c81ddd303d3..2780c0fadd01 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > @@ -10,6 +10,7 @@
> > >  #include "gt/shmem_utils.h"
> > >  #include "intel_guc_ads.h"
> > >  #include "intel_guc_fwif.h"
> > > +#include "intel_guc_capture.h"
> > 
> > wrong order
> > 
> > >  #include "intel_uc.h"
> > >  #include "i915_drv.h"
> > >  
> > > @@ -71,8 +72,7 @@ static u32 guc_ads_golden_ctxt_size(struct intel_guc *guc)
> > >  
> > >  static u32 guc_ads_capture_size(struct intel_guc *guc)
> > >  {
> > > -	/* Basic support to init ADS without a proper GuC error capture list */
> > > -	return PAGE_ALIGN(PAGE_SIZE);
> > > +	return PAGE_ALIGN(guc->ads_capture_size);
> > >  }
> > >  
> > >  static u32 guc_ads_private_data_size(struct intel_guc *guc)
> > > @@ -519,24 +519,170 @@ static void guc_init_golden_context(struct intel_guc *guc)
> > >  	GEM_BUG_ON(guc->ads_golden_ctxt_size != total_size);
> > >  }
> > >  
> > > -static void guc_capture_prep_lists(struct intel_guc *guc, struct __guc_ads_blob *blob)
> > > +static int
> > > +guc_fill_reglist(struct intel_guc *guc, struct __guc_ads_blob *blob, int vf, bool enabled,
> > > +		 int classid, int type, char *typename, u16 *p_numregs, int newnum, u8 **p_virt_ptr,
> > > +		 u32 *p_blobptr_to_ggtt, u32 *p_ggtt, u32 null_ggtt)
> > 
> > hmm, this does not look good - do we really need all these params ?
> > 
> > >  {
> > > -	int i, j;
> > > -	u32 addr_ggtt, offset;
> > > +	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> > > +	struct guc_debug_capture_list *listnode;
> > > +	int size = 0;
> > >  
> > > -	offset = guc_ads_capture_offset(guc);
> > > -	addr_ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
> > > +	if (blob && *p_numregs != newnum) {
> > > +		if (type == GUC_CAPTURE_LIST_TYPE_GLOBAL)
> > > +			drm_warn(&i915->drm, "Guc-Cap VF%d-%s num-reg mismatch was=%d now=%d!\n",
> > > +				 vf, typename, *p_numregs, newnum);
> > > +		else
> > > +			drm_warn(&i915->drm, "Guc-Cap VF%d-Class-%d-%s num-reg mismatch was=%d now=%d!\n",
> > > +				 vf, classid, typename, *p_numregs, newnum);
> > > +	}
> > > +	/*
> > > +	 * For enabled capture lists, we not only need to call capture module to help
> > > +	 * populate the list-descriptor into the correct ads capture structures, but
> > > +	 * we also need to increment the virtual pointers and ggtt offsets so that
> > > +	 * caller has the subsequent gfx memory location.
> > > +	 */
> > > +	*p_numregs = newnum;
> > > +	size = PAGE_ALIGN((sizeof(struct guc_debug_capture_list)) +
> > > +			  (newnum * sizeof(struct guc_mmio_reg)));
> > > +	/* if caller hasn't allocated ADS blob, return size and counts, we're done */
> > > +	if (!blob)
> > > +		return size;
> > > +	if (blob) {
> > 
> > redundant
> > 
> > > +		/* if caller allocated ADS blob, populate the capture register descriptors */
> > > +		if (!newnum) {
> > > +			*p_blobptr_to_ggtt = null_ggtt;
> > > +		} else {
> > > +			/* get ptr and populate header info: */
> > > +			*p_blobptr_to_ggtt = *p_ggtt;
> > > +			listnode = (struct guc_debug_capture_list *)*p_virt_ptr;
> > > +			*p_ggtt += sizeof(struct guc_debug_capture_list);
> > > +			*p_virt_ptr += sizeof(struct guc_debug_capture_list);
> > > +			listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, *p_numregs);
> > > +
> > > +			/* get ptr and populate register descriptor list: */
> > > +			intel_guc_capture_list_init(guc, vf, type, classid,
> > > +						    (struct guc_mmio_reg *)*p_virt_ptr,
> > > +						    *p_numregs);
> > > +
> > > +			/* increment ptrs for that header: */
> > > +			*p_ggtt += size - sizeof(struct guc_debug_capture_list);
> > > +			*p_virt_ptr += size - sizeof(struct guc_debug_capture_list);
> > > +		}
> > > +	}
> > > +
> > > +	return size;
> > > +}
> > > +
> > > +static int guc_capture_prep_lists(struct intel_guc *guc, struct __guc_ads_blob *blob)
> > > +{
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > +	int i, j, size;
> > > +	u32 ggtt, null_ggtt, offset, alloc_size = 0;
> > > +	struct guc_gt_system_info *info, local_info;
> > > +	struct guc_debug_capture_list *listnode;
> > > +	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> > > +	struct intel_guc_state_capture *gc = &guc->capture;
> > > +	u16 tmp = 0;
> > > +	u8 *ptr = NULL;
> > > +
> > > +	if (blob) {
> > > +		offset = guc_ads_capture_offset(guc);
> > > +		ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
> > > +		ptr = ((u8 *)blob) + offset;
> > > +		info = &blob->system_info;
> > > +	} else {
> > > +		memset(&local_info, 0, sizeof(local_info));
> > > +		info = &local_info;
> > > +		fill_engine_enable_masks(gt, info);
> > > +	}
> > > +
> > > +	/* first, set aside the first page for a capture_list with zero descriptors */
> > > +	alloc_size = PAGE_SIZE;
> > > +	if (blob) {
> > > +		listnode = (struct guc_debug_capture_list *)ptr;
> > > +		listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, 0);
> > > +		null_ggtt = ggtt;
> > > +		ggtt += PAGE_SIZE;
> > > +		ptr +=  PAGE_SIZE;
> > > +	}
> > >  
> > > -	/* FIXME: Populate a proper capture list */
> > > +#define COUNT_REGS intel_guc_capture_list_count
> > > +#define FILL_REGS guc_fill_reglist
> > > +#define TYPE_CLASS GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS
> > > +#define TYPE_INSTANCE GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE
> > >  
> > >  	for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
> > >  		for (j = 0; j < GUC_MAX_ENGINE_CLASSES; j++) {
> > > -			blob->ads.capture_instance[i][j] = addr_ggtt;
> > > -			blob->ads.capture_class[i][j] = addr_ggtt;
> > > +			if (!info->engine_enabled_masks[j]) {
> > > +				if (gc->num_class_regs[i][j])
> > > +					drm_warn(&i915->drm, "GuC-Cap VF%d-class-%d "
> > > +						 "class regs valid mismatch was=%d now=%d!\n",
> > > +						 i, j, gc->num_class_regs[i][j], tmp);
> > > +				if (gc->num_instance_regs[i][j])
> > > +					drm_warn(&i915->drm, "GuC-Cap VF%d-class-%d "
> > > +						 "inst regs valid mismatch was=%d now=%d!\n",
> > > +						 i, j, gc->num_instance_regs[i][j], tmp);
> > > +				gc->num_class_regs[i][j] = 0;
> > > +				gc->num_instance_regs[i][j] = 0;
> > > +				if (blob) {
> > > +					blob->ads.capture_class[i][j] = null_ggtt;
> > > +					blob->ads.capture_instance[i][j] = null_ggtt;
> > > +				}
> > > +			} else {
> > > +				if (!COUNT_REGS(guc, i, TYPE_CLASS,
> > > +						guc_class_to_engine_class(j), &tmp)) {
> > > +					size = FILL_REGS(guc, blob, i, true, j, TYPE_CLASS,
> > > +							 "class", &gc->num_class_regs[i][j],
> > > +							 tmp, &ptr,
> > > +							 &blob->ads.capture_class[i][j],
> > > +							 &ggtt, null_ggtt);
> > > +					gc->class_list_size += size;
> > > +					alloc_size += size;
> > > +				} else {
> > > +					gc->num_class_regs[i][j] = 0;
> > > +					if (blob)
> > > +						blob->ads.capture_class[i][j] = null_ggtt;
> > > +				}
> > > +				if (!COUNT_REGS(guc, i, TYPE_INSTANCE,
> > > +						guc_class_to_engine_class(j), &tmp)) {
> > > +					size = FILL_REGS(guc, blob, i, true, j, TYPE_INSTANCE,
> > > +							 "instance", &gc->num_instance_regs[i][j],
> > > +							 tmp, &ptr,
> > > +							 &blob->ads.capture_instance[i][j],
> > > +							 &ggtt, null_ggtt);
> > > +					gc->instance_list_size += size;
> > > +					alloc_size += size;
> > > +				} else {
> > > +					gc->num_instance_regs[i][j] = 0;
> > > +					if (blob)
> > > +						blob->ads.capture_instance[i][j] = null_ggtt;
> > > +				}
> > > +			}
> > > +		}
> > > +		if (!COUNT_REGS(guc, i, GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, &tmp)) {
> > > +			size = FILL_REGS(guc, blob, i, true, 0, GUC_CAPTURE_LIST_TYPE_GLOBAL,
> > > +					 "global", &gc->num_global_regs[i], tmp, &ptr,
> > > +					 &blob->ads.capture_global[i], &ggtt, null_ggtt);
> > > +			gc->global_list_size += size;
> > > +			alloc_size += size;
> > > +		} else {
> > > +			gc->num_global_regs[i] = 0;
> > > +			if (blob)
> > > +				blob->ads.capture_global[i] = null_ggtt;
> > >  		}
> > > -
> > > -		blob->ads.capture_global[i] = addr_ggtt;
> > >  	}
> > > +
> > > +#undef COUNT_REGS
> > > +#undef FILL_REGS
> > > +#undef TYPE_CLASS
> > > +#undef TYPE_INSTANCE
> > > +
> > > +	if (guc->ads_capture_size && guc->ads_capture_size != PAGE_ALIGN(alloc_size))
> > > +		drm_warn(&i915->drm, "GuC->ADS->Capture alloc size changed from %d to %d\n",
> > > +			 guc->ads_capture_size, PAGE_ALIGN(alloc_size));
> > > +
> > > +	return PAGE_ALIGN(alloc_size);
> > >  }
> > >  
> > >  static void __guc_ads_init(struct intel_guc *guc)
> > > @@ -614,6 +760,12 @@ int intel_guc_ads_create(struct intel_guc *guc)
> > >  		return ret;
> > >  	guc->ads_golden_ctxt_size = ret;
> > >  
> > > +	/* Likewise the capture lists: */
> > > +	ret = guc_capture_prep_lists(guc, NULL);
> > > +	if (ret < 0)
> > > +		return ret;
> > > +	guc->ads_capture_size = ret;
> > > +
> > >  	/* Now the total size can be determined: */
> > >  	size = guc_ads_blob_size(guc);
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> > > new file mode 100644
> > > index 000000000000..c741c77b7fc8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> > > @@ -0,0 +1,232 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2021-2021 Intel Corporation
> > > + */
> > > +
> > > +#include <drm/drm_print.h>
> > > +
> > > +#include "i915_drv.h"
> > > +#include "i915_drv.h"
> > 
> > duplicated include
> > 
> > > +#include "i915_memcpy.h"
> > > +#include "gt/intel_gt.h"
> > > +
> > > +#include "intel_guc_fwif.h"
> > > +#include "intel_guc_capture.h"
> > > +
> > > +/* Define all device tables of GuC error capture register lists */
> > > +
> > > +/********************************* Gen12 LP  *********************************/
> > 
> > didn't we move away from "GEN" naming ?
> > 
> > > +/************** GLOBAL *************/
> > 
> > do we really need all these decorations ?
> > 
> > > +struct __guc_mmio_reg_descr gen12lp_global_regs[] = {
> > > +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> > > +	/* Add additional register list */
> > 
> > do we need this reminder ?
> > 
> > > +/********** List of lists **********/
> > > +struct __guc_mmio_reg_descr_group gen12lp_lists[] = {
> > > +	{
> > > +		.list = gen12lp_global_regs,
> > > +		.num_regs = (sizeof(gen12lp_global_regs) / sizeof(struct __guc_mmio_reg_descr)),
> > 
> > ARRAY_SIZE ?
> > 
> > > +/************ FIXME: Populate tables for other devices in subsequent patch ************/
> > > +
> > > +static struct __guc_mmio_reg_descr_group *
> > > +guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)
> > 
> > in new code we are using "i915" instead of "dev_priv" and since this
> > function has "guc" prefix it shall rather take "guc" as param:
> > 
> > guc_capture_get_device_reglist(struct intel_guc *guc)
> > {
> > 	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> > 	...
> > 
> > 
> > > +static inline void
> > > +warn_with_capture_list_identifier(struct drm_i915_private *i915, char *msg,
> > > +				  u32 owner, u32 type, u32 classid)
> > > +{
> > > +	const char *ownerstr[GUC_CAPTURE_LIST_INDEX_MAX] = {"PF", "VF"};
> > > +	const char *typestr[GUC_CAPTURE_LIST_TYPE_MAX - 1] = {"Class", "Instance"};
> > > +	const char *classstr[GUC_LAST_ENGINE_CLASS + 1] = {"Render", "Video", "VideoEnhance",
> > > +							   "Blitter", "Reserved"};
> > 
> > better to wrap that into simple small helpers like
> > 
> > 	const char *stringify_guc_capture_owner(u32 owner) { .. }
> > 	const char *stringify_guc_capture_type(u32 type) { .. }
> > 	const char *stringify_guc_capture_class(u32 class) { .. }
> > 
> > > +int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 classid,
> > > +				 u16 *num_entries)
> > > +{
> > > +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
> > 
> > s/dev_priv/i915
> > redundant ()
> > 
> > > +	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
> > > +	struct __guc_mmio_reg_descr_group *match;
> > > +
> > > +	if (!reglists)
> > > +		return -ENODEV;
> > > +
> > > +	match = guc_capture_get_one_list(reglists, owner, type, classid);
> > > +	if (match) {
> > > +		*num_entries = match->num_regs;
> > > +		return 0;
> > 
> > IIRC early returns are preferred for error cases, not success
> > 
> > > +int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 classid,
> > > +				struct guc_mmio_reg *ptr, u16 num_entries)
> > > +{
> > > +	u32 j = 0;
> > > +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
> > 
> > s/dev_priv/i915
> > redundant ()
> > 
> > > +struct intel_guc;
> > > +
> > > +struct __guc_mmio_reg_descr {
> > > +	i915_reg_t reg;
> > > +	u32 flags;
> > > +	u32 mask;
> > > +	char *regname;
> > 
> > const char* ?
> > 
> > but maybe instead of adding reg name to the GuC specific struct we
> > should add generic purpose function that will return pretty name of the
> > register:
> > 
> > i915_reg.c:
> > 
> > const char *i915_reg_to_string(i915_reg_r reg)
> > {
> > 	...
> > }
> > 
> > >  
> > >  /* Capture-types of GuC capture register lists */
> > > -enum
> > > +enum guc_capture_owner
> > >  {
> > >  	GUC_CAPTURE_LIST_INDEX_PF = 0,
> > >  	GUC_CAPTURE_LIST_INDEX_VF = 1,
> > >  	GUC_CAPTURE_LIST_INDEX_MAX = 2,
> > 
> > s/INDEX/OWNER ?
> > 
> > >  };
> > >  


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size for error capture lists
  2021-11-23 21:46   ` Michal Wajdeczko
  2021-11-24  9:52     ` Jani Nikula
  2021-11-24 17:34     ` Teres Alexis, Alan Previn
@ 2021-12-22 20:13     ` Teres Alexis, Alan Previn
  2 siblings, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-22 20:13 UTC (permalink / raw)
  To: Wajdeczko, Michal, intel-gfx

Michal, WRT below, feedback from other developers is match spec names.
(i.e. means other struct names introduced in the series needs minor touch-ups).

>  
>  /* Capture-types of GuC capture register lists */ -enum
> +enum guc_capture_owner
>  {
>  	GUC_CAPTURE_LIST_INDEX_PF = 0,
>  	GUC_CAPTURE_LIST_INDEX_VF = 1,
>  	GUC_CAPTURE_LIST_INDEX_MAX = 2,

s/INDEX/OWNER ?

-----Original Message-----
From: Wajdeczko, Michal <Michal.Wajdeczko@intel.com> 
Sent: Tuesday, November 23, 2021 1:47 PM
To: Teres Alexis, Alan Previn <alan.previn.teres.alexis@intel.com>; intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size for error capture lists

Hi,

just few random nits below

-Michal


On 23.11.2021 00:03, Alan Previn wrote:
> Update GuC ADS size allocation to include space for the lists of error 
> state capture register descriptors.
> 
> Also, populate the lists of registers we want GuC to report back to 
> Host on engine reset events. This list should include global, 
> engine-class and engine-instance registers for every engine-class type 
> on the current hardware.
> 
> NOTE: Start with a fake table of register lists to layout the 
> framework before adding real registers in subsequent patch.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  drivers/gpu/drm/i915/Makefile                 |   1 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc.c        |  10 +-
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   5 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    | 176 ++++++++++++-
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 232 ++++++++++++++++++
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |  47 ++++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  19 +-
>  7 files changed, 476 insertions(+), 14 deletions(-)  create mode 
> 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
>  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> 
> diff --git a/drivers/gpu/drm/i915/Makefile 
> b/drivers/gpu/drm/i915/Makefile index 074d6b8edd23..e3c4d5cea4c3 
> 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -190,6 +190,7 @@ i915-y += gt/uc/intel_uc.o \
>  	  gt/uc/intel_guc_rc.o \
>  	  gt/uc/intel_guc_slpc.o \
>  	  gt/uc/intel_guc_submission.o \
> +	  gt/uc/intel_guc_capture.o \

use alphabetical order

>  	  gt/uc/intel_huc.o \
>  	  gt/uc/intel_huc_debugfs.o \
>  	  gt/uc/intel_huc_fw.o
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> index 5cf9ebd2ee55..458f0d248a5a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
> @@ -335,9 +335,14 @@ int intel_guc_init(struct intel_guc *guc)
>  	if (ret)
>  		goto err_fw;
>  
> -	ret = intel_guc_ads_create(guc);
> +	ret = intel_guc_capture_init(guc);
>  	if (ret)
>  		goto err_log;
> +
> +	ret = intel_guc_ads_create(guc);
> +	if (ret)
> +		goto err_capture;
> +
>  	GEM_BUG_ON(!guc->ads_vma);
>  
>  	ret = intel_guc_ct_init(&guc->ct);
> @@ -376,6 +381,8 @@ int intel_guc_init(struct intel_guc *guc)
>  	intel_guc_ct_fini(&guc->ct);
>  err_ads:
>  	intel_guc_ads_destroy(guc);
> +err_capture:
> +	intel_guc_capture_destroy(guc);
>  err_log:
>  	intel_guc_log_destroy(&guc->log);
>  err_fw:
> @@ -403,6 +410,7 @@ void intel_guc_fini(struct intel_guc *guc)
>  	intel_guc_ct_fini(&guc->ct);
>  
>  	intel_guc_ads_destroy(guc);
> +	intel_guc_capture_destroy(guc);
>  	intel_guc_log_destroy(&guc->log);
>  	intel_uc_fw_fini(&guc->fw);
>  }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 9de99772f916..d136c69abe12 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -16,6 +16,7 @@
>  #include "intel_guc_log.h"
>  #include "intel_guc_reg.h"
>  #include "intel_guc_slpc_types.h"
> +#include "intel_guc_capture.h"

use alphabetical order

>  #include "intel_uc_fw.h"
>  #include "i915_utils.h"
>  #include "i915_vma.h"
> @@ -37,6 +38,8 @@ struct intel_guc {
>  	struct intel_guc_ct ct;
>  	/** @slpc: sub-structure containing SLPC related data and objects */
>  	struct intel_guc_slpc slpc;
> +	/** @capture: the error-state-capture module's data and objects */
> +	struct intel_guc_state_capture capture;
>  
>  	/** @sched_engine: Global engine used to submit requests to GuC */
>  	struct i915_sched_engine *sched_engine; @@ -138,6 +141,8 @@ struct 
> intel_guc {
>  	u32 ads_regset_size;
>  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>  	u32 ads_golden_ctxt_size;
> +	/** @ads_capture_size: size of register lists in the ADS used for error capture */
> +	u32 ads_capture_size;
>  	/** @ads_engine_usage_size: size of engine usage in the ADS */
>  	u32 ads_engine_usage_size;
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 6c81ddd303d3..2780c0fadd01 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -10,6 +10,7 @@
>  #include "gt/shmem_utils.h"
>  #include "intel_guc_ads.h"
>  #include "intel_guc_fwif.h"
> +#include "intel_guc_capture.h"

wrong order

>  #include "intel_uc.h"
>  #include "i915_drv.h"
>  
> @@ -71,8 +72,7 @@ static u32 guc_ads_golden_ctxt_size(struct intel_guc 
> *guc)
>  
>  static u32 guc_ads_capture_size(struct intel_guc *guc)  {
> -	/* Basic support to init ADS without a proper GuC error capture list */
> -	return PAGE_ALIGN(PAGE_SIZE);
> +	return PAGE_ALIGN(guc->ads_capture_size);
>  }
>  
>  static u32 guc_ads_private_data_size(struct intel_guc *guc) @@ 
> -519,24 +519,170 @@ static void guc_init_golden_context(struct intel_guc *guc)
>  	GEM_BUG_ON(guc->ads_golden_ctxt_size != total_size);  }
>  
> -static void guc_capture_prep_lists(struct intel_guc *guc, struct 
> __guc_ads_blob *blob)
> +static int
> +guc_fill_reglist(struct intel_guc *guc, struct __guc_ads_blob *blob, int vf, bool enabled,
> +		 int classid, int type, char *typename, u16 *p_numregs, int newnum, u8 **p_virt_ptr,
> +		 u32 *p_blobptr_to_ggtt, u32 *p_ggtt, u32 null_ggtt)

hmm, this does not look good - do we really need all these params ?

>  {
> -	int i, j;
> -	u32 addr_ggtt, offset;
> +	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> +	struct guc_debug_capture_list *listnode;
> +	int size = 0;
>  
> -	offset = guc_ads_capture_offset(guc);
> -	addr_ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
> +	if (blob && *p_numregs != newnum) {
> +		if (type == GUC_CAPTURE_LIST_TYPE_GLOBAL)
> +			drm_warn(&i915->drm, "Guc-Cap VF%d-%s num-reg mismatch was=%d now=%d!\n",
> +				 vf, typename, *p_numregs, newnum);
> +		else
> +			drm_warn(&i915->drm, "Guc-Cap VF%d-Class-%d-%s num-reg mismatch was=%d now=%d!\n",
> +				 vf, classid, typename, *p_numregs, newnum);
> +	}
> +	/*
> +	 * For enabled capture lists, we not only need to call capture module to help
> +	 * populate the list-descriptor into the correct ads capture structures, but
> +	 * we also need to increment the virtual pointers and ggtt offsets so that
> +	 * caller has the subsequent gfx memory location.
> +	 */
> +	*p_numregs = newnum;
> +	size = PAGE_ALIGN((sizeof(struct guc_debug_capture_list)) +
> +			  (newnum * sizeof(struct guc_mmio_reg)));
> +	/* if caller hasn't allocated ADS blob, return size and counts, we're done */
> +	if (!blob)
> +		return size;
> +	if (blob) {

redundant

> +		/* if caller allocated ADS blob, populate the capture register descriptors */
> +		if (!newnum) {
> +			*p_blobptr_to_ggtt = null_ggtt;
> +		} else {
> +			/* get ptr and populate header info: */
> +			*p_blobptr_to_ggtt = *p_ggtt;
> +			listnode = (struct guc_debug_capture_list *)*p_virt_ptr;
> +			*p_ggtt += sizeof(struct guc_debug_capture_list);
> +			*p_virt_ptr += sizeof(struct guc_debug_capture_list);
> +			listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, 
> +*p_numregs);
> +
> +			/* get ptr and populate register descriptor list: */
> +			intel_guc_capture_list_init(guc, vf, type, classid,
> +						    (struct guc_mmio_reg *)*p_virt_ptr,
> +						    *p_numregs);
> +
> +			/* increment ptrs for that header: */
> +			*p_ggtt += size - sizeof(struct guc_debug_capture_list);
> +			*p_virt_ptr += size - sizeof(struct guc_debug_capture_list);
> +		}
> +	}
> +
> +	return size;
> +}
> +
> +static int guc_capture_prep_lists(struct intel_guc *guc, struct 
> +__guc_ads_blob *blob) {
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	int i, j, size;
> +	u32 ggtt, null_ggtt, offset, alloc_size = 0;
> +	struct guc_gt_system_info *info, local_info;
> +	struct guc_debug_capture_list *listnode;
> +	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> +	struct intel_guc_state_capture *gc = &guc->capture;
> +	u16 tmp = 0;
> +	u8 *ptr = NULL;
> +
> +	if (blob) {
> +		offset = guc_ads_capture_offset(guc);
> +		ggtt = intel_guc_ggtt_offset(guc, guc->ads_vma) + offset;
> +		ptr = ((u8 *)blob) + offset;
> +		info = &blob->system_info;
> +	} else {
> +		memset(&local_info, 0, sizeof(local_info));
> +		info = &local_info;
> +		fill_engine_enable_masks(gt, info);
> +	}
> +
> +	/* first, set aside the first page for a capture_list with zero descriptors */
> +	alloc_size = PAGE_SIZE;
> +	if (blob) {
> +		listnode = (struct guc_debug_capture_list *)ptr;
> +		listnode->header.info = FIELD_PREP(GUC_CAPTURELISTHDR_NUMDESCR, 0);
> +		null_ggtt = ggtt;
> +		ggtt += PAGE_SIZE;
> +		ptr +=  PAGE_SIZE;
> +	}
>  
> -	/* FIXME: Populate a proper capture list */
> +#define COUNT_REGS intel_guc_capture_list_count #define FILL_REGS 
> +guc_fill_reglist #define TYPE_CLASS 
> +GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS
> +#define TYPE_INSTANCE GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE
>  
>  	for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
>  		for (j = 0; j < GUC_MAX_ENGINE_CLASSES; j++) {
> -			blob->ads.capture_instance[i][j] = addr_ggtt;
> -			blob->ads.capture_class[i][j] = addr_ggtt;
> +			if (!info->engine_enabled_masks[j]) {
> +				if (gc->num_class_regs[i][j])
> +					drm_warn(&i915->drm, "GuC-Cap VF%d-class-%d "
> +						 "class regs valid mismatch was=%d now=%d!\n",
> +						 i, j, gc->num_class_regs[i][j], tmp);
> +				if (gc->num_instance_regs[i][j])
> +					drm_warn(&i915->drm, "GuC-Cap VF%d-class-%d "
> +						 "inst regs valid mismatch was=%d now=%d!\n",
> +						 i, j, gc->num_instance_regs[i][j], tmp);
> +				gc->num_class_regs[i][j] = 0;
> +				gc->num_instance_regs[i][j] = 0;
> +				if (blob) {
> +					blob->ads.capture_class[i][j] = null_ggtt;
> +					blob->ads.capture_instance[i][j] = null_ggtt;
> +				}
> +			} else {
> +				if (!COUNT_REGS(guc, i, TYPE_CLASS,
> +						guc_class_to_engine_class(j), &tmp)) {
> +					size = FILL_REGS(guc, blob, i, true, j, TYPE_CLASS,
> +							 "class", &gc->num_class_regs[i][j],
> +							 tmp, &ptr,
> +							 &blob->ads.capture_class[i][j],
> +							 &ggtt, null_ggtt);
> +					gc->class_list_size += size;
> +					alloc_size += size;
> +				} else {
> +					gc->num_class_regs[i][j] = 0;
> +					if (blob)
> +						blob->ads.capture_class[i][j] = null_ggtt;
> +				}
> +				if (!COUNT_REGS(guc, i, TYPE_INSTANCE,
> +						guc_class_to_engine_class(j), &tmp)) {
> +					size = FILL_REGS(guc, blob, i, true, j, TYPE_INSTANCE,
> +							 "instance", &gc->num_instance_regs[i][j],
> +							 tmp, &ptr,
> +							 &blob->ads.capture_instance[i][j],
> +							 &ggtt, null_ggtt);
> +					gc->instance_list_size += size;
> +					alloc_size += size;
> +				} else {
> +					gc->num_instance_regs[i][j] = 0;
> +					if (blob)
> +						blob->ads.capture_instance[i][j] = null_ggtt;
> +				}
> +			}
> +		}
> +		if (!COUNT_REGS(guc, i, GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, &tmp)) {
> +			size = FILL_REGS(guc, blob, i, true, 0, GUC_CAPTURE_LIST_TYPE_GLOBAL,
> +					 "global", &gc->num_global_regs[i], tmp, &ptr,
> +					 &blob->ads.capture_global[i], &ggtt, null_ggtt);
> +			gc->global_list_size += size;
> +			alloc_size += size;
> +		} else {
> +			gc->num_global_regs[i] = 0;
> +			if (blob)
> +				blob->ads.capture_global[i] = null_ggtt;
>  		}
> -
> -		blob->ads.capture_global[i] = addr_ggtt;
>  	}
> +
> +#undef COUNT_REGS
> +#undef FILL_REGS
> +#undef TYPE_CLASS
> +#undef TYPE_INSTANCE
> +
> +	if (guc->ads_capture_size && guc->ads_capture_size != PAGE_ALIGN(alloc_size))
> +		drm_warn(&i915->drm, "GuC->ADS->Capture alloc size changed from %d to %d\n",
> +			 guc->ads_capture_size, PAGE_ALIGN(alloc_size));
> +
> +	return PAGE_ALIGN(alloc_size);
>  }
>  
>  static void __guc_ads_init(struct intel_guc *guc) @@ -614,6 +760,12 
> @@ int intel_guc_ads_create(struct intel_guc *guc)
>  		return ret;
>  	guc->ads_golden_ctxt_size = ret;
>  
> +	/* Likewise the capture lists: */
> +	ret = guc_capture_prep_lists(guc, NULL);
> +	if (ret < 0)
> +		return ret;
> +	guc->ads_capture_size = ret;
> +
>  	/* Now the total size can be determined: */
>  	size = guc_ads_blob_size(guc);
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> new file mode 100644
> index 000000000000..c741c77b7fc8
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -0,0 +1,232 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2021-2021 Intel Corporation  */
> +
> +#include <drm/drm_print.h>
> +
> +#include "i915_drv.h"
> +#include "i915_drv.h"

duplicated include

> +#include "i915_memcpy.h"
> +#include "gt/intel_gt.h"
> +
> +#include "intel_guc_fwif.h"
> +#include "intel_guc_capture.h"
> +
> +/* Define all device tables of GuC error capture register lists */
> +
> +/********************************* Gen12 LP  
> +*********************************/

didn't we move away from "GEN" naming ?

> +/************** GLOBAL *************/

do we really need all these decorations ?

> +struct __guc_mmio_reg_descr gen12lp_global_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */

do we need this reminder ?

> +};
> +
> +/********** RENDER/COMPUTE *********/
> +/* Per-Class */
> +struct __guc_mmio_reg_descr gen12lp_rc_class_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/* Per-Engine-Instance */
> +struct __guc_mmio_reg_descr gen12lp_rc_inst_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/************* MEDIA-VD ************/
> +/* Per-Class */
> +struct __guc_mmio_reg_descr gen12lp_vd_class_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/* Per-Engine-Instance */
> +struct __guc_mmio_reg_descr gen12lp_vd_inst_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/************* MEDIA-VEC ***********/
> +/* Per-Class */
> +struct __guc_mmio_reg_descr gen12lp_vec_class_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/* Per-Engine-Instance */
> +struct __guc_mmio_reg_descr gen12lp_vec_inst_regs[] = {
> +	{SWF_ILK(0),               0,      0, "SWF_ILK0"},
> +	/* Add additional register list */
> +};
> +
> +/********** List of lists **********/ struct 
> +__guc_mmio_reg_descr_group gen12lp_lists[] = {
> +	{
> +		.list = gen12lp_global_regs,
> +		.num_regs = (sizeof(gen12lp_global_regs) / sizeof(struct 
> +__guc_mmio_reg_descr)),

ARRAY_SIZE ?

> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_GLOBAL,
> +		.engine = 0
> +	},
> +	{
> +		.list = gen12lp_rc_class_regs,
> +		.num_regs = (sizeof(gen12lp_rc_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +		.engine = RENDER_CLASS
> +	},
> +	{
> +		.list = gen12lp_rc_inst_regs,
> +		.num_regs = (sizeof(gen12lp_rc_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +		.engine = RENDER_CLASS
> +	},
> +	{
> +		.list = gen12lp_vd_class_regs,
> +		.num_regs = (sizeof(gen12lp_vd_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +		.engine = VIDEO_DECODE_CLASS
> +	},
> +	{
> +		.list = gen12lp_vd_inst_regs,
> +		.num_regs = (sizeof(gen12lp_vd_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +		.engine = VIDEO_DECODE_CLASS
> +	},
> +	{
> +		.list = gen12lp_vec_class_regs,
> +		.num_regs = (sizeof(gen12lp_vec_class_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +		.engine = VIDEO_ENHANCEMENT_CLASS
> +	},
> +	{
> +		.list = gen12lp_vec_inst_regs,
> +		.num_regs = (sizeof(gen12lp_vec_inst_regs) / sizeof(struct __guc_mmio_reg_descr)),
> +		.owner = GUC_CAPTURE_LIST_INDEX_PF,
> +		.type = GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +		.engine = VIDEO_ENHANCEMENT_CLASS
> +	},
> +	{NULL, 0, 0, 0, 0}
> +};
> +
> +/************ FIXME: Populate tables for other devices in subsequent 
> +patch ************/
> +
> +static struct __guc_mmio_reg_descr_group * 
> +guc_capture_get_device_reglist(struct drm_i915_private *dev_priv)

in new code we are using "i915" instead of "dev_priv" and since this function has "guc" prefix it shall rather take "guc" as param:

guc_capture_get_device_reglist(struct intel_guc *guc) {
	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
	...


> +{
> +	if (IS_TIGERLAKE(dev_priv) || IS_ROCKETLAKE(dev_priv) ||
> +	    IS_ALDERLAKE_S(dev_priv) || IS_ALDERLAKE_P(dev_priv)) {
> +		return gen12lp_lists;
> +	}
> +
> +	return NULL;
> +}
> +
> +static inline struct __guc_mmio_reg_descr_group * 
> +guc_capture_get_one_list(struct __guc_mmio_reg_descr_group *reglists, 
> +u32 owner, u32 type, u32 id) {
> +	int i = 0;
> +
> +	if (!reglists)
> +		return NULL;
> +	while (reglists[i].list) {
> +		if (reglists[i].owner == owner &&
> +		    reglists[i].type == type) {
> +			if (reglists[i].type == GUC_CAPTURE_LIST_TYPE_GLOBAL ||
> +			    reglists[i].engine == id) {
> +				return &reglists[i];
> +			}
> +		}
> +		++i;
> +	}
> +	return NULL;
> +}
> +
> +static inline void
> +warn_with_capture_list_identifier(struct drm_i915_private *i915, char *msg,
> +				  u32 owner, u32 type, u32 classid) {
> +	const char *ownerstr[GUC_CAPTURE_LIST_INDEX_MAX] = {"PF", "VF"};
> +	const char *typestr[GUC_CAPTURE_LIST_TYPE_MAX - 1] = {"Class", "Instance"};
> +	const char *classstr[GUC_LAST_ENGINE_CLASS + 1] = {"Render", "Video", "VideoEnhance",
> +							   "Blitter", "Reserved"};

better to wrap that into simple small helpers like

	const char *stringify_guc_capture_owner(u32 owner) { .. }
	const char *stringify_guc_capture_type(u32 type) { .. }
	const char *stringify_guc_capture_class(u32 class) { .. }

> +	static const char unknownstr[] = "unknown";
> +
> +	if (type == GUC_CAPTURE_LIST_TYPE_GLOBAL)
> +		drm_warn(&i915->drm, "GuC-capture: %s for %s Global-Registers.\n", msg,
> +			 (owner < GUC_CAPTURE_LIST_INDEX_MAX) ? ownerstr[owner] : unknownstr);
> +	else
> +		drm_warn(&i915->drm, "GuC-capture: %s for %s %s-Registers on %s-Engine\n", msg,
> +			 (owner < GUC_CAPTURE_LIST_INDEX_MAX) ? ownerstr[owner] : unknownstr,
> +			 (type < GUC_CAPTURE_LIST_TYPE_MAX) ? typestr[type - 1] :  unknownstr,
> +			 (classid < GUC_LAST_ENGINE_CLASS + 1) ? classstr[classid] : 
> +unknownstr); }
> +
> +int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 classid,
> +				 u16 *num_entries)
> +{
> +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;

s/dev_priv/i915
redundant ()

> +	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
> +	struct __guc_mmio_reg_descr_group *match;
> +
> +	if (!reglists)
> +		return -ENODEV;
> +
> +	match = guc_capture_get_one_list(reglists, owner, type, classid);
> +	if (match) {
> +		*num_entries = match->num_regs;
> +		return 0;

IIRC early returns are preferred for error cases, not success

> +	}
> +
> +	warn_with_capture_list_identifier(dev_priv, "Missing register list size", owner, type,
> +					  classid);
> +
> +	return -ENODATA;
> +}
> +
> +int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 classid,
> +				struct guc_mmio_reg *ptr, u16 num_entries) {
> +	u32 j = 0;
> +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;

s/dev_priv/i915
redundant ()

> +	struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
> +	struct __guc_mmio_reg_descr_group *match;
> +
> +	if (!reglists)
> +		return -ENODEV;
> +
> +	match = guc_capture_get_one_list(reglists, owner, type, classid);
> +	if (match) {
> +		while (j < num_entries && j < match->num_regs) {
> +			ptr[j].offset = match->list[j].reg.reg;
> +			ptr[j].value = 0xDEADF00D;
> +			ptr[j].flags = match->list[j].flags;
> +			ptr[j].mask = match->list[j].mask;
> +			++j;
> +		}
> +		return 0;
> +	}
> +
> +	warn_with_capture_list_identifier(dev_priv, "Missing register list init", owner, type,
> +					  classid);
> +
> +	return -ENODATA;
> +}
> +
> +void intel_guc_capture_destroy(struct intel_guc *guc) { }
> +
> +int intel_guc_capture_init(struct intel_guc *guc) {
> +	struct drm_i915_private *dev_priv = (guc_to_gt(guc))->i915;
> +
> +	guc->capture.reglists = guc_capture_get_device_reglist(dev_priv);
> +	return 0;
> +}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> new file mode 100644
> index 000000000000..352940b8bc87
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
> @@ -0,0 +1,47 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2021-2021 Intel Corporation  */
> +
> +#ifndef _INTEL_GUC_CAPTURE_H
> +#define _INTEL_GUC_CAPTURE_H
> +
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include "intel_guc_fwif.h"
> +
> +struct intel_guc;
> +
> +struct __guc_mmio_reg_descr {
> +	i915_reg_t reg;
> +	u32 flags;
> +	u32 mask;
> +	char *regname;

const char* ?

but maybe instead of adding reg name to the GuC specific struct we should add generic purpose function that will return pretty name of the
register:

i915_reg.c:

const char *i915_reg_to_string(i915_reg_r reg) {
	...
}

> +};
> +
> +struct __guc_mmio_reg_descr_group {
> +	struct __guc_mmio_reg_descr *list;
> +	u32 num_regs;
> +	u32 owner; /* see enum guc_capture_owner */
> +	u32 type; /* see enum guc_capture_type */
> +	u32 engine; /* as per MAX_ENGINE_CLASS */ };
> +
> +struct intel_guc_state_capture {
> +	struct __guc_mmio_reg_descr_group *reglists;
> +	u16 num_instance_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
> +	u16 num_class_regs[GUC_CAPTURE_LIST_INDEX_MAX][GUC_MAX_ENGINE_CLASSES];
> +	u16 num_global_regs[GUC_CAPTURE_LIST_INDEX_MAX];
> +	int instance_list_size;
> +	int class_list_size;
> +	int global_list_size;
> +};
> +
> +int intel_guc_capture_list_count(struct intel_guc *guc, u32 owner, u32 type, u32 class,
> +				 u16 *num_entries);
> +int intel_guc_capture_list_init(struct intel_guc *guc, u32 owner, u32 type, u32 class,
> +				struct guc_mmio_reg *ptr, u16 num_entries); void 
> +intel_guc_capture_destroy(struct intel_guc *guc); int 
> +intel_guc_capture_init(struct intel_guc *guc);
> +
> +#endif /* _INTEL_GUC_CAPTURE_H */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index 767684b6af67..1a1d2271c7e9 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -285,13 +285,30 @@ struct guc_gt_system_info {  } __packed;
>  
>  /* Capture-types of GuC capture register lists */ -enum
> +enum guc_capture_owner
>  {
>  	GUC_CAPTURE_LIST_INDEX_PF = 0,
>  	GUC_CAPTURE_LIST_INDEX_VF = 1,
>  	GUC_CAPTURE_LIST_INDEX_MAX = 2,

s/INDEX/OWNER ?

>  };
>  
> +/*Register-types of GuC capture register lists */ enum 
> +guc_capture_type {
> +	GUC_CAPTURE_LIST_TYPE_GLOBAL = 0,
> +	GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +	GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +	GUC_CAPTURE_LIST_TYPE_MAX,
> +};
> +
> +struct guc_debug_capture_list_header {
> +	u32 info;
> +		#define GUC_CAPTURELISTHDR_NUMDESCR GENMASK(15, 0) };
> +
> +struct guc_debug_capture_list {
> +	struct guc_debug_capture_list_header header; };
> +
>  /* GuC Additional Data Struct */
>  struct guc_ads {
>  	struct guc_mmio_reg_set 
> reg_state_list[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2021-12-08  6:31     ` Teres Alexis, Alan Previn
@ 2021-12-23 18:54       ` Teres Alexis, Alan Previn
  2021-12-24 12:09         ` Tvrtko Ursulin
  0 siblings, 1 reply; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-23 18:54 UTC (permalink / raw)
  To: Brost, Matthew; +Cc: intel-gfx

Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
there is little benefit to even making that guc-id lookup because:

1. the delay between the context reset notification (when the vmas are copied
   and when we verify we had received a guc err capture dump) may be subjectively
   large enough and not tethered that the guc-id may have already been re-assigned.

2. I was really looking for some kind of unique context handle to print out that could
   be correlated (by user inspecting the dump) back to a unique app or process or
   context-id but cant find such a param in struct intel_context.

As part of further reviewing the end to end flows and possible error scenarios, there
also may potentially be a mismatch between "which context was reset by guc at time-n"
vs "which context's vma buffers is being printed out at time-n+x" if
we are experiencing back-to-back resets and the user dumped the debugfs x-time later.

(Recap: First, guc notifies capture event, second, guc notifies context reset during
which we trigger i915_gpu_coredump. In this second step, the vma's are dumped and we
verify that the guc capture happened but don't parse the guc-err-capture-logs yet.
Third step is when user triggers the debugfs to dump which is when we parse the error
capture logs.)

As a fix, what we can do in the guc_error_capture report out is to ensure that
we dont re-print the previously dumped vmas if we end up finding multiple
guc-error-capture dumps since the i915_gpu_coredump would have only captured the vma's
for the very first context that was reset. And with guc-submission, that would always
correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the guc-error-capture
logs are large enough to hold data for multiple dumps).

The changes (removal of below-hunk and adding of only-print-the-first-vma") is trivial
but i felt it warranted a good explanation. Apologies for the inbox noise.

...alan

On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
> Thanks again for the detailed review here.
> Will fix all the rest on next rev.
> One special response for this one:
> 
> 
> On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
> > On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
> > > +			if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
> > > +				GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
> > > +				eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
> > > +				eng = guc_lookup_engine(guc, engineclass, eng_inst);
> > > +				if (eng) {
> > > +					GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
> > > +				} else {
> > > +					PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup Fail!\n");
> > > +				}
> > > +				ce = guc_context_lookup(guc, data.guc_ctx_id);
> > 
> > You are going to need to reference count the 'ce' here. See
> > intel_guc_context_reset_process_msg for an example. 
> > 
> 
> Oh crap - i missed this one - which you had explicitly mentioned offline when i was doing the
> development. Sorry about that i just totally missed it from my todo-notes.
> 
> ...alan


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2021-12-23 18:54       ` Teres Alexis, Alan Previn
@ 2021-12-24 12:09         ` Tvrtko Ursulin
  2021-12-24 13:34           ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Tvrtko Ursulin @ 2021-12-24 12:09 UTC (permalink / raw)
  To: Teres Alexis, Alan Previn, Brost, Matthew; +Cc: intel-gfx


Hi,

Somehow I stumbled on this while browsing through the mailing list.

On 23/12/2021 18:54, Teres Alexis, Alan Previn wrote:
> Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
> there is little benefit to even making that guc-id lookup because:
> 
> 1. the delay between the context reset notification (when the vmas are copied
>     and when we verify we had received a guc err capture dump) may be subjectively
>     large enough and not tethered that the guc-id may have already been re-assigned.
> 
> 2. I was really looking for some kind of unique context handle to print out that could
>     be correlated (by user inspecting the dump) back to a unique app or process or
>     context-id but cant find such a param in struct intel_context.
> 
> As part of further reviewing the end to end flows and possible error scenarios, there
> also may potentially be a mismatch between "which context was reset by guc at time-n"
> vs "which context's vma buffers is being printed out at time-n+x" if
> we are experiencing back-to-back resets and the user dumped the debugfs x-time later.

What does this all actually mean, because it sounds rather alarming, 
that it just won't be possible to know which context, belonging to which 
process, was reset? And because of guc_id potentially re-assigned even 
the captured VMAs may not be the correct ones?

Regards,

Tvrtko

> 
> (Recap: First, guc notifies capture event, second, guc notifies context reset during
> which we trigger i915_gpu_coredump. In this second step, the vma's are dumped and we
> verify that the guc capture happened but don't parse the guc-err-capture-logs yet.
> Third step is when user triggers the debugfs to dump which is when we parse the error
> capture logs.)
> 
> As a fix, what we can do in the guc_error_capture report out is to ensure that
> we dont re-print the previously dumped vmas if we end up finding multiple
> guc-error-capture dumps since the i915_gpu_coredump would have only captured the vma's
> for the very first context that was reset. And with guc-submission, that would always
> correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the guc-error-capture
> logs are large enough to hold data for multiple dumps).
> 
> The changes (removal of below-hunk and adding of only-print-the-first-vma") is trivial
> but i felt it warranted a good explanation. Apologies for the inbox noise.
> 
> ...alan
> 
> On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
>> Thanks again for the detailed review here.
>> Will fix all the rest on next rev.
>> One special response for this one:
>>
>>
>> On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
>>> On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
>>>> +			if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
>>>> +				GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
>>>> +				eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
>>>> +				eng = guc_lookup_engine(guc, engineclass, eng_inst);
>>>> +				if (eng) {
>>>> +					GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
>>>> +				} else {
>>>> +					PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup Fail!\n");
>>>> +				}
>>>> +				ce = guc_context_lookup(guc, data.guc_ctx_id);
>>>
>>> You are going to need to reference count the 'ce' here. See
>>> intel_guc_context_reset_process_msg for an example.
>>>
>>
>> Oh crap - i missed this one - which you had explicitly mentioned offline when i was doing the
>> development. Sorry about that i just totally missed it from my todo-notes.
>>
>> ...alan
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2021-12-24 12:09         ` Tvrtko Ursulin
@ 2021-12-24 13:34           ` Teres Alexis, Alan Previn
  2022-01-04 13:56             ` Tvrtko Ursulin
  0 siblings, 1 reply; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2021-12-24 13:34 UTC (permalink / raw)
  To: Brost, Matthew, tvrtko.ursulin; +Cc: intel-gfx


On Fri, 2021-12-24 at 12:09 +0000, Tvrtko Ursulin wrote:
> Hi,
> 
> Somehow I stumbled on this while browsing through the mailing list.
> 
> On 23/12/2021 18:54, Teres Alexis, Alan Previn wrote:
> > Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
> > there is little benefit to even making that guc-id lookup because:
> > 
> > 1. the delay between the context reset notification (when the vmas are copied
> >     and when we verify we had received a guc err capture dump) may be subjectively
> >     large enough and not tethered that the guc-id may have already been re-assigned.
> > 
> > 2. I was really looking for some kind of unique context handle to print out that could
> >     be correlated (by user inspecting the dump) back to a unique app or process or
> >     context-id but cant find such a param in struct intel_context.
> > 
> > As part of further reviewing the end to end flows and possible error scenarios, there
> > also may potentially be a mismatch between "which context was reset by guc at time-n"
> > vs "which context's vma buffers is being printed out at time-n+x" if
> > we are experiencing back-to-back resets and the user dumped the debugfs x-time later.
> 
> What does this all actually mean, because it sounds rather alarming, 
> that it just won't be possible to know which context, belonging to which 
> process, was reset? And because of guc_id potentially re-assigned even 
> the captured VMAs may not be the correct ones?
> 
> 

The flow of events are as below:

1. guc sends notification that an error capture was done and ready to take.
	- at this point we copy the guc error captured dump into an interim store
	  (larger buffer that can hold multiple captures).
2. guc sends notification that a context was reset (after the prior)
	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
          from the context that was reset
	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
          global gpu state, engine state, context vmas and also engine registers. For the
          engine registers now call into the guc_capture code which merely needs to verify
	  that GuC had already done a step 1 and we have data ready to be parsed.

(time elapses)

3. end user triggers the sysfs to dump the error state and all prior information is 
   printed out in proper format.


Between 2 and 3:
   - Looking at existing framework (established by execlist-capture codes),
        I believe we only hold on to the first error state capture and drop any
        subsequent context reset captures occurring before #3 (i.e. before the end user
        triggers the debugfs)
   - However, in that same space, guc can send us more and more error-capture logs
         long as we have space for it in the buffer.

So the issue was that in my original patch, for every next capture-snaphot we find in
guc-error-capture output buffer, i would find the matching engine and print out all
the VMA data (that was successfully captured in #2). However, i should only do that
for the first dump only since that would correlate exactly with the existing execlist
code behavior. So this fix is actually pretty straight forward to get the right matching
VMA.

WRT to my statement about "getting the context-to->process" lookup, i was initially hoping that
I could "on my own" (within the guc-err-capture module) get that information, but it would be
a stretch (in terms of inter-component information access). More importantly, its totally
unnecessary since existing execlist code already did that in Step 2. That code remains intact
with guc-error-capture.

One open i plan to test before final rev is with shared engines like CCS and RCS where i want to
trigger cascading hangs + resets in quick succession just to see how the overall flow behavior works.

I will attach an output guc error capture based gpu error dump as per the review comment from Matthew
on last rev.

..alan
> 


> Regards,
> 
> Tvrtko
> 
> > (Recap: First, guc notifies capture event, second, guc notifies context reset during
> > which we trigger i915_gpu_coredump. In this second step, the vma's are dumped and we
> > verify that the guc capture happened but don't parse the guc-err-capture-logs yet.
> > Third step is when user triggers the debugfs to dump which is when we parse the error
> > capture logs.)
> > 
> > As a fix, what we can do in the guc_error_capture report out is to ensure that
> > we dont re-print the previously dumped vmas if we end up finding multiple
> > guc-error-capture dumps since the i915_gpu_coredump would have only captured the vma's
> > for the very first context that was reset. And with guc-submission, that would always
> > correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the guc-error-capture
> > logs are large enough to hold data for multiple dumps).
> > 
> > The changes (removal of below-hunk and adding of only-print-the-first-vma") is trivial
> > but i felt it warranted a good explanation. Apologies for the inbox noise.
> > 
> > ...alan
> > 
> > On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
> > > Thanks again for the detailed review here.
> > > Will fix all the rest on next rev.
> > > One special response for this one:
> > > 
> > > 
> > > On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
> > > > On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
> > > > > +			if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
> > > > > +				GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
> > > > > +				eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
> > > > > +				eng = guc_lookup_engine(guc, engineclass, eng_inst);
> > > > > +				if (eng) {
> > > > > +					GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
> > > > > +				} else {
> > > > > +					PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup Fail!\n");
> > > > > +				}
> > > > > +				ce = guc_context_lookup(guc, data.guc_ctx_id);
> > > > 
> > > > You are going to need to reference count the 'ce' here. See
> > > > intel_guc_context_reset_process_msg for an example.
> > > > 
> > > 
> > > Oh crap - i missed this one - which you had explicitly mentioned offline when i was doing the
> > > development. Sorry about that i just totally missed it from my todo-notes.
> > > 
> > > ...alan


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2021-12-24 13:34           ` Teres Alexis, Alan Previn
@ 2022-01-04 13:56             ` Tvrtko Ursulin
  2022-01-05 17:30               ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Tvrtko Ursulin @ 2022-01-04 13:56 UTC (permalink / raw)
  To: Teres Alexis, Alan Previn, Brost, Matthew; +Cc: intel-gfx


On 24/12/2021 13:34, Teres Alexis, Alan Previn wrote:
> 
> On Fri, 2021-12-24 at 12:09 +0000, Tvrtko Ursulin wrote:
>> Hi,
>>
>> Somehow I stumbled on this while browsing through the mailing list.
>>
>> On 23/12/2021 18:54, Teres Alexis, Alan Previn wrote:
>>> Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
>>> there is little benefit to even making that guc-id lookup because:
>>>
>>> 1. the delay between the context reset notification (when the vmas are copied
>>>      and when we verify we had received a guc err capture dump) may be subjectively
>>>      large enough and not tethered that the guc-id may have already been re-assigned.
>>>
>>> 2. I was really looking for some kind of unique context handle to print out that could
>>>      be correlated (by user inspecting the dump) back to a unique app or process or
>>>      context-id but cant find such a param in struct intel_context.
>>>
>>> As part of further reviewing the end to end flows and possible error scenarios, there
>>> also may potentially be a mismatch between "which context was reset by guc at time-n"
>>> vs "which context's vma buffers is being printed out at time-n+x" if
>>> we are experiencing back-to-back resets and the user dumped the debugfs x-time later.
>>
>> What does this all actually mean, because it sounds rather alarming,
>> that it just won't be possible to know which context, belonging to which
>> process, was reset? And because of guc_id potentially re-assigned even
>> the captured VMAs may not be the correct ones?
>>
>>
> 
> The flow of events are as below:
> 
> 1. guc sends notification that an error capture was done and ready to take.
> 	- at this point we copy the guc error captured dump into an interim store
> 	  (larger buffer that can hold multiple captures).
> 2. guc sends notification that a context was reset (after the prior)
> 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
>            from the context that was reset
> 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
>            global gpu state, engine state, context vmas and also engine registers. For the
>            engine registers now call into the guc_capture code which merely needs to verify
> 	  that GuC had already done a step 1 and we have data ready to be parsed.

What about the time between the actual reset and receiving the context 
reset notification? Latter will contain intel_context->guc_id - can that 
be re-assigned or "retired" in between the two and so cause problems for 
matching the correct (or any) vmas?

Regards,

Tvrtko

> 
> (time elapses)
> 
> 3. end user triggers the sysfs to dump the error state and all prior information is
>     printed out in proper format.
> 
> 
> Between 2 and 3:
>     - Looking at existing framework (established by execlist-capture codes),
>          I believe we only hold on to the first error state capture and drop any
>          subsequent context reset captures occurring before #3 (i.e. before the end user
>          triggers the debugfs)
>     - However, in that same space, guc can send us more and more error-capture logs
>           long as we have space for it in the buffer.
> 
> So the issue was that in my original patch, for every next capture-snaphot we find in
> guc-error-capture output buffer, i would find the matching engine and print out all
> the VMA data (that was successfully captured in #2). However, i should only do that
> for the first dump only since that would correlate exactly with the existing execlist
> code behavior. So this fix is actually pretty straight forward to get the right matching
> VMA.
> 
> WRT to my statement about "getting the context-to->process" lookup, i was initially hoping that
> I could "on my own" (within the guc-err-capture module) get that information, but it would be
> a stretch (in terms of inter-component information access). More importantly, its totally
> unnecessary since existing execlist code already did that in Step 2. That code remains intact
> with guc-error-capture.
> 
> One open i plan to test before final rev is with shared engines like CCS and RCS where i want to
> trigger cascading hangs + resets in quick succession just to see how the overall flow behavior works.
> 
> I will attach an output guc error capture based gpu error dump as per the review comment from Matthew
> on last rev.
> 
> ..alan
>>
> 
> 
>> Regards,
>>
>> Tvrtko
>>
>>> (Recap: First, guc notifies capture event, second, guc notifies context reset during
>>> which we trigger i915_gpu_coredump. In this second step, the vma's are dumped and we
>>> verify that the guc capture happened but don't parse the guc-err-capture-logs yet.
>>> Third step is when user triggers the debugfs to dump which is when we parse the error
>>> capture logs.)
>>>
>>> As a fix, what we can do in the guc_error_capture report out is to ensure that
>>> we dont re-print the previously dumped vmas if we end up finding multiple
>>> guc-error-capture dumps since the i915_gpu_coredump would have only captured the vma's
>>> for the very first context that was reset. And with guc-submission, that would always
>>> correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the guc-error-capture
>>> logs are large enough to hold data for multiple dumps).
>>>
>>> The changes (removal of below-hunk and adding of only-print-the-first-vma") is trivial
>>> but i felt it warranted a good explanation. Apologies for the inbox noise.
>>>
>>> ...alan
>>>
>>> On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
>>>> Thanks again for the detailed review here.
>>>> Will fix all the rest on next rev.
>>>> One special response for this one:
>>>>
>>>>
>>>> On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
>>>>> On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
>>>>>> +			if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
>>>>>> +				GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
>>>>>> +				eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
>>>>>> +				eng = guc_lookup_engine(guc, engineclass, eng_inst);
>>>>>> +				if (eng) {
>>>>>> +					GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
>>>>>> +				} else {
>>>>>> +					PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup Fail!\n");
>>>>>> +				}
>>>>>> +				ce = guc_context_lookup(guc, data.guc_ctx_id);
>>>>>
>>>>> You are going to need to reference count the 'ce' here. See
>>>>> intel_guc_context_reset_process_msg for an example.
>>>>>
>>>>
>>>> Oh crap - i missed this one - which you had explicitly mentioned offline when i was doing the
>>>> development. Sorry about that i just totally missed it from my todo-notes.
>>>>
>>>> ...alan
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2022-01-04 13:56             ` Tvrtko Ursulin
@ 2022-01-05 17:30               ` Teres Alexis, Alan Previn
  2022-01-06  9:38                 ` Tvrtko Ursulin
  0 siblings, 1 reply; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2022-01-05 17:30 UTC (permalink / raw)
  To: Brost, Matthew, tvrtko.ursulin; +Cc: intel-gfx


On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
> 
> > The flow of events are as below:
> > 
> > 1. guc sends notification that an error capture was done and ready to take.
> > 	- at this point we copy the guc error captured dump into an interim store
> > 	  (larger buffer that can hold multiple captures).
> > 2. guc sends notification that a context was reset (after the prior)
> > 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
> >            from the context that was reset
> > 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
> >            global gpu state, engine state, context vmas and also engine registers. For the
> >            engine registers now call into the guc_capture code which merely needs to verify
> > 	  that GuC had already done a step 1 and we have data ready to be parsed.
> 
> What about the time between the actual reset and receiving the context 
> reset notification? Latter will contain intel_context->guc_id - can that 
> be re-assigned or "retired" in between the two and so cause problems for 
> matching the correct (or any) vmas?
> 
Not it cannot because its only after the context reset notification that i915 starts
taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens.
That's what i've observed in the code flow.

> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2022-01-05 17:30               ` Teres Alexis, Alan Previn
@ 2022-01-06  9:38                 ` Tvrtko Ursulin
  2022-01-06 18:33                   ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Tvrtko Ursulin @ 2022-01-06  9:38 UTC (permalink / raw)
  To: Teres Alexis, Alan Previn, Brost, Matthew; +Cc: intel-gfx


On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
> 
> On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
>>
>>> The flow of events are as below:
>>>
>>> 1. guc sends notification that an error capture was done and ready to take.
>>> 	- at this point we copy the guc error captured dump into an interim store
>>> 	  (larger buffer that can hold multiple captures).
>>> 2. guc sends notification that a context was reset (after the prior)
>>> 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
>>>             from the context that was reset
>>> 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
>>>             global gpu state, engine state, context vmas and also engine registers. For the
>>>             engine registers now call into the guc_capture code which merely needs to verify
>>> 	  that GuC had already done a step 1 and we have data ready to be parsed.
>>
>> What about the time between the actual reset and receiving the context
>> reset notification? Latter will contain intel_context->guc_id - can that
>> be re-assigned or "retired" in between the two and so cause problems for
>> matching the correct (or any) vmas?
>>
> Not it cannot because its only after the context reset notification that i915 starts
> taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens.
> That's what i've observed in the code flow.

The fact it is "only after" is exactly why I asked.

Reset notification is in a CT queue with other stuff, right? So can be 
some unrelated time after the actual reset. Could have context be 
retired in the meantime and guc_id released is the question.

Because i915 has no idea there was a reset until this delayed message 
comes over, but it could see user interrupt signaling end of batch, 
after the reset has happened, unbeknown to i915, right?

Perhaps the answer is guc_id cannot be released via the request retire 
flows. Or GuC signaling release of guc_id is a thing, which is then 
ordered via the same CT buffer.

I don't know, just asking.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2022-01-06  9:38                 ` Tvrtko Ursulin
@ 2022-01-06 18:33                   ` Teres Alexis, Alan Previn
  2022-01-07  9:03                     ` Tvrtko Ursulin
  0 siblings, 1 reply; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2022-01-06 18:33 UTC (permalink / raw)
  To: Brost, Matthew, tvrtko.ursulin; +Cc: intel-gfx


On Thu, 2022-01-06 at 09:38 +0000, Tvrtko Ursulin wrote:
> On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
> > On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
> > > > The flow of events are as below:
> > > > 
> > > > 1. guc sends notification that an error capture was done and ready to take.
> > > > 	- at this point we copy the guc error captured dump into an interim store
> > > > 	  (larger buffer that can hold multiple captures).
> > > > 2. guc sends notification that a context was reset (after the prior)
> > > > 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
> > > >             from the context that was reset
> > > > 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
> > > >             global gpu state, engine state, context vmas and also engine registers. For the
> > > >             engine registers now call into the guc_capture code which merely needs to verify
> > > > 	  that GuC had already done a step 1 and we have data ready to be parsed.
> > > 
> > > What about the time between the actual reset and receiving the context
> > > reset notification? Latter will contain intel_context->guc_id - can that
> > > be re-assigned or "retired" in between the two and so cause problems for
> > > matching the correct (or any) vmas?
> > > 
> > Not it cannot because its only after the context reset notification that i915 starts
> > taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens.
> > That's what i've observed in the code flow.
> 
> The fact it is "only after" is exactly why I asked.
> 
> Reset notification is in a CT queue with other stuff, right? So can be 
> some unrelated time after the actual reset. Could have context be 
> retired in the meantime and guc_id released is the question.
> 
> Because i915 has no idea there was a reset until this delayed message 
> comes over, but it could see user interrupt signaling end of batch, 
> after the reset has happened, unbeknown to i915, right?
> 
> Perhaps the answer is guc_id cannot be released via the request retire 
> flows. Or GuC signaling release of guc_id is a thing, which is then 
> ordered via the same CT buffer.
> 
> I don't know, just asking.
> 
As long as the context is pinned, the guc-id wont be re-assigned. After a bit of offline brain-dump
from John Harrison, there are many factors that can keep the context pinned (recounts) including
new or oustanding requests. So a guc-id can't get re-assigned between a capture-notify and a
context-reset even if that outstanding request is the only refcount left since it would still
be considered outstanding by the driver. I also think we may also be talking past each other
in the sense that the guc-id is something the driver assigns to a context being pinned and only
the driver can un-assign it (both assigning and unasigning is via H2G interactions).
I get the sense you are assuming the GuC can un-assign the guc-id's on its own - which isn't
the case. Apologies if i mis-assumed.

> Regards,
> 
> Tvrtko


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2022-01-06 18:33                   ` Teres Alexis, Alan Previn
@ 2022-01-07  9:03                     ` Tvrtko Ursulin
  2022-01-07 17:03                       ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Tvrtko Ursulin @ 2022-01-07  9:03 UTC (permalink / raw)
  To: Teres Alexis, Alan Previn, Brost, Matthew; +Cc: intel-gfx


On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
> 
> On Thu, 2022-01-06 at 09:38 +0000, Tvrtko Ursulin wrote:
>> On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
>>> On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
>>>>> The flow of events are as below:
>>>>>
>>>>> 1. guc sends notification that an error capture was done and ready to take.
>>>>> 	- at this point we copy the guc error captured dump into an interim store
>>>>> 	  (larger buffer that can hold multiple captures).
>>>>> 2. guc sends notification that a context was reset (after the prior)
>>>>> 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
>>>>>              from the context that was reset
>>>>> 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
>>>>>              global gpu state, engine state, context vmas and also engine registers. For the
>>>>>              engine registers now call into the guc_capture code which merely needs to verify
>>>>> 	  that GuC had already done a step 1 and we have data ready to be parsed.
>>>>
>>>> What about the time between the actual reset and receiving the context
>>>> reset notification? Latter will contain intel_context->guc_id - can that
>>>> be re-assigned or "retired" in between the two and so cause problems for
>>>> matching the correct (or any) vmas?
>>>>
>>> Not it cannot because its only after the context reset notification that i915 starts
>>> taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens.
>>> That's what i've observed in the code flow.
>>
>> The fact it is "only after" is exactly why I asked.
>>
>> Reset notification is in a CT queue with other stuff, right? So can be
>> some unrelated time after the actual reset. Could have context be
>> retired in the meantime and guc_id released is the question.
>>
>> Because i915 has no idea there was a reset until this delayed message
>> comes over, but it could see user interrupt signaling end of batch,
>> after the reset has happened, unbeknown to i915, right?
>>
>> Perhaps the answer is guc_id cannot be released via the request retire
>> flows. Or GuC signaling release of guc_id is a thing, which is then
>> ordered via the same CT buffer.
>>
>> I don't know, just asking.
>>
> As long as the context is pinned, the guc-id wont be re-assigned. After a bit of offline brain-dump
> from John Harrison, there are many factors that can keep the context pinned (recounts) including
> new or oustanding requests. So a guc-id can't get re-assigned between a capture-notify and a
> context-reset even if that outstanding request is the only refcount left since it would still
> be considered outstanding by the driver. I also think we may also be talking past each other
> in the sense that the guc-id is something the driver assigns to a context being pinned and only
> the driver can un-assign it (both assigning and unasigning is via H2G interactions).
> I get the sense you are assuming the GuC can un-assign the guc-id's on its own - which isn't
> the case. Apologies if i mis-assumed.

I did not think GuC can re-assign ce->guc_id. I asked about request/context complete/retire happening before reset/capture notification is received.

That would be the time window between the last intel_context_put, so last i915_request_put from retire, at which point AFAICT GuC code releases the guc_id. Execution timeline like:

|------ rq1 ------|------ rq2 ------|
    ^ engine reset		    ^ rq2, rq1 retire, guc id released

                                                           		^ GuC reset notify received - guc_id not known any more?
  
You are saying something is guaranteed to be holding onto the guc_id at the point of receiving the notification? "There are many factors that can keep the context pinned" - what is it in this case? Or the case cannot happen?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2022-01-07  9:03                     ` Tvrtko Ursulin
@ 2022-01-07 17:03                       ` Teres Alexis, Alan Previn
  2022-01-10  8:07                         ` Tvrtko Ursulin
  0 siblings, 1 reply; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2022-01-07 17:03 UTC (permalink / raw)
  To: Brost, Matthew, tvrtko.ursulin; +Cc: intel-gfx


On Fri, 2022-01-07 at 09:03 +0000, Tvrtko Ursulin wrote:
> On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
> > On Thu, 2022-01-06 at 09:38 +0000, Tvrtko Ursulin wrote:
> > > On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
> > > > On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
> > > > > > The flow of events are as below:
> > > > > > 
> > > > > > 1. guc sends notification that an error capture was done and ready to take.
> > > > > > 	- at this point we copy the guc error captured dump into an interim store
> > > > > > 	  (larger buffer that can hold multiple captures).
> > > > > > 2. guc sends notification that a context was reset (after the prior)
> > > > > > 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
> > > > > >              from the context that was reset
> > > > > > 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
> > > > > >              global gpu state, engine state, context vmas and also engine registers. For the
> > > > > >              engine registers now call into the guc_capture code which merely needs to verify
> > > > > > 	  that GuC had already done a step 1 and we have data ready to be parsed.
> > > > > 
> > > > > What about the time between the actual reset and receiving the context
> > > > > reset notification? Latter will contain intel_context->guc_id - can that
> > > > > be re-assigned or "retired" in between the two and so cause problems for
> > > > > matching the correct (or any) vmas?
> > > > > 
> > > > Not it cannot because its only after the context reset notification that i915 starts
> > > > taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens.
> > > > That's what i've observed in the code flow.
> > > 
> > > The fact it is "only after" is exactly why I asked.
> > > 
> > > Reset notification is in a CT queue with other stuff, right? So can be
> > > some unrelated time after the actual reset. Could have context be
> > > retired in the meantime and guc_id released is the question.
> > > 
> > > Because i915 has no idea there was a reset until this delayed message
> > > comes over, but it could see user interrupt signaling end of batch,
> > > after the reset has happened, unbeknown to i915, right?
> > > 
> > > Perhaps the answer is guc_id cannot be released via the request retire
> > > flows. Or GuC signaling release of guc_id is a thing, which is then
> > > ordered via the same CT buffer.
> > > 
> > > I don't know, just asking.
> > > 
> > As long as the context is pinned, the guc-id wont be re-assigned. After a bit of offline brain-dump
> > from John Harrison, there are many factors that can keep the context pinned (recounts) including
> > new or oustanding requests. So a guc-id can't get re-assigned between a capture-notify and a
> > context-reset even if that outstanding request is the only refcount left since it would still
> > be considered outstanding by the driver. I also think we may also be talking past each other
> > in the sense that the guc-id is something the driver assigns to a context being pinned and only
> > the driver can un-assign it (both assigning and unasigning is via H2G interactions).
> > I get the sense you are assuming the GuC can un-assign the guc-id's on its own - which isn't
> > the case. Apologies if i mis-assumed.
> 
> I did not think GuC can re-assign ce->guc_id. I asked about request/context complete/retire happening before reset/capture notification is received.
> 
> That would be the time window between the last intel_context_put, so last i915_request_put from retire, at which point AFAICT GuC code releases the guc_id. Execution timeline like:
> 
> > ------ rq1 ------|------ rq2 ------|
>     ^ engine reset		    ^ rq2, rq1 retire, guc id released
> 
>                                                            		^ GuC reset notify received - guc_id not known any more?
>   
> You are saying something is guaranteed to be holding onto the guc_id at the point of receiving the notification? "There are many factors that can keep the context pinned" - what is it in this case? Or the case cannot happen?
> 
> Regards,
> 
> Tvrtko

above chart is incorrect: GuC reset notification is sent from GuC to host before it sends the engine reset notification 
...alan






^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2022-01-07 17:03                       ` Teres Alexis, Alan Previn
@ 2022-01-10  8:07                         ` Tvrtko Ursulin
  2022-01-10 18:19                           ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Tvrtko Ursulin @ 2022-01-10  8:07 UTC (permalink / raw)
  To: Teres Alexis, Alan Previn, Brost, Matthew; +Cc: intel-gfx


On 07/01/2022 17:03, Teres Alexis, Alan Previn wrote:
> 
> On Fri, 2022-01-07 at 09:03 +0000, Tvrtko Ursulin wrote:
>> On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
>>> On Thu, 2022-01-06 at 09:38 +0000, Tvrtko Ursulin wrote:
>>>> On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
>>>>> On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
>>>>>>> The flow of events are as below:
>>>>>>>
>>>>>>> 1. guc sends notification that an error capture was done and ready to take.
>>>>>>> 	- at this point we copy the guc error captured dump into an interim store
>>>>>>> 	  (larger buffer that can hold multiple captures).
>>>>>>> 2. guc sends notification that a context was reset (after the prior)
>>>>>>> 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
>>>>>>>               from the context that was reset
>>>>>>> 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
>>>>>>>               global gpu state, engine state, context vmas and also engine registers. For the
>>>>>>>               engine registers now call into the guc_capture code which merely needs to verify
>>>>>>> 	  that GuC had already done a step 1 and we have data ready to be parsed.
>>>>>>
>>>>>> What about the time between the actual reset and receiving the context
>>>>>> reset notification? Latter will contain intel_context->guc_id - can that
>>>>>> be re-assigned or "retired" in between the two and so cause problems for
>>>>>> matching the correct (or any) vmas?
>>>>>>
>>>>> Not it cannot because its only after the context reset notification that i915 starts
>>>>> taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens.
>>>>> That's what i've observed in the code flow.
>>>>
>>>> The fact it is "only after" is exactly why I asked.
>>>>
>>>> Reset notification is in a CT queue with other stuff, right? So can be
>>>> some unrelated time after the actual reset. Could have context be
>>>> retired in the meantime and guc_id released is the question.
>>>>
>>>> Because i915 has no idea there was a reset until this delayed message
>>>> comes over, but it could see user interrupt signaling end of batch,
>>>> after the reset has happened, unbeknown to i915, right?
>>>>
>>>> Perhaps the answer is guc_id cannot be released via the request retire
>>>> flows. Or GuC signaling release of guc_id is a thing, which is then
>>>> ordered via the same CT buffer.
>>>>
>>>> I don't know, just asking.
>>>>
>>> As long as the context is pinned, the guc-id wont be re-assigned. After a bit of offline brain-dump
>>> from John Harrison, there are many factors that can keep the context pinned (recounts) including
>>> new or oustanding requests. So a guc-id can't get re-assigned between a capture-notify and a
>>> context-reset even if that outstanding request is the only refcount left since it would still
>>> be considered outstanding by the driver. I also think we may also be talking past each other
>>> in the sense that the guc-id is something the driver assigns to a context being pinned and only
>>> the driver can un-assign it (both assigning and unasigning is via H2G interactions).
>>> I get the sense you are assuming the GuC can un-assign the guc-id's on its own - which isn't
>>> the case. Apologies if i mis-assumed.
>>
>> I did not think GuC can re-assign ce->guc_id. I asked about request/context complete/retire happening before reset/capture notification is received.
>>
>> That would be the time window between the last intel_context_put, so last i915_request_put from retire, at which point AFAICT GuC code releases the guc_id. Execution timeline like:
>>
>>> ------ rq1 ------|------ rq2 ------|
>>      ^ engine reset		    ^ rq2, rq1 retire, guc id released
>>
>>                                                             		^ GuC reset notify received - guc_id not known any more?
>>    
>> You are saying something is guaranteed to be holding onto the guc_id at the point of receiving the notification? "There are many factors that can keep the context pinned" - what is it in this case? Or the case cannot happen?
>>
>> Regards,
>>
>> Tvrtko
> 
> above chart is incorrect: GuC reset notification is sent from GuC to host before it sends the engine reset notification

Meaning? And how does it relate to actual reset vs retire vs reset 
notification (sent or received)?

Plus, I thought so far we were talking about reset notification and 
capture notification, so what you say here now extra confuses me without 
providing an answer to my question.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2022-01-10  8:07                         ` Tvrtko Ursulin
@ 2022-01-10 18:19                           ` Teres Alexis, Alan Previn
  2022-01-11 10:08                             ` Tvrtko Ursulin
  0 siblings, 1 reply; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2022-01-10 18:19 UTC (permalink / raw)
  To: Brost, Matthew, tvrtko.ursulin; +Cc: intel-gfx


On Mon, 2022-01-10 at 08:07 +0000, Tvrtko Ursulin wrote:
> On 07/01/2022 17:03, Teres Alexis, Alan Previn wrote:
> > On Fri, 2022-01-07 at 09:03 +0000, Tvrtko Ursulin wrote:
> > > On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
> > > > On Thu, 2022-01-06 at 09:38 +0000, Tvrtko Ursulin wrote:
> > > > > On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
> > > > > > On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
> > > > > > > > The flow of events are as below:
> > > > > > > > 
> > > > > > > > 1. guc sends notification that an error capture was done and ready to take.
> > > > > > > > 	- at this point we copy the guc error captured dump into an interim store
> > > > > > > > 	  (larger buffer that can hold multiple captures).
> > > > > > > > 2. guc sends notification that a context was reset (after the prior)
> > > > > > > > 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
> > > > > > > >               from the context that was reset
> > > > > > > > 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
> > > > > > > >               global gpu state, engine state, context vmas and also engine registers. For the
> > > > > > > >               engine registers now call into the guc_capture code which merely needs to verify
> > > > > > > > 	  that GuC had already done a step 1 and we have data ready to be parsed.
> > > > > > > 
> > > > > > > What about the time between the actual reset and receiving the context
> > > > > > > reset notification? Latter will contain intel_context->guc_id - can that
> > > > > > > be re-assigned or "retired" in between the two and so cause problems for
> > > > > > > matching the correct (or any) vmas?
> > > > > > > 
> > > > > > Not it cannot because its only after the context reset notification that i915 starts
> > > > > > taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens.
> > > > > > That's what i've observed in the code flow.
> > > > > 
> > > > > The fact it is "only after" is exactly why I asked.
> > > > > 
> > > > > Reset notification is in a CT queue with other stuff, right? So can be
> > > > > some unrelated time after the actual reset. Could have context be
> > > > > retired in the meantime and guc_id released is the question.
> > > > > 
> > > > > Because i915 has no idea there was a reset until this delayed message
> > > > > comes over, but it could see user interrupt signaling end of batch,
> > > > > after the reset has happened, unbeknown to i915, right?
> > > > > 
> > > > > Perhaps the answer is guc_id cannot be released via the request retire
> > > > > flows. Or GuC signaling release of guc_id is a thing, which is then
> > > > > ordered via the same CT buffer.
> > > > > 
> > > > > I don't know, just asking.
> > > > > 
> > > > As long as the context is pinned, the guc-id wont be re-assigned. After a bit of offline brain-dump
> > > > from John Harrison, there are many factors that can keep the context pinned (recounts) including
> > > > new or oustanding requests. So a guc-id can't get re-assigned between a capture-notify and a
> > > > context-reset even if that outstanding request is the only refcount left since it would still
> > > > be considered outstanding by the driver. I also think we may also be talking past each other
> > > > in the sense that the guc-id is something the driver assigns to a context being pinned and only
> > > > the driver can un-assign it (both assigning and unasigning is via H2G interactions).
> > > > I get the sense you are assuming the GuC can un-assign the guc-id's on its own - which isn't
> > > > the case. Apologies if i mis-assumed.
> > > 
> > > I did not think GuC can re-assign ce->guc_id. I asked about request/context complete/retire happening before reset/capture notification is received.
> > > 
> > > That would be the time window between the last intel_context_put, so last i915_request_put from retire, at which point AFAICT GuC code releases the guc_id. Execution timeline like:
> > > 
> > > > ------ rq1 ------|------ rq2 ------|
> > >      ^ engine reset		    ^ rq2, rq1 retire, guc id released
> > > 
> > >                                                             		^ GuC reset notify received - guc_id not known any more?
> > >    
> > > You are saying something is guaranteed to be holding onto the guc_id at the point of receiving the notification? "There are many factors that can keep the context pinned" - what is it in this case? Or the case cannot happen?
> > > 
> > > Regards,
> > > 
> > > Tvrtko
> > 
> > above chart is incorrect: GuC reset notification is sent from GuC to host before it sends the engine reset notification
> 
> Meaning? And how does it relate to actual reset vs retire vs reset 
> notification (sent or received)?
> 
> Plus, I thought so far we were talking about reset notification and 
> capture notification, so what you say here now extra confuses me without 
> providing an answer to my question.
> 
> Regards,
> 
> Tvrtko
So i think the confusion at this point of the conversation is because in the prior discussion we have been talking about
the focus was on printout of the error capture status (which happens when user triggers the debugfs to dump). In your
previous reply, you had provided a timeline that references the engine-reset, request/retire and reset-notification
events which are separate from the print-out event.


So recap of timeline of events that highlights when things occur including the printout:
(apologies for a lot of repeated and known info below, i am repeating for my own benefit)

t0
   - ContextA makes a request
         -> pin ContextA and get a guc-id OR
            reuse existing guc-id if context is still pinned.
         -> ref count is always incremented when a new request is sent
            to keep the context pinned with the same guc-id 

t1...t10
   1- ContextA continues through multiple request and retirement events
   2- no hangs, no resets, ContextA is good

t11
   1- ContextA sends a faulty workload
   2- as always, its either already pinned with same guc-id
      or get a new guc-id and pinned again
   3- refcount increases
t12
   1- lets assume all outstanding ContextA request successfully retire
      except the work at t11. So there is one refcount left holding ContextA
      pinned with that same guc-id
t13
   1- GuC decides to reset ContextA (this means KMD had previously setup GuC
      scheduler policies, execution-quanta and preemption-timeout that tells
      GuC it wants GuC to reset a context that doesnt complete in time and cant
      be preempted if a higher priority workload needs to get in).
t14
   1- GuC does a full error-state captures on its side and sends the
      G2H for error-capture-notification to KMD. At this point, refcount
      remains untouched and context is still pinned.
   2- error capture module copies the new error-capture to interim store
      but does not parse it yet.
t15
   1- GuC sends the G2H for engine reset / context reset to KMD. At this
      point, KMD calls the i915_gpu_coredump function and will capture a
      snapshot of all relevant context information and its faulting request.
      This includes the context, LRCA and vmas (such as the batch buffer).
      At this point the guc-error capture is not parsed but we already have a
      snapshot of the guc-error-capture-dump from t14.
   2- i915_gpu_coredump shall chose to keep all of the information collected
      if its the first error or will discard everything.
   3- i915 may attempt to replay the context or it may not and if not it
      could lose the guc-id.
tn
   1- The end user triggers the debugfs to dump the gpu error state. 
   2- multiple information is printed and includes the call to
      intel_guc_capture_out_print_next_group function. The guc-err-capture module
      now finally parses the information and prints everything it finds.
   3- based on the sequence of events in t14 and t15 (i.e. guc sends error-capture
      notification first, guc sends context-reset notification second and the
      i915_cpu_coredump will only keep dump information the first-error of a context-
      or-engine rese) the first engine-state-dump that intel_guc_capture_out_print_next_group
      finds from the buffer that GuC had sent in t14 is expected to match, but we know
      at this point the guc-id could of course be lost.
        - NOTE1: in prior replies, i had mentioned something along the lines of "not
          able to extract information about the context and process". I didnt do
          a good job of explaining that this gap is pertaining to the new
          intel_guc_capture_out_print_next_group function being able to get that
          on its own. I should have also stated that all that info was already
          captured by i915_gpu_coredump in step 15 and the first engine-reset-dump we
          find in intel_guc_capture_out_print_next_group should correlate.
        - NOTE2: Since the i915_gpu_coredump function only keeps the first error-state
          but discards any subsequent ones (if the end user hasn't cleared the 1st one
          via the debugfs), I am not having the guc-err-capture module parse all of the
          error-capture info at the time of t14 and hold off until now at tn.

An important assumption here is that at the time of tn, the very first engine-dump we parse
via the guc-error-capture dump should correlate with the first error-capture that
i915_gpu_coredump is parsing at tn (captured at t15). This was what i had summarized
on this thread on Dec 23rd morning PST.

However, if we want to add additional check-and-balance (to ensure the dumps are
aligned) is to keep a copy of the guc-id and LRCA (not ref-counting and keeping pinned
but just making a copy of the values) when i915_gpu_coredump does NOT discard the capture
coz that will be the one printed out when triggered by end-user and is expected to match
the first entry from the the oustanding guc-err-capture dumps. I can include this in
my upcoming rev but will only make that copy if the i915_gpu_coredump does NOT discard
the dump.

...alan


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2022-01-10 18:19                           ` Teres Alexis, Alan Previn
@ 2022-01-11 10:08                             ` Tvrtko Ursulin
  2022-01-14  7:16                               ` Teres Alexis, Alan Previn
  0 siblings, 1 reply; 52+ messages in thread
From: Tvrtko Ursulin @ 2022-01-11 10:08 UTC (permalink / raw)
  To: Teres Alexis, Alan Previn, Brost, Matthew; +Cc: intel-gfx


On 10/01/2022 18:19, Teres Alexis, Alan Previn wrote:
> 
> On Mon, 2022-01-10 at 08:07 +0000, Tvrtko Ursulin wrote:
>> On 07/01/2022 17:03, Teres Alexis, Alan Previn wrote:
>>> On Fri, 2022-01-07 at 09:03 +0000, Tvrtko Ursulin wrote:
>>>> On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
>>>>> On Thu, 2022-01-06 at 09:38 +0000, Tvrtko Ursulin wrote:
>>>>>> On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
>>>>>>> On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
>>>>>>>>> The flow of events are as below:
>>>>>>>>>
>>>>>>>>> 1. guc sends notification that an error capture was done and ready to take.
>>>>>>>>> 	- at this point we copy the guc error captured dump into an interim store
>>>>>>>>> 	  (larger buffer that can hold multiple captures).
>>>>>>>>> 2. guc sends notification that a context was reset (after the prior)
>>>>>>>>> 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
>>>>>>>>>                from the context that was reset
>>>>>>>>> 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
>>>>>>>>>                global gpu state, engine state, context vmas and also engine registers. For the
>>>>>>>>>                engine registers now call into the guc_capture code which merely needs to verify
>>>>>>>>> 	  that GuC had already done a step 1 and we have data ready to be parsed.
>>>>>>>>
>>>>>>>> What about the time between the actual reset and receiving the context
>>>>>>>> reset notification? Latter will contain intel_context->guc_id - can that
>>>>>>>> be re-assigned or "retired" in between the two and so cause problems for
>>>>>>>> matching the correct (or any) vmas?
>>>>>>>>
>>>>>>> Not it cannot because its only after the context reset notification that i915 starts
>>>>>>> taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens.
>>>>>>> That's what i've observed in the code flow.
>>>>>>
>>>>>> The fact it is "only after" is exactly why I asked.
>>>>>>
>>>>>> Reset notification is in a CT queue with other stuff, right? So can be
>>>>>> some unrelated time after the actual reset. Could have context be
>>>>>> retired in the meantime and guc_id released is the question.
>>>>>>
>>>>>> Because i915 has no idea there was a reset until this delayed message
>>>>>> comes over, but it could see user interrupt signaling end of batch,
>>>>>> after the reset has happened, unbeknown to i915, right?
>>>>>>
>>>>>> Perhaps the answer is guc_id cannot be released via the request retire
>>>>>> flows. Or GuC signaling release of guc_id is a thing, which is then
>>>>>> ordered via the same CT buffer.
>>>>>>
>>>>>> I don't know, just asking.
>>>>>>
>>>>> As long as the context is pinned, the guc-id wont be re-assigned. After a bit of offline brain-dump
>>>>> from John Harrison, there are many factors that can keep the context pinned (recounts) including
>>>>> new or oustanding requests. So a guc-id can't get re-assigned between a capture-notify and a
>>>>> context-reset even if that outstanding request is the only refcount left since it would still
>>>>> be considered outstanding by the driver. I also think we may also be talking past each other
>>>>> in the sense that the guc-id is something the driver assigns to a context being pinned and only
>>>>> the driver can un-assign it (both assigning and unasigning is via H2G interactions).
>>>>> I get the sense you are assuming the GuC can un-assign the guc-id's on its own - which isn't
>>>>> the case. Apologies if i mis-assumed.
>>>>
>>>> I did not think GuC can re-assign ce->guc_id. I asked about request/context complete/retire happening before reset/capture notification is received.
>>>>
>>>> That would be the time window between the last intel_context_put, so last i915_request_put from retire, at which point AFAICT GuC code releases the guc_id. Execution timeline like:
>>>>
>>>>> ------ rq1 ------|------ rq2 ------|
>>>>       ^ engine reset		    ^ rq2, rq1 retire, guc id released
>>>>
>>>>                                                              		^ GuC reset notify received - guc_id not known any more?
>>>>     
>>>> You are saying something is guaranteed to be holding onto the guc_id at the point of receiving the notification? "There are many factors that can keep the context pinned" - what is it in this case? Or the case cannot happen?
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>
>>> above chart is incorrect: GuC reset notification is sent from GuC to host before it sends the engine reset notification
>>
>> Meaning? And how does it relate to actual reset vs retire vs reset
>> notification (sent or received)?
>>
>> Plus, I thought so far we were talking about reset notification and
>> capture notification, so what you say here now extra confuses me without
>> providing an answer to my question.
>>
>> Regards,
>>
>> Tvrtko
> So i think the confusion at this point of the conversation is because in the prior discussion we have been talking about
> the focus was on printout of the error capture status (which happens when user triggers the debugfs to dump). In your
> previous reply, you had provided a timeline that references the engine-reset, request/retire and reset-notification
> events which are separate from the print-out event.
> 
> 
> So recap of timeline of events that highlights when things occur including the printout:
> (apologies for a lot of repeated and known info below, i am repeating for my own benefit)
> 
> t0
>     - ContextA makes a request
>           -> pin ContextA and get a guc-id OR
>              reuse existing guc-id if context is still pinned.
>           -> ref count is always incremented when a new request is sent
>              to keep the context pinned with the same guc-id
> 
> t1...t10
>     1- ContextA continues through multiple request and retirement events
>     2- no hangs, no resets, ContextA is good
> 
> t11
>     1- ContextA sends a faulty workload
>     2- as always, its either already pinned with same guc-id
>        or get a new guc-id and pinned again
>     3- refcount increases
> t12
>     1- lets assume all outstanding ContextA request successfully retire
>        except the work at t11. So there is one refcount left holding ContextA
>        pinned with that same guc-id
> t13
>     1- GuC decides to reset ContextA (this means KMD had previously setup GuC
>        scheduler policies, execution-quanta and preemption-timeout that tells
>        GuC it wants GuC to reset a context that doesnt complete in time and cant
>        be preempted if a higher priority workload needs to get in).
> t14
>     1- GuC does a full error-state captures on its side and sends the
>        G2H for error-capture-notification to KMD. At this point, refcount
>        remains untouched and context is still pinned.

All good until this step I think.

At this point in the timeline my question is this:

Once GuC is done it's error capture and engine reset, having sent out 
the notification (or plural), does it continue to execute ContextA?

If it does not, given what you wrote in t15-3, please skip to that 
location (t15-3).

If it does continue execution, it then hits the request post-amble 
containing the seqno write and user interrupt.

Engine/capture notifications are sitting in the CT buffer waiting for 
the i915 to read them.

In parallel, ahead of the CT work, i915 notices the ContextA request has 
been completed and proceeds to retire it.

Does this release the final reference on the guc_id associated with 
ContextA?

>     2- error capture module copies the new error-capture to interim store
>        but does not parse it yet.
> t15
>     1- GuC sends the G2H for engine reset / context reset to KMD. At this
>        point, KMD calls the i915_gpu_coredump function and will capture a
>        snapshot of all relevant context information and its faulting request.
>        This includes the context, LRCA and vmas (such as the batch buffer).
>        At this point the guc-error capture is not parsed but we already have a
>        snapshot of the guc-error-capture-dump from t14.
>     2- i915_gpu_coredump shall chose to keep all of the information collected
>        if its the first error or will discard everything.
>     3- i915 may attempt to replay the context or it may not and if not it
>        could lose the guc-id.

This made me thing guc engine reset notification is a "handshake" 
operation and not a pure notification? Does it imply GuC will wait for 
i915 to reply what to do next meaning it won't continue to execute 
ContextA before i915 replies to engine reset notification?

If so that would resolve my concern.

Regards,

Tvrtko

> tn
>     1- The end user triggers the debugfs to dump the gpu error state.
>     2- multiple information is printed and includes the call to
>        intel_guc_capture_out_print_next_group function. The guc-err-capture module
>        now finally parses the information and prints everything it finds.
>     3- based on the sequence of events in t14 and t15 (i.e. guc sends error-capture
>        notification first, guc sends context-reset notification second and the
>        i915_cpu_coredump will only keep dump information the first-error of a context-
>        or-engine rese) the first engine-state-dump that intel_guc_capture_out_print_next_group
>        finds from the buffer that GuC had sent in t14 is expected to match, but we know
>        at this point the guc-id could of course be lost.
>          - NOTE1: in prior replies, i had mentioned something along the lines of "not
>            able to extract information about the context and process". I didnt do
>            a good job of explaining that this gap is pertaining to the new
>            intel_guc_capture_out_print_next_group function being able to get that
>            on its own. I should have also stated that all that info was already
>            captured by i915_gpu_coredump in step 15 and the first engine-reset-dump we
>            find in intel_guc_capture_out_print_next_group should correlate.
>          - NOTE2: Since the i915_gpu_coredump function only keeps the first error-state
>            but discards any subsequent ones (if the end user hasn't cleared the 1st one
>            via the debugfs), I am not having the guc-err-capture module parse all of the
>            error-capture info at the time of t14 and hold off until now at tn.
> 
> An important assumption here is that at the time of tn, the very first engine-dump we parse
> via the guc-error-capture dump should correlate with the first error-capture that
> i915_gpu_coredump is parsing at tn (captured at t15). This was what i had summarized
> on this thread on Dec 23rd morning PST.
> 
> However, if we want to add additional check-and-balance (to ensure the dumps are
> aligned) is to keep a copy of the guc-id and LRCA (not ref-counting and keeping pinned
> but just making a copy of the values) when i915_gpu_coredump does NOT discard the capture
> coz that will be the one printed out when triggered by end-user and is expected to match
> the first entry from the the oustanding guc-err-capture dumps. I can include this in
> my upcoming rev but will only make that copy if the i915_gpu_coredump does NOT discard
> the dump.
> 
> ...alan
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
  2022-01-11 10:08                             ` Tvrtko Ursulin
@ 2022-01-14  7:16                               ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 52+ messages in thread
From: Teres Alexis, Alan Previn @ 2022-01-14  7:16 UTC (permalink / raw)
  To: Tvrtko Ursulin, Brost, Matthew; +Cc: intel-gfx

> This made me thing guc engine reset notification is a "handshake" 
> operation and not a pure notification? Does it imply GuC will wait for
> i915 to reply what to do next meaning it won't continue to execute ContextA before i915 replies to engine reset notification?

> If so that would resolve my concern.

Yes: The GuC to host action is used to report a hung context to the VF host if engine reset was triggered and a hung context was detected during engine reset. This context is automatically put in a non-runnable state.
Apologies for the delay - some task IRQs.

...alan

-----Original Message-----
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> 
Sent: Tuesday, January 11, 2022 2:09 AM
To: Teres Alexis, Alan Previn <alan.previn.teres.alexis@intel.com>; Brost, Matthew <matthew.brost@intel.com>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.


On 10/01/2022 18:19, Teres Alexis, Alan Previn wrote:
> 
> On Mon, 2022-01-10 at 08:07 +0000, Tvrtko Ursulin wrote:
>> On 07/01/2022 17:03, Teres Alexis, Alan Previn wrote:
>>> On Fri, 2022-01-07 at 09:03 +0000, Tvrtko Ursulin wrote:
>>>> On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
>>>>> On Thu, 2022-01-06 at 09:38 +0000, Tvrtko Ursulin wrote:
>>>>>> On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
>>>>>>> On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
>>>>>>>>> The flow of events are as below:
>>>>>>>>>
>>>>>>>>> 1. guc sends notification that an error capture was done and ready to take.
>>>>>>>>> 	- at this point we copy the guc error captured dump into an interim store
>>>>>>>>> 	  (larger buffer that can hold multiple captures).
>>>>>>>>> 2. guc sends notification that a context was reset (after the prior)
>>>>>>>>> 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
>>>>>>>>>                from the context that was reset
>>>>>>>>> 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
>>>>>>>>>                global gpu state, engine state, context vmas and also engine registers. For the
>>>>>>>>>                engine registers now call into the guc_capture code which merely needs to verify
>>>>>>>>> 	  that GuC had already done a step 1 and we have data ready to be parsed.
>>>>>>>>
>>>>>>>> What about the time between the actual reset and receiving the 
>>>>>>>> context reset notification? Latter will contain 
>>>>>>>> intel_context->guc_id - can that be re-assigned or "retired" in 
>>>>>>>> between the two and so cause problems for matching the correct (or any) vmas?
>>>>>>>>
>>>>>>> Not it cannot because its only after the context reset 
>>>>>>> notification that i915 starts taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens.
>>>>>>> That's what i've observed in the code flow.
>>>>>>
>>>>>> The fact it is "only after" is exactly why I asked.
>>>>>>
>>>>>> Reset notification is in a CT queue with other stuff, right? So 
>>>>>> can be some unrelated time after the actual reset. Could have 
>>>>>> context be retired in the meantime and guc_id released is the question.
>>>>>>
>>>>>> Because i915 has no idea there was a reset until this delayed 
>>>>>> message comes over, but it could see user interrupt signaling end 
>>>>>> of batch, after the reset has happened, unbeknown to i915, right?
>>>>>>
>>>>>> Perhaps the answer is guc_id cannot be released via the request 
>>>>>> retire flows. Or GuC signaling release of guc_id is a thing, 
>>>>>> which is then ordered via the same CT buffer.
>>>>>>
>>>>>> I don't know, just asking.
>>>>>>
>>>>> As long as the context is pinned, the guc-id wont be re-assigned. 
>>>>> After a bit of offline brain-dump from John Harrison, there are 
>>>>> many factors that can keep the context pinned (recounts) including 
>>>>> new or oustanding requests. So a guc-id can't get re-assigned 
>>>>> between a capture-notify and a context-reset even if that 
>>>>> outstanding request is the only refcount left since it would still 
>>>>> be considered outstanding by the driver. I also think we may also be talking past each other in the sense that the guc-id is something the driver assigns to a context being pinned and only the driver can un-assign it (both assigning and unasigning is via H2G interactions).
>>>>> I get the sense you are assuming the GuC can un-assign the 
>>>>> guc-id's on its own - which isn't the case. Apologies if i mis-assumed.
>>>>
>>>> I did not think GuC can re-assign ce->guc_id. I asked about request/context complete/retire happening before reset/capture notification is received.
>>>>
>>>> That would be the time window between the last intel_context_put, so last i915_request_put from retire, at which point AFAICT GuC code releases the guc_id. Execution timeline like:
>>>>
>>>>> ------ rq1 ------|------ rq2 ------|
>>>>       ^ engine reset		    ^ rq2, rq1 retire, guc id released
>>>>
>>>>                                                              		^ GuC reset notify received - guc_id not known any more?
>>>>     
>>>> You are saying something is guaranteed to be holding onto the guc_id at the point of receiving the notification? "There are many factors that can keep the context pinned" - what is it in this case? Or the case cannot happen?
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>
>>> above chart is incorrect: GuC reset notification is sent from GuC to 
>>> host before it sends the engine reset notification
>>
>> Meaning? And how does it relate to actual reset vs retire vs reset 
>> notification (sent or received)?
>>
>> Plus, I thought so far we were talking about reset notification and 
>> capture notification, so what you say here now extra confuses me 
>> without providing an answer to my question.
>>
>> Regards,
>>
>> Tvrtko
> So i think the confusion at this point of the conversation is because 
> in the prior discussion we have been talking about the focus was on 
> printout of the error capture status (which happens when user triggers 
> the debugfs to dump). In your previous reply, you had provided a timeline that references the engine-reset, request/retire and reset-notification events which are separate from the print-out event.
> 
> 
> So recap of timeline of events that highlights when things occur including the printout:
> (apologies for a lot of repeated and known info below, i am repeating 
> for my own benefit)
> 
> t0
>     - ContextA makes a request
>           -> pin ContextA and get a guc-id OR
>              reuse existing guc-id if context is still pinned.
>           -> ref count is always incremented when a new request is sent
>              to keep the context pinned with the same guc-id
> 
> t1...t10
>     1- ContextA continues through multiple request and retirement events
>     2- no hangs, no resets, ContextA is good
> 
> t11
>     1- ContextA sends a faulty workload
>     2- as always, its either already pinned with same guc-id
>        or get a new guc-id and pinned again
>     3- refcount increases
> t12
>     1- lets assume all outstanding ContextA request successfully retire
>        except the work at t11. So there is one refcount left holding ContextA
>        pinned with that same guc-id
> t13
>     1- GuC decides to reset ContextA (this means KMD had previously setup GuC
>        scheduler policies, execution-quanta and preemption-timeout that tells
>        GuC it wants GuC to reset a context that doesnt complete in time and cant
>        be preempted if a higher priority workload needs to get in).
> t14
>     1- GuC does a full error-state captures on its side and sends the
>        G2H for error-capture-notification to KMD. At this point, refcount
>        remains untouched and context is still pinned.

All good until this step I think.

At this point in the timeline my question is this:

Once GuC is done it's error capture and engine reset, having sent out the notification (or plural), does it continue to execute ContextA?

If it does not, given what you wrote in t15-3, please skip to that location (t15-3).

If it does continue execution, it then hits the request post-amble containing the seqno write and user interrupt.

Engine/capture notifications are sitting in the CT buffer waiting for the i915 to read them.

In parallel, ahead of the CT work, i915 notices the ContextA request has been completed and proceeds to retire it.

Does this release the final reference on the guc_id associated with ContextA?

>     2- error capture module copies the new error-capture to interim store
>        but does not parse it yet.
> t15
>     1- GuC sends the G2H for engine reset / context reset to KMD. At this
>        point, KMD calls the i915_gpu_coredump function and will capture a
>        snapshot of all relevant context information and its faulting request.
>        This includes the context, LRCA and vmas (such as the batch buffer).
>        At this point the guc-error capture is not parsed but we already have a
>        snapshot of the guc-error-capture-dump from t14.
>     2- i915_gpu_coredump shall chose to keep all of the information collected
>        if its the first error or will discard everything.
>     3- i915 may attempt to replay the context or it may not and if not it
>        could lose the guc-id.

This made me thing guc engine reset notification is a "handshake" 
operation and not a pure notification? Does it imply GuC will wait for
i915 to reply what to do next meaning it won't continue to execute ContextA before i915 replies to engine reset notification?

If so that would resolve my concern.

Regards,

Tvrtko

> tn
>     1- The end user triggers the debugfs to dump the gpu error state.
>     2- multiple information is printed and includes the call to
>        intel_guc_capture_out_print_next_group function. The guc-err-capture module
>        now finally parses the information and prints everything it finds.
>     3- based on the sequence of events in t14 and t15 (i.e. guc sends error-capture
>        notification first, guc sends context-reset notification second and the
>        i915_cpu_coredump will only keep dump information the first-error of a context-
>        or-engine rese) the first engine-state-dump that intel_guc_capture_out_print_next_group
>        finds from the buffer that GuC had sent in t14 is expected to match, but we know
>        at this point the guc-id could of course be lost.
>          - NOTE1: in prior replies, i had mentioned something along the lines of "not
>            able to extract information about the context and process". I didnt do
>            a good job of explaining that this gap is pertaining to the new
>            intel_guc_capture_out_print_next_group function being able to get that
>            on its own. I should have also stated that all that info was already
>            captured by i915_gpu_coredump in step 15 and the first engine-reset-dump we
>            find in intel_guc_capture_out_print_next_group should correlate.
>          - NOTE2: Since the i915_gpu_coredump function only keeps the first error-state
>            but discards any subsequent ones (if the end user hasn't cleared the 1st one
>            via the debugfs), I am not having the guc-err-capture module parse all of the
>            error-capture info at the time of t14 and hold off until now at tn.
> 
> An important assumption here is that at the time of tn, the very first 
> engine-dump we parse via the guc-error-capture dump should correlate 
> with the first error-capture that i915_gpu_coredump is parsing at tn 
> (captured at t15). This was what i had summarized on this thread on Dec 23rd morning PST.
> 
> However, if we want to add additional check-and-balance (to ensure the 
> dumps are
> aligned) is to keep a copy of the guc-id and LRCA (not ref-counting 
> and keeping pinned but just making a copy of the values) when 
> i915_gpu_coredump does NOT discard the capture coz that will be the 
> one printed out when triggered by end-user and is expected to match 
> the first entry from the the oustanding guc-err-capture dumps. I can 
> include this in my upcoming rev but will only make that copy if the i915_gpu_coredump does NOT discard the dump.
> 
> ...alan
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2022-01-14  7:17 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-22 23:03 [RFC 0/7] Add GuC Error Capture Support Alan Previn
2021-11-22 23:03 ` [Intel-gfx] " Alan Previn
2021-11-22 23:03 ` [Intel-gfx] [RFC 1/7] drm/i915/guc: Add basic support for error capture lists Alan Previn
2021-11-23 21:12   ` Michal Wajdeczko
2021-12-08 18:23     ` Teres Alexis, Alan Previn
2021-11-22 23:03 ` [Intel-gfx] [RFC 2/7] drm/i915/guc: Update GuC ADS size " Alan Previn
2021-11-23 21:46   ` Michal Wajdeczko
2021-11-24  9:52     ` Jani Nikula
2021-11-24 17:34     ` Teres Alexis, Alan Previn
2021-12-21 23:15       ` Teres Alexis, Alan Previn
2021-12-22  1:49       ` Teres Alexis, Alan Previn
2021-12-22 20:13     ` Teres Alexis, Alan Previn
2021-11-24 10:06   ` Jani Nikula
2021-11-24 17:37     ` Teres Alexis, Alan Previn
2021-11-22 23:03 ` [Intel-gfx] [RFC 3/7] drm/i915/guc: Populate XE_LP register lists for GuC error state capture Alan Previn
2021-11-23  1:59   ` kernel test robot
2021-11-23 21:55   ` Michal Wajdeczko
2021-11-24 17:16     ` Teres Alexis, Alan Previn
2021-11-22 23:03 ` [Intel-gfx] [RFC 4/7] drm/i915/guc: Add GuC's error state capture output structures Alan Previn
2021-11-24 10:08   ` Jani Nikula
2021-11-24 17:37     ` Teres Alexis, Alan Previn
2021-12-07 21:01   ` Matthew Brost
2021-12-07 23:35     ` Teres Alexis, Alan Previn
2021-11-22 23:04 ` [Intel-gfx] [RFC 5/7] drm/i915/guc: Update GuC's log-buffer-state access for error capture Alan Previn
2021-12-07 22:31   ` Matthew Brost
2021-12-07 23:33     ` Teres Alexis, Alan Previn
2021-12-07 23:30       ` Matthew Brost
2021-11-22 23:04 ` [Intel-gfx] [RFC 6/7] drm/i915/guc: Copy new GuC error capture logs upon G2H notification Alan Previn
2021-12-07 22:58   ` Matthew Brost
2021-12-08  5:14     ` Teres Alexis, Alan Previn
2021-12-08 18:22       ` Teres Alexis, Alan Previn
2021-11-22 23:04 ` [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list Alan Previn
2021-11-23  0:25   ` Teres Alexis, Alan Previn
2021-12-08  0:22   ` Matthew Brost
2021-12-08  6:31     ` Teres Alexis, Alan Previn
2021-12-23 18:54       ` Teres Alexis, Alan Previn
2021-12-24 12:09         ` Tvrtko Ursulin
2021-12-24 13:34           ` Teres Alexis, Alan Previn
2022-01-04 13:56             ` Tvrtko Ursulin
2022-01-05 17:30               ` Teres Alexis, Alan Previn
2022-01-06  9:38                 ` Tvrtko Ursulin
2022-01-06 18:33                   ` Teres Alexis, Alan Previn
2022-01-07  9:03                     ` Tvrtko Ursulin
2022-01-07 17:03                       ` Teres Alexis, Alan Previn
2022-01-10  8:07                         ` Tvrtko Ursulin
2022-01-10 18:19                           ` Teres Alexis, Alan Previn
2022-01-11 10:08                             ` Tvrtko Ursulin
2022-01-14  7:16                               ` Teres Alexis, Alan Previn
2021-11-22 23:44 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Add GuC Error Capture Support Patchwork
2021-11-22 23:45 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-11-23  0:16 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2021-11-23  0:40 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for Add GuC Error Capture Support (rev2) Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.