All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v1] Data port coherency control for UMDs.
@ 2018-03-19 12:37 Tomasz Lis
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
                   ` (28 more replies)
  0 siblings, 29 replies; 88+ messages in thread
From: Tomasz Lis @ 2018-03-19 12:37 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to three basic question explaining background
of the functionality and rationale for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.


2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \                 
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

Tomasz Lis (1):
  drm/i915: Add Exec param to control data port coherency.

 drivers/gpu/drm/i915/i915_drv.c            |  3 ++
 drivers/gpu/drm/i915/i915_gem_context.h    |  1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  5 +++
 drivers/gpu/drm/i915/intel_lrc.c           | 56 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h           |  3 ++
 include/uapi/drm/i915_drm.h                | 12 ++++++-
 6 files changed, 79 insertions(+), 1 deletion(-)

--
2.7.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 88+ messages in thread
* [PATCH v1] drm/i915/gen11: Preempt-to-idle support in execlists.
@ 2018-03-27 15:17 Tomasz Lis
  2018-07-16 13:07 ` [PATCH v6] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
  0 siblings, 1 reply; 88+ messages in thread
From: Tomasz Lis @ 2018-03-27 15:17 UTC (permalink / raw)
  To: intel-gfx; +Cc: mika.kuoppala

The patch adds support of preempt-to-idle requesting by setting a proper
bit within Execlist Control Register, and receiving preemption result from
Context Status Buffer.

Preemption in previous gens required a special batch buffer to be executed,
so the Command Streamer never preempted to idle directly. In Icelake it is
possible, as there is a hardware mechanism to inform the kernel about
status of the preemption request.

This patch does not cover using the new preemption mechanism when GuC is
active.

Bspec: 18922
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h          |  2 ++
 drivers/gpu/drm/i915/i915_pci.c          |  3 ++-
 drivers/gpu/drm/i915/intel_device_info.h |  1 +
 drivers/gpu/drm/i915/intel_lrc.c         | 45 +++++++++++++++++++++++++++-----
 drivers/gpu/drm/i915/intel_lrc.h         |  1 +
 5 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 800230b..c32580b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2514,6 +2514,8 @@ intel_info(const struct drm_i915_private *dev_priv)
 		((dev_priv)->info.has_logical_ring_elsq)
 #define HAS_LOGICAL_RING_PREEMPTION(dev_priv) \
 		((dev_priv)->info.has_logical_ring_preemption)
+#define HAS_HW_PREEMPT_TO_IDLE(dev_priv) \
+		((dev_priv)->info.has_hw_preempt_to_idle)
 
 #define HAS_EXECLISTS(dev_priv) HAS_LOGICAL_RING_CONTEXTS(dev_priv)
 
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 4364922..66b6700 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -595,7 +595,8 @@ static const struct intel_device_info intel_cannonlake_info = {
 	GEN(11), \
 	.ddb_size = 2048, \
 	.has_csr = 0, \
-	.has_logical_ring_elsq = 1
+	.has_logical_ring_elsq = 1, \
+	.has_hw_preempt_to_idle = 1
 
 static const struct intel_device_info intel_icelake_11_info = {
 	GEN11_FEATURES,
diff --git a/drivers/gpu/drm/i915/intel_device_info.h b/drivers/gpu/drm/i915/intel_device_info.h
index 933e316..4eb97b5 100644
--- a/drivers/gpu/drm/i915/intel_device_info.h
+++ b/drivers/gpu/drm/i915/intel_device_info.h
@@ -98,6 +98,7 @@ enum intel_platform {
 	func(has_logical_ring_contexts); \
 	func(has_logical_ring_elsq); \
 	func(has_logical_ring_preemption); \
+	func(has_hw_preempt_to_idle); \
 	func(has_overlay); \
 	func(has_pooled_eu); \
 	func(has_psr); \
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ba7f783..1a22de4 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -153,6 +153,7 @@
 #define GEN8_CTX_STATUS_ACTIVE_IDLE	(1 << 3)
 #define GEN8_CTX_STATUS_COMPLETE	(1 << 4)
 #define GEN8_CTX_STATUS_LITE_RESTORE	(1 << 15)
+#define GEN11_CTX_STATUS_PREEMPT_IDLE	(1 << 29)
 
 #define GEN8_CTX_STATUS_COMPLETED_MASK \
 	 (GEN8_CTX_STATUS_COMPLETE | GEN8_CTX_STATUS_PREEMPTED)
@@ -183,7 +184,9 @@ static inline bool need_preempt(const struct intel_engine_cs *engine,
 				const struct i915_request *last,
 				int prio)
 {
-	return engine->i915->preempt_context && prio > max(rq_prio(last), 0);
+	return (engine->i915->preempt_context ||
+		HAS_HW_PREEMPT_TO_IDLE(engine->i915)) &&
+		 prio > max(rq_prio(last), 0);
 }
 
 /**
@@ -535,6 +538,25 @@ static void inject_preempt_context(struct intel_engine_cs *engine)
 	execlists_set_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT);
 }
 
+static void gen11_preempt_to_idle(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists *execlists = &engine->execlists;
+
+	GEM_TRACE("%s\n", engine->name);
+
+	/*
+	 * hardware which HAS_HW_PREEMPT_TO_IDLE(), always also
+	 * HAS_LOGICAL_RING_ELSQ(), so we can assume ctrl_reg is set
+	 */
+	GEM_BUG_ON(execlists->ctrl_reg != NULL);
+
+	/* trigger preemption to idle */
+	writel(EL_CTRL_PREEMPT_TO_IDLE, execlists->ctrl_reg);
+
+	execlists_clear_active(execlists, EXECLISTS_ACTIVE_HWACK);
+	execlists_set_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
+}
+
 static void execlists_dequeue(struct intel_engine_cs *engine)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
@@ -594,7 +616,10 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			goto unlock;
 
 		if (need_preempt(engine, last, execlists->queue_priority)) {
-			inject_preempt_context(engine);
+			if (HAS_HW_PREEMPT_TO_IDLE(engine->i915))
+				gen11_preempt_to_idle(engine);
+			else
+				inject_preempt_context(engine);
 			goto unlock;
 		}
 
@@ -962,10 +987,13 @@ static void execlists_submission_tasklet(unsigned long data)
 				  status, buf[2*head + 1],
 				  execlists->active);
 
-			if (status & (GEN8_CTX_STATUS_IDLE_ACTIVE |
-				      GEN8_CTX_STATUS_PREEMPTED))
+			/* Check if switched to active or preempted to active */
+			if ((status & (GEN8_CTX_STATUS_IDLE_ACTIVE |
+					GEN8_CTX_STATUS_PREEMPTED)) &&
+			    !(status & GEN11_CTX_STATUS_PREEMPT_IDLE))
 				execlists_set_active(execlists,
 						     EXECLISTS_ACTIVE_HWACK);
+
 			if (status & GEN8_CTX_STATUS_ACTIVE_IDLE)
 				execlists_clear_active(execlists,
 						       EXECLISTS_ACTIVE_HWACK);
@@ -976,8 +1004,13 @@ static void execlists_submission_tasklet(unsigned long data)
 			/* We should never get a COMPLETED | IDLE_ACTIVE! */
 			GEM_BUG_ON(status & GEN8_CTX_STATUS_IDLE_ACTIVE);
 
-			if (status & GEN8_CTX_STATUS_COMPLETE &&
-			    buf[2*head + 1] == execlists->preempt_complete_status) {
+			/*
+			 * Check if preempted to real idle, either directly or
+			 * the preemptive context already finished executing
+			 */
+			if ((status & GEN11_CTX_STATUS_PREEMPT_IDLE) ||
+			    (status & GEN8_CTX_STATUS_COMPLETE &&
+			    buf[2*head + 1] == execlists->preempt_complete_status)) {
 				GEM_TRACE("%s preempt-idle\n", engine->name);
 
 				execlists_cancel_port_requests(execlists);
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 59d7b86..958d1b3 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -45,6 +45,7 @@
 #define RING_EXECLIST_SQ_CONTENTS(engine)	_MMIO((engine)->mmio_base + 0x510)
 #define RING_EXECLIST_CONTROL(engine)		_MMIO((engine)->mmio_base + 0x550)
 #define	  EL_CTRL_LOAD				(1 << 0)
+#define	  EL_CTRL_PREEMPT_TO_IDLE		(1 << 1)
 
 /* The docs specify that the write pointer wraps around after 5h, "After status
  * is written out to the last available status QW at offset 5h, this pointer
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2018-10-16 13:59 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
2018-03-19 12:43   ` Chris Wilson
2018-03-19 14:14     ` Lis, Tomasz
2018-03-19 14:26       ` Chris Wilson
2018-03-20 17:23         ` Lis, Tomasz
2018-05-04  9:24           ` Joonas Lahtinen
2018-03-20 18:43       ` Oscar Mateo
2018-03-21 10:16         ` Chris Wilson
2018-03-21 19:42           ` Oscar Mateo
2018-03-27 17:41             ` Lis, Tomasz
2018-03-30 17:29   ` [PATCH " Tomasz Lis
2018-03-31 19:07     ` kbuild test robot
2018-04-11 15:46   ` [PATCH v2] " Tomasz Lis
2018-06-20 15:03   ` [PATCH v1] Second implementation of Data Port Coherency Tomasz Lis
2018-06-20 15:03     ` [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
2018-06-21  6:39       ` Joonas Lahtinen
2018-06-21 13:47         ` Lis, Tomasz
2018-07-18 13:03           ` Joonas Lahtinen
2018-06-21  7:05       ` Chris Wilson
2018-06-21 13:47         ` Lis, Tomasz
2018-06-21  7:31       ` Dunajski, Bartosz
2018-06-21  8:48         ` Joonas Lahtinen
2018-06-22 16:40           ` Dunajski, Bartosz
2018-07-18 13:12             ` Joonas Lahtinen
2018-07-18 13:27               ` Dunajski, Bartosz
2018-07-09 13:20   ` [PATCH v4] " Tomasz Lis
2018-07-09 13:48     ` Lionel Landwerlin
2018-07-09 14:03       ` Lis, Tomasz
2018-07-09 14:24         ` Lionel Landwerlin
2018-07-09 15:21           ` Lis, Tomasz
2018-07-09 16:28     ` Tvrtko Ursulin
2018-07-09 16:37       ` Chris Wilson
2018-07-10 17:32         ` Lis, Tomasz
2018-07-11  9:28           ` Tvrtko Ursulin
2018-07-10 18:03       ` Lis, Tomasz
2018-07-11 11:20         ` Lis, Tomasz
2018-07-12 15:10   ` [PATCH v5] " Tomasz Lis
2018-07-13 10:40     ` Tvrtko Ursulin
2018-07-13 17:44       ` Lis, Tomasz
2018-10-09 18:06   ` [PATCH v6] " Tomasz Lis
2018-10-10  7:29     ` Tvrtko Ursulin
2018-10-12 15:02   ` [PATCH v8] " Tomasz Lis
2018-10-15 12:52     ` Tvrtko Ursulin
2018-10-16 13:59     ` Joonas Lahtinen
2018-03-19 13:53 ` [RFC v1] Data port coherency control for UMDs Joonas Lahtinen
2018-03-19 16:09   ` Lis, Tomasz
2018-03-20 15:15   ` Dunajski, Bartosz
2018-03-21 10:02     ` Joonas Lahtinen
2018-03-26  9:46       ` Dunajski, Bartosz
2018-03-29  7:42         ` Joonas Lahtinen
2018-03-30  9:00           ` Dunajski, Bartosz
2018-04-04  9:18             ` Joonas Lahtinen
2018-04-11  9:15               ` Dunajski, Bartosz
2018-03-19 14:18 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency Patchwork
2018-03-19 14:34 ` ✓ Fi.CI.BAT: success " Patchwork
2018-03-19 16:48 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-03-30 18:14 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev2) Patchwork
2018-03-30 18:30 ` ✓ Fi.CI.BAT: success " Patchwork
2018-03-30 19:59 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-04-11 16:12 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev3) Patchwork
2018-04-11 16:29 ` ✓ Fi.CI.BAT: success " Patchwork
2018-04-11 20:02 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-06-20 15:45 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev4) Patchwork
2018-06-20 16:00 ` ✓ Fi.CI.BAT: success " Patchwork
2018-06-20 21:01 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-07-09 13:57 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev5) Patchwork
2018-07-09 13:58 ` ✗ Fi.CI.SPARSE: " Patchwork
2018-07-09 14:14 ` ✓ Fi.CI.BAT: success " Patchwork
2018-07-09 20:04 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-07-12 15:18 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev6) Patchwork
2018-07-12 15:19 ` ✗ Fi.CI.SPARSE: " Patchwork
2018-07-12 15:34 ` ✓ Fi.CI.BAT: success " Patchwork
2018-10-09 18:27 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev7) Patchwork
2018-10-09 18:28 ` ✗ Fi.CI.SPARSE: " Patchwork
2018-10-09 18:52 ` ✓ Fi.CI.BAT: success " Patchwork
2018-10-09 21:44 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-10-12 15:14 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev8) Patchwork
2018-10-12 15:15 ` ✗ Fi.CI.SPARSE: " Patchwork
2018-10-12 15:34 ` ✓ Fi.CI.BAT: success " Patchwork
2018-10-12 18:27 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-03-27 15:17 [PATCH v1] drm/i915/gen11: Preempt-to-idle support in execlists Tomasz Lis
2018-07-16 13:07 ` [PATCH v6] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
2018-07-16 13:35   ` Tvrtko Ursulin
2018-07-18 13:24   ` Joonas Lahtinen
2018-07-18 14:42     ` Tvrtko Ursulin
2018-07-18 15:28       ` Lis, Tomasz
2018-07-19  7:12         ` Joonas Lahtinen
2018-07-19 15:10           ` Lis, Tomasz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.