All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v1] Data port coherency control for UMDs.
@ 2018-03-19 12:37 Tomasz Lis
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
                   ` (28 more replies)
  0 siblings, 29 replies; 81+ messages in thread
From: Tomasz Lis @ 2018-03-19 12:37 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to three basic question explaining background
of the functionality and rationale for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.


2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \                 
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

Tomasz Lis (1):
  drm/i915: Add Exec param to control data port coherency.

 drivers/gpu/drm/i915/i915_drv.c            |  3 ++
 drivers/gpu/drm/i915/i915_gem_context.h    |  1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  5 +++
 drivers/gpu/drm/i915/intel_lrc.c           | 56 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h           |  3 ++
 include/uapi/drm/i915_drm.h                | 12 ++++++-
 6 files changed, 79 insertions(+), 1 deletion(-)

--
2.7.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
@ 2018-03-19 12:37 ` Tomasz Lis
  2018-03-19 12:43   ` Chris Wilson
                     ` (7 more replies)
  2018-03-19 13:53 ` [RFC v1] Data port coherency control for UMDs Joonas Lahtinen
                   ` (27 subsequent siblings)
  28 siblings, 8 replies; 81+ messages in thread
From: Tomasz Lis @ 2018-03-19 12:37 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The patch adds a parameter to control the data port coherency functionality
on a per-exec call basis. When data port coherency flag value is different
than what it was in previous call for the context, a command to switch data
port coherency state is added before the buffer to be executed.

Bspec: 11419
---
 drivers/gpu/drm/i915/i915_drv.c            |  3 ++
 drivers/gpu/drm/i915/i915_gem_context.h    |  1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 17 ++++++++++
 drivers/gpu/drm/i915/intel_lrc.c           | 53 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h           |  3 ++
 include/uapi/drm/i915_drm.h                | 12 ++++++-
 6 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 3df5193..fcb3547 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -436,6 +436,9 @@ static int i915_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_PARAM_CS_TIMESTAMP_FREQUENCY:
 		value = 1000 * INTEL_INFO(dev_priv)->cs_timestamp_frequency_khz;
 		break;
+	case I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY:
+		value = (INTEL_GEN(dev_priv) >= 9);
+		break;
 	default:
 		DRM_DEBUG("Unknown parameter %d\n", param->param);
 		return -EINVAL;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 7854262..00aa309 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -118,6 +118,7 @@ struct i915_gem_context {
 #define CONTEXT_BANNABLE		3
 #define CONTEXT_BANNED			4
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
+#define CONTEXT_DATA_PORT_COHERENT	6
 
 	/**
 	 * @hw_id: - unique identifier for the context
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 8c170db..f848f14 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -2245,6 +2245,18 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		eb.batch_flags |= I915_DISPATCH_RS;
 	}
 
+	if (args->flags & I915_EXEC_DATA_PORT_COHERENT) {
+		if (INTEL_GEN(eb.i915) < 9) {
+			DRM_DEBUG("Data Port Coherency is only allowed for Gen9 and above\n");
+			return -EINVAL;
+		}
+		if (eb.engine->class != RENDER_CLASS) {
+			DRM_DEBUG("Data Port Coherency is not available on %s\n",
+				 eb.engine->name);
+			return -EINVAL;
+		}
+	}
+
 	if (args->flags & I915_EXEC_FENCE_IN) {
 		in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
 		if (!in_fence)
@@ -2371,6 +2383,11 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		goto err_batch_unpin;
 	}
 
+	/* Emit the switch of data port coherency state if needed */
+	err = intel_lr_context_modify_data_port_coherency(eb.request,
+			(args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);
+	GEM_WARN_ON(err);
+
 	if (in_fence) {
 		err = i915_request_await_dma_fence(eb.request, in_fence);
 		if (err < 0)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 53f1c00..b847798 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -254,6 +254,59 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+	cs = intel_ring_begin(req, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(req->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	/* Enabling coherency means disabling the bit which forces it off */
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(req, cs);
+
+	return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+					bool enable)
+{
+	struct i915_gem_context *ctx = req->ctx;
+	int ret;
+
+	if (test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags) == enable)
+		return 0;
+
+	ret = emit_set_data_port_coherency(req, enable);
+
+	if (!ret) {
+		if (enable)
+			__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+		else
+			__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	}
+
+	return ret;
+}
+
 static struct i915_priolist *
 lookup_priolist(struct intel_engine_cs *engine,
 		struct i915_priotree *pt,
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 59d7b86..c46b239 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -111,4 +111,7 @@ intel_lr_context_descriptor(struct i915_gem_context *ctx,
 	return ctx->engine[engine->id].lrc_desc;
 }
 
+int intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+						bool enable);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..a5fed1f 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -529,6 +529,11 @@ typedef struct drm_i915_irq_wait {
  */
 #define I915_PARAM_CS_TIMESTAMP_FREQUENCY 51
 
+/* Query whether DRM_I915_GEM_EXECBUFFER2 supports the ability to switch
+ * Data Cache access into Data Port Coherency mode.
+ */
+#define I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY 52
+
 typedef struct drm_i915_getparam {
 	__s32 param;
 	/*
@@ -1048,7 +1053,12 @@ struct drm_i915_gem_execbuffer2 {
  */
 #define I915_EXEC_FENCE_ARRAY   (1<<19)
 
-#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_ARRAY<<1))
+/* Data Port Coherency capability will be switched before an exec call
+ * which has this flag different than previous call for the context.
+ */
+#define I915_EXEC_DATA_PORT_COHERENT   (1<<20)
+
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_DATA_PORT_COHERENT<<1))
 
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
@ 2018-03-19 12:43   ` Chris Wilson
  2018-03-19 14:14     ` Lis, Tomasz
  2018-03-30 17:29   ` [PATCH " Tomasz Lis
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 81+ messages in thread
From: Chris Wilson @ 2018-03-19 12:43 UTC (permalink / raw)
  To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski

Quoting Tomasz Lis (2018-03-19 12:37:35)
> The patch adds a parameter to control the data port coherency functionality
> on a per-exec call basis. When data port coherency flag value is different
> than what it was in previous call for the context, a command to switch data
> port coherency state is added before the buffer to be executed.

So this is part of the context? Why do it at exec level? If exec level
is desired, why not whitelist it?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] Data port coherency control for UMDs.
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
@ 2018-03-19 13:53 ` Joonas Lahtinen
  2018-03-19 16:09   ` Lis, Tomasz
  2018-03-20 15:15   ` Dunajski, Bartosz
  2018-03-19 14:18 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency Patchwork
                   ` (26 subsequent siblings)
  28 siblings, 2 replies; 81+ messages in thread
From: Joonas Lahtinen @ 2018-03-19 13:53 UTC (permalink / raw)
  To: Tomasz Lis, intel-gfx, Dave Airlie; +Cc: bartosz.dunajski

+ Dave, as FYI

Quoting Tomasz Lis (2018-03-19 14:37:34)
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required.

Can you please link to the corresponding OpenCL driver changes? I'm
assuming this relates to the new-driver-to-be-adopted, instead of
Beignet?

How is the story/schedule looking for adopting the new driver to
distros?

Seeing the userspace counterpart and tests will help in assessing the
suggested changes.

Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-19 12:43   ` Chris Wilson
@ 2018-03-19 14:14     ` Lis, Tomasz
  2018-03-19 14:26       ` Chris Wilson
  2018-03-20 18:43       ` Oscar Mateo
  0 siblings, 2 replies; 81+ messages in thread
From: Lis, Tomasz @ 2018-03-19 14:14 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: bartosz.dunajski



On 2018-03-19 13:43, Chris Wilson wrote:
> Quoting Tomasz Lis (2018-03-19 12:37:35)
>> The patch adds a parameter to control the data port coherency functionality
>> on a per-exec call basis. When data port coherency flag value is different
>> than what it was in previous call for the context, a command to switch data
>> port coherency state is added before the buffer to be executed.
> So this is part of the context? Why do it at exec level?

It is part of the context, stored within HDC chicken bit register.
The exec level was requested by the OCL team, due to concerns about 
performance cost of context setparam calls.

>   If exec level
> is desired, why not whitelist it?
> -Chris

If we have no issue in whitelisting the register, I'm sure OCL will 
agree to that.
I assumed the whitelisting will be unacceptable because of security 
concerns with some options.
The register also changes its position and content between gens, which 
makes whitelisting hard to manage.

Main purpose of chicken bit registers, in general, is to allow work 
around for hardware features which could  be buggy or could have 
unintended influence on the platform.
The data port coherency functionality landed there for the same reasons; 
then it twisted itself in a way that we now need user space to switch it.
Is it really ok to whitelist chicken bit registers?
-Tomasz

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency.
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
  2018-03-19 13:53 ` [RFC v1] Data port coherency control for UMDs Joonas Lahtinen
@ 2018-03-19 14:18 ` Patchwork
  2018-03-19 14:34 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-03-19 14:18 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency.
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
0c30a5422174 drm/i915: Add Exec param to control data port coherency.
-:54: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#54: FILE: drivers/gpu/drm/i915/i915_gem_execbuffer.c:2255:
+			DRM_DEBUG("Data Port Coherency is not available on %s\n",
+				 eb.engine->name);

-:68: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#68: FILE: drivers/gpu/drm/i915/i915_gem_execbuffer.c:2388:
+	err = intel_lr_context_modify_data_port_coherency(eb.request,
+			(args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);

-:115: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#115: FILE: drivers/gpu/drm/i915/intel_lrc.c:290:
+intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+					bool enable)

-:174: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#174: FILE: include/uapi/drm/i915_drm.h:1059:
+#define I915_EXEC_DATA_PORT_COHERENT   (1<<20)
                                          ^

-:176: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#176: FILE: include/uapi/drm/i915_drm.h:1061:
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_DATA_PORT_COHERENT<<1))
                                                                  ^

-:179: ERROR:MISSING_SIGN_OFF: Missing Signed-off-by: line(s)

total: 1 errors, 0 warnings, 5 checks, 135 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-19 14:14     ` Lis, Tomasz
@ 2018-03-19 14:26       ` Chris Wilson
  2018-03-20 17:23         ` Lis, Tomasz
  2018-03-20 18:43       ` Oscar Mateo
  1 sibling, 1 reply; 81+ messages in thread
From: Chris Wilson @ 2018-03-19 14:26 UTC (permalink / raw)
  To: Lis, Tomasz, intel-gfx; +Cc: bartosz.dunajski

Quoting Lis, Tomasz (2018-03-19 14:14:19)
> 
> 
> On 2018-03-19 13:43, Chris Wilson wrote:
> > Quoting Tomasz Lis (2018-03-19 12:37:35)
> >> The patch adds a parameter to control the data port coherency functionality
> >> on a per-exec call basis. When data port coherency flag value is different
> >> than what it was in previous call for the context, a command to switch data
> >> port coherency state is added before the buffer to be executed.
> > So this is part of the context? Why do it at exec level?
> 
> It is part of the context, stored within HDC chicken bit register.
> The exec level was requested by the OCL team, due to concerns about 
> performance cost of context setparam calls.

What? Oh dear, oh dear, thrice oh dear. The context setparam would look
like:

	if (arg != context->value) {
		rq = request_alloc(context, RCS);
		cs = ring_begin(rq, 4);
		cs++ = MI_LRI;
		cs++ = reg;
		cs++ = magic;
		cs++ = MI_NOOP;
		request_add(rq);
		context->value = arg
	}

The argument is whether stuffing it into a crowded, v.frequently
executed execbuf is better than an irregular setparam. If they want to
flip it on every batch, use execbuf. If it's going to be very
infrequent, setparam.

That discussion must be part of the rationale in the commitlog.

Otoh, execbuf3 would accept it as a command packet. Hmm.

> >   If exec level
> > is desired, why not whitelist it?
> 
> If we have no issue in whitelisting the register, I'm sure OCL will 
> agree to that.
> I assumed the whitelisting will be unacceptable because of security 
> concerns with some options.
> The register also changes its position and content between gens, which 
> makes whitelisting hard to manage.
> 
> Main purpose of chicken bit registers, in general, is to allow work 
> around for hardware features which could  be buggy or could have 
> unintended influence on the platform.
> The data port coherency functionality landed there for the same reasons; 
> then it twisted itself in a way that we now need user space to switch it.
> Is it really ok to whitelist chicken bit registers?

It all depends on whether it breaks segregation. If the only users
affected are themselves, fine. Otherwise, no.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Add Exec param to control data port coherency.
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (2 preceding siblings ...)
  2018-03-19 14:18 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency Patchwork
@ 2018-03-19 14:34 ` Patchwork
  2018-03-19 16:48 ` ✗ Fi.CI.IGT: failure " Patchwork
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-03-19 14:34 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency.
URL   : https://patchwork.freedesktop.org/series/40181/
State : success

== Summary ==

Series 40181v1 drm/i915: Add Exec param to control data port coherency.
https://patchwork.freedesktop.org/api/1.0/series/40181/revisions/1/mbox/

fi-bdw-5557u     total:285  pass:264  dwarn:0   dfail:0   fail:0   skip:21  time:430s
fi-bdw-gvtdvm    total:285  pass:261  dwarn:0   dfail:0   fail:0   skip:24  time:442s
fi-blb-e6850     total:285  pass:220  dwarn:1   dfail:0   fail:0   skip:64  time:382s
fi-bsw-n3050     total:285  pass:239  dwarn:0   dfail:0   fail:0   skip:46  time:540s
fi-bwr-2160      total:285  pass:180  dwarn:0   dfail:0   fail:0   skip:105 time:299s
fi-bxt-j4205     total:285  pass:256  dwarn:0   dfail:0   fail:0   skip:29  time:507s
fi-byt-j1900     total:285  pass:250  dwarn:0   dfail:0   fail:0   skip:35  time:515s
fi-byt-n2820     total:285  pass:246  dwarn:0   dfail:0   fail:0   skip:39  time:502s
fi-cfl-8700k     total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:409s
fi-cfl-s2        total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:579s
fi-cfl-u         total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:510s
fi-cnl-drrs      total:285  pass:254  dwarn:3   dfail:0   fail:0   skip:28  time:542s
fi-elk-e7500     total:285  pass:225  dwarn:1   dfail:0   fail:0   skip:59  time:420s
fi-gdg-551       total:285  pass:176  dwarn:0   dfail:0   fail:1   skip:108 time:315s
fi-glk-1         total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:533s
fi-hsw-4770      total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:407s
fi-ilk-650       total:285  pass:225  dwarn:0   dfail:0   fail:0   skip:60  time:421s
fi-ivb-3520m     total:285  pass:256  dwarn:0   dfail:0   fail:0   skip:29  time:473s
fi-ivb-3770      total:285  pass:252  dwarn:0   dfail:0   fail:0   skip:33  time:430s
fi-kbl-7500u     total:285  pass:260  dwarn:1   dfail:0   fail:0   skip:24  time:477s
fi-kbl-7567u     total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:465s
fi-kbl-r         total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:515s
fi-pnv-d510      total:285  pass:219  dwarn:1   dfail:0   fail:0   skip:65  time:655s
fi-skl-6260u     total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:440s
fi-skl-6600u     total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:531s
fi-skl-6700hq    total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:542s
fi-skl-6700k2    total:285  pass:261  dwarn:0   dfail:0   fail:0   skip:24  time:499s
fi-skl-6770hq    total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:501s
fi-skl-guc       total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:429s
fi-skl-gvtdvm    total:285  pass:262  dwarn:0   dfail:0   fail:0   skip:23  time:450s
fi-snb-2520m     total:285  pass:245  dwarn:0   dfail:0   fail:0   skip:40  time:581s
fi-snb-2600      total:285  pass:245  dwarn:0   dfail:0   fail:0   skip:40  time:400s

0c03d54dbbfe5f4f13cc03ce33b0af902bed64a6 drm-tip: 2018y-03m-19d-13h-45m-32s UTC integration manifest
0c30a5422174 drm/i915: Add Exec param to control data port coherency.

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8391/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] Data port coherency control for UMDs.
  2018-03-19 13:53 ` [RFC v1] Data port coherency control for UMDs Joonas Lahtinen
@ 2018-03-19 16:09   ` Lis, Tomasz
  2018-03-20 15:15   ` Dunajski, Bartosz
  1 sibling, 0 replies; 81+ messages in thread
From: Lis, Tomasz @ 2018-03-19 16:09 UTC (permalink / raw)
  To: Joonas Lahtinen, intel-gfx, Dave Airlie; +Cc: bartosz.dunajski



On 2018-03-19 14:53, Joonas Lahtinen wrote:
> + Dave, as FYI
>
> Quoting Tomasz Lis (2018-03-19 14:37:34)
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required.
> Can you please link to the corresponding OpenCL driver changes? I'm
> assuming this relates to the new-driver-to-be-adopted, instead of
> Beignet?
It is for the new driver; I will ask the OCL developers to provide a link.
>
> How is the story/schedule looking for adopting the new driver to
> distros?
I guess that's another question for OCL guys, I don't know.
> Seeing the userspace counterpart and tests will help in assessing the
> suggested changes.
>
> Regards, Joonas
I prepared an IGT test for that, I will send it to a proper mailing list 
soon.
-Tomasz

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915: Add Exec param to control data port coherency.
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (3 preceding siblings ...)
  2018-03-19 14:34 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-03-19 16:48 ` Patchwork
  2018-03-30 18:14 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev2) Patchwork
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-03-19 16:48 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency.
URL   : https://patchwork.freedesktop.org/series/40181/
State : failure

== Summary ==

---- Possible new issues:

Test gem_exec_params:
        Subgroup invalid-flag:
                pass       -> FAIL       (shard-apl)

---- Known issues:

Test kms_flip:
        Subgroup plain-flip-ts-check-interruptible:
                pass       -> FAIL       (shard-hsw) fdo#100368 +1
Test kms_frontbuffer_tracking:
        Subgroup fbc-1p-primscrn-shrfb-msflip-blt:
                fail       -> PASS       (shard-apl) fdo#104727
Test kms_sysfs_edid_timing:
                warn       -> PASS       (shard-apl) fdo#100047

fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
fdo#104727 https://bugs.freedesktop.org/show_bug.cgi?id=104727
fdo#100047 https://bugs.freedesktop.org/show_bug.cgi?id=100047

shard-apl        total:3442 pass:1814 dwarn:1   dfail:0   fail:8   skip:1619 time:13042s
shard-hsw        total:3442 pass:1767 dwarn:1   dfail:0   fail:2   skip:1671 time:11899s
shard-snb        total:3442 pass:1358 dwarn:1   dfail:0   fail:2   skip:2081 time:7170s
Blacklisted hosts:
shard-kbl        total:3442 pass:1937 dwarn:1   dfail:0   fail:10  skip:1494 time:9945s

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8391/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] Data port coherency control for UMDs.
  2018-03-19 13:53 ` [RFC v1] Data port coherency control for UMDs Joonas Lahtinen
  2018-03-19 16:09   ` Lis, Tomasz
@ 2018-03-20 15:15   ` Dunajski, Bartosz
  2018-03-21 10:02     ` Joonas Lahtinen
  1 sibling, 1 reply; 81+ messages in thread
From: Dunajski, Bartosz @ 2018-03-20 15:15 UTC (permalink / raw)
  To: Joonas Lahtinen, Lis, Tomasz, intel-gfx, Dave Airlie

This functionality is used by new OCL drvier (aka. NEO):
https://github.com/intel/compute-runtime 

Starting from commit: 933312e0986d3a7c1ef557e511eb4ced301ea292

-----Original Message-----
From: Joonas Lahtinen [mailto:joonas.lahtinen@linux.intel.com] 
Sent: Monday, March 19, 2018 2:54 PM
To: Lis, Tomasz <tomasz.lis@intel.com>; intel-gfx@lists.freedesktop.org; Dave Airlie <airlied@redhat.com>
Cc: Dunajski, Bartosz <bartosz.dunajski@intel.com>; chris@chris-wilson.co.uk; Winiarski, Michal <michal.winiarski@intel.com>
Subject: Re: [RFC v1] Data port coherency control for UMDs.

+ Dave, as FYI

Quoting Tomasz Lis (2018-03-19 14:37:34)
> The OpenCL driver develpers requested a functionality to control cache 
> coherency at data port level. Keeping the coherency at that level is 
> disabled by default due to its performance costs. OpenCL driver is 
> planning to enable it for a small subset of submissions, when such 
> functionality is required.

Can you please link to the corresponding OpenCL driver changes? I'm assuming this relates to the new-driver-to-be-adopted, instead of Beignet?

How is the story/schedule looking for adopting the new driver to distros?

Seeing the userspace counterpart and tests will help in assessing the suggested changes.

Regards, Joonas
--------------------------------------------------------------------

Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by
others is strictly prohibited.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-19 14:26       ` Chris Wilson
@ 2018-03-20 17:23         ` Lis, Tomasz
  2018-05-04  9:24           ` Joonas Lahtinen
  0 siblings, 1 reply; 81+ messages in thread
From: Lis, Tomasz @ 2018-03-20 17:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: bartosz.dunajski


[-- Attachment #1.1: Type: text/plain, Size: 4136 bytes --]



On 2018-03-19 15:26, Chris Wilson wrote:
> Quoting Lis, Tomasz (2018-03-19 14:14:19)
>>
>> On 2018-03-19 13:43, Chris Wilson wrote:
>>> Quoting Tomasz Lis (2018-03-19 12:37:35)
>>>> The patch adds a parameter to control the data port coherency functionality
>>>> on a per-exec call basis. When data port coherency flag value is different
>>>> than what it was in previous call for the context, a command to switch data
>>>> port coherency state is added before the buffer to be executed.
>>> So this is part of the context? Why do it at exec level?
>> It is part of the context, stored within HDC chicken bit register.
>> The exec level was requested by the OCL team, due to concerns about
>> performance cost of context setparam calls.
> What? Oh dear, oh dear, thrice oh dear. The context setparam would look
> like:
>
> 	if (arg != context->value) {
> 		rq = request_alloc(context, RCS);
> 		cs = ring_begin(rq, 4);
> 		cs++ = MI_LRI;
> 		cs++ = reg;
> 		cs++ = magic;
> 		cs++ = MI_NOOP;
> 		request_add(rq);
> 		context->value = arg
> 	}
>
> The argument is whether stuffing it into a crowded, v.frequently
> executed execbuf is better than an irregular setparam. If they want to
> flip it on every batch, use execbuf. If it's going to be very
> infrequent, setparam.
Implementing the data port coherency switch as context setparam would 
not be a problem, I agree.
But this is not a solution OCL is willing to accept. Any additional 
IOCTL call is a concern for the OCL developers.

For more explanation on switch frequency - please look at the cover 
letter I provided; here's the related part of it:
(note: the data port coherency is called fine grain coherency within UMD)

    3. Will coherency switch be used frequently?

    There are scenarios that will require frequent toggling of the coherency
    switch.
    E.g. an application has two OCL compute kernels: kern_master and kern_worker.
    kern_master uses, concurrently with CPU, some fine grain SVM resources
    (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
    computational work that needs to be executed. kern_master analyzes incoming
    work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
    for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
    the payload that kern_master produced. These two kernels work in a loop, one
    after another. Since only kern_master requires coherency, kern_worker should
    not be forced to pay for it. This means that we need to have the ability to
    toggle coherency switch on or off per each GPU submission:
    (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
    COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

> That discussion must be part of the rationale in the commitlog.
Will add.
Should I place the whole text from cover letter within the commit comment?
> Otoh, execbuf3 would accept it as a command packet. Hmm.
I know we have execbuf2, but execbuf3? Are you proposing to add 
something like that?
>>>    If exec level
>>> is desired, why not whitelist it?
>> If we have no issue in whitelisting the register, I'm sure OCL will
>> agree to that.
>> I assumed the whitelisting will be unacceptable because of security
>> concerns with some options.
>> The register also changes its position and content between gens, which
>> makes whitelisting hard to manage.
>>
>> Main purpose of chicken bit registers, in general, is to allow work
>> around for hardware features which could  be buggy or could have
>> unintended influence on the platform.
>> The data port coherency functionality landed there for the same reasons;
>> then it twisted itself in a way that we now need user space to switch it.
>> Is it really ok to whitelist chicken bit registers?
> It all depends on whether it breaks segregation. If the only users
> affected are themselves, fine. Otherwise, no.
> -Chris
Chicken Bit registers are definitely not planned as safe for use. While 
meaning of bits within HDC_CHICKEN0 change between gens, I doubt any of 
the registers *can't* be used to cause GPU hung.
-Tomasz


[-- Attachment #1.2: Type: text/html, Size: 5492 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-19 14:14     ` Lis, Tomasz
  2018-03-19 14:26       ` Chris Wilson
@ 2018-03-20 18:43       ` Oscar Mateo
  2018-03-21 10:16         ` Chris Wilson
  1 sibling, 1 reply; 81+ messages in thread
From: Oscar Mateo @ 2018-03-20 18:43 UTC (permalink / raw)
  To: Lis, Tomasz, Chris Wilson, intel-gfx; +Cc: bartosz.dunajski



On 3/19/2018 7:14 AM, Lis, Tomasz wrote:
>
>
> On 2018-03-19 13:43, Chris Wilson wrote:
>> Quoting Tomasz Lis (2018-03-19 12:37:35)
>>> The patch adds a parameter to control the data port coherency 
>>> functionality
>>> on a per-exec call basis. When data port coherency flag value is 
>>> different
>>> than what it was in previous call for the context, a command to 
>>> switch data
>>> port coherency state is added before the buffer to be executed.
>> So this is part of the context? Why do it at exec level?
>
> It is part of the context, stored within HDC chicken bit register.
> The exec level was requested by the OCL team, due to concerns about 
> performance cost of context setparam calls.
>
>>   If exec level
>> is desired, why not whitelist it?
>> -Chris
>
> If we have no issue in whitelisting the register, I'm sure OCL will 
> agree to that.
> I assumed the whitelisting will be unacceptable because of security 
> concerns with some options.
> The register also changes its position and content between gens, which 
> makes whitelisting hard to manage.
>

I think a security analysis of this register was already done, and the 
result was that it contains some other bits that could be dangerous. In 
CNL those bits were moved out of the way and the HDC_CHICKEN0 register 
can be whitelisted (WaAllowUMDToControlCoherency). In ICL the register 
should already be non-privileged.

> Main purpose of chicken bit registers, in general, is to allow work 
> around for hardware features which could  be buggy or could have 
> unintended influence on the platform.
> The data port coherency functionality landed there for the same 
> reasons; then it twisted itself in a way that we now need user space 
> to switch it.
> Is it really ok to whitelist chicken bit registers?
> -Tomasz
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] Data port coherency control for UMDs.
  2018-03-20 15:15   ` Dunajski, Bartosz
@ 2018-03-21 10:02     ` Joonas Lahtinen
  2018-03-26  9:46       ` Dunajski, Bartosz
  0 siblings, 1 reply; 81+ messages in thread
From: Joonas Lahtinen @ 2018-03-21 10:02 UTC (permalink / raw)
  To: Dunajski, Bartosz, Lis, Tomasz, intel-gfx, Dave Airlie, Jon Ewins

+ Jon, as we clearly have a disconnect between what was requested to be
done and what has been done.

Quoting Dunajski, Bartosz (2018-03-20 17:15:06)
> This functionality is used by new OCL drvier (aka. NEO):
> https://github.com/intel/compute-runtime 
> 
> Starting from commit: 933312e0986d3a7c1ef557e511eb4ced301ea292

That's not how the changes were requested to be introduced. It's the
opposite of what was asked.

You should do such changes in a topic branch, not the master. The master
branch must always be using only what is in the latest upstream kernel.

Please read:

https://01.org/linuxgraphics/gfx-docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements

Paying special attention to:

"The kernel patch can only be merged after all the above requirements
are met, but it must be merged before the userspace patches land. uAPI
always flows from the kernel, doing things the other way round risks
divergence of the uAPI definitions and header files."

The end-user should always be able to update to the latest bleeding edge
kernel without userspace breakage. That's not the case here because the
userspace is tied to special kernel version so the ABI is bound to break.

Regards, Joonas

> 
> -----Original Message-----
> From: Joonas Lahtinen [mailto:joonas.lahtinen@linux.intel.com] 
> Sent: Monday, March 19, 2018 2:54 PM
> To: Lis, Tomasz <tomasz.lis@intel.com>; intel-gfx@lists.freedesktop.org; Dave Airlie <airlied@redhat.com>
> Cc: Dunajski, Bartosz <bartosz.dunajski@intel.com>; chris@chris-wilson.co.uk; Winiarski, Michal <michal.winiarski@intel.com>
> Subject: Re: [RFC v1] Data port coherency control for UMDs.
> 
> + Dave, as FYI
> 
> Quoting Tomasz Lis (2018-03-19 14:37:34)
> > The OpenCL driver develpers requested a functionality to control cache 
> > coherency at data port level. Keeping the coherency at that level is 
> > disabled by default due to its performance costs. OpenCL driver is 
> > planning to enable it for a small subset of submissions, when such 
> > functionality is required.
> 
> Can you please link to the corresponding OpenCL driver changes? I'm assuming this relates to the new-driver-to-be-adopted, instead of Beignet?
> 
> How is the story/schedule looking for adopting the new driver to distros?
> 
> Seeing the userspace counterpart and tests will help in assessing the suggested changes.
> 
> Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-20 18:43       ` Oscar Mateo
@ 2018-03-21 10:16         ` Chris Wilson
  2018-03-21 19:42           ` Oscar Mateo
  0 siblings, 1 reply; 81+ messages in thread
From: Chris Wilson @ 2018-03-21 10:16 UTC (permalink / raw)
  To: Oscar Mateo, Lis, Tomasz, intel-gfx; +Cc: bartosz.dunajski

Quoting Oscar Mateo (2018-03-20 18:43:45)
> 
> 
> On 3/19/2018 7:14 AM, Lis, Tomasz wrote:
> >
> >
> > On 2018-03-19 13:43, Chris Wilson wrote:
> >> Quoting Tomasz Lis (2018-03-19 12:37:35)
> >>> The patch adds a parameter to control the data port coherency 
> >>> functionality
> >>> on a per-exec call basis. When data port coherency flag value is 
> >>> different
> >>> than what it was in previous call for the context, a command to 
> >>> switch data
> >>> port coherency state is added before the buffer to be executed.
> >> So this is part of the context? Why do it at exec level?
> >
> > It is part of the context, stored within HDC chicken bit register.
> > The exec level was requested by the OCL team, due to concerns about 
> > performance cost of context setparam calls.
> >
> >>   If exec level
> >> is desired, why not whitelist it?
> >> -Chris
> >
> > If we have no issue in whitelisting the register, I'm sure OCL will 
> > agree to that.
> > I assumed the whitelisting will be unacceptable because of security 
> > concerns with some options.
> > The register also changes its position and content between gens, which 
> > makes whitelisting hard to manage.
> >
> 
> I think a security analysis of this register was already done, and the 
> result was that it contains some other bits that could be dangerous. In 
> CNL those bits were moved out of the way and the HDC_CHICKEN0 register 
> can be whitelisted (WaAllowUMDToControlCoherency). In ICL the register 
> should already be non-privileged.

The previous alternative to whitelisting was running through a command
parser for validation. That's a very general mechanism suitable for a
wide range of sins.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-21 10:16         ` Chris Wilson
@ 2018-03-21 19:42           ` Oscar Mateo
  2018-03-27 17:41             ` Lis, Tomasz
  0 siblings, 1 reply; 81+ messages in thread
From: Oscar Mateo @ 2018-03-21 19:42 UTC (permalink / raw)
  To: Chris Wilson, Lis, Tomasz, intel-gfx; +Cc: bartosz.dunajski



On 3/21/2018 3:16 AM, Chris Wilson wrote:
> Quoting Oscar Mateo (2018-03-20 18:43:45)
>>
>> On 3/19/2018 7:14 AM, Lis, Tomasz wrote:
>>>
>>> On 2018-03-19 13:43, Chris Wilson wrote:
>>>> Quoting Tomasz Lis (2018-03-19 12:37:35)
>>>>> The patch adds a parameter to control the data port coherency
>>>>> functionality
>>>>> on a per-exec call basis. When data port coherency flag value is
>>>>> different
>>>>> than what it was in previous call for the context, a command to
>>>>> switch data
>>>>> port coherency state is added before the buffer to be executed.
>>>> So this is part of the context? Why do it at exec level?
>>> It is part of the context, stored within HDC chicken bit register.
>>> The exec level was requested by the OCL team, due to concerns about
>>> performance cost of context setparam calls.
>>>
>>>>    If exec level
>>>> is desired, why not whitelist it?
>>>> -Chris
>>> If we have no issue in whitelisting the register, I'm sure OCL will
>>> agree to that.
>>> I assumed the whitelisting will be unacceptable because of security
>>> concerns with some options.
>>> The register also changes its position and content between gens, which
>>> makes whitelisting hard to manage.
>>>
>> I think a security analysis of this register was already done, and the
>> result was that it contains some other bits that could be dangerous. In
>> CNL those bits were moved out of the way and the HDC_CHICKEN0 register
>> can be whitelisted (WaAllowUMDToControlCoherency). In ICL the register
>> should already be non-privileged.
> The previous alternative to whitelisting was running through a command
> parser for validation. That's a very general mechanism suitable for a
> wide range of sins.
> -Chris

Are you suggesting that we enable the cmd parser for every Gen < CNL for 
this particular usage only? :P

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] Data port coherency control for UMDs.
  2018-03-21 10:02     ` Joonas Lahtinen
@ 2018-03-26  9:46       ` Dunajski, Bartosz
  2018-03-29  7:42         ` Joonas Lahtinen
  0 siblings, 1 reply; 81+ messages in thread
From: Dunajski, Bartosz @ 2018-03-26  9:46 UTC (permalink / raw)
  To: Joonas Lahtinen, Lis, Tomasz, intel-gfx, Dave Airlie, Ewins, Jon

Here is pull request with patch usage:
https://github.com/intel/compute-runtime/pull/29 

This patch is required to control data port coherency depending on incoming workload into OpenCL API (fine-grain SVM requirement).


-----Original Message-----
From: Joonas Lahtinen [mailto:joonas.lahtinen@linux.intel.com] 
Sent: Wednesday, March 21, 2018 11:03 AM
To: Dunajski, Bartosz <bartosz.dunajski@intel.com>; Lis, Tomasz <tomasz.lis@intel.com>; intel-gfx@lists.freedesktop.org; Dave Airlie <airlied@redhat.com>; Ewins, Jon <jon.ewins@intel.com>
Cc: chris@chris-wilson.co.uk; Winiarski, Michal <michal.winiarski@intel.com>
Subject: RE: [RFC v1] Data port coherency control for UMDs.

+ Jon, as we clearly have a disconnect between what was requested to be
done and what has been done.

Quoting Dunajski, Bartosz (2018-03-20 17:15:06)
> This functionality is used by new OCL drvier (aka. NEO):
> https://github.com/intel/compute-runtime
> 
> Starting from commit: 933312e0986d3a7c1ef557e511eb4ced301ea292

That's not how the changes were requested to be introduced. It's the opposite of what was asked.

You should do such changes in a topic branch, not the master. The master branch must always be using only what is in the latest upstream kernel.

Please read:

https://01.org/linuxgraphics/gfx-docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements

Paying special attention to:

"The kernel patch can only be merged after all the above requirements are met, but it must be merged before the userspace patches land. uAPI always flows from the kernel, doing things the other way round risks divergence of the uAPI definitions and header files."

The end-user should always be able to update to the latest bleeding edge kernel without userspace breakage. That's not the case here because the userspace is tied to special kernel version so the ABI is bound to break.

Regards, Joonas

> 
> -----Original Message-----
> From: Joonas Lahtinen [mailto:joonas.lahtinen@linux.intel.com]
> Sent: Monday, March 19, 2018 2:54 PM
> To: Lis, Tomasz <tomasz.lis@intel.com>; 
> intel-gfx@lists.freedesktop.org; Dave Airlie <airlied@redhat.com>
> Cc: Dunajski, Bartosz <bartosz.dunajski@intel.com>; 
> chris@chris-wilson.co.uk; Winiarski, Michal 
> <michal.winiarski@intel.com>
> Subject: Re: [RFC v1] Data port coherency control for UMDs.
> 
> + Dave, as FYI
> 
> Quoting Tomasz Lis (2018-03-19 14:37:34)
> > The OpenCL driver develpers requested a functionality to control 
> > cache coherency at data port level. Keeping the coherency at that 
> > level is disabled by default due to its performance costs. OpenCL 
> > driver is planning to enable it for a small subset of submissions, 
> > when such functionality is required.
> 
> Can you please link to the corresponding OpenCL driver changes? I'm assuming this relates to the new-driver-to-be-adopted, instead of Beignet?
> 
> How is the story/schedule looking for adopting the new driver to distros?
> 
> Seeing the userspace counterpart and tests will help in assessing the suggested changes.
> 
> Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-21 19:42           ` Oscar Mateo
@ 2018-03-27 17:41             ` Lis, Tomasz
  0 siblings, 0 replies; 81+ messages in thread
From: Lis, Tomasz @ 2018-03-27 17:41 UTC (permalink / raw)
  To: Oscar Mateo, Chris Wilson, intel-gfx; +Cc: bartosz.dunajski



On 2018-03-21 20:42, Oscar Mateo wrote:
>
>
> On 3/21/2018 3:16 AM, Chris Wilson wrote:
>> Quoting Oscar Mateo (2018-03-20 18:43:45)
>>>
>>> On 3/19/2018 7:14 AM, Lis, Tomasz wrote:
>>>>
>>>> On 2018-03-19 13:43, Chris Wilson wrote:
>>>>> Quoting Tomasz Lis (2018-03-19 12:37:35)
>>>>>> The patch adds a parameter to control the data port coherency
>>>>>> functionality
>>>>>> on a per-exec call basis. When data port coherency flag value is
>>>>>> different
>>>>>> than what it was in previous call for the context, a command to
>>>>>> switch data
>>>>>> port coherency state is added before the buffer to be executed.
>>>>> So this is part of the context? Why do it at exec level?
>>>> It is part of the context, stored within HDC chicken bit register.
>>>> The exec level was requested by the OCL team, due to concerns about
>>>> performance cost of context setparam calls.
>>>>
>>>>>    If exec level
>>>>> is desired, why not whitelist it?
>>>>> -Chris
>>>> If we have no issue in whitelisting the register, I'm sure OCL will
>>>> agree to that.
>>>> I assumed the whitelisting will be unacceptable because of security
>>>> concerns with some options.
>>>> The register also changes its position and content between gens, which
>>>> makes whitelisting hard to manage.
>>>>
>>> I think a security analysis of this register was already done, and the
>>> result was that it contains some other bits that could be dangerous. In
>>> CNL those bits were moved out of the way and the HDC_CHICKEN0 register
>>> can be whitelisted (WaAllowUMDToControlCoherency). In ICL the register
>>> should already be non-privileged.
>> The previous alternative to whitelisting was running through a command
>> parser for validation. That's a very general mechanism suitable for a
>> wide range of sins.
>> -Chris
>
> Are you suggesting that we enable the cmd parser for every Gen < CNL 
> for this particular usage only? :P
>
It is a solution that would allow to do what we want without any 
additions to module interface.
It may be worth considering if we think the coherency setting will be 
temporary and removed in future gens, as we wouldn't want to have 
obsolete flags.

I think the setting will stay with us, as it is needed to support 
CL_MEM_SVM_FINE_GRAIN_BUFFER flag from OpenCL spec.
Keeping coherency will always cost performance, so we will likely always 
have a hardware setting to switch it.
But the bspec says coherency override control will be removed in future 
projects...


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] Data port coherency control for UMDs.
  2018-03-26  9:46       ` Dunajski, Bartosz
@ 2018-03-29  7:42         ` Joonas Lahtinen
  2018-03-30  9:00           ` Dunajski, Bartosz
  0 siblings, 1 reply; 81+ messages in thread
From: Joonas Lahtinen @ 2018-03-29  7:42 UTC (permalink / raw)
  To: Dunajski, Bartosz, Ewins, Jon, Lis, Tomasz, intel-gfx, Dave Airlie

Quoting Dunajski, Bartosz (2018-03-26 12:46:13)
> Here is pull request with patch usage:
> https://github.com/intel/compute-runtime/pull/29 
> 
> This patch is required to control data port coherency depending on incoming workload into OpenCL API (fine-grain SVM requirement).

Thank you for correcting the process.

The original question however remains unanswered, how is it looking for
the adoption of the new driver in distros?

I was not able to find much by searching online, but by looking at the
sources, I'd assume requiring a custom version of LLVM would be
something to get rid of.

Regards, Joonas

> 
> 
> -----Original Message-----
> From: Joonas Lahtinen [mailto:joonas.lahtinen@linux.intel.com] 
> Sent: Wednesday, March 21, 2018 11:03 AM
> To: Dunajski, Bartosz <bartosz.dunajski@intel.com>; Lis, Tomasz <tomasz.lis@intel.com>; intel-gfx@lists.freedesktop.org; Dave Airlie <airlied@redhat.com>; Ewins, Jon <jon.ewins@intel.com>
> Cc: chris@chris-wilson.co.uk; Winiarski, Michal <michal.winiarski@intel.com>
> Subject: RE: [RFC v1] Data port coherency control for UMDs.
> 
> + Jon, as we clearly have a disconnect between what was requested to be
> done and what has been done.
> 
> Quoting Dunajski, Bartosz (2018-03-20 17:15:06)
> > This functionality is used by new OCL drvier (aka. NEO):
> > https://github.com/intel/compute-runtime
> > 
> > Starting from commit: 933312e0986d3a7c1ef557e511eb4ced301ea292
> 
> That's not how the changes were requested to be introduced. It's the opposite of what was asked.
> 
> You should do such changes in a topic branch, not the master. The master branch must always be using only what is in the latest upstream kernel.
> 
> Please read:
> 
> https://01.org/linuxgraphics/gfx-docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> 
> Paying special attention to:
> 
> "The kernel patch can only be merged after all the above requirements are met, but it must be merged before the userspace patches land. uAPI always flows from the kernel, doing things the other way round risks divergence of the uAPI definitions and header files."
> 
> The end-user should always be able to update to the latest bleeding edge kernel without userspace breakage. That's not the case here because the userspace is tied to special kernel version so the ABI is bound to break.
> 
> Regards, Joonas
> 
> > 
> > -----Original Message-----
> > From: Joonas Lahtinen [mailto:joonas.lahtinen@linux.intel.com]
> > Sent: Monday, March 19, 2018 2:54 PM
> > To: Lis, Tomasz <tomasz.lis@intel.com>; 
> > intel-gfx@lists.freedesktop.org; Dave Airlie <airlied@redhat.com>
> > Cc: Dunajski, Bartosz <bartosz.dunajski@intel.com>; 
> > chris@chris-wilson.co.uk; Winiarski, Michal 
> > <michal.winiarski@intel.com>
> > Subject: Re: [RFC v1] Data port coherency control for UMDs.
> > 
> > + Dave, as FYI
> > 
> > Quoting Tomasz Lis (2018-03-19 14:37:34)
> > > The OpenCL driver develpers requested a functionality to control 
> > > cache coherency at data port level. Keeping the coherency at that 
> > > level is disabled by default due to its performance costs. OpenCL 
> > > driver is planning to enable it for a small subset of submissions, 
> > > when such functionality is required.
> > 
> > Can you please link to the corresponding OpenCL driver changes? I'm assuming this relates to the new-driver-to-be-adopted, instead of Beignet?
> > 
> > How is the story/schedule looking for adopting the new driver to distros?
> > 
> > Seeing the userspace counterpart and tests will help in assessing the suggested changes.
> > 
> > Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] Data port coherency control for UMDs.
  2018-03-29  7:42         ` Joonas Lahtinen
@ 2018-03-30  9:00           ` Dunajski, Bartosz
  2018-04-04  9:18             ` Joonas Lahtinen
  0 siblings, 1 reply; 81+ messages in thread
From: Dunajski, Bartosz @ 2018-03-30  9:00 UTC (permalink / raw)
  To: Joonas Lahtinen, Ewins, Jon, Lis, Tomasz, intel-gfx, Dave Airlie

I think the adoption is not a problem here.
If driver can query that patch is active on the specific setup, new capabilities will be always reported to the user.

-----Original Message-----
Quoting Dunajski, Bartosz (2018-03-26 12:46:13)
> Here is pull request with patch usage:
> https://github.com/intel/compute-runtime/pull/29
> 
> This patch is required to control data port coherency depending on incoming workload into OpenCL API (fine-grain SVM requirement).

Thank you for correcting the process.

The original question however remains unanswered, how is it looking for the adoption of the new driver in distros?

I was not able to find much by searching online, but by looking at the sources, I'd assume requiring a custom version of LLVM would be something to get rid of.

Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
  2018-03-19 12:43   ` Chris Wilson
@ 2018-03-30 17:29   ` Tomasz Lis
  2018-03-31 19:07     ` kbuild test robot
  2018-04-11 15:46   ` [PATCH v2] " Tomasz Lis
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 81+ messages in thread
From: Tomasz Lis @ 2018-03-30 17:29 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The patch adds a parameter to control the data port coherency functionality
on a per-exec call basis. When data port coherency flag value is different
than what it was in previous call for the context, a command to switch data
port coherency state is added before the buffer to be executed.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

4. Why the execlist flag approach was chosen?

There are two other ways of providing the functionality to UMDs, besides the
execlist flag:

a) Chicken bit register whitelisting.

This approach would allow adding the functionality without any change to
KMDs interface. Also, it has been determined that whitelisting is safe for
gen10 and gen11. The issue is with gen9, where hardware whitelisting cannot
be used, and OCL driver needs support for it. A workaround there would be to
use command parser, which verifies buffers before execution. But such parsing
comes at a considerable performance cost.

b) Providing the flag as context IOCTL setting.

The data port coherency switch could be implemented as a context parameter,
which would schedule submission of a buffer to switch the coherency flag.
That is an elegant solution with bounds the flag to context, which matches
the hardware placement of the feature. This solution was not accepted
because of OCL driver performance concerns. The OCL driver is constructed
with emphasis on creating small, but very frequent submissions. With such
architecture, adding IOCTL setparam call before submission has considerable
impact on the performance.

Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |  3 ++
 drivers/gpu/drm/i915/i915_gem_context.h    |  1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 17 ++++++++++
 drivers/gpu/drm/i915/intel_lrc.c           | 53 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h           |  3 ++
 include/uapi/drm/i915_drm.h                | 12 ++++++-
 6 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d354627..030854e 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -436,6 +436,9 @@ static int i915_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_PARAM_CS_TIMESTAMP_FREQUENCY:
 		value = 1000 * INTEL_INFO(dev_priv)->cs_timestamp_frequency_khz;
 		break;
+	case I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY:
+		value = (INTEL_GEN(dev_priv) >= 9);
+		break;
 	default:
 		DRM_DEBUG("Unknown parameter %d\n", param->param);
 		return -EINVAL;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 7854262..00aa309 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -118,6 +118,7 @@ struct i915_gem_context {
 #define CONTEXT_BANNABLE		3
 #define CONTEXT_BANNED			4
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
+#define CONTEXT_DATA_PORT_COHERENT	6
 
 	/**
 	 * @hw_id: - unique identifier for the context
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 8c170db..e3a2f9e 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -2245,6 +2245,18 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		eb.batch_flags |= I915_DISPATCH_RS;
 	}
 
+	if (args->flags & I915_EXEC_DATA_PORT_COHERENT) {
+		if (INTEL_GEN(eb.i915) < 9) {
+			DRM_DEBUG("Data Port Coherency is only allowed for Gen9 and above\n");
+			return -EINVAL;
+		}
+		if (eb.engine->class != RENDER_CLASS) {
+			DRM_DEBUG("Data Port Coherency is not available on %s\n",
+				 eb.engine->name);
+			return -EINVAL;
+		}
+	}
+
 	if (args->flags & I915_EXEC_FENCE_IN) {
 		in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
 		if (!in_fence)
@@ -2371,6 +2383,11 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		goto err_batch_unpin;
 	}
 
+	/* Emit the switch of data port coherency state if needed */
+	err = intel_lr_context_modify_data_port_coherency(eb.request,
+			(args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);
+	GEM_WARN_ON(err);
+
 	if (in_fence) {
 		err = i915_request_await_dma_fence(eb.request, in_fence);
 		if (err < 0)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index f60b61b..2094494 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -254,6 +254,59 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+	cs = intel_ring_begin(req, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(req->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	/* Enabling coherency means disabling the bit which forces it off */
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(req, cs);
+
+	return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+					bool enable)
+{
+	struct i915_gem_context *ctx = req->ctx;
+	int ret;
+
+	if (test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags) == enable)
+		return 0;
+
+	ret = emit_set_data_port_coherency(req, enable);
+
+	if (!ret) {
+		if (enable)
+			__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+		else
+			__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	}
+
+	return ret;
+}
+
 static struct i915_priolist *
 lookup_priolist(struct intel_engine_cs *engine,
 		struct i915_priotree *pt,
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 59d7b86..c46b239 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -111,4 +111,7 @@ intel_lr_context_descriptor(struct i915_gem_context *ctx,
 	return ctx->engine[engine->id].lrc_desc;
 }
 
+int intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+						bool enable);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..0f52793 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -529,6 +529,11 @@ typedef struct drm_i915_irq_wait {
  */
 #define I915_PARAM_CS_TIMESTAMP_FREQUENCY 51
 
+/* Query whether DRM_I915_GEM_EXECBUFFER2 supports the ability to switch
+ * Data Cache access into Data Port Coherency mode.
+ */
+#define I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY 52
+
 typedef struct drm_i915_getparam {
 	__s32 param;
 	/*
@@ -1048,7 +1053,12 @@ struct drm_i915_gem_execbuffer2 {
  */
 #define I915_EXEC_FENCE_ARRAY   (1<<19)
 
-#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_ARRAY<<1))
+/* Data Port Coherency capability will be switched before an exec call
+ * which has this flag different than previous call for the context.
+ */
+#define I915_EXEC_DATA_PORT_COHERENT   (1<<20)
+
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_DATA_PORT_COHERENT<<1))
 
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev2)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (4 preceding siblings ...)
  2018-03-19 16:48 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2018-03-30 18:14 ` Patchwork
  2018-03-30 18:30 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-03-30 18:14 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev2)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
415de02a8f37 drm/i915: Add Exec param to control data port coherency.
-:14: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#14: 
coherency at data port level. Keeping the coherency at that level is disabled

-:175: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#175: FILE: drivers/gpu/drm/i915/i915_gem_execbuffer.c:2255:
+			DRM_DEBUG("Data Port Coherency is not available on %s\n",
+				 eb.engine->name);

-:189: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#189: FILE: drivers/gpu/drm/i915/i915_gem_execbuffer.c:2388:
+	err = intel_lr_context_modify_data_port_coherency(eb.request,
+			(args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);

-:236: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#236: FILE: drivers/gpu/drm/i915/intel_lrc.c:290:
+intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+					bool enable)

-:295: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#295: FILE: include/uapi/drm/i915_drm.h:1059:
+#define I915_EXEC_DATA_PORT_COHERENT   (1<<20)
                                          ^

-:297: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#297: FILE: include/uapi/drm/i915_drm.h:1061:
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_DATA_PORT_COHERENT<<1))
                                                                  ^

total: 0 errors, 1 warnings, 5 checks, 135 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Add Exec param to control data port coherency. (rev2)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (5 preceding siblings ...)
  2018-03-30 18:14 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev2) Patchwork
@ 2018-03-30 18:30 ` Patchwork
  2018-03-30 19:59 ` ✗ Fi.CI.IGT: failure " Patchwork
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-03-30 18:30 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev2)
URL   : https://patchwork.freedesktop.org/series/40181/
State : success

== Summary ==

Series 40181v2 drm/i915: Add Exec param to control data port coherency.
https://patchwork.freedesktop.org/api/1.0/series/40181/revisions/2/mbox/

---- Known issues:

Test gem_mmap_gtt:
        Subgroup basic-small-bo-tiledx:
                pass       -> FAIL       (fi-gdg-551) fdo#102575
Test kms_flip:
        Subgroup basic-flip-vs-wf_vblank:
                pass       -> FAIL       (fi-cfl-s3) fdo#100368
Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-a:
                pass       -> FAIL       (fi-ivb-3520m) k.org#198519 +2

fdo#102575 
fdo#100368 
k.org#198519 https://bugzilla.kernel.org/show_bug.cgi?id=198519

fi-bdw-5557u     total:285  pass:264  dwarn:0   dfail:0   fail:0   skip:21  time:431s
fi-bdw-gvtdvm    total:285  pass:261  dwarn:0   dfail:0   fail:0   skip:24  time:439s
fi-blb-e6850     total:285  pass:220  dwarn:1   dfail:0   fail:0   skip:64  time:381s
fi-bsw-n3050     total:285  pass:239  dwarn:0   dfail:0   fail:0   skip:46  time:539s
fi-bwr-2160      total:285  pass:180  dwarn:0   dfail:0   fail:0   skip:105 time:295s
fi-bxt-dsi       total:285  pass:255  dwarn:0   dfail:0   fail:0   skip:30  time:512s
fi-bxt-j4205     total:285  pass:256  dwarn:0   dfail:0   fail:0   skip:29  time:509s
fi-byt-j1900     total:285  pass:250  dwarn:0   dfail:0   fail:0   skip:35  time:517s
fi-byt-n2820     total:285  pass:246  dwarn:0   dfail:0   fail:0   skip:39  time:514s
fi-cfl-8700k     total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:412s
fi-cfl-s3        total:285  pass:258  dwarn:0   dfail:0   fail:1   skip:26  time:549s
fi-cfl-u         total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:513s
fi-cnl-y3        total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:585s
fi-elk-e7500     total:285  pass:225  dwarn:1   dfail:0   fail:0   skip:59  time:412s
fi-gdg-551       total:285  pass:176  dwarn:0   dfail:0   fail:1   skip:108 time:317s
fi-glk-1         total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:539s
fi-hsw-4770      total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:403s
fi-ilk-650       total:285  pass:225  dwarn:0   dfail:0   fail:0   skip:60  time:421s
fi-ivb-3520m     total:285  pass:253  dwarn:0   dfail:0   fail:3   skip:29  time:422s
fi-ivb-3770      total:285  pass:252  dwarn:0   dfail:0   fail:0   skip:33  time:429s
fi-kbl-7500u     total:285  pass:260  dwarn:1   dfail:0   fail:0   skip:24  time:470s
fi-kbl-7567u     total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:469s
fi-kbl-r         total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:509s
fi-pnv-d510      total:285  pass:219  dwarn:1   dfail:0   fail:0   skip:65  time:659s
fi-skl-6260u     total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:440s
fi-skl-6600u     total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:531s
fi-skl-6700k2    total:285  pass:261  dwarn:0   dfail:0   fail:0   skip:24  time:507s
fi-skl-6770hq    total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:494s
fi-skl-guc       total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:428s
fi-skl-gvtdvm    total:285  pass:262  dwarn:0   dfail:0   fail:0   skip:23  time:446s
fi-snb-2520m     total:285  pass:245  dwarn:0   dfail:0   fail:0   skip:40  time:597s
fi-snb-2600      total:285  pass:245  dwarn:0   dfail:0   fail:0   skip:40  time:401s
Blacklisted hosts:
fi-cnl-psr       total:285  pass:256  dwarn:3   dfail:0   fail:0   skip:26  time:533s
fi-glk-j4005     total:285  pass:256  dwarn:0   dfail:0   fail:0   skip:29  time:487s

ae8ce2cdcba75443f03f5f65fa1627a4e14481c9 drm-tip: 2018y-03m-30d-17h-40m-01s UTC integration manifest
415de02a8f37 drm/i915: Add Exec param to control data port coherency.

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8551/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915: Add Exec param to control data port coherency. (rev2)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (6 preceding siblings ...)
  2018-03-30 18:30 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-03-30 19:59 ` Patchwork
  2018-04-11 16:12 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev3) Patchwork
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-03-30 19:59 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev2)
URL   : https://patchwork.freedesktop.org/series/40181/
State : failure

== Summary ==

---- Possible new issues:

Test gem_exec_params:
        Subgroup invalid-flag:
                pass       -> FAIL       (shard-apl)
Test kms_busy:
        Subgroup extended-modeset-hang-newfb-with-reset-render-a:
                fail       -> PASS       (shard-snb)
Test kms_cursor_crc:
        Subgroup cursor-256x256-onscreen:
                fail       -> PASS       (shard-snb)
Test kms_frontbuffer_tracking:
        Subgroup fbc-2p-scndscrn-indfb-plflip-blt:
                fail       -> SKIP       (shard-snb)
        Subgroup fbc-2p-scndscrn-pri-shrfb-draw-pwrite:
                fail       -> SKIP       (shard-snb)
        Subgroup fbcpsr-rgb101010-draw-mmap-cpu:
                fail       -> SKIP       (shard-snb)

---- Known issues:

Test kms_flip:
        Subgroup 2x-flip-vs-expired-vblank:
                fail       -> PASS       (shard-hsw) fdo#102887
        Subgroup 2x-plain-flip-fb-recreate:
                fail       -> PASS       (shard-hsw) fdo#100368 +1
Test kms_frontbuffer_tracking:
        Subgroup fbc-1p-primscrn-pri-indfb-draw-pwrite:
                pass       -> FAIL       (shard-apl) fdo#103167
        Subgroup psr-1p-pri-indfb-multidraw:
                fail       -> SKIP       (shard-snb) fdo#105798
Test kms_plane_lowres:
        Subgroup pipe-b-tiling-yf:
                fail       -> PASS       (shard-apl) fdo#103166
Test kms_vblank:
        Subgroup pipe-a-accuracy-idle:
                pass       -> FAIL       (shard-hsw) fdo#102583
Test testdisplay:
                dmesg-warn -> PASS       (shard-apl) fdo#104727

fdo#102887 https://bugs.freedesktop.org/show_bug.cgi?id=102887
fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
fdo#105798 https://bugs.freedesktop.org/show_bug.cgi?id=105798
fdo#103166 https://bugs.freedesktop.org/show_bug.cgi?id=103166
fdo#102583 https://bugs.freedesktop.org/show_bug.cgi?id=102583
fdo#104727 https://bugs.freedesktop.org/show_bug.cgi?id=104727

shard-apl        total:3495 pass:1829 dwarn:1   dfail:0   fail:9   skip:1655 time:12853s
shard-hsw        total:3495 pass:1781 dwarn:1   dfail:0   fail:3   skip:1709 time:11419s
shard-snb        total:3495 pass:1374 dwarn:1   dfail:0   fail:3   skip:2117 time:7030s
Blacklisted hosts:
shard-kbl        total:3477 pass:1921 dwarn:16  dfail:0   fail:9   skip:1530 time:9003s

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8551/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-30 17:29   ` [PATCH " Tomasz Lis
@ 2018-03-31 19:07     ` kbuild test robot
  0 siblings, 0 replies; 81+ messages in thread
From: kbuild test robot @ 2018-03-31 19:07 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx, bartosz.dunajski, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 11050 bytes --]

Hi Tomasz,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Tomasz-Lis/drm-i915-Add-Exec-param-to-control-data-port-coherency/20180401-021313
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-randconfig-x010-201813 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from drivers/gpu//drm/i915/i915_request.h:30:0,
                    from drivers/gpu//drm/i915/i915_gem_timeline.h:30,
                    from drivers/gpu//drm/i915/intel_ringbuffer.h:8,
                    from drivers/gpu//drm/i915/intel_lrc.h:27,
                    from drivers/gpu//drm/i915/i915_drv.h:63,
                    from drivers/gpu//drm/i915/i915_gem_execbuffer.c:38:
   drivers/gpu//drm/i915/i915_gem_execbuffer.c: In function 'i915_gem_do_execbuffer':
   drivers/gpu//drm/i915/i915_gem.h:47:54: warning: statement with no effect [-Wunused-value]
    #define GEM_WARN_ON(expr) (BUILD_BUG_ON_INVALID(expr), 0)
                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
>> drivers/gpu//drm/i915/i915_gem_execbuffer.c:2389:2: note: in expansion of macro 'GEM_WARN_ON'
     GEM_WARN_ON(err);
     ^~~~~~~~~~~

vim +/GEM_WARN_ON +2389 drivers/gpu//drm/i915/i915_gem_execbuffer.c

  2182	
  2183	static int
  2184	i915_gem_do_execbuffer(struct drm_device *dev,
  2185			       struct drm_file *file,
  2186			       struct drm_i915_gem_execbuffer2 *args,
  2187			       struct drm_i915_gem_exec_object2 *exec,
  2188			       struct drm_syncobj **fences)
  2189	{
  2190		struct i915_execbuffer eb;
  2191		struct dma_fence *in_fence = NULL;
  2192		struct sync_file *out_fence = NULL;
  2193		int out_fence_fd = -1;
  2194		int err;
  2195	
  2196		BUILD_BUG_ON(__EXEC_INTERNAL_FLAGS & ~__I915_EXEC_ILLEGAL_FLAGS);
  2197		BUILD_BUG_ON(__EXEC_OBJECT_INTERNAL_FLAGS &
  2198			     ~__EXEC_OBJECT_UNKNOWN_FLAGS);
  2199	
  2200		eb.i915 = to_i915(dev);
  2201		eb.file = file;
  2202		eb.args = args;
  2203		if (DBG_FORCE_RELOC || !(args->flags & I915_EXEC_NO_RELOC))
  2204			args->flags |= __EXEC_HAS_RELOC;
  2205	
  2206		eb.exec = exec;
  2207		eb.vma = (struct i915_vma **)(exec + args->buffer_count + 1);
  2208		eb.vma[0] = NULL;
  2209		eb.flags = (unsigned int *)(eb.vma + args->buffer_count + 1);
  2210	
  2211		eb.invalid_flags = __EXEC_OBJECT_UNKNOWN_FLAGS;
  2212		if (USES_FULL_PPGTT(eb.i915))
  2213			eb.invalid_flags |= EXEC_OBJECT_NEEDS_GTT;
  2214		reloc_cache_init(&eb.reloc_cache, eb.i915);
  2215	
  2216		eb.buffer_count = args->buffer_count;
  2217		eb.batch_start_offset = args->batch_start_offset;
  2218		eb.batch_len = args->batch_len;
  2219	
  2220		eb.batch_flags = 0;
  2221		if (args->flags & I915_EXEC_SECURE) {
  2222			if (!drm_is_current_master(file) || !capable(CAP_SYS_ADMIN))
  2223			    return -EPERM;
  2224	
  2225			eb.batch_flags |= I915_DISPATCH_SECURE;
  2226		}
  2227		if (args->flags & I915_EXEC_IS_PINNED)
  2228			eb.batch_flags |= I915_DISPATCH_PINNED;
  2229	
  2230		eb.engine = eb_select_engine(eb.i915, file, args);
  2231		if (!eb.engine)
  2232			return -EINVAL;
  2233	
  2234		if (args->flags & I915_EXEC_RESOURCE_STREAMER) {
  2235			if (!HAS_RESOURCE_STREAMER(eb.i915)) {
  2236				DRM_DEBUG("RS is only allowed for Haswell, Gen8 and above\n");
  2237				return -EINVAL;
  2238			}
  2239			if (eb.engine->id != RCS) {
  2240				DRM_DEBUG("RS is not available on %s\n",
  2241					 eb.engine->name);
  2242				return -EINVAL;
  2243			}
  2244	
  2245			eb.batch_flags |= I915_DISPATCH_RS;
  2246		}
  2247	
  2248		if (args->flags & I915_EXEC_DATA_PORT_COHERENT) {
  2249			if (INTEL_GEN(eb.i915) < 9) {
  2250				DRM_DEBUG("Data Port Coherency is only allowed for Gen9 and above\n");
  2251				return -EINVAL;
  2252			}
  2253			if (eb.engine->class != RENDER_CLASS) {
  2254				DRM_DEBUG("Data Port Coherency is not available on %s\n",
  2255					 eb.engine->name);
  2256				return -EINVAL;
  2257			}
  2258		}
  2259	
  2260		if (args->flags & I915_EXEC_FENCE_IN) {
  2261			in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
  2262			if (!in_fence)
  2263				return -EINVAL;
  2264		}
  2265	
  2266		if (args->flags & I915_EXEC_FENCE_OUT) {
  2267			out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
  2268			if (out_fence_fd < 0) {
  2269				err = out_fence_fd;
  2270				goto err_in_fence;
  2271			}
  2272		}
  2273	
  2274		err = eb_create(&eb);
  2275		if (err)
  2276			goto err_out_fence;
  2277	
  2278		GEM_BUG_ON(!eb.lut_size);
  2279	
  2280		err = eb_select_context(&eb);
  2281		if (unlikely(err))
  2282			goto err_destroy;
  2283	
  2284		/*
  2285		 * Take a local wakeref for preparing to dispatch the execbuf as
  2286		 * we expect to access the hardware fairly frequently in the
  2287		 * process. Upon first dispatch, we acquire another prolonged
  2288		 * wakeref that we hold until the GPU has been idle for at least
  2289		 * 100ms.
  2290		 */
  2291		intel_runtime_pm_get(eb.i915);
  2292	
  2293		err = i915_mutex_lock_interruptible(dev);
  2294		if (err)
  2295			goto err_rpm;
  2296	
  2297		err = eb_relocate(&eb);
  2298		if (err) {
  2299			/*
  2300			 * If the user expects the execobject.offset and
  2301			 * reloc.presumed_offset to be an exact match,
  2302			 * as for using NO_RELOC, then we cannot update
  2303			 * the execobject.offset until we have completed
  2304			 * relocation.
  2305			 */
  2306			args->flags &= ~__EXEC_HAS_RELOC;
  2307			goto err_vma;
  2308		}
  2309	
  2310		if (unlikely(*eb.batch->exec_flags & EXEC_OBJECT_WRITE)) {
  2311			DRM_DEBUG("Attempting to use self-modifying batch buffer\n");
  2312			err = -EINVAL;
  2313			goto err_vma;
  2314		}
  2315		if (eb.batch_start_offset > eb.batch->size ||
  2316		    eb.batch_len > eb.batch->size - eb.batch_start_offset) {
  2317			DRM_DEBUG("Attempting to use out-of-bounds batch\n");
  2318			err = -EINVAL;
  2319			goto err_vma;
  2320		}
  2321	
  2322		if (eb_use_cmdparser(&eb)) {
  2323			struct i915_vma *vma;
  2324	
  2325			vma = eb_parse(&eb, drm_is_current_master(file));
  2326			if (IS_ERR(vma)) {
  2327				err = PTR_ERR(vma);
  2328				goto err_vma;
  2329			}
  2330	
  2331			if (vma) {
  2332				/*
  2333				 * Batch parsed and accepted:
  2334				 *
  2335				 * Set the DISPATCH_SECURE bit to remove the NON_SECURE
  2336				 * bit from MI_BATCH_BUFFER_START commands issued in
  2337				 * the dispatch_execbuffer implementations. We
  2338				 * specifically don't want that set on batches the
  2339				 * command parser has accepted.
  2340				 */
  2341				eb.batch_flags |= I915_DISPATCH_SECURE;
  2342				eb.batch_start_offset = 0;
  2343				eb.batch = vma;
  2344			}
  2345		}
  2346	
  2347		if (eb.batch_len == 0)
  2348			eb.batch_len = eb.batch->size - eb.batch_start_offset;
  2349	
  2350		/*
  2351		 * snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
  2352		 * batch" bit. Hence we need to pin secure batches into the global gtt.
  2353		 * hsw should have this fixed, but bdw mucks it up again. */
  2354		if (eb.batch_flags & I915_DISPATCH_SECURE) {
  2355			struct i915_vma *vma;
  2356	
  2357			/*
  2358			 * So on first glance it looks freaky that we pin the batch here
  2359			 * outside of the reservation loop. But:
  2360			 * - The batch is already pinned into the relevant ppgtt, so we
  2361			 *   already have the backing storage fully allocated.
  2362			 * - No other BO uses the global gtt (well contexts, but meh),
  2363			 *   so we don't really have issues with multiple objects not
  2364			 *   fitting due to fragmentation.
  2365			 * So this is actually safe.
  2366			 */
  2367			vma = i915_gem_object_ggtt_pin(eb.batch->obj, NULL, 0, 0, 0);
  2368			if (IS_ERR(vma)) {
  2369				err = PTR_ERR(vma);
  2370				goto err_vma;
  2371			}
  2372	
  2373			eb.batch = vma;
  2374		}
  2375	
  2376		/* All GPU relocation batches must be submitted prior to the user rq */
  2377		GEM_BUG_ON(eb.reloc_cache.rq);
  2378	
  2379		/* Allocate a request for this batch buffer nice and early. */
  2380		eb.request = i915_request_alloc(eb.engine, eb.ctx);
  2381		if (IS_ERR(eb.request)) {
  2382			err = PTR_ERR(eb.request);
  2383			goto err_batch_unpin;
  2384		}
  2385	
  2386		/* Emit the switch of data port coherency state if needed */
  2387		err = intel_lr_context_modify_data_port_coherency(eb.request,
  2388				(args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);
> 2389		GEM_WARN_ON(err);
  2390	
  2391		if (in_fence) {
  2392			err = i915_request_await_dma_fence(eb.request, in_fence);
  2393			if (err < 0)
  2394				goto err_request;
  2395		}
  2396	
  2397		if (fences) {
  2398			err = await_fence_array(&eb, fences);
  2399			if (err)
  2400				goto err_request;
  2401		}
  2402	
  2403		if (out_fence_fd != -1) {
  2404			out_fence = sync_file_create(&eb.request->fence);
  2405			if (!out_fence) {
  2406				err = -ENOMEM;
  2407				goto err_request;
  2408			}
  2409		}
  2410	
  2411		/*
  2412		 * Whilst this request exists, batch_obj will be on the
  2413		 * active_list, and so will hold the active reference. Only when this
  2414		 * request is retired will the the batch_obj be moved onto the
  2415		 * inactive_list and lose its active reference. Hence we do not need
  2416		 * to explicitly hold another reference here.
  2417		 */
  2418		eb.request->batch = eb.batch;
  2419	
  2420		trace_i915_request_queue(eb.request, eb.batch_flags);
  2421		err = eb_submit(&eb);
  2422	err_request:
  2423		__i915_request_add(eb.request, err == 0);
  2424		add_to_client(eb.request, file);
  2425	
  2426		if (fences)
  2427			signal_fence_array(&eb, fences);
  2428	
  2429		if (out_fence) {
  2430			if (err == 0) {
  2431				fd_install(out_fence_fd, out_fence->file);
  2432				args->rsvd2 &= GENMASK_ULL(31, 0); /* keep in-fence */
  2433				args->rsvd2 |= (u64)out_fence_fd << 32;
  2434				out_fence_fd = -1;
  2435			} else {
  2436				fput(out_fence->file);
  2437			}
  2438		}
  2439	
  2440	err_batch_unpin:
  2441		if (eb.batch_flags & I915_DISPATCH_SECURE)
  2442			i915_vma_unpin(eb.batch);
  2443	err_vma:
  2444		if (eb.exec)
  2445			eb_release_vmas(&eb);
  2446		mutex_unlock(&dev->struct_mutex);
  2447	err_rpm:
  2448		intel_runtime_pm_put(eb.i915);
  2449		i915_gem_context_put(eb.ctx);
  2450	err_destroy:
  2451		eb_destroy(&eb);
  2452	err_out_fence:
  2453		if (out_fence_fd != -1)
  2454			put_unused_fd(out_fence_fd);
  2455	err_in_fence:
  2456		dma_fence_put(in_fence);
  2457		return err;
  2458	}
  2459	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31427 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] Data port coherency control for UMDs.
  2018-03-30  9:00           ` Dunajski, Bartosz
@ 2018-04-04  9:18             ` Joonas Lahtinen
  2018-04-11  9:15               ` Dunajski, Bartosz
  0 siblings, 1 reply; 81+ messages in thread
From: Joonas Lahtinen @ 2018-04-04  9:18 UTC (permalink / raw)
  To: Dunajski, Bartosz, Ewins, Jon, Lis, Tomasz, intel-gfx, Dave Airlie

Quoting Dunajski, Bartosz (2018-03-30 12:00:51)
> I think the adoption is not a problem here.

Adoption is important for fulfilling the DRM subsystem requirements for
merging new kernel uAPI that I referred to previously.

So do we have some timeline for any distro picking up the new driver?

Regards, Joonas

> If driver can query that patch is active on the specific setup, new capabilities will be always reported to the user.
> 
> -----Original Message-----
> Quoting Dunajski, Bartosz (2018-03-26 12:46:13)
> > Here is pull request with patch usage:
> > https://github.com/intel/compute-runtime/pull/29
> > 
> > This patch is required to control data port coherency depending on incoming workload into OpenCL API (fine-grain SVM requirement).
> 
> Thank you for correcting the process.
> 
> The original question however remains unanswered, how is it looking for the adoption of the new driver in distros?
> 
> I was not able to find much by searching online, but by looking at the sources, I'd assume requiring a custom version of LLVM would be something to get rid of.
> 
> Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] Data port coherency control for UMDs.
  2018-04-04  9:18             ` Joonas Lahtinen
@ 2018-04-11  9:15               ` Dunajski, Bartosz
  0 siblings, 0 replies; 81+ messages in thread
From: Dunajski, Bartosz @ 2018-04-11  9:15 UTC (permalink / raw)
  To: Joonas Lahtinen, Ewins, Jon, Lis, Tomasz, intel-gfx, Dave Airlie

We don’t have any timeline for this at the moment. 
NEO driver is quite fresh in open source and adoption will be handled in near future.

Can we move on with review process and handle adoption in separate thread?

Thanks,
Bartosz

-----Original Message-----
From: Joonas Lahtinen [mailto:joonas.lahtinen@linux.intel.com] 
Sent: Wednesday, April 4, 2018 11:19 AM
To: Dunajski, Bartosz <bartosz.dunajski@intel.com>; Ewins, Jon <jon.ewins@intel.com>; Lis, Tomasz <tomasz.lis@intel.com>; intel-gfx@lists.freedesktop.org; Dave Airlie <airlied@redhat.com>
Cc: chris@chris-wilson.co.uk; Winiarski, Michal <michal.winiarski@intel.com>
Subject: RE: [RFC v1] Data port coherency control for UMDs.

Quoting Dunajski, Bartosz (2018-03-30 12:00:51)
> I think the adoption is not a problem here.

Adoption is important for fulfilling the DRM subsystem requirements for merging new kernel uAPI that I referred to previously.

So do we have some timeline for any distro picking up the new driver?

Regards, Joonas

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2] drm/i915: Add Exec param to control data port coherency.
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
  2018-03-19 12:43   ` Chris Wilson
  2018-03-30 17:29   ` [PATCH " Tomasz Lis
@ 2018-04-11 15:46   ` Tomasz Lis
  2018-06-20 15:03   ` [PATCH v1] Second implementation of Data Port Coherency Tomasz Lis
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 81+ messages in thread
From: Tomasz Lis @ 2018-04-11 15:46 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The patch adds a parameter to control the data port coherency functionality
on a per-exec call basis. When data port coherency flag value is different
than what it was in previous call for the context, a command to switch data
port coherency state is added before the buffer to be executed.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

4. Why the execlist flag approach was chosen?

There are two other ways of providing the functionality to UMDs, besides the
execlist flag:

a) Chicken bit register whitelisting.

This approach would allow adding the functionality without any change to
KMDs interface. Also, it has been determined that whitelisting is safe for
gen10 and gen11. The issue is with gen9, where hardware whitelisting cannot
be used, and OCL driver needs support for it. A workaround there would be to
use command parser, which verifies buffers before execution. But such parsing
comes at a considerable performance cost.

b) Providing the flag as context IOCTL setting.

The data port coherency switch could be implemented as a context parameter,
which would schedule submission of a buffer to switch the coherency flag.
That is an elegant solution with bounds the flag to context, which matches
the hardware placement of the feature. This solution was not accepted
because of OCL driver performance concerns. The OCL driver is constructed
with emphasis on creating small, but very frequent submissions. With such
architecture, adding IOCTL setparam call before submission has considerable
impact on the performance.

v2: Fixed compilation warning.

Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |  3 ++
 drivers/gpu/drm/i915/i915_gem_context.h    |  1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 18 ++++++++++
 drivers/gpu/drm/i915/intel_lrc.c           | 53 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h           |  3 ++
 include/uapi/drm/i915_drm.h                | 12 ++++++-
 6 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index f770be1..19493b0 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -436,6 +436,9 @@ static int i915_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_PARAM_CS_TIMESTAMP_FREQUENCY:
 		value = 1000 * INTEL_INFO(dev_priv)->cs_timestamp_frequency_khz;
 		break;
+	case I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY:
+		value = (INTEL_GEN(dev_priv) >= 9);
+		break;
 	default:
 		DRM_DEBUG("Unknown parameter %d\n", param->param);
 		return -EINVAL;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 7854262..00aa309 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -118,6 +118,7 @@ struct i915_gem_context {
 #define CONTEXT_BANNABLE		3
 #define CONTEXT_BANNED			4
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
+#define CONTEXT_DATA_PORT_COHERENT	6
 
 	/**
 	 * @hw_id: - unique identifier for the context
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index c74f5df..ada376c 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -2274,6 +2274,18 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		eb.batch_flags |= I915_DISPATCH_RS;
 	}
 
+	if (args->flags & I915_EXEC_DATA_PORT_COHERENT) {
+		if (INTEL_GEN(eb.i915) < 9) {
+			DRM_DEBUG("Data Port Coherency is only allowed for Gen9 and above\n");
+			return -EINVAL;
+		}
+		if (eb.engine->class != RENDER_CLASS) {
+			DRM_DEBUG("Data Port Coherency is not available on %s\n",
+				 eb.engine->name);
+			return -EINVAL;
+		}
+	}
+
 	if (args->flags & I915_EXEC_FENCE_IN) {
 		in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
 		if (!in_fence)
@@ -2400,6 +2412,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		goto err_batch_unpin;
 	}
 
+	/* Emit the switch of data port coherency state if needed */
+	err = intel_lr_context_modify_data_port_coherency(eb.request,
+			(args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);
+	if (GEM_WARN_ON(err))
+		DRM_DEBUG("Data Port Coherency toggle failed, keeping old setting.\n");
+
 	if (in_fence) {
 		err = i915_request_await_dma_fence(eb.request, in_fence);
 		if (err < 0)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 02b25bf..b25df7e 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -255,6 +255,59 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+	cs = intel_ring_begin(req, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(req->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	/* Enabling coherency means disabling the bit which forces it off */
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(req, cs);
+
+	return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+					bool enable)
+{
+	struct i915_gem_context *ctx = req->ctx;
+	int ret;
+
+	if (test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags) == enable)
+		return 0;
+
+	ret = emit_set_data_port_coherency(req, enable);
+
+	if (!ret) {
+		if (enable)
+			__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+		else
+			__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	}
+
+	return ret;
+}
+
 static struct i915_priolist *
 lookup_priolist(struct intel_engine_cs *engine,
 		struct i915_priotree *pt,
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 59d7b86..c46b239 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -111,4 +111,7 @@ intel_lr_context_descriptor(struct i915_gem_context *ctx,
 	return ctx->engine[engine->id].lrc_desc;
 }
 
+int intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+						bool enable);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..0f52793 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -529,6 +529,11 @@ typedef struct drm_i915_irq_wait {
  */
 #define I915_PARAM_CS_TIMESTAMP_FREQUENCY 51
 
+/* Query whether DRM_I915_GEM_EXECBUFFER2 supports the ability to switch
+ * Data Cache access into Data Port Coherency mode.
+ */
+#define I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY 52
+
 typedef struct drm_i915_getparam {
 	__s32 param;
 	/*
@@ -1048,7 +1053,12 @@ struct drm_i915_gem_execbuffer2 {
  */
 #define I915_EXEC_FENCE_ARRAY   (1<<19)
 
-#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_ARRAY<<1))
+/* Data Port Coherency capability will be switched before an exec call
+ * which has this flag different than previous call for the context.
+ */
+#define I915_EXEC_DATA_PORT_COHERENT   (1<<20)
+
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_DATA_PORT_COHERENT<<1))
 
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev3)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (7 preceding siblings ...)
  2018-03-30 19:59 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2018-04-11 16:12 ` Patchwork
  2018-04-11 16:29 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-04-11 16:12 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev3)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
bcf7786e9207 drm/i915: Add Exec param to control data port coherency.
-:14: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#14: 
coherency at data port level. Keeping the coherency at that level is disabled

-:177: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#177: FILE: drivers/gpu/drm/i915/i915_gem_execbuffer.c:2284:
+			DRM_DEBUG("Data Port Coherency is not available on %s\n",
+				 eb.engine->name);

-:191: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#191: FILE: drivers/gpu/drm/i915/i915_gem_execbuffer.c:2417:
+	err = intel_lr_context_modify_data_port_coherency(eb.request,
+			(args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);

-:239: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#239: FILE: drivers/gpu/drm/i915/intel_lrc.c:291:
+intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+					bool enable)

-:298: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#298: FILE: include/uapi/drm/i915_drm.h:1059:
+#define I915_EXEC_DATA_PORT_COHERENT   (1<<20)
                                          ^

-:300: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#300: FILE: include/uapi/drm/i915_drm.h:1061:
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_DATA_PORT_COHERENT<<1))
                                                                  ^

total: 0 errors, 1 warnings, 5 checks, 136 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Add Exec param to control data port coherency. (rev3)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (8 preceding siblings ...)
  2018-04-11 16:12 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev3) Patchwork
@ 2018-04-11 16:29 ` Patchwork
  2018-04-11 20:02 ` ✗ Fi.CI.IGT: failure " Patchwork
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-04-11 16:29 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev3)
URL   : https://patchwork.freedesktop.org/series/40181/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4046 -> Patchwork_8669 =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_8669 need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_8669, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8669/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_8669:

  === IGT changes ===

    ==== Warnings ====

    igt@gem_exec_gttfill@basic:
      fi-pnv-d510:        PASS -> SKIP

    
== Known issues ==

  Here are the changes found in Patchwork_8669 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-snb-2520m:       PASS -> INCOMPLETE (fdo#103713)

    
    ==== Possible fixes ====

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a:
      fi-cfl-s3:          INCOMPLETE (fdo#105641) -> PASS

    
  fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
  fdo#105641 https://bugs.freedesktop.org/show_bug.cgi?id=105641


== Participating hosts (34 -> 31) ==

  Missing    (3): fi-ilk-m540 fi-bxt-dsi fi-skl-6700hq 


== Build changes ==

    * Linux: CI_DRM_4046 -> Patchwork_8669

  CI_DRM_4046: d123888920ccd112851ade43a3bf1c25627c2316 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4422: a914075d55dd089095121906bf4c3e825a3cecf2 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_8669: bcf7786e9207cb04b7b49bd601ed131b8bb1332c @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4422: 45e115f293fd6acc0c9647cf2d3b76be78819ba5 @ git://anongit.freedesktop.org/piglit


== Linux commits ==

bcf7786e9207 drm/i915: Add Exec param to control data port coherency.

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8669/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915: Add Exec param to control data port coherency. (rev3)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (9 preceding siblings ...)
  2018-04-11 16:29 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-04-11 20:02 ` Patchwork
  2018-06-20 15:45 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev4) Patchwork
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-04-11 20:02 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev3)
URL   : https://patchwork.freedesktop.org/series/40181/
State : failure

== Summary ==

= CI Bug Log - changes from CI_DRM_4046_full -> Patchwork_8669_full =

== Summary - FAILURE ==

  Serious unknown changes coming with Patchwork_8669_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_8669_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8669/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_8669_full:

  === IGT changes ===

    ==== Possible regressions ====

    igt@gem_exec_params@invalid-flag:
      shard-kbl:          PASS -> FAIL
      shard-apl:          PASS -> FAIL

    
    ==== Warnings ====

    igt@pm_rc6_residency@rc6-accuracy:
      shard-kbl:          SKIP -> PASS +3

    
== Known issues ==

  Here are the changes found in Patchwork_8669_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_flip@2x-busy-flip:
      shard-hsw:          PASS -> DMESG-WARN (fdo#102614)

    igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-indfb-draw-mmap-gtt:
      shard-apl:          PASS -> FAIL (fdo#103167)

    igt@perf_pmu@busy-accuracy-50-vecs0:
      shard-kbl:          PASS -> FAIL (fdo#105157)

    igt@prime_vgem@basic-fence-flip:
      shard-apl:          PASS -> FAIL (fdo#104008)

    
    ==== Possible fixes ====

    igt@kms_flip@flip-vs-panning:
      shard-apl:          DMESG-WARN (fdo#104727) -> PASS

    igt@kms_flip@flip-vs-panning-vs-hang:
      shard-snb:          DMESG-WARN (fdo#103821) -> PASS

    igt@kms_flip@plain-flip-fb-recreate:
      shard-hsw:          FAIL (fdo#100368) -> PASS

    igt@kms_flip@wf_vblank-ts-check:
      shard-kbl:          FAIL (fdo#105346) -> PASS

    igt@kms_frontbuffer_tracking@fbc-rgb101010-draw-render:
      shard-kbl:          DMESG-WARN (fdo#103558) -> PASS +2

    igt@kms_vblank@pipe-b-accuracy-idle:
      shard-hsw:          FAIL (fdo#102583) -> PASS

    
  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#102583 https://bugs.freedesktop.org/show_bug.cgi?id=102583
  fdo#102614 https://bugs.freedesktop.org/show_bug.cgi?id=102614
  fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
  fdo#103558 https://bugs.freedesktop.org/show_bug.cgi?id=103558
  fdo#103821 https://bugs.freedesktop.org/show_bug.cgi?id=103821
  fdo#104008 https://bugs.freedesktop.org/show_bug.cgi?id=104008
  fdo#104727 https://bugs.freedesktop.org/show_bug.cgi?id=104727
  fdo#105157 https://bugs.freedesktop.org/show_bug.cgi?id=105157
  fdo#105346 https://bugs.freedesktop.org/show_bug.cgi?id=105346


== Participating hosts (6 -> 4) ==

  Missing    (2): shard-glk shard-glkb 


== Build changes ==

    * Linux: CI_DRM_4046 -> Patchwork_8669

  CI_DRM_4046: d123888920ccd112851ade43a3bf1c25627c2316 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4422: a914075d55dd089095121906bf4c3e825a3cecf2 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_8669: bcf7786e9207cb04b7b49bd601ed131b8bb1332c @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4422: 45e115f293fd6acc0c9647cf2d3b76be78819ba5 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8669/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
  2018-03-20 17:23         ` Lis, Tomasz
@ 2018-05-04  9:24           ` Joonas Lahtinen
  0 siblings, 0 replies; 81+ messages in thread
From: Joonas Lahtinen @ 2018-05-04  9:24 UTC (permalink / raw)
  To: Lis, Tomasz, Chris Wilson, intel-gfx; +Cc: bartosz.dunajski

Quoting Lis, Tomasz (2018-03-20 19:23:03)
> 
> 
> On 2018-03-19 15:26, Chris Wilson wrote:
> 
>     Quoting Lis, Tomasz (2018-03-19 14:14:19)
> 
> 
>         On 2018-03-19 13:43, Chris Wilson wrote:
> 
>             Quoting Tomasz Lis (2018-03-19 12:37:35)
> 
>                 The patch adds a parameter to control the data port coherency functionality
>                 on a per-exec call basis. When data port coherency flag value is different
>                 than what it was in previous call for the context, a command to switch data
>                 port coherency state is added before the buffer to be executed.
> 
>             So this is part of the context? Why do it at exec level?
> 
>         It is part of the context, stored within HDC chicken bit register.
>         The exec level was requested by the OCL team, due to concerns about
>         performance cost of context setparam calls.
> 
>     What? Oh dear, oh dear, thrice oh dear. The context setparam would look
>     like:
> 
>             if (arg != context->value) {
>                     rq = request_alloc(context, RCS);
>                     cs = ring_begin(rq, 4);
>                     cs++ = MI_LRI;
>                     cs++ = reg;
>                     cs++ = magic;
>                     cs++ = MI_NOOP;
>                     request_add(rq);
>                     context->value = arg
>             }
> 
>     The argument is whether stuffing it into a crowded, v.frequently
>     executed execbuf is better than an irregular setparam. If they want to
>     flip it on every batch, use execbuf. If it's going to be very
>     infrequent, setparam.
> 
> Implementing the data port coherency switch as context setparam would not be a
> problem, I agree.
> But this is not a solution OCL is willing to accept. Any additional IOCTL call
> is a concern for the OCL developers.

Being part of hardware context is a good indication that GEM context is
the right place for the bit. Stuffing more into execbuf for
one-usecase-one-platform scenario doesn't sound very future looking in
terms of overall driver development.

I would truly imagine that the IOCTL execution time should not be
meaningful compared to compute kernel execution times. If they are
already having a large amount of IOCTL calls between each patch, I guess
that is something to be discussed separately.

Regards, Joonas

> 
> For more explanation on switch frequency - please look at the cover letter I
> provided; here's the related part of it:
> (note: the data port coherency is called fine grain coherency within UMD)
> 
>     3. Will coherency switch be used frequently?
> 
>     There are scenarios that will require frequent toggling of the coherency
>     switch.
>     E.g. an application has two OCL compute kernels: kern_master and kern_worker.
>     kern_master uses, concurrently with CPU, some fine grain SVM resources
>     (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>     computational work that needs to be executed. kern_master analyzes incoming
>     work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
>     for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
>     the payload that kern_master produced. These two kernels work in a loop, one
>     after another. Since only kern_master requires coherency, kern_worker should
>     not be forced to pay for it. This means that we need to have the ability to
>     toggle coherency switch on or off per each GPU submission:
>     (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
>     COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
>     That discussion must be part of the rationale in the commitlog.
> 
> Will add.
> Should I place the whole text from cover letter within the commit comment?
> 
>     Otoh, execbuf3 would accept it as a command packet. Hmm.
> 
> I know we have execbuf2, but execbuf3? Are you proposing to add something like
> that?
> 
>               If exec level
>             is desired, why not whitelist it?
> 
>         If we have no issue in whitelisting the register, I'm sure OCL will
>         agree to that.
>         I assumed the whitelisting will be unacceptable because of security
>         concerns with some options.
>         The register also changes its position and content between gens, which
>         makes whitelisting hard to manage.
> 
>         Main purpose of chicken bit registers, in general, is to allow work
>         around for hardware features which could  be buggy or could have
>         unintended influence on the platform.
>         The data port coherency functionality landed there for the same reasons;
>         then it twisted itself in a way that we now need user space to switch it.
>         Is it really ok to whitelist chicken bit registers?
> 
>     It all depends on whether it breaks segregation. If the only users
>     affected are themselves, fine. Otherwise, no.
>     -Chris
> 
> Chicken Bit registers are definitely not planned as safe for use. While meaning
> of bits within HDC_CHICKEN0 change between gens, I doubt any of the registers
> *can't* be used to cause GPU hung.
> -Tomasz
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v1] Second implementation of Data Port Coherency.
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
                     ` (2 preceding siblings ...)
  2018-04-11 15:46   ` [PATCH v2] " Tomasz Lis
@ 2018-06-20 15:03   ` Tomasz Lis
  2018-06-20 15:03     ` [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
  2018-07-09 13:20   ` [PATCH v4] " Tomasz Lis
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 81+ messages in thread
From: Tomasz Lis @ 2018-06-20 15:03 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The OCL Team agreed to use IOCTL instead of Exec flag to switch coherency
settings.

Also:
1. I will follow this patch with IGT test for the new functionality.
2. The OCL Team will publish UMD patch for it.

Tomasz Lis (1):
  drm/i915: Add IOCTL Param to control data port coherency.

 drivers/gpu/drm/i915/i915_gem_context.c | 41 ++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
 drivers/gpu/drm/i915/intel_lrc.c        | 51 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
 include/uapi/drm/i915_drm.h             |  1 +
 5 files changed, 103 insertions(+)

-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-06-20 15:03   ` [PATCH v1] Second implementation of Data Port Coherency Tomasz Lis
@ 2018-06-20 15:03     ` Tomasz Lis
  2018-06-21  6:39       ` Joonas Lahtinen
                         ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Tomasz Lis @ 2018-06-20 15:03 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>

Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 41 ++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
 drivers/gpu/drm/i915/intel_lrc.c        | 51 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
 include/uapi/drm/i915_drm.h             |  1 +
 5 files changed, 103 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index ccf463a..ea65ae6 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -711,6 +711,24 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
 	return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
 }
 
+static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+	int ret;
+	ret = intel_lr_context_modify_data_port_coherency(ctx, true);
+	if (!GEM_WARN_ON(ret))
+		__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	return ret;
+}
+
+static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+	int ret;
+	ret = intel_lr_context_modify_data_port_coherency(ctx, false);
+	if (!GEM_WARN_ON(ret))
+		__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	return ret;
+}
+
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file)
 {
@@ -784,6 +802,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->value = ctx->sched.priority;
 		break;
+	case I915_CONTEXT_PARAM_COHERENCY:
+		/*
+		 * ENODEV if the feature is not supported. This removes the need
+		 * of separate IS_SUPPORTED parameter.
+		 */
+		if (INTEL_GEN(dev_priv) < 9)
+			ret = -ENODEV;
+		else
+			args->value = i915_gem_context_is_data_port_coherent(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -830,6 +859,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -893,6 +923,17 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		}
 		break;
 
+	case I915_CONTEXT_PARAM_COHERENCY:
+		if (args->size)
+			ret = -EINVAL;
+		else if (INTEL_GEN(dev_priv) < 9)
+			ret = -ENODEV;
+		else if (args->value)
+			ret = i915_gem_context_set_data_port_coherent(ctx);
+		else
+			ret = i915_gem_context_clear_data_port_coherent(ctx);
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index b116e49..e8ccb70 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -126,6 +126,7 @@ struct i915_gem_context {
 #define CONTEXT_BANNABLE		3
 #define CONTEXT_BANNED			4
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
+#define CONTEXT_DATA_PORT_COHERENT	6
 
 	/**
 	 * @hw_id: - unique identifier for the context
@@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
 	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
 }
 
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+	return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+}
+
 static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
 {
 	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 33bc914..c69dc26 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+	cs = intel_ring_begin(req, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(req->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	/* FIXME: this feature may be unuseable on CNL; If this checks to be
+	 *  true, we should enodev for CNL. */
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	/* Enabling coherency means disabling the bit which forces it off */
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(req, cs);
+
+	return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+		bool enable)
+{
+	struct i915_request *req;
+	int ret;
+
+	req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	ret = emit_set_data_port_coherency(req, enable);
+
+	i915_request_add(req);
+
+	return ret;
+}
+
 static struct i915_priolist *
 lookup_priolist(struct intel_engine_cs *engine, int prio)
 {
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 1593194..214e291 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -104,4 +104,8 @@ struct i915_gem_context;
 
 void intel_lr_context_resume(struct drm_i915_private *dev_priv);
 
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+					     bool enable);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..fab072f 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
 #define I915_CONTEXT_PARAM_BANNABLE	0x5
 #define I915_CONTEXT_PARAM_PRIORITY	0x6
+#define I915_CONTEXT_PARAM_COHERENCY	0x7
 #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
 #define   I915_CONTEXT_DEFAULT_PRIORITY		0
 #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev4)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (10 preceding siblings ...)
  2018-04-11 20:02 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2018-06-20 15:45 ` Patchwork
  2018-06-20 16:00 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-06-20 15:45 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev4)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
525e6ee946c8 drm/i915: Add IOCTL Param to control data port coherency.
-:15: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#15: 
coherency at data port level. Keeping the coherency at that level is disabled

-:125: WARNING:LINE_SPACING: Missing a blank line after declarations
#125: FILE: drivers/gpu/drm/i915/i915_gem_context.c:717:
+	int ret;
+	ret = intel_lr_context_modify_data_port_coherency(ctx, true);

-:134: WARNING:LINE_SPACING: Missing a blank line after declarations
#134: FILE: drivers/gpu/drm/i915/i915_gem_context.c:726:
+	int ret;
+	ret = intel_lr_context_modify_data_port_coherency(ctx, false);

-:244: WARNING:BLOCK_COMMENT_STYLE: Block comments use a trailing */ on a separate line
#244: FILE: drivers/gpu/drm/i915/intel_lrc.c:279:
+	 *  true, we should enodev for CNL. */

-:261: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#261: FILE: drivers/gpu/drm/i915/intel_lrc.c:296:
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+		bool enable)

-:290: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#290: FILE: drivers/gpu/drm/i915/intel_lrc.h:109:
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+					     bool enable);

total: 0 errors, 4 warnings, 2 checks, 161 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Add Exec param to control data port coherency. (rev4)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (11 preceding siblings ...)
  2018-06-20 15:45 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev4) Patchwork
@ 2018-06-20 16:00 ` Patchwork
  2018-06-20 21:01 ` ✗ Fi.CI.IGT: failure " Patchwork
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-06-20 16:00 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev4)
URL   : https://patchwork.freedesktop.org/series/40181/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4348 -> Patchwork_9373 =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_9373 need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9373, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/40181/revisions/4/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9373:

  === IGT changes ===

    ==== Warnings ====

    igt@gem_exec_gttfill@basic:
      fi-pnv-d510:        PASS -> SKIP

    
== Known issues ==

  Here are the changes found in Patchwork_9373 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@gem_ctx_create@basic-files:
      fi-kbl-7567u:       PASS -> DMESG-WARN (fdo#106954)

    
    ==== Possible fixes ====

    igt@drv_module_reload@basic-reload:
      fi-glk-j4005:       DMESG-WARN (fdo#106248, fdo#106725) -> PASS

    
  fdo#106248 https://bugs.freedesktop.org/show_bug.cgi?id=106248
  fdo#106725 https://bugs.freedesktop.org/show_bug.cgi?id=106725
  fdo#106954 https://bugs.freedesktop.org/show_bug.cgi?id=106954


== Participating hosts (44 -> 38) ==

  Additional (1): fi-hsw-peppy 
  Missing    (7): fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-glk-dsi fi-bsw-cyan fi-ctg-p8600 fi-kbl-x1275 


== Build changes ==

    * Linux: CI_DRM_4348 -> Patchwork_9373

  CI_DRM_4348: 3a2fbf8fe32d909c5d44e61e7d212ae694e9e473 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4526: 4bbfb4fb14b3deab9bc4db9911280b35c22b718c @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9373: 525e6ee946c85a6a1742df72a3e2d87695234bb1 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

525e6ee946c8 drm/i915: Add IOCTL Param to control data port coherency.

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9373/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915: Add Exec param to control data port coherency. (rev4)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (12 preceding siblings ...)
  2018-06-20 16:00 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-06-20 21:01 ` Patchwork
  2018-07-09 13:57 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev5) Patchwork
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-06-20 21:01 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev4)
URL   : https://patchwork.freedesktop.org/series/40181/
State : failure

== Summary ==

= CI Bug Log - changes from CI_DRM_4348_full -> Patchwork_9373_full =

== Summary - FAILURE ==

  Serious unknown changes coming with Patchwork_9373_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9373_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9373_full:

  === IGT changes ===

    ==== Possible regressions ====

    igt@gem_ctx_param@invalid-param-get:
      shard-apl:          PASS -> FAIL +1
      shard-glk:          PASS -> FAIL +1

    igt@gem_ctx_param@invalid-param-set:
      shard-kbl:          PASS -> FAIL +1
      shard-hsw:          PASS -> FAIL +1
      shard-snb:          PASS -> FAIL +1

    
    ==== Warnings ====

    igt@gem_mocs_settings@mocs-rc6-vebox:
      shard-kbl:          SKIP -> PASS +7

    
== Known issues ==

  Here are the changes found in Patchwork_9373_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@drv_selftest@live_hugepages:
      shard-kbl:          PASS -> INCOMPLETE (fdo#103665)

    igt@gem_exec_schedule@pi-ringfull-vebox:
      shard-kbl:          NOTRUN -> FAIL (fdo#103158)

    igt@kms_flip@2x-flip-vs-expired-vblank:
      shard-hsw:          PASS -> FAIL (fdo#102887)

    igt@kms_setmode@basic:
      shard-apl:          PASS -> FAIL (fdo#99912)

    igt@perf@polling:
      shard-hsw:          PASS -> FAIL (fdo#102252)

    
    ==== Possible fixes ====

    igt@gem_exec_await@wide-contexts:
      shard-glk:          FAIL (fdo#105900) -> PASS

    igt@kms_flip@flip-vs-panning-vs-hang:
      shard-snb:          DMESG-WARN (fdo#103821) -> PASS

    igt@kms_flip@plain-flip-ts-check:
      shard-glk:          FAIL (fdo#100368) -> PASS

    igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-shrfb-draw-mmap-wc:
      shard-glk:          FAIL (fdo#103167, fdo#104724) -> PASS

    igt@kms_frontbuffer_tracking@fbc-rgb565-draw-mmap-wc:
      shard-kbl:          FAIL (fdo#106067) -> PASS

    igt@perf@enable-disable:
      shard-kbl:          DMESG-FAIL (fdo#106064) -> PASS

    igt@pm_rpm@system-suspend-execbuf:
      shard-kbl:          INCOMPLETE (fdo#103665) -> PASS

    igt@pm_rps@waitboost:
      shard-kbl:          FAIL (fdo#102250) -> PASS

    
    ==== Warnings ====

    igt@drv_selftest@live_gtt:
      shard-glk:          FAIL (fdo#105347) -> INCOMPLETE (fdo#103359, k.org#198133)

    
  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#102250 https://bugs.freedesktop.org/show_bug.cgi?id=102250
  fdo#102252 https://bugs.freedesktop.org/show_bug.cgi?id=102252
  fdo#102887 https://bugs.freedesktop.org/show_bug.cgi?id=102887
  fdo#103158 https://bugs.freedesktop.org/show_bug.cgi?id=103158
  fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
  fdo#103359 https://bugs.freedesktop.org/show_bug.cgi?id=103359
  fdo#103665 https://bugs.freedesktop.org/show_bug.cgi?id=103665
  fdo#103821 https://bugs.freedesktop.org/show_bug.cgi?id=103821
  fdo#104724 https://bugs.freedesktop.org/show_bug.cgi?id=104724
  fdo#105347 https://bugs.freedesktop.org/show_bug.cgi?id=105347
  fdo#105900 https://bugs.freedesktop.org/show_bug.cgi?id=105900
  fdo#106064 https://bugs.freedesktop.org/show_bug.cgi?id=106064
  fdo#106067 https://bugs.freedesktop.org/show_bug.cgi?id=106067
  fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
  k.org#198133 https://bugzilla.kernel.org/show_bug.cgi?id=198133


== Participating hosts (5 -> 5) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4348 -> Patchwork_9373

  CI_DRM_4348: 3a2fbf8fe32d909c5d44e61e7d212ae694e9e473 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4526: 4bbfb4fb14b3deab9bc4db9911280b35c22b718c @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9373: 525e6ee946c85a6a1742df72a3e2d87695234bb1 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9373/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-06-20 15:03     ` [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
@ 2018-06-21  6:39       ` Joonas Lahtinen
  2018-06-21 13:47         ` Lis, Tomasz
  2018-06-21  7:05       ` Chris Wilson
  2018-06-21  7:31       ` Dunajski, Bartosz
  2 siblings, 1 reply; 81+ messages in thread
From: Joonas Lahtinen @ 2018-06-21  6:39 UTC (permalink / raw)
  To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski

Changelog would be much appreciated. And this is not the first version
of the series. It helps to remind the reviewer that original
implementation was changed into IOCTl based on feedback. Please see the
git log in i915 for some examples.

Quoting Tomasz Lis (2018-06-20 18:03:07)
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
> 
> Rationale:
> 
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
> 
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
> 
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
> 
> 2. Why do we need a global coherency switch?
> 
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                    ________________
>                   |      NODE1     |
>                   | uint64_t data  |
>                   +----------------|
>                   | NODE*  |  NODE*|
>                   +--------+-------+
>                     /              \
>    ________________/                \________________
>   |      NODE2     |                |      NODE3     |
>   | uint64_t data  |                | uint64_t data  |
>   +----------------|                +----------------|
>   | NODE*  |  NODE*|                | NODE*  |  NODE*|
>   +--------+-------+                +--------+-------+
> 
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
> 
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
> 
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
> 
> 3. Will coherency switch be used frequently?
> 
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
> 
> Bspec: 11419
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>

<SNIP>

> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index ccf463a..ea65ae6 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -711,6 +711,24 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
>         return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
>  }
>  
> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +       int ret;
> +       ret = intel_lr_context_modify_data_port_coherency(ctx, true);
> +       if (!GEM_WARN_ON(ret))

I don't think there's need for the WARN as the error will be propagated
back to userspace?

> +               __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +       return ret;
> +}
> +
> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +       int ret;
> +       ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> +       if (!GEM_WARN_ON(ret))

Ditto.

> +               __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +       return ret;
> +}
> +
>  int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>                                   struct drm_file *file)
>  {
> @@ -784,6 +802,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>                                     struct drm_file *file)
>  {
> +       struct drm_i915_private *dev_priv = to_i915(dev);
>         struct drm_i915_file_private *file_priv = file->driver_priv;
>         struct drm_i915_gem_context_param *args = data;
>         struct i915_gem_context *ctx;
> @@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>         case I915_CONTEXT_PARAM_PRIORITY:
>                 args->value = ctx->sched.priority;
>                 break;
> +       case I915_CONTEXT_PARAM_COHERENCY:
> +               /*
> +                * ENODEV if the feature is not supported. This removes the need
> +                * of separate IS_SUPPORTED parameter.
> +                */

Code speaks for itself, the comment is not needed.

> +               if (INTEL_GEN(dev_priv) < 9)
> +                       ret = -ENODEV;
> +               else
> +                       args->value = i915_gem_context_is_data_port_coherent(ctx);
> +               break;
>         default:
>                 ret = -EINVAL;
>                 break;
> @@ -893,6 +923,17 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>                 }
>                 break;
>  
> +       case I915_CONTEXT_PARAM_COHERENCY:
> +               if (args->size)
> +                       ret = -EINVAL;
> +               else if (INTEL_GEN(dev_priv) < 9)
> +                       ret = -ENODEV;
> +               else if (args->value)
> +                       ret = i915_gem_context_set_data_port_coherent(ctx);
> +               else
> +                       ret = i915_gem_context_clear_data_port_coherent(ctx);

Be more strict with the uAPI. Only accept values 0 or 1, then you leave
space for extension in the future.

> +               break;
> +
>         default:
>                 ret = -EINVAL;
>                 break;

<SNIP>

> +++ b/drivers/gpu/drm/i915/intel_lrc.c

I'm feeling this is not the right file. The bit is in hardware context,
and doesn't have so much to do with LRC.

> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>         ce->lrc_desc = desc;
>  }
>  
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
> +{
> +       u32 *cs;
> +       i915_reg_t reg;
> +
> +       GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> +       GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> +       cs = intel_ring_begin(req, 4);
> +       if (IS_ERR(cs))
> +               return PTR_ERR(cs);
> +
> +       if (INTEL_GEN(req->i915) >= 10)
> +               reg = CNL_HDC_CHICKEN0;
> +       else
> +               reg = HDC_CHICKEN0;
> +
> +       /* FIXME: this feature may be unuseable on CNL; If this checks to be
> +        *  true, we should enodev for CNL. */

This is exactly why we want the IGT tests to check for effects, not for
the register. Then we can get an answer by running the tests on all kind
of CNL systems at hand.

> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> +       *cs++ = i915_mmio_reg_offset(reg);
> +       /* Enabling coherency means disabling the bit which forces it off */

Code is again very self explanatory without the comment.

> +       if (enable)
> +               *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +       else
> +               *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +       *cs++ = MI_NOOP;
> +
> +       intel_ring_advance(req, cs);
> +
> +       return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +               bool enable)
> +{
> +       struct i915_request *req;
> +       int ret;
> +
> +       req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
> +       if (IS_ERR(req))
> +               return PTR_ERR(req);
> +
> +       ret = emit_set_data_port_coherency(req, enable);
> +
> +       i915_request_add(req);
> +
> +       return ret;
> +}

I'm thinking we should set this value when it has changed, when we insert the
requests into the command stream. So if you change back and forth, while
not emitting any requests, nothing really happens. If you change the value and
emit a request, we should emit a LRI before the jump to the commands.
Similary if you keep setting the value to the value it already was in,
nothing will happen, again.

> +
>  static struct i915_priolist *
>  lookup_priolist(struct intel_engine_cs *engine, int prio)
>  {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..214e291 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>  
>  void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>  
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +                                            bool enable);
> +
>  #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..fab072f 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
>  #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE    0x4
>  #define I915_CONTEXT_PARAM_BANNABLE    0x5
>  #define I915_CONTEXT_PARAM_PRIORITY    0x6
> +#define I915_CONTEXT_PARAM_COHERENCY   0x7

Please add this line after the indented context priorities.

>  #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>  #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>  #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */

Here.

Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-06-20 15:03     ` [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
  2018-06-21  6:39       ` Joonas Lahtinen
@ 2018-06-21  7:05       ` Chris Wilson
  2018-06-21 13:47         ` Lis, Tomasz
  2018-06-21  7:31       ` Dunajski, Bartosz
  2 siblings, 1 reply; 81+ messages in thread
From: Chris Wilson @ 2018-06-21  7:05 UTC (permalink / raw)
  To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski

Quoting Tomasz Lis (2018-06-20 16:03:07)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 33bc914..c69dc26 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>         ce->lrc_desc = desc;
>  }
>  
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
> +{
> +       u32 *cs;
> +       i915_reg_t reg;
> +
> +       GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> +       GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> +       cs = intel_ring_begin(req, 4);
> +       if (IS_ERR(cs))
> +               return PTR_ERR(cs);
> +
> +       if (INTEL_GEN(req->i915) >= 10)
> +               reg = CNL_HDC_CHICKEN0;
> +       else
> +               reg = HDC_CHICKEN0;
> +
> +       /* FIXME: this feature may be unuseable on CNL; If this checks to be
> +        *  true, we should enodev for CNL. */
> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> +       *cs++ = i915_mmio_reg_offset(reg);
> +       /* Enabling coherency means disabling the bit which forces it off */
> +       if (enable)
> +               *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +       else
> +               *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +       *cs++ = MI_NOOP;
> +
> +       intel_ring_advance(req, cs);
> +
> +       return 0;
> +}

There's nothing specific to the logical ringbuffer context here afaics.
It could have just been done inside the single
i915_gem_context_set_data_port_coherency(). Also makes it clearer that
i915_gem_context_set_data_port_coherency needs struct_mutex.

cmd = HDC_FORCE_NON_COHERENT << 16;
if (!coherent)
	cmd |= HDC_FORCE_NON_COHERENT;
*cs++ = cmd;

Does that read any clearer?

> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..214e291 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>  
>  void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>  
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +                                            bool enable);
> +
>  #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..fab072f 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
>  #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE    0x4
>  #define I915_CONTEXT_PARAM_BANNABLE    0x5
>  #define I915_CONTEXT_PARAM_PRIORITY    0x6
> +#define I915_CONTEXT_PARAM_COHERENCY   0x7

DATAPORT_COHERENCY
There are many different caches.

There should be some commentary around here telling userspace what the
contract is.

>  #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>  #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>  #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */

COHERENCY has MAX/MIN_USER_PRIORITY, interesting. I thought it was just
a boolean.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-06-20 15:03     ` [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
  2018-06-21  6:39       ` Joonas Lahtinen
  2018-06-21  7:05       ` Chris Wilson
@ 2018-06-21  7:31       ` Dunajski, Bartosz
  2018-06-21  8:48         ` Joonas Lahtinen
  2 siblings, 1 reply; 81+ messages in thread
From: Dunajski, Bartosz @ 2018-06-21  7:31 UTC (permalink / raw)
  To: Lis, Tomasz, intel-gfx

I would like to add few things that were mentioned previously.

According to adoption plan.
Our plan is to drop dependency on LLVM 4.0.1 (with custom patches) and instead compile with unpatched (either system or vanilla) LLVM 6.0. Work to transition our compiler stack to LLVM 6 is expected to complete in late Q3. Additionally, we are refactoring our packaging, so instead of a single neo package with multiple components we will have multiple versioned packages with clear dependencies. 

We are coordinating with ClearLinux team to get included once that happens and plan to reach out to other OSVs to do the same.

Is this plan enough to consider NEO an actual opensource client for the coherency control patch ?

PR for patch usage:
https://github.com/intel/compute-runtime/pull/53 

Bartosz
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-06-21  7:31       ` Dunajski, Bartosz
@ 2018-06-21  8:48         ` Joonas Lahtinen
  2018-06-22 16:40           ` Dunajski, Bartosz
  0 siblings, 1 reply; 81+ messages in thread
From: Joonas Lahtinen @ 2018-06-21  8:48 UTC (permalink / raw)
  To: Dunajski, Bartosz, Lis, Tomasz, intel-gfx, Dave Airlie

+ Dave Airlie (The DRM subsystem maintainer) for FYI

Quoting Dunajski, Bartosz (2018-06-21 10:31:57)
> I would like to add few things that were mentioned previously.
> 
> According to adoption plan.
> Our plan is to drop dependency on LLVM 4.0.1 (with custom patches) and instead compile with unpatched (either system or vanilla) LLVM 6.0. Work to transition our compiler stack to LLVM 6 is expected to complete in late Q3. Additionally, we are refactoring our packaging, so instead of a single neo package with multiple components we will have multiple versioned packages with clear dependencies. 
> 
> We are coordinating with ClearLinux team to get included once that happens and plan to reach out to other OSVs to do the same.
> 
> Is this plan enough to consider NEO an actual opensource client for the coherency control patch ?

Yes, once you follow through with the plan, there should be no issues
about merging patches to support the driver.

You may want to squeeze your timeline to be complete before 4.19-rc5,
which is the feature cutoff date for 4.20, but that is rather an
ambitious goal. Your original schedule would land the patches before
4.20-rc5 resulting in inclusion to 4.21.

Regards, Joonas

PS. I'm going on a vacation for a couple of weeks.

> 
> PR for patch usage:
> https://github.com/intel/compute-runtime/pull/53 
> 
> Bartosz
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-06-21  7:05       ` Chris Wilson
@ 2018-06-21 13:47         ` Lis, Tomasz
  0 siblings, 0 replies; 81+ messages in thread
From: Lis, Tomasz @ 2018-06-21 13:47 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: bartosz.dunajski



On 2018-06-21 09:05, Chris Wilson wrote:
> Quoting Tomasz Lis (2018-06-20 16:03:07)
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 33bc914..c69dc26 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>>          ce->lrc_desc = desc;
>>   }
>>   
>> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
>> +{
>> +       u32 *cs;
>> +       i915_reg_t reg;
>> +
>> +       GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> +       GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> +       cs = intel_ring_begin(req, 4);
>> +       if (IS_ERR(cs))
>> +               return PTR_ERR(cs);
>> +
>> +       if (INTEL_GEN(req->i915) >= 10)
>> +               reg = CNL_HDC_CHICKEN0;
>> +       else
>> +               reg = HDC_CHICKEN0;
>> +
>> +       /* FIXME: this feature may be unuseable on CNL; If this checks to be
>> +        *  true, we should enodev for CNL. */
>> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +       *cs++ = i915_mmio_reg_offset(reg);
>> +       /* Enabling coherency means disabling the bit which forces it off */
>> +       if (enable)
>> +               *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> +       else
>> +               *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> +       *cs++ = MI_NOOP;
>> +
>> +       intel_ring_advance(req, cs);
>> +
>> +       return 0;
>> +}
> There's nothing specific to the logical ringbuffer context here afaics.
> It could have just been done inside the single
> i915_gem_context_set_data_port_coherency(). Also makes it clearer that
> i915_gem_context_set_data_port_coherency needs struct_mutex.
>
> cmd = HDC_FORCE_NON_COHERENT << 16;
> if (!coherent)
> 	cmd |= HDC_FORCE_NON_COHERENT;
> *cs++ = cmd;
>
> Does that read any clearer?
Sorry, I don't think I follow.
Should I move the code out of logical ringbuffer context (intel_lrc.c)?
Should I merge the emit_set_data_port_coherency() with 
intel_lr_context_modify_data_port_coherency()?
Should I lock a mutex while adding the request?
-Tomasz
>
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..214e291 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>   
>>   void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>   
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
>> +                                            bool enable);
>> +
>>   #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..fab072f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
>>   #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE    0x4
>>   #define I915_CONTEXT_PARAM_BANNABLE    0x5
>>   #define I915_CONTEXT_PARAM_PRIORITY    0x6
>> +#define I915_CONTEXT_PARAM_COHERENCY   0x7
> DATAPORT_COHERENCY
> There are many different caches.
>
> There should be some commentary around here telling userspace what the
> contract is.
Will do.
>
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
> COHERENCY has MAX/MIN_USER_PRIORITY, interesting. I thought it was just
> a boolean.
> -Chris
I did not noticed the structure of defines here; will move the new define.
-Tomasz

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-06-21  6:39       ` Joonas Lahtinen
@ 2018-06-21 13:47         ` Lis, Tomasz
  2018-07-18 13:03           ` Joonas Lahtinen
  0 siblings, 1 reply; 81+ messages in thread
From: Lis, Tomasz @ 2018-06-21 13:47 UTC (permalink / raw)
  To: Joonas Lahtinen, intel-gfx; +Cc: bartosz.dunajski



On 2018-06-21 08:39, Joonas Lahtinen wrote:
> Changelog would be much appreciated. And this is not the first version
> of the series. It helps to remind the reviewer that original
> implementation was changed into IOCTl based on feedback. Please see the
> git log in i915 for some examples.
Will add. I considered this a separate series, as it is a different 
implementation.
>
> Quoting Tomasz Lis (2018-06-20 18:03:07)
>> The patch adds a parameter to control the data port coherency functionality
>> on a per-context level. When the IOCTL is called, a command to switch data
>> port coherency state is added to the ordered list. All prior requests are
>> executed on old coherency settings, and all exec requests after the IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
>> overhead related to tracking (snooping) memory inside different cache units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
>> memory coherency between CPU and GPU). The goal of coherency enable/disable
>> switch is to remove overhead of memory coherency when memory coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units), Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
>> model is similar to regular memory load/store operations available on typical
>> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>>                     ________________
>>                    |      NODE1     |
>>                    | uint64_t data  |
>>                    +----------------|
>>                    | NODE*  |  NODE*|
>>                    +--------+-------+
>>                      /              \
>>     ________________/                \________________
>>    |      NODE2     |                |      NODE3     |
>>    | uint64_t data  |                | uint64_t data  |
>>    +----------------|                +----------------|
>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>    +--------+-------+                +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory locations
>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
>> Virtual Memory feature). Using pointers from different allocations doesn't
>> affect the stateless addressing model which even allows scattered reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in stateless model
>> can be encoded (at ISA level) to either use or disable coherency. However, for
>> generic OCL applications (such as example with tree-like data structure), OCL
>> compiler is not able to determine origin of memory pointed to by an arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is needed or not
>> for specific pointer (or for specific I/O instruction). As a result, compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that it would be
>> possible to workaround this (e.g. based on allocations map and pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
>> the payload that kern_master produced. These two kernels work in a loop, one
>> after another. Since only kern_master requires coherency, kern_worker should
>> not be forced to pay for it. This means that we need to have the ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> <SNIP>
>
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>> index ccf463a..ea65ae6 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -711,6 +711,24 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
>>          return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
>>   }
>>   
>> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
>> +{
>> +       int ret;
>> +       ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>> +       if (!GEM_WARN_ON(ret))
> I don't think there's need for the WARN as the error will be propagated
> back to userspace?
You're right.
>
>> +               __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +       return ret;
>> +}
>> +
>> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
>> +{
>> +       int ret;
>> +       ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>> +       if (!GEM_WARN_ON(ret))
> Ditto.
ack
>
>> +               __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +       return ret;
>> +}
>> +
>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>                                    struct drm_file *file)
>>   {
>> @@ -784,6 +802,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>>                                      struct drm_file *file)
>>   {
>> +       struct drm_i915_private *dev_priv = to_i915(dev);
>>          struct drm_i915_file_private *file_priv = file->driver_priv;
>>          struct drm_i915_gem_context_param *args = data;
>>          struct i915_gem_context *ctx;
>> @@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>>          case I915_CONTEXT_PARAM_PRIORITY:
>>                  args->value = ctx->sched.priority;
>>                  break;
>> +       case I915_CONTEXT_PARAM_COHERENCY:
>> +               /*
>> +                * ENODEV if the feature is not supported. This removes the need
>> +                * of separate IS_SUPPORTED parameter.
>> +                */
> Code speaks for itself, the comment is not needed.
I don't think it is a good idea to limit comments. The current look of 
the code makes it hard for anyone new to work on it, as the only 
documentation is the history in mailing list.
I don't think it's the correct approach. I believe comments should be 
encouraged.

In this specific case, the code lets you know that ENODEV is returned 
below gen9. But there is no macro IS_DATA_PORT_COHERENCY_SUPPORTED() 
which would clearly indicate the cause of that, so comment is required.

>> +               if (INTEL_GEN(dev_priv) < 9)
>> +                       ret = -ENODEV;
>> +               else
>> +                       args->value = i915_gem_context_is_data_port_coherent(ctx);
>> +               break;
>>          default:
>>                  ret = -EINVAL;
>>                  break;
>> @@ -893,6 +923,17 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>>                  }
>>                  break;
>>   
>> +       case I915_CONTEXT_PARAM_COHERENCY:
>> +               if (args->size)
>> +                       ret = -EINVAL;
>> +               else if (INTEL_GEN(dev_priv) < 9)
>> +                       ret = -ENODEV;
>> +               else if (args->value)
>> +                       ret = i915_gem_context_set_data_port_coherent(ctx);
>> +               else
>> +                       ret = i915_gem_context_clear_data_port_coherent(ctx);
> Be more strict with the uAPI. Only accept values 0 or 1, then you leave
> space for extension in the future.
Right. Will do.
>
>> +               break;
>> +
>>          default:
>>                  ret = -EINVAL;
>>                  break;
> <SNIP>
>
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> I'm feeling this is not the right file. The bit is in hardware context,
> and doesn't have so much to do with LRC.
Should I move it to i915_gem_context.c?
>
>> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>>          ce->lrc_desc = desc;
>>   }
>>   
>> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
>> +{
>> +       u32 *cs;
>> +       i915_reg_t reg;
>> +
>> +       GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> +       GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> +       cs = intel_ring_begin(req, 4);
>> +       if (IS_ERR(cs))
>> +               return PTR_ERR(cs);
>> +
>> +       if (INTEL_GEN(req->i915) >= 10)
>> +               reg = CNL_HDC_CHICKEN0;
>> +       else
>> +               reg = HDC_CHICKEN0;
>> +
>> +       /* FIXME: this feature may be unuseable on CNL; If this checks to be
>> +        *  true, we should enodev for CNL. */
> This is exactly why we want the IGT tests to check for effects, not for
> the register. Then we can get an answer by running the tests on all kind
> of CNL systems at hand.
This comment is actually outdated, I left it by mistake. Will remove.
>
>> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +       *cs++ = i915_mmio_reg_offset(reg);
>> +       /* Enabling coherency means disabling the bit which forces it off */
> Code is again very self explanatory without the comment.
The logic is reversed, so that "enable" does a "disable". I believe the 
comment does a great job of assuring the reader that this is not just a 
coding mistake.

Do we have any official guidelines for limiting comments?
>
>> +       if (enable)
>> +               *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> +       else
>> +               *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> +       *cs++ = MI_NOOP;
>> +
>> +       intel_ring_advance(req, cs);
>> +
>> +       return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
>> +               bool enable)
>> +{
>> +       struct i915_request *req;
>> +       int ret;
>> +
>> +       req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>> +       if (IS_ERR(req))
>> +               return PTR_ERR(req);
>> +
>> +       ret = emit_set_data_port_coherency(req, enable);
>> +
>> +       i915_request_add(req);
>> +
>> +       return ret;
>> +}
> I'm thinking we should set this value when it has changed, when we insert the
> requests into the command stream. So if you change back and forth, while
> not emitting any requests, nothing really happens. If you change the value and
> emit a request, we should emit a LRI before the jump to the commands.
> Similary if you keep setting the value to the value it already was in,
> nothing will happen, again.
When I considered that, my way of reasoning was:
If we execute the flag changing buffer right away, it may be sent to 
hardware faster if there is no job in progress.
If we use the lazy way, and trigger the change just before submission -  
there will be additional conditions in submission code, plus the change 
will be made when there is another job pending (though it's not a 
considerable payload to just switch a flag).
If user space switches the flag back and forth without much sense, then 
there is something wrong with the user space driver, and it shouldn't be 
up to kernel to fix that.

This is why I chosen the current approach. But I can change it if you wish.

>> +
>>   static struct i915_priolist *
>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>   {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..214e291 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>   
>>   void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>   
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
>> +                                            bool enable);
>> +
>>   #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..fab072f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
>>   #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE    0x4
>>   #define I915_CONTEXT_PARAM_BANNABLE    0x5
>>   #define I915_CONTEXT_PARAM_PRIORITY    0x6
>> +#define I915_CONTEXT_PARAM_COHERENCY   0x7
> Please add this line after the indented context priorities.
ack
>
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
> Here.
>
> Regards, Joonas

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-06-21  8:48         ` Joonas Lahtinen
@ 2018-06-22 16:40           ` Dunajski, Bartosz
  2018-07-18 13:12             ` Joonas Lahtinen
  0 siblings, 1 reply; 81+ messages in thread
From: Dunajski, Bartosz @ 2018-06-22 16:40 UTC (permalink / raw)
  To: Joonas Lahtinen, Lis, Tomasz, intel-gfx, Dave Airlie

Additionally, we are already on Arch:
https://aur.archlinux.org/packages/compute-runtime 

Can I assume that adoption plan is not a blocker anymore?

Bartosz

> Yes, once you follow through with the plan, there should be no issues about merging patches to support the driver.
>
> You may want to squeeze your timeline to be complete before 4.19-rc5, which is the feature cutoff date for 4.20, but that is rather an ambitious goal. Your original schedule would land the patches before
> 4.20-rc5 resulting in inclusion to 4.21.
>
> Regards, Joonas
>
> PS. I'm going on a vacation for a couple of weeks.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
                     ` (3 preceding siblings ...)
  2018-06-20 15:03   ` [PATCH v1] Second implementation of Data Port Coherency Tomasz Lis
@ 2018-07-09 13:20   ` Tomasz Lis
  2018-07-09 13:48     ` Lionel Landwerlin
  2018-07-09 16:28     ` Tvrtko Ursulin
  2018-07-12 15:10   ` [PATCH v5] " Tomasz Lis
                     ` (2 subsequent siblings)
  7 siblings, 2 replies; 81+ messages in thread
From: Tomasz Lis @ 2018-07-09 13:20 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
    Removed redundant GEM_WARN_ON()s. Improved to coding standard.
    Introduced a macro for checking whether hardware supports the feature.

Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>

Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  1 +
 drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
 drivers/gpu/drm/i915/intel_lrc.c        | 49 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
 include/uapi/drm/i915_drm.h             |  6 ++++
 6 files changed, 107 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 09ab124..7d4bbd5 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
 #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
 #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
 				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
 
 #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index b10770c..6db352e 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -711,6 +711,26 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
 	return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
 }
 
+static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+	int ret;
+
+	ret = intel_lr_context_modify_data_port_coherency(ctx, true);
+	if (!ret)
+		__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	return ret;
+}
+
+static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+	int ret;
+
+	ret = intel_lr_context_modify_data_port_coherency(ctx, false);
+	if (!ret)
+		__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	return ret;
+}
+
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file)
 {
@@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->value = ctx->sched.priority;
 		break;
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (!HAS_DATA_PORT_COHERENCY(dev_priv))
+			ret = -ENODEV;
+		else
+			args->value = i915_gem_context_is_data_port_coherent(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		}
 		break;
 
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (args->size)
+			ret = -EINVAL;
+		else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
+			ret = -ENODEV;
+		else if (args->value == 1)
+			ret = i915_gem_context_set_data_port_coherent(ctx);
+		else if (args->value == 0)
+			ret = i915_gem_context_clear_data_port_coherent(ctx);
+		else
+			ret = -EINVAL;
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index b116e49..e8ccb70 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -126,6 +126,7 @@ struct i915_gem_context {
 #define CONTEXT_BANNABLE		3
 #define CONTEXT_BANNED			4
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
+#define CONTEXT_DATA_PORT_COHERENT	6
 
 	/**
 	 * @hw_id: - unique identifier for the context
@@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
 	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
 }
 
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+	return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+}
+
 static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
 {
 	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ab89dab..1f037e3 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+	cs = intel_ring_begin(req, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(req->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	/* Enabling coherency means disabling the bit which forces it off */
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(req, cs);
+
+	return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+					    bool enable)
+{
+	struct i915_request *req;
+	int ret;
+
+	req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	ret = emit_set_data_port_coherency(req, enable);
+
+	i915_request_add(req);
+
+	return ret;
+}
+
 static struct i915_priolist *
 lookup_priolist(struct intel_engine_cs *engine, int prio)
 {
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 1593194..f6965ae 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -104,4 +104,8 @@ struct i915_gem_context;
 
 void intel_lr_context_resume(struct drm_i915_private *dev_priv);
 
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+					    bool enable);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..e677bea 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
 #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
 #define   I915_CONTEXT_DEFAULT_PRIORITY		0
 #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
+/*
+ * When data port level coherency is enabled, the GPU will update memory
+ * buffers shared with CPU, by forcing internal cache units to send memory
+ * writes to real RAM faster. Keeping such coherency has performance cost.
+ */
+#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
 	__u64 value;
 };
 
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-09 13:20   ` [PATCH v4] " Tomasz Lis
@ 2018-07-09 13:48     ` Lionel Landwerlin
  2018-07-09 14:03       ` Lis, Tomasz
  2018-07-09 16:28     ` Tvrtko Ursulin
  1 sibling, 1 reply; 81+ messages in thread
From: Lionel Landwerlin @ 2018-07-09 13:48 UTC (permalink / raw)
  To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski

On 09/07/18 14:20, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
>
> Rationale:
>
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
>
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
>
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
>
> 2. Why do we need a global coherency switch?
>
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                     ________________
>                    |      NODE1     |
>                    | uint64_t data  |
>                    +----------------|
>                    | NODE*  |  NODE*|
>                    +--------+-------+
>                      /              \
>     ________________/                \________________
>    |      NODE2     |                |      NODE3     |
>    | uint64_t data  |                | uint64_t data  |
>    +----------------|                +----------------|
>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>    +--------+-------+                +--------+-------+
>
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
>
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
>
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
>
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>      Introduced a macro for checking whether hardware supports the feature.
>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
>
> Bspec: 11419
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>   drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>   drivers/gpu/drm/i915/intel_lrc.c        | 49 +++++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>   include/uapi/drm/i915_drm.h             |  6 ++++
>   6 files changed, 107 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 09ab124..7d4bbd5 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
>   #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
>   #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
>   				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)

Reading the documentation it seems that the bit you want to set is gone 
in ICL/Gen11.
Maybe limit this to >= 9 && < 11?

Cheers,

-
Lionel

>   
>   #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
>   
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b10770c..6db352e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -711,6 +711,26 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
>   	return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
>   }
>   
> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	int ret;
> +
> +	ret = intel_lr_context_modify_data_port_coherency(ctx, true);
> +	if (!ret)
> +		__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +	return ret;
> +}
> +
> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	int ret;
> +
> +	ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> +	if (!ret)
> +		__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +	return ret;
> +}
> +
>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>   				  struct drm_file *file)
>   {
> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *dev_priv = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_PRIORITY:
>   		args->value = ctx->sched.priority;
>   		break;
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> +			ret = -ENODEV;
> +		else
> +			args->value = i915_gem_context_is_data_port_coherent(ctx);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *dev_priv = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   		}
>   		break;
>   
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (args->size)
> +			ret = -EINVAL;
> +		else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> +			ret = -ENODEV;
> +		else if (args->value == 1)
> +			ret = i915_gem_context_set_data_port_coherent(ctx);
> +		else if (args->value == 0)
> +			ret = i915_gem_context_clear_data_port_coherent(ctx);
> +		else
> +			ret = -EINVAL;
> +		break;
> +
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index b116e49..e8ccb70 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -126,6 +126,7 @@ struct i915_gem_context {
>   #define CONTEXT_BANNABLE		3
>   #define CONTEXT_BANNED			4
>   #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
> +#define CONTEXT_DATA_PORT_COHERENT	6
>   
>   	/**
>   	 * @hw_id: - unique identifier for the context
> @@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
>   	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>   }
>   
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +}
> +
>   static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
>   {
>   	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ab89dab..1f037e3 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>   	ce->lrc_desc = desc;
>   }
>   
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
> +{
> +	u32 *cs;
> +	i915_reg_t reg;
> +
> +	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> +	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> +	cs = intel_ring_begin(req, 4);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	if (INTEL_GEN(req->i915) >= 10)
> +		reg = CNL_HDC_CHICKEN0;
> +	else
> +		reg = HDC_CHICKEN0;
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(reg);
> +	/* Enabling coherency means disabling the bit which forces it off */
> +	if (enable)
> +		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +	else
> +		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(req, cs);
> +
> +	return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +					    bool enable)
> +{
> +	struct i915_request *req;
> +	int ret;
> +
> +	req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
> +	if (IS_ERR(req))
> +		return PTR_ERR(req);
> +
> +	ret = emit_set_data_port_coherency(req, enable);
> +
> +	i915_request_add(req);
> +
> +	return ret;
> +}
> +
>   static struct i915_priolist *
>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>   {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..f6965ae 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>   
>   void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>   
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +					    bool enable);
> +
>   #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..e677bea 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>   #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
>   #define   I915_CONTEXT_DEFAULT_PRIORITY		0
>   #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to real RAM faster. Keeping such coherency has performance cost.
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
>   	__u64 value;
>   };
>   


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev5)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (13 preceding siblings ...)
  2018-06-20 21:01 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2018-07-09 13:57 ` Patchwork
  2018-07-09 13:58 ` ✗ Fi.CI.SPARSE: " Patchwork
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-07-09 13:57 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev5)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
c705f01c81f1 drm/i915: Add IOCTL Param to control data port coherency.
-:15: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#15: 
coherency at data port level. Keeping the coherency at that level is disabled

total: 0 errors, 1 warnings, 0 checks, 171 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.SPARSE: warning for drm/i915: Add Exec param to control data port coherency. (rev5)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (14 preceding siblings ...)
  2018-07-09 13:57 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev5) Patchwork
@ 2018-07-09 13:58 ` Patchwork
  2018-07-09 14:14 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-07-09 13:58 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev5)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim sparse origin/drm-tip
Commit: drm/i915: Add IOCTL Param to control data port coherency.
-drivers/gpu/drm/i915/selftests/../i915_drv.h:3652:16: warning: expression using sizeof(void)
+drivers/gpu/drm/i915/selftests/../i915_drv.h:3653:16: warning: expression using sizeof(void)

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-09 13:48     ` Lionel Landwerlin
@ 2018-07-09 14:03       ` Lis, Tomasz
  2018-07-09 14:24         ` Lionel Landwerlin
  0 siblings, 1 reply; 81+ messages in thread
From: Lis, Tomasz @ 2018-07-09 14:03 UTC (permalink / raw)
  To: Lionel Landwerlin, intel-gfx; +Cc: bartosz.dunajski



On 2018-07-09 15:48, Lionel Landwerlin wrote:
> On 09/07/18 14:20, Tomasz Lis wrote:
>> The patch adds a parameter to control the data port coherency 
>> functionality
>> on a per-context level. When the IOCTL is called, a command to switch 
>> data
>> port coherency state is added to the ordered list. All prior requests 
>> are
>> executed on old coherency settings, and all exec requests after the 
>> IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is 
>> disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that 
>> is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature 
>> that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>> architecture, adds
>> overhead related to tracking (snooping) memory inside different cache 
>> units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>> require
>> memory coherency between CPU and GPU). The goal of coherency 
>> enable/disable
>> switch is to remove overhead of memory coherency when memory 
>> coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units), 
>> Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>> instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O 
>> using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This 
>> "stateless"
>> model is similar to regular memory load/store operations available on 
>> typical
>> CPUs. Since this model provides I/O using arbitrary virtual 
>> addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer 
>> (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>>                     ________________
>>                    |      NODE1     |
>>                    | uint64_t data  |
>>                    +----------------|
>>                    | NODE*  |  NODE*|
>>                    +--------+-------+
>>                      /              \
>>     ________________/                \________________
>>    |      NODE2     |                |      NODE3     |
>>    | uint64_t data  |                | uint64_t data  |
>>    +----------------|                +----------------|
>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>    +--------+-------+                +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory 
>> locations
>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>> one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM - 
>> Shared
>> Virtual Memory feature). Using pointers from different allocations 
>> doesn't
>> affect the stateless addressing model which even allows scattered 
>> reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature 
>> of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in 
>> stateless model
>> can be encoded (at ISA level) to either use or disable coherency. 
>> However, for
>> generic OCL applications (such as example with tree-like data 
>> structure), OCL
>> compiler is not able to determine origin of memory pointed to by an 
>> arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is 
>> needed or not
>> for specific pointer (or for specific I/O instruction). As a result, 
>> compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that 
>> it would be
>> possible to workaround this (e.g. based on allocations map and 
>> pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping 
>> coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN 
>> ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that 
>> allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that 
>> actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance 
>> impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and 
>> kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes 
>> incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>> with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>> processes
>> the payload that kern_master produced. These two kernels work in a 
>> loop, one
>> after another. Since only kern_master requires coherency, kern_worker 
>> should
>> not be forced to pay for it. This means that we need to have the 
>> ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> v2: Fixed compilation warning.
>> v3: Refactored the patch to add IOCTL instead of exec flag.
>> v4: Renamed and documented the API flag. Used strict values.
>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>      Introduced a macro for checking whether hardware supports the 
>> feature.
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>   drivers/gpu/drm/i915/i915_gem_context.c | 41 
>> +++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>>   drivers/gpu/drm/i915/intel_lrc.c        | 49 
>> +++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>>   include/uapi/drm/i915_drm.h             |  6 ++++
>>   6 files changed, 107 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 09ab124..7d4bbd5 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>> *dev_priv)
>>   #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & 
>> EDRAM_ENABLED))
>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>
> Reading the documentation it seems that the bit you want to set is 
> gone in ICL/Gen11.
> Maybe limit this to >= 9 && < 11?
Icelake actually has the bit as well, just the address is different.
I will add its support as a separate patch as soon as the change which 
defines ICL_HDC_CHICKEN0 is accepted.
But in the current form - you are right, ICL is not supported.
I will update the condition.
-Tomasz
>
> Cheers,
>
> -
> Lionel
>
>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>> ((dev_priv)->info.hws_needs_physical)
>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index b10770c..6db352e 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct 
>> drm_i915_file_private *file_priv)
>>       return atomic_read(&file_priv->ban_score) >= 
>> I915_CLIENT_SCORE_BANNED;
>>   }
>>   +static int i915_gem_context_set_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    int ret;
>> +
>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>> +    if (!ret)
>> +        __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +    return ret;
>> +}
>> +
>> +static int i915_gem_context_clear_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    int ret;
>> +
>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>> +    if (!ret)
>> +        __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +    return ret;
>> +}
>> +
>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>                     struct drm_file *file)
>>   {
>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>       case I915_CONTEXT_PARAM_PRIORITY:
>>           args->value = ctx->sched.priority;
>>           break;
>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> +            ret = -ENODEV;
>> +        else
>> +            args->value = i915_gem_context_is_data_port_coherent(ctx);
>> +        break;
>>       default:
>>           ret = -EINVAL;
>>           break;
>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct 
>> drm_device *dev, void *data,
>>           }
>>           break;
>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (args->size)
>> +            ret = -EINVAL;
>> +        else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> +            ret = -ENODEV;
>> +        else if (args->value == 1)
>> +            ret = i915_gem_context_set_data_port_coherent(ctx);
>> +        else if (args->value == 0)
>> +            ret = i915_gem_context_clear_data_port_coherent(ctx);
>> +        else
>> +            ret = -EINVAL;
>> +        break;
>> +
>>       default:
>>           ret = -EINVAL;
>>           break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>> b/drivers/gpu/drm/i915/i915_gem_context.h
>> index b116e49..e8ccb70 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>   #define CONTEXT_BANNABLE        3
>>   #define CONTEXT_BANNED            4
>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>> +#define CONTEXT_DATA_PORT_COHERENT    6
>>         /**
>>        * @hw_id: - unique identifier for the context
>> @@ -257,6 +258,11 @@ static inline void 
>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>   }
>>   +static inline bool i915_gem_context_is_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +}
>> +
>>   static inline bool i915_gem_context_is_default(const struct 
>> i915_gem_context *c)
>>   {
>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> index ab89dab..1f037e3 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct 
>> i915_gem_context *ctx,
>>       ce->lrc_desc = desc;
>>   }
>>   +static int emit_set_data_port_coherency(struct i915_request *req, 
>> bool enable)
>> +{
>> +    u32 *cs;
>> +    i915_reg_t reg;
>> +
>> +    GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> +    GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> +    cs = intel_ring_begin(req, 4);
>> +    if (IS_ERR(cs))
>> +        return PTR_ERR(cs);
>> +
>> +    if (INTEL_GEN(req->i915) >= 10)
>> +        reg = CNL_HDC_CHICKEN0;
>> +    else
>> +        reg = HDC_CHICKEN0;
>> +
>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +    *cs++ = i915_mmio_reg_offset(reg);
>> +    /* Enabling coherency means disabling the bit which forces it 
>> off */
>> +    if (enable)
>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> +    else
>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> +    *cs++ = MI_NOOP;
>> +
>> +    intel_ring_advance(req, cs);
>> +
>> +    return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>> *ctx,
>> +                        bool enable)
>> +{
>> +    struct i915_request *req;
>> +    int ret;
>> +
>> +    req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>> +    if (IS_ERR(req))
>> +        return PTR_ERR(req);
>> +
>> +    ret = emit_set_data_port_coherency(req, enable);
>> +
>> +    i915_request_add(req);
>> +
>> +    return ret;
>> +}
>> +
>>   static struct i915_priolist *
>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>   {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>> b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..f6965ae 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>   +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>> *ctx,
>> +                        bool enable);
>> +
>>   #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..e677bea 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update 
>> memory
>> + * buffers shared with CPU, by forcing internal cache units to send 
>> memory
>> + * writes to real RAM faster. Keeping such coherency has performance 
>> cost.
>> + */
>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>       __u64 value;
>>   };
>
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Add Exec param to control data port coherency. (rev5)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (15 preceding siblings ...)
  2018-07-09 13:58 ` ✗ Fi.CI.SPARSE: " Patchwork
@ 2018-07-09 14:14 ` Patchwork
  2018-07-09 20:04 ` ✗ Fi.CI.IGT: failure " Patchwork
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-07-09 14:14 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev5)
URL   : https://patchwork.freedesktop.org/series/40181/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4454 -> Patchwork_9592 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/40181/revisions/5/mbox/

== Known issues ==

  Here are the changes found in Patchwork_9592 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_frontbuffer_tracking@basic:
      fi-hsw-peppy:       PASS -> DMESG-FAIL (fdo#102614, fdo#106103)

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-snb-2520m:       PASS -> INCOMPLETE (fdo#103713)

    igt@prime_vgem@basic-fence-flip:
      fi-ilk-650:         PASS -> FAIL (fdo#104008)

    
    ==== Possible fixes ====

    igt@gem_exec_suspend@basic-s3:
      {fi-kbl-8809g}:     FAIL (fdo#103375) -> PASS

    
    ==== Warnings ====

    igt@gem_exec_suspend@basic-s4-devices:
      {fi-kbl-8809g}:     FAIL -> INCOMPLETE (fdo#107139)

    
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  fdo#102614 https://bugs.freedesktop.org/show_bug.cgi?id=102614
  fdo#103375 https://bugs.freedesktop.org/show_bug.cgi?id=103375
  fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
  fdo#104008 https://bugs.freedesktop.org/show_bug.cgi?id=104008
  fdo#106103 https://bugs.freedesktop.org/show_bug.cgi?id=106103
  fdo#107139 https://bugs.freedesktop.org/show_bug.cgi?id=107139


== Participating hosts (46 -> 41) ==

  Additional (1): fi-cfl-8109u 
  Missing    (6): fi-ilk-m540 fi-hsw-4200u fi-byt-j1900 fi-byt-squawks fi-bsw-cyan fi-ctg-p8600 


== Build changes ==

    * Linux: CI_DRM_4454 -> Patchwork_9592

  CI_DRM_4454: 5f4ec795dbe0b8a1c565afcd2af79e41346e7268 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4544: 764160f214cd916ddb79408b9f28ac0ad2df40e0 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9592: c705f01c81f17ed77188ef82dd953be11be0e1a9 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

c705f01c81f1 drm/i915: Add IOCTL Param to control data port coherency.

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9592/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-09 14:03       ` Lis, Tomasz
@ 2018-07-09 14:24         ` Lionel Landwerlin
  2018-07-09 15:21           ` Lis, Tomasz
  0 siblings, 1 reply; 81+ messages in thread
From: Lionel Landwerlin @ 2018-07-09 14:24 UTC (permalink / raw)
  To: Lis, Tomasz, intel-gfx; +Cc: bartosz.dunajski

On 09/07/18 15:03, Lis, Tomasz wrote:
>
>
> On 2018-07-09 15:48, Lionel Landwerlin wrote:
>> On 09/07/18 14:20, Tomasz Lis wrote:
>>> The patch adds a parameter to control the data port coherency 
>>> functionality
>>> on a per-context level. When the IOCTL is called, a command to 
>>> switch data
>>> port coherency state is added to the ordered list. All prior 
>>> requests are
>>> executed on old coherency settings, and all exec requests after the 
>>> IOCTL
>>> will use new settings.
>>>
>>> Rationale:
>>>
>>> The OpenCL driver develpers requested a functionality to control cache
>>> coherency at data port level. Keeping the coherency at that level is 
>>> disabled
>>> by default due to its performance costs. OpenCL driver is planning to
>>> enable it for a small subset of submissions, when such functionality is
>>> required. Below are answers to basic question explaining background
>>> of the functionality and reasoning for the proposed implementation:
>>>
>>> 1. Why do we need a coherency enable/disable switch for memory that 
>>> is shared
>>> between CPU and GEN (GPU)?
>>>
>>> Memory coherency between CPU and GEN, while being a great feature 
>>> that enables
>>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>>> architecture, adds
>>> overhead related to tracking (snooping) memory inside different 
>>> cache units
>>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>>> require
>>> memory coherency between CPU and GPU). The goal of coherency 
>>> enable/disable
>>> switch is to remove overhead of memory coherency when memory 
>>> coherency is not
>>> needed.
>>>
>>> 2. Why do we need a global coherency switch?
>>>
>>> In order to support I/O commands from within EUs (Execution Units), 
>>> Intel GEN
>>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>>> instructions.
>>> These send instructions provide several addressing models. One of these
>>> addressing models (named "stateless") provides most flexible I/O 
>>> using plain
>>> virtual addresses (as opposed to buffer_handle+offset models). This 
>>> "stateless"
>>> model is similar to regular memory load/store operations available 
>>> on typical
>>> CPUs. Since this model provides I/O using arbitrary virtual 
>>> addresses, it
>>> enables algorithmic designs that are based on pointer-to-pointer 
>>> (e.g. buffer
>>> of pointers) concepts. For instance, it allows creating tree-like data
>>> structures such as:
>>>                     ________________
>>>                    |      NODE1     |
>>>                    | uint64_t data  |
>>>                    +----------------|
>>>                    | NODE*  |  NODE*|
>>>                    +--------+-------+
>>>                      /              \
>>>     ________________/                \________________
>>>    |      NODE2     |                |      NODE3     |
>>>    | uint64_t data  |                | uint64_t data  |
>>>    +----------------|                +----------------|
>>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>>    +--------+-------+                +--------+-------+
>>>
>>> Please note that pointers inside such structures can point to memory 
>>> locations
>>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>>> one OCL
>>> allocation while NODE3 resides in a completely separate OCL allocation.
>>> Additionally, such pointers can be shared with CPU (i.e. using SVM - 
>>> Shared
>>> Virtual Memory feature). Using pointers from different allocations 
>>> doesn't
>>> affect the stateless addressing model which even allows scattered 
>>> reading from
>>> different allocations at the same time (i.e. by utilizing 
>>> SIMD-nature of send
>>> instructions).
>>>
>>> When it comes to coherency programming, send instructions in 
>>> stateless model
>>> can be encoded (at ISA level) to either use or disable coherency. 
>>> However, for
>>> generic OCL applications (such as example with tree-like data 
>>> structure), OCL
>>> compiler is not able to determine origin of memory pointed to by an 
>>> arbitrary
>>> pointer - i.e. is not able to track given pointer back to a specific
>>> allocation. As such, it's not able to decide whether coherency is 
>>> needed or not
>>> for specific pointer (or for specific I/O instruction). As a result, 
>>> compiler
>>> encodes all stateless sends as coherent (doing otherwise would lead to
>>> functional issues resulting from data corruption). Please note that 
>>> it would be
>>> possible to workaround this (e.g. based on allocations map and 
>>> pointer bounds
>>> checking prior to each I/O instruction) but the performance cost of 
>>> such
>>> workaround would be many times greater than the cost of keeping 
>>> coherency
>>> always enabled. As such, enabling/disabling memory coherency at GEN 
>>> ISA level
>>> is not feasible and alternative method is needed.
>>>
>>> Such alternative solution is to have a global coherency switch that 
>>> allows
>>> disabling coherency for single (though entire) GPU submission. This is
>>> beneficial because this way we:
>>> * can enable (and pay for) coherency only in submissions that 
>>> actually need
>>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>>> * don't care about coherency at GEN ISA granularity (no performance 
>>> impact)
>>>
>>> 3. Will coherency switch be used frequently?
>>>
>>> There are scenarios that will require frequent toggling of the 
>>> coherency
>>> switch.
>>> E.g. an application has two OCL compute kernels: kern_master and 
>>> kern_worker.
>>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>>> computational work that needs to be executed. kern_master analyzes 
>>> incoming
>>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>>> with payload
>>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>>> processes
>>> the payload that kern_master produced. These two kernels work in a 
>>> loop, one
>>> after another. Since only kern_master requires coherency, 
>>> kern_worker should
>>> not be forced to pay for it. This means that we need to have the 
>>> ability to
>>> toggle coherency switch on or off per each GPU submission:
>>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>>> (ENABLE
>>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>>
>>> v2: Fixed compilation warning.
>>> v3: Refactored the patch to add IOCTL instead of exec flag.
>>> v4: Renamed and documented the API flag. Used strict values.
>>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>>      Introduced a macro for checking whether hardware supports the 
>>> feature.
>>>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>>
>>> Bspec: 11419
>>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>>   drivers/gpu/drm/i915/i915_gem_context.c | 41 
>>> +++++++++++++++++++++++++++
>>>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>>>   drivers/gpu/drm/i915/intel_lrc.c        | 49 
>>> +++++++++++++++++++++++++++++++++
>>>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>>>   include/uapi/drm/i915_drm.h             |  6 ++++
>>>   6 files changed, 107 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>>> b/drivers/gpu/drm/i915/i915_drv.h
>>> index 09ab124..7d4bbd5 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>>> *dev_priv)
>>>   #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & 
>>> EDRAM_ENABLED))
>>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>
>> Reading the documentation it seems that the bit you want to set is 
>> gone in ICL/Gen11.
>> Maybe limit this to >= 9 && < 11?
> Icelake actually has the bit as well, just the address is different.
> I will add its support as a separate patch as soon as the change which 
> defines ICL_HDC_CHICKEN0 is accepted.
> But in the current form - you are right, ICL is not supported.
> I will update the condition.
> -Tomasz

Just out of curiosity, what address is ICL_HD_CHICKEN0 at?

Thanks,

-
Lionel

>>
>> Cheers,
>>
>> -
>> Lionel
>>
>>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>>> ((dev_priv)->info.hws_needs_physical)
>>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>>> b/drivers/gpu/drm/i915/i915_gem_context.c
>>> index b10770c..6db352e 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct 
>>> drm_i915_file_private *file_priv)
>>>       return atomic_read(&file_priv->ban_score) >= 
>>> I915_CLIENT_SCORE_BANNED;
>>>   }
>>>   +static int i915_gem_context_set_data_port_coherent(struct 
>>> i915_gem_context *ctx)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>>> +    if (!ret)
>>> +        __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +    return ret;
>>> +}
>>> +
>>> +static int i915_gem_context_clear_data_port_coherent(struct 
>>> i915_gem_context *ctx)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>> +    if (!ret)
>>> +        __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +    return ret;
>>> +}
>>> +
>>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>>                     struct drm_file *file)
>>>   {
>>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct 
>>> drm_device *dev, void *data,
>>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>>> *data,
>>>                       struct drm_file *file)
>>>   {
>>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>>       struct drm_i915_gem_context_param *args = data;
>>>       struct i915_gem_context *ctx;
>>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct 
>>> drm_device *dev, void *data,
>>>       case I915_CONTEXT_PARAM_PRIORITY:
>>>           args->value = ctx->sched.priority;
>>>           break;
>>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>> +        if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>> +            ret = -ENODEV;
>>> +        else
>>> +            args->value = i915_gem_context_is_data_port_coherent(ctx);
>>> +        break;
>>>       default:
>>>           ret = -EINVAL;
>>>           break;
>>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct 
>>> drm_device *dev, void *data,
>>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>>> *data,
>>>                       struct drm_file *file)
>>>   {
>>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>>       struct drm_i915_gem_context_param *args = data;
>>>       struct i915_gem_context *ctx;
>>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct 
>>> drm_device *dev, void *data,
>>>           }
>>>           break;
>>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>> +        if (args->size)
>>> +            ret = -EINVAL;
>>> +        else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>> +            ret = -ENODEV;
>>> +        else if (args->value == 1)
>>> +            ret = i915_gem_context_set_data_port_coherent(ctx);
>>> +        else if (args->value == 0)
>>> +            ret = i915_gem_context_clear_data_port_coherent(ctx);
>>> +        else
>>> +            ret = -EINVAL;
>>> +        break;
>>> +
>>>       default:
>>>           ret = -EINVAL;
>>>           break;
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>>> b/drivers/gpu/drm/i915/i915_gem_context.h
>>> index b116e49..e8ccb70 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>>   #define CONTEXT_BANNABLE        3
>>>   #define CONTEXT_BANNED            4
>>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>>> +#define CONTEXT_DATA_PORT_COHERENT    6
>>>         /**
>>>        * @hw_id: - unique identifier for the context
>>> @@ -257,6 +258,11 @@ static inline void 
>>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>>   }
>>>   +static inline bool i915_gem_context_is_data_port_coherent(struct 
>>> i915_gem_context *ctx)
>>> +{
>>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +}
>>> +
>>>   static inline bool i915_gem_context_is_default(const struct 
>>> i915_gem_context *c)
>>>   {
>>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>>> b/drivers/gpu/drm/i915/intel_lrc.c
>>> index ab89dab..1f037e3 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct 
>>> i915_gem_context *ctx,
>>>       ce->lrc_desc = desc;
>>>   }
>>>   +static int emit_set_data_port_coherency(struct i915_request *req, 
>>> bool enable)
>>> +{
>>> +    u32 *cs;
>>> +    i915_reg_t reg;
>>> +
>>> +    GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>>> +    GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>>> +
>>> +    cs = intel_ring_begin(req, 4);
>>> +    if (IS_ERR(cs))
>>> +        return PTR_ERR(cs);
>>> +
>>> +    if (INTEL_GEN(req->i915) >= 10)
>>> +        reg = CNL_HDC_CHICKEN0;
>>> +    else
>>> +        reg = HDC_CHICKEN0;
>>> +
>>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>>> +    *cs++ = i915_mmio_reg_offset(reg);
>>> +    /* Enabling coherency means disabling the bit which forces it 
>>> off */
>>> +    if (enable)
>>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>>> +    else
>>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>>> +    *cs++ = MI_NOOP;
>>> +
>>> +    intel_ring_advance(req, cs);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int
>>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>>> *ctx,
>>> +                        bool enable)
>>> +{
>>> +    struct i915_request *req;
>>> +    int ret;
>>> +
>>> +    req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>>> +    if (IS_ERR(req))
>>> +        return PTR_ERR(req);
>>> +
>>> +    ret = emit_set_data_port_coherency(req, enable);
>>> +
>>> +    i915_request_add(req);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   static struct i915_priolist *
>>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>>   {
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>> index 1593194..f6965ae 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>>   +int
>>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>>> *ctx,
>>> +                        bool enable);
>>> +
>>>   #endif /* _INTEL_LRC_H_ */
>>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>>> index 7f5634c..e677bea 100644
>>> --- a/include/uapi/drm/i915_drm.h
>>> +++ b/include/uapi/drm/i915_drm.h
>>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>>> +/*
>>> + * When data port level coherency is enabled, the GPU will update 
>>> memory
>>> + * buffers shared with CPU, by forcing internal cache units to send 
>>> memory
>>> + * writes to real RAM faster. Keeping such coherency has 
>>> performance cost.
>>> + */
>>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>>       __u64 value;
>>>   };
>>
>>
>
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-09 14:24         ` Lionel Landwerlin
@ 2018-07-09 15:21           ` Lis, Tomasz
  0 siblings, 0 replies; 81+ messages in thread
From: Lis, Tomasz @ 2018-07-09 15:21 UTC (permalink / raw)
  To: Lionel Landwerlin, intel-gfx; +Cc: bartosz.dunajski



On 2018-07-09 16:24, Lionel Landwerlin wrote:
> On 09/07/18 15:03, Lis, Tomasz wrote:
>>
>>
>> On 2018-07-09 15:48, Lionel Landwerlin wrote:
>>> On 09/07/18 14:20, Tomasz Lis wrote:
>>>> The patch adds a parameter to control the data port coherency 
>>>> functionality
>>>> on a per-context level. When the IOCTL is called, a command to 
>>>> switch data
>>>> port coherency state is added to the ordered list. All prior 
>>>> requests are
>>>> executed on old coherency settings, and all exec requests after the 
>>>> IOCTL
>>>> will use new settings.
>>>>
>>>> Rationale:
>>>>
>>>> The OpenCL driver develpers requested a functionality to control cache
>>>> coherency at data port level. Keeping the coherency at that level 
>>>> is disabled
>>>> by default due to its performance costs. OpenCL driver is planning to
>>>> enable it for a small subset of submissions, when such 
>>>> functionality is
>>>> required. Below are answers to basic question explaining background
>>>> of the functionality and reasoning for the proposed implementation:
>>>>
>>>> 1. Why do we need a coherency enable/disable switch for memory that 
>>>> is shared
>>>> between CPU and GEN (GPU)?
>>>>
>>>> Memory coherency between CPU and GEN, while being a great feature 
>>>> that enables
>>>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>>>> architecture, adds
>>>> overhead related to tracking (snooping) memory inside different 
>>>> cache units
>>>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>>>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>>>> require
>>>> memory coherency between CPU and GPU). The goal of coherency 
>>>> enable/disable
>>>> switch is to remove overhead of memory coherency when memory 
>>>> coherency is not
>>>> needed.
>>>>
>>>> 2. Why do we need a global coherency switch?
>>>>
>>>> In order to support I/O commands from within EUs (Execution Units), 
>>>> Intel GEN
>>>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>>>> instructions.
>>>> These send instructions provide several addressing models. One of 
>>>> these
>>>> addressing models (named "stateless") provides most flexible I/O 
>>>> using plain
>>>> virtual addresses (as opposed to buffer_handle+offset models). This 
>>>> "stateless"
>>>> model is similar to regular memory load/store operations available 
>>>> on typical
>>>> CPUs. Since this model provides I/O using arbitrary virtual 
>>>> addresses, it
>>>> enables algorithmic designs that are based on pointer-to-pointer 
>>>> (e.g. buffer
>>>> of pointers) concepts. For instance, it allows creating tree-like data
>>>> structures such as:
>>>>                     ________________
>>>>                    |      NODE1     |
>>>>                    | uint64_t data  |
>>>>                    +----------------|
>>>>                    | NODE*  |  NODE*|
>>>>                    +--------+-------+
>>>>                      /              \
>>>>     ________________/                \________________
>>>>    |      NODE2     |                |      NODE3     |
>>>>    | uint64_t data  |                | uint64_t data  |
>>>>    +----------------|                +----------------|
>>>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>>>    +--------+-------+                +--------+-------+
>>>>
>>>> Please note that pointers inside such structures can point to 
>>>> memory locations
>>>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>>>> one OCL
>>>> allocation while NODE3 resides in a completely separate OCL 
>>>> allocation.
>>>> Additionally, such pointers can be shared with CPU (i.e. using SVM 
>>>> - Shared
>>>> Virtual Memory feature). Using pointers from different allocations 
>>>> doesn't
>>>> affect the stateless addressing model which even allows scattered 
>>>> reading from
>>>> different allocations at the same time (i.e. by utilizing 
>>>> SIMD-nature of send
>>>> instructions).
>>>>
>>>> When it comes to coherency programming, send instructions in 
>>>> stateless model
>>>> can be encoded (at ISA level) to either use or disable coherency. 
>>>> However, for
>>>> generic OCL applications (such as example with tree-like data 
>>>> structure), OCL
>>>> compiler is not able to determine origin of memory pointed to by an 
>>>> arbitrary
>>>> pointer - i.e. is not able to track given pointer back to a specific
>>>> allocation. As such, it's not able to decide whether coherency is 
>>>> needed or not
>>>> for specific pointer (or for specific I/O instruction). As a 
>>>> result, compiler
>>>> encodes all stateless sends as coherent (doing otherwise would lead to
>>>> functional issues resulting from data corruption). Please note that 
>>>> it would be
>>>> possible to workaround this (e.g. based on allocations map and 
>>>> pointer bounds
>>>> checking prior to each I/O instruction) but the performance cost of 
>>>> such
>>>> workaround would be many times greater than the cost of keeping 
>>>> coherency
>>>> always enabled. As such, enabling/disabling memory coherency at GEN 
>>>> ISA level
>>>> is not feasible and alternative method is needed.
>>>>
>>>> Such alternative solution is to have a global coherency switch that 
>>>> allows
>>>> disabling coherency for single (though entire) GPU submission. This is
>>>> beneficial because this way we:
>>>> * can enable (and pay for) coherency only in submissions that 
>>>> actually need
>>>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER 
>>>> resources)
>>>> * don't care about coherency at GEN ISA granularity (no performance 
>>>> impact)
>>>>
>>>> 3. Will coherency switch be used frequently?
>>>>
>>>> There are scenarios that will require frequent toggling of the 
>>>> coherency
>>>> switch.
>>>> E.g. an application has two OCL compute kernels: kern_master and 
>>>> kern_worker.
>>>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>>>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>>>> computational work that needs to be executed. kern_master analyzes 
>>>> incoming
>>>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>>>> with payload
>>>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>>>> processes
>>>> the payload that kern_master produced. These two kernels work in a 
>>>> loop, one
>>>> after another. Since only kern_master requires coherency, 
>>>> kern_worker should
>>>> not be forced to pay for it. This means that we need to have the 
>>>> ability to
>>>> toggle coherency switch on or off per each GPU submission:
>>>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>>>> (ENABLE
>>>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>>>
>>>> v2: Fixed compilation warning.
>>>> v3: Refactored the patch to add IOCTL instead of exec flag.
>>>> v4: Renamed and documented the API flag. Used strict values.
>>>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>>>      Introduced a macro for checking whether hardware supports the 
>>>> feature.
>>>>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>>>
>>>> Bspec: 11419
>>>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>>>   drivers/gpu/drm/i915/i915_gem_context.c | 41 
>>>> +++++++++++++++++++++++++++
>>>>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>>>>   drivers/gpu/drm/i915/intel_lrc.c        | 49 
>>>> +++++++++++++++++++++++++++++++++
>>>>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>>>>   include/uapi/drm/i915_drm.h             |  6 ++++
>>>>   6 files changed, 107 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>>>> b/drivers/gpu/drm/i915/i915_drv.h
>>>> index 09ab124..7d4bbd5 100644
>>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>>>> *dev_priv)
>>>>   #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & 
>>>> EDRAM_ENABLED))
>>>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>>>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>>
>>> Reading the documentation it seems that the bit you want to set is 
>>> gone in ICL/Gen11.
>>> Maybe limit this to >= 9 && < 11?
>> Icelake actually has the bit as well, just the address is different.
>> I will add its support as a separate patch as soon as the change 
>> which defines ICL_HDC_CHICKEN0 is accepted.
>> But in the current form - you are right, ICL is not supported.
>> I will update the condition.
>> -Tomasz
>
> Just out of curiosity, what address is ICL_HD_CHICKEN0 at?
It was defined as _MMIO(0xE5F4). But now I see it is renamed to 
ICL_HDC_MODE, and already on the tip.
Bspec: 19175
Wow, looks like I can include the gen11 support already. Will add in 
next version.
Thank you!
>
> Thanks,
>
> -
> Lionel
>
>>>
>>> Cheers,
>>>
>>> -
>>> Lionel
>>>
>>>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>>>> ((dev_priv)->info.hws_needs_physical)
>>>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>>>> b/drivers/gpu/drm/i915/i915_gem_context.c
>>>> index b10770c..6db352e 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct 
>>>> drm_i915_file_private *file_priv)
>>>>       return atomic_read(&file_priv->ban_score) >= 
>>>> I915_CLIENT_SCORE_BANNED;
>>>>   }
>>>>   +static int i915_gem_context_set_data_port_coherent(struct 
>>>> i915_gem_context *ctx)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>>>> +    if (!ret)
>>>> +        __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int i915_gem_context_clear_data_port_coherent(struct 
>>>> i915_gem_context *ctx)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>>> +    if (!ret)
>>>> +        __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> +    return ret;
>>>> +}
>>>> +
>>>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void 
>>>> *data,
>>>>                     struct drm_file *file)
>>>>   {
>>>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct 
>>>> drm_device *dev, void *data,
>>>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>>>> *data,
>>>>                       struct drm_file *file)
>>>>   {
>>>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>>>       struct drm_i915_gem_context_param *args = data;
>>>>       struct i915_gem_context *ctx;
>>>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct 
>>>> drm_device *dev, void *data,
>>>>       case I915_CONTEXT_PARAM_PRIORITY:
>>>>           args->value = ctx->sched.priority;
>>>>           break;
>>>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>>> +        if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>>> +            ret = -ENODEV;
>>>> +        else
>>>> +            args->value = 
>>>> i915_gem_context_is_data_port_coherent(ctx);
>>>> +        break;
>>>>       default:
>>>>           ret = -EINVAL;
>>>>           break;
>>>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct 
>>>> drm_device *dev, void *data,
>>>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>>>> *data,
>>>>                       struct drm_file *file)
>>>>   {
>>>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>>>       struct drm_i915_gem_context_param *args = data;
>>>>       struct i915_gem_context *ctx;
>>>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct 
>>>> drm_device *dev, void *data,
>>>>           }
>>>>           break;
>>>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>>> +        if (args->size)
>>>> +            ret = -EINVAL;
>>>> +        else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>>> +            ret = -ENODEV;
>>>> +        else if (args->value == 1)
>>>> +            ret = i915_gem_context_set_data_port_coherent(ctx);
>>>> +        else if (args->value == 0)
>>>> +            ret = i915_gem_context_clear_data_port_coherent(ctx);
>>>> +        else
>>>> +            ret = -EINVAL;
>>>> +        break;
>>>> +
>>>>       default:
>>>>           ret = -EINVAL;
>>>>           break;
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>>>> b/drivers/gpu/drm/i915/i915_gem_context.h
>>>> index b116e49..e8ccb70 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>>>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>>>   #define CONTEXT_BANNABLE        3
>>>>   #define CONTEXT_BANNED            4
>>>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>>>> +#define CONTEXT_DATA_PORT_COHERENT    6
>>>>         /**
>>>>        * @hw_id: - unique identifier for the context
>>>> @@ -257,6 +258,11 @@ static inline void 
>>>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>>>   }
>>>>   +static inline bool i915_gem_context_is_data_port_coherent(struct 
>>>> i915_gem_context *ctx)
>>>> +{
>>>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> +}
>>>> +
>>>>   static inline bool i915_gem_context_is_default(const struct 
>>>> i915_gem_context *c)
>>>>   {
>>>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>>>> b/drivers/gpu/drm/i915/intel_lrc.c
>>>> index ab89dab..1f037e3 100644
>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct 
>>>> i915_gem_context *ctx,
>>>>       ce->lrc_desc = desc;
>>>>   }
>>>>   +static int emit_set_data_port_coherency(struct i915_request 
>>>> *req, bool enable)
>>>> +{
>>>> +    u32 *cs;
>>>> +    i915_reg_t reg;
>>>> +
>>>> +    GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>>>> +    GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>>>> +
>>>> +    cs = intel_ring_begin(req, 4);
>>>> +    if (IS_ERR(cs))
>>>> +        return PTR_ERR(cs);
>>>> +
>>>> +    if (INTEL_GEN(req->i915) >= 10)
>>>> +        reg = CNL_HDC_CHICKEN0;
>>>> +    else
>>>> +        reg = HDC_CHICKEN0;
>>>> +
>>>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>>>> +    *cs++ = i915_mmio_reg_offset(reg);
>>>> +    /* Enabling coherency means disabling the bit which forces it 
>>>> off */
>>>> +    if (enable)
>>>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>>>> +    else
>>>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>>>> +    *cs++ = MI_NOOP;
>>>> +
>>>> +    intel_ring_advance(req, cs);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int
>>>> +intel_lr_context_modify_data_port_coherency(struct 
>>>> i915_gem_context *ctx,
>>>> +                        bool enable)
>>>> +{
>>>> +    struct i915_request *req;
>>>> +    int ret;
>>>> +
>>>> +    req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>>>> +    if (IS_ERR(req))
>>>> +        return PTR_ERR(req);
>>>> +
>>>> +    ret = emit_set_data_port_coherency(req, enable);
>>>> +
>>>> +    i915_request_add(req);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>>   static struct i915_priolist *
>>>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>>>   {
>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>>> index 1593194..f6965ae 100644
>>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>>>   +int
>>>> +intel_lr_context_modify_data_port_coherency(struct 
>>>> i915_gem_context *ctx,
>>>> +                        bool enable);
>>>> +
>>>>   #endif /* _INTEL_LRC_H_ */
>>>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>>>> index 7f5634c..e677bea 100644
>>>> --- a/include/uapi/drm/i915_drm.h
>>>> +++ b/include/uapi/drm/i915_drm.h
>>>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>>>> +/*
>>>> + * When data port level coherency is enabled, the GPU will update 
>>>> memory
>>>> + * buffers shared with CPU, by forcing internal cache units to 
>>>> send memory
>>>> + * writes to real RAM faster. Keeping such coherency has 
>>>> performance cost.
>>>> + */
>>>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>>>       __u64 value;
>>>>   };
>>>
>>>
>>
>>
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-09 13:20   ` [PATCH v4] " Tomasz Lis
  2018-07-09 13:48     ` Lionel Landwerlin
@ 2018-07-09 16:28     ` Tvrtko Ursulin
  2018-07-09 16:37       ` Chris Wilson
  2018-07-10 18:03       ` Lis, Tomasz
  1 sibling, 2 replies; 81+ messages in thread
From: Tvrtko Ursulin @ 2018-07-09 16:28 UTC (permalink / raw)
  To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski


On 09/07/2018 14:20, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
> 
> Rationale:
> 
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
> 
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
> 
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
> 
> 2. Why do we need a global coherency switch?
> 
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                     ________________
>                    |      NODE1     |
>                    | uint64_t data  |
>                    +----------------|
>                    | NODE*  |  NODE*|
>                    +--------+-------+
>                      /              \
>     ________________/                \________________
>    |      NODE2     |                |      NODE3     |
>    | uint64_t data  |                | uint64_t data  |
>    +----------------|                +----------------|
>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>    +--------+-------+                +--------+-------+
> 
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
> 
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
> 
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
> 
> 3. Will coherency switch be used frequently?
> 
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>      Introduced a macro for checking whether hardware supports the feature.
> 
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
> 
> Bspec: 11419
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>   drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>   drivers/gpu/drm/i915/intel_lrc.c        | 49 +++++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>   include/uapi/drm/i915_drm.h             |  6 ++++
>   6 files changed, 107 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 09ab124..7d4bbd5 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
>   #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
>   #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
>   				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
>   
>   #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
>   
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b10770c..6db352e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -711,6 +711,26 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
>   	return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
>   }
>   
> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	int ret;
> +
> +	ret = intel_lr_context_modify_data_port_coherency(ctx, true);
> +	if (!ret)
> +		__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +	return ret;
> +}
> +
> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	int ret;
> +
> +	ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> +	if (!ret)
> +		__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +	return ret;

Is there a good reason you allow userspace to keep emitting unlimited 
number of commands which actually do not change the status? If not 
please consider gating the command emission with 
test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they 
could keep toggling ad infinitum with no real work in between. Has it 
been considered to only save the desired state in set param and then 
emit it, if needed, before next execbuf? Minor thing in any case, just 
curious since I wasn't following the threads.

> +}
> +
>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>   				  struct drm_file *file)
>   {
> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *dev_priv = to_i915(dev);

Feel free to use the local for the other existing to_i915(dev) call 
sites in here.

Also use i915 for the local name. Unless I915_READ/WRITE is used i915 is 
preferred nowadays.

>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_PRIORITY:
>   		args->value = ctx->sched.priority;
>   		break;
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> +			ret = -ENODEV;
> +		else
> +			args->value = i915_gem_context_is_data_port_coherent(ctx);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *dev_priv = to_i915(dev);

As with get_param.

>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   		}
>   		break;
>   
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (args->size)
> +			ret = -EINVAL;
> +		else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> +			ret = -ENODEV;
> +		else if (args->value == 1)
> +			ret = i915_gem_context_set_data_port_coherent(ctx);
> +		else if (args->value == 0)
> +			ret = i915_gem_context_clear_data_port_coherent(ctx);
> +		else
> +			ret = -EINVAL;
> +		break;
> +
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index b116e49..e8ccb70 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -126,6 +126,7 @@ struct i915_gem_context {
>   #define CONTEXT_BANNABLE		3
>   #define CONTEXT_BANNED			4
>   #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
> +#define CONTEXT_DATA_PORT_COHERENT	6
>   
>   	/**
>   	 * @hw_id: - unique identifier for the context
> @@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
>   	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>   }
>   
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +}
> +
>   static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
>   {
>   	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ab89dab..1f037e3 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>   	ce->lrc_desc = desc;
>   }
>   
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)

After much disagreement we ended up with rq as the consistent naming for 
requests.

> +{
> +	u32 *cs;
> +	i915_reg_t reg;
> +
> +	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> +	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> +	cs = intel_ring_begin(req, 4);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	if (INTEL_GEN(req->i915) >= 10)
> +		reg = CNL_HDC_CHICKEN0;
> +	else
> +		reg = HDC_CHICKEN0;
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(reg);
> +	/* Enabling coherency means disabling the bit which forces it off */
> +	if (enable)
> +		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +	else
> +		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(req, cs);
> +
> +	return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +					    bool enable)
> +{
> +	struct i915_request *req;

rq as above.

> +	int ret;
> +
> +	req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
> +	if (IS_ERR(req))
> +		return PTR_ERR(req);
> +
> +	ret = emit_set_data_port_coherency(req, enable);
> +
> +	i915_request_add(req);
> +
> +	return ret;
> +}
> +
>   static struct i915_priolist *
>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>   {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..f6965ae 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>   
>   void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>   
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +					    bool enable);
> +
>   #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..e677bea 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>   #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
>   #define   I915_CONTEXT_DEFAULT_PRIORITY		0
>   #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to real RAM faster. Keeping such coherency has performance cost.

Is this comment correct? Is it actually sending memory writes to _RAM_, 
or just the coherency mode enabled, even if only targetting CPU or 
shared cache, which adds a cost?

s/Keeping such coherency has performance cost./Enabling data port 
coherency has a performance cost./ ? Or "can have a performance cost"?

> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
>   	__u64 value;
>   };
>   
> 

Since I understand this design has been approved already on the high 
level, and as you can see I only had some minor comments to add, I can 
say that the patch in principle looks okay to me.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-09 16:28     ` Tvrtko Ursulin
@ 2018-07-09 16:37       ` Chris Wilson
  2018-07-10 17:32         ` Lis, Tomasz
  2018-07-10 18:03       ` Lis, Tomasz
  1 sibling, 1 reply; 81+ messages in thread
From: Chris Wilson @ 2018-07-09 16:37 UTC (permalink / raw)
  To: Tomasz Lis, Tvrtko Ursulin, intel-gfx; +Cc: bartosz.dunajski

Quoting Tvrtko Ursulin (2018-07-09 17:28:02)
> 
> On 09/07/2018 14:20, Tomasz Lis wrote:
> > +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> > +{
> > +     int ret;
> > +
> > +     ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> > +     if (!ret)
> > +             __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> > +     return ret;
> 
> Is there a good reason you allow userspace to keep emitting unlimited 
> number of commands which actually do not change the status? If not 
> please consider gating the command emission with 
> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they 
> could keep toggling ad infinitum with no real work in between. Has it 
> been considered to only save the desired state in set param and then 
> emit it, if needed, before next execbuf? Minor thing in any case, just 
> curious since I wasn't following the threads.

The first patch tried to add a bit to execbuf, and having been
mistakenly down that road before, we asked if there was any alternative.
(Now if you've also been following execbuf3 conversations, having a
packet for privileged LRI is definitely something we want.)

Setting the value in the context register is precisely what we want to
do, and trivially serialised with execbuf since we have to serialise
reservation of ring space, i.e. the normal rules of request generation.
(execbuf is just a client and nothing special). From that point of view,
we only care about frequency, if it is very frequent it should be
controlled by userspace inside the batch (but it can't due to there
being dangerous bits inside the reg aiui). At the other end of the
scale, is context_setparam for set-once. And there should be no
inbetween as that requires costly batch flushes.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915: Add Exec param to control data port coherency. (rev5)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (16 preceding siblings ...)
  2018-07-09 14:14 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-07-09 20:04 ` Patchwork
  2018-07-12 15:18 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev6) Patchwork
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-07-09 20:04 UTC (permalink / raw)
  To: Lis, Tomasz; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev5)
URL   : https://patchwork.freedesktop.org/series/40181/
State : failure

== Summary ==

= CI Bug Log - changes from CI_DRM_4454_full -> Patchwork_9592_full =

== Summary - FAILURE ==

  Serious unknown changes coming with Patchwork_9592_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9592_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9592_full:

  === IGT changes ===

    ==== Possible regressions ====

    igt@gem_ctx_param@invalid-param-get:
      shard-apl:          PASS -> FAIL +1
      shard-glk:          PASS -> FAIL +1

    igt@gem_ctx_param@invalid-param-set:
      shard-kbl:          PASS -> FAIL +1
      shard-hsw:          PASS -> FAIL +1
      shard-snb:          PASS -> FAIL +1

    
    ==== Warnings ====

    igt@gem_exec_schedule@deep-bsd2:
      shard-kbl:          SKIP -> PASS +1

    
== Known issues ==

  Here are the changes found in Patchwork_9592_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_flip_tiling@flip-to-x-tiled:
      shard-glk:          PASS -> FAIL (fdo#107161, fdo#103822)

    
    ==== Possible fixes ====

    igt@kms_cursor_legacy@cursorb-vs-flipb-toggle:
      shard-glk:          DMESG-WARN (fdo#105763, fdo#106538) -> PASS

    igt@kms_flip@2x-flip-vs-expired-vblank:
      shard-glk:          FAIL (fdo#102887) -> PASS

    igt@kms_flip_tiling@flip-to-y-tiled:
      shard-glk:          FAIL (fdo#107161) -> PASS

    igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-shrfb-draw-mmap-cpu:
      shard-hsw:          FAIL (fdo#105682, fdo#103167) -> PASS

    
  fdo#102887 https://bugs.freedesktop.org/show_bug.cgi?id=102887
  fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
  fdo#103822 https://bugs.freedesktop.org/show_bug.cgi?id=103822
  fdo#105682 https://bugs.freedesktop.org/show_bug.cgi?id=105682
  fdo#105763 https://bugs.freedesktop.org/show_bug.cgi?id=105763
  fdo#106538 https://bugs.freedesktop.org/show_bug.cgi?id=106538
  fdo#107161 https://bugs.freedesktop.org/show_bug.cgi?id=107161


== Participating hosts (5 -> 5) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4454 -> Patchwork_9592

  CI_DRM_4454: 5f4ec795dbe0b8a1c565afcd2af79e41346e7268 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4544: 764160f214cd916ddb79408b9f28ac0ad2df40e0 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9592: c705f01c81f17ed77188ef82dd953be11be0e1a9 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9592/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-09 16:37       ` Chris Wilson
@ 2018-07-10 17:32         ` Lis, Tomasz
  2018-07-11  9:28           ` Tvrtko Ursulin
  0 siblings, 1 reply; 81+ messages in thread
From: Lis, Tomasz @ 2018-07-10 17:32 UTC (permalink / raw)
  To: Chris Wilson, Tvrtko Ursulin, intel-gfx; +Cc: bartosz.dunajski


[-- Attachment #1.1: Type: text/plain, Size: 3295 bytes --]



On 2018-07-09 18:37, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-07-09 17:28:02)
>> On 09/07/2018 14:20, Tomasz Lis wrote:
>>> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>> +     if (!ret)
>>> +             __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +     return ret;
>> Is there a good reason you allow userspace to keep emitting unlimited
>> number of commands which actually do not change the status? If not
>> please consider gating the command emission with
>> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they
>> could keep toggling ad infinitum with no real work in between. Has it
>> been considered to only save the desired state in set param and then
>> emit it, if needed, before next execbuf? Minor thing in any case, just
>> curious since I wasn't following the threads.
> The first patch tried to add a bit to execbuf, and having been
> mistakenly down that road before, we asked if there was any alternative.
> (Now if you've also been following execbuf3 conversations, having a
> packet for privileged LRI is definitely something we want.)
>
> Setting the value in the context register is precisely what we want to
> do, and trivially serialised with execbuf since we have to serialise
> reservation of ring space, i.e. the normal rules of request generation.
> (execbuf is just a client and nothing special). From that point of view,
> we only care about frequency, if it is very frequent it should be
> controlled by userspace inside the batch (but it can't due to there
> being dangerous bits inside the reg aiui). At the other end of the
> scale, is context_setparam for set-once. And there should be no
> inbetween as that requires costly batch flushes.
> -Chris
Joonas did brought that concern in his review; here it is, with my response:

On 2018-06-21 15:47, Lis, Tomasz wrote:
> On 2018-06-21 08:39, Joonas Lahtinen wrote:
>> I'm thinking we should set this value when it has changed, when we 
>> insert the
>> requests into the command stream. So if you change back and forth, while
>> not emitting any requests, nothing really happens. If you change the 
>> value and
>> emit a request, we should emit a LRI before the jump to the commands.
>> Similary if you keep setting the value to the value it already was in,
>> nothing will happen, again.
> When I considered that, my way of reasoning was:
> If we execute the flag changing buffer right away, it may be sent to 
> hardware faster if there is no job in progress.
> If we use the lazy way, and trigger the change just before submission 
> -  there will be additional conditions in submission code, plus the 
> change will be made when there is another job pending (though it's not 
> a considerable payload to just switch a flag).
> If user space switches the flag back and forth without much sense, 
> then there is something wrong with the user space driver, and it 
> shouldn't be up to kernel to fix that.
>
> This is why I chosen the current approach. But I can change it if you 
> wish.

So while I think the current solution is better from performance 
standpoint, but I will change it if you request that.
-Tomasz


[-- Attachment #1.2: Type: text/html, Size: 4339 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-09 16:28     ` Tvrtko Ursulin
  2018-07-09 16:37       ` Chris Wilson
@ 2018-07-10 18:03       ` Lis, Tomasz
  2018-07-11 11:20         ` Lis, Tomasz
  1 sibling, 1 reply; 81+ messages in thread
From: Lis, Tomasz @ 2018-07-10 18:03 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: bartosz.dunajski



On 2018-07-09 18:28, Tvrtko Ursulin wrote:
>
> On 09/07/2018 14:20, Tomasz Lis wrote:
>> The patch adds a parameter to control the data port coherency 
>> functionality
>> on a per-context level. When the IOCTL is called, a command to switch 
>> data
>> port coherency state is added to the ordered list. All prior requests 
>> are
>> executed on old coherency settings, and all exec requests after the 
>> IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is 
>> disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that 
>> is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature 
>> that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>> architecture, adds
>> overhead related to tracking (snooping) memory inside different cache 
>> units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>> require
>> memory coherency between CPU and GPU). The goal of coherency 
>> enable/disable
>> switch is to remove overhead of memory coherency when memory 
>> coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units), 
>> Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>> instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O 
>> using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This 
>> "stateless"
>> model is similar to regular memory load/store operations available on 
>> typical
>> CPUs. Since this model provides I/O using arbitrary virtual 
>> addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer 
>> (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>>                     ________________
>>                    |      NODE1     |
>>                    | uint64_t data  |
>>                    +----------------|
>>                    | NODE*  |  NODE*|
>>                    +--------+-------+
>>                      /              \
>>     ________________/                \________________
>>    |      NODE2     |                |      NODE3     |
>>    | uint64_t data  |                | uint64_t data  |
>>    +----------------|                +----------------|
>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>    +--------+-------+                +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory 
>> locations
>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>> one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM - 
>> Shared
>> Virtual Memory feature). Using pointers from different allocations 
>> doesn't
>> affect the stateless addressing model which even allows scattered 
>> reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature 
>> of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in 
>> stateless model
>> can be encoded (at ISA level) to either use or disable coherency. 
>> However, for
>> generic OCL applications (such as example with tree-like data 
>> structure), OCL
>> compiler is not able to determine origin of memory pointed to by an 
>> arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is 
>> needed or not
>> for specific pointer (or for specific I/O instruction). As a result, 
>> compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that 
>> it would be
>> possible to workaround this (e.g. based on allocations map and 
>> pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping 
>> coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN 
>> ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that 
>> allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that 
>> actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance 
>> impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and 
>> kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes 
>> incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>> with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>> processes
>> the payload that kern_master produced. These two kernels work in a 
>> loop, one
>> after another. Since only kern_master requires coherency, kern_worker 
>> should
>> not be forced to pay for it. This means that we need to have the 
>> ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> v2: Fixed compilation warning.
>> v3: Refactored the patch to add IOCTL instead of exec flag.
>> v4: Renamed and documented the API flag. Used strict values.
>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>      Introduced a macro for checking whether hardware supports the 
>> feature.
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>   drivers/gpu/drm/i915/i915_gem_context.c | 41 
>> +++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>>   drivers/gpu/drm/i915/intel_lrc.c        | 49 
>> +++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>>   include/uapi/drm/i915_drm.h             |  6 ++++
>>   6 files changed, 107 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 09ab124..7d4bbd5 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>> *dev_priv)
>>   #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & 
>> EDRAM_ENABLED))
>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>> ((dev_priv)->info.hws_needs_physical)
>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index b10770c..6db352e 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct 
>> drm_i915_file_private *file_priv)
>>       return atomic_read(&file_priv->ban_score) >= 
>> I915_CLIENT_SCORE_BANNED;
>>   }
>>   +static int i915_gem_context_set_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    int ret;
>> +
>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>> +    if (!ret)
>> +        __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +    return ret;
>> +}
>> +
>> +static int i915_gem_context_clear_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    int ret;
>> +
>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>> +    if (!ret)
>> +        __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +    return ret;
>
> Is there a good reason you allow userspace to keep emitting unlimited 
> number of commands which actually do not change the status? If not 
> please consider gating the command emission with 
> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they 
> could keep toggling ad infinitum with no real work in between. Has it 
> been considered to only save the desired state in set param and then 
> emit it, if needed, before next execbuf? Minor thing in any case, just 
> curious since I wasn't following the threads.
>
(discussed further in separate thread)
>> +}
>> +
>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>                     struct drm_file *file)
>>   {
>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>
> Feel free to use the local for the other existing to_i915(dev) call 
> sites in here.
>
> Also use i915 for the local name. Unless I915_READ/WRITE is used i915 
> is preferred nowadays.
Will do.
>
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>       case I915_CONTEXT_PARAM_PRIORITY:
>>           args->value = ctx->sched.priority;
>>           break;
>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> +            ret = -ENODEV;
>> +        else
>> +            args->value = i915_gem_context_is_data_port_coherent(ctx);
>> +        break;
>>       default:
>>           ret = -EINVAL;
>>           break;
>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>
> As with get_param.
Ack.
>
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct 
>> drm_device *dev, void *data,
>>           }
>>           break;
>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (args->size)
>> +            ret = -EINVAL;
>> +        else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> +            ret = -ENODEV;
>> +        else if (args->value == 1)
>> +            ret = i915_gem_context_set_data_port_coherent(ctx);
>> +        else if (args->value == 0)
>> +            ret = i915_gem_context_clear_data_port_coherent(ctx);
>> +        else
>> +            ret = -EINVAL;
>> +        break;
>> +
>>       default:
>>           ret = -EINVAL;
>>           break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>> b/drivers/gpu/drm/i915/i915_gem_context.h
>> index b116e49..e8ccb70 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>   #define CONTEXT_BANNABLE        3
>>   #define CONTEXT_BANNED            4
>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>> +#define CONTEXT_DATA_PORT_COHERENT    6
>>         /**
>>        * @hw_id: - unique identifier for the context
>> @@ -257,6 +258,11 @@ static inline void 
>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>   }
>>   +static inline bool i915_gem_context_is_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +}
>> +
>>   static inline bool i915_gem_context_is_default(const struct 
>> i915_gem_context *c)
>>   {
>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> index ab89dab..1f037e3 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct 
>> i915_gem_context *ctx,
>>       ce->lrc_desc = desc;
>>   }
>>   +static int emit_set_data_port_coherency(struct i915_request *req, 
>> bool enable)
>
> After much disagreement we ended up with rq as the consistent naming 
> for requests.
:)
ok.
>
>> +{
>> +    u32 *cs;
>> +    i915_reg_t reg;
>> +
>> +    GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> +    GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> +    cs = intel_ring_begin(req, 4);
>> +    if (IS_ERR(cs))
>> +        return PTR_ERR(cs);
>> +
>> +    if (INTEL_GEN(req->i915) >= 10)
>> +        reg = CNL_HDC_CHICKEN0;
>> +    else
>> +        reg = HDC_CHICKEN0;
>> +
>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +    *cs++ = i915_mmio_reg_offset(reg);
>> +    /* Enabling coherency means disabling the bit which forces it 
>> off */
>> +    if (enable)
>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> +    else
>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> +    *cs++ = MI_NOOP;
>> +
>> +    intel_ring_advance(req, cs);
>> +
>> +    return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>> *ctx,
>> +                        bool enable)
>> +{
>> +    struct i915_request *req;
>
> rq as above.
ack
>
>> +    int ret;
>> +
>> +    req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>> +    if (IS_ERR(req))
>> +        return PTR_ERR(req);
>> +
>> +    ret = emit_set_data_port_coherency(req, enable);
>> +
>> +    i915_request_add(req);
>> +
>> +    return ret;
>> +}
>> +
>>   static struct i915_priolist *
>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>   {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>> b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..f6965ae 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>   +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>> *ctx,
>> +                        bool enable);
>> +
>>   #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..e677bea 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update 
>> memory
>> + * buffers shared with CPU, by forcing internal cache units to send 
>> memory
>> + * writes to real RAM faster. Keeping such coherency has performance 
>> cost.
>
> Is this comment correct? Is it actually sending memory writes to 
> _RAM_, or just the coherency mode enabled, even if only targetting CPU 
> or shared cache, which adds a cost?
I'm not sure whether there are further coherency modes to choose how 
"deep" coherency goes. The use case of OCL Team is to see gradual 
changes in the buffers on CPU side while the execution progresses. Write 
to RAM is needed to achieve that. And that limits performance by using 
RAM bandwidth.
>
> s/Keeping such coherency has performance cost./Enabling data port 
> coherency has a performance cost./ ? Or "can have a performance cost"?
I would prefer "Enabling data port coherency has a performance cost.". 
There likely are workloads with unmeasureable performance impact, but in 
real world using more memory writes will always slow down something.
>
>> + */
>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>       __u64 value;
>>   };
>>
>
> Since I understand this design has been approved already on the high 
> level, and as you can see I only had some minor comments to add, I can 
> say that the patch in principle looks okay to me.
Great; will produce a v5 soon.
-Tomasz
>
> Regards,
>
> Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-10 17:32         ` Lis, Tomasz
@ 2018-07-11  9:28           ` Tvrtko Ursulin
  0 siblings, 0 replies; 81+ messages in thread
From: Tvrtko Ursulin @ 2018-07-11  9:28 UTC (permalink / raw)
  To: Lis, Tomasz, Chris Wilson, intel-gfx; +Cc: bartosz.dunajski


On 10/07/2018 18:32, Lis, Tomasz wrote:
> On 2018-07-09 18:37, Chris Wilson wrote:
>> Quoting Tvrtko Ursulin (2018-07-09 17:28:02)
>>> On 09/07/2018 14:20, Tomasz Lis wrote:
>>>> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
>>>> +{
>>>> +     int ret;
>>>> +
>>>> +     ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>>> +     if (!ret)
>>>> +             __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> +     return ret;
>>> Is there a good reason you allow userspace to keep emitting unlimited
>>> number of commands which actually do not change the status? If not
>>> please consider gating the command emission with
>>> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they
>>> could keep toggling ad infinitum with no real work in between. Has it
>>> been considered to only save the desired state in set param and then
>>> emit it, if needed, before next execbuf? Minor thing in any case, just
>>> curious since I wasn't following the threads.
>> The first patch tried to add a bit to execbuf, and having been
>> mistakenly down that road before, we asked if there was any alternative.
>> (Now if you've also been following execbuf3 conversations, having a
>> packet for privileged LRI is definitely something we want.)
>>
>> Setting the value in the context register is precisely what we want to
>> do, and trivially serialised with execbuf since we have to serialise
>> reservation of ring space, i.e. the normal rules of request generation.
>> (execbuf is just a client and nothing special). From that point of view,
>> we only care about frequency, if it is very frequent it should be
>> controlled by userspace inside the batch (but it can't due to there
>> being dangerous bits inside the reg aiui). At the other end of the
>> scale, is context_setparam for set-once. And there should be no
>> inbetween as that requires costly batch flushes.
>> -Chris
> Joonas did brought that concern in his review; here it is, with my response:
> 
> On 2018-06-21 15:47, Lis, Tomasz wrote:
>> On 2018-06-21 08:39, Joonas Lahtinen wrote:
>>> I'm thinking we should set this value when it has changed, when we 
>>> insert the
>>> requests into the command stream. So if you change back and forth, while
>>> not emitting any requests, nothing really happens. If you change the 
>>> value and
>>> emit a request, we should emit a LRI before the jump to the commands.
>>> Similary if you keep setting the value to the value it already was in,
>>> nothing will happen, again.
>> When I considered that, my way of reasoning was:
>> If we execute the flag changing buffer right away, it may be sent to 
>> hardware faster if there is no job in progress.
>> If we use the lazy way, and trigger the change just before submission 
>> -  there will be additional conditions in submission code, plus the 
>> change will be made when there is another job pending (though it's not 
>> a considerable payload to just switch a flag).
>> If user space switches the flag back and forth without much sense, 
>> then there is something wrong with the user space driver, and it 
>> shouldn't be up to kernel to fix that.
>>
>> This is why I chosen the current approach. But I can change it if you 
>> wish.
> 
> So while I think the current solution is better from performance 
> standpoint, but I will change it if you request that.

Sounds like an interesting dilemma and I can see both arguments.

But for me I still prefer the option where coherency programming is 
emitted lazily on state change only. We do emit a bunch of pipe controls 
to invalidate caches and such as preamble to every request so that fits 
nicely. Advantage I see is that the set param ioctl remains very light 
and doesn't do any command submission, keeping in spirit and expectation 
with all current parameters. It makes the ioctl much quicker and as a 
secondary benefit it protects userspace form their own sillyness.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-10 18:03       ` Lis, Tomasz
@ 2018-07-11 11:20         ` Lis, Tomasz
  0 siblings, 0 replies; 81+ messages in thread
From: Lis, Tomasz @ 2018-07-11 11:20 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: bartosz.dunajski



On 2018-07-10 20:03, Lis, Tomasz wrote:
>
>
> On 2018-07-09 18:28, Tvrtko Ursulin wrote:
>>
>> On 09/07/2018 14:20, Tomasz Lis wrote:
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>> index 1593194..f6965ae 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>> [...]
>>> +/*
>>> + * When data port level coherency is enabled, the GPU will update 
>>> memory
>>> + * buffers shared with CPU, by forcing internal cache units to send 
>>> memory
>>> + * writes to real RAM faster. Keeping such coherency has 
>>> performance cost.
>>
>> Is this comment correct? Is it actually sending memory writes to 
>> _RAM_, or just the coherency mode enabled, even if only targetting 
>> CPU or shared cache, which adds a cost?
> I'm not sure whether there are further coherency modes to choose how 
> "deep" coherency goes. The use case of OCL Team is to see gradual 
> changes in the buffers on CPU side while the execution progresses. 
> Write to RAM is needed to achieve that. And that limits performance by 
> using RAM bandwidth.

It was pointed out to me that last level cache is shared between CPU and 
GPU on non-atoms. Which means my argument was invalid, an most likely 
the coherency option does not enforce RAM write. I will update the comment.
-Tomasz

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v5] drm/i915: Add IOCTL Param to control data port coherency.
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
                     ` (4 preceding siblings ...)
  2018-07-09 13:20   ` [PATCH v4] " Tomasz Lis
@ 2018-07-12 15:10   ` Tomasz Lis
  2018-07-13 10:40     ` Tvrtko Ursulin
  2018-10-09 18:06   ` [PATCH v6] " Tomasz Lis
  2018-10-12 15:02   ` [PATCH v8] " Tomasz Lis
  7 siblings, 1 reply; 81+ messages in thread
From: Tomasz Lis @ 2018-07-12 15:10 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
    Removed redundant GEM_WARN_ON()s. Improved to coding standard.
    Introduced a macro for checking whether hardware supports the feature.
v5: Renamed some locals. Made the flag write to be lazy.
    Updated comments to remove misconceptions. Added gen11 support.

Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>

Bspec: 11419
Bspec: 19175
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |  1 +
 drivers/gpu/drm/i915/i915_gem_context.c    | 29 +++++++++++++---
 drivers/gpu/drm/i915/i915_gem_context.h    | 17 +++++++++
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  6 ++++
 drivers/gpu/drm/i915/intel_lrc.c           | 55 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h           |  4 +++
 include/uapi/drm/i915_drm.h                |  7 ++++
 7 files changed, 115 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 01dd298..73192e1 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
 #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
 #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
 				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
 
 #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index b10770c..b5b63ac 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -784,6 +784,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -804,10 +805,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_GTT_SIZE:
 		if (ctx->ppgtt)
 			args->value = ctx->ppgtt->vm.total;
-		else if (to_i915(dev)->mm.aliasing_ppgtt)
-			args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
+		else if (i915->mm.aliasing_ppgtt)
+			args->value = i915->mm.aliasing_ppgtt->vm.total;
 		else
-			args->value = to_i915(dev)->ggtt.vm.total;
+			args->value = i915->ggtt.vm.total;
 		break;
 	case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
 		args->value = i915_gem_context_no_error_capture(ctx);
@@ -818,6 +819,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->value = ctx->sched.priority;
 		break;
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (!HAS_DATA_PORT_COHERENCY(i915))
+			ret = -ENODEV;
+		else
+			args->value = i915_gem_context_is_data_port_coherent_requested(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -830,6 +837,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -880,7 +888,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 
 			if (args->size)
 				ret = -EINVAL;
-			else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
+			else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
 				ret = -ENODEV;
 			else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
 				 priority < I915_CONTEXT_MIN_USER_PRIORITY)
@@ -893,6 +901,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		}
 		break;
 
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (args->size)
+			ret = -EINVAL;
+		else if (!HAS_DATA_PORT_COHERENCY(i915))
+			ret = -ENODEV;
+		else if (args->value == 1)
+			i915_gem_context_set_data_port_coherent_requested(ctx);
+		else if (args->value == 0)
+			i915_gem_context_clear_data_port_coherent_requested(ctx);
+		else
+			ret = -EINVAL;
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index b116e49..826af84 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -126,6 +126,8 @@ struct i915_gem_context {
 #define CONTEXT_BANNABLE		3
 #define CONTEXT_BANNED			4
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
+#define CONTEXT_DATA_PORT_COHERENT_REQUESTED	6
+#define CONTEXT_DATA_PORT_COHERENT_ACTIVE	7
 
 	/**
 	 * @hw_id: - unique identifier for the context
@@ -257,6 +259,21 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
 	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
 }
 
+static inline bool i915_gem_context_is_data_port_coherent_requested(struct i915_gem_context *ctx)
+{
+	return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_set_data_port_coherent_requested(struct i915_gem_context *ctx)
+{
+	__set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_clear_data_port_coherent_requested(struct i915_gem_context *ctx)
+{
+	__clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
 static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
 {
 	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 3f0c612..64a7cd4 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -2361,6 +2361,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		goto err_batch_unpin;
 	}
 
+	/* Emit the switch of data port coherency state if needed */
+	err = intel_lr_context_modify_data_port_coherency(eb.request,
+		     i915_gem_context_is_data_port_coherent_requested(eb.ctx));
+	if (GEM_WARN_ON(err))
+		DRM_DEBUG("Data Port Coherency toggle failed, keeping old setting.\n");
+
 	if (in_fence) {
 		err = i915_request_await_dma_fence(eb.request, in_fence);
 		if (err < 0)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 35d37af..fcee03d 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -259,6 +259,61 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
+
+	cs = intel_ring_begin(rq, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(rq->i915) >= 11)
+		reg = ICL_HDC_MODE;
+	else if (INTEL_GEN(rq->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	/* Enabling coherency means disabling the bit which forces it off */
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
+					    bool enable)
+{
+	struct i915_gem_context *ctx = rq->gem_context;
+	int ret;
+
+	if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
+		return 0;
+
+	ret = emit_set_data_port_coherency(rq, enable);
+
+	if (!ret) {
+		if (enable)
+			__set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+		else
+			__clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+	}
+
+       return ret;
+}
+
 static struct i915_priolist *
 lookup_priolist(struct intel_engine_cs *engine, int prio)
 {
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 1593194..20e8664 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -104,4 +104,8 @@ struct i915_gem_context;
 
 void intel_lr_context_resume(struct drm_i915_private *dev_priv);
 
+int
+intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
+					    bool enable);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..0a4e31f 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
 #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
 #define   I915_CONTEXT_DEFAULT_PRIORITY		0
 #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
+/*
+ * When data port level coherency is enabled, the GPU will update memory
+ * buffers shared with CPU, by forcing internal cache units to send memory
+ * writes to higher level caches faster. Enabling data port coherency has
+ * performance cost.
+ */
+#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
 	__u64 value;
 };
 
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev6)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (17 preceding siblings ...)
  2018-07-09 20:04 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2018-07-12 15:18 ` Patchwork
  2018-07-12 15:19 ` ✗ Fi.CI.SPARSE: " Patchwork
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-07-12 15:18 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev6)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
ad63bf48d85e drm/i915: Add IOCTL Param to control data port coherency.
-:15: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#15: 
coherency at data port level. Keeping the coherency at that level is disabled

-:257: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#257: FILE: drivers/gpu/drm/i915/i915_gem_execbuffer.c:2366:
+	err = intel_lr_context_modify_data_port_coherency(eb.request,
+		     i915_gem_context_is_data_port_coherent_requested(eb.ctx));

-:324: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#324: FILE: drivers/gpu/drm/i915/intel_lrc.c:314:
+       return ret;$

total: 0 errors, 2 warnings, 1 checks, 196 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.SPARSE: warning for drm/i915: Add Exec param to control data port coherency. (rev6)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (18 preceding siblings ...)
  2018-07-12 15:18 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev6) Patchwork
@ 2018-07-12 15:19 ` Patchwork
  2018-07-12 15:34 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-07-12 15:19 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev6)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim sparse origin/drm-tip
Commit: drm/i915: Add IOCTL Param to control data port coherency.
-drivers/gpu/drm/i915/selftests/../i915_drv.h:3652:16: warning: expression using sizeof(void)
+drivers/gpu/drm/i915/selftests/../i915_drv.h:3653:16: warning: expression using sizeof(void)

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Add Exec param to control data port coherency. (rev6)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (19 preceding siblings ...)
  2018-07-12 15:19 ` ✗ Fi.CI.SPARSE: " Patchwork
@ 2018-07-12 15:34 ` Patchwork
  2018-10-09 18:27 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev7) Patchwork
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-07-12 15:34 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev6)
URL   : https://patchwork.freedesktop.org/series/40181/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4476 -> Patchwork_9634 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/40181/revisions/6/mbox/

== Known issues ==

  Here are the changes found in Patchwork_9634 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@drv_module_reload@basic-no-display:
      {fi-skl-iommu}:     NOTRUN -> FAIL (fdo#106066) +2

    
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  fdo#106066 https://bugs.freedesktop.org/show_bug.cgi?id=106066


== Participating hosts (45 -> 42) ==

  Additional (2): fi-byt-j1900 fi-skl-iommu 
  Missing    (5): fi-ctg-p8600 fi-ilk-m540 fi-byt-squawks fi-bsw-cyan fi-hsw-4200u 


== Build changes ==

    * Linux: CI_DRM_4476 -> Patchwork_9634

  CI_DRM_4476: b818fac0878147c6df45338cb515b9b7bd878b7f @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4552: 5175aff31e00e17786ebb97aaaf25ddd38b5e72e @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9634: ad63bf48d85e2a85e9542e8a9daab2db3f19488d @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

ad63bf48d85e drm/i915: Add IOCTL Param to control data port coherency.

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9634/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v5] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-12 15:10   ` [PATCH v5] " Tomasz Lis
@ 2018-07-13 10:40     ` Tvrtko Ursulin
  2018-07-13 17:44       ` Lis, Tomasz
  0 siblings, 1 reply; 81+ messages in thread
From: Tvrtko Ursulin @ 2018-07-13 10:40 UTC (permalink / raw)
  To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski


On 12/07/2018 16:10, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
> 
> Rationale:
> 
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
> 
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
> 
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
> 
> 2. Why do we need a global coherency switch?
> 
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                     ________________
>                    |      NODE1     |
>                    | uint64_t data  |
>                    +----------------|
>                    | NODE*  |  NODE*|
>                    +--------+-------+
>                      /              \
>     ________________/                \________________
>    |      NODE2     |                |      NODE3     |
>    | uint64_t data  |                | uint64_t data  |
>    +----------------|                +----------------|
>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>    +--------+-------+                +--------+-------+
> 
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
> 
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
> 
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
> 
> 3. Will coherency switch be used frequently?
> 
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>      Introduced a macro for checking whether hardware supports the feature.
> v5: Renamed some locals. Made the flag write to be lazy.
>      Updated comments to remove misconceptions. Added gen11 support.
> 
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
> 
> Bspec: 11419
> Bspec: 19175
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h            |  1 +
>   drivers/gpu/drm/i915/i915_gem_context.c    | 29 +++++++++++++---
>   drivers/gpu/drm/i915/i915_gem_context.h    | 17 +++++++++
>   drivers/gpu/drm/i915/i915_gem_execbuffer.c |  6 ++++
>   drivers/gpu/drm/i915/intel_lrc.c           | 55 ++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_lrc.h           |  4 +++
>   include/uapi/drm/i915_drm.h                |  7 ++++
>   7 files changed, 115 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 01dd298..73192e1 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
>   #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
>   #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
>   				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
>   
>   #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
>   
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b10770c..b5b63ac 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -784,6 +784,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *i915 = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -804,10 +805,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_GTT_SIZE:
>   		if (ctx->ppgtt)
>   			args->value = ctx->ppgtt->vm.total;
> -		else if (to_i915(dev)->mm.aliasing_ppgtt)
> -			args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
> +		else if (i915->mm.aliasing_ppgtt)
> +			args->value = i915->mm.aliasing_ppgtt->vm.total;
>   		else
> -			args->value = to_i915(dev)->ggtt.vm.total;
> +			args->value = i915->ggtt.vm.total;
>   		break;
>   	case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
>   		args->value = i915_gem_context_no_error_capture(ctx);
> @@ -818,6 +819,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_PRIORITY:
>   		args->value = ctx->sched.priority;
>   		break;
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (!HAS_DATA_PORT_COHERENCY(i915))
> +			ret = -ENODEV;
> +		else
> +			args->value = i915_gem_context_is_data_port_coherent_requested(ctx);

Feels a bit like overly long name so maybe drop the _requested suffix 
but a suggestion only.

> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> @@ -830,6 +837,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *i915 = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -880,7 +888,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   
>   			if (args->size)
>   				ret = -EINVAL;
> -			else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> +			else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
>   				ret = -ENODEV;
>   			else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
>   				 priority < I915_CONTEXT_MIN_USER_PRIORITY)
> @@ -893,6 +901,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   		}
>   		break;
>   
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (args->size)
> +			ret = -EINVAL;
> +		else if (!HAS_DATA_PORT_COHERENCY(i915))
> +			ret = -ENODEV;
> +		else if (args->value == 1)
> +			i915_gem_context_set_data_port_coherent_requested(ctx);
> +		else if (args->value == 0)
> +			i915_gem_context_clear_data_port_coherent_requested(ctx);
> +		else
> +			ret = -EINVAL;
> +		break;
> +
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index b116e49..826af84 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -126,6 +126,8 @@ struct i915_gem_context {
>   #define CONTEXT_BANNABLE		3
>   #define CONTEXT_BANNED			4
>   #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED	6
> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE	7
>   
>   	/**
>   	 * @hw_id: - unique identifier for the context
> @@ -257,6 +259,21 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
>   	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>   }
>   
> +static inline bool i915_gem_context_is_data_port_coherent_requested(struct i915_gem_context *ctx)
> +{
> +	return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_set_data_port_coherent_requested(struct i915_gem_context *ctx)
> +{
> +	__set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_clear_data_port_coherent_requested(struct i915_gem_context *ctx)
> +{
> +	__clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
>   static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
>   {
>   	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 3f0c612..64a7cd4 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -2361,6 +2361,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>   		goto err_batch_unpin;
>   	}
>   
> +	/* Emit the switch of data port coherency state if needed */
> +	err = intel_lr_context_modify_data_port_coherency(eb.request,
> +		     i915_gem_context_is_data_port_coherent_requested(eb.ctx));
> +	if (GEM_WARN_ON(err))
> +		DRM_DEBUG("Data Port Coherency toggle failed, keeping old setting.\n");

I think we should propagate the error to userspace here. By the virtue 
of MIN_SPACE_FOR_ADD_REQUEST* we guarantee there must be space for 
request emission.

GEM_WARN_ON is therefore okay to let us know we got the value of 
MIN_SPACE_FOR_ADD_REQUEST wrong. Just remove the "keeping old setting" 
from the debug message.

* Having looked at the commit which last increased 
MIN_SPACE_FOR_ADD_REQUEST I suspect the current value is large enough 
for this addition and that we could probably look at decreasing it. It 
is a manual process though so not straightforward.

But also since this is >= GEN9 code I think it needs to be done deeper. 
Like in the backend layer sounds right to me.

Maybe intel_lrc.c/gen8_emit_flush_render in the EMIT_INVALIDATE mode? 
That is the request preamble dealing with invalidating caches so 
modifying cache coherency mode there as well sounds like a fit to me.

> +
>   	if (in_fence) {
>   		err = i915_request_await_dma_fence(eb.request, in_fence);
>   		if (err < 0)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 35d37af..fcee03d 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,61 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>   	ce->lrc_desc = desc;
>   }
>   
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> +	u32 *cs;
> +	i915_reg_t reg;
> +
> +	GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> +	GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> +	cs = intel_ring_begin(rq, 4);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	if (INTEL_GEN(rq->i915) >= 11)
> +		reg = ICL_HDC_MODE;
> +	else if (INTEL_GEN(rq->i915) >= 10)
> +		reg = CNL_HDC_CHICKEN0;
> +	else
> +		reg = HDC_CHICKEN0;
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(reg);
> +	/* Enabling coherency means disabling the bit which forces it off */
> +	if (enable)
> +		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +	else
> +		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
> +					    bool enable)
> +{
> +	struct i915_gem_context *ctx = rq->gem_context;
> +	int ret;
> +

I'd put a lockdep_assert_held on struct_mutex here to mark it up for the 
future.

> +	if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)

You don't need to pass in enable to this function since it can figure 
out what to do from the flags on its own:

	if ((ctx->flags & REQUESTED) == (ctx->flags & ACTIVE))
		return 0;

After which functions should proabbly be renamed to 
intel_lr_context_update_data_port_coherency?

> +		return 0;
> +
> +	ret = emit_set_data_port_coherency(rq, enable);

And then:

	..(rq, ctx->flags & REQUESTED)

> +
> +	if (!ret) {
> +		if (enable)
> +			__set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +		else
> +			__clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +	}
> +
> +       return ret;
> +}
> +
>   static struct i915_priolist *
>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>   {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..20e8664 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>   
>   void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>   
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
> +					    bool enable);
> +
>   #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..0a4e31f 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
>   #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
>   #define   I915_CONTEXT_DEFAULT_PRIORITY		0
>   #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to higher level caches faster. Enabling data port coherency has
> + * performance cost.

"has _a_ performance cost" I think but not a native speaker so might be 
wrong.

> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
>   	__u64 value;
>   };
>   
> 

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v5] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-13 10:40     ` Tvrtko Ursulin
@ 2018-07-13 17:44       ` Lis, Tomasz
  0 siblings, 0 replies; 81+ messages in thread
From: Lis, Tomasz @ 2018-07-13 17:44 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: bartosz.dunajski



On 2018-07-13 12:40, Tvrtko Ursulin wrote:
>
> On 12/07/2018 16:10, Tomasz Lis wrote:
>> The patch adds a parameter to control the data port coherency 
>> functionality
>> on a per-context level. When the IOCTL is called, a command to switch 
>> data
>> port coherency state is added to the ordered list. All prior requests 
>> are
>> executed on old coherency settings, and all exec requests after the 
>> IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is 
>> disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that 
>> is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature 
>> that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>> architecture, adds
>> overhead related to tracking (snooping) memory inside different cache 
>> units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>> require
>> memory coherency between CPU and GPU). The goal of coherency 
>> enable/disable
>> switch is to remove overhead of memory coherency when memory 
>> coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units), 
>> Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>> instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O 
>> using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This 
>> "stateless"
>> model is similar to regular memory load/store operations available on 
>> typical
>> CPUs. Since this model provides I/O using arbitrary virtual 
>> addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer 
>> (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>>                     ________________
>>                    |      NODE1     |
>>                    | uint64_t data  |
>>                    +----------------|
>>                    | NODE*  |  NODE*|
>>                    +--------+-------+
>>                      /              \
>>     ________________/                \________________
>>    |      NODE2     |                |      NODE3     |
>>    | uint64_t data  |                | uint64_t data  |
>>    +----------------|                +----------------|
>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>    +--------+-------+                +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory 
>> locations
>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>> one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM - 
>> Shared
>> Virtual Memory feature). Using pointers from different allocations 
>> doesn't
>> affect the stateless addressing model which even allows scattered 
>> reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature 
>> of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in 
>> stateless model
>> can be encoded (at ISA level) to either use or disable coherency. 
>> However, for
>> generic OCL applications (such as example with tree-like data 
>> structure), OCL
>> compiler is not able to determine origin of memory pointed to by an 
>> arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is 
>> needed or not
>> for specific pointer (or for specific I/O instruction). As a result, 
>> compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that 
>> it would be
>> possible to workaround this (e.g. based on allocations map and 
>> pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping 
>> coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN 
>> ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that 
>> allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that 
>> actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance 
>> impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and 
>> kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes 
>> incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>> with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>> processes
>> the payload that kern_master produced. These two kernels work in a 
>> loop, one
>> after another. Since only kern_master requires coherency, kern_worker 
>> should
>> not be forced to pay for it. This means that we need to have the 
>> ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> v2: Fixed compilation warning.
>> v3: Refactored the patch to add IOCTL instead of exec flag.
>> v4: Renamed and documented the API flag. Used strict values.
>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>      Introduced a macro for checking whether hardware supports the 
>> feature.
>> v5: Renamed some locals. Made the flag write to be lazy.
>>      Updated comments to remove misconceptions. Added gen11 support.
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Bspec: 19175
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h            |  1 +
>>   drivers/gpu/drm/i915/i915_gem_context.c    | 29 +++++++++++++---
>>   drivers/gpu/drm/i915/i915_gem_context.h    | 17 +++++++++
>>   drivers/gpu/drm/i915/i915_gem_execbuffer.c |  6 ++++
>>   drivers/gpu/drm/i915/intel_lrc.c           | 55 
>> ++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/intel_lrc.h           |  4 +++
>>   include/uapi/drm/i915_drm.h                |  7 ++++
>>   7 files changed, 115 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 01dd298..73192e1 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>> *dev_priv)
>>   #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & 
>> EDRAM_ENABLED))
>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>> ((dev_priv)->info.hws_needs_physical)
>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index b10770c..b5b63ac 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -784,6 +784,7 @@ int i915_gem_context_destroy_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *i915 = to_i915(dev);
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -804,10 +805,10 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>       case I915_CONTEXT_PARAM_GTT_SIZE:
>>           if (ctx->ppgtt)
>>               args->value = ctx->ppgtt->vm.total;
>> -        else if (to_i915(dev)->mm.aliasing_ppgtt)
>> -            args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
>> +        else if (i915->mm.aliasing_ppgtt)
>> +            args->value = i915->mm.aliasing_ppgtt->vm.total;
>>           else
>> -            args->value = to_i915(dev)->ggtt.vm.total;
>> +            args->value = i915->ggtt.vm.total;
>>           break;
>>       case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
>>           args->value = i915_gem_context_no_error_capture(ctx);
>> @@ -818,6 +819,12 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>       case I915_CONTEXT_PARAM_PRIORITY:
>>           args->value = ctx->sched.priority;
>>           break;
>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (!HAS_DATA_PORT_COHERENCY(i915))
>> +            ret = -ENODEV;
>> +        else
>> +            args->value = 
>> i915_gem_context_is_data_port_coherent_requested(ctx);
>
> Feels a bit like overly long name so maybe drop the _requested suffix 
> but a suggestion only.
I was considering this as well; will do.
>
>> +        break;
>>       default:
>>           ret = -EINVAL;
>>           break;
>> @@ -830,6 +837,7 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *i915 = to_i915(dev);
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -880,7 +888,7 @@ int i915_gem_context_setparam_ioctl(struct 
>> drm_device *dev, void *data,
>>                 if (args->size)
>>                   ret = -EINVAL;
>> -            else if (!(to_i915(dev)->caps.scheduler & 
>> I915_SCHEDULER_CAP_PRIORITY))
>> +            else if (!(i915->caps.scheduler & 
>> I915_SCHEDULER_CAP_PRIORITY))
>>                   ret = -ENODEV;
>>               else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
>>                    priority < I915_CONTEXT_MIN_USER_PRIORITY)
>> @@ -893,6 +901,19 @@ int i915_gem_context_setparam_ioctl(struct 
>> drm_device *dev, void *data,
>>           }
>>           break;
>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (args->size)
>> +            ret = -EINVAL;
>> +        else if (!HAS_DATA_PORT_COHERENCY(i915))
>> +            ret = -ENODEV;
>> +        else if (args->value == 1)
>> + i915_gem_context_set_data_port_coherent_requested(ctx);
>> +        else if (args->value == 0)
>> + i915_gem_context_clear_data_port_coherent_requested(ctx);
>> +        else
>> +            ret = -EINVAL;
>> +        break;
>> +
>>       default:
>>           ret = -EINVAL;
>>           break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>> b/drivers/gpu/drm/i915/i915_gem_context.h
>> index b116e49..826af84 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -126,6 +126,8 @@ struct i915_gem_context {
>>   #define CONTEXT_BANNABLE        3
>>   #define CONTEXT_BANNED            4
>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED    6
>> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE    7
>>         /**
>>        * @hw_id: - unique identifier for the context
>> @@ -257,6 +259,21 @@ static inline void 
>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>   }
>>   +static inline bool 
>> i915_gem_context_is_data_port_coherent_requested(struct 
>> i915_gem_context *ctx)
>> +{
>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
>> +}
>> +
>> +static inline void 
>> i915_gem_context_set_data_port_coherent_requested(struct 
>> i915_gem_context *ctx)
>> +{
>> +    __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
>> +}
>> +
>> +static inline void 
>> i915_gem_context_clear_data_port_coherent_requested(struct 
>> i915_gem_context *ctx)
>> +{
>> +    __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
>> +}
>> +
>>   static inline bool i915_gem_context_is_default(const struct 
>> i915_gem_context *c)
>>   {
>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c 
>> b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> index 3f0c612..64a7cd4 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> @@ -2361,6 +2361,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>>           goto err_batch_unpin;
>>       }
>>   +    /* Emit the switch of data port coherency state if needed */
>> +    err = intel_lr_context_modify_data_port_coherency(eb.request,
>> + i915_gem_context_is_data_port_coherent_requested(eb.ctx));
>> +    if (GEM_WARN_ON(err))
>> +        DRM_DEBUG("Data Port Coherency toggle failed, keeping old 
>> setting.\n");
>
> I think we should propagate the error to userspace here. By the virtue 
> of MIN_SPACE_FOR_ADD_REQUEST* we guarantee there must be space for 
> request emission.
>
> GEM_WARN_ON is therefore okay to let us know we got the value of 
> MIN_SPACE_FOR_ADD_REQUEST wrong. Just remove the "keeping old setting" 
> from the debug message.
ack
>
> * Having looked at the commit which last increased 
> MIN_SPACE_FOR_ADD_REQUEST I suspect the current value is large enough 
> for this addition and that we could probably look at decreasing it. It 
> is a manual process though so not straightforward.
>
> But also since this is >= GEN9 code I think it needs to be done 
> deeper. Like in the backend layer sounds right to me.
>
> Maybe intel_lrc.c/gen8_emit_flush_render in the EMIT_INVALIDATE mode? 
> That is the request preamble dealing with invalidating caches so 
> modifying cache coherency mode there as well sounds like a fit to me.
>
Agreed. Will move.
>> +
>>       if (in_fence) {
>>           err = i915_request_await_dma_fence(eb.request, in_fence);
>>           if (err < 0)
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> index 35d37af..fcee03d 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -259,6 +259,61 @@ intel_lr_context_descriptor_update(struct 
>> i915_gem_context *ctx,
>>       ce->lrc_desc = desc;
>>   }
>>   +static int emit_set_data_port_coherency(struct i915_request *rq, 
>> bool enable)
>> +{
>> +    u32 *cs;
>> +    i915_reg_t reg;
>> +
>> +    GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
>> +    GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
>> +
>> +    cs = intel_ring_begin(rq, 4);
>> +    if (IS_ERR(cs))
>> +        return PTR_ERR(cs);
>> +
>> +    if (INTEL_GEN(rq->i915) >= 11)
>> +        reg = ICL_HDC_MODE;
>> +    else if (INTEL_GEN(rq->i915) >= 10)
>> +        reg = CNL_HDC_CHICKEN0;
>> +    else
>> +        reg = HDC_CHICKEN0;
>> +
>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +    *cs++ = i915_mmio_reg_offset(reg);
>> +    /* Enabling coherency means disabling the bit which forces it 
>> off */
>> +    if (enable)
>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> +    else
>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> +    *cs++ = MI_NOOP;
>> +
>> +    intel_ring_advance(rq, cs);
>> +
>> +    return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
>> +                        bool enable)
>> +{
>> +    struct i915_gem_context *ctx = rq->gem_context;
>> +    int ret;
>> +
>
> I'd put a lockdep_assert_held on struct_mutex here to mark it up for 
> the future.
ok, will do.
>
>> +    if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == 
>> enable)
>
> You don't need to pass in enable to this function since it can figure 
> out what to do from the flags on its own:
>
>     if ((ctx->flags & REQUESTED) == (ctx->flags & ACTIVE))
>         return 0;
>
> After which functions should proabbly be renamed to 
> intel_lr_context_update_data_port_coherency?
>
ack
>> +        return 0;
>> +
>> +    ret = emit_set_data_port_coherency(rq, enable);
>
> And then:
>
>     ..(rq, ctx->flags & REQUESTED)
ok, I will use a local though.
>
>> +
>> +    if (!ret) {
>> +        if (enable)
>> +            __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
>> +        else
>> +            __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, 
>> &ctx->flags);
>> +    }
>> +
>> +       return ret;
>> +}
>> +
>>   static struct i915_priolist *
>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>   {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>> b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..20e8664 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>   +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
>> +                        bool enable);
>> +
>>   #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..0a4e31f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update 
>> memory
>> + * buffers shared with CPU, by forcing internal cache units to send 
>> memory
>> + * writes to higher level caches faster. Enabling data port 
>> coherency has
>> + * performance cost.
>
> "has _a_ performance cost" I think but not a native speaker so might 
> be wrong.
Agreed.
Will send the update as soon as it's tested.
>
>> + */
>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>       __u64 value;
>>   };
>>
>
> Regards,
>
> Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-06-21 13:47         ` Lis, Tomasz
@ 2018-07-18 13:03           ` Joonas Lahtinen
  0 siblings, 0 replies; 81+ messages in thread
From: Joonas Lahtinen @ 2018-07-18 13:03 UTC (permalink / raw)
  To: Lis, Tomasz, intel-gfx; +Cc: bartosz.dunajski

Quoting Lis, Tomasz (2018-06-21 16:47:45)
> On 2018-06-21 08:39, Joonas Lahtinen wrote:
> > Quoting Tomasz Lis (2018-06-20 18:03:07)
> >>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> >>                                      struct drm_file *file)
> >>   {
> >> +       struct drm_i915_private *dev_priv = to_i915(dev);
> >>          struct drm_i915_file_private *file_priv = file->driver_priv;
> >>          struct drm_i915_gem_context_param *args = data;
> >>          struct i915_gem_context *ctx;
> >> @@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> >>          case I915_CONTEXT_PARAM_PRIORITY:
> >>                  args->value = ctx->sched.priority;
> >>                  break;
> >> +       case I915_CONTEXT_PARAM_COHERENCY:
> >> +               /*
> >> +                * ENODEV if the feature is not supported. This removes the need
> >> +                * of separate IS_SUPPORTED parameter.
> >> +                */
> > Code speaks for itself, the comment is not needed.
> I don't think it is a good idea to limit comments. The current look of 
> the code makes it hard for anyone new to work on it, as the only 
> documentation is the history in mailing list.
> I don't think it's the correct approach. I believe comments should be 
> encouraged.

It's not a matter of opinion. Code should be written clear enough that
comments are not needed.

<SNIP>

> >> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> >> +       *cs++ = i915_mmio_reg_offset(reg);
> >> +       /* Enabling coherency means disabling the bit which forces it off */
> > Code is again very self explanatory without the comment.
> The logic is reversed, so that "enable" does a "disable". I believe the 
> comment does a great job of assuring the reader that this is not just a 
> coding mistake.
> 
> Do we have any official guidelines for limiting comments?

Yes, avoid where humanly possible. And when you can't avoid, it should
explain "why" not what. I don't see such cases in this patch.

<SNIP>

> > I'm thinking we should set this value when it has changed, when we insert the
> > requests into the command stream. So if you change back and forth, while
> > not emitting any requests, nothing really happens. If you change the value and
> > emit a request, we should emit a LRI before the jump to the commands.
> > Similary if you keep setting the value to the value it already was in,
> > nothing will happen, again.
> When I considered that, my way of reasoning was:
> If we execute the flag changing buffer right away, it may be sent to 
> hardware faster if there is no job in progress.
> If we use the lazy way, and trigger the change just before submission -  
> there will be additional conditions in submission code, plus the change 
> will be made when there is another job pending (though it's not a 
> considerable payload to just switch a flag).
> If user space switches the flag back and forth without much sense, then 
> there is something wrong with the user space driver, and it shouldn't be 
> up to kernel to fix that.

A few register writes appended before jumping to the BB should not be a
performance concern. There will definitely be more overhead in sending a
whole separate request, so I'm not sure I follow whole picture.

So I still think it's right thing to do to only emit the commands as a
prelude when needed.

Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-06-22 16:40           ` Dunajski, Bartosz
@ 2018-07-18 13:12             ` Joonas Lahtinen
  2018-07-18 13:27               ` Dunajski, Bartosz
  0 siblings, 1 reply; 81+ messages in thread
From: Joonas Lahtinen @ 2018-07-18 13:12 UTC (permalink / raw)
  To: Dunajski, Bartosz, Lis, Tomasz, intel-gfx, Dave Airlie

Quoting Dunajski, Bartosz (2018-06-22 19:40:58)
> Additionally, we are already on Arch:
> https://aur.archlinux.org/packages/compute-runtime 

I'm not an Arch user myself, but my impression is that AUR [1] is equivalent
of Ubuntu's PPA where anybody can very much upload anything outside of
the support model of the distro.

> Can I assume that adoption plan is not a blocker anymore?

Due to above, I don't think that matter is changed to direction or
another.

Regards, Joonas

[1] https://wiki.archlinux.org/index.php/Arch_User_Repository#What_is_the_AUR.3F
> 
> Bartosz
> 
> > Yes, once you follow through with the plan, there should be no issues about merging patches to support the driver.
> >
> > You may want to squeeze your timeline to be complete before 4.19-rc5, which is the feature cutoff date for 4.20, but that is rather an ambitious goal. Your original schedule would land the patches before
> > 4.20-rc5 resulting in inclusion to 4.21.
> >
> > Regards, Joonas
> >
> > PS. I'm going on a vacation for a couple of weeks.
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
  2018-07-18 13:12             ` Joonas Lahtinen
@ 2018-07-18 13:27               ` Dunajski, Bartosz
  0 siblings, 0 replies; 81+ messages in thread
From: Dunajski, Bartosz @ 2018-07-18 13:27 UTC (permalink / raw)
  To: Joonas Lahtinen, Lis, Tomasz, intel-gfx, Dave Airlie

You are right about the AUR. This is just a step into opensource community direction.
According to my previous answer about ClearLinux (and others), which is more important here. We are still coordinating this, but I think we are on the right path. And NEO can be considered as opensource client for the coherency patch.


> I'm not an Arch user myself, but my impression is that AUR [1] is equivalent of Ubuntu's PPA where anybody can very much upload anything outside of the support model of the distro.

 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v6] drm/i915: Add IOCTL Param to control data port coherency.
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
                     ` (5 preceding siblings ...)
  2018-07-12 15:10   ` [PATCH v5] " Tomasz Lis
@ 2018-10-09 18:06   ` Tomasz Lis
  2018-10-10  7:29     ` Tvrtko Ursulin
  2018-10-12 15:02   ` [PATCH v8] " Tomasz Lis
  7 siblings, 1 reply; 81+ messages in thread
From: Tomasz Lis @ 2018-10-09 18:06 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
    Removed redundant GEM_WARN_ON()s. Improved to coding standard.
    Introduced a macro for checking whether hardware supports the feature.
v5: Renamed some locals. Made the flag write to be lazy.
    Updated comments to remove misconceptions. Added gen11 support.
v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
    Moved all flags checking to one place. Added mutex check.
v7: Removed 2 comments, improved API comment. (Joonas)

Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>

Bspec: 11419
Bspec: 19175
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  1 +
 drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
 drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
 drivers/gpu/drm/i915/intel_lrc.c        | 64 ++++++++++++++++++++++++++++++++-
 include/uapi/drm/i915_drm.h             | 10 ++++++
 5 files changed, 116 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 794a8a0..e1ea5cb 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
 #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
 #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
 				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
 
 #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 8cbe580..718ede9 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_GTT_SIZE:
 		if (ctx->ppgtt)
 			args->value = ctx->ppgtt->vm.total;
-		else if (to_i915(dev)->mm.aliasing_ppgtt)
-			args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
+		else if (i915->mm.aliasing_ppgtt)
+			args->value = i915->mm.aliasing_ppgtt->vm.total;
 		else
-			args->value = to_i915(dev)->ggtt.vm.total;
+			args->value = i915->ggtt.vm.total;
 		break;
 	case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
 		args->value = i915_gem_context_no_error_capture(ctx);
@@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
 		break;
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (!HAS_DATA_PORT_COHERENCY(i915))
+			ret = -ENODEV;
+		else
+			args->value = i915_gem_context_is_data_port_coherent(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 
 			if (args->size)
 				ret = -EINVAL;
-			else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
+			else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
 				ret = -ENODEV;
 			else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
 				 priority < I915_CONTEXT_MIN_USER_PRIORITY)
@@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		}
 		break;
 
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (args->size)
+			ret = -EINVAL;
+		else if (!HAS_DATA_PORT_COHERENCY(i915))
+			ret = -ENODEV;
+		else if (args->value == 1)
+			i915_gem_context_set_data_port_coherent(ctx);
+		else if (args->value == 0)
+			i915_gem_context_clear_data_port_coherent(ctx);
+		else
+			ret = -EINVAL;
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index f6d870b..55969bc 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -131,6 +131,8 @@ struct i915_gem_context {
 #define CONTEXT_BANNED			0
 #define CONTEXT_CLOSED			1
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	2
+#define CONTEXT_DATA_PORT_COHERENT_REQUESTED	6
+#define CONTEXT_DATA_PORT_COHERENT_ACTIVE	7
 
 	/**
 	 * @hw_id: - unique identifier for the context
@@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
 	atomic_dec(&ctx->hw_id_pin_count);
 }
 
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+	return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+	__set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+	__clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
 static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
 {
 	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ff0e2b3..313fb72 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
+
+	cs = intel_ring_begin(rq, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(rq->i915) >= 11)
+		reg = ICL_HDC_MODE;
+	else if (INTEL_GEN(rq->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static int
+intel_lr_context_update_data_port_coherency(struct i915_request *rq)
+{
+	struct i915_gem_context *ctx = rq->gem_context;
+	bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+	int ret;
+
+	lockdep_assert_held(&rq->i915->drm.struct_mutex);
+
+	if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
+		return 0;
+
+	ret = emit_set_data_port_coherency(rq, enable);
+
+	if (!ret) {
+		if (enable)
+			__set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+		else
+			__clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+	}
+
+	return ret;
+}
+
 static void unwind_wa_tail(struct i915_request *rq)
 {
 	rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
@@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
 		i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
 	bool vf_flush_wa = false, dc_flush_wa = false;
 	u32 *cs, flags = 0;
-	int len;
+	int err, len;
 
 	flags |= PIPE_CONTROL_CS_STALL;
 
@@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
 		/* WaForGAMHang:kbl */
 		if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
 			dc_flush_wa = true;
+
+		err = intel_lr_context_update_data_port_coherency(request);
+		if (GEM_WARN_ON(err)) {
+			DRM_DEBUG("Data Port Coherency toggle failed.\n");
+			return err;
+		}
 	}
 
 	len = 6;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 298b2e1..8f8211b 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
 #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
 #define   I915_CONTEXT_DEFAULT_PRIORITY		0
 #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
+/*
+ * When data port level coherency is enabled, the GPU and CPU will both keep
+ * changes to memory content visible to each other as fast as possible, by
+ * forcing internal cache units to send memory writes to higher level caches
+ * immediatelly after writes. Only buffers with coherency requested within
+ * surface state, or specific stateless accesses will be affected by this
+ * option. Keeping data port coherency has a performance cost, and therefore
+ * it is by default disabled (see WaForceEnableNonCoherent).
+ */
+#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
 	__u64 value;
 };
 
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev7)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (20 preceding siblings ...)
  2018-07-12 15:34 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-10-09 18:27 ` Patchwork
  2018-10-09 18:28 ` ✗ Fi.CI.SPARSE: " Patchwork
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-10-09 18:27 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev7)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
730c3b0457ef drm/i915: Add IOCTL Param to control data port coherency.
-:15: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#15: 
coherency at data port level. Keeping the coherency at that level is disabled

-:351: WARNING:TYPO_SPELLING: 'immediatelly' may be misspelled - perhaps 'immediately'?
#351: FILE: include/uapi/drm/i915_drm.h:1493:
+ * immediatelly after writes. Only buffers with coherency requested within

total: 0 errors, 2 warnings, 0 checks, 200 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.SPARSE: warning for drm/i915: Add Exec param to control data port coherency. (rev7)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (21 preceding siblings ...)
  2018-10-09 18:27 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev7) Patchwork
@ 2018-10-09 18:28 ` Patchwork
  2018-10-09 18:52 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-10-09 18:28 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev7)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim sparse origin/drm-tip
Sparse version: v0.5.2
Commit: drm/i915: Add IOCTL Param to control data port coherency.
-drivers/gpu/drm/i915/selftests/../i915_drv.h:3727:16: warning: expression using sizeof(void)
+drivers/gpu/drm/i915/selftests/../i915_drv.h:3728:16: warning: expression using sizeof(void)

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Add Exec param to control data port coherency. (rev7)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (22 preceding siblings ...)
  2018-10-09 18:28 ` ✗ Fi.CI.SPARSE: " Patchwork
@ 2018-10-09 18:52 ` Patchwork
  2018-10-09 21:44 ` ✗ Fi.CI.IGT: failure " Patchwork
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-10-09 18:52 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev7)
URL   : https://patchwork.freedesktop.org/series/40181/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4955 -> Patchwork_10401 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/40181/revisions/7/mbox/

== Known issues ==

  Here are the changes found in Patchwork_10401 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_frontbuffer_tracking@basic:
      fi-byt-clapper:     PASS -> FAIL (fdo#103167)

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-blb-e6850:       PASS -> INCOMPLETE (fdo#107718)

    
    ==== Possible fixes ====

    igt@kms_pipe_crc_basic@read-crc-pipe-b:
      fi-byt-clapper:     FAIL (fdo#107362) -> PASS

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a:
      fi-gdg-551:         INCOMPLETE -> PASS

    
  fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
  fdo#107362 https://bugs.freedesktop.org/show_bug.cgi?id=107362
  fdo#107718 https://bugs.freedesktop.org/show_bug.cgi?id=107718


== Participating hosts (47 -> 41) ==

  Additional (1): fi-skl-guc 
  Missing    (7): fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-icl-u2 fi-bsw-cyan fi-snb-2520m fi-ctg-p8600 


== Build changes ==

    * Linux: CI_DRM_4955 -> Patchwork_10401

  CI_DRM_4955: 86b783f703497b30c7417d977c0a557d0cc40e40 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4672: 4497591d2572831a9f07fd9e48a2571bfcffe354 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_10401: 730c3b0457ef0f21136d0866fc96e9a83cd41a10 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

730c3b0457ef drm/i915: Add IOCTL Param to control data port coherency.

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10401/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915: Add Exec param to control data port coherency. (rev7)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (23 preceding siblings ...)
  2018-10-09 18:52 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-10-09 21:44 ` Patchwork
  2018-10-12 15:14 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev8) Patchwork
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-10-09 21:44 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev7)
URL   : https://patchwork.freedesktop.org/series/40181/
State : failure

== Summary ==

= CI Bug Log - changes from CI_DRM_4955_full -> Patchwork_10401_full =

== Summary - FAILURE ==

  Serious unknown changes coming with Patchwork_10401_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_10401_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_10401_full:

  === IGT changes ===

    ==== Possible regressions ====

    igt@gem_ctx_param@invalid-param-get:
      shard-skl:          PASS -> FAIL
      shard-apl:          PASS -> FAIL +1
      shard-glk:          PASS -> FAIL +1

    igt@gem_ctx_param@invalid-param-set:
      shard-skl:          NOTRUN -> FAIL
      shard-kbl:          PASS -> FAIL +1
      shard-hsw:          PASS -> FAIL +1
      shard-snb:          PASS -> FAIL +1

    
    ==== Warnings ====

    igt@pm_rc6_residency@rc6-accuracy:
      shard-kbl:          PASS -> SKIP

    
== Known issues ==

  Here are the changes found in Patchwork_10401_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@gem_exec_await@wide-contexts:
      shard-apl:          PASS -> FAIL (fdo#106680)

    igt@gem_exec_schedule@pi-ringfull-bsd:
      shard-skl:          NOTRUN -> FAIL (fdo#103158)

    igt@kms_busy@extended-modeset-hang-newfb-render-b:
      shard-hsw:          PASS -> DMESG-WARN (fdo#107956)

    igt@kms_cursor_crc@cursor-128x128-suspend:
      shard-apl:          PASS -> INCOMPLETE (fdo#103927)

    igt@kms_cursor_crc@cursor-64x21-onscreen:
      shard-glk:          PASS -> FAIL (fdo#103232)

    igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-onoff:
      shard-glk:          PASS -> FAIL (fdo#103167) +3

    igt@kms_frontbuffer_tracking@fbcpsr-stridechange:
      shard-skl:          NOTRUN -> FAIL (fdo#105683)

    {igt@kms_plane_alpha_blend@pipe-b-constant-alpha-max}:
      shard-skl:          NOTRUN -> FAIL (fdo#108145)

    igt@kms_plane_multiple@atomic-pipe-c-tiling-y:
      shard-apl:          PASS -> FAIL (fdo#103166)

    igt@kms_setmode@basic:
      shard-apl:          PASS -> FAIL (fdo#99912)
      shard-kbl:          PASS -> FAIL (fdo#99912)

    
    ==== Possible fixes ====

    igt@drv_suspend@shrink:
      shard-glk:          INCOMPLETE (fdo#103359, fdo#106886, k.org#198133) -> PASS

    igt@gem_userptr_blits@readonly-unsync:
      shard-skl:          INCOMPLETE (fdo#108074) -> PASS

    igt@kms_cursor_legacy@cursora-vs-flipa-toggle:
      shard-glk:          DMESG-WARN (fdo#105763, fdo#106538) -> PASS

    igt@kms_flip@dpms-vs-vblank-race-interruptible:
      shard-kbl:          FAIL (fdo#103060) -> PASS

    igt@kms_flip@flip-vs-expired-vblank-interruptible:
      shard-kbl:          FAIL (fdo#102887, fdo#105363) -> PASS

    igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-draw-mmap-cpu:
      shard-apl:          FAIL (fdo#103167) -> PASS +2

    igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-draw-pwrite:
      shard-glk:          FAIL (fdo#103167) -> PASS +1

    igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-pri-indfb-draw-render:
      shard-skl:          FAIL (fdo#103167) -> PASS +1

    igt@kms_plane@plane-position-covered-pipe-a-planes:
      shard-apl:          FAIL (fdo#103166) -> PASS

    igt@perf_pmu@rc6-runtime-pm-long:
      shard-kbl:          FAIL (fdo#105010) -> PASS

    
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  fdo#102887 https://bugs.freedesktop.org/show_bug.cgi?id=102887
  fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
  fdo#103158 https://bugs.freedesktop.org/show_bug.cgi?id=103158
  fdo#103166 https://bugs.freedesktop.org/show_bug.cgi?id=103166
  fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
  fdo#103232 https://bugs.freedesktop.org/show_bug.cgi?id=103232
  fdo#103359 https://bugs.freedesktop.org/show_bug.cgi?id=103359
  fdo#103927 https://bugs.freedesktop.org/show_bug.cgi?id=103927
  fdo#105010 https://bugs.freedesktop.org/show_bug.cgi?id=105010
  fdo#105363 https://bugs.freedesktop.org/show_bug.cgi?id=105363
  fdo#105683 https://bugs.freedesktop.org/show_bug.cgi?id=105683
  fdo#105763 https://bugs.freedesktop.org/show_bug.cgi?id=105763
  fdo#106538 https://bugs.freedesktop.org/show_bug.cgi?id=106538
  fdo#106680 https://bugs.freedesktop.org/show_bug.cgi?id=106680
  fdo#106886 https://bugs.freedesktop.org/show_bug.cgi?id=106886
  fdo#107956 https://bugs.freedesktop.org/show_bug.cgi?id=107956
  fdo#108074 https://bugs.freedesktop.org/show_bug.cgi?id=108074
  fdo#108145 https://bugs.freedesktop.org/show_bug.cgi?id=108145
  fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
  k.org#198133 https://bugzilla.kernel.org/show_bug.cgi?id=198133


== Participating hosts (6 -> 6) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4955 -> Patchwork_10401

  CI_DRM_4955: 86b783f703497b30c7417d977c0a557d0cc40e40 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4672: 4497591d2572831a9f07fd9e48a2571bfcffe354 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_10401: 730c3b0457ef0f21136d0866fc96e9a83cd41a10 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10401/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v6] drm/i915: Add IOCTL Param to control data port coherency.
  2018-10-09 18:06   ` [PATCH v6] " Tomasz Lis
@ 2018-10-10  7:29     ` Tvrtko Ursulin
  0 siblings, 0 replies; 81+ messages in thread
From: Tvrtko Ursulin @ 2018-10-10  7:29 UTC (permalink / raw)
  To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski


On 09/10/2018 19:06, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
> 
> Rationale:
> 
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
> 
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
> 
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
> 
> 2. Why do we need a global coherency switch?
> 
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                     ________________
>                    |      NODE1     |
>                    | uint64_t data  |
>                    +----------------|
>                    | NODE*  |  NODE*|
>                    +--------+-------+
>                      /              \
>     ________________/                \________________
>    |      NODE2     |                |      NODE3     |
>    | uint64_t data  |                | uint64_t data  |
>    +----------------|                +----------------|
>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>    +--------+-------+                +--------+-------+
> 
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
> 
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
> 
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
> 
> 3. Will coherency switch be used frequently?
> 
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>      Introduced a macro for checking whether hardware supports the feature.
> v5: Renamed some locals. Made the flag write to be lazy.
>      Updated comments to remove misconceptions. Added gen11 support.
> v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
>      Moved all flags checking to one place. Added mutex check.
> v7: Removed 2 comments, improved API comment. (Joonas)
> 
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
> 
> Bspec: 11419
> Bspec: 19175
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>   drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
>   drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
>   drivers/gpu/drm/i915/intel_lrc.c        | 64 ++++++++++++++++++++++++++++++++-
>   include/uapi/drm/i915_drm.h             | 10 ++++++
>   5 files changed, 116 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 794a8a0..e1ea5cb 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
>   #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
>   #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
>   				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
>   
>   #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
>   
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 8cbe580..718ede9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *i915 = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_GTT_SIZE:
>   		if (ctx->ppgtt)
>   			args->value = ctx->ppgtt->vm.total;
> -		else if (to_i915(dev)->mm.aliasing_ppgtt)
> -			args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
> +		else if (i915->mm.aliasing_ppgtt)
> +			args->value = i915->mm.aliasing_ppgtt->vm.total;
>   		else
> -			args->value = to_i915(dev)->ggtt.vm.total;
> +			args->value = i915->ggtt.vm.total;
>   		break;
>   	case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
>   		args->value = i915_gem_context_no_error_capture(ctx);
> @@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_PRIORITY:
>   		args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
>   		break;
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (!HAS_DATA_PORT_COHERENCY(i915))
> +			ret = -ENODEV;
> +		else
> +			args->value = i915_gem_context_is_data_port_coherent(ctx);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> @@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *i915 = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   
>   			if (args->size)
>   				ret = -EINVAL;
> -			else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> +			else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
>   				ret = -ENODEV;
>   			else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
>   				 priority < I915_CONTEXT_MIN_USER_PRIORITY)
> @@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   		}
>   		break;
>   
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (args->size)
> +			ret = -EINVAL;
> +		else if (!HAS_DATA_PORT_COHERENCY(i915))
> +			ret = -ENODEV;
> +		else if (args->value == 1)
> +			i915_gem_context_set_data_port_coherent(ctx);
> +		else if (args->value == 0)
> +			i915_gem_context_clear_data_port_coherent(ctx);
> +		else
> +			ret = -EINVAL;
> +		break;
> +
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index f6d870b..55969bc 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -131,6 +131,8 @@ struct i915_gem_context {
>   #define CONTEXT_BANNED			0
>   #define CONTEXT_CLOSED			1
>   #define CONTEXT_FORCE_SINGLE_SUBMISSION	2
> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED	6
> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE	7
>   
>   	/**
>   	 * @hw_id: - unique identifier for the context
> @@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
>   	atomic_dec(&ctx->hw_id_pin_count);
>   }
>   
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	__set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	__clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
>   static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
>   {
>   	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ff0e2b3..313fb72 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>   	ce->lrc_desc = desc;
>   }
>   
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> +	u32 *cs;
> +	i915_reg_t reg;
> +
> +	GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> +	GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> +	cs = intel_ring_begin(rq, 4);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	if (INTEL_GEN(rq->i915) >= 11)
> +		reg = ICL_HDC_MODE;
> +	else if (INTEL_GEN(rq->i915) >= 10)
> +		reg = CNL_HDC_CHICKEN0;
> +	else
> +		reg = HDC_CHICKEN0;
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(reg);
> +	if (enable)
> +		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +	else
> +		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static int
> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
> +{
> +	struct i915_gem_context *ctx = rq->gem_context;
> +	bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +	int ret;
> +
> +	lockdep_assert_held(&rq->i915->drm.struct_mutex);
> +
> +	if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
> +		return 0;
> +
> +	ret = emit_set_data_port_coherency(rq, enable);
> +
> +	if (!ret) {
> +		if (enable)
> +			__set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +		else
> +			__clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +	}
> +
> +	return ret;
> +}
> +
>   static void unwind_wa_tail(struct i915_request *rq)
>   {
>   	rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
> @@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   		i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
>   	bool vf_flush_wa = false, dc_flush_wa = false;
>   	u32 *cs, flags = 0;
> -	int len;
> +	int err, len;
>   
>   	flags |= PIPE_CONTROL_CS_STALL;
>   
> @@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   		/* WaForGAMHang:kbl */
>   		if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
>   			dc_flush_wa = true;
> +
> +		err = intel_lr_context_update_data_port_coherency(request);
> +		if (GEM_WARN_ON(err)) {

Awooga awooga! ((tm) by Chris) :))

Please someone review and ack my patch which makes GEM_WARN_ON safe.

Regards,

Tvrtko

> +			DRM_DEBUG("Data Port Coherency toggle failed.\n");
> +			return err;
> +		}
>   	}
>   
>   	len = 6;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 298b2e1..8f8211b 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
>   #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
>   #define   I915_CONTEXT_DEFAULT_PRIORITY		0
>   #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU and CPU will both keep
> + * changes to memory content visible to each other as fast as possible, by
> + * forcing internal cache units to send memory writes to higher level caches
> + * immediatelly after writes. Only buffers with coherency requested within
> + * surface state, or specific stateless accesses will be affected by this
> + * option. Keeping data port coherency has a performance cost, and therefore
> + * it is by default disabled (see WaForceEnableNonCoherent).
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
>   	__u64 value;
>   };
>   
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v8] drm/i915: Add IOCTL Param to control data port coherency.
  2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
                     ` (6 preceding siblings ...)
  2018-10-09 18:06   ` [PATCH v6] " Tomasz Lis
@ 2018-10-12 15:02   ` Tomasz Lis
  2018-10-15 12:52     ` Tvrtko Ursulin
  2018-10-16 13:59     ` Joonas Lahtinen
  7 siblings, 2 replies; 81+ messages in thread
From: Tomasz Lis @ 2018-10-12 15:02 UTC (permalink / raw)
  To: intel-gfx; +Cc: bartosz.dunajski

The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
    Removed redundant GEM_WARN_ON()s. Improved to coding standard.
    Introduced a macro for checking whether hardware supports the feature.
v5: Renamed some locals. Made the flag write to be lazy.
    Updated comments to remove misconceptions. Added gen11 support.
v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
    Moved all flags checking to one place. Added mutex check.
v7: Removed 2 comments, improved API comment. (Joonas)
v8: Use non-GEM WARN_ON when in statements. (Tvrtko)

Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>

Bspec: 11419
Bspec: 19175
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  1 +
 drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
 drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
 drivers/gpu/drm/i915/intel_lrc.c        | 64 ++++++++++++++++++++++++++++++++-
 include/uapi/drm/i915_drm.h             | 10 ++++++
 5 files changed, 116 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3017ef0..90b3a0ff 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
 #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
 #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
 				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
 
 #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 8cbe580..718ede9 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_GTT_SIZE:
 		if (ctx->ppgtt)
 			args->value = ctx->ppgtt->vm.total;
-		else if (to_i915(dev)->mm.aliasing_ppgtt)
-			args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
+		else if (i915->mm.aliasing_ppgtt)
+			args->value = i915->mm.aliasing_ppgtt->vm.total;
 		else
-			args->value = to_i915(dev)->ggtt.vm.total;
+			args->value = i915->ggtt.vm.total;
 		break;
 	case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
 		args->value = i915_gem_context_no_error_capture(ctx);
@@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
 		break;
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (!HAS_DATA_PORT_COHERENCY(i915))
+			ret = -ENODEV;
+		else
+			args->value = i915_gem_context_is_data_port_coherent(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 
 			if (args->size)
 				ret = -EINVAL;
-			else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
+			else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
 				ret = -ENODEV;
 			else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
 				 priority < I915_CONTEXT_MIN_USER_PRIORITY)
@@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		}
 		break;
 
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (args->size)
+			ret = -EINVAL;
+		else if (!HAS_DATA_PORT_COHERENCY(i915))
+			ret = -ENODEV;
+		else if (args->value == 1)
+			i915_gem_context_set_data_port_coherent(ctx);
+		else if (args->value == 0)
+			i915_gem_context_clear_data_port_coherent(ctx);
+		else
+			ret = -EINVAL;
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index f6d870b..69f9247 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -131,6 +131,8 @@ struct i915_gem_context {
 #define CONTEXT_BANNED			0
 #define CONTEXT_CLOSED			1
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	2
+#define CONTEXT_DATA_PORT_COHERENT_REQUESTED	3
+#define CONTEXT_DATA_PORT_COHERENT_ACTIVE	4
 
 	/**
 	 * @hw_id: - unique identifier for the context
@@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
 	atomic_dec(&ctx->hw_id_pin_count);
 }
 
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+	return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+	__set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+	__clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
 static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
 {
 	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ff0e2b3..8680bc2 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
+
+	cs = intel_ring_begin(rq, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(rq->i915) >= 11)
+		reg = ICL_HDC_MODE;
+	else if (INTEL_GEN(rq->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static int
+intel_lr_context_update_data_port_coherency(struct i915_request *rq)
+{
+	struct i915_gem_context *ctx = rq->gem_context;
+	bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+	int ret;
+
+	lockdep_assert_held(&rq->i915->drm.struct_mutex);
+
+	if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
+		return 0;
+
+	ret = emit_set_data_port_coherency(rq, enable);
+
+	if (!ret) {
+		if (enable)
+			__set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+		else
+			__clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+	}
+
+	return ret;
+}
+
 static void unwind_wa_tail(struct i915_request *rq)
 {
 	rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
@@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
 		i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
 	bool vf_flush_wa = false, dc_flush_wa = false;
 	u32 *cs, flags = 0;
-	int len;
+	int err, len;
 
 	flags |= PIPE_CONTROL_CS_STALL;
 
@@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
 		/* WaForGAMHang:kbl */
 		if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
 			dc_flush_wa = true;
+
+		err = intel_lr_context_update_data_port_coherency(request);
+		if (WARN_ON(err)) {
+			DRM_DEBUG("Data Port Coherency toggle failed.\n");
+			return err;
+		}
 	}
 
 	len = 6;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 298b2e1..7c9e153 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
 #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
 #define   I915_CONTEXT_DEFAULT_PRIORITY		0
 #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
+/*
+ * When data port level coherency is enabled, the GPU and CPU will both keep
+ * changes to memory content visible to each other as fast as possible, by
+ * forcing internal cache units to send memory writes to higher level caches
+ * immediately after writes. Only buffers with coherency requested within
+ * surface state, or specific stateless accesses will be affected by this
+ * option. Keeping data port coherency has a performance cost, and therefore
+ * it is by default disabled (see WaForceEnableNonCoherent).
+ */
+#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
 	__u64 value;
 };
 
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev8)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (24 preceding siblings ...)
  2018-10-09 21:44 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2018-10-12 15:14 ` Patchwork
  2018-10-12 15:15 ` ✗ Fi.CI.SPARSE: " Patchwork
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-10-12 15:14 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev8)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
2c32b051c0f7 drm/i915: Add IOCTL Param to control data port coherency.
-:15: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#15: 
coherency at data port level. Keeping the coherency at that level is disabled

total: 0 errors, 1 warnings, 0 checks, 200 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.SPARSE: warning for drm/i915: Add Exec param to control data port coherency. (rev8)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (25 preceding siblings ...)
  2018-10-12 15:14 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev8) Patchwork
@ 2018-10-12 15:15 ` Patchwork
  2018-10-12 15:34 ` ✓ Fi.CI.BAT: success " Patchwork
  2018-10-12 18:27 ` ✗ Fi.CI.IGT: failure " Patchwork
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-10-12 15:15 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev8)
URL   : https://patchwork.freedesktop.org/series/40181/
State : warning

== Summary ==

$ dim sparse origin/drm-tip
Sparse version: v0.5.2
Commit: drm/i915: Add IOCTL Param to control data port coherency.
-drivers/gpu/drm/i915/selftests/../i915_drv.h:3725:16: warning: expression using sizeof(void)
+drivers/gpu/drm/i915/selftests/../i915_drv.h:3726:16: warning: expression using sizeof(void)

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Add Exec param to control data port coherency. (rev8)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (26 preceding siblings ...)
  2018-10-12 15:15 ` ✗ Fi.CI.SPARSE: " Patchwork
@ 2018-10-12 15:34 ` Patchwork
  2018-10-12 18:27 ` ✗ Fi.CI.IGT: failure " Patchwork
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-10-12 15:34 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev8)
URL   : https://patchwork.freedesktop.org/series/40181/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4978 -> Patchwork_10440 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/40181/revisions/8/mbox/

== Known issues ==

  Here are the changes found in Patchwork_10440 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@gem_exec_suspend@basic-s3:
      fi-icl-u:           NOTRUN -> INCOMPLETE (fdo#107713)

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-skl-6700k2:      PASS -> INCOMPLETE (fdo#104108, fdo#105524, k.org#199541)

    
    ==== Possible fixes ====

    igt@drv_module_reload@basic-reload:
      fi-blb-e6850:       INCOMPLETE (fdo#107718) -> PASS

    igt@kms_pipe_crc_basic@nonblocking-crc-pipe-b-frame-sequence:
      fi-byt-clapper:     FAIL (fdo#103191, fdo#107362) -> PASS +1

    
  fdo#103191 https://bugs.freedesktop.org/show_bug.cgi?id=103191
  fdo#104108 https://bugs.freedesktop.org/show_bug.cgi?id=104108
  fdo#105524 https://bugs.freedesktop.org/show_bug.cgi?id=105524
  fdo#107362 https://bugs.freedesktop.org/show_bug.cgi?id=107362
  fdo#107713 https://bugs.freedesktop.org/show_bug.cgi?id=107713
  fdo#107718 https://bugs.freedesktop.org/show_bug.cgi?id=107718
  k.org#199541 https://bugzilla.kernel.org/show_bug.cgi?id=199541


== Participating hosts (44 -> 42) ==

  Additional (1): fi-icl-u 
  Missing    (3): fi-ilk-m540 fi-byt-squawks fi-bsw-cyan 


== Build changes ==

    * Linux: CI_DRM_4978 -> Patchwork_10440

  CI_DRM_4978: ca98b2681a49a1417f8157af2d94a4f2d0bd0e47 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4674: 93871c6fb3c25e5d350c9faf36ded917174214de @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_10440: 2c32b051c0f7a1519f6d8802c66b0f7e0c7a68a9 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

2c32b051c0f7 drm/i915: Add IOCTL Param to control data port coherency.

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10440/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915: Add Exec param to control data port coherency. (rev8)
  2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
                   ` (27 preceding siblings ...)
  2018-10-12 15:34 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-10-12 18:27 ` Patchwork
  28 siblings, 0 replies; 81+ messages in thread
From: Patchwork @ 2018-10-12 18:27 UTC (permalink / raw)
  To: Tomasz Lis; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Add Exec param to control data port coherency. (rev8)
URL   : https://patchwork.freedesktop.org/series/40181/
State : failure

== Summary ==

= CI Bug Log - changes from CI_DRM_4978_full -> Patchwork_10440_full =

== Summary - FAILURE ==

  Serious unknown changes coming with Patchwork_10440_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_10440_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_10440_full:

  === IGT changes ===

    ==== Possible regressions ====

    igt@gem_ctx_param@invalid-param-get:
      shard-skl:          PASS -> FAIL +1
      shard-apl:          PASS -> FAIL +1
      shard-glk:          PASS -> FAIL +1

    igt@gem_ctx_param@invalid-param-set:
      shard-kbl:          PASS -> FAIL +1
      shard-hsw:          PASS -> FAIL +1
      shard-snb:          PASS -> FAIL +1

    
    ==== Warnings ====

    igt@kms_chv_cursor_fail@pipe-a-256x256-top-edge:
      shard-snb:          PASS -> SKIP +1

    
== Known issues ==

  Here are the changes found in Patchwork_10440_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@gem_ppgtt@blt-vs-render-ctx0:
      shard-skl:          NOTRUN -> TIMEOUT (fdo#108039) +1

    igt@kms_busy@extended-modeset-hang-newfb-with-reset-render-a:
      shard-skl:          NOTRUN -> DMESG-WARN (fdo#107956)

    igt@kms_chv_cursor_fail@pipe-a-64x64-right-edge:
      shard-skl:          PASS -> FAIL (fdo#104671)

    igt@kms_cursor_crc@cursor-256x256-random:
      shard-apl:          PASS -> FAIL (fdo#103232) +1

    igt@kms_cursor_crc@cursor-64x21-onscreen:
      shard-glk:          PASS -> FAIL (fdo#103232) +1

    igt@kms_fbcon_fbt@fbc:
      shard-skl:          NOTRUN -> FAIL (fdo#103833, fdo#105682)

    igt@kms_fbcon_fbt@psr:
      shard-skl:          NOTRUN -> FAIL (fdo#107882)

    igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-pwrite:
      shard-glk:          PASS -> FAIL (fdo#103167) +1

    igt@kms_frontbuffer_tracking@fbc-rgb565-draw-render:
      shard-skl:          NOTRUN -> FAIL (fdo#103167)

    igt@kms_frontbuffer_tracking@fbcpsr-1p-indfb-fliptrack:
      shard-skl:          NOTRUN -> FAIL (fdo#105682) +1

    igt@kms_frontbuffer_tracking@psr-1p-offscren-pri-indfb-draw-mmap-wc:
      shard-skl:          PASS -> FAIL (fdo#103167)

    igt@kms_plane@pixel-format-pipe-b-planes:
      shard-skl:          NOTRUN -> DMESG-FAIL (fdo#103166, fdo#106885)

    igt@kms_plane_alpha_blend@pipe-a-constant-alpha-max:
      shard-skl:          NOTRUN -> FAIL (fdo#108145) +2

    igt@kms_plane_multiple@atomic-pipe-c-tiling-yf:
      shard-apl:          PASS -> FAIL (fdo#103166) +2

    igt@kms_setmode@basic:
      shard-apl:          PASS -> FAIL (fdo#99912)

    igt@kms_vblank@pipe-a-ts-continuation-suspend:
      shard-skl:          PASS -> INCOMPLETE (fdo#104108, fdo#107773)

    igt@kms_vblank@pipe-b-ts-continuation-idle-hang:
      shard-apl:          PASS -> INCOMPLETE (fdo#103927)

    igt@pm_rpm@gem-execbuf:
      shard-skl:          PASS -> INCOMPLETE (fdo#107807, fdo#107803)

    igt@pm_rpm@system-suspend-execbuf:
      shard-skl:          PASS -> INCOMPLETE (fdo#107807, fdo#104108)

    
    ==== Possible fixes ====

    igt@debugfs_test@read_all_entries_display_off:
      shard-skl:          INCOMPLETE (fdo#104108) -> PASS

    igt@gem_softpin@noreloc-s3:
      shard-skl:          INCOMPLETE (fdo#104108, fdo#107773) -> PASS

    igt@kms_cursor_crc@cursor-128x128-onscreen:
      shard-apl:          FAIL (fdo#103232) -> PASS +4

    igt@kms_cursor_crc@cursor-64x64-suspend:
      shard-apl:          FAIL (fdo#103191, fdo#103232) -> PASS

    igt@kms_cursor_legacy@cursora-vs-flipa-toggle:
      shard-glk:          DMESG-WARN (fdo#105763, fdo#106538) -> PASS

    igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-draw-render:
      shard-apl:          FAIL (fdo#103167) -> PASS

    igt@kms_frontbuffer_tracking@fbcpsr-1p-offscren-pri-indfb-draw-render:
      shard-skl:          FAIL (fdo#105682) -> PASS +1

    igt@kms_frontbuffer_tracking@psr-1p-offscren-pri-shrfb-draw-mmap-gtt:
      shard-skl:          FAIL (fdo#103167) -> PASS

    igt@kms_frontbuffer_tracking@psr-rgb565-draw-render:
      shard-skl:          FAIL (fdo#103167) -> SKIP

    igt@kms_plane_alpha_blend@pipe-a-coverage-7efc:
      shard-skl:          FAIL (fdo#108145) -> PASS

    igt@pm_rpm@modeset-non-lpsp:
      shard-skl:          INCOMPLETE (fdo#107807) -> SKIP

    
  fdo#103166 https://bugs.freedesktop.org/show_bug.cgi?id=103166
  fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
  fdo#103191 https://bugs.freedesktop.org/show_bug.cgi?id=103191
  fdo#103232 https://bugs.freedesktop.org/show_bug.cgi?id=103232
  fdo#103833 https://bugs.freedesktop.org/show_bug.cgi?id=103833
  fdo#103927 https://bugs.freedesktop.org/show_bug.cgi?id=103927
  fdo#104108 https://bugs.freedesktop.org/show_bug.cgi?id=104108
  fdo#104671 https://bugs.freedesktop.org/show_bug.cgi?id=104671
  fdo#105682 https://bugs.freedesktop.org/show_bug.cgi?id=105682
  fdo#105763 https://bugs.freedesktop.org/show_bug.cgi?id=105763
  fdo#106538 https://bugs.freedesktop.org/show_bug.cgi?id=106538
  fdo#106885 https://bugs.freedesktop.org/show_bug.cgi?id=106885
  fdo#107773 https://bugs.freedesktop.org/show_bug.cgi?id=107773
  fdo#107803 https://bugs.freedesktop.org/show_bug.cgi?id=107803
  fdo#107807 https://bugs.freedesktop.org/show_bug.cgi?id=107807
  fdo#107882 https://bugs.freedesktop.org/show_bug.cgi?id=107882
  fdo#107956 https://bugs.freedesktop.org/show_bug.cgi?id=107956
  fdo#108039 https://bugs.freedesktop.org/show_bug.cgi?id=108039
  fdo#108145 https://bugs.freedesktop.org/show_bug.cgi?id=108145
  fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912


== Participating hosts (6 -> 6) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4978 -> Patchwork_10440

  CI_DRM_4978: ca98b2681a49a1417f8157af2d94a4f2d0bd0e47 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4674: 93871c6fb3c25e5d350c9faf36ded917174214de @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_10440: 2c32b051c0f7a1519f6d8802c66b0f7e0c7a68a9 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10440/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v8] drm/i915: Add IOCTL Param to control data port coherency.
  2018-10-12 15:02   ` [PATCH v8] " Tomasz Lis
@ 2018-10-15 12:52     ` Tvrtko Ursulin
  2018-10-16 13:59     ` Joonas Lahtinen
  1 sibling, 0 replies; 81+ messages in thread
From: Tvrtko Ursulin @ 2018-10-15 12:52 UTC (permalink / raw)
  To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski


On 12/10/2018 16:02, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
> 
> Rationale:
> 
> The OpenCL driver develpers requested a functionality to control cache

typo in developers

> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
> 
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
> 
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
> 
> 2. Why do we need a global coherency switch?
> 
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                     ________________
>                    |      NODE1     |
>                    | uint64_t data  |
>                    +----------------|
>                    | NODE*  |  NODE*|
>                    +--------+-------+
>                      /              \
>     ________________/                \________________
>    |      NODE2     |                |      NODE3     |
>    | uint64_t data  |                | uint64_t data  |
>    +----------------|                +----------------|
>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>    +--------+-------+                +--------+-------+
> 
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
> 
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
> 
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
> 
> 3. Will coherency switch be used frequently?
> 
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>      Introduced a macro for checking whether hardware supports the feature.
> v5: Renamed some locals. Made the flag write to be lazy.
>      Updated comments to remove misconceptions. Added gen11 support.
> v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
>      Moved all flags checking to one place. Added mutex check.
> v7: Removed 2 comments, improved API comment. (Joonas)
> v8: Use non-GEM WARN_ON when in statements. (Tvrtko)
> 
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
> 
> Bspec: 11419
> Bspec: 19175
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>   drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
>   drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
>   drivers/gpu/drm/i915/intel_lrc.c        | 64 ++++++++++++++++++++++++++++++++-
>   include/uapi/drm/i915_drm.h             | 10 ++++++
>   5 files changed, 116 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 3017ef0..90b3a0ff 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
>   #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
>   #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
>   				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
>   
>   #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
>   
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 8cbe580..718ede9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *i915 = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_GTT_SIZE:
>   		if (ctx->ppgtt)
>   			args->value = ctx->ppgtt->vm.total;
> -		else if (to_i915(dev)->mm.aliasing_ppgtt)
> -			args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
> +		else if (i915->mm.aliasing_ppgtt)
> +			args->value = i915->mm.aliasing_ppgtt->vm.total;
>   		else
> -			args->value = to_i915(dev)->ggtt.vm.total;
> +			args->value = i915->ggtt.vm.total;
>   		break;
>   	case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
>   		args->value = i915_gem_context_no_error_capture(ctx);
> @@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_PRIORITY:
>   		args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
>   		break;
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (!HAS_DATA_PORT_COHERENCY(i915))
> +			ret = -ENODEV;
> +		else
> +			args->value = i915_gem_context_is_data_port_coherent(ctx);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> @@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *i915 = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   
>   			if (args->size)
>   				ret = -EINVAL;
> -			else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> +			else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
>   				ret = -ENODEV;
>   			else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
>   				 priority < I915_CONTEXT_MIN_USER_PRIORITY)
> @@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   		}
>   		break;
>   
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (args->size)
> +			ret = -EINVAL;
> +		else if (!HAS_DATA_PORT_COHERENCY(i915))
> +			ret = -ENODEV;
> +		else if (args->value == 1)
> +			i915_gem_context_set_data_port_coherent(ctx);
> +		else if (args->value == 0)
> +			i915_gem_context_clear_data_port_coherent(ctx);
> +		else
> +			ret = -EINVAL;
> +		break;
> +
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index f6d870b..69f9247 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -131,6 +131,8 @@ struct i915_gem_context {
>   #define CONTEXT_BANNED			0
>   #define CONTEXT_CLOSED			1
>   #define CONTEXT_FORCE_SINGLE_SUBMISSION	2
> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED	3
> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE	4
>   
>   	/**
>   	 * @hw_id: - unique identifier for the context
> @@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
>   	atomic_dec(&ctx->hw_id_pin_count);
>   }
>   
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	__set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	__clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
>   static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
>   {
>   	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ff0e2b3..8680bc2 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>   	ce->lrc_desc = desc;
>   }
>   
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> +	u32 *cs;
> +	i915_reg_t reg;
> +
> +	GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> +	GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> +	cs = intel_ring_begin(rq, 4);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	if (INTEL_GEN(rq->i915) >= 11)
> +		reg = ICL_HDC_MODE;
> +	else if (INTEL_GEN(rq->i915) >= 10)
> +		reg = CNL_HDC_CHICKEN0;
> +	else
> +		reg = HDC_CHICKEN0;
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(reg);
> +	if (enable)
> +		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +	else
> +		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static int
> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
> +{
> +	struct i915_gem_context *ctx = rq->gem_context;
> +	bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +	int ret;
> +
> +	lockdep_assert_held(&rq->i915->drm.struct_mutex);
> +
> +	if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
> +		return 0;
> +
> +	ret = emit_set_data_port_coherency(rq, enable);
> +
> +	if (!ret) {
> +		if (enable)
> +			__set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +		else
> +			__clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +	}
> +
> +	return ret;
> +}
> +
>   static void unwind_wa_tail(struct i915_request *rq)
>   {
>   	rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
> @@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   		i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
>   	bool vf_flush_wa = false, dc_flush_wa = false;
>   	u32 *cs, flags = 0;
> -	int len;
> +	int err, len;
>   
>   	flags |= PIPE_CONTROL_CS_STALL;
>   
> @@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   		/* WaForGAMHang:kbl */
>   		if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
>   			dc_flush_wa = true;
> +
> +		err = intel_lr_context_update_data_port_coherency(request);
> +		if (WARN_ON(err)) {
> +			DRM_DEBUG("Data Port Coherency toggle failed.\n");
> +			return err;
> +		}
>   	}
>   
>   	len = 6;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 298b2e1..7c9e153 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
>   #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
>   #define   I915_CONTEXT_DEFAULT_PRIORITY		0
>   #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU and CPU will both keep
> + * changes to memory content visible to each other as fast as possible, by
> + * forcing internal cache units to send memory writes to higher level caches
> + * immediately after writes. Only buffers with coherency requested within
> + * surface state, or specific stateless accesses will be affected by this
> + * option. Keeping data port coherency has a performance cost, and therefore
> + * it is by default disabled (see WaForceEnableNonCoherent).
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
>   	__u64 value;
>   };
>   
> 

Looks okay to me.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v8] drm/i915: Add IOCTL Param to control data port coherency.
  2018-10-12 15:02   ` [PATCH v8] " Tomasz Lis
  2018-10-15 12:52     ` Tvrtko Ursulin
@ 2018-10-16 13:59     ` Joonas Lahtinen
  1 sibling, 0 replies; 81+ messages in thread
From: Joonas Lahtinen @ 2018-10-16 13:59 UTC (permalink / raw)
  To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski

Quoting Tomasz Lis (2018-10-12 18:02:56)
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
> 
> Rationale:
> 
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
> 
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
> 
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
> 
> 2. Why do we need a global coherency switch?
> 
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                    ________________
>                   |      NODE1     |
>                   | uint64_t data  |
>                   +----------------|
>                   | NODE*  |  NODE*|
>                   +--------+-------+
>                     /              \
>    ________________/                \________________
>   |      NODE2     |                |      NODE3     |
>   | uint64_t data  |                | uint64_t data  |
>   +----------------|                +----------------|
>   | NODE*  |  NODE*|                | NODE*  |  NODE*|
>   +--------+-------+                +--------+-------+
> 
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
> 
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
> 
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)

Might be worthy mentioning that this address space compatibility can be
achieved with userptr + soft-pinning allocations to their process space
addresses.

Anyway, this is;

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

But as mentioned previously, getting this merged needs the test to be
finished and clarity from the userspace project side.

Regards, Joonas

> 3. Will coherency switch be used frequently?
> 
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
>     Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>     Introduced a macro for checking whether hardware supports the feature.
> v5: Renamed some locals. Made the flag write to be lazy.
>     Updated comments to remove misconceptions. Added gen11 support.
> v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
>     Moved all flags checking to one place. Added mutex check.
> v7: Removed 2 comments, improved API comment. (Joonas)
> v8: Use non-GEM WARN_ON when in statements. (Tvrtko)
> 
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
> 
> Bspec: 11419
> Bspec: 19175
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |  1 +
>  drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
>  drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
>  drivers/gpu/drm/i915/intel_lrc.c        | 64 ++++++++++++++++++++++++++++++++-
>  include/uapi/drm/i915_drm.h             | 10 ++++++
>  5 files changed, 116 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 3017ef0..90b3a0ff 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
>  #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
>  #define HAS_WT(dev_priv)       ((IS_HASWELL(dev_priv) || \
>                                  IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv)      (INTEL_GEN(dev_priv) >= 9)
>  
>  #define HWS_NEEDS_PHYSICAL(dev_priv)   ((dev_priv)->info.hws_needs_physical)
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 8cbe580..718ede9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>                                     struct drm_file *file)
>  {
> +       struct drm_i915_private *i915 = to_i915(dev);
>         struct drm_i915_file_private *file_priv = file->driver_priv;
>         struct drm_i915_gem_context_param *args = data;
>         struct i915_gem_context *ctx;
> @@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>         case I915_CONTEXT_PARAM_GTT_SIZE:
>                 if (ctx->ppgtt)
>                         args->value = ctx->ppgtt->vm.total;
> -               else if (to_i915(dev)->mm.aliasing_ppgtt)
> -                       args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
> +               else if (i915->mm.aliasing_ppgtt)
> +                       args->value = i915->mm.aliasing_ppgtt->vm.total;
>                 else
> -                       args->value = to_i915(dev)->ggtt.vm.total;
> +                       args->value = i915->ggtt.vm.total;
>                 break;
>         case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
>                 args->value = i915_gem_context_no_error_capture(ctx);
> @@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>         case I915_CONTEXT_PARAM_PRIORITY:
>                 args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
>                 break;
> +       case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +               if (!HAS_DATA_PORT_COHERENCY(i915))
> +                       ret = -ENODEV;
> +               else
> +                       args->value = i915_gem_context_is_data_port_coherent(ctx);
> +               break;
>         default:
>                 ret = -EINVAL;
>                 break;
> @@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>  int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>                                     struct drm_file *file)
>  {
> +       struct drm_i915_private *i915 = to_i915(dev);
>         struct drm_i915_file_private *file_priv = file->driver_priv;
>         struct drm_i915_gem_context_param *args = data;
>         struct i915_gem_context *ctx;
> @@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>  
>                         if (args->size)
>                                 ret = -EINVAL;
> -                       else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> +                       else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
>                                 ret = -ENODEV;
>                         else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
>                                  priority < I915_CONTEXT_MIN_USER_PRIORITY)
> @@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>                 }
>                 break;
>  
> +       case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +               if (args->size)
> +                       ret = -EINVAL;
> +               else if (!HAS_DATA_PORT_COHERENCY(i915))
> +                       ret = -ENODEV;
> +               else if (args->value == 1)
> +                       i915_gem_context_set_data_port_coherent(ctx);
> +               else if (args->value == 0)
> +                       i915_gem_context_clear_data_port_coherent(ctx);
> +               else
> +                       ret = -EINVAL;
> +               break;
> +
>         default:
>                 ret = -EINVAL;
>                 break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index f6d870b..69f9247 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -131,6 +131,8 @@ struct i915_gem_context {
>  #define CONTEXT_BANNED                 0
>  #define CONTEXT_CLOSED                 1
>  #define CONTEXT_FORCE_SINGLE_SUBMISSION        2
> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED   3
> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE      4
>  
>         /**
>          * @hw_id: - unique identifier for the context
> @@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
>         atomic_dec(&ctx->hw_id_pin_count);
>  }
>  
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +       return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +       __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +       __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
>  static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
>  {
>         return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ff0e2b3..8680bc2 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>         ce->lrc_desc = desc;
>  }
>  
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> +       u32 *cs;
> +       i915_reg_t reg;
> +
> +       GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> +       GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> +       cs = intel_ring_begin(rq, 4);
> +       if (IS_ERR(cs))
> +               return PTR_ERR(cs);
> +
> +       if (INTEL_GEN(rq->i915) >= 11)
> +               reg = ICL_HDC_MODE;
> +       else if (INTEL_GEN(rq->i915) >= 10)
> +               reg = CNL_HDC_CHICKEN0;
> +       else
> +               reg = HDC_CHICKEN0;
> +
> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> +       *cs++ = i915_mmio_reg_offset(reg);
> +       if (enable)
> +               *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +       else
> +               *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +       *cs++ = MI_NOOP;
> +
> +       intel_ring_advance(rq, cs);
> +
> +       return 0;
> +}
> +
> +static int
> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
> +{
> +       struct i915_gem_context *ctx = rq->gem_context;
> +       bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +       int ret;
> +
> +       lockdep_assert_held(&rq->i915->drm.struct_mutex);
> +
> +       if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
> +               return 0;
> +
> +       ret = emit_set_data_port_coherency(rq, enable);
> +
> +       if (!ret) {
> +               if (enable)
> +                       __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +               else
> +                       __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +       }
> +
> +       return ret;
> +}
> +
>  static void unwind_wa_tail(struct i915_request *rq)
>  {
>         rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
> @@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
>                 i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
>         bool vf_flush_wa = false, dc_flush_wa = false;
>         u32 *cs, flags = 0;
> -       int len;
> +       int err, len;
>  
>         flags |= PIPE_CONTROL_CS_STALL;
>  
> @@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
>                 /* WaForGAMHang:kbl */
>                 if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
>                         dc_flush_wa = true;
> +
> +               err = intel_lr_context_update_data_port_coherency(request);
> +               if (WARN_ON(err)) {
> +                       DRM_DEBUG("Data Port Coherency toggle failed.\n");
> +                       return err;
> +               }
>         }
>  
>         len = 6;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 298b2e1..7c9e153 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
>  #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>  #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>  #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU and CPU will both keep
> + * changes to memory content visible to each other as fast as possible, by
> + * forcing internal cache units to send memory writes to higher level caches
> + * immediately after writes. Only buffers with coherency requested within
> + * surface state, or specific stateless accesses will be affected by this
> + * option. Keeping data port coherency has a performance cost, and therefore
> + * it is by default disabled (see WaForceEnableNonCoherent).
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
>         __u64 value;
>  };
>  
> -- 
> 2.7.4
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2018-10-16 13:59 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-19 12:37 [RFC v1] Data port coherency control for UMDs Tomasz Lis
2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
2018-03-19 12:43   ` Chris Wilson
2018-03-19 14:14     ` Lis, Tomasz
2018-03-19 14:26       ` Chris Wilson
2018-03-20 17:23         ` Lis, Tomasz
2018-05-04  9:24           ` Joonas Lahtinen
2018-03-20 18:43       ` Oscar Mateo
2018-03-21 10:16         ` Chris Wilson
2018-03-21 19:42           ` Oscar Mateo
2018-03-27 17:41             ` Lis, Tomasz
2018-03-30 17:29   ` [PATCH " Tomasz Lis
2018-03-31 19:07     ` kbuild test robot
2018-04-11 15:46   ` [PATCH v2] " Tomasz Lis
2018-06-20 15:03   ` [PATCH v1] Second implementation of Data Port Coherency Tomasz Lis
2018-06-20 15:03     ` [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
2018-06-21  6:39       ` Joonas Lahtinen
2018-06-21 13:47         ` Lis, Tomasz
2018-07-18 13:03           ` Joonas Lahtinen
2018-06-21  7:05       ` Chris Wilson
2018-06-21 13:47         ` Lis, Tomasz
2018-06-21  7:31       ` Dunajski, Bartosz
2018-06-21  8:48         ` Joonas Lahtinen
2018-06-22 16:40           ` Dunajski, Bartosz
2018-07-18 13:12             ` Joonas Lahtinen
2018-07-18 13:27               ` Dunajski, Bartosz
2018-07-09 13:20   ` [PATCH v4] " Tomasz Lis
2018-07-09 13:48     ` Lionel Landwerlin
2018-07-09 14:03       ` Lis, Tomasz
2018-07-09 14:24         ` Lionel Landwerlin
2018-07-09 15:21           ` Lis, Tomasz
2018-07-09 16:28     ` Tvrtko Ursulin
2018-07-09 16:37       ` Chris Wilson
2018-07-10 17:32         ` Lis, Tomasz
2018-07-11  9:28           ` Tvrtko Ursulin
2018-07-10 18:03       ` Lis, Tomasz
2018-07-11 11:20         ` Lis, Tomasz
2018-07-12 15:10   ` [PATCH v5] " Tomasz Lis
2018-07-13 10:40     ` Tvrtko Ursulin
2018-07-13 17:44       ` Lis, Tomasz
2018-10-09 18:06   ` [PATCH v6] " Tomasz Lis
2018-10-10  7:29     ` Tvrtko Ursulin
2018-10-12 15:02   ` [PATCH v8] " Tomasz Lis
2018-10-15 12:52     ` Tvrtko Ursulin
2018-10-16 13:59     ` Joonas Lahtinen
2018-03-19 13:53 ` [RFC v1] Data port coherency control for UMDs Joonas Lahtinen
2018-03-19 16:09   ` Lis, Tomasz
2018-03-20 15:15   ` Dunajski, Bartosz
2018-03-21 10:02     ` Joonas Lahtinen
2018-03-26  9:46       ` Dunajski, Bartosz
2018-03-29  7:42         ` Joonas Lahtinen
2018-03-30  9:00           ` Dunajski, Bartosz
2018-04-04  9:18             ` Joonas Lahtinen
2018-04-11  9:15               ` Dunajski, Bartosz
2018-03-19 14:18 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency Patchwork
2018-03-19 14:34 ` ✓ Fi.CI.BAT: success " Patchwork
2018-03-19 16:48 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-03-30 18:14 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev2) Patchwork
2018-03-30 18:30 ` ✓ Fi.CI.BAT: success " Patchwork
2018-03-30 19:59 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-04-11 16:12 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev3) Patchwork
2018-04-11 16:29 ` ✓ Fi.CI.BAT: success " Patchwork
2018-04-11 20:02 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-06-20 15:45 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev4) Patchwork
2018-06-20 16:00 ` ✓ Fi.CI.BAT: success " Patchwork
2018-06-20 21:01 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-07-09 13:57 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev5) Patchwork
2018-07-09 13:58 ` ✗ Fi.CI.SPARSE: " Patchwork
2018-07-09 14:14 ` ✓ Fi.CI.BAT: success " Patchwork
2018-07-09 20:04 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-07-12 15:18 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev6) Patchwork
2018-07-12 15:19 ` ✗ Fi.CI.SPARSE: " Patchwork
2018-07-12 15:34 ` ✓ Fi.CI.BAT: success " Patchwork
2018-10-09 18:27 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev7) Patchwork
2018-10-09 18:28 ` ✗ Fi.CI.SPARSE: " Patchwork
2018-10-09 18:52 ` ✓ Fi.CI.BAT: success " Patchwork
2018-10-09 21:44 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-10-12 15:14 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Add Exec param to control data port coherency. (rev8) Patchwork
2018-10-12 15:15 ` ✗ Fi.CI.SPARSE: " Patchwork
2018-10-12 15:34 ` ✓ Fi.CI.BAT: success " Patchwork
2018-10-12 18:27 ` ✗ Fi.CI.IGT: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.