All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request
@ 2018-07-28 16:46 Chris Wilson
  2018-07-28 16:46 ` [PATCH 2/5] drm/i915: Expose idle delays to Kconfig Chris Wilson
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Chris Wilson @ 2018-07-28 16:46 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky, Eero Tamminen

An interesting discussion regarding "hybrid interrupt polling" for NVMe
came to the conclusion that the ideal busyspin before sleeping was half
of the expected request latency (and better if it was already halfway
through that request). This suggested that we too should look again at
our tradeoff between spinning and waiting. Currently, our spin simply
tries to hide the cost of enabling the interrupt, which is good to avoid
penalising nop requests (i.e. test throughput) and not much else.
Studying real world workloads suggests that a spin of upto 500us can
dramatically boost performance, but the suggestion is that this is not
from avoiding interrupt latency per-se, but from secondary effects of
sleeping such as allowing the CPU reduce cstate and context switch away.

In a truly hybrid interrupt polling scheme, we would aim to sleep until
just before the request completed and then wake up in advance of the
interrupt and do a quick poll to handle completion. This is tricky for
ourselves at the moment as we are not recording request times, and since
we allow preemption, our requests are not on as a nicely ordered
timeline as IO. However, the idea is interesting, for it will certainly
help us decide when busyspinning is worthwhile.

v2: Expose the spin setting via Kconfig options for easier adjustment
and testing.
v3: Don't get caught sneaking in a change to the busyspin parameters.
v4: Explain more about the "hybrid interrupt polling" scheme that we
want to migrate towards.

Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sagar Kamble <sagar.a.kamble@intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Reviewed-by: Sagar Kamble <sagar.a.kamble@intel.com>
---
 drivers/gpu/drm/i915/Kconfig         |  6 +++++
 drivers/gpu/drm/i915/Kconfig.profile | 26 +++++++++++++++++++
 drivers/gpu/drm/i915/i915_request.c  | 39 +++++++++++++++++++++++++---
 3 files changed, 67 insertions(+), 4 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/Kconfig.profile

diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index 5c607f2c707b..387613f29cb0 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -132,3 +132,9 @@ depends on DRM_I915
 depends on EXPERT
 source drivers/gpu/drm/i915/Kconfig.debug
 endmenu
+
+menu "drm/i915 Profile Guided Optimisation"
+	visible if EXPERT
+	depends on DRM_I915
+	source drivers/gpu/drm/i915/Kconfig.profile
+endmenu
diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
new file mode 100644
index 000000000000..8a230eeb98df
--- /dev/null
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -0,0 +1,26 @@
+config DRM_I915_SPIN_REQUEST_IRQ
+	int
+	default 5 # microseconds
+	help
+	  Before sleeping waiting for a request (GPU operation) to complete,
+	  we may spend some time polling for its completion. As the IRQ may
+	  take a non-negligible time to setup, we do a short spin first to
+	  check if the request will complete in the time it would have taken
+	  us to enable the interrupt.
+
+	  May be 0 to disable the initial spin. In practice, we estimate
+	  the cost of enabling the interrupt (if currently disabled) to be
+	  a few microseconds.
+
+config DRM_I915_SPIN_REQUEST_CS
+	int
+	default 2 # microseconds
+	help
+	  After sleeping for a request (GPU operation) to complete, we will
+	  be woken up on the completion of every request prior to the one
+	  being waited on. For very short requests, going back to sleep and
+	  be woken up again may add considerably to the wakeup latency. To
+	  avoid incurring extra latency from the scheduler, we may choose to
+	  spin prior to sleeping again.
+
+	  May be 0 to disable spinning after being woken.
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 5c2c93cbab12..ca25c53f8daa 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1336,8 +1336,32 @@ long i915_request_wait(struct i915_request *rq,
 	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
 	GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
 
-	/* Optimistic short spin before touching IRQs */
-	if (__i915_spin_request(rq, wait.seqno, state, 5))
+	/*
+	 * Optimistic spin before touching IRQs.
+	 *
+	 * We may use a rather large value here to offset the penalty of
+	 * switching away from the active task. Frequently, the client will
+	 * wait upon an old swapbuffer to throttle itself to remain within a
+	 * frame of the gpu. If the client is running in lockstep with the gpu,
+	 * then it should not be waiting long at all, and a sleep now will incur
+	 * extra scheduler latency in producing the next frame. To try to
+	 * avoid adding the cost of enabling/disabling the interrupt to the
+	 * short wait, we first spin to see if the request would have completed
+	 * in the time taken to setup the interrupt.
+	 *
+	 * We need upto 5us to enable the irq, and upto 20us to hide the
+	 * scheduler latency of a context switch, ignoring the secondary
+	 * impacts from a context switch such as cache eviction.
+	 *
+	 * The scheme used for low-latency IO is called "hybrid interrupt
+	 * polling". The suggestion there is to sleep until just before you
+	 * expect to be woken by the device interrupt and then poll for its
+	 * completion. That requires having a good predictor for the request
+	 * duration, which we currently lack.
+	 */
+	if (CONFIG_DRM_I915_SPIN_REQUEST_IRQ &&
+	    __i915_spin_request(rq, wait.seqno, state,
+				CONFIG_DRM_I915_SPIN_REQUEST_IRQ))
 		goto complete;
 
 	set_current_state(state);
@@ -1396,8 +1420,15 @@ long i915_request_wait(struct i915_request *rq,
 		    __i915_wait_request_check_and_reset(rq))
 			continue;
 
-		/* Only spin if we know the GPU is processing this request */
-		if (__i915_spin_request(rq, wait.seqno, state, 2))
+		/*
+		 * A quick spin now we are on the CPU to offset the cost of
+		 * context switching away (and so spin for roughly the same as
+		 * the scheduler latency). We only spin if we know the GPU is
+		 * processing this request, and so likely to finish shortly.
+		 */
+		if (CONFIG_DRM_I915_SPIN_REQUEST_CS &&
+		    __i915_spin_request(rq, wait.seqno, state,
+					CONFIG_DRM_I915_SPIN_REQUEST_CS))
 			break;
 
 		if (!intel_wait_check_request(&wait, rq)) {
-- 
2.18.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] drm/i915: Expose idle delays to Kconfig
  2018-07-28 16:46 [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request Chris Wilson
@ 2018-07-28 16:46 ` Chris Wilson
  2018-08-02 14:33   ` Tvrtko Ursulin
  2018-07-28 16:46 ` [PATCH 3/5] drm/i915: Increase busyspin limit before a context-switch Chris Wilson
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-07-28 16:46 UTC (permalink / raw)
  To: intel-gfx

We want to expose the parameters for controlling how long it takes for
us to notice and park the GPU after a stream of requests in order to try
and tune the optimal power-efficiency vs latency of a mostly idle system.

v2: retire_work has two scheduling points, update them both

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile | 23 +++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem.c      |  8 +++++---
 drivers/gpu/drm/i915/i915_gem.h      | 13 +++++++++++++
 3 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 8a230eeb98df..63cb744d920d 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -24,3 +24,26 @@ config DRM_I915_SPIN_REQUEST_CS
 	  spin prior to sleeping again.
 
 	  May be 0 to disable spinning after being woken.
+
+config DRM_I915_GEM_RETIRE_DELAY
+	int
+	default 1000 # milliseconds
+	help
+	  We maintain a background job whose purpose is to keep cleaning up
+	  after userspace, and may be the first to spot an idle system. This
+	  parameter determines the interval between execution of the worker.
+
+	  To reduce the number of timer wakeups, large delays (of greater than
+	  a second) are rounded to the next walltime second to allow coalescing
+	  of multiple system timers into a single wakeup.
+
+config DRM_I915_GEM_PARK_DELAY
+	int
+	default 100 # milliseconds
+	help
+	  Before parking the engines and the GPU after the final request is
+	  retired, we may wait for a small delay to reduce the frequecy of
+	  having to park/unpark and so the latency in executing a new request.
+
+	  May be 0 to immediately start parking the engines after the last
+	  request.
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 460f256114f7..5b538a92631d 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -192,7 +192,9 @@ void i915_gem_park(struct drm_i915_private *i915)
 		return;
 
 	/* Defer the actual call to __i915_gem_park() to prevent ping-pongs */
-	mod_delayed_work(i915->wq, &i915->gt.idle_work, msecs_to_jiffies(100));
+	mod_delayed_work(i915->wq,
+			 &i915->gt.idle_work,
+			 msecs_to_jiffies(CONFIG_DRM_I915_GEM_PARK_DELAY));
 }
 
 void i915_gem_unpark(struct drm_i915_private *i915)
@@ -236,7 +238,7 @@ void i915_gem_unpark(struct drm_i915_private *i915)
 
 	queue_delayed_work(i915->wq,
 			   &i915->gt.retire_work,
-			   round_jiffies_up_relative(HZ));
+			   i915_gem_retire_delay());
 }
 
 int
@@ -3466,7 +3468,7 @@ i915_gem_retire_work_handler(struct work_struct *work)
 	if (READ_ONCE(dev_priv->gt.awake))
 		queue_delayed_work(dev_priv->wq,
 				   &dev_priv->gt.retire_work,
-				   round_jiffies_up_relative(HZ));
+				   i915_gem_retire_delay());
 }
 
 static void shrink_caches(struct drm_i915_private *i915)
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index e46592956872..c6b4971435dc 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -27,6 +27,8 @@
 
 #include <linux/bug.h>
 #include <linux/interrupt.h>
+#include <linux/jiffies.h>
+#include <linux/timer.h>
 
 struct drm_i915_private;
 
@@ -76,6 +78,17 @@ struct drm_i915_private;
 void i915_gem_park(struct drm_i915_private *i915);
 void i915_gem_unpark(struct drm_i915_private *i915);
 
+static inline unsigned long i915_gem_retire_delay(void)
+{
+	const unsigned long delay =
+		msecs_to_jiffies(CONFIG_DRM_I915_GEM_RETIRE_DELAY);
+
+	if (CONFIG_DRM_I915_GEM_RETIRE_DELAY >= 1000)
+		return round_jiffies_up_relative(delay);
+	else
+		return delay;
+}
+
 static inline void __tasklet_disable_sync_once(struct tasklet_struct *t)
 {
 	if (atomic_inc_return(&t->count) == 1)
-- 
2.18.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] drm/i915: Increase busyspin limit before a context-switch
  2018-07-28 16:46 [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request Chris Wilson
  2018-07-28 16:46 ` [PATCH 2/5] drm/i915: Expose idle delays to Kconfig Chris Wilson
@ 2018-07-28 16:46 ` Chris Wilson
  2018-08-02 14:40   ` Tvrtko Ursulin
  2018-07-28 16:46 ` [PATCH 4/5] drm/i915: Increase initial busyspin limit Chris Wilson
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-07-28 16:46 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky, Eero Tamminen

Looking at the distribution of i915_wait_request for a set of GL
benchmarks, we see:

broadwell# python bcc/tools/funclatency.py -u i915_wait_request
   usecs               : count     distribution
       0 -> 1          : 29184    |****************************************|
       2 -> 3          : 5767     |*******                                 |
       4 -> 7          : 3000     |****                                    |
       8 -> 15         : 491      |                                        |
      16 -> 31         : 140      |                                        |
      32 -> 63         : 203      |                                        |
      64 -> 127        : 543      |                                        |
     128 -> 255        : 881      |*                                       |
     256 -> 511        : 1209     |*                                       |
     512 -> 1023       : 1739     |**                                      |
    1024 -> 2047       : 22855    |*******************************         |
    2048 -> 4095       : 1725     |**                                      |
    4096 -> 8191       : 5813     |*******                                 |
    8192 -> 16383      : 5348     |*******                                 |
   16384 -> 32767      : 1000     |*                                       |
   32768 -> 65535      : 4400     |******                                  |
   65536 -> 131071     : 296      |                                        |
  131072 -> 262143     : 225      |                                        |
  262144 -> 524287     : 4        |                                        |
  524288 -> 1048575    : 1        |                                        |
 1048576 -> 2097151    : 1        |                                        |
 2097152 -> 4194303    : 1        |                                        |

broxton# python bcc/tools/funclatency.py -u i915_wait_request
   usecs               : count     distribution
       0 -> 1          : 5523     |*************************************   |
       2 -> 3          : 1340     |*********                               |
       4 -> 7          : 2100     |**************                          |
       8 -> 15         : 755      |*****                                   |
      16 -> 31         : 211      |*                                       |
      32 -> 63         : 53       |                                        |
      64 -> 127        : 71       |                                        |
     128 -> 255        : 113      |                                        |
     256 -> 511        : 262      |*                                       |
     512 -> 1023       : 358      |**                                      |
    1024 -> 2047       : 1105     |*******                                 |
    2048 -> 4095       : 848      |*****                                   |
    4096 -> 8191       : 1295     |********                                |
    8192 -> 16383      : 5894     |****************************************|
   16384 -> 32767      : 4270     |****************************            |
   32768 -> 65535      : 5622     |**************************************  |
   65536 -> 131071     : 306      |**                                      |
  131072 -> 262143     : 50       |                                        |
  262144 -> 524287     : 76       |                                        |
  524288 -> 1048575    : 34       |                                        |
 1048576 -> 2097151    : 0        |                                        |
 2097152 -> 4194303    : 1        |                                        |

Picking 20us for the context-switch busyspin has the dual advantage of
catching most frequent short waits while avoiding the cost of a context
switch. 20us is a typical latency of 2 context-switches, i.e. the cost
of taking the sleep, without the secondary effects of cache flushing.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sagar Kamble <sagar.a.kamble@intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 63cb744d920d..de394dea4a14 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -14,7 +14,7 @@ config DRM_I915_SPIN_REQUEST_IRQ
 
 config DRM_I915_SPIN_REQUEST_CS
 	int
-	default 2 # microseconds
+	default 20 # microseconds
 	help
 	  After sleeping for a request (GPU operation) to complete, we will
 	  be woken up on the completion of every request prior to the one
-- 
2.18.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] drm/i915: Increase initial busyspin limit
  2018-07-28 16:46 [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request Chris Wilson
  2018-07-28 16:46 ` [PATCH 2/5] drm/i915: Expose idle delays to Kconfig Chris Wilson
  2018-07-28 16:46 ` [PATCH 3/5] drm/i915: Increase busyspin limit before a context-switch Chris Wilson
@ 2018-07-28 16:46 ` Chris Wilson
  2018-08-02 14:47   ` Tvrtko Ursulin
  2018-07-28 16:46 ` [PATCH 5/5] drm/i915: Do not use iowait while waiting for the GPU Chris Wilson
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-07-28 16:46 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky, Eero Tamminen

Here we bump the busyspin for the current request to avoid sleeping and
the cost of both idling and downclocking the CPU.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sagar Kamble <sagar.a.kamble@intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Francisco Jerez <currojerez@riseup.net>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index de394dea4a14..54e0ba4051e7 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -1,6 +1,6 @@
 config DRM_I915_SPIN_REQUEST_IRQ
 	int
-	default 5 # microseconds
+	default 250 # microseconds
 	help
 	  Before sleeping waiting for a request (GPU operation) to complete,
 	  we may spend some time polling for its completion. As the IRQ may
-- 
2.18.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] drm/i915: Do not use iowait while waiting for the GPU
  2018-07-28 16:46 [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request Chris Wilson
                   ` (2 preceding siblings ...)
  2018-07-28 16:46 ` [PATCH 4/5] drm/i915: Increase initial busyspin limit Chris Wilson
@ 2018-07-28 16:46 ` Chris Wilson
  2018-08-02 15:09   ` Tvrtko Ursulin
  2018-07-28 16:54 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/5] drm/i915: Expose the busyspin durations for i915_wait_request Patchwork
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-07-28 16:46 UTC (permalink / raw)
  To: intel-gfx; +Cc: Eero Tamminen

A recent trend for cpufreq is to boost the CPU frequencies for
iowaiters, in particularly to benefit high frequency I/O. We do the same
and boost the GPU clocks to try and minimise time spent waiting for the
GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
frequency will result in the GPU being throttled and its frequency being
reduced. Thus declaring iowait negatively impacts on GPU throughput.

v2: Both sleeps!

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Francisco Jerez <currojerez@riseup.net>
---
 drivers/gpu/drm/i915/i915_request.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index ca25c53f8daa..694269e5b761 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1330,7 +1330,7 @@ long i915_request_wait(struct i915_request *rq,
 			goto complete;
 		}
 
-		timeout = io_schedule_timeout(timeout);
+		timeout = schedule_timeout(timeout);
 	} while (1);
 
 	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
@@ -1387,7 +1387,7 @@ long i915_request_wait(struct i915_request *rq,
 			break;
 		}
 
-		timeout = io_schedule_timeout(timeout);
+		timeout = schedule_timeout(timeout);
 
 		if (intel_wait_complete(&wait) &&
 		    intel_wait_check_request(&wait, rq))
-- 
2.18.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/5] drm/i915: Expose the busyspin durations for i915_wait_request
  2018-07-28 16:46 [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request Chris Wilson
                   ` (3 preceding siblings ...)
  2018-07-28 16:46 ` [PATCH 5/5] drm/i915: Do not use iowait while waiting for the GPU Chris Wilson
@ 2018-07-28 16:54 ` Patchwork
  2018-07-28 17:16 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-07-28 16:54 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/5] drm/i915: Expose the busyspin durations for i915_wait_request
URL   : https://patchwork.freedesktop.org/series/47390/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
22828401a503 drm/i915: Expose the busyspin durations for i915_wait_request
-:36: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#36: 
References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf

-:61: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#61: 
new file mode 100644

total: 0 errors, 2 warnings, 0 checks, 86 lines checked
d6ae4c2794a5 drm/i915: Expose idle delays to Kconfig
98ef93ba5efa drm/i915: Increase busyspin limit before a context-switch
-:14: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#14: 
       0 -> 1          : 29184    |****************************************|

total: 0 errors, 1 warnings, 0 checks, 8 lines checked
a45aa1f25f62 drm/i915: Increase initial busyspin limit
d0c9ef9adcd0 drm/i915: Do not use iowait while waiting for the GPU
-:16: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#16: 
References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")

-:16: ERROR:GIT_COMMIT_ID: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")'
#16: 
References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")

total: 1 errors, 1 warnings, 0 checks, 16 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [1/5] drm/i915: Expose the busyspin durations for i915_wait_request
  2018-07-28 16:46 [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request Chris Wilson
                   ` (4 preceding siblings ...)
  2018-07-28 16:54 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/5] drm/i915: Expose the busyspin durations for i915_wait_request Patchwork
@ 2018-07-28 17:16 ` Patchwork
  2018-07-28 19:25 ` ✓ Fi.CI.IGT: " Patchwork
  2018-08-02 14:15 ` [PATCH 1/5] " Tvrtko Ursulin
  7 siblings, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-07-28 17:16 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/5] drm/i915: Expose the busyspin durations for i915_wait_request
URL   : https://patchwork.freedesktop.org/series/47390/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4581 -> Patchwork_9803 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/47390/revisions/1/mbox/

== Known issues ==

  Here are the changes found in Patchwork_9803 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_flip@basic-flip-vs-wf_vblank:
      {fi-bsw-kefka}:     PASS -> FAIL (fdo#100368)

    
    ==== Possible fixes ====

    igt@debugfs_test@read_all_entries:
      fi-snb-2520m:       INCOMPLETE (fdo#103713) -> PASS

    igt@drv_selftest@live_hangcheck:
      {fi-cfl-8109u}:     DMESG-FAIL (fdo#106560) -> PASS

    igt@drv_selftest@live_workarounds:
      fi-skl-6700k2:      DMESG-FAIL (fdo#107292) -> PASS

    igt@kms_chamelium@hdmi-hpd-fast:
      fi-kbl-7500u:       FAIL (fdo#103841, fdo#102672) -> SKIP

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c:
      fi-bxt-dsi:         INCOMPLETE (fdo#103927) -> PASS

    
    ==== Warnings ====

    igt@drv_selftest@live_workarounds:
      fi-cnl-psr:         DMESG-FAIL (fdo#107292) -> DMESG-WARN (fdo#105395)

    {igt@kms_psr@primary_page_flip}:
      fi-cnl-psr:         DMESG-FAIL (fdo#107372) -> DMESG-WARN (fdo#107372)

    
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#102672 https://bugs.freedesktop.org/show_bug.cgi?id=102672
  fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
  fdo#103841 https://bugs.freedesktop.org/show_bug.cgi?id=103841
  fdo#103927 https://bugs.freedesktop.org/show_bug.cgi?id=103927
  fdo#105395 https://bugs.freedesktop.org/show_bug.cgi?id=105395
  fdo#106560 https://bugs.freedesktop.org/show_bug.cgi?id=106560
  fdo#107292 https://bugs.freedesktop.org/show_bug.cgi?id=107292
  fdo#107372 https://bugs.freedesktop.org/show_bug.cgi?id=107372


== Participating hosts (48 -> 44) ==

  Additional (1): fi-hsw-peppy 
  Missing    (5): fi-byt-clapper fi-ilk-m540 fi-byt-squawks fi-bsw-cyan fi-hsw-4200u 


== Build changes ==

    * Linux: CI_DRM_4581 -> Patchwork_9803

  CI_DRM_4581: a85b6922ba20d1b4b527bfd5d67224225a846dac @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4580: f1c868dae24056ebc27e4f3c197724ce9b956a8a @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9803: d0c9ef9adcd08c0c250482bba811cbc790d4ac44 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

d0c9ef9adcd0 drm/i915: Do not use iowait while waiting for the GPU
a45aa1f25f62 drm/i915: Increase initial busyspin limit
98ef93ba5efa drm/i915: Increase busyspin limit before a context-switch
d6ae4c2794a5 drm/i915: Expose idle delays to Kconfig
22828401a503 drm/i915: Expose the busyspin durations for i915_wait_request

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9803/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* ✓ Fi.CI.IGT: success for series starting with [1/5] drm/i915: Expose the busyspin durations for i915_wait_request
  2018-07-28 16:46 [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request Chris Wilson
                   ` (5 preceding siblings ...)
  2018-07-28 17:16 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-07-28 19:25 ` Patchwork
  2018-08-02 14:15 ` [PATCH 1/5] " Tvrtko Ursulin
  7 siblings, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-07-28 19:25 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/5] drm/i915: Expose the busyspin durations for i915_wait_request
URL   : https://patchwork.freedesktop.org/series/47390/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4581_full -> Patchwork_9803_full =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_9803_full need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9803_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9803_full:

  === IGT changes ===

    ==== Warnings ====

    igt@drv_selftest@live_guc:
      shard-kbl:          SKIP -> PASS +1

    
== Known issues ==

  Here are the changes found in Patchwork_9803_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_flip@2x-flip-vs-expired-vblank:
      shard-glk:          PASS -> FAIL (fdo#105363)

    igt@kms_flip@2x-plain-flip-fb-recreate:
      shard-glk:          PASS -> FAIL (fdo#100368)

    
    ==== Possible fixes ====

    igt@drv_selftest@live_hangcheck:
      shard-kbl:          DMESG-FAIL (fdo#106560, fdo#106947) -> PASS

    igt@kms_flip@plain-flip-fb-recreate-interruptible:
      shard-glk:          FAIL (fdo#100368) -> PASS

    
  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#105363 https://bugs.freedesktop.org/show_bug.cgi?id=105363
  fdo#106560 https://bugs.freedesktop.org/show_bug.cgi?id=106560
  fdo#106947 https://bugs.freedesktop.org/show_bug.cgi?id=106947


== Participating hosts (5 -> 5) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4581 -> Patchwork_9803

  CI_DRM_4581: a85b6922ba20d1b4b527bfd5d67224225a846dac @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4580: f1c868dae24056ebc27e4f3c197724ce9b956a8a @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9803: d0c9ef9adcd08c0c250482bba811cbc790d4ac44 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9803/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request
  2018-07-28 16:46 [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request Chris Wilson
                   ` (6 preceding siblings ...)
  2018-07-28 19:25 ` ✓ Fi.CI.IGT: " Patchwork
@ 2018-08-02 14:15 ` Tvrtko Ursulin
  2018-08-02 14:29   ` Chris Wilson
  7 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-08-02 14:15 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen


On 28/07/2018 17:46, Chris Wilson wrote:
> An interesting discussion regarding "hybrid interrupt polling" for NVMe
> came to the conclusion that the ideal busyspin before sleeping was half
> of the expected request latency (and better if it was already halfway
> through that request). This suggested that we too should look again at
> our tradeoff between spinning and waiting. Currently, our spin simply
> tries to hide the cost of enabling the interrupt, which is good to avoid
> penalising nop requests (i.e. test throughput) and not much else.
> Studying real world workloads suggests that a spin of upto 500us can
> dramatically boost performance, but the suggestion is that this is not
> from avoiding interrupt latency per-se, but from secondary effects of
> sleeping such as allowing the CPU reduce cstate and context switch away.
> 
> In a truly hybrid interrupt polling scheme, we would aim to sleep until
> just before the request completed and then wake up in advance of the
> interrupt and do a quick poll to handle completion. This is tricky for
> ourselves at the moment as we are not recording request times, and since
> we allow preemption, our requests are not on as a nicely ordered
> timeline as IO. However, the idea is interesting, for it will certainly
> help us decide when busyspinning is worthwhile.
> 
> v2: Expose the spin setting via Kconfig options for easier adjustment
> and testing.
> v3: Don't get caught sneaking in a change to the busyspin parameters.
> v4: Explain more about the "hybrid interrupt polling" scheme that we
> want to migrate towards.
> 
> Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Reviewed-by: Sagar Kamble <sagar.a.kamble@intel.com>
> ---
>   drivers/gpu/drm/i915/Kconfig         |  6 +++++
>   drivers/gpu/drm/i915/Kconfig.profile | 26 +++++++++++++++++++
>   drivers/gpu/drm/i915/i915_request.c  | 39 +++++++++++++++++++++++++---
>   3 files changed, 67 insertions(+), 4 deletions(-)
>   create mode 100644 drivers/gpu/drm/i915/Kconfig.profile
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 5c607f2c707b..387613f29cb0 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -132,3 +132,9 @@ depends on DRM_I915
>   depends on EXPERT
>   source drivers/gpu/drm/i915/Kconfig.debug
>   endmenu
> +
> +menu "drm/i915 Profile Guided Optimisation"

Or something like a more generic drm/i915 Manual Tuning Options so it 
sounds less like an automated thing where you can feed the results of 
something straight in?

> +	visible if EXPERT
> +	depends on DRM_I915
> +	source drivers/gpu/drm/i915/Kconfig.profile
> +endmenu
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> new file mode 100644
> index 000000000000..8a230eeb98df
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -0,0 +1,26 @@
> +config DRM_I915_SPIN_REQUEST_IRQ
> +	int
> +	default 5 # microseconds
> +	help
> +	  Before sleeping waiting for a request (GPU operation) to complete,
> +	  we may spend some time polling for its completion. As the IRQ may
> +	  take a non-negligible time to setup, we do a short spin first to
> +	  check if the request will complete in the time it would have taken
> +	  us to enable the interrupt.
> +
> +	  May be 0 to disable the initial spin. In practice, we estimate
> +	  the cost of enabling the interrupt (if currently disabled) to be
> +	  a few microseconds.
> +
> +config DRM_I915_SPIN_REQUEST_CS
> +	int
> +	default 2 # microseconds
> +	help
> +	  After sleeping for a request (GPU operation) to complete, we will
> +	  be woken up on the completion of every request prior to the one
> +	  being waited on. For very short requests, going back to sleep and
> +	  be woken up again may add considerably to the wakeup latency. To
> +	  avoid incurring extra latency from the scheduler, we may choose to
> +	  spin prior to sleeping again.
> +
> +	  May be 0 to disable spinning after being woken.
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 5c2c93cbab12..ca25c53f8daa 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1336,8 +1336,32 @@ long i915_request_wait(struct i915_request *rq,
>   	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
>   	GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
>   
> -	/* Optimistic short spin before touching IRQs */
> -	if (__i915_spin_request(rq, wait.seqno, state, 5))
> +	/*
> +	 * Optimistic spin before touching IRQs.
> +	 *
> +	 * We may use a rather large value here to offset the penalty of
> +	 * switching away from the active task. Frequently, the client will
> +	 * wait upon an old swapbuffer to throttle itself to remain within a
> +	 * frame of the gpu. If the client is running in lockstep with the gpu,
> +	 * then it should not be waiting long at all, and a sleep now will incur
> +	 * extra scheduler latency in producing the next frame. To try to
> +	 * avoid adding the cost of enabling/disabling the interrupt to the
> +	 * short wait, we first spin to see if the request would have completed
> +	 * in the time taken to setup the interrupt.
> +	 *
> +	 * We need upto 5us to enable the irq, and upto 20us to hide the
> +	 * scheduler latency of a context switch, ignoring the secondary
> +	 * impacts from a context switch such as cache eviction.
> +	 *
> +	 * The scheme used for low-latency IO is called "hybrid interrupt
> +	 * polling". The suggestion there is to sleep until just before you
> +	 * expect to be woken by the device interrupt and then poll for its
> +	 * completion. That requires having a good predictor for the request
> +	 * duration, which we currently lack.
> +	 */
> +	if (CONFIG_DRM_I915_SPIN_REQUEST_IRQ &&
> +	    __i915_spin_request(rq, wait.seqno, state,
> +				CONFIG_DRM_I915_SPIN_REQUEST_IRQ))
>   		goto complete;
>   
>   	set_current_state(state);
> @@ -1396,8 +1420,15 @@ long i915_request_wait(struct i915_request *rq,
>   		    __i915_wait_request_check_and_reset(rq))
>   			continue;
>   
> -		/* Only spin if we know the GPU is processing this request */
> -		if (__i915_spin_request(rq, wait.seqno, state, 2))
> +		/*
> +		 * A quick spin now we are on the CPU to offset the cost of
> +		 * context switching away (and so spin for roughly the same as
> +		 * the scheduler latency). We only spin if we know the GPU is
> +		 * processing this request, and so likely to finish shortly.
> +		 */
> +		if (CONFIG_DRM_I915_SPIN_REQUEST_CS &&
> +		    __i915_spin_request(rq, wait.seqno, state,
> +					CONFIG_DRM_I915_SPIN_REQUEST_CS))
>   			break;
>   
>   		if (!intel_wait_check_request(&wait, rq)) {
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

We could add PMU metrics to count hit/missed busy spins so profile 
guided bit becomes more direct?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request
  2018-08-02 14:15 ` [PATCH 1/5] " Tvrtko Ursulin
@ 2018-08-02 14:29   ` Chris Wilson
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-08-02 14:29 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen

Quoting Tvrtko Ursulin (2018-08-02 15:15:35)
> 
> On 28/07/2018 17:46, Chris Wilson wrote:
> > An interesting discussion regarding "hybrid interrupt polling" for NVMe
> > came to the conclusion that the ideal busyspin before sleeping was half
> > of the expected request latency (and better if it was already halfway
> > through that request). This suggested that we too should look again at
> > our tradeoff between spinning and waiting. Currently, our spin simply
> > tries to hide the cost of enabling the interrupt, which is good to avoid
> > penalising nop requests (i.e. test throughput) and not much else.
> > Studying real world workloads suggests that a spin of upto 500us can
> > dramatically boost performance, but the suggestion is that this is not
> > from avoiding interrupt latency per-se, but from secondary effects of
> > sleeping such as allowing the CPU reduce cstate and context switch away.
> > 
> > In a truly hybrid interrupt polling scheme, we would aim to sleep until
> > just before the request completed and then wake up in advance of the
> > interrupt and do a quick poll to handle completion. This is tricky for
> > ourselves at the moment as we are not recording request times, and since
> > we allow preemption, our requests are not on as a nicely ordered
> > timeline as IO. However, the idea is interesting, for it will certainly
> > help us decide when busyspinning is worthwhile.
> > 
> > v2: Expose the spin setting via Kconfig options for easier adjustment
> > and testing.
> > v3: Don't get caught sneaking in a change to the busyspin parameters.
> > v4: Explain more about the "hybrid interrupt polling" scheme that we
> > want to migrate towards.
> > 
> > Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> > References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> > Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Ben Widawsky <ben@bwidawsk.net>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Reviewed-by: Sagar Kamble <sagar.a.kamble@intel.com>
> > ---
> >   drivers/gpu/drm/i915/Kconfig         |  6 +++++
> >   drivers/gpu/drm/i915/Kconfig.profile | 26 +++++++++++++++++++
> >   drivers/gpu/drm/i915/i915_request.c  | 39 +++++++++++++++++++++++++---
> >   3 files changed, 67 insertions(+), 4 deletions(-)
> >   create mode 100644 drivers/gpu/drm/i915/Kconfig.profile
> > 
> > diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> > index 5c607f2c707b..387613f29cb0 100644
> > --- a/drivers/gpu/drm/i915/Kconfig
> > +++ b/drivers/gpu/drm/i915/Kconfig
> > @@ -132,3 +132,9 @@ depends on DRM_I915
> >   depends on EXPERT
> >   source drivers/gpu/drm/i915/Kconfig.debug
> >   endmenu
> > +
> > +menu "drm/i915 Profile Guided Optimisation"
> 
> Or something like a more generic drm/i915 Manual Tuning Options so it 
> sounds less like an automated thing where you can feed the results of 
> something straight in?

Hah, good called. PGO was taken right out of the gcc manual, so yeah
it's a bit misleading.
 
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> We could add PMU metrics to count hit/missed busy spins so profile 
> guided bit becomes more direct?

Sensible. I wonder though how much we can do already with kprobes?

Another one I think I'd like here is the DMA_LATENCY qos parameter. If
you haven't seen https://patchwork.freedesktop.org/patch/241571/ that's
worth throwing against media-bench.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] drm/i915: Expose idle delays to Kconfig
  2018-07-28 16:46 ` [PATCH 2/5] drm/i915: Expose idle delays to Kconfig Chris Wilson
@ 2018-08-02 14:33   ` Tvrtko Ursulin
  2018-08-02 14:40     ` Chris Wilson
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-08-02 14:33 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 28/07/2018 17:46, Chris Wilson wrote:
> We want to expose the parameters for controlling how long it takes for
> us to notice and park the GPU after a stream of requests in order to try
> and tune the optimal power-efficiency vs latency of a mostly idle system.

Who do we expect to tweak these and what kind of gains have they reported?

> 
> v2: retire_work has two scheduling points, update them both
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/Kconfig.profile | 23 +++++++++++++++++++++++
>   drivers/gpu/drm/i915/i915_gem.c      |  8 +++++---
>   drivers/gpu/drm/i915/i915_gem.h      | 13 +++++++++++++
>   3 files changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 8a230eeb98df..63cb744d920d 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -24,3 +24,26 @@ config DRM_I915_SPIN_REQUEST_CS
>   	  spin prior to sleeping again.
>   
>   	  May be 0 to disable spinning after being woken.
> +
> +config DRM_I915_GEM_RETIRE_DELAY
> +	int
> +	default 1000 # milliseconds
> +	help
> +	  We maintain a background job whose purpose is to keep cleaning up
> +	  after userspace, and may be the first to spot an idle system. This

"user space" since this is help text?

> +	  parameter determines the interval between execution of the worker.
> +
> +	  To reduce the number of timer wakeups, large delays (of greater than
> +	  a second) are rounded to the next walltime second to allow coalescing

Wake ups, wall time or with dashes?

> +	  of multiple system timers into a single wakeup.

Do you perhaps want to omit the bit about rounding to next wall time 
second and just say it may not be exact, so to reserve some internal 
freedom?

> +
> +config DRM_I915_GEM_PARK_DELAY
> +	int
> +	default 100 # milliseconds
> +	help
> +	  Before parking the engines and the GPU after the final request is
> +	  retired, we may wait for a small delay to reduce the frequecy of

typo in frequency

> +	  having to park/unpark and so the latency in executing a new request.
> +
> +	  May be 0 to immediately start parking the engines after the last
> +	  request.

I'd add so it says "the last request is retired." so there is even more 
hint that it is not after the request has actually completed but when we 
went to retire it.

> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 460f256114f7..5b538a92631d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -192,7 +192,9 @@ void i915_gem_park(struct drm_i915_private *i915)
>   		return;
>   
>   	/* Defer the actual call to __i915_gem_park() to prevent ping-pongs */
> -	mod_delayed_work(i915->wq, &i915->gt.idle_work, msecs_to_jiffies(100));
> +	mod_delayed_work(i915->wq,
> +			 &i915->gt.idle_work,
> +			 msecs_to_jiffies(CONFIG_DRM_I915_GEM_PARK_DELAY));
>   }
>   
>   void i915_gem_unpark(struct drm_i915_private *i915)
> @@ -236,7 +238,7 @@ void i915_gem_unpark(struct drm_i915_private *i915)
>   
>   	queue_delayed_work(i915->wq,
>   			   &i915->gt.retire_work,
> -			   round_jiffies_up_relative(HZ));
> +			   i915_gem_retire_delay());
>   }
>   
>   int
> @@ -3466,7 +3468,7 @@ i915_gem_retire_work_handler(struct work_struct *work)
>   	if (READ_ONCE(dev_priv->gt.awake))
>   		queue_delayed_work(dev_priv->wq,
>   				   &dev_priv->gt.retire_work,
> -				   round_jiffies_up_relative(HZ));
> +				   i915_gem_retire_delay());
>   }
>   
>   static void shrink_caches(struct drm_i915_private *i915)
> diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
> index e46592956872..c6b4971435dc 100644
> --- a/drivers/gpu/drm/i915/i915_gem.h
> +++ b/drivers/gpu/drm/i915/i915_gem.h
> @@ -27,6 +27,8 @@
>   
>   #include <linux/bug.h>
>   #include <linux/interrupt.h>
> +#include <linux/jiffies.h>
> +#include <linux/timer.h>
>   
>   struct drm_i915_private;
>   
> @@ -76,6 +78,17 @@ struct drm_i915_private;
>   void i915_gem_park(struct drm_i915_private *i915);
>   void i915_gem_unpark(struct drm_i915_private *i915);
>   
> +static inline unsigned long i915_gem_retire_delay(void)
> +{
> +	const unsigned long delay =
> +		msecs_to_jiffies(CONFIG_DRM_I915_GEM_RETIRE_DELAY);
> +
> +	if (CONFIG_DRM_I915_GEM_RETIRE_DELAY >= 1000)
> +		return round_jiffies_up_relative(delay);
> +	else
> +		return delay;
> +}
> +
>   static inline void __tasklet_disable_sync_once(struct tasklet_struct *t)
>   {
>   	if (atomic_inc_return(&t->count) == 1)
> 

Leave it to you to pick or not any of the tidies I suggested. Either way:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] drm/i915: Expose idle delays to Kconfig
  2018-08-02 14:33   ` Tvrtko Ursulin
@ 2018-08-02 14:40     ` Chris Wilson
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-08-02 14:40 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-08-02 15:33:33)
> 
> On 28/07/2018 17:46, Chris Wilson wrote:
> > We want to expose the parameters for controlling how long it takes for
> > us to notice and park the GPU after a stream of requests in order to try
> > and tune the optimal power-efficiency vs latency of a mostly idle system.
> 
> Who do we expect to tweak these and what kind of gains have they reported?

These are more targeted at S0ix. Which is a can of worms that I am
surprised hasn't been opened yet. Could it be that it all works better
than expected? Or more likely they just have been talking to us.

So these are definitely more optimistically opportunistic than
practical.

> > v2: retire_work has two scheduling points, update them both
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/Kconfig.profile | 23 +++++++++++++++++++++++
> >   drivers/gpu/drm/i915/i915_gem.c      |  8 +++++---
> >   drivers/gpu/drm/i915/i915_gem.h      | 13 +++++++++++++
> >   3 files changed, 41 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> > index 8a230eeb98df..63cb744d920d 100644
> > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > @@ -24,3 +24,26 @@ config DRM_I915_SPIN_REQUEST_CS
> >         spin prior to sleeping again.
> >   
> >         May be 0 to disable spinning after being woken.
> > +
> > +config DRM_I915_GEM_RETIRE_DELAY
> > +     int
> > +     default 1000 # milliseconds
> > +     help
> > +       We maintain a background job whose purpose is to keep cleaning up
> > +       after userspace, and may be the first to spot an idle system. This
> 
> "user space" since this is help text?
> 
> > +       parameter determines the interval between execution of the worker.
> > +
> > +       To reduce the number of timer wakeups, large delays (of greater than
> > +       a second) are rounded to the next walltime second to allow coalescing
> 
> Wake ups, wall time or with dashes?
> 
> > +       of multiple system timers into a single wakeup.
> 
> Do you perhaps want to omit the bit about rounding to next wall time 
> second and just say it may not be exact, so to reserve some internal 
> freedom?

Sure, but it's help text, even less accurate than comments ;)
I think it's prudent to say that it is inexact.
> 
> > +
> > +config DRM_I915_GEM_PARK_DELAY
> > +     int
> > +     default 100 # milliseconds
> > +     help
> > +       Before parking the engines and the GPU after the final request is
> > +       retired, we may wait for a small delay to reduce the frequecy of
> 
> typo in frequency
> 
> > +       having to park/unpark and so the latency in executing a new request.
> > +
> > +       May be 0 to immediately start parking the engines after the last
> > +       request.
> 
> I'd add so it says "the last request is retired." so there is even more 
> hint that it is not after the request has actually completed but when we 
> went to retire it.

Ok.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] drm/i915: Increase busyspin limit before a context-switch
  2018-07-28 16:46 ` [PATCH 3/5] drm/i915: Increase busyspin limit before a context-switch Chris Wilson
@ 2018-08-02 14:40   ` Tvrtko Ursulin
  2018-08-02 14:46     ` Tvrtko Ursulin
  2018-08-02 14:47     ` Chris Wilson
  0 siblings, 2 replies; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-08-02 14:40 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen


On 28/07/2018 17:46, Chris Wilson wrote:
> Looking at the distribution of i915_wait_request for a set of GL

What was the set?

> benchmarks, we see:
> 
> broadwell# python bcc/tools/funclatency.py -u i915_wait_request
>     usecs               : count     distribution
>         0 -> 1          : 29184    |****************************************|
>         2 -> 3          : 5767     |*******                                 |
>         4 -> 7          : 3000     |****                                    |
>         8 -> 15         : 491      |                                        |
>        16 -> 31         : 140      |                                        |
>        32 -> 63         : 203      |                                        |
>        64 -> 127        : 543      |                                        |
>       128 -> 255        : 881      |*                                       |
>       256 -> 511        : 1209     |*                                       |
>       512 -> 1023       : 1739     |**                                      |
>      1024 -> 2047       : 22855    |*******************************         |
>      2048 -> 4095       : 1725     |**                                      |
>      4096 -> 8191       : 5813     |*******                                 |
>      8192 -> 16383      : 5348     |*******                                 |
>     16384 -> 32767      : 1000     |*                                       |
>     32768 -> 65535      : 4400     |******                                  |
>     65536 -> 131071     : 296      |                                        |
>    131072 -> 262143     : 225      |                                        |
>    262144 -> 524287     : 4        |                                        |
>    524288 -> 1048575    : 1        |                                        |
>   1048576 -> 2097151    : 1        |                                        |
>   2097152 -> 4194303    : 1        |                                        |
> 
> broxton# python bcc/tools/funclatency.py -u i915_wait_request
>     usecs               : count     distribution
>         0 -> 1          : 5523     |*************************************   |
>         2 -> 3          : 1340     |*********                               |
>         4 -> 7          : 2100     |**************                          |
>         8 -> 15         : 755      |*****                                   |
>        16 -> 31         : 211      |*                                       |
>        32 -> 63         : 53       |                                        |
>        64 -> 127        : 71       |                                        |
>       128 -> 255        : 113      |                                        |
>       256 -> 511        : 262      |*                                       |
>       512 -> 1023       : 358      |**                                      |
>      1024 -> 2047       : 1105     |*******                                 |
>      2048 -> 4095       : 848      |*****                                   |
>      4096 -> 8191       : 1295     |********                                |
>      8192 -> 16383      : 5894     |****************************************|
>     16384 -> 32767      : 4270     |****************************            |
>     32768 -> 65535      : 5622     |**************************************  |
>     65536 -> 131071     : 306      |**                                      |
>    131072 -> 262143     : 50       |                                        |
>    262144 -> 524287     : 76       |                                        |
>    524288 -> 1048575    : 34       |                                        |
>   1048576 -> 2097151    : 0        |                                        |
>   2097152 -> 4194303    : 1        |                                        |
> 
> Picking 20us for the context-switch busyspin has the dual advantage of
> catching most frequent short waits while avoiding the cost of a context
> switch. 20us is a typical latency of 2 context-switches, i.e. the cost
> of taking the sleep, without the secondary effects of cache flushing.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> ---
>   drivers/gpu/drm/i915/Kconfig.profile | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 63cb744d920d..de394dea4a14 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -14,7 +14,7 @@ config DRM_I915_SPIN_REQUEST_IRQ
>   
>   config DRM_I915_SPIN_REQUEST_CS
>   	int
> -	default 2 # microseconds
> +	default 20 # microseconds
>   	help
>   	  After sleeping for a request (GPU operation) to complete, we will
>   	  be woken up on the completion of every request prior to the one
> 

I'd be more tempted to pick 10us given the histograms. It would avoid 
wasting cycles on Broadwell and keep the majority of the benefit on Broxton.

However.. it also raises the question if we perhaps want to have this 
initialized per-platform at runtime.. ? That would open up the way of 
auto-tuning it, if the goal is to eliminate the low part of the histogram.

Also, please add to the commit what kind of perf/watt or something 
effect on the benchmarks we get with it.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] drm/i915: Increase busyspin limit before a context-switch
  2018-08-02 14:40   ` Tvrtko Ursulin
@ 2018-08-02 14:46     ` Tvrtko Ursulin
  2018-08-02 14:47     ` Chris Wilson
  1 sibling, 0 replies; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-08-02 14:46 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen


On 02/08/2018 15:40, Tvrtko Ursulin wrote:
> 
> On 28/07/2018 17:46, Chris Wilson wrote:
>> Looking at the distribution of i915_wait_request for a set of GL
> 
> What was the set?
> 
>> benchmarks, we see:
>>
>> broadwell# python bcc/tools/funclatency.py -u i915_wait_request
>>     usecs               : count     distribution
>>         0 -> 1          : 29184    
>> |****************************************|
>>         2 -> 3          : 5767     
>> |*******                                 |
>>         4 -> 7          : 3000     
>> |****                                    |
>>         8 -> 15         : 491      
>> |                                        |
>>        16 -> 31         : 140      
>> |                                        |
>>        32 -> 63         : 203      
>> |                                        |
>>        64 -> 127        : 543      
>> |                                        |
>>       128 -> 255        : 881      
>> |*                                       |
>>       256 -> 511        : 1209     
>> |*                                       |
>>       512 -> 1023       : 1739     
>> |**                                      |
>>      1024 -> 2047       : 22855    
>> |*******************************         |
>>      2048 -> 4095       : 1725     
>> |**                                      |
>>      4096 -> 8191       : 5813     
>> |*******                                 |
>>      8192 -> 16383      : 5348     
>> |*******                                 |
>>     16384 -> 32767      : 1000     
>> |*                                       |
>>     32768 -> 65535      : 4400     
>> |******                                  |
>>     65536 -> 131071     : 296      
>> |                                        |
>>    131072 -> 262143     : 225      
>> |                                        |
>>    262144 -> 524287     : 4        
>> |                                        |
>>    524288 -> 1048575    : 1        
>> |                                        |
>>   1048576 -> 2097151    : 1        
>> |                                        |
>>   2097152 -> 4194303    : 1        
>> |                                        |
>>
>> broxton# python bcc/tools/funclatency.py -u i915_wait_request
>>     usecs               : count     distribution
>>         0 -> 1          : 5523     
>> |*************************************   |
>>         2 -> 3          : 1340     
>> |*********                               |
>>         4 -> 7          : 2100     
>> |**************                          |
>>         8 -> 15         : 755      
>> |*****                                   |
>>        16 -> 31         : 211      
>> |*                                       |
>>        32 -> 63         : 53       
>> |                                        |
>>        64 -> 127        : 71       
>> |                                        |
>>       128 -> 255        : 113      
>> |                                        |
>>       256 -> 511        : 262      
>> |*                                       |
>>       512 -> 1023       : 358      
>> |**                                      |
>>      1024 -> 2047       : 1105     
>> |*******                                 |
>>      2048 -> 4095       : 848      
>> |*****                                   |
>>      4096 -> 8191       : 1295     
>> |********                                |
>>      8192 -> 16383      : 5894     
>> |****************************************|
>>     16384 -> 32767      : 4270     
>> |****************************            |
>>     32768 -> 65535      : 5622     
>> |**************************************  |
>>     65536 -> 131071     : 306      
>> |**                                      |
>>    131072 -> 262143     : 50       
>> |                                        |
>>    262144 -> 524287     : 76       
>> |                                        |
>>    524288 -> 1048575    : 34       
>> |                                        |
>>   1048576 -> 2097151    : 0        
>> |                                        |
>>   2097152 -> 4194303    : 1        
>> |                                        |
>>
>> Picking 20us for the context-switch busyspin has the dual advantage of
>> catching most frequent short waits while avoiding the cost of a context
>> switch. 20us is a typical latency of 2 context-switches, i.e. the cost
>> of taking the sleep, without the secondary effects of cache flushing.
>>
>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
>> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Ben Widawsky <ben@bwidawsk.net>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>> ---
>>   drivers/gpu/drm/i915/Kconfig.profile | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/Kconfig.profile 
>> b/drivers/gpu/drm/i915/Kconfig.profile
>> index 63cb744d920d..de394dea4a14 100644
>> --- a/drivers/gpu/drm/i915/Kconfig.profile
>> +++ b/drivers/gpu/drm/i915/Kconfig.profile
>> @@ -14,7 +14,7 @@ config DRM_I915_SPIN_REQUEST_IRQ
>>   config DRM_I915_SPIN_REQUEST_CS
>>       int
>> -    default 2 # microseconds
>> +    default 20 # microseconds
>>       help
>>         After sleeping for a request (GPU operation) to complete, we will
>>         be woken up on the completion of every request prior to the one
>>
> 
> I'd be more tempted to pick 10us given the histograms. It would avoid 
> wasting cycles on Broadwell and keep the majority of the benefit on 
> Broxton.

Actually the first spin is 5us so are you sure bumping of the second 
spin should be the first step? In other words, wouldn't bumping the 
first one to 10us eliminate most the the low bars from the histogram?

Regards,

Tvrtko

> However.. it also raises the question if we perhaps want to have this 
> initialized per-platform at runtime.. ? That would open up the way of 
> auto-tuning it, if the goal is to eliminate the low part of the histogram.
> 
> Also, please add to the commit what kind of perf/watt or something 
> effect on the benchmarks we get with it.
> 
> Regards,
> 
> Tvrtko
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] drm/i915: Increase busyspin limit before a context-switch
  2018-08-02 14:40   ` Tvrtko Ursulin
  2018-08-02 14:46     ` Tvrtko Ursulin
@ 2018-08-02 14:47     ` Chris Wilson
  1 sibling, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-08-02 14:47 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen

Quoting Tvrtko Ursulin (2018-08-02 15:40:27)
> 
> On 28/07/2018 17:46, Chris Wilson wrote:
> > Looking at the distribution of i915_wait_request for a set of GL
> 
> What was the set?
> 
> > benchmarks, we see:
> > 
> > broadwell# python bcc/tools/funclatency.py -u i915_wait_request
> >     usecs               : count     distribution
> >         0 -> 1          : 29184    |****************************************|
> >         2 -> 3          : 5767     |*******                                 |
> >         4 -> 7          : 3000     |****                                    |
> >         8 -> 15         : 491      |                                        |
> >        16 -> 31         : 140      |                                        |
> >        32 -> 63         : 203      |                                        |
> >        64 -> 127        : 543      |                                        |
> >       128 -> 255        : 881      |*                                       |
> >       256 -> 511        : 1209     |*                                       |
> >       512 -> 1023       : 1739     |**                                      |
> >      1024 -> 2047       : 22855    |*******************************         |
> >      2048 -> 4095       : 1725     |**                                      |
> >      4096 -> 8191       : 5813     |*******                                 |
> >      8192 -> 16383      : 5348     |*******                                 |
> >     16384 -> 32767      : 1000     |*                                       |
> >     32768 -> 65535      : 4400     |******                                  |
> >     65536 -> 131071     : 296      |                                        |
> >    131072 -> 262143     : 225      |                                        |
> >    262144 -> 524287     : 4        |                                        |
> >    524288 -> 1048575    : 1        |                                        |
> >   1048576 -> 2097151    : 1        |                                        |
> >   2097152 -> 4194303    : 1        |                                        |
> > 
> > broxton# python bcc/tools/funclatency.py -u i915_wait_request
> >     usecs               : count     distribution
> >         0 -> 1          : 5523     |*************************************   |
> >         2 -> 3          : 1340     |*********                               |
> >         4 -> 7          : 2100     |**************                          |
> >         8 -> 15         : 755      |*****                                   |
> >        16 -> 31         : 211      |*                                       |
> >        32 -> 63         : 53       |                                        |
> >        64 -> 127        : 71       |                                        |
> >       128 -> 255        : 113      |                                        |
> >       256 -> 511        : 262      |*                                       |
> >       512 -> 1023       : 358      |**                                      |
> >      1024 -> 2047       : 1105     |*******                                 |
> >      2048 -> 4095       : 848      |*****                                   |
> >      4096 -> 8191       : 1295     |********                                |
> >      8192 -> 16383      : 5894     |****************************************|
> >     16384 -> 32767      : 4270     |****************************            |
> >     32768 -> 65535      : 5622     |**************************************  |
> >     65536 -> 131071     : 306      |**                                      |
> >    131072 -> 262143     : 50       |                                        |
> >    262144 -> 524287     : 76       |                                        |
> >    524288 -> 1048575    : 34       |                                        |
> >   1048576 -> 2097151    : 0        |                                        |
> >   2097152 -> 4194303    : 1        |                                        |
> > 
> > Picking 20us for the context-switch busyspin has the dual advantage of
> > catching most frequent short waits while avoiding the cost of a context
> > switch. 20us is a typical latency of 2 context-switches, i.e. the cost
> > of taking the sleep, without the secondary effects of cache flushing.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> > Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Ben Widawsky <ben@bwidawsk.net>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > ---
> >   drivers/gpu/drm/i915/Kconfig.profile | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> > index 63cb744d920d..de394dea4a14 100644
> > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > @@ -14,7 +14,7 @@ config DRM_I915_SPIN_REQUEST_IRQ
> >   
> >   config DRM_I915_SPIN_REQUEST_CS
> >       int
> > -     default 2 # microseconds
> > +     default 20 # microseconds
> >       help
> >         After sleeping for a request (GPU operation) to complete, we will
> >         be woken up on the completion of every request prior to the one
> > 
> 
> I'd be more tempted to pick 10us given the histograms. It would avoid 
> wasting cycles on Broadwell and keep the majority of the benefit on Broxton.
> 
> However.. it also raises the question if we perhaps want to have this 
> initialized per-platform at runtime.. ? That would open up the way of 
> auto-tuning it, if the goal is to eliminate the low part of the histogram.
> 
> Also, please add to the commit what kind of perf/watt or something 
> effect on the benchmarks we get with it.

There's a reason why I keep sending it to people supposed to be
interested in such things ;)

But honestly I don't value this patch much in the grand scheme of
things, since this caters to being woken up at the end of the previous
request with the expectation that the request of interest is super
short. That does not seem likely. I'd even float the opposite patch to
set it to 0 by default, which iirc you suggested I was crazy for putting
a spin here in the first place.

The only place where it might help is when some other process held onto
the first_waiter slot for too long.

The initial spin is much more interesting wrt to stall latencies.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] drm/i915: Increase initial busyspin limit
  2018-07-28 16:46 ` [PATCH 4/5] drm/i915: Increase initial busyspin limit Chris Wilson
@ 2018-08-02 14:47   ` Tvrtko Ursulin
  2018-08-02 14:54     ` Chris Wilson
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-08-02 14:47 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen


On 28/07/2018 17:46, Chris Wilson wrote:
> Here we bump the busyspin for the current request to avoid sleeping and
> the cost of both idling and downclocking the CPU.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Francisco Jerez <currojerez@riseup.net>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> ---
>   drivers/gpu/drm/i915/Kconfig.profile | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index de394dea4a14..54e0ba4051e7 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -1,6 +1,6 @@
>   config DRM_I915_SPIN_REQUEST_IRQ
>   	int
> -	default 5 # microseconds
> +	default 250 # microseconds
>   	help
>   	  Before sleeping waiting for a request (GPU operation) to complete,
>   	  we may spend some time polling for its completion. As the IRQ may
> 

But the histograms from previous patch do not suggest 250us busy spin 
gets us into interesting territory. And we have to know the distribution 
between IRQ and CS spins.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] drm/i915: Increase initial busyspin limit
  2018-08-02 14:47   ` Tvrtko Ursulin
@ 2018-08-02 14:54     ` Chris Wilson
  2018-08-02 19:37       ` Rogozhkin, Dmitry V
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-08-02 14:54 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: Ben Widawsky, Eero Tamminen

Quoting Tvrtko Ursulin (2018-08-02 15:47:37)
> 
> On 28/07/2018 17:46, Chris Wilson wrote:
> > Here we bump the busyspin for the current request to avoid sleeping and
> > the cost of both idling and downclocking the CPU.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> > Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> > Cc: Francisco Jerez <currojerez@riseup.net>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Ben Widawsky <ben@bwidawsk.net>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > ---
> >   drivers/gpu/drm/i915/Kconfig.profile | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> > index de394dea4a14..54e0ba4051e7 100644
> > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > @@ -1,6 +1,6 @@
> >   config DRM_I915_SPIN_REQUEST_IRQ
> >       int
> > -     default 5 # microseconds
> > +     default 250 # microseconds
> >       help
> >         Before sleeping waiting for a request (GPU operation) to complete,
> >         we may spend some time polling for its completion. As the IRQ may
> > 
> 
> But the histograms from previous patch do not suggest 250us busy spin 
> gets us into interesting territory. And we have to know the distribution 
> between IRQ and CS spins.

This is for a different problem than presented in those histograms. This
is to hide the reclocking that occurred if we sleep.

Previous justification for the initial spin was to try and hide the irq
setup costs, this is to try and hide the sleep costs.

The purpose of this patch was just to dip the toe into the water of what
is possible and solicit feedback since we are still waiting for the
panacea patches to improve cpufreq. (Why it had such a lackluster
justification.)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] drm/i915: Do not use iowait while waiting for the GPU
  2018-07-28 16:46 ` [PATCH 5/5] drm/i915: Do not use iowait while waiting for the GPU Chris Wilson
@ 2018-08-02 15:09   ` Tvrtko Ursulin
  0 siblings, 0 replies; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-08-02 15:09 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Eero Tamminen


On 28/07/2018 17:46, Chris Wilson wrote:
> A recent trend for cpufreq is to boost the CPU frequencies for
> iowaiters, in particularly to benefit high frequency I/O. We do the same
> and boost the GPU clocks to try and minimise time spent waiting for the
> GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
> frequency will result in the GPU being throttled and its frequency being
> reduced. Thus declaring iowait negatively impacts on GPU throughput.

I see it has been discussed in another thread. My view here is that I do 
not know. But fiddling with this on our side does feel questionable.

If IO wait is waiting for a device then it is correct. If it interacts 
badly with cpufreq in specifics of shared TDP then I feel it should be 
solved elsewhere (cpufreq). I don't have a lot of ideas how though. GPU 
usage has been tried by someone in the past but I don't know how that went.

Regards,

Tvrtko

> v2: Both sleeps!
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
> References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Francisco Jerez <currojerez@riseup.net>
> ---
>   drivers/gpu/drm/i915/i915_request.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index ca25c53f8daa..694269e5b761 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1330,7 +1330,7 @@ long i915_request_wait(struct i915_request *rq,
>   			goto complete;
>   		}
>   
> -		timeout = io_schedule_timeout(timeout);
> +		timeout = schedule_timeout(timeout);
>   	} while (1);
>   
>   	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
> @@ -1387,7 +1387,7 @@ long i915_request_wait(struct i915_request *rq,
>   			break;
>   		}
>   
> -		timeout = io_schedule_timeout(timeout);
> +		timeout = schedule_timeout(timeout);
>   
>   		if (intel_wait_complete(&wait) &&
>   		    intel_wait_check_request(&wait, rq))
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] drm/i915: Increase initial busyspin limit
  2018-08-02 14:54     ` Chris Wilson
@ 2018-08-02 19:37       ` Rogozhkin, Dmitry V
  2018-08-02 20:21         ` Chris Wilson
  0 siblings, 1 reply; 20+ messages in thread
From: Rogozhkin, Dmitry V @ 2018-08-02 19:37 UTC (permalink / raw)
  To: Mrozek, Michal, intel-gfx, Danecki, Jacek, chris, tvrtko.ursulin
  Cc: ben, Tamminen, Eero T

On Thu, 2018-08-02 at 15:54 +0100, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-08-02 15:47:37)
> > 
> > On 28/07/2018 17:46, Chris Wilson wrote:
> > > Here we bump the busyspin for the current request to avoid
> > > sleeping and
> > > the cost of both idling and downclocking the CPU.
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> > > Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> > > Cc: Francisco Jerez <currojerez@riseup.net>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > Cc: Ben Widawsky <ben@bwidawsk.net>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/Kconfig.profile | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/Kconfig.profile
> > > b/drivers/gpu/drm/i915/Kconfig.profile
> > > index de394dea4a14..54e0ba4051e7 100644
> > > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > > @@ -1,6 +1,6 @@
> > >   config DRM_I915_SPIN_REQUEST_IRQ
> > >       int
> > > -     default 5 # microseconds
> > > +     default 250 # microseconds
> > >       help
> > >         Before sleeping waiting for a request (GPU operation) to
> > > complete,
> > >         we may spend some time polling for its completion. As the
> > > IRQ may
> > > 
> > 
> > But the histograms from previous patch do not suggest 250us busy
> > spin 
> > gets us into interesting territory. And we have to know the
> > distribution 
> > between IRQ and CS spins.
> 
> This is for a different problem than presented in those histograms.
> This
> is to hide the reclocking that occurred if we sleep.
> 
> Previous justification for the initial spin was to try and hide the
> irq
> setup costs, this is to try and hide the sleep costs.
> 
> The purpose of this patch was just to dip the toe into the water of
> what
> is possible and solicit feedback since we are still waiting for the
> panacea patches to improve cpufreq. (Why it had such a lackluster
> justification.)

This patch somewhat resembles with the issue we just met in userspace
for the opencl and media. Long story short there was CPU% regression
between 2 opencl releases root caused to the following. OpenCL rewrote
the code how they wait for the task completion. Essentially they have
an optimistic spin implemented on the userspace side and logic their is
quite complex: they commit to different duration of the spin depending
on the scenario. And they had some pretty long spin, like 10ms. This is
a killer thing if opencl is coupled with media workloads, especially
when media is dominant. That's because usual execution time of media
batches is pretty long, few or dozen milliseconds is quite typical. As
a result opencl workload is really being submitted and executed with
significant delay, thus optimistic spin gives up and it comes with the
CPU% cost. This story was discussed here: https://github.com/intel/comp
ute-runtime/commit/d53e1c3979f6d822f55645bfcd753be1658f53ca#comments.

I was actually surprised why they don't rely on i915 wait function.
They responded in was: "I would gladly always use bo_wait ioctl, but it
has big overhead for short workloads." Thus the question: do we have an
opportunity to improve? Can we have some dynamic way to configure
spinning for the particular requests/contexts? /if I understand
correctly current patch it introduces kernel config-level setting
rather than some dynamic setting/

> -Chris
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] drm/i915: Increase initial busyspin limit
  2018-08-02 19:37       ` Rogozhkin, Dmitry V
@ 2018-08-02 20:21         ` Chris Wilson
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-08-02 20:21 UTC (permalink / raw)
  To: Danecki, Jacek, Mrozek, Michal, Rogozhkin, Dmitry V, intel-gfx,
	tvrtko.ursulin
  Cc: ben, Tamminen, Eero T

Quoting Rogozhkin, Dmitry V (2018-08-02 20:37:36)
> I was actually surprised why they don't rely on i915 wait function.
> They responded in was: "I would gladly always use bo_wait ioctl, but it
> has big overhead for short workloads." Thus the question: do we have an
> opportunity to improve? Can we have some dynamic way to configure
> spinning for the particular requests/contexts? /if I understand
> correctly current patch it introduces kernel config-level setting
> rather than some dynamic setting/

Better yet, tell us about the problem with the overhead!

The latency from batch to client (measuring RCS timestamp in batch to
RCS timestamp in userspace) is 10us (bdw and the like), under controlled
conditions ;) Hmm, that does drop a bit for a RT client which does also
suggest that cpufreq is a factor there.

A roundtrip through the wait ioctl doing nothing is 100ns; minimum wait
due to irq setup is around 7us (spin then mmio before detecting
completion immediately afterwards). If your measurements are orders of
magnitude worse that too is enlightening.

To justify a 10ms spin is that sleeping (hitting the scheduler) is
unacceptable, which seems to be a widerscale problem. But that I
recommend we look at https://patchwork.freedesktop.org/patch/241571/ to
see if something like that helps us in the short term.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-08-02 20:21 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-28 16:46 [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request Chris Wilson
2018-07-28 16:46 ` [PATCH 2/5] drm/i915: Expose idle delays to Kconfig Chris Wilson
2018-08-02 14:33   ` Tvrtko Ursulin
2018-08-02 14:40     ` Chris Wilson
2018-07-28 16:46 ` [PATCH 3/5] drm/i915: Increase busyspin limit before a context-switch Chris Wilson
2018-08-02 14:40   ` Tvrtko Ursulin
2018-08-02 14:46     ` Tvrtko Ursulin
2018-08-02 14:47     ` Chris Wilson
2018-07-28 16:46 ` [PATCH 4/5] drm/i915: Increase initial busyspin limit Chris Wilson
2018-08-02 14:47   ` Tvrtko Ursulin
2018-08-02 14:54     ` Chris Wilson
2018-08-02 19:37       ` Rogozhkin, Dmitry V
2018-08-02 20:21         ` Chris Wilson
2018-07-28 16:46 ` [PATCH 5/5] drm/i915: Do not use iowait while waiting for the GPU Chris Wilson
2018-08-02 15:09   ` Tvrtko Ursulin
2018-07-28 16:54 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/5] drm/i915: Expose the busyspin durations for i915_wait_request Patchwork
2018-07-28 17:16 ` ✓ Fi.CI.BAT: success " Patchwork
2018-07-28 19:25 ` ✓ Fi.CI.IGT: " Patchwork
2018-08-02 14:15 ` [PATCH 1/5] " Tvrtko Ursulin
2018-08-02 14:29   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.