All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device
@ 2017-05-18  9:46 Chris Wilson
  2017-05-18  9:46 ` [PATCH 02/24] drm/i915: Mark CPU cache as dirty on every transition for CPU writes Chris Wilson
                   ` (25 more replies)
  0 siblings, 26 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

Set the class on our mock pci device to GFX. This should be useful for
utilities like intel-iommu that special case gfx devices.

References: https://bugs.freedesktop.org/show_bug.cgi?id=101080
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/selftests/mock_gem_device.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
index f4edd4c6cb07..627e2aa09766 100644
--- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
+++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
@@ -121,6 +121,7 @@ struct drm_i915_private *mock_gem_device(void)
 		goto err;
 
 	device_initialize(&pdev->dev);
+	pdev->class = PCI_BASE_CLASS_DISPLAY << 16;
 	pdev->dev.release = release_dev;
 	dev_set_name(&pdev->dev, "mock");
 	dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/24] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 03/24] drm/i915: Store i915_gem_object_is_coherent() as a bit next to cache-dirty Chris Wilson
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx; +Cc: Dongwon Kim

Currently, we only mark the CPU cache as dirty if we skip a clflush.
This leads to some confusion where we have to ask if the object is in
the write domain or missed a clflush. If we always mark the cache as
dirty, this becomes a much simply question to answer.

The goal remains to do as few clflushes as required and to do them as
late as possible, in the hope of deferring the work to a kthread and not
block the caller (e.g. execbuf, flips).

v2: Always call clflush before GPU execution when the cache_dirty flag
is set. This may cause some extra work on llc systems that migrate dirty
buffers back and forth - but we do try to limit that by only setting
cache_dirty at the end of the gpu sequence.

v3: Always mark the cache as dirty upon a level change, as we need to
invalidate any stale cachelines due to external writes.

Reported-by: Dongwon Kim <dongwon.kim@intel.com>
Fixes: a6a7cc4b7db6 ("drm/i915: Always flush the dirty CPU cache when pinning the scanout")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Tested-by: Dongwon Kim <dongwon.kim@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c                  | 76 ++++++++++++++----------
 drivers/gpu/drm/i915/i915_gem_clflush.c          | 15 +++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c       | 21 +++----
 drivers/gpu/drm/i915/i915_gem_internal.c         |  3 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c          |  5 +-
 drivers/gpu/drm/i915/selftests/huge_gem_object.c |  3 +-
 6 files changed, 67 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 02adf8241394..155dd52f2d18 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -49,7 +49,7 @@ static void i915_gem_flush_free_objects(struct drm_i915_private *i915);
 
 static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
 {
-	if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
+	if (obj->cache_dirty)
 		return false;
 
 	if (!i915_gem_object_is_coherent(obj))
@@ -233,6 +233,14 @@ i915_gem_object_get_pages_phys(struct drm_i915_gem_object *obj)
 	return st;
 }
 
+static void __start_cpu_write(struct drm_i915_gem_object *obj)
+{
+	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
+	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
+	if (cpu_write_needs_clflush(obj))
+		obj->cache_dirty = true;
+}
+
 static void
 __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
 				struct sg_table *pages,
@@ -248,8 +256,7 @@ __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
 	    !i915_gem_object_is_coherent(obj))
 		drm_clflush_sg(pages);
 
-	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
-	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
+	__start_cpu_write(obj);
 }
 
 static void
@@ -684,6 +691,12 @@ i915_gem_dumb_create(struct drm_file *file,
 			       args->size, &args->handle);
 }
 
+static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)
+{
+	return !(obj->cache_level == I915_CACHE_NONE ||
+		 obj->cache_level == I915_CACHE_WT);
+}
+
 /**
  * Creates a new mm object and returns a handle to it.
  * @dev: drm device pointer
@@ -753,6 +766,11 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
 	case I915_GEM_DOMAIN_CPU:
 		i915_gem_clflush_object(obj, I915_CLFLUSH_SYNC);
 		break;
+
+	case I915_GEM_DOMAIN_RENDER:
+		if (gpu_write_needs_clflush(obj))
+			obj->cache_dirty = true;
+		break;
 	}
 
 	obj->base.write_domain = 0;
@@ -854,7 +872,8 @@ int i915_gem_obj_prepare_shmem_read(struct drm_i915_gem_object *obj,
 	 * optimizes for the case when the gpu will dirty the data
 	 * anyway again before the next pread happens.
 	 */
-	if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
+	if (!obj->cache_dirty &&
+	    !(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
 		*needs_clflush = CLFLUSH_BEFORE;
 
 out:
@@ -906,14 +925,16 @@ int i915_gem_obj_prepare_shmem_write(struct drm_i915_gem_object *obj,
 	 * This optimizes for the case when the gpu will use the data
 	 * right away and we therefore have to clflush anyway.
 	 */
-	if (obj->base.write_domain != I915_GEM_DOMAIN_CPU)
+	if (!obj->cache_dirty) {
 		*needs_clflush |= CLFLUSH_AFTER;
 
-	/* Same trick applies to invalidate partially written cachelines read
-	 * before writing.
-	 */
-	if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
-		*needs_clflush |= CLFLUSH_BEFORE;
+		/*
+		 * Same trick applies to invalidate partially written
+		 * cachelines read before writing.
+		 */
+		if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
+			*needs_clflush |= CLFLUSH_BEFORE;
+	}
 
 out:
 	intel_fb_obj_invalidate(obj, ORIGIN_CPU);
@@ -3372,10 +3393,13 @@ int i915_gem_wait_for_idle(struct drm_i915_private *i915, unsigned int flags)
 
 static void __i915_gem_object_flush_for_display(struct drm_i915_gem_object *obj)
 {
-	if (obj->base.write_domain != I915_GEM_DOMAIN_CPU && !obj->cache_dirty)
-		return;
-
-	i915_gem_clflush_object(obj, I915_CLFLUSH_FORCE);
+	/*
+	 * We manually flush the CPU domain so that we can override and
+	 * force the flush for the display, and perform it asyncrhonously.
+	 */
+	flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
+	if (obj->cache_dirty)
+		i915_gem_clflush_object(obj, I915_CLFLUSH_FORCE);
 	obj->base.write_domain = 0;
 }
 
@@ -3634,13 +3658,10 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
 		}
 	}
 
-	if (obj->base.write_domain == I915_GEM_DOMAIN_CPU &&
-	    i915_gem_object_is_coherent(obj))
-		obj->cache_dirty = true;
-
 	list_for_each_entry(vma, &obj->vma_list, obj_link)
 		vma->node.color = cache_level;
 	obj->cache_level = cache_level;
+	obj->cache_dirty = true; /* Always invalidate stale cachelines */
 
 	return 0;
 }
@@ -3862,9 +3883,6 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
 	if (ret)
 		return ret;
 
-	if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
-		return 0;
-
 	flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
 
 	/* Flush the CPU cache if it's still invalid. */
@@ -3876,15 +3894,13 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
 	/* It should now be out of any other write domains, and we can update
 	 * the domain values for our changes.
 	 */
-	GEM_BUG_ON((obj->base.write_domain & ~I915_GEM_DOMAIN_CPU) != 0);
+	GEM_BUG_ON(obj->base.write_domain & ~I915_GEM_DOMAIN_CPU);
 
 	/* If we're writing through the CPU, then the GPU read domains will
 	 * need to be invalidated at next use.
 	 */
-	if (write) {
-		obj->base.read_domains = I915_GEM_DOMAIN_CPU;
-		obj->base.write_domain = I915_GEM_DOMAIN_CPU;
-	}
+	if (write)
+		__start_cpu_write(obj);
 
 	return 0;
 }
@@ -4304,6 +4320,8 @@ i915_gem_object_create(struct drm_i915_private *dev_priv, u64 size)
 	} else
 		obj->cache_level = I915_CACHE_NONE;
 
+	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
+
 	trace_i915_gem_object_create(obj);
 
 	return obj;
@@ -4972,10 +4990,8 @@ int i915_gem_freeze_late(struct drm_i915_private *dev_priv)
 
 	mutex_lock(&dev_priv->drm.struct_mutex);
 	for (p = phases; *p; p++) {
-		list_for_each_entry(obj, *p, global_link) {
-			obj->base.read_domains = I915_GEM_DOMAIN_CPU;
-			obj->base.write_domain = I915_GEM_DOMAIN_CPU;
-		}
+		list_for_each_entry(obj, *p, global_link)
+			__start_cpu_write(obj);
 	}
 	mutex_unlock(&dev_priv->drm.struct_mutex);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_clflush.c b/drivers/gpu/drm/i915/i915_gem_clflush.c
index ffac7a1f0caf..17b207e963c2 100644
--- a/drivers/gpu/drm/i915/i915_gem_clflush.c
+++ b/drivers/gpu/drm/i915/i915_gem_clflush.c
@@ -71,8 +71,6 @@ static const struct dma_fence_ops i915_clflush_ops = {
 static void __i915_do_clflush(struct drm_i915_gem_object *obj)
 {
 	drm_clflush_sg(obj->mm.pages);
-	obj->cache_dirty = false;
-
 	intel_fb_obj_flush(obj, ORIGIN_CPU);
 }
 
@@ -81,9 +79,6 @@ static void i915_clflush_work(struct work_struct *work)
 	struct clflush *clflush = container_of(work, typeof(*clflush), work);
 	struct drm_i915_gem_object *obj = clflush->obj;
 
-	if (!obj->cache_dirty)
-		goto out;
-
 	if (i915_gem_object_pin_pages(obj)) {
 		DRM_ERROR("Failed to acquire obj->pages for clflushing\n");
 		goto out;
@@ -131,10 +126,10 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
 	 * anything not backed by physical memory we consider to be always
 	 * coherent and not need clflushing.
 	 */
-	if (!i915_gem_object_has_struct_page(obj))
+	if (!i915_gem_object_has_struct_page(obj)) {
+		obj->cache_dirty = false;
 		return;
-
-	obj->cache_dirty = true;
+	}
 
 	/* If the GPU is snooping the contents of the CPU cache,
 	 * we do not need to manually clear the CPU cache lines.  However,
@@ -153,6 +148,8 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
 	if (!(flags & I915_CLFLUSH_SYNC))
 		clflush = kmalloc(sizeof(*clflush), GFP_KERNEL);
 	if (clflush) {
+		GEM_BUG_ON(!obj->cache_dirty);
+
 		dma_fence_init(&clflush->dma,
 			       &i915_clflush_ops,
 			       &clflush_lock,
@@ -180,4 +177,6 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
 	} else {
 		GEM_BUG_ON(obj->base.write_domain != I915_GEM_DOMAIN_CPU);
 	}
+
+	obj->cache_dirty = false;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index af1965774e7b..0b8ae0f56675 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -291,7 +291,7 @@ static inline int use_cpu_reloc(struct drm_i915_gem_object *obj)
 		return DBG_USE_CPU_RELOC > 0;
 
 	return (HAS_LLC(to_i915(obj->base.dev)) ||
-		obj->base.write_domain == I915_GEM_DOMAIN_CPU ||
+		obj->cache_dirty ||
 		obj->cache_level != I915_CACHE_NONE);
 }
 
@@ -1129,10 +1129,8 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
 		if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
 			continue;
 
-		if (obj->base.write_domain & I915_GEM_DOMAIN_CPU) {
+		if (obj->cache_dirty)
 			i915_gem_clflush_object(obj, 0);
-			obj->base.write_domain = 0;
-		}
 
 		ret = i915_gem_request_await_object
 			(req, obj, obj->base.pending_write_domain);
@@ -1265,12 +1263,6 @@ i915_gem_validate_context(struct drm_device *dev, struct drm_file *file,
 	return ctx;
 }
 
-static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)
-{
-	return !(obj->cache_level == I915_CACHE_NONE ||
-		 obj->cache_level == I915_CACHE_WT);
-}
-
 void i915_vma_move_to_active(struct i915_vma *vma,
 			     struct drm_i915_gem_request *req,
 			     unsigned int flags)
@@ -1294,15 +1286,16 @@ void i915_vma_move_to_active(struct i915_vma *vma,
 	i915_gem_active_set(&vma->last_read[idx], req);
 	list_move_tail(&vma->vm_link, &vma->vm->active_list);
 
+	obj->base.write_domain = 0;
 	if (flags & EXEC_OBJECT_WRITE) {
+		obj->base.write_domain = I915_GEM_DOMAIN_RENDER;
+
 		if (intel_fb_obj_invalidate(obj, ORIGIN_CS))
 			i915_gem_active_set(&obj->frontbuffer_write, req);
 
-		/* update for the implicit flush after a batch */
-		obj->base.write_domain &= ~I915_GEM_GPU_DOMAINS;
-		if (!obj->cache_dirty && gpu_write_needs_clflush(obj))
-			obj->cache_dirty = true;
+		obj->base.read_domains = 0;
 	}
+	obj->base.read_domains |= I915_GEM_GPU_DOMAINS;
 
 	if (flags & EXEC_OBJECT_NEEDS_FENCE)
 		i915_gem_active_set(&vma->last_fence, req);
diff --git a/drivers/gpu/drm/i915/i915_gem_internal.c b/drivers/gpu/drm/i915/i915_gem_internal.c
index fc950abbe400..58e93e87d573 100644
--- a/drivers/gpu/drm/i915/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/i915_gem_internal.c
@@ -188,9 +188,10 @@ i915_gem_object_create_internal(struct drm_i915_private *i915,
 	drm_gem_private_object_init(&i915->drm, &obj->base, size);
 	i915_gem_object_init(obj, &i915_gem_object_internal_ops);
 
-	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
+	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->cache_level = HAS_LLC(i915) ? I915_CACHE_LLC : I915_CACHE_NONE;
+	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
 
 	return obj;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 58ccf8b8ca1c..9f84be171ad2 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -802,9 +802,10 @@ i915_gem_userptr_ioctl(struct drm_device *dev, void *data, struct drm_file *file
 
 	drm_gem_private_object_init(dev, &obj->base, args->user_size);
 	i915_gem_object_init(obj, &i915_gem_userptr_ops);
-	obj->cache_level = I915_CACHE_LLC;
-	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
+	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
+	obj->cache_level = I915_CACHE_LLC;
+	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
 
 	obj->userptr.ptr = args->user_ptr;
 	obj->userptr.read_only = !!(args->flags & I915_USERPTR_READ_ONLY);
diff --git a/drivers/gpu/drm/i915/selftests/huge_gem_object.c b/drivers/gpu/drm/i915/selftests/huge_gem_object.c
index 4e681fc13be4..0ca867a877b6 100644
--- a/drivers/gpu/drm/i915/selftests/huge_gem_object.c
+++ b/drivers/gpu/drm/i915/selftests/huge_gem_object.c
@@ -126,9 +126,10 @@ huge_gem_object(struct drm_i915_private *i915,
 	drm_gem_private_object_init(&i915->drm, &obj->base, dma_size);
 	i915_gem_object_init(obj, &huge_ops);
 
-	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
+	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->cache_level = HAS_LLC(i915) ? I915_CACHE_LLC : I915_CACHE_NONE;
+	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
 	obj->scratch = phys_size;
 
 	return obj;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/24] drm/i915: Store i915_gem_object_is_coherent() as a bit next to cache-dirty
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
  2017-05-18  9:46 ` [PATCH 02/24] drm/i915: Mark CPU cache as dirty on every transition for CPU writes Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-19 10:22   ` Chris Wilson
  2017-05-22 20:39   ` Dongwon Kim
  2017-05-18  9:46 ` [PATCH 04/24] drm/i915: Reinstate reservation_object zapping for batch_pool objects Chris Wilson
                   ` (23 subsequent siblings)
  25 siblings, 2 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx; +Cc: Dongwon Kim

For ease of use (i.e. avoiding a few checks and function calls), store
the object's cache coherency next to the cache is dirty bit.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c                  | 14 +++++++-------
 drivers/gpu/drm/i915/i915_gem_clflush.c          |  2 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c       |  2 +-
 drivers/gpu/drm/i915/i915_gem_internal.c         |  3 ++-
 drivers/gpu/drm/i915/i915_gem_object.h           |  1 +
 drivers/gpu/drm/i915/i915_gem_stolen.c           |  1 +
 drivers/gpu/drm/i915/i915_gem_userptr.c          |  3 ++-
 drivers/gpu/drm/i915/selftests/huge_gem_object.c |  3 ++-
 8 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 155dd52f2d18..870659c13de3 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -52,7 +52,7 @@ static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
 	if (obj->cache_dirty)
 		return false;
 
-	if (!i915_gem_object_is_coherent(obj))
+	if (!obj->cache_coherent)
 		return true;
 
 	return obj->pin_display;
@@ -253,7 +253,7 @@ __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
 
 	if (needs_clflush &&
 	    (obj->base.read_domains & I915_GEM_DOMAIN_CPU) == 0 &&
-	    !i915_gem_object_is_coherent(obj))
+	    !obj->cache_coherent)
 		drm_clflush_sg(pages);
 
 	__start_cpu_write(obj);
@@ -856,8 +856,7 @@ int i915_gem_obj_prepare_shmem_read(struct drm_i915_gem_object *obj,
 	if (ret)
 		return ret;
 
-	if (i915_gem_object_is_coherent(obj) ||
-	    !static_cpu_has(X86_FEATURE_CLFLUSH)) {
+	if (obj->cache_coherent || !static_cpu_has(X86_FEATURE_CLFLUSH)) {
 		ret = i915_gem_object_set_to_cpu_domain(obj, false);
 		if (ret)
 			goto err_unpin;
@@ -909,8 +908,7 @@ int i915_gem_obj_prepare_shmem_write(struct drm_i915_gem_object *obj,
 	if (ret)
 		return ret;
 
-	if (i915_gem_object_is_coherent(obj) ||
-	    !static_cpu_has(X86_FEATURE_CLFLUSH)) {
+	if (obj->cache_coherent || !static_cpu_has(X86_FEATURE_CLFLUSH)) {
 		ret = i915_gem_object_set_to_cpu_domain(obj, true);
 		if (ret)
 			goto err_unpin;
@@ -3661,6 +3659,7 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
 	list_for_each_entry(vma, &obj->vma_list, obj_link)
 		vma->node.color = cache_level;
 	obj->cache_level = cache_level;
+	obj->cache_coherent = i915_gem_object_is_coherent(obj);
 	obj->cache_dirty = true; /* Always invalidate stale cachelines */
 
 	return 0;
@@ -4320,7 +4319,8 @@ i915_gem_object_create(struct drm_i915_private *dev_priv, u64 size)
 	} else
 		obj->cache_level = I915_CACHE_NONE;
 
-	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
+	obj->cache_coherent = i915_gem_object_is_coherent(obj);
+	obj->cache_dirty = !obj->cache_coherent;
 
 	trace_i915_gem_object_create(obj);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_clflush.c b/drivers/gpu/drm/i915/i915_gem_clflush.c
index 17b207e963c2..152f16c11878 100644
--- a/drivers/gpu/drm/i915/i915_gem_clflush.c
+++ b/drivers/gpu/drm/i915/i915_gem_clflush.c
@@ -139,7 +139,7 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
 	 * snooping behaviour occurs naturally as the result of our domain
 	 * tracking.
 	 */
-	if (!(flags & I915_CLFLUSH_FORCE) && i915_gem_object_is_coherent(obj))
+	if (!(flags & I915_CLFLUSH_FORCE) && obj->cache_coherent)
 		return;
 
 	trace_i915_gem_object_clflush(obj);
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 0b8ae0f56675..2e5f513087a8 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1129,7 +1129,7 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
 		if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
 			continue;
 
-		if (obj->cache_dirty)
+		if (obj->cache_dirty & ~obj->cache_coherent)
 			i915_gem_clflush_object(obj, 0);
 
 		ret = i915_gem_request_await_object
diff --git a/drivers/gpu/drm/i915/i915_gem_internal.c b/drivers/gpu/drm/i915/i915_gem_internal.c
index 58e93e87d573..568bf83af1f5 100644
--- a/drivers/gpu/drm/i915/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/i915_gem_internal.c
@@ -191,7 +191,8 @@ i915_gem_object_create_internal(struct drm_i915_private *i915,
 	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
 	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->cache_level = HAS_LLC(i915) ? I915_CACHE_LLC : I915_CACHE_NONE;
-	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
+	obj->cache_coherent = i915_gem_object_is_coherent(obj);
+	obj->cache_dirty = !obj->cache_coherent;
 
 	return obj;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_object.h b/drivers/gpu/drm/i915/i915_gem_object.h
index 174cf923c236..dca15adc91de 100644
--- a/drivers/gpu/drm/i915/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/i915_gem_object.h
@@ -106,6 +106,7 @@ struct drm_i915_gem_object {
 	unsigned long gt_ro:1;
 	unsigned int cache_level:3;
 	unsigned int cache_dirty:1;
+	unsigned int cache_coherent:1;
 
 	atomic_t frontbuffer_bits;
 	unsigned int frontbuffer_ggtt_origin; /* write once */
diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c b/drivers/gpu/drm/i915/i915_gem_stolen.c
index f3abdc27c5dd..68af4a39973d 100644
--- a/drivers/gpu/drm/i915/i915_gem_stolen.c
+++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
@@ -592,6 +592,7 @@ _i915_gem_object_create_stolen(struct drm_i915_private *dev_priv,
 	obj->stolen = stolen;
 	obj->base.read_domains = I915_GEM_DOMAIN_CPU | I915_GEM_DOMAIN_GTT;
 	obj->cache_level = HAS_LLC(dev_priv) ? I915_CACHE_LLC : I915_CACHE_NONE;
+	obj->cache_coherent = true; /* assumptions! more like cache_oblivious */
 
 	if (i915_gem_object_pin_pages(obj))
 		goto cleanup;
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 9f84be171ad2..4ec9a04aa165 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -805,7 +805,8 @@ i915_gem_userptr_ioctl(struct drm_device *dev, void *data, struct drm_file *file
 	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
 	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->cache_level = I915_CACHE_LLC;
-	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
+	obj->cache_coherent = i915_gem_object_is_coherent(obj);
+	obj->cache_dirty = !obj->cache_coherent;
 
 	obj->userptr.ptr = args->user_ptr;
 	obj->userptr.read_only = !!(args->flags & I915_USERPTR_READ_ONLY);
diff --git a/drivers/gpu/drm/i915/selftests/huge_gem_object.c b/drivers/gpu/drm/i915/selftests/huge_gem_object.c
index 0ca867a877b6..caf76af36aba 100644
--- a/drivers/gpu/drm/i915/selftests/huge_gem_object.c
+++ b/drivers/gpu/drm/i915/selftests/huge_gem_object.c
@@ -129,7 +129,8 @@ huge_gem_object(struct drm_i915_private *i915,
 	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
 	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->cache_level = HAS_LLC(i915) ? I915_CACHE_LLC : I915_CACHE_NONE;
-	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
+	obj->cache_coherent = i915_gem_object_is_coherent(obj);
+	obj->cache_dirty = !obj->cache_coherent;
 	obj->scratch = phys_size;
 
 	return obj;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/24] drm/i915: Reinstate reservation_object zapping for batch_pool objects
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
  2017-05-18  9:46 ` [PATCH 02/24] drm/i915: Mark CPU cache as dirty on every transition for CPU writes Chris Wilson
  2017-05-18  9:46 ` [PATCH 03/24] drm/i915: Store i915_gem_object_is_coherent() as a bit next to cache-dirty Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 05/24] drm/i915: Amalgamate execbuffer parameter structures Chris Wilson
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala, Matthew Auld

I removed the zapping of the reservation_object->fence array of shared
fences prematurely. We don't yet have the code to zap that array when
retiring the object, and so currently it remains possible to continually
grow the shared array trapping requests when reusing the batch_pool
object across many timelines.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_batch_pool.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_batch_pool.c b/drivers/gpu/drm/i915/i915_gem_batch_pool.c
index 41aa598c4f3b..c93005c2e0fb 100644
--- a/drivers/gpu/drm/i915/i915_gem_batch_pool.c
+++ b/drivers/gpu/drm/i915/i915_gem_batch_pool.c
@@ -114,12 +114,27 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool,
 	list_for_each_entry(obj, list, batch_pool_link) {
 		/* The batches are strictly LRU ordered */
 		if (i915_gem_object_is_active(obj)) {
-			if (!reservation_object_test_signaled_rcu(obj->resv,
-								  true))
+			struct reservation_object *resv = obj->resv;
+
+			if (!reservation_object_test_signaled_rcu(resv, true))
 				break;
 
 			i915_gem_retire_requests(pool->engine->i915);
 			GEM_BUG_ON(i915_gem_object_is_active(obj));
+
+			/*
+			 * The object is now idle, clear the array of shared
+			 * fences before we add a new request. Although, we
+			 * remain on the same engine, we may be on a different
+			 * timeline and so may continually grow the array,
+			 * trapping a reference to all the old fences, rather
+			 * than replace the existing fence.
+			 */
+			if (rcu_access_pointer(resv->fence)) {
+				reservation_object_lock(resv, NULL);
+				reservation_object_add_excl_fence(resv, NULL);
+				reservation_object_unlock(resv);
+			}
 		}
 
 		GEM_BUG_ON(!reservation_object_test_signaled_rcu(obj->resv,
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 05/24] drm/i915: Amalgamate execbuffer parameter structures
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (2 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 04/24] drm/i915: Reinstate reservation_object zapping for batch_pool objects Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 06/24] drm/i915: Use vma->exec_entry as our double-entry placeholder Chris Wilson
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

Combine the two slightly overlapping parameter structures we pass around
the execbuffer routines into one.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 550 ++++++++++++-----------------
 1 file changed, 233 insertions(+), 317 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 2e5f513087a8..52021bc64754 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -50,70 +50,78 @@
 
 #define BATCH_OFFSET_BIAS (256*1024)
 
-struct i915_execbuffer_params {
-	struct drm_device               *dev;
-	struct drm_file                 *file;
-	struct i915_vma			*batch;
-	u32				dispatch_flags;
-	u32				args_batch_start_offset;
-	struct intel_engine_cs          *engine;
-	struct i915_gem_context         *ctx;
-	struct drm_i915_gem_request     *request;
-};
+#define __I915_EXEC_ILLEGAL_FLAGS \
+	(__I915_EXEC_UNKNOWN_FLAGS | I915_EXEC_CONSTANTS_MASK)
 
-struct eb_vmas {
+struct i915_execbuffer {
 	struct drm_i915_private *i915;
+	struct drm_file *file;
+	struct drm_i915_gem_execbuffer2 *args;
+	struct drm_i915_gem_exec_object2 *exec;
+	struct intel_engine_cs *engine;
+	struct i915_gem_context *ctx;
+	struct i915_address_space *vm;
+	struct i915_vma *batch;
+	struct drm_i915_gem_request *request;
+	u32 batch_start_offset;
+	u32 batch_len;
+	unsigned int dispatch_flags;
+	struct drm_i915_gem_exec_object2 shadow_exec_entry;
+	bool need_relocs;
 	struct list_head vmas;
+	struct reloc_cache {
+		struct drm_mm_node node;
+		unsigned long vaddr;
+		unsigned int page;
+		bool use_64bit_reloc : 1;
+	} reloc_cache;
 	int and;
 	union {
-		struct i915_vma *lut[0];
-		struct hlist_head buckets[0];
+		struct i915_vma **lut;
+		struct hlist_head *buckets;
 	};
 };
 
-static struct eb_vmas *
-eb_create(struct drm_i915_private *i915,
-	  struct drm_i915_gem_execbuffer2 *args)
+static int
+eb_create(struct i915_execbuffer *eb)
 {
-	struct eb_vmas *eb = NULL;
-
-	if (args->flags & I915_EXEC_HANDLE_LUT) {
-		unsigned size = args->buffer_count;
+	eb->lut = NULL;
+	if (eb->args->flags & I915_EXEC_HANDLE_LUT) {
+		unsigned int size = eb->args->buffer_count;
 		size *= sizeof(struct i915_vma *);
-		size += sizeof(struct eb_vmas);
-		eb = kmalloc(size, GFP_TEMPORARY | __GFP_NOWARN | __GFP_NORETRY);
+		eb->lut = kmalloc(size,
+				  GFP_TEMPORARY | __GFP_NOWARN | __GFP_NORETRY);
 	}
 
-	if (eb == NULL) {
-		unsigned size = args->buffer_count;
-		unsigned count = PAGE_SIZE / sizeof(struct hlist_head) / 2;
+	if (!eb->lut) {
+		unsigned int size = eb->args->buffer_count;
+		unsigned int count = PAGE_SIZE / sizeof(struct hlist_head) / 2;
 		BUILD_BUG_ON_NOT_POWER_OF_2(PAGE_SIZE / sizeof(struct hlist_head));
 		while (count > 2*size)
 			count >>= 1;
-		eb = kzalloc(count*sizeof(struct hlist_head) +
-			     sizeof(struct eb_vmas),
-			     GFP_TEMPORARY);
-		if (eb == NULL)
-			return eb;
+		eb->lut = kzalloc(count * sizeof(struct hlist_head),
+				  GFP_TEMPORARY);
+		if (!eb->lut)
+			return -ENOMEM;
 
 		eb->and = count - 1;
-	} else
-		eb->and = -args->buffer_count;
+	} else {
+		eb->and = -eb->args->buffer_count;
+	}
 
-	eb->i915 = i915;
 	INIT_LIST_HEAD(&eb->vmas);
-	return eb;
+	return 0;
 }
 
 static void
-eb_reset(struct eb_vmas *eb)
+eb_reset(struct i915_execbuffer *eb)
 {
 	if (eb->and >= 0)
 		memset(eb->buckets, 0, (eb->and+1)*sizeof(struct hlist_head));
 }
 
 static struct i915_vma *
-eb_get_batch(struct eb_vmas *eb)
+eb_get_batch(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma = list_entry(eb->vmas.prev, typeof(*vma), exec_list);
 
@@ -133,34 +141,30 @@ eb_get_batch(struct eb_vmas *eb)
 }
 
 static int
-eb_lookup_vmas(struct eb_vmas *eb,
-	       struct drm_i915_gem_exec_object2 *exec,
-	       const struct drm_i915_gem_execbuffer2 *args,
-	       struct i915_address_space *vm,
-	       struct drm_file *file)
+eb_lookup_vmas(struct i915_execbuffer *eb)
 {
 	struct drm_i915_gem_object *obj;
 	struct list_head objects;
 	int i, ret;
 
 	INIT_LIST_HEAD(&objects);
-	spin_lock(&file->table_lock);
+	spin_lock(&eb->file->table_lock);
 	/* Grab a reference to the object and release the lock so we can lookup
 	 * or create the VMA without using GFP_ATOMIC */
-	for (i = 0; i < args->buffer_count; i++) {
-		obj = to_intel_bo(idr_find(&file->object_idr, exec[i].handle));
+	for (i = 0; i < eb->args->buffer_count; i++) {
+		obj = to_intel_bo(idr_find(&eb->file->object_idr, eb->exec[i].handle));
 		if (obj == NULL) {
-			spin_unlock(&file->table_lock);
+			spin_unlock(&eb->file->table_lock);
 			DRM_DEBUG("Invalid object handle %d at index %d\n",
-				   exec[i].handle, i);
+				   eb->exec[i].handle, i);
 			ret = -ENOENT;
 			goto err;
 		}
 
 		if (!list_empty(&obj->obj_exec_link)) {
-			spin_unlock(&file->table_lock);
+			spin_unlock(&eb->file->table_lock);
 			DRM_DEBUG("Object %p [handle %d, index %d] appears more than once in object list\n",
-				   obj, exec[i].handle, i);
+				   obj, eb->exec[i].handle, i);
 			ret = -EINVAL;
 			goto err;
 		}
@@ -168,7 +172,7 @@ eb_lookup_vmas(struct eb_vmas *eb,
 		i915_gem_object_get(obj);
 		list_add_tail(&obj->obj_exec_link, &objects);
 	}
-	spin_unlock(&file->table_lock);
+	spin_unlock(&eb->file->table_lock);
 
 	i = 0;
 	while (!list_empty(&objects)) {
@@ -186,7 +190,7 @@ eb_lookup_vmas(struct eb_vmas *eb,
 		 * from the (obj, vm) we don't run the risk of creating
 		 * duplicated vmas for the same vm.
 		 */
-		vma = i915_vma_instance(obj, vm, NULL);
+		vma = i915_vma_instance(obj, eb->vm, NULL);
 		if (unlikely(IS_ERR(vma))) {
 			DRM_DEBUG("Failed to lookup VMA\n");
 			ret = PTR_ERR(vma);
@@ -197,11 +201,13 @@ eb_lookup_vmas(struct eb_vmas *eb,
 		list_add_tail(&vma->exec_list, &eb->vmas);
 		list_del_init(&obj->obj_exec_link);
 
-		vma->exec_entry = &exec[i];
+		vma->exec_entry = &eb->exec[i];
 		if (eb->and < 0) {
 			eb->lut[i] = vma;
 		} else {
-			uint32_t handle = args->flags & I915_EXEC_HANDLE_LUT ? i : exec[i].handle;
+			u32 handle =
+				eb->args->flags & I915_EXEC_HANDLE_LUT ?
+				i : eb->exec[i].handle;
 			vma->exec_handle = handle;
 			hlist_add_head(&vma->exec_node,
 				       &eb->buckets[handle & eb->and]);
@@ -228,7 +234,7 @@ eb_lookup_vmas(struct eb_vmas *eb,
 	return ret;
 }
 
-static struct i915_vma *eb_get_vma(struct eb_vmas *eb, unsigned long handle)
+static struct i915_vma *eb_get_vma(struct i915_execbuffer *eb, unsigned long handle)
 {
 	if (eb->and < 0) {
 		if (handle >= -eb->and)
@@ -248,7 +254,7 @@ static struct i915_vma *eb_get_vma(struct eb_vmas *eb, unsigned long handle)
 }
 
 static void
-i915_gem_execbuffer_unreserve_vma(struct i915_vma *vma)
+eb_unreserve_vma(struct i915_vma *vma)
 {
 	struct drm_i915_gem_exec_object2 *entry;
 
@@ -266,8 +272,10 @@ i915_gem_execbuffer_unreserve_vma(struct i915_vma *vma)
 	entry->flags &= ~(__EXEC_OBJECT_HAS_FENCE | __EXEC_OBJECT_HAS_PIN);
 }
 
-static void eb_destroy(struct eb_vmas *eb)
+static void eb_destroy(struct i915_execbuffer *eb)
 {
+	i915_gem_context_put(eb->ctx);
+
 	while (!list_empty(&eb->vmas)) {
 		struct i915_vma *vma;
 
@@ -275,11 +283,10 @@ static void eb_destroy(struct eb_vmas *eb)
 				       struct i915_vma,
 				       exec_list);
 		list_del_init(&vma->exec_list);
-		i915_gem_execbuffer_unreserve_vma(vma);
+		eb_unreserve_vma(vma);
 		vma->exec_entry = NULL;
 		i915_vma_put(vma);
 	}
-	kfree(eb);
 }
 
 static inline int use_cpu_reloc(struct drm_i915_gem_object *obj)
@@ -320,20 +327,11 @@ relocation_target(const struct drm_i915_gem_relocation_entry *reloc,
 	return gen8_canonical_addr((int)reloc->delta + target_offset);
 }
 
-struct reloc_cache {
-	struct drm_i915_private *i915;
-	struct drm_mm_node node;
-	unsigned long vaddr;
-	unsigned int page;
-	bool use_64bit_reloc;
-};
-
 static void reloc_cache_init(struct reloc_cache *cache,
 			     struct drm_i915_private *i915)
 {
 	cache->page = -1;
 	cache->vaddr = 0;
-	cache->i915 = i915;
 	/* Must be a variable in the struct to allow GCC to unroll. */
 	cache->use_64bit_reloc = HAS_64BIT_RELOC(i915);
 	cache->node.allocated = false;
@@ -351,7 +349,14 @@ static inline unsigned int unmask_flags(unsigned long p)
 
 #define KMAP 0x4 /* after CLFLUSH_FLAGS */
 
-static void reloc_cache_fini(struct reloc_cache *cache)
+static inline struct i915_ggtt *cache_to_ggtt(struct reloc_cache *cache)
+{
+	struct drm_i915_private *i915 =
+		container_of(cache, struct i915_execbuffer, reloc_cache)->i915;
+	return &i915->ggtt;
+}
+
+static void reloc_cache_reset(struct reloc_cache *cache)
 {
 	void *vaddr;
 
@@ -369,7 +374,7 @@ static void reloc_cache_fini(struct reloc_cache *cache)
 		wmb();
 		io_mapping_unmap_atomic((void __iomem *)vaddr);
 		if (cache->node.allocated) {
-			struct i915_ggtt *ggtt = &cache->i915->ggtt;
+			struct i915_ggtt *ggtt = cache_to_ggtt(cache);
 
 			ggtt->base.clear_range(&ggtt->base,
 					       cache->node.start,
@@ -379,6 +384,9 @@ static void reloc_cache_fini(struct reloc_cache *cache)
 			i915_vma_unpin((struct i915_vma *)cache->node.mm);
 		}
 	}
+
+	cache->vaddr = 0;
+	cache->page = -1;
 }
 
 static void *reloc_kmap(struct drm_i915_gem_object *obj,
@@ -417,7 +425,7 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
 			 struct reloc_cache *cache,
 			 int page)
 {
-	struct i915_ggtt *ggtt = &cache->i915->ggtt;
+	struct i915_ggtt *ggtt = cache_to_ggtt(cache);
 	unsigned long offset;
 	void *vaddr;
 
@@ -467,7 +475,8 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
 		offset += page << PAGE_SHIFT;
 	}
 
-	vaddr = (void __force *) io_mapping_map_atomic_wc(&cache->i915->ggtt.mappable, offset);
+	vaddr = (void __force *)io_mapping_map_atomic_wc(&ggtt->mappable,
+							 offset);
 	cache->page = page;
 	cache->vaddr = (unsigned long)vaddr;
 
@@ -546,12 +555,10 @@ relocate_entry(struct drm_i915_gem_object *obj,
 }
 
 static int
-i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
-				   struct eb_vmas *eb,
-				   struct drm_i915_gem_relocation_entry *reloc,
-				   struct reloc_cache *cache)
+eb_relocate_entry(struct drm_i915_gem_object *obj,
+		  struct i915_execbuffer *eb,
+		  struct drm_i915_gem_relocation_entry *reloc)
 {
-	struct drm_i915_private *dev_priv = to_i915(obj->base.dev);
 	struct drm_gem_object *target_obj;
 	struct drm_i915_gem_object *target_i915_obj;
 	struct i915_vma *target_vma;
@@ -570,8 +577,8 @@ i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
 	/* Sandybridge PPGTT errata: We need a global gtt mapping for MI and
 	 * pipe_control writes because the gpu doesn't properly redirect them
 	 * through the ppgtt for non_secure batchbuffers. */
-	if (unlikely(IS_GEN6(dev_priv) &&
-	    reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION)) {
+	if (unlikely(IS_GEN6(eb->i915) &&
+		     reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION)) {
 		ret = i915_vma_bind(target_vma, target_i915_obj->cache_level,
 				    PIN_GLOBAL);
 		if (WARN_ONCE(ret, "Unexpected failure to bind target VMA!"))
@@ -612,7 +619,7 @@ i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
 
 	/* Check that the relocation address is valid... */
 	if (unlikely(reloc->offset >
-		     obj->base.size - (cache->use_64bit_reloc ? 8 : 4))) {
+		     obj->base.size - (eb->reloc_cache.use_64bit_reloc ? 8 : 4))) {
 		DRM_DEBUG("Relocation beyond object bounds: "
 			  "obj %p target %d offset %d size %d.\n",
 			  obj, reloc->target_handle,
@@ -628,7 +635,7 @@ i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
 		return -EINVAL;
 	}
 
-	ret = relocate_entry(obj, reloc, cache, target_offset);
+	ret = relocate_entry(obj, reloc, &eb->reloc_cache, target_offset);
 	if (ret)
 		return ret;
 
@@ -637,19 +644,15 @@ i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
 	return 0;
 }
 
-static int
-i915_gem_execbuffer_relocate_vma(struct i915_vma *vma,
-				 struct eb_vmas *eb)
+static int eb_relocate_vma(struct i915_vma *vma, struct i915_execbuffer *eb)
 {
 #define N_RELOC(x) ((x) / sizeof(struct drm_i915_gem_relocation_entry))
 	struct drm_i915_gem_relocation_entry stack_reloc[N_RELOC(512)];
 	struct drm_i915_gem_relocation_entry __user *user_relocs;
 	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
-	struct reloc_cache cache;
 	int remain, ret = 0;
 
 	user_relocs = u64_to_user_ptr(entry->relocs_ptr);
-	reloc_cache_init(&cache, eb->i915);
 
 	remain = entry->relocation_count;
 	while (remain) {
@@ -678,7 +681,7 @@ i915_gem_execbuffer_relocate_vma(struct i915_vma *vma,
 		do {
 			u64 offset = r->presumed_offset;
 
-			ret = i915_gem_execbuffer_relocate_entry(vma->obj, eb, r, &cache);
+			ret = eb_relocate_entry(vma->obj, eb, r);
 			if (ret)
 				goto out;
 
@@ -710,39 +713,35 @@ i915_gem_execbuffer_relocate_vma(struct i915_vma *vma,
 	}
 
 out:
-	reloc_cache_fini(&cache);
+	reloc_cache_reset(&eb->reloc_cache);
 	return ret;
 #undef N_RELOC
 }
 
 static int
-i915_gem_execbuffer_relocate_vma_slow(struct i915_vma *vma,
-				      struct eb_vmas *eb,
-				      struct drm_i915_gem_relocation_entry *relocs)
+eb_relocate_vma_slow(struct i915_vma *vma,
+		     struct i915_execbuffer *eb,
+		     struct drm_i915_gem_relocation_entry *relocs)
 {
 	const struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
-	struct reloc_cache cache;
 	int i, ret = 0;
 
-	reloc_cache_init(&cache, eb->i915);
 	for (i = 0; i < entry->relocation_count; i++) {
-		ret = i915_gem_execbuffer_relocate_entry(vma->obj, eb, &relocs[i], &cache);
+		ret = eb_relocate_entry(vma->obj, eb, &relocs[i]);
 		if (ret)
 			break;
 	}
-	reloc_cache_fini(&cache);
-
+	reloc_cache_reset(&eb->reloc_cache);
 	return ret;
 }
 
-static int
-i915_gem_execbuffer_relocate(struct eb_vmas *eb)
+static int eb_relocate(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 	int ret = 0;
 
 	list_for_each_entry(vma, &eb->vmas, exec_list) {
-		ret = i915_gem_execbuffer_relocate_vma(vma, eb);
+		ret = eb_relocate_vma(vma, eb);
 		if (ret)
 			break;
 	}
@@ -757,9 +756,9 @@ static bool only_mappable_for_reloc(unsigned int flags)
 }
 
 static int
-i915_gem_execbuffer_reserve_vma(struct i915_vma *vma,
-				struct intel_engine_cs *engine,
-				bool *need_reloc)
+eb_reserve_vma(struct i915_vma *vma,
+	       struct intel_engine_cs *engine,
+	       bool *need_reloc)
 {
 	struct drm_i915_gem_object *obj = vma->obj;
 	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
@@ -878,34 +877,27 @@ eb_vma_misplaced(struct i915_vma *vma)
 	return false;
 }
 
-static int
-i915_gem_execbuffer_reserve(struct intel_engine_cs *engine,
-			    struct list_head *vmas,
-			    struct i915_gem_context *ctx,
-			    bool *need_relocs)
+static int eb_reserve(struct i915_execbuffer *eb)
 {
+	const bool has_fenced_gpu_access = INTEL_GEN(eb->i915) < 4;
+	const bool needs_unfenced_map = INTEL_INFO(eb->i915)->unfenced_needs_alignment;
 	struct drm_i915_gem_object *obj;
 	struct i915_vma *vma;
-	struct i915_address_space *vm;
 	struct list_head ordered_vmas;
 	struct list_head pinned_vmas;
-	bool has_fenced_gpu_access = INTEL_GEN(engine->i915) < 4;
-	bool needs_unfenced_map = INTEL_INFO(engine->i915)->unfenced_needs_alignment;
 	int retry;
 
-	vm = list_first_entry(vmas, struct i915_vma, exec_list)->vm;
-
 	INIT_LIST_HEAD(&ordered_vmas);
 	INIT_LIST_HEAD(&pinned_vmas);
-	while (!list_empty(vmas)) {
+	while (!list_empty(&eb->vmas)) {
 		struct drm_i915_gem_exec_object2 *entry;
 		bool need_fence, need_mappable;
 
-		vma = list_first_entry(vmas, struct i915_vma, exec_list);
+		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_list);
 		obj = vma->obj;
 		entry = vma->exec_entry;
 
-		if (ctx->flags & CONTEXT_NO_ZEROMAP)
+		if (eb->ctx->flags & CONTEXT_NO_ZEROMAP)
 			entry->flags |= __EXEC_OBJECT_NEEDS_BIAS;
 
 		if (!has_fenced_gpu_access)
@@ -927,8 +919,8 @@ i915_gem_execbuffer_reserve(struct intel_engine_cs *engine,
 		obj->base.pending_read_domains = I915_GEM_GPU_DOMAINS & ~I915_GEM_DOMAIN_COMMAND;
 		obj->base.pending_write_domain = 0;
 	}
-	list_splice(&ordered_vmas, vmas);
-	list_splice(&pinned_vmas, vmas);
+	list_splice(&ordered_vmas, &eb->vmas);
+	list_splice(&pinned_vmas, &eb->vmas);
 
 	/* Attempt to pin all of the buffers into the GTT.
 	 * This is done in 3 phases:
@@ -947,27 +939,24 @@ i915_gem_execbuffer_reserve(struct intel_engine_cs *engine,
 		int ret = 0;
 
 		/* Unbind any ill-fitting objects or pin. */
-		list_for_each_entry(vma, vmas, exec_list) {
+		list_for_each_entry(vma, &eb->vmas, exec_list) {
 			if (!drm_mm_node_allocated(&vma->node))
 				continue;
 
 			if (eb_vma_misplaced(vma))
 				ret = i915_vma_unbind(vma);
 			else
-				ret = i915_gem_execbuffer_reserve_vma(vma,
-								      engine,
-								      need_relocs);
+				ret = eb_reserve_vma(vma, eb->engine, &eb->need_relocs);
 			if (ret)
 				goto err;
 		}
 
 		/* Bind fresh objects */
-		list_for_each_entry(vma, vmas, exec_list) {
+		list_for_each_entry(vma, &eb->vmas, exec_list) {
 			if (drm_mm_node_allocated(&vma->node))
 				continue;
 
-			ret = i915_gem_execbuffer_reserve_vma(vma, engine,
-							      need_relocs);
+			ret = eb_reserve_vma(vma, eb->engine, &eb->need_relocs);
 			if (ret)
 				goto err;
 		}
@@ -977,39 +966,30 @@ i915_gem_execbuffer_reserve(struct intel_engine_cs *engine,
 			return ret;
 
 		/* Decrement pin count for bound objects */
-		list_for_each_entry(vma, vmas, exec_list)
-			i915_gem_execbuffer_unreserve_vma(vma);
+		list_for_each_entry(vma, &eb->vmas, exec_list)
+			eb_unreserve_vma(vma);
 
-		ret = i915_gem_evict_vm(vm, true);
+		ret = i915_gem_evict_vm(eb->vm, true);
 		if (ret)
 			return ret;
 	} while (1);
 }
 
 static int
-i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
-				  struct drm_i915_gem_execbuffer2 *args,
-				  struct drm_file *file,
-				  struct intel_engine_cs *engine,
-				  struct eb_vmas *eb,
-				  struct drm_i915_gem_exec_object2 *exec,
-				  struct i915_gem_context *ctx)
+eb_relocate_slow(struct i915_execbuffer *eb)
 {
+	const unsigned int count = eb->args->buffer_count;
+	struct drm_device *dev = &eb->i915->drm;
 	struct drm_i915_gem_relocation_entry *reloc;
-	struct i915_address_space *vm;
 	struct i915_vma *vma;
-	bool need_relocs;
 	int *reloc_offset;
 	int i, total, ret;
-	unsigned count = args->buffer_count;
-
-	vm = list_first_entry(&eb->vmas, struct i915_vma, exec_list)->vm;
 
 	/* We may process another execbuffer during the unlock... */
 	while (!list_empty(&eb->vmas)) {
 		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_list);
 		list_del_init(&vma->exec_list);
-		i915_gem_execbuffer_unreserve_vma(vma);
+		eb_unreserve_vma(vma);
 		i915_vma_put(vma);
 	}
 
@@ -1017,7 +997,7 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 
 	total = 0;
 	for (i = 0; i < count; i++)
-		total += exec[i].relocation_count;
+		total += eb->exec[i].relocation_count;
 
 	reloc_offset = drm_malloc_ab(count, sizeof(*reloc_offset));
 	reloc = drm_malloc_ab(total, sizeof(*reloc));
@@ -1034,10 +1014,10 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 		u64 invalid_offset = (u64)-1;
 		int j;
 
-		user_relocs = u64_to_user_ptr(exec[i].relocs_ptr);
+		user_relocs = u64_to_user_ptr(eb->exec[i].relocs_ptr);
 
 		if (copy_from_user(reloc+total, user_relocs,
-				   exec[i].relocation_count * sizeof(*reloc))) {
+				   eb->exec[i].relocation_count * sizeof(*reloc))) {
 			ret = -EFAULT;
 			mutex_lock(&dev->struct_mutex);
 			goto err;
@@ -1052,7 +1032,7 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 		 * happened we would make the mistake of assuming that the
 		 * relocations were valid.
 		 */
-		for (j = 0; j < exec[i].relocation_count; j++) {
+		for (j = 0; j < eb->exec[i].relocation_count; j++) {
 			if (__copy_to_user(&user_relocs[j].presumed_offset,
 					   &invalid_offset,
 					   sizeof(invalid_offset))) {
@@ -1063,7 +1043,7 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 		}
 
 		reloc_offset[i] = total;
-		total += exec[i].relocation_count;
+		total += eb->exec[i].relocation_count;
 	}
 
 	ret = i915_mutex_lock_interruptible(dev);
@@ -1074,20 +1054,18 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 
 	/* reacquire the objects */
 	eb_reset(eb);
-	ret = eb_lookup_vmas(eb, exec, args, vm, file);
+	ret = eb_lookup_vmas(eb);
 	if (ret)
 		goto err;
 
-	need_relocs = (args->flags & I915_EXEC_NO_RELOC) == 0;
-	ret = i915_gem_execbuffer_reserve(engine, &eb->vmas, ctx,
-					  &need_relocs);
+	ret = eb_reserve(eb);
 	if (ret)
 		goto err;
 
 	list_for_each_entry(vma, &eb->vmas, exec_list) {
-		int offset = vma->exec_entry - exec;
-		ret = i915_gem_execbuffer_relocate_vma_slow(vma, eb,
-							    reloc + reloc_offset[offset]);
+		int idx = vma->exec_entry - eb->exec;
+
+		ret = eb_relocate_vma_slow(vma, eb, reloc + reloc_offset[idx]);
 		if (ret)
 			goto err;
 	}
@@ -1105,13 +1083,12 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 }
 
 static int
-i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
-				struct list_head *vmas)
+eb_move_to_gpu(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 	int ret;
 
-	list_for_each_entry(vma, vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_list) {
 		struct drm_i915_gem_object *obj = vma->obj;
 
 		if (vma->exec_entry->flags & EXEC_OBJECT_CAPTURE) {
@@ -1121,9 +1098,9 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
 			if (unlikely(!capture))
 				return -ENOMEM;
 
-			capture->next = req->capture_list;
+			capture->next = eb->request->capture_list;
 			capture->vma = vma;
-			req->capture_list = capture;
+			eb->request->capture_list = capture;
 		}
 
 		if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
@@ -1133,22 +1110,22 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
 			i915_gem_clflush_object(obj, 0);
 
 		ret = i915_gem_request_await_object
-			(req, obj, obj->base.pending_write_domain);
+			(eb->request, obj, obj->base.pending_write_domain);
 		if (ret)
 			return ret;
 	}
 
 	/* Unconditionally flush any chipset caches (for streaming writes). */
-	i915_gem_chipset_flush(req->engine->i915);
+	i915_gem_chipset_flush(eb->i915);
 
 	/* Unconditionally invalidate GPU caches and TLBs. */
-	return req->engine->emit_flush(req, EMIT_INVALIDATE);
+	return eb->engine->emit_flush(eb->request, EMIT_INVALIDATE);
 }
 
 static bool
 i915_gem_check_execbuffer(struct drm_i915_gem_execbuffer2 *exec)
 {
-	if (exec->flags & __I915_EXEC_UNKNOWN_FLAGS)
+	if (exec->flags & __I915_EXEC_ILLEGAL_FLAGS)
 		return false;
 
 	/* Kernel clipping was a DRI1 misfeature */
@@ -1245,22 +1222,24 @@ validate_exec_list(struct drm_device *dev,
 	return 0;
 }
 
-static struct i915_gem_context *
-i915_gem_validate_context(struct drm_device *dev, struct drm_file *file,
-			  struct intel_engine_cs *engine, const u32 ctx_id)
+static int eb_select_context(struct i915_execbuffer *eb)
 {
+	unsigned int ctx_id = i915_execbuffer2_get_context_id(*eb->args);
 	struct i915_gem_context *ctx;
 
-	ctx = i915_gem_context_lookup(file->driver_priv, ctx_id);
-	if (IS_ERR(ctx))
-		return ctx;
+	ctx = i915_gem_context_lookup(eb->file->driver_priv, ctx_id);
+	if (unlikely(IS_ERR(ctx)))
+		return PTR_ERR(ctx);
 
-	if (i915_gem_context_is_banned(ctx)) {
+	if (unlikely(i915_gem_context_is_banned(ctx))) {
 		DRM_DEBUG("Context %u tried to submit while banned\n", ctx_id);
-		return ERR_PTR(-EIO);
+		return -EIO;
 	}
 
-	return ctx;
+	eb->ctx = i915_gem_context_get(ctx);
+	eb->vm = ctx->ppgtt ? &ctx->ppgtt->base : &eb->i915->ggtt.base;
+
+	return 0;
 }
 
 void i915_vma_move_to_active(struct i915_vma *vma,
@@ -1320,12 +1299,11 @@ static void eb_export_fence(struct drm_i915_gem_object *obj,
 }
 
 static void
-i915_gem_execbuffer_move_to_active(struct list_head *vmas,
-				   struct drm_i915_gem_request *req)
+eb_move_to_active(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 
-	list_for_each_entry(vma, vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_list) {
 		struct drm_i915_gem_object *obj = vma->obj;
 
 		obj->base.write_domain = obj->base.pending_write_domain;
@@ -1335,8 +1313,8 @@ i915_gem_execbuffer_move_to_active(struct list_head *vmas,
 			obj->base.pending_read_domains |= obj->base.read_domains;
 		obj->base.read_domains = obj->base.pending_read_domains;
 
-		i915_vma_move_to_active(vma, req, vma->exec_entry->flags);
-		eb_export_fence(obj, req, vma->exec_entry->flags);
+		i915_vma_move_to_active(vma, eb->request, vma->exec_entry->flags);
+		eb_export_fence(obj, eb->request, vma->exec_entry->flags);
 	}
 }
 
@@ -1366,29 +1344,22 @@ i915_reset_gen7_sol_offsets(struct drm_i915_gem_request *req)
 	return 0;
 }
 
-static struct i915_vma *
-i915_gem_execbuffer_parse(struct intel_engine_cs *engine,
-			  struct drm_i915_gem_exec_object2 *shadow_exec_entry,
-			  struct drm_i915_gem_object *batch_obj,
-			  struct eb_vmas *eb,
-			  u32 batch_start_offset,
-			  u32 batch_len,
-			  bool is_master)
+static struct i915_vma *eb_parse(struct i915_execbuffer *eb, bool is_master)
 {
 	struct drm_i915_gem_object *shadow_batch_obj;
 	struct i915_vma *vma;
 	int ret;
 
-	shadow_batch_obj = i915_gem_batch_pool_get(&engine->batch_pool,
-						   PAGE_ALIGN(batch_len));
+	shadow_batch_obj = i915_gem_batch_pool_get(&eb->engine->batch_pool,
+						   PAGE_ALIGN(eb->batch_len));
 	if (IS_ERR(shadow_batch_obj))
 		return ERR_CAST(shadow_batch_obj);
 
-	ret = intel_engine_cmd_parser(engine,
-				      batch_obj,
+	ret = intel_engine_cmd_parser(eb->engine,
+				      eb->batch->obj,
 				      shadow_batch_obj,
-				      batch_start_offset,
-				      batch_len,
+				      eb->batch_start_offset,
+				      eb->batch_len,
 				      is_master);
 	if (ret) {
 		if (ret == -EACCES) /* unhandled chained batch */
@@ -1402,9 +1373,8 @@ i915_gem_execbuffer_parse(struct intel_engine_cs *engine,
 	if (IS_ERR(vma))
 		goto out;
 
-	memset(shadow_exec_entry, 0, sizeof(*shadow_exec_entry));
-
-	vma->exec_entry = shadow_exec_entry;
+	vma->exec_entry =
+		memset(&eb->shadow_exec_entry, 0, sizeof(*vma->exec_entry));
 	vma->exec_entry->flags = __EXEC_OBJECT_HAS_PIN;
 	i915_gem_object_get(shadow_batch_obj);
 	list_add_tail(&vma->exec_list, &eb->vmas);
@@ -1423,46 +1393,33 @@ add_to_client(struct drm_i915_gem_request *req,
 }
 
 static int
-execbuf_submit(struct i915_execbuffer_params *params,
-	       struct drm_i915_gem_execbuffer2 *args,
-	       struct list_head *vmas)
+execbuf_submit(struct i915_execbuffer *eb)
 {
-	u64 exec_start, exec_len;
 	int ret;
 
-	ret = i915_gem_execbuffer_move_to_gpu(params->request, vmas);
+	ret = eb_move_to_gpu(eb);
 	if (ret)
 		return ret;
 
-	ret = i915_switch_context(params->request);
+	ret = i915_switch_context(eb->request);
 	if (ret)
 		return ret;
 
-	if (args->flags & I915_EXEC_CONSTANTS_MASK) {
-		DRM_DEBUG("I915_EXEC_CONSTANTS_* unsupported\n");
-		return -EINVAL;
-	}
-
-	if (args->flags & I915_EXEC_GEN7_SOL_RESET) {
-		ret = i915_reset_gen7_sol_offsets(params->request);
+	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
+		ret = i915_reset_gen7_sol_offsets(eb->request);
 		if (ret)
 			return ret;
 	}
 
-	exec_len   = args->batch_len;
-	exec_start = params->batch->node.start +
-		     params->args_batch_start_offset;
-
-	if (exec_len == 0)
-		exec_len = params->batch->size - params->args_batch_start_offset;
-
-	ret = params->engine->emit_bb_start(params->request,
-					    exec_start, exec_len,
-					    params->dispatch_flags);
+	ret = eb->engine->emit_bb_start(eb->request,
+					eb->batch->node.start +
+					eb->batch_start_offset,
+					eb->batch_len,
+					eb->dispatch_flags);
 	if (ret)
 		return ret;
 
-	i915_gem_execbuffer_move_to_active(vmas, params->request);
+	eb_move_to_active(eb);
 
 	return 0;
 }
@@ -1544,27 +1501,16 @@ eb_select_engine(struct drm_i915_private *dev_priv,
 }
 
 static int
-i915_gem_do_execbuffer(struct drm_device *dev, void *data,
+i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_file *file,
 		       struct drm_i915_gem_execbuffer2 *args,
 		       struct drm_i915_gem_exec_object2 *exec)
 {
-	struct drm_i915_private *dev_priv = to_i915(dev);
-	struct i915_ggtt *ggtt = &dev_priv->ggtt;
-	struct eb_vmas *eb;
-	struct drm_i915_gem_exec_object2 shadow_exec_entry;
-	struct intel_engine_cs *engine;
-	struct i915_gem_context *ctx;
-	struct i915_address_space *vm;
-	struct i915_execbuffer_params params_master; /* XXX: will be removed later */
-	struct i915_execbuffer_params *params = &params_master;
-	const u32 ctx_id = i915_execbuffer2_get_context_id(*args);
-	u32 dispatch_flags;
+	struct i915_execbuffer eb;
 	struct dma_fence *in_fence = NULL;
 	struct sync_file *out_fence = NULL;
 	int out_fence_fd = -1;
 	int ret;
-	bool need_relocs;
 
 	if (!i915_gem_check_execbuffer(args))
 		return -EINVAL;
@@ -1573,37 +1519,42 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		return ret;
 
-	dispatch_flags = 0;
+	eb.i915 = to_i915(dev);
+	eb.file = file;
+	eb.args = args;
+	eb.exec = exec;
+	eb.need_relocs = (args->flags & I915_EXEC_NO_RELOC) == 0;
+	reloc_cache_init(&eb.reloc_cache, eb.i915);
+
+	eb.batch_start_offset = args->batch_start_offset;
+	eb.batch_len = args->batch_len;
+
+	eb.dispatch_flags = 0;
 	if (args->flags & I915_EXEC_SECURE) {
 		if (!drm_is_current_master(file) || !capable(CAP_SYS_ADMIN))
 		    return -EPERM;
 
-		dispatch_flags |= I915_DISPATCH_SECURE;
+		eb.dispatch_flags |= I915_DISPATCH_SECURE;
 	}
 	if (args->flags & I915_EXEC_IS_PINNED)
-		dispatch_flags |= I915_DISPATCH_PINNED;
-
-	engine = eb_select_engine(dev_priv, file, args);
-	if (!engine)
-		return -EINVAL;
+		eb.dispatch_flags |= I915_DISPATCH_PINNED;
 
-	if (args->buffer_count < 1) {
-		DRM_DEBUG("execbuf with %d buffers\n", args->buffer_count);
+	eb.engine = eb_select_engine(eb.i915, file, args);
+	if (!eb.engine)
 		return -EINVAL;
-	}
 
 	if (args->flags & I915_EXEC_RESOURCE_STREAMER) {
-		if (!HAS_RESOURCE_STREAMER(dev_priv)) {
+		if (!HAS_RESOURCE_STREAMER(eb.i915)) {
 			DRM_DEBUG("RS is only allowed for Haswell, Gen8 and above\n");
 			return -EINVAL;
 		}
-		if (engine->id != RCS) {
+		if (eb.engine->id != RCS) {
 			DRM_DEBUG("RS is not available on %s\n",
-				 engine->name);
+				 eb.engine->name);
 			return -EINVAL;
 		}
 
-		dispatch_flags |= I915_DISPATCH_RS;
+		eb.dispatch_flags |= I915_DISPATCH_RS;
 	}
 
 	if (args->flags & I915_EXEC_FENCE_IN) {
@@ -1626,59 +1577,44 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	 * wakeref that we hold until the GPU has been idle for at least
 	 * 100ms.
 	 */
-	intel_runtime_pm_get(dev_priv);
+	intel_runtime_pm_get(eb.i915);
 
 	ret = i915_mutex_lock_interruptible(dev);
 	if (ret)
 		goto pre_mutex_err;
 
-	ctx = i915_gem_validate_context(dev, file, engine, ctx_id);
-	if (IS_ERR(ctx)) {
+	ret = eb_select_context(&eb);
+	if (ret) {
 		mutex_unlock(&dev->struct_mutex);
-		ret = PTR_ERR(ctx);
 		goto pre_mutex_err;
 	}
 
-	i915_gem_context_get(ctx);
-
-	if (ctx->ppgtt)
-		vm = &ctx->ppgtt->base;
-	else
-		vm = &ggtt->base;
-
-	memset(&params_master, 0x00, sizeof(params_master));
-
-	eb = eb_create(dev_priv, args);
-	if (eb == NULL) {
-		i915_gem_context_put(ctx);
+	if (eb_create(&eb)) {
+		i915_gem_context_put(eb.ctx);
 		mutex_unlock(&dev->struct_mutex);
 		ret = -ENOMEM;
 		goto pre_mutex_err;
 	}
 
 	/* Look up object handles */
-	ret = eb_lookup_vmas(eb, exec, args, vm, file);
+	ret = eb_lookup_vmas(&eb);
 	if (ret)
 		goto err;
 
 	/* take note of the batch buffer before we might reorder the lists */
-	params->batch = eb_get_batch(eb);
+	eb.batch = eb_get_batch(&eb);
 
 	/* Move the objects en-masse into the GTT, evicting if necessary. */
-	need_relocs = (args->flags & I915_EXEC_NO_RELOC) == 0;
-	ret = i915_gem_execbuffer_reserve(engine, &eb->vmas, ctx,
-					  &need_relocs);
+	ret = eb_reserve(&eb);
 	if (ret)
 		goto err;
 
 	/* The objects are in their final locations, apply the relocations. */
-	if (need_relocs)
-		ret = i915_gem_execbuffer_relocate(eb);
+	if (eb.need_relocs)
+		ret = eb_relocate(&eb);
 	if (ret) {
 		if (ret == -EFAULT) {
-			ret = i915_gem_execbuffer_relocate_slow(dev, args, file,
-								engine,
-								eb, exec, ctx);
+			ret = eb_relocate_slow(&eb);
 			BUG_ON(!mutex_is_locked(&dev->struct_mutex));
 		}
 		if (ret)
@@ -1686,28 +1622,22 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	}
 
 	/* Set the pending read domains for the batch buffer to COMMAND */
-	if (params->batch->obj->base.pending_write_domain) {
+	if (eb.batch->obj->base.pending_write_domain) {
 		DRM_DEBUG("Attempting to use self-modifying batch buffer\n");
 		ret = -EINVAL;
 		goto err;
 	}
-	if (args->batch_start_offset > params->batch->size ||
-	    args->batch_len > params->batch->size - args->batch_start_offset) {
+	if (eb.batch_start_offset > eb.batch->size ||
+	    eb.batch_len > eb.batch->size - eb.batch_start_offset) {
 		DRM_DEBUG("Attempting to use out-of-bounds batch\n");
 		ret = -EINVAL;
 		goto err;
 	}
 
-	params->args_batch_start_offset = args->batch_start_offset;
-	if (engine->needs_cmd_parser && args->batch_len) {
+	if (eb.engine->needs_cmd_parser && eb.batch_len) {
 		struct i915_vma *vma;
 
-		vma = i915_gem_execbuffer_parse(engine, &shadow_exec_entry,
-						params->batch->obj,
-						eb,
-						args->batch_start_offset,
-						args->batch_len,
-						drm_is_current_master(file));
+		vma = eb_parse(&eb, drm_is_current_master(file));
 		if (IS_ERR(vma)) {
 			ret = PTR_ERR(vma);
 			goto err;
@@ -1723,19 +1653,21 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 			 * specifically don't want that set on batches the
 			 * command parser has accepted.
 			 */
-			dispatch_flags |= I915_DISPATCH_SECURE;
-			params->args_batch_start_offset = 0;
-			params->batch = vma;
+			eb.dispatch_flags |= I915_DISPATCH_SECURE;
+			eb.batch_start_offset = 0;
+			eb.batch = vma;
 		}
 	}
 
-	params->batch->obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
+	eb.batch->obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
+	if (eb.batch_len == 0)
+		eb.batch_len = eb.batch->size - eb.batch_start_offset;
 
 	/* snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
 	 * batch" bit. Hence we need to pin secure batches into the global gtt.
 	 * hsw should have this fixed, but bdw mucks it up again. */
-	if (dispatch_flags & I915_DISPATCH_SECURE) {
-		struct drm_i915_gem_object *obj = params->batch->obj;
+	if (eb.dispatch_flags & I915_DISPATCH_SECURE) {
+		struct drm_i915_gem_object *obj = eb.batch->obj;
 		struct i915_vma *vma;
 
 		/*
@@ -1754,25 +1686,24 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 			goto err;
 		}
 
-		params->batch = vma;
+		eb.batch = vma;
 	}
 
 	/* Allocate a request for this batch buffer nice and early. */
-	params->request = i915_gem_request_alloc(engine, ctx);
-	if (IS_ERR(params->request)) {
-		ret = PTR_ERR(params->request);
+	eb.request = i915_gem_request_alloc(eb.engine, eb.ctx);
+	if (IS_ERR(eb.request)) {
+		ret = PTR_ERR(eb.request);
 		goto err_batch_unpin;
 	}
 
 	if (in_fence) {
-		ret = i915_gem_request_await_dma_fence(params->request,
-						       in_fence);
+		ret = i915_gem_request_await_dma_fence(eb.request, in_fence);
 		if (ret < 0)
 			goto err_request;
 	}
 
 	if (out_fence_fd != -1) {
-		out_fence = sync_file_create(&params->request->fence);
+		out_fence = sync_file_create(&eb.request->fence);
 		if (!out_fence) {
 			ret = -ENOMEM;
 			goto err_request;
@@ -1785,26 +1716,13 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	 * inactive_list and lose its active reference. Hence we do not need
 	 * to explicitly hold another reference here.
 	 */
-	params->request->batch = params->batch;
-
-	/*
-	 * Save assorted stuff away to pass through to *_submission().
-	 * NB: This data should be 'persistent' and not local as it will
-	 * kept around beyond the duration of the IOCTL once the GPU
-	 * scheduler arrives.
-	 */
-	params->dev                     = dev;
-	params->file                    = file;
-	params->engine                    = engine;
-	params->dispatch_flags          = dispatch_flags;
-	params->ctx                     = ctx;
+	eb.request->batch = eb.batch;
 
-	trace_i915_gem_request_queue(params->request, dispatch_flags);
-
-	ret = execbuf_submit(params, args, &eb->vmas);
+	trace_i915_gem_request_queue(eb.request, eb.dispatch_flags);
+	ret = execbuf_submit(&eb);
 err_request:
-	__i915_add_request(params->request, ret == 0);
-	add_to_client(params->request, file);
+	__i915_add_request(eb.request, ret == 0);
+	add_to_client(eb.request, file);
 
 	if (out_fence) {
 		if (ret == 0) {
@@ -1824,19 +1742,17 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	 * needs to be adjusted to also track the ggtt batch vma properly as
 	 * active.
 	 */
-	if (dispatch_flags & I915_DISPATCH_SECURE)
-		i915_vma_unpin(params->batch);
+	if (eb.dispatch_flags & I915_DISPATCH_SECURE)
+		i915_vma_unpin(eb.batch);
 err:
 	/* the request owns the ref now */
-	i915_gem_context_put(ctx);
-	eb_destroy(eb);
-
+	eb_destroy(&eb);
 	mutex_unlock(&dev->struct_mutex);
 
 pre_mutex_err:
 	/* intel_gpu_busy should also get a ref, so it will free when the device
 	 * is really idle. */
-	intel_runtime_pm_put(dev_priv);
+	intel_runtime_pm_put(eb.i915);
 	if (out_fence_fd != -1)
 		put_unused_fd(out_fence_fd);
 err_in_fence:
@@ -1907,7 +1823,7 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
 	exec2.flags = I915_EXEC_RENDER;
 	i915_execbuffer2_set_context_id(exec2, 0);
 
-	ret = i915_gem_do_execbuffer(dev, data, file, &exec2, exec2_list);
+	ret = i915_gem_do_execbuffer(dev, file, &exec2, exec2_list);
 	if (!ret) {
 		struct drm_i915_gem_exec_object __user *user_exec_list =
 			u64_to_user_ptr(args->buffers_ptr);
@@ -1966,7 +1882,7 @@ i915_gem_execbuffer2(struct drm_device *dev, void *data,
 		return -EFAULT;
 	}
 
-	ret = i915_gem_do_execbuffer(dev, data, file, args, exec2_list);
+	ret = i915_gem_do_execbuffer(dev, file, args, exec2_list);
 	if (!ret) {
 		/* Copy the new buffer offsets back to the user's exec list. */
 		struct drm_i915_gem_exec_object2 __user *user_exec_list =
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/24] drm/i915: Use vma->exec_entry as our double-entry placeholder
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (3 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 05/24] drm/i915: Amalgamate execbuffer parameter structures Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 07/24] drm/i915: Split vma exec_link/evict_link Chris Wilson
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

This has the benefit of not requiring us to manipulate the
vma->exec_link list when tearing down the execbuffer, and is a
marginally cheaper test to detect the user error.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_evict.c      | 17 ++-----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 77 ++++++++++++++++--------------
 drivers/gpu/drm/i915/i915_vma.c            |  1 -
 3 files changed, 44 insertions(+), 51 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index 51e365f70464..891247d79299 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -59,9 +59,6 @@ mark_free(struct drm_mm_scan *scan,
 	if (i915_vma_is_pinned(vma))
 		return false;
 
-	if (WARN_ON(!list_empty(&vma->exec_list)))
-		return false;
-
 	if (flags & PIN_NONFAULT && !list_empty(&vma->obj->userfault_link))
 		return false;
 
@@ -160,8 +157,6 @@ i915_gem_evict_something(struct i915_address_space *vm,
 	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
 		ret = drm_mm_scan_remove_block(&scan, &vma->node);
 		BUG_ON(ret);
-
-		INIT_LIST_HEAD(&vma->exec_list);
 	}
 
 	/* Can we unpin some objects such as idle hw contents,
@@ -209,17 +204,12 @@ i915_gem_evict_something(struct i915_address_space *vm,
 		if (drm_mm_scan_remove_block(&scan, &vma->node))
 			__i915_vma_pin(vma);
 		else
-			list_del_init(&vma->exec_list);
+			list_del(&vma->exec_list);
 	}
 
 	/* Unbinding will emit any required flushes */
 	ret = 0;
-	while (!list_empty(&eviction_list)) {
-		vma = list_first_entry(&eviction_list,
-				       struct i915_vma,
-				       exec_list);
-
-		list_del_init(&vma->exec_list);
+	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
 		__i915_vma_unpin(vma);
 		if (ret == 0)
 			ret = i915_vma_unbind(vma);
@@ -315,7 +305,7 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 		}
 
 		/* Overlap of objects in the same batch? */
-		if (i915_vma_is_pinned(vma) || !list_empty(&vma->exec_list)) {
+		if (i915_vma_is_pinned(vma)) {
 			ret = -ENOSPC;
 			if (vma->exec_entry &&
 			    vma->exec_entry->flags & EXEC_OBJECT_PINNED)
@@ -336,7 +326,6 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 	}
 
 	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
-		list_del_init(&vma->exec_list);
 		__i915_vma_unpin(vma);
 		if (ret == 0)
 			ret = i915_vma_unbind(vma);
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 52021bc64754..fee86e62c934 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -109,13 +109,40 @@ eb_create(struct i915_execbuffer *eb)
 		eb->and = -eb->args->buffer_count;
 	}
 
-	INIT_LIST_HEAD(&eb->vmas);
 	return 0;
 }
 
+static inline void
+__eb_unreserve_vma(struct i915_vma *vma,
+		   const struct drm_i915_gem_exec_object2 *entry)
+{
+	if (unlikely(entry->flags & __EXEC_OBJECT_HAS_FENCE))
+		i915_vma_unpin_fence(vma);
+
+	if (entry->flags & __EXEC_OBJECT_HAS_PIN)
+		__i915_vma_unpin(vma);
+}
+
+static void
+eb_unreserve_vma(struct i915_vma *vma)
+{
+	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+
+	__eb_unreserve_vma(vma, entry);
+	entry->flags &= ~(__EXEC_OBJECT_HAS_FENCE | __EXEC_OBJECT_HAS_PIN);
+}
+
 static void
 eb_reset(struct i915_execbuffer *eb)
 {
+	struct i915_vma *vma;
+
+	list_for_each_entry(vma, &eb->vmas, exec_list) {
+		eb_unreserve_vma(vma);
+		i915_vma_put(vma);
+		vma->exec_entry = NULL;
+	}
+
 	if (eb->and >= 0)
 		memset(eb->buckets, 0, (eb->and+1)*sizeof(struct hlist_head));
 }
@@ -147,6 +174,8 @@ eb_lookup_vmas(struct i915_execbuffer *eb)
 	struct list_head objects;
 	int i, ret;
 
+	INIT_LIST_HEAD(&eb->vmas);
+
 	INIT_LIST_HEAD(&objects);
 	spin_lock(&eb->file->table_lock);
 	/* Grab a reference to the object and release the lock so we can lookup
@@ -253,40 +282,23 @@ static struct i915_vma *eb_get_vma(struct i915_execbuffer *eb, unsigned long han
 	}
 }
 
-static void
-eb_unreserve_vma(struct i915_vma *vma)
-{
-	struct drm_i915_gem_exec_object2 *entry;
-
-	if (!drm_mm_node_allocated(&vma->node))
-		return;
-
-	entry = vma->exec_entry;
-
-	if (entry->flags & __EXEC_OBJECT_HAS_FENCE)
-		i915_vma_unpin_fence(vma);
-
-	if (entry->flags & __EXEC_OBJECT_HAS_PIN)
-		__i915_vma_unpin(vma);
-
-	entry->flags &= ~(__EXEC_OBJECT_HAS_FENCE | __EXEC_OBJECT_HAS_PIN);
-}
-
 static void eb_destroy(struct i915_execbuffer *eb)
 {
-	i915_gem_context_put(eb->ctx);
+	struct i915_vma *vma;
 
-	while (!list_empty(&eb->vmas)) {
-		struct i915_vma *vma;
+	list_for_each_entry(vma, &eb->vmas, exec_list) {
+		if (!vma->exec_entry)
+			continue;
 
-		vma = list_first_entry(&eb->vmas,
-				       struct i915_vma,
-				       exec_list);
-		list_del_init(&vma->exec_list);
-		eb_unreserve_vma(vma);
+		__eb_unreserve_vma(vma, vma->exec_entry);
 		vma->exec_entry = NULL;
 		i915_vma_put(vma);
 	}
+
+	i915_gem_context_put(eb->ctx);
+
+	if (eb->buckets)
+		kfree(eb->buckets);
 }
 
 static inline int use_cpu_reloc(struct drm_i915_gem_object *obj)
@@ -986,13 +998,7 @@ eb_relocate_slow(struct i915_execbuffer *eb)
 	int i, total, ret;
 
 	/* We may process another execbuffer during the unlock... */
-	while (!list_empty(&eb->vmas)) {
-		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_list);
-		list_del_init(&vma->exec_list);
-		eb_unreserve_vma(vma);
-		i915_vma_put(vma);
-	}
-
+	eb_reset(eb);
 	mutex_unlock(&dev->struct_mutex);
 
 	total = 0;
@@ -1053,7 +1059,6 @@ eb_relocate_slow(struct i915_execbuffer *eb)
 	}
 
 	/* reacquire the objects */
-	eb_reset(eb);
 	ret = eb_lookup_vmas(eb);
 	if (ret)
 		goto err;
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 1aba47024656..6cf32da682ec 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -85,7 +85,6 @@ vma_create(struct drm_i915_gem_object *obj,
 	if (vma == NULL)
 		return ERR_PTR(-ENOMEM);
 
-	INIT_LIST_HEAD(&vma->exec_list);
 	for (i = 0; i < ARRAY_SIZE(vma->last_read); i++)
 		init_request_active(&vma->last_read[i], i915_vma_retire);
 	init_request_active(&vma->last_fence, NULL);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/24] drm/i915: Split vma exec_link/evict_link
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (4 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 06/24] drm/i915: Use vma->exec_entry as our double-entry placeholder Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 08/24] drm/i915: Store a direct lookup from object handle to vma Chris Wilson
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

Currently the vma has one link member that is used for both holding its
place in the execbuf reservation list, and in any eviction list. This
dual property is quite tricky and error prone.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_evict.c      | 14 ++++++-------
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 32 +++++++++++++++---------------
 drivers/gpu/drm/i915/i915_vma.h            |  7 +++++--
 3 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index 891247d79299..204a2d9288ae 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -62,7 +62,7 @@ mark_free(struct drm_mm_scan *scan,
 	if (flags & PIN_NONFAULT && !list_empty(&vma->obj->userfault_link))
 		return false;
 
-	list_add(&vma->exec_list, unwind);
+	list_add(&vma->evict_link, unwind);
 	return drm_mm_scan_add_block(scan, &vma->node);
 }
 
@@ -154,7 +154,7 @@ i915_gem_evict_something(struct i915_address_space *vm,
 	} while (*++phase);
 
 	/* Nothing found, clean up and bail out! */
-	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
+	list_for_each_entry_safe(vma, next, &eviction_list, evict_link) {
 		ret = drm_mm_scan_remove_block(&scan, &vma->node);
 		BUG_ON(ret);
 	}
@@ -200,16 +200,16 @@ i915_gem_evict_something(struct i915_address_space *vm,
 	 * calling unbind (which may remove the active reference
 	 * of any of our objects, thus corrupting the list).
 	 */
-	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
+	list_for_each_entry_safe(vma, next, &eviction_list, evict_link) {
 		if (drm_mm_scan_remove_block(&scan, &vma->node))
 			__i915_vma_pin(vma);
 		else
-			list_del(&vma->exec_list);
+			list_del(&vma->evict_link);
 	}
 
 	/* Unbinding will emit any required flushes */
 	ret = 0;
-	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
+	list_for_each_entry_safe(vma, next, &eviction_list, evict_link) {
 		__i915_vma_unpin(vma);
 		if (ret == 0)
 			ret = i915_vma_unbind(vma);
@@ -322,10 +322,10 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 		 * reference) another in our eviction list.
 		 */
 		__i915_vma_pin(vma);
-		list_add(&vma->exec_list, &eviction_list);
+		list_add(&vma->evict_link, &eviction_list);
 	}
 
-	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
+	list_for_each_entry_safe(vma, next, &eviction_list, evict_link) {
 		__i915_vma_unpin(vma);
 		if (ret == 0)
 			ret = i915_vma_unbind(vma);
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index fee86e62c934..ba1bf8a0123d 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -137,7 +137,7 @@ eb_reset(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		eb_unreserve_vma(vma);
 		i915_vma_put(vma);
 		vma->exec_entry = NULL;
@@ -150,7 +150,7 @@ eb_reset(struct i915_execbuffer *eb)
 static struct i915_vma *
 eb_get_batch(struct i915_execbuffer *eb)
 {
-	struct i915_vma *vma = list_entry(eb->vmas.prev, typeof(*vma), exec_list);
+	struct i915_vma *vma = list_entry(eb->vmas.prev, typeof(*vma), exec_link);
 
 	/*
 	 * SNA is doing fancy tricks with compressing batch buffers, which leads
@@ -227,7 +227,7 @@ eb_lookup_vmas(struct i915_execbuffer *eb)
 		}
 
 		/* Transfer ownership from the objects list to the vmas list. */
-		list_add_tail(&vma->exec_list, &eb->vmas);
+		list_add_tail(&vma->exec_link, &eb->vmas);
 		list_del_init(&obj->obj_exec_link);
 
 		vma->exec_entry = &eb->exec[i];
@@ -286,7 +286,7 @@ static void eb_destroy(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		if (!vma->exec_entry)
 			continue;
 
@@ -752,7 +752,7 @@ static int eb_relocate(struct i915_execbuffer *eb)
 	struct i915_vma *vma;
 	int ret = 0;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		ret = eb_relocate_vma(vma, eb);
 		if (ret)
 			break;
@@ -905,7 +905,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		struct drm_i915_gem_exec_object2 *entry;
 		bool need_fence, need_mappable;
 
-		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_list);
+		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_link);
 		obj = vma->obj;
 		entry = vma->exec_entry;
 
@@ -921,12 +921,12 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		need_mappable = need_fence || need_reloc_mappable(vma);
 
 		if (entry->flags & EXEC_OBJECT_PINNED)
-			list_move_tail(&vma->exec_list, &pinned_vmas);
+			list_move_tail(&vma->exec_link, &pinned_vmas);
 		else if (need_mappable) {
 			entry->flags |= __EXEC_OBJECT_NEEDS_MAP;
-			list_move(&vma->exec_list, &ordered_vmas);
+			list_move(&vma->exec_link, &ordered_vmas);
 		} else
-			list_move_tail(&vma->exec_list, &ordered_vmas);
+			list_move_tail(&vma->exec_link, &ordered_vmas);
 
 		obj->base.pending_read_domains = I915_GEM_GPU_DOMAINS & ~I915_GEM_DOMAIN_COMMAND;
 		obj->base.pending_write_domain = 0;
@@ -951,7 +951,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		int ret = 0;
 
 		/* Unbind any ill-fitting objects or pin. */
-		list_for_each_entry(vma, &eb->vmas, exec_list) {
+		list_for_each_entry(vma, &eb->vmas, exec_link) {
 			if (!drm_mm_node_allocated(&vma->node))
 				continue;
 
@@ -964,7 +964,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		}
 
 		/* Bind fresh objects */
-		list_for_each_entry(vma, &eb->vmas, exec_list) {
+		list_for_each_entry(vma, &eb->vmas, exec_link) {
 			if (drm_mm_node_allocated(&vma->node))
 				continue;
 
@@ -978,7 +978,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 			return ret;
 
 		/* Decrement pin count for bound objects */
-		list_for_each_entry(vma, &eb->vmas, exec_list)
+		list_for_each_entry(vma, &eb->vmas, exec_link)
 			eb_unreserve_vma(vma);
 
 		ret = i915_gem_evict_vm(eb->vm, true);
@@ -1067,7 +1067,7 @@ eb_relocate_slow(struct i915_execbuffer *eb)
 	if (ret)
 		goto err;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		int idx = vma->exec_entry - eb->exec;
 
 		ret = eb_relocate_vma_slow(vma, eb, reloc + reloc_offset[idx]);
@@ -1093,7 +1093,7 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
 	struct i915_vma *vma;
 	int ret;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		struct drm_i915_gem_object *obj = vma->obj;
 
 		if (vma->exec_entry->flags & EXEC_OBJECT_CAPTURE) {
@@ -1308,7 +1308,7 @@ eb_move_to_active(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		struct drm_i915_gem_object *obj = vma->obj;
 
 		obj->base.write_domain = obj->base.pending_write_domain;
@@ -1382,7 +1382,7 @@ static struct i915_vma *eb_parse(struct i915_execbuffer *eb, bool is_master)
 		memset(&eb->shadow_exec_entry, 0, sizeof(*vma->exec_entry));
 	vma->exec_entry->flags = __EXEC_OBJECT_HAS_PIN;
 	i915_gem_object_get(shadow_batch_obj);
-	list_add_tail(&vma->exec_list, &eb->vmas);
+	list_add_tail(&vma->exec_link, &eb->vmas);
 
 out:
 	i915_gem_object_unpin_pages(shadow_batch_obj);
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index 2e03f81dddbe..4d827300d1a8 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -100,8 +100,11 @@ struct i915_vma {
 	struct list_head obj_link; /* Link in the object's VMA list */
 	struct rb_node obj_node;
 
-	/** This vma's place in the batchbuffer or on the eviction list */
-	struct list_head exec_list;
+	/** This vma's place in the execbuf reservation list */
+	struct list_head exec_link;
+
+	/** This vma's place in the eviction list */
+	struct list_head evict_link;
 
 	/**
 	 * Used for performing relocations during execbuffer insertion.
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/24] drm/i915: Store a direct lookup from object handle to vma
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (5 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 07/24] drm/i915: Split vma exec_link/evict_link Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 09/24] drm/i915: Pass vma to relocate entry Chris Wilson
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

The advent of full-ppgtt lead to an extra indirection between the object
and its binding. That extra indirection has a noticeable impact on how
fast we can convert from the user handles to our internal vma for
execbuffer. In order to bypass the extra indirection, we use a
resizable hashtable to jump from the object to the per-ctx vma.
rhashtable was considered but we don't need the online resizing feature
and the extra complexity proved to undermine its usefulness. Instead, we
simply reallocate the hastable on demand in a background task and
serialize it before iterating.

In non-full-ppgtt modes, multiple files and multiple contexts can share
the same vma. This leads to having multiple possible handle->vma links,
so we only use the first to establish the fast path. The majority of
buffers are not shared and so we should still be able to realise
speedups with multiple clients.

v2: Prettier names, more magic.
v3: Many style tweaks, most notably hiding the misuse of execobj[].rsvd2

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c           |   6 +
 drivers/gpu/drm/i915/i915_drv.h               |   2 +-
 drivers/gpu/drm/i915/i915_gem.c               |   5 +-
 drivers/gpu/drm/i915/i915_gem_context.c       |  82 +++++++-
 drivers/gpu/drm/i915/i915_gem_context.h       |  26 +++
 drivers/gpu/drm/i915/i915_gem_execbuffer.c    | 264 ++++++++++++++++----------
 drivers/gpu/drm/i915/i915_gem_object.h        |   4 +-
 drivers/gpu/drm/i915/i915_utils.h             |   5 +
 drivers/gpu/drm/i915/i915_vma.c               |  20 ++
 drivers/gpu/drm/i915/i915_vma.h               |   8 +-
 drivers/gpu/drm/i915/selftests/mock_context.c |  12 +-
 11 files changed, 322 insertions(+), 112 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 8abb93994c48..d72442ce4806 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1988,6 +1988,12 @@ static int i915_context_status(struct seq_file *m, void *unused)
 			seq_putc(m, '\n');
 		}
 
+		seq_printf(m,
+			   "\tvma hashtable size=%u (actual %lu), count=%u\n",
+			   ctx->vma_lut.ht_size,
+			   BIT(ctx->vma_lut.ht_bits),
+			   ctx->vma_lut.ht_count);
+
 		seq_putc(m, '\n');
 	}
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 17883a84b8c1..4f244f9c3d69 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -37,7 +37,7 @@
 #include <linux/i2c.h>
 #include <linux/i2c-algo-bit.h>
 #include <linux/backlight.h>
-#include <linux/hashtable.h>
+#include <linux/hash.h>
 #include <linux/intel-iommu.h>
 #include <linux/kref.h>
 #include <linux/pm_qos.h>
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 870659c13de3..4d01089a38f9 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3242,6 +3242,10 @@ void i915_gem_close_object(struct drm_gem_object *gem, struct drm_file *file)
 		if (vma->vm->file == fpriv)
 			i915_vma_close(vma);
 
+	vma = obj->vma_hashed;
+	if (vma && vma->ctx->file_priv == fpriv)
+		i915_vma_unlink_ctx(vma);
+
 	if (i915_gem_object_is_active(obj) &&
 	    !i915_gem_object_has_active_reference(obj)) {
 		i915_gem_object_set_active_reference(obj);
@@ -4231,7 +4235,6 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
 
 	INIT_LIST_HEAD(&obj->global_link);
 	INIT_LIST_HEAD(&obj->userfault_link);
-	INIT_LIST_HEAD(&obj->obj_exec_link);
 	INIT_LIST_HEAD(&obj->vma_list);
 	INIT_LIST_HEAD(&obj->batch_pool_link);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index c5d1666d7071..9af443e69c57 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -85,6 +85,7 @@
  *
  */
 
+#include <linux/log2.h>
 #include <drm/drmP.h>
 #include <drm/i915_drm.h>
 #include "i915_drv.h"
@@ -92,6 +93,70 @@
 
 #define ALL_L3_SLICES(dev) (1 << NUM_L3_SLICES(dev)) - 1
 
+/* Initial size (as log2) to preallocate the handle->object hashtable */
+#define VMA_HT_BITS 2u /* 4 x 2 pointers, 64 bytes minimum */
+
+static void resize_vma_ht(struct work_struct *work)
+{
+	struct i915_gem_context_vma_lut *lut =
+		container_of(work, typeof(*lut), resize);
+	unsigned int bits, new_bits, size, i;
+	struct hlist_head *new_ht;
+
+	GEM_BUG_ON(!(lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS));
+
+	bits = 1 + ilog2(4*lut->ht_count/3 + 1);
+	new_bits = min_t(unsigned int,
+			 max(bits, VMA_HT_BITS),
+			 sizeof(unsigned int) * BITS_PER_BYTE - 1);
+	if (new_bits == lut->ht_bits)
+		goto out;
+
+	new_ht = kzalloc(sizeof(*new_ht)<<new_bits, GFP_KERNEL | __GFP_NOWARN);
+	if (!new_ht)
+		new_ht = vzalloc(sizeof(*new_ht)<<new_bits);
+	if (!new_ht)
+		/* Pretend resize succeeded and stop calling us for a bit! */
+		goto out;
+
+	size = BIT(lut->ht_bits);
+	for (i = 0; i < size; i++) {
+		struct i915_vma *vma;
+		struct hlist_node *tmp;
+
+		hlist_for_each_entry_safe(vma, tmp, &lut->ht[i], ctx_node)
+			hlist_add_head(&vma->ctx_node,
+				       &new_ht[hash_32(vma->ctx_handle,
+						       new_bits)]);
+	}
+	kvfree(lut->ht);
+	lut->ht = new_ht;
+	lut->ht_bits = new_bits;
+out:
+	smp_store_release(&lut->ht_size, BIT(bits));
+	GEM_BUG_ON(lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS);
+}
+
+static void vma_lut_free(struct i915_gem_context *ctx)
+{
+	struct i915_gem_context_vma_lut *lut = &ctx->vma_lut;
+	unsigned int i, size;
+
+	if (lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS)
+		cancel_work_sync(&lut->resize);
+
+	size = BIT(lut->ht_bits);
+	for (i = 0; i < size; i++) {
+		struct i915_vma *vma;
+
+		hlist_for_each_entry(vma, &lut->ht[i], ctx_node) {
+			vma->obj->vma_hashed = NULL;
+			vma->ctx = NULL;
+		}
+	}
+	kvfree(lut->ht);
+}
+
 void i915_gem_context_free(struct kref *ctx_ref)
 {
 	struct i915_gem_context *ctx = container_of(ctx_ref, typeof(*ctx), ref);
@@ -101,6 +166,7 @@ void i915_gem_context_free(struct kref *ctx_ref)
 	trace_i915_context_free(ctx);
 	GEM_BUG_ON(!i915_gem_context_is_closed(ctx));
 
+	vma_lut_free(ctx);
 	i915_ppgtt_put(ctx->ppgtt);
 
 	for (i = 0; i < I915_NUM_ENGINES; i++) {
@@ -118,6 +184,7 @@ void i915_gem_context_free(struct kref *ctx_ref)
 
 	kfree(ctx->name);
 	put_pid(ctx->pid);
+
 	list_del(&ctx->link);
 
 	ida_simple_remove(&ctx->i915->context_hw_ida, ctx->hw_id);
@@ -201,13 +268,24 @@ __create_hw_context(struct drm_i915_private *dev_priv,
 	ctx->i915 = dev_priv;
 	ctx->priority = I915_PRIORITY_NORMAL;
 
+	ctx->vma_lut.ht_bits = VMA_HT_BITS;
+	ctx->vma_lut.ht_size = BIT(VMA_HT_BITS);
+	BUILD_BUG_ON(BIT(VMA_HT_BITS) == I915_CTX_RESIZE_IN_PROGRESS);
+	ctx->vma_lut.ht = kcalloc(ctx->vma_lut.ht_size,
+				  sizeof(*ctx->vma_lut.ht),
+				  GFP_KERNEL);
+	if (!ctx->vma_lut.ht)
+		goto err_out;
+
+	INIT_WORK(&ctx->vma_lut.resize, resize_vma_ht);
+
 	/* Default context will never have a file_priv */
 	ret = DEFAULT_CONTEXT_HANDLE;
 	if (file_priv) {
 		ret = idr_alloc(&file_priv->context_idr, ctx,
 				DEFAULT_CONTEXT_HANDLE, 0, GFP_KERNEL);
 		if (ret < 0)
-			goto err_out;
+			goto err_lut;
 	}
 	ctx->user_handle = ret;
 
@@ -248,6 +326,8 @@ __create_hw_context(struct drm_i915_private *dev_priv,
 err_pid:
 	put_pid(ctx->pid);
 	idr_remove(&file_priv->context_idr, ctx->user_handle);
+err_lut:
+	kvfree(ctx->vma_lut.ht);
 err_out:
 	context_close(ctx);
 	return ERR_PTR(ret);
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 4af2ab94558b..82c99ba92ad3 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -143,6 +143,32 @@ struct i915_gem_context {
 	/** ggtt_offset_bias: placement restriction for context objects */
 	u32 ggtt_offset_bias;
 
+	struct i915_gem_context_vma_lut {
+		/** ht_size: last request size to allocate the hashtable for. */
+		unsigned int ht_size;
+#define I915_CTX_RESIZE_IN_PROGRESS BIT(0)
+		/** ht_bits: real log2(size) of hashtable. */
+		unsigned int ht_bits;
+		/** ht_count: current number of entries inside the hashtable */
+		unsigned int ht_count;
+
+		/** ht: the array of buckets comprising the simple hashtable */
+		struct hlist_head *ht;
+
+		/**
+		 * resize: After an execbuf completes, we check the load factor
+		 * of the hashtable. If the hashtable is too full, or too empty,
+		 * we schedule a task to resize the hashtable. During the
+		 * resize, the entries are moved between different buckets and
+		 * so we cannot simultaneously read the hashtable as it is
+		 * being resized (unlike rhashtable). Therefore we treat the
+		 * active work as a strong barrier, pausing a subsequent
+		 * execbuf to wait for the resize worker to complete, if
+		 * required.
+		 */
+		struct work_struct resize;
+	} vma_lut;
+
 	/** engine: per-engine logical HW state */
 	struct intel_context {
 		struct i915_vma *state;
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index ba1bf8a0123d..b21cec236b98 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -75,38 +75,43 @@ struct i915_execbuffer {
 		unsigned int page;
 		bool use_64bit_reloc : 1;
 	} reloc_cache;
-	int and;
-	union {
-		struct i915_vma **lut;
-		struct hlist_head *buckets;
-	};
+	int lut_mask;
+	struct hlist_head *buckets;
 };
 
+/*
+ * As an alternative to creating a hashtable of handle-to-vma for a batch,
+ * we used the last available reserved field in the execobject[] and stash
+ * a link from the execobj to its vma.
+ */
+#define __exec_to_vma(ee) (ee)->rsvd2
+#define exec_to_vma(ee) u64_to_ptr(struct i915_vma, __exec_to_vma(ee))
+
 static int
 eb_create(struct i915_execbuffer *eb)
 {
-	eb->lut = NULL;
-	if (eb->args->flags & I915_EXEC_HANDLE_LUT) {
-		unsigned int size = eb->args->buffer_count;
-		size *= sizeof(struct i915_vma *);
-		eb->lut = kmalloc(size,
-				  GFP_TEMPORARY | __GFP_NOWARN | __GFP_NORETRY);
-	}
-
-	if (!eb->lut) {
-		unsigned int size = eb->args->buffer_count;
-		unsigned int count = PAGE_SIZE / sizeof(struct hlist_head) / 2;
-		BUILD_BUG_ON_NOT_POWER_OF_2(PAGE_SIZE / sizeof(struct hlist_head));
-		while (count > 2*size)
-			count >>= 1;
-		eb->lut = kzalloc(count * sizeof(struct hlist_head),
-				  GFP_TEMPORARY);
-		if (!eb->lut)
-			return -ENOMEM;
-
-		eb->and = count - 1;
+	if ((eb->args->flags & I915_EXEC_HANDLE_LUT) == 0) {
+		unsigned int size = 1 + ilog2(eb->args->buffer_count);
+
+		do {
+			eb->buckets = kzalloc(sizeof(struct hlist_head) << size,
+					      GFP_TEMPORARY |
+					      __GFP_NORETRY |
+					      __GFP_NOWARN);
+			if (eb->buckets)
+				break;
+		} while (--size);
+
+		if (unlikely(!eb->buckets)) {
+			eb->buckets = kzalloc(sizeof(struct hlist_head),
+					      GFP_TEMPORARY);
+			if (unlikely(!eb->buckets))
+				return -ENOMEM;
+		}
+
+		eb->lut_mask = size;
 	} else {
-		eb->and = -eb->args->buffer_count;
+		eb->lut_mask = -eb->args->buffer_count;
 	}
 
 	return 0;
@@ -143,73 +148,112 @@ eb_reset(struct i915_execbuffer *eb)
 		vma->exec_entry = NULL;
 	}
 
-	if (eb->and >= 0)
-		memset(eb->buckets, 0, (eb->and+1)*sizeof(struct hlist_head));
+	if (eb->lut_mask >= 0)
+		memset(eb->buckets, 0,
+		       sizeof(struct hlist_head) << eb->lut_mask);
 }
 
-static struct i915_vma *
-eb_get_batch(struct i915_execbuffer *eb)
+static bool
+eb_add_vma(struct i915_execbuffer *eb, struct i915_vma *vma, int i)
 {
-	struct i915_vma *vma = list_entry(eb->vmas.prev, typeof(*vma), exec_link);
+	if (unlikely(vma->exec_entry)) {
+		DRM_DEBUG("Object [handle %d, index %d] appears more than once in object list\n",
+			  eb->exec[i].handle, i);
+		return false;
+	}
+	list_add_tail(&vma->exec_link, &eb->vmas);
 
-	/*
-	 * SNA is doing fancy tricks with compressing batch buffers, which leads
-	 * to negative relocation deltas. Usually that works out ok since the
-	 * relocate address is still positive, except when the batch is placed
-	 * very low in the GTT. Ensure this doesn't happen.
-	 *
-	 * Note that actual hangs have only been observed on gen7, but for
-	 * paranoia do it everywhere.
-	 */
-	if ((vma->exec_entry->flags & EXEC_OBJECT_PINNED) == 0)
-		vma->exec_entry->flags |= __EXEC_OBJECT_NEEDS_BIAS;
+	vma->exec_entry = &eb->exec[i];
+	if (eb->lut_mask >= 0) {
+		vma->exec_handle = eb->exec[i].handle;
+		hlist_add_head(&vma->exec_node,
+			       &eb->buckets[hash_32(vma->exec_handle,
+						    eb->lut_mask)]);
+	}
 
-	return vma;
+	i915_vma_get(vma);
+	__exec_to_vma(&eb->exec[i]) = (uintptr_t)vma;
+	return true;
+}
+
+static inline struct hlist_head *
+ht_head(const struct i915_gem_context *ctx, u32 handle)
+{
+	return &ctx->vma_lut.ht[hash_32(handle, ctx->vma_lut.ht_bits)];
+}
+
+static inline bool
+ht_needs_resize(const struct i915_gem_context *ctx)
+{
+	return (4*ctx->vma_lut.ht_count > 3*ctx->vma_lut.ht_size ||
+		4*ctx->vma_lut.ht_count + 1 < ctx->vma_lut.ht_size);
 }
 
 static int
 eb_lookup_vmas(struct i915_execbuffer *eb)
 {
-	struct drm_i915_gem_object *obj;
-	struct list_head objects;
-	int i, ret;
+#define INTERMEDIATE BIT(0)
+	const int count = eb->args->buffer_count;
+	struct i915_vma *vma;
+	int slow_pass = -1;
+	int i;
 
 	INIT_LIST_HEAD(&eb->vmas);
 
-	INIT_LIST_HEAD(&objects);
+	if (unlikely(eb->ctx->vma_lut.ht_size & I915_CTX_RESIZE_IN_PROGRESS))
+		flush_work(&eb->ctx->vma_lut.resize);
+	GEM_BUG_ON(eb->ctx->vma_lut.ht_size & I915_CTX_RESIZE_IN_PROGRESS);
+
+	for (i = 0; i < count; i++) {
+		__exec_to_vma(&eb->exec[i]) = 0;
+
+		hlist_for_each_entry(vma,
+				     ht_head(eb->ctx, eb->exec[i].handle),
+				     ctx_node) {
+			if (vma->ctx_handle != eb->exec[i].handle)
+				continue;
+
+			if (!eb_add_vma(eb, vma, i))
+				return -EINVAL;
+
+			goto next_vma;
+		}
+
+		if (slow_pass < 0)
+			slow_pass = i;
+next_vma: ;
+	}
+
+	if (slow_pass < 0)
+		return 0;
+
 	spin_lock(&eb->file->table_lock);
 	/* Grab a reference to the object and release the lock so we can lookup
 	 * or create the VMA without using GFP_ATOMIC */
-	for (i = 0; i < eb->args->buffer_count; i++) {
-		obj = to_intel_bo(idr_find(&eb->file->object_idr, eb->exec[i].handle));
-		if (obj == NULL) {
-			spin_unlock(&eb->file->table_lock);
-			DRM_DEBUG("Invalid object handle %d at index %d\n",
-				   eb->exec[i].handle, i);
-			ret = -ENOENT;
-			goto err;
-		}
+	for (i = slow_pass; i < count; i++) {
+		struct drm_i915_gem_object *obj;
 
-		if (!list_empty(&obj->obj_exec_link)) {
+		if (__exec_to_vma(&eb->exec[i]))
+			continue;
+
+		obj = to_intel_bo(idr_find(&eb->file->object_idr,
+					   eb->exec[i].handle));
+		if (unlikely(!obj)) {
 			spin_unlock(&eb->file->table_lock);
-			DRM_DEBUG("Object %p [handle %d, index %d] appears more than once in object list\n",
-				   obj, eb->exec[i].handle, i);
-			ret = -EINVAL;
-			goto err;
+			DRM_DEBUG("Invalid object handle %d at index %d\n",
+				  eb->exec[i].handle, i);
+			return -ENOENT;
 		}
 
-		i915_gem_object_get(obj);
-		list_add_tail(&obj->obj_exec_link, &objects);
+		__exec_to_vma(&eb->exec[i]) = INTERMEDIATE | (uintptr_t)obj;
 	}
 	spin_unlock(&eb->file->table_lock);
 
-	i = 0;
-	while (!list_empty(&objects)) {
-		struct i915_vma *vma;
+	for (i = slow_pass; i < count; i++) {
+		struct drm_i915_gem_object *obj;
 
-		obj = list_first_entry(&objects,
-				       struct drm_i915_gem_object,
-				       obj_exec_link);
+		if ((__exec_to_vma(&eb->exec[i]) & INTERMEDIATE) == 0)
+			continue;
 
 		/*
 		 * NOTE: We can leak any vmas created here when something fails
@@ -219,61 +263,73 @@ eb_lookup_vmas(struct i915_execbuffer *eb)
 		 * from the (obj, vm) we don't run the risk of creating
 		 * duplicated vmas for the same vm.
 		 */
+		obj = u64_to_ptr(struct drm_i915_gem_object,
+				 __exec_to_vma(&eb->exec[i]) & ~INTERMEDIATE);
 		vma = i915_vma_instance(obj, eb->vm, NULL);
 		if (unlikely(IS_ERR(vma))) {
 			DRM_DEBUG("Failed to lookup VMA\n");
-			ret = PTR_ERR(vma);
-			goto err;
+			return PTR_ERR(vma);
 		}
 
-		/* Transfer ownership from the objects list to the vmas list. */
-		list_add_tail(&vma->exec_link, &eb->vmas);
-		list_del_init(&obj->obj_exec_link);
-
-		vma->exec_entry = &eb->exec[i];
-		if (eb->and < 0) {
-			eb->lut[i] = vma;
-		} else {
-			u32 handle =
-				eb->args->flags & I915_EXEC_HANDLE_LUT ?
-				i : eb->exec[i].handle;
-			vma->exec_handle = handle;
-			hlist_add_head(&vma->exec_node,
-				       &eb->buckets[handle & eb->and]);
+		/* First come, first served */
+		if (!vma->ctx) {
+			vma->ctx = eb->ctx;
+			vma->ctx_handle = eb->exec[i].handle;
+			hlist_add_head(&vma->ctx_node,
+				       ht_head(eb->ctx, eb->exec[i].handle));
+			eb->ctx->vma_lut.ht_count++;
+			if (i915_vma_is_ggtt(vma)) {
+				GEM_BUG_ON(obj->vma_hashed);
+				obj->vma_hashed = vma;
+			}
 		}
-		++i;
+
+		if (!eb_add_vma(eb, vma, i))
+			return -EINVAL;
+	}
+
+	if (ht_needs_resize(eb->ctx)) {
+		eb->ctx->vma_lut.ht_size |= I915_CTX_RESIZE_IN_PROGRESS;
+		queue_work(system_highpri_wq, &eb->ctx->vma_lut.resize);
 	}
 
 	return 0;
+#undef INTERMEDIATE
+}
 
+static struct i915_vma *
+eb_get_batch(struct i915_execbuffer *eb)
+{
+	struct i915_vma *vma =
+		exec_to_vma(&eb->exec[eb->args->buffer_count - 1]);
 
-err:
-	while (!list_empty(&objects)) {
-		obj = list_first_entry(&objects,
-				       struct drm_i915_gem_object,
-				       obj_exec_link);
-		list_del_init(&obj->obj_exec_link);
-		i915_gem_object_put(obj);
-	}
 	/*
-	 * Objects already transfered to the vmas list will be unreferenced by
-	 * eb_destroy.
+	 * SNA is doing fancy tricks with compressing batch buffers, which leads
+	 * to negative relocation deltas. Usually that works out ok since the
+	 * relocate address is still positive, except when the batch is placed
+	 * very low in the GTT. Ensure this doesn't happen.
+	 *
+	 * Note that actual hangs have only been observed on gen7, but for
+	 * paranoia do it everywhere.
 	 */
+	if ((vma->exec_entry->flags & EXEC_OBJECT_PINNED) == 0)
+		vma->exec_entry->flags |= __EXEC_OBJECT_NEEDS_BIAS;
 
-	return ret;
+	return vma;
 }
 
-static struct i915_vma *eb_get_vma(struct i915_execbuffer *eb, unsigned long handle)
+static struct i915_vma *
+eb_get_vma(struct i915_execbuffer *eb, unsigned long handle)
 {
-	if (eb->and < 0) {
-		if (handle >= -eb->and)
+	if (eb->lut_mask < 0) {
+		if (handle >= -eb->lut_mask)
 			return NULL;
-		return eb->lut[handle];
+		return exec_to_vma(&eb->exec[handle]);
 	} else {
 		struct hlist_head *head;
 		struct i915_vma *vma;
 
-		head = &eb->buckets[handle & eb->and];
+		head = &eb->buckets[hash_32(handle, eb->lut_mask)];
 		hlist_for_each_entry(vma, head, exec_node) {
 			if (vma->exec_handle == handle)
 				return vma;
@@ -297,7 +353,7 @@ static void eb_destroy(struct i915_execbuffer *eb)
 
 	i915_gem_context_put(eb->ctx);
 
-	if (eb->buckets)
+	if (eb->lut_mask >= 0)
 		kfree(eb->buckets);
 }
 
@@ -917,7 +973,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		need_fence =
 			(entry->flags & EXEC_OBJECT_NEEDS_FENCE ||
 			 needs_unfenced_map) &&
-			i915_gem_object_is_tiled(obj);
+			i915_gem_object_is_tiled(vma->obj);
 		need_mappable = need_fence || need_reloc_mappable(vma);
 
 		if (entry->flags & EXEC_OBJECT_PINNED)
diff --git a/drivers/gpu/drm/i915/i915_gem_object.h b/drivers/gpu/drm/i915/i915_gem_object.h
index dca15adc91de..1a007cbf1943 100644
--- a/drivers/gpu/drm/i915/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/i915_gem_object.h
@@ -71,6 +71,7 @@ struct drm_i915_gem_object {
 	/** List of VMAs backed by this object */
 	struct list_head vma_list;
 	struct rb_root vma_tree;
+	struct i915_vma *vma_hashed;
 
 	/** Stolen memory for this object, instead of being backed by shmem. */
 	struct drm_mm_node *stolen;
@@ -85,9 +86,6 @@ struct drm_i915_gem_object {
 	 */
 	struct list_head userfault_link;
 
-	/** Used in execbuf to temporarily hold a ref */
-	struct list_head obj_exec_link;
-
 	struct list_head batch_pool_link;
 	I915_SELFTEST_DECLARE(struct list_head st_link);
 
diff --git a/drivers/gpu/drm/i915/i915_utils.h b/drivers/gpu/drm/i915/i915_utils.h
index 16ecd1ab108d..12fc250b47b9 100644
--- a/drivers/gpu/drm/i915/i915_utils.h
+++ b/drivers/gpu/drm/i915/i915_utils.h
@@ -99,6 +99,11 @@
 	__T;								\
 })
 
+#define u64_to_ptr(T, x) ({						\
+	typecheck(u64, x);						\
+	(T *)(uintptr_t)(x);						\
+})
+
 #define __mask_next_bit(mask) ({					\
 	int __idx = ffs(mask) - 1;					\
 	mask &= ~BIT(__idx);						\
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 6cf32da682ec..ad696239383d 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -590,11 +590,31 @@ void i915_vma_destroy(struct i915_vma *vma)
 	kmem_cache_free(to_i915(vma->obj->base.dev)->vmas, vma);
 }
 
+void i915_vma_unlink_ctx(struct i915_vma *vma)
+{
+	struct i915_gem_context *ctx = vma->ctx;
+
+	if (ctx->vma_lut.ht_size & I915_CTX_RESIZE_IN_PROGRESS) {
+		cancel_work_sync(&ctx->vma_lut.resize);
+		ctx->vma_lut.ht_size &= ~I915_CTX_RESIZE_IN_PROGRESS;
+	}
+
+	__hlist_del(&vma->ctx_node);
+	ctx->vma_lut.ht_count--;
+
+	if (i915_vma_is_ggtt(vma))
+		vma->obj->vma_hashed = NULL;
+	vma->ctx = NULL;
+}
+
 void i915_vma_close(struct i915_vma *vma)
 {
 	GEM_BUG_ON(i915_vma_is_closed(vma));
 	vma->flags |= I915_VMA_CLOSED;
 
+	if (vma->ctx)
+		i915_vma_unlink_ctx(vma);
+
 	list_del(&vma->obj_link);
 	rb_erase(&vma->obj_node, &vma->obj->vma_tree);
 
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index 4d827300d1a8..88543fafcffc 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -99,6 +99,7 @@ struct i915_vma {
 
 	struct list_head obj_link; /* Link in the object's VMA list */
 	struct rb_node obj_node;
+	struct hlist_node obj_hash;
 
 	/** This vma's place in the execbuf reservation list */
 	struct list_head exec_link;
@@ -110,8 +111,12 @@ struct i915_vma {
 	 * Used for performing relocations during execbuffer insertion.
 	 */
 	struct hlist_node exec_node;
-	unsigned long exec_handle;
 	struct drm_i915_gem_exec_object2 *exec_entry;
+	u32 exec_handle;
+
+	struct i915_gem_context *ctx;
+	struct hlist_node ctx_node;
+	u32 ctx_handle;
 };
 
 struct i915_vma *
@@ -235,6 +240,7 @@ bool i915_vma_misplaced(const struct i915_vma *vma,
 			u64 size, u64 alignment, u64 flags);
 void __i915_vma_set_map_and_fenceable(struct i915_vma *vma);
 int __must_check i915_vma_unbind(struct i915_vma *vma);
+void i915_vma_unlink_ctx(struct i915_vma *vma);
 void i915_vma_close(struct i915_vma *vma);
 void i915_vma_destroy(struct i915_vma *vma);
 
diff --git a/drivers/gpu/drm/i915/selftests/mock_context.c b/drivers/gpu/drm/i915/selftests/mock_context.c
index 8d3a90c3f8ac..f8b9cc212b02 100644
--- a/drivers/gpu/drm/i915/selftests/mock_context.c
+++ b/drivers/gpu/drm/i915/selftests/mock_context.c
@@ -40,10 +40,18 @@ mock_context(struct drm_i915_private *i915,
 	INIT_LIST_HEAD(&ctx->link);
 	ctx->i915 = i915;
 
+	ctx->vma_lut.ht_bits = VMA_HT_BITS;
+	ctx->vma_lut.ht_size = BIT(VMA_HT_BITS);
+	ctx->vma_lut.ht = kcalloc(ctx->vma_lut.ht_size,
+				  sizeof(*ctx->vma_lut.ht),
+				  GFP_KERNEL);
+	if (!ctx->vma_lut.ht)
+		goto err_free;
+
 	ret = ida_simple_get(&i915->context_hw_ida,
 			     0, MAX_CONTEXT_HW_ID, GFP_KERNEL);
 	if (ret < 0)
-		goto err_free;
+		goto err_vma_ht;
 	ctx->hw_id = ret;
 
 	if (name) {
@@ -58,6 +66,8 @@ mock_context(struct drm_i915_private *i915,
 
 	return ctx;
 
+err_vma_ht:
+	kvfree(ctx->vma_lut.ht);
 err_free:
 	kfree(ctx);
 	return NULL;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/24] drm/i915: Pass vma to relocate entry
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (6 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 08/24] drm/i915: Store a direct lookup from object handle to vma Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 10/24] drm/i915: Disable EXEC_OBJECT_ASYNC when doing relocations Chris Wilson
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

We can simplify our tracking of pending writes in an execbuf to the
single bit in the vma->exec_entry->flags, but that requires the
relocation function knowing the object's vma. Pass it along.

Note we have only been using a single bit to track flushing since

commit cc889e0f6ce6a63c62db17d702ecfed86d58083f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jun 13 20:45:19 2012 +0200

    drm/i915: disable flushing_list/gpu_write_list

unconditionally flushed all render caches before the breadcrumb and

commit 6ac42f4148bc27e5ffd18a9ab0eac57f58822af4
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sat Jul 21 12:25:01 2012 +0200

    drm/i915: Replace the complex flushing logic with simple invalidate/flush all

did away with the explicit GPU domain tracking. This was then codified
into the ABI with NO_RELOC in

commit ed5982e6ce5f106abcbf071f80730db344a6da42
Author: Daniel Vetter <daniel.vetter@ffwll.ch> # Oi! Patch stealer!
Date:   Thu Jan 17 22:23:36 2013 +0100

    drm/i915: Allow userspace to hint that the relocations were known

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 102 ++++++++++++-----------------
 1 file changed, 42 insertions(+), 60 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index b21cec236b98..039ec1fb3e03 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -623,42 +623,25 @@ relocate_entry(struct drm_i915_gem_object *obj,
 }
 
 static int
-eb_relocate_entry(struct drm_i915_gem_object *obj,
+eb_relocate_entry(struct i915_vma *vma,
 		  struct i915_execbuffer *eb,
 		  struct drm_i915_gem_relocation_entry *reloc)
 {
-	struct drm_gem_object *target_obj;
-	struct drm_i915_gem_object *target_i915_obj;
-	struct i915_vma *target_vma;
-	uint64_t target_offset;
+	struct i915_vma *target;
+	u64 target_offset;
 	int ret;
 
 	/* we've already hold a reference to all valid objects */
-	target_vma = eb_get_vma(eb, reloc->target_handle);
-	if (unlikely(target_vma == NULL))
+	target = eb_get_vma(eb, reloc->target_handle);
+	if (unlikely(!target))
 		return -ENOENT;
-	target_i915_obj = target_vma->obj;
-	target_obj = &target_vma->obj->base;
-
-	target_offset = gen8_canonical_addr(target_vma->node.start);
-
-	/* Sandybridge PPGTT errata: We need a global gtt mapping for MI and
-	 * pipe_control writes because the gpu doesn't properly redirect them
-	 * through the ppgtt for non_secure batchbuffers. */
-	if (unlikely(IS_GEN6(eb->i915) &&
-		     reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION)) {
-		ret = i915_vma_bind(target_vma, target_i915_obj->cache_level,
-				    PIN_GLOBAL);
-		if (WARN_ONCE(ret, "Unexpected failure to bind target VMA!"))
-			return ret;
-	}
 
 	/* Validate that the target is in a valid r/w GPU domain */
 	if (unlikely(reloc->write_domain & (reloc->write_domain - 1))) {
 		DRM_DEBUG("reloc with multiple write domains: "
-			  "obj %p target %d offset %d "
+			  "target %d offset %d "
 			  "read %08x write %08x",
-			  obj, reloc->target_handle,
+			  reloc->target_handle,
 			  (int) reloc->offset,
 			  reloc->read_domains,
 			  reloc->write_domain);
@@ -667,43 +650,57 @@ eb_relocate_entry(struct drm_i915_gem_object *obj,
 	if (unlikely((reloc->write_domain | reloc->read_domains)
 		     & ~I915_GEM_GPU_DOMAINS)) {
 		DRM_DEBUG("reloc with read/write non-GPU domains: "
-			  "obj %p target %d offset %d "
+			  "target %d offset %d "
 			  "read %08x write %08x",
-			  obj, reloc->target_handle,
+			  reloc->target_handle,
 			  (int) reloc->offset,
 			  reloc->read_domains,
 			  reloc->write_domain);
 		return -EINVAL;
 	}
 
-	target_obj->pending_read_domains |= reloc->read_domains;
-	target_obj->pending_write_domain |= reloc->write_domain;
+	if (reloc->write_domain)
+		target->exec_entry->flags |= EXEC_OBJECT_WRITE;
+
+	/*
+	 * Sandybridge PPGTT errata: We need a global gtt mapping for MI and
+	 * pipe_control writes because the gpu doesn't properly redirect them
+	 * through the ppgtt for non_secure batchbuffers.
+	 */
+	if (unlikely(IS_GEN6(eb->i915) &&
+		     reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION)) {
+		ret = i915_vma_bind(target, target->obj->cache_level,
+				    PIN_GLOBAL);
+		if (WARN_ONCE(ret, "Unexpected failure to bind target VMA!"))
+			return ret;
+	}
 
 	/* If the relocation already has the right value in it, no
 	 * more work needs to be done.
 	 */
+	target_offset = gen8_canonical_addr(target->node.start);
 	if (target_offset == reloc->presumed_offset)
 		return 0;
 
 	/* Check that the relocation address is valid... */
 	if (unlikely(reloc->offset >
-		     obj->base.size - (eb->reloc_cache.use_64bit_reloc ? 8 : 4))) {
+		     vma->size - (eb->reloc_cache.use_64bit_reloc ? 8 : 4))) {
 		DRM_DEBUG("Relocation beyond object bounds: "
-			  "obj %p target %d offset %d size %d.\n",
-			  obj, reloc->target_handle,
-			  (int) reloc->offset,
-			  (int) obj->base.size);
+			  "target %d offset %d size %d.\n",
+			  reloc->target_handle,
+			  (int)reloc->offset,
+			  (int)vma->size);
 		return -EINVAL;
 	}
 	if (unlikely(reloc->offset & 3)) {
 		DRM_DEBUG("Relocation not 4-byte aligned: "
-			  "obj %p target %d offset %d.\n",
-			  obj, reloc->target_handle,
-			  (int) reloc->offset);
+			  "target %d offset %d.\n",
+			  reloc->target_handle,
+			  (int)reloc->offset);
 		return -EINVAL;
 	}
 
-	ret = relocate_entry(obj, reloc, &eb->reloc_cache, target_offset);
+	ret = relocate_entry(vma->obj, reloc, &eb->reloc_cache, target_offset);
 	if (ret)
 		return ret;
 
@@ -749,7 +746,7 @@ static int eb_relocate_vma(struct i915_vma *vma, struct i915_execbuffer *eb)
 		do {
 			u64 offset = r->presumed_offset;
 
-			ret = eb_relocate_entry(vma->obj, eb, r);
+			ret = eb_relocate_entry(vma, eb, r);
 			if (ret)
 				goto out;
 
@@ -795,7 +792,7 @@ eb_relocate_vma_slow(struct i915_vma *vma,
 	int i, ret = 0;
 
 	for (i = 0; i < entry->relocation_count; i++) {
-		ret = eb_relocate_entry(vma->obj, eb, &relocs[i]);
+		ret = eb_relocate_entry(vma, eb, &relocs[i]);
 		if (ret)
 			break;
 	}
@@ -828,7 +825,6 @@ eb_reserve_vma(struct i915_vma *vma,
 	       struct intel_engine_cs *engine,
 	       bool *need_reloc)
 {
-	struct drm_i915_gem_object *obj = vma->obj;
 	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
 	uint64_t flags;
 	int ret;
@@ -882,11 +878,6 @@ eb_reserve_vma(struct i915_vma *vma,
 		*need_reloc = true;
 	}
 
-	if (entry->flags & EXEC_OBJECT_WRITE) {
-		obj->base.pending_read_domains = I915_GEM_DOMAIN_RENDER;
-		obj->base.pending_write_domain = I915_GEM_DOMAIN_RENDER;
-	}
-
 	return 0;
 }
 
@@ -949,7 +940,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 {
 	const bool has_fenced_gpu_access = INTEL_GEN(eb->i915) < 4;
 	const bool needs_unfenced_map = INTEL_INFO(eb->i915)->unfenced_needs_alignment;
-	struct drm_i915_gem_object *obj;
 	struct i915_vma *vma;
 	struct list_head ordered_vmas;
 	struct list_head pinned_vmas;
@@ -962,7 +952,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		bool need_fence, need_mappable;
 
 		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_link);
-		obj = vma->obj;
 		entry = vma->exec_entry;
 
 		if (eb->ctx->flags & CONTEXT_NO_ZEROMAP)
@@ -983,9 +972,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 			list_move(&vma->exec_link, &ordered_vmas);
 		} else
 			list_move_tail(&vma->exec_link, &ordered_vmas);
-
-		obj->base.pending_read_domains = I915_GEM_GPU_DOMAINS & ~I915_GEM_DOMAIN_COMMAND;
-		obj->base.pending_write_domain = 0;
 	}
 	list_splice(&ordered_vmas, &eb->vmas);
 	list_splice(&pinned_vmas, &eb->vmas);
@@ -1171,7 +1157,7 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
 			i915_gem_clflush_object(obj, 0);
 
 		ret = i915_gem_request_await_object
-			(eb->request, obj, obj->base.pending_write_domain);
+			(eb->request, obj, vma->exec_entry->flags & EXEC_OBJECT_WRITE);
 		if (ret)
 			return ret;
 	}
@@ -1367,12 +1353,10 @@ eb_move_to_active(struct i915_execbuffer *eb)
 	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		struct drm_i915_gem_object *obj = vma->obj;
 
-		obj->base.write_domain = obj->base.pending_write_domain;
-		if (obj->base.write_domain)
-			vma->exec_entry->flags |= EXEC_OBJECT_WRITE;
-		else
-			obj->base.pending_read_domains |= obj->base.read_domains;
-		obj->base.read_domains = obj->base.pending_read_domains;
+		obj->base.write_domain = 0;
+		if (vma->exec_entry->flags & EXEC_OBJECT_WRITE)
+			obj->base.read_domains = 0;
+		obj->base.read_domains |= I915_GEM_GPU_DOMAINS;
 
 		i915_vma_move_to_active(vma, eb->request, vma->exec_entry->flags);
 		eb_export_fence(obj, eb->request, vma->exec_entry->flags);
@@ -1682,8 +1666,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			goto err;
 	}
 
-	/* Set the pending read domains for the batch buffer to COMMAND */
-	if (eb.batch->obj->base.pending_write_domain) {
+	if (eb.batch->exec_entry->flags & EXEC_OBJECT_WRITE) {
 		DRM_DEBUG("Attempting to use self-modifying batch buffer\n");
 		ret = -EINVAL;
 		goto err;
@@ -1720,7 +1703,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		}
 	}
 
-	eb.batch->obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
 	if (eb.batch_len == 0)
 		eb.batch_len = eb.batch->size - eb.batch_start_offset;
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/24] drm/i915: Disable EXEC_OBJECT_ASYNC when doing relocations
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (7 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 09/24] drm/i915: Pass vma to relocate entry Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-19  9:05   ` Joonas Lahtinen
  2017-05-18  9:46 ` [PATCH 11/24] drm/i915: Eliminate lots of iterations over the execobjects array Chris Wilson
                   ` (16 subsequent siblings)
  25 siblings, 1 reply; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

If we write a relocation into the buffer, we require our own implicit
synchronisation added after the start of the execbuf, outside of the
user's control. As we may end up clflushing, or doing the patch itself
on the GPU, asynchronously we need to look at the implicit serialisation
on obj->resv and hence need to disable EXEC_OBJECT_ASYNC for this
object.

If the user does trigger a stall for relocations, we make sure the stall
is complete enough so that the batch is not submitted before we complete
those relocations.

Fixes: 77ae9957897d ("drm/i915: Enable userspace to opt-out of implicit fencing")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 039ec1fb3e03..32542205060d 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -700,6 +700,16 @@ eb_relocate_entry(struct i915_vma *vma,
 		return -EINVAL;
 	}
 
+	/*
+	 * If we write into the object, we need to force the synchronisation
+	 * barrier, either with an asynchronous clflush or if we executed the
+	 * patching using the GPU (though that should be serialised by the
+	 * timeline). To be completely sure, and since we are required to
+	 * do relocations we are already stalling, disable the user's opt
+	 * of our synchronisation.
+	 */
+	vma->exec_entry->flags &= ~EXEC_OBJECT_ASYNC;
+
 	ret = relocate_entry(vma->obj, reloc, &eb->reloc_cache, target_offset);
 	if (ret)
 		return ret;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/24] drm/i915: Eliminate lots of iterations over the execobjects array
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (8 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 10/24] drm/i915: Disable EXEC_OBJECT_ASYNC when doing relocations Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 12/24] drm/i915: Store a persistent reference for an object in the execbuffer cache Chris Wilson
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

The major scaling bottleneck in execbuffer is the processing of the
execobjects. Creating an auxiliary list is inefficient when compared to
using the execobject array we already have allocated.

Reservation is then split into phases. As we lookup up the VMA, we
try and bind it back into active location. Only if that fails, do we add
it to the unbound list for phase 2. In phase 2, we try and add all those
objects that could not fit into their previous location, with fallback
to retrying all objects and evicting the VM in case of severe
fragmentation. (This is the same as before, except that phase 1 is now
done inline with looking up the VMA to avoid an iteration over the
execobject array. In the ideal case, we eliminate the separate reservation
phase). During the reservation phase, we only evict from the VM between
passes (rather than currently as we try to fit every new VMA). In
testing with Unreal Engine's Atlantis demo which stresses the eviction
logic on gen7 class hardware, this speed up the framerate by a factor of
2.

The second loop amalgamation is between move_to_gpu and move_to_active.
As we always submit the request, even if incomplete, we can use the
current request to track active VMA as we perform the flushes and
synchronisation required.

The next big advancement is to avoid copying back to the user any
execobjects and relocations that are not changed.

v2: Add a Theory of Operation spiel.
v3: Fall back to slow relocations in preparation for flushing userptrs.
v4: Document struct members, factor out eb_validate_vma(), add a few
more comments to explain some magic and hide other magic behind macros.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h                 |    2 +-
 drivers/gpu/drm/i915/i915_gem_evict.c           |   92 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c      | 2020 +++++++++++++----------
 drivers/gpu/drm/i915/i915_vma.c                 |    2 +-
 drivers/gpu/drm/i915/i915_vma.h                 |    1 +
 drivers/gpu/drm/i915/selftests/i915_gem_evict.c |    4 +-
 drivers/gpu/drm/i915/selftests/i915_vma.c       |   16 +-
 7 files changed, 1240 insertions(+), 897 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 4f244f9c3d69..8ba120acf080 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3530,7 +3530,7 @@ int __must_check i915_gem_evict_something(struct i915_address_space *vm,
 int __must_check i915_gem_evict_for_node(struct i915_address_space *vm,
 					 struct drm_mm_node *node,
 					 unsigned int flags);
-int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle);
+int i915_gem_evict_vm(struct i915_address_space *vm);
 
 /* belongs in i915_gem_gtt.h */
 static inline void i915_gem_chipset_flush(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index 204a2d9288ae..a193f1b36c67 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -50,6 +50,29 @@ static bool ggtt_is_idle(struct drm_i915_private *dev_priv)
 	return true;
 }
 
+static int ggtt_flush(struct drm_i915_private *i915)
+{
+	int err;
+
+	/* Not everything in the GGTT is tracked via vma (otherwise we
+	 * could evict as required with minimal stalling) so we are forced
+	 * to idle the GPU and explicitly retire outstanding requests in
+	 * the hopes that we can then remove contexts and the like only
+	 * bound by their active reference.
+	 */
+	err = i915_gem_switch_to_kernel_context(i915);
+	if (err)
+		return err;
+
+	err = i915_gem_wait_for_idle(i915,
+				     I915_WAIT_INTERRUPTIBLE |
+				     I915_WAIT_LOCKED);
+	if (err)
+		return err;
+
+	return 0;
+}
+
 static bool
 mark_free(struct drm_mm_scan *scan,
 	  struct i915_vma *vma,
@@ -175,19 +198,7 @@ i915_gem_evict_something(struct i915_address_space *vm,
 		return intel_has_pending_fb_unpin(dev_priv) ? -EAGAIN : -ENOSPC;
 	}
 
-	/* Not everything in the GGTT is tracked via vma (otherwise we
-	 * could evict as required with minimal stalling) so we are forced
-	 * to idle the GPU and explicitly retire outstanding requests in
-	 * the hopes that we can then remove contexts and the like only
-	 * bound by their active reference.
-	 */
-	ret = i915_gem_switch_to_kernel_context(dev_priv);
-	if (ret)
-		return ret;
-
-	ret = i915_gem_wait_for_idle(dev_priv,
-				     I915_WAIT_INTERRUPTIBLE |
-				     I915_WAIT_LOCKED);
+	ret = ggtt_flush(dev_priv);
 	if (ret)
 		return ret;
 
@@ -337,10 +348,8 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 /**
  * i915_gem_evict_vm - Evict all idle vmas from a vm
  * @vm: Address space to cleanse
- * @do_idle: Boolean directing whether to idle first.
  *
- * This function evicts all idles vmas from a vm. If all unpinned vmas should be
- * evicted the @do_idle needs to be set to true.
+ * This function evicts all vmas from a vm.
  *
  * This is used by the execbuf code as a last-ditch effort to defragment the
  * address space.
@@ -348,37 +357,50 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
  * To clarify: This is for freeing up virtual address space, not for freeing
  * memory in e.g. the shrinker.
  */
-int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle)
+int i915_gem_evict_vm(struct i915_address_space *vm)
 {
+	struct list_head *phases[] = {
+		&vm->inactive_list,
+		&vm->active_list,
+		NULL
+	}, **phase;
+	struct list_head eviction_list;
 	struct i915_vma *vma, *next;
 	int ret;
 
 	lockdep_assert_held(&vm->i915->drm.struct_mutex);
 	trace_i915_gem_evict_vm(vm);
 
-	if (do_idle) {
-		struct drm_i915_private *dev_priv = vm->i915;
-
-		if (i915_is_ggtt(vm)) {
-			ret = i915_gem_switch_to_kernel_context(dev_priv);
-			if (ret)
-				return ret;
-		}
-
-		ret = i915_gem_wait_for_idle(dev_priv,
-					     I915_WAIT_INTERRUPTIBLE |
-					     I915_WAIT_LOCKED);
+	/* Switch back to the default context in order to unpin
+	 * the existing context objects. However, such objects only
+	 * pin themselves inside the global GTT and performing the
+	 * switch otherwise is ineffective.
+	 */
+	if (i915_is_ggtt(vm)) {
+		ret = ggtt_flush(vm->i915);
 		if (ret)
 			return ret;
-
-		WARN_ON(!list_empty(&vm->active_list));
 	}
 
-	list_for_each_entry_safe(vma, next, &vm->inactive_list, vm_link)
-		if (!i915_vma_is_pinned(vma))
-			WARN_ON(i915_vma_unbind(vma));
+	INIT_LIST_HEAD(&eviction_list);
+	phase = phases;
+	do {
+		list_for_each_entry(vma, *phase, vm_link) {
+			if (i915_vma_is_pinned(vma))
+				continue;
 
-	return 0;
+			__i915_vma_pin(vma);
+			list_add(&vma->evict_link, &eviction_list);
+		}
+	} while (*++phase);
+
+	ret = 0;
+	list_for_each_entry_safe(vma, next, &eviction_list, evict_link) {
+		__i915_vma_unpin(vma);
+		if (ret == 0)
+			ret = i915_vma_unbind(vma);
+	}
+	return ret;
 }
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 32542205060d..77c7363754d1 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -42,41 +42,195 @@
 
 #define DBG_USE_CPU_RELOC 0 /* -1 force GTT relocs; 1 force CPU relocs */
 
-#define  __EXEC_OBJECT_HAS_PIN		(1<<31)
-#define  __EXEC_OBJECT_HAS_FENCE	(1<<30)
-#define  __EXEC_OBJECT_NEEDS_MAP	(1<<29)
-#define  __EXEC_OBJECT_NEEDS_BIAS	(1<<28)
-#define  __EXEC_OBJECT_INTERNAL_FLAGS (0xf<<28) /* all of the above */
+#define  __EXEC_OBJECT_HAS_PIN		BIT(31)
+#define  __EXEC_OBJECT_HAS_FENCE	BIT(30)
+#define  __EXEC_OBJECT_NEEDS_MAP	BIT(29)
+#define  __EXEC_OBJECT_NEEDS_BIAS	BIT(28)
+#define  __EXEC_OBJECT_INTERNAL_FLAGS (~0u << 28) /* all of the above */
+#define __EB_RESERVED (__EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_FENCE)
+
+#define __EXEC_HAS_RELOC	BIT(31)
+#define __EXEC_VALIDATED	BIT(30)
+#define UPDATE			PIN_OFFSET_FIXED
 
 #define BATCH_OFFSET_BIAS (256*1024)
 
 #define __I915_EXEC_ILLEGAL_FLAGS \
 	(__I915_EXEC_UNKNOWN_FLAGS | I915_EXEC_CONSTANTS_MASK)
 
+/**
+ * DOC: User command execution
+ *
+ * Userspace submits commands to be executed on the GPU as an instruction
+ * stream within a GEM object we call a batchbuffer. This instructions may
+ * refer to other GEM objects containing auxiliary state such as kernels,
+ * samplers, render targets and even secondary batchbuffers. Userspace does
+ * not know where in the GPU memory these objects reside and so before the
+ * batchbuffer is passed to the GPU for execution, those addresses in the
+ * batchbuffer and auxiliary objects are updated. This is known as relocation,
+ * or patching. To try and avoid having to relocate each object on the next
+ * execution, userspace is told the location of those objects in this pass,
+ * but this remains just a hint as the kernel may choose a new location for
+ * any object in the future.
+ *
+ * Processing an execbuf ioctl is conceptually split up into a few phases.
+ *
+ * 1. Validation - Ensure all the pointers, handles and flags are valid.
+ * 2. Reservation - Assign GPU address space for every object
+ * 3. Relocation - Update any addresses to point to the final locations
+ * 4. Serialisation - Order the request with respect to its dependencies
+ * 5. Construction - Construct a request to execute the batchbuffer
+ * 6. Submission (at some point in the future execution)
+ *
+ * Reserving resources for the execbuf is the most complicated phase. We
+ * neither want to have to migrate the object in the address space, nor do
+ * we want to have to update any relocations pointing to this object. Ideally,
+ * we want to leave the object where it is and for all the existing relocations
+ * to match. If the object is given a new address, or if userspace thinks the
+ * object is elsewhere, we have to parse all the relocation entries and update
+ * the addresses. Userspace can set the I915_EXEC_NORELOC flag to hint that
+ * all the target addresses in all of its objects match the value in the
+ * relocation entries and that they all match the presumed offsets given by the
+ * list of execbuffer objects. Using this knowledge, we know that if we haven't
+ * moved any buffers, all the relocation entries are valid and we can skip
+ * the update. (If userspace is wrong, the likely outcome is an impromptu GPU
+ * hang.) The requirement for using I915_EXEC_NO_RELOC are:
+ *
+ *      The addresses written in the objects must match the corresponding
+ *      reloc.presumed_offset which in turn must match the corresponding
+ *      execobject.offset.
+ *
+ *      Any render targets written to in the batch must be flagged with
+ *      EXEC_OBJECT_WRITE.
+ *
+ *      To avoid stalling, execobject.offset should match the current
+ *      address of that object within the active context.
+ *
+ * The reservation is done is multiple phases. First we try and keep any
+ * object already bound in its current location - so as long as meets the
+ * constraints imposed by the new execbuffer. Any object left unbound after the
+ * first pass is then fitted into any available idle space. If an object does
+ * not fit, all objects are removed from the reservation and the process rerun
+ * after sorting the objects into a priority order (more difficult to fit
+ * objects are tried first). Failing that, the entire VM is cleared and we try
+ * to fit the execbuf once last time before concluding that it simply will not
+ * fit.
+ *
+ * A small complication to all of this is that we allow userspace not only to
+ * specify an alignment and a size for the object in the address space, but
+ * we also allow userspace to specify the exact offset. This objects are
+ * simpler to place (the location is known a priori) all we have to do is make
+ * sure the space is available.
+ *
+ * Once all the objects are in place, patching up the buried pointers to point
+ * to the final locations is a fairly simple job of walking over the relocation
+ * entry arrays, looking up the right address and rewriting the value into
+ * the object. Simple! ... The relocation entries are stored in user memory
+ * and so to access them we have to copy them into a local buffer. That copy
+ * has to avoid taking any pagefaults as they may lead back to a GEM object
+ * requiring the struct_mutex (i.e. recursive deadlock). So once again we split
+ * the relocation into multiple passes. First we try to do everything within an
+ * atomic context (avoid the pagefaults) which requires that we never wait. If
+ * we detect that we may wait, or if we need to fault, then we have to fallback
+ * to a slower path. The slowpath has to drop the mutex. (Can you hear alarm
+ * bells yet?) Dropping the mutex means that we lose all the state we have
+ * built up so far for the execbuf and we must reset any global data. However,
+ * we do leave the objects pinned in their final locations - which is a
+ * potential issue for concurrent execbufs. Once we have left the mutex, we can
+ * allocate and copy all the relocation entries into a large array at our
+ * leisure, reacquire the mutex, reclaim all the objects and other state and
+ * then proceed to update any incorrect addresses with the objects.
+ *
+ * As we process the relocation entries, we maintain a record of whether the
+ * object is being written to. Using NORELOC, we expect userspace to provide
+ * this information instead. We also check whether we can skip the relocation
+ * by comparing the expected value inside the relocation entry with the target's
+ * final address. If they differ, we have to map the current object and rewrite
+ * the 4 or 8 byte pointer within.
+ *
+ * Serialising an execbuf is quite simple according to the rules of the GEM
+ * ABI. Execution within each context is ordered by the order of submission.
+ * Writes to any GEM object are in order of submission and are exclusive. Reads
+ * from a GEM object are unordered with respect to other reads, but ordered by
+ * writes. A write submitted after a read cannot occur before the read, and
+ * similarly any read submitted after a write cannot occur before the write.
+ * Writes are ordered between engines such that only one write occurs at any
+ * time (completing any reads beforehand) - using semaphores where available
+ * and CPU serialisation otherwise. Other GEM access obey the same rules, any
+ * write (either via mmaps using set-domain, or via pwrite) must flush all GPU
+ * reads before starting, and any read (either using set-domain or pread) must
+ * flush all GPU writes before starting. (Note we only employ a barrier before,
+ * we currently rely on userspace not concurrently starting a new execution
+ * whilst reading or writing to an object. This may be an advantage or not
+ * depending on how much you trust userspace not to shoot themselves in the
+ * foot.) Serialisation may just result in the request being inserted into
+ * a DAG awaiting its turn, but most simple is to wait on the CPU until
+ * all dependencies are resolved.
+ *
+ * After all of that, is just a matter of closing the request and handing it to
+ * the hardware (well, leaving it in a queue to be executed). However, we also
+ * offer the ability for batchbuffers to be run with elevated privileges so
+ * that they access otherwise hidden registers. (Used to adjust L3 cache etc.)
+ * Before any batch is given extra privileges we first must check that it
+ * contains no nefarious instructions, we check that each instruction is from
+ * our whitelist and all registers are also from an allowed list. We first
+ * copy the user's batchbuffer to a shadow (so that the user doesn't have
+ * access to it, either by the CPU or GPU as we scan it) and then parse each
+ * instruction. If everything is ok, we set a flag telling the hardware to run
+ * the batchbuffer in trusted mode, otherwise the ioctl is rejected.
+ */
+
 struct i915_execbuffer {
-	struct drm_i915_private *i915;
-	struct drm_file *file;
-	struct drm_i915_gem_execbuffer2 *args;
-	struct drm_i915_gem_exec_object2 *exec;
-	struct intel_engine_cs *engine;
-	struct i915_gem_context *ctx;
-	struct i915_address_space *vm;
-	struct i915_vma *batch;
-	struct drm_i915_gem_request *request;
-	u32 batch_start_offset;
-	u32 batch_len;
-	unsigned int dispatch_flags;
-	struct drm_i915_gem_exec_object2 shadow_exec_entry;
-	bool need_relocs;
-	struct list_head vmas;
+	struct drm_i915_private *i915; /** i915 backpointer */
+	struct drm_file *file; /** per-file lookup tables and limits */
+	struct drm_i915_gem_execbuffer2 *args; /** ioctl parameters */
+	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
+
+	struct intel_engine_cs *engine; /** engine to queue the request to */
+	struct i915_gem_context *ctx; /** context for building the request */
+	struct i915_address_space *vm; /** GTT and vma for the request */
+
+	struct drm_i915_gem_request *request; /** our request to build */
+	struct i915_vma *batch; /** identity of the batch obj/vma */
+
+	/** actual size of execobj[] as we may extend it for the cmdparser */
+	unsigned int buffer_count;
+
+	/** list of vma not yet bound during reservation phase */
+	struct list_head unbound;
+
+	/** list of vma that have execobj.relocation_count */
+	struct list_head relocs;
+
+	/**
+	 * Track the most recently used object for relocations, as we
+	 * frequently have to perform multiple relocations within the same
+	 * obj/page
+	 */
 	struct reloc_cache {
-		struct drm_mm_node node;
-		unsigned long vaddr;
-		unsigned int page;
+		struct drm_mm_node node; /** temporary GTT binding */
+		unsigned long vaddr; /** Current kmap address */
+		unsigned long page; /** Currently mapped page index */
 		bool use_64bit_reloc : 1;
+		bool has_llc : 1;
+		bool has_fence : 1;
+		bool needs_unfenced : 1;
 	} reloc_cache;
-	int lut_mask;
-	struct hlist_head *buckets;
+
+	u64 invalid_flags; /** Set of execobj.flags that are invalid */
+	u32 context_flags; /** Set of execobj.flags to insert from the ctx */
+
+	u32 batch_start_offset; /** Location within object of batch */
+	u32 batch_len; /** Length of batch within object */
+	u32 batch_flags; /** Flags composed for emit_bb_start() */
+
+	/**
+	 * Indicate either the size of the hastable used to resolve
+	 * relocation handles, or if negative that we are using a direct
+	 * index into the execobj[].
+	 */
+	int lut_size;
+	struct hlist_head *buckets; /** ht for relocation handles */
 };
 
 /*
@@ -87,12 +241,42 @@ struct i915_execbuffer {
 #define __exec_to_vma(ee) (ee)->rsvd2
 #define exec_to_vma(ee) u64_to_ptr(struct i915_vma, __exec_to_vma(ee))
 
+/*
+ * Used to convert any address to canonical form.
+ * Starting from gen8, some commands (e.g. STATE_BASE_ADDRESS,
+ * MI_LOAD_REGISTER_MEM and others, see Broadwell PRM Vol2a) require the
+ * addresses to be in a canonical form:
+ * "GraphicsAddress[63:48] are ignored by the HW and assumed to be in correct
+ * canonical form [63:48] == [47]."
+ */
+#define GEN8_HIGH_ADDRESS_BIT 47
+static inline u64 gen8_canonical_addr(u64 address)
+{
+	return sign_extend64(address, GEN8_HIGH_ADDRESS_BIT);
+}
+
+static inline u64 gen8_noncanonical_addr(u64 address)
+{
+	return address & GENMASK_ULL(GEN8_HIGH_ADDRESS_BIT, 0);
+}
+
 static int
 eb_create(struct i915_execbuffer *eb)
 {
-	if ((eb->args->flags & I915_EXEC_HANDLE_LUT) == 0) {
-		unsigned int size = 1 + ilog2(eb->args->buffer_count);
+	if (!(eb->args->flags & I915_EXEC_HANDLE_LUT)) {
+		unsigned int size = 1 + ilog2(eb->buffer_count);
 
+		/*
+		 * Without a 1:1 association between relocation handles and
+		 * the execobject[] index, we instead create a hashtable.
+		 * We size it dynamically based on available memory, starting
+		 * first with 1:1 assocative hash and scaling back until
+		 * the allocation succeeds.
+		 *
+		 * Later on we use a positive lut_size to indicate we are
+		 * using this hashtable, and a negative value to indicate a
+		 * direct lookup.
+		 */
 		do {
 			eb->buckets = kzalloc(sizeof(struct hlist_head) << size,
 					      GFP_TEMPORARY |
@@ -109,112 +293,415 @@ eb_create(struct i915_execbuffer *eb)
 				return -ENOMEM;
 		}
 
-		eb->lut_mask = size;
+		eb->lut_size = size;
 	} else {
-		eb->lut_mask = -eb->args->buffer_count;
+		eb->lut_size = -eb->buffer_count;
 	}
 
 	return 0;
 }
 
+static bool
+eb_vma_misplaced(const struct drm_i915_gem_exec_object2 *entry,
+		 const struct i915_vma *vma)
+{
+	if (!(entry->flags & __EXEC_OBJECT_HAS_PIN))
+		return true;
+
+	if (vma->node.size < entry->pad_to_size)
+		return true;
+
+	if (entry->alignment && !IS_ALIGNED(vma->node.start, entry->alignment))
+		return true;
+
+	if (entry->flags & EXEC_OBJECT_PINNED &&
+	    vma->node.start != entry->offset)
+		return true;
+
+	if (entry->flags & __EXEC_OBJECT_NEEDS_BIAS &&
+	    vma->node.start < BATCH_OFFSET_BIAS)
+		return true;
+
+	if (!(entry->flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS) &&
+	    (vma->node.start + vma->node.size - 1) >> 32)
+		return true;
+
+	return false;
+}
+
+static inline void
+eb_pin_vma(struct i915_execbuffer *eb,
+	   struct drm_i915_gem_exec_object2 *entry,
+	   struct i915_vma *vma)
+{
+	u64 flags;
+
+	flags = vma->node.start;
+	flags |= PIN_USER | PIN_NONBLOCK | PIN_OFFSET_FIXED;
+	if (unlikely(entry->flags & EXEC_OBJECT_NEEDS_GTT))
+		flags |= PIN_GLOBAL;
+	if (unlikely(i915_vma_pin(vma, 0, 0, flags)))
+		return;
+
+	if (unlikely(entry->flags & EXEC_OBJECT_NEEDS_FENCE)) {
+		if (unlikely(i915_vma_get_fence(vma))) {
+			i915_vma_unpin(vma);
+			return;
+		}
+
+		if (i915_vma_pin_fence(vma))
+			entry->flags |= __EXEC_OBJECT_HAS_FENCE;
+	}
+
+	entry->flags |= __EXEC_OBJECT_HAS_PIN;
+}
+
 static inline void
 __eb_unreserve_vma(struct i915_vma *vma,
 		   const struct drm_i915_gem_exec_object2 *entry)
 {
+	GEM_BUG_ON(!(entry->flags & __EXEC_OBJECT_HAS_PIN));
+
 	if (unlikely(entry->flags & __EXEC_OBJECT_HAS_FENCE))
 		i915_vma_unpin_fence(vma);
 
-	if (entry->flags & __EXEC_OBJECT_HAS_PIN)
-		__i915_vma_unpin(vma);
+	__i915_vma_unpin(vma);
 }
 
-static void
-eb_unreserve_vma(struct i915_vma *vma)
+static inline void
+eb_unreserve_vma(struct i915_vma *vma,
+		 struct drm_i915_gem_exec_object2 *entry)
 {
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	if (entry->flags & __EXEC_OBJECT_HAS_PIN) {
+		__eb_unreserve_vma(vma, entry);
+		entry->flags &= ~__EB_RESERVED;
+	}
+}
+
+static int
+eb_validate_vma(struct i915_execbuffer *eb,
+		struct drm_i915_gem_exec_object2 *entry,
+		struct i915_vma *vma)
+{
+	if (unlikely(entry->flags & eb->invalid_flags))
+		return -EINVAL;
+
+	if (unlikely(entry->alignment && !is_power_of_2(entry->alignment)))
+		return -EINVAL;
 
-	__eb_unreserve_vma(vma, entry);
-	entry->flags &= ~(__EXEC_OBJECT_HAS_FENCE | __EXEC_OBJECT_HAS_PIN);
+	/*
+	 * Offset can be used as input (EXEC_OBJECT_PINNED), reject
+	 * any non-page-aligned or non-canonical addresses.
+	 */
+	if (entry->flags & EXEC_OBJECT_PINNED) {
+		if (unlikely(entry->offset !=
+			     gen8_canonical_addr(entry->offset & PAGE_MASK)))
+			return -EINVAL;
+	}
+
+	/*
+	 * From drm_mm perspective address space is continuous,
+	 * so from this point we're always using non-canonical
+	 * form internally.
+	 */
+	entry->offset = gen8_noncanonical_addr(entry->offset);
+
+	/* pad_to_size was once a reserved field, so sanitize it */
+	if (entry->flags & EXEC_OBJECT_PAD_TO_SIZE) {
+		if (unlikely(offset_in_page(entry->pad_to_size)))
+			return -EINVAL;
+	} else {
+		entry->pad_to_size = 0;
+	}
+
+	if (unlikely(vma->exec_entry)) {
+		DRM_DEBUG("Object [handle %d, index %d] appears more than once in object list\n",
+			  entry->handle, (int)(entry - eb->exec));
+		return -EINVAL;
+	}
+
+	return 0;
 }
 
-static void
-eb_reset(struct i915_execbuffer *eb)
+static int
+eb_add_vma(struct i915_execbuffer *eb,
+	   struct drm_i915_gem_exec_object2 *entry,
+	   struct i915_vma *vma)
 {
-	struct i915_vma *vma;
+	int err;
 
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
-		eb_unreserve_vma(vma);
-		i915_vma_put(vma);
-		vma->exec_entry = NULL;
+	GEM_BUG_ON(i915_vma_is_closed(vma));
+
+	if (!(eb->args->flags & __EXEC_VALIDATED)) {
+		err = eb_validate_vma(eb, entry, vma);
+		if (unlikely(err))
+			return err;
 	}
 
-	if (eb->lut_mask >= 0)
-		memset(eb->buckets, 0,
-		       sizeof(struct hlist_head) << eb->lut_mask);
+	/*
+	 * Stash a pointer from the vma to execobj, so we can query its flags,
+	 * size, alignment etc as provided by the user. Also we stash a pointer
+	 * to the vma inside the execobj so that we can use a direct lookup
+	 * to find the right target VMA when doing relocations.
+	 */
+	vma->exec_entry = entry;
+	__exec_to_vma(entry) = (uintptr_t)i915_vma_get(vma);
+
+	if (eb->lut_size >= 0) {
+		vma->exec_handle = entry->handle;
+		hlist_add_head(&vma->exec_node,
+			       &eb->buckets[hash_32(entry->handle,
+						    eb->lut_size)]);
+	}
+
+	if (entry->relocation_count)
+		list_add_tail(&vma->reloc_link, &eb->relocs);
+
+	if (!eb->reloc_cache.has_fence) {
+		entry->flags &= ~EXEC_OBJECT_NEEDS_FENCE;
+	} else {
+		if ((entry->flags & EXEC_OBJECT_NEEDS_FENCE ||
+		     eb->reloc_cache.needs_unfenced) &&
+		    i915_gem_object_is_tiled(vma->obj))
+			entry->flags |= EXEC_OBJECT_NEEDS_GTT | __EXEC_OBJECT_NEEDS_MAP;
+	}
+
+	if (!(entry->flags & EXEC_OBJECT_PINNED))
+		entry->flags |= eb->context_flags;
+
+	err = 0;
+	if (vma->node.size)
+		eb_pin_vma(eb, entry, vma);
+	if (eb_vma_misplaced(entry, vma)) {
+		eb_unreserve_vma(vma, entry);
+
+		list_add_tail(&vma->exec_link, &eb->unbound);
+		if (drm_mm_node_allocated(&vma->node))
+			err = i915_vma_unbind(vma);
+	} else {
+		if (entry->offset != vma->node.start) {
+			entry->offset = vma->node.start | UPDATE;
+			eb->args->flags |= __EXEC_HAS_RELOC;
+		}
+	}
+	return err;
 }
 
-static bool
-eb_add_vma(struct i915_execbuffer *eb, struct i915_vma *vma, int i)
+static inline int use_cpu_reloc(const struct reloc_cache *cache,
+				const struct drm_i915_gem_object *obj)
 {
-	if (unlikely(vma->exec_entry)) {
-		DRM_DEBUG("Object [handle %d, index %d] appears more than once in object list\n",
-			  eb->exec[i].handle, i);
+	if (!i915_gem_object_has_struct_page(obj))
 		return false;
+
+	if (DBG_USE_CPU_RELOC)
+		return DBG_USE_CPU_RELOC > 0;
+
+	return (cache->has_llc ||
+		obj->cache_dirty ||
+		obj->cache_level != I915_CACHE_NONE);
+}
+
+static int
+eb_reserve_vma(struct i915_execbuffer *eb, struct i915_vma *vma)
+{
+	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	u64 flags;
+	int err;
+
+	flags = PIN_USER | PIN_NONBLOCK;
+	if (entry->flags & EXEC_OBJECT_NEEDS_GTT)
+		flags |= PIN_GLOBAL;
+
+	if (!drm_mm_node_allocated(&vma->node)) {
+		/*
+		 * Wa32bitGeneralStateOffset & Wa32bitInstructionBaseOffset,
+		 * limit address to the first 4GBs for unflagged objects.
+		 */
+		if (!(entry->flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS))
+			flags |= PIN_ZONE_4G;
+
+		if (entry->flags & __EXEC_OBJECT_NEEDS_MAP)
+			flags |= PIN_MAPPABLE;
+
+		if (entry->flags & EXEC_OBJECT_PINNED) {
+			flags |= entry->offset | PIN_OFFSET_FIXED;
+			/* force overlapping PINNED checks */
+			flags &= ~PIN_NONBLOCK;
+		} else if (entry->flags & __EXEC_OBJECT_NEEDS_BIAS) {
+			flags |= BATCH_OFFSET_BIAS | PIN_OFFSET_BIAS;
+		}
 	}
-	list_add_tail(&vma->exec_link, &eb->vmas);
 
-	vma->exec_entry = &eb->exec[i];
-	if (eb->lut_mask >= 0) {
-		vma->exec_handle = eb->exec[i].handle;
-		hlist_add_head(&vma->exec_node,
-			       &eb->buckets[hash_32(vma->exec_handle,
-						    eb->lut_mask)]);
+	err = i915_vma_pin(vma, entry->pad_to_size, entry->alignment, flags);
+	if (err)
+		return err;
+
+	if (entry->offset != vma->node.start) {
+		entry->offset = vma->node.start | UPDATE;
+		eb->args->flags |= __EXEC_HAS_RELOC;
 	}
 
-	i915_vma_get(vma);
-	__exec_to_vma(&eb->exec[i]) = (uintptr_t)vma;
-	return true;
+	entry->flags |= __EXEC_OBJECT_HAS_PIN;
+	GEM_BUG_ON(eb_vma_misplaced(entry, vma));
+
+	if (unlikely(entry->flags & EXEC_OBJECT_NEEDS_FENCE)) {
+		err = i915_vma_get_fence(vma);
+		if (unlikely(err)) {
+			i915_vma_unpin(vma);
+			return err;
+		}
+
+		if (i915_vma_pin_fence(vma))
+			entry->flags |= __EXEC_OBJECT_HAS_FENCE;
+	}
+
+	return 0;
+}
+
+static int eb_reserve(struct i915_execbuffer *eb)
+{
+	const unsigned int count = eb->buffer_count;
+	struct list_head last;
+	struct i915_vma *vma;
+	unsigned int i, pass;
+	int err;
+
+	/*
+	 * Attempt to pin all of the buffers into the GTT.
+	 * This is done in 3 phases:
+	 *
+	 * 1a. Unbind all objects that do not match the GTT constraints for
+	 *     the execbuffer (fenceable, mappable, alignment etc).
+	 * 1b. Increment pin count for already bound objects.
+	 * 2.  Bind new objects.
+	 * 3.  Decrement pin count.
+	 *
+	 * This avoid unnecessary unbinding of later objects in order to make
+	 * room for the earlier objects *unless* we need to defragment.
+	 */
+
+	pass = 0;
+	err = 0;
+	do {
+		list_for_each_entry(vma, &eb->unbound, exec_link) {
+			err = eb_reserve_vma(eb, vma);
+			if (err)
+				break;
+		}
+		if (err != -ENOSPC)
+			return err;
+
+		/* Resort *all* the objects into priority order */
+		INIT_LIST_HEAD(&eb->unbound);
+		INIT_LIST_HEAD(&last);
+		for (i = 0; i < count; i++) {
+			struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+
+			if (entry->flags & EXEC_OBJECT_PINNED &&
+			    entry->flags & __EXEC_OBJECT_HAS_PIN)
+				continue;
+
+			vma = exec_to_vma(entry);
+			eb_unreserve_vma(vma, entry);
+
+			if (entry->flags & EXEC_OBJECT_PINNED)
+				list_add(&vma->exec_link, &eb->unbound);
+			else if (entry->flags & __EXEC_OBJECT_NEEDS_MAP)
+				list_add_tail(&vma->exec_link, &eb->unbound);
+			else
+				list_add_tail(&vma->exec_link, &last);
+		}
+		list_splice_tail(&last, &eb->unbound);
+
+		switch (pass++) {
+		case 0:
+			break;
+
+		case 1:
+			/* Too fragmented, unbind everything and retry */
+			err = i915_gem_evict_vm(eb->vm);
+			if (err)
+				return err;
+			break;
+
+		default:
+			return -ENOSPC;
+		}
+	} while (1);
 }
 
 static inline struct hlist_head *
-ht_head(const struct i915_gem_context *ctx, u32 handle)
+ht_head(const  struct i915_gem_context_vma_lut *lut, u32 handle)
 {
-	return &ctx->vma_lut.ht[hash_32(handle, ctx->vma_lut.ht_bits)];
+	return &lut->ht[hash_32(handle, lut->ht_bits)];
 }
 
 static inline bool
-ht_needs_resize(const struct i915_gem_context *ctx)
+ht_needs_resize(const struct i915_gem_context_vma_lut *lut)
 {
-	return (4*ctx->vma_lut.ht_count > 3*ctx->vma_lut.ht_size ||
-		4*ctx->vma_lut.ht_count + 1 < ctx->vma_lut.ht_size);
+	return (4*lut->ht_count > 3*lut->ht_size ||
+		4*lut->ht_count + 1 < lut->ht_size);
 }
 
-static int
-eb_lookup_vmas(struct i915_execbuffer *eb)
+static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
+{
+	return eb->buffer_count - 1;
+}
+
+static int eb_select_context(struct i915_execbuffer *eb)
+{
+	struct i915_gem_context *ctx;
+
+	ctx = i915_gem_context_lookup(eb->file->driver_priv, eb->args->rsvd1);
+	if (unlikely(IS_ERR(ctx)))
+		return PTR_ERR(ctx);
+
+	if (unlikely(i915_gem_context_is_banned(ctx))) {
+		DRM_DEBUG("Context %u tried to submit while banned\n",
+			  ctx->user_handle);
+		return -EIO;
+	}
+
+	eb->ctx = i915_gem_context_get(ctx);
+	eb->vm = ctx->ppgtt ? &ctx->ppgtt->base : &eb->i915->ggtt.base;
+
+	eb->context_flags = 0;
+	if (ctx->flags & CONTEXT_NO_ZEROMAP)
+		eb->context_flags |= __EXEC_OBJECT_NEEDS_BIAS;
+
+	return 0;
+}
+
+static int eb_lookup_vmas(struct i915_execbuffer *eb)
 {
 #define INTERMEDIATE BIT(0)
-	const int count = eb->args->buffer_count;
+	const unsigned int count = eb->buffer_count;
+	struct i915_gem_context_vma_lut *lut = &eb->ctx->vma_lut;
 	struct i915_vma *vma;
+	struct idr *idr;
+	unsigned int i;
 	int slow_pass = -1;
-	int i;
+	int err;
 
-	INIT_LIST_HEAD(&eb->vmas);
+	INIT_LIST_HEAD(&eb->relocs);
+	INIT_LIST_HEAD(&eb->unbound);
 
-	if (unlikely(eb->ctx->vma_lut.ht_size & I915_CTX_RESIZE_IN_PROGRESS))
-		flush_work(&eb->ctx->vma_lut.resize);
-	GEM_BUG_ON(eb->ctx->vma_lut.ht_size & I915_CTX_RESIZE_IN_PROGRESS);
+	if (unlikely(lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS))
+		flush_work(&lut->resize);
+	GEM_BUG_ON(lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS);
 
 	for (i = 0; i < count; i++) {
 		__exec_to_vma(&eb->exec[i]) = 0;
 
 		hlist_for_each_entry(vma,
-				     ht_head(eb->ctx, eb->exec[i].handle),
+				     ht_head(lut, eb->exec[i].handle),
 				     ctx_node) {
 			if (vma->ctx_handle != eb->exec[i].handle)
 				continue;
 
-			if (!eb_add_vma(eb, vma, i))
-				return -EINVAL;
+			err = eb_add_vma(eb, &eb->exec[i], vma);
+			if (unlikely(err))
+				return err;
 
 			goto next_vma;
 		}
@@ -225,24 +712,27 @@ next_vma: ;
 	}
 
 	if (slow_pass < 0)
-		return 0;
+		goto out;
 
 	spin_lock(&eb->file->table_lock);
-	/* Grab a reference to the object and release the lock so we can lookup
-	 * or create the VMA without using GFP_ATOMIC */
+	/*
+	 * Grab a reference to the object and release the lock so we can lookup
+	 * or create the VMA without using GFP_ATOMIC
+	 */
+	idr = &eb->file->object_idr;
 	for (i = slow_pass; i < count; i++) {
 		struct drm_i915_gem_object *obj;
 
 		if (__exec_to_vma(&eb->exec[i]))
 			continue;
 
-		obj = to_intel_bo(idr_find(&eb->file->object_idr,
-					   eb->exec[i].handle));
+		obj = to_intel_bo(idr_find(idr, eb->exec[i].handle));
 		if (unlikely(!obj)) {
 			spin_unlock(&eb->file->table_lock);
 			DRM_DEBUG("Invalid object handle %d at index %d\n",
 				  eb->exec[i].handle, i);
-			return -ENOENT;
+			err = -ENOENT;
+			goto err;
 		}
 
 		__exec_to_vma(&eb->exec[i]) = INTERMEDIATE | (uintptr_t)obj;
@@ -252,7 +742,7 @@ next_vma: ;
 	for (i = slow_pass; i < count; i++) {
 		struct drm_i915_gem_object *obj;
 
-		if ((__exec_to_vma(&eb->exec[i]) & INTERMEDIATE) == 0)
+		if (!(__exec_to_vma(&eb->exec[i]) & INTERMEDIATE))
 			continue;
 
 		/*
@@ -263,12 +753,13 @@ next_vma: ;
 		 * from the (obj, vm) we don't run the risk of creating
 		 * duplicated vmas for the same vm.
 		 */
-		obj = u64_to_ptr(struct drm_i915_gem_object,
+		obj = u64_to_ptr(typeof(*obj),
 				 __exec_to_vma(&eb->exec[i]) & ~INTERMEDIATE);
 		vma = i915_vma_instance(obj, eb->vm, NULL);
 		if (unlikely(IS_ERR(vma))) {
 			DRM_DEBUG("Failed to lookup VMA\n");
-			return PTR_ERR(vma);
+			err = PTR_ERR(vma);
+			goto err;
 		}
 
 		/* First come, first served */
@@ -276,32 +767,31 @@ next_vma: ;
 			vma->ctx = eb->ctx;
 			vma->ctx_handle = eb->exec[i].handle;
 			hlist_add_head(&vma->ctx_node,
-				       ht_head(eb->ctx, eb->exec[i].handle));
-			eb->ctx->vma_lut.ht_count++;
+				       ht_head(lut, eb->exec[i].handle));
+			lut->ht_count++;
+			lut->ht_size |= I915_CTX_RESIZE_IN_PROGRESS;
 			if (i915_vma_is_ggtt(vma)) {
 				GEM_BUG_ON(obj->vma_hashed);
 				obj->vma_hashed = vma;
 			}
 		}
 
-		if (!eb_add_vma(eb, vma, i))
-			return -EINVAL;
+		err = eb_add_vma(eb, &eb->exec[i], vma);
+		if (unlikely(err))
+			goto err;
 	}
 
-	if (ht_needs_resize(eb->ctx)) {
-		eb->ctx->vma_lut.ht_size |= I915_CTX_RESIZE_IN_PROGRESS;
-		queue_work(system_highpri_wq, &eb->ctx->vma_lut.resize);
+	if (lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS) {
+		if (ht_needs_resize(lut))
+			queue_work(system_highpri_wq, &lut->resize);
+		else
+			lut->ht_size &= ~I915_CTX_RESIZE_IN_PROGRESS;
 	}
 
-	return 0;
-#undef INTERMEDIATE
-}
-
-static struct i915_vma *
-eb_get_batch(struct i915_execbuffer *eb)
-{
-	struct i915_vma *vma =
-		exec_to_vma(&eb->exec[eb->args->buffer_count - 1]);
+out:
+	/* take note of the batch buffer before we might reorder the lists */
+	i = eb_batch_index(eb);
+	eb->batch = exec_to_vma(&eb->exec[i]);
 
 	/*
 	 * SNA is doing fancy tricks with compressing batch buffers, which leads
@@ -312,24 +802,36 @@ eb_get_batch(struct i915_execbuffer *eb)
 	 * Note that actual hangs have only been observed on gen7, but for
 	 * paranoia do it everywhere.
 	 */
-	if ((vma->exec_entry->flags & EXEC_OBJECT_PINNED) == 0)
-		vma->exec_entry->flags |= __EXEC_OBJECT_NEEDS_BIAS;
+	if (!(eb->exec[i].flags & EXEC_OBJECT_PINNED))
+		eb->exec[i].flags |= __EXEC_OBJECT_NEEDS_BIAS;
+	if (eb->reloc_cache.has_fence)
+		eb->exec[i].flags |= EXEC_OBJECT_NEEDS_FENCE;
 
-	return vma;
+	eb->args->flags |= __EXEC_VALIDATED;
+	return eb_reserve(eb);
+
+err:
+	for (i = slow_pass; i < count; i++) {
+		if (__exec_to_vma(&eb->exec[i]) & INTERMEDIATE)
+			__exec_to_vma(&eb->exec[i]) = 0;
+	}
+	lut->ht_size &= ~I915_CTX_RESIZE_IN_PROGRESS;
+	return err;
+#undef INTERMEDIATE
 }
 
 static struct i915_vma *
-eb_get_vma(struct i915_execbuffer *eb, unsigned long handle)
+eb_get_vma(const struct i915_execbuffer *eb, unsigned long handle)
 {
-	if (eb->lut_mask < 0) {
-		if (handle >= -eb->lut_mask)
+	if (eb->lut_size < 0) {
+		if (handle >= -eb->lut_size)
 			return NULL;
 		return exec_to_vma(&eb->exec[handle]);
 	} else {
 		struct hlist_head *head;
 		struct i915_vma *vma;
 
-		head = &eb->buckets[hash_32(handle, eb->lut_mask)];
+		head = &eb->buckets[hash_32(handle, eb->lut_size)];
 		hlist_for_each_entry(vma, head, exec_node) {
 			if (vma->exec_handle == handle)
 				return vma;
@@ -338,61 +840,59 @@ eb_get_vma(struct i915_execbuffer *eb, unsigned long handle)
 	}
 }
 
-static void eb_destroy(struct i915_execbuffer *eb)
+static void eb_reset_vmas(const struct i915_execbuffer *eb)
 {
-	struct i915_vma *vma;
+	const unsigned int count = eb->buffer_count;
+	unsigned int i;
 
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
-		if (!vma->exec_entry)
-			continue;
+	for (i = 0; i < count; i++) {
+		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+		struct i915_vma *vma = exec_to_vma(entry);
 
-		__eb_unreserve_vma(vma, vma->exec_entry);
+		eb_unreserve_vma(vma, entry);
 		vma->exec_entry = NULL;
 		i915_vma_put(vma);
 	}
 
-	i915_gem_context_put(eb->ctx);
-
-	if (eb->lut_mask >= 0)
-		kfree(eb->buckets);
+	if (eb->lut_size >= 0)
+		memset(eb->buckets, 0,
+		       sizeof(struct hlist_head) << eb->lut_size);
 }
 
-static inline int use_cpu_reloc(struct drm_i915_gem_object *obj)
+static void eb_release_vmas(const struct i915_execbuffer *eb)
 {
-	if (!i915_gem_object_has_struct_page(obj))
-		return false;
+	const unsigned int count = eb->buffer_count;
+	unsigned int i;
 
-	if (DBG_USE_CPU_RELOC)
-		return DBG_USE_CPU_RELOC > 0;
+	if (!eb->exec)
+		return;
 
-	return (HAS_LLC(to_i915(obj->base.dev)) ||
-		obj->cache_dirty ||
-		obj->cache_level != I915_CACHE_NONE);
-}
+	for (i = 0; i < count; i++) {
+		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+		struct i915_vma *vma = exec_to_vma(entry);
 
-/* Used to convert any address to canonical form.
- * Starting from gen8, some commands (e.g. STATE_BASE_ADDRESS,
- * MI_LOAD_REGISTER_MEM and others, see Broadwell PRM Vol2a) require the
- * addresses to be in a canonical form:
- * "GraphicsAddress[63:48] are ignored by the HW and assumed to be in correct
- * canonical form [63:48] == [47]."
- */
-#define GEN8_HIGH_ADDRESS_BIT 47
-static inline uint64_t gen8_canonical_addr(uint64_t address)
-{
-	return sign_extend64(address, GEN8_HIGH_ADDRESS_BIT);
+		if (!vma || !vma->exec_entry)
+			continue;
+
+		GEM_BUG_ON(vma->exec_entry != entry);
+		if (entry->flags & __EXEC_OBJECT_HAS_PIN)
+			__eb_unreserve_vma(vma, entry);
+		vma->exec_entry = NULL;
+		i915_vma_put(vma);
+	}
 }
 
-static inline uint64_t gen8_noncanonical_addr(uint64_t address)
+static void eb_destroy(const struct i915_execbuffer *eb)
 {
-	return address & ((1ULL << (GEN8_HIGH_ADDRESS_BIT + 1)) - 1);
+	if (eb->lut_size >= 0)
+		kfree(eb->buckets);
 }
 
-static inline uint64_t
+static inline u64
 relocation_target(const struct drm_i915_gem_relocation_entry *reloc,
-		  uint64_t target_offset)
+		  const struct i915_vma *target)
 {
-	return gen8_canonical_addr((int)reloc->delta + target_offset);
+	return gen8_canonical_addr((int)reloc->delta + target->node.start);
 }
 
 static void reloc_cache_init(struct reloc_cache *cache,
@@ -401,6 +901,9 @@ static void reloc_cache_init(struct reloc_cache *cache,
 	cache->page = -1;
 	cache->vaddr = 0;
 	/* Must be a variable in the struct to allow GCC to unroll. */
+	cache->has_llc = HAS_LLC(i915);
+	cache->has_fence = INTEL_GEN(i915) < 4;
+	cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
 	cache->use_64bit_reloc = HAS_64BIT_RELOC(i915);
 	cache->node.allocated = false;
 }
@@ -459,7 +962,7 @@ static void reloc_cache_reset(struct reloc_cache *cache)
 
 static void *reloc_kmap(struct drm_i915_gem_object *obj,
 			struct reloc_cache *cache,
-			int page)
+			unsigned long page)
 {
 	void *vaddr;
 
@@ -467,11 +970,11 @@ static void *reloc_kmap(struct drm_i915_gem_object *obj,
 		kunmap_atomic(unmask_page(cache->vaddr));
 	} else {
 		unsigned int flushes;
-		int ret;
+		int err;
 
-		ret = i915_gem_obj_prepare_shmem_write(obj, &flushes);
-		if (ret)
-			return ERR_PTR(ret);
+		err = i915_gem_obj_prepare_shmem_write(obj, &flushes);
+		if (err)
+			return ERR_PTR(err);
 
 		BUILD_BUG_ON(KMAP & CLFLUSH_FLAGS);
 		BUILD_BUG_ON((KMAP | CLFLUSH_FLAGS) & PAGE_MASK);
@@ -491,7 +994,7 @@ static void *reloc_kmap(struct drm_i915_gem_object *obj,
 
 static void *reloc_iomap(struct drm_i915_gem_object *obj,
 			 struct reloc_cache *cache,
-			 int page)
+			 unsigned long page)
 {
 	struct i915_ggtt *ggtt = cache_to_ggtt(cache);
 	unsigned long offset;
@@ -501,31 +1004,31 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
 		io_mapping_unmap_atomic((void __force __iomem *) unmask_page(cache->vaddr));
 	} else {
 		struct i915_vma *vma;
-		int ret;
+		int err;
 
-		if (use_cpu_reloc(obj))
+		if (use_cpu_reloc(cache, obj))
 			return NULL;
 
-		ret = i915_gem_object_set_to_gtt_domain(obj, true);
-		if (ret)
-			return ERR_PTR(ret);
+		err = i915_gem_object_set_to_gtt_domain(obj, true);
+		if (err)
+			return ERR_PTR(err);
 
 		vma = i915_gem_object_ggtt_pin(obj, NULL, 0, 0,
 					       PIN_MAPPABLE | PIN_NONBLOCK);
 		if (IS_ERR(vma)) {
 			memset(&cache->node, 0, sizeof(cache->node));
-			ret = drm_mm_insert_node_in_range
+			err = drm_mm_insert_node_in_range
 				(&ggtt->base.mm, &cache->node,
 				 PAGE_SIZE, 0, I915_COLOR_UNEVICTABLE,
 				 0, ggtt->mappable_end,
 				 DRM_MM_INSERT_LOW);
-			if (ret) /* no inactive aperture space, use cpu reloc */
+			if (err) /* no inactive aperture space, use cpu reloc */
 				return NULL;
 		} else {
-			ret = i915_vma_put_fence(vma);
-			if (ret) {
+			err = i915_vma_put_fence(vma);
+			if (err) {
 				i915_vma_unpin(vma);
-				return ERR_PTR(ret);
+				return ERR_PTR(err);
 			}
 
 			cache->node.start = vma->node.start;
@@ -553,7 +1056,7 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
 
 static void *reloc_vaddr(struct drm_i915_gem_object *obj,
 			 struct reloc_cache *cache,
-			 int page)
+			 unsigned long page)
 {
 	void *vaddr;
 
@@ -580,7 +1083,8 @@ static void clflush_write32(u32 *addr, u32 value, unsigned int flushes)
 
 		*addr = value;
 
-		/* Writes to the same cacheline are serialised by the CPU
+		/*
+		 * Writes to the same cacheline are serialised by the CPU
 		 * (including clflush). On the write path, we only require
 		 * that it hits memory in an orderly fashion and place
 		 * mb barriers at the start and end of the relocation phase
@@ -592,25 +1096,26 @@ static void clflush_write32(u32 *addr, u32 value, unsigned int flushes)
 		*addr = value;
 }
 
-static int
-relocate_entry(struct drm_i915_gem_object *obj,
+static u64
+relocate_entry(struct i915_vma *vma,
 	       const struct drm_i915_gem_relocation_entry *reloc,
-	       struct reloc_cache *cache,
-	       u64 target_offset)
+	       struct i915_execbuffer *eb,
+	       const struct i915_vma *target)
 {
+	struct drm_i915_gem_object *obj = vma->obj;
 	u64 offset = reloc->offset;
-	bool wide = cache->use_64bit_reloc;
+	u64 target_offset = relocation_target(reloc, target);
+	bool wide = eb->reloc_cache.use_64bit_reloc;
 	void *vaddr;
 
-	target_offset = relocation_target(reloc, target_offset);
 repeat:
-	vaddr = reloc_vaddr(obj, cache, offset >> PAGE_SHIFT);
+	vaddr = reloc_vaddr(obj, &eb->reloc_cache, offset >> PAGE_SHIFT);
 	if (IS_ERR(vaddr))
 		return PTR_ERR(vaddr);
 
 	clflush_write32(vaddr + offset_in_page(offset),
 			lower_32_bits(target_offset),
-			cache->vaddr);
+			eb->reloc_cache.vaddr);
 
 	if (wide) {
 		offset += sizeof(u32);
@@ -619,17 +1124,16 @@ relocate_entry(struct drm_i915_gem_object *obj,
 		goto repeat;
 	}
 
-	return 0;
+	return target->node.start | UPDATE;
 }
 
-static int
-eb_relocate_entry(struct i915_vma *vma,
-		  struct i915_execbuffer *eb,
-		  struct drm_i915_gem_relocation_entry *reloc)
+static u64
+eb_relocate_entry(struct i915_execbuffer *eb,
+		  struct i915_vma *vma,
+		  const struct drm_i915_gem_relocation_entry *reloc)
 {
 	struct i915_vma *target;
-	u64 target_offset;
-	int ret;
+	int err;
 
 	/* we've already hold a reference to all valid objects */
 	target = eb_get_vma(eb, reloc->target_handle);
@@ -659,27 +1163,30 @@ eb_relocate_entry(struct i915_vma *vma,
 		return -EINVAL;
 	}
 
-	if (reloc->write_domain)
+	if (reloc->write_domain) {
 		target->exec_entry->flags |= EXEC_OBJECT_WRITE;
 
-	/*
-	 * Sandybridge PPGTT errata: We need a global gtt mapping for MI and
-	 * pipe_control writes because the gpu doesn't properly redirect them
-	 * through the ppgtt for non_secure batchbuffers.
-	 */
-	if (unlikely(IS_GEN6(eb->i915) &&
-		     reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION)) {
-		ret = i915_vma_bind(target, target->obj->cache_level,
-				    PIN_GLOBAL);
-		if (WARN_ONCE(ret, "Unexpected failure to bind target VMA!"))
-			return ret;
+		/*
+		 * Sandybridge PPGTT errata: We need a global gtt mapping
+		 * for MI and pipe_control writes because the gpu doesn't
+		 * properly redirect them through the ppgtt for non_secure
+		 * batchbuffers.
+		 */
+		if (reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION &&
+		    IS_GEN6(eb->i915)) {
+			err = i915_vma_bind(target, target->obj->cache_level,
+					    PIN_GLOBAL);
+			if (WARN_ONCE(err,
+				      "Unexpected failure to bind target VMA!"))
+				return err;
+		}
 	}
 
-	/* If the relocation already has the right value in it, no
+	/*
+	 * If the relocation already has the right value in it, no
 	 * more work needs to be done.
 	 */
-	target_offset = gen8_canonical_addr(target->node.start);
-	if (target_offset == reloc->presumed_offset)
+	if (gen8_canonical_addr(target->node.start) == reloc->presumed_offset)
 		return 0;
 
 	/* Check that the relocation address is valid... */
@@ -710,35 +1217,37 @@ eb_relocate_entry(struct i915_vma *vma,
 	 */
 	vma->exec_entry->flags &= ~EXEC_OBJECT_ASYNC;
 
-	ret = relocate_entry(vma->obj, reloc, &eb->reloc_cache, target_offset);
-	if (ret)
-		return ret;
-
 	/* and update the user's relocation entry */
-	reloc->presumed_offset = target_offset;
-	return 0;
+	return relocate_entry(vma, reloc, eb, target);
 }
 
-static int eb_relocate_vma(struct i915_vma *vma, struct i915_execbuffer *eb)
+static int eb_relocate_vma(struct i915_execbuffer *eb, struct i915_vma *vma)
 {
 #define N_RELOC(x) ((x) / sizeof(struct drm_i915_gem_relocation_entry))
-	struct drm_i915_gem_relocation_entry stack_reloc[N_RELOC(512)];
-	struct drm_i915_gem_relocation_entry __user *user_relocs;
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
-	int remain, ret = 0;
-
-	user_relocs = u64_to_user_ptr(entry->relocs_ptr);
+	struct drm_i915_gem_relocation_entry stack[N_RELOC(512)];
+	struct drm_i915_gem_relocation_entry __user *urelocs;
+	const struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	unsigned int remain;
 
+	urelocs = u64_to_user_ptr(entry->relocs_ptr);
 	remain = entry->relocation_count;
-	while (remain) {
-		struct drm_i915_gem_relocation_entry *r = stack_reloc;
-		unsigned long unwritten;
-		unsigned int count;
+	if (unlikely(remain > N_RELOC(ULONG_MAX)))
+		return -EINVAL;
 
-		count = min_t(unsigned int, remain, ARRAY_SIZE(stack_reloc));
-		remain -= count;
+	/*
+	 * We must check that the entire relocation array is safe
+	 * to read. However, if the array is not writable the user loses
+	 * the updated relocation values.
+	 */
 
-		/* This is the fast path and we cannot handle a pagefault
+	do {
+		struct drm_i915_gem_relocation_entry *r = stack;
+		unsigned int count =
+			min_t(unsigned int, remain, ARRAY_SIZE(stack));
+		unsigned int copied;
+
+		/*
+		 * This is the fast path and we cannot handle a pagefault
 		 * whilst holding the struct mutex lest the user pass in the
 		 * relocations contained within a mmaped bo. For in such a case
 		 * we, the page fault handler would call i915_gem_fault() and
@@ -746,409 +1255,354 @@ static int eb_relocate_vma(struct i915_vma *vma, struct i915_execbuffer *eb)
 		 * this is bad and so lockdep complains vehemently.
 		 */
 		pagefault_disable();
-		unwritten = __copy_from_user_inatomic(r, user_relocs, count*sizeof(r[0]));
+		copied = __copy_from_user_inatomic(r, urelocs, count * sizeof(r[0]));
 		pagefault_enable();
-		if (unlikely(unwritten)) {
-			ret = -EFAULT;
+		if (unlikely(copied)) {
+			remain = -EFAULT;
 			goto out;
 		}
 
+		remain -= count;
 		do {
-			u64 offset = r->presumed_offset;
+			u64 offset = eb_relocate_entry(eb, vma, r);
 
-			ret = eb_relocate_entry(vma, eb, r);
-			if (ret)
+			if (likely(offset == 0)) {
+			} else if ((s64)offset < 0) {
+				remain = (int)offset;
 				goto out;
-
-			if (r->presumed_offset != offset) {
+			} else {
+				/*
+				 * Note that reporting an error now
+				 * leaves everything in an inconsistent
+				 * state as we have *already* changed
+				 * the relocation value inside the
+				 * object. As we have not changed the
+				 * reloc.presumed_offset or will not
+				 * change the execobject.offset, on the
+				 * call we may not rewrite the value
+				 * inside the object, leaving it
+				 * dangling and causing a GPU hang. Unless
+				 * userspace dynamically rebuilds the
+				 * relocations on each execbuf rather than
+				 * presume a static tree.
+				 *
+				 * We did previously check if the relocations
+				 * were writable (access_ok), an error now
+				 * would be a strange race with mprotect,
+				 * having already demonstrated that we
+				 * can read from this userspace address.
+				 */
+				offset = gen8_canonical_addr(offset & ~UPDATE);
 				pagefault_disable();
-				unwritten = __put_user(r->presumed_offset,
-						       &user_relocs->presumed_offset);
+				__put_user(offset,
+					   &urelocs[r-stack].presumed_offset);
 				pagefault_enable();
-				if (unlikely(unwritten)) {
-					/* Note that reporting an error now
-					 * leaves everything in an inconsistent
-					 * state as we have *already* changed
-					 * the relocation value inside the
-					 * object. As we have not changed the
-					 * reloc.presumed_offset or will not
-					 * change the execobject.offset, on the
-					 * call we may not rewrite the value
-					 * inside the object, leaving it
-					 * dangling and causing a GPU hang.
-					 */
-					ret = -EFAULT;
-					goto out;
-				}
 			}
-
-			user_relocs++;
-			r++;
-		} while (--count);
-	}
-
+		} while (r++, --count);
+		urelocs += ARRAY_SIZE(stack);
+	} while (remain);
 out:
 	reloc_cache_reset(&eb->reloc_cache);
-	return ret;
-#undef N_RELOC
+	return remain;
 }
 
 static int
-eb_relocate_vma_slow(struct i915_vma *vma,
-		     struct i915_execbuffer *eb,
-		     struct drm_i915_gem_relocation_entry *relocs)
+eb_relocate_vma_slow(struct i915_execbuffer *eb, struct i915_vma *vma)
 {
 	const struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
-	int i, ret = 0;
+	struct drm_i915_gem_relocation_entry *relocs =
+		u64_to_ptr(typeof(*relocs), entry->relocs_ptr);
+	unsigned int i;
+	int err;
 
 	for (i = 0; i < entry->relocation_count; i++) {
-		ret = eb_relocate_entry(vma, eb, &relocs[i]);
-		if (ret)
-			break;
+		u64 offset = eb_relocate_entry(eb, vma, &relocs[i]);
+
+		if ((s64)offset < 0) {
+			err = (int)offset;
+			goto err;
+		}
 	}
+	err = 0;
+err:
 	reloc_cache_reset(&eb->reloc_cache);
-	return ret;
+	return err;
 }
 
 static int eb_relocate(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
-	int ret = 0;
-
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
-		ret = eb_relocate_vma(vma, eb);
-		if (ret)
-			break;
-	}
-
-	return ret;
-}
-
-static bool only_mappable_for_reloc(unsigned int flags)
-{
-	return (flags & (EXEC_OBJECT_NEEDS_FENCE | __EXEC_OBJECT_NEEDS_MAP)) ==
-		__EXEC_OBJECT_NEEDS_MAP;
-}
 
-static int
-eb_reserve_vma(struct i915_vma *vma,
-	       struct intel_engine_cs *engine,
-	       bool *need_reloc)
-{
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
-	uint64_t flags;
-	int ret;
-
-	flags = PIN_USER;
-	if (entry->flags & EXEC_OBJECT_NEEDS_GTT)
-		flags |= PIN_GLOBAL;
-
-	if (!drm_mm_node_allocated(&vma->node)) {
-		/* Wa32bitGeneralStateOffset & Wa32bitInstructionBaseOffset,
-		 * limit address to the first 4GBs for unflagged objects.
-		 */
-		if ((entry->flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS) == 0)
-			flags |= PIN_ZONE_4G;
-		if (entry->flags & __EXEC_OBJECT_NEEDS_MAP)
-			flags |= PIN_GLOBAL | PIN_MAPPABLE;
-		if (entry->flags & __EXEC_OBJECT_NEEDS_BIAS)
-			flags |= BATCH_OFFSET_BIAS | PIN_OFFSET_BIAS;
-		if (entry->flags & EXEC_OBJECT_PINNED)
-			flags |= entry->offset | PIN_OFFSET_FIXED;
-		if ((flags & PIN_MAPPABLE) == 0)
-			flags |= PIN_HIGH;
-	}
-
-	ret = i915_vma_pin(vma,
-			   entry->pad_to_size,
-			   entry->alignment,
-			   flags);
-	if ((ret == -ENOSPC || ret == -E2BIG) &&
-	    only_mappable_for_reloc(entry->flags))
-		ret = i915_vma_pin(vma,
-				   entry->pad_to_size,
-				   entry->alignment,
-				   flags & ~PIN_MAPPABLE);
-	if (ret)
-		return ret;
-
-	entry->flags |= __EXEC_OBJECT_HAS_PIN;
-
-	if (entry->flags & EXEC_OBJECT_NEEDS_FENCE) {
-		ret = i915_vma_get_fence(vma);
-		if (ret)
-			return ret;
-
-		if (i915_vma_pin_fence(vma))
-			entry->flags |= __EXEC_OBJECT_HAS_FENCE;
-	}
-
-	if (entry->offset != vma->node.start) {
-		entry->offset = vma->node.start;
-		*need_reloc = true;
+	/* The objects are in their final locations, apply the relocations. */
+	list_for_each_entry(vma, &eb->relocs, reloc_link) {
+		int err = eb_relocate_vma(eb, vma);
+		if (err)
+			return err;
 	}
 
 	return 0;
 }
 
-static bool
-need_reloc_mappable(struct i915_vma *vma)
+static int check_relocations(const struct drm_i915_gem_exec_object2 *entry)
 {
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	const char __user *addr, *end;
+	unsigned long size;
+	char __maybe_unused c;
 
-	if (entry->relocation_count == 0)
-		return false;
-
-	if (!i915_vma_is_ggtt(vma))
-		return false;
+	size = entry->relocation_count;
+	if (size == 0)
+		return 0;
 
-	/* See also use_cpu_reloc() */
-	if (HAS_LLC(to_i915(vma->obj->base.dev)))
-		return false;
+	if (size > N_RELOC(ULONG_MAX))
+		return -EINVAL;
 
-	if (vma->obj->base.write_domain == I915_GEM_DOMAIN_CPU)
-		return false;
+	addr = u64_to_user_ptr(entry->relocs_ptr);
+	size *= sizeof(struct drm_i915_gem_relocation_entry);
+	if (!access_ok(VERIFY_WRITE, addr, size))
+		return -EFAULT;
 
-	return true;
+	end = addr + size;
+	for (; addr < end; addr += PAGE_SIZE) {
+		int err = __get_user(c, addr);
+		if (err)
+			return err;
+	}
+	return __get_user(c, end - 1);
 }
 
-static bool
-eb_vma_misplaced(struct i915_vma *vma)
+static int
+eb_copy_relocations(const struct i915_execbuffer *eb)
 {
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	const unsigned int count = eb->buffer_count;
+	unsigned int i;
+	int err;
 
-	WARN_ON(entry->flags & __EXEC_OBJECT_NEEDS_MAP &&
-		!i915_vma_is_ggtt(vma));
+	for (i = 0; i < count; i++) {
+		const unsigned int nreloc = eb->exec[i].relocation_count;
+		struct drm_i915_gem_relocation_entry __user *urelocs;
+		struct drm_i915_gem_relocation_entry *relocs;
+		unsigned long size;
+		unsigned long copied;
 
-	if (entry->alignment && !IS_ALIGNED(vma->node.start, entry->alignment))
-		return true;
+		if (nreloc == 0)
+			continue;
 
-	if (vma->node.size < entry->pad_to_size)
-		return true;
+		err = check_relocations(&eb->exec[i]);
+		if (err)
+			goto err;
 
-	if (entry->flags & EXEC_OBJECT_PINNED &&
-	    vma->node.start != entry->offset)
-		return true;
+		urelocs = u64_to_user_ptr(eb->exec[i].relocs_ptr);
+		size = nreloc * sizeof(*relocs);
 
-	if (entry->flags & __EXEC_OBJECT_NEEDS_BIAS &&
-	    vma->node.start < BATCH_OFFSET_BIAS)
-		return true;
+		relocs = drm_malloc_gfp(size, 1, GFP_TEMPORARY);
+		if (!relocs) {
+			drm_free_large(relocs);
+			err = -ENOMEM;
+			goto err;
+		}
 
-	/* avoid costly ping-pong once a batch bo ended up non-mappable */
-	if (entry->flags & __EXEC_OBJECT_NEEDS_MAP &&
-	    !i915_vma_is_map_and_fenceable(vma))
-		return !only_mappable_for_reloc(entry->flags);
+		/* copy_from_user is limited to < 4GiB */
+		copied = 0;
+		do {
+			unsigned int len =
+				min_t(u64, BIT_ULL(31), size - copied);
+
+			if (__copy_from_user((char *)relocs + copied,
+					     (char *)urelocs + copied,
+					     len)) {
+				drm_free_large(relocs);
+				err = -EFAULT;
+				goto err;
+			}
 
-	if ((entry->flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS) == 0 &&
-	    (vma->node.start + vma->node.size - 1) >> 32)
-		return true;
+			copied += len;
+		} while (copied < size);
 
-	return false;
-}
+		/*
+		 * As we do not update the known relocation offsets after
+		 * relocating (due to the complexities in lock handling),
+		 * we need to mark them as invalid now so that we force the
+		 * relocation processing next time. Just in case the target
+		 * object is evicted and then rebound into its old
+		 * presumed_offset before the next execbuffer - if that
+		 * happened we would make the mistake of assuming that the
+		 * relocations were valid.
+		 */
+		user_access_begin();
+		for (copied = 0; copied < nreloc; copied++)
+			unsafe_put_user(-1,
+					&urelocs[copied].presumed_offset,
+					end_user);
+end_user:
+		user_access_end();
 
-static int eb_reserve(struct i915_execbuffer *eb)
-{
-	const bool has_fenced_gpu_access = INTEL_GEN(eb->i915) < 4;
-	const bool needs_unfenced_map = INTEL_INFO(eb->i915)->unfenced_needs_alignment;
-	struct i915_vma *vma;
-	struct list_head ordered_vmas;
-	struct list_head pinned_vmas;
-	int retry;
-
-	INIT_LIST_HEAD(&ordered_vmas);
-	INIT_LIST_HEAD(&pinned_vmas);
-	while (!list_empty(&eb->vmas)) {
-		struct drm_i915_gem_exec_object2 *entry;
-		bool need_fence, need_mappable;
-
-		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_link);
-		entry = vma->exec_entry;
-
-		if (eb->ctx->flags & CONTEXT_NO_ZEROMAP)
-			entry->flags |= __EXEC_OBJECT_NEEDS_BIAS;
-
-		if (!has_fenced_gpu_access)
-			entry->flags &= ~EXEC_OBJECT_NEEDS_FENCE;
-		need_fence =
-			(entry->flags & EXEC_OBJECT_NEEDS_FENCE ||
-			 needs_unfenced_map) &&
-			i915_gem_object_is_tiled(vma->obj);
-		need_mappable = need_fence || need_reloc_mappable(vma);
-
-		if (entry->flags & EXEC_OBJECT_PINNED)
-			list_move_tail(&vma->exec_link, &pinned_vmas);
-		else if (need_mappable) {
-			entry->flags |= __EXEC_OBJECT_NEEDS_MAP;
-			list_move(&vma->exec_link, &ordered_vmas);
-		} else
-			list_move_tail(&vma->exec_link, &ordered_vmas);
-	}
-	list_splice(&ordered_vmas, &eb->vmas);
-	list_splice(&pinned_vmas, &eb->vmas);
-
-	/* Attempt to pin all of the buffers into the GTT.
-	 * This is done in 3 phases:
-	 *
-	 * 1a. Unbind all objects that do not match the GTT constraints for
-	 *     the execbuffer (fenceable, mappable, alignment etc).
-	 * 1b. Increment pin count for already bound objects.
-	 * 2.  Bind new objects.
-	 * 3.  Decrement pin count.
-	 *
-	 * This avoid unnecessary unbinding of later objects in order to make
-	 * room for the earlier objects *unless* we need to defragment.
-	 */
-	retry = 0;
-	do {
-		int ret = 0;
+		eb->exec[i].relocs_ptr = (uintptr_t)relocs;
+	}
 
-		/* Unbind any ill-fitting objects or pin. */
-		list_for_each_entry(vma, &eb->vmas, exec_link) {
-			if (!drm_mm_node_allocated(&vma->node))
-				continue;
+	return 0;
 
-			if (eb_vma_misplaced(vma))
-				ret = i915_vma_unbind(vma);
-			else
-				ret = eb_reserve_vma(vma, eb->engine, &eb->need_relocs);
-			if (ret)
-				goto err;
-		}
+err:
+	while (i--) {
+		struct drm_i915_gem_relocation_entry *relocs =
+			u64_to_ptr(typeof(*relocs), eb->exec[i].relocs_ptr);
+		if (eb->exec[i].relocation_count)
+			drm_free_large(relocs);
+	}
+	return err;
+}
 
-		/* Bind fresh objects */
-		list_for_each_entry(vma, &eb->vmas, exec_link) {
-			if (drm_mm_node_allocated(&vma->node))
-				continue;
+static int eb_prefault_relocations(const struct i915_execbuffer *eb)
+{
+	const unsigned int count = eb->buffer_count;
+	unsigned int i;
 
-			ret = eb_reserve_vma(vma, eb->engine, &eb->need_relocs);
-			if (ret)
-				goto err;
-		}
+	if (unlikely(i915.prefault_disable))
+		return 0;
 
-err:
-		if (ret != -ENOSPC || retry++)
-			return ret;
+	for (i = 0; i < count; i++) {
+		int err;
 
-		/* Decrement pin count for bound objects */
-		list_for_each_entry(vma, &eb->vmas, exec_link)
-			eb_unreserve_vma(vma);
+		err = check_relocations(&eb->exec[i]);
+		if (err)
+			return err;
+	}
 
-		ret = i915_gem_evict_vm(eb->vm, true);
-		if (ret)
-			return ret;
-	} while (1);
+	return 0;
 }
 
-static int
-eb_relocate_slow(struct i915_execbuffer *eb)
+static int eb_relocate_slow(struct i915_execbuffer *eb)
 {
-	const unsigned int count = eb->args->buffer_count;
 	struct drm_device *dev = &eb->i915->drm;
-	struct drm_i915_gem_relocation_entry *reloc;
+	bool have_copy = false;
 	struct i915_vma *vma;
-	int *reloc_offset;
-	int i, total, ret;
+	int err = 0;
+
+repeat:
+	if (signal_pending(current)) {
+		err = -ERESTARTSYS;
+		goto out;
+	}
 
 	/* We may process another execbuffer during the unlock... */
-	eb_reset(eb);
+	eb_reset_vmas(eb);
 	mutex_unlock(&dev->struct_mutex);
 
-	total = 0;
-	for (i = 0; i < count; i++)
-		total += eb->exec[i].relocation_count;
-
-	reloc_offset = drm_malloc_ab(count, sizeof(*reloc_offset));
-	reloc = drm_malloc_ab(total, sizeof(*reloc));
-	if (reloc == NULL || reloc_offset == NULL) {
-		drm_free_large(reloc);
-		drm_free_large(reloc_offset);
-		mutex_lock(&dev->struct_mutex);
-		return -ENOMEM;
+	/*
+	 * We take 3 passes through the slowpatch.
+	 *
+	 * 1 - we try to just prefault all the user relocation entries and
+	 * then attempt to reuse the atomic pagefault disabled fast path again.
+	 *
+	 * 2 - we copy the user entries to a local buffer here outside of the
+	 * local and allow ourselves to wait upon any rendering before
+	 * relocations
+	 *
+	 * 3 - we already have a local copy of the relocation entries, but
+	 * were interrupted (EAGAIN) whilst waiting for the objects, try again.
+	 */
+	if (!err) {
+		err = eb_prefault_relocations(eb);
+	} else if (!have_copy) {
+		err = eb_copy_relocations(eb);
+		have_copy = err == 0;
+	} else {
+		cond_resched();
+		err = 0;
 	}
-
-	total = 0;
-	for (i = 0; i < count; i++) {
-		struct drm_i915_gem_relocation_entry __user *user_relocs;
-		u64 invalid_offset = (u64)-1;
-		int j;
-
-		user_relocs = u64_to_user_ptr(eb->exec[i].relocs_ptr);
-
-		if (copy_from_user(reloc+total, user_relocs,
-				   eb->exec[i].relocation_count * sizeof(*reloc))) {
-			ret = -EFAULT;
-			mutex_lock(&dev->struct_mutex);
-			goto err;
-		}
-
-		/* As we do not update the known relocation offsets after
-		 * relocating (due to the complexities in lock handling),
-		 * we need to mark them as invalid now so that we force the
-		 * relocation processing next time. Just in case the target
-		 * object is evicted and then rebound into its old
-		 * presumed_offset before the next execbuffer - if that
-		 * happened we would make the mistake of assuming that the
-		 * relocations were valid.
-		 */
-		for (j = 0; j < eb->exec[i].relocation_count; j++) {
-			if (__copy_to_user(&user_relocs[j].presumed_offset,
-					   &invalid_offset,
-					   sizeof(invalid_offset))) {
-				ret = -EFAULT;
-				mutex_lock(&dev->struct_mutex);
-				goto err;
-			}
-		}
-
-		reloc_offset[i] = total;
-		total += eb->exec[i].relocation_count;
+	if (err) {
+		mutex_lock(&dev->struct_mutex);
+		goto out;
 	}
 
-	ret = i915_mutex_lock_interruptible(dev);
-	if (ret) {
+	err = i915_mutex_lock_interruptible(dev);
+	if (err) {
 		mutex_lock(&dev->struct_mutex);
-		goto err;
+		goto out;
 	}
 
 	/* reacquire the objects */
-	ret = eb_lookup_vmas(eb);
-	if (ret)
-		goto err;
-
-	ret = eb_reserve(eb);
-	if (ret)
+	err = eb_lookup_vmas(eb);
+	if (err)
 		goto err;
 
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
-		int idx = vma->exec_entry - eb->exec;
-
-		ret = eb_relocate_vma_slow(vma, eb, reloc + reloc_offset[idx]);
-		if (ret)
-			goto err;
+	list_for_each_entry(vma, &eb->relocs, reloc_link) {
+		if (!have_copy) {
+			pagefault_disable();
+			err = eb_relocate_vma(eb, vma);
+			pagefault_enable();
+			if (err)
+				goto repeat;
+		} else {
+			err = eb_relocate_vma_slow(eb, vma);
+			if (err)
+				goto err;
+		}
 	}
 
-	/* Leave the user relocations as are, this is the painfully slow path,
+	/*
+	 * Leave the user relocations as are, this is the painfully slow path,
 	 * and we want to avoid the complication of dropping the lock whilst
 	 * having buffers reserved in the aperture and so causing spurious
 	 * ENOSPC for random operations.
 	 */
 
 err:
-	drm_free_large(reloc);
-	drm_free_large(reloc_offset);
-	return ret;
+	if (err == -EAGAIN)
+		goto repeat;
+
+out:
+	if (have_copy) {
+		const unsigned int count = eb->buffer_count;
+		unsigned int i;
+
+		for (i = 0; i < count; i++) {
+			const struct drm_i915_gem_exec_object2 *entry =
+				&eb->exec[i];
+			struct drm_i915_gem_relocation_entry *relocs;
+
+			if (!entry->relocation_count)
+				continue;
+
+			relocs = u64_to_ptr(typeof(*relocs), entry->relocs_ptr);
+			drm_free_large(relocs);
+		}
+	}
+
+	return err ?: have_copy;
+}
+
+static void eb_export_fence(struct drm_i915_gem_object *obj,
+			    struct drm_i915_gem_request *req,
+			    unsigned int flags)
+{
+	struct reservation_object *resv = obj->resv;
+
+	/*
+	 * Ignore errors from failing to allocate the new fence, we can't
+	 * handle an error right now. Worst case should be missed
+	 * synchronisation leading to rendering corruption.
+	 */
+	reservation_object_lock(resv, NULL);
+	if (flags & EXEC_OBJECT_WRITE)
+		reservation_object_add_excl_fence(resv, &req->fence);
+	else if (reservation_object_reserve_shared(resv) == 0)
+		reservation_object_add_shared_fence(resv, &req->fence);
+	reservation_object_unlock(resv);
 }
 
 static int
 eb_move_to_gpu(struct i915_execbuffer *eb)
 {
-	struct i915_vma *vma;
-	int ret;
+	const unsigned int count = eb->buffer_count;
+	unsigned int i;
+	int err;
 
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
+	for (i = 0; i < count; i++) {
+		const struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+		struct i915_vma *vma = exec_to_vma(entry);
 		struct drm_i915_gem_object *obj = vma->obj;
 
-		if (vma->exec_entry->flags & EXEC_OBJECT_CAPTURE) {
+		if (entry->flags & EXEC_OBJECT_CAPTURE) {
 			struct i915_gem_capture_list *capture;
 
 			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
@@ -1160,17 +1614,31 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
 			eb->request->capture_list = capture;
 		}
 
-		if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
-			continue;
+		if (entry->flags & EXEC_OBJECT_ASYNC)
+			goto skip_flushes;
 
 		if (obj->cache_dirty & ~obj->cache_coherent)
 			i915_gem_clflush_object(obj, 0);
 
-		ret = i915_gem_request_await_object
-			(eb->request, obj, vma->exec_entry->flags & EXEC_OBJECT_WRITE);
-		if (ret)
-			return ret;
+		err = i915_gem_request_await_object
+			(eb->request, obj, entry->flags & EXEC_OBJECT_WRITE);
+		if (err)
+			return err;
+
+skip_flushes:
+		i915_vma_move_to_active(vma, eb->request, entry->flags);
+		__eb_unreserve_vma(vma, entry);
+		vma->exec_entry = NULL;
+	}
+
+	for (i = 0; i < count; i++) {
+		const struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+		struct i915_vma *vma = exec_to_vma(entry);
+
+		eb_export_fence(vma->obj, eb->request, entry->flags);
+		i915_vma_put(vma);
 	}
+	eb->exec = NULL;
 
 	/* Unconditionally flush any chipset caches (for streaming writes). */
 	i915_gem_chipset_flush(eb->i915);
@@ -1202,103 +1670,6 @@ i915_gem_check_execbuffer(struct drm_i915_gem_execbuffer2 *exec)
 	return true;
 }
 
-static int
-validate_exec_list(struct drm_device *dev,
-		   struct drm_i915_gem_exec_object2 *exec,
-		   int count)
-{
-	unsigned relocs_total = 0;
-	unsigned relocs_max = UINT_MAX / sizeof(struct drm_i915_gem_relocation_entry);
-	unsigned invalid_flags;
-	int i;
-
-	/* INTERNAL flags must not overlap with external ones */
-	BUILD_BUG_ON(__EXEC_OBJECT_INTERNAL_FLAGS & ~__EXEC_OBJECT_UNKNOWN_FLAGS);
-
-	invalid_flags = __EXEC_OBJECT_UNKNOWN_FLAGS;
-	if (USES_FULL_PPGTT(dev))
-		invalid_flags |= EXEC_OBJECT_NEEDS_GTT;
-
-	for (i = 0; i < count; i++) {
-		char __user *ptr = u64_to_user_ptr(exec[i].relocs_ptr);
-		int length; /* limited by fault_in_pages_readable() */
-
-		if (exec[i].flags & invalid_flags)
-			return -EINVAL;
-
-		/* Offset can be used as input (EXEC_OBJECT_PINNED), reject
-		 * any non-page-aligned or non-canonical addresses.
-		 */
-		if (exec[i].flags & EXEC_OBJECT_PINNED) {
-			if (exec[i].offset !=
-			    gen8_canonical_addr(exec[i].offset & PAGE_MASK))
-				return -EINVAL;
-		}
-
-		/* From drm_mm perspective address space is continuous,
-		 * so from this point we're always using non-canonical
-		 * form internally.
-		 */
-		exec[i].offset = gen8_noncanonical_addr(exec[i].offset);
-
-		if (exec[i].alignment && !is_power_of_2(exec[i].alignment))
-			return -EINVAL;
-
-		/* pad_to_size was once a reserved field, so sanitize it */
-		if (exec[i].flags & EXEC_OBJECT_PAD_TO_SIZE) {
-			if (offset_in_page(exec[i].pad_to_size))
-				return -EINVAL;
-		} else {
-			exec[i].pad_to_size = 0;
-		}
-
-		/* First check for malicious input causing overflow in
-		 * the worst case where we need to allocate the entire
-		 * relocation tree as a single array.
-		 */
-		if (exec[i].relocation_count > relocs_max - relocs_total)
-			return -EINVAL;
-		relocs_total += exec[i].relocation_count;
-
-		length = exec[i].relocation_count *
-			sizeof(struct drm_i915_gem_relocation_entry);
-		/*
-		 * We must check that the entire relocation array is safe
-		 * to read, but since we may need to update the presumed
-		 * offsets during execution, check for full write access.
-		 */
-		if (!access_ok(VERIFY_WRITE, ptr, length))
-			return -EFAULT;
-
-		if (likely(!i915.prefault_disable)) {
-			if (fault_in_pages_readable(ptr, length))
-				return -EFAULT;
-		}
-	}
-
-	return 0;
-}
-
-static int eb_select_context(struct i915_execbuffer *eb)
-{
-	unsigned int ctx_id = i915_execbuffer2_get_context_id(*eb->args);
-	struct i915_gem_context *ctx;
-
-	ctx = i915_gem_context_lookup(eb->file->driver_priv, ctx_id);
-	if (unlikely(IS_ERR(ctx)))
-		return PTR_ERR(ctx);
-
-	if (unlikely(i915_gem_context_is_banned(ctx))) {
-		DRM_DEBUG("Context %u tried to submit while banned\n", ctx_id);
-		return -EIO;
-	}
-
-	eb->ctx = i915_gem_context_get(ctx);
-	eb->vm = ctx->ppgtt ? &ctx->ppgtt->base : &eb->i915->ggtt.base;
-
-	return 0;
-}
-
 void i915_vma_move_to_active(struct i915_vma *vma,
 			     struct drm_i915_gem_request *req,
 			     unsigned int flags)
@@ -1309,7 +1680,8 @@ void i915_vma_move_to_active(struct i915_vma *vma,
 	lockdep_assert_held(&req->i915->drm.struct_mutex);
 	GEM_BUG_ON(!drm_mm_node_allocated(&vma->node));
 
-	/* Add a reference if we're newly entering the active list.
+	/*
+	 * Add a reference if we're newly entering the active list.
 	 * The order in which we add operations to the retirement queue is
 	 * vital here: mark_active adds to the start of the callback list,
 	 * such that subsequent callbacks are called first. Therefore we
@@ -1337,42 +1709,6 @@ void i915_vma_move_to_active(struct i915_vma *vma,
 		i915_gem_active_set(&vma->last_fence, req);
 }
 
-static void eb_export_fence(struct drm_i915_gem_object *obj,
-			    struct drm_i915_gem_request *req,
-			    unsigned int flags)
-{
-	struct reservation_object *resv = obj->resv;
-
-	/* Ignore errors from failing to allocate the new fence, we can't
-	 * handle an error right now. Worst case should be missed
-	 * synchronisation leading to rendering corruption.
-	 */
-	reservation_object_lock(resv, NULL);
-	if (flags & EXEC_OBJECT_WRITE)
-		reservation_object_add_excl_fence(resv, &req->fence);
-	else if (reservation_object_reserve_shared(resv) == 0)
-		reservation_object_add_shared_fence(resv, &req->fence);
-	reservation_object_unlock(resv);
-}
-
-static void
-eb_move_to_active(struct i915_execbuffer *eb)
-{
-	struct i915_vma *vma;
-
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
-		struct drm_i915_gem_object *obj = vma->obj;
-
-		obj->base.write_domain = 0;
-		if (vma->exec_entry->flags & EXEC_OBJECT_WRITE)
-			obj->base.read_domains = 0;
-		obj->base.read_domains |= I915_GEM_GPU_DOMAINS;
-
-		i915_vma_move_to_active(vma, eb->request, vma->exec_entry->flags);
-		eb_export_fence(obj, eb->request, vma->exec_entry->flags);
-	}
-}
-
 static int
 i915_reset_gen7_sol_offsets(struct drm_i915_gem_request *req)
 {
@@ -1384,16 +1720,16 @@ i915_reset_gen7_sol_offsets(struct drm_i915_gem_request *req)
 		return -EINVAL;
 	}
 
-	cs = intel_ring_begin(req, 4 * 3);
+	cs = intel_ring_begin(req, 4 * 2 + 2);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
+	*cs++ = MI_LOAD_REGISTER_IMM(4);
 	for (i = 0; i < 4; i++) {
-		*cs++ = MI_LOAD_REGISTER_IMM(1);
 		*cs++ = i915_mmio_reg_offset(GEN7_SO_WRITE_OFFSET(i));
 		*cs++ = 0;
 	}
-
+	*cs++ = MI_NOOP;
 	intel_ring_advance(req, cs);
 
 	return 0;
@@ -1403,24 +1739,24 @@ static struct i915_vma *eb_parse(struct i915_execbuffer *eb, bool is_master)
 {
 	struct drm_i915_gem_object *shadow_batch_obj;
 	struct i915_vma *vma;
-	int ret;
+	int err;
 
 	shadow_batch_obj = i915_gem_batch_pool_get(&eb->engine->batch_pool,
 						   PAGE_ALIGN(eb->batch_len));
 	if (IS_ERR(shadow_batch_obj))
 		return ERR_CAST(shadow_batch_obj);
 
-	ret = intel_engine_cmd_parser(eb->engine,
+	err = intel_engine_cmd_parser(eb->engine,
 				      eb->batch->obj,
 				      shadow_batch_obj,
 				      eb->batch_start_offset,
 				      eb->batch_len,
 				      is_master);
-	if (ret) {
-		if (ret == -EACCES) /* unhandled chained batch */
+	if (err) {
+		if (err == -EACCES) /* unhandled chained batch */
 			vma = NULL;
 		else
-			vma = ERR_PTR(ret);
+			vma = ERR_PTR(err);
 		goto out;
 	}
 
@@ -1429,10 +1765,10 @@ static struct i915_vma *eb_parse(struct i915_execbuffer *eb, bool is_master)
 		goto out;
 
 	vma->exec_entry =
-		memset(&eb->shadow_exec_entry, 0, sizeof(*vma->exec_entry));
+		memset(&eb->exec[eb->buffer_count++],
+		       0, sizeof(*vma->exec_entry));
 	vma->exec_entry->flags = __EXEC_OBJECT_HAS_PIN;
-	i915_gem_object_get(shadow_batch_obj);
-	list_add_tail(&vma->exec_link, &eb->vmas);
+	__exec_to_vma(vma->exec_entry) = (uintptr_t)i915_vma_get(vma);
 
 out:
 	i915_gem_object_unpin_pages(shadow_batch_obj);
@@ -1448,33 +1784,31 @@ add_to_client(struct drm_i915_gem_request *req,
 }
 
 static int
-execbuf_submit(struct i915_execbuffer *eb)
+eb_submit(struct i915_execbuffer *eb)
 {
-	int ret;
+	int err;
 
-	ret = eb_move_to_gpu(eb);
-	if (ret)
-		return ret;
+	err = eb_move_to_gpu(eb);
+	if (err)
+		return err;
 
-	ret = i915_switch_context(eb->request);
-	if (ret)
-		return ret;
+	err = i915_switch_context(eb->request);
+	if (err)
+		return err;
 
 	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
-		ret = i915_reset_gen7_sol_offsets(eb->request);
-		if (ret)
-			return ret;
+		err = i915_reset_gen7_sol_offsets(eb->request);
+		if (err)
+			return err;
 	}
 
-	ret = eb->engine->emit_bb_start(eb->request,
+	err = eb->engine->emit_bb_start(eb->request,
 					eb->batch->node.start +
 					eb->batch_start_offset,
 					eb->batch_len,
-					eb->dispatch_flags);
-	if (ret)
-		return ret;
-
-	eb_move_to_active(eb);
+					eb->batch_flags);
+	if (err)
+		return err;
 
 	return 0;
 }
@@ -1565,34 +1899,35 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	struct dma_fence *in_fence = NULL;
 	struct sync_file *out_fence = NULL;
 	int out_fence_fd = -1;
-	int ret;
-
-	if (!i915_gem_check_execbuffer(args))
-		return -EINVAL;
+	int err;
 
-	ret = validate_exec_list(dev, exec, args->buffer_count);
-	if (ret)
-		return ret;
+	BUILD_BUG_ON(__EXEC_OBJECT_INTERNAL_FLAGS & ~__EXEC_OBJECT_UNKNOWN_FLAGS);
 
 	eb.i915 = to_i915(dev);
 	eb.file = file;
 	eb.args = args;
+	if (!(args->flags & I915_EXEC_NO_RELOC))
+		args->flags |= __EXEC_HAS_RELOC;
 	eb.exec = exec;
-	eb.need_relocs = (args->flags & I915_EXEC_NO_RELOC) == 0;
+	eb.ctx = NULL;
+	eb.invalid_flags = __EXEC_OBJECT_UNKNOWN_FLAGS;
+	if (USES_FULL_PPGTT(eb.i915))
+		eb.invalid_flags |= EXEC_OBJECT_NEEDS_GTT;
 	reloc_cache_init(&eb.reloc_cache, eb.i915);
 
+	eb.buffer_count = args->buffer_count;
 	eb.batch_start_offset = args->batch_start_offset;
 	eb.batch_len = args->batch_len;
 
-	eb.dispatch_flags = 0;
+	eb.batch_flags = 0;
 	if (args->flags & I915_EXEC_SECURE) {
 		if (!drm_is_current_master(file) || !capable(CAP_SYS_ADMIN))
 		    return -EPERM;
 
-		eb.dispatch_flags |= I915_DISPATCH_SECURE;
+		eb.batch_flags |= I915_DISPATCH_SECURE;
 	}
 	if (args->flags & I915_EXEC_IS_PINNED)
-		eb.dispatch_flags |= I915_DISPATCH_PINNED;
+		eb.batch_flags |= I915_DISPATCH_PINNED;
 
 	eb.engine = eb_select_engine(eb.i915, file, args);
 	if (!eb.engine)
@@ -1609,7 +1944,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			return -EINVAL;
 		}
 
-		eb.dispatch_flags |= I915_DISPATCH_RS;
+		eb.batch_flags |= I915_DISPATCH_RS;
 	}
 
 	if (args->flags & I915_EXEC_FENCE_IN) {
@@ -1621,71 +1956,57 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (args->flags & I915_EXEC_FENCE_OUT) {
 		out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
 		if (out_fence_fd < 0) {
-			ret = out_fence_fd;
+			err = out_fence_fd;
 			goto err_in_fence;
 		}
 	}
 
-	/* Take a local wakeref for preparing to dispatch the execbuf as
+	if (eb_create(&eb))
+		return -ENOMEM;
+
+	/*
+	 * Take a local wakeref for preparing to dispatch the execbuf as
 	 * we expect to access the hardware fairly frequently in the
 	 * process. Upon first dispatch, we acquire another prolonged
 	 * wakeref that we hold until the GPU has been idle for at least
 	 * 100ms.
 	 */
 	intel_runtime_pm_get(eb.i915);
+	err = i915_mutex_lock_interruptible(dev);
+	if (err)
+		goto err_rpm;
+
+	err = eb_select_context(&eb);
+	if (unlikely(err))
+		goto err_unlock;
+
+	err = eb_lookup_vmas(&eb);
+	if (likely(!err && args->flags & __EXEC_HAS_RELOC))
+		err = eb_relocate(&eb);
+	if (err == -EAGAIN || err == -EFAULT)
+		err = eb_relocate_slow(&eb);
+	if (err && args->flags & I915_EXEC_NO_RELOC)
+		/*
+		 * If the user expects the execobject.offset and
+		 * reloc.presumed_offset to be an exact match,
+		 * as for using NO_RELOC, then we cannot update
+		 * the execobject.offset until we have completed
+		 * relocation.
+		 */
+		args->flags &= ~__EXEC_HAS_RELOC;
+	if (err < 0)
+		goto err_vma;
 
-	ret = i915_mutex_lock_interruptible(dev);
-	if (ret)
-		goto pre_mutex_err;
-
-	ret = eb_select_context(&eb);
-	if (ret) {
-		mutex_unlock(&dev->struct_mutex);
-		goto pre_mutex_err;
-	}
-
-	if (eb_create(&eb)) {
-		i915_gem_context_put(eb.ctx);
-		mutex_unlock(&dev->struct_mutex);
-		ret = -ENOMEM;
-		goto pre_mutex_err;
-	}
-
-	/* Look up object handles */
-	ret = eb_lookup_vmas(&eb);
-	if (ret)
-		goto err;
-
-	/* take note of the batch buffer before we might reorder the lists */
-	eb.batch = eb_get_batch(&eb);
-
-	/* Move the objects en-masse into the GTT, evicting if necessary. */
-	ret = eb_reserve(&eb);
-	if (ret)
-		goto err;
-
-	/* The objects are in their final locations, apply the relocations. */
-	if (eb.need_relocs)
-		ret = eb_relocate(&eb);
-	if (ret) {
-		if (ret == -EFAULT) {
-			ret = eb_relocate_slow(&eb);
-			BUG_ON(!mutex_is_locked(&dev->struct_mutex));
-		}
-		if (ret)
-			goto err;
-	}
-
-	if (eb.batch->exec_entry->flags & EXEC_OBJECT_WRITE) {
+	if (unlikely(eb.batch->exec_entry->flags & EXEC_OBJECT_WRITE)) {
 		DRM_DEBUG("Attempting to use self-modifying batch buffer\n");
-		ret = -EINVAL;
-		goto err;
+		err = -EINVAL;
+		goto err_vma;
 	}
 	if (eb.batch_start_offset > eb.batch->size ||
 	    eb.batch_len > eb.batch->size - eb.batch_start_offset) {
 		DRM_DEBUG("Attempting to use out-of-bounds batch\n");
-		ret = -EINVAL;
-		goto err;
+		err = -EINVAL;
+		goto err_vma;
 	}
 
 	if (eb.engine->needs_cmd_parser && eb.batch_len) {
@@ -1693,8 +2014,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 		vma = eb_parse(&eb, drm_is_current_master(file));
 		if (IS_ERR(vma)) {
-			ret = PTR_ERR(vma);
-			goto err;
+			err = PTR_ERR(vma);
+			goto err_vma;
 		}
 
 		if (vma) {
@@ -1707,7 +2028,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			 * specifically don't want that set on batches the
 			 * command parser has accepted.
 			 */
-			eb.dispatch_flags |= I915_DISPATCH_SECURE;
+			eb.batch_flags |= I915_DISPATCH_SECURE;
 			eb.batch_start_offset = 0;
 			eb.batch = vma;
 		}
@@ -1716,11 +2037,11 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (eb.batch_len == 0)
 		eb.batch_len = eb.batch->size - eb.batch_start_offset;
 
-	/* snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
+	/*
+	 * snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
 	 * batch" bit. Hence we need to pin secure batches into the global gtt.
 	 * hsw should have this fixed, but bdw mucks it up again. */
-	if (eb.dispatch_flags & I915_DISPATCH_SECURE) {
-		struct drm_i915_gem_object *obj = eb.batch->obj;
+	if (eb.batch_flags & I915_DISPATCH_SECURE) {
 		struct i915_vma *vma;
 
 		/*
@@ -1733,10 +2054,10 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		 *   fitting due to fragmentation.
 		 * So this is actually safe.
 		 */
-		vma = i915_gem_object_ggtt_pin(obj, NULL, 0, 0, 0);
+		vma = i915_gem_object_ggtt_pin(eb.batch->obj, NULL, 0, 0, 0);
 		if (IS_ERR(vma)) {
-			ret = PTR_ERR(vma);
-			goto err;
+			err = PTR_ERR(vma);
+			goto err_vma;
 		}
 
 		eb.batch = vma;
@@ -1745,25 +2066,26 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	/* Allocate a request for this batch buffer nice and early. */
 	eb.request = i915_gem_request_alloc(eb.engine, eb.ctx);
 	if (IS_ERR(eb.request)) {
-		ret = PTR_ERR(eb.request);
+		err = PTR_ERR(eb.request);
 		goto err_batch_unpin;
 	}
 
 	if (in_fence) {
-		ret = i915_gem_request_await_dma_fence(eb.request, in_fence);
-		if (ret < 0)
+		err = i915_gem_request_await_dma_fence(eb.request, in_fence);
+		if (err < 0)
 			goto err_request;
 	}
 
 	if (out_fence_fd != -1) {
 		out_fence = sync_file_create(&eb.request->fence);
 		if (!out_fence) {
-			ret = -ENOMEM;
+			err = -ENOMEM;
 			goto err_request;
 		}
 	}
 
-	/* Whilst this request exists, batch_obj will be on the
+	/*
+	 * Whilst this request exists, batch_obj will be on the
 	 * active_list, and so will hold the active reference. Only when this
 	 * request is retired will the the batch_obj be moved onto the
 	 * inactive_list and lose its active reference. Hence we do not need
@@ -1771,14 +2093,14 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	 */
 	eb.request->batch = eb.batch;
 
-	trace_i915_gem_request_queue(eb.request, eb.dispatch_flags);
-	ret = execbuf_submit(&eb);
+	trace_i915_gem_request_queue(eb.request, eb.batch_flags);
+	err = eb_submit(&eb);
 err_request:
-	__i915_add_request(eb.request, ret == 0);
+	__i915_add_request(eb.request, err == 0);
 	add_to_client(eb.request, file);
 
 	if (out_fence) {
-		if (ret == 0) {
+		if (err == 0) {
 			fd_install(out_fence_fd, out_fence->file);
 			args->rsvd2 &= GENMASK_ULL(0, 31); /* keep in-fence */
 			args->rsvd2 |= (u64)out_fence_fd << 32;
@@ -1789,28 +2111,21 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	}
 
 err_batch_unpin:
-	/*
-	 * FIXME: We crucially rely upon the active tracking for the (ppgtt)
-	 * batch vma for correctness. For less ugly and less fragility this
-	 * needs to be adjusted to also track the ggtt batch vma properly as
-	 * active.
-	 */
-	if (eb.dispatch_flags & I915_DISPATCH_SECURE)
+	if (eb.batch_flags & I915_DISPATCH_SECURE)
 		i915_vma_unpin(eb.batch);
-err:
-	/* the request owns the ref now */
-	eb_destroy(&eb);
+err_vma:
+	eb_release_vmas(&eb);
+	i915_gem_context_put(eb.ctx);
+err_unlock:
 	mutex_unlock(&dev->struct_mutex);
-
-pre_mutex_err:
-	/* intel_gpu_busy should also get a ref, so it will free when the device
-	 * is really idle. */
+err_rpm:
 	intel_runtime_pm_put(eb.i915);
+	eb_destroy(&eb);
 	if (out_fence_fd != -1)
 		put_unused_fd(out_fence_fd);
 err_in_fence:
 	dma_fence_put(in_fence);
-	return ret;
+	return err;
 }
 
 /*
@@ -1825,16 +2140,35 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
 	struct drm_i915_gem_execbuffer2 exec2;
 	struct drm_i915_gem_exec_object *exec_list = NULL;
 	struct drm_i915_gem_exec_object2 *exec2_list = NULL;
-	int ret, i;
+	unsigned int i;
+	int err;
 
 	if (args->buffer_count < 1) {
 		DRM_DEBUG("execbuf with %d buffers\n", args->buffer_count);
 		return -EINVAL;
 	}
 
+	exec2.buffers_ptr = args->buffers_ptr;
+	exec2.buffer_count = args->buffer_count;
+	exec2.batch_start_offset = args->batch_start_offset;
+	exec2.batch_len = args->batch_len;
+	exec2.DR1 = args->DR1;
+	exec2.DR4 = args->DR4;
+	exec2.num_cliprects = args->num_cliprects;
+	exec2.cliprects_ptr = args->cliprects_ptr;
+	exec2.flags = I915_EXEC_RENDER;
+	i915_execbuffer2_set_context_id(exec2, 0);
+
+	if (!i915_gem_check_execbuffer(&exec2))
+		return -EINVAL;
+
 	/* Copy in the exec list from userland */
-	exec_list = drm_malloc_ab(sizeof(*exec_list), args->buffer_count);
-	exec2_list = drm_malloc_ab(sizeof(*exec2_list), args->buffer_count);
+	exec_list = drm_malloc_gfp(args->buffer_count,
+				   sizeof(*exec_list),
+				   __GFP_NOWARN | GFP_TEMPORARY);
+	exec2_list = drm_malloc_gfp(args->buffer_count + 1,
+				    sizeof(*exec2_list),
+				    __GFP_NOWARN | GFP_TEMPORARY);
 	if (exec_list == NULL || exec2_list == NULL) {
 		DRM_DEBUG("Failed to allocate exec list for %d buffers\n",
 			  args->buffer_count);
@@ -1842,12 +2176,12 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
 		drm_free_large(exec2_list);
 		return -ENOMEM;
 	}
-	ret = copy_from_user(exec_list,
+	err = copy_from_user(exec_list,
 			     u64_to_user_ptr(args->buffers_ptr),
 			     sizeof(*exec_list) * args->buffer_count);
-	if (ret != 0) {
+	if (err) {
 		DRM_DEBUG("copy %d exec entries failed %d\n",
-			  args->buffer_count, ret);
+			  args->buffer_count, err);
 		drm_free_large(exec_list);
 		drm_free_large(exec2_list);
 		return -EFAULT;
@@ -1865,42 +2199,29 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
 			exec2_list[i].flags = 0;
 	}
 
-	exec2.buffers_ptr = args->buffers_ptr;
-	exec2.buffer_count = args->buffer_count;
-	exec2.batch_start_offset = args->batch_start_offset;
-	exec2.batch_len = args->batch_len;
-	exec2.DR1 = args->DR1;
-	exec2.DR4 = args->DR4;
-	exec2.num_cliprects = args->num_cliprects;
-	exec2.cliprects_ptr = args->cliprects_ptr;
-	exec2.flags = I915_EXEC_RENDER;
-	i915_execbuffer2_set_context_id(exec2, 0);
-
-	ret = i915_gem_do_execbuffer(dev, file, &exec2, exec2_list);
-	if (!ret) {
+	err = i915_gem_do_execbuffer(dev, file, &exec2, exec2_list);
+	if (exec2.flags & __EXEC_HAS_RELOC) {
 		struct drm_i915_gem_exec_object __user *user_exec_list =
 			u64_to_user_ptr(args->buffers_ptr);
 
 		/* Copy the new buffer offsets back to the user's exec list. */
 		for (i = 0; i < args->buffer_count; i++) {
+			if (!(exec2_list[i].offset & UPDATE))
+				continue;
+
 			exec2_list[i].offset =
-				gen8_canonical_addr(exec2_list[i].offset);
-			ret = __copy_to_user(&user_exec_list[i].offset,
-					     &exec2_list[i].offset,
-					     sizeof(user_exec_list[i].offset));
-			if (ret) {
-				ret = -EFAULT;
-				DRM_DEBUG("failed to copy %d exec entries "
-					  "back to user (%d)\n",
-					  args->buffer_count, ret);
+				gen8_canonical_addr(exec2_list[i].offset & PIN_OFFSET_MASK);
+			exec2_list[i].offset &= PIN_OFFSET_MASK;
+			if (__copy_to_user(&user_exec_list[i].offset,
+					   &exec2_list[i].offset,
+					   sizeof(user_exec_list[i].offset)))
 				break;
-			}
 		}
 	}
 
 	drm_free_large(exec_list);
 	drm_free_large(exec2_list);
-	return ret;
+	return err;
 }
 
 int
@@ -1908,56 +2229,65 @@ i915_gem_execbuffer2(struct drm_device *dev, void *data,
 		     struct drm_file *file)
 {
 	struct drm_i915_gem_execbuffer2 *args = data;
-	struct drm_i915_gem_exec_object2 *exec2_list = NULL;
-	int ret;
+	struct drm_i915_gem_exec_object2 *exec2_list;
+	int err;
 
 	if (args->buffer_count < 1 ||
-	    args->buffer_count > UINT_MAX / sizeof(*exec2_list)) {
+	    args->buffer_count > SIZE_MAX / sizeof(*exec2_list) - 1) {
 		DRM_DEBUG("execbuf2 with %d buffers\n", args->buffer_count);
 		return -EINVAL;
 	}
 
-	exec2_list = drm_malloc_gfp(args->buffer_count,
+	if (!i915_gem_check_execbuffer(args))
+		return -EINVAL;
+
+	/* Allocate an extra slot for use by the command parser */
+	exec2_list = drm_malloc_gfp(args->buffer_count + 1,
 				    sizeof(*exec2_list),
-				    GFP_TEMPORARY);
+				    __GFP_NOWARN | GFP_TEMPORARY);
 	if (exec2_list == NULL) {
 		DRM_DEBUG("Failed to allocate exec list for %d buffers\n",
 			  args->buffer_count);
 		return -ENOMEM;
 	}
-	ret = copy_from_user(exec2_list,
-			     u64_to_user_ptr(args->buffers_ptr),
-			     sizeof(*exec2_list) * args->buffer_count);
-	if (ret != 0) {
-		DRM_DEBUG("copy %d exec entries failed %d\n",
-			  args->buffer_count, ret);
+	if (copy_from_user(exec2_list,
+			   u64_to_user_ptr(args->buffers_ptr),
+			   sizeof(*exec2_list) * args->buffer_count)) {
+		DRM_DEBUG("copy %d exec entries failed\n", args->buffer_count);
 		drm_free_large(exec2_list);
 		return -EFAULT;
 	}
 
-	ret = i915_gem_do_execbuffer(dev, file, args, exec2_list);
-	if (!ret) {
-		/* Copy the new buffer offsets back to the user's exec list. */
+	err = i915_gem_do_execbuffer(dev, file, args, exec2_list);
+
+	/*
+	 * Now that we have begun execution of the batchbuffer, we ignore
+	 * any new error after this point. Also given that we have already
+	 * updated the associated relocations, we try to write out the current
+	 * object locations irrespective of any error.
+	 */
+	if (args->flags & __EXEC_HAS_RELOC) {
 		struct drm_i915_gem_exec_object2 __user *user_exec_list =
-				   u64_to_user_ptr(args->buffers_ptr);
-		int i;
+			u64_to_user_ptr(args->buffers_ptr);
+		unsigned int i;
 
+		/* Copy the new buffer offsets back to the user's exec list. */
+		user_access_begin();
 		for (i = 0; i < args->buffer_count; i++) {
+			if (!(exec2_list[i].offset & UPDATE))
+				continue;
+
 			exec2_list[i].offset =
-				gen8_canonical_addr(exec2_list[i].offset);
-			ret = __copy_to_user(&user_exec_list[i].offset,
-					     &exec2_list[i].offset,
-					     sizeof(user_exec_list[i].offset));
-			if (ret) {
-				ret = -EFAULT;
-				DRM_DEBUG("failed to copy %d exec entries "
-					  "back to user\n",
-					  args->buffer_count);
-				break;
-			}
+				gen8_canonical_addr(exec2_list[i].offset & PIN_OFFSET_MASK);
+			unsafe_put_user(exec2_list[i].offset,
+					&user_exec_list[i].offset,
+					end_user);
 		}
+end_user:
+		user_access_end();
 	}
 
+	args->flags &= ~__I915_EXEC_UNKNOWN_FLAGS;
 	drm_free_large(exec2_list);
-	return ret;
+	return err;
 }
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index ad696239383d..6b1253fdfc39 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -463,7 +463,7 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 			  size, obj->base.size,
 			  flags & PIN_MAPPABLE ? "mappable" : "total",
 			  end);
-		return -E2BIG;
+		return -ENOSPC;
 	}
 
 	ret = i915_gem_object_pin_pages(obj);
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index 88543fafcffc..062addfee6ef 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -103,6 +103,7 @@ struct i915_vma {
 
 	/** This vma's place in the execbuf reservation list */
 	struct list_head exec_link;
+	struct list_head reloc_link;
 
 	/** This vma's place in the eviction list */
 	struct list_head evict_link;
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_evict.c b/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
index 14e9c2fbc4e6..5ea373221f49 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
@@ -304,7 +304,7 @@ static int igt_evict_vm(void *arg)
 		goto cleanup;
 
 	/* Everything is pinned, nothing should happen */
-	err = i915_gem_evict_vm(&ggtt->base, false);
+	err = i915_gem_evict_vm(&ggtt->base);
 	if (err) {
 		pr_err("i915_gem_evict_vm on a full GGTT returned err=%d]\n",
 		       err);
@@ -313,7 +313,7 @@ static int igt_evict_vm(void *arg)
 
 	unpin_ggtt(i915);
 
-	err = i915_gem_evict_vm(&ggtt->base, false);
+	err = i915_gem_evict_vm(&ggtt->base);
 	if (err) {
 		pr_err("i915_gem_evict_vm on a full GGTT returned err=%d]\n",
 		       err);
diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
index ad56566e24db..fb9072d5877f 100644
--- a/drivers/gpu/drm/i915/selftests/i915_vma.c
+++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
@@ -225,14 +225,6 @@ static bool assert_pin_valid(const struct i915_vma *vma,
 }
 
 __maybe_unused
-static bool assert_pin_e2big(const struct i915_vma *vma,
-			     const struct pin_mode *mode,
-			     int result)
-{
-	return result == -E2BIG;
-}
-
-__maybe_unused
 static bool assert_pin_enospc(const struct i915_vma *vma,
 			      const struct pin_mode *mode,
 			      int result)
@@ -255,7 +247,6 @@ static int igt_vma_pin1(void *arg)
 #define VALID(sz, fl) { .size = (sz), .flags = (fl), .assert = assert_pin_valid, .string = #sz ", " #fl ", (valid) " }
 #define __INVALID(sz, fl, check, eval) { .size = (sz), .flags = (fl), .assert = (check), .string = #sz ", " #fl ", (invalid " #eval ")" }
 #define INVALID(sz, fl) __INVALID(sz, fl, assert_pin_einval, EINVAL)
-#define TOOBIG(sz, fl) __INVALID(sz, fl, assert_pin_e2big, E2BIG)
 #define NOSPACE(sz, fl) __INVALID(sz, fl, assert_pin_enospc, ENOSPC)
 		VALID(0, PIN_GLOBAL),
 		VALID(0, PIN_GLOBAL | PIN_MAPPABLE),
@@ -276,11 +267,11 @@ static int igt_vma_pin1(void *arg)
 		VALID(8192, PIN_GLOBAL),
 		VALID(i915->ggtt.mappable_end - 4096, PIN_GLOBAL | PIN_MAPPABLE),
 		VALID(i915->ggtt.mappable_end, PIN_GLOBAL | PIN_MAPPABLE),
-		TOOBIG(i915->ggtt.mappable_end + 4096, PIN_GLOBAL | PIN_MAPPABLE),
+		NOSPACE(i915->ggtt.mappable_end + 4096, PIN_GLOBAL | PIN_MAPPABLE),
 		VALID(i915->ggtt.base.total - 4096, PIN_GLOBAL),
 		VALID(i915->ggtt.base.total, PIN_GLOBAL),
-		TOOBIG(i915->ggtt.base.total + 4096, PIN_GLOBAL),
-		TOOBIG(round_down(U64_MAX, PAGE_SIZE), PIN_GLOBAL),
+		NOSPACE(i915->ggtt.base.total + 4096, PIN_GLOBAL),
+		NOSPACE(round_down(U64_MAX, PAGE_SIZE), PIN_GLOBAL),
 		INVALID(8192, PIN_GLOBAL | PIN_MAPPABLE | PIN_OFFSET_FIXED | (i915->ggtt.mappable_end - 4096)),
 		INVALID(8192, PIN_GLOBAL | PIN_OFFSET_FIXED | (i915->ggtt.base.total - 4096)),
 		INVALID(8192, PIN_GLOBAL | PIN_OFFSET_FIXED | (round_down(U64_MAX, PAGE_SIZE) - 4096)),
@@ -300,7 +291,6 @@ static int igt_vma_pin1(void *arg)
 #endif
 		{ },
 #undef NOSPACE
-#undef TOOBIG
 #undef INVALID
 #undef __INVALID
 #undef VALID
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/24] drm/i915: Store a persistent reference for an object in the execbuffer cache
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (9 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 11/24] drm/i915: Eliminate lots of iterations over the execobjects array Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-19  9:37   ` Joonas Lahtinen
  2017-05-18  9:46 ` [PATCH 13/24] drm/i915: First try the previous execbuffer location Chris Wilson
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

If we take a reference to the object/vma when it is first used in an
execbuf, we can keep that reference until the object's file-local handle
is closed. Thereby saving a frequent ref/unref pair.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem_context.c    |  1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 32 ++++++++++++++++++++----------
 drivers/gpu/drm/i915/i915_vma.c            |  2 ++
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 9af443e69c57..c62e17a68f5b 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -152,6 +152,7 @@ static void vma_lut_free(struct i915_gem_context *ctx)
 		hlist_for_each_entry(vma, &lut->ht[i], ctx_node) {
 			vma->obj->vma_hashed = NULL;
 			vma->ctx = NULL;
+			i915_vma_put(vma);
 		}
 	}
 	kvfree(lut->ht);
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 77c7363754d1..92616c96a8c6 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -42,11 +42,12 @@
 
 #define DBG_USE_CPU_RELOC 0 /* -1 force GTT relocs; 1 force CPU relocs */
 
-#define  __EXEC_OBJECT_HAS_PIN		BIT(31)
-#define  __EXEC_OBJECT_HAS_FENCE	BIT(30)
-#define  __EXEC_OBJECT_NEEDS_MAP	BIT(29)
-#define  __EXEC_OBJECT_NEEDS_BIAS	BIT(28)
-#define  __EXEC_OBJECT_INTERNAL_FLAGS (~0u << 28) /* all of the above */
+#define  __EXEC_OBJECT_HAS_REF		BIT(31)
+#define  __EXEC_OBJECT_HAS_PIN		BIT(30)
+#define  __EXEC_OBJECT_HAS_FENCE	BIT(29)
+#define  __EXEC_OBJECT_NEEDS_MAP	BIT(28)
+#define  __EXEC_OBJECT_NEEDS_BIAS	BIT(27)
+#define  __EXEC_OBJECT_INTERNAL_FLAGS (~0u << 27) /* all of the above */
 #define __EB_RESERVED (__EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_FENCE)
 
 #define __EXEC_HAS_RELOC	BIT(31)
@@ -445,7 +446,7 @@ eb_add_vma(struct i915_execbuffer *eb,
 	 * to find the right target VMA when doing relocations.
 	 */
 	vma->exec_entry = entry;
-	__exec_to_vma(entry) = (uintptr_t)i915_vma_get(vma);
+	__exec_to_vma(entry) = (uintptr_t)vma;
 
 	if (eb->lut_size >= 0) {
 		vma->exec_handle = entry->handle;
@@ -774,11 +775,19 @@ next_vma: ;
 				GEM_BUG_ON(obj->vma_hashed);
 				obj->vma_hashed = vma;
 			}
+
+			i915_vma_get(vma);
 		}
 
 		err = eb_add_vma(eb, &eb->exec[i], vma);
 		if (unlikely(err))
 			goto err;
+
+		/* Only after we validated the user didn't use our bits */
+		if (vma->ctx != eb->ctx) {
+			i915_vma_get(vma);
+			eb->exec[i].flags |= __EXEC_OBJECT_HAS_REF;
+		}
 	}
 
 	if (lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS) {
@@ -851,7 +860,8 @@ static void eb_reset_vmas(const struct i915_execbuffer *eb)
 
 		eb_unreserve_vma(vma, entry);
 		vma->exec_entry = NULL;
-		i915_vma_put(vma);
+		if (unlikely(entry->flags & __EXEC_OBJECT_HAS_REF))
+			i915_vma_put(vma);
 	}
 
 	if (eb->lut_size >= 0)
@@ -878,7 +888,8 @@ static void eb_release_vmas(const struct i915_execbuffer *eb)
 		if (entry->flags & __EXEC_OBJECT_HAS_PIN)
 			__eb_unreserve_vma(vma, entry);
 		vma->exec_entry = NULL;
-		i915_vma_put(vma);
+		if (unlikely(entry->flags & __EXEC_OBJECT_HAS_REF))
+			i915_vma_put(vma);
 	}
 }
 
@@ -1636,7 +1647,8 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
 		struct i915_vma *vma = exec_to_vma(entry);
 
 		eb_export_fence(vma->obj, eb->request, entry->flags);
-		i915_vma_put(vma);
+		if (unlikely(entry->flags & __EXEC_OBJECT_HAS_REF))
+			i915_vma_put(vma);
 	}
 	eb->exec = NULL;
 
@@ -1767,7 +1779,7 @@ static struct i915_vma *eb_parse(struct i915_execbuffer *eb, bool is_master)
 	vma->exec_entry =
 		memset(&eb->exec[eb->buffer_count++],
 		       0, sizeof(*vma->exec_entry));
-	vma->exec_entry->flags = __EXEC_OBJECT_HAS_PIN;
+	vma->exec_entry->flags = __EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_REF;
 	__exec_to_vma(vma->exec_entry) = (uintptr_t)i915_vma_get(vma);
 
 out:
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 6b1253fdfc39..09b00e9de2f0 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -605,6 +605,8 @@ void i915_vma_unlink_ctx(struct i915_vma *vma)
 	if (i915_vma_is_ggtt(vma))
 		vma->obj->vma_hashed = NULL;
 	vma->ctx = NULL;
+
+	i915_vma_put(vma);
 }
 
 void i915_vma_close(struct i915_vma *vma)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/24] drm/i915: First try the previous execbuffer location
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (10 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 12/24] drm/i915: Store a persistent reference for an object in the execbuffer cache Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 14/24] drm/i915: Wait upon userptr get-user-pages within execbuffer Chris Wilson
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

When choosing a slot for an execbuffer, we ideally want to use the same
address as last time (so that we don't have to rebind it) and the same
address as expected by the user (so that we don't have to fixup any
relocations pointing to it). If we first try to bind the incoming
execbuffer->offset from the user, or the currently bound offset that
should hopefully achieve the goal of avoiding the rebind cost and the
relocation penalty. However, if the object is not currently bound there
we don't want to arbitrarily unbind an object in our chosen position and
so choose to rebind/relocate the incoming object instead. After we
report the new position back to the user, on the next pass the
relocations should have settled down.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtien@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 12 ++++++++----
 drivers/gpu/drm/i915/i915_gem_gtt.c        |  6 ++++++
 drivers/gpu/drm/i915/i915_gem_gtt.h        |  1 +
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 92616c96a8c6..171a945046c3 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -337,10 +337,15 @@ eb_pin_vma(struct i915_execbuffer *eb,
 {
 	u64 flags;
 
-	flags = vma->node.start;
-	flags |= PIN_USER | PIN_NONBLOCK | PIN_OFFSET_FIXED;
+	if (vma->node.size)
+		flags = vma->node.start;
+	else
+		flags = entry->offset & PIN_OFFSET_MASK;
+
+	flags |= PIN_USER | PIN_NOEVICT | PIN_OFFSET_FIXED;
 	if (unlikely(entry->flags & EXEC_OBJECT_NEEDS_GTT))
 		flags |= PIN_GLOBAL;
+
 	if (unlikely(i915_vma_pin(vma, 0, 0, flags)))
 		return;
 
@@ -471,8 +476,7 @@ eb_add_vma(struct i915_execbuffer *eb,
 		entry->flags |= eb->context_flags;
 
 	err = 0;
-	if (vma->node.size)
-		eb_pin_vma(eb, entry, vma);
+	eb_pin_vma(eb, entry, vma);
 	if (eb_vma_misplaced(entry, vma)) {
 		eb_unreserve_vma(vma, entry);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index a9d78ebcabfe..b6a798e4d6f2 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -3302,6 +3302,9 @@ int i915_gem_gtt_reserve(struct i915_address_space *vm,
 	if (err != -ENOSPC)
 		return err;
 
+	if (flags & PIN_NOEVICT)
+		return -ENOSPC;
+
 	err = i915_gem_evict_for_node(vm, node, flags);
 	if (err == 0)
 		err = drm_mm_reserve_node(&vm->mm, node);
@@ -3416,6 +3419,9 @@ int i915_gem_gtt_insert(struct i915_address_space *vm,
 	if (err != -ENOSPC)
 		return err;
 
+	if (flags & PIN_NOEVICT)
+		return -ENOSPC;
+
 	/* No free space, pick a slot at random.
 	 *
 	 * There is a pathological case here using a GTT shared between
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
index fb15684c1d83..a528ce1380fd 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.h
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
@@ -588,6 +588,7 @@ int i915_gem_gtt_insert(struct i915_address_space *vm,
 #define PIN_MAPPABLE		BIT(1)
 #define PIN_ZONE_4G		BIT(2)
 #define PIN_NONFAULT		BIT(3)
+#define PIN_NOEVICT		BIT(4)
 
 #define PIN_MBZ			BIT(5) /* I915_VMA_PIN_OVERFLOW */
 #define PIN_GLOBAL		BIT(6) /* I915_VMA_GLOBAL_BIND */
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 14/24] drm/i915: Wait upon userptr get-user-pages within execbuffer
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (11 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 13/24] drm/i915: First try the previous execbuffer location Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 15/24] drm/i915: Allow execbuffer to use the first object as the batch Chris Wilson
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

This simply hides the EAGAIN caused by userptr when userspace causes
resource contention. However, it is quite beneficial with highly
contended userptr users as we avoid repeating the setup costs and
kernel-user context switches.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |  1 +
 drivers/gpu/drm/i915/i915_drv.h            | 10 +++++++++-
 drivers/gpu/drm/i915/i915_gem.c            |  4 +++-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  3 +++
 drivers/gpu/drm/i915/i915_gem_userptr.c    | 18 +++++++++++++++---
 5 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 72fb47a439d2..dc8f3144ca99 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -553,6 +553,7 @@ static void i915_gem_fini(struct drm_i915_private *dev_priv)
 	intel_uc_fini_hw(dev_priv);
 	i915_gem_cleanup_engines(dev_priv);
 	i915_gem_context_fini(dev_priv);
+	i915_gem_cleanup_userptr(dev_priv);
 	mutex_unlock(&dev_priv->drm.struct_mutex);
 
 	i915_gem_drain_freed_objects(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 8ba120acf080..261de5f38956 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1452,6 +1452,13 @@ struct i915_gem_mm {
 	/** LRU list of objects with fence regs on them. */
 	struct list_head fence_list;
 
+	/**
+	 * Workqueue to fault in userptr pages, flushed by the execbuf
+	 * when required but otherwise left to userspace to try again
+	 * on EAGAIN.
+	 */
+	struct workqueue_struct *userptr_wq;
+
 	u64 unordered_timeline;
 
 	/* the indicator for dispatch video commands on two BSD rings */
@@ -3180,7 +3187,8 @@ int i915_gem_set_tiling_ioctl(struct drm_device *dev, void *data,
 			      struct drm_file *file_priv);
 int i915_gem_get_tiling_ioctl(struct drm_device *dev, void *data,
 			      struct drm_file *file_priv);
-void i915_gem_init_userptr(struct drm_i915_private *dev_priv);
+int i915_gem_init_userptr(struct drm_i915_private *dev_priv);
+void i915_gem_cleanup_userptr(struct drm_i915_private *dev_priv);
 int i915_gem_userptr_ioctl(struct drm_device *dev, void *data,
 			   struct drm_file *file);
 int i915_gem_get_aperture_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 4d01089a38f9..827dfe7b6937 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4782,7 +4782,9 @@ int i915_gem_init(struct drm_i915_private *dev_priv)
 	 */
 	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
 
-	i915_gem_init_userptr(dev_priv);
+	ret = i915_gem_init_userptr(dev_priv);
+	if (ret)
+		goto out_unlock;
 
 	ret = i915_gem_init_ggtt(dev_priv);
 	if (ret)
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 171a945046c3..07ea00ea0b55 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1529,6 +1529,9 @@ static int eb_relocate_slow(struct i915_execbuffer *eb)
 		goto out;
 	}
 
+	/* A frequent cause for EAGAIN are currently unavailable client pages */
+	flush_workqueue(eb->i915->mm.userptr_wq);
+
 	err = i915_mutex_lock_interruptible(dev);
 	if (err) {
 		mutex_lock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 4ec9a04aa165..d89163b5d021 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -378,7 +378,7 @@ __i915_mm_struct_free(struct kref *kref)
 	mutex_unlock(&mm->i915->mm_lock);
 
 	INIT_WORK(&mm->work, __i915_mm_struct_free__worker);
-	schedule_work(&mm->work);
+	queue_work(mm->i915->mm.userptr_wq, &mm->work);
 }
 
 static void
@@ -598,7 +598,7 @@ __i915_gem_userptr_get_pages_schedule(struct drm_i915_gem_object *obj)
 	get_task_struct(work->task);
 
 	INIT_WORK(&work->work, __i915_gem_userptr_get_pages_worker);
-	schedule_work(&work->work);
+	queue_work(to_i915(obj->base.dev)->mm.userptr_wq, &work->work);
 
 	return ERR_PTR(-EAGAIN);
 }
@@ -830,8 +830,20 @@ i915_gem_userptr_ioctl(struct drm_device *dev, void *data, struct drm_file *file
 	return 0;
 }
 
-void i915_gem_init_userptr(struct drm_i915_private *dev_priv)
+int i915_gem_init_userptr(struct drm_i915_private *dev_priv)
 {
 	mutex_init(&dev_priv->mm_lock);
 	hash_init(dev_priv->mm_structs);
+
+	dev_priv->mm.userptr_wq =
+		alloc_workqueue("i915-userptr-acquire", WQ_HIGHPRI, 0);
+	if (!dev_priv->mm.userptr_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void i915_gem_cleanup_userptr(struct drm_i915_private *dev_priv)
+{
+	destroy_workqueue(dev_priv->mm.userptr_wq);
 }
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 15/24] drm/i915: Allow execbuffer to use the first object as the batch
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (12 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 14/24] drm/i915: Wait upon userptr get-user-pages within execbuffer Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 16/24] drm/i915: Async GPU relocation processing Chris Wilson
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

Currently, the last object in the execlist is the always the batch.
However, when building the batch buffer we often know the batch object
first and if we can use the first slot in the execlist we can emit
relocation instructions relative to it immediately and avoid a separate
pass to adjust the relocations to point to the last execlist slot.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |  1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  5 ++++-
 include/uapi/drm/i915_drm.h                | 18 +++++++++++++++++-
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index dc8f3144ca99..178550a93884 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -351,6 +351,7 @@ static int i915_getparam(struct drm_device *dev, void *data,
 	case I915_PARAM_HAS_EXEC_ASYNC:
 	case I915_PARAM_HAS_EXEC_FENCE:
 	case I915_PARAM_HAS_EXEC_CAPTURE:
+	case I915_PARAM_HAS_EXEC_BATCH_FIRST:
 		/* For the time being all of these are always true;
 		 * if some supported hardware does not have one of these
 		 * features this value needs to be provided from
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 07ea00ea0b55..bc2bb8b19bbf 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -650,7 +650,10 @@ ht_needs_resize(const struct i915_gem_context_vma_lut *lut)
 
 static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
 {
-	return eb->buffer_count - 1;
+	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
+		return 0;
+	else
+		return eb->buffer_count - 1;
 }
 
 static int eb_select_context(struct i915_execbuffer *eb)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index f24a80d2d42e..4adf3e5f5c71 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -418,6 +418,12 @@ typedef struct drm_i915_irq_wait {
  */
 #define I915_PARAM_HAS_EXEC_CAPTURE	 45
 
+/*
+ * Query whether DRM_I915_GEM_EXECBUFFER2 supports supplying the batch buffer
+ * as the first execobject as opposed to the last. See I915_EXEC_BATCH_FIRST.
+ */
+#define I915_PARAM_HAS_EXEC_BATCH_FIRST	 46
+
 typedef struct drm_i915_getparam {
 	__s32 param;
 	/*
@@ -904,7 +910,17 @@ struct drm_i915_gem_execbuffer2 {
  */
 #define I915_EXEC_FENCE_OUT		(1<<17)
 
-#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_OUT<<1))
+/*
+ * Traditionally the execbuf ioctl has only considered the final element in
+ * the execobject[] to be the executable batch. Often though, the client
+ * will known the batch object prior to construction and being able to place
+ * it into the execobject[] array first can simplify the relocation tracking.
+ * Setting I915_EXEC_BATCH_FIRST tells execbuf to use element 0 of the
+ * execobject[] as the * batch instead (the default is to use the last
+ * element).
+ */
+#define I915_EXEC_BATCH_FIRST		(1<<18)
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_BATCH_FIRST<<1))
 
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 16/24] drm/i915: Async GPU relocation processing
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (13 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 15/24] drm/i915: Allow execbuffer to use the first object as the batch Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 17/24] drm/i915: Stash a pointer to the obj's resv in the vma Chris Wilson
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

If the user requires patching of their batch or auxiliary buffers, we
currently make the alterations on the cpu. If they are active on the GPU
at the time, we wait under the struct_mutex for them to finish executing
before we rewrite the contents. This happens if shared relocation trees
are used between different contexts with separate address space (and the
buffers then have different addresses in each), the 3D state will need
to be adjusted between execution on each context. However, we don't need
to use the CPU to do the relocation patching, as we could queue commands
to the GPU to perform it and use fences to serialise the operation with
the current activity and future - so the operation on the GPU appears
just as atomic as performing it immediately. Performing the relocation
rewrites on the GPU is not free, in terms of pure throughput, the number
of relocations/s is about halved - but more importantly so is the time
under the struct_mutex.

v2: Break out the request/batch allocation for clearer error flow.
v3: A few asserts to ensure rq ordering is maintained

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c            |   1 -
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 227 ++++++++++++++++++++++++++++-
 2 files changed, 220 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 827dfe7b6937..5dcd51655a81 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4373,7 +4373,6 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
 		GEM_BUG_ON(i915_gem_object_is_active(obj));
 		list_for_each_entry_safe(vma, vn,
 					 &obj->vma_list, obj_link) {
-			GEM_BUG_ON(!i915_vma_is_ggtt(vma));
 			GEM_BUG_ON(i915_vma_is_active(vma));
 			vma->flags &= ~I915_VMA_PIN_MASK;
 			i915_vma_close(vma);
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index bc2bb8b19bbf..9715099664ac 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -40,7 +40,12 @@
 #include "intel_drv.h"
 #include "intel_frontbuffer.h"
 
-#define DBG_USE_CPU_RELOC 0 /* -1 force GTT relocs; 1 force CPU relocs */
+enum {
+	FORCE_CPU_RELOC = 1,
+	FORCE_GTT_RELOC,
+	FORCE_GPU_RELOC,
+#define DBG_FORCE_RELOC 0 /* choose one of the above! */
+};
 
 #define  __EXEC_OBJECT_HAS_REF		BIT(31)
 #define  __EXEC_OBJECT_HAS_PIN		BIT(30)
@@ -212,10 +217,15 @@ struct i915_execbuffer {
 		struct drm_mm_node node; /** temporary GTT binding */
 		unsigned long vaddr; /** Current kmap address */
 		unsigned long page; /** Currently mapped page index */
+		unsigned int gen; /** Cached value of INTEL_GEN */
 		bool use_64bit_reloc : 1;
 		bool has_llc : 1;
 		bool has_fence : 1;
 		bool needs_unfenced : 1;
+
+		struct drm_i915_gem_request *rq;
+		u32 *rq_cmd;
+		unsigned int rq_size;
 	} reloc_cache;
 
 	u64 invalid_flags; /** Set of execobj.flags that are invalid */
@@ -498,8 +508,11 @@ static inline int use_cpu_reloc(const struct reloc_cache *cache,
 	if (!i915_gem_object_has_struct_page(obj))
 		return false;
 
-	if (DBG_USE_CPU_RELOC)
-		return DBG_USE_CPU_RELOC > 0;
+	if (DBG_FORCE_RELOC == FORCE_CPU_RELOC)
+		return true;
+
+	if (DBG_FORCE_RELOC == FORCE_GTT_RELOC)
+		return false;
 
 	return (cache->has_llc ||
 		obj->cache_dirty ||
@@ -902,6 +915,8 @@ static void eb_release_vmas(const struct i915_execbuffer *eb)
 
 static void eb_destroy(const struct i915_execbuffer *eb)
 {
+	GEM_BUG_ON(eb->reloc_cache.rq);
+
 	if (eb->lut_size >= 0)
 		kfree(eb->buckets);
 }
@@ -919,11 +934,14 @@ static void reloc_cache_init(struct reloc_cache *cache,
 	cache->page = -1;
 	cache->vaddr = 0;
 	/* Must be a variable in the struct to allow GCC to unroll. */
+	cache->gen = INTEL_GEN(i915);
 	cache->has_llc = HAS_LLC(i915);
-	cache->has_fence = INTEL_GEN(i915) < 4;
-	cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
 	cache->use_64bit_reloc = HAS_64BIT_RELOC(i915);
+	cache->has_fence = cache->gen < 4;
+	cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
 	cache->node.allocated = false;
+	cache->rq = NULL;
+	cache->rq_size = 0;
 }
 
 static inline void *unmask_page(unsigned long p)
@@ -945,10 +963,24 @@ static inline struct i915_ggtt *cache_to_ggtt(struct reloc_cache *cache)
 	return &i915->ggtt;
 }
 
+static void reloc_gpu_flush(struct reloc_cache *cache)
+{
+	GEM_BUG_ON(cache->rq_size >= cache->rq->batch->obj->base.size / sizeof(u32));
+	cache->rq_cmd[cache->rq_size] = MI_BATCH_BUFFER_END;
+	i915_gem_object_unpin_map(cache->rq->batch->obj);
+	i915_gem_chipset_flush(cache->rq->i915);
+
+	__i915_add_request(cache->rq, true);
+	cache->rq = NULL;
+}
+
 static void reloc_cache_reset(struct reloc_cache *cache)
 {
 	void *vaddr;
 
+	if (cache->rq)
+		reloc_gpu_flush(cache);
+
 	if (!cache->vaddr)
 		return;
 
@@ -1114,6 +1146,121 @@ static void clflush_write32(u32 *addr, u32 value, unsigned int flushes)
 		*addr = value;
 }
 
+static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
+			     struct i915_vma *vma,
+			     unsigned int len)
+{
+	struct reloc_cache *cache = &eb->reloc_cache;
+	struct drm_i915_gem_object *obj;
+	struct drm_i915_gem_request *rq;
+	struct i915_vma *batch;
+	u32 *cmd;
+	int err;
+
+	GEM_BUG_ON(vma->obj->base.write_domain & I915_GEM_DOMAIN_CPU);
+
+	obj = i915_gem_batch_pool_get(&eb->engine->batch_pool, PAGE_SIZE);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+
+	cmd = i915_gem_object_pin_map(obj,
+				      cache->has_llc ? I915_MAP_WB : I915_MAP_WC);
+	i915_gem_object_unpin_pages(obj);
+	if (IS_ERR(cmd))
+		return PTR_ERR(cmd);
+
+	err = i915_gem_object_set_to_wc_domain(obj, false);
+	if (err)
+		goto err_unmap;
+
+	batch = i915_vma_instance(obj, vma->vm, NULL);
+	if (IS_ERR(batch)) {
+		err = PTR_ERR(batch);
+		goto err_unmap;
+	}
+
+	err = i915_vma_pin(batch, 0, 0, PIN_USER | PIN_NONBLOCK);
+	if (err)
+		goto err_unmap;
+
+	rq = i915_gem_request_alloc(eb->engine, eb->ctx);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto err_unpin;
+	}
+
+	err = i915_gem_request_await_object(rq, vma->obj, true);
+	if (err)
+		goto err_request;
+
+	err = eb->engine->emit_flush(rq, EMIT_INVALIDATE);
+	if (err)
+		goto err_request;
+
+	err = i915_switch_context(rq);
+	if (err)
+		goto err_request;
+
+	err = eb->engine->emit_bb_start(rq,
+					batch->node.start, PAGE_SIZE,
+					cache->gen > 5 ? 0 : I915_DISPATCH_SECURE);
+	if (err)
+		goto err_request;
+
+	GEM_BUG_ON(!reservation_object_test_signaled_rcu(obj->resv, true));
+	i915_vma_move_to_active(batch, rq, 0);
+	reservation_object_lock(obj->resv, NULL);
+	reservation_object_add_excl_fence(obj->resv, &rq->fence);
+	reservation_object_unlock(obj->resv);
+	i915_vma_unpin(batch);
+
+	i915_vma_move_to_active(vma, rq, true);
+	reservation_object_lock(vma->obj->resv, NULL);
+	reservation_object_add_excl_fence(vma->obj->resv, &rq->fence);
+	reservation_object_unlock(vma->obj->resv);
+
+	rq->batch = batch;
+
+	cache->rq = rq;
+	cache->rq_cmd = cmd;
+	cache->rq_size = 0;
+
+	/* Return with batch mapping (cmd) still pinned */
+	return 0;
+
+err_request:
+	i915_add_request(rq);
+err_unpin:
+	i915_vma_unpin(batch);
+err_unmap:
+	i915_gem_object_unpin_map(obj);
+	return err;
+}
+
+static u32 *reloc_gpu(struct i915_execbuffer *eb,
+		      struct i915_vma *vma,
+		      unsigned int len)
+{
+	struct reloc_cache *cache = &eb->reloc_cache;
+	u32 *cmd;
+
+	if (cache->rq_size > PAGE_SIZE/sizeof(u32) - (len + 1))
+		reloc_gpu_flush(cache);
+
+	if (unlikely(!cache->rq)) {
+		int err;
+
+		err = __reloc_gpu_alloc(eb, vma, len);
+		if (unlikely(err))
+			return ERR_PTR(err);
+	}
+
+	cmd = cache->rq_cmd + cache->rq_size;
+	cache->rq_size += len;
+
+	return cmd;
+}
+
 static u64
 relocate_entry(struct i915_vma *vma,
 	       const struct drm_i915_gem_relocation_entry *reloc,
@@ -1126,6 +1273,67 @@ relocate_entry(struct i915_vma *vma,
 	bool wide = eb->reloc_cache.use_64bit_reloc;
 	void *vaddr;
 
+	if (!eb->reloc_cache.vaddr &&
+	    (DBG_FORCE_RELOC == FORCE_GPU_RELOC ||
+	     !reservation_object_test_signaled_rcu(obj->resv, true))) {
+		const unsigned int gen = eb->reloc_cache.gen;
+		unsigned int len;
+		u32 *batch;
+		u64 addr;
+
+		if (wide)
+			len = offset & 7 ? 8 : 5;
+		else if (gen >= 4)
+			len = 4;
+		else if (gen >= 3)
+			len = 3;
+		else /* On gen2 MI_STORE_DWORD_IMM uses a physical address */
+			goto repeat;
+
+		batch = reloc_gpu(eb, vma, len);
+		if (IS_ERR(batch))
+			goto repeat;
+
+		addr = gen8_canonical_addr(vma->node.start + offset);
+		if (wide) {
+			if (offset & 7) {
+				*batch++ = MI_STORE_DWORD_IMM_GEN4;
+				*batch++ = lower_32_bits(addr);
+				*batch++ = upper_32_bits(addr);
+				*batch++ = lower_32_bits(target_offset);
+
+				addr = gen8_canonical_addr(addr + 4);
+
+				*batch++ = MI_STORE_DWORD_IMM_GEN4;
+				*batch++ = lower_32_bits(addr);
+				*batch++ = upper_32_bits(addr);
+				*batch++ = upper_32_bits(target_offset);
+			} else {
+				*batch++ = (MI_STORE_DWORD_IMM_GEN4 | (1 << 21)) + 1;
+				*batch++ = lower_32_bits(addr);
+				*batch++ = upper_32_bits(addr);
+				*batch++ = lower_32_bits(target_offset);
+				*batch++ = upper_32_bits(target_offset);
+			}
+		} else if (gen >= 6) {
+			*batch++ = MI_STORE_DWORD_IMM_GEN4;
+			*batch++ = 0;
+			*batch++ = addr;
+			*batch++ = target_offset;
+		} else if (gen >= 4) {
+			*batch++ = MI_STORE_DWORD_IMM_GEN4 | MI_USE_GGTT;
+			*batch++ = 0;
+			*batch++ = addr;
+			*batch++ = target_offset;
+		} else {
+			*batch++ = MI_STORE_DWORD_IMM | MI_MEM_VIRTUAL;
+			*batch++ = addr;
+			*batch++ = target_offset;
+		}
+
+		goto out;
+	}
+
 repeat:
 	vaddr = reloc_vaddr(obj, &eb->reloc_cache, offset >> PAGE_SHIFT);
 	if (IS_ERR(vaddr))
@@ -1142,6 +1350,7 @@ relocate_entry(struct i915_vma *vma,
 		goto repeat;
 	}
 
+out:
 	return target->node.start | UPDATE;
 }
 
@@ -1204,7 +1413,8 @@ eb_relocate_entry(struct i915_execbuffer *eb,
 	 * If the relocation already has the right value in it, no
 	 * more work needs to be done.
 	 */
-	if (gen8_canonical_addr(target->node.start) == reloc->presumed_offset)
+	if (!DBG_FORCE_RELOC &&
+	    gen8_canonical_addr(target->node.start) == reloc->presumed_offset)
 		return 0;
 
 	/* Check that the relocation address is valid... */
@@ -1928,7 +2138,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	eb.i915 = to_i915(dev);
 	eb.file = file;
 	eb.args = args;
-	if (!(args->flags & I915_EXEC_NO_RELOC))
+	if (DBG_FORCE_RELOC || !(args->flags & I915_EXEC_NO_RELOC))
 		args->flags |= __EXEC_HAS_RELOC;
 	eb.exec = exec;
 	eb.ctx = NULL;
@@ -2085,6 +2295,9 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		eb.batch = vma;
 	}
 
+	/* All GPU relocation batches must be submitted prior to the user rq */
+	GEM_BUG_ON(eb.reloc_cache.rq);
+
 	/* Allocate a request for this batch buffer nice and early. */
 	eb.request = i915_gem_request_alloc(eb.engine, eb.ctx);
 	if (IS_ERR(eb.request)) {
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 17/24] drm/i915: Stash a pointer to the obj's resv in the vma
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (14 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 16/24] drm/i915: Async GPU relocation processing Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-19  9:19   ` Joonas Lahtinen
  2017-05-18  9:46 ` [PATCH 18/24] drm/i915: Convert execbuf to use struct-of-array packing for critical fields Chris Wilson
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

During execbuf, a mandatory step is that we add this request (this
fence) to each object's reservation_object. Inside execbuf, we track the
vma, and to add the fence to the reservation_object then means having to
first chase the obj, incurring another cache miss. We can reduce the
 number of cache misses by stashing a pointer to the reservation_object
in the vma itself.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 25 ++++++++++++-------------
 drivers/gpu/drm/i915/i915_vma.c            |  1 +
 drivers/gpu/drm/i915/i915_vma.h            |  3 ++-
 3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 9715099664ac..f5a0e419bfec 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1207,17 +1207,17 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
 	if (err)
 		goto err_request;
 
-	GEM_BUG_ON(!reservation_object_test_signaled_rcu(obj->resv, true));
+	GEM_BUG_ON(!reservation_object_test_signaled_rcu(batch->resv, true));
 	i915_vma_move_to_active(batch, rq, 0);
-	reservation_object_lock(obj->resv, NULL);
-	reservation_object_add_excl_fence(obj->resv, &rq->fence);
-	reservation_object_unlock(obj->resv);
+	reservation_object_lock(batch->resv, NULL);
+	reservation_object_add_excl_fence(batch->resv, &rq->fence);
+	reservation_object_unlock(batch->resv);
 	i915_vma_unpin(batch);
 
 	i915_vma_move_to_active(vma, rq, true);
-	reservation_object_lock(vma->obj->resv, NULL);
-	reservation_object_add_excl_fence(vma->obj->resv, &rq->fence);
-	reservation_object_unlock(vma->obj->resv);
+	reservation_object_lock(vma->resv, NULL);
+	reservation_object_add_excl_fence(vma->resv, &rq->fence);
+	reservation_object_unlock(vma->resv);
 
 	rq->batch = batch;
 
@@ -1267,7 +1267,6 @@ relocate_entry(struct i915_vma *vma,
 	       struct i915_execbuffer *eb,
 	       const struct i915_vma *target)
 {
-	struct drm_i915_gem_object *obj = vma->obj;
 	u64 offset = reloc->offset;
 	u64 target_offset = relocation_target(reloc, target);
 	bool wide = eb->reloc_cache.use_64bit_reloc;
@@ -1275,7 +1274,7 @@ relocate_entry(struct i915_vma *vma,
 
 	if (!eb->reloc_cache.vaddr &&
 	    (DBG_FORCE_RELOC == FORCE_GPU_RELOC ||
-	     !reservation_object_test_signaled_rcu(obj->resv, true))) {
+	     !reservation_object_test_signaled_rcu(vma->resv, true))) {
 		const unsigned int gen = eb->reloc_cache.gen;
 		unsigned int len;
 		u32 *batch;
@@ -1335,7 +1334,7 @@ relocate_entry(struct i915_vma *vma,
 	}
 
 repeat:
-	vaddr = reloc_vaddr(obj, &eb->reloc_cache, offset >> PAGE_SHIFT);
+	vaddr = reloc_vaddr(vma->obj, &eb->reloc_cache, offset >> PAGE_SHIFT);
 	if (IS_ERR(vaddr))
 		return PTR_ERR(vaddr);
 
@@ -1802,11 +1801,11 @@ static int eb_relocate_slow(struct i915_execbuffer *eb)
 	return err ?: have_copy;
 }
 
-static void eb_export_fence(struct drm_i915_gem_object *obj,
+static void eb_export_fence(struct i915_vma *vma,
 			    struct drm_i915_gem_request *req,
 			    unsigned int flags)
 {
-	struct reservation_object *resv = obj->resv;
+	struct reservation_object *resv = vma->resv;
 
 	/*
 	 * Ignore errors from failing to allocate the new fence, we can't
@@ -1866,7 +1865,7 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
 		const struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
 		struct i915_vma *vma = exec_to_vma(entry);
 
-		eb_export_fence(vma->obj, eb->request, entry->flags);
+		eb_export_fence(vma, eb->request, entry->flags);
 		if (unlikely(entry->flags & __EXEC_OBJECT_HAS_REF))
 			i915_vma_put(vma);
 	}
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 09b00e9de2f0..fc1a8412b0a0 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -90,6 +90,7 @@ vma_create(struct drm_i915_gem_object *obj,
 	init_request_active(&vma->last_fence, NULL);
 	vma->vm = vm;
 	vma->obj = obj;
+	vma->resv = obj->resv;
 	vma->size = obj->base.size;
 	vma->display_alignment = I915_GTT_MIN_ALIGNMENT;
 
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index 062addfee6ef..3840b8cdc6b1 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -50,6 +50,7 @@ struct i915_vma {
 	struct drm_i915_gem_object *obj;
 	struct i915_address_space *vm;
 	struct drm_i915_fence_reg *fence;
+	struct reservation_object *resv;
 	struct sg_table *pages;
 	void __iomem *iomap;
 	u64 size;
@@ -111,8 +112,8 @@ struct i915_vma {
 	/**
 	 * Used for performing relocations during execbuffer insertion.
 	 */
-	struct hlist_node exec_node;
 	struct drm_i915_gem_exec_object2 *exec_entry;
+	struct hlist_node exec_node;
 	u32 exec_handle;
 
 	struct i915_gem_context *ctx;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 18/24] drm/i915: Convert execbuf to use struct-of-array packing for critical fields
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (15 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 17/24] drm/i915: Stash a pointer to the obj's resv in the vma Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-19  9:25   ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 19/24] drm/i915/scheduler: Support user-defined priorities Chris Wilson
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

When userspace is doing most of the work, avoiding relocs (using
NO_RELOC) and opting out of implicit synchronisation (using ASYNC), we
still spend a lot of time processing the arrays in execbuf, even though
we now should have nothing to do most of the time. One issue that
becomes readily apparent in profiling anv is that iterating over the
large execobj[] is unfriendly to the loop prefetchers of the CPU and it
much prefers iterating over a pair of arrays rather than one big array.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem_evict.c      |   4 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 224 +++++++++++++++--------------
 drivers/gpu/drm/i915/i915_vma.h            |   2 +-
 3 files changed, 118 insertions(+), 112 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index a193f1b36c67..4df039ef2ce3 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -318,8 +318,8 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 		/* Overlap of objects in the same batch? */
 		if (i915_vma_is_pinned(vma)) {
 			ret = -ENOSPC;
-			if (vma->exec_entry &&
-			    vma->exec_entry->flags & EXEC_OBJECT_PINNED)
+			if (vma->exec_flags &&
+			    *vma->exec_flags & EXEC_OBJECT_PINNED)
 				ret = -EINVAL;
 			break;
 		}
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index f5a0e419bfec..d6eb38b12976 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -47,13 +47,13 @@ enum {
 #define DBG_FORCE_RELOC 0 /* choose one of the above! */
 };
 
-#define  __EXEC_OBJECT_HAS_REF		BIT(31)
-#define  __EXEC_OBJECT_HAS_PIN		BIT(30)
-#define  __EXEC_OBJECT_HAS_FENCE	BIT(29)
+#define  __LUT_HAS_REF		BIT(31)
+#define  __LUT_HAS_PIN		BIT(30)
+#define  __LUT_HAS_FENCE	BIT(29)
 #define  __EXEC_OBJECT_NEEDS_MAP	BIT(28)
 #define  __EXEC_OBJECT_NEEDS_BIAS	BIT(27)
-#define  __EXEC_OBJECT_INTERNAL_FLAGS (~0u << 27) /* all of the above */
-#define __EB_RESERVED (__EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_FENCE)
+#define  __LUT_INTERNAL_FLAGS (~0u << 27) /* all of the above */
+#define __LUT_RESERVED (__LUT_HAS_PIN | __LUT_HAS_FENCE)
 
 #define __EXEC_HAS_RELOC	BIT(31)
 #define __EXEC_VALIDATED	BIT(30)
@@ -191,6 +191,8 @@ struct i915_execbuffer {
 	struct drm_file *file; /** per-file lookup tables and limits */
 	struct drm_i915_gem_execbuffer2 *args; /** ioctl parameters */
 	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
+	struct i915_vma **vma;
+	unsigned int *flags;
 
 	struct intel_engine_cs *engine; /** engine to queue the request to */
 	struct i915_gem_context *ctx; /** context for building the request */
@@ -245,14 +247,6 @@ struct i915_execbuffer {
 };
 
 /*
- * As an alternative to creating a hashtable of handle-to-vma for a batch,
- * we used the last available reserved field in the execobject[] and stash
- * a link from the execobj to its vma.
- */
-#define __exec_to_vma(ee) (ee)->rsvd2
-#define exec_to_vma(ee) u64_to_ptr(struct i915_vma, __exec_to_vma(ee))
-
-/*
  * Used to convert any address to canonical form.
  * Starting from gen8, some commands (e.g. STATE_BASE_ADDRESS,
  * MI_LOAD_REGISTER_MEM and others, see Broadwell PRM Vol2a) require the
@@ -316,7 +310,7 @@ static bool
 eb_vma_misplaced(const struct drm_i915_gem_exec_object2 *entry,
 		 const struct i915_vma *vma)
 {
-	if (!(entry->flags & __EXEC_OBJECT_HAS_PIN))
+	if (!(*vma->exec_flags & __LUT_HAS_PIN))
 		return true;
 
 	if (vma->node.size < entry->pad_to_size)
@@ -342,7 +336,7 @@ eb_vma_misplaced(const struct drm_i915_gem_exec_object2 *entry,
 
 static inline void
 eb_pin_vma(struct i915_execbuffer *eb,
-	   struct drm_i915_gem_exec_object2 *entry,
+	   const struct drm_i915_gem_exec_object2 *entry,
 	   struct i915_vma *vma)
 {
 	u64 flags;
@@ -366,31 +360,28 @@ eb_pin_vma(struct i915_execbuffer *eb,
 		}
 
 		if (i915_vma_pin_fence(vma))
-			entry->flags |= __EXEC_OBJECT_HAS_FENCE;
+			*vma->exec_flags |= __LUT_HAS_FENCE;
 	}
 
-	entry->flags |= __EXEC_OBJECT_HAS_PIN;
+	*vma->exec_flags |= __LUT_HAS_PIN;
 }
 
-static inline void
-__eb_unreserve_vma(struct i915_vma *vma,
-		   const struct drm_i915_gem_exec_object2 *entry)
+static inline void __eb_unreserve_vma(struct i915_vma *vma, unsigned int flags)
 {
-	GEM_BUG_ON(!(entry->flags & __EXEC_OBJECT_HAS_PIN));
+	GEM_BUG_ON(!(flags & __LUT_HAS_PIN));
 
-	if (unlikely(entry->flags & __EXEC_OBJECT_HAS_FENCE))
+	if (unlikely(flags & __LUT_HAS_FENCE))
 		i915_vma_unpin_fence(vma);
 
 	__i915_vma_unpin(vma);
 }
 
 static inline void
-eb_unreserve_vma(struct i915_vma *vma,
-		 struct drm_i915_gem_exec_object2 *entry)
+eb_unreserve_vma(struct i915_vma *vma, unsigned int *flags)
 {
-	if (entry->flags & __EXEC_OBJECT_HAS_PIN) {
-		__eb_unreserve_vma(vma, entry);
-		entry->flags &= ~__EB_RESERVED;
+	if (*flags & __LUT_HAS_PIN) {
+		__eb_unreserve_vma(vma, *flags);
+		*flags &= ~__LUT_RESERVED;
 	}
 }
 
@@ -430,7 +421,7 @@ eb_validate_vma(struct i915_execbuffer *eb,
 		entry->pad_to_size = 0;
 	}
 
-	if (unlikely(vma->exec_entry)) {
+	if (unlikely(vma->exec_flags)) {
 		DRM_DEBUG("Object [handle %d, index %d] appears more than once in object list\n",
 			  entry->handle, (int)(entry - eb->exec));
 		return -EINVAL;
@@ -440,10 +431,9 @@ eb_validate_vma(struct i915_execbuffer *eb,
 }
 
 static int
-eb_add_vma(struct i915_execbuffer *eb,
-	   struct drm_i915_gem_exec_object2 *entry,
-	   struct i915_vma *vma)
+eb_add_vma(struct i915_execbuffer *eb, unsigned int i, struct i915_vma *vma)
 {
+	struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
 	int err;
 
 	GEM_BUG_ON(i915_vma_is_closed(vma));
@@ -454,15 +444,6 @@ eb_add_vma(struct i915_execbuffer *eb,
 			return err;
 	}
 
-	/*
-	 * Stash a pointer from the vma to execobj, so we can query its flags,
-	 * size, alignment etc as provided by the user. Also we stash a pointer
-	 * to the vma inside the execobj so that we can use a direct lookup
-	 * to find the right target VMA when doing relocations.
-	 */
-	vma->exec_entry = entry;
-	__exec_to_vma(entry) = (uintptr_t)vma;
-
 	if (eb->lut_size >= 0) {
 		vma->exec_handle = entry->handle;
 		hlist_add_head(&vma->exec_node,
@@ -485,10 +466,20 @@ eb_add_vma(struct i915_execbuffer *eb,
 	if (!(entry->flags & EXEC_OBJECT_PINNED))
 		entry->flags |= eb->context_flags;
 
+	/*
+	 * Stash a pointer from the vma to execobj, so we can query its flags,
+	 * size, alignment etc as provided by the user. Also we stash a pointer
+	 * to the vma inside the execobj so that we can use a direct lookup
+	 * to find the right target VMA when doing relocations.
+	 */
+	eb->vma[i] = vma;
+	eb->flags[i] = entry->flags;
+	vma->exec_flags = &eb->flags[i];
+
 	err = 0;
 	eb_pin_vma(eb, entry, vma);
 	if (eb_vma_misplaced(entry, vma)) {
-		eb_unreserve_vma(vma, entry);
+		eb_unreserve_vma(vma, vma->exec_flags);
 
 		list_add_tail(&vma->exec_link, &eb->unbound);
 		if (drm_mm_node_allocated(&vma->node))
@@ -522,7 +513,8 @@ static inline int use_cpu_reloc(const struct reloc_cache *cache,
 static int
 eb_reserve_vma(struct i915_execbuffer *eb, struct i915_vma *vma)
 {
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	struct drm_i915_gem_exec_object2 *entry =
+		&eb->exec[vma->exec_flags - eb->flags];
 	u64 flags;
 	int err;
 
@@ -559,7 +551,7 @@ eb_reserve_vma(struct i915_execbuffer *eb, struct i915_vma *vma)
 		eb->args->flags |= __EXEC_HAS_RELOC;
 	}
 
-	entry->flags |= __EXEC_OBJECT_HAS_PIN;
+	*vma->exec_flags |= __LUT_HAS_PIN;
 	GEM_BUG_ON(eb_vma_misplaced(entry, vma));
 
 	if (unlikely(entry->flags & EXEC_OBJECT_NEEDS_FENCE)) {
@@ -570,7 +562,7 @@ eb_reserve_vma(struct i915_execbuffer *eb, struct i915_vma *vma)
 		}
 
 		if (i915_vma_pin_fence(vma))
-			entry->flags |= __EXEC_OBJECT_HAS_FENCE;
+			*vma->exec_flags |= __LUT_HAS_FENCE;
 	}
 
 	return 0;
@@ -613,18 +605,17 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		INIT_LIST_HEAD(&eb->unbound);
 		INIT_LIST_HEAD(&last);
 		for (i = 0; i < count; i++) {
-			struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+			unsigned int flags = eb->flags[i];
+			struct i915_vma *vma = eb->vma[i];
 
-			if (entry->flags & EXEC_OBJECT_PINNED &&
-			    entry->flags & __EXEC_OBJECT_HAS_PIN)
+			if (flags & EXEC_OBJECT_PINNED && flags & __LUT_HAS_PIN)
 				continue;
 
-			vma = exec_to_vma(entry);
-			eb_unreserve_vma(vma, entry);
+			eb_unreserve_vma(vma, &eb->flags[i]);
 
-			if (entry->flags & EXEC_OBJECT_PINNED)
+			if (flags & EXEC_OBJECT_PINNED)
 				list_add(&vma->exec_link, &eb->unbound);
-			else if (entry->flags & __EXEC_OBJECT_NEEDS_MAP)
+			else if (flags & __EXEC_OBJECT_NEEDS_MAP)
 				list_add_tail(&vma->exec_link, &eb->unbound);
 			else
 				list_add_tail(&vma->exec_link, &last);
@@ -712,23 +703,23 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
 	GEM_BUG_ON(lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS);
 
 	for (i = 0; i < count; i++) {
-		__exec_to_vma(&eb->exec[i]) = 0;
-
 		hlist_for_each_entry(vma,
 				     ht_head(lut, eb->exec[i].handle),
 				     ctx_node) {
 			if (vma->ctx_handle != eb->exec[i].handle)
 				continue;
 
-			err = eb_add_vma(eb, &eb->exec[i], vma);
+			err = eb_add_vma(eb, i, vma);
 			if (unlikely(err))
 				return err;
-
 			goto next_vma;
 		}
 
 		if (slow_pass < 0)
 			slow_pass = i;
+
+		eb->vma[i] = NULL;
+
 next_vma: ;
 	}
 
@@ -744,7 +735,7 @@ next_vma: ;
 	for (i = slow_pass; i < count; i++) {
 		struct drm_i915_gem_object *obj;
 
-		if (__exec_to_vma(&eb->exec[i]))
+		if (eb->vma[i])
 			continue;
 
 		obj = to_intel_bo(idr_find(idr, eb->exec[i].handle));
@@ -756,14 +747,17 @@ next_vma: ;
 			goto err;
 		}
 
-		__exec_to_vma(&eb->exec[i]) = INTERMEDIATE | (uintptr_t)obj;
+		eb->vma[i] =
+			(struct i915_vma *)ptr_pack_bits(obj, INTERMEDIATE, 1);
 	}
 	spin_unlock(&eb->file->table_lock);
 
 	for (i = slow_pass; i < count; i++) {
 		struct drm_i915_gem_object *obj;
+		unsigned int is_obj;
 
-		if (!(__exec_to_vma(&eb->exec[i]) & INTERMEDIATE))
+		obj = (typeof(obj))ptr_unpack_bits(eb->vma[i], &is_obj, 1);
+		if (!is_obj)
 			continue;
 
 		/*
@@ -774,8 +768,6 @@ next_vma: ;
 		 * from the (obj, vm) we don't run the risk of creating
 		 * duplicated vmas for the same vm.
 		 */
-		obj = u64_to_ptr(typeof(*obj),
-				 __exec_to_vma(&eb->exec[i]) & ~INTERMEDIATE);
 		vma = i915_vma_instance(obj, eb->vm, NULL);
 		if (unlikely(IS_ERR(vma))) {
 			DRM_DEBUG("Failed to lookup VMA\n");
@@ -799,14 +791,14 @@ next_vma: ;
 			i915_vma_get(vma);
 		}
 
-		err = eb_add_vma(eb, &eb->exec[i], vma);
+		err = eb_add_vma(eb, i, vma);
 		if (unlikely(err))
 			goto err;
 
 		/* Only after we validated the user didn't use our bits */
 		if (vma->ctx != eb->ctx) {
 			i915_vma_get(vma);
-			eb->exec[i].flags |= __EXEC_OBJECT_HAS_REF;
+			eb->flags[i] |= __LUT_HAS_REF;
 		}
 	}
 
@@ -820,7 +812,8 @@ next_vma: ;
 out:
 	/* take note of the batch buffer before we might reorder the lists */
 	i = eb_batch_index(eb);
-	eb->batch = exec_to_vma(&eb->exec[i]);
+	eb->batch = eb->vma[i];
+	GEM_BUG_ON(eb->batch->exec_flags != &eb->flags[i]);
 
 	/*
 	 * SNA is doing fancy tricks with compressing batch buffers, which leads
@@ -841,8 +834,8 @@ next_vma: ;
 
 err:
 	for (i = slow_pass; i < count; i++) {
-		if (__exec_to_vma(&eb->exec[i]) & INTERMEDIATE)
-			__exec_to_vma(&eb->exec[i]) = 0;
+		if (ptr_unmask_bits(eb->vma[i], 1))
+			eb->vma[i] = NULL;
 	}
 	lut->ht_size &= ~I915_CTX_RESIZE_IN_PROGRESS;
 	return err;
@@ -855,7 +848,7 @@ eb_get_vma(const struct i915_execbuffer *eb, unsigned long handle)
 	if (eb->lut_size < 0) {
 		if (handle >= -eb->lut_size)
 			return NULL;
-		return exec_to_vma(&eb->exec[handle]);
+		return eb->vma[handle];
 	} else {
 		struct hlist_head *head;
 		struct i915_vma *vma;
@@ -875,13 +868,16 @@ static void eb_reset_vmas(const struct i915_execbuffer *eb)
 	unsigned int i;
 
 	for (i = 0; i < count; i++) {
-		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
-		struct i915_vma *vma = exec_to_vma(entry);
+		struct i915_vma *vma = eb->vma[i];
+		unsigned int *flags = &eb->flags[i];
 
-		eb_unreserve_vma(vma, entry);
-		vma->exec_entry = NULL;
-		if (unlikely(entry->flags & __EXEC_OBJECT_HAS_REF))
+		eb_unreserve_vma(vma, flags);
+		vma->exec_flags = NULL;
+
+		if (unlikely(*flags & __LUT_HAS_REF)) {
 			i915_vma_put(vma);
+			*flags &= ~__LUT_HAS_REF;
+		}
 	}
 
 	if (eb->lut_size >= 0)
@@ -898,17 +894,18 @@ static void eb_release_vmas(const struct i915_execbuffer *eb)
 		return;
 
 	for (i = 0; i < count; i++) {
-		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
-		struct i915_vma *vma = exec_to_vma(entry);
+		struct i915_vma *vma = eb->vma[i];
+		unsigned int *flags = &eb->flags[i];
 
-		if (!vma || !vma->exec_entry)
+		if (!vma)
 			continue;
 
-		GEM_BUG_ON(vma->exec_entry != entry);
-		if (entry->flags & __EXEC_OBJECT_HAS_PIN)
-			__eb_unreserve_vma(vma, entry);
-		vma->exec_entry = NULL;
-		if (unlikely(entry->flags & __EXEC_OBJECT_HAS_REF))
+		GEM_BUG_ON(vma->exec_flags != &eb->flags[i]);
+
+		eb_unreserve_vma(vma, flags);
+		vma->exec_flags = NULL;
+
+		if (unlikely(*flags & __LUT_HAS_REF))
 			i915_vma_put(vma);
 	}
 }
@@ -1390,7 +1387,7 @@ eb_relocate_entry(struct i915_execbuffer *eb,
 	}
 
 	if (reloc->write_domain) {
-		target->exec_entry->flags |= EXEC_OBJECT_WRITE;
+		*target->exec_flags |= EXEC_OBJECT_WRITE;
 
 		/*
 		 * Sandybridge PPGTT errata: We need a global gtt mapping
@@ -1442,7 +1439,7 @@ eb_relocate_entry(struct i915_execbuffer *eb,
 	 * do relocations we are already stalling, disable the user's opt
 	 * of our synchronisation.
 	 */
-	vma->exec_entry->flags &= ~EXEC_OBJECT_ASYNC;
+	*vma->exec_flags &= ~EXEC_OBJECT_ASYNC;
 
 	/* and update the user's relocation entry */
 	return relocate_entry(vma, reloc, eb, target);
@@ -1453,7 +1450,8 @@ static int eb_relocate_vma(struct i915_execbuffer *eb, struct i915_vma *vma)
 #define N_RELOC(x) ((x) / sizeof(struct drm_i915_gem_relocation_entry))
 	struct drm_i915_gem_relocation_entry stack[N_RELOC(512)];
 	struct drm_i915_gem_relocation_entry __user *urelocs;
-	const struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	const struct drm_i915_gem_exec_object2 *entry =
+		&eb->exec[vma->exec_flags - eb->flags];
 	unsigned int remain;
 
 	urelocs = u64_to_user_ptr(entry->relocs_ptr);
@@ -1536,7 +1534,8 @@ static int eb_relocate_vma(struct i915_execbuffer *eb, struct i915_vma *vma)
 static int
 eb_relocate_vma_slow(struct i915_execbuffer *eb, struct i915_vma *vma)
 {
-	const struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	const struct drm_i915_gem_exec_object2 *entry =
+		&eb->exec[vma->exec_flags - eb->flags];
 	struct drm_i915_gem_relocation_entry *relocs =
 		u64_to_ptr(typeof(*relocs), entry->relocs_ptr);
 	unsigned int i;
@@ -1755,6 +1754,8 @@ static int eb_relocate_slow(struct i915_execbuffer *eb)
 	if (err)
 		goto err;
 
+	GEM_BUG_ON(!eb->batch);
+
 	list_for_each_entry(vma, &eb->relocs, reloc_link) {
 		if (!have_copy) {
 			pagefault_disable();
@@ -1828,11 +1829,10 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
 	int err;
 
 	for (i = 0; i < count; i++) {
-		const struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
-		struct i915_vma *vma = exec_to_vma(entry);
-		struct drm_i915_gem_object *obj = vma->obj;
+		unsigned int flags = eb->flags[i];
+		struct drm_i915_gem_object *obj;
 
-		if (entry->flags & EXEC_OBJECT_CAPTURE) {
+		if (flags & EXEC_OBJECT_CAPTURE) {
 			struct i915_gem_capture_list *capture;
 
 			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
@@ -1840,33 +1840,34 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
 				return -ENOMEM;
 
 			capture->next = eb->request->capture_list;
-			capture->vma = vma;
+			capture->vma = eb->vma[i];
 			eb->request->capture_list = capture;
 		}
 
-		if (entry->flags & EXEC_OBJECT_ASYNC)
-			goto skip_flushes;
+		if (flags & EXEC_OBJECT_ASYNC)
+			continue;
 
+		obj = eb->vma[i]->obj;
 		if (obj->cache_dirty & ~obj->cache_coherent)
 			i915_gem_clflush_object(obj, 0);
 
 		err = i915_gem_request_await_object
-			(eb->request, obj, entry->flags & EXEC_OBJECT_WRITE);
+			(eb->request, obj, flags & EXEC_OBJECT_WRITE);
 		if (err)
 			return err;
-
-skip_flushes:
-		i915_vma_move_to_active(vma, eb->request, entry->flags);
-		__eb_unreserve_vma(vma, entry);
-		vma->exec_entry = NULL;
 	}
 
 	for (i = 0; i < count; i++) {
-		const struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
-		struct i915_vma *vma = exec_to_vma(entry);
+		struct i915_vma *vma = eb->vma[i];
+		unsigned int flags = eb->flags[i];
+
+		i915_vma_move_to_active(vma, eb->request, flags);
+		eb_export_fence(vma, eb->request, flags);
+
+		__eb_unreserve_vma(vma, flags);
+		vma->exec_flags = NULL;
 
-		eb_export_fence(vma, eb->request, entry->flags);
-		if (unlikely(entry->flags & __EXEC_OBJECT_HAS_REF))
+		if (unlikely(flags & __LUT_HAS_REF))
 			i915_vma_put(vma);
 	}
 	eb->exec = NULL;
@@ -1995,11 +1996,10 @@ static struct i915_vma *eb_parse(struct i915_execbuffer *eb, bool is_master)
 	if (IS_ERR(vma))
 		goto out;
 
-	vma->exec_entry =
-		memset(&eb->exec[eb->buffer_count++],
-		       0, sizeof(*vma->exec_entry));
-	vma->exec_entry->flags = __EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_REF;
-	__exec_to_vma(vma->exec_entry) = (uintptr_t)i915_vma_get(vma);
+	eb->vma[eb->buffer_count] = i915_vma_get(vma);
+	eb->flags[eb->buffer_count] = __LUT_HAS_PIN | __LUT_HAS_REF;
+	vma->exec_flags = &eb->flags[eb->buffer_count];
+	eb->buffer_count++;
 
 out:
 	i915_gem_object_unpin_pages(shadow_batch_obj);
@@ -2132,7 +2132,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	int out_fence_fd = -1;
 	int err;
 
-	BUILD_BUG_ON(__EXEC_OBJECT_INTERNAL_FLAGS & ~__EXEC_OBJECT_UNKNOWN_FLAGS);
+	BUILD_BUG_ON(__LUT_INTERNAL_FLAGS & ~__EXEC_OBJECT_UNKNOWN_FLAGS);
 
 	eb.i915 = to_i915(dev);
 	eb.file = file;
@@ -2141,6 +2141,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		args->flags |= __EXEC_HAS_RELOC;
 	eb.exec = exec;
 	eb.ctx = NULL;
+	eb.vma = (struct i915_vma **)(exec + args->buffer_count + 1);
+	eb.flags = (unsigned int *)(eb.vma + args->buffer_count + 1);
 	eb.invalid_flags = __EXEC_OBJECT_UNKNOWN_FLAGS;
 	if (USES_FULL_PPGTT(eb.i915))
 		eb.invalid_flags |= EXEC_OBJECT_NEEDS_GTT;
@@ -2228,7 +2230,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (err < 0)
 		goto err_vma;
 
-	if (unlikely(eb.batch->exec_entry->flags & EXEC_OBJECT_WRITE)) {
+	if (unlikely(*eb.batch->exec_flags & EXEC_OBJECT_WRITE)) {
 		DRM_DEBUG("Attempting to use self-modifying batch buffer\n");
 		err = -EINVAL;
 		goto err_vma;
@@ -2401,7 +2403,9 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
 				   sizeof(*exec_list),
 				   __GFP_NOWARN | GFP_TEMPORARY);
 	exec2_list = drm_malloc_gfp(args->buffer_count + 1,
-				    sizeof(*exec2_list),
+				    sizeof(*exec2_list) +
+				    sizeof(struct i915_vma *) +
+				    sizeof(unsigned int),
 				    __GFP_NOWARN | GFP_TEMPORARY);
 	if (exec_list == NULL || exec2_list == NULL) {
 		DRM_DEBUG("Failed to allocate exec list for %d buffers\n",
@@ -2477,7 +2481,9 @@ i915_gem_execbuffer2(struct drm_device *dev, void *data,
 
 	/* Allocate an extra slot for use by the command parser */
 	exec2_list = drm_malloc_gfp(args->buffer_count + 1,
-				    sizeof(*exec2_list),
+				    sizeof(*exec2_list) +
+				    sizeof(struct i915_vma *) +
+				    sizeof(unsigned int),
 				    __GFP_NOWARN | GFP_TEMPORARY);
 	if (exec2_list == NULL) {
 		DRM_DEBUG("Failed to allocate exec list for %d buffers\n",
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index 3840b8cdc6b1..cbf0795465bf 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -112,7 +112,7 @@ struct i915_vma {
 	/**
 	 * Used for performing relocations during execbuffer insertion.
 	 */
-	struct drm_i915_gem_exec_object2 *exec_entry;
+	unsigned int *exec_flags;
 	struct hlist_node exec_node;
 	u32 exec_handle;
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 19/24] drm/i915/scheduler: Support user-defined priorities
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (16 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 18/24] drm/i915: Convert execbuf to use struct-of-array packing for critical fields Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-08-02 21:55   ` Jason Ekstrand
  2017-05-18  9:46 ` [PATCH 20/24] drm/i915: Group all the global context information together Chris Wilson
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

Use a priority stored in the context as the initial value when
submitting a request. This allows us to change the default priority on a
per-context basis, allowing different contexts to be favoured with GPU
time at the expense of lower importance work. The user can adjust the
context's priority via I915_CONTEXT_PARAM_PRIORITY, with more positive
values being higher priority (they will be serviced earlier, after their
dependencies have been resolved). Any prerequisite work for an execbuf
will have its priority raised to match the new request as required.

Normal users can specify any value in the range of -1023 to 0 [default],
i.e. they can reduce the priority of their workloads (and temporarily
boost it back to normal if so desired).

Privileged users can specify any value in the range of -1023 to 1023,
[default is 0], i.e. they can raise their priority above all overs and
so potentially starve the system.

Note that the existing schedulers are not fair, nor load balancing, the
execution is strictly by priority on a first-come, first-served basis,
and the driver may choose to boost some requests above the range
available to users.

This priority was originally based around nice(2), but evolved to allow
clients to adjust their priority within a small range, and allow for a
privileged high priority range.

For example, this can be used to implement EGL_IMG_context_priority
https://www.khronos.org/registry/egl/extensions/IMG/EGL_IMG_context_priority.txt

	EGL_CONTEXT_PRIORITY_LEVEL_IMG determines the priority level of
        the context to be created. This attribute is a hint, as an
        implementation may not support multiple contexts at some
        priority levels and system policy may limit access to high
        priority contexts to appropriate system privilege level. The
        default value for EGL_CONTEXT_PRIORITY_LEVEL_IMG is
        EGL_CONTEXT_PRIORITY_MEDIUM_IMG."

so we can map

	PRIORITY_HIGH -> 1023 [privileged, will failback to 0]
	PRIORITY_MED -> 0 [default]
	PRIORITY_LOW -> -1023

They also map onto the priorities used by VkQueue (and a VkQueue is
essentially a timeline, our i915_gem_context under full-ppgtt).

v2: s/CAP_SYS_ADMIN/CAP_SYS_NICE/
v3: Report min/max user priorities as defines in the uapi, and rebase
internal priorities on the exposed values.

Testcase: igt/gem_exec_schedule
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 23 +++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_request.h | 11 ++++++++---
 include/uapi/drm/i915_drm.h             |  9 ++++++++-
 3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index c62e17a68f5b..0735d624ef07 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1038,6 +1038,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_BANNABLE:
 		args->value = i915_gem_context_is_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_PRIORITY:
+		args->value = ctx->priority;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -1095,6 +1098,26 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		else
 			i915_gem_context_clear_bannable(ctx);
 		break;
+
+	case I915_CONTEXT_PARAM_PRIORITY:
+		{
+			int priority = args->value;
+
+			if (args->size)
+				ret = -EINVAL;
+			else if (!to_i915(dev)->engine[RCS]->schedule)
+				ret = -ENODEV;
+			else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
+				 priority < I915_CONTEXT_MIN_USER_PRIORITY)
+				ret = -EINVAL;
+			else if (priority > I915_CONTEXT_DEFAULT_PRIORITY &&
+				 !capable(CAP_SYS_NICE))
+				ret = -EPERM;
+			else
+				ctx->priority = priority;
+		}
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_request.h b/drivers/gpu/drm/i915/i915_gem_request.h
index 7b7c84369d78..d34d8adba9e9 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.h
+++ b/drivers/gpu/drm/i915/i915_gem_request.h
@@ -30,6 +30,8 @@
 #include "i915_gem.h"
 #include "i915_sw_fence.h"
 
+#include <uapi/drm/i915_drm.h>
+
 struct drm_file;
 struct drm_i915_gem_object;
 struct drm_i915_gem_request;
@@ -69,9 +71,12 @@ struct i915_priotree {
 	struct list_head waiters_list; /* those after us, they depend upon us */
 	struct list_head link;
 	int priority;
-#define I915_PRIORITY_MAX 1024
-#define I915_PRIORITY_NORMAL 0
-#define I915_PRIORITY_MIN (-I915_PRIORITY_MAX)
+};
+
+enum {
+	I915_PRIORITY_MIN = I915_CONTEXT_MIN_USER_PRIORITY - 1,
+	I915_PRIORITY_NORMAL = I915_CONTEXT_DEFAULT_PRIORITY,
+	I915_PRIORITY_MAX = I915_CONTEXT_MAX_USER_PRIORITY + 1,
 };
 
 struct i915_gem_capture_list {
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 4adf3e5f5c71..34ee011f08ac 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -393,8 +393,11 @@ typedef struct drm_i915_irq_wait {
 #define I915_PARAM_MIN_EU_IN_POOL	 39
 #define I915_PARAM_MMAP_GTT_VERSION	 40
 
-/* Query whether DRM_I915_GEM_EXECBUFFER2 supports user defined execution
+/*
+ * Query whether DRM_I915_GEM_EXECBUFFER2 supports user defined execution
  * priorities and the driver will attempt to execute batches in priority order.
+ * The initial priority for each batch is supplied by the context and is
+ * controlled via I915_CONTEXT_PARAM_PRIORITY.
  */
 #define I915_PARAM_HAS_SCHEDULER	 41
 #define I915_PARAM_HUC_STATUS		 42
@@ -1320,6 +1323,10 @@ struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
 #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
 #define I915_CONTEXT_PARAM_BANNABLE	0x5
+#define I915_CONTEXT_PARAM_PRIORITY	0x6
+#define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
+#define   I915_CONTEXT_DEFAULT_PRIORITY		0
+#define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
 	__u64 value;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 20/24] drm/i915: Group all the global context information together
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (17 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 19/24] drm/i915/scheduler: Support user-defined priorities Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-19  9:35   ` Joonas Lahtinen
  2017-05-18  9:46 ` [PATCH 21/24] drm/i915: Allow contexts to be unreferenced locklessly Chris Wilson
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

Create a substruct to hold all the global context state under
drm_i915_private.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_debugfs.c              |  4 +--
 drivers/gpu/drm/i915/i915_drv.c                  |  9 +++---
 drivers/gpu/drm/i915/i915_drv.h                  | 20 ++++++++------
 drivers/gpu/drm/i915/i915_gem.c                  | 13 ++++-----
 drivers/gpu/drm/i915/i915_gem_context.c          | 35 +++++++++++++-----------
 drivers/gpu/drm/i915/i915_gem_context.h          | 14 ++++++----
 drivers/gpu/drm/i915/i915_sysfs.c                |  2 +-
 drivers/gpu/drm/i915/intel_lrc.c                 |  2 +-
 drivers/gpu/drm/i915/selftests/mock_context.c    |  2 +-
 drivers/gpu/drm/i915/selftests/mock_gem_device.c |  4 +--
 10 files changed, 57 insertions(+), 48 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index d72442ce4806..b6d0a6569aae 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1956,7 +1956,7 @@ static int i915_context_status(struct seq_file *m, void *unused)
 	if (ret)
 		return ret;
 
-	list_for_each_entry(ctx, &dev_priv->context_list, link) {
+	list_for_each_entry(ctx, &dev_priv->contexts.list, link) {
 		seq_printf(m, "HW context %u ", ctx->hw_id);
 		if (ctx->pid) {
 			struct task_struct *task;
@@ -2062,7 +2062,7 @@ static int i915_dump_lrc(struct seq_file *m, void *unused)
 	if (ret)
 		return ret;
 
-	list_for_each_entry(ctx, &dev_priv->context_list, link)
+	list_for_each_entry(ctx, &dev_priv->contexts.list, link)
 		for_each_engine(engine, dev_priv, id)
 			i915_dump_lrc_obj(m, ctx, engine);
 
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 178550a93884..6c91bc9f7e2b 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -553,13 +553,13 @@ static void i915_gem_fini(struct drm_i915_private *dev_priv)
 	mutex_lock(&dev_priv->drm.struct_mutex);
 	intel_uc_fini_hw(dev_priv);
 	i915_gem_cleanup_engines(dev_priv);
-	i915_gem_context_fini(dev_priv);
+	i915_gem_contexts_fini(dev_priv);
 	i915_gem_cleanup_userptr(dev_priv);
 	mutex_unlock(&dev_priv->drm.struct_mutex);
 
 	i915_gem_drain_freed_objects(dev_priv);
 
-	WARN_ON(!list_empty(&dev_priv->context_list));
+	WARN_ON(!list_empty(&dev_priv->contexts.list));
 }
 
 static int i915_load_modeset_init(struct drm_device *dev)
@@ -1392,9 +1392,10 @@ static void i915_driver_release(struct drm_device *dev)
 
 static int i915_driver_open(struct drm_device *dev, struct drm_file *file)
 {
+	struct drm_i915_private *i915 = to_i915(dev);
 	int ret;
 
-	ret = i915_gem_open(dev, file);
+	ret = i915_gem_open(i915, file);
 	if (ret)
 		return ret;
 
@@ -1424,7 +1425,7 @@ static void i915_driver_postclose(struct drm_device *dev, struct drm_file *file)
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 
 	mutex_lock(&dev->struct_mutex);
-	i915_gem_context_close(dev, file);
+	i915_gem_context_close(file);
 	i915_gem_release(dev, file);
 	mutex_unlock(&dev->struct_mutex);
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 261de5f38956..61a0c81164cc 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2233,13 +2233,6 @@ struct drm_i915_private {
 	DECLARE_HASHTABLE(mm_structs, 7);
 	struct mutex mm_lock;
 
-	/* The hw wants to have a stable context identifier for the lifetime
-	 * of the context (for OA, PASID, faults, etc). This is limited
-	 * in execlists to 21 bits.
-	 */
-	struct ida context_hw_ida;
-#define MAX_CONTEXT_HW_ID (1<<21) /* exclusive */
-
 	/* Kernel Modesetting */
 
 	struct intel_crtc *plane_to_crtc_mapping[I915_MAX_PIPES];
@@ -2318,7 +2311,16 @@ struct drm_i915_private {
 	 */
 	struct mutex av_mutex;
 
-	struct list_head context_list;
+	struct {
+		struct list_head list;
+
+		/* The hw wants to have a stable context identifier for the
+		 * lifetime of the context (for OA, PASID, faults, etc).
+		 * This is limited in execlists to 21 bits.
+		 */
+		struct ida hw_ida;
+#define MAX_CONTEXT_HW_ID (1<<21) /* exclusive */
+	} contexts;
 
 	u32 fdi_rx_config;
 
@@ -3450,7 +3452,7 @@ i915_gem_object_pin_to_display_plane(struct drm_i915_gem_object *obj,
 void i915_gem_object_unpin_from_display_plane(struct i915_vma *vma);
 int i915_gem_object_attach_phys(struct drm_i915_gem_object *obj,
 				int align);
-int i915_gem_open(struct drm_device *dev, struct drm_file *file);
+int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file);
 void i915_gem_release(struct drm_device *dev, struct drm_file *file);
 
 int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 5dcd51655a81..c044b5dbdb66 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3079,7 +3079,7 @@ void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
 
 	stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
 
-	i915_gem_context_lost(dev_priv);
+	i915_gem_contexts_lost(dev_priv);
 
 	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
 }
@@ -4540,7 +4540,7 @@ int i915_gem_suspend(struct drm_i915_private *dev_priv)
 		goto err_unlock;
 
 	assert_kernel_context_is_current(dev_priv);
-	i915_gem_context_lost(dev_priv);
+	i915_gem_contexts_lost(dev_priv);
 	mutex_unlock(&dev->struct_mutex);
 
 	intel_guc_suspend(dev_priv);
@@ -4789,7 +4789,7 @@ int i915_gem_init(struct drm_i915_private *dev_priv)
 	if (ret)
 		goto out_unlock;
 
-	ret = i915_gem_context_init(dev_priv);
+	ret = i915_gem_contexts_init(dev_priv);
 	if (ret)
 		goto out_unlock;
 
@@ -4899,7 +4899,6 @@ i915_gem_load_init(struct drm_i915_private *dev_priv)
 	if (err)
 		goto err_priorities;
 
-	INIT_LIST_HEAD(&dev_priv->context_list);
 	INIT_WORK(&dev_priv->mm.free_work, __i915_gem_free_work);
 	init_llist_head(&dev_priv->mm.free_list);
 	INIT_LIST_HEAD(&dev_priv->mm.unbound_list);
@@ -5023,7 +5022,7 @@ void i915_gem_release(struct drm_device *dev, struct drm_file *file)
 	}
 }
 
-int i915_gem_open(struct drm_device *dev, struct drm_file *file)
+int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file)
 {
 	struct drm_i915_file_private *file_priv;
 	int ret;
@@ -5035,7 +5034,7 @@ int i915_gem_open(struct drm_device *dev, struct drm_file *file)
 		return -ENOMEM;
 
 	file->driver_priv = file_priv;
-	file_priv->dev_priv = to_i915(dev);
+	file_priv->dev_priv = i915;
 	file_priv->file = file;
 	INIT_LIST_HEAD(&file_priv->rps.link);
 
@@ -5044,7 +5043,7 @@ int i915_gem_open(struct drm_device *dev, struct drm_file *file)
 
 	file_priv->bsd_engine = -1;
 
-	ret = i915_gem_context_open(dev, file);
+	ret = i915_gem_context_open(i915, file);
 	if (ret)
 		kfree(file_priv);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 0735d624ef07..f4395a1e214d 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -188,7 +188,7 @@ void i915_gem_context_free(struct kref *ctx_ref)
 
 	list_del(&ctx->link);
 
-	ida_simple_remove(&ctx->i915->context_hw_ida, ctx->hw_id);
+	ida_simple_remove(&ctx->i915->contexts.hw_ida, ctx->hw_id);
 	kfree(ctx);
 }
 
@@ -205,7 +205,7 @@ static int assign_hw_id(struct drm_i915_private *dev_priv, unsigned *out)
 {
 	int ret;
 
-	ret = ida_simple_get(&dev_priv->context_hw_ida,
+	ret = ida_simple_get(&dev_priv->contexts.hw_ida,
 			     0, MAX_CONTEXT_HW_ID, GFP_KERNEL);
 	if (ret < 0) {
 		/* Contexts are only released when no longer active.
@@ -213,7 +213,7 @@ static int assign_hw_id(struct drm_i915_private *dev_priv, unsigned *out)
 		 * stale contexts and try again.
 		 */
 		i915_gem_retire_requests(dev_priv);
-		ret = ida_simple_get(&dev_priv->context_hw_ida,
+		ret = ida_simple_get(&dev_priv->contexts.hw_ida,
 				     0, MAX_CONTEXT_HW_ID, GFP_KERNEL);
 		if (ret < 0)
 			return ret;
@@ -265,7 +265,7 @@ __create_hw_context(struct drm_i915_private *dev_priv,
 	}
 
 	kref_init(&ctx->ref);
-	list_add_tail(&ctx->link, &dev_priv->context_list);
+	list_add_tail(&ctx->link, &dev_priv->contexts.list);
 	ctx->i915 = dev_priv;
 	ctx->priority = I915_PRIORITY_NORMAL;
 
@@ -418,7 +418,7 @@ i915_gem_context_create_gvt(struct drm_device *dev)
 	return ctx;
 }
 
-int i915_gem_context_init(struct drm_i915_private *dev_priv)
+int i915_gem_contexts_init(struct drm_i915_private *dev_priv)
 {
 	struct i915_gem_context *ctx;
 
@@ -427,6 +427,8 @@ int i915_gem_context_init(struct drm_i915_private *dev_priv)
 	if (WARN_ON(dev_priv->kernel_context))
 		return 0;
 
+	INIT_LIST_HEAD(&dev_priv->contexts.list);
+
 	if (intel_vgpu_active(dev_priv) &&
 	    HAS_LOGICAL_RING_CONTEXTS(dev_priv)) {
 		if (!i915.enable_execlists) {
@@ -437,7 +439,7 @@ int i915_gem_context_init(struct drm_i915_private *dev_priv)
 
 	/* Using the simple ida interface, the max is limited by sizeof(int) */
 	BUILD_BUG_ON(MAX_CONTEXT_HW_ID > INT_MAX);
-	ida_init(&dev_priv->context_hw_ida);
+	ida_init(&dev_priv->contexts.hw_ida);
 
 	ctx = i915_gem_create_context(dev_priv, NULL);
 	if (IS_ERR(ctx)) {
@@ -463,7 +465,7 @@ int i915_gem_context_init(struct drm_i915_private *dev_priv)
 	return 0;
 }
 
-void i915_gem_context_lost(struct drm_i915_private *dev_priv)
+void i915_gem_contexts_lost(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
@@ -484,7 +486,7 @@ void i915_gem_context_lost(struct drm_i915_private *dev_priv)
 	if (!i915.enable_execlists) {
 		struct i915_gem_context *ctx;
 
-		list_for_each_entry(ctx, &dev_priv->context_list, link) {
+		list_for_each_entry(ctx, &dev_priv->contexts.list, link) {
 			if (!i915_gem_context_is_default(ctx))
 				continue;
 
@@ -503,7 +505,7 @@ void i915_gem_context_lost(struct drm_i915_private *dev_priv)
 	}
 }
 
-void i915_gem_context_fini(struct drm_i915_private *dev_priv)
+void i915_gem_contexts_fini(struct drm_i915_private *dev_priv)
 {
 	struct i915_gem_context *dctx = dev_priv->kernel_context;
 
@@ -514,7 +516,7 @@ void i915_gem_context_fini(struct drm_i915_private *dev_priv)
 	context_close(dctx);
 	dev_priv->kernel_context = NULL;
 
-	ida_destroy(&dev_priv->context_hw_ida);
+	ida_destroy(&dev_priv->contexts.hw_ida);
 }
 
 static int context_idr_cleanup(int id, void *p, void *data)
@@ -525,16 +527,17 @@ static int context_idr_cleanup(int id, void *p, void *data)
 	return 0;
 }
 
-int i915_gem_context_open(struct drm_device *dev, struct drm_file *file)
+int i915_gem_context_open(struct drm_i915_private *i915,
+			  struct drm_file *file)
 {
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct i915_gem_context *ctx;
 
 	idr_init(&file_priv->context_idr);
 
-	mutex_lock(&dev->struct_mutex);
-	ctx = i915_gem_create_context(to_i915(dev), file_priv);
-	mutex_unlock(&dev->struct_mutex);
+	mutex_lock(&i915->drm.struct_mutex);
+	ctx = i915_gem_create_context(i915, file_priv);
+	mutex_unlock(&i915->drm.struct_mutex);
 
 	GEM_BUG_ON(i915_gem_context_is_kernel(ctx));
 
@@ -546,11 +549,11 @@ int i915_gem_context_open(struct drm_device *dev, struct drm_file *file)
 	return 0;
 }
 
-void i915_gem_context_close(struct drm_device *dev, struct drm_file *file)
+void i915_gem_context_close(struct drm_file *file)
 {
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 
-	lockdep_assert_held(&dev->struct_mutex);
+	lockdep_assert_held(&file_priv->dev_priv->drm.struct_mutex);
 
 	idr_for_each(&file_priv->context_idr, context_idr_cleanup, NULL);
 	idr_destroy(&file_priv->context_idr);
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 82c99ba92ad3..808f878db812 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -273,13 +273,17 @@ static inline bool i915_gem_context_is_kernel(struct i915_gem_context *ctx)
 }
 
 /* i915_gem_context.c */
-int __must_check i915_gem_context_init(struct drm_i915_private *dev_priv);
-void i915_gem_context_lost(struct drm_i915_private *dev_priv);
-void i915_gem_context_fini(struct drm_i915_private *dev_priv);
-int i915_gem_context_open(struct drm_device *dev, struct drm_file *file);
-void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
+int __must_check i915_gem_contexts_init(struct drm_i915_private *dev_priv);
+void i915_gem_contexts_lost(struct drm_i915_private *dev_priv);
+void i915_gem_contexts_fini(struct drm_i915_private *dev_priv);
+
+int i915_gem_context_open(struct drm_i915_private *i915,
+			  struct drm_file *file);
+void i915_gem_context_close(struct drm_file *file);
+
 int i915_switch_context(struct drm_i915_gem_request *req);
 int i915_gem_switch_to_kernel_context(struct drm_i915_private *dev_priv);
+
 void i915_gem_context_free(struct kref *ctx_ref);
 struct i915_gem_context *
 i915_gem_context_create_gvt(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_sysfs.c b/drivers/gpu/drm/i915/i915_sysfs.c
index 1eef3fae4db3..3a481062f219 100644
--- a/drivers/gpu/drm/i915/i915_sysfs.c
+++ b/drivers/gpu/drm/i915/i915_sysfs.c
@@ -209,7 +209,7 @@ i915_l3_write(struct file *filp, struct kobject *kobj,
 	memcpy(*remap_info + (offset/4), buf, count);
 
 	/* NB: We defer the remapping until we switch to the context */
-	list_for_each_entry(ctx, &dev_priv->context_list, link)
+	list_for_each_entry(ctx, &dev_priv->contexts.list, link)
 		ctx->remap_slice |= (1<<slice);
 
 	ret = count;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 014b30ace8a0..42ab91acccc5 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2064,7 +2064,7 @@ void intel_lr_context_resume(struct drm_i915_private *dev_priv)
 	 * So to avoid that we reset the context images upon resume. For
 	 * simplicity, we just zero everything out.
 	 */
-	list_for_each_entry(ctx, &dev_priv->context_list, link) {
+	list_for_each_entry(ctx, &dev_priv->contexts.list, link) {
 		for_each_engine(engine, dev_priv, id) {
 			struct intel_context *ce = &ctx->engine[engine->id];
 			u32 *reg;
diff --git a/drivers/gpu/drm/i915/selftests/mock_context.c b/drivers/gpu/drm/i915/selftests/mock_context.c
index f8b9cc212b02..243325b97d4c 100644
--- a/drivers/gpu/drm/i915/selftests/mock_context.c
+++ b/drivers/gpu/drm/i915/selftests/mock_context.c
@@ -48,7 +48,7 @@ mock_context(struct drm_i915_private *i915,
 	if (!ctx->vma_lut.ht)
 		goto err_free;
 
-	ret = ida_simple_get(&i915->context_hw_ida,
+	ret = ida_simple_get(&i915->contexts.hw_ida,
 			     0, MAX_CONTEXT_HW_ID, GFP_KERNEL);
 	if (ret < 0)
 		goto err_vma_ht;
diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
index 627e2aa09766..0ddb70a16550 100644
--- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
+++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
@@ -61,7 +61,7 @@ static void mock_device_release(struct drm_device *dev)
 	mutex_lock(&i915->drm.struct_mutex);
 	for_each_engine(engine, i915, id)
 		mock_engine_free(engine);
-	i915_gem_context_fini(i915);
+	i915_gem_contexts_fini(i915);
 	mutex_unlock(&i915->drm.struct_mutex);
 
 	drain_workqueue(i915->wq);
@@ -160,7 +160,7 @@ struct drm_i915_private *mock_gem_device(void)
 	INIT_LIST_HEAD(&i915->mm.unbound_list);
 	INIT_LIST_HEAD(&i915->mm.bound_list);
 
-	ida_init(&i915->context_hw_ida);
+	ida_init(&i915->contexts.hw_ida);
 
 	INIT_DELAYED_WORK(&i915->gt.retire_work, mock_retire_work_handler);
 	INIT_DELAYED_WORK(&i915->gt.idle_work, mock_idle_work_handler);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 21/24] drm/i915: Allow contexts to be unreferenced locklessly
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (18 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 20/24] drm/i915: Group all the global context information together Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18  9:46 ` [PATCH 22/24] drm/i915: Enable rcu-only context lookups Chris Wilson
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

If we move the actual cleanup of the context to a worker, we can allow
the final free to be called from any context and avoid undue latency in
the caller.

v2: Negotiate handling the delayed contexts free by flushing the
workqueue before calling i915_gem_context_fini() and performing the final
free of the kernel context directly
v3: Flush deferred frees before new context allocations

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/gvt/scheduler.c             |  2 +-
 drivers/gpu/drm/i915/i915_drv.c                  |  2 +
 drivers/gpu/drm/i915/i915_drv.h                  | 23 +--------
 drivers/gpu/drm/i915/i915_gem_context.c          | 59 +++++++++++++++++++-----
 drivers/gpu/drm/i915/i915_gem_context.h          | 15 +++++-
 drivers/gpu/drm/i915/i915_perf.c                 |  4 +-
 drivers/gpu/drm/i915/selftests/i915_vma.c        |  8 +++-
 drivers/gpu/drm/i915/selftests/mock_context.c    |  9 ++++
 drivers/gpu/drm/i915/selftests/mock_context.h    |  2 +
 drivers/gpu/drm/i915/selftests/mock_gem_device.c |  3 +-
 10 files changed, 88 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/i915/gvt/scheduler.c b/drivers/gpu/drm/i915/gvt/scheduler.c
index 6ae286cb5804..84c2495df372 100644
--- a/drivers/gpu/drm/i915/gvt/scheduler.c
+++ b/drivers/gpu/drm/i915/gvt/scheduler.c
@@ -595,7 +595,7 @@ int intel_gvt_init_workload_scheduler(struct intel_gvt *gvt)
 
 void intel_vgpu_clean_gvt_context(struct intel_vgpu *vgpu)
 {
-	i915_gem_context_put_unlocked(vgpu->shadow_ctx);
+	i915_gem_context_put(vgpu->shadow_ctx);
 }
 
 int intel_vgpu_init_gvt_context(struct intel_vgpu *vgpu)
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 6c91bc9f7e2b..2d2fb4327f97 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -550,6 +550,8 @@ static const struct vga_switcheroo_client_ops i915_switcheroo_ops = {
 
 static void i915_gem_fini(struct drm_i915_private *dev_priv)
 {
+	flush_workqueue(dev_priv->wq);
+
 	mutex_lock(&dev_priv->drm.struct_mutex);
 	intel_uc_fini_hw(dev_priv);
 	i915_gem_cleanup_engines(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 61a0c81164cc..f11c06bd977e 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2313,6 +2313,8 @@ struct drm_i915_private {
 
 	struct {
 		struct list_head list;
+		struct llist_head free_list;
+		struct work_struct free_work;
 
 		/* The hw wants to have a stable context identifier for the
 		 * lifetime of the context (for OA, PASID, faults, etc).
@@ -3497,27 +3499,6 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
 	return ctx;
 }
 
-static inline struct i915_gem_context *
-i915_gem_context_get(struct i915_gem_context *ctx)
-{
-	kref_get(&ctx->ref);
-	return ctx;
-}
-
-static inline void i915_gem_context_put(struct i915_gem_context *ctx)
-{
-	lockdep_assert_held(&ctx->i915->drm.struct_mutex);
-	kref_put(&ctx->ref, i915_gem_context_free);
-}
-
-static inline void i915_gem_context_put_unlocked(struct i915_gem_context *ctx)
-{
-	struct mutex *lock = &ctx->i915->drm.struct_mutex;
-
-	if (kref_put_mutex(&ctx->ref, i915_gem_context_free, lock))
-		mutex_unlock(lock);
-}
-
 static inline struct intel_timeline *
 i915_gem_context_lookup_timeline(struct i915_gem_context *ctx,
 				 struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index f4395a1e214d..2ad78bb50137 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -158,13 +158,11 @@ static void vma_lut_free(struct i915_gem_context *ctx)
 	kvfree(lut->ht);
 }
 
-void i915_gem_context_free(struct kref *ctx_ref)
+static void i915_gem_context_free(struct i915_gem_context *ctx)
 {
-	struct i915_gem_context *ctx = container_of(ctx_ref, typeof(*ctx), ref);
 	int i;
 
 	lockdep_assert_held(&ctx->i915->drm.struct_mutex);
-	trace_i915_context_free(ctx);
 	GEM_BUG_ON(!i915_gem_context_is_closed(ctx));
 
 	vma_lut_free(ctx);
@@ -192,6 +190,37 @@ void i915_gem_context_free(struct kref *ctx_ref)
 	kfree(ctx);
 }
 
+static void contexts_free(struct drm_i915_private *i915)
+{
+	struct llist_node *freed = llist_del_all(&i915->contexts.free_list);
+	struct i915_gem_context *ctx;
+
+	lockdep_assert_held(&i915->drm.struct_mutex);
+
+	llist_for_each_entry(ctx, freed, free_link)
+		i915_gem_context_free(ctx);
+}
+
+static void contexts_free_worker(struct work_struct *work)
+{
+	struct drm_i915_private *i915 =
+		container_of(work, typeof(*i915), contexts.free_work);
+
+	mutex_lock(&i915->drm.struct_mutex);
+	contexts_free(i915);
+	mutex_unlock(&i915->drm.struct_mutex);
+}
+
+void i915_gem_context_release(struct kref *ref)
+{
+	struct i915_gem_context *ctx = container_of(ref, typeof(*ctx), ref);
+	struct drm_i915_private *i915 = ctx->i915;
+
+	trace_i915_context_free(ctx);
+	if (llist_add(&ctx->free_link, &i915->contexts.free_list))
+		queue_work(i915->wq, &i915->contexts.free_work);
+}
+
 static void context_close(struct i915_gem_context *ctx)
 {
 	i915_gem_context_set_closed(ctx);
@@ -428,6 +457,8 @@ int i915_gem_contexts_init(struct drm_i915_private *dev_priv)
 		return 0;
 
 	INIT_LIST_HEAD(&dev_priv->contexts.list);
+	INIT_WORK(&dev_priv->contexts.free_work, contexts_free_worker);
+	init_llist_head(&dev_priv->contexts.free_list);
 
 	if (intel_vgpu_active(dev_priv) &&
 	    HAS_LOGICAL_RING_CONTEXTS(dev_priv)) {
@@ -505,18 +536,20 @@ void i915_gem_contexts_lost(struct drm_i915_private *dev_priv)
 	}
 }
 
-void i915_gem_contexts_fini(struct drm_i915_private *dev_priv)
+void i915_gem_contexts_fini(struct drm_i915_private *i915)
 {
-	struct i915_gem_context *dctx = dev_priv->kernel_context;
-
-	lockdep_assert_held(&dev_priv->drm.struct_mutex);
+	struct i915_gem_context *ctx;
 
-	GEM_BUG_ON(!i915_gem_context_is_kernel(dctx));
+	lockdep_assert_held(&i915->drm.struct_mutex);
 
-	context_close(dctx);
-	dev_priv->kernel_context = NULL;
+	/* Keep the context so that we can free it immediately ourselves */
+	ctx = i915_gem_context_get(fetch_and_zero(&i915->kernel_context));
+	GEM_BUG_ON(!i915_gem_context_is_kernel(ctx));
+	context_close(ctx);
+	i915_gem_context_free(ctx);
 
-	ida_destroy(&dev_priv->contexts.hw_ida);
+	/* Must free all deferred contexts (via flush_workqueue) first */
+	ida_destroy(&i915->contexts.hw_ida);
 }
 
 static int context_idr_cleanup(int id, void *p, void *data)
@@ -957,6 +990,10 @@ int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 	if (ret)
 		return ret;
 
+	/* Reap stale contexts */
+	i915_gem_retire_requests(dev_priv);
+	contexts_free(dev_priv);
+
 	ctx = i915_gem_create_context(dev_priv, file_priv);
 	mutex_unlock(&dev->struct_mutex);
 	if (IS_ERR(ctx))
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 808f878db812..61146f4aa168 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -86,6 +86,7 @@ struct i915_gem_context {
 
 	/** link: place with &drm_i915_private.context_list */
 	struct list_head link;
+	struct llist_node free_link;
 
 	/**
 	 * @ref: reference count
@@ -284,7 +285,7 @@ void i915_gem_context_close(struct drm_file *file);
 int i915_switch_context(struct drm_i915_gem_request *req);
 int i915_gem_switch_to_kernel_context(struct drm_i915_private *dev_priv);
 
-void i915_gem_context_free(struct kref *ctx_ref);
+void i915_gem_context_release(struct kref *ctx_ref);
 struct i915_gem_context *
 i915_gem_context_create_gvt(struct drm_device *dev);
 
@@ -299,4 +300,16 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_reset_stats_ioctl(struct drm_device *dev, void *data,
 				       struct drm_file *file);
 
+static inline struct i915_gem_context *
+i915_gem_context_get(struct i915_gem_context *ctx)
+{
+	kref_get(&ctx->ref);
+	return ctx;
+}
+
+static inline void i915_gem_context_put(struct i915_gem_context *ctx)
+{
+	kref_put(&ctx->ref, i915_gem_context_release);
+}
+
 #endif /* !__I915_GEM_CONTEXT_H__ */
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 85269bcc8372..8b5010c5c736 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1680,7 +1680,7 @@ static void i915_perf_destroy_locked(struct i915_perf_stream *stream)
 	list_del(&stream->link);
 
 	if (stream->ctx)
-		i915_gem_context_put_unlocked(stream->ctx);
+		i915_gem_context_put(stream->ctx);
 
 	kfree(stream);
 }
@@ -1851,7 +1851,7 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv,
 	kfree(stream);
 err_ctx:
 	if (specific_ctx)
-		i915_gem_context_put_unlocked(specific_ctx);
+		i915_gem_context_put(specific_ctx);
 err:
 	return ret;
 }
diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
index fb9072d5877f..2e86ec136b35 100644
--- a/drivers/gpu/drm/i915/selftests/i915_vma.c
+++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
@@ -186,16 +186,20 @@ static int igt_vma_create(void *arg)
 				goto end;
 		}
 
-		list_for_each_entry_safe(ctx, cn, &contexts, link)
+		list_for_each_entry_safe(ctx, cn, &contexts, link) {
+			list_del_init(&ctx->link);
 			mock_context_close(ctx);
+		}
 	}
 
 end:
 	/* Final pass to lookup all created contexts */
 	err = create_vmas(i915, &objects, &contexts);
 out:
-	list_for_each_entry_safe(ctx, cn, &contexts, link)
+	list_for_each_entry_safe(ctx, cn, &contexts, link) {
+		list_del_init(&ctx->link);
 		mock_context_close(ctx);
+	}
 
 	list_for_each_entry_safe(obj, on, &objects, st_link)
 		i915_gem_object_put(obj);
diff --git a/drivers/gpu/drm/i915/selftests/mock_context.c b/drivers/gpu/drm/i915/selftests/mock_context.c
index 243325b97d4c..9c7c68181f82 100644
--- a/drivers/gpu/drm/i915/selftests/mock_context.c
+++ b/drivers/gpu/drm/i915/selftests/mock_context.c
@@ -86,3 +86,12 @@ void mock_context_close(struct i915_gem_context *ctx)
 
 	i915_gem_context_put(ctx);
 }
+
+void mock_init_contexts(struct drm_i915_private *i915)
+{
+	INIT_LIST_HEAD(&i915->contexts.list);
+	ida_init(&i915->contexts.hw_ida);
+
+	INIT_WORK(&i915->contexts.free_work, contexts_free_worker);
+	init_llist_head(&i915->contexts.free_list);
+}
diff --git a/drivers/gpu/drm/i915/selftests/mock_context.h b/drivers/gpu/drm/i915/selftests/mock_context.h
index 2427e5c0916a..383941a61124 100644
--- a/drivers/gpu/drm/i915/selftests/mock_context.h
+++ b/drivers/gpu/drm/i915/selftests/mock_context.h
@@ -25,6 +25,8 @@
 #ifndef __MOCK_CONTEXT_H
 #define __MOCK_CONTEXT_H
 
+void mock_init_contexts(struct drm_i915_private *i915);
+
 struct i915_gem_context *
 mock_context(struct drm_i915_private *i915,
 	     const char *name);
diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
index 0ddb70a16550..47613d20bba8 100644
--- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
+++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
@@ -57,6 +57,7 @@ static void mock_device_release(struct drm_device *dev)
 
 	cancel_delayed_work_sync(&i915->gt.retire_work);
 	cancel_delayed_work_sync(&i915->gt.idle_work);
+	flush_workqueue(i915->wq);
 
 	mutex_lock(&i915->drm.struct_mutex);
 	for_each_engine(engine, i915, id)
@@ -160,7 +161,7 @@ struct drm_i915_private *mock_gem_device(void)
 	INIT_LIST_HEAD(&i915->mm.unbound_list);
 	INIT_LIST_HEAD(&i915->mm.bound_list);
 
-	ida_init(&i915->contexts.hw_ida);
+	mock_init_contexts(i915);
 
 	INIT_DELAYED_WORK(&i915->gt.retire_work, mock_retire_work_handler);
 	INIT_DELAYED_WORK(&i915->gt.idle_work, mock_idle_work_handler);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 22/24] drm/i915: Enable rcu-only context lookups
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (19 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 21/24] drm/i915: Allow contexts to be unreferenced locklessly Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-19 10:36   ` Joonas Lahtinen
  2017-05-18  9:46 ` [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse Chris Wilson
                   ` (4 subsequent siblings)
  25 siblings, 1 reply; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

Whilst the contents of the context is still protected by the big
struct_mutex, this is not much of an improvement. It is just one tiny
step towards reducing our BKL.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h            | 16 +++++--
 drivers/gpu/drm/i915/i915_gem_context.c    | 77 ++++++++++++++----------------
 drivers/gpu/drm/i915/i915_gem_context.h    |  5 ++
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 21 ++++----
 4 files changed, 64 insertions(+), 55 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f11c06bd977e..1d55bbde68df 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3486,15 +3486,21 @@ void i915_gem_object_save_bit_17_swizzle(struct drm_i915_gem_object *obj,
 					 struct sg_table *pages);
 
 static inline struct i915_gem_context *
+__i915_gem_context_lookup_rcu(struct drm_i915_file_private *file_priv, u32 id)
+{
+	return idr_find(&file_priv->context_idr, id);
+}
+
+static inline struct i915_gem_context *
 i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
 {
 	struct i915_gem_context *ctx;
 
-	lockdep_assert_held(&file_priv->dev_priv->drm.struct_mutex);
-
-	ctx = idr_find(&file_priv->context_idr, id);
-	if (!ctx)
-		return ERR_PTR(-ENOENT);
+	rcu_read_lock();
+	ctx = __i915_gem_context_lookup_rcu(file_priv, id);
+	if (ctx && !kref_get_unless_zero(&ctx->ref))
+		ctx = NULL;
+	rcu_read_unlock();
 
 	return ctx;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 2ad78bb50137..198b0064a3d3 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -187,7 +187,7 @@ static void i915_gem_context_free(struct i915_gem_context *ctx)
 	list_del(&ctx->link);
 
 	ida_simple_remove(&ctx->i915->contexts.hw_ida, ctx->hw_id);
-	kfree(ctx);
+	kfree_rcu(ctx, rcu);
 }
 
 static void contexts_free(struct drm_i915_private *i915)
@@ -1021,20 +1021,19 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 	if (args->ctx_id == DEFAULT_CONTEXT_HANDLE)
 		return -ENOENT;
 
-	ret = i915_mutex_lock_interruptible(dev);
-	if (ret)
-		return ret;
-
 	ctx = i915_gem_context_lookup(file_priv, args->ctx_id);
-	if (IS_ERR(ctx)) {
-		mutex_unlock(&dev->struct_mutex);
-		return PTR_ERR(ctx);
-	}
+	if (!ctx)
+		return -ENOENT;
+
+	ret = mutex_lock_interruptible(&dev->struct_mutex);
+	if (ret)
+		goto out;
 
 	__destroy_hw_context(ctx, file_priv);
 	mutex_unlock(&dev->struct_mutex);
 
-	DRM_DEBUG("HW context %d destroyed\n", args->ctx_id);
+out:
+	i915_gem_context_put(ctx);
 	return 0;
 }
 
@@ -1044,17 +1043,11 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
-	int ret;
-
-	ret = i915_mutex_lock_interruptible(dev);
-	if (ret)
-		return ret;
+	int ret = 0;
 
 	ctx = i915_gem_context_lookup(file_priv, args->ctx_id);
-	if (IS_ERR(ctx)) {
-		mutex_unlock(&dev->struct_mutex);
-		return PTR_ERR(ctx);
-	}
+	if (!ctx)
+		return -ENOENT;
 
 	args->size = 0;
 	switch (args->param) {
@@ -1085,8 +1078,8 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 		ret = -EINVAL;
 		break;
 	}
-	mutex_unlock(&dev->struct_mutex);
 
+	i915_gem_context_put(ctx);
 	return ret;
 }
 
@@ -1098,15 +1091,13 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 	struct i915_gem_context *ctx;
 	int ret;
 
+	ctx = i915_gem_context_lookup(file_priv, args->ctx_id);
+	if (!ctx)
+		return -ENOENT;
+
 	ret = i915_mutex_lock_interruptible(dev);
 	if (ret)
-		return ret;
-
-	ctx = i915_gem_context_lookup(file_priv, args->ctx_id);
-	if (IS_ERR(ctx)) {
-		mutex_unlock(&dev->struct_mutex);
-		return PTR_ERR(ctx);
-	}
+		goto out;
 
 	switch (args->param) {
 	case I915_CONTEXT_PARAM_BAN_PERIOD:
@@ -1164,6 +1155,8 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 	}
 	mutex_unlock(&dev->struct_mutex);
 
+out:
+	i915_gem_context_put(ctx);
 	return ret;
 }
 
@@ -1181,27 +1174,31 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	if (args->ctx_id == DEFAULT_CONTEXT_HANDLE && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	ret = i915_mutex_lock_interruptible(dev);
-	if (ret)
-		return ret;
+	ret = -ENOENT;
+	rcu_read_lock();
+	ctx = __i915_gem_context_lookup_rcu(file->driver_priv, args->ctx_id);
+	if (!ctx)
+		goto out;
 
-	ctx = i915_gem_context_lookup(file->driver_priv, args->ctx_id);
-	if (IS_ERR(ctx)) {
-		mutex_unlock(&dev->struct_mutex);
-		return PTR_ERR(ctx);
-	}
+	/*
+	 * We opt for unserialised reads here. This may result in tearing
+	 * in the extremely unlikely event of a GPU hang on this context
+	 * as we are querying them. If we need that extra layer of protection,
+	 * we should wrap the hangstats with a seqlock.
+	 */
 
 	if (capable(CAP_SYS_ADMIN))
 		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
 	else
 		args->reset_count = 0;
 
-	args->batch_active = ctx->guilty_count;
-	args->batch_pending = ctx->active_count;
-
-	mutex_unlock(&dev->struct_mutex);
+	args->batch_active = READ_ONCE(ctx->guilty_count);
+	args->batch_pending = READ_ONCE(ctx->active_count);
 
-	return 0;
+	ret = 0;
+out:
+	rcu_read_unlock();
+	return ret;
 }
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 61146f4aa168..04320f80f9f4 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -100,6 +100,11 @@ struct i915_gem_context {
 	struct kref ref;
 
 	/**
+	 * @rcu: rcu_head for deferred freeing.
+	 */
+	struct rcu_head rcu;
+
+	/**
 	 * @flags: small set of booleans
 	 */
 	unsigned long flags;
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index d6eb38b12976..6cb2f8128e99 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -665,16 +665,17 @@ static int eb_select_context(struct i915_execbuffer *eb)
 	struct i915_gem_context *ctx;
 
 	ctx = i915_gem_context_lookup(eb->file->driver_priv, eb->args->rsvd1);
-	if (unlikely(IS_ERR(ctx)))
-		return PTR_ERR(ctx);
+	if (unlikely(!ctx))
+		return -ENOENT;
 
 	if (unlikely(i915_gem_context_is_banned(ctx))) {
 		DRM_DEBUG("Context %u tried to submit while banned\n",
 			  ctx->user_handle);
+		i915_gem_context_put(ctx);
 		return -EIO;
 	}
 
-	eb->ctx = i915_gem_context_get(ctx);
+	eb->ctx = ctx;
 	eb->vm = ctx->ppgtt ? &ctx->ppgtt->base : &eb->i915->ggtt.base;
 
 	eb->context_flags = 0;
@@ -2140,7 +2141,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (DBG_FORCE_RELOC || !(args->flags & I915_EXEC_NO_RELOC))
 		args->flags |= __EXEC_HAS_RELOC;
 	eb.exec = exec;
-	eb.ctx = NULL;
 	eb.vma = (struct i915_vma **)(exec + args->buffer_count + 1);
 	eb.flags = (unsigned int *)(eb.vma + args->buffer_count + 1);
 	eb.invalid_flags = __EXEC_OBJECT_UNKNOWN_FLAGS;
@@ -2197,6 +2197,10 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (eb_create(&eb))
 		return -ENOMEM;
 
+	err = eb_select_context(&eb);
+	if (unlikely(err))
+		goto err_destroy;
+
 	/*
 	 * Take a local wakeref for preparing to dispatch the execbuf as
 	 * we expect to access the hardware fairly frequently in the
@@ -2205,14 +2209,11 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	 * 100ms.
 	 */
 	intel_runtime_pm_get(eb.i915);
+
 	err = i915_mutex_lock_interruptible(dev);
 	if (err)
 		goto err_rpm;
 
-	err = eb_select_context(&eb);
-	if (unlikely(err))
-		goto err_unlock;
-
 	err = eb_lookup_vmas(&eb);
 	if (likely(!err && args->flags & __EXEC_HAS_RELOC))
 		err = eb_relocate(&eb);
@@ -2351,11 +2352,11 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		i915_vma_unpin(eb.batch);
 err_vma:
 	eb_release_vmas(&eb);
-	i915_gem_context_put(eb.ctx);
-err_unlock:
 	mutex_unlock(&dev->struct_mutex);
 err_rpm:
 	intel_runtime_pm_put(eb.i915);
+	i915_gem_context_put(eb.ctx);
+err_destroy:
 	eb_destroy(&eb);
 	if (out_fence_fd != -1)
 		put_unused_fd(out_fence_fd);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (20 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 22/24] drm/i915: Enable rcu-only context lookups Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18 13:52   ` Tvrtko Ursulin
  2017-05-22 10:51   ` Tvrtko Ursulin
  2017-05-18  9:46 ` [PATCH 24/24] RFC drm/i915: Expose a PMU interface for perf queries Chris Wilson
                   ` (3 subsequent siblings)
  25 siblings, 2 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

Keep the recently freed context objects for reuse. This allows us to use
the current GGTT bindings and dma bound pages, avoiding any clflushes as
required. We mark the objects as purgeable under memory pressure, and
reap the list of freed objects as soon as the device is idle.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h                  |  2 +
 drivers/gpu/drm/i915/i915_gem.c                  |  1 +
 drivers/gpu/drm/i915/i915_gem_context.c          | 59 ++++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_gem_context.h          |  5 ++
 drivers/gpu/drm/i915/intel_lrc.c                 |  2 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c          |  2 +-
 drivers/gpu/drm/i915/selftests/mock_context.c    |  1 +
 drivers/gpu/drm/i915/selftests/mock_gem_device.c |  1 +
 8 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 1d55bbde68df..1fa1e7d48f02 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2316,6 +2316,8 @@ struct drm_i915_private {
 		struct llist_head free_list;
 		struct work_struct free_work;
 
+		struct list_head freed_objects;
+
 		/* The hw wants to have a stable context identifier for the
 		 * lifetime of the context (for OA, PASID, faults, etc).
 		 * This is limited in execlists to 21 bits.
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c044b5dbdb66..f482c320d810 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3212,6 +3212,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
 		DRM_ERROR("Timeout waiting for engines to idle\n");
 
 	intel_engines_mark_idle(dev_priv);
+	i915_gem_contexts_mark_idle(dev_priv);
 	i915_gem_timelines_mark_idle(dev_priv);
 
 	GEM_BUG_ON(!dev_priv->gt.awake);
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 198b0064a3d3..1e51fd36f355 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -96,6 +96,48 @@
 /* Initial size (as log2) to preallocate the handle->object hashtable */
 #define VMA_HT_BITS 2u /* 4 x 2 pointers, 64 bytes minimum */
 
+struct drm_i915_gem_object *
+i915_gem_context_create_object(struct drm_i915_private *i915,
+			       unsigned long size)
+{
+	struct drm_i915_gem_object *obj, *on;
+
+	lockdep_assert_held(&i915->drm.struct_mutex);
+
+	list_for_each_entry_safe(obj, on,
+				 &i915->contexts.freed_objects,
+				 batch_pool_link) {
+		if (obj->mm.madv != I915_MADV_DONTNEED) {
+			/* Purge the heretic! */
+			list_del(&obj->batch_pool_link);
+			i915_gem_object_put(obj);
+			continue;
+		}
+
+		if (obj->base.size == size) {
+			list_del(&obj->batch_pool_link);
+			obj->mm.madv = I915_MADV_WILLNEED;
+			return obj;
+		}
+	}
+
+	return i915_gem_object_create(i915, size);
+}
+
+void i915_gem_contexts_mark_idle(struct drm_i915_private *i915)
+{
+	struct drm_i915_gem_object *obj, *on;
+
+	lockdep_assert_held(&i915->drm.struct_mutex);
+
+	list_for_each_entry_safe(obj, on,
+				 &i915->contexts.freed_objects,
+				 batch_pool_link) {
+		list_del(&obj->batch_pool_link);
+		i915_gem_object_put(obj);
+	}
+}
+
 static void resize_vma_ht(struct work_struct *work)
 {
 	struct i915_gem_context_vma_lut *lut =
@@ -160,9 +202,10 @@ static void vma_lut_free(struct i915_gem_context *ctx)
 
 static void i915_gem_context_free(struct i915_gem_context *ctx)
 {
+	struct drm_i915_private *i915 = ctx->i915;
 	int i;
 
-	lockdep_assert_held(&ctx->i915->drm.struct_mutex);
+	lockdep_assert_held(&i915->drm.struct_mutex);
 	GEM_BUG_ON(!i915_gem_context_is_closed(ctx));
 
 	vma_lut_free(ctx);
@@ -178,7 +221,11 @@ static void i915_gem_context_free(struct i915_gem_context *ctx)
 		if (ce->ring)
 			intel_ring_free(ce->ring);
 
-		__i915_gem_object_release_unless_active(ce->state->obj);
+		/* Keep the objects around for quick reuse */
+		GEM_BUG_ON(ce->state->obj->mm.madv != I915_MADV_WILLNEED);
+		ce->state->obj->mm.madv = I915_MADV_DONTNEED;
+		list_add(&ce->state->obj->batch_pool_link,
+			 &i915->contexts.freed_objects);
 	}
 
 	kfree(ctx->name);
@@ -186,7 +233,7 @@ static void i915_gem_context_free(struct i915_gem_context *ctx)
 
 	list_del(&ctx->link);
 
-	ida_simple_remove(&ctx->i915->contexts.hw_ida, ctx->hw_id);
+	ida_simple_remove(&i915->contexts.hw_ida, ctx->hw_id);
 	kfree_rcu(ctx, rcu);
 }
 
@@ -457,6 +504,7 @@ int i915_gem_contexts_init(struct drm_i915_private *dev_priv)
 		return 0;
 
 	INIT_LIST_HEAD(&dev_priv->contexts.list);
+	INIT_LIST_HEAD(&dev_priv->contexts.freed_objects);
 	INIT_WORK(&dev_priv->contexts.free_work, contexts_free_worker);
 	init_llist_head(&dev_priv->contexts.free_list);
 
@@ -542,12 +590,17 @@ void i915_gem_contexts_fini(struct drm_i915_private *i915)
 
 	lockdep_assert_held(&i915->drm.struct_mutex);
 
+	GEM_BUG_ON(work_pending(&i915->contexts.free_work));
+
 	/* Keep the context so that we can free it immediately ourselves */
 	ctx = i915_gem_context_get(fetch_and_zero(&i915->kernel_context));
 	GEM_BUG_ON(!i915_gem_context_is_kernel(ctx));
 	context_close(ctx);
 	i915_gem_context_free(ctx);
 
+	i915_gem_contexts_mark_idle(i915);
+	GEM_BUG_ON(!list_empty(&i915->contexts.freed_objects));
+
 	/* Must free all deferred contexts (via flush_workqueue) first */
 	ida_destroy(&i915->contexts.hw_ida);
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 04320f80f9f4..4654caab9b6c 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -294,6 +294,11 @@ void i915_gem_context_release(struct kref *ctx_ref);
 struct i915_gem_context *
 i915_gem_context_create_gvt(struct drm_device *dev);
 
+struct drm_i915_gem_object *
+i915_gem_context_create_object(struct drm_i915_private *i915,
+			       unsigned long size);
+void i915_gem_contexts_mark_idle(struct drm_i915_private *i915);
+
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 42ab91acccc5..21a1519db936 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2011,7 +2011,7 @@ static int execlists_context_deferred_alloc(struct i915_gem_context *ctx,
 	/* One extra page as the sharing data between driver and GuC */
 	context_size += PAGE_SIZE * LRC_PPHWSP_PN;
 
-	ctx_obj = i915_gem_object_create(ctx->i915, context_size);
+	ctx_obj = i915_gem_context_create_object(ctx->i915, context_size);
 	if (IS_ERR(ctx_obj)) {
 		DRM_DEBUG_DRIVER("Alloc LRC backing obj failed.\n");
 		return PTR_ERR(ctx_obj);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index acd1da9b62a3..253d32fef133 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1454,7 +1454,7 @@ alloc_context_vma(struct intel_engine_cs *engine)
 	struct drm_i915_gem_object *obj;
 	struct i915_vma *vma;
 
-	obj = i915_gem_object_create(i915, engine->context_size);
+	obj = i915_gem_context_create_object(i915, engine->context_size);
 	if (IS_ERR(obj))
 		return ERR_CAST(obj);
 
diff --git a/drivers/gpu/drm/i915/selftests/mock_context.c b/drivers/gpu/drm/i915/selftests/mock_context.c
index 9c7c68181f82..f4972674c96e 100644
--- a/drivers/gpu/drm/i915/selftests/mock_context.c
+++ b/drivers/gpu/drm/i915/selftests/mock_context.c
@@ -92,6 +92,7 @@ void mock_init_contexts(struct drm_i915_private *i915)
 	INIT_LIST_HEAD(&i915->contexts.list);
 	ida_init(&i915->contexts.hw_ida);
 
+	INIT_LIST_HEAD(&i915->contexts.freed_objects);
 	INIT_WORK(&i915->contexts.free_work, contexts_free_worker);
 	init_llist_head(&i915->contexts.free_list);
 }
diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
index 47613d20bba8..5eb931ecf0a4 100644
--- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
+++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
@@ -53,6 +53,7 @@ static void mock_device_release(struct drm_device *dev)
 
 	mutex_lock(&i915->drm.struct_mutex);
 	mock_device_flush(i915);
+	i915_gem_contexts_lost(i915);
 	mutex_unlock(&i915->drm.struct_mutex);
 
 	cancel_delayed_work_sync(&i915->gt.retire_work);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 24/24] RFC drm/i915: Expose a PMU interface for perf queries
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (21 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse Chris Wilson
@ 2017-05-18  9:46 ` Chris Wilson
  2017-05-18 23:48   ` Dmitry Rogozhkin
  2017-06-15 11:17   ` Tvrtko Ursulin
  2017-05-18 12:31 ` ✓ Fi.CI.BAT: success for series starting with [01/24] drm/i915/selftests: Pretend to be a gfx pci device Patchwork
                   ` (2 subsequent siblings)
  25 siblings, 2 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18  9:46 UTC (permalink / raw)
  To: intel-gfx

The first goal is to be able to measure GPU (and invidual ring) busyness
without having to poll registers from userspace. (Which not only incurs
holding the forcewake lock indefinitely, perturbing the system, but also
runs the risk of hanging the machine.) As an alternative we can use the
perf event counter interface to sample the ring registers periodically
and send those results to userspace.

To be able to do so, we need to export the two symbols from
kernel/events/core.c to register and unregister a PMU device.

v2: Use a common timer for the ring sampling.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/Makefile           |   1 +
 drivers/gpu/drm/i915/i915_drv.c         |   2 +
 drivers/gpu/drm/i915/i915_drv.h         |  23 ++
 drivers/gpu/drm/i915/i915_pmu.c         | 449 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |   2 +
 include/uapi/drm/i915_drm.h             |  40 +++
 kernel/events/core.c                    |   1 +
 7 files changed, 518 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/i915_pmu.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 7b05fb802f4c..ca88e6e67910 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -26,6 +26,7 @@ i915-y := i915_drv.o \
 
 i915-$(CONFIG_COMPAT)   += i915_ioc32.o
 i915-$(CONFIG_DEBUG_FS) += i915_debugfs.o intel_pipe_crc.o
+i915-$(CONFIG_PERF_EVENTS) += i915_pmu.o
 
 # GEM code
 i915-y += i915_cmd_parser.o \
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 2d2fb4327f97..e3c6d052d1c9 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1144,6 +1144,7 @@ static void i915_driver_register(struct drm_i915_private *dev_priv)
 	struct drm_device *dev = &dev_priv->drm;
 
 	i915_gem_shrinker_init(dev_priv);
+	i915_pmu_register(dev_priv);
 
 	/*
 	 * Notify a valid surface after modesetting,
@@ -1197,6 +1198,7 @@ static void i915_driver_unregister(struct drm_i915_private *dev_priv)
 	intel_opregion_unregister(dev_priv);
 
 	i915_perf_unregister(dev_priv);
+	i915_pmu_unregister(dev_priv);
 
 	i915_teardown_sysfs(dev_priv);
 	i915_guc_log_unregister(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 1fa1e7d48f02..10beae1a13c8 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -40,6 +40,7 @@
 #include <linux/hash.h>
 #include <linux/intel-iommu.h>
 #include <linux/kref.h>
+#include <linux/perf_event.h>
 #include <linux/pm_qos.h>
 #include <linux/reservation.h>
 #include <linux/shmem_fs.h>
@@ -2075,6 +2076,12 @@ struct intel_cdclk_state {
 	unsigned int cdclk, vco, ref;
 };
 
+enum {
+	__I915_SAMPLE_FREQ_ACT = 0,
+	__I915_SAMPLE_FREQ_REQ,
+	__I915_NUM_PMU_SAMPLERS
+};
+
 struct drm_i915_private {
 	struct drm_device drm;
 
@@ -2564,6 +2571,13 @@ struct drm_i915_private {
 		int	irq;
 	} lpe_audio;
 
+	struct {
+		struct pmu base;
+		struct hrtimer timer;
+		u64 enable;
+		u64 sample[__I915_NUM_PMU_SAMPLERS];
+	} pmu;
+
 	/*
 	 * NOTE: This is the dri1/ums dungeon, don't add stuff here. Your patch
 	 * will be rejected. Instead look for a better place.
@@ -3681,6 +3695,15 @@ extern void i915_perf_fini(struct drm_i915_private *dev_priv);
 extern void i915_perf_register(struct drm_i915_private *dev_priv);
 extern void i915_perf_unregister(struct drm_i915_private *dev_priv);
 
+/* i915_pmu.c */
+#ifdef CONFIG_PERF_EVENTS
+extern void i915_pmu_register(struct drm_i915_private *i915);
+extern void i915_pmu_unregister(struct drm_i915_private *i915);
+#else
+static inline void i915_pmu_register(struct drm_i915_private *i915) {}
+static inline void i915_pmu_unregister(struct drm_i915_private *i915) {}
+#endif
+
 /* i915_suspend.c */
 extern int i915_save_state(struct drm_i915_private *dev_priv);
 extern int i915_restore_state(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
new file mode 100644
index 000000000000..80e1c07841ac
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -0,0 +1,449 @@
+#include <linux/perf_event.h>
+#include <linux/pm_runtime.h>
+
+#include "i915_drv.h"
+#include "intel_ringbuffer.h"
+
+#define FREQUENCY 200
+#define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
+
+#define RING_MASK 0xffffffff
+#define RING_MAX 32
+
+static void engines_sample(struct drm_i915_private *dev_priv)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	bool fw = false;
+
+	if ((dev_priv->pmu.enable & RING_MASK) == 0)
+		return;
+
+	if (!dev_priv->gt.awake)
+		return;
+
+	if (!intel_runtime_pm_get_if_in_use(dev_priv))
+		return;
+
+	for_each_engine(engine, dev_priv, id) {
+		u32 val;
+
+		if ((dev_priv->pmu.enable & (0x7 << (4*id))) == 0)
+			continue;
+
+		if (i915_seqno_passed(intel_engine_get_seqno(engine),
+				      intel_engine_last_submit(engine)))
+			continue;
+
+		if (!fw) {
+			intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+			fw = true;
+		}
+
+		val = I915_READ_FW(RING_MI_MODE(engine->mmio_base));
+		if (!(val & MODE_IDLE))
+			engine->pmu_sample[I915_SAMPLE_BUSY] += PERIOD;
+
+		val = I915_READ_FW(RING_CTL(engine->mmio_base));
+		if (val & RING_WAIT)
+			engine->pmu_sample[I915_SAMPLE_WAIT] += PERIOD;
+		if (val & RING_WAIT_SEMAPHORE)
+			engine->pmu_sample[I915_SAMPLE_SEMA] += PERIOD;
+	}
+
+	if (fw)
+		intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+	intel_runtime_pm_put(dev_priv);
+}
+
+static void frequency_sample(struct drm_i915_private *dev_priv)
+{
+	if (dev_priv->pmu.enable & BIT_ULL(I915_PMU_ACTUAL_FREQUENCY)) {
+		u64 val;
+
+		val = dev_priv->rps.cur_freq;
+		if (dev_priv->gt.awake &&
+		    intel_runtime_pm_get_if_in_use(dev_priv)) {
+			val = I915_READ_NOTRACE(GEN6_RPSTAT1);
+			if (INTEL_GEN(dev_priv) >= 9)
+				val = (val & GEN9_CAGF_MASK) >> GEN9_CAGF_SHIFT;
+			else if (IS_HASWELL(dev_priv) || INTEL_GEN(dev_priv) >= 8)
+				val = (val & HSW_CAGF_MASK) >> HSW_CAGF_SHIFT;
+			else
+				val = (val & GEN6_CAGF_MASK) >> GEN6_CAGF_SHIFT;
+			intel_runtime_pm_put(dev_priv);
+		}
+		val = intel_gpu_freq(dev_priv, val);
+		dev_priv->pmu.sample[__I915_SAMPLE_FREQ_ACT] += val * PERIOD;
+	}
+
+	if (dev_priv->pmu.enable & BIT_ULL(I915_PMU_REQUESTED_FREQUENCY)) {
+		u64 val = intel_gpu_freq(dev_priv, dev_priv->rps.cur_freq);
+		dev_priv->pmu.sample[__I915_SAMPLE_FREQ_REQ] += val * PERIOD;
+	}
+}
+
+static enum hrtimer_restart i915_sample(struct hrtimer *hrtimer)
+{
+	struct drm_i915_private *i915 =
+		container_of(hrtimer, struct drm_i915_private, pmu.timer);
+
+	if (i915->pmu.enable == 0)
+		return HRTIMER_NORESTART;
+
+	engines_sample(i915);
+	frequency_sample(i915);
+
+	hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD));
+	return HRTIMER_RESTART;
+}
+
+static void i915_pmu_event_destroy(struct perf_event *event)
+{
+	WARN_ON(event->parent);
+}
+
+static int engine_event_init(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), pmu.base);
+	int engine = event->attr.config >> 2;
+	int sample = event->attr.config & 3;
+
+	switch (sample) {
+	case I915_SAMPLE_BUSY:
+	case I915_SAMPLE_WAIT:
+		break;
+	case I915_SAMPLE_SEMA:
+		if (INTEL_GEN(i915) < 6)
+			return -ENODEV;
+		break;
+	default:
+		return -ENOENT;
+	}
+
+	if (engine >= I915_NUM_ENGINES)
+		return -ENOENT;
+
+	if (!i915->engine[engine])
+		return -ENODEV;
+
+	return 0;
+}
+
+static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
+{
+	struct perf_sample_data data;
+	struct perf_event *event;
+	u64 period;
+
+	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
+	if (event->state != PERF_EVENT_STATE_ACTIVE)
+		return HRTIMER_NORESTART;
+
+	event->pmu->read(event);
+
+	perf_sample_data_init(&data, 0, event->hw.last_period);
+	perf_event_overflow(event, &data, NULL);
+
+	period = max_t(u64, 10000, event->hw.sample_period);
+	hrtimer_forward_now(hrtimer, ns_to_ktime(period));
+	return HRTIMER_RESTART;
+}
+
+static void init_hrtimer(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (!is_sampling_event(event))
+		return;
+
+	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hwc->hrtimer.function = hrtimer_sample;
+
+	if (event->attr.freq) {
+		long freq = event->attr.sample_freq;
+
+		event->attr.sample_period = NSEC_PER_SEC / freq;
+		hwc->sample_period = event->attr.sample_period;
+		local64_set(&hwc->period_left, hwc->sample_period);
+		hwc->last_period = hwc->sample_period;
+		event->attr.freq = 0;
+	}
+}
+
+static int i915_pmu_event_init(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), pmu.base);
+	int ret;
+
+	/* XXX ideally only want pid == -1 && cpu == -1 */
+
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
+	ret = 0;
+	if (event->attr.config < RING_MAX) {
+		ret = engine_event_init(event);
+	} else switch (event->attr.config) {
+	case I915_PMU_ACTUAL_FREQUENCY:
+		if (IS_VALLEYVIEW(i915) || IS_CHERRYVIEW(i915))
+			ret = -ENODEV; /* requires a mutex for sampling! */
+	case I915_PMU_REQUESTED_FREQUENCY:
+	case I915_PMU_ENERGY:
+	case I915_PMU_RC6_RESIDENCY:
+	case I915_PMU_RC6p_RESIDENCY:
+	case I915_PMU_RC6pp_RESIDENCY:
+		if (INTEL_GEN(i915) < 6)
+			ret = -ENODEV;
+		break;
+	}
+	if (ret)
+		return ret;
+
+	if (!event->parent)
+		event->destroy = i915_pmu_event_destroy;
+
+	init_hrtimer(event);
+
+	return 0;
+}
+
+static void i915_pmu_timer_start(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	s64 period;
+
+	if (!is_sampling_event(event))
+		return;
+
+	period = local64_read(&hwc->period_left);
+	if (period) {
+		if (period < 0)
+			period = 10000;
+
+		local64_set(&hwc->period_left, 0);
+	} else {
+		period = max_t(u64, 10000, hwc->sample_period);
+	}
+
+	hrtimer_start_range_ns(&hwc->hrtimer,
+			       ns_to_ktime(period), 0,
+			       HRTIMER_MODE_REL_PINNED);
+}
+
+static void i915_pmu_timer_cancel(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (!is_sampling_event(event))
+		return;
+
+	local64_set(&hwc->period_left,
+		    ktime_to_ns(hrtimer_get_remaining(&hwc->hrtimer)));
+	hrtimer_cancel(&hwc->hrtimer);
+}
+
+static void i915_pmu_enable(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), pmu.base);
+
+	if (i915->pmu.enable == 0)
+		hrtimer_start_range_ns(&i915->pmu.timer,
+				       ns_to_ktime(PERIOD), 0,
+				       HRTIMER_MODE_REL_PINNED);
+
+	i915->pmu.enable |= BIT_ULL(event->attr.config);
+
+	i915_pmu_timer_start(event);
+}
+
+static void i915_pmu_disable(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), pmu.base);
+
+	i915->pmu.enable &= ~BIT_ULL(event->attr.config);
+	i915_pmu_timer_cancel(event);
+}
+
+static int i915_pmu_event_add(struct perf_event *event, int flags)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (flags & PERF_EF_START)
+		i915_pmu_enable(event);
+
+	hwc->state = !(flags & PERF_EF_START);
+
+	return 0;
+}
+
+static void i915_pmu_event_del(struct perf_event *event, int flags)
+{
+	i915_pmu_disable(event);
+}
+
+static void i915_pmu_event_start(struct perf_event *event, int flags)
+{
+	i915_pmu_enable(event);
+}
+
+static void i915_pmu_event_stop(struct perf_event *event, int flags)
+{
+	i915_pmu_disable(event);
+}
+
+static u64 read_energy_uJ(struct drm_i915_private *dev_priv)
+{
+	u64 power;
+
+	GEM_BUG_ON(INTEL_GEN(dev_priv) < 6);
+
+	intel_runtime_pm_get(dev_priv);
+
+	rdmsrl(MSR_RAPL_POWER_UNIT, power);
+	power = (power & 0x1f00) >> 8;
+	power = 1000000 >> power; /* convert to uJ */
+	power *= I915_READ_NOTRACE(MCH_SECP_NRG_STTS);
+
+	intel_runtime_pm_put(dev_priv);
+
+	return power;
+}
+
+static inline u64 calc_residency(struct drm_i915_private *dev_priv,
+				 const i915_reg_t reg)
+{
+	u64 val, units = 128, div = 100000;
+
+	GEM_BUG_ON(INTEL_GEN(dev_priv) < 6);
+
+	intel_runtime_pm_get(dev_priv);
+	if (IS_VALLEYVIEW(dev_priv) || IS_CHERRYVIEW(dev_priv)) {
+		div = dev_priv->czclk_freq;
+		units = 1;
+		if (I915_READ_NOTRACE(VLV_COUNTER_CONTROL) & VLV_COUNT_RANGE_HIGH)
+			units <<= 8;
+	} else if (IS_GEN9_LP(dev_priv)) {
+		div = 1200;
+		units = 1;
+	}
+	val = I915_READ_NOTRACE(reg);
+	intel_runtime_pm_put(dev_priv);
+
+	val *= units;
+	return DIV_ROUND_UP_ULL(val, div);
+}
+
+static u64 count_interrupts(struct drm_i915_private *i915)
+{
+	/* open-coded kstat_irqs() */
+	struct irq_desc *desc = irq_to_desc(i915->drm.pdev->irq);
+	u64 sum = 0;
+	int cpu;
+
+	if (!desc || !desc->kstat_irqs)
+		return 0;
+
+	for_each_possible_cpu(cpu)
+		sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
+
+	return sum;
+}
+
+static void i915_pmu_event_read(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), pmu.base);
+	u64 val = 0;
+
+	if (event->attr.config < 32) {
+		int engine = event->attr.config >> 2;
+		int sample = event->attr.config & 3;
+		val = i915->engine[engine]->pmu_sample[sample];
+	} else switch (event->attr.config) {
+	case I915_PMU_ACTUAL_FREQUENCY:
+		val = i915->pmu.sample[__I915_SAMPLE_FREQ_ACT];
+		break;
+	case I915_PMU_REQUESTED_FREQUENCY:
+		val = i915->pmu.sample[__I915_SAMPLE_FREQ_REQ];
+		break;
+	case I915_PMU_ENERGY:
+		val = read_energy_uJ(i915);
+		break;
+	case I915_PMU_INTERRUPTS:
+		val = count_interrupts(i915);
+		break;
+
+	case I915_PMU_RC6_RESIDENCY:
+		if (!i915->gt.awake)
+			return;
+
+		val = calc_residency(i915, IS_VALLEYVIEW(i915) ? VLV_GT_RENDER_RC6 : GEN6_GT_GFX_RC6);
+		break;
+
+	case I915_PMU_RC6p_RESIDENCY:
+		if (!i915->gt.awake)
+			return;
+
+		if (!IS_VALLEYVIEW(i915))
+			val = calc_residency(i915, GEN6_GT_GFX_RC6p);
+		break;
+
+	case I915_PMU_RC6pp_RESIDENCY:
+		if (!i915->gt.awake)
+			return;
+
+		if (!IS_VALLEYVIEW(i915))
+			val = calc_residency(i915, GEN6_GT_GFX_RC6pp);
+		break;
+	}
+
+	local64_set(&event->count, val);
+}
+
+static int i915_pmu_event_event_idx(struct perf_event *event)
+{
+	return 0;
+}
+
+void i915_pmu_register(struct drm_i915_private *i915)
+{
+	if (INTEL_GEN(i915) <= 2)
+		return;
+
+	i915->pmu.base.task_ctx_nr	= perf_sw_context;
+	i915->pmu.base.event_init	= i915_pmu_event_init;
+	i915->pmu.base.add		= i915_pmu_event_add;
+	i915->pmu.base.del		= i915_pmu_event_del;
+	i915->pmu.base.start		= i915_pmu_event_start;
+	i915->pmu.base.stop		= i915_pmu_event_stop;
+	i915->pmu.base.read		= i915_pmu_event_read;
+	i915->pmu.base.event_idx	= i915_pmu_event_event_idx;
+
+	hrtimer_init(&i915->pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	i915->pmu.timer.function = i915_sample;
+	i915->pmu.enable = 0;
+
+	if (perf_pmu_register(&i915->pmu.base, "i915", -1))
+		i915->pmu.base.event_init = NULL;
+}
+
+void i915_pmu_unregister(struct drm_i915_private *i915)
+{
+	if (!i915->pmu.base.event_init)
+		return;
+
+	i915->pmu.enable = 0;
+
+	perf_pmu_unregister(&i915->pmu.base);
+	i915->pmu.base.event_init = NULL;
+
+	hrtimer_cancel(&i915->pmu.timer);
+}
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 6aa20ac8cde3..084fa7816256 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -244,6 +244,8 @@ struct intel_engine_cs {
 		I915_SELFTEST_DECLARE(bool mock : 1);
 	} breadcrumbs;
 
+	u64 pmu_sample[3];
+
 	/*
 	 * A pool of objects to use as shadow copies of client batch buffers
 	 * when the command parser is enabled. Prevents the client from
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 34ee011f08ac..e9375ff29371 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -86,6 +86,46 @@ enum i915_mocs_table_index {
 	I915_MOCS_CACHED,
 };
 
+/**
+ * DOC: perf_events exposed by i915 through /sys/bus/event_sources/drivers/i915
+ *
+ */
+#define I915_SAMPLE_BUSY	0
+#define I915_SAMPLE_WAIT	1
+#define I915_SAMPLE_SEMA	2
+
+#define I915_SAMPLE_RCS		0
+#define I915_SAMPLE_VCS		1
+#define I915_SAMPLE_BCS		2
+#define I915_SAMPLE_VECS	3
+
+#define __I915_PMU_COUNT(ring, id) ((ring) << 4 | (id))
+
+#define I915_PMU_COUNT_RCS_BUSY __I915_PMU_COUNT(I915_SAMPLE_RCS, I915_SAMPLE_BUSY)
+#define I915_PMU_COUNT_RCS_WAIT __I915_PMU_COUNT(I915_SAMPLE_RCS, I915_SAMPLE_WAIT)
+#define I915_PMU_COUNT_RCS_SEMA __I915_PMU_COUNT(I915_SAMPLE_RCS, I915_SAMPLE_SEMA)
+
+#define I915_PMU_COUNT_VCS_BUSY __I915_PMU_COUNT(I915_SAMPLE_VCS, I915_SAMPLE_BUSY)
+#define I915_PMU_COUNT_VCS_WAIT __I915_PMU_COUNT(I915_SAMPLE_VCS, I915_SAMPLE_WAIT)
+#define I915_PMU_COUNT_VCS_SEMA __I915_PMU_COUNT(I915_SAMPLE_VCS, I915_SAMPLE_SEMA)
+
+#define I915_PMU_COUNT_BCS_BUSY __I915_PMU_COUNT(I915_SAMPLE_BCS, I915_SAMPLE_BUSY)
+#define I915_PMU_COUNT_BCS_WAIT __I915_PMU_COUNT(I915_SAMPLE_BCS, I915_SAMPLE_WAIT)
+#define I915_PMU_COUNT_BCS_SEMA __I915_PMU_COUNT(I915_SAMPLE_BCS, I915_SAMPLE_SEMA)
+
+#define I915_PMU_COUNT_VECS_BUSY __I915_PMU_COUNT(I915_SAMPLE_VECS, I915_SAMPLE_BUSY)
+#define I915_PMU_COUNT_VECS_WAIT __I915_PMU_COUNT(I915_SAMPLE_VECS, I915_SAMPLE_WAIT)
+#define I915_PMU_COUNT_VECS_SEMA __I915_PMU_COUNT(I915_SAMPLE_VECS, I915_SAMPLE_SEMA)
+
+#define I915_PMU_ACTUAL_FREQUENCY 32
+#define I915_PMU_REQUESTED_FREQUENCY 33
+#define I915_PMU_ENERGY 34
+#define I915_PMU_INTERRUPTS 35
+
+#define I915_PMU_RC6_RESIDENCY		40
+#define I915_PMU_RC6p_RESIDENCY	41
+#define I915_PMU_RC6pp_RESIDENCY	42
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f811dd20bbc1..6351ed8a2e56 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7365,6 +7365,7 @@ int perf_event_overflow(struct perf_event *event,
 {
 	return __perf_event_overflow(event, 1, data, regs);
 }
+EXPORT_SYMBOL_GPL(perf_event_overflow);
 
 /*
  * Generic software event infrastructure
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [01/24] drm/i915/selftests: Pretend to be a gfx pci device
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (22 preceding siblings ...)
  2017-05-18  9:46 ` [PATCH 24/24] RFC drm/i915: Expose a PMU interface for perf queries Chris Wilson
@ 2017-05-18 12:31 ` Patchwork
  2017-05-18 13:35 ` [PATCH 01/24] " Tvrtko Ursulin
  2017-05-19  9:02 ` Joonas Lahtinen
  25 siblings, 0 replies; 47+ messages in thread
From: Patchwork @ 2017-05-18 12:31 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [01/24] drm/i915/selftests: Pretend to be a gfx pci device
URL   : https://patchwork.freedesktop.org/series/24612/
State : success

== Summary ==

Series 24612v1 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/24612/revisions/1/mbox/

Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                fail       -> PASS       (fi-snb-2600) fdo#100007

fdo#100007 https://bugs.freedesktop.org/show_bug.cgi?id=100007

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:443s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:429s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:572s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:511s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:490s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:477s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:416s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:407s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:419s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:494s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:470s
fi-kbl-7500u     total:278  pass:255  dwarn:5   dfail:0   fail:0   skip:18  time:465s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:449s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:576s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:462s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:498s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:438s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:532s
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29  time:404s

28aeb29016ff4ba6576438278ea1c593993b0c97 drm-tip: 2017y-05m-18d-06h-01m-15s UTC integration manifest
973142a RFC drm/i915: Expose a PMU interface for perf queries
5b23c59 drm/i915: Keep a recent cache of freed contexts objects for reuse
30ebd90 drm/i915: Enable rcu-only context lookups
94887f4 drm/i915: Allow contexts to be unreferenced locklessly
5cf184d drm/i915: Group all the global context information together
e2b3c17 drm/i915/scheduler: Support user-defined priorities
ad001c5 drm/i915: Convert execbuf to use struct-of-array packing for critical fields
93bc332 drm/i915: Stash a pointer to the obj's resv in the vma
d0aa474 drm/i915: Async GPU relocation processing
c65943e drm/i915: Allow execbuffer to use the first object as the batch
82ea9e9 drm/i915: Wait upon userptr get-user-pages within execbuffer
09f44ba drm/i915: First try the previous execbuffer location
efb796a drm/i915: Store a persistent reference for an object in the execbuffer cache
2f750e4 drm/i915: Eliminate lots of iterations over the execobjects array
373ab26 drm/i915: Disable EXEC_OBJECT_ASYNC when doing relocations
4011ec4 drm/i915: Pass vma to relocate entry
dcffdcb drm/i915: Store a direct lookup from object handle to vma
e5a2a52 drm/i915: Split vma exec_link/evict_link
533c013 drm/i915: Use vma->exec_entry as our double-entry placeholder
94827fa drm/i915: Amalgamate execbuffer parameter structures
4801a06 drm/i915: Reinstate reservation_object zapping for batch_pool objects
235e60b drm/i915: Store i915_gem_object_is_coherent() as a bit next to cache-dirty
8057f22 drm/i915: Mark CPU cache as dirty on every transition for CPU writes
3204282 drm/i915/selftests: Pretend to be a gfx pci device

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4735/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (23 preceding siblings ...)
  2017-05-18 12:31 ` ✓ Fi.CI.BAT: success for series starting with [01/24] drm/i915/selftests: Pretend to be a gfx pci device Patchwork
@ 2017-05-18 13:35 ` Tvrtko Ursulin
  2017-05-18 14:46   ` Chris Wilson
  2017-05-19  9:02 ` Joonas Lahtinen
  25 siblings, 1 reply; 47+ messages in thread
From: Tvrtko Ursulin @ 2017-05-18 13:35 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2017 10:46, Chris Wilson wrote:
> Set the class on our mock pci device to GFX. This should be useful for
> utilities like intel-iommu that special case gfx devices.
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=101080
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/selftests/mock_gem_device.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> index f4edd4c6cb07..627e2aa09766 100644
> --- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> +++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> @@ -121,6 +121,7 @@ struct drm_i915_private *mock_gem_device(void)
>  		goto err;
>
>  	device_initialize(&pdev->dev);
> +	pdev->class = PCI_BASE_CLASS_DISPLAY << 16;
>  	pdev->dev.release = release_dev;
>  	dev_set_name(&pdev->dev, "mock");
>  	dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
>

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse
  2017-05-18  9:46 ` [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse Chris Wilson
@ 2017-05-18 13:52   ` Tvrtko Ursulin
  2017-05-18 14:19     ` Chris Wilson
  2017-05-22 10:51   ` Tvrtko Ursulin
  1 sibling, 1 reply; 47+ messages in thread
From: Tvrtko Ursulin @ 2017-05-18 13:52 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2017 10:46, Chris Wilson wrote:
> Keep the recently freed context objects for reuse. This allows us to use
> the current GGTT bindings and dma bound pages, avoiding any clflushes as
> required. We mark the objects as purgeable under memory pressure, and
> reap the list of freed objects as soon as the device is idle.

Some description on when is this interesting and by how much?

Since you drop everything on idle it must be for some workload which 
likes to go through context while keeping busy?

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_drv.h                  |  2 +
>  drivers/gpu/drm/i915/i915_gem.c                  |  1 +
>  drivers/gpu/drm/i915/i915_gem_context.c          | 59 ++++++++++++++++++++++--
>  drivers/gpu/drm/i915/i915_gem_context.h          |  5 ++
>  drivers/gpu/drm/i915/intel_lrc.c                 |  2 +-
>  drivers/gpu/drm/i915/intel_ringbuffer.c          |  2 +-
>  drivers/gpu/drm/i915/selftests/mock_context.c    |  1 +
>  drivers/gpu/drm/i915/selftests/mock_gem_device.c |  1 +
>  8 files changed, 68 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 1d55bbde68df..1fa1e7d48f02 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2316,6 +2316,8 @@ struct drm_i915_private {
>  		struct llist_head free_list;
>  		struct work_struct free_work;
>
> +		struct list_head freed_objects;

free_ctx_objects maybe?

Knowing you I am surprised this is not bucketed! :)

> +
>  		/* The hw wants to have a stable context identifier for the
>  		 * lifetime of the context (for OA, PASID, faults, etc).
>  		 * This is limited in execlists to 21 bits.
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index c044b5dbdb66..f482c320d810 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3212,6 +3212,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
>  		DRM_ERROR("Timeout waiting for engines to idle\n");
>
>  	intel_engines_mark_idle(dev_priv);
> +	i915_gem_contexts_mark_idle(dev_priv);
>  	i915_gem_timelines_mark_idle(dev_priv);
>
>  	GEM_BUG_ON(!dev_priv->gt.awake);
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 198b0064a3d3..1e51fd36f355 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -96,6 +96,48 @@
>  /* Initial size (as log2) to preallocate the handle->object hashtable */
>  #define VMA_HT_BITS 2u /* 4 x 2 pointers, 64 bytes minimum */
>
> +struct drm_i915_gem_object *
> +i915_gem_context_create_object(struct drm_i915_private *i915,
> +			       unsigned long size)
> +{
> +	struct drm_i915_gem_object *obj, *on;
> +
> +	lockdep_assert_held(&i915->drm.struct_mutex);
> +
> +	list_for_each_entry_safe(obj, on,
> +				 &i915->contexts.freed_objects,
> +				 batch_pool_link) {
> +		if (obj->mm.madv != I915_MADV_DONTNEED) {
> +			/* Purge the heretic! */

I don't see this can happen in the current version?

> +			list_del(&obj->batch_pool_link);
> +			i915_gem_object_put(obj);
> +			continue;
> +		}
> +
> +		if (obj->base.size == size) {
> +			list_del(&obj->batch_pool_link);
> +			obj->mm.madv = I915_MADV_WILLNEED;
> +			return obj;
> +		}
> +	}
> +
> +	return i915_gem_object_create(i915, size);
> +}
> +
> +void i915_gem_contexts_mark_idle(struct drm_i915_private *i915)
> +{
> +	struct drm_i915_gem_object *obj, *on;
> +
> +	lockdep_assert_held(&i915->drm.struct_mutex);
> +
> +	list_for_each_entry_safe(obj, on,
> +				 &i915->contexts.freed_objects,
> +				 batch_pool_link) {
> +		list_del(&obj->batch_pool_link);
> +		i915_gem_object_put(obj);
> +	}
> +}
> +
>  static void resize_vma_ht(struct work_struct *work)
>  {
>  	struct i915_gem_context_vma_lut *lut =
> @@ -160,9 +202,10 @@ static void vma_lut_free(struct i915_gem_context *ctx)
>
>  static void i915_gem_context_free(struct i915_gem_context *ctx)
>  {
> +	struct drm_i915_private *i915 = ctx->i915;
>  	int i;
>
> -	lockdep_assert_held(&ctx->i915->drm.struct_mutex);
> +	lockdep_assert_held(&i915->drm.struct_mutex);
>  	GEM_BUG_ON(!i915_gem_context_is_closed(ctx));
>
>  	vma_lut_free(ctx);
> @@ -178,7 +221,11 @@ static void i915_gem_context_free(struct i915_gem_context *ctx)
>  		if (ce->ring)
>  			intel_ring_free(ce->ring);
>
> -		__i915_gem_object_release_unless_active(ce->state->obj);
> +		/* Keep the objects around for quick reuse */
> +		GEM_BUG_ON(ce->state->obj->mm.madv != I915_MADV_WILLNEED);
> +		ce->state->obj->mm.madv = I915_MADV_DONTNEED;
> +		list_add(&ce->state->obj->batch_pool_link,
> +			 &i915->contexts.freed_objects);
>  	}
>
>  	kfree(ctx->name);
> @@ -186,7 +233,7 @@ static void i915_gem_context_free(struct i915_gem_context *ctx)
>
>  	list_del(&ctx->link);
>
> -	ida_simple_remove(&ctx->i915->contexts.hw_ida, ctx->hw_id);
> +	ida_simple_remove(&i915->contexts.hw_ida, ctx->hw_id);
>  	kfree_rcu(ctx, rcu);
>  }
>
> @@ -457,6 +504,7 @@ int i915_gem_contexts_init(struct drm_i915_private *dev_priv)
>  		return 0;
>
>  	INIT_LIST_HEAD(&dev_priv->contexts.list);
> +	INIT_LIST_HEAD(&dev_priv->contexts.freed_objects);
>  	INIT_WORK(&dev_priv->contexts.free_work, contexts_free_worker);
>  	init_llist_head(&dev_priv->contexts.free_list);
>
> @@ -542,12 +590,17 @@ void i915_gem_contexts_fini(struct drm_i915_private *i915)
>
>  	lockdep_assert_held(&i915->drm.struct_mutex);
>
> +	GEM_BUG_ON(work_pending(&i915->contexts.free_work));
> +
>  	/* Keep the context so that we can free it immediately ourselves */
>  	ctx = i915_gem_context_get(fetch_and_zero(&i915->kernel_context));
>  	GEM_BUG_ON(!i915_gem_context_is_kernel(ctx));
>  	context_close(ctx);
>  	i915_gem_context_free(ctx);
>
> +	i915_gem_contexts_mark_idle(i915);
> +	GEM_BUG_ON(!list_empty(&i915->contexts.freed_objects));
> +
>  	/* Must free all deferred contexts (via flush_workqueue) first */
>  	ida_destroy(&i915->contexts.hw_ida);

Must not read series in random order.. bbl. :)

Regards,

Tvrtko

>  }
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index 04320f80f9f4..4654caab9b6c 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -294,6 +294,11 @@ void i915_gem_context_release(struct kref *ctx_ref);
>  struct i915_gem_context *
>  i915_gem_context_create_gvt(struct drm_device *dev);
>
> +struct drm_i915_gem_object *
> +i915_gem_context_create_object(struct drm_i915_private *i915,
> +			       unsigned long size);
> +void i915_gem_contexts_mark_idle(struct drm_i915_private *i915);
> +
>  int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>  				  struct drm_file *file);
>  int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 42ab91acccc5..21a1519db936 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2011,7 +2011,7 @@ static int execlists_context_deferred_alloc(struct i915_gem_context *ctx,
>  	/* One extra page as the sharing data between driver and GuC */
>  	context_size += PAGE_SIZE * LRC_PPHWSP_PN;
>
> -	ctx_obj = i915_gem_object_create(ctx->i915, context_size);
> +	ctx_obj = i915_gem_context_create_object(ctx->i915, context_size);
>  	if (IS_ERR(ctx_obj)) {
>  		DRM_DEBUG_DRIVER("Alloc LRC backing obj failed.\n");
>  		return PTR_ERR(ctx_obj);
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index acd1da9b62a3..253d32fef133 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1454,7 +1454,7 @@ alloc_context_vma(struct intel_engine_cs *engine)
>  	struct drm_i915_gem_object *obj;
>  	struct i915_vma *vma;
>
> -	obj = i915_gem_object_create(i915, engine->context_size);
> +	obj = i915_gem_context_create_object(i915, engine->context_size);
>  	if (IS_ERR(obj))
>  		return ERR_CAST(obj);
>
> diff --git a/drivers/gpu/drm/i915/selftests/mock_context.c b/drivers/gpu/drm/i915/selftests/mock_context.c
> index 9c7c68181f82..f4972674c96e 100644
> --- a/drivers/gpu/drm/i915/selftests/mock_context.c
> +++ b/drivers/gpu/drm/i915/selftests/mock_context.c
> @@ -92,6 +92,7 @@ void mock_init_contexts(struct drm_i915_private *i915)
>  	INIT_LIST_HEAD(&i915->contexts.list);
>  	ida_init(&i915->contexts.hw_ida);
>
> +	INIT_LIST_HEAD(&i915->contexts.freed_objects);
>  	INIT_WORK(&i915->contexts.free_work, contexts_free_worker);
>  	init_llist_head(&i915->contexts.free_list);
>  }
> diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> index 47613d20bba8..5eb931ecf0a4 100644
> --- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> +++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> @@ -53,6 +53,7 @@ static void mock_device_release(struct drm_device *dev)
>
>  	mutex_lock(&i915->drm.struct_mutex);
>  	mock_device_flush(i915);
> +	i915_gem_contexts_lost(i915);
>  	mutex_unlock(&i915->drm.struct_mutex);
>
>  	cancel_delayed_work_sync(&i915->gt.retire_work);
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse
  2017-05-18 13:52   ` Tvrtko Ursulin
@ 2017-05-18 14:19     ` Chris Wilson
  2017-05-19  9:09       ` Tvrtko Ursulin
  0 siblings, 1 reply; 47+ messages in thread
From: Chris Wilson @ 2017-05-18 14:19 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Thu, May 18, 2017 at 02:52:41PM +0100, Tvrtko Ursulin wrote:
> 
> On 18/05/2017 10:46, Chris Wilson wrote:
> >Keep the recently freed context objects for reuse. This allows us to use
> >the current GGTT bindings and dma bound pages, avoiding any clflushes as
> >required. We mark the objects as purgeable under memory pressure, and
> >reap the list of freed objects as soon as the device is idle.
> 
> Some description on when is this interesting and by how much?

No interesting workload (a few synthetic benchmarks including the GL).
Best argument would be a minor improvement in application startup.

> Since you drop everything on idle it must be for some workload which
> likes to go through context while keeping busy?

Yup, it's just the rule of thumb I have (from experience with the
cmdparser batch caching).
 
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >---
> > drivers/gpu/drm/i915/i915_drv.h                  |  2 +
> > drivers/gpu/drm/i915/i915_gem.c                  |  1 +
> > drivers/gpu/drm/i915/i915_gem_context.c          | 59 ++++++++++++++++++++++--
> > drivers/gpu/drm/i915/i915_gem_context.h          |  5 ++
> > drivers/gpu/drm/i915/intel_lrc.c                 |  2 +-
> > drivers/gpu/drm/i915/intel_ringbuffer.c          |  2 +-
> > drivers/gpu/drm/i915/selftests/mock_context.c    |  1 +
> > drivers/gpu/drm/i915/selftests/mock_gem_device.c |  1 +
> > 8 files changed, 68 insertions(+), 5 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> >index 1d55bbde68df..1fa1e7d48f02 100644
> >--- a/drivers/gpu/drm/i915/i915_drv.h
> >+++ b/drivers/gpu/drm/i915/i915_drv.h
> >@@ -2316,6 +2316,8 @@ struct drm_i915_private {
> > 		struct llist_head free_list;
> > 		struct work_struct free_work;
> >
> >+		struct list_head freed_objects;
> 
> free_ctx_objects maybe?

This is i915->contexts.freed_objects (new contexts substruct). There is
already a precedent for using freed_ (mainly because I think its a
better description for that phase, free is too similar to avail,
although you may persuade me that free is better).

> Knowing you I am surprised this is not bucketed! :)

I didn't bother since we only really have two sizes - also reason why I
didn't make it per-engine so that we could share the xcs context
objects. My expectation is that we won't be caught up in slow list
searches, so went for simplicity for a change.

> >+struct drm_i915_gem_object *
> >+i915_gem_context_create_object(struct drm_i915_private *i915,
> >+			       unsigned long size)
> >+{
> >+	struct drm_i915_gem_object *obj, *on;
> >+
> >+	lockdep_assert_held(&i915->drm.struct_mutex);
> >+
> >+	list_for_each_entry_safe(obj, on,
> >+				 &i915->contexts.freed_objects,
> >+				 batch_pool_link) {
> >+		if (obj->mm.madv != I915_MADV_DONTNEED) {
> >+			/* Purge the heretic! */
> 
> I don't see this can happen in the current version?

The power of the shrinker. Remember it can and will stab you in the back
at any time.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device
  2017-05-18 13:35 ` [PATCH 01/24] " Tvrtko Ursulin
@ 2017-05-18 14:46   ` Chris Wilson
  0 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-18 14:46 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Thu, May 18, 2017 at 02:35:06PM +0100, Tvrtko Ursulin wrote:
> 
> On 18/05/2017 10:46, Chris Wilson wrote:
> >Set the class on our mock pci device to GFX. This should be useful for
> >utilities like intel-iommu that special case gfx devices.
> >
> >References: https://bugs.freedesktop.org/show_bug.cgi?id=101080
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >---
> > drivers/gpu/drm/i915/selftests/mock_gem_device.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> >diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> >index f4edd4c6cb07..627e2aa09766 100644
> >--- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> >+++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> >@@ -121,6 +121,7 @@ struct drm_i915_private *mock_gem_device(void)
> > 		goto err;
> >
> > 	device_initialize(&pdev->dev);
> >+	pdev->class = PCI_BASE_CLASS_DISPLAY << 16;
> > 	pdev->dev.release = release_dev;
> > 	dev_set_name(&pdev->dev, "mock");
> > 	dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
> >
> 
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Pushed this little patch, hopefully we'll see some change in bug 101080
next month...

Thanks for the review.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 24/24] RFC drm/i915: Expose a PMU interface for perf queries
  2017-05-18  9:46 ` [PATCH 24/24] RFC drm/i915: Expose a PMU interface for perf queries Chris Wilson
@ 2017-05-18 23:48   ` Dmitry Rogozhkin
  2017-05-19  8:01     ` Chris Wilson
  2017-06-15 11:17   ` Tvrtko Ursulin
  1 sibling, 1 reply; 47+ messages in thread
From: Dmitry Rogozhkin @ 2017-05-18 23:48 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 5/18/2017 2:46 AM, Chris Wilson wrote:
> The first goal is to be able to measure GPU (and invidual ring) busyness
> without having to poll registers from userspace. (Which not only incurs
> holding the forcewake lock indefinitely, perturbing the system, but also
> runs the risk of hanging the machine.) As an alternative we can use the
> perf event counter interface to sample the ring registers periodically
> and send those results to userspace.
>
> To be able to do so, we need to export the two symbols from
> kernel/events/core.c to register and unregister a PMU device.
>
> v2: Use a common timer for the ring sampling.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/Makefile           |   1 +
>   drivers/gpu/drm/i915/i915_drv.c         |   2 +
>   drivers/gpu/drm/i915/i915_drv.h         |  23 ++
>   drivers/gpu/drm/i915/i915_pmu.c         | 449 ++++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_ringbuffer.h |   2 +
>   include/uapi/drm/i915_drm.h             |  40 +++
>   kernel/events/core.c                    |   1 +
>   7 files changed, 518 insertions(+)
>   create mode 100644 drivers/gpu/drm/i915/i915_pmu.c
>
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 7b05fb802f4c..ca88e6e67910 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -26,6 +26,7 @@ i915-y := i915_drv.o \
>   
>   i915-$(CONFIG_COMPAT)   += i915_ioc32.o
>   i915-$(CONFIG_DEBUG_FS) += i915_debugfs.o intel_pipe_crc.o
> +i915-$(CONFIG_PERF_EVENTS) += i915_pmu.o
>   
>   # GEM code
>   i915-y += i915_cmd_parser.o \
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 2d2fb4327f97..e3c6d052d1c9 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1144,6 +1144,7 @@ static void i915_driver_register(struct drm_i915_private *dev_priv)
>   	struct drm_device *dev = &dev_priv->drm;
>   
>   	i915_gem_shrinker_init(dev_priv);
> +	i915_pmu_register(dev_priv);
>   
>   	/*
>   	 * Notify a valid surface after modesetting,
> @@ -1197,6 +1198,7 @@ static void i915_driver_unregister(struct drm_i915_private *dev_priv)
>   	intel_opregion_unregister(dev_priv);
>   
>   	i915_perf_unregister(dev_priv);
> +	i915_pmu_unregister(dev_priv);
>   
>   	i915_teardown_sysfs(dev_priv);
>   	i915_guc_log_unregister(dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 1fa1e7d48f02..10beae1a13c8 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -40,6 +40,7 @@
>   #include <linux/hash.h>
>   #include <linux/intel-iommu.h>
>   #include <linux/kref.h>
> +#include <linux/perf_event.h>
>   #include <linux/pm_qos.h>
>   #include <linux/reservation.h>
>   #include <linux/shmem_fs.h>
> @@ -2075,6 +2076,12 @@ struct intel_cdclk_state {
>   	unsigned int cdclk, vco, ref;
>   };
>   
> +enum {
> +	__I915_SAMPLE_FREQ_ACT = 0,
> +	__I915_SAMPLE_FREQ_REQ,
> +	__I915_NUM_PMU_SAMPLERS
> +};
> +
>   struct drm_i915_private {
>   	struct drm_device drm;
>   
> @@ -2564,6 +2571,13 @@ struct drm_i915_private {
>   		int	irq;
>   	} lpe_audio;
>   
> +	struct {
> +		struct pmu base;
> +		struct hrtimer timer;
> +		u64 enable;
> +		u64 sample[__I915_NUM_PMU_SAMPLERS];
> +	} pmu;
> +
>   	/*
>   	 * NOTE: This is the dri1/ums dungeon, don't add stuff here. Your patch
>   	 * will be rejected. Instead look for a better place.
> @@ -3681,6 +3695,15 @@ extern void i915_perf_fini(struct drm_i915_private *dev_priv);
>   extern void i915_perf_register(struct drm_i915_private *dev_priv);
>   extern void i915_perf_unregister(struct drm_i915_private *dev_priv);
>   
> +/* i915_pmu.c */
> +#ifdef CONFIG_PERF_EVENTS
> +extern void i915_pmu_register(struct drm_i915_private *i915);
> +extern void i915_pmu_unregister(struct drm_i915_private *i915);
> +#else
> +static inline void i915_pmu_register(struct drm_i915_private *i915) {}
> +static inline void i915_pmu_unregister(struct drm_i915_private *i915) {}
> +#endif
> +
>   /* i915_suspend.c */
>   extern int i915_save_state(struct drm_i915_private *dev_priv);
>   extern int i915_restore_state(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
> new file mode 100644
> index 000000000000..80e1c07841ac
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_pmu.c
> @@ -0,0 +1,449 @@
> +#include <linux/perf_event.h>
> +#include <linux/pm_runtime.h>
> +
> +#include "i915_drv.h"
> +#include "intel_ringbuffer.h"
> +
> +#define FREQUENCY 200
> +#define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
> +
> +#define RING_MASK 0xffffffff
> +#define RING_MAX 32
> +
> +static void engines_sample(struct drm_i915_private *dev_priv)
> +{
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	bool fw = false;
> +
> +	if ((dev_priv->pmu.enable & RING_MASK) == 0)
> +		return;
> +
> +	if (!dev_priv->gt.awake)
> +		return;
> +
> +	if (!intel_runtime_pm_get_if_in_use(dev_priv))
> +		return;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		u32 val;
> +
> +		if ((dev_priv->pmu.enable & (0x7 << (4*id))) == 0)
> +			continue;
> +
> +		if (i915_seqno_passed(intel_engine_get_seqno(engine),
> +				      intel_engine_last_submit(engine)))
> +			continue;
> +
> +		if (!fw) {
> +			intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> +			fw = true;
> +		}
> +
> +		val = I915_READ_FW(RING_MI_MODE(engine->mmio_base));
> +		if (!(val & MODE_IDLE))
> +			engine->pmu_sample[I915_SAMPLE_BUSY] += PERIOD;
Could you, please, check that I understand correctly what you attempt to 
do? Each PERIOD=10ms you check the status of the engines. If engine is 
in a busy state you count the PERIOD time as if the engine was busy 
during this period. If engine was in some other state you count this 
toward this state (wait or sema below as I understand).

If that's what you are trying to do, your solution can be somewhat used 
only for the cases where tasks are executed on the engines more then 2x 
PERIOD time, i.e. longer than 20ms. Which is not the case for a lot of 
tasks submitted by media with the usual execution time of just few 
milliseconds. Considering that i915 currently tracks when batches were 
scheduled and when they are ready, you can easily calculate very strong 
precise metric of busy clocks for each engine from the perspective of 
i915, i.e. that will not be precise time of how long engine calculated 
things because it will include scheduling latency, but that is exactly 
what end-user requires: end-user does not care whether there is a 
latency or there is not, for him engine is busy all that time. This is 
done in the alternate solution given by Tvrtko in "drm/i915: Export 
engine busy stats in debugfs" patch. Why you do not take this as a 
basis? Why this patch containing few arithmetic operations to calculate 
engines busy clocks is ignored?

> +
> +		val = I915_READ_FW(RING_CTL(engine->mmio_base));
> +		if (val & RING_WAIT)
> +			engine->pmu_sample[I915_SAMPLE_WAIT] += PERIOD;
> +		if (val & RING_WAIT_SEMAPHORE)
> +			engine->pmu_sample[I915_SAMPLE_SEMA] += PERIOD;
> +	}
> +
> +	if (fw)
> +		intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> +	intel_runtime_pm_put(dev_priv);
> +}
> +
> +static void frequency_sample(struct drm_i915_private *dev_priv)
> +{
> +	if (dev_priv->pmu.enable & BIT_ULL(I915_PMU_ACTUAL_FREQUENCY)) {
> +		u64 val;
> +
> +		val = dev_priv->rps.cur_freq;
> +		if (dev_priv->gt.awake &&
> +		    intel_runtime_pm_get_if_in_use(dev_priv)) {
> +			val = I915_READ_NOTRACE(GEN6_RPSTAT1);
> +			if (INTEL_GEN(dev_priv) >= 9)
> +				val = (val & GEN9_CAGF_MASK) >> GEN9_CAGF_SHIFT;
> +			else if (IS_HASWELL(dev_priv) || INTEL_GEN(dev_priv) >= 8)
> +				val = (val & HSW_CAGF_MASK) >> HSW_CAGF_SHIFT;
> +			else
> +				val = (val & GEN6_CAGF_MASK) >> GEN6_CAGF_SHIFT;
> +			intel_runtime_pm_put(dev_priv);
> +		}
> +		val = intel_gpu_freq(dev_priv, val);
> +		dev_priv->pmu.sample[__I915_SAMPLE_FREQ_ACT] += val * PERIOD;
> +	}
> +
> +	if (dev_priv->pmu.enable & BIT_ULL(I915_PMU_REQUESTED_FREQUENCY)) {
> +		u64 val = intel_gpu_freq(dev_priv, dev_priv->rps.cur_freq);
> +		dev_priv->pmu.sample[__I915_SAMPLE_FREQ_REQ] += val * PERIOD;
> +	}
> +}
> +
> +static enum hrtimer_restart i915_sample(struct hrtimer *hrtimer)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(hrtimer, struct drm_i915_private, pmu.timer);
> +
> +	if (i915->pmu.enable == 0)
> +		return HRTIMER_NORESTART;
> +
> +	engines_sample(i915);
> +	frequency_sample(i915);
> +
> +	hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD));
> +	return HRTIMER_RESTART;
> +}
> +
> +static void i915_pmu_event_destroy(struct perf_event *event)
> +{
> +	WARN_ON(event->parent);
> +}
> +
> +static int engine_event_init(struct perf_event *event)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(event->pmu, typeof(*i915), pmu.base);
> +	int engine = event->attr.config >> 2;
> +	int sample = event->attr.config & 3;
> +
> +	switch (sample) {
> +	case I915_SAMPLE_BUSY:
> +	case I915_SAMPLE_WAIT:
> +		break;
> +	case I915_SAMPLE_SEMA:
> +		if (INTEL_GEN(i915) < 6)
> +			return -ENODEV;
> +		break;
> +	default:
> +		return -ENOENT;
> +	}
> +
> +	if (engine >= I915_NUM_ENGINES)
> +		return -ENOENT;
> +
> +	if (!i915->engine[engine])
> +		return -ENODEV;
> +
> +	return 0;
> +}
> +
> +static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
> +{
> +	struct perf_sample_data data;
> +	struct perf_event *event;
> +	u64 period;
> +
> +	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
> +	if (event->state != PERF_EVENT_STATE_ACTIVE)
> +		return HRTIMER_NORESTART;
> +
> +	event->pmu->read(event);
> +
> +	perf_sample_data_init(&data, 0, event->hw.last_period);
> +	perf_event_overflow(event, &data, NULL);
> +
> +	period = max_t(u64, 10000, event->hw.sample_period);
> +	hrtimer_forward_now(hrtimer, ns_to_ktime(period));
> +	return HRTIMER_RESTART;
> +}
> +
> +static void init_hrtimer(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	if (!is_sampling_event(event))
> +		return;
> +
> +	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	hwc->hrtimer.function = hrtimer_sample;
> +
> +	if (event->attr.freq) {
> +		long freq = event->attr.sample_freq;
> +
> +		event->attr.sample_period = NSEC_PER_SEC / freq;
> +		hwc->sample_period = event->attr.sample_period;
> +		local64_set(&hwc->period_left, hwc->sample_period);
> +		hwc->last_period = hwc->sample_period;
> +		event->attr.freq = 0;
> +	}
> +}
> +
> +static int i915_pmu_event_init(struct perf_event *event)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(event->pmu, typeof(*i915), pmu.base);
> +	int ret;
> +
> +	/* XXX ideally only want pid == -1 && cpu == -1 */
> +
> +	if (event->attr.type != event->pmu->type)
> +		return -ENOENT;
> +
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
> +	ret = 0;
> +	if (event->attr.config < RING_MAX) {
> +		ret = engine_event_init(event);
> +	} else switch (event->attr.config) {
> +	case I915_PMU_ACTUAL_FREQUENCY:
> +		if (IS_VALLEYVIEW(i915) || IS_CHERRYVIEW(i915))
> +			ret = -ENODEV; /* requires a mutex for sampling! */
> +	case I915_PMU_REQUESTED_FREQUENCY:
> +	case I915_PMU_ENERGY:
> +	case I915_PMU_RC6_RESIDENCY:
> +	case I915_PMU_RC6p_RESIDENCY:
> +	case I915_PMU_RC6pp_RESIDENCY:
> +		if (INTEL_GEN(i915) < 6)
> +			ret = -ENODEV;
> +		break;
> +	}
> +	if (ret)
> +		return ret;
> +
> +	if (!event->parent)
> +		event->destroy = i915_pmu_event_destroy;
> +
> +	init_hrtimer(event);
> +
> +	return 0;
> +}
> +
> +static void i915_pmu_timer_start(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +	s64 period;
> +
> +	if (!is_sampling_event(event))
> +		return;
> +
> +	period = local64_read(&hwc->period_left);
> +	if (period) {
> +		if (period < 0)
> +			period = 10000;
> +
> +		local64_set(&hwc->period_left, 0);
> +	} else {
> +		period = max_t(u64, 10000, hwc->sample_period);
> +	}
> +
> +	hrtimer_start_range_ns(&hwc->hrtimer,
> +			       ns_to_ktime(period), 0,
> +			       HRTIMER_MODE_REL_PINNED);
> +}
> +
> +static void i915_pmu_timer_cancel(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	if (!is_sampling_event(event))
> +		return;
> +
> +	local64_set(&hwc->period_left,
> +		    ktime_to_ns(hrtimer_get_remaining(&hwc->hrtimer)));
> +	hrtimer_cancel(&hwc->hrtimer);
> +}
> +
> +static void i915_pmu_enable(struct perf_event *event)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(event->pmu, typeof(*i915), pmu.base);
> +
> +	if (i915->pmu.enable == 0)
> +		hrtimer_start_range_ns(&i915->pmu.timer,
> +				       ns_to_ktime(PERIOD), 0,
> +				       HRTIMER_MODE_REL_PINNED);
> +
> +	i915->pmu.enable |= BIT_ULL(event->attr.config);
> +
> +	i915_pmu_timer_start(event);
> +}
> +
> +static void i915_pmu_disable(struct perf_event *event)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(event->pmu, typeof(*i915), pmu.base);
> +
> +	i915->pmu.enable &= ~BIT_ULL(event->attr.config);
> +	i915_pmu_timer_cancel(event);
> +}
> +
> +static int i915_pmu_event_add(struct perf_event *event, int flags)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	if (flags & PERF_EF_START)
> +		i915_pmu_enable(event);
> +
> +	hwc->state = !(flags & PERF_EF_START);
> +
> +	return 0;
> +}
> +
> +static void i915_pmu_event_del(struct perf_event *event, int flags)
> +{
> +	i915_pmu_disable(event);
> +}
> +
> +static void i915_pmu_event_start(struct perf_event *event, int flags)
> +{
> +	i915_pmu_enable(event);
> +}
> +
> +static void i915_pmu_event_stop(struct perf_event *event, int flags)
> +{
> +	i915_pmu_disable(event);
> +}
> +
> +static u64 read_energy_uJ(struct drm_i915_private *dev_priv)
> +{
> +	u64 power;
> +
> +	GEM_BUG_ON(INTEL_GEN(dev_priv) < 6);
> +
> +	intel_runtime_pm_get(dev_priv);
> +
> +	rdmsrl(MSR_RAPL_POWER_UNIT, power);
> +	power = (power & 0x1f00) >> 8;
> +	power = 1000000 >> power; /* convert to uJ */
> +	power *= I915_READ_NOTRACE(MCH_SECP_NRG_STTS);
> +
> +	intel_runtime_pm_put(dev_priv);
> +
> +	return power;
> +}
> +
> +static inline u64 calc_residency(struct drm_i915_private *dev_priv,
> +				 const i915_reg_t reg)
> +{
> +	u64 val, units = 128, div = 100000;
> +
> +	GEM_BUG_ON(INTEL_GEN(dev_priv) < 6);
> +
> +	intel_runtime_pm_get(dev_priv);
> +	if (IS_VALLEYVIEW(dev_priv) || IS_CHERRYVIEW(dev_priv)) {
> +		div = dev_priv->czclk_freq;
> +		units = 1;
> +		if (I915_READ_NOTRACE(VLV_COUNTER_CONTROL) & VLV_COUNT_RANGE_HIGH)
> +			units <<= 8;
> +	} else if (IS_GEN9_LP(dev_priv)) {
> +		div = 1200;
> +		units = 1;
> +	}
> +	val = I915_READ_NOTRACE(reg);
> +	intel_runtime_pm_put(dev_priv);
> +
> +	val *= units;
> +	return DIV_ROUND_UP_ULL(val, div);
> +}
> +
> +static u64 count_interrupts(struct drm_i915_private *i915)
> +{
> +	/* open-coded kstat_irqs() */
> +	struct irq_desc *desc = irq_to_desc(i915->drm.pdev->irq);
> +	u64 sum = 0;
> +	int cpu;
> +
> +	if (!desc || !desc->kstat_irqs)
> +		return 0;
> +
> +	for_each_possible_cpu(cpu)
> +		sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
> +
> +	return sum;
> +}
> +
> +static void i915_pmu_event_read(struct perf_event *event)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(event->pmu, typeof(*i915), pmu.base);
> +	u64 val = 0;
> +
> +	if (event->attr.config < 32) {
> +		int engine = event->attr.config >> 2;
> +		int sample = event->attr.config & 3;
> +		val = i915->engine[engine]->pmu_sample[sample];
> +	} else switch (event->attr.config) {
> +	case I915_PMU_ACTUAL_FREQUENCY:
> +		val = i915->pmu.sample[__I915_SAMPLE_FREQ_ACT];
> +		break;
> +	case I915_PMU_REQUESTED_FREQUENCY:
> +		val = i915->pmu.sample[__I915_SAMPLE_FREQ_REQ];
> +		break;
> +	case I915_PMU_ENERGY:
> +		val = read_energy_uJ(i915);
> +		break;
> +	case I915_PMU_INTERRUPTS:
> +		val = count_interrupts(i915);
> +		break;
> +
> +	case I915_PMU_RC6_RESIDENCY:
> +		if (!i915->gt.awake)
> +			return;
> +
> +		val = calc_residency(i915, IS_VALLEYVIEW(i915) ? VLV_GT_RENDER_RC6 : GEN6_GT_GFX_RC6);
> +		break;
> +
> +	case I915_PMU_RC6p_RESIDENCY:
> +		if (!i915->gt.awake)
> +			return;
> +
> +		if (!IS_VALLEYVIEW(i915))
> +			val = calc_residency(i915, GEN6_GT_GFX_RC6p);
> +		break;
> +
> +	case I915_PMU_RC6pp_RESIDENCY:
> +		if (!i915->gt.awake)
> +			return;
> +
> +		if (!IS_VALLEYVIEW(i915))
> +			val = calc_residency(i915, GEN6_GT_GFX_RC6pp);
> +		break;
> +	}
> +
> +	local64_set(&event->count, val);
> +}
> +
> +static int i915_pmu_event_event_idx(struct perf_event *event)
> +{
> +	return 0;
> +}
> +
> +void i915_pmu_register(struct drm_i915_private *i915)
> +{
> +	if (INTEL_GEN(i915) <= 2)
> +		return;
> +
> +	i915->pmu.base.task_ctx_nr	= perf_sw_context;
> +	i915->pmu.base.event_init	= i915_pmu_event_init;
> +	i915->pmu.base.add		= i915_pmu_event_add;
> +	i915->pmu.base.del		= i915_pmu_event_del;
> +	i915->pmu.base.start		= i915_pmu_event_start;
> +	i915->pmu.base.stop		= i915_pmu_event_stop;
> +	i915->pmu.base.read		= i915_pmu_event_read;
> +	i915->pmu.base.event_idx	= i915_pmu_event_event_idx;
> +
> +	hrtimer_init(&i915->pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	i915->pmu.timer.function = i915_sample;
> +	i915->pmu.enable = 0;
> +
> +	if (perf_pmu_register(&i915->pmu.base, "i915", -1))
> +		i915->pmu.base.event_init = NULL;
> +}
> +
> +void i915_pmu_unregister(struct drm_i915_private *i915)
> +{
> +	if (!i915->pmu.base.event_init)
> +		return;
> +
> +	i915->pmu.enable = 0;
> +
> +	perf_pmu_unregister(&i915->pmu.base);
> +	i915->pmu.base.event_init = NULL;
> +
> +	hrtimer_cancel(&i915->pmu.timer);
> +}
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 6aa20ac8cde3..084fa7816256 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -244,6 +244,8 @@ struct intel_engine_cs {
>   		I915_SELFTEST_DECLARE(bool mock : 1);
>   	} breadcrumbs;
>   
> +	u64 pmu_sample[3];
> +
>   	/*
>   	 * A pool of objects to use as shadow copies of client batch buffers
>   	 * when the command parser is enabled. Prevents the client from
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 34ee011f08ac..e9375ff29371 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -86,6 +86,46 @@ enum i915_mocs_table_index {
>   	I915_MOCS_CACHED,
>   };
>   
> +/**
> + * DOC: perf_events exposed by i915 through /sys/bus/event_sources/drivers/i915
> + *
> + */
> +#define I915_SAMPLE_BUSY	0
> +#define I915_SAMPLE_WAIT	1
> +#define I915_SAMPLE_SEMA	2
> +
> +#define I915_SAMPLE_RCS		0
> +#define I915_SAMPLE_VCS		1
> +#define I915_SAMPLE_BCS		2
> +#define I915_SAMPLE_VECS	3
> +
> +#define __I915_PMU_COUNT(ring, id) ((ring) << 4 | (id))
> +
> +#define I915_PMU_COUNT_RCS_BUSY __I915_PMU_COUNT(I915_SAMPLE_RCS, I915_SAMPLE_BUSY)
> +#define I915_PMU_COUNT_RCS_WAIT __I915_PMU_COUNT(I915_SAMPLE_RCS, I915_SAMPLE_WAIT)
> +#define I915_PMU_COUNT_RCS_SEMA __I915_PMU_COUNT(I915_SAMPLE_RCS, I915_SAMPLE_SEMA)
> +
> +#define I915_PMU_COUNT_VCS_BUSY __I915_PMU_COUNT(I915_SAMPLE_VCS, I915_SAMPLE_BUSY)
> +#define I915_PMU_COUNT_VCS_WAIT __I915_PMU_COUNT(I915_SAMPLE_VCS, I915_SAMPLE_WAIT)
> +#define I915_PMU_COUNT_VCS_SEMA __I915_PMU_COUNT(I915_SAMPLE_VCS, I915_SAMPLE_SEMA)
> +
> +#define I915_PMU_COUNT_BCS_BUSY __I915_PMU_COUNT(I915_SAMPLE_BCS, I915_SAMPLE_BUSY)
> +#define I915_PMU_COUNT_BCS_WAIT __I915_PMU_COUNT(I915_SAMPLE_BCS, I915_SAMPLE_WAIT)
> +#define I915_PMU_COUNT_BCS_SEMA __I915_PMU_COUNT(I915_SAMPLE_BCS, I915_SAMPLE_SEMA)
> +
> +#define I915_PMU_COUNT_VECS_BUSY __I915_PMU_COUNT(I915_SAMPLE_VECS, I915_SAMPLE_BUSY)
> +#define I915_PMU_COUNT_VECS_WAIT __I915_PMU_COUNT(I915_SAMPLE_VECS, I915_SAMPLE_WAIT)
> +#define I915_PMU_COUNT_VECS_SEMA __I915_PMU_COUNT(I915_SAMPLE_VECS, I915_SAMPLE_SEMA)
> +
> +#define I915_PMU_ACTUAL_FREQUENCY 32
> +#define I915_PMU_REQUESTED_FREQUENCY 33
> +#define I915_PMU_ENERGY 34
> +#define I915_PMU_INTERRUPTS 35
> +
> +#define I915_PMU_RC6_RESIDENCY		40
> +#define I915_PMU_RC6p_RESIDENCY	41
> +#define I915_PMU_RC6pp_RESIDENCY	42
> +
>   /* Each region is a minimum of 16k, and there are at most 255 of them.
>    */
>   #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f811dd20bbc1..6351ed8a2e56 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7365,6 +7365,7 @@ int perf_event_overflow(struct perf_event *event,
>   {
>   	return __perf_event_overflow(event, 1, data, regs);
>   }
> +EXPORT_SYMBOL_GPL(perf_event_overflow);
>   
>   /*
>    * Generic software event infrastructure

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 24/24] RFC drm/i915: Expose a PMU interface for perf queries
  2017-05-18 23:48   ` Dmitry Rogozhkin
@ 2017-05-19  8:01     ` Chris Wilson
  2017-05-19 14:54       ` Dmitry Rogozhkin
  0 siblings, 1 reply; 47+ messages in thread
From: Chris Wilson @ 2017-05-19  8:01 UTC (permalink / raw)
  To: Dmitry Rogozhkin; +Cc: intel-gfx

On Thu, May 18, 2017 at 04:48:47PM -0700, Dmitry Rogozhkin wrote:
> 
> 
> On 5/18/2017 2:46 AM, Chris Wilson wrote:
> >The first goal is to be able to measure GPU (and invidual ring) busyness
> >without having to poll registers from userspace. (Which not only incurs
> >holding the forcewake lock indefinitely, perturbing the system, but also
> >runs the risk of hanging the machine.) As an alternative we can use the
> >perf event counter interface to sample the ring registers periodically
> >and send those results to userspace.
> >
> >To be able to do so, we need to export the two symbols from
> >kernel/events/core.c to register and unregister a PMU device.
> >
> >v2: Use a common timer for the ring sampling.
> >
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >---
> >  drivers/gpu/drm/i915/Makefile           |   1 +
> >  drivers/gpu/drm/i915/i915_drv.c         |   2 +
> >  drivers/gpu/drm/i915/i915_drv.h         |  23 ++
> >  drivers/gpu/drm/i915/i915_pmu.c         | 449 ++++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/i915/intel_ringbuffer.h |   2 +
> >  include/uapi/drm/i915_drm.h             |  40 +++
> >  kernel/events/core.c                    |   1 +
> >  7 files changed, 518 insertions(+)
> >  create mode 100644 drivers/gpu/drm/i915/i915_pmu.c
> >
> >diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> >index 7b05fb802f4c..ca88e6e67910 100644
> >--- a/drivers/gpu/drm/i915/Makefile
> >+++ b/drivers/gpu/drm/i915/Makefile
> >@@ -26,6 +26,7 @@ i915-y := i915_drv.o \
> >  i915-$(CONFIG_COMPAT)   += i915_ioc32.o
> >  i915-$(CONFIG_DEBUG_FS) += i915_debugfs.o intel_pipe_crc.o
> >+i915-$(CONFIG_PERF_EVENTS) += i915_pmu.o
> >  # GEM code
> >  i915-y += i915_cmd_parser.o \
> >diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> >index 2d2fb4327f97..e3c6d052d1c9 100644
> >--- a/drivers/gpu/drm/i915/i915_drv.c
> >+++ b/drivers/gpu/drm/i915/i915_drv.c
> >@@ -1144,6 +1144,7 @@ static void i915_driver_register(struct drm_i915_private *dev_priv)
> >  	struct drm_device *dev = &dev_priv->drm;
> >  	i915_gem_shrinker_init(dev_priv);
> >+	i915_pmu_register(dev_priv);
> >  	/*
> >  	 * Notify a valid surface after modesetting,
> >@@ -1197,6 +1198,7 @@ static void i915_driver_unregister(struct drm_i915_private *dev_priv)
> >  	intel_opregion_unregister(dev_priv);
> >  	i915_perf_unregister(dev_priv);
> >+	i915_pmu_unregister(dev_priv);
> >  	i915_teardown_sysfs(dev_priv);
> >  	i915_guc_log_unregister(dev_priv);
> >diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> >index 1fa1e7d48f02..10beae1a13c8 100644
> >--- a/drivers/gpu/drm/i915/i915_drv.h
> >+++ b/drivers/gpu/drm/i915/i915_drv.h
> >@@ -40,6 +40,7 @@
> >  #include <linux/hash.h>
> >  #include <linux/intel-iommu.h>
> >  #include <linux/kref.h>
> >+#include <linux/perf_event.h>
> >  #include <linux/pm_qos.h>
> >  #include <linux/reservation.h>
> >  #include <linux/shmem_fs.h>
> >@@ -2075,6 +2076,12 @@ struct intel_cdclk_state {
> >  	unsigned int cdclk, vco, ref;
> >  };
> >+enum {
> >+	__I915_SAMPLE_FREQ_ACT = 0,
> >+	__I915_SAMPLE_FREQ_REQ,
> >+	__I915_NUM_PMU_SAMPLERS
> >+};
> >+
> >  struct drm_i915_private {
> >  	struct drm_device drm;
> >@@ -2564,6 +2571,13 @@ struct drm_i915_private {
> >  		int	irq;
> >  	} lpe_audio;
> >+	struct {
> >+		struct pmu base;
> >+		struct hrtimer timer;
> >+		u64 enable;
> >+		u64 sample[__I915_NUM_PMU_SAMPLERS];
> >+	} pmu;
> >+
> >  	/*
> >  	 * NOTE: This is the dri1/ums dungeon, don't add stuff here. Your patch
> >  	 * will be rejected. Instead look for a better place.
> >@@ -3681,6 +3695,15 @@ extern void i915_perf_fini(struct drm_i915_private *dev_priv);
> >  extern void i915_perf_register(struct drm_i915_private *dev_priv);
> >  extern void i915_perf_unregister(struct drm_i915_private *dev_priv);
> >+/* i915_pmu.c */
> >+#ifdef CONFIG_PERF_EVENTS
> >+extern void i915_pmu_register(struct drm_i915_private *i915);
> >+extern void i915_pmu_unregister(struct drm_i915_private *i915);
> >+#else
> >+static inline void i915_pmu_register(struct drm_i915_private *i915) {}
> >+static inline void i915_pmu_unregister(struct drm_i915_private *i915) {}
> >+#endif
> >+
> >  /* i915_suspend.c */
> >  extern int i915_save_state(struct drm_i915_private *dev_priv);
> >  extern int i915_restore_state(struct drm_i915_private *dev_priv);
> >diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
> >new file mode 100644
> >index 000000000000..80e1c07841ac
> >--- /dev/null
> >+++ b/drivers/gpu/drm/i915/i915_pmu.c
> >@@ -0,0 +1,449 @@
> >+#include <linux/perf_event.h>
> >+#include <linux/pm_runtime.h>
> >+
> >+#include "i915_drv.h"
> >+#include "intel_ringbuffer.h"
> >+
> >+#define FREQUENCY 200
> >+#define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
> >+
> >+#define RING_MASK 0xffffffff
> >+#define RING_MAX 32
> >+
> >+static void engines_sample(struct drm_i915_private *dev_priv)
> >+{
> >+	struct intel_engine_cs *engine;
> >+	enum intel_engine_id id;
> >+	bool fw = false;
> >+
> >+	if ((dev_priv->pmu.enable & RING_MASK) == 0)
> >+		return;
> >+
> >+	if (!dev_priv->gt.awake)
> >+		return;
> >+
> >+	if (!intel_runtime_pm_get_if_in_use(dev_priv))
> >+		return;
> >+
> >+	for_each_engine(engine, dev_priv, id) {
> >+		u32 val;
> >+
> >+		if ((dev_priv->pmu.enable & (0x7 << (4*id))) == 0)
> >+			continue;
> >+
> >+		if (i915_seqno_passed(intel_engine_get_seqno(engine),
> >+				      intel_engine_last_submit(engine)))
> >+			continue;
> >+
> >+		if (!fw) {
> >+			intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> >+			fw = true;
> >+		}
> >+
> >+		val = I915_READ_FW(RING_MI_MODE(engine->mmio_base));
> >+		if (!(val & MODE_IDLE))
> >+			engine->pmu_sample[I915_SAMPLE_BUSY] += PERIOD;
> Could you, please, check that I understand correctly what you
> attempt to do? Each PERIOD=10ms you check the status of the engines.
> If engine is in a busy state you count the PERIOD time as if the
> engine was busy during this period. If engine was in some other
> state you count this toward this state (wait or sema below as I
> understand).
> 
> If that's what you are trying to do, your solution can be somewhat
> used only for the cases where tasks are executed on the engines more
> then 2x PERIOD time, i.e. longer than 20ms.

It's a statistical sampling, same as any other. You have control over
the period, or at least perf does give you control (I may have not
hooked it up correctly). I set a totally arbitrary cap of 20us.

> Which is not the case
> for a lot of tasks submitted by media with the usual execution time
> of just few milliseconds.

If you are only issuing one batch lasting 1ms every second, the load is
more or less immaterial. 

> Considering that i915 currently tracks
> when batches were scheduled and when they are ready, you can easily
> calculate very strong precise metric of busy clocks for each engine
> from the perspective of i915, i.e. that will not be precise time of
> how long engine calculated things because it will include scheduling
> latency, but that is exactly what end-user requires: end-user does
> not care whether there is a latency or there is not, for him engine
> is busy all that time.

Which you can do yourself. I am not that interested in repeating
metrics already available to userspace. Furthermore, coupled in with the
statiscal sampling of perf, perf also provides access to the timestamps
of when each request is constructed, queued, executed, completed and
retired; and waited upon along with a host of other tracepoints.

> This is done in the alternate solution given
> by Tvrtko in "drm/i915: Export engine busy stats in debugfs" patch.
> Why you do not take this as a basis? Why this patch containing few
> arithmetic operations to calculate engines busy clocks is ignored?

Because this approach is independent of platform and plugs into the
existing perf infrastructure. And you need a far more convincing
argument to have an execlist-only code running inside interrupt
handlers. Tvrtko said himself that maintainability is paramount.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device
  2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
                   ` (24 preceding siblings ...)
  2017-05-18 13:35 ` [PATCH 01/24] " Tvrtko Ursulin
@ 2017-05-19  9:02 ` Joonas Lahtinen
  2017-05-19  9:10   ` Chris Wilson
  25 siblings, 1 reply; 47+ messages in thread
From: Joonas Lahtinen @ 2017-05-19  9:02 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On to, 2017-05-18 at 10:46 +0100, Chris Wilson wrote:
> Set the class on our mock pci device to GFX. This should be useful for
> utilities like intel-iommu that special case gfx devices.
> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=101080
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

I'm not hundred percent sure how this connects to the referenced bug,
but it's correct:

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/24] drm/i915: Disable EXEC_OBJECT_ASYNC when doing relocations
  2017-05-18  9:46 ` [PATCH 10/24] drm/i915: Disable EXEC_OBJECT_ASYNC when doing relocations Chris Wilson
@ 2017-05-19  9:05   ` Joonas Lahtinen
  0 siblings, 0 replies; 47+ messages in thread
From: Joonas Lahtinen @ 2017-05-19  9:05 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On to, 2017-05-18 at 10:46 +0100, Chris Wilson wrote:
> If we write a relocation into the buffer, we require our own implicit
> synchronisation added after the start of the execbuf, outside of the
> user's control. As we may end up clflushing, or doing the patch itself
> on the GPU, asynchronously we need to look at the implicit serialisation
> on obj->resv and hence need to disable EXEC_OBJECT_ASYNC for this
> object.
> 
> If the user does trigger a stall for relocations, we make sure the stall
> is complete enough so that the batch is not submitted before we complete
> those relocations.
> 
> Fixes: 77ae9957897d ("drm/i915: Enable userspace to opt-out of implicit fencing")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse
  2017-05-18 14:19     ` Chris Wilson
@ 2017-05-19  9:09       ` Tvrtko Ursulin
  0 siblings, 0 replies; 47+ messages in thread
From: Tvrtko Ursulin @ 2017-05-19  9:09 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2017 15:19, Chris Wilson wrote:
> On Thu, May 18, 2017 at 02:52:41PM +0100, Tvrtko Ursulin wrote:
>>
>> On 18/05/2017 10:46, Chris Wilson wrote:
>>> Keep the recently freed context objects for reuse. This allows us to use
>>> the current GGTT bindings and dma bound pages, avoiding any clflushes as
>>> required. We mark the objects as purgeable under memory pressure, and
>>> reap the list of freed objects as soon as the device is idle.
>>
>> Some description on when is this interesting and by how much?
>
> No interesting workload (a few synthetic benchmarks including the GL).
> Best argument would be a minor improvement in application startup.

In this case I would be reluctant to complicate the code for no benefit.

>> Since you drop everything on idle it must be for some workload which
>> likes to go through context while keeping busy?
>
> Yup, it's just the rule of thumb I have (from experience with the
> cmdparser batch caching).
>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> ---
>>> drivers/gpu/drm/i915/i915_drv.h                  |  2 +
>>> drivers/gpu/drm/i915/i915_gem.c                  |  1 +
>>> drivers/gpu/drm/i915/i915_gem_context.c          | 59 ++++++++++++++++++++++--
>>> drivers/gpu/drm/i915/i915_gem_context.h          |  5 ++
>>> drivers/gpu/drm/i915/intel_lrc.c                 |  2 +-
>>> drivers/gpu/drm/i915/intel_ringbuffer.c          |  2 +-
>>> drivers/gpu/drm/i915/selftests/mock_context.c    |  1 +
>>> drivers/gpu/drm/i915/selftests/mock_gem_device.c |  1 +
>>> 8 files changed, 68 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>>> index 1d55bbde68df..1fa1e7d48f02 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>> @@ -2316,6 +2316,8 @@ struct drm_i915_private {
>>> 		struct llist_head free_list;
>>> 		struct work_struct free_work;
>>>
>>> +		struct list_head freed_objects;
>>
>> free_ctx_objects maybe?
>
> This is i915->contexts.freed_objects (new contexts substruct). There is
> already a precedent for using freed_ (mainly because I think its a
> better description for that phase, free is too similar to avail,
> although you may persuade me that free is better).

My bad, I thought it is drm_i915_private.

>> Knowing you I am surprised this is not bucketed! :)
>
> I didn't bother since we only really have two sizes - also reason why I
> didn't make it per-engine so that we could share the xcs context
> objects. My expectation is that we won't be caught up in slow list
> searches, so went for simplicity for a change.
>
>>> +struct drm_i915_gem_object *
>>> +i915_gem_context_create_object(struct drm_i915_private *i915,
>>> +			       unsigned long size)
>>> +{
>>> +	struct drm_i915_gem_object *obj, *on;
>>> +
>>> +	lockdep_assert_held(&i915->drm.struct_mutex);
>>> +
>>> +	list_for_each_entry_safe(obj, on,
>>> +				 &i915->contexts.freed_objects,
>>> +				 batch_pool_link) {
>>> +		if (obj->mm.madv != I915_MADV_DONTNEED) {
>>> +			/* Purge the heretic! */
>>
>> I don't see this can happen in the current version?
>
> The power of the shrinker. Remember it can and will stab you in the back
> at any time.

Ok I forgot about the purged state, I just was surprised that anyone 
would be able to move the state away from don't need.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device
  2017-05-19  9:02 ` Joonas Lahtinen
@ 2017-05-19  9:10   ` Chris Wilson
  0 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-19  9:10 UTC (permalink / raw)
  To: Joonas Lahtinen; +Cc: intel-gfx

On Fri, May 19, 2017 at 12:02:48PM +0300, Joonas Lahtinen wrote:
> On to, 2017-05-18 at 10:46 +0100, Chris Wilson wrote:
> > Set the class on our mock pci device to GFX. This should be useful for
> > utilities like intel-iommu that special case gfx devices.
> > 
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=101080
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> I'm not hundred percent sure how this connects to the referenced bug,
> but it's correct:

The bug is an ENOMEM from intel-iommu.c as it tried to allocate an iommu
domain. My goal is to circumvent that allocation by using its gfx special
case.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 17/24] drm/i915: Stash a pointer to the obj's resv in the vma
  2017-05-18  9:46 ` [PATCH 17/24] drm/i915: Stash a pointer to the obj's resv in the vma Chris Wilson
@ 2017-05-19  9:19   ` Joonas Lahtinen
  0 siblings, 0 replies; 47+ messages in thread
From: Joonas Lahtinen @ 2017-05-19  9:19 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On to, 2017-05-18 at 10:46 +0100, Chris Wilson wrote:
> During execbuf, a mandatory step is that we add this request (this
> fence) to each object's reservation_object. Inside execbuf, we track the
> vma, and to add the fence to the reservation_object then means having to
> first chase the obj, incurring another cache miss. We can reduce the
>  number of cache misses by stashing a pointer to the reservation_object
> in the vma itself.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

<SNIP>

> +++ b/drivers/gpu/drm/i915/i915_vma.h
> @@ -50,6 +50,7 @@ struct i915_vma {
>  	struct drm_i915_gem_object *obj;
>  	struct i915_address_space *vm;
>  	struct drm_i915_fence_reg *fence;
> +	struct reservation_object *resv;

Add a kerneldoc that this is an alias.

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 18/24] drm/i915: Convert execbuf to use struct-of-array packing for critical fields
  2017-05-18  9:46 ` [PATCH 18/24] drm/i915: Convert execbuf to use struct-of-array packing for critical fields Chris Wilson
@ 2017-05-19  9:25   ` Chris Wilson
  0 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-19  9:25 UTC (permalink / raw)
  To: intel-gfx

On Thu, May 18, 2017 at 10:46:32AM +0100, Chris Wilson wrote:
> When userspace is doing most of the work, avoiding relocs (using
> NO_RELOC) and opting out of implicit synchronisation (using ASYNC), we
> still spend a lot of time processing the arrays in execbuf, even though
> we now should have nothing to do most of the time. One issue that
> becomes readily apparent in profiling anv is that iterating over the
> large execobj[] is unfriendly to the loop prefetchers of the CPU and it
> much prefers iterating over a pair of arrays rather than one big array.

Joonas pointed out that given the alignment of vma in our slab, I have a
few bits to spare in the eb->vma[] pointers, now only using a few bits
for state tracking of vma through execbuf and he does like bit packing
into pointers very much...
-Chris
> 

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 20/24] drm/i915: Group all the global context information together
  2017-05-18  9:46 ` [PATCH 20/24] drm/i915: Group all the global context information together Chris Wilson
@ 2017-05-19  9:35   ` Joonas Lahtinen
  0 siblings, 0 replies; 47+ messages in thread
From: Joonas Lahtinen @ 2017-05-19  9:35 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On to, 2017-05-18 at 10:46 +0100, Chris Wilson wrote:
> Create a substruct to hold all the global context state under
> drm_i915_private.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/24] drm/i915: Store a persistent reference for an object in the execbuffer cache
  2017-05-18  9:46 ` [PATCH 12/24] drm/i915: Store a persistent reference for an object in the execbuffer cache Chris Wilson
@ 2017-05-19  9:37   ` Joonas Lahtinen
  0 siblings, 0 replies; 47+ messages in thread
From: Joonas Lahtinen @ 2017-05-19  9:37 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On to, 2017-05-18 at 10:46 +0100, Chris Wilson wrote:
> If we take a reference to the object/vma when it is first used in an
> execbuf, we can keep that reference until the object's file-local handle
> is closed. Thereby saving a frequent ref/unref pair.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

The benchmark this gives improvements on might make sense to add to the
commit message.

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/24] drm/i915: Store i915_gem_object_is_coherent() as a bit next to cache-dirty
  2017-05-18  9:46 ` [PATCH 03/24] drm/i915: Store i915_gem_object_is_coherent() as a bit next to cache-dirty Chris Wilson
@ 2017-05-19 10:22   ` Chris Wilson
  2017-05-22 20:39   ` Dongwon Kim
  1 sibling, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-19 10:22 UTC (permalink / raw)
  To: intel-gfx; +Cc: Dongwon Kim

On Thu, May 18, 2017 at 10:46:17AM +0100, Chris Wilson wrote:
> For ease of use (i.e. avoiding a few checks and function calls), store
> the object's cache coherency next to the cache is dirty bit.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dongwon Kim <dongwon.kim@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> ---
> diff --git a/drivers/gpu/drm/i915/i915_gem_clflush.c b/drivers/gpu/drm/i915/i915_gem_clflush.c
> index 17b207e963c2..152f16c11878 100644
> --- a/drivers/gpu/drm/i915/i915_gem_clflush.c
> +++ b/drivers/gpu/drm/i915/i915_gem_clflush.c
> @@ -139,7 +139,7 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
>  	 * snooping behaviour occurs naturally as the result of our domain
>  	 * tracking.
>  	 */
> -	if (!(flags & I915_CLFLUSH_FORCE) && i915_gem_object_is_coherent(obj))
> +	if (!(flags & I915_CLFLUSH_FORCE) && obj->cache_coherent)
>  		return;
>  
>  	trace_i915_gem_object_clflush(obj);
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 0b8ae0f56675..2e5f513087a8 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -1129,7 +1129,7 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
>  		if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
>  			continue;
>  
> -		if (obj->cache_dirty)
> +		if (obj->cache_dirty & ~obj->cache_coherent)
>  			i915_gem_clflush_object(obj, 0);

I was worrying whether this is in conflict with v3 of the previous
patch, i.e. we must invalidate the stale cachelines. However, we are
still dropping clflushes if cache-coherent is tagged. So that narrows
down the cache-level transitions that are relevant and in particular a
clflush on transition to snooped before the GPU execbuf is not the fix
being applied in the previous patch. Odd.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 22/24] drm/i915: Enable rcu-only context lookups
  2017-05-18  9:46 ` [PATCH 22/24] drm/i915: Enable rcu-only context lookups Chris Wilson
@ 2017-05-19 10:36   ` Joonas Lahtinen
  0 siblings, 0 replies; 47+ messages in thread
From: Joonas Lahtinen @ 2017-05-19 10:36 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On to, 2017-05-18 at 10:46 +0100, Chris Wilson wrote:
> Whilst the contents of the context is still protected by the big
> struct_mutex, this is not much of an improvement. It is just one tiny
> step towards reducing our BKL.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

<SNIP>

> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -665,16 +665,17 @@ static int eb_select_context(struct i915_execbuffer *eb)
>  	struct i915_gem_context *ctx;
>  
>  	ctx = i915_gem_context_lookup(eb->file->driver_priv, eb->args->rsvd1);
> -	if (unlikely(IS_ERR(ctx)))
> -		return PTR_ERR(ctx);
> +	if (unlikely(!ctx))
> +		return -ENOENT;
>  
>  	if (unlikely(i915_gem_context_is_banned(ctx))) {
>  		DRM_DEBUG("Context %u tried to submit while banned\n",
>  			  ctx->user_handle);
> +		i915_gem_context_put(ctx);
>  		return -EIO;

Dem sweet onions.

>  	}

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 24/24] RFC drm/i915: Expose a PMU interface for perf queries
  2017-05-19  8:01     ` Chris Wilson
@ 2017-05-19 14:54       ` Dmitry Rogozhkin
  0 siblings, 0 replies; 47+ messages in thread
From: Dmitry Rogozhkin @ 2017-05-19 14:54 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 5/19/2017 1:01 AM, Chris Wilson wrote:
> On Thu, May 18, 2017 at 04:48:47PM -0700, Dmitry Rogozhkin wrote:
>>
>> On 5/18/2017 2:46 AM, Chris Wilson wrote:
>>> The first goal is to be able to measure GPU (and invidual ring) busyness
>>> without having to poll registers from userspace. (Which not only incurs
>>> holding the forcewake lock indefinitely, perturbing the system, but also
>>> runs the risk of hanging the machine.) As an alternative we can use the
>>> perf event counter interface to sample the ring registers periodically
>>> and send those results to userspace.
>>>
>>> To be able to do so, we need to export the two symbols from
>>> kernel/events/core.c to register and unregister a PMU device.
>>>
>>> v2: Use a common timer for the ring sampling.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> ---
>>>   drivers/gpu/drm/i915/Makefile           |   1 +
>>>   drivers/gpu/drm/i915/i915_drv.c         |   2 +
>>>   drivers/gpu/drm/i915/i915_drv.h         |  23 ++
>>>   drivers/gpu/drm/i915/i915_pmu.c         | 449 ++++++++++++++++++++++++++++++++
>>>   drivers/gpu/drm/i915/intel_ringbuffer.h |   2 +
>>>   include/uapi/drm/i915_drm.h             |  40 +++
>>>   kernel/events/core.c                    |   1 +
>>>   7 files changed, 518 insertions(+)
>>>   create mode 100644 drivers/gpu/drm/i915/i915_pmu.c
>>>
>>> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
>>> index 7b05fb802f4c..ca88e6e67910 100644
>>> --- a/drivers/gpu/drm/i915/Makefile
>>> +++ b/drivers/gpu/drm/i915/Makefile
>>> @@ -26,6 +26,7 @@ i915-y := i915_drv.o \
>>>   i915-$(CONFIG_COMPAT)   += i915_ioc32.o
>>>   i915-$(CONFIG_DEBUG_FS) += i915_debugfs.o intel_pipe_crc.o
>>> +i915-$(CONFIG_PERF_EVENTS) += i915_pmu.o
>>>   # GEM code
>>>   i915-y += i915_cmd_parser.o \
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>>> index 2d2fb4327f97..e3c6d052d1c9 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.c
>>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>>> @@ -1144,6 +1144,7 @@ static void i915_driver_register(struct drm_i915_private *dev_priv)
>>>   	struct drm_device *dev = &dev_priv->drm;
>>>   	i915_gem_shrinker_init(dev_priv);
>>> +	i915_pmu_register(dev_priv);
>>>   	/*
>>>   	 * Notify a valid surface after modesetting,
>>> @@ -1197,6 +1198,7 @@ static void i915_driver_unregister(struct drm_i915_private *dev_priv)
>>>   	intel_opregion_unregister(dev_priv);
>>>   	i915_perf_unregister(dev_priv);
>>> +	i915_pmu_unregister(dev_priv);
>>>   	i915_teardown_sysfs(dev_priv);
>>>   	i915_guc_log_unregister(dev_priv);
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>>> index 1fa1e7d48f02..10beae1a13c8 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>> @@ -40,6 +40,7 @@
>>>   #include <linux/hash.h>
>>>   #include <linux/intel-iommu.h>
>>>   #include <linux/kref.h>
>>> +#include <linux/perf_event.h>
>>>   #include <linux/pm_qos.h>
>>>   #include <linux/reservation.h>
>>>   #include <linux/shmem_fs.h>
>>> @@ -2075,6 +2076,12 @@ struct intel_cdclk_state {
>>>   	unsigned int cdclk, vco, ref;
>>>   };
>>> +enum {
>>> +	__I915_SAMPLE_FREQ_ACT = 0,
>>> +	__I915_SAMPLE_FREQ_REQ,
>>> +	__I915_NUM_PMU_SAMPLERS
>>> +};
>>> +
>>>   struct drm_i915_private {
>>>   	struct drm_device drm;
>>> @@ -2564,6 +2571,13 @@ struct drm_i915_private {
>>>   		int	irq;
>>>   	} lpe_audio;
>>> +	struct {
>>> +		struct pmu base;
>>> +		struct hrtimer timer;
>>> +		u64 enable;
>>> +		u64 sample[__I915_NUM_PMU_SAMPLERS];
>>> +	} pmu;
>>> +
>>>   	/*
>>>   	 * NOTE: This is the dri1/ums dungeon, don't add stuff here. Your patch
>>>   	 * will be rejected. Instead look for a better place.
>>> @@ -3681,6 +3695,15 @@ extern void i915_perf_fini(struct drm_i915_private *dev_priv);
>>>   extern void i915_perf_register(struct drm_i915_private *dev_priv);
>>>   extern void i915_perf_unregister(struct drm_i915_private *dev_priv);
>>> +/* i915_pmu.c */
>>> +#ifdef CONFIG_PERF_EVENTS
>>> +extern void i915_pmu_register(struct drm_i915_private *i915);
>>> +extern void i915_pmu_unregister(struct drm_i915_private *i915);
>>> +#else
>>> +static inline void i915_pmu_register(struct drm_i915_private *i915) {}
>>> +static inline void i915_pmu_unregister(struct drm_i915_private *i915) {}
>>> +#endif
>>> +
>>>   /* i915_suspend.c */
>>>   extern int i915_save_state(struct drm_i915_private *dev_priv);
>>>   extern int i915_restore_state(struct drm_i915_private *dev_priv);
>>> diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
>>> new file mode 100644
>>> index 000000000000..80e1c07841ac
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/i915/i915_pmu.c
>>> @@ -0,0 +1,449 @@
>>> +#include <linux/perf_event.h>
>>> +#include <linux/pm_runtime.h>
>>> +
>>> +#include "i915_drv.h"
>>> +#include "intel_ringbuffer.h"
>>> +
>>> +#define FREQUENCY 200
>>> +#define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
>>> +
>>> +#define RING_MASK 0xffffffff
>>> +#define RING_MAX 32
>>> +
>>> +static void engines_sample(struct drm_i915_private *dev_priv)
>>> +{
>>> +	struct intel_engine_cs *engine;
>>> +	enum intel_engine_id id;
>>> +	bool fw = false;
>>> +
>>> +	if ((dev_priv->pmu.enable & RING_MASK) == 0)
>>> +		return;
>>> +
>>> +	if (!dev_priv->gt.awake)
>>> +		return;
>>> +
>>> +	if (!intel_runtime_pm_get_if_in_use(dev_priv))
>>> +		return;
>>> +
>>> +	for_each_engine(engine, dev_priv, id) {
>>> +		u32 val;
>>> +
>>> +		if ((dev_priv->pmu.enable & (0x7 << (4*id))) == 0)
>>> +			continue;
>>> +
>>> +		if (i915_seqno_passed(intel_engine_get_seqno(engine),
>>> +				      intel_engine_last_submit(engine)))
>>> +			continue;
>>> +
>>> +		if (!fw) {
>>> +			intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>>> +			fw = true;
>>> +		}
>>> +
>>> +		val = I915_READ_FW(RING_MI_MODE(engine->mmio_base));
>>> +		if (!(val & MODE_IDLE))
>>> +			engine->pmu_sample[I915_SAMPLE_BUSY] += PERIOD;
>> Could you, please, check that I understand correctly what you
>> attempt to do? Each PERIOD=10ms you check the status of the engines.
>> If engine is in a busy state you count the PERIOD time as if the
>> engine was busy during this period. If engine was in some other
>> state you count this toward this state (wait or sema below as I
>> understand).
>>
>> If that's what you are trying to do, your solution can be somewhat
>> used only for the cases where tasks are executed on the engines more
>> then 2x PERIOD time, i.e. longer than 20ms.
> It's a statistical sampling, same as any other. You have control over
> the period, or at least perf does give you control (I may have not
> hooked it up correctly). I set a totally arbitrary cap of 20us.
So, I understand correctly. Statistics where you really have all mighty 
to provide non-statistic metrics or at least provide both. That's great 
that perf permits to control sampling period, but smaller period will 
come with the higher cpu% which will lead to the gating this usage in 
the production environment. Besides in production environment you need 
to get busy clocks metrics over pretty big period of time close to 1 second.
>> Which is not the case
>> for a lot of tasks submitted by media with the usual execution time
>> of just few milliseconds.
> If you are only issuing one batch lasting 1ms every second, the load is
> more or less immaterial.
I did not say that. And I don't understand where you took this from. Sorry.
>> Considering that i915 currently tracks
>> when batches were scheduled and when they are ready, you can easily
>> calculate very strong precise metric of busy clocks for each engine
>> from the perspective of i915, i.e. that will not be precise time of
>> how long engine calculated things because it will include scheduling
>> latency, but that is exactly what end-user requires: end-user does
>> not care whether there is a latency or there is not, for him engine
>> is busy all that time.
> Which you can do yourself. I am not that interested in repeating
> metrics already available to userspace.
Repeating metrics?! Here are metrics which are available for centuries: 
cat /proc/stat. Here I can get busy clocks information for the CPUs and 
much more. Please, provide me a similar path for the GPUs if I am not 
aware of its existence.
> Furthermore, coupled in with the
> statiscal sampling of perf, perf also provides access to the timestamps
> of when each request is constructed, queued, executed, completed and
> retired; and waited upon along with a host of other tracepoints.
So, you suggest to track each request from the userspace and calculate 
on our own. You care so much about performance and additional heavy code 
in the driver especially in the interrupt handlers, why I should not 
care about userspace performance. What you incline and suggest is a 
no-go. As I said, monitoring metrics are required, not debugging 
metrics. Monitoring metrics should correctly work with the long sampling 
periods like 1s and more giving smallest possible overhead, i.e. no polling.
>> This is done in the alternate solution given
>> by Tvrtko in "drm/i915: Export engine busy stats in debugfs" patch.
>> Why you do not take this as a basis? Why this patch containing few
>> arithmetic operations to calculate engines busy clocks is ignored?
> Because this approach is independent of platform and plugs into the
> existing perf infrastructure. And you need a far more convincing
> argument to have an execlist-only code running inside interrupt
> handlers. Tvrtko said himself that maintainability is paramount.
Why execlists only? Introduce the same for ringbuffers, GuC... Step 
higher if needed and introduce counters to the scheduler routines which 
also should track requests. After all couple this with the requests 
tracking you expose to perf system as you noted above. Is that not possible?
Perf infrastructure is good thing. But it is really needed to be able to 
easily get certain engines metrics (busy clocks at the first place) with 
simple 'cat gpu/stat'. Please, understand, these metrics are required in 
production to give customers opportunity to dynamically load balance N 
media servers from the host.
> -Chris
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse
  2017-05-18  9:46 ` [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse Chris Wilson
  2017-05-18 13:52   ` Tvrtko Ursulin
@ 2017-05-22 10:51   ` Tvrtko Ursulin
  2017-05-22 11:06     ` Chris Wilson
  1 sibling, 1 reply; 47+ messages in thread
From: Tvrtko Ursulin @ 2017-05-22 10:51 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2017 10:46, Chris Wilson wrote:
> Keep the recently freed context objects for reuse. This allows us to use
> the current GGTT bindings and dma bound pages, avoiding any clflushes as
> required. We mark the objects as purgeable under memory pressure, and
> reap the list of freed objects as soon as the device is idle.

One additional thought - how about making the batch pool more generic, 
or more precisely extracting a layer from it to be called object pool, 
and then having batch pool and context pool on top of that?

There would be some details to flesh out, like do we want to strictly 
split internal from shmemfs backed objects, or can allow batch pool to 
get either, the API and similar.

But if it could be done with not a lot of code it may be preferential to 
have one implementation of the similar basic idea.

Regards,

Tvrtko

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_drv.h                  |  2 +
>  drivers/gpu/drm/i915/i915_gem.c                  |  1 +
>  drivers/gpu/drm/i915/i915_gem_context.c          | 59 ++++++++++++++++++++++--
>  drivers/gpu/drm/i915/i915_gem_context.h          |  5 ++
>  drivers/gpu/drm/i915/intel_lrc.c                 |  2 +-
>  drivers/gpu/drm/i915/intel_ringbuffer.c          |  2 +-
>  drivers/gpu/drm/i915/selftests/mock_context.c    |  1 +
>  drivers/gpu/drm/i915/selftests/mock_gem_device.c |  1 +
>  8 files changed, 68 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 1d55bbde68df..1fa1e7d48f02 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2316,6 +2316,8 @@ struct drm_i915_private {
>  		struct llist_head free_list;
>  		struct work_struct free_work;
>
> +		struct list_head freed_objects;
> +
>  		/* The hw wants to have a stable context identifier for the
>  		 * lifetime of the context (for OA, PASID, faults, etc).
>  		 * This is limited in execlists to 21 bits.
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index c044b5dbdb66..f482c320d810 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3212,6 +3212,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
>  		DRM_ERROR("Timeout waiting for engines to idle\n");
>
>  	intel_engines_mark_idle(dev_priv);
> +	i915_gem_contexts_mark_idle(dev_priv);
>  	i915_gem_timelines_mark_idle(dev_priv);
>
>  	GEM_BUG_ON(!dev_priv->gt.awake);
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 198b0064a3d3..1e51fd36f355 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -96,6 +96,48 @@
>  /* Initial size (as log2) to preallocate the handle->object hashtable */
>  #define VMA_HT_BITS 2u /* 4 x 2 pointers, 64 bytes minimum */
>
> +struct drm_i915_gem_object *
> +i915_gem_context_create_object(struct drm_i915_private *i915,
> +			       unsigned long size)
> +{
> +	struct drm_i915_gem_object *obj, *on;
> +
> +	lockdep_assert_held(&i915->drm.struct_mutex);
> +
> +	list_for_each_entry_safe(obj, on,
> +				 &i915->contexts.freed_objects,
> +				 batch_pool_link) {
> +		if (obj->mm.madv != I915_MADV_DONTNEED) {
> +			/* Purge the heretic! */
> +			list_del(&obj->batch_pool_link);
> +			i915_gem_object_put(obj);
> +			continue;
> +		}
> +
> +		if (obj->base.size == size) {
> +			list_del(&obj->batch_pool_link);
> +			obj->mm.madv = I915_MADV_WILLNEED;
> +			return obj;
> +		}
> +	}
> +
> +	return i915_gem_object_create(i915, size);
> +}
> +
> +void i915_gem_contexts_mark_idle(struct drm_i915_private *i915)
> +{
> +	struct drm_i915_gem_object *obj, *on;
> +
> +	lockdep_assert_held(&i915->drm.struct_mutex);
> +
> +	list_for_each_entry_safe(obj, on,
> +				 &i915->contexts.freed_objects,
> +				 batch_pool_link) {
> +		list_del(&obj->batch_pool_link);
> +		i915_gem_object_put(obj);
> +	}
> +}
> +
>  static void resize_vma_ht(struct work_struct *work)
>  {
>  	struct i915_gem_context_vma_lut *lut =
> @@ -160,9 +202,10 @@ static void vma_lut_free(struct i915_gem_context *ctx)
>
>  static void i915_gem_context_free(struct i915_gem_context *ctx)
>  {
> +	struct drm_i915_private *i915 = ctx->i915;
>  	int i;
>
> -	lockdep_assert_held(&ctx->i915->drm.struct_mutex);
> +	lockdep_assert_held(&i915->drm.struct_mutex);
>  	GEM_BUG_ON(!i915_gem_context_is_closed(ctx));
>
>  	vma_lut_free(ctx);
> @@ -178,7 +221,11 @@ static void i915_gem_context_free(struct i915_gem_context *ctx)
>  		if (ce->ring)
>  			intel_ring_free(ce->ring);
>
> -		__i915_gem_object_release_unless_active(ce->state->obj);
> +		/* Keep the objects around for quick reuse */
> +		GEM_BUG_ON(ce->state->obj->mm.madv != I915_MADV_WILLNEED);
> +		ce->state->obj->mm.madv = I915_MADV_DONTNEED;
> +		list_add(&ce->state->obj->batch_pool_link,
> +			 &i915->contexts.freed_objects);
>  	}
>
>  	kfree(ctx->name);
> @@ -186,7 +233,7 @@ static void i915_gem_context_free(struct i915_gem_context *ctx)
>
>  	list_del(&ctx->link);
>
> -	ida_simple_remove(&ctx->i915->contexts.hw_ida, ctx->hw_id);
> +	ida_simple_remove(&i915->contexts.hw_ida, ctx->hw_id);
>  	kfree_rcu(ctx, rcu);
>  }
>
> @@ -457,6 +504,7 @@ int i915_gem_contexts_init(struct drm_i915_private *dev_priv)
>  		return 0;
>
>  	INIT_LIST_HEAD(&dev_priv->contexts.list);
> +	INIT_LIST_HEAD(&dev_priv->contexts.freed_objects);
>  	INIT_WORK(&dev_priv->contexts.free_work, contexts_free_worker);
>  	init_llist_head(&dev_priv->contexts.free_list);
>
> @@ -542,12 +590,17 @@ void i915_gem_contexts_fini(struct drm_i915_private *i915)
>
>  	lockdep_assert_held(&i915->drm.struct_mutex);
>
> +	GEM_BUG_ON(work_pending(&i915->contexts.free_work));
> +
>  	/* Keep the context so that we can free it immediately ourselves */
>  	ctx = i915_gem_context_get(fetch_and_zero(&i915->kernel_context));
>  	GEM_BUG_ON(!i915_gem_context_is_kernel(ctx));
>  	context_close(ctx);
>  	i915_gem_context_free(ctx);
>
> +	i915_gem_contexts_mark_idle(i915);
> +	GEM_BUG_ON(!list_empty(&i915->contexts.freed_objects));
> +
>  	/* Must free all deferred contexts (via flush_workqueue) first */
>  	ida_destroy(&i915->contexts.hw_ida);
>  }
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index 04320f80f9f4..4654caab9b6c 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -294,6 +294,11 @@ void i915_gem_context_release(struct kref *ctx_ref);
>  struct i915_gem_context *
>  i915_gem_context_create_gvt(struct drm_device *dev);
>
> +struct drm_i915_gem_object *
> +i915_gem_context_create_object(struct drm_i915_private *i915,
> +			       unsigned long size);
> +void i915_gem_contexts_mark_idle(struct drm_i915_private *i915);
> +
>  int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>  				  struct drm_file *file);
>  int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 42ab91acccc5..21a1519db936 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2011,7 +2011,7 @@ static int execlists_context_deferred_alloc(struct i915_gem_context *ctx,
>  	/* One extra page as the sharing data between driver and GuC */
>  	context_size += PAGE_SIZE * LRC_PPHWSP_PN;
>
> -	ctx_obj = i915_gem_object_create(ctx->i915, context_size);
> +	ctx_obj = i915_gem_context_create_object(ctx->i915, context_size);
>  	if (IS_ERR(ctx_obj)) {
>  		DRM_DEBUG_DRIVER("Alloc LRC backing obj failed.\n");
>  		return PTR_ERR(ctx_obj);
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index acd1da9b62a3..253d32fef133 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1454,7 +1454,7 @@ alloc_context_vma(struct intel_engine_cs *engine)
>  	struct drm_i915_gem_object *obj;
>  	struct i915_vma *vma;
>
> -	obj = i915_gem_object_create(i915, engine->context_size);
> +	obj = i915_gem_context_create_object(i915, engine->context_size);
>  	if (IS_ERR(obj))
>  		return ERR_CAST(obj);
>
> diff --git a/drivers/gpu/drm/i915/selftests/mock_context.c b/drivers/gpu/drm/i915/selftests/mock_context.c
> index 9c7c68181f82..f4972674c96e 100644
> --- a/drivers/gpu/drm/i915/selftests/mock_context.c
> +++ b/drivers/gpu/drm/i915/selftests/mock_context.c
> @@ -92,6 +92,7 @@ void mock_init_contexts(struct drm_i915_private *i915)
>  	INIT_LIST_HEAD(&i915->contexts.list);
>  	ida_init(&i915->contexts.hw_ida);
>
> +	INIT_LIST_HEAD(&i915->contexts.freed_objects);
>  	INIT_WORK(&i915->contexts.free_work, contexts_free_worker);
>  	init_llist_head(&i915->contexts.free_list);
>  }
> diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> index 47613d20bba8..5eb931ecf0a4 100644
> --- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> +++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
> @@ -53,6 +53,7 @@ static void mock_device_release(struct drm_device *dev)
>
>  	mutex_lock(&i915->drm.struct_mutex);
>  	mock_device_flush(i915);
> +	i915_gem_contexts_lost(i915);
>  	mutex_unlock(&i915->drm.struct_mutex);
>
>  	cancel_delayed_work_sync(&i915->gt.retire_work);
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse
  2017-05-22 10:51   ` Tvrtko Ursulin
@ 2017-05-22 11:06     ` Chris Wilson
  0 siblings, 0 replies; 47+ messages in thread
From: Chris Wilson @ 2017-05-22 11:06 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Mon, May 22, 2017 at 11:51:27AM +0100, Tvrtko Ursulin wrote:
> 
> On 18/05/2017 10:46, Chris Wilson wrote:
> >Keep the recently freed context objects for reuse. This allows us to use
> >the current GGTT bindings and dma bound pages, avoiding any clflushes as
> >required. We mark the objects as purgeable under memory pressure, and
> >reap the list of freed objects as soon as the device is idle.
> 
> One additional thought - how about making the batch pool more
> generic, or more precisely extracting a layer from it to be called
> object pool, and then having batch pool and context pool on top of
> that?
> 
> There would be some details to flesh out, like do we want to
> strictly split internal from shmemfs backed objects, or can allow
> batch pool to get either, the API and similar.

That was the big difference; different object types that shouldn't be
mixed. So you going to have different lists for the the different
classes and behaviours.

On top of this, there's also the question of whether to keep old objects
around for userspace. Userspace caching is still more efficient for
userspace, but there are classes of objects that are cycled between
processes and discarded instead of being recycled in userspace. The
tricky part is making sure that the object state, after alloc from
cache, doesn't surprise the user.

> But if it could be done with not a lot of code it may be
> preferential to have one implementation of the similar basic idea.

No denying that, just not enough in common yet and still mostly
searching for the problem to be solved. (Outside of unrealistic
benchmarks and test cases.)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/24] drm/i915: Store i915_gem_object_is_coherent() as a bit next to cache-dirty
  2017-05-18  9:46 ` [PATCH 03/24] drm/i915: Store i915_gem_object_is_coherent() as a bit next to cache-dirty Chris Wilson
  2017-05-19 10:22   ` Chris Wilson
@ 2017-05-22 20:39   ` Dongwon Kim
  1 sibling, 0 replies; 47+ messages in thread
From: Dongwon Kim @ 2017-05-22 20:39 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

Chris, 

I tested this together with your v3 (Mark cache dirty...)
patch and verified tests are all passing.

Tested-by : Dongwon Kim <dongwon.kim@intel.com>

On Thu, May 18, 2017 at 10:46:17AM +0100, Chris Wilson wrote:
> For ease of use (i.e. avoiding a few checks and function calls), store
> the object's cache coherency next to the cache is dirty bit.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dongwon Kim <dongwon.kim@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c                  | 14 +++++++-------
>  drivers/gpu/drm/i915/i915_gem_clflush.c          |  2 +-
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c       |  2 +-
>  drivers/gpu/drm/i915/i915_gem_internal.c         |  3 ++-
>  drivers/gpu/drm/i915/i915_gem_object.h           |  1 +
>  drivers/gpu/drm/i915/i915_gem_stolen.c           |  1 +
>  drivers/gpu/drm/i915/i915_gem_userptr.c          |  3 ++-
>  drivers/gpu/drm/i915/selftests/huge_gem_object.c |  3 ++-
>  8 files changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 155dd52f2d18..870659c13de3 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -52,7 +52,7 @@ static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
>  	if (obj->cache_dirty)
>  		return false;
>  
> -	if (!i915_gem_object_is_coherent(obj))
> +	if (!obj->cache_coherent)
>  		return true;
>  
>  	return obj->pin_display;
> @@ -253,7 +253,7 @@ __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
>  
>  	if (needs_clflush &&
>  	    (obj->base.read_domains & I915_GEM_DOMAIN_CPU) == 0 &&
> -	    !i915_gem_object_is_coherent(obj))
> +	    !obj->cache_coherent)
>  		drm_clflush_sg(pages);
>  
>  	__start_cpu_write(obj);
> @@ -856,8 +856,7 @@ int i915_gem_obj_prepare_shmem_read(struct drm_i915_gem_object *obj,
>  	if (ret)
>  		return ret;
>  
> -	if (i915_gem_object_is_coherent(obj) ||
> -	    !static_cpu_has(X86_FEATURE_CLFLUSH)) {
> +	if (obj->cache_coherent || !static_cpu_has(X86_FEATURE_CLFLUSH)) {
>  		ret = i915_gem_object_set_to_cpu_domain(obj, false);
>  		if (ret)
>  			goto err_unpin;
> @@ -909,8 +908,7 @@ int i915_gem_obj_prepare_shmem_write(struct drm_i915_gem_object *obj,
>  	if (ret)
>  		return ret;
>  
> -	if (i915_gem_object_is_coherent(obj) ||
> -	    !static_cpu_has(X86_FEATURE_CLFLUSH)) {
> +	if (obj->cache_coherent || !static_cpu_has(X86_FEATURE_CLFLUSH)) {
>  		ret = i915_gem_object_set_to_cpu_domain(obj, true);
>  		if (ret)
>  			goto err_unpin;
> @@ -3661,6 +3659,7 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
>  	list_for_each_entry(vma, &obj->vma_list, obj_link)
>  		vma->node.color = cache_level;
>  	obj->cache_level = cache_level;
> +	obj->cache_coherent = i915_gem_object_is_coherent(obj);
>  	obj->cache_dirty = true; /* Always invalidate stale cachelines */
>  
>  	return 0;
> @@ -4320,7 +4319,8 @@ i915_gem_object_create(struct drm_i915_private *dev_priv, u64 size)
>  	} else
>  		obj->cache_level = I915_CACHE_NONE;
>  
> -	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
> +	obj->cache_coherent = i915_gem_object_is_coherent(obj);
> +	obj->cache_dirty = !obj->cache_coherent;
>  
>  	trace_i915_gem_object_create(obj);
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_clflush.c b/drivers/gpu/drm/i915/i915_gem_clflush.c
> index 17b207e963c2..152f16c11878 100644
> --- a/drivers/gpu/drm/i915/i915_gem_clflush.c
> +++ b/drivers/gpu/drm/i915/i915_gem_clflush.c
> @@ -139,7 +139,7 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
>  	 * snooping behaviour occurs naturally as the result of our domain
>  	 * tracking.
>  	 */
> -	if (!(flags & I915_CLFLUSH_FORCE) && i915_gem_object_is_coherent(obj))
> +	if (!(flags & I915_CLFLUSH_FORCE) && obj->cache_coherent)
>  		return;
>  
>  	trace_i915_gem_object_clflush(obj);
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 0b8ae0f56675..2e5f513087a8 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -1129,7 +1129,7 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
>  		if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
>  			continue;
>  
> -		if (obj->cache_dirty)
> +		if (obj->cache_dirty & ~obj->cache_coherent)
>  			i915_gem_clflush_object(obj, 0);
>  
>  		ret = i915_gem_request_await_object
> diff --git a/drivers/gpu/drm/i915/i915_gem_internal.c b/drivers/gpu/drm/i915/i915_gem_internal.c
> index 58e93e87d573..568bf83af1f5 100644
> --- a/drivers/gpu/drm/i915/i915_gem_internal.c
> +++ b/drivers/gpu/drm/i915/i915_gem_internal.c
> @@ -191,7 +191,8 @@ i915_gem_object_create_internal(struct drm_i915_private *i915,
>  	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>  	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>  	obj->cache_level = HAS_LLC(i915) ? I915_CACHE_LLC : I915_CACHE_NONE;
> -	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
> +	obj->cache_coherent = i915_gem_object_is_coherent(obj);
> +	obj->cache_dirty = !obj->cache_coherent;
>  
>  	return obj;
>  }
> diff --git a/drivers/gpu/drm/i915/i915_gem_object.h b/drivers/gpu/drm/i915/i915_gem_object.h
> index 174cf923c236..dca15adc91de 100644
> --- a/drivers/gpu/drm/i915/i915_gem_object.h
> +++ b/drivers/gpu/drm/i915/i915_gem_object.h
> @@ -106,6 +106,7 @@ struct drm_i915_gem_object {
>  	unsigned long gt_ro:1;
>  	unsigned int cache_level:3;
>  	unsigned int cache_dirty:1;
> +	unsigned int cache_coherent:1;
>  
>  	atomic_t frontbuffer_bits;
>  	unsigned int frontbuffer_ggtt_origin; /* write once */
> diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c b/drivers/gpu/drm/i915/i915_gem_stolen.c
> index f3abdc27c5dd..68af4a39973d 100644
> --- a/drivers/gpu/drm/i915/i915_gem_stolen.c
> +++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
> @@ -592,6 +592,7 @@ _i915_gem_object_create_stolen(struct drm_i915_private *dev_priv,
>  	obj->stolen = stolen;
>  	obj->base.read_domains = I915_GEM_DOMAIN_CPU | I915_GEM_DOMAIN_GTT;
>  	obj->cache_level = HAS_LLC(dev_priv) ? I915_CACHE_LLC : I915_CACHE_NONE;
> +	obj->cache_coherent = true; /* assumptions! more like cache_oblivious */
>  
>  	if (i915_gem_object_pin_pages(obj))
>  		goto cleanup;
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 9f84be171ad2..4ec9a04aa165 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -805,7 +805,8 @@ i915_gem_userptr_ioctl(struct drm_device *dev, void *data, struct drm_file *file
>  	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>  	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>  	obj->cache_level = I915_CACHE_LLC;
> -	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
> +	obj->cache_coherent = i915_gem_object_is_coherent(obj);
> +	obj->cache_dirty = !obj->cache_coherent;
>  
>  	obj->userptr.ptr = args->user_ptr;
>  	obj->userptr.read_only = !!(args->flags & I915_USERPTR_READ_ONLY);
> diff --git a/drivers/gpu/drm/i915/selftests/huge_gem_object.c b/drivers/gpu/drm/i915/selftests/huge_gem_object.c
> index 0ca867a877b6..caf76af36aba 100644
> --- a/drivers/gpu/drm/i915/selftests/huge_gem_object.c
> +++ b/drivers/gpu/drm/i915/selftests/huge_gem_object.c
> @@ -129,7 +129,8 @@ huge_gem_object(struct drm_i915_private *i915,
>  	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
>  	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>  	obj->cache_level = HAS_LLC(i915) ? I915_CACHE_LLC : I915_CACHE_NONE;
> -	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
> +	obj->cache_coherent = i915_gem_object_is_coherent(obj);
> +	obj->cache_dirty = !obj->cache_coherent;
>  	obj->scratch = phys_size;
>  
>  	return obj;
> -- 
> 2.11.0
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 24/24] RFC drm/i915: Expose a PMU interface for perf queries
  2017-05-18  9:46 ` [PATCH 24/24] RFC drm/i915: Expose a PMU interface for perf queries Chris Wilson
  2017-05-18 23:48   ` Dmitry Rogozhkin
@ 2017-06-15 11:17   ` Tvrtko Ursulin
  1 sibling, 0 replies; 47+ messages in thread
From: Tvrtko Ursulin @ 2017-06-15 11:17 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2017 10:46, Chris Wilson wrote:
> The first goal is to be able to measure GPU (and invidual ring) busyness
> without having to poll registers from userspace. (Which not only incurs
> holding the forcewake lock indefinitely, perturbing the system, but also
> runs the risk of hanging the machine.) As an alternative we can use the
> perf event counter interface to sample the ring registers periodically
> and send those results to userspace.

I finally got round trying this and it does not look too bad overhead wise.

Maybe it is polish and review time? :)

I am also thinking if we could make it more efficient on gen8+ by 
employing the core of the engine busyness RFC I've recently sent. So in 
other words avoid one timer and MMIO sampling when possible. It can be a 
separate patch on top added by me if we get agreement on this.

> To be able to do so, we need to export the two symbols from
> kernel/events/core.c to register and unregister a PMU device.
> 
> v2: Use a common timer for the ring sampling.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/Makefile           |   1 +
>   drivers/gpu/drm/i915/i915_drv.c         |   2 +
>   drivers/gpu/drm/i915/i915_drv.h         |  23 ++
>   drivers/gpu/drm/i915/i915_pmu.c         | 449 ++++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_ringbuffer.h |   2 +
>   include/uapi/drm/i915_drm.h             |  40 +++
>   kernel/events/core.c                    |   1 +
>   7 files changed, 518 insertions(+)
>   create mode 100644 drivers/gpu/drm/i915/i915_pmu.c
> 
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 7b05fb802f4c..ca88e6e67910 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -26,6 +26,7 @@ i915-y := i915_drv.o \
>   
>   i915-$(CONFIG_COMPAT)   += i915_ioc32.o
>   i915-$(CONFIG_DEBUG_FS) += i915_debugfs.o intel_pipe_crc.o
> +i915-$(CONFIG_PERF_EVENTS) += i915_pmu.o
>   
>   # GEM code
>   i915-y += i915_cmd_parser.o \
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 2d2fb4327f97..e3c6d052d1c9 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1144,6 +1144,7 @@ static void i915_driver_register(struct drm_i915_private *dev_priv)
>   	struct drm_device *dev = &dev_priv->drm;
>   
>   	i915_gem_shrinker_init(dev_priv);
> +	i915_pmu_register(dev_priv);
>   
>   	/*
>   	 * Notify a valid surface after modesetting,
> @@ -1197,6 +1198,7 @@ static void i915_driver_unregister(struct drm_i915_private *dev_priv)
>   	intel_opregion_unregister(dev_priv);
>   
>   	i915_perf_unregister(dev_priv);
> +	i915_pmu_unregister(dev_priv);
>   
>   	i915_teardown_sysfs(dev_priv);
>   	i915_guc_log_unregister(dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 1fa1e7d48f02..10beae1a13c8 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -40,6 +40,7 @@
>   #include <linux/hash.h>
>   #include <linux/intel-iommu.h>
>   #include <linux/kref.h>
> +#include <linux/perf_event.h>
>   #include <linux/pm_qos.h>
>   #include <linux/reservation.h>
>   #include <linux/shmem_fs.h>
> @@ -2075,6 +2076,12 @@ struct intel_cdclk_state {
>   	unsigned int cdclk, vco, ref;
>   };
>   
> +enum {
> +	__I915_SAMPLE_FREQ_ACT = 0,
> +	__I915_SAMPLE_FREQ_REQ,
> +	__I915_NUM_PMU_SAMPLERS
> +};
> +
>   struct drm_i915_private {
>   	struct drm_device drm;
>   
> @@ -2564,6 +2571,13 @@ struct drm_i915_private {
>   		int	irq;
>   	} lpe_audio;
>   
> +	struct {
> +		struct pmu base;
> +		struct hrtimer timer;
> +		u64 enable;
> +		u64 sample[__I915_NUM_PMU_SAMPLERS];
> +	} pmu;
> +
>   	/*
>   	 * NOTE: This is the dri1/ums dungeon, don't add stuff here. Your patch
>   	 * will be rejected. Instead look for a better place.
> @@ -3681,6 +3695,15 @@ extern void i915_perf_fini(struct drm_i915_private *dev_priv);
>   extern void i915_perf_register(struct drm_i915_private *dev_priv);
>   extern void i915_perf_unregister(struct drm_i915_private *dev_priv);
>   
> +/* i915_pmu.c */
> +#ifdef CONFIG_PERF_EVENTS
> +extern void i915_pmu_register(struct drm_i915_private *i915);
> +extern void i915_pmu_unregister(struct drm_i915_private *i915);
> +#else
> +static inline void i915_pmu_register(struct drm_i915_private *i915) {}
> +static inline void i915_pmu_unregister(struct drm_i915_private *i915) {}
> +#endif
> +
>   /* i915_suspend.c */
>   extern int i915_save_state(struct drm_i915_private *dev_priv);
>   extern int i915_restore_state(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
> new file mode 100644
> index 000000000000..80e1c07841ac
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_pmu.c
> @@ -0,0 +1,449 @@
> +#include <linux/perf_event.h>
> +#include <linux/pm_runtime.h>
> +
> +#include "i915_drv.h"
> +#include "intel_ringbuffer.h"
> +
> +#define FREQUENCY 200
> +#define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
> +
> +#define RING_MASK 0xffffffff
> +#define RING_MAX 32
> +
> +static void engines_sample(struct drm_i915_private *dev_priv)
> +{
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	bool fw = false;
> +
> +	if ((dev_priv->pmu.enable & RING_MASK) == 0)
> +		return;
> +
> +	if (!dev_priv->gt.awake)
> +		return;
> +
> +	if (!intel_runtime_pm_get_if_in_use(dev_priv))
> +		return;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		u32 val;
> +
> +		if ((dev_priv->pmu.enable & (0x7 << (4*id))) == 0)

This would need some nicely named macros. PMU_ENGINE_MASK(id) ?

> +			continue;
> +

Would help me at least:

		/* Skip idle engines. */
> +		if (i915_seqno_passed(intel_engine_get_seqno(engine),
> +				      intel_engine_last_submit(engine)))
> +			continue;
> +
> +		if (!fw) {
> +			intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> +			fw = true;
> +		}
> +
> +		val = I915_READ_FW(RING_MI_MODE(engine->mmio_base));
> +		if (!(val & MODE_IDLE))
> +			engine->pmu_sample[I915_SAMPLE_BUSY] += PERIOD;
> +
> +		val = I915_READ_FW(RING_CTL(engine->mmio_base));
> +		if (val & RING_WAIT)
> +			engine->pmu_sample[I915_SAMPLE_WAIT] += PERIOD;
> +		if (val & RING_WAIT_SEMAPHORE)
> +			engine->pmu_sample[I915_SAMPLE_SEMA] += PERIOD;
> +	}
> +
> +	if (fw)
> +		intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> +	intel_runtime_pm_put(dev_priv);
> +}
> +
> +static void frequency_sample(struct drm_i915_private *dev_priv)
> +{
> +	if (dev_priv->pmu.enable & BIT_ULL(I915_PMU_ACTUAL_FREQUENCY)) {
> +		u64 val;
> +
> +		val = dev_priv->rps.cur_freq;
> +		if (dev_priv->gt.awake &&
> +		    intel_runtime_pm_get_if_in_use(dev_priv)) {
> +			val = I915_READ_NOTRACE(GEN6_RPSTAT1);
> +			if (INTEL_GEN(dev_priv) >= 9)
> +				val = (val & GEN9_CAGF_MASK) >> GEN9_CAGF_SHIFT;
> +			else if (IS_HASWELL(dev_priv) || INTEL_GEN(dev_priv) >= 8)
> +				val = (val & HSW_CAGF_MASK) >> HSW_CAGF_SHIFT;
> +			else
> +				val = (val & GEN6_CAGF_MASK) >> GEN6_CAGF_SHIFT;
> +			intel_runtime_pm_put(dev_priv);
> +		}
> +		val = intel_gpu_freq(dev_priv, val);
> +		dev_priv->pmu.sample[__I915_SAMPLE_FREQ_ACT] += val * PERIOD;
> +	}
> +
> +	if (dev_priv->pmu.enable & BIT_ULL(I915_PMU_REQUESTED_FREQUENCY)) {
> +		u64 val = intel_gpu_freq(dev_priv, dev_priv->rps.cur_freq);
> +		dev_priv->pmu.sample[__I915_SAMPLE_FREQ_REQ] += val * PERIOD;
> +	}
> +}

I think a very nice thing would be to gate the timer on gt.awake | 
I915_PMU_REQUESTED_FREQUENCY from the callers. Actually not gate, but to 
manage it via those triggers.

Maybe have something like event_needs_timer(event) at the timer enable 
time and export i1915_pmu_gt_idle/active to be called from the relevant 
parts elsewhere. Those functions would then manage the timer enablement 
depending on the enabled events mask.

This way we would have no timer running during GT idle periods which 
would be very nice. (Unless I915_PMU_REQUESTED_FREQUENCY is enabled.)

Then on top, as mentioned before, we could have event_needs_timer() give 
different results for engine stats if we followed up with a gen8+ mode 
driven by the request in/out hooks instead of having to run a timer and 
do mmio.

> +
> +static enum hrtimer_restart i915_sample(struct hrtimer *hrtimer)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(hrtimer, struct drm_i915_private, pmu.timer);
> +
> +	if (i915->pmu.enable == 0)
> +		return HRTIMER_NORESTART;
> +
> +	engines_sample(i915);
> +	frequency_sample(i915);
> +
> +	hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD));
> +	return HRTIMER_RESTART;
> +}
> +
> +static void i915_pmu_event_destroy(struct perf_event *event)
> +{
> +	WARN_ON(event->parent);
> +}
> +
> +static int engine_event_init(struct perf_event *event)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(event->pmu, typeof(*i915), pmu.base);
> +	int engine = event->attr.config >> 2;
> +	int sample = event->attr.config & 3;

Some macros for these two.

> +
> +	switch (sample) {
> +	case I915_SAMPLE_BUSY:
> +	case I915_SAMPLE_WAIT:
> +		break;
> +	case I915_SAMPLE_SEMA:
> +		if (INTEL_GEN(i915) < 6)
> +			return -ENODEV;
> +		break;
> +	default:
> +		return -ENOENT;
> +	}
> +
> +	if (engine >= I915_NUM_ENGINES)
> +		return -ENOENT;

Why ENOENT and not ENODEV?

> +
> +	if (!i915->engine[engine])
> +		return -ENODEV;
> +
> +	return 0;
> +}
> +
> +static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
> +{
> +	struct perf_sample_data data;
> +	struct perf_event *event;
> +	u64 period;
> +
> +	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
> +	if (event->state != PERF_EVENT_STATE_ACTIVE)
> +		return HRTIMER_NORESTART;
> +
> +	event->pmu->read(event);
> +
> +	perf_sample_data_init(&data, 0, event->hw.last_period);
> +	perf_event_overflow(event, &data, NULL);
> +
> +	period = max_t(u64, 10000, event->hw.sample_period);
> +	hrtimer_forward_now(hrtimer, ns_to_ktime(period));
> +	return HRTIMER_RESTART;
> +}
> +
> +static void init_hrtimer(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	if (!is_sampling_event(event))
> +		return;
> +
> +	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	hwc->hrtimer.function = hrtimer_sample;
> +
> +	if (event->attr.freq) {
> +		long freq = event->attr.sample_freq;
> +
> +		event->attr.sample_period = NSEC_PER_SEC / freq;
> +		hwc->sample_period = event->attr.sample_period;
> +		local64_set(&hwc->period_left, hwc->sample_period);
> +		hwc->last_period = hwc->sample_period;
> +		event->attr.freq = 0;
> +	}
> +}
> +
> +static int i915_pmu_event_init(struct perf_event *event)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(event->pmu, typeof(*i915), pmu.base);
> +	int ret;
> +
> +	/* XXX ideally only want pid == -1 && cpu == -1 */

What is this about?

> +
> +	if (event->attr.type != event->pmu->type)
> +		return -ENOENT;
> +
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
> +	ret = 0;
> +	if (event->attr.config < RING_MAX) {
> +		ret = engine_event_init(event);
> +	} else switch (event->attr.config) {
> +	case I915_PMU_ACTUAL_FREQUENCY:
> +		if (IS_VALLEYVIEW(i915) || IS_CHERRYVIEW(i915))
> +			ret = -ENODEV; /* requires a mutex for sampling! */
> +	case I915_PMU_REQUESTED_FREQUENCY:
> +	case I915_PMU_ENERGY:
> +	case I915_PMU_RC6_RESIDENCY:
> +	case I915_PMU_RC6p_RESIDENCY:
> +	case I915_PMU_RC6pp_RESIDENCY:
> +		if (INTEL_GEN(i915) < 6)
> +			ret = -ENODEV;
> +		break;
> +	}
> +	if (ret)
> +		return ret;
> +
> +	if (!event->parent)
> +		event->destroy = i915_pmu_event_destroy;
> +
> +	init_hrtimer(event);
> +
> +	return 0;
> +}
> +
> +static void i915_pmu_timer_start(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +	s64 period;
> +
> +	if (!is_sampling_event(event))
> +		return;
> +
> +	period = local64_read(&hwc->period_left);
> +	if (period) {
> +		if (period < 0)
> +			period = 10000;
> +
> +		local64_set(&hwc->period_left, 0);
> +	} else {
> +		period = max_t(u64, 10000, hwc->sample_period);
> +	}
> +
> +	hrtimer_start_range_ns(&hwc->hrtimer,
> +			       ns_to_ktime(period), 0,
> +			       HRTIMER_MODE_REL_PINNED);
> +}
> +
> +static void i915_pmu_timer_cancel(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	if (!is_sampling_event(event))
> +		return;
> +
> +	local64_set(&hwc->period_left,
> +		    ktime_to_ns(hrtimer_get_remaining(&hwc->hrtimer)));
> +	hrtimer_cancel(&hwc->hrtimer);
> +}
> +
> +static void i915_pmu_enable(struct perf_event *event)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(event->pmu, typeof(*i915), pmu.base);
> +
> +	if (i915->pmu.enable == 0)
> +		hrtimer_start_range_ns(&i915->pmu.timer,
> +				       ns_to_ktime(PERIOD), 0,
> +				       HRTIMER_MODE_REL_PINNED);
> +
> +	i915->pmu.enable |= BIT_ULL(event->attr.config);
> +
> +	i915_pmu_timer_start(event);
> +}
> +
> +static void i915_pmu_disable(struct perf_event *event)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(event->pmu, typeof(*i915), pmu.base);
> +
> +	i915->pmu.enable &= ~BIT_ULL(event->attr.config);
> +	i915_pmu_timer_cancel(event);
> +}
> +
> +static int i915_pmu_event_add(struct perf_event *event, int flags)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	if (flags & PERF_EF_START)
> +		i915_pmu_enable(event);
> +
> +	hwc->state = !(flags & PERF_EF_START);
> +
> +	return 0;
> +}
> +
> +static void i915_pmu_event_del(struct perf_event *event, int flags)
> +{
> +	i915_pmu_disable(event);
> +}
> +
> +static void i915_pmu_event_start(struct perf_event *event, int flags)
> +{
> +	i915_pmu_enable(event);
> +}
> +
> +static void i915_pmu_event_stop(struct perf_event *event, int flags)
> +{
> +	i915_pmu_disable(event);
> +}
> +
> +static u64 read_energy_uJ(struct drm_i915_private *dev_priv)
> +{
> +	u64 power;
> +
> +	GEM_BUG_ON(INTEL_GEN(dev_priv) < 6);
> +
> +	intel_runtime_pm_get(dev_pr > +
> +	rdmsrl(MSR_RAPL_POWER_UNIT, power);
> +	power = (power & 0x1f00) >> 8;
> +	power = 1000000 >> power; /* convert to uJ */
> +	power *= I915_READ_NOTRACE(MCH_SECP_NRG_STTS);
> +
> +	intel_runtime_pm_put(dev_priv);
> +
> +	return power;
> +}
> +
> +static inline u64 calc_residency(struct drm_i915_private *dev_priv,
> +				 const i915_reg_t reg)
> +{
> +	u64 val, units = 128, div = 100000;
> +
> +	GEM_BUG_ON(INTEL_GEN(dev_priv) < 6);
> +
> +	intel_runtime_pm_get(dev_priv);
> +	if (IS_VALLEYVIEW(dev_priv) || IS_CHERRYVIEW(dev_priv)) {
> +		div = dev_priv->czclk_freq;
> +		units = 1;
> +		if (I915_READ_NOTRACE(VLV_COUNTER_CONTROL) & VLV_COUNT_RANGE_HIGH)
> +			units <<= 8;
> +	} else if (IS_GEN9_LP(dev_priv)) {
> +		div = 1200;
> +		units = 1;
> +	}
> +	val = I915_READ_NOTRACE(reg);
> +	intel_runtime_pm_put(dev_priv);
> +
> +	val *= units;
> +	return DIV_ROUND_UP_ULL(val, div);
> +}
> +
> +static u64 count_interrupts(struct drm_i915_private *i915)
> +{
> +	/* open-coded kstat_irqs() */
> +	struct irq_desc *desc = irq_to_desc(i915->drm.pdev->irq);
> +	u64 sum = 0;
> +	int cpu;
> +
> +	if (!desc || !desc->kstat_irqs)
> +		return 0;
> +
> +	for_each_possible_cpu(cpu)
> +		sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
> +
> +	return sum;
> +}
> +
> +static void i915_pmu_event_read(struct perf_event *event)
> +{
> +	struct drm_i915_private *i915 =
> +		container_of(event->pmu, typeof(*i915), pmu.base);
> +	u64 val = 0;
> +
> +	if (event->attr.config < 32) {

32 could use a name as well.

> +		int engine = event->attr.config >> 2;
> +		int sample = event->attr.config & 3;
> +		val = i915->engine[engine]->pmu_sample[sample];
> +	} else switch (event->attr.config) {
> +	case I915_PMU_ACTUAL_FREQUENCY:
> +		val = i915->pmu.sample[__I915_SAMPLE_FREQ_ACT];
> +		break;
> +	case I915_PMU_REQUESTED_FREQUENCY:
> +		val = i915->pmu.sample[__I915_SAMPLE_FREQ_REQ];
> +		break;
> +	case I915_PMU_ENERGY:
> +		val = read_energy_uJ(i915);
> +		break;
> +	case I915_PMU_INTERRUPTS:
> +		val = count_interrupts(i915);
> +		break;
> +
> +	case I915_PMU_RC6_RESIDENCY:
> +		if (!i915->gt.awake)
> +			return;

Don't need to set val to something in this case? Like 100%, or what is 
the equivalent. Maybe not, I need to study how the PMU API works..

> +
> +		val = calc_residency(i915, IS_VALLEYVIEW(i915) ? VLV_GT_RENDER_RC6 : GEN6_GT_GFX_RC6);
> +		break;
> +
> +	case I915_PMU_RC6p_RESIDENCY:
> +		if (!i915->gt.awake)
> +			return;
> +
> +		if (!IS_VALLEYVIEW(i915))
> +			val = calc_residency(i915, GEN6_GT_GFX_RC6p);
> +		break;
> +
> +	case I915_PMU_RC6pp_RESIDENCY:
> +		if (!i915->gt.awake)
> +			return;
> +
> +		if (!IS_VALLEYVIEW(i915))
> +			val = calc_residency(i915, GEN6_GT_GFX_RC6pp);
> +		break;
> +	}
> +
> +	local64_set(&event->count, val);
> +}
> +
> +static int i915_pmu_event_event_idx(struct perf_event *event)
> +{
> +	return 0;
> +}
> +
> +void i915_pmu_register(struct drm_i915_private *i915)
> +{
> +	if (INTEL_GEN(i915) <= 2)
> +		return;
> +
> +	i915->pmu.base.task_ctx_nr	= perf_sw_context;
> +	i915->pmu.base.event_init	= i915_pmu_event_init;
> +	i915->pmu.base.add		= i915_pmu_event_add;
> +	i915->pmu.base.del		= i915_pmu_event_del;
> +	i915->pmu.base.start		= i915_pmu_event_start;
> +	i915->pmu.base.stop		= i915_pmu_event_stop;
> +	i915->pmu.base.read		= i915_pmu_event_read;
> +	i915->pmu.base.event_idx	= i915_pmu_event_event_idx;
> +
> +	hrtimer_init(&i915->pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	i915->pmu.timer.function = i915_sample;
> +	i915->pmu.enable = 0;
> +
> +	if (perf_pmu_register(&i915->pmu.base, "i915", -1))
> +		i915->pmu.base.event_init = NULL;
> +}
> +
> +void i915_pmu_unregister(struct drm_i915_private *i915)
> +{
> +	if (!i915->pmu.base.event_init)
> +		return;
> +
> +	i915->pmu.enable = 0;
> +
> +	perf_pmu_unregister(&i915->pmu.base);
> +	i915->pmu.base.event_init = NULL;
> +
> +	hrtimer_cancel(&i915->pmu.timer);
> +}
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 6aa20ac8cde3..084fa7816256 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -244,6 +244,8 @@ struct intel_engine_cs {
>   		I915_SELFTEST_DECLARE(bool mock : 1);
>   	} breadcrumbs;
>   
> +	u64 pmu_sample[3];

Define for 3 needed.

> +
>   	/*
>   	 * A pool of objects to use as shadow copies of client batch buffers
>   	 * when the command parser is enabled. Prevents the client from
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 34ee011f08ac..e9375ff29371 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -86,6 +86,46 @@ enum i915_mocs_table_index {
>   	I915_MOCS_CACHED,
>   };
>   
> +/**
> + * DOC: perf_events exposed by i915 through /sys/bus/event_sources/drivers/i915
> + *
> + */
> +#define I915_SAMPLE_BUSY	0
> +#define I915_SAMPLE_WAIT	1
> +#define I915_SAMPLE_SEMA	2
> +
> +#define I915_SAMPLE_RCS		0
> +#define I915_SAMPLE_VCS		1
> +#define I915_SAMPLE_BCS		2
> +#define I915_SAMPLE_VECS	3
> +
> +#define __I915_PMU_COUNT(ring, id) ((ring) << 4 | (id))
> +
> +#define I915_PMU_COUNT_RCS_BUSY __I915_PMU_COUNT(I915_SAMPLE_RCS, I915_SAMPLE_BUSY)
> +#define I915_PMU_COUNT_RCS_WAIT __I915_PMU_COUNT(I915_SAMPLE_RCS, I915_SAMPLE_WAIT)
> +#define I915_PMU_COUNT_RCS_SEMA __I915_PMU_COUNT(I915_SAMPLE_RCS, I915_SAMPLE_SEMA)
> +
> +#define I915_PMU_COUNT_VCS_BUSY __I915_PMU_COUNT(I915_SAMPLE_VCS, I915_SAMPLE_BUSY)
> +#define I915_PMU_COUNT_VCS_WAIT __I915_PMU_COUNT(I915_SAMPLE_VCS, I915_SAMPLE_WAIT)
> +#define I915_PMU_COUNT_VCS_SEMA __I915_PMU_COUNT(I915_SAMPLE_VCS, I915_SAMPLE_SEMA)

VCS2 is missing.

> +
> +#define I915_PMU_COUNT_BCS_BUSY __I915_PMU_COUNT(I915_SAMPLE_BCS, I915_SAMPLE_BUSY)
> +#define I915_PMU_COUNT_BCS_WAIT __I915_PMU_COUNT(I915_SAMPLE_BCS, I915_SAMPLE_WAIT)
> +#define I915_PMU_COUNT_BCS_SEMA __I915_PMU_COUNT(I915_SAMPLE_BCS, I915_SAMPLE_SEMA)
> +
> +#define I915_PMU_COUNT_VECS_BUSY __I915_PMU_COUNT(I915_SAMPLE_VECS, I915_SAMPLE_BUSY)
> +#define I915_PMU_COUNT_VECS_WAIT __I915_PMU_COUNT(I915_SAMPLE_VECS, I915_SAMPLE_WAIT)
> +#define I915_PMU_COUNT_VECS_SEMA __I915_PMU_COUNT(I915_SAMPLE_VECS, I915_SAMPLE_SEMA)
> +
> +#define I915_PMU_ACTUAL_FREQUENCY 32
> +#define I915_PMU_REQUESTED_FREQUENCY 33
> +#define I915_PMU_ENERGY 34
> +#define I915_PMU_INTERRUPTS 35

Do these numbers become ABI btw?

> +
> +#define I915_PMU_RC6_RESIDENCY		40
> +#define I915_PMU_RC6p_RESIDENCY	41
> +#define I915_PMU_RC6pp_RESIDENCY	42
> +
>   /* Each region is a minimum of 16k, and there are at most 255 of them.
>    */
>   #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f811dd20bbc1..6351ed8a2e56 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7365,6 +7365,7 @@ int perf_event_overflow(struct perf_event *event,
>   {
>   	return __perf_event_overflow(event, 1, data, regs);
>   }
> +EXPORT_SYMBOL_GPL(perf_event_overflow);
>   
>   /*
>    * Generic software event infrastructure
> 

First pass only for now. No insight into perf/PMU interaction and 
haven't checked the residency&co calculations.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 19/24] drm/i915/scheduler: Support user-defined priorities
  2017-05-18  9:46 ` [PATCH 19/24] drm/i915/scheduler: Support user-defined priorities Chris Wilson
@ 2017-08-02 21:55   ` Jason Ekstrand
  0 siblings, 0 replies; 47+ messages in thread
From: Jason Ekstrand @ 2017-08-02 21:55 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Intel GFX


[-- Attachment #1.1: Type: text/plain, Size: 7789 bytes --]

On Thu, May 18, 2017 at 2:46 AM, Chris Wilson <chris@chris-wilson.co.uk>
wrote:

> Use a priority stored in the context as the initial value when
> submitting a request. This allows us to change the default priority on a
> per-context basis, allowing different contexts to be favoured with GPU
> time at the expense of lower importance work. The user can adjust the
> context's priority via I915_CONTEXT_PARAM_PRIORITY, with more positive
> values being higher priority (they will be serviced earlier, after their
> dependencies have been resolved). Any prerequisite work for an execbuf
> will have its priority raised to match the new request as required.
>
> Normal users can specify any value in the range of -1023 to 0 [default],
> i.e. they can reduce the priority of their workloads (and temporarily
> boost it back to normal if so desired).
>
> Privileged users can specify any value in the range of -1023 to 1023,
> [default is 0], i.e. they can raise their priority above all overs and
> so potentially starve the system.
>

How is "priviledged" defined?  These days, X11 and/or your Wayland
compositor run as the logged in user.  One thought (thanks to Rafael) would
be DRM master on the current VT.  Also, is there some mechanism whereby a
"priviledged" app can make another app "priviledged"?  I.e. X11 making your
compositor privileged.


> Note that the existing schedulers are not fair, nor load balancing, the
> execution is strictly by priority on a first-come, first-served basis,
> and the driver may choose to boost some requests above the range
> available to users.
>
> This priority was originally based around nice(2), but evolved to allow
> clients to adjust their priority within a small range, and allow for a
> privileged high priority range.
>
> For example, this can be used to implement EGL_IMG_context_priority
> https://www.khronos.org/registry/egl/extensions/IMG/
> EGL_IMG_context_priority.txt
>
>         EGL_CONTEXT_PRIORITY_LEVEL_IMG determines the priority level of
>         the context to be created. This attribute is a hint, as an
>         implementation may not support multiple contexts at some
>         priority levels and system policy may limit access to high
>         priority contexts to appropriate system privilege level. The
>         default value for EGL_CONTEXT_PRIORITY_LEVEL_IMG is
>         EGL_CONTEXT_PRIORITY_MEDIUM_IMG."
>
> so we can map
>
>         PRIORITY_HIGH -> 1023 [privileged, will failback to 0]
>         PRIORITY_MED -> 0 [default]
>         PRIORITY_LOW -> -1023
>
> They also map onto the priorities used by VkQueue (and a VkQueue is
> essentially a timeline, our i915_gem_context under full-ppgtt).
>

Not really. VkQueue priorities are a long and complicated story and were
*not* well thought-out in the API.  Many hours of meetings have been wasted
trying to figure out what they mean.  Whey they *don't* mean is a global
context priority.


> v2: s/CAP_SYS_ADMIN/CAP_SYS_NICE/
> v3: Report min/max user priorities as defines in the uapi, and rebase
> internal priorities on the exposed values.
>
> Testcase: igt/gem_exec_schedule
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 23 +++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_gem_request.h | 11 ++++++++---
>  include/uapi/drm/i915_drm.h             |  9 ++++++++-
>  3 files changed, 39 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
> b/drivers/gpu/drm/i915/i915_gem_context.c
> index c62e17a68f5b..0735d624ef07 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -1038,6 +1038,9 @@ int i915_gem_context_getparam_ioctl(struct
> drm_device *dev, void *data,
>         case I915_CONTEXT_PARAM_BANNABLE:
>                 args->value = i915_gem_context_is_bannable(ctx);
>                 break;
> +       case I915_CONTEXT_PARAM_PRIORITY:
> +               args->value = ctx->priority;
> +               break;
>         default:
>                 ret = -EINVAL;
>                 break;
> @@ -1095,6 +1098,26 @@ int i915_gem_context_setparam_ioctl(struct
> drm_device *dev, void *data,
>                 else
>                         i915_gem_context_clear_bannable(ctx);
>                 break;
> +
> +       case I915_CONTEXT_PARAM_PRIORITY:
> +               {
> +                       int priority = args->value;
> +
> +                       if (args->size)
> +                               ret = -EINVAL;
> +                       else if (!to_i915(dev)->engine[RCS]->schedule)
> +                               ret = -ENODEV;
> +                       else if (priority > I915_CONTEXT_MAX_USER_PRIORITY
> ||
> +                                priority < I915_CONTEXT_MIN_USER_
> PRIORITY)
> +                               ret = -EINVAL;
> +                       else if (priority > I915_CONTEXT_DEFAULT_PRIORITY
> &&
> +                                !capable(CAP_SYS_NICE))
> +                               ret = -EPERM;
> +                       else
> +                               ctx->priority = priority;
> +               }
> +               break;
> +
>         default:
>                 ret = -EINVAL;
>                 break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.h
> b/drivers/gpu/drm/i915/i915_gem_request.h
> index 7b7c84369d78..d34d8adba9e9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.h
> +++ b/drivers/gpu/drm/i915/i915_gem_request.h
> @@ -30,6 +30,8 @@
>  #include "i915_gem.h"
>  #include "i915_sw_fence.h"
>
> +#include <uapi/drm/i915_drm.h>
> +
>  struct drm_file;
>  struct drm_i915_gem_object;
>  struct drm_i915_gem_request;
> @@ -69,9 +71,12 @@ struct i915_priotree {
>         struct list_head waiters_list; /* those after us, they depend upon
> us */
>         struct list_head link;
>         int priority;
> -#define I915_PRIORITY_MAX 1024
> -#define I915_PRIORITY_NORMAL 0
> -#define I915_PRIORITY_MIN (-I915_PRIORITY_MAX)
> +};
> +
> +enum {
> +       I915_PRIORITY_MIN = I915_CONTEXT_MIN_USER_PRIORITY - 1,
> +       I915_PRIORITY_NORMAL = I915_CONTEXT_DEFAULT_PRIORITY,
> +       I915_PRIORITY_MAX = I915_CONTEXT_MAX_USER_PRIORITY + 1,
>  };
>
>  struct i915_gem_capture_list {
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 4adf3e5f5c71..34ee011f08ac 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -393,8 +393,11 @@ typedef struct drm_i915_irq_wait {
>  #define I915_PARAM_MIN_EU_IN_POOL       39
>  #define I915_PARAM_MMAP_GTT_VERSION     40
>
> -/* Query whether DRM_I915_GEM_EXECBUFFER2 supports user defined execution
> +/*
> + * Query whether DRM_I915_GEM_EXECBUFFER2 supports user defined execution
>   * priorities and the driver will attempt to execute batches in priority
> order.
> + * The initial priority for each batch is supplied by the context and is
> + * controlled via I915_CONTEXT_PARAM_PRIORITY.
>   */
>  #define I915_PARAM_HAS_SCHEDULER        41
>  #define I915_PARAM_HUC_STATUS           42
> @@ -1320,6 +1323,10 @@ struct drm_i915_gem_context_param {
>  #define I915_CONTEXT_PARAM_GTT_SIZE    0x3
>  #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE    0x4
>  #define I915_CONTEXT_PARAM_BANNABLE    0x5
> +#define I915_CONTEXT_PARAM_PRIORITY    0x6
> +#define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
> +#define   I915_CONTEXT_DEFAULT_PRIORITY                0
> +#define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
>         __u64 value;
>  };
>
> --
> 2.11.0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

[-- Attachment #1.2: Type: text/html, Size: 10017 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2017-08-02 21:55 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-18  9:46 [PATCH 01/24] drm/i915/selftests: Pretend to be a gfx pci device Chris Wilson
2017-05-18  9:46 ` [PATCH 02/24] drm/i915: Mark CPU cache as dirty on every transition for CPU writes Chris Wilson
2017-05-18  9:46 ` [PATCH 03/24] drm/i915: Store i915_gem_object_is_coherent() as a bit next to cache-dirty Chris Wilson
2017-05-19 10:22   ` Chris Wilson
2017-05-22 20:39   ` Dongwon Kim
2017-05-18  9:46 ` [PATCH 04/24] drm/i915: Reinstate reservation_object zapping for batch_pool objects Chris Wilson
2017-05-18  9:46 ` [PATCH 05/24] drm/i915: Amalgamate execbuffer parameter structures Chris Wilson
2017-05-18  9:46 ` [PATCH 06/24] drm/i915: Use vma->exec_entry as our double-entry placeholder Chris Wilson
2017-05-18  9:46 ` [PATCH 07/24] drm/i915: Split vma exec_link/evict_link Chris Wilson
2017-05-18  9:46 ` [PATCH 08/24] drm/i915: Store a direct lookup from object handle to vma Chris Wilson
2017-05-18  9:46 ` [PATCH 09/24] drm/i915: Pass vma to relocate entry Chris Wilson
2017-05-18  9:46 ` [PATCH 10/24] drm/i915: Disable EXEC_OBJECT_ASYNC when doing relocations Chris Wilson
2017-05-19  9:05   ` Joonas Lahtinen
2017-05-18  9:46 ` [PATCH 11/24] drm/i915: Eliminate lots of iterations over the execobjects array Chris Wilson
2017-05-18  9:46 ` [PATCH 12/24] drm/i915: Store a persistent reference for an object in the execbuffer cache Chris Wilson
2017-05-19  9:37   ` Joonas Lahtinen
2017-05-18  9:46 ` [PATCH 13/24] drm/i915: First try the previous execbuffer location Chris Wilson
2017-05-18  9:46 ` [PATCH 14/24] drm/i915: Wait upon userptr get-user-pages within execbuffer Chris Wilson
2017-05-18  9:46 ` [PATCH 15/24] drm/i915: Allow execbuffer to use the first object as the batch Chris Wilson
2017-05-18  9:46 ` [PATCH 16/24] drm/i915: Async GPU relocation processing Chris Wilson
2017-05-18  9:46 ` [PATCH 17/24] drm/i915: Stash a pointer to the obj's resv in the vma Chris Wilson
2017-05-19  9:19   ` Joonas Lahtinen
2017-05-18  9:46 ` [PATCH 18/24] drm/i915: Convert execbuf to use struct-of-array packing for critical fields Chris Wilson
2017-05-19  9:25   ` Chris Wilson
2017-05-18  9:46 ` [PATCH 19/24] drm/i915/scheduler: Support user-defined priorities Chris Wilson
2017-08-02 21:55   ` Jason Ekstrand
2017-05-18  9:46 ` [PATCH 20/24] drm/i915: Group all the global context information together Chris Wilson
2017-05-19  9:35   ` Joonas Lahtinen
2017-05-18  9:46 ` [PATCH 21/24] drm/i915: Allow contexts to be unreferenced locklessly Chris Wilson
2017-05-18  9:46 ` [PATCH 22/24] drm/i915: Enable rcu-only context lookups Chris Wilson
2017-05-19 10:36   ` Joonas Lahtinen
2017-05-18  9:46 ` [PATCH 23/24] drm/i915: Keep a recent cache of freed contexts objects for reuse Chris Wilson
2017-05-18 13:52   ` Tvrtko Ursulin
2017-05-18 14:19     ` Chris Wilson
2017-05-19  9:09       ` Tvrtko Ursulin
2017-05-22 10:51   ` Tvrtko Ursulin
2017-05-22 11:06     ` Chris Wilson
2017-05-18  9:46 ` [PATCH 24/24] RFC drm/i915: Expose a PMU interface for perf queries Chris Wilson
2017-05-18 23:48   ` Dmitry Rogozhkin
2017-05-19  8:01     ` Chris Wilson
2017-05-19 14:54       ` Dmitry Rogozhkin
2017-06-15 11:17   ` Tvrtko Ursulin
2017-05-18 12:31 ` ✓ Fi.CI.BAT: success for series starting with [01/24] drm/i915/selftests: Pretend to be a gfx pci device Patchwork
2017-05-18 13:35 ` [PATCH 01/24] " Tvrtko Ursulin
2017-05-18 14:46   ` Chris Wilson
2017-05-19  9:02 ` Joonas Lahtinen
2017-05-19  9:10   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.