All of lore.kernel.org
 help / color / mirror / Atom feed
* Confluence of eb + timeline improvements
@ 2017-04-19  9:41 Chris Wilson
  2017-04-19  9:41 ` [PATCH 01/27] drm/i915/selftests: Allocate inode/file dynamically Chris Wilson
                   ` (30 more replies)
  0 siblings, 31 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

Lots of patches we have all seen before by now, majority of them are
r-b'ed, or very close to r-b (I hope). There are lots of nice little
performance improvements and should make the relocation tests green.
-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH 01/27] drm/i915/selftests: Allocate inode/file dynamically
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-20  7:42   ` Joonas Lahtinen
  2017-04-19  9:41 ` [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes Chris Wilson
                   ` (29 subsequent siblings)
  30 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Matthew Auld, Arnd Bergmann

Avoid having too large a stack by creating the fake struct inode/file on
the heap instead.

drivers/gpu/drm/i915/selftests/mock_drm.c: In function 'mock_file':
drivers/gpu/drm/i915/selftests/mock_drm.c:46:1: error: the frame size of 1328 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]
drivers/gpu/drm/i915/selftests/mock_drm.c: In function 'mock_file_free':
drivers/gpu/drm/i915/selftests/mock_drm.c:54:1: error: the frame size of 1312 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]

Reported-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 66d9cb5d805a ("drm/i915: Mock the GEM device for self-testing")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/gpu/drm/i915/selftests/mock_drm.c | 45 ++++++++++++++++++++++---------
 1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/selftests/mock_drm.c b/drivers/gpu/drm/i915/selftests/mock_drm.c
index 113dec05c7dc..09c704153456 100644
--- a/drivers/gpu/drm/i915/selftests/mock_drm.c
+++ b/drivers/gpu/drm/i915/selftests/mock_drm.c
@@ -24,31 +24,50 @@
 
 #include "mock_drm.h"
 
-static inline struct inode fake_inode(struct drm_i915_private *i915)
-{
-	return (struct inode){ .i_rdev = i915->drm.primary->index };
-}
-
 struct drm_file *mock_file(struct drm_i915_private *i915)
 {
-	struct inode inode = fake_inode(i915);
-	struct file filp = {};
+	struct file *filp;
+	struct inode *inode;
 	struct drm_file *file;
 	int err;
 
-	err = drm_open(&inode, &filp);
-	if (unlikely(err))
-		return ERR_PTR(err);
+	inode = kzalloc(sizeof(*inode), GFP_KERNEL);
+	if (!inode) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	inode->i_rdev = i915->drm.primary->index;
 
-	file = filp.private_data;
+	filp = kzalloc(sizeof(*filp), GFP_KERNEL);
+	if (!filp) {
+		err = -ENOMEM;
+		goto err_inode;
+	}
+
+	err = drm_open(inode, filp);
+	if (err)
+		goto err_filp;
+
+	file = filp->private_data;
+	memset(&file->filp, POISON_INUSE, sizeof(file->filp));
 	file->authenticated = true;
+
+	kfree(filp);
+	kfree(inode);
 	return file;
+
+err_filp:
+	kfree(filp);
+err_inode:
+	kfree(inode);
+err:
+	return ERR_PTR(err);
 }
 
 void mock_file_free(struct drm_i915_private *i915, struct drm_file *file)
 {
-	struct inode inode = fake_inode(i915);
 	struct file filp = { .private_data = file };
 
-	drm_release(&inode, &filp);
+	drm_release(NULL, &filp);
 }
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
  2017-04-19  9:41 ` [PATCH 01/27] drm/i915/selftests: Allocate inode/file dynamically Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19 16:52   ` Dongwon Kim
  2017-04-19  9:41 ` [PATCH 03/27] drm/i915: Mark up clflushes as belonging to an unordered timeline Chris Wilson
                   ` (28 subsequent siblings)
  30 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Dongwon Kim

Currently, we only mark the CPU cache as dirty if we skip a clflush.
This leads to some confusion where we have to ask if the object is in
the write domain or missed a clflush. If we always mark the cache as
dirty, this becomes a much simply question to answer.

The goal remains to do as few clflushes as required and to do them as
late as possible, in the hope of deferring the work to a kthread and not
block the caller (e.g. execbuf, flips).

Reported-by: Dongwon Kim <dongwon.kim@intel.com>
Fixes: a6a7cc4b7db6 ("drm/i915: Always flush the dirty CPU cache when pinning the scanout")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c                  | 78 +++++++++++++++---------
 drivers/gpu/drm/i915/i915_gem_clflush.c          | 15 +++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c       | 21 +++----
 drivers/gpu/drm/i915/i915_gem_internal.c         |  3 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c          |  5 +-
 drivers/gpu/drm/i915/selftests/huge_gem_object.c |  3 +-
 6 files changed, 70 insertions(+), 55 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 33fb11cc5acc..488ca7733c1e 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -49,7 +49,7 @@ static void i915_gem_flush_free_objects(struct drm_i915_private *i915);
 
 static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
 {
-	if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
+	if (obj->cache_dirty)
 		return false;
 
 	if (!i915_gem_object_is_coherent(obj))
@@ -233,6 +233,14 @@ i915_gem_object_get_pages_phys(struct drm_i915_gem_object *obj)
 	return st;
 }
 
+static void __start_cpu_write(struct drm_i915_gem_object *obj)
+{
+	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
+	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
+	if (cpu_write_needs_clflush(obj))
+		obj->cache_dirty = true;
+}
+
 static void
 __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
 				struct sg_table *pages,
@@ -248,8 +256,7 @@ __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
 	    !i915_gem_object_is_coherent(obj))
 		drm_clflush_sg(pages);
 
-	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
-	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
+	__start_cpu_write(obj);
 }
 
 static void
@@ -684,6 +691,12 @@ i915_gem_dumb_create(struct drm_file *file,
 			       args->size, &args->handle);
 }
 
+static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)
+{
+	return !(obj->cache_level == I915_CACHE_NONE ||
+		 obj->cache_level == I915_CACHE_WT);
+}
+
 /**
  * Creates a new mm object and returns a handle to it.
  * @dev: drm device pointer
@@ -753,6 +766,11 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
 	case I915_GEM_DOMAIN_CPU:
 		i915_gem_clflush_object(obj, I915_CLFLUSH_SYNC);
 		break;
+
+	case I915_GEM_DOMAIN_RENDER:
+		if (gpu_write_needs_clflush(obj))
+			obj->cache_dirty = true;
+		break;
 	}
 
 	obj->base.write_domain = 0;
@@ -854,7 +872,8 @@ int i915_gem_obj_prepare_shmem_read(struct drm_i915_gem_object *obj,
 	 * optimizes for the case when the gpu will dirty the data
 	 * anyway again before the next pread happens.
 	 */
-	if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
+	if (!obj->cache_dirty &&
+	    !(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
 		*needs_clflush = CLFLUSH_BEFORE;
 
 out:
@@ -906,14 +925,15 @@ int i915_gem_obj_prepare_shmem_write(struct drm_i915_gem_object *obj,
 	 * This optimizes for the case when the gpu will use the data
 	 * right away and we therefore have to clflush anyway.
 	 */
-	if (obj->base.write_domain != I915_GEM_DOMAIN_CPU)
+	if (!obj->cache_dirty) {
 		*needs_clflush |= CLFLUSH_AFTER;
 
-	/* Same trick applies to invalidate partially written cachelines read
-	 * before writing.
-	 */
-	if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
-		*needs_clflush |= CLFLUSH_BEFORE;
+		/* Same trick applies to invalidate partially written
+		 * cachelines read before writing.
+		 */
+		if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
+			*needs_clflush |= CLFLUSH_BEFORE;
+	}
 
 out:
 	intel_fb_obj_invalidate(obj, ORIGIN_CPU);
@@ -3374,10 +3394,12 @@ int i915_gem_wait_for_idle(struct drm_i915_private *i915, unsigned int flags)
 
 static void __i915_gem_object_flush_for_display(struct drm_i915_gem_object *obj)
 {
-	if (obj->base.write_domain != I915_GEM_DOMAIN_CPU && !obj->cache_dirty)
-		return;
-
-	i915_gem_clflush_object(obj, I915_CLFLUSH_FORCE);
+	/* We manually flush the CPU domain so that we can override and
+	 * force the flush for the display, and perform it asyncrhonously.
+	 */
+	flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
+	if (obj->cache_dirty)
+		i915_gem_clflush_object(obj, I915_CLFLUSH_FORCE);
 	obj->base.write_domain = 0;
 }
 
@@ -3636,14 +3658,17 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
 		}
 	}
 
-	if (obj->base.write_domain == I915_GEM_DOMAIN_CPU &&
-	    i915_gem_object_is_coherent(obj))
-		obj->cache_dirty = true;
+	/* Catch any deferred obj->cache_dirty markups */
+	flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
 
 	list_for_each_entry(vma, &obj->vma_list, obj_link)
 		vma->node.color = cache_level;
 	obj->cache_level = cache_level;
 
+	if (obj->base.write_domain & I915_GEM_DOMAIN_CPU &&
+	    cpu_write_needs_clflush(obj))
+		obj->cache_dirty = true;
+
 	return 0;
 }
 
@@ -3864,9 +3889,6 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
 	if (ret)
 		return ret;
 
-	if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
-		return 0;
-
 	flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
 
 	/* Flush the CPU cache if it's still invalid. */
@@ -3878,15 +3900,13 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
 	/* It should now be out of any other write domains, and we can update
 	 * the domain values for our changes.
 	 */
-	GEM_BUG_ON((obj->base.write_domain & ~I915_GEM_DOMAIN_CPU) != 0);
+	GEM_BUG_ON(obj->base.write_domain & ~I915_GEM_DOMAIN_CPU);
 
 	/* If we're writing through the CPU, then the GPU read domains will
 	 * need to be invalidated at next use.
 	 */
-	if (write) {
-		obj->base.read_domains = I915_GEM_DOMAIN_CPU;
-		obj->base.write_domain = I915_GEM_DOMAIN_CPU;
-	}
+	if (write)
+		__start_cpu_write(obj);
 
 	return 0;
 }
@@ -4306,6 +4326,8 @@ i915_gem_object_create(struct drm_i915_private *dev_priv, u64 size)
 	} else
 		obj->cache_level = I915_CACHE_NONE;
 
+	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
+
 	trace_i915_gem_object_create(obj);
 
 	return obj;
@@ -4968,10 +4990,8 @@ int i915_gem_freeze_late(struct drm_i915_private *dev_priv)
 
 	mutex_lock(&dev_priv->drm.struct_mutex);
 	for (p = phases; *p; p++) {
-		list_for_each_entry(obj, *p, global_link) {
-			obj->base.read_domains = I915_GEM_DOMAIN_CPU;
-			obj->base.write_domain = I915_GEM_DOMAIN_CPU;
-		}
+		list_for_each_entry(obj, *p, global_link)
+			__start_cpu_write(obj);
 	}
 	mutex_unlock(&dev_priv->drm.struct_mutex);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_clflush.c b/drivers/gpu/drm/i915/i915_gem_clflush.c
index ffd01e02fe94..a895643c4dc4 100644
--- a/drivers/gpu/drm/i915/i915_gem_clflush.c
+++ b/drivers/gpu/drm/i915/i915_gem_clflush.c
@@ -72,8 +72,6 @@ static const struct dma_fence_ops i915_clflush_ops = {
 static void __i915_do_clflush(struct drm_i915_gem_object *obj)
 {
 	drm_clflush_sg(obj->mm.pages);
-	obj->cache_dirty = false;
-
 	intel_fb_obj_flush(obj, ORIGIN_CPU);
 }
 
@@ -82,9 +80,6 @@ static void i915_clflush_work(struct work_struct *work)
 	struct clflush *clflush = container_of(work, typeof(*clflush), work);
 	struct drm_i915_gem_object *obj = clflush->obj;
 
-	if (!obj->cache_dirty)
-		goto out;
-
 	if (i915_gem_object_pin_pages(obj)) {
 		DRM_ERROR("Failed to acquire obj->pages for clflushing\n");
 		goto out;
@@ -132,10 +127,10 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
 	 * anything not backed by physical memory we consider to be always
 	 * coherent and not need clflushing.
 	 */
-	if (!i915_gem_object_has_struct_page(obj))
+	if (!i915_gem_object_has_struct_page(obj)) {
+		obj->cache_dirty = false;
 		return;
-
-	obj->cache_dirty = true;
+	}
 
 	/* If the GPU is snooping the contents of the CPU cache,
 	 * we do not need to manually clear the CPU cache lines.  However,
@@ -154,6 +149,8 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
 	if (!(flags & I915_CLFLUSH_SYNC))
 		clflush = kmalloc(sizeof(*clflush), GFP_KERNEL);
 	if (clflush) {
+		GEM_BUG_ON(!obj->cache_dirty);
+
 		dma_fence_init(&clflush->dma,
 			       &i915_clflush_ops,
 			       &clflush_lock,
@@ -181,6 +178,8 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
 	} else {
 		GEM_BUG_ON(obj->base.write_domain != I915_GEM_DOMAIN_CPU);
 	}
+
+	obj->cache_dirty = false;
 }
 
 void i915_gem_clflush_init(struct drm_i915_private *i915)
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index af1965774e7b..ddc011ef5480 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -291,7 +291,7 @@ static inline int use_cpu_reloc(struct drm_i915_gem_object *obj)
 		return DBG_USE_CPU_RELOC > 0;
 
 	return (HAS_LLC(to_i915(obj->base.dev)) ||
-		obj->base.write_domain == I915_GEM_DOMAIN_CPU ||
+		obj->cache_dirty ||
 		obj->cache_level != I915_CACHE_NONE);
 }
 
@@ -1129,10 +1129,8 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
 		if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
 			continue;
 
-		if (obj->base.write_domain & I915_GEM_DOMAIN_CPU) {
+		if (obj->base.write_domain & obj->cache_dirty)
 			i915_gem_clflush_object(obj, 0);
-			obj->base.write_domain = 0;
-		}
 
 		ret = i915_gem_request_await_object
 			(req, obj, obj->base.pending_write_domain);
@@ -1265,12 +1263,6 @@ i915_gem_validate_context(struct drm_device *dev, struct drm_file *file,
 	return ctx;
 }
 
-static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)
-{
-	return !(obj->cache_level == I915_CACHE_NONE ||
-		 obj->cache_level == I915_CACHE_WT);
-}
-
 void i915_vma_move_to_active(struct i915_vma *vma,
 			     struct drm_i915_gem_request *req,
 			     unsigned int flags)
@@ -1294,15 +1286,16 @@ void i915_vma_move_to_active(struct i915_vma *vma,
 	i915_gem_active_set(&vma->last_read[idx], req);
 	list_move_tail(&vma->vm_link, &vma->vm->active_list);
 
+	obj->base.write_domain = 0;
 	if (flags & EXEC_OBJECT_WRITE) {
+		obj->base.write_domain = I915_GEM_DOMAIN_RENDER;
+
 		if (intel_fb_obj_invalidate(obj, ORIGIN_CS))
 			i915_gem_active_set(&obj->frontbuffer_write, req);
 
-		/* update for the implicit flush after a batch */
-		obj->base.write_domain &= ~I915_GEM_GPU_DOMAINS;
-		if (!obj->cache_dirty && gpu_write_needs_clflush(obj))
-			obj->cache_dirty = true;
+		obj->base.read_domains = 0;
 	}
+	obj->base.read_domains |= I915_GEM_GPU_DOMAINS;
 
 	if (flags & EXEC_OBJECT_NEEDS_FENCE)
 		i915_gem_active_set(&vma->last_fence, req);
diff --git a/drivers/gpu/drm/i915/i915_gem_internal.c b/drivers/gpu/drm/i915/i915_gem_internal.c
index fc950abbe400..58e93e87d573 100644
--- a/drivers/gpu/drm/i915/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/i915_gem_internal.c
@@ -188,9 +188,10 @@ i915_gem_object_create_internal(struct drm_i915_private *i915,
 	drm_gem_private_object_init(&i915->drm, &obj->base, size);
 	i915_gem_object_init(obj, &i915_gem_object_internal_ops);
 
-	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
+	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->cache_level = HAS_LLC(i915) ? I915_CACHE_LLC : I915_CACHE_NONE;
+	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
 
 	return obj;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 58ccf8b8ca1c..9f84be171ad2 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -802,9 +802,10 @@ i915_gem_userptr_ioctl(struct drm_device *dev, void *data, struct drm_file *file
 
 	drm_gem_private_object_init(dev, &obj->base, args->user_size);
 	i915_gem_object_init(obj, &i915_gem_userptr_ops);
-	obj->cache_level = I915_CACHE_LLC;
-	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
+	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
+	obj->cache_level = I915_CACHE_LLC;
+	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
 
 	obj->userptr.ptr = args->user_ptr;
 	obj->userptr.read_only = !!(args->flags & I915_USERPTR_READ_ONLY);
diff --git a/drivers/gpu/drm/i915/selftests/huge_gem_object.c b/drivers/gpu/drm/i915/selftests/huge_gem_object.c
index 4e681fc13be4..0ca867a877b6 100644
--- a/drivers/gpu/drm/i915/selftests/huge_gem_object.c
+++ b/drivers/gpu/drm/i915/selftests/huge_gem_object.c
@@ -126,9 +126,10 @@ huge_gem_object(struct drm_i915_private *i915,
 	drm_gem_private_object_init(&i915->drm, &obj->base, dma_size);
 	i915_gem_object_init(obj, &huge_ops);
 
-	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
+	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	obj->cache_level = HAS_LLC(i915) ? I915_CACHE_LLC : I915_CACHE_NONE;
+	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
 	obj->scratch = phys_size;
 
 	return obj;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 03/27] drm/i915: Mark up clflushes as belonging to an unordered timeline
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
  2017-04-19  9:41 ` [PATCH 01/27] drm/i915/selftests: Allocate inode/file dynamically Chris Wilson
  2017-04-19  9:41 ` [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 04/27] drm/i915: Lift timeline ordering to await_dma_fence Chris Wilson
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

2 clflushes on two different objects are not ordered, and so do not
belong to the same timeline (context). Either we use a unique context
for each, or we reserve a special global context to mean unordered.
Ideally, we would reserve 0 to mean unordered (DMA_FENCE_NO_CONTEXT) to
have the same semantics everywhere.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         | 2 ++
 drivers/gpu/drm/i915/i915_gem.c         | 2 +-
 drivers/gpu/drm/i915/i915_gem_clflush.c | 8 +-------
 drivers/gpu/drm/i915/i915_gem_clflush.h | 1 -
 4 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 357b6c6c2f04..a11d7d8f5f2e 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1514,6 +1514,8 @@ struct i915_gem_mm {
 	/** LRU list of objects with fence regs on them. */
 	struct list_head fence_list;
 
+	u64 unordered_timeline;
+
 	/* the indicator for dispatch video commands on two BSD rings */
 	atomic_t bsd_engine_dispatch_index;
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 488ca7733c1e..1b100fa2a11c 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4768,7 +4768,7 @@ int i915_gem_init(struct drm_i915_private *dev_priv)
 
 	mutex_lock(&dev_priv->drm.struct_mutex);
 
-	i915_gem_clflush_init(dev_priv);
+	dev_priv->mm.unordered_timeline = dma_fence_context_alloc(1);
 
 	if (!i915.enable_execlists) {
 		dev_priv->gt.resume = intel_legacy_submission_resume;
diff --git a/drivers/gpu/drm/i915/i915_gem_clflush.c b/drivers/gpu/drm/i915/i915_gem_clflush.c
index a895643c4dc4..17b207e963c2 100644
--- a/drivers/gpu/drm/i915/i915_gem_clflush.c
+++ b/drivers/gpu/drm/i915/i915_gem_clflush.c
@@ -27,7 +27,6 @@
 #include "i915_gem_clflush.h"
 
 static DEFINE_SPINLOCK(clflush_lock);
-static u64 clflush_context;
 
 struct clflush {
 	struct dma_fence dma; /* Must be first for dma_fence_free() */
@@ -154,7 +153,7 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
 		dma_fence_init(&clflush->dma,
 			       &i915_clflush_ops,
 			       &clflush_lock,
-			       clflush_context,
+			       to_i915(obj->base.dev)->mm.unordered_timeline,
 			       0);
 		i915_sw_fence_init(&clflush->wait, i915_clflush_notify);
 
@@ -181,8 +180,3 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
 
 	obj->cache_dirty = false;
 }
-
-void i915_gem_clflush_init(struct drm_i915_private *i915)
-{
-	clflush_context = dma_fence_context_alloc(1);
-}
diff --git a/drivers/gpu/drm/i915/i915_gem_clflush.h b/drivers/gpu/drm/i915/i915_gem_clflush.h
index b62d61a2d15f..2455a7820937 100644
--- a/drivers/gpu/drm/i915/i915_gem_clflush.h
+++ b/drivers/gpu/drm/i915/i915_gem_clflush.h
@@ -28,7 +28,6 @@
 struct drm_i915_private;
 struct drm_i915_gem_object;
 
-void i915_gem_clflush_init(struct drm_i915_private *i915);
 void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
 			     unsigned int flags);
 #define I915_CLFLUSH_FORCE BIT(0)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 04/27] drm/i915: Lift timeline ordering to await_dma_fence
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (2 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 03/27] drm/i915: Mark up clflushes as belonging to an unordered timeline Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 05/27] drm/i915: Make ptr_unpack_bits() more function-like Chris Wilson
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

Currently we filter out repeated use of the same timeline in the low
level i915_gem_request_await_request(), after having added the
dependency on the old request. However, we can lift this to
i915_gem_request_await_dma_fence() (before the dependency is added)
using the observation that requests along the same timeline are
explicitly ordered via i915_add_request (along with the dependencies).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_request.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 095cccc2e8b2..97c07986b7c1 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -679,6 +679,7 @@ i915_gem_request_await_request(struct drm_i915_gem_request *to,
 	int ret;
 
 	GEM_BUG_ON(to == from);
+	GEM_BUG_ON(to->timeline == from->timeline);
 
 	if (to->engine->schedule) {
 		ret = i915_priotree_add_dependency(to->i915,
@@ -688,9 +689,6 @@ i915_gem_request_await_request(struct drm_i915_gem_request *to,
 			return ret;
 	}
 
-	if (to->timeline == from->timeline)
-		return 0;
-
 	if (to->engine == from->engine) {
 		ret = i915_sw_fence_await_sw_fence_gfp(&to->submit,
 						       &from->submit,
@@ -739,6 +737,13 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
 		return 0;
 
+	/* Requests on the same timeline are explicitly ordered, along with
+	 * their dependencies, by i915_add_request() which ensures that requests
+	 * are submitted in-order through each ring.
+	 */
+	if (fence->context == req->fence.context)
+		return 0;
+
 	if (dma_fence_is_i915(fence))
 		return i915_gem_request_await_request(req, to_request(fence));
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 05/27] drm/i915: Make ptr_unpack_bits() more function-like
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (3 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 04/27] drm/i915: Lift timeline ordering to await_dma_fence Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 06/27] drm/i915: Redefine ptr_pack_bits() and friends Chris Wilson
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

ptr_unpack_bits() is a function-like macro, as such it is meant to be
replaceable by a function. In this case, we should be passing in the
out-param as a pointer.

Bizarrely this does affect code generation:

function                                     old     new   delta
i915_gem_object_pin_map                      409     389     -20

An improvement(?) in this case, but one can't help wonder what
strict-aliasing optimisations we are preventing.

The generated code looks identical in using ptr_unpack_bits (no extra
motions to stack, the pointer and bits appear to be kept in registers),
the difference appears to be code ordering and with a reorder it is able
to use smaller forward jumps.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c   | 2 +-
 drivers/gpu/drm/i915/i915_utils.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 1b100fa2a11c..bf035065785c 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2632,7 +2632,7 @@ void *i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
 	}
 	GEM_BUG_ON(!obj->mm.pages);
 
-	ptr = ptr_unpack_bits(obj->mm.mapping, has_type);
+	ptr = ptr_unpack_bits(obj->mm.mapping, &has_type);
 	if (ptr && has_type != type) {
 		if (pinned) {
 			ret = -EBUSY;
diff --git a/drivers/gpu/drm/i915/i915_utils.h b/drivers/gpu/drm/i915/i915_utils.h
index c5455d36b617..aca11aad5da7 100644
--- a/drivers/gpu/drm/i915/i915_utils.h
+++ b/drivers/gpu/drm/i915/i915_utils.h
@@ -77,7 +77,7 @@
 
 #define ptr_unpack_bits(ptr, bits) ({					\
 	unsigned long __v = (unsigned long)(ptr);			\
-	(bits) = __v & ~PAGE_MASK;					\
+	*(bits) = __v & ~PAGE_MASK;					\
 	(typeof(ptr))(__v & PAGE_MASK);					\
 })
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 06/27] drm/i915: Redefine ptr_pack_bits() and friends
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (4 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 05/27] drm/i915: Make ptr_unpack_bits() more function-like Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence Chris Wilson
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

Rebrand the current (pointer | bits) pack/unpack utility macros as
explicit bit twiddling for PAGE_SIZE so that we can use the more
flexible underlying macros for different bits.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_cmd_parser.c |  2 +-
 drivers/gpu/drm/i915/i915_gem.c        |  6 +++---
 drivers/gpu/drm/i915/i915_utils.h      | 19 +++++++++++++------
 3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
index 2a1a3347495a..f0cb22cc0dd6 100644
--- a/drivers/gpu/drm/i915/i915_cmd_parser.c
+++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
@@ -1284,7 +1284,7 @@ int intel_engine_cmd_parser(struct intel_engine_cs *engine,
 
 		if (*cmd == MI_BATCH_BUFFER_END) {
 			if (needs_clflush_after) {
-				void *ptr = ptr_mask_bits(shadow_batch_obj->mm.mapping);
+				void *ptr = page_mask_bits(shadow_batch_obj->mm.mapping);
 				drm_clflush_virt_range(ptr,
 						       (void *)(cmd + 1) - ptr);
 			}
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index bf035065785c..2bc72314cdd1 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2300,7 +2300,7 @@ void __i915_gem_object_put_pages(struct drm_i915_gem_object *obj,
 	if (obj->mm.mapping) {
 		void *ptr;
 
-		ptr = ptr_mask_bits(obj->mm.mapping);
+		ptr = page_mask_bits(obj->mm.mapping);
 		if (is_vmalloc_addr(ptr))
 			vunmap(ptr);
 		else
@@ -2632,7 +2632,7 @@ void *i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
 	}
 	GEM_BUG_ON(!obj->mm.pages);
 
-	ptr = ptr_unpack_bits(obj->mm.mapping, &has_type);
+	ptr = page_unpack_bits(obj->mm.mapping, &has_type);
 	if (ptr && has_type != type) {
 		if (pinned) {
 			ret = -EBUSY;
@@ -2654,7 +2654,7 @@ void *i915_gem_object_pin_map(struct drm_i915_gem_object *obj,
 			goto err_unpin;
 		}
 
-		obj->mm.mapping = ptr_pack_bits(ptr, type);
+		obj->mm.mapping = page_pack_bits(ptr, type);
 	}
 
 out_unlock:
diff --git a/drivers/gpu/drm/i915/i915_utils.h b/drivers/gpu/drm/i915/i915_utils.h
index aca11aad5da7..f0500c65726d 100644
--- a/drivers/gpu/drm/i915/i915_utils.h
+++ b/drivers/gpu/drm/i915/i915_utils.h
@@ -70,20 +70,27 @@
 #define overflows_type(x, T) \
 	(sizeof(x) > sizeof(T) && (x) >> (sizeof(T) * BITS_PER_BYTE))
 
-#define ptr_mask_bits(ptr) ({						\
+#define ptr_mask_bits(ptr, n) ({					\
 	unsigned long __v = (unsigned long)(ptr);			\
-	(typeof(ptr))(__v & PAGE_MASK);					\
+	(typeof(ptr))(__v & -BIT(n));					\
 })
 
-#define ptr_unpack_bits(ptr, bits) ({					\
+#define ptr_unmask_bits(ptr, n) ((unsigned long)(ptr) & (BIT(n) - 1))
+
+#define ptr_unpack_bits(ptr, bits, n) ({				\
 	unsigned long __v = (unsigned long)(ptr);			\
-	*(bits) = __v & ~PAGE_MASK;					\
-	(typeof(ptr))(__v & PAGE_MASK);					\
+	*(bits) = __v & (BIT(n) - 1);					\
+	(typeof(ptr))(__v & -BIT(n));					\
 })
 
-#define ptr_pack_bits(ptr, bits)					\
+#define ptr_pack_bits(ptr, bits, n)					\
 	((typeof(ptr))((unsigned long)(ptr) | (bits)))
 
+#define page_mask_bits(ptr) ptr_mask_bits(ptr, PAGE_SHIFT)
+#define page_unmask_bits(ptr) ptr_unmask_bits(ptr, PAGE_SHIFT)
+#define page_pack_bits(ptr, bits) ptr_pack_bits(ptr, bits, PAGE_SHIFT)
+#define page_unpack_bits(ptr, bits) ptr_unpack_bits(ptr, bits, PAGE_SHIFT)
+
 #define ptr_offset(ptr, member) offsetof(typeof(*(ptr)), member)
 
 #define fetch_and_zero(ptr) ({						\
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (5 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 06/27] drm/i915: Redefine ptr_pack_bits() and friends Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-24 13:03   ` Tvrtko Ursulin
                     ` (2 more replies)
  2017-04-19  9:41 ` [PATCH 08/27] drm/i915: Rename intel_timeline.sync_seqno[] to .global_sync[] Chris Wilson
                   ` (23 subsequent siblings)
  30 siblings, 3 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

Track the latest fence waited upon on each context, and only add a new
asynchronous wait if the new fence is more recent than the recorded
fence for that context. This requires us to filter out unordered
timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
absence of a universal identifier, we have to use our own
i915->mm.unordered_timeline token.

v2: Throw around the debug crutches
v3: Inline the likely case of the pre-allocation cache being full.
v4: Drop the pre-allocation support, we can lose the most recent fence
in case of allocation failure -- it just means we may emit more awaits
than strictly necessary but will not break.
v5: Trim allocation size for leaf nodes, they only need an array of u32
not pointers.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_request.c            |  67 +++---
 drivers/gpu/drm/i915/i915_gem_timeline.c           | 260 +++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_timeline.h           |  14 ++
 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 123 ++++++++++
 .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
 5 files changed, 438 insertions(+), 27 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c

diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 97c07986b7c1..fb6c31ba3ef9 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -730,9 +730,7 @@ int
 i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 				 struct dma_fence *fence)
 {
-	struct dma_fence_array *array;
 	int ret;
-	int i;
 
 	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
 		return 0;
@@ -744,39 +742,54 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 	if (fence->context == req->fence.context)
 		return 0;
 
-	if (dma_fence_is_i915(fence))
-		return i915_gem_request_await_request(req, to_request(fence));
+	/* Squash repeated waits to the same timelines, picking the latest */
+	if (fence->context != req->i915->mm.unordered_timeline &&
+	    intel_timeline_sync_get(req->timeline,
+				    fence->context, fence->seqno))
+		return 0;
 
-	if (!dma_fence_is_array(fence)) {
+	if (dma_fence_is_i915(fence)) {
+		ret = i915_gem_request_await_request(req, to_request(fence));
+		if (ret < 0)
+			return ret;
+	} else if (!dma_fence_is_array(fence)) {
 		ret = i915_sw_fence_await_dma_fence(&req->submit,
 						    fence, I915_FENCE_TIMEOUT,
 						    GFP_KERNEL);
-		return ret < 0 ? ret : 0;
-	}
-
-	/* Note that if the fence-array was created in signal-on-any mode,
-	 * we should *not* decompose it into its individual fences. However,
-	 * we don't currently store which mode the fence-array is operating
-	 * in. Fortunately, the only user of signal-on-any is private to
-	 * amdgpu and we should not see any incoming fence-array from
-	 * sync-file being in signal-on-any mode.
-	 */
-
-	array = to_dma_fence_array(fence);
-	for (i = 0; i < array->num_fences; i++) {
-		struct dma_fence *child = array->fences[i];
-
-		if (dma_fence_is_i915(child))
-			ret = i915_gem_request_await_request(req,
-							     to_request(child));
-		else
-			ret = i915_sw_fence_await_dma_fence(&req->submit,
-							    child, I915_FENCE_TIMEOUT,
-							    GFP_KERNEL);
 		if (ret < 0)
 			return ret;
+	} else {
+		struct dma_fence_array *array = to_dma_fence_array(fence);
+		int i;
+
+		/* Note that if the fence-array was created in signal-on-any
+		 * mode, we should *not* decompose it into its individual
+		 * fences. However, we don't currently store which mode the
+		 * fence-array is operating in. Fortunately, the only user of
+		 * signal-on-any is private to amdgpu and we should not see any
+		 * incoming fence-array from sync-file being in signal-on-any
+		 * mode.
+		 */
+
+		for (i = 0; i < array->num_fences; i++) {
+			struct dma_fence *child = array->fences[i];
+
+			if (dma_fence_is_i915(child))
+				ret = i915_gem_request_await_request(req,
+								     to_request(child));
+			else
+				ret = i915_sw_fence_await_dma_fence(&req->submit,
+								    child, I915_FENCE_TIMEOUT,
+								    GFP_KERNEL);
+			if (ret < 0)
+				return ret;
+		}
 	}
 
+	if (fence->context != req->i915->mm.unordered_timeline)
+		intel_timeline_sync_set(req->timeline,
+					fence->context, fence->seqno);
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
index b596ca7ee058..f2b734dda895 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.c
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
@@ -24,6 +24,254 @@
 
 #include "i915_drv.h"
 
+#define NSYNC 16
+#define SHIFT ilog2(NSYNC)
+#define MASK (NSYNC - 1)
+
+/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
+ * context id to the last u32 fence seqno waited upon from that context.
+ * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
+ * the root. This allows us to access the whole tree via a single pointer
+ * to the most recently used layer. We expect fence contexts to be dense
+ * and most reuse to be on the same i915_gem_context but on neighbouring
+ * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
+ * effective lookup cache. If the new lookup is not on the same leaf, we
+ * expect it to be on the neighbouring branch.
+ *
+ * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
+ * allows us to store whether a particular seqno is valid (i.e. allows us
+ * to distinguish unset from 0).
+ *
+ * A branch holds an array of layer pointers, and has height > 0, and always
+ * has at least 2 layers (either branches or leaves) below it.
+ *
+ */
+struct intel_timeline_sync {
+	u64 prefix;
+	unsigned int height;
+	unsigned int bitmap;
+	struct intel_timeline_sync *parent;
+	/* union {
+	 *	u32 seqno;
+	 *	struct intel_timeline_sync *child;
+	 * } slot[NSYNC];
+	 */
+};
+
+static inline u32 *__sync_seqno(struct intel_timeline_sync *p)
+{
+	GEM_BUG_ON(p->height);
+	return (u32 *)(p + 1);
+}
+
+static inline struct intel_timeline_sync **
+__sync_child(struct intel_timeline_sync *p)
+{
+	GEM_BUG_ON(!p->height);
+	return (struct intel_timeline_sync **)(p + 1);
+}
+
+static inline unsigned int
+__sync_idx(const struct intel_timeline_sync *p, u64 id)
+{
+	return (id >> p->height) & MASK;
+}
+
+static void __sync_free(struct intel_timeline_sync *p)
+{
+	if (p->height) {
+		unsigned int i;
+
+		while ((i = ffs(p->bitmap))) {
+			p->bitmap &= ~0u << i;
+			__sync_free(__sync_child(p)[i - 1]);
+		}
+	}
+
+	kfree(p);
+}
+
+static void sync_free(struct intel_timeline_sync *sync)
+{
+	if (!sync)
+		return;
+
+	while (sync->parent)
+		sync = sync->parent;
+
+	__sync_free(sync);
+}
+
+bool intel_timeline_sync_get(struct intel_timeline *tl, u64 id, u32 seqno)
+{
+	struct intel_timeline_sync *p;
+	unsigned int idx;
+
+	p = tl->sync;
+	if (!p)
+		return false;
+
+	if (likely((id >> SHIFT) == p->prefix))
+		goto found;
+
+	/* First climb the tree back to a parent branch */
+	do {
+		p = p->parent;
+		if (!p)
+			return false;
+
+		if ((id >> p->height >> SHIFT) == p->prefix)
+			break;
+	} while (1);
+
+	/* And then descend again until we find our leaf */
+	do {
+		if (!p->height)
+			break;
+
+		p = __sync_child(p)[__sync_idx(p, id)];
+		if (!p)
+			return false;
+
+		if ((id >> p->height >> SHIFT) != p->prefix)
+			return false;
+	} while (1);
+
+	tl->sync = p;
+found:
+	idx = id & MASK;
+	if (!(p->bitmap & BIT(idx)))
+		return false;
+
+	return i915_seqno_passed(__sync_seqno(p)[idx], seqno);
+}
+
+static noinline int
+__intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
+{
+	struct intel_timeline_sync *p = tl->sync;
+	unsigned int idx;
+
+	if (!p) {
+		p = kzalloc(sizeof(*p) + NSYNC * sizeof(seqno), GFP_KERNEL);
+		if (unlikely(!p))
+			return -ENOMEM;
+
+		p->prefix = id >> SHIFT;
+		goto found;
+	}
+
+	/* Climb back up the tree until we find a common prefix */
+	do {
+		if (!p->parent)
+			break;
+
+		p = p->parent;
+
+		if ((id >> p->height >> SHIFT) == p->prefix)
+			break;
+	} while (1);
+
+	/* No shortcut, we have to descend the tree to find the right layer
+	 * containing this fence.
+	 *
+	 * Each layer in the tree holds 16 (NSYNC) pointers, either fences
+	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
+	 * other nodes (height > 0) are internal layers that point to a lower
+	 * node. Each internal layer has at least 2 descendents.
+	 *
+	 * Starting at the top, we check whether the current prefix matches. If
+	 * it doesn't, we have gone passed our layer and need to insert a join
+	 * into the tree, and a new leaf node as a descendent as well as the
+	 * original layer.
+	 *
+	 * The matching prefix means we are still following the right branch
+	 * of the tree. If it has height 0, we have found our leaf and just
+	 * need to replace the fence slot with ourselves. If the height is
+	 * not zero, our slot contains the next layer in the tree (unless
+	 * it is empty, in which case we can add ourselves as a new leaf).
+	 * As descend the tree the prefix grows (and height decreases).
+	 */
+	do {
+		struct intel_timeline_sync *next;
+
+		if ((id >> p->height >> SHIFT) != p->prefix) {
+			/* insert a join above the current layer */
+			next = kzalloc(sizeof(*next) + NSYNC * sizeof(next),
+				       GFP_KERNEL);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			next->height = ALIGN(fls64((id >> p->height >> SHIFT) ^ p->prefix),
+					    SHIFT) + p->height;
+			next->prefix = id >> next->height >> SHIFT;
+
+			if (p->parent)
+				__sync_child(p->parent)[__sync_idx(p->parent, id)] = next;
+			next->parent = p->parent;
+
+			idx = p->prefix >> (next->height - p->height - SHIFT) & MASK;
+			__sync_child(next)[idx] = p;
+			next->bitmap |= BIT(idx);
+			p->parent = next;
+
+			/* ascend to the join */
+			p = next;
+		} else {
+			if (!p->height)
+				break;
+		}
+
+		/* descend into the next layer */
+		GEM_BUG_ON(!p->height);
+		idx = __sync_idx(p, id);
+		next = __sync_child(p)[idx];
+		if (unlikely(!next)) {
+			next = kzalloc(sizeof(*next) + NSYNC * sizeof(seqno),
+				       GFP_KERNEL);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			__sync_child(p)[idx] = next;
+			p->bitmap |= BIT(idx);
+			next->parent = p;
+			next->prefix = id >> SHIFT;
+
+			p = next;
+			break;
+		}
+
+		p = next;
+	} while (1);
+
+found:
+	GEM_BUG_ON(p->height);
+	GEM_BUG_ON(p->prefix != id >> SHIFT);
+	tl->sync = p;
+	idx = id & MASK;
+	__sync_seqno(p)[idx] = seqno;
+	p->bitmap |= BIT(idx);
+	return 0;
+}
+
+int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
+{
+	struct intel_timeline_sync *p = tl->sync;
+
+	/* We expect to be called in sequence following a  _get(id), which
+	 * should have preloaded the tl->sync hint for us.
+	 */
+	if (likely(p && (id >> SHIFT) == p->prefix)) {
+		unsigned int idx = id & MASK;
+
+		__sync_seqno(p)[idx] = seqno;
+		p->bitmap |= BIT(idx);
+		return 0;
+	}
+
+	return __intel_timeline_sync_set(tl, id, seqno);
+}
+
 static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 				    struct i915_gem_timeline *timeline,
 				    const char *name,
@@ -35,6 +283,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 
 	lockdep_assert_held(&i915->drm.struct_mutex);
 
+	/* Ideally we want a set of engines on a single leaf as we expect
+	 * to mostly be tracking synchronisation between engines.
+	 */
+	BUILD_BUG_ON(NSYNC < I915_NUM_ENGINES);
+	BUILD_BUG_ON(NSYNC > BITS_PER_BYTE * sizeof(timeline->engine[0].sync->bitmap));
+
 	timeline->i915 = i915;
 	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
 	if (!timeline->name)
@@ -91,8 +345,14 @@ void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
 		struct intel_timeline *tl = &timeline->engine[i];
 
 		GEM_BUG_ON(!list_empty(&tl->requests));
+
+		sync_free(tl->sync);
 	}
 
 	list_del(&timeline->link);
 	kfree(timeline->name);
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftests/i915_gem_timeline.c"
+#endif
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
index 6c53e14cab2a..c33dee0025ee 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.h
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
@@ -26,10 +26,13 @@
 #define I915_GEM_TIMELINE_H
 
 #include <linux/list.h>
+#include <linux/radix-tree.h>
 
+#include "i915_utils.h"
 #include "i915_gem_request.h"
 
 struct i915_gem_timeline;
+struct intel_timeline_sync;
 
 struct intel_timeline {
 	u64 fence_context;
@@ -55,6 +58,14 @@ struct intel_timeline {
 	 * struct_mutex.
 	 */
 	struct i915_gem_active last_request;
+
+	/* We track the most recent seqno that we wait on in every context so
+	 * that we only have to emit a new await and dependency on a more
+	 * recent sync point. As the contexts may executed out-of-order, we
+	 * have to track each individually and cannot not rely on an absolute
+	 * global_seqno.
+	 */
+	struct intel_timeline_sync *sync;
 	u32 sync_seqno[I915_NUM_ENGINES];
 
 	struct i915_gem_timeline *common;
@@ -75,4 +86,7 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
 int i915_gem_timeline_init__global(struct drm_i915_private *i915);
 void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
 
+bool intel_timeline_sync_get(struct intel_timeline *tl, u64 id, u32 seqno);
+int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno);
+
 #endif
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
new file mode 100644
index 000000000000..c0bb8ecac93b
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
@@ -0,0 +1,123 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "../i915_selftest.h"
+#include "mock_gem_device.h"
+
+static int igt_seqmap(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	const struct {
+		const char *name;
+		u32 seqno;
+		bool expected;
+		bool set;
+	} pass[] = {
+		{ "unset", 0, false, false },
+		{ "new", 0, false, true },
+		{ "0a", 0, true, true },
+		{ "1a", 1, false, true },
+		{ "1b", 1, true, true },
+		{ "0b", 0, true, false },
+		{ "2a", 2, false, true },
+		{ "4", 4, false, true },
+		{ "INT_MAX", INT_MAX, false, true },
+		{ "INT_MAX-1", INT_MAX-1, true, false },
+		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
+		{ "INT_MAX", INT_MAX, true, false },
+		{ "UINT_MAX", UINT_MAX, false, true },
+		{ "wrap", 0, false, true },
+		{ "unwrap", UINT_MAX, true, false },
+		{},
+	}, *p;
+	struct intel_timeline *tl;
+	int order, offset;
+	int ret;
+
+	tl = &i915->gt.global_timeline.engine[RCS];
+	for (p = pass; p->name; p++) {
+		for (order = 1; order < 64; order++) {
+			for (offset = -1; offset <= (order > 1); offset++) {
+				u64 ctx = BIT_ULL(order) + offset;
+
+				if (intel_timeline_sync_get(tl,
+							    ctx,
+							    p->seqno) != p->expected) {
+					pr_err("1: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
+					       p->name, ctx, p->seqno, yesno(p->expected));
+					return -EINVAL;
+				}
+
+				if (p->set) {
+					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
+					if (ret)
+						return ret;
+				}
+			}
+		}
+	}
+
+	tl = &i915->gt.global_timeline.engine[BCS];
+	for (order = 1; order < 64; order++) {
+		for (offset = -1; offset <= (order > 1); offset++) {
+			u64 ctx = BIT_ULL(order) + offset;
+
+			for (p = pass; p->name; p++) {
+				if (intel_timeline_sync_get(tl,
+							    ctx,
+							    p->seqno) != p->expected) {
+					pr_err("2: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
+					       p->name, ctx, p->seqno, yesno(p->expected));
+					return -EINVAL;
+				}
+
+				if (p->set) {
+					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
+					if (ret)
+						return ret;
+				}
+			}
+		}
+	}
+
+	return 0;
+}
+
+int i915_gem_timeline_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(igt_seqmap),
+	};
+	struct drm_i915_private *i915;
+	int err;
+
+	i915 = mock_gem_device();
+	if (!i915)
+		return -ENOMEM;
+
+	err = i915_subtests(tests, i915);
+	drm_dev_unref(&i915->drm);
+
+	return err;
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
index be9a9ebf5692..8d0f50c25df8 100644
--- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
@@ -12,6 +12,7 @@ selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
 selftest(scatterlist, scatterlist_mock_selftests)
 selftest(uncore, intel_uncore_mock_selftests)
 selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
+selftest(timelines, i915_gem_timeline_mock_selftests)
 selftest(requests, i915_gem_request_mock_selftests)
 selftest(objects, i915_gem_object_mock_selftests)
 selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 08/27] drm/i915: Rename intel_timeline.sync_seqno[] to .global_sync[]
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (6 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 09/27] drm/i915: Confirm the request is still active before adding it to the await Chris Wilson
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

With the addition of the inter-context intel_time.sync map, having a
very similar sync_seqno[] is confusing. Aide the reader by denoting that
this a pre-allocated array for storing semaphore sync points wrt to the
global seqno.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_request.c  | 8 ++++----
 drivers/gpu/drm/i915/i915_gem_timeline.h | 8 +++++++-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index fb6c31ba3ef9..1f3620ab4736 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -218,8 +218,8 @@ static int reset_all_global_seqno(struct drm_i915_private *i915, u32 seqno)
 		tl->seqno = seqno;
 
 		list_for_each_entry(timeline, &i915->gt.timelines, link)
-			memset(timeline->engine[id].sync_seqno, 0,
-			       sizeof(timeline->engine[id].sync_seqno));
+			memset(timeline->engine[id].global_sync, 0,
+			       sizeof(timeline->engine[id].global_sync));
 	}
 
 	return 0;
@@ -704,7 +704,7 @@ i915_gem_request_await_request(struct drm_i915_gem_request *to,
 		return ret < 0 ? ret : 0;
 	}
 
-	if (seqno <= to->timeline->sync_seqno[from->engine->id])
+	if (seqno <= to->timeline->global_sync[from->engine->id])
 		return 0;
 
 	trace_i915_gem_ring_sync_to(to, from);
@@ -722,7 +722,7 @@ i915_gem_request_await_request(struct drm_i915_gem_request *to,
 			return ret;
 	}
 
-	to->timeline->sync_seqno[from->engine->id] = seqno;
+	to->timeline->global_sync[from->engine->id] = seqno;
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
index c33dee0025ee..29bce47cbf67 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.h
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
@@ -66,7 +66,13 @@ struct intel_timeline {
 	 * global_seqno.
 	 */
 	struct intel_timeline_sync *sync;
-	u32 sync_seqno[I915_NUM_ENGINES];
+	/* Separately to the inter-context seqno map above, we track the last
+	 * barrier (e.g. semaphore wait) to the global engine timelines. Note
+	 * that this tracks global_seqno rather than the context.seqno, and
+	 * so it subject to the limitations of hw wraparound and that we
+	 * may need to revoke global_seqno (on pre-emption).
+	 */
+	u32 global_sync[I915_NUM_ENGINES];
 
 	struct i915_gem_timeline *common;
 };
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 09/27] drm/i915: Confirm the request is still active before adding it to the await
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (7 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 08/27] drm/i915: Rename intel_timeline.sync_seqno[] to .global_sync[] Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 10/27] drm/i915: Do not record a successful syncpoint for a dma-await Chris Wilson
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

Although we do check the completion-status of the request before
actually adding a wait on it (either to its submit fence or its
completion dma-fence), we currently do not check before adding it to the
dependency lists.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_request.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 1f3620ab4736..d5fadf53f153 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -681,6 +681,9 @@ i915_gem_request_await_request(struct drm_i915_gem_request *to,
 	GEM_BUG_ON(to == from);
 	GEM_BUG_ON(to->timeline == from->timeline);
 
+	if (i915_gem_request_completed(from))
+		return 0;
+
 	if (to->engine->schedule) {
 		ret = i915_priotree_add_dependency(to->i915,
 						   &to->priotree,
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 10/27] drm/i915: Do not record a successful syncpoint for a dma-await
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (8 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 09/27] drm/i915: Confirm the request is still active before adding it to the await Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 11/27] drm/i915: Switch the global i915.semaphores check to a local predicate Chris Wilson
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

As we may unwind the requests, even though the request we are awaiting
has a global_seqno that seqno may be revoked during the await and so we
can not reliably use it as a barrier for all future awaits on the same
timeline.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_request.c | 34 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index d5fadf53f153..75b33993468f 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -700,33 +700,31 @@ i915_gem_request_await_request(struct drm_i915_gem_request *to,
 	}
 
 	seqno = i915_gem_request_global_seqno(from);
-	if (!seqno) {
-		ret = i915_sw_fence_await_dma_fence(&to->submit,
-						    &from->fence, 0,
-						    GFP_KERNEL);
-		return ret < 0 ? ret : 0;
-	}
+	if (!seqno)
+		goto await_dma_fence;
 
-	if (seqno <= to->timeline->global_sync[from->engine->id])
-		return 0;
-
-	trace_i915_gem_ring_sync_to(to, from);
 	if (!i915.semaphores) {
-		if (!i915_spin_request(from, TASK_INTERRUPTIBLE, 2)) {
-			ret = i915_sw_fence_await_dma_fence(&to->submit,
-							    &from->fence, 0,
-							    GFP_KERNEL);
-			if (ret < 0)
-				return ret;
-		}
+		if (!__i915_spin_request(from, seqno, TASK_INTERRUPTIBLE, 2))
+			goto await_dma_fence;
 	} else {
+		if (seqno <= to->timeline->global_sync[from->engine->id])
+			return 0;
+
+		trace_i915_gem_ring_sync_to(to, from);
 		ret = to->engine->semaphore.sync_to(to, from);
 		if (ret)
 			return ret;
+
+		to->timeline->global_sync[from->engine->id] = seqno;
 	}
 
-	to->timeline->global_sync[from->engine->id] = seqno;
 	return 0;
+
+await_dma_fence:
+	ret = i915_sw_fence_await_dma_fence(&to->submit,
+					    &from->fence, 0,
+					    GFP_KERNEL);
+	return ret < 0 ? ret : 0;
 }
 
 int
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 11/27] drm/i915: Switch the global i915.semaphores check to a local predicate
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (9 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 10/27] drm/i915: Do not record a successful syncpoint for a dma-await Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 12/27] drm/i915: Only report a wakeup if the waiter was truly asleep Chris Wilson
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

Rather than use a global modparam, we can just check to see if the
engine has semaphores configured upon it.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_request.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 75b33993468f..83b1584b3deb 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -703,10 +703,12 @@ i915_gem_request_await_request(struct drm_i915_gem_request *to,
 	if (!seqno)
 		goto await_dma_fence;
 
-	if (!i915.semaphores) {
+	if (!to->engine->semaphore.sync_to) {
 		if (!__i915_spin_request(from, seqno, TASK_INTERRUPTIBLE, 2))
 			goto await_dma_fence;
 	} else {
+		GEM_BUG_ON(!from->engine->semaphore.signal);
+
 		if (seqno <= to->timeline->global_sync[from->engine->id])
 			return 0;
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 12/27] drm/i915: Only report a wakeup if the waiter was truly asleep
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (10 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 11/27] drm/i915: Switch the global i915.semaphores check to a local predicate Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-20 13:30   ` Tvrtko Ursulin
  2017-04-19  9:41 ` [PATCH 13/27] drm/i915/execlists: Pack the count into the low bits of the port.request Chris Wilson
                   ` (18 subsequent siblings)
  30 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

If we attempt to wake up a waiter, who is currently checking the seqno
it will be in the TASK_INTERRUPTIBLE state and ttwu will report success.
However, it is actually awake and functioning -- so delay reporting the
actual wake up until it sleeps.

v2: Defend against !CONFIG_SMP
v3: Don't filter out calls to wake_up_process

References: https://bugs.freedesktop.org/show_bug.cgi?id=100007
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_breadcrumbs.c | 18 ++++++++++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.h  |  4 ++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_breadcrumbs.c b/drivers/gpu/drm/i915/intel_breadcrumbs.c
index 9ccbf26124c6..808d3a3cda0a 100644
--- a/drivers/gpu/drm/i915/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/intel_breadcrumbs.c
@@ -27,6 +27,12 @@
 
 #include "i915_drv.h"
 
+#ifdef CONFIG_SMP
+#define task_asleep(tsk) (!(tsk)->on_cpu)
+#else
+#define task_asleep(tsk) ((tsk) != current)
+#endif
+
 static unsigned int __intel_breadcrumbs_wakeup(struct intel_breadcrumbs *b)
 {
 	struct intel_wait *wait;
@@ -37,8 +43,16 @@ static unsigned int __intel_breadcrumbs_wakeup(struct intel_breadcrumbs *b)
 	wait = b->irq_wait;
 	if (wait) {
 		result = ENGINE_WAKEUP_WAITER;
-		if (wake_up_process(wait->tsk))
+
+		/* Be careful not to report a successful wakeup if the waiter
+		 * is currently processing the seqno, where it will have
+		 * already called set_task_state(TASK_INTERRUPTIBLE).
+		 */
+		if (task_asleep(wait->tsk))
 			result |= ENGINE_WAKEUP_ASLEEP;
+
+		if (wake_up_process(wait->tsk))
+			result |= ENGINE_WAKEUP_SUCCESS;
 	}
 
 	return result;
@@ -98,7 +112,7 @@ static void intel_breadcrumbs_hangcheck(unsigned long data)
 	 * but we still have a waiter. Assuming all batches complete within
 	 * DRM_I915_HANGCHECK_JIFFIES [1.5s]!
 	 */
-	if (intel_engine_wakeup(engine) & ENGINE_WAKEUP_ASLEEP) {
+	if (intel_engine_wakeup(engine) == ENGINE_WAKEUP) {
 		missed_breadcrumb(engine);
 		mod_timer(&engine->breadcrumbs.fake_irq, jiffies + 1);
 	} else {
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 00d36aa4e26d..d25b88467e5e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -668,6 +668,10 @@ static inline bool intel_engine_has_waiter(const struct intel_engine_cs *engine)
 unsigned int intel_engine_wakeup(struct intel_engine_cs *engine);
 #define ENGINE_WAKEUP_WAITER BIT(0)
 #define ENGINE_WAKEUP_ASLEEP BIT(1)
+#define ENGINE_WAKEUP_SUCCESS BIT(2)
+#define ENGINE_WAKEUP (ENGINE_WAKEUP_WAITER | \
+		       ENGINE_WAKEUP_ASLEEP | \
+		       ENGINE_WAKEUP_SUCCESS)
 
 void __intel_engine_disarm_breadcrumbs(struct intel_engine_cs *engine);
 void intel_engine_disarm_breadcrumbs(struct intel_engine_cs *engine);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 13/27] drm/i915/execlists: Pack the count into the low bits of the port.request
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (11 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 12/27] drm/i915: Only report a wakeup if the waiter was truly asleep Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-20 14:58   ` Tvrtko Ursulin
  2017-04-19  9:41 ` [PATCH 14/27] drm/i915: Don't mark an execlists context-switch when idle Chris Wilson
                   ` (17 subsequent siblings)
  30 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

add/remove: 1/1 grow/shrink: 5/4 up/down: 391/-578 (-187)
function                                     old     new   delta
execlists_submit_ports                       262     471    +209
port_assign.isra                               -     136    +136
capture                                     6344    6359     +15
reset_common_ring                            438     452     +14
execlists_submit_request                     228     238     +10
gen8_init_common_ring                        334     341      +7
intel_engine_is_idle                         106     105      -1
i915_engine_info                            2314    2290     -24
__i915_gem_set_wedged_BKL                    485     411     -74
intel_lrc_irq_handler                       1789    1604    -185
execlists_update_context                     294       -    -294

The most important change there is the improve to the
intel_lrc_irq_handler and excclist_submit_ports (net improvement since
execlists_update_context is now inlined).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c        |  32 ++++---
 drivers/gpu/drm/i915/i915_gem.c            |   6 +-
 drivers/gpu/drm/i915/i915_gpu_error.c      |  13 ++-
 drivers/gpu/drm/i915/i915_guc_submission.c |  18 ++--
 drivers/gpu/drm/i915/intel_engine_cs.c     |   2 +-
 drivers/gpu/drm/i915/intel_lrc.c           | 133 ++++++++++++++++-------------
 drivers/gpu/drm/i915/intel_ringbuffer.h    |   8 +-
 7 files changed, 120 insertions(+), 92 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 870c470177b5..0b5d7142d8d9 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -3315,6 +3315,7 @@ static int i915_engine_info(struct seq_file *m, void *unused)
 		if (i915.enable_execlists) {
 			u32 ptr, read, write;
 			struct rb_node *rb;
+			unsigned int idx;
 
 			seq_printf(m, "\tExeclist status: 0x%08x %08x\n",
 				   I915_READ(RING_EXECLIST_STATUS_LO(engine)),
@@ -3332,8 +3333,7 @@ static int i915_engine_info(struct seq_file *m, void *unused)
 			if (read > write)
 				write += GEN8_CSB_ENTRIES;
 			while (read < write) {
-				unsigned int idx = ++read % GEN8_CSB_ENTRIES;
-
+				idx = ++read % GEN8_CSB_ENTRIES;
 				seq_printf(m, "\tExeclist CSB[%d]: 0x%08x, context: %d\n",
 					   idx,
 					   I915_READ(RING_CONTEXT_STATUS_BUF_LO(engine, idx)),
@@ -3341,21 +3341,19 @@ static int i915_engine_info(struct seq_file *m, void *unused)
 			}
 
 			rcu_read_lock();
-			rq = READ_ONCE(engine->execlist_port[0].request);
-			if (rq) {
-				seq_printf(m, "\t\tELSP[0] count=%d, ",
-					   engine->execlist_port[0].count);
-				print_request(m, rq, "rq: ");
-			} else {
-				seq_printf(m, "\t\tELSP[0] idle\n");
-			}
-			rq = READ_ONCE(engine->execlist_port[1].request);
-			if (rq) {
-				seq_printf(m, "\t\tELSP[1] count=%d, ",
-					   engine->execlist_port[1].count);
-				print_request(m, rq, "rq: ");
-			} else {
-				seq_printf(m, "\t\tELSP[1] idle\n");
+			for (idx = 0; idx < ARRAY_SIZE(engine->execlist_port); idx++) {
+				unsigned int count;
+
+				rq = port_unpack(&engine->execlist_port[idx],
+						 &count);
+				if (rq) {
+					seq_printf(m, "\t\tELSP[%d] count=%d, ",
+						   idx, count);
+					print_request(m, rq, "rq: ");
+				} else {
+					seq_printf(m, "\t\tELSP[%d] idle\n",
+						   idx);
+				}
 			}
 			rcu_read_unlock();
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 2bc72314cdd1..f6df402a5247 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3039,12 +3039,14 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
 	 */
 
 	if (i915.enable_execlists) {
+		struct execlist_port *port = engine->execlist_port;
 		unsigned long flags;
+		unsigned int n;
 
 		spin_lock_irqsave(&engine->timeline->lock, flags);
 
-		i915_gem_request_put(engine->execlist_port[0].request);
-		i915_gem_request_put(engine->execlist_port[1].request);
+		for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++)
+			i915_gem_request_put(port_request(&port[n]));
 		memset(engine->execlist_port, 0, sizeof(engine->execlist_port));
 		engine->execlist_queue = RB_ROOT;
 		engine->execlist_first = NULL;
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 4b247b050dcd..c5cdc6611d7f 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1324,12 +1324,17 @@ static void engine_record_requests(struct intel_engine_cs *engine,
 static void error_record_engine_execlists(struct intel_engine_cs *engine,
 					  struct drm_i915_error_engine *ee)
 {
+	const struct execlist_port *port = engine->execlist_port;
 	unsigned int n;
 
-	for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++)
-		if (engine->execlist_port[n].request)
-			record_request(engine->execlist_port[n].request,
-				       &ee->execlist[n]);
+	for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++) {
+		struct drm_i915_gem_request *rq = port_request(&port[n]);
+
+		if (!rq)
+			break;
+
+		record_request(rq, &ee->execlist[n]);
+	}
 }
 
 static void record_context(struct drm_i915_error_context *e,
diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 1642fff9cf13..370373c97b81 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -658,7 +658,7 @@ static void nested_enable_signaling(struct drm_i915_gem_request *rq)
 static bool i915_guc_dequeue(struct intel_engine_cs *engine)
 {
 	struct execlist_port *port = engine->execlist_port;
-	struct drm_i915_gem_request *last = port[0].request;
+	struct drm_i915_gem_request *last = port[0].request_count;
 	struct rb_node *rb;
 	bool submit = false;
 
@@ -672,7 +672,7 @@ static bool i915_guc_dequeue(struct intel_engine_cs *engine)
 			if (port != engine->execlist_port)
 				break;
 
-			i915_gem_request_assign(&port->request, last);
+			i915_gem_request_assign(&port->request_count, last);
 			nested_enable_signaling(last);
 			port++;
 		}
@@ -688,7 +688,7 @@ static bool i915_guc_dequeue(struct intel_engine_cs *engine)
 		submit = true;
 	}
 	if (submit) {
-		i915_gem_request_assign(&port->request, last);
+		i915_gem_request_assign(&port->request_count, last);
 		nested_enable_signaling(last);
 		engine->execlist_first = rb;
 	}
@@ -705,17 +705,19 @@ static void i915_guc_irq_handler(unsigned long data)
 	bool submit;
 
 	do {
-		rq = port[0].request;
+		rq = port[0].request_count;
 		while (rq && i915_gem_request_completed(rq)) {
 			trace_i915_gem_request_out(rq);
 			i915_gem_request_put(rq);
-			port[0].request = port[1].request;
-			port[1].request = NULL;
-			rq = port[0].request;
+
+			port[0].request_count = port[1].request_count;
+			port[1].request_count = NULL;
+
+			rq = port[0].request_count;
 		}
 
 		submit = false;
-		if (!port[1].request)
+		if (!port[1].request_count)
 			submit = i915_guc_dequeue(engine);
 	} while (submit);
 }
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 402769d9d840..10027d0a09b5 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -1148,7 +1148,7 @@ bool intel_engine_is_idle(struct intel_engine_cs *engine)
 		return false;
 
 	/* Both ports drained, no more ELSP submission? */
-	if (engine->execlist_port[0].request)
+	if (port_request(&engine->execlist_port[0]))
 		return false;
 
 	/* Ring stopped? */
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 7df278fe492e..69299fbab4f9 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -342,39 +342,32 @@ static u64 execlists_update_context(struct drm_i915_gem_request *rq)
 
 static void execlists_submit_ports(struct intel_engine_cs *engine)
 {
-	struct drm_i915_private *dev_priv = engine->i915;
 	struct execlist_port *port = engine->execlist_port;
 	u32 __iomem *elsp =
-		dev_priv->regs + i915_mmio_reg_offset(RING_ELSP(engine));
-	u64 desc[2];
-
-	GEM_BUG_ON(port[0].count > 1);
-	if (!port[0].count)
-		execlists_context_status_change(port[0].request,
-						INTEL_CONTEXT_SCHEDULE_IN);
-	desc[0] = execlists_update_context(port[0].request);
-	GEM_DEBUG_EXEC(port[0].context_id = upper_32_bits(desc[0]));
-	port[0].count++;
-
-	if (port[1].request) {
-		GEM_BUG_ON(port[1].count);
-		execlists_context_status_change(port[1].request,
-						INTEL_CONTEXT_SCHEDULE_IN);
-		desc[1] = execlists_update_context(port[1].request);
-		GEM_DEBUG_EXEC(port[1].context_id = upper_32_bits(desc[1]));
-		port[1].count = 1;
-	} else {
-		desc[1] = 0;
-	}
-	GEM_BUG_ON(desc[0] == desc[1]);
-
-	/* You must always write both descriptors in the order below. */
-	writel(upper_32_bits(desc[1]), elsp);
-	writel(lower_32_bits(desc[1]), elsp);
+		engine->i915->regs + i915_mmio_reg_offset(RING_ELSP(engine));
+	unsigned int n;
+
+	for (n = ARRAY_SIZE(engine->execlist_port); n--; ) {
+		struct drm_i915_gem_request *rq;
+		unsigned int count;
+		u64 desc;
+
+		rq = port_unpack(&port[n], &count);
+		if (rq) {
+			GEM_BUG_ON(count > !n);
+			if (!count++)
+				execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
+			port[n].request_count = port_pack(rq, count);
+			desc = execlists_update_context(rq);
+			GEM_DEBUG_EXEC(port[n].context_id = upper_32_bits(desc));
+		} else {
+			GEM_BUG_ON(!n);
+			desc = 0;
+		}
 
-	writel(upper_32_bits(desc[0]), elsp);
-	/* The context is automatically loaded after the following */
-	writel(lower_32_bits(desc[0]), elsp);
+		writel(upper_32_bits(desc), elsp);
+		writel(lower_32_bits(desc), elsp);
+	}
 }
 
 static bool ctx_single_port_submission(const struct i915_gem_context *ctx)
@@ -395,6 +388,18 @@ static bool can_merge_ctx(const struct i915_gem_context *prev,
 	return true;
 }
 
+static void port_assign(struct execlist_port *port,
+			struct drm_i915_gem_request *rq)
+{
+	GEM_BUG_ON(rq == port_request(port));
+
+	if (port->request_count)
+		i915_gem_request_put(port_request(port));
+
+	port->request_count =
+		port_pack(i915_gem_request_get(rq), port_count(port));
+}
+
 static void execlists_dequeue(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *last;
@@ -402,7 +407,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 	struct rb_node *rb;
 	bool submit = false;
 
-	last = port->request;
+	last = port_request(port);
 	if (last)
 		/* WaIdleLiteRestore:bdw,skl
 		 * Apply the wa NOOPs to prevent ring:HEAD == req:TAIL
@@ -412,7 +417,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		 */
 		last->tail = last->wa_tail;
 
-	GEM_BUG_ON(port[1].request);
+	GEM_BUG_ON(port[1].request_count);
 
 	/* Hardware submission is through 2 ports. Conceptually each port
 	 * has a (RING_START, RING_HEAD, RING_TAIL) tuple. RING_START is
@@ -469,7 +474,8 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 
 			GEM_BUG_ON(last->ctx == cursor->ctx);
 
-			i915_gem_request_assign(&port->request, last);
+			if (submit)
+				port_assign(port, last);
 			port++;
 		}
 
@@ -484,7 +490,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		submit = true;
 	}
 	if (submit) {
-		i915_gem_request_assign(&port->request, last);
+		port_assign(port, last);
 		engine->execlist_first = rb;
 	}
 	spin_unlock_irq(&engine->timeline->lock);
@@ -495,14 +501,14 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 
 static bool execlists_elsp_idle(struct intel_engine_cs *engine)
 {
-	return !engine->execlist_port[0].request;
+	return !port_count(&engine->execlist_port[0]);
 }
 
 static bool execlists_elsp_ready(const struct intel_engine_cs *engine)
 {
 	const struct execlist_port *port = engine->execlist_port;
 
-	return port[0].count + port[1].count < 2;
+	return port_count(&port[0]) + port_count(&port[1]) < 2;
 }
 
 /*
@@ -552,7 +558,9 @@ static void intel_lrc_irq_handler(unsigned long data)
 		tail = GEN8_CSB_WRITE_PTR(head);
 		head = GEN8_CSB_READ_PTR(head);
 		while (head != tail) {
+			struct drm_i915_gem_request *rq;
 			unsigned int status;
+			unsigned int count;
 
 			if (++head == GEN8_CSB_ENTRIES)
 				head = 0;
@@ -582,20 +590,24 @@ static void intel_lrc_irq_handler(unsigned long data)
 			GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 1) !=
 					 port[0].context_id);
 
-			GEM_BUG_ON(port[0].count == 0);
-			if (--port[0].count == 0) {
+			rq = port_unpack(&port[0], &count);
+			GEM_BUG_ON(count == 0);
+			if (--count == 0) {
 				GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
-				GEM_BUG_ON(!i915_gem_request_completed(port[0].request));
-				execlists_context_status_change(port[0].request,
-								INTEL_CONTEXT_SCHEDULE_OUT);
+				GEM_BUG_ON(!i915_gem_request_completed(rq));
+				execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT);
+
+				trace_i915_gem_request_out(rq);
+				i915_gem_request_put(rq);
 
-				trace_i915_gem_request_out(port[0].request);
-				i915_gem_request_put(port[0].request);
 				port[0] = port[1];
 				memset(&port[1], 0, sizeof(port[1]));
+			} else {
+				port[0].request_count = port_pack(rq, count);
 			}
 
-			GEM_BUG_ON(port[0].count == 0 &&
+			/* After the final element, the hw should be idle */
+			GEM_BUG_ON(port_count(&port[0]) == 0 &&
 				   !(status & GEN8_CTX_STATUS_ACTIVE_IDLE));
 		}
 
@@ -1148,11 +1160,6 @@ static int intel_init_workaround_bb(struct intel_engine_cs *engine)
 	return ret;
 }
 
-static u32 port_seqno(struct execlist_port *port)
-{
-	return port->request ? port->request->global_seqno : 0;
-}
-
 static int gen8_init_common_ring(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
@@ -1177,12 +1184,22 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
 	/* After a GPU reset, we may have requests to replay */
 	clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
 	if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) {
-		DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n",
-				 engine->name,
-				 port_seqno(&engine->execlist_port[0]),
-				 port_seqno(&engine->execlist_port[1]));
-		engine->execlist_port[0].count = 0;
-		engine->execlist_port[1].count = 0;
+		struct execlist_port *port = engine->execlist_port;
+		unsigned int n;
+
+		for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++) {
+			if (!port[n].request_count)
+				break;
+
+			DRM_DEBUG_DRIVER("Restarting %s from 0x%x [%d]\n",
+					 engine->name,
+					 port_request(&port[n])->global_seqno,
+					 n);
+
+			/* Discard the current inflight count */
+			port[n].request_count = port_request(&port[n]);
+		}
+
 		execlists_submit_ports(engine);
 	}
 
@@ -1261,13 +1278,13 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	intel_ring_update_space(request->ring);
 
 	/* Catch up with any missed context-switch interrupts */
-	if (request->ctx != port[0].request->ctx) {
-		i915_gem_request_put(port[0].request);
+	if (request->ctx != port_request(&port[0])->ctx) {
+		i915_gem_request_put(port_request(&port[0]));
 		port[0] = port[1];
 		memset(&port[1], 0, sizeof(port[1]));
 	}
 
-	GEM_BUG_ON(request->ctx != port[0].request->ctx);
+	GEM_BUG_ON(request->ctx != port_request(&port[0])->ctx);
 
 	/* Reset WaIdleLiteRestore:bdw,skl as well */
 	request->tail =
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index d25b88467e5e..39b733e5cfd3 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -377,8 +377,12 @@ struct intel_engine_cs {
 	/* Execlists */
 	struct tasklet_struct irq_tasklet;
 	struct execlist_port {
-		struct drm_i915_gem_request *request;
-		unsigned int count;
+		struct drm_i915_gem_request *request_count;
+#define EXECLIST_COUNT_BITS 2
+#define port_request(p) ptr_mask_bits((p)->request_count, EXECLIST_COUNT_BITS)
+#define port_count(p) ptr_unmask_bits((p)->request_count, EXECLIST_COUNT_BITS)
+#define port_pack(rq, count) ptr_pack_bits(rq, count, EXECLIST_COUNT_BITS)
+#define port_unpack(p, count) ptr_unpack_bits((p)->request_count, count, EXECLIST_COUNT_BITS)
 		GEM_DEBUG_DECL(u32 context_id);
 	} execlist_port[2];
 	struct rb_root execlist_queue;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 14/27] drm/i915: Don't mark an execlists context-switch when idle
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (12 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 13/27] drm/i915/execlists: Pack the count into the low bits of the port.request Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-20  8:53   ` Joonas Lahtinen
  2017-04-19  9:41 ` [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list Chris Wilson
                   ` (16 subsequent siblings)
  30 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

If we *know* that the engine is idle, i.e. we have not more contexts in
lift, we can skip any spurious CSB idle interrupts. These spurious
interrupts seem to arrive long after we assert that the engines are
completely idle, triggering later assertions:

[  178.896646] intel_engine_is_idle(bcs): interrupt not handled, irq_posted=2
[  178.896655] ------------[ cut here ]------------
[  178.896658] kernel BUG at drivers/gpu/drm/i915/intel_engine_cs.c:226!
[  178.896661] invalid opcode: 0000 [#1] SMP
[  178.896663] Modules linked in: i915(E) x86_pkg_temp_thermal(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) intel_gtt(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) aesni_intel(E) prime_numbers(E) evdev(E) aes_x86_64(E) drm(E) crypto_simd(E) cryptd(E) glue_helper(E) mei_me(E) mei(E) lpc_ich(E) efivars(E) mfd_core(E) battery(E) video(E) acpi_pad(E) button(E) tpm_tis(E) tpm_tis_core(E) tpm(E) autofs4(E) i2c_i801(E) fan(E) thermal(E) i2c_designware_platform(E) i2c_designware_core(E)
[  178.896694] CPU: 1 PID: 522 Comm: gem_exec_whispe Tainted: G            E   4.11.0-rc5+ #14
[  178.896702] task: ffff88040aba8d40 task.stack: ffffc900003f0000
[  178.896722] RIP: 0010:intel_engine_init_global_seqno+0x1db/0x1f0 [i915]
[  178.896725] RSP: 0018:ffffc900003f3ab0 EFLAGS: 00010246
[  178.896728] RAX: 0000000000000000 RBX: ffff88040af54000 RCX: 0000000000000000
[  178.896731] RDX: ffff88041ec933e0 RSI: ffff88041ec8cc48 RDI: ffff88041ec8cc48
[  178.896734] RBP: ffffc900003f3ac8 R08: 0000000000000000 R09: 000000000000047d
[  178.896736] R10: 0000000000000040 R11: ffff88040b344f80 R12: 0000000000000000
[  178.896739] R13: ffff88040bce0000 R14: ffff88040bce52d8 R15: ffff88040bce0000
[  178.896742] FS:  00007f2cccc2d8c0(0000) GS:ffff88041ec80000(0000) knlGS:0000000000000000
[  178.896746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  178.896749] CR2: 00007f41ddd8f000 CR3: 000000040bb03000 CR4: 00000000001406e0
[  178.896752] Call Trace:
[  178.896768]  reset_all_global_seqno.part.33+0x4e/0xd0 [i915]
[  178.896782]  i915_gem_request_alloc+0x304/0x330 [i915]
[  178.896795]  i915_gem_do_execbuffer+0x8a1/0x17d0 [i915]
[  178.896799]  ? remove_wait_queue+0x48/0x50
[  178.896812]  ? i915_wait_request+0x300/0x590 [i915]
[  178.896816]  ? wake_up_q+0x70/0x70
[  178.896819]  ? refcount_dec_and_test+0x11/0x20
[  178.896823]  ? reservation_object_add_excl_fence+0xa5/0x100
[  178.896835]  i915_gem_execbuffer2+0xab/0x1f0 [i915]
[  178.896844]  drm_ioctl+0x1e6/0x460 [drm]
[  178.896858]  ? i915_gem_execbuffer+0x260/0x260 [i915]
[  178.896862]  ? dput+0xcf/0x250
[  178.896866]  ? full_proxy_release+0x66/0x80
[  178.896869]  ? mntput+0x1f/0x30
[  178.896872]  do_vfs_ioctl+0x8f/0x5b0
[  178.896875]  ? ____fput+0x9/0x10
[  178.896878]  ? task_work_run+0x80/0xa0
[  178.896881]  SyS_ioctl+0x3c/0x70
[  178.896885]  entry_SYSCALL_64_fastpath+0x17/0x98
[  178.896888] RIP: 0033:0x7f2ccb455ca7
[  178.896890] RSP: 002b:00007ffcabec72d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  178.896894] RAX: ffffffffffffffda RBX: 000055f897a44b90 RCX: 00007f2ccb455ca7
[  178.896897] RDX: 00007ffcabec74a0 RSI: 0000000040406469 RDI: 0000000000000003
[  178.896900] RBP: 00007f2ccb70a440 R08: 00007f2ccb70d0a4 R09: 0000000000000000
[  178.896903] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  178.896905] R13: 000055f89782d71a R14: 00007ffcabecf838 R15: 0000000000000003
[  178.896908] Code: 00 31 d2 4c 89 ef 8d 70 48 41 ff 95 f8 06 00 00 e9 68 fe ff ff be 0f 00 00 00 48 c7 c7 48 dc 37 a0 e8 fa 33 d6 e0 e9 0b ff ff ff <0f> 0b 0f 0b 0f 0b 0f 0b 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00

On the other hand, by ignoring the interrupt do we risk running out of
space in CSB ring? Testing for a few hours suggests not, i.e. that we
only seem to get the odd delayed CSB idle notification.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index fd97fe00cd0d..fb2ac202dec5 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1359,8 +1359,10 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
 	bool tasklet = false;
 
 	if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift)) {
-		set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
-		tasklet = true;
+		if (port_count(&engine->execlist_port[0])) {
+			set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
+			tasklet = true;
+		}
 	}
 
 	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) {
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (13 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 14/27] drm/i915: Don't mark an execlists context-switch when idle Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-24 10:28   ` Tvrtko Ursulin
  2017-04-19  9:41 ` [PATCH 16/27] drm/i915: Reinstate reservation_object zapping for batch_pool objects Chris Wilson
                   ` (15 subsequent siblings)
  30 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

All the requests at the same priority are executed in FIFO order. They
do not need to be stored in the rbtree themselves, as they are a simple
list within a level. If we move the requests at one priority into a list,
we can then reduce the rbtree to the set of priorities. This should keep
the height of the rbtree small, as the number of active priorities can not
exceed the number of active requests and should be typically only a few.

Currently, we have ~2k possible different priority levels, that may
increase to allow even more fine grained selection. Allocating those in
advance seems a waste (and may be impossible), so we opt for allocating
upon first use, and freeing after its requests are depleted. To avoid
the possibility of an allocation failure causing us to lose a request,
we preallocate the default priority (0) and bump any request to that
priority if we fail to allocate it the appropriate plist. Having a
request (that is ready to run, so not leading to corruption) execute
out-of-order is better than leaking the request (and its dependency
tree) entirely.

There should be a benefit to reducing execlists_dequeue() to principally
using a simple list (and reducing the frequency of both rbtree iteration
and balancing on erase) but for typical workloads, request coalescing
should be small enough that we don't notice any change. The main gain is
from improving PI calls to schedule, and the explicit list within a
level should make request unwinding simpler (we just need to insert at
the head of the list rather than the tail and not have to make the
rbtree search more complicated).

v2: Avoid use-after-free when deleting a depleted priolist

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c        | 12 +++--
 drivers/gpu/drm/i915/i915_gem_request.c    |  4 +-
 drivers/gpu/drm/i915/i915_gem_request.h    |  2 +-
 drivers/gpu/drm/i915/i915_guc_submission.c | 20 ++++++--
 drivers/gpu/drm/i915/intel_lrc.c           | 75 ++++++++++++++++++++++--------
 drivers/gpu/drm/i915/intel_ringbuffer.h    |  7 +++
 6 files changed, 90 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 0b5d7142d8d9..a8c7788d986e 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -3314,7 +3314,6 @@ static int i915_engine_info(struct seq_file *m, void *unused)
 
 		if (i915.enable_execlists) {
 			u32 ptr, read, write;
-			struct rb_node *rb;
 			unsigned int idx;
 
 			seq_printf(m, "\tExeclist status: 0x%08x %08x\n",
@@ -3358,9 +3357,14 @@ static int i915_engine_info(struct seq_file *m, void *unused)
 			rcu_read_unlock();
 
 			spin_lock_irq(&engine->timeline->lock);
-			for (rb = engine->execlist_first; rb; rb = rb_next(rb)) {
-				rq = rb_entry(rb, typeof(*rq), priotree.node);
-				print_request(m, rq, "\t\tQ ");
+			for (rb = engine->execlist_first; rb; rb = rb_next(rb)){
+				struct execlist_priolist *plist =
+					rb_entry(rb, typeof(*plist), node);
+
+				list_for_each_entry(rq,
+						    &plist->requests,
+						    priotree.link)
+					print_request(m, rq, "\t\tQ ");
 			}
 			spin_unlock_irq(&engine->timeline->lock);
 		} else if (INTEL_GEN(dev_priv) > 6) {
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 83b1584b3deb..59c0e0b00028 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -159,7 +159,7 @@ i915_priotree_fini(struct drm_i915_private *i915, struct i915_priotree *pt)
 {
 	struct i915_dependency *dep, *next;
 
-	GEM_BUG_ON(!RB_EMPTY_NODE(&pt->node));
+	GEM_BUG_ON(!list_empty(&pt->link));
 
 	/* Everyone we depended upon (the fences we wait to be signaled)
 	 * should retire before us and remove themselves from our list.
@@ -185,7 +185,7 @@ i915_priotree_init(struct i915_priotree *pt)
 {
 	INIT_LIST_HEAD(&pt->signalers_list);
 	INIT_LIST_HEAD(&pt->waiters_list);
-	RB_CLEAR_NODE(&pt->node);
+	INIT_LIST_HEAD(&pt->link);
 	pt->priority = INT_MIN;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_gem_request.h b/drivers/gpu/drm/i915/i915_gem_request.h
index 4ccab5affd3c..0a1d717b9fa7 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.h
+++ b/drivers/gpu/drm/i915/i915_gem_request.h
@@ -67,7 +67,7 @@ struct i915_dependency {
 struct i915_priotree {
 	struct list_head signalers_list; /* those before us, we depend upon */
 	struct list_head waiters_list; /* those after us, they depend upon us */
-	struct rb_node node;
+	struct list_head link;
 	int priority;
 #define I915_PRIORITY_MAX 1024
 #define I915_PRIORITY_MIN (-I915_PRIORITY_MAX)
diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 370373c97b81..69b39729003b 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -664,9 +664,15 @@ static bool i915_guc_dequeue(struct intel_engine_cs *engine)
 
 	spin_lock_irq(&engine->timeline->lock);
 	rb = engine->execlist_first;
+	GEM_BUG_ON(rb_first(&engine->execlist_queue) != rb);
 	while (rb) {
+		struct execlist_priolist *plist =
+			rb_entry(rb, typeof(*plist), node);
 		struct drm_i915_gem_request *rq =
-			rb_entry(rb, typeof(*rq), priotree.node);
+			list_first_entry(&plist->requests,
+					 typeof(*rq),
+					 priotree.link);
+		GEM_BUG_ON(list_empty(&plist->requests));
 
 		if (last && rq->ctx != last->ctx) {
 			if (port != engine->execlist_port)
@@ -677,9 +683,15 @@ static bool i915_guc_dequeue(struct intel_engine_cs *engine)
 			port++;
 		}
 
-		rb = rb_next(rb);
-		rb_erase(&rq->priotree.node, &engine->execlist_queue);
-		RB_CLEAR_NODE(&rq->priotree.node);
+		if (rq->priotree.link.next == rq->priotree.link.prev) {
+			rb = rb_next(rb);
+			rb_erase(&plist->node, &engine->execlist_queue);
+			if (plist->priority)
+				kfree(plist);
+		} else {
+			__list_del_entry(&rq->priotree.link);
+		}
+		INIT_LIST_HEAD(&rq->priotree.link);
 		rq->priotree.priority = INT_MAX;
 
 		i915_guc_submit(rq);
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 69299fbab4f9..f96d7980ac16 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -442,9 +442,15 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 
 	spin_lock_irq(&engine->timeline->lock);
 	rb = engine->execlist_first;
+	GEM_BUG_ON(rb_first(&engine->execlist_queue) != rb);
 	while (rb) {
+		struct execlist_priolist *plist =
+			rb_entry(rb, typeof(*plist), node);
 		struct drm_i915_gem_request *cursor =
-			rb_entry(rb, typeof(*cursor), priotree.node);
+			list_first_entry(&plist->requests,
+					 typeof(*cursor),
+					 priotree.link);
+		GEM_BUG_ON(list_empty(&plist->requests));
 
 		/* Can we combine this request with the current port? It has to
 		 * be the same context/ringbuffer and not have any exceptions
@@ -479,9 +485,15 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			port++;
 		}
 
-		rb = rb_next(rb);
-		rb_erase(&cursor->priotree.node, &engine->execlist_queue);
-		RB_CLEAR_NODE(&cursor->priotree.node);
+		if (cursor->priotree.link.next == cursor->priotree.link.prev) {
+			rb = rb_next(rb);
+			rb_erase(&plist->node, &engine->execlist_queue);
+			if (plist->priority)
+				kfree(plist);
+		} else {
+			__list_del_entry(&cursor->priotree.link);
+		}
+		INIT_LIST_HEAD(&cursor->priotree.link);
 		cursor->priotree.priority = INT_MAX;
 
 		__i915_gem_request_submit(cursor);
@@ -621,28 +633,53 @@ static void intel_lrc_irq_handler(unsigned long data)
 	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
 }
 
-static bool insert_request(struct i915_priotree *pt, struct rb_root *root)
+static bool
+insert_request(struct intel_engine_cs *engine,
+	       struct i915_priotree *pt,
+	       int prio)
 {
+	struct execlist_priolist *plist;
 	struct rb_node **p, *rb;
 	bool first = true;
 
+find_plist:
 	/* most positive priority is scheduled first, equal priorities fifo */
 	rb = NULL;
-	p = &root->rb_node;
+	p = &engine->execlist_queue.rb_node;
 	while (*p) {
-		struct i915_priotree *pos;
-
 		rb = *p;
-		pos = rb_entry(rb, typeof(*pos), node);
-		if (pt->priority > pos->priority) {
+		plist = rb_entry(rb, typeof(*plist), node);
+		if (prio > plist->priority) {
 			p = &rb->rb_left;
-		} else {
+		} else if (prio < plist->priority) {
 			p = &rb->rb_right;
 			first = false;
+		} else {
+			list_add_tail(&pt->link, &plist->requests);
+			return false;
 		}
 	}
-	rb_link_node(&pt->node, rb, p);
-	rb_insert_color(&pt->node, root);
+
+	if (!prio) {
+		plist = &engine->default_priolist;
+	} else {
+		plist = kmalloc(sizeof(*plist), GFP_ATOMIC);
+		/* Convert an allocation failure to a priority bump */
+		if (unlikely(!plist)) {
+			prio = 0; /* recurses just once */
+			goto find_plist;
+		}
+	}
+
+	plist->priority = prio;
+	rb_link_node(&plist->node, rb, p);
+	rb_insert_color(&plist->node, &engine->execlist_queue);
+
+	INIT_LIST_HEAD(&plist->requests);
+	list_add_tail(&pt->link, &plist->requests);
+
+	if (first)
+		engine->execlist_first = &plist->node;
 
 	return first;
 }
@@ -655,8 +692,9 @@ static void execlists_submit_request(struct drm_i915_gem_request *request)
 	/* Will be called from irq-context when using foreign fences. */
 	spin_lock_irqsave(&engine->timeline->lock, flags);
 
-	if (insert_request(&request->priotree, &engine->execlist_queue)) {
-		engine->execlist_first = &request->priotree.node;
+	if (insert_request(engine,
+			   &request->priotree,
+			   request->priotree.priority)) {
 		if (execlists_elsp_ready(engine))
 			tasklet_hi_schedule(&engine->irq_tasklet);
 	}
@@ -745,10 +783,9 @@ static void execlists_schedule(struct drm_i915_gem_request *request, int prio)
 			continue;
 
 		pt->priority = prio;
-		if (!RB_EMPTY_NODE(&pt->node)) {
-			rb_erase(&pt->node, &engine->execlist_queue);
-			if (insert_request(pt, &engine->execlist_queue))
-				engine->execlist_first = &pt->node;
+		if (!list_empty(&pt->link)) {
+			__list_del_entry(&pt->link);
+			insert_request(engine, pt, prio);
 		}
 	}
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 39b733e5cfd3..1ff41bd9e89a 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -187,6 +187,12 @@ enum intel_engine_id {
 	VECS
 };
 
+struct execlist_priolist {
+	struct rb_node node;
+	struct list_head requests;
+	int priority;
+};
+
 #define INTEL_ENGINE_CS_MAX_NAME 8
 
 struct intel_engine_cs {
@@ -376,6 +382,7 @@ struct intel_engine_cs {
 
 	/* Execlists */
 	struct tasklet_struct irq_tasklet;
+	struct execlist_priolist default_priolist;
 	struct execlist_port {
 		struct drm_i915_gem_request *request_count;
 #define EXECLIST_COUNT_BITS 2
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 16/27] drm/i915: Reinstate reservation_object zapping for batch_pool objects
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (14 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-28 12:20   ` Tvrtko Ursulin
  2017-04-19  9:41 ` [PATCH 17/27] drm/i915: Amalgamate execbuffer parameter structures Chris Wilson
                   ` (14 subsequent siblings)
  30 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala, Matthew Auld

I removed the zapping of the reservation_object->fence array of shared
fences prematurely. We don't yet have the code to zap that array when
retiring the object, and so currently it remains possible to continually
grow the shared array trapping requests when reusing the batch_pool
object across many timelines.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_batch_pool.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_batch_pool.c b/drivers/gpu/drm/i915/i915_gem_batch_pool.c
index 41aa598c4f3b..414e46e2f072 100644
--- a/drivers/gpu/drm/i915/i915_gem_batch_pool.c
+++ b/drivers/gpu/drm/i915/i915_gem_batch_pool.c
@@ -114,12 +114,26 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool,
 	list_for_each_entry(obj, list, batch_pool_link) {
 		/* The batches are strictly LRU ordered */
 		if (i915_gem_object_is_active(obj)) {
-			if (!reservation_object_test_signaled_rcu(obj->resv,
-								  true))
+			struct reservation_object *resv = obj->resv;
+
+			if (!reservation_object_test_signaled_rcu(resv, true))
 				break;
 
 			i915_gem_retire_requests(pool->engine->i915);
 			GEM_BUG_ON(i915_gem_object_is_active(obj));
+
+			/* The object is now idle, clear the array of shared
+			 * fences before we add a new request. Although, we
+			 * remain on the same engine, we may be on a different
+			 * timeline and so may continually grow the array,
+			 * trapping a reference to all the old fences, rather
+			 * than replace the existing fence.
+			 */
+			if (rcu_access_pointer(resv->fence)) {
+				reservation_object_lock(resv, NULL);
+				reservation_object_add_excl_fence(resv, NULL);
+				reservation_object_unlock(resv);
+			}
 		}
 
 		GEM_BUG_ON(!reservation_object_test_signaled_rcu(obj->resv,
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 17/27] drm/i915: Amalgamate execbuffer parameter structures
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (15 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 16/27] drm/i915: Reinstate reservation_object zapping for batch_pool objects Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 18/27] drm/i915: Use vma->exec_entry as our double-entry placeholder Chris Wilson
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

Combine the two slightly overlapping parameter structures we pass around
the execbuffer routines into one.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 550 ++++++++++++-----------------
 1 file changed, 233 insertions(+), 317 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index ddc011ef5480..32c9750f7249 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -50,70 +50,78 @@
 
 #define BATCH_OFFSET_BIAS (256*1024)
 
-struct i915_execbuffer_params {
-	struct drm_device               *dev;
-	struct drm_file                 *file;
-	struct i915_vma			*batch;
-	u32				dispatch_flags;
-	u32				args_batch_start_offset;
-	struct intel_engine_cs          *engine;
-	struct i915_gem_context         *ctx;
-	struct drm_i915_gem_request     *request;
-};
+#define __I915_EXEC_ILLEGAL_FLAGS \
+	(__I915_EXEC_UNKNOWN_FLAGS | I915_EXEC_CONSTANTS_MASK)
 
-struct eb_vmas {
+struct i915_execbuffer {
 	struct drm_i915_private *i915;
+	struct drm_file *file;
+	struct drm_i915_gem_execbuffer2 *args;
+	struct drm_i915_gem_exec_object2 *exec;
+	struct intel_engine_cs *engine;
+	struct i915_gem_context *ctx;
+	struct i915_address_space *vm;
+	struct i915_vma *batch;
+	struct drm_i915_gem_request *request;
+	u32 batch_start_offset;
+	u32 batch_len;
+	unsigned int dispatch_flags;
+	struct drm_i915_gem_exec_object2 shadow_exec_entry;
+	bool need_relocs;
 	struct list_head vmas;
+	struct reloc_cache {
+		struct drm_mm_node node;
+		unsigned long vaddr;
+		unsigned int page;
+		bool use_64bit_reloc : 1;
+	} reloc_cache;
 	int and;
 	union {
-		struct i915_vma *lut[0];
-		struct hlist_head buckets[0];
+		struct i915_vma **lut;
+		struct hlist_head *buckets;
 	};
 };
 
-static struct eb_vmas *
-eb_create(struct drm_i915_private *i915,
-	  struct drm_i915_gem_execbuffer2 *args)
+static int
+eb_create(struct i915_execbuffer *eb)
 {
-	struct eb_vmas *eb = NULL;
-
-	if (args->flags & I915_EXEC_HANDLE_LUT) {
-		unsigned size = args->buffer_count;
+	eb->lut = NULL;
+	if (eb->args->flags & I915_EXEC_HANDLE_LUT) {
+		unsigned int size = eb->args->buffer_count;
 		size *= sizeof(struct i915_vma *);
-		size += sizeof(struct eb_vmas);
-		eb = kmalloc(size, GFP_TEMPORARY | __GFP_NOWARN | __GFP_NORETRY);
+		eb->lut = kmalloc(size,
+				  GFP_TEMPORARY | __GFP_NOWARN | __GFP_NORETRY);
 	}
 
-	if (eb == NULL) {
-		unsigned size = args->buffer_count;
-		unsigned count = PAGE_SIZE / sizeof(struct hlist_head) / 2;
+	if (!eb->lut) {
+		unsigned int size = eb->args->buffer_count;
+		unsigned int count = PAGE_SIZE / sizeof(struct hlist_head) / 2;
 		BUILD_BUG_ON_NOT_POWER_OF_2(PAGE_SIZE / sizeof(struct hlist_head));
 		while (count > 2*size)
 			count >>= 1;
-		eb = kzalloc(count*sizeof(struct hlist_head) +
-			     sizeof(struct eb_vmas),
-			     GFP_TEMPORARY);
-		if (eb == NULL)
-			return eb;
+		eb->lut = kzalloc(count*sizeof(struct hlist_head),
+				  GFP_TEMPORARY);
+		if (!eb->lut)
+			return -ENOMEM;
 
 		eb->and = count - 1;
-	} else
-		eb->and = -args->buffer_count;
+	} else {
+		eb->and = -eb->args->buffer_count;
+	}
 
-	eb->i915 = i915;
 	INIT_LIST_HEAD(&eb->vmas);
-	return eb;
+	return 0;
 }
 
 static void
-eb_reset(struct eb_vmas *eb)
+eb_reset(struct i915_execbuffer *eb)
 {
 	if (eb->and >= 0)
 		memset(eb->buckets, 0, (eb->and+1)*sizeof(struct hlist_head));
 }
 
 static struct i915_vma *
-eb_get_batch(struct eb_vmas *eb)
+eb_get_batch(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma = list_entry(eb->vmas.prev, typeof(*vma), exec_list);
 
@@ -133,34 +141,30 @@ eb_get_batch(struct eb_vmas *eb)
 }
 
 static int
-eb_lookup_vmas(struct eb_vmas *eb,
-	       struct drm_i915_gem_exec_object2 *exec,
-	       const struct drm_i915_gem_execbuffer2 *args,
-	       struct i915_address_space *vm,
-	       struct drm_file *file)
+eb_lookup_vmas(struct i915_execbuffer *eb)
 {
 	struct drm_i915_gem_object *obj;
 	struct list_head objects;
 	int i, ret;
 
 	INIT_LIST_HEAD(&objects);
-	spin_lock(&file->table_lock);
+	spin_lock(&eb->file->table_lock);
 	/* Grab a reference to the object and release the lock so we can lookup
 	 * or create the VMA without using GFP_ATOMIC */
-	for (i = 0; i < args->buffer_count; i++) {
-		obj = to_intel_bo(idr_find(&file->object_idr, exec[i].handle));
+	for (i = 0; i < eb->args->buffer_count; i++) {
+		obj = to_intel_bo(idr_find(&eb->file->object_idr, eb->exec[i].handle));
 		if (obj == NULL) {
-			spin_unlock(&file->table_lock);
+			spin_unlock(&eb->file->table_lock);
 			DRM_DEBUG("Invalid object handle %d at index %d\n",
-				   exec[i].handle, i);
+				   eb->exec[i].handle, i);
 			ret = -ENOENT;
 			goto err;
 		}
 
 		if (!list_empty(&obj->obj_exec_link)) {
-			spin_unlock(&file->table_lock);
+			spin_unlock(&eb->file->table_lock);
 			DRM_DEBUG("Object %p [handle %d, index %d] appears more than once in object list\n",
-				   obj, exec[i].handle, i);
+				   obj, eb->exec[i].handle, i);
 			ret = -EINVAL;
 			goto err;
 		}
@@ -168,7 +172,7 @@ eb_lookup_vmas(struct eb_vmas *eb,
 		i915_gem_object_get(obj);
 		list_add_tail(&obj->obj_exec_link, &objects);
 	}
-	spin_unlock(&file->table_lock);
+	spin_unlock(&eb->file->table_lock);
 
 	i = 0;
 	while (!list_empty(&objects)) {
@@ -186,7 +190,7 @@ eb_lookup_vmas(struct eb_vmas *eb,
 		 * from the (obj, vm) we don't run the risk of creating
 		 * duplicated vmas for the same vm.
 		 */
-		vma = i915_vma_instance(obj, vm, NULL);
+		vma = i915_vma_instance(obj, eb->vm, NULL);
 		if (unlikely(IS_ERR(vma))) {
 			DRM_DEBUG("Failed to lookup VMA\n");
 			ret = PTR_ERR(vma);
@@ -197,11 +201,13 @@ eb_lookup_vmas(struct eb_vmas *eb,
 		list_add_tail(&vma->exec_list, &eb->vmas);
 		list_del_init(&obj->obj_exec_link);
 
-		vma->exec_entry = &exec[i];
+		vma->exec_entry = &eb->exec[i];
 		if (eb->and < 0) {
 			eb->lut[i] = vma;
 		} else {
-			uint32_t handle = args->flags & I915_EXEC_HANDLE_LUT ? i : exec[i].handle;
+			u32 handle =
+				eb->args->flags & I915_EXEC_HANDLE_LUT ?
+				i : eb->exec[i].handle;
 			vma->exec_handle = handle;
 			hlist_add_head(&vma->exec_node,
 				       &eb->buckets[handle & eb->and]);
@@ -228,7 +234,7 @@ eb_lookup_vmas(struct eb_vmas *eb,
 	return ret;
 }
 
-static struct i915_vma *eb_get_vma(struct eb_vmas *eb, unsigned long handle)
+static struct i915_vma *eb_get_vma(struct i915_execbuffer *eb, unsigned long handle)
 {
 	if (eb->and < 0) {
 		if (handle >= -eb->and)
@@ -248,7 +254,7 @@ static struct i915_vma *eb_get_vma(struct eb_vmas *eb, unsigned long handle)
 }
 
 static void
-i915_gem_execbuffer_unreserve_vma(struct i915_vma *vma)
+eb_unreserve_vma(struct i915_vma *vma)
 {
 	struct drm_i915_gem_exec_object2 *entry;
 
@@ -266,8 +272,10 @@ i915_gem_execbuffer_unreserve_vma(struct i915_vma *vma)
 	entry->flags &= ~(__EXEC_OBJECT_HAS_FENCE | __EXEC_OBJECT_HAS_PIN);
 }
 
-static void eb_destroy(struct eb_vmas *eb)
+static void eb_destroy(struct i915_execbuffer *eb)
 {
+	i915_gem_context_put(eb->ctx);
+
 	while (!list_empty(&eb->vmas)) {
 		struct i915_vma *vma;
 
@@ -275,11 +283,10 @@ static void eb_destroy(struct eb_vmas *eb)
 				       struct i915_vma,
 				       exec_list);
 		list_del_init(&vma->exec_list);
-		i915_gem_execbuffer_unreserve_vma(vma);
+		eb_unreserve_vma(vma);
 		vma->exec_entry = NULL;
 		i915_vma_put(vma);
 	}
-	kfree(eb);
 }
 
 static inline int use_cpu_reloc(struct drm_i915_gem_object *obj)
@@ -320,20 +327,11 @@ relocation_target(const struct drm_i915_gem_relocation_entry *reloc,
 	return gen8_canonical_addr((int)reloc->delta + target_offset);
 }
 
-struct reloc_cache {
-	struct drm_i915_private *i915;
-	struct drm_mm_node node;
-	unsigned long vaddr;
-	unsigned int page;
-	bool use_64bit_reloc;
-};
-
 static void reloc_cache_init(struct reloc_cache *cache,
 			     struct drm_i915_private *i915)
 {
 	cache->page = -1;
 	cache->vaddr = 0;
-	cache->i915 = i915;
 	/* Must be a variable in the struct to allow GCC to unroll. */
 	cache->use_64bit_reloc = HAS_64BIT_RELOC(i915);
 	cache->node.allocated = false;
@@ -351,7 +349,14 @@ static inline unsigned int unmask_flags(unsigned long p)
 
 #define KMAP 0x4 /* after CLFLUSH_FLAGS */
 
-static void reloc_cache_fini(struct reloc_cache *cache)
+static inline struct i915_ggtt *cache_to_ggtt(struct reloc_cache *cache)
+{
+	struct drm_i915_private *i915 =
+		container_of(cache, struct i915_execbuffer, reloc_cache)->i915;
+	return &i915->ggtt;
+}
+
+static void reloc_cache_reset(struct reloc_cache *cache)
 {
 	void *vaddr;
 
@@ -369,7 +374,7 @@ static void reloc_cache_fini(struct reloc_cache *cache)
 		wmb();
 		io_mapping_unmap_atomic((void __iomem *)vaddr);
 		if (cache->node.allocated) {
-			struct i915_ggtt *ggtt = &cache->i915->ggtt;
+			struct i915_ggtt *ggtt = cache_to_ggtt(cache);
 
 			ggtt->base.clear_range(&ggtt->base,
 					       cache->node.start,
@@ -379,6 +384,9 @@ static void reloc_cache_fini(struct reloc_cache *cache)
 			i915_vma_unpin((struct i915_vma *)cache->node.mm);
 		}
 	}
+
+	cache->vaddr = 0;
+	cache->page = -1;
 }
 
 static void *reloc_kmap(struct drm_i915_gem_object *obj,
@@ -417,7 +425,7 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
 			 struct reloc_cache *cache,
 			 int page)
 {
-	struct i915_ggtt *ggtt = &cache->i915->ggtt;
+	struct i915_ggtt *ggtt = cache_to_ggtt(cache);
 	unsigned long offset;
 	void *vaddr;
 
@@ -467,7 +475,8 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
 		offset += page << PAGE_SHIFT;
 	}
 
-	vaddr = (void __force *) io_mapping_map_atomic_wc(&cache->i915->ggtt.mappable, offset);
+	vaddr = (void __force *)io_mapping_map_atomic_wc(&ggtt->mappable,
+							 offset);
 	cache->page = page;
 	cache->vaddr = (unsigned long)vaddr;
 
@@ -546,12 +555,10 @@ relocate_entry(struct drm_i915_gem_object *obj,
 }
 
 static int
-i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
-				   struct eb_vmas *eb,
-				   struct drm_i915_gem_relocation_entry *reloc,
-				   struct reloc_cache *cache)
+eb_relocate_entry(struct drm_i915_gem_object *obj,
+		  struct i915_execbuffer *eb,
+		  struct drm_i915_gem_relocation_entry *reloc)
 {
-	struct drm_i915_private *dev_priv = to_i915(obj->base.dev);
 	struct drm_gem_object *target_obj;
 	struct drm_i915_gem_object *target_i915_obj;
 	struct i915_vma *target_vma;
@@ -570,8 +577,8 @@ i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
 	/* Sandybridge PPGTT errata: We need a global gtt mapping for MI and
 	 * pipe_control writes because the gpu doesn't properly redirect them
 	 * through the ppgtt for non_secure batchbuffers. */
-	if (unlikely(IS_GEN6(dev_priv) &&
-	    reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION)) {
+	if (unlikely(IS_GEN6(eb->i915) &&
+		     reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION)) {
 		ret = i915_vma_bind(target_vma, target_i915_obj->cache_level,
 				    PIN_GLOBAL);
 		if (WARN_ONCE(ret, "Unexpected failure to bind target VMA!"))
@@ -612,7 +619,7 @@ i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
 
 	/* Check that the relocation address is valid... */
 	if (unlikely(reloc->offset >
-		     obj->base.size - (cache->use_64bit_reloc ? 8 : 4))) {
+		     obj->base.size - (eb->reloc_cache.use_64bit_reloc ? 8 : 4))) {
 		DRM_DEBUG("Relocation beyond object bounds: "
 			  "obj %p target %d offset %d size %d.\n",
 			  obj, reloc->target_handle,
@@ -628,7 +635,7 @@ i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
 		return -EINVAL;
 	}
 
-	ret = relocate_entry(obj, reloc, cache, target_offset);
+	ret = relocate_entry(obj, reloc, &eb->reloc_cache, target_offset);
 	if (ret)
 		return ret;
 
@@ -637,19 +644,15 @@ i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
 	return 0;
 }
 
-static int
-i915_gem_execbuffer_relocate_vma(struct i915_vma *vma,
-				 struct eb_vmas *eb)
+static int eb_relocate_vma(struct i915_vma *vma, struct i915_execbuffer *eb)
 {
 #define N_RELOC(x) ((x) / sizeof(struct drm_i915_gem_relocation_entry))
 	struct drm_i915_gem_relocation_entry stack_reloc[N_RELOC(512)];
 	struct drm_i915_gem_relocation_entry __user *user_relocs;
 	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
-	struct reloc_cache cache;
 	int remain, ret = 0;
 
 	user_relocs = u64_to_user_ptr(entry->relocs_ptr);
-	reloc_cache_init(&cache, eb->i915);
 
 	remain = entry->relocation_count;
 	while (remain) {
@@ -678,7 +681,7 @@ i915_gem_execbuffer_relocate_vma(struct i915_vma *vma,
 		do {
 			u64 offset = r->presumed_offset;
 
-			ret = i915_gem_execbuffer_relocate_entry(vma->obj, eb, r, &cache);
+			ret = eb_relocate_entry(vma->obj, eb, r);
 			if (ret)
 				goto out;
 
@@ -710,39 +713,35 @@ i915_gem_execbuffer_relocate_vma(struct i915_vma *vma,
 	}
 
 out:
-	reloc_cache_fini(&cache);
+	reloc_cache_reset(&eb->reloc_cache);
 	return ret;
 #undef N_RELOC
 }
 
 static int
-i915_gem_execbuffer_relocate_vma_slow(struct i915_vma *vma,
-				      struct eb_vmas *eb,
-				      struct drm_i915_gem_relocation_entry *relocs)
+eb_relocate_vma_slow(struct i915_vma *vma,
+		     struct i915_execbuffer *eb,
+		     struct drm_i915_gem_relocation_entry *relocs)
 {
 	const struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
-	struct reloc_cache cache;
 	int i, ret = 0;
 
-	reloc_cache_init(&cache, eb->i915);
 	for (i = 0; i < entry->relocation_count; i++) {
-		ret = i915_gem_execbuffer_relocate_entry(vma->obj, eb, &relocs[i], &cache);
+		ret = eb_relocate_entry(vma->obj, eb, &relocs[i]);
 		if (ret)
 			break;
 	}
-	reloc_cache_fini(&cache);
-
+	reloc_cache_reset(&eb->reloc_cache);
 	return ret;
 }
 
-static int
-i915_gem_execbuffer_relocate(struct eb_vmas *eb)
+static int eb_relocate(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 	int ret = 0;
 
 	list_for_each_entry(vma, &eb->vmas, exec_list) {
-		ret = i915_gem_execbuffer_relocate_vma(vma, eb);
+		ret = eb_relocate_vma(vma, eb);
 		if (ret)
 			break;
 	}
@@ -757,9 +756,9 @@ static bool only_mappable_for_reloc(unsigned int flags)
 }
 
 static int
-i915_gem_execbuffer_reserve_vma(struct i915_vma *vma,
-				struct intel_engine_cs *engine,
-				bool *need_reloc)
+eb_reserve_vma(struct i915_vma *vma,
+	       struct intel_engine_cs *engine,
+	       bool *need_reloc)
 {
 	struct drm_i915_gem_object *obj = vma->obj;
 	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
@@ -878,34 +877,27 @@ eb_vma_misplaced(struct i915_vma *vma)
 	return false;
 }
 
-static int
-i915_gem_execbuffer_reserve(struct intel_engine_cs *engine,
-			    struct list_head *vmas,
-			    struct i915_gem_context *ctx,
-			    bool *need_relocs)
+static int eb_reserve(struct i915_execbuffer *eb)
 {
+	const bool has_fenced_gpu_access = INTEL_GEN(eb->i915) < 4;
+	const bool needs_unfenced_map = INTEL_INFO(eb->i915)->unfenced_needs_alignment;
 	struct drm_i915_gem_object *obj;
 	struct i915_vma *vma;
-	struct i915_address_space *vm;
 	struct list_head ordered_vmas;
 	struct list_head pinned_vmas;
-	bool has_fenced_gpu_access = INTEL_GEN(engine->i915) < 4;
-	bool needs_unfenced_map = INTEL_INFO(engine->i915)->unfenced_needs_alignment;
 	int retry;
 
-	vm = list_first_entry(vmas, struct i915_vma, exec_list)->vm;
-
 	INIT_LIST_HEAD(&ordered_vmas);
 	INIT_LIST_HEAD(&pinned_vmas);
-	while (!list_empty(vmas)) {
+	while (!list_empty(&eb->vmas)) {
 		struct drm_i915_gem_exec_object2 *entry;
 		bool need_fence, need_mappable;
 
-		vma = list_first_entry(vmas, struct i915_vma, exec_list);
+		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_list);
 		obj = vma->obj;
 		entry = vma->exec_entry;
 
-		if (ctx->flags & CONTEXT_NO_ZEROMAP)
+		if (eb->ctx->flags & CONTEXT_NO_ZEROMAP)
 			entry->flags |= __EXEC_OBJECT_NEEDS_BIAS;
 
 		if (!has_fenced_gpu_access)
@@ -927,8 +919,8 @@ i915_gem_execbuffer_reserve(struct intel_engine_cs *engine,
 		obj->base.pending_read_domains = I915_GEM_GPU_DOMAINS & ~I915_GEM_DOMAIN_COMMAND;
 		obj->base.pending_write_domain = 0;
 	}
-	list_splice(&ordered_vmas, vmas);
-	list_splice(&pinned_vmas, vmas);
+	list_splice(&ordered_vmas, &eb->vmas);
+	list_splice(&pinned_vmas, &eb->vmas);
 
 	/* Attempt to pin all of the buffers into the GTT.
 	 * This is done in 3 phases:
@@ -947,27 +939,24 @@ i915_gem_execbuffer_reserve(struct intel_engine_cs *engine,
 		int ret = 0;
 
 		/* Unbind any ill-fitting objects or pin. */
-		list_for_each_entry(vma, vmas, exec_list) {
+		list_for_each_entry(vma, &eb->vmas, exec_list) {
 			if (!drm_mm_node_allocated(&vma->node))
 				continue;
 
 			if (eb_vma_misplaced(vma))
 				ret = i915_vma_unbind(vma);
 			else
-				ret = i915_gem_execbuffer_reserve_vma(vma,
-								      engine,
-								      need_relocs);
+				ret = eb_reserve_vma(vma, eb->engine, &eb->need_relocs);
 			if (ret)
 				goto err;
 		}
 
 		/* Bind fresh objects */
-		list_for_each_entry(vma, vmas, exec_list) {
+		list_for_each_entry(vma, &eb->vmas, exec_list) {
 			if (drm_mm_node_allocated(&vma->node))
 				continue;
 
-			ret = i915_gem_execbuffer_reserve_vma(vma, engine,
-							      need_relocs);
+			ret = eb_reserve_vma(vma, eb->engine, &eb->need_relocs);
 			if (ret)
 				goto err;
 		}
@@ -977,39 +966,30 @@ i915_gem_execbuffer_reserve(struct intel_engine_cs *engine,
 			return ret;
 
 		/* Decrement pin count for bound objects */
-		list_for_each_entry(vma, vmas, exec_list)
-			i915_gem_execbuffer_unreserve_vma(vma);
+		list_for_each_entry(vma, &eb->vmas, exec_list)
+			eb_unreserve_vma(vma);
 
-		ret = i915_gem_evict_vm(vm, true);
+		ret = i915_gem_evict_vm(eb->vm, true);
 		if (ret)
 			return ret;
 	} while (1);
 }
 
 static int
-i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
-				  struct drm_i915_gem_execbuffer2 *args,
-				  struct drm_file *file,
-				  struct intel_engine_cs *engine,
-				  struct eb_vmas *eb,
-				  struct drm_i915_gem_exec_object2 *exec,
-				  struct i915_gem_context *ctx)
+eb_relocate_slow(struct i915_execbuffer *eb)
 {
+	const unsigned int count = eb->args->buffer_count;
+	struct drm_device *dev = &eb->i915->drm;
 	struct drm_i915_gem_relocation_entry *reloc;
-	struct i915_address_space *vm;
 	struct i915_vma *vma;
-	bool need_relocs;
 	int *reloc_offset;
 	int i, total, ret;
-	unsigned count = args->buffer_count;
-
-	vm = list_first_entry(&eb->vmas, struct i915_vma, exec_list)->vm;
 
 	/* We may process another execbuffer during the unlock... */
 	while (!list_empty(&eb->vmas)) {
 		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_list);
 		list_del_init(&vma->exec_list);
-		i915_gem_execbuffer_unreserve_vma(vma);
+		eb_unreserve_vma(vma);
 		i915_vma_put(vma);
 	}
 
@@ -1017,7 +997,7 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 
 	total = 0;
 	for (i = 0; i < count; i++)
-		total += exec[i].relocation_count;
+		total += eb->exec[i].relocation_count;
 
 	reloc_offset = drm_malloc_ab(count, sizeof(*reloc_offset));
 	reloc = drm_malloc_ab(total, sizeof(*reloc));
@@ -1034,10 +1014,10 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 		u64 invalid_offset = (u64)-1;
 		int j;
 
-		user_relocs = u64_to_user_ptr(exec[i].relocs_ptr);
+		user_relocs = u64_to_user_ptr(eb->exec[i].relocs_ptr);
 
 		if (copy_from_user(reloc+total, user_relocs,
-				   exec[i].relocation_count * sizeof(*reloc))) {
+				   eb->exec[i].relocation_count * sizeof(*reloc))) {
 			ret = -EFAULT;
 			mutex_lock(&dev->struct_mutex);
 			goto err;
@@ -1052,7 +1032,7 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 		 * happened we would make the mistake of assuming that the
 		 * relocations were valid.
 		 */
-		for (j = 0; j < exec[i].relocation_count; j++) {
+		for (j = 0; j < eb->exec[i].relocation_count; j++) {
 			if (__copy_to_user(&user_relocs[j].presumed_offset,
 					   &invalid_offset,
 					   sizeof(invalid_offset))) {
@@ -1063,7 +1043,7 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 		}
 
 		reloc_offset[i] = total;
-		total += exec[i].relocation_count;
+		total += eb->exec[i].relocation_count;
 	}
 
 	ret = i915_mutex_lock_interruptible(dev);
@@ -1074,20 +1054,18 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 
 	/* reacquire the objects */
 	eb_reset(eb);
-	ret = eb_lookup_vmas(eb, exec, args, vm, file);
+	ret = eb_lookup_vmas(eb);
 	if (ret)
 		goto err;
 
-	need_relocs = (args->flags & I915_EXEC_NO_RELOC) == 0;
-	ret = i915_gem_execbuffer_reserve(engine, &eb->vmas, ctx,
-					  &need_relocs);
+	ret = eb_reserve(eb);
 	if (ret)
 		goto err;
 
 	list_for_each_entry(vma, &eb->vmas, exec_list) {
-		int offset = vma->exec_entry - exec;
-		ret = i915_gem_execbuffer_relocate_vma_slow(vma, eb,
-							    reloc + reloc_offset[offset]);
+		int idx = vma->exec_entry - eb->exec;
+
+		ret = eb_relocate_vma_slow(vma, eb, reloc + reloc_offset[idx]);
 		if (ret)
 			goto err;
 	}
@@ -1105,13 +1083,12 @@ i915_gem_execbuffer_relocate_slow(struct drm_device *dev,
 }
 
 static int
-i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
-				struct list_head *vmas)
+eb_move_to_gpu(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 	int ret;
 
-	list_for_each_entry(vma, vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_list) {
 		struct drm_i915_gem_object *obj = vma->obj;
 
 		if (vma->exec_entry->flags & EXEC_OBJECT_CAPTURE) {
@@ -1121,9 +1098,9 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
 			if (unlikely(!capture))
 				return -ENOMEM;
 
-			capture->next = req->capture_list;
+			capture->next = eb->request->capture_list;
 			capture->vma = vma;
-			req->capture_list = capture;
+			eb->request->capture_list = capture;
 		}
 
 		if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
@@ -1133,22 +1110,22 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
 			i915_gem_clflush_object(obj, 0);
 
 		ret = i915_gem_request_await_object
-			(req, obj, obj->base.pending_write_domain);
+			(eb->request, obj, obj->base.pending_write_domain);
 		if (ret)
 			return ret;
 	}
 
 	/* Unconditionally flush any chipset caches (for streaming writes). */
-	i915_gem_chipset_flush(req->engine->i915);
+	i915_gem_chipset_flush(eb->i915);
 
 	/* Unconditionally invalidate GPU caches and TLBs. */
-	return req->engine->emit_flush(req, EMIT_INVALIDATE);
+	return eb->engine->emit_flush(eb->request, EMIT_INVALIDATE);
 }
 
 static bool
 i915_gem_check_execbuffer(struct drm_i915_gem_execbuffer2 *exec)
 {
-	if (exec->flags & __I915_EXEC_UNKNOWN_FLAGS)
+	if (exec->flags & __I915_EXEC_ILLEGAL_FLAGS)
 		return false;
 
 	/* Kernel clipping was a DRI1 misfeature */
@@ -1245,22 +1222,24 @@ validate_exec_list(struct drm_device *dev,
 	return 0;
 }
 
-static struct i915_gem_context *
-i915_gem_validate_context(struct drm_device *dev, struct drm_file *file,
-			  struct intel_engine_cs *engine, const u32 ctx_id)
+static int eb_select_context(struct i915_execbuffer *eb)
 {
+	unsigned int ctx_id = i915_execbuffer2_get_context_id(*eb->args);
 	struct i915_gem_context *ctx;
 
-	ctx = i915_gem_context_lookup(file->driver_priv, ctx_id);
-	if (IS_ERR(ctx))
-		return ctx;
+	ctx = i915_gem_context_lookup(eb->file->driver_priv, ctx_id);
+	if (unlikely(IS_ERR(ctx)))
+		return PTR_ERR(ctx);
 
-	if (i915_gem_context_is_banned(ctx)) {
+	if (unlikely(i915_gem_context_is_banned(ctx))) {
 		DRM_DEBUG("Context %u tried to submit while banned\n", ctx_id);
-		return ERR_PTR(-EIO);
+		return -EIO;
 	}
 
-	return ctx;
+	eb->ctx = i915_gem_context_get(ctx);
+	eb->vm = ctx->ppgtt ? &ctx->ppgtt->base : &eb->i915->ggtt.base;
+
+	return 0;
 }
 
 void i915_vma_move_to_active(struct i915_vma *vma,
@@ -1320,12 +1299,11 @@ static void eb_export_fence(struct drm_i915_gem_object *obj,
 }
 
 static void
-i915_gem_execbuffer_move_to_active(struct list_head *vmas,
-				   struct drm_i915_gem_request *req)
+eb_move_to_active(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 
-	list_for_each_entry(vma, vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_list) {
 		struct drm_i915_gem_object *obj = vma->obj;
 
 		obj->base.write_domain = obj->base.pending_write_domain;
@@ -1335,8 +1313,8 @@ i915_gem_execbuffer_move_to_active(struct list_head *vmas,
 			obj->base.pending_read_domains |= obj->base.read_domains;
 		obj->base.read_domains = obj->base.pending_read_domains;
 
-		i915_vma_move_to_active(vma, req, vma->exec_entry->flags);
-		eb_export_fence(obj, req, vma->exec_entry->flags);
+		i915_vma_move_to_active(vma, eb->request, vma->exec_entry->flags);
+		eb_export_fence(obj, eb->request, vma->exec_entry->flags);
 	}
 }
 
@@ -1366,29 +1344,22 @@ i915_reset_gen7_sol_offsets(struct drm_i915_gem_request *req)
 	return 0;
 }
 
-static struct i915_vma *
-i915_gem_execbuffer_parse(struct intel_engine_cs *engine,
-			  struct drm_i915_gem_exec_object2 *shadow_exec_entry,
-			  struct drm_i915_gem_object *batch_obj,
-			  struct eb_vmas *eb,
-			  u32 batch_start_offset,
-			  u32 batch_len,
-			  bool is_master)
+static struct i915_vma *eb_parse(struct i915_execbuffer *eb, bool is_master)
 {
 	struct drm_i915_gem_object *shadow_batch_obj;
 	struct i915_vma *vma;
 	int ret;
 
-	shadow_batch_obj = i915_gem_batch_pool_get(&engine->batch_pool,
-						   PAGE_ALIGN(batch_len));
+	shadow_batch_obj = i915_gem_batch_pool_get(&eb->engine->batch_pool,
+						   PAGE_ALIGN(eb->batch_len));
 	if (IS_ERR(shadow_batch_obj))
 		return ERR_CAST(shadow_batch_obj);
 
-	ret = intel_engine_cmd_parser(engine,
-				      batch_obj,
+	ret = intel_engine_cmd_parser(eb->engine,
+				      eb->batch->obj,
 				      shadow_batch_obj,
-				      batch_start_offset,
-				      batch_len,
+				      eb->batch_start_offset,
+				      eb->batch_len,
 				      is_master);
 	if (ret) {
 		if (ret == -EACCES) /* unhandled chained batch */
@@ -1402,9 +1373,8 @@ i915_gem_execbuffer_parse(struct intel_engine_cs *engine,
 	if (IS_ERR(vma))
 		goto out;
 
-	memset(shadow_exec_entry, 0, sizeof(*shadow_exec_entry));
-
-	vma->exec_entry = shadow_exec_entry;
+	vma->exec_entry =
+		memset(&eb->shadow_exec_entry, 0, sizeof(*vma->exec_entry));
 	vma->exec_entry->flags = __EXEC_OBJECT_HAS_PIN;
 	i915_gem_object_get(shadow_batch_obj);
 	list_add_tail(&vma->exec_list, &eb->vmas);
@@ -1423,46 +1393,33 @@ add_to_client(struct drm_i915_gem_request *req,
 }
 
 static int
-execbuf_submit(struct i915_execbuffer_params *params,
-	       struct drm_i915_gem_execbuffer2 *args,
-	       struct list_head *vmas)
+execbuf_submit(struct i915_execbuffer *eb)
 {
-	u64 exec_start, exec_len;
 	int ret;
 
-	ret = i915_gem_execbuffer_move_to_gpu(params->request, vmas);
+	ret = eb_move_to_gpu(eb);
 	if (ret)
 		return ret;
 
-	ret = i915_switch_context(params->request);
+	ret = i915_switch_context(eb->request);
 	if (ret)
 		return ret;
 
-	if (args->flags & I915_EXEC_CONSTANTS_MASK) {
-		DRM_DEBUG("I915_EXEC_CONSTANTS_* unsupported\n");
-		return -EINVAL;
-	}
-
-	if (args->flags & I915_EXEC_GEN7_SOL_RESET) {
-		ret = i915_reset_gen7_sol_offsets(params->request);
+	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
+		ret = i915_reset_gen7_sol_offsets(eb->request);
 		if (ret)
 			return ret;
 	}
 
-	exec_len   = args->batch_len;
-	exec_start = params->batch->node.start +
-		     params->args_batch_start_offset;
-
-	if (exec_len == 0)
-		exec_len = params->batch->size - params->args_batch_start_offset;
-
-	ret = params->engine->emit_bb_start(params->request,
-					    exec_start, exec_len,
-					    params->dispatch_flags);
+	ret = eb->engine->emit_bb_start(eb->request,
+					eb->batch->node.start +
+					eb->batch_start_offset,
+					eb->batch_len,
+					eb->dispatch_flags);
 	if (ret)
 		return ret;
 
-	i915_gem_execbuffer_move_to_active(vmas, params->request);
+	eb_move_to_active(eb);
 
 	return 0;
 }
@@ -1544,27 +1501,16 @@ eb_select_engine(struct drm_i915_private *dev_priv,
 }
 
 static int
-i915_gem_do_execbuffer(struct drm_device *dev, void *data,
+i915_gem_do_execbuffer(struct drm_device *dev,
 		       struct drm_file *file,
 		       struct drm_i915_gem_execbuffer2 *args,
 		       struct drm_i915_gem_exec_object2 *exec)
 {
-	struct drm_i915_private *dev_priv = to_i915(dev);
-	struct i915_ggtt *ggtt = &dev_priv->ggtt;
-	struct eb_vmas *eb;
-	struct drm_i915_gem_exec_object2 shadow_exec_entry;
-	struct intel_engine_cs *engine;
-	struct i915_gem_context *ctx;
-	struct i915_address_space *vm;
-	struct i915_execbuffer_params params_master; /* XXX: will be removed later */
-	struct i915_execbuffer_params *params = &params_master;
-	const u32 ctx_id = i915_execbuffer2_get_context_id(*args);
-	u32 dispatch_flags;
+	struct i915_execbuffer eb;
 	struct dma_fence *in_fence = NULL;
 	struct sync_file *out_fence = NULL;
 	int out_fence_fd = -1;
 	int ret;
-	bool need_relocs;
 
 	if (!i915_gem_check_execbuffer(args))
 		return -EINVAL;
@@ -1573,37 +1519,42 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		return ret;
 
-	dispatch_flags = 0;
+	eb.i915 = to_i915(dev);
+	eb.file = file;
+	eb.args = args;
+	eb.exec = exec;
+	eb.need_relocs = (args->flags & I915_EXEC_NO_RELOC) == 0;
+	reloc_cache_init(&eb.reloc_cache, eb.i915);
+
+	eb.batch_start_offset = args->batch_start_offset;
+	eb.batch_len = args->batch_len;
+
+	eb.dispatch_flags = 0;
 	if (args->flags & I915_EXEC_SECURE) {
 		if (!drm_is_current_master(file) || !capable(CAP_SYS_ADMIN))
 		    return -EPERM;
 
-		dispatch_flags |= I915_DISPATCH_SECURE;
+		eb.dispatch_flags |= I915_DISPATCH_SECURE;
 	}
 	if (args->flags & I915_EXEC_IS_PINNED)
-		dispatch_flags |= I915_DISPATCH_PINNED;
-
-	engine = eb_select_engine(dev_priv, file, args);
-	if (!engine)
-		return -EINVAL;
+		eb.dispatch_flags |= I915_DISPATCH_PINNED;
 
-	if (args->buffer_count < 1) {
-		DRM_DEBUG("execbuf with %d buffers\n", args->buffer_count);
+	eb.engine = eb_select_engine(eb.i915, file, args);
+	if (!eb.engine)
 		return -EINVAL;
-	}
 
 	if (args->flags & I915_EXEC_RESOURCE_STREAMER) {
-		if (!HAS_RESOURCE_STREAMER(dev_priv)) {
+		if (!HAS_RESOURCE_STREAMER(eb.i915)) {
 			DRM_DEBUG("RS is only allowed for Haswell, Gen8 and above\n");
 			return -EINVAL;
 		}
-		if (engine->id != RCS) {
+		if (eb.engine->id != RCS) {
 			DRM_DEBUG("RS is not available on %s\n",
-				 engine->name);
+				 eb.engine->name);
 			return -EINVAL;
 		}
 
-		dispatch_flags |= I915_DISPATCH_RS;
+		eb.dispatch_flags |= I915_DISPATCH_RS;
 	}
 
 	if (args->flags & I915_EXEC_FENCE_IN) {
@@ -1626,59 +1577,44 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	 * wakeref that we hold until the GPU has been idle for at least
 	 * 100ms.
 	 */
-	intel_runtime_pm_get(dev_priv);
+	intel_runtime_pm_get(eb.i915);
 
 	ret = i915_mutex_lock_interruptible(dev);
 	if (ret)
 		goto pre_mutex_err;
 
-	ctx = i915_gem_validate_context(dev, file, engine, ctx_id);
-	if (IS_ERR(ctx)) {
+	ret = eb_select_context(&eb);
+	if (ret) {
 		mutex_unlock(&dev->struct_mutex);
-		ret = PTR_ERR(ctx);
 		goto pre_mutex_err;
 	}
 
-	i915_gem_context_get(ctx);
-
-	if (ctx->ppgtt)
-		vm = &ctx->ppgtt->base;
-	else
-		vm = &ggtt->base;
-
-	memset(&params_master, 0x00, sizeof(params_master));
-
-	eb = eb_create(dev_priv, args);
-	if (eb == NULL) {
-		i915_gem_context_put(ctx);
+	if (eb_create(&eb)) {
+		i915_gem_context_put(eb.ctx);
 		mutex_unlock(&dev->struct_mutex);
 		ret = -ENOMEM;
 		goto pre_mutex_err;
 	}
 
 	/* Look up object handles */
-	ret = eb_lookup_vmas(eb, exec, args, vm, file);
+	ret = eb_lookup_vmas(&eb);
 	if (ret)
 		goto err;
 
 	/* take note of the batch buffer before we might reorder the lists */
-	params->batch = eb_get_batch(eb);
+	eb.batch = eb_get_batch(&eb);
 
 	/* Move the objects en-masse into the GTT, evicting if necessary. */
-	need_relocs = (args->flags & I915_EXEC_NO_RELOC) == 0;
-	ret = i915_gem_execbuffer_reserve(engine, &eb->vmas, ctx,
-					  &need_relocs);
+	ret = eb_reserve(&eb);
 	if (ret)
 		goto err;
 
 	/* The objects are in their final locations, apply the relocations. */
-	if (need_relocs)
-		ret = i915_gem_execbuffer_relocate(eb);
+	if (eb.need_relocs)
+		ret = eb_relocate(&eb);
 	if (ret) {
 		if (ret == -EFAULT) {
-			ret = i915_gem_execbuffer_relocate_slow(dev, args, file,
-								engine,
-								eb, exec, ctx);
+			ret = eb_relocate_slow(&eb);
 			BUG_ON(!mutex_is_locked(&dev->struct_mutex));
 		}
 		if (ret)
@@ -1686,28 +1622,22 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	}
 
 	/* Set the pending read domains for the batch buffer to COMMAND */
-	if (params->batch->obj->base.pending_write_domain) {
+	if (eb.batch->obj->base.pending_write_domain) {
 		DRM_DEBUG("Attempting to use self-modifying batch buffer\n");
 		ret = -EINVAL;
 		goto err;
 	}
-	if (args->batch_start_offset > params->batch->size ||
-	    args->batch_len > params->batch->size - args->batch_start_offset) {
+	if (eb.batch_start_offset > eb.batch->size ||
+	    eb.batch_len > eb.batch->size - eb.batch_start_offset) {
 		DRM_DEBUG("Attempting to use out-of-bounds batch\n");
 		ret = -EINVAL;
 		goto err;
 	}
 
-	params->args_batch_start_offset = args->batch_start_offset;
-	if (engine->needs_cmd_parser && args->batch_len) {
+	if (eb.engine->needs_cmd_parser && eb.batch_len) {
 		struct i915_vma *vma;
 
-		vma = i915_gem_execbuffer_parse(engine, &shadow_exec_entry,
-						params->batch->obj,
-						eb,
-						args->batch_start_offset,
-						args->batch_len,
-						drm_is_current_master(file));
+		vma = eb_parse(&eb, drm_is_current_master(file));
 		if (IS_ERR(vma)) {
 			ret = PTR_ERR(vma);
 			goto err;
@@ -1723,19 +1653,21 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 			 * specifically don't want that set on batches the
 			 * command parser has accepted.
 			 */
-			dispatch_flags |= I915_DISPATCH_SECURE;
-			params->args_batch_start_offset = 0;
-			params->batch = vma;
+			eb.dispatch_flags |= I915_DISPATCH_SECURE;
+			eb.batch_start_offset = 0;
+			eb.batch = vma;
 		}
 	}
 
-	params->batch->obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
+	eb.batch->obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
+	if (eb.batch_len == 0)
+		eb.batch_len = eb.batch->size - eb.batch_start_offset;
 
 	/* snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
 	 * batch" bit. Hence we need to pin secure batches into the global gtt.
 	 * hsw should have this fixed, but bdw mucks it up again. */
-	if (dispatch_flags & I915_DISPATCH_SECURE) {
-		struct drm_i915_gem_object *obj = params->batch->obj;
+	if (eb.dispatch_flags & I915_DISPATCH_SECURE) {
+		struct drm_i915_gem_object *obj = eb.batch->obj;
 		struct i915_vma *vma;
 
 		/*
@@ -1754,25 +1686,24 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 			goto err;
 		}
 
-		params->batch = vma;
+		eb.batch = vma;
 	}
 
 	/* Allocate a request for this batch buffer nice and early. */
-	params->request = i915_gem_request_alloc(engine, ctx);
-	if (IS_ERR(params->request)) {
-		ret = PTR_ERR(params->request);
+	eb.request = i915_gem_request_alloc(eb.engine, eb.ctx);
+	if (IS_ERR(eb.request)) {
+		ret = PTR_ERR(eb.request);
 		goto err_batch_unpin;
 	}
 
 	if (in_fence) {
-		ret = i915_gem_request_await_dma_fence(params->request,
-						       in_fence);
+		ret = i915_gem_request_await_dma_fence(eb.request, in_fence);
 		if (ret < 0)
 			goto err_request;
 	}
 
 	if (out_fence_fd != -1) {
-		out_fence = sync_file_create(&params->request->fence);
+		out_fence = sync_file_create(&eb.request->fence);
 		if (!out_fence) {
 			ret = -ENOMEM;
 			goto err_request;
@@ -1785,26 +1716,13 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	 * inactive_list and lose its active reference. Hence we do not need
 	 * to explicitly hold another reference here.
 	 */
-	params->request->batch = params->batch;
-
-	/*
-	 * Save assorted stuff away to pass through to *_submission().
-	 * NB: This data should be 'persistent' and not local as it will
-	 * kept around beyond the duration of the IOCTL once the GPU
-	 * scheduler arrives.
-	 */
-	params->dev                     = dev;
-	params->file                    = file;
-	params->engine                    = engine;
-	params->dispatch_flags          = dispatch_flags;
-	params->ctx                     = ctx;
+	eb.request->batch = eb.batch;
 
-	trace_i915_gem_request_queue(params->request, dispatch_flags);
-
-	ret = execbuf_submit(params, args, &eb->vmas);
+	trace_i915_gem_request_queue(eb.request, eb.dispatch_flags);
+	ret = execbuf_submit(&eb);
 err_request:
-	__i915_add_request(params->request, ret == 0);
-	add_to_client(params->request, file);
+	__i915_add_request(eb.request, ret == 0);
+	add_to_client(eb.request, file);
 
 	if (out_fence) {
 		if (ret == 0) {
@@ -1824,19 +1742,17 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	 * needs to be adjusted to also track the ggtt batch vma properly as
 	 * active.
 	 */
-	if (dispatch_flags & I915_DISPATCH_SECURE)
-		i915_vma_unpin(params->batch);
+	if (eb.dispatch_flags & I915_DISPATCH_SECURE)
+		i915_vma_unpin(eb.batch);
 err:
 	/* the request owns the ref now */
-	i915_gem_context_put(ctx);
-	eb_destroy(eb);
-
+	eb_destroy(&eb);
 	mutex_unlock(&dev->struct_mutex);
 
 pre_mutex_err:
 	/* intel_gpu_busy should also get a ref, so it will free when the device
 	 * is really idle. */
-	intel_runtime_pm_put(dev_priv);
+	intel_runtime_pm_put(eb.i915);
 	if (out_fence_fd != -1)
 		put_unused_fd(out_fence_fd);
 err_in_fence:
@@ -1907,7 +1823,7 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
 	exec2.flags = I915_EXEC_RENDER;
 	i915_execbuffer2_set_context_id(exec2, 0);
 
-	ret = i915_gem_do_execbuffer(dev, data, file, &exec2, exec2_list);
+	ret = i915_gem_do_execbuffer(dev, file, &exec2, exec2_list);
 	if (!ret) {
 		struct drm_i915_gem_exec_object __user *user_exec_list =
 			u64_to_user_ptr(args->buffers_ptr);
@@ -1966,7 +1882,7 @@ i915_gem_execbuffer2(struct drm_device *dev, void *data,
 		return -EFAULT;
 	}
 
-	ret = i915_gem_do_execbuffer(dev, data, file, args, exec2_list);
+	ret = i915_gem_do_execbuffer(dev, file, args, exec2_list);
 	if (!ret) {
 		/* Copy the new buffer offsets back to the user's exec list. */
 		struct drm_i915_gem_exec_object2 __user *user_exec_list =
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 18/27] drm/i915: Use vma->exec_entry as our double-entry placeholder
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (16 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 17/27] drm/i915: Amalgamate execbuffer parameter structures Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 19/27] drm/i915: Split vma exec_link/evict_link Chris Wilson
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

This has the benefit of not requiring us to manipulate the
vma->exec_link list when tearing down the execbuffer, and is a
marginally cheaper test to detect the user error.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_evict.c      | 17 ++-----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 77 ++++++++++++++++--------------
 drivers/gpu/drm/i915/i915_vma.c            |  1 -
 3 files changed, 44 insertions(+), 51 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index 51e365f70464..891247d79299 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -59,9 +59,6 @@ mark_free(struct drm_mm_scan *scan,
 	if (i915_vma_is_pinned(vma))
 		return false;
 
-	if (WARN_ON(!list_empty(&vma->exec_list)))
-		return false;
-
 	if (flags & PIN_NONFAULT && !list_empty(&vma->obj->userfault_link))
 		return false;
 
@@ -160,8 +157,6 @@ i915_gem_evict_something(struct i915_address_space *vm,
 	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
 		ret = drm_mm_scan_remove_block(&scan, &vma->node);
 		BUG_ON(ret);
-
-		INIT_LIST_HEAD(&vma->exec_list);
 	}
 
 	/* Can we unpin some objects such as idle hw contents,
@@ -209,17 +204,12 @@ i915_gem_evict_something(struct i915_address_space *vm,
 		if (drm_mm_scan_remove_block(&scan, &vma->node))
 			__i915_vma_pin(vma);
 		else
-			list_del_init(&vma->exec_list);
+			list_del(&vma->exec_list);
 	}
 
 	/* Unbinding will emit any required flushes */
 	ret = 0;
-	while (!list_empty(&eviction_list)) {
-		vma = list_first_entry(&eviction_list,
-				       struct i915_vma,
-				       exec_list);
-
-		list_del_init(&vma->exec_list);
+	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
 		__i915_vma_unpin(vma);
 		if (ret == 0)
 			ret = i915_vma_unbind(vma);
@@ -315,7 +305,7 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 		}
 
 		/* Overlap of objects in the same batch? */
-		if (i915_vma_is_pinned(vma) || !list_empty(&vma->exec_list)) {
+		if (i915_vma_is_pinned(vma)) {
 			ret = -ENOSPC;
 			if (vma->exec_entry &&
 			    vma->exec_entry->flags & EXEC_OBJECT_PINNED)
@@ -336,7 +326,6 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 	}
 
 	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
-		list_del_init(&vma->exec_list);
 		__i915_vma_unpin(vma);
 		if (ret == 0)
 			ret = i915_vma_unbind(vma);
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 32c9750f7249..6d616662ef67 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -109,13 +109,40 @@ eb_create(struct i915_execbuffer *eb)
 		eb->and = -eb->args->buffer_count;
 	}
 
-	INIT_LIST_HEAD(&eb->vmas);
 	return 0;
 }
 
+static inline void
+__eb_unreserve_vma(struct i915_vma *vma,
+		   const struct drm_i915_gem_exec_object2 *entry)
+{
+	if (unlikely(entry->flags & __EXEC_OBJECT_HAS_FENCE))
+		i915_vma_unpin_fence(vma);
+
+	if (entry->flags & __EXEC_OBJECT_HAS_PIN)
+		__i915_vma_unpin(vma);
+}
+
+static void
+eb_unreserve_vma(struct i915_vma *vma)
+{
+	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+
+	__eb_unreserve_vma(vma, entry);
+	entry->flags &= ~(__EXEC_OBJECT_HAS_FENCE | __EXEC_OBJECT_HAS_PIN);
+}
+
 static void
 eb_reset(struct i915_execbuffer *eb)
 {
+	struct i915_vma *vma;
+
+	list_for_each_entry(vma, &eb->vmas, exec_list) {
+		eb_unreserve_vma(vma);
+		i915_vma_put(vma);
+		vma->exec_entry = NULL;
+	}
+
 	if (eb->and >= 0)
 		memset(eb->buckets, 0, (eb->and+1)*sizeof(struct hlist_head));
 }
@@ -147,6 +174,8 @@ eb_lookup_vmas(struct i915_execbuffer *eb)
 	struct list_head objects;
 	int i, ret;
 
+	INIT_LIST_HEAD(&eb->vmas);
+
 	INIT_LIST_HEAD(&objects);
 	spin_lock(&eb->file->table_lock);
 	/* Grab a reference to the object and release the lock so we can lookup
@@ -253,40 +282,23 @@ static struct i915_vma *eb_get_vma(struct i915_execbuffer *eb, unsigned long han
 	}
 }
 
-static void
-eb_unreserve_vma(struct i915_vma *vma)
-{
-	struct drm_i915_gem_exec_object2 *entry;
-
-	if (!drm_mm_node_allocated(&vma->node))
-		return;
-
-	entry = vma->exec_entry;
-
-	if (entry->flags & __EXEC_OBJECT_HAS_FENCE)
-		i915_vma_unpin_fence(vma);
-
-	if (entry->flags & __EXEC_OBJECT_HAS_PIN)
-		__i915_vma_unpin(vma);
-
-	entry->flags &= ~(__EXEC_OBJECT_HAS_FENCE | __EXEC_OBJECT_HAS_PIN);
-}
-
 static void eb_destroy(struct i915_execbuffer *eb)
 {
-	i915_gem_context_put(eb->ctx);
+	struct i915_vma *vma;
 
-	while (!list_empty(&eb->vmas)) {
-		struct i915_vma *vma;
+	list_for_each_entry(vma, &eb->vmas, exec_list) {
+		if (!vma->exec_entry)
+			continue;
 
-		vma = list_first_entry(&eb->vmas,
-				       struct i915_vma,
-				       exec_list);
-		list_del_init(&vma->exec_list);
-		eb_unreserve_vma(vma);
+		__eb_unreserve_vma(vma, vma->exec_entry);
 		vma->exec_entry = NULL;
 		i915_vma_put(vma);
 	}
+
+	i915_gem_context_put(eb->ctx);
+
+	if (eb->buckets)
+		kfree(eb->buckets);
 }
 
 static inline int use_cpu_reloc(struct drm_i915_gem_object *obj)
@@ -986,13 +998,7 @@ eb_relocate_slow(struct i915_execbuffer *eb)
 	int i, total, ret;
 
 	/* We may process another execbuffer during the unlock... */
-	while (!list_empty(&eb->vmas)) {
-		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_list);
-		list_del_init(&vma->exec_list);
-		eb_unreserve_vma(vma);
-		i915_vma_put(vma);
-	}
-
+	eb_reset(eb);
 	mutex_unlock(&dev->struct_mutex);
 
 	total = 0;
@@ -1053,7 +1059,6 @@ eb_relocate_slow(struct i915_execbuffer *eb)
 	}
 
 	/* reacquire the objects */
-	eb_reset(eb);
 	ret = eb_lookup_vmas(eb);
 	if (ret)
 		goto err;
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 1aba47024656..6cf32da682ec 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -85,7 +85,6 @@ vma_create(struct drm_i915_gem_object *obj,
 	if (vma == NULL)
 		return ERR_PTR(-ENOMEM);
 
-	INIT_LIST_HEAD(&vma->exec_list);
 	for (i = 0; i < ARRAY_SIZE(vma->last_read); i++)
 		init_request_active(&vma->last_read[i], i915_vma_retire);
 	init_request_active(&vma->last_fence, NULL);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 19/27] drm/i915: Split vma exec_link/evict_link
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (17 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 18/27] drm/i915: Use vma->exec_entry as our double-entry placeholder Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 20/27] drm/i915: Store a direct lookup from object handle to vma Chris Wilson
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

Currently the vma has one link member that is used for both holding its
place in the execbuf reservation list, and in any eviction list. This
dual property is quite tricky and error prone.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_evict.c      | 14 ++++++-------
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 32 +++++++++++++++---------------
 drivers/gpu/drm/i915/i915_vma.h            |  7 +++++--
 3 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index 891247d79299..204a2d9288ae 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -62,7 +62,7 @@ mark_free(struct drm_mm_scan *scan,
 	if (flags & PIN_NONFAULT && !list_empty(&vma->obj->userfault_link))
 		return false;
 
-	list_add(&vma->exec_list, unwind);
+	list_add(&vma->evict_link, unwind);
 	return drm_mm_scan_add_block(scan, &vma->node);
 }
 
@@ -154,7 +154,7 @@ i915_gem_evict_something(struct i915_address_space *vm,
 	} while (*++phase);
 
 	/* Nothing found, clean up and bail out! */
-	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
+	list_for_each_entry_safe(vma, next, &eviction_list, evict_link) {
 		ret = drm_mm_scan_remove_block(&scan, &vma->node);
 		BUG_ON(ret);
 	}
@@ -200,16 +200,16 @@ i915_gem_evict_something(struct i915_address_space *vm,
 	 * calling unbind (which may remove the active reference
 	 * of any of our objects, thus corrupting the list).
 	 */
-	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
+	list_for_each_entry_safe(vma, next, &eviction_list, evict_link) {
 		if (drm_mm_scan_remove_block(&scan, &vma->node))
 			__i915_vma_pin(vma);
 		else
-			list_del(&vma->exec_list);
+			list_del(&vma->evict_link);
 	}
 
 	/* Unbinding will emit any required flushes */
 	ret = 0;
-	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
+	list_for_each_entry_safe(vma, next, &eviction_list, evict_link) {
 		__i915_vma_unpin(vma);
 		if (ret == 0)
 			ret = i915_vma_unbind(vma);
@@ -322,10 +322,10 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 		 * reference) another in our eviction list.
 		 */
 		__i915_vma_pin(vma);
-		list_add(&vma->exec_list, &eviction_list);
+		list_add(&vma->evict_link, &eviction_list);
 	}
 
-	list_for_each_entry_safe(vma, next, &eviction_list, exec_list) {
+	list_for_each_entry_safe(vma, next, &eviction_list, evict_link) {
 		__i915_vma_unpin(vma);
 		if (ret == 0)
 			ret = i915_vma_unbind(vma);
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 6d616662ef67..42468cbf7678 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -137,7 +137,7 @@ eb_reset(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		eb_unreserve_vma(vma);
 		i915_vma_put(vma);
 		vma->exec_entry = NULL;
@@ -150,7 +150,7 @@ eb_reset(struct i915_execbuffer *eb)
 static struct i915_vma *
 eb_get_batch(struct i915_execbuffer *eb)
 {
-	struct i915_vma *vma = list_entry(eb->vmas.prev, typeof(*vma), exec_list);
+	struct i915_vma *vma = list_entry(eb->vmas.prev, typeof(*vma), exec_link);
 
 	/*
 	 * SNA is doing fancy tricks with compressing batch buffers, which leads
@@ -227,7 +227,7 @@ eb_lookup_vmas(struct i915_execbuffer *eb)
 		}
 
 		/* Transfer ownership from the objects list to the vmas list. */
-		list_add_tail(&vma->exec_list, &eb->vmas);
+		list_add_tail(&vma->exec_link, &eb->vmas);
 		list_del_init(&obj->obj_exec_link);
 
 		vma->exec_entry = &eb->exec[i];
@@ -286,7 +286,7 @@ static void eb_destroy(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		if (!vma->exec_entry)
 			continue;
 
@@ -752,7 +752,7 @@ static int eb_relocate(struct i915_execbuffer *eb)
 	struct i915_vma *vma;
 	int ret = 0;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		ret = eb_relocate_vma(vma, eb);
 		if (ret)
 			break;
@@ -905,7 +905,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		struct drm_i915_gem_exec_object2 *entry;
 		bool need_fence, need_mappable;
 
-		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_list);
+		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_link);
 		obj = vma->obj;
 		entry = vma->exec_entry;
 
@@ -921,12 +921,12 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		need_mappable = need_fence || need_reloc_mappable(vma);
 
 		if (entry->flags & EXEC_OBJECT_PINNED)
-			list_move_tail(&vma->exec_list, &pinned_vmas);
+			list_move_tail(&vma->exec_link, &pinned_vmas);
 		else if (need_mappable) {
 			entry->flags |= __EXEC_OBJECT_NEEDS_MAP;
-			list_move(&vma->exec_list, &ordered_vmas);
+			list_move(&vma->exec_link, &ordered_vmas);
 		} else
-			list_move_tail(&vma->exec_list, &ordered_vmas);
+			list_move_tail(&vma->exec_link, &ordered_vmas);
 
 		obj->base.pending_read_domains = I915_GEM_GPU_DOMAINS & ~I915_GEM_DOMAIN_COMMAND;
 		obj->base.pending_write_domain = 0;
@@ -951,7 +951,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		int ret = 0;
 
 		/* Unbind any ill-fitting objects or pin. */
-		list_for_each_entry(vma, &eb->vmas, exec_list) {
+		list_for_each_entry(vma, &eb->vmas, exec_link) {
 			if (!drm_mm_node_allocated(&vma->node))
 				continue;
 
@@ -964,7 +964,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		}
 
 		/* Bind fresh objects */
-		list_for_each_entry(vma, &eb->vmas, exec_list) {
+		list_for_each_entry(vma, &eb->vmas, exec_link) {
 			if (drm_mm_node_allocated(&vma->node))
 				continue;
 
@@ -978,7 +978,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 			return ret;
 
 		/* Decrement pin count for bound objects */
-		list_for_each_entry(vma, &eb->vmas, exec_list)
+		list_for_each_entry(vma, &eb->vmas, exec_link)
 			eb_unreserve_vma(vma);
 
 		ret = i915_gem_evict_vm(eb->vm, true);
@@ -1067,7 +1067,7 @@ eb_relocate_slow(struct i915_execbuffer *eb)
 	if (ret)
 		goto err;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		int idx = vma->exec_entry - eb->exec;
 
 		ret = eb_relocate_vma_slow(vma, eb, reloc + reloc_offset[idx]);
@@ -1093,7 +1093,7 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
 	struct i915_vma *vma;
 	int ret;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		struct drm_i915_gem_object *obj = vma->obj;
 
 		if (vma->exec_entry->flags & EXEC_OBJECT_CAPTURE) {
@@ -1308,7 +1308,7 @@ eb_move_to_active(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
 
-	list_for_each_entry(vma, &eb->vmas, exec_list) {
+	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		struct drm_i915_gem_object *obj = vma->obj;
 
 		obj->base.write_domain = obj->base.pending_write_domain;
@@ -1382,7 +1382,7 @@ static struct i915_vma *eb_parse(struct i915_execbuffer *eb, bool is_master)
 		memset(&eb->shadow_exec_entry, 0, sizeof(*vma->exec_entry));
 	vma->exec_entry->flags = __EXEC_OBJECT_HAS_PIN;
 	i915_gem_object_get(shadow_batch_obj);
-	list_add_tail(&vma->exec_list, &eb->vmas);
+	list_add_tail(&vma->exec_link, &eb->vmas);
 
 out:
 	i915_gem_object_unpin_pages(shadow_batch_obj);
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index 2e03f81dddbe..4d827300d1a8 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -100,8 +100,11 @@ struct i915_vma {
 	struct list_head obj_link; /* Link in the object's VMA list */
 	struct rb_node obj_node;
 
-	/** This vma's place in the batchbuffer or on the eviction list */
-	struct list_head exec_list;
+	/** This vma's place in the execbuf reservation list */
+	struct list_head exec_link;
+
+	/** This vma's place in the eviction list */
+	struct list_head evict_link;
 
 	/**
 	 * Used for performing relocations during execbuffer insertion.
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 20/27] drm/i915: Store a direct lookup from object handle to vma
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (18 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 19/27] drm/i915: Split vma exec_link/evict_link Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 21/27] drm/i915: Pass vma to relocate entry Chris Wilson
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

The advent of full-ppgtt lead to an extra indirection between the object
and its binding. That extra indirection has a noticeable impact on how
fast we can convert from the user handles to our internal vma for
execbuffer. In order to bypass the extra indirection, we use a
resizable hashtable to jump from the object to the per-ctx vma.
rhashtable was considered but we don't need the online resizing feature
and the extra complexity proved to undermine its usefulness. Instead, we
simply reallocate the hastable on demand in a background task and
serialize it before iterating.

In non-full-ppgtt modes, multiple files and multiple contexts can share
the same vma. This leads to having multiple possible handle->vma links,
so we only use the first to establish the fast path. The majority of
buffers are not shared and so we should still be able to realise
speedups with multiple clients.

v2: Prettier names, more magic.
v3: Many style tweaks, most notably hiding the misuse of execobj[].rsvd2

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c           |   6 +
 drivers/gpu/drm/i915/i915_drv.h               |   2 +-
 drivers/gpu/drm/i915/i915_gem.c               |   5 +-
 drivers/gpu/drm/i915/i915_gem_context.c       |  86 ++++++++-
 drivers/gpu/drm/i915/i915_gem_context.h       |  25 +++
 drivers/gpu/drm/i915/i915_gem_execbuffer.c    | 261 ++++++++++++++++----------
 drivers/gpu/drm/i915/i915_gem_object.h        |   4 +-
 drivers/gpu/drm/i915/i915_utils.h             |   5 +
 drivers/gpu/drm/i915/i915_vma.c               |  20 ++
 drivers/gpu/drm/i915/i915_vma.h               |   8 +-
 drivers/gpu/drm/i915/selftests/mock_context.c |  12 +-
 11 files changed, 320 insertions(+), 114 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index a8c7788d986e..a2472048b84d 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1988,6 +1988,12 @@ static int i915_context_status(struct seq_file *m, void *unused)
 			seq_putc(m, '\n');
 		}
 
+		seq_printf(m,
+			   "\tvma hashtable size=%u (actual %lu), count=%u\n",
+			   ctx->vma_lut.ht_size,
+			   BIT(ctx->vma_lut.ht_bits),
+			   ctx->vma_lut.ht_count);
+
 		seq_putc(m, '\n');
 	}
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a11d7d8f5f2e..7b6926861e04 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -37,7 +37,7 @@
 #include <linux/i2c.h>
 #include <linux/i2c-algo-bit.h>
 #include <linux/backlight.h>
-#include <linux/hashtable.h>
+#include <linux/hash.h>
 #include <linux/intel-iommu.h>
 #include <linux/kref.h>
 #include <linux/pm_qos.h>
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f6df402a5247..ed761a122966 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3247,6 +3247,10 @@ void i915_gem_close_object(struct drm_gem_object *gem, struct drm_file *file)
 		if (vma->vm->file == fpriv)
 			i915_vma_close(vma);
 
+	vma = obj->vma_hashed;
+	if (vma && vma->ctx->file_priv == fpriv)
+		i915_vma_unlink_ctx(vma);
+
 	if (i915_gem_object_is_active(obj) &&
 	    !i915_gem_object_has_active_reference(obj)) {
 		i915_gem_object_set_active_reference(obj);
@@ -4240,7 +4244,6 @@ void i915_gem_object_init(struct drm_i915_gem_object *obj,
 
 	INIT_LIST_HEAD(&obj->global_link);
 	INIT_LIST_HEAD(&obj->userfault_link);
-	INIT_LIST_HEAD(&obj->obj_exec_link);
 	INIT_LIST_HEAD(&obj->vma_list);
 	INIT_LIST_HEAD(&obj->batch_pool_link);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 8bd0c4966913..23fd1470a7f4 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -85,6 +85,7 @@
  *
  */
 
+#include <linux/log2.h>
 #include <drm/drmP.h>
 #include <drm/i915_drm.h>
 #include "i915_drv.h"
@@ -92,6 +93,9 @@
 
 #define ALL_L3_SLICES(dev) (1 << NUM_L3_SLICES(dev)) - 1
 
+/* Initial size (as log2) to preallocate the handle->object hashtable */
+#define VMA_HT_BITS 2u /* 4 x 2 pointers, 64 bytes minimum */
+
 static int get_context_size(struct drm_i915_private *dev_priv)
 {
 	int ret;
@@ -119,6 +123,67 @@ static int get_context_size(struct drm_i915_private *dev_priv)
 	return ret;
 }
 
+static void resize_vma_ht(struct work_struct *work)
+{
+	struct i915_gem_context_vma_lut *lut =
+		container_of(work, typeof(*lut), resize);
+	unsigned int bits, new_bits, size, i;
+	struct hlist_head *new_ht;
+
+	GEM_BUG_ON(!(lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS));
+
+	bits = 1 + ilog2(4*lut->ht_count/3 + 1);
+	new_bits = min_t(unsigned int,
+			 max(bits, VMA_HT_BITS),
+			 sizeof(unsigned int) * BITS_PER_BYTE - 1);
+	if (new_bits == lut->ht_bits)
+		goto out;
+
+	new_ht = kzalloc(sizeof(*new_ht)<<new_bits, GFP_KERNEL | __GFP_NOWARN);
+	if (!new_ht)
+		new_ht = vzalloc(sizeof(*new_ht)<<new_bits);
+	if (!new_ht)
+		/* Pretend resize succeeded and stop calling us for a bit! */
+		goto out;
+
+	size = BIT(lut->ht_bits);
+	for (i = 0; i < size; i++) {
+		struct i915_vma *vma;
+		struct hlist_node *tmp;
+
+		hlist_for_each_entry_safe(vma, tmp, &lut->ht[i], ctx_node)
+			hlist_add_head(&vma->ctx_node,
+				       &new_ht[hash_32(vma->ctx_handle,
+						       new_bits)]);
+	}
+	kvfree(lut->ht);
+	lut->ht = new_ht;
+	lut->ht_bits = new_bits;
+out:
+	smp_store_release(&lut->ht_size, BIT(bits));
+	GEM_BUG_ON(lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS);
+}
+
+static void vma_lut_free(struct i915_gem_context *ctx)
+{
+	struct i915_gem_context_vma_lut *lut = &ctx->vma_lut;
+	unsigned int i, size;
+
+	if (lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS)
+		cancel_work_sync(&lut->resize);
+
+	size = BIT(lut->ht_bits);
+	for (i = 0; i < size; i++) {
+		struct i915_vma *vma;
+
+		hlist_for_each_entry(vma, &lut->ht[i], ctx_node) {
+			vma->obj->vma_hashed = NULL;
+			vma->ctx = NULL;
+		}
+	}
+	kvfree(lut->ht);
+}
+
 void i915_gem_context_free(struct kref *ctx_ref)
 {
 	struct i915_gem_context *ctx = container_of(ctx_ref, typeof(*ctx), ref);
@@ -128,6 +193,7 @@ void i915_gem_context_free(struct kref *ctx_ref)
 	trace_i915_context_free(ctx);
 	GEM_BUG_ON(!i915_gem_context_is_closed(ctx));
 
+	vma_lut_free(ctx);
 	i915_ppgtt_put(ctx->ppgtt);
 
 	for (i = 0; i < I915_NUM_ENGINES; i++) {
@@ -145,6 +211,7 @@ void i915_gem_context_free(struct kref *ctx_ref)
 
 	kfree(ctx->name);
 	put_pid(ctx->pid);
+
 	list_del(&ctx->link);
 
 	ida_simple_remove(&ctx->i915->context_hw_ida, ctx->hw_id);
@@ -266,6 +333,17 @@ __create_hw_context(struct drm_i915_private *dev_priv,
 	list_add_tail(&ctx->link, &dev_priv->context_list);
 	ctx->i915 = dev_priv;
 
+	ctx->vma_lut.ht_bits = VMA_HT_BITS;
+	ctx->vma_lut.ht_size = BIT(VMA_HT_BITS);
+	BUILD_BUG_ON(BIT(VMA_HT_BITS) == I915_CTX_RESIZE_IN_PROGRESS);
+	ctx->vma_lut.ht = kcalloc(ctx->vma_lut.ht_size,
+				  sizeof(*ctx->vma_lut.ht),
+				  GFP_KERNEL);
+	if (!ctx->vma_lut.ht)
+		goto err_out;
+
+	INIT_WORK(&ctx->vma_lut.resize, resize_vma_ht);
+
 	if (dev_priv->hw_context_size) {
 		struct drm_i915_gem_object *obj;
 		struct i915_vma *vma;
@@ -273,14 +351,14 @@ __create_hw_context(struct drm_i915_private *dev_priv,
 		obj = alloc_context_obj(dev_priv, dev_priv->hw_context_size);
 		if (IS_ERR(obj)) {
 			ret = PTR_ERR(obj);
-			goto err_out;
+			goto err_lut;
 		}
 
 		vma = i915_vma_instance(obj, &dev_priv->ggtt.base, NULL);
 		if (IS_ERR(vma)) {
 			i915_gem_object_put(obj);
 			ret = PTR_ERR(vma);
-			goto err_out;
+			goto err_lut;
 		}
 
 		ctx->engine[RCS].state = vma;
@@ -292,7 +370,7 @@ __create_hw_context(struct drm_i915_private *dev_priv,
 		ret = idr_alloc(&file_priv->context_idr, ctx,
 				DEFAULT_CONTEXT_HANDLE, 0, GFP_KERNEL);
 		if (ret < 0)
-			goto err_out;
+			goto err_lut;
 	}
 	ctx->user_handle = ret;
 
@@ -333,6 +411,8 @@ __create_hw_context(struct drm_i915_private *dev_priv,
 err_pid:
 	put_pid(ctx->pid);
 	idr_remove(&file_priv->context_idr, ctx->user_handle);
+err_lut:
+	kvfree(ctx->vma_lut.ht);
 err_out:
 	context_close(ctx);
 	return ERR_PTR(ret);
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 4af2ab94558b..db5b28a28d75 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -143,6 +143,31 @@ struct i915_gem_context {
 	/** ggtt_offset_bias: placement restriction for context objects */
 	u32 ggtt_offset_bias;
 
+	struct i915_gem_context_vma_lut {
+		/** ht_size: last request size to allocate the hashtable for. */
+		unsigned int ht_size;
+#define I915_CTX_RESIZE_IN_PROGRESS BIT(0)
+		/** ht_bits: real log2(size) of hashtable. */
+		unsigned int ht_bits;
+		/** ht_count: current number of entries inside the hashtable */
+		unsigned int ht_count;
+
+		/** ht: the array of buckets comprising the simple hashtable */
+		struct hlist_head *ht;
+
+		/** resize: After an execbuf completes, we check the load factor
+		 * of the hashtable. If the hashtable is too full, or too empty,
+		 * we schedule a task to resize the hashtable. During the
+		 * resize, the entries are moved between different buckets and
+		 * so we cannot simultaneously read the hashtable as it is
+		 * being resized (unlike rhashtable). Therefore we treat the
+		 * active work as a strong barrier, pausing a subsequent
+		 * execbuf to wait for the resize worker to complete, if
+		 * required.
+		 */
+		struct work_struct resize;
+	} vma_lut;
+
 	/** engine: per-engine logical HW state */
 	struct intel_context {
 		struct i915_vma *state;
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 42468cbf7678..3684446df6b6 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -75,38 +75,40 @@ struct i915_execbuffer {
 		unsigned int page;
 		bool use_64bit_reloc : 1;
 	} reloc_cache;
-	int and;
-	union {
-		struct i915_vma **lut;
-		struct hlist_head *buckets;
-	};
+	int lut_mask;
+	struct hlist_head *buckets;
 };
 
+/* As an alternative to creating a hashtable of handle-to-vma for a batch,
+ * we used the last available reserved field in the execobject[] and stash
+ * a link from the execobj to its vma.
+ */
+#define __exec_to_vma(ee) (ee)->rsvd2
+#define exec_to_vma(ee) u64_to_ptr(struct i915_vma, __exec_to_vma(ee))
+
 static int
 eb_create(struct i915_execbuffer *eb)
 {
-	eb->lut = NULL;
-	if (eb->args->flags & I915_EXEC_HANDLE_LUT) {
-		unsigned int size = eb->args->buffer_count;
-		size *= sizeof(struct i915_vma *);
-		eb->lut = kmalloc(size,
-				  GFP_TEMPORARY | __GFP_NOWARN | __GFP_NORETRY);
-	}
-
-	if (!eb->lut) {
-		unsigned int size = eb->args->buffer_count;
-		unsigned int count = PAGE_SIZE / sizeof(struct hlist_head) / 2;
-		BUILD_BUG_ON_NOT_POWER_OF_2(PAGE_SIZE / sizeof(struct hlist_head));
-		while (count > 2*size)
-			count >>= 1;
-		eb->lut = kzalloc(count*sizeof(struct hlist_head),
-				  GFP_TEMPORARY);
-		if (!eb->lut)
-			return -ENOMEM;
-
-		eb->and = count - 1;
+	if ((eb->args->flags & I915_EXEC_HANDLE_LUT) == 0) {
+		unsigned int size = 1 + ilog2(eb->args->buffer_count);
+
+		do {
+			eb->buckets = kzalloc(sizeof(struct hlist_head) << size,
+					     GFP_TEMPORARY | __GFP_NOWARN | __GFP_NORETRY);
+			if (eb->buckets)
+				break;
+		} while (--size);
+
+		if (unlikely(!eb->buckets)) {
+			eb->buckets = kzalloc(sizeof(struct hlist_head),
+					      GFP_TEMPORARY);
+			if (unlikely(!eb->buckets))
+				return -ENOMEM;
+		}
+
+		eb->lut_mask = size;
 	} else {
-		eb->and = -eb->args->buffer_count;
+		eb->lut_mask = -eb->args->buffer_count;
 	}
 
 	return 0;
@@ -143,73 +145,112 @@ eb_reset(struct i915_execbuffer *eb)
 		vma->exec_entry = NULL;
 	}
 
-	if (eb->and >= 0)
-		memset(eb->buckets, 0, (eb->and+1)*sizeof(struct hlist_head));
+	if (eb->lut_mask >= 0)
+		memset(eb->buckets, 0,
+		       sizeof(struct hlist_head) << eb->lut_mask);
 }
 
-static struct i915_vma *
-eb_get_batch(struct i915_execbuffer *eb)
+static bool
+eb_add_vma(struct i915_execbuffer *eb, struct i915_vma *vma, int i)
 {
-	struct i915_vma *vma = list_entry(eb->vmas.prev, typeof(*vma), exec_link);
+	if (unlikely(vma->exec_entry)) {
+		DRM_DEBUG("Object [handle %d, index %d] appears more than once in object list\n",
+			  eb->exec[i].handle, i);
+		return false;
+	}
+	list_add_tail(&vma->exec_link, &eb->vmas);
 
-	/*
-	 * SNA is doing fancy tricks with compressing batch buffers, which leads
-	 * to negative relocation deltas. Usually that works out ok since the
-	 * relocate address is still positive, except when the batch is placed
-	 * very low in the GTT. Ensure this doesn't happen.
-	 *
-	 * Note that actual hangs have only been observed on gen7, but for
-	 * paranoia do it everywhere.
-	 */
-	if ((vma->exec_entry->flags & EXEC_OBJECT_PINNED) == 0)
-		vma->exec_entry->flags |= __EXEC_OBJECT_NEEDS_BIAS;
+	vma->exec_entry = &eb->exec[i];
+	if (eb->lut_mask >= 0) {
+		vma->exec_handle = eb->exec[i].handle;
+		hlist_add_head(&vma->exec_node,
+			       &eb->buckets[hash_32(vma->exec_handle,
+						    eb->lut_mask)]);
+	}
 
-	return vma;
+	i915_vma_get(vma);
+	__exec_to_vma(&eb->exec[i]) = (uintptr_t)vma;
+	return true;
+}
+
+static inline struct hlist_head *
+ht_head(const struct i915_gem_context *ctx, u32 handle)
+{
+	return &ctx->vma_lut.ht[hash_32(handle, ctx->vma_lut.ht_bits)];
+}
+
+static inline bool
+ht_needs_resize(const struct i915_gem_context *ctx)
+{
+	return (4*ctx->vma_lut.ht_count > 3*ctx->vma_lut.ht_size ||
+		4*ctx->vma_lut.ht_count + 1 < ctx->vma_lut.ht_size);
 }
 
 static int
 eb_lookup_vmas(struct i915_execbuffer *eb)
 {
-	struct drm_i915_gem_object *obj;
-	struct list_head objects;
-	int i, ret;
+#define INTERMEDIATE BIT(0)
+	const int count = eb->args->buffer_count;
+	struct i915_vma *vma;
+	int slow_pass = -1;
+	int i;
 
 	INIT_LIST_HEAD(&eb->vmas);
 
-	INIT_LIST_HEAD(&objects);
+	if (unlikely(eb->ctx->vma_lut.ht_size & I915_CTX_RESIZE_IN_PROGRESS))
+		flush_work(&eb->ctx->vma_lut.resize);
+	GEM_BUG_ON(eb->ctx->vma_lut.ht_size & I915_CTX_RESIZE_IN_PROGRESS);
+
+	for (i = 0; i < count; i++) {
+		__exec_to_vma(&eb->exec[i]) = 0;
+
+		hlist_for_each_entry(vma,
+				     ht_head(eb->ctx, eb->exec[i].handle),
+				     ctx_node) {
+			if (vma->ctx_handle != eb->exec[i].handle)
+				continue;
+
+			if (!eb_add_vma(eb, vma, i))
+				return -EINVAL;
+
+			goto next_vma;
+		}
+
+		if (slow_pass < 0)
+			slow_pass = i;
+next_vma: ;
+	}
+
+	if (slow_pass < 0)
+		return 0;
+
 	spin_lock(&eb->file->table_lock);
 	/* Grab a reference to the object and release the lock so we can lookup
 	 * or create the VMA without using GFP_ATOMIC */
-	for (i = 0; i < eb->args->buffer_count; i++) {
-		obj = to_intel_bo(idr_find(&eb->file->object_idr, eb->exec[i].handle));
-		if (obj == NULL) {
-			spin_unlock(&eb->file->table_lock);
-			DRM_DEBUG("Invalid object handle %d at index %d\n",
-				   eb->exec[i].handle, i);
-			ret = -ENOENT;
-			goto err;
-		}
+	for (i = slow_pass; i < count; i++) {
+		struct drm_i915_gem_object *obj;
 
-		if (!list_empty(&obj->obj_exec_link)) {
+		if (__exec_to_vma(&eb->exec[i]))
+			continue;
+
+		obj = to_intel_bo(idr_find(&eb->file->object_idr,
+					   eb->exec[i].handle));
+		if (unlikely(!obj)) {
 			spin_unlock(&eb->file->table_lock);
-			DRM_DEBUG("Object %p [handle %d, index %d] appears more than once in object list\n",
-				   obj, eb->exec[i].handle, i);
-			ret = -EINVAL;
-			goto err;
+			DRM_DEBUG("Invalid object handle %d at index %d\n",
+				  eb->exec[i].handle, i);
+			return -ENOENT;
 		}
 
-		i915_gem_object_get(obj);
-		list_add_tail(&obj->obj_exec_link, &objects);
+		__exec_to_vma(&eb->exec[i]) = INTERMEDIATE | (uintptr_t)obj;
 	}
 	spin_unlock(&eb->file->table_lock);
 
-	i = 0;
-	while (!list_empty(&objects)) {
-		struct i915_vma *vma;
+	for (i = slow_pass; i < count; i++) {
+		struct drm_i915_gem_object *obj;
 
-		obj = list_first_entry(&objects,
-				       struct drm_i915_gem_object,
-				       obj_exec_link);
+		if ((__exec_to_vma(&eb->exec[i]) & INTERMEDIATE) == 0)
+			continue;
 
 		/*
 		 * NOTE: We can leak any vmas created here when something fails
@@ -219,61 +260,73 @@ eb_lookup_vmas(struct i915_execbuffer *eb)
 		 * from the (obj, vm) we don't run the risk of creating
 		 * duplicated vmas for the same vm.
 		 */
+		obj = u64_to_ptr(struct drm_i915_gem_object,
+				 __exec_to_vma(&eb->exec[i]) & ~INTERMEDIATE);
 		vma = i915_vma_instance(obj, eb->vm, NULL);
 		if (unlikely(IS_ERR(vma))) {
 			DRM_DEBUG("Failed to lookup VMA\n");
-			ret = PTR_ERR(vma);
-			goto err;
+			return PTR_ERR(vma);
 		}
 
-		/* Transfer ownership from the objects list to the vmas list. */
-		list_add_tail(&vma->exec_link, &eb->vmas);
-		list_del_init(&obj->obj_exec_link);
-
-		vma->exec_entry = &eb->exec[i];
-		if (eb->and < 0) {
-			eb->lut[i] = vma;
-		} else {
-			u32 handle =
-				eb->args->flags & I915_EXEC_HANDLE_LUT ?
-				i : eb->exec[i].handle;
-			vma->exec_handle = handle;
-			hlist_add_head(&vma->exec_node,
-				       &eb->buckets[handle & eb->and]);
+		/* First come, first served */
+		if (!vma->ctx) {
+			vma->ctx = eb->ctx;
+			vma->ctx_handle = eb->exec[i].handle;
+			hlist_add_head(&vma->ctx_node,
+				       ht_head(eb->ctx, eb->exec[i].handle));
+			eb->ctx->vma_lut.ht_count++;
+			if (i915_vma_is_ggtt(vma)) {
+				GEM_BUG_ON(obj->vma_hashed);
+				obj->vma_hashed = vma;
+			}
 		}
-		++i;
+
+		if (!eb_add_vma(eb, vma, i))
+			return -EINVAL;
+	}
+
+	if (ht_needs_resize(eb->ctx)) {
+		eb->ctx->vma_lut.ht_size |= I915_CTX_RESIZE_IN_PROGRESS;
+		queue_work(system_highpri_wq, &eb->ctx->vma_lut.resize);
 	}
 
 	return 0;
+#undef INTERMEDIATE
+}
 
+static struct i915_vma *
+eb_get_batch(struct i915_execbuffer *eb)
+{
+	struct i915_vma *vma =
+		exec_to_vma(&eb->exec[eb->args->buffer_count - 1]);
 
-err:
-	while (!list_empty(&objects)) {
-		obj = list_first_entry(&objects,
-				       struct drm_i915_gem_object,
-				       obj_exec_link);
-		list_del_init(&obj->obj_exec_link);
-		i915_gem_object_put(obj);
-	}
 	/*
-	 * Objects already transfered to the vmas list will be unreferenced by
-	 * eb_destroy.
+	 * SNA is doing fancy tricks with compressing batch buffers, which leads
+	 * to negative relocation deltas. Usually that works out ok since the
+	 * relocate address is still positive, except when the batch is placed
+	 * very low in the GTT. Ensure this doesn't happen.
+	 *
+	 * Note that actual hangs have only been observed on gen7, but for
+	 * paranoia do it everywhere.
 	 */
+	if ((vma->exec_entry->flags & EXEC_OBJECT_PINNED) == 0)
+		vma->exec_entry->flags |= __EXEC_OBJECT_NEEDS_BIAS;
 
-	return ret;
+	return vma;
 }
 
-static struct i915_vma *eb_get_vma(struct i915_execbuffer *eb, unsigned long handle)
+static struct i915_vma *
+eb_get_vma(struct i915_execbuffer *eb, unsigned long handle)
 {
-	if (eb->and < 0) {
-		if (handle >= -eb->and)
+	if (eb->lut_mask < 0) {
+		if (handle >= -eb->lut_mask)
 			return NULL;
-		return eb->lut[handle];
+		return exec_to_vma(&eb->exec[handle]);
 	} else {
 		struct hlist_head *head;
 		struct i915_vma *vma;
 
-		head = &eb->buckets[handle & eb->and];
+		head = &eb->buckets[hash_32(handle, eb->lut_mask)];
 		hlist_for_each_entry(vma, head, exec_node) {
 			if (vma->exec_handle == handle)
 				return vma;
@@ -297,7 +350,7 @@ static void eb_destroy(struct i915_execbuffer *eb)
 
 	i915_gem_context_put(eb->ctx);
 
-	if (eb->buckets)
+	if (eb->lut_mask >= 0)
 		kfree(eb->buckets);
 }
 
@@ -917,7 +970,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		need_fence =
 			(entry->flags & EXEC_OBJECT_NEEDS_FENCE ||
 			 needs_unfenced_map) &&
-			i915_gem_object_is_tiled(obj);
+			i915_gem_object_is_tiled(vma->obj);
 		need_mappable = need_fence || need_reloc_mappable(vma);
 
 		if (entry->flags & EXEC_OBJECT_PINNED)
diff --git a/drivers/gpu/drm/i915/i915_gem_object.h b/drivers/gpu/drm/i915/i915_gem_object.h
index 174cf923c236..5093e065b9a6 100644
--- a/drivers/gpu/drm/i915/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/i915_gem_object.h
@@ -71,6 +71,7 @@ struct drm_i915_gem_object {
 	/** List of VMAs backed by this object */
 	struct list_head vma_list;
 	struct rb_root vma_tree;
+	struct i915_vma *vma_hashed;
 
 	/** Stolen memory for this object, instead of being backed by shmem. */
 	struct drm_mm_node *stolen;
@@ -85,9 +86,6 @@ struct drm_i915_gem_object {
 	 */
 	struct list_head userfault_link;
 
-	/** Used in execbuf to temporarily hold a ref */
-	struct list_head obj_exec_link;
-
 	struct list_head batch_pool_link;
 	I915_SELFTEST_DECLARE(struct list_head st_link);
 
diff --git a/drivers/gpu/drm/i915/i915_utils.h b/drivers/gpu/drm/i915/i915_utils.h
index f0500c65726d..11c4134c241a 100644
--- a/drivers/gpu/drm/i915/i915_utils.h
+++ b/drivers/gpu/drm/i915/i915_utils.h
@@ -99,4 +99,9 @@
 	__T;								\
 })
 
+#define u64_to_ptr(T, x) ({						\
+	typecheck(u64, x);						\
+	(T *)(uintptr_t)(x);						\
+})
+
 #endif /* !__I915_UTILS_H */
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 6cf32da682ec..ad696239383d 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -590,11 +590,31 @@ void i915_vma_destroy(struct i915_vma *vma)
 	kmem_cache_free(to_i915(vma->obj->base.dev)->vmas, vma);
 }
 
+void i915_vma_unlink_ctx(struct i915_vma *vma)
+{
+	struct i915_gem_context *ctx = vma->ctx;
+
+	if (ctx->vma_lut.ht_size & I915_CTX_RESIZE_IN_PROGRESS) {
+		cancel_work_sync(&ctx->vma_lut.resize);
+		ctx->vma_lut.ht_size &= ~I915_CTX_RESIZE_IN_PROGRESS;
+	}
+
+	__hlist_del(&vma->ctx_node);
+	ctx->vma_lut.ht_count--;
+
+	if (i915_vma_is_ggtt(vma))
+		vma->obj->vma_hashed = NULL;
+	vma->ctx = NULL;
+}
+
 void i915_vma_close(struct i915_vma *vma)
 {
 	GEM_BUG_ON(i915_vma_is_closed(vma));
 	vma->flags |= I915_VMA_CLOSED;
 
+	if (vma->ctx)
+		i915_vma_unlink_ctx(vma);
+
 	list_del(&vma->obj_link);
 	rb_erase(&vma->obj_node, &vma->obj->vma_tree);
 
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index 4d827300d1a8..88543fafcffc 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -99,6 +99,7 @@ struct i915_vma {
 
 	struct list_head obj_link; /* Link in the object's VMA list */
 	struct rb_node obj_node;
+	struct hlist_node obj_hash;
 
 	/** This vma's place in the execbuf reservation list */
 	struct list_head exec_link;
@@ -110,8 +111,12 @@ struct i915_vma {
 	 * Used for performing relocations during execbuffer insertion.
 	 */
 	struct hlist_node exec_node;
-	unsigned long exec_handle;
 	struct drm_i915_gem_exec_object2 *exec_entry;
+	u32 exec_handle;
+
+	struct i915_gem_context *ctx;
+	struct hlist_node ctx_node;
+	u32 ctx_handle;
 };
 
 struct i915_vma *
@@ -235,6 +240,7 @@ bool i915_vma_misplaced(const struct i915_vma *vma,
 			u64 size, u64 alignment, u64 flags);
 void __i915_vma_set_map_and_fenceable(struct i915_vma *vma);
 int __must_check i915_vma_unbind(struct i915_vma *vma);
+void i915_vma_unlink_ctx(struct i915_vma *vma);
 void i915_vma_close(struct i915_vma *vma);
 void i915_vma_destroy(struct i915_vma *vma);
 
diff --git a/drivers/gpu/drm/i915/selftests/mock_context.c b/drivers/gpu/drm/i915/selftests/mock_context.c
index 8d3a90c3f8ac..f8b9cc212b02 100644
--- a/drivers/gpu/drm/i915/selftests/mock_context.c
+++ b/drivers/gpu/drm/i915/selftests/mock_context.c
@@ -40,10 +40,18 @@ mock_context(struct drm_i915_private *i915,
 	INIT_LIST_HEAD(&ctx->link);
 	ctx->i915 = i915;
 
+	ctx->vma_lut.ht_bits = VMA_HT_BITS;
+	ctx->vma_lut.ht_size = BIT(VMA_HT_BITS);
+	ctx->vma_lut.ht = kcalloc(ctx->vma_lut.ht_size,
+				  sizeof(*ctx->vma_lut.ht),
+				  GFP_KERNEL);
+	if (!ctx->vma_lut.ht)
+		goto err_free;
+
 	ret = ida_simple_get(&i915->context_hw_ida,
 			     0, MAX_CONTEXT_HW_ID, GFP_KERNEL);
 	if (ret < 0)
-		goto err_free;
+		goto err_vma_ht;
 	ctx->hw_id = ret;
 
 	if (name) {
@@ -58,6 +66,8 @@ mock_context(struct drm_i915_private *i915,
 
 	return ctx;
 
+err_vma_ht:
+	kvfree(ctx->vma_lut.ht);
 err_free:
 	kfree(ctx);
 	return NULL;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 21/27] drm/i915: Pass vma to relocate entry
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (19 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 20/27] drm/i915: Store a direct lookup from object handle to vma Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 22/27] drm/i915: Eliminate lots of iterations over the execobjects array Chris Wilson
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

We can simplify our tracking of pending writes in an execbuf to the
single bit in the vma->exec_entry->flags, but that requires the
relocation function knowing the object's vma. Pass it along.

Note we have only been using a single bit to track flushing since

commit cc889e0f6ce6a63c62db17d702ecfed86d58083f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jun 13 20:45:19 2012 +0200

    drm/i915: disable flushing_list/gpu_write_list

unconditionally flushed all render caches before the breadcrumb and

commit 6ac42f4148bc27e5ffd18a9ab0eac57f58822af4
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sat Jul 21 12:25:01 2012 +0200

    drm/i915: Replace the complex flushing logic with simple invalidate/flush all

did away with the explicit GPU domain tracking. This was then codified
into the ABI with NO_RELOC in

commit ed5982e6ce5f106abcbf071f80730db344a6da42
Author: Daniel Vetter <daniel.vetter@ffwll.ch> # Oi! Patch stealer!
Date:   Thu Jan 17 22:23:36 2013 +0100

    drm/i915: Allow userspace to hint that the relocations were known

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 101 ++++++++++++-----------------
 1 file changed, 41 insertions(+), 60 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 3684446df6b6..f166dae7ef28 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -620,42 +620,25 @@ relocate_entry(struct drm_i915_gem_object *obj,
 }
 
 static int
-eb_relocate_entry(struct drm_i915_gem_object *obj,
+eb_relocate_entry(struct i915_vma *vma,
 		  struct i915_execbuffer *eb,
 		  struct drm_i915_gem_relocation_entry *reloc)
 {
-	struct drm_gem_object *target_obj;
-	struct drm_i915_gem_object *target_i915_obj;
-	struct i915_vma *target_vma;
-	uint64_t target_offset;
+	struct i915_vma *target;
+	u64 target_offset;
 	int ret;
 
 	/* we've already hold a reference to all valid objects */
-	target_vma = eb_get_vma(eb, reloc->target_handle);
-	if (unlikely(target_vma == NULL))
+	target = eb_get_vma(eb, reloc->target_handle);
+	if (unlikely(!target))
 		return -ENOENT;
-	target_i915_obj = target_vma->obj;
-	target_obj = &target_vma->obj->base;
-
-	target_offset = gen8_canonical_addr(target_vma->node.start);
-
-	/* Sandybridge PPGTT errata: We need a global gtt mapping for MI and
-	 * pipe_control writes because the gpu doesn't properly redirect them
-	 * through the ppgtt for non_secure batchbuffers. */
-	if (unlikely(IS_GEN6(eb->i915) &&
-		     reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION)) {
-		ret = i915_vma_bind(target_vma, target_i915_obj->cache_level,
-				    PIN_GLOBAL);
-		if (WARN_ONCE(ret, "Unexpected failure to bind target VMA!"))
-			return ret;
-	}
 
 	/* Validate that the target is in a valid r/w GPU domain */
 	if (unlikely(reloc->write_domain & (reloc->write_domain - 1))) {
 		DRM_DEBUG("reloc with multiple write domains: "
-			  "obj %p target %d offset %d "
+			  "target %d offset %d "
 			  "read %08x write %08x",
-			  obj, reloc->target_handle,
+			  reloc->target_handle,
 			  (int) reloc->offset,
 			  reloc->read_domains,
 			  reloc->write_domain);
@@ -664,43 +647,56 @@ eb_relocate_entry(struct drm_i915_gem_object *obj,
 	if (unlikely((reloc->write_domain | reloc->read_domains)
 		     & ~I915_GEM_GPU_DOMAINS)) {
 		DRM_DEBUG("reloc with read/write non-GPU domains: "
-			  "obj %p target %d offset %d "
+			  "target %d offset %d "
 			  "read %08x write %08x",
-			  obj, reloc->target_handle,
+			  reloc->target_handle,
 			  (int) reloc->offset,
 			  reloc->read_domains,
 			  reloc->write_domain);
 		return -EINVAL;
 	}
 
-	target_obj->pending_read_domains |= reloc->read_domains;
-	target_obj->pending_write_domain |= reloc->write_domain;
+	if (reloc->write_domain)
+		target->exec_entry->flags |= EXEC_OBJECT_WRITE;
+
+	/* Sandybridge PPGTT errata: We need a global gtt mapping for MI and
+	 * pipe_control writes because the gpu doesn't properly redirect them
+	 * through the ppgtt for non_secure batchbuffers.
+	 */
+	if (unlikely(IS_GEN6(eb->i915) &&
+		     reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION)) {
+		ret = i915_vma_bind(target, target->obj->cache_level,
+				    PIN_GLOBAL);
+		if (WARN_ONCE(ret, "Unexpected failure to bind target VMA!"))
+			return ret;
+	}
 
 	/* If the relocation already has the right value in it, no
 	 * more work needs to be done.
 	 */
+	target_offset = gen8_canonical_addr(target->node.start);
 	if (target_offset == reloc->presumed_offset)
 		return 0;
 
 	/* Check that the relocation address is valid... */
 	if (unlikely(reloc->offset >
-		     obj->base.size - (eb->reloc_cache.use_64bit_reloc ? 8 : 4))) {
+		     vma->size - (eb->reloc_cache.use_64bit_reloc ? 8 : 4))) {
 		DRM_DEBUG("Relocation beyond object bounds: "
-			  "obj %p target %d offset %d size %d.\n",
-			  obj, reloc->target_handle,
-			  (int) reloc->offset,
-			  (int) obj->base.size);
+			  "target %d offset %d size %d.\n",
+			  reloc->target_handle,
+			  (int)reloc->offset,
+			  (int)vma->size);
 		return -EINVAL;
 	}
 	if (unlikely(reloc->offset & 3)) {
 		DRM_DEBUG("Relocation not 4-byte aligned: "
-			  "obj %p target %d offset %d.\n",
-			  obj, reloc->target_handle,
-			  (int) reloc->offset);
+			  "target %d offset %d.\n",
+			  reloc->target_handle,
+			  (int)reloc->offset);
 		return -EINVAL;
 	}
 
-	ret = relocate_entry(obj, reloc, &eb->reloc_cache, target_offset);
+	ret = relocate_entry(vma->obj, reloc, &eb->reloc_cache, target_offset);
 	if (ret)
 		return ret;
 
@@ -746,7 +742,7 @@ static int eb_relocate_vma(struct i915_vma *vma, struct i915_execbuffer *eb)
 		do {
 			u64 offset = r->presumed_offset;
 
-			ret = eb_relocate_entry(vma->obj, eb, r);
+			ret = eb_relocate_entry(vma, eb, r);
 			if (ret)
 				goto out;
 
@@ -792,7 +788,7 @@ eb_relocate_vma_slow(struct i915_vma *vma,
 	int i, ret = 0;
 
 	for (i = 0; i < entry->relocation_count; i++) {
-		ret = eb_relocate_entry(vma->obj, eb, &relocs[i]);
+		ret = eb_relocate_entry(vma, eb, &relocs[i]);
 		if (ret)
 			break;
 	}
@@ -825,7 +821,6 @@ eb_reserve_vma(struct i915_vma *vma,
 	       struct intel_engine_cs *engine,
 	       bool *need_reloc)
 {
-	struct drm_i915_gem_object *obj = vma->obj;
 	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
 	uint64_t flags;
 	int ret;
@@ -879,11 +874,6 @@ eb_reserve_vma(struct i915_vma *vma,
 		*need_reloc = true;
 	}
 
-	if (entry->flags & EXEC_OBJECT_WRITE) {
-		obj->base.pending_read_domains = I915_GEM_DOMAIN_RENDER;
-		obj->base.pending_write_domain = I915_GEM_DOMAIN_RENDER;
-	}
-
 	return 0;
 }
 
@@ -946,7 +936,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 {
 	const bool has_fenced_gpu_access = INTEL_GEN(eb->i915) < 4;
 	const bool needs_unfenced_map = INTEL_INFO(eb->i915)->unfenced_needs_alignment;
-	struct drm_i915_gem_object *obj;
 	struct i915_vma *vma;
 	struct list_head ordered_vmas;
 	struct list_head pinned_vmas;
@@ -959,7 +948,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 		bool need_fence, need_mappable;
 
 		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_link);
-		obj = vma->obj;
 		entry = vma->exec_entry;
 
 		if (eb->ctx->flags & CONTEXT_NO_ZEROMAP)
@@ -980,9 +968,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
 			list_move(&vma->exec_link, &ordered_vmas);
 		} else
 			list_move_tail(&vma->exec_link, &ordered_vmas);
-
-		obj->base.pending_read_domains = I915_GEM_GPU_DOMAINS & ~I915_GEM_DOMAIN_COMMAND;
-		obj->base.pending_write_domain = 0;
 	}
 	list_splice(&ordered_vmas, &eb->vmas);
 	list_splice(&pinned_vmas, &eb->vmas);
@@ -1168,7 +1153,7 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
 			i915_gem_clflush_object(obj, 0);
 
 		ret = i915_gem_request_await_object
-			(eb->request, obj, obj->base.pending_write_domain);
+			(eb->request, obj, vma->exec_entry->flags & EXEC_OBJECT_WRITE);
 		if (ret)
 			return ret;
 	}
@@ -1364,12 +1349,10 @@ eb_move_to_active(struct i915_execbuffer *eb)
 	list_for_each_entry(vma, &eb->vmas, exec_link) {
 		struct drm_i915_gem_object *obj = vma->obj;
 
-		obj->base.write_domain = obj->base.pending_write_domain;
-		if (obj->base.write_domain)
-			vma->exec_entry->flags |= EXEC_OBJECT_WRITE;
-		else
-			obj->base.pending_read_domains |= obj->base.read_domains;
-		obj->base.read_domains = obj->base.pending_read_domains;
+		obj->base.write_domain = 0;
+		if (vma->exec_entry->flags & EXEC_OBJECT_WRITE)
+			obj->base.read_domains = 0;
+		obj->base.read_domains |= I915_GEM_GPU_DOMAINS;
 
 		i915_vma_move_to_active(vma, eb->request, vma->exec_entry->flags);
 		eb_export_fence(obj, eb->request, vma->exec_entry->flags);
@@ -1679,8 +1662,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			goto err;
 	}
 
-	/* Set the pending read domains for the batch buffer to COMMAND */
-	if (eb.batch->obj->base.pending_write_domain) {
+	if (eb.batch->exec_entry->flags & EXEC_OBJECT_WRITE) {
 		DRM_DEBUG("Attempting to use self-modifying batch buffer\n");
 		ret = -EINVAL;
 		goto err;
@@ -1717,7 +1699,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		}
 	}
 
-	eb.batch->obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
 	if (eb.batch_len == 0)
 		eb.batch_len = eb.batch->size - eb.batch_start_offset;
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 22/27] drm/i915: Eliminate lots of iterations over the execobjects array
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (20 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 21/27] drm/i915: Pass vma to relocate entry Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-20  8:49   ` Joonas Lahtinen
  2017-04-19  9:41 ` [PATCH 23/27] drm/i915: First try the previous execbuffer location Chris Wilson
                   ` (8 subsequent siblings)
  30 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

The major scaling bottleneck in execbuffer is the processing of the
execobjects. Creating an auxiliary list is inefficient when compared to
using the execobject array we already have allocated.

Reservation is then split into phases. As we lookup up the VMA, we
try and bind it back into active location. Only if that fails, do we add
it to the unbound list for phase 2. In phase 2, we try and add all those
objects that could not fit into their previous location, with fallback
to retrying all objects and evicting the VM in case of severe
fragmentation. (This is the same as before, except that phase 1 is now
done inline with looking up the VMA to avoid an iteration over the
execobject array. In the ideal case, we eliminate the separate reservation
phase). During the reservation phase, we only evict from the VM between
passes (rather than currently as we try to fit every new VMA). In
testing with Unreal Engine's Atlantis demo which stresses the eviction
logic on gen7 class hardware, this speed up the framerate by a factor of
2.

The second loop amalgamation is between move_to_gpu and move_to_active.
As we always submit the request, even if incomplete, we can use the
current request to track active VMA as we perform the flushes and
synchronisation required.

The next big advancement is to avoid copying back to the user any
execobjects and relocations that are not changed.

v2: Add a Theory of Operation spiel.
v3: Fall back to slow relocations in preparation for flushing userptrs.
v4: Document struct members, factor out eb_validate_vma(), add a few
more comments to explain some magic and hide other magic behind macros.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h                 |    2 +-
 drivers/gpu/drm/i915/i915_gem_evict.c           |   92 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c      | 1953 +++++++++++++----------
 drivers/gpu/drm/i915/i915_vma.c                 |    2 +-
 drivers/gpu/drm/i915/i915_vma.h                 |    1 +
 drivers/gpu/drm/i915/selftests/i915_gem_evict.c |    4 +-
 drivers/gpu/drm/i915/selftests/i915_vma.c       |   16 +-
 7 files changed, 1187 insertions(+), 883 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 7b6926861e04..b0c4b9cb75c2 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3558,7 +3558,7 @@ int __must_check i915_gem_evict_something(struct i915_address_space *vm,
 int __must_check i915_gem_evict_for_node(struct i915_address_space *vm,
 					 struct drm_mm_node *node,
 					 unsigned int flags);
-int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle);
+int i915_gem_evict_vm(struct i915_address_space *vm);
 
 /* belongs in i915_gem_gtt.h */
 static inline void i915_gem_chipset_flush(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index 204a2d9288ae..a193f1b36c67 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -50,6 +50,29 @@ static bool ggtt_is_idle(struct drm_i915_private *dev_priv)
 	return true;
 }
 
+static int ggtt_flush(struct drm_i915_private *i915)
+{
+	int err;
+
+	/* Not everything in the GGTT is tracked via vma (otherwise we
+	 * could evict as required with minimal stalling) so we are forced
+	 * to idle the GPU and explicitly retire outstanding requests in
+	 * the hopes that we can then remove contexts and the like only
+	 * bound by their active reference.
+	 */
+	err = i915_gem_switch_to_kernel_context(i915);
+	if (err)
+		return err;
+
+	err = i915_gem_wait_for_idle(i915,
+				     I915_WAIT_INTERRUPTIBLE |
+				     I915_WAIT_LOCKED);
+	if (err)
+		return err;
+
+	return 0;
+}
+
 static bool
 mark_free(struct drm_mm_scan *scan,
 	  struct i915_vma *vma,
@@ -175,19 +198,7 @@ i915_gem_evict_something(struct i915_address_space *vm,
 		return intel_has_pending_fb_unpin(dev_priv) ? -EAGAIN : -ENOSPC;
 	}
 
-	/* Not everything in the GGTT is tracked via vma (otherwise we
-	 * could evict as required with minimal stalling) so we are forced
-	 * to idle the GPU and explicitly retire outstanding requests in
-	 * the hopes that we can then remove contexts and the like only
-	 * bound by their active reference.
-	 */
-	ret = i915_gem_switch_to_kernel_context(dev_priv);
-	if (ret)
-		return ret;
-
-	ret = i915_gem_wait_for_idle(dev_priv,
-				     I915_WAIT_INTERRUPTIBLE |
-				     I915_WAIT_LOCKED);
+	ret = ggtt_flush(dev_priv);
 	if (ret)
 		return ret;
 
@@ -337,10 +348,8 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
 /**
  * i915_gem_evict_vm - Evict all idle vmas from a vm
  * @vm: Address space to cleanse
- * @do_idle: Boolean directing whether to idle first.
  *
- * This function evicts all idles vmas from a vm. If all unpinned vmas should be
- * evicted the @do_idle needs to be set to true.
+ * This function evicts all vmas from a vm.
  *
  * This is used by the execbuf code as a last-ditch effort to defragment the
  * address space.
@@ -348,37 +357,50 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
  * To clarify: This is for freeing up virtual address space, not for freeing
  * memory in e.g. the shrinker.
  */
-int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle)
+int i915_gem_evict_vm(struct i915_address_space *vm)
 {
+	struct list_head *phases[] = {
+		&vm->inactive_list,
+		&vm->active_list,
+		NULL
+	}, **phase;
+	struct list_head eviction_list;
 	struct i915_vma *vma, *next;
 	int ret;
 
 	lockdep_assert_held(&vm->i915->drm.struct_mutex);
 	trace_i915_gem_evict_vm(vm);
 
-	if (do_idle) {
-		struct drm_i915_private *dev_priv = vm->i915;
-
-		if (i915_is_ggtt(vm)) {
-			ret = i915_gem_switch_to_kernel_context(dev_priv);
-			if (ret)
-				return ret;
-		}
-
-		ret = i915_gem_wait_for_idle(dev_priv,
-					     I915_WAIT_INTERRUPTIBLE |
-					     I915_WAIT_LOCKED);
+	/* Switch back to the default context in order to unpin
+	 * the existing context objects. However, such objects only
+	 * pin themselves inside the global GTT and performing the
+	 * switch otherwise is ineffective.
+	 */
+	if (i915_is_ggtt(vm)) {
+		ret = ggtt_flush(vm->i915);
 		if (ret)
 			return ret;
-
-		WARN_ON(!list_empty(&vm->active_list));
 	}
 
-	list_for_each_entry_safe(vma, next, &vm->inactive_list, vm_link)
-		if (!i915_vma_is_pinned(vma))
-			WARN_ON(i915_vma_unbind(vma));
+	INIT_LIST_HEAD(&eviction_list);
+	phase = phases;
+	do {
+		list_for_each_entry(vma, *phase, vm_link) {
+			if (i915_vma_is_pinned(vma))
+				continue;
 
-	return 0;
+			__i915_vma_pin(vma);
+			list_add(&vma->evict_link, &eviction_list);
+		}
+	} while (*++phase);
+
+	ret = 0;
+	list_for_each_entry_safe(vma, next, &eviction_list, evict_link) {
+		__i915_vma_unpin(vma);
+		if (ret == 0)
+			ret = i915_vma_unbind(vma);
+	}
+	return ret;
 }
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index f166dae7ef28..de41e423d3f7 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -42,41 +42,193 @@
 
 #define DBG_USE_CPU_RELOC 0 /* -1 force GTT relocs; 1 force CPU relocs */
 
-#define  __EXEC_OBJECT_HAS_PIN		(1<<31)
-#define  __EXEC_OBJECT_HAS_FENCE	(1<<30)
-#define  __EXEC_OBJECT_NEEDS_MAP	(1<<29)
-#define  __EXEC_OBJECT_NEEDS_BIAS	(1<<28)
+#define  __EXEC_OBJECT_HAS_PIN		BIT(31)
+#define  __EXEC_OBJECT_HAS_FENCE	BIT(30)
+#define  __EXEC_OBJECT_NEEDS_MAP	BIT(29)
+#define  __EXEC_OBJECT_NEEDS_BIAS	BIT(28)
 #define  __EXEC_OBJECT_INTERNAL_FLAGS (0xf<<28) /* all of the above */
+#define __EB_RESERVED (__EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_FENCE)
+
+#define __EXEC_HAS_RELOC	BIT(31)
+#define __EXEC_VALIDATED	BIT(30)
+#define UPDATE			PIN_OFFSET_FIXED
 
 #define BATCH_OFFSET_BIAS (256*1024)
 
 #define __I915_EXEC_ILLEGAL_FLAGS \
 	(__I915_EXEC_UNKNOWN_FLAGS | I915_EXEC_CONSTANTS_MASK)
 
+/**
+ * DOC: User command execution
+ *
+ * Userspace submits commands to be executed on the GPU as an instruction
+ * stream within a GEM object we call a batchbuffer. This instructions may
+ * refer to other GEM objects containing auxiliary state such as kernels,
+ * samplers, render targets and even secondary batchbuffers. Userspace does
+ * not know where in the GPU memory these objects reside and so before the
+ * batchbuffer is passed to the GPU for execution, those addresses in the
+ * batchbuffer and auxiliary objects are updated. This is known as relocation,
+ * or patching. To try and avoid having to relocate each object on the next
+ * execution, userspace is told the location of those objects in this pass,
+ * but this remains just a hint as the kernel may choose a new location for
+ * any object in the future.
+ *
+ * Processing an execbuf ioctl is conceptually split up into a few phases.
+ *
+ * 1. Validation - Ensure all the pointers, handles and flags are valid.
+ * 2. Reservation - Assign GPU address space for every object
+ * 3. Relocation - Update any addresses to point to the final locations
+ * 4. Serialisation - Order the request with respect to its dependencies
+ * 5. Construction - Construct a request to execute the batchbuffer
+ * 6. Submission (at some point in the future execution)
+ *
+ * Reserving resources for the execbuf is the most complicated phase. We
+ * neither want to have to migrate the object in the address space, nor do
+ * we want to have to update any relocations pointing to this object. Ideally,
+ * we want to leave the object where it is and for all the existing relocations
+ * to match. If the object is given a new address, or if userspace thinks the
+ * object is elsewhere, we have to parse all the relocation entries and update
+ * the addresses. Userspace can set the I915_EXEC_NORELOC flag to hint that
+ * all the target addresses in all of its objects match the value in the
+ * relocation entries and that they all match the presumed offsets given by the
+ * list of execbuffer objects. Using this knowledge, we know that if we haven't
+ * moved any buffers, all the relocation entries are valid and we can skip
+ * the update. (If userspace is wrong, the likely outcome is an impromptu GPU
+ * hang.) The requirement for using I915_EXEC_NO_RELOC are:
+ *
+ *      The addresses written in the objects must match the corresponding
+ *      reloc.presumed_offset which in turn must match the corresponding
+ *      execobject.offset.
+ *
+ *      Any render targets written to in the batch must be flagged with
+ *      EXEC_OBJECT_WRITE.
+ *
+ *      To avoid stalling, execobject.offset should match the current
+ *      address of that object within the active context.
+ *
+ * The reservation is done is multiple phases. First we try and keep any
+ * object already bound in its current location - so as long as meets the
+ * constraints imposed by the new execbuffer. Any object left unbound after the
+ * first pass is then fitted into any available idle space. If an object does
+ * not fit, all objects are removed from the reservation and the process rerun
+ * after sorting the objects into a priority order (more difficult to fit
+ * objects are tried first). Failing that, the entire VM is cleared and we try
+ * to fit the execbuf once last time before concluding that it simply will not
+ * fit.
+ *
+ * A small complication to all of this is that we allow userspace not only to
+ * specify an alignment and a size for the object in the address space, but
+ * we also allow userspace to specify the exact offset. This objects are
+ * simpler to place (the location is known a priori) all we have to do is make
+ * sure the space is available.
+ *
+ * Once all the objects are in place, patching up the buried pointers to point
+ * to the final locations is a fairly simple job of walking over the relocation
+ * entry arrays, looking up the right address and rewriting the value into
+ * the object. Simple! ... The relocation entries are stored in user memory
+ * and so to access them we have to copy them into a local buffer. That copy
+ * has to avoid taking any pagefaults as they may lead back to a GEM object
+ * requiring the struct_mutex (i.e. recursive deadlock). So once again we split
+ * the relocation into multiple passes. First we try to do everything within an
+ * atomic context (avoid the pagefaults) which requires that we never wait. If
+ * we detect that we may wait, or if we need to fault, then we have to fallback
+ * to a slower path. The slowpath has to drop the mutex. (Can you hear alarm
+ * bells yet?) Dropping the mutex means that we lose all the state we have
+ * built up so far for the execbuf and we must reset any global data. However,
+ * we do leave the objects pinned in their final locations - which is a
+ * potential issue for concurrent execbufs. Once we have left the mutex, we can
+ * allocate and copy all the relocation entries into a large array at our
+ * leisure, reacquire the mutex, reclaim all the objects and other state and
+ * then proceed to update any incorrect addresses with the objects.
+ *
+ * As we process the relocation entries, we maintain a record of whether the
+ * object is being written to. Using NORELOC, we expect userspace to provide
+ * this information instead. We also check whether we can skip the relocation
+ * by comparing the expected value inside the relocation entry with the target's
+ * final address. If they differ, we have to map the current object and rewrite
+ * the 4 or 8 byte pointer within.
+ *
+ * Serialising an execbuf is quite simple according to the rules of the GEM
+ * ABI. Execution within each context is ordered by the order of submission.
+ * Writes to any GEM object are in order of submission and are exclusive. Reads
+ * from a GEM object are unordered with respect to other reads, but ordered by
+ * writes. A write submitted after a read cannot occur before the read, and
+ * similarly any read submitted after a write cannot occur before the write.
+ * Writes are ordered between engines such that only one write occurs at any
+ * time (completing any reads beforehand) - using semaphores where available
+ * and CPU serialisation otherwise. Other GEM access obey the same rules, any
+ * write (either via mmaps using set-domain, or via pwrite) must flush all GPU
+ * reads before starting, and any read (either using set-domain or pread) must
+ * flush all GPU writes before starting. (Note we only employ a barrier before,
+ * we currently rely on userspace not concurrently starting a new execution
+ * whilst reading or writing to an object. This may be an advantage or not
+ * depending on how much you trust userspace not to shoot themselves in the
+ * foot.) Serialisation may just result in the request being inserted into
+ * a DAG awaiting its turn, but most simple is to wait on the CPU until
+ * all dependencies are resolved.
+ *
+ * After all of that, is just a matter of closing the request and handing it to
+ * the hardware (well, leaving it in a queue to be executed). However, we also
+ * offer the ability for batchbuffers to be run with elevated privileges so
+ * that they access otherwise hidden registers. (Used to adjust L3 cache etc.)
+ * Before any batch is given extra privileges we first must check that it
+ * contains no nefarious instructions, we check that each instruction is from
+ * our whitelist and all registers are also from an allowed list. We first
+ * copy the user's batchbuffer to a shadow (so that the user doesn't have
+ * access to it, either by the CPU or GPU as we scan it) and then parse each
+ * instruction. If everything is ok, we set a flag telling the hardware to run
+ * the batchbuffer in trusted mode, otherwise the ioctl is rejected.
+ */
+
 struct i915_execbuffer {
-	struct drm_i915_private *i915;
-	struct drm_file *file;
-	struct drm_i915_gem_execbuffer2 *args;
-	struct drm_i915_gem_exec_object2 *exec;
-	struct intel_engine_cs *engine;
-	struct i915_gem_context *ctx;
-	struct i915_address_space *vm;
-	struct i915_vma *batch;
-	struct drm_i915_gem_request *request;
-	u32 batch_start_offset;
-	u32 batch_len;
-	unsigned int dispatch_flags;
-	struct drm_i915_gem_exec_object2 shadow_exec_entry;
-	bool need_relocs;
-	struct list_head vmas;
+	struct drm_i915_private *i915; /** i915 backpointer */
+	struct drm_file *file; /** per-file lookup tables and limits */
+	struct drm_i915_gem_execbuffer2 *args; /** ioctl parameters */
+	struct drm_i915_gem_exec_object2 *exec; /** ioctl execobj[] */
+
+	struct intel_engine_cs *engine; /** engine to queue the request to */
+	struct i915_gem_context *ctx; /** context for building the request */
+	struct i915_address_space *vm; /** GTT and vma for the request */
+
+	struct drm_i915_gem_request *request; /** our request to build */
+	struct i915_vma *batch; /** identity of the batch obj/vma */
+
+	/** actual size of execobj[] as we may extend it for the cmdparser */
+	unsigned int buffer_count;
+
+	/** list of vma not yet bound during reservation phase */
+	struct list_head unbound;
+
+	/** list of vma that have execobj.relocation_count */
+	struct list_head relocs;
+
+	/** Track the most recently used object for relocations, as we
+	 * frequently have to perform multiple relocations within the same
+	 * obj/page
+	 */
 	struct reloc_cache {
-		struct drm_mm_node node;
-		unsigned long vaddr;
-		unsigned int page;
+		struct drm_mm_node node; /** temporary GTT binding */
+		unsigned long vaddr; /** Current kmap address */
+		unsigned long page; /** Currently mapped page index */
 		bool use_64bit_reloc : 1;
+		bool has_llc : 1;
+		bool has_fence : 1;
+		bool needs_unfenced : 1;
 	} reloc_cache;
-	int lut_mask;
-	struct hlist_head *buckets;
+
+	u64 invalid_flags; /** Set of execobj.flags that are invalid */
+	u32 context_flags; /** Set of execobj.flags to insert from the ctx */
+
+	u32 batch_start_offset; /** Location within object of batch */
+	u32 batch_len; /** Length of batch within object */
+	u32 batch_flags; /** Flags composed for emit_bb_start() */
+
+	/** Indicate either the size of the hastable used to resolve
+	 * relocation handles, or if negative that we are using a direct
+	 * index into the execobj[].
+	 */
+	int lut_size;
+	struct hlist_head *buckets; /** ht for relocation handles */
 };
 
 /* As an alternative to creating a hashtable of handle-to-vma for a batch,
@@ -86,12 +238,40 @@ struct i915_execbuffer {
 #define __exec_to_vma(ee) (ee)->rsvd2
 #define exec_to_vma(ee) u64_to_ptr(struct i915_vma, __exec_to_vma(ee))
 
+/* Used to convert any address to canonical form.
+ * Starting from gen8, some commands (e.g. STATE_BASE_ADDRESS,
+ * MI_LOAD_REGISTER_MEM and others, see Broadwell PRM Vol2a) require the
+ * addresses to be in a canonical form:
+ * "GraphicsAddress[63:48] are ignored by the HW and assumed to be in correct
+ * canonical form [63:48] == [47]."
+ */
+#define GEN8_HIGH_ADDRESS_BIT 47
+static inline u64 gen8_canonical_addr(u64 address)
+{
+	return sign_extend64(address, GEN8_HIGH_ADDRESS_BIT);
+}
+
+static inline u64 gen8_noncanonical_addr(u64 address)
+{
+	return address & GENMASK_ULL(GEN8_HIGH_ADDRESS_BIT, 0);
+}
+
 static int
 eb_create(struct i915_execbuffer *eb)
 {
-	if ((eb->args->flags & I915_EXEC_HANDLE_LUT) == 0) {
-		unsigned int size = 1 + ilog2(eb->args->buffer_count);
-
+	if (!(eb->args->flags & I915_EXEC_HANDLE_LUT)) {
+		unsigned int size = 1 + ilog2(eb->buffer_count);
+
+		/* Without a 1:1 association between relocation handles and
+		 * the execobject[] index, we instead create a hashtable.
+		 * We size it dynamically based on available memory, starting
+		 * first with 1:1 assocative hash and scaling back until
+		 * the allocation succeeds.
+		 *
+		 * Later on we use a positive lut_size to indicate we are
+		 * using this hashtable, and a negative value to indicate a
+		 * direct lookup.
+		 */
 		do {
 			eb->buckets = kzalloc(sizeof(struct hlist_head) << size,
 					     GFP_TEMPORARY | __GFP_NOWARN | __GFP_NORETRY);
@@ -106,112 +286,396 @@ eb_create(struct i915_execbuffer *eb)
 				return -ENOMEM;
 		}
 
-		eb->lut_mask = size;
+		eb->lut_size = size;
 	} else {
-		eb->lut_mask = -eb->args->buffer_count;
+		eb->lut_size = -eb->buffer_count;
 	}
 
 	return 0;
 }
 
+static bool
+eb_vma_misplaced(const struct drm_i915_gem_exec_object2 *entry,
+		 const struct i915_vma *vma)
+{
+	if (!(entry->flags & __EXEC_OBJECT_HAS_PIN))
+		return true;
+
+	if (vma->node.size < entry->pad_to_size)
+		return true;
+
+	if (entry->alignment && !IS_ALIGNED(vma->node.start, entry->alignment))
+		return true;
+
+	if (entry->flags & EXEC_OBJECT_PINNED &&
+	    vma->node.start != entry->offset)
+		return true;
+
+	if (entry->flags & __EXEC_OBJECT_NEEDS_BIAS &&
+	    vma->node.start < BATCH_OFFSET_BIAS)
+		return true;
+
+	if (!(entry->flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS) &&
+	    (vma->node.start + vma->node.size - 1) >> 32)
+		return true;
+
+	return false;
+}
+
+static inline void
+eb_pin_vma(struct i915_execbuffer *eb,
+	   struct drm_i915_gem_exec_object2 *entry,
+	   struct i915_vma *vma)
+{
+	u64 flags;
+
+	flags = vma->node.start;
+	flags |= PIN_USER | PIN_NONBLOCK | PIN_OFFSET_FIXED;
+	if (unlikely(entry->flags & EXEC_OBJECT_NEEDS_GTT))
+		flags |= PIN_GLOBAL;
+	if (unlikely(i915_vma_pin(vma, 0, 0, flags)))
+		return;
+
+	if (unlikely(entry->flags & EXEC_OBJECT_NEEDS_FENCE)) {
+		if (unlikely(i915_vma_get_fence(vma))) {
+			i915_vma_unpin(vma);
+			return;
+		}
+
+		if (i915_vma_pin_fence(vma))
+			entry->flags |= __EXEC_OBJECT_HAS_FENCE;
+	}
+
+	entry->flags |= __EXEC_OBJECT_HAS_PIN;
+}
+
 static inline void
 __eb_unreserve_vma(struct i915_vma *vma,
 		   const struct drm_i915_gem_exec_object2 *entry)
 {
+	GEM_BUG_ON(!(entry->flags & __EXEC_OBJECT_HAS_PIN));
+
 	if (unlikely(entry->flags & __EXEC_OBJECT_HAS_FENCE))
 		i915_vma_unpin_fence(vma);
 
-	if (entry->flags & __EXEC_OBJECT_HAS_PIN)
-		__i915_vma_unpin(vma);
+	__i915_vma_unpin(vma);
 }
 
-static void
-eb_unreserve_vma(struct i915_vma *vma)
+static inline void
+eb_unreserve_vma(struct i915_vma *vma,
+		 struct drm_i915_gem_exec_object2 *entry)
 {
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	if (entry->flags & __EXEC_OBJECT_HAS_PIN) {
+		__eb_unreserve_vma(vma, entry);
+		entry->flags &= ~__EB_RESERVED;
+	}
+}
+
+static int
+eb_validate_vma(struct i915_execbuffer *eb,
+		struct drm_i915_gem_exec_object2 *entry,
+		struct i915_vma *vma)
+{
+	if (unlikely(entry->flags & eb->invalid_flags))
+		return -EINVAL;
+
+	if (unlikely(entry->alignment && !is_power_of_2(entry->alignment)))
+		return -EINVAL;
 
-	__eb_unreserve_vma(vma, entry);
-	entry->flags &= ~(__EXEC_OBJECT_HAS_FENCE | __EXEC_OBJECT_HAS_PIN);
+	/* Offset can be used as input (EXEC_OBJECT_PINNED), reject
+	 * any non-page-aligned or non-canonical addresses.
+	 */
+	if (entry->flags & EXEC_OBJECT_PINNED) {
+		if (unlikely(entry->offset !=
+			     gen8_canonical_addr(entry->offset & PAGE_MASK)))
+			return -EINVAL;
+	}
+
+	/* From drm_mm perspective address space is continuous,
+	 * so from this point we're always using non-canonical
+	 * form internally.
+	 */
+	entry->offset = gen8_noncanonical_addr(entry->offset);
+
+	/* pad_to_size was once a reserved field, so sanitize it */
+	if (entry->flags & EXEC_OBJECT_PAD_TO_SIZE) {
+		if (unlikely(offset_in_page(entry->pad_to_size)))
+			return -EINVAL;
+	} else {
+		entry->pad_to_size = 0;
+	}
+
+	if (unlikely(vma->exec_entry)) {
+		DRM_DEBUG("Object [handle %d, index %d] appears more than once in object list\n",
+			  entry->handle, (int)(entry - eb->exec));
+		return -EINVAL;
+	}
+
+	return 0;
 }
 
-static void
-eb_reset(struct i915_execbuffer *eb)
+static int
+eb_add_vma(struct i915_execbuffer *eb,
+	   struct drm_i915_gem_exec_object2 *entry,
+	   struct i915_vma *vma)
 {
-	struct i915_vma *vma;
+	int err;
 
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
-		eb_unreserve_vma(vma);
-		i915_vma_put(vma);
-		vma->exec_entry = NULL;
+	GEM_BUG_ON(i915_vma_is_closed(vma));
+
+	if (!(eb->args->flags & __EXEC_VALIDATED)) {
+		err = eb_validate_vma(eb, entry, vma);
+		if (unlikely(err))
+			return err;
 	}
 
-	if (eb->lut_mask >= 0)
-		memset(eb->buckets, 0,
-		       sizeof(struct hlist_head) << eb->lut_mask);
+	/* Stash a pointer from the vma to execobj, so we can query its flags,
+	 * size, alignment etc as provided by the user. Also we stash a pointer
+	 * to the vma inside the execobj so that we can use a direct lookup
+	 * to find the right target VMA when doing relocations.
+	 */
+	vma->exec_entry = entry;
+	__exec_to_vma(entry) = (uintptr_t)i915_vma_get(vma);
+
+	if (eb->lut_size >= 0) {
+		vma->exec_handle = entry->handle;
+		hlist_add_head(&vma->exec_node,
+			       &eb->buckets[hash_32(entry->handle,
+						    eb->lut_size)]);
+	}
+
+	if (entry->relocation_count)
+		list_add_tail(&vma->reloc_link, &eb->relocs);
+
+	if (!eb->reloc_cache.has_fence) {
+		entry->flags &= ~EXEC_OBJECT_NEEDS_FENCE;
+	} else {
+		if ((entry->flags & EXEC_OBJECT_NEEDS_FENCE ||
+		     eb->reloc_cache.needs_unfenced) &&
+		    i915_gem_object_is_tiled(vma->obj))
+			entry->flags |= EXEC_OBJECT_NEEDS_GTT | __EXEC_OBJECT_NEEDS_MAP;
+	}
+
+	if (!(entry->flags & EXEC_OBJECT_PINNED))
+		entry->flags |= eb->context_flags;
+
+	err = 0;
+	if (vma->node.size)
+		eb_pin_vma(eb, entry, vma);
+	if (eb_vma_misplaced(entry, vma)) {
+		eb_unreserve_vma(vma, entry);
+
+		list_add_tail(&vma->exec_link, &eb->unbound);
+		if (drm_mm_node_allocated(&vma->node))
+			err = i915_vma_unbind(vma);
+	} else {
+		if (entry->offset != vma->node.start) {
+			entry->offset = vma->node.start | UPDATE;
+			eb->args->flags |= __EXEC_HAS_RELOC;
+		}
+	}
+	return err;
 }
 
-static bool
-eb_add_vma(struct i915_execbuffer *eb, struct i915_vma *vma, int i)
+static inline int use_cpu_reloc(const struct reloc_cache *cache,
+				const struct drm_i915_gem_object *obj)
 {
-	if (unlikely(vma->exec_entry)) {
-		DRM_DEBUG("Object [handle %d, index %d] appears more than once in object list\n",
-			  eb->exec[i].handle, i);
+	if (!i915_gem_object_has_struct_page(obj))
 		return false;
+
+	if (DBG_USE_CPU_RELOC)
+		return DBG_USE_CPU_RELOC > 0;
+
+	return (cache->has_llc ||
+		obj->cache_dirty ||
+		obj->cache_level != I915_CACHE_NONE);
+}
+
+static int
+eb_reserve_vma(struct i915_execbuffer *eb, struct i915_vma *vma)
+{
+	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	u64 flags;
+	int err;
+
+	flags = PIN_USER | PIN_NONBLOCK;
+	if (entry->flags & EXEC_OBJECT_NEEDS_GTT)
+		flags |= PIN_GLOBAL;
+
+	if (!drm_mm_node_allocated(&vma->node)) {
+		/* Wa32bitGeneralStateOffset & Wa32bitInstructionBaseOffset,
+		 * limit address to the first 4GBs for unflagged objects.
+		 */
+		if (!(entry->flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS))
+			flags |= PIN_ZONE_4G;
+
+		if (entry->flags & __EXEC_OBJECT_NEEDS_MAP)
+			flags |= PIN_MAPPABLE;
+
+		if (entry->flags & EXEC_OBJECT_PINNED) {
+			flags |= entry->offset | PIN_OFFSET_FIXED;
+			/* force overlapping PINNED checks */
+			flags &= ~PIN_NONBLOCK;
+		} else if (entry->flags & __EXEC_OBJECT_NEEDS_BIAS)
+			flags |= BATCH_OFFSET_BIAS | PIN_OFFSET_BIAS;
 	}
-	list_add_tail(&vma->exec_link, &eb->vmas);
 
-	vma->exec_entry = &eb->exec[i];
-	if (eb->lut_mask >= 0) {
-		vma->exec_handle = eb->exec[i].handle;
-		hlist_add_head(&vma->exec_node,
-			       &eb->buckets[hash_32(vma->exec_handle,
-						    eb->lut_mask)]);
+	err = i915_vma_pin(vma, entry->pad_to_size, entry->alignment, flags);
+	if (err)
+		return err;
+
+	if (entry->offset != vma->node.start) {
+		entry->offset = vma->node.start | UPDATE;
+		eb->args->flags |= __EXEC_HAS_RELOC;
 	}
 
-	i915_vma_get(vma);
-	__exec_to_vma(&eb->exec[i]) = (uintptr_t)vma;
-	return true;
+	entry->flags |= __EXEC_OBJECT_HAS_PIN;
+	GEM_BUG_ON(eb_vma_misplaced(entry, vma));
+
+	if (unlikely(entry->flags & EXEC_OBJECT_NEEDS_FENCE)) {
+		err = i915_vma_get_fence(vma);
+		if (unlikely(err)) {
+			i915_vma_unpin(vma);
+			return err;
+		}
+
+		if (i915_vma_pin_fence(vma))
+			entry->flags |= __EXEC_OBJECT_HAS_FENCE;
+	}
+
+	return 0;
+}
+
+static int eb_reserve(struct i915_execbuffer *eb)
+{
+	const unsigned int count = eb->buffer_count;
+	struct list_head last;
+	struct i915_vma *vma;
+	unsigned int i, pass;
+	int err;
+
+	/* Attempt to pin all of the buffers into the GTT.
+	 * This is done in 3 phases:
+	 *
+	 * 1a. Unbind all objects that do not match the GTT constraints for
+	 *     the execbuffer (fenceable, mappable, alignment etc).
+	 * 1b. Increment pin count for already bound objects.
+	 * 2.  Bind new objects.
+	 * 3.  Decrement pin count.
+	 *
+	 * This avoid unnecessary unbinding of later objects in order to make
+	 * room for the earlier objects *unless* we need to defragment.
+	 */
+
+	pass = 0;
+	err = 0;
+	do {
+		list_for_each_entry(vma, &eb->unbound, exec_link) {
+			err = eb_reserve_vma(eb, vma);
+			if (err)
+				break;
+		}
+		if (err != -ENOSPC || pass++)
+			return err;
+
+		/* Resort *all* the objects into priority order */
+		INIT_LIST_HEAD(&eb->unbound);
+		INIT_LIST_HEAD(&last);
+		for (i = 0; i < count; i++) {
+			struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+
+			vma = exec_to_vma(entry);
+			eb_unreserve_vma(vma, entry);
+
+			if (entry->flags & EXEC_OBJECT_PINNED)
+				list_add(&vma->exec_link, &eb->unbound);
+			else if (entry->flags & __EXEC_OBJECT_NEEDS_MAP)
+				list_add_tail(&vma->exec_link, &eb->unbound);
+			else
+				list_add_tail(&vma->exec_link, &last);
+		}
+		list_splice_tail(&last, &eb->unbound);
+
+		/* Too fragmented, unbind everything and retry */
+		err = i915_gem_evict_vm(eb->vm);
+		if (err)
+			return err;
+	} while (1);
 }
 
 static inline struct hlist_head *
-ht_head(const struct i915_gem_context *ctx, u32 handle)
+ht_head(const  struct i915_gem_context_vma_lut *lut, u32 handle)
 {
-	return &ctx->vma_lut.ht[hash_32(handle, ctx->vma_lut.ht_bits)];
+	return &lut->ht[hash_32(handle, lut->ht_bits)];
 }
 
 static inline bool
-ht_needs_resize(const struct i915_gem_context *ctx)
+ht_needs_resize(const struct i915_gem_context_vma_lut *lut)
 {
-	return (4*ctx->vma_lut.ht_count > 3*ctx->vma_lut.ht_size ||
-		4*ctx->vma_lut.ht_count + 1 < ctx->vma_lut.ht_size);
+	return (4*lut->ht_count > 3*lut->ht_size ||
+		4*lut->ht_count + 1 < lut->ht_size);
+}
+
+static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
+{
+	return eb->buffer_count - 1;
+}
+
+static int eb_select_context(struct i915_execbuffer *eb)
+{
+	struct i915_gem_context *ctx;
+
+	ctx = i915_gem_context_lookup(eb->file->driver_priv, eb->args->rsvd1);
+	if (unlikely(IS_ERR(ctx)))
+		return PTR_ERR(ctx);
+
+	if (unlikely(i915_gem_context_is_banned(ctx))) {
+		DRM_DEBUG("Context %u tried to submit while banned\n",
+			  ctx->user_handle);
+		return -EIO;
+	}
+
+	eb->ctx = i915_gem_context_get(ctx);
+	eb->vm = ctx->ppgtt ? &ctx->ppgtt->base : &eb->i915->ggtt.base;
+
+	eb->context_flags = 0;
+	if (ctx->flags & CONTEXT_NO_ZEROMAP)
+		eb->context_flags |= __EXEC_OBJECT_NEEDS_BIAS;
+
+	return 0;
 }
 
 static int
 eb_lookup_vmas(struct i915_execbuffer *eb)
 {
 #define INTERMEDIATE BIT(0)
-	const int count = eb->args->buffer_count;
+	const unsigned int count = eb->buffer_count;
+	struct i915_gem_context_vma_lut *lut = &eb->ctx->vma_lut;
 	struct i915_vma *vma;
+	struct idr *idr;
+	unsigned int i;
 	int slow_pass = -1;
-	int i;
+	int err;
 
-	INIT_LIST_HEAD(&eb->vmas);
+	INIT_LIST_HEAD(&eb->relocs);
+	INIT_LIST_HEAD(&eb->unbound);
 
-	if (unlikely(eb->ctx->vma_lut.ht_size & I915_CTX_RESIZE_IN_PROGRESS))
-		flush_work(&eb->ctx->vma_lut.resize);
-	GEM_BUG_ON(eb->ctx->vma_lut.ht_size & I915_CTX_RESIZE_IN_PROGRESS);
+	if (unlikely(lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS))
+		flush_work(&lut->resize);
+	GEM_BUG_ON(lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS);
 
 	for (i = 0; i < count; i++) {
 		__exec_to_vma(&eb->exec[i]) = 0;
 
 		hlist_for_each_entry(vma,
-				     ht_head(eb->ctx, eb->exec[i].handle),
+				     ht_head(lut, eb->exec[i].handle),
 				     ctx_node) {
 			if (vma->ctx_handle != eb->exec[i].handle)
 				continue;
 
-			if (!eb_add_vma(eb, vma, i))
-				return -EINVAL;
+			err = eb_add_vma(eb, &eb->exec[i], vma);
+			if (unlikely(err))
+				return err;
 
 			goto next_vma;
 		}
@@ -222,24 +686,25 @@ next_vma: ;
 	}
 
 	if (slow_pass < 0)
-		return 0;
+		goto out;
 
 	spin_lock(&eb->file->table_lock);
 	/* Grab a reference to the object and release the lock so we can lookup
 	 * or create the VMA without using GFP_ATOMIC */
+	idr = &eb->file->object_idr;
 	for (i = slow_pass; i < count; i++) {
 		struct drm_i915_gem_object *obj;
 
 		if (__exec_to_vma(&eb->exec[i]))
 			continue;
 
-		obj = to_intel_bo(idr_find(&eb->file->object_idr,
-					   eb->exec[i].handle));
+		obj = to_intel_bo(idr_find(idr, eb->exec[i].handle));
 		if (unlikely(!obj)) {
 			spin_unlock(&eb->file->table_lock);
 			DRM_DEBUG("Invalid object handle %d at index %d\n",
 				  eb->exec[i].handle, i);
-			return -ENOENT;
+			err = -ENOENT;
+			goto err;
 		}
 
 		__exec_to_vma(&eb->exec[i]) = INTERMEDIATE | (uintptr_t)obj;
@@ -249,7 +714,7 @@ next_vma: ;
 	for (i = slow_pass; i < count; i++) {
 		struct drm_i915_gem_object *obj;
 
-		if ((__exec_to_vma(&eb->exec[i]) & INTERMEDIATE) == 0)
+		if (!(__exec_to_vma(&eb->exec[i]) & INTERMEDIATE))
 			continue;
 
 		/*
@@ -260,12 +725,13 @@ next_vma: ;
 		 * from the (obj, vm) we don't run the risk of creating
 		 * duplicated vmas for the same vm.
 		 */
-		obj = u64_to_ptr(struct drm_i915_gem_object,
+		obj = u64_to_ptr(typeof(*obj),
 				 __exec_to_vma(&eb->exec[i]) & ~INTERMEDIATE);
 		vma = i915_vma_instance(obj, eb->vm, NULL);
 		if (unlikely(IS_ERR(vma))) {
 			DRM_DEBUG("Failed to lookup VMA\n");
-			return PTR_ERR(vma);
+			err = PTR_ERR(vma);
+			goto err;
 		}
 
 		/* First come, first served */
@@ -273,32 +739,31 @@ next_vma: ;
 			vma->ctx = eb->ctx;
 			vma->ctx_handle = eb->exec[i].handle;
 			hlist_add_head(&vma->ctx_node,
-				       ht_head(eb->ctx, eb->exec[i].handle));
-			eb->ctx->vma_lut.ht_count++;
+				       ht_head(lut, eb->exec[i].handle));
+			lut->ht_count++;
+			lut->ht_size |= I915_CTX_RESIZE_IN_PROGRESS;
 			if (i915_vma_is_ggtt(vma)) {
 				GEM_BUG_ON(obj->vma_hashed);
 				obj->vma_hashed = vma;
 			}
 		}
 
-		if (!eb_add_vma(eb, vma, i))
-			return -EINVAL;
+		err = eb_add_vma(eb, &eb->exec[i], vma);
+		if (unlikely(err))
+			goto err;
 	}
 
-	if (ht_needs_resize(eb->ctx)) {
-		eb->ctx->vma_lut.ht_size |= I915_CTX_RESIZE_IN_PROGRESS;
-		queue_work(system_highpri_wq, &eb->ctx->vma_lut.resize);
+	if (lut->ht_size & I915_CTX_RESIZE_IN_PROGRESS) {
+		if (ht_needs_resize(lut))
+			queue_work(system_highpri_wq, &lut->resize);
+		else
+			lut->ht_size &= ~I915_CTX_RESIZE_IN_PROGRESS;
 	}
 
-	return 0;
-#undef INTERMEDIATE
-}
-
-static struct i915_vma *
-eb_get_batch(struct i915_execbuffer *eb)
-{
-	struct i915_vma *vma =
-		exec_to_vma(&eb->exec[eb->args->buffer_count - 1]);
+out:
+	/* take note of the batch buffer before we might reorder the lists */
+	i = eb_batch_index(eb);
+	eb->batch = exec_to_vma(&eb->exec[i]);
 
 	/*
 	 * SNA is doing fancy tricks with compressing batch buffers, which leads
@@ -309,24 +774,36 @@ eb_get_batch(struct i915_execbuffer *eb)
 	 * Note that actual hangs have only been observed on gen7, but for
 	 * paranoia do it everywhere.
 	 */
-	if ((vma->exec_entry->flags & EXEC_OBJECT_PINNED) == 0)
-		vma->exec_entry->flags |= __EXEC_OBJECT_NEEDS_BIAS;
+	if (!(eb->exec[i].flags & EXEC_OBJECT_PINNED))
+		eb->exec[i].flags |= __EXEC_OBJECT_NEEDS_BIAS;
+	if (eb->reloc_cache.has_fence)
+		eb->exec[i].flags |= EXEC_OBJECT_NEEDS_FENCE;
 
-	return vma;
+	eb->args->flags |= __EXEC_VALIDATED;
+	return eb_reserve(eb);
+
+err:
+	for (i = slow_pass; i < count; i++) {
+		if (__exec_to_vma(&eb->exec[i]) & INTERMEDIATE)
+			__exec_to_vma(&eb->exec[i]) = 0;
+	}
+	lut->ht_size &= ~I915_CTX_RESIZE_IN_PROGRESS;
+	return err;
+#undef INTERMEDIATE
 }
 
 static struct i915_vma *
-eb_get_vma(struct i915_execbuffer *eb, unsigned long handle)
+eb_get_vma(const struct i915_execbuffer *eb, unsigned long handle)
 {
-	if (eb->lut_mask < 0) {
-		if (handle >= -eb->lut_mask)
+	if (eb->lut_size < 0) {
+		if (handle >= -eb->lut_size)
 			return NULL;
 		return exec_to_vma(&eb->exec[handle]);
 	} else {
 		struct hlist_head *head;
 		struct i915_vma *vma;
 
-		head = &eb->buckets[hash_32(handle, eb->lut_mask)];
+		head = &eb->buckets[hash_32(handle, eb->lut_size)];
 		hlist_for_each_entry(vma, head, exec_node) {
 			if (vma->exec_handle == handle)
 				return vma;
@@ -335,61 +812,60 @@ eb_get_vma(struct i915_execbuffer *eb, unsigned long handle)
 	}
 }
 
-static void eb_destroy(struct i915_execbuffer *eb)
+static void
+eb_reset(const struct i915_execbuffer *eb)
 {
-	struct i915_vma *vma;
+	const unsigned int count = eb->buffer_count;
+	unsigned int i;
 
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
-		if (!vma->exec_entry)
-			continue;
+	for (i = 0; i < count; i++) {
+		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+		struct i915_vma *vma = exec_to_vma(entry);
 
-		__eb_unreserve_vma(vma, vma->exec_entry);
+		eb_unreserve_vma(vma, entry);
 		vma->exec_entry = NULL;
 		i915_vma_put(vma);
 	}
 
-	i915_gem_context_put(eb->ctx);
-
-	if (eb->lut_mask >= 0)
-		kfree(eb->buckets);
+	if (eb->lut_size >= 0)
+		memset(eb->buckets, 0,
+		       sizeof(struct hlist_head) << eb->lut_size);
 }
 
-static inline int use_cpu_reloc(struct drm_i915_gem_object *obj)
+static void eb_release_vma(const struct i915_execbuffer *eb)
 {
-	if (!i915_gem_object_has_struct_page(obj))
-		return false;
+	const unsigned int count = eb->buffer_count;
+	unsigned int i;
 
-	if (DBG_USE_CPU_RELOC)
-		return DBG_USE_CPU_RELOC > 0;
+	if (!eb->exec)
+		return;
 
-	return (HAS_LLC(to_i915(obj->base.dev)) ||
-		obj->cache_dirty ||
-		obj->cache_level != I915_CACHE_NONE);
-}
+	for (i = 0; i < count; i++) {
+		struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+		struct i915_vma *vma = exec_to_vma(entry);
 
-/* Used to convert any address to canonical form.
- * Starting from gen8, some commands (e.g. STATE_BASE_ADDRESS,
- * MI_LOAD_REGISTER_MEM and others, see Broadwell PRM Vol2a) require the
- * addresses to be in a canonical form:
- * "GraphicsAddress[63:48] are ignored by the HW and assumed to be in correct
- * canonical form [63:48] == [47]."
- */
-#define GEN8_HIGH_ADDRESS_BIT 47
-static inline uint64_t gen8_canonical_addr(uint64_t address)
-{
-	return sign_extend64(address, GEN8_HIGH_ADDRESS_BIT);
+		if (!vma || !vma->exec_entry)
+			continue;
+
+		GEM_BUG_ON(vma->exec_entry != entry);
+		if (entry->flags & __EXEC_OBJECT_HAS_PIN)
+			__eb_unreserve_vma(vma, entry);
+		vma->exec_entry = NULL;
+		i915_vma_put(vma);
+	}
 }
 
-static inline uint64_t gen8_noncanonical_addr(uint64_t address)
+static void eb_destroy(const struct i915_execbuffer *eb)
 {
-	return address & ((1ULL << (GEN8_HIGH_ADDRESS_BIT + 1)) - 1);
+	if (eb->lut_size >= 0)
+		kfree(eb->buckets);
 }
 
-static inline uint64_t
+static inline u64
 relocation_target(const struct drm_i915_gem_relocation_entry *reloc,
-		  uint64_t target_offset)
+		  const struct i915_vma *target)
 {
-	return gen8_canonical_addr((int)reloc->delta + target_offset);
+	return gen8_canonical_addr((int)reloc->delta + target->node.start);
 }
 
 static void reloc_cache_init(struct reloc_cache *cache,
@@ -398,6 +874,9 @@ static void reloc_cache_init(struct reloc_cache *cache,
 	cache->page = -1;
 	cache->vaddr = 0;
 	/* Must be a variable in the struct to allow GCC to unroll. */
+	cache->has_llc = HAS_LLC(i915);
+	cache->has_fence = INTEL_GEN(i915) < 4;
+	cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
 	cache->use_64bit_reloc = HAS_64BIT_RELOC(i915);
 	cache->node.allocated = false;
 }
@@ -456,7 +935,7 @@ static void reloc_cache_reset(struct reloc_cache *cache)
 
 static void *reloc_kmap(struct drm_i915_gem_object *obj,
 			struct reloc_cache *cache,
-			int page)
+			unsigned long page)
 {
 	void *vaddr;
 
@@ -464,11 +943,11 @@ static void *reloc_kmap(struct drm_i915_gem_object *obj,
 		kunmap_atomic(unmask_page(cache->vaddr));
 	} else {
 		unsigned int flushes;
-		int ret;
+		int err;
 
-		ret = i915_gem_obj_prepare_shmem_write(obj, &flushes);
-		if (ret)
-			return ERR_PTR(ret);
+		err = i915_gem_obj_prepare_shmem_write(obj, &flushes);
+		if (err)
+			return ERR_PTR(err);
 
 		BUILD_BUG_ON(KMAP & CLFLUSH_FLAGS);
 		BUILD_BUG_ON((KMAP | CLFLUSH_FLAGS) & PAGE_MASK);
@@ -488,7 +967,7 @@ static void *reloc_kmap(struct drm_i915_gem_object *obj,
 
 static void *reloc_iomap(struct drm_i915_gem_object *obj,
 			 struct reloc_cache *cache,
-			 int page)
+			 unsigned long page)
 {
 	struct i915_ggtt *ggtt = cache_to_ggtt(cache);
 	unsigned long offset;
@@ -498,31 +977,31 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
 		io_mapping_unmap_atomic((void __force __iomem *) unmask_page(cache->vaddr));
 	} else {
 		struct i915_vma *vma;
-		int ret;
+		int err;
 
-		if (use_cpu_reloc(obj))
+		if (use_cpu_reloc(cache, obj))
 			return NULL;
 
-		ret = i915_gem_object_set_to_gtt_domain(obj, true);
-		if (ret)
-			return ERR_PTR(ret);
+		err = i915_gem_object_set_to_gtt_domain(obj, true);
+		if (err)
+			return ERR_PTR(err);
 
 		vma = i915_gem_object_ggtt_pin(obj, NULL, 0, 0,
 					       PIN_MAPPABLE | PIN_NONBLOCK);
 		if (IS_ERR(vma)) {
 			memset(&cache->node, 0, sizeof(cache->node));
-			ret = drm_mm_insert_node_in_range
+			err = drm_mm_insert_node_in_range
 				(&ggtt->base.mm, &cache->node,
 				 PAGE_SIZE, 0, I915_COLOR_UNEVICTABLE,
 				 0, ggtt->mappable_end,
 				 DRM_MM_INSERT_LOW);
-			if (ret) /* no inactive aperture space, use cpu reloc */
+			if (err) /* no inactive aperture space, use cpu reloc */
 				return NULL;
 		} else {
-			ret = i915_vma_put_fence(vma);
-			if (ret) {
+			err = i915_vma_put_fence(vma);
+			if (err) {
 				i915_vma_unpin(vma);
-				return ERR_PTR(ret);
+				return ERR_PTR(err);
 			}
 
 			cache->node.start = vma->node.start;
@@ -550,7 +1029,7 @@ static void *reloc_iomap(struct drm_i915_gem_object *obj,
 
 static void *reloc_vaddr(struct drm_i915_gem_object *obj,
 			 struct reloc_cache *cache,
-			 int page)
+			 unsigned long page)
 {
 	void *vaddr;
 
@@ -589,25 +1068,26 @@ static void clflush_write32(u32 *addr, u32 value, unsigned int flushes)
 		*addr = value;
 }
 
-static int
-relocate_entry(struct drm_i915_gem_object *obj,
+static u64
+relocate_entry(struct i915_vma *vma,
 	       const struct drm_i915_gem_relocation_entry *reloc,
-	       struct reloc_cache *cache,
-	       u64 target_offset)
+	       struct i915_execbuffer *eb,
+	       const struct i915_vma *target)
 {
+	struct drm_i915_gem_object *obj = vma->obj;
 	u64 offset = reloc->offset;
-	bool wide = cache->use_64bit_reloc;
+	u64 target_offset = relocation_target(reloc, target);
+	bool wide = eb->reloc_cache.use_64bit_reloc;
 	void *vaddr;
 
-	target_offset = relocation_target(reloc, target_offset);
 repeat:
-	vaddr = reloc_vaddr(obj, cache, offset >> PAGE_SHIFT);
+	vaddr = reloc_vaddr(obj, &eb->reloc_cache, offset >> PAGE_SHIFT);
 	if (IS_ERR(vaddr))
 		return PTR_ERR(vaddr);
 
 	clflush_write32(vaddr + offset_in_page(offset),
 			lower_32_bits(target_offset),
-			cache->vaddr);
+			eb->reloc_cache.vaddr);
 
 	if (wide) {
 		offset += sizeof(u32);
@@ -616,17 +1096,16 @@ relocate_entry(struct drm_i915_gem_object *obj,
 		goto repeat;
 	}
 
-	return 0;
+	return gen8_canonical_addr(target->node.start) | UPDATE;
 }
 
-static int
-eb_relocate_entry(struct i915_vma *vma,
-		  struct i915_execbuffer *eb,
-		  struct drm_i915_gem_relocation_entry *reloc)
+static u64
+eb_relocate_entry(struct i915_execbuffer *eb,
+		  struct i915_vma *vma,
+		  const struct drm_i915_gem_relocation_entry *reloc)
 {
 	struct i915_vma *target;
-	u64 target_offset;
-	int ret;
+	int err;
 
 	/* we've already hold a reference to all valid objects */
 	target = eb_get_vma(eb, reloc->target_handle);
@@ -656,26 +1135,28 @@ eb_relocate_entry(struct i915_vma *vma,
 		return -EINVAL;
 	}
 
-	if (reloc->write_domain)
+	if (reloc->write_domain) {
 		target->exec_entry->flags |= EXEC_OBJECT_WRITE;
 
-	/* Sandybridge PPGTT errata: We need a global gtt mapping for MI and
-	 * pipe_control writes because the gpu doesn't properly redirect them
-	 * through the ppgtt for non_secure batchbuffers.
-	 */
-	if (unlikely(IS_GEN6(eb->i915) &&
-		     reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION)) {
-		ret = i915_vma_bind(target, target->obj->cache_level,
-				    PIN_GLOBAL);
-		if (WARN_ONCE(ret, "Unexpected failure to bind target VMA!"))
-			return ret;
+		/* Sandybridge PPGTT errata: We need a global gtt mapping
+		 * for MI and pipe_control writes because the gpu doesn't
+		 * properly redirect them through the ppgtt for non_secure
+		 * batchbuffers.
+		 */
+		if (reloc->write_domain == I915_GEM_DOMAIN_INSTRUCTION &&
+		    IS_GEN6(eb->i915)) {
+			err = i915_vma_bind(target, target->obj->cache_level,
+					    PIN_GLOBAL);
+			if (WARN_ONCE(err,
+				      "Unexpected failure to bind target VMA!"))
+				return err;
+		}
 	}
 
 	/* If the relocation already has the right value in it, no
 	 * more work needs to be done.
 	 */
-	target_offset = gen8_canonical_addr(target->node.start);
-	if (target_offset == reloc->presumed_offset)
+	if (gen8_canonical_addr(target->node.start) == reloc->presumed_offset)
 		return 0;
 
 	/* Check that the relocation address is valid... */
@@ -696,33 +1177,34 @@ eb_relocate_entry(struct i915_vma *vma,
 		return -EINVAL;
 	}
 
-	ret = relocate_entry(vma->obj, reloc, &eb->reloc_cache, target_offset);
-	if (ret)
-		return ret;
-
 	/* and update the user's relocation entry */
-	reloc->presumed_offset = target_offset;
-	return 0;
+	return relocate_entry(vma, reloc, eb, target);
 }
 
-static int eb_relocate_vma(struct i915_vma *vma, struct i915_execbuffer *eb)
+static int eb_relocate_vma(struct i915_execbuffer *eb, struct i915_vma *vma)
 {
 #define N_RELOC(x) ((x) / sizeof(struct drm_i915_gem_relocation_entry))
-	struct drm_i915_gem_relocation_entry stack_reloc[N_RELOC(512)];
-	struct drm_i915_gem_relocation_entry __user *user_relocs;
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
-	int remain, ret = 0;
-
-	user_relocs = u64_to_user_ptr(entry->relocs_ptr);
+	struct drm_i915_gem_relocation_entry stack[N_RELOC(512)];
+	struct drm_i915_gem_relocation_entry __user *urelocs;
+	const struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	unsigned int remain;
 
+	urelocs = u64_to_user_ptr(entry->relocs_ptr);
 	remain = entry->relocation_count;
-	while (remain) {
-		struct drm_i915_gem_relocation_entry *r = stack_reloc;
-		unsigned long unwritten;
-		unsigned int count;
+	if (unlikely(remain > N_RELOC(ULONG_MAX)))
+		return -EINVAL;
 
-		count = min_t(unsigned int, remain, ARRAY_SIZE(stack_reloc));
-		remain -= count;
+	/*
+	 * We must check that the entire relocation array is safe
+	 * to read. However, if the array is not writable the user loses
+	 * the updated relocation values.
+	 */
+
+	do {
+		struct drm_i915_gem_relocation_entry *r = stack;
+		unsigned int count =
+			min_t(unsigned int, remain, ARRAY_SIZE(stack));
+		unsigned int copied;
 
 		/* This is the fast path and we cannot handle a pagefault
 		 * whilst holding the struct mutex lest the user pass in the
@@ -732,385 +1214,284 @@ static int eb_relocate_vma(struct i915_vma *vma, struct i915_execbuffer *eb)
 		 * this is bad and so lockdep complains vehemently.
 		 */
 		pagefault_disable();
-		unwritten = __copy_from_user_inatomic(r, user_relocs, count*sizeof(r[0]));
+		copied = __copy_from_user_inatomic(r, urelocs, count * sizeof(r[0]));
 		pagefault_enable();
-		if (unlikely(unwritten)) {
-			ret = -EFAULT;
+		if (unlikely(copied)) {
+			remain = -EFAULT;
 			goto out;
 		}
 
+		remain -= count;
 		do {
-			u64 offset = r->presumed_offset;
+			u64 offset = eb_relocate_entry(eb, vma, r);
 
-			ret = eb_relocate_entry(vma, eb, r);
-			if (ret)
+			if (likely(offset == 0)) {
+			} else if ((s64)offset < 0) {
+				remain = (s64)offset;
 				goto out;
-
-			if (r->presumed_offset != offset) {
+			} else {
+				/* Note that reporting an error now
+				 * leaves everything in an inconsistent
+				 * state as we have *already* changed
+				 * the relocation value inside the
+				 * object. As we have not changed the
+				 * reloc.presumed_offset or will not
+				 * change the execobject.offset, on the
+				 * call we may not rewrite the value
+				 * inside the object, leaving it
+				 * dangling and causing a GPU hang. Unless
+				 * userspace dynamically rebuilds the
+				 * relocations on each execbuf rather than
+				 * presume a static tree.
+				 *
+				 * We did previously check if the relocations
+				 * were writable (access_ok), an error now
+				 * would be a strange race with mprotect,
+				 * having already demonstrated that we
+				 * can read from this userspace address.
+				 */
 				pagefault_disable();
-				unwritten = __put_user(r->presumed_offset,
-						       &user_relocs->presumed_offset);
+				__put_user(offset & ~UPDATE,
+					   &urelocs[r-stack].presumed_offset);
 				pagefault_enable();
-				if (unlikely(unwritten)) {
-					/* Note that reporting an error now
-					 * leaves everything in an inconsistent
-					 * state as we have *already* changed
-					 * the relocation value inside the
-					 * object. As we have not changed the
-					 * reloc.presumed_offset or will not
-					 * change the execobject.offset, on the
-					 * call we may not rewrite the value
-					 * inside the object, leaving it
-					 * dangling and causing a GPU hang.
-					 */
-					ret = -EFAULT;
-					goto out;
-				}
 			}
-
-			user_relocs++;
-			r++;
-		} while (--count);
-	}
-
+		} while (r++, --count);
+		urelocs += ARRAY_SIZE(stack);
+	} while (remain);
 out:
 	reloc_cache_reset(&eb->reloc_cache);
-	return ret;
-#undef N_RELOC
+	return remain;
 }
 
 static int
-eb_relocate_vma_slow(struct i915_vma *vma,
-		     struct i915_execbuffer *eb,
-		     struct drm_i915_gem_relocation_entry *relocs)
+eb_relocate_vma_slow(struct i915_execbuffer *eb, struct i915_vma *vma)
 {
 	const struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
-	int i, ret = 0;
+	struct drm_i915_gem_relocation_entry *relocs =
+		u64_to_ptr(typeof(*relocs), entry->relocs_ptr);
+	unsigned int i;
+	int err;
 
 	for (i = 0; i < entry->relocation_count; i++) {
-		ret = eb_relocate_entry(vma, eb, &relocs[i]);
-		if (ret)
-			break;
+		u64 offset = eb_relocate_entry(eb, vma, &relocs[i]);
+
+		if ((s64)offset < 0) {
+			err = (s64)offset;
+			goto err;
+		}
 	}
+	err = 0;
+err:
 	reloc_cache_reset(&eb->reloc_cache);
-	return ret;
+	return err;
 }
 
 static int eb_relocate(struct i915_execbuffer *eb)
 {
 	struct i915_vma *vma;
-	int ret = 0;
-
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
-		ret = eb_relocate_vma(vma, eb);
-		if (ret)
-			break;
-	}
-
-	return ret;
-}
-
-static bool only_mappable_for_reloc(unsigned int flags)
-{
-	return (flags & (EXEC_OBJECT_NEEDS_FENCE | __EXEC_OBJECT_NEEDS_MAP)) ==
-		__EXEC_OBJECT_NEEDS_MAP;
-}
-
-static int
-eb_reserve_vma(struct i915_vma *vma,
-	       struct intel_engine_cs *engine,
-	       bool *need_reloc)
-{
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
-	uint64_t flags;
-	int ret;
-
-	flags = PIN_USER;
-	if (entry->flags & EXEC_OBJECT_NEEDS_GTT)
-		flags |= PIN_GLOBAL;
-
-	if (!drm_mm_node_allocated(&vma->node)) {
-		/* Wa32bitGeneralStateOffset & Wa32bitInstructionBaseOffset,
-		 * limit address to the first 4GBs for unflagged objects.
-		 */
-		if ((entry->flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS) == 0)
-			flags |= PIN_ZONE_4G;
-		if (entry->flags & __EXEC_OBJECT_NEEDS_MAP)
-			flags |= PIN_GLOBAL | PIN_MAPPABLE;
-		if (entry->flags & __EXEC_OBJECT_NEEDS_BIAS)
-			flags |= BATCH_OFFSET_BIAS | PIN_OFFSET_BIAS;
-		if (entry->flags & EXEC_OBJECT_PINNED)
-			flags |= entry->offset | PIN_OFFSET_FIXED;
-		if ((flags & PIN_MAPPABLE) == 0)
-			flags |= PIN_HIGH;
-	}
-
-	ret = i915_vma_pin(vma,
-			   entry->pad_to_size,
-			   entry->alignment,
-			   flags);
-	if ((ret == -ENOSPC || ret == -E2BIG) &&
-	    only_mappable_for_reloc(entry->flags))
-		ret = i915_vma_pin(vma,
-				   entry->pad_to_size,
-				   entry->alignment,
-				   flags & ~PIN_MAPPABLE);
-	if (ret)
-		return ret;
-
-	entry->flags |= __EXEC_OBJECT_HAS_PIN;
 
-	if (entry->flags & EXEC_OBJECT_NEEDS_FENCE) {
-		ret = i915_vma_get_fence(vma);
-		if (ret)
-			return ret;
-
-		if (i915_vma_pin_fence(vma))
-			entry->flags |= __EXEC_OBJECT_HAS_FENCE;
-	}
-
-	if (entry->offset != vma->node.start) {
-		entry->offset = vma->node.start;
-		*need_reloc = true;
+	/* The objects are in their final locations, apply the relocations. */
+	list_for_each_entry(vma, &eb->relocs, reloc_link) {
+		int err = eb_relocate_vma(eb, vma);
+		if (err)
+			return err;
 	}
 
 	return 0;
 }
 
-static bool
-need_reloc_mappable(struct i915_vma *vma)
+static int check_relocations(const struct drm_i915_gem_exec_object2 *entry)
 {
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	const char __user *addr, *end;
+	unsigned long size;
+	char __maybe_unused c;
 
-	if (entry->relocation_count == 0)
-		return false;
-
-	if (!i915_vma_is_ggtt(vma))
-		return false;
+	size = entry->relocation_count;
+	if (size == 0)
+		return 0;
 
-	/* See also use_cpu_reloc() */
-	if (HAS_LLC(to_i915(vma->obj->base.dev)))
-		return false;
+	if (size > N_RELOC(ULONG_MAX))
+		return -EINVAL;
 
-	if (vma->obj->base.write_domain == I915_GEM_DOMAIN_CPU)
-		return false;
+	addr = u64_to_user_ptr(entry->relocs_ptr);
+	size *= sizeof(struct drm_i915_gem_relocation_entry);
+	if (!access_ok(VERIFY_WRITE, addr, size))
+		return -EFAULT;
 
-	return true;
+	end = addr + size;
+	for (; addr < end; addr += PAGE_SIZE) {
+		int err = __get_user(c, addr);
+		if (err)
+			return err;
+	}
+	return __get_user(c, end - 1);
 }
 
-static bool
-eb_vma_misplaced(struct i915_vma *vma)
+static int
+eb_copy_relocations(const struct i915_execbuffer *eb)
 {
-	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
+	const unsigned int count = eb->buffer_count;
+	unsigned int i;
+	int err;
 
-	WARN_ON(entry->flags & __EXEC_OBJECT_NEEDS_MAP &&
-		!i915_vma_is_ggtt(vma));
+	for (i = 0; i < count; i++) {
+		const unsigned int nreloc = eb->exec[i].relocation_count;
+		struct drm_i915_gem_relocation_entry __user *urelocs;
+		struct drm_i915_gem_relocation_entry *relocs;
+		unsigned long size;
+		unsigned long copied;
 
-	if (entry->alignment && !IS_ALIGNED(vma->node.start, entry->alignment))
-		return true;
+		if (nreloc == 0)
+			continue;
 
-	if (vma->node.size < entry->pad_to_size)
-		return true;
+		err = check_relocations(&eb->exec[i]);
+		if (err)
+			goto err;
 
-	if (entry->flags & EXEC_OBJECT_PINNED &&
-	    vma->node.start != entry->offset)
-		return true;
+		urelocs = u64_to_user_ptr(eb->exec[i].relocs_ptr);
+		size = nreloc * sizeof(*relocs);
 
-	if (entry->flags & __EXEC_OBJECT_NEEDS_BIAS &&
-	    vma->node.start < BATCH_OFFSET_BIAS)
-		return true;
-
-	/* avoid costly ping-pong once a batch bo ended up non-mappable */
-	if (entry->flags & __EXEC_OBJECT_NEEDS_MAP &&
-	    !i915_vma_is_map_and_fenceable(vma))
-		return !only_mappable_for_reloc(entry->flags);
+		relocs = drm_malloc_gfp(size, 1, GFP_TEMPORARY);
+		if (!relocs) {
+			drm_free_large(relocs);
+			err = -ENOMEM;
+			goto err;
+		}
 
-	if ((entry->flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS) == 0 &&
-	    (vma->node.start + vma->node.size - 1) >> 32)
-		return true;
+		/* copy_from_user is limited to < 4GiB */
+		copied = 0;
+		do {
+			unsigned int len =
+				min_t(u64, BIT_ULL(31), size - copied);
+
+			if (__copy_from_user((char *)relocs + copied,
+					     (char *)urelocs + copied,
+					     len)) {
+				drm_free_large(relocs);
+				err = -EFAULT;
+				goto err;
+			}
 
-	return false;
-}
+			copied += len;
+		} while (copied < size);
 
-static int eb_reserve(struct i915_execbuffer *eb)
-{
-	const bool has_fenced_gpu_access = INTEL_GEN(eb->i915) < 4;
-	const bool needs_unfenced_map = INTEL_INFO(eb->i915)->unfenced_needs_alignment;
-	struct i915_vma *vma;
-	struct list_head ordered_vmas;
-	struct list_head pinned_vmas;
-	int retry;
-
-	INIT_LIST_HEAD(&ordered_vmas);
-	INIT_LIST_HEAD(&pinned_vmas);
-	while (!list_empty(&eb->vmas)) {
-		struct drm_i915_gem_exec_object2 *entry;
-		bool need_fence, need_mappable;
-
-		vma = list_first_entry(&eb->vmas, struct i915_vma, exec_link);
-		entry = vma->exec_entry;
-
-		if (eb->ctx->flags & CONTEXT_NO_ZEROMAP)
-			entry->flags |= __EXEC_OBJECT_NEEDS_BIAS;
-
-		if (!has_fenced_gpu_access)
-			entry->flags &= ~EXEC_OBJECT_NEEDS_FENCE;
-		need_fence =
-			(entry->flags & EXEC_OBJECT_NEEDS_FENCE ||
-			 needs_unfenced_map) &&
-			i915_gem_object_is_tiled(vma->obj);
-		need_mappable = need_fence || need_reloc_mappable(vma);
-
-		if (entry->flags & EXEC_OBJECT_PINNED)
-			list_move_tail(&vma->exec_link, &pinned_vmas);
-		else if (need_mappable) {
-			entry->flags |= __EXEC_OBJECT_NEEDS_MAP;
-			list_move(&vma->exec_link, &ordered_vmas);
-		} else
-			list_move_tail(&vma->exec_link, &ordered_vmas);
-	}
-	list_splice(&ordered_vmas, &eb->vmas);
-	list_splice(&pinned_vmas, &eb->vmas);
+		/* As we do not update the known relocation offsets after
+		 * relocating (due to the complexities in lock handling),
+		 * we need to mark them as invalid now so that we force the
+		 * relocation processing next time. Just in case the target
+		 * object is evicted and then rebound into its old
+		 * presumed_offset before the next execbuffer - if that
+		 * happened we would make the mistake of assuming that the
+		 * relocations were valid.
+		 */
+		user_access_begin();
+		for (copied = 0; copied < nreloc; copied++)
+			unsafe_put_user(-1,
+					&urelocs[copied].presumed_offset,
+					end_user);
+end_user:
+		user_access_end();
 
-	/* Attempt to pin all of the buffers into the GTT.
-	 * This is done in 3 phases:
-	 *
-	 * 1a. Unbind all objects that do not match the GTT constraints for
-	 *     the execbuffer (fenceable, mappable, alignment etc).
-	 * 1b. Increment pin count for already bound objects.
-	 * 2.  Bind new objects.
-	 * 3.  Decrement pin count.
-	 *
-	 * This avoid unnecessary unbinding of later objects in order to make
-	 * room for the earlier objects *unless* we need to defragment.
-	 */
-	retry = 0;
-	do {
-		int ret = 0;
+		eb->exec[i].relocs_ptr = (uintptr_t)relocs;
+	}
 
-		/* Unbind any ill-fitting objects or pin. */
-		list_for_each_entry(vma, &eb->vmas, exec_link) {
-			if (!drm_mm_node_allocated(&vma->node))
-				continue;
+	return 0;
 
-			if (eb_vma_misplaced(vma))
-				ret = i915_vma_unbind(vma);
-			else
-				ret = eb_reserve_vma(vma, eb->engine, &eb->need_relocs);
-			if (ret)
-				goto err;
-		}
+err:
+	while (i--) {
+		struct drm_i915_gem_relocation_entry *relocs =
+			u64_to_ptr(typeof(*relocs), eb->exec[i].relocs_ptr);
+		if (eb->exec[i].relocation_count)
+			drm_free_large(relocs);
+	}
+	return err;
+}
 
-		/* Bind fresh objects */
-		list_for_each_entry(vma, &eb->vmas, exec_link) {
-			if (drm_mm_node_allocated(&vma->node))
-				continue;
+static int eb_prefault_relocations(const struct i915_execbuffer *eb)
+{
+	const unsigned int count = eb->buffer_count;
+	unsigned int i;
 
-			ret = eb_reserve_vma(vma, eb->engine, &eb->need_relocs);
-			if (ret)
-				goto err;
-		}
+	if (unlikely(i915.prefault_disable))
+		return 0;
 
-err:
-		if (ret != -ENOSPC || retry++)
-			return ret;
+	for (i = 0; i < count; i++) {
+		int err;
 
-		/* Decrement pin count for bound objects */
-		list_for_each_entry(vma, &eb->vmas, exec_link)
-			eb_unreserve_vma(vma);
+		err = check_relocations(&eb->exec[i]);
+		if (err)
+			return err;
+	}
 
-		ret = i915_gem_evict_vm(eb->vm, true);
-		if (ret)
-			return ret;
-	} while (1);
+	return 0;
 }
 
-static int
-eb_relocate_slow(struct i915_execbuffer *eb)
+static int eb_relocate_slow(struct i915_execbuffer *eb)
 {
-	const unsigned int count = eb->args->buffer_count;
 	struct drm_device *dev = &eb->i915->drm;
-	struct drm_i915_gem_relocation_entry *reloc;
+	bool have_copy = false;
 	struct i915_vma *vma;
-	int *reloc_offset;
-	int i, total, ret;
+	int err = 0;
+
+repeat:
+	if (signal_pending(current)) {
+		err = -ERESTARTSYS;
+		goto out;
+	}
 
 	/* We may process another execbuffer during the unlock... */
 	eb_reset(eb);
 	mutex_unlock(&dev->struct_mutex);
 
-	total = 0;
-	for (i = 0; i < count; i++)
-		total += eb->exec[i].relocation_count;
-
-	reloc_offset = drm_malloc_ab(count, sizeof(*reloc_offset));
-	reloc = drm_malloc_ab(total, sizeof(*reloc));
-	if (reloc == NULL || reloc_offset == NULL) {
-		drm_free_large(reloc);
-		drm_free_large(reloc_offset);
-		mutex_lock(&dev->struct_mutex);
-		return -ENOMEM;
+	/* We take 3 passes through the slowpatch.
+	 *
+	 * 1 - we try to just prefault all the user relocation entries and
+	 * then attempt to reuse the atomic pagefault disabled fast path again.
+	 *
+	 * 2 - we copy the user entries to a local buffer here outside of the
+	 * local and allow ourselves to wait upon any rendering before
+	 * relocations
+	 *
+	 * 3 - we already have a local copy of the relocation entries, but
+	 * were interrupted (EAGAIN) whilst waiting for the objects, try again.
+	 */
+	if (!err) {
+		err = eb_prefault_relocations(eb);
+	} else if (!have_copy) {
+		err = eb_copy_relocations(eb);
+		have_copy = err == 0;
+	} else {
+		cond_resched();
+		err = 0;
 	}
-
-	total = 0;
-	for (i = 0; i < count; i++) {
-		struct drm_i915_gem_relocation_entry __user *user_relocs;
-		u64 invalid_offset = (u64)-1;
-		int j;
-
-		user_relocs = u64_to_user_ptr(eb->exec[i].relocs_ptr);
-
-		if (copy_from_user(reloc+total, user_relocs,
-				   eb->exec[i].relocation_count * sizeof(*reloc))) {
-			ret = -EFAULT;
-			mutex_lock(&dev->struct_mutex);
-			goto err;
-		}
-
-		/* As we do not update the known relocation offsets after
-		 * relocating (due to the complexities in lock handling),
-		 * we need to mark them as invalid now so that we force the
-		 * relocation processing next time. Just in case the target
-		 * object is evicted and then rebound into its old
-		 * presumed_offset before the next execbuffer - if that
-		 * happened we would make the mistake of assuming that the
-		 * relocations were valid.
-		 */
-		for (j = 0; j < eb->exec[i].relocation_count; j++) {
-			if (__copy_to_user(&user_relocs[j].presumed_offset,
-					   &invalid_offset,
-					   sizeof(invalid_offset))) {
-				ret = -EFAULT;
-				mutex_lock(&dev->struct_mutex);
-				goto err;
-			}
-		}
-
-		reloc_offset[i] = total;
-		total += eb->exec[i].relocation_count;
+	if (err) {
+		mutex_lock(&dev->struct_mutex);
+		goto out;
 	}
 
-	ret = i915_mutex_lock_interruptible(dev);
-	if (ret) {
+	err = i915_mutex_lock_interruptible(dev);
+	if (err) {
 		mutex_lock(&dev->struct_mutex);
-		goto err;
+		goto out;
 	}
 
 	/* reacquire the objects */
-	ret = eb_lookup_vmas(eb);
-	if (ret)
-		goto err;
-
-	ret = eb_reserve(eb);
-	if (ret)
+	err = eb_lookup_vmas(eb);
+	if (err)
 		goto err;
 
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
-		int idx = vma->exec_entry - eb->exec;
-
-		ret = eb_relocate_vma_slow(vma, eb, reloc + reloc_offset[idx]);
-		if (ret)
-			goto err;
+	list_for_each_entry(vma, &eb->relocs, reloc_link) {
+		if (!have_copy) {
+			pagefault_disable();
+			err = eb_relocate_vma(eb, vma);
+			pagefault_enable();
+			if (err)
+				goto repeat;
+		} else {
+			err = eb_relocate_vma_slow(eb, vma);
+			if (err)
+				goto err;
+		}
 	}
 
 	/* Leave the user relocations as are, this is the painfully slow path,
@@ -1120,21 +1501,61 @@ eb_relocate_slow(struct i915_execbuffer *eb)
 	 */
 
 err:
-	drm_free_large(reloc);
-	drm_free_large(reloc_offset);
-	return ret;
+	if (err == -EAGAIN)
+		goto repeat;
+
+out:
+	if (have_copy) {
+		const unsigned int count = eb->buffer_count;
+		unsigned int i;
+
+		for (i = 0; i < count; i++) {
+			const struct drm_i915_gem_exec_object2 *entry =
+				&eb->exec[i];
+			struct drm_i915_gem_relocation_entry *relocs;
+
+			if (!entry->relocation_count)
+				continue;
+
+			relocs = u64_to_ptr(typeof(*relocs), entry->relocs_ptr);
+			drm_free_large(relocs);
+		}
+	}
+
+	return err ?: have_copy;
+}
+
+static void eb_export_fence(struct drm_i915_gem_object *obj,
+			    struct drm_i915_gem_request *req,
+			    unsigned int flags)
+{
+	struct reservation_object *resv = obj->resv;
+
+	/* Ignore errors from failing to allocate the new fence, we can't
+	 * handle an error right now. Worst case should be missed
+	 * synchronisation leading to rendering corruption.
+	 */
+	reservation_object_lock(resv, NULL);
+	if (flags & EXEC_OBJECT_WRITE)
+		reservation_object_add_excl_fence(resv, &req->fence);
+	else if (reservation_object_reserve_shared(resv) == 0)
+		reservation_object_add_shared_fence(resv, &req->fence);
+	reservation_object_unlock(resv);
 }
 
 static int
 eb_move_to_gpu(struct i915_execbuffer *eb)
 {
-	struct i915_vma *vma;
-	int ret;
+	const unsigned int count = eb->buffer_count;
+	unsigned int i;
+	int err;
 
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
+	for (i = 0; i < count; i++) {
+		const struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+		struct i915_vma *vma = exec_to_vma(entry);
 		struct drm_i915_gem_object *obj = vma->obj;
 
-		if (vma->exec_entry->flags & EXEC_OBJECT_CAPTURE) {
+		if (entry->flags & EXEC_OBJECT_CAPTURE) {
 			struct i915_gem_capture_list *capture;
 
 			capture = kmalloc(sizeof(*capture), GFP_KERNEL);
@@ -1146,17 +1567,31 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
 			eb->request->capture_list = capture;
 		}
 
-		if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
-			continue;
+		if (entry->flags & EXEC_OBJECT_ASYNC)
+			goto skip_flushes;
 
 		if (obj->base.write_domain & obj->cache_dirty)
 			i915_gem_clflush_object(obj, 0);
 
-		ret = i915_gem_request_await_object
-			(eb->request, obj, vma->exec_entry->flags & EXEC_OBJECT_WRITE);
-		if (ret)
-			return ret;
+		err = i915_gem_request_await_object
+			(eb->request, obj, entry->flags & EXEC_OBJECT_WRITE);
+		if (err)
+			return err;
+
+skip_flushes:
+		i915_vma_move_to_active(vma, eb->request, entry->flags);
+		__eb_unreserve_vma(vma, entry);
+		vma->exec_entry = NULL;
+	}
+
+	for (i = 0; i < count; i++) {
+		const struct drm_i915_gem_exec_object2 *entry = &eb->exec[i];
+		struct i915_vma *vma = exec_to_vma(entry);
+
+		eb_export_fence(vma->obj, eb->request, entry->flags);
+		i915_vma_put(vma);
 	}
+	eb->exec = NULL;
 
 	/* Unconditionally flush any chipset caches (for streaming writes). */
 	i915_gem_chipset_flush(eb->i915);
@@ -1188,103 +1623,6 @@ i915_gem_check_execbuffer(struct drm_i915_gem_execbuffer2 *exec)
 	return true;
 }
 
-static int
-validate_exec_list(struct drm_device *dev,
-		   struct drm_i915_gem_exec_object2 *exec,
-		   int count)
-{
-	unsigned relocs_total = 0;
-	unsigned relocs_max = UINT_MAX / sizeof(struct drm_i915_gem_relocation_entry);
-	unsigned invalid_flags;
-	int i;
-
-	/* INTERNAL flags must not overlap with external ones */
-	BUILD_BUG_ON(__EXEC_OBJECT_INTERNAL_FLAGS & ~__EXEC_OBJECT_UNKNOWN_FLAGS);
-
-	invalid_flags = __EXEC_OBJECT_UNKNOWN_FLAGS;
-	if (USES_FULL_PPGTT(dev))
-		invalid_flags |= EXEC_OBJECT_NEEDS_GTT;
-
-	for (i = 0; i < count; i++) {
-		char __user *ptr = u64_to_user_ptr(exec[i].relocs_ptr);
-		int length; /* limited by fault_in_pages_readable() */
-
-		if (exec[i].flags & invalid_flags)
-			return -EINVAL;
-
-		/* Offset can be used as input (EXEC_OBJECT_PINNED), reject
-		 * any non-page-aligned or non-canonical addresses.
-		 */
-		if (exec[i].flags & EXEC_OBJECT_PINNED) {
-			if (exec[i].offset !=
-			    gen8_canonical_addr(exec[i].offset & PAGE_MASK))
-				return -EINVAL;
-		}
-
-		/* From drm_mm perspective address space is continuous,
-		 * so from this point we're always using non-canonical
-		 * form internally.
-		 */
-		exec[i].offset = gen8_noncanonical_addr(exec[i].offset);
-
-		if (exec[i].alignment && !is_power_of_2(exec[i].alignment))
-			return -EINVAL;
-
-		/* pad_to_size was once a reserved field, so sanitize it */
-		if (exec[i].flags & EXEC_OBJECT_PAD_TO_SIZE) {
-			if (offset_in_page(exec[i].pad_to_size))
-				return -EINVAL;
-		} else {
-			exec[i].pad_to_size = 0;
-		}
-
-		/* First check for malicious input causing overflow in
-		 * the worst case where we need to allocate the entire
-		 * relocation tree as a single array.
-		 */
-		if (exec[i].relocation_count > relocs_max - relocs_total)
-			return -EINVAL;
-		relocs_total += exec[i].relocation_count;
-
-		length = exec[i].relocation_count *
-			sizeof(struct drm_i915_gem_relocation_entry);
-		/*
-		 * We must check that the entire relocation array is safe
-		 * to read, but since we may need to update the presumed
-		 * offsets during execution, check for full write access.
-		 */
-		if (!access_ok(VERIFY_WRITE, ptr, length))
-			return -EFAULT;
-
-		if (likely(!i915.prefault_disable)) {
-			if (fault_in_pages_readable(ptr, length))
-				return -EFAULT;
-		}
-	}
-
-	return 0;
-}
-
-static int eb_select_context(struct i915_execbuffer *eb)
-{
-	unsigned int ctx_id = i915_execbuffer2_get_context_id(*eb->args);
-	struct i915_gem_context *ctx;
-
-	ctx = i915_gem_context_lookup(eb->file->driver_priv, ctx_id);
-	if (unlikely(IS_ERR(ctx)))
-		return PTR_ERR(ctx);
-
-	if (unlikely(i915_gem_context_is_banned(ctx))) {
-		DRM_DEBUG("Context %u tried to submit while banned\n", ctx_id);
-		return -EIO;
-	}
-
-	eb->ctx = i915_gem_context_get(ctx);
-	eb->vm = ctx->ppgtt ? &ctx->ppgtt->base : &eb->i915->ggtt.base;
-
-	return 0;
-}
-
 void i915_vma_move_to_active(struct i915_vma *vma,
 			     struct drm_i915_gem_request *req,
 			     unsigned int flags)
@@ -1323,42 +1661,6 @@ void i915_vma_move_to_active(struct i915_vma *vma,
 		i915_gem_active_set(&vma->last_fence, req);
 }
 
-static void eb_export_fence(struct drm_i915_gem_object *obj,
-			    struct drm_i915_gem_request *req,
-			    unsigned int flags)
-{
-	struct reservation_object *resv = obj->resv;
-
-	/* Ignore errors from failing to allocate the new fence, we can't
-	 * handle an error right now. Worst case should be missed
-	 * synchronisation leading to rendering corruption.
-	 */
-	reservation_object_lock(resv, NULL);
-	if (flags & EXEC_OBJECT_WRITE)
-		reservation_object_add_excl_fence(resv, &req->fence);
-	else if (reservation_object_reserve_shared(resv) == 0)
-		reservation_object_add_shared_fence(resv, &req->fence);
-	reservation_object_unlock(resv);
-}
-
-static void
-eb_move_to_active(struct i915_execbuffer *eb)
-{
-	struct i915_vma *vma;
-
-	list_for_each_entry(vma, &eb->vmas, exec_link) {
-		struct drm_i915_gem_object *obj = vma->obj;
-
-		obj->base.write_domain = 0;
-		if (vma->exec_entry->flags & EXEC_OBJECT_WRITE)
-			obj->base.read_domains = 0;
-		obj->base.read_domains |= I915_GEM_GPU_DOMAINS;
-
-		i915_vma_move_to_active(vma, eb->request, vma->exec_entry->flags);
-		eb_export_fence(obj, eb->request, vma->exec_entry->flags);
-	}
-}
-
 static int
 i915_reset_gen7_sol_offsets(struct drm_i915_gem_request *req)
 {
@@ -1370,16 +1672,16 @@ i915_reset_gen7_sol_offsets(struct drm_i915_gem_request *req)
 		return -EINVAL;
 	}
 
-	cs = intel_ring_begin(req, 4 * 3);
+	cs = intel_ring_begin(req, 4 * 2 + 2);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
+	*cs++ = MI_LOAD_REGISTER_IMM(4);
 	for (i = 0; i < 4; i++) {
-		*cs++ = MI_LOAD_REGISTER_IMM(1);
 		*cs++ = i915_mmio_reg_offset(GEN7_SO_WRITE_OFFSET(i));
 		*cs++ = 0;
 	}
-
+	*cs++ = MI_NOOP;
 	intel_ring_advance(req, cs);
 
 	return 0;
@@ -1389,24 +1691,24 @@ static struct i915_vma *eb_parse(struct i915_execbuffer *eb, bool is_master)
 {
 	struct drm_i915_gem_object *shadow_batch_obj;
 	struct i915_vma *vma;
-	int ret;
+	int err;
 
 	shadow_batch_obj = i915_gem_batch_pool_get(&eb->engine->batch_pool,
 						   PAGE_ALIGN(eb->batch_len));
 	if (IS_ERR(shadow_batch_obj))
 		return ERR_CAST(shadow_batch_obj);
 
-	ret = intel_engine_cmd_parser(eb->engine,
+	err = intel_engine_cmd_parser(eb->engine,
 				      eb->batch->obj,
 				      shadow_batch_obj,
 				      eb->batch_start_offset,
 				      eb->batch_len,
 				      is_master);
-	if (ret) {
-		if (ret == -EACCES) /* unhandled chained batch */
+	if (err) {
+		if (err == -EACCES) /* unhandled chained batch */
 			vma = NULL;
 		else
-			vma = ERR_PTR(ret);
+			vma = ERR_PTR(err);
 		goto out;
 	}
 
@@ -1415,10 +1717,10 @@ static struct i915_vma *eb_parse(struct i915_execbuffer *eb, bool is_master)
 		goto out;
 
 	vma->exec_entry =
-		memset(&eb->shadow_exec_entry, 0, sizeof(*vma->exec_entry));
+		memset(&eb->exec[eb->buffer_count++],
+		       0, sizeof(*vma->exec_entry));
 	vma->exec_entry->flags = __EXEC_OBJECT_HAS_PIN;
-	i915_gem_object_get(shadow_batch_obj);
-	list_add_tail(&vma->exec_link, &eb->vmas);
+	__exec_to_vma(vma->exec_entry) = (uintptr_t)i915_vma_get(vma);
 
 out:
 	i915_gem_object_unpin_pages(shadow_batch_obj);
@@ -1434,33 +1736,31 @@ add_to_client(struct drm_i915_gem_request *req,
 }
 
 static int
-execbuf_submit(struct i915_execbuffer *eb)
+eb_submit(struct i915_execbuffer *eb)
 {
-	int ret;
+	int err;
 
-	ret = eb_move_to_gpu(eb);
-	if (ret)
-		return ret;
+	err = eb_move_to_gpu(eb);
+	if (err)
+		return err;
 
-	ret = i915_switch_context(eb->request);
-	if (ret)
-		return ret;
+	err = i915_switch_context(eb->request);
+	if (err)
+		return err;
 
 	if (eb->args->flags & I915_EXEC_GEN7_SOL_RESET) {
-		ret = i915_reset_gen7_sol_offsets(eb->request);
-		if (ret)
-			return ret;
+		err = i915_reset_gen7_sol_offsets(eb->request);
+		if (err)
+			return err;
 	}
 
-	ret = eb->engine->emit_bb_start(eb->request,
+	err = eb->engine->emit_bb_start(eb->request,
 					eb->batch->node.start +
 					eb->batch_start_offset,
 					eb->batch_len,
-					eb->dispatch_flags);
-	if (ret)
-		return ret;
-
-	eb_move_to_active(eb);
+					eb->batch_flags);
+	if (err)
+		return err;
 
 	return 0;
 }
@@ -1551,34 +1851,35 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	struct dma_fence *in_fence = NULL;
 	struct sync_file *out_fence = NULL;
 	int out_fence_fd = -1;
-	int ret;
+	int err;
 
-	if (!i915_gem_check_execbuffer(args))
-		return -EINVAL;
-
-	ret = validate_exec_list(dev, exec, args->buffer_count);
-	if (ret)
-		return ret;
+	BUILD_BUG_ON(__EXEC_OBJECT_INTERNAL_FLAGS & ~__EXEC_OBJECT_UNKNOWN_FLAGS);
 
 	eb.i915 = to_i915(dev);
 	eb.file = file;
 	eb.args = args;
+	if (!(args->flags & I915_EXEC_NO_RELOC))
+		args->flags |= __EXEC_HAS_RELOC;
 	eb.exec = exec;
-	eb.need_relocs = (args->flags & I915_EXEC_NO_RELOC) == 0;
+	eb.ctx = NULL;
+	eb.invalid_flags = __EXEC_OBJECT_UNKNOWN_FLAGS;
+	if (USES_FULL_PPGTT(eb.i915))
+		eb.invalid_flags |= EXEC_OBJECT_NEEDS_GTT;
 	reloc_cache_init(&eb.reloc_cache, eb.i915);
 
+	eb.buffer_count = args->buffer_count;
 	eb.batch_start_offset = args->batch_start_offset;
 	eb.batch_len = args->batch_len;
 
-	eb.dispatch_flags = 0;
+	eb.batch_flags = 0;
 	if (args->flags & I915_EXEC_SECURE) {
 		if (!drm_is_current_master(file) || !capable(CAP_SYS_ADMIN))
 		    return -EPERM;
 
-		eb.dispatch_flags |= I915_DISPATCH_SECURE;
+		eb.batch_flags |= I915_DISPATCH_SECURE;
 	}
 	if (args->flags & I915_EXEC_IS_PINNED)
-		eb.dispatch_flags |= I915_DISPATCH_PINNED;
+		eb.batch_flags |= I915_DISPATCH_PINNED;
 
 	eb.engine = eb_select_engine(eb.i915, file, args);
 	if (!eb.engine)
@@ -1595,7 +1896,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			return -EINVAL;
 		}
 
-		eb.dispatch_flags |= I915_DISPATCH_RS;
+		eb.batch_flags |= I915_DISPATCH_RS;
 	}
 
 	if (args->flags & I915_EXEC_FENCE_IN) {
@@ -1607,11 +1908,14 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	if (args->flags & I915_EXEC_FENCE_OUT) {
 		out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
 		if (out_fence_fd < 0) {
-			ret = out_fence_fd;
+			err = out_fence_fd;
 			goto err_in_fence;
 		}
 	}
 
+	if (eb_create(&eb))
+		return -ENOMEM;
+
 	/* Take a local wakeref for preparing to dispatch the execbuf as
 	 * we expect to access the hardware fairly frequently in the
 	 * process. Upon first dispatch, we acquire another prolonged
@@ -1619,59 +1923,40 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	 * 100ms.
 	 */
 	intel_runtime_pm_get(eb.i915);
+	err = i915_mutex_lock_interruptible(dev);
+	if (err)
+		goto err_rpm;
+
+	err = eb_select_context(&eb);
+	if (unlikely(err))
+		goto err_unlock;
+
+	err = eb_lookup_vmas(&eb);
+	if (likely(!err && args->flags & __EXEC_HAS_RELOC))
+		err = eb_relocate(&eb);
+	if (err == -EAGAIN || err == -EFAULT)
+		err = eb_relocate_slow(&eb);
+	if (err && args->flags & I915_EXEC_NO_RELOC)
+		/* If the user expects the execobject.offset and
+		 * reloc.presumed_offset to be an exact match,
+		 * as for using NO_RELOC, then we cannot update
+		 * the execobject.offset until we have completed
+		 * relocation.
+		 */
+		args->flags &= ~__EXEC_HAS_RELOC;
+	if (err < 0)
+		goto err_vma;
 
-	ret = i915_mutex_lock_interruptible(dev);
-	if (ret)
-		goto pre_mutex_err;
-
-	ret = eb_select_context(&eb);
-	if (ret) {
-		mutex_unlock(&dev->struct_mutex);
-		goto pre_mutex_err;
-	}
-
-	if (eb_create(&eb)) {
-		i915_gem_context_put(eb.ctx);
-		mutex_unlock(&dev->struct_mutex);
-		ret = -ENOMEM;
-		goto pre_mutex_err;
-	}
-
-	/* Look up object handles */
-	ret = eb_lookup_vmas(&eb);
-	if (ret)
-		goto err;
-
-	/* take note of the batch buffer before we might reorder the lists */
-	eb.batch = eb_get_batch(&eb);
-
-	/* Move the objects en-masse into the GTT, evicting if necessary. */
-	ret = eb_reserve(&eb);
-	if (ret)
-		goto err;
-
-	/* The objects are in their final locations, apply the relocations. */
-	if (eb.need_relocs)
-		ret = eb_relocate(&eb);
-	if (ret) {
-		if (ret == -EFAULT) {
-			ret = eb_relocate_slow(&eb);
-			BUG_ON(!mutex_is_locked(&dev->struct_mutex));
-		}
-		if (ret)
-			goto err;
-	}
-
-	if (eb.batch->exec_entry->flags & EXEC_OBJECT_WRITE) {
+	if (unlikely(eb.batch->exec_entry->flags & EXEC_OBJECT_WRITE)) {
 		DRM_DEBUG("Attempting to use self-modifying batch buffer\n");
-		ret = -EINVAL;
-		goto err;
+		err = -EINVAL;
+		goto err_vma;
 	}
 	if (eb.batch_start_offset > eb.batch->size ||
 	    eb.batch_len > eb.batch->size - eb.batch_start_offset) {
 		DRM_DEBUG("Attempting to use out-of-bounds batch\n");
-		ret = -EINVAL;
-		goto err;
+		err = -EINVAL;
+		goto err_vma;
 	}
 
 	if (eb.engine->needs_cmd_parser && eb.batch_len) {
@@ -1679,8 +1964,8 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 
 		vma = eb_parse(&eb, drm_is_current_master(file));
 		if (IS_ERR(vma)) {
-			ret = PTR_ERR(vma);
-			goto err;
+			err = PTR_ERR(vma);
+			goto err_vma;
 		}
 
 		if (vma) {
@@ -1693,7 +1978,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 			 * specifically don't want that set on batches the
 			 * command parser has accepted.
 			 */
-			eb.dispatch_flags |= I915_DISPATCH_SECURE;
+			eb.batch_flags |= I915_DISPATCH_SECURE;
 			eb.batch_start_offset = 0;
 			eb.batch = vma;
 		}
@@ -1705,8 +1990,7 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	/* snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
 	 * batch" bit. Hence we need to pin secure batches into the global gtt.
 	 * hsw should have this fixed, but bdw mucks it up again. */
-	if (eb.dispatch_flags & I915_DISPATCH_SECURE) {
-		struct drm_i915_gem_object *obj = eb.batch->obj;
+	if (eb.batch_flags & I915_DISPATCH_SECURE) {
 		struct i915_vma *vma;
 
 		/*
@@ -1719,10 +2003,10 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		 *   fitting due to fragmentation.
 		 * So this is actually safe.
 		 */
-		vma = i915_gem_object_ggtt_pin(obj, NULL, 0, 0, 0);
+		vma = i915_gem_object_ggtt_pin(eb.batch->obj, NULL, 0, 0, 0);
 		if (IS_ERR(vma)) {
-			ret = PTR_ERR(vma);
-			goto err;
+			err = PTR_ERR(vma);
+			goto err_vma;
 		}
 
 		eb.batch = vma;
@@ -1731,20 +2015,20 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	/* Allocate a request for this batch buffer nice and early. */
 	eb.request = i915_gem_request_alloc(eb.engine, eb.ctx);
 	if (IS_ERR(eb.request)) {
-		ret = PTR_ERR(eb.request);
+		err = PTR_ERR(eb.request);
 		goto err_batch_unpin;
 	}
 
 	if (in_fence) {
-		ret = i915_gem_request_await_dma_fence(eb.request, in_fence);
-		if (ret < 0)
+		err = i915_gem_request_await_dma_fence(eb.request, in_fence);
+		if (err < 0)
 			goto err_request;
 	}
 
 	if (out_fence_fd != -1) {
 		out_fence = sync_file_create(&eb.request->fence);
 		if (!out_fence) {
-			ret = -ENOMEM;
+			err = -ENOMEM;
 			goto err_request;
 		}
 	}
@@ -1757,14 +2041,14 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	 */
 	eb.request->batch = eb.batch;
 
-	trace_i915_gem_request_queue(eb.request, eb.dispatch_flags);
-	ret = execbuf_submit(&eb);
+	trace_i915_gem_request_queue(eb.request, eb.batch_flags);
+	err = eb_submit(&eb);
 err_request:
-	__i915_add_request(eb.request, ret == 0);
+	__i915_add_request(eb.request, err == 0);
 	add_to_client(eb.request, file);
 
 	if (out_fence) {
-		if (ret == 0) {
+		if (err == 0) {
 			fd_install(out_fence_fd, out_fence->file);
 			args->rsvd2 &= GENMASK_ULL(0, 31); /* keep in-fence */
 			args->rsvd2 |= (u64)out_fence_fd << 32;
@@ -1775,28 +2059,21 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 	}
 
 err_batch_unpin:
-	/*
-	 * FIXME: We crucially rely upon the active tracking for the (ppgtt)
-	 * batch vma for correctness. For less ugly and less fragility this
-	 * needs to be adjusted to also track the ggtt batch vma properly as
-	 * active.
-	 */
-	if (eb.dispatch_flags & I915_DISPATCH_SECURE)
+	if (eb.batch_flags & I915_DISPATCH_SECURE)
 		i915_vma_unpin(eb.batch);
-err:
-	/* the request owns the ref now */
-	eb_destroy(&eb);
+err_vma:
+	eb_release_vma(&eb);
+	i915_gem_context_put(eb.ctx);
+err_unlock:
 	mutex_unlock(&dev->struct_mutex);
-
-pre_mutex_err:
-	/* intel_gpu_busy should also get a ref, so it will free when the device
-	 * is really idle. */
+err_rpm:
 	intel_runtime_pm_put(eb.i915);
+	eb_destroy(&eb);
 	if (out_fence_fd != -1)
 		put_unused_fd(out_fence_fd);
 err_in_fence:
 	dma_fence_put(in_fence);
-	return ret;
+	return err;
 }
 
 /*
@@ -1811,16 +2088,35 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
 	struct drm_i915_gem_execbuffer2 exec2;
 	struct drm_i915_gem_exec_object *exec_list = NULL;
 	struct drm_i915_gem_exec_object2 *exec2_list = NULL;
-	int ret, i;
+	unsigned int i;
+	int err;
 
 	if (args->buffer_count < 1) {
 		DRM_DEBUG("execbuf with %d buffers\n", args->buffer_count);
 		return -EINVAL;
 	}
 
+	exec2.buffers_ptr = args->buffers_ptr;
+	exec2.buffer_count = args->buffer_count;
+	exec2.batch_start_offset = args->batch_start_offset;
+	exec2.batch_len = args->batch_len;
+	exec2.DR1 = args->DR1;
+	exec2.DR4 = args->DR4;
+	exec2.num_cliprects = args->num_cliprects;
+	exec2.cliprects_ptr = args->cliprects_ptr;
+	exec2.flags = I915_EXEC_RENDER;
+	i915_execbuffer2_set_context_id(exec2, 0);
+
+	if (!i915_gem_check_execbuffer(&exec2))
+		return -EINVAL;
+
 	/* Copy in the exec list from userland */
-	exec_list = drm_malloc_ab(sizeof(*exec_list), args->buffer_count);
-	exec2_list = drm_malloc_ab(sizeof(*exec2_list), args->buffer_count);
+	exec_list = drm_malloc_gfp(args->buffer_count,
+				   sizeof(*exec_list),
+				   __GFP_NOWARN | GFP_TEMPORARY);
+	exec2_list = drm_malloc_gfp(args->buffer_count + 1,
+				    sizeof(*exec2_list),
+				    __GFP_NOWARN | GFP_TEMPORARY);
 	if (exec_list == NULL || exec2_list == NULL) {
 		DRM_DEBUG("Failed to allocate exec list for %d buffers\n",
 			  args->buffer_count);
@@ -1828,12 +2124,12 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
 		drm_free_large(exec2_list);
 		return -ENOMEM;
 	}
-	ret = copy_from_user(exec_list,
+	err = copy_from_user(exec_list,
 			     u64_to_user_ptr(args->buffers_ptr),
 			     sizeof(*exec_list) * args->buffer_count);
-	if (ret != 0) {
+	if (err) {
 		DRM_DEBUG("copy %d exec entries failed %d\n",
-			  args->buffer_count, ret);
+			  args->buffer_count, err);
 		drm_free_large(exec_list);
 		drm_free_large(exec2_list);
 		return -EFAULT;
@@ -1851,42 +2147,29 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
 			exec2_list[i].flags = 0;
 	}
 
-	exec2.buffers_ptr = args->buffers_ptr;
-	exec2.buffer_count = args->buffer_count;
-	exec2.batch_start_offset = args->batch_start_offset;
-	exec2.batch_len = args->batch_len;
-	exec2.DR1 = args->DR1;
-	exec2.DR4 = args->DR4;
-	exec2.num_cliprects = args->num_cliprects;
-	exec2.cliprects_ptr = args->cliprects_ptr;
-	exec2.flags = I915_EXEC_RENDER;
-	i915_execbuffer2_set_context_id(exec2, 0);
-
-	ret = i915_gem_do_execbuffer(dev, file, &exec2, exec2_list);
-	if (!ret) {
+	err = i915_gem_do_execbuffer(dev, file, &exec2, exec2_list);
+	if (exec2.flags & __EXEC_HAS_RELOC) {
 		struct drm_i915_gem_exec_object __user *user_exec_list =
 			u64_to_user_ptr(args->buffers_ptr);
 
 		/* Copy the new buffer offsets back to the user's exec list. */
 		for (i = 0; i < args->buffer_count; i++) {
+			if (!(exec2_list[i].offset & UPDATE))
+				continue;
+
 			exec2_list[i].offset =
-				gen8_canonical_addr(exec2_list[i].offset);
-			ret = __copy_to_user(&user_exec_list[i].offset,
-					     &exec2_list[i].offset,
-					     sizeof(user_exec_list[i].offset));
-			if (ret) {
-				ret = -EFAULT;
-				DRM_DEBUG("failed to copy %d exec entries "
-					  "back to user (%d)\n",
-					  args->buffer_count, ret);
+				gen8_canonical_addr(exec2_list[i].offset & PIN_OFFSET_MASK);
+			exec2_list[i].offset &= PIN_OFFSET_MASK;
+			if (__copy_to_user(&user_exec_list[i].offset,
+					   &exec2_list[i].offset,
+					   sizeof(user_exec_list[i].offset)))
 				break;
-			}
 		}
 	}
 
 	drm_free_large(exec_list);
 	drm_free_large(exec2_list);
-	return ret;
+	return err;
 }
 
 int
@@ -1894,56 +2177,64 @@ i915_gem_execbuffer2(struct drm_device *dev, void *data,
 		     struct drm_file *file)
 {
 	struct drm_i915_gem_execbuffer2 *args = data;
-	struct drm_i915_gem_exec_object2 *exec2_list = NULL;
-	int ret;
+	struct drm_i915_gem_exec_object2 *exec2_list;
+	int err;
 
 	if (args->buffer_count < 1 ||
-	    args->buffer_count > UINT_MAX / sizeof(*exec2_list)) {
+	    args->buffer_count > SIZE_MAX / sizeof(*exec2_list) - 1) {
 		DRM_DEBUG("execbuf2 with %d buffers\n", args->buffer_count);
 		return -EINVAL;
 	}
 
-	exec2_list = drm_malloc_gfp(args->buffer_count,
+	if (!i915_gem_check_execbuffer(args))
+		return -EINVAL;
+
+	/* Allocate an extra slot for use by the command parser */
+	exec2_list = drm_malloc_gfp(args->buffer_count + 1,
 				    sizeof(*exec2_list),
-				    GFP_TEMPORARY);
+				    __GFP_NOWARN | GFP_TEMPORARY);
 	if (exec2_list == NULL) {
 		DRM_DEBUG("Failed to allocate exec list for %d buffers\n",
 			  args->buffer_count);
 		return -ENOMEM;
 	}
-	ret = copy_from_user(exec2_list,
-			     u64_to_user_ptr(args->buffers_ptr),
-			     sizeof(*exec2_list) * args->buffer_count);
-	if (ret != 0) {
-		DRM_DEBUG("copy %d exec entries failed %d\n",
-			  args->buffer_count, ret);
+	if (copy_from_user(exec2_list,
+			   u64_to_user_ptr(args->buffers_ptr),
+			   sizeof(*exec2_list) * args->buffer_count)) {
+		DRM_DEBUG("copy %d exec entries failed\n", args->buffer_count);
 		drm_free_large(exec2_list);
 		return -EFAULT;
 	}
 
-	ret = i915_gem_do_execbuffer(dev, file, args, exec2_list);
-	if (!ret) {
-		/* Copy the new buffer offsets back to the user's exec list. */
+	err = i915_gem_do_execbuffer(dev, file, args, exec2_list);
+
+	/* Now that we have begun execution of the batchbuffer, we ignore
+	 * any new error after this point. Also given that we have already
+	 * updated the associated relocations, we try to write out the current
+	 * object locations irrespective of any error.
+	 */
+	if (args->flags & __EXEC_HAS_RELOC) {
 		struct drm_i915_gem_exec_object2 __user *user_exec_list =
-				   u64_to_user_ptr(args->buffers_ptr);
-		int i;
+			u64_to_user_ptr(args->buffers_ptr);
+		unsigned int i;
 
+		/* Copy the new buffer offsets back to the user's exec list. */
+		user_access_begin();
 		for (i = 0; i < args->buffer_count; i++) {
+			if (!(exec2_list[i].offset & UPDATE))
+				continue;
+
 			exec2_list[i].offset =
-				gen8_canonical_addr(exec2_list[i].offset);
-			ret = __copy_to_user(&user_exec_list[i].offset,
-					     &exec2_list[i].offset,
-					     sizeof(user_exec_list[i].offset));
-			if (ret) {
-				ret = -EFAULT;
-				DRM_DEBUG("failed to copy %d exec entries "
-					  "back to user\n",
-					  args->buffer_count);
-				break;
-			}
+				gen8_canonical_addr(exec2_list[i].offset & PIN_OFFSET_MASK);
+			unsafe_put_user(exec2_list[i].offset,
+					&user_exec_list[i].offset,
+					end_user);
 		}
+end_user:
+		user_access_end();
 	}
 
+	args->flags &= ~__I915_EXEC_UNKNOWN_FLAGS;
 	drm_free_large(exec2_list);
-	return ret;
+	return err;
 }
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index ad696239383d..6b1253fdfc39 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -463,7 +463,7 @@ i915_vma_insert(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 			  size, obj->base.size,
 			  flags & PIN_MAPPABLE ? "mappable" : "total",
 			  end);
-		return -E2BIG;
+		return -ENOSPC;
 	}
 
 	ret = i915_gem_object_pin_pages(obj);
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
index 88543fafcffc..062addfee6ef 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -103,6 +103,7 @@ struct i915_vma {
 
 	/** This vma's place in the execbuf reservation list */
 	struct list_head exec_link;
+	struct list_head reloc_link;
 
 	/** This vma's place in the eviction list */
 	struct list_head evict_link;
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_evict.c b/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
index 14e9c2fbc4e6..5ea373221f49 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
@@ -304,7 +304,7 @@ static int igt_evict_vm(void *arg)
 		goto cleanup;
 
 	/* Everything is pinned, nothing should happen */
-	err = i915_gem_evict_vm(&ggtt->base, false);
+	err = i915_gem_evict_vm(&ggtt->base);
 	if (err) {
 		pr_err("i915_gem_evict_vm on a full GGTT returned err=%d]\n",
 		       err);
@@ -313,7 +313,7 @@ static int igt_evict_vm(void *arg)
 
 	unpin_ggtt(i915);
 
-	err = i915_gem_evict_vm(&ggtt->base, false);
+	err = i915_gem_evict_vm(&ggtt->base);
 	if (err) {
 		pr_err("i915_gem_evict_vm on a full GGTT returned err=%d]\n",
 		       err);
diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
index ad56566e24db..fb9072d5877f 100644
--- a/drivers/gpu/drm/i915/selftests/i915_vma.c
+++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
@@ -225,14 +225,6 @@ static bool assert_pin_valid(const struct i915_vma *vma,
 }
 
 __maybe_unused
-static bool assert_pin_e2big(const struct i915_vma *vma,
-			     const struct pin_mode *mode,
-			     int result)
-{
-	return result == -E2BIG;
-}
-
-__maybe_unused
 static bool assert_pin_enospc(const struct i915_vma *vma,
 			      const struct pin_mode *mode,
 			      int result)
@@ -255,7 +247,6 @@ static int igt_vma_pin1(void *arg)
 #define VALID(sz, fl) { .size = (sz), .flags = (fl), .assert = assert_pin_valid, .string = #sz ", " #fl ", (valid) " }
 #define __INVALID(sz, fl, check, eval) { .size = (sz), .flags = (fl), .assert = (check), .string = #sz ", " #fl ", (invalid " #eval ")" }
 #define INVALID(sz, fl) __INVALID(sz, fl, assert_pin_einval, EINVAL)
-#define TOOBIG(sz, fl) __INVALID(sz, fl, assert_pin_e2big, E2BIG)
 #define NOSPACE(sz, fl) __INVALID(sz, fl, assert_pin_enospc, ENOSPC)
 		VALID(0, PIN_GLOBAL),
 		VALID(0, PIN_GLOBAL | PIN_MAPPABLE),
@@ -276,11 +267,11 @@ static int igt_vma_pin1(void *arg)
 		VALID(8192, PIN_GLOBAL),
 		VALID(i915->ggtt.mappable_end - 4096, PIN_GLOBAL | PIN_MAPPABLE),
 		VALID(i915->ggtt.mappable_end, PIN_GLOBAL | PIN_MAPPABLE),
-		TOOBIG(i915->ggtt.mappable_end + 4096, PIN_GLOBAL | PIN_MAPPABLE),
+		NOSPACE(i915->ggtt.mappable_end + 4096, PIN_GLOBAL | PIN_MAPPABLE),
 		VALID(i915->ggtt.base.total - 4096, PIN_GLOBAL),
 		VALID(i915->ggtt.base.total, PIN_GLOBAL),
-		TOOBIG(i915->ggtt.base.total + 4096, PIN_GLOBAL),
-		TOOBIG(round_down(U64_MAX, PAGE_SIZE), PIN_GLOBAL),
+		NOSPACE(i915->ggtt.base.total + 4096, PIN_GLOBAL),
+		NOSPACE(round_down(U64_MAX, PAGE_SIZE), PIN_GLOBAL),
 		INVALID(8192, PIN_GLOBAL | PIN_MAPPABLE | PIN_OFFSET_FIXED | (i915->ggtt.mappable_end - 4096)),
 		INVALID(8192, PIN_GLOBAL | PIN_OFFSET_FIXED | (i915->ggtt.base.total - 4096)),
 		INVALID(8192, PIN_GLOBAL | PIN_OFFSET_FIXED | (round_down(U64_MAX, PAGE_SIZE) - 4096)),
@@ -300,7 +291,6 @@ static int igt_vma_pin1(void *arg)
 #endif
 		{ },
 #undef NOSPACE
-#undef TOOBIG
 #undef INVALID
 #undef __INVALID
 #undef VALID
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 23/27] drm/i915: First try the previous execbuffer location
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (21 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 22/27] drm/i915: Eliminate lots of iterations over the execobjects array Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 24/27] drm/i915: Wait upon userptr get-user-pages within execbuffer Chris Wilson
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

When choosing a slot for an execbuffer, we ideally want to use the same
address as last time (so that we don't have to rebind it) and the same
address as expected by the user (so that we don't have to fixup any
relocations pointing to it). If we first try to bind the incoming
execbuffer->offset from the user, or the currently bound offset that
should hopefully achieve the goal of avoiding the rebind cost and the
relocation penalty. However, if the object is not currently bound there
we don't want to arbitrarily unbind an object in our chosen position and
so choose to rebind/relocate the incoming object instead. After we
report the new position back to the user, on the next pass the
relocations should have settled down.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtien@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 12 ++++++++----
 drivers/gpu/drm/i915/i915_gem_gtt.c        |  6 ++++++
 drivers/gpu/drm/i915/i915_gem_gtt.h        |  1 +
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index de41e423d3f7..f4b5e221708d 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -329,10 +329,15 @@ eb_pin_vma(struct i915_execbuffer *eb,
 {
 	u64 flags;
 
-	flags = vma->node.start;
-	flags |= PIN_USER | PIN_NONBLOCK | PIN_OFFSET_FIXED;
+	if (vma->node.size)
+		flags = vma->node.start;
+	else
+		flags = entry->offset & PIN_OFFSET_MASK;
+
+	flags |= PIN_USER | PIN_NOEVICT | PIN_OFFSET_FIXED;
 	if (unlikely(entry->flags & EXEC_OBJECT_NEEDS_GTT))
 		flags |= PIN_GLOBAL;
+
 	if (unlikely(i915_vma_pin(vma, 0, 0, flags)))
 		return;
 
@@ -460,8 +465,7 @@ eb_add_vma(struct i915_execbuffer *eb,
 		entry->flags |= eb->context_flags;
 
 	err = 0;
-	if (vma->node.size)
-		eb_pin_vma(eb, entry, vma);
+	eb_pin_vma(eb, entry, vma);
 	if (eb_vma_misplaced(entry, vma)) {
 		eb_unreserve_vma(vma, entry);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 8bab4aea63e6..62871cd50605 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -3288,6 +3288,9 @@ int i915_gem_gtt_reserve(struct i915_address_space *vm,
 	if (err != -ENOSPC)
 		return err;
 
+	if (flags & PIN_NOEVICT)
+		return -ENOSPC;
+
 	err = i915_gem_evict_for_node(vm, node, flags);
 	if (err == 0)
 		err = drm_mm_reserve_node(&vm->mm, node);
@@ -3402,6 +3405,9 @@ int i915_gem_gtt_insert(struct i915_address_space *vm,
 	if (err != -ENOSPC)
 		return err;
 
+	if (flags & PIN_NOEVICT)
+		return -ENOSPC;
+
 	/* No free space, pick a slot at random.
 	 *
 	 * There is a pathological case here using a GTT shared between
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
index fb15684c1d83..a528ce1380fd 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.h
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
@@ -588,6 +588,7 @@ int i915_gem_gtt_insert(struct i915_address_space *vm,
 #define PIN_MAPPABLE		BIT(1)
 #define PIN_ZONE_4G		BIT(2)
 #define PIN_NONFAULT		BIT(3)
+#define PIN_NOEVICT		BIT(4)
 
 #define PIN_MBZ			BIT(5) /* I915_VMA_PIN_OVERFLOW */
 #define PIN_GLOBAL		BIT(6) /* I915_VMA_GLOBAL_BIND */
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 24/27] drm/i915: Wait upon userptr get-user-pages within execbuffer
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (22 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 23/27] drm/i915: First try the previous execbuffer location Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 25/27] drm/i915: Allow execbuffer to use the first object as the batch Chris Wilson
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

This simply hides the EAGAIN caused by userptr when userspace causes
resource contention. However, it is quite beneficial with highly
contended userptr users as we avoid repeating the setup costs and
kernel-user context switches.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |  1 +
 drivers/gpu/drm/i915/i915_drv.h            | 10 +++++++++-
 drivers/gpu/drm/i915/i915_gem.c            |  4 +++-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  3 +++
 drivers/gpu/drm/i915/i915_gem_userptr.c    | 18 +++++++++++++++---
 5 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index cc7393e65e99..6ce736514396 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -553,6 +553,7 @@ static void i915_gem_fini(struct drm_i915_private *dev_priv)
 	intel_uc_fini_hw(dev_priv);
 	i915_gem_cleanup_engines(dev_priv);
 	i915_gem_context_fini(dev_priv);
+	i915_gem_cleanup_userptr(dev_priv);
 	mutex_unlock(&dev_priv->drm.struct_mutex);
 
 	i915_gem_drain_freed_objects(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b0c4b9cb75c2..915f6d700cfe 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1514,6 +1514,13 @@ struct i915_gem_mm {
 	/** LRU list of objects with fence regs on them. */
 	struct list_head fence_list;
 
+	/**
+	 * Workqueue to fault in userptr pages, flushed by the execbuf
+	 * when required but otherwise left to userspace to try again
+	 * on EAGAIN.
+	 */
+	struct workqueue_struct *userptr_wq;
+
 	u64 unordered_timeline;
 
 	/* the indicator for dispatch video commands on two BSD rings */
@@ -3208,7 +3215,8 @@ int i915_gem_set_tiling_ioctl(struct drm_device *dev, void *data,
 			      struct drm_file *file_priv);
 int i915_gem_get_tiling_ioctl(struct drm_device *dev, void *data,
 			      struct drm_file *file_priv);
-void i915_gem_init_userptr(struct drm_i915_private *dev_priv);
+int i915_gem_init_userptr(struct drm_i915_private *dev_priv);
+void i915_gem_cleanup_userptr(struct drm_i915_private *dev_priv);
 int i915_gem_userptr_ioctl(struct drm_device *dev, void *data,
 			   struct drm_file *file);
 int i915_gem_get_aperture_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index ed761a122966..55cb8a2cb99b 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4791,7 +4791,9 @@ int i915_gem_init(struct drm_i915_private *dev_priv)
 	 */
 	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
 
-	i915_gem_init_userptr(dev_priv);
+	ret = i915_gem_init_userptr(dev_priv);
+	if (ret)
+		goto out_unlock;
 
 	ret = i915_gem_init_ggtt(dev_priv);
 	if (ret)
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index f4b5e221708d..44413594ba47 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1473,6 +1473,9 @@ static int eb_relocate_slow(struct i915_execbuffer *eb)
 		goto out;
 	}
 
+	/* A frequent cause for EAGAIN are currently unavailable client pages */
+	flush_workqueue(eb->i915->mm.userptr_wq);
+
 	err = i915_mutex_lock_interruptible(dev);
 	if (err) {
 		mutex_lock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 9f84be171ad2..8b5232688de0 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -378,7 +378,7 @@ __i915_mm_struct_free(struct kref *kref)
 	mutex_unlock(&mm->i915->mm_lock);
 
 	INIT_WORK(&mm->work, __i915_mm_struct_free__worker);
-	schedule_work(&mm->work);
+	queue_work(mm->i915->mm.userptr_wq, &mm->work);
 }
 
 static void
@@ -598,7 +598,7 @@ __i915_gem_userptr_get_pages_schedule(struct drm_i915_gem_object *obj)
 	get_task_struct(work->task);
 
 	INIT_WORK(&work->work, __i915_gem_userptr_get_pages_worker);
-	schedule_work(&work->work);
+	queue_work(to_i915(obj->base.dev)->mm.userptr_wq, &work->work);
 
 	return ERR_PTR(-EAGAIN);
 }
@@ -829,8 +829,20 @@ i915_gem_userptr_ioctl(struct drm_device *dev, void *data, struct drm_file *file
 	return 0;
 }
 
-void i915_gem_init_userptr(struct drm_i915_private *dev_priv)
+int i915_gem_init_userptr(struct drm_i915_private *dev_priv)
 {
 	mutex_init(&dev_priv->mm_lock);
 	hash_init(dev_priv->mm_structs);
+
+	dev_priv->mm.userptr_wq =
+		alloc_workqueue("i915-userptr-acquire", WQ_HIGHPRI, 0);
+	if (!dev_priv->mm.userptr_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void i915_gem_cleanup_userptr(struct drm_i915_private *dev_priv)
+{
+	destroy_workqueue(dev_priv->mm.userptr_wq);
 }
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 25/27] drm/i915: Allow execbuffer to use the first object as the batch
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (23 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 24/27] drm/i915: Wait upon userptr get-user-pages within execbuffer Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 26/27] drm/i915: Async GPU relocation processing Chris Wilson
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

Currently, the last object in the execlist is the always the batch.
However, when building the batch buffer we often know the batch object
first and if we can use the first slot in the execlist we can emit
relocation instructions relative to it immediately and avoid a separate
pass to adjust the relocations to point to the last execlist slot.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |  1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  5 ++++-
 include/uapi/drm/i915_drm.h                | 16 +++++++++++++++-
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 6ce736514396..5f2aeb27aeb7 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -351,6 +351,7 @@ static int i915_getparam(struct drm_device *dev, void *data,
 	case I915_PARAM_HAS_EXEC_ASYNC:
 	case I915_PARAM_HAS_EXEC_FENCE:
 	case I915_PARAM_HAS_EXEC_CAPTURE:
+	case I915_PARAM_HAS_EXEC_BATCH_FIRST:
 		/* For the time being all of these are always true;
 		 * if some supported hardware does not have one of these
 		 * features this value needs to be provided from
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 44413594ba47..1da7c3a46436 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -622,7 +622,10 @@ ht_needs_resize(const struct i915_gem_context_vma_lut *lut)
 
 static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
 {
-	return eb->buffer_count - 1;
+	if (eb->args->flags & I915_EXEC_BATCH_FIRST)
+		return 0;
+	else
+		return eb->buffer_count - 1;
 }
 
 static int eb_select_context(struct i915_execbuffer *eb)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index f24a80d2d42e..f43a22ae955b 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -418,6 +418,11 @@ typedef struct drm_i915_irq_wait {
  */
 #define I915_PARAM_HAS_EXEC_CAPTURE	 45
 
+/* Query whether DRM_I915_GEM_EXECBUFFER2 supports supplying the batch buffer
+ * as the first execobject as opposed to the last. See I915_EXEC_BATCH_FIRST.
+ */
+#define I915_PARAM_HAS_EXEC_BATCH_FIRST	 46
+
 typedef struct drm_i915_getparam {
 	__s32 param;
 	/*
@@ -904,7 +909,16 @@ struct drm_i915_gem_execbuffer2 {
  */
 #define I915_EXEC_FENCE_OUT		(1<<17)
 
-#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_OUT<<1))
+/* Traditionally the execbuf ioctl has only considered the final element in
+ * the execobject[] to be the executable batch. Often though, the client
+ * will known the batch object prior to construction and being able to place
+ * it into the execobject[] array first can simplify the relocation tracking.
+ * Setting I915_EXEC_BATCH_FIRST tells execbuf to use element 0 of the
+ * execobject[] as the * batch instead (the default is to use the last
+ * element).
+ */
+#define I915_EXEC_BATCH_FIRST		(1<<18)
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_BATCH_FIRST<<1))
 
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 26/27] drm/i915: Async GPU relocation processing
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (24 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 25/27] drm/i915: Allow execbuffer to use the first object as the batch Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19  9:41 ` [PATCH 27/27] drm/i915/scheduler: Support user-defined priorities Chris Wilson
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

If the user requires patching of their batch or auxiliary buffers, we
currently make the alterations on the cpu. If they are active on the GPU
at the time, we wait under the struct_mutex for them to finish executing
before we rewrite the contents. This happens if shared relocation trees
are used between different contexts with separate address space (and the
buffers then have different addresses in each), the 3D state will need
to be adjusted between execution on each context. However, we don't need
to use the CPU to do the relocation patching, as we could queue commands
to the GPU to perform it and use fences to serialise the operation with
the current activity and future - so the operation on the GPU appears
just as atomic as performing it immediately. Performing the relocation
rewrites on the GPU is not free, in terms of pure throughput, the number
of relocations/s is about halved - but more importantly so is the time
under the struct_mutex.

v2: Break out the request/batch allocation for clearer error flow.
v3: A few asserts to ensure rq ordering is maintained

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c            |   1 -
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 225 ++++++++++++++++++++++++++++-
 2 files changed, 219 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 55cb8a2cb99b..7d9cabdab89a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4381,7 +4381,6 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
 		GEM_BUG_ON(i915_gem_object_is_active(obj));
 		list_for_each_entry_safe(vma, vn,
 					 &obj->vma_list, obj_link) {
-			GEM_BUG_ON(!i915_vma_is_ggtt(vma));
 			GEM_BUG_ON(i915_vma_is_active(vma));
 			vma->flags &= ~I915_VMA_PIN_MASK;
 			i915_vma_close(vma);
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 1da7c3a46436..e35476b0ca1b 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -40,7 +40,12 @@
 #include "intel_drv.h"
 #include "intel_frontbuffer.h"
 
-#define DBG_USE_CPU_RELOC 0 /* -1 force GTT relocs; 1 force CPU relocs */
+enum {
+	FORCE_CPU_RELOC = 1,
+	FORCE_GTT_RELOC,
+	FORCE_GPU_RELOC,
+#define DBG_FORCE_RELOC 0 /* choose one of the above! */
+};
 
 #define  __EXEC_OBJECT_HAS_PIN		BIT(31)
 #define  __EXEC_OBJECT_HAS_FENCE	BIT(30)
@@ -210,10 +215,15 @@ struct i915_execbuffer {
 		struct drm_mm_node node; /** temporary GTT binding */
 		unsigned long vaddr; /** Current kmap address */
 		unsigned long page; /** Currently mapped page index */
+		unsigned int gen; /** Cached value of INTEL_GEN */
 		bool use_64bit_reloc : 1;
 		bool has_llc : 1;
 		bool has_fence : 1;
 		bool needs_unfenced : 1;
+
+		struct drm_i915_gem_request *rq;
+		u32 *rq_cmd;
+		unsigned int rq_size;
 	} reloc_cache;
 
 	u64 invalid_flags; /** Set of execobj.flags that are invalid */
@@ -487,8 +497,11 @@ static inline int use_cpu_reloc(const struct reloc_cache *cache,
 	if (!i915_gem_object_has_struct_page(obj))
 		return false;
 
-	if (DBG_USE_CPU_RELOC)
-		return DBG_USE_CPU_RELOC > 0;
+	if (DBG_FORCE_RELOC == FORCE_CPU_RELOC)
+		return true;
+
+	if (DBG_FORCE_RELOC == FORCE_GTT_RELOC)
+		return false;
 
 	return (cache->has_llc ||
 		obj->cache_dirty ||
@@ -864,6 +877,8 @@ static void eb_release_vma(const struct i915_execbuffer *eb)
 
 static void eb_destroy(const struct i915_execbuffer *eb)
 {
+	GEM_BUG_ON(eb->reloc_cache.rq);
+
 	if (eb->lut_size >= 0)
 		kfree(eb->buckets);
 }
@@ -881,11 +896,14 @@ static void reloc_cache_init(struct reloc_cache *cache,
 	cache->page = -1;
 	cache->vaddr = 0;
 	/* Must be a variable in the struct to allow GCC to unroll. */
+	cache->gen = INTEL_GEN(i915);
 	cache->has_llc = HAS_LLC(i915);
-	cache->has_fence = INTEL_GEN(i915) < 4;
-	cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
 	cache->use_64bit_reloc = HAS_64BIT_RELOC(i915);
+	cache->has_fence = cache->gen < 4;
+	cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
 	cache->node.allocated = false;
+	cache->rq = NULL;
+	cache->rq_size = 0;
 }
 
 static inline void *unmask_page(unsigned long p)
@@ -907,10 +925,24 @@ static inline struct i915_ggtt *cache_to_ggtt(struct reloc_cache *cache)
 	return &i915->ggtt;
 }
 
+static void reloc_gpu_flush(struct reloc_cache *cache)
+{
+	GEM_BUG_ON(cache->rq_size >= cache->rq->batch->obj->base.size / sizeof(u32));
+	cache->rq_cmd[cache->rq_size] = MI_BATCH_BUFFER_END;
+	i915_gem_object_unpin_map(cache->rq->batch->obj);
+	i915_gem_chipset_flush(cache->rq->i915);
+
+	__i915_add_request(cache->rq, true);
+	cache->rq = NULL;
+}
+
 static void reloc_cache_reset(struct reloc_cache *cache)
 {
 	void *vaddr;
 
+	if (cache->rq)
+		reloc_gpu_flush(cache);
+
 	if (!cache->vaddr)
 		return;
 
@@ -1075,6 +1107,121 @@ static void clflush_write32(u32 *addr, u32 value, unsigned int flushes)
 		*addr = value;
 }
 
+static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
+			     struct i915_vma *vma,
+			     unsigned int len)
+{
+	struct reloc_cache *cache = &eb->reloc_cache;
+	struct drm_i915_gem_object *obj;
+	struct drm_i915_gem_request *rq;
+	struct i915_vma *batch;
+	u32 *cmd;
+	int err;
+
+	GEM_BUG_ON(vma->obj->base.write_domain & I915_GEM_DOMAIN_CPU);
+
+	obj = i915_gem_batch_pool_get(&eb->engine->batch_pool, PAGE_SIZE);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+
+	cmd = i915_gem_object_pin_map(obj,
+				      cache->has_llc ? I915_MAP_WB : I915_MAP_WC);
+	i915_gem_object_unpin_pages(obj);
+	if (IS_ERR(cmd))
+		return PTR_ERR(cmd);
+
+	err = i915_gem_object_set_to_wc_domain(obj, false);
+	if (err)
+		goto err_unmap;
+
+	batch = i915_vma_instance(obj, vma->vm, NULL);
+	if (IS_ERR(batch)) {
+		err = PTR_ERR(batch);
+		goto err_unmap;
+	}
+
+	err = i915_vma_pin(batch, 0, 0, PIN_USER | PIN_NONBLOCK);
+	if (err)
+		goto err_unmap;
+
+	rq = i915_gem_request_alloc(eb->engine, eb->ctx);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto err_unpin;
+	}
+
+	err = i915_gem_request_await_object(rq, vma->obj, true);
+	if (err)
+		goto err_request;
+
+	err = eb->engine->emit_flush(rq, EMIT_INVALIDATE);
+	if (err)
+		goto err_request;
+
+	err = i915_switch_context(rq);
+	if (err)
+		goto err_request;
+
+	err = eb->engine->emit_bb_start(rq,
+					batch->node.start, PAGE_SIZE,
+					cache->gen > 5 ? 0 : I915_DISPATCH_SECURE);
+	if (err)
+		goto err_request;
+
+	GEM_BUG_ON(!reservation_object_test_signaled_rcu(obj->resv, true));
+	i915_vma_move_to_active(batch, rq, 0);
+	reservation_object_lock(obj->resv, NULL);
+	reservation_object_add_excl_fence(obj->resv, &rq->fence);
+	reservation_object_unlock(obj->resv);
+	i915_vma_unpin(batch);
+
+	i915_vma_move_to_active(vma, rq, true);
+	reservation_object_lock(vma->obj->resv, NULL);
+	reservation_object_add_excl_fence(vma->obj->resv, &rq->fence);
+	reservation_object_unlock(vma->obj->resv);
+
+	rq->batch = batch;
+
+	cache->rq = rq;
+	cache->rq_cmd = cmd;
+	cache->rq_size = 0;
+
+	/* Return with batch mapping (cmd) still pinned */
+	return 0;
+
+err_request:
+	i915_add_request(rq);
+err_unpin:
+	i915_vma_unpin(batch);
+err_unmap:
+	i915_gem_object_unpin_map(obj);
+	return err;
+}
+
+static u32 *reloc_gpu(struct i915_execbuffer *eb,
+		      struct i915_vma *vma,
+		      unsigned int len)
+{
+	struct reloc_cache *cache = &eb->reloc_cache;
+	u32 *cmd;
+
+	if (cache->rq_size > PAGE_SIZE/sizeof(u32) - (len + 1))
+		reloc_gpu_flush(cache);
+
+	if (unlikely(!cache->rq)) {
+		int err;
+
+		err = __reloc_gpu_alloc(eb, vma, len);
+		if (unlikely(err))
+			return ERR_PTR(err);
+	}
+
+	cmd = cache->rq_cmd + cache->rq_size;
+	cache->rq_size += len;
+
+	return cmd;
+}
+
 static u64
 relocate_entry(struct i915_vma *vma,
 	       const struct drm_i915_gem_relocation_entry *reloc,
@@ -1087,6 +1234,67 @@ relocate_entry(struct i915_vma *vma,
 	bool wide = eb->reloc_cache.use_64bit_reloc;
 	void *vaddr;
 
+	if (!eb->reloc_cache.vaddr &&
+	    (DBG_FORCE_RELOC == FORCE_GPU_RELOC ||
+	     !reservation_object_test_signaled_rcu(obj->resv, true))) {
+		const unsigned int gen = eb->reloc_cache.gen;
+		unsigned int len;
+		u32 *batch;
+		u64 addr;
+
+		if (wide)
+			len = offset & 7 ? 8 : 5;
+		else if (gen >= 4)
+			len = 4;
+		else if (gen >= 3)
+			len = 3;
+		else /* On gen2 MI_STORE_DWORD_IMM uses a physical address */
+			goto repeat;
+
+		batch = reloc_gpu(eb, vma, len);
+		if (IS_ERR(batch))
+			goto repeat;
+
+		addr = gen8_canonical_addr(vma->node.start + offset);
+		if (wide) {
+			if (offset & 7) {
+				*batch++ = MI_STORE_DWORD_IMM_GEN4;
+				*batch++ = lower_32_bits(addr);
+				*batch++ = upper_32_bits(addr);
+				*batch++ = lower_32_bits(target_offset);
+
+				addr = gen8_canonical_addr(addr + 4);
+
+				*batch++ = MI_STORE_DWORD_IMM_GEN4;
+				*batch++ = lower_32_bits(addr);
+				*batch++ = upper_32_bits(addr);
+				*batch++ = upper_32_bits(target_offset);
+			} else {
+				*batch++ = (MI_STORE_DWORD_IMM_GEN4 | (1 << 21)) + 1;
+				*batch++ = lower_32_bits(addr);
+				*batch++ = upper_32_bits(addr);
+				*batch++ = lower_32_bits(target_offset);
+				*batch++ = upper_32_bits(target_offset);
+			}
+		} else if (gen >= 6) {
+			*batch++ = MI_STORE_DWORD_IMM_GEN4;
+			*batch++ = 0;
+			*batch++ = addr;
+			*batch++ = target_offset;
+		} else if (gen >= 4) {
+			*batch++ = MI_STORE_DWORD_IMM_GEN4 | MI_USE_GGTT;
+			*batch++ = 0;
+			*batch++ = addr;
+			*batch++ = target_offset;
+		} else {
+			*batch++ = MI_STORE_DWORD_IMM | MI_MEM_VIRTUAL;
+			*batch++ = addr;
+			*batch++ = target_offset;
+		}
+
+		goto out;
+	}
+
 repeat:
 	vaddr = reloc_vaddr(obj, &eb->reloc_cache, offset >> PAGE_SHIFT);
 	if (IS_ERR(vaddr))
@@ -1103,6 +1311,7 @@ relocate_entry(struct i915_vma *vma,
 		goto repeat;
 	}
 
+out:
 	return gen8_canonical_addr(target->node.start) | UPDATE;
 }
 
@@ -1163,7 +1372,8 @@ eb_relocate_entry(struct i915_execbuffer *eb,
 	/* If the relocation already has the right value in it, no
 	 * more work needs to be done.
 	 */
-	if (gen8_canonical_addr(target->node.start) == reloc->presumed_offset)
+	if (!DBG_FORCE_RELOC &&
+	    gen8_canonical_addr(target->node.start) == reloc->presumed_offset)
 		return 0;
 
 	/* Check that the relocation address is valid... */
@@ -2022,6 +2232,9 @@ i915_gem_do_execbuffer(struct drm_device *dev,
 		eb.batch = vma;
 	}
 
+	/* All GPU relocation batches must be submitted prior to the user rq */
+	GEM_BUG_ON(eb.reloc_cache.rq);
+
 	/* Allocate a request for this batch buffer nice and early. */
 	eb.request = i915_gem_request_alloc(eb.engine, eb.ctx);
 	if (IS_ERR(eb.request)) {
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 27/27] drm/i915/scheduler: Support user-defined priorities
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (25 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 26/27] drm/i915: Async GPU relocation processing Chris Wilson
@ 2017-04-19  9:41 ` Chris Wilson
  2017-04-19 10:09   ` Chris Wilson
  2017-04-19 10:01 ` ✗ Fi.CI.BAT: failure for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically Patchwork
                   ` (3 subsequent siblings)
  30 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19  9:41 UTC (permalink / raw)
  To: intel-gfx

Use a priority stored in the context as the initial value when
submitting a request. This allows us to change the default priority on a
per-context basis, allowing different contexts to be favoured with GPU
time at the expense of lower importance work. The user can adjust the
context's priority via I915_CONTEXT_PARAM_PRIORITY, with more positive
values being higher priority (they will be serviced earlier, after their
dependencies have been resolved). Any prerequisite work for an execbuf
will have its priority raised to match the new request as required.

Normal users can specify any value in the range of -1023 to 0 [default],
i.e. they can reduce the priority of their workloads (and temporarily
boost it back to normal if so desired).

Privileged users can specify any value in the range of -1023 to 1023,
[default is 0], i.e. they can raise their priority above all overs and
so potentially starve the system.

Note that the existing schedulers are not fair, nor load balancing, the
execution is strictly by priority on a first-come, first-served basis,
and the driver may choose to boost some requests above the range
available to users.

This priority was originally based around nice(2), but evolved to allow
clients to adjust their priority within a small range, and allow for a
privileged high priority range.

For example, this can be used to implement EGL_IMG_context_priority
https://www.khronos.org/registry/egl/extensions/IMG/EGL_IMG_context_priority.txt

	EGL_CONTEXT_PRIORITY_LEVEL_IMG determines the priority level of
        the context to be created. This attribute is a hint, as an
        implementation may not support multiple contexts at some
        priority levels and system policy may limit access to high
        priority contexts to appropriate system privilege level. The
        default value for EGL_CONTEXT_PRIORITY_LEVEL_IMG is
        EGL_CONTEXT_PRIORITY_MEDIUM_IMG."

so we can map

	PRIORITY_HIGH -> 1023 [privileged, will failback to 0]
	PRIORITY_MED -> 0 [default]
	PRIORITY_LOW -> -1023

They also map onto the priorities used by VkQueue (and a VkQueue is
essentially a timeline, our i915_gem_context under full-ppgtt).

v2: s/CAP_SYS_ADMIN/CAP_SYS_NICE/

Testcase: igt/gem_exec_schedule
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 22 ++++++++++++++++++++++
 include/uapi/drm/i915_drm.h             |  3 +++
 2 files changed, 25 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 23fd1470a7f4..694eddba51a6 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1141,6 +1141,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_BANNABLE:
 		args->value = i915_gem_context_is_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_PRIORITY:
+		args->value = ctx->priority;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -1198,6 +1201,25 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		else
 			i915_gem_context_clear_bannable(ctx);
 		break;
+
+	case I915_CONTEXT_PARAM_PRIORITY:
+		{
+			int priority = args->value;
+
+			if (args->size)
+				ret = -EINVAL;
+			else if (!to_i915(dev)->engine[RCS]->schedule)
+				ret = -ENODEV;
+			else if (priority >= I915_PRIORITY_MAX ||
+				 priority <= I915_PRIORITY_MIN)
+				ret = -EINVAL;
+			else if (priority > 0 && !capable(CAP_SYS_NICE))
+				ret = -EPERM;
+			else
+				ctx->priority = priority;
+		}
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index f43a22ae955b..200f2cf393b2 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -395,6 +395,8 @@ typedef struct drm_i915_irq_wait {
 
 /* Query whether DRM_I915_GEM_EXECBUFFER2 supports user defined execution
  * priorities and the driver will attempt to execute batches in priority order.
+ * The initial priority for each batch is supplied by the context and is
+ * controlled via I915_CONTEXT_PARAM_PRIORITY.
  */
 #define I915_PARAM_HAS_SCHEDULER	 41
 #define I915_PARAM_HUC_STATUS		 42
@@ -1318,6 +1320,7 @@ struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
 #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
 #define I915_CONTEXT_PARAM_BANNABLE	0x5
+#define I915_CONTEXT_PARAM_PRIORITY	0x6
 	__u64 value;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* ✗ Fi.CI.BAT: failure for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (26 preceding siblings ...)
  2017-04-19  9:41 ` [PATCH 27/27] drm/i915/scheduler: Support user-defined priorities Chris Wilson
@ 2017-04-19 10:01 ` Patchwork
  2017-04-27  7:27 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev2) Patchwork
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 95+ messages in thread
From: Patchwork @ 2017-04-19 10:01 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically
URL   : https://patchwork.freedesktop.org/series/23227/
State : failure

== Summary ==

Series 23227v1 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/23227/revisions/1/mbox/

Test gem_exec_fence:
        Subgroup await-hang-default:
                pass       -> INCOMPLETE (fi-ilk-650)
Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                fail       -> PASS       (fi-snb-2600) fdo#100007

fdo#100007 https://bugs.freedesktop.org/show_bug.cgi?id=100007

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:426s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:420s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:569s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:504s
fi-bxt-t5700     total:278  pass:258  dwarn:0   dfail:0   fail:0   skip:20  time:549s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:491s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:483s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:410s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:410s
fi-ilk-650       total:48   pass:27   dwarn:0   dfail:0   fail:0   skip:20 
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:491s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:471s
fi-kbl-7500u     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:450s
fi-kbl-7560u     total:278  pass:267  dwarn:1   dfail:0   fail:0   skip:10  time:562s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:448s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:562s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:450s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:483s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:425s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:531s
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29  time:408s

05040c47d415b1621c0d64e40c0062890b854c9f drm-tip: 2017y-04m-19d-06h-48m-08s UTC integration manifest
f615350 drm/i915/scheduler: Support user-defined priorities
40273ef drm/i915: Async GPU relocation processing
910dcf5 drm/i915: Allow execbuffer to use the first object as the batch
4949570 drm/i915: Wait upon userptr get-user-pages within execbuffer
5e3982e drm/i915: First try the previous execbuffer location
2f93e2f drm/i915: Eliminate lots of iterations over the execobjects array
155a5c7 drm/i915: Pass vma to relocate entry
acc9363 drm/i915: Store a direct lookup from object handle to vma
3e9677a drm/i915: Split vma exec_link/evict_link
5a5cfe5 drm/i915: Use vma->exec_entry as our double-entry placeholder
7ea3245 drm/i915: Amalgamate execbuffer parameter structures
d8a64d9 drm/i915: Reinstate reservation_object zapping for batch_pool objects
1dca75c drm/i915: Split execlist priority queue into rbtree + linked list
8f5c72c drm/i915: Don't mark an execlists context-switch when idle
66953d3 drm/i915/execlists: Pack the count into the low bits of the port.request
913f088 drm/i915: Only report a wakeup if the waiter was truly asleep
d153946e drm/i915: Switch the global i915.semaphores check to a local predicate
95ac420 drm/i915: Do not record a successful syncpoint for a dma-await
6c682a7 drm/i915: Confirm the request is still active before adding it to the await
74936ad drm/i915: Rename intel_timeline.sync_seqno[] to .global_sync[]
c629206 drm/i915: Squash repeated awaits on the same fence
c5d2fa8 drm/i915: Redefine ptr_pack_bits() and friends
27a74c5 drm/i915: Make ptr_unpack_bits() more function-like
b422df4 drm/i915: Lift timeline ordering to await_dma_fence
942f00e drm/i915: Mark up clflushes as belonging to an unordered timeline
2c0d12b drm/i915: Mark CPU cache as dirty on every transition for CPU writes
e008f0e drm/i915/selftests: Allocate inode/file dynamically

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4519/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 27/27] drm/i915/scheduler: Support user-defined priorities
  2017-04-19  9:41 ` [PATCH 27/27] drm/i915/scheduler: Support user-defined priorities Chris Wilson
@ 2017-04-19 10:09   ` Chris Wilson
  2017-04-19 11:07     ` Tvrtko Ursulin
  0 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19 10:09 UTC (permalink / raw)
  To: intel-gfx

On Wed, Apr 19, 2017 at 10:41:43AM +0100, Chris Wilson wrote:
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index f43a22ae955b..200f2cf393b2 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -395,6 +395,8 @@ typedef struct drm_i915_irq_wait {
>  
>  /* Query whether DRM_I915_GEM_EXECBUFFER2 supports user defined execution
>   * priorities and the driver will attempt to execute batches in priority order.
> + * The initial priority for each batch is supplied by the context and is
> + * controlled via I915_CONTEXT_PARAM_PRIORITY.
>   */
>  #define I915_PARAM_HAS_SCHEDULER	 41
>  #define I915_PARAM_HUC_STATUS		 42
> @@ -1318,6 +1320,7 @@ struct drm_i915_gem_context_param {
>  #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
>  #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
>  #define I915_CONTEXT_PARAM_BANNABLE	0x5
> +#define I915_CONTEXT_PARAM_PRIORITY	0x6

Grr. Forgot to add min/max defines.

#define I915_CONTEXT_MAX_USER_PRIORITY		1023 /* inclusive */
#define I915_CONTEXT_DEFAULT_PRIORITY		0
#define I915_CONTEXT_MIN_USER_PRIORITY		-1023 /* inclusive */

Or should it be I915_CONTEXT_PRIORITY_MAX_USER etc?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 27/27] drm/i915/scheduler: Support user-defined priorities
  2017-04-19 10:09   ` Chris Wilson
@ 2017-04-19 11:07     ` Tvrtko Ursulin
  0 siblings, 0 replies; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-19 11:07 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 19/04/2017 11:09, Chris Wilson wrote:
> On Wed, Apr 19, 2017 at 10:41:43AM +0100, Chris Wilson wrote:
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index f43a22ae955b..200f2cf393b2 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -395,6 +395,8 @@ typedef struct drm_i915_irq_wait {
>>
>>  /* Query whether DRM_I915_GEM_EXECBUFFER2 supports user defined execution
>>   * priorities and the driver will attempt to execute batches in priority order.
>> + * The initial priority for each batch is supplied by the context and is
>> + * controlled via I915_CONTEXT_PARAM_PRIORITY.
>>   */
>>  #define I915_PARAM_HAS_SCHEDULER	 41
>>  #define I915_PARAM_HUC_STATUS		 42
>> @@ -1318,6 +1320,7 @@ struct drm_i915_gem_context_param {
>>  #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
>>  #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
>>  #define I915_CONTEXT_PARAM_BANNABLE	0x5
>> +#define I915_CONTEXT_PARAM_PRIORITY	0x6
>
> Grr. Forgot to add min/max defines.
>
> #define I915_CONTEXT_MAX_USER_PRIORITY		1023 /* inclusive */
> #define I915_CONTEXT_DEFAULT_PRIORITY		0
> #define I915_CONTEXT_MIN_USER_PRIORITY		-1023 /* inclusive */

Yes, and use these in context get param, including the default instead 
of the zero I think.

> Or should it be I915_CONTEXT_PRIORITY_MAX_USER etc?

Priority last somehow looks better to me since like that it is clearly a 
separate category from param names. But I don't mind either way.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
  2017-04-19  9:41 ` [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes Chris Wilson
@ 2017-04-19 16:52   ` Dongwon Kim
  2017-04-19 17:15     ` Chris Wilson
                       ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Dongwon Kim @ 2017-04-19 16:52 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

I tried your patch but it didn't fix the original 
problem. I think it is somehow related to the flushing condition
here:

@@ -1129,10 +1129,8 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
 	if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
 		continue;

	if (obj->base.write_domain & I915_GEM_DOMAIN_CPU) {
+	if (obj->base.write_domain & obj->cache_dirty)
 		i915_gem_clflush_object(obj, 0);
-		obj->base.write_domain = 0;
-	}

here, we do clflush only if write_domain is not 0 even if cache_dirty
flag is set after your patch is applied.

And now please look at this:

@@ -753,6 +766,11 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
        case I915_GEM_DOMAIN_CPU:
                i915_gem_clflush_object(obj, I915_CLFLUSH_SYNC);
                break;
+
+       case I915_GEM_DOMAIN_RENDER:
+               if (gpu_write_needs_clflush(obj))
+                       obj->cache_dirty = true;
+               break;
        }

        obj->base.write_domain=0;

So here, if the write_domain is I915_GEM_DOMAIN_RENDER, we set cache_dirty to true
then reset write_domain.

So right after this flush_write_domain call, write_domain will be 0 but cache is
still dirty. I am wondering if this is where that condition (write_domain==0 and 
cache_dirty==1) originally came from.

On Wed, Apr 19, 2017 at 10:41:18AM +0100, Chris Wilson wrote:
> Currently, we only mark the CPU cache as dirty if we skip a clflush.
> This leads to some confusion where we have to ask if the object is in
> the write domain or missed a clflush. If we always mark the cache as
> dirty, this becomes a much simply question to answer.
> 
> The goal remains to do as few clflushes as required and to do them as
> late as possible, in the hope of deferring the work to a kthread and not
> block the caller (e.g. execbuf, flips).
> 
> Reported-by: Dongwon Kim <dongwon.kim@intel.com>
> Fixes: a6a7cc4b7db6 ("drm/i915: Always flush the dirty CPU cache when pinning the scanout")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Dongwon Kim <dongwon.kim@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c                  | 78 +++++++++++++++---------
>  drivers/gpu/drm/i915/i915_gem_clflush.c          | 15 +++--
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c       | 21 +++----
>  drivers/gpu/drm/i915/i915_gem_internal.c         |  3 +-
>  drivers/gpu/drm/i915/i915_gem_userptr.c          |  5 +-
>  drivers/gpu/drm/i915/selftests/huge_gem_object.c |  3 +-
>  6 files changed, 70 insertions(+), 55 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 33fb11cc5acc..488ca7733c1e 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -49,7 +49,7 @@ static void i915_gem_flush_free_objects(struct drm_i915_private *i915);
>  
>  static bool cpu_write_needs_clflush(struct drm_i915_gem_object *obj)
>  {
> -	if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
> +	if (obj->cache_dirty)
>  		return false;
>  
>  	if (!i915_gem_object_is_coherent(obj))
> @@ -233,6 +233,14 @@ i915_gem_object_get_pages_phys(struct drm_i915_gem_object *obj)
>  	return st;
>  }
>  
> +static void __start_cpu_write(struct drm_i915_gem_object *obj)
> +{
> +	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> +	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
> +	if (cpu_write_needs_clflush(obj))
> +		obj->cache_dirty = true;
> +}
> +
>  static void
>  __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
>  				struct sg_table *pages,
> @@ -248,8 +256,7 @@ __i915_gem_object_release_shmem(struct drm_i915_gem_object *obj,
>  	    !i915_gem_object_is_coherent(obj))
>  		drm_clflush_sg(pages);
>  
> -	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> -	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
> +	__start_cpu_write(obj);
>  }
>  
>  static void
> @@ -684,6 +691,12 @@ i915_gem_dumb_create(struct drm_file *file,
>  			       args->size, &args->handle);
>  }
>  
> +static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)
> +{
> +	return !(obj->cache_level == I915_CACHE_NONE ||
> +		 obj->cache_level == I915_CACHE_WT);
> +}
> +
>  /**
>   * Creates a new mm object and returns a handle to it.
>   * @dev: drm device pointer
> @@ -753,6 +766,11 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
>  	case I915_GEM_DOMAIN_CPU:
>  		i915_gem_clflush_object(obj, I915_CLFLUSH_SYNC);
>  		break;
> +
> +	case I915_GEM_DOMAIN_RENDER:
> +		if (gpu_write_needs_clflush(obj))
> +			obj->cache_dirty = true;
> +		break;
>  	}
>  
>  	obj->base.write_domain = 0;
> @@ -854,7 +872,8 @@ int i915_gem_obj_prepare_shmem_read(struct drm_i915_gem_object *obj,
>  	 * optimizes for the case when the gpu will dirty the data
>  	 * anyway again before the next pread happens.
>  	 */
> -	if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
> +	if (!obj->cache_dirty &&
> +	    !(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
>  		*needs_clflush = CLFLUSH_BEFORE;
>  
>  out:
> @@ -906,14 +925,15 @@ int i915_gem_obj_prepare_shmem_write(struct drm_i915_gem_object *obj,
>  	 * This optimizes for the case when the gpu will use the data
>  	 * right away and we therefore have to clflush anyway.
>  	 */
> -	if (obj->base.write_domain != I915_GEM_DOMAIN_CPU)
> +	if (!obj->cache_dirty) {
>  		*needs_clflush |= CLFLUSH_AFTER;
>  
> -	/* Same trick applies to invalidate partially written cachelines read
> -	 * before writing.
> -	 */
> -	if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
> -		*needs_clflush |= CLFLUSH_BEFORE;
> +		/* Same trick applies to invalidate partially written
> +		 * cachelines read before writing.
> +		 */
> +		if (!(obj->base.read_domains & I915_GEM_DOMAIN_CPU))
> +			*needs_clflush |= CLFLUSH_BEFORE;
> +	}
>  
>  out:
>  	intel_fb_obj_invalidate(obj, ORIGIN_CPU);
> @@ -3374,10 +3394,12 @@ int i915_gem_wait_for_idle(struct drm_i915_private *i915, unsigned int flags)
>  
>  static void __i915_gem_object_flush_for_display(struct drm_i915_gem_object *obj)
>  {
> -	if (obj->base.write_domain != I915_GEM_DOMAIN_CPU && !obj->cache_dirty)
> -		return;
> -
> -	i915_gem_clflush_object(obj, I915_CLFLUSH_FORCE);
> +	/* We manually flush the CPU domain so that we can override and
> +	 * force the flush for the display, and perform it asyncrhonously.
> +	 */
> +	flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
> +	if (obj->cache_dirty)
> +		i915_gem_clflush_object(obj, I915_CLFLUSH_FORCE);
>  	obj->base.write_domain = 0;
>  }
>  
> @@ -3636,14 +3658,17 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
>  		}
>  	}
>  
> -	if (obj->base.write_domain == I915_GEM_DOMAIN_CPU &&
> -	    i915_gem_object_is_coherent(obj))
> -		obj->cache_dirty = true;
> +	/* Catch any deferred obj->cache_dirty markups */
> +	flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
>  
>  	list_for_each_entry(vma, &obj->vma_list, obj_link)
>  		vma->node.color = cache_level;
>  	obj->cache_level = cache_level;
>  
> +	if (obj->base.write_domain & I915_GEM_DOMAIN_CPU &&
> +	    cpu_write_needs_clflush(obj))
> +		obj->cache_dirty = true;
> +
>  	return 0;
>  }
>  
> @@ -3864,9 +3889,6 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
>  	if (ret)
>  		return ret;
>  
> -	if (obj->base.write_domain == I915_GEM_DOMAIN_CPU)
> -		return 0;
> -
>  	flush_write_domain(obj, ~I915_GEM_DOMAIN_CPU);
>  
>  	/* Flush the CPU cache if it's still invalid. */
> @@ -3878,15 +3900,13 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
>  	/* It should now be out of any other write domains, and we can update
>  	 * the domain values for our changes.
>  	 */
> -	GEM_BUG_ON((obj->base.write_domain & ~I915_GEM_DOMAIN_CPU) != 0);
> +	GEM_BUG_ON(obj->base.write_domain & ~I915_GEM_DOMAIN_CPU);
>  
>  	/* If we're writing through the CPU, then the GPU read domains will
>  	 * need to be invalidated at next use.
>  	 */
> -	if (write) {
> -		obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> -		obj->base.write_domain = I915_GEM_DOMAIN_CPU;
> -	}
> +	if (write)
> +		__start_cpu_write(obj);
>  
>  	return 0;
>  }
> @@ -4306,6 +4326,8 @@ i915_gem_object_create(struct drm_i915_private *dev_priv, u64 size)
>  	} else
>  		obj->cache_level = I915_CACHE_NONE;
>  
> +	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
> +
>  	trace_i915_gem_object_create(obj);
>  
>  	return obj;
> @@ -4968,10 +4990,8 @@ int i915_gem_freeze_late(struct drm_i915_private *dev_priv)
>  
>  	mutex_lock(&dev_priv->drm.struct_mutex);
>  	for (p = phases; *p; p++) {
> -		list_for_each_entry(obj, *p, global_link) {
> -			obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> -			obj->base.write_domain = I915_GEM_DOMAIN_CPU;
> -		}
> +		list_for_each_entry(obj, *p, global_link)
> +			__start_cpu_write(obj);
>  	}
>  	mutex_unlock(&dev_priv->drm.struct_mutex);
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_clflush.c b/drivers/gpu/drm/i915/i915_gem_clflush.c
> index ffd01e02fe94..a895643c4dc4 100644
> --- a/drivers/gpu/drm/i915/i915_gem_clflush.c
> +++ b/drivers/gpu/drm/i915/i915_gem_clflush.c
> @@ -72,8 +72,6 @@ static const struct dma_fence_ops i915_clflush_ops = {
>  static void __i915_do_clflush(struct drm_i915_gem_object *obj)
>  {
>  	drm_clflush_sg(obj->mm.pages);
> -	obj->cache_dirty = false;
> -
>  	intel_fb_obj_flush(obj, ORIGIN_CPU);
>  }
>  
> @@ -82,9 +80,6 @@ static void i915_clflush_work(struct work_struct *work)
>  	struct clflush *clflush = container_of(work, typeof(*clflush), work);
>  	struct drm_i915_gem_object *obj = clflush->obj;
>  
> -	if (!obj->cache_dirty)
> -		goto out;
> -
>  	if (i915_gem_object_pin_pages(obj)) {
>  		DRM_ERROR("Failed to acquire obj->pages for clflushing\n");
>  		goto out;
> @@ -132,10 +127,10 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
>  	 * anything not backed by physical memory we consider to be always
>  	 * coherent and not need clflushing.
>  	 */
> -	if (!i915_gem_object_has_struct_page(obj))
> +	if (!i915_gem_object_has_struct_page(obj)) {
> +		obj->cache_dirty = false;
>  		return;
> -
> -	obj->cache_dirty = true;
> +	}
>  
>  	/* If the GPU is snooping the contents of the CPU cache,
>  	 * we do not need to manually clear the CPU cache lines.  However,
> @@ -154,6 +149,8 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
>  	if (!(flags & I915_CLFLUSH_SYNC))
>  		clflush = kmalloc(sizeof(*clflush), GFP_KERNEL);
>  	if (clflush) {
> +		GEM_BUG_ON(!obj->cache_dirty);
> +
>  		dma_fence_init(&clflush->dma,
>  			       &i915_clflush_ops,
>  			       &clflush_lock,
> @@ -181,6 +178,8 @@ void i915_gem_clflush_object(struct drm_i915_gem_object *obj,
>  	} else {
>  		GEM_BUG_ON(obj->base.write_domain != I915_GEM_DOMAIN_CPU);
>  	}
> +
> +	obj->cache_dirty = false;
>  }
>  
>  void i915_gem_clflush_init(struct drm_i915_private *i915)
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index af1965774e7b..ddc011ef5480 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -291,7 +291,7 @@ static inline int use_cpu_reloc(struct drm_i915_gem_object *obj)
>  		return DBG_USE_CPU_RELOC > 0;
>  
>  	return (HAS_LLC(to_i915(obj->base.dev)) ||
> -		obj->base.write_domain == I915_GEM_DOMAIN_CPU ||
> +		obj->cache_dirty ||
>  		obj->cache_level != I915_CACHE_NONE);
>  }
>  
> @@ -1129,10 +1129,8 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
>  		if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
>  			continue;
>  
> -		if (obj->base.write_domain & I915_GEM_DOMAIN_CPU) {
> +		if (obj->base.write_domain & obj->cache_dirty)
>  			i915_gem_clflush_object(obj, 0);
> -			obj->base.write_domain = 0;
> -		}
>  
>  		ret = i915_gem_request_await_object
>  			(req, obj, obj->base.pending_write_domain);
> @@ -1265,12 +1263,6 @@ i915_gem_validate_context(struct drm_device *dev, struct drm_file *file,
>  	return ctx;
>  }
>  
> -static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj)
> -{
> -	return !(obj->cache_level == I915_CACHE_NONE ||
> -		 obj->cache_level == I915_CACHE_WT);
> -}
> -
>  void i915_vma_move_to_active(struct i915_vma *vma,
>  			     struct drm_i915_gem_request *req,
>  			     unsigned int flags)
> @@ -1294,15 +1286,16 @@ void i915_vma_move_to_active(struct i915_vma *vma,
>  	i915_gem_active_set(&vma->last_read[idx], req);
>  	list_move_tail(&vma->vm_link, &vma->vm->active_list);
>  
> +	obj->base.write_domain = 0;
>  	if (flags & EXEC_OBJECT_WRITE) {
> +		obj->base.write_domain = I915_GEM_DOMAIN_RENDER;
> +
>  		if (intel_fb_obj_invalidate(obj, ORIGIN_CS))
>  			i915_gem_active_set(&obj->frontbuffer_write, req);
>  
> -		/* update for the implicit flush after a batch */
> -		obj->base.write_domain &= ~I915_GEM_GPU_DOMAINS;
> -		if (!obj->cache_dirty && gpu_write_needs_clflush(obj))
> -			obj->cache_dirty = true;
> +		obj->base.read_domains = 0;
>  	}
> +	obj->base.read_domains |= I915_GEM_GPU_DOMAINS;
>  
>  	if (flags & EXEC_OBJECT_NEEDS_FENCE)
>  		i915_gem_active_set(&vma->last_fence, req);
> diff --git a/drivers/gpu/drm/i915/i915_gem_internal.c b/drivers/gpu/drm/i915/i915_gem_internal.c
> index fc950abbe400..58e93e87d573 100644
> --- a/drivers/gpu/drm/i915/i915_gem_internal.c
> +++ b/drivers/gpu/drm/i915/i915_gem_internal.c
> @@ -188,9 +188,10 @@ i915_gem_object_create_internal(struct drm_i915_private *i915,
>  	drm_gem_private_object_init(&i915->drm, &obj->base, size);
>  	i915_gem_object_init(obj, &i915_gem_object_internal_ops);
>  
> -	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>  	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> +	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>  	obj->cache_level = HAS_LLC(i915) ? I915_CACHE_LLC : I915_CACHE_NONE;
> +	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
>  
>  	return obj;
>  }
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 58ccf8b8ca1c..9f84be171ad2 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -802,9 +802,10 @@ i915_gem_userptr_ioctl(struct drm_device *dev, void *data, struct drm_file *file
>  
>  	drm_gem_private_object_init(dev, &obj->base, args->user_size);
>  	i915_gem_object_init(obj, &i915_gem_userptr_ops);
> -	obj->cache_level = I915_CACHE_LLC;
> -	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>  	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> +	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
> +	obj->cache_level = I915_CACHE_LLC;
> +	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
>  
>  	obj->userptr.ptr = args->user_ptr;
>  	obj->userptr.read_only = !!(args->flags & I915_USERPTR_READ_ONLY);
> diff --git a/drivers/gpu/drm/i915/selftests/huge_gem_object.c b/drivers/gpu/drm/i915/selftests/huge_gem_object.c
> index 4e681fc13be4..0ca867a877b6 100644
> --- a/drivers/gpu/drm/i915/selftests/huge_gem_object.c
> +++ b/drivers/gpu/drm/i915/selftests/huge_gem_object.c
> @@ -126,9 +126,10 @@ huge_gem_object(struct drm_i915_private *i915,
>  	drm_gem_private_object_init(&i915->drm, &obj->base, dma_size);
>  	i915_gem_object_init(obj, &huge_ops);
>  
> -	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>  	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> +	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
>  	obj->cache_level = HAS_LLC(i915) ? I915_CACHE_LLC : I915_CACHE_NONE;
> +	obj->cache_dirty = !i915_gem_object_is_coherent(obj);
>  	obj->scratch = phys_size;
>  
>  	return obj;
> -- 
> 2.11.0
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
  2017-04-19 16:52   ` Dongwon Kim
@ 2017-04-19 17:15     ` Chris Wilson
  2017-04-19 17:46     ` Chris Wilson
  2017-04-19 18:08     ` Chris Wilson
  2 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19 17:15 UTC (permalink / raw)
  To: Dongwon Kim; +Cc: intel-gfx

On Wed, Apr 19, 2017 at 09:52:28AM -0700, Dongwon Kim wrote:
> I tried your patch but it didn't fix the original 
> problem. I think it is somehow related to the flushing condition
> here:
> 
> @@ -1129,10 +1129,8 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
>  	if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
>  		continue;
> 
> 	if (obj->base.write_domain & I915_GEM_DOMAIN_CPU) {
> +	if (obj->base.write_domain & obj->cache_dirty)
>  		i915_gem_clflush_object(obj, 0);
> -		obj->base.write_domain = 0;
> -	}
> 
> here, we do clflush only if write_domain is not 0 even if cache_dirty
> flag is set after your patch is applied.
> 
> And now please look at this:
> 
> @@ -753,6 +766,11 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
>         case I915_GEM_DOMAIN_CPU:
>                 i915_gem_clflush_object(obj, I915_CLFLUSH_SYNC);
>                 break;
> +
> +       case I915_GEM_DOMAIN_RENDER:
> +               if (gpu_write_needs_clflush(obj))
> +                       obj->cache_dirty = true;
> +               break;
>         }
> 
>         obj->base.write_domain=0;
> 
> So here, if the write_domain is I915_GEM_DOMAIN_RENDER, we set cache_dirty to true
> then reset write_domain.
> 
> So right after this flush_write_domain call, write_domain will be 0 but cache is
> still dirty. I am wondering if this is where that condition (write_domain==0 and 
> cache_dirty==1) originally came from.

And we definitely do not want to be flushing the cache for the GPU after
a GPU write. We only want for that cache to be flushed after the GPU if
it is moved to a non-coherent domain. That's the challenge.

I was also expecting that (incoherent) reads from the GPU would go through
the cache - we make the same assumption for CPU reads.

Thanks for testing, definitely back to the drawing board. Hmm, might be
worth taking the alternative approach and to always schedule an async
clflush in set-cache-domain. Just have to check that we have appropriate
waits and don't have any inappropriate ones.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
  2017-04-19 16:52   ` Dongwon Kim
  2017-04-19 17:15     ` Chris Wilson
@ 2017-04-19 17:46     ` Chris Wilson
  2017-04-19 18:08     ` Chris Wilson
  2 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19 17:46 UTC (permalink / raw)
  To: Dongwon Kim; +Cc: intel-gfx

On Wed, Apr 19, 2017 at 09:52:28AM -0700, Dongwon Kim wrote:
> I tried your patch but it didn't fix the original 
> problem. I think it is somehow related to the flushing condition
> here:

I don't think I actually checked what GPU you observed it on - I was
assuming llc, since that was the last bug we had.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
  2017-04-19 16:52   ` Dongwon Kim
  2017-04-19 17:15     ` Chris Wilson
  2017-04-19 17:46     ` Chris Wilson
@ 2017-04-19 18:08     ` Chris Wilson
  2017-04-19 18:13       ` Dongwon Kim
  2 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-19 18:08 UTC (permalink / raw)
  To: Dongwon Kim; +Cc: intel-gfx

On Wed, Apr 19, 2017 at 09:52:28AM -0700, Dongwon Kim wrote:
> I tried your patch but it didn't fix the original 
> problem. I think it is somehow related to the flushing condition
> here:
> 
> @@ -1129,10 +1129,8 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
>  	if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
>  		continue;
> 
> 	if (obj->base.write_domain & I915_GEM_DOMAIN_CPU) {
> +	if (obj->base.write_domain & obj->cache_dirty)
>  		i915_gem_clflush_object(obj, 0);
> -		obj->base.write_domain = 0;
> -	}
> 
> here, we do clflush only if write_domain is not 0 even if cache_dirty
> flag is set after your patch is applied.

This can be just reduced to if (obj->cache_dirty) clflush().

We're slightly better in that we don't set obj->cache_dirty so often for
normal gpu rendering (just on transitions away from the gpu now), but it
still means we will be redundantly checking for clflushes prior to
rendering.

Can you double check that this patch + if (obj->cache_dirty) works for you?

What I guess I really want here is
	if (obj->cache_dirty & !obj->cache_coherent)
essentially inlining the check from clflush.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
  2017-04-19 18:08     ` Chris Wilson
@ 2017-04-19 18:13       ` Dongwon Kim
  2017-04-19 18:26         ` Chris Wilson
  0 siblings, 1 reply; 95+ messages in thread
From: Dongwon Kim @ 2017-04-19 18:13 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Matt Roper

Chris,

Just to make sure, you want to just remove write_domain check from 
if statement before clflush in execbuffer_move_to_gpu. Am I right?
I will try both (cache_dirty only vs cache_dirty & !cache_coherent)
and get back to you shortly. 

On Wed, Apr 19, 2017 at 07:08:33PM +0100, Chris Wilson wrote:
> On Wed, Apr 19, 2017 at 09:52:28AM -0700, Dongwon Kim wrote:
> > I tried your patch but it didn't fix the original 
> > problem. I think it is somehow related to the flushing condition
> > here:
> > 
> > @@ -1129,10 +1129,8 @@ i915_gem_execbuffer_move_to_gpu(struct drm_i915_gem_request *req,
> >  	if (vma->exec_entry->flags & EXEC_OBJECT_ASYNC)
> >  		continue;
> > 
> > 	if (obj->base.write_domain & I915_GEM_DOMAIN_CPU) {
> > +	if (obj->base.write_domain & obj->cache_dirty)
> >  		i915_gem_clflush_object(obj, 0);
> > -		obj->base.write_domain = 0;
> > -	}
> > 
> > here, we do clflush only if write_domain is not 0 even if cache_dirty
> > flag is set after your patch is applied.
> 
> This can be just reduced to if (obj->cache_dirty) clflush().
> 
> We're slightly better in that we don't set obj->cache_dirty so often for
> normal gpu rendering (just on transitions away from the gpu now), but it
> still means we will be redundantly checking for clflushes prior to
> rendering.
> 
> Can you double check that this patch + if (obj->cache_dirty) works for you?
> 
> What I guess I really want here is
> 	if (obj->cache_dirty & !obj->cache_coherent)
> essentially inlining the check from clflush.
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
  2017-04-19 18:13       ` Dongwon Kim
@ 2017-04-19 18:26         ` Chris Wilson
  2017-04-19 20:30           ` Dongwon Kim
  2017-04-19 20:49           ` Dongwon Kim
  0 siblings, 2 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-19 18:26 UTC (permalink / raw)
  To: Dongwon Kim; +Cc: intel-gfx

On Wed, Apr 19, 2017 at 11:13:17AM -0700, Dongwon Kim wrote:
> Chris,
> 
> Just to make sure, you want to just remove write_domain check from 
> if statement before clflush in execbuffer_move_to_gpu. Am I right?
> I will try both (cache_dirty only vs cache_dirty & !cache_coherent)
> and get back to you shortly. 

Yes, I just don't have a single bit for cache_coherent yet, so you
might as well let that call i915_gem_object_clflush().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
  2017-04-19 18:26         ` Chris Wilson
@ 2017-04-19 20:30           ` Dongwon Kim
  2017-04-19 20:49           ` Dongwon Kim
  1 sibling, 0 replies; 95+ messages in thread
From: Dongwon Kim @ 2017-04-19 20:30 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Matt Roper

Chris,

I think my assumption was not correct. I took out write_domain
check but it is still failing. However, here are couple of
observation points. I did some experiments.. One of them is 
to take out even cache_dirty check from eb_move_to_gpu. 
With this, all sample tests were passing but as you might 
expect, tests were running so slow, which explains how much 
clflush cost.

Then, I put cache_dirty check back into eb_move_to_gpu then
removed 'if (gpu_write_needs_clflush(obj))' from flush_write_domain
when write_domain is I915_GEM_DOMAIN_RENDER

@@ -753,6 +766,11 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigne
d int flush_domains)
+
+       case I915_GEM_DOMAIN_RENDER:
+               if (gpu_write_needs_clflush(obj)) <-- took out this line
+                       obj->cache_dirty = true;
+               break;

to make cache_dirty is set all the time if write_domain is 
I915_GEM_DOMAIN_RENDER. And I saw some of failing tests were
now passing but this doesn't fix all of them.

I will try to revert back other changes in your patch as well.
Please let me know if there's any other thing you want me to
try to find a clue.

On Wed, Apr 19, 2017 at 07:26:29PM +0100, Chris Wilson wrote:
> On Wed, Apr 19, 2017 at 11:13:17AM -0700, Dongwon Kim wrote:
> > Chris,
> > 
> > Just to make sure, you want to just remove write_domain check from 
> > if statement before clflush in execbuffer_move_to_gpu. Am I right?
> > I will try both (cache_dirty only vs cache_dirty & !cache_coherent)
> > and get back to you shortly. 
> 
> Yes, I just don't have a single bit for cache_coherent yet, so you
> might as well let that call i915_gem_object_clflush().
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes
  2017-04-19 18:26         ` Chris Wilson
  2017-04-19 20:30           ` Dongwon Kim
@ 2017-04-19 20:49           ` Dongwon Kim
  1 sibling, 0 replies; 95+ messages in thread
From: Dongwon Kim @ 2017-04-19 20:49 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Matt Roper

Chris,

I am sorry that I didn't tell you about GPU that
I am working on. It is GEN9LP. Our target is APL-I.
So no LLC is available. 

On Wed, Apr 19, 2017 at 07:26:29PM +0100, Chris Wilson wrote:
> On Wed, Apr 19, 2017 at 11:13:17AM -0700, Dongwon Kim wrote:
> > Chris,
> > 
> > Just to make sure, you want to just remove write_domain check from 
> > if statement before clflush in execbuffer_move_to_gpu. Am I right?
> > I will try both (cache_dirty only vs cache_dirty & !cache_coherent)
> > and get back to you shortly. 
> 
> Yes, I just don't have a single bit for cache_coherent yet, so you
> might as well let that call i915_gem_object_clflush().
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 01/27] drm/i915/selftests: Allocate inode/file dynamically
  2017-04-19  9:41 ` [PATCH 01/27] drm/i915/selftests: Allocate inode/file dynamically Chris Wilson
@ 2017-04-20  7:42   ` Joonas Lahtinen
  0 siblings, 0 replies; 95+ messages in thread
From: Joonas Lahtinen @ 2017-04-20  7:42 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Matthew Auld, Arnd Bergmann

On ke, 2017-04-19 at 10:41 +0100, Chris Wilson wrote:
> Avoid having too large a stack by creating the fake struct inode/file on
> the heap instead.
> 
> drivers/gpu/drm/i915/selftests/mock_drm.c: In function 'mock_file':
> drivers/gpu/drm/i915/selftests/mock_drm.c:46:1: error: the frame size of 1328 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]
> drivers/gpu/drm/i915/selftests/mock_drm.c: In function 'mock_file_free':
> drivers/gpu/drm/i915/selftests/mock_drm.c:54:1: error: the frame size of 1312 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]
> 
> Reported-by: Arnd Bergmann <arnd@arndb.de>
> Fixes: 66d9cb5d805a ("drm/i915: Mock the GEM device for self-testing")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Acked-by: Arnd Bergmann <arnd@arndb.de>

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 22/27] drm/i915: Eliminate lots of iterations over the execobjects array
  2017-04-19  9:41 ` [PATCH 22/27] drm/i915: Eliminate lots of iterations over the execobjects array Chris Wilson
@ 2017-04-20  8:49   ` Joonas Lahtinen
  0 siblings, 0 replies; 95+ messages in thread
From: Joonas Lahtinen @ 2017-04-20  8:49 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On ke, 2017-04-19 at 10:41 +0100, Chris Wilson wrote:
> The major scaling bottleneck in execbuffer is the processing of the
> execobjects. Creating an auxiliary list is inefficient when compared to
> using the execobject array we already have allocated.
> 
> Reservation is then split into phases. As we lookup up the VMA, we
> try and bind it back into active location. Only if that fails, do we add
> it to the unbound list for phase 2. In phase 2, we try and add all those
> objects that could not fit into their previous location, with fallback
> to retrying all objects and evicting the VM in case of severe
> fragmentation. (This is the same as before, except that phase 1 is now
> done inline with looking up the VMA to avoid an iteration over the
> execobject array. In the ideal case, we eliminate the separate reservation
> phase). During the reservation phase, we only evict from the VM between
> passes (rather than currently as we try to fit every new VMA). In
> testing with Unreal Engine's Atlantis demo which stresses the eviction
> logic on gen7 class hardware, this speed up the framerate by a factor of
> 2.
> 
> The second loop amalgamation is between move_to_gpu and move_to_active.
> As we always submit the request, even if incomplete, we can use the
> current request to track active VMA as we perform the flushes and
> synchronisation required.
> 
> The next big advancement is to avoid copying back to the user any
> execobjects and relocations that are not changed.
> 
> v2: Add a Theory of Operation spiel.
> v3: Fall back to slow relocations in preparation for flushing userptrs.
> v4: Document struct members, factor out eb_validate_vma(), add a few
> more comments to explain some magic and hide other magic behind macros.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Changelog checks out. Assuming you peeked at the generated html docs:

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 14/27] drm/i915: Don't mark an execlists context-switch when idle
  2017-04-19  9:41 ` [PATCH 14/27] drm/i915: Don't mark an execlists context-switch when idle Chris Wilson
@ 2017-04-20  8:53   ` Joonas Lahtinen
  0 siblings, 0 replies; 95+ messages in thread
From: Joonas Lahtinen @ 2017-04-20  8:53 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On ke, 2017-04-19 at 10:41 +0100, Chris Wilson wrote:
> If we *know* that the engine is idle, i.e. we have not more contexts in
> lift, we can skip any spurious CSB idle interrupts. These spurious

in flight?

> interrupts seem to arrive long after we assert that the engines are
> completely idle, triggering later assertions:
> 
> [  178.896646] intel_engine_is_idle(bcs): interrupt not handled, irq_posted=2
> [  178.896655] ------------[ cut here ]------------
> [  178.896658] kernel BUG at drivers/gpu/drm/i915/intel_engine_cs.c:226!
> [  178.896661] invalid opcode: 0000 [#1] SMP
> [  178.896663] Modules linked in: i915(E) x86_pkg_temp_thermal(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) intel_gtt(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) aesni_intel(E) prime_numbers(E) evdev(E) aes_x86_64(E) drm(E) crypto_simd(E) cryptd(E) glue_helper(E) mei_me(E) mei(E) lpc_ich(E) efivars(E) mfd_core(E) battery(E) video(E) acpi_pad(E) button(E) tpm_tis(E) tpm_tis_core(E) tpm(E) autofs4(E) i2c_i801(E) fan(E) thermal(E) i2c_designware_platform(E) i2c_designware_core(E)
> [  178.896694] CPU: 1 PID: 522 Comm: gem_exec_whispe Tainted: G            E   4.11.0-rc5+ #14
> [  178.896702] task: ffff88040aba8d40 task.stack: ffffc900003f0000
> [  178.896722] RIP: 0010:intel_engine_init_global_seqno+0x1db/0x1f0 [i915]
> [  178.896725] RSP: 0018:ffffc900003f3ab0 EFLAGS: 00010246
> [  178.896728] RAX: 0000000000000000 RBX: ffff88040af54000 RCX: 0000000000000000
> [  178.896731] RDX: ffff88041ec933e0 RSI: ffff88041ec8cc48 RDI: ffff88041ec8cc48
> [  178.896734] RBP: ffffc900003f3ac8 R08: 0000000000000000 R09: 000000000000047d
> [  178.896736] R10: 0000000000000040 R11: ffff88040b344f80 R12: 0000000000000000
> [  178.896739] R13: ffff88040bce0000 R14: ffff88040bce52d8 R15: ffff88040bce0000
> [  178.896742] FS:  00007f2cccc2d8c0(0000) GS:ffff88041ec80000(0000) knlGS:0000000000000000
> [  178.896746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  178.896749] CR2: 00007f41ddd8f000 CR3: 000000040bb03000 CR4: 00000000001406e0
> [  178.896752] Call Trace:
> [  178.896768]  reset_all_global_seqno.part.33+0x4e/0xd0 [i915]
> [  178.896782]  i915_gem_request_alloc+0x304/0x330 [i915]
> [  178.896795]  i915_gem_do_execbuffer+0x8a1/0x17d0 [i915]
> [  178.896799]  ? remove_wait_queue+0x48/0x50
> [  178.896812]  ? i915_wait_request+0x300/0x590 [i915]
> [  178.896816]  ? wake_up_q+0x70/0x70
> [  178.896819]  ? refcount_dec_and_test+0x11/0x20
> [  178.896823]  ? reservation_object_add_excl_fence+0xa5/0x100
> [  178.896835]  i915_gem_execbuffer2+0xab/0x1f0 [i915]
> [  178.896844]  drm_ioctl+0x1e6/0x460 [drm]
> [  178.896858]  ? i915_gem_execbuffer+0x260/0x260 [i915]
> [  178.896862]  ? dput+0xcf/0x250
> [  178.896866]  ? full_proxy_release+0x66/0x80
> [  178.896869]  ? mntput+0x1f/0x30
> [  178.896872]  do_vfs_ioctl+0x8f/0x5b0
> [  178.896875]  ? ____fput+0x9/0x10
> [  178.896878]  ? task_work_run+0x80/0xa0
> [  178.896881]  SyS_ioctl+0x3c/0x70
> [  178.896885]  entry_SYSCALL_64_fastpath+0x17/0x98
> [  178.896888] RIP: 0033:0x7f2ccb455ca7
> [  178.896890] RSP: 002b:00007ffcabec72d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [  178.896894] RAX: ffffffffffffffda RBX: 000055f897a44b90 RCX: 00007f2ccb455ca7
> [  178.896897] RDX: 00007ffcabec74a0 RSI: 0000000040406469 RDI: 0000000000000003
> [  178.896900] RBP: 00007f2ccb70a440 R08: 00007f2ccb70d0a4 R09: 0000000000000000
> [  178.896903] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [  178.896905] R13: 000055f89782d71a R14: 00007ffcabecf838 R15: 0000000000000003
> [  178.896908] Code: 00 31 d2 4c 89 ef 8d 70 48 41 ff 95 f8 06 00 00 e9 68 fe ff ff be 0f 00 00 00 48 c7 c7 48 dc 37 a0 e8 fa 33 d6 e0 e9 0b ff ff ff <0f> 0b 0f 0b 0f 0b 0f 0b 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00
> 
> On the other hand, by ignoring the interrupt do we risk running out of
> space in CSB ring? Testing for a few hours suggests not, i.e. that we
> only seem to get the odd delayed CSB idle notification.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Slap your Tested-by too.

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Even with that, I dislike the port_count macro.

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 12/27] drm/i915: Only report a wakeup if the waiter was truly asleep
  2017-04-19  9:41 ` [PATCH 12/27] drm/i915: Only report a wakeup if the waiter was truly asleep Chris Wilson
@ 2017-04-20 13:30   ` Tvrtko Ursulin
  2017-04-20 13:57     ` Chris Wilson
  0 siblings, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-20 13:30 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 19/04/2017 10:41, Chris Wilson wrote:
> If we attempt to wake up a waiter, who is currently checking the seqno
> it will be in the TASK_INTERRUPTIBLE state and ttwu will report success.
> However, it is actually awake and functioning -- so delay reporting the
> actual wake up until it sleeps.
>
> v2: Defend against !CONFIG_SMP
> v3: Don't filter out calls to wake_up_process
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=100007
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_breadcrumbs.c | 18 ++++++++++++++++--
>  drivers/gpu/drm/i915/intel_ringbuffer.h  |  4 ++++
>  2 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_breadcrumbs.c b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> index 9ccbf26124c6..808d3a3cda0a 100644
> --- a/drivers/gpu/drm/i915/intel_breadcrumbs.c
> +++ b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> @@ -27,6 +27,12 @@
>
>  #include "i915_drv.h"
>
> +#ifdef CONFIG_SMP
> +#define task_asleep(tsk) (!(tsk)->on_cpu)
> +#else
> +#define task_asleep(tsk) ((tsk) != current)
> +#endif
> +
>  static unsigned int __intel_breadcrumbs_wakeup(struct intel_breadcrumbs *b)
>  {
>  	struct intel_wait *wait;
> @@ -37,8 +43,16 @@ static unsigned int __intel_breadcrumbs_wakeup(struct intel_breadcrumbs *b)
>  	wait = b->irq_wait;
>  	if (wait) {
>  		result = ENGINE_WAKEUP_WAITER;
> -		if (wake_up_process(wait->tsk))
> +
> +		/* Be careful not to report a successful wakeup if the waiter
> +		 * is currently processing the seqno, where it will have
> +		 * already called set_task_state(TASK_INTERRUPTIBLE).
> +		 */
> +		if (task_asleep(wait->tsk))
>  			result |= ENGINE_WAKEUP_ASLEEP;
> +
> +		if (wake_up_process(wait->tsk))
> +			result |= ENGINE_WAKEUP_SUCCESS;

The rough idea I had of atomic_inc(&b->wakeup_cnt) here with 
unconditional wake_up_process, coupled with atomic_dec_and_test in the 
signaler would not work? I was thinking that would avoid signaler losing 
the wakeup and avoid us having to touch the low level scheduler data.

Or what you meant last time by not sure it was worth it was referring to 
the above?

Regards,

Tvrtko

>  	}
>
>  	return result;
> @@ -98,7 +112,7 @@ static void intel_breadcrumbs_hangcheck(unsigned long data)
>  	 * but we still have a waiter. Assuming all batches complete within
>  	 * DRM_I915_HANGCHECK_JIFFIES [1.5s]!
>  	 */
> -	if (intel_engine_wakeup(engine) & ENGINE_WAKEUP_ASLEEP) {
> +	if (intel_engine_wakeup(engine) == ENGINE_WAKEUP) {
>  		missed_breadcrumb(engine);
>  		mod_timer(&engine->breadcrumbs.fake_irq, jiffies + 1);
>  	} else {
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 00d36aa4e26d..d25b88467e5e 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -668,6 +668,10 @@ static inline bool intel_engine_has_waiter(const struct intel_engine_cs *engine)
>  unsigned int intel_engine_wakeup(struct intel_engine_cs *engine);
>  #define ENGINE_WAKEUP_WAITER BIT(0)
>  #define ENGINE_WAKEUP_ASLEEP BIT(1)
> +#define ENGINE_WAKEUP_SUCCESS BIT(2)
> +#define ENGINE_WAKEUP (ENGINE_WAKEUP_WAITER | \
> +		       ENGINE_WAKEUP_ASLEEP | \
> +		       ENGINE_WAKEUP_SUCCESS)
>
>  void __intel_engine_disarm_breadcrumbs(struct intel_engine_cs *engine);
>  void intel_engine_disarm_breadcrumbs(struct intel_engine_cs *engine);
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 12/27] drm/i915: Only report a wakeup if the waiter was truly asleep
  2017-04-20 13:30   ` Tvrtko Ursulin
@ 2017-04-20 13:57     ` Chris Wilson
  0 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-20 13:57 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Thu, Apr 20, 2017 at 02:30:21PM +0100, Tvrtko Ursulin wrote:
> 
> On 19/04/2017 10:41, Chris Wilson wrote:
> >If we attempt to wake up a waiter, who is currently checking the seqno
> >it will be in the TASK_INTERRUPTIBLE state and ttwu will report success.
> >However, it is actually awake and functioning -- so delay reporting the
> >actual wake up until it sleeps.
> >
> >v2: Defend against !CONFIG_SMP
> >v3: Don't filter out calls to wake_up_process

I forgot this patch was inbetween the series, i.e. I am not pursuing
this one at the moment.

> >References: https://bugs.freedesktop.org/show_bug.cgi?id=100007
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >---
> > drivers/gpu/drm/i915/intel_breadcrumbs.c | 18 ++++++++++++++++--
> > drivers/gpu/drm/i915/intel_ringbuffer.h  |  4 ++++
> > 2 files changed, 20 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/intel_breadcrumbs.c b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> >index 9ccbf26124c6..808d3a3cda0a 100644
> >--- a/drivers/gpu/drm/i915/intel_breadcrumbs.c
> >+++ b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> >@@ -27,6 +27,12 @@
> >
> > #include "i915_drv.h"
> >
> >+#ifdef CONFIG_SMP
> >+#define task_asleep(tsk) (!(tsk)->on_cpu)
> >+#else
> >+#define task_asleep(tsk) ((tsk) != current)
> >+#endif
> >+
> > static unsigned int __intel_breadcrumbs_wakeup(struct intel_breadcrumbs *b)
> > {
> > 	struct intel_wait *wait;
> >@@ -37,8 +43,16 @@ static unsigned int __intel_breadcrumbs_wakeup(struct intel_breadcrumbs *b)
> > 	wait = b->irq_wait;
> > 	if (wait) {
> > 		result = ENGINE_WAKEUP_WAITER;
> >-		if (wake_up_process(wait->tsk))
> >+
> >+		/* Be careful not to report a successful wakeup if the waiter
> >+		 * is currently processing the seqno, where it will have
> >+		 * already called set_task_state(TASK_INTERRUPTIBLE).
> >+		 */
> >+		if (task_asleep(wait->tsk))
> > 			result |= ENGINE_WAKEUP_ASLEEP;
> >+
> >+		if (wake_up_process(wait->tsk))
> >+			result |= ENGINE_WAKEUP_SUCCESS;
> 
> The rough idea I had of atomic_inc(&b->wakeup_cnt) here with
> unconditional wake_up_process, coupled with atomic_dec_and_test in
> the signaler would not work? I was thinking that would avoid
> signaler losing the wakeup and avoid us having to touch the low
> level scheduler data.

Best one I had was to store an atomic counter in each struct intel_wait
to determine if it was inside the wait-for-breadcrumb. But that is
duplicating on-cpu (with the same advantage of not being confused by any
sleep elsewhere in the check-breadcrumb path) and fundamentally less
precise.

> Or what you meant last time by not sure it was worth it was
> referring to the above?

I think the chance that this is affecting a missed breacrumb result is
small, certainly not with the regularity of snb-2600. I had pushed it to
the end, but obviously not far enough down the list. When I looked at
the list of patches, I actually though this was a different patch
"drm/i915: Only wake the waiter from the interrupt if passed"

My apologies for the noise.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 13/27] drm/i915/execlists: Pack the count into the low bits of the port.request
  2017-04-19  9:41 ` [PATCH 13/27] drm/i915/execlists: Pack the count into the low bits of the port.request Chris Wilson
@ 2017-04-20 14:58   ` Tvrtko Ursulin
  2017-04-27 14:37     ` Chris Wilson
  0 siblings, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-20 14:58 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Mika Kuoppala


On 19/04/2017 10:41, Chris Wilson wrote:
> add/remove: 1/1 grow/shrink: 5/4 up/down: 391/-578 (-187)
> function                                     old     new   delta
> execlists_submit_ports                       262     471    +209
> port_assign.isra                               -     136    +136
> capture                                     6344    6359     +15
> reset_common_ring                            438     452     +14
> execlists_submit_request                     228     238     +10
> gen8_init_common_ring                        334     341      +7
> intel_engine_is_idle                         106     105      -1
> i915_engine_info                            2314    2290     -24
> __i915_gem_set_wedged_BKL                    485     411     -74
> intel_lrc_irq_handler                       1789    1604    -185
> execlists_update_context                     294       -    -294
>
> The most important change there is the improve to the
> intel_lrc_irq_handler and excclist_submit_ports (net improvement since
> execlists_update_context is now inlined).
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c        |  32 ++++---
>  drivers/gpu/drm/i915/i915_gem.c            |   6 +-
>  drivers/gpu/drm/i915/i915_gpu_error.c      |  13 ++-
>  drivers/gpu/drm/i915/i915_guc_submission.c |  18 ++--
>  drivers/gpu/drm/i915/intel_engine_cs.c     |   2 +-
>  drivers/gpu/drm/i915/intel_lrc.c           | 133 ++++++++++++++++-------------
>  drivers/gpu/drm/i915/intel_ringbuffer.h    |   8 +-
>  7 files changed, 120 insertions(+), 92 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 870c470177b5..0b5d7142d8d9 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -3315,6 +3315,7 @@ static int i915_engine_info(struct seq_file *m, void *unused)
>  		if (i915.enable_execlists) {
>  			u32 ptr, read, write;
>  			struct rb_node *rb;
> +			unsigned int idx;
>
>  			seq_printf(m, "\tExeclist status: 0x%08x %08x\n",
>  				   I915_READ(RING_EXECLIST_STATUS_LO(engine)),
> @@ -3332,8 +3333,7 @@ static int i915_engine_info(struct seq_file *m, void *unused)
>  			if (read > write)
>  				write += GEN8_CSB_ENTRIES;
>  			while (read < write) {
> -				unsigned int idx = ++read % GEN8_CSB_ENTRIES;
> -
> +				idx = ++read % GEN8_CSB_ENTRIES;
>  				seq_printf(m, "\tExeclist CSB[%d]: 0x%08x, context: %d\n",
>  					   idx,
>  					   I915_READ(RING_CONTEXT_STATUS_BUF_LO(engine, idx)),
> @@ -3341,21 +3341,19 @@ static int i915_engine_info(struct seq_file *m, void *unused)
>  			}
>
>  			rcu_read_lock();
> -			rq = READ_ONCE(engine->execlist_port[0].request);
> -			if (rq) {
> -				seq_printf(m, "\t\tELSP[0] count=%d, ",
> -					   engine->execlist_port[0].count);
> -				print_request(m, rq, "rq: ");
> -			} else {
> -				seq_printf(m, "\t\tELSP[0] idle\n");
> -			}
> -			rq = READ_ONCE(engine->execlist_port[1].request);
> -			if (rq) {
> -				seq_printf(m, "\t\tELSP[1] count=%d, ",
> -					   engine->execlist_port[1].count);
> -				print_request(m, rq, "rq: ");
> -			} else {
> -				seq_printf(m, "\t\tELSP[1] idle\n");
> +			for (idx = 0; idx < ARRAY_SIZE(engine->execlist_port); idx++) {
> +				unsigned int count;
> +
> +				rq = port_unpack(&engine->execlist_port[idx],
> +						 &count);
> +				if (rq) {
> +					seq_printf(m, "\t\tELSP[%d] count=%d, ",
> +						   idx, count);
> +					print_request(m, rq, "rq: ");
> +				} else {
> +					seq_printf(m, "\t\tELSP[%d] idle\n",
> +						   idx);
> +				}
>  			}
>  			rcu_read_unlock();
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 2bc72314cdd1..f6df402a5247 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3039,12 +3039,14 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
>  	 */
>
>  	if (i915.enable_execlists) {
> +		struct execlist_port *port = engine->execlist_port;
>  		unsigned long flags;
> +		unsigned int n;
>
>  		spin_lock_irqsave(&engine->timeline->lock, flags);
>
> -		i915_gem_request_put(engine->execlist_port[0].request);
> -		i915_gem_request_put(engine->execlist_port[1].request);
> +		for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++)
> +			i915_gem_request_put(port_request(&port[n]));
>  		memset(engine->execlist_port, 0, sizeof(engine->execlist_port));
>  		engine->execlist_queue = RB_ROOT;
>  		engine->execlist_first = NULL;
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 4b247b050dcd..c5cdc6611d7f 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1324,12 +1324,17 @@ static void engine_record_requests(struct intel_engine_cs *engine,
>  static void error_record_engine_execlists(struct intel_engine_cs *engine,
>  					  struct drm_i915_error_engine *ee)
>  {
> +	const struct execlist_port *port = engine->execlist_port;
>  	unsigned int n;
>
> -	for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++)
> -		if (engine->execlist_port[n].request)
> -			record_request(engine->execlist_port[n].request,
> -				       &ee->execlist[n]);
> +	for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++) {
> +		struct drm_i915_gem_request *rq = port_request(&port[n]);
> +
> +		if (!rq)
> +			break;
> +
> +		record_request(rq, &ee->execlist[n]);
> +	}
>  }
>
>  static void record_context(struct drm_i915_error_context *e,
> diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
> index 1642fff9cf13..370373c97b81 100644
> --- a/drivers/gpu/drm/i915/i915_guc_submission.c
> +++ b/drivers/gpu/drm/i915/i915_guc_submission.c
> @@ -658,7 +658,7 @@ static void nested_enable_signaling(struct drm_i915_gem_request *rq)
>  static bool i915_guc_dequeue(struct intel_engine_cs *engine)
>  {
>  	struct execlist_port *port = engine->execlist_port;
> -	struct drm_i915_gem_request *last = port[0].request;
> +	struct drm_i915_gem_request *last = port[0].request_count;

It's confusing that in this new scheme sometimes we have direct access 
to the request and sometimes we have to go through the port_request macro.

So maybe we should always use the port_request macro. Hm, could we 
invent a new type to help enforce that? Like:

struct drm_i915_gem_port_request_slot {
	struct drm_i915_gem_request *req_count;
};

And then execlist port would contain these and helpers would need to be 
functions?

I've also noticed some GVT/GuC patches which sounded like they are 
adding the same single submission constraints so maybe now is the time 
to unify the dequeue? (Haven't looked at those patches deeper than the 
subject line so might be wrong.)

Not sure 100% of all the above, would need to sketch it. What are your 
thoughts?

>  	struct rb_node *rb;
>  	bool submit = false;
>
> @@ -672,7 +672,7 @@ static bool i915_guc_dequeue(struct intel_engine_cs *engine)
>  			if (port != engine->execlist_port)
>  				break;
>
> -			i915_gem_request_assign(&port->request, last);
> +			i915_gem_request_assign(&port->request_count, last);
>  			nested_enable_signaling(last);
>  			port++;
>  		}
> @@ -688,7 +688,7 @@ static bool i915_guc_dequeue(struct intel_engine_cs *engine)
>  		submit = true;
>  	}
>  	if (submit) {
> -		i915_gem_request_assign(&port->request, last);
> +		i915_gem_request_assign(&port->request_count, last);
>  		nested_enable_signaling(last);
>  		engine->execlist_first = rb;
>  	}
> @@ -705,17 +705,19 @@ static void i915_guc_irq_handler(unsigned long data)
>  	bool submit;
>
>  	do {
> -		rq = port[0].request;
> +		rq = port[0].request_count;
>  		while (rq && i915_gem_request_completed(rq)) {
>  			trace_i915_gem_request_out(rq);
>  			i915_gem_request_put(rq);
> -			port[0].request = port[1].request;
> -			port[1].request = NULL;
> -			rq = port[0].request;
> +
> +			port[0].request_count = port[1].request_count;
> +			port[1].request_count = NULL;
> +
> +			rq = port[0].request_count;
>  		}
>
>  		submit = false;
> -		if (!port[1].request)
> +		if (!port[1].request_count)
>  			submit = i915_guc_dequeue(engine);
>  	} while (submit);
>  }
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index 402769d9d840..10027d0a09b5 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -1148,7 +1148,7 @@ bool intel_engine_is_idle(struct intel_engine_cs *engine)
>  		return false;
>
>  	/* Both ports drained, no more ELSP submission? */
> -	if (engine->execlist_port[0].request)
> +	if (port_request(&engine->execlist_port[0]))
>  		return false;
>
>  	/* Ring stopped? */
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 7df278fe492e..69299fbab4f9 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -342,39 +342,32 @@ static u64 execlists_update_context(struct drm_i915_gem_request *rq)
>
>  static void execlists_submit_ports(struct intel_engine_cs *engine)
>  {
> -	struct drm_i915_private *dev_priv = engine->i915;
>  	struct execlist_port *port = engine->execlist_port;
>  	u32 __iomem *elsp =
> -		dev_priv->regs + i915_mmio_reg_offset(RING_ELSP(engine));
> -	u64 desc[2];
> -
> -	GEM_BUG_ON(port[0].count > 1);
> -	if (!port[0].count)
> -		execlists_context_status_change(port[0].request,
> -						INTEL_CONTEXT_SCHEDULE_IN);
> -	desc[0] = execlists_update_context(port[0].request);
> -	GEM_DEBUG_EXEC(port[0].context_id = upper_32_bits(desc[0]));
> -	port[0].count++;
> -
> -	if (port[1].request) {
> -		GEM_BUG_ON(port[1].count);
> -		execlists_context_status_change(port[1].request,
> -						INTEL_CONTEXT_SCHEDULE_IN);
> -		desc[1] = execlists_update_context(port[1].request);
> -		GEM_DEBUG_EXEC(port[1].context_id = upper_32_bits(desc[1]));
> -		port[1].count = 1;
> -	} else {
> -		desc[1] = 0;
> -	}
> -	GEM_BUG_ON(desc[0] == desc[1]);
> -
> -	/* You must always write both descriptors in the order below. */
> -	writel(upper_32_bits(desc[1]), elsp);
> -	writel(lower_32_bits(desc[1]), elsp);
> +		engine->i915->regs + i915_mmio_reg_offset(RING_ELSP(engine));
> +	unsigned int n;
> +
> +	for (n = ARRAY_SIZE(engine->execlist_port); n--; ) {

We could also add for_each_req_port or something, to iterate and unpack 
either req only or the count as well?

Preliminary pass only before the mtg. :)

Regards,

Tvrtko

> +		struct drm_i915_gem_request *rq;
> +		unsigned int count;
> +		u64 desc;
> +
> +		rq = port_unpack(&port[n], &count);
> +		if (rq) {
> +			GEM_BUG_ON(count > !n);
> +			if (!count++)
> +				execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
> +			port[n].request_count = port_pack(rq, count);
> +			desc = execlists_update_context(rq);
> +			GEM_DEBUG_EXEC(port[n].context_id = upper_32_bits(desc));
> +		} else {
> +			GEM_BUG_ON(!n);
> +			desc = 0;
> +		}
>
> -	writel(upper_32_bits(desc[0]), elsp);
> -	/* The context is automatically loaded after the following */
> -	writel(lower_32_bits(desc[0]), elsp);
> +		writel(upper_32_bits(desc), elsp);
> +		writel(lower_32_bits(desc), elsp);
> +	}
>  }
>
>  static bool ctx_single_port_submission(const struct i915_gem_context *ctx)
> @@ -395,6 +388,18 @@ static bool can_merge_ctx(const struct i915_gem_context *prev,
>  	return true;
>  }
>
> +static void port_assign(struct execlist_port *port,
> +			struct drm_i915_gem_request *rq)
> +{
> +	GEM_BUG_ON(rq == port_request(port));
> +
> +	if (port->request_count)
> +		i915_gem_request_put(port_request(port));
> +
> +	port->request_count =
> +		port_pack(i915_gem_request_get(rq), port_count(port));
> +}
> +
>  static void execlists_dequeue(struct intel_engine_cs *engine)
>  {
>  	struct drm_i915_gem_request *last;
> @@ -402,7 +407,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>  	struct rb_node *rb;
>  	bool submit = false;
>
> -	last = port->request;
> +	last = port_request(port);
>  	if (last)
>  		/* WaIdleLiteRestore:bdw,skl
>  		 * Apply the wa NOOPs to prevent ring:HEAD == req:TAIL
> @@ -412,7 +417,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>  		 */
>  		last->tail = last->wa_tail;
>
> -	GEM_BUG_ON(port[1].request);
> +	GEM_BUG_ON(port[1].request_count);
>
>  	/* Hardware submission is through 2 ports. Conceptually each port
>  	 * has a (RING_START, RING_HEAD, RING_TAIL) tuple. RING_START is
> @@ -469,7 +474,8 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>
>  			GEM_BUG_ON(last->ctx == cursor->ctx);
>
> -			i915_gem_request_assign(&port->request, last);
> +			if (submit)
> +				port_assign(port, last);
>  			port++;
>  		}
>
> @@ -484,7 +490,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>  		submit = true;
>  	}
>  	if (submit) {
> -		i915_gem_request_assign(&port->request, last);
> +		port_assign(port, last);
>  		engine->execlist_first = rb;
>  	}
>  	spin_unlock_irq(&engine->timeline->lock);
> @@ -495,14 +501,14 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>
>  static bool execlists_elsp_idle(struct intel_engine_cs *engine)
>  {
> -	return !engine->execlist_port[0].request;
> +	return !port_count(&engine->execlist_port[0]);
>  }
>
>  static bool execlists_elsp_ready(const struct intel_engine_cs *engine)
>  {
>  	const struct execlist_port *port = engine->execlist_port;
>
> -	return port[0].count + port[1].count < 2;
> +	return port_count(&port[0]) + port_count(&port[1]) < 2;
>  }
>
>  /*
> @@ -552,7 +558,9 @@ static void intel_lrc_irq_handler(unsigned long data)
>  		tail = GEN8_CSB_WRITE_PTR(head);
>  		head = GEN8_CSB_READ_PTR(head);
>  		while (head != tail) {
> +			struct drm_i915_gem_request *rq;
>  			unsigned int status;
> +			unsigned int count;
>
>  			if (++head == GEN8_CSB_ENTRIES)
>  				head = 0;
> @@ -582,20 +590,24 @@ static void intel_lrc_irq_handler(unsigned long data)
>  			GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 1) !=
>  					 port[0].context_id);
>
> -			GEM_BUG_ON(port[0].count == 0);
> -			if (--port[0].count == 0) {
> +			rq = port_unpack(&port[0], &count);
> +			GEM_BUG_ON(count == 0);
> +			if (--count == 0) {
>  				GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
> -				GEM_BUG_ON(!i915_gem_request_completed(port[0].request));
> -				execlists_context_status_change(port[0].request,
> -								INTEL_CONTEXT_SCHEDULE_OUT);
> +				GEM_BUG_ON(!i915_gem_request_completed(rq));
> +				execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT);
> +
> +				trace_i915_gem_request_out(rq);
> +				i915_gem_request_put(rq);
>
> -				trace_i915_gem_request_out(port[0].request);
> -				i915_gem_request_put(port[0].request);
>  				port[0] = port[1];
>  				memset(&port[1], 0, sizeof(port[1]));
> +			} else {
> +				port[0].request_count = port_pack(rq, count);
>  			}
>
> -			GEM_BUG_ON(port[0].count == 0 &&
> +			/* After the final element, the hw should be idle */
> +			GEM_BUG_ON(port_count(&port[0]) == 0 &&
>  				   !(status & GEN8_CTX_STATUS_ACTIVE_IDLE));
>  		}
>
> @@ -1148,11 +1160,6 @@ static int intel_init_workaround_bb(struct intel_engine_cs *engine)
>  	return ret;
>  }
>
> -static u32 port_seqno(struct execlist_port *port)
> -{
> -	return port->request ? port->request->global_seqno : 0;
> -}
> -
>  static int gen8_init_common_ring(struct intel_engine_cs *engine)
>  {
>  	struct drm_i915_private *dev_priv = engine->i915;
> @@ -1177,12 +1184,22 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
>  	/* After a GPU reset, we may have requests to replay */
>  	clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
>  	if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) {
> -		DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n",
> -				 engine->name,
> -				 port_seqno(&engine->execlist_port[0]),
> -				 port_seqno(&engine->execlist_port[1]));
> -		engine->execlist_port[0].count = 0;
> -		engine->execlist_port[1].count = 0;
> +		struct execlist_port *port = engine->execlist_port;
> +		unsigned int n;
> +
> +		for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++) {
> +			if (!port[n].request_count)
> +				break;
> +
> +			DRM_DEBUG_DRIVER("Restarting %s from 0x%x [%d]\n",
> +					 engine->name,
> +					 port_request(&port[n])->global_seqno,
> +					 n);
> +
> +			/* Discard the current inflight count */
> +			port[n].request_count = port_request(&port[n]);
> +		}
> +
>  		execlists_submit_ports(engine);
>  	}
>
> @@ -1261,13 +1278,13 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  	intel_ring_update_space(request->ring);
>
>  	/* Catch up with any missed context-switch interrupts */
> -	if (request->ctx != port[0].request->ctx) {
> -		i915_gem_request_put(port[0].request);
> +	if (request->ctx != port_request(&port[0])->ctx) {
> +		i915_gem_request_put(port_request(&port[0]));
>  		port[0] = port[1];
>  		memset(&port[1], 0, sizeof(port[1]));
>  	}
>
> -	GEM_BUG_ON(request->ctx != port[0].request->ctx);
> +	GEM_BUG_ON(request->ctx != port_request(&port[0])->ctx);
>
>  	/* Reset WaIdleLiteRestore:bdw,skl as well */
>  	request->tail =
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index d25b88467e5e..39b733e5cfd3 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -377,8 +377,12 @@ struct intel_engine_cs {
>  	/* Execlists */
>  	struct tasklet_struct irq_tasklet;
>  	struct execlist_port {
> -		struct drm_i915_gem_request *request;
> -		unsigned int count;
> +		struct drm_i915_gem_request *request_count;

Would req(uest)_slot maybe be better?

> +#define EXECLIST_COUNT_BITS 2
> +#define port_request(p) ptr_mask_bits((p)->request_count, EXECLIST_COUNT_BITS)
> +#define port_count(p) ptr_unmask_bits((p)->request_count, EXECLIST_COUNT_BITS)
> +#define port_pack(rq, count) ptr_pack_bits(rq, count, EXECLIST_COUNT_BITS)
> +#define port_unpack(p, count) ptr_unpack_bits((p)->request_count, count, EXECLIST_COUNT_BITS)
>  		GEM_DEBUG_DECL(u32 context_id);
>  	} execlist_port[2];
>  	struct rb_root execlist_queue;
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list
  2017-04-19  9:41 ` [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list Chris Wilson
@ 2017-04-24 10:28   ` Tvrtko Ursulin
  2017-04-24 11:07     ` Chris Wilson
  0 siblings, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-24 10:28 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 19/04/2017 10:41, Chris Wilson wrote:
> All the requests at the same priority are executed in FIFO order. They
> do not need to be stored in the rbtree themselves, as they are a simple
> list within a level. If we move the requests at one priority into a list,
> we can then reduce the rbtree to the set of priorities. This should keep
> the height of the rbtree small, as the number of active priorities can not
> exceed the number of active requests and should be typically only a few.
>
> Currently, we have ~2k possible different priority levels, that may
> increase to allow even more fine grained selection. Allocating those in
> advance seems a waste (and may be impossible), so we opt for allocating
> upon first use, and freeing after its requests are depleted. To avoid
> the possibility of an allocation failure causing us to lose a request,
> we preallocate the default priority (0) and bump any request to that
> priority if we fail to allocate it the appropriate plist. Having a
> request (that is ready to run, so not leading to corruption) execute
> out-of-order is better than leaking the request (and its dependency
> tree) entirely.
>
> There should be a benefit to reducing execlists_dequeue() to principally
> using a simple list (and reducing the frequency of both rbtree iteration
> and balancing on erase) but for typical workloads, request coalescing
> should be small enough that we don't notice any change. The main gain is
> from improving PI calls to schedule, and the explicit list within a
> level should make request unwinding simpler (we just need to insert at
> the head of the list rather than the tail and not have to make the
> rbtree search more complicated).

Sounds attractive! What workloads show the benefit and how much?

> v2: Avoid use-after-free when deleting a depleted priolist
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c        | 12 +++--
>  drivers/gpu/drm/i915/i915_gem_request.c    |  4 +-
>  drivers/gpu/drm/i915/i915_gem_request.h    |  2 +-
>  drivers/gpu/drm/i915/i915_guc_submission.c | 20 ++++++--
>  drivers/gpu/drm/i915/intel_lrc.c           | 75 ++++++++++++++++++++++--------
>  drivers/gpu/drm/i915/intel_ringbuffer.h    |  7 +++
>  6 files changed, 90 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 0b5d7142d8d9..a8c7788d986e 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -3314,7 +3314,6 @@ static int i915_engine_info(struct seq_file *m, void *unused)
>
>  		if (i915.enable_execlists) {
>  			u32 ptr, read, write;
> -			struct rb_node *rb;
>  			unsigned int idx;
>
>  			seq_printf(m, "\tExeclist status: 0x%08x %08x\n",
> @@ -3358,9 +3357,14 @@ static int i915_engine_info(struct seq_file *m, void *unused)
>  			rcu_read_unlock();
>
>  			spin_lock_irq(&engine->timeline->lock);
> -			for (rb = engine->execlist_first; rb; rb = rb_next(rb)) {
> -				rq = rb_entry(rb, typeof(*rq), priotree.node);
> -				print_request(m, rq, "\t\tQ ");
> +			for (rb = engine->execlist_first; rb; rb = rb_next(rb)){
> +				struct execlist_priolist *plist =
> +					rb_entry(rb, typeof(*plist), node);
> +
> +				list_for_each_entry(rq,
> +						    &plist->requests,
> +						    priotree.link)
> +					print_request(m, rq, "\t\tQ ");
>  			}
>  			spin_unlock_irq(&engine->timeline->lock);
>  		} else if (INTEL_GEN(dev_priv) > 6) {
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 83b1584b3deb..59c0e0b00028 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -159,7 +159,7 @@ i915_priotree_fini(struct drm_i915_private *i915, struct i915_priotree *pt)
>  {
>  	struct i915_dependency *dep, *next;
>
> -	GEM_BUG_ON(!RB_EMPTY_NODE(&pt->node));
> +	GEM_BUG_ON(!list_empty(&pt->link));
>
>  	/* Everyone we depended upon (the fences we wait to be signaled)
>  	 * should retire before us and remove themselves from our list.
> @@ -185,7 +185,7 @@ i915_priotree_init(struct i915_priotree *pt)
>  {
>  	INIT_LIST_HEAD(&pt->signalers_list);
>  	INIT_LIST_HEAD(&pt->waiters_list);
> -	RB_CLEAR_NODE(&pt->node);
> +	INIT_LIST_HEAD(&pt->link);
>  	pt->priority = INT_MIN;
>  }
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.h b/drivers/gpu/drm/i915/i915_gem_request.h
> index 4ccab5affd3c..0a1d717b9fa7 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.h
> +++ b/drivers/gpu/drm/i915/i915_gem_request.h
> @@ -67,7 +67,7 @@ struct i915_dependency {
>  struct i915_priotree {
>  	struct list_head signalers_list; /* those before us, we depend upon */
>  	struct list_head waiters_list; /* those after us, they depend upon us */
> -	struct rb_node node;
> +	struct list_head link;
>  	int priority;
>  #define I915_PRIORITY_MAX 1024
>  #define I915_PRIORITY_MIN (-I915_PRIORITY_MAX)
> diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
> index 370373c97b81..69b39729003b 100644
> --- a/drivers/gpu/drm/i915/i915_guc_submission.c
> +++ b/drivers/gpu/drm/i915/i915_guc_submission.c
> @@ -664,9 +664,15 @@ static bool i915_guc_dequeue(struct intel_engine_cs *engine)
>
>  	spin_lock_irq(&engine->timeline->lock);
>  	rb = engine->execlist_first;
> +	GEM_BUG_ON(rb_first(&engine->execlist_queue) != rb);
>  	while (rb) {
> +		struct execlist_priolist *plist =
> +			rb_entry(rb, typeof(*plist), node);
>  		struct drm_i915_gem_request *rq =
> -			rb_entry(rb, typeof(*rq), priotree.node);
> +			list_first_entry(&plist->requests,
> +					 typeof(*rq),
> +					 priotree.link);
> +		GEM_BUG_ON(list_empty(&plist->requests));
>
>  		if (last && rq->ctx != last->ctx) {
>  			if (port != engine->execlist_port)
> @@ -677,9 +683,15 @@ static bool i915_guc_dequeue(struct intel_engine_cs *engine)
>  			port++;
>  		}
>
> -		rb = rb_next(rb);
> -		rb_erase(&rq->priotree.node, &engine->execlist_queue);
> -		RB_CLEAR_NODE(&rq->priotree.node);
> +		if (rq->priotree.link.next == rq->priotree.link.prev) {
> +			rb = rb_next(rb);
> +			rb_erase(&plist->node, &engine->execlist_queue);
> +			if (plist->priority)
> +				kfree(plist);
> +		} else {
> +			__list_del_entry(&rq->priotree.link);
> +		}
> +		INIT_LIST_HEAD(&rq->priotree.link);
>  		rq->priotree.priority = INT_MAX;
>
>  		i915_guc_submit(rq);
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 69299fbab4f9..f96d7980ac16 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -442,9 +442,15 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>
>  	spin_lock_irq(&engine->timeline->lock);
>  	rb = engine->execlist_first;
> +	GEM_BUG_ON(rb_first(&engine->execlist_queue) != rb);
>  	while (rb) {
> +		struct execlist_priolist *plist =
> +			rb_entry(rb, typeof(*plist), node);
>  		struct drm_i915_gem_request *cursor =
> -			rb_entry(rb, typeof(*cursor), priotree.node);
> +			list_first_entry(&plist->requests,
> +					 typeof(*cursor),
> +					 priotree.link);
> +		GEM_BUG_ON(list_empty(&plist->requests));
>
>  		/* Can we combine this request with the current port? It has to
>  		 * be the same context/ringbuffer and not have any exceptions
> @@ -479,9 +485,15 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>  			port++;
>  		}
>
> -		rb = rb_next(rb);
> -		rb_erase(&cursor->priotree.node, &engine->execlist_queue);
> -		RB_CLEAR_NODE(&cursor->priotree.node);
> +		if (cursor->priotree.link.next == cursor->priotree.link.prev) {
> +			rb = rb_next(rb);
> +			rb_erase(&plist->node, &engine->execlist_queue);
> +			if (plist->priority)
> +				kfree(plist);
> +		} else {
> +			__list_del_entry(&cursor->priotree.link);
> +		}
> +		INIT_LIST_HEAD(&cursor->priotree.link);
>  		cursor->priotree.priority = INT_MAX;
>
>  		__i915_gem_request_submit(cursor);
> @@ -621,28 +633,53 @@ static void intel_lrc_irq_handler(unsigned long data)
>  	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
>  }
>
> -static bool insert_request(struct i915_priotree *pt, struct rb_root *root)
> +static bool
> +insert_request(struct intel_engine_cs *engine,
> +	       struct i915_priotree *pt,
> +	       int prio)
>  {
> +	struct execlist_priolist *plist;
>  	struct rb_node **p, *rb;
>  	bool first = true;
>
> +find_plist:
>  	/* most positive priority is scheduled first, equal priorities fifo */
>  	rb = NULL;
> -	p = &root->rb_node;
> +	p = &engine->execlist_queue.rb_node;
>  	while (*p) {
> -		struct i915_priotree *pos;
> -
>  		rb = *p;
> -		pos = rb_entry(rb, typeof(*pos), node);
> -		if (pt->priority > pos->priority) {
> +		plist = rb_entry(rb, typeof(*plist), node);
> +		if (prio > plist->priority) {
>  			p = &rb->rb_left;
> -		} else {
> +		} else if (prio < plist->priority) {
>  			p = &rb->rb_right;
>  			first = false;
> +		} else {
> +			list_add_tail(&pt->link, &plist->requests);
> +			return false;
>  		}
>  	}
> -	rb_link_node(&pt->node, rb, p);
> -	rb_insert_color(&pt->node, root);
> +
> +	if (!prio) {
> +		plist = &engine->default_priolist;

Should be "prio == I915_PRIO_DEFAULT" (give or take).

But I am not completely happy with special casing the default priority 
for two reasons.

Firstly, userspace can opt to lower its priority and completely defeat 
this path.

Secondly, we already have flip priority which perhaps should have it's 
own fast path / avoid allocation as well.

Those two combined make me unsure whether the optimisation is worth it. 
What would be the pros and cons of three steps:

1. No optimisation.
2. prio == default optimisation like above.
3. Better system with caching of frequently used levels.

Last is definitely complicated, second is not, but is the second much 
better than the first?

Perhaps a simplification of 3) where we would defer the freeing of 
unused priority levels until the busy to idle transition? That would 
also drop the existence and need for special handling of 
engine->default_prio.

> +	} else {
> +		plist = kmalloc(sizeof(*plist), GFP_ATOMIC);
> +		/* Convert an allocation failure to a priority bump */

Where is the priority bump? It looks like it can be the opposite for 
high prio requests below.

I don't think it matters what happens with priorities hugely when small 
allocations start to go bad but would like to understand the comment.

And perhaps this would be worthy of a dedicated slab cache?

> +		if (unlikely(!plist)) {
> +			prio = 0; /* recurses just once */
> +			goto find_plist;
> +		}
> +	}
> +
> +	plist->priority = prio;
> +	rb_link_node(&plist->node, rb, p);
> +	rb_insert_color(&plist->node, &engine->execlist_queue);
> +
> +	INIT_LIST_HEAD(&plist->requests);
> +	list_add_tail(&pt->link, &plist->requests);
> +
> +	if (first)
> +		engine->execlist_first = &plist->node;
>
>  	return first;
>  }
> @@ -655,8 +692,9 @@ static void execlists_submit_request(struct drm_i915_gem_request *request)
>  	/* Will be called from irq-context when using foreign fences. */
>  	spin_lock_irqsave(&engine->timeline->lock, flags);
>
> -	if (insert_request(&request->priotree, &engine->execlist_queue)) {
> -		engine->execlist_first = &request->priotree.node;
> +	if (insert_request(engine,
> +			   &request->priotree,
> +			   request->priotree.priority)) {
>  		if (execlists_elsp_ready(engine))
>  			tasklet_hi_schedule(&engine->irq_tasklet);
>  	}
> @@ -745,10 +783,9 @@ static void execlists_schedule(struct drm_i915_gem_request *request, int prio)
>  			continue;
>
>  		pt->priority = prio;
> -		if (!RB_EMPTY_NODE(&pt->node)) {
> -			rb_erase(&pt->node, &engine->execlist_queue);
> -			if (insert_request(pt, &engine->execlist_queue))
> -				engine->execlist_first = &pt->node;
> +		if (!list_empty(&pt->link)) {
> +			__list_del_entry(&pt->link);
> +			insert_request(engine, pt, prio);
>  		}
>  	}
>
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 39b733e5cfd3..1ff41bd9e89a 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -187,6 +187,12 @@ enum intel_engine_id {
>  	VECS
>  };
>
> +struct execlist_priolist {
> +	struct rb_node node;
> +	struct list_head requests;
> +	int priority;
> +};
> +
>  #define INTEL_ENGINE_CS_MAX_NAME 8
>
>  struct intel_engine_cs {
> @@ -376,6 +382,7 @@ struct intel_engine_cs {
>
>  	/* Execlists */
>  	struct tasklet_struct irq_tasklet;
> +	struct execlist_priolist default_priolist;
>  	struct execlist_port {
>  		struct drm_i915_gem_request *request_count;
>  #define EXECLIST_COUNT_BITS 2
>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list
  2017-04-24 10:28   ` Tvrtko Ursulin
@ 2017-04-24 11:07     ` Chris Wilson
  2017-04-24 12:18       ` Chris Wilson
  2017-04-24 12:44       ` Tvrtko Ursulin
  0 siblings, 2 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-24 11:07 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Mon, Apr 24, 2017 at 11:28:32AM +0100, Tvrtko Ursulin wrote:
> 
> On 19/04/2017 10:41, Chris Wilson wrote:
> >All the requests at the same priority are executed in FIFO order. They
> >do not need to be stored in the rbtree themselves, as they are a simple
> >list within a level. If we move the requests at one priority into a list,
> >we can then reduce the rbtree to the set of priorities. This should keep
> >the height of the rbtree small, as the number of active priorities can not
> >exceed the number of active requests and should be typically only a few.
> >
> >Currently, we have ~2k possible different priority levels, that may
> >increase to allow even more fine grained selection. Allocating those in
> >advance seems a waste (and may be impossible), so we opt for allocating
> >upon first use, and freeing after its requests are depleted. To avoid
> >the possibility of an allocation failure causing us to lose a request,
> >we preallocate the default priority (0) and bump any request to that
> >priority if we fail to allocate it the appropriate plist. Having a
> >request (that is ready to run, so not leading to corruption) execute
> >out-of-order is better than leaking the request (and its dependency
> >tree) entirely.
> >
> >There should be a benefit to reducing execlists_dequeue() to principally
> >using a simple list (and reducing the frequency of both rbtree iteration
> >and balancing on erase) but for typical workloads, request coalescing
> >should be small enough that we don't notice any change. The main gain is
> >from improving PI calls to schedule, and the explicit list within a
> >level should make request unwinding simpler (we just need to insert at
> >the head of the list rather than the tail and not have to make the
> >rbtree search more complicated).
> 
> Sounds attractive! What workloads show the benefit and how much?

The default will show the best, since everything is priority 0 more or
less and so we reduce the rbtree search to a single lookup and list_add.
It's hard to measure the impact of the rbtree though. On the dequeue
side, the mmio access dominates. On the schedule side, if we have lots
of requests, the dfs dominates.

I have an idea on how we might stress the rbtree in submit_request - but
still it requires long queues untypical of most workloads. Still tbd.

> >-static bool insert_request(struct i915_priotree *pt, struct rb_root *root)
> >+static bool
> >+insert_request(struct intel_engine_cs *engine,
> >+	       struct i915_priotree *pt,
> >+	       int prio)
> > {
> >+	struct execlist_priolist *plist;
> > 	struct rb_node **p, *rb;
> > 	bool first = true;
> >
> >+find_plist:
> > 	/* most positive priority is scheduled first, equal priorities fifo */
> > 	rb = NULL;
> >-	p = &root->rb_node;
> >+	p = &engine->execlist_queue.rb_node;
> > 	while (*p) {
> >-		struct i915_priotree *pos;
> >-
> > 		rb = *p;
> >-		pos = rb_entry(rb, typeof(*pos), node);
> >-		if (pt->priority > pos->priority) {
> >+		plist = rb_entry(rb, typeof(*plist), node);
> >+		if (prio > plist->priority) {
> > 			p = &rb->rb_left;
> >-		} else {
> >+		} else if (prio < plist->priority) {
> > 			p = &rb->rb_right;
> > 			first = false;
> >+		} else {
> >+			list_add_tail(&pt->link, &plist->requests);
> >+			return false;
> > 		}
> > 	}
> >-	rb_link_node(&pt->node, rb, p);
> >-	rb_insert_color(&pt->node, root);
> >+
> >+	if (!prio) {
> >+		plist = &engine->default_priolist;
> 
> Should be "prio == I915_PRIO_DEFAULT" (give or take).
> 
> But I am not completely happy with special casing the default
> priority for two reasons.
> 
> Firstly, userspace can opt to lower its priority and completely
> defeat this path.
> 
> Secondly, we already have flip priority which perhaps should have
> it's own fast path / avoid allocation as well.
> 
> Those two combined make me unsure whether the optimisation is worth
> it. What would be the pros and cons of three steps:
> 
> 1. No optimisation.
> 2. prio == default optimisation like above.
> 3. Better system with caching of frequently used levels.
> 
> Last is definitely complicated, second is not, but is the second
> much better than the first?

It was not intended as an optimisation. It is for handling the
ENOMEM here. We cannot abort the request at such a late stage, so we
need somewhere to hold it. That dictated having a preallocted slot. I
also didn't like having to preallocate all possible levels as that seems
a waste, especially as I like to invent new levels and suspect that we
may end up using a full u32 range.

Using it for the default priority was then to take advantage of the
preallocation.

> Perhaps a simplification of 3) where we would defer the freeing of
> unused priority levels until the busy to idle transition? That would
> also drop the existence and need for special handling of
> engine->default_prio.
> 
> >+	} else {
> >+		plist = kmalloc(sizeof(*plist), GFP_ATOMIC);
> >+		/* Convert an allocation failure to a priority bump */
> 
> Where is the priority bump? It looks like it can be the opposite for
> high prio requests below.

Correct. Bump was the best verb I thought of.
 
> I don't think it matters what happens with priorities hugely when
> small allocations start to go bad but would like to understand the
> comment.
> 
> And perhaps this would be worthy of a dedicated slab cache?

Even with a slab cache, we cannot prevent allocation failure. I don't
think priority levels will be frequent enough to really justify one.
Should be a good match for the common kmalloc-64 slab.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list
  2017-04-24 11:07     ` Chris Wilson
@ 2017-04-24 12:18       ` Chris Wilson
  2017-04-24 12:44       ` Tvrtko Ursulin
  1 sibling, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-24 12:18 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

On Mon, Apr 24, 2017 at 12:07:47PM +0100, Chris Wilson wrote:
> On Mon, Apr 24, 2017 at 11:28:32AM +0100, Tvrtko Ursulin wrote:
> > 
> > On 19/04/2017 10:41, Chris Wilson wrote:
> > Sounds attractive! What workloads show the benefit and how much?
> 
> The default will show the best, since everything is priority 0 more or
> less and so we reduce the rbtree search to a single lookup and list_add.
> It's hard to measure the impact of the rbtree though. On the dequeue
> side, the mmio access dominates. On the schedule side, if we have lots
> of requests, the dfs dominates.
> 
> I have an idea on how we might stress the rbtree in submit_request - but
> still it requires long queues untypical of most workloads. Still tbd.

I have something that does show a difference in that path (which is
potentially in hardirq). Overal time is completely dominated by the
reservation_object (ofc, we'll get back around to its scalability
patches at some point). For a few thousand prio=0 requests inflight, the
difference in execlists_submit_request() is about 6x, and for
intel_lrc_irq_hander() is about 2x (just a factor that I sent a lot of
coalesceable requests and so the reduction of rb_next to list_next).

Completely synthetic testing, I would be worried if the rbtree was that
tall in practice (request generation >> execution). The neat part of the
split, I think is that make the resubmission of a gazzumped request
easier - instead of writing a parallel rbtree sort, we just put the old
request at the head of the plist.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list
  2017-04-24 11:07     ` Chris Wilson
  2017-04-24 12:18       ` Chris Wilson
@ 2017-04-24 12:44       ` Tvrtko Ursulin
  2017-04-24 13:06         ` Chris Wilson
  1 sibling, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-24 12:44 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 24/04/2017 12:07, Chris Wilson wrote:
> On Mon, Apr 24, 2017 at 11:28:32AM +0100, Tvrtko Ursulin wrote:
>>
>> On 19/04/2017 10:41, Chris Wilson wrote:
>>> All the requests at the same priority are executed in FIFO order. They
>>> do not need to be stored in the rbtree themselves, as they are a simple
>>> list within a level. If we move the requests at one priority into a list,
>>> we can then reduce the rbtree to the set of priorities. This should keep
>>> the height of the rbtree small, as the number of active priorities can not
>>> exceed the number of active requests and should be typically only a few.
>>>
>>> Currently, we have ~2k possible different priority levels, that may
>>> increase to allow even more fine grained selection. Allocating those in
>>> advance seems a waste (and may be impossible), so we opt for allocating
>>> upon first use, and freeing after its requests are depleted. To avoid
>>> the possibility of an allocation failure causing us to lose a request,
>>> we preallocate the default priority (0) and bump any request to that
>>> priority if we fail to allocate it the appropriate plist. Having a
>>> request (that is ready to run, so not leading to corruption) execute
>>> out-of-order is better than leaking the request (and its dependency
>>> tree) entirely.
>>>
>>> There should be a benefit to reducing execlists_dequeue() to principally
>>> using a simple list (and reducing the frequency of both rbtree iteration
>>> and balancing on erase) but for typical workloads, request coalescing
>>> should be small enough that we don't notice any change. The main gain is
>> >from improving PI calls to schedule, and the explicit list within a
>>> level should make request unwinding simpler (we just need to insert at
>>> the head of the list rather than the tail and not have to make the
>>> rbtree search more complicated).
>>
>> Sounds attractive! What workloads show the benefit and how much?
>
> The default will show the best, since everything is priority 0 more or
> less and so we reduce the rbtree search to a single lookup and list_add.
> It's hard to measure the impact of the rbtree though. On the dequeue
> side, the mmio access dominates. On the schedule side, if we have lots
> of requests, the dfs dominates.
>
> I have an idea on how we might stress the rbtree in submit_request - but
> still it requires long queues untypical of most workloads. Still tbd.
>
>>> -static bool insert_request(struct i915_priotree *pt, struct rb_root *root)
>>> +static bool
>>> +insert_request(struct intel_engine_cs *engine,
>>> +	       struct i915_priotree *pt,
>>> +	       int prio)
>>> {
>>> +	struct execlist_priolist *plist;
>>> 	struct rb_node **p, *rb;
>>> 	bool first = true;
>>>
>>> +find_plist:
>>> 	/* most positive priority is scheduled first, equal priorities fifo */
>>> 	rb = NULL;
>>> -	p = &root->rb_node;
>>> +	p = &engine->execlist_queue.rb_node;
>>> 	while (*p) {
>>> -		struct i915_priotree *pos;
>>> -
>>> 		rb = *p;
>>> -		pos = rb_entry(rb, typeof(*pos), node);
>>> -		if (pt->priority > pos->priority) {
>>> +		plist = rb_entry(rb, typeof(*plist), node);
>>> +		if (prio > plist->priority) {
>>> 			p = &rb->rb_left;
>>> -		} else {
>>> +		} else if (prio < plist->priority) {
>>> 			p = &rb->rb_right;
>>> 			first = false;
>>> +		} else {
>>> +			list_add_tail(&pt->link, &plist->requests);
>>> +			return false;
>>> 		}
>>> 	}
>>> -	rb_link_node(&pt->node, rb, p);
>>> -	rb_insert_color(&pt->node, root);
>>> +
>>> +	if (!prio) {
>>> +		plist = &engine->default_priolist;
>>
>> Should be "prio == I915_PRIO_DEFAULT" (give or take).
>>
>> But I am not completely happy with special casing the default
>> priority for two reasons.
>>
>> Firstly, userspace can opt to lower its priority and completely
>> defeat this path.
>>
>> Secondly, we already have flip priority which perhaps should have
>> it's own fast path / avoid allocation as well.
>>
>> Those two combined make me unsure whether the optimisation is worth
>> it. What would be the pros and cons of three steps:
>>
>> 1. No optimisation.
>> 2. prio == default optimisation like above.
>> 3. Better system with caching of frequently used levels.
>>
>> Last is definitely complicated, second is not, but is the second
>> much better than the first?
>
> It was not intended as an optimisation. It is for handling the
> ENOMEM here. We cannot abort the request at such a late stage, so we
> need somewhere to hold it. That dictated having a preallocted slot. I
> also didn't like having to preallocate all possible levels as that seems
> a waste, especially as I like to invent new levels and suspect that we
> may end up using a full u32 range.
>
> Using it for the default priority was then to take advantage of the
> preallocation.
>
>> Perhaps a simplification of 3) where we would defer the freeing of
>> unused priority levels until the busy to idle transition? That would
>> also drop the existence and need for special handling of
>> engine->default_prio.
>>
>>> +	} else {
>>> +		plist = kmalloc(sizeof(*plist), GFP_ATOMIC);
>>> +		/* Convert an allocation failure to a priority bump */
>>
>> Where is the priority bump? It looks like it can be the opposite for
>> high prio requests below.
>
> Correct. Bump was the best verb I thought of.
>
>> I don't think it matters what happens with priorities hugely when
>> small allocations start to go bad but would like to understand the
>> comment.
>>
>> And perhaps this would be worthy of a dedicated slab cache?
>
> Even with a slab cache, we cannot prevent allocation failure. I don't
> think priority levels will be frequent enough to really justify one.
> Should be a good match for the common kmalloc-64 slab.

We could keep a pre-allocated entry with each engine which would 
transfer ownership with insert_request. It would have to be allocated at 
a point where we can fail like request_alloc, but downside would be 
starting to take engine timeline lock in request alloc path. Only to 
check and preallocate if needed, but still. And it would mean more 
traffic on the slab API in that path as well. Oh well, not very nice. 
Was just thinking if we can avoid GFP_ATOMIC and the default priority 
fallback. It seems like your solution is a better compromise.

A couple more question on the patch details then.

Could you implement the list handling in a more obvious way, instead of 
link.next == link.prev use a more obvious list_empty on the 
plist->requests, why __list_del_entry and not just list_del and you have 
a list_add_tail as well which could be just list_add since the list is 
empty at that point and _tail falsely suggests the _tail is important.

Regards,

Tvrtko










_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-19  9:41 ` [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence Chris Wilson
@ 2017-04-24 13:03   ` Tvrtko Ursulin
  2017-04-24 13:19     ` Chris Wilson
  2017-04-26 10:20   ` Tvrtko Ursulin
  2017-04-27  7:06   ` [PATCH v8] " Chris Wilson
  2 siblings, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-24 13:03 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 19/04/2017 10:41, Chris Wilson wrote:
> Track the latest fence waited upon on each context, and only add a new
> asynchronous wait if the new fence is more recent than the recorded
> fence for that context. This requires us to filter out unordered
> timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> absence of a universal identifier, we have to use our own
> i915->mm.unordered_timeline token.

(._.), a bit later... @_@!

What does this fixes and is the complexity worth it?

Regards,

Tvrtko


>
> v2: Throw around the debug crutches
> v3: Inline the likely case of the pre-allocation cache being full.
> v4: Drop the pre-allocation support, we can lose the most recent fence
> in case of allocation failure -- it just means we may emit more awaits
> than strictly necessary but will not break.
> v5: Trim allocation size for leaf nodes, they only need an array of u32
> not pointers.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_request.c            |  67 +++---
>  drivers/gpu/drm/i915/i915_gem_timeline.c           | 260 +++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_gem_timeline.h           |  14 ++
>  drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 123 ++++++++++
>  .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
>  5 files changed, 438 insertions(+), 27 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 97c07986b7c1..fb6c31ba3ef9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -730,9 +730,7 @@ int
>  i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>  				 struct dma_fence *fence)
>  {
> -	struct dma_fence_array *array;
>  	int ret;
> -	int i;
>
>  	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>  		return 0;
> @@ -744,39 +742,54 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>  	if (fence->context == req->fence.context)
>  		return 0;
>
> -	if (dma_fence_is_i915(fence))
> -		return i915_gem_request_await_request(req, to_request(fence));
> +	/* Squash repeated waits to the same timelines, picking the latest */
> +	if (fence->context != req->i915->mm.unordered_timeline &&
> +	    intel_timeline_sync_get(req->timeline,
> +				    fence->context, fence->seqno))
> +		return 0;
>
> -	if (!dma_fence_is_array(fence)) {
> +	if (dma_fence_is_i915(fence)) {
> +		ret = i915_gem_request_await_request(req, to_request(fence));
> +		if (ret < 0)
> +			return ret;
> +	} else if (!dma_fence_is_array(fence)) {
>  		ret = i915_sw_fence_await_dma_fence(&req->submit,
>  						    fence, I915_FENCE_TIMEOUT,
>  						    GFP_KERNEL);
> -		return ret < 0 ? ret : 0;
> -	}
> -
> -	/* Note that if the fence-array was created in signal-on-any mode,
> -	 * we should *not* decompose it into its individual fences. However,
> -	 * we don't currently store which mode the fence-array is operating
> -	 * in. Fortunately, the only user of signal-on-any is private to
> -	 * amdgpu and we should not see any incoming fence-array from
> -	 * sync-file being in signal-on-any mode.
> -	 */
> -
> -	array = to_dma_fence_array(fence);
> -	for (i = 0; i < array->num_fences; i++) {
> -		struct dma_fence *child = array->fences[i];
> -
> -		if (dma_fence_is_i915(child))
> -			ret = i915_gem_request_await_request(req,
> -							     to_request(child));
> -		else
> -			ret = i915_sw_fence_await_dma_fence(&req->submit,
> -							    child, I915_FENCE_TIMEOUT,
> -							    GFP_KERNEL);
>  		if (ret < 0)
>  			return ret;
> +	} else {
> +		struct dma_fence_array *array = to_dma_fence_array(fence);
> +		int i;
> +
> +		/* Note that if the fence-array was created in signal-on-any
> +		 * mode, we should *not* decompose it into its individual
> +		 * fences. However, we don't currently store which mode the
> +		 * fence-array is operating in. Fortunately, the only user of
> +		 * signal-on-any is private to amdgpu and we should not see any
> +		 * incoming fence-array from sync-file being in signal-on-any
> +		 * mode.
> +		 */
> +
> +		for (i = 0; i < array->num_fences; i++) {
> +			struct dma_fence *child = array->fences[i];
> +
> +			if (dma_fence_is_i915(child))
> +				ret = i915_gem_request_await_request(req,
> +								     to_request(child));
> +			else
> +				ret = i915_sw_fence_await_dma_fence(&req->submit,
> +								    child, I915_FENCE_TIMEOUT,
> +								    GFP_KERNEL);
> +			if (ret < 0)
> +				return ret;
> +		}
>  	}
>
> +	if (fence->context != req->i915->mm.unordered_timeline)
> +		intel_timeline_sync_set(req->timeline,
> +					fence->context, fence->seqno);
> +
>  	return 0;
>  }
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
> index b596ca7ee058..f2b734dda895 100644
> --- a/drivers/gpu/drm/i915/i915_gem_timeline.c
> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
> @@ -24,6 +24,254 @@
>
>  #include "i915_drv.h"
>
> +#define NSYNC 16
> +#define SHIFT ilog2(NSYNC)
> +#define MASK (NSYNC - 1)
> +
> +/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
> + * context id to the last u32 fence seqno waited upon from that context.
> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
> + * the root. This allows us to access the whole tree via a single pointer
> + * to the most recently used layer. We expect fence contexts to be dense
> + * and most reuse to be on the same i915_gem_context but on neighbouring
> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
> + * effective lookup cache. If the new lookup is not on the same leaf, we
> + * expect it to be on the neighbouring branch.
> + *
> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
> + * allows us to store whether a particular seqno is valid (i.e. allows us
> + * to distinguish unset from 0).
> + *
> + * A branch holds an array of layer pointers, and has height > 0, and always
> + * has at least 2 layers (either branches or leaves) below it.
> + *
> + */
> +struct intel_timeline_sync {
> +	u64 prefix;
> +	unsigned int height;
> +	unsigned int bitmap;
> +	struct intel_timeline_sync *parent;
> +	/* union {
> +	 *	u32 seqno;
> +	 *	struct intel_timeline_sync *child;
> +	 * } slot[NSYNC];
> +	 */
> +};
> +
> +static inline u32 *__sync_seqno(struct intel_timeline_sync *p)
> +{
> +	GEM_BUG_ON(p->height);
> +	return (u32 *)(p + 1);
> +}
> +
> +static inline struct intel_timeline_sync **
> +__sync_child(struct intel_timeline_sync *p)
> +{
> +	GEM_BUG_ON(!p->height);
> +	return (struct intel_timeline_sync **)(p + 1);
> +}
> +
> +static inline unsigned int
> +__sync_idx(const struct intel_timeline_sync *p, u64 id)
> +{
> +	return (id >> p->height) & MASK;
> +}
> +
> +static void __sync_free(struct intel_timeline_sync *p)
> +{
> +	if (p->height) {
> +		unsigned int i;
> +
> +		while ((i = ffs(p->bitmap))) {
> +			p->bitmap &= ~0u << i;
> +			__sync_free(__sync_child(p)[i - 1]);
> +		}
> +	}
> +
> +	kfree(p);
> +}
> +
> +static void sync_free(struct intel_timeline_sync *sync)
> +{
> +	if (!sync)
> +		return;
> +
> +	while (sync->parent)
> +		sync = sync->parent;
> +
> +	__sync_free(sync);
> +}
> +
> +bool intel_timeline_sync_get(struct intel_timeline *tl, u64 id, u32 seqno)
> +{
> +	struct intel_timeline_sync *p;
> +	unsigned int idx;
> +
> +	p = tl->sync;
> +	if (!p)
> +		return false;
> +
> +	if (likely((id >> SHIFT) == p->prefix))
> +		goto found;
> +
> +	/* First climb the tree back to a parent branch */
> +	do {
> +		p = p->parent;
> +		if (!p)
> +			return false;
> +
> +		if ((id >> p->height >> SHIFT) == p->prefix)
> +			break;
> +	} while (1);
> +
> +	/* And then descend again until we find our leaf */
> +	do {
> +		if (!p->height)
> +			break;
> +
> +		p = __sync_child(p)[__sync_idx(p, id)];
> +		if (!p)
> +			return false;
> +
> +		if ((id >> p->height >> SHIFT) != p->prefix)
> +			return false;
> +	} while (1);
> +
> +	tl->sync = p;
> +found:
> +	idx = id & MASK;
> +	if (!(p->bitmap & BIT(idx)))
> +		return false;
> +
> +	return i915_seqno_passed(__sync_seqno(p)[idx], seqno);
> +}
> +
> +static noinline int
> +__intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> +{
> +	struct intel_timeline_sync *p = tl->sync;
> +	unsigned int idx;
> +
> +	if (!p) {
> +		p = kzalloc(sizeof(*p) + NSYNC * sizeof(seqno), GFP_KERNEL);
> +		if (unlikely(!p))
> +			return -ENOMEM;
> +
> +		p->prefix = id >> SHIFT;
> +		goto found;
> +	}
> +
> +	/* Climb back up the tree until we find a common prefix */
> +	do {
> +		if (!p->parent)
> +			break;
> +
> +		p = p->parent;
> +
> +		if ((id >> p->height >> SHIFT) == p->prefix)
> +			break;
> +	} while (1);
> +
> +	/* No shortcut, we have to descend the tree to find the right layer
> +	 * containing this fence.
> +	 *
> +	 * Each layer in the tree holds 16 (NSYNC) pointers, either fences
> +	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
> +	 * other nodes (height > 0) are internal layers that point to a lower
> +	 * node. Each internal layer has at least 2 descendents.
> +	 *
> +	 * Starting at the top, we check whether the current prefix matches. If
> +	 * it doesn't, we have gone passed our layer and need to insert a join
> +	 * into the tree, and a new leaf node as a descendent as well as the
> +	 * original layer.
> +	 *
> +	 * The matching prefix means we are still following the right branch
> +	 * of the tree. If it has height 0, we have found our leaf and just
> +	 * need to replace the fence slot with ourselves. If the height is
> +	 * not zero, our slot contains the next layer in the tree (unless
> +	 * it is empty, in which case we can add ourselves as a new leaf).
> +	 * As descend the tree the prefix grows (and height decreases).
> +	 */
> +	do {
> +		struct intel_timeline_sync *next;
> +
> +		if ((id >> p->height >> SHIFT) != p->prefix) {
> +			/* insert a join above the current layer */
> +			next = kzalloc(sizeof(*next) + NSYNC * sizeof(next),
> +				       GFP_KERNEL);
> +			if (unlikely(!next))
> +				return -ENOMEM;
> +
> +			next->height = ALIGN(fls64((id >> p->height >> SHIFT) ^ p->prefix),
> +					    SHIFT) + p->height;
> +			next->prefix = id >> next->height >> SHIFT;
> +
> +			if (p->parent)
> +				__sync_child(p->parent)[__sync_idx(p->parent, id)] = next;
> +			next->parent = p->parent;
> +
> +			idx = p->prefix >> (next->height - p->height - SHIFT) & MASK;
> +			__sync_child(next)[idx] = p;
> +			next->bitmap |= BIT(idx);
> +			p->parent = next;
> +
> +			/* ascend to the join */
> +			p = next;
> +		} else {
> +			if (!p->height)
> +				break;
> +		}
> +
> +		/* descend into the next layer */
> +		GEM_BUG_ON(!p->height);
> +		idx = __sync_idx(p, id);
> +		next = __sync_child(p)[idx];
> +		if (unlikely(!next)) {
> +			next = kzalloc(sizeof(*next) + NSYNC * sizeof(seqno),
> +				       GFP_KERNEL);
> +			if (unlikely(!next))
> +				return -ENOMEM;
> +
> +			__sync_child(p)[idx] = next;
> +			p->bitmap |= BIT(idx);
> +			next->parent = p;
> +			next->prefix = id >> SHIFT;
> +
> +			p = next;
> +			break;
> +		}
> +
> +		p = next;
> +	} while (1);
> +
> +found:
> +	GEM_BUG_ON(p->height);
> +	GEM_BUG_ON(p->prefix != id >> SHIFT);
> +	tl->sync = p;
> +	idx = id & MASK;
> +	__sync_seqno(p)[idx] = seqno;
> +	p->bitmap |= BIT(idx);
> +	return 0;
> +}
> +
> +int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> +{
> +	struct intel_timeline_sync *p = tl->sync;
> +
> +	/* We expect to be called in sequence following a  _get(id), which
> +	 * should have preloaded the tl->sync hint for us.
> +	 */
> +	if (likely(p && (id >> SHIFT) == p->prefix)) {
> +		unsigned int idx = id & MASK;
> +
> +		__sync_seqno(p)[idx] = seqno;
> +		p->bitmap |= BIT(idx);
> +		return 0;
> +	}
> +
> +	return __intel_timeline_sync_set(tl, id, seqno);
> +}
> +
>  static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>  				    struct i915_gem_timeline *timeline,
>  				    const char *name,
> @@ -35,6 +283,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>
>  	lockdep_assert_held(&i915->drm.struct_mutex);
>
> +	/* Ideally we want a set of engines on a single leaf as we expect
> +	 * to mostly be tracking synchronisation between engines.
> +	 */
> +	BUILD_BUG_ON(NSYNC < I915_NUM_ENGINES);
> +	BUILD_BUG_ON(NSYNC > BITS_PER_BYTE * sizeof(timeline->engine[0].sync->bitmap));
> +
>  	timeline->i915 = i915;
>  	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
>  	if (!timeline->name)
> @@ -91,8 +345,14 @@ void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
>  		struct intel_timeline *tl = &timeline->engine[i];
>
>  		GEM_BUG_ON(!list_empty(&tl->requests));
> +
> +		sync_free(tl->sync);
>  	}
>
>  	list_del(&timeline->link);
>  	kfree(timeline->name);
>  }
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> +#include "selftests/i915_gem_timeline.c"
> +#endif
> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
> index 6c53e14cab2a..c33dee0025ee 100644
> --- a/drivers/gpu/drm/i915/i915_gem_timeline.h
> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
> @@ -26,10 +26,13 @@
>  #define I915_GEM_TIMELINE_H
>
>  #include <linux/list.h>
> +#include <linux/radix-tree.h>
>
> +#include "i915_utils.h"
>  #include "i915_gem_request.h"
>
>  struct i915_gem_timeline;
> +struct intel_timeline_sync;
>
>  struct intel_timeline {
>  	u64 fence_context;
> @@ -55,6 +58,14 @@ struct intel_timeline {
>  	 * struct_mutex.
>  	 */
>  	struct i915_gem_active last_request;
> +
> +	/* We track the most recent seqno that we wait on in every context so
> +	 * that we only have to emit a new await and dependency on a more
> +	 * recent sync point. As the contexts may executed out-of-order, we
> +	 * have to track each individually and cannot not rely on an absolute
> +	 * global_seqno.
> +	 */
> +	struct intel_timeline_sync *sync;
>  	u32 sync_seqno[I915_NUM_ENGINES];
>
>  	struct i915_gem_timeline *common;
> @@ -75,4 +86,7 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
>  int i915_gem_timeline_init__global(struct drm_i915_private *i915);
>  void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
>
> +bool intel_timeline_sync_get(struct intel_timeline *tl, u64 id, u32 seqno);
> +int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno);
> +
>  #endif
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> new file mode 100644
> index 000000000000..c0bb8ecac93b
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> @@ -0,0 +1,123 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include "../i915_selftest.h"
> +#include "mock_gem_device.h"
> +
> +static int igt_seqmap(void *arg)
> +{
> +	struct drm_i915_private *i915 = arg;
> +	const struct {
> +		const char *name;
> +		u32 seqno;
> +		bool expected;
> +		bool set;
> +	} pass[] = {
> +		{ "unset", 0, false, false },
> +		{ "new", 0, false, true },
> +		{ "0a", 0, true, true },
> +		{ "1a", 1, false, true },
> +		{ "1b", 1, true, true },
> +		{ "0b", 0, true, false },
> +		{ "2a", 2, false, true },
> +		{ "4", 4, false, true },
> +		{ "INT_MAX", INT_MAX, false, true },
> +		{ "INT_MAX-1", INT_MAX-1, true, false },
> +		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
> +		{ "INT_MAX", INT_MAX, true, false },
> +		{ "UINT_MAX", UINT_MAX, false, true },
> +		{ "wrap", 0, false, true },
> +		{ "unwrap", UINT_MAX, true, false },
> +		{},
> +	}, *p;
> +	struct intel_timeline *tl;
> +	int order, offset;
> +	int ret;
> +
> +	tl = &i915->gt.global_timeline.engine[RCS];
> +	for (p = pass; p->name; p++) {
> +		for (order = 1; order < 64; order++) {
> +			for (offset = -1; offset <= (order > 1); offset++) {
> +				u64 ctx = BIT_ULL(order) + offset;
> +
> +				if (intel_timeline_sync_get(tl,
> +							    ctx,
> +							    p->seqno) != p->expected) {
> +					pr_err("1: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> +					       p->name, ctx, p->seqno, yesno(p->expected));
> +					return -EINVAL;
> +				}
> +
> +				if (p->set) {
> +					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
> +					if (ret)
> +						return ret;
> +				}
> +			}
> +		}
> +	}
> +
> +	tl = &i915->gt.global_timeline.engine[BCS];
> +	for (order = 1; order < 64; order++) {
> +		for (offset = -1; offset <= (order > 1); offset++) {
> +			u64 ctx = BIT_ULL(order) + offset;
> +
> +			for (p = pass; p->name; p++) {
> +				if (intel_timeline_sync_get(tl,
> +							    ctx,
> +							    p->seqno) != p->expected) {
> +					pr_err("2: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> +					       p->name, ctx, p->seqno, yesno(p->expected));
> +					return -EINVAL;
> +				}
> +
> +				if (p->set) {
> +					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
> +					if (ret)
> +						return ret;
> +				}
> +			}
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +int i915_gem_timeline_mock_selftests(void)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(igt_seqmap),
> +	};
> +	struct drm_i915_private *i915;
> +	int err;
> +
> +	i915 = mock_gem_device();
> +	if (!i915)
> +		return -ENOMEM;
> +
> +	err = i915_subtests(tests, i915);
> +	drm_dev_unref(&i915->drm);
> +
> +	return err;
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> index be9a9ebf5692..8d0f50c25df8 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> @@ -12,6 +12,7 @@ selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
>  selftest(scatterlist, scatterlist_mock_selftests)
>  selftest(uncore, intel_uncore_mock_selftests)
>  selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
> +selftest(timelines, i915_gem_timeline_mock_selftests)
>  selftest(requests, i915_gem_request_mock_selftests)
>  selftest(objects, i915_gem_object_mock_selftests)
>  selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list
  2017-04-24 12:44       ` Tvrtko Ursulin
@ 2017-04-24 13:06         ` Chris Wilson
  0 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-24 13:06 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Mon, Apr 24, 2017 at 01:44:53PM +0100, Tvrtko Ursulin wrote:
> 
> On 24/04/2017 12:07, Chris Wilson wrote:
> >On Mon, Apr 24, 2017 at 11:28:32AM +0100, Tvrtko Ursulin wrote:
> >>
> >>On 19/04/2017 10:41, Chris Wilson wrote:
> >>>All the requests at the same priority are executed in FIFO order. They
> >>>do not need to be stored in the rbtree themselves, as they are a simple
> >>>list within a level. If we move the requests at one priority into a list,
> >>>we can then reduce the rbtree to the set of priorities. This should keep
> >>>the height of the rbtree small, as the number of active priorities can not
> >>>exceed the number of active requests and should be typically only a few.
> >>>
> >>>Currently, we have ~2k possible different priority levels, that may
> >>>increase to allow even more fine grained selection. Allocating those in
> >>>advance seems a waste (and may be impossible), so we opt for allocating
> >>>upon first use, and freeing after its requests are depleted. To avoid
> >>>the possibility of an allocation failure causing us to lose a request,
> >>>we preallocate the default priority (0) and bump any request to that
> >>>priority if we fail to allocate it the appropriate plist. Having a
> >>>request (that is ready to run, so not leading to corruption) execute
> >>>out-of-order is better than leaking the request (and its dependency
> >>>tree) entirely.
> >>>
> >>>There should be a benefit to reducing execlists_dequeue() to principally
> >>>using a simple list (and reducing the frequency of both rbtree iteration
> >>>and balancing on erase) but for typical workloads, request coalescing
> >>>should be small enough that we don't notice any change. The main gain is
> >>>from improving PI calls to schedule, and the explicit list within a
> >>>level should make request unwinding simpler (we just need to insert at
> >>>the head of the list rather than the tail and not have to make the
> >>>rbtree search more complicated).
> >>
> >>Sounds attractive! What workloads show the benefit and how much?
> >
> >The default will show the best, since everything is priority 0 more or
> >less and so we reduce the rbtree search to a single lookup and list_add.
> >It's hard to measure the impact of the rbtree though. On the dequeue
> >side, the mmio access dominates. On the schedule side, if we have lots
> >of requests, the dfs dominates.
> >
> >I have an idea on how we might stress the rbtree in submit_request - but
> >still it requires long queues untypical of most workloads. Still tbd.
> >
> >>>-static bool insert_request(struct i915_priotree *pt, struct rb_root *root)
> >>>+static bool
> >>>+insert_request(struct intel_engine_cs *engine,
> >>>+	       struct i915_priotree *pt,
> >>>+	       int prio)
> >>>{
> >>>+	struct execlist_priolist *plist;
> >>>	struct rb_node **p, *rb;
> >>>	bool first = true;
> >>>
> >>>+find_plist:
> >>>	/* most positive priority is scheduled first, equal priorities fifo */
> >>>	rb = NULL;
> >>>-	p = &root->rb_node;
> >>>+	p = &engine->execlist_queue.rb_node;
> >>>	while (*p) {
> >>>-		struct i915_priotree *pos;
> >>>-
> >>>		rb = *p;
> >>>-		pos = rb_entry(rb, typeof(*pos), node);
> >>>-		if (pt->priority > pos->priority) {
> >>>+		plist = rb_entry(rb, typeof(*plist), node);
> >>>+		if (prio > plist->priority) {
> >>>			p = &rb->rb_left;
> >>>-		} else {
> >>>+		} else if (prio < plist->priority) {
> >>>			p = &rb->rb_right;
> >>>			first = false;
> >>>+		} else {
> >>>+			list_add_tail(&pt->link, &plist->requests);
> >>>+			return false;
> >>>		}
> >>>	}
> >>>-	rb_link_node(&pt->node, rb, p);
> >>>-	rb_insert_color(&pt->node, root);
> >>>+
> >>>+	if (!prio) {
> >>>+		plist = &engine->default_priolist;
> >>
> >>Should be "prio == I915_PRIO_DEFAULT" (give or take).
> >>
> >>But I am not completely happy with special casing the default
> >>priority for two reasons.
> >>
> >>Firstly, userspace can opt to lower its priority and completely
> >>defeat this path.
> >>
> >>Secondly, we already have flip priority which perhaps should have
> >>it's own fast path / avoid allocation as well.
> >>
> >>Those two combined make me unsure whether the optimisation is worth
> >>it. What would be the pros and cons of three steps:
> >>
> >>1. No optimisation.
> >>2. prio == default optimisation like above.
> >>3. Better system with caching of frequently used levels.
> >>
> >>Last is definitely complicated, second is not, but is the second
> >>much better than the first?
> >
> >It was not intended as an optimisation. It is for handling the
> >ENOMEM here. We cannot abort the request at such a late stage, so we
> >need somewhere to hold it. That dictated having a preallocted slot. I
> >also didn't like having to preallocate all possible levels as that seems
> >a waste, especially as I like to invent new levels and suspect that we
> >may end up using a full u32 range.
> >
> >Using it for the default priority was then to take advantage of the
> >preallocation.
> >
> >>Perhaps a simplification of 3) where we would defer the freeing of
> >>unused priority levels until the busy to idle transition? That would
> >>also drop the existence and need for special handling of
> >>engine->default_prio.
> >>
> >>>+	} else {
> >>>+		plist = kmalloc(sizeof(*plist), GFP_ATOMIC);
> >>>+		/* Convert an allocation failure to a priority bump */
> >>
> >>Where is the priority bump? It looks like it can be the opposite for
> >>high prio requests below.
> >
> >Correct. Bump was the best verb I thought of.
> >
> >>I don't think it matters what happens with priorities hugely when
> >>small allocations start to go bad but would like to understand the
> >>comment.
> >>
> >>And perhaps this would be worthy of a dedicated slab cache?
> >
> >Even with a slab cache, we cannot prevent allocation failure. I don't
> >think priority levels will be frequent enough to really justify one.
> >Should be a good match for the common kmalloc-64 slab.
> 
> We could keep a pre-allocated entry with each engine which would
> transfer ownership with insert_request. It would have to be
> allocated at a point where we can fail like request_alloc, but
> downside would be starting to take engine timeline lock in request
> alloc path. Only to check and preallocate if needed, but still. And
> it would mean more traffic on the slab API in that path as well. Oh
> well, not very nice. Was just thinking if we can avoid GFP_ATOMIC
> and the default priority fallback. It seems like your solution is a
> better compromise.

No worries. I didn't particular like the idea of reserving a slot with
each request either or having a GFP_ATOMIC nestled so deep in request
submission. It is certainly possible for us to do the allocation at
request_alloc and carry it through to schedule - but still
that only covers the first call to schedule. It seems like we would have
to resort to always pass in a slot, and free that slot if unused.
 
> A couple more question on the patch details then.
> 
> Could you implement the list handling in a more obvious way, instead
> of link.next == link.prev use a more obvious list_empty on the
> plist->requests, why __list_del_entry and not just list_del and you
> have a list_add_tail as well which could be just list_add since the
> list is empty at that point and _tail falsely suggests the _tail is
> important.

After a few more passes, it is saner now (with just one debatable list
handling tweak from ages ago)

static inline void __list_del_many(struct list_head *head,
                                   struct list_head *first)
{
        head->next = first;
        first->prev = head;
}
...
	rb = engine->execlist_first;
	GEM_BUG_ON(rb_first(&engine->execlist_queue) != rb);
	while (rb) {
		struct execlist_priolist *plist =
			rb_entry(rb, typeof(*plist), node);
		struct drm_i915_gem_request *rq, *rn;

		list_for_each_entry_safe(rq, rn,
					 &plist->requests, priotree.link) {
			if(!merge(rq)) { /* blah */
				__list_del_many(&plist->requests,
						&rq->priotree.link);
				goto done;
			}

			INIT_LIST_HEAD(&rq->priotree.link);
			rq->priotree.priority = INT_MAX;

			... /*__i915_gem_request_submit(rq)); */
		}

		rb = rb_next(rb);
		rb_erase(&plist->node, &engine->execlist_queue);
		INIT_LIST_HEAD(&plist->requests);
		if (plist->priority)
			kfree(plist)
	}

You may notice I have a dislike of the cache misses from lists. :|
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-24 13:03   ` Tvrtko Ursulin
@ 2017-04-24 13:19     ` Chris Wilson
  2017-04-24 13:31       ` Chris Wilson
  0 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-24 13:19 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Mon, Apr 24, 2017 at 02:03:25PM +0100, Tvrtko Ursulin wrote:
> 
> On 19/04/2017 10:41, Chris Wilson wrote:
> >Track the latest fence waited upon on each context, and only add a new
> >asynchronous wait if the new fence is more recent than the recorded
> >fence for that context. This requires us to filter out unordered
> >timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> >absence of a universal identifier, we have to use our own
> >i915->mm.unordered_timeline token.
> 
> (._.), a bit later... @_@!
> 
> What does this fixes and is the complexity worth it?

It's a recovery of the optimisation that we used to have from the
initial multiple engine semaphore synchronisation - that of avoiding
repeating the same synchronisation barriers.

In the current setup, the cost of repeat fence synchronisation is
obfuscated, it just causes a tight loop between

 /<---------------------------------------------\
 |                                               ^
i915_sw_fence_complete -> i915_sw_fence_commit ->|

and extra depth in the dependency trees, which is generally not
observed in normal usage.

When you know what you are looking for, the reduction of all those
atomic ops from underneath hardirq is definitely worth it, even for
fairly simply operations, and there tends to be repetition from all he
buffers being tracked between requests (and clients).

Using a seqno map avoids the cost of tracking fences (i.e. keeping old
fences forever) and allows it to be kept on the timeline, rather than
the request itself (a ht under the request can squash simple repeats,
but using the timeline is more complete).

2 small routines to implement a compressed radixtree -- it's
comparitively simple compared to having to accommodate RCU walkers!
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-24 13:19     ` Chris Wilson
@ 2017-04-24 13:31       ` Chris Wilson
  0 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-24 13:31 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

On Mon, Apr 24, 2017 at 02:19:54PM +0100, Chris Wilson wrote:
> On Mon, Apr 24, 2017 at 02:03:25PM +0100, Tvrtko Ursulin wrote:
> > 
> > On 19/04/2017 10:41, Chris Wilson wrote:
> > >Track the latest fence waited upon on each context, and only add a new
> > >asynchronous wait if the new fence is more recent than the recorded
> > >fence for that context. This requires us to filter out unordered
> > >timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> > >absence of a universal identifier, we have to use our own
> > >i915->mm.unordered_timeline token.
> > 
> > (._.), a bit later... @_@!
> > 
> > What does this fixes and is the complexity worth it?
> 
> It's a recovery of the optimisation that we used to have from the
> initial multiple engine semaphore synchronisation - that of avoiding
> repeating the same synchronisation barriers.
> 
> In the current setup, the cost of repeat fence synchronisation is
> obfuscated, it just causes a tight loop between
> 
>  /<---------------------------------------------\
>  |                                               ^
> i915_sw_fence_complete -> i915_sw_fence_commit ->|
> 
> and extra depth in the dependency trees, which is generally not
> observed in normal usage.
> 
> When you know what you are looking for, the reduction of all those
> atomic ops from underneath hardirq is definitely worth it, even for
> fairly simply operations, and there tends to be repetition from all he
> buffers being tracked between requests (and clients).

And it also says, to me at least, that the cost of the lookup must be
less than the cost of a couple of atomics.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-19  9:41 ` [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence Chris Wilson
  2017-04-24 13:03   ` Tvrtko Ursulin
@ 2017-04-26 10:20   ` Tvrtko Ursulin
  2017-04-26 10:38     ` Chris Wilson
  2017-04-27  7:06   ` [PATCH v8] " Chris Wilson
  2 siblings, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-26 10:20 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 19/04/2017 10:41, Chris Wilson wrote:
> Track the latest fence waited upon on each context, and only add a new
> asynchronous wait if the new fence is more recent than the recorded
> fence for that context. This requires us to filter out unordered
> timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> absence of a universal identifier, we have to use our own
> i915->mm.unordered_timeline token.
>
> v2: Throw around the debug crutches
> v3: Inline the likely case of the pre-allocation cache being full.
> v4: Drop the pre-allocation support, we can lose the most recent fence
> in case of allocation failure -- it just means we may emit more awaits
> than strictly necessary but will not break.
> v5: Trim allocation size for leaf nodes, they only need an array of u32
> not pointers.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_request.c            |  67 +++---
>  drivers/gpu/drm/i915/i915_gem_timeline.c           | 260 +++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_gem_timeline.h           |  14 ++
>  drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 123 ++++++++++
>  .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
>  5 files changed, 438 insertions(+), 27 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 97c07986b7c1..fb6c31ba3ef9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -730,9 +730,7 @@ int
>  i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>  				 struct dma_fence *fence)
>  {
> -	struct dma_fence_array *array;
>  	int ret;
> -	int i;
>
>  	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>  		return 0;
> @@ -744,39 +742,54 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>  	if (fence->context == req->fence.context)
>  		return 0;
>
> -	if (dma_fence_is_i915(fence))
> -		return i915_gem_request_await_request(req, to_request(fence));
> +	/* Squash repeated waits to the same timelines, picking the latest */
> +	if (fence->context != req->i915->mm.unordered_timeline &&
> +	    intel_timeline_sync_get(req->timeline,
> +				    fence->context, fence->seqno))

Function name is non-intuitive to me. It doesn't seem to get anything, 
but is more like query? Since it ends up with i915_seqno_passed, maybe 
intel_timeline_sync_is_newer/older ? (give or take)

And kerneldoc for intel_timeline_sync_get and set are needed as well.

> +		return 0;
>
> -	if (!dma_fence_is_array(fence)) {
> +	if (dma_fence_is_i915(fence)) {
> +		ret = i915_gem_request_await_request(req, to_request(fence));
> +		if (ret < 0)
> +			return ret;
> +	} else if (!dma_fence_is_array(fence)) {
>  		ret = i915_sw_fence_await_dma_fence(&req->submit,
>  						    fence, I915_FENCE_TIMEOUT,
>  						    GFP_KERNEL);
> -		return ret < 0 ? ret : 0;
> -	}
> -
> -	/* Note that if the fence-array was created in signal-on-any mode,
> -	 * we should *not* decompose it into its individual fences. However,
> -	 * we don't currently store which mode the fence-array is operating
> -	 * in. Fortunately, the only user of signal-on-any is private to
> -	 * amdgpu and we should not see any incoming fence-array from
> -	 * sync-file being in signal-on-any mode.
> -	 */
> -
> -	array = to_dma_fence_array(fence);
> -	for (i = 0; i < array->num_fences; i++) {
> -		struct dma_fence *child = array->fences[i];
> -
> -		if (dma_fence_is_i915(child))
> -			ret = i915_gem_request_await_request(req,
> -							     to_request(child));
> -		else
> -			ret = i915_sw_fence_await_dma_fence(&req->submit,
> -							    child, I915_FENCE_TIMEOUT,
> -							    GFP_KERNEL);
>  		if (ret < 0)
>  			return ret;
> +	} else {
> +		struct dma_fence_array *array = to_dma_fence_array(fence);
> +		int i;
> +
> +		/* Note that if the fence-array was created in signal-on-any
> +		 * mode, we should *not* decompose it into its individual
> +		 * fences. However, we don't currently store which mode the
> +		 * fence-array is operating in. Fortunately, the only user of
> +		 * signal-on-any is private to amdgpu and we should not see any
> +		 * incoming fence-array from sync-file being in signal-on-any
> +		 * mode.
> +		 */
> +
> +		for (i = 0; i < array->num_fences; i++) {
> +			struct dma_fence *child = array->fences[i];
> +
> +			if (dma_fence_is_i915(child))
> +				ret = i915_gem_request_await_request(req,
> +								     to_request(child));
> +			else
> +				ret = i915_sw_fence_await_dma_fence(&req->submit,
> +								    child, I915_FENCE_TIMEOUT,
> +								    GFP_KERNEL);
> +			if (ret < 0)
> +				return ret;
> +		}
>  	}
>
> +	if (fence->context != req->i915->mm.unordered_timeline)
> +		intel_timeline_sync_set(req->timeline,
> +					fence->context, fence->seqno);
> +
>  	return 0;
>  }
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
> index b596ca7ee058..f2b734dda895 100644
> --- a/drivers/gpu/drm/i915/i915_gem_timeline.c
> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
> @@ -24,6 +24,254 @@
>
>  #include "i915_drv.h"
>
> +#define NSYNC 16
> +#define SHIFT ilog2(NSYNC)
> +#define MASK (NSYNC - 1)
> +
> +/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
> + * context id to the last u32 fence seqno waited upon from that context.
> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
> + * the root. This allows us to access the whole tree via a single pointer
> + * to the most recently used layer. We expect fence contexts to be dense
> + * and most reuse to be on the same i915_gem_context but on neighbouring
> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
> + * effective lookup cache. If the new lookup is not on the same leaf, we
> + * expect it to be on the neighbouring branch.
> + *
> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
> + * allows us to store whether a particular seqno is valid (i.e. allows us
> + * to distinguish unset from 0).
> + *
> + * A branch holds an array of layer pointers, and has height > 0, and always
> + * has at least 2 layers (either branches or leaves) below it.
> + *
> + */

@_@ :)

Ok, so a map of u64 to u32. We can't use IDR or radixtree directly 
because of u64 keys. :( How about a hash table? It would be much simpler 
to review. :) Seriously, if it would perform close enough it would be a 
much much simpler implementation.

> +struct intel_timeline_sync {
> +	u64 prefix;
> +	unsigned int height;
> +	unsigned int bitmap;
> +	struct intel_timeline_sync *parent;
> +	/* union {
> +	 *	u32 seqno;
> +	 *	struct intel_timeline_sync *child;
> +	 * } slot[NSYNC];
> +	 */
> +};
> +
> +static inline u32 *__sync_seqno(struct intel_timeline_sync *p)
> +{
> +	GEM_BUG_ON(p->height);
> +	return (u32 *)(p + 1);
> +}
> +
> +static inline struct intel_timeline_sync **
> +__sync_child(struct intel_timeline_sync *p)
> +{
> +	GEM_BUG_ON(!p->height);
> +	return (struct intel_timeline_sync **)(p + 1);
> +}
> +
> +static inline unsigned int
> +__sync_idx(const struct intel_timeline_sync *p, u64 id)
> +{
> +	return (id >> p->height) & MASK;
> +}
> +
> +static void __sync_free(struct intel_timeline_sync *p)
> +{
> +	if (p->height) {
> +		unsigned int i;
> +
> +		while ((i = ffs(p->bitmap))) {
> +			p->bitmap &= ~0u << i;
> +			__sync_free(__sync_child(p)[i - 1]);
> +		}
> +	}
> +
> +	kfree(p);
> +}
> +
> +static void sync_free(struct intel_timeline_sync *sync)
> +{
> +	if (!sync)
> +		return;
> +
> +	while (sync->parent)
> +		sync = sync->parent;
> +
> +	__sync_free(sync);
> +}
> +
> +bool intel_timeline_sync_get(struct intel_timeline *tl, u64 id, u32 seqno)
> +{
> +	struct intel_timeline_sync *p;
> +	unsigned int idx;
> +
> +	p = tl->sync;
> +	if (!p)
> +		return false;
> +
> +	if (likely((id >> SHIFT) == p->prefix))
> +		goto found;
> +
> +	/* First climb the tree back to a parent branch */
> +	do {
> +		p = p->parent;
> +		if (!p)
> +			return false;
> +
> +		if ((id >> p->height >> SHIFT) == p->prefix)
> +			break;
> +	} while (1);
> +
> +	/* And then descend again until we find our leaf */
> +	do {
> +		if (!p->height)
> +			break;
> +
> +		p = __sync_child(p)[__sync_idx(p, id)];
> +		if (!p)
> +			return false;
> +
> +		if ((id >> p->height >> SHIFT) != p->prefix)
> +			return false;
> +	} while (1);
> +
> +	tl->sync = p;
> +found:
> +	idx = id & MASK;
> +	if (!(p->bitmap & BIT(idx)))
> +		return false;
> +
> +	return i915_seqno_passed(__sync_seqno(p)[idx], seqno);
> +}
> +
> +static noinline int
> +__intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> +{
> +	struct intel_timeline_sync *p = tl->sync;
> +	unsigned int idx;
> +
> +	if (!p) {
> +		p = kzalloc(sizeof(*p) + NSYNC * sizeof(seqno), GFP_KERNEL);
> +		if (unlikely(!p))
> +			return -ENOMEM;
> +
> +		p->prefix = id >> SHIFT;
> +		goto found;
> +	}
> +
> +	/* Climb back up the tree until we find a common prefix */
> +	do {
> +		if (!p->parent)
> +			break;
> +
> +		p = p->parent;
> +
> +		if ((id >> p->height >> SHIFT) == p->prefix)
> +			break;
> +	} while (1);
> +
> +	/* No shortcut, we have to descend the tree to find the right layer
> +	 * containing this fence.
> +	 *
> +	 * Each layer in the tree holds 16 (NSYNC) pointers, either fences
> +	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
> +	 * other nodes (height > 0) are internal layers that point to a lower
> +	 * node. Each internal layer has at least 2 descendents.
> +	 *
> +	 * Starting at the top, we check whether the current prefix matches. If
> +	 * it doesn't, we have gone passed our layer and need to insert a join
> +	 * into the tree, and a new leaf node as a descendent as well as the
> +	 * original layer.
> +	 *
> +	 * The matching prefix means we are still following the right branch
> +	 * of the tree. If it has height 0, we have found our leaf and just
> +	 * need to replace the fence slot with ourselves. If the height is
> +	 * not zero, our slot contains the next layer in the tree (unless
> +	 * it is empty, in which case we can add ourselves as a new leaf).
> +	 * As descend the tree the prefix grows (and height decreases).
> +	 */
> +	do {
> +		struct intel_timeline_sync *next;
> +
> +		if ((id >> p->height >> SHIFT) != p->prefix) {
> +			/* insert a join above the current layer */
> +			next = kzalloc(sizeof(*next) + NSYNC * sizeof(next),
> +				       GFP_KERNEL);
> +			if (unlikely(!next))
> +				return -ENOMEM;
> +
> +			next->height = ALIGN(fls64((id >> p->height >> SHIFT) ^ p->prefix),
> +					    SHIFT) + p->height;
> +			next->prefix = id >> next->height >> SHIFT;
> +
> +			if (p->parent)
> +				__sync_child(p->parent)[__sync_idx(p->parent, id)] = next;
> +			next->parent = p->parent;
> +
> +			idx = p->prefix >> (next->height - p->height - SHIFT) & MASK;
> +			__sync_child(next)[idx] = p;
> +			next->bitmap |= BIT(idx);
> +			p->parent = next;
> +
> +			/* ascend to the join */
> +			p = next;
> +		} else {
> +			if (!p->height)
> +				break;
> +		}
> +
> +		/* descend into the next layer */
> +		GEM_BUG_ON(!p->height);
> +		idx = __sync_idx(p, id);
> +		next = __sync_child(p)[idx];
> +		if (unlikely(!next)) {
> +			next = kzalloc(sizeof(*next) + NSYNC * sizeof(seqno),
> +				       GFP_KERNEL);
> +			if (unlikely(!next))
> +				return -ENOMEM;
> +
> +			__sync_child(p)[idx] = next;
> +			p->bitmap |= BIT(idx);
> +			next->parent = p;
> +			next->prefix = id >> SHIFT;
> +
> +			p = next;
> +			break;
> +		}
> +
> +		p = next;
> +	} while (1);
> +
> +found:
> +	GEM_BUG_ON(p->height);
> +	GEM_BUG_ON(p->prefix != id >> SHIFT);
> +	tl->sync = p;
> +	idx = id & MASK;
> +	__sync_seqno(p)[idx] = seqno;
> +	p->bitmap |= BIT(idx);
> +	return 0;
> +}
> +
> +int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> +{
> +	struct intel_timeline_sync *p = tl->sync;
> +
> +	/* We expect to be called in sequence following a  _get(id), which
> +	 * should have preloaded the tl->sync hint for us.
> +	 */
> +	if (likely(p && (id >> SHIFT) == p->prefix)) {
> +		unsigned int idx = id & MASK;
> +
> +		__sync_seqno(p)[idx] = seqno;
> +		p->bitmap |= BIT(idx);
> +		return 0;
> +	}
> +
> +	return __intel_timeline_sync_set(tl, id, seqno);
> +}
> +
>  static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>  				    struct i915_gem_timeline *timeline,
>  				    const char *name,
> @@ -35,6 +283,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>
>  	lockdep_assert_held(&i915->drm.struct_mutex);
>
> +	/* Ideally we want a set of engines on a single leaf as we expect
> +	 * to mostly be tracking synchronisation between engines.
> +	 */
> +	BUILD_BUG_ON(NSYNC < I915_NUM_ENGINES);
> +	BUILD_BUG_ON(NSYNC > BITS_PER_BYTE * sizeof(timeline->engine[0].sync->bitmap));
> +
>  	timeline->i915 = i915;
>  	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
>  	if (!timeline->name)
> @@ -91,8 +345,14 @@ void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
>  		struct intel_timeline *tl = &timeline->engine[i];
>
>  		GEM_BUG_ON(!list_empty(&tl->requests));
> +
> +		sync_free(tl->sync);
>  	}
>
>  	list_del(&timeline->link);
>  	kfree(timeline->name);
>  }
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> +#include "selftests/i915_gem_timeline.c"
> +#endif
> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
> index 6c53e14cab2a..c33dee0025ee 100644
> --- a/drivers/gpu/drm/i915/i915_gem_timeline.h
> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
> @@ -26,10 +26,13 @@
>  #define I915_GEM_TIMELINE_H
>
>  #include <linux/list.h>
> +#include <linux/radix-tree.h>
>
> +#include "i915_utils.h"
>  #include "i915_gem_request.h"
>
>  struct i915_gem_timeline;
> +struct intel_timeline_sync;
>
>  struct intel_timeline {
>  	u64 fence_context;
> @@ -55,6 +58,14 @@ struct intel_timeline {
>  	 * struct_mutex.
>  	 */
>  	struct i915_gem_active last_request;
> +
> +	/* We track the most recent seqno that we wait on in every context so
> +	 * that we only have to emit a new await and dependency on a more
> +	 * recent sync point. As the contexts may executed out-of-order, we
> +	 * have to track each individually and cannot not rely on an absolute
> +	 * global_seqno.
> +	 */
> +	struct intel_timeline_sync *sync;
>  	u32 sync_seqno[I915_NUM_ENGINES];
>
>  	struct i915_gem_timeline *common;
> @@ -75,4 +86,7 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
>  int i915_gem_timeline_init__global(struct drm_i915_private *i915);
>  void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
>
> +bool intel_timeline_sync_get(struct intel_timeline *tl, u64 id, u32 seqno);
> +int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno);
> +
>  #endif
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> new file mode 100644
> index 000000000000..c0bb8ecac93b
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> @@ -0,0 +1,123 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include "../i915_selftest.h"
> +#include "mock_gem_device.h"
> +
> +static int igt_seqmap(void *arg)
> +{
> +	struct drm_i915_private *i915 = arg;
> +	const struct {
> +		const char *name;
> +		u32 seqno;
> +		bool expected;
> +		bool set;
> +	} pass[] = {
> +		{ "unset", 0, false, false },
> +		{ "new", 0, false, true },
> +		{ "0a", 0, true, true },
> +		{ "1a", 1, false, true },
> +		{ "1b", 1, true, true },
> +		{ "0b", 0, true, false },
> +		{ "2a", 2, false, true },
> +		{ "4", 4, false, true },
> +		{ "INT_MAX", INT_MAX, false, true },
> +		{ "INT_MAX-1", INT_MAX-1, true, false },
> +		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
> +		{ "INT_MAX", INT_MAX, true, false },
> +		{ "UINT_MAX", UINT_MAX, false, true },
> +		{ "wrap", 0, false, true },
> +		{ "unwrap", UINT_MAX, true, false },
> +		{},
> +	}, *p;
> +	struct intel_timeline *tl;
> +	int order, offset;
> +	int ret;
> +
> +	tl = &i915->gt.global_timeline.engine[RCS];

Unless I am missing something, it looks like you could get away with a 
lighter solution of implementing a mock_timeline instead of the whole 
mock_gem_device. I think it would be preferable.

> +	for (p = pass; p->name; p++) {
> +		for (order = 1; order < 64; order++) {
> +			for (offset = -1; offset <= (order > 1); offset++) {
> +				u64 ctx = BIT_ULL(order) + offset;
> +
> +				if (intel_timeline_sync_get(tl,
> +							    ctx,
> +							    p->seqno) != p->expected) {
> +					pr_err("1: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> +					       p->name, ctx, p->seqno, yesno(p->expected));
> +					return -EINVAL;
> +				}
> +
> +				if (p->set) {
> +					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
> +					if (ret)
> +						return ret;
> +				}
> +			}
> +		}
> +	}
> +
> +	tl = &i915->gt.global_timeline.engine[BCS];
> +	for (order = 1; order < 64; order++) {
> +		for (offset = -1; offset <= (order > 1); offset++) {
> +			u64 ctx = BIT_ULL(order) + offset;
> +
> +			for (p = pass; p->name; p++) {
> +				if (intel_timeline_sync_get(tl,
> +							    ctx,
> +							    p->seqno) != p->expected) {
> +					pr_err("2: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> +					       p->name, ctx, p->seqno, yesno(p->expected));
> +					return -EINVAL;
> +				}
> +
> +				if (p->set) {
> +					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
> +					if (ret)
> +						return ret;
> +				}
> +			}
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +int i915_gem_timeline_mock_selftests(void)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(igt_seqmap),
> +	};
> +	struct drm_i915_private *i915;
> +	int err;
> +
> +	i915 = mock_gem_device();
> +	if (!i915)
> +		return -ENOMEM;
> +
> +	err = i915_subtests(tests, i915);
> +	drm_dev_unref(&i915->drm);
> +
> +	return err;
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> index be9a9ebf5692..8d0f50c25df8 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> @@ -12,6 +12,7 @@ selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
>  selftest(scatterlist, scatterlist_mock_selftests)
>  selftest(uncore, intel_uncore_mock_selftests)
>  selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
> +selftest(timelines, i915_gem_timeline_mock_selftests)
>  selftest(requests, i915_gem_request_mock_selftests)
>  selftest(objects, i915_gem_object_mock_selftests)
>  selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 10:20   ` Tvrtko Ursulin
@ 2017-04-26 10:38     ` Chris Wilson
  2017-04-26 10:54       ` Tvrtko Ursulin
  0 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-26 10:38 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Wed, Apr 26, 2017 at 11:20:16AM +0100, Tvrtko Ursulin wrote:
> 
> On 19/04/2017 10:41, Chris Wilson wrote:
> >Track the latest fence waited upon on each context, and only add a new
> >asynchronous wait if the new fence is more recent than the recorded
> >fence for that context. This requires us to filter out unordered
> >timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> >absence of a universal identifier, we have to use our own
> >i915->mm.unordered_timeline token.
> >
> >v2: Throw around the debug crutches
> >v3: Inline the likely case of the pre-allocation cache being full.
> >v4: Drop the pre-allocation support, we can lose the most recent fence
> >in case of allocation failure -- it just means we may emit more awaits
> >than strictly necessary but will not break.
> >v5: Trim allocation size for leaf nodes, they only need an array of u32
> >not pointers.
> >
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >---
> > drivers/gpu/drm/i915/i915_gem_request.c            |  67 +++---
> > drivers/gpu/drm/i915/i915_gem_timeline.c           | 260 +++++++++++++++++++++
> > drivers/gpu/drm/i915/i915_gem_timeline.h           |  14 ++
> > drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 123 ++++++++++
> > .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
> > 5 files changed, 438 insertions(+), 27 deletions(-)
> > create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> >
> >diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> >index 97c07986b7c1..fb6c31ba3ef9 100644
> >--- a/drivers/gpu/drm/i915/i915_gem_request.c
> >+++ b/drivers/gpu/drm/i915/i915_gem_request.c
> >@@ -730,9 +730,7 @@ int
> > i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
> > 				 struct dma_fence *fence)
> > {
> >-	struct dma_fence_array *array;
> > 	int ret;
> >-	int i;
> >
> > 	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> > 		return 0;
> >@@ -744,39 +742,54 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
> > 	if (fence->context == req->fence.context)
> > 		return 0;
> >
> >-	if (dma_fence_is_i915(fence))
> >-		return i915_gem_request_await_request(req, to_request(fence));
> >+	/* Squash repeated waits to the same timelines, picking the latest */
> >+	if (fence->context != req->i915->mm.unordered_timeline &&
> >+	    intel_timeline_sync_get(req->timeline,
> >+				    fence->context, fence->seqno))
> 
> Function name is non-intuitive to me. It doesn't seem to get
> anything, but is more like query? Since it ends up with
> i915_seqno_passed, maybe intel_timeline_sync_is_newer/older ? (give
> or take)

_get was choosen as the partner for _set, which seemed to make sense.
Keep intel_timeline_sync_set() and replace _get with
intel_timeline_sync_passed() ?
intel_timeline_sync_is_later() ?

> >diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
> >index b596ca7ee058..f2b734dda895 100644
> >--- a/drivers/gpu/drm/i915/i915_gem_timeline.c
> >+++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
> >@@ -24,6 +24,254 @@
> >
> > #include "i915_drv.h"
> >
> >+#define NSYNC 16
> >+#define SHIFT ilog2(NSYNC)
> >+#define MASK (NSYNC - 1)
> >+
> >+/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
> >+ * context id to the last u32 fence seqno waited upon from that context.
> >+ * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
> >+ * the root. This allows us to access the whole tree via a single pointer
> >+ * to the most recently used layer. We expect fence contexts to be dense
> >+ * and most reuse to be on the same i915_gem_context but on neighbouring
> >+ * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
> >+ * effective lookup cache. If the new lookup is not on the same leaf, we
> >+ * expect it to be on the neighbouring branch.
> >+ *
> >+ * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
> >+ * allows us to store whether a particular seqno is valid (i.e. allows us
> >+ * to distinguish unset from 0).
> >+ *
> >+ * A branch holds an array of layer pointers, and has height > 0, and always
> >+ * has at least 2 layers (either branches or leaves) below it.
> >+ *
> >+ */
> 
> @_@ :)
> 
> Ok, so a map of u64 to u32. We can't use IDR or radixtree directly
> because of u64 keys. :( How about a hash table? It would be much
> simpler to review. :) Seriously, if it would perform close enough it
> would be a much much simpler implementation.

You want a resizable hashtable. rht is appallingly slow, so you want a
custom resizeable ht. They are not as simple as this codewise ;)
(Plus a compressed radixtree is part of my plan for scalability
improvements for struct reservation_object.)

This is designed around the idea that most lookups are to neighbouring
contexts (i.e. same i915_gem_context, different engines) and so are on
the same leaf and so cached. (A goal here is to be cheaper than the cost
of repetitions along the fence signaling. They are indirect costs that
show up in a couple of places, but are reasonably cheap. Offsetting the
cost is the benefit of moving it off the signal->exec path.)

Plus radixtree scrapped the idr lookup cache, which is a negative for
most of our code :( Fortunately for execbuf, we do have a bypass planned.

> >+static int igt_seqmap(void *arg)
> >+{
> >+	struct drm_i915_private *i915 = arg;
> >+	const struct {
> >+		const char *name;
> >+		u32 seqno;
> >+		bool expected;
> >+		bool set;
> >+	} pass[] = {
> >+		{ "unset", 0, false, false },
> >+		{ "new", 0, false, true },
> >+		{ "0a", 0, true, true },
> >+		{ "1a", 1, false, true },
> >+		{ "1b", 1, true, true },
> >+		{ "0b", 0, true, false },
> >+		{ "2a", 2, false, true },
> >+		{ "4", 4, false, true },
> >+		{ "INT_MAX", INT_MAX, false, true },
> >+		{ "INT_MAX-1", INT_MAX-1, true, false },
> >+		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
> >+		{ "INT_MAX", INT_MAX, true, false },
> >+		{ "UINT_MAX", UINT_MAX, false, true },
> >+		{ "wrap", 0, false, true },
> >+		{ "unwrap", UINT_MAX, true, false },
> >+		{},
> >+	}, *p;
> >+	struct intel_timeline *tl;
> >+	int order, offset;
> >+	int ret;
> >+
> >+	tl = &i915->gt.global_timeline.engine[RCS];
> 
> Unless I am missing something, it looks like you could get away with
> a lighter solution of implementing a mock_timeline instead of the
> whole mock_gem_device. I think it would be preferable.

Fine, I was just using a familiar pattern.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 10:38     ` Chris Wilson
@ 2017-04-26 10:54       ` Tvrtko Ursulin
  2017-04-26 11:18         ` Chris Wilson
  0 siblings, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-26 10:54 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 26/04/2017 11:38, Chris Wilson wrote:
> On Wed, Apr 26, 2017 at 11:20:16AM +0100, Tvrtko Ursulin wrote:
>>
>> On 19/04/2017 10:41, Chris Wilson wrote:
>>> Track the latest fence waited upon on each context, and only add a new
>>> asynchronous wait if the new fence is more recent than the recorded
>>> fence for that context. This requires us to filter out unordered
>>> timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
>>> absence of a universal identifier, we have to use our own
>>> i915->mm.unordered_timeline token.
>>>
>>> v2: Throw around the debug crutches
>>> v3: Inline the likely case of the pre-allocation cache being full.
>>> v4: Drop the pre-allocation support, we can lose the most recent fence
>>> in case of allocation failure -- it just means we may emit more awaits
>>> than strictly necessary but will not break.
>>> v5: Trim allocation size for leaf nodes, they only need an array of u32
>>> not pointers.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_gem_request.c            |  67 +++---
>>> drivers/gpu/drm/i915/i915_gem_timeline.c           | 260 +++++++++++++++++++++
>>> drivers/gpu/drm/i915/i915_gem_timeline.h           |  14 ++
>>> drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 123 ++++++++++
>>> .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
>>> 5 files changed, 438 insertions(+), 27 deletions(-)
>>> create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
>>> index 97c07986b7c1..fb6c31ba3ef9 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
>>> @@ -730,9 +730,7 @@ int
>>> i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>>> 				 struct dma_fence *fence)
>>> {
>>> -	struct dma_fence_array *array;
>>> 	int ret;
>>> -	int i;
>>>
>>> 	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>> 		return 0;
>>> @@ -744,39 +742,54 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>>> 	if (fence->context == req->fence.context)
>>> 		return 0;
>>>
>>> -	if (dma_fence_is_i915(fence))
>>> -		return i915_gem_request_await_request(req, to_request(fence));
>>> +	/* Squash repeated waits to the same timelines, picking the latest */
>>> +	if (fence->context != req->i915->mm.unordered_timeline &&
>>> +	    intel_timeline_sync_get(req->timeline,
>>> +				    fence->context, fence->seqno))
>>
>> Function name is non-intuitive to me. It doesn't seem to get
>> anything, but is more like query? Since it ends up with
>> i915_seqno_passed, maybe intel_timeline_sync_is_newer/older ? (give
>> or take)
>
> _get was choosen as the partner for _set, which seemed to make sense.
> Keep intel_timeline_sync_set() and replace _get with
> intel_timeline_sync_passed() ?
> intel_timeline_sync_is_later() ?

Both are better in my opinion. _get just makes it sounds like it is 
returning something from the object, which it is not. So whichever you 
prefer.

>>> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
>>> index b596ca7ee058..f2b734dda895 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_timeline.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
>>> @@ -24,6 +24,254 @@
>>>
>>> #include "i915_drv.h"
>>>
>>> +#define NSYNC 16
>>> +#define SHIFT ilog2(NSYNC)
>>> +#define MASK (NSYNC - 1)
>>> +
>>> +/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
>>> + * context id to the last u32 fence seqno waited upon from that context.
>>> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
>>> + * the root. This allows us to access the whole tree via a single pointer
>>> + * to the most recently used layer. We expect fence contexts to be dense
>>> + * and most reuse to be on the same i915_gem_context but on neighbouring
>>> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
>>> + * effective lookup cache. If the new lookup is not on the same leaf, we
>>> + * expect it to be on the neighbouring branch.
>>> + *
>>> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
>>> + * allows us to store whether a particular seqno is valid (i.e. allows us
>>> + * to distinguish unset from 0).
>>> + *
>>> + * A branch holds an array of layer pointers, and has height > 0, and always
>>> + * has at least 2 layers (either branches or leaves) below it.
>>> + *
>>> + */
>>
>> @_@ :)
>>
>> Ok, so a map of u64 to u32. We can't use IDR or radixtree directly
>> because of u64 keys. :( How about a hash table? It would be much
>> simpler to review. :) Seriously, if it would perform close enough it
>> would be a much much simpler implementation.
>
> You want a resizable hashtable. rht is appallingly slow, so you want a
> custom resizeable ht. They are not as simple as this codewise ;)
> (Plus a compressed radixtree is part of my plan for scalability
> improvements for struct reservation_object.)

Why resizable? I was thinking a normal one. If at any given time we have 
an active set of contexts, or at least lookups are as you say below, to 
neighbouring contexts, that would mean we are talking about lookups to 
different hash buckets.  And for the typical working set we would expect 
many collisions so longer lists in each bucket? So maybe NUM_ENGINES * 
some typical load constant number buckets would not be that bad?

> This is designed around the idea that most lookups are to neighbouring
> contexts (i.e. same i915_gem_context, different engines) and so are on
> the same leaf and so cached. (A goal here is to be cheaper than the cost
> of repetitions along the fence signaling. They are indirect costs that
> show up in a couple of places, but are reasonably cheap. Offsetting the
> cost is the benefit of moving it off the signal->exec path.)
>
> Plus radixtree scrapped the idr lookup cache, which is a negative for
> most of our code :( Fortunately for execbuf, we do have a bypass planned.

I trust the data structure is a great, but would like to understand if 
something simpler could perhaps get us 99% of the performance (or some 
number).

>>> +static int igt_seqmap(void *arg)
>>> +{
>>> +	struct drm_i915_private *i915 = arg;
>>> +	const struct {
>>> +		const char *name;
>>> +		u32 seqno;
>>> +		bool expected;
>>> +		bool set;
>>> +	} pass[] = {
>>> +		{ "unset", 0, false, false },
>>> +		{ "new", 0, false, true },
>>> +		{ "0a", 0, true, true },
>>> +		{ "1a", 1, false, true },
>>> +		{ "1b", 1, true, true },
>>> +		{ "0b", 0, true, false },
>>> +		{ "2a", 2, false, true },
>>> +		{ "4", 4, false, true },
>>> +		{ "INT_MAX", INT_MAX, false, true },
>>> +		{ "INT_MAX-1", INT_MAX-1, true, false },
>>> +		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
>>> +		{ "INT_MAX", INT_MAX, true, false },
>>> +		{ "UINT_MAX", UINT_MAX, false, true },
>>> +		{ "wrap", 0, false, true },
>>> +		{ "unwrap", UINT_MAX, true, false },
>>> +		{},
>>> +	}, *p;
>>> +	struct intel_timeline *tl;
>>> +	int order, offset;
>>> +	int ret;
>>> +
>>> +	tl = &i915->gt.global_timeline.engine[RCS];
>>
>> Unless I am missing something, it looks like you could get away with
>> a lighter solution of implementing a mock_timeline instead of the
>> whole mock_gem_device. I think it would be preferable.
>
> Fine, I was just using a familiar pattern.

If it is more than a few lines (I thought it wouldn't be) to add a mock 
timeline then don't bother. I just thought unit tests should preferable 
stay as lean as possible.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 10:54       ` Tvrtko Ursulin
@ 2017-04-26 11:18         ` Chris Wilson
  2017-04-26 12:13           ` Tvrtko Ursulin
  0 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-26 11:18 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Wed, Apr 26, 2017 at 11:54:08AM +0100, Tvrtko Ursulin wrote:
> 
> On 26/04/2017 11:38, Chris Wilson wrote:
> >On Wed, Apr 26, 2017 at 11:20:16AM +0100, Tvrtko Ursulin wrote:
> >>
> >>On 19/04/2017 10:41, Chris Wilson wrote:
> >>>Track the latest fence waited upon on each context, and only add a new
> >>>asynchronous wait if the new fence is more recent than the recorded
> >>>fence for that context. This requires us to filter out unordered
> >>>timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> >>>absence of a universal identifier, we have to use our own
> >>>i915->mm.unordered_timeline token.
> >>>
> >>>v2: Throw around the debug crutches
> >>>v3: Inline the likely case of the pre-allocation cache being full.
> >>>v4: Drop the pre-allocation support, we can lose the most recent fence
> >>>in case of allocation failure -- it just means we may emit more awaits
> >>>than strictly necessary but will not break.
> >>>v5: Trim allocation size for leaf nodes, they only need an array of u32
> >>>not pointers.
> >>>
> >>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>>---
> >>>drivers/gpu/drm/i915/i915_gem_request.c            |  67 +++---
> >>>drivers/gpu/drm/i915/i915_gem_timeline.c           | 260 +++++++++++++++++++++
> >>>drivers/gpu/drm/i915/i915_gem_timeline.h           |  14 ++
> >>>drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 123 ++++++++++
> >>>.../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
> >>>5 files changed, 438 insertions(+), 27 deletions(-)
> >>>create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> >>>
> >>>diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> >>>index 97c07986b7c1..fb6c31ba3ef9 100644
> >>>--- a/drivers/gpu/drm/i915/i915_gem_request.c
> >>>+++ b/drivers/gpu/drm/i915/i915_gem_request.c
> >>>@@ -730,9 +730,7 @@ int
> >>>i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
> >>>				 struct dma_fence *fence)
> >>>{
> >>>-	struct dma_fence_array *array;
> >>>	int ret;
> >>>-	int i;
> >>>
> >>>	if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> >>>		return 0;
> >>>@@ -744,39 +742,54 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
> >>>	if (fence->context == req->fence.context)
> >>>		return 0;
> >>>
> >>>-	if (dma_fence_is_i915(fence))
> >>>-		return i915_gem_request_await_request(req, to_request(fence));
> >>>+	/* Squash repeated waits to the same timelines, picking the latest */
> >>>+	if (fence->context != req->i915->mm.unordered_timeline &&
> >>>+	    intel_timeline_sync_get(req->timeline,
> >>>+				    fence->context, fence->seqno))
> >>
> >>Function name is non-intuitive to me. It doesn't seem to get
> >>anything, but is more like query? Since it ends up with
> >>i915_seqno_passed, maybe intel_timeline_sync_is_newer/older ? (give
> >>or take)
> >
> >_get was choosen as the partner for _set, which seemed to make sense.
> >Keep intel_timeline_sync_set() and replace _get with
> >intel_timeline_sync_passed() ?
> >intel_timeline_sync_is_later() ?
> 
> Both are better in my opinion. _get just makes it sounds like it is
> returning something from the object, which it is not. So whichever
> you prefer.
> 
> >>>diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
> >>>index b596ca7ee058..f2b734dda895 100644
> >>>--- a/drivers/gpu/drm/i915/i915_gem_timeline.c
> >>>+++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
> >>>@@ -24,6 +24,254 @@
> >>>
> >>>#include "i915_drv.h"
> >>>
> >>>+#define NSYNC 16
> >>>+#define SHIFT ilog2(NSYNC)
> >>>+#define MASK (NSYNC - 1)
> >>>+
> >>>+/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
> >>>+ * context id to the last u32 fence seqno waited upon from that context.
> >>>+ * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
> >>>+ * the root. This allows us to access the whole tree via a single pointer
> >>>+ * to the most recently used layer. We expect fence contexts to be dense
> >>>+ * and most reuse to be on the same i915_gem_context but on neighbouring
> >>>+ * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
> >>>+ * effective lookup cache. If the new lookup is not on the same leaf, we
> >>>+ * expect it to be on the neighbouring branch.
> >>>+ *
> >>>+ * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
> >>>+ * allows us to store whether a particular seqno is valid (i.e. allows us
> >>>+ * to distinguish unset from 0).
> >>>+ *
> >>>+ * A branch holds an array of layer pointers, and has height > 0, and always
> >>>+ * has at least 2 layers (either branches or leaves) below it.
> >>>+ *
> >>>+ */
> >>
> >>@_@ :)
> >>
> >>Ok, so a map of u64 to u32. We can't use IDR or radixtree directly
> >>because of u64 keys. :( How about a hash table? It would be much
> >>simpler to review. :) Seriously, if it would perform close enough it
> >>would be a much much simpler implementation.
> >
> >You want a resizable hashtable. rht is appallingly slow, so you want a
> >custom resizeable ht. They are not as simple as this codewise ;)
> >(Plus a compressed radixtree is part of my plan for scalability
> >improvements for struct reservation_object.)
> 
> Why resizable? I was thinking a normal one. If at any given time we
> have an active set of contexts, or at least lookups are as you say
> below, to neighbouring contexts, that would mean we are talking
> about lookups to different hash buckets.  And for the typical
> working set we would expect many collisions so longer lists in each
> bucket? So maybe NUM_ENGINES * some typical load constant number
> buckets would not be that bad?

Consider a long running display server that will accumulate 10,000s of
thousands of clients in its lifetime, each with their own contents that
get shared by passing around fences/framebuffers. (Or on a shorter scale
any of the context stress tests in igt.) Due to the non-recycling of the
context ids, we can grow to very large tables - but we have no knowledge
of what contexts are no longer required.

To compensate we need to occasionally prune the sync points. For a ht we
could just scrap it, For an idr, we could store last use and delete
stale leaves.

But first we have a question of how many buckets do we give the static
ht? Most processes will be sharing between 2 contexts (render context,
presentation context) except for a display server who may have 10-100 of
clients - and possibly where eliminating repeated syncs is going to be
most valuable. That suggests 256 buckets for every timeline (assuming
just a pair of engines across shared contexts). Overkill for the
majority, and going to be miserable in stress tests.

Finally what do you store in the ht? Fences are the obvious candidate,
but need reaping. Or you just create a ht for the request and only
squash repeated fences inside a single request - that doesn't benefit
from timeline tracking and squashing between requests (but does avoid
keeping fences around forever). Hence why I went with tracking seqno. To
avoid allocations for the ht nodes, we could create an open-addressed ht
with the {context, seqno} embedded in it. It would be efficient, but
needs online resizing and is a fair chunk of new code.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 11:18         ` Chris Wilson
@ 2017-04-26 12:13           ` Tvrtko Ursulin
  2017-04-26 12:23             ` Chris Wilson
  2017-04-26 18:56             ` Chris Wilson
  0 siblings, 2 replies; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-26 12:13 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 26/04/2017 12:18, Chris Wilson wrote:
> On Wed, Apr 26, 2017 at 11:54:08AM +0100, Tvrtko Ursulin wrote:
>>>>> +/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
>>>>> + * context id to the last u32 fence seqno waited upon from that context.
>>>>> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
>>>>> + * the root. This allows us to access the whole tree via a single pointer
>>>>> + * to the most recently used layer. We expect fence contexts to be dense
>>>>> + * and most reuse to be on the same i915_gem_context but on neighbouring
>>>>> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
>>>>> + * effective lookup cache. If the new lookup is not on the same leaf, we
>>>>> + * expect it to be on the neighbouring branch.
>>>>> + *
>>>>> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
>>>>> + * allows us to store whether a particular seqno is valid (i.e. allows us
>>>>> + * to distinguish unset from 0).
>>>>> + *
>>>>> + * A branch holds an array of layer pointers, and has height > 0, and always
>>>>> + * has at least 2 layers (either branches or leaves) below it.
>>>>> + *
>>>>> + */
>>>>
>>>> @_@ :)
>>>>
>>>> Ok, so a map of u64 to u32. We can't use IDR or radixtree directly
>>>> because of u64 keys. :( How about a hash table? It would be much
>>>> simpler to review. :) Seriously, if it would perform close enough it
>>>> would be a much much simpler implementation.
>>>
>>> You want a resizable hashtable. rht is appallingly slow, so you want a
>>> custom resizeable ht. They are not as simple as this codewise ;)
>>> (Plus a compressed radixtree is part of my plan for scalability
>>> improvements for struct reservation_object.)
>>
>> Why resizable? I was thinking a normal one. If at any given time we
>> have an active set of contexts, or at least lookups are as you say
>> below, to neighbouring contexts, that would mean we are talking
>> about lookups to different hash buckets.  And for the typical
>> working set we would expect many collisions so longer lists in each
>> bucket? So maybe NUM_ENGINES * some typical load constant number
>> buckets would not be that bad?
>
> Consider a long running display server that will accumulate 10,000s of
> thousands of clients in its lifetime, each with their own contents that
> get shared by passing around fences/framebuffers. (Or on a shorter scale
> any of the context stress tests in igt.) Due to the non-recycling of the
> context ids, we can grow to very large tables - but we have no knowledge
> of what contexts are no longer required.
>
> To compensate we need to occasionally prune the sync points. For a ht we
> could just scrap it, For an idr, we could store last use and delete
> stale leaves.

Hm, pruning yes.. but you don't have pruning in this patch either. So 
that's something which needs to be addressed either way.

> But first we have a question of how many buckets do we give the static
> ht? Most processes will be sharing between 2 contexts (render context,
> presentation context) except for a display server who may have 10-100 of
> clients - and possibly where eliminating repeated syncs is going to be
> most valuable. That suggests 256 buckets for every timeline (assuming
> just a pair of engines across shared contexts). Overkill for the
> majority, and going to be miserable in stress tests.
>
> Finally what do you store in the ht? Fences are the obvious candidate,
> but need reaping. Or you just create a ht for the request and only
> squash repeated fences inside a single request - that doesn't benefit
> from timeline tracking and squashing between requests (but does avoid
> keeping fences around forever). Hence why I went with tracking seqno. To
> avoid allocations for the ht nodes, we could create an open-addressed ht
> with the {context, seqno} embedded in it. It would be efficient, but
> needs online resizing and is a fair chunk of new code.

I was thinking of exactly the same thing as this patch does, u64 context 
id as key, u32 seqnos (wrapped in a container with hlist_node).

Ok, so the key is pruning to keep the display server scenario in check.

Free the key from i915_fence_release?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 12:13           ` Tvrtko Ursulin
@ 2017-04-26 12:23             ` Chris Wilson
  2017-04-26 14:36               ` Tvrtko Ursulin
  2017-04-26 18:56             ` Chris Wilson
  1 sibling, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-26 12:23 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Wed, Apr 26, 2017 at 01:13:41PM +0100, Tvrtko Ursulin wrote:
> 
> On 26/04/2017 12:18, Chris Wilson wrote:
> >On Wed, Apr 26, 2017 at 11:54:08AM +0100, Tvrtko Ursulin wrote:
> >>>>>+/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
> >>>>>+ * context id to the last u32 fence seqno waited upon from that context.
> >>>>>+ * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
> >>>>>+ * the root. This allows us to access the whole tree via a single pointer
> >>>>>+ * to the most recently used layer. We expect fence contexts to be dense
> >>>>>+ * and most reuse to be on the same i915_gem_context but on neighbouring
> >>>>>+ * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
> >>>>>+ * effective lookup cache. If the new lookup is not on the same leaf, we
> >>>>>+ * expect it to be on the neighbouring branch.
> >>>>>+ *
> >>>>>+ * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
> >>>>>+ * allows us to store whether a particular seqno is valid (i.e. allows us
> >>>>>+ * to distinguish unset from 0).
> >>>>>+ *
> >>>>>+ * A branch holds an array of layer pointers, and has height > 0, and always
> >>>>>+ * has at least 2 layers (either branches or leaves) below it.
> >>>>>+ *
> >>>>>+ */
> >>>>
> >>>>@_@ :)
> >>>>
> >>>>Ok, so a map of u64 to u32. We can't use IDR or radixtree directly
> >>>>because of u64 keys. :( How about a hash table? It would be much
> >>>>simpler to review. :) Seriously, if it would perform close enough it
> >>>>would be a much much simpler implementation.
> >>>
> >>>You want a resizable hashtable. rht is appallingly slow, so you want a
> >>>custom resizeable ht. They are not as simple as this codewise ;)
> >>>(Plus a compressed radixtree is part of my plan for scalability
> >>>improvements for struct reservation_object.)
> >>
> >>Why resizable? I was thinking a normal one. If at any given time we
> >>have an active set of contexts, or at least lookups are as you say
> >>below, to neighbouring contexts, that would mean we are talking
> >>about lookups to different hash buckets.  And for the typical
> >>working set we would expect many collisions so longer lists in each
> >>bucket? So maybe NUM_ENGINES * some typical load constant number
> >>buckets would not be that bad?
> >
> >Consider a long running display server that will accumulate 10,000s of
> >thousands of clients in its lifetime, each with their own contents that
> >get shared by passing around fences/framebuffers. (Or on a shorter scale
> >any of the context stress tests in igt.) Due to the non-recycling of the
> >context ids, we can grow to very large tables - but we have no knowledge
> >of what contexts are no longer required.
> >
> >To compensate we need to occasionally prune the sync points. For a ht we
> >could just scrap it, For an idr, we could store last use and delete
> >stale leaves.
> 
> Hm, pruning yes.. but you don't have pruning in this patch either.
> So that's something which needs to be addressed either way.

I know, review is great for hindsight. The iterator/remove for the
compressed idr is going to be on the uglier side of the insertion. Ugly
enough to be a seperate patch.
 
> >But first we have a question of how many buckets do we give the static
> >ht? Most processes will be sharing between 2 contexts (render context,
> >presentation context) except for a display server who may have 10-100 of
> >clients - and possibly where eliminating repeated syncs is going to be
> >most valuable. That suggests 256 buckets for every timeline (assuming
> >just a pair of engines across shared contexts). Overkill for the
> >majority, and going to be miserable in stress tests.
> >
> >Finally what do you store in the ht? Fences are the obvious candidate,
> >but need reaping. Or you just create a ht for the request and only
> >squash repeated fences inside a single request - that doesn't benefit
> >from timeline tracking and squashing between requests (but does avoid
> >keeping fences around forever). Hence why I went with tracking seqno. To
> >avoid allocations for the ht nodes, we could create an open-addressed ht
> >with the {context, seqno} embedded in it. It would be efficient, but
> >needs online resizing and is a fair chunk of new code.
> 
> I was thinking of exactly the same thing as this patch does, u64
> context id as key, u32 seqnos (wrapped in a container with
> hlist_node).

Hmm, a hashed radixtree. I did read something about a hybrid approach.

> Ok, so the key is pruning to keep the display server scenario in check.
> 
> Free the key from i915_fence_release?

Too early, it's the timeline (and syncs along it) that's interesting.
For our contexts, we can hook into context-close, but we still have some
foreign dma-fence-contexts to worry about. I was thinking of walking all
timelines from the idle_worker. And possibly forcibly prune across
suspend.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 12:23             ` Chris Wilson
@ 2017-04-26 14:36               ` Tvrtko Ursulin
  2017-04-26 14:55                 ` Chris Wilson
  2017-04-26 15:04                 ` Chris Wilson
  0 siblings, 2 replies; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-26 14:36 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 26/04/2017 13:23, Chris Wilson wrote:
> On Wed, Apr 26, 2017 at 01:13:41PM +0100, Tvrtko Ursulin wrote:
>>
>> On 26/04/2017 12:18, Chris Wilson wrote:
>>> On Wed, Apr 26, 2017 at 11:54:08AM +0100, Tvrtko Ursulin wrote:
>>>>>>> +/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
>>>>>>> + * context id to the last u32 fence seqno waited upon from that context.
>>>>>>> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
>>>>>>> + * the root. This allows us to access the whole tree via a single pointer
>>>>>>> + * to the most recently used layer. We expect fence contexts to be dense
>>>>>>> + * and most reuse to be on the same i915_gem_context but on neighbouring
>>>>>>> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
>>>>>>> + * effective lookup cache. If the new lookup is not on the same leaf, we
>>>>>>> + * expect it to be on the neighbouring branch.
>>>>>>> + *
>>>>>>> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
>>>>>>> + * allows us to store whether a particular seqno is valid (i.e. allows us
>>>>>>> + * to distinguish unset from 0).
>>>>>>> + *
>>>>>>> + * A branch holds an array of layer pointers, and has height > 0, and always
>>>>>>> + * has at least 2 layers (either branches or leaves) below it.
>>>>>>> + *
>>>>>>> + */
>>>>>>
>>>>>> @_@ :)
>>>>>>
>>>>>> Ok, so a map of u64 to u32. We can't use IDR or radixtree directly
>>>>>> because of u64 keys. :( How about a hash table? It would be much
>>>>>> simpler to review. :) Seriously, if it would perform close enough it
>>>>>> would be a much much simpler implementation.
>>>>>
>>>>> You want a resizable hashtable. rht is appallingly slow, so you want a
>>>>> custom resizeable ht. They are not as simple as this codewise ;)
>>>>> (Plus a compressed radixtree is part of my plan for scalability
>>>>> improvements for struct reservation_object.)
>>>>
>>>> Why resizable? I was thinking a normal one. If at any given time we
>>>> have an active set of contexts, or at least lookups are as you say
>>>> below, to neighbouring contexts, that would mean we are talking
>>>> about lookups to different hash buckets.  And for the typical
>>>> working set we would expect many collisions so longer lists in each
>>>> bucket? So maybe NUM_ENGINES * some typical load constant number
>>>> buckets would not be that bad?
>>>
>>> Consider a long running display server that will accumulate 10,000s of
>>> thousands of clients in its lifetime, each with their own contents that
>>> get shared by passing around fences/framebuffers. (Or on a shorter scale
>>> any of the context stress tests in igt.) Due to the non-recycling of the
>>> context ids, we can grow to very large tables - but we have no knowledge
>>> of what contexts are no longer required.
>>>
>>> To compensate we need to occasionally prune the sync points. For a ht we
>>> could just scrap it, For an idr, we could store last use and delete
>>> stale leaves.
>>
>> Hm, pruning yes.. but you don't have pruning in this patch either.
>> So that's something which needs to be addressed either way.
>
> I know, review is great for hindsight. The iterator/remove for the
> compressed idr is going to be on the uglier side of the insertion. Ugly
> enough to be a seperate patch.
>
>>> But first we have a question of how many buckets do we give the static
>>> ht? Most processes will be sharing between 2 contexts (render context,
>>> presentation context) except for a display server who may have 10-100 of
>>> clients - and possibly where eliminating repeated syncs is going to be
>>> most valuable. That suggests 256 buckets for every timeline (assuming
>>> just a pair of engines across shared contexts). Overkill for the
>>> majority, and going to be miserable in stress tests.
>>>
>>> Finally what do you store in the ht? Fences are the obvious candidate,
>>> but need reaping. Or you just create a ht for the request and only
>>> squash repeated fences inside a single request - that doesn't benefit
>> >from timeline tracking and squashing between requests (but does avoid
>>> keeping fences around forever). Hence why I went with tracking seqno. To
>>> avoid allocations for the ht nodes, we could create an open-addressed ht
>>> with the {context, seqno} embedded in it. It would be efficient, but
>>> needs online resizing and is a fair chunk of new code.
>>
>> I was thinking of exactly the same thing as this patch does, u64
>> context id as key, u32 seqnos (wrapped in a container with
>> hlist_node).
>
> Hmm, a hashed radixtree. I did read something about a hybrid approach.
>
>> Ok, so the key is pruning to keep the display server scenario in check.
>>
>> Free the key from i915_fence_release?
>
> Too early, it's the timeline (and syncs along it) that's interesting.
> For our contexts, we can hook into context-close, but we still have some
> foreign dma-fence-contexts to worry about. I was thinking of walking all
> timelines from the idle_worker. And possibly forcibly prune across
> suspend.

Hm I don't see why it is too early. If request is getting freed, there 
is no one waiting on it any longer, so how can it be OK to keep that 
seqno in the map?

But yes, sounds easy to do it from the idle worker. Just walk everything 
and prune when engine seqno has advanced past it?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 14:36               ` Tvrtko Ursulin
@ 2017-04-26 14:55                 ` Chris Wilson
  2017-04-26 15:04                 ` Chris Wilson
  1 sibling, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-26 14:55 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Wed, Apr 26, 2017 at 03:36:19PM +0100, Tvrtko Ursulin wrote:
> 
> On 26/04/2017 13:23, Chris Wilson wrote:
> >Too early, it's the timeline (and syncs along it) that's interesting.
> >For our contexts, we can hook into context-close, but we still have some
> >foreign dma-fence-contexts to worry about. I was thinking of walking all
> >timelines from the idle_worker. And possibly forcibly prune across
> >suspend.
> 
> Hm I don't see why it is too early. If request is getting freed,
> there is no one waiting on it any longer, so how can it be OK to
> keep that seqno in the map?

The fence->seqno represents a known synchronisation point from our timeline to the
fence->timeline. In the future, we know that we can skip all
synchronisations onto that timeline that are older than the fence->seqno
because we already have synchronised. That coupling persists after the
fence itself is destroyed. It is why tracking it between timelines is
more effective than just squashing repeats within a request.

(This is perhaps more significant when we expend ring space to emit GPU
commands for each sync point, i.e. hw semaphores.)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 14:36               ` Tvrtko Ursulin
  2017-04-26 14:55                 ` Chris Wilson
@ 2017-04-26 15:04                 ` Chris Wilson
  1 sibling, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-26 15:04 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Wed, Apr 26, 2017 at 03:36:19PM +0100, Tvrtko Ursulin wrote:
> But yes, sounds easy to do it from the idle worker. Just walk
> everything and prune when engine seqno has advanced past it?

Hmm. I was thinking that sounded like a great idea and then realised we
are storing context.seqno not global_seqno. :|

But... Yes, from idle we know that the fences are completed and so if we
should have a request for an old sync point, it will be ignored anyway
because the fence is completed. So we can just throw away the ht
entirely at that point with no loss of optimisations.

Ok, I see what you were trying to tell me about i915_fence_release now.
The problem with handling at fence_release is doing the reverse map to
all contexts using it.

Thanks,
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 12:13           ` Tvrtko Ursulin
  2017-04-26 12:23             ` Chris Wilson
@ 2017-04-26 18:56             ` Chris Wilson
  2017-04-26 22:22               ` Chris Wilson
  1 sibling, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-26 18:56 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Wed, Apr 26, 2017 at 01:13:41PM +0100, Tvrtko Ursulin wrote:
> I was thinking of exactly the same thing as this patch does, u64
> context id as key, u32 seqnos (wrapped in a container with
> hlist_node).

#define NSYNC 32
struct intel_timeline_sync { /* kmalloc-256 slab */
	struct hlist_node node;
        u64 prefix;
	u32 bitmap;
	u32 seqno[NSYNC];
};
DECLARE_HASHTABLE(sync, 7);

If I squint, the numbers favour the idr. ;)

Tbh, the precence of the squash is noticeable and well above the noise,
the difference between a hashtable and the idr, far below the noise
floor (in a testcase intended to stress the efficacy of this patch). The
cost of reference counting in execbuffer and the reservation_object hide
all ills. :(

What I am not happy with is the 1<<7 buckets I'm currently using.
Thinking about the idle pruning, there shouldn't be any reason to go
above 1<<3, I hope?

Do we start on GEM_STATS?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 18:56             ` Chris Wilson
@ 2017-04-26 22:22               ` Chris Wilson
  2017-04-27  9:20                 ` Tvrtko Ursulin
  0 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-26 22:22 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

On Wed, Apr 26, 2017 at 07:56:14PM +0100, Chris Wilson wrote:
> On Wed, Apr 26, 2017 at 01:13:41PM +0100, Tvrtko Ursulin wrote:
> > I was thinking of exactly the same thing as this patch does, u64
> > context id as key, u32 seqnos (wrapped in a container with
> > hlist_node).
> 
> #define NSYNC 32
> struct intel_timeline_sync { /* kmalloc-256 slab */
> 	struct hlist_node node;
>         u64 prefix;
> 	u32 bitmap;
> 	u32 seqno[NSYNC];
> };
> DECLARE_HASHTABLE(sync, 7);
> 
> If I squint, the numbers favour the idr. ;)

Hmm, it didn't take much to start running into misery with a static ht.
I know my testing is completely artificial but I am not going to be
happy with a static size, it will always be too big or too small and
never just Goldilocks.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8] drm/i915: Squash repeated awaits on the same fence
  2017-04-19  9:41 ` [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence Chris Wilson
  2017-04-24 13:03   ` Tvrtko Ursulin
  2017-04-26 10:20   ` Tvrtko Ursulin
@ 2017-04-27  7:06   ` Chris Wilson
  2017-04-27  7:14     ` Chris Wilson
                       ` (2 more replies)
  2 siblings, 3 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-27  7:06 UTC (permalink / raw)
  To: intel-gfx

Track the latest fence waited upon on each context, and only add a new
asynchronous wait if the new fence is more recent than the recorded
fence for that context. This requires us to filter out unordered
timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
absence of a universal identifier, we have to use our own
i915->mm.unordered_timeline token.

v2: Throw around the debug crutches
v3: Inline the likely case of the pre-allocation cache being full.
v4: Drop the pre-allocation support, we can lose the most recent fence
in case of allocation failure -- it just means we may emit more awaits
than strictly necessary but will not break.
v5: Trim allocation size for leaf nodes, they only need an array of u32
not pointers.
v6: Create mock_timeline to tidy selftest writing
v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko)
v8: Prune the stale sync points when we idle.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c                    |   1 +
 drivers/gpu/drm/i915/i915_gem_request.c            |  11 +
 drivers/gpu/drm/i915/i915_gem_timeline.c           | 314 +++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_timeline.h           |  15 +
 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 125 ++++++++
 .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
 drivers/gpu/drm/i915/selftests/mock_timeline.c     |  52 ++++
 drivers/gpu/drm/i915/selftests/mock_timeline.h     |  33 +++
 8 files changed, 552 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
 create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.c
 create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.h

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c1fa3c103f38..f886ef492036 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3214,6 +3214,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
 		intel_engine_disarm_breadcrumbs(engine);
 		i915_gem_batch_pool_fini(&engine->batch_pool);
 	}
+	i915_gem_timelines_mark_idle(dev_priv);
 
 	GEM_BUG_ON(!dev_priv->gt.awake);
 	dev_priv->gt.awake = false;
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 5fa4e52ded06..d9f76665bc6b 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -772,6 +772,12 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 		if (fence->context == req->fence.context)
 			continue;
 
+		/* Squash repeated waits to the same timelines */
+		if (fence->context != req->i915->mm.unordered_timeline &&
+		    intel_timeline_sync_is_later(req->timeline,
+						 fence->context, fence->seqno))
+			continue;
+
 		if (dma_fence_is_i915(fence))
 			ret = i915_gem_request_await_request(req,
 							     to_request(fence));
@@ -781,6 +787,11 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 							    GFP_KERNEL);
 		if (ret < 0)
 			return ret;
+
+		/* Record the most latest fence on each timeline */
+		if (fence->context != req->i915->mm.unordered_timeline)
+			intel_timeline_sync_set(req->timeline,
+						fence->context, fence->seqno);
 	} while (--nchild);
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
index b596ca7ee058..967c53a53a92 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.c
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
@@ -24,6 +24,276 @@
 
 #include "i915_drv.h"
 
+#define NSYNC 16
+#define SHIFT ilog2(NSYNC)
+#define MASK (NSYNC - 1)
+
+/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
+ * context id to the last u32 fence seqno waited upon from that context.
+ * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
+ * the root. This allows us to access the whole tree via a single pointer
+ * to the most recently used layer. We expect fence contexts to be dense
+ * and most reuse to be on the same i915_gem_context but on neighbouring
+ * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
+ * effective lookup cache. If the new lookup is not on the same leaf, we
+ * expect it to be on the neighbouring branch.
+ *
+ * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
+ * allows us to store whether a particular seqno is valid (i.e. allows us
+ * to distinguish unset from 0).
+ *
+ * A branch holds an array of layer pointers, and has height > 0, and always
+ * has at least 2 layers (either branches or leaves) below it.
+ */
+struct intel_timeline_sync {
+	u64 prefix;
+	unsigned int height;
+	unsigned int bitmap;
+	struct intel_timeline_sync *parent;
+	/* union {
+	 *	u32 seqno;
+	 *	struct intel_timeline_sync *child;
+	 * } slot[NSYNC];
+	 */
+};
+
+static inline u32 *__sync_seqno(struct intel_timeline_sync *p)
+{
+	GEM_BUG_ON(p->height);
+	return (u32 *)(p + 1);
+}
+
+static inline struct intel_timeline_sync **
+__sync_child(struct intel_timeline_sync *p)
+{
+	GEM_BUG_ON(!p->height);
+	return (struct intel_timeline_sync **)(p + 1);
+}
+
+static inline unsigned int
+__sync_idx(const struct intel_timeline_sync *p, u64 id)
+{
+	return (id >> p->height) & MASK;
+}
+
+static void __sync_free(struct intel_timeline_sync *p)
+{
+	if (p->height) {
+		unsigned int i;
+
+		while ((i = ffs(p->bitmap))) {
+			p->bitmap &= ~0u << i;
+			__sync_free(__sync_child(p)[i - 1]);
+		}
+	}
+
+	kfree(p);
+}
+
+static void sync_free(struct intel_timeline_sync *sync)
+{
+	if (!sync)
+		return;
+
+	while (sync->parent)
+		sync = sync->parent;
+
+	__sync_free(sync);
+}
+
+/** intel_timeline_sync_is_later -- compare against the last know sync point
+ * @tl - the @intel_timeline
+ * @id - the context id (other timeline) we are synchronising to
+ * @seqno - the sequence number along the other timeline
+ *
+ * If we have already synchronised this @tl with another (@id) then we can
+ * omit any repeated or earlier synchronisation requests. If the two timelines
+ * are already coupled, we can also omit the dependency between the two as that
+ * is already known via the timeline.
+ *
+ * Returns true if the two timelines are already synchronised wrt to @seqno,
+ * false if not and the synchronisation must be emitted.
+ */
+bool intel_timeline_sync_is_later(struct intel_timeline *tl, u64 id, u32 seqno)
+{
+	struct intel_timeline_sync *p;
+	unsigned int idx;
+
+	p = tl->sync;
+	if (!p)
+		return false;
+
+	if (likely((id >> SHIFT) == p->prefix))
+		goto found;
+
+	/* First climb the tree back to a parent branch */
+	do {
+		p = p->parent;
+		if (!p)
+			return false;
+
+		if ((id >> p->height >> SHIFT) == p->prefix)
+			break;
+	} while (1);
+
+	/* And then descend again until we find our leaf */
+	do {
+		if (!p->height)
+			break;
+
+		p = __sync_child(p)[__sync_idx(p, id)];
+		if (!p)
+			return false;
+
+		if ((id >> p->height >> SHIFT) != p->prefix)
+			return false;
+	} while (1);
+
+	tl->sync = p;
+found:
+	idx = id & MASK;
+	if (!(p->bitmap & BIT(idx)))
+		return false;
+
+	return i915_seqno_passed(__sync_seqno(p)[idx], seqno);
+}
+
+static noinline int
+__intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
+{
+	struct intel_timeline_sync *p = tl->sync;
+	unsigned int idx;
+
+	if (!p) {
+		p = kzalloc(sizeof(*p) + NSYNC * sizeof(seqno), GFP_KERNEL);
+		if (unlikely(!p))
+			return -ENOMEM;
+
+		p->prefix = id >> SHIFT;
+		goto found;
+	}
+
+	/* Climb back up the tree until we find a common prefix */
+	do {
+		if (!p->parent)
+			break;
+
+		p = p->parent;
+
+		if ((id >> p->height >> SHIFT) == p->prefix)
+			break;
+	} while (1);
+
+	/* No shortcut, we have to descend the tree to find the right layer
+	 * containing this fence.
+	 *
+	 * Each layer in the tree holds 16 (NSYNC) pointers, either fences
+	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
+	 * other nodes (height > 0) are internal layers that point to a lower
+	 * node. Each internal layer has at least 2 descendents.
+	 *
+	 * Starting at the top, we check whether the current prefix matches. If
+	 * it doesn't, we have gone passed our layer and need to insert a join
+	 * into the tree, and a new leaf node as a descendent as well as the
+	 * original layer.
+	 *
+	 * The matching prefix means we are still following the right branch
+	 * of the tree. If it has height 0, we have found our leaf and just
+	 * need to replace the fence slot with ourselves. If the height is
+	 * not zero, our slot contains the next layer in the tree (unless
+	 * it is empty, in which case we can add ourselves as a new leaf).
+	 * As descend the tree the prefix grows (and height decreases).
+	 */
+	do {
+		struct intel_timeline_sync *next;
+
+		if ((id >> p->height >> SHIFT) != p->prefix) {
+			/* insert a join above the current layer */
+			next = kzalloc(sizeof(*next) + NSYNC * sizeof(next),
+				       GFP_KERNEL);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			next->height = ALIGN(fls64((id >> p->height >> SHIFT) ^ p->prefix),
+					    SHIFT) + p->height;
+			next->prefix = id >> next->height >> SHIFT;
+
+			if (p->parent)
+				__sync_child(p->parent)[__sync_idx(p->parent, id)] = next;
+			next->parent = p->parent;
+
+			idx = p->prefix >> (next->height - p->height - SHIFT) & MASK;
+			__sync_child(next)[idx] = p;
+			next->bitmap |= BIT(idx);
+			p->parent = next;
+
+			/* ascend to the join */
+			p = next;
+		} else {
+			if (!p->height)
+				break;
+		}
+
+		/* descend into the next layer */
+		GEM_BUG_ON(!p->height);
+		idx = __sync_idx(p, id);
+		next = __sync_child(p)[idx];
+		if (unlikely(!next)) {
+			next = kzalloc(sizeof(*next) + NSYNC * sizeof(seqno),
+				       GFP_KERNEL);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			__sync_child(p)[idx] = next;
+			p->bitmap |= BIT(idx);
+			next->parent = p;
+			next->prefix = id >> SHIFT;
+
+			p = next;
+			break;
+		}
+
+		p = next;
+	} while (1);
+
+found:
+	GEM_BUG_ON(p->height);
+	GEM_BUG_ON(p->prefix != id >> SHIFT);
+	tl->sync = p;
+	idx = id & MASK;
+	__sync_seqno(p)[idx] = seqno;
+	p->bitmap |= BIT(idx);
+	return 0;
+}
+
+/** intel_timeline_sync_set -- mark the most recent syncpoint between contexts
+ * @tl - the @intel_timeline
+ * @id - the context id (other timeline) we have synchronised to
+ * @seqno - the sequence number along the other timeline
+ *
+ * When we synchronise this @tl with another (@id), we also know that we have
+ * synchronized with all previous seqno along that timeline. If we then have
+ * a request to synchronise with the same seqno or older, we can omit it,
+ * see intel_timeline_sync_is_later()
+ */
+int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
+{
+	struct intel_timeline_sync *p = tl->sync;
+
+	/* We expect to be called in sequence following a  _get(id), which
+	 * should have preloaded the tl->sync hint for us.
+	 */
+	if (likely(p && (id >> SHIFT) == p->prefix)) {
+		unsigned int idx = id & MASK;
+
+		__sync_seqno(p)[idx] = seqno;
+		p->bitmap |= BIT(idx);
+		return 0;
+	}
+
+	return __intel_timeline_sync_set(tl, id, seqno);
+}
+
 static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 				    struct i915_gem_timeline *timeline,
 				    const char *name,
@@ -35,6 +305,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 
 	lockdep_assert_held(&i915->drm.struct_mutex);
 
+	/* Ideally we want a set of engines on a single leaf as we expect
+	 * to mostly be tracking synchronisation between engines.
+	 */
+	BUILD_BUG_ON(NSYNC < I915_NUM_ENGINES);
+	BUILD_BUG_ON(NSYNC > BITS_PER_BYTE * sizeof(timeline->engine[0].sync->bitmap));
+
 	timeline->i915 = i915;
 	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
 	if (!timeline->name)
@@ -81,6 +357,37 @@ int i915_gem_timeline_init__global(struct drm_i915_private *i915)
 					&class, "&global_timeline->lock");
 }
 
+/** i915_gem_timelines_mark_idle -- called when the driver idles
+ * @i915 - the drm_i915_private device
+ *
+ * When the driver is completely idle, we know that all of our sync points
+ * have been signaled and our tracking is then entirely redundant. Any request
+ * to wait upon an older sync point will be completed instantly as we know
+ * the fence is signaled and therefore we will not even look them up in the
+ * sync point map.
+ */
+void i915_gem_timelines_mark_idle(struct drm_i915_private *i915)
+{
+	struct i915_gem_timeline *timeline;
+	int i;
+
+	lockdep_assert_held(&i915->drm.struct_mutex);
+
+	list_for_each_entry(timeline, &i915->gt.timelines, link) {
+		for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
+			struct intel_timeline *tl = &timeline->engine[i];
+
+			/* All known fences are completed so we can scrap
+			 * the current sync point tracking and start afresh,
+			 * any attempt to wait upon a previous sync point
+			 * will be skipped as the fence was signaled.
+			 */
+			sync_free(tl->sync);
+			tl->sync = NULL;
+		}
+	}
+}
+
 void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
 {
 	int i;
@@ -91,8 +398,15 @@ void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
 		struct intel_timeline *tl = &timeline->engine[i];
 
 		GEM_BUG_ON(!list_empty(&tl->requests));
+
+		sync_free(tl->sync);
 	}
 
 	list_del(&timeline->link);
 	kfree(timeline->name);
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftests/mock_timeline.c"
+#include "selftests/i915_gem_timeline.c"
+#endif
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
index 6c53e14cab2a..e16a62bc21e6 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.h
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
@@ -26,10 +26,13 @@
 #define I915_GEM_TIMELINE_H
 
 #include <linux/list.h>
+#include <linux/radix-tree.h>
 
+#include "i915_utils.h"
 #include "i915_gem_request.h"
 
 struct i915_gem_timeline;
+struct intel_timeline_sync;
 
 struct intel_timeline {
 	u64 fence_context;
@@ -55,6 +58,14 @@ struct intel_timeline {
 	 * struct_mutex.
 	 */
 	struct i915_gem_active last_request;
+
+	/* We track the most recent seqno that we wait on in every context so
+	 * that we only have to emit a new await and dependency on a more
+	 * recent sync point. As the contexts may executed out-of-order, we
+	 * have to track each individually and cannot not rely on an absolute
+	 * global_seqno.
+	 */
+	struct intel_timeline_sync *sync;
 	u32 sync_seqno[I915_NUM_ENGINES];
 
 	struct i915_gem_timeline *common;
@@ -73,6 +84,10 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
 			   struct i915_gem_timeline *tl,
 			   const char *name);
 int i915_gem_timeline_init__global(struct drm_i915_private *i915);
+void i915_gem_timelines_mark_idle(struct drm_i915_private *i915);
 void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
 
+int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno);
+bool intel_timeline_sync_is_later(struct intel_timeline *tl, u64 id, u32 seqno);
+
 #endif
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
new file mode 100644
index 000000000000..ce24804e2a8e
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
@@ -0,0 +1,125 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "../i915_selftest.h"
+#include "mock_gem_device.h"
+
+static int igt_seqmap(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	const struct {
+		const char *name;
+		u32 seqno;
+		bool expected;
+		bool set;
+	} pass[] = {
+		{ "unset", 0, false, false },
+		{ "new", 0, false, true },
+		{ "0a", 0, true, true },
+		{ "1a", 1, false, true },
+		{ "1b", 1, true, true },
+		{ "0b", 0, true, false },
+		{ "2a", 2, false, true },
+		{ "4", 4, false, true },
+		{ "INT_MAX", INT_MAX, false, true },
+		{ "INT_MAX-1", INT_MAX-1, true, false },
+		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
+		{ "INT_MAX", INT_MAX, true, false },
+		{ "UINT_MAX", UINT_MAX, false, true },
+		{ "wrap", 0, false, true },
+		{ "unwrap", UINT_MAX, true, false },
+		{},
+	}, *p;
+	struct i915_gem_timeline *timeline;
+	struct intel_timeline *tl;
+	int order, offset;
+	int ret;
+
+	timeline = mock_timeline(i915);
+
+	tl = &timeline->engine[RCS];
+	for (p = pass; p->name; p++) {
+		for (order = 1; order < 64; order++) {
+			for (offset = -1; offset <= (order > 1); offset++) {
+				u64 ctx = BIT_ULL(order) + offset;
+
+				if (intel_timeline_sync_is_later
+				    (tl, ctx, p->seqno) != p->expected) {
+					pr_err("1: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
+					       p->name, ctx, p->seqno, yesno(p->expected));
+					return -EINVAL;
+				}
+
+				if (p->set) {
+					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
+					if (ret)
+						return ret;
+				}
+			}
+		}
+	}
+
+	tl = &timeline->engine[BCS];
+	for (order = 1; order < 64; order++) {
+		for (offset = -1; offset <= (order > 1); offset++) {
+			u64 ctx = BIT_ULL(order) + offset;
+
+			for (p = pass; p->name; p++) {
+				if (intel_timeline_sync_is_later
+				    (tl, ctx, p->seqno) != p->expected) {
+					pr_err("2: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
+					       p->name, ctx, p->seqno, yesno(p->expected));
+					return -EINVAL;
+				}
+
+				if (p->set) {
+					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
+					if (ret)
+						return ret;
+				}
+			}
+		}
+	}
+
+	mock_timeline_destroy(timeline);
+	return 0;
+}
+
+int i915_gem_timeline_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(igt_seqmap),
+	};
+	struct drm_i915_private *i915;
+	int err;
+
+	i915 = mock_gem_device();
+	if (!i915)
+		return -ENOMEM;
+
+	err = i915_subtests(tests, i915);
+	drm_dev_unref(&i915->drm);
+
+	return err;
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
index be9a9ebf5692..8d0f50c25df8 100644
--- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
@@ -12,6 +12,7 @@ selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
 selftest(scatterlist, scatterlist_mock_selftests)
 selftest(uncore, intel_uncore_mock_selftests)
 selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
+selftest(timelines, i915_gem_timeline_mock_selftests)
 selftest(requests, i915_gem_request_mock_selftests)
 selftest(objects, i915_gem_object_mock_selftests)
 selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.c b/drivers/gpu/drm/i915/selftests/mock_timeline.c
new file mode 100644
index 000000000000..e8d62f5f6ed3
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/mock_timeline.c
@@ -0,0 +1,52 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "mock_timeline.h"
+
+struct i915_gem_timeline *
+mock_timeline(struct drm_i915_private *i915)
+{
+	struct i915_gem_timeline *timeline;
+
+	timeline = kzalloc(sizeof(*timeline), GFP_KERNEL);
+	if (!timeline)
+		return NULL;
+
+	mutex_lock(&i915->drm.struct_mutex);
+	i915_gem_timeline_init(i915, timeline, "mock");
+	mutex_unlock(&i915->drm.struct_mutex);
+
+	return timeline;
+}
+
+void mock_timeline_destroy(struct i915_gem_timeline *timeline)
+{
+	struct drm_i915_private *i915 = timeline->i915;
+
+	mutex_lock(&i915->drm.struct_mutex);
+	i915_gem_timeline_fini(timeline);
+	mutex_unlock(&i915->drm.struct_mutex);
+
+	kfree(timeline);
+}
diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.h b/drivers/gpu/drm/i915/selftests/mock_timeline.h
new file mode 100644
index 000000000000..b33dcd2151ef
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/mock_timeline.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __MOCK_TIMELINE__
+#define __MOCK_TIMELINE__
+
+#include "../i915_gem_timeline.h"
+
+struct i915_gem_timeline *mock_timeline(struct drm_i915_private *i915);
+void mock_timeline_destroy(struct i915_gem_timeline *timeline);
+
+#endif /* !__MOCK_TIMELINE__ */
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v8] drm/i915: Squash repeated awaits on the same fence
  2017-04-27  7:06   ` [PATCH v8] " Chris Wilson
@ 2017-04-27  7:14     ` Chris Wilson
  2017-04-27  9:50     ` Chris Wilson
  2017-04-27 11:48     ` [PATCH v9] " Chris Wilson
  2 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-27  7:14 UTC (permalink / raw)
  To: intel-gfx

On Thu, Apr 27, 2017 at 08:06:36AM +0100, Chris Wilson wrote:
> Track the latest fence waited upon on each context, and only add a new
> asynchronous wait if the new fence is more recent than the recorded
> fence for that context. This requires us to filter out unordered
> timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> absence of a universal identifier, we have to use our own
> i915->mm.unordered_timeline token.

Fwiw, the conversion to a ht of leaves is
http://paste.debian.net/929577/

I don't like the compromise of the fixed size ht, it is too easy to hit
a badly performing case.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev2)
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (27 preceding siblings ...)
  2017-04-19 10:01 ` ✗ Fi.CI.BAT: failure for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically Patchwork
@ 2017-04-27  7:27 ` Patchwork
  2017-04-28 14:31 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev5) Patchwork
  2017-04-28 19:22 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev6) Patchwork
  30 siblings, 0 replies; 95+ messages in thread
From: Patchwork @ 2017-04-27  7:27 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev2)
URL   : https://patchwork.freedesktop.org/series/23227/
State : success

== Summary ==

Series 23227v2 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/23227/revisions/2/mbox/

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:428s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:429s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:580s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:510s
fi-bxt-t5700     total:278  pass:258  dwarn:0   dfail:0   fail:0   skip:20  time:544s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:481s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:486s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:413s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:409s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:418s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:494s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:467s
fi-kbl-7500u     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:457s
fi-kbl-7560u     total:278  pass:267  dwarn:1   dfail:0   fail:0   skip:10  time:570s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:458s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:570s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:453s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:496s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:432s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:534s
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29  time:406s

459f7d04deb6549ed4f27957ec414b727dc763f3 drm-tip: 2017y-04m-26d-16h-05m-26s UTC integration manifest
075ed10 drm/i915: Redefine ptr_pack_bits() and friends
cb02940 drm/i915: Make ptr_unpack_bits() more function-like
70114f1 drm/i915: Lift timeline ordering to await_dma_fence
621074e drm/i915: Mark up clflushes as belonging to an unordered timeline
623963bb drm/i915: Mark CPU cache as dirty on every transition for CPU writes

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4561/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-26 22:22               ` Chris Wilson
@ 2017-04-27  9:20                 ` Tvrtko Ursulin
  2017-04-27  9:47                   ` Chris Wilson
  0 siblings, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-27  9:20 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 26/04/2017 23:22, Chris Wilson wrote:
> On Wed, Apr 26, 2017 at 07:56:14PM +0100, Chris Wilson wrote:
>> On Wed, Apr 26, 2017 at 01:13:41PM +0100, Tvrtko Ursulin wrote:
>>> I was thinking of exactly the same thing as this patch does, u64
>>> context id as key, u32 seqnos (wrapped in a container with
>>> hlist_node).
>>
>> #define NSYNC 32
>> struct intel_timeline_sync { /* kmalloc-256 slab */
>> 	struct hlist_node node;
>>         u64 prefix;
>> 	u32 bitmap;
>> 	u32 seqno[NSYNC];
>> };
>> DECLARE_HASHTABLE(sync, 7);
>>
>> If I squint, the numbers favour the idr. ;)
>
> Hmm, it didn't take much to start running into misery with a static ht.
> I know my testing is completely artificial but I am not going to be
> happy with a static size, it will always be too big or too small and
> never just Goldilocks.

Oh what a pity, implementation is so much smaller. What kind of misery 
was it? I presume not longer below the noise floor? With more than three 
buckets?

If no other choice I'll tackle the review. Hopefully won't get lost in 
all the shifts, leafs, branches and prefixes. :)

Regards,

Tvrtko

P.S. GEM_STATS you mention in the other reply - what are you referring 
to with that? The idea to expose queue depths and possibly more via some 
interface? If so prototyping that is almost next on my TODO list.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence
  2017-04-27  9:20                 ` Tvrtko Ursulin
@ 2017-04-27  9:47                   ` Chris Wilson
  0 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-27  9:47 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Thu, Apr 27, 2017 at 10:20:36AM +0100, Tvrtko Ursulin wrote:
> 
> On 26/04/2017 23:22, Chris Wilson wrote:
> >On Wed, Apr 26, 2017 at 07:56:14PM +0100, Chris Wilson wrote:
> >>On Wed, Apr 26, 2017 at 01:13:41PM +0100, Tvrtko Ursulin wrote:
> >>>I was thinking of exactly the same thing as this patch does, u64
> >>>context id as key, u32 seqnos (wrapped in a container with
> >>>hlist_node).
> >>
> >>#define NSYNC 32
> >>struct intel_timeline_sync { /* kmalloc-256 slab */
> >>	struct hlist_node node;
> >>        u64 prefix;
> >>	u32 bitmap;
> >>	u32 seqno[NSYNC];
> >>};
> >>DECLARE_HASHTABLE(sync, 7);
> >>
> >>If I squint, the numbers favour the idr. ;)
> >
> >Hmm, it didn't take much to start running into misery with a static ht.
> >I know my testing is completely artificial but I am not going to be
> >happy with a static size, it will always be too big or too small and
> >never just Goldilocks.
> 
> Oh what a pity, implementation is so much smaller. What kind of
> misery was it? I presume not longer below the noise floor? With more
> than three buckets?

Yup, after realising the flaw in my userspace test, I was able to
hit intel_timeline_sync_is_later() more often. The difference between
idr/ht in that test is still less than the difference in not squashing,
but it becomes easier to realise a difference (the moment when it was
spending over 90% in that function walking the hash chain was the last
straw).
 
> If no other choice I'll tackle the review. Hopefully won't get lost
> in all the shifts, leafs, branches and prefixes. :)

You may well win the ht argument when it comes to an RCU compatible
variant for reservation_object; the relative simplicity in walking the
rcu chains is much more reassuring than arguing rcu correctness of
parent pointers and manual stacks for iterators.

Still a fixed sized ht is going to have long chains for igt, and
reservation_objects are very common so we can't go mad in giving each a
large number of buckets. The biggest complexity for reservation_object
is that it offers guaranteed insertion (along with a u64 index that
rules out lib/radixtree, rhashtable).

And I hope one day refcounting becomes reasonably cheap again, since
sadly it's unavoidable in reservation_object (afaict).

> Regards,
> 
> Tvrtko
> 
> P.S. GEM_STATS you mention in the other reply - what are you
> referring to with that? The idea to expose queue depths and possibly
> more via some interface? If so prototyping that is almost next on my
> TODO list.

I was thinking of intrusive debugging stats that we may want to keep
around and conditionally compile in.

Most statistics should not be for public consumption :)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8] drm/i915: Squash repeated awaits on the same fence
  2017-04-27  7:06   ` [PATCH v8] " Chris Wilson
  2017-04-27  7:14     ` Chris Wilson
@ 2017-04-27  9:50     ` Chris Wilson
  2017-04-27 11:42       ` Chris Wilson
  2017-04-27 11:48     ` [PATCH v9] " Chris Wilson
  2 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-27  9:50 UTC (permalink / raw)
  To: intel-gfx

On Thu, Apr 27, 2017 at 08:06:36AM +0100, Chris Wilson wrote:
> +int i915_gem_timeline_mock_selftests(void)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(igt_seqmap),

I should add a few benchmarks here as well.

random insertion
random lookup (uses same random set as insertion)
repeated lookups of neighbouring engines

So that we can compare in situ given our simple api.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8] drm/i915: Squash repeated awaits on the same fence
  2017-04-27  9:50     ` Chris Wilson
@ 2017-04-27 11:42       ` Chris Wilson
  0 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-27 11:42 UTC (permalink / raw)
  To: intel-gfx, Tvrtko Ursulin, Joonas Lahtinen

On Thu, Apr 27, 2017 at 10:50:28AM +0100, Chris Wilson wrote:
> On Thu, Apr 27, 2017 at 08:06:36AM +0100, Chris Wilson wrote:
> > +int i915_gem_timeline_mock_selftests(void)
> > +{
> > +	static const struct i915_subtest tests[] = {
> > +		SUBTEST(igt_seqmap),
> 
> I should add a few benchmarks here as well.
> 
> random insertion
> random lookup (uses same random set as insertion)
> repeated lookups of neighbouring engines
> 
> So that we can compare in situ given our simple api.

Hmm, I may be biased, but on Braswell:

idr:
bench_sync: 196699 random insertions, 515ns/insert
bench_sync: 196699 random lookups, 376ns/lookup
bench_sync: 2428021 repeated insert/lookups, 41ns/op

1<<3 ht:
bench_sync: 7857 random insertions, 12766ns/insert
bench_sync: 7857 random lookups, 12855ns/lookup
bench_sync: 2164705 repeated insert/lookups, 47ns/op

1<<7 ht:
bench_sync: 17891 random insertions, 5733ns/insert
bench_sync: 17891 random lookups, 5618ns/lookup
bench_sync: 1983086 repeated insert/lookups, 52ns/op

That is better than my expectations! Once again, take with a pinch of
salt as random insetions are totally unrealistic.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v9] drm/i915: Squash repeated awaits on the same fence
  2017-04-27  7:06   ` [PATCH v8] " Chris Wilson
  2017-04-27  7:14     ` Chris Wilson
  2017-04-27  9:50     ` Chris Wilson
@ 2017-04-27 11:48     ` Chris Wilson
  2017-04-27 16:47       ` Tvrtko Ursulin
  2017-04-28  7:41       ` [PATCH v10] " Chris Wilson
  2 siblings, 2 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-27 11:48 UTC (permalink / raw)
  To: intel-gfx

Track the latest fence waited upon on each context, and only add a new
asynchronous wait if the new fence is more recent than the recorded
fence for that context. This requires us to filter out unordered
timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
absence of a universal identifier, we have to use our own
i915->mm.unordered_timeline token.

v2: Throw around the debug crutches
v3: Inline the likely case of the pre-allocation cache being full.
v4: Drop the pre-allocation support, we can lose the most recent fence
in case of allocation failure -- it just means we may emit more awaits
than strictly necessary but will not break.
v5: Trim allocation size for leaf nodes, they only need an array of u32
not pointers.
v6: Create mock_timeline to tidy selftest writing
v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko)
v8: Prune the stale sync points when we idle.
v9: Include a small benchmark in the kselftests

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c                    |   1 +
 drivers/gpu/drm/i915/i915_gem_request.c            |  11 +
 drivers/gpu/drm/i915/i915_gem_timeline.c           | 314 +++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_timeline.h           |  15 +
 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 225 +++++++++++++++
 .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
 drivers/gpu/drm/i915/selftests/mock_timeline.c     |  52 ++++
 drivers/gpu/drm/i915/selftests/mock_timeline.h     |  33 +++
 8 files changed, 652 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
 create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.c
 create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.h

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c1fa3c103f38..f886ef492036 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3214,6 +3214,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
 		intel_engine_disarm_breadcrumbs(engine);
 		i915_gem_batch_pool_fini(&engine->batch_pool);
 	}
+	i915_gem_timelines_mark_idle(dev_priv);
 
 	GEM_BUG_ON(!dev_priv->gt.awake);
 	dev_priv->gt.awake = false;
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 5fa4e52ded06..d9f76665bc6b 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -772,6 +772,12 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 		if (fence->context == req->fence.context)
 			continue;
 
+		/* Squash repeated waits to the same timelines */
+		if (fence->context != req->i915->mm.unordered_timeline &&
+		    intel_timeline_sync_is_later(req->timeline,
+						 fence->context, fence->seqno))
+			continue;
+
 		if (dma_fence_is_i915(fence))
 			ret = i915_gem_request_await_request(req,
 							     to_request(fence));
@@ -781,6 +787,11 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 							    GFP_KERNEL);
 		if (ret < 0)
 			return ret;
+
+		/* Record the most latest fence on each timeline */
+		if (fence->context != req->i915->mm.unordered_timeline)
+			intel_timeline_sync_set(req->timeline,
+						fence->context, fence->seqno);
 	} while (--nchild);
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
index b596ca7ee058..967c53a53a92 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.c
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
@@ -24,6 +24,276 @@
 
 #include "i915_drv.h"
 
+#define NSYNC 16
+#define SHIFT ilog2(NSYNC)
+#define MASK (NSYNC - 1)
+
+/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
+ * context id to the last u32 fence seqno waited upon from that context.
+ * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
+ * the root. This allows us to access the whole tree via a single pointer
+ * to the most recently used layer. We expect fence contexts to be dense
+ * and most reuse to be on the same i915_gem_context but on neighbouring
+ * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
+ * effective lookup cache. If the new lookup is not on the same leaf, we
+ * expect it to be on the neighbouring branch.
+ *
+ * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
+ * allows us to store whether a particular seqno is valid (i.e. allows us
+ * to distinguish unset from 0).
+ *
+ * A branch holds an array of layer pointers, and has height > 0, and always
+ * has at least 2 layers (either branches or leaves) below it.
+ */
+struct intel_timeline_sync {
+	u64 prefix;
+	unsigned int height;
+	unsigned int bitmap;
+	struct intel_timeline_sync *parent;
+	/* union {
+	 *	u32 seqno;
+	 *	struct intel_timeline_sync *child;
+	 * } slot[NSYNC];
+	 */
+};
+
+static inline u32 *__sync_seqno(struct intel_timeline_sync *p)
+{
+	GEM_BUG_ON(p->height);
+	return (u32 *)(p + 1);
+}
+
+static inline struct intel_timeline_sync **
+__sync_child(struct intel_timeline_sync *p)
+{
+	GEM_BUG_ON(!p->height);
+	return (struct intel_timeline_sync **)(p + 1);
+}
+
+static inline unsigned int
+__sync_idx(const struct intel_timeline_sync *p, u64 id)
+{
+	return (id >> p->height) & MASK;
+}
+
+static void __sync_free(struct intel_timeline_sync *p)
+{
+	if (p->height) {
+		unsigned int i;
+
+		while ((i = ffs(p->bitmap))) {
+			p->bitmap &= ~0u << i;
+			__sync_free(__sync_child(p)[i - 1]);
+		}
+	}
+
+	kfree(p);
+}
+
+static void sync_free(struct intel_timeline_sync *sync)
+{
+	if (!sync)
+		return;
+
+	while (sync->parent)
+		sync = sync->parent;
+
+	__sync_free(sync);
+}
+
+/** intel_timeline_sync_is_later -- compare against the last know sync point
+ * @tl - the @intel_timeline
+ * @id - the context id (other timeline) we are synchronising to
+ * @seqno - the sequence number along the other timeline
+ *
+ * If we have already synchronised this @tl with another (@id) then we can
+ * omit any repeated or earlier synchronisation requests. If the two timelines
+ * are already coupled, we can also omit the dependency between the two as that
+ * is already known via the timeline.
+ *
+ * Returns true if the two timelines are already synchronised wrt to @seqno,
+ * false if not and the synchronisation must be emitted.
+ */
+bool intel_timeline_sync_is_later(struct intel_timeline *tl, u64 id, u32 seqno)
+{
+	struct intel_timeline_sync *p;
+	unsigned int idx;
+
+	p = tl->sync;
+	if (!p)
+		return false;
+
+	if (likely((id >> SHIFT) == p->prefix))
+		goto found;
+
+	/* First climb the tree back to a parent branch */
+	do {
+		p = p->parent;
+		if (!p)
+			return false;
+
+		if ((id >> p->height >> SHIFT) == p->prefix)
+			break;
+	} while (1);
+
+	/* And then descend again until we find our leaf */
+	do {
+		if (!p->height)
+			break;
+
+		p = __sync_child(p)[__sync_idx(p, id)];
+		if (!p)
+			return false;
+
+		if ((id >> p->height >> SHIFT) != p->prefix)
+			return false;
+	} while (1);
+
+	tl->sync = p;
+found:
+	idx = id & MASK;
+	if (!(p->bitmap & BIT(idx)))
+		return false;
+
+	return i915_seqno_passed(__sync_seqno(p)[idx], seqno);
+}
+
+static noinline int
+__intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
+{
+	struct intel_timeline_sync *p = tl->sync;
+	unsigned int idx;
+
+	if (!p) {
+		p = kzalloc(sizeof(*p) + NSYNC * sizeof(seqno), GFP_KERNEL);
+		if (unlikely(!p))
+			return -ENOMEM;
+
+		p->prefix = id >> SHIFT;
+		goto found;
+	}
+
+	/* Climb back up the tree until we find a common prefix */
+	do {
+		if (!p->parent)
+			break;
+
+		p = p->parent;
+
+		if ((id >> p->height >> SHIFT) == p->prefix)
+			break;
+	} while (1);
+
+	/* No shortcut, we have to descend the tree to find the right layer
+	 * containing this fence.
+	 *
+	 * Each layer in the tree holds 16 (NSYNC) pointers, either fences
+	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
+	 * other nodes (height > 0) are internal layers that point to a lower
+	 * node. Each internal layer has at least 2 descendents.
+	 *
+	 * Starting at the top, we check whether the current prefix matches. If
+	 * it doesn't, we have gone passed our layer and need to insert a join
+	 * into the tree, and a new leaf node as a descendent as well as the
+	 * original layer.
+	 *
+	 * The matching prefix means we are still following the right branch
+	 * of the tree. If it has height 0, we have found our leaf and just
+	 * need to replace the fence slot with ourselves. If the height is
+	 * not zero, our slot contains the next layer in the tree (unless
+	 * it is empty, in which case we can add ourselves as a new leaf).
+	 * As descend the tree the prefix grows (and height decreases).
+	 */
+	do {
+		struct intel_timeline_sync *next;
+
+		if ((id >> p->height >> SHIFT) != p->prefix) {
+			/* insert a join above the current layer */
+			next = kzalloc(sizeof(*next) + NSYNC * sizeof(next),
+				       GFP_KERNEL);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			next->height = ALIGN(fls64((id >> p->height >> SHIFT) ^ p->prefix),
+					    SHIFT) + p->height;
+			next->prefix = id >> next->height >> SHIFT;
+
+			if (p->parent)
+				__sync_child(p->parent)[__sync_idx(p->parent, id)] = next;
+			next->parent = p->parent;
+
+			idx = p->prefix >> (next->height - p->height - SHIFT) & MASK;
+			__sync_child(next)[idx] = p;
+			next->bitmap |= BIT(idx);
+			p->parent = next;
+
+			/* ascend to the join */
+			p = next;
+		} else {
+			if (!p->height)
+				break;
+		}
+
+		/* descend into the next layer */
+		GEM_BUG_ON(!p->height);
+		idx = __sync_idx(p, id);
+		next = __sync_child(p)[idx];
+		if (unlikely(!next)) {
+			next = kzalloc(sizeof(*next) + NSYNC * sizeof(seqno),
+				       GFP_KERNEL);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			__sync_child(p)[idx] = next;
+			p->bitmap |= BIT(idx);
+			next->parent = p;
+			next->prefix = id >> SHIFT;
+
+			p = next;
+			break;
+		}
+
+		p = next;
+	} while (1);
+
+found:
+	GEM_BUG_ON(p->height);
+	GEM_BUG_ON(p->prefix != id >> SHIFT);
+	tl->sync = p;
+	idx = id & MASK;
+	__sync_seqno(p)[idx] = seqno;
+	p->bitmap |= BIT(idx);
+	return 0;
+}
+
+/** intel_timeline_sync_set -- mark the most recent syncpoint between contexts
+ * @tl - the @intel_timeline
+ * @id - the context id (other timeline) we have synchronised to
+ * @seqno - the sequence number along the other timeline
+ *
+ * When we synchronise this @tl with another (@id), we also know that we have
+ * synchronized with all previous seqno along that timeline. If we then have
+ * a request to synchronise with the same seqno or older, we can omit it,
+ * see intel_timeline_sync_is_later()
+ */
+int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
+{
+	struct intel_timeline_sync *p = tl->sync;
+
+	/* We expect to be called in sequence following a  _get(id), which
+	 * should have preloaded the tl->sync hint for us.
+	 */
+	if (likely(p && (id >> SHIFT) == p->prefix)) {
+		unsigned int idx = id & MASK;
+
+		__sync_seqno(p)[idx] = seqno;
+		p->bitmap |= BIT(idx);
+		return 0;
+	}
+
+	return __intel_timeline_sync_set(tl, id, seqno);
+}
+
 static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 				    struct i915_gem_timeline *timeline,
 				    const char *name,
@@ -35,6 +305,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 
 	lockdep_assert_held(&i915->drm.struct_mutex);
 
+	/* Ideally we want a set of engines on a single leaf as we expect
+	 * to mostly be tracking synchronisation between engines.
+	 */
+	BUILD_BUG_ON(NSYNC < I915_NUM_ENGINES);
+	BUILD_BUG_ON(NSYNC > BITS_PER_BYTE * sizeof(timeline->engine[0].sync->bitmap));
+
 	timeline->i915 = i915;
 	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
 	if (!timeline->name)
@@ -81,6 +357,37 @@ int i915_gem_timeline_init__global(struct drm_i915_private *i915)
 					&class, "&global_timeline->lock");
 }
 
+/** i915_gem_timelines_mark_idle -- called when the driver idles
+ * @i915 - the drm_i915_private device
+ *
+ * When the driver is completely idle, we know that all of our sync points
+ * have been signaled and our tracking is then entirely redundant. Any request
+ * to wait upon an older sync point will be completed instantly as we know
+ * the fence is signaled and therefore we will not even look them up in the
+ * sync point map.
+ */
+void i915_gem_timelines_mark_idle(struct drm_i915_private *i915)
+{
+	struct i915_gem_timeline *timeline;
+	int i;
+
+	lockdep_assert_held(&i915->drm.struct_mutex);
+
+	list_for_each_entry(timeline, &i915->gt.timelines, link) {
+		for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
+			struct intel_timeline *tl = &timeline->engine[i];
+
+			/* All known fences are completed so we can scrap
+			 * the current sync point tracking and start afresh,
+			 * any attempt to wait upon a previous sync point
+			 * will be skipped as the fence was signaled.
+			 */
+			sync_free(tl->sync);
+			tl->sync = NULL;
+		}
+	}
+}
+
 void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
 {
 	int i;
@@ -91,8 +398,15 @@ void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
 		struct intel_timeline *tl = &timeline->engine[i];
 
 		GEM_BUG_ON(!list_empty(&tl->requests));
+
+		sync_free(tl->sync);
 	}
 
 	list_del(&timeline->link);
 	kfree(timeline->name);
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftests/mock_timeline.c"
+#include "selftests/i915_gem_timeline.c"
+#endif
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
index 6c53e14cab2a..e16a62bc21e6 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.h
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
@@ -26,10 +26,13 @@
 #define I915_GEM_TIMELINE_H
 
 #include <linux/list.h>
+#include <linux/radix-tree.h>
 
+#include "i915_utils.h"
 #include "i915_gem_request.h"
 
 struct i915_gem_timeline;
+struct intel_timeline_sync;
 
 struct intel_timeline {
 	u64 fence_context;
@@ -55,6 +58,14 @@ struct intel_timeline {
 	 * struct_mutex.
 	 */
 	struct i915_gem_active last_request;
+
+	/* We track the most recent seqno that we wait on in every context so
+	 * that we only have to emit a new await and dependency on a more
+	 * recent sync point. As the contexts may executed out-of-order, we
+	 * have to track each individually and cannot not rely on an absolute
+	 * global_seqno.
+	 */
+	struct intel_timeline_sync *sync;
 	u32 sync_seqno[I915_NUM_ENGINES];
 
 	struct i915_gem_timeline *common;
@@ -73,6 +84,10 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
 			   struct i915_gem_timeline *tl,
 			   const char *name);
 int i915_gem_timeline_init__global(struct drm_i915_private *i915);
+void i915_gem_timelines_mark_idle(struct drm_i915_private *i915);
 void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
 
+int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno);
+bool intel_timeline_sync_is_later(struct intel_timeline *tl, u64 id, u32 seqno);
+
 #endif
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
new file mode 100644
index 000000000000..66b4c24b0c26
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
@@ -0,0 +1,225 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <linux/random.h>
+
+#include "../i915_selftest.h"
+#include "mock_gem_device.h"
+#include "mock_timeline.h"
+
+static int igt_sync(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	const struct {
+		const char *name;
+		u32 seqno;
+		bool expected;
+		bool set;
+	} pass[] = {
+		{ "unset", 0, false, false },
+		{ "new", 0, false, true },
+		{ "0a", 0, true, true },
+		{ "1a", 1, false, true },
+		{ "1b", 1, true, true },
+		{ "0b", 0, true, false },
+		{ "2a", 2, false, true },
+		{ "4", 4, false, true },
+		{ "INT_MAX", INT_MAX, false, true },
+		{ "INT_MAX-1", INT_MAX-1, true, false },
+		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
+		{ "INT_MAX", INT_MAX, true, false },
+		{ "UINT_MAX", UINT_MAX, false, true },
+		{ "wrap", 0, false, true },
+		{ "unwrap", UINT_MAX, true, false },
+		{},
+	}, *p;
+	struct i915_gem_timeline *timeline;
+	struct intel_timeline *tl;
+	int order, offset;
+	int ret;
+
+	timeline = mock_timeline(i915);
+	if (!timeline)
+		return -ENOMEM;
+
+	tl = &timeline->engine[RCS];
+	for (p = pass; p->name; p++) {
+		for (order = 1; order < 64; order++) {
+			for (offset = -1; offset <= (order > 1); offset++) {
+				u64 ctx = BIT_ULL(order) + offset;
+
+				if (intel_timeline_sync_is_later
+				    (tl, ctx, p->seqno) != p->expected) {
+					pr_err("1: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
+					       p->name, ctx, p->seqno, yesno(p->expected));
+					ret = -EINVAL;
+					goto out;
+				}
+
+				if (p->set) {
+					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
+					if (ret)
+						goto out;
+				}
+			}
+		}
+	}
+
+	tl = &timeline->engine[BCS];
+	for (order = 1; order < 64; order++) {
+		for (offset = -1; offset <= (order > 1); offset++) {
+			u64 ctx = BIT_ULL(order) + offset;
+
+			for (p = pass; p->name; p++) {
+				if (intel_timeline_sync_is_later
+				    (tl, ctx, p->seqno) != p->expected) {
+					pr_err("2: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
+					       p->name, ctx, p->seqno, yesno(p->expected));
+					ret = -EINVAL;
+					goto out;
+				}
+
+				if (p->set) {
+					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
+					if (ret)
+						goto out;
+				}
+			}
+		}
+	}
+
+out:
+	mock_timeline_destroy(timeline);
+	return ret;
+}
+
+static u64 prandom_u64_state(struct rnd_state *rnd)
+{
+	u64 x;
+
+	x = prandom_u32_state(rnd);
+	x <<= 32;
+	x |= prandom_u32_state(rnd);
+
+	return x;
+}
+
+static unsigned int random_engine(struct rnd_state *rnd)
+{
+	return ((u64)prandom_u32_state(rnd) * I915_NUM_ENGINES) >> 32;
+}
+
+static int bench_sync(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct rnd_state prng;
+	struct i915_gem_timeline *timeline;
+	struct intel_timeline *tl;
+	unsigned long end_time, count;
+	ktime_t kt;
+	int ret;
+
+	timeline = mock_timeline(i915);
+	if (!timeline)
+		return -ENOMEM;
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	tl = &timeline->engine[RCS];
+
+	count = 0;
+	kt = -ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		u64 id = prandom_u64_state(&prng);
+
+		intel_timeline_sync_set(tl, id, 0);
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_add(ktime_get(), kt);
+
+	pr_info("%s: %lu random insertions, %lluns/insert\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+
+	end_time = count;
+	kt = -ktime_get();
+	while (end_time--) {
+		u64 id = prandom_u64_state(&prng);
+
+		if (!intel_timeline_sync_is_later(tl, id, 0)) {
+			pr_err("Lookup of %llu failed\n", id);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+	kt = ktime_add(ktime_get(), kt);
+
+	pr_info("%s: %lu random lookups, %lluns/lookup\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	tl = &timeline->engine[BCS];
+
+	count = 0;
+	kt = -ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		u32 id = random_engine(&prng);
+		u32 seqno = prandom_u32_state(&prng);
+
+		if (!intel_timeline_sync_is_later(tl, id, seqno))
+			intel_timeline_sync_set(tl, id, seqno);
+
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_add(ktime_get(), kt);
+
+	pr_info("%s: %lu repeated insert/lookups, %lluns/op\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	ret = 0;
+out:
+	mock_timeline_destroy(timeline);
+	return ret;
+}
+
+int i915_gem_timeline_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(igt_sync),
+		SUBTEST(bench_sync),
+	};
+	struct drm_i915_private *i915;
+	int err;
+
+	i915 = mock_gem_device();
+	if (!i915)
+		return -ENOMEM;
+
+	err = i915_subtests(tests, i915);
+	drm_dev_unref(&i915->drm);
+
+	return err;
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
index be9a9ebf5692..8d0f50c25df8 100644
--- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
@@ -12,6 +12,7 @@ selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
 selftest(scatterlist, scatterlist_mock_selftests)
 selftest(uncore, intel_uncore_mock_selftests)
 selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
+selftest(timelines, i915_gem_timeline_mock_selftests)
 selftest(requests, i915_gem_request_mock_selftests)
 selftest(objects, i915_gem_object_mock_selftests)
 selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.c b/drivers/gpu/drm/i915/selftests/mock_timeline.c
new file mode 100644
index 000000000000..e8d62f5f6ed3
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/mock_timeline.c
@@ -0,0 +1,52 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "mock_timeline.h"
+
+struct i915_gem_timeline *
+mock_timeline(struct drm_i915_private *i915)
+{
+	struct i915_gem_timeline *timeline;
+
+	timeline = kzalloc(sizeof(*timeline), GFP_KERNEL);
+	if (!timeline)
+		return NULL;
+
+	mutex_lock(&i915->drm.struct_mutex);
+	i915_gem_timeline_init(i915, timeline, "mock");
+	mutex_unlock(&i915->drm.struct_mutex);
+
+	return timeline;
+}
+
+void mock_timeline_destroy(struct i915_gem_timeline *timeline)
+{
+	struct drm_i915_private *i915 = timeline->i915;
+
+	mutex_lock(&i915->drm.struct_mutex);
+	i915_gem_timeline_fini(timeline);
+	mutex_unlock(&i915->drm.struct_mutex);
+
+	kfree(timeline);
+}
diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.h b/drivers/gpu/drm/i915/selftests/mock_timeline.h
new file mode 100644
index 000000000000..b33dcd2151ef
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/mock_timeline.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __MOCK_TIMELINE__
+#define __MOCK_TIMELINE__
+
+#include "../i915_gem_timeline.h"
+
+struct i915_gem_timeline *mock_timeline(struct drm_i915_private *i915);
+void mock_timeline_destroy(struct i915_gem_timeline *timeline);
+
+#endif /* !__MOCK_TIMELINE__ */
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH 13/27] drm/i915/execlists: Pack the count into the low bits of the port.request
  2017-04-20 14:58   ` Tvrtko Ursulin
@ 2017-04-27 14:37     ` Chris Wilson
  2017-04-28 12:02       ` Tvrtko Ursulin
  0 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-27 14:37 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, Mika Kuoppala

On Thu, Apr 20, 2017 at 03:58:19PM +0100, Tvrtko Ursulin wrote:
> > static void record_context(struct drm_i915_error_context *e,
> >diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
> >index 1642fff9cf13..370373c97b81 100644
> >--- a/drivers/gpu/drm/i915/i915_guc_submission.c
> >+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
> >@@ -658,7 +658,7 @@ static void nested_enable_signaling(struct drm_i915_gem_request *rq)
> > static bool i915_guc_dequeue(struct intel_engine_cs *engine)
> > {
> > 	struct execlist_port *port = engine->execlist_port;
> >-	struct drm_i915_gem_request *last = port[0].request;
> >+	struct drm_i915_gem_request *last = port[0].request_count;
> 
> It's confusing that in this new scheme sometimes we have direct
> access to the request and sometimes we have to go through the
> port_request macro.
> 
> So maybe we should always use the port_request macro. Hm, could we
> invent a new type to help enforce that? Like:
> 
> struct drm_i915_gem_port_request_slot {
> 	struct drm_i915_gem_request *req_count;
> };
> 
> And then execlist port would contain these and helpers would need to
> be functions?
> 
> I've also noticed some GVT/GuC patches which sounded like they are
> adding the same single submission constraints so maybe now is the
> time to unify the dequeue? (Haven't looked at those patches deeper
> than the subject line so might be wrong.)
> 
> Not sure 100% of all the above, would need to sketch it. What are
> your thoughts?

I forsee a use for the count in guc as well, so conversion is ok with
me.

> >diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >index 7df278fe492e..69299fbab4f9 100644
> >--- a/drivers/gpu/drm/i915/intel_lrc.c
> >+++ b/drivers/gpu/drm/i915/intel_lrc.c
> >@@ -342,39 +342,32 @@ static u64 execlists_update_context(struct drm_i915_gem_request *rq)
> >
> > static void execlists_submit_ports(struct intel_engine_cs *engine)
> > {
> >-	struct drm_i915_private *dev_priv = engine->i915;
> > 	struct execlist_port *port = engine->execlist_port;
> > 	u32 __iomem *elsp =
> >-		dev_priv->regs + i915_mmio_reg_offset(RING_ELSP(engine));
> >-	u64 desc[2];
> >-
> >-	GEM_BUG_ON(port[0].count > 1);
> >-	if (!port[0].count)
> >-		execlists_context_status_change(port[0].request,
> >-						INTEL_CONTEXT_SCHEDULE_IN);
> >-	desc[0] = execlists_update_context(port[0].request);
> >-	GEM_DEBUG_EXEC(port[0].context_id = upper_32_bits(desc[0]));
> >-	port[0].count++;
> >-
> >-	if (port[1].request) {
> >-		GEM_BUG_ON(port[1].count);
> >-		execlists_context_status_change(port[1].request,
> >-						INTEL_CONTEXT_SCHEDULE_IN);
> >-		desc[1] = execlists_update_context(port[1].request);
> >-		GEM_DEBUG_EXEC(port[1].context_id = upper_32_bits(desc[1]));
> >-		port[1].count = 1;
> >-	} else {
> >-		desc[1] = 0;
> >-	}
> >-	GEM_BUG_ON(desc[0] == desc[1]);
> >-
> >-	/* You must always write both descriptors in the order below. */
> >-	writel(upper_32_bits(desc[1]), elsp);
> >-	writel(lower_32_bits(desc[1]), elsp);
> >+		engine->i915->regs + i915_mmio_reg_offset(RING_ELSP(engine));
> >+	unsigned int n;
> >+
> >+	for (n = ARRAY_SIZE(engine->execlist_port); n--; ) {
> 
> We could also add for_each_req_port or something, to iterate and
> unpack either req only or the count as well?

for_each_port_reverse? We're looking at very special cases here!

I'm not sure and I'm playing with different structures.

> >diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> >index d25b88467e5e..39b733e5cfd3 100644
> >--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> >+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> >@@ -377,8 +377,12 @@ struct intel_engine_cs {
> > 	/* Execlists */
> > 	struct tasklet_struct irq_tasklet;
> > 	struct execlist_port {
> >-		struct drm_i915_gem_request *request;
> >-		unsigned int count;
> >+		struct drm_i915_gem_request *request_count;
> 
> Would req(uest)_slot maybe be better?

It's definitely a count (of how many times this request has been
submitted), and I like long verbose names when I don't want them to be
used directly. So expect guc to be tidied.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v9] drm/i915: Squash repeated awaits on the same fence
  2017-04-27 11:48     ` [PATCH v9] " Chris Wilson
@ 2017-04-27 16:47       ` Tvrtko Ursulin
  2017-04-27 17:25         ` Chris Wilson
  2017-04-28  7:41       ` [PATCH v10] " Chris Wilson
  1 sibling, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-27 16:47 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 27/04/2017 12:48, Chris Wilson wrote:
> Track the latest fence waited upon on each context, and only add a new
> asynchronous wait if the new fence is more recent than the recorded
> fence for that context. This requires us to filter out unordered
> timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> absence of a universal identifier, we have to use our own
> i915->mm.unordered_timeline token.
>
> v2: Throw around the debug crutches
> v3: Inline the likely case of the pre-allocation cache being full.
> v4: Drop the pre-allocation support, we can lose the most recent fence
> in case of allocation failure -- it just means we may emit more awaits
> than strictly necessary but will not break.
> v5: Trim allocation size for leaf nodes, they only need an array of u32
> not pointers.
> v6: Create mock_timeline to tidy selftest writing
> v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko)
> v8: Prune the stale sync points when we idle.
> v9: Include a small benchmark in the kselftests
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c                    |   1 +
>  drivers/gpu/drm/i915/i915_gem_request.c            |  11 +
>  drivers/gpu/drm/i915/i915_gem_timeline.c           | 314 +++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_gem_timeline.h           |  15 +
>  drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 225 +++++++++++++++
>  .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
>  drivers/gpu/drm/i915/selftests/mock_timeline.c     |  52 ++++
>  drivers/gpu/drm/i915/selftests/mock_timeline.h     |  33 +++
>  8 files changed, 652 insertions(+)
>  create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
>  create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.c
>  create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.h
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index c1fa3c103f38..f886ef492036 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3214,6 +3214,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
>  		intel_engine_disarm_breadcrumbs(engine);
>  		i915_gem_batch_pool_fini(&engine->batch_pool);
>  	}
> +	i915_gem_timelines_mark_idle(dev_priv);
>
>  	GEM_BUG_ON(!dev_priv->gt.awake);
>  	dev_priv->gt.awake = false;
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 5fa4e52ded06..d9f76665bc6b 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -772,6 +772,12 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>  		if (fence->context == req->fence.context)
>  			continue;
>
> +		/* Squash repeated waits to the same timelines */
> +		if (fence->context != req->i915->mm.unordered_timeline &&
> +		    intel_timeline_sync_is_later(req->timeline,
> +						 fence->context, fence->seqno))
> +			continue;

Wrong base?

> +
>  		if (dma_fence_is_i915(fence))
>  			ret = i915_gem_request_await_request(req,
>  							     to_request(fence));
> @@ -781,6 +787,11 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>  							    GFP_KERNEL);
>  		if (ret < 0)
>  			return ret;
> +
> +		/* Record the most latest fence on each timeline */
> +		if (fence->context != req->i915->mm.unordered_timeline)
> +			intel_timeline_sync_set(req->timeline,
> +						fence->context, fence->seqno);
>  	} while (--nchild);
>
>  	return 0;
> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
> index b596ca7ee058..967c53a53a92 100644
> --- a/drivers/gpu/drm/i915/i915_gem_timeline.c
> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
> @@ -24,6 +24,276 @@
>
>  #include "i915_drv.h"
>
> +#define NSYNC 16
> +#define SHIFT ilog2(NSYNC)
> +#define MASK (NSYNC - 1)
> +
> +/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
> + * context id to the last u32 fence seqno waited upon from that context.
> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
> + * the root. This allows us to access the whole tree via a single pointer
> + * to the most recently used layer. We expect fence contexts to be dense
> + * and most reuse to be on the same i915_gem_context but on neighbouring
> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
> + * effective lookup cache. If the new lookup is not on the same leaf, we
> + * expect it to be on the neighbouring branch.
> + *
> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
> + * allows us to store whether a particular seqno is valid (i.e. allows us
> + * to distinguish unset from 0).
> + *
> + * A branch holds an array of layer pointers, and has height > 0, and always
> + * has at least 2 layers (either branches or leaves) below it.
> + */
> +struct intel_timeline_sync {
> +	u64 prefix;
> +	unsigned int height;
> +	unsigned int bitmap;

u16 would be enough for the bitmap since NSYNC == 16? To no benefit 
though. Maybe just add a BUILD_BUG_ON(sizeof(p->bitmap) * BITS_PER_BYTE 
 >= NSYNC) somewhere?

> +	struct intel_timeline_sync *parent;
> +	/* union {
> +	 *	u32 seqno;
> +	 *	struct intel_timeline_sync *child;
> +	 * } slot[NSYNC];
> +	 */

Put a note saying this comment describes what follows after struct 
intel_timeline_sync.

Would "union { ... } slot[0];" work as a maker and have any benefit to 
the readability of the code below?

You could same some bytes (64 I think) for the leaf nodes if you did 
something like:

	union {
		u32 seqno[NSYNC];
		struct intel_timeline_sync *child[NSYNC];
	};

Although I think it conflicts with the slot marker idea. Hm, no actually 
it doesn't. You could have both union members as simply markers.

	union {
		u32 seqno[];
		struct intel_timeline_sync *child[];
	};

Again, not sure yet if it would make that much better readability.

> +};
> +
> +static inline u32 *__sync_seqno(struct intel_timeline_sync *p)
> +{
> +	GEM_BUG_ON(p->height);
> +	return (u32 *)(p + 1);
> +}
> +
> +static inline struct intel_timeline_sync **
> +__sync_child(struct intel_timeline_sync *p)
> +{
> +	GEM_BUG_ON(!p->height);
> +	return (struct intel_timeline_sync **)(p + 1);
> +}
> +
> +static inline unsigned int
> +__sync_idx(const struct intel_timeline_sync *p, u64 id)
> +{
> +	return (id >> p->height) & MASK;
> +}
> +
> +static void __sync_free(struct intel_timeline_sync *p)
> +{
> +	if (p->height) {
> +		unsigned int i;
> +
> +		while ((i = ffs(p->bitmap))) {
> +			p->bitmap &= ~0u << i;
> +			__sync_free(__sync_child(p)[i - 1]);

Maximum height is 64 for this tree so here there is no danger of stack 
overflow?

> +		}
> +	}
> +
> +	kfree(p);
> +}
> +
> +static void sync_free(struct intel_timeline_sync *sync)
> +{
> +	if (!sync)
> +		return;
> +
> +	while (sync->parent)
> +		sync = sync->parent;
> +
> +	__sync_free(sync);
> +}
> +
> +/** intel_timeline_sync_is_later -- compare against the last know sync point
> + * @tl - the @intel_timeline
> + * @id - the context id (other timeline) we are synchronising to
> + * @seqno - the sequence number along the other timeline
> + *
> + * If we have already synchronised this @tl with another (@id) then we can
> + * omit any repeated or earlier synchronisation requests. If the two timelines
> + * are already coupled, we can also omit the dependency between the two as that
> + * is already known via the timeline.
> + *
> + * Returns true if the two timelines are already synchronised wrt to @seqno,
> + * false if not and the synchronisation must be emitted.
> + */
> +bool intel_timeline_sync_is_later(struct intel_timeline *tl, u64 id, u32 seqno)
> +{
> +	struct intel_timeline_sync *p;
> +	unsigned int idx;
> +
> +	p = tl->sync;
> +	if (!p)
> +		return false;
> +
> +	if (likely((id >> SHIFT) == p->prefix))
> +		goto found;
> +
> +	/* First climb the tree back to a parent branch */
> +	do {
> +		p = p->parent;
> +		if (!p)
> +			return false;
> +
> +		if ((id >> p->height >> SHIFT) == p->prefix)

Worth having "id >> p->height >> SHIFT" as a macro for better readability?

> +			break;
> +	} while (1);
> +
> +	/* And then descend again until we find our leaf */
> +	do {
> +		if (!p->height)
> +			break;
> +
> +		p = __sync_child(p)[__sync_idx(p, id)];
> +		if (!p)
> +			return false;
> +
> +		if ((id >> p->height >> SHIFT) != p->prefix)
> +			return false;

Is this possible or a GEM_BUG_ON? Maybe I am not understanding it, but I 
thought it would be __sync_child slot had unexpected prefix in it?

> +	} while (1);
> +
> +	tl->sync = p;
> +found:
> +	idx = id & MASK;
> +	if (!(p->bitmap & BIT(idx)))
> +		return false;
> +
> +	return i915_seqno_passed(__sync_seqno(p)[idx], seqno);
> +}
> +
> +static noinline int
> +__intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> +{
> +	struct intel_timeline_sync *p = tl->sync;
> +	unsigned int idx;
> +
> +	if (!p) {
> +		p = kzalloc(sizeof(*p) + NSYNC * sizeof(seqno), GFP_KERNEL);
> +		if (unlikely(!p))
> +			return -ENOMEM;
> +
> +		p->prefix = id >> SHIFT;
> +		goto found;
> +	}
> +
> +	/* Climb back up the tree until we find a common prefix */
> +	do {
> +		if (!p->parent)
> +			break;
> +
> +		p = p->parent;
> +
> +		if ((id >> p->height >> SHIFT) == p->prefix)
> +			break;
> +	} while (1);

__climb_back_to_prefix(p, id) as a helper since it is used in the lookup 
as well?

> +
> +	/* No shortcut, we have to descend the tree to find the right layer
> +	 * containing this fence.
> +	 *
> +	 * Each layer in the tree holds 16 (NSYNC) pointers, either fences
> +	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
> +	 * other nodes (height > 0) are internal layers that point to a lower
> +	 * node. Each internal layer has at least 2 descendents.
> +	 *
> +	 * Starting at the top, we check whether the current prefix matches. If
> +	 * it doesn't, we have gone passed our layer and need to insert a join
> +	 * into the tree, and a new leaf node as a descendent as well as the
> +	 * original layer.
> +	 *
> +	 * The matching prefix means we are still following the right branch
> +	 * of the tree. If it has height 0, we have found our leaf and just
> +	 * need to replace the fence slot with ourselves. If the height is
> +	 * not zero, our slot contains the next layer in the tree (unless
> +	 * it is empty, in which case we can add ourselves as a new leaf).
> +	 * As descend the tree the prefix grows (and height decreases).
> +	 */
> +	do {
> +		struct intel_timeline_sync *next;
> +
> +		if ((id >> p->height >> SHIFT) != p->prefix) {
> +			/* insert a join above the current layer */
> +			next = kzalloc(sizeof(*next) + NSYNC * sizeof(next),
> +				       GFP_KERNEL);
> +			if (unlikely(!next))
> +				return -ENOMEM;
> +
> +			next->height = ALIGN(fls64((id >> p->height >> SHIFT) ^ p->prefix),
> +					    SHIFT) + p->height;

Got lost here - what's xor-ing accomplishing here? What is height then, 
not just depth relative to the bottom of the tree?

> +			next->prefix = id >> next->height >> SHIFT;
> +
> +			if (p->parent)
> +				__sync_child(p->parent)[__sync_idx(p->parent, id)] = next;
> +			next->parent = p->parent;
> +
> +			idx = p->prefix >> (next->height - p->height - SHIFT) & MASK;
> +			__sync_child(next)[idx] = p;
> +			next->bitmap |= BIT(idx);
> +			p->parent = next;
> +
> +			/* ascend to the join */
> +			p = next;
> +		} else {
> +			if (!p->height)
> +				break;
> +		}
> +
> +		/* descend into the next layer */
> +		GEM_BUG_ON(!p->height);
> +		idx = __sync_idx(p, id);
> +		next = __sync_child(p)[idx];
> +		if (unlikely(!next)) {
> +			next = kzalloc(sizeof(*next) + NSYNC * sizeof(seqno),
> +				       GFP_KERNEL);
> +			if (unlikely(!next))
> +				return -ENOMEM;
> +
> +			__sync_child(p)[idx] = next;
> +			p->bitmap |= BIT(idx);
> +			next->parent = p;
> +			next->prefix = id >> SHIFT;
> +
> +			p = next;
> +			break;
> +		}
> +
> +		p = next;
> +	} while (1);
> +
> +found:
> +	GEM_BUG_ON(p->height);
> +	GEM_BUG_ON(p->prefix != id >> SHIFT);
> +	tl->sync = p;
> +	idx = id & MASK;
> +	__sync_seqno(p)[idx] = seqno;
> +	p->bitmap |= BIT(idx);
> +	return 0;
> +}
> +
> +/** intel_timeline_sync_set -- mark the most recent syncpoint between contexts
> + * @tl - the @intel_timeline
> + * @id - the context id (other timeline) we have synchronised to
> + * @seqno - the sequence number along the other timeline
> + *
> + * When we synchronise this @tl with another (@id), we also know that we have
> + * synchronized with all previous seqno along that timeline. If we then have
> + * a request to synchronise with the same seqno or older, we can omit it,
> + * see intel_timeline_sync_is_later()
> + */
> +int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> +{
> +	struct intel_timeline_sync *p = tl->sync;
> +
> +	/* We expect to be called in sequence following a  _get(id), which
> +	 * should have preloaded the tl->sync hint for us.
> +	 */
> +	if (likely(p && (id >> SHIFT) == p->prefix)) {
> +		unsigned int idx = id & MASK;
> +
> +		__sync_seqno(p)[idx] = seqno;
> +		p->bitmap |= BIT(idx);
> +		return 0;
> +	}
> +
> +	return __intel_timeline_sync_set(tl, id, seqno);

Could pass in p and set tl->sync = p at this level. That would decouple 
the algorithm from the timeline better. With equivalent treatment for 
the query, and renaming of struct intel_timeline_sync, algorithm would 
be ready for moving out of drm/i915/ :)

> +}
> +
>  static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>  				    struct i915_gem_timeline *timeline,
>  				    const char *name,
> @@ -35,6 +305,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>
>  	lockdep_assert_held(&i915->drm.struct_mutex);
>
> +	/* Ideally we want a set of engines on a single leaf as we expect
> +	 * to mostly be tracking synchronisation between engines.
> +	 */
> +	BUILD_BUG_ON(NSYNC < I915_NUM_ENGINES);
> +	BUILD_BUG_ON(NSYNC > BITS_PER_BYTE * sizeof(timeline->engine[0].sync->bitmap));

Ta-da! :)

> +
>  	timeline->i915 = i915;
>  	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
>  	if (!timeline->name)
> @@ -81,6 +357,37 @@ int i915_gem_timeline_init__global(struct drm_i915_private *i915)
>  					&class, "&global_timeline->lock");
>  }
>
> +/** i915_gem_timelines_mark_idle -- called when the driver idles
> + * @i915 - the drm_i915_private device
> + *
> + * When the driver is completely idle, we know that all of our sync points
> + * have been signaled and our tracking is then entirely redundant. Any request
> + * to wait upon an older sync point will be completed instantly as we know
> + * the fence is signaled and therefore we will not even look them up in the
> + * sync point map.
> + */
> +void i915_gem_timelines_mark_idle(struct drm_i915_private *i915)
> +{
> +	struct i915_gem_timeline *timeline;
> +	int i;
> +
> +	lockdep_assert_held(&i915->drm.struct_mutex);
> +
> +	list_for_each_entry(timeline, &i915->gt.timelines, link) {
> +		for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
> +			struct intel_timeline *tl = &timeline->engine[i];
> +
> +			/* All known fences are completed so we can scrap
> +			 * the current sync point tracking and start afresh,
> +			 * any attempt to wait upon a previous sync point
> +			 * will be skipped as the fence was signaled.
> +			 */
> +			sync_free(tl->sync);
> +			tl->sync = NULL;
> +		}
> +	}
> +}
> +
>  void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
>  {
>  	int i;
> @@ -91,8 +398,15 @@ void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
>  		struct intel_timeline *tl = &timeline->engine[i];
>
>  		GEM_BUG_ON(!list_empty(&tl->requests));
> +
> +		sync_free(tl->sync);
>  	}
>
>  	list_del(&timeline->link);
>  	kfree(timeline->name);
>  }
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> +#include "selftests/mock_timeline.c"
> +#include "selftests/i915_gem_timeline.c"
> +#endif
> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
> index 6c53e14cab2a..e16a62bc21e6 100644
> --- a/drivers/gpu/drm/i915/i915_gem_timeline.h
> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
> @@ -26,10 +26,13 @@
>  #define I915_GEM_TIMELINE_H
>
>  #include <linux/list.h>
> +#include <linux/radix-tree.h>

What is used from it?

>
> +#include "i915_utils.h"
>  #include "i915_gem_request.h"
>
>  struct i915_gem_timeline;
> +struct intel_timeline_sync;
>
>  struct intel_timeline {
>  	u64 fence_context;
> @@ -55,6 +58,14 @@ struct intel_timeline {
>  	 * struct_mutex.
>  	 */
>  	struct i915_gem_active last_request;
> +
> +	/* We track the most recent seqno that we wait on in every context so
> +	 * that we only have to emit a new await and dependency on a more
> +	 * recent sync point. As the contexts may executed out-of-order, we
> +	 * have to track each individually and cannot not rely on an absolute
> +	 * global_seqno.
> +	 */
> +	struct intel_timeline_sync *sync;
>  	u32 sync_seqno[I915_NUM_ENGINES];
>
>  	struct i915_gem_timeline *common;
> @@ -73,6 +84,10 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
>  			   struct i915_gem_timeline *tl,
>  			   const char *name);
>  int i915_gem_timeline_init__global(struct drm_i915_private *i915);
> +void i915_gem_timelines_mark_idle(struct drm_i915_private *i915);
>  void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
>
> +int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno);
> +bool intel_timeline_sync_is_later(struct intel_timeline *tl, u64 id, u32 seqno);
> +
>  #endif
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> new file mode 100644
> index 000000000000..66b4c24b0c26
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> @@ -0,0 +1,225 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include <linux/random.h>
> +
> +#include "../i915_selftest.h"
> +#include "mock_gem_device.h"
> +#include "mock_timeline.h"
> +
> +static int igt_sync(void *arg)
> +{
> +	struct drm_i915_private *i915 = arg;
> +	const struct {
> +		const char *name;
> +		u32 seqno;
> +		bool expected;
> +		bool set;
> +	} pass[] = {
> +		{ "unset", 0, false, false },
> +		{ "new", 0, false, true },
> +		{ "0a", 0, true, true },
> +		{ "1a", 1, false, true },
> +		{ "1b", 1, true, true },
> +		{ "0b", 0, true, false },
> +		{ "2a", 2, false, true },
> +		{ "4", 4, false, true },
> +		{ "INT_MAX", INT_MAX, false, true },
> +		{ "INT_MAX-1", INT_MAX-1, true, false },
> +		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
> +		{ "INT_MAX", INT_MAX, true, false },
> +		{ "UINT_MAX", UINT_MAX, false, true },
> +		{ "wrap", 0, false, true },
> +		{ "unwrap", UINT_MAX, true, false },
> +		{},
> +	}, *p;
> +	struct i915_gem_timeline *timeline;
> +	struct intel_timeline *tl;
> +	int order, offset;
> +	int ret;
> +
> +	timeline = mock_timeline(i915);

Hey-ho.. when I suggested mock_timeline I did not realize we need the 
mock_device anyway. :( For struct_mutex I guess? Oh well, that was an 
useless suggestion.. :(

> +	if (!timeline)
> +		return -ENOMEM;
> +
> +	tl = &timeline->engine[RCS];
> +	for (p = pass; p->name; p++) {
> +		for (order = 1; order < 64; order++) {
> +			for (offset = -1; offset <= (order > 1); offset++) {
> +				u64 ctx = BIT_ULL(order) + offset;
> +
> +				if (intel_timeline_sync_is_later
> +				    (tl, ctx, p->seqno) != p->expected) {

Unusual formatting.

> +					pr_err("1: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> +					       p->name, ctx, p->seqno, yesno(p->expected));
> +					ret = -EINVAL;
> +					goto out;
> +				}
> +
> +				if (p->set) {
> +					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
> +					if (ret)
> +						goto out;
> +				}
> +			}
> +		}
> +	}
> +
> +	tl = &timeline->engine[BCS];
> +	for (order = 1; order < 64; order++) {
> +		for (offset = -1; offset <= (order > 1); offset++) {
> +			u64 ctx = BIT_ULL(order) + offset;
> +
> +			for (p = pass; p->name; p++) {
> +				if (intel_timeline_sync_is_later
> +				    (tl, ctx, p->seqno) != p->expected) {
> +					pr_err("2: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> +					       p->name, ctx, p->seqno, yesno(p->expected));
> +					ret = -EINVAL;
> +					goto out;
> +				}
> +
> +				if (p->set) {
> +					ret = intel_timeline_sync_set(tl, ctx, p->seqno);
> +					if (ret)
> +						goto out;
> +				}
> +			}
> +		}
> +	}
> +
> +out:
> +	mock_timeline_destroy(timeline);
> +	return ret;
> +}
> +
> +static u64 prandom_u64_state(struct rnd_state *rnd)
> +{
> +	u64 x;
> +
> +	x = prandom_u32_state(rnd);
> +	x <<= 32;
> +	x |= prandom_u32_state(rnd);
> +
> +	return x;
> +}
> +
> +static unsigned int random_engine(struct rnd_state *rnd)
> +{
> +	return ((u64)prandom_u32_state(rnd) * I915_NUM_ENGINES) >> 32;
> +}
> +
> +static int bench_sync(void *arg)
> +{
> +	struct drm_i915_private *i915 = arg;
> +	struct rnd_state prng;
> +	struct i915_gem_timeline *timeline;
> +	struct intel_timeline *tl;
> +	unsigned long end_time, count;
> +	ktime_t kt;
> +	int ret;
> +
> +	timeline = mock_timeline(i915);
> +	if (!timeline)
> +		return -ENOMEM;
> +
> +	prandom_seed_state(&prng, i915_selftest.random_seed);
> +	tl = &timeline->engine[RCS];
> +
> +	count = 0;
> +	kt = -ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		u64 id = prandom_u64_state(&prng);
> +
> +		intel_timeline_sync_set(tl, id, 0);
> +		count++;
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_add(ktime_get(), kt);

Why not ktime_sub? I don't know if ktime_t is signed or not.

> +
> +	pr_info("%s: %lu random insertions, %lluns/insert\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	prandom_seed_state(&prng, i915_selftest.random_seed);
> +
> +	end_time = count;
> +	kt = -ktime_get();
> +	while (end_time--) {

This is a new pattern for me - why not simply go by time in every test? 
You have to be sure lookups are not a bazillion times faster than 
insertions like this.

> +		u64 id = prandom_u64_state(&prng);
> +
> +		if (!intel_timeline_sync_is_later(tl, id, 0)) {
> +			pr_err("Lookup of %llu failed\n", id);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +	kt = ktime_add(ktime_get(), kt);
> +
> +	pr_info("%s: %lu random lookups, %lluns/lookup\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	prandom_seed_state(&prng, i915_selftest.random_seed);
> +	tl = &timeline->engine[BCS];
> +
> +	count = 0;
> +	kt = -ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		u32 id = random_engine(&prng);
> +		u32 seqno = prandom_u32_state(&prng);
> +
> +		if (!intel_timeline_sync_is_later(tl, id, seqno))
> +			intel_timeline_sync_set(tl, id, seqno);
> +
> +		count++;
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_add(ktime_get(), kt);
> +
> +	pr_info("%s: %lu repeated insert/lookups, %lluns/op\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	ret = 0;
> +out:
> +	mock_timeline_destroy(timeline);
> +	return ret;
> +}
> +
> +int i915_gem_timeline_mock_selftests(void)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(igt_sync),
> +		SUBTEST(bench_sync),
> +	};
> +	struct drm_i915_private *i915;
> +	int err;
> +
> +	i915 = mock_gem_device();
> +	if (!i915)
> +		return -ENOMEM;
> +
> +	err = i915_subtests(tests, i915);
> +	drm_dev_unref(&i915->drm);
> +
> +	return err;
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> index be9a9ebf5692..8d0f50c25df8 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> @@ -12,6 +12,7 @@ selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
>  selftest(scatterlist, scatterlist_mock_selftests)
>  selftest(uncore, intel_uncore_mock_selftests)
>  selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
> +selftest(timelines, i915_gem_timeline_mock_selftests)
>  selftest(requests, i915_gem_request_mock_selftests)
>  selftest(objects, i915_gem_object_mock_selftests)
>  selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
> diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.c b/drivers/gpu/drm/i915/selftests/mock_timeline.c
> new file mode 100644
> index 000000000000..e8d62f5f6ed3
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/mock_timeline.c
> @@ -0,0 +1,52 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include "mock_timeline.h"
> +
> +struct i915_gem_timeline *
> +mock_timeline(struct drm_i915_private *i915)
> +{
> +	struct i915_gem_timeline *timeline;
> +
> +	timeline = kzalloc(sizeof(*timeline), GFP_KERNEL);
> +	if (!timeline)
> +		return NULL;
> +
> +	mutex_lock(&i915->drm.struct_mutex);
> +	i915_gem_timeline_init(i915, timeline, "mock");
> +	mutex_unlock(&i915->drm.struct_mutex);
> +
> +	return timeline;
> +}
> +
> +void mock_timeline_destroy(struct i915_gem_timeline *timeline)
> +{
> +	struct drm_i915_private *i915 = timeline->i915;
> +
> +	mutex_lock(&i915->drm.struct_mutex);
> +	i915_gem_timeline_fini(timeline);
> +	mutex_unlock(&i915->drm.struct_mutex);
> +
> +	kfree(timeline);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.h b/drivers/gpu/drm/i915/selftests/mock_timeline.h
> new file mode 100644
> index 000000000000..b33dcd2151ef
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/mock_timeline.h
> @@ -0,0 +1,33 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#ifndef __MOCK_TIMELINE__
> +#define __MOCK_TIMELINE__
> +
> +#include "../i915_gem_timeline.h"
> +
> +struct i915_gem_timeline *mock_timeline(struct drm_i915_private *i915);
> +void mock_timeline_destroy(struct i915_gem_timeline *timeline);
> +
> +#endif /* !__MOCK_TIMELINE__ */
>

I'll have another pass tomorrow. Hopefully with some helpful replies on 
my question I will be able to digest it.

Regards,

Tvrtko


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v9] drm/i915: Squash repeated awaits on the same fence
  2017-04-27 16:47       ` Tvrtko Ursulin
@ 2017-04-27 17:25         ` Chris Wilson
  2017-04-27 20:34           ` Chris Wilson
  0 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-27 17:25 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Thu, Apr 27, 2017 at 05:47:32PM +0100, Tvrtko Ursulin wrote:
> 
> On 27/04/2017 12:48, Chris Wilson wrote:
> >diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> >index 5fa4e52ded06..d9f76665bc6b 100644
> >--- a/drivers/gpu/drm/i915/i915_gem_request.c
> >+++ b/drivers/gpu/drm/i915/i915_gem_request.c
> >@@ -772,6 +772,12 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
> > 		if (fence->context == req->fence.context)
> > 			continue;
> >
> >+		/* Squash repeated waits to the same timelines */
> >+		if (fence->context != req->i915->mm.unordered_timeline &&
> >+		    intel_timeline_sync_is_later(req->timeline,
> >+						 fence->context, fence->seqno))
> >+			continue;
> 
> Wrong base?

I haven't moved this patch relative to the others in the series? There's
a few patches to get to here first.

> >+struct intel_timeline_sync {
> >+	u64 prefix;
> >+	unsigned int height;
> >+	unsigned int bitmap;
> 
> u16 would be enough for the bitmap since NSYNC == 16? To no benefit
> though. Maybe just add a BUILD_BUG_ON(sizeof(p->bitmap) *
> BITS_PER_BYTE >= NSYNC) somewhere?

Indeed compacting these bits have no impact on allocation size, so I
went with natural sizes. But I didn't check if the compiler prefers u16.
 
> >+	struct intel_timeline_sync *parent;
> >+	/* union {
> >+	 *	u32 seqno;
> >+	 *	struct intel_timeline_sync *child;
> >+	 * } slot[NSYNC];
> >+	 */
> 
> Put a note saying this comment describes what follows after struct
> intel_timeline_sync.
> 
> Would "union { ... } slot[0];" work as a maker and have any benefit
> to the readability of the code below?
> 
> You could same some bytes (64 I think) for the leaf nodes if you did
> something like:

Hmm, where's the saving?

leaves are sizeof(*p) + NSYNC*sizeof(seqno) -> kmalloc-128 slab
branches are sizeof(*p) + NSYNC*sizeof(p) -> kmalloc-256 slab

> 	union {
> 		u32 seqno[NSYNC];
> 		struct intel_timeline_sync *child[NSYNC];
> 	};
> 
> Although I think it conflicts with the slot marker idea. Hm, no
> actually it doesn't. You could have both union members as simply
> markers.
> 
> 	union {
> 		u32 seqno[];
> 		struct intel_timeline_sync *child[];
> 	};
> 
> Again, not sure yet if it would make that much better readability.

Tried, gcc doesn't like unions of variable length arrays. Hence
resorting to manual packing the arrays after the struct.

> >+static void __sync_free(struct intel_timeline_sync *p)
> >+{
> >+	if (p->height) {
> >+		unsigned int i;
> >+
> >+		while ((i = ffs(p->bitmap))) {
> >+			p->bitmap &= ~0u << i;
> >+			__sync_free(__sync_child(p)[i - 1]);
> 
> Maximum height is 64 for this tree so here there is no danger of
> stack overflow?

Maximum recusion depth is 64 / NSHIFT(4) = 16. Stack usage is small,
only a few registers to push pop, so I didn't feel any danger in
allowing recursion.

The while() loop was chosen as that avoided a stack variable.

> >+	/* First climb the tree back to a parent branch */
> >+	do {
> >+		p = p->parent;
> >+		if (!p)
> >+			return false;
> >+
> >+		if ((id >> p->height >> SHIFT) == p->prefix)
> 
> Worth having "id >> p->height >> SHIFT" as a macro for better readability?

Yeah, this is the main issue with the code, so many shifts.

> >+			break;
> >+	} while (1);
> >+
> >+	/* And then descend again until we find our leaf */
> >+	do {
> >+		if (!p->height)
> >+			break;
> >+
> >+		p = __sync_child(p)[__sync_idx(p, id)];
> >+		if (!p)
> >+			return false;
> >+
> >+		if ((id >> p->height >> SHIFT) != p->prefix)
> >+			return false;
> 
> Is this possible or a GEM_BUG_ON? Maybe I am not understanding it,
> but I thought it would be __sync_child slot had unexpected prefix in
> it?

The tree may skip levels.
 
> >+	} while (1);
> >+
> >+	tl->sync = p;
> >+found:
> >+	idx = id & MASK;
> >+	if (!(p->bitmap & BIT(idx)))
> >+		return false;
> >+
> >+	return i915_seqno_passed(__sync_seqno(p)[idx], seqno);
> >+}
> >+
> >+static noinline int
> >+__intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> >+{
> >+	struct intel_timeline_sync *p = tl->sync;
> >+	unsigned int idx;
> >+
> >+	if (!p) {
> >+		p = kzalloc(sizeof(*p) + NSYNC * sizeof(seqno), GFP_KERNEL);
> >+		if (unlikely(!p))
> >+			return -ENOMEM;
> >+
> >+		p->prefix = id >> SHIFT;
> >+		goto found;
> >+	}
> >+
> >+	/* Climb back up the tree until we find a common prefix */
> >+	do {
> >+		if (!p->parent)
> >+			break;
> >+
> >+		p = p->parent;
> >+
> >+		if ((id >> p->height >> SHIFT) == p->prefix)
> >+			break;
> >+	} while (1);
> 
> __climb_back_to_prefix(p, id) as a helper since it is used in the
> lookup as well?

The two climbers were subtly different. :(

> >+	/* No shortcut, we have to descend the tree to find the right layer
> >+	 * containing this fence.
> >+	 *
> >+	 * Each layer in the tree holds 16 (NSYNC) pointers, either fences
> >+	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
> >+	 * other nodes (height > 0) are internal layers that point to a lower
> >+	 * node. Each internal layer has at least 2 descendents.
> >+	 *
> >+	 * Starting at the top, we check whether the current prefix matches. If
> >+	 * it doesn't, we have gone passed our layer and need to insert a join
> >+	 * into the tree, and a new leaf node as a descendent as well as the
> >+	 * original layer.
> >+	 *
> >+	 * The matching prefix means we are still following the right branch
> >+	 * of the tree. If it has height 0, we have found our leaf and just
> >+	 * need to replace the fence slot with ourselves. If the height is
> >+	 * not zero, our slot contains the next layer in the tree (unless
> >+	 * it is empty, in which case we can add ourselves as a new leaf).
> >+	 * As descend the tree the prefix grows (and height decreases).
> >+	 */
> >+	do {
> >+		struct intel_timeline_sync *next;
> >+
> >+		if ((id >> p->height >> SHIFT) != p->prefix) {
> >+			/* insert a join above the current layer */
> >+			next = kzalloc(sizeof(*next) + NSYNC * sizeof(next),
> >+				       GFP_KERNEL);
> >+			if (unlikely(!next))
> >+				return -ENOMEM;
> >+
> >+			next->height = ALIGN(fls64((id >> p->height >> SHIFT) ^ p->prefix),
> >+					    SHIFT) + p->height;
> 
> Got lost here - what's xor-ing accomplishing here? What is height
> then, not just depth relative to the bottom of the tree?

It's working out at what height these two prefixes differ. That's the
height immediately above which we need to insert the join.

> >+int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> >+{
> >+	struct intel_timeline_sync *p = tl->sync;
> >+
> >+	/* We expect to be called in sequence following a  _get(id), which
> >+	 * should have preloaded the tl->sync hint for us.
> >+	 */
> >+	if (likely(p && (id >> SHIFT) == p->prefix)) {
> >+		unsigned int idx = id & MASK;
> >+
> >+		__sync_seqno(p)[idx] = seqno;
> >+		p->bitmap |= BIT(idx);
> >+		return 0;
> >+	}
> >+
> >+	return __intel_timeline_sync_set(tl, id, seqno);
> 
> Could pass in p and set tl->sync = p at this level. That would
> decouple the algorithm from the timeline better. With equivalent
> treatment for the query, and renaming of struct intel_timeline_sync,
> algorithm would be ready for moving out of drm/i915/ :)

I really did want to keep this as a tail call to keep the fast path neat
and tidy with minimal stack manipulation.

> >diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
> >index 6c53e14cab2a..e16a62bc21e6 100644
> >--- a/drivers/gpu/drm/i915/i915_gem_timeline.h
> >+++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
> >@@ -26,10 +26,13 @@
> > #define I915_GEM_TIMELINE_H
> >
> > #include <linux/list.h>
> >+#include <linux/radix-tree.h>
> 
> What is used from it?

Stray. I started with common idr/radixtree, then plonked in the u64 idr
I had for reservation_object.

> >+static int igt_sync(void *arg)
> >+{
> >+	struct drm_i915_private *i915 = arg;
> >+	const struct {
> >+		const char *name;
> >+		u32 seqno;
> >+		bool expected;
> >+		bool set;
> >+	} pass[] = {
> >+		{ "unset", 0, false, false },
> >+		{ "new", 0, false, true },
> >+		{ "0a", 0, true, true },
> >+		{ "1a", 1, false, true },
> >+		{ "1b", 1, true, true },
> >+		{ "0b", 0, true, false },
> >+		{ "2a", 2, false, true },
> >+		{ "4", 4, false, true },
> >+		{ "INT_MAX", INT_MAX, false, true },
> >+		{ "INT_MAX-1", INT_MAX-1, true, false },
> >+		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
> >+		{ "INT_MAX", INT_MAX, true, false },
> >+		{ "UINT_MAX", UINT_MAX, false, true },
> >+		{ "wrap", 0, false, true },
> >+		{ "unwrap", UINT_MAX, true, false },
> >+		{},
> >+	}, *p;
> >+	struct i915_gem_timeline *timeline;
> >+	struct intel_timeline *tl;
> >+	int order, offset;
> >+	int ret;
> >+
> >+	timeline = mock_timeline(i915);
> 
> Hey-ho.. when I suggested mock_timeline I did not realize we need
> the mock_device anyway. :( For struct_mutex I guess? Oh well, that
> was an useless suggestion.. :(

It's only used for an assertion which is annoying. I have been
contemplating making the mock_timeline be intel_timeline and then can
remove the mock_gem_device.

> >+	tl = &timeline->engine[RCS];
> >+	for (p = pass; p->name; p++) {
> >+		for (order = 1; order < 64; order++) {
> >+			for (offset = -1; offset <= (order > 1); offset++) {
> >+				u64 ctx = BIT_ULL(order) + offset;
> >+
> >+				if (intel_timeline_sync_is_later
> >+				    (tl, ctx, p->seqno) != p->expected) {
> 
> Unusual formatting.

I prefer it over having arguments on each line that still go over 80col.
Yes it probably means I should break the loops into functions.

> >+static int bench_sync(void *arg)
> >+{
> >+	struct drm_i915_private *i915 = arg;
> >+	struct rnd_state prng;
> >+	struct i915_gem_timeline *timeline;
> >+	struct intel_timeline *tl;
> >+	unsigned long end_time, count;
> >+	ktime_t kt;
> >+	int ret;
> >+
> >+	timeline = mock_timeline(i915);
> >+	if (!timeline)
> >+		return -ENOMEM;
> >+
> >+	prandom_seed_state(&prng, i915_selftest.random_seed);
> >+	tl = &timeline->engine[RCS];
> >+
> >+	count = 0;
> >+	kt = -ktime_get();
> >+	end_time = jiffies + HZ/10;
> >+	do {
> >+		u64 id = prandom_u64_state(&prng);
> >+
> >+		intel_timeline_sync_set(tl, id, 0);
> >+		count++;
> >+	} while (!time_after(jiffies, end_time));
> >+	kt = ktime_add(ktime_get(), kt);
> 
> Why not ktime_sub? I don't know if ktime_t is signed or not.

It's s64, but I was just using a pattern I'm familar with.

> >+
> >+	pr_info("%s: %lu random insertions, %lluns/insert\n",
> >+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> >+
> >+	prandom_seed_state(&prng, i915_selftest.random_seed);
> >+
> >+	end_time = count;
> >+	kt = -ktime_get();
> >+	while (end_time--) {
> 
> This is a new pattern for me - why not simply go by time in every
> test? You have to be sure lookups are not a bazillion times faster
> than insertions like this.

It's because I wanted to measure known id. After inserting N id, we
reset the prng and then time how long it takes to look up the same N.

I don't particular want to do (insert+lookup) as that is favoured by the
caching, and we measure it seperately later.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v9] drm/i915: Squash repeated awaits on the same fence
  2017-04-27 17:25         ` Chris Wilson
@ 2017-04-27 20:34           ` Chris Wilson
  2017-04-27 20:53             ` Chris Wilson
  0 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-27 20:34 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

On Thu, Apr 27, 2017 at 06:25:47PM +0100, Chris Wilson wrote:
> On Thu, Apr 27, 2017 at 05:47:32PM +0100, Tvrtko Ursulin wrote:
> > >+int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> > >+{
> > >+	struct intel_timeline_sync *p = tl->sync;
> > >+
> > >+	/* We expect to be called in sequence following a  _get(id), which
> > >+	 * should have preloaded the tl->sync hint for us.
> > >+	 */
> > >+	if (likely(p && (id >> SHIFT) == p->prefix)) {
> > >+		unsigned int idx = id & MASK;
> > >+
> > >+		__sync_seqno(p)[idx] = seqno;
> > >+		p->bitmap |= BIT(idx);
> > >+		return 0;
> > >+	}
> > >+
> > >+	return __intel_timeline_sync_set(tl, id, seqno);
> > 
> > Could pass in p and set tl->sync = p at this level. That would
> > decouple the algorithm from the timeline better. With equivalent
> > treatment for the query, and renaming of struct intel_timeline_sync,
> > algorithm would be ready for moving out of drm/i915/ :)
> 
> I really did want to keep this as a tail call to keep the fast path neat
> and tidy with minimal stack manipulation.

Happier with

_intel_timeline_sync_set(struct intel_timeline_sync **root,
			 u64 id, u32 seqno)
{
	struct intel_timeline_sync *p = *root;
	...
	*root = p;
	return 0;
}

return __intel_timeline_sync_set(&tl->sync, id, seqno);

A little step towards abstraction. Works equally well for
intel_timeline_sync_is_later().

Hmm. i915_seqmap.c ? Too cryptic?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v9] drm/i915: Squash repeated awaits on the same fence
  2017-04-27 20:34           ` Chris Wilson
@ 2017-04-27 20:53             ` Chris Wilson
  0 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-27 20:53 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

On Thu, Apr 27, 2017 at 09:34:10PM +0100, Chris Wilson wrote:
> On Thu, Apr 27, 2017 at 06:25:47PM +0100, Chris Wilson wrote:
> > On Thu, Apr 27, 2017 at 05:47:32PM +0100, Tvrtko Ursulin wrote:
> > > >+int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> > > >+{
> > > >+	struct intel_timeline_sync *p = tl->sync;
> > > >+
> > > >+	/* We expect to be called in sequence following a  _get(id), which
> > > >+	 * should have preloaded the tl->sync hint for us.
> > > >+	 */
> > > >+	if (likely(p && (id >> SHIFT) == p->prefix)) {
> > > >+		unsigned int idx = id & MASK;
> > > >+
> > > >+		__sync_seqno(p)[idx] = seqno;
> > > >+		p->bitmap |= BIT(idx);
> > > >+		return 0;
> > > >+	}
> > > >+
> > > >+	return __intel_timeline_sync_set(tl, id, seqno);
> > > 
> > > Could pass in p and set tl->sync = p at this level. That would
> > > decouple the algorithm from the timeline better. With equivalent
> > > treatment for the query, and renaming of struct intel_timeline_sync,
> > > algorithm would be ready for moving out of drm/i915/ :)
> > 
> > I really did want to keep this as a tail call to keep the fast path neat
> > and tidy with minimal stack manipulation.
> 
> Happier with
> 
> _intel_timeline_sync_set(struct intel_timeline_sync **root,
> 			 u64 id, u32 seqno)
> {
> 	struct intel_timeline_sync *p = *root;
> 	...
> 	*root = p;
> 	return 0;
> }
> 
> return __intel_timeline_sync_set(&tl->sync, id, seqno);
> 
> A little step towards abstraction. Works equally well for
> intel_timeline_sync_is_later().
> 
> Hmm. i915_seqmap.c ? Too cryptic?

Went with i915_syncmap (struct, .c, .h)

There's some knowlege of seqno built in (i.e the is_later function).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v10] drm/i915: Squash repeated awaits on the same fence
  2017-04-27 11:48     ` [PATCH v9] " Chris Wilson
  2017-04-27 16:47       ` Tvrtko Ursulin
@ 2017-04-28  7:41       ` Chris Wilson
  2017-04-28  7:59         ` Chris Wilson
                           ` (3 more replies)
  1 sibling, 4 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-28  7:41 UTC (permalink / raw)
  To: intel-gfx

Track the latest fence waited upon on each context, and only add a new
asynchronous wait if the new fence is more recent than the recorded
fence for that context. This requires us to filter out unordered
timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
absence of a universal identifier, we have to use our own
i915->mm.unordered_timeline token.

v2: Throw around the debug crutches
v3: Inline the likely case of the pre-allocation cache being full.
v4: Drop the pre-allocation support, we can lose the most recent fence
in case of allocation failure -- it just means we may emit more awaits
than strictly necessary but will not break.
v5: Trim allocation size for leaf nodes, they only need an array of u32
not pointers.
v6: Create mock_timeline to tidy selftest writing
v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko)
v8: Prune the stale sync points when we idle.
v9: Include a small benchmark in the kselftests
v10: Separate the idr implementation into its own compartment.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/Makefile                      |   1 +
 drivers/gpu/drm/i915/i915_gem.c                    |   1 +
 drivers/gpu/drm/i915/i915_gem.h                    |   2 +
 drivers/gpu/drm/i915/i915_gem_request.c            |   9 +
 drivers/gpu/drm/i915/i915_gem_timeline.c           |  92 +++++-
 drivers/gpu/drm/i915/i915_gem_timeline.h           |  36 ++
 drivers/gpu/drm/i915/i915_syncmap.c                | 362 +++++++++++++++++++++
 drivers/gpu/drm/i915/i915_syncmap.h                |  39 +++
 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 257 +++++++++++++++
 .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
 drivers/gpu/drm/i915/selftests/mock_timeline.c     |  45 +++
 drivers/gpu/drm/i915/selftests/mock_timeline.h     |  33 ++
 12 files changed, 860 insertions(+), 18 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_syncmap.c
 create mode 100644 drivers/gpu/drm/i915/i915_syncmap.h
 create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
 create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.c
 create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 2cf04504e494..7b05fb802f4c 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -16,6 +16,7 @@ i915-y := i915_drv.o \
 	  i915_params.o \
 	  i915_pci.o \
           i915_suspend.o \
+	  i915_syncmap.o \
 	  i915_sw_fence.o \
 	  i915_sysfs.o \
 	  intel_csr.o \
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c1fa3c103f38..f886ef492036 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3214,6 +3214,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
 		intel_engine_disarm_breadcrumbs(engine);
 		i915_gem_batch_pool_fini(&engine->batch_pool);
 	}
+	i915_gem_timelines_mark_idle(dev_priv);
 
 	GEM_BUG_ON(!dev_priv->gt.awake);
 	dev_priv->gt.awake = false;
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 5a49487368ca..ee54597465b6 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -25,6 +25,8 @@
 #ifndef __I915_GEM_H__
 #define __I915_GEM_H__
 
+#include <linux/bug.h>
+
 #ifdef CONFIG_DRM_I915_DEBUG_GEM
 #define GEM_BUG_ON(expr) BUG_ON(expr)
 #define GEM_WARN_ON(expr) WARN_ON(expr)
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 5fa4e52ded06..807fc1b65dd1 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -772,6 +772,11 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 		if (fence->context == req->fence.context)
 			continue;
 
+		/* Squash repeated waits to the same timelines */
+		if (fence->context != req->i915->mm.unordered_timeline &&
+		    intel_timeline_sync_is_later(req->timeline, fence))
+			continue;
+
 		if (dma_fence_is_i915(fence))
 			ret = i915_gem_request_await_request(req,
 							     to_request(fence));
@@ -781,6 +786,10 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 							    GFP_KERNEL);
 		if (ret < 0)
 			return ret;
+
+		/* Record the latest fence used against each timeline */
+		if (fence->context != req->i915->mm.unordered_timeline)
+			intel_timeline_sync_set(req->timeline, fence);
 	} while (--nchild);
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
index b596ca7ee058..a28a65db82e9 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.c
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
@@ -24,6 +24,31 @@
 
 #include "i915_drv.h"
 
+static void __intel_timeline_init(struct intel_timeline *tl,
+				  struct i915_gem_timeline *parent,
+				  u64 context,
+				  struct lock_class_key *lockclass,
+				  const char *lockname)
+{
+	tl->fence_context = context;
+	tl->common = parent;
+#ifdef CONFIG_DEBUG_SPINLOCK
+	__raw_spin_lock_init(&tl->lock.rlock, lockname, lockclass);
+#else
+	spin_lock_init(&tl->lock);
+#endif
+	init_request_active(&tl->last_request, NULL);
+	INIT_LIST_HEAD(&tl->requests);
+	i915_syncmap_init(&tl->sync);
+}
+
+static void __intel_timeline_fini(struct intel_timeline *tl)
+{
+	GEM_BUG_ON(!list_empty(&tl->requests));
+
+	i915_syncmap_free(&tl->sync);
+}
+
 static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 				    struct i915_gem_timeline *timeline,
 				    const char *name,
@@ -35,6 +60,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 
 	lockdep_assert_held(&i915->drm.struct_mutex);
 
+	/*
+	 * Ideally we want a set of engines on a single leaf as we expect
+	 * to mostly be tracking synchronisation between engines.
+	 */
+	BUILD_BUG_ON(KSYNCMAP < I915_NUM_ENGINES);
+
 	timeline->i915 = i915;
 	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
 	if (!timeline->name)
@@ -44,19 +75,10 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 
 	/* Called during early_init before we know how many engines there are */
 	fences = dma_fence_context_alloc(ARRAY_SIZE(timeline->engine));
-	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
-		struct intel_timeline *tl = &timeline->engine[i];
-
-		tl->fence_context = fences++;
-		tl->common = timeline;
-#ifdef CONFIG_DEBUG_SPINLOCK
-		__raw_spin_lock_init(&tl->lock.rlock, lockname, lockclass);
-#else
-		spin_lock_init(&tl->lock);
-#endif
-		init_request_active(&tl->last_request, NULL);
-		INIT_LIST_HEAD(&tl->requests);
-	}
+	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++)
+		__intel_timeline_init(&timeline->engine[i],
+				      timeline, fences++,
+				      lockclass, lockname);
 
 	return 0;
 }
@@ -81,18 +103,52 @@ int i915_gem_timeline_init__global(struct drm_i915_private *i915)
 					&class, "&global_timeline->lock");
 }
 
+/**
+ * i915_gem_timelines_mark_idle -- called when the driver idles
+ * @i915 - the drm_i915_private device
+ *
+ * When the driver is completely idle, we know that all of our sync points
+ * have been signaled and our tracking is then entirely redundant. Any request
+ * to wait upon an older sync point will be completed instantly as we know
+ * the fence is signaled and therefore we will not even look them up in the
+ * sync point map.
+ */
+void i915_gem_timelines_mark_idle(struct drm_i915_private *i915)
+{
+	struct i915_gem_timeline *timeline;
+	int i;
+
+	lockdep_assert_held(&i915->drm.struct_mutex);
+
+	list_for_each_entry(timeline, &i915->gt.timelines, link) {
+		for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
+			struct intel_timeline *tl = &timeline->engine[i];
+
+			/*
+			 * All known fences are completed so we can scrap
+			 * the current sync point tracking and start afresh,
+			 * any attempt to wait upon a previous sync point
+			 * will be skipped as the fence was signaled.
+			 */
+			i915_syncmap_free(&tl->sync);
+		}
+	}
+}
+
 void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
 {
 	int i;
 
 	lockdep_assert_held(&timeline->i915->drm.struct_mutex);
 
-	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
-		struct intel_timeline *tl = &timeline->engine[i];
-
-		GEM_BUG_ON(!list_empty(&tl->requests));
-	}
+	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++)
+		__intel_timeline_fini(&timeline->engine[i]);
 
 	list_del(&timeline->link);
 	kfree(timeline->name);
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftests/mock_timeline.c"
+#include "selftests/i915_gem_timeline.c"
+#endif
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
index 6c53e14cab2a..86ade2890902 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.h
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
@@ -27,7 +27,9 @@
 
 #include <linux/list.h>
 
+#include "i915_utils.h"
 #include "i915_gem_request.h"
+#include "i915_syncmap.h"
 
 struct i915_gem_timeline;
 
@@ -55,6 +57,15 @@ struct intel_timeline {
 	 * struct_mutex.
 	 */
 	struct i915_gem_active last_request;
+
+	/**
+	 * We track the most recent seqno that we wait on in every context so
+	 * that we only have to emit a new await and dependency on a more
+	 * recent sync point. As the contexts may executed out-of-order, we
+	 * have to track each individually and cannot not rely on an absolute
+	 * global_seqno.
+	 */
+	struct i915_syncmap *sync;
 	u32 sync_seqno[I915_NUM_ENGINES];
 
 	struct i915_gem_timeline *common;
@@ -73,6 +84,31 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
 			   struct i915_gem_timeline *tl,
 			   const char *name);
 int i915_gem_timeline_init__global(struct drm_i915_private *i915);
+void i915_gem_timelines_mark_idle(struct drm_i915_private *i915);
 void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
 
+static inline int __intel_timeline_sync_set(struct intel_timeline *tl,
+					    u64 context, u32 seqno)
+{
+	return i915_syncmap_set(&tl->sync, context, seqno);
+}
+
+static inline int intel_timeline_sync_set(struct intel_timeline *tl,
+					  const struct dma_fence *fence)
+{
+	return __intel_timeline_sync_set(tl, fence->context, fence->seqno);
+}
+
+static inline bool __intel_timeline_sync_is_later(struct intel_timeline *tl,
+						  u64 context, u32 seqno)
+{
+	return i915_syncmap_is_later(&tl->sync, context, seqno);
+}
+
+static inline bool intel_timeline_sync_is_later(struct intel_timeline *tl,
+						const struct dma_fence *fence)
+{
+	return __intel_timeline_sync_is_later(tl, fence->context, fence->seqno);
+}
+
 #endif
diff --git a/drivers/gpu/drm/i915/i915_syncmap.c b/drivers/gpu/drm/i915/i915_syncmap.c
new file mode 100644
index 000000000000..70762c3772a0
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_syncmap.c
@@ -0,0 +1,362 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <linux/slab.h>
+
+#include "i915_syncmap.h"
+
+#include "i915_gem.h" /* GEM_BUG_ON() */
+
+#define SHIFT ilog2(KSYNCMAP)
+#define MASK (KSYNCMAP - 1)
+
+/*
+ * struct i915_syncmap is a layer of a radixtree that maps a u64 fence
+ * context id to the last u32 fence seqno waited upon from that context.
+ * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
+ * the root. This allows us to access the whole tree via a single pointer
+ * to the most recently used layer. We expect fence contexts to be dense
+ * and most reuse to be on the same i915_gem_context but on neighbouring
+ * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
+ * effective lookup cache. If the new lookup is not on the same leaf, we
+ * expect it to be on the neighbouring branch.
+ *
+ * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
+ * allows us to store whether a particular seqno is valid (i.e. allows us
+ * to distinguish unset from 0).
+ *
+ * A branch holds an array of layer pointers, and has height > 0, and always
+ * has at least 2 layers (either branches or leaves) below it.
+ */
+
+struct i915_syncmap {
+	u64 prefix;
+	unsigned int height;
+	unsigned int bitmap;
+	struct i915_syncmap *parent;
+	/*
+	 * Following this header is an array of either seqno or child pointers:
+	 * union {
+	 *	u32 seqno[KSYNCMAP];
+	 *	struct i915_syncmap *child[KSYNMAP];
+	 * };
+	 */
+};
+
+/**
+ * i915_syncmap_init -- initialise the #i915_syncmap
+ * @root - pointer to the #i915_syncmap
+ */
+void i915_syncmap_init(struct i915_syncmap **root)
+{
+	BUILD_BUG_ON(KSYNCMAP > BITS_PER_BYTE * sizeof((*root)->bitmap));
+	*root = NULL;
+}
+
+static inline u32 *__sync_seqno(struct i915_syncmap *p)
+{
+	GEM_BUG_ON(p->height);
+	return (u32 *)(p + 1);
+}
+
+static inline struct i915_syncmap **__sync_child(struct i915_syncmap *p)
+{
+	GEM_BUG_ON(!p->height);
+	return (struct i915_syncmap **)(p + 1);
+}
+
+static inline unsigned int __sync_idx(const struct i915_syncmap *p, u64 id)
+{
+	return (id >> p->height) & MASK;
+}
+
+static inline u64 __sync_prefix(const struct i915_syncmap *p, u64 id)
+{
+	return id >> p->height >> SHIFT;
+}
+
+static inline u64 __sync_leaf(const struct i915_syncmap *p, u64 id)
+{
+	GEM_BUG_ON(p->height);
+	return id >> SHIFT;
+}
+
+static inline bool seqno_later(u32 a, u32 b)
+{
+	return (s32)(a - b) >= 0;
+}
+
+/**
+ * i915_syncmap_is_later -- compare against the last know sync point
+ * @root - pointer to the #i915_syncmap
+ * @id - the context id (other timeline) we are synchronising to
+ * @seqno - the sequence number along the other timeline
+ *
+ * If we have already synchronised this @root with another (@id) then we can
+ * omit any repeated or earlier synchronisation requests. If the two timelines
+ * are already coupled, we can also omit the dependency between the two as that
+ * is already known via the timeline.
+ *
+ * Returns true if the two timelines are already synchronised wrt to @seqno,
+ * false if not and the synchronisation must be emitted.
+ */
+bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno)
+{
+	struct i915_syncmap *p;
+	unsigned int idx;
+
+	p = *root;
+	if (!p)
+		return false;
+
+	if (likely(__sync_leaf(p, id) == p->prefix))
+		goto found;
+
+	/* First climb the tree back to a parent branch */
+	do {
+		p = p->parent;
+		if (!p)
+			return false;
+
+		if (__sync_prefix(p, id) == p->prefix)
+			break;
+	} while (1);
+
+	/* And then descend again until we find our leaf */
+	do {
+		if (!p->height)
+			break;
+
+		p = __sync_child(p)[__sync_idx(p, id)];
+		if (!p)
+			return false;
+
+		if (__sync_prefix(p, id) != p->prefix)
+			return false;
+	} while (1);
+
+	*root = p;
+found:
+	idx = id & MASK;
+	if (!(p->bitmap & BIT(idx)))
+		return false;
+
+	return seqno_later(__sync_seqno(p)[idx], seqno);
+}
+
+static struct i915_syncmap *
+__sync_alloc_leaf(struct i915_syncmap *parent, u64 id)
+{
+	struct i915_syncmap *p;
+
+	p = kmalloc(sizeof(*p) + KSYNCMAP * sizeof(u32), GFP_KERNEL);
+	if (unlikely(!p))
+		return NULL;
+
+	p->parent = parent;
+	p->height = 0;
+	p->bitmap = 0;
+	p->prefix = __sync_leaf(p, id);
+	return p;
+}
+
+static noinline int
+__i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno)
+{
+	struct i915_syncmap *p = *root;
+	unsigned int idx;
+
+	if (!p) {
+		p = __sync_alloc_leaf(NULL, id);
+		if (unlikely(!p))
+			return -ENOMEM;
+
+		goto found;
+	}
+
+	/* Climb back up the tree until we find a common prefix */
+	do {
+		if (!p->parent)
+			break;
+
+		p = p->parent;
+
+		if (__sync_prefix(p, id) == p->prefix)
+			break;
+	} while (1);
+
+	/*
+	 * No shortcut, we have to descend the tree to find the right layer
+	 * containing this fence.
+	 *
+	 * Each layer in the tree holds 16 (KSYNCMAP) pointers, either fences
+	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
+	 * other nodes (height > 0) are internal layers that point to a lower
+	 * node. Each internal layer has at least 2 descendents.
+	 *
+	 * Starting at the top, we check whether the current prefix matches. If
+	 * it doesn't, we have gone passed our layer and need to insert a join
+	 * into the tree, and a new leaf node as a descendent as well as the
+	 * original layer.
+	 *
+	 * The matching prefix means we are still following the right branch
+	 * of the tree. If it has height 0, we have found our leaf and just
+	 * need to replace the fence slot with ourselves. If the height is
+	 * not zero, our slot contains the next layer in the tree (unless
+	 * it is empty, in which case we can add ourselves as a new leaf).
+	 * As descend the tree the prefix grows (and height decreases).
+	 */
+	do {
+		struct i915_syncmap *next;
+
+		if (__sync_prefix(p, id) != p->prefix) {
+			unsigned int above;
+
+			/* insert a join above the current layer */
+			next = kzalloc(sizeof(*next) + KSYNCMAP * sizeof(next),
+				       GFP_KERNEL);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			above = fls64(__sync_prefix(p, id) ^ p->prefix);
+			above = round_up(above, SHIFT);
+			next->height = above + p->height;
+			next->prefix = __sync_prefix(next, id);
+
+			if (p->parent)
+				__sync_child(p->parent)[__sync_idx(p->parent, id)] = next;
+			next->parent = p->parent;
+
+			idx = p->prefix >> (above - SHIFT) & MASK;
+			__sync_child(next)[idx] = p;
+			next->bitmap |= BIT(idx);
+			p->parent = next;
+
+			/* ascend to the join */
+			p = next;
+		} else {
+			if (!p->height)
+				break;
+		}
+
+		/* descend into the next layer */
+		GEM_BUG_ON(!p->height);
+		idx = __sync_idx(p, id);
+		next = __sync_child(p)[idx];
+		if (unlikely(!next)) {
+			next = __sync_alloc_leaf(p, id);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			__sync_child(p)[idx] = next;
+			p->bitmap |= BIT(idx);
+
+			p = next;
+			break;
+		}
+
+		p = next;
+	} while (1);
+
+found:
+	GEM_BUG_ON(p->prefix != __sync_leaf(p, id));
+	idx = id & MASK;
+	__sync_seqno(p)[idx] = seqno;
+	p->bitmap |= BIT(idx);
+	*root = p;
+	return 0;
+}
+
+/**
+ * i915_syncmap_set -- mark the most recent syncpoint between contexts
+ * @root - pointer to the #i915_syncmap
+ * @id - the context id (other timeline) we have synchronised to
+ * @seqno - the sequence number along the other timeline
+ *
+ * When we synchronise this @root with another (@id), we also know that we have
+ * synchronized with all previous seqno along that timeline. If we then have
+ * a request to synchronise with the same seqno or older, we can omit it,
+ * see i915_syncmap_is_later()
+ *
+ * Returns 0 on success, or a negative error code.
+ */
+int i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno)
+{
+	struct i915_syncmap *p = *root;
+
+	/*
+	 * We expect to be called in sequence following a is_later(id), which
+	 * should have preloaded the root for us.
+	 */
+	if (likely(p && __sync_leaf(p, id) == p->prefix)) {
+		unsigned int idx = id & MASK;
+
+		__sync_seqno(p)[idx] = seqno;
+		p->bitmap |= BIT(idx);
+		return 0;
+	}
+
+	return __i915_syncmap_set(root, id, seqno);
+}
+
+static void __sync_free(struct i915_syncmap *p)
+{
+	if (p->height) {
+		unsigned int i;
+
+		while ((i = ffs(p->bitmap))) {
+			p->bitmap &= ~0u << i;
+			__sync_free(__sync_child(p)[i - 1]);
+		}
+	}
+
+	kfree(p);
+}
+
+/**
+ * i915_syncmap_free -- free all memory associated with the syncmap
+ * @root - pointer to the #i915_syncmap
+ *
+ * Either when the timeline is to be freed and we no longer need the sync
+ * point tracking, or when the fences are all known to be signaled and the
+ * sync point tracking is redundant, we can free the #i915_syncmap to recover
+ * its allocations.
+ *
+ * Will reinitialise the @root pointer so that the #i915_syncmap is ready for
+ * reuse.
+ */
+void i915_syncmap_free(struct i915_syncmap **root)
+{
+	struct i915_syncmap *p;
+
+	p = *root;
+	if (!p)
+		return;
+
+	while (p->parent)
+		p = p->parent;
+
+	__sync_free(p);
+	*root = NULL;
+}
diff --git a/drivers/gpu/drm/i915/i915_syncmap.h b/drivers/gpu/drm/i915/i915_syncmap.h
new file mode 100644
index 000000000000..7ca827d812ae
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_syncmap.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __I915_SYNCMAP_H__
+#define __I915_SYNCMAP_H__
+
+#include <linux/types.h>
+
+struct i915_syncmap;
+
+void i915_syncmap_init(struct i915_syncmap **root);
+bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno);
+int i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno);
+void i915_syncmap_free(struct i915_syncmap **root);
+
+#define KSYNCMAP 16
+
+#endif /* __I915_SYNCMAP_H__ */
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
new file mode 100644
index 000000000000..2058e754c86d
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
@@ -0,0 +1,257 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <linux/random.h>
+
+#include "../i915_selftest.h"
+#include "mock_gem_device.h"
+#include "mock_timeline.h"
+
+static int igt_sync(void *arg)
+{
+	const struct {
+		const char *name;
+		u32 seqno;
+		bool expected;
+		bool set;
+	} pass[] = {
+		{ "unset", 0, false, false },
+		{ "new", 0, false, true },
+		{ "0a", 0, true, true },
+		{ "1a", 1, false, true },
+		{ "1b", 1, true, true },
+		{ "0b", 0, true, false },
+		{ "2a", 2, false, true },
+		{ "4", 4, false, true },
+		{ "INT_MAX", INT_MAX, false, true },
+		{ "INT_MAX-1", INT_MAX-1, true, false },
+		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
+		{ "INT_MAX", INT_MAX, true, false },
+		{ "UINT_MAX", UINT_MAX, false, true },
+		{ "wrap", 0, false, true },
+		{ "unwrap", UINT_MAX, true, false },
+		{},
+	}, *p;
+	struct intel_timeline *tl;
+	int order, offset;
+	int ret;
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	for (p = pass; p->name; p++) {
+		for (order = 1; order < 64; order++) {
+			for (offset = -1; offset <= (order > 1); offset++) {
+				u64 ctx = BIT_ULL(order) + offset;
+
+				if (__intel_timeline_sync_is_later
+				    (tl, ctx, p->seqno) != p->expected) {
+					pr_err("1: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
+					       p->name, ctx, p->seqno, yesno(p->expected));
+					ret = -EINVAL;
+					goto out;
+				}
+
+				if (p->set) {
+					ret = __intel_timeline_sync_set(tl, ctx, p->seqno);
+					if (ret)
+						goto out;
+				}
+			}
+		}
+	}
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	for (order = 1; order < 64; order++) {
+		for (offset = -1; offset <= (order > 1); offset++) {
+			u64 ctx = BIT_ULL(order) + offset;
+
+			for (p = pass; p->name; p++) {
+				if (__intel_timeline_sync_is_later
+				    (tl, ctx, p->seqno) != p->expected) {
+					pr_err("2: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
+					       p->name, ctx, p->seqno, yesno(p->expected));
+					ret = -EINVAL;
+					goto out;
+				}
+
+				if (p->set) {
+					ret = __intel_timeline_sync_set(tl, ctx, p->seqno);
+					if (ret)
+						goto out;
+				}
+			}
+		}
+	}
+
+out:
+	mock_timeline_destroy(tl);
+	return ret;
+}
+
+static u64 prandom_u64_state(struct rnd_state *rnd)
+{
+	u64 x;
+
+	x = prandom_u32_state(rnd);
+	x <<= 32;
+	x |= prandom_u32_state(rnd);
+
+	return x;
+}
+
+static unsigned int random_engine(struct rnd_state *rnd)
+{
+	return ((u64)prandom_u32_state(rnd) * I915_NUM_ENGINES) >> 32;
+}
+
+static int bench_sync(void *arg)
+{
+	struct rnd_state prng;
+	struct intel_timeline *tl;
+	unsigned long end_time, count;
+	ktime_t kt;
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		u64 id = prandom_u64_state(&prng);
+
+		__intel_timeline_sync_set(tl, id, 0);
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu random insertions, %lluns/insert\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	end_time = count;
+	kt = ktime_get();
+	while (end_time--) {
+		u64 id = prandom_u64_state(&prng);
+
+		if (!__intel_timeline_sync_is_later(tl, id, 0)) {
+			mock_timeline_destroy(tl);
+			pr_err("Lookup of %llu failed\n", id);
+			return -EINVAL;
+		}
+	}
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu random lookups, %lluns/lookup\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		__intel_timeline_sync_set(tl, count++, 0);
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu in-order insertions, %lluns/insert\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	end_time = count;
+	kt = ktime_get();
+	while (end_time--) {
+		if (!__intel_timeline_sync_is_later(tl, end_time, 0)) {
+			pr_err("Lookup of %lu failed\n", end_time);
+			mock_timeline_destroy(tl);
+			return -EINVAL;
+		}
+	}
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu in-order lookups, %lluns/lookup\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		u32 id = random_engine(&prng);
+		u32 seqno = prandom_u32_state(&prng);
+
+		if (!__intel_timeline_sync_is_later(tl, id, seqno))
+			__intel_timeline_sync_set(tl, id, seqno);
+
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu repeated insert/lookups, %lluns/op\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		if (!__intel_timeline_sync_is_later(tl, count & 7, count >> 4))
+			__intel_timeline_sync_set(tl, count & 7, count >> 4);
+
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu cyclic insert/lookups, %lluns/op\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+	mock_timeline_destroy(tl);
+
+	return 0;
+}
+
+int i915_gem_timeline_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(igt_sync),
+		SUBTEST(bench_sync),
+	};
+
+	return i915_subtests(tests, NULL);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
index be9a9ebf5692..8d0f50c25df8 100644
--- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
@@ -12,6 +12,7 @@ selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
 selftest(scatterlist, scatterlist_mock_selftests)
 selftest(uncore, intel_uncore_mock_selftests)
 selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
+selftest(timelines, i915_gem_timeline_mock_selftests)
 selftest(requests, i915_gem_request_mock_selftests)
 selftest(objects, i915_gem_object_mock_selftests)
 selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.c b/drivers/gpu/drm/i915/selftests/mock_timeline.c
new file mode 100644
index 000000000000..47b1f47c5812
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/mock_timeline.c
@@ -0,0 +1,45 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "mock_timeline.h"
+
+struct intel_timeline *mock_timeline(u64 context)
+{
+	static struct lock_class_key class;
+	struct intel_timeline *tl;
+
+	tl = kzalloc(sizeof(*tl), GFP_KERNEL);
+	if (!tl)
+		return NULL;
+
+	__intel_timeline_init(tl, NULL, context, &class, "mock");
+
+	return tl;
+}
+
+void mock_timeline_destroy(struct intel_timeline *tl)
+{
+	__intel_timeline_fini(tl);
+	kfree(tl);
+}
diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.h b/drivers/gpu/drm/i915/selftests/mock_timeline.h
new file mode 100644
index 000000000000..c27ff4639b8b
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/mock_timeline.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __MOCK_TIMELINE__
+#define __MOCK_TIMELINE__
+
+#include "../i915_gem_timeline.h"
+
+struct intel_timeline *mock_timeline(u64 context);
+void mock_timeline_destroy(struct intel_timeline *tl);
+
+#endif /* !__MOCK_TIMELINE__ */
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v10] drm/i915: Squash repeated awaits on the same fence
  2017-04-28  7:41       ` [PATCH v10] " Chris Wilson
@ 2017-04-28  7:59         ` Chris Wilson
  2017-04-28  9:32         ` Tvrtko Ursulin
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-28  7:59 UTC (permalink / raw)
  To: intel-gfx

On Fri, Apr 28, 2017 at 08:41:36AM +0100, Chris Wilson wrote:
> Track the latest fence waited upon on each context, and only add a new
> asynchronous wait if the new fence is more recent than the recorded
> fence for that context. This requires us to filter out unordered
> timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> absence of a universal identifier, we have to use our own
> i915->mm.unordered_timeline token.
> 
> v2: Throw around the debug crutches
> v3: Inline the likely case of the pre-allocation cache being full.
> v4: Drop the pre-allocation support, we can lose the most recent fence
> in case of allocation failure -- it just means we may emit more awaits
> than strictly necessary but will not break.
> v5: Trim allocation size for leaf nodes, they only need an array of u32
> not pointers.
> v6: Create mock_timeline to tidy selftest writing
> v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko)
> v8: Prune the stale sync points when we idle.
> v9: Include a small benchmark in the kselftests
> v10: Separate the idr implementation into its own compartment.

Before you complain, I haven't yet attempted to refactor the kselftests
to avoid deep indentation.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v10] drm/i915: Squash repeated awaits on the same fence
  2017-04-28  7:41       ` [PATCH v10] " Chris Wilson
  2017-04-28  7:59         ` Chris Wilson
@ 2017-04-28  9:32         ` Tvrtko Ursulin
  2017-04-28  9:54           ` Chris Wilson
  2017-04-28  9:55         ` Tvrtko Ursulin
  2017-04-28 14:12         ` [PATCH v13] " Chris Wilson
  3 siblings, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-28  9:32 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 28/04/2017 08:41, Chris Wilson wrote:
> Track the latest fence waited upon on each context, and only add a new
> asynchronous wait if the new fence is more recent than the recorded
> fence for that context. This requires us to filter out unordered
> timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> absence of a universal identifier, we have to use our own
> i915->mm.unordered_timeline token.
>
> v2: Throw around the debug crutches
> v3: Inline the likely case of the pre-allocation cache being full.
> v4: Drop the pre-allocation support, we can lose the most recent fence
> in case of allocation failure -- it just means we may emit more awaits
> than strictly necessary but will not break.
> v5: Trim allocation size for leaf nodes, they only need an array of u32
> not pointers.
> v6: Create mock_timeline to tidy selftest writing
> v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko)
> v8: Prune the stale sync points when we idle.
> v9: Include a small benchmark in the kselftests
> v10: Separate the idr implementation into its own compartment.

FYI I am reading v11 and commenting here. Hopefully it works out. :)

>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/Makefile                      |   1 +
>  drivers/gpu/drm/i915/i915_gem.c                    |   1 +
>  drivers/gpu/drm/i915/i915_gem.h                    |   2 +
>  drivers/gpu/drm/i915/i915_gem_request.c            |   9 +
>  drivers/gpu/drm/i915/i915_gem_timeline.c           |  92 +++++-
>  drivers/gpu/drm/i915/i915_gem_timeline.h           |  36 ++
>  drivers/gpu/drm/i915/i915_syncmap.c                | 362 +++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_syncmap.h                |  39 +++
>  drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 257 +++++++++++++++
>  .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
>  drivers/gpu/drm/i915/selftests/mock_timeline.c     |  45 +++
>  drivers/gpu/drm/i915/selftests/mock_timeline.h     |  33 ++
>  12 files changed, 860 insertions(+), 18 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/i915_syncmap.c
>  create mode 100644 drivers/gpu/drm/i915/i915_syncmap.h
>  create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
>  create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.c
>  create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.h
>
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 2cf04504e494..7b05fb802f4c 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -16,6 +16,7 @@ i915-y := i915_drv.o \
>  	  i915_params.o \
>  	  i915_pci.o \
>            i915_suspend.o \
> +	  i915_syncmap.o \
>  	  i915_sw_fence.o \
>  	  i915_sysfs.o \
>  	  intel_csr.o \
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index c1fa3c103f38..f886ef492036 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3214,6 +3214,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
>  		intel_engine_disarm_breadcrumbs(engine);
>  		i915_gem_batch_pool_fini(&engine->batch_pool);
>  	}
> +	i915_gem_timelines_mark_idle(dev_priv);
>
>  	GEM_BUG_ON(!dev_priv->gt.awake);
>  	dev_priv->gt.awake = false;
> diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
> index 5a49487368ca..ee54597465b6 100644
> --- a/drivers/gpu/drm/i915/i915_gem.h
> +++ b/drivers/gpu/drm/i915/i915_gem.h
> @@ -25,6 +25,8 @@
>  #ifndef __I915_GEM_H__
>  #define __I915_GEM_H__
>
> +#include <linux/bug.h>
> +
>  #ifdef CONFIG_DRM_I915_DEBUG_GEM
>  #define GEM_BUG_ON(expr) BUG_ON(expr)
>  #define GEM_WARN_ON(expr) WARN_ON(expr)
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 5fa4e52ded06..807fc1b65dd1 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -772,6 +772,11 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>  		if (fence->context == req->fence.context)
>  			continue;
>
> +		/* Squash repeated waits to the same timelines */
> +		if (fence->context != req->i915->mm.unordered_timeline &&
> +		    intel_timeline_sync_is_later(req->timeline, fence))
> +			continue;
> +
>  		if (dma_fence_is_i915(fence))
>  			ret = i915_gem_request_await_request(req,
>  							     to_request(fence));
> @@ -781,6 +786,10 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>  							    GFP_KERNEL);
>  		if (ret < 0)
>  			return ret;
> +
> +		/* Record the latest fence used against each timeline */
> +		if (fence->context != req->i915->mm.unordered_timeline)
> +			intel_timeline_sync_set(req->timeline, fence);
>  	} while (--nchild);
>
>  	return 0;
> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
> index b596ca7ee058..a28a65db82e9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_timeline.c
> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
> @@ -24,6 +24,31 @@
>
>  #include "i915_drv.h"

#include "i915_syncmap.h"?

I think the over-reliance on i915_drv.h being an include all is hurting 
us in some cases, and even though it is probably very hard to untangle, 
perhaps it is worth making new source files cleaner in that respect.

>
> +static void __intel_timeline_init(struct intel_timeline *tl,
> +				  struct i915_gem_timeline *parent,
> +				  u64 context,
> +				  struct lock_class_key *lockclass,
> +				  const char *lockname)
> +{
> +	tl->fence_context = context;
> +	tl->common = parent;
> +#ifdef CONFIG_DEBUG_SPINLOCK
> +	__raw_spin_lock_init(&tl->lock.rlock, lockname, lockclass);
> +#else
> +	spin_lock_init(&tl->lock);
> +#endif
> +	init_request_active(&tl->last_request, NULL);
> +	INIT_LIST_HEAD(&tl->requests);
> +	i915_syncmap_init(&tl->sync);
> +}
> +
> +static void __intel_timeline_fini(struct intel_timeline *tl)
> +{
> +	GEM_BUG_ON(!list_empty(&tl->requests));
> +
> +	i915_syncmap_free(&tl->sync);
> +}
> +
>  static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>  				    struct i915_gem_timeline *timeline,
>  				    const char *name,
> @@ -35,6 +60,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>
>  	lockdep_assert_held(&i915->drm.struct_mutex);
>
> +	/*
> +	 * Ideally we want a set of engines on a single leaf as we expect
> +	 * to mostly be tracking synchronisation between engines.
> +	 */
> +	BUILD_BUG_ON(KSYNCMAP < I915_NUM_ENGINES);

Maybe also add BUILD_BUG_ON(!is_power_of_2(KSYNCMAP)) just in case.

> +
>  	timeline->i915 = i915;
>  	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
>  	if (!timeline->name)
> @@ -44,19 +75,10 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>
>  	/* Called during early_init before we know how many engines there are */
>  	fences = dma_fence_context_alloc(ARRAY_SIZE(timeline->engine));
> -	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
> -		struct intel_timeline *tl = &timeline->engine[i];
> -
> -		tl->fence_context = fences++;
> -		tl->common = timeline;
> -#ifdef CONFIG_DEBUG_SPINLOCK
> -		__raw_spin_lock_init(&tl->lock.rlock, lockname, lockclass);
> -#else
> -		spin_lock_init(&tl->lock);
> -#endif
> -		init_request_active(&tl->last_request, NULL);
> -		INIT_LIST_HEAD(&tl->requests);
> -	}
> +	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++)
> +		__intel_timeline_init(&timeline->engine[i],
> +				      timeline, fences++,
> +				      lockclass, lockname);
>
>  	return 0;
>  }
> @@ -81,18 +103,52 @@ int i915_gem_timeline_init__global(struct drm_i915_private *i915)
>  					&class, "&global_timeline->lock");
>  }
>
> +/**
> + * i915_gem_timelines_mark_idle -- called when the driver idles
> + * @i915 - the drm_i915_private device
> + *
> + * When the driver is completely idle, we know that all of our sync points
> + * have been signaled and our tracking is then entirely redundant. Any request
> + * to wait upon an older sync point will be completed instantly as we know
> + * the fence is signaled and therefore we will not even look them up in the
> + * sync point map.
> + */
> +void i915_gem_timelines_mark_idle(struct drm_i915_private *i915)
> +{
> +	struct i915_gem_timeline *timeline;
> +	int i;
> +
> +	lockdep_assert_held(&i915->drm.struct_mutex);
> +
> +	list_for_each_entry(timeline, &i915->gt.timelines, link) {
> +		for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
> +			struct intel_timeline *tl = &timeline->engine[i];
> +
> +			/*
> +			 * All known fences are completed so we can scrap
> +			 * the current sync point tracking and start afresh,
> +			 * any attempt to wait upon a previous sync point
> +			 * will be skipped as the fence was signaled.
> +			 */
> +			i915_syncmap_free(&tl->sync);
> +		}
> +	}
> +}
> +
>  void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
>  {
>  	int i;
>
>  	lockdep_assert_held(&timeline->i915->drm.struct_mutex);
>
> -	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
> -		struct intel_timeline *tl = &timeline->engine[i];
> -
> -		GEM_BUG_ON(!list_empty(&tl->requests));
> -	}
> +	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++)
> +		__intel_timeline_fini(&timeline->engine[i]);
>
>  	list_del(&timeline->link);
>  	kfree(timeline->name);
>  }
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> +#include "selftests/mock_timeline.c"
> +#include "selftests/i915_gem_timeline.c"
> +#endif
> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
> index 6c53e14cab2a..86ade2890902 100644
> --- a/drivers/gpu/drm/i915/i915_gem_timeline.h
> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
> @@ -27,7 +27,9 @@
>
>  #include <linux/list.h>
>
> +#include "i915_utils.h"
>  #include "i915_gem_request.h"
> +#include "i915_syncmap.h"
>
>  struct i915_gem_timeline;
>
> @@ -55,6 +57,15 @@ struct intel_timeline {
>  	 * struct_mutex.
>  	 */
>  	struct i915_gem_active last_request;
> +
> +	/**
> +	 * We track the most recent seqno that we wait on in every context so
> +	 * that we only have to emit a new await and dependency on a more
> +	 * recent sync point. As the contexts may executed out-of-order, we
> +	 * have to track each individually and cannot not rely on an absolute
> +	 * global_seqno.
> +	 */
> +	struct i915_syncmap *sync;
>  	u32 sync_seqno[I915_NUM_ENGINES];
>
>  	struct i915_gem_timeline *common;
> @@ -73,6 +84,31 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
>  			   struct i915_gem_timeline *tl,
>  			   const char *name);
>  int i915_gem_timeline_init__global(struct drm_i915_private *i915);
> +void i915_gem_timelines_mark_idle(struct drm_i915_private *i915);
>  void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
>
> +static inline int __intel_timeline_sync_set(struct intel_timeline *tl,
> +					    u64 context, u32 seqno)
> +{
> +	return i915_syncmap_set(&tl->sync, context, seqno);
> +}
> +
> +static inline int intel_timeline_sync_set(struct intel_timeline *tl,
> +					  const struct dma_fence *fence)
> +{
> +	return __intel_timeline_sync_set(tl, fence->context, fence->seqno);
> +}
> +
> +static inline bool __intel_timeline_sync_is_later(struct intel_timeline *tl,
> +						  u64 context, u32 seqno)
> +{
> +	return i915_syncmap_is_later(&tl->sync, context, seqno);
> +}
> +
> +static inline bool intel_timeline_sync_is_later(struct intel_timeline *tl,
> +						const struct dma_fence *fence)
> +{
> +	return __intel_timeline_sync_is_later(tl, fence->context, fence->seqno);
> +}
> +
>  #endif
> diff --git a/drivers/gpu/drm/i915/i915_syncmap.c b/drivers/gpu/drm/i915/i915_syncmap.c
> new file mode 100644
> index 000000000000..70762c3772a0
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_syncmap.c
> @@ -0,0 +1,362 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include <linux/slab.h>
> +
> +#include "i915_syncmap.h"
> +
> +#include "i915_gem.h" /* GEM_BUG_ON() */
> +
> +#define SHIFT ilog2(KSYNCMAP)
> +#define MASK (KSYNCMAP - 1)
> +
> +/*
> + * struct i915_syncmap is a layer of a radixtree that maps a u64 fence
> + * context id to the last u32 fence seqno waited upon from that context.
> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
> + * the root. This allows us to access the whole tree via a single pointer
> + * to the most recently used layer. We expect fence contexts to be dense
> + * and most reuse to be on the same i915_gem_context but on neighbouring
> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
> + * effective lookup cache. If the new lookup is not on the same leaf, we
> + * expect it to be on the neighbouring branch.
> + *
> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
> + * allows us to store whether a particular seqno is valid (i.e. allows us
> + * to distinguish unset from 0).
> + *
> + * A branch holds an array of layer pointers, and has height > 0, and always
> + * has at least 2 layers (either branches or leaves) below it.
> + */
> +
> +struct i915_syncmap {
> +	u64 prefix;
> +	unsigned int height;
> +	unsigned int bitmap;
> +	struct i915_syncmap *parent;
> +	/*
> +	 * Following this header is an array of either seqno or child pointers:
> +	 * union {
> +	 *	u32 seqno[KSYNCMAP];
> +	 *	struct i915_syncmap *child[KSYNMAP];
> +	 * };
> +	 */
> +};
> +
> +/**
> + * i915_syncmap_init -- initialise the #i915_syncmap
> + * @root - pointer to the #i915_syncmap
> + */
> +void i915_syncmap_init(struct i915_syncmap **root)
> +{
> +	BUILD_BUG_ON(KSYNCMAP > BITS_PER_BYTE * sizeof((*root)->bitmap));
> +	*root = NULL;
> +}
> +
> +static inline u32 *__sync_seqno(struct i915_syncmap *p)
> +{
> +	GEM_BUG_ON(p->height);
> +	return (u32 *)(p + 1);
> +}
> +
> +static inline struct i915_syncmap **__sync_child(struct i915_syncmap *p)
> +{
> +	GEM_BUG_ON(!p->height);
> +	return (struct i915_syncmap **)(p + 1);
> +}
> +
> +static inline unsigned int __sync_idx(const struct i915_syncmap *p, u64 id)
> +{
> +	return (id >> p->height) & MASK;
> +}
> +
> +static inline u64 __sync_prefix(const struct i915_syncmap *p, u64 id)
> +{
> +	return id >> p->height >> SHIFT;
> +}
> +
> +static inline u64 __sync_leaf(const struct i915_syncmap *p, u64 id)
> +{
> +	GEM_BUG_ON(p->height);
> +	return id >> SHIFT;
> +}
> +
> +static inline bool seqno_later(u32 a, u32 b)
> +{
> +	return (s32)(a - b) >= 0;
> +}
> +
> +/**
> + * i915_syncmap_is_later -- compare against the last know sync point
> + * @root - pointer to the #i915_syncmap
> + * @id - the context id (other timeline) we are synchronising to
> + * @seqno - the sequence number along the other timeline
> + *
> + * If we have already synchronised this @root with another (@id) then we can
> + * omit any repeated or earlier synchronisation requests. If the two timelines
> + * are already coupled, we can also omit the dependency between the two as that
> + * is already known via the timeline.
> + *
> + * Returns true if the two timelines are already synchronised wrt to @seqno,
> + * false if not and the synchronisation must be emitted.
> + */
> +bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno)
> +{
> +	struct i915_syncmap *p;
> +	unsigned int idx;
> +
> +	p = *root;
> +	if (!p)
> +		return false;
> +
> +	if (likely(__sync_leaf(p, id) == p->prefix))
> +		goto found;

Are you sure likely is appropriate here?

> +
> +	/* First climb the tree back to a parent branch */
> +	do {
> +		p = p->parent;
> +		if (!p)
> +			return false;
> +
> +		if (__sync_prefix(p, id) == p->prefix)
> +			break;
> +	} while (1);
> +
> +	/* And then descend again until we find our leaf */
> +	do {
> +		if (!p->height)
> +			break;
> +
> +		p = __sync_child(p)[__sync_idx(p, id)];
> +		if (!p)
> +			return false;
> +
> +		if (__sync_prefix(p, id) != p->prefix)
> +			return false;
> +	} while (1);
> +
> +	*root = p;
> +found:
> +	idx = id & MASK;

Would:

	GEM_BUG_ON(p->height);
	idx = __sync_idx(p, id);

be correct (even if more verbose) here instead of idx = id & MASK?

> +	if (!(p->bitmap & BIT(idx)))
> +		return false;

I was thinking briefly whether seqno+bitmap get/set helpers would be 
helpful but I think there is no need. With the __sync_*_prefix algorithm 
is much more readable already.

> +
> +	return seqno_later(__sync_seqno(p)[idx], seqno);
> +}
> +
> +static struct i915_syncmap *
> +__sync_alloc_leaf(struct i915_syncmap *parent, u64 id)
> +{
> +	struct i915_syncmap *p;
> +
> +	p = kmalloc(sizeof(*p) + KSYNCMAP * sizeof(u32), GFP_KERNEL);
> +	if (unlikely(!p))
> +		return NULL;
> +
> +	p->parent = parent;
> +	p->height = 0;
> +	p->bitmap = 0;
> +	p->prefix = __sync_leaf(p, id);
> +	return p;
> +}
> +
> +static noinline int
> +__i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno)
> +{
> +	struct i915_syncmap *p = *root;
> +	unsigned int idx;
> +
> +	if (!p) {
> +		p = __sync_alloc_leaf(NULL, id);
> +		if (unlikely(!p))
> +			return -ENOMEM;
> +
> +		goto found;
> +	}
> +

GEM_BUG_ON(p->prefix == __sync_leaf_prefix(p, id)) ? Or maybe better try 
to handle it rather than expect caller will never do that? By handling 
it I mean not immediately strt climbing the tree just below but check 
for the condition first and goto found.

> +	/* Climb back up the tree until we find a common prefix */
> +	do {
> +		if (!p->parent)
> +			break;
> +
> +		p = p->parent;
> +
> +		if (__sync_prefix(p, id) == p->prefix)
> +			break;
> +	} while (1);
> +
> +	/*
> +	 * No shortcut, we have to descend the tree to find the right layer
> +	 * containing this fence.
> +	 *
> +	 * Each layer in the tree holds 16 (KSYNCMAP) pointers, either fences
> +	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
> +	 * other nodes (height > 0) are internal layers that point to a lower
> +	 * node. Each internal layer has at least 2 descendents.
> +	 *
> +	 * Starting at the top, we check whether the current prefix matches. If
> +	 * it doesn't, we have gone passed our layer and need to insert a join
> +	 * into the tree, and a new leaf node as a descendent as well as the
> +	 * original layer.
> +	 *
> +	 * The matching prefix means we are still following the right branch
> +	 * of the tree. If it has height 0, we have found our leaf and just
> +	 * need to replace the fence slot with ourselves. If the height is
> +	 * not zero, our slot contains the next layer in the tree (unless
> +	 * it is empty, in which case we can add ourselves as a new leaf).
> +	 * As descend the tree the prefix grows (and height decreases).
> +	 */
> +	do {
> +		struct i915_syncmap *next;
> +
> +		if (__sync_prefix(p, id) != p->prefix) {
> +			unsigned int above;
> +
> +			/* insert a join above the current layer */
> +			next = kzalloc(sizeof(*next) + KSYNCMAP * sizeof(next),
> +				       GFP_KERNEL);

Next is above, right? Would common_parent be correct? Or 
lowest_common_parent? Possibly not the name because it is too long, but 
I'm trying to figure out if I got the fls and xor business right.

> +			if (unlikely(!next))
> +				return -ENOMEM;
> +
> +			above = fls64(__sync_prefix(p, id) ^ p->prefix);
> +			above = round_up(above, SHIFT);
> +			next->height = above + p->height;
> +			next->prefix = __sync_prefix(next, id);
> +
> +			if (p->parent)
> +				__sync_child(p->parent)[__sync_idx(p->parent, id)] = next;
> +			next->parent = p->parent;
> +
> +			idx = p->prefix >> (above - SHIFT) & MASK;
> +			__sync_child(next)[idx] = p;
> +			next->bitmap |= BIT(idx);
> +			p->parent = next;
> +
> +			/* ascend to the join */
> +			p = next;
> +		} else {
> +			if (!p->height)
> +				break;
> +		}
> +
> +		/* descend into the next layer */
> +		GEM_BUG_ON(!p->height);
> +		idx = __sync_idx(p, id);
> +		next = __sync_child(p)[idx];
> +		if (unlikely(!next)) {

Why is this one unlikely?

> +			next = __sync_alloc_leaf(p, id);
> +			if (unlikely(!next))
> +				return -ENOMEM;
> +
> +			__sync_child(p)[idx] = next;
> +			p->bitmap |= BIT(idx);
> +
> +			p = next;
> +			break;
> +		}
> +
> +		p = next;
> +	} while (1);
> +
> +found:
> +	GEM_BUG_ON(p->prefix != __sync_leaf(p, id));
> +	idx = id & MASK;
> +	__sync_seqno(p)[idx] = seqno;
> +	p->bitmap |= BIT(idx);

Actually __sync_set_seqno(p, id, seqno) might be useful here and in 
i915_syncmap_set below.

The below looks OK for the moment. Still getting to terms with the above 
loop. Postponing drawing the diagram.. :)
Regards,

Tvrtko

> +	*root = p;
> +	return 0;
> +}
> +
> +/**
> + * i915_syncmap_set -- mark the most recent syncpoint between contexts
> + * @root - pointer to the #i915_syncmap
> + * @id - the context id (other timeline) we have synchronised to
> + * @seqno - the sequence number along the other timeline
> + *
> + * When we synchronise this @root with another (@id), we also know that we have
> + * synchronized with all previous seqno along that timeline. If we then have
> + * a request to synchronise with the same seqno or older, we can omit it,
> + * see i915_syncmap_is_later()
> + *
> + * Returns 0 on success, or a negative error code.
> + */
> +int i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno)
> +{
> +	struct i915_syncmap *p = *root;
> +
> +	/*
> +	 * We expect to be called in sequence following a is_later(id), which
> +	 * should have preloaded the root for us.
> +	 */
> +	if (likely(p && __sync_leaf(p, id) == p->prefix)) {
> +		unsigned int idx = id & MASK;
> +
> +		__sync_seqno(p)[idx] = seqno;
> +		p->bitmap |= BIT(idx);
> +		return 0;
> +	}
> +
> +	return __i915_syncmap_set(root, id, seqno);
> +}
> +
> +static void __sync_free(struct i915_syncmap *p)
> +{
> +	if (p->height) {
> +		unsigned int i;
> +
> +		while ((i = ffs(p->bitmap))) {
> +			p->bitmap &= ~0u << i;
> +			__sync_free(__sync_child(p)[i - 1]);
> +		}
> +	}
> +
> +	kfree(p);
> +}
> +
> +/**
> + * i915_syncmap_free -- free all memory associated with the syncmap
> + * @root - pointer to the #i915_syncmap
> + *
> + * Either when the timeline is to be freed and we no longer need the sync
> + * point tracking, or when the fences are all known to be signaled and the
> + * sync point tracking is redundant, we can free the #i915_syncmap to recover
> + * its allocations.
> + *
> + * Will reinitialise the @root pointer so that the #i915_syncmap is ready for
> + * reuse.
> + */
> +void i915_syncmap_free(struct i915_syncmap **root)
> +{
> +	struct i915_syncmap *p;
> +
> +	p = *root;
> +	if (!p)
> +		return;
> +
> +	while (p->parent)
> +		p = p->parent;
> +
> +	__sync_free(p);
> +	*root = NULL;
> +}
> diff --git a/drivers/gpu/drm/i915/i915_syncmap.h b/drivers/gpu/drm/i915/i915_syncmap.h
> new file mode 100644
> index 000000000000..7ca827d812ae
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_syncmap.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#ifndef __I915_SYNCMAP_H__
> +#define __I915_SYNCMAP_H__
> +
> +#include <linux/types.h>
> +
> +struct i915_syncmap;
> +
> +void i915_syncmap_init(struct i915_syncmap **root);
> +bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno);
> +int i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno);
> +void i915_syncmap_free(struct i915_syncmap **root);
> +
> +#define KSYNCMAP 16
> +
> +#endif /* __I915_SYNCMAP_H__ */
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> new file mode 100644
> index 000000000000..2058e754c86d
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> @@ -0,0 +1,257 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include <linux/random.h>
> +
> +#include "../i915_selftest.h"
> +#include "mock_gem_device.h"
> +#include "mock_timeline.h"
> +
> +static int igt_sync(void *arg)
> +{
> +	const struct {
> +		const char *name;
> +		u32 seqno;
> +		bool expected;
> +		bool set;
> +	} pass[] = {
> +		{ "unset", 0, false, false },
> +		{ "new", 0, false, true },
> +		{ "0a", 0, true, true },
> +		{ "1a", 1, false, true },
> +		{ "1b", 1, true, true },
> +		{ "0b", 0, true, false },
> +		{ "2a", 2, false, true },
> +		{ "4", 4, false, true },
> +		{ "INT_MAX", INT_MAX, false, true },
> +		{ "INT_MAX-1", INT_MAX-1, true, false },
> +		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
> +		{ "INT_MAX", INT_MAX, true, false },
> +		{ "UINT_MAX", UINT_MAX, false, true },
> +		{ "wrap", 0, false, true },
> +		{ "unwrap", UINT_MAX, true, false },
> +		{},
> +	}, *p;
> +	struct intel_timeline *tl;
> +	int order, offset;
> +	int ret;
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	for (p = pass; p->name; p++) {
> +		for (order = 1; order < 64; order++) {
> +			for (offset = -1; offset <= (order > 1); offset++) {
> +				u64 ctx = BIT_ULL(order) + offset;
> +
> +				if (__intel_timeline_sync_is_later
> +				    (tl, ctx, p->seqno) != p->expected) {
> +					pr_err("1: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> +					       p->name, ctx, p->seqno, yesno(p->expected));
> +					ret = -EINVAL;
> +					goto out;
> +				}
> +
> +				if (p->set) {
> +					ret = __intel_timeline_sync_set(tl, ctx, p->seqno);
> +					if (ret)
> +						goto out;
> +				}
> +			}
> +		}
> +	}
> +	mock_timeline_destroy(tl);
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	for (order = 1; order < 64; order++) {
> +		for (offset = -1; offset <= (order > 1); offset++) {
> +			u64 ctx = BIT_ULL(order) + offset;
> +
> +			for (p = pass; p->name; p++) {
> +				if (__intel_timeline_sync_is_later
> +				    (tl, ctx, p->seqno) != p->expected) {
> +					pr_err("2: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> +					       p->name, ctx, p->seqno, yesno(p->expected));
> +					ret = -EINVAL;
> +					goto out;
> +				}
> +
> +				if (p->set) {
> +					ret = __intel_timeline_sync_set(tl, ctx, p->seqno);
> +					if (ret)
> +						goto out;
> +				}
> +			}
> +		}
> +	}
> +
> +out:
> +	mock_timeline_destroy(tl);
> +	return ret;
> +}
> +
> +static u64 prandom_u64_state(struct rnd_state *rnd)
> +{
> +	u64 x;
> +
> +	x = prandom_u32_state(rnd);
> +	x <<= 32;
> +	x |= prandom_u32_state(rnd);
> +
> +	return x;
> +}
> +
> +static unsigned int random_engine(struct rnd_state *rnd)
> +{
> +	return ((u64)prandom_u32_state(rnd) * I915_NUM_ENGINES) >> 32;
> +}
> +
> +static int bench_sync(void *arg)
> +{
> +	struct rnd_state prng;
> +	struct intel_timeline *tl;
> +	unsigned long end_time, count;
> +	ktime_t kt;
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	prandom_seed_state(&prng, i915_selftest.random_seed);
> +	count = 0;
> +	kt = ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		u64 id = prandom_u64_state(&prng);
> +
> +		__intel_timeline_sync_set(tl, id, 0);
> +		count++;
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_sub(ktime_get(), kt);
> +	pr_info("%s: %lu random insertions, %lluns/insert\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	prandom_seed_state(&prng, i915_selftest.random_seed);
> +	end_time = count;
> +	kt = ktime_get();
> +	while (end_time--) {
> +		u64 id = prandom_u64_state(&prng);
> +
> +		if (!__intel_timeline_sync_is_later(tl, id, 0)) {
> +			mock_timeline_destroy(tl);
> +			pr_err("Lookup of %llu failed\n", id);
> +			return -EINVAL;
> +		}
> +	}
> +	kt = ktime_sub(ktime_get(), kt);
> +	pr_info("%s: %lu random lookups, %lluns/lookup\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	mock_timeline_destroy(tl);
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	count = 0;
> +	kt = ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		__intel_timeline_sync_set(tl, count++, 0);
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_sub(ktime_get(), kt);
> +	pr_info("%s: %lu in-order insertions, %lluns/insert\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	end_time = count;
> +	kt = ktime_get();
> +	while (end_time--) {
> +		if (!__intel_timeline_sync_is_later(tl, end_time, 0)) {
> +			pr_err("Lookup of %lu failed\n", end_time);
> +			mock_timeline_destroy(tl);
> +			return -EINVAL;
> +		}
> +	}
> +	kt = ktime_sub(ktime_get(), kt);
> +	pr_info("%s: %lu in-order lookups, %lluns/lookup\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	mock_timeline_destroy(tl);
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	prandom_seed_state(&prng, i915_selftest.random_seed);
> +	count = 0;
> +	kt = ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		u32 id = random_engine(&prng);
> +		u32 seqno = prandom_u32_state(&prng);
> +
> +		if (!__intel_timeline_sync_is_later(tl, id, seqno))
> +			__intel_timeline_sync_set(tl, id, seqno);
> +
> +		count++;
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_sub(ktime_get(), kt);
> +	pr_info("%s: %lu repeated insert/lookups, %lluns/op\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +	mock_timeline_destroy(tl);
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	count = 0;
> +	kt = ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		if (!__intel_timeline_sync_is_later(tl, count & 7, count >> 4))
> +			__intel_timeline_sync_set(tl, count & 7, count >> 4);
> +
> +		count++;
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_sub(ktime_get(), kt);
> +	pr_info("%s: %lu cyclic insert/lookups, %lluns/op\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +	mock_timeline_destroy(tl);
> +
> +	return 0;
> +}
> +
> +int i915_gem_timeline_mock_selftests(void)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(igt_sync),
> +		SUBTEST(bench_sync),
> +	};
> +
> +	return i915_subtests(tests, NULL);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> index be9a9ebf5692..8d0f50c25df8 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> @@ -12,6 +12,7 @@ selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
>  selftest(scatterlist, scatterlist_mock_selftests)
>  selftest(uncore, intel_uncore_mock_selftests)
>  selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
> +selftest(timelines, i915_gem_timeline_mock_selftests)
>  selftest(requests, i915_gem_request_mock_selftests)
>  selftest(objects, i915_gem_object_mock_selftests)
>  selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
> diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.c b/drivers/gpu/drm/i915/selftests/mock_timeline.c
> new file mode 100644
> index 000000000000..47b1f47c5812
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/mock_timeline.c
> @@ -0,0 +1,45 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include "mock_timeline.h"
> +
> +struct intel_timeline *mock_timeline(u64 context)
> +{
> +	static struct lock_class_key class;
> +	struct intel_timeline *tl;
> +
> +	tl = kzalloc(sizeof(*tl), GFP_KERNEL);
> +	if (!tl)
> +		return NULL;
> +
> +	__intel_timeline_init(tl, NULL, context, &class, "mock");
> +
> +	return tl;
> +}
> +
> +void mock_timeline_destroy(struct intel_timeline *tl)
> +{
> +	__intel_timeline_fini(tl);
> +	kfree(tl);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.h b/drivers/gpu/drm/i915/selftests/mock_timeline.h
> new file mode 100644
> index 000000000000..c27ff4639b8b
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/mock_timeline.h
> @@ -0,0 +1,33 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#ifndef __MOCK_TIMELINE__
> +#define __MOCK_TIMELINE__
> +
> +#include "../i915_gem_timeline.h"
> +
> +struct intel_timeline *mock_timeline(u64 context);
> +void mock_timeline_destroy(struct intel_timeline *tl);
> +
> +#endif /* !__MOCK_TIMELINE__ */
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v10] drm/i915: Squash repeated awaits on the same fence
  2017-04-28  9:32         ` Tvrtko Ursulin
@ 2017-04-28  9:54           ` Chris Wilson
  0 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-28  9:54 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Fri, Apr 28, 2017 at 10:32:58AM +0100, Tvrtko Ursulin wrote:
> 
> On 28/04/2017 08:41, Chris Wilson wrote:
> >Track the latest fence waited upon on each context, and only add a new
> >asynchronous wait if the new fence is more recent than the recorded
> >fence for that context. This requires us to filter out unordered
> >timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> >absence of a universal identifier, we have to use our own
> >i915->mm.unordered_timeline token.
> >
> >v2: Throw around the debug crutches
> >v3: Inline the likely case of the pre-allocation cache being full.
> >v4: Drop the pre-allocation support, we can lose the most recent fence
> >in case of allocation failure -- it just means we may emit more awaits
> >than strictly necessary but will not break.
> >v5: Trim allocation size for leaf nodes, they only need an array of u32
> >not pointers.
> >v6: Create mock_timeline to tidy selftest writing
> >v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko)
> >v8: Prune the stale sync points when we idle.
> >v9: Include a small benchmark in the kselftests
> >v10: Separate the idr implementation into its own compartment.
> 
> FYI I am reading v11 and commenting here. Hopefully it works out. :)
> 
> >
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >---
> > drivers/gpu/drm/i915/Makefile                      |   1 +
> > drivers/gpu/drm/i915/i915_gem.c                    |   1 +
> > drivers/gpu/drm/i915/i915_gem.h                    |   2 +
> > drivers/gpu/drm/i915/i915_gem_request.c            |   9 +
> > drivers/gpu/drm/i915/i915_gem_timeline.c           |  92 +++++-
> > drivers/gpu/drm/i915/i915_gem_timeline.h           |  36 ++
> > drivers/gpu/drm/i915/i915_syncmap.c                | 362 +++++++++++++++++++++
> > drivers/gpu/drm/i915/i915_syncmap.h                |  39 +++
> > drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 257 +++++++++++++++
> > .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   1 +
> > drivers/gpu/drm/i915/selftests/mock_timeline.c     |  45 +++
> > drivers/gpu/drm/i915/selftests/mock_timeline.h     |  33 ++
> > 12 files changed, 860 insertions(+), 18 deletions(-)
> > create mode 100644 drivers/gpu/drm/i915/i915_syncmap.c
> > create mode 100644 drivers/gpu/drm/i915/i915_syncmap.h
> > create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> > create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.c
> > create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.h
> >
> >diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> >index 2cf04504e494..7b05fb802f4c 100644
> >--- a/drivers/gpu/drm/i915/Makefile
> >+++ b/drivers/gpu/drm/i915/Makefile
> >@@ -16,6 +16,7 @@ i915-y := i915_drv.o \
> > 	  i915_params.o \
> > 	  i915_pci.o \
> >           i915_suspend.o \
> >+	  i915_syncmap.o \
> > 	  i915_sw_fence.o \
> > 	  i915_sysfs.o \
> > 	  intel_csr.o \
> >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >index c1fa3c103f38..f886ef492036 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -3214,6 +3214,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
> > 		intel_engine_disarm_breadcrumbs(engine);
> > 		i915_gem_batch_pool_fini(&engine->batch_pool);
> > 	}
> >+	i915_gem_timelines_mark_idle(dev_priv);
> >
> > 	GEM_BUG_ON(!dev_priv->gt.awake);
> > 	dev_priv->gt.awake = false;
> >diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
> >index 5a49487368ca..ee54597465b6 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.h
> >+++ b/drivers/gpu/drm/i915/i915_gem.h
> >@@ -25,6 +25,8 @@
> > #ifndef __I915_GEM_H__
> > #define __I915_GEM_H__
> >
> >+#include <linux/bug.h>
> >+
> > #ifdef CONFIG_DRM_I915_DEBUG_GEM
> > #define GEM_BUG_ON(expr) BUG_ON(expr)
> > #define GEM_WARN_ON(expr) WARN_ON(expr)
> >diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> >index 5fa4e52ded06..807fc1b65dd1 100644
> >--- a/drivers/gpu/drm/i915/i915_gem_request.c
> >+++ b/drivers/gpu/drm/i915/i915_gem_request.c
> >@@ -772,6 +772,11 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
> > 		if (fence->context == req->fence.context)
> > 			continue;
> >
> >+		/* Squash repeated waits to the same timelines */
> >+		if (fence->context != req->i915->mm.unordered_timeline &&
> >+		    intel_timeline_sync_is_later(req->timeline, fence))
> >+			continue;
> >+
> > 		if (dma_fence_is_i915(fence))
> > 			ret = i915_gem_request_await_request(req,
> > 							     to_request(fence));
> >@@ -781,6 +786,10 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
> > 							    GFP_KERNEL);
> > 		if (ret < 0)
> > 			return ret;
> >+
> >+		/* Record the latest fence used against each timeline */
> >+		if (fence->context != req->i915->mm.unordered_timeline)
> >+			intel_timeline_sync_set(req->timeline, fence);
> > 	} while (--nchild);
> >
> > 	return 0;
> >diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
> >index b596ca7ee058..a28a65db82e9 100644
> >--- a/drivers/gpu/drm/i915/i915_gem_timeline.c
> >+++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
> >@@ -24,6 +24,31 @@
> >
> > #include "i915_drv.h"
> 
> #include "i915_syncmap.h"?
> 
> I think the over-reliance on i915_drv.h being an include all is
> hurting us in some cases, and even though it is probably very hard
> to untangle, perhaps it is worth making new source files cleaner in
> that respect.

It's pulled in via i915_gem_timeline.h currently, which appears itself
pulled in via i915_drv.h. Sigh, yes this is a case where i915_drv.h hid
an error.

> > static int __i915_gem_timeline_init(struct drm_i915_private *i915,
> > 				    struct i915_gem_timeline *timeline,
> > 				    const char *name,
> >@@ -35,6 +60,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
> >
> > 	lockdep_assert_held(&i915->drm.struct_mutex);
> >
> >+	/*
> >+	 * Ideally we want a set of engines on a single leaf as we expect
> >+	 * to mostly be tracking synchronisation between engines.
> >+	 */
> >+	BUILD_BUG_ON(KSYNCMAP < I915_NUM_ENGINES);
> 
> Maybe also add BUILD_BUG_ON(!is_power_of_2(KSYNCMAP)) just in case.

Can do over in syncmap_init. I figured we would hit a few
BUILD_BUG_ON_NO_POWER_OF_TWO() anyway, but being explicit is sensible.

> >+/**
> >+ * i915_syncmap_is_later -- compare against the last know sync point
> >+ * @root - pointer to the #i915_syncmap
> >+ * @id - the context id (other timeline) we are synchronising to
> >+ * @seqno - the sequence number along the other timeline
> >+ *
> >+ * If we have already synchronised this @root with another (@id) then we can
> >+ * omit any repeated or earlier synchronisation requests. If the two timelines
> >+ * are already coupled, we can also omit the dependency between the two as that
> >+ * is already known via the timeline.
> >+ *
> >+ * Returns true if the two timelines are already synchronised wrt to @seqno,
> >+ * false if not and the synchronisation must be emitted.
> >+ */
> >+bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno)
> >+{
> >+	struct i915_syncmap *p;
> >+	unsigned int idx;
> >+
> >+	p = *root;
> >+	if (!p)
> >+		return false;
> >+
> >+	if (likely(__sync_leaf(p, id) == p->prefix))
> >+		goto found;
> 
> Are you sure likely is appropriate here?

Yes, it is primarily documenting the intent. If it fails the likely()
prediction, the idea of caching is mostly moot.
> 
> >+
> >+	/* First climb the tree back to a parent branch */
> >+	do {
> >+		p = p->parent;
> >+		if (!p)
> >+			return false;
> >+
> >+		if (__sync_prefix(p, id) == p->prefix)
> >+			break;
> >+	} while (1);
> >+
> >+	/* And then descend again until we find our leaf */
> >+	do {
> >+		if (!p->height)
> >+			break;
> >+
> >+		p = __sync_child(p)[__sync_idx(p, id)];
> >+		if (!p)
> >+			return false;
> >+
> >+		if (__sync_prefix(p, id) != p->prefix)
> >+			return false;
> >+	} while (1);
> >+
> >+	*root = p;
> >+found:
> >+	idx = id & MASK;
> 
> Would:
> 
> 	GEM_BUG_ON(p->height);
> 	idx = __sync_idx(p, id);
> 
> be correct (even if more verbose) here instead of idx = id & MASK?

Yes. But it will incur a shift (gcc will not know p->height is 0 in
!debug builds). So __sync_leaf_idx().

> 
> >+	if (!(p->bitmap & BIT(idx)))
> >+		return false;
> 
> I was thinking briefly whether seqno+bitmap get/set helpers would be
> helpful but I think there is no need. With the __sync_*_prefix
> algorithm is much more readable already.
> 
> >+
> >+	return seqno_later(__sync_seqno(p)[idx], seqno);
> >+}
> >+
> >+static struct i915_syncmap *
> >+__sync_alloc_leaf(struct i915_syncmap *parent, u64 id)
> >+{
> >+	struct i915_syncmap *p;
> >+
> >+	p = kmalloc(sizeof(*p) + KSYNCMAP * sizeof(u32), GFP_KERNEL);
> >+	if (unlikely(!p))
> >+		return NULL;
> >+
> >+	p->parent = parent;
> >+	p->height = 0;
> >+	p->bitmap = 0;
> >+	p->prefix = __sync_leaf(p, id);
> >+	return p;
> >+}
> >+
> >+static noinline int
> >+__i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno)
> >+{
> >+	struct i915_syncmap *p = *root;
> >+	unsigned int idx;
> >+
> >+	if (!p) {
> >+		p = __sync_alloc_leaf(NULL, id);
> >+		if (unlikely(!p))
> >+			return -ENOMEM;
> >+
> >+		goto found;
> >+	}
> >+
> 
> GEM_BUG_ON(p->prefix == __sync_leaf_prefix(p, id)) ? Or maybe better
> try to handle it rather than expect caller will never do that? By
> handling it I mean not immediately strt climbing the tree just below
> but check for the condition first and goto found.

Hmm, no. s/__i915_syncmap_set/__sync_set/, we shouldn't treat this as
being in isolation, it is really a branch of i915_syncmap_set split out
to micro-optimise the likely path of i915_syncmap_set.

So GEM_BUG_ON() preconditions is fine.
 
> >+	/* Climb back up the tree until we find a common prefix */
> >+	do {
> >+		if (!p->parent)
> >+			break;
> >+
> >+		p = p->parent;
> >+
> >+		if (__sync_prefix(p, id) == p->prefix)
> >+			break;
> >+	} while (1);
> >+
> >+	/*
> >+	 * No shortcut, we have to descend the tree to find the right layer
> >+	 * containing this fence.
> >+	 *
> >+	 * Each layer in the tree holds 16 (KSYNCMAP) pointers, either fences
> >+	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
> >+	 * other nodes (height > 0) are internal layers that point to a lower
> >+	 * node. Each internal layer has at least 2 descendents.
> >+	 *
> >+	 * Starting at the top, we check whether the current prefix matches. If
> >+	 * it doesn't, we have gone passed our layer and need to insert a join
> >+	 * into the tree, and a new leaf node as a descendent as well as the
> >+	 * original layer.
> >+	 *
> >+	 * The matching prefix means we are still following the right branch
> >+	 * of the tree. If it has height 0, we have found our leaf and just
> >+	 * need to replace the fence slot with ourselves. If the height is
> >+	 * not zero, our slot contains the next layer in the tree (unless
> >+	 * it is empty, in which case we can add ourselves as a new leaf).
> >+	 * As descend the tree the prefix grows (and height decreases).
> >+	 */
> >+	do {
> >+		struct i915_syncmap *next;
> >+
> >+		if (__sync_prefix(p, id) != p->prefix) {
> >+			unsigned int above;
> >+
> >+			/* insert a join above the current layer */
> >+			next = kzalloc(sizeof(*next) + KSYNCMAP * sizeof(next),
> >+				       GFP_KERNEL);
> 
> Next is above, right? Would common_parent be correct? Or
> lowest_common_parent? Possibly not the name because it is too long,
> but I'm trying to figure out if I got the fls and xor business
> right.

Problem is next here is parent, next later on is child. So next for lack
of a better name.

> >+			if (unlikely(!next))
> >+				return -ENOMEM;
> >+
> >+			above = fls64(__sync_prefix(p, id) ^ p->prefix);
> >+			above = round_up(above, SHIFT);
> >+			next->height = above + p->height;
> >+			next->prefix = __sync_prefix(next, id);
> >+
> >+			if (p->parent)
> >+				__sync_child(p->parent)[__sync_idx(p->parent, id)] = next;
> >+			next->parent = p->parent;
> >+
> >+			idx = p->prefix >> (above - SHIFT) & MASK;
> >+			__sync_child(next)[idx] = p;
> >+			next->bitmap |= BIT(idx);
> >+			p->parent = next;
> >+
> >+			/* ascend to the join */
> >+			p = next;
> >+		} else {
> >+			if (!p->height)
> >+				break;
> >+		}
> >+
> >+		/* descend into the next layer */
> >+		GEM_BUG_ON(!p->height);
> >+		idx = __sync_idx(p, id);
> >+		next = __sync_child(p)[idx];
> >+		if (unlikely(!next)) {
> 
> Why is this one unlikely?

Iirc, I was painting the malloc failure paths at the time. So this is a
silly one.

> >+			next = __sync_alloc_leaf(p, id);
> >+			if (unlikely(!next))
> >+				return -ENOMEM;
> >+
> >+			__sync_child(p)[idx] = next;
> >+			p->bitmap |= BIT(idx);
> >+
> >+			p = next;
> >+			break;
> >+		}
> >+
> >+		p = next;
> >+	} while (1);
> >+
> >+found:
> >+	GEM_BUG_ON(p->prefix != __sync_leaf(p, id));
> >+	idx = id & MASK;
> >+	__sync_seqno(p)[idx] = seqno;
> >+	p->bitmap |= BIT(idx);
> 
> Actually __sync_set_seqno(p, id, seqno) might be useful here and in
> i915_syncmap_set below.
> 
> The below looks OK for the moment. Still getting to terms with the
> above loop. Postponing drawing the diagram.. :)

If you have a pretty ascii diagram, I'll paste it in!
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v10] drm/i915: Squash repeated awaits on the same fence
  2017-04-28  7:41       ` [PATCH v10] " Chris Wilson
  2017-04-28  7:59         ` Chris Wilson
  2017-04-28  9:32         ` Tvrtko Ursulin
@ 2017-04-28  9:55         ` Tvrtko Ursulin
  2017-04-28 10:11           ` Chris Wilson
  2017-04-28 14:12         ` [PATCH v13] " Chris Wilson
  3 siblings, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-28  9:55 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 28/04/2017 08:41, Chris Wilson wrote:

[snip]

> +static int igt_sync(void *arg)
> +{
> +	const struct {
> +		const char *name;
> +		u32 seqno;
> +		bool expected;
> +		bool set;
> +	} pass[] = {
> +		{ "unset", 0, false, false },
> +		{ "new", 0, false, true },
> +		{ "0a", 0, true, true },
> +		{ "1a", 1, false, true },
> +		{ "1b", 1, true, true },
> +		{ "0b", 0, true, false },
> +		{ "2a", 2, false, true },
> +		{ "4", 4, false, true },
> +		{ "INT_MAX", INT_MAX, false, true },
> +		{ "INT_MAX-1", INT_MAX-1, true, false },
> +		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
> +		{ "INT_MAX", INT_MAX, true, false },
> +		{ "UINT_MAX", UINT_MAX, false, true },
> +		{ "wrap", 0, false, true },
> +		{ "unwrap", UINT_MAX, true, false },
> +		{},
> +	}, *p;
> +	struct intel_timeline *tl;
> +	int order, offset;
> +	int ret;
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	for (p = pass; p->name; p++) {
> +		for (order = 1; order < 64; order++) {
> +			for (offset = -1; offset <= (order > 1); offset++) {
> +				u64 ctx = BIT_ULL(order) + offset;
> +
> +				if (__intel_timeline_sync_is_later
> +				    (tl, ctx, p->seqno) != p->expected) {
> +					pr_err("1: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> +					       p->name, ctx, p->seqno, yesno(p->expected));
> +					ret = -EINVAL;
> +					goto out;
> +				}
> +
> +				if (p->set) {
> +					ret = __intel_timeline_sync_set(tl, ctx, p->seqno);
> +					if (ret)
> +						goto out;
> +				}
> +			}
> +		}
> +	}

I think verification that the tree height matches the expectation, and 
also total number of nodes, is required.

If too complicated in the current structure you could add a simpler 
iteration over another table of steps, which would include the ctx id 
and seqno in it, and expected height/total nodes after each step.

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v10] drm/i915: Squash repeated awaits on the same fence
  2017-04-28  9:55         ` Tvrtko Ursulin
@ 2017-04-28 10:11           ` Chris Wilson
  0 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-28 10:11 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Fri, Apr 28, 2017 at 10:55:09AM +0100, Tvrtko Ursulin wrote:
> 
> On 28/04/2017 08:41, Chris Wilson wrote:
> 
> [snip]
> 
> >+static int igt_sync(void *arg)
> >+{
> >+	const struct {
> >+		const char *name;
> >+		u32 seqno;
> >+		bool expected;
> >+		bool set;
> >+	} pass[] = {
> >+		{ "unset", 0, false, false },
> >+		{ "new", 0, false, true },
> >+		{ "0a", 0, true, true },
> >+		{ "1a", 1, false, true },
> >+		{ "1b", 1, true, true },
> >+		{ "0b", 0, true, false },
> >+		{ "2a", 2, false, true },
> >+		{ "4", 4, false, true },
> >+		{ "INT_MAX", INT_MAX, false, true },
> >+		{ "INT_MAX-1", INT_MAX-1, true, false },
> >+		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
> >+		{ "INT_MAX", INT_MAX, true, false },
> >+		{ "UINT_MAX", UINT_MAX, false, true },
> >+		{ "wrap", 0, false, true },
> >+		{ "unwrap", UINT_MAX, true, false },
> >+		{},
> >+	}, *p;
> >+	struct intel_timeline *tl;
> >+	int order, offset;
> >+	int ret;
> >+
> >+	tl = mock_timeline(0);
> >+	if (!tl)
> >+		return -ENOMEM;
> >+
> >+	for (p = pass; p->name; p++) {
> >+		for (order = 1; order < 64; order++) {
> >+			for (offset = -1; offset <= (order > 1); offset++) {
> >+				u64 ctx = BIT_ULL(order) + offset;
> >+
> >+				if (__intel_timeline_sync_is_later
> >+				    (tl, ctx, p->seqno) != p->expected) {
> >+					pr_err("1: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> >+					       p->name, ctx, p->seqno, yesno(p->expected));
> >+					ret = -EINVAL;
> >+					goto out;
> >+				}
> >+
> >+				if (p->set) {
> >+					ret = __intel_timeline_sync_set(tl, ctx, p->seqno);
> >+					if (ret)
> >+						goto out;
> >+				}
> >+			}
> >+		}
> >+	}
> 
> I think verification that the tree height matches the expectation,
> and also total number of nodes, is required.

That sounds a good excuse to start selftests/i915_syncmap

My primary goal here is to exercise the simpler intel_timeline_sync
interface, i.e. this portion should be agnostic to the implementation.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 13/27] drm/i915/execlists: Pack the count into the low bits of the port.request
  2017-04-27 14:37     ` Chris Wilson
@ 2017-04-28 12:02       ` Tvrtko Ursulin
  2017-04-28 12:21         ` Chris Wilson
  0 siblings, 1 reply; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-28 12:02 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala


On 27/04/2017 15:37, Chris Wilson wrote:
> On Thu, Apr 20, 2017 at 03:58:19PM +0100, Tvrtko Ursulin wrote:
>>> static void record_context(struct drm_i915_error_context *e,
>>> diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
>>> index 1642fff9cf13..370373c97b81 100644
>>> --- a/drivers/gpu/drm/i915/i915_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/i915_guc_submission.c
>>> @@ -658,7 +658,7 @@ static void nested_enable_signaling(struct drm_i915_gem_request *rq)
>>> static bool i915_guc_dequeue(struct intel_engine_cs *engine)
>>> {
>>> 	struct execlist_port *port = engine->execlist_port;
>>> -	struct drm_i915_gem_request *last = port[0].request;
>>> +	struct drm_i915_gem_request *last = port[0].request_count;
>>
>> It's confusing that in this new scheme sometimes we have direct
>> access to the request and sometimes we have to go through the
>> port_request macro.
>>
>> So maybe we should always use the port_request macro. Hm, could we
>> invent a new type to help enforce that? Like:
>>
>> struct drm_i915_gem_port_request_slot {
>> 	struct drm_i915_gem_request *req_count;
>> };
>>
>> And then execlist port would contain these and helpers would need to
>> be functions?
>>
>> I've also noticed some GVT/GuC patches which sounded like they are
>> adding the same single submission constraints so maybe now is the
>> time to unify the dequeue? (Haven't looked at those patches deeper
>> than the subject line so might be wrong.)
>>
>> Not sure 100% of all the above, would need to sketch it. What are
>> your thoughts?
>
> I forsee a use for the count in guc as well, so conversion is ok with
> me.

Conversion to a wrapper structure as I proposed or keeping it as you 
have it?

>>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
>>> index d25b88467e5e..39b733e5cfd3 100644
>>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
>>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
>>> @@ -377,8 +377,12 @@ struct intel_engine_cs {
>>> 	/* Execlists */
>>> 	struct tasklet_struct irq_tasklet;
>>> 	struct execlist_port {
>>> -		struct drm_i915_gem_request *request;
>>> -		unsigned int count;
>>> +		struct drm_i915_gem_request *request_count;
>>
>> Would req(uest)_slot maybe be better?
>
> It's definitely a count (of how many times this request has been
> submitted), and I like long verbose names when I don't want them to be
> used directly. So expect guc to be tidied.

It is a pointer and a count. My point was that request_count sounds too 
much like a count of how many times has something been done or happened 
to the request.

Request_slot was my attempt to make it obvious in the name itself there 
is more to it. And the wrapper struct was another step further, plus the 
idea was it would make sure you always need to access this field via the 
accessor. Since I think going sometimes directly and sometimes via 
wrapper is too fragile.

Anyway, my big issue is I am not sure if we are in agreement or not. Do 
you agree going with the wrapper structure makes sense or not?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 16/27] drm/i915: Reinstate reservation_object zapping for batch_pool objects
  2017-04-19  9:41 ` [PATCH 16/27] drm/i915: Reinstate reservation_object zapping for batch_pool objects Chris Wilson
@ 2017-04-28 12:20   ` Tvrtko Ursulin
  0 siblings, 0 replies; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-04-28 12:20 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Matthew Auld, Mika Kuoppala


On 19/04/2017 10:41, Chris Wilson wrote:
> I removed the zapping of the reservation_object->fence array of shared
> fences prematurely. We don't yet have the code to zap that array when
> retiring the object, and so currently it remains possible to continually
> grow the shared array trapping requests when reusing the batch_pool
> object across many timelines.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_batch_pool.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_batch_pool.c b/drivers/gpu/drm/i915/i915_gem_batch_pool.c
> index 41aa598c4f3b..414e46e2f072 100644
> --- a/drivers/gpu/drm/i915/i915_gem_batch_pool.c
> +++ b/drivers/gpu/drm/i915/i915_gem_batch_pool.c
> @@ -114,12 +114,26 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool,
>  	list_for_each_entry(obj, list, batch_pool_link) {
>  		/* The batches are strictly LRU ordered */
>  		if (i915_gem_object_is_active(obj)) {
> -			if (!reservation_object_test_signaled_rcu(obj->resv,
> -								  true))
> +			struct reservation_object *resv = obj->resv;
> +
> +			if (!reservation_object_test_signaled_rcu(resv, true))
>  				break;
>
>  			i915_gem_retire_requests(pool->engine->i915);
>  			GEM_BUG_ON(i915_gem_object_is_active(obj));
> +
> +			/* The object is now idle, clear the array of shared
> +			 * fences before we add a new request. Although, we
> +			 * remain on the same engine, we may be on a different
> +			 * timeline and so may continually grow the array,
> +			 * trapping a reference to all the old fences, rather
> +			 * than replace the existing fence.
> +			 */
> +			if (rcu_access_pointer(resv->fence)) {
> +				reservation_object_lock(resv, NULL);
> +				reservation_object_add_excl_fence(resv, NULL);
> +				reservation_object_unlock(resv);
> +			}
>  		}
>
>  		GEM_BUG_ON(!reservation_object_test_signaled_rcu(obj->resv,
>

Not too familiar with the reservation object stuff but having read some 
kerneldoc it looks correct to me.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 13/27] drm/i915/execlists: Pack the count into the low bits of the port.request
  2017-04-28 12:02       ` Tvrtko Ursulin
@ 2017-04-28 12:21         ` Chris Wilson
  0 siblings, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-04-28 12:21 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, Mika Kuoppala

On Fri, Apr 28, 2017 at 01:02:25PM +0100, Tvrtko Ursulin wrote:
> 
> On 27/04/2017 15:37, Chris Wilson wrote:
> >On Thu, Apr 20, 2017 at 03:58:19PM +0100, Tvrtko Ursulin wrote:
> >>>static void record_context(struct drm_i915_error_context *e,
> >>>diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
> >>>index 1642fff9cf13..370373c97b81 100644
> >>>--- a/drivers/gpu/drm/i915/i915_guc_submission.c
> >>>+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
> >>>@@ -658,7 +658,7 @@ static void nested_enable_signaling(struct drm_i915_gem_request *rq)
> >>>static bool i915_guc_dequeue(struct intel_engine_cs *engine)
> >>>{
> >>>	struct execlist_port *port = engine->execlist_port;
> >>>-	struct drm_i915_gem_request *last = port[0].request;
> >>>+	struct drm_i915_gem_request *last = port[0].request_count;
> >>
> >>It's confusing that in this new scheme sometimes we have direct
> >>access to the request and sometimes we have to go through the
> >>port_request macro.
> >>
> >>So maybe we should always use the port_request macro. Hm, could we
> >>invent a new type to help enforce that? Like:
> >>
> >>struct drm_i915_gem_port_request_slot {
> >>	struct drm_i915_gem_request *req_count;
> >>};
> >>
> >>And then execlist port would contain these and helpers would need to
> >>be functions?
> >>
> >>I've also noticed some GVT/GuC patches which sounded like they are
> >>adding the same single submission constraints so maybe now is the
> >>time to unify the dequeue? (Haven't looked at those patches deeper
> >>than the subject line so might be wrong.)
> >>
> >>Not sure 100% of all the above, would need to sketch it. What are
> >>your thoughts?
> >
> >I forsee a use for the count in guc as well, so conversion is ok with
> >me.
> 
> Conversion to a wrapper structure as I proposed or keeping it as you
> have it?
> 
> >>>diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> >>>index d25b88467e5e..39b733e5cfd3 100644
> >>>--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> >>>+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> >>>@@ -377,8 +377,12 @@ struct intel_engine_cs {
> >>>	/* Execlists */
> >>>	struct tasklet_struct irq_tasklet;
> >>>	struct execlist_port {
> >>>-		struct drm_i915_gem_request *request;
> >>>-		unsigned int count;
> >>>+		struct drm_i915_gem_request *request_count;
> >>
> >>Would req(uest)_slot maybe be better?
> >
> >It's definitely a count (of how many times this request has been
> >submitted), and I like long verbose names when I don't want them to be
> >used directly. So expect guc to be tidied.
> 
> It is a pointer and a count. My point was that request_count sounds
> too much like a count of how many times has something been done or
> happened to the request.
> 
> Request_slot was my attempt to make it obvious in the name itself
> there is more to it. And the wrapper struct was another step
> further, plus the idea was it would make sure you always need to
> access this field via the accessor. Since I think going sometimes
> directly and sometimes via wrapper is too fragile.

I read slot as port[slot], whereas I am using as a count of how many
times I have done something with this request/context.

> Anyway, my big issue is I am not sure if we are in agreement or not.
> Do you agree going with the wrapper structure makes sense or not?

I'm using port_request() in guc, see the version in #prescheduler.

What I haven't come up with is a good plan for assignment, which
is still using port->request_count = port_pack() but now that is limited
to just the port_assign() functions.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v13] drm/i915: Squash repeated awaits on the same fence
  2017-04-28  7:41       ` [PATCH v10] " Chris Wilson
                           ` (2 preceding siblings ...)
  2017-04-28  9:55         ` Tvrtko Ursulin
@ 2017-04-28 14:12         ` Chris Wilson
  2017-04-28 19:02           ` [PATCH v14] " Chris Wilson
  3 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-28 14:12 UTC (permalink / raw)
  To: intel-gfx

Track the latest fence waited upon on each context, and only add a new
asynchronous wait if the new fence is more recent than the recorded
fence for that context. This requires us to filter out unordered
timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
absence of a universal identifier, we have to use our own
i915->mm.unordered_timeline token.

v2: Throw around the debug crutches
v3: Inline the likely case of the pre-allocation cache being full.
v4: Drop the pre-allocation support, we can lose the most recent fence
in case of allocation failure -- it just means we may emit more awaits
than strictly necessary but will not break.
v5: Trim allocation size for leaf nodes, they only need an array of u32
not pointers.
v6: Create mock_timeline to tidy selftest writing
v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko)
v8: Prune the stale sync points when we idle.
v9: Include a small benchmark in the kselftests
v10: Separate the idr implementation into its own compartment. (Tvrkto)
v11: Refactor igt_sync kselftests to avoid deep nesting (Tvrkto)
v12: __sync_leaf_idx() to assert that p->height is 0 when checking leaves
v13: kselftests to investigate struct i915_syncmap itself (Tvrtko)

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
Now just needs some ascii art!
---
 drivers/gpu/drm/i915/Makefile                      |   1 +
 drivers/gpu/drm/i915/i915_gem.c                    |   1 +
 drivers/gpu/drm/i915/i915_gem.h                    |   2 +
 drivers/gpu/drm/i915/i915_gem_request.c            |   9 +
 drivers/gpu/drm/i915/i915_gem_timeline.c           |  93 +++-
 drivers/gpu/drm/i915/i915_gem_timeline.h           |  38 ++
 drivers/gpu/drm/i915/i915_syncmap.c                | 381 +++++++++++++++
 drivers/gpu/drm/i915/i915_syncmap.h                |  39 ++
 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 283 +++++++++++
 .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   2 +
 drivers/gpu/drm/i915/selftests/i915_syncmap.c      | 539 +++++++++++++++++++++
 drivers/gpu/drm/i915/selftests/mock_timeline.c     |  45 ++
 drivers/gpu/drm/i915/selftests/mock_timeline.h     |  33 ++
 13 files changed, 1448 insertions(+), 18 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_syncmap.c
 create mode 100644 drivers/gpu/drm/i915/i915_syncmap.h
 create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
 create mode 100644 drivers/gpu/drm/i915/selftests/i915_syncmap.c
 create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.c
 create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 2cf04504e494..7b05fb802f4c 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -16,6 +16,7 @@ i915-y := i915_drv.o \
 	  i915_params.o \
 	  i915_pci.o \
           i915_suspend.o \
+	  i915_syncmap.o \
 	  i915_sw_fence.o \
 	  i915_sysfs.o \
 	  intel_csr.o \
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 327645ae7d96..edd4baae892a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3214,6 +3214,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
 		intel_engine_disarm_breadcrumbs(engine);
 		i915_gem_batch_pool_fini(&engine->batch_pool);
 	}
+	i915_gem_timelines_mark_idle(dev_priv);
 
 	GEM_BUG_ON(!dev_priv->gt.awake);
 	dev_priv->gt.awake = false;
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 5a49487368ca..ee54597465b6 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -25,6 +25,8 @@
 #ifndef __I915_GEM_H__
 #define __I915_GEM_H__
 
+#include <linux/bug.h>
+
 #ifdef CONFIG_DRM_I915_DEBUG_GEM
 #define GEM_BUG_ON(expr) BUG_ON(expr)
 #define GEM_WARN_ON(expr) WARN_ON(expr)
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 5fa4e52ded06..807fc1b65dd1 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -772,6 +772,11 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 		if (fence->context == req->fence.context)
 			continue;
 
+		/* Squash repeated waits to the same timelines */
+		if (fence->context != req->i915->mm.unordered_timeline &&
+		    intel_timeline_sync_is_later(req->timeline, fence))
+			continue;
+
 		if (dma_fence_is_i915(fence))
 			ret = i915_gem_request_await_request(req,
 							     to_request(fence));
@@ -781,6 +786,10 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 							    GFP_KERNEL);
 		if (ret < 0)
 			return ret;
+
+		/* Record the latest fence used against each timeline */
+		if (fence->context != req->i915->mm.unordered_timeline)
+			intel_timeline_sync_set(req->timeline, fence);
 	} while (--nchild);
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
index b596ca7ee058..f271e93310fb 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.c
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
@@ -23,6 +23,32 @@
  */
 
 #include "i915_drv.h"
+#include "i915_syncmap.h"
+
+static void __intel_timeline_init(struct intel_timeline *tl,
+				  struct i915_gem_timeline *parent,
+				  u64 context,
+				  struct lock_class_key *lockclass,
+				  const char *lockname)
+{
+	tl->fence_context = context;
+	tl->common = parent;
+#ifdef CONFIG_DEBUG_SPINLOCK
+	__raw_spin_lock_init(&tl->lock.rlock, lockname, lockclass);
+#else
+	spin_lock_init(&tl->lock);
+#endif
+	init_request_active(&tl->last_request, NULL);
+	INIT_LIST_HEAD(&tl->requests);
+	i915_syncmap_init(&tl->sync);
+}
+
+static void __intel_timeline_fini(struct intel_timeline *tl)
+{
+	GEM_BUG_ON(!list_empty(&tl->requests));
+
+	i915_syncmap_free(&tl->sync);
+}
 
 static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 				    struct i915_gem_timeline *timeline,
@@ -35,6 +61,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 
 	lockdep_assert_held(&i915->drm.struct_mutex);
 
+	/*
+	 * Ideally we want a set of engines on a single leaf as we expect
+	 * to mostly be tracking synchronisation between engines.
+	 */
+	BUILD_BUG_ON(KSYNCMAP < I915_NUM_ENGINES);
+
 	timeline->i915 = i915;
 	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
 	if (!timeline->name)
@@ -44,19 +76,10 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 
 	/* Called during early_init before we know how many engines there are */
 	fences = dma_fence_context_alloc(ARRAY_SIZE(timeline->engine));
-	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
-		struct intel_timeline *tl = &timeline->engine[i];
-
-		tl->fence_context = fences++;
-		tl->common = timeline;
-#ifdef CONFIG_DEBUG_SPINLOCK
-		__raw_spin_lock_init(&tl->lock.rlock, lockname, lockclass);
-#else
-		spin_lock_init(&tl->lock);
-#endif
-		init_request_active(&tl->last_request, NULL);
-		INIT_LIST_HEAD(&tl->requests);
-	}
+	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++)
+		__intel_timeline_init(&timeline->engine[i],
+				      timeline, fences++,
+				      lockclass, lockname);
 
 	return 0;
 }
@@ -81,18 +104,52 @@ int i915_gem_timeline_init__global(struct drm_i915_private *i915)
 					&class, "&global_timeline->lock");
 }
 
+/**
+ * i915_gem_timelines_mark_idle -- called when the driver idles
+ * @i915 - the drm_i915_private device
+ *
+ * When the driver is completely idle, we know that all of our sync points
+ * have been signaled and our tracking is then entirely redundant. Any request
+ * to wait upon an older sync point will be completed instantly as we know
+ * the fence is signaled and therefore we will not even look them up in the
+ * sync point map.
+ */
+void i915_gem_timelines_mark_idle(struct drm_i915_private *i915)
+{
+	struct i915_gem_timeline *timeline;
+	int i;
+
+	lockdep_assert_held(&i915->drm.struct_mutex);
+
+	list_for_each_entry(timeline, &i915->gt.timelines, link) {
+		for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
+			struct intel_timeline *tl = &timeline->engine[i];
+
+			/*
+			 * All known fences are completed so we can scrap
+			 * the current sync point tracking and start afresh,
+			 * any attempt to wait upon a previous sync point
+			 * will be skipped as the fence was signaled.
+			 */
+			i915_syncmap_free(&tl->sync);
+		}
+	}
+}
+
 void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
 {
 	int i;
 
 	lockdep_assert_held(&timeline->i915->drm.struct_mutex);
 
-	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
-		struct intel_timeline *tl = &timeline->engine[i];
-
-		GEM_BUG_ON(!list_empty(&tl->requests));
-	}
+	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++)
+		__intel_timeline_fini(&timeline->engine[i]);
 
 	list_del(&timeline->link);
 	kfree(timeline->name);
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftests/mock_timeline.c"
+#include "selftests/i915_gem_timeline.c"
+#endif
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
index 6c53e14cab2a..82d59126eb60 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.h
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
@@ -27,7 +27,9 @@
 
 #include <linux/list.h>
 
+#include "i915_utils.h"
 #include "i915_gem_request.h"
+#include "i915_syncmap.h"
 
 struct i915_gem_timeline;
 
@@ -55,6 +57,17 @@ struct intel_timeline {
 	 * struct_mutex.
 	 */
 	struct i915_gem_active last_request;
+
+	/**
+	 * We track the most recent seqno that we wait on in every context so
+	 * that we only have to emit a new await and dependency on a more
+	 * recent sync point. As the contexts may executed out-of-order, we
+	 * have to track each individually and cannot not rely on an absolute
+	 * global_seqno. When we know that all tracked fences are completed
+	 * (i.e. when the driver is idle), we know that the syncmap is
+	 * redundant and we can discard it without loss of generality.
+	 */
+	struct i915_syncmap *sync;
 	u32 sync_seqno[I915_NUM_ENGINES];
 
 	struct i915_gem_timeline *common;
@@ -73,6 +86,31 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
 			   struct i915_gem_timeline *tl,
 			   const char *name);
 int i915_gem_timeline_init__global(struct drm_i915_private *i915);
+void i915_gem_timelines_mark_idle(struct drm_i915_private *i915);
 void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
 
+static inline int __intel_timeline_sync_set(struct intel_timeline *tl,
+					    u64 context, u32 seqno)
+{
+	return i915_syncmap_set(&tl->sync, context, seqno);
+}
+
+static inline int intel_timeline_sync_set(struct intel_timeline *tl,
+					  const struct dma_fence *fence)
+{
+	return __intel_timeline_sync_set(tl, fence->context, fence->seqno);
+}
+
+static inline bool __intel_timeline_sync_is_later(struct intel_timeline *tl,
+						  u64 context, u32 seqno)
+{
+	return i915_syncmap_is_later(&tl->sync, context, seqno);
+}
+
+static inline bool intel_timeline_sync_is_later(struct intel_timeline *tl,
+						const struct dma_fence *fence)
+{
+	return __intel_timeline_sync_is_later(tl, fence->context, fence->seqno);
+}
+
 #endif
diff --git a/drivers/gpu/drm/i915/i915_syncmap.c b/drivers/gpu/drm/i915/i915_syncmap.c
new file mode 100644
index 000000000000..76e319d97645
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_syncmap.c
@@ -0,0 +1,381 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <linux/slab.h>
+
+#include "i915_syncmap.h"
+
+#include "i915_gem.h" /* GEM_BUG_ON() */
+#include "i915_selftest.h"
+
+#define SHIFT ilog2(KSYNCMAP)
+#define MASK (KSYNCMAP - 1)
+
+/*
+ * struct i915_syncmap is a layer of a radixtree that maps a u64 fence
+ * context id to the last u32 fence seqno waited upon from that context.
+ * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
+ * the root. This allows us to access the whole tree via a single pointer
+ * to the most recently used layer. We expect fence contexts to be dense
+ * and most reuse to be on the same i915_gem_context but on neighbouring
+ * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
+ * effective lookup cache. If the new lookup is not on the same leaf, we
+ * expect it to be on the neighbouring branch.
+ *
+ * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
+ * allows us to store whether a particular seqno is valid (i.e. allows us
+ * to distinguish unset from 0).
+ *
+ * A branch holds an array of layer pointers, and has height > 0, and always
+ * has at least 2 layers (either branches or leaves) below it.
+ */
+
+struct i915_syncmap {
+	u64 prefix;
+	unsigned int height;
+	unsigned int bitmap;
+	struct i915_syncmap *parent;
+	/*
+	 * Following this header is an array of either seqno or child pointers:
+	 * union {
+	 *	u32 seqno[KSYNCMAP];
+	 *	struct i915_syncmap *child[KSYNCMAP];
+	 * };
+	 */
+};
+
+/**
+ * i915_syncmap_init -- initialise the #i915_syncmap
+ * @root - pointer to the #i915_syncmap
+ */
+void i915_syncmap_init(struct i915_syncmap **root)
+{
+	BUILD_BUG_ON_NOT_POWER_OF_2(KSYNCMAP);
+	BUILD_BUG_ON_NOT_POWER_OF_2(SHIFT);
+	BUILD_BUG_ON(KSYNCMAP > BITS_PER_BYTE * sizeof((*root)->bitmap));
+	*root = NULL;
+}
+
+static inline u32 *__sync_seqno(struct i915_syncmap *p)
+{
+	GEM_BUG_ON(p->height);
+	return (u32 *)(p + 1);
+}
+
+static inline struct i915_syncmap **__sync_child(struct i915_syncmap *p)
+{
+	GEM_BUG_ON(!p->height);
+	return (struct i915_syncmap **)(p + 1);
+}
+
+static inline unsigned int
+__sync_branch_idx(const struct i915_syncmap *p, u64 id)
+{
+	return (id >> p->height) & MASK;
+}
+
+static inline unsigned int
+__sync_leaf_idx(const struct i915_syncmap *p, u64 id)
+{
+	GEM_BUG_ON(p->height);
+	return id & MASK;
+}
+
+static inline u64 __sync_branch_prefix(const struct i915_syncmap *p, u64 id)
+{
+	return id >> p->height >> SHIFT;
+}
+
+static inline u64 __sync_leaf_prefix(const struct i915_syncmap *p, u64 id)
+{
+	GEM_BUG_ON(p->height);
+	return id >> SHIFT;
+}
+
+static inline bool seqno_later(u32 a, u32 b)
+{
+	return (s32)(a - b) >= 0;
+}
+
+/**
+ * i915_syncmap_is_later -- compare against the last know sync point
+ * @root - pointer to the #i915_syncmap
+ * @id - the context id (other timeline) we are synchronising to
+ * @seqno - the sequence number along the other timeline
+ *
+ * If we have already synchronised this @root with another (@id) then we can
+ * omit any repeated or earlier synchronisation requests. If the two timelines
+ * are already coupled, we can also omit the dependency between the two as that
+ * is already known via the timeline.
+ *
+ * Returns true if the two timelines are already synchronised wrt to @seqno,
+ * false if not and the synchronisation must be emitted.
+ */
+bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno)
+{
+	struct i915_syncmap *p;
+	unsigned int idx;
+
+	p = *root;
+	if (!p)
+		return false;
+
+	if (likely(__sync_leaf_prefix(p, id) == p->prefix))
+		goto found;
+
+	/* First climb the tree back to a parent branch */
+	do {
+		p = p->parent;
+		if (!p)
+			return false;
+
+		if (__sync_branch_prefix(p, id) == p->prefix)
+			break;
+	} while (1);
+
+	/* And then descend again until we find our leaf */
+	do {
+		if (!p->height)
+			break;
+
+		p = __sync_child(p)[__sync_branch_idx(p, id)];
+		if (!p)
+			return false;
+
+		if (__sync_branch_prefix(p, id) != p->prefix)
+			return false;
+	} while (1);
+
+	*root = p;
+found:
+	idx = __sync_leaf_idx(p, id);
+	if (!(p->bitmap & BIT(idx)))
+		return false;
+
+	return seqno_later(__sync_seqno(p)[idx], seqno);
+}
+
+static struct i915_syncmap *
+__sync_alloc_leaf(struct i915_syncmap *parent, u64 id)
+{
+	struct i915_syncmap *p;
+
+	p = kmalloc(sizeof(*p) + KSYNCMAP * sizeof(u32), GFP_KERNEL);
+	if (unlikely(!p))
+		return NULL;
+
+	p->parent = parent;
+	p->height = 0;
+	p->bitmap = 0;
+	p->prefix = __sync_leaf_prefix(p, id);
+	return p;
+}
+
+static noinline int __sync_set(struct i915_syncmap **root, u64 id, u32 seqno)
+{
+	struct i915_syncmap *p = *root;
+	unsigned int idx;
+
+	if (!p) {
+		p = __sync_alloc_leaf(NULL, id);
+		if (unlikely(!p))
+			return -ENOMEM;
+
+		goto found;
+	}
+
+	/* Caller handled the likely cached case */
+	GEM_BUG_ON(__sync_leaf_prefix(p, id) == p->prefix);
+
+	/* Climb back up the tree until we find a common prefix */
+	do {
+		if (!p->parent)
+			break;
+
+		p = p->parent;
+
+		if (__sync_branch_prefix(p, id) == p->prefix)
+			break;
+	} while (1);
+
+	/*
+	 * No shortcut, we have to descend the tree to find the right layer
+	 * containing this fence.
+	 *
+	 * Each layer in the tree holds 16 (KSYNCMAP) pointers, either fences
+	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
+	 * other nodes (height > 0) are internal layers that point to a lower
+	 * node. Each internal layer has at least 2 descendents.
+	 *
+	 * Starting at the top, we check whether the current prefix matches. If
+	 * it doesn't, we have gone passed our layer and need to insert a join
+	 * into the tree, and a new leaf node as a descendent as well as the
+	 * original layer.
+	 *
+	 * The matching prefix means we are still following the right branch
+	 * of the tree. If it has height 0, we have found our leaf and just
+	 * need to replace the fence slot with ourselves. If the height is
+	 * not zero, our slot contains the next layer in the tree (unless
+	 * it is empty, in which case we can add ourselves as a new leaf).
+	 * As descend the tree the prefix grows (and height decreases).
+	 */
+	do {
+		struct i915_syncmap *next;
+
+		if (__sync_branch_prefix(p, id) != p->prefix) {
+			unsigned int above;
+
+			/* insert a join above the current layer */
+			next = kzalloc(sizeof(*next) + KSYNCMAP * sizeof(next),
+				       GFP_KERNEL);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			above = fls64(__sync_branch_prefix(p, id) ^ p->prefix);
+			above = round_up(above, SHIFT);
+			next->height = above + p->height;
+			next->prefix = __sync_branch_prefix(next, id);
+
+			if (p->parent) {
+				idx = __sync_branch_idx(p->parent, id);
+				__sync_child(p->parent)[idx] = next;
+			}
+			next->parent = p->parent;
+
+			idx = p->prefix >> (above - SHIFT) & MASK;
+			__sync_child(next)[idx] = p;
+			next->bitmap |= BIT(idx);
+			p->parent = next;
+
+			/* ascend to the join */
+			p = next;
+		} else {
+			if (!p->height)
+				break;
+		}
+
+		/* descend into the next layer */
+		GEM_BUG_ON(!p->height);
+		idx = __sync_branch_idx(p, id);
+		next = __sync_child(p)[idx];
+		if (!next) {
+			next = __sync_alloc_leaf(p, id);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			__sync_child(p)[idx] = next;
+			p->bitmap |= BIT(idx);
+
+			p = next;
+			break;
+		}
+
+		p = next;
+	} while (1);
+
+found:
+	GEM_BUG_ON(p->prefix != __sync_leaf_prefix(p, id));
+	idx = __sync_leaf_idx(p, id);
+	__sync_seqno(p)[idx] = seqno;
+	p->bitmap |= BIT(idx);
+	*root = p;
+	return 0;
+}
+
+/**
+ * i915_syncmap_set -- mark the most recent syncpoint between contexts
+ * @root - pointer to the #i915_syncmap
+ * @id - the context id (other timeline) we have synchronised to
+ * @seqno - the sequence number along the other timeline
+ *
+ * When we synchronise this @root with another (@id), we also know that we have
+ * synchronized with all previous seqno along that timeline. If we then have
+ * a request to synchronise with the same seqno or older, we can omit it,
+ * see i915_syncmap_is_later()
+ *
+ * Returns 0 on success, or a negative error code.
+ */
+int i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno)
+{
+	struct i915_syncmap *p = *root;
+
+	/*
+	 * We expect to be called in sequence following a is_later(id), which
+	 * should have preloaded the root for us.
+	 */
+	if (likely(p && __sync_leaf_prefix(p, id) == p->prefix)) {
+		unsigned int idx = __sync_leaf_idx(p, id);
+
+		__sync_seqno(p)[idx] = seqno;
+		p->bitmap |= BIT(idx);
+		return 0;
+	}
+
+	return __sync_set(root, id, seqno);
+}
+
+static void __sync_free(struct i915_syncmap *p)
+{
+	if (p->height) {
+		unsigned int i;
+
+		while ((i = ffs(p->bitmap))) {
+			p->bitmap &= ~0u << i;
+			__sync_free(__sync_child(p)[i - 1]);
+		}
+	}
+
+	kfree(p);
+}
+
+/**
+ * i915_syncmap_free -- free all memory associated with the syncmap
+ * @root - pointer to the #i915_syncmap
+ *
+ * Either when the timeline is to be freed and we no longer need the sync
+ * point tracking, or when the fences are all known to be signaled and the
+ * sync point tracking is redundant, we can free the #i915_syncmap to recover
+ * its allocations.
+ *
+ * Will reinitialise the @root pointer so that the #i915_syncmap is ready for
+ * reuse.
+ */
+void i915_syncmap_free(struct i915_syncmap **root)
+{
+	struct i915_syncmap *p;
+
+	p = *root;
+	if (!p)
+		return;
+
+	while (p->parent)
+		p = p->parent;
+
+	__sync_free(p);
+	*root = NULL;
+}
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftests/i915_syncmap.c"
+#endif
diff --git a/drivers/gpu/drm/i915/i915_syncmap.h b/drivers/gpu/drm/i915/i915_syncmap.h
new file mode 100644
index 000000000000..7ca827d812ae
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_syncmap.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __I915_SYNCMAP_H__
+#define __I915_SYNCMAP_H__
+
+#include <linux/types.h>
+
+struct i915_syncmap;
+
+void i915_syncmap_init(struct i915_syncmap **root);
+bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno);
+int i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno);
+void i915_syncmap_free(struct i915_syncmap **root);
+
+#define KSYNCMAP 16
+
+#endif /* __I915_SYNCMAP_H__ */
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
new file mode 100644
index 000000000000..3b6725097eb0
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
@@ -0,0 +1,283 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <linux/random.h>
+
+#include "../i915_selftest.h"
+#include "mock_gem_device.h"
+#include "mock_timeline.h"
+
+struct __igt_sync {
+	const char *name;
+	u32 seqno;
+	bool expected;
+	bool set;
+};
+
+static int __igt_sync(struct intel_timeline *tl,
+		      u64 ctx,
+		      const struct __igt_sync *p,
+		      const char *name)
+{
+	int ret;
+
+	if (__intel_timeline_sync_is_later(tl, ctx, p->seqno) != p->expected) {
+		pr_err("%s: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
+		       name, p->name, ctx, p->seqno, yesno(p->expected));
+		return -EINVAL;
+	}
+
+	if (p->set) {
+		ret = __intel_timeline_sync_set(tl, ctx, p->seqno);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int igt_sync(void *arg)
+{
+	const struct __igt_sync pass[] = {
+		{ "unset", 0, false, false },
+		{ "new", 0, false, true },
+		{ "0a", 0, true, true },
+		{ "1a", 1, false, true },
+		{ "1b", 1, true, true },
+		{ "0b", 0, true, false },
+		{ "2a", 2, false, true },
+		{ "4", 4, false, true },
+		{ "INT_MAX", INT_MAX, false, true },
+		{ "INT_MAX-1", INT_MAX-1, true, false },
+		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
+		{ "INT_MAX", INT_MAX, true, false },
+		{ "UINT_MAX", UINT_MAX, false, true },
+		{ "wrap", 0, false, true },
+		{ "unwrap", UINT_MAX, true, false },
+		{},
+	}, *p;
+	struct intel_timeline *tl;
+	int order, offset;
+	int ret;
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	for (p = pass; p->name; p++) {
+		for (order = 1; order < 64; order++) {
+			for (offset = -1; offset <= (order > 1); offset++) {
+				u64 ctx = BIT_ULL(order) + offset;
+
+				ret = __igt_sync(tl, ctx, p, "1");
+				if (ret)
+					goto out;
+			}
+		}
+	}
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	for (order = 1; order < 64; order++) {
+		for (offset = -1; offset <= (order > 1); offset++) {
+			u64 ctx = BIT_ULL(order) + offset;
+
+			for (p = pass; p->name; p++) {
+				ret = __igt_sync(tl, ctx, p, "2");
+				if (ret)
+					goto out;
+			}
+		}
+	}
+
+out:
+	mock_timeline_destroy(tl);
+	return ret;
+}
+
+static u64 prandom_u64_state(struct rnd_state *rnd)
+{
+	u64 x;
+
+	x = prandom_u32_state(rnd);
+	x <<= 32;
+	x |= prandom_u32_state(rnd);
+
+	return x;
+}
+
+static unsigned int random_engine(struct rnd_state *rnd)
+{
+	return ((u64)prandom_u32_state(rnd) * I915_NUM_ENGINES) >> 32;
+}
+
+static int bench_sync(void *arg)
+{
+#define M (1 << 20)
+	struct rnd_state prng;
+	struct intel_timeline *tl;
+	unsigned long end_time, count;
+	u64 prng32_1M;
+	ktime_t kt;
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		u32 x;
+
+		WRITE_ONCE(x, prandom_u32_state(&prng));
+
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	pr_debug("%s: %lu random evaluations, %lluns/prng\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+	prng32_1M = ktime_to_ns(kt) * M / count;
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		u64 id = prandom_u64_state(&prng);
+
+		__intel_timeline_sync_set(tl, id, 0);
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	kt = ktime_sub_ns(kt, count * prng32_1M * 2 / M);
+	pr_info("%s: %lu random insertions, %lluns/insert\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	end_time = count;
+	kt = ktime_get();
+	while (end_time--) {
+		u64 id = prandom_u64_state(&prng);
+
+		if (!__intel_timeline_sync_is_later(tl, id, 0)) {
+			mock_timeline_destroy(tl);
+			pr_err("Lookup of %llu failed\n", id);
+			return -EINVAL;
+		}
+	}
+	kt = ktime_sub(ktime_get(), kt);
+	kt = ktime_sub_ns(kt, count * prng32_1M * 2 / M);
+	pr_info("%s: %lu random lookups, %lluns/lookup\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		__intel_timeline_sync_set(tl, count++, 0);
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu in-order insertions, %lluns/insert\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	end_time = count;
+	kt = ktime_get();
+	while (end_time--) {
+		if (!__intel_timeline_sync_is_later(tl, end_time, 0)) {
+			pr_err("Lookup of %lu failed\n", end_time);
+			mock_timeline_destroy(tl);
+			return -EINVAL;
+		}
+	}
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu in-order lookups, %lluns/lookup\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		u32 id = random_engine(&prng);
+		u32 seqno = prandom_u32_state(&prng);
+
+		if (!__intel_timeline_sync_is_later(tl, id, seqno))
+			__intel_timeline_sync_set(tl, id, seqno);
+
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	kt = ktime_sub_ns(kt, count * prng32_1M / M);
+	pr_info("%s: %lu repeated insert/lookups, %lluns/op\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		if (!__intel_timeline_sync_is_later(tl, count & 7, count >> 4))
+			__intel_timeline_sync_set(tl, count & 7, count >> 4);
+
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu cyclic insert/lookups, %lluns/op\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+	mock_timeline_destroy(tl);
+
+	return 0;
+#undef M
+}
+
+int i915_gem_timeline_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(igt_sync),
+		SUBTEST(bench_sync),
+	};
+
+	return i915_subtests(tests, NULL);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
index be9a9ebf5692..76c1f149a0a0 100644
--- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
@@ -10,8 +10,10 @@
  */
 selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
 selftest(scatterlist, scatterlist_mock_selftests)
+selftest(syncmap, i915_syncmap_mock_selftests)
 selftest(uncore, intel_uncore_mock_selftests)
 selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
+selftest(timelines, i915_gem_timeline_mock_selftests)
 selftest(requests, i915_gem_request_mock_selftests)
 selftest(objects, i915_gem_object_mock_selftests)
 selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/i915_syncmap.c b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
new file mode 100644
index 000000000000..6237e3464391
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
@@ -0,0 +1,539 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "../i915_selftest.h"
+#include "i915_random.h"
+
+static int check_syncmap_free(struct i915_syncmap **sync)
+{
+	i915_syncmap_free(sync);
+	if (*sync) {
+		pr_err("sync not cleared after free\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int igt_syncmap_init(void *arg)
+{
+	struct i915_syncmap *sync = (void *)~0ul;
+
+	/*
+	 * Cursory check that we can initialise a random pointer and transform
+	 * it into the root pointer of a syncmap.
+	 */
+
+	i915_syncmap_init(&sync);
+	return check_syncmap_free(&sync);
+}
+
+static u64 prandom_u64_state(struct rnd_state *rnd)
+{
+	u64 x;
+
+	x = prandom_u32_state(rnd);
+	x <<= 32;
+	x |= prandom_u32_state(rnd);
+
+	return x;
+}
+
+static int check_seqno(struct i915_syncmap *leaf, unsigned int idx, u32 seqno)
+{
+	if (leaf->height) {
+		pr_err("%s: not a leaf, height is %d\n",
+		       __func__, leaf->height);
+		return -EINVAL;
+	}
+
+	if (__sync_seqno(leaf)[idx] != seqno) {
+		pr_err("%s: seqno[%d], found %x, expected %x\n",
+		       __func__, idx, __sync_seqno(leaf)[idx], seqno);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int check_one(struct i915_syncmap **sync, u64 context, u32 seqno)
+{
+	int err;
+
+	err = i915_syncmap_set(sync, context, seqno);
+	if (err)
+		return err;
+
+	if ((*sync)->height) {
+		pr_err("Inserting first context=%llx did not return leaf (height=%d, prefix=%llx\n",
+		       context, (*sync)->height, (*sync)->prefix);
+		return -EINVAL;
+	}
+
+	if ((*sync)->parent) {
+		pr_err("Inserting first context=%llx created branches!\n",
+		       context);
+		return -EINVAL;
+	}
+
+	if (hweight32((*sync)->bitmap) != 1) {
+		pr_err("First bitmap does not contain a single entry, found %x (count=%d)!\n",
+		       (*sync)->bitmap, hweight32((*sync)->bitmap));
+		return -EINVAL;
+	}
+
+	err = check_seqno((*sync), ilog2((*sync)->bitmap), seqno);
+	if (err)
+		return err;
+
+	if (!i915_syncmap_is_later(sync, context, seqno)) {
+		pr_err("Lookup of first context=%llx/seqno=%x failed!\n",
+		       context, seqno);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int igt_syncmap_one(void *arg)
+{
+	I915_RND_STATE(prng);
+	IGT_TIMEOUT(end_time);
+	struct i915_syncmap *sync;
+	unsigned long max = 1;
+	int err;
+
+	/*
+	 * Check that inserting a new id, creates a leaf and only that leaf.
+	 */
+
+	i915_syncmap_init(&sync);
+
+	do {
+		u64 context = prandom_u64_state(&prng);
+		unsigned long loop;
+
+		err = check_syncmap_free(&sync);
+		if (err)
+			goto out;
+
+		for (loop = 0; loop <= max; loop++) {
+			err = check_one(&sync, context,
+					prandom_u32_state(&prng));
+			if (err)
+				goto out;
+		}
+		max++;
+	} while (!__igt_timeout(end_time, NULL));
+	pr_debug("%s: Completed %lu single insertions\n",
+		__func__, max * (max - 1) / 2);
+out:
+	i915_syncmap_free(&sync);
+	return err;
+}
+
+static int check_leaf(struct i915_syncmap **sync, u64 context, u32 seqno)
+{
+	int err;
+
+	err = i915_syncmap_set(sync, context, seqno);
+	if (err)
+		return err;
+
+	if ((*sync)->height) {
+		pr_err("Inserting context=%llx did not return leaf (height=%d, prefix=%llx\n",
+		       context, (*sync)->height, (*sync)->prefix);
+		return -EINVAL;
+	}
+
+	if (hweight32((*sync)->bitmap) != 1) {
+		pr_err("First entry into leaf (context=%llx) does not contain a single entry, found %x (count=%d)!\n",
+		       context, (*sync)->bitmap, hweight32((*sync)->bitmap));
+		return -EINVAL;
+	}
+
+	err = check_seqno((*sync), ilog2((*sync)->bitmap), seqno);
+	if (err)
+		return err;
+
+	if (!i915_syncmap_is_later(sync, context, seqno)) {
+		pr_err("Lookup of first entry context=%llx/seqno=%x failed!\n",
+		       context, seqno);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int igt_syncmap_join_above(void *arg)
+{
+	struct i915_syncmap *sync;
+	unsigned int pass, order;
+	int err;
+
+	i915_syncmap_init(&sync);
+
+	/*
+	 * When we have a new id that doesn't fit inside the existing tree,
+	 * we need to add a new layer above.
+	 *
+	 * 1: 0x00000001
+	 * 2: 0x00000010
+	 * 3: 0x00000100
+	 * 4: 0x00001000
+	 * ...
+	 * Each pass the common prefix shrinks and we have to insert a join.
+	 * Each join will only contain two branches, the latest of which
+	 * is always a leaf.
+	 *
+	 * If we then reuse the same set of contexts, we expect to build an
+	 * identical tree.
+	 */
+	for (pass = 0; pass < 3; pass++) {
+		for (order = 0; order < 64; order += SHIFT) {
+			u64 context = BIT_ULL(order);
+			struct i915_syncmap *join;
+
+			err = check_leaf(&sync, context, 0);
+			if (err)
+				goto out;
+
+			join = sync->parent;
+			if (!join) /* very first insert will have no parents */
+				continue;
+
+			if (!join->height) {
+				pr_err("Parent with no height!\n");
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (hweight32(join->bitmap) != 2) {
+				pr_err("Join does not have 2 children: %x (%d)\n",
+				       join->bitmap, hweight32(join->bitmap));
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (__sync_child(join)[__sync_branch_idx(join, context)] != sync) {
+				pr_err("Leaf misplaced in parent!\n");
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	}
+out:
+	i915_syncmap_free(&sync);
+	return err;
+}
+
+static int igt_syncmap_join_below(void *arg)
+{
+	struct i915_syncmap *sync;
+	unsigned int step, order, idx;
+	int err;
+
+	i915_syncmap_init(&sync);
+
+	/*
+	 * Check that we can split a compacted branch by replacing it with
+	 * a join.
+	 */
+	for (step = 0; step < KSYNCMAP; step++) {
+		for (order = 64 - SHIFT; order > 0; order -= SHIFT) {
+			u64 context = step*BIT_ULL(order);
+
+			err = i915_syncmap_set(&sync, context, 0);
+			if (err)
+				goto out;
+
+			if (sync->height) {
+				pr_err("Inserting context=%llx (order=%d, step=%d) did not return leaf (height=%d, prefix=%llx\n",
+				       context, order, step, sync->height, sync->prefix);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	}
+
+	for (step = 0; step < KSYNCMAP; step++) {
+		for (order = SHIFT; order < 64; order += SHIFT) {
+			u64 context = step*BIT_ULL(order);
+
+			if (!i915_syncmap_is_later(&sync, context, 0)) {
+				pr_err("1: context %llx (order=%d, step=%d) not found\n",
+				       context, order, step);
+				err = -EINVAL;
+				goto out;
+			}
+
+			for (idx = 1; idx < KSYNCMAP; idx++) {
+				if (i915_syncmap_is_later(&sync, context + idx, 0)) {
+					pr_err("1: context %llx (order=%d, step=%d) should not exist\n",
+					       context + idx, order, step);
+					err = -EINVAL;
+					goto out;
+				}
+			}
+		}
+	}
+
+	for (order = SHIFT; order < 64; order += SHIFT) {
+		for (step = 0; step < KSYNCMAP; step++) {
+			u64 context = step*BIT_ULL(order);
+
+			if (!i915_syncmap_is_later(&sync, context, 0)) {
+				pr_err("2: context %llx (order=%d, step=%d) not found\n",
+				       context, order, step);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	}
+
+out:
+	i915_syncmap_free(&sync);
+	return err;
+}
+
+static int igt_syncmap_neighbours(void *arg)
+{
+	I915_RND_STATE(prng);
+	IGT_TIMEOUT(end_time);
+	struct i915_syncmap *sync;
+	int err;
+
+	/*
+	 * Each leaf holds KSYNCMAP seqno. Check that when we create KSYNCMAP
+	 * neighbouring ids, they all fit into the same leaf.
+	 */
+
+	i915_syncmap_init(&sync);
+	do {
+		u64 context = prandom_u64_state(&prng) & ~MASK;
+		unsigned int idx;
+
+		if (i915_syncmap_is_later(&sync, context, 0)) /* Skip repeats */
+			continue;
+
+		for (idx = 0; idx < KSYNCMAP; idx++) {
+			err = i915_syncmap_set(&sync, context + idx, 0);
+			if (err)
+				goto out;
+
+			if (sync->height) {
+				pr_err("Inserting context=%llx did not return leaf (height=%d, prefix=%llx\n",
+				       context, sync->height, sync->prefix);
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (sync->bitmap != BIT(idx + 1) - 1) {
+				pr_err("Inserting neighbouring context=0x%llx+%d, did not fit into the same leaf bitmap=%x (%d), expected %lx (%d)\n",
+				       context, idx,
+				       sync->bitmap, hweight32(sync->bitmap),
+				       BIT(idx + 1) - 1, idx + 1);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	} while (!__igt_timeout(end_time, NULL));
+out:
+	i915_syncmap_free(&sync);
+	return err;
+}
+
+static int igt_syncmap_compact(void *arg)
+{
+	struct i915_syncmap *sync;
+	unsigned int idx, order;
+	int err;
+
+	i915_syncmap_init(&sync);
+
+	/*
+	 * The syncmap are "space efficient" compressed radix trees - any
+	 * branch with only one child is skipped and replaced by the child.
+	 *
+	 * If we construct a tree with ids that are neighbouring at a non-zero
+	 * height, we form a join but each child of that join is directly a
+	 * leaf holding the single id.
+	 */
+	for (order = SHIFT; order < 64; order += SHIFT) {
+		err = check_syncmap_free(&sync);
+		if (err)
+			goto out;
+
+		/* Create neighbours in the parent */
+		for (idx = 0; idx < KSYNCMAP; idx++) {
+			u64 context = idx * BIT_ULL(order) + idx;
+
+			err = i915_syncmap_set(&sync, context, 0);
+			if (err)
+				goto out;
+
+			if (sync->height) {
+				pr_err("Inserting context=%llx (order=%d, idx=%d) did not return leaf (height=%d, prefix=%llx\n",
+				       context, order, idx,
+				       sync->height, sync->prefix);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+
+		sync = sync->parent;
+		if (sync->parent) {
+			pr_err("Parent (join) of last leaf was not the sync!\n");
+			err = -EINVAL;
+			goto out;
+		}
+
+		if (sync->height != order) {
+			pr_err("Join does not have the expected height, found %d, expected %d\n",
+			       sync->height, order);
+			err = -EINVAL;
+			goto out;
+		}
+
+		if (sync->bitmap != BIT(KSYNCMAP) - 1) {
+			pr_err("Join is not full!, found %x (%d) expected %lx (%d)\n",
+			       sync->bitmap, hweight32(sync->bitmap),
+			       BIT(KSYNCMAP) - 1, KSYNCMAP);
+			err = -EINVAL;
+			goto out;
+		}
+
+		/* Each of our children should be a leaf */
+		for (idx = 0; idx < KSYNCMAP; idx++) {
+			struct i915_syncmap *leaf = __sync_child(sync)[idx];
+
+			if (leaf->height) {
+				pr_err("Child %d is a not leaf!\n", idx);
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (leaf->parent != sync) {
+				pr_err("Child %d is not attached to us!\n",
+				       idx);
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (!is_power_of_2(leaf->bitmap)) {
+				pr_err("Child %d holds more than one id, found %x (%d)\n",
+				       idx, leaf->bitmap, hweight32(leaf->bitmap));
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (leaf->bitmap != BIT(idx)) {
+				pr_err("Child %d has wrong seqno idx, found %d, expected %d\n",
+				       idx, ilog2(leaf->bitmap), idx);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	}
+out:
+	i915_syncmap_free(&sync);
+	return err;
+}
+
+static int igt_syncmap_random(void *arg)
+{
+	I915_RND_STATE(prng);
+	IGT_TIMEOUT(end_time);
+	struct i915_syncmap *sync;
+	unsigned long count, phase, i;
+	u32 seqno;
+	int err;
+
+	i915_syncmap_init(&sync);
+
+	/*
+	 * Having tried to test the individual operations within i915_syncmap,
+	 * run a smoketest exploring the entire u64 space with random
+	 * insertions.
+	 */
+
+	count = 0;
+	phase = jiffies + HZ/100 + 1;
+	do {
+		u64 context = prandom_u64_state(&prng);
+
+		err = i915_syncmap_set(&sync, context, 0);
+		if (err)
+			goto out;
+
+		count++;
+	} while (!time_after(jiffies, phase));
+	seqno = 0;
+
+	phase = 0;
+	do {
+		I915_RND_STATE(ctx);
+		u32 last_seqno = seqno;
+		bool expect;
+
+		seqno = prandom_u32_state(&prng);
+		expect = seqno_later(last_seqno, seqno);
+
+		for (i = 0; i < count; i++) {
+			u64 context = prandom_u64_state(&ctx);
+
+			if (i915_syncmap_is_later(&sync, context, seqno) != expect) {
+				pr_err("context=%llu, last=%u this=%u did not match expectation (%d)\n",
+				       context, last_seqno, seqno, expect);
+				err = -EINVAL;
+				goto out;
+			}
+
+			err = i915_syncmap_set(&sync, context, seqno);
+			if (err)
+				goto out;
+		}
+
+		phase++;
+	} while (!__igt_timeout(end_time, NULL));
+	pr_debug("Completed %lu passes, each of %lu contexts\n", phase, count);
+out:
+	i915_syncmap_free(&sync);
+	return err;
+}
+
+int i915_syncmap_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(igt_syncmap_init),
+		SUBTEST(igt_syncmap_one),
+		SUBTEST(igt_syncmap_join_above),
+		SUBTEST(igt_syncmap_join_below),
+		SUBTEST(igt_syncmap_neighbours),
+		SUBTEST(igt_syncmap_compact),
+		SUBTEST(igt_syncmap_random),
+	};
+
+	return i915_subtests(tests, NULL);
+}
diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.c b/drivers/gpu/drm/i915/selftests/mock_timeline.c
new file mode 100644
index 000000000000..47b1f47c5812
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/mock_timeline.c
@@ -0,0 +1,45 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "mock_timeline.h"
+
+struct intel_timeline *mock_timeline(u64 context)
+{
+	static struct lock_class_key class;
+	struct intel_timeline *tl;
+
+	tl = kzalloc(sizeof(*tl), GFP_KERNEL);
+	if (!tl)
+		return NULL;
+
+	__intel_timeline_init(tl, NULL, context, &class, "mock");
+
+	return tl;
+}
+
+void mock_timeline_destroy(struct intel_timeline *tl)
+{
+	__intel_timeline_fini(tl);
+	kfree(tl);
+}
diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.h b/drivers/gpu/drm/i915/selftests/mock_timeline.h
new file mode 100644
index 000000000000..c27ff4639b8b
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/mock_timeline.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __MOCK_TIMELINE__
+#define __MOCK_TIMELINE__
+
+#include "../i915_gem_timeline.h"
+
+struct intel_timeline *mock_timeline(u64 context);
+void mock_timeline_destroy(struct intel_timeline *tl);
+
+#endif /* !__MOCK_TIMELINE__ */
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev5)
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (28 preceding siblings ...)
  2017-04-27  7:27 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev2) Patchwork
@ 2017-04-28 14:31 ` Patchwork
  2017-04-28 19:22 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev6) Patchwork
  30 siblings, 0 replies; 95+ messages in thread
From: Patchwork @ 2017-04-28 14:31 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev5)
URL   : https://patchwork.freedesktop.org/series/23227/
State : success

== Summary ==

Series 23227v5 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/23227/revisions/5/mbox/

Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                pass       -> FAIL       (fi-snb-2600) fdo#100007

fdo#100007 https://bugs.freedesktop.org/show_bug.cgi?id=100007

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:436s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:422s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:576s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:508s
fi-bxt-t5700     total:278  pass:258  dwarn:0   dfail:0   fail:0   skip:20  time:539s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:484s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:479s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:405s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:403s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:413s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:493s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:463s
fi-kbl-7500u     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:460s
fi-kbl-7560u     total:278  pass:267  dwarn:1   dfail:0   fail:0   skip:10  time:571s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:462s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:579s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:462s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:490s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:430s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:530s
fi-snb-2600      total:278  pass:248  dwarn:0   dfail:0   fail:1   skip:29  time:401s

86cc4197d2fa4c45b75bf54026765d27d86b84c8 drm-tip: 2017y-04m-28d-09h-14m-47s UTC integration manifest
2826b18 drm/i915: Redefine ptr_pack_bits() and friends
d366ae4 drm/i915: Make ptr_unpack_bits() more function-like
758209b drm/i915: Lift timeline ordering to await_dma_fence
5cf482d drm/i915: Mark up clflushes as belonging to an unordered timeline
5eec284 drm/i915: Mark CPU cache as dirty on every transition for CPU writes

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4580/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v14] drm/i915: Squash repeated awaits on the same fence
  2017-04-28 14:12         ` [PATCH v13] " Chris Wilson
@ 2017-04-28 19:02           ` Chris Wilson
  2017-05-02 12:24             ` Tvrtko Ursulin
  0 siblings, 1 reply; 95+ messages in thread
From: Chris Wilson @ 2017-04-28 19:02 UTC (permalink / raw)
  To: intel-gfx

Track the latest fence waited upon on each context, and only add a new
asynchronous wait if the new fence is more recent than the recorded
fence for that context. This requires us to filter out unordered
timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
absence of a universal identifier, we have to use our own
i915->mm.unordered_timeline token.

v2: Throw around the debug crutches
v3: Inline the likely case of the pre-allocation cache being full.
v4: Drop the pre-allocation support, we can lose the most recent fence
in case of allocation failure -- it just means we may emit more awaits
than strictly necessary but will not break.
v5: Trim allocation size for leaf nodes, they only need an array of u32
not pointers.
v6: Create mock_timeline to tidy selftest writing
v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko)
v8: Prune the stale sync points when we idle.
v9: Include a small benchmark in the kselftests
v10: Separate the idr implementation into its own compartment. (Tvrkto)
v11: Refactor igt_sync kselftests to avoid deep nesting (Tvrkto)
v12: __sync_leaf_idx() to assert that p->height is 0 when checking leaves
v13: kselftests to investigate struct i915_syncmap itself (Tvrtko)
v14: Foray into ascii art graphs

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/Makefile                      |   1 +
 drivers/gpu/drm/i915/i915_gem.c                    |   1 +
 drivers/gpu/drm/i915/i915_gem.h                    |   2 +
 drivers/gpu/drm/i915/i915_gem_request.c            |   9 +
 drivers/gpu/drm/i915/i915_gem_timeline.c           |  93 +++-
 drivers/gpu/drm/i915/i915_gem_timeline.h           |  38 ++
 drivers/gpu/drm/i915/i915_syncmap.c                | 419 ++++++++++++++
 drivers/gpu/drm/i915/i915_syncmap.h                |  39 ++
 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 272 +++++++++
 .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   2 +
 drivers/gpu/drm/i915/selftests/i915_random.c       |  11 +
 drivers/gpu/drm/i915/selftests/i915_random.h       |   2 +
 drivers/gpu/drm/i915/selftests/i915_syncmap.c      | 609 +++++++++++++++++++++
 drivers/gpu/drm/i915/selftests/mock_timeline.c     |  45 ++
 drivers/gpu/drm/i915/selftests/mock_timeline.h     |  33 ++
 15 files changed, 1558 insertions(+), 18 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_syncmap.c
 create mode 100644 drivers/gpu/drm/i915/i915_syncmap.h
 create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
 create mode 100644 drivers/gpu/drm/i915/selftests/i915_syncmap.c
 create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.c
 create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 2cf04504e494..7b05fb802f4c 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -16,6 +16,7 @@ i915-y := i915_drv.o \
 	  i915_params.o \
 	  i915_pci.o \
           i915_suspend.o \
+	  i915_syncmap.o \
 	  i915_sw_fence.o \
 	  i915_sysfs.o \
 	  intel_csr.o \
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index a7da9cdf6c39..0f8046e0a63c 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3215,6 +3215,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
 		intel_engine_disarm_breadcrumbs(engine);
 		i915_gem_batch_pool_fini(&engine->batch_pool);
 	}
+	i915_gem_timelines_mark_idle(dev_priv);
 
 	GEM_BUG_ON(!dev_priv->gt.awake);
 	dev_priv->gt.awake = false;
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 5a49487368ca..ee54597465b6 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -25,6 +25,8 @@
 #ifndef __I915_GEM_H__
 #define __I915_GEM_H__
 
+#include <linux/bug.h>
+
 #ifdef CONFIG_DRM_I915_DEBUG_GEM
 #define GEM_BUG_ON(expr) BUG_ON(expr)
 #define GEM_WARN_ON(expr) WARN_ON(expr)
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 022f5588d906..637b8cddf988 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -773,6 +773,11 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 		if (fence->context == req->fence.context)
 			continue;
 
+		/* Squash repeated waits to the same timelines */
+		if (fence->context != req->i915->mm.unordered_timeline &&
+		    intel_timeline_sync_is_later(req->timeline, fence))
+			continue;
+
 		if (dma_fence_is_i915(fence))
 			ret = i915_gem_request_await_request(req,
 							     to_request(fence));
@@ -782,6 +787,10 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
 							    GFP_KERNEL);
 		if (ret < 0)
 			return ret;
+
+		/* Record the latest fence used against each timeline */
+		if (fence->context != req->i915->mm.unordered_timeline)
+			intel_timeline_sync_set(req->timeline, fence);
 	} while (--nchild);
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
index b596ca7ee058..f271e93310fb 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.c
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
@@ -23,6 +23,32 @@
  */
 
 #include "i915_drv.h"
+#include "i915_syncmap.h"
+
+static void __intel_timeline_init(struct intel_timeline *tl,
+				  struct i915_gem_timeline *parent,
+				  u64 context,
+				  struct lock_class_key *lockclass,
+				  const char *lockname)
+{
+	tl->fence_context = context;
+	tl->common = parent;
+#ifdef CONFIG_DEBUG_SPINLOCK
+	__raw_spin_lock_init(&tl->lock.rlock, lockname, lockclass);
+#else
+	spin_lock_init(&tl->lock);
+#endif
+	init_request_active(&tl->last_request, NULL);
+	INIT_LIST_HEAD(&tl->requests);
+	i915_syncmap_init(&tl->sync);
+}
+
+static void __intel_timeline_fini(struct intel_timeline *tl)
+{
+	GEM_BUG_ON(!list_empty(&tl->requests));
+
+	i915_syncmap_free(&tl->sync);
+}
 
 static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 				    struct i915_gem_timeline *timeline,
@@ -35,6 +61,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 
 	lockdep_assert_held(&i915->drm.struct_mutex);
 
+	/*
+	 * Ideally we want a set of engines on a single leaf as we expect
+	 * to mostly be tracking synchronisation between engines.
+	 */
+	BUILD_BUG_ON(KSYNCMAP < I915_NUM_ENGINES);
+
 	timeline->i915 = i915;
 	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
 	if (!timeline->name)
@@ -44,19 +76,10 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
 
 	/* Called during early_init before we know how many engines there are */
 	fences = dma_fence_context_alloc(ARRAY_SIZE(timeline->engine));
-	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
-		struct intel_timeline *tl = &timeline->engine[i];
-
-		tl->fence_context = fences++;
-		tl->common = timeline;
-#ifdef CONFIG_DEBUG_SPINLOCK
-		__raw_spin_lock_init(&tl->lock.rlock, lockname, lockclass);
-#else
-		spin_lock_init(&tl->lock);
-#endif
-		init_request_active(&tl->last_request, NULL);
-		INIT_LIST_HEAD(&tl->requests);
-	}
+	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++)
+		__intel_timeline_init(&timeline->engine[i],
+				      timeline, fences++,
+				      lockclass, lockname);
 
 	return 0;
 }
@@ -81,18 +104,52 @@ int i915_gem_timeline_init__global(struct drm_i915_private *i915)
 					&class, "&global_timeline->lock");
 }
 
+/**
+ * i915_gem_timelines_mark_idle -- called when the driver idles
+ * @i915 - the drm_i915_private device
+ *
+ * When the driver is completely idle, we know that all of our sync points
+ * have been signaled and our tracking is then entirely redundant. Any request
+ * to wait upon an older sync point will be completed instantly as we know
+ * the fence is signaled and therefore we will not even look them up in the
+ * sync point map.
+ */
+void i915_gem_timelines_mark_idle(struct drm_i915_private *i915)
+{
+	struct i915_gem_timeline *timeline;
+	int i;
+
+	lockdep_assert_held(&i915->drm.struct_mutex);
+
+	list_for_each_entry(timeline, &i915->gt.timelines, link) {
+		for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
+			struct intel_timeline *tl = &timeline->engine[i];
+
+			/*
+			 * All known fences are completed so we can scrap
+			 * the current sync point tracking and start afresh,
+			 * any attempt to wait upon a previous sync point
+			 * will be skipped as the fence was signaled.
+			 */
+			i915_syncmap_free(&tl->sync);
+		}
+	}
+}
+
 void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
 {
 	int i;
 
 	lockdep_assert_held(&timeline->i915->drm.struct_mutex);
 
-	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
-		struct intel_timeline *tl = &timeline->engine[i];
-
-		GEM_BUG_ON(!list_empty(&tl->requests));
-	}
+	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++)
+		__intel_timeline_fini(&timeline->engine[i]);
 
 	list_del(&timeline->link);
 	kfree(timeline->name);
 }
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftests/mock_timeline.c"
+#include "selftests/i915_gem_timeline.c"
+#endif
diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
index 6c53e14cab2a..82d59126eb60 100644
--- a/drivers/gpu/drm/i915/i915_gem_timeline.h
+++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
@@ -27,7 +27,9 @@
 
 #include <linux/list.h>
 
+#include "i915_utils.h"
 #include "i915_gem_request.h"
+#include "i915_syncmap.h"
 
 struct i915_gem_timeline;
 
@@ -55,6 +57,17 @@ struct intel_timeline {
 	 * struct_mutex.
 	 */
 	struct i915_gem_active last_request;
+
+	/**
+	 * We track the most recent seqno that we wait on in every context so
+	 * that we only have to emit a new await and dependency on a more
+	 * recent sync point. As the contexts may executed out-of-order, we
+	 * have to track each individually and cannot not rely on an absolute
+	 * global_seqno. When we know that all tracked fences are completed
+	 * (i.e. when the driver is idle), we know that the syncmap is
+	 * redundant and we can discard it without loss of generality.
+	 */
+	struct i915_syncmap *sync;
 	u32 sync_seqno[I915_NUM_ENGINES];
 
 	struct i915_gem_timeline *common;
@@ -73,6 +86,31 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
 			   struct i915_gem_timeline *tl,
 			   const char *name);
 int i915_gem_timeline_init__global(struct drm_i915_private *i915);
+void i915_gem_timelines_mark_idle(struct drm_i915_private *i915);
 void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
 
+static inline int __intel_timeline_sync_set(struct intel_timeline *tl,
+					    u64 context, u32 seqno)
+{
+	return i915_syncmap_set(&tl->sync, context, seqno);
+}
+
+static inline int intel_timeline_sync_set(struct intel_timeline *tl,
+					  const struct dma_fence *fence)
+{
+	return __intel_timeline_sync_set(tl, fence->context, fence->seqno);
+}
+
+static inline bool __intel_timeline_sync_is_later(struct intel_timeline *tl,
+						  u64 context, u32 seqno)
+{
+	return i915_syncmap_is_later(&tl->sync, context, seqno);
+}
+
+static inline bool intel_timeline_sync_is_later(struct intel_timeline *tl,
+						const struct dma_fence *fence)
+{
+	return __intel_timeline_sync_is_later(tl, fence->context, fence->seqno);
+}
+
 #endif
diff --git a/drivers/gpu/drm/i915/i915_syncmap.c b/drivers/gpu/drm/i915/i915_syncmap.c
new file mode 100644
index 000000000000..8748dc50b3fd
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_syncmap.c
@@ -0,0 +1,419 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <linux/slab.h>
+
+#include "i915_syncmap.h"
+
+#include "i915_gem.h" /* GEM_BUG_ON() */
+#include "i915_selftest.h"
+
+#define SHIFT ilog2(KSYNCMAP)
+#define MASK (KSYNCMAP - 1)
+
+/*
+ * struct i915_syncmap is a layer of a radixtree that maps a u64 fence
+ * context id to the last u32 fence seqno waited upon from that context.
+ * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
+ * the root. This allows us to access the whole tree via a single pointer
+ * to the most recently used layer. We expect fence contexts to be dense
+ * and most reuse to be on the same i915_gem_context but on neighbouring
+ * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
+ * effective lookup cache. If the new lookup is not on the same leaf, we
+ * expect it to be on the neighbouring branch.
+ *
+ * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
+ * allows us to store whether a particular seqno is valid (i.e. allows us
+ * to distinguish unset from 0).
+ *
+ * A branch holds an array of layer pointers, and has height > 0, and always
+ * has at least 2 layers (either branches or leaves) below it.
+ *
+ * For example,
+ * 	i915_syncmap_set(&sync, 0, 0);
+ *	i915_syncmap_set(&sync, 1, 1);
+ *	i915_syncmap_set(&sync, 2, 2);
+ *	i915_syncmap_set(&sync, 0x10, 0x10);
+ *	i915_syncmap_set(&sync, 0x11, 0x11);
+ *	i915_syncmap_set(&sync, 0x200, 0x200);
+ *	i915_syncmap_set(&sync, 0x201, 0x201);
+ *	i915_syncmap_set(&sync, 0x500000, 0x500000);
+ *	i915_syncmap_set(&sync, 0x500001, 0x500001);
+ *	i915_syncmap_set(&sync, 0x503000, 0x503000);
+ *	i915_syncmap_set(&sync, 0x503001, 0x503001);
+ *	i915_syncmap_set(&sync, 0xeull << 60 | 0xe, 0xe);
+ * will build a tree like:
+ *	0xffffffffffffffff
+ *	0-> 0x0000000000ffffff
+ *	|   0-> 0x0000000000000fff
+ *	|   |   0-> 0x00000000000000ff
+ *	|   |   |   0-> 0x000000000000000f 0:0, 1:1, 2:2
+ *	|   |   |   1-> 0x000000000000001f 0:10, 1:11
+ *	|   |   2-> 0x000000000000020f 0:200, 1:201
+ *	|   5-> 0x000000000050ffff
+ *	|   |   0-> 0x000000000050000f 0:500000, 1:500001
+ *	|   |   3-> 0x000000000050300f 0:503000, 1:503001
+ *	e-> 0xe00000000000000f e:e
+ */
+
+struct i915_syncmap {
+	u64 prefix;
+	unsigned int height;
+	unsigned int bitmap;
+	struct i915_syncmap *parent;
+	/*
+	 * Following this header is an array of either seqno or child pointers:
+	 * union {
+	 *	u32 seqno[KSYNCMAP];
+	 *	struct i915_syncmap *child[KSYNCMAP];
+	 * };
+	 */
+};
+
+/**
+ * i915_syncmap_init -- initialise the #i915_syncmap
+ * @root - pointer to the #i915_syncmap
+ */
+void i915_syncmap_init(struct i915_syncmap **root)
+{
+	BUILD_BUG_ON_NOT_POWER_OF_2(KSYNCMAP);
+	BUILD_BUG_ON_NOT_POWER_OF_2(SHIFT);
+	BUILD_BUG_ON(KSYNCMAP > BITS_PER_BYTE * sizeof((*root)->bitmap));
+	*root = NULL;
+}
+
+static inline u32 *__sync_seqno(struct i915_syncmap *p)
+{
+	GEM_BUG_ON(p->height);
+	return (u32 *)(p + 1);
+}
+
+static inline struct i915_syncmap **__sync_child(struct i915_syncmap *p)
+{
+	GEM_BUG_ON(!p->height);
+	return (struct i915_syncmap **)(p + 1);
+}
+
+static inline unsigned int
+__sync_branch_idx(const struct i915_syncmap *p, u64 id)
+{
+	return (id >> p->height) & MASK;
+}
+
+static inline unsigned int
+__sync_leaf_idx(const struct i915_syncmap *p, u64 id)
+{
+	GEM_BUG_ON(p->height);
+	return id & MASK;
+}
+
+static inline u64 __sync_branch_prefix(const struct i915_syncmap *p, u64 id)
+{
+	return id >> p->height >> SHIFT;
+}
+
+static inline u64 __sync_leaf_prefix(const struct i915_syncmap *p, u64 id)
+{
+	GEM_BUG_ON(p->height);
+	return id >> SHIFT;
+}
+
+static inline bool seqno_later(u32 a, u32 b)
+{
+	return (s32)(a - b) >= 0;
+}
+
+/**
+ * i915_syncmap_is_later -- compare against the last know sync point
+ * @root - pointer to the #i915_syncmap
+ * @id - the context id (other timeline) we are synchronising to
+ * @seqno - the sequence number along the other timeline
+ *
+ * If we have already synchronised this @root with another (@id) then we can
+ * omit any repeated or earlier synchronisation requests. If the two timelines
+ * are already coupled, we can also omit the dependency between the two as that
+ * is already known via the timeline.
+ *
+ * Returns true if the two timelines are already synchronised wrt to @seqno,
+ * false if not and the synchronisation must be emitted.
+ */
+bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno)
+{
+	struct i915_syncmap *p;
+	unsigned int idx;
+
+	p = *root;
+	if (!p)
+		return false;
+
+	if (likely(__sync_leaf_prefix(p, id) == p->prefix))
+		goto found;
+
+	/* First climb the tree back to a parent branch */
+	do {
+		p = p->parent;
+		if (!p)
+			return false;
+
+		if (__sync_branch_prefix(p, id) == p->prefix)
+			break;
+	} while (1);
+
+	/* And then descend again until we find our leaf */
+	do {
+		if (!p->height)
+			break;
+
+		p = __sync_child(p)[__sync_branch_idx(p, id)];
+		if (!p)
+			return false;
+
+		if (__sync_branch_prefix(p, id) != p->prefix)
+			return false;
+	} while (1);
+
+	*root = p;
+found:
+	idx = __sync_leaf_idx(p, id);
+	if (!(p->bitmap & BIT(idx)))
+		return false;
+
+	return seqno_later(__sync_seqno(p)[idx], seqno);
+}
+
+static struct i915_syncmap *
+__sync_alloc_leaf(struct i915_syncmap *parent, u64 id)
+{
+	struct i915_syncmap *p;
+
+	p = kmalloc(sizeof(*p) + KSYNCMAP * sizeof(u32), GFP_KERNEL);
+	if (unlikely(!p))
+		return NULL;
+
+	p->parent = parent;
+	p->height = 0;
+	p->bitmap = 0;
+	p->prefix = __sync_leaf_prefix(p, id);
+	return p;
+}
+
+static inline void __sync_set_seqno(struct i915_syncmap *p, u64 id, u32 seqno)
+{
+	unsigned int idx = __sync_leaf_idx(p, id);
+
+	__sync_seqno(p)[idx] = seqno;
+	p->bitmap |= BIT(idx);
+}
+
+static inline void __sync_set_child(struct i915_syncmap *p,
+				    unsigned int idx,
+				    struct i915_syncmap *child)
+{
+	__sync_child(p)[idx] = child;
+	p->bitmap |= BIT(idx);
+}
+
+static noinline int __sync_set(struct i915_syncmap **root, u64 id, u32 seqno)
+{
+	struct i915_syncmap *p = *root;
+	unsigned int idx;
+
+	if (!p) {
+		p = __sync_alloc_leaf(NULL, id);
+		if (unlikely(!p))
+			return -ENOMEM;
+
+		goto found;
+	}
+
+	/* Caller handled the likely cached case */
+	GEM_BUG_ON(__sync_leaf_prefix(p, id) == p->prefix);
+
+	/* Climb back up the tree until we find a common prefix */
+	do {
+		if (!p->parent)
+			break;
+
+		p = p->parent;
+
+		if (__sync_branch_prefix(p, id) == p->prefix)
+			break;
+	} while (1);
+
+	/*
+	 * No shortcut, we have to descend the tree to find the right layer
+	 * containing this fence.
+	 *
+	 * Each layer in the tree holds 16 (KSYNCMAP) pointers, either fences
+	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
+	 * other nodes (height > 0) are internal layers that point to a lower
+	 * node. Each internal layer has at least 2 descendents.
+	 *
+	 * Starting at the top, we check whether the current prefix matches. If
+	 * it doesn't, we have gone passed our layer and need to insert a join
+	 * into the tree, and a new leaf node as a descendent as well as the
+	 * original layer.
+	 *
+	 * The matching prefix means we are still following the right branch
+	 * of the tree. If it has height 0, we have found our leaf and just
+	 * need to replace the fence slot with ourselves. If the height is
+	 * not zero, our slot contains the next layer in the tree (unless
+	 * it is empty, in which case we can add ourselves as a new leaf).
+	 * As descend the tree the prefix grows (and height decreases).
+	 */
+	do {
+		struct i915_syncmap *next;
+
+		if (__sync_branch_prefix(p, id) != p->prefix) {
+			unsigned int above;
+
+			/* Insert a join above the current layer */
+			next = kzalloc(sizeof(*next) + KSYNCMAP * sizeof(next),
+				       GFP_KERNEL);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			/* Compute the height at which these two diverge */
+			above = fls64(__sync_branch_prefix(p, id) ^ p->prefix);
+			above = round_up(above, SHIFT);
+			next->height = above + p->height;
+			next->prefix = __sync_branch_prefix(next, id);
+
+			/* Insert the join into the parent */
+			if (p->parent) {
+				idx = __sync_branch_idx(p->parent, id);
+				__sync_child(p->parent)[idx] = next;
+				GEM_BUG_ON(!(p->parent->bitmap & BIT(idx)));
+			}
+			next->parent = p->parent;
+
+			/* Compute the idx of the other branch, not our id! */
+			idx = p->prefix >> (above - SHIFT) & MASK;
+			__sync_set_child(next, idx, p);
+			p->parent = next;
+
+			/* Ascend to the join */
+			p = next;
+		} else {
+			if (!p->height)
+				break;
+		}
+
+		/* Descend into the next layer */
+		GEM_BUG_ON(!p->height);
+		idx = __sync_branch_idx(p, id);
+		next = __sync_child(p)[idx];
+		if (!next) {
+			next = __sync_alloc_leaf(p, id);
+			if (unlikely(!next))
+				return -ENOMEM;
+
+			__sync_set_child(p, idx, next);
+			p = next;
+			break;
+		}
+
+		p = next;
+	} while (1);
+
+found:
+	GEM_BUG_ON(p->prefix != __sync_leaf_prefix(p, id));
+	__sync_set_seqno(p, id, seqno);
+	*root = p;
+	return 0;
+}
+
+/**
+ * i915_syncmap_set -- mark the most recent syncpoint between contexts
+ * @root - pointer to the #i915_syncmap
+ * @id - the context id (other timeline) we have synchronised to
+ * @seqno - the sequence number along the other timeline
+ *
+ * When we synchronise this @root with another (@id), we also know that we have
+ * synchronized with all previous seqno along that timeline. If we then have
+ * a request to synchronise with the same seqno or older, we can omit it,
+ * see i915_syncmap_is_later()
+ *
+ * Returns 0 on success, or a negative error code.
+ */
+int i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno)
+{
+	struct i915_syncmap *p = *root;
+
+	/*
+	 * We expect to be called in sequence following a is_later(id), which
+	 * should have preloaded the root for us.
+	 */
+	if (likely(p && __sync_leaf_prefix(p, id) == p->prefix)) {
+		__sync_set_seqno(p, id, seqno);
+		return 0;
+	}
+
+	return __sync_set(root, id, seqno);
+}
+
+static void __sync_free(struct i915_syncmap *p)
+{
+	if (p->height) {
+		unsigned int i;
+
+		while ((i = ffs(p->bitmap))) {
+			p->bitmap &= ~0u << i;
+			__sync_free(__sync_child(p)[i - 1]);
+		}
+	}
+
+	kfree(p);
+}
+
+/**
+ * i915_syncmap_free -- free all memory associated with the syncmap
+ * @root - pointer to the #i915_syncmap
+ *
+ * Either when the timeline is to be freed and we no longer need the sync
+ * point tracking, or when the fences are all known to be signaled and the
+ * sync point tracking is redundant, we can free the #i915_syncmap to recover
+ * its allocations.
+ *
+ * Will reinitialise the @root pointer so that the #i915_syncmap is ready for
+ * reuse.
+ */
+void i915_syncmap_free(struct i915_syncmap **root)
+{
+	struct i915_syncmap *p;
+
+	p = *root;
+	if (!p)
+		return;
+
+	while (p->parent)
+		p = p->parent;
+
+	__sync_free(p);
+	*root = NULL;
+}
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftests/i915_syncmap.c"
+#endif
diff --git a/drivers/gpu/drm/i915/i915_syncmap.h b/drivers/gpu/drm/i915/i915_syncmap.h
new file mode 100644
index 000000000000..7ca827d812ae
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_syncmap.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __I915_SYNCMAP_H__
+#define __I915_SYNCMAP_H__
+
+#include <linux/types.h>
+
+struct i915_syncmap;
+
+void i915_syncmap_init(struct i915_syncmap **root);
+bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno);
+int i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno);
+void i915_syncmap_free(struct i915_syncmap **root);
+
+#define KSYNCMAP 16
+
+#endif /* __I915_SYNCMAP_H__ */
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
new file mode 100644
index 000000000000..1a9f9cb57878
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
@@ -0,0 +1,272 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "../i915_selftest.h"
+#include "i915_random.h"
+
+#include "mock_gem_device.h"
+#include "mock_timeline.h"
+
+struct __igt_sync {
+	const char *name;
+	u32 seqno;
+	bool expected;
+	bool set;
+};
+
+static int __igt_sync(struct intel_timeline *tl,
+		      u64 ctx,
+		      const struct __igt_sync *p,
+		      const char *name)
+{
+	int ret;
+
+	if (__intel_timeline_sync_is_later(tl, ctx, p->seqno) != p->expected) {
+		pr_err("%s: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
+		       name, p->name, ctx, p->seqno, yesno(p->expected));
+		return -EINVAL;
+	}
+
+	if (p->set) {
+		ret = __intel_timeline_sync_set(tl, ctx, p->seqno);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int igt_sync(void *arg)
+{
+	const struct __igt_sync pass[] = {
+		{ "unset", 0, false, false },
+		{ "new", 0, false, true },
+		{ "0a", 0, true, true },
+		{ "1a", 1, false, true },
+		{ "1b", 1, true, true },
+		{ "0b", 0, true, false },
+		{ "2a", 2, false, true },
+		{ "4", 4, false, true },
+		{ "INT_MAX", INT_MAX, false, true },
+		{ "INT_MAX-1", INT_MAX-1, true, false },
+		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
+		{ "INT_MAX", INT_MAX, true, false },
+		{ "UINT_MAX", UINT_MAX, false, true },
+		{ "wrap", 0, false, true },
+		{ "unwrap", UINT_MAX, true, false },
+		{},
+	}, *p;
+	struct intel_timeline *tl;
+	int order, offset;
+	int ret;
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	for (p = pass; p->name; p++) {
+		for (order = 1; order < 64; order++) {
+			for (offset = -1; offset <= (order > 1); offset++) {
+				u64 ctx = BIT_ULL(order) + offset;
+
+				ret = __igt_sync(tl, ctx, p, "1");
+				if (ret)
+					goto out;
+			}
+		}
+	}
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	for (order = 1; order < 64; order++) {
+		for (offset = -1; offset <= (order > 1); offset++) {
+			u64 ctx = BIT_ULL(order) + offset;
+
+			for (p = pass; p->name; p++) {
+				ret = __igt_sync(tl, ctx, p, "2");
+				if (ret)
+					goto out;
+			}
+		}
+	}
+
+out:
+	mock_timeline_destroy(tl);
+	return ret;
+}
+
+static unsigned int random_engine(struct rnd_state *rnd)
+{
+	return ((u64)prandom_u32_state(rnd) * I915_NUM_ENGINES) >> 32;
+}
+
+static int bench_sync(void *arg)
+{
+#define M (1 << 20)
+	struct rnd_state prng;
+	struct intel_timeline *tl;
+	unsigned long end_time, count;
+	u64 prng32_1M;
+	ktime_t kt;
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		u32 x;
+
+		WRITE_ONCE(x, prandom_u32_state(&prng));
+
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	pr_debug("%s: %lu random evaluations, %lluns/prng\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+	prng32_1M = ktime_to_ns(kt) * M / count;
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		u64 id = prandom_u64_state(&prng);
+
+		__intel_timeline_sync_set(tl, id, 0);
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	kt = ktime_sub_ns(kt, count * prng32_1M * 2 / M);
+	pr_info("%s: %lu random insertions, %lluns/insert\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	end_time = count;
+	kt = ktime_get();
+	while (end_time--) {
+		u64 id = prandom_u64_state(&prng);
+
+		if (!__intel_timeline_sync_is_later(tl, id, 0)) {
+			mock_timeline_destroy(tl);
+			pr_err("Lookup of %llu failed\n", id);
+			return -EINVAL;
+		}
+	}
+	kt = ktime_sub(ktime_get(), kt);
+	kt = ktime_sub_ns(kt, count * prng32_1M * 2 / M);
+	pr_info("%s: %lu random lookups, %lluns/lookup\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		__intel_timeline_sync_set(tl, count++, 0);
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu in-order insertions, %lluns/insert\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	end_time = count;
+	kt = ktime_get();
+	while (end_time--) {
+		if (!__intel_timeline_sync_is_later(tl, end_time, 0)) {
+			pr_err("Lookup of %lu failed\n", end_time);
+			mock_timeline_destroy(tl);
+			return -EINVAL;
+		}
+	}
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu in-order lookups, %lluns/lookup\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	prandom_seed_state(&prng, i915_selftest.random_seed);
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		u32 id = random_engine(&prng);
+		u32 seqno = prandom_u32_state(&prng);
+
+		if (!__intel_timeline_sync_is_later(tl, id, seqno))
+			__intel_timeline_sync_set(tl, id, seqno);
+
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	kt = ktime_sub_ns(kt, count * prng32_1M / M);
+	pr_info("%s: %lu repeated insert/lookups, %lluns/op\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+	mock_timeline_destroy(tl);
+
+	tl = mock_timeline(0);
+	if (!tl)
+		return -ENOMEM;
+
+	count = 0;
+	kt = ktime_get();
+	end_time = jiffies + HZ/10;
+	do {
+		if (!__intel_timeline_sync_is_later(tl, count & 7, count >> 4))
+			__intel_timeline_sync_set(tl, count & 7, count >> 4);
+
+		count++;
+	} while (!time_after(jiffies, end_time));
+	kt = ktime_sub(ktime_get(), kt);
+	pr_info("%s: %lu cyclic insert/lookups, %lluns/op\n",
+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
+	mock_timeline_destroy(tl);
+
+	return 0;
+#undef M
+}
+
+int i915_gem_timeline_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(igt_sync),
+		SUBTEST(bench_sync),
+	};
+
+	return i915_subtests(tests, NULL);
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
index be9a9ebf5692..76c1f149a0a0 100644
--- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
@@ -10,8 +10,10 @@
  */
 selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
 selftest(scatterlist, scatterlist_mock_selftests)
+selftest(syncmap, i915_syncmap_mock_selftests)
 selftest(uncore, intel_uncore_mock_selftests)
 selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
+selftest(timelines, i915_gem_timeline_mock_selftests)
 selftest(requests, i915_gem_request_mock_selftests)
 selftest(objects, i915_gem_object_mock_selftests)
 selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
diff --git a/drivers/gpu/drm/i915/selftests/i915_random.c b/drivers/gpu/drm/i915/selftests/i915_random.c
index c17c83c30637..97796d3e3c9a 100644
--- a/drivers/gpu/drm/i915/selftests/i915_random.c
+++ b/drivers/gpu/drm/i915/selftests/i915_random.c
@@ -30,6 +30,17 @@
 
 #include "i915_random.h"
 
+u64 prandom_u64_state(struct rnd_state *rnd)
+{
+	u64 x;
+
+	x = prandom_u32_state(rnd);
+	x <<= 32;
+	x |= prandom_u32_state(rnd);
+
+	return x;
+}
+
 static inline u32 i915_prandom_u32_max_state(u32 ep_ro, struct rnd_state *state)
 {
 	return upper_32_bits((u64)prandom_u32_state(state) * ep_ro);
diff --git a/drivers/gpu/drm/i915/selftests/i915_random.h b/drivers/gpu/drm/i915/selftests/i915_random.h
index b9c334ce6cd9..0c65b87194ce 100644
--- a/drivers/gpu/drm/i915/selftests/i915_random.h
+++ b/drivers/gpu/drm/i915/selftests/i915_random.h
@@ -41,6 +41,8 @@
 #define I915_RND_SUBSTATE(name__, parent__) \
 	struct rnd_state name__ = I915_RND_STATE_INITIALIZER(prandom_u32_state(&(parent__)))
 
+u64 prandom_u64_state(struct rnd_state *rnd);
+
 unsigned int *i915_random_order(unsigned int count,
 				struct rnd_state *state);
 void i915_random_reorder(unsigned int *order,
diff --git a/drivers/gpu/drm/i915/selftests/i915_syncmap.c b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
new file mode 100644
index 000000000000..5f14fbfef0f4
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
@@ -0,0 +1,609 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "../i915_selftest.h"
+#include "i915_random.h"
+
+static char *
+__sync_print(struct i915_syncmap *p,
+	     char *buf, unsigned long *sz,
+	     unsigned int depth,
+	     unsigned int idx)
+{
+	unsigned long len;
+	unsigned i, bits;
+
+	if (depth) {
+		for (i = 0; i < depth - 1; i++) {
+			len = snprintf(buf, *sz, "|   ");
+			buf += len;
+			*sz -= len;
+		}
+		len = snprintf(buf, *sz, "%x-> ", idx);
+		buf += len;
+		*sz -= len;
+	}
+
+	if (p->height < 64 - SHIFT)
+		len = snprintf(buf, *sz, "0x%016llx",
+				(p->prefix << p->height << SHIFT) |
+				(BIT_ULL(p->height + SHIFT) - 1));
+	else
+		len = snprintf(buf, *sz, "0x%016llx", U64_MAX);
+	buf += len;
+	*sz -= len;
+
+	if (!p->height) {
+		for (bits = p->bitmap; (i = ffs(bits)); bits &= ~0u << i) {
+			len = snprintf(buf, *sz, " %x:%x,",
+				       i - 1, __sync_seqno(p)[i - 1]);
+			buf += len;
+			*sz -= len;
+		}
+		buf -= 1;
+		*sz += 1;
+	}
+
+	len = snprintf(buf, *sz, "\n");
+	buf += len;
+	*sz -= len;
+
+	if (p->height) {
+		for (bits = p->bitmap; (i = ffs(bits)); bits &= ~0u << i)
+			buf = __sync_print(__sync_child(p)[i - 1],
+					   buf, sz, depth + 1, i - 1);
+	}
+
+	return buf;
+}
+
+static bool
+i915_syncmap_print_to_buf(struct i915_syncmap *p, char *buf, unsigned long sz)
+{
+	if (!p)
+		return false;
+
+	while (p->parent)
+		p = p->parent;
+
+	__sync_print(p, buf, &sz, 0, false);
+	return true;
+}
+
+static int check_syncmap_free(struct i915_syncmap **sync)
+{
+	i915_syncmap_free(sync);
+	if (*sync) {
+		pr_err("sync not cleared after free\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int dump_syncmap(struct i915_syncmap *sync, int err)
+{
+	char *buf;
+
+	if (!err)
+		return check_syncmap_free(&sync);
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		goto skip;
+
+	if (i915_syncmap_print_to_buf(sync, buf, PAGE_SIZE))
+		pr_err("%s", buf);
+
+	kfree(buf);
+
+skip:
+	i915_syncmap_free(&sync);
+	return err;
+}
+
+static int igt_syncmap_init(void *arg)
+{
+	struct i915_syncmap *sync = (void *)~0ul;
+
+	/*
+	 * Cursory check that we can initialise a random pointer and transform
+	 * it into the root pointer of a syncmap.
+	 */
+
+	i915_syncmap_init(&sync);
+	return check_syncmap_free(&sync);
+}
+
+static int check_seqno(struct i915_syncmap *leaf, unsigned int idx, u32 seqno)
+{
+	if (leaf->height) {
+		pr_err("%s: not a leaf, height is %d\n",
+		       __func__, leaf->height);
+		return -EINVAL;
+	}
+
+	if (__sync_seqno(leaf)[idx] != seqno) {
+		pr_err("%s: seqno[%d], found %x, expected %x\n",
+		       __func__, idx, __sync_seqno(leaf)[idx], seqno);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int check_one(struct i915_syncmap **sync, u64 context, u32 seqno)
+{
+	int err;
+
+	err = i915_syncmap_set(sync, context, seqno);
+	if (err)
+		return err;
+
+	if ((*sync)->height) {
+		pr_err("Inserting first context=%llx did not return leaf (height=%d, prefix=%llx\n",
+		       context, (*sync)->height, (*sync)->prefix);
+		return -EINVAL;
+	}
+
+	if ((*sync)->parent) {
+		pr_err("Inserting first context=%llx created branches!\n",
+		       context);
+		return -EINVAL;
+	}
+
+	if (hweight32((*sync)->bitmap) != 1) {
+		pr_err("First bitmap does not contain a single entry, found %x (count=%d)!\n",
+		       (*sync)->bitmap, hweight32((*sync)->bitmap));
+		return -EINVAL;
+	}
+
+	err = check_seqno((*sync), ilog2((*sync)->bitmap), seqno);
+	if (err)
+		return err;
+
+	if (!i915_syncmap_is_later(sync, context, seqno)) {
+		pr_err("Lookup of first context=%llx/seqno=%x failed!\n",
+		       context, seqno);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int igt_syncmap_one(void *arg)
+{
+	I915_RND_STATE(prng);
+	IGT_TIMEOUT(end_time);
+	struct i915_syncmap *sync;
+	unsigned long max = 1;
+	int err;
+
+	/*
+	 * Check that inserting a new id, creates a leaf and only that leaf.
+	 */
+
+	i915_syncmap_init(&sync);
+
+	do {
+		u64 context = prandom_u64_state(&prng);
+		unsigned long loop;
+
+		err = check_syncmap_free(&sync);
+		if (err)
+			goto out;
+
+		for (loop = 0; loop <= max; loop++) {
+			err = check_one(&sync, context,
+					prandom_u32_state(&prng));
+			if (err)
+				goto out;
+		}
+		max++;
+	} while (!__igt_timeout(end_time, NULL));
+	pr_debug("%s: Completed %lu single insertions\n",
+		__func__, max * (max - 1) / 2);
+out:
+	return dump_syncmap(sync, err);
+}
+
+static int check_leaf(struct i915_syncmap **sync, u64 context, u32 seqno)
+{
+	int err;
+
+	err = i915_syncmap_set(sync, context, seqno);
+	if (err)
+		return err;
+
+	if ((*sync)->height) {
+		pr_err("Inserting context=%llx did not return leaf (height=%d, prefix=%llx\n",
+		       context, (*sync)->height, (*sync)->prefix);
+		return -EINVAL;
+	}
+
+	if (hweight32((*sync)->bitmap) != 1) {
+		pr_err("First entry into leaf (context=%llx) does not contain a single entry, found %x (count=%d)!\n",
+		       context, (*sync)->bitmap, hweight32((*sync)->bitmap));
+		return -EINVAL;
+	}
+
+	err = check_seqno((*sync), ilog2((*sync)->bitmap), seqno);
+	if (err)
+		return err;
+
+	if (!i915_syncmap_is_later(sync, context, seqno)) {
+		pr_err("Lookup of first entry context=%llx/seqno=%x failed!\n",
+		       context, seqno);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int igt_syncmap_join_above(void *arg)
+{
+	struct i915_syncmap *sync;
+	unsigned int pass, order;
+	int err;
+
+	i915_syncmap_init(&sync);
+
+	/*
+	 * When we have a new id that doesn't fit inside the existing tree,
+	 * we need to add a new layer above.
+	 *
+	 * 1: 0x00000001
+	 * 2: 0x00000010
+	 * 3: 0x00000100
+	 * 4: 0x00001000
+	 * ...
+	 * Each pass the common prefix shrinks and we have to insert a join.
+	 * Each join will only contain two branches, the latest of which
+	 * is always a leaf.
+	 *
+	 * If we then reuse the same set of contexts, we expect to build an
+	 * identical tree.
+	 */
+	for (pass = 0; pass < 3; pass++) {
+		for (order = 0; order < 64; order += SHIFT) {
+			u64 context = BIT_ULL(order);
+			struct i915_syncmap *join;
+
+			err = check_leaf(&sync, context, 0);
+			if (err)
+				goto out;
+
+			join = sync->parent;
+			if (!join) /* very first insert will have no parents */
+				continue;
+
+			if (!join->height) {
+				pr_err("Parent with no height!\n");
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (hweight32(join->bitmap) != 2) {
+				pr_err("Join does not have 2 children: %x (%d)\n",
+				       join->bitmap, hweight32(join->bitmap));
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (__sync_child(join)[__sync_branch_idx(join, context)] != sync) {
+				pr_err("Leaf misplaced in parent!\n");
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	}
+out:
+	return dump_syncmap(sync, err);
+}
+
+static int igt_syncmap_join_below(void *arg)
+{
+	struct i915_syncmap *sync;
+	unsigned int step, order, idx;
+	int err;
+
+	i915_syncmap_init(&sync);
+
+	/*
+	 * Check that we can split a compacted branch by replacing it with
+	 * a join.
+	 */
+	for (step = 0; step < KSYNCMAP; step++) {
+		for (order = 64 - SHIFT; order > 0; order -= SHIFT) {
+			u64 context = step*BIT_ULL(order);
+
+			err = i915_syncmap_set(&sync, context, 0);
+			if (err)
+				goto out;
+
+			if (sync->height) {
+				pr_err("Inserting context=%llx (order=%d, step=%d) did not return leaf (height=%d, prefix=%llx\n",
+				       context, order, step, sync->height, sync->prefix);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	}
+
+	for (step = 0; step < KSYNCMAP; step++) {
+		for (order = SHIFT; order < 64; order += SHIFT) {
+			u64 context = step*BIT_ULL(order);
+
+			if (!i915_syncmap_is_later(&sync, context, 0)) {
+				pr_err("1: context %llx (order=%d, step=%d) not found\n",
+				       context, order, step);
+				err = -EINVAL;
+				goto out;
+			}
+
+			for (idx = 1; idx < KSYNCMAP; idx++) {
+				if (i915_syncmap_is_later(&sync, context + idx, 0)) {
+					pr_err("1: context %llx (order=%d, step=%d) should not exist\n",
+					       context + idx, order, step);
+					err = -EINVAL;
+					goto out;
+				}
+			}
+		}
+	}
+
+	for (order = SHIFT; order < 64; order += SHIFT) {
+		for (step = 0; step < KSYNCMAP; step++) {
+			u64 context = step*BIT_ULL(order);
+
+			if (!i915_syncmap_is_later(&sync, context, 0)) {
+				pr_err("2: context %llx (order=%d, step=%d) not found\n",
+				       context, order, step);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	}
+
+out:
+	return dump_syncmap(sync, err);
+}
+
+static int igt_syncmap_neighbours(void *arg)
+{
+	I915_RND_STATE(prng);
+	IGT_TIMEOUT(end_time);
+	struct i915_syncmap *sync;
+	int err;
+
+	/*
+	 * Each leaf holds KSYNCMAP seqno. Check that when we create KSYNCMAP
+	 * neighbouring ids, they all fit into the same leaf.
+	 */
+
+	i915_syncmap_init(&sync);
+	do {
+		u64 context = prandom_u64_state(&prng) & ~MASK;
+		unsigned int idx;
+
+		if (i915_syncmap_is_later(&sync, context, 0)) /* Skip repeats */
+			continue;
+
+		for (idx = 0; idx < KSYNCMAP; idx++) {
+			err = i915_syncmap_set(&sync, context + idx, 0);
+			if (err)
+				goto out;
+
+			if (sync->height) {
+				pr_err("Inserting context=%llx did not return leaf (height=%d, prefix=%llx\n",
+				       context, sync->height, sync->prefix);
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (sync->bitmap != BIT(idx + 1) - 1) {
+				pr_err("Inserting neighbouring context=0x%llx+%d, did not fit into the same leaf bitmap=%x (%d), expected %lx (%d)\n",
+				       context, idx,
+				       sync->bitmap, hweight32(sync->bitmap),
+				       BIT(idx + 1) - 1, idx + 1);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	} while (!__igt_timeout(end_time, NULL));
+out:
+	return dump_syncmap(sync, err);
+}
+
+static int igt_syncmap_compact(void *arg)
+{
+	struct i915_syncmap *sync;
+	unsigned int idx, order;
+	int err;
+
+	i915_syncmap_init(&sync);
+
+	/*
+	 * The syncmap are "space efficient" compressed radix trees - any
+	 * branch with only one child is skipped and replaced by the child.
+	 *
+	 * If we construct a tree with ids that are neighbouring at a non-zero
+	 * height, we form a join but each child of that join is directly a
+	 * leaf holding the single id.
+	 */
+	for (order = SHIFT; order < 64; order += SHIFT) {
+		err = check_syncmap_free(&sync);
+		if (err)
+			goto out;
+
+		/* Create neighbours in the parent */
+		for (idx = 0; idx < KSYNCMAP; idx++) {
+			u64 context = idx * BIT_ULL(order) + idx;
+
+			err = i915_syncmap_set(&sync, context, 0);
+			if (err)
+				goto out;
+
+			if (sync->height) {
+				pr_err("Inserting context=%llx (order=%d, idx=%d) did not return leaf (height=%d, prefix=%llx\n",
+				       context, order, idx,
+				       sync->height, sync->prefix);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+
+		sync = sync->parent;
+		if (sync->parent) {
+			pr_err("Parent (join) of last leaf was not the sync!\n");
+			err = -EINVAL;
+			goto out;
+		}
+
+		if (sync->height != order) {
+			pr_err("Join does not have the expected height, found %d, expected %d\n",
+			       sync->height, order);
+			err = -EINVAL;
+			goto out;
+		}
+
+		if (sync->bitmap != BIT(KSYNCMAP) - 1) {
+			pr_err("Join is not full!, found %x (%d) expected %lx (%d)\n",
+			       sync->bitmap, hweight32(sync->bitmap),
+			       BIT(KSYNCMAP) - 1, KSYNCMAP);
+			err = -EINVAL;
+			goto out;
+		}
+
+		/* Each of our children should be a leaf */
+		for (idx = 0; idx < KSYNCMAP; idx++) {
+			struct i915_syncmap *leaf = __sync_child(sync)[idx];
+
+			if (leaf->height) {
+				pr_err("Child %d is a not leaf!\n", idx);
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (leaf->parent != sync) {
+				pr_err("Child %d is not attached to us!\n",
+				       idx);
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (!is_power_of_2(leaf->bitmap)) {
+				pr_err("Child %d holds more than one id, found %x (%d)\n",
+				       idx, leaf->bitmap, hweight32(leaf->bitmap));
+				err = -EINVAL;
+				goto out;
+			}
+
+			if (leaf->bitmap != BIT(idx)) {
+				pr_err("Child %d has wrong seqno idx, found %d, expected %d\n",
+				       idx, ilog2(leaf->bitmap), idx);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	}
+out:
+	return dump_syncmap(sync, err);
+}
+
+static int igt_syncmap_random(void *arg)
+{
+	I915_RND_STATE(prng);
+	IGT_TIMEOUT(end_time);
+	struct i915_syncmap *sync;
+	unsigned long count, phase, i;
+	u32 seqno;
+	int err;
+
+	i915_syncmap_init(&sync);
+
+	/*
+	 * Having tried to test the individual operations within i915_syncmap,
+	 * run a smoketest exploring the entire u64 space with random
+	 * insertions.
+	 */
+
+	count = 0;
+	phase = jiffies + HZ/100 + 1;
+	do {
+		u64 context = prandom_u64_state(&prng);
+
+		err = i915_syncmap_set(&sync, context, 0);
+		if (err)
+			goto out;
+
+		count++;
+	} while (!time_after(jiffies, phase));
+	seqno = 0;
+
+	phase = 0;
+	do {
+		I915_RND_STATE(ctx);
+		u32 last_seqno = seqno;
+		bool expect;
+
+		seqno = prandom_u32_state(&prng);
+		expect = seqno_later(last_seqno, seqno);
+
+		for (i = 0; i < count; i++) {
+			u64 context = prandom_u64_state(&ctx);
+
+			if (i915_syncmap_is_later(&sync, context, seqno) != expect) {
+				pr_err("context=%llu, last=%u this=%u did not match expectation (%d)\n",
+				       context, last_seqno, seqno, expect);
+				err = -EINVAL;
+				goto out;
+			}
+
+			err = i915_syncmap_set(&sync, context, seqno);
+			if (err)
+				goto out;
+		}
+
+		phase++;
+	} while (!__igt_timeout(end_time, NULL));
+	pr_debug("Completed %lu passes, each of %lu contexts\n", phase, count);
+out:
+	return dump_syncmap(sync, err);
+}
+
+int i915_syncmap_mock_selftests(void)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(igt_syncmap_init),
+		SUBTEST(igt_syncmap_one),
+		SUBTEST(igt_syncmap_join_above),
+		SUBTEST(igt_syncmap_join_below),
+		SUBTEST(igt_syncmap_neighbours),
+		SUBTEST(igt_syncmap_compact),
+		SUBTEST(igt_syncmap_random),
+	};
+
+	return i915_subtests(tests, NULL);
+}
diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.c b/drivers/gpu/drm/i915/selftests/mock_timeline.c
new file mode 100644
index 000000000000..47b1f47c5812
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/mock_timeline.c
@@ -0,0 +1,45 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "mock_timeline.h"
+
+struct intel_timeline *mock_timeline(u64 context)
+{
+	static struct lock_class_key class;
+	struct intel_timeline *tl;
+
+	tl = kzalloc(sizeof(*tl), GFP_KERNEL);
+	if (!tl)
+		return NULL;
+
+	__intel_timeline_init(tl, NULL, context, &class, "mock");
+
+	return tl;
+}
+
+void mock_timeline_destroy(struct intel_timeline *tl)
+{
+	__intel_timeline_fini(tl);
+	kfree(tl);
+}
diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.h b/drivers/gpu/drm/i915/selftests/mock_timeline.h
new file mode 100644
index 000000000000..c27ff4639b8b
--- /dev/null
+++ b/drivers/gpu/drm/i915/selftests/mock_timeline.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __MOCK_TIMELINE__
+#define __MOCK_TIMELINE__
+
+#include "../i915_gem_timeline.h"
+
+struct intel_timeline *mock_timeline(u64 context);
+void mock_timeline_destroy(struct intel_timeline *tl);
+
+#endif /* !__MOCK_TIMELINE__ */
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev6)
  2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
                   ` (29 preceding siblings ...)
  2017-04-28 14:31 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev5) Patchwork
@ 2017-04-28 19:22 ` Patchwork
  30 siblings, 0 replies; 95+ messages in thread
From: Patchwork @ 2017-04-28 19:22 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev6)
URL   : https://patchwork.freedesktop.org/series/23227/
State : success

== Summary ==

Series 23227v6 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/23227/revisions/6/mbox/

Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                pass       -> FAIL       (fi-snb-2600) fdo#100007
Test gem_exec_suspend:
        Subgroup basic-s4-devices:
                pass       -> DMESG-WARN (fi-kbl-7560u) fdo#100125

fdo#100007 https://bugs.freedesktop.org/show_bug.cgi?id=100007
fdo#100125 https://bugs.freedesktop.org/show_bug.cgi?id=100125

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:431s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:429s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:571s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:514s
fi-bxt-t5700     total:278  pass:258  dwarn:0   dfail:0   fail:0   skip:20  time:541s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:489s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:486s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:413s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:402s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:412s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:482s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:485s
fi-kbl-7500u     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:459s
fi-kbl-7560u     total:278  pass:267  dwarn:1   dfail:0   fail:0   skip:10  time:573s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:453s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:576s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:462s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:496s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:431s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:535s
fi-snb-2600      total:278  pass:248  dwarn:0   dfail:0   fail:1   skip:29  time:402s

1d490e4b6d5324cfbf8dc800cf4a99471252802c drm-tip: 2017y-04m-28d-14h-14m-47s UTC integration manifest
a004fb4 drm/i915: Redefine ptr_pack_bits() and friends
78596713 drm/i915: Make ptr_unpack_bits() more function-like
a1f6541 drm/i915: Lift timeline ordering to await_dma_fence
c37268d drm/i915: Mark up clflushes as belonging to an unordered timeline
3f480d8 drm/i915: Mark CPU cache as dirty on every transition for CPU writes

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4583/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v14] drm/i915: Squash repeated awaits on the same fence
  2017-04-28 19:02           ` [PATCH v14] " Chris Wilson
@ 2017-05-02 12:24             ` Tvrtko Ursulin
  2017-05-02 14:45               ` Chris Wilson
  2017-05-02 14:50               ` Chris Wilson
  0 siblings, 2 replies; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-05-02 12:24 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 28/04/2017 20:02, Chris Wilson wrote:
> Track the latest fence waited upon on each context, and only add a new
> asynchronous wait if the new fence is more recent than the recorded
> fence for that context. This requires us to filter out unordered
> timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the
> absence of a universal identifier, we have to use our own
> i915->mm.unordered_timeline token.
>
> v2: Throw around the debug crutches
> v3: Inline the likely case of the pre-allocation cache being full.
> v4: Drop the pre-allocation support, we can lose the most recent fence
> in case of allocation failure -- it just means we may emit more awaits
> than strictly necessary but will not break.
> v5: Trim allocation size for leaf nodes, they only need an array of u32
> not pointers.
> v6: Create mock_timeline to tidy selftest writing
> v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko)
> v8: Prune the stale sync points when we idle.
> v9: Include a small benchmark in the kselftests
> v10: Separate the idr implementation into its own compartment. (Tvrkto)
> v11: Refactor igt_sync kselftests to avoid deep nesting (Tvrkto)
> v12: __sync_leaf_idx() to assert that p->height is 0 when checking leaves
> v13: kselftests to investigate struct i915_syncmap itself (Tvrtko)
> v14: Foray into ascii art graphs
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/Makefile                      |   1 +
>  drivers/gpu/drm/i915/i915_gem.c                    |   1 +
>  drivers/gpu/drm/i915/i915_gem.h                    |   2 +
>  drivers/gpu/drm/i915/i915_gem_request.c            |   9 +
>  drivers/gpu/drm/i915/i915_gem_timeline.c           |  93 +++-
>  drivers/gpu/drm/i915/i915_gem_timeline.h           |  38 ++
>  drivers/gpu/drm/i915/i915_syncmap.c                | 419 ++++++++++++++
>  drivers/gpu/drm/i915/i915_syncmap.h                |  39 ++
>  drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 272 +++++++++
>  .../gpu/drm/i915/selftests/i915_mock_selftests.h   |   2 +
>  drivers/gpu/drm/i915/selftests/i915_random.c       |  11 +
>  drivers/gpu/drm/i915/selftests/i915_random.h       |   2 +
>  drivers/gpu/drm/i915/selftests/i915_syncmap.c      | 609 +++++++++++++++++++++
>  drivers/gpu/drm/i915/selftests/mock_timeline.c     |  45 ++
>  drivers/gpu/drm/i915/selftests/mock_timeline.h     |  33 ++
>  15 files changed, 1558 insertions(+), 18 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/i915_syncmap.c
>  create mode 100644 drivers/gpu/drm/i915/i915_syncmap.h
>  create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
>  create mode 100644 drivers/gpu/drm/i915/selftests/i915_syncmap.c
>  create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.c
>  create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.h
>
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 2cf04504e494..7b05fb802f4c 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -16,6 +16,7 @@ i915-y := i915_drv.o \
>  	  i915_params.o \
>  	  i915_pci.o \
>            i915_suspend.o \
> +	  i915_syncmap.o \
>  	  i915_sw_fence.o \
>  	  i915_sysfs.o \
>  	  intel_csr.o \
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index a7da9cdf6c39..0f8046e0a63c 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3215,6 +3215,7 @@ i915_gem_idle_work_handler(struct work_struct *work)
>  		intel_engine_disarm_breadcrumbs(engine);
>  		i915_gem_batch_pool_fini(&engine->batch_pool);
>  	}
> +	i915_gem_timelines_mark_idle(dev_priv);
>
>  	GEM_BUG_ON(!dev_priv->gt.awake);
>  	dev_priv->gt.awake = false;
> diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
> index 5a49487368ca..ee54597465b6 100644
> --- a/drivers/gpu/drm/i915/i915_gem.h
> +++ b/drivers/gpu/drm/i915/i915_gem.h
> @@ -25,6 +25,8 @@
>  #ifndef __I915_GEM_H__
>  #define __I915_GEM_H__
>
> +#include <linux/bug.h>
> +
>  #ifdef CONFIG_DRM_I915_DEBUG_GEM
>  #define GEM_BUG_ON(expr) BUG_ON(expr)
>  #define GEM_WARN_ON(expr) WARN_ON(expr)
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 022f5588d906..637b8cddf988 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -773,6 +773,11 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>  		if (fence->context == req->fence.context)
>  			continue;
>
> +		/* Squash repeated waits to the same timelines */
> +		if (fence->context != req->i915->mm.unordered_timeline &&
> +		    intel_timeline_sync_is_later(req->timeline, fence))
> +			continue;
> +
>  		if (dma_fence_is_i915(fence))
>  			ret = i915_gem_request_await_request(req,
>  							     to_request(fence));
> @@ -782,6 +787,10 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
>  							    GFP_KERNEL);
>  		if (ret < 0)
>  			return ret;
> +
> +		/* Record the latest fence used against each timeline */
> +		if (fence->context != req->i915->mm.unordered_timeline)
> +			intel_timeline_sync_set(req->timeline, fence);
>  	} while (--nchild);
>
>  	return 0;
> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c
> index b596ca7ee058..f271e93310fb 100644
> --- a/drivers/gpu/drm/i915/i915_gem_timeline.c
> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.c
> @@ -23,6 +23,32 @@
>   */
>
>  #include "i915_drv.h"
> +#include "i915_syncmap.h"
> +
> +static void __intel_timeline_init(struct intel_timeline *tl,
> +				  struct i915_gem_timeline *parent,
> +				  u64 context,
> +				  struct lock_class_key *lockclass,
> +				  const char *lockname)
> +{
> +	tl->fence_context = context;
> +	tl->common = parent;
> +#ifdef CONFIG_DEBUG_SPINLOCK
> +	__raw_spin_lock_init(&tl->lock.rlock, lockname, lockclass);
> +#else
> +	spin_lock_init(&tl->lock);
> +#endif
> +	init_request_active(&tl->last_request, NULL);
> +	INIT_LIST_HEAD(&tl->requests);
> +	i915_syncmap_init(&tl->sync);
> +}
> +
> +static void __intel_timeline_fini(struct intel_timeline *tl)
> +{
> +	GEM_BUG_ON(!list_empty(&tl->requests));
> +
> +	i915_syncmap_free(&tl->sync);
> +}
>
>  static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>  				    struct i915_gem_timeline *timeline,
> @@ -35,6 +61,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>
>  	lockdep_assert_held(&i915->drm.struct_mutex);
>
> +	/*
> +	 * Ideally we want a set of engines on a single leaf as we expect
> +	 * to mostly be tracking synchronisation between engines.
> +	 */
> +	BUILD_BUG_ON(KSYNCMAP < I915_NUM_ENGINES);
> +
>  	timeline->i915 = i915;
>  	timeline->name = kstrdup(name ?: "[kernel]", GFP_KERNEL);
>  	if (!timeline->name)
> @@ -44,19 +76,10 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915,
>
>  	/* Called during early_init before we know how many engines there are */
>  	fences = dma_fence_context_alloc(ARRAY_SIZE(timeline->engine));
> -	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
> -		struct intel_timeline *tl = &timeline->engine[i];
> -
> -		tl->fence_context = fences++;
> -		tl->common = timeline;
> -#ifdef CONFIG_DEBUG_SPINLOCK
> -		__raw_spin_lock_init(&tl->lock.rlock, lockname, lockclass);
> -#else
> -		spin_lock_init(&tl->lock);
> -#endif
> -		init_request_active(&tl->last_request, NULL);
> -		INIT_LIST_HEAD(&tl->requests);
> -	}
> +	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++)
> +		__intel_timeline_init(&timeline->engine[i],
> +				      timeline, fences++,
> +				      lockclass, lockname);
>
>  	return 0;
>  }
> @@ -81,18 +104,52 @@ int i915_gem_timeline_init__global(struct drm_i915_private *i915)
>  					&class, "&global_timeline->lock");
>  }
>
> +/**
> + * i915_gem_timelines_mark_idle -- called when the driver idles
> + * @i915 - the drm_i915_private device
> + *
> + * When the driver is completely idle, we know that all of our sync points
> + * have been signaled and our tracking is then entirely redundant. Any request
> + * to wait upon an older sync point will be completed instantly as we know
> + * the fence is signaled and therefore we will not even look them up in the
> + * sync point map.
> + */
> +void i915_gem_timelines_mark_idle(struct drm_i915_private *i915)
> +{
> +	struct i915_gem_timeline *timeline;
> +	int i;
> +
> +	lockdep_assert_held(&i915->drm.struct_mutex);
> +
> +	list_for_each_entry(timeline, &i915->gt.timelines, link) {
> +		for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
> +			struct intel_timeline *tl = &timeline->engine[i];
> +
> +			/*
> +			 * All known fences are completed so we can scrap
> +			 * the current sync point tracking and start afresh,
> +			 * any attempt to wait upon a previous sync point
> +			 * will be skipped as the fence was signaled.
> +			 */
> +			i915_syncmap_free(&tl->sync);
> +		}
> +	}
> +}
> +
>  void i915_gem_timeline_fini(struct i915_gem_timeline *timeline)
>  {
>  	int i;
>
>  	lockdep_assert_held(&timeline->i915->drm.struct_mutex);
>
> -	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++) {
> -		struct intel_timeline *tl = &timeline->engine[i];
> -
> -		GEM_BUG_ON(!list_empty(&tl->requests));
> -	}
> +	for (i = 0; i < ARRAY_SIZE(timeline->engine); i++)
> +		__intel_timeline_fini(&timeline->engine[i]);
>
>  	list_del(&timeline->link);
>  	kfree(timeline->name);
>  }
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> +#include "selftests/mock_timeline.c"
> +#include "selftests/i915_gem_timeline.c"
> +#endif
> diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
> index 6c53e14cab2a..82d59126eb60 100644
> --- a/drivers/gpu/drm/i915/i915_gem_timeline.h
> +++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
> @@ -27,7 +27,9 @@
>
>  #include <linux/list.h>
>
> +#include "i915_utils.h"
>  #include "i915_gem_request.h"
> +#include "i915_syncmap.h"
>
>  struct i915_gem_timeline;
>
> @@ -55,6 +57,17 @@ struct intel_timeline {
>  	 * struct_mutex.
>  	 */
>  	struct i915_gem_active last_request;
> +
> +	/**
> +	 * We track the most recent seqno that we wait on in every context so
> +	 * that we only have to emit a new await and dependency on a more
> +	 * recent sync point. As the contexts may executed out-of-order, we
> +	 * have to track each individually and cannot not rely on an absolute
> +	 * global_seqno. When we know that all tracked fences are completed
> +	 * (i.e. when the driver is idle), we know that the syncmap is
> +	 * redundant and we can discard it without loss of generality.
> +	 */
> +	struct i915_syncmap *sync;
>  	u32 sync_seqno[I915_NUM_ENGINES];
>
>  	struct i915_gem_timeline *common;
> @@ -73,6 +86,31 @@ int i915_gem_timeline_init(struct drm_i915_private *i915,
>  			   struct i915_gem_timeline *tl,
>  			   const char *name);
>  int i915_gem_timeline_init__global(struct drm_i915_private *i915);
> +void i915_gem_timelines_mark_idle(struct drm_i915_private *i915);
>  void i915_gem_timeline_fini(struct i915_gem_timeline *tl);
>
> +static inline int __intel_timeline_sync_set(struct intel_timeline *tl,
> +					    u64 context, u32 seqno)
> +{
> +	return i915_syncmap_set(&tl->sync, context, seqno);
> +}
> +
> +static inline int intel_timeline_sync_set(struct intel_timeline *tl,
> +					  const struct dma_fence *fence)
> +{
> +	return __intel_timeline_sync_set(tl, fence->context, fence->seqno);
> +}
> +
> +static inline bool __intel_timeline_sync_is_later(struct intel_timeline *tl,
> +						  u64 context, u32 seqno)
> +{
> +	return i915_syncmap_is_later(&tl->sync, context, seqno);
> +}
> +
> +static inline bool intel_timeline_sync_is_later(struct intel_timeline *tl,
> +						const struct dma_fence *fence)
> +{
> +	return __intel_timeline_sync_is_later(tl, fence->context, fence->seqno);
> +}
> +
>  #endif
> diff --git a/drivers/gpu/drm/i915/i915_syncmap.c b/drivers/gpu/drm/i915/i915_syncmap.c
> new file mode 100644
> index 000000000000..8748dc50b3fd
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_syncmap.c
> @@ -0,0 +1,419 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include <linux/slab.h>
> +
> +#include "i915_syncmap.h"
> +
> +#include "i915_gem.h" /* GEM_BUG_ON() */
> +#include "i915_selftest.h"
> +
> +#define SHIFT ilog2(KSYNCMAP)
> +#define MASK (KSYNCMAP - 1)
> +
> +/*
> + * struct i915_syncmap is a layer of a radixtree that maps a u64 fence
> + * context id to the last u32 fence seqno waited upon from that context.
> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
> + * the root. This allows us to access the whole tree via a single pointer
> + * to the most recently used layer. We expect fence contexts to be dense
> + * and most reuse to be on the same i915_gem_context but on neighbouring
> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
> + * effective lookup cache. If the new lookup is not on the same leaf, we
> + * expect it to be on the neighbouring branch.
> + *
> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
> + * allows us to store whether a particular seqno is valid (i.e. allows us
> + * to distinguish unset from 0).
> + *
> + * A branch holds an array of layer pointers, and has height > 0, and always
> + * has at least 2 layers (either branches or leaves) below it.
> + *
> + * For example,
> + * 	i915_syncmap_set(&sync, 0, 0);
> + *	i915_syncmap_set(&sync, 1, 1);
> + *	i915_syncmap_set(&sync, 2, 2);
> + *	i915_syncmap_set(&sync, 0x10, 0x10);
> + *	i915_syncmap_set(&sync, 0x11, 0x11);
> + *	i915_syncmap_set(&sync, 0x200, 0x200);
> + *	i915_syncmap_set(&sync, 0x201, 0x201);
> + *	i915_syncmap_set(&sync, 0x500000, 0x500000);
> + *	i915_syncmap_set(&sync, 0x500001, 0x500001);
> + *	i915_syncmap_set(&sync, 0x503000, 0x503000);
> + *	i915_syncmap_set(&sync, 0x503001, 0x503001);
> + *	i915_syncmap_set(&sync, 0xeull << 60 | 0xe, 0xe);
> + * will build a tree like:
> + *	0xffffffffffffffff
> + *	0-> 0x0000000000ffffff
> + *	|   0-> 0x0000000000000fff
> + *	|   |   0-> 0x00000000000000ff
> + *	|   |   |   0-> 0x000000000000000f 0:0, 1:1, 2:2
> + *	|   |   |   1-> 0x000000000000001f 0:10, 1:11
> + *	|   |   2-> 0x000000000000020f 0:200, 1:201
> + *	|   5-> 0x000000000050ffff
> + *	|   |   0-> 0x000000000050000f 0:500000, 1:500001
> + *	|   |   3-> 0x000000000050300f 0:503000, 1:503001
> + *	e-> 0xe00000000000000f e:e
> + */
> +
> +struct i915_syncmap {
> +	u64 prefix;
> +	unsigned int height;
> +	unsigned int bitmap;
> +	struct i915_syncmap *parent;
> +	/*
> +	 * Following this header is an array of either seqno or child pointers:
> +	 * union {
> +	 *	u32 seqno[KSYNCMAP];
> +	 *	struct i915_syncmap *child[KSYNCMAP];
> +	 * };
> +	 */
> +};
> +
> +/**
> + * i915_syncmap_init -- initialise the #i915_syncmap
> + * @root - pointer to the #i915_syncmap
> + */
> +void i915_syncmap_init(struct i915_syncmap **root)
> +{
> +	BUILD_BUG_ON_NOT_POWER_OF_2(KSYNCMAP);
> +	BUILD_BUG_ON_NOT_POWER_OF_2(SHIFT);
> +	BUILD_BUG_ON(KSYNCMAP > BITS_PER_BYTE * sizeof((*root)->bitmap));
> +	*root = NULL;
> +}
> +
> +static inline u32 *__sync_seqno(struct i915_syncmap *p)
> +{
> +	GEM_BUG_ON(p->height);
> +	return (u32 *)(p + 1);
> +}
> +
> +static inline struct i915_syncmap **__sync_child(struct i915_syncmap *p)
> +{
> +	GEM_BUG_ON(!p->height);
> +	return (struct i915_syncmap **)(p + 1);
> +}
> +
> +static inline unsigned int
> +__sync_branch_idx(const struct i915_syncmap *p, u64 id)
> +{
> +	return (id >> p->height) & MASK;
> +}
> +
> +static inline unsigned int
> +__sync_leaf_idx(const struct i915_syncmap *p, u64 id)
> +{
> +	GEM_BUG_ON(p->height);
> +	return id & MASK;
> +}
> +
> +static inline u64 __sync_branch_prefix(const struct i915_syncmap *p, u64 id)
> +{
> +	return id >> p->height >> SHIFT;
> +}
> +
> +static inline u64 __sync_leaf_prefix(const struct i915_syncmap *p, u64 id)
> +{
> +	GEM_BUG_ON(p->height);
> +	return id >> SHIFT;
> +}
> +
> +static inline bool seqno_later(u32 a, u32 b)
> +{
> +	return (s32)(a - b) >= 0;
> +}
> +
> +/**
> + * i915_syncmap_is_later -- compare against the last know sync point
> + * @root - pointer to the #i915_syncmap
> + * @id - the context id (other timeline) we are synchronising to
> + * @seqno - the sequence number along the other timeline
> + *
> + * If we have already synchronised this @root with another (@id) then we can
> + * omit any repeated or earlier synchronisation requests. If the two timelines
> + * are already coupled, we can also omit the dependency between the two as that
> + * is already known via the timeline.
> + *
> + * Returns true if the two timelines are already synchronised wrt to @seqno,
> + * false if not and the synchronisation must be emitted.
> + */
> +bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno)
> +{
> +	struct i915_syncmap *p;
> +	unsigned int idx;
> +
> +	p = *root;
> +	if (!p)
> +		return false;
> +
> +	if (likely(__sync_leaf_prefix(p, id) == p->prefix))
> +		goto found;
> +
> +	/* First climb the tree back to a parent branch */
> +	do {
> +		p = p->parent;
> +		if (!p)
> +			return false;
> +
> +		if (__sync_branch_prefix(p, id) == p->prefix)
> +			break;
> +	} while (1);
> +
> +	/* And then descend again until we find our leaf */
> +	do {
> +		if (!p->height)
> +			break;
> +
> +		p = __sync_child(p)[__sync_branch_idx(p, id)];
> +		if (!p)
> +			return false;
> +
> +		if (__sync_branch_prefix(p, id) != p->prefix)
> +			return false;
> +	} while (1);
> +
> +	*root = p;
> +found:
> +	idx = __sync_leaf_idx(p, id);
> +	if (!(p->bitmap & BIT(idx)))
> +		return false;
> +
> +	return seqno_later(__sync_seqno(p)[idx], seqno);
> +}
> +
> +static struct i915_syncmap *
> +__sync_alloc_leaf(struct i915_syncmap *parent, u64 id)
> +{
> +	struct i915_syncmap *p;
> +
> +	p = kmalloc(sizeof(*p) + KSYNCMAP * sizeof(u32), GFP_KERNEL);
> +	if (unlikely(!p))
> +		return NULL;
> +
> +	p->parent = parent;
> +	p->height = 0;
> +	p->bitmap = 0;
> +	p->prefix = __sync_leaf_prefix(p, id);
> +	return p;
> +}
> +
> +static inline void __sync_set_seqno(struct i915_syncmap *p, u64 id, u32 seqno)
> +{
> +	unsigned int idx = __sync_leaf_idx(p, id);
> +
> +	__sync_seqno(p)[idx] = seqno;
> +	p->bitmap |= BIT(idx);
> +}
> +
> +static inline void __sync_set_child(struct i915_syncmap *p,
> +				    unsigned int idx,
> +				    struct i915_syncmap *child)
> +{
> +	__sync_child(p)[idx] = child;
> +	p->bitmap |= BIT(idx);
> +}
> +
> +static noinline int __sync_set(struct i915_syncmap **root, u64 id, u32 seqno)
> +{
> +	struct i915_syncmap *p = *root;
> +	unsigned int idx;
> +
> +	if (!p) {
> +		p = __sync_alloc_leaf(NULL, id);
> +		if (unlikely(!p))
> +			return -ENOMEM;
> +
> +		goto found;
> +	}
> +
> +	/* Caller handled the likely cached case */
> +	GEM_BUG_ON(__sync_leaf_prefix(p, id) == p->prefix);
> +
> +	/* Climb back up the tree until we find a common prefix */
> +	do {
> +		if (!p->parent)
> +			break;
> +
> +		p = p->parent;
> +
> +		if (__sync_branch_prefix(p, id) == p->prefix)
> +			break;
> +	} while (1);
> +
> +	/*
> +	 * No shortcut, we have to descend the tree to find the right layer
> +	 * containing this fence.
> +	 *
> +	 * Each layer in the tree holds 16 (KSYNCMAP) pointers, either fences
> +	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
> +	 * other nodes (height > 0) are internal layers that point to a lower
> +	 * node. Each internal layer has at least 2 descendents.
> +	 *
> +	 * Starting at the top, we check whether the current prefix matches. If
> +	 * it doesn't, we have gone passed our layer and need to insert a join
> +	 * into the tree, and a new leaf node as a descendent as well as the
> +	 * original layer.
> +	 *
> +	 * The matching prefix means we are still following the right branch
> +	 * of the tree. If it has height 0, we have found our leaf and just
> +	 * need to replace the fence slot with ourselves. If the height is
> +	 * not zero, our slot contains the next layer in the tree (unless
> +	 * it is empty, in which case we can add ourselves as a new leaf).
> +	 * As descend the tree the prefix grows (and height decreases).
> +	 */
> +	do {
> +		struct i915_syncmap *next;
> +
> +		if (__sync_branch_prefix(p, id) != p->prefix) {
> +			unsigned int above;
> +
> +			/* Insert a join above the current layer */
> +			next = kzalloc(sizeof(*next) + KSYNCMAP * sizeof(next),
> +				       GFP_KERNEL);
> +			if (unlikely(!next))
> +				return -ENOMEM;
> +
> +			/* Compute the height at which these two diverge */
> +			above = fls64(__sync_branch_prefix(p, id) ^ p->prefix);
> +			above = round_up(above, SHIFT);
> +			next->height = above + p->height;
> +			next->prefix = __sync_branch_prefix(next, id);
> +
> +			/* Insert the join into the parent */
> +			if (p->parent) {
> +				idx = __sync_branch_idx(p->parent, id);
> +				__sync_child(p->parent)[idx] = next;
> +				GEM_BUG_ON(!(p->parent->bitmap & BIT(idx)));
> +			}
> +			next->parent = p->parent;
> +
> +			/* Compute the idx of the other branch, not our id! */
> +			idx = p->prefix >> (above - SHIFT) & MASK;
> +			__sync_set_child(next, idx, p);
> +			p->parent = next;
> +
> +			/* Ascend to the join */
> +			p = next;
> +		} else {
> +			if (!p->height)
> +				break;
> +		}
> +
> +		/* Descend into the next layer */
> +		GEM_BUG_ON(!p->height);
> +		idx = __sync_branch_idx(p, id);
> +		next = __sync_child(p)[idx];
> +		if (!next) {
> +			next = __sync_alloc_leaf(p, id);
> +			if (unlikely(!next))
> +				return -ENOMEM;
> +
> +			__sync_set_child(p, idx, next);
> +			p = next;
> +			break;
> +		}
> +
> +		p = next;
> +	} while (1);
> +
> +found:
> +	GEM_BUG_ON(p->prefix != __sync_leaf_prefix(p, id));
> +	__sync_set_seqno(p, id, seqno);
> +	*root = p;
> +	return 0;
> +}
> +
> +/**
> + * i915_syncmap_set -- mark the most recent syncpoint between contexts
> + * @root - pointer to the #i915_syncmap
> + * @id - the context id (other timeline) we have synchronised to
> + * @seqno - the sequence number along the other timeline
> + *
> + * When we synchronise this @root with another (@id), we also know that we have
> + * synchronized with all previous seqno along that timeline. If we then have
> + * a request to synchronise with the same seqno or older, we can omit it,
> + * see i915_syncmap_is_later()
> + *
> + * Returns 0 on success, or a negative error code.
> + */
> +int i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno)
> +{
> +	struct i915_syncmap *p = *root;
> +
> +	/*
> +	 * We expect to be called in sequence following a is_later(id), which
> +	 * should have preloaded the root for us.
> +	 */
> +	if (likely(p && __sync_leaf_prefix(p, id) == p->prefix)) {
> +		__sync_set_seqno(p, id, seqno);
> +		return 0;
> +	}
> +
> +	return __sync_set(root, id, seqno);
> +}
> +
> +static void __sync_free(struct i915_syncmap *p)
> +{
> +	if (p->height) {
> +		unsigned int i;
> +
> +		while ((i = ffs(p->bitmap))) {
> +			p->bitmap &= ~0u << i;
> +			__sync_free(__sync_child(p)[i - 1]);
> +		}
> +	}
> +
> +	kfree(p);
> +}
> +
> +/**
> + * i915_syncmap_free -- free all memory associated with the syncmap
> + * @root - pointer to the #i915_syncmap
> + *
> + * Either when the timeline is to be freed and we no longer need the sync
> + * point tracking, or when the fences are all known to be signaled and the
> + * sync point tracking is redundant, we can free the #i915_syncmap to recover
> + * its allocations.
> + *
> + * Will reinitialise the @root pointer so that the #i915_syncmap is ready for
> + * reuse.
> + */
> +void i915_syncmap_free(struct i915_syncmap **root)
> +{
> +	struct i915_syncmap *p;
> +
> +	p = *root;
> +	if (!p)
> +		return;
> +
> +	while (p->parent)
> +		p = p->parent;
> +
> +	__sync_free(p);
> +	*root = NULL;
> +}
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> +#include "selftests/i915_syncmap.c"
> +#endif
> diff --git a/drivers/gpu/drm/i915/i915_syncmap.h b/drivers/gpu/drm/i915/i915_syncmap.h
> new file mode 100644
> index 000000000000..7ca827d812ae
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_syncmap.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#ifndef __I915_SYNCMAP_H__
> +#define __I915_SYNCMAP_H__
> +
> +#include <linux/types.h>
> +
> +struct i915_syncmap;
> +
> +void i915_syncmap_init(struct i915_syncmap **root);
> +bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno);
> +int i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno);
> +void i915_syncmap_free(struct i915_syncmap **root);
> +
> +#define KSYNCMAP 16
> +
> +#endif /* __I915_SYNCMAP_H__ */
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> new file mode 100644
> index 000000000000..1a9f9cb57878
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_timeline.c
> @@ -0,0 +1,272 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include "../i915_selftest.h"
> +#include "i915_random.h"
> +
> +#include "mock_gem_device.h"
> +#include "mock_timeline.h"
> +
> +struct __igt_sync {
> +	const char *name;
> +	u32 seqno;
> +	bool expected;
> +	bool set;
> +};
> +
> +static int __igt_sync(struct intel_timeline *tl,
> +		      u64 ctx,
> +		      const struct __igt_sync *p,
> +		      const char *name)
> +{
> +	int ret;
> +
> +	if (__intel_timeline_sync_is_later(tl, ctx, p->seqno) != p->expected) {
> +		pr_err("%s: %s(ctx=%llu, seqno=%u) expected passed %s but failed\n",
> +		       name, p->name, ctx, p->seqno, yesno(p->expected));
> +		return -EINVAL;
> +	}
> +
> +	if (p->set) {
> +		ret = __intel_timeline_sync_set(tl, ctx, p->seqno);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int igt_sync(void *arg)
> +{
> +	const struct __igt_sync pass[] = {
> +		{ "unset", 0, false, false },
> +		{ "new", 0, false, true },
> +		{ "0a", 0, true, true },
> +		{ "1a", 1, false, true },
> +		{ "1b", 1, true, true },
> +		{ "0b", 0, true, false },
> +		{ "2a", 2, false, true },
> +		{ "4", 4, false, true },
> +		{ "INT_MAX", INT_MAX, false, true },
> +		{ "INT_MAX-1", INT_MAX-1, true, false },
> +		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
> +		{ "INT_MAX", INT_MAX, true, false },
> +		{ "UINT_MAX", UINT_MAX, false, true },
> +		{ "wrap", 0, false, true },
> +		{ "unwrap", UINT_MAX, true, false },
> +		{},
> +	}, *p;
> +	struct intel_timeline *tl;
> +	int order, offset;
> +	int ret;
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	for (p = pass; p->name; p++) {
> +		for (order = 1; order < 64; order++) {
> +			for (offset = -1; offset <= (order > 1); offset++) {
> +				u64 ctx = BIT_ULL(order) + offset;
> +
> +				ret = __igt_sync(tl, ctx, p, "1");
> +				if (ret)
> +					goto out;
> +			}
> +		}
> +	}
> +	mock_timeline_destroy(tl);
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	for (order = 1; order < 64; order++) {
> +		for (offset = -1; offset <= (order > 1); offset++) {
> +			u64 ctx = BIT_ULL(order) + offset;
> +
> +			for (p = pass; p->name; p++) {
> +				ret = __igt_sync(tl, ctx, p, "2");
> +				if (ret)
> +					goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	mock_timeline_destroy(tl);
> +	return ret;
> +}
> +
> +static unsigned int random_engine(struct rnd_state *rnd)
> +{
> +	return ((u64)prandom_u32_state(rnd) * I915_NUM_ENGINES) >> 32;
> +}
> +
> +static int bench_sync(void *arg)
> +{
> +#define M (1 << 20)
> +	struct rnd_state prng;
> +	struct intel_timeline *tl;
> +	unsigned long end_time, count;
> +	u64 prng32_1M;
> +	ktime_t kt;
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	prandom_seed_state(&prng, i915_selftest.random_seed);
> +	count = 0;
> +	kt = ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		u32 x;
> +
> +		WRITE_ONCE(x, prandom_u32_state(&prng));
> +
> +		count++;
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_sub(ktime_get(), kt);
> +	pr_debug("%s: %lu random evaluations, %lluns/prng\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +	prng32_1M = ktime_to_ns(kt) * M / count;
> +
> +	prandom_seed_state(&prng, i915_selftest.random_seed);
> +	count = 0;
> +	kt = ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		u64 id = prandom_u64_state(&prng);
> +
> +		__intel_timeline_sync_set(tl, id, 0);
> +		count++;
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_sub(ktime_get(), kt);
> +	kt = ktime_sub_ns(kt, count * prng32_1M * 2 / M);
> +	pr_info("%s: %lu random insertions, %lluns/insert\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	prandom_seed_state(&prng, i915_selftest.random_seed);
> +	end_time = count;
> +	kt = ktime_get();
> +	while (end_time--) {
> +		u64 id = prandom_u64_state(&prng);
> +
> +		if (!__intel_timeline_sync_is_later(tl, id, 0)) {
> +			mock_timeline_destroy(tl);
> +			pr_err("Lookup of %llu failed\n", id);
> +			return -EINVAL;
> +		}
> +	}
> +	kt = ktime_sub(ktime_get(), kt);
> +	kt = ktime_sub_ns(kt, count * prng32_1M * 2 / M);
> +	pr_info("%s: %lu random lookups, %lluns/lookup\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	mock_timeline_destroy(tl);
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	count = 0;
> +	kt = ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		__intel_timeline_sync_set(tl, count++, 0);
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_sub(ktime_get(), kt);
> +	pr_info("%s: %lu in-order insertions, %lluns/insert\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	end_time = count;
> +	kt = ktime_get();
> +	while (end_time--) {
> +		if (!__intel_timeline_sync_is_later(tl, end_time, 0)) {
> +			pr_err("Lookup of %lu failed\n", end_time);
> +			mock_timeline_destroy(tl);
> +			return -EINVAL;
> +		}
> +	}
> +	kt = ktime_sub(ktime_get(), kt);
> +	pr_info("%s: %lu in-order lookups, %lluns/lookup\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +
> +	mock_timeline_destroy(tl);
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	prandom_seed_state(&prng, i915_selftest.random_seed);
> +	count = 0;
> +	kt = ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		u32 id = random_engine(&prng);
> +		u32 seqno = prandom_u32_state(&prng);
> +
> +		if (!__intel_timeline_sync_is_later(tl, id, seqno))
> +			__intel_timeline_sync_set(tl, id, seqno);
> +
> +		count++;
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_sub(ktime_get(), kt);
> +	kt = ktime_sub_ns(kt, count * prng32_1M / M);

Two randoms to account here.

> +	pr_info("%s: %lu repeated insert/lookups, %lluns/op\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +	mock_timeline_destroy(tl);
> +
> +	tl = mock_timeline(0);
> +	if (!tl)
> +		return -ENOMEM;
> +
> +	count = 0;
> +	kt = ktime_get();
> +	end_time = jiffies + HZ/10;
> +	do {
> +		if (!__intel_timeline_sync_is_later(tl, count & 7, count >> 4))
> +			__intel_timeline_sync_set(tl, count & 7, count >> 4);
> +
> +		count++;
> +	} while (!time_after(jiffies, end_time));
> +	kt = ktime_sub(ktime_get(), kt);
> +	pr_info("%s: %lu cyclic insert/lookups, %lluns/op\n",
> +		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> +	mock_timeline_destroy(tl);
> +
> +	return 0;
> +#undef M
> +}
> +
> +int i915_gem_timeline_mock_selftests(void)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(igt_sync),
> +		SUBTEST(bench_sync),
> +	};
> +
> +	return i915_subtests(tests, NULL);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> index be9a9ebf5692..76c1f149a0a0 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_mock_selftests.h
> @@ -10,8 +10,10 @@
>   */
>  selftest(sanitycheck, i915_mock_sanitycheck) /* keep first (igt selfcheck) */
>  selftest(scatterlist, scatterlist_mock_selftests)
> +selftest(syncmap, i915_syncmap_mock_selftests)
>  selftest(uncore, intel_uncore_mock_selftests)
>  selftest(breadcrumbs, intel_breadcrumbs_mock_selftests)
> +selftest(timelines, i915_gem_timeline_mock_selftests)
>  selftest(requests, i915_gem_request_mock_selftests)
>  selftest(objects, i915_gem_object_mock_selftests)
>  selftest(dmabuf, i915_gem_dmabuf_mock_selftests)
> diff --git a/drivers/gpu/drm/i915/selftests/i915_random.c b/drivers/gpu/drm/i915/selftests/i915_random.c
> index c17c83c30637..97796d3e3c9a 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_random.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_random.c
> @@ -30,6 +30,17 @@
>
>  #include "i915_random.h"
>
> +u64 prandom_u64_state(struct rnd_state *rnd)
> +{
> +	u64 x;
> +
> +	x = prandom_u32_state(rnd);
> +	x <<= 32;
> +	x |= prandom_u32_state(rnd);
> +
> +	return x;
> +}
> +
>  static inline u32 i915_prandom_u32_max_state(u32 ep_ro, struct rnd_state *state)
>  {
>  	return upper_32_bits((u64)prandom_u32_state(state) * ep_ro);
> diff --git a/drivers/gpu/drm/i915/selftests/i915_random.h b/drivers/gpu/drm/i915/selftests/i915_random.h
> index b9c334ce6cd9..0c65b87194ce 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_random.h
> +++ b/drivers/gpu/drm/i915/selftests/i915_random.h
> @@ -41,6 +41,8 @@
>  #define I915_RND_SUBSTATE(name__, parent__) \
>  	struct rnd_state name__ = I915_RND_STATE_INITIALIZER(prandom_u32_state(&(parent__)))
>
> +u64 prandom_u64_state(struct rnd_state *rnd);
> +
>  unsigned int *i915_random_order(unsigned int count,
>  				struct rnd_state *state);
>  void i915_random_reorder(unsigned int *order,
> diff --git a/drivers/gpu/drm/i915/selftests/i915_syncmap.c b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
> new file mode 100644
> index 000000000000..5f14fbfef0f4
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
> @@ -0,0 +1,609 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include "../i915_selftest.h"
> +#include "i915_random.h"
> +
> +static char *
> +__sync_print(struct i915_syncmap *p,
> +	     char *buf, unsigned long *sz,
> +	     unsigned int depth,
> +	     unsigned int idx)
> +{
> +	unsigned long len;
> +	unsigned i, bits;
> +
> +	if (depth) {
> +		for (i = 0; i < depth - 1; i++) {
> +			len = snprintf(buf, *sz, "|   ");
> +			buf += len;
> +			*sz -= len;
> +		}
> +		len = snprintf(buf, *sz, "%x-> ", idx);
> +		buf += len;
> +		*sz -= len;
> +	}
> +
> +	if (p->height < 64 - SHIFT)
> +		len = snprintf(buf, *sz, "0x%016llx",
> +				(p->prefix << p->height << SHIFT) |
> +				(BIT_ULL(p->height + SHIFT) - 1));
> +	else
> +		len = snprintf(buf, *sz, "0x%016llx", U64_MAX);
> +	buf += len;
> +	*sz -= len;
> +
> +	if (!p->height) {
> +		for (bits = p->bitmap; (i = ffs(bits)); bits &= ~0u << i) {

Would for_each_set_bit be more readable?

> +			len = snprintf(buf, *sz, " %x:%x,",
> +				       i - 1, __sync_seqno(p)[i - 1]);
> +			buf += len;
> +			*sz -= len;
> +		}
> +		buf -= 1;
> +		*sz += 1;
> +	}
> +
> +	len = snprintf(buf, *sz, "\n");
> +	buf += len;
> +	*sz -= len;
> +
> +	if (p->height) {
> +		for (bits = p->bitmap; (i = ffs(bits)); bits &= ~0u << i)
> +			buf = __sync_print(__sync_child(p)[i - 1],
> +					   buf, sz, depth + 1, i - 1);
> +	}
> +
> +	return buf;
> +}
> +
> +static bool
> +i915_syncmap_print_to_buf(struct i915_syncmap *p, char *buf, unsigned long sz)
> +{
> +	if (!p)
> +		return false;
> +
> +	while (p->parent)
> +		p = p->parent;
> +
> +	__sync_print(p, buf, &sz, 0, false);
> +	return true;
> +}
> +
> +static int check_syncmap_free(struct i915_syncmap **sync)
> +{
> +	i915_syncmap_free(sync);
> +	if (*sync) {
> +		pr_err("sync not cleared after free\n");
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int dump_syncmap(struct i915_syncmap *sync, int err)
> +{
> +	char *buf;
> +
> +	if (!err)
> +		return check_syncmap_free(&sync);
> +
> +	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!buf)
> +		goto skip;
> +
> +	if (i915_syncmap_print_to_buf(sync, buf, PAGE_SIZE))
> +		pr_err("%s", buf);
> +
> +	kfree(buf);
> +
> +skip:
> +	i915_syncmap_free(&sync);
> +	return err;
> +}
> +
> +static int igt_syncmap_init(void *arg)
> +{
> +	struct i915_syncmap *sync = (void *)~0ul;
> +
> +	/*
> +	 * Cursory check that we can initialise a random pointer and transform
> +	 * it into the root pointer of a syncmap.
> +	 */
> +
> +	i915_syncmap_init(&sync);
> +	return check_syncmap_free(&sync);
> +}
> +
> +static int check_seqno(struct i915_syncmap *leaf, unsigned int idx, u32 seqno)
> +{
> +	if (leaf->height) {
> +		pr_err("%s: not a leaf, height is %d\n",
> +		       __func__, leaf->height);
> +		return -EINVAL;
> +	}
> +
> +	if (__sync_seqno(leaf)[idx] != seqno) {
> +		pr_err("%s: seqno[%d], found %x, expected %x\n",
> +		       __func__, idx, __sync_seqno(leaf)[idx], seqno);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int check_one(struct i915_syncmap **sync, u64 context, u32 seqno)
> +{
> +	int err;
> +
> +	err = i915_syncmap_set(sync, context, seqno);
> +	if (err)
> +		return err;
> +
> +	if ((*sync)->height) {
> +		pr_err("Inserting first context=%llx did not return leaf (height=%d, prefix=%llx\n",
> +		       context, (*sync)->height, (*sync)->prefix);
> +		return -EINVAL;
> +	}
> +
> +	if ((*sync)->parent) {
> +		pr_err("Inserting first context=%llx created branches!\n",
> +		       context);
> +		return -EINVAL;
> +	}
> +
> +	if (hweight32((*sync)->bitmap) != 1) {
> +		pr_err("First bitmap does not contain a single entry, found %x (count=%d)!\n",
> +		       (*sync)->bitmap, hweight32((*sync)->bitmap));
> +		return -EINVAL;
> +	}
> +
> +	err = check_seqno((*sync), ilog2((*sync)->bitmap), seqno);
> +	if (err)
> +		return err;
> +
> +	if (!i915_syncmap_is_later(sync, context, seqno)) {
> +		pr_err("Lookup of first context=%llx/seqno=%x failed!\n",
> +		       context, seqno);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int igt_syncmap_one(void *arg)
> +{
> +	I915_RND_STATE(prng);
> +	IGT_TIMEOUT(end_time);
> +	struct i915_syncmap *sync;
> +	unsigned long max = 1;
> +	int err;
> +
> +	/*
> +	 * Check that inserting a new id, creates a leaf and only that leaf.
> +	 */
> +
> +	i915_syncmap_init(&sync);
> +
> +	do {
> +		u64 context = prandom_u64_state(&prng);
> +		unsigned long loop;
> +
> +		err = check_syncmap_free(&sync);
> +		if (err)
> +			goto out;
> +
> +		for (loop = 0; loop <= max; loop++) {
> +			err = check_one(&sync, context,
> +					prandom_u32_state(&prng));
> +			if (err)
> +				goto out;
> +		}
> +		max++;
> +	} while (!__igt_timeout(end_time, NULL));
> +	pr_debug("%s: Completed %lu single insertions\n",
> +		__func__, max * (max - 1) / 2);
> +out:
> +	return dump_syncmap(sync, err);
> +}
> +
> +static int check_leaf(struct i915_syncmap **sync, u64 context, u32 seqno)
> +{
> +	int err;
> +
> +	err = i915_syncmap_set(sync, context, seqno);
> +	if (err)
> +		return err;
> +
> +	if ((*sync)->height) {
> +		pr_err("Inserting context=%llx did not return leaf (height=%d, prefix=%llx\n",
> +		       context, (*sync)->height, (*sync)->prefix);
> +		return -EINVAL;
> +	}
> +
> +	if (hweight32((*sync)->bitmap) != 1) {
> +		pr_err("First entry into leaf (context=%llx) does not contain a single entry, found %x (count=%d)!\n",
> +		       context, (*sync)->bitmap, hweight32((*sync)->bitmap));
> +		return -EINVAL;
> +	}
> +
> +	err = check_seqno((*sync), ilog2((*sync)->bitmap), seqno);
> +	if (err)
> +		return err;
> +
> +	if (!i915_syncmap_is_later(sync, context, seqno)) {
> +		pr_err("Lookup of first entry context=%llx/seqno=%x failed!\n",
> +		       context, seqno);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int igt_syncmap_join_above(void *arg)
> +{
> +	struct i915_syncmap *sync;
> +	unsigned int pass, order;
> +	int err;
> +
> +	i915_syncmap_init(&sync);
> +
> +	/*
> +	 * When we have a new id that doesn't fit inside the existing tree,
> +	 * we need to add a new layer above.
> +	 *
> +	 * 1: 0x00000001
> +	 * 2: 0x00000010
> +	 * 3: 0x00000100
> +	 * 4: 0x00001000
> +	 * ...
> +	 * Each pass the common prefix shrinks and we have to insert a join.
> +	 * Each join will only contain two branches, the latest of which
> +	 * is always a leaf.
> +	 *
> +	 * If we then reuse the same set of contexts, we expect to build an
> +	 * identical tree.
> +	 */
> +	for (pass = 0; pass < 3; pass++) {
> +		for (order = 0; order < 64; order += SHIFT) {
> +			u64 context = BIT_ULL(order);
> +			struct i915_syncmap *join;
> +
> +			err = check_leaf(&sync, context, 0);
> +			if (err)
> +				goto out;
> +
> +			join = sync->parent;
> +			if (!join) /* very first insert will have no parents */
> +				continue;
> +
> +			if (!join->height) {
> +				pr_err("Parent with no height!\n");
> +				err = -EINVAL;
> +				goto out;
> +			}
> +
> +			if (hweight32(join->bitmap) != 2) {
> +				pr_err("Join does not have 2 children: %x (%d)\n",
> +				       join->bitmap, hweight32(join->bitmap));
> +				err = -EINVAL;
> +				goto out;
> +			}
> +
> +			if (__sync_child(join)[__sync_branch_idx(join, context)] != sync) {
> +				pr_err("Leaf misplaced in parent!\n");
> +				err = -EINVAL;
> +				goto out;
> +			}
> +		}
> +	}
> +out:
> +	return dump_syncmap(sync, err);
> +}
> +
> +static int igt_syncmap_join_below(void *arg)
> +{
> +	struct i915_syncmap *sync;
> +	unsigned int step, order, idx;
> +	int err;
> +
> +	i915_syncmap_init(&sync);
> +
> +	/*
> +	 * Check that we can split a compacted branch by replacing it with
> +	 * a join.
> +	 */
> +	for (step = 0; step < KSYNCMAP; step++) {
> +		for (order = 64 - SHIFT; order > 0; order -= SHIFT) {
> +			u64 context = step*BIT_ULL(order);
> +
> +			err = i915_syncmap_set(&sync, context, 0);
> +			if (err)
> +				goto out;
> +
> +			if (sync->height) {
> +				pr_err("Inserting context=%llx (order=%d, step=%d) did not return leaf (height=%d, prefix=%llx\n",
> +				       context, order, step, sync->height, sync->prefix);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +	for (step = 0; step < KSYNCMAP; step++) {
> +		for (order = SHIFT; order < 64; order += SHIFT) {
> +			u64 context = step*BIT_ULL(order);
> +
> +			if (!i915_syncmap_is_later(&sync, context, 0)) {
> +				pr_err("1: context %llx (order=%d, step=%d) not found\n",
> +				       context, order, step);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +
> +			for (idx = 1; idx < KSYNCMAP; idx++) {
> +				if (i915_syncmap_is_later(&sync, context + idx, 0)) {
> +					pr_err("1: context %llx (order=%d, step=%d) should not exist\n",
> +					       context + idx, order, step);
> +					err = -EINVAL;
> +					goto out;
> +				}
> +			}
> +		}
> +	}
> +
> +	for (order = SHIFT; order < 64; order += SHIFT) {
> +		for (step = 0; step < KSYNCMAP; step++) {
> +			u64 context = step*BIT_ULL(order);
> +
> +			if (!i915_syncmap_is_later(&sync, context, 0)) {
> +				pr_err("2: context %llx (order=%d, step=%d) not found\n",
> +				       context, order, step);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	return dump_syncmap(sync, err);
> +}
> +
> +static int igt_syncmap_neighbours(void *arg)
> +{
> +	I915_RND_STATE(prng);
> +	IGT_TIMEOUT(end_time);
> +	struct i915_syncmap *sync;
> +	int err;
> +
> +	/*
> +	 * Each leaf holds KSYNCMAP seqno. Check that when we create KSYNCMAP
> +	 * neighbouring ids, they all fit into the same leaf.
> +	 */
> +
> +	i915_syncmap_init(&sync);
> +	do {
> +		u64 context = prandom_u64_state(&prng) & ~MASK;
> +		unsigned int idx;
> +
> +		if (i915_syncmap_is_later(&sync, context, 0)) /* Skip repeats */
> +			continue;
> +
> +		for (idx = 0; idx < KSYNCMAP; idx++) {
> +			err = i915_syncmap_set(&sync, context + idx, 0);
> +			if (err)
> +				goto out;
> +
> +			if (sync->height) {
> +				pr_err("Inserting context=%llx did not return leaf (height=%d, prefix=%llx\n",
> +				       context, sync->height, sync->prefix);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +
> +			if (sync->bitmap != BIT(idx + 1) - 1) {
> +				pr_err("Inserting neighbouring context=0x%llx+%d, did not fit into the same leaf bitmap=%x (%d), expected %lx (%d)\n",
> +				       context, idx,
> +				       sync->bitmap, hweight32(sync->bitmap),
> +				       BIT(idx + 1) - 1, idx + 1);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +		}
> +	} while (!__igt_timeout(end_time, NULL));
> +out:
> +	return dump_syncmap(sync, err);
> +}
> +
> +static int igt_syncmap_compact(void *arg)
> +{
> +	struct i915_syncmap *sync;
> +	unsigned int idx, order;
> +	int err;
> +
> +	i915_syncmap_init(&sync);
> +
> +	/*
> +	 * The syncmap are "space efficient" compressed radix trees - any
> +	 * branch with only one child is skipped and replaced by the child.
> +	 *
> +	 * If we construct a tree with ids that are neighbouring at a non-zero
> +	 * height, we form a join but each child of that join is directly a
> +	 * leaf holding the single id.
> +	 */
> +	for (order = SHIFT; order < 64; order += SHIFT) {
> +		err = check_syncmap_free(&sync);
> +		if (err)
> +			goto out;
> +
> +		/* Create neighbours in the parent */
> +		for (idx = 0; idx < KSYNCMAP; idx++) {
> +			u64 context = idx * BIT_ULL(order) + idx;
> +
> +			err = i915_syncmap_set(&sync, context, 0);
> +			if (err)
> +				goto out;
> +
> +			if (sync->height) {
> +				pr_err("Inserting context=%llx (order=%d, idx=%d) did not return leaf (height=%d, prefix=%llx\n",
> +				       context, order, idx,
> +				       sync->height, sync->prefix);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +		}
> +
> +		sync = sync->parent;
> +		if (sync->parent) {
> +			pr_err("Parent (join) of last leaf was not the sync!\n");
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (sync->height != order) {
> +			pr_err("Join does not have the expected height, found %d, expected %d\n",
> +			       sync->height, order);
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (sync->bitmap != BIT(KSYNCMAP) - 1) {
> +			pr_err("Join is not full!, found %x (%d) expected %lx (%d)\n",
> +			       sync->bitmap, hweight32(sync->bitmap),
> +			       BIT(KSYNCMAP) - 1, KSYNCMAP);
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* Each of our children should be a leaf */
> +		for (idx = 0; idx < KSYNCMAP; idx++) {
> +			struct i915_syncmap *leaf = __sync_child(sync)[idx];
> +
> +			if (leaf->height) {
> +				pr_err("Child %d is a not leaf!\n", idx);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +
> +			if (leaf->parent != sync) {
> +				pr_err("Child %d is not attached to us!\n",
> +				       idx);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +
> +			if (!is_power_of_2(leaf->bitmap)) {
> +				pr_err("Child %d holds more than one id, found %x (%d)\n",
> +				       idx, leaf->bitmap, hweight32(leaf->bitmap));
> +				err = -EINVAL;
> +				goto out;
> +			}
> +
> +			if (leaf->bitmap != BIT(idx)) {
> +				pr_err("Child %d has wrong seqno idx, found %d, expected %d\n",
> +				       idx, ilog2(leaf->bitmap), idx);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +		}
> +	}
> +out:
> +	return dump_syncmap(sync, err);
> +}
> +
> +static int igt_syncmap_random(void *arg)
> +{
> +	I915_RND_STATE(prng);
> +	IGT_TIMEOUT(end_time);
> +	struct i915_syncmap *sync;
> +	unsigned long count, phase, i;
> +	u32 seqno;
> +	int err;
> +
> +	i915_syncmap_init(&sync);
> +
> +	/*
> +	 * Having tried to test the individual operations within i915_syncmap,
> +	 * run a smoketest exploring the entire u64 space with random
> +	 * insertions.
> +	 */
> +
> +	count = 0;
> +	phase = jiffies + HZ/100 + 1;
> +	do {
> +		u64 context = prandom_u64_state(&prng);
> +
> +		err = i915_syncmap_set(&sync, context, 0);
> +		if (err)
> +			goto out;
> +
> +		count++;
> +	} while (!time_after(jiffies, phase));
> +	seqno = 0;
> +
> +	phase = 0;
> +	do {
> +		I915_RND_STATE(ctx);
> +		u32 last_seqno = seqno;
> +		bool expect;
> +
> +		seqno = prandom_u32_state(&prng);
> +		expect = seqno_later(last_seqno, seqno);
> +
> +		for (i = 0; i < count; i++) {
> +			u64 context = prandom_u64_state(&ctx);
> +
> +			if (i915_syncmap_is_later(&sync, context, seqno) != expect) {
> +				pr_err("context=%llu, last=%u this=%u did not match expectation (%d)\n",
> +				       context, last_seqno, seqno, expect);
> +				err = -EINVAL;
> +				goto out;
> +			}
> +
> +			err = i915_syncmap_set(&sync, context, seqno);
> +			if (err)
> +				goto out;
> +		}
> +
> +		phase++;
> +	} while (!__igt_timeout(end_time, NULL));
> +	pr_debug("Completed %lu passes, each of %lu contexts\n", phase, count);
> +out:
> +	return dump_syncmap(sync, err);
> +}
> +
> +int i915_syncmap_mock_selftests(void)
> +{
> +	static const struct i915_subtest tests[] = {
> +		SUBTEST(igt_syncmap_init),
> +		SUBTEST(igt_syncmap_one),
> +		SUBTEST(igt_syncmap_join_above),
> +		SUBTEST(igt_syncmap_join_below),
> +		SUBTEST(igt_syncmap_neighbours),
> +		SUBTEST(igt_syncmap_compact),
> +		SUBTEST(igt_syncmap_random),
> +	};
> +
> +	return i915_subtests(tests, NULL);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.c b/drivers/gpu/drm/i915/selftests/mock_timeline.c
> new file mode 100644
> index 000000000000..47b1f47c5812
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/mock_timeline.c
> @@ -0,0 +1,45 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include "mock_timeline.h"
> +
> +struct intel_timeline *mock_timeline(u64 context)
> +{
> +	static struct lock_class_key class;
> +	struct intel_timeline *tl;
> +
> +	tl = kzalloc(sizeof(*tl), GFP_KERNEL);
> +	if (!tl)
> +		return NULL;
> +
> +	__intel_timeline_init(tl, NULL, context, &class, "mock");
> +
> +	return tl;
> +}
> +
> +void mock_timeline_destroy(struct intel_timeline *tl)
> +{
> +	__intel_timeline_fini(tl);
> +	kfree(tl);
> +}
> diff --git a/drivers/gpu/drm/i915/selftests/mock_timeline.h b/drivers/gpu/drm/i915/selftests/mock_timeline.h
> new file mode 100644
> index 000000000000..c27ff4639b8b
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/selftests/mock_timeline.h
> @@ -0,0 +1,33 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#ifndef __MOCK_TIMELINE__
> +#define __MOCK_TIMELINE__
> +
> +#include "../i915_gem_timeline.h"
> +
> +struct intel_timeline *mock_timeline(u64 context);
> +void mock_timeline_destroy(struct intel_timeline *tl);
> +
> +#endif /* !__MOCK_TIMELINE__ */
>

Looks very neat and tidy and the extensive unit tests fill me with 
confidence that nothing functional was missed.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v14] drm/i915: Squash repeated awaits on the same fence
  2017-05-02 12:24             ` Tvrtko Ursulin
@ 2017-05-02 14:45               ` Chris Wilson
  2017-05-02 15:11                 ` Chris Wilson
  2017-05-02 15:17                 ` Tvrtko Ursulin
  2017-05-02 14:50               ` Chris Wilson
  1 sibling, 2 replies; 95+ messages in thread
From: Chris Wilson @ 2017-05-02 14:45 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Tue, May 02, 2017 at 01:24:58PM +0100, Tvrtko Ursulin wrote:
> On 28/04/2017 20:02, Chris Wilson wrote:
> >+	if (!p->height) {
> >+		for (bits = p->bitmap; (i = ffs(bits)); bits &= ~0u << i) {
> 
> Would for_each_set_bit be more readable?

Downside is that we have to cast bitmap to unsigned long:

Something like:

diff --git a/drivers/gpu/drm/i915/selftests/i915_syncmap.c b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
index 1f8b594b4157..9fbc9e144833 100644
--- a/drivers/gpu/drm/i915/selftests/i915_syncmap.c
+++ b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
@@ -33,7 +33,7 @@ __sync_print(struct i915_syncmap *p,
             unsigned int idx)
 {
        unsigned long len;
-       unsigned i, bits, X;
+       unsigned i, X;
 
        if (depth) {
                unsigned int d;
@@ -61,7 +61,7 @@ __sync_print(struct i915_syncmap *p,
        *sz -= len;
 
        if (!p->height) {
-               for (bits = p->bitmap; (i = ffs(bits)); bits &= ~0u << i) {
+               for_each_set_bit(i, (unsigned long *)&p->bitmap, KSYNCMAP) {
                        len = scnprintf(buf, *sz, " %x:%x,",
                                       i - 1, __sync_seqno(p)[i - 1]);
                        buf += len;
@@ -76,11 +76,11 @@ __sync_print(struct i915_syncmap *p,
        *sz -= len;
 
        if (p->height) {
-               for (bits = p->bitmap; (i = ffs(bits)); ) {
-                       bits &= ~0u << i;
+               for_each_set_bit(i, (unsigned long *)&p->bitmap, KSYNCMAP) {
                        buf = __sync_print(__sync_child(p)[i - 1],
                                           buf, sz,
-                                          depth + 1, (last << 1) | !!bits,
+                                          depth + 1, (last << 1) |
+                                          !!(p->bitmap & (~0u << i)),
                                           i - 1);
                }
        }

And thank you for not suggesting to use the horrible code generation of
for_each_set_bit() outside of the pretty printer. :)

P.S. Latest ascii graphs:
	0xXXXXXXXXXXXXXXXX
	0-> 0x0000000000XXXXXX
	|   0-> 0x0000000000000XXX
	|   |   0-> 0x00000000000000XX
	|   |   |   0-> 0x000000000000000X 0:0, 1:1, 2:2
	|   |   |   1-> 0x000000000000001X 0:10, 1:11
	|   |   2-> 0x000000000000020X 0:200, 1:201
	|   5-> 0x000000000050XXXX
	|       0-> 0x000000000050000X 0:500000, 1:500001
	|       3-> 0x000000000050300X 0:503000, 1:503001
	e-> 0xe00000000000000X e:e
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v14] drm/i915: Squash repeated awaits on the same fence
  2017-05-02 12:24             ` Tvrtko Ursulin
  2017-05-02 14:45               ` Chris Wilson
@ 2017-05-02 14:50               ` Chris Wilson
  1 sibling, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-05-02 14:50 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Tue, May 02, 2017 at 01:24:58PM +0100, Tvrtko Ursulin wrote:
> On 28/04/2017 20:02, Chris Wilson wrote:
> >+	prandom_seed_state(&prng, i915_selftest.random_seed);
> >+	count = 0;
> >+	kt = ktime_get();
> >+	end_time = jiffies + HZ/10;
> >+	do {
> >+		u32 id = random_engine(&prng);
> >+		u32 seqno = prandom_u32_state(&prng);
> >+
> >+		if (!__intel_timeline_sync_is_later(tl, id, seqno))
> >+			__intel_timeline_sync_set(tl, id, seqno);
> >+
> >+		count++;
> >+	} while (!time_after(jiffies, end_time));
> >+	kt = ktime_sub(ktime_get(), kt);
> >+	kt = ktime_sub_ns(kt, count * prng32_1M / M);
> 
> Two randoms to account here.

Thank you. That fixes the discrepancy between the random_engine results
and using the engines in order.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v14] drm/i915: Squash repeated awaits on the same fence
  2017-05-02 14:45               ` Chris Wilson
@ 2017-05-02 15:11                 ` Chris Wilson
  2017-05-02 15:17                 ` Tvrtko Ursulin
  1 sibling, 0 replies; 95+ messages in thread
From: Chris Wilson @ 2017-05-02 15:11 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

On Tue, May 02, 2017 at 03:45:23PM +0100, Chris Wilson wrote:
> On Tue, May 02, 2017 at 01:24:58PM +0100, Tvrtko Ursulin wrote:
> > On 28/04/2017 20:02, Chris Wilson wrote:
> > >+	if (!p->height) {
> > >+		for (bits = p->bitmap; (i = ffs(bits)); bits &= ~0u << i) {
> > 
> > Would for_each_set_bit be more readable?
> 
> Downside is that we have to cast bitmap to unsigned long:
> 
> Something like:

Well that forgot that for_each_set_bit was 0-based and not off-by-one
like ffs(). Let's try again:


diff --git a/drivers/gpu/drm/i915/selftests/i915_syncmap.c b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
index 7b9c6eeaf62c..f279347ab218 100644
--- a/drivers/gpu/drm/i915/selftests/i915_syncmap.c
+++ b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
@@ -33,7 +33,7 @@ __sync_print(struct i915_syncmap *p,
             unsigned int idx)
 {
        unsigned long len;
-       unsigned i, bits, X;
+       unsigned i, X;
 
        if (depth) {
                unsigned int d;
@@ -60,9 +60,9 @@ __sync_print(struct i915_syncmap *p,
        scnprintf(buf - X, *sz + X, "%*s", X, "XXXXXXXXXXXXXXXXX");
 
        if (!p->height) {
-               for (bits = p->bitmap; (i = ffs(bits)); bits &= ~0u << i) {
+               for_each_set_bit(i, (unsigned long *)&p->bitmap, KSYNCMAP) {
                        len = scnprintf(buf, *sz, " %x:%x,",
-                                      i - 1, __sync_seqno(p)[i - 1]);
+                                      i, __sync_seqno(p)[i]);
                        buf += len;
                        *sz -= len;
                }
@@ -75,12 +75,12 @@ __sync_print(struct i915_syncmap *p,
        *sz -= len;
 
        if (p->height) {
-               for (bits = p->bitmap; (i = ffs(bits)); ) {
-                       bits &= ~0u << i;
-                       buf = __sync_print(__sync_child(p)[i - 1],
-                                          buf, sz,
-                                          depth + 1, (last << 1) | !!bits,
-                                          i - 1);
+               for_each_set_bit(i, (unsigned long *)&p->bitmap, KSYNCMAP) {
+                       buf = __sync_print(__sync_child(p)[i], buf, sz,
+                                          depth + 1,
+                                          (last << 1) |
+                                          !!(p->bitmap & (~0u << (i + 1))),
+                                          i);
                }
        }
 

I'm in favour of the cast over gratiutious use of ffs()
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v14] drm/i915: Squash repeated awaits on the same fence
  2017-05-02 14:45               ` Chris Wilson
  2017-05-02 15:11                 ` Chris Wilson
@ 2017-05-02 15:17                 ` Tvrtko Ursulin
  1 sibling, 0 replies; 95+ messages in thread
From: Tvrtko Ursulin @ 2017-05-02 15:17 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 02/05/2017 15:45, Chris Wilson wrote:
> On Tue, May 02, 2017 at 01:24:58PM +0100, Tvrtko Ursulin wrote:
>> On 28/04/2017 20:02, Chris Wilson wrote:
>>> +	if (!p->height) {
>>> +		for (bits = p->bitmap; (i = ffs(bits)); bits &= ~0u << i) {
>>
>> Would for_each_set_bit be more readable?
>
> Downside is that we have to cast bitmap to unsigned long:
>
> Something like:
>
> diff --git a/drivers/gpu/drm/i915/selftests/i915_syncmap.c b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
> index 1f8b594b4157..9fbc9e144833 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_syncmap.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_syncmap.c
> @@ -33,7 +33,7 @@ __sync_print(struct i915_syncmap *p,
>              unsigned int idx)
>  {
>         unsigned long len;
> -       unsigned i, bits, X;
> +       unsigned i, X;
>
>         if (depth) {
>                 unsigned int d;
> @@ -61,7 +61,7 @@ __sync_print(struct i915_syncmap *p,
>         *sz -= len;
>
>         if (!p->height) {
> -               for (bits = p->bitmap; (i = ffs(bits)); bits &= ~0u << i) {
> +               for_each_set_bit(i, (unsigned long *)&p->bitmap, KSYNCMAP) {
>                         len = scnprintf(buf, *sz, " %x:%x,",
>                                        i - 1, __sync_seqno(p)[i - 1]);
>                         buf += len;
> @@ -76,11 +76,11 @@ __sync_print(struct i915_syncmap *p,
>         *sz -= len;
>
>         if (p->height) {
> -               for (bits = p->bitmap; (i = ffs(bits)); ) {
> -                       bits &= ~0u << i;
> +               for_each_set_bit(i, (unsigned long *)&p->bitmap, KSYNCMAP) {
>                         buf = __sync_print(__sync_child(p)[i - 1],
>                                            buf, sz,
> -                                          depth + 1, (last << 1) | !!bits,
> +                                          depth + 1, (last << 1) |
> +                                          !!(p->bitmap & (~0u << i)),
>                                            i - 1);
>                 }
>         }

Its a bit smaller line shrink in this version so I am not sure. Most 
importantly it is not getting rid of the ~0u << i business. Have it as 
you prefer it.

> And thank you for not suggesting to use the horrible code generation of
> for_each_set_bit() outside of the pretty printer. :)
>
> P.S. Latest ascii graphs:
> 	0xXXXXXXXXXXXXXXXX
> 	0-> 0x0000000000XXXXXX
> 	|   0-> 0x0000000000000XXX
> 	|   |   0-> 0x00000000000000XX
> 	|   |   |   0-> 0x000000000000000X 0:0, 1:1, 2:2
> 	|   |   |   1-> 0x000000000000001X 0:10, 1:11
> 	|   |   2-> 0x000000000000020X 0:200, 1:201
> 	|   5-> 0x000000000050XXXX
> 	|       0-> 0x000000000050000X 0:500000, 1:500001
> 	|       3-> 0x000000000050300X 0:503000, 1:503001
> 	e-> 0xe00000000000000X e:e

I think that's better. And thank you for adding the graph!

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2017-05-02 15:17 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-19  9:41 Confluence of eb + timeline improvements Chris Wilson
2017-04-19  9:41 ` [PATCH 01/27] drm/i915/selftests: Allocate inode/file dynamically Chris Wilson
2017-04-20  7:42   ` Joonas Lahtinen
2017-04-19  9:41 ` [PATCH 02/27] drm/i915: Mark CPU cache as dirty on every transition for CPU writes Chris Wilson
2017-04-19 16:52   ` Dongwon Kim
2017-04-19 17:15     ` Chris Wilson
2017-04-19 17:46     ` Chris Wilson
2017-04-19 18:08     ` Chris Wilson
2017-04-19 18:13       ` Dongwon Kim
2017-04-19 18:26         ` Chris Wilson
2017-04-19 20:30           ` Dongwon Kim
2017-04-19 20:49           ` Dongwon Kim
2017-04-19  9:41 ` [PATCH 03/27] drm/i915: Mark up clflushes as belonging to an unordered timeline Chris Wilson
2017-04-19  9:41 ` [PATCH 04/27] drm/i915: Lift timeline ordering to await_dma_fence Chris Wilson
2017-04-19  9:41 ` [PATCH 05/27] drm/i915: Make ptr_unpack_bits() more function-like Chris Wilson
2017-04-19  9:41 ` [PATCH 06/27] drm/i915: Redefine ptr_pack_bits() and friends Chris Wilson
2017-04-19  9:41 ` [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence Chris Wilson
2017-04-24 13:03   ` Tvrtko Ursulin
2017-04-24 13:19     ` Chris Wilson
2017-04-24 13:31       ` Chris Wilson
2017-04-26 10:20   ` Tvrtko Ursulin
2017-04-26 10:38     ` Chris Wilson
2017-04-26 10:54       ` Tvrtko Ursulin
2017-04-26 11:18         ` Chris Wilson
2017-04-26 12:13           ` Tvrtko Ursulin
2017-04-26 12:23             ` Chris Wilson
2017-04-26 14:36               ` Tvrtko Ursulin
2017-04-26 14:55                 ` Chris Wilson
2017-04-26 15:04                 ` Chris Wilson
2017-04-26 18:56             ` Chris Wilson
2017-04-26 22:22               ` Chris Wilson
2017-04-27  9:20                 ` Tvrtko Ursulin
2017-04-27  9:47                   ` Chris Wilson
2017-04-27  7:06   ` [PATCH v8] " Chris Wilson
2017-04-27  7:14     ` Chris Wilson
2017-04-27  9:50     ` Chris Wilson
2017-04-27 11:42       ` Chris Wilson
2017-04-27 11:48     ` [PATCH v9] " Chris Wilson
2017-04-27 16:47       ` Tvrtko Ursulin
2017-04-27 17:25         ` Chris Wilson
2017-04-27 20:34           ` Chris Wilson
2017-04-27 20:53             ` Chris Wilson
2017-04-28  7:41       ` [PATCH v10] " Chris Wilson
2017-04-28  7:59         ` Chris Wilson
2017-04-28  9:32         ` Tvrtko Ursulin
2017-04-28  9:54           ` Chris Wilson
2017-04-28  9:55         ` Tvrtko Ursulin
2017-04-28 10:11           ` Chris Wilson
2017-04-28 14:12         ` [PATCH v13] " Chris Wilson
2017-04-28 19:02           ` [PATCH v14] " Chris Wilson
2017-05-02 12:24             ` Tvrtko Ursulin
2017-05-02 14:45               ` Chris Wilson
2017-05-02 15:11                 ` Chris Wilson
2017-05-02 15:17                 ` Tvrtko Ursulin
2017-05-02 14:50               ` Chris Wilson
2017-04-19  9:41 ` [PATCH 08/27] drm/i915: Rename intel_timeline.sync_seqno[] to .global_sync[] Chris Wilson
2017-04-19  9:41 ` [PATCH 09/27] drm/i915: Confirm the request is still active before adding it to the await Chris Wilson
2017-04-19  9:41 ` [PATCH 10/27] drm/i915: Do not record a successful syncpoint for a dma-await Chris Wilson
2017-04-19  9:41 ` [PATCH 11/27] drm/i915: Switch the global i915.semaphores check to a local predicate Chris Wilson
2017-04-19  9:41 ` [PATCH 12/27] drm/i915: Only report a wakeup if the waiter was truly asleep Chris Wilson
2017-04-20 13:30   ` Tvrtko Ursulin
2017-04-20 13:57     ` Chris Wilson
2017-04-19  9:41 ` [PATCH 13/27] drm/i915/execlists: Pack the count into the low bits of the port.request Chris Wilson
2017-04-20 14:58   ` Tvrtko Ursulin
2017-04-27 14:37     ` Chris Wilson
2017-04-28 12:02       ` Tvrtko Ursulin
2017-04-28 12:21         ` Chris Wilson
2017-04-19  9:41 ` [PATCH 14/27] drm/i915: Don't mark an execlists context-switch when idle Chris Wilson
2017-04-20  8:53   ` Joonas Lahtinen
2017-04-19  9:41 ` [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list Chris Wilson
2017-04-24 10:28   ` Tvrtko Ursulin
2017-04-24 11:07     ` Chris Wilson
2017-04-24 12:18       ` Chris Wilson
2017-04-24 12:44       ` Tvrtko Ursulin
2017-04-24 13:06         ` Chris Wilson
2017-04-19  9:41 ` [PATCH 16/27] drm/i915: Reinstate reservation_object zapping for batch_pool objects Chris Wilson
2017-04-28 12:20   ` Tvrtko Ursulin
2017-04-19  9:41 ` [PATCH 17/27] drm/i915: Amalgamate execbuffer parameter structures Chris Wilson
2017-04-19  9:41 ` [PATCH 18/27] drm/i915: Use vma->exec_entry as our double-entry placeholder Chris Wilson
2017-04-19  9:41 ` [PATCH 19/27] drm/i915: Split vma exec_link/evict_link Chris Wilson
2017-04-19  9:41 ` [PATCH 20/27] drm/i915: Store a direct lookup from object handle to vma Chris Wilson
2017-04-19  9:41 ` [PATCH 21/27] drm/i915: Pass vma to relocate entry Chris Wilson
2017-04-19  9:41 ` [PATCH 22/27] drm/i915: Eliminate lots of iterations over the execobjects array Chris Wilson
2017-04-20  8:49   ` Joonas Lahtinen
2017-04-19  9:41 ` [PATCH 23/27] drm/i915: First try the previous execbuffer location Chris Wilson
2017-04-19  9:41 ` [PATCH 24/27] drm/i915: Wait upon userptr get-user-pages within execbuffer Chris Wilson
2017-04-19  9:41 ` [PATCH 25/27] drm/i915: Allow execbuffer to use the first object as the batch Chris Wilson
2017-04-19  9:41 ` [PATCH 26/27] drm/i915: Async GPU relocation processing Chris Wilson
2017-04-19  9:41 ` [PATCH 27/27] drm/i915/scheduler: Support user-defined priorities Chris Wilson
2017-04-19 10:09   ` Chris Wilson
2017-04-19 11:07     ` Tvrtko Ursulin
2017-04-19 10:01 ` ✗ Fi.CI.BAT: failure for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically Patchwork
2017-04-27  7:27 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev2) Patchwork
2017-04-28 14:31 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev5) Patchwork
2017-04-28 19:22 ` ✓ Fi.CI.BAT: success for series starting with [01/27] drm/i915/selftests: Allocate inode/file dynamically (rev6) Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.