All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] Enabling GEN8 full PPGTT + fixes
@ 2014-07-01 18:17 Ben Widawsky
  2014-07-01 18:17 ` [PATCH 01/16] drm/i915: Split up do_switch Ben Widawsky
                   ` (17 more replies)
  0 siblings, 18 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky

Here be all the patches to make full PPGTT relatively stable on Broadwell. Most
of the work was actually to the generic PPGTT code, and not BDW specific. There
are basically 3 fixes:
1. Make the error state not horrible, but more work still needed.
2. Fix up another tricky from != from case in do_switch
3. Generally more graceful handling in ppgtt_release

1. Various forms of this subseries have shown up on the list from both
myself, and Chris (and I feel like Mika, also). For whatever reason,
none have been merged. I don't necessarily mind missing information, but
without these patches we can OOPs and GP fault in the error state, which
is not acceptable.

2. The meat of the debugging came here. Essentially this problem has
already been seen, and solved. The issue is, there is another spot in
do_switch that can invoke a switch to the default context. We probably
want to get slightly better handling of this, but for now this solves my
problems.

3. There are 2 categories addressed in this bucket. Reset, and signals.  The
patches themselves explain the situation. I take a two step approach with this
where first I make things correct (and slow as heck), and then I do the more
optimal and trickier solution. Both of these are in line to be replaced by
Daniel, but I needed something sooner.

Here is the test case that caught all of the above issues:
while [ 1 ] ;  do
        (glxgears) & pid[0]=$!
        (glxgears) & pid[1]=$!
        (glxgears) & pid[2]=$!
        sleep 3
        kill ${pid[*]}
done


Ben Widawsky (16):
  drm/i915: Split up do_switch
  drm/i915: Extract l3 remapping out of ctx switch
  drm/i915/ppgtt: Load address space after mi_set_context
  drm/i915: Fix another another use-after-free in do_switch
  drm/i915/ctx: Return earlier on failure
  drm/i915/error: Check the potential ctx obj's vm
  drm/i915/error: vma error capture prettyify
  drm/i915/error: Do a better job of disambiguating VMAs
  drm/i915/error: Capture vmas instead of BOs
  drm/i915: Add some extra guards in evict_vm
  drm/i915: Make an uninterruptible evict
  drm/i915: Reorder ctx unref on ppgtt cleanup
  drm/i915: More correct (slower) ppgtt cleanup
  drm/i915: Defer PPGTT cleanup
  drm/i915/bdw: Enable full PPGTT
  drm/i915: Get the error state over the wire (HACKish)

 drivers/gpu/drm/i915/i915_debugfs.c        |   2 +-
 drivers/gpu/drm/i915/i915_drv.h            |  18 ++-
 drivers/gpu/drm/i915/i915_gem.c            | 110 +++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.c    | 238 ++++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_gem_evict.c      |  39 +++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   2 +-
 drivers/gpu/drm/i915/i915_gem_gtt.c        |   3 +-
 drivers/gpu/drm/i915/i915_gem_gtt.h        |   4 +
 drivers/gpu/drm/i915/i915_gpu_error.c      | 157 ++++++++++++-------
 drivers/gpu/drm/i915/i915_sysfs.c          |   2 +-
 10 files changed, 449 insertions(+), 126 deletions(-)

-- 
2.0.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/16] drm/i915: Split up do_switch
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-01 18:17 ` [PATCH 02/16] drm/i915: Extract l3 remapping out of ctx switch Ben Widawsky
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

There are two important reasons for this patch. It should make the
existing code a lot more readable. It also makes the next patch much
easier to understand in my opinion. There are 2 main variables that
effect this function, leaving 4 permutations:
ring: RCS vs !RCS
PPGTT: full or not

I didn't find extracting the full PPGTT usage to be very beneficial at
this point, but it may be in the future.

This was originally recommended by Daniel Vetter, and in this case, I
agree. There was no intentional behavioral change.

v2: Change the pin assertion to be GGTT only. This is more accurate.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 78 +++++++++++++++++++++------------
 1 file changed, 50 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 0d2c75b..6dbe3e7 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -591,31 +591,57 @@ mi_set_context(struct intel_engine_cs *ring,
 	return ret;
 }
 
-static int do_switch(struct intel_engine_cs *ring,
-		     struct intel_context *to)
+static void do_switch_fini_common(struct intel_engine_cs *ring,
+				  struct intel_context *from,
+				  struct intel_context *to)
+{
+	if (likely(from))
+		i915_gem_context_unreference(from);
+	i915_gem_context_reference(to);
+	ring->last_context = to;
+}
+
+static int do_switch_xcs(struct intel_engine_cs *ring,
+			 struct intel_context *from,
+			 struct intel_context *to)
+{
+	struct drm_device *dev = ring->dev;
+	struct i915_hw_ppgtt *ppgtt = ctx_to_ppgtt(to);
+	int ret;
+
+	BUG_ON(from && from->obj != NULL);
+
+	if (USES_FULL_PPGTT(dev)) {
+		ret = ppgtt->switch_mm(ppgtt, ring, false);
+		if (ret)
+			return ret;
+	}
+
+	if (from)
+		do_switch_fini_common(ring, from, to);
+
+	return 0;
+}
+
+static int do_switch_rcs(struct intel_engine_cs *ring,
+			 struct intel_context *from,
+			 struct intel_context *to)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
-	struct intel_context *from = ring->last_context;
 	struct i915_hw_ppgtt *ppgtt = ctx_to_ppgtt(to);
 	u32 hw_flags = 0;
 	bool uninitialized = false;
 	int ret, i;
 
-	if (from != NULL && ring == &dev_priv->ring[RCS]) {
+	if (from != NULL) {
 		BUG_ON(from->obj == NULL);
-		BUG_ON(!i915_gem_obj_is_pinned(from->obj));
+		BUG_ON(!i915_gem_obj_ggtt_bound(from->obj));
 	}
 
-	if (from == to && !to->remap_slice)
-		return 0;
-
 	/* Trying to pin first makes error handling easier. */
-	if (ring == &dev_priv->ring[RCS]) {
-		ret = i915_gem_obj_ggtt_pin(to->obj,
-					    get_context_alignment(ring->dev), 0);
-		if (ret)
-			return ret;
-	}
+	ret = i915_gem_obj_ggtt_pin(to->obj, get_context_alignment(ring->dev), 0);
+	if (ret)
+		return ret;
 
 	/*
 	 * Pin can switch back to the default context if we end up calling into
@@ -630,12 +656,6 @@ static int do_switch(struct intel_engine_cs *ring,
 			goto unpin_out;
 	}
 
-	if (ring != &dev_priv->ring[RCS]) {
-		if (from)
-			i915_gem_context_unreference(from);
-		goto done;
-	}
-
 	/*
 	 * Clear this page out of any CPU caches for coherent swap-in/out. Note
 	 * that thanks to write = false in this call and us not setting any gpu
@@ -694,15 +714,11 @@ static int do_switch(struct intel_engine_cs *ring,
 
 		/* obj is kept alive until the next request by its active ref */
 		i915_gem_object_ggtt_unpin(from->obj);
-		i915_gem_context_unreference(from);
 	}
 
 	uninitialized = !to->is_initialized && from == NULL;
 	to->is_initialized = true;
-
-done:
-	i915_gem_context_reference(to);
-	ring->last_context = to;
+	do_switch_fini_common(ring, from, to);
 
 	if (uninitialized) {
 		ret = i915_gem_render_state_init(ring);
@@ -713,8 +729,7 @@ done:
 	return 0;
 
 unpin_out:
-	if (ring->id == RCS)
-		i915_gem_object_ggtt_unpin(to->obj);
+	i915_gem_object_ggtt_unpin(to->obj);
 	return ret;
 }
 
@@ -732,6 +747,7 @@ int i915_switch_context(struct intel_engine_cs *ring,
 			struct intel_context *to)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct intel_context *from = ring->last_context;
 
 	WARN_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
 
@@ -745,7 +761,13 @@ int i915_switch_context(struct intel_engine_cs *ring,
 		return 0;
 	}
 
-	return do_switch(ring, to);
+	if (from == to && !to->remap_slice)
+		return 0;
+
+	if (ring->id == RCS)
+		return do_switch_rcs(ring, from, to);
+	else
+		return do_switch_xcs(ring, from, to);
 }
 
 static bool hw_context_enabled(struct drm_device *dev)
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 02/16] drm/i915: Extract l3 remapping out of ctx switch
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
  2014-07-01 18:17 ` [PATCH 01/16] drm/i915: Split up do_switch Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-01 18:17 ` [PATCH 03/16] drm/i915/ppgtt: Load address space after mi_set_context Ben Widawsky
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

This is just a cosmetic change to try to put do_switch_rcs on a diet. As
it stands, the function was quite complex, and error prone.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 6dbe3e7..25cc889 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -623,6 +623,24 @@ static int do_switch_xcs(struct intel_engine_cs *ring,
 	return 0;
 }
 
+static void remap_l3(struct intel_engine_cs *ring,
+		     struct intel_context *ctx)
+{
+	int ret, i;
+
+	for (i = 0; i < MAX_L3_SLICES; i++) {
+		if (!(ctx->remap_slice & (1<<i)))
+			continue;
+
+		ret = i915_gem_l3_remap(ring, i);
+		/* If it failed, try again next round */
+		if (ret)
+			DRM_DEBUG_DRIVER("L3 remapping failed\n");
+		else
+			ctx->remap_slice &= ~(1<<i);
+	}
+}
+
 static int do_switch_rcs(struct intel_engine_cs *ring,
 			 struct intel_context *from,
 			 struct intel_context *to)
@@ -631,7 +649,7 @@ static int do_switch_rcs(struct intel_engine_cs *ring,
 	struct i915_hw_ppgtt *ppgtt = ctx_to_ppgtt(to);
 	u32 hw_flags = 0;
 	bool uninitialized = false;
-	int ret, i;
+	int ret;
 
 	if (from != NULL) {
 		BUG_ON(from->obj == NULL);
@@ -681,17 +699,7 @@ static int do_switch_rcs(struct intel_engine_cs *ring,
 	if (ret)
 		goto unpin_out;
 
-	for (i = 0; i < MAX_L3_SLICES; i++) {
-		if (!(to->remap_slice & (1<<i)))
-			continue;
-
-		ret = i915_gem_l3_remap(ring, i);
-		/* If it failed, try again next round */
-		if (ret)
-			DRM_DEBUG_DRIVER("L3 remapping failed\n");
-		else
-			to->remap_slice &= ~(1<<i);
-	}
+	remap_l3(ring, to);
 
 	/* The backing object for the context is done after switching to the
 	 * *next* context. Therefore we cannot retire the previous context until
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 03/16] drm/i915/ppgtt: Load address space after mi_set_context
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
  2014-07-01 18:17 ` [PATCH 01/16] drm/i915: Split up do_switch Ben Widawsky
  2014-07-01 18:17 ` [PATCH 02/16] drm/i915: Extract l3 remapping out of ctx switch Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-01 18:17 ` [PATCH 04/16] drm/i915: Fix another another use-after-free in do_switch Ben Widawsky
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

The simple explanation is, the docs say to do this for GEN8. Perhaps we
want to do this for GEN7 too, I am not certain.

PDPs are saved and restored with context. Contexts (without execlists)
only exist on the render ring. The docs say that PDPs are not power
context save/restored.  I've learned that this actually means something
which SW doesn't care about. So pretend the statement doesn't exist.
For non RCS, nothing changes.

All this patch now does is change the ordering of LRI vs MI_SET_CONTEXT
for the initialization of the context. I do this because the docs say to
do it, and frankly, I cannot reason why it is necessary. I've thought
about it a lot, and tried, without success, to get a reason from design.
The answer I got more or less says, "gen7 is different than gen8." I've
given up, and am adding this little bit of code to make it in sync with
the docs.

v2: Completely rewritten commit message that addresses the requests
Ville made for v1
Only load PDPs for initial context load (Ville)

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 25cc889..601a58f 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -649,6 +649,7 @@ static int do_switch_rcs(struct intel_engine_cs *ring,
 	struct i915_hw_ppgtt *ppgtt = ctx_to_ppgtt(to);
 	u32 hw_flags = 0;
 	bool uninitialized = false;
+	bool needs_pd_load = (INTEL_INFO(ring->dev)->gen < 8) && USES_FULL_PPGTT(ring->dev);
 	int ret;
 
 	if (from != NULL) {
@@ -668,7 +669,10 @@ static int do_switch_rcs(struct intel_engine_cs *ring,
 	 */
 	from = ring->last_context;
 
-	if (USES_FULL_PPGTT(ring->dev)) {
+	if (needs_pd_load) {
+		/* Older GENs still want the load first, "PP_DCLV followed by
+		 * PP_DIR_BASE register through Load Register Immediate commands
+		 * in Ring Buffer before submitting a context."*/
 		ret = ppgtt->switch_mm(ppgtt, ring, false);
 		if (ret)
 			goto unpin_out;
@@ -692,13 +696,34 @@ static int do_switch_rcs(struct intel_engine_cs *ring,
 		vma->bind_vma(vma, to->obj->cache_level, GLOBAL_BIND);
 	}
 
-	if (!to->is_initialized || i915_gem_context_is_default(to))
+	if (!to->is_initialized || i915_gem_context_is_default(to)) {
 		hw_flags |= MI_RESTORE_INHIBIT;
+		needs_pd_load = USES_FULL_PPGTT(ring->dev) && IS_GEN8(ring->dev);
+	}
 
 	ret = mi_set_context(ring, to, hw_flags);
 	if (ret)
 		goto unpin_out;
 
+	/* GEN8 does *not* require an explicit reload if the PDPs have been
+	 * setup, and we do not wish to move them.
+	 *
+	 * XXX: If we implemented page directory eviction code, this
+	 * optimization needs to be removed.
+	 */
+	if (needs_pd_load) {
+		ret = ppgtt->switch_mm(ppgtt, ring, false);
+		/* The hardware context switch is emitted, but we haven't
+		 * actually changed the state - so it's probably safe to bail
+		 * here. Still, let the user know something dangerous has
+		 * happened.
+		 */
+		if (ret) {
+			DRM_ERROR("Failed to change address space on context switch\n");
+			goto unpin_out;
+		}
+	}
+
 	remap_l3(ring, to);
 
 	/* The backing object for the context is done after switching to the
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 04/16] drm/i915: Fix another another use-after-free in do_switch
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (2 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 03/16] drm/i915/ppgtt: Load address space after mi_set_context Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-08-09 20:15   ` [PATCH] [v2] " Ben Widawsky
  2014-07-01 18:17 ` [PATCH 05/16] drm/i915/ctx: Return earlier on failure Ben Widawsky
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

See the following for many more details.

commit acc240d41ea1ab9c488a79219fb313b5b46265ae
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Dec 5 15:42:34 2013 +0100

    drm/i915: Fix use-after-free in do_switch

In this case, the issue is only for full PPGTT:
do_switch
  context_unref
    ppgtt_release
      i915_gpu_idle
	switch_to_default
	from changes to default context

This could be backported to the pre do_switch cleanup I did in this
series. However, it's much cleaner and more obvious as a patch on top,
so I'd really like to do this as a post cleanup patch.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 601a58f..0e6e743 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -745,13 +745,21 @@ static int do_switch_rcs(struct intel_engine_cs *ring,
 		from->obj->dirty = 1;
 		BUG_ON(from->obj->ring != ring);
 
-		/* obj is kept alive until the next request by its active ref */
-		i915_gem_object_ggtt_unpin(from->obj);
 	}
 
 	uninitialized = !to->is_initialized && from == NULL;
 	to->is_initialized = true;
 	do_switch_fini_common(ring, from, to);
+	/* From may have disappeared again after thecontext unref */
+	from = ring->last_context;
+	if (from != NULL) {
+		/* obj is kept alive until the next request by its active ref.
+		 * XXX: The context needs to be unpinned last, or else we risk
+		 * hitting evict/idle on the ppgtt free, which will call back
+		 * into this, and we'll get a double unpin on this context
+		 */
+		i915_gem_object_ggtt_unpin(from->obj);
+	}
 
 	if (uninitialized) {
 		ret = i915_gem_render_state_init(ring);
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 05/16] drm/i915/ctx: Return earlier on failure
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (3 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 04/16] drm/i915: Fix another another use-after-free in do_switch Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-04  8:14   ` Chris Wilson
  2014-07-01 18:17 ` [PATCH 06/16] drm/i915/error: Check the potential ctx obj's vm Ben Widawsky
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

As what was correctly debugged here:
commit acc240d41ea1ab9c488a79219fb313b5b46265ae
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Dec 5 15:42:34 2013 +0100

    drm/i915: Fix use-after-free in do_switch

It then becomes apparent that the default context cannot be the context
being switched to for context switch because it is always bound. It
follows that if the ring->last_context (from) has changed after the
bind_to_gtt, it will always be the default context - this is commented
in the code block.

This assertion will help catch issues without our logic sooner than
letting the system move long (which is possible for some time).

I really want this to be a BUG(), but I also want the patch to get
merged. I think the fact that none of the ERRNOs make any sense at all
is just more evidence that this shouldn't be a WARN.

//Cc: Ian Lister (don't have current email address)
Cc: Rafael Barbalho <rafael.barbalho@intel.com>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 0e6e743..cf7cf81 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -669,6 +669,15 @@ static int do_switch_rcs(struct intel_engine_cs *ring,
 	 */
 	from = ring->last_context;
 
+	/* The only context which 'from' can be, if it was changed, is the default
+	 * context. The default context cannot end up in evict everything (as
+	 * commented above) because it is always pinned.
+	 */
+	if (WARN_ON(from == to)) {
+		ret = -EPERM;
+		goto unpin_out;
+	}
+
 	if (needs_pd_load) {
 		/* Older GENs still want the load first, "PP_DCLV followed by
 		 * PP_DIR_BASE register through Load Register Immediate commands
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 06/16] drm/i915/error: Check the potential ctx obj's vm
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (4 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 05/16] drm/i915/ctx: Return earlier on failure Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-17  8:47   ` Daniel Vetter
  2014-07-01 18:17 ` [PATCH 07/16] drm/i915/error: vma error capture prettyify Ben Widawsky
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

The bound list is global (all objects which back the VMAs are stored
here). Recently the BUG() in the offset lookup was demoted to a WARN,
but the fault actually lies in the caller, here.

This bug has existed since the initial introduction of PPGTT (however,
it was fixed in unmerged patches to fix up the error state).

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 66cf417..550ba38 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -871,6 +871,9 @@ static void i915_gem_record_active_context(struct intel_engine_cs *ring,
 		return;
 
 	list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) {
+		if (!i915_gem_obj_ggtt_bound(obj))
+			continue;
+
 		if ((error->ccid & PAGE_MASK) == i915_gem_obj_ggtt_offset(obj)) {
 			ering->ctx = i915_error_ggtt_object_create(dev_priv, obj);
 			break;
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 07/16] drm/i915/error: vma error capture prettyify
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (5 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 06/16] drm/i915/error: Check the potential ctx obj's vm Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-01 18:17 ` [PATCH 08/16] drm/i915/error: Do a better job of disambiguating VMAs Ben Widawsky
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

Rename some variables, and clean up the code a bit to make things
clearer in our error capture.

There isn't an intentional functional change here.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 55 ++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 550ba38..ebe2904 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -967,63 +967,72 @@ static void i915_gem_record_rings(struct drm_device *dev,
 	}
 }
 
-/* FIXME: Since pin count/bound list is global, we duplicate what we capture per
+/**
+ * i915_gem_capture_vm() - Capture a VMs error state.
+ * @error:	The main error structure
+ * @vm:		The address space we're capturing.
+ * @vm_ndx:	Which vm without the buffer array
+ *
+ * FIXME: Since pin count/bound list is global, we duplicate what we capture per
  * VM.
  */
 static void i915_gem_capture_vm(struct drm_i915_private *dev_priv,
 				struct drm_i915_error_state *error,
 				struct i915_address_space *vm,
-				const int ndx)
+				const int vm_ndx)
 {
 	struct drm_i915_error_buffer *active_bo = NULL, *pinned_bo = NULL;
 	struct drm_i915_gem_object *obj;
 	struct i915_vma *vma;
-	int i;
+	int active_vma_count = 0;
 
-	i = 0;
 	list_for_each_entry(vma, &vm->active_list, mm_list)
-		i++;
-	error->active_bo_count[ndx] = i;
+		active_vma_count++;
+
+	error->active_bo_count[vm_ndx] = active_vma_count;
+
 	list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list)
 		if (i915_gem_obj_is_pinned(obj))
-			i++;
-	error->pinned_bo_count[ndx] = i - error->active_bo_count[ndx];
+			active_vma_count++;
+
+	/* XXX: this is an incorrect measurement of pinned BOs */
+	error->pinned_bo_count[vm_ndx] = active_vma_count - error->active_bo_count[vm_ndx];
 
-	if (i) {
-		active_bo = kcalloc(i, sizeof(*active_bo), GFP_ATOMIC);
+	if (active_vma_count) {
+		active_bo = kcalloc(active_vma_count, sizeof(*active_bo), GFP_ATOMIC);
 		if (active_bo)
-			pinned_bo = active_bo + error->active_bo_count[ndx];
+			pinned_bo = active_bo + error->active_bo_count[vm_ndx];
 	}
 
 	if (active_bo)
-		error->active_bo_count[ndx] =
+		error->active_bo_count[vm_ndx] =
 			capture_active_bo(active_bo,
-					  error->active_bo_count[ndx],
+					  error->active_bo_count[vm_ndx],
 					  &vm->active_list);
 
 	if (pinned_bo)
-		error->pinned_bo_count[ndx] =
+		error->pinned_bo_count[vm_ndx] =
 			capture_pinned_bo(pinned_bo,
-					  error->pinned_bo_count[ndx],
+					  error->pinned_bo_count[vm_ndx],
 					  &dev_priv->mm.bound_list);
-	error->active_bo[ndx] = active_bo;
-	error->pinned_bo[ndx] = pinned_bo;
+	error->active_bo[vm_ndx] = active_bo;
+	error->pinned_bo[vm_ndx] = pinned_bo;
 }
 
 static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv,
 				     struct drm_i915_error_state *error)
 {
 	struct i915_address_space *vm;
-	int cnt = 0, i = 0;
+	int vm_count = 0, i = 0;
 
 	list_for_each_entry(vm, &dev_priv->vm_list, global_link)
-		cnt++;
+		vm_count++;
 
-	error->active_bo = kcalloc(cnt, sizeof(*error->active_bo), GFP_ATOMIC);
-	error->pinned_bo = kcalloc(cnt, sizeof(*error->pinned_bo), GFP_ATOMIC);
-	error->active_bo_count = kcalloc(cnt, sizeof(*error->active_bo_count),
+	error->active_bo = kcalloc(vm_count, sizeof(*error->active_bo), GFP_ATOMIC);
+	error->pinned_bo = kcalloc(vm_count, sizeof(*error->pinned_bo), GFP_ATOMIC);
+	error->active_bo_count = kcalloc(vm_count, sizeof(*error->active_bo_count),
 					 GFP_ATOMIC);
-	error->pinned_bo_count = kcalloc(cnt, sizeof(*error->pinned_bo_count),
+	error->pinned_bo_count = kcalloc(vm_count, sizeof(*error->pinned_bo_count),
 					 GFP_ATOMIC);
 
 	list_for_each_entry(vm, &dev_priv->vm_list, global_link)
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 08/16] drm/i915/error: Do a better job of disambiguating VMAs
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (6 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 07/16] drm/i915/error: vma error capture prettyify Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-04  7:57   ` Chris Wilson
  2014-07-01 18:17 ` [PATCH 09/16] drm/i915/error: Capture vmas instead of BOs Ben Widawsky
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

Some of the original PPGTT patches in this area where unmerged, and this
left a lot of confusion in our error capture with regard to which vm/obj
we want to capture. There have been at least a couple of patches from
Chris, and myself to try to fix this up; so here is another shot. Nobody
running without full PPGTT is effected by this, and that is probably why
nobody has bothered to fix it yet.

Instead of using any of the global lists to find the VMAs we want to
capture, we use the union of the active, and the inactive list in the
VM. This allows us to replace our capture_bo with capture_vma, and know
all the VMAs we want to capture are valid.

I could have probably figured out a way to reuse mm_list. As we've had
bugs here before in the shrinker, I think the best way forward is to get
it working, and then optimize it later.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_gtt.c   |  1 +
 drivers/gpu/drm/i915/i915_gem_gtt.h   |  2 ++
 drivers/gpu/drm/i915/i915_gpu_error.c | 39 ++++++++++++++++++++++-------------
 3 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index a4153ee..88451dc 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -2114,6 +2114,7 @@ static struct i915_vma *__i915_gem_vma_create(struct drm_i915_gem_object *obj,
 		return ERR_PTR(-ENOMEM);
 
 	INIT_LIST_HEAD(&vma->vma_link);
+	INIT_LIST_HEAD(&vma->pin_capture_link);
 	INIT_LIST_HEAD(&vma->mm_list);
 	INIT_LIST_HEAD(&vma->exec_list);
 	vma->vm = vm;
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
index 8d6f7c1..1d75801 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.h
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
@@ -126,6 +126,8 @@ struct i915_vma {
 
 	struct list_head vma_link; /* Link in the object's VMA list */
 
+	struct list_head pin_capture_link; /* Link in the error capture */
+
 	/** This vma's place in the batchbuffer or on the eviction list */
 	struct list_head exec_list;
 
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index ebe2904..123a4fc 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -665,14 +665,14 @@ static u32 capture_active_bo(struct drm_i915_error_buffer *err,
 static u32 capture_pinned_bo(struct drm_i915_error_buffer *err,
 			     int count, struct list_head *head)
 {
-	struct drm_i915_gem_object *obj;
+	struct i915_vma *vma;
 	int i = 0;
 
-	list_for_each_entry(obj, head, global_list) {
-		if (!i915_gem_obj_is_pinned(obj))
+	list_for_each_entry(vma, head, pin_capture_link) {
+		if (!i915_gem_obj_is_pinned(vma->obj))
 			continue;
 
-		capture_bo(err++, obj);
+		capture_bo(err++, vma->obj);
 		if (++i == count)
 			break;
 	}
@@ -982,21 +982,32 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv,
 				const int vm_ndx)
 {
 	struct drm_i915_error_buffer *active_bo = NULL, *pinned_bo = NULL;
-	struct drm_i915_gem_object *obj;
 	struct i915_vma *vma;
 	int active_vma_count = 0;
+	int vma_pin_count = 0;
+	LIST_HEAD(pinned_vma);
 
-	list_for_each_entry(vma, &vm->active_list, mm_list)
+	list_for_each_entry(vma, &vm->active_list, mm_list) {
 		active_vma_count++;
+		if (vma->pin_count) {
+			vma_pin_count++;
+			list_move_tail(&vma->pin_capture_link, &pinned_vma);
+		}
+	}
 
-	error->active_bo_count[vm_ndx] = active_vma_count;
-
-	list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list)
-		if (i915_gem_obj_is_pinned(obj))
-			active_vma_count++;
+	list_for_each_entry(vma, &vm->inactive_list, mm_list) {
+		/* Certain objects may be on the inactive list, but pinned, when
+		 * in the global GGTT. */
+		if (WARN_ON(!i915_is_ggtt(vm) &&
+			    vma->pin_count &&
+			    !(vma->exec_entry->flags & (1<<31)))) { /* FIXME: need the actual flag */
+			vma_pin_count++;
+			list_move_tail(&vma->pin_capture_link, &pinned_vma);
+		}
+	}
 
-	/* XXX: this is an incorrect measurement of pinned BOs */
-	error->pinned_bo_count[vm_ndx] = active_vma_count - error->active_bo_count[vm_ndx];
+	error->active_bo_count[vm_ndx] = active_vma_count;
+	error->pinned_bo_count[vm_ndx] = vma_pin_count;
 
 	if (active_vma_count) {
 		active_bo = kcalloc(active_vma_count, sizeof(*active_bo), GFP_ATOMIC);
@@ -1014,7 +1025,7 @@ static void i915_gem_capture_vm(struct drm_i915_private *dev_priv,
 		error->pinned_bo_count[vm_ndx] =
 			capture_pinned_bo(pinned_bo,
 					  error->pinned_bo_count[vm_ndx],
-					  &dev_priv->mm.bound_list);
+					  &pinned_vma);
 	error->active_bo[vm_ndx] = active_bo;
 	error->pinned_bo[vm_ndx] = pinned_bo;
 }
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 09/16] drm/i915/error: Capture vmas instead of BOs
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (7 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 08/16] drm/i915/error: Do a better job of disambiguating VMAs Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-01 18:17 ` [PATCH 10/16] drm/i915: Add some extra guards in evict_vm Ben Widawsky
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

To follow up on the last patch, we can now capture the VMAs instead of
the BOs. The hope if we get more accurate error capture while debugging
PPGTT.

Note that this does not impact the previous argument about whether to
capture all VMAs, or just the guilty VMA. This merely allows the code to
do whatever we chose later.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_drv.h       |  1 +
 drivers/gpu/drm/i915/i915_gpu_error.c | 53 +++++++++++++++++++----------------
 2 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 8cea596..d3a69aa 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -305,6 +305,7 @@ struct drm_i915_error_state {
 	char error_msg[128];
 	u32 reset_count;
 	u32 suspend_count;
+	u32 vm_count;
 
 	/* Generic register state */
 	u32 eir;
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 123a4fc..e82e590 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -384,15 +384,19 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
 		i915_ring_error_state(m, dev, &error->ring[i]);
 	}
 
-	if (error->active_bo)
-		print_error_buffers(m, "Active",
-				    error->active_bo[0],
-				    error->active_bo_count[0]);
+	for (i = 0; i < error->vm_count; i++) {
+		if (error->active_bo[i])
+			print_error_buffers(m, "Active",
+					    error->active_bo[i],
+					    error->active_bo_count[i]);
+	}
 
-	if (error->pinned_bo)
-		print_error_buffers(m, "Pinned",
-				    error->pinned_bo[0],
-				    error->pinned_bo_count[0]);
+	for (i = 0; i < error->vm_count; i++) {
+		if (error->pinned_bo[i])
+			print_error_buffers(m, "Pinned",
+					    error->pinned_bo[i],
+					    error->pinned_bo_count[i]);
+	}
 
 	for (i = 0; i < ARRAY_SIZE(error->ring); i++) {
 		struct drm_i915_error_object *obj;
@@ -623,22 +627,23 @@ unwind:
 	i915_error_object_create_sized((dev_priv), (src), &(dev_priv)->gtt.base, \
 				       (src)->base.size>>PAGE_SHIFT)
 
-static void capture_bo(struct drm_i915_error_buffer *err,
-		       struct drm_i915_gem_object *obj)
+static void capture_vma(struct drm_i915_error_buffer *err, struct i915_vma *vma)
 {
+	struct drm_i915_gem_object *obj = vma->obj;
+
 	err->size = obj->base.size;
 	err->name = obj->base.name;
 	err->rseqno = obj->last_read_seqno;
 	err->wseqno = obj->last_write_seqno;
-	err->gtt_offset = i915_gem_obj_ggtt_offset(obj);
+	err->gtt_offset = vma->node.start;
 	err->read_domains = obj->base.read_domains;
 	err->write_domain = obj->base.write_domain;
 	err->fence_reg = obj->fence_reg;
-	err->pinned = 0;
-	if (i915_gem_obj_is_pinned(obj))
-		err->pinned = 1;
-	if (obj->user_pin_count > 0)
+	if (obj->user_pin_count > 0) {
+		WARN_ON(i915_is_ggtt(vma->vm));
 		err->pinned = -1;
+	} else
+		err->pinned = !!vma->pin_count;
 	err->tiling = obj->tiling_mode;
 	err->dirty = obj->dirty;
 	err->purgeable = obj->madv != I915_MADV_WILLNEED;
@@ -654,7 +659,7 @@ static u32 capture_active_bo(struct drm_i915_error_buffer *err,
 	int i = 0;
 
 	list_for_each_entry(vma, head, mm_list) {
-		capture_bo(err++, vma->obj);
+		capture_vma(err++, vma);
 		if (++i == count)
 			break;
 	}
@@ -669,10 +674,10 @@ static u32 capture_pinned_bo(struct drm_i915_error_buffer *err,
 	int i = 0;
 
 	list_for_each_entry(vma, head, pin_capture_link) {
-		if (!i915_gem_obj_is_pinned(vma->obj))
+		if (!vma->pin_count)
 			continue;
 
-		capture_bo(err++, vma->obj);
+		capture_vma(err++, vma);
 		if (++i == count)
 			break;
 	}
@@ -1034,16 +1039,16 @@ static void i915_gem_capture_buffers(struct drm_i915_private *dev_priv,
 				     struct drm_i915_error_state *error)
 {
 	struct i915_address_space *vm;
-	int vm_count = 0, i = 0;
+	int i = 0;
 
 	list_for_each_entry(vm, &dev_priv->vm_list, global_link)
-		vm_count++;
+		error->vm_count++;
 
-	error->active_bo = kcalloc(vm_count, sizeof(*error->active_bo), GFP_ATOMIC);
-	error->pinned_bo = kcalloc(vm_count, sizeof(*error->pinned_bo), GFP_ATOMIC);
-	error->active_bo_count = kcalloc(vm_count, sizeof(*error->active_bo_count),
+	error->active_bo = kcalloc(error->vm_count, sizeof(*error->active_bo), GFP_ATOMIC);
+	error->pinned_bo = kcalloc(error->vm_count, sizeof(*error->pinned_bo), GFP_ATOMIC);
+	error->active_bo_count = kcalloc(error->vm_count, sizeof(*error->active_bo_count),
 					 GFP_ATOMIC);
-	error->pinned_bo_count = kcalloc(vm_count, sizeof(*error->pinned_bo_count),
+	error->pinned_bo_count = kcalloc(error->vm_count, sizeof(*error->pinned_bo_count),
 					 GFP_ATOMIC);
 
 	list_for_each_entry(vm, &dev_priv->vm_list, global_link)
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 10/16] drm/i915: Add some extra guards in evict_vm
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (8 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 09/16] drm/i915/error: Capture vmas instead of BOs Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-01 18:17 ` [PATCH 11/16] drm/i915: Make an uninterruptible evict Ben Widawsky
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_evict.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index bbf4b12..38297d3 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -214,6 +214,7 @@ int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle)
 	struct i915_vma *vma, *next;
 	int ret;
 
+	BUG_ON(!mutex_is_locked(&vm->dev->struct_mutex));
 	trace_i915_gem_evict_vm(vm);
 
 	if (do_idle) {
@@ -222,11 +223,15 @@ int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle)
 			return ret;
 
 		i915_gem_retire_requests(vm->dev);
+
+		WARN_ON(!list_empty(&vm->active_list));
 	}
 
-	list_for_each_entry_safe(vma, next, &vm->inactive_list, mm_list)
+	list_for_each_entry_safe(vma, next, &vm->inactive_list, mm_list) {
+		WARN_ON(!i915_is_ggtt(vm) && vma->pin_count);
 		if (vma->pin_count == 0)
 			WARN_ON(i915_vma_unbind(vma));
+	}
 
 	return 0;
 }
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 11/16] drm/i915: Make an uninterruptible evict
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (9 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 10/16] drm/i915: Add some extra guards in evict_vm Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-01 18:17 ` [PATCH 12/16] drm/i915: Reorder ctx unref on ppgtt cleanup Ben Widawsky
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

There are no users of this yet, but the idea is presented and split out
to find bugs.

Also, while here, return -ERESTARTSYS to the caller, in case they want
to do something with it.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_drv.h            |  2 +-
 drivers/gpu/drm/i915/i915_gem_context.c    |  5 ++---
 drivers/gpu/drm/i915/i915_gem_evict.c      | 32 ++++++++++++++++++++++--------
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  2 +-
 4 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d3a69aa..6c252c3 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2478,7 +2478,7 @@ int __must_check i915_gem_evict_something(struct drm_device *dev,
 					  unsigned long start,
 					  unsigned long end,
 					  unsigned flags);
-int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle);
+int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle, bool interruptible);
 int i915_gem_evict_everything(struct drm_device *dev);
 
 /* belongs in i915_gem_gtt.h */
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index cf7cf81..b6803b3 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -121,11 +121,10 @@ static void do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
 			if (WARN_ON(list_empty(&vma->vma_link) ||
 				    list_is_singular(&vma->vma_link)))
 				break;
-
-		i915_gem_evict_vm(&ppgtt->base, true);
+		i915_gem_evict_vm(&ppgtt->base, true, true);
 	} else {
 		i915_gem_retire_requests(dev);
-		i915_gem_evict_vm(&ppgtt->base, false);
+		i915_gem_evict_vm(&ppgtt->base, false, true);
 	}
 
 	ppgtt->base.cleanup(&ppgtt->base);
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
index 38297d3..ac8d90f 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -197,8 +197,9 @@ found:
 /**
  * i915_gem_evict_vm - Evict all idle vmas from a vm
  *
- * @vm: Address space to cleanse
- * @do_idle: Boolean directing whether to idle first.
+ * @vm:			Address space to cleanse
+ * @do_idle:		Boolean directing whether to idle first.
+ * @interruptible:	How to wait
  *
  * This function evicts all idles vmas from a vm. If all unpinned vmas should be
  * evicted the @do_idle needs to be set to true.
@@ -209,18 +210,24 @@ found:
  * To clarify: This is for freeing up virtual address space, not for freeing
  * memory in e.g. the shrinker.
  */
-int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle)
+int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle, bool interruptible)
 {
+	struct drm_i915_private *dev_priv = to_i915(vm->dev);
+
 	struct i915_vma *vma, *next;
+	bool was_intr = dev_priv->mm.interruptible;
 	int ret;
 
 	BUG_ON(!mutex_is_locked(&vm->dev->struct_mutex));
 	trace_i915_gem_evict_vm(vm);
 
+	if (!interruptible)
+		dev_priv->mm.interruptible = false;
+
 	if (do_idle) {
 		ret = i915_gpu_idle(vm->dev);
 		if (ret)
-			return ret;
+			goto out;
 
 		i915_gem_retire_requests(vm->dev);
 
@@ -229,11 +236,20 @@ int i915_gem_evict_vm(struct i915_address_space *vm, bool do_idle)
 
 	list_for_each_entry_safe(vma, next, &vm->inactive_list, mm_list) {
 		WARN_ON(!i915_is_ggtt(vm) && vma->pin_count);
-		if (vma->pin_count == 0)
-			WARN_ON(i915_vma_unbind(vma));
+		if (vma->pin_count == 0) {
+			ret = i915_vma_unbind(vma);
+			if (ret == -ERESTARTSYS) {
+				BUG_ON(!interruptible);
+				goto out;
+			} else if (ret)
+				WARN(1, "Failed to unbind vma %d\n", ret);
+		}
 	}
+	ret = 0;
 
-	return 0;
+out:
+	dev_priv->mm.interruptible = was_intr;
+	return ret;
 }
 
 /**
@@ -276,7 +292,7 @@ i915_gem_evict_everything(struct drm_device *dev)
 
 	/* Having flushed everything, unbind() should never raise an error */
 	list_for_each_entry(vm, &dev_priv->vm_list, global_link)
-		WARN_ON(i915_gem_evict_vm(vm, false));
+		WARN_ON(i915_gem_evict_vm(vm, false, true));
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index d815ef5..40ec61c 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -722,7 +722,7 @@ err:
 		list_for_each_entry(vma, vmas, exec_list)
 			i915_gem_execbuffer_unreserve_vma(vma);
 
-		ret = i915_gem_evict_vm(vm, true);
+		ret = i915_gem_evict_vm(vm, true, true);
 		if (ret)
 			return ret;
 	} while (1);
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 12/16] drm/i915: Reorder ctx unref on ppgtt cleanup
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (10 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 11/16] drm/i915: Make an uninterruptible evict Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-17  9:56   ` Daniel Vetter
  2014-07-01 18:17 ` [PATCH 13/16] drm/i915: More correct (slower) " Ben Widawsky
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

The comment [which was mine] is wrong. The context object can never be
bound in a PPGTT because it is only capable of living in the Global GTT.
So, remove the comment, and reorder the unref. What's nice about the
latter is it keeps the context object alive past the PPGTT. This makes
the destroy ordering symmetric with the creation ordering.

Create:
1. Create context
2. Create PPGTT

Destroy:
1. Destroy PPGTT
2. Destroy context

As far as I know, this does not fix a bug. The code previously kept the
context data structure, only the object was gone. As the code was,
nothing tried to use the object after this point.

NOTE: If in the future we have cases where the PPGTT can/should outlive
the context (which doesn't occur today, but the code permits it), this
ordering does not matter. Even if this occurs, as it stands now, we do
not expect that to be the normal case, and having this order makes
debugging a bit easier if we're tracking object lifetimes for the
context vs ppgtt

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index b6803b3..8d106d9 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -185,14 +185,12 @@ void i915_gem_context_free(struct kref *ctx_ref)
 		/* We refcount even the aliasing PPGTT to keep the code symmetric */
 		if (USES_PPGTT(ctx->obj->base.dev))
 			ppgtt = ctx_to_ppgtt(ctx);
-
-		/* XXX: Free up the object before tearing down the address space, in
-		 * case we're bound in the PPGTT */
-		drm_gem_object_unreference(&ctx->obj->base);
 	}
 
 	if (ppgtt)
 		kref_put(&ppgtt->ref, ppgtt_release);
+	if (ctx->obj)
+		drm_gem_object_unreference(&ctx->obj->base);
 	list_del(&ctx->link);
 	kfree(ctx);
 }
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 13/16] drm/i915: More correct (slower) ppgtt cleanup
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (11 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 12/16] drm/i915: Reorder ctx unref on ppgtt cleanup Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-17  9:49   ` Daniel Vetter
  2014-07-01 18:17 ` [PATCH 14/16] drm/i915: Defer PPGTT cleanup Ben Widawsky
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

If a VM still have objects which are bound (exactly: have a node
reserved in the drm_mm), and we are in the middle of a reset, we have no
hope of the standard methods fixing the situation (ring idle won't
work). We must therefore let the reset handler take it's course, and
then we can resume tearing down the VM.

This logic very much duplicates^Wresembles the logic in our wait for
error code. I've decided to leave it as open coded because I expect this
bit of code to require tweaks and changes over time.

Interruptions via signal causes a really similar problem.

This should deprecate the need for the yet unmerged patch from Chris
(and an identical patch from me, which was first!!):
drm/i915: Prevent signals from interrupting close()

I have a followup patch to implement deferred free, before you complain.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 51 +++++++++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 8d106d9..e1b5613 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -101,6 +101,32 @@ static void do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
 	struct drm_device *dev = ppgtt->base.dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct i915_address_space *vm = &ppgtt->base;
+	bool do_idle = false;
+	int ret;
+
+	/* If we get here while in reset, we need to let the reset handler run
+	 * first, or else our VM teardown isn't going to go smoothly. There are
+	 * a could of options at this point, but letting the reset handler do
+	 * it's thing is the most desirable. The reset handler will take care of
+	 * retiring the stuck requests.
+	 */
+	if (i915_reset_in_progress(&dev_priv->gpu_error)) {
+		mutex_unlock(&dev->struct_mutex);
+#define EXIT_COND (!i915_reset_in_progress(&dev_priv->gpu_error) || \
+		   i915_terminally_wedged(&dev_priv->gpu_error))
+		ret = wait_event_timeout(dev_priv->gpu_error.reset_queue,
+					 EXIT_COND,
+					 10 * HZ);
+		if (!ret) {
+			/* it's unlikely idling will solve anything, but it
+			 * shouldn't hurt to try. */
+			do_idle = true;
+			/* TODO: go down kicking and screaming harder */
+		}
+#undef EXIT_COND
+
+		mutex_lock(&dev->struct_mutex);
+	}
 
 	if (ppgtt == dev_priv->mm.aliasing_ppgtt ||
 	    (list_empty(&vm->active_list) && list_empty(&vm->inactive_list))) {
@@ -117,14 +143,33 @@ static void do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
 	if (!list_empty(&vm->active_list)) {
 		struct i915_vma *vma;
 
+		do_idle = true;
 		list_for_each_entry(vma, &vm->active_list, mm_list)
 			if (WARN_ON(list_empty(&vma->vma_link) ||
 				    list_is_singular(&vma->vma_link)))
 				break;
-		i915_gem_evict_vm(&ppgtt->base, true, true);
-	} else {
+	} else
 		i915_gem_retire_requests(dev);
-		i915_gem_evict_vm(&ppgtt->base, false, true);
+
+	/* We have a problem here where VM teardown cannot be interrupted, or
+	 * else the ppgtt cleanup will fail. As an example, a precisely timed
+	 * SIGKILL could leads to an OOPS, or worse. There are two options:
+	 * 1. Make the eviction uninterruptible
+	 * 2. Defer the eviction if it was interrupted.
+	 *
+	 * Option #1 is not the friendliest, but it's the easiest to implement,
+	 * and least error prone.
+	 * TODO: Implement option 2
+	 */
+	ret = i915_gem_evict_vm(&ppgtt->base, do_idle, !do_idle);
+	if (ret == -ERESTARTSYS)
+		ret = i915_gem_evict_vm(&ppgtt->base, do_idle, false);
+	WARN_ON(ret);
+	WARN_ON(!list_empty(&vm->active_list));
+
+	/* This is going to blow up badly if the mm is unclean */
+	if (WARN_ON(!list_empty(&ppgtt->base.mm.head_node.node_list))) {
+		/* TODO: go down kicking and screaming harder++ */
 	}
 
 	ppgtt->base.cleanup(&ppgtt->base);
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 14/16] drm/i915: Defer PPGTT cleanup
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (12 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 13/16] drm/i915: More correct (slower) " Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-01 18:17 ` [PATCH 15/16] drm/i915/bdw: Enable full PPGTT Ben Widawsky
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

The last patch made PPGTT free cases correct. It left a major problem
though where in many cases it was possible to have to idle the GPU in
order to destroy a VM. This is really unfortunate as it is stalling the
active GPU process for the dying GPU process.

The workqueue grew very tricky. I left in my original wait based version
as #if 0, and used Chris' recommendation with the seqno check. I haven't
measure one vs the other, but I am in favor of the code as it is. I am
just leaving the old code for people to observe.

NOTE: I somewhat expect this patch to not get merged because of work in
the pipeline. I won't argue much against that, and I'll simply say, this
solves a real problem, or platforms running with PPGTT. PPGTT is not
enabled by default yet, and I therefore see little harm in merging this.
Because I was expecting this to not get merged, I did not spent much
time investigating any corner cases, in particular, cases where I need
to cleanup the wq. If this patch does head in the merge direction, I
will take a closer look at that.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_drv.h         |  10 +++
 drivers/gpu/drm/i915/i915_gem.c         | 110 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.c |  40 ++++++++----
 drivers/gpu/drm/i915/i915_gem_gtt.c     |   2 +-
 drivers/gpu/drm/i915/i915_gem_gtt.h     |   2 +
 5 files changed, 150 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6c252c3..c23213d 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1077,6 +1077,15 @@ struct i915_gem_mm {
 	struct delayed_work idle_work;
 
 	/**
+	 * PPGTT freeing often happens during interruptible times (fd close,
+	 * execbuf, busy_ioctl). It therefore becomes difficult to clean up the
+	 * PPGTT when the refcount reaches 0 if a signal comes in. This
+	 * workqueue defers the cleanup of the VM to a later time, and allows
+	 * userspace to continue on.
+	 */
+	struct delayed_work ppgtt_work;
+
+	/**
 	 * Are we in a non-interruptible section of code like
 	 * modesetting?
 	 */
@@ -1461,6 +1470,7 @@ struct drm_i915_private {
 	struct mutex modeset_restore_lock;
 
 	struct list_head vm_list; /* Global list of all address spaces */
+	struct list_head ppgtt_free_list;
 	struct i915_gtt gtt; /* VM representing the global address space */
 
 	struct i915_gem_mm mm;
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f6d1238..561bd64 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2707,6 +2707,112 @@ i915_gem_idle_work_handler(struct work_struct *work)
 	intel_mark_idle(dev_priv->dev);
 }
 
+static void
+i915_gem_ppgtt_work_handler(struct work_struct *work)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(work, typeof(*dev_priv), mm.ppgtt_work.work);
+	struct i915_address_space *vm, *v;
+#if 0
+	unsigned reset_counter;
+	int ret;
+#endif
+
+	if (!mutex_trylock(&dev_priv->dev->struct_mutex)) {
+		queue_delayed_work(dev_priv->wq, &dev_priv->mm.ppgtt_work,
+				   round_jiffies_up_relative(HZ));
+		return;
+	}
+
+	list_for_each_entry_safe(vm, v, &dev_priv->ppgtt_free_list, global_link) {
+		struct i915_hw_ppgtt *ppgtt = container_of(vm, struct i915_hw_ppgtt, base);
+		struct i915_vma *vma = NULL, *vma_active = NULL, *vma_inactive = NULL;
+
+		/* The following attempts to find the newest (most recently
+		 * activated/inactivated) VMA.
+		 */
+		if (!list_empty(&ppgtt->base.active_list)) {
+			vma_active = list_last_entry(&ppgtt->base.active_list,
+						     typeof(*vma), mm_list);
+		}
+		if (!list_empty(&ppgtt->base.inactive_list)) {
+			vma_inactive = list_last_entry(&ppgtt->base.inactive_list,
+						       typeof(*vma), mm_list);
+		}
+
+		if (vma_active)
+			vma = vma_active;
+		else if (vma_inactive)
+			vma = vma_inactive;
+
+		/* Sanity check */
+		if (vma_active && vma_inactive) {
+			if (WARN_ON(vma_active->retire_seqno <= vma_inactive->retire_seqno))
+				vma = vma_inactive;
+		}
+
+		if (!vma)
+			goto finish;
+
+		/* Another sanity check */
+		if (WARN_ON(!vma->retire_seqno))
+			continue;
+
+		/* NOTE: We could wait here for the seqno, but that makes things
+		 * significantly more complex. */
+		if (!i915_seqno_passed(vma->retire_ring->get_seqno(vma->retire_ring, true), vma->retire_seqno))
+			break;
+
+#if 0
+		reset_counter = atomic_read(&dev_priv->gpu_error.reset_counter);
+		mutex_unlock(&dev_priv->dev->struct_mutex);
+		ret = __wait_seqno(vma->retire_ring, vma->retire_seqno,
+				   reset_counter, true, NULL, NULL);
+
+		if (i915_mutex_lock_interruptible(dev_priv->dev)) {
+			queue_delayed_work(dev_priv->wq, &dev_priv->mm.ppgtt_work,
+					   round_jiffies_up_relative(HZ));
+			return;
+		}
+
+		if (ret)
+			continue;
+
+#endif
+finish:
+		/* If the object was destroyed while our VMA was dangling, the
+		 * object free code did the synchronous cleanup for us.
+		 * Therefore every node on this list should have a living object
+		 * (and VMA), but let's try not to use it in case future code
+		 * allows a VMA to outlive an object.
+		 */
+		while (!list_empty(&ppgtt->base.mm.head_node.node_list)) {
+			struct drm_mm_node *node;
+			struct i915_vma *vma;
+			struct drm_i915_gem_object *obj;
+
+			node = list_first_entry(&ppgtt->base.mm.head_node.node_list,
+						struct drm_mm_node,
+						node_list);
+			vma = container_of(node, struct i915_vma, node);
+			obj = vma->obj;
+
+			drm_mm_remove_node(node);
+			i915_gem_vma_destroy(vma);
+			i915_gem_object_unpin_pages(obj);
+		}
+
+		ppgtt->base.cleanup(&ppgtt->base);
+		kfree(ppgtt);
+	}
+
+	if (!list_empty(&dev_priv->ppgtt_free_list))
+		queue_delayed_work(dev_priv->wq, &dev_priv->mm.ppgtt_work,
+				   round_jiffies_up_relative(HZ));
+
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+}
+
 /**
  * Ensures that an object will eventually get non-busy by flushing any required
  * write domains, emitting any outstanding lazy request and retiring and
@@ -4551,6 +4657,7 @@ i915_gem_suspend(struct drm_device *dev)
 	mutex_unlock(&dev->struct_mutex);
 
 	del_timer_sync(&dev_priv->gpu_error.hangcheck_timer);
+	cancel_delayed_work_sync(&dev_priv->mm.ppgtt_work);
 	cancel_delayed_work_sync(&dev_priv->mm.retire_work);
 	cancel_delayed_work_sync(&dev_priv->mm.idle_work);
 
@@ -4892,6 +4999,7 @@ i915_gem_load(struct drm_device *dev)
 				  NULL);
 
 	INIT_LIST_HEAD(&dev_priv->vm_list);
+	INIT_LIST_HEAD(&dev_priv->ppgtt_free_list);
 	i915_init_vm(dev_priv, &dev_priv->gtt.base);
 
 	INIT_LIST_HEAD(&dev_priv->context_list);
@@ -4906,6 +5014,8 @@ i915_gem_load(struct drm_device *dev)
 			  i915_gem_retire_work_handler);
 	INIT_DELAYED_WORK(&dev_priv->mm.idle_work,
 			  i915_gem_idle_work_handler);
+	INIT_DELAYED_WORK(&dev_priv->mm.ppgtt_work,
+			  i915_gem_ppgtt_work_handler);
 	init_waitqueue_head(&dev_priv->gpu_error.reset_queue);
 
 	/* On GEN3 we really need to make sure the ARB C3 LP bit is set */
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index e1b5613..5b4a9a0 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -96,7 +96,7 @@
 #define GEN6_CONTEXT_ALIGN (64<<10)
 #define GEN7_CONTEXT_ALIGN 4096
 
-static void do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
+static int do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
 {
 	struct drm_device *dev = ppgtt->base.dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
@@ -131,7 +131,7 @@ static void do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
 	if (ppgtt == dev_priv->mm.aliasing_ppgtt ||
 	    (list_empty(&vm->active_list) && list_empty(&vm->inactive_list))) {
 		ppgtt->base.cleanup(&ppgtt->base);
-		return;
+		return 0;
 	}
 
 	/*
@@ -153,17 +153,30 @@ static void do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
 
 	/* We have a problem here where VM teardown cannot be interrupted, or
 	 * else the ppgtt cleanup will fail. As an example, a precisely timed
-	 * SIGKILL could leads to an OOPS, or worse. There are two options:
-	 * 1. Make the eviction uninterruptible
-	 * 2. Defer the eviction if it was interrupted.
-	 *
-	 * Option #1 is not the friendliest, but it's the easiest to implement,
-	 * and least error prone.
-	 * TODO: Implement option 2
+	 * SIGKILL could leads to an OOPS, or worse. The real solution is to
+	 * properly track the VMA <-> OBJ relationship. This temporary bandaid
+	 * will simply defer the free until we know the seqno has passed for
+	 * this VMA.
 	 */
 	ret = i915_gem_evict_vm(&ppgtt->base, do_idle, !do_idle);
-	if (ret == -ERESTARTSYS)
-		ret = i915_gem_evict_vm(&ppgtt->base, do_idle, false);
+	if (ret == -ERESTARTSYS) {
+		struct drm_mm_node *entry;
+		/* First mark all VMAs */
+		drm_mm_for_each_node(entry, &ppgtt->base.mm) {
+			struct i915_vma *vma = container_of(entry, struct i915_vma, node);
+			vma->retire_seqno = vma->obj->last_read_seqno;
+			vma->retire_ring = vma->obj->ring;
+			/* It's okay to lose the object, we just can't lose the
+			 * VMA */
+		}
+
+		list_move_tail(&ppgtt->base.global_link, &dev_priv->ppgtt_free_list);
+		queue_delayed_work(dev_priv->wq,
+				   &dev_priv->mm.ppgtt_work,
+				   round_jiffies_up_relative(HZ));
+		return -EAGAIN;
+	}
+
 	WARN_ON(ret);
 	WARN_ON(!list_empty(&vm->active_list));
 
@@ -173,6 +186,7 @@ static void do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
 	}
 
 	ppgtt->base.cleanup(&ppgtt->base);
+	return 0;
 }
 
 static void ppgtt_release(struct kref *kref)
@@ -180,8 +194,8 @@ static void ppgtt_release(struct kref *kref)
 	struct i915_hw_ppgtt *ppgtt =
 		container_of(kref, struct i915_hw_ppgtt, ref);
 
-	do_ppgtt_cleanup(ppgtt);
-	kfree(ppgtt);
+	if (!do_ppgtt_cleanup(ppgtt))
+		kfree(ppgtt);
 }
 
 static size_t get_context_alignment(struct drm_device *dev)
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 88451dc..b2d14d7 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -1029,7 +1029,7 @@ static void gen6_ppgtt_cleanup(struct i915_address_space *vm)
 		container_of(vm, struct i915_hw_ppgtt, base);
 
 	list_del(&vm->global_link);
-	drm_mm_takedown(&ppgtt->base.mm);
+	drm_mm_takedown(&vm->mm);
 	drm_mm_remove_node(&ppgtt->node);
 
 	gen6_ppgtt_unmap_pages(ppgtt);
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
index 1d75801..d0dc567 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.h
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
@@ -137,6 +137,8 @@ struct i915_vma {
 	struct hlist_node exec_node;
 	unsigned long exec_handle;
 	struct drm_i915_gem_exec_object2 *exec_entry;
+	uint32_t retire_seqno; /* Last active seqno at context desruction */
+	struct intel_engine_cs *retire_ring; /* Last ring for retire_seqno */
 
 	/**
 	 * How many users have pinned this object in GTT space. The following
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 15/16] drm/i915/bdw: Enable full PPGTT
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (13 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 14/16] drm/i915: Defer PPGTT cleanup Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-01 18:17 ` [PATCH 16/16] drm/i915: Get the error state over the wire (HACKish) Ben Widawsky
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

Broadwell is perfectly capable of full PPGTT. I've been using it for
some time, and seen no especially ill effects.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>

Conflicts:
	drivers/gpu/drm/i915/i915_drv.h
---
 drivers/gpu/drm/i915/i915_drv.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index c23213d..1045006 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2003,7 +2003,7 @@ struct drm_i915_cmd_table {
 
 #define HAS_HW_CONTEXTS(dev)	(INTEL_INFO(dev)->gen >= 6)
 #define HAS_ALIASING_PPGTT(dev)	(INTEL_INFO(dev)->gen >= 6)
-#define HAS_PPGTT(dev)		(INTEL_INFO(dev)->gen >= 7 && !IS_GEN8(dev))
+#define HAS_PPGTT(dev)		(INTEL_INFO(dev)->gen >= 7 && !IS_CHERRYVIEW(dev))
 #define USES_PPGTT(dev)		intel_enable_ppgtt(dev, false)
 #define USES_FULL_PPGTT(dev)	intel_enable_ppgtt(dev, true)
 
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 16/16] drm/i915: Get the error state over the wire (HACKish)
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (14 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 15/16] drm/i915/bdw: Enable full PPGTT Ben Widawsky
@ 2014-07-01 18:17 ` Ben Widawsky
  2014-07-04  8:02   ` Chris Wilson
  2014-07-03 22:01 ` [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload Ben Widawsky
  2014-07-17 12:04 ` [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Daniel Vetter
  17 siblings, 1 reply; 34+ messages in thread
From: Ben Widawsky @ 2014-07-01 18:17 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

I was dealing with a bug recently where the system would hard hang
somewhere between hangcheck and reset. There was time after error
collection to actually get my error state out, but I couldn't get the
reads to work.

This patch is also useful for when reset kills the machine, and you want
to keep reset enabled but still get error state.

Since I found the patch pretty useful, I decided to clean it up and
submit it. It was mostly meant as a one-off hack originally though.

If a maintainer decides it's useful, then here it is.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_debugfs.c   |  2 +-
 drivers/gpu/drm/i915/i915_drv.h       |  3 ++-
 drivers/gpu/drm/i915/i915_gpu_error.c | 31 +++++++++++++++++++++++++------
 drivers/gpu/drm/i915/i915_sysfs.c     |  2 +-
 4 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 6b7b32b..2daad46 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -929,7 +929,7 @@ static ssize_t i915_error_state_read(struct file *file, char __user *userbuf,
 	if (ret)
 		return ret;
 
-	ret = i915_error_state_to_str(&error_str, error_priv);
+	ret = i915_error_state_to_str(&error_str, error_priv->dev, error_priv->error);
 	if (ret)
 		goto out;
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 1045006..b6a4f1e 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2544,7 +2544,8 @@ static inline void intel_display_crc_init(struct drm_device *dev) {}
 __printf(2, 3)
 void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...);
 int i915_error_state_to_str(struct drm_i915_error_state_buf *estr,
-			    const struct i915_error_state_file_priv *error);
+			    struct drm_device *dev,
+			    const struct drm_i915_error_state *error);
 int i915_error_state_buf_init(struct drm_i915_error_state_buf *eb,
 			      size_t count, loff_t pos);
 static inline void i915_error_state_buf_release(
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index e82e590..1540bf6 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -184,8 +184,22 @@ static void i915_error_puts(struct drm_i915_error_state_buf *e,
 	__i915_error_advance(e, len);
 }
 
-#define err_printf(e, ...) i915_error_printf(e, __VA_ARGS__)
-#define err_puts(e, s) i915_error_puts(e, s)
+
+static bool wire = false;
+#define err_printf(e, ...) do {				\
+	if (wire) {					\
+		printk(__VA_ARGS__);			\
+	} else {					\
+		i915_error_printf(e, __VA_ARGS__);	\
+	}						\
+} while (0)
+#define err_puts(e, s) do {				\
+	if (wire) {					\
+		printk(s);				\
+	} else {					\
+		i915_error_puts(e, s);			\
+	}						\
+} while (0)
 
 static void print_error_buffers(struct drm_i915_error_state_buf *m,
 				const char *name,
@@ -240,7 +254,7 @@ static const char *hangcheck_action_to_str(enum intel_ring_hangcheck_action a)
 
 static void i915_ring_error_state(struct drm_i915_error_state_buf *m,
 				  struct drm_device *dev,
-				  struct drm_i915_error_ring *ring)
+				  const struct drm_i915_error_ring *ring)
 {
 	if (!ring->valid)
 		return;
@@ -322,11 +336,10 @@ static void print_error_obj(struct drm_i915_error_state_buf *m,
 }
 
 int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
-			    const struct i915_error_state_file_priv *error_priv)
+			    struct drm_device *dev,
+			    const struct drm_i915_error_state *error)
 {
-	struct drm_device *dev = error_priv->dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
-	struct drm_i915_error_state *error = error_priv->error;
 	int i, j, offset, elt;
 	int max_hangcheck_score;
 
@@ -1197,6 +1210,12 @@ void i915_capture_error_state(struct drm_device *dev, bool wedged,
 	spin_lock_irqsave(&dev_priv->gpu_error.lock, flags);
 	if (dev_priv->gpu_error.first_error == NULL) {
 		dev_priv->gpu_error.first_error = error;
+#ifdef PUSH_TO_WIRE
+		/* Probably racy, but this is emergency debug */
+		wire = true;
+		i915_error_state_to_str(NULL, dev, error);
+		wire = false;
+#endif
 		error = NULL;
 	}
 	spin_unlock_irqrestore(&dev_priv->gpu_error.lock, flags);
diff --git a/drivers/gpu/drm/i915/i915_sysfs.c b/drivers/gpu/drm/i915/i915_sysfs.c
index 86ce39a..6f4be9d 100644
--- a/drivers/gpu/drm/i915/i915_sysfs.c
+++ b/drivers/gpu/drm/i915/i915_sysfs.c
@@ -512,7 +512,7 @@ static ssize_t error_state_read(struct file *filp, struct kobject *kobj,
 	error_priv.dev = dev;
 	i915_error_state_get(dev, &error_priv);
 
-	ret = i915_error_state_to_str(&error_str, &error_priv);
+	ret = i915_error_state_to_str(&error_str, dev, error_priv.error);
 	if (ret)
 		goto out;
 
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (15 preceding siblings ...)
  2014-07-01 18:17 ` [PATCH 16/16] drm/i915: Get the error state over the wire (HACKish) Ben Widawsky
@ 2014-07-03 22:01 ` Ben Widawsky
  2014-07-03 22:01   ` [PATCH 2/2] drm/i915: Remove false assertion in ppgtt_release Ben Widawsky
  2014-07-04  7:51   ` [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload Chris Wilson
  2014-07-17 12:04 ` [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Daniel Vetter
  17 siblings, 2 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-03 22:01 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

This is a spec requirement for all rings.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 5b4a9a0..1ac648f 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -870,6 +870,9 @@ int i915_switch_context(struct intel_engine_cs *ring,
 	if (from == to && !to->remap_slice)
 		return 0;
 
+	if (IS_GEN8(ring->dev))
+		WARN_ON(ring->flush(ring, I915_GEM_GPU_DOMAINS, 0));
+
 	if (ring->id == RCS)
 		return do_switch_rcs(ring, from, to);
 	else
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/2] drm/i915: Remove false assertion in ppgtt_release
  2014-07-03 22:01 ` [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload Ben Widawsky
@ 2014-07-03 22:01   ` Ben Widawsky
  2014-07-04  7:51   ` [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload Chris Wilson
  1 sibling, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-03 22:01 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

Originally the thought for the assertion was that if there are no real
VMAs (died during execbuf), or there is only 1 VMA, but the VMA is on
the active list, it's a bug. The former case is pretty obvious. The
later case simply meant to assert the context unref/object retire
interactions were working properly

There is a flaw in the logic of the second when an object has multiple
VMAs. If there are multiple VMAs, it's possible that the object
continually had it's seqno increased as it was used by another context.
In this case, the context ref will die, but the VMA will not be taking
off the active list because of the missing retire seqno for a VMA.

Like some of the other fixes I've submitted recently, this should be
fixed by the eventual work Daniel will do.

This is pretty easy to reproduce whenever mesa uses the blit engine.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 1ac648f..2b39fca 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -145,8 +145,7 @@ static int do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
 
 		do_idle = true;
 		list_for_each_entry(vma, &vm->active_list, mm_list)
-			if (WARN_ON(list_empty(&vma->vma_link) ||
-				    list_is_singular(&vma->vma_link)))
+			if (WARN_ON(list_empty(&vma->vma_link)))
 				break;
 	} else
 		i915_gem_retire_requests(dev);
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload
  2014-07-03 22:01 ` [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload Ben Widawsky
  2014-07-03 22:01   ` [PATCH 2/2] drm/i915: Remove false assertion in ppgtt_release Ben Widawsky
@ 2014-07-04  7:51   ` Chris Wilson
  2014-07-04 16:55     ` Ben Widawsky
  1 sibling, 1 reply; 34+ messages in thread
From: Chris Wilson @ 2014-07-04  7:51 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Thu, Jul 03, 2014 at 03:01:49PM -0700, Ben Widawsky wrote:
> This is a spec requirement for all rings.
> 
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 5b4a9a0..1ac648f 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -870,6 +870,9 @@ int i915_switch_context(struct intel_engine_cs *ring,
>  	if (from == to && !to->remap_slice)
>  		return 0;
>  
> +	if (IS_GEN8(ring->dev))
> +		WARN_ON(ring->flush(ring, I915_GEM_GPU_DOMAINS, 0));
> +

That's intel_ring_invalidate_all_caches(). The only time we won't be
doing this are the forced switches at startup/reset and close. Is the
requirement before or after the SET_CONTEXT? What I would prefer doing
is moving the invalidate-dispatch-flush-signal into one atomic operation
(that would help simplify execlist as well) - would that be good enough
here to be sure that an invalidate is performed after the context
switch and before the next batch?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/16] drm/i915/error: Do a better job of disambiguating VMAs
  2014-07-01 18:17 ` [PATCH 08/16] drm/i915/error: Do a better job of disambiguating VMAs Ben Widawsky
@ 2014-07-04  7:57   ` Chris Wilson
  2014-07-04 16:56     ` Ben Widawsky
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Wilson @ 2014-07-04  7:57 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Tue, Jul 01, 2014 at 11:17:43AM -0700, Ben Widawsky wrote:
> Some of the original PPGTT patches in this area where unmerged, and this
> left a lot of confusion in our error capture with regard to which vm/obj
> we want to capture. There have been at least a couple of patches from
> Chris, and myself to try to fix this up; so here is another shot. Nobody
> running without full PPGTT is effected by this, and that is probably why
> nobody has bothered to fix it yet.
> 
> Instead of using any of the global lists to find the VMAs we want to
> capture, we use the union of the active, and the inactive list in the
> VM. This allows us to replace our capture_bo with capture_vma, and know
> all the VMAs we want to capture are valid.
> 
> I could have probably figured out a way to reuse mm_list. As we've had
> bugs here before in the shrinker, I think the best way forward is to get
> it working, and then optimize it later.
> 
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---
>  drivers/gpu/drm/i915/i915_gem_gtt.c   |  1 +
>  drivers/gpu/drm/i915/i915_gem_gtt.h   |  2 ++
>  drivers/gpu/drm/i915/i915_gpu_error.c | 39 ++++++++++++++++++++++-------------
>  3 files changed, 28 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
> index a4153ee..88451dc 100644
> --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> @@ -2114,6 +2114,7 @@ static struct i915_vma *__i915_gem_vma_create(struct drm_i915_gem_object *obj,
>  		return ERR_PTR(-ENOMEM);
>  
>  	INIT_LIST_HEAD(&vma->vma_link);
> +	INIT_LIST_HEAD(&vma->pin_capture_link);
>  	INIT_LIST_HEAD(&vma->mm_list);
>  	INIT_LIST_HEAD(&vma->exec_list);
>  	vma->vm = vm;
> diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
> index 8d6f7c1..1d75801 100644
> --- a/drivers/gpu/drm/i915/i915_gem_gtt.h
> +++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
> @@ -126,6 +126,8 @@ struct i915_vma {
>  
>  	struct list_head vma_link; /* Link in the object's VMA list */
>  
> +	struct list_head pin_capture_link; /* Link in the error capture */
> +
>  	/** This vma's place in the batchbuffer or on the eviction list */
>  	struct list_head exec_list;

We already have a slot for temporary lists...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 16/16] drm/i915: Get the error state over the wire (HACKish)
  2014-07-01 18:17 ` [PATCH 16/16] drm/i915: Get the error state over the wire (HACKish) Ben Widawsky
@ 2014-07-04  8:02   ` Chris Wilson
  0 siblings, 0 replies; 34+ messages in thread
From: Chris Wilson @ 2014-07-04  8:02 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Tue, Jul 01, 2014 at 11:17:51AM -0700, Ben Widawsky wrote:
> I was dealing with a bug recently where the system would hard hang
> somewhere between hangcheck and reset. There was time after error
> collection to actually get my error state out, but I couldn't get the
> reads to work.
> 
> This patch is also useful for when reset kills the machine, and you want
> to keep reset enabled but still get error state.
> 
> Since I found the patch pretty useful, I decided to clean it up and
> submit it. It was mostly meant as a one-off hack originally though.
> 
> If a maintainer decides it's useful, then here it is.

I think we could certainly push the register dump into DRM_DEBUG. That's
usually enough to identify duplicates and so may be useful from the dev
perspective. Or throw in a QA parameter that dumps an interesting subset to
KERN_ERR. Most people seem to manage to include the dmesg so getting the
right information into it would reduce some of the annoyance in vague
bug reports. (Too much of the wrong information though is equally
annoying.)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/16] drm/i915/ctx: Return earlier on failure
  2014-07-01 18:17 ` [PATCH 05/16] drm/i915/ctx: Return earlier on failure Ben Widawsky
@ 2014-07-04  8:14   ` Chris Wilson
  0 siblings, 0 replies; 34+ messages in thread
From: Chris Wilson @ 2014-07-04  8:14 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Tue, Jul 01, 2014 at 11:17:40AM -0700, Ben Widawsky wrote:
> As what was correctly debugged here:
> commit acc240d41ea1ab9c488a79219fb313b5b46265ae
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Thu Dec 5 15:42:34 2013 +0100
> 
>     drm/i915: Fix use-after-free in do_switch
> 
> It then becomes apparent that the default context cannot be the context
> being switched to for context switch because it is always bound. It
> follows that if the ring->last_context (from) has changed after the
> bind_to_gtt, it will always be the default context - this is commented
> in the code block.
> 
> This assertion will help catch issues without our logic sooner than
> letting the system move long (which is possible for some time).
> 
> I really want this to be a BUG(), but I also want the patch to get
> merged. I think the fact that none of the ERRNOs make any sense at all
> is just more evidence that this shouldn't be a WARN.

Hmm, we could adopt EREMOTEIO or for driver bugs.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload
  2014-07-04  7:51   ` [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload Chris Wilson
@ 2014-07-04 16:55     ` Ben Widawsky
  0 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-04 16:55 UTC (permalink / raw)
  To: Chris Wilson, Ben Widawsky, Intel GFX

On Fri, Jul 04, 2014 at 08:51:59AM +0100, Chris Wilson wrote:
> On Thu, Jul 03, 2014 at 03:01:49PM -0700, Ben Widawsky wrote:
> > This is a spec requirement for all rings.
> > 
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_context.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > index 5b4a9a0..1ac648f 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > @@ -870,6 +870,9 @@ int i915_switch_context(struct intel_engine_cs *ring,
> >  	if (from == to && !to->remap_slice)
> >  		return 0;
> >  
> > +	if (IS_GEN8(ring->dev))
> > +		WARN_ON(ring->flush(ring, I915_GEM_GPU_DOMAINS, 0));
> > +
> 
> That's intel_ring_invalidate_all_caches(). The only time we won't be
> doing this are the forced switches at startup/reset and close.

So there were three reasons for doing this.
1. AFAICT there were cases when we don't, which you seem to agree on.
2. I liked the explicit nature which /should/ have minimal impact given
that it's already done. It's future proof (to whatever extent such a
thing is possible). It's clear.
3. The worst of the reasons. On more than one occasion, I sent a trace
to certain people and they complained I was missing the TLB invalidate.
So I put it right there where nobody could miss it.

> Is the requirement before or after the SET_CONTEXT?

The requirement is before the PDP load. Which on GEN8 RCS means, before
MI_SET_CONTEXT. When we will RESTORE_INHIBIT, it must be before the
LRIs.

> What I would prefer doing
> is moving the invalidate-dispatch-flush-signal into one atomic operation
> (that would help simplify execlist as well) - would that be good enough
> here to be sure that an invalidate is performed after the context
> switch and before the next batch?
> -Chris
> 

As long as the ordering is correct based on the above, it sounds fine to
me. Assuming any of the other patches in the series gets merged though,
I'd propose sticking this one in as well, until the final solution
lands.

-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/16] drm/i915/error: Do a better job of disambiguating VMAs
  2014-07-04  7:57   ` Chris Wilson
@ 2014-07-04 16:56     ` Ben Widawsky
  2014-07-17  8:51       ` Daniel Vetter
  0 siblings, 1 reply; 34+ messages in thread
From: Ben Widawsky @ 2014-07-04 16:56 UTC (permalink / raw)
  To: Chris Wilson, Ben Widawsky, Intel GFX

On Fri, Jul 04, 2014 at 08:57:08AM +0100, Chris Wilson wrote:
> On Tue, Jul 01, 2014 at 11:17:43AM -0700, Ben Widawsky wrote:
> > Some of the original PPGTT patches in this area where unmerged, and this
> > left a lot of confusion in our error capture with regard to which vm/obj
> > we want to capture. There have been at least a couple of patches from
> > Chris, and myself to try to fix this up; so here is another shot. Nobody
> > running without full PPGTT is effected by this, and that is probably why
> > nobody has bothered to fix it yet.
> > 
> > Instead of using any of the global lists to find the VMAs we want to
> > capture, we use the union of the active, and the inactive list in the
> > VM. This allows us to replace our capture_bo with capture_vma, and know
> > all the VMAs we want to capture are valid.
> > 
> > I could have probably figured out a way to reuse mm_list. As we've had
> > bugs here before in the shrinker, I think the best way forward is to get
> > it working, and then optimize it later.
> > 
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_gtt.c   |  1 +
> >  drivers/gpu/drm/i915/i915_gem_gtt.h   |  2 ++
> >  drivers/gpu/drm/i915/i915_gpu_error.c | 39 ++++++++++++++++++++++-------------
> >  3 files changed, 28 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > index a4153ee..88451dc 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > @@ -2114,6 +2114,7 @@ static struct i915_vma *__i915_gem_vma_create(struct drm_i915_gem_object *obj,
> >  		return ERR_PTR(-ENOMEM);
> >  
> >  	INIT_LIST_HEAD(&vma->vma_link);
> > +	INIT_LIST_HEAD(&vma->pin_capture_link);
> >  	INIT_LIST_HEAD(&vma->mm_list);
> >  	INIT_LIST_HEAD(&vma->exec_list);
> >  	vma->vm = vm;
> > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
> > index 8d6f7c1..1d75801 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_gtt.h
> > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
> > @@ -126,6 +126,8 @@ struct i915_vma {
> >  
> >  	struct list_head vma_link; /* Link in the object's VMA list */
> >  
> > +	struct list_head pin_capture_link; /* Link in the error capture */
> > +
> >  	/** This vma's place in the batchbuffer or on the eviction list */
> >  	struct list_head exec_list;
> 
> We already have a slot for temporary lists...
> -Chris
> 

I did mention that in the commit message, if I caught your meaning.

-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/16] drm/i915/error: Check the potential ctx obj's vm
  2014-07-01 18:17 ` [PATCH 06/16] drm/i915/error: Check the potential ctx obj's vm Ben Widawsky
@ 2014-07-17  8:47   ` Daniel Vetter
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-17  8:47 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Tue, Jul 01, 2014 at 11:17:41AM -0700, Ben Widawsky wrote:
> The bound list is global (all objects which back the VMAs are stored
> here). Recently the BUG() in the offset lookup was demoted to a WARN,
> but the fault actually lies in the caller, here.
> 
> This bug has existed since the initial introduction of PPGTT (however,
> it was fixed in unmerged patches to fix up the error state).

Aside: This is another bug than the one which spurred me to demote the
BUG_ON to a WARN_ON. Patch merged with that patch referenced for clarity.
-Daniel

> 
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 66cf417..550ba38 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -871,6 +871,9 @@ static void i915_gem_record_active_context(struct intel_engine_cs *ring,
>  		return;
>  
>  	list_for_each_entry(obj, &dev_priv->mm.bound_list, global_list) {
> +		if (!i915_gem_obj_ggtt_bound(obj))
> +			continue;
> +
>  		if ((error->ccid & PAGE_MASK) == i915_gem_obj_ggtt_offset(obj)) {
>  			ering->ctx = i915_error_ggtt_object_create(dev_priv, obj);
>  			break;
> -- 
> 2.0.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/16] drm/i915/error: Do a better job of disambiguating VMAs
  2014-07-04 16:56     ` Ben Widawsky
@ 2014-07-17  8:51       ` Daniel Vetter
  2014-07-20 23:49         ` Ben Widawsky
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Vetter @ 2014-07-17  8:51 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Fri, Jul 04, 2014 at 09:56:54AM -0700, Ben Widawsky wrote:
> On Fri, Jul 04, 2014 at 08:57:08AM +0100, Chris Wilson wrote:
> > On Tue, Jul 01, 2014 at 11:17:43AM -0700, Ben Widawsky wrote:
> > > Some of the original PPGTT patches in this area where unmerged, and this
> > > left a lot of confusion in our error capture with regard to which vm/obj
> > > we want to capture. There have been at least a couple of patches from
> > > Chris, and myself to try to fix this up; so here is another shot. Nobody
> > > running without full PPGTT is effected by this, and that is probably why
> > > nobody has bothered to fix it yet.
> > > 
> > > Instead of using any of the global lists to find the VMAs we want to
> > > capture, we use the union of the active, and the inactive list in the
> > > VM. This allows us to replace our capture_bo with capture_vma, and know
> > > all the VMAs we want to capture are valid.
> > > 
> > > I could have probably figured out a way to reuse mm_list. As we've had
> > > bugs here before in the shrinker, I think the best way forward is to get
> > > it working, and then optimize it later.
> > > 
> > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > > ---
> > >  drivers/gpu/drm/i915/i915_gem_gtt.c   |  1 +
> > >  drivers/gpu/drm/i915/i915_gem_gtt.h   |  2 ++
> > >  drivers/gpu/drm/i915/i915_gpu_error.c | 39 ++++++++++++++++++++++-------------
> > >  3 files changed, 28 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > index a4153ee..88451dc 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > @@ -2114,6 +2114,7 @@ static struct i915_vma *__i915_gem_vma_create(struct drm_i915_gem_object *obj,
> > >  		return ERR_PTR(-ENOMEM);
> > >  
> > >  	INIT_LIST_HEAD(&vma->vma_link);
> > > +	INIT_LIST_HEAD(&vma->pin_capture_link);
> > >  	INIT_LIST_HEAD(&vma->mm_list);
> > >  	INIT_LIST_HEAD(&vma->exec_list);
> > >  	vma->vm = vm;
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
> > > index 8d6f7c1..1d75801 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_gtt.h
> > > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
> > > @@ -126,6 +126,8 @@ struct i915_vma {
> > >  
> > >  	struct list_head vma_link; /* Link in the object's VMA list */
> > >  
> > > +	struct list_head pin_capture_link; /* Link in the error capture */
> > > +
> > >  	/** This vma's place in the batchbuffer or on the eviction list */
> > >  	struct list_head exec_list;
> > 
> > We already have a slot for temporary lists...
> > -Chris
> > 
> 
> I did mention that in the commit message, if I caught your meaning.

Chris is probably talking about exec_list which is our canonical temporary
list, mostly used by execbuf. But also in other places.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 13/16] drm/i915: More correct (slower) ppgtt cleanup
  2014-07-01 18:17 ` [PATCH 13/16] drm/i915: More correct (slower) " Ben Widawsky
@ 2014-07-17  9:49   ` Daniel Vetter
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-17  9:49 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Tue, Jul 01, 2014 at 11:17:48AM -0700, Ben Widawsky wrote:
> If a VM still have objects which are bound (exactly: have a node
> reserved in the drm_mm), and we are in the middle of a reset, we have no
> hope of the standard methods fixing the situation (ring idle won't
> work). We must therefore let the reset handler take it's course, and
> then we can resume tearing down the VM.
> 
> This logic very much duplicates^Wresembles the logic in our wait for
> error code. I've decided to leave it as open coded because I expect this
> bit of code to require tweaks and changes over time.
> 
> Interruptions via signal causes a really similar problem.
> 
> This should deprecate the need for the yet unmerged patch from Chris
> (and an identical patch from me, which was first!!):
> drm/i915: Prevent signals from interrupting close()
> 
> I have a followup patch to implement deferred free, before you complain.
> 
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>

Imo this goes in the wrong direction. ppgtt_cleanup really shouldn't ever
have a need to wait for the gpu. We need to rework the lifetimes such that
we keep the ppgtt alive until the gpu is done with it. Similarly to how we
keep the objects themselves around when the gpu is still using them. Even
when userspace has already dropped the last reference.

Having such a stark behaviour difference between ppgtt lifetimes and
object lifetimes only leads to unecessary complexity and fragility in the
code. And this patch here is a good example for this.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 51 +++++++++++++++++++++++++++++++--
>  1 file changed, 48 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 8d106d9..e1b5613 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -101,6 +101,32 @@ static void do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
>  	struct drm_device *dev = ppgtt->base.dev;
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	struct i915_address_space *vm = &ppgtt->base;
> +	bool do_idle = false;
> +	int ret;
> +
> +	/* If we get here while in reset, we need to let the reset handler run
> +	 * first, or else our VM teardown isn't going to go smoothly. There are
> +	 * a could of options at this point, but letting the reset handler do
> +	 * it's thing is the most desirable. The reset handler will take care of
> +	 * retiring the stuck requests.
> +	 */
> +	if (i915_reset_in_progress(&dev_priv->gpu_error)) {
> +		mutex_unlock(&dev->struct_mutex);
> +#define EXIT_COND (!i915_reset_in_progress(&dev_priv->gpu_error) || \
> +		   i915_terminally_wedged(&dev_priv->gpu_error))
> +		ret = wait_event_timeout(dev_priv->gpu_error.reset_queue,
> +					 EXIT_COND,
> +					 10 * HZ);
> +		if (!ret) {
> +			/* it's unlikely idling will solve anything, but it
> +			 * shouldn't hurt to try. */
> +			do_idle = true;
> +			/* TODO: go down kicking and screaming harder */
> +		}
> +#undef EXIT_COND
> +
> +		mutex_lock(&dev->struct_mutex);
> +	}
>  
>  	if (ppgtt == dev_priv->mm.aliasing_ppgtt ||
>  	    (list_empty(&vm->active_list) && list_empty(&vm->inactive_list))) {
> @@ -117,14 +143,33 @@ static void do_ppgtt_cleanup(struct i915_hw_ppgtt *ppgtt)
>  	if (!list_empty(&vm->active_list)) {
>  		struct i915_vma *vma;
>  
> +		do_idle = true;
>  		list_for_each_entry(vma, &vm->active_list, mm_list)
>  			if (WARN_ON(list_empty(&vma->vma_link) ||
>  				    list_is_singular(&vma->vma_link)))
>  				break;
> -		i915_gem_evict_vm(&ppgtt->base, true, true);
> -	} else {
> +	} else
>  		i915_gem_retire_requests(dev);
> -		i915_gem_evict_vm(&ppgtt->base, false, true);
> +
> +	/* We have a problem here where VM teardown cannot be interrupted, or
> +	 * else the ppgtt cleanup will fail. As an example, a precisely timed
> +	 * SIGKILL could leads to an OOPS, or worse. There are two options:
> +	 * 1. Make the eviction uninterruptible
> +	 * 2. Defer the eviction if it was interrupted.
> +	 *
> +	 * Option #1 is not the friendliest, but it's the easiest to implement,
> +	 * and least error prone.
> +	 * TODO: Implement option 2
> +	 */
> +	ret = i915_gem_evict_vm(&ppgtt->base, do_idle, !do_idle);
> +	if (ret == -ERESTARTSYS)
> +		ret = i915_gem_evict_vm(&ppgtt->base, do_idle, false);
> +	WARN_ON(ret);
> +	WARN_ON(!list_empty(&vm->active_list));
> +
> +	/* This is going to blow up badly if the mm is unclean */
> +	if (WARN_ON(!list_empty(&ppgtt->base.mm.head_node.node_list))) {
> +		/* TODO: go down kicking and screaming harder++ */
>  	}
>  
>  	ppgtt->base.cleanup(&ppgtt->base);
> -- 
> 2.0.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 12/16] drm/i915: Reorder ctx unref on ppgtt cleanup
  2014-07-01 18:17 ` [PATCH 12/16] drm/i915: Reorder ctx unref on ppgtt cleanup Ben Widawsky
@ 2014-07-17  9:56   ` Daniel Vetter
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-17  9:56 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Tue, Jul 01, 2014 at 11:17:47AM -0700, Ben Widawsky wrote:
> The comment [which was mine] is wrong. The context object can never be
> bound in a PPGTT because it is only capable of living in the Global GTT.
> So, remove the comment, and reorder the unref. What's nice about the
> latter is it keeps the context object alive past the PPGTT. This makes
> the destroy ordering symmetric with the creation ordering.
> 
> Create:
> 1. Create context
> 2. Create PPGTT
> 
> Destroy:
> 1. Destroy PPGTT
> 2. Destroy context
> 
> As far as I know, this does not fix a bug. The code previously kept the
> context data structure, only the object was gone. As the code was,
> nothing tried to use the object after this point.
> 
> NOTE: If in the future we have cases where the PPGTT can/should outlive
> the context (which doesn't occur today, but the code permits it), this
> ordering does not matter. Even if this occurs, as it stands now, we do
> not expect that to be the normal case, and having this order makes
> debugging a bit easier if we're tracking object lifetimes for the
> context vs ppgtt
> 
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>

Queued for -next, thanks for the patch.
-Daniel
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b6803b3..8d106d9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -185,14 +185,12 @@ void i915_gem_context_free(struct kref *ctx_ref)
>  		/* We refcount even the aliasing PPGTT to keep the code symmetric */
>  		if (USES_PPGTT(ctx->obj->base.dev))
>  			ppgtt = ctx_to_ppgtt(ctx);
> -
> -		/* XXX: Free up the object before tearing down the address space, in
> -		 * case we're bound in the PPGTT */
> -		drm_gem_object_unreference(&ctx->obj->base);
>  	}
>  
>  	if (ppgtt)
>  		kref_put(&ppgtt->ref, ppgtt_release);
> +	if (ctx->obj)
> +		drm_gem_object_unreference(&ctx->obj->base);
>  	list_del(&ctx->link);
>  	kfree(ctx);
>  }
> -- 
> 2.0.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/16] Enabling GEN8 full PPGTT + fixes
  2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
                   ` (16 preceding siblings ...)
  2014-07-03 22:01 ` [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload Ben Widawsky
@ 2014-07-17 12:04 ` Daniel Vetter
  17 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-17 12:04 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX

On Tue, Jul 01, 2014 at 11:17:35AM -0700, Ben Widawsky wrote:
> Here be all the patches to make full PPGTT relatively stable on Broadwell. Most
> of the work was actually to the generic PPGTT code, and not BDW specific. There
> are basically 3 fixes:
> 1. Make the error state not horrible, but more work still needed.
> 2. Fix up another tricky from != from case in do_switch
> 3. Generally more graceful handling in ppgtt_release
> 
> 1. Various forms of this subseries have shown up on the list from both
> myself, and Chris (and I feel like Mika, also). For whatever reason,
> none have been merged. I don't necessarily mind missing information, but
> without these patches we can OOPs and GP fault in the error state, which
> is not acceptable.
> 
> 2. The meat of the debugging came here. Essentially this problem has
> already been seen, and solved. The issue is, there is another spot in
> do_switch that can invoke a switch to the default context. We probably
> want to get slightly better handling of this, but for now this solves my
> problems.
> 
> 3. There are 2 categories addressed in this bucket. Reset, and signals.  The
> patches themselves explain the situation. I take a two step approach with this
> where first I make things correct (and slow as heck), and then I do the more
> optimal and trickier solution. Both of these are in line to be replaced by
> Daniel, but I needed something sooner.
> 
> Here is the test case that caught all of the above issues:
> while [ 1 ] ;  do
>         (glxgears) & pid[0]=$!
>         (glxgears) & pid[1]=$!
>         (glxgears) & pid[2]=$!
>         sleep 3
>         kill ${pid[*]}
> done
> 
> 
> Ben Widawsky (16):
>   drm/i915: Split up do_switch
>   drm/i915: Extract l3 remapping out of ctx switch
>   drm/i915/ppgtt: Load address space after mi_set_context
>   drm/i915: Fix another another use-after-free in do_switch
>   drm/i915/ctx: Return earlier on failure
>   drm/i915/error: Check the potential ctx obj's vm
>   drm/i915/error: vma error capture prettyify
>   drm/i915/error: Do a better job of disambiguating VMAs
>   drm/i915/error: Capture vmas instead of BOs
>   drm/i915: Add some extra guards in evict_vm
>   drm/i915: Make an uninterruptible evict
>   drm/i915: Reorder ctx unref on ppgtt cleanup
>   drm/i915: More correct (slower) ppgtt cleanup
>   drm/i915: Defer PPGTT cleanup
>   drm/i915/bdw: Enable full PPGTT
>   drm/i915: Get the error state over the wire (HACKish)

I think we can pull the roughly first 9 patches (already merged one of
them). Just needs someone to do the in-depth review, and preferrably
someone who's looking at the relevant jira ppgtt tasks we're tracking
interannyl. For the changes affecting ppgtt_cleanup I've replied in a
separate mail.
-Daniel

> 
>  drivers/gpu/drm/i915/i915_debugfs.c        |   2 +-
>  drivers/gpu/drm/i915/i915_drv.h            |  18 ++-
>  drivers/gpu/drm/i915/i915_gem.c            | 110 +++++++++++++
>  drivers/gpu/drm/i915/i915_gem_context.c    | 238 ++++++++++++++++++++++-------
>  drivers/gpu/drm/i915/i915_gem_evict.c      |  39 +++--
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |   2 +-
>  drivers/gpu/drm/i915/i915_gem_gtt.c        |   3 +-
>  drivers/gpu/drm/i915/i915_gem_gtt.h        |   4 +
>  drivers/gpu/drm/i915/i915_gpu_error.c      | 157 ++++++++++++-------
>  drivers/gpu/drm/i915/i915_sysfs.c          |   2 +-
>  10 files changed, 449 insertions(+), 126 deletions(-)
> 
> -- 
> 2.0.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/16] drm/i915/error: Do a better job of disambiguating VMAs
  2014-07-17  8:51       ` Daniel Vetter
@ 2014-07-20 23:49         ` Ben Widawsky
  0 siblings, 0 replies; 34+ messages in thread
From: Ben Widawsky @ 2014-07-20 23:49 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel GFX, Ben Widawsky

On Thu, Jul 17, 2014 at 10:51:23AM +0200, Daniel Vetter wrote:
> On Fri, Jul 04, 2014 at 09:56:54AM -0700, Ben Widawsky wrote:
> > On Fri, Jul 04, 2014 at 08:57:08AM +0100, Chris Wilson wrote:
> > > On Tue, Jul 01, 2014 at 11:17:43AM -0700, Ben Widawsky wrote:
> > > > Some of the original PPGTT patches in this area where unmerged, and this
> > > > left a lot of confusion in our error capture with regard to which vm/obj
> > > > we want to capture. There have been at least a couple of patches from
> > > > Chris, and myself to try to fix this up; so here is another shot. Nobody
> > > > running without full PPGTT is effected by this, and that is probably why
> > > > nobody has bothered to fix it yet.
> > > > 
> > > > Instead of using any of the global lists to find the VMAs we want to
> > > > capture, we use the union of the active, and the inactive list in the
> > > > VM. This allows us to replace our capture_bo with capture_vma, and know
> > > > all the VMAs we want to capture are valid.
> > > > 
> > > > I could have probably figured out a way to reuse mm_list. As we've had
> > > > bugs here before in the shrinker, I think the best way forward is to get
> > > > it working, and then optimize it later.
> > > > 
> > > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_gem_gtt.c   |  1 +
> > > >  drivers/gpu/drm/i915/i915_gem_gtt.h   |  2 ++
> > > >  drivers/gpu/drm/i915/i915_gpu_error.c | 39 ++++++++++++++++++++++-------------
> > > >  3 files changed, 28 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > > index a4153ee..88451dc 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > > @@ -2114,6 +2114,7 @@ static struct i915_vma *__i915_gem_vma_create(struct drm_i915_gem_object *obj,
> > > >  		return ERR_PTR(-ENOMEM);
> > > >  
> > > >  	INIT_LIST_HEAD(&vma->vma_link);
> > > > +	INIT_LIST_HEAD(&vma->pin_capture_link);
> > > >  	INIT_LIST_HEAD(&vma->mm_list);
> > > >  	INIT_LIST_HEAD(&vma->exec_list);
> > > >  	vma->vm = vm;
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
> > > > index 8d6f7c1..1d75801 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_gtt.h
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
> > > > @@ -126,6 +126,8 @@ struct i915_vma {
> > > >  
> > > >  	struct list_head vma_link; /* Link in the object's VMA list */
> > > >  
> > > > +	struct list_head pin_capture_link; /* Link in the error capture */
> > > > +
> > > >  	/** This vma's place in the batchbuffer or on the eviction list */
> > > >  	struct list_head exec_list;
> > > 
> > > We already have a slot for temporary lists...
> > > -Chris
> > > 
> > 
> > I did mention that in the commit message, if I caught your meaning.
> 
> Chris is probably talking about exec_list which is our canonical temporary
> list, mostly used by execbuf. But also in other places.
> -Daniel

I think that was a typo on my part, I meant exec_list. In either case, I
think doing it this way and merging it later is the safest path.

-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] [v2] drm/i915: Fix another another use-after-free in do_switch
  2014-07-01 18:17 ` [PATCH 04/16] drm/i915: Fix another another use-after-free in do_switch Ben Widawsky
@ 2014-08-09 20:15   ` Ben Widawsky
  2014-08-10  8:04     ` Chris Wilson
  0 siblings, 1 reply; 34+ messages in thread
From: Ben Widawsky @ 2014-08-09 20:15 UTC (permalink / raw)
  To: Intel GFX; +Cc: Ben Widawsky, Ben Widawsky

See the following for many more details.

commit acc240d41ea1ab9c488a79219fb313b5b46265ae
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Dec 5 15:42:34 2013 +0100

    drm/i915: Fix use-after-free in do_switch

In this case, the issue is only for full PPGTT:
do_switch
  context_unref
    ppgtt_release
      i915_gpu_idle
	switch_to_default
	from changes to default context

This could be backported to the pre do_switch cleanup I did in this
series. However, it's much cleaner and more obvious as a patch on top,
so I'd really like to do this as a post cleanup patch.

v2: There was a bug in the original patch where the ring->last_context
was set too early. I am not sure how this wasn't being hit when I sent
this previously. Perhaps I tested the wrong patch previously.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index c9aa3e6..0ce8fc9 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -609,14 +609,18 @@ mi_set_context(struct intel_engine_cs *ring,
 	return ret;
 }
 
-static void do_switch_fini_common(struct intel_engine_cs *ring,
-				  struct intel_context *from,
-				  struct intel_context *to)
+static struct intel_context *do_switch_fini_common(struct intel_engine_cs *ring,
+						   struct intel_context *from,
+						   struct intel_context *to)
 {
+	struct intel_context *ret;
 	if (likely(from))
 		i915_gem_context_unreference(from);
 	i915_gem_context_reference(to);
+	ret = ring->last_context;
 	ring->last_context = to;
+
+	return ret;
 }
 
 static int do_switch_xcs(struct intel_engine_cs *ring,
@@ -762,14 +766,20 @@ static int do_switch_rcs(struct intel_engine_cs *ring,
 		 */
 		from->legacy_hw_ctx.rcs_state->dirty = 1;
 		BUG_ON(from->legacy_hw_ctx.rcs_state->ring != ring);
-
-		/* obj is kept alive until the next request by its active ref */
-		i915_gem_object_ggtt_unpin(from->legacy_hw_ctx.rcs_state);
 	}
 
 	uninitialized = !to->legacy_hw_ctx.initialized && from == NULL;
 	to->legacy_hw_ctx.initialized = true;
-	do_switch_fini_common(ring, from, to);
+	/* From may have disappeared again after the context unref */
+	from = do_switch_fini_common(ring, from, to);
+	if (from != NULL) {
+		/* obj is kept alive until the next request by its active ref.
+		 * XXX: The context needs to be unpinned last, or else we risk
+		 * hitting evict/idle on the ppgtt free, which will call back
+		 * into this, and we'll get a double unpin on this context
+		 */
+		i915_gem_object_ggtt_unpin(from->legacy_hw_ctx.rcs_state);
+	}
 
 	if (uninitialized) {
 		ret = i915_gem_render_state_init(ring);
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] [v2] drm/i915: Fix another another use-after-free in do_switch
  2014-08-09 20:15   ` [PATCH] [v2] " Ben Widawsky
@ 2014-08-10  8:04     ` Chris Wilson
  2014-08-11  9:26       ` Daniel Vetter
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Wilson @ 2014-08-10  8:04 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Intel GFX, Ben Widawsky

On Sat, Aug 09, 2014 at 01:15:16PM -0700, Ben Widawsky wrote:
> See the following for many more details.
> 
> commit acc240d41ea1ab9c488a79219fb313b5b46265ae
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Thu Dec 5 15:42:34 2013 +0100
> 
>     drm/i915: Fix use-after-free in do_switch
> 
> In this case, the issue is only for full PPGTT:
> do_switch
>   context_unref
>     ppgtt_release
>       i915_gpu_idle
> 	switch_to_default
> 	from changes to default context
> 
> This could be backported to the pre do_switch cleanup I did in this
> series. However, it's much cleaner and more obvious as a patch on top,
> so I'd really like to do this as a post cleanup patch.
> 
> v2: There was a bug in the original patch where the ring->last_context
> was set too early. I am not sure how this wasn't being hit when I sent
> this previously. Perhaps I tested the wrong patch previously.
> 
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>

Ok, I convinced myself that the you are fixing the bug you describe and
don't seem to be introducing a new one, so

Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] [v2] drm/i915: Fix another another use-after-free in do_switch
  2014-08-10  8:04     ` Chris Wilson
@ 2014-08-11  9:26       ` Daniel Vetter
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-08-11  9:26 UTC (permalink / raw)
  To: Chris Wilson, Ben Widawsky, Intel GFX, Ben Widawsky

On Sun, Aug 10, 2014 at 09:04:10AM +0100, Chris Wilson wrote:
> On Sat, Aug 09, 2014 at 01:15:16PM -0700, Ben Widawsky wrote:
> > See the following for many more details.
> > 
> > commit acc240d41ea1ab9c488a79219fb313b5b46265ae
> > Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Date:   Thu Dec 5 15:42:34 2013 +0100
> > 
> >     drm/i915: Fix use-after-free in do_switch
> > 
> > In this case, the issue is only for full PPGTT:
> > do_switch
> >   context_unref
> >     ppgtt_release
> >       i915_gpu_idle
> > 	switch_to_default
> > 	from changes to default context

Pardon my ignorance (well this stuff is just hard), but can the above
still happen with Michel Thierry's patch to rework ppgtt_release?

In particular I seem to be too dense to find the ppgtt_release -> gpu_idle
step once the forcefull vma unbinding is gone. Doe I miss something?
Someone please enlighten me ...

Thanks, Daniel

> > 
> > This could be backported to the pre do_switch cleanup I did in this
> > series. However, it's much cleaner and more obvious as a patch on top,
> > so I'd really like to do this as a post cleanup patch.
> > 
> > v2: There was a bug in the original patch where the ring->last_context
> > was set too early. I am not sure how this wasn't being hit when I sent
> > this previously. Perhaps I tested the wrong patch previously.
> > 
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> 
> Ok, I convinced myself that the you are fixing the bug you describe and
> don't seem to be introducing a new one, so
> 
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2014-08-11  9:26 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-01 18:17 [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Ben Widawsky
2014-07-01 18:17 ` [PATCH 01/16] drm/i915: Split up do_switch Ben Widawsky
2014-07-01 18:17 ` [PATCH 02/16] drm/i915: Extract l3 remapping out of ctx switch Ben Widawsky
2014-07-01 18:17 ` [PATCH 03/16] drm/i915/ppgtt: Load address space after mi_set_context Ben Widawsky
2014-07-01 18:17 ` [PATCH 04/16] drm/i915: Fix another another use-after-free in do_switch Ben Widawsky
2014-08-09 20:15   ` [PATCH] [v2] " Ben Widawsky
2014-08-10  8:04     ` Chris Wilson
2014-08-11  9:26       ` Daniel Vetter
2014-07-01 18:17 ` [PATCH 05/16] drm/i915/ctx: Return earlier on failure Ben Widawsky
2014-07-04  8:14   ` Chris Wilson
2014-07-01 18:17 ` [PATCH 06/16] drm/i915/error: Check the potential ctx obj's vm Ben Widawsky
2014-07-17  8:47   ` Daniel Vetter
2014-07-01 18:17 ` [PATCH 07/16] drm/i915/error: vma error capture prettyify Ben Widawsky
2014-07-01 18:17 ` [PATCH 08/16] drm/i915/error: Do a better job of disambiguating VMAs Ben Widawsky
2014-07-04  7:57   ` Chris Wilson
2014-07-04 16:56     ` Ben Widawsky
2014-07-17  8:51       ` Daniel Vetter
2014-07-20 23:49         ` Ben Widawsky
2014-07-01 18:17 ` [PATCH 09/16] drm/i915/error: Capture vmas instead of BOs Ben Widawsky
2014-07-01 18:17 ` [PATCH 10/16] drm/i915: Add some extra guards in evict_vm Ben Widawsky
2014-07-01 18:17 ` [PATCH 11/16] drm/i915: Make an uninterruptible evict Ben Widawsky
2014-07-01 18:17 ` [PATCH 12/16] drm/i915: Reorder ctx unref on ppgtt cleanup Ben Widawsky
2014-07-17  9:56   ` Daniel Vetter
2014-07-01 18:17 ` [PATCH 13/16] drm/i915: More correct (slower) " Ben Widawsky
2014-07-17  9:49   ` Daniel Vetter
2014-07-01 18:17 ` [PATCH 14/16] drm/i915: Defer PPGTT cleanup Ben Widawsky
2014-07-01 18:17 ` [PATCH 15/16] drm/i915/bdw: Enable full PPGTT Ben Widawsky
2014-07-01 18:17 ` [PATCH 16/16] drm/i915: Get the error state over the wire (HACKish) Ben Widawsky
2014-07-04  8:02   ` Chris Wilson
2014-07-03 22:01 ` [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload Ben Widawsky
2014-07-03 22:01   ` [PATCH 2/2] drm/i915: Remove false assertion in ppgtt_release Ben Widawsky
2014-07-04  7:51   ` [PATCH 1/2] drm/i915/gen8: Invalidate TLBs before PDP reload Chris Wilson
2014-07-04 16:55     ` Ben Widawsky
2014-07-17 12:04 ` [PATCH 00/16] Enabling GEN8 full PPGTT + fixes Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.