All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Restore inhibiting the load of the default context
@ 2015-11-27 10:07 Chris Wilson
  2015-11-27 11:32 ` Mika Kuoppala
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2015-11-27 10:07 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Mika Kuoppala

Following a GPU reset, we may leave the context in a poorly defined
state, and reloading from that context will leave the GPU flummoxed. For
secondary contexts, this will lead to that context being banned - but
currently it is also causing the default context to become banned,
leading to turmoil in the shared state.

This is a regression from

commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
Author: Ben Widawsky <benjamin.widawsky@intel.com>
Date:   Mon Mar 16 16:00:58 2015 +0000

    drm/i915: Initialize all contexts

which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
default context.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 43761c5bcaca..1041099d285a 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -708,7 +708,7 @@ static int do_switch(struct drm_i915_gem_request *req)
 	if (ret)
 		goto unpin_out;
 
-	if (!to->legacy_hw_ctx.initialized) {
+	if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(to)) {
 		hw_flags |= MI_RESTORE_INHIBIT;
 		/* NB: If we inhibit the restore, the context is not allowed to
 		 * die because future work may end up depending on valid address
-- 
2.6.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Restore inhibiting the load of the default context
  2015-11-27 10:07 [PATCH] drm/i915: Restore inhibiting the load of the default context Chris Wilson
@ 2015-11-27 11:32 ` Mika Kuoppala
  2015-11-27 13:14   ` Chris Wilson
  2015-11-27 13:28   ` [PATCH v2] " Chris Wilson
  0 siblings, 2 replies; 11+ messages in thread
From: Mika Kuoppala @ 2015-11-27 11:32 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Daniel Vetter

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Following a GPU reset, we may leave the context in a poorly defined
> state, and reloading from that context will leave the GPU flummoxed. For
> secondary contexts, this will lead to that context being banned - but
> currently it is also causing the default context to become banned,
> leading to turmoil in the shared state.
>
> This is a regression from
>
> commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
> Author: Ben Widawsky <benjamin.widawsky@intel.com>
> Date:   Mon Mar 16 16:00:58 2015 +0000
>
>     drm/i915: Initialize all contexts
>
> which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
> default context.
>

As we never submit anything except driver initialization commands
for that context, what would cause this context to become corrupted?

Please consider:

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
b/drivers/gpu/drm/i915/i915_gem_context.c
index 43761c5..45b9a39 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -332,6 +332,7 @@ void i915_gem_context_reset(struct drm_device *dev)
        for (i = 0; i < I915_NUM_RINGS; i++) {
                struct intel_engine_cs *ring = &dev_priv->ring[i];
                struct intel_context *lctx = ring->last_context;
+               struct intel_context *dctx = ring->default_context;
 
                if (lctx) {
                        if (lctx->legacy_hw_ctx.rcs_state && i == RCS)
@@ -340,6 +341,9 @@ void i915_gem_context_reset(struct drm_device *dev)
                        i915_gem_context_unreference(lctx);
                        ring->last_context = NULL;
                }
+
+               if (dctx)
+                       dctx->legacy_hw_ctx.initialized = false;
        }
 }

To achieve the same effect and as a bonus, get the
same default context (with workarounds) as we
did in driver init.

I also think that we should zero the global
default context in here to gain similarity wrt
module init.

-Mika

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 43761c5bcaca..1041099d285a 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -708,7 +708,7 @@ static int do_switch(struct drm_i915_gem_request *req)
>  	if (ret)
>  		goto unpin_out;
>  
> -	if (!to->legacy_hw_ctx.initialized) {
> +	if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(to)) {
>  		hw_flags |= MI_RESTORE_INHIBIT;
>  		/* NB: If we inhibit the restore, the context is not allowed to
>  		 * die because future work may end up depending on valid address
> -- 
> 2.6.2
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Restore inhibiting the load of the default context
  2015-11-27 11:32 ` Mika Kuoppala
@ 2015-11-27 13:14   ` Chris Wilson
  2015-11-27 13:28   ` [PATCH v2] " Chris Wilson
  1 sibling, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2015-11-27 13:14 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: Daniel Vetter, intel-gfx

On Fri, Nov 27, 2015 at 01:32:11PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Following a GPU reset, we may leave the context in a poorly defined
> > state, and reloading from that context will leave the GPU flummoxed. For
> > secondary contexts, this will lead to that context being banned - but
> > currently it is also causing the default context to become banned,
> > leading to turmoil in the shared state.
> >
> > This is a regression from
> >
> > commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
> > Author: Ben Widawsky <benjamin.widawsky@intel.com>
> > Date:   Mon Mar 16 16:00:58 2015 +0000
> >
> >     drm/i915: Initialize all contexts
> >
> > which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
> > default context.
> >
> 
> As we never submit anything except driver initialization commands
> for that context, what would cause this context to become corrupted?

I can only hazard that the act of reseting the GPU left it invalid. A
bisect pointed to that commit, and partially reverting each chunk left
me with the conclusion that the hang was a direct result of reloading
the context. Closer inspection may reveal someelse suspect about the
context, but I object to this sneaky change.

> Please consider:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
> b/drivers/gpu/drm/i915/i915_gem_context.c
> index 43761c5..45b9a39 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -332,6 +332,7 @@ void i915_gem_context_reset(struct drm_device *dev)
>         for (i = 0; i < I915_NUM_RINGS; i++) {
>                 struct intel_engine_cs *ring = &dev_priv->ring[i];
>                 struct intel_context *lctx = ring->last_context;
> +               struct intel_context *dctx = ring->default_context;
>  
>                 if (lctx) {
>                         if (lctx->legacy_hw_ctx.rcs_state && i == RCS)
> @@ -340,6 +341,9 @@ void i915_gem_context_reset(struct drm_device *dev)
>                         i915_gem_context_unreference(lctx);
>                         ring->last_context = NULL;
>                 }
> +
> +               if (dctx)
> +                       dctx->legacy_hw_ctx.initialized = false;
>         }
>  }
> 
> To achieve the same effect and as a bonus, get the
> same default context (with workarounds) as we
> did in driver init.

I considered it, and wondered why it wasn't already there. It is a
separate issue imo.
 
> I also think that we should zero the global
> default context in here to gain similarity wrt
> module init.

You mean reallocate it from scratch? We have avoided doing the
reallocations in the past, as they can fail at inopportune times
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] drm/i915: Restore inhibiting the load of the default context
  2015-11-27 11:32 ` Mika Kuoppala
  2015-11-27 13:14   ` Chris Wilson
@ 2015-11-27 13:28   ` Chris Wilson
  2015-12-10 10:19     ` Mika Kuoppala
  1 sibling, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2015-11-27 13:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Mika Kuoppala

Following a GPU reset, we may leave the context in a poorly defined
state, and reloading from that context will leave the GPU flummoxed. For
secondary contexts, this will lead to that context being banned - but
currently it is also causing the default context to become banned,
leading to turmoil in the shared state.

This is a regression from

commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
Author: Ben Widawsky <benjamin.widawsky@intel.com>
Date:   Mon Mar 16 16:00:58 2015 +0000

    drm/i915: Initialize all contexts

which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
default context.

v2: Mark the global default context as uninitialized on GPU reset so
that the context-local workarounds are reloaded upon re-enabling.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 43761c5bcaca..f024d5d2c746 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -340,6 +340,10 @@ void i915_gem_context_reset(struct drm_device *dev)
 			i915_gem_context_unreference(lctx);
 			ring->last_context = NULL;
 		}
+
+		/* Force the GPU state to be reinitialised on enabling */
+		if (ring->default_context)
+			ring->default_context->legacy_hw_ctx.initialized = false;
 	}
 }
 
@@ -708,7 +712,7 @@ static int do_switch(struct drm_i915_gem_request *req)
 	if (ret)
 		goto unpin_out;
 
-	if (!to->legacy_hw_ctx.initialized) {
+	if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(to)) {
 		hw_flags |= MI_RESTORE_INHIBIT;
 		/* NB: If we inhibit the restore, the context is not allowed to
 		 * die because future work may end up depending on valid address
-- 
2.6.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/i915: Restore inhibiting the load of the default context
  2015-11-27 13:28   ` [PATCH v2] " Chris Wilson
@ 2015-12-10 10:19     ` Mika Kuoppala
  2015-12-10 10:45       ` Daniel Vetter
  2015-12-10 13:24       ` Francisco Jerez
  0 siblings, 2 replies; 11+ messages in thread
From: Mika Kuoppala @ 2015-12-10 10:19 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Daniel Vetter

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Following a GPU reset, we may leave the context in a poorly defined
> state, and reloading from that context will leave the GPU flummoxed. For
> secondary contexts, this will lead to that context being banned - but
> currently it is also causing the default context to become banned,
> leading to turmoil in the shared state.
>
> This is a regression from
>
> commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
> Author: Ben Widawsky <benjamin.widawsky@intel.com>
> Date:   Mon Mar 16 16:00:58 2015 +0000
>
>     drm/i915: Initialize all contexts
>
> which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
> default context.
>
> v2: Mark the global default context as uninitialized on GPU reset so
> that the context-local workarounds are reloaded upon re-enabling.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 43761c5bcaca..f024d5d2c746 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -340,6 +340,10 @@ void i915_gem_context_reset(struct drm_device *dev)
>  			i915_gem_context_unreference(lctx);
>  			ring->last_context = NULL;
>  		}
> +
> +		/* Force the GPU state to be reinitialised on enabling */
> +		if (ring->default_context)
> +			ring->default_context->legacy_hw_ctx.initialized = false;
>  	}
>  }
>  
> @@ -708,7 +712,7 @@ static int do_switch(struct drm_i915_gem_request *req)
>  	if (ret)
>  		goto unpin_out;
>  
> -	if (!to->legacy_hw_ctx.initialized) {
> +	if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(to)) {
>  		hw_flags |= MI_RESTORE_INHIBIT;
>  		/* NB: If we inhibit the restore, the context is not allowed to
>  		 * die because future work may end up depending on valid address
> -- 
> 2.6.2
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/i915: Restore inhibiting the load of the default context
  2015-12-10 10:19     ` Mika Kuoppala
@ 2015-12-10 10:45       ` Daniel Vetter
  2016-01-06 10:07         ` Daniel Vetter
  2015-12-10 13:24       ` Francisco Jerez
  1 sibling, 1 reply; 11+ messages in thread
From: Daniel Vetter @ 2015-12-10 10:45 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: Daniel Vetter, intel-gfx

On Thu, Dec 10, 2015 at 12:19:29PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Following a GPU reset, we may leave the context in a poorly defined
> > state, and reloading from that context will leave the GPU flummoxed. For
> > secondary contexts, this will lead to that context being banned - but
> > currently it is also causing the default context to become banned,
> > leading to turmoil in the shared state.
> >
> > This is a regression from
> >
> > commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
> > Author: Ben Widawsky <benjamin.widawsky@intel.com>
> > Date:   Mon Mar 16 16:00:58 2015 +0000
> >
> >     drm/i915: Initialize all contexts
> >
> > which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
> > default context.
> >
> > v2: Mark the global default context as uninitialized on GPU reset so
> > that the context-local workarounds are reloaded upon re-enabling.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

Queued for -next, thanks for the patch.
-Daniel

> 
> > Cc: Michel Thierry <michel.thierry@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_context.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > index 43761c5bcaca..f024d5d2c746 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > @@ -340,6 +340,10 @@ void i915_gem_context_reset(struct drm_device *dev)
> >  			i915_gem_context_unreference(lctx);
> >  			ring->last_context = NULL;
> >  		}
> > +
> > +		/* Force the GPU state to be reinitialised on enabling */
> > +		if (ring->default_context)
> > +			ring->default_context->legacy_hw_ctx.initialized = false;
> >  	}
> >  }
> >  
> > @@ -708,7 +712,7 @@ static int do_switch(struct drm_i915_gem_request *req)
> >  	if (ret)
> >  		goto unpin_out;
> >  
> > -	if (!to->legacy_hw_ctx.initialized) {
> > +	if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(to)) {
> >  		hw_flags |= MI_RESTORE_INHIBIT;
> >  		/* NB: If we inhibit the restore, the context is not allowed to
> >  		 * die because future work may end up depending on valid address
> > -- 
> > 2.6.2
> >
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/i915: Restore inhibiting the load of the default context
  2015-12-10 10:19     ` Mika Kuoppala
  2015-12-10 10:45       ` Daniel Vetter
@ 2015-12-10 13:24       ` Francisco Jerez
  2015-12-10 13:37         ` Chris Wilson
  1 sibling, 1 reply; 11+ messages in thread
From: Francisco Jerez @ 2015-12-10 13:24 UTC (permalink / raw)
  To: Mika Kuoppala, Chris Wilson, intel-gfx; +Cc: Daniel Vetter, Ben Widawsky


[-- Attachment #1.1.1: Type: text/plain, Size: 4653 bytes --]

Mika Kuoppala <mika.kuoppala@linux.intel.com> writes:

> Chris Wilson <chris@chris-wilson.co.uk> writes:
>
>> Following a GPU reset, we may leave the context in a poorly defined
>> state, and reloading from that context will leave the GPU flummoxed. For
>> secondary contexts, this will lead to that context being banned - but
>> currently it is also causing the default context to become banned,
>> leading to turmoil in the shared state.
>>
>> This is a regression from
>>
>> commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
>> Author: Ben Widawsky <benjamin.widawsky@intel.com>
>> Date:   Mon Mar 16 16:00:58 2015 +0000
>>
>>     drm/i915: Initialize all contexts
>>
>> which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
>> default context.
>>

AFAICT the removal of MI_RESTORE_INHIBIT in that commit seemed
justified.  Ben explained that it was needed to fix a pagefault in the
default context under certain conditions.  I don't know the details of
the original change (Ben CC'ed), but it seems like this would be trading
one bug for another?

Other than that this opens the door again to subtle state leaks between
processes, as I learned recently while implementing L3 partitioning in
Mesa.  Mesa now changes the L3 configuration in ways that will break
assumptions from processes that use the default context (the DDX).  The
DDX assumes, for instance, that the URB size is set according to the
hardware defaults, but it doesn't actually program the URB size itself,
so in a way the DDX relies on MI_RESTORE_INHIBIT *not* to be set for the
default context for correct operation.  This commit breaks that
assumption and the kernel ABI.

Mesa has a workaround in place to reduce the likelihood of such leaks,
but the solution is far from ideal because it relies on userspace
cooperation and had a measurable impact in performance (because it
requires userspace to assume the worst-case scenario that the following
batch is going to be from a different context with MI_RESTORE_INHIBIT
set, so we have to restore the hardware default L3 configuration at the
end of every batch even if that's actually not the case), for that
reason we would like to drop the userspace workaround in the future at
least on v4.1 kernels and newer.

One more question below.

>> v2: Mark the global default context as uninitialized on GPU reset so
>> that the context-local workarounds are reloaded upon re-enabling.
>>
>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>
> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
>
>> Cc: Michel Thierry <michel.thierry@intel.com>
>> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> ---
>>  drivers/gpu/drm/i915/i915_gem_context.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>> index 43761c5bcaca..f024d5d2c746 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -340,6 +340,10 @@ void i915_gem_context_reset(struct drm_device *dev)
>>  			i915_gem_context_unreference(lctx);
>>  			ring->last_context = NULL;
>>  		}
>> +
>> +		/* Force the GPU state to be reinitialised on enabling */
>> +		if (ring->default_context)
>> +			ring->default_context->legacy_hw_ctx.initialized = false;
>>  	}
>>  }
>>  
>> @@ -708,7 +712,7 @@ static int do_switch(struct drm_i915_gem_request *req)
>>  	if (ret)
>>  		goto unpin_out;
>>  
>> -	if (!to->legacy_hw_ctx.initialized) {
>> +	if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(to)) {

This hunk causes MI_RESTORE_INHIBIT to be set again for the default
context regardless of whether a reset happened or not, so it seems
unrelated to the rest of your change.  Maybe I'm understanding this
wrong but doesn't the !initialized check together with the hunk above
already guarantee that MI_RESTORE_INHIBIT will be set after GPU reset,
which is what you wanted to achieve?

>>  		hw_flags |= MI_RESTORE_INHIBIT;
>>  		/* NB: If we inhibit the restore, the context is not allowed to
>>  		 * die because future work may end up depending on valid address
>> -- 
>> 2.6.2
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 212 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/i915: Restore inhibiting the load of the default context
  2015-12-10 13:24       ` Francisco Jerez
@ 2015-12-10 13:37         ` Chris Wilson
  2015-12-10 13:57           ` Francisco Jerez
  2015-12-10 14:03           ` Daniel Vetter
  0 siblings, 2 replies; 11+ messages in thread
From: Chris Wilson @ 2015-12-10 13:37 UTC (permalink / raw)
  To: Francisco Jerez; +Cc: intel-gfx, Ben Widawsky, Daniel Vetter

On Thu, Dec 10, 2015 at 03:24:52PM +0200, Francisco Jerez wrote:
> Mika Kuoppala <mika.kuoppala@linux.intel.com> writes:
> 
> > Chris Wilson <chris@chris-wilson.co.uk> writes:
> >
> >> Following a GPU reset, we may leave the context in a poorly defined
> >> state, and reloading from that context will leave the GPU flummoxed. For
> >> secondary contexts, this will lead to that context being banned - but
> >> currently it is also causing the default context to become banned,
> >> leading to turmoil in the shared state.
> >>
> >> This is a regression from
> >>
> >> commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
> >> Author: Ben Widawsky <benjamin.widawsky@intel.com>
> >> Date:   Mon Mar 16 16:00:58 2015 +0000
> >>
> >>     drm/i915: Initialize all contexts
> >>
> >> which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
> >> default context.
> >>
> 
> AFAICT the removal of MI_RESTORE_INHIBIT in that commit seemed
> justified.  Ben explained that it was needed to fix a pagefault in the
> default context under certain conditions.  I don't know the details of
> the original change (Ben CC'ed), but it seems like this would be trading
> one bug for another?

A bug in a feature that never worked and isn't enabled?
 
> Other than that this opens the door again to subtle state leaks between
> processes, as I learned recently while implementing L3 partitioning in
> Mesa.  Mesa now changes the L3 configuration in ways that will break
> assumptions from processes that use the default context (the DDX).  The
> DDX assumes, for instance, that the URB size is set according to the
> hardware defaults, but it doesn't actually program the URB size itself,
> so in a way the DDX relies on MI_RESTORE_INHIBIT *not* to be set for the
> default context for correct operation.  This commit breaks that
> assumption and the kernel ABI.

Wrong the ABI is the other way around and has been for 10 years. Note
the kernel isn't also ensuring that the default context has the default
hardware values either. The "golden renderstate" also doesn't set all
defaults and is also a recent introduction.

It changes the ABI back to what it was....
 
> Mesa has a workaround in place to reduce the likelihood of such leaks,
> but the solution is far from ideal because it relies on userspace
> cooperation and had a measurable impact in performance (because it
> requires userspace to assume the worst-case scenario that the following
> batch is going to be from a different context with MI_RESTORE_INHIBIT
> set, so we have to restore the hardware default L3 configuration at the
> end of every batch even if that's actually not the case), for that
> reason we would like to drop the userspace workaround in the future at
> least on v4.1 kernels and newer.

Mesa has workarounds for leaks between hardware contexts, i.e. state
that is not stored in the context itself. Mesa also has to assume a
clean slate when setting up the context for the first time, and doesn't
use the default context itself.
 
> One more question below.
> 
> >> v2: Mark the global default context as uninitialized on GPU reset so
> >> that the context-local workarounds are reloaded upon re-enabling.
> >>
> >> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >
> > Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
> >
> >> Cc: Michel Thierry <michel.thierry@intel.com>
> >> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >> ---
> >>  drivers/gpu/drm/i915/i915_gem_context.c | 6 +++++-
> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> >> index 43761c5bcaca..f024d5d2c746 100644
> >> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> >> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> >> @@ -340,6 +340,10 @@ void i915_gem_context_reset(struct drm_device *dev)
> >>  			i915_gem_context_unreference(lctx);
> >>  			ring->last_context = NULL;
> >>  		}
> >> +
> >> +		/* Force the GPU state to be reinitialised on enabling */
> >> +		if (ring->default_context)
> >> +			ring->default_context->legacy_hw_ctx.initialized = false;
> >>  	}
> >>  }
> >>  
> >> @@ -708,7 +712,7 @@ static int do_switch(struct drm_i915_gem_request *req)
> >>  	if (ret)
> >>  		goto unpin_out;
> >>  
> >> -	if (!to->legacy_hw_ctx.initialized) {
> >> +	if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(to)) {
> 
> This hunk causes MI_RESTORE_INHIBIT to be set again for the default
> context regardless of whether a reset happened or not, so it seems
> unrelated to the rest of your change.  Maybe I'm understanding this
> wrong but doesn't the !initialized check together with the hunk above
> already guarantee that MI_RESTORE_INHIBIT will be set after GPU reset,
> which is what you wanted to achieve?

This is the change to *restore* the kernel ABI. The flag in reset is to
force the WA LRI to be re-emitted (not for gen6 ofc) as they will not be
preserved.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/i915: Restore inhibiting the load of the default context
  2015-12-10 13:37         ` Chris Wilson
@ 2015-12-10 13:57           ` Francisco Jerez
  2015-12-10 14:03           ` Daniel Vetter
  1 sibling, 0 replies; 11+ messages in thread
From: Francisco Jerez @ 2015-12-10 13:57 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, Ben Widawsky, Daniel Vetter


[-- Attachment #1.1.1: Type: text/plain, Size: 6128 bytes --]

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Thu, Dec 10, 2015 at 03:24:52PM +0200, Francisco Jerez wrote:
>> Mika Kuoppala <mika.kuoppala@linux.intel.com> writes:
>> 
>> > Chris Wilson <chris@chris-wilson.co.uk> writes:
>> >
>> >> Following a GPU reset, we may leave the context in a poorly defined
>> >> state, and reloading from that context will leave the GPU flummoxed. For
>> >> secondary contexts, this will lead to that context being banned - but
>> >> currently it is also causing the default context to become banned,
>> >> leading to turmoil in the shared state.
>> >>
>> >> This is a regression from
>> >>
>> >> commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
>> >> Author: Ben Widawsky <benjamin.widawsky@intel.com>
>> >> Date:   Mon Mar 16 16:00:58 2015 +0000
>> >>
>> >>     drm/i915: Initialize all contexts
>> >>
>> >> which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
>> >> default context.
>> >>
>> 
>> AFAICT the removal of MI_RESTORE_INHIBIT in that commit seemed
>> justified.  Ben explained that it was needed to fix a pagefault in the
>> default context under certain conditions.  I don't know the details of
>> the original change (Ben CC'ed), but it seems like this would be trading
>> one bug for another?
>
> A bug in a feature that never worked and isn't enabled?
>  
>> Other than that this opens the door again to subtle state leaks between
>> processes, as I learned recently while implementing L3 partitioning in
>> Mesa.  Mesa now changes the L3 configuration in ways that will break
>> assumptions from processes that use the default context (the DDX).  The
>> DDX assumes, for instance, that the URB size is set according to the
>> hardware defaults, but it doesn't actually program the URB size itself,
>> so in a way the DDX relies on MI_RESTORE_INHIBIT *not* to be set for the
>> default context for correct operation.  This commit breaks that
>> assumption and the kernel ABI.
>
> Wrong the ABI is the other way around and has been for 10 years. Note
> the kernel isn't also ensuring that the default context has the default
> hardware values either. The "golden renderstate" also doesn't set all
> defaults and is also a recent introduction.
>
> It changes the ABI back to what it was....

Sounds like a change fully unrelated to fixing the bug on GPU reset.

The ABI change done in 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b seemed
legitimate because it made the context state on switch more well-defined
than it used to be, IOW it was a backwards-compatible ABI change because
it couldn't possibly break userspace assumptions.  This change OTOH
breaks an assumption about the kernel ABI done by the DDX now.

>  
>> Mesa has a workaround in place to reduce the likelihood of such leaks,
>> but the solution is far from ideal because it relies on userspace
>> cooperation and had a measurable impact in performance (because it
>> requires userspace to assume the worst-case scenario that the following
>> batch is going to be from a different context with MI_RESTORE_INHIBIT
>> set, so we have to restore the hardware default L3 configuration at the
>> end of every batch even if that's actually not the case), for that
>> reason we would like to drop the userspace workaround in the future at
>> least on v4.1 kernels and newer.
>
> Mesa has workarounds for leaks between hardware contexts, i.e. state
> that is not stored in the context itself. Mesa also has to assume a
> clean slate when setting up the context for the first time, and doesn't
> use the default context itself.
>  

No, the workaround involves restoring state which *is* part of the
hardware context, it just won't get saved and restored with this commit
because of the MI_RESTORE_INHIBIT flag when switching to the DDX'
default context.

>> One more question below.
>> 
>> >> v2: Mark the global default context as uninitialized on GPU reset so
>> >> that the context-local workarounds are reloaded upon re-enabling.
>> >>
>> >> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> >
>> > Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> >
>> >> Cc: Michel Thierry <michel.thierry@intel.com>
>> >> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>> >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> >> ---
>> >>  drivers/gpu/drm/i915/i915_gem_context.c | 6 +++++-
>> >>  1 file changed, 5 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>> >> index 43761c5bcaca..f024d5d2c746 100644
>> >> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> >> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> >> @@ -340,6 +340,10 @@ void i915_gem_context_reset(struct drm_device *dev)
>> >>  			i915_gem_context_unreference(lctx);
>> >>  			ring->last_context = NULL;
>> >>  		}
>> >> +
>> >> +		/* Force the GPU state to be reinitialised on enabling */
>> >> +		if (ring->default_context)
>> >> +			ring->default_context->legacy_hw_ctx.initialized = false;
>> >>  	}
>> >>  }
>> >>  
>> >> @@ -708,7 +712,7 @@ static int do_switch(struct drm_i915_gem_request *req)
>> >>  	if (ret)
>> >>  		goto unpin_out;
>> >>  
>> >> -	if (!to->legacy_hw_ctx.initialized) {
>> >> +	if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(to)) {
>> 
>> This hunk causes MI_RESTORE_INHIBIT to be set again for the default
>> context regardless of whether a reset happened or not, so it seems
>> unrelated to the rest of your change.  Maybe I'm understanding this
>> wrong but doesn't the !initialized check together with the hunk above
>> already guarantee that MI_RESTORE_INHIBIT will be set after GPU reset,
>> which is what you wanted to achieve?
>
> This is the change to *restore* the kernel ABI. The flag in reset is to
> force the WA LRI to be re-emitted (not for gen6 ofc) as they will not be
> preserved.

Again you don't justify why changing the kernel ABI is required to fix
a GPU reset bug.

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 212 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/i915: Restore inhibiting the load of the default context
  2015-12-10 13:37         ` Chris Wilson
  2015-12-10 13:57           ` Francisco Jerez
@ 2015-12-10 14:03           ` Daniel Vetter
  1 sibling, 0 replies; 11+ messages in thread
From: Daniel Vetter @ 2015-12-10 14:03 UTC (permalink / raw)
  To: Chris Wilson, Francisco Jerez, Mika Kuoppala, intel-gfx,
	Daniel Vetter, Ben Widawsky

On Thu, Dec 10, 2015 at 01:37:56PM +0000, Chris Wilson wrote:
> On Thu, Dec 10, 2015 at 03:24:52PM +0200, Francisco Jerez wrote:
> > Mika Kuoppala <mika.kuoppala@linux.intel.com> writes:
> > 
> > > Chris Wilson <chris@chris-wilson.co.uk> writes:
> > >
> > >> Following a GPU reset, we may leave the context in a poorly defined
> > >> state, and reloading from that context will leave the GPU flummoxed. For
> > >> secondary contexts, this will lead to that context being banned - but
> > >> currently it is also causing the default context to become banned,
> > >> leading to turmoil in the shared state.
> > >>
> > >> This is a regression from
> > >>
> > >> commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
> > >> Author: Ben Widawsky <benjamin.widawsky@intel.com>
> > >> Date:   Mon Mar 16 16:00:58 2015 +0000
> > >>
> > >>     drm/i915: Initialize all contexts
> > >>
> > >> which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
> > >> default context.
> > >>
> > 
> > AFAICT the removal of MI_RESTORE_INHIBIT in that commit seemed
> > justified.  Ben explained that it was needed to fix a pagefault in the
> > default context under certain conditions.  I don't know the details of
> > the original change (Ben CC'ed), but it seems like this would be trading
> > one bug for another?
> 
> A bug in a feature that never worked and isn't enabled?
>  
> > Other than that this opens the door again to subtle state leaks between
> > processes, as I learned recently while implementing L3 partitioning in
> > Mesa.  Mesa now changes the L3 configuration in ways that will break
> > assumptions from processes that use the default context (the DDX).  The
> > DDX assumes, for instance, that the URB size is set according to the
> > hardware defaults, but it doesn't actually program the URB size itself,
> > so in a way the DDX relies on MI_RESTORE_INHIBIT *not* to be set for the
> > default context for correct operation.  This commit breaks that
> > assumption and the kernel ABI.
> 
> Wrong the ABI is the other way around and has been for 10 years. Note
> the kernel isn't also ensuring that the default context has the default
> hardware values either. The "golden renderstate" also doesn't set all
> defaults and is also a recent introduction.
> 
> It changes the ABI back to what it was....

Well it seems folks like the new ABI, even if somewhat accidental. And
more isolation between clients can't hurt I think (well it does hurt perf
somewhat). So I guess we need a respun patch that just makes sure that we
restore the default context stuff after reset, probably by marking it as
unitialized? And then just the one the kernel has I guess?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/i915: Restore inhibiting the load of the default context
  2015-12-10 10:45       ` Daniel Vetter
@ 2016-01-06 10:07         ` Daniel Vetter
  0 siblings, 0 replies; 11+ messages in thread
From: Daniel Vetter @ 2016-01-06 10:07 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: Daniel Vetter, intel-gfx

On Thu, Dec 10, 2015 at 11:45:54AM +0100, Daniel Vetter wrote:
> On Thu, Dec 10, 2015 at 12:19:29PM +0200, Mika Kuoppala wrote:
> > Chris Wilson <chris@chris-wilson.co.uk> writes:
> > 
> > > Following a GPU reset, we may leave the context in a poorly defined
> > > state, and reloading from that context will leave the GPU flummoxed. For
> > > secondary contexts, this will lead to that context being banned - but
> > > currently it is also causing the default context to become banned,
> > > leading to turmoil in the shared state.
> > >
> > > This is a regression from
> > >
> > > commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1]
> > > Author: Ben Widawsky <benjamin.widawsky@intel.com>
> > > Date:   Mon Mar 16 16:00:58 2015 +0000
> > >
> > >     drm/i915: Initialize all contexts
> > >
> > > which quietly introduced the removal of the MI_RESTORE_INHIBIT on the
> > > default context.
> > >
> > > v2: Mark the global default context as uninitialized on GPU reset so
> > > that the context-local workarounds are reloaded upon re-enabling.
> > >
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
> 
> Queued for -next, thanks for the patch.

I guess I dropped this one again, but now merged again since it fixes a
suspend/resume regression.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-01-06 10:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-27 10:07 [PATCH] drm/i915: Restore inhibiting the load of the default context Chris Wilson
2015-11-27 11:32 ` Mika Kuoppala
2015-11-27 13:14   ` Chris Wilson
2015-11-27 13:28   ` [PATCH v2] " Chris Wilson
2015-12-10 10:19     ` Mika Kuoppala
2015-12-10 10:45       ` Daniel Vetter
2016-01-06 10:07         ` Daniel Vetter
2015-12-10 13:24       ` Francisco Jerez
2015-12-10 13:37         ` Chris Wilson
2015-12-10 13:57           ` Francisco Jerez
2015-12-10 14:03           ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.