Chris Wilson writes: > On Thu, Dec 10, 2015 at 03:24:52PM +0200, Francisco Jerez wrote: >> Mika Kuoppala writes: >> >> > Chris Wilson writes: >> > >> >> Following a GPU reset, we may leave the context in a poorly defined >> >> state, and reloading from that context will leave the GPU flummoxed. For >> >> secondary contexts, this will lead to that context being banned - but >> >> currently it is also causing the default context to become banned, >> >> leading to turmoil in the shared state. >> >> >> >> This is a regression from >> >> >> >> commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1] >> >> Author: Ben Widawsky >> >> Date: Mon Mar 16 16:00:58 2015 +0000 >> >> >> >> drm/i915: Initialize all contexts >> >> >> >> which quietly introduced the removal of the MI_RESTORE_INHIBIT on the >> >> default context. >> >> >> >> AFAICT the removal of MI_RESTORE_INHIBIT in that commit seemed >> justified. Ben explained that it was needed to fix a pagefault in the >> default context under certain conditions. I don't know the details of >> the original change (Ben CC'ed), but it seems like this would be trading >> one bug for another? > > A bug in a feature that never worked and isn't enabled? > >> Other than that this opens the door again to subtle state leaks between >> processes, as I learned recently while implementing L3 partitioning in >> Mesa. Mesa now changes the L3 configuration in ways that will break >> assumptions from processes that use the default context (the DDX). The >> DDX assumes, for instance, that the URB size is set according to the >> hardware defaults, but it doesn't actually program the URB size itself, >> so in a way the DDX relies on MI_RESTORE_INHIBIT *not* to be set for the >> default context for correct operation. This commit breaks that >> assumption and the kernel ABI. > > Wrong the ABI is the other way around and has been for 10 years. Note > the kernel isn't also ensuring that the default context has the default > hardware values either. The "golden renderstate" also doesn't set all > defaults and is also a recent introduction. > > It changes the ABI back to what it was.... Sounds like a change fully unrelated to fixing the bug on GPU reset. The ABI change done in 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b seemed legitimate because it made the context state on switch more well-defined than it used to be, IOW it was a backwards-compatible ABI change because it couldn't possibly break userspace assumptions. This change OTOH breaks an assumption about the kernel ABI done by the DDX now. > >> Mesa has a workaround in place to reduce the likelihood of such leaks, >> but the solution is far from ideal because it relies on userspace >> cooperation and had a measurable impact in performance (because it >> requires userspace to assume the worst-case scenario that the following >> batch is going to be from a different context with MI_RESTORE_INHIBIT >> set, so we have to restore the hardware default L3 configuration at the >> end of every batch even if that's actually not the case), for that >> reason we would like to drop the userspace workaround in the future at >> least on v4.1 kernels and newer. > > Mesa has workarounds for leaks between hardware contexts, i.e. state > that is not stored in the context itself. Mesa also has to assume a > clean slate when setting up the context for the first time, and doesn't > use the default context itself. > No, the workaround involves restoring state which *is* part of the hardware context, it just won't get saved and restored with this commit because of the MI_RESTORE_INHIBIT flag when switching to the DDX' default context. >> One more question below. >> >> >> v2: Mark the global default context as uninitialized on GPU reset so >> >> that the context-local workarounds are reloaded upon re-enabling. >> >> >> >> Signed-off-by: Chris Wilson >> > >> > Reviewed-by: Mika Kuoppala >> > >> >> Cc: Michel Thierry >> >> Cc: Mika Kuoppala >> >> Cc: Daniel Vetter >> >> --- >> >> drivers/gpu/drm/i915/i915_gem_context.c | 6 +++++- >> >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c >> >> index 43761c5bcaca..f024d5d2c746 100644 >> >> --- a/drivers/gpu/drm/i915/i915_gem_context.c >> >> +++ b/drivers/gpu/drm/i915/i915_gem_context.c >> >> @@ -340,6 +340,10 @@ void i915_gem_context_reset(struct drm_device *dev) >> >> i915_gem_context_unreference(lctx); >> >> ring->last_context = NULL; >> >> } >> >> + >> >> + /* Force the GPU state to be reinitialised on enabling */ >> >> + if (ring->default_context) >> >> + ring->default_context->legacy_hw_ctx.initialized = false; >> >> } >> >> } >> >> >> >> @@ -708,7 +712,7 @@ static int do_switch(struct drm_i915_gem_request *req) >> >> if (ret) >> >> goto unpin_out; >> >> >> >> - if (!to->legacy_hw_ctx.initialized) { >> >> + if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(to)) { >> >> This hunk causes MI_RESTORE_INHIBIT to be set again for the default >> context regardless of whether a reset happened or not, so it seems >> unrelated to the rest of your change. Maybe I'm understanding this >> wrong but doesn't the !initialized check together with the hunk above >> already guarantee that MI_RESTORE_INHIBIT will be set after GPU reset, >> which is what you wanted to achieve? > > This is the change to *restore* the kernel ABI. The flag in reset is to > force the WA LRI to be re-emitted (not for gen6 ofc) as they will not be > preserved. Again you don't justify why changing the kernel ABI is required to fix a GPU reset bug. > -Chris > > -- > Chris Wilson, Intel Open Source Technology Centre